如何网络抓取 Yelp.com（2023 年更新）--分步指南

Raluca Penciuc on Mar 03 2023

Yelp is a platform that allows users to search for businesses, read reviews, and even make reservations. It is a popular website with millions of monthly visitors, making it an ideal target for data scraping.

Knowing how to web scrape Yelp can be a powerful tool for businesses and entrepreneurs looking to gather valuable information about the local market.

In this article, we will explore the advantages of web scraping Yelp, including information on how to set up the environment, locate the data, and extract valuable information.

We will also look at the potential business ideas that can be created using this scraped data, and why using a professional scraper is better than creating your own. By the end of this article, you will have a solid understanding of how to web scrape Yelp.

Environment setup

Before we begin, let's ensure we have the necessary tools.

首先，从官方网站下载并安装 Node.js，确保使用长期支持 (LTS) 版本。这也将自动安装 Node Package Manager（NPM），我们将使用它来安装更多依赖项。

在本教程中，我们将使用 Visual Studio Code 作为集成开发环境 (IDE)，但您也可以选择使用任何其他 IDE。为项目创建一个新文件夹，打开终端，运行以下命令建立一个新的 Node.js 项目：

npm init -y

这将在项目目录中创建package.json文件，其中将存储有关项目及其依赖项的信息。

接下来，我们需要安装 TypeScript 和 Node.js 的类型定义。TypeScript 提供可选的静态类型，有助于防止代码出错。为此，请在终端运行

npm install typescript @types/node --save-dev

您可以运行

npx tsc --version

TypeScript 使用名为tsconfig.json的配置文件来存储编译器选项和其他设置。要在项目中创建该文件，请运行以下命令：

npx tsc -init

Make sure that the value for “outDir” is set to “dist”. This way we will separate the TypeScript files from the compiled ones. You can find more information about this file and its properties in the official TypeScript documentation.

现在，在项目中创建一个 "src"目录和一个新的 "index.ts"文件。我们将在这里保存刮擦代码。要执行 TypeScript 代码，必须先编译它，因此为了确保我们不会忘记这个额外的步骤，我们可以使用自定义命令。

前往 "package.json"文件，然后像这样编辑 "脚本"部分：

"scripts": {

    "test": "npx tsc && node dist/index.js"

}

这样，在执行脚本时，只需在终端中输入 "npm run test"即可。

Finally, to scrape the data from the website, we will use Puppeteer, a headless browser library for Node.js that allows you to control a web browser and interact with websites programmatically. To install it, run this command in the terminal:

npm install puppeteer

当你想确保数据的完整性时，强烈建议使用它，因为如今许多网站都包含动态生成的内容。如果你很好奇，可以在继续阅读Puppeteer 文档之前先查看一下它的功能。

数据位置

Now that you have your environment set up, we can start looking at extracting the data. For this article, I chose to scrape the page of an Irish restaurant from Dublin: https://www.yelp.ie/biz/the-boxty-house-dublin?osq=Restaurants.

我们将提取以下数据：

the restaurant name;
the restaurant rating;
the restaurant's number of reviews;
the business website;
the business phone number;
the restaurant's physical addresses.

您可以在下面的截图中看到所有这些信息：

打开每个元素上的 "开发工具"，你就能看到我们用来定位 HTML 元素的 CSS 选择器。如果你对 CSS 选择器的工作原理还不太了解，请参考这本新手指南。

提取数据

在编写脚本之前，让我们验证一下 Puppeteer 的安装是否顺利：

import puppeteer from 'puppeteer';

async function scrapeYelpData(yelp_url: string): Promise<void> {

    // Launch Puppeteer

    const browser = await puppeteer.launch({

        headless: false,

    	  args: ['--start-maximized'],

    	  defaultViewport: null

    })

    // Create a new page

    const page = await browser.newPage()

    // Navigate to the target URL

    await page.goto(yelp_url)

    // Close the browser

    await browser.close()

}

scrapeYelpData("https://www.yelp.ie/biz/the-boxty-house-dublin?osq=Restaurants")

Here we open a browser window, create a new page, navigate to our target URL, and close the browser. For the sake of simplicity and visual debugging, I open the browser window maximized in non-headless mode.

现在，让我们来看看网站的结构：

It seems that Yelp displays a somewhat difficult page structure, as the class names are randomly generated and very few elements have unique attribute values.

But fear not, we can get creative with the solution. Firstly, to get the restaurant name, we target the only “h1” element present on the page.

// Extract restaurant name

const restaurant_name = await page.evaluate(() => {

    const name = document.querySelector('h1')

    return name ? name.textContent : ''

})

console.log(restaurant_name)

Now, to get the restaurant rating, you can notice that beyond the star icons, the explicit value is present in the attribute “aria-label”. So, we target the “div” element whose “aria-label” attribute ends with the “star rating” string.

// Extract restaurant rating

const restaurant_rating = await page.evaluate(() => {

    const rating = document.querySelector('div[aria-label$="star rating"]')

    return rating ? rating.getAttribute('aria-label') : ''

})

console.log(restaurant_rating)

And finally (for this particular HTML section), we see that we can easily get the review number by targeting the highlighted anchor element.

// Extract restaurant reviews

const restaurant_reviews = await page.evaluate(() => {

    const reviews = document.querySelector('a[href="#reviews"]')

    return reviews ? reviews.textContent : ''

})

console.log(restaurant_reviews)

Easy peasy. Let’s take a look at the business information widget:

Unfortunately, in this situation, we cannot rely on CSS selectors. Luckily, we can make use of another method to locate the HTML elements: XPath. If you’re fairly new to how CSS selectors work, feel free to reach out to this beginner guide.

To extract the restaurant’s website: we apply the following logic:

locate the “p” element that has “Business website” as text content;

locate the following sibling

locate the anchor element and its “href” attribute.

// Extract restaurant website

const restaurant_website_element = await page.$x("//p[contains(text(), 'Business website')]/following-sibling::p/a/@href")

const restaurant_website = await page.evaluate(

    element => element.nodeValue,

    restaurant_website_element[0]

)

console.log(restaurant_website)

Now, for the phone number and the address we can follow the exact same logic, with 2 exceptions:

for the phone number, we stop the following sibling and extract its textContent property;
for the address, we target the following sibling of the parent element.

// Extract restaurant phone number

const restaurant_phone_element = await page.$x("//p[contains(text(), 'Phone number')]/following-sibling::p")

const restaurant_phone = await page.evaluate(

    element => element.textContent,

    restaurant_phone_element[0]

)

console.log(restaurant_phone)

// Extract restaurant address

const restaurant_address_element = await page.$x("//a[contains(text(), 'Get Directions')]/parent::p/following-sibling::p")

const restaurant_address = await page.evaluate(

    element => element.textContent,

    restaurant_address_element[0]

)

console.log(restaurant_address)

最终结果应该是这样的

The Boxty House

4.5 star rating

948 reviews

/biz_redir?url=http%3A%2F%2Fwww.boxtyhouse.ie%2F&cachebuster=1673542348&website_link_type=website&src_bizid=EoMjdtjMgm3sTv7dwmfHsg&s=16fbda8bbdc467c9f3896a2dcab12f2387c27793c70f0b739f349828e3eeecc3

(01) 677 2762

20-21 Temple Bar Dublin 2

绕过僵尸检测

While scraping Yelp may seem easy at first, the process can become more complex and challenging as you scale up your project. The website implements various techniques to detect and prevent automated traffic, so your scaled-up scraper starts getting blocked.

Yelp collects multiple browser data to generate and associate you with a unique fingerprint. Some of these are:

properties from the Navigator object (deviceMemory, hardwareConcurrency, platform, userAgent, webdriver, etc.)
时间和性能检查
service workers
屏幕尺寸检查
以及更多

One way to overcome these challenges and continue scraping at a large scale is to use a scraping API. These kinds of services provide a simple and reliable way to access data from websites like yelp.com, without the need to build and maintain your own scraper.

WebScrapingAPI 就是这样一款产品。它的代理旋转机制完全避免了验证码，其扩展知识库可以随机化浏览器数据，使其看起来像真实用户。

设置简单快捷。你只需注册一个账户，就会收到 API 密钥。您可以在仪表板上访问该密钥，它用于验证您发送的请求。

由于您已经设置了 Node.js 环境，我们可以使用相应的 SDK。运行以下命令将其添加到项目依赖项中：

npm install webscrapingapi

现在只需发送一个 GET 请求，我们就能收到网站的 HTML 文档。请注意，这并不是访问 API 的唯一方式。

import webScrapingApiClient from 'webscrapingapi';

const client = new webScrapingApiClient("YOUR_API_KEY");

async function exampleUsage() {

    const api_params = {

        'render_js': 1,

    	  'proxy_type': 'residential',

    }

    const URL = "https://www.yelp.ie/biz/the-boxty-house-dublin?osq=Restaurants"

    const response = await client.get(URL, api_params)

    if (response.success) {

        console.log(response.response.data)

    } else {

        console.log(response.error.response.data)

    }

}

exampleUsage();

启用 "render_js "参数后，我们就可以使用无头浏览器发送请求，就像你之前在本教程中所做的那样。

收到 HTML 文档后，您可以使用另一个库来提取感兴趣的数据，比如Cheerio。没听说过？看看这本指南就能帮你入门！

结论

This article has presented you with a comprehensive guide on how to web scrape Yelp using TypeScript and Puppeteer. We have gone through the process of setting up the environment, locating and extracting data, and why using a professional scraper is a better solution than creating your own.

The data scraped from Yelp can be used for various purposes such as identifying market trends, analyzing customer sentiment, monitoring competitors, creating targeted marketing campaigns, and many more.

Overall, web scraping Yelp.com can be a valuable asset for anyone looking to gain a competitive advantage in their local market and this guide has provided a great starting point to do so.