如何用 JavaScript 抓取 HTML 表格

米哈伊-马克西姆（Mihai Maxim），2023 年 1 月 31 日

在处理网络上显示的表格数据时，掌握如何使用 javascript 抓取 HTML 表格是一项至关重要的技能。网站通常会在表格中显示重要信息，如产品详细信息、价格、库存水平和财务数据。在为各种分析任务收集数据时，提取这些信息的能力非常有用。在本文中，我们将深入探讨 HTML 表格，并构建一个简单但功能强大的程序来从中提取数据并导出到 CSV 或 JSON 文件。我们将使用 Node.js 和 cheerio 来轻松完成这一过程。

了解 HTML 表格的结构

HTML tables are a powerful tool for marking up structured tabular data and displaying it in a way that is easy for users to read and understand. Tables are made up of data organized into rows and columns, and HTML provides several different elements for defining and structuring this data. A table must include at least the following elements: <table>, <tr> (table row), and <td> (table data). For added structure and semantic value, tables may also include the <th> (table header) element, as well as the <thead>, <tbody>, and <tfoot> elements.

让我们通过一个小例子来了解一下标签。

请注意第二个表格使用了更具体的语法

The <thead> tag will apply a bold font to the "Fruit" and "Color" cells in the second table. Other than that, you can see how both syntaxes achieve the same organization of the data.

从网络上抓取表格时，必须注意可能会遇到以不同程度的语义特定性编写的表格。换句话说，有些表格可能包含更详细、描述性更强的 HTML 标记，而有些表格可能使用更简单、描述性更弱的语法。

使用 Node.js 和 cheerio 抓取 HTML 表格

欢迎来到有趣的部分！我们已经了解了 HTML 表格的结构和用途，现在是将这些知识付诸实践的时候了。本教程的目标是 https://chartmasters.org/best-selling-artists-of-all-time/ 上的畅销艺术家史表。首先，我们将设置工作环境并安装必要的库。然后，我们将探索目标网站，并提出一些选择器来提取表格中的数据。之后，我们将编写代码来实际抓取数据，最后将数据导出为 CSV 和 JSON 等不同格式。

建立工作环境

好了，让我们开始我们的全新项目吧！开始之前，请确保已安装 node.js。您可以从 https://nodejs.org/en/ 下载。

现在打开你最喜欢的代码编辑器，打开你的项目目录，然后运行（在终端）：

npm init -y

这将初始化一个新项目，并创建一个默认 package.json 文件。

{

  "name": "html_table_scraper", // the name of your project folder

  "version": "1.0.0",

  "description": "",

  "main": "index.js",

  "scripts": {

    "test": "echo \"Error: no test specified\" && exit 1"

  },

  "keywords": [],

  "author": "",

  "license": "ISC"

}

为了简单起见，我们将使用 "require "导入模块。但如果您想使用 "import "语句导入模块，请将此添加到 package.json 文件中：

 "类型"："module",

 // 这将使您能够使用 import 语句

 // ex: import * as cheerio from 'cheerio'；

main"："index.js "选项指定了作为程序入口的文件名。既然如此，现在就可以创建一个空的 index.js 文件。

我们将使用 cheerio 库来解析目标网站的 HTML。您可以使用

npm install cheerio

现在打开 index.js 文件，并将其作为模块加入：

const cheerio = require('cheerio')；

基本的工作环境已经确定。下一章，我们将探讨历届最畅销艺术家表的结构。

使用 DevTools 测试目标网站

通过检查 "开发工具 "的 "元素 "选项卡，我们可以提取有关表格结构的宝贵信息：

The table is stored in a <thead>, <tbody> format.

所有表格元素都分配了描述性 id、类别和角色。

在浏览器中，您可以直接访问网页的 HTML。这意味着您可以使用 getElementsByTagName 或 querySelector 等 JavaScript 函数从 HTML 中提取数据。

有鉴于此，我们可以使用开发工具控制台来测试一些选择器。

让我们使用 role="columnheader" 属性提取标题名称

现在，让我们使用 role="cell" 和 role="row" 属性从第一行提取数据：

如您所见：

我们可以使用"[role=columheader]"来选择所有标题元素。

我们可以使用 "tbody [role=row]"来选择所有行元素。

对于每一行，我们可以使用"[role=cell]"来选择其单元格。

需要注意的一点是，PIC 单元格包含一张图片，我们应编写一条特殊规则来提取其 URL。

实施 HTML 表格搜索器

现在，是时候使用 Node.js 和 cheerio 让事情变得更先进一些了。

要在 Node.js 项目中获取网站的 HTML，您必须向网站发出获取请求。这会以字符串形式返回 HTML，这意味着您无法使用 JavaScript DOM 函数来提取数据。这就是 cheerio 的用武之地。Cheerio 是一个库，可让您像在浏览器中一样解析和操作 HTML 字符串。这意味着您可以使用熟悉的 CSS 选择器从 HTML 中提取数据，就像在浏览器中一样。

还需要注意的是，从获取请求中返回的 HTML 可能与您在浏览器中看到的 HTML 不同。这是因为浏览器运行的 JavaScript 可以修改显示的 HTML。而在获取请求中，你得到的只是原始的 HTML，不需要对 JavaScript 进行任何修改。

出于测试目的，让我们向目标网站发出获取请求，并将结果 HTML 写入本地文件：

//index.js

const fs = require('fs');

(async () => {

    const response =  await fetch('https://chartmasters.org/best-selling-artists-of-all-time/');

    const raw_html = await response.text();

    fs.writeFileSync('raw-best-selling-artists-of-all-time.html', raw_html);

})();

// it will write the raw html to raw-best-selling-artists-of-all-time.html

// try it with other websites: https://randomwordgenerator.com/

您可以使用

node index.js

你会得到

表格结构保持不变。这意味着我们在前一章中发现的选择器仍然适用。

好了，现在让我们继续学习实际代码：

向https://chartmasters.org/best-selling-artists-of-all-time/ 发送获取请求后，您必须将原始 HTML 加载到 cheerio 中：

const cheerio = require('cheerio');

(async () => {

    const response =  await fetch('https://chartmasters.org/best-selling-artists-of-all-time/');

    const raw_html = await response.text();

    const $ = cheerio.load(raw_html);

})();

加载了 cheerio 后，让我们看看如何提取标题：

    const headers = $("[role=columnheader]")

 const header_names = []

 headers.each((index, el) => {

 header_names.push($(el).text())

 })

 //header_names

    [

        '#',

        'PIC',

        'Artist',

        'Total CSPC',

        'Studio Albums Sales',

        'Other LPs Sales',

        'Physical Singles Sales',

        'Digital Singles Sales',

        'Sales Update',

        'Streams EAS (Update)'

    ]

第一排

    const first_row = $("tbody [role=row]")[0]

 const first_row_cells = $(first_row).find('[role=cell]')

 const first_row_data = []

 first_row_cells.each((index, f_r_c) => {

 const image = $(f_r_c).find('img').attr('src')

 if(image) {

 first_row_data.push(image)

 }

 else {

 first_row_data.push($(f_r_c).text())

 }

   })

   //first_row_data

     [

        '1',

        'https://i.scdn.co/image/ab6761610000f178e9348cc01ff5d55971b22433',

        'The Beatles',

        '421,300,000',

        '160,650,000',

        '203,392,000',

        '116,080,000',

        '35,230,000',

        '03/01/17',

        '17,150,000 (01/03/23)'

      ]

还记得我们在浏览器开发工具控制台中使用 JavaScript 刮擦 HTML 表格的情形吗？此时，我们复制了在 Node.js 项目中实现的相同功能。您可以回顾上一章，观察这两种实现方式之间的许多相似之处。

接下来，让我们重写代码，刮取所有记录：

    const rows = $("tbody [role=row]")

 const rows_data = []

 rows.each((index, row) => {

 const row_cell_data = []

 const cells = $(row).find('[role=cell]')

 cells.each((index, cell) => {

 const image = $(cell).find('img').attr('src')

 if(image) {

 row_cell_data.push(image)

 }

 else {

 row_cell_data.push($(cell).text())

 })

       })

        rows_data.push(row_cell_data)

 })

   //rows_data

 [

 [

 '1',

 'https://i.scdn.co/image/ab6761610000f178e9348cc01ff5d55971b22433',

 'The Beatles',

 '421,300,000',

 '160,650,000',

 '203,392,000'、

         '116,080,000',

 '35,230,000',

 '03/01/17',

 '17,150,000 (01/03/23)'

 ],

 [

 '2',

 'https：//i.scdn.co/image/ab6761610000f178a2a0b9e3448c1e702de9dc90',

 'Michael Jackson',

 '336,084,000',

 '182,600,000',

 '101、997,000',

 '79,350,000',

 '79,930,000',

 '09/27/17',

 '15,692,000 (01/06/23)'

 ],

...

   ]

现在，我们获得了数据，让我们看看如何导出这些数据。

导出数据

在您成功获取了要搜刮的数据后，考虑如何存储这些信息非常重要。最常用的选项是 .json 和 .csv。请选择最符合您特定需求和要求的格式。

将数据导出为 json

如果要将数据导出为 .json，应首先将数据捆绑到与 .json 格式相似的 JavaScript 对象中。

我们有一个包含标题名称的数组（header_names）和另一个包含行数据的数组（rows_data，数组的数组）。.json 格式以键值对的形式存储信息。我们需要按照这一规则捆绑数据：

[ // this is what we need to obtain

  {

    '#': '1',

    PIC: 'https://i.scdn.co/image/ab6761610000f178e9348cc01ff5d55971b22433',

    Artist: 'The Beatles',

    'Total CSPC': '421,300,000',

    'Studio Albums Sales': '160,650,000',

    'Other LPs Sales': '203,392,000',

    'Physical Singles Sales': '116,080,000',

    'Digital Singles Sales': '35,230,000',

    'Sales Update': '03/01/17',

    'Streams EAS (Update)': '17,150,000 (01/03/23)'

  },

  {

    '#': '2',

    PIC: 'https://i.scdn.co/image/ab6761610000f178a2a0b9e3448c1e702de9dc90',

    Artist: 'Michael Jackson',

    'Total CSPC': '336,084,000',

    'Studio Albums Sales': '182,600,000',

    'Other LPs Sales': '101,997,000',

    'Physical Singles Sales': '79,350,000',

    'Digital Singles Sales': '79,930,000',

    'Sales Update': '09/27/17',

    'Streams EAS (Update)': '15,692,000 (01/06/23)'

  } 

  ...

]

以下是实现这一目标的方法：

     // go through each row

 const table_data = rows_data.map(row => {

        // create a new object

        const obj = {};

        // forEach element in header_names

        header_names.forEach((header_name, i) => {

          // add a key-value pair to the object where the key is the current header name and the value is the value at the same index in the row

          obj[header_name] = row[i];

        });

        // return the object

        return obj;

 });

现在，您可以使用 JSON.stringify() 函数将 JavaScript 对象转换为 .json 字符串，然后将其写入文件。

 const fs = require('fs');

 const table_data_json_string = JSON.stringify(table_data, null, 2)

 fs.writeFile('table_data.json', table_data_json_string, (err) => {

if (err) throw err;

console.log('The file has been saved as table_data.json!');

 })

将数据导出为 csv 格式

.csv 格式代表 "逗号分隔值"。如果要将表格保存为 .csv，则需要使用类似的格式：

id,name,age // 表头后面的行

1,Alice,20

2,Bob,25

3,Charlie,30

我们的表格数据包括一个标题名称数组（header_names）和另一个包含行数据的数组（rows_data，数组的数组）。下面是将这些数据写入 .csv 文件的方法：

    let csv_string = header_names.join(',') + '\n'; // add the headers

    // forEach row in rows_data

    rows_data.forEach(row => {

      // add the row to the CSV string

      csv_string += row.join(',') + '\n';

    });

   

    // write the string to a file

    fs.writeFile('table_data.csv', csv_string, (err) => {

      if (err) throw err;

      console.log('The file has been saved as table_data.csv!');

    });

避免受阻

您是否遇到过这样的问题：当您试图抓取一个网站的信息时，却发现该页面没有完全加载？这可能会令人沮丧，尤其是当您知道网站使用 JavaScript 生成内容时。我们无法像普通浏览器那样执行 JavaScript，这可能会导致数据不完整，甚至因为在短时间内发出过多请求而被网站禁止。

WebScrapingApi 就是解决这一问题的方法之一。有了我们的服务，您只需向我们的 API 提出请求，它就会为您处理所有复杂的任务。它将执行 JavaScript、旋转代理，甚至处理验证码。

Here is how you can make a simple fetch request to a <target_url> and write the response to a file:

const fs = require('fs');

(async () => {

    const result = await fetch('https://api.webscrapingapi.com/v1?' + new URLSearchParams({

        api_key: '<api_key>',

        url: '<target_url>',

        render_js: 1,

        proxy_type: 'residential',

    }))

   

    const html = await result.text();

    fs.writeFileSync('wsa_test.html', html);

})();

您可以在https://www.webscrapingapi.com/上创建新账户，免费获取 API_KEY

通过指定 render_js=1 参数，您将启用 WebScrapingAPI 使用无头浏览器访问目标网页的功能，该功能允许在将最终刮擦结果返回给您之前渲染 JavaScript 页面元素。

查看https://docs.webscrapingapi.com/webscrapingapi/advanced-api-features/proxies，了解我们旋转代理的功能。

结论

在本文中，我们了解了使用 JavaScript 对 HTML 表格进行网络搜刮的强大功能，以及它如何帮助我们从网站中提取有价值的数据。我们探讨了 HTML 表格的结构，并学习了如何将 cheerio 库与 Node.js 结合使用，轻松地从表格中抓取数据。我们还了解了导出数据的不同方法，包括 CSV 和 JSON 格式。按照本文概述的步骤，您现在应该已经为在任何网站上刮擦 HTML 表格打下了坚实的基础。

无论您是经验丰富的专家，还是刚刚开始第一个刮擦项目，WebScrapingAPI's 都会在每个环节为您提供帮助。我们的团队随时乐意回答您的任何问题，并为您的项目提供指导。因此，如果您感到困惑或需要帮助，请随时联系我们。