像专业人士一样解析 HTML：使用 Python 和 Regex 掌握网络抓取技术

Suciu Dan，2023 年 4 月 13 日

近几十年来，互联网上的数据量与日俱增。人类使用这些数据的目的多种多样，从个人兴趣到商业研究，不一而足。

然而，如果这些数据不是以 XML 或 JSON 等格式返回，就很难或无法通过软件应用程序读取。这就是网络刮擦技术的用武之地。

网络搜索是从互联网上收集和处理原始数据的过程。这些数据经过解析后可用于多种目的，如价格情报、市场研究、训练人工智能模型、情感分析、品牌审计和搜索引擎优化审计。

网络搜刮的关键环节之一是解析 HTML。这可以使用多种工具来完成，如 Python 的 BeautifulSoup、NodeJS 的 Cheerio 和 Ruby 的 Nokogiri。

正则表达式（regex）是定义搜索模式的字符序列。

在本文中，我们将探讨如何使用 regex 和 Python 解析 HTML 文档。我们还将讨论网络搜刮带来的一些挑战和替代解决方案。

文章结束时，您将对该主题以及现有的各种工具和技术有一个全面的了解。

基本 Regex 解析

大多数通用编程语言都支持 regex。您可以在 Python、C、C++、Java、Rust、OCaml 和 JavaScript 等多种编程语言中使用 regex。

Here’s what a regex rule for extracting the value from the <title> tag looks like:

<title>(.*?)</title>

可怕吧？请记住，这只是个开始。我们很快就会进入兔子洞。

在本文中，我使用的是 Python 3.11.1。让我们将这条规则写入代码。创建一个名为 main.py 的文件并粘贴此代码段：

import re

html = "<html><head><title>Scraping</title></head></html>"

title_search = re.search("<title>(.*?)</title>", html)

title = title_search.group(1)

print(title)

运行 `python main.py` 命令即可执行该代码。您将看到输出结果是 "Scraping"（抓取）一词。

在本例中，我们使用 `re` 模块处理 regex。`re.search()` 函数在字符串中搜索特定的模式。第一个参数是 regex 模式，第二个参数是我们要搜索的字符串。

The regex pattern in this example is "<title>(.*?)</title>". It consists of several parts:

<title>: This is a literal string, it will match the characters "<title>" exactly.
(.*?):这是一个捕获组，用括号表示。.字符匹配任何单个字符（换行符除外），而 * 数量符号表示匹配前面 0 个或更多字符。此外，"...... "使 * 不贪婪，这意味着一旦找到结束标记，它就会停止。
</title>: This is also a literal string, it will match the characters "</title>" exactly.

如果找到匹配对象，re.search() 函数会返回一个匹配对象，而 group(1) 方法则用于提取第一个捕获组所匹配的文本，即开头和结尾标题标记之间的文本。

该文本将分配给变量 title，输出结果将是 "Scraping"。

高级 Regex 解析

从单个 HTML 标记中提取数据并不是那么有用。它让你了解了使用正则表达式可以做什么，但你无法在实际情况中使用它。

让我们查看PyPI网站，即 Python 软件包索引。在主页上，他们显示了四项统计数据：项目数、发布数、文件数和用户数。

我们要提取项目的数量。为此，我们可以使用以下 regex：

([0-9,]+) 项目

正则表达式将匹配以一个或多个数字（可选逗号分隔）开头、以 "项目 "结尾的任何字符串。具体操作如下

([0-9,]+):这是一个捕获组，用括号表示；方括号 [0-9,] 用于匹配 0 到 9 之间的任何数字和字符 `,`；+ 数量符号表示匹配前面 1 个或多个字符。
projects.项目：这是一个字面字符串，与 "projects "完全匹配。

是时候对规则进行测试了。用以下代码片段更新 `main.py` 代码：

import urllib.request

import re

response = urllib.request.urlopen("https://pypi.org/")

html = response.read().decode("utf-8")

matches = re.search("([0-9,]+) projects", html)

projects = matches.group(1)

print(projects)

我们使用 urllib 库中的 urlopen 方法向 pypi.org 网站发出 GET 请求。我们在 html 变量中读取响应。我们针对 HTML 内容运行 regex 规则，然后打印第一个匹配组。

使用 `python main.py` 命令运行代码并检查输出：它将显示网站上的项目数量。

提取链接

现在我们有了一个可以获取网站 HTML 文档的简单刮擦工具，让我们来玩一下代码。

我们可以通过这条规则提取所有链接：

href=[\'"]?([^\'" >]+)

该正则表达式由几个部分组成：

href=：这是一个字面字符串，将与字符 "href="完全匹配。
[\'"]?：方括号[]匹配方括号内的任何单个字符，在本例中就是""或""；量化符"? "表示匹配前面的零个或一个字符，这意味着 href 值可以用""或""括起来，也可以不用。
([^\'" >]+)：这是一个捕获组，用括号表示；方括号内的 ^ 表示否定，它将匹配任何不是 ',",> 或空格的字符；+ 数量词表示匹配前面的 1 个或多个字符，它表示该组将捕获一个或多个与模式匹配的字符。

提取图像

还有一件事，我们就快要完成编写 regex 规则的工作了：我们需要提取图像。让我们使用这条规则：

<img.*?src="(.*?)"

该正则表达式由几个部分组成：

<img: This is a literal string, it will match the characters "<img" exactly.
.*?: the .* match any character (except a newline) 0 or more times, and the ? quantifier means to match as few as possible of the preceding character; this is used to match any character that appears before the src attribute in the <img> tag, and it allows the pattern to match any <img> tag regardless of the number of attributes it has.
src="：这是一个字面字符串，将与字符 "src="完全匹配。
(.*?): this is a capturing group, denoted by the parentheses; the .*? match any character (except a newline) 0 or more times, and the ? quantifier means to match as few as possible of the preceding character; this group captures the src value of the <img> tag.
"：这是一个字面字符串，将与字符""完全匹配。

让我们来测试一下。用以下代码替换之前的代码片段：

import urllib.request

import re

response = urllib.request.urlopen("https://pypi.org/")

html = response.read().decode("utf-8")

images = re.findall('<img.*?src="(.*?)"', html)

print(*images, sep = "\n")

这段代码的输出将显示一个包含 Pypi 页面上所有图片链接的列表。

局限性

使用正则表达式进行网络抓取是一种从网站中提取数据的强大工具，但也有其局限性。使用 regex 进行网络搜刮的一个主要问题是，当 HTML 结构发生变化时，它可能会失效。

例如，请看下面的代码示例，我们正试图使用 regex 从 h2 中提取文本：

<html>

   <head>

       <title>Example Title</title>

   </head>

   <body>

       <h1>Page Title</h1>

       <p>This is a paragraph under the title</p>

       <h2>First Subtitle</h2>

       <p>First paragraph under the subtitle</p>

       <h2>Second Subtitle</p>

   </body>

</html>

Compare the first <h2> tag with the second one. You may notice the second <h2> is not properly closed, and the code has </p> instead of </h2>. Let’s update the snippet with this:

import re

html = "<html><head><title>Example Title</title></head><body><h1>Page Title</h1><p>This is a paragraph under the title</p><h2>First Subtitle</h2><p>First paragraph under the subtitle</p><h2>Second Subtitle</p></body></html>"

headingTags = re.findall("<h2>(.*?)</h2>", html)

print(*headingTags, sep = "\n")

让我们运行代码并检查输出结果：

第一个字幕

缺少第二个标题标记中的文本。出现这种情况是因为 regex 规则没有匹配未闭合的标题标记。

解决这个问题的方法之一是使用BeautifulSoup 这样的库，它允许你浏览和搜索 HTML 树结构，而不是依赖正则表达式。使用 BeautifulSoup，您可以像这样提取网页标题：

from bs4 import BeautifulSoup

html = "<html><head><title>Example Title</title></head><body><h1>Page Title</h1><p>This is a paragraph under the title</p><h2>First Subtitle</h2><p>First paragraph under the subtitle</p><h2>Second Subtitle</p></body></html>"

soup = BeautifulSoup(html, 'html.parser')

for headingTag in soup.findAll('h2'):

   print(headingTag.text)

BeautifulSoup 成功提取了畸形标签，输出结果如下：