使用 Python 和 BeautifulSoup 提取和解析网络数据

Sorin-Gabriel Marica，2021 年 7 月 13 日

网络搜刮工具是帮助你从网站中提取特定信息的重要工具。从理论上讲，您可以手动完成这项工作，但网络搜刮可以让您更高效、更有成效地处理大量数据。

Python 是最流行的网络搜索编程语言之一。这种语言带有一个库 BeautifulSoup，可以简化整个过程。这二者结合在一起，使得网络搜刮比其他语言容易得多。

在我看来，使用 BeautifulSoup 是从头开始创建一个简单的网络搜索器的最简单方法。如果你想了解更多相关信息，请继续阅读，因为我将向你展示如何使用 Python 和 BeautifulSoup 创建自己的网络刮板。

美丽汤概述

如其文档所述，BeautifulSoup 是一个用于从 HTML 和 XML 文件中提取数据的 Python 库。因此，您可以使用 Python 从网站中提取 HTML 内容，然后使用 BeautifulSoup 解析 HTML 以获取相关信息。

使用 BeautifulSoup 的主要优势在于其提供的简单语法。使用该库，您可以浏览 DOM 树、搜索特定元素或修改 HTML 内容。所有这些优势使它成为解析 HTML 和 XML 文档的最流行的 python 库。

安装

要安装 BeautifulSoup，你应该查看这里的指南，因为安装方法会根据你使用的机器而有所不同。在本文中，我使用的是 Linux 系统，只需运行以下命令即可：

pip install beautifulsoup4

如果您使用的是 python3，可能需要使用以下命令来安装该库：

pip3 install beautifulsoup4

请注意，我的机器已经安装了 python3。如果你是 Python 的新手，你可以在这里找到如何安装 Python 的指南。此外，你还可以查看我们的使用 python 构建网络刮刀终极指南，了解更多相关信息。

使用 BeautifulSoup 创建刮板

现在，如果一切顺利，我们就可以开始构建自己的搜索器了。在本文中，我选择从 RottenTomatoes获取有史以来排名前 100 位的电影，并以 JSON 和 CSV 两种格式保存所有内容。

检索页面源

为了热身和熟悉 BeautifulSoup，我们将首先获取页面的完整 HTML 代码，并将其保存到一个名为 "page.txt "的新文件中。

如果你想查看任何页面的 HTML 源代码，可以在 Google Chrome 浏览器中按下 CTRL+U 键。这将打开一个新标签页，你会看到类似下面的内容：

要使用 BeautifulSoup 和 Python 获取相同的源代码，我们可以使用以下代码：

import requests
from bs4 import BeautifulSoup
 
scraped_url = 'https://www.rottentomatoes.com/top/bestofrt/'
page = requests.get(scraped_url)
 
soup = BeautifulSoup(page.content, 'html.parser')
 
file = open('page.txt', mode='w', encoding='utf-8')
file.write(soup.prettify())

在这段代码中，我们请求访问 RottenTomatoes 页面，然后将页面的所有内容添加到 BeautifulSoup 对象中。在此示例中，BeautifulSoup 的唯一用途是最后一个名为 "prettify()"的函数，它可以格式化 html 代码，使其更易于阅读。

To understand the function better, for this HTML code “<div><span>Test</span></div>”, prettify, will add the tabulations and transform it in this formatted code:

<div>

<span>

测试

</span>

</div>

代码的最终结果是创建一个名为 page.txt 的文件，其中包含我们链接的整个页面源代码：

请注意，这是执行任何 Javascript 代码之前的页面源代码。有时，网站可能会选择动态更改页面内容。在这种情况下，页面源代码看起来会与显示给用户的实际内容不同。如果您需要您的 scraper 执行 Javascript，您可以阅读我们的使用 Selenium 构建 Web scraper 指南，或者使用WebScrapingAPI，我们的产品会为您解决这个问题。

获取网络数据

如果您查看上一页源，就会发现您可以找到电影名称及其评分。幸运的是，烂番茄不会动态加载电影列表，因此我们可以继续搜刮所需的信息。

首先，我们要检查页面并查看 HTML 的结构。为此，您可以右键单击电影标题，然后选择 "检查元素 "选项。这时会出现以下窗口：

I used the red line to highlight the useful information from this image. You can see that the page displays the top movies in a table and that there are four cells on each table row (<tr> element).

第一个单元格包含电影的位置，第二个单元格包含评分信息（tMeterScore 类元素），第三个单元格包含电影标题，最后一个单元格提供评论数量。

了解了这个结构，我们就可以开始提取我们需要的信息了。

import requests
from bs4 import BeautifulSoup
 
links_base = 'https://www.rottentomatoes.com'
scraped_url = 'https://www.rottentomatoes.com/top/bestofrt/'
page = requests.get(scraped_url)
 
soup = BeautifulSoup(page.content, 'html.parser')
 
table = soup.find("table", class_="table") # We extract just the table code from the entire page
rows = table.findAll("tr") # This will extract each table row, in an array
 
movies = []
 
for index, row in enumerate(rows):
    if index > 0: # We skip the first row since this row only contains the column names
        link = row.find("a") # We get the link from the table row
        rating = row.find(class_="tMeterScore") # We get the element with the class tMeterScore from the table row
        movies.append({
            "link": links_base + link.get('href'), # The href attribute of the link
            "title": link.string.strip(), # The strip function removes blank spaces at the beginning and the end of a string
            "rating": rating.string.strip().replace("&nbsp;", ""), # We remove &nbsp; from the string and the blank spaces
        })
        
print(movies)

运行这段代码时，应该会得到如下结果：

在本例中，我们提取表格内容并循环查看表格行。由于第一行只包含列名，我们将跳过它。

On the rest of the rows, we continue the process by extracting the anchor (<a>) element and the span element with the class “tMeterScore”. Having them, we can now retrieve the information needed.

电影的标题位于锚元素中，链接位于锚元素的属性 "href "中，评分位于类为 "tMeterScore "的 span 元素中。我们只需为每一行创建一个新字典，并将其添加到电影列表中即可。

保存网络数据

到目前为止，刮板已经检索并格式化了数据，但我们只在终端中显示了这些数据。或者，我们也可以将信息以 JSON 或 CSV 的形式保存在电脑上。刮板的完整代码（包括创建本地文件）如下：

import requests
from bs4 import BeautifulSoup
import csv
import json
 
links_base = 'https://www.rottentomatoes.com'
scraped_url = 'https://www.rottentomatoes.com/top/bestofrt/'
page = requests.get(scraped_url)
 
soup = BeautifulSoup(page.content, 'html.parser')
 
table = soup.find("table", class_="table") # We extract just the table code from the entire page
rows = table.findAll("tr") # This will extract each table row from the table, in an array
 
movies = []
 
for index, row in enumerate(rows):
    if index > 0: # We skip the first row since this row only contains the column names
        link = row.find("a") # We get the link from the table row
        rating = row.find(class_="tMeterScore") # We get the element with the class tMeterScore from the table row
        movies.append({
            "link": links_base + link.get('href'), # The href attribute of the link
            "title": link.string.strip(), # The strip function removes blank spaces at the beginning and the end of a string
            "rating": rating.string.strip().replace("&nbsp;", ""), # We remove &nbsp; from the string and the blank spaces
        })
        
file = open('movies.json', mode='w', encoding='utf-8')
file.write(json.dumps(movies))
 
writer = csv.writer(open("movies.csv", 'w'))
for movie in movies:
    writer.writerow(movie.values())

更进一步

现在，您已经掌握了所有信息，可以选择进一步进行搜刮。请记住，每部电影都有一个链接。您可以继续搜索电影页面，提取更多相关信息。

例如，如果您查看电影《一夜风流》（1934 年）的页面，就会发现您仍然可以搜索到观众评分、电影时长、类型等有用信息。

不过，在短时间内发出所有这些请求看起来很不正常，可能会导致验证码验证，甚至 IP 屏蔽。为了避免这种情况，您应该使用轮换代理，这样发送的流量看起来会很自然，而且会来自多个 IP。

BeautifulSoup 的其他功能

虽然我们的 RottenTomatoes 搜刮工具已经完成，但 BeautifulSoup 仍有很多地方需要改进。每当您在一个项目上工作时，您都应该打开文档链接，这样当您遇到问题时就可以快速查找解决方案。

例如，BeautifulSoup 允许浏览页面的 DOM 树：

from bs4 import BeautifulSoup
 
soup = BeautifulSoup("<head><title>Title</title></head><body><div><p>Some text <span>Span</span></p></div></body>", 'html.parser')
 
print(soup.head.title) # Will print "<title>Title</title>"
print(soup.body.div.p.span) # Will print "<span>Span</span>"

当你需要选择一个无法通过属性识别的元素时，这项功能可以帮到你。在这种情况下，只有通过 DOM 结构才能找到它。

BeautifulSoup 的另一个亮点是，您可以修改页面源代码：

from bs4 import BeautifulSoup
 
soup = BeautifulSoup("<head><title>Title</title></head><body><div><p>Some text <span>Span</span></p></div></body>", 'html.parser')
 
soup.head.title.string = "New Title"
print(soup)
# The line above will print "<head><title>New Title</title></head><body><div><p>Some text <span>Span</span></p></div></body>"

如果你想创建一项服务，让用户可以优化自己的网页，这一点就非常重要。例如，您可以使用该脚本扫描网站，获取 CSS，将其最小化，然后替换到 HTML 源代码中。可能性是无限的！