你可以使用以下代码输出第一个国家的首都:
parsel_dom = parsel.Selector(text=raw_html)
first_capital = parsel_dom.css(".country-capital::text").get()
print(first_capital)
// Output
Andorra la Vella
parsel_dom.css(".country-capital::text").get() will select the inner text of the first element that has the country-capital class.
你可以使用以下代码打印所有国家名称:
countries_names = filter(lambda line: line.strip() != "", parsel_dom.css(".country-name::text").getall())
for country_name in countries_names:
print(country_name.strip())
// Output
Andorra
United Arab Emirates
Afghanistan
Antigua and Barbuda
Anguilla
. . .
parsel_dom.css(".country-name::text").getall() will select the inner texts of all the elements that have the "country-name" class.
请注意,我们需要对输出结果进行一些清理。这是因为所有带有“.country-name”类的元素内部都嵌套了一个<i>标签。此外,国家名称周围还带有许多尾随空格。
<h3 class="country-name">
<i class="flag-icon flag-icon-ae"></i> //this is picked up as an empty string
United Arab Emirates // this is picked up as “ United Arab Emirates “
</h3>
现在,让我们编写一个脚本,使用 CSS 选择器提取所有数据:
import parsel
import requests
response = requests.get("https://www.scrapethissite.com/pages/simple/")
raw_html = response.text
parsel_dom = parsel.Selector(text=raw_html)
countries = parsel_dom.css(".country")
countries_data = []
for country in countries:
country_name = country.css(".country-name::text").getall()[1].strip()
country_capital = country.css(".country-capital::text").get()
country_population = country.css(".country-population::text").get()
country_area = country.css(".country-area::text").get()
countries_data.append({
"name": country_name,
"capital": country_capital,
"population": country_population,
"area": country_area
})
for country_data in countries_data:
print(country_data)
// Outputs
{'name': 'Andorra', 'capital': 'Andorra la Vella', 'population': '84000', 'area': '468.0'}
{'name': 'United Arab Emirates', 'capital': 'Abu Dhabi', 'population': '4975593', 'area': '82880.0'}
{'name': 'Afghanistan', 'capital': 'Kabul', 'population': '29121286', 'area': '647500.0'}
...