以下是一个整合了环境配置、请求处理、数据解析和分页功能的完整脚本,可作为单个可运行文件使用。请根据实际需求调整搜索词和页面范围。
import requests
from bs4 import BeautifulSoup
import time
import json
import csv
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Accept": "text/html,application/xhtml+xml",
}
def scrape_amazon(query, max_pages=5):
all_products = []
for page in range(1, max_pages + 1):
url = f"https://www.amazon.com/s?k={query}&page={page}"
resp = requests.get(url, headers=HEADERS)
if resp.status_code != 200:
print(f"Page {page}: HTTP {resp.status_code}")
break
soup = BeautifulSoup(resp.text, "lxml")
cards = soup.select('div[data-component-type="s-search-result"]')
if not cards:
break
for card in cards:
all_products.append({
"asin": card.get("data-asin", ""),
"title": _text(card, "h2 a span"),
"price": _text(card, "span.a-price span.a-offscreen"),
"rating": _text(card, "span.a-icon-alt"),
"image": card.select_one("img.s-image")["src"]
if card.select_one("img.s-image") else None,
})
time.sleep(2)
return all_products
def _text(card, selector):
tag = card.select_one(selector)
return tag.get_text(strip=True) if tag else None
if __name__ == "__main__":
results = scrape_amazon("wireless+earbuds", max_pages=3)
print(f"Scraped {len(results)} products")
有几点注释值得特别说明。 _text 辅助函数使解析循环保持紧凑,并避免了重复 None 检查。 Accept-Encoding 和 Accept 标头完善了请求指纹,使其更接近真实浏览器的特征。将所有内容封装在函数中,便于将其嵌入更大的处理流程或通过调度程序调用。