`extract_rules` 用于告知 API 使用哪些 CSS 选择器以及应返回哪些数据。`js_scenario` 则为内置的无头浏览器提供指令:`wait` 会暂停执行指定毫秒数;`evaluate` 则在页面上下文中运行自定义 JavaScript(用于滚动、点击等操作)。
import os, json
import pandas as pd
import webscrapingapi
API_KEY = os.environ.get("WSAPI_KEY")
client = webscrapingapi.WebScrapingAPIClient(API_KEY)
# Verify these selectors against a live Expedia page before use
CARD_SELECTOR = "[data-stid='lodging-card-responsive']"
extract_rules = {
"hotels": {
"selector": CARD_SELECTOR,
"type": "list",
"output": {
"name": {"selector": "[data-stid='content-hotel-title']", "output": "text"},
"price": {"selector": "[data-stid='price-summary']", "output": "text"},
"rating": {"selector": ".uitk-rating-medium", "output": "text"},
"reviews": {"selector": "[data-stid='reviews-summary']", "output": "text"},
"location": {"selector": "[data-stid='content-hotel-neighborhood']", "output": "text"},
}
}
}
# Wait 2 s → scroll to bottom → wait 2 s to trigger lazy-loaded cards
js_scenario = {"instructions": [
{"wait": 2000},
{"evaluate": "window.scrollTo(0, document.body.scrollHeight)"},
{"wait": 2000}
]}
这种两阶段的等待模式——先暂停再滚动,随后再次暂停——是经过深思熟虑的。Expedia 利用 JavaScript 渲染机制,随着视口向下滚动页面,延迟加载酒店卡片。若跳过任一等待步骤,都可能导致返回的酒店列表不完整,尤其是在网络连接较慢或目的地搜索结果较多时。