实际的网页抓取很少仅涉及加载页面。你通常需要点击按钮、填写搜索表单,或滚动页面以触发延迟加载的内容。Scrapy Playwright 的页面方法通过 playwright_page_methods request meta.
A PageMethod 是 Playwright 页面操作的封装。您传递一个操作列表,处理程序会在初始导航后按顺序执行每个操作。
点击按钮:
from scrapy_playwright.page import PageMethod
yield scrapy.Request(
url,
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("click", selector="button#load-more"),
PageMethod("wait_for_selector", selector="div.new-content"),
],
},
callback=self.parse,
)
填写并提交表单:
yield scrapy.Request(
url,
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("fill", selector="input#search", value="python scrapy"),
PageMethod("click", selector="button[type=submit]"),
PageMethod("wait_for_selector", selector="div.results"),
],
},
callback=self.parse,
)
滚动到页面底部:
yield scrapy.Request(
url,
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod(
"evaluate",
"window.scrollTo(0, document.body.scrollHeight)",
),
PageMethod("wait_for_timeout", 2000),
],
},
callback=self.parse,
)
请注意这种模式:您通过链式调用 PageMethod 调用以模拟真实的用户会话。处理程序会按顺序处理这些操作,因此顺序至关重要。在触发新内容的操作(如触发 API 调用的点击、加载更多项的滚动)之后,务必添加等待操作,以便在 Scrapy 捕获最终 HTML 之前,页面有时间完成更新。
一个实用技巧:尽量缩短 playwright_page_methods 列表尽可能简短。每次方法调用都会增加延迟。如果能通过更少的步骤实现相同结果(例如,直接导航至经过筛选的 URL 而不是填写表单),请优先选择更简单的方法。