在开始编写脚本之前,让我们先验证 Puppeteer 是否已正确安装:
import puppeteer from 'puppeteer';
async function scrapeIdealistaData(idealista_url: string): Promise<void> {
// Launch Puppeteer
const browser = await puppeteer.launch({
headless: false,
args: ['--start-maximized'],
defaultViewport: null
})
// Create a new page
const page = await browser.newPage()
// Navigate to the target URL
await page.goto(idealista_url)
// Close the browser
await browser.close()
}
scrapeIdealistaData("https://www.idealista.com/pt/alquiler-viviendas/toledo/buenavista-valparaiso-la-legua/")
在此,我们将打开一个浏览器窗口,创建新页面,导航至目标 URL,然后关闭浏览器。为了简化操作并便于可视化调试,我以非无头模式全屏打开浏览器窗口。
由于所有房源列表的结构和数据都相同,我们可以在算法中提取整个房源列表的所有信息。运行脚本后,我们可以遍历所有结果并将其合并为一个列表。
要获取所有房产的 URL,我们定位具有“item-link”类的锚点元素。然后将结果转换为 JavaScript 数组,并将每个元素映射到“href”属性的值。
// Extract listings location
const listings_location = await page.evaluate(() => {
const locations = document.querySelectorAll('a.item-link')
const locations_array = Array.from(locations)
return locations ? locations_array.map(a => a.getAttribute('href')) : []
})
console.log(listings_location.length, listings_location)
对于标题,我们可以利用相同的锚点元素,只不过这次要提取其“title”属性。
// Extract listings titles
const listings_title = await page.evaluate(() => {
const titles = document.querySelectorAll('a.item-link')
const titles_array = Array.from(titles)
return titles ? titles_array.map(t => t.getAttribute('title')) : []
})
console.log(listings_title.length, listings_title)
对于价格,我们定位具有两个类名的“span”元素:“item-price”和“h2-simulated”。尽可能准确地识别元素至关重要,以免影响最终结果。该结果同样需要转换为数组,然后映射到其文本内容。
// Extract listings prices
const listings_price = await page.evaluate(() => {
const prices = document.querySelectorAll('span.item-price.h2-simulated')
const prices_array = Array.from(prices)
return prices ? prices_array.map(p => p.textContent) : []
})
console.log(listings_price.length, listings_price)
对于房产详情,我们采用相同的原则,解析具有“item-detail-char”类名的“div”元素。
// Extract listings details
const listings_detail = await page.evaluate(() => {
const details = document.querySelectorAll('div.item-detail-char')
const details_array = Array.from(details)
return details ? details_array.map(d => d.textContent) : []
})
console.log(listings_detail.length, listings_detail)
最后是房产描述部分。这里我们使用额外的正则表达式,用于移除所有多余的换行符。
// Extract listings descriptions
const listings_description = await page.evaluate(() => {
const descriptions = document.querySelectorAll('div.item-description.description')
const descriptions_array = Array.from(descriptions)
return descriptions ? descriptions_array.map(d => d.textContent.replace(/(\r\n|\n|\r)/gm, "")) : []
})
console.log(listings_description.length, listings_description)
现在您应该得到了5个列表,每个列表对应我们抓取的一组数据。如前所述,我们需要将它们合并为一个列表。这样,我们收集的信息将更容易进行后续处理。
// Group the lists
const listings = []
for (let i = 0; i < listings_location.length; i++) {
listings.push({
url: listings_location[i],
title: listings_title[i],
price: listings_price[i],
details: listings_detail[i],
description: listings_description[i]
})
}
console.log(listings.length, listings)
最终结果应如下所示:
[
{
url: '/pt/inmueble/99004556/',
title: 'Apartamento em ronda de Buenavista, Buenavista-Valparaíso-La Legua, Toledo',
price: '750€/mês',
details: '\n3 quart.\n115 m² área bruta\n2º andar exterior com elevador\nOntem \n',
description: 'Apartamento para alugar na Ronda Buenavista, em Toledo.Três quartos e duas casas de banho, sala, cozinha, terraço, garagem e arrecadação....'
},
{
url: '/pt/inmueble/100106615/',
title: 'Moradia em banda em Buenavista-Valparaíso-La Legua, Toledo',
price: '1.000€/mês',
details: '\n4 quart.\n195 m² área bruta\nOntem \n',
description: 'Magnífica casa geminada para alugar com 3 andares, 4 quartos aconchegantes, 3 banheiros, sala ampla e luminosa, cozinha totalmente equipa...'
},
{
url: '/pt/inmueble/100099977/',
title: 'Moradia em banda em calle Francisco Ortiz, Buenavista-Valparaíso-La Legua, Toledo',
price: '800€/mês',
details: '\n3 quart.\n118 m² área bruta\n10 jan \n',
description: 'O REMAX GRUPO FV aluga uma casa mobiliada na Calle Francisco Ortiz, em Toledo.Moradia geminada com 148 metros construídos, distribuídos...'
},
{
url: '/pt/inmueble/100094142/',
title: 'Apartamento em Buenavista-Valparaíso-La Legua, Toledo',
price: '850€/mês',
details: '\n4 quart.\n110 m² área bruta\n1º andar exterior com elevador\n10 jan \n',
description: 'Apartamento muito espaçoso para alugar sem móveis, cozinha totalmente equipada.Composto por 4 quartos, 1 casa de banho, terraço.Calefaç...'
}
]