Common Questions About Web Scraping - Answers & Tips

Mihai Maxim on Mar 03 2023

blog-image

Navigating the world of web scraping can be a bit overwhelming. You have to choose the right programming language, and the right library, and deal with many unforeseen setbacks. It quickly becomes a lot to take in. But don't let that discourage you! In this article, I've answered some of the most asked questions about web scraping. You'll learn what other people are doing and the challenges they've faced. This will help guide your own decision-making process. Whether you're new to the field or a seasoned pro, there's something here for everyone.

Why can't my scraper see the same data as my browser?

You've written a script to fetch HTML from a website, but you're not getting the full data. You've tested your selectors in the browser and they should work, right? Not always. Websites that rely on JavaScript to render won't work with a simple GET request. There are libraries like Puppeteer and Selenium that use headless browsers to render JavaScript. They allow you to make the request in the context of a browser and wait for JavaScript to finish executing. This way, you can get the full HTML. You may not always need a headless browser to get the missing data. Search for <script> tags in the HTML. The missing data could be hidden inside <script> tags as JavaScript variables.

How can I scrape a website that uses generated CSS classes?

Some websites use libraries that automatically create unique class names for different page components. This can make it difficult to use traditional CSS selectors to target specific elements.

One solution is to use XPath expressions instead. XPath selectors rely on the layout of the page, rather than specific class names. This means that even if the class names change, the XPath selector will still be able to locate the desired element.

For example, if you have an HTML component that looks like this:

<div class="container">

<div class="subcontainer_af21">

<ul class="ul_ax1">

<li class="li_adef">

<a href="https://link1">Winter Storm</a>

</li>

</ul>

<ul class="ul_cgt4">

<li class="li_ocv2">

<a href="https://lin2">SpaceX</a>

</li>

</ul>

</div>

</div>

You can select the second <a> element with:

//div[@class='container']/div/ul[2]/li/a

Is cheerio faster than Puppeteer?

Yes, Cheerio is generally considered to be faster than Puppeteer. This is because Cheerio is a server-side library that works directly with the HTML content. Puppeteer is a browser automation library that controls a headless browser to load web pages and interact with them. Cheerio is limited in the sense that it can only work with static pages, it doesn't have the ability to interact with the browser as Puppeteer does

Are XPath selectors better than CSS selectors?

It depends on the context. If you are looking to extract data based on the position of elements, XPath is the better choice. However, if you are looking to extract data based on properties such as class or id, CSS selectors are a better option.

Is Playwright better than Puppeteer?

Both of them offer similar functionalities, but. Playwright supports multiple browsers, including Chrome, Firefox, and Safari. Puppeteer supports only Chrome and Chromium.

Playwright has better support for working with multiple tabs and windows. It also has built-in support for handling browser contexts, cookies, and storage. Playwright is better suited for complex projects.

How can I avoid IP bans?

In general, you can try to space out your requests. Use different IPs. Use proxies. Try to alter the browser fingerprint. For most people, this is a never ending battle. The good news is that it does not have to be this way. You can use our solution, WebScrapingAPI. WebScrapingAPI provides an API that will handle all the heavy lifting for you. It can execute JavaScript, rotate proxies, and even handle CAPTCHAs. You'll never have to worry about getting your IP banned. But don’t take our word for it. You can try it for free.

How to extract text from HTML with BeautifulSoup?

You can use the BeautifulSoup library. Here is an example of extracting text using the .get_text() function:

from bs4 import BeautifulSoup

html_doc = """

<html>

<head>

<title>title of the page</title>

</head>

<body>

<p>a paragraph</p>

<a href='https://link.com'>a link</a>

</body>

</html>

"""

soup = BeautifulSoup(html_doc, 'html.parser')

paragraph_text = soup.find('p').text

print(paragraph_text)

#Prints 'a paragraph'

link_text = soup.find('a').text

print(link_text)

#Prints 'a link'

all_text = soup.get_text()

print(all_text)

"""

title of the page

a paragraph

a link

"""

How to extract text from HTML with Selenium?

Here’s how you can do it in Selenium:

from selenium import webdriver

from selenium.webdriver.common.by import By

DRIVER_PATH = 'path/to/chromedriver'

driver = webdriver.Chrome(executable_path=DRIVER_PATH)

driver.get("https://en.wikipedia.org/wiki/Main_Page")

# get all the h2 elements

content = driver.find_element(By.TAG_NAME, "h2")

print(content.text)

# Prints 'From today's featured article'

How to select HTML elements by text with BeautifulSoup?

With BeautifulSoup, you can use the soup.find method with the text=re.compile("<text>") parameter:

from bs4 import BeautifulSoup

import re

html_doc = """

<html>

<body>

<p class="my_paragraph">a paragraph.</p>

<p class="my_paragraph">another paragraph.</p>

</body>

</html>

"""

soup = BeautifulSoup(html_doc, 'html.parser')

# find the first pTag that contains the text 'a par'

pTag = soup.find("p", text=re.compile("a par"))

print(pTag)

How to select HTML elements by text with Selenium?

In Selenium, you can do it with XPath:

from selenium import webdriver

from selenium.webdriver.common.by import By

DRIVER_PATH = 'path/to/chromedriver'

driver = webdriver.Chrome(executable_path=DRIVER_PATH)

driver.get("https://en.wikipedia.org/wiki/Main_Page")

# get all the elements with class vector-body

span = driver.find_element(By.XPATH, "//span[contains(text(), 'Did')]")

print(span.text)

# Prints 'Did you know ...'

driver.quit()

How to find HTML elements with CSS selectors in BeautifulSoup?

Here’s how you can do it with BeautifulSoup and the find and find_all methods:

from bs4 import BeautifulSoup

html_doc = """

<html>

<body>

<p class="my_paragraph">First paragraph.</p>

<p class="my_paragraph">Second paragraph..</p>

<p>Last paragraph.</p>

</body>

</html>

"""

soup = BeautifulSoup(html_doc, 'html.parser')

# find all elements with class 'my_paragraph

elements = soup.find_all(class_="my_paragraph")

for element in elements:

print(element.text)

# prints 'First paragraph.' and 'Second paragraph..'

How to find HTML elements by class with Selenium?

Here is how you can do it with Selenium:

from selenium import webdriver

from selenium.webdriver.common.by import By

DRIVER_PATH = 'path/to/chromedriver'

driver = webdriver.Chrome(executable_path=DRIVER_PATH)

driver.get("https://en.wikipedia.org/wiki/Main_Page")

# get all the elements with class vector-body

elements = driver.find_elements(By.CLASS_NAME, "vector-body")

for element in elements:

print(element.text)

driver.quit()

How to use XPath with BeautifulSoup?

You will need the lxml Python library:

import requests

from bs4 import BeautifulSoup

from lxml import etree

response = requests.get("https://en.wikipedia.org/wiki/Main_Page")

soup = BeautifulSoup(response.content, 'html.parser')

dom = etree.HTML(str(body))

xpath_str = '//h1//text()'

print(dom.xpath(xpath_str))

#Prints ['Main Page', 'Welcome to ', 'Wikipedia']

How to wait for the page to load in Selenium?

If you simply want to wait for a certain time before timing out when trying to find any element, you can use the driver.implicitly_wait(time_in_secods) function:

from selenium import webdriver

from selenium.webdriver.common.by import By

DRIVER_PATH = 'C:/Users/Michael/Desktop/chromedriver'

driver = webdriver.Chrome(executable_path=DRIVER_PATH)

driver.implicitly_wait(10)

driver.get("https://en.wikipedia.org/wiki/Main_Page")

element = driver.find_element(By.ID, "not_found_id")

# the element does not exist, but it waits 10 seconds for it

text = element.text

print(text)

# Close the browser

driver.quit()

You can also choose to wait until a certain condition is met:

from selenium import webdriver

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

DRIVER_PATH = 'C:/Users/Michael/Desktop/chromedriver'

driver = webdriver.Chrome(executable_path=DRIVER_PATH)

driver.get("https://en.wikipedia.org/wiki/Main_Page")

# Wait for the element with id 'content' to be present on the page

wait = WebDriverWait(driver, 10)

element = wait.until(EC.presence_of_element_located((By.ID, "content")))

element = driver.find_element(By.ID, "content")

text = element.text

print(text)

# Close the browser

driver.quit()

How to find HTML elements with CSS selectors in Puppeteer?

In Puppeteer, you can use the page.$() and page.$$() functions to select elements with CSS selectors. The page.$() function is used to find the first element that matches the selector. The page.$$() function is used to find all the elements that match the selector.

const puppeteer = require('puppeteer');

(async () => {

const browser = await puppeteer.launch({

headless: false,

});



const page = await browser.newPage();

await page.goto('https://www.scrapethissite.com/pages/simple/');

// Extract the first odd row element

const firstOddRow = await page.$('.container .row');

console.log(await firstOddRow.evaluate(node => node.textContent));

// Extract all the odd rows

const allOddRows = await page.$$('.container .row');

for (const oddRow of allOddRows) {

console.log(await oddRow.evaluate(node => node.textContent));

}

await browser.close();

})();

How to find HTML elements with CSS selectors in Playwright?

Here is how you can do it with Playwright. It is very similar to Puppeteer:

const { chromium } = require('playwright');

(async () => {

const browser = await chromium.launch({

headless: false,

});

const context = await browser.newContext();

const page = await context.newPage();

await page.goto('https://www.scrapethissite.com/pages/simple/');

// Extract the first odd row element

const firstOddRow = await page.$('.container .row');

console.log(await firstOddRow.textContent());

// Extract all the odd rows

const allOddRows = await page.$$('.container .row');

for (const oddRow of allOddRows ) {

console.log(await oddRow.textContent());

}

await browser.close();

})();

How to find HTML elements with CSS selectors in cheerio?

With cheerio, you’ll have to fetch the HTML (I used the request library to do that) and then pass it to the cheerio library:

const request = require('request');

const cheerio = require('cheerio');

const url = 'https://www.scrapethissite.com/pages/simple/';

request(url, (error, response, html) => {

if (!error && response.statusCode === 200) {

const $ = cheerio.load(html);

const firstOddRow = $('.container .row').first();

console.log(firstOddRow.text());

const allOddRows = $('.container .row');

allOddRows.each((i, oddRow) => {

console.log($(oddRow).text());

});

}

});

How to use XPath with Puppeteer?

With Puppeteer, you can use the page.$x() function to select elements with XPath selectors:

const puppeteer = require('puppeteer');

(async () => {

const browser = await puppeteer.launch();

const page = await browser.newPage();

await page.goto('https://www.scrapethissite.com/pages/forms/');

// Extract the table header elements

const allTableHeaders = await page.$x('//table/tbody/tr[1]//th');

for(let i = 0; i < allTableHeaders.length; i++) {

const header = await page.evaluate(el => el.textContent, allTableHeaders[i]);

console.log(header.trim());

}

await browser.close();

})();

// Output:

// Team Name

// Year

// Wins

// Losses

// OT Losses

// Win %

// Goals For (GF)

// Goals Against (GA)

// + / -

How to use XPath with Playwright?

const { chromium } = require('playwright');

(async () => {

const browser = await chromium.launch({

headless: false,

});

const context = await browser.newContext();

const page = await context.newPage();

await page.goto('https://www.scrapethissite.com/pages/forms/');

// Extract the table header elements

const allTableHeaders = await page.locator('xpath=//table/tbody/tr[1]//th').all();



for (let i = 0; i < allTableHeaders.length; i++) {

const headerText = await allTableHeaders[i].innerText();

console.log(headerText);

}

await browser.close();

})();

Any selector string starting with // or .. is assumed to be an xpath selector. For example, Playwright converts '//html/body' to 'xpath=//html/body'.

How to find HTML elements by text in Puppeteer?

In Puppeteer, the simplest way to find elements by text is to use the XPath text() function:

const puppeteer = require('puppeteer');

(async () => {

const browser = await puppeteer.launch({

headless: false,

});

const page = await browser.newPage();

await page.goto('https://en.wikipedia.org/wiki/Web_scraping');

// Select all the p tags texts that contain the word "prevent"

const pTags = await page.$x('//p[contains(text(), "prevent")]/text()');

for(let i = 0; i < pTags.length; i++) {

const pTag = await page.evaluate(el => el.textContent, pTags[i]);

console.log(pTag,"\n");

}

await browser.close();

})();

//Output:

There are methods that some websites use to prevent web scraping, such as detecting and disallowing bots from crawling (viewing) their pages. In response, there are web scraping systems that rely on using techniques in ...

How to find HTML elements by text in Playwright?

If you want to find elements by text in Playwright, you can use the allInnerTexts() function in combination with XPath.

const { chromium } = require('playwright');

(async () => {

const browser = await chromium.launch({

headless: false,

});

const context = await browser.newContext();

const page = await context.newPage();

await page.goto('https://en.wikipedia.org/wiki/Web_scraping');

// Select all the p tags texts that contain the word "prevent"

const pTags = await page.locator('//p[contains(text(), "prevent")]').allInnerTexts();



for (let i = 0; i < pTags.length; i++) {

console.log(pTags[i], "\n");

}

await browser.close();

})();

How to find HTML elements by text in cheerio?

const request = require('request');

const cheerio = require('cheerio');

const url = 'https://en.wikipedia.org/wiki/Web_scraping';

request(url, (error, response, html) => {

if (!error && response.statusCode === 200) {

const $ = cheerio.load(html);

// Select all the p tags texts that contain the word "prevent"

const elements = $('p').filter((i, el) => $(el).text().includes('prevent'));

elements.each((i, el) => {

console.log($(el).text());

});

}

});

How to wait for selectors in Puppeteer?

In Puppeteer, you can use the page.waitForSelector() function to wait for a specific element to appear on the page before continuing with the script. You can use it with both CSS and XPath selectors:

await page.waitForSelector('.basic-element', { timeout: 10000 });

await page.waitForXPath("//div[@class='basic-element']"), { timeout: 10000 });

The timeout parameter specifies the maximum wait time in ms.

You can also wait for an element to reach a certain state:

await page.waitForSelector('.basic-element', { visible: true });

// wait until the element becomes visible

How to wait for selectors in Playwright?

Playwright is similar to Puppeteer. You can use the page.waitForSelector() method to wait for a specific element to appear on the page.

await page.waitForSelector('.element-class', { timeout: 10000 });

You can also wait for an element to reach a certain state:

 await page.waitForSelector('.basic-element', { state: 'visible' });

// wait for element to become visible

Wrapping up

Web scraping is a vast subject and this article only covers the surface. Choosing the right tool for your specific use case is crucial. For example, if you want to scrape a website using JavaScript, the cheerio library is a good option. However, if the website requires JavaScript to load fully, Puppeteer or Playwright are better options. Web scraping is challenging but understanding the tools can save you a lot of headaches. I hope this article broadened your perspective and I wish you all the best in your web scraping endeavors.

News and updates

Stay up-to-date with the latest web scraping guides and news by subscribing to our newsletter.

We care about the protection of your data. Read our Privacy Policy.

Related articles

thumbnail
GuidesHow To Scrape Amazon Product Data: A Comprehensive Guide to Best Practices & Tools

Explore the complexities of scraping Amazon product data with our in-depth guide. From best practices and tools like Amazon Scraper API to legal considerations, learn how to navigate challenges, bypass CAPTCHAs, and efficiently extract valuable insights.

Suciu Dan
author avatar
Suciu Dan
15 min read
thumbnail
Science of Web ScrapingScrapy vs. Selenium: A Comprehensive Guide to Choosing the Best Web Scraping Tool

Explore the in-depth comparison between Scrapy and Selenium for web scraping. From large-scale data acquisition to handling dynamic content, discover the pros, cons, and unique features of each. Learn how to choose the best framework based on your project's needs and scale.

WebscrapingAPI
author avatar
WebscrapingAPI
14 min read
thumbnail
GuidesScrapy vs. Beautiful Soup: A Comprehensive Comparison Guide for Web Scraping Tools

Explore a detailed comparison between Scrapy and Beautiful Soup, two leading web scraping tools. Understand their features, pros and cons, and discover how they can be used together to suit various project needs.

WebscrapingAPI
author avatar
WebscrapingAPI
10 min read