Find out how to Scrape JavaScript Tables with Python

Andrei Ogiolan on Apr 24 2023

Introduction

Web scraping is a powerful tool that allows you to extract data from websites and use it for a variety of purposes. It can be used to gather data for business intelligence, track changes on websites, or build your own web applications. In this article, we will be focusing on how to scrape JavaScript tables using Python.

What are JavaScript tables?

JavaScript tables are a common way to display tabular data on the web, and they can be found on a wide range of websites. Scraping these tables can be challenging because the data is often embedded in the page's source code as a JavaScript object, rather than in a standard HTML table. However, with the right tools and techniques, it is possible to extract this data using Python.

We will begin by setting up the necessary tools and installing any required libraries. Then, we will walk through the process of using Python and a web driver to load the webpage and extract the data from the table. Finally, we will discuss some ways to improve the efficiency and reliability of the scraping process, and why it may be better to use a professional scraper for more complex tasks.

Setting up

Before we can start scraping JavaScript tables with Python, there are 2 prerequisites that we need to take care of:

Python: This tutorial assumes that you have Python installed on your machine. If you don't have Python installed, you can download it from the official website and follow the instructions for your operating system.
A web driver: In order to load and interact with webpages using Python, we will need to use a web driver. There are several options available, such as ChromeDriver, FirefoxDriver, and SafariDriver. For this tutorial, we will be using ChromeDriver.

Once you have Python and a web driver installed, you will need to install the following libraries:

Selenium: Selenium is a library that allows you to control a web browser through Python. We will use it to load and interact with the webpage containing the table. When it comes to JavaScript tables it is important to use a library like Selenium instead of Python requests because you can wait until a JavaScript generated element appears on the page in case it is not present when the webpage is loaded.
Pandas: Pandas is a library that provides easy-to-use data structures and data analysis tools for Python. We will use it to store and manipulate the data that we extract from the table.

To install these libraries, open a terminal or command prompt and use the pip command to install them:

$ pip install selenium pandas

That's it! You are now ready to start scraping JavaScript tables with Python. In the next section, we will walk through the process step by step.

Let’s start scraping

Now that we have all of the necessary tools installed, it's time to start scraping JavaScript tables with Python. The process involves the following steps

Load the webpage containing the table using Selenium and a web driver.
Extract the data from the table using Selenium and Python.
Store and manipulate the data using Pandas.

Let's walk through each of these steps in more detail:

Step 1: Loading the webpage

The first thing we need to do is load the webpage containing the table that we want to scrape. We can do this using Selenium and a web driver.

First, let's import the necessary libraries:

from selenium import webdriver

from selenium.webdriver.chrome.service import Service

from selenium.webdriver.common.by import By

import time

Next, we'll create an instance of the web driver and use it to load the webpage:

# Replace "path/to/chromedriver" with the path to your ChromeDriver executable

driver = webdriver.Chrome(service=Service('path/to/chromedriver'))

# Load the webpage

driver.get('https://html.com/tags/table/')

It's important to note that the webpage must be fully loaded before extracting the data from the table. You may need to use the driver.implicitly_wait() function to wait for the page to load, or use the driver.find_element(By.*, ‘’) function to wait for a specific element on the page to be loaded.

Step 2: Extract the data

Once the webpage is loaded, we can use Selenium to extract the data from the table. There are several ways to do this, but one method is to use the driver.find_elements(By.CSS_SELECTOR, ‘td’) function to locate the cells in the table and extract the text from each cell.

Here is an example of how to extract the data from a simple table with two columns:

# Find all of the rows in the table

rows = driver.find_elements(By.CSS_SELECTOR, 'table tr')

# For each row, find the cells and extract the text

for row in rows:

    try:

        cells = row.find_elements(By.CSS_SELECTOR, 'td') or row.find_elements(By.CSS_SELECTOR, 'th')

    except:

        continue

    for cel in cells:

        print(cel.text, end= " ")

    print()

driver.quit()

Remember that you may need to use a different CSS selector depending on the table's structure and the elements it contains. You can use the developer tools in your web browser to inspect the page and find the appropriate selector.

Step 3: Store and manipulate the data

Once you've extracted the data from the table, you can store it in a Pandas data frame and manipulate it as needed. Here's an example of how to do this:

from selenium import webdriver

from selenium.webdriver.chrome.service import Service

from selenium.webdriver.common.by import By

import time

import pandas as pd

df = pd.DataFrame()

driver = webdriver.Chrome(service=Service('/path/to/chromedriver'))

# Use the webdriver to load a webpage

driver.get('https://html.com/tags/table/')

# When scraping JavaScript generated content it is important to wait a few seconds

time.sleep(4)

table = driver.find_element(By.CSS_SELECTOR, 'table')

# For each row, find the cells and extract the text

df = pd.read_html(table.get_attribute('outerHTML'))

print(df)

driver.close()

Diving deeper

While the steps described above will allow you to scrape JavaScript tables using Python, there are a few ways to improve the process's efficiency and reliability.

One way to improve efficiency is to use a headless browser, which is a browser that runs in the background without a GUI. This can be faster than running a full browser, and it is less resource-intensive. To use a headless browser with Selenium, you can use the --headless flag when creating the web driver instance.

Another way to improve efficiency is to use a service that provides rotating IP addresses, such as a proxy server. This can help you avoid being detected as a scraper and blocked by the website, as the IP address of the request will appear to change with each request. WebScrapingAPI is a service that offers the possibility to scrape a website using a proxy server. You can learn more about how to use proxies for web scraping, feel free to check our docs.

To use a proxy server with Selenium, I strongly recommend you to use selenium-wire since it is more straight-forward than plain Selenium when it comes to connecting to a proxy server. Just like any other Python package you can simply install it by running the following command:

$ pip install selenium-wire

Then you can use the following coding sample in order to use a proxy server with Selenium:

from seleniumwire import webdriver

from selenium.webdriver.chrome.service import Service

import time

# Create a webdriver instance with the desired proxy server and authentication details

API_KEY = '<YOUR-API-KEY-HERE>'

options = {

    'proxy': {

        'http': f'http://webscrapingapi:{API_KEY}@proxy.webscrapingapi.com:80',

        'https': f'https://webscrapingapi:{API_KEY}@proxy.webscrapingapi.com:80',

        'no_proxy': 'localhost,127.0.0.1'

    }

}

driver = webdriver.Chrome(service=Service('/path/to/chromedriver'), seleniumwire_options=options)

# Use the webdriver to load a webpage

driver.get('http://httpbin.org/ip')

# When scraping JavaScript generated content it is important to wait a few seconds

time.sleep(5)

# Do something with the page, such as extract data or take a screenshot

# ...

# Close the webdriver

driver.quit()

While these techniques can be useful for improving the efficiency and reliability of your web scraping, it is important to note that they are beyond the scope of this article. For more complex scraping tasks, it may be more efficient and reliable to use a professional scraper, such as WebScrapingAPI. This tool provides additional features, such as IP rotation and CAPTCHA bypass, that can make the scraping process way easier and more reliable.

In the next section, we will summarize the steps for scraping JavaScript tables with Python and discuss the benefits of using a professional scraper for more complex tasks.

Summary

In conclusion, scraping JavaScript tables with Python is a powerful way to extract data from websites and use it for a variety of purposes. Whether you are using your own code or a professional scraper, this technique can be a valuable tool for gathering data and gaining insights.

News and updates

Stay up-to-date with the latest web scraping guides and news by subscribing to our newsletter.

We care about the protection of your data. Read our Privacy Policy.

Guides How To Scrape Amazon Product Data: A Comprehensive Guide to Best Practices & Tools

Explore the complexities of scraping Amazon product data with our in-depth guide. From best practices and tools like Amazon Scraper API to legal considerations, learn how to navigate challenges, bypass CAPTCHAs, and efficiently extract valuable insights.

Suciu Dan

Aug 10 202315 min read

Guides Learn How To Bypass Cloudflare Detection With The Best Selenium Browser

Learn what’s the best browser to bypass Cloudflare detection systems while web scraping with Selenium.

Mihnea-Octavian Manolache

May 02 20239 min read

Guides Find out how to scrape HTML tables with Golang

Learn how to scrape HTML tables with Golang for powerful data extraction. Explore the structure of HTML tables and build a web scraper using Golang's simplicity, concurrency, and robust standard library.

Andrei Ogiolan

Apr 24 20238 min read