Scrapy Splash Tutorial: Mastering the Art of Scraping JavaScript-Rendered Websites with Scrapy and Splash

Ștefan Răcila on Aug 10 2023

blog-image

In the complex web landscape of today, where content is often generated dynamically using JavaScript, AJAX calls, or other client-side scripting, scraping information becomes a challenging task. Traditional scraping techniques might fail to extract data that is loaded asynchronously, requiring a more sophisticated approach. This is where Scrapy Splash enters the scene.

Scrapy Splash is a streamlined browser equipped with an HTTP API. Unlike bulkier browsers, it is lightweight yet powerful, designed to scrape websites that render their content with JavaScript or through AJAX procedures. By simulating a real browser's behavior, Scrapy Splash can interact with dynamic elements, making it an invaluable tool for any data extraction needs related to JavaScript-rendered content.

In this comprehensive guide, we will explore the unique capabilities of Scrapy Splash, illustrating step by step how to leverage this tool effectively to scrape data from websites that utilize JavaScript for rendering. Whether you're an experienced data miner or just starting, understanding Scrapy Splash's functionalities will empower you to obtain the information you need from an increasingly dynamic web.

Stay with us as we delve into the ins and outs of using Scrapy Splash for scraping the modern, interactive web, beginning with its installation and ending with real-world examples.

How to Configure Splash: A Step-by-Step Guide to Installation and Configuration

Scrapy Splash is an immensely powerful tool that can unlock new opportunities for scraping data from dynamic websites. However, before we start reaping the benefits of Scrapy Splash, we must first get our systems set up. This involves several essential steps, including the installation of Docker, Splash, Scrapy, and the necessary configurations to make everything work together seamlessly.

1) Setting Up and Installing Docker

Docker is a cutting-edge containerization technology that allows us to isolate and run the Splash instance in a virtual container, ensuring a smooth and consistent operation.

For Linux Users:

Execute the following command in the terminal:

curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

For Other Operating Systems:

Windows, macOS, and other OS users can find detailed installation guides on the Docker website.

2) Downloading and Installing Splash via Docker

With Docker installed, you can proceed to download the Splash Docker image, an essential part of our scraping infrastructure.

Execute the command:

docker pull scrapinghub/splash

This will download the image. Now run it with:

docker run -it -p 8050:8050 --rm scrapinghub/splash

Congratulations! Your Splash instance is now ready at localhost:8050. You should see the default Splash page when you visit this URL in your browser.

3) Installing Scrapy and the Scrapy-Splash Plugin

Scrapy is a flexible scraping framework, and the scrapy-splash plugin bridges Scrapy with Splash. You can install both with:

pip install scrapy scrapy-splash

The command above downloads all the required dependencies and installs them.

4) Creating Your First Scrapy Project

Kickstart your scraping journey with the following command:

scrapy startproject splashscraper

This creates a Scrapy project named splashscraper with a structure similar to:

splashscraper
├── scrapy.cfg
└── splashscraper
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
└── __init__.py

5) Integrating Scrapy with Splash

Now comes the essential part - configuring Scrapy to work with Splash. This requires modifying the settings.py file in your Scrapy project.

Splash URL Configuration:

Define a variable for your Splash instance:

SPLASH_URL = 'http://localhost:8050'

Downloader Middlewares:

These settings enable interaction with Splash:

DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,

}

Spider Middlewares and Duplicate Filters:
Further, include the necessary Splash middleware for deduplication:
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,

}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

The rest of the settings may remain at their default values.

Writing a Scrapy Splash Spider

Scraping data from dynamic web pages may require interaction with JavaScript. That's where Scrapy Splash comes into play. By the end of this guide, you'll know how to create a spider using Scrapy Splash to scrape quotes from quotes.toscrape.com.

Step 1: Generating the Spider

We will use Scrapy's built-in command to generate a spider. The command is:

scrapy genspider quotes quotes.toscrape.com

Upon execution, a new file named quotes.py will be created in the spiders directory.

Step 2: Understanding the Basics of a Scrapy Spider

Opening quotes.py, you'll find:

import scrapy

class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
pass
  • name: The spider’s name
  • allowed_domains: Restricts spider to listed domains
  • start_urls: The URLs to scrape
  • parse: The method invoked for each URL

Step 3: Scrape Data from a Single Page

Now, let's make the spider functional.

a) Inspect Elements Using a Web Browser

Use the developer tools to analyze the HTML structure. You'll find each quote enclosed in a div tag with a class name quote.

b) Prepare the SplashscraperItem Class

In items.py, modify it to include three fields: author, text, and tags:

import scrapy

class SplashscraperItem(scrapy.Item):
author = scrapy.Field()
text = scrapy.Field()
tags = scrapy.Field()

c) Implement parse() Method

Import the SplashscraperItem class and update the parse method in quotes.py:

from items import SplashscraperItem

def parse(self, response):
for quote in response.css("div.quote"):
text = quote.css("span.text::text").extract_first("")
author = quote.css("small.author::text").extract_first("")
tags = quote.css("meta.keywords::attr(content)").extract_first("")
item = SplashscraperItem()
item['text'] = text
item['author'] = author
item['tags'] = tags
yield item

Step 4: Handling Pagination

Add code to navigate through all the pages:

next_url = response.css("li.next>a::attr(href)").extract_first("")
if next_url:
yield scrapy.Request(next_url, self.parse)

Step 5: Adding Splash Requests for Dynamic Content

To use SplashRequest, you’ll have to make changes to the current spider:

from scrapy_splash import SplashRequest

def start_requests(self):
url = 'https://quotes.toscrape.com/'
yield SplashRequest(url, self.parse, args={'wait': 1})

Update the parse method to use SplashRequest as well:

if next_url:
yield scrapy.SplashRequest(next_url, self.parse, args={'wait': 1})

Congratulations! You've just written a fully functional Scrapy spider that utilizes Splash to scrape dynamic content. You can now run the spider and extract all the quotes, authors, and tags from quotes.toscrape.com.

The code provides an excellent template for scraping other dynamic websites with similar structures. Happy scraping!

Handling Splash Responses in Scrapy

Splash responses in Scrapy contain some unique characteristics that differ from standard Scrapy Responses. They are handled in a specific way, based on the type of response, but the extraction process can be performed using familiar Scrapy methods. Let's delve into it.

Understanding how Splash Responds to Requests and Its Response Object

When Scrapy Splash processes a request, it returns different response subclasses depending on the request type:

  • SplashResponse: For binary Splash responses that include media files like images, videos, audios, etc.
  • SplashTextResponse: When the result is textual.
  • SplashJsonResponse: When the result is a JSON object.

Parsing Data from Splash Responses

Scrapy’s built-in parser and Selector classes can be employed to parse Splash Responses. This means that, although the response types are different, the methods used to extract data from them remain the same.

Here's an example of how to extract data from a Splash response:

text = quote.css("span.text::text").extract_first("")
author = quote.css("small.author::text").extract_first("")
tags = quote.css("meta.keywords::attr(content)").extract_first("")

Explanation:

  • .css("span.text::text"): This uses CSS Selectors to locate the span element with class text, and ::text tells Scrapy to extract the text property from that element.
  • .css("meta.keywords::attr(content)"): Here, ::attr(content) is used to get the content attribute of the meta tag with class keywords.

Conclusion

Handling Splash responses in Scrapy doesn't require any specialized treatment. You can still use the familiar methods and syntax to extract data. The primary difference lies in understanding the type of Splash response returned, which could be a standard text, binary, or JSON. These types can be handled similarly to regular Scrapy responses, allowing for a smooth transition if you're adding Splash to an existing Scrapy project.

Happy scraping with Splash!

News and updates

Stay up-to-date with the latest web scraping guides and news by subscribing to our newsletter.

We care about the protection of your data. Read our Privacy Policy.

Related articles

thumbnail
Science of Web ScrapingScrapy vs. Selenium: A Comprehensive Guide to Choosing the Best Web Scraping Tool

Explore the in-depth comparison between Scrapy and Selenium for web scraping. From large-scale data acquisition to handling dynamic content, discover the pros, cons, and unique features of each. Learn how to choose the best framework based on your project's needs and scale.

WebscrapingAPI
author avatar
WebscrapingAPI
14 min read
thumbnail
GuidesScrapy vs. Beautiful Soup: A Comprehensive Comparison Guide for Web Scraping Tools

Explore a detailed comparison between Scrapy and Beautiful Soup, two leading web scraping tools. Understand their features, pros and cons, and discover how they can be used together to suit various project needs.

WebscrapingAPI
author avatar
WebscrapingAPI
10 min read
thumbnail
GuidesLearn How To Bypass Cloudflare Detection With The Best Selenium Browser

Learn what’s the best browser to bypass Cloudflare detection systems while web scraping with Selenium.

Mihnea-Octavian Manolache
author avatar
Mihnea-Octavian Manolache
9 min read