JSoup: HTML Parsing In Java

米哈伊-马克西姆（Mihai Maxim），2023 年 1 月 31 日

Introducing JSoup

Web scraping can be thought of as digital treasure hunting. You go through a website and dig out all the information you need. It's a technique that's used for all sorts of things, like finding the cheapest prices, analyzing customer sentiment, or collecting data for research.

Java is considered a great programming language for web scraping because it has a wide variety of libraries and frameworks that can assist in the process. One of the most well-known libraries for web scraping in Java is JSoup. JSoup lets you navigate and search through a website's HTML and extract all the data you need.

By combining Java with JSoup, you can create awesome web scraping apps that can extract data from websites quickly and easily. In this article, I will walk you through the basics of web scraping with JSoup.

Setting up a JSoup project

In this section, we will create a new Java project with Maven, and configure it to run from the command line using the exec-maven-plugin. This will allow you to easily package and run your project on a server, allowing for the automation and scalability of the data extraction process. After that, we will install the JSoup library.

Creating a Maven project

Maven is a build automation tool for Java projects. It manages dependencies, builds, and documentation, making it easier to manage complex Java projects. With Maven, you can easily manage and organize your project's build process, dependencies, and documentation. It also allows for easy integration with tools and frameworks.

Installing Maven is a simple process that can be done in a few steps.

First, download the latest version of Maven from the official website (https://maven.apache.org/download.cgi).

Once the download is complete, extract the contents of the archive to a directory of your choice.

Next, you'll need to set up the environment variables.

On Windows, set the JAVA_HOME variable to the location of your JDK and add the bin folder of the Maven installation to the PATH variable.

On Linux/macOS, you'll need to add the following lines to your ~/.bashrc or ~/.bash_profile file:

export JAVA_HOME=path/to/the/jdk

export PATH=$PATH:path/to/maven/bin

Confirm the Maven installation by running mvn --version in a terminal.

With Maven installed, you can now create a new Java Maven project:

 mvn archetype:generate -DgroupId=com.project.scraper 

-DartifactId=jsoup-scraper-project 

-DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

This creates a new folder called "jsoup-scraper-project" containing the project's contents.

The entry point for the application (the main class) will be in the “com.project.scraper” package.

Running the project from the command line

In order to run a Maven Java project from the command line, we will use the exec-maven-plugin.

To install the plugin, you need to add it to the project's pom.xml file. This can be done by adding the following code snippet to the <build><plugins> section of the pom.xml file:

<build>

 <plugins>

   <plugin>

     <groupId>org.codehaus.mojo</groupId>

     <artifactId>exec-maven-plugin</artifactId>

     <version>3.1.0</version>

     <executions>

       <execution>

         <goals>

           <goal>java</goal>

         </goals>

       </execution>

     </executions>

     <configuration>

       <mainClass>com.project.scraper.App</mainClass>

     </configuration>

   </plugin>

 </plugins>

</build>

Make sure you select the right path for the main Class of the project.

Use mvn package exec:java in the terminal (in the project directory) to run the project.

Installing the JSoup library

To install the JSoup library, add the following dependency to your project's pom.xml file:

<dependency>

 <groupId>org.jsoup</groupId>

 <artifactId>jsoup</artifactId>

 <version>1.14.3</version>

</dependency>

Visit https://mvnrepository.com/artifact/org.jsoup/jsoup to check the latest version.

Parsing HTML in Java with JSoup

In this section, we will explore the https://www.scrapethissite.com/pages/forms/ website and see how we can extract the information about hockey teams. By examining a real-world website, you will understand the concepts and techniques used in web scraping with JSoup and how you could apply them to your own projects.

Fetching the HTML

In order to get the HTML from the website, you need to make a HTTP request to it. In JSoup, the connect() method is used to create a connection to a specified URL. It returns a Connection object, which can be used to configure the request and retrieve the response from the server.

Let’s see how we can use the connect() method to fetch the HTML from our URL and then write it to a local HTML file (hockey.html):

package com.project.scraper;

import org.jsoup.Jsoup;

import org.jsoup.nodes.Document;

import java.io.*;

import java.io.IOException;

public class App

{

   public static void main( String[] args )

   {

       String RAW_HTML;

       try {

           Document document = Jsoup.connect("https://www.scrapethissite.com/pages/forms/")

                   .get();

           RAW_HTML = document.html();

           FileWriter writer = new FileWriter("hockey.html");

           writer.write(RAW_HTML);

           writer.close();

       } catch (IOException e) {

           e.printStackTrace();

       }

   }

}

Now we can open the file and examine the structure of the HTML with the Developer Tools:

The data we need is present in an HTML table on the page. Now that we have accessed the page, we can proceed to extract the content from the table using selectors.

Writing selectors

The selectors in JSoup have similarities to the selectors in JavaScript. Both have a similar syntax and allow you to select elements from an HTML document based on their tag name, class, id, and CSS properties.

Here are some of the main selectors you can use with JSoup:

getElementsByTag(): Selects elements based on their tag name.
getElementsByClass(): Selects elements based on their class name.
getElementById(): Selects an element based on its id.
select(): Selects elements based on a CSS selector (similar to querySelectorAll)

Now let’s use some of them to extract all the team names:

try {

   Document document = Jsoup.connect("https://www.scrapethissite.com/pages/forms/")

           .get();

   Elements rows = document.getElementsByTag("tr");

   for(Element row : rows) {

      

       Elements teamName = row.getElementsByClass("name");

      

       if(teamName.text().compareTo("") != 0)

           System.out.println(teamName.text());

      

   }

} catch (IOException e) {

   e.printStackTrace();

}

// Prints the team names:

Boston Bruins

Buffalo Sabres

Calgary Flames

Chicago Blackhawks

Detroit Red Wings

Edmonton Oilers

Hartford Whalers

...

We iterated over every row, and for each one, we printed the team name using the class selector 'name'.

The last example emphasizes the flexibility and the ability to apply selector methods multiple times on the elements that have been extracted. This is particularly useful when dealing with complex and large HTML documents.

Here’s another version that uses Java streams and the select() method to print all the team names:

try {

   Document document = Jsoup.connect("https://www.scrapethissite.com/pages/forms/")

           .get();

   Elements teamNamesElements = document.select("table .team .name");

   String[] teamNames = teamNamesElements.stream()

                                         .map(element -> element.text())

                                         .toArray(String[]::new);

   for (String teamName : teamNames) {

       System.out.println(teamName);

   }

} catch (IOException e) {

   e.printStackTrace();

}

// Also prints the team names:

Boston Bruins

Buffalo Sabres

Calgary Flames

...

Now let’s print all the table headers and rows:

try {

   Document document = Jsoup.connect("https://www.scrapethissite.com/pages/forms/")

           .get();

   Elements tableHeadersElements = document.select("table th");

   Elements tableRowsElements = document.select("table .team");

   String[] tableHeaders =

   tableHeadersElements.stream()

                       .map(element -> element.text())

                       .toArray(String[]::new);

   String[][] tableRows =

   tableRowsElements.stream()

            .map(

                table_row -> table_row

                .select("td")

                .stream()

                .map(row_element -> row_element.text())

                .toArray(String[]::new)

               )

            .toArray(String[][]::new);

   for (int i = 0; i < tableHeaders.length; i++) {

       System.out.print(tableHeaders[i] + " ");

   }

   for (int i = 0; i < tableRows.length; i++) {

       for (int j = 0; j < tableRows[i].length; j++) {

           System.out.print(tableRows[i][j] + " ");

       }

       System.out.println();

   }

} catch (IOException e) {

   e.printStackTrace();

}

// Prints

Team Name Year Wins Losses OT Losses Win ...

Boston Bruins 1990 44 24  0.55 299 264 35 

Buffalo Sabres 1990 31 30  0.388 292 278 14 

Calgary Flames 1990 46 26  0.575 344 263 81 

Chicago Blackhawks 1990 49 23  0.613 284 211 73 

Detroit Red Wings 1990 34 38  0.425 273 298 -25

...

Notice that we used streams to store the rowS. Here is a simpler way of doing it, using for loops:

String[][] tableRows = new String[tableRowsElements.size()][];

for (int i = 0; i < tableRowsElements.size(); i++) {

   Element table_row = tableRowsElements.get(i);

   Elements tableDataElements = table_row.select("td");

   String[] rowData = new String[tableDataElements.size()];

   for (int j = 0; j < tableDataElements.size(); j++) {

       Element row_element = tableDataElements.get(j);

       String text = row_element.text();

       rowData[j] = text;

   }

   tableRows[i] = rowData;

}

Handling pagination

When extracting data from a website, it is common for the information to be split across multiple pages. In order to scrape all the relevant data, it is necessary to make requests to each page of the website and extract the information from each one. We can easily implement this feature to our project.

All we have to do is change the page_num query param in the URL and make another HTTP request with the connect() method.

int pageLimit = 25;

String [] tableHeaders = new String[0];

Vector<String[][]> rowsGroups = new Vector<String [][]>();

for (int currentPage=1; currentPage<pageLimit; currentPage++) {

   try {

       Document document = Jsoup.connect("https://www.scrapethissite.com/pages/forms/?page_num=" + currentPage)

               .get();

       if(currentPage == 1) {

           Elements tableHeadersElements = document.select("table th");

           tableHeaders = tableHeadersElements.stream()

                   .map(element -> element.text())

                   .toArray(String[]::new);

       }

       Elements tableRowsElements = document.select("table .team");

       String[][] tableRows = new String[tableRowsElements.size()][];

       for (int i = 0; i < tableRowsElements.size(); i++) {

           Element table_row = tableRowsElements.get(i);

           Elements tableDataElements = table_row.select("td");

           String[] rowData = new String[tableDataElements.size()];

           for (int j = 0; j < tableDataElements.size(); j++) {

               Element row_element = tableDataElements.get(j);

               String text = row_element.text();

               rowData[j] = text;

           }

           tableRows[i] = rowData;

       }

       rowsGroups.add(tableRows);

   } catch (IOException e) {

       e.printStackTrace();

   }

   // do something with the headers and the the table rows groups

}

Since the tables from each page have the same headers, you should make sure not to scrape them multiple times.

The full code

Here is the full code that extracts all the tables from the https://www.scrapethissite.com/pages/forms/ website. I also included a function that saves the data to .CSV:

package com.project.scraper;

import org.jsoup.Jsoup;

import org.jsoup.nodes.Document;

import org.jsoup.nodes.Element;

import org.jsoup.select.Elements;

import java.io.*;

import java.io.IOException;

import java.util.Vector;

public class App

{

   public static void main( String[] args )

   {

       int pageLimit = 25;

       String [] tableHeaders = new String[0];

       Vector<String[][]> rowsGroups = new Vector<String [][]>();

       for (int currentPage=1; currentPage<pageLimit; currentPage++) {

           try {

               Document document = Jsoup.connect("https://www.scrapethissite.com/pages/forms/?page_num=" + currentPage)

                       .get();

               if(currentPage == 1) {

                   Elements tableHeadersElements = document.select("table th");

                   tableHeaders = tableHeadersElements.stream()

                           .map(element -> element.text())

                           .toArray(String[]::new);

               }

               Elements tableRowsElements = document.select("table .team");

               String[][] tableRows = new String[tableRowsElements.size()][];

               for (int i = 0; i < tableRowsElements.size(); i++) {

                   Element table_row = tableRowsElements.get(i);

                   Elements tableDataElements = table_row.select("td");

                   String[] rowData = new String[tableDataElements.size()];

                   for (int j = 0; j < tableDataElements.size(); j++) {

                       Element row_element = tableDataElements.get(j);

                       String text = row_element.text();

                       rowData[j] = text;

                   }

                   tableRows[i] = rowData;

               }

               rowsGroups.add(tableRows);

           } catch (IOException e) {

               e.printStackTrace();

           }

       }

       writeFullTableToCSV(rowsGroups, tableHeaders, "full_table.csv");

   }

   public static void writeFullTableToCSV(Vector<String[][]> rowsGroups, String[] headers, String fileName) {

       File file = new File(fileName);

       try {

           FileWriter writer = new FileWriter(file);

           // write the headers first

           for (int i = 0; i < headers.length; i++) {

               writer.append(headers[i]);

               if (i != headers.length - 1) {

                   writer.append(",");

               }

           }

           writer.append("\n");

           // write all the rows groups

           for (String [][] rowsGroup : rowsGroups) {

               for (String[] row : rowsGroup) {

                   for (int i = 0; i < row.length; i++) {

                       writer.append(row[i]);

                       if (i != row.length - 1) {

                           writer.append(",");

                       }

                   }

                   writer.append("\n");

               }

           }

           writer.flush();

           writer.close();

       } catch (IOException e) {

           e.printStackTrace();

       }

   }

}

总结

In this article, we covered how to install Maven and create a new Java Maven project, as well as how to run the project from the command line. We also discussed how to install the JSoup library by adding the dependency to the project's pom.xml file. Finally, we went over an example of how to use JSoup to parse HTML and extract data from a website. By following the steps outlined in the article, you should have a solid foundation for setting up a JSoup project and begin extracting data from websites. JSoup offers a wide range of options and possibilities for web scraping and I encourage you to explore them and apply them to your own projects.

As you have seen, data is often shared across multiple web pages. Making rapid requests to the same domain can lead to your IP getting banned. With our product, WebScrapingAPI, you will never have to worry about such problems. Our API ensures that you can make as many requests as you need. And the best part, is you can try it for free.