Introduction

Hey everyone! Today, I am excited to share another Java project that helped me understand an essential concept: multithreading. In this blog post, we'll explore how I built a multithreaded web crawler using Java, Jsoup for HTML parsing, and Phaser for task synchronization.

This project is designed to fetch hyperlinks from a given website up to a specified depth while making use of concurrent execution. If you're interested in learning about Java multithreading in a practical way, this blog is for you!

Why Build a Web Crawler?

A web crawler is a great project to explore multithreading because:

It requires handling multiple web pages simultaneously.
We need to synchronize tasks efficiently.
It helps in understanding how to manage thread execution using ExecutorService.
We can explore libraries like Jsoup for web scraping.

Technologies Used

Java (for programming)
Jsoup (for extracting hyperlinks)
ExecutorService (for multithreading)
Phaser (for synchronizing task execution)

How It Works

The web crawler follows these steps:

1️⃣ User Input: The program prompts for a starting URL, the maximum depth, and the number of threads.
2️⃣ Managing URLs: The URLStore class keeps track of visited URLs and prevents duplicate crawling.
3️⃣ Fetching Pages: The URLFetcher class uses Jsoup to retrieve and extract hyperlinks.
4️⃣ Multithreading: The CrawlerTask runs concurrently using a thread pool (ExecutorService) and synchronizes execution with Phaser.

Implementation Details

1️⃣ WebCrawler.java (Main Entry Point)

import java.util.Scanner;

public class WebCrawler {
    public static void main(String[] args) {
        Scanner scanner = new Scanner(System.in);
        System.out.println("Enter the starting URL: ");
        String startUrl = scanner.nextLine();

        System.out.println("Enter the maximum depth: ");
        int maxDepth = scanner.nextInt();

        System.out.println("Enter the number of threads: ");
        int numThreads = scanner.nextInt();

        CrawlerTask crawler = new CrawlerTask(startUrl, maxDepth, numThreads);
        crawler.start();
    }
}

2️⃣ URLStore.java (Managing URLs)

import java.util.HashSet;
import java.util.Set;

public class URLStore {
    private final Set<String> visitedUrls = new HashSet<>();

    public synchronized boolean addUrl(String url) {
        return visitedUrls.add(url);
    }
}

3️⃣ URLFetcher.java (Fetching Links)

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;

public class URLFetcher {
    public static Set<String> fetchLinks(String url) {
        Set<String> links = new HashSet<>();
        try {
            Document document = Jsoup.connect(url).get();
            Elements elements = document.select("a[href]");
            for (Element element : elements) {
                links.add(element.absUrl("href"));
            }
        } catch (IOException e) {
            System.err.println("Failed to fetch: " + url);
        }
        return links;
    }
}

4️⃣ CrawlerTask.java (Multithreading with Phaser)

import java.util.concurrent.*;
import java.util.Set;

public class CrawlerTask {
    private final ExecutorService executor;
    private final Phaser phaser;
    private final URLStore urlStore;
    private final int maxDepth;

    public CrawlerTask(String startUrl, int maxDepth, int numThreads) {
        this.executor = Executors.newFixedThreadPool(numThreads);
        this.phaser = new Phaser(1);
        this.urlStore = new URLStore();
        this.maxDepth = maxDepth;
        crawl(startUrl, 0);
    }

    private void crawl(String url, int depth) {
        if (depth > maxDepth || !urlStore.addUrl(url)) return;
        phaser.register();
        executor.execute(() -> {
            Set<String> links = URLFetcher.fetchLinks(url);
            for (String link : links) {
                crawl(link, depth + 1);
            }
            phaser.arriveAndDeregister();
        });
    }

    public void start() {
        phaser.arriveAndAwaitAdvance();
        executor.shutdown();
    }
}

Running the Web Crawler

1️⃣ Compile and run the Java program.
2️⃣ Enter the starting URL, maximum depth, and number of threads.
3️⃣ The crawler will fetch and display extracted links recursively.

Example Output:

Enter the starting URL: https://example.com
Enter the maximum depth: 2
Enter the number of threads: 5

Crawling: https://example.com
Found link: https://example.com/about
Found link: https://example.com/contact
Crawling: https://example.com/about
Crawling: https://example.com/contact
...

Key Learnings

While building this project, I learned:

✅ How to use multithreading in Java.
✅ The importance of Phaser for synchronizing concurrent tasks.
✅ How to use Jsoup for web scraping.
✅ Managing URL traversal efficiently using HashSet.

Conclusion

This project was an amazing hands-on experience in Java multithreading and web crawling! If you're learning Java, I highly recommend trying it out. Feel free to fork the repository and contribute!

Code Repository

Link to GitHub repository

Building a Multithreaded Web Crawler in Java