The web is a vast ocean of information, and web crawlers are the boats we use to navigate it. In this blog, we’ll explore how to build a multi-threaded web crawler in Java. This project leverages advanced Java concurrency utilities to enable efficient and scalable crawling.

By the end of this blog, you’ll have a working web crawler that can fetch links up to a specified depth, process multiple URLs concurrently, and measure execution time for performance insights.

🔎 1. Introduction: What are we building and why?

A web crawler is a program designed to systematically browse the web and extract hyperlinks. In this project, we’re building a multi-threaded web crawler that:

Starts from a given URL.
Crawls up to a specified depth.
Uses multiple threads for concurrent crawling.
Ensures no URL is visited more than once.

This project is perfect for learning:

Java concurrency utilities like Phaser, ExecutorService, and ConcurrentHashMap.
How to manage shared resources in a multi-threaded environment.
Practical use of the JSoup library for parsing HTML.

📚 2. Key Components and Classes

CrawlerTask

The CrawlerTask class represents a single crawling task. It fetches links from a URL and submits new tasks for each link, provided the maximum depth hasn’t been reached.

Key features:

Uses Phaser for thread coordination.
Prevents revisiting URLs using URLStore.
Submits new tasks for discovered links.

package org.example;

import java.util.Set;
import java.util.concurrent.Phaser;

public class CrawlerTask implements Runnable {
    private final URLStore urlStore;
    private final URLFetcher urlFetcher;
    private final int maxDepth;
    private final int currentDepth;
    private final Phaser phaser;

    public CrawlerTask(URLStore urlStore, URLFetcher urlFetcher, int maxDepth, int currentDepth, Phaser phaser) {
        this.urlStore = urlStore;
        this.urlFetcher = urlFetcher;
        this.maxDepth = maxDepth;
        this.currentDepth = currentDepth;
        this.phaser = phaser;
    }

    @Override
    public void run() {
        try {
            String url = urlStore.getNextUrl();
            System.out.println(Thread.currentThread().getName() + " " + url);
            if (url == null || currentDepth > maxDepth) return;

            Set<String> links = urlFetcher.fetchLinks(url);
            for (String link : links) {
                if (urlStore.addUrl(link)) {
                    phaser.register();
                    WebCrawler.submitTask(urlStore, urlFetcher, currentDepth + 1, maxDepth);
                }
            }
        } catch (Exception e) {
            System.out.println("Error occurred !");
        } finally {
            phaser.arriveAndDeregister();
        }
    }
}

URLFetcher

The URLFetcher class uses the JSoup library to fetch hyperlinks from a web page.

Key features:

Connects to the given URL and parses its HTML.
Extracts and returns all anchor (<a>) tags with valid href attributes.

package org.example;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.HashSet;
import java.util.Set;

public class URLFetcher {
    public Set<String> fetchLinks(String url) {
        Set<String> links = new HashSet<>();
        Document document = null;
        try {
            document = Jsoup.connect(url).timeout(50000).get();
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
        System.out.println(document.text());
        Elements anchorTags = document.select("a[href]");
        for (Element link : anchorTags) {
            String extractedUrl = link.absUrl("href");
            if (!extractedUrl.isEmpty()) {
                links.add(extractedUrl);
                System.out.println(links);
            }
        }
        return links;
    }
}

URLStore

The URLStore class manages the URLs to be visited. It ensures no URL is visited more than once.

Key features:

Uses a ConcurrentHashMap to keep track of visited URLs.
Uses a BlockingQueue to store URLs for crawling.

package org.example;

import java.util.concurrent.BlockingQueue;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.LinkedBlockingQueue;

public class URLStore {
    private final ConcurrentHashMap<String, Boolean> visitedUrl = new ConcurrentHashMap<>();
    private final BlockingQueue<String> urlQueue = new LinkedBlockingQueue<>();

    public boolean addUrl(String url) {
        if (visitedUrl.putIfAbsent(url, true) == null) {
            urlQueue.offer(url);
            return true;
        } else {
            return false;
        }
    }

    public String getNextUrl() throws InterruptedException {
        return urlQueue.poll();
    }

    public boolean isQueueEmpty() {
        return urlQueue.isEmpty();
    }
}

WebCrawler

The WebCrawler class is the entry point for the application. It:

Takes user input for the starting URL, depth, and number of threads.
Initializes the Phaser for thread synchronization.
Submits the first crawling task and manages thread execution.

package org.example;

import java.util.Scanner;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Phaser;

public class WebCrawler {
    private static Phaser phaser;
    private static ExecutorService executorService;

    public static void main(String[] args) {
        Scanner sc = new Scanner(System.in);

        System.out.println("Enter your url");
        String url = sc.nextLine();

        System.out.println("Enter the depth of the scrawler");
        int MAX_DEPTH = sc.nextInt();

        System.out.println("Enter the number of workers");
        int MAX_THREADS = sc.nextInt();

        URLStore urlStore = new URLStore();
        URLFetcher urlFetcher = new URLFetcher();
        phaser = new Phaser(1);

        executorService = Executors.newFixedThreadPool(MAX_THREADS);

        urlStore.addUrl(url);

        long start = System.currentTimeMillis();

        submitTask(urlStore, urlFetcher, 0, MAX_DEPTH);

        phaser.awaitAdvance(phaser.getPhase());

        executorService.shutdown();

        System.out.println("Time taken:" + (System.currentTimeMillis() - start));
    }

    public static void submitTask(URLStore urlStore, URLFetcher urlFetcher, int currentDepth, int maxDepth) {
        executorService.submit(new CrawlerTask(urlStore, urlFetcher, maxDepth, currentDepth, phaser));
    }
}

🧠 3. Challenges & Learnings

Challenges:

Handling shared resources (URLStore) safely in a multi-threaded environment.
Efficiently coordinating threads using Phaser.
Parsing and validating URLs with JSoup.

Learnings:

How to use advanced Java concurrency utilities like Phaser and ExecutorService.
Best practices for managing shared resources in multi-threaded applications.
Practical usage of third-party libraries for web scraping.

🏃‍♂️ 4. How to Run It

Prerequisites:

JDK installed (Java Development Kit).
JSoup library (jsoup-1.15.3.jar or later) added to your project.

Steps:

Clone the GitHub repository:

 git clone https://github.com/Mayurdpatil67/web-crawler.git
 cd web-crawler

Compile the Java files:

 javac -cp .:jsoup-1.15.3.jar org/example/*.java

Run the program:

 java -cp .:jsoup-1.15.3.jar org.example.WebCrawler

Example Interaction:

Enter your url
https://example.com
Enter the depth of the scrawler
2
Enter the number of workers
4
Time taken: 1234 ms

🌟 5. GitHub Repository

The complete source code for this project is available on GitHub. Feel free to clone, fork, or contribute to the repository:

GitHub Link: Web Crawler System

🌟 6. Conclusion

This project showcases the power of multi-threading and concurrency in Java. You’ve learned how to:

Build a scalable, multi-threaded web crawler.
Use Phaser for thread synchronization.
Leverage libraries like JSoup for web scraping.

What’s Next?

Add support for robots.txt to comply with web crawling standards.
Implement a depth-first or breadth-first crawling strategy.
Save crawled data to a database or file system.

Happy crawling! 🌐🚀

Multi-threaded Web Crawler in Java

🔎 1. Introduction: What are we building and why?

📚 2. Key Components and Classes

CrawlerTask

URLFetcher

URLStore

WebCrawler

🧠 3. Challenges & Learnings

Challenges:

Learnings:

🏃‍♂️ 4. How to Run It

Prerequisites:

Steps:

Example Interaction:

🌟 5. GitHub Repository

🌟 6. Conclusion

What’s Next?

Subscribe to my newsletter

Mayur Patil

Mayur Patil