Multi-threaded Web Crawler in Java

Mayur PatilMayur Patil
5 min read

The web is a vast ocean of information, and web crawlers are the boats we use to navigate it. In this blog, we’ll explore how to build a multi-threaded web crawler in Java. This project leverages advanced Java concurrency utilities to enable efficient and scalable crawling.

By the end of this blog, you’ll have a working web crawler that can fetch links up to a specified depth, process multiple URLs concurrently, and measure execution time for performance insights.


🔎 1. Introduction: What are we building and why?

A web crawler is a program designed to systematically browse the web and extract hyperlinks. In this project, we’re building a multi-threaded web crawler that:

  • Starts from a given URL.

  • Crawls up to a specified depth.

  • Uses multiple threads for concurrent crawling.

  • Ensures no URL is visited more than once.

This project is perfect for learning:

  • Java concurrency utilities like Phaser, ExecutorService, and ConcurrentHashMap.

  • How to manage shared resources in a multi-threaded environment.

  • Practical use of the JSoup library for parsing HTML.


📚 2. Key Components and Classes

CrawlerTask

The CrawlerTask class represents a single crawling task. It fetches links from a URL and submits new tasks for each link, provided the maximum depth hasn’t been reached.

Key features:

  • Uses Phaser for thread coordination.

  • Prevents revisiting URLs using URLStore.

  • Submits new tasks for discovered links.

package org.example;

import java.util.Set;
import java.util.concurrent.Phaser;

public class CrawlerTask implements Runnable {
    private final URLStore urlStore;
    private final URLFetcher urlFetcher;
    private final int maxDepth;
    private final int currentDepth;
    private final Phaser phaser;

    public CrawlerTask(URLStore urlStore, URLFetcher urlFetcher, int maxDepth, int currentDepth, Phaser phaser) {
        this.urlStore = urlStore;
        this.urlFetcher = urlFetcher;
        this.maxDepth = maxDepth;
        this.currentDepth = currentDepth;
        this.phaser = phaser;
    }

    @Override
    public void run() {
        try {
            String url = urlStore.getNextUrl();
            System.out.println(Thread.currentThread().getName() + " " + url);
            if (url == null || currentDepth > maxDepth) return;

            Set<String> links = urlFetcher.fetchLinks(url);
            for (String link : links) {
                if (urlStore.addUrl(link)) {
                    phaser.register();
                    WebCrawler.submitTask(urlStore, urlFetcher, currentDepth + 1, maxDepth);
                }
            }
        } catch (Exception e) {
            System.out.println("Error occurred !");
        } finally {
            phaser.arriveAndDeregister();
        }
    }
}

URLFetcher

The URLFetcher class uses the JSoup library to fetch hyperlinks from a web page.

Key features:

  • Connects to the given URL and parses its HTML.

  • Extracts and returns all anchor (<a>) tags with valid href attributes.

package org.example;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.HashSet;
import java.util.Set;

public class URLFetcher {
    public Set<String> fetchLinks(String url) {
        Set<String> links = new HashSet<>();
        Document document = null;
        try {
            document = Jsoup.connect(url).timeout(50000).get();
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
        System.out.println(document.text());
        Elements anchorTags = document.select("a[href]");
        for (Element link : anchorTags) {
            String extractedUrl = link.absUrl("href");
            if (!extractedUrl.isEmpty()) {
                links.add(extractedUrl);
                System.out.println(links);
            }
        }
        return links;
    }
}

URLStore

The URLStore class manages the URLs to be visited. It ensures no URL is visited more than once.

Key features:

  • Uses a ConcurrentHashMap to keep track of visited URLs.

  • Uses a BlockingQueue to store URLs for crawling.

package org.example;

import java.util.concurrent.BlockingQueue;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.LinkedBlockingQueue;

public class URLStore {
    private final ConcurrentHashMap<String, Boolean> visitedUrl = new ConcurrentHashMap<>();
    private final BlockingQueue<String> urlQueue = new LinkedBlockingQueue<>();

    public boolean addUrl(String url) {
        if (visitedUrl.putIfAbsent(url, true) == null) {
            urlQueue.offer(url);
            return true;
        } else {
            return false;
        }
    }

    public String getNextUrl() throws InterruptedException {
        return urlQueue.poll();
    }

    public boolean isQueueEmpty() {
        return urlQueue.isEmpty();
    }
}

WebCrawler

The WebCrawler class is the entry point for the application. It:

  1. Takes user input for the starting URL, depth, and number of threads.

  2. Initializes the Phaser for thread synchronization.

  3. Submits the first crawling task and manages thread execution.

package org.example;

import java.util.Scanner;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Phaser;

public class WebCrawler {
    private static Phaser phaser;
    private static ExecutorService executorService;

    public static void main(String[] args) {
        Scanner sc = new Scanner(System.in);

        System.out.println("Enter your url");
        String url = sc.nextLine();

        System.out.println("Enter the depth of the scrawler");
        int MAX_DEPTH = sc.nextInt();

        System.out.println("Enter the number of workers");
        int MAX_THREADS = sc.nextInt();

        URLStore urlStore = new URLStore();
        URLFetcher urlFetcher = new URLFetcher();
        phaser = new Phaser(1);

        executorService = Executors.newFixedThreadPool(MAX_THREADS);

        urlStore.addUrl(url);

        long start = System.currentTimeMillis();

        submitTask(urlStore, urlFetcher, 0, MAX_DEPTH);

        phaser.awaitAdvance(phaser.getPhase());

        executorService.shutdown();

        System.out.println("Time taken:" + (System.currentTimeMillis() - start));
    }

    public static void submitTask(URLStore urlStore, URLFetcher urlFetcher, int currentDepth, int maxDepth) {
        executorService.submit(new CrawlerTask(urlStore, urlFetcher, maxDepth, currentDepth, phaser));
    }
}

🧠 3. Challenges & Learnings

Challenges:

  1. Handling shared resources (URLStore) safely in a multi-threaded environment.

  2. Efficiently coordinating threads using Phaser.

  3. Parsing and validating URLs with JSoup.

Learnings:

  • How to use advanced Java concurrency utilities like Phaser and ExecutorService.

  • Best practices for managing shared resources in multi-threaded applications.

  • Practical usage of third-party libraries for web scraping.


🏃‍♂️ 4. How to Run It

Prerequisites:

  • JDK installed (Java Development Kit).

  • JSoup library (jsoup-1.15.3.jar or later) added to your project.

Steps:

  1. Clone the GitHub repository:

     git clone https://github.com/Mayurdpatil67/web-crawler.git
     cd web-crawler
    
  2. Compile the Java files:

     javac -cp .:jsoup-1.15.3.jar org/example/*.java
    
  3. Run the program:

     java -cp .:jsoup-1.15.3.jar org.example.WebCrawler
    

Example Interaction:

Enter your url
https://example.com
Enter the depth of the scrawler
2
Enter the number of workers
4
Time taken: 1234 ms

🌟 5. GitHub Repository

The complete source code for this project is available on GitHub. Feel free to clone, fork, or contribute to the repository:


🌟 6. Conclusion

This project showcases the power of multi-threading and concurrency in Java. You’ve learned how to:

  • Build a scalable, multi-threaded web crawler.

  • Use Phaser for thread synchronization.

  • Leverage libraries like JSoup for web scraping.

What’s Next?

  • Add support for robots.txt to comply with web crawling standards.

  • Implement a depth-first or breadth-first crawling strategy.

  • Save crawled data to a database or file system.

Happy crawling! 🌐🚀

0
Subscribe to my newsletter

Read articles from Mayur Patil directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Mayur Patil
Mayur Patil

Skilled in Java & Spring Boot , Backedn Enthusiast