Multi-threaded Web Crawler in Java


The web is a vast ocean of information, and web crawlers are the boats we use to navigate it. In this blog, we’ll explore how to build a multi-threaded web crawler in Java. This project leverages advanced Java concurrency utilities to enable efficient and scalable crawling.
By the end of this blog, you’ll have a working web crawler that can fetch links up to a specified depth, process multiple URLs concurrently, and measure execution time for performance insights.
🔎 1. Introduction: What are we building and why?
A web crawler is a program designed to systematically browse the web and extract hyperlinks. In this project, we’re building a multi-threaded web crawler that:
Starts from a given URL.
Crawls up to a specified depth.
Uses multiple threads for concurrent crawling.
Ensures no URL is visited more than once.
This project is perfect for learning:
Java concurrency utilities like
Phaser
,ExecutorService
, andConcurrentHashMap
.How to manage shared resources in a multi-threaded environment.
Practical use of the JSoup library for parsing HTML.
📚 2. Key Components and Classes
CrawlerTask
The CrawlerTask
class represents a single crawling task. It fetches links from a URL and submits new tasks for each link, provided the maximum depth hasn’t been reached.
Key features:
Uses
Phaser
for thread coordination.Prevents revisiting URLs using
URLStore
.Submits new tasks for discovered links.
package org.example;
import java.util.Set;
import java.util.concurrent.Phaser;
public class CrawlerTask implements Runnable {
private final URLStore urlStore;
private final URLFetcher urlFetcher;
private final int maxDepth;
private final int currentDepth;
private final Phaser phaser;
public CrawlerTask(URLStore urlStore, URLFetcher urlFetcher, int maxDepth, int currentDepth, Phaser phaser) {
this.urlStore = urlStore;
this.urlFetcher = urlFetcher;
this.maxDepth = maxDepth;
this.currentDepth = currentDepth;
this.phaser = phaser;
}
@Override
public void run() {
try {
String url = urlStore.getNextUrl();
System.out.println(Thread.currentThread().getName() + " " + url);
if (url == null || currentDepth > maxDepth) return;
Set<String> links = urlFetcher.fetchLinks(url);
for (String link : links) {
if (urlStore.addUrl(link)) {
phaser.register();
WebCrawler.submitTask(urlStore, urlFetcher, currentDepth + 1, maxDepth);
}
}
} catch (Exception e) {
System.out.println("Error occurred !");
} finally {
phaser.arriveAndDeregister();
}
}
}
URLFetcher
The URLFetcher
class uses the JSoup library to fetch hyperlinks from a web page.
Key features:
Connects to the given URL and parses its HTML.
Extracts and returns all anchor (
<a>
) tags with validhref
attributes.
package org.example;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
public class URLFetcher {
public Set<String> fetchLinks(String url) {
Set<String> links = new HashSet<>();
Document document = null;
try {
document = Jsoup.connect(url).timeout(50000).get();
} catch (IOException e) {
throw new RuntimeException(e);
}
System.out.println(document.text());
Elements anchorTags = document.select("a[href]");
for (Element link : anchorTags) {
String extractedUrl = link.absUrl("href");
if (!extractedUrl.isEmpty()) {
links.add(extractedUrl);
System.out.println(links);
}
}
return links;
}
}
URLStore
The URLStore
class manages the URLs to be visited. It ensures no URL is visited more than once.
Key features:
Uses a
ConcurrentHashMap
to keep track of visited URLs.Uses a
BlockingQueue
to store URLs for crawling.
package org.example;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.LinkedBlockingQueue;
public class URLStore {
private final ConcurrentHashMap<String, Boolean> visitedUrl = new ConcurrentHashMap<>();
private final BlockingQueue<String> urlQueue = new LinkedBlockingQueue<>();
public boolean addUrl(String url) {
if (visitedUrl.putIfAbsent(url, true) == null) {
urlQueue.offer(url);
return true;
} else {
return false;
}
}
public String getNextUrl() throws InterruptedException {
return urlQueue.poll();
}
public boolean isQueueEmpty() {
return urlQueue.isEmpty();
}
}
WebCrawler
The WebCrawler
class is the entry point for the application. It:
Takes user input for the starting URL, depth, and number of threads.
Initializes the
Phaser
for thread synchronization.Submits the first crawling task and manages thread execution.
package org.example;
import java.util.Scanner;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Phaser;
public class WebCrawler {
private static Phaser phaser;
private static ExecutorService executorService;
public static void main(String[] args) {
Scanner sc = new Scanner(System.in);
System.out.println("Enter your url");
String url = sc.nextLine();
System.out.println("Enter the depth of the scrawler");
int MAX_DEPTH = sc.nextInt();
System.out.println("Enter the number of workers");
int MAX_THREADS = sc.nextInt();
URLStore urlStore = new URLStore();
URLFetcher urlFetcher = new URLFetcher();
phaser = new Phaser(1);
executorService = Executors.newFixedThreadPool(MAX_THREADS);
urlStore.addUrl(url);
long start = System.currentTimeMillis();
submitTask(urlStore, urlFetcher, 0, MAX_DEPTH);
phaser.awaitAdvance(phaser.getPhase());
executorService.shutdown();
System.out.println("Time taken:" + (System.currentTimeMillis() - start));
}
public static void submitTask(URLStore urlStore, URLFetcher urlFetcher, int currentDepth, int maxDepth) {
executorService.submit(new CrawlerTask(urlStore, urlFetcher, maxDepth, currentDepth, phaser));
}
}
🧠 3. Challenges & Learnings
Challenges:
Handling shared resources (
URLStore
) safely in a multi-threaded environment.Efficiently coordinating threads using
Phaser
.Parsing and validating URLs with JSoup.
Learnings:
How to use advanced Java concurrency utilities like
Phaser
andExecutorService
.Best practices for managing shared resources in multi-threaded applications.
Practical usage of third-party libraries for web scraping.
🏃♂️ 4. How to Run It
Prerequisites:
JDK installed (Java Development Kit).
JSoup library (
jsoup-1.15.3.jar
or later) added to your project.
Steps:
Clone the GitHub repository:
git clone https://github.com/Mayurdpatil67/web-crawler.git cd web-crawler
Compile the Java files:
javac -cp .:jsoup-1.15.3.jar org/example/*.java
Run the program:
java -cp .:jsoup-1.15.3.jar org.example.WebCrawler
Example Interaction:
Enter your url
https://example.com
Enter the depth of the scrawler
2
Enter the number of workers
4
Time taken: 1234 ms
🌟 5. GitHub Repository
The complete source code for this project is available on GitHub. Feel free to clone, fork, or contribute to the repository:
- GitHub Link: Web Crawler System
🌟 6. Conclusion
This project showcases the power of multi-threading and concurrency in Java. You’ve learned how to:
Build a scalable, multi-threaded web crawler.
Use
Phaser
for thread synchronization.Leverage libraries like JSoup for web scraping.
What’s Next?
Add support for robots.txt to comply with web crawling standards.
Implement a depth-first or breadth-first crawling strategy.
Save crawled data to a database or file system.
Happy crawling! 🌐🚀
Subscribe to my newsletter
Read articles from Mayur Patil directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Mayur Patil
Mayur Patil
Skilled in Java & Spring Boot , Backedn Enthusiast