😵‍💫 What Are Infinite Traps in Web Crawling (And How to Avoid Them in Python)

kelvin benokelvin beno
3 min read

It Begins

If you’ve ever built a web crawler or are just getting into it, you’ll quickly learn that the internet is a jungle. It’s full of exciting content—but also full of traps. One of the most frustrating issues you’ll run into is something called an infinite trap.

Let’s break down what infinite traps are, why they’re a problem, and how you can build a Python web crawler that avoids them.


🕳️ What’s an Infinite Trap?

Imagine you’re exploring a maze. You turn a corner, and there’s a door. You go through it… and find yourself right back where you started. That’s basically what an infinite trap is for a crawler.

Infinite traps are patterns on websites that can make a crawler get stuck in an endless loop—constantly visiting the same or similar pages without ever reaching new content. They’re bad news because:

  • They waste your time and bandwidth

  • They can cause your crawler to crash or slow down

  • They may get your crawler blocked by websites


🤔 Common Examples of Infinite Traps

Here are a few classic ones:

  1. Cyclic Links: Page A links to B, and B links back to A. Your crawler just keeps bouncing between them.

  2. Never-ending Pagination: Some sites have infinite scrolling or pagination that just keeps going (even when there’s no more content).

  3. Dynamic URLs with Parameters: URLs that look different but serve the same content, like:

    • /product?id=123

    • /product?id=123&utm=facebook

    • /product?id=123&utm=twitter

  4. Session IDs in URLs: Some pages add a unique session ID every time you visit. To your crawler, that looks like a brand-new page each time.

  5. Redirect Loops: Page A redirects to B, which redirects to C, which… you guessed it, redirects back to A.


🛠️ Let’s Build a Python Crawler That Avoids Traps

We’ll use:

  • requests – to make HTTP requests

  • BeautifulSoup – to parse HTML

  • urllib.parse – to clean up and join URLs

Step 1: Install the Libraries

pip install requests beautifulsoup4

Step 2: The Code

Here’s a basic crawler that avoids infinite traps by keeping track of where it’s been and ignoring repeat or unnecessary pages.

from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse, urldefrag

visited_urls = set()
MAX_PAGES = 1000 

def normalize_url(url):
    url, _ = urldefrag(url)
    parsed = urlparse(url)
    return parsed.scheme + "://" + parsed.netloc + parsed.path

def is_valid_url(url):
    # Skip file links like images or PDFs
    return not url.lower().endswith(('.jpg', '.jpeg', '.png', '.pdf', '.zip'))

def crawl(url, base_domain, depth=0):
    global visited_urls

    if len(visited_urls) >= MAX_PAGES:
        return

    normalized_url = normalize_url(url)

    if normalized_url in visited_urls or not is_valid_url(url):
        return

    try:
        response = requests.get(url, timeout=5)
        if response.status_code != 200:
            return
    except requests.RequestException:
        return

    print(f"[{depth}] Crawling: {normalized_url}")
    visited_urls.add(normalized_url)

    soup = BeautifulSoup(response.text, 'html.parser')
    for link in soup.find_all('a', href=True):
        href = urljoin(url, link['href'])
        href = normalize_url(href)

        # Only follow links within the same domain
        if urlparse(href).netloc == base_domain:
            crawl(href, base_domain, depth + 1)

if __name__ == "__main__":
    start_url = "https://example.com"
    domain = urlparse(start_url).netloc
    crawl(start_url, domain)

✅ What This Code Handles

  • Avoids visiting the same page twice (by normalizing URLs)

  • Skips unnecessary file types like images and PDFs

  • Stays within the same domain

  • Limits total pages so it doesn’t run forever


💡 Bonus Ideas to Make It Smarter

If you want to level it up a bit:

  • Respect robots.txt: Use Python’s robotparser to avoid crawling restricted pages.

  • Detect repeating content: Hash page content or titles and skip if they’re the same.

  • Add delays between requests: Don’t spam servers—add a time.sleep() between each request.

  • Use Scrapy: If you’re doing this seriously, consider using the Scrapy framework—it has a lot of this built-in.


🚀 Wrapping Up

Infinite traps are sneaky, but with some careful planning, you can build crawlers that are smart, respectful, and efficient. Start small, test often, and keep an eye on where your crawler goes—because the web doesn’t always play fair.
Till next time folks happy coding.

0
Subscribe to my newsletter

Read articles from kelvin beno directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

kelvin beno
kelvin beno

I am a developer from Kenya, passionate about Building software that can shape and change lives for the better