đľâđŤ What Are Infinite Traps in Web Crawling (And How to Avoid Them in Python)


It Begins
If youâve ever built a web crawler or are just getting into it, youâll quickly learn that the internet is a jungle. Itâs full of exciting contentâbut also full of traps. One of the most frustrating issues youâll run into is something called an infinite trap.
Letâs break down what infinite traps are, why theyâre a problem, and how you can build a Python web crawler that avoids them.
đłď¸ Whatâs an Infinite Trap?
Imagine youâre exploring a maze. You turn a corner, and thereâs a door. You go through it⌠and find yourself right back where you started. Thatâs basically what an infinite trap is for a crawler.
Infinite traps are patterns on websites that can make a crawler get stuck in an endless loopâconstantly visiting the same or similar pages without ever reaching new content. Theyâre bad news because:
They waste your time and bandwidth
They can cause your crawler to crash or slow down
They may get your crawler blocked by websites
đ¤ Common Examples of Infinite Traps
Here are a few classic ones:
Cyclic Links: Page A links to B, and B links back to A. Your crawler just keeps bouncing between them.
Never-ending Pagination: Some sites have infinite scrolling or pagination that just keeps going (even when thereâs no more content).
Dynamic URLs with Parameters: URLs that look different but serve the same content, like:
/product?id=123
/product?id=123&utm=facebook
/product?id=123&utm=twitter
Session IDs in URLs: Some pages add a unique session ID every time you visit. To your crawler, that looks like a brand-new page each time.
Redirect Loops: Page A redirects to B, which redirects to C, which⌠you guessed it, redirects back to A.
đ ď¸ Letâs Build a Python Crawler That Avoids Traps
Weâll use:
requests
â to make HTTP requestsBeautifulSoup
â to parse HTMLurllib.parse
â to clean up and join URLs
Step 1: Install the Libraries
pip install requests beautifulsoup4
Step 2: The Code
Hereâs a basic crawler that avoids infinite traps by keeping track of where itâs been and ignoring repeat or unnecessary pages.
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse, urldefrag
visited_urls = set()
MAX_PAGES = 1000
def normalize_url(url):
url, _ = urldefrag(url)
parsed = urlparse(url)
return parsed.scheme + "://" + parsed.netloc + parsed.path
def is_valid_url(url):
# Skip file links like images or PDFs
return not url.lower().endswith(('.jpg', '.jpeg', '.png', '.pdf', '.zip'))
def crawl(url, base_domain, depth=0):
global visited_urls
if len(visited_urls) >= MAX_PAGES:
return
normalized_url = normalize_url(url)
if normalized_url in visited_urls or not is_valid_url(url):
return
try:
response = requests.get(url, timeout=5)
if response.status_code != 200:
return
except requests.RequestException:
return
print(f"[{depth}] Crawling: {normalized_url}")
visited_urls.add(normalized_url)
soup = BeautifulSoup(response.text, 'html.parser')
for link in soup.find_all('a', href=True):
href = urljoin(url, link['href'])
href = normalize_url(href)
# Only follow links within the same domain
if urlparse(href).netloc == base_domain:
crawl(href, base_domain, depth + 1)
if __name__ == "__main__":
start_url = "https://example.com"
domain = urlparse(start_url).netloc
crawl(start_url, domain)
â What This Code Handles
Avoids visiting the same page twice (by normalizing URLs)
Skips unnecessary file types like images and PDFs
Stays within the same domain
Limits total pages so it doesnât run forever
đĄ Bonus Ideas to Make It Smarter
If you want to level it up a bit:
Respect robots.txt: Use Pythonâs
robotparser
to avoid crawling restricted pages.Detect repeating content: Hash page content or titles and skip if theyâre the same.
Add delays between requests: Donât spam serversâadd a
time.sleep()
between each request.Use Scrapy: If youâre doing this seriously, consider using the Scrapy frameworkâit has a lot of this built-in.
đ Wrapping Up
Infinite traps are sneaky, but with some careful planning, you can build crawlers that are smart, respectful, and efficient. Start small, test often, and keep an eye on where your crawler goesâbecause the web doesnât always play fair.
Till next time folks happy coding.
Subscribe to my newsletter
Read articles from kelvin beno directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

kelvin beno
kelvin beno
I am a developer from Kenya, passionate about Building software that can shape and change lives for the better