Build a Robust Web Scraper with Python: A Complete Guide

VARUNVARUN
5 min read

Overview

Over the past few weeks, I’ve been working on scraping structured contractor data from websites like YellowPages, and I encountered frequent issues like bot detection, dynamic loading delays, and unreliable internet. Web scraping can be challenging, especially when dealing with detection mechanisms and network issues.

When I first got into web scraping, I had no idea where to start. There were countless tutorials online, but most were either outdated, too simplistic, or broke on real-world websites.

I wanted to extract structured data, but I ran into walls:

  • Websites blocked my bot within minutes

  • Pages didn’t load reliably due to dynamic content

  • Most examples never went beyond static requests + BeautifulSoup

  • No one talked about error handling or resuming after failure

Resources were scattered, and trial-and-error became my best teacher. I spent countless hours debugging XPath issues, avoiding bot detection, and building something reliable — only to watch it fail the next day due to a slight website change or IP block.

All these challenges eventually motivated me to put together a flexible and reliable scraper — something I wish I had when I started that would:

  • Work on real-world sites

  • Handle network issues gracefully

  • Mimic human-like browsing to bypass detection

  • Save data in structured formats for future use

How it Works?

Here’s a step-by-step breakdown of how the scraper works, part by part. Each section of the code handles a specific task — from setting up the browser to navigating pages and extracting data — making the whole process smooth and reliable.

📦 Wait for the Internet Connection

def wait_for_internet(timeout=120):
    while True:
        try:
            socket.create_connection(("8.8.8.8", 53), timeout=3)
            print(f"Internet is back at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
            return True
        except OSError:
            print(f"Internet down at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}. Retrying in 2 min...")
            time.sleep(timeout)

Before making any HTTP requests or scraping, this function checks for internet availability by pinging Google’s DNS (8.8.8.8) (You can use any other DNS as well.). It ensures the scraper doesn't crash if the internet drops.

📦 Setup a Stealth WebDriver

def setup_driver():
    options = uc.ChromeOptions()
    options.add_argument("--disable-blink-features=AutomationControlled")
    options.add_argument("--start-maximized")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    options.add_argument("--disable-gpu")
    options.add_argument("--headless")
    options.add_argument("--window-size=1920,1080")
    options.add_argument(
        "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/114.0.0.0 Safari/537.36"
    )
    # headless=False for visibility; set True in production
    driver = uc.Chrome(options=options)
    driver.execute_cdp_cmd("Network.setUserAgentOverride", {
        "userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                     "AppleWebKit/537.36 (KHTML, like Gecko) "
                     "Chrome/114.0.0.0 Safari/537.36"
    })
    return drive

We use undetected_chromedriver instead of regular Selenium ChromeDriver. This bypasses bot detection mechanisms (like fingerprinting) that websites often deploy.

Key options:

  • --headless: Runs browser without GUI (useful for servers)

  • Custom User-Agent: Mimics a real browser session

  • --disable-blink-features=AutomationControlled: Hides automation signals

📦 Safe Data Extraction

def extract_data(driver, link):
    wait_for_internet()
    driver.get(link)
    time.sleep(2)
    wait = WebDriverWait(driver, 10)

    def safe_select(css, attr=None):
        try:
            el = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, css)))
            return el.get_attribute(attr).strip() if attr else el.text.strip()
        except:
            return ""

    name    = safe_select("h1.dockable.business-name")
    phone   = safe_select("a.phone.dockable span.full")
    website = safe_select("a.website-link.dockable", "href")
    address = safe_select("span.address")
    email_href = safe_select("a.email-business", "href")
    email   = email_href.replace("mailto:", "") if email_href.startswith("mailto:") else ""

    return {
        "Name": name,
        "Phone": phone,
        "Website": website,
        "Address": address,
        "Email": email,
        "Profile Link": link
    }

This function opens an individual profile page and extracts:

  • Name

  • Phone Number

  • Website

  • Address

  • Email

It uses CSS Selectors and WebDriverWait to ensure elements are present before scraping — avoiding premature lookups.

📦 Resume-Capable Main Scraper

def scrape(city_name, contractor_url, max_pages=100):
    driver = setup_driver()
    wait = WebDriverWait(driver, 10)
    data = []

    safe_city_name = city_name.replace(" ", "_")
    filename = f"{safe_city_name}.csv"
    existing_links = set()

    if os.path.exists(filename):
        print(f"Resuming from existing file: {filename}")
        existing_df = pd.read_csv(filename)
        existing_links = set(existing_df["Profile Link"].dropna().tolist())
        data.extend(existing_df.to_dict(orient="records"))

    try:
        for page in range(1, max_pages + 1):
            print(f"\n{city_name} — Processing page {page}...")
            url = f"https://www.your_website.com/{contractor_url}?page={page}" if page > 1 else f"https://www.your_website.com/{contractor_url}"
            wait_for_internet()
            driver.get(url)

            for attempt in range(3):
                try:
                    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "a.business-name")))
                    break
                except:
                    print(f"Retry {attempt+1}/3 — waiting for contractor links…")
                    time.sleep(2)
            else:
                print("Could not load contractor links. Skipping city.")
                break

            links = [el.get_attribute("href")
                     for el in driver.find_elements(By.CSS_SELECTOR, "a.business-name")
                     if el.get_attribute("href")]

            for link in links:
                if link in existing_links:
                    print(f"Skipping already scraped: {link}")
                    continue

                print(f"Scraping: {link}")
                contractor_data = extract_contractor_data(driver, link)
                data.append(contractor_data)
                existing_links.add(link)

                # Optional: Save after each entry (safer but slower)
                pd.DataFrame(data).to_csv(filename, index=False)
                time.sleep(1)

    except KeyboardInterrupt:
        print("\nInterrupted by user! Saving collected data for this city…")

    finally:
        driver.quit()
        df = pd.DataFrame(data)
        df.to_csv(filename, index=False)
        print(f"Saved {len(df)} entries to '{filename}'")

The main scraper lies in the scrape() function. This part of the code is responsible for navigating through multiple pages of search results (up to max_pages) and extracting individual contractor details from each profile page.

Here’s how it works:

  1. Construct the URL for each results page

  2. Find all the listing links on that page

  3. Rate Limiting

  4. Open each profile and extract data

  5. Save data to CSV after every new entry (optional)

  6. Resume support (very useful!)
    If a CSV file already exists for a field, the scraper will load it first, then skip scraping any links that are already listed in that file. This makes the scraper resumable, efficient, and safe to stop and restart at any point without losing previous work.

📦 Main driver function

def main():
    cities_df = pd.read_csv("your_file_name.csv") 
    for _, row in cities_df.iterrows():
        city_name = row["city"]
        contractor_url = row["url"]
        scrape_contractors_for_city(city_name, contractor_url, max_pages=100)


if __name__ == "__main__":
    main()

This reads a list of cities from a CSV file (with columns city and url) and invokes the scraper for each one sequentially.

📦 Project Dependencies

pandas
undetected-chromedriver
selenium

Usage Tips

💡Customization for Other Websites

To adapt this scraper for other sites:

  1. Update CSS Selectors: Modify the selectors in extract_contractor_data()

  2. Change URL Pattern: Update the URL construction logic

  3. Adjust Wait Conditions: Modify what elements to wait for

  4. Update Data Fields: Change the returned dictionary structure

Conclusion

This scraper template provides a solid foundation for reliable web scraping projects. The combination of stealth techniques, robust error handling, and data persistence makes it suitable for production environments where reliability is crucial.

The code successfully helped me scrape large datasets without encountering blocking issues, and the resume functionality saved countless hours when dealing with interruptions.

Feel free to adapt this template for your own scraping needs, and remember to always scrape responsibly!


I’ve made the entire code public on GitHub for others to use and contribute. Clone, fork, or star the repo:

https://github.com/Varun3507/selenium-stealth-scraper.git


Found this helpful? Give it a ⭐ on GitHub and follow me for more web scraping and automation content!

#webscraping #python #selenium #automation #datascience

10
Subscribe to my newsletter

Read articles from VARUN directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

VARUN
VARUN