Ultimate Guide to Web Scraping with Python: Extract Indeed Job Listings

Saurabh VermaSaurabh Verma
5 min read

Web scraping is a powerful technique to automatically extract data from websites. Whether you're building a job aggregator, a price comparison engine, or just exploring data, Python makes scraping relatively simple and powerful. In this blog, we'll walk through building a real-world job scraper using Python, PostgreSQL, ScrapeOps, and Docker — all tied together in a production-grade setup.

🤔 What is Web Scraping?

Web scraping is the process of programmatically fetching data from websites. You send HTTP requests to a webpage, parse the HTML, and extract specific pieces of information like product prices, job listings, or news headlines.

🔍 Real-World Use Cases:

  • Price comparison from e-commerce platforms

  • Job aggregators pulling listings from multiple job boards

  • Market research using user reviews or social posts

  • News aggregation from various media sites

⚙️ How Web Scraping Works (In 5 Steps)

  1. Send HTTP Request to the target webpage

  2. Download HTML Content

  3. Parse the HTML using libraries like BeautifulSoup or lxml

  4. Extract Data using tags, classes, or DOM structure

  5. Store Data into CSV, JSON, or a database like PostgreSQL

🐍 Scraping Job Listings with Python

We'll scrape job listings from Indeed, store them in a PostgreSQL database, export to CSV, and send results via email.

🧰 Prerequisites

  1. ✅ Python Environment

Ensure Python 3.7+ is installed. You can use a virtual environment:

bashCopyEditpython -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
  1. ✅ PostgreSQL Setup

Create a PostgreSQL database and table:

sqlCopyEditCREATE TABLE master (
    id SERIAL PRIMARY KEY,
    job_title TEXT,
    company_name TEXT,
    job_description TEXT
);
  1. ✅ Environment Variables (.env)

Create a .env file with your sensitive config:

envCopyEditDB_HOST=localhost
DB_PORT=5432
DB_NAME=your_database_name
DB_USER_NAME=your_db_user
DB_USER_PASSWORD=your_db_password

SCRAPEOPS_API_KEY=your_scrapeops_api_key
SCRAPEOPS_HEADER_URL=https://headers.scrapeops.io/v1/browser-headers?api_key=
SCRAPEOPS_PROXY_URL=https://proxy.scrapeops.io/v1/?
JOBS_URL=https://www.indeed.com/jobs?
JOBS_ID_URL=https://www.indeed.com/viewjob?jk=
CSV_LOCATION=job_data.csv
  1. ✅ Create Utility Modules

You need two utility files to support the main script:

  • index.py — Handles database connections (shown in your code).

  • send_email.py — A utility to send email alerts (e.g., via SMTP or an API like SendGrid). Make sure this exists and works.

Load it in Python using dotenv:

pythonCopyEditfrom dotenv import load_dotenv
load_dotenv()

  1. 📦 requirements.txt

Add these libraries:

requests==2.31.0 beautifulsoup4==4.12.2 pandas==2.1.2 httpx==0.23.2 psycopg2-binary==2.9.9 python-crontab==3.0.0 python-dotenv==1.0.0

🏗️ Setting Up Logging and ScrapeOps

Before scraping, let’s set up logging, keyword search, and dynamic headers:

pythonCopyEditimport os, logging, requests
from random import randint

SCRAPEOPS_API_KEY = os.getenv('SCRAPEOPS_API_KEY')
dir_path = os.path.dirname(os.path.realpath(__file__))
file_name = os.path.join(dir_path, 'test_log.log')

# Logger config
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
file_handler = logging.FileHandler(file_name)
file_handler.setFormatter(logging.Formatter('%(asctime)s - %(levelname)s - %(message)s'))
logger.addHandler(file_handler)

keyword_list = ['software engineer']
location_list = ['California']

def do_logging():
    logger.info("log event")

def get_headers_list():
    response = requests.get(os.getenv('SCRAPEOPS_HEADER_URL') + SCRAPEOPS_API_KEY)
    return response.json().get('result', [])

def get_random_header(header_list):
    return header_list[randint(0, len(header_list) - 1)]

Now that our environment is configured and our logging and headers are in place, we move into the core logic of scraping job listings from Indeed. In this part of the script, we make HTTP requests to fetch search results for specific job titles and locations, extract job IDs from the HTML, and then use those IDs to fetch full job details. Let's break it down step by step.

Nested pagination:

pythonCopyEditfor keyword in keyword_list:
    for location in location_list:
        for offset in range(0, 1010, 10):
            try:
                url = get_scrapeops_url(get_indeed_search_url(keyword, location, offset))
                response = requests.get(url, headers=get_random_header(header_list))
                ...
            except Exception as e:
                print("Error:", e)

📥 Storing Job Data

Insert results into PostgreSQL:

pythonCopyEditcur  = get_db_cursor()
conn = get_db_connection()

for job_desc in full_job_data_list:
    cur.execute("INSERT INTO master(job_title, company_name, job_description) VALUES (%s, %s, %s)", job_desc)
    conn.commit()

close_connection()

📤 Export to CSV

pythonCopyEditdf = pd.DataFrame(full_job_data_list)
df.to_csv(os.getenv('CSV_LOCATION'), index=False)

✉️ Send Email Notifications

pythonCopyEditimport smtplib, ssl

def send_email(message):
    port = 465
    context = ssl.create_default_context()
    with smtplib.SMTP_SSL("smtp.gmail.com", port, context=context) as server:
        server.login("sender@gmail.com", "your_password")
        server.sendmail("sender@gmail.com", "receiver@gmail.com", message)

Trigger email if data found:

pythonCopyEditif len(full_job_data_list) > 0:
    message = "Subject: Jobs Found\n\n"
    for job in full_job_data_list:
        message += f"{json.dumps(job)}\n\n"
    send_email(message)

📜 Understanding Log Files

Log entries look like this:

pgsqlCopyEdit2023-12-30 23:12:01,782 - INFO - log event

This helps you:

  • Monitor scraper runs

  • Debug failures

  • Audit scheduled tasks

All logs are stored in test_log.log.

🐳 Dockerizing the Scraper

Use Docker for a portable and production-ready environment:

Dockerfile

dockerfileCopyEditFROM python:3

ENV DB_HOST='localhost'
ENV DB_NAME='scaper_db'
ENV DB_PORT=5432
ENV DB_USER_NAME='postgres'
ENV DB_USER_PASSWORD='postgres'
ENV SCRAPEOPS_API_KEY='your_api_key_here'
ENV SCRAPEOPS_HEADER_URL='http://headers.scrapeops.io/v1/browser-headers?api_key='
ENV SCRAPEOPS_PROXY_URL='https://proxy.scrapeops.io/v1/?'
ENV JOBS_URL='https://www.indeed.com/jobs?'
ENV JOBS_ID_URL='https://www.indeed.com/viewjob?jk='
ENV CSV_LOCATION='data/export_dataframe.csv'

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY main.py index.py send_email.py ./
COPY data/ ./data/

CMD ["python3", "./main.py"]

Run It:

bashCopyEditdocker build -t job-scraper .
docker run --rm job-scraper

Top Challenges Faced During Web Scraping

  • 🛑 Blocking by Anti-Bot Systems

    • Websites detect and block scraping bots based on IP, headers, or behavior.

    • Results in 403 Forbidden errors or being redirected to login/CAPTCHA pages.

  • ⏱️ Rate-Limiting

    • Servers limit how frequently requests can be made from a single IP.

    • Too many rapid requests trigger temporary bans or throttling.

  • 🔄 Website Structure Changes

    • HTML layout, tag names, or class names may change anytime.

    • Hardcoded selectors break, returning empty or incorrect data.

  • ⚙️ JavaScript-Rendered Content

    • Data isn’t present in the initial HTML; loaded dynamically with JavaScript.

    • Requires headless browsers (Selenium, Playwright) or tools like Puppeteer.

  • 🔐 Authentication & CAPTCHA Challenges

    • Some data is only available after login or session verification.

    • CAPTCHA blocks prevent automated access unless solved manually or via OCR.

  • 🌍 Geo-Restrictions & IP Filtering

    • Content availability depends on your IP location.

    • Sites may block or alter content for foreign or suspicious IPs.

  • 🕵️ Anti-Scraping Tactics

    • Includes honeypot links, JavaScript obfuscation, tokenized sessions.

    • Designed to confuse or trap scrapers without affecting normal users.

  • ⚖️ Legal & Ethical Limitations

    • Many websites prohibit scraping in their Terms of Service.

    • Ignoring robots.txt or republishing data can raise legal risks.

  • 🧹 Incomplete, Noisy, or Dirty Data

    • Missing values, duplicates, or inconsistently formatted content.

    • Often requires heavy post-processing and cleaning

My Work for Web Scraping is Available on GitHub

You can find the full code for this project and more details on my GitHub repository: https://github.com/saurabh-369/indeed_web_scapper/tree/main

✅ Final Thoughts

With the right tools and strategy, web scraping in Python becomes incredibly powerful. We covered:

  • Setting up a scraper for Indeed

  • Logging, data insertion, and email alerts

  • Using ScrapeOps to bypass anti-scraping protections

  • Dockerizing the whole project

Happy scraping! 🚀

0
Subscribe to my newsletter

Read articles from Saurabh Verma directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Saurabh Verma
Saurabh Verma

I’m a seasoned IT professional with over 12 years of experience, previously with industry leaders like Deloitte and Accenture. My journey has been driven by a deep passion for software development and strong, adaptive leadership. I bring expertise across the full technology stack—both frontend and backend—and thrive in areas like client engagement, talent acquisition, proposal management, and driving innovation. I’m always open to collaboration, exploring new tech trends, or discussing impactful opportunities. Let’s connect and turn forward-thinking ideas into real-world solutions that make a difference.