Ultimate Guide to Web Scraping with Python: Extract Indeed Job Listings

Web scraping is a powerful technique to automatically extract data from websites. Whether you're building a job aggregator, a price comparison engine, or just exploring data, Python makes scraping relatively simple and powerful. In this blog, we'll walk through building a real-world job scraper using Python, PostgreSQL, ScrapeOps, and Docker — all tied together in a production-grade setup.
🤔 What is Web Scraping?
Web scraping is the process of programmatically fetching data from websites. You send HTTP requests to a webpage, parse the HTML, and extract specific pieces of information like product prices, job listings, or news headlines.
🔍 Real-World Use Cases:
Price comparison from e-commerce platforms
Job aggregators pulling listings from multiple job boards
Market research using user reviews or social posts
News aggregation from various media sites
⚙️ How Web Scraping Works (In 5 Steps)
Send HTTP Request to the target webpage
Download HTML Content
Parse the HTML using libraries like BeautifulSoup or lxml
Extract Data using tags, classes, or DOM structure
Store Data into CSV, JSON, or a database like PostgreSQL
🐍 Scraping Job Listings with Python
We'll scrape job listings from Indeed, store them in a PostgreSQL database, export to CSV, and send results via email.
🧰 Prerequisites
✅ Python Environment
Ensure Python 3.7+ is installed. You can use a virtual environment:
bashCopyEditpython -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
✅ PostgreSQL Setup
Create a PostgreSQL database and table:
sqlCopyEditCREATE TABLE master (
id SERIAL PRIMARY KEY,
job_title TEXT,
company_name TEXT,
job_description TEXT
);
✅ Environment Variables (
.env
)
Create a .env
file with your sensitive config:
envCopyEditDB_HOST=localhost
DB_PORT=5432
DB_NAME=your_database_name
DB_USER_NAME=your_db_user
DB_USER_PASSWORD=your_db_password
SCRAPEOPS_API_KEY=your_scrapeops_api_key
SCRAPEOPS_HEADER_URL=https://headers.scrapeops.io/v1/browser-headers?api_key=
SCRAPEOPS_PROXY_URL=https://proxy.scrapeops.io/v1/?
JOBS_URL=https://www.indeed.com/jobs?
JOBS_ID_URL=https://www.indeed.com/viewjob?jk=
CSV_LOCATION=job_data.csv
- ✅ Create Utility Modules
You need two utility files to support the main script:
index.py — Handles database connections (shown in your code).
send_email.py — A utility to send email alerts (e.g., via SMTP or an API like SendGrid). Make sure this exists and works.
Load it in Python using dotenv
:
pythonCopyEditfrom dotenv import load_dotenv
load_dotenv()
📦
requirements.txt
Add these libraries:
requests==2.31.0 beautifulsoup4==4.12.2 pandas==2.1.2 httpx==0.23.2 psycopg2-binary==2.9.9 python-crontab==3.0.0 python-dotenv==1.0.0
🏗️ Setting Up Logging and ScrapeOps
Before scraping, let’s set up logging, keyword search, and dynamic headers:
pythonCopyEditimport os, logging, requests
from random import randint
SCRAPEOPS_API_KEY = os.getenv('SCRAPEOPS_API_KEY')
dir_path = os.path.dirname(os.path.realpath(__file__))
file_name = os.path.join(dir_path, 'test_log.log')
# Logger config
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
file_handler = logging.FileHandler(file_name)
file_handler.setFormatter(logging.Formatter('%(asctime)s - %(levelname)s - %(message)s'))
logger.addHandler(file_handler)
keyword_list = ['software engineer']
location_list = ['California']
def do_logging():
logger.info("log event")
def get_headers_list():
response = requests.get(os.getenv('SCRAPEOPS_HEADER_URL') + SCRAPEOPS_API_KEY)
return response.json().get('result', [])
def get_random_header(header_list):
return header_list[randint(0, len(header_list) - 1)]
Now that our environment is configured and our logging and headers are in place, we move into the core logic of scraping job listings from Indeed. In this part of the script, we make HTTP requests to fetch search results for specific job titles and locations, extract job IDs from the HTML, and then use those IDs to fetch full job details. Let's break it down step by step.
Nested pagination:
pythonCopyEditfor keyword in keyword_list:
for location in location_list:
for offset in range(0, 1010, 10):
try:
url = get_scrapeops_url(get_indeed_search_url(keyword, location, offset))
response = requests.get(url, headers=get_random_header(header_list))
...
except Exception as e:
print("Error:", e)
📥 Storing Job Data
Insert results into PostgreSQL:
pythonCopyEditcur = get_db_cursor()
conn = get_db_connection()
for job_desc in full_job_data_list:
cur.execute("INSERT INTO master(job_title, company_name, job_description) VALUES (%s, %s, %s)", job_desc)
conn.commit()
close_connection()
📤 Export to CSV
pythonCopyEditdf = pd.DataFrame(full_job_data_list)
df.to_csv(os.getenv('CSV_LOCATION'), index=False)
✉️ Send Email Notifications
pythonCopyEditimport smtplib, ssl
def send_email(message):
port = 465
context = ssl.create_default_context()
with smtplib.SMTP_SSL("smtp.gmail.com", port, context=context) as server:
server.login("sender@gmail.com", "your_password")
server.sendmail("sender@gmail.com", "receiver@gmail.com", message)
Trigger email if data found:
pythonCopyEditif len(full_job_data_list) > 0:
message = "Subject: Jobs Found\n\n"
for job in full_job_data_list:
message += f"{json.dumps(job)}\n\n"
send_email(message)
📜 Understanding Log Files
Log entries look like this:
pgsqlCopyEdit2023-12-30 23:12:01,782 - INFO - log event
This helps you:
Monitor scraper runs
Debug failures
Audit scheduled tasks
All logs are stored in test_log.log
.
🐳 Dockerizing the Scraper
Use Docker for a portable and production-ready environment:
Dockerfile
dockerfileCopyEditFROM python:3
ENV DB_HOST='localhost'
ENV DB_NAME='scaper_db'
ENV DB_PORT=5432
ENV DB_USER_NAME='postgres'
ENV DB_USER_PASSWORD='postgres'
ENV SCRAPEOPS_API_KEY='your_api_key_here'
ENV SCRAPEOPS_HEADER_URL='http://headers.scrapeops.io/v1/browser-headers?api_key='
ENV SCRAPEOPS_PROXY_URL='https://proxy.scrapeops.io/v1/?'
ENV JOBS_URL='https://www.indeed.com/jobs?'
ENV JOBS_ID_URL='https://www.indeed.com/viewjob?jk='
ENV CSV_LOCATION='data/export_dataframe.csv'
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY main.py index.py send_email.py ./
COPY data/ ./data/
CMD ["python3", "./main.py"]
Run It:
bashCopyEditdocker build -t job-scraper .
docker run --rm job-scraper
Top Challenges Faced During Web Scraping
🛑 Blocking by Anti-Bot Systems
Websites detect and block scraping bots based on IP, headers, or behavior.
Results in 403 Forbidden errors or being redirected to login/CAPTCHA pages.
⏱️ Rate-Limiting
Servers limit how frequently requests can be made from a single IP.
Too many rapid requests trigger temporary bans or throttling.
🔄 Website Structure Changes
HTML layout, tag names, or class names may change anytime.
Hardcoded selectors break, returning empty or incorrect data.
⚙️ JavaScript-Rendered Content
Data isn’t present in the initial HTML; loaded dynamically with JavaScript.
Requires headless browsers (Selenium, Playwright) or tools like Puppeteer.
🔐 Authentication & CAPTCHA Challenges
Some data is only available after login or session verification.
CAPTCHA blocks prevent automated access unless solved manually or via OCR.
🌍 Geo-Restrictions & IP Filtering
Content availability depends on your IP location.
Sites may block or alter content for foreign or suspicious IPs.
🕵️ Anti-Scraping Tactics
Includes honeypot links, JavaScript obfuscation, tokenized sessions.
Designed to confuse or trap scrapers without affecting normal users.
⚖️ Legal & Ethical Limitations
Many websites prohibit scraping in their Terms of Service.
Ignoring robots.txt or republishing data can raise legal risks.
🧹 Incomplete, Noisy, or Dirty Data
Missing values, duplicates, or inconsistently formatted content.
Often requires heavy post-processing and cleaning
My Work for Web Scraping is Available on GitHub
You can find the full code for this project and more details on my GitHub repository: https://github.com/saurabh-369/indeed_web_scapper/tree/main
✅ Final Thoughts
With the right tools and strategy, web scraping in Python becomes incredibly powerful. We covered:
Setting up a scraper for Indeed
Logging, data insertion, and email alerts
Using ScrapeOps to bypass anti-scraping protections
Dockerizing the whole project
Happy scraping! 🚀
Subscribe to my newsletter
Read articles from Saurabh Verma directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Saurabh Verma
Saurabh Verma
I’m a seasoned IT professional with over 12 years of experience, previously with industry leaders like Deloitte and Accenture. My journey has been driven by a deep passion for software development and strong, adaptive leadership. I bring expertise across the full technology stack—both frontend and backend—and thrive in areas like client engagement, talent acquisition, proposal management, and driving innovation. I’m always open to collaboration, exploring new tech trends, or discussing impactful opportunities. Let’s connect and turn forward-thinking ideas into real-world solutions that make a difference.