Scrapy with Sample Output
Table of contents
- 1. Introduction to Scrapy
- 2. Installing Scrapy
- 3. Creating a Scrapy Project
- 4. Writing and Running Spiders
- 5. Data Extraction with Scrapy
- 6. Handling Pagination in Scrapy
- 7. Storing and Exporting Data in Scrapy
- 8. Working with APIs
- 9. Scrapy Shell
- 10. Advanced Techniques for Job Scraping
- 11. Managing Scrapy Settings
- 12. Error Handling and Debugging in Scrapy
- 13. Scrapy Item Pipelines
- 14. Deploying Scrapy Projects
- 15. Tips and Tricks for Efficient Scraping
- 16. Building a Job Scraper for AIJobs.net, Indeed, and LinkedIn
- 1. Analyzing the Structure of Job Websites
- 2. Writing Spiders for AIJobs.net
- 3. Writing Spiders for Indeed
- 4. Writing Spiders for LinkedIn
- 5. Key Takeaways
- Focus Areas for Your Use Case
- 1. CSS vs. XPath Selectors
- 2. Pagination
- 3. APIs and Dynamic Content
- 4. Data Storage
- 5. Deployment
- Key Steps for Your Use Case
1. Introduction to Scrapy
What is Scrapy?
Scrapy is an open-source web scraping framework for Python. It is designed for fast and efficient extraction of data from websites, handling large-scale scraping projects with ease.
Features and Advantages
Fast and Efficient:
- Built-in support for asynchronous requests, making it faster than most scraping libraries.
Structured Data Handling:
- Uses
Items
andPipelines
to process and store data efficiently.
- Uses
Flexible and Extensible:
- Customize behavior with middlewares and settings.
Built-in Support for Pagination:
- Handles multi-page scraping and recursive crawling.
Export Options:
- Save data in formats like JSON, CSV, XML, or directly into databases.
Use Cases for Scrapy in Job Scraping
Extracting job titles, descriptions, and companies from websites like Indeed, LinkedIn, and AIJobs.net.
Automating data collection for job analysis and trend insights.
Collecting job board data for market research or career planning.
Scrapy Workflow
Create a Project:
- Initialize a Scrapy project with the necessary files and folders.
Write a Spider:
- Define the logic for scraping, including target URLs and data extraction.
Run the Spider:
- Execute the spider to collect data.
Store the Data:
- Save extracted data into structured formats like JSON or CSV.
Example: Setting Up and Running Scrapy
1. Install Scrapy
Install Scrapy using pip:
pip install scrapy
Verify the installation:
scrapy --version
2. Create a Scrapy Project
Run the following command to create a Scrapy project:
scrapy startproject job_scraper
Directory Structure:
job_scraper/
scrapy.cfg
job_scraper/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
3. Create a Spider
Navigate to the spiders
folder and create a new spider file, e.g., job_
spider.py
.
Code Example: Basic Spider
import scrapy
class JobSpider(scrapy.Spider):
name = "job_spider"
start_urls = ["https://example.com/jobs"]
def parse(self, response):
# Extract job titles
job_titles = response.css('h2.job-title::text').getall()
for title in job_titles:
yield {"Job Title": title}
4. Run the Spider
Navigate to the project directory and run the spider:
scrapy crawl job_spider -o jobs.json
Sample Output (jobs.json):
[
{"Job Title": "Data Scientist"},
{"Job Title": "Machine Learning Engineer"},
{"Job Title": "AI Researcher"}
]
5. Features Demonstrated
Asynchronous Requests: Multiple pages are crawled simultaneously.
CSS Selectors: Used to extract job titles from the page.
Data Export: Saved in JSON format for analysis.
2. Installing Scrapy
1. Scrapy Installation on macOS
Step 1: Install Prerequisites
Scrapy requires Python and some system dependencies.
Install Homebrew (if not already installed):
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Install Python:
brew install python
Install
libxml2
andlibxslt
:brew install libxml2 libxslt
Ensure
pip
is up to date:python3 -m pip install --upgrade pip
Step 2: Install Scrapy
Install Scrapy using pip
:
pip install scrapy
If you want to install it in a virtual environment (recommended), create and activate a virtual environment first:
python3 -m venv scrapy_env
source scrapy_env/bin/activate
pip install scrapy
2. Verifying the Installation
Run the following command to verify that Scrapy is installed successfully:
scrapy --version
Expected Output:
Scrapy 2.x.x
3. Setting Up the Environment
Step 1: Create a New Scrapy Project
Navigate to the directory where you want to create your project:
cd ~/projects
Create a new Scrapy project:
scrapy startproject job_scraper
Navigate to the newly created project:
cd job_scraper
Directory Structure:
job_scraper/
scrapy.cfg
job_scraper/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
Step 2: Install Additional Dependencies
Install any additional dependencies your project might need, such as:
playwright
for JavaScript-rendered content:pip install scrapy-playwright playwright install
pandas
for data processing:pip install pandas
4. Testing the Installation
Run the default Scrapy command to test the environment:
scrapy list
Expected Output:
No spiders found
This indicates Scrapy is installed and the project is set up correctly. You can now create your first spider.
Quick Troubleshooting
Issue: Command
scrapy
not found.Ensure Scrapy is installed in the correct Python environment.
Activate the virtual environment if used:
source scrapy_env/bin/activate
Issue: Missing dependencies (
libxml2
,libxslt
).Install using Homebrew:
brew install libxml2 libxslt
3. Creating a Scrapy Project
1. Directory Structure of a Scrapy Project
After creating a Scrapy project, it follows this structure:
job_scraper/
scrapy.cfg # Project configuration file
job_scraper/ # Main module for your Scrapy project
__init__.py
items.py # Define data models for scraped items
middlewares.py # Middleware for processing requests/responses
pipelines.py # Define how to process/store scraped data
settings.py # Project-wide settings
spiders/ # Folder containing spider definitions
__init__.py
2. Key Components of a Scrapy Project
Spiders
The spiders define the logic for crawling websites and extracting data.
They specify starting URLs, how to parse responses, and how to follow links.
Items
Items are Python classes that structure the scraped data.
They define fields such as job title, company, and location.
Pipelines
Pipelines process scraped data before storage.
Examples: Cleaning data, saving to a database, or exporting to files.
Middlewares
Middlewares process requests and responses.
Examples: Rotating proxies, adding custom headers.
Settings
Settings configure the Scrapy project (e.g., download delays, concurrent requests).
Custom settings can be applied for each spider.
3. Creating Your First Scrapy Project
Step 1: Create a New Project
Run the following command to create a Scrapy project:
scrapy startproject job_scraper
Navigate to the project directory:
cd job_scraper
Step 2: Define an Item
Open items.py
and define fields for the job data you want to scrape:
import scrapy
class JobScraperItem(scrapy.Item):
title = scrapy.Field()
company = scrapy.Field()
location = scrapy.Field()
link = scrapy.Field()
Step 3: Create a Spider
Inside the spiders/
directory, create a file named job_
spider.py
:
import scrapy
from job_scraper.items import JobScraperItem
class JobSpider(scrapy.Spider):
name = "job_spider"
start_urls = ["https://example.com/jobs"]
def parse(self, response):
# Loop through job cards
for job in response.css('div.job-card'):
item = JobScraperItem()
item['title'] = job.css('h2.job-title::text').get()
item['company'] = job.css('span.company-name::text').get()
item['location'] = job.css('span.job-location::text').get()
item['link'] = response.urljoin(job.css('a::attr(href)').get())
yield item
Step 4: Configure Settings
Open settings.py
and configure as needed:
Enable JSON or CSV export:
FEEDS = { 'jobs.json': {'format': 'json'}, }
Set a download delay to prevent bans:
DOWNLOAD_DELAY = 2
Step 5: Run the Spider
Run your spider:
scrapy crawl job_spider
Output will be saved to jobs.json
as configured in settings.py
.
Sample Output (jobs.json
):
[
{"title": "Data Scientist", "company": "TechCorp", "location": "San Francisco, CA", "link": "https://example.com/jobs/123"},
{"title": "Machine Learning Engineer", "company": "AI Corp", "location": "New York, NY", "link": "https://example.com/jobs/456"}
]
4. Running Your Project and Verifying Outputs
Check the Data
Ensure the scraped data matches the structure defined in items.py
.
Debugging
Use the Scrapy Shell to debug selectors:
scrapy shell "https://example.com/jobs"
Example command in shell:
response.css('h2.job-title::text').getall()
4. Writing and Running Spiders
1. Creating a Spider
What is a Spider?
A spider in Scrapy is a class that defines how a website will be scraped:
The URLs to scrape.
How to extract data.
How to follow links.
Code Example: Creating a Spider
Navigate to the spiders/
directory and create a file, e.g., job_
spider.py
:
import scrapy
class JobSpider(scrapy.Spider):
name = "job_spider" # Unique name for the spider
start_urls = ["https://example.com/jobs"] # Initial URLs to scrape
def parse(self, response):
pass # Logic for extracting data goes here
2. Defining start_urls
Purpose of start_urls
:
- It lists the initial URLs the spider will visit.
You can add multiple URLs:
start_urls = [
"https://example.com/jobs",
"https://example.com/remote-jobs",
]
3. Sending Requests and Receiving Responses
Scrapy automatically sends GET requests to start_urls
and passes the response to the parse
method.
Code Example: Extracting Response Status
def parse(self, response):
print(f"Visited {response.url}, Status: {response.status}")
Sample Output:
Visited https://example.com/jobs, Status: 200
4. Extracting Data from Job Listings
Using CSS Selectors
CSS selectors allow you to extract elements based on tags, classes, or IDs.
Example: Extracting Job Titles
def parse(self, response):
job_titles = response.css('h2.job-title::text').getall() # Get all job titles
for title in job_titles:
yield {"Job Title": title}
Using XPath
XPath is a more powerful alternative for selecting elements.
Example: Extracting Job Titles with XPath
def parse(self, response):
job_titles = response.xpath('//h2[@class="job-title"]/text()').getall()
for title in job_titles:
yield {"Job Title": title}
Extracting Multiple Fields
You can scrape job titles, companies, locations, and links in one loop.
Code Example:
def parse(self, response):
for job in response.css('div.job-card'):
yield {
"Job Title": job.css('h2.job-title::text').get(),
"Company": job.css('span.company-name::text').get(),
"Location": job.css('span.job-location::text').get(),
"Link": response.urljoin(job.css('a::attr(href)').get()),
}
Sample Output:
[
{"Job Title": "Data Scientist", "Company": "TechCorp", "Location": "San Francisco, CA", "Link": "https://example.com/job/123"},
{"Job Title": "AI Engineer", "Company": "AI Corp", "Location": "New York, NY", "Link": "https://example.com/job/456"}
]
5. Running Spiders Using scrapy crawl
Step 1: Run the Spider
Run the following command to execute the spider:
scrapy crawl job_spider
Step 2: Export Data to a File
Export data directly to JSON, CSV, or XML:
scrapy crawl job_spider -o jobs.json # Export to JSON
scrapy crawl job_spider -o jobs.csv # Export to CSV
Step 3: Debugging in Scrapy Shell
Use the Scrapy Shell to debug selectors:
scrapy shell "https://example.com/jobs"
Example commands in shell:
response.css('h2.job-title::text').getall() # Get all job titles
response.xpath('//h2[@class="job-title"]/text()').getall() # Using XPath
Key Takeaways
Spider Basics:
Name your spider uniquely.
Use
start_urls
for the initial requests.
Data Extraction:
- Use CSS and XPath selectors to extract structured data.
Running and Exporting:
- Use
scrapy crawl
to run spiders and export data in desired formats.
- Use
5. Data Extraction with Scrapy
1. CSS Selectors vs. XPath
How to Choose Between CSS and XPath
CSS Selectors:
Simple and readable syntax.
Ideal for extracting data based on tag names, classes, or IDs.
Supported in most browsers for quick testing.
XPath:
More powerful and flexible.
Ideal for selecting elements based on advanced conditions (e.g., siblings, specific attributes).
Useful when elements are deeply nested or don’t have unique classes/IDs.
Combining CSS and XPath
You can mix both in a spider for robust data extraction.
Code Example: Combining CSS and XPath
def parse(self, response):
job_cards = response.css('div.job-card') # CSS Selector for job cards
for job in job_cards:
title = job.css('h2.job-title::text').get() # CSS for title
company = job.xpath('.//span[@class="company-name"]/text()').get() # XPath for company name
yield {"Title": title, "Company": company}
2. CSS Selectors
Selecting Elements
By Tag Name: Select elements by their HTML tag:
h2
By Class: Select elements with a specific class:
.job-title
By ID: Select elements with a specific ID:
#main-header
By Attribute: Select elements with a specific attribute:
[type="text"]
Extracting Text
Use
::text
to extract the text content of an element:h2.job-title::text
Extracting Attributes
Use
::attr(attribute)
to extract the value of an attribute:a::attr(href)
Code Example: CSS Selectors
def parse(self, response):
for job in response.css('div.job-card'):
yield {
"Title": job.css('h2.job-title::text').get(),
"Company": job.css('span.company-name::text').get(),
"Location": job.css('span.job-location::text').get(),
"Link": response.urljoin(job.css('a::attr(href)').get()),
}
3. XPath Selectors
Basic XPath Syntax
By Tag Name: Select all
<h2>
elements://h2
By Class or Attribute: Select elements with a specific class or attribute:
//div[@class="job-card"]
Text Content: Select elements based on their text:
//h2[text()="Data Scientist"]
Partial Match: Use
contains()
for partial matches://h2[contains(@class, "title")]
Extracting Text
Use text()
to extract the text content of an element:
//h2[@class="job-title"]/text()
Extracting Attributes
Use @attribute
to extract the value of an attribute:
//a[@class="apply-link"]/@href
Code Example: XPath Selectors
def parse(self, response):
for job in response.xpath('//div[@class="job-card"]'):
yield {
"Title": job.xpath('.//h2[@class="job-title"]/text()').get(),
"Company": job.xpath('.//span[@class="company-name"]/text()').get(),
"Location": job.xpath('.//span[@class="job-location"]/text()').get(),
"Link": response.urljoin(job.xpath('.//a[@class="apply-link"]/@href').get()),
}
4. Using text()
and @attribute
in XPath
Extracting Direct Text
Extract text directly inside a tag:
//h2[@class="job-title"]/text()
Extracting Attribute Values
Extract attributes like
href
://a[@class="apply-link"]/@href
Combining Conditions
Extract elements that satisfy multiple conditions:
//h2[@class="job-title" and contains(text(), "Engineer")]
5. Comparison of CSS and XPath
Feature | CSS | XPath |
Ease of Use | Simple syntax | More complex but flexible |
Partial Matches | Limited support | Fully supported with contains() |
Hierarchy | Limited parent-child selection | Robust parent-child and sibling selection |
Attributes | Simple with ::attr() | Flexible with @attribute |
Testing | Can test in browser DevTools | Requires XPath testing tools |
Key Takeaways
CSS Selectors:
Best for straightforward structures.
Use
::text
and::attr()
for text and attribute extraction.
XPath:
Best for complex conditions and nested elements.
Use
text()
and@attribute
for fine-grained control.
Combination:
- Use both CSS and XPath selectors in a spider for optimal results.
6. Handling Pagination in Scrapy
1. Identifying Pagination Patterns
Types of Pagination
Next Button:
- A link or button to the next page, e.g.,
<a href="/jobs?page=2">Next</a>
.
- A link or button to the next page, e.g.,
Numbered Links:
- Links for specific pages, e.g.,
<a href="/jobs?page=1">1</a>
.
- Links for specific pages, e.g.,
How to Identify Pagination
Inspect the website's pagination element using browser DevTools.
Look for a
href
attribute that changes across pages.Verify if URLs are relative or absolute.
2. Extracting href
Attributes for the Next Page
Example: Extracting the Next Page Link
If the pagination uses a "Next" button:
def parse(self, response):
# Extract the next page link
next_page = response.css('a.next::attr(href)').get() # Update CSS selector as per website
if next_page:
yield response.follow(next_page, self.parse)
Using XPath:
next_page = response.xpath('//a[@class="next"]/@href').get()
3. Sending Requests to Subsequent Pages
Scrapy's response.follow()
handles relative and absolute URLs seamlessly.
Example: Following Pagination Links
def parse(self, response):
# Extract job data
for job in response.css('div.job-card'):
yield {
"Title": job.css('h2.job-title::text').get(),
"Company": job.css('span.company-name::text').get(),
"Location": job.css('span.job-location::text').get(),
"Link": response.urljoin(job.css('a::attr(href)').get()),
}
# Follow the next page
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
4. Implementing Recursive Crawling
Using start_requests
You can override the start_requests
method to handle initial requests or pagination logic.
Example: Crawling Multiple Pages from Start
def start_requests(self):
base_url = "https://example.com/jobs?page="
for page in range(1, 6): # Crawl pages 1 to 5
yield scrapy.Request(url=f"{base_url}{page}", callback=self.parse)
Using a Custom Callback Function
Custom callbacks help when the pagination logic is complex or requires separate parsing for links.
Example: Callback for Parsing Subsequent Pages
def parse(self, response):
# Extract job data
for job in response.css('div.job-card'):
yield {
"Title": job.css('h2.job-title::text').get(),
"Company": job.css('span.company-name::text').get(),
"Location": job.css('span.job-location::text').get(),
"Link": response.urljoin(job.css('a::attr(href)').get()),
}
# Follow pagination
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse_next_page)
def parse_next_page(self, response):
# Extract data from the next page
for job in response.css('div.job-card'):
yield {
"Title": job.css('h2.job-title::text').get(),
"Company": job.css('span.company-name::text').get(),
"Location": job.css('span.job-location::text').get(),
"Link": response.urljoin(job.css('a::attr(href)').get()),
}
# Continue following pagination
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse_next_page)
5. Handling Numbered Pagination Links
When a website lists page numbers, iterate over all available links.
Example: Extracting All Pagination Links
def parse(self, response):
# Extract job data
for job in response.css('div.job-card'):
yield {
"Title": job.css('h2.job-title::text').get(),
"Company": job.css('span.company-name::text').get(),
"Location": job.css('span.job-location::text').get(),
"Link": response.urljoin(job.css('a::attr(href)').get()),
}
# Follow numbered pagination links
pagination_links = response.css('a.page-link::attr(href)').getall()
for link in pagination_links:
yield response.follow(link, callback=self.parse)
6. Full Example: Handling Pagination
Spider Example: Scraping a Job Website
import scrapy
class JobSpider(scrapy.Spider):
name = "job_spider"
start_urls = ["https://example.com/jobs"]
def parse(self, response):
# Extract job data
for job in response.css('div.job-card'):
yield {
"Title": job.css('h2.job-title::text').get(),
"Company": job.css('span.company-name::text').get(),
"Location": job.css('span.job-location::text').get(),
"Link": response.urljoin(job.css('a::attr(href)').get()),
}
# Follow the next page
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse)
Command to Run Spider:
scrapy crawl job_spider -o jobs.json
Sample Output (jobs.json
):
[
{"Title": "Data Scientist", "Company": "TechCorp", "Location": "San Francisco, CA", "Link": "https://example.com/job/123"},
{"Title": "AI Engineer", "Company": "AI Corp", "Location": "New York, NY", "Link": "https://example.com/job/456"},
...
]
Key Takeaways
Identify Pagination Patterns:
- Inspect next buttons or numbered links using browser DevTools.
Follow Pagination Links:
- Use
response.follow()
for seamless crawling across pages.
- Use
Recursive Crawling:
- Implement callbacks to handle dynamic or complex pagination logic.
Use
start_requests
:- For iterating over predefined URL patterns.
7. Storing and Exporting Data in Scrapy
1. Structuring Scraped Data Using Items
Why Use Items
?
Provides a structured way to define fields for your scraped data.
Easier to maintain and process data consistently.
Defining Items
Open items.py
in your Scrapy project and define fields for the data you want to scrape.
Code Example: Defining Job Fields
import scrapy
class JobScraperItem(scrapy.Item):
title = scrapy.Field()
company = scrapy.Field()
location = scrapy.Field()
link = scrapy.Field()
2. Using ItemLoaders
for Data Pre-Processing
Why Use ItemLoaders
?
Pre-process scraped data before storing it.
Clean and format data (e.g., stripping whitespace, converting cases).
Example: Using ItemLoaders
Modify your spider to use ItemLoaders
:
Code Example: Applying Pre-Processing
from scrapy.loader import ItemLoader
from job_scraper.items import JobScraperItem
class JobSpider(scrapy.Spider):
name = "job_spider"
start_urls = ["https://example.com/jobs"]
def parse(self, response):
for job in response.css('div.job-card'):
loader = ItemLoader(item=JobScraperItem(), selector=job)
loader.add_css('title', 'h2.job-title::text')
loader.add_css('company', 'span.company-name::text')
loader.add_css('location', 'span.job-location::text')
loader.add_css('link', 'a::attr(href)')
yield loader.load_item()
Output Example After Pre-Processing:
{
"title": "Data Scientist",
"company": "TechCorp",
"location": "San Francisco, CA",
"link": "https://example.com/job/123"
}
3. Exporting Data
Scrapy supports exporting data directly in formats like JSON, CSV, and XML.
Exporting to JSON
Run the spider and specify the output format:
scrapy crawl job_spider -o jobs.json
Sample Output (jobs.json
):
[
{"title": "Data Scientist", "company": "TechCorp", "location": "San Francisco, CA", "link": "https://example.com/job/123"},
{"title": "Machine Learning Engineer", "company": "AI Corp", "location": "New York, NY", "link": "https://example.com/job/456"}
]
Exporting to CSV
Run the spider with CSV format:
scrapy crawl job_spider -o jobs.csv
Sample Output (jobs.csv
):
title,company,location,link
Data Scientist,TechCorp,San Francisco,CA,https://example.com/job/123
Machine Learning Engineer,AI Corp,New York,NY,https://example.com/job/456
Exporting to Excel via Pipelines
You can save data in Excel format (.xlsx
) by installing and using openpyxl
.
Step 1: Install openpyxl
pip install openpyxl
Step 2: Modify pipelines.py
Update the pipeline to save data to Excel.
Code Example: Excel Pipeline
import openpyxl
class ExcelPipeline:
def open_spider(self, spider):
# Create a new workbook and sheet
self.workbook = openpyxl.Workbook()
self.sheet = self.workbook.active
self.sheet.title = "Jobs"
# Define headers
self.sheet.append(["Title", "Company", "Location", "Link"])
def process_item(self, item, spider):
# Append data row
self.sheet.append([item['title'], item['company'], item['location'], item['link']])
return item
def close_spider(self, spider):
# Save the workbook
self.workbook.save("jobs.xlsx")
Step 3: Enable the Pipeline in settings.py
ITEM_PIPELINES = {
'job_scraper.pipelines.ExcelPipeline': 300,
}
Run the Spider:
scrapy crawl job_spider
Output: jobs.xlsx
with structured job data.
4. Appending Data to Existing Files
To append data without overwriting:
For JSON and CSV
- Use the
FEEDS
setting withappend
mode insettings.py
:
FEEDS = {
'jobs.json': {
'format': 'json',
'overwrite': False,
'encoding': 'utf8',
},
'jobs.csv': {
'format': 'csv',
'overwrite': False,
'encoding': 'utf8',
},
}
For Excel
Modify the pipeline to open and append to an existing file.
Code Example: Append Data to Excel
import openpyxl
class ExcelPipeline:
def open_spider(self, spider):
try:
# Try to load existing workbook
self.workbook = openpyxl.load_workbook("jobs.xlsx")
self.sheet = self.workbook.active
except FileNotFoundError:
# Create a new workbook if file doesn't exist
self.workbook = openpyxl.Workbook()
self.sheet = self.workbook.active
self.sheet.append(["Title", "Company", "Location", "Link"])
def process_item(self, item, spider):
self.sheet.append([item['title'], item['company'], item['location'], item['link']])
return item
def close_spider(self, spider):
self.workbook.save("jobs.xlsx")
Key Takeaways
Structured Data with
Items
:Define fields to ensure consistent data structure.
Use
ItemLoaders
for pre-processing.
Exporting Data:
- Easily export data to JSON, CSV, or Excel.
Appending Data:
- Use
FEEDS
settings or modify pipelines to prevent overwriting.
- Use
Excel Integration:
- Utilize
openpyxl
for storing data in.xlsx
files.
- Utilize
8. Working with APIs
1. Identifying API Endpoints from Job Websites
Why Use APIs?
APIs provide structured data (e.g., JSON or XML) directly, eliminating the need for HTML parsing.
Faster and more efficient compared to scraping web pages.
Steps to Identify API Endpoints
Inspect Network Requests:
Open browser DevTools (F12 in Chrome/Firefox).
Navigate to the Network tab and perform an action (e.g., searching for jobs).
Look for API calls in the XHR or Fetch filters.
Analyze Request and Response:
Examine the URL, HTTP method (GET/POST), headers, and payload.
Check if the response contains JSON data relevant to your needs.
Validate Endpoint Accessibility:
- Copy the API URL and try sending a request using a tool like Postman or
curl
.
- Copy the API URL and try sending a request using a tool like Postman or
2. Sending Authenticated API Requests with Scrapy
Step 1: Understand Authentication Requirements
No Authentication: Some APIs are open and can be accessed directly.
Token-Based Authentication: APIs like LinkedIn require an API key or OAuth token in the headers.
Step 2: Include Headers for Authentication
Add required headers to your Scrapy request.
Example: Adding API Key to Headers
class JobApiSpider(scrapy.Spider):
name = "job_api_spider"
start_urls = ["https://api.example.com/jobs"]
def start_requests(self):
headers = {
"Authorization": "Bearer YOUR_API_KEY",
"User-Agent": "Scrapy/1.0 (+https://example.com)",
}
for url in self.start_urls:
yield scrapy.Request(url, headers=headers, callback=self.parse)
def parse(self, response):
data = response.json() # Extract JSON data from the API response
for job in data.get('results', []):
yield {
"Title": job.get('title'),
"Company": job.get('company'),
"Location": job.get('location'),
"Link": job.get('url'),
}
3. Extracting Structured JSON Data
Step 1: Parse the JSON Response
Scrapy provides a response.json()
method to parse JSON directly.
Example: Extracting Data
def parse(self, response):
data = response.json() # Parse JSON response
for job in data['jobs']:
yield {
"Title": job['title'],
"Company": job['company']['name'],
"Location": job['location'],
"Link": job['link'],
}
Sample API Response:
{
"jobs": [
{
"title": "Data Scientist",
"company": {"name": "TechCorp"},
"location": "San Francisco, CA",
"link": "https://example.com/job/123"
},
{
"title": "AI Engineer",
"company": {"name": "AI Corp"},
"location": "New York, NY",
"link": "https://example.com/job/456"
}
]
}
Output:
{
"Title": "Data Scientist",
"Company": "TechCorp",
"Location": "San Francisco, CA",
"Link": "https://example.com/job/123"
}
4. Storing API Results in CSV or JSON Formats
Export to JSON
Run the spider and save the output directly:
scrapy crawl job_api_spider -o jobs.json
Export to CSV
Save the output in CSV format:
scrapy crawl job_api_spider -o jobs.csv
Sample CSV Output (jobs.csv
):
Title,Company,Location,Link
Data Scientist,TechCorp,San Francisco, CA,https://example.com/job/123
AI Engineer,AI Corp,New York, NY,https://example.com/job/456
5. Full Example: Using LinkedIn API
Step 1: Get API Access
Create a LinkedIn Developer Account.
Register an application to get your client ID and client secret.
Authenticate using OAuth to obtain an access token.
Step 2: Use LinkedIn API for Job Data
LinkedIn API requires an access token in the headers.
Spider Example:
class LinkedInJobSpider(scrapy.Spider):
name = "linkedin_jobs"
start_urls = ["https://api.linkedin.com/v2/jobSearch"]
def start_requests(self):
headers = {
"Authorization": "Bearer YOUR_ACCESS_TOKEN",
"Content-Type": "application/json",
}
for url in self.start_urls:
yield scrapy.Request(url, headers=headers, callback=self.parse)
def parse(self, response):
data = response.json()
for job in data['elements']:
yield {
"Title": job['title'],
"Company": job['companyName'],
"Location": job.get('location', 'Remote'),
"Link": job['applyUrl'],
}
Command to Run Spider:
scrapy crawl linkedin_jobs -o linkedin_jobs.json
6. Key Considerations for API Scraping
Advantages of Using APIs
Structured Data: Easier to extract and process.
Stability: Less prone to breakage compared to scraping HTML.
Efficiency: Faster data retrieval.
Challenges
Rate Limits: APIs often limit the number of requests per minute/hour.
Authentication: Some APIs require complex OAuth authentication.
Paid Access: Certain APIs charge for higher usage tiers.
Best Practices
Respect Rate Limits:
Use the
DOWNLOAD_DELAY
setting in Scrapy:DOWNLOAD_DELAY = 2 # Delay of 2 seconds between requests
Retry Failed Requests:
Enable the Retry Middleware in
settings.py
:RETRY_ENABLED = True RETRY_TIMES = 3
Store Data Efficiently:
- Use pipelines to save data directly into databases (e.g., MongoDB or PostgreSQL).
Key Takeaways
Identify APIs:
Use DevTools to find API endpoints.
Test API calls with tools like Postman.
Authentication:
- Add required headers for token-based APIs.
Data Extraction:
- Use
response.json()
to parse structured JSON responses.
- Use
Exporting Results:
- Save results in JSON or CSV for analysis.
Best Practices:
- Respect rate limits and handle retries gracefully.
9. Scrapy Shell
1. What is the Scrapy Shell and How to Use It?
What is the Scrapy Shell?
The Scrapy Shell is an interactive command-line environment for testing and debugging scraping logic. It allows you to:
Fetch web pages.
Inspect HTML responses.
Experiment with CSS and XPath selectors.
Test data extraction logic before implementing it in spiders.
How to Start the Scrapy Shell
Run the Scrapy Shell for a specific URL:
scrapy shell "https://example.com/jobs"
2. Using the fetch()
Command to Inspect Responses
Fetch a URL
If you’re already in the shell, use the fetch()
command to load a webpage:
fetch("https://example.com/jobs")
Inspect the Response
After fetching a URL, the response is stored in the response
object. You can inspect the response attributes:
Check the status code:
response.status
Output:
200
View the response URL:
response.url
Output:
https://example.com/jobs
Check the HTML content:
print(response.text[:500]) # Display the first 500 characters
3. Experimenting with CSS and XPath in the Shell
Using CSS Selectors
You can test CSS selectors directly in the shell:
Select all job titles:
response.css('h2.job-title::text').getall()
Output:
['Data Scientist', 'AI Engineer', 'Machine Learning Engineer']
Extract a specific attribute:
response.css('a.apply-link::attr(href)').getall()
Output:
['/job/123', '/job/456', '/job/789']
Using XPath
Test XPath expressions for the same results:
Select all job titles:
response.xpath('//h2[@class="job-title"]/text()').getall()
Output:
['Data Scientist', 'AI Engineer', 'Machine Learning Engineer']
Extract specific attributes:
response.xpath('//a[@class="apply-link"]/@href').getall()
Output:
['/job/123', '/job/456', '/job/789']
Using Both CSS and XPath
You can use both CSS and XPath to locate elements and test which works better for your target website.
Example: Get Job Titles with CSS and XPath
# CSS Selector
response.css('h2.job-title::text').getall()
# XPath
response.xpath('//h2[@class="job-title"]/text()').getall()
4. Debugging Using Scrapy Shell
Inspecting Issues
If your spider isn’t working as expected, use the shell to debug:
Fetch the URL that’s causing the issue:
fetch("https://example.com/jobs")
Inspect the structure of the HTML:
print(response.text)
Test your selector logic:
CSS:
response.css('div.job-card h2.job-title::text').getall()
XPath:
response.xpath('//div[@class="job-card"]/h2[@class="job-title"]/text()').getall()
Check Element Presence
Sometimes, elements are missing or incorrectly targeted:
response.css('div.nonexistent-class').get()
Output:
None
This confirms that the element is not present in the HTML.
Check for Errors in Responses
If a request fails, you can inspect the response:
response.status # HTTP status code
response.headers # Response headers
5. Example Workflow in Scrapy Shell
Step 1: Start the Shell
scrapy shell "https://example.com/jobs"
Step 2: Inspect the Response
View the HTML content:
print(response.text[:500])
Step 3: Experiment with Selectors
Extract job titles:
response.css('h2.job-title::text').getall()
Extract links:
response.css('a.apply-link::attr(href)').getall()
Step 4: Debug Complex Selectors
Test XPath for deeply nested elements:
response.xpath('//div[@class="job-card"]/a/@href').getall()
Step 5: Verify Pagination
Test the next page selector:
response.css('a.next::attr(href)').get()
Key Takeaways
Scrapy Shell Basics:
- Use the shell for quick debugging and experimentation.
fetch()
Command:- Fetch and inspect responses dynamically.
Testing Selectors:
- Experiment with CSS and XPath to ensure data extraction works.
Debugging:
- Inspect missing elements or incorrect logic before modifying your spider.
10. Advanced Techniques for Job Scraping
1. Handling Dynamic JavaScript-Rendered Websites
Some job websites use JavaScript to load content dynamically. Scrapy alone cannot render JavaScript, but libraries like scrapy-playwright
can help.
1.1 Using scrapy-playwright
for Rendering JavaScript
Step 1: Install scrapy-playwright
pip install scrapy-playwright
Step 2: Enable scrapy-playwright
in Settings Add the following to your settings.py
:
DOWNLOADER_MIDDLEWARES = {
'scrapy_playwright.middleware.ScrapyPlaywrightDownloadHandler': 543,
}
PLAYWRIGHT_BROWSER_TYPE = "chromium" # Options: chromium, firefox, webkit
Step 3: Use playwright
in Your Spider
import scrapy
class DynamicJobSpider(scrapy.Spider):
name = "dynamic_job_spider"
def start_requests(self):
yield scrapy.Request(
url="https://example.com/jobs",
meta={"playwright": True}, # Enable Playwright for this request
callback=self.parse,
)
def parse(self, response):
# Extract job data from JavaScript-rendered content
for job in response.css('div.job-card'):
yield {
"Title": job.css('h2.job-title::text').get(),
"Company": job.css('span.company-name::text').get(),
"Location": job.css('span.job-location::text').get(),
"Link": response.urljoin(job.css('a::attr(href)').get()),
}
1.2 Scraping Job Pages with Infinite Scrolling
For websites with infinite scrolling:
Use Playwright to scroll and load content dynamically.
Capture the full page content.
Code Example: Infinite Scrolling
from scrapy_playwright.page import PageCoroutine
class InfiniteScrollSpider(scrapy.Spider):
name = "infinite_scroll"
def start_requests(self):
yield scrapy.Request(
url="https://example.com/infinite-jobs",
meta={
"playwright": True,
"playwright_page_coroutines": [
PageCoroutine("evaluate", "window.scrollTo(0, document.body.scrollHeight)"),
PageCoroutine("wait_for_timeout", 5000), # Wait for more content to load
],
},
callback=self.parse,
)
def parse(self, response):
for job in response.css('div.job-card'):
yield {
"Title": job.css('h2.job-title::text').get(),
"Company": job.css('span.company-name::text').get(),
"Location": job.css('span.job-location::text').get(),
"Link": response.urljoin(job.css('a::attr(href)').get()),
}
2. Logging into Websites to Scrape Hidden Job Listings
Many job websites require user authentication to access full job details.
2.1 Using FormRequest
for Login Forms
Code Example: Submitting a Login Form
from scrapy.http import FormRequest
class LoginSpider(scrapy.Spider):
name = "login_spider"
start_urls = ["https://example.com/login"]
def parse(self, response):
# Send a POST request to login
return FormRequest.from_response(
response,
formdata={"username": "your_username", "password": "your_password"},
callback=self.after_login,
)
def after_login(self, response):
# Scrape job listings after successful login
if "Welcome" in response.text:
yield scrapy.Request(
url="https://example.com/hidden-jobs",
callback=self.parse_jobs,
)
def parse_jobs(self, response):
for job in response.css('div.job-card'):
yield {
"Title": job.css('h2.job-title::text').get(),
"Company": job.css('span.company-name::text').get(),
"Location": job.css('span.job-location::text').get(),
}
2.2 Extracting CSRF Tokens Dynamically
Many websites use CSRF tokens for form submissions.
Code Example: Extract CSRF Tokens
def parse(self, response):
csrf_token = response.css('input[name="csrf_token"]::attr(value)').get()
return FormRequest.from_response(
response,
formdata={
"username": "your_username",
"password": "your_password",
"csrf_token": csrf_token,
},
callback=self.after_login,
)
3. Extracting Job Listings from Tables
Some websites display job data in HTML tables. Scraping these requires parsing rows (<tr>
) and columns (<td>
).
3.1 Scraping Tabular HTML Structures
HTML Example:
<table class="job-table">
<tr>
<th>Title</th>
<th>Company</th>
<th>Location</th>
</tr>
<tr>
<td>Data Scientist</td>
<td>TechCorp</td>
<td>San Francisco, CA</td>
</tr>
<tr>
<td>AI Engineer</td>
<td>AI Corp</td>
<td>New York, NY</td>
</tr>
</table>
Code Example: Scraping Table Data
class TableJobSpider(scrapy.Spider):
name = "table_job_spider"
start_urls = ["https://example.com/jobs"]
def parse(self, response):
rows = response.css('table.job-table tr')[1:] # Skip header row
for row in rows:
yield {
"Title": row.css('td:nth-child(1)::text').get(),
"Company": row.css('td:nth-child(2)::text').get(),
"Location": row.css('td:nth-child(3)::text').get(),
}
Output:
[
{"Title": "Data Scientist", "Company": "TechCorp", "Location": "San Francisco, CA"},
{"Title": "AI Engineer", "Company": "AI Corp", "Location": "New York, NY"}
]
3.2 Parsing and Cleaning Table Data
Clean and normalize data before storing it.
Code Example: Data Cleaning
def parse(self, response):
rows = response.css('table.job-table tr')[1:] # Skip header row
for row in rows:
title = row.css('td:nth-child(1)::text').get().strip()
company = row.css('td:nth-child(2)::text').get().strip()
location = row.css('td:nth-child(3)::text').get().strip()
yield {
"Title": title,
"Company": company,
"Location": location,
}
Key Takeaways
Dynamic Content:
- Use
scrapy-playwright
for JavaScript-rendered pages and infinite scrolling.
- Use
Authentication:
Use
FormRequest
to log in and access hidden job listings.Dynamically extract CSRF tokens for secure login forms.
Tabular Data:
- Extract structured data from HTML tables and clean it for consistent storage.
11. Managing Scrapy Settings
Efficient Scrapy settings can improve scraping performance while reducing the risk of being blocked by target websites.
1. Configuring Settings for Efficient Scraping
1.1 Controlling Concurrency and Delays
CONCURRENT_REQUESTS
: Specifies the number of concurrent requests sent by Scrapy. The default is 16.DOWNLOAD_DELAY
: Sets a delay between requests to the same domain to avoid triggering rate limits.
Example Settings in settings.py
:
CONCURRENT_REQUESTS = 8 # Reduce concurrency for sensitive websites
DOWNLOAD_DELAY = 2 # Add a 2-second delay between requests
1.2 Enabling AUTOTHROTTLE
for Dynamic Rate Limiting
What is
AUTOTHROTTLE
?Dynamically adjusts request rates based on the server’s response times.
Helps reduce the chances of getting banned by websites.
Example Settings in settings.py
:
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1 # Initial download delay
AUTOTHROTTLE_MAX_DELAY = 10 # Maximum delay in case of high latency
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0 # Average concurrent requests
AUTOTHROTTLE_DEBUG = True # Enable debug logging for auto-throttling
How It Works:
Scrapy monitors server response times and adjusts the request rate accordingly.
Ensures efficient scraping while reducing server load.
2. Setting Custom Headers and User Agents
2.1 Why Set Custom Headers?
Default headers can trigger bans if the website identifies Scrapy as a bot.
Mimic real user behavior with custom headers and user agents.
2.2 Setting Custom Headers
Example Settings in settings.py
:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
Add Custom Headers in the Spider:
class JobSpider(scrapy.Spider):
name = "job_spider"
start_urls = ["https://example.com/jobs"]
def start_requests(self):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36',
}
for url in self.start_urls:
yield scrapy.Request(url, headers=headers, callback=self.parse)
2.3 Rotating User Agents
Install the scrapy-user-agents
library to easily rotate user agents:
pip install scrapy-user-agents
Enable in settings.py
:
DOWNLOADER_MIDDLEWARES = {
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
3. Rotating Proxies for Avoiding Bans
3.1 Why Rotate Proxies?
Prevent websites from blocking your IP address during large-scale scraping.
Proxies can mask your IP and distribute requests across multiple addresses.
3.2 Adding Proxies to Scrapy
Option 1: Manually Specify Proxies
class ProxySpider(scrapy.Spider):
name = "proxy_spider"
start_urls = ["https://example.com/jobs"]
def start_requests(self):
proxies = ["http://proxy1:8080", "http://proxy2:8080"]
for url in self.start_urls:
proxy = proxies[0] # Rotate proxies manually
yield scrapy.Request(url, meta={"proxy": proxy}, callback=self.parse)
Option 2: Using a Proxy Middleware
Install scrapy-rotating-proxies
:
pip install scrapy-rotating-proxies
Enable in settings.py
:
ROTATING_PROXY_LIST = [
'http://proxy1:8080',
'http://proxy2:8080',
'http://proxy3:8080',
]
DOWNLOADER_MIDDLEWARES = {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}
4. Full Settings Example
# Concurrency and Delay
CONCURRENT_REQUESTS = 8
DOWNLOAD_DELAY = 2
# Auto-throttle
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
AUTOTHROTTLE_DEBUG = True
# Custom Headers
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
# Rotating User Agents
DOWNLOADER_MIDDLEWARES = {
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
# Rotating Proxies
ROTATING_PROXY_LIST = [
'http://proxy1:8080',
'http://proxy2:8080',
'http://proxy3:8080',
]
DOWNLOADER_MIDDLEWARES.update({
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
})
Key Takeaways
Concurrency and Delays:
Control request rates with
CONCURRENT_REQUESTS
andDOWNLOAD_DELAY
.Use
AUTOTHROTTLE
for dynamic rate limiting.
Custom Headers and User Agents:
Mimic browser behavior to avoid detection.
Use
scrapy-user-agents
for rotating user agents automatically.
Rotating Proxies:
Distribute requests across multiple proxies to prevent IP bans.
Use
scrapy-rotating-proxies
for seamless proxy management.
12. Error Handling and Debugging in Scrapy
1. Common Scrapy Errors and Solutions
1.1 HTTP Errors
404 (Not Found):
The requested page doesn't exist.
Solution: Check if the URL is correct or dynamically constructed.
403 (Forbidden):
The server blocked your request, likely due to bot detection.
Solution:
Rotate user agents or proxies.
Add headers that mimic real browsers.
Slow down requests using
DOWNLOAD_DELAY
.
500 (Internal Server Error):
The server encountered an error.
Solution:
Retry the request.
Verify if the server is overloaded.
Enable HTTP Error Logging in settings.py
:
HTTPERROR_ALLOWED_CODES = [403, 500] # Log these codes without stopping
1.2 Selector Errors
Empty Results from CSS or XPath Selectors:
The selectors don't match the target elements.
Solution:
Use the Scrapy Shell to debug selectors.
Ensure elements are loaded before scraping (e.g., JavaScript-rendered pages).
Example Debugging in Shell:
scrapy shell "https://example.com/jobs"
response.css('div.job-card').getall()
response.xpath('//div[@class="job-card"]').getall()
1.3 Missing or Unexpected Data
The structure of the website may have changed.
Solution:
- Regularly update selectors and test scraping logic.
2. Logging and Debugging Scrapy Spiders
2.1 Configuring Logging
Scrapy provides built-in logging to track spider activity.
Example Logging Configuration in settings.py
:
LOG_LEVEL = 'INFO' # Options: DEBUG, INFO, WARNING, ERROR, CRITICAL
LOG_FILE = 'scrapy_log.txt' # Save logs to a file
Log Output Example:
INFO: Scrapy 2.x.x started
INFO: Spider opened
INFO: Crawled (200) <GET https://example.com/jobs> (referer: None)
WARNING: Response status code 403 for https://example.com/protected
2.2 Adding Custom Logs
Add logs within your spider to monitor specific events.
Code Example:
class JobSpider(scrapy.Spider):
name = "job_spider"
start_urls = ["https://example.com/jobs"]
def parse(self, response):
self.logger.info(f"Visited: {response.url}")
if response.status == 403:
self.logger.warning(f"403 Forbidden: {response.url}")
for job in response.css('div.job-card'):
yield {"title": job.css('h2.job-title::text').get()}
3. Retrying Failed Requests
Enable Retry Middleware
Scrapy retries failed requests by default. Configure it in settings.py
:
RETRY_ENABLED = True
RETRY_TIMES = 3 # Number of retries
RETRY_HTTP_CODES = [500, 502, 503, 504, 403, 408] # Retry these status codes
Using Proxies for Retries
Combine retries with rotating proxies to increase success rates.
Example in settings.py
:
DOWNLOADER_MIDDLEWARES.update({
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
})
4. Handling Timeouts
Timeout Settings
Set time limits for requests to avoid hanging indefinitely.
Configuration in settings.py
:
DOWNLOAD_TIMEOUT = 15 # Timeout in seconds
Detect and Retry Timeouts
Enable retrying for timeout errors:
RETRY_HTTP_CODES.append(408) # Add 408 (Request Timeout) to retry codes
5. Handling CAPTCHA Challenges
5.1 Detecting CAPTCHA
If your response contains a CAPTCHA page:
- Check for elements like
<div class="captcha">
or specific keywords.
Code Example: Detect CAPTCHA
def parse(self, response):
if "captcha" in response.text.lower():
self.logger.warning(f"CAPTCHA encountered at {response.url}")
return
# Proceed with scraping
5.2 Bypassing CAPTCHA
Use Third-Party CAPTCHA Solvers:
- Services like 2Captcha or Anti-Captcha can solve CAPTCHA challenges.
Install the anticaptchaofficial
Python library:
pip install anticaptchaofficial
Code Example: Solve CAPTCHA Using 2Captcha
from anticaptchaofficial.recaptchav2proxyless import recaptchaV2Proxyless
def solve_captcha(site_key, page_url):
solver = recaptchaV2Proxyless()
solver.set_verbose(1)
solver.set_key("YOUR_2CAPTCHA_API_KEY")
solver.set_website_url(page_url)
solver.set_website_key(site_key)
return solver.solve_and_return_solution()
def parse(self, response):
site_key = response.css('div.g-recaptcha::attr(data-sitekey)').get()
captcha_solution = solve_captcha(site_key, response.url)
self.logger.info(f"CAPTCHA Solved: {captcha_solution}")
Use Playwright for JavaScript CAPTCHA:
- Use
scrapy-playwright
to render CAPTCHA pages and solve simple challenges.
- Use
6. Full Error Handling Example
Spider Code:
import scrapy
from scrapy.http import Request
class JobSpider(scrapy.Spider):
name = "job_spider"
start_urls = ["https://example.com/jobs"]
def parse(self, response):
if response.status == 403:
self.logger.warning(f"403 Forbidden: {response.url}")
return
for job in response.css('div.job-card'):
yield {
"Title": job.css('h2.job-title::text').get(),
"Company": job.css('span.company-name::text').get(),
}
# Follow pagination
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
Settings Example (settings.py
):
# Retry Settings
RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 403, 408]
# Timeout
DOWNLOAD_TIMEOUT = 15
# Logging
LOG_LEVEL = 'INFO'
LOG_FILE = 'scrapy_log.txt'
# Throttling
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
Key Takeaways
HTTP Errors:
Handle
403
with headers, proxies, or user agents.Retry
500
and timeout errors using retry middleware.
Logging:
- Use custom logs to monitor spider behavior and debug issues.
CAPTCHA Challenges:
- Detect CAPTCHAs and use external solvers or Playwright.
Timeouts:
- Set a
DOWNLOAD_TIMEOUT
and retry timeouts when necessary.
- Set a
13. Scrapy Item Pipelines
1. Enabling Pipelines in Scrapy Settings
To use item pipelines, you must enable them in the settings.py
file by defining their priority. Pipelines with lower numbers are executed first.
Example in settings.py
:
ITEM_PIPELINES = {
'job_scraper.pipelines.ValidationPipeline': 100,
'job_scraper.pipelines.CleaningPipeline': 200,
'job_scraper.pipelines.JsonPipeline': 300, # For JSON output
'job_scraper.pipelines.PostgreSQLPipeline': 400, # For PostgreSQL storage
}
2. Processing Scraped Data in Pipelines
Pipeline Template
All pipelines must implement the process_item
method:
class YourPipeline:
def process_item(self, item, spider):
# Process and transform the item
return item
2.1 Saving Data Locally
a. Saving to JSON Save each scraped item as a JSON line in a file.
Code Example: JSON Pipeline
import json
class JsonPipeline:
def open_spider(self, spider):
self.file = open('jobs.json', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
b. Saving to CSV Save data to a CSV file.
Code Example: CSV Pipeline
import csv
class CsvPipeline:
def open_spider(self, spider):
self.file = open('jobs.csv', 'w', newline='', encoding='utf-8')
self.writer = csv.DictWriter(self.file, fieldnames=["title", "company", "location", "link"])
self.writer.writeheader()
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
self.writer.writerow(dict(item))
return item
2.2 Storing Data in PostgreSQL
Step 1: Install PostgreSQL Python Driver
pip install psycopg2-binary
Step 2: Define the PostgreSQL Pipeline This pipeline connects to a PostgreSQL database and inserts scraped data into a table.
Code Example: PostgreSQL Pipeline
import psycopg2
class PostgreSQLPipeline:
def open_spider(self, spider):
self.conn = psycopg2.connect(
dbname="job_database",
user="your_username",
password="your_password",
host="localhost",
port="5432"
)
self.cursor = self.conn.cursor()
# Create table if it doesn't exist
self.cursor.execute("""
CREATE TABLE IF NOT EXISTS jobs (
title TEXT,
company TEXT,
location TEXT,
link TEXT
)
""")
def close_spider(self, spider):
self.conn.commit()
self.conn.close()
def process_item(self, item, spider):
self.cursor.execute("""
INSERT INTO jobs (title, company, location, link)
VALUES (%s, %s, %s, %s)
""", (item['title'], item['company'], item['location'], item['link']))
return item
3. Writing Custom Pipelines for Validation and Cleaning
3.1 Validation Pipeline
Ensure that mandatory fields (e.g., title
, company
) are not empty. Drop items if required fields are missing.
Code Example: Validation Pipeline
class ValidationPipeline:
def process_item(self, item, spider):
if not item.get('title'):
raise scrapy.exceptions.DropItem(f"Missing title in {item}")
if not item.get('company'):
item['company'] = "Unknown" # Assign default value
return item
3.2 Cleaning Pipeline
Standardize and clean the scraped data, such as removing extra spaces or formatting fields.
Code Example: Cleaning Pipeline
class CleaningPipeline:
def process_item(self, item, spider):
# Strip whitespace
item['title'] = item['title'].strip()
item['company'] = item['company'].strip()
item['location'] = item['location'].strip()
# Normalize location format
if "remote" in item['location'].lower():
item['location'] = "Remote"
return item
4. Full Example
Here’s an example setup with pipelines for validation, cleaning, and saving to JSON and PostgreSQL.
Pipeline Code (pipelines.py
)
import scrapy
import json
import psycopg2
class ValidationPipeline:
def process_item(self, item, spider):
if not item.get('title'):
raise scrapy.exceptions.DropItem(f"Missing title in {item}")
if not item.get('company'):
item['company'] = "Unknown"
return item
class CleaningPipeline:
def process_item(self, item, spider):
item['title'] = item['title'].strip()
item['company'] = item['company'].strip()
item['location'] = item['location'].strip()
if "remote" in item['location'].lower():
item['location'] = "Remote"
return item
class JsonPipeline:
def open_spider(self, spider):
self.file = open('jobs.json', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
class PostgreSQLPipeline:
def open_spider(self, spider):
self.conn = psycopg2.connect(
dbname="job_database",
user="your_username",
password="your_password",
host="localhost",
port="5432"
)
self.cursor = self.conn.cursor()
self.cursor.execute("""
CREATE TABLE IF NOT EXISTS jobs (
title TEXT,
company TEXT,
location TEXT,
link TEXT
)
""")
def close_spider(self, spider):
self.conn.commit()
self.conn.close()
def process_item(self, item, spider):
self.cursor.execute("""
INSERT INTO jobs (title, company, location, link)
VALUES (%s, %s, %s, %s)
""", (item['title'], item['company'], item['location'], item['link']))
return item
Settings (settings.py
)
ITEM_PIPELINES = {
'job_scraper.pipelines.ValidationPipeline': 100,
'job_scraper.pipelines.CleaningPipeline': 200,
'job_scraper.pipelines.JsonPipeline': 300,
'job_scraper.pipelines.PostgreSQLPipeline': 400,
}
5. Output Examples
jobs.json
:
[
{"title": "Data Scientist", "company": "TechCorp", "location": "San Francisco, CA", "link": "https://example.com/job/123"},
{"title": "AI Engineer", "company": "AI Corp", "location": "Remote", "link": "https://example.com/job/456"}
]
PostgreSQL Table:
+--------------------+------------+--------------------+-----------------------------+
| title | company | location | link |
+--------------------+------------+--------------------+-----------------------------+
| Data Scientist | TechCorp | San Francisco, CA | https://example.com/job/123 |
| AI Engineer | AI Corp | Remote | https://example.com/job/456 |
+--------------------+------------+--------------------+-----------------------------+
Key Takeaways
Validation and Cleaning:
- Use custom pipelines to ensure data quality.
Local Storage:
- Save data as JSON or CSV using dedicated pipelines.
Database Storage:
- Store data in PostgreSQL for advanced querying and analysis.
Pipeline Priority:
- Order pipelines in
settings.py
for sequential processing.
- Order pipelines in
14. Deploying Scrapy Projects
1. Deploying Spiders to Scrapy Cloud
Scrapy Cloud is a cloud-based platform by Scrapinghub for deploying, scheduling, and monitoring Scrapy spiders.
1.1 Install the shub
CLI
pip install shub
1.2 Authenticate with Scrapy Cloud
Run the following command and enter your Scrapy Cloud API key:
shub login
1.3 Deploy Your Spider
In your Scrapy project directory, deploy the spider to Scrapy Cloud:
shub deploy
Output Example:
Packing version 1.0
Deploying to Scrapy Cloud project '12345'
Spider deployed: https://app.scrapinghub.com/p/12345
1.4 Schedule a Spider
After deployment, schedule the spider from the Scrapy Cloud dashboard or using the shub
CLI:
shub schedule my_spider
2. Using scrapyd
to Manage and Schedule Spiders
scrapyd
is a service for running Scrapy spiders remotely, managing deployments, and scheduling jobs via an HTTP API.
2.1 Install scrapyd
pip install scrapyd
2.2 Start the scrapyd
Service
Run scrapyd
to start the server (default port: 6800):
scrapyd
2.3 Deploy Your Spider to scrapyd
Step 1: Install scrapyd-client
pip install scrapyd-client
Step 2: Configure Deployment Create a scrapy.cfg
file in your project directory:
[settings]
default = my_project.settings
[deploy]
url = http://localhost:6800/
project = my_project
Step 3: Deploy the Spider
scrapyd-deploy
2.4 Schedule Spiders with scrapyd
Use curl
or a Python script to schedule a spider on scrapyd
.
Command Example:
curl http://localhost:6800/schedule.json -d project=my_project -d spider=my_spider
Python Example:
import requests
response = requests.post(
"http://localhost:6800/schedule.json",
data={"project": "my_project", "spider": "my_spider"}
)
print(response.json())
3. Running Spiders on Cloud Platforms
You can deploy Scrapy spiders to cloud platforms like AWS or GCP for better scalability.
3.1 Running Spiders on AWS
Step 1: Install AWS CLI
pip install awscli
Step 2: Set Up an EC2 Instance
Launch an EC2 instance with an appropriate image (e.g., Amazon Linux or Ubuntu).
SSH into the instance and install Python, Scrapy, and necessary dependencies:
sudo apt update sudo apt install python3 python3-pip pip install scrapy
Step 3: Run Your Spider Upload your Scrapy project to the instance and run the spider:
scrapy crawl my_spider
3.2 Running Spiders on GCP
Step 1: Install Google Cloud CLI
curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-<version>-linux-x86_64.tar.gz
Step 2: Launch a Virtual Machine
Create a VM instance in the Google Cloud Console.
SSH into the instance and set up Python, Scrapy, and dependencies.
Step 3: Run Your Spider Transfer your Scrapy project to the VM and execute the spider:
scrapy crawl my_spider
4. Scheduling Scrapy Jobs
4.1 Using cron
Step 1: Open crontab
crontab -e
Step 2: Add a Job Schedule your spider to run at a specific time (e.g., every day at midnight):
0 0 * * * cd /path/to/scrapy/project && scrapy crawl my_spider
4.2 Using APScheduler
Install APScheduler
:
pip install apscheduler
Code Example: Schedule a Spider
from apscheduler.schedulers.blocking import BlockingScheduler
from subprocess import call
def run_spider():
call(["scrapy", "crawl", "my_spider"])
scheduler = BlockingScheduler()
scheduler.add_job(run_spider, "interval", hours=24) # Run every 24 hours
try:
scheduler.start()
except (KeyboardInterrupt, SystemExit):
pass
Run the script to keep the scheduler active.
Key Takeaways
Scrapy Cloud:
- Use
shub
to deploy and schedule spiders on Scrapy Cloud.
- Use
scrapyd
:- Manage and schedule spiders remotely using
scrapyd
.
- Manage and schedule spiders remotely using
Cloud Platforms:
- Run spiders on AWS or GCP for scalability.
Job Scheduling:
- Use
cron
for simple scheduling orAPScheduler
for more flexibility.
- Use
15. Tips and Tricks for Efficient Scraping
1. Using Spider Arguments for Flexible Spiders
Why Use Spider Arguments?
- Spider arguments allow you to pass dynamic parameters to spiders at runtime, making them more flexible.
Example: Define Spider with Arguments
Modify your spider to accept arguments:
import scrapy
class JobSpider(scrapy.Spider):
name = "job_spider"
def __init__(self, location=None, job_type=None, *args, **kwargs):
super().__init__(*args, **kwargs)
self.location = location
self.job_type = job_type
def start_requests(self):
url = f"https://example.com/jobs?location={self.location}&type={self.job_type}"
yield scrapy.Request(url, self.parse)
def parse(self, response):
for job in response.css('div.job-card'):
yield {
"title": job.css('h2.job-title::text').get(),
"company": job.css('span.company-name::text').get(),
"location": self.location,
"type": self.job_type,
}
Run the Spider with Arguments
scrapy crawl job_spider -a location="New York" -a job_type="Data Scientist"
Output:
{
"title": "Data Scientist",
"company": "TechCorp",
"location": "New York",
"type": "Data Scientist"
}
2. Custom Callback Functions for Following Links
Why Use Custom Callbacks?
- Custom callbacks allow you to handle links and responses differently based on their context.
Example: Following Links with Custom Callback
class JobSpider(scrapy.Spider):
name = "job_spider"
start_urls = ["https://example.com/jobs"]
def parse(self, response):
for job in response.css('div.job-card'):
job_link = job.css('a::attr(href)').get()
yield response.follow(job_link, self.parse_job)
def parse_job(self, response):
yield {
"title": response.css('h1.job-title::text').get(),
"company": response.css('span.company-name::text').get(),
"description": response.css('div.job-description').get(),
}
3. Debugging CSS and XPath Selectors with the Scrapy Shell
Why Use Scrapy Shell?
- Scrapy Shell helps you test CSS and XPath selectors interactively, making it easier to debug and refine your extraction logic.
Example Workflow:
Start the Scrapy Shell:
scrapy shell "https://example.com/jobs"
Test CSS Selectors:
response.css('div.job-card h2.job-title::text').getall()
Output:
['Data Scientist', 'AI Engineer', 'ML Developer']
Test XPath Selectors:
response.xpath('//div[@class="job-card"]/h2[@class="job-title"]/text()').getall()
Output:
['Data Scientist', 'AI Engineer', 'ML Developer']
Inspect the Full HTML of a Node:
print(response.css('div.job-card').get())
4. Writing Reusable Standalone Spiders
Why Create Standalone Spiders?
- Standalone spiders can be reused across multiple projects with minimal changes.
Example: Reusable Spider
Create a base spider class with configurable attributes:
class ReusableSpider(scrapy.Spider):
name = "reusable_spider"
def __init__(self, base_url=None, selectors=None, *args, **kwargs):
super().__init__(*args, **kwargs)
self.base_url = base_url
self.selectors = selectors
def start_requests(self):
yield scrapy.Request(self.base_url, self.parse)
def parse(self, response):
for item in response.css(self.selectors['item']):
yield {
"title": item.css(self.selectors['title']).get(),
"link": response.urljoin(item.css(self.selectors['link']).get()),
}
Run the Spider with Arguments:
scrapy crawl reusable_spider -a base_url="https://example.com/jobs" -a selectors='{"item":"div.job-card","title":"h2.job-title::text","link":"a::attr(href)"}'
5. Best Practices for Reusable Spiders
Use Variables for Selectors:
- Define selectors in
settings.py
or pass them as arguments.
- Define selectors in
Handle Common Scenarios:
- Add logic for pagination, error handling, and retries.
Separate Logic:
- Use item pipelines to clean data instead of adding all processing to spiders.
Key Takeaways
Spider Arguments:
- Add flexibility to spiders by accepting arguments at runtime.
Custom Callbacks:
- Use callbacks to handle different pages and links contextually.
Scrapy Shell:
- Debug and refine your selectors interactively.
Reusable Spiders:
- Write spiders that can adapt to different use cases with minimal changes.
16. Building a Job Scraper for AIJobs.net, Indeed, and LinkedIn
1. Analyzing the Structure of Job Websites
Key Data Points
For job websites, we typically extract:
Job Titles: The role being advertised.
Companies: The employer or organization.
Locations: Where the job is based or if it’s remote.
Links: Direct URLs to job details.
Identifying Pagination Patterns
Look for "Next" Buttons or Pagination Links:
HTML example:
<a class="next" href="/jobs?page=2">Next</a>
Dynamic APIs:
- Inspect network requests in browser DevTools for APIs providing job data.
2. Writing Spiders for AIJobs.net
Step 1: Analyze AIJobs.net
Inspect the structure of job cards on the website:
<div class="job-card"> <h2 class="job-title">Data Scientist</h2> <span class="company-name">TechCorp</span> <a href="/job/123" class="job-link">View Job</a> </div>
Step 2: Create the Spider
Spider Code for AIJobs.net:
import scrapy
class AIJobsSpider(scrapy.Spider):
name = "aijobs"
start_urls = ["https://aijobs.net/jobs"]
def parse(self, response):
for job in response.css('div.job-card'):
yield {
"title": job.css('h2.job-title::text').get(),
"company": job.css('span.company-name::text').get(),
"link": response.urljoin(job.css('a.job-link::attr(href)').get()),
}
# Handle pagination
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
Run the Spider
scrapy crawl aijobs -o aijobs.json
3. Writing Spiders for Indeed
Step 1: Analyze Indeed
Identify job cards and key data points:
<div class="job_seen_beacon"> <h2 class="jobTitle">Data Scientist</h2> <span class="companyName">TechCorp</span> <div class="companyLocation">San Francisco, CA</div> <a href="/rc/clk?jk=abc123" class="jobTitle-link">View Job</a> </div>
Step 2: Create the Spider
Spider Code for Indeed:
import scrapy
class IndeedSpider(scrapy.Spider):
name = "indeed"
start_urls = ["https://www.indeed.com/jobs?q=data+scientist&l="]
def parse(self, response):
for job in response.css('div.job_seen_beacon'):
yield {
"title": job.css('h2.jobTitle::text').get(),
"company": job.css('span.companyName::text').get(),
"location": job.css('div.companyLocation::text').get(),
"link": response.urljoin(job.css('a::attr(href)').get()),
}
# Handle pagination
next_page = response.css('a[aria-label="Next"]::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
Run the Spider
scrapy crawl indeed -o indeed.json
4. Writing Spiders for LinkedIn
Step 1: Analyze LinkedIn
LinkedIn heavily relies on JavaScript for rendering job data. Scrapy alone cannot scrape it effectively. Use one of these approaches:
API: Requires developer access and an access token.
Selenium: Automates browser interaction for dynamic content.
Step 2: Using Selenium for LinkedIn
Install Selenium and WebDriver
pip install selenium
Code Example: LinkedIn Spider with Selenium
from scrapy import Spider
from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
class LinkedInSpider(Spider):
name = "linkedin"
def __init__(self):
self.driver = webdriver.Chrome()
def start_requests(self):
url = "https://www.linkedin.com/jobs/search/?keywords=data%20scientist"
self.driver.get(url)
time.sleep(5) # Wait for the page to load
html = self.driver.page_source
response = HtmlResponse(url=url, body=html, encoding='utf-8')
self.parse(response)
def parse(self, response):
for job in response.css('li.result-card'):
yield {
"title": job.css('h3.result-card__title::text').get(),
"company": job.css('h4.result-card__subtitle::text').get(),
"location": job.css('span.job-result-card__location::text').get(),
"link": job.css('a.result-card__full-card-link::attr(href)').get(),
}
def closed(self, reason):
self.driver.quit()
Step 3: Run the Spider
Ensure you have the appropriate WebDriver installed (e.g., ChromeDriver for Chrome).
scrapy crawl linkedin -o linkedin.json
5. Key Takeaways
-
Pure HTML scraping with Scrapy is sufficient.
Focus on job cards and pagination.
Indeed:
Scrape job titles, companies, and locations with pagination handling.
Handle URLs carefully as they might be relative.
LinkedIn:
- Use Selenium for JavaScript-rendered content or APIs for structured data.
Focus Areas for Your Use Case
Here’s how to address each focus area step by step, tailored to job scraping.
1. CSS vs. XPath Selectors
Goal:
- Learn how to extract job-related data from real websites using both CSS and XPath.
CSS Selectors:
CSS is intuitive and efficient for most scraping tasks.
Example: Extracting Job Titles
response.css('h2.job-title::text').getall()
Output:
['Data Scientist', 'AI Engineer', 'ML Developer']
XPath Selectors:
XPath provides more power for complex HTML structures.
Example: Extracting Job Titles
response.xpath('//h2[@class="job-title"]/text()').getall()
Output:
['Data Scientist', 'AI Engineer', 'ML Developer']
2. Pagination
Goal:
- Enable spiders to navigate multi-page job listings.
Recursive Crawling Example:
class PaginationSpider(scrapy.Spider):
name = "pagination"
start_urls = ["https://example.com/jobs"]
def parse(self, response):
# Extract job data
for job in response.css('div.job-card'):
yield {
"title": job.css('h2.job-title::text').get(),
"company": job.css('span.company-name::text').get(),
"link": response.urljoin(job.css('a::attr(href)').get()),
}
# Follow pagination links
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
Test Pagination Logic:
Use Scrapy Shell to debug pagination links.
scrapy shell "https://example.com/jobs"
response.css('a.next::attr(href)').get()
3. APIs and Dynamic Content
Goal:
- Extract data from APIs or render JavaScript-heavy pages.
Using APIs:
APIs often return structured JSON data.
Example: Requesting Job Data from an API
import scrapy
import json
class APISpider(scrapy.Spider):
name = "api_spider"
start_urls = ["https://example.com/api/jobs"]
def parse(self, response):
data = json.loads(response.text)
for job in data['jobs']:
yield {
"title": job['title'],
"company": job['company'],
"location": job['location'],
"link": job['url'],
}
Handling JavaScript-Rendered Pages:
Use scrapy-playwright
to scrape dynamic content.
Example: Scraping with Playwright
import scrapy
class PlaywrightSpider(scrapy.Spider):
name = "playwright_spider"
start_urls = ["https://example.com/jobs"]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, meta={"playwright": True})
def parse(self, response):
for job in response.css('div.job-card'):
yield {
"title": job.css('h2.job-title::text').get(),
"company": job.css('span.company-name::text').get(),
}
4. Data Storage
Goal:
- Save job data in structured formats (JSON, CSV).
Saving to JSON
import json
class JsonPipeline:
def open_spider(self, spider):
self.file = open('jobs.json', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
Saving to CSV
import csv
class CsvPipeline:
def open_spider(self, spider):
self.file = open('jobs.csv', 'w', newline='', encoding='utf-8')
self.writer = csv.DictWriter(self.file, fieldnames=["title", "company", "location", "link"])
self.writer.writeheader()
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
self.writer.writerow(dict(item))
return item
5. Deployment
Goal:
- Automate spider execution using
scrapyd
or cloud platforms.
Using scrapyd
:
Deploy and schedule spiders on a local or remote scrapyd
instance.
Deploy Spider to scrapyd
:
scrapyd-deploy
Schedule Spider:
curl http://localhost:6800/schedule.json -d project=my_project -d spider=my_spider
Using Cron Jobs:
Automate spider execution with cron
.
Example Cron Job:
0 0 * * * cd /path/to/scrapy/project && scrapy crawl my_spider
Using Cloud Platforms:
Run spiders on AWS or GCP for scalability.
Example on AWS EC2:
SSH into the instance, upload your project, and execute:
scrapy crawl my_spider
Key Steps for Your Use Case
Practice CSS and XPath selectors:
- Extract job data interactively with Scrapy Shell.
Master Pagination:
- Scrape multiple pages dynamically.
Use APIs or
scrapy-playwright
:- Handle modern job boards with JavaScript-rendered content.
Save Data:
- Use pipelines to export job listings in JSON or CSV.
Automate Deployment:
- Schedule jobs with
scrapyd
or cloud platforms.
- Schedule jobs with
Subscribe to my newsletter
Read articles from Anix Lynch directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by