20 Scrapy concepts with Before-and-After Examples
Table of contents
- 1. Creating a Scrapy Project π
- 2. Creating a Spider π·οΈ
- 3. Running a Spider πββοΈ
- 4. Parsing Responses (parse method) π
- 5. CSS Selectors (response.css) π
- 6. XPath Selectors (response.xpath) π§
- 7. Extracting Links (response.follow) π
- 8. Storing Data (Item Pipeline) π
- 9. Defining Items (Item Class) π
- 10. Handling Pagination (next page) π
- 11. Configuring Settings (Settings Module) βοΈ
- 12. Handling Cookies (COOKIES_ENABLED) πͺ
- 13. Customizing Request Headers (headers) π
- 14. Downloading Files (media) π
- 15. Using CrawlSpider (CrawlSpider Class) πΈοΈ
- 16. Throttling Requests (AUTOTHROTTLE) β³
- 17. Handling Redirects (REDIRECT_ENABLED) π
- 18. Rotating User Agents (FAKE USER AGENT) π
- 19. Logging (LOG_LEVEL) π
- 20. Middleware (Custom Middleware) βοΈ
1. Creating a Scrapy Project π
Boilerplate Code:
scrapy startproject myproject
Use Case: Initialize a new Scrapy project. π
Goal: Set up the basic structure for your Scrapy project. π―
Sample Command:
scrapy startproject myproject
Before Example:
You need to scrape data but donβt have a project structure. π€
No project directory.
After Example:
With scrapy startproject, you get a fully scaffolded project directory! π
myproject/
βββ myproject/
βββ scrapy.cfg
βββ ...
Challenge: π Try creating multiple Scrapy projects and see how the project structure varies with different settings.
2. Creating a Spider π·οΈ
Boilerplate Code:
scrapy genspider spider_name domain.com
Use Case: Create a Spider to scrape a specific website. π·οΈ
Goal: Set up a spider that defines how to crawl and parse a website. π―
Sample Command:
scrapy genspider myspider example.com
Before Example:
You want to scrape a website but donβt have a spider defined. π€
No spider available.
After Example:
With scrapy genspider, you generate a spider file ready to customize! π·οΈ
myproject/spiders/myspider.py
Challenge: π Try creating spiders for multiple domains and define the rules for each.
3. Running a Spider πββοΈ
Boilerplate Code:
scrapy crawl spider_name
Use Case: Use crawl to run your Scrapy spider. πββοΈ
Goal: Execute the spider to crawl and scrape data from the target site. π―
Sample Command:
scrapy crawl myspider
Before Example:
Youβve written your spider but donβt know how to execute it. π€
Spider exists, but no data collected.
After Example:
With scrapy crawl, the spider runs, scrapes, and collects data! πββοΈ
Data is collected and printed or stored.
Challenge: π Run the spider with the -o
option to save scraped data into a file (e.g., json
, csv
).
4. Parsing Responses (parse method) π
Boilerplate Code:
def parse(self, response):
# Extract data here
pass
Use Case: Define the parse method to handle the data extracted from responses. π
Goal: Extract data from the HTML content of the page. π―
Sample Code:
def parse(self, response):
title = response.css('title::text').get()
yield {'title': title}
Before Example:
You have a spider that crawls pages but doesnβt extract specific data. π€
HTML response is received but no data extracted.
After Example:
With parse, you extract specific elements from the page! π
Extracted data: {"title": "Example Title"}
Challenge: π Try extracting multiple fields like headers, paragraphs, or links using CSS or XPath selectors.
5. CSS Selectors (response.css) π
Boilerplate Code:
response.css('css_selector')
Use Case: Use CSS selectors to locate elements within the HTML response. π
Goal: Select and extract data using CSS-like syntax. π―
Sample Code:
title = response.css('title::text').get()
Before Example:
You have an HTML response but canβt efficiently extract specific elements. π€
Data: <title>Example Title</title>
After Example:
With CSS selectors, you can easily extract the desired text or attributes! π
Output: "Example Title"
Challenge: π Use CSS selectors to extract different elements such as images (img::attr(src)
), links (a::attr(href)
), or text.
6. XPath Selectors (response.xpath) π§
Boilerplate Code:
response.xpath('xpath_expression')
Use Case: Use XPath selectors to extract elements from the HTML response. π§
Goal: Use powerful XPath expressions for more flexible or complex queries. π―
Sample Code:
title = response.xpath('//title/text()').get()
Before Example:
You need to extract elements but CSS selectors are not flexible enough. π€
Data: <title>Example Title</title>
After Example:
With XPath, you can extract data using more complex queries! π§
Output: "Example Title"
Challenge: π Try using XPath to extract nested elements or multiple attributes in a single query.
7. Extracting Links (response.follow) π
Boilerplate Code:
response.follow(link, callback)
Use Case: Use follow to navigate to links and scrape multiple pages. π
Goal: Extract links from a page and follow them to scrape additional pages. π―
Sample Code:
for href in response.css('a::attr(href)').getall():
yield response.follow(href, self.parse)
Before Example:
Your spider scrapes a single page but doesnβt navigate to other linked pages. π€
Only the first page is scraped.
After Example:
With response.follow, you can follow links and scrape multiple pages! π
The spider navigates and scrapes linked pages.
Challenge: π Try following only specific links, such as those that contain certain keywords or paths.
8. Storing Data (Item Pipeline) π
Boilerplate Code:
class MyItemPipeline:
def process_item(self, item, spider):
# Process and store the item
return item
Use Case: Use item pipelines to store or process the scraped data. π
Goal: Define how scraped data should be processed and stored after extraction. π―
Sample Code:
class MyItemPipeline:
def process_item(self, item, spider):
# Save item to a file or database
with open('output.txt', 'a') as f:
f.write(f"{item}\n")
return item
Before Example:
Youβve extracted data but have no way to store or process it. π€
Scraped data is printed but not saved.
After Example:
With pipelines, you can process and store data in files, databases, etc.! π
Output: Data is saved to a file or database.
Challenge: π Try implementing pipelines to save data in formats like CSV
or JSON
.
9. Defining Items (Item Class) π
Boilerplate Code:
from scrapy import Item, Field
Use Case: Define a structured Item to represent the data you are scraping. π
Goal: Organize the scraped data into a structured format. π―
Sample Code:
class MyItem(Item):
title = Field()
link = Field()
Before Example:
Youβve scraped data but donβt have a structured format to represent it. π€
Unstructured data extraction.
After Example:
With Item, your data is organized into fields for better structure and processing! π
Structured data: {"title": "Example", "link": "https://example.com"}
Challenge: π Try defining multiple fields and extract values for each one using CSS or XPath.
10. Handling Pagination (next page) π
Boilerplate Code:
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
Use Case: Handle pagination to scrape data across multiple pages. π
Goal: Automatically navigate through paginated content to collect more data. π―
Sample Code:
def parse(self, response):
# Extract data from the current page
yield {'title': response.css('title::text').get()}
# Follow the pagination link
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
Before Example:
Your spider scrapes only the first page of a paginated website. π€
Data is limited to the first page.
After Example:
With pagination handling, the spider follows links and scrapes additional pages! π
Data collected from multiple pages.
Challenge: π Try handling pagination where the "next" button has different forms (e.g., buttons, JavaScript events).
11. Configuring Settings (Settings Module) βοΈ
Boilerplate Code:
from scrapy.utils.project import get_project_settings
Use Case: Use the settings module to configure how Scrapy runs. βοΈ
Goal: Adjust settings like user-agent, download delays, and more. π―
Sample Code:
settings = get_project_settings()
settings.set('USER_AGENT', 'Mozilla/5.0 (compatible; MyScrapyBot/1.0)')
Before Example:
Your spider runs with default settings, like a default user-agent, causing potential blocking. π€
Scrapy default settings in use.
After Example:
With custom settings, you can fine-tune spider behavior like user-agent and download delays! βοΈ
Custom user-agent or settings applied.
Challenge: π Try adding download delays to prevent being blocked by websites (DOWNLOAD_DELAY = 2
).
12. Handling Cookies (COOKIES_ENABLED) πͺ
Boilerplate Code:
settings.set('COOKIES_ENABLED', True)
Use Case: Enable or disable cookies in your Scrapy project. πͺ
Goal: Control how your spider handles cookies for session-based scraping. π―
Sample Code:
settings.set('COOKIES_ENABLED', True)
Before Example:
Your spider struggles to maintain a session because cookies are not handled. π€
Session information is lost.
After Example:
With cookies enabled, your spider maintains sessions correctly across requests! πͺ
Session data maintained via cookies.
Challenge: π Try scraping a website that requires login using cookies to maintain the session.
13. Customizing Request Headers (headers) π
Boilerplate Code:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
'Accept-Language': 'en'
}
yield scrapy.Request(url, headers=headers)
Use Case: Customize headers in your requests to mimic real browser behavior. π
Goal: Avoid detection by websites and mimic genuine users. π―
Sample Code:
# Send a request with custom headers
headers = {'User-Agent': 'Mozilla/5.0', 'Accept-Language': 'en'}
yield scrapy.Request(url="https://example.com", headers=headers)
Before Example:
Your spider is blocked due to a missing or default user-agent. π€
Request blocked by server.
After Example:
With custom headers, your spider mimics a real browser request! π
Request accepted with custom headers.
Challenge: π Experiment with different headers like Referer
and Accept-Encoding
to bypass bot detection.
14. Downloading Files (media) π
Boilerplate Code:
yield scrapy.Request(url, callback=self.save_file)
Use Case: Use Scrapy to download files like images or PDFs from the web. π
Goal: Automate the process of downloading media files from web pages. π―
Sample Code:
def save_file(self, response):
filename = response.url.split("/")[-1]
with open(filename, 'wb') as f:
f.write(response.body)
Before Example:
You manually download files, which is time-consuming. π€
Files manually downloaded.
After Example:
With Scrapy, files are automatically downloaded and saved! π
Files automatically saved to your system.
Challenge: π Try downloading multiple file types (e.g., images, PDFs, audio) from a website.
15. Using CrawlSpider (CrawlSpider Class) πΈοΈ
Boilerplate Code:
from scrapy.spiders import CrawlSpider, Rule
Use Case: Use CrawlSpider to handle more complex crawling, with automatic link extraction. πΈοΈ
Goal: Define rules to crawl a website efficiently, automatically following links. π―
Sample Code:
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'my_crawler'
start_urls = ['https://example.com']
rules = [Rule(LinkExtractor(allow=('category/',)), callback='parse_item')]
def parse_item(self, response):
# Extract data
yield {'title': response.css('title::text').get()}
Before Example:
Your spider requires manual coding to follow links and extract data. π€
Manually coded link following.
After Example:
With CrawlSpider, link extraction and crawling are automated! πΈοΈ
Automatic crawling and data extraction based on rules.
Challenge: π Define multiple rules for different types of links and customize crawling behavior.
16. Throttling Requests (AUTOTHROTTLE) β³
Boilerplate Code:
settings.set('AUTOTHROTTLE_ENABLED', True)
Use Case: Enable AutoThrottle to control the speed of requests dynamically. β³
Goal: Prevent being blocked by websites by adjusting request rates. π―
Sample Code:
settings.set('AUTOTHROTTLE_ENABLED', True)
settings.set('AUTOTHROTTLE_START_DELAY', 1)
settings.set('AUTOTHROTTLE_MAX_DELAY', 10)
Before Example:
Your spider sends too many requests too quickly, getting blocked by websites. π€
Website blocks requests due to high volume.
After Example:
With AutoThrottle, your spider automatically adjusts request speed to avoid detection! β³
Spider adapts to avoid being blocked.
Challenge: π Try combining AutoThrottle
with a proxy or user-agent rotation to further avoid detection.
17. Handling Redirects (REDIRECT_ENABLED) π
Boilerplate Code:
settings.set('REDIRECT_ENABLED', False)
Use Case: Control how your spider handles redirects (enable/disable). π
Goal: Decide whether to follow redirects or handle them manually. π―
Sample Code:
settings.set('REDIRECT_ENABLED', False) # Prevent following redirects
Before Example:
Your spider follows redirects, leading to pages you don't want to scrape. π€
Unwanted redirects followed.
After Example:
With redirects disabled, your spider stays on the original page and handles redirects manually! π
Redirects are not automatically followed.
Challenge: π Try enabling redirects and handling specific redirects programmatically.
18. Rotating User Agents (FAKE USER AGENT) π
Boilerplate Code:
from fake_useragent import UserAgent
Use Case: Rotate user agents to avoid detection by websites. π
Goal: Prevent being blocked by websites that monitor for bots with static user agents. π―
Sample Code:
from fake_useragent import UserAgent
def start_requests(self):
ua = UserAgent()
headers = {'User-Agent': ua.random}
yield scrapy.Request(url='https://example.com', headers=headers)
Before Example:
You use the same user-agent for all requests, making it easy for websites to detect you as a bot. π€
Static user-agent leads to detection.
After Example:
With rotating user agents, you reduce the chance of being detected! π
User-agent rotated for each request.
Challenge: π Try using multiple user-agent strings and test different websites to see which are most effective.
19. Logging (LOG_LEVEL) π
Boilerplate Code:
settings.set('LOG_LEVEL', 'INFO')
Use Case: Set the log level to control the verbosity of Scrapyβs logging. π
Goal: Adjust the level of logging (e.g., DEBUG
, INFO
, WARNING
, ERROR
). π―
Sample Code:
settings.set('LOG_LEVEL', 'DEBUG') # Show detailed logging info
Before Example:
Your logs are too verbose or too quiet, making it hard to debug or monitor the spider. π€
Irrelevant or missing log data.
After Example:
With log level control, you see only the logs you need! π
Logs set to "DEBUG" for detailed information.
Challenge: π Experiment with different log levels and monitor how your spider behaves in each case.
20. Middleware (Custom Middleware) βοΈ
Boilerplate Code:
python
class MyCustomMiddleware:
def process_request(self, request, spider):
# Custom request processing logic
return None
Use Case: Write custom middleware to modify requests or responses before/after they are handled. βοΈ
Goal: Intercept and modify requests or responses dynamically during scraping. π―
Sample Code:
class MyCustomMiddleware:
def process_request(self, request, spider):
# Add a custom header to all requests
request.headers['Custom-Header'] = 'MyValue'
return None
Before Example:
You need to modify requests/responses dynamically, but thereβs no built-in feature for your use case. π€
Static request handling.
After Example:
With middleware, you can intercept and modify requests or responses as needed! βοΈ
Custom headers added to all requests.
Challenge: π Try using middleware to retry failed requests or handle custom error conditions.
Subscribe to my newsletter
Read articles from Anix Lynch directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by