20 Scrapy concepts with Before-and-After Examples

Anix LynchAnix Lynch
11 min read

1. Creating a Scrapy Project πŸ“

Boilerplate Code:

scrapy startproject myproject

Use Case: Initialize a new Scrapy project. πŸ“

Goal: Set up the basic structure for your Scrapy project. 🎯

Sample Command:

scrapy startproject myproject

Before Example:
You need to scrape data but don’t have a project structure. πŸ€”

No project directory.

After Example:
With scrapy startproject, you get a fully scaffolded project directory! πŸ“

myproject/
    β”œβ”€β”€ myproject/
    β”œβ”€β”€ scrapy.cfg
    └── ...

Challenge: 🌟 Try creating multiple Scrapy projects and see how the project structure varies with different settings.


2. Creating a Spider πŸ•·οΈ

Boilerplate Code:

scrapy genspider spider_name domain.com

Use Case: Create a Spider to scrape a specific website. πŸ•·οΈ

Goal: Set up a spider that defines how to crawl and parse a website. 🎯

Sample Command:

scrapy genspider myspider example.com

Before Example:
You want to scrape a website but don’t have a spider defined. πŸ€”

No spider available.

After Example:
With scrapy genspider, you generate a spider file ready to customize! πŸ•·οΈ

myproject/spiders/myspider.py

Challenge: 🌟 Try creating spiders for multiple domains and define the rules for each.


3. Running a Spider πŸƒβ€β™‚οΈ

Boilerplate Code:

scrapy crawl spider_name

Use Case: Use crawl to run your Scrapy spider. πŸƒβ€β™‚οΈ

Goal: Execute the spider to crawl and scrape data from the target site. 🎯

Sample Command:

scrapy crawl myspider

Before Example:
You’ve written your spider but don’t know how to execute it. πŸ€”

Spider exists, but no data collected.

After Example:
With scrapy crawl, the spider runs, scrapes, and collects data! πŸƒβ€β™‚οΈ

Data is collected and printed or stored.

Challenge: 🌟 Run the spider with the -o option to save scraped data into a file (e.g., json, csv).


4. Parsing Responses (parse method) πŸ”

Boilerplate Code:

def parse(self, response):
    # Extract data here
    pass

Use Case: Define the parse method to handle the data extracted from responses. πŸ”

Goal: Extract data from the HTML content of the page. 🎯

Sample Code:

def parse(self, response):
    title = response.css('title::text').get()
    yield {'title': title}

Before Example:
You have a spider that crawls pages but doesn’t extract specific data. πŸ€”

HTML response is received but no data extracted.

After Example:
With parse, you extract specific elements from the page! πŸ”

Extracted data: {"title": "Example Title"}

Challenge: 🌟 Try extracting multiple fields like headers, paragraphs, or links using CSS or XPath selectors.


5. CSS Selectors (response.css) 🌐

Boilerplate Code:

response.css('css_selector')

Use Case: Use CSS selectors to locate elements within the HTML response. 🌐

Goal: Select and extract data using CSS-like syntax. 🎯

Sample Code:

title = response.css('title::text').get()

Before Example:
You have an HTML response but can’t efficiently extract specific elements. πŸ€”

Data: <title>Example Title</title>

After Example:
With CSS selectors, you can easily extract the desired text or attributes! 🌐

Output: "Example Title"

Challenge: 🌟 Use CSS selectors to extract different elements such as images (img::attr(src)), links (a::attr(href)), or text.


6. XPath Selectors (response.xpath) 🧭

Boilerplate Code:

response.xpath('xpath_expression')

Use Case: Use XPath selectors to extract elements from the HTML response. 🧭

Goal: Use powerful XPath expressions for more flexible or complex queries. 🎯

Sample Code:

title = response.xpath('//title/text()').get()

Before Example:
You need to extract elements but CSS selectors are not flexible enough. πŸ€”

Data: <title>Example Title</title>

After Example:
With XPath, you can extract data using more complex queries! 🧭

Output: "Example Title"

Challenge: 🌟 Try using XPath to extract nested elements or multiple attributes in a single query.


Boilerplate Code:

response.follow(link, callback)

Use Case: Use follow to navigate to links and scrape multiple pages. πŸ”—

Goal: Extract links from a page and follow them to scrape additional pages. 🎯

Sample Code:

for href in response.css('a::attr(href)').getall():
    yield response.follow(href, self.parse)

Before Example:
Your spider scrapes a single page but doesn’t navigate to other linked pages. πŸ€”

Only the first page is scraped.

After Example:
With response.follow, you can follow links and scrape multiple pages! πŸ”—

The spider navigates and scrapes linked pages.

Challenge: 🌟 Try following only specific links, such as those that contain certain keywords or paths.


8. Storing Data (Item Pipeline) πŸ“Š

Boilerplate Code:

class MyItemPipeline:
    def process_item(self, item, spider):
        # Process and store the item
        return item

Use Case: Use item pipelines to store or process the scraped data. πŸ“Š

Goal: Define how scraped data should be processed and stored after extraction. 🎯

Sample Code:

class MyItemPipeline:
    def process_item(self, item, spider):
        # Save item to a file or database
        with open('output.txt', 'a') as f:
            f.write(f"{item}\n")
        return item

Before Example:
You’ve extracted data but have no way to store or process it. πŸ€”

Scraped data is printed but not saved.

After Example:
With pipelines, you can process and store data in files, databases, etc.! πŸ“Š

Output: Data is saved to a file or database.

Challenge: 🌟 Try implementing pipelines to save data in formats like CSV or JSON.


9. Defining Items (Item Class) πŸ“‹

Boilerplate Code:

from scrapy import Item, Field

Use Case: Define a structured Item to represent the data you are scraping. πŸ“‹

Goal: Organize the scraped data into a structured format. 🎯

Sample Code:

class MyItem(Item):
    title = Field()
    link = Field()

Before Example:
You’ve scraped data but don’t have a structured format to represent it. πŸ€”

Unstructured data extraction.

After Example:
With Item, your data is organized into fields for better structure and processing! πŸ“‹

Structured data: {"title": "Example", "link": "https://example.com"}

Challenge: 🌟 Try defining multiple fields and extract values for each one using CSS or XPath.


10. Handling Pagination (next page) πŸ”„

Boilerplate Code:

next_page = response.css('a.next::attr(href)').get()
if next_page:
    yield response.follow(next_page, self.parse)

Use Case: Handle pagination to scrape data across multiple pages. πŸ”„

Goal: Automatically navigate through paginated content to collect more data. 🎯

Sample Code:

def parse(self, response):
    # Extract data from the current page
    yield {'title': response.css('title::text').get()}



    # Follow the pagination link
    next_page = response.css('a.next::attr(href)').get()
    if next_page:
        yield response.follow(next_page, self.parse)

Before Example:
Your spider scrapes only the first page of a paginated website. πŸ€”

Data is limited to the first page.

After Example:
With pagination handling, the spider follows links and scrapes additional pages! πŸ”„

Data collected from multiple pages.

Challenge: 🌟 Try handling pagination where the "next" button has different forms (e.g., buttons, JavaScript events).


11. Configuring Settings (Settings Module) βš™οΈ

Boilerplate Code:

from scrapy.utils.project import get_project_settings

Use Case: Use the settings module to configure how Scrapy runs. βš™οΈ

Goal: Adjust settings like user-agent, download delays, and more. 🎯

Sample Code:

settings = get_project_settings()
settings.set('USER_AGENT', 'Mozilla/5.0 (compatible; MyScrapyBot/1.0)')

Before Example:
Your spider runs with default settings, like a default user-agent, causing potential blocking. πŸ€”

Scrapy default settings in use.

After Example:
With custom settings, you can fine-tune spider behavior like user-agent and download delays! βš™οΈ

Custom user-agent or settings applied.

Challenge: 🌟 Try adding download delays to prevent being blocked by websites (DOWNLOAD_DELAY = 2).


12. Handling Cookies (COOKIES_ENABLED) πŸͺ

Boilerplate Code:

settings.set('COOKIES_ENABLED', True)

Use Case: Enable or disable cookies in your Scrapy project. πŸͺ

Goal: Control how your spider handles cookies for session-based scraping. 🎯

Sample Code:

settings.set('COOKIES_ENABLED', True)

Before Example:
Your spider struggles to maintain a session because cookies are not handled. πŸ€”

Session information is lost.

After Example:
With cookies enabled, your spider maintains sessions correctly across requests! πŸͺ

Session data maintained via cookies.

Challenge: 🌟 Try scraping a website that requires login using cookies to maintain the session.


13. Customizing Request Headers (headers) πŸ“œ

Boilerplate Code:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
    'Accept-Language': 'en'
}
yield scrapy.Request(url, headers=headers)

Use Case: Customize headers in your requests to mimic real browser behavior. πŸ“œ

Goal: Avoid detection by websites and mimic genuine users. 🎯

Sample Code:

# Send a request with custom headers
headers = {'User-Agent': 'Mozilla/5.0', 'Accept-Language': 'en'}
yield scrapy.Request(url="https://example.com", headers=headers)

Before Example:
Your spider is blocked due to a missing or default user-agent. πŸ€”

Request blocked by server.

After Example:
With custom headers, your spider mimics a real browser request! πŸ“œ

Request accepted with custom headers.

Challenge: 🌟 Experiment with different headers like Referer and Accept-Encoding to bypass bot detection.


14. Downloading Files (media) πŸ“‚

Boilerplate Code:

yield scrapy.Request(url, callback=self.save_file)

Use Case: Use Scrapy to download files like images or PDFs from the web. πŸ“‚

Goal: Automate the process of downloading media files from web pages. 🎯

Sample Code:

def save_file(self, response):
    filename = response.url.split("/")[-1]
    with open(filename, 'wb') as f:
        f.write(response.body)

Before Example:
You manually download files, which is time-consuming. πŸ€”

Files manually downloaded.

After Example:
With Scrapy, files are automatically downloaded and saved! πŸ“‚

Files automatically saved to your system.

Challenge: 🌟 Try downloading multiple file types (e.g., images, PDFs, audio) from a website.


15. Using CrawlSpider (CrawlSpider Class) πŸ•ΈοΈ

Boilerplate Code:

from scrapy.spiders import CrawlSpider, Rule

Use Case: Use CrawlSpider to handle more complex crawling, with automatic link extraction. πŸ•ΈοΈ

Goal: Define rules to crawl a website efficiently, automatically following links. 🎯

Sample Code:

from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'my_crawler'
    start_urls = ['https://example.com']
    rules = [Rule(LinkExtractor(allow=('category/',)), callback='parse_item')]

    def parse_item(self, response):
        # Extract data
        yield {'title': response.css('title::text').get()}

Before Example:
Your spider requires manual coding to follow links and extract data. πŸ€”

Manually coded link following.

After Example:
With CrawlSpider, link extraction and crawling are automated! πŸ•ΈοΈ

Automatic crawling and data extraction based on rules.

Challenge: 🌟 Define multiple rules for different types of links and customize crawling behavior.


16. Throttling Requests (AUTOTHROTTLE) ⏳

Boilerplate Code:

settings.set('AUTOTHROTTLE_ENABLED', True)

Use Case: Enable AutoThrottle to control the speed of requests dynamically. ⏳

Goal: Prevent being blocked by websites by adjusting request rates. 🎯

Sample Code:

settings.set('AUTOTHROTTLE_ENABLED', True)
settings.set('AUTOTHROTTLE_START_DELAY', 1)
settings.set('AUTOTHROTTLE_MAX_DELAY', 10)

Before Example:
Your spider sends too many requests too quickly, getting blocked by websites. πŸ€”

Website blocks requests due to high volume.

After Example:
With AutoThrottle, your spider automatically adjusts request speed to avoid detection! ⏳

Spider adapts to avoid being blocked.

Challenge: 🌟 Try combining AutoThrottle with a proxy or user-agent rotation to further avoid detection.


17. Handling Redirects (REDIRECT_ENABLED) πŸ”„

Boilerplate Code:

settings.set('REDIRECT_ENABLED', False)

Use Case: Control how your spider handles redirects (enable/disable). πŸ”„

Goal: Decide whether to follow redirects or handle them manually. 🎯

Sample Code:

settings.set('REDIRECT_ENABLED', False)  # Prevent following redirects

Before Example:
Your spider follows redirects, leading to pages you don't want to scrape. πŸ€”

Unwanted redirects followed.

After Example:
With redirects disabled, your spider stays on the original page and handles redirects manually! πŸ”„

Redirects are not automatically followed.

Challenge: 🌟 Try enabling redirects and handling specific redirects programmatically.


18. Rotating User Agents (FAKE USER AGENT) πŸ”„

Boilerplate Code:

from fake_useragent import UserAgent

Use Case: Rotate user agents to avoid detection by websites. πŸ”„

Goal: Prevent being blocked by websites that monitor for bots with static user agents. 🎯

Sample Code:

from fake_useragent import UserAgent

def start_requests(self):
    ua = UserAgent()
    headers = {'User-Agent': ua.random}
    yield scrapy.Request(url='https://example.com', headers=headers)

Before Example:
You use the same user-agent for all requests, making it easy for websites to detect you as a bot. πŸ€”

Static user-agent leads to detection.

After Example:
With rotating user agents, you reduce the chance of being detected! πŸ”„

User-agent rotated for each request.

Challenge: 🌟 Try using multiple user-agent strings and test different websites to see which are most effective.


19. Logging (LOG_LEVEL) πŸ“

Boilerplate Code:

settings.set('LOG_LEVEL', 'INFO')

Use Case: Set the log level to control the verbosity of Scrapy’s logging. πŸ“

Goal: Adjust the level of logging (e.g., DEBUG, INFO, WARNING, ERROR). 🎯

Sample Code:

settings.set('LOG_LEVEL', 'DEBUG')  # Show detailed logging info

Before Example:
Your logs are too verbose or too quiet, making it hard to debug or monitor the spider. πŸ€”

Irrelevant or missing log data.

After Example:
With log level control, you see only the logs you need! πŸ“

Logs set to "DEBUG" for detailed information.

Challenge: 🌟 Experiment with different log levels and monitor how your spider behaves in each case.


20. Middleware (Custom Middleware) βš™οΈ

Boilerplate Code:


python
class MyCustomMiddleware:
    def process_request(self, request, spider):
        # Custom request processing logic
        return None

Use Case: Write custom middleware to modify requests or responses before/after they are handled. βš™οΈ

Goal: Intercept and modify requests or responses dynamically during scraping. 🎯

Sample Code:

class MyCustomMiddleware:
    def process_request(self, request, spider):
        # Add a custom header to all requests
        request.headers['Custom-Header'] = 'MyValue'
        return None

Before Example:
You need to modify requests/responses dynamically, but there’s no built-in feature for your use case. πŸ€”

Static request handling.

After Example:
With middleware, you can intercept and modify requests or responses as needed! βš™οΈ

Custom headers added to all requests.

Challenge: 🌟 Try using middleware to retry failed requests or handle custom error conditions.


0
Subscribe to my newsletter

Read articles from Anix Lynch directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Anix Lynch
Anix Lynch