Web Data Scraping Demystified: A Tutorial with Real-World Examples


Data is the fuel that powers innovation. Whether you’re a data scientist, a hobbyist, or a developer with a passion for information, having access to quality data is crucial. One way to gather unique datasets for analysis or research is by scraping websites, provided you do so responsibly and ethically. In this tutorial, we explain how to scrape article content from a website, using BBC Yoruba articles as an example. We also discuss reasons you might want to scrape data, and outline best practices to ensure your work is legal and respectful of the source.
Why Scrape Data?
Scraping data can be extremely valuable for several reasons:
Research and Analysis:
Researchers and analysts often need data not readily available through APIs or public datasets. Scraping allows you to gather specific pieces of content—such as news articles, blog posts, or user reviews—for sentiment analysis, trend forecasting, or academic studies.Competitive Intelligence:
Companies frequently use scraping to monitor trends, customer feedback, and market dynamics without the high cost of traditional data sources.Content Aggregation:
Whether you’re building a news aggregator, a personal recommendation system, or a niche content portal, scraping lets you compile and present information from multiple sources in one place.Learning and Development:
Experimenting with data scraping can be an excellent way to improve your coding skills, learn about data processing pipelines, and understand real-world challenges in data collection.
Before you begin scraping any website, it’s imperative to check its permissions via the robots.txt
file and review the Terms of Service to ensure that your activity is allowed.
Prerequisites
1. Install Python
Make sure you have Python installed on your system. Python is a versatile language, perfect for scraping tasks.
2. Create and Activate a Virtual Environment
Working within a virtual environment keeps your dependencies organized. Here’s how to set one up:
# Create a new virtual environment named 'env'
python3 -m venv env
# Activate the virtual environment (Linux/macOS)
source env/bin/activate
# On Windows, run:
# env\Scripts\activate
3. Install Required Libraries
With your virtual environment active, install the libraries we’ll need:
(env) pip install requests beautifulsoup4
requests: Makes HTTP requests to fetch webpage content.
BeautifulSoup: Parses the HTML so you can extract the necessary data.
Best Practices for Web Scraping
Scraping can be a powerful tool, but it comes with responsibilities. Here are some best practices:
Confirm Permission:
- Always Verify: Before scraping, check the website’s
robots.txt
(for instance, BBC’s robots.txt) and its Terms of Service. This file outlines which parts of the site may be crawled and by whom.
- Always Verify: Before scraping, check the website’s
Respect the Source:
- Avoid Overloading Servers: Use delays between requests (known as throttling) to avoid overwhelming the website’s server. Respect the site's bandwidth and performance.
Implement Robust Error Handling:
- Plan for Uncertainty: Websites change. Network issues, layout changes, or access denials can occur. Write your code to handle these gracefully.
Use Clear Documentation and Type Hints:
- Improve Readability: Using docstrings and type annotations not only makes your code more maintainable, but also easier for others, and your future self, to understand and modify.
Include Metadata:
- Ensure Traceability: Always store metadata (like the source URL and provider information) along with your scraped data. This is essential for proper citation and maintaining data provenance.
Example Code
Below is a complete code example that scrapes an article from BBC Yoruba, organises the content into a structured JSON format, and saves it to a file. This example is tailored to the BBC’s structure, but remember, it’s important to inspect and adjust to the structure of any website you work with.
import os
import json
import time
import requests
from bs4 import BeautifulSoup
from typing import List, Optional, Dict, Any
def scrape_article_content(article_url: str, language: str = "yoruba") -> Optional[List[Dict[str, Any]]]:
"""
Scrape content from an article URL and return a list of structured paragraph data.
Each paragraph is represented as a dictionary containing:
- text: The cleaned paragraph text.
- label: The language of the article.
- format: The content format (set to 'article').
- source: A dictionary with the provider (e.g., 'BBC') and the original article URL.
- citation: Additional citation info (None in this example).
Args:
article_url (str): The URL of the article.
language (str): The language label for the article content (default is "yooruba").
Returns:
Optional[List[Dict[str, Any]]]: A list of dictionaries containing paragraph data,
or None if the article could not be fetched.
"""
try:
response = requests.get(article_url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, "html.parser")
# Find the main content container. Adjust tags as needed for the specific website.
content_div = soup.find("main") or soup.find("article")
paragraphs = content_div.find_all("p") if content_div else []
article_texts = [
{
"text": paragraph.get_text(strip=True).replace("\n", " "),
"label": language,
"format": "article",
"source": {
"provider": "BBC",
"url": article_url,
},
"citation": None,
}
for paragraph in paragraphs
if len(paragraph.get_text(strip=True).replace("\n", " ")) > 50 # Only include meaningful content
]
return article_texts
else:
print(f"Failed to fetch article: {article_url} (Status: {response.status_code})")
return None
except Exception as e:
print(f"Error scraping article: {article_url} (Error: {e})")
return None
def save_dataset_to_file(dataset: List[Dict[str, Any]], file_path: str) -> None:
"""
Save the scraped dataset to a JSON file in a structured format.
Args:
dataset (List[Dict[str, Any]]): A list of data dictionaries to save.
file_path (str): The file path where the dataset will be saved.
"""
try:
with open(file_path, "w", encoding="utf-8") as f:
json.dump(dataset, f, ensure_ascii=False, indent=4)
print(f"Dataset successfully saved to {file_path}")
except Exception as e:
print(f"Error saving dataset to file: {e}")
def main() -> None:
"""
Main function to scrape an article, compile the data, and save it to a JSON file.
"""
# Example BBC article URL (replace with a valid URL for live testing)
article_url = "https://www.bbc.com/yoruba/articles/c62q3vn8666o"
language = "yoruba"
# IMPORTANT: Confirm that the website allows scraping by checking its robots.txt and terms of service.
# Scrape the article content
dataset = scrape_article_content(article_url, language)
if dataset:
# Create the output directory if it doesn't exist
output_dir = "datasets"
os.makedirs(output_dir, exist_ok=True)
# Define the JSON file path for storing the dataset
file_path = os.path.join(output_dir, "article_dataset.json")
# Save the scraped data to a JSON file
save_dataset_to_file(dataset, file_path)
else:
print("No data to save.")
# Pause briefly between requests to avoid overwhelming the server
time.sleep(5)
if __name__ == "__main__":
main()
Explanation
Scraping Function (
scrape_article_content
):Sends an HTTP GET request to the specified article URL.
Parses the HTML content using BeautifulSoup and extracts text from paragraph elements within a designated main content container.
Filters out paragraphs that are too short to be meaningful.
Packages the text along with metadata (language, format, source, etc.) into a list of dictionaries.
Dataset Saving Function (
save_dataset_to_file
):- Writes the collected data into a JSON file, making it easy to share, analyse, or integrate with other applications.
Main Function:
Defines an example article URL and the language label.
Reminds you to check the website’s
robots.txt
and Terms of Service for scraping permission.Executes the scraping function and, if successful, saves the dataset.
Includes a short delay to be courteous to the website’s server.
Conclusion
Scraping data offers a powerful way to collect information for research, analysis, and innovative applications. However, always balance your technical curiosity with respect for the data source. Before you start scraping, ensure you have permission by reviewing the site’s robots.txt
and terms of use. With proper planning, clear documentation, and adherence to best practices, you can create a robust scraper that gathers valuable data while honouring ethical and legal considerations.
Happy scraping, and remember, data is a precious resource best handled responsibly!
Subscribe to my newsletter
Read articles from TemiTope Kayode directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

TemiTope Kayode
TemiTope Kayode
Seasoned software engineer and founder specialised in web and mobile applications, enterprise applications, cloud computing, and DevOps using tools like Django, React, Flutter, AWS, and DigitalOcean. Currently a Senior Software Developer and a mentor, I balance coding with family and leisure. Holds a distinction in a Masters in Computer Science from Coventry University, blending education with practical prowess. Passionate about technology and innovation, eager to connect and explore new possibilities.