Introduction:

Today, data is everywhere - be it market research or content aggregation there is unbound information just waiting to be analyzed. Manually collecting data from websites doesn't work out to be practical or even time-consuming at times. That is where the role of web scraping comes in. It enables automated extraction of highly valuable data from websites for use for the best purposes, such as research and SEO analysis.

Why Web Scraping?

Usage in web scraping includes:

Data gathering: Gathering data from blogs, news sites, e-commerce sites, and much more.

Competitor Analysis: Checking the competitor websites for SEO, pricing of products, etc.

Research: Collection of data for academic, market, or other types of studies.

Content aggregation: Automatically collecting content from all kinds of sources for the incorporation into blogs or other websites.

Tools and Libraries Used:

In this project, I used the following tools:

requests: A simple HTTP library for sending requests and fetching web pages.
BeautifulSoup: A library for parsing HTML content and extracting data.
time & random: To simulate real browsing behavior by introducing delays between requests.
logging: For error handling and keeping track of the scraping process.
urllib.parse: To handle relative and absolute URLs.

How the Code Works:

The main components of the scraper are:

1. Fetching Webpage Data with `requests`

To start, we need to send an HTTP request to the website and fetch the page’s content.

def get_soup(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
    }
    session = requests.Session()
    retry = Retry(connect=3, backoff_factor=0.5)
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)

    try:
        response = session.get(url, headers=headers)
        response.raise_for_status()  # Raises error for invalid status codes
        return BeautifulSoup(response.content, 'html.parser')
    except requests.exceptions.RequestException as e:
        logging.error(f"Error fetching {url}: {e}")
        return None

We use requests.Session() for persistent HTTP sessions, and the Retry and HTTPAdapter ensure the scraper retries failed requests.

2. Extracting Paragraphs

Once we have the HTML content, we use BeautifulSoup to extract all <p> (paragraph) tags.

def extract_paragraphs(soup):
    paragraphs = []
    for p in soup.find_all('p'):
        try:
            paragraphs.append(p.text.strip())
        except AttributeError:
            logging.warning("Error parsing paragraph text")
    return paragraphs

This function finds all paragraph tags and extracts the text from them, stripping out any unnecessary whitespace.

3. Extracting Links

We also want to extract all the links (<a> tags) from the page. The extract_links() function converts any relative links into absolute URLs using urljoin().

def extract_links(soup, base_url, keyword=None):
    links = []
    for link in soup.find_all('a', href=True):
        full_url = urljoin(base_url, link['href'])
        if not keyword or keyword in full_url:
            links.append(full_url)
    return links

This function will help us gather all the URLs (internal and external) from the page. We can filter these URLs with a keyword if needed.

4. Putting It All Together

Finally, we define the scrape_data() function to fetch the webpage, extract the relevant data (title, paragraphs, links), and log the process.

def scrape_data(url):
    start_time = time.time()
    soup = get_soup(url)
    if not soup:
        return

    # Extract and print title
    title = soup.title.text.strip() if soup.title else 'No title found'
    print(f"Title of the webpage: {title}")

    # Extract and print paragraphs
    print("\nParagraphs:")
    paragraphs = extract_paragraphs(soup)
    for paragraph in paragraphs:
        print(paragraph)

    # Extract and print links
    print("\nLinks:")
    links = extract_links(soup, url)
    for link in links:
        print(link)

    duration = time.time() - start_time
    logging.info(f"Scraped data from {url} in {duration:.2f} seconds")

Running the Scraper

To run the scraper, simply pass the URL you want to scrape:

if __name__ == "__main__":
    url = 'https://www.geeksforgeeks.org/python-web-scraping-tutorial/'
    scrape_data(url)

Key Features of This Scraper:

Retry Logic: The scraper retries failed HTTP requests to ensure reliable data collection.
Extracts Important Data: It pulls the title, paragraphs, and all links from a webpage.
Logging: Logs errors and execution time, which helps with debugging and monitoring.
Polite Scraping: The scraper adds random delays between requests to avoid overloading the server.

Conclusion:

This is a simple but powerful tool in Python to make the job of automating data collection off of websites even easier, with libraries such as requests and BeautifulSoup. The code can easily be configured with these requests to extract useful information off the web. It is possible to extend this basic scraper beyond scraping multiple pages, handling pagination, or even interacting with websites that require form submissions or handling JavaScript content.

Collecting all the data through automation saves valuable time and resources, freeing your concern to focus on analyzing the data rather than collecting it.

Web Scraper using Python

Table of contents

Introduction:

Why Web Scraping?

Tools and Libraries Used:

How the Code Works:

1. Fetching Webpage Data with `requests`

2. Extracting Paragraphs

3. Extracting Links

4. Putting It All Together

Running the Scraper

Key Features of This Scraper:

Conclusion:

Subscribe to my newsletter

Yaswanth Merugumala

Yaswanth Merugumala

Web Scraper using Python

Table of contents

Introduction:

Why Web Scraping?

Tools and Libraries Used:

How the Code Works:

1. Fetching Webpage Data with requests

2. Extracting Paragraphs

3. Extracting Links

4. Putting It All Together

Running the Scraper

Key Features of This Scraper:

Conclusion:

Subscribe to my newsletter

Yaswanth Merugumala

Yaswanth Merugumala

1. Fetching Webpage Data with `requests`