Web Scraper using Python
Introduction:
Today, data is everywhere - be it market research or content aggregation there is unbound information just waiting to be analyzed. Manually collecting data from websites doesn't work out to be practical or even time-consuming at times. That is where the role of web scraping comes in. It enables automated extraction of highly valuable data from websites for use for the best purposes, such as research and SEO analysis.
Why Web Scraping?
Usage in web scraping includes:
Data gathering: Gathering data from blogs, news sites, e-commerce sites, and much more.
Competitor Analysis: Checking the competitor websites for SEO, pricing of products, etc.
Research: Collection of data for academic, market, or other types of studies.
Content aggregation: Automatically collecting content from all kinds of sources for the incorporation into blogs or other websites.
Tools and Libraries Used:
In this project, I used the following tools:
requests
: A simple HTTP library for sending requests and fetching web pages.BeautifulSoup
: A library for parsing HTML content and extracting data.time
&random
: To simulate real browsing behavior by introducing delays between requests.logging
: For error handling and keeping track of the scraping process.urllib.parse
: To handle relative and absolute URLs.
How the Code Works:
The main components of the scraper are:
1. Fetching Webpage Data with requests
To start, we need to send an HTTP request to the website and fetch the page’s content.
def get_soup(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
try:
response = session.get(url, headers=headers)
response.raise_for_status() # Raises error for invalid status codes
return BeautifulSoup(response.content, 'html.parser')
except requests.exceptions.RequestException as e:
logging.error(f"Error fetching {url}: {e}")
return None
We use requests.Session()
for persistent HTTP sessions, and the Retry
and HTTPAdapter
ensure the scraper retries failed requests.
2. Extracting Paragraphs
Once we have the HTML content, we use BeautifulSoup
to extract all <p>
(paragraph) tags.
def extract_paragraphs(soup):
paragraphs = []
for p in soup.find_all('p'):
try:
paragraphs.append(p.text.strip())
except AttributeError:
logging.warning("Error parsing paragraph text")
return paragraphs
This function finds all paragraph tags and extracts the text from them, stripping out any unnecessary whitespace.
3. Extracting Links
We also want to extract all the links (<a>
tags) from the page. The extract_links()
function converts any relative links into absolute URLs using urljoin()
.
def extract_links(soup, base_url, keyword=None):
links = []
for link in soup.find_all('a', href=True):
full_url = urljoin(base_url, link['href'])
if not keyword or keyword in full_url:
links.append(full_url)
return links
This function will help us gather all the URLs (internal and external) from the page. We can filter these URLs with a keyword if needed.
4. Putting It All Together
Finally, we define the scrape_data()
function to fetch the webpage, extract the relevant data (title, paragraphs, links), and log the process.
def scrape_data(url):
start_time = time.time()
soup = get_soup(url)
if not soup:
return
# Extract and print title
title = soup.title.text.strip() if soup.title else 'No title found'
print(f"Title of the webpage: {title}")
# Extract and print paragraphs
print("\nParagraphs:")
paragraphs = extract_paragraphs(soup)
for paragraph in paragraphs:
print(paragraph)
# Extract and print links
print("\nLinks:")
links = extract_links(soup, url)
for link in links:
print(link)
duration = time.time() - start_time
logging.info(f"Scraped data from {url} in {duration:.2f} seconds")
Running the Scraper
To run the scraper, simply pass the URL you want to scrape:
if __name__ == "__main__":
url = 'https://www.geeksforgeeks.org/python-web-scraping-tutorial/'
scrape_data(url)
Key Features of This Scraper:
Retry Logic: The scraper retries failed HTTP requests to ensure reliable data collection.
Extracts Important Data: It pulls the title, paragraphs, and all links from a webpage.
Logging: Logs errors and execution time, which helps with debugging and monitoring.
Polite Scraping: The scraper adds random delays between requests to avoid overloading the server.
Conclusion:
This is a simple but powerful tool in Python to make the job of automating data collection off of websites even easier, with libraries such as requests and BeautifulSoup. The code can easily be configured with these requests to extract useful information off the web. It is possible to extend this basic scraper beyond scraping multiple pages, handling pagination, or even interacting with websites that require form submissions or handling JavaScript content.
Collecting all the data through automation saves valuable time and resources, freeing your concern to focus on analyzing the data rather than collecting it.
Subscribe to my newsletter
Read articles from Yaswanth Merugumala directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by