Web Scraping and Data Handling in Python: My First Hands-On Experience


Photo by Claudio Schwarz on Unsplash
Introduction
As a newcomer to data science and artificial intelligence, I recently completed my first week’s assignment in the Cyber Shujaa Program, focusing on web scraping and data handling in Python. This blog post documents my journey from complete beginner to successfully scraping, parsing, and storing web data all while learning Google Colab along the way.
Getting Started with Google Colab
My first step was setting up a Google Colab account and creating a new notebook. Colab provides a fantastic cloud-based environment for running Python code without any local setup. Here’s how I began:
# Importing necessary libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
These three libraries became my essential toolkit:
BeautifulSoup
for parsing HTML content
requests
for fetching web pages
pandas
for data manipulation and storage
Fetching and Parsing Web Content
The assignment used ScrapeThisSite’s forms page as our target URL. Here’s how I retrieved and parsed the content:
# Fetching the target URL
url = 'https://www.scrapethissite.com/pages/forms/'
response = requests.get(url)
# Parsing the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
Extracting Table Data
The page contained hockey team statistics in a table, which I needed to extract:
# Finding the table and headers
table = soup.find_all('table', class_='table')[0]
headers = [header.text.strip() for header in table.find_all('th')]
# Creating an empty DataFrame with the headers
df = pd.DataFrame(columns=headers)
Two Approaches to Data Collection
I discovered two methods to populate my DataFrame with the table data:
1. List Accumulation Method
# Initialize empty list
data_rows = []
# Extract and process each row
for row in table.find_all('tr')[1:]: # Skip header row
row_data = row.find_all('td')
clean_row_data = [cell.text.strip() for cell in row_data]
data_rows.append(clean_row_data)
# Convert list to DataFrame
df = pd.DataFrame(data_rows, columns=headers)
2. Direct DataFrame Append
# Initialize empty DataFrame
df = pd.DataFrame(columns=headers)
# Append rows directly to DataFrame
for row in table.find_all('tr')[1:]:
row_data = row.find_all('td')
clean_row_data =[cell.text.strip() for cell in row_data]
length = len(df)
df.loc[length] = clean_row_data
Both methods worked, but they have different performance characteristics. The list accumulation method is generally faster for large datasets, while the direct append approach might be more intuitive for smaller projects.
Inspecting and Saving the Data
After collecting the data, I inspected it and saved to a CSV file:
# Display the first few rows
print(df.head())
# Save to CSV
df.to_csv('hockey_team_stats.csv', index=False)
# Verify the saved data
saved_data = pd.read_csv('hockey_team_stats.csv')print(saved_data.head())
Key Learnings and Reflections
This introductory project taught me several valuable lessons:
Hands-on learning works: Seeing immediate results from my code helped concepts stick better than just reading theory.
Python’s ecosystem is powerful: The combination of BeautifulSoup, requests, and pandas makes web scraping accessible even for beginners.
There’s often more than one solution: Discovering two different approaches to populate my DataFrame showed me that programming often offers multiple paths to the same destination.
Google Colab is beginner-friendly: The cloud-based environment eliminated setup headaches and let me focus on learning.
Looking Ahead
This assignment has given me confidence to tackle more complex data science challenges. I’m particularly excited to:
Explore more advanced web scraping techniques
Learn about data cleaning and preprocessing
Apply these skills to real-world problems
You can view my complete code on Google Colab.
As I continue through the Cyber Shujaa Program, I’ll be documenting my journey here on my blog. This portfolio of projects will not only reinforce my learning but also showcase my growing skills to potential employers in the Data and AI field.
Have you worked with web scraping before? What tips would you share with a beginner like me? Let me know in the comments!
Subscribe to my newsletter
Read articles from Capwell Murimi directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
