Python Web Scraping: My First Experience

Introduction

As a newcomer to data science and artificial intelligence, I recently completed my first week’s assignment in the Cyber Shujaa Program, focusing on web scraping and data handling in Python. This blog post documents my journey from complete beginner to successfully scraping, parsing, and storing web data all while learning Google Colab along the way.

Getting Started with Google Colab

My first step was setting up a Google Colab account and creating a new notebook. Colab provides a fantastic cloud-based environment for running Python code without any local setup. Here’s how I began:

# Importing necessary libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd

These three libraries became my essential toolkit:

BeautifulSoup for parsing HTML content

requests for fetching web pages

pandas for data manipulation and storage

Fetching and Parsing Web Content

The assignment used ScrapeThisSite’s forms page as our target URL. Here’s how I retrieved and parsed the content:

# Fetching the target URL
url = 'https://www.scrapethissite.com/pages/forms/'
response = requests.get(url)

# Parsing the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

Extracting Table Data

The page contained hockey team statistics in a table, which I needed to extract:

# Finding the table and headers
table = soup.find_all('table', class_='table')[0]
headers = [header.text.strip() for header in table.find_all('th')]

# Creating an empty DataFrame with the headers
df = pd.DataFrame(columns=headers)

Two Approaches to Data Collection

I discovered two methods to populate my DataFrame with the table data:

1. List Accumulation Method

# Initialize empty list
data_rows = []

# Extract and process each row
for row in table.find_all('tr')[1:]:  # Skip header row
    row_data = row.find_all('td')
    clean_row_data = [cell.text.strip() for cell in row_data]
    data_rows.append(clean_row_data)
# Convert list to DataFrame
df = pd.DataFrame(data_rows, columns=headers)

2. Direct DataFrame Append

# Initialize empty DataFrame
df = pd.DataFrame(columns=headers)

# Append rows directly to DataFrame
for row in table.find_all('tr')[1:]:
    row_data = row.find_all('td')
    clean_row_data =[cell.text.strip() for cell in row_data]
    length = len(df)
    df.loc[length] = clean_row_data

Both methods worked, but they have different performance characteristics. The list accumulation method is generally faster for large datasets, while the direct append approach might be more intuitive for smaller projects.

Inspecting and Saving the Data

After collecting the data, I inspected it and saved to a CSV file:

# Display the first few rows
print(df.head())

# Save to CSV
df.to_csv('hockey_team_stats.csv', index=False)
# Verify the saved data
saved_data = pd.read_csv('hockey_team_stats.csv')print(saved_data.head())

Key Learnings and Reflections

This introductory project taught me several valuable lessons:

Hands-on learning works: Seeing immediate results from my code helped concepts stick better than just reading theory.
Python’s ecosystem is powerful: The combination of BeautifulSoup, requests, and pandas makes web scraping accessible even for beginners.
There’s often more than one solution: Discovering two different approaches to populate my DataFrame showed me that programming often offers multiple paths to the same destination.
Google Colab is beginner-friendly: The cloud-based environment eliminated setup headaches and let me focus on learning.

Looking Ahead

This assignment has given me confidence to tackle more complex data science challenges. I’m particularly excited to:

Explore more advanced web scraping techniques
Learn about data cleaning and preprocessing
Apply these skills to real-world problems

You can view my complete code on Google Colab.

As I continue through the Cyber Shujaa Program, I’ll be documenting my journey here on my blog. This portfolio of projects will not only reinforce my learning but also showcase my growing skills to potential employers in the Data and AI field.

Have you worked with web scraping before? What tips would you share with a beginner like me? Let me know in the comments!

Web Scraping and Data Handling in Python: My First Hands-On Experience

Table of contents