Beginner's Guide: Web Scraping with Beautiful Soup

What is web scraping in simple words?

In simple words, web scraping is like having a little helper that automatically gathers information from a website for you.

For example

E-commerce platforms like Amazon and Flipkart monitor each other’s pricing strategies to offer competitive deals.
Apps like Google News and Inshorts gather and organize breaking news from various sources for quick, easy reading.

If you want to dive deeper into web scraping, you can check out our this blog and then come back here, or just start here. No worries!

HTML Basics for Web Scraping

Before we begin scraping, we need to understand the basics of HTML. We'll focus on the essentials, which are enough to successfully scrape a website.

What is HTML?

HTML (HyperText Markup Language) structures content on the web.

It's made up of elements (tags) like <div>, <p>, <h1>, <a> etc.

Example

<html>
<body>
<h1>Title</h1>
<p class="product">book</p>
<a href="/buy-now">Buy Now</a>
</body>
</html>

Important HTML Elements for Scraping

Tag	Meaning	Common Use Case
<div>	Division/Container	Group content
<p>	Paragraph	Text blocks
<h1>, <h2>, etc	Headings	Titles and sections
<a>	Anchor(links)	URLs, navigation
<img>	Image	Pictures and icons
<table>	Table	Structured tabular data

Do you Know What are Attributes ?

Attributes are important !

They are used within the opening tag like <tag name = “value” </tag>
They consist of a name and a value like name = “value”
They provide additional information about the element
We use these attributes to target the correct elements
These attributes often look like id, class, href, src

For example

The <a> tag defines a hyperlink. The href attribute specifies the URL of the page

<a href="https://ourtechtale.hashnode.dev/">Visit OurTechTale</a>
src attribute in the <img> tag specifies the image source and the width and height attributes define the dimensions of elements like images

<img src="img_car.jpg" width="300px" height="100px">

Don’t worry we can specify more than one attribute in a single tag. Just separate them by hitting a SPACEBAR on the keyboard

class and id attributes are commonly used for styling and scripting

<p class="size" id=”height”>10cm</p>

CSS basics for Scraping

What is CSS?

CSS (Cascading Style Sheets) controls the style and layout of HTML elements.

For scraping, we mainly care about CSS selectors to find and extract data.

Common CSS Selectors

Selector	Meaning	Example
.class	Selects elements by class	.title, .price
#id	Selects an element by id	#main, #product-title
tag	Selects all elements of a tag	h1, div, p
tag.class	Selects a tag with class	p.price, div.container

Example

Selecting Elements

HTML:

<p class="price">$29.99</p>

CSS selector to target this:

p.price

Basic structure of an HTML document

<!DOCTYPE html>
<html>
<head>
<title>Sample Page</title>
</head>
<body>
<h1>Welcome!</h1>
<p>This is a sample page.</p>
</body>
</html>

Key Elements

<!DOCTYPE html>: Declares the document type.

<html>: Root element of the page.

<head>: Contains metadata, styles, and scripts.

<body>: Contains visible content.

Common HTML Tags Used in Scraping

Tag	Purpose
<div>	Section or container for content
<span>	Inline container
<a>	Anchor tag for hyperlinks
<img>	Displays images (uses src)
<ul>, <ol>, <li>	Lists and list items
<table>, <tr>, <td>	Tables and cells
<h1> to <h6>	Headers (various sizes)
<p>	Paragraph text
<form>, <input>, <button>	Form elements

Navigating HTML Structure in Scraping

Scraping tools like BeautifulSoup and Selenium use tag names and attributes to locate elements.

Example Targets

By tag: soup.find('div')

By class: soup.find('div', class_='product')

By ID: soup.find(id='header')

By attribute: soup.find('a', {'href': True})

Python libraries required for scraping

Requests
Beautiful Soup
lxml
Selenium (optional)

Let’s explore these libraries

Requests

It is a Python tool that helps your program talk to websites — like sending or receiving information from a webpage.
Talks to Websites: It lets your Python code open websites and get data, like reading the content of a news article or sending login details.
Very Easy to Use: You can get a webpage by just writing requests.get("https://example.com") — it's like typing a website URL in your browser, but using Python.
Handles Extras Automatically: It can also take care of things like login info, form data, or cookies without you needing to write a lot of extra code.

Installation

To install requests, run:

pip install requests

Example

import requests

url = "https://example.com"

response = requests.get(url)

print(response.text)

Beautiful soup

It is a Python tool that helps you read and pull out useful information from web pages
Reads Web Pages Like a Human: It takes the messy code behind a website (called HTML) and organizes it so your Python program can understand it easily.
Helps You Find Things: Once the webpage is organized, you can easily search for specific things like headings, paragraphs, links, or even prices using tag names or class names.
Perfect for Web Scraping: It works great with another tool called requests, and together, they help you grab and clean data from websites — like collecting product names from an online store.

Installation

To install requests, run:

pip install beautifulsoup4

Example 1

from bs4 import BeautifulSoup

html_code = """
<html>
<head><title>My Website</title></head>
<body>
<h1>Welcome!</h1>
<p>This is a sample page.</p>
</body>
</html>
"""

# Create a BeautifulSoup object
soup = BeautifulSoup(html_code, 'html.parser')

# Accessing elements

print(soup.title)        
# <title>My Website</title>

print(soup.h1.text)     
# Welcome!

Explanation

html_code: Sample HTML content.
BeautifulSoup(html_code, 'html.parser'): This creates a soup object, which is now a parsed version of the HTML.
Now you can easily access tags like title, h1, etc.

Example 2

import requests
from bs4 import BeautifulSoup

# Step 1: Send a GET request to the website
url = "https://example.com"
response = requests.get(url)

# Step 2: Create a BeautifulSoup object
soup = BeautifulSoup(response.text, 'html.parser')

# Step 3: Extract the title of the page
print("Page Title:", soup.title.text)

What's happening

requests.get(url) fetches the HTML content of the webpage.
BeautifulSoup(response.text, 'html.parser') parses that HTML.
soup.title.text gives the content inside the <title> tag of the page.

You can replace "https://example.com" with any public website that allows scraping.

Lxml

It is a tool in Python that helps you read and work with web page code (HTML or XML) quickly and easily.

Fast HTML/XML Parser: lxml is a high-performance library used to parse and process XML and HTML documents efficiently.
Supports XPath & CSS Selectors: It allows powerful searching within documents using XPath or CSS-style queries, making it ideal for complex data extraction.
Faster than html.parser: Compared to Python’s built-in html.parser, lxml is faster and more robust, especially for large or poorly structured HTML.

Installation

To install lxml, run:

pip install lxml

Example

from bs4 import BeautifulSoup
import requests

# Fetch the webpage
url = "https://example.com"
response = requests.get(url)

# Parse the HTML using lxml parser
soup = BeautifulSoup(response.text, 'lxml')

# Extract the title
print("Title:", soup.title.text)

Just like 'html.parser', you can pass 'lxml' as the parser to BeautifulSoup. It's faster and more lenient with messy HTML.

Selenium

It is a tool that automatically opens a web browser and does things just like a human — like clicking buttons, typing in boxes, or scrolling pages.It is used for interacting with JavaScript-rendered pages.

Automates the Browser: It can open Chrome or Firefox and visit websites for you, just like you're doing it yourself.
Handles Dynamic Websites: If a website loads data using JavaScript (like Instagram or Flipkart), Selenium can still access that data — unlike simpler tools like requests or BeautifulSoup.
Acts Like a Real User: You can tell Selenium to click, type, fill out forms, or even take screenshots — great for testing websites or scraping data from complex pages.

We will not be using selenium in this blog

LET’S GET STARTED WITH SCRAPING

We will be using books.toscrape.com website for scraping because this website is made for scraping.

WARNING: Always check a website’s robots.txt file or terms of service. For major platforms (like LinkedIn, Amazon, Facebook), scraping is often against terms and could result in bans or legal warnings. Use public APIs where possible.

#Import the Libraries
import requests
from bs4 import BeautifulSoup

#Send a Request to the Website
url = "https://books.toscrape.com"
response = requests.get(url)

# Check if it worked
print(response.status_code)  # Should print 200

#Parse the HTML with BeautifulSoup
soup = BeautifulSoup(response.text, 'lxml')

#Print the parsed HTML
print(soup)

#Visualize and understand the structure of HTML and XML documents
#Adding some prettiness !!
print(soup.pretiffy())

#Some examples
# print(soup.h1)
# print(soup.h1.text)
# print(soup.find("p"))
# print(soup.find("p", class_="description"))
# print(soup.find_all("a"))
# print(soup.select_one("p.description"))
# print(soup.select("a"))

# link = soup.find("a")
# print(link["href"])

# print(soup.p.parent)
# print(list(soup.body.children))
# print(soup.h1.find_next_sibling())

MOST COMMON METHODS USED IN BEAUTIFUL SOUP

Accessing Elements

Access the first occurrence of a tag:

soup.h1
Get the text inside a tag:

soup.h1.text
find() Method

Finds the first matching element:

soup.find("p")
Find a tag with specific attributes:

soup.find("p", class_="description")
find_all() Method

Finds all matching elements:

soup.find_all("a")
Using select() and select_one()

Select elements using CSS selectors.

soup.select_one("p.description")

soup.select("a")
Extracting Attributes

Get the value of an attribute, such as href from an <a> tag:

link = soup.find("a")

print(link["href"])

Or using .get():

print(link.get("href"))
Traversing the Tree

Access parent elements:

soup.p.parent
Access children elements:

list(soup.body.children)
Find the next sibling:

soup.h1.find_next_sibling()

Conclusion

Web scraping might seem a bit challenging at first, but with Beautiful Soup and a little curiosity, it becomes much easier. This is just the start, and we're excited to dive deeper into web data with you. If you're just starting out, take it one page at a time. Before you know it, you'll be scraping like a pro!

💬 Got questions or scraping ideas? Drop them in the comments—we’d love to hear from you!

— Palak, Abhishek| OurTechTale

How We Scraped Our First Website Using Beautiful Soup and Python

Table of contents

What is web scraping in simple words?

HTML Basics for Web Scraping

Do you Know What are Attributes ?

CSS basics for Scraping

Common HTML Tags Used in Scraping

Navigating HTML Structure in Scraping

LET’S GET STARTED WITH SCRAPING

Conclusion

Subscribe to my newsletter

Palak Goyal

Palak Goyal