How We Scraped Our First Website Using Beautiful Soup and Python


What is web scraping in simple words?
In simple words, web scraping is like having a little helper that automatically gathers information from a website for you.
For example
E-commerce platforms like Amazon and Flipkart monitor each other’s pricing strategies to offer competitive deals.
Apps like Google News and Inshorts gather and organize breaking news from various sources for quick, easy reading.
If you want to dive deeper into web scraping, you can check out our this blog and then come back here, or just start here. No worries!
HTML Basics for Web Scraping
Before we begin scraping, we need to understand the basics of HTML. We'll focus on the essentials, which are enough to successfully scrape a website.
What is HTML?
HTML (HyperText Markup Language) structures content on the web.
It's made up of elements (tags) like <div>, <p>, <h1>, <a>
etc.
Example
<html>
<body>
<h1>Title</h1>
<p class="product">book</p>
<a href="/buy-now">Buy Now</a>
</body>
</html>
Important HTML Elements for Scraping
Tag | Meaning | Common Use Case |
<div> | Division/Container | Group content |
<p> | Paragraph | Text blocks |
<h1>, <h2>, etc | Headings | Titles and sections |
<a> | Anchor(links) | URLs, navigation |
<img> | Image | Pictures and icons |
<table> | Table | Structured tabular data |
Do you Know What are Attributes ?
Attributes are important !
They are used within the opening tag like
<tag name = “value” </tag>
They consist of a name and a value like
name = “value”
They provide additional information about the element
We use these attributes to target the correct elements
These attributes often look like
id, class, href, src
For example
The
<a>
tag defines a hyperlink. Thehref
attribute specifies the URL of the page<a href="
https://ourtechtale.hashnode.dev/
">Visit OurTechTale</a>
src
attribute in the<img>
tag specifies the image source and the width and height attributes define the dimensions of elements like images<img src="img_car.jpg" width="300px" height="100px">
Don’t worry we can specify more than one attribute in a single tag. Just separate them by hitting a SPACEBAR on the keyboard
class
andid
attributes are commonly used for styling and scripting<p class="size" id=”height”>10cm</p>
CSS basics for Scraping
What is CSS?
CSS (Cascading Style Sheets) controls the style and layout of HTML elements.
For scraping, we mainly care about CSS selectors to find and extract data.
Common CSS Selectors
Selector | Meaning | Example |
.class | Selects elements by class | .title, .price |
#id | Selects an element by id | #main, #product-title |
tag | Selects all elements of a tag | h1, div, p |
tag.class | Selects a tag with class | p.price, div.container |
Example
Selecting Elements
HTML:
<p class="price">$29.99</p>
CSS selector to target this:
p.price
Basic structure of an HTML document
<!DOCTYPE html>
<html>
<head>
<title>Sample Page</title>
</head>
<body>
<h1>Welcome!</h1>
<p>This is a sample page.</p>
</body>
</html>
Key Elements
<!DOCTYPE html>: Declares the document type.
<html>: Root element of the page.
<head>: Contains metadata, styles, and scripts.
<body>: Contains visible content.
Common HTML Tags Used in Scraping
Tag | Purpose |
<div> | Section or container for content |
<span> | Inline container |
<a> | Anchor tag for hyperlinks |
<img> | Displays images (uses src) |
<ul>, <ol>, <li> | Lists and list items |
<table>, <tr>, <td> | Tables and cells |
<h1> to <h6> | Headers (various sizes) |
<p> | Paragraph text |
<form>, <input>, <button> | Form elements |
Navigating HTML Structure in Scraping
Scraping tools like BeautifulSoup and Selenium use tag names and attributes to locate elements.
Example Targets
By tag: soup.find('div')
By class: soup.find('div', class_='product')
By ID: soup.find(id='header')
By attribute: soup.find('a', {'href': True})
Python libraries required for scraping
Requests
Beautiful Soup
lxml
Selenium (optional)
Let’s explore these libraries
- Requests
It is a Python tool that helps your program talk to websites — like sending or receiving information from a webpage.
Talks to Websites: It lets your Python code open websites and get data, like reading the content of a news article or sending login details.
Very Easy to Use: You can get a webpage by just writing requests.get("https://example.com") — it's like typing a website URL in your browser, but using Python.
Handles Extras Automatically: It can also take care of things like login info, form data, or cookies without you needing to write a lot of extra code.
Installation
To install requests, run:
pip install requests
Example
import requests
url = "https://example.com"
response = requests.get(url)
print(response.text)
- Beautiful soup
It is a Python tool that helps you read and pull out useful information from web pages
Reads Web Pages Like a Human: It takes the messy code behind a website (called HTML) and organizes it so your Python program can understand it easily.
Helps You Find Things: Once the webpage is organized, you can easily search for specific things like headings, paragraphs, links, or even prices using tag names or class names.
Perfect for Web Scraping: It works great with another tool called requests, and together, they help you grab and clean data from websites — like collecting product names from an online store.
Installation
To install requests, run:
pip install beautifulsoup4
Example 1
from bs4 import BeautifulSoup
html_code = """
<html>
<head><title>My Website</title></head>
<body>
<h1>Welcome!</h1>
<p>This is a sample page.</p>
</body>
</html>
"""
# Create a BeautifulSoup object
soup = BeautifulSoup(html_code, 'html.parser')
# Accessing elements
print(soup.title)
# <title>My Website</title>
print(soup.h1.text)
# Welcome!
Explanation
html_code: Sample HTML content.
BeautifulSoup(html_code, 'html.parser'): This creates a soup object, which is now a parsed version of the HTML.
Now you can easily access tags like title, h1, etc.
Example 2
import requests
from bs4 import BeautifulSoup
# Step 1: Send a GET request to the website
url = "https://example.com"
response = requests.get(url)
# Step 2: Create a BeautifulSoup object
soup = BeautifulSoup(response.text, 'html.parser')
# Step 3: Extract the title of the page
print("Page Title:", soup.title.text)
What's happening
requests.get(url) fetches the HTML content of the webpage.
BeautifulSoup(response.text, 'html.parser') parses that HTML.
soup.title.text gives the content inside the <title> tag of the page.
You can replace "https://example.com"
with any public website that allows scraping.
- Lxml
It is a tool in Python that helps you read and work with web page code (HTML or XML) quickly and easily.
Fast HTML/XML Parser: lxml is a high-performance library used to parse and process XML and HTML documents efficiently.
Supports XPath & CSS Selectors: It allows powerful searching within documents using XPath or CSS-style queries, making it ideal for complex data extraction.
Faster than html.parser: Compared to Python’s built-in html.parser, lxml is faster and more robust, especially for large or poorly structured HTML.
Installation
To install lxml, run:
pip install lxml
Example
from bs4 import BeautifulSoup
import requests
# Fetch the webpage
url = "https://example.com"
response = requests.get(url)
# Parse the HTML using lxml parser
soup = BeautifulSoup(response.text, 'lxml')
# Extract the title
print("Title:", soup.title.text)
Just like 'html.parser', you can pass 'lxml' as the parser to BeautifulSoup. It's faster and more lenient with messy HTML.
- Selenium
It is a tool that automatically opens a web browser and does things just like a human — like clicking buttons, typing in boxes, or scrolling pages.It is used for interacting with JavaScript-rendered pages.
Automates the Browser: It can open Chrome or Firefox and visit websites for you, just like you're doing it yourself.
Handles Dynamic Websites: If a website loads data using JavaScript (like Instagram or Flipkart), Selenium can still access that data — unlike simpler tools like requests or BeautifulSoup.
Acts Like a Real User: You can tell Selenium to click, type, fill out forms, or even take screenshots — great for testing websites or scraping data from complex pages.
We will not be using selenium in this blog
LET’S GET STARTED WITH SCRAPING
We will be using books.toscrape.com website for scraping because this website is made for scraping.
WARNING: Always check a website’s
robots.txt
file or terms of service. For major platforms (like LinkedIn, Amazon, Facebook), scraping is often against terms and could result in bans or legal warnings. Use public APIs where possible.
#Import the Libraries
import requests
from bs4 import BeautifulSoup
#Send a Request to the Website
url = "https://books.toscrape.com"
response = requests.get(url)
# Check if it worked
print(response.status_code) # Should print 200
#Parse the HTML with BeautifulSoup
soup = BeautifulSoup(response.text, 'lxml')
#Print the parsed HTML
print(soup)
#Visualize and understand the structure of HTML and XML documents
#Adding some prettiness !!
print(soup.pretiffy())
#Some examples
# print(soup.h1)
# print(soup.h1.text)
# print(soup.find("p"))
# print(soup.find("p", class_="description"))
# print(soup.find_all("a"))
# print(soup.select_one("p.description"))
# print(soup.select("a"))
# link = soup.find("a")
# print(link["href"])
# print(soup.p.parent)
# print(list(soup.body.children))
# print(soup.h1.find_next_sibling())
MOST COMMON METHODS USED IN BEAUTIFUL SOUP
Accessing Elements
Access the first occurrence of a tag:
soup.h1
Get the text inside a tag:
soup.h1.text
find()
MethodFinds the first matching element:
soup.find("p")
Find a tag with specific attributes:
soup.find("p", class_="description")
find_all()
MethodFinds all matching elements:
soup.find_all("a")
Using
select()
andselect_one()
Select elements using CSS selectors.
soup.select_one("p.description")
soup.select("a")
Extracting Attributes
Get the value of an attribute, such as href from an <a> tag:
link = soup.find("a")
print(link["href"])
Or using .get():
print(link.get("href"))
Traversing the Tree
Access parent elements:
soup.p.parent
Access children elements:
list(soup.body.children)
Find the next sibling:
soup.h1.find_next_sibling()
Conclusion
Web scraping might seem a bit challenging at first, but with Beautiful Soup and a little curiosity, it becomes much easier. This is just the start, and we're excited to dive deeper into web data with you. If you're just starting out, take it one page at a time. Before you know it, you'll be scraping like a pro!
💬 Got questions or scraping ideas? Drop them in the comments—we’d love to hear from you!
— Palak, Abhishek| OurTechTale
Subscribe to my newsletter
Read articles from Palak Goyal directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Palak Goyal
Palak Goyal
Recent Data Science graduate passionate about solving real-world problems using Python, SQL, and analytics. I write about hands-on projects, practical tutorials, and lessons from my journey in tech — learning, building, and sharing one blog at a time.