20 BeautifulSoup concepts with Before-and-After Examples
Table of contents
- 1. Creating a Soup Object (BeautifulSoup) π
- 2. Finding Elements by Tag Name (find) π
- 3. Finding Multiple Elements (find_all) π
- 4. Extracting Text Content (get_text) π£οΈ
- 5. Navigating the DOM (soup.tag) π§
- 6. Extracting Attributes (tag['attribute']) π·οΈ
- 7. Finding by Class or ID (find class_/id) π·οΈ
- 8. Modifying HTML Content (tag.string) π οΈ
- 9. Inserting New Tags (new_tag) ποΈ
- 10. Removing Tags (decompose) π§Ή
- 11. Finding by CSS Selectors (select) π΅οΈββοΈ
- 12. Extracting Tag Names (name) π
- 13. Navigating Parent Elements (parent) π
- 14. Navigating Sibling Elements (next_sibling / previous_sibling) βοΈ
- 15. Navigating Child Elements (children) π§βπ€βπ§
- 16. Accessing Tag Attributes (attrs) π·οΈ
- 17. Searching with Multiple Criteria (find/find_all with filters) π
- 18. Using Lambda for Custom Filters (find_all) π§βπ»
- 19. Searching for Specific Text (string/text) π
- 20. Modifying the DOM Tree (insert_before/insert_after) π οΈ
1. Creating a Soup Object (BeautifulSoup) π
Boilerplate Code:
from bs4 import BeautifulSoup
Use Case: Create a soup object to parse HTML or XML data. π
Goal: Load and process an HTML or XML document for web scraping. π―
Sample Code:
# Load HTML data
html_data = "<html><body><h1>Hello World!</h1></body></html>"
# Create a soup object
soup = BeautifulSoup(html_data, "html.parser")
Before Example:
You have raw HTML or XML but no way to process or extract its content. π€
Data: "<html><body><h1>Hello World!</h1></body></html>"
After Example:
With BeautifulSoup, you can now parse and manipulate the HTML document! π
Output: A soup object that allows you to navigate and extract information.
Challenge: π Try parsing an HTML file from a webpage and print out the <title>
tag.
2. Finding Elements by Tag Name (find) π
Boilerplate Code:
soup.find('tag_name')
Use Case: Use find to retrieve the first element that matches a given tag. π
Goal: Locate a specific tag in the HTML or XML document. π―
Sample Code:
# Find the first h1 tag
h1_tag = soup.find('h1')
print(h1_tag.text) # Output: "Hello World!"
Before Example:
You have an HTML document but donβt know how to find specific elements. π€
Data: "<h1>Hello World!</h1>"
After Example:
With find, you can easily extract the first matching element! π
Output: "Hello World!"
Challenge: π Try using find
to locate other tags like <p>
or <div>
.
3. Finding Multiple Elements (find_all) π
Boilerplate Code:
soup.find_all('tag_name')
Use Case: Use find_all to retrieve a list of all elements that match a given tag. π
Goal: Extract all occurrences of a specific tag. π―
Sample Code:
# Find all paragraph tags
paragraphs = soup.find_all('p')
for p in paragraphs:
print(p.text)
Before Example:
You need to extract multiple elements but find
only gives you the first one. π€
Data: "<p>Paragraph 1</p><p>Paragraph 2</p>"
After Example:
With find_all, you can extract all matching elements! π
Output: ["Paragraph 1", "Paragraph 2"]
Challenge: π Try using find_all
to extract all <a>
tags and print their href
attributes.
4. Extracting Text Content (get_text) π£οΈ
Boilerplate Code:
soup.get_text()
Use Case: Use get_text to retrieve all text content from an element or document. π£οΈ
Goal: Extract only the text from an HTML or XML element. π―
Sample Code:
# Get all text from the document
text_content = soup.get_text()
print(text_content)
Before Example:
You want to extract the text but HTML tags are in the way. π€
Data: "<p>Hello <b>World</b>!</p>"
After Example:
With get_text, you extract only the plain text! π£οΈ
Output: "Hello World!"
Challenge: π Try using get_text()
on different sections of the HTML document.
5. Navigating the DOM (soup.tag) π§
Boilerplate Code:
soup.tag
Use Case: Use the dot notation to directly access a tag in the document. π§
Goal: Quickly navigate the documentβs structure using the tag names. π―
Sample Code:
# Access the body tag
body = soup.body
print(body)
Before Example:
You need to locate specific parts of the document but donβt want to call find
repeatedly. π€
Data: "<body>...</body>"
After Example:
With dot notation, navigating the document becomes quick and easy! π§
Output: The content inside the `<body>` tag.
Challenge: π Try using soup.title
to extract the <title>
tag content.
6. Extracting Attributes (tag['attribute']) π·οΈ
Boilerplate Code:
tag['attribute']
Use Case: Extract attributes (like href
, src
) from HTML elements. π·οΈ
Goal: Retrieve specific attributes from tags, such as URLs from anchor tags. π―
Sample Code:
# Extract href attribute from an anchor tag
link = soup.find('a')
print(link['href'])
Before Example:
You need to get the URL from an anchor tag, but all you have is the tag itself. π€
Data: "<a href='https://example.com'>Example</a>"
After Example:
With tag['attribute'], you can easily extract the href
value! π·οΈ
Output: "https://example.com"
Challenge: π Try extracting the src
attribute from an <img>
tag.
7. Finding by Class or ID (find class_/id) π·οΈ
Boilerplate Code:
soup.find('tag', class_='class_name')
soup.find('tag', id='id_name')
Use Case: Use class_ or id to find elements by their CSS class or ID. π·οΈ
Goal: Locate elements based on their class or ID attributes. π―
Sample Code:
# Find element by class name
element = soup.find('div', class_='my-class')
print(element)
# Find element by ID
element = soup.find('div', id='my-id')
print(element)
Before Example:
You have elements with specific classes or IDs but donβt know how to locate them. π€
Data: "<div class='my-class'>...</div>"
After Example:
With class_ or id, you can directly find elements by their attributes! π·οΈ
Output: The element with the matching class or ID.
Challenge: π Try using find_all
to locate multiple elements with the same class.
8. Modifying HTML Content (tag.string) π οΈ
Boilerplate Code:
tag.string = "New Content"
Use Case: Modify the content of an HTML element using tag.string. π οΈ
Goal: Change the text inside an HTML tag. π―
Sample Code:
# Modify the content of an h1 tag
h1_tag = soup.find('h1')
h1_tag.string = "New Heading"
print(h1_tag)
Before Example:
You want to change the content of an HTML tag but donβt know how to edit it. π€
Data: "<h1>Hello World!</h1>"
After Example:
With tag.string, you can change the text inside the tag! π οΈ
Output: "<h1>New Heading</h1>"
Challenge: π Try modifying multiple elements in the document by looping through them.
9. Inserting New Tags (new_tag) ποΈ
Boilerplate Code:
new_tag = soup.new_tag("tag_name")
Use Case: Use new_tag to create and insert new HTML elements dynamically. ποΈ
Goal: Add new elements to the document for manipulation or enhancement. π―
Sample Code:
# Create a new tag
new_tag = soup.new_tag("p")
new_tag.string = "This is a new paragraph."
# Append it to the body
soup.body.append(new_tag)
print(soup.body)
Before Example:
You want to add new content but donβt know how to create new HTML elements. π€
Data: "<body>...</body>"
**After
Example**:
With new_tag, you can dynamically insert new elements! ποΈ
Output: A new paragraph added to the body.
Challenge: π Try inserting multiple tags dynamically at different places in the document.
10. Removing Tags (decompose) π§Ή
Boilerplate Code:
tag.decompose()
Use Case: Remove tags and their contents from the document using decompose. π§Ή
Goal: Clean up unwanted tags or elements from the HTML. π―
Sample Code:
# Find and remove an h1 tag
h1_tag = soup.find('h1')
h1_tag.decompose()
print(soup)
Before Example:
You want to remove an element but donβt know how to delete it. π€
Data: "<h1>Hello World!</h1>"
After Example:
With decompose, the element and its contents are completely removed! π§Ή
Output: The `<h1>` tag is removed from the document.
Challenge: π Try using decompose
to remove multiple tags or entire sections of the document.
11. Finding by CSS Selectors (select) π΅οΈββοΈ
Boilerplate Code:
soup.select('css_selector')
Use Case: Use select to find elements using CSS selectors (like .class
, #id
, tag
). π΅οΈββοΈ
Goal: Locate elements based on complex CSS selectors. π―
Sample Code:
# Find elements by CSS selector
elements = soup.select('div.my-class')
for element in elements:
print(element)
Before Example:
You need to find elements using CSS-style selectors. π€
Data: "<div class='my-class'>...</div>"
After Example:
With select, you can target elements using flexible and complex selectors! π΅οΈββοΈ
Output: Elements found using the `.my-class` CSS selector.
Challenge: π Try using more advanced CSS selectors like div > p
or ul li:first-child
.
12. Extracting Tag Names (name) π
Boilerplate Code:
tag.name
Use Case: Extract the tag name from an element using the name attribute. π
Goal: Identify the type of element (e.g., h1
, p
, div
). π―
Sample Code:
# Extract tag name
tag = soup.find('h1')
print(tag.name) # Output: "h1"
Before Example:
You want to check what kind of element you're dealing with but don't know its tag name. π€
Data: "<h1>Hello World!</h1>"
After Example:
With tag.name, you can extract and confirm the element type! π
Output: "h1"
Challenge: π Try printing the names of all elements within a specific tag (like <div>
).
13. Navigating Parent Elements (parent) π
Boilerplate Code:
tag.parent
Use Case: Access the parent element of a tag using tag.parent. π
Goal: Navigate to the parent element of a given tag. π―
Sample Code:
# Find parent of an h1 tag
h1_tag = soup.find('h1')
print(h1_tag.parent)
Before Example:
You want to access the container element (parent) of a specific tag. π€
Data: "<body><h1>Hello World!</h1></body>"
After Example:
With parent, you can move up the DOM to the parent tag! π
Output: The parent `<body>` element.
Challenge: π Try accessing the grandparent by chaining .parent.parent
.
14. Navigating Sibling Elements (next_sibling / previous_sibling) βοΈ
Boilerplate Code:
tag.next_sibling
tag.previous_sibling
Use Case: Use next_sibling and previous_sibling to navigate between sibling elements. βοΈ
Goal: Access the next or previous sibling element of a tag. π―
Sample Code:
# Find next sibling of an h1 tag
h1_tag = soup.find('h1')
print(h1_tag.next_sibling)
Before Example:
You want to move between elements on the same level in the DOM (siblings). π€
Data: "<h1>Hello World!</h1><p>This is a paragraph.</p>"
After Example:
With next_sibling, you can move to the next sibling element in the DOM! βοΈ
Output: The `<p>` element following the `<h1>`.
Challenge: π Try looping through all siblings of a tag.
15. Navigating Child Elements (children) π§βπ€βπ§
Boilerplate Code:
tag.children
Use Case: Access the children (direct descendants) of a tag using tag.children. π§βπ€βπ§
Goal: Iterate over all child elements of a given tag. π―
Sample Code:
# Loop through child elements of the body tag
body = soup.body
for child in body.children:
print(child)
Before Example:
You want to access all direct child elements of a parent tag. π€
Data: "<body><h1>Hello</h1><p>World</p></body>"
After Example:
With children, you can easily loop through and extract all child tags! π§βπ€βπ§
Output: The `<h1>` and `<p>` elements inside the body.
Challenge: π Try using .descendants
to access all descendants, including nested ones.
16. Accessing Tag Attributes (attrs) π·οΈ
Boilerplate Code:
tag.attrs
Use Case: Use attrs to get all attributes of an HTML tag. π·οΈ
Goal: Retrieve a dictionary of all attributes associated with a tag. π―
Sample Code:
# Get all attributes of an anchor tag
anchor = soup.find('a')
print(anchor.attrs) # Output: {'href': 'https://example.com'}
Before Example:
You need to access all attributes of a tag, but you only know one. π€
Data: "<a href='https://example.com' title='Example Link'></a>"
After Example:
With attrs, you get a dictionary of all the attributes associated with the tag! π·οΈ
Output: {'href': 'https://example.com', 'title': 'Example Link'}
Challenge: π Try modifying an attribute by directly editing tag.attrs['attribute']
.
17. Searching with Multiple Criteria (find/find_all with filters) π
Boilerplate Code:
soup.find('tag', {'attribute': 'value'})
Use Case: Use find or find_all with filters to search based on multiple criteria. π
Goal: Locate elements that match specific tags and attributes. π―
Sample Code:
# Find div with specific class and id
div = soup.find('div', {'class': 'my-class', 'id': 'my-id'})
print(div)
Before Example:
You need to find elements that match both tag name and attributes. π€
Data: "<div class='my-class' id='my-id'>...</div>"
After Example:
With filters, you can locate elements that meet multiple conditions! π
Output: The `<div>` tag with matching class and id.
Challenge: π Try combining class_
, id
, and other attributes for more complex searches.
18. Using Lambda for Custom Filters (find_all) π§βπ»
Boilerplate Code:
soup.find_all(lambda tag: some_condition)
Use Case: Use a lambda function in find_all to apply custom search filters. π§βπ»
Goal: Apply custom logic to filter elements based on non-standard conditions. π―
Sample Code:
# Find all tags with more than one attribute
tags = soup.find_all(lambda tag: len(tag.attrs) > 1)
for tag in tags:
print(tag)
Before Example:
You need a custom search condition that standard filters can't handle. π€
Data: Multiple elements with varying attributes.
After Example:
With lambda filters, you can apply any custom condition to find elements! π§βπ»
Output: All tags with more than one attribute.
Challenge: π Try using lambda to find all tags with specific text content or custom attribute logic.
19. Searching for Specific Text (string/text) π
Boilerplate Code:
soup.find_all(string="specific text")
Use Case: Use string to find elements that contain specific text. π
Goal: Search for tags based on their text content. π―
Sample Code:
# Find all tags containing the specific text "Hello"
tags = soup.find_all(string="Hello")
print(tags)
Before Example:
You need to find elements that contain a specific string of text. π€
Data: "<p>Hello</p><p>World</p>"
After Example:
With string, you can locate elements based on their text content! π
Output: A
list of elements containing the text "Hello".
Challenge: π Try searching for partial matches or case-insensitive text.
20. Modifying the DOM Tree (insert_before/insert_after) π οΈ
Boilerplate Code:
tag.insert_before(new_tag)
tag.insert_after(new_tag)
Use Case: Use insert_before or insert_after to insert elements into the DOM tree. π οΈ
Goal: Dynamically insert new elements before or after existing ones. π―
Sample Code:
# Create a new paragraph tag
new_paragraph = soup.new_tag("p")
new_paragraph.string = "This is a new paragraph."
# Insert the new paragraph after the h1 tag
h1_tag = soup.find('h1')
h1_tag.insert_after(new_paragraph)
Before Example:
You want to add new content in specific positions in the DOM. π€
Data: "<h1>Hello World!</h1>"
After Example:
With insert_before and insert_after, you can add new elements dynamically! π οΈ
Output: A new paragraph inserted after the `<h1>` tag.
Challenge: π Try inserting multiple elements at different locations in the DOM.
Subscribe to my newsletter
Read articles from Anix Lynch directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by