When working with web scraping or offline website analysis, you might need to download not only the HTML content of a page but also its associated assets like CSS files, JavaScript, images, and fonts. Python provides a powerful suite of libraries to help you achieve this efficiently. In this guide, we'll use requests for making HTTP requests and BeautifulSoup for parsing HTML. We'll also handle asset downloads and saving them to the appropriate directories.

Here’s a step-by-step guide to help you download HTML and associated assets from a URL using Python.

Prerequisites

Ensure you have the required libraries installed. You can install them using pip if you haven't already:

pip install requests beautifulsoup4

Step-by-Step Script

Import Required Libraries

We'll need requests to fetch the webpage and its assets, BeautifulSoup to parse the HTML, and some standard libraries to handle files and directories.
```
 import requests
 from bs4 import BeautifulSoup
 import os
 from urllib.parse import urljoin
```

Define the URL and Fetch HTML Content

Define the URL from which you want to download the HTML and assets. Use requests to fetch the HTML content.

 # Define the URL
 url = 'http://example.com'

 # Fetch the HTML content
 response = requests.get(url)
 html_content = response.text

Save the HTML Content

Write the fetched HTML content to a file named index.html.

 # Save HTML content
 with open('index.html', 'w', encoding='utf-8') as file:
     file.write(html_content)

Parse HTML Content

Use BeautifulSoup to parse the HTML content for extracting asset URLs.
```
 # Parse HTML content
 soup = BeautifulSoup(html_content, 'html.parser')
```

Create Directories for Assets

Create directories to store CSS, JavaScript, images, and other assets.

 # Create directories to save CSS, JS, images, and other assets
 os.makedirs('css', exist_ok=True)
 os.makedirs('js', exist_ok=True)
 os.makedirs('images', exist_ok=True)
 os.makedirs('assets', exist_ok=True)  # For other assets

Define a Function to Download Files

This function will download files from a URL and save them to the specified directory.

 # Function to download files
 def download_file(file_url, directory):
     try:
         response = requests.get(file_url)
         response.raise_for_status()  # Check for request errors
         file_name = os.path.basename(file_url)
         file_path = os.path.join(directory, file_name)
         with open(file_path, 'wb') as file:
             file.write(response.content)
         print(f"Downloaded: {file_url}")
     except requests.RequestException as e:
         print(f"Error downloading {file_url}: {e}")

Find and Download CSS Files

Locate CSS files and download them.

 # Find and download CSS files
 for link in soup.find_all('link', href=True):
     if 'stylesheet' in link.get('rel', []):
         css_url = urljoin(url, link['href'])
         download_file(css_url, 'css')

Find and Download JavaScript Files

Locate JavaScript files and download them.

 # Find and download JavaScript bundles
 for script in soup.find_all('script', src=True):
     js_url = urljoin(url, script['src'])
     download_file(js_url, 'js')

Find and Download Images

Locate image files and download them.

 # Find and download images
 for img in soup.find_all('img', src=True):
     img_url = urljoin(url, img['src'])
     download_file(img_url, 'images')

Find and Download Other Assets

Handle other assets like fonts, videos, and icons.

# Find and download other assets (e.g., fonts, videos)
for link in soup.find_all('link', href=True):
    if 'icon' in link.get('rel', []) or 'manifest' in link.get('rel', []):
        asset_url = urljoin(url, link['href'])
        download_file(asset_url, 'assets')

# Example for handling video files
for video in soup.find_all('video', src=True):
    video_url = urljoin(url, video['src'])
    download_file(video_url, 'assets')

# Example for handling font files
for link in soup.find_all('link', href=True):
    if link['href'].endswith(('.woff', '.woff2', '.ttf', '.otf')):
        font_url = urljoin(url, link['href'])
        download_file(font_url, 'assets')

💡

Feel free to tweak and extend this script to handle more complex scenarios or additional types of assets.

Conclusion

This script provides a robust way to download HTML and associated assets from a URL. By utilizing requests and BeautifulSoup, you can efficiently fetch and save web content for offline analysis or replication. Adjust the directories and asset handling as needed based on the specific requirements of your project. Happy scraping!

How to Download HTML and Assets from a URL with Python

Table of contents

Prerequisites

Step-by-Step Script

Subscribe to my newsletter

ByteScrum Technologies

ByteScrum Technologies