Create a simple web scraping script in python
What is the project about
This project is a simple example of how you can use web scraping in a funny way to learn and practice coding. This simple code is a way to get all the products related to research on the Amazon webpage, for example, you search "lamp" on Amazon this simple script automates this process getting all the products inside of the page, getting the title and the price.
What you need to know for this project
To understand and modify this project you should know the basics of Python and HTTP request. If you want to get an introduction to HTTP you can check out my article about it: "HTTP protocol made simple"
The project
So let's take a look at what this script should do:
from bs4 import BeautifulSoup
import requests
#function that get all the product inside of an amazon page writing them in a csv file
def scrape_page(url):
cache = []
count = 0
#to get your user agent you can write in your browser "what is my user agent" and then past here
headers = {'user-agent':'Your User Agent'}
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.content,"html.parser")
card = soup.find_all("div",class_="sg-col-inner")
print("===========================")
print("Page data:")
print(f"Response Status Code: {response.status_code}")
for element in card:
title = element.find("span",class_="a-size-base-plus a-color-base a-text-normal")
price = element.find("span", class_="a-price-whole")
if title != None and price != None and (title,price) not in cache :
with open("data.csv","a") as file:
line = f"{title.text};{price.text}\n"
file.write(line)
cache.append((title,price))
count+=1
print(f"Element Stored: {count}")
print("===========================")
def main():
product = input("What product are you looking for?")
page_number = input("How many pages do you want to scrape?").strip()
while not page_number.isdigit():
print("Please enter numbers for page number")
page_number = input("How many pages do you want to scrape?").strip()
page_number= int(page_number)
#with this script is also possible to scrape more than 1 page
for page in range(1,page_number+1):
url = f"https://www.amazon.it/s?k={product}&page={page}"
scrape_page(url)
if __name__ == "__main__":
main()
Conclusion
I hope that seeing this simple script you can try to build out your version make it better or improve the performance. Thanks for reading the article.
Follow and support me:
Special thanks if you subscribe to my channel :)
Subscribe to my newsletter
Read articles from Paolo Ferrari directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Paolo Ferrari
Paolo Ferrari
Hi, I am a self taught software developer and i want to share the process to learn new stuff and also share the very few things that i know with other people.