ScrapeGraphAI + Datatune, scrape the web, transform the data with Datatune

I was browsing Amazon for a new laptop, getting overwhelmed by the hundreds of options and confusing specifications. After spending two hours manually copying specs into a spreadsheet, I decided there had to be a better approach. That's when I remembered reading about ScrapeGraphAI and thought it might be worth trying.
The Problem with Manual Research
Shopping for laptops online presents several challenges:
Hundreds of models with varying specifications
Prices that change frequently
Reviews are scattered across different sections
Difficult comparison between similar models
Like many people, I started by opening multiple browser tabs and manually collecting information. This process was time-consuming and prone to errors.
What is ScrapeGraphAI?
ScrapeGraphAI is a Python library that uses LLMs to extract data from websites and documents. Instead of writing complex scraping code, you describe what information you want in plain language, and the library handles the extraction.
Key features include:
Natural language instructions: No need for complex CSS selectors or XPath
Multiple LLM support: Works with GPT, Gemini, Groq, Azure, and local models via Ollama
Flexible formats: Handles XML, HTML, JSON, and Markdown documents
Using ScrapeGraphAI for Laptop Research
Here's how I used it to extract laptop information from Amazon:
Setup
pip install scrapegraphai
playwright install
Basic Implementation
from scrapegraphai.graphs import SmartScraperGraph
graph_config = {
"llm": {
"model": "gpt-3.5-turbo",
"api_key": "YOUR_API_KEY"
},
"verbose": True,
"headless": False
}
smart_scraper = SmartScraperGraph(
prompt="Extract laptop name, price, rating, key specifications, and availability from this Amazon search page",
source="https://amazon.com/s?k=laptops+under+1500",
config=graph_config
)
results = smart_scraper.run()
Processing the Output with Datatune
Once ScrapeGraphAI extracts the data, you can use datatune to clean and filter the results.
Install datatune using pip
pip install datatune
:
import os
import dask.dataframe as dd
import datatune as dt
from datatune.llm.llm import OpenAI
import json
import pandas as pd
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
llm = OpenAI(model_name="gpt-3.5-turbo", tpm=200000, rpm=50)
# Convert ScrapeGraphAI JSON output to DataFrame
with open('smartscraper-2025-07-19.json', 'r') as f:
scrape_data = json.load(f)
# Extract laptops array and convert to DataFrame
laptops_df = pd.DataFrame(scrape_data['result']['laptops'])
df = dd.from_pandas(laptops_df, npartitions=2)
# Map operation to standardize and enrich data
mapped = dt.Map(
prompt="Standardize the processor name and categorize it as budget, mid-range, or high-end based on performance",
output_fields=["processor_category", "standardized_processor"],
input_fields=["processor"]
)(llm, df)
# Filter for laptops with 16GB RAM and reasonable pricing
filtered = dt.Filter(
prompt="Keep only laptops with 16GB RAM, price between $800-$2000, and battery life over 8 hours",
input_fields=["ram", "price", "batteryLife"]
)(llm, mapped)
# Additional mapping for value analysis
value_mapped = dt.Map(
prompt="Calculate value score based on price, specs, and battery life. Rate as excellent, good, fair, or poor value",
output_fields=["value_rating", "value_score"],
input_fields=["price", "ram", "ssd", "processor", "batteryLife", "weight"]
)(llm, filtered)
# Final cleanup and export
result = dt.finalize(value_mapped)
final_df = result.compute()
final_df.to_csv("filtered_laptops.csv", index=False)
print(f"Found {len(final_df)} laptops matching criteria")
print(final_df[['brand', 'model', 'price', 'value_rating']].head())
Alternative Processing Approach
For more specific filtering based on use cases:
# Filter for different use cases
productivity_laptops = dt.Filter(
prompt="Keep laptops suitable for productivity work: good battery life, lightweight, reliable processor",
input_fields=["batteryLife", "weight", "processor", "ram"]
)(llm, df)
gaming_laptops = dt.Filter(
prompt="Keep gaming laptops: powerful processor, dedicated graphics implied by model name, adequate RAM",
input_fields=["processor", "model", "ram", "price"]
)(llm, df)
budget_laptops = dt.Filter(
prompt="Keep budget-friendly laptops under $1000 with decent specifications",
input_fields=["price", "ram", "processor", "ssd"]
)(llm, df)
# Process each category
categories = {
"productivity": productivity_laptops,
"gaming": gaming_laptops,
"budget": budget_laptops
}
for category, filtered_df in categories.items():
# Add category information
categorized = dt.Map(
prompt=f"Add category '{category}' and recommend this laptop with brief reasoning",
output_fields=["category", "recommendation_reason"],
input_fields=["brand", "model", "price", "processor", "ram"]
)(llm, filtered_df)
final_result = dt.finalize(categorized)
final_result.compute().to_csv(f"{category}_laptops.csv", index=False)
Final Output
After running the complete pipeline (ScrapeGraphAI extraction + datatune processing), here's what the final dataset looks like:
brand,model,price,ram,ssd,processor,screenSize,batteryLife,weight,releaseYear,processor_category,standardized_processor,value_rating,value_score,category,recommendation_reason
Apple,MacBook Pro,1299,16,512,Apple M1,13.3,18,1.4,2020,high-end,Apple M1 8-core,excellent,9.2,productivity,"Outstanding battery life and performance for professional work, lightweight design ideal for mobile productivity"
Lenovo,ThinkPad X1 Carbon,1399,16,512,Intel Core i7,14,15,1.1,2021,high-end,Intel Core i7-1165G7,good,8.1,productivity,"Ultra-lightweight business laptop with excellent build quality and long battery life, perfect for professionals"
ASUS,TUF Gaming A15,999,8,512,AMD Ryzen 5,15.6,10,2.2,2021,mid-range,AMD Ryzen 5 4600H,good,7.8,gaming,"Solid gaming performance at affordable price point, good processor and adequate RAM for modern games"
Acer,Aspire 5,599,8,256,Intel Core i5,15.6,9,1.75,2021,mid-range,Intel Core i5-1135G7,excellent,8.9,budget,"Outstanding value for money with solid specs for everyday computing, good balance of price and performance"
Microsoft,Surface Laptop 4,999,8,256,AMD Ryzen 5,13.5,19,1.27,2021,mid-range,AMD Ryzen 5 4680U,good,8.0,productivity,"Premium build quality with exceptional battery life, ideal for students and professionals who value portability"
Available Pipeline Types
ScrapeGraphAI offers several scraping approaches:
SmartScraperGraph: Single-page extraction with user prompts
SearchGraph: Multi-page scraping from search results
SpeechGraph: Converts web content to audio files
ScriptCreatorGraph: Generates Python scripts for extracted data
SmartScraperMultiGraph: Handles multiple pages simultaneously
Practical Applications
Beyond laptop shopping, this approach works for:
E-commerce and Retail
Price monitoring across competitors
Product availability tracking
Review analysis
Market trend identification
Business Intelligence
Lead generation from directories
Competitor analysis
Market research automation
Social media monitoring
Research and Academia
Academic paper data extraction
Survey data collection
Content analysis
Dataset creation
Financial Services
Stock price tracking
Financial news monitoring
Economic indicator collection
Risk assessment data
Getting Started
To try ScrapeGraphAI yourself:
- Installation
pip install scrapegraphai
playwright install
- Basic Usage
from scrapegraphai.graphs import SmartScraperGraph
config = {
"llm": {
"model": "gpt-3.5-turbo",
"api_key": "your-api-key"
}
}
scraper = SmartScraperGraph(
prompt="Extract product names and prices",
source="https://example-shop.com",
config=config
)
data = scraper.run()
print(data)
- Advanced Features
Use local models with Ollama for privacy
Handle JavaScript-heavy sites with headless browsers
Process multiple pages simultaneously
Generate audio summaries of web content
Results and Considerations
In my laptop research, ScrapeGraphAI reduced data collection time from hours to minutes. The structured JSON output made it easy to compare options and identify the best value laptops in my budget range.
You can check out more examples here for ScrapegraphAI: https://github.com/ScrapeGraphAI/scrapegraph-sdk/tree/main/scrapegraph-py/examples/sync
Conclusion
ScrapeGraphAI offers a practical solution for automated data extraction without requiring extensive programming knowledge. It significantly reduces the manual effort involved in gathering structured data from websites.
For anyone regularly collecting data from websites, whether for research, business intelligence, or personal projects, it's worth exploring. The natural language interface makes it accessible to non-programmers, while the flexibility supports more complex use cases.
The combination with datatune for post-processing provides a complete pipeline from raw web data to cleaned, categorized datasets ready for analysis.
Give us a star!
Datatune: https://github.com/vitalops/datatune
ScrapeGraphAI: https://github.com/ScrapeGraphAI/scrapegraph-sdk
Subscribe to my newsletter
Read articles from Abhijith Neil Abraham directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
