Use LLMs for web scraping and data transformations

I was browsing Amazon for a new laptop, getting overwhelmed by the hundreds of options and confusing specifications. After spending two hours manually copying specs into a spreadsheet, I decided there had to be a better approach. That's when I remembered reading about ScrapeGraphAI and thought it might be worth trying.

The Problem with Manual Research

Shopping for laptops online presents several challenges:

Hundreds of models with varying specifications
Prices that change frequently
Reviews are scattered across different sections
Difficult comparison between similar models

Like many people, I started by opening multiple browser tabs and manually collecting information. This process was time-consuming and prone to errors.

What is ScrapeGraphAI?

ScrapeGraphAI is a Python library that uses LLMs to extract data from websites and documents. Instead of writing complex scraping code, you describe what information you want in plain language, and the library handles the extraction.

Key features include:

Natural language instructions: No need for complex CSS selectors or XPath
Multiple LLM support: Works with GPT, Gemini, Groq, Azure, and local models via Ollama
Flexible formats: Handles XML, HTML, JSON, and Markdown documents

Using ScrapeGraphAI for Laptop Research

Here's how I used it to extract laptop information from Amazon:

Setup

pip install scrapegraphai
playwright install

Basic Implementation

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "model": "gpt-3.5-turbo",
        "api_key": "YOUR_API_KEY"
    },
    "verbose": True,
    "headless": False
}

smart_scraper = SmartScraperGraph(
    prompt="Extract laptop name, price, rating, key specifications, and availability from this Amazon search page",
    source="https://amazon.com/s?k=laptops+under+1500",
    config=graph_config
)

results = smart_scraper.run()

Processing the Output with Datatune

Once ScrapeGraphAI extracts the data, you can use datatune to clean and filter the results.

Install datatune using pip

pip install datatune

import os
import dask.dataframe as dd
import datatune as dt
from datatune.llm.llm import OpenAI
import json
import pandas as pd

os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
llm = OpenAI(model_name="gpt-3.5-turbo", tpm=200000, rpm=50)

# Convert ScrapeGraphAI JSON output to DataFrame
with open('smartscraper-2025-07-19.json', 'r') as f:
    scrape_data = json.load(f)

# Extract laptops array and convert to DataFrame
laptops_df = pd.DataFrame(scrape_data['result']['laptops'])
df = dd.from_pandas(laptops_df, npartitions=2)

# Map operation to standardize and enrich data
mapped = dt.Map(
    prompt="Standardize the processor name and categorize it as budget, mid-range, or high-end based on performance",
    output_fields=["processor_category", "standardized_processor"],
    input_fields=["processor"]
)(llm, df)

# Filter for laptops with 16GB RAM and reasonable pricing
filtered = dt.Filter(
    prompt="Keep only laptops with 16GB RAM, price between $800-$2000, and battery life over 8 hours",
    input_fields=["ram", "price", "batteryLife"]
)(llm, mapped)

# Additional mapping for value analysis
value_mapped = dt.Map(
    prompt="Calculate value score based on price, specs, and battery life. Rate as excellent, good, fair, or poor value",
    output_fields=["value_rating", "value_score"],
    input_fields=["price", "ram", "ssd", "processor", "batteryLife", "weight"]
)(llm, filtered)

# Final cleanup and export
result = dt.finalize(value_mapped)
final_df = result.compute()
final_df.to_csv("filtered_laptops.csv", index=False)

print(f"Found {len(final_df)} laptops matching criteria")
print(final_df[['brand', 'model', 'price', 'value_rating']].head())

Alternative Processing Approach

For more specific filtering based on use cases:

# Filter for different use cases
productivity_laptops = dt.Filter(
    prompt="Keep laptops suitable for productivity work: good battery life, lightweight, reliable processor",
    input_fields=["batteryLife", "weight", "processor", "ram"]
)(llm, df)

gaming_laptops = dt.Filter(
    prompt="Keep gaming laptops: powerful processor, dedicated graphics implied by model name, adequate RAM",
    input_fields=["processor", "model", "ram", "price"]
)(llm, df)

budget_laptops = dt.Filter(
    prompt="Keep budget-friendly laptops under $1000 with decent specifications",
    input_fields=["price", "ram", "processor", "ssd"]
)(llm, df)

# Process each category
categories = {
    "productivity": productivity_laptops,
    "gaming": gaming_laptops, 
    "budget": budget_laptops
}

for category, filtered_df in categories.items():
    # Add category information
    categorized = dt.Map(
        prompt=f"Add category '{category}' and recommend this laptop with brief reasoning",
        output_fields=["category", "recommendation_reason"],
        input_fields=["brand", "model", "price", "processor", "ram"]
    )(llm, filtered_df)

    final_result = dt.finalize(categorized)
    final_result.compute().to_csv(f"{category}_laptops.csv", index=False)

Final Output

After running the complete pipeline (ScrapeGraphAI extraction + datatune processing), here's what the final dataset looks like:

brand,model,price,ram,ssd,processor,screenSize,batteryLife,weight,releaseYear,processor_category,standardized_processor,value_rating,value_score,category,recommendation_reason
Apple,MacBook Pro,1299,16,512,Apple M1,13.3,18,1.4,2020,high-end,Apple M1 8-core,excellent,9.2,productivity,"Outstanding battery life and performance for professional work, lightweight design ideal for mobile productivity"
Lenovo,ThinkPad X1 Carbon,1399,16,512,Intel Core i7,14,15,1.1,2021,high-end,Intel Core i7-1165G7,good,8.1,productivity,"Ultra-lightweight business laptop with excellent build quality and long battery life, perfect for professionals"
ASUS,TUF Gaming A15,999,8,512,AMD Ryzen 5,15.6,10,2.2,2021,mid-range,AMD Ryzen 5 4600H,good,7.8,gaming,"Solid gaming performance at affordable price point, good processor and adequate RAM for modern games"
Acer,Aspire 5,599,8,256,Intel Core i5,15.6,9,1.75,2021,mid-range,Intel Core i5-1135G7,excellent,8.9,budget,"Outstanding value for money with solid specs for everyday computing, good balance of price and performance"
Microsoft,Surface Laptop 4,999,8,256,AMD Ryzen 5,13.5,19,1.27,2021,mid-range,AMD Ryzen 5 4680U,good,8.0,productivity,"Premium build quality with exceptional battery life, ideal for students and professionals who value portability"

Available Pipeline Types

ScrapeGraphAI offers several scraping approaches:

SmartScraperGraph: Single-page extraction with user prompts
SearchGraph: Multi-page scraping from search results
SpeechGraph: Converts web content to audio files
ScriptCreatorGraph: Generates Python scripts for extracted data
SmartScraperMultiGraph: Handles multiple pages simultaneously

Practical Applications

Beyond laptop shopping, this approach works for:

E-commerce and Retail

Price monitoring across competitors
Product availability tracking
Review analysis
Market trend identification

Business Intelligence

Lead generation from directories
Competitor analysis
Market research automation
Social media monitoring

Research and Academia

Academic paper data extraction
Survey data collection
Content analysis
Dataset creation

Financial Services

Stock price tracking
Financial news monitoring
Economic indicator collection
Risk assessment data

Getting Started

To try ScrapeGraphAI yourself:

Installation

pip install scrapegraphai
playwright install

Basic Usage

from scrapegraphai.graphs import SmartScraperGraph

config = {
    "llm": {
        "model": "gpt-3.5-turbo", 
        "api_key": "your-api-key"
    }
}

scraper = SmartScraperGraph(
    prompt="Extract product names and prices",
    source="https://example-shop.com",
    config=config
)

data = scraper.run()
print(data)

Advanced Features

Use local models with Ollama for privacy
Handle JavaScript-heavy sites with headless browsers
Process multiple pages simultaneously
Generate audio summaries of web content

Results and Considerations

In my laptop research, ScrapeGraphAI reduced data collection time from hours to minutes. The structured JSON output made it easy to compare options and identify the best value laptops in my budget range.

You can check out more examples here for ScrapegraphAI: https://github.com/ScrapeGraphAI/scrapegraph-sdk/tree/main/scrapegraph-py/examples/sync

Conclusion

ScrapeGraphAI offers a practical solution for automated data extraction without requiring extensive programming knowledge. It significantly reduces the manual effort involved in gathering structured data from websites.

For anyone regularly collecting data from websites, whether for research, business intelligence, or personal projects, it's worth exploring. The natural language interface makes it accessible to non-programmers, while the flexibility supports more complex use cases.

The combination with datatune for post-processing provides a complete pipeline from raw web data to cleaned, categorized datasets ready for analysis.

Give us a star!

Datatune: https://github.com/vitalops/datatune

ScrapeGraphAI: https://github.com/ScrapeGraphAI/scrapegraph-sdk

ScrapeGraphAI + Datatune, scrape the web, transform the data with Datatune

Table of contents