Building an Automated Web Scraper Using Supabase Edge Functions

Subramanya M RaoSubramanya M Rao
10 min read

Web scraping is a common task in many applications, whether you're collecting pricing data, monitoring content updates, or aggregating information from multiple sources. While there are many ways to build a scraper, creating one that runs automatically on a schedule without managing your own server infrastructure can be challenging.

In this post, I'll walk through how I built a fully automated web scraping solution using Supabase's free tier. This solution periodically extracts data from a website and stores it in a database, all without managing any infrastructure.

The Problem to Solve

I needed to monitor a specific website for daily content updates. The content included images and text that changed daily, and I wanted to archive this information automatically. Specifically, I needed to:

  • Extract the image URL and title text for each day's content

  • Store this data in a structured database

  • Run the scraper automatically every few hours

  • Do all of this without managing servers

The Solution: Supabase Edge Functions + PostgreSQL

My solution combines several technologies:

  • Supabase Edge Functions - serverless functions that run our scraping code

  • Supabase PostgreSQL - database to store the scraped data

  • pg_cron - PostgreSQL extension for scheduling tasks

  • Cheerio - Node.js library for parsing HTML

  • Axios - HTTP client for making web requests

Step 1: Setting Up the Database Schema

First, I created a table in Supabase to store the scraped data:

create table daily_images (
  id serial primary key,
  date date unique not null,
  image_url text not null,
  title text,
  created_at timestamp default now()
);

-- Adding a fallback URL column for cases when scraping fails
alter table daily_images add column fallback_url text;

This schema allows me to:

  • Store one entry per day (using the unique constraint on date)

  • Save both the image URL and descriptive title

  • Track when each entry was created

  • Have a fallback image URL if the scraper fails to find an image

Step 2: Creating the Edge Function

Next, I wrote a Supabase Edge Function to perform the actual scraping. Edge Functions run on Deno, which is different from Node.js, so the import syntax is slightly different:

import { serve } from 'https://deno.land/std@0.177.0/http/server.ts'
import axios from 'npm:axios'
import * as cheerio from 'npm:cheerio'
import { format } from 'npm:date-fns'
import { createClient } from 'https://esm.sh/@supabase/supabase-js@2'

const supabaseUrl = Deno.env.get('SUPABASE_URL') as string
const supabaseServiceKey = Deno.env.get('SUPABASE_SERVICE_ROLE_KEY') as string
const supabase = createClient(supabaseUrl, supabaseServiceKey)

async function scrapeContent() {
    try {
        console.log('Starting scraping process...')

        // Fetch the target webpage
        const response = await axios.get('https://example.com/daily-content')
        console.log('Successfully fetched the webpage')

        const $ = cheerio.load(response.data)
        console.log('Loaded HTML content with cheerio')

        // Format today's date to match how it appears on the site
        const targetDate = format(new Date(), 'MMMM d')
        console.log('Looking for date:', targetDate)

        // Find the heading containing today's date
        const targetHeading = $(`h2.wp-block-heading:contains('${targetDate}')`)
        console.log('Found matching headings:', targetHeading.length)

        if (!targetHeading.length) {
            console.log('No heading found for today\'s date')
            return { imageUrl: null, title: null }
        }

        // Extract the heading text to use as title
        const title = targetHeading.text().trim()

        // Find the gallery element that follows the heading
        const gallery = targetHeading.next('figure.wp-block-gallery')
        console.log('Found gallery:', gallery.length > 0 ? 'yes' : 'no')

        // Get the first image from the gallery
        const firstImage = gallery.find('img').first().attr('src')
        console.log('Image URL found:', firstImage || 'none')

        return { imageUrl: firstImage || null, title }
    } catch (error) {
        console.error('Scraping failed with error:', error.message)
        return { imageUrl: null, title: null }
    }
}

serve(async (req) => {
    try {
        // Get the image URL and title
        const { imageUrl, title } = await scrapeContent()

        if (imageUrl) {
            // Get today's date formatted for the database
            const today = format(new Date(), 'yyyy-MM-dd')

            // Update the table with the new image URL and title
            const { data, error } = await supabase
                .from('daily_images')
                .upsert(
                    { 
                        date: today,
                        image_url: imageUrl,
                        title: title || `Daily Content - ${today}`,
                        created_at: new Date().toISOString()
                    },
                    { onConflict: 'date' }
                )

            if (error) throw error

            return new Response(JSON.stringify({ 
                success: true, 
                imageUrl,
                title 
            }), {
                headers: { 'Content-Type': 'application/json' },
                status: 200
            })
        } else {
            // If no image is found, you might want to set a fallback URL
            const fallbackUrl = "https://example.com/default-image.jpg"
            const today = format(new Date(), 'yyyy-MM-dd')

            // Update with fallback information
            const { data, error } = await supabase
                .from('daily_images')
                .upsert(
                    { 
                        date: today,
                        image_url: null,
                        fallback_url: fallbackUrl,
                        title: `Daily Content (Fallback) - ${today}`,
                        created_at: new Date().toISOString()
                    },
                    { onConflict: 'date' }
                )

            if (error) throw error

            return new Response(JSON.stringify({ 
                success: false, 
                message: 'No image found, fallback used',
                fallbackUrl
            }), {
                headers: { 'Content-Type': 'application/json' },
                status: 404
            })
        }
    } catch (error) {
        return new Response(JSON.stringify({ success: false, error: error.message }), {
            headers: { 'Content-Type': 'application/json' },
            status: 500
        })
    }
})

This function:

  1. Searches a webpage for content related to today's date

  2. Extracts the image URL and title text

  3. Updates our Supabase database with the information

  4. Handles failures gracefully by using a fallback URL

  5. we have added logs because we can check logs of the deployed edge function in supabase dashboard

Step 3: Local Development and Testing

Before deploying, I needed to set up a local development environment to test my Edge Function. Supabase Edge Functions run on Deno (not Node.js), so the local setup needs to reflect this.

Setting Up the Local Environment

First, I installed Deno:

# On macOS or Linux
curl -fsSL https://deno.land/x/install/install.sh | sh

# On Windows (using PowerShell)
iwr https://deno.land/x/install/install.ps1 -useb | iex

# Or using Homebrew
brew install deno

Next, I needed to create a local Supabase setup for testing. The Supabase CLI uses Docker to run a local development environment:

# Install Docker if you don't have it
# Instructions vary by OS: https://docs.docker.com/get-docker/

# Install Supabase CLI
npm install -g supabase

# Initialize a new Supabase project
mkdir my-scraper && cd my-scraper
supabase init

# Start the local Supabase stack
supabase start

This command spins up Docker containers with PostgreSQL, PostgREST, GoTrue, and other Supabase services locally.

Creating and Testing the Edge Function Locally

With the local environment running, I created and tested my Edge Function:

# Create a new Edge Function
supabase functions new scrape-daily-image

# Add my code to the function file at:
# supabase/functions/scrape-daily-image/index.ts

# Set local environment variables
supabase functions secrets set SUPABASE_URL="http://localhost:54321"
supabase functions secrets set SUPABASE_SERVICE_ROLE_KEY="[local-service-key-from-env-file]"
# The local service key can be found in the .env file created by supabase start

# Run the function locally
supabase functions serve --no-verify-jwt

With the function running locally, I tested it with a separate terminal:

curl -X POST http://localhost:54321/functions/v1/scrape-daily-image \
-H "Content-Type: application/json"

Unit Testing Edge Functions

For more robust testing, Supabase Edge Functions can be unit tested with Deno's built-in testing framework.

For more detailed information on testing Supabase Edge Functions, check out this comprehensive guide on testing Supabase Edge Functions with Deno Test and Supabase's official unit testing documentation.

Step 4: Deploying the Edge Function

With local testing complete, I was ready to deploy the function to Supabase:

# Login to Supabase
supabase login

# Deploy the function
supabase functions deploy scrape-daily-image --no-verify-jwt

The --no-verify-jwt flag allows the function to be called without authentication. For production use with sensitive operations, you might want to require authentication by omitting this flag.

After deployment, the function is available at https://your-project-ref.functions.supabase.co/scrape-daily-image.

Environment Variables for Production

I also needed to set environment variables for the production environment:

supabase secrets set SUPABASE_URL="https://your-project-ref.supabase.co" \
  --project-ref your-project-ref
supabase secrets set SUPABASE_SERVICE_ROLE_KEY="your-service-role-key" \
  --project-ref your-project-ref

These environment variables allow the function to connect to your production Supabase database.

Step 5: Testing the Deployed Function

Once deployed, I tested the function by invoking it directly:

curl -X POST https://your-project-ref.functions.supabase.co/scrape-daily-image \
-H "Authorization: Bearer your-anon-key" \
-H "Content-Type: application/json"

This ensured that my function could:

  1. Successfully connect to the target website

  2. Parse the HTML and extract the needed information

  3. Save the data to my Supabase database

Step 6: Setting Up Automatic Scheduling

The final piece was setting up automatic scheduling. For this, I used Supabase's PostgreSQL database with the pg_cron extension:

-- Enable the required extensions
CREATE EXTENSION IF NOT EXISTS pg_cron;
CREATE EXTENSION IF NOT EXISTS pg_net;

-- Create a function to call the edge function
CREATE OR REPLACE FUNCTION call_scrape_edge_function()
RETURNS void AS $$
BEGIN
  PERFORM net.http_post(
    url := 'https://your-project-ref.functions.supabase.co/scrape-daily-image',
    headers := '{"Authorization": "Bearer your-anon-key", "Content-Type": "application/json"}',
    body := '{}'
  );
END;
$$ LANGUAGE plpgsql;

-- Schedule the function to run every 3 hours
SELECT cron.schedule('scrape-daily-image', '0 */3 * * *', 'SELECT call_scrape_edge_function()');

After running this SQL, I verified that the job was correctly scheduled by checking the cron job table:

SELECT * FROM cron.job;

This returned

| jobid | schedule    | command                            | nodename  | nodeport | database | username | active | jobname            |
| ----- | ----------- | ---------------------------------- | --------- | -------- | -------- | -------- | ------ | ------------------ |
| 1     | 0 */3 * * * | SELECT call_scrape_edge_function() | localhost | 5432     | postgres | postgres | true   | scrape-daily-image |

The scraper was now fully automated and would run every 3 hours!

Key Benefits of This Approach

Using Supabase for this project provided several advantages:

  1. Zero infrastructure management - No servers to maintain or monitor

  2. Cost-effective - Runs on Supabase's free tier

  3. Automatic scheduling - Built-in PostgreSQL scheduling with pg_cron

  4. Persistent storage - Data is stored directly in a relational database

  5. Easily scalable - Can handle increased load as needed

  6. Simple deployment - Straightforward CLI-based deployment process

  7. Modern development experience - TypeScript support and Deno runtime

Challenges and Solutions

Like any project, I encountered a few challenges:

  1. Deno vs Node.js - Supabase Edge Functions use Deno, not Node.js, requiring slightly different import syntax and environment setup.

    • Solution: Used Deno-compatible imports and tested locally with Deno before deployment

    • Example: import axios from 'npm:axios' instead of import axios from 'axios'

  2. Missing pg_net extension - Initially got an error about the "net" schema not existing.

    • Solution: Added CREATE EXTENSION IF NOT EXISTS pg_net; to enable HTTP requests from PostgreSQL

    • Error message: ERROR: 3F000: schema "net" does not exist

  3. Error handling - Needed to handle cases where the expected content wasn't found.

    • Solution: Added fallback logic and proper error handling to ensure the process didn't fail silently
  4. Local development environment - Setting up a local testing environment with Deno requires some additional steps.

    • Solution: Used Docker with Supabase CLI's supabase start command to create a fully functional local development environment
  5. Debugging deployed functions - Once deployed, debugging can be challenging.

    • Solution: Added extensive logging using console.log() statements and used supabase functions logs scrape-daily-image to view logs

Conclusion

This project demonstrates how Supabase can be used to create a fully automated web scraping solution without managing any infrastructure. By combining Edge Functions, PostgreSQL, and pg_cron, I was able to build a reliable system that:

  1. Regularly scrapes a website for daily content

  2. Stores that content in a structured database

  3. Runs automatically on a schedule

  4. Handles failures gracefully

All of this was achieved using Supabase's free tier, making it a cost-effective solution for many small to medium projects.

What's particularly impressive is how this solution leverages the power of PostgreSQL not just as a database, but as a scheduler (via pg_cron) and as an HTTP client (via pg_net). This creates a truly serverless architecture where all components, from scheduling to execution to storage, are handled without managing any infrastructure.

Resources and Further Reading

For those looking to implement a similar solution, here are some valuable resources:

Next Steps

To extend this project further, I could:

  • Add notifications when new content is found (via email or webhooks)

  • Create a simple front-end to display the archived content

  • Implement more sophisticated error handling and retries

  • Add authentication to protect the scraped data

  • Set up monitoring to alert me if the scraper fails consistently

What scraping projects would you build with this approach? Let me know in the comments!

0
Subscribe to my newsletter

Read articles from Subramanya M Rao directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Subramanya M Rao
Subramanya M Rao

CS Undergrad pursuing Web Development. Keen Learner & Tech Enthusiast.