Using DuckDB-WASM for in-browser Data Engineering

Tobias MüllerTobias Müller
18 min read

Introduction

DuckDB, the in-process DBMS specialized in OLAP workloads, had a very rapid growth during the last year, both in functionality, but also popularity amongst its users, but also with developers that contribute many projects to the Open Source DuckDB ecosystem.

DuckDB cannot "only" be run on a variety of Operating Systems and Architectures, there's also a DuckDB-WASM version, that allows running DuckDB in a browser. This opens up some very interesting use cases, and is also gaining a lot of traction in the last 12 months.

Use Case: Building a SQL Workbench with DuckDB-WASM

One of the first things that came to my mind once I learned about the existence of DuckDB-WASM was that it could be used to create an online SQL Workbench, where people could interactively run queries, show their results, but also visualize them. DuckDB-WASM sits at its core, providing the storage layer, query engine and many things more...

You can find the project at

It's built with the following core technologies / frameworks:

It's hosted as a static website export / single page application on AWS using

If you're interested in the hosting setup, you can have a look at https://github.com/tobilg/serverless-aws-static-websites which can deploy such static websites on AWS via IaC with minimum effort.

Using the SQL Workbench

There are many possibilities how you can use the SQL Workbench, some are described below

Overview

When you open sql-workbench.com for the first time, you can see that the workbench is divided in three different areas:

  • On the left, there's the "Local Tables" area, that will display the created tables of you ran queries such as CREATE TABLE names (name VARCHAR), or used the drag-and-drop area on the lower left corner to drop any CSV, Parquet or Arrow file on it (details see below).

  • The upper main editor area is the SQL editor, where you can type your SQL queries. You're already presented with some example queries for different types of data once the page is loaded.

  • The lower main result area where the results of the ran queries will be shown, or alternatively, the visualizations of these results.

💡
You can adjust the respective heights of the main areas by dragging the lever in the middle.

Running SQL queries

To run your first query, select the first line of SQL, either with your keyboard or with your mouse, and press the key combination CMD + Enter of you're on a Mac, or Ctrl + Enter if you're on a Windows or Linux machine.

The result of the query that was executed can then be found in the lower main area as a table:

💡
Queries will only be executed if one or more queries are selected. If multiple queries shall be executed, make sure you use a semicolon at the end of each query. Otherwise an error will be displayed.

Running multiple queries

You can also run multiple queries sequentially, e.g. to create a table, insert some records, and display the results:

CREATE TABLE first_names (name VARCHAR, birth_cnt integer);

INSERT INTO first_names (name, birth_cnt) VALUES ('Liam', 20456);
INSERT INTO first_names (name, birth_cnt) VALUES ('Noah', 18621);
INSERT INTO first_names (name, birth_cnt) VALUES ('Oliver', 15076);
INSERT INTO first_names (name, birth_cnt) VALUES ('James', 12028);
INSERT INTO first_names (name, birth_cnt) VALUES ('Elijah', 11979);
INSERT INTO first_names (name, birth_cnt) VALUES ('William', 11282);
INSERT INTO first_names (name, birth_cnt) VALUES ('Henry', 11221);
INSERT INTO first_names (name, birth_cnt) VALUES ('Lucas', 10909);
INSERT INTO first_names (name, birth_cnt) VALUES ('Benjamin', 10842);
INSERT INTO first_names (name, birth_cnt) VALUES ('Theodore', 10754);

SELECT * FROM first_names;

When you copy & paste the above SQLs, select them and run them, the result looks like this:

You can see on the left-hand side the newly created table first_names, that can be reused for other queries without having to reload the data again.

💡
Only the result of the last-run query will be displayed in the lower main result area!

If you want to open a new SQL Workbench and directly run the above query, please click on the image below:

💡
The data is persisted until you reload the overall SQL Workbench page.

Querying data you have on your machine

To try this, you can for example download a list of AWS Services as a CSV from

https://raw.githubusercontent.com/tobilg/aws-iam-data/main/data/csv/aws_services.csv

This file has four columns, service_id, name, prefix and reference_url. Once you downloaded the file, you can simply drag-and-drop from the folder you downloaded it to to the area in the lower left corner of the SQL Workbench:

A table called aws_services.csv has now been automatically created, which you can query via SQLs, for example:

SELECT name, prefix from 'aws_services.csv';

If you want to open a new SQL Workbench and directly run the above query, please click on the image below:

Querying and visualizing remote data

DuckDB-WASM supports the loading of compatible data in different formats (e.g. CSV, Parquet or Arrow) from remote http(s) sources. Other data formats that can be used include JSON, but this requires the loading of so-called DuckDB extensions.

💡
It's necessary that the websites hosting the data add the relevant CORS headers, otherwise the browser (not DuckDB-WASM or the SQL Workbench) will forbid the loading of the files and show an error message instead

In this example, we will use data about AWS CloudFront Edge Locations, that is available at tobilg/aws-edge-locations with this query:

SELECT * FROM 'https://raw.githubusercontent.com/tobilg/aws-edge-locations/main/data/aws-edge-locations.parquet';

The result will look like this:

We now want to create a bar chart of the data, showing the number of Edge Locations by country and city. This can be done by hovering over the result table, and clicking on the small "configure" button that looks like a wrench which subsequently appears on the upper right corner of the table:

You then see the overview of the available columns, and the current visualization type (Datagrid)

To get an overview of the possible visualization types click on the Datagrid icon:

Then select "Y Bar". This will give you an initial bar char:

But as we want to display the count of Edge Locations by country and city, we need to drag-and-drop the columns country and city to the "Group By" area:

We can now close the configuration menu to see the chart in it's full size:

There are many other visualization types from which you can choose from, such as Treemaps and Sunbursts, as well as Map Scatters:

Exporting visualizations and data

You can also export the visualizations, as well as the data. Just click on "Export" and type in a "Save as" name, and select the output format you want to download:

The data can be downloaded as CSV, JSON or Arrow file. Here's the CSV example:

"country (Group by 1)","count"
,517
"Argentina",3
"Australia",10
"Austria",3
"Bahrain",2
"Belgium",1
"Brazil",21
"Bulgaria",3
"Canada",8
"Chile",6
"China",8
"Colombia",3
"Croatia",1
"Czech Republic",1
"Denmark",3
"Finland",4
"France",17
"Germany",37
"Greece",1
"Hungary",1
"India",48
"Indonesia",5
"Ireland",2
"Israel",2
"Italy",16
"Japan",27
"Kenya",1
"Korea",8
"Malaysia",2
"Mexico",4
"Netherlands",5
"New Zealand",2
"Nigeria",1
"Norway",2
"Oman",1
"Peru",2
"Philippines",2
"Poland",5
"Portugal",1
"Romania",1
"Singapore",7
"South Africa",2
"Spain",12
"Sweden",4
"Switzerland",2
"Taiwan",3
"Thailand",2
"UAE",4
"UK",30
"United States",179
"Vietnam",2

And here's the exported PNG:

Exporting the HTML version will give you an interactive graph with hovering etc. Furthermore, you can also change the theme for the different visualizations:

This is also reflected in the exported graphs:

Using the schema browser

The schema browser can be found on the left-hand side. It's automatically updated after each executed query, so that all schema operations can be captured. On table-level, the columns, constraints and indexes are shown:

If you right-click on a table name, a context menu is shown that has different options:

  • Generating scripts based on the table definition

  • Truncating, deleting or summarizing the table

  • Viewing table data (all records, first 10 and first 100)

Once clicked, those menu items will create a new tab (see below), and generate and execute the appropriate SQL statements:

Using query tabs

Another new feature is the possibility to have multiple query tabs. Those are either automatically created by context menu actions, or the user that clicks on the plus icon:

Each tab can be closed by clicking on the "x" icon next to the tab name.

Generating data models

If users have created some tables, it's then possible to create a data model from the schema metadata. If the tables also have foreign key relationships, those are also shown in the diagram. Just click on the "Data Model" menu entry on the lower left corner.

Under the hood, this feature generates Mermaid Entity Relationship Diagram code, that is dynamically rendered as a graph.

Using the query history

Each query that is issued in the current version of the SQL Workbench is recorded for the so-called query history. It can be accessed by clicking on the "Query History" menu entry in the lower left corner. Once clicked, there's an overlay on the right-hand side with the list of the issued queries.

The newest queries can be found on top of the list, and with each query listed, there's also an indication when the query was run, and how long it took to execute.

With the trash icon in the top-right corner, the complete query history can be truncated. Also, single query history entries can be deleted, as well as specific queries can be re-run in a new tab by clicking "Replay query" in the menu that's present for each query history entry.

Example Data Engineering pipeline

Dataset & Goals

A well-known dataset is the NYC TLC Trip Record dataset. It can be found freely available on the website of NYC Taxi and Limousine Commission website. It also comes with some explanations and additional lookup data. In this example, we focus on the yellow taxi data.

The goal of this example pipeline is to create a clean Data Mart from the given trip records and location data, being able to support some basic analysis of the data via OLAP patterns.

Source Data analysis

On the NYC TLC website, there's a PDF file explaining the structure and contents of the data. The table structure can be found below, the highlighted columns indicate dimensional values, for which we'll build dimension tables for in the later steps.

Column nameDescription
VendorIDA code indicating the TPEP provider that provided the record.
tpep_pickup_datetimeThe date and time when the meter was engaged.
tpep_dropoff_datetimeThe date and time when the meter was disengaged.
Passenger_countThe number of passengers in the vehicle. This is a driver-entered value.
Trip_distanceThe elapsed trip distance in miles reported by the taximeter.
PULocationIDTLC Taxi Zone in which the taximeter was engaged
DOLocationIDTLC Taxi Zone in which the taximeter was disengaged
RateCodeIDThe final rate code in effect at the end of the trip.
Store_and_fwd_flagThis flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server.
Payment_typeA numeric code signifying how the passenger paid for the trip.
Fare_amountThe time-and-distance fare calculated by the meter.
ExtraMiscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.
MTA_tax$0.50 MTA tax that is automatically triggered based on the metered rate in use.
Improvement_surcharge$0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.
Tip_amountTip amount – This field is automatically populated for credit card tips. Cash tips are not included.
Tolls_amountTotal amount of all tolls paid in trip.
Total_amountThe total amount charged to passengers. Does not include cash tips.
Congestion_SurchargeTotal amount collected in trip for NYS congestion surcharge.
Airport_fee$1.25 for pick up only at LaGuardia and John F. Kennedy Airports

There's an additional CSV file for the so-called Taxi Zones, as well as a SHX shapefile containing the same info, but with an additional geo information. The structure is the following:

Column nameDescription
LocationIDTLC Taxi Zone, corresponding to the PULocationID and DOLocationID columns in the trip dataset
BoroughThe name of the NYC borough this Taxi Zone is in
ZoneThe name of the Taxi Zone
service_zoneCan either be "Yellow Zone" or "Boro Zone"

Target Data Model

The target data model is derived from the original trip record data, with extracted dimension tables plus a new date hierarchy dimension. Also, the naming schema gets unified and cleaned up.

It is modeled as a so-called Snowflake Schema (check the Mermaid source):

Loading & Transforming the data

The loading & transforming of the data is divided in multiple steps:

  • Generating the dimensional tables and values from the given dataset information

  • Generating a date hierarchy dimension

  • Loading and transforming the trip data

    • Replacing values with dimension references

    • Clean the column naming

    • Unify values

Generating the dimensional tables

We use the given dataset information from the PDF file to manually create dimension tables and their values:

-- Install and load the spatial extension
INSTALL spatial;
LOAD spatial;

-- Create temporary table
CREATE TABLE tmp_service_zones AS (
    SELECT 
        DISTINCT service_zone 
    FROM 
        'https://data.quacking.cloud/nyc-taxi-metadata/taxi_zones.csv' 
    WHERE 
        service_zone != 'N/A' 
    ORDER BY 
        service_zone
); 

-- Create dim_zone_type table
CREATE TABLE dim_zone_type (
    zone_type_id INTEGER PRIMARY KEY,
    name VARCHAR
);

-- Insert dim_zone_type table
INSERT INTO dim_zone_type
SELECT 
    -1 as zone_type_id, 
    'N/A' as name 
UNION ALL
SELECT 
    (rowid + 1)::INTEGER as zone_type_id, 
    service_zone as name 
FROM 
    tmp_service_zones 
; 

-- Drop table tmp_service_zones
DROP TABLE tmp_service_zones; 

-- Create temporary table
CREATE TABLE tmp_borough AS (
    SELECT 
        DISTINCT borough 
    FROM 
        'https://data.quacking.cloud/nyc-taxi-metadata/taxi_zones.csv' 
    WHERE
        borough != 'Unknown'
    ORDER BY 
        borough
); 

-- Create dim_borough table
CREATE TABLE dim_borough (
    borough_id INTEGER PRIMARY KEY,
    name VARCHAR
);

-- Insert dim_borough table
INSERT INTO dim_borough
SELECT 
    -1 as borough_id, 
    'N/A' as name 
UNION ALL
SELECT 
    (rowid + 1)::INTEGER as borough_id, 
    borough as name 
FROM 
    tmp_borough 
; 

-- Drop temporary table
DROP TABLE tmp_borough;

-- Create dim_zone table
CREATE TABLE dim_zone (
    zone_id INTEGER PRIMARY KEY,
    zone_type_id INTEGER,
    borough_id INTEGER,
    name VARCHAR,
    geojson VARCHAR,
    FOREIGN KEY (zone_type_id) REFERENCES dim_zone_type (zone_type_id),
    FOREIGN KEY (borough_id) REFERENCES dim_borough (borough_id)
);

-- Insert dim_zone table
INSERT INTO dim_zone 
SELECT DISTINCT
    CASE
        WHEN csv.LocationID IS NOT NULL THEN csv.LocationID::INT
        ELSE raw.LocationID
    END AS zone_id, 
    zt.zone_type_id, 
    CASE
        WHEN b.borough_id IS NOT NULL THEN b.borough_id
        ELSE -1
    END AS borough_id,
    CASE
        WHEN csv.Zone IS NOT NULL THEN csv.Zone
        ELSE raw.zone
    END AS name, 
    raw.geojson 
FROM 
    (
        SELECT 
            LocationID, 
            borough,
            zone,  
            geojson 
        FROM 
            (
                SELECT 
                    LocationID, 
                    borough, 
                    zone, 
                    rank() OVER (PARTITION BY LocationID ORDER BY Shape_Leng) AS ranked, 
                    ST_AsGeoJSON(ST_Transform(geom, 'ESRI:102718', 'EPSG:4326')) AS geojson 
                FROM ST_Read('https://data.quacking.cloud/nyc-taxi-metadata/taxi_zones.shx') 
            ) sub
        WHERE 
            sub.ranked = 1 
    ) raw 
FULL OUTER JOIN 
    (
        SELECT DISTINCT 
            LocationID,
            Zone, 
            service_zone 
        FROM 
            'https://data.quacking.cloud/nyc-taxi-metadata/taxi_zones.csv' 
    ) csv 
ON 
    csv.LocationId = raw.LocationId 
FULL OUTER JOIN 
    dim_zone_type zt 
ON 
    csv.service_zone = zt.name 
FULL OUTER JOIN 
    dim_borough b 
ON 
    b.name = raw.borough
WHERE
    zone_id IS NOT NULL
ORDER BY
    zone_id
;

-- Create dim_rate_code table
CREATE TABLE dim_rate_code (
    rate_code_id INTEGER PRIMARY KEY,
    name VARCHAR
);

INSERT INTO dim_rate_code (rate_code_id, name) VALUES (1, 'Standard rate');
INSERT INTO dim_rate_code (rate_code_id, name) VALUES (2, 'JFK');
INSERT INTO dim_rate_code (rate_code_id, name) VALUES (3, 'Newark');
INSERT INTO dim_rate_code (rate_code_id, name) VALUES (4, 'Nassau or Westchester');
INSERT INTO dim_rate_code (rate_code_id, name) VALUES (5, 'Negotiated fare');
INSERT INTO dim_rate_code (rate_code_id, name) VALUES (6, 'Group ride');
INSERT INTO dim_rate_code (rate_code_id, name) VALUES (99, 'N/A');

-- Create dim_payment_type table
CREATE TABLE dim_payment_type (
    payment_type_id INTEGER PRIMARY KEY,
    name VARCHAR
);

INSERT INTO dim_payment_type (payment_type_id, name) VALUES (1, 'Credit card');
INSERT INTO dim_payment_type (payment_type_id, name) VALUES (2, 'Cash');
INSERT INTO dim_payment_type (payment_type_id, name) VALUES (3, 'No charge');
INSERT INTO dim_payment_type (payment_type_id, name) VALUES (4, 'Dispute');
INSERT INTO dim_payment_type (payment_type_id, name) VALUES (5, 'Unknown');
INSERT INTO dim_payment_type (payment_type_id, name) VALUES (6, 'Voided trip');

-- Create dim_vendor table
CREATE TABLE dim_vendor (
    vendor_id INTEGER PRIMARY KEY,
    name VARCHAR
);

INSERT INTO dim_vendor (vendor_id, name) VALUES (1, 'Creative Mobile Technologies');
INSERT INTO dim_vendor (vendor_id, name) VALUES (2, 'VeriFone Inc.');

-- Create dim_stored_type table
CREATE TABLE dim_stored_type (
    stored_type_id INTEGER PRIMARY KEY,
    name VARCHAR
);

INSERT INTO dim_stored_type (stored_type_id, name) VALUES (1, 'Store and forward trip');
INSERT INTO dim_stored_type (stored_type_id, name) VALUES (2, 'Not a store and forward trip');

Generating a date hierarchy dimension

-- Create dim_date table
CREATE TABLE dim_date (
    day_dt DATE PRIMARY KEY,
    day_name VARCHAR,
    day_of_week INT,
    day_of_month INT,
    day_of_year INT,
    week_of_year INT,
    month_of_year INT,
    month_name VARCHAR,
    year INT
);

INSERT INTO dim_date
SELECT 
    date_key AS day_dt,
    DAYNAME(date_key)::VARCHAR AS day_name,
    ISODOW(date_key)::INT AS day_of_week,
    DAYOFMONTH(date_key)::INT AS day_of_month,
    DAYOFYEAR(date_key)::INT AS day_of_year, 
    WEEKOFYEAR(date_key)::INT AS week_of_year,
    MONTH(date_key)::INT AS month_of_year,
    MONTHNAME(date_key)::VARCHAR AS month_name,
    YEAR(date_key)::INT AS year
FROM 
    (
        SELECT 
            CAST(RANGE AS DATE) AS date_key 
        FROM 
            RANGE(DATE '2005-01-01', DATE '2030-12-31', INTERVAL 1 DAY)
    ) generate_date
;

Loading and transforming the trip data

-- Create sequence for generating trip_ids
CREATE SEQUENCE trip_id_sequence START 1;

-- Create fact_trip table
CREATE TABLE fact_trip (
    trip_id INTEGER  DEFAULT nextval('trip_id_sequence') PRIMARY KEY,
    pickup_zone_id INTEGER,
    pickup_dt DATE,
    pickup_ts TIMESTAMP,
    dropoff_zone_id INTEGER,
    dropoff_dt DATE,
    dropoff_ts TIMESTAMP,
    rate_code_id INTEGER, 
    stored_type_id INTEGER, 
    payment_type_id INTEGER, 
    vendor_id INTEGER, 
    passenger_count DOUBLE,
    trip_distance_miles DOUBLE,
    fare_amount DOUBLE,
    extra_amount DOUBLE,
    mta_tax_amount DOUBLE,
    improvement_surcharge_amount DOUBLE,
    tip_amount DOUBLE,
    tolls_amount DOUBLE,
    congestion_surcharge_amount DOUBLE,
    airport_fee_amount DOUBLE,
    total_amount DOUBLE
);

-- Deactivating FK relationships for now, due to performance issues when inserting 3 million records
-- FOREIGN KEY (pickup_zone_id) REFERENCES dim_zone (zone_id),
-- FOREIGN KEY (dropoff_zone_id) REFERENCES dim_zone (zone_id),
-- FOREIGN KEY (pickup_dt) REFERENCES dim_date (day_dt),
-- FOREIGN KEY (dropoff_dt) REFERENCES dim_date (day_dt),
-- FOREIGN KEY (rate_code_id) REFERENCES dim_rate_code (rate_code_id),
-- FOREIGN KEY (stored_type_id) REFERENCES dim_stored_type (stored_type_id),
-- FOREIGN KEY (payment_type_id) REFERENCES dim_payment_type (payment_type_id),
-- FOREIGN KEY (vendor_id) REFERENCES dim_vendor (vendor_id)

-- Insert transformed fact data
INSERT INTO fact_trip
SELECT 
    nextval('trip_id_sequence') AS trip_id,
    PULocationID::INT as pickup_zone_id,
    tpep_pickup_datetime::DATE as pickup_dt,
    tpep_pickup_datetime AS pickup_ts,
    DOLocationID::INT as dropoff_zone_id,
    tpep_dropoff_datetime::DATE as dropoff_dt,
    tpep_dropoff_datetime AS dropoff_ts,
    RatecodeID::INT AS rate_code_id,
    CASE
        WHEN store_and_fwd_flag = 'Y' THEN 1
        WHEN store_and_fwd_flag = 'N' THEN 2
    END AS stored_type_id,
    payment_type::INT AS payment_type_id,
    VendorID::INT AS vendor_id,
    passenger_count,
    trip_distance AS trip_distance_miles,
    fare_amount,
    extra AS extra_amount,
    mta_tax AS mta_tax_amount,
    improvement_surcharge AS improvement_surcharge_amount,
    tip_amount,
    tolls_amount,
    congestion_surcharge AS congestion_surcharge_amount,
    airport_fee AS airport_fee_amount,
    total_amount
FROM 
    'https://data.quacking.cloud/nyc-taxi-data/yellow_tripdata_2023-01.parquet'
;

Data Analysis

The following analyses are just examples on how you could analyze the data set. Feel free to think about your own questions for the dataset, and try to build queries yourselves!

Preparation

DuckDB supports the SUMMARIZE command can help you understand the final data in the fact table before querying it. It launches a query that computes a number of aggregates over all columns, including min, max, avg, std and approx_unique:

The output already shows some "interesting" things, such as pickup_dt being far in the past, e.g. 2008-12-31, and the dropoff_dt with similar values (2009-01-01).

Most utilized trip locations

With this analysis, we want to have a look at the 20 most frequented trips from pickup zone to dropoff zone:

SELECT
    pz.name || ' -> ' || dz.name AS trip_description,
    count(DISTINCT ft.trip_id)::INT AS trip_count,
    sum(ft.passenger_count)::INT AS passenger_count,
    sum(ft.total_amount)::INT AS total_amount,
    sum(ft.trip_distance_miles)::INT AS trip_distance_miles,
    (sum(trip_distance_miles)/count(DISTINCT ft.trip_id))::DOUBLE AS trip_distance_miles_avg,
    (sum(ft.total_amount)/count(DISTINCT ft.trip_id))::DOUBLE AS total_amount_avg,
    (sum(ft.passenger_count)/count(DISTINCT ft.trip_id))::DOUBLE AS passenger_count_avg
FROM
    fact_trip ft
INNER JOIN
    dim_zone pz
ON
    pz.zone_id = ft.pickup_zone_id
INNER JOIN
    dim_zone dz
ON
    dz.zone_id = ft.dropoff_zone_id
GROUP BY
    trip_description
ORDER BY
    trip_count DESC
LIMIT 20;

On a M2 Mac Mini with 16GB RAM, aggregating the 3 million trips takes around 900ms. The result looks like the following:

We can now also create a Y Bar chart showing the (total) trip count, passenger count and trip distance by top 20 most frequented trips:

End result:

Trip frequency by weekday and time of day

To inspect the traffic patterns, we want to analyze the trip frequency by weekday and time of day (aggregated on hourly level). Therefore, we make use of DuckDB's advanced timestamp/time handling functions:

SELECT
    dd.day_name,
    dd.day_of_week,
    datepart('hour', time_bucket(INTERVAL '1 HOUR', ft.pickup_ts)) day_hour,
    count(DISTINCT ft.trip_id)::INT AS trip_count,
FROM
    fact_trip ft
INNER JOIN
    dim_date dd
ON
    dd.day_dt = ft.pickup_dt
GROUP BY
    dd.day_name,
    dd.day_of_week,
    day_hour
ORDER BY
    dd.day_of_week,
    day_hour;

Then, we can configure a Y Bar chart that can show us the number of trips by weekday and hour:

End result:

Sharing of Data Pipelines & Visualizations

With the latest version of sql-workbench.com it's possible to share both queries and the (customized) visualization of the last executed query. Therefore, you write your queries, run them to check whether they work, and then update the visualization configuration.

When you did that, you can then click on "Share queries" in the lower left corner of the SQL Workbench. The toggle will let you choose whether you want to copy the visualization configuration as well, or not.

If you want to run the complete pipeline to build our dimensional data model, you can click on the link below (this can take from 10 to 120 seconds depending on your machine and internet connection speed, as approximately 50MB of data will be downloaded):

💡
The above link unfortunately has to be routed through https://dub.co as URL shortener, because HashNode doesn't support very long URLs as links!

Attaching remote databases

Since DuckDB v1.0.0, it is possible to attach a remote database via HTTPS or S3. As an example, you could use the below statements to load and query a remote dataset of AWS IAM data from GitHub:

ATTACH 'https://raw.githubusercontent.com/tobilg/aws-iam-data/main/data/db/iam.duckdb' AS iam (READ_ONLY);

SELECT 
  s.name as service_name, 
  count(distinct a.name)::int action_cnt 
FROM 
  iam.services s 
INNER JOIN 
  iam.actions a ON s.service_id = a.service_id 
GROUP BY ALL 
ORDER BY action_cnt desc 
LIMIT 25;

Click on the image below to execute the queries:

Conclusion

With DuckDB-WASM and some common web frameworks, it's pretty easy and fast to create custom data applications that can handle datasets in the size of millions of records.

Those applications are able to provide a very lightweight approach to working with different types of data (such as Parquet, Iceberg, Arrow, CSV, JSON or spatial data formats), whether locally or remote (via HTTP(S) of S3-compatible storage services), thanks to the versatile DuckDB engine.

Users can interactively work with the data, create data pipelines by using raw SQL, and iterate until the final desired state has been achieved. The generated data pipeline queries can easily be shared with simple links to sql-workbench.com, so that other collaborators can continue to iterate on the existing work, or even create new solutions with it.

Once a data pipeline has been finalized, it could for example be deployed to DuckDB instances running in own cloud accounts of the users. A great example would be running DuckDB in AWS Lambda, e.g. for repartitioning Parquet data in S3 nightly, or automatically running reports based on aggregation pipelines etc.

The possibilities are nearly endless, so I'm very curious what you all build with this great technology! Thanks for reading this length article, I'm happy to answer any questions in the comments.

8
Subscribe to my newsletter

Read articles from Tobias Müller directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Tobias Müller
Tobias Müller