A Comprehensive Guide to Running dbt Models

VipinVipin
7 min read

Introduction:

Data transformation is a critical step in the analytics process, and dbt (data build tool) has become a cornerstone of modern data engineering. In this blog post, we’ll dive into the practical aspects of running dbt models, an essential part of any dbt project. Whether you’re new to dbt or looking to optimize your workflow, this guide will walk you through the key steps and best practices.

What Are dbt Models?

At the heart of dbt are models—SQL files that define the transformations you want to apply to your raw data. These models are housed in the models directory of your dbt project and are typically organized by theme or data domain.

A dbt model is more than just a SQL query. It’s a reusable transformation that dbt compiles, optimizes, and materializes in your data warehouse. dbt models can be used to create tables, views, or incremental loads, depending on how you configure them.

One of the powerful features of dbt is its use of the Jinja templating language, allowing you to create dynamic, parameterized SQL that adapts to different environments and scenarios.

Prerequisites

Before we dive in, make sure you have the following tools installed:

  • Python: Required to run dbt.

  • Visual Studio code. Install Python and dbt extension in VS code.

  • A Snowflake Account: Ensure you have a Snowflake account with sufficient permissions to create roles, warehouses, databases, etc. Refer https://vipinmp.hashnode.dev/snowflake-a-beginners-guide

Snowflake data load

The data used here is stored as CSV files in a public S3 bucket and the following steps will guide you through how to prepare your Snowflake account for that data and upload it.

  1. Log in to your trial Snowflake account.

  2. In the Snowflake UI, click + Worksheet in the upper right corner to create a new worksheet.

  3. Create a new virtual warehouse, two new databases (one for raw data, the other for future dbt development), and two new schemas (one for jaffle_shop data, the other for stripe data).

    To do this, run these SQL commands by typing them into the Editor of your new Snowflake worksheet and clicking Run in the upper right corner of the UI:

     create database raw;
     create database analytics;
     create schema raw.jaffle_shop;
     create schema raw.stripe;
    
  4. In the raw database and jaffle_shop and stripe schemas, create three tables and load relevant data into them:

    • First, delete all contents (empty) in the Editor of the Snowflake worksheet. Then, run this SQL command to create the customer table:

        create table raw.jaffle_shop.customers 
        ( id integer,
          first_name varchar,
          last_name varchar
        );
      
    • Delete all contents in the Editor, then run this command to load data into the customer table:

        copy into raw.jaffle_shop.customers (id, first_name, last_name)
        from 's3://dbt-tutorial-public/jaffle_shop_customers.csv'
        file_format = (
            type = 'CSV'
            field_delimiter = ','
            skip_header = 1
            );
      
    • Delete all contents in the Editor (empty), then run this command to create the orders table:

        create table raw.jaffle_shop.orders
        ( id integer,
          user_id integer,
          order_date date,
          status varchar,
          _etl_loaded_at timestamp default current_timestamp
        );
      
    • Delete all contents in the Editor, then run this command to load data into the orders table:

        copy into raw.jaffle_shop.orders (id, user_id, order_date, status)
        from 's3://dbt-tutorial-public/jaffle_shop_orders.csv'
        file_format = (
            type = 'CSV'
            field_delimiter = ','
            skip_header = 1
            );
      
    • Delete all contents in the Editor (empty), then run this command to create the payment table:

        create table raw.stripe.payment 
        ( id integer,
          orderid integer,
          paymentmethod varchar,
          status varchar,
          amount integer,
          created date,
          _batched_at timestamp default current_timestamp
        );
      
    • Delete all contents in the Editor, then run this command to load data into the payment table:

        copy into raw.stripe.payment (id, orderid, paymentmethod, status, amount, created)
        from 's3://dbt-tutorial-public/stripe_payments.csv'
        file_format = (
            type = 'CSV'
            field_delimiter = ','
            skip_header = 1
            );
      
  5. Verify that the data is loaded by running these SQL queries. Confirm that you can see output for each one.

Initialize dbt and check the connection

Create virtual environment and activate it
virtualenv my_dbt_venv
source my_dbt_venv/bin/activate

Install dbt along with the Snowflake adapter:
pip install dbt-core dbt-snowflake

  • Create .dbt folder in the home directory.

  • Create a New dbt Project: Run the following command to initialize a new dbt project:

  • Open the newly created project in VSCODE and Activate the virtual environment and test the connection using dbt debug

    command.

Build the Model

    1. Go to the models directory, then select Create file.

      1. Name the file customers_order.sql, then click Create.

      2. Copy the following query into the file and click Save.

    with customers as (

        select
            id as customer_id,
            first_name,
            last_name

        from raw.jaffle_shop.customers

    ),

    orders as (

        select
            id as order_id,
            user_id as customer_id,
            order_date,
            status

        from raw.jaffle_shop.orders

    ),

    customer_orders as (

        select
            customer_id,

            min(order_date) as first_order_date,
            max(order_date) as most_recent_order_date,
            count(order_id) as number_of_orders

        from orders

        group by 1

    ),

    final as (

        select
            customers.customer_id,
            customers.first_name,
            customers.last_name,
            customer_orders.first_order_date,
            customer_orders.most_recent_order_date,
            coalesce(customer_orders.number_of_orders, 0) as number_of_orders

        from customers

        left join customer_orders using (customer_id)

    )

    select * from final

Model explanation:

    1. customers CTE:

      • Selects customer data from the raw.jaffle_shop.customers table.

      • Renames the id column to customer_id for better clarity.

      1. orders CTE:

        • Selects order data from the raw.jaffle_shop.orders table.

        • Renames the id column to order_id and user_id to customer_id to maintain consistency with the customers CTE.

      2. customer_orders CTE:

        • Aggregates order data by customer_id.

        • Calculates the first_order_date, most_recent_order_date, and number_of_orders for each customer.

      3. final CTE:

        • Combines the customers and customer_orders data.

        • Uses a LEFT JOIN to ensure all customers are included, even those without any orders.

        • The coalesce function ensures that customers with no orders will show 0 for number_of_orders.

      4. Final SELECT:

        • Selects and displays all the relevant fields from the final CTE.
  1. Enter dbt run in the command prompt at the bottom of the screen. You should get a successful run and see the three models.

Change the way your model is materialized

One of the most powerful features of dbt is that you can change the way a model is materialized in your warehouse, simply by changing a configuration value. You can change things between tables and views by changing a keyword rather than writing the data definition language (DDL) to do this behind the scenes.

By default, everything gets created as a view. You can override that at the directory level so everything in that directory will materialize to a different materialization.

  1. Edit your dbt_project.yml file.

    • Update your project name to:

      dbt_project.yml

        name: 'dbt_demo'
      
    • Configure jaffle_shop so everything in it will be materialized as a table; and configure example so everything in it will be materialized as a view. Update your models config block to:

      dbt_project.yml

        models:
          dbt-demo:
            +materialized: table
      
    • Click Save.

  2. Enter the dbt run command. Your customers model should now be built as a table!

Build models on top of other models

As a best practice in SQL, you should separate logic that cleans up your data from logic that transforms your data. You have already started doing this in the existing query by using common table expressions (CTEs).

Now you can experiment by separating the logic out into separate models and using the ref function to build models on top of other models:

  • The DAG we want for our dbt project

    1. Create a new SQL file, models/stg_customers.sql, with the SQL from the customers CTE in our original query.

    2. Create a second new SQL file, models/stg_orders.sql, with the SQL from the orders CTE in our original query.

      models/stg_customers.sql

       select
           id as customer_id,
           first_name,
           last_name
      
       from raw.jaffle_shop.customers
      

      models/stg_orders.sql

       select
           id as order_id,
           user_id as customer_id,
           order_date,
           status
      
       from raw.jaffle_shop.orders
      
    3. Edit the SQL in your models/customers_order.sql file as follows:

      models/customers_order.sql

       with customers as (
      
           select * from {{ ref('stg_customers') }}
      
       ),
      
       orders as (
      
           select * from {{ ref('stg_orders') }}
      
       ),
      
       customer_orders as (
      
           select
               customer_id,
      
               min(order_date) as first_order_date,
               max(order_date) as most_recent_order_date,
               count(order_id) as number_of_orders
      
           from orders
      
           group by 1
      
       ),
      
       final as (
      
           select
               customers.customer_id,
               customers.first_name,
               customers.last_name,
               customer_orders.first_order_date,
               customer_orders.most_recent_order_date,
               coalesce(customer_orders.number_of_orders, 0) as number_of_orders
      
           from customers
      
           left join customer_orders using (customer_id)
      
       )
      
       select * from final
      

    4. Execute dbt run.

      This time, when you performed a dbt run, separate views/tables were created for stg_customers, stg_orders and customers. dbt inferred the order to run these models. Because customers depends on stg_customers and stg_orders, dbt builds customers last. You do not need to explicitly define these dependencies.

Resources and Further Reading

  • Conclusion

    Running dbt models efficiently is key to maintaining a robust and reliable data pipeline. Whether you're running individual models, incremental models, or your entire project, understanding how to leverage dbt’s powerful run commands will help streamline your workflow and ensure your data transformations are accurate and up-to-date.

0
Subscribe to my newsletter

Read articles from Vipin directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Vipin
Vipin