Getting Started with dbt: A Beginner’s Guide

Sriram KrishnanSriram Krishnan
4 min read

In most modern data stacks, raw data lands in a warehouse—but transforming that data into something reliable, trusted, and usable? That’s where things often go sideways. As teams scale, they start drowning in a swamp of SQL scripts, inconsistent logic, and dashboard patchwork.

dbt (Data Build Tool) solves this by turning SQL-based transformations into a structured, testable, and collaborative workflow—bringing engineering best practices into analytics without requiring you to become a full-fledged software engineer.


What dbt Brings to the Table

Here’s a high-level overview of what dbt enables inside a data team:

✅ Modular SQL Models

Transformations are broken into focused .sql files called models. These run in dependency order and can be layered cleanly.

✅ Materializations

Models can be built as:

  • view: always fresh but on-demand

  • table: persisted on run

  • incremental: only update new data

  • ephemeral: used as CTEs inside other models

✅ Source Tracking

Define upstream raw tables (from ingestion tools) as sources. This brings visibility, metadata, and lineage into your transformation pipeline.

✅ Data Testing

Built-in generic tests (like not null, unique, relationships) and custom singular tests (e.g., “no revenue < 0”) to validate logic and assumptions.

✅ Documentation

Document your models and columns inline. dbt auto-generates a browsable documentation site with lineage, tags, and descriptions.

✅ Snapshots

Track historical changes (type 2 slowly changing dimensions) for evolving data like customer plans or subscription status.

✅ Macros & Jinja

Use Jinja templates and custom macros to keep logic DRY, dynamic, and reusable across models.

✅ Git-based Collaboration

All code is version-controlled in Git. You build branches, create pull requests, and review changes—just like software teams.

✅ Scheduling & CI/CD

Run dbt jobs daily, hourly, or on-demand. Integrate it into a CI/CD process with dbt Cloud, GitHub Actions, or Airflow.

✅ Seeds & Packages

Use CSVs as reference tables (seeds), and import reusable utilities via packages like dbt-utils.


Realistic Example: Analytics for a SaaS Business

Let’s say you’re building a data platform for a subscription SaaS product. Your business wants insights into revenue, user behavior, churn, and customer segments. You ingest data from:

  • Stripe (billing)

  • HubSpot (CRM)

  • Segment (event tracking)

These land in your warehouse as raw tables:

  • raw.stripe_payments

  • raw.hubspot_contacts

  • raw.segment_events

Here’s how a dbt project can structure the transformation layer:

/models
  /sources
    stripe.yml
    hubspot.yml
    segment.yml

  /staging
    stg_payments.sql
    stg_contacts.sql
    stg_events.sql

  /intermediate
    int_mrr_breakdown.sql
    int_user_activity.sql

  /marts
    fct_revenue.sql
    fct_engagement.sql
    dim_customer.sql

/snapshots
  scd_customer_status.sql

/seeds
  currency_exchange_rates.csv

/tests
  test_no_negative_revenue.sql

/macros
  monthly_range.sql

How the Features Come Together

🔷 Sources

In sources/stripe.yml, you define raw.stripe_payments as a source. This allows you to refer to it via source() instead of hardcoding table names. It also provides metadata lineage and control over upstream tables.

🔷 Staging Models

You standardize inputs in stg_payments.sql, renaming fields like amount_paid_cents to amount_usd, filtering test transactions, and fixing timezones. These staging models act as the clean base layer.

🔷 Intermediate Models

In int_user_activity.sql, you enrich events from Segment with user metadata and calculate active days. In int_mrr_breakdown.sql, you break down subscription revenue into components like upgrades and downgrades.

These intermediate models separate complex logic, keeping fact models readable and maintainable.

🔷 Fact & Dimension Models

Your final output includes:

  • fct_revenue.sql: monthly MRR, churn, expansion, contraction

  • fct_user_engagement.sql: DAU/WAU/MAU

  • dim_customer.sql: enriched profile of every user

These models feed dashboards, forecasts, and investor reports.

🔷 Materializations

You use:

  • table for fct_revenue (slow-changing, useful to cache)

  • incremental for fct_user_engagement (daily updates)

  • view for staging models

  • ephemeral for lightweight CTE chains (in int_user_activity.sql)

This balances performance and flexibility.

🔷 Snapshots

You use scd_customer_status.sql to track when customers switch from “trial” to “paid” to “churned.” This snapshot maintains historical states for analysis, using type-2 SCD strategy.

🔷 Tests

In your YAML files and tests/, you add:

  • Generic tests: not_null, unique, accepted_values

  • Custom tests: test_no_negative_revenue.sql (ensures no refunds exceed payments)

These run on every build and CI process to catch issues before dashboards break.

🔷 Docs

Each model, column, and test has a description. dbt auto-generates browsable documentation with:

  • Descriptions

  • Owners

  • Tags

  • DAG visualizations (lineage)

You deploy this to your internal wiki so anyone in marketing or finance can browse what “MRR” or “Customer Type” actually means.

🔷 Seeds

You load currency_exchange_rates.csv as a seed and join it into your revenue model to normalize values to USD. It acts like a managed lookup table.

🔷 Macros & Jinja

You define a monthly_range() macro used across models to generate calendar-based cohort windows. This DRY approach prevents logic duplication.

🔷 Version Control & CI/CD

The whole dbt repo lives in GitHub. Your team uses feature branches and pull requests with dbt test and dbt build triggered in CI (via dbt Cloud or GitHub Actions). Each change is reviewed and documented.


Final Thoughts

dbt isn’t just another tool. It’s a methodology for managing complexity in the transformation layer—modular SQL, automated testing, clear documentation, and version control. It scales with your data, your team, and your ambitions.

If you’re just starting out, begin by building a simple dbt project with one raw table, a staging model, and a few tests. You’ll quickly see how it helps you work faster and safer.

0
Subscribe to my newsletter

Read articles from Sriram Krishnan directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sriram Krishnan
Sriram Krishnan