Getting Started with dbt: A Beginner’s Guide

In most modern data stacks, raw data lands in a warehouse—but transforming that data into something reliable, trusted, and usable? That’s where things often go sideways. As teams scale, they start drowning in a swamp of SQL scripts, inconsistent logic, and dashboard patchwork.
dbt (Data Build Tool) solves this by turning SQL-based transformations into a structured, testable, and collaborative workflow—bringing engineering best practices into analytics without requiring you to become a full-fledged software engineer.
What dbt Brings to the Table
Here’s a high-level overview of what dbt enables inside a data team:
✅ Modular SQL Models
Transformations are broken into focused .sql
files called models. These run in dependency order and can be layered cleanly.
✅ Materializations
Models can be built as:
view
: always fresh but on-demandtable
: persisted on runincremental
: only update new dataephemeral
: used as CTEs inside other models
✅ Source Tracking
Define upstream raw tables (from ingestion tools) as sources. This brings visibility, metadata, and lineage into your transformation pipeline.
✅ Data Testing
Built-in generic tests (like not null
, unique
, relationships
) and custom singular tests (e.g., “no revenue < 0”) to validate logic and assumptions.
✅ Documentation
Document your models and columns inline. dbt auto-generates a browsable documentation site with lineage, tags, and descriptions.
✅ Snapshots
Track historical changes (type 2 slowly changing dimensions) for evolving data like customer plans or subscription status.
✅ Macros & Jinja
Use Jinja templates and custom macros to keep logic DRY, dynamic, and reusable across models.
✅ Git-based Collaboration
All code is version-controlled in Git. You build branches, create pull requests, and review changes—just like software teams.
✅ Scheduling & CI/CD
Run dbt jobs daily, hourly, or on-demand. Integrate it into a CI/CD process with dbt Cloud, GitHub Actions, or Airflow.
✅ Seeds & Packages
Use CSVs as reference tables (seeds), and import reusable utilities via packages like dbt-utils
.
Realistic Example: Analytics for a SaaS Business
Let’s say you’re building a data platform for a subscription SaaS product. Your business wants insights into revenue, user behavior, churn, and customer segments. You ingest data from:
Stripe (billing)
HubSpot (CRM)
Segment (event tracking)
These land in your warehouse as raw tables:
raw.stripe_payments
raw.hubspot_contacts
raw.segment_events
Here’s how a dbt project can structure the transformation layer:
/models
/sources
stripe.yml
hubspot.yml
segment.yml
/staging
stg_payments.sql
stg_contacts.sql
stg_events.sql
/intermediate
int_mrr_breakdown.sql
int_user_activity.sql
/marts
fct_revenue.sql
fct_engagement.sql
dim_customer.sql
/snapshots
scd_customer_status.sql
/seeds
currency_exchange_rates.csv
/tests
test_no_negative_revenue.sql
/macros
monthly_range.sql
How the Features Come Together
🔷 Sources
In sources/stripe.yml
, you define raw.stripe_payments
as a source. This allows you to refer to it via source()
instead of hardcoding table names. It also provides metadata lineage and control over upstream tables.
🔷 Staging Models
You standardize inputs in stg_payments.sql
, renaming fields like amount_paid_cents
to amount_usd
, filtering test transactions, and fixing timezones. These staging models act as the clean base layer.
🔷 Intermediate Models
In int_user_activity.sql
, you enrich events from Segment with user metadata and calculate active days. In int_mrr_breakdown.sql
, you break down subscription revenue into components like upgrades and downgrades.
These intermediate models separate complex logic, keeping fact models readable and maintainable.
🔷 Fact & Dimension Models
Your final output includes:
fct_revenue.sql
: monthly MRR, churn, expansion, contractionfct_user_engagement.sql
: DAU/WAU/MAUdim_customer.sql
: enriched profile of every user
These models feed dashboards, forecasts, and investor reports.
🔷 Materializations
You use:
table
forfct_revenue
(slow-changing, useful to cache)incremental
forfct_user_engagement
(daily updates)view
for staging modelsephemeral
for lightweight CTE chains (inint_user_activity.sql
)
This balances performance and flexibility.
🔷 Snapshots
You use scd_customer_status.sql
to track when customers switch from “trial” to “paid” to “churned.” This snapshot maintains historical states for analysis, using type-2 SCD strategy.
🔷 Tests
In your YAML files and tests/
, you add:
Generic tests:
not_null
,unique
,accepted_values
Custom tests:
test_no_negative_revenue.sql
(ensures no refunds exceed payments)
These run on every build and CI process to catch issues before dashboards break.
🔷 Docs
Each model, column, and test has a description. dbt auto-generates browsable documentation with:
Descriptions
Owners
Tags
DAG visualizations (lineage)
You deploy this to your internal wiki so anyone in marketing or finance can browse what “MRR” or “Customer Type” actually means.
🔷 Seeds
You load currency_exchange_rates.csv
as a seed and join it into your revenue model to normalize values to USD. It acts like a managed lookup table.
🔷 Macros & Jinja
You define a monthly_range()
macro used across models to generate calendar-based cohort windows. This DRY approach prevents logic duplication.
🔷 Version Control & CI/CD
The whole dbt repo lives in GitHub. Your team uses feature branches and pull requests with dbt test
and dbt build
triggered in CI (via dbt Cloud or GitHub Actions). Each change is reviewed and documented.
Final Thoughts
dbt isn’t just another tool. It’s a methodology for managing complexity in the transformation layer—modular SQL, automated testing, clear documentation, and version control. It scales with your data, your team, and your ambitions.
If you’re just starting out, begin by building a simple dbt project with one raw table, a staging model, and a few tests. You’ll quickly see how it helps you work faster and safer.
Subscribe to my newsletter
Read articles from Sriram Krishnan directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
