This article is a note on a book Analytics Engineering with SQL and dbt.

Chapter 1. Analytics Engineering

Data Analytics Lifecycle

Problem definition	First, we need to understand the problem. Identify the business objectives and available resources to solve the problem.
Data modeling	Begin modeling data according to the modeling technique that fits the needs - diamond strategy, star schema, data vault, etc.
Data ingestion and transformation	Ingest and prepare data from source systems to match the models created.
Data storage and structuring	Decide the file format (Apache Parquet, Delta Lake, or Apache Iceberg), as well as partitioning strategies and storage components (S3, Redshift, Snowflake).
Data visualization and analysis	Visualize data to directly support decision making. This phase should be in close coordination with business stakeholders.
Data quality monitoring, testing, and documentation	Implement quality control to ensure that stakeholders can trust exposed data models, document transformation steps and semantic meanings, and ensure pipeline testing.

Why Use dbt

Enables organizations to perform complex data transformation tasks quickly and easily.
Integrates with other tools, like Apache Airflow, for complete data pipeline management.

dbt Features

Data modeling: Allows defining data models with SQL-based syntax.
Data testing: Provides a testing framework for teams to test their data models, ensuring accuracy and reliability.
Data documentation: Enables documentation of data models and services.
Data tracking capabilities: Allows tracing the origin of data models (data lineage).
Data governance capabilities: Enforces data governance policies, such as data access controls, data lineage, and data quality checks.

ELT

Data is first extracted and loaded into target system before transformation
Advantages over ETL:
- Increased flexibility to support a wider range of data applications
- Versatility in accommodating various data transformations and real-time insights directly within the target system
- Derive actionable insights from the data more rapidly and adapt to changing analytical needs
Disadvantage:
- Higher storage and ingestion costs - this is justifiable with flexibility it brings to operations

Chapter 2. Data Modeling for Analytics

Data Modeling

Defining the structure, relationships, and attributes of data entities within a system
Essential aspect of data modeling is the normalization
- Breaking data into logical units and organizing them into separate tables → eliminates data redundancy and improves data integrity
- Normalization ensures that data is stored in a structured and consistent manner
With SQL, we can define tables, manipulate data, and retrieve information
- Complementary to SQL, dbt serves as a comprehensive framework for constructing and orchestrating complex data pipelines. Users can define transformation logic, apply essential business rules, and craft reusable modular code components, which is known as models

Phases of Data Modeling

Conceptual	Identify the purpose of the database and clarify problems or requirements with stakeholders. Understand the required data elements, relationships, and constraints with subject matter experts (SME).
Logical	Normalize the data to eliminate redundancies, improve data integrity, and optimize query performance. The normalized logical model reflects the relationships and dependencies among entities.
Physical	Translate the logical data model into a specific database implementation.

Data Normalization Process

1NF	Eliminate repeating groups by breaking the data into smaller atomic units (atomicity).
2NF	Examine dependencies within the data. Ensure that non-primary key columns depend on the primary key (redundancy).
3NF	Eliminate transitive dependencies by ensuring that non-primary key columns are only dependent on the primary key (transitivity).

Dimensional Data Modeling

A popular approach to data modeling is dimensional modeling, which focuses on modeling data to support analytics and reporting requirements. Particularly suited for DW and BI applications
Data modeling lays foundation for capturing and structuring data and dimensional modeling offers specialized technique that support analytical and reporting needs. Together, dimensional data modeling enables robust and flexible DB that facilitate transactional processing and in-depth data analysis

Referencing Data Models in dbt

Using Jinja template, user can establish clear dependencies among models, enhance code reusability, and ensure the consistency and accuracy of data pipeline
- Jinja: templating language that allows dynamic transformation within SQL
{{ref()}}: dbt can automatically detect and establish dependencies among models based on upstream tables

Staging Data Models

Staging layer serves as the basis for the modular construction of complex data models. Each staging model corresponds to a source table with 1:1 relationship
Example: select relevant columns from the raw table. Basic transformations such as rename column or convert data types

Intermediate Data Models

Combine the atomic building blocks from the staging layer to create complex and meaningful models

Mart Models

Combine all relevant data from multiple sources and transform it into a cohesive view. Responsible for presenting business-defined entities via dashboards or applications
Typically materialized as tables. Materializing enables faster query execution

Testing in dbt

Tests are designed to identify rows or records that don’t meet the specified assertion criteria
Singular tests: specific and targeted tests written as SQL. Test specific aspects of data, such as NULL value absence in a fact table or validation
Generic tests: versatile and can be applied to multiple models or data sources. Checking data consistency among tables or ensuring the integrity of specific columns

Chapter 3. SQL for Analytics

Views

A view is a virtual table in a DB defined by a query. View dynamically retrieves data from the tables when it’s accessed

Example

                    CREATE VIEW author_book_count AS
                        SELECT COUNT(books.book_id) AS book_count
                        FROM authors
                        JOIN books ON authors.author_id = books.author_id
                        GROUP BY authors.author_id

CTE (Common Table Expressions)

Acts as a temporary result set. Enhance readability of code by breaking queries into blocks
- Define once, use many times
WITH Keyword to declare the CTE. AS keyword to define the CTE query

Example

                    WITH cte_one (column1, column2) AS (
                        ...
                    ),
                    cte_two (column3, column3) AS (
                        ...
                    )

                    SELECT column1, column2
                    FROM cte_one

Window Functions

Performs an aggregate-like operation on a set of query rows
Example: perform calculation on a set of table rows related to the current row
- ```
                    SELECT book_id,
                           book_title,
                           publication_year,
                           ROW_NUMBER() OVER (ORDER BY publication_year) AS running_count
                    FROM books;
```
- PARTITION BY divides the rows into partitions based on columns. The window function is applied separately to each partition
- ORDER BY specifies columns to determine the order within each partition. Helps define the logical sequence

Practical dbt

Practical dbt use case demonstrated by Danggeun data team

Structure

Source Layer (Raw): Source data from Data Warehouse
Base Layer (staging)
- You can slightly transform data in this layer. Such as SELECT columns, data formatting, and type casting
- Mapped 1:1 to source data
- We don’t use physical table, instead materialize with view
Dimension Layer: Entity attributes, such as User and Product
- We can use advanced operations, such as JOIN, WHERE
- Materialize with table
- Need for fresh data → periodic calculation and we can auto adjust refresh period using dbt tags
Fact Layer: Contains information X did Y. User activity data, object creation, or transformation event
- We can use advanced operations like dimension layer here. However as time progress, more data get added → large data size
  - Since data size is big and we can split data by time frame, we can Materialize incremental to take advantage of data lookup and storage efficiency
- Like dimension layer, can use dbt tags to auto adjust refresh period

SQL

Use ref to depend on other models
- Wrote defensive logic to protect cyclic situation
Created macros as needed by model attributes
- Example: automatically save snapshot, add custom label to bigquery query

yaml

Easily define storing, partitioning, and clustering method by defining config in yaml and sql

Orchestration

Use Astronomer Cosmos to run pipeline
Observability using Datadog → Slack

Documentation

dbt docs to provide stakeholders 1. model definition and explanation 2. business logic

Little Bit of Analytics Engineering

Chapter 1. Analytics Engineering

Chapter 2. Data Modeling for Analytics

Chapter 3. SQL for Analytics

Practical dbt

Subscribe to my newsletter

Sion Kim

Sion Kim