Data Modeling and Clustering Strategies in Snowflake

Sriram KrishnanSriram Krishnan
5 min read

In the last few blog posts, we explored powerful features built into Snowflake—from Time Travel and Streams to UDFs and Snowpark. We also saw how dbt integrates with Snowflake as a transformation layer.

Now it’s time to shift gears and focus on data design.

This chapter is all about modeling your data to make the most of Snowflake's architecture. We’ll explore two tightly connected ideas:

  • Data modeling: the blueprint for how your data is stored and accessed

  • Clustering: how Snowflake physically organizes that data for peak performance


1. Why Modeling Still Matters in Snowflake

Snowflake separates storage and compute. It handles infrastructure behind the scenes. But that doesn't mean structure doesn't matter.

A poorly modeled warehouse doesn't just mean messy SQL—it translates directly into wasted compute, slow dashboards, and duplicated logic.

With a well-designed model, you can:

  • Minimize joins and avoid expensive full-table scans

  • Reduce compute usage across your queries

  • Improve dashboard load times and team productivity

Modeling is the foundation of scalable analytics.


2. Data Modeling in Practice

Data modeling in Snowflake typically revolves around two tried-and-true patterns:

The Star Schema: Simple and Fast
The most common approach. A central fact table contains measures (like sales_amount, orders_count) and connects directly to dimension tables (dim_customer, dim_product, dim_date).

This structure is simple, easy to understand, and efficient to query. Since all dimensions are denormalized, queries require fewer joins, which usually results in faster execution and easier SQL.

The Snowflake Schema: Normalized and Efficient
An extension of the star schema. Here, dimension tables are further broken down into sub-dimensions. For example, dim_product might normalize into dim_product, dim_category, and dim_brand.

This reduces data duplication and improves data integrity, but increases join complexity. It's useful when storage efficiency and data governance are top priorities.

Flat Tables: Flexible but Heavy
Sometimes, a wide denormalized table is the right answer—especially for ML workflows or BI tools that don’t handle joins well. But beware: flat tables are easy to query but hard to maintain, and often lead to redundancy.

In dbt, a clean modeling project often looks like this:

  • staging/: raw data, lightly cleaned

  • intermediate/: business logic, joins, and transformations

  • marts/: final outputs for analysis or dashboards


3. What Are Micro-Partitions?

Under the hood, Snowflake stores all your table data in micro-partitions—compressed, columnar storage units usually between 50–500 MB in size.

Each micro-partition has metadata: the min/max values for each column, number of distinct values, and null counts.

When you query a large table, Snowflake uses this metadata to prune irrelevant partitions—which can drastically reduce scan time. This is what makes Snowflake fast, especially when you filter data with WHERE clauses.


4. Introduction to Clustering

Over time, as new data is inserted or updated, the natural ordering of micro-partitions degrades. Filtering gets slower, and Snowflake must scan more partitions than necessary.

Clustering fixes that. It’s Snowflake’s way of physically reordering the data by one or more columns (called clustering keys) so that related rows are grouped together inside partitions.

Think of it like sorting books in a library. If all books are stored by author and genre, finding a specific one is faster. Clustering is Snowflake's way of doing the same under the hood.


5. Natural vs. Manual Clustering

Natural Clustering happens as data is ingested. Initially, this may be fine—especially if you're inserting data in order (e.g., by event_date). But over time, natural order breaks down.

Clustering Keys allow you to explicitly tell Snowflake how to organize the data.

  • Improves pruning

  • Keeps partition metadata accurate

  • Makes large-table queries much faster


6. Choosing a Clustering Key

The effectiveness of clustering hinges entirely on selecting the right key. Use columns that are frequently used in WHERE or JOIN clauses. Cardinality is important:

  • Low cardinality (e.g., country) might not reduce scan size much.

  • High cardinality (e.g., uuid) spreads data too thin.

Good keys:

cluster by (event_date)           -- For time-based queries
cluster by (customer_id, event_date)  -- For behavioral analytics

Less effective:

cluster by (uuid)     -- Too many unique values
cluster by (status)   -- Too few values

7. Automatic vs. Manual Clustering

By default, Snowflake offers Automatic Clustering, a serverless service that:

  • Detects when reclustering is needed

  • Runs in the background

  • Doesn’t block queries

  • Charges only for serverless compute used

You can resume, suspend, or monitor it anytime. For most use cases, this is the simplest and most effective way to keep your tables optimized.

In edge cases, you can use Manual Reclustering, where you control how and when reclustering happens. This gives more control but requires hands-on monitoring.


8. Implementing Clustering in dbt

Set clustering in your model config:

{{ config(
  materialized='table',
  cluster_by=['event_date']
) }}

For incremental models, enable automatic clustering via a post-hook:

{{ config(
  materialized='incremental',
  cluster_by=['event_date'],
  post_hook="ALTER TABLE {{ this }} RESUME RECLUSTERING"
) }}

9. Monitoring Clustering Activity

Use built-in functions to understand how well your clustering is working:

select system$clustering_information('analytics.fct_events');

And monitor recent activity:

select to_date(start_time), table_name, sum(credits_used) as credits
from snowflake.account_usage.automatic_clustering_history
where start_time > dateadd(day, -30, current_timestamp())
group by 1,2
order by 3 desc;

10. When Not to Cluster

Clustering isn’t always necessary. Skip it if:

  • The table is small

  • Queries are unpredictable

  • It’s a write-heavy table with little analytical value

Start with modeling. Add clustering only when it actually improves performance and saves cost.


Final Thoughts

You now understand the 'what' and 'why' of Snowflake's data modeling and clustering strategies.

  • Use star schemas for speed and simplicity

  • Consider snowflake schemas when integrity and normalization matter

  • Add clustering when query performance becomes a problem, not before

  • Let automatic clustering do the heavy lifting unless you have edge-case needs

Modeling and clustering are how you move from a functional Snowflake project to a fast, future-proof one.

0
Subscribe to my newsletter

Read articles from Sriram Krishnan directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sriram Krishnan
Sriram Krishnan