Selective Denormalization Strategy: Why the opposite direction, partially?

Everyone talks about normalization. But barely anyone talks about denormalization—let alone selective denormalization.

Back in college, while studying DBMS concepts, professors would emphasize how important—crucial, even—normalization is. And to be honest, they weren’t wrong. After all, that topic was a guaranteed guest in the exams.

But here’s the thing—they were only partially right.

Why? Because as you gain real-world experience, you begin to realize a simple but powerful truth:

Everything comes at a cost. And that cost can be paid in different currencies—whether it’s financial feasibility, query execution time, architectural complexity, system maintainability, or long-term sustainability.

At the end of the day, it’s all about trade-offs. What do you prioritize—and more importantly, why? That “why” becomes the driving force behind your decisions.

So, who should read this article?

Well, technically anyone. But if I had to be specific, here’s who would benefit the most:

  • Students who like to think beyond textbooks—the ones who challenge the classics.

  • Aspiring data professionals (especially data engineers).

  • Anyone with an interest in data warehouse architecture.

  • Curious minds exploring data modeling and dimensional modeling.

  • Engineers trying to improve performance in existing projects—this strategy might just give them a new perspective.

Let’s dive in.


Normalization vs. Denormalization: Through the Eyes of an Architect

🧩 Normalization

Let’s get this out of the way quickly. We won’t dwell too long on the theory—you’ve probably learned this in college or encountered it during system design. Still, here’s a quick refresher, plus some trustworthy resources if you’d like to brush up.

In short, normalization is the process of breaking down larger tables into smaller, well-structured ones to eliminate redundancy and maintain data integrity. This separation typically follows normal forms (1NF to 5NF) and reflects logical relationships between entities.

📚 Resources to Learn More About Normalization:

Advantages of Normalization

  1. Efficient Updates: Data lives in one place. Updating a customer’s name or product price? You change it once, and it reflects everywhere.

  2. Lightweight Writes: Since each table focuses on minimal fields, write operations are fast and storage-efficient.

⚠️ Disadvantages of Normalization

  1. Slower Analytical Reads: For analytical use cases, data often needs to be re-joined from multiple tables—this adds latency and compute cost.

  2. Harder for Business Users: Analysts and business users might find normalized schemas too complex to query directly.

  3. Expensive Joins: Over-normalization results in too many small tables. Querying across these with chained joins increases processing time and resource usage.

🏗️ Denormalization

Denormalization is the intentional act of introducing data redundancy into the schema. It involves combining data from multiple normalized tables—selectively—into one, to improve read performance and reduce join overhead.

That’s right: we do this on purpose, and with good reason.

But why?

Because we evaluate tradeoffs. And this strategy makes sense in the right contexts—especially in analytics and reporting-heavy systems where read performance matters more than update efficiency.

Advantages of Denormalization

  • Faster Reads: Data needed for queries is consolidated—so joins are reduced or eliminated.

  • Efficient Aggregations & Filters: With all relevant data available in one row, operations like filtering, grouping, and slicing become much quicker.

  • Simpler Query Logic: Especially useful for business users and dashboard tools that benefit from flatter schemas.

⚠️ Disadvantages of Denormalization

  • Redundant Data Updates: Repeating the same information (e.g., product category or customer name) across many rows means updates are expensive and prone to errors.

  • Data Inconsistency Risk: Without careful design or governance, duplicated values can drift and create inconsistencies.

  • Heavier Write Loads: Write operations carry more data, increasing I/O overhead and impacting ingestion speeds.

  • Over-Denormalization Pitfalls: Going too far leads to massive tables mixing multiple entities—hurting performance, making maintenance harder, and defeating the purpose of structured modeling.

Up next: Selective Denormalization — the sweet spot between normalization and full denormalization. When done right, it can drastically improve performance without sacrificing structure. Let’s dive in. 🔍


What Do You Do 🧐 and How Do You Do It? 😉

It’s rarely a smart move to commit blindly to one strategy—fully normalized or fully denormalized. Instead, the real advantage comes from striking a balance: leveraging the best of both worlds.

That balance is what we call the Selective Denormalization Strategy.

As the name suggests, we intentionally introduce data duplication—but only for selected attributes. If we were to duplicate everything, it would be no different than creating a monolithic, fully denormalized table, which comes with serious drawbacks.

Below is a 3-step guideline to help you identify the right fields to denormalize:

1️⃣ Understand Query Patterns & Limit the Scope of Denormalization

  • You don’t need to denormalize every categorical attribute from every dimension table.

  • Focus only on attributes frequently used in filtering or grouping within analytical queries.

  • The decision on how many and which columns to denormalize is context-dependent: it depends on the schema, the analytical workload, and the business questions you're trying to answer.

  • Goal: Optimize for the most common and performance-critical queries, not for theoretical completeness.

2️⃣ Avoid Denormalizing High-Cardinality or Volatile Data

  • Cardinality refers to the number of unique values in a column relative to the total number of rows.

  • Avoid denormalizing:

    • High-cardinality fields (e.g., customer email, transaction ID).

    • Highly volatile attributes (e.g., customer status, product stock level).

  • These are best kept in their respective dimension tables to avoid bloating the fact table and increasing update complexity.

  • Low-cardinality, low-volatility fields are typically great candidates for denormalization—especially for improving filtering and grouping performance.

3️⃣ Distinguish Between Contextual vs. Descriptive-Categorical Values

  • Contextual-Categorical Values: These are tied to events, tend to be low in cardinality, and rarely change. Great candidates for denormalization.

    • Example: store_location (if it indicates where the product was sold).
  • Descriptive-Categorical Values: These are tied to entities (like customer or product) and can change more frequently or be reclassified.

    • Example: residential_address (tied to the customer entity and subject to change).
  • Why this matters: Descriptive values are more prone to data inconsistency when denormalized and can cause unnecessary duplication.

Real-World Example: Our fact_sales_denormalized Table

We selectively denormalized the following columns:

  • customer_segment (from dim_customers)

  • category (from dim_products)

  • store_type, store_location, campaign_name (from dim_stores)

At first glance, all of these might appear to be Descriptive-Categorical values—and they are. But here's a nuance worth understanding:

Let’s say our dim_customers also includes a residential_address.

In this case:

  • residential_address would clearly remain a Descriptive-Categorical value — it’s tied directly to a customer (an entity).

  • However, store_location can now be treated as a Contextual-Categorical value — because it represents the location where the sale occurred, an event-based context.

This distinction is subtle but crucial — and this kind of reasoning helps identify the fields worth denormalizing to optimize performance without sacrificing structure.


The Schema, Metadata, and Targeted Insights

Below is the schema for the synthetically generated dataset used in this experiment.

The design was created in Figma, with a strong emphasis on visual clarity—we used color-coding to differentiate between entities, relationships, and categories.

This not only improves readability but also makes it easier to understand the context and purpose of each element in the schema.

A well-structured schema should tell a story — about the data, its purpose, and how it's intended to be queried.

dim_campaigns [ 50 total records ]

  • campaign_sk

  • campaign_id

  • campaign_name

  • start_date_sk

  • end_date_sk

  • campaign_budget

dim_customers [ 100,000 total records ]

  • customer_sk

  • customer_id

  • first_name

  • last_name

  • email

  • residential_location

  • customer_segment

dim_dates [ 366 total records ]

  • full_date

  • date_sk

  • year

  • month

  • day

  • weekday

  • quarter

dim_products [ 100,000 total records ]

  • product_sk

  • product_id

  • product_name

  • category

  • brand

  • origin_location

dim_salespersons [ 2,000 total records ]

  • salesperson_sk

  • salesperson_id

  • salesperson_name

  • salesperson_role

dim_stores [ 500 total records ]

  • store_sk

  • store_id

  • store_name

  • store_type

  • store_location

  • store_manager_sk

fact_sales_normalized [ 1,000,000 total records ]

  • sales_sk

  • sales_id

  • customer_sk

  • product_sk

  • store_sk

  • salesperson_sk

  • campaign_sk

  • sales_date

  • total_amount

fact_sales_denormalized [ 1,000,000 total records ]

  • sales_sk

  • sales_id

  • customer_sk

  • product_sk

  • store_sk

  • salesperson_sk

  • campaign_sk

  • sales_date

  • total_amount

  • customer_segment

  • category

  • store_type

  • store_location

  • campaign_name

Deriving Insights — Validating the Denormalized Columns

To validate the effectiveness and practicality of selective denormalization, we focused on a few key business insights that intentionally utilize the denormalized columns in distinct ways.

These insights were chosen to reflect real-world use cases, particularly analytical queries that would benefit from reduced joins and improved performance.

KPIs Derived:

  1. Total Sales Amount by Product Category and Year

  2. Sales Distribution by Customer Segment and Store Type

  3. Top 5 Campaigns by Total Revenue

Each KPI was derived twice, using the following approaches:

1. Using the Normalized Fact Table (fact_sales_normalized)

  • Required joining the fact table with relevant dimension tables (dim_products, dim_customers, dim_stores, etc.) to retrieve the necessary descriptive attributes.

  • This follows the standard normalized star schema pattern — optimal for write efficiency and data integrity, but potentially costly for complex reads.

2. Using the Denormalized Fact Table (fact_sales_denormalized)

  • Performed direct GROUP BY operations on the denormalized columns (like category, customer_segment, store_type, etc.) without any join.

  • This allows us to observe the performance gains and validate that the denormalized columns serve their purpose in real analytical workloads.

We’ve used Snowflake SQL for demonstration purposes because of its simplicity and readability.

While SQL syntax may vary slightly across Redshift, Azure Synapse, BigQuery, and Snowflake, the overall query logic remains largely the same — and should be easy to adapt across platforms.

SQL

-- Query 1: Total Sales Amount by Product Category and Year
-- 1.A – Normalized Version

SELECT
  c.customer_segment,
  s.store_location,
  ca.campaign_name,
  SUM(f.total_amount) AS total_sales
FROM db_retail.analytics.fact_sales_normalized AS f
JOIN db_retail.analytics.dim_customers  AS c  ON f.customer_sk = c.customer_sk
JOIN db_retail.analytics.dim_stores     AS s  ON f.store_sk = s.store_sk
JOIN db_retail.analytics.dim_campaigns  AS ca ON f.campaign_sk = ca.campaign_sk
WHERE f.sales_date BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY c.customer_segment, s.store_location, ca.campaign_name
ORDER BY total_sales DESC;


-- Query 1: Total Sales Amount by Product Category and Year
-- 1.B – Denormalized Version

SELECT
  customer_segment,
  store_location,
  campaign_name,
  SUM(total_amount) AS total_sales
FROM db_retail.analytics.fact_sales_denormalized
WHERE sales_date BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY customer_segment, store_location, campaign_name
ORDER BY total_sales DESC;

SQL

-- Query 2: Sales Distribution by Customer Segment and Store Type
-- 2.A – Normalized Version

SELECT
  d.year,
  d.month,
  p.category,
  st.store_type,
  SUM(f.total_amount) AS monthly_revenue
FROM db_retail.analytics.fact_sales_normalized AS f
JOIN db_retail.analytics.dim_dates     AS d  ON CAST(f.sales_date AS DATE) = d.full_date
JOIN db_retail.analytics.dim_products AS p  ON f.product_sk = p.product_sk
JOIN db_retail.analytics.dim_stores   AS st ON f.store_sk   = st.store_sk
WHERE d.year = 2024
GROUP BY d.year, d.month, p.category, st.store_type
ORDER BY d.year, d.month, p.category, st.store_type;


-- Query 2: Sales Distribution by Customer Segment and Store Type
-- 2.B – Denormalized Version

SELECT
  EXTRACT(YEAR  FROM sales_date) AS year,
  EXTRACT(MONTH FROM sales_date) AS month,
  category,
  store_type,
  SUM(total_amount) AS monthly_revenue
FROM db_retail.analytics.fact_sales_denormalized
WHERE EXTRACT(YEAR FROM sales_date) = 2024
GROUP BY year, month, category, store_type
ORDER BY year, month, category, store_type;

SQL

-- Query 3: Top 5 Campaigns by Total Revenue
-- 3.A – Normalized Version

SELECT
  ca.campaign_name,
  SUM(f.total_amount) AS campaign_revenue
FROM db_retail.analytics.fact_sales_normalized AS f
JOIN db_retail.analytics.dim_products  AS p  ON f.product_sk  = p.product_sk
JOIN db_retail.analytics.dim_campaigns AS ca ON f.campaign_sk = ca.campaign_sk
WHERE p.category = 'Electronics'
GROUP BY ca.campaign_name
ORDER BY campaign_revenue DESC
LIMIT 5;


-- Query 3: Top 5 Campaigns by Total Revenue
-- 3.B – Denormalized Version

SELECT
  campaign_name,
  SUM(total_amount) AS campaign_revenue
FROM db_retail.analytics.fact_sales_denormalized
WHERE category = 'Electronics'
GROUP BY campaign_name
ORDER BY campaign_revenue DESC
LIMIT 5;

Up next: We’ll walk through the actual SQL queries used and the performance comparisons observed across cloud platforms.


The Processes Set-up for Cloud Data Warehouses

Cloud Data Warehouse Setup — The Process

This experiment was conducted using four of the most popular cloud-based data warehousing solutions (listed in no particular order):

  1. Amazon Redshift

  2. Microsoft Azure Synapse Analytics

  3. Google Cloud Platform (GCP) BigQuery

  4. Snowflake

    (Note: Snowflake is cloud-agnostic, so you’ll need to deploy it on either AWS, Azure, or GCP — depending on your preference.)

Each platform has its own strengths and trade-offs. Personally, I had prior hands-on experience with Amazon Redshift and Snowflake, so getting started with them was straightforward.

However, Azure Synapse and BigQuery were new to me — and experimenting with them turned out to be quite insightful. It helped me better understand not just the tooling, but also the design philosophy and ecosystem approach of each cloud provider.

Common Setup Flow Across Cloud Providers

While each cloud has its own services, all of them broadly require the following three components to set up and run this kind of experiment:

  1. Object Storage

    To host and transfer your generated data files (CSV/Parquet, etc.) to the cloud environment.

  2. ETL or Data Ingestion Tool

    To load the data into the data warehouse. This could be native tools like COPY INTO for Redshift or external connectors like Airbyte.

  3. Data Warehouse Environment

    The actual database engine where you create tables, run queries, and benchmark performance.

  4. Identity and Access Management

    Make sure that the data services have the necessary permissions to are able to access and perform, assets and actions respectively.

1. Storage Service

1.1 – AWS (Amazon Web Services)

  • Use Amazon S3 (Simple Storage Service) buckets for storing structured or semi-structured data such as CSV, Parquet, or JSON files.

  • Configure bucket policies or IAM roles for secure access.

1.2 – Microsoft Azure

  • Use Azure Blob Storage or Azure Data Lake Storage Gen2 (ADLS Gen2) for optimized analytics workloads.

  • When using ADLS Gen2, enable the Hierarchical Namespace feature to support directory-like structures and optimized analytics access.

  • Store data inside containers (equivalent to buckets in AWS/GCP terminology).

1.3 – Google Cloud Platform (GCP)

  • Use Google Cloud Storage (GCS) and create storage buckets for holding your datasets.

  • Choose the appropriate storage class (e.g., Standard, Nearline) based on access patterns.

2. Data Warehouse Service

2.1 – AWS (Amazon Web Services)

  • Use Amazon Redshift, a fully managed petabyte-scale data warehouse.

  • Deploy a RA3 node type, such as ra3.xlplus, which decouples compute and storage.

  • For testing, use a 1–2 node cluster.

2.2 – Microsoft Azure

  • Use Azure Synapse Analytics (formerly Azure SQL Data Warehouse).

  • Choose a Dedicated SQL Pool (formerly SQL DW) with a performance tier like DW100c (Gen2) for development and testing workloads.

  • Gen2 provides separated compute and storage and supports PolyBase and COPY INTO for data ingestion.

2.3 – Google Cloud Platform (GCP)

  • Use BigQuery, a serverless and fully managed data warehouse solution.

  • BigQuery uses an on-demand pricing model and autoscaling compute slots, so you don’t configure node sizes manually.

  • Supports both batch loading and external table queries directly over Cloud Storage.

3. Identity and Access Management (IAM)

AWS (Amazon Web Services)

  • Create an IAM Role with a trust relationship allowing Redshift to assume the role.

  • Attach the AmazonS3ReadOnlyAccess policy (or a custom policy) to allow read access to specific S3 buckets.

  • Associate the IAM role with your Amazon Redshift cluster using the aws_iam_role configuration.

  • In Redshift, run COPY commands using the IAM role to load data from S3 into Redshift tables.

Microsoft Azure

  • Create an Azure Storage Account and a container to store your data.

  • Create an Azure Synapse Analytics Workspace.

  • Within Synapse, enable and assign a System-Assigned Managed Identity or create a User-Assigned Managed Identity.

  • Grant Storage Blob Data Reader or Storage Blob Data Contributor role to the managed identity on the Storage Account (via RBAC).

  • Use PolyBase or COPY INTO in Synapse SQL to read data from the Storage Account.

Google Cloud Platform (GCP)

  • Create a GCP Project and set up Cloud Storage Buckets within the project.

  • Create a Service Account with the following IAM roles:

    • Storage Object Viewer – to read data from Cloud Storage

    • BigQuery Data Editor – to write into BigQuery tables

  • Enable the BigQuery API and Cloud Storage API for the project.

  • Use the Service Account key (for external access) or Workload Identity Federation (for secure access) to authenticate and run BigQuery LOAD DATA or EXTERNAL TABLE queries.

Snowflake

Compared to other clouds, setting up Snowflake’s data access is shorter but requires careful attention.

  1. Create an integration object in Snowflake linked to the IAM role of your personal AWS account. This IAM role must have the necessary policies attached to access data stored in the S3 bucket.

  2. Describe the integration object in Snowflake to obtain two important values:

    • STORAGE_AWS_IAM_USER_ARN

    • STORAGE_AWS_EXTERNAL_ID

  3. Edit the Trust Relationship of the IAM role in your personal AWS account to include Snowflake’s AWS account, using the above values.

Why these steps?

  • Step 1 ensures that Snowflake’s AWS account recognizes your personal AWS account and can request data access.

  • Step 2 makes sure your personal AWS account recognizes Snowflake’s AWS account and trusts it to assume the role for accessing the data.

In essence, this is a mutual trust setup between two separate AWS accounts — like two strangers shaking hands and acknowledging each other to establish a secure connection.


The Performance Benchmarks

For each cloud platform, although the insights were derived by comparing the normalized and denormalized versions of the fact_sales table, we ran each experiment twice per query.

The reason is that during the first execution, the query engine performs the full set of internal operations—such as scanning data blocks, filtering rows based on predicates, shuffling data across compute nodes, and executing joins or aggregations—all of which contribute to actual compute cost and performance.

However, in most modern cloud data warehouses, query results or intermediate execution plans are cached to some extent. Therefore, when the same query is executed again, the subsequent execution often benefits from query caching, result reuse, or accelerated IO paths, depending on the platform.

By executing the queries twice, we can observe the difference between cold execution performance (first run) and cached execution performance (second run), giving us insight into how much impact the caching mechanism has on query response times and resource utilization.

Benchmarks #1 – AWS [ Amazon RedShift ]

First Run:

Query No.NormalizedDenormalizedConclusion
Query 1: Total Sales Amount by Product Category and Year7s 237ms5s 367ms😎 Faster
Query 2: Sales Distribution by Customer Segment and Store Type4s 681ms5s 098ms🥲 Slower
Query 3: Top 5 Campaigns by Total Revenue6s 69ms4s 193ms😎 Faster

Subsequent Run:

Query No.NormalizedDenormalizedConclusion
Query 1: Total Sales Amount by Product Category and Year11ms9ms😎 Faster
Query 2: Sales Distribution by Customer Segment and Store Type16ms10ms😎 Faster
Query 3: Top 5 Campaigns by Total Revenue12ms9ms😎 Faster

Benchmarks #2 – Azure [ Synapse ]

First Run:

Query No.NormalizedDenormalizedConclusion
Query 1: Total Sales Amount by Product Category and Year15s6s😎 Faster
Query 2: Sales Distribution by Customer Segment and Store Type5s4s😎 Faster
Query 3: Top 5 Campaigns by Total Revenue2s7s🥲 Slower

Subsequent Run:

Query No.NormalizedDenormalizedConclusion
Query 1: Total Sales Amount by Product Category and Year6s5s😎 Faster
Query 2: Sales Distribution by Customer Segment and Store Type3s4s🥲 Slower
Query 3: Top 5 Campaigns by Total Revenue2s7s🥲 Slower

Benchmarks #3 – GCP [ BigQuery ]

First Run:

Query No.NormalizedDenormalizedConclusion
Query 1: Total Sales Amount by Product Category and Year1000ms680ms😎 Faster
Query 2: Sales Distribution by Customer Segment and Store Type7871000🥲 Slower
Query 3: Top 5 Campaigns by Total Revenue205146😎 Faster

Subsequent Run:

Query No.NormalizedDenormalizedConclusion
Query 1: Total Sales Amount by Product Category and Year--🤝 Draw
Query 2: Sales Distribution by Customer Segment and Store Type--🤝 Draw
Query 3: Top 5 Campaigns by Total Revenue--🤝 Draw

Benchmarks #4 – Snowflake

First Run:

Query No.NormalizedDenormalizedConclusion
Query 1: Total Sales Amount by Product Category and Year743ms322ms😎 Faster
Query 2: Sales Distribution by Customer Segment and Store Type407ms290ms😎 Faster
Query 3: Top 5 Campaigns by Total Revenue120ms84ms😎 Faster

Subsequent Run:

Query No.NormalizedDenormalizedConclusion
Query 1: Total Sales Amount by Product Category and Year44ms35ms😎 Faster
Query 2: Sales Distribution by Customer Segment and Store Type25ms23ms😎 Faster
Query 3: Top 5 Campaigns by Total Revenue82ms32ms😎 Faster

Conclusion

We can conclude that Amazon Redshift and Snowflake perform really well — most of the time their query execution is consistently faster.

GCP BigQuery performs decently, but the same cannot be said for Azure Synapse. What causes this unexpected difference in performance? That’s a great question — and could be the perfect topic for our next blog post.

This article took me nearly a month to research and experiment with, just to grasp the basics of each cloud platform and to run these hands-on experiments.

It wasn’t just about reading theory online — I actually ran these experiments on the cloud to capture accurate performance metrics. Otherwise, it would have been easy to unintentionally mislead you into thinking that selective denormalization works faster across all cloud data warehouses (which, as you can see, is not true once you look at Azure Synapse’s results).

I hope this article gives you enough insight to make informed data modeling decisions when the time comes. I truly enjoyed sharing this journey with you. Until next time — this is Charlie, signing off.


0
Subscribe to my newsletter

Read articles from Shrinivas Vishnupurikar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Shrinivas Vishnupurikar
Shrinivas Vishnupurikar

👋🏻 Hello! My name is Shrinivas and with a strong foundation in Data Analytics and Machine Learning, I'm excited to learn from industry experts and contribute my skills. 🧑🏻‍🏫 As an Instructor, I've taught Python Programming Language with Project based learning approach to 140+ students thus fostering AI, Data Analytics, and ML interests. I am well-versed in Data Visualization, possess excellent communication skills, and excel in leadership and team building. 🤝 Whether it's presenting insights to technical team or non-technical stakeholders or collaborating with cross-functional teams or teaching and learning from seniors or peers, I'm committed to fostering clear and impactful communication. 🌟 If you share a passion for Data Analytics, Business Intelligence, Machine Learning, Prompt Engineering, Large Language Models, or simply want to connect, I'd love to hear from you. Get In Touch With Me :- Phone No: +91 8080687809 Email: shrinivasv73@gmail.com