Selective Denormalization Strategy: Why the opposite direction, partially?


Everyone talks about normalization. But barely anyone talks about denormalization—let alone selective denormalization.
Back in college, while studying DBMS concepts, professors would emphasize how important—crucial, even—normalization is. And to be honest, they weren’t wrong. After all, that topic was a guaranteed guest in the exams.
But here’s the thing—they were only partially right.
Why? Because as you gain real-world experience, you begin to realize a simple but powerful truth:
Everything comes at a cost. And that cost can be paid in different currencies—whether it’s financial feasibility, query execution time, architectural complexity, system maintainability, or long-term sustainability.
At the end of the day, it’s all about trade-offs. What do you prioritize—and more importantly, why? That “why” becomes the driving force behind your decisions.
So, who should read this article?
Well, technically anyone. But if I had to be specific, here’s who would benefit the most:
Students who like to think beyond textbooks—the ones who challenge the classics.
Aspiring data professionals (especially data engineers).
Anyone with an interest in data warehouse architecture.
Curious minds exploring data modeling and dimensional modeling.
Engineers trying to improve performance in existing projects—this strategy might just give them a new perspective.
Let’s dive in.
Normalization vs. Denormalization: Through the Eyes of an Architect
🧩 Normalization
Let’s get this out of the way quickly. We won’t dwell too long on the theory—you’ve probably learned this in college or encountered it during system design. Still, here’s a quick refresher, plus some trustworthy resources if you’d like to brush up.
In short, normalization is the process of breaking down larger tables into smaller, well-structured ones to eliminate redundancy and maintain data integrity. This separation typically follows normal forms (1NF to 5NF) and reflects logical relationships between entities.
📚 Resources to Learn More About Normalization:
✅ Advantages of Normalization
Efficient Updates: Data lives in one place. Updating a customer’s name or product price? You change it once, and it reflects everywhere.
Lightweight Writes: Since each table focuses on minimal fields, write operations are fast and storage-efficient.
⚠️ Disadvantages of Normalization
Slower Analytical Reads: For analytical use cases, data often needs to be re-joined from multiple tables—this adds latency and compute cost.
Harder for Business Users: Analysts and business users might find normalized schemas too complex to query directly.
Expensive Joins: Over-normalization results in too many small tables. Querying across these with chained joins increases processing time and resource usage.
🏗️ Denormalization
Denormalization is the intentional act of introducing data redundancy into the schema. It involves combining data from multiple normalized tables—selectively—into one, to improve read performance and reduce join overhead.
That’s right: we do this on purpose, and with good reason.
But why?
Because we evaluate tradeoffs. And this strategy makes sense in the right contexts—especially in analytics and reporting-heavy systems where read performance matters more than update efficiency.
✅ Advantages of Denormalization
Faster Reads: Data needed for queries is consolidated—so joins are reduced or eliminated.
Efficient Aggregations & Filters: With all relevant data available in one row, operations like filtering, grouping, and slicing become much quicker.
Simpler Query Logic: Especially useful for business users and dashboard tools that benefit from flatter schemas.
⚠️ Disadvantages of Denormalization
Redundant Data Updates: Repeating the same information (e.g., product category or customer name) across many rows means updates are expensive and prone to errors.
Data Inconsistency Risk: Without careful design or governance, duplicated values can drift and create inconsistencies.
Heavier Write Loads: Write operations carry more data, increasing I/O overhead and impacting ingestion speeds.
Over-Denormalization Pitfalls: Going too far leads to massive tables mixing multiple entities—hurting performance, making maintenance harder, and defeating the purpose of structured modeling.
Up next: Selective Denormalization — the sweet spot between normalization and full denormalization. When done right, it can drastically improve performance without sacrificing structure. Let’s dive in. 🔍
What Do You Do 🧐 and How Do You Do It? 😉
It’s rarely a smart move to commit blindly to one strategy—fully normalized or fully denormalized. Instead, the real advantage comes from striking a balance: leveraging the best of both worlds.
That balance is what we call the Selective Denormalization Strategy.
As the name suggests, we intentionally introduce data duplication—but only for selected attributes. If we were to duplicate everything, it would be no different than creating a monolithic, fully denormalized table, which comes with serious drawbacks.
Below is a 3-step guideline to help you identify the right fields to denormalize:
1️⃣ Understand Query Patterns & Limit the Scope of Denormalization
You don’t need to denormalize every categorical attribute from every dimension table.
Focus only on attributes frequently used in filtering or grouping within analytical queries.
The decision on how many and which columns to denormalize is context-dependent: it depends on the schema, the analytical workload, and the business questions you're trying to answer.
Goal: Optimize for the most common and performance-critical queries, not for theoretical completeness.
2️⃣ Avoid Denormalizing High-Cardinality or Volatile Data
Cardinality refers to the number of unique values in a column relative to the total number of rows.
Avoid denormalizing:
High-cardinality fields (e.g., customer email, transaction ID).
Highly volatile attributes (e.g., customer status, product stock level).
These are best kept in their respective dimension tables to avoid bloating the fact table and increasing update complexity.
Low-cardinality, low-volatility fields are typically great candidates for denormalization—especially for improving filtering and grouping performance.
3️⃣ Distinguish Between Contextual vs. Descriptive-Categorical Values
Contextual-Categorical Values: These are tied to events, tend to be low in cardinality, and rarely change. Great candidates for denormalization.
- Example:
store_location
(if it indicates where the product was sold).
- Example:
Descriptive-Categorical Values: These are tied to entities (like customer or product) and can change more frequently or be reclassified.
- Example:
residential_address
(tied to the customer entity and subject to change).
- Example:
Why this matters: Descriptive values are more prone to data inconsistency when denormalized and can cause unnecessary duplication.
Real-World Example: Our fact_sales_denormalized
Table
We selectively denormalized the following columns:
customer_segment
(fromdim_customers
)category
(fromdim_products
)store_type
,store_location
,campaign_name
(fromdim_stores
)
At first glance, all of these might appear to be Descriptive-Categorical values—and they are. But here's a nuance worth understanding:
Let’s say our dim_customers
also includes a residential_address
.
In this case:
residential_address
would clearly remain a Descriptive-Categorical value — it’s tied directly to a customer (an entity).However,
store_location
can now be treated as a Contextual-Categorical value — because it represents the location where the sale occurred, an event-based context.
This distinction is subtle but crucial — and this kind of reasoning helps identify the fields worth denormalizing to optimize performance without sacrificing structure.
The Schema, Metadata, and Targeted Insights
Below is the schema for the synthetically generated dataset used in this experiment.
The design was created in Figma, with a strong emphasis on visual clarity—we used color-coding to differentiate between entities, relationships, and categories.
This not only improves readability but also makes it easier to understand the context and purpose of each element in the schema.
A well-structured schema should tell a story — about the data, its purpose, and how it's intended to be queried.
dim_campaigns [ 50 total records ]
campaign_sk
campaign_id
campaign_name
start_date_sk
end_date_sk
campaign_budget
dim_customers [ 100,000 total records ]
customer_sk
customer_id
first_name
last_name
email
residential_location
customer_segment
dim_dates [ 366 total records ]
full_date
date_sk
year
month
day
weekday
quarter
dim_products [ 100,000 total records ]
product_sk
product_id
product_name
category
brand
origin_location
dim_salespersons [ 2,000 total records ]
salesperson_sk
salesperson_id
salesperson_name
salesperson_role
dim_stores [ 500 total records ]
store_sk
store_id
store_name
store_type
store_location
store_manager_sk
fact_sales_normalized [ 1,000,000 total records ]
sales_sk
sales_id
customer_sk
product_sk
store_sk
salesperson_sk
campaign_sk
sales_date
total_amount
fact_sales_denormalized [ 1,000,000 total records ]
sales_sk
sales_id
customer_sk
product_sk
store_sk
salesperson_sk
campaign_sk
sales_date
total_amount
customer_segment
category
store_type
store_location
campaign_name
Deriving Insights — Validating the Denormalized Columns
To validate the effectiveness and practicality of selective denormalization, we focused on a few key business insights that intentionally utilize the denormalized columns in distinct ways.
These insights were chosen to reflect real-world use cases, particularly analytical queries that would benefit from reduced joins and improved performance.
KPIs Derived:
Total Sales Amount by Product Category and Year
Sales Distribution by Customer Segment and Store Type
Top 5 Campaigns by Total Revenue
Each KPI was derived twice, using the following approaches:
1. Using the Normalized Fact Table (fact_sales_normalized
)
Required joining the fact table with relevant dimension tables (
dim_products
,dim_customers
,dim_stores
, etc.) to retrieve the necessary descriptive attributes.This follows the standard normalized star schema pattern — optimal for write efficiency and data integrity, but potentially costly for complex reads.
2. Using the Denormalized Fact Table (fact_sales_denormalized
)
Performed direct
GROUP BY
operations on the denormalized columns (likecategory
,customer_segment
,store_type
, etc.) without any join.This allows us to observe the performance gains and validate that the denormalized columns serve their purpose in real analytical workloads.
We’ve used Snowflake SQL for demonstration purposes because of its simplicity and readability.
While SQL syntax may vary slightly across Redshift, Azure Synapse, BigQuery, and Snowflake, the overall query logic remains largely the same — and should be easy to adapt across platforms.
SQL
-- Query 1: Total Sales Amount by Product Category and Year
-- 1.A – Normalized Version
SELECT
c.customer_segment,
s.store_location,
ca.campaign_name,
SUM(f.total_amount) AS total_sales
FROM db_retail.analytics.fact_sales_normalized AS f
JOIN db_retail.analytics.dim_customers AS c ON f.customer_sk = c.customer_sk
JOIN db_retail.analytics.dim_stores AS s ON f.store_sk = s.store_sk
JOIN db_retail.analytics.dim_campaigns AS ca ON f.campaign_sk = ca.campaign_sk
WHERE f.sales_date BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY c.customer_segment, s.store_location, ca.campaign_name
ORDER BY total_sales DESC;
-- Query 1: Total Sales Amount by Product Category and Year
-- 1.B – Denormalized Version
SELECT
customer_segment,
store_location,
campaign_name,
SUM(total_amount) AS total_sales
FROM db_retail.analytics.fact_sales_denormalized
WHERE sales_date BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY customer_segment, store_location, campaign_name
ORDER BY total_sales DESC;
SQL
-- Query 2: Sales Distribution by Customer Segment and Store Type
-- 2.A – Normalized Version
SELECT
d.year,
d.month,
p.category,
st.store_type,
SUM(f.total_amount) AS monthly_revenue
FROM db_retail.analytics.fact_sales_normalized AS f
JOIN db_retail.analytics.dim_dates AS d ON CAST(f.sales_date AS DATE) = d.full_date
JOIN db_retail.analytics.dim_products AS p ON f.product_sk = p.product_sk
JOIN db_retail.analytics.dim_stores AS st ON f.store_sk = st.store_sk
WHERE d.year = 2024
GROUP BY d.year, d.month, p.category, st.store_type
ORDER BY d.year, d.month, p.category, st.store_type;
-- Query 2: Sales Distribution by Customer Segment and Store Type
-- 2.B – Denormalized Version
SELECT
EXTRACT(YEAR FROM sales_date) AS year,
EXTRACT(MONTH FROM sales_date) AS month,
category,
store_type,
SUM(total_amount) AS monthly_revenue
FROM db_retail.analytics.fact_sales_denormalized
WHERE EXTRACT(YEAR FROM sales_date) = 2024
GROUP BY year, month, category, store_type
ORDER BY year, month, category, store_type;
SQL
-- Query 3: Top 5 Campaigns by Total Revenue
-- 3.A – Normalized Version
SELECT
ca.campaign_name,
SUM(f.total_amount) AS campaign_revenue
FROM db_retail.analytics.fact_sales_normalized AS f
JOIN db_retail.analytics.dim_products AS p ON f.product_sk = p.product_sk
JOIN db_retail.analytics.dim_campaigns AS ca ON f.campaign_sk = ca.campaign_sk
WHERE p.category = 'Electronics'
GROUP BY ca.campaign_name
ORDER BY campaign_revenue DESC
LIMIT 5;
-- Query 3: Top 5 Campaigns by Total Revenue
-- 3.B – Denormalized Version
SELECT
campaign_name,
SUM(total_amount) AS campaign_revenue
FROM db_retail.analytics.fact_sales_denormalized
WHERE category = 'Electronics'
GROUP BY campaign_name
ORDER BY campaign_revenue DESC
LIMIT 5;
Up next: We’ll walk through the actual SQL queries used and the performance comparisons observed across cloud platforms.
The Processes Set-up for Cloud Data Warehouses
Cloud Data Warehouse Setup — The Process
This experiment was conducted using four of the most popular cloud-based data warehousing solutions (listed in no particular order):
Amazon Redshift
Microsoft Azure Synapse Analytics
Google Cloud Platform (GCP) BigQuery
Snowflake
(Note: Snowflake is cloud-agnostic, so you’ll need to deploy it on either AWS, Azure, or GCP — depending on your preference.)
Each platform has its own strengths and trade-offs. Personally, I had prior hands-on experience with Amazon Redshift and Snowflake, so getting started with them was straightforward.
However, Azure Synapse and BigQuery were new to me — and experimenting with them turned out to be quite insightful. It helped me better understand not just the tooling, but also the design philosophy and ecosystem approach of each cloud provider.
Common Setup Flow Across Cloud Providers
While each cloud has its own services, all of them broadly require the following three components to set up and run this kind of experiment:
Object Storage
To host and transfer your generated data files (CSV/Parquet, etc.) to the cloud environment.
ETL or Data Ingestion Tool
To load the data into the data warehouse. This could be native tools like
COPY INTO
for Redshift or external connectors like Airbyte.Data Warehouse Environment
The actual database engine where you create tables, run queries, and benchmark performance.
Identity and Access Management
Make sure that the data services have the necessary permissions to are able to access and perform, assets and actions respectively.
1. Storage Service
1.1 – AWS (Amazon Web Services)
Use Amazon S3 (Simple Storage Service) buckets for storing structured or semi-structured data such as CSV, Parquet, or JSON files.
Configure bucket policies or IAM roles for secure access.
1.2 – Microsoft Azure
Use Azure Blob Storage or Azure Data Lake Storage Gen2 (ADLS Gen2) for optimized analytics workloads.
When using ADLS Gen2, enable the Hierarchical Namespace feature to support directory-like structures and optimized analytics access.
Store data inside containers (equivalent to buckets in AWS/GCP terminology).
1.3 – Google Cloud Platform (GCP)
Use Google Cloud Storage (GCS) and create storage buckets for holding your datasets.
Choose the appropriate storage class (e.g., Standard, Nearline) based on access patterns.
2. Data Warehouse Service
2.1 – AWS (Amazon Web Services)
Use Amazon Redshift, a fully managed petabyte-scale data warehouse.
Deploy a RA3 node type, such as
ra3.xlplus
, which decouples compute and storage.For testing, use a 1–2 node cluster.
2.2 – Microsoft Azure
Use Azure Synapse Analytics (formerly Azure SQL Data Warehouse).
Choose a Dedicated SQL Pool (formerly SQL DW) with a performance tier like DW100c (Gen2) for development and testing workloads.
Gen2 provides separated compute and storage and supports PolyBase and COPY INTO for data ingestion.
2.3 – Google Cloud Platform (GCP)
Use BigQuery, a serverless and fully managed data warehouse solution.
BigQuery uses an on-demand pricing model and autoscaling compute slots, so you don’t configure node sizes manually.
Supports both batch loading and external table queries directly over Cloud Storage.
3. Identity and Access Management (IAM)
AWS (Amazon Web Services)
Create an IAM Role with a trust relationship allowing Redshift to assume the role.
Attach the AmazonS3ReadOnlyAccess policy (or a custom policy) to allow read access to specific S3 buckets.
Associate the IAM role with your Amazon Redshift cluster using the
aws_iam_role
configuration.In Redshift, run
COPY
commands using the IAM role to load data from S3 into Redshift tables.
Microsoft Azure
Create an Azure Storage Account and a container to store your data.
Create an Azure Synapse Analytics Workspace.
Within Synapse, enable and assign a System-Assigned Managed Identity or create a User-Assigned Managed Identity.
Grant Storage Blob Data Reader or Storage Blob Data Contributor role to the managed identity on the Storage Account (via RBAC).
Use PolyBase or COPY INTO in Synapse SQL to read data from the Storage Account.
Google Cloud Platform (GCP)
Create a GCP Project and set up Cloud Storage Buckets within the project.
Create a Service Account with the following IAM roles:
Storage Object Viewer
– to read data from Cloud StorageBigQuery Data Editor
– to write into BigQuery tables
Enable the BigQuery API and Cloud Storage API for the project.
Use the Service Account key (for external access) or Workload Identity Federation (for secure access) to authenticate and run BigQuery
LOAD DATA
orEXTERNAL TABLE
queries.
Snowflake
Compared to other clouds, setting up Snowflake’s data access is shorter but requires careful attention.
Create an integration object in Snowflake linked to the IAM role of your personal AWS account. This IAM role must have the necessary policies attached to access data stored in the S3 bucket.
Describe the integration object in Snowflake to obtain two important values:
STORAGE_AWS_IAM_USER_ARN
STORAGE_AWS_EXTERNAL_ID
Edit the Trust Relationship of the IAM role in your personal AWS account to include Snowflake’s AWS account, using the above values.
Why these steps?
Step 1 ensures that Snowflake’s AWS account recognizes your personal AWS account and can request data access.
Step 2 makes sure your personal AWS account recognizes Snowflake’s AWS account and trusts it to assume the role for accessing the data.
In essence, this is a mutual trust setup between two separate AWS accounts — like two strangers shaking hands and acknowledging each other to establish a secure connection.
The Performance Benchmarks
For each cloud platform, although the insights were derived by comparing the normalized and denormalized versions of the fact_sales
table, we ran each experiment twice per query.
The reason is that during the first execution, the query engine performs the full set of internal operations—such as scanning data blocks, filtering rows based on predicates, shuffling data across compute nodes, and executing joins or aggregations—all of which contribute to actual compute cost and performance.
However, in most modern cloud data warehouses, query results or intermediate execution plans are cached to some extent. Therefore, when the same query is executed again, the subsequent execution often benefits from query caching, result reuse, or accelerated IO paths, depending on the platform.
By executing the queries twice, we can observe the difference between cold execution performance (first run) and cached execution performance (second run), giving us insight into how much impact the caching mechanism has on query response times and resource utilization.
Benchmarks #1 – AWS [ Amazon RedShift ]
First Run:
Query No. | Normalized | Denormalized | Conclusion |
Query 1: Total Sales Amount by Product Category and Year | 7s 237ms | 5s 367ms | 😎 Faster |
Query 2: Sales Distribution by Customer Segment and Store Type | 4s 681ms | 5s 098ms | 🥲 Slower |
Query 3: Top 5 Campaigns by Total Revenue | 6s 69ms | 4s 193ms | 😎 Faster |
Subsequent Run:
Query No. | Normalized | Denormalized | Conclusion |
Query 1: Total Sales Amount by Product Category and Year | 11ms | 9ms | 😎 Faster |
Query 2: Sales Distribution by Customer Segment and Store Type | 16ms | 10ms | 😎 Faster |
Query 3: Top 5 Campaigns by Total Revenue | 12ms | 9ms | 😎 Faster |
Benchmarks #2 – Azure [ Synapse ]
First Run:
Query No. | Normalized | Denormalized | Conclusion |
Query 1: Total Sales Amount by Product Category and Year | 15s | 6s | 😎 Faster |
Query 2: Sales Distribution by Customer Segment and Store Type | 5s | 4s | 😎 Faster |
Query 3: Top 5 Campaigns by Total Revenue | 2s | 7s | 🥲 Slower |
Subsequent Run:
Query No. | Normalized | Denormalized | Conclusion |
Query 1: Total Sales Amount by Product Category and Year | 6s | 5s | 😎 Faster |
Query 2: Sales Distribution by Customer Segment and Store Type | 3s | 4s | 🥲 Slower |
Query 3: Top 5 Campaigns by Total Revenue | 2s | 7s | 🥲 Slower |
Benchmarks #3 – GCP [ BigQuery ]
First Run:
Query No. | Normalized | Denormalized | Conclusion |
Query 1: Total Sales Amount by Product Category and Year | 1000ms | 680ms | 😎 Faster |
Query 2: Sales Distribution by Customer Segment and Store Type | 787 | 1000 | 🥲 Slower |
Query 3: Top 5 Campaigns by Total Revenue | 205 | 146 | 😎 Faster |
Subsequent Run:
Query No. | Normalized | Denormalized | Conclusion |
Query 1: Total Sales Amount by Product Category and Year | - | - | 🤝 Draw |
Query 2: Sales Distribution by Customer Segment and Store Type | - | - | 🤝 Draw |
Query 3: Top 5 Campaigns by Total Revenue | - | - | 🤝 Draw |
Benchmarks #4 – Snowflake
First Run:
Query No. | Normalized | Denormalized | Conclusion |
Query 1: Total Sales Amount by Product Category and Year | 743ms | 322ms | 😎 Faster |
Query 2: Sales Distribution by Customer Segment and Store Type | 407ms | 290ms | 😎 Faster |
Query 3: Top 5 Campaigns by Total Revenue | 120ms | 84ms | 😎 Faster |
Subsequent Run:
Query No. | Normalized | Denormalized | Conclusion |
Query 1: Total Sales Amount by Product Category and Year | 44ms | 35ms | 😎 Faster |
Query 2: Sales Distribution by Customer Segment and Store Type | 25ms | 23ms | 😎 Faster |
Query 3: Top 5 Campaigns by Total Revenue | 82ms | 32ms | 😎 Faster |
Conclusion
We can conclude that Amazon Redshift and Snowflake perform really well — most of the time their query execution is consistently faster.
GCP BigQuery performs decently, but the same cannot be said for Azure Synapse. What causes this unexpected difference in performance? That’s a great question — and could be the perfect topic for our next blog post.
This article took me nearly a month to research and experiment with, just to grasp the basics of each cloud platform and to run these hands-on experiments.
It wasn’t just about reading theory online — I actually ran these experiments on the cloud to capture accurate performance metrics. Otherwise, it would have been easy to unintentionally mislead you into thinking that selective denormalization works faster across all cloud data warehouses (which, as you can see, is not true once you look at Azure Synapse’s results).
I hope this article gives you enough insight to make informed data modeling decisions when the time comes. I truly enjoyed sharing this journey with you. Until next time — this is Charlie, signing off.
Subscribe to my newsletter
Read articles from Shrinivas Vishnupurikar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Shrinivas Vishnupurikar
Shrinivas Vishnupurikar
👋🏻 Hello! My name is Shrinivas and with a strong foundation in Data Analytics and Machine Learning, I'm excited to learn from industry experts and contribute my skills. 🧑🏻🏫 As an Instructor, I've taught Python Programming Language with Project based learning approach to 140+ students thus fostering AI, Data Analytics, and ML interests. I am well-versed in Data Visualization, possess excellent communication skills, and excel in leadership and team building. 🤝 Whether it's presenting insights to technical team or non-technical stakeholders or collaborating with cross-functional teams or teaching and learning from seniors or peers, I'm committed to fostering clear and impactful communication. 🌟 If you share a passion for Data Analytics, Business Intelligence, Machine Learning, Prompt Engineering, Large Language Models, or simply want to connect, I'd love to hear from you. Get In Touch With Me :- Phone No: +91 8080687809 Email: shrinivasv73@gmail.com