Implement Medallion architecture with SCD Type 2 in Microsoft Fabric - Part 2
In the first article of the series on Medallion architecture in Fabric we looked into the process of load, cleanse and transfer of data from the bronze layer to the Silver layer.
In this blog post we would move further and delve into the process of moving data from Silver layer to the Gold layer.
The data load will be into a dimensional model organized in a star schema structure in the Gold layer. Star schema is dimensional modelling design where you have the fact table in the center with the dimension tables located at the edges of the star.
Image credit : Microsoft
SetUp
To get started we first need to create the underlying schemas for the dimension data and the fact data.
We would create four dimension tables and a single fact table. The dimension tables are Dim_Category
, Dim_Customer
, Dim_Product
and Dim_Supplier
and the fact table being Fact_Sales
.
CREATE TABLE [dbo].[Dim_Category](
[CategoryId] [int] NOT NULL,
[Category] [varchar](200) NOT NULL,
[DateInserted] [datetime2](0) NOT NULL
)
ALTER TABLE Dim_Category ADD CONSTRAINT PK_Dim_Category PRIMARY KEY NONCLUSTERED (CategoryId) NOT ENFORCED;
GO
CREATE TABLE [dbo].[Dim_Customer](
[CustomerId] [int] NOT NULL,
[CustomerName] [varchar](200) NOT NULL,
[CustomerRegion] [varchar](200) NOT NULL,
[CustomerEmail] [varchar](200) NOT NULL,
[DateInserted] [datetime2](0) NOT NULL
)
ALTER TABLE Dim_Customer ADD CONSTRAINT PK_Dim_Customery PRIMARY KEY NONCLUSTERED (CustomerId) NOT ENFORCED;
GO
CREATE TABLE [dbo].[Dim_Product](
[ProductId] [int] NOT NULL,
[Product] [varchar](200) NOT NULL,
[DateInserted] [datetime2](0) NOT NULL
)
ALTER TABLE Dim_Product ADD CONSTRAINT PK_Dim_Product PRIMARY KEY NONCLUSTERED (ProductId) NOT ENFORCED;
GO
CREATE TABLE [dbo].[Dim_Supplier](
[SupplierId] [int] NOT NULL,
[Supplier] [varchar](200) NOT NULL,
[DateInserted] [datetime2](0) NULL
)
ALTER TABLE Dim_Supplier ADD CONSTRAINT PK_Dim_Supplier PRIMARY KEY NONCLUSTERED (SupplierId) NOT ENFORCED;
GO
CREATE TABLE [dbo].[Fact_Sales](
[OrderNumber] [int] NOT NULL,
[CustomerId] [int] NOT NULL,
[ProductId] [int] NOT NULL,
[CategoryId] [int] NOT NULL,
[SupplierId] [int] NOT NULL,
[Date] [int] NOT NULL,
[Quantity] [int] NOT NULL,
[UnitPrice] [decimal](10, 0) NOT NULL,
[TotalSales] [int] NOT NULL,
[DateInserted] [datetime2](6) NOT NULL
)
GO
ALTER TABLE Fact_Sales ADD CONSTRAINT PK_Fact_Sales PRIMARY KEY NONCLUSTERED (OrderNumber) ;
GO
ALTER TABLE Fact_Sales ADD CONSTRAINT FK_CustomerId FOREIGN KEY (CustomerId) REFERENCES Dim_Customer (CustomerId) NOT ENFORCED;
GO
ALTER TABLE Fact_Sales ADD CONSTRAINT FK_ProductId FOREIGN KEY (ProductId) REFERENCES Dim_Product (ProductId) NOT ENFORCED;
GO
ALTER TABLE Fact_Sales ADD CONSTRAINT FK_CategoryId FOREIGN KEY (CategoryId) REFERENCES Dim_Category (CategoryId) NOT ENFORCED;
GO
ALTER TABLE Fact_Sales ADD CONSTRAINT FK_SupplierId FOREIGN KEY (SupplierId) REFERENCES Dim_Supplier (SupplierId) NOT ENFORCED;
GO
This is how the star schema dimensional model should look post creation of the tables.
Please note that at the time of writing this article Fabric data warehouse do not support identity keys that eventually could be used as surrogate keys. So we would have create customized process to replicate the surrogate key behavior and manage them through SQL queries as part of the load process.
Also note that PRIMARY KEYS
are only supported when NONCLUSTERED
and NOT ENFORCED
are both used while FOREIGN KEY
is only supported when NOT ENFORCED
is used. DEFAULT CONSTRAINTS
are also not supported.
And you cannot use PRIMARY/UNIQUE KEY
keyword in the CREATE TABLE
statement.
For example, trying to create a table and define a primary key on it through a single SQL statement will error out
CREATE TABLE [dbo].[Dim_Category](
[CategoryId] [int] NULL,
[Category] [varchar](200) NULL,
[DateInserted] [datetime2](0) NULL,
CONSTRAINT [UQ_Dim_Category] UNIQUE NONCLUSTERED
(
[CategoryId] ASC
)
)
GO
Instead the CREATE TABLE
statement has to be modified where the primary key is created through the ALTER TABLE
statement.
CREATE TABLE [dbo].[Dim_Category](
[CategoryId] [int] NOT NULL,
[Category] [varchar](200) NOT NULL,
[DateInserted] [datetime2](0) NOT NULL
)
ALTER TABLE Dim_Category ADD CONSTRAINT PK_Dim_Category PRIMARY KEY NONCLUSTERED (CategoryId) NOT ENFORCED;
GO
Same is the case when creating FOREIGN KEYS
. They have to be created through a separate CREATE
statement.
The following statement creates a FOREIGN KEY
relationship on the Fact_Sales
table that references the column CustomerId
in the Dim_Customer
table.
ALTER TABLE Fact_Sales ADD CONSTRAINT FK_CustomerId FOREIGN KEY (CustomerId) REFERENCES Dim_Customer (CustomerId) NOT ENFORCED;
GO
The Code
Now that we have the schema ready its to time to create stored procedures to load data from the Silver layer to the data warehouse in the Gold layer.
Below are the Stored Procedure scripts that loads the Dimension data from the Silver layer Lakehouse into the corresponding Dimensional tables in the Gold layer.
CREATE PROCEDURE [dbo].[Load_Dim_Category]
AS
BEGIN
DECLARE @Max_Category_Id INT = (SELECT MAX(CategoryId) FROM Dim_Category)
IF @Max_Category_Id IS NULL
INSERT INTO Dim_Category (CategoryId, Category, DateInserted)
SELECT
ROW_NUMBER() OVER (ORDER BY category)
,category
,GETDATE()
FROM Silver_Layer.[dbo].[Category]
ELSE
SELECT
@Max_Category_Id + 1
,category
,GETDATE()
FROM [Silver_Layer].[dbo].[Category] C1
WHERE NOT EXISTS (SELECT
1
FROM dbo.Dim_Category C2
WHERE C1.category = C2.Category)
END
GO
CREATE PROCEDURE [dbo].[Load_Dim_Customer]
AS
BEGIN
DECLARE @Max_Customer_Id INT = (SELECT
MAX(CustomerId)
FROM Dim_Customer)
IF @Max_Customer_Id IS NULL
INSERT INTO Dim_Customer (CustomerId, CustomerName, CustomerRegion, CustomerEmail, DateInserted)
SELECT
ROW_NUMBER() OVER (ORDER BY CustomerName, CustomerEmail)
,CustomerName
,CustomerRegion
,CustomerEmail
,GETDATE()
FROM Silver_Layer.[dbo].[Customer]
ELSE
INSERT INTO Dim_Customer (CustomerId, CustomerName, CustomerRegion, CustomerEmail, DateInserted)
SELECT
@Max_Customer_Id+1
,CustomerName
,CustomerRegion
,CustomerEmail
,GETDATE()
FROM [Silver_Layer].[dbo].[Customer] C1
WHERE NOT EXISTS (SELECT
1
FROM dbo.Dim_Customer C2
WHERE C1.[CustomerName] = C2.[CustomerName] and C1.CustomerEmail=C2.CustomerEmail)
END
UPDATE C1 SET
C1.CustomerRegion = C2.CustomerRegion,
C1.CustomerEmail = C2.CustomerEmail
from Dim_Customer C1 inner join Silver_Layer.[dbo].[Customer] C2 on C1.[CustomerName]=C2.[CustomerName]
WHERE (C1.CustomerEmail!=C2.CustomerEmail OR C1.CustomerRegion!=C2.CustomerRegion)
GO
CREATE PROCEDURE [dbo].[Load_Dim_Product]
AS
BEGIN
DECLARE @Max_Product_Id INT = (SELECT
MAX(ProductId)
FROM Dim_Product)
IF @Max_Product_Id IS NULL
INSERT INTO Dim_Product (ProductId, Product, DateInserted)
SELECT
ROW_NUMBER() OVER (ORDER BY product)
,[product]
,GETDATE()
FROM Silver_Layer.[dbo].[Product]
ELSE
INSERT INTO Dim_Product (ProductId, Product, DateInserted)
SELECT
@Max_Product_Id+1
,product
,GETDATE()
FROM [Silver_Layer].[dbo].[Product] P1
WHERE NOT EXISTS (SELECT
1
FROM dbo.Dim_Product P2
WHERE P1.[product] = P2.[Product])
END
GO
CREATE PROCEDURE [dbo].[Load_Dim_Supplier]
AS
BEGIN
DECLARE @Max_Supplier_Id INT = (SELECT
MAX(SupplierId)
FROM Dim_Supplier)
IF @Max_Supplier_Id IS NULL
INSERT INTO Dim_Supplier (SupplierId, Supplier, DateInserted)
SELECT
ROW_NUMBER() OVER (ORDER BY supplier)
,[supplier]
,GETDATE()
FROM Silver_Layer.[dbo].[Supplier]
ELSE
INSERT INTO Dim_Supplier (SupplierId, Supplier, DateInserted)
SELECT
@Max_Supplier_Id + 1
,supplier
,GETDATE()
FROM [Silver_Layer].[dbo].[Supplier] P1
WHERE NOT EXISTS (SELECT
1
FROM dbo.Dim_Supplier P2
WHERE P1.[supplier] = P2.[Supplier])
END
GO
The logic is fundamentally the similar across all the Dimension table loads.
Identify the type of load,first or incremental load :
DECLARE @Your_max_Id INT = (SELECT MAX(ColumnId) FROM Dim_Table)
IF @Your_max_Id IS NULL Then
Its a first load
else
Its a incremental load
The logic for updating Dim_Customers
is slightly different because there are several non-key attributes that may change such as a customer’s email or customer region. The Id attributes in the dimension tables are incremented based on the most recent value of the id present in the table.
Stored Procedure script to handle Fact_Sales
table load.
CREATE PROCEDURE [dbo].[Load_Fact_Sales]
AS
INSERT INTO Fact_Sales
SELECT
S.OrderNumber
,CU.CustomerId
,P.ProductId
,C.CategoryId
,SP.SupplierId
,S.Date
,S.Quantity
,S.UnitPrice
,S.TotalSales
,GETDATE()
FROM Silver_Layer.[dbo].[Sales] S
INNER JOIN [dbo].[Dim_Product] P
ON S.Product = P.Product
INNER JOIN [dbo].[Dim_Category] C
ON S.Category = C.Category
INNER JOIN [dbo].[Dim_Supplier] SP
ON S.Supplier = SP.Supplier
INNER JOIN [dbo].[Dim_Customer] CU
ON S.CustomerName = CU.CustomerName
AND S.CustomerEmail = CU.CustomerEmail
WHERE NOT EXISTS(SELECT 1 FROM Fact_Sales FS WHERE FS.OrderNumber=S.OrderNumber)
UPDATE FS SET
FS.CustomerId=CU.CustomerId,
FS.ProductId = P.ProductId,
FS.CategoryId = C.CategoryId,
FS.SupplierId=SP.SupplierId,
FS.Date=S.Date,
FS.Quantity=S.Quantity,
FS.UnitPrice=S.UnitPrice,
FS.TotalSales=S.TotalSales,
FS.DateInserted=GETDATE()
FROM Silver_Layer.[dbo].[Sales] S
INNER JOIN Fact_Sales FS on FS.OrderNumber=S.OrderNumber
INNER JOIN [dbo].[Dim_Product] P
ON S.Product = P.Product
INNER JOIN [dbo].[Dim_Category] C
ON S.Category = C.Category
INNER JOIN [dbo].[Dim_Supplier] SP
ON S.Supplier = SP.Supplier
INNER JOIN [dbo].[Dim_Customer] CU
ON S.CustomerName = CU.CustomerName
AND S.CustomerEmail = CU.CustomerEmail
WHERE FS.OrderNumber=S.OrderNumber and
( S.Product != P.Product
or S.Category != C.Category
or S.Supplier != SP.Supplier
or S.CustomerName != CU.CustomerName
or FS.Date != S.Date
or FS.Quantity != S.Quantity
or FS.UnitPrice != S.UnitPrice
or FS.TotalSales != S.TotalSales)
GO
The above procedure would map the corresponding id’s from the dimension table to the fact sales table and handle the load through an Upsert operation.
Once all objects are created and defined in the database, in the next step create Data pipelines to execute the entire end to end process.
The first step in the pipeline run is the notebook that would cleanse, format, and dedup the data in the Bronze layer before moving it to the Silver layer. This is followed by loading data into the dimension tables in the data warehouse's Gold layer and finally loading data into the Fact_sales
table.
The data in the Fact_sales
table should look something like this.
Once done we are good to go ahead with the pipeline execution.
This dimension model can be used as base source to your semantic model.
Conclusion
This concludes the end to end implementation of Medallion architecture. I hope this two part series would help you to get started on the Medallion architecture implementation and customized the process based on the business needs and design.
Thanks for reading !!!
Subscribe to my newsletter
Read articles from Sachin Nandanwar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by