Implement Medallion architecture with SCD Type 2 in Microsoft Fabric - Part 2

In the first article of the series on Medallion architecture in Fabric we looked into the process of load, cleanse and transfer of data from the bronze layer to the Silver layer.
In this blog post we would move further and delve into the process of moving data from Silver layer to the Gold layer.

The data load will be into a dimensional model organized in a star schema structure in the Gold layer. Star schema is dimensional modelling design where you have the fact table in the center with the dimension tables located at the edges of the star.

Diagram shows an illustration of a star schema for sales facts. There are five dimensions, each located at a point of the star.

Image credit : Microsoft

SetUp

To get started we first need to create the underlying schemas for the dimension data and the fact data.

We would create four dimension tables and a single fact table. The dimension tables are Dim_Category, Dim_Customer, Dim_Product and Dim_Supplier and the fact table being Fact_Sales.


CREATE TABLE [dbo].[Dim_Category](
    [CategoryId] [int] NOT NULL,
    [Category] [varchar](200) NOT NULL,
    [DateInserted] [datetime2](0) NOT NULL
    )
ALTER TABLE Dim_Category ADD CONSTRAINT PK_Dim_Category PRIMARY KEY NONCLUSTERED (CategoryId) NOT ENFORCED;
GO

CREATE TABLE [dbo].[Dim_Customer](
    [CustomerId] [int] NOT NULL,
    [CustomerName] [varchar](200) NOT NULL,
    [CustomerRegion] [varchar](200) NOT NULL,
    [CustomerEmail] [varchar](200) NOT NULL,
    [DateInserted] [datetime2](0) NOT NULL
    )
ALTER TABLE Dim_Customer ADD CONSTRAINT PK_Dim_Customery PRIMARY KEY NONCLUSTERED (CustomerId) NOT ENFORCED;
GO

CREATE TABLE [dbo].[Dim_Product](
    [ProductId] [int] NOT NULL,
    [Product] [varchar](200) NOT NULL,
    [DateInserted] [datetime2](0) NOT NULL
    )
ALTER TABLE Dim_Product ADD CONSTRAINT PK_Dim_Product PRIMARY KEY NONCLUSTERED (ProductId) NOT ENFORCED;
GO

CREATE TABLE [dbo].[Dim_Supplier](
    [SupplierId] [int] NOT NULL,
    [Supplier] [varchar](200) NOT NULL,
    [DateInserted] [datetime2](0) NULL
    )
ALTER TABLE Dim_Supplier ADD CONSTRAINT PK_Dim_Supplier PRIMARY KEY NONCLUSTERED (SupplierId) NOT ENFORCED;
GO

CREATE TABLE [dbo].[Fact_Sales](
    [OrderNumber] [int] NOT NULL,
    [CustomerId] [int] NOT NULL,
    [ProductId] [int] NOT NULL,
    [CategoryId] [int] NOT NULL,
    [SupplierId] [int] NOT NULL,
    [Date] [int] NOT NULL,
    [Quantity] [int] NOT NULL,
    [UnitPrice] [decimal](10, 0) NOT NULL,
    [TotalSales] [int] NOT NULL,
    [DateInserted] [datetime2](6) NOT NULL
    )
GO
ALTER TABLE Fact_Sales ADD CONSTRAINT PK_Fact_Sales PRIMARY KEY NONCLUSTERED (OrderNumber) ;
GO
ALTER TABLE Fact_Sales ADD CONSTRAINT FK_CustomerId FOREIGN KEY (CustomerId) REFERENCES Dim_Customer (CustomerId) NOT ENFORCED;
GO
ALTER TABLE Fact_Sales ADD CONSTRAINT FK_ProductId FOREIGN KEY (ProductId) REFERENCES Dim_Product (ProductId) NOT ENFORCED;
GO
ALTER TABLE Fact_Sales ADD CONSTRAINT FK_CategoryId FOREIGN KEY (CategoryId) REFERENCES Dim_Category (CategoryId) NOT ENFORCED;
GO
ALTER TABLE Fact_Sales ADD CONSTRAINT FK_SupplierId FOREIGN KEY (SupplierId) REFERENCES Dim_Supplier (SupplierId) NOT ENFORCED;
GO

This is how the star schema dimensional model should look post creation of the tables.

Please note that at the time of writing this article Fabric data warehouse do not support identity keys that eventually could be used as surrogate keys. So we would have create customized process to replicate the surrogate key behavior and manage them through SQL queries as part of the load process.

Also note that PRIMARY KEYS are only supported when NONCLUSTERED and NOT ENFORCED are both used while FOREIGN KEY is only supported when NOT ENFORCED is used. DEFAULT CONSTRAINTS are also not supported.

And you cannot use PRIMARY/UNIQUE KEY keyword in the CREATE TABLE statement.

For example, trying to create a table and define a primary key on it through a single SQL statement will error out

CREATE TABLE [dbo].[Dim_Category](
    [CategoryId] [int] NULL,
    [Category] [varchar](200) NULL,
    [DateInserted] [datetime2](0) NULL,
 CONSTRAINT [UQ_Dim_Category] UNIQUE NONCLUSTERED 
        (
            [CategoryId] ASC
        )
)
GO

Instead the CREATE TABLE statement has to be modified where the primary key is created through the ALTER TABLE statement.

CREATE TABLE [dbo].[Dim_Category](
    [CategoryId] [int] NOT NULL,
    [Category] [varchar](200) NOT NULL,
    [DateInserted] [datetime2](0) NOT NULL
    )
ALTER TABLE Dim_Category ADD CONSTRAINT PK_Dim_Category PRIMARY KEY NONCLUSTERED (CategoryId) NOT ENFORCED;
GO

Same is the case when creating FOREIGN KEYS. They have to be created through a separate CREATE statement.

The following statement creates a FOREIGN KEY relationship on the Fact_Sales table that references the column CustomerId in the Dim_Customer table.

ALTER TABLE Fact_Sales ADD CONSTRAINT FK_CustomerId FOREIGN KEY (CustomerId) REFERENCES Dim_Customer (CustomerId) NOT ENFORCED;
GO

The Code

Now that we have the schema ready its to time to create stored procedures to load data from the Silver layer to the data warehouse in the Gold layer.

Below are the Stored Procedure scripts that loads the Dimension data from the Silver layer Lakehouse into the corresponding Dimensional tables in the Gold layer.

CREATE PROCEDURE [dbo].[Load_Dim_Category]
AS
BEGIN

  DECLARE @Max_Category_Id INT = (SELECT MAX(CategoryId) FROM Dim_Category)


  IF @Max_Category_Id IS NULL
    INSERT INTO Dim_Category (CategoryId, Category, DateInserted)
      SELECT
        ROW_NUMBER() OVER (ORDER BY category)
       ,category
       ,GETDATE()
      FROM Silver_Layer.[dbo].[Category]
  ELSE
    SELECT
      @Max_Category_Id + 1
     ,category
     ,GETDATE()
    FROM [Silver_Layer].[dbo].[Category] C1
    WHERE NOT EXISTS (SELECT
        1
      FROM dbo.Dim_Category C2
      WHERE C1.category = C2.Category)
END

GO

CREATE  PROCEDURE [dbo].[Load_Dim_Customer]
AS
BEGIN

  DECLARE @Max_Customer_Id INT = (SELECT
      MAX(CustomerId)
    FROM Dim_Customer)

  IF @Max_Customer_Id IS NULL
    INSERT INTO Dim_Customer (CustomerId, CustomerName, CustomerRegion, CustomerEmail, DateInserted)
      SELECT
        ROW_NUMBER() OVER (ORDER BY CustomerName, CustomerEmail)
       ,CustomerName
       ,CustomerRegion
       ,CustomerEmail
       ,GETDATE()
      FROM Silver_Layer.[dbo].[Customer]
  ELSE
   INSERT INTO Dim_Customer (CustomerId, CustomerName, CustomerRegion, CustomerEmail, DateInserted)
    SELECT
      @Max_Customer_Id+1
     ,CustomerName
     ,CustomerRegion
     ,CustomerEmail
     ,GETDATE()
    FROM [Silver_Layer].[dbo].[Customer] C1
    WHERE NOT EXISTS (SELECT
        1
      FROM dbo.Dim_Customer C2
      WHERE C1.[CustomerName] = C2.[CustomerName] and C1.CustomerEmail=C2.CustomerEmail)

END

UPDATE C1 SET
C1.CustomerRegion = C2.CustomerRegion,
C1.CustomerEmail = C2.CustomerEmail

from Dim_Customer C1 inner join Silver_Layer.[dbo].[Customer] C2 on C1.[CustomerName]=C2.[CustomerName]
WHERE (C1.CustomerEmail!=C2.CustomerEmail OR C1.CustomerRegion!=C2.CustomerRegion)

GO

CREATE  PROCEDURE [dbo].[Load_Dim_Product]
AS
BEGIN

  DECLARE @Max_Product_Id INT = (SELECT
      MAX(ProductId)
    FROM Dim_Product)

  IF @Max_Product_Id IS NULL
    INSERT INTO Dim_Product (ProductId, Product, DateInserted)
      SELECT
        ROW_NUMBER() OVER (ORDER BY product)
       ,[product]
       ,GETDATE()
      FROM Silver_Layer.[dbo].[Product]
  ELSE
    INSERT INTO Dim_Product (ProductId, Product, DateInserted)
    SELECT
     @Max_Product_Id+1
     ,product
     ,GETDATE()
    FROM [Silver_Layer].[dbo].[Product] P1
    WHERE NOT EXISTS (SELECT
        1
      FROM dbo.Dim_Product P2
      WHERE P1.[product] = P2.[Product])
END

GO

CREATE PROCEDURE [dbo].[Load_Dim_Supplier]
AS
BEGIN

  DECLARE @Max_Supplier_Id INT = (SELECT
      MAX(SupplierId)
    FROM Dim_Supplier)


  IF @Max_Supplier_Id IS NULL
    INSERT INTO Dim_Supplier (SupplierId, Supplier, DateInserted)
      SELECT
        ROW_NUMBER() OVER (ORDER BY supplier)
       ,[supplier]
       ,GETDATE()
      FROM Silver_Layer.[dbo].[Supplier]
  ELSE
    INSERT INTO Dim_Supplier (SupplierId, Supplier, DateInserted)
    SELECT
      @Max_Supplier_Id + 1
     ,supplier
     ,GETDATE()
    FROM [Silver_Layer].[dbo].[Supplier] P1
    WHERE NOT EXISTS (SELECT
        1
      FROM dbo.Dim_Supplier P2
      WHERE P1.[supplier] = P2.[Supplier])
END
GO

The logic is fundamentally the similar across all the Dimension table loads.

Identify the type of load,first or incremental load :

 DECLARE @Your_max_Id INT = (SELECT MAX(ColumnId) FROM Dim_Table)

  IF @Your_max_Id IS NULL Then

   Its a first load
   else
   Its a incremental load

The logic for updating Dim_Customers is slightly different because there are several non-key attributes that may change such as a customer’s email or customer region. The Id attributes in the dimension tables are incremented based on the most recent value of the id present in the table.

Stored Procedure script to handle Fact_Sales table load.

CREATE  PROCEDURE [dbo].[Load_Fact_Sales]
AS

  INSERT INTO Fact_Sales
    SELECT
      S.OrderNumber
     ,CU.CustomerId
     ,P.ProductId
     ,C.CategoryId
     ,SP.SupplierId
     ,S.Date
     ,S.Quantity
     ,S.UnitPrice
     ,S.TotalSales
     ,GETDATE()
    FROM Silver_Layer.[dbo].[Sales] S
    INNER JOIN [dbo].[Dim_Product] P
      ON S.Product = P.Product
    INNER JOIN [dbo].[Dim_Category] C
      ON S.Category = C.Category
    INNER JOIN [dbo].[Dim_Supplier] SP
      ON S.Supplier = SP.Supplier
    INNER JOIN [dbo].[Dim_Customer] CU
      ON S.CustomerName = CU.CustomerName
     AND S.CustomerEmail = CU.CustomerEmail         
     WHERE NOT EXISTS(SELECT 1 FROM Fact_Sales FS WHERE FS.OrderNumber=S.OrderNumber)

UPDATE FS SET
FS.CustomerId=CU.CustomerId,
FS.ProductId = P.ProductId,
FS.CategoryId = C.CategoryId,
FS.SupplierId=SP.SupplierId,
FS.Date=S.Date,
FS.Quantity=S.Quantity,
FS.UnitPrice=S.UnitPrice,
FS.TotalSales=S.TotalSales,
FS.DateInserted=GETDATE()
 FROM Silver_Layer.[dbo].[Sales] S 
 INNER JOIN Fact_Sales FS on FS.OrderNumber=S.OrderNumber
  INNER JOIN [dbo].[Dim_Product] P
      ON S.Product = P.Product
    INNER JOIN [dbo].[Dim_Category] C
      ON S.Category = C.Category
    INNER JOIN [dbo].[Dim_Supplier] SP
      ON S.Supplier = SP.Supplier
    INNER JOIN [dbo].[Dim_Customer] CU
      ON S.CustomerName = CU.CustomerName
        AND S.CustomerEmail = CU.CustomerEmail
WHERE FS.OrderNumber=S.OrderNumber and
    ( S.Product != P.Product
    or S.Category != C.Category
    or S.Supplier != SP.Supplier
    or S.CustomerName != CU.CustomerName
    or FS.Date != S.Date
    or FS.Quantity != S.Quantity
    or FS.UnitPrice != S.UnitPrice
    or FS.TotalSales != S.TotalSales)
GO

The above procedure would map the corresponding id’s from the dimension table to the fact sales table and handle the load through an Upsert operation.

Once all objects are created and defined in the database, in the next step create Data pipelines to execute the entire end to end process.

The first step in the pipeline run is the notebook that would cleanse, format, and dedup the data in the Bronze layer before moving it to the Silver layer. This is followed by loading data into the dimension tables in the data warehouse's Gold layer and finally loading data into the Fact_sales table.

The data in the Fact_sales table should look something like this.

Once done we are good to go ahead with the pipeline execution.

This dimension model can be used as base source to your semantic model.

Conclusion

This concludes the end to end implementation of Medallion architecture. I hope this two part series would help you to get started on the Medallion architecture implementation and customized the process based on the business needs and design.

Thanks for reading !!!

0
Subscribe to my newsletter

Read articles from Sachin Nandanwar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sachin Nandanwar
Sachin Nandanwar