Slowly Changing Dimensions with PySpark and Delta Lake

Slowly Changing Dimensions (SCDs) are a vital concept in data warehousing, particularly in managing data that changes over time. As the entities evolve over time, it’s crucial to track and manage these changes effectively. This is where Slowly Changing Dimensions (SCD) come into play.

Take an example of table :-

Primarily there are three types of SCD -

  1. SCD Type 1 - It's a simplest form of approach where existing data is overwritten by new data, only the latest data is saved and losing the historical information. This method is suitable when historical data isn't critical.

    Table 1:

    Table 2:

  2. SCD Type 2 - Type 2 SCDs keep track of both current and historical data. When a change occurs in an attribute, a new record is inserted into the dimension table with the updated attribute values and a new surrogate key (often referred to effective date or suppress). This preserves historical data, enabling analysis of how dimensions evolve over time. This adds majorly 3 types of columns start_date, end_date.

    Table 1:

    Table 2:

  3. SCD Type 3 - Unlike SCD1 and SCD2, which primarily focus on preserving the most recent state of data, SCD3 allows for limited historical tracking of changes while minimizing storage requirements. It maintains a changed column details.

    Table 1 :

    Table 2

0
Subscribe to my newsletter

Read articles from Harshita Chaudhary directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Harshita Chaudhary
Harshita Chaudhary

Hii, My name is Harshita Chaudhary, I am a data engineer. I have joined Hashnode to share my day to day learnings with you guys, I like to write tech blogs on the fundamental topics of data engineering in a concise manner adding proper and practical examples. Happy learning and Keep Querying !!