In this phase, my goal was to prepare the raw data for analysis by creating a clean, structured dataset. This process was critical to ensure the accuracy and reliability of my final insights.

Data Source & Acquisition

I obtained the dataset from Kaggle, which was provided in a zip file that contained three CSV files and two JSON files. For the initial load, I placed the files into a bucket in Google Cloud Platform. After, I used Airflow to create a DAG to migrate the files to its official Data Lake within a bucket called ‘raw’ in Google Cloud Platform.

Tools & Technology

Python: The primary language used to migrate the data.
Pandas: Used to create a data frame for the JSON file (mcc_code) to transform it into a CSV.
BashOperator: to run the command line
GCSToGCSOperator: to migrate data from one bucket to another
PythonOperator: to run Python code
GSCHook: to download the JSON file using the correct method to transform it into a CSV

Data Cleaning & Transformation

The data set provided was relatively clean. The data contained a few issues that needed to be addressed. This stage was used to create staging files.

Incorrect Date Format: “transaction_date” to month, day, year.
Incorrect Currency Format: “credit_limit”, “per_capita_income”, “yearly_income”, “total_debt”, and “amount” were stored as a string and included a ‘$’. I replaced the dollar signs with no character (““) and converted the values to a numeric value with two decimal places.

Tool & Technology

SQL: The Primary Language used to manipulate the data
Dbt: Used to create the data model by leveraging SQL.

Data Ingestion

With the data successfully cleaned and transformed into a structured format, the final step in the pipeline was to load it into a centralized and scalable location. This phase is crucial because it makes the data ready for efficient querying and long-term storage, which is essential for our upcoming analysis and dashboard creation.

The Loading Process

I chose to load the data into Google’s BigQuery, which served as my project’s data warehouse. Using a robust database offers significant advantages over simply storing the data in a file, as it allows for fast querying, ensures data integrity, and can easily be accessed by BI tools like Power BI.

The process involved two main steps:

Database creation: I first created the Database called transaction_analytics. Then I created the tables stg_cards, stg_mcc, stg_users, and stg_transactions. Google Cloud Platform auto-detected the schema, ensuring the data would be correctly ingested.
Data Ingestion: Using Airflow, I created a DAG that connected to Google BigQuery, which performed the transformations within the SQL DBT files, created the fact and dimension views,which were modeled after the Snowflake model. This method effectively created a data warehouse I could connect to.

The End Result

At the end of this stage, my dataset was no longer simple CSV files; it was a structured table in a database, ready to be queried and used as the foundation for my analysis. This final step of the data engineering phase ensures that the data is in the perfect state for the next stages of the project.

Here is a link to this project’s GitHub Repo.

Here is the link to this Project’s Board.

Building the Data Foundation