Building the Data Foundation

In this phase, my goal was to prepare the raw data for analysis by creating a clean, structured dataset. This process was critical to ensure the accuracy and reliability of my final insights.

Data Source & Acquisition

I obtained the dataset from Kaggle, which was provided in a zip file that contained three CSV files and two JSON files. For the initial load, I placed the files into a bucket in Google Cloud Platform. After, I used Airflow to create a DAG to migrate the files to its official Data Lake within a bucket called ‘raw’ in Google Cloud Platform.

Tools & Technology

  • Python: The primary language used to migrate the data.

  • Pandas: Used to create a data frame for the JSON file (mcc_code) to transform it into a CSV.

  • BashOperator: to run the command line

  • GCSToGCSOperator: to migrate data from one bucket to another

  • PythonOperator: to run Python code

  • GSCHook: to download the JSON file using the correct method to transform it into a CSV

Data Cleaning & Transformation

The data set provided was relatively clean. The data contained a few issues that needed to be addressed. This stage was used to create staging files.

  • Incorrect Date Format: “transaction_date” to month, day, year.

  • Incorrect Currency Format: “credit_limit”, “per_capita_income”, “yearly_income”, “total_debt”, and “amount” were stored as a string and included a ‘$’. I replaced the dollar signs with no character (““) and converted the values to a numeric value with two decimal places.

Tool & Technology

  • SQL: The Primary Language used to manipulate the data

  • Dbt: Used to create the data model by leveraging SQL.

Data Ingestion

With the data successfully cleaned and transformed into a structured format, the final step in the pipeline was to load it into a centralized and scalable location. This phase is crucial because it makes the data ready for efficient querying and long-term storage, which is essential for our upcoming analysis and dashboard creation.

The Loading Process

I chose to load the data into Google’s BigQuery, which served as my project’s data warehouse. Using a robust database offers significant advantages over simply storing the data in a file, as it allows for fast querying, ensures data integrity, and can easily be accessed by BI tools like Power BI.

The process involved two main steps:

  1. Database creation: I first created the Database called transaction_analytics. Then I created the tables stg_cards, stg_mcc, stg_users, and stg_transactions. Google Cloud Platform auto-detected the schema, ensuring the data would be correctly ingested.

  2. Data Ingestion: Using Airflow, I created a DAG that connected to Google BigQuery, which performed the transformations within the SQL DBT files, created the fact and dimension views,which were modeled after the Snowflake model. This method effectively created a data warehouse I could connect to.

The End Result

At the end of this stage, my dataset was no longer simple CSV files; it was a structured table in a database, ready to be queried and used as the foundation for my analysis. This final step of the data engineering phase ensures that the data is in the perfect state for the next stages of the project.

Here is a link to this project’s GitHub Repo.

Here is the link to this Project’s Board.

0
Subscribe to my newsletter

Read articles from Renisa Mangal-King directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Renisa Mangal-King
Renisa Mangal-King

Passionate and driven data science student with a focus on leveraging data analytics and engineering to solve complex business problems, particularly within the fintech sector. With a strong foundation in modern data stacks, statistical modeling, and machine learning, I am dedicated to transforming raw data into actionable insights that drive growth and innovation.