Building an End-to-End Lending Data Analytics Platform: My DE Zoomcamp 2025 Final Project

Abhay AhirkarAbhay Ahirkar
4 min read

Introduction

As part of my DE Zoomcamp 2025 journey, I developed a comprehensive data engineering solution for the lending industry. The project addresses a critical business need: helping financial institutions make data-driven lending decisions by analyzing patterns in loan data, creating risk profiles for borrowers, and providing actionable insights through interactive dashboards.

In this article, I'll walk you through the architecture, implementation details, and key learnings from building this end-to-end data platform.

The Business Problem

Financial institutions face several challenges when making lending decisions:

  • Processing large volumes of customer and loan data

  • Identifying patterns in defaults and repayments

  • Creating accurate risk profiles for borrowers

  • Optimizing lending strategies

My solution addresses these challenges by creating a scalable, cloud-based data pipeline that transforms raw lending data into valuable business insights.

Technical Architecture

I built the platform entirely on Google Cloud Platform (GCP) using modern data engineering practices:

  1. Data Lake: Google Cloud Storage (GCS) stores raw and processed data

  2. Processing Engine: Apache Spark on Dataproc handles transformation tasks

  3. Data Warehouse: BigQuery stores processed data in an optimized schema

  4. Workflow Orchestration: Kestra manages pipeline dependencies

  5. Visualization: Metabase creates business intelligence dashboards

  6. Infrastructure as Code: Terraform ensures reproducibility

The Data Pipeline

The data pipeline consists of multiple transformation steps:

  1. Data Ingestion: Uploading the Lending Club dataset to GCS

  2. Data Splitting: Separating the raw data into logical domains (customers, loans, repayments, defaulters)

  3. Data Cleaning: Processing each domain with dedicated PySpark jobs

  4. Data Warehouse Loading: Creating external tables in BigQuery

  5. Analytics Views: Building unified views for visualization

  6. Loan Scoring: Calculating risk scores for each loan

All these steps are orchestrated using Kestra, which provides a clear visualization of job dependencies:

Key Implementation Details

1. Data Transformation with PySpark

The core of this project is a series of PySpark jobs that transform raw lending data into analytical datasets. For example, the first script breaks down the original dataset into several domain-specific tables:

Each subsequent script handles the cleaning and transformation of a specific domain.

2. Infrastructure as Code with Terraform

One of my priorities was ensuring the entire solution could be easily reproduced. I used Terraform to define all cloud resources:

This approach allows anyone to spin up an identical infrastructure with just a few commands.

3. Workflow Orchestration with Kestra

Managing the dependencies between data transformation steps is crucial. I used Kestra to orchestrate the workflow:

Kestra provides a clean interface for tracking job execution and managing dependencies.

4. Data Visualization with Metabase

The final step was creating an intuitive dashboard for business users:

!Dashboard

The dashboard includes:

  • Distribution of loans by status, grade, and purpose

  • Temporal trends in loan issuance and repayment

  • Default rate analysis by demographic segments

  • Loan score distribution and risk categorization

Challenges and Solutions

Challenge 1: Data Quality Issues The original dataset contained many inconsistencies and missing values. I addressed this by implementing robust data cleaning in each PySpark job.

Challenge 2: Scaling Spark Jobs Processing the large dataset efficiently required careful configuration of Dataproc clusters and optimization of Spark parameters.

Challenge 3: Workflow Management Managing dependencies between transformation steps was complex. Kestra provided a powerful solution for workflow orchestration.

Key Learnings

  1. Infrastructure as Code is Essential: Terraform made the deployment process repeatable and reliable.

  2. Separation of Concerns: Breaking the pipeline into domain-specific transformations improved maintainability.

  3. Workflow Orchestration: Managing dependencies between jobs is crucial for reliable data pipelines.

  4. Cloud-Native Architecture: Leveraging managed services like Dataproc and BigQuery reduced operational overhead.

Conclusion

Building this lending data analytics platform was a comprehensive application of the data engineering concepts I learned in DE Zoomcamp 2025. The project demonstrates how modern data engineering practices can transform raw data into business value.

The final solution provides lending institutions with actionable insights to make better-informed decisions, reduce default risks, and optimize their lending strategies.

What's Next?

Looking ahead, I'm considering several enhancements:

  • Adding real-time data processing for immediate insights

  • Implementing machine learning models for predictive analytics

  • Expanding the dashboard with more sophisticated visualizations

Resources

The complete source code and implementation details are available on GitHub.


Special thanks to Alexey Grigorev, DataTalksClub, Kestra, and the entire DE Zoomcamp community for the incredible learning journey!

#DataEngineering #CloudComputing #GCP #ApacheSpark #BigData #FinTech #DEZoomCamp

0
Subscribe to my newsletter

Read articles from Abhay Ahirkar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Abhay Ahirkar
Abhay Ahirkar