Introduction

As part of my DE Zoomcamp 2025 journey, I developed a comprehensive data engineering solution for the lending industry. The project addresses a critical business need: helping financial institutions make data-driven lending decisions by analyzing patterns in loan data, creating risk profiles for borrowers, and providing actionable insights through interactive dashboards.

In this article, I'll walk you through the architecture, implementation details, and key learnings from building this end-to-end data platform.

The Business Problem

Financial institutions face several challenges when making lending decisions:

Processing large volumes of customer and loan data
Identifying patterns in defaults and repayments
Creating accurate risk profiles for borrowers
Optimizing lending strategies

My solution addresses these challenges by creating a scalable, cloud-based data pipeline that transforms raw lending data into valuable business insights.

Technical Architecture

I built the platform entirely on Google Cloud Platform (GCP) using modern data engineering practices:

Data Lake: Google Cloud Storage (GCS) stores raw and processed data
Processing Engine: Apache Spark on Dataproc handles transformation tasks
Data Warehouse: BigQuery stores processed data in an optimized schema
Workflow Orchestration: Kestra manages pipeline dependencies
Visualization: Metabase creates business intelligence dashboards
Infrastructure as Code: Terraform ensures reproducibility

The Data Pipeline

The data pipeline consists of multiple transformation steps:

Data Ingestion: Uploading the Lending Club dataset to GCS
Data Splitting: Separating the raw data into logical domains (customers, loans, repayments, defaulters)
Data Cleaning: Processing each domain with dedicated PySpark jobs
Data Warehouse Loading: Creating external tables in BigQuery
Analytics Views: Building unified views for visualization
Loan Scoring: Calculating risk scores for each loan

All these steps are orchestrated using Kestra, which provides a clear visualization of job dependencies:

Key Implementation Details

1. Data Transformation with PySpark

The core of this project is a series of PySpark jobs that transform raw lending data into analytical datasets. For example, the first script breaks down the original dataset into several domain-specific tables:

Each subsequent script handles the cleaning and transformation of a specific domain.

2. Infrastructure as Code with Terraform

One of my priorities was ensuring the entire solution could be easily reproduced. I used Terraform to define all cloud resources:

This approach allows anyone to spin up an identical infrastructure with just a few commands.

3. Workflow Orchestration with Kestra

Managing the dependencies between data transformation steps is crucial. I used Kestra to orchestrate the workflow:

Kestra provides a clean interface for tracking job execution and managing dependencies.

4. Data Visualization with Metabase

The final step was creating an intuitive dashboard for business users:

!Dashboard

The dashboard includes:

Distribution of loans by status, grade, and purpose
Temporal trends in loan issuance and repayment
Default rate analysis by demographic segments
Loan score distribution and risk categorization

Challenges and Solutions

Challenge 1: Data Quality Issues The original dataset contained many inconsistencies and missing values. I addressed this by implementing robust data cleaning in each PySpark job.

Challenge 2: Scaling Spark Jobs Processing the large dataset efficiently required careful configuration of Dataproc clusters and optimization of Spark parameters.

Challenge 3: Workflow Management Managing dependencies between transformation steps was complex. Kestra provided a powerful solution for workflow orchestration.

Key Learnings

Infrastructure as Code is Essential: Terraform made the deployment process repeatable and reliable.
Separation of Concerns: Breaking the pipeline into domain-specific transformations improved maintainability.
Workflow Orchestration: Managing dependencies between jobs is crucial for reliable data pipelines.
Cloud-Native Architecture: Leveraging managed services like Dataproc and BigQuery reduced operational overhead.

Conclusion

Building this lending data analytics platform was a comprehensive application of the data engineering concepts I learned in DE Zoomcamp 2025. The project demonstrates how modern data engineering practices can transform raw data into business value.

The final solution provides lending institutions with actionable insights to make better-informed decisions, reduce default risks, and optimize their lending strategies.

What's Next?

Looking ahead, I'm considering several enhancements:

Adding real-time data processing for immediate insights
Implementing machine learning models for predictive analytics
Expanding the dashboard with more sophisticated visualizations

Resources

The complete source code and implementation details are available on GitHub.

Special thanks to Alexey Grigorev, DataTalksClub, Kestra, and the entire DE Zoomcamp community for the incredible learning journey!

#DataEngineering #CloudComputing #GCP #ApacheSpark #BigData #FinTech #DEZoomCamp

Building an End-to-End Lending Data Analytics Platform: My DE Zoomcamp 2025 Final Project