"Building a Production-Grade Insurance Fraud Detection System with Streamlit, XGBoost, and MLOps" slug: insurance-fraud-detection-mlops

SanjayramSanjayram
3 min read

excerpt: "Learn how to build a complete fraud detection pipeline using Python, XGBoost, joblib, Streamlit, and MLOps principles. From data preprocessing to real-time dashboards—this guide walks you through it all."
tags:

  • machine-learning

  • streamlit

  • data-science

  • xgboost

  • mlops


Introduction

Fraudulent insurance claims cost the industry billions annually. In this blog, we'll dive deep into designing and deploying a production-grade insurance fraud detection pipeline using:

  • XGBoost for robust classification

  • Streamlit for interactive dashboards

  • MLOps practices (modular structure, testing, Docker, CI/CD)

Whether you're an aspiring data scientist or a professional preparing for a FAANG role, this project showcases how to build real-world systems.


Project Structure

insurance-fraud-detection/
├── data/                  # Raw and processed datasets
├── models/                # Saved model + scaler (joblib)
├── src/                   # Modular Python scripts (optional if not using direct imports)
├── dashboards/            # Streamlit UI
├── api/                   # FastAPI for REST serving
├── tests/                 # Pytest unit tests
├── requirements.txt       # Dependencies
├── config.yaml            # Config file
├── pyproject.toml         # Linting & formatting
├── Makefile               # CLI workflow automation
├── .gitignore             # Ignore unnecessary files
└── README.md              # Full project overview

Dataset Overview

We use a simplified insurance dataset with anonymized features:

FeatureDescription
ageCustomer's age
policy_sales_channelAgent or online channel ID
gender0 = Female, 1 = Male
previously_insuredHas previous insurance? (0/1)
vehicle_age0 = <1yr, 1 = 1-2yr, 2 = >2yr
vehicle_damage0 = No, 1 = Yes
is_fraudTarget variable

Model Building (XGBoost + Preprocessing)

We use a Pipeline to ensure feature scaling + model inference are handled together:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
import joblib

features = ['age', 'policy_sales_channel', 'gender', 'previously_insured',
            'vehicle_age', 'vehicle_damage']

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", XGBClassifier(n_estimators=100, max_depth=4, random_state=42))
])

pipeline.fit(X_train[features], y_train)
joblib.dump(pipeline, "models/final_model.pkl")

One file, one artifact—makes deployment clean!


Streamlit Dashboard for Real-Time Use

We use Streamlit to:

  • Upload a CSV file or input data manually

  • Show predictions

  • Visualize fraud distribution

import streamlit as st
import joblib
import pandas as pd

model = joblib.load("../models/final_model.pkl")

user_input = {...}  # collected via st.selectbox, etc.
df = pd.DataFrame([user_input])
pred = model.predict(df)

You get a downloadable fraud probability CSV and a pie chart.


Testing with PyTest

A great project is incomplete without tests:

def test_model_prediction():
    model = joblib.load("models/final_model.pkl")
    X_sample = pd.DataFrame([{...}])
    pred = model.predict(X_sample)
    assert pred[0] in [0, 1]

Automation via Makefile

install:
    pip install -r requirements.txt

train:
    python src/train_model.py

dashboard:
    streamlit run dashboards/streamlit_dashboard.py

Dockerization (Optional but Pro-Level)

Your Dockerfile:

FROM python:3.12
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
CMD ["streamlit", "run", "dashboards/streamlit_dashboard.py"]

Build and run:

docker build -t fraud-app .
docker run -p 8501:8501 fraud-app

What's Next? Advanced Enhancements

  • ✅ CI/CD with GitHub Actions

  • ✅ FastAPI backend for ML inference

  • ✅ Model versioning with MLflow

  • ✅ Data pipeline orchestration with Prefect/Airflow

  • ✅ Feature Store integration

  • ✅ Monitor predictions in production


Conclusion

You’ve now built a complete, production-ready ML system:

  • Clean code

  • Modular architecture

  • Tested and deployed with UI

This is the kind of project that grabs attention on your resume, portfolio, or even FAANG interviews.


Source Code

Everything is available on GitHub:

👉 https://github.com/SANJAYRAM-DS/fraud-detection-end2end

Linked-In

👉 www.linkedin.com/in/sanjayram-data

If you liked this, follow me on Hashnode and stay tuned for more industry-level ML projects!


0
Subscribe to my newsletter

Read articles from Sanjayram directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sanjayram
Sanjayram