MlOps and Cloud Project

When I first started learning Data Science, I thought building a model was the main job. But when I worked on my Anime Recommender System, I realised the real challenge is making it production-ready — scalable, reproducible, and easy to maintain.

In this post, I’ll share how I built an end-to-end recommendation system with modern MLOps practices, deployed it on Google Cloud Platform, and optimised it for speed and cost — even as a fresher.

Dataset I used: Anime Recommendation Database 2020 on Kaggle

1. The Problem

I wanted to recommend anime shows to users based on their watching history and anime similarities. But instead of stopping at a Jupyter Notebook, I set out to:

Ingest & process massive datasets (~70M ratings) efficiently
Build a robust model using collaborative & content-based filtering
Deploy it in the cloud with CI/CD, version control, and monitoring

2. Tech Stack

ML & Backend: Python, TensorFlow/Keras, Pandas, Scikit-learn, Flask
MLOps: Jenkins, Docker, DVC, MLflow, CometML
Cloud: Google Cloud Storage, Google Artifact Registry, Google Kubernetes Engine

3. The Workflow

Data Ingestion & Processing

Pulled data from Google Cloud Storage
Used DVC to track datasets for reproducibility
Sampled 5M rows for training to optimise resources
Feature engineering: user-item interaction matrices, embeddings

Model Training

Implemented Neural Collaborative Filtering (NCF)
Used embedding layers for users & anime items
Early stopping & model checkpointing for efficiency

Model Serving

Generated recommendations using precomputed embeddings for <250ms inference time
Flask API + responsive web interface
Cached frequently requested images for speed

4. MLOps Integration

CI/CD Pipeline (Jenkins)
- Launched a VM instance of Jenkins in Google Cloud
- Automated data checks, training, and deployment
- Multi-stage Docker builds (reduced image size from 10GB → 600MB)
Cloud-Native Deployment:
- Deployed on GKE with autoscaling, load balancing, and rolling updates
Version Control:
- DVC for datasets & models, Git for source code
Monitoring:
- CometML for tracking metrics & experiments

5. Optimisation Wins

✅ 94% smaller Docker image (10GB → 600MB)
✅ 10x faster deployments
✅ <100ms recommendation latency
✅ Cost savings with efficient resource usage

6. Key Learnings

MLOps is not optional if you want production-grade ML
Cloud-native tools make scaling simple — if you design for it early
A smaller, optimised Docker image saves both time and money
Even as a fresher, you can build something industry-ready by following good practices

Problems I faced and How I overcame Them:

After Learning all the tech stack which took me 2 weeks even though I studied non stop for like 12 hours a day. I was facing way too many problems. The most difficult problem I faced was:-

Docker Image Size (= 10 GB)

At first I was doing everything inside my Docker Image like Installing all dependencies for development and deployment, running my full pipeline inside the Image, and copying all garbage like raw data, model and weights in my docker image. And when I checked in my docker desktop before publishing it to GCP Artifact Registry. I noticed it’s size was massive and since GCP bills you on the amount of computation resource you use that would’ve costed me my whole bank account.
How I overcame It: I then started finding a way to reduce the image size. I looked up tens of youtube tutorials, ask chatgpt too all that gave me was the idea to use docker-compose. In essence use two image one for my model training and second for my web app hosting. I tried doing it, wrote a docker-compose.yaml and changed my workflow but the problem was even though I am using 2 Docker Images and my size reduced by mounting a volume for shared Data like embeddings, processed Dataset and all the problem was still the cost, pushing it to google cloud will still incur additional costs.

After all this deliberation I came to a solution I already have a VM instance of my Jenkins ready and When I am building my image I am running the training inside the Jenkins during build time. So why not just run the training in Jenkins and only build the necessary Web App Image. This is what led to my breakthrough of my MlOps project. I was now only downloading necessary libraries by splitting my dependencies in pyproject.toml file in two:

Development (For Jenkins)
Deployment (For Docker Image)

Now just this small change made the biggest impact in my project reduced image size to ~700 MB. Later I used a Multistage build to reduce the image size by an additional 100 MB.

Conclusion

This project taught me more than any course could — not just about ML, but about building reliable, maintainable, and scalable systems.

If you’re a fresher learning Data Science, my advice: don’t just build models, learn how to deploy and maintain them.

If you have any suggestion, questions or advice for me I would appreciate you reaching out to me either on comments or LinkedIn.

Side Note: After testing that everything was working as it should be I deleted my cloud resources because it cost money that i don’t have. If you are looking for the Web App deployment you can visit the Hugging Face Space where i deployed it for free.

🔗 Deployment: https://huggingface.co/spaces/aman-yadav-ds/anime-recommender

🔗 GitHub: https://github.com/aman-yadav-ds/anime-recommender
🔗 LinkedIn: https://www.linkedin.com/in/amandeep-yadav-2251b325a/

🚀 How I Integrated MLOps into My Data Science Project (as a Fresher)

Table of contents