Infra Made Simple: For Data Scientists, ML Engineers & Research Scientists

Table of contents
- A Zero-to-Hero Guide for AI Practitioners Navigating Infrastructure
- 1. The Infra Mindset for Data Roles
- 2. From Code to Pod: A User Request's Journey
- 3. Core Infra Concepts You Should Know
- 4. Metrics That Matter in ML/AI Systems
- 5. Bottlenecks You’ll Hit — and How to Fix Them
- 6. Best Practices for Infra-Aware ML Engineers
- 7. CI/CD & Long Builds — Why They Hurt and How to Fix Them
- 8. Common Mistakes & Lessons Learned
- 9. Real-World Cheat Sheet
- 10. Final Words for Data-Focused Engineers

A Zero-to-Hero Guide for AI Practitioners Navigating Infrastructure
Why This Guide?
Have you ever:
Struggled to understand CI/CD errors like
No space left on device
?Waited 40 minutes for a simple app to deploy?
Asked "What’s a pod?" during a critical production incident?
This document is your go-to resource to:
Understand how your code actually runs in production
Collaborate effectively with DevOps and platform teams
Troubleshoot infra-related issues without being a Kubernetes expert
Designed for non-DevOps professionals working in AI — including data scientists, ML engineers, and researchers.
1. The Infra Mindset for Data Roles
"Infra isn’t someone else’s job anymore. It’s part of building intelligent apps."
Modern ML systems are:
Real-time (e.g., chatbots, APIs)
Multi-service (e.g., RAG pipelines, ingestion workflows)
GPU-dependent (LLMs, CV models)
Basic infra knowledge saves hours of debugging, enables faster iteration, and improves reliability.
2. From Code to Pod: A User Request's Journey
Here’s what happens when a user interacts with your deployed service:
User sends a request (e.g., via Teams or API call)
Load balancer routes it to a healthy pod
Pod (a containerized app instance) runs your logic
Your app queries data/models, returns response
Knowing this lets you debug which layer broke (app? infra? scaling?).
3. Core Infra Concepts You Should Know
Term | What It Means |
Image | Frozen code + dependencies (Docker snapshot) |
Container | A running instance of an image (isolated) |
Pod | A K8s unit running one/more containers |
Node | A physical/VM host that runs pods |
Cluster | Group of nodes managed by Kubernetes |
Deployment | YAML config that defines how pods are run |
Service | Stable endpoint for accessing a set of pods |
4. Metrics That Matter in ML/AI Systems
Metric | Description | Why You Should Care |
Latency | Time taken to serve one request | Impacts user satisfaction |
RPS | Requests handled per second | Determines how scalable you are |
Throughput | Max stable request rate | Indicates system limits |
p95 Latency | Slowest 5% response time | Highlights spikes/bottlenecks |
Memory | RAM usage of your pod | Prevents OOM kills, slowness |
5. Bottlenecks You’ll Hit — and How to Fix Them
Symptom | Root Cause | Fix |
CI build takes 45 mins | No caching, bad Dockerfile | Use --cache-from , avoid COPY . . early |
No space left on device error | Image layers too large | Clean builds, use .dockerignore |
Pod stuck in Pending | No compatible node (e.g. GPU) | Add GPU nodes or queue fallback |
Frequent OOM crashes | RAM underestimated | Set limits in YAML |
Logs are missing or unclear | Not printed or collected properly | Add prints, use kubectl logs |
6. Best Practices for Infra-Aware ML Engineers
Docker
Use
python:3.11-slim
or similar lean imagesSeparate
COPY requirements.txt
+ install → cache layers.dockerignore
= lifesaver for large repos
Kubernetes
Always set
resources.requests
andlimits
Use HPA (horizontal pod autoscaler)
Define readiness/liveness probes
Monitoring
Emit logs in structured format
Capture latency, memory, error rates
Share dashboards with DevOps for visibility
Workflow Hygiene
Test locally with mocks before pushing
Use
docker build . && docker run
pre-pushDocument infra assumptions (RAM/CPU needed) in README
7. CI/CD & Long Builds — Why They Hurt and How to Fix Them
Problem: Change one line → entire pipeline re-runs
Why?
You copied all code too early
No layer caching
Pip install reruns every time
Better Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["python", "app.py"]
Also:
Use build cache in GitHub/GitLab CI
Avoid rebuilding on markdown/doc-only commits
8. Common Mistakes & Lessons Learned
Mistake | What You Should Do Instead |
Ignored memory limits in YAML | Set CPU/mem requests /limits |
Pushed without mocking downstream services | Use stubs for fast validation |
App crashed and logs were missing | Use kubectl logs -f <pod> |
Copied whole repo in Docker build early | Copy only needed files, cache wisely |
Didn’t know why pod stuck | Use kubectl describe pod |
Image size ballooned | Use .dockerignore , slim base image |
9. Real-World Cheat Sheet
Goal | Command or Tip |
See running pods | kubectl get pods |
View logs for a pod | kubectl logs <pod-name> |
Explain why pod is stuck | kubectl describe pod <pod-name> |
Free up Docker disk space | docker system prune |
View memory usage | top , free -h , or pod dashboard |
Build & test locally | docker build . && docker run |
Identify top memory usage | ps aux --sort=-%mem |
10. Final Words for Data-Focused Engineers
Infra is no longer optional for ML teams. You don’t need to master Kubernetes, but you should:
Know how your code gets deployed
Track memory, latency, and pod health
Build with infra in mind (not as an afterthought)
Great models + poor infra = poor user experience
This doc is your on-ramp.
Subscribe to my newsletter
Read articles from Sai Sandeep Kantareddy directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Sai Sandeep Kantareddy
Sai Sandeep Kantareddy
Senior ML Engineer | GenAI + RAG Systems | Fine-tuning | MLOps | Conversational & Document AI Building reliable, real-time AI systems across high-impact domains — from Conversational AI and Document Intelligence to Healthcare, Retail, and Compliance. At 7-Eleven, I lead GenAI initiatives involving LLM fine-tuning (Mistral, QLoRA, Unsloth), hybrid RAG pipelines, and multimodal agent-based bots. Domains I specialize in: Conversational AI (Teams + Claude bots, product QA agents) Document AI (OCR + RAG, contract Q&A, layout parsing) Retail & CPG (vendor mapping, shelf audits, promotion lift) Healthcare AI (clinical retrieval, Mayo Clinic work) MLOps & Infra (Databricks, MLflow, vector DBs, CI/CD) Multimodal Vision+LLM (part lookup from images) I work at the intersection of LLM performance, retrieval relevance, and scalable deployment — making AI not just smart, but production-ready. Let’s connect if you’re exploring RAG architectures, chatbot infra, or fine-tuning strategy!