Zero-to-Hero Infrastructure Guide for ML Engineers, Data Scientists

A Zero-to-Hero Guide for AI Practitioners Navigating Infrastructure

Why This Guide?

Have you ever:

Struggled to understand CI/CD errors like No space left on device?
Waited 40 minutes for a simple app to deploy?
Asked "What’s a pod?" during a critical production incident?

This document is your go-to resource to:

Understand how your code actually runs in production
Collaborate effectively with DevOps and platform teams
Troubleshoot infra-related issues without being a Kubernetes expert

Designed for non-DevOps professionals working in AI — including data scientists, ML engineers, and researchers.

1. The Infra Mindset for Data Roles

"Infra isn’t someone else’s job anymore. It’s part of building intelligent apps."

Modern ML systems are:

Real-time (e.g., chatbots, APIs)
Multi-service (e.g., RAG pipelines, ingestion workflows)
GPU-dependent (LLMs, CV models)

Basic infra knowledge saves hours of debugging, enables faster iteration, and improves reliability.

2. From Code to Pod: A User Request's Journey

Here’s what happens when a user interacts with your deployed service:

User sends a request (e.g., via Teams or API call)
Load balancer routes it to a healthy pod
Pod (a containerized app instance) runs your logic
Your app queries data/models, returns response

Knowing this lets you debug which layer broke (app? infra? scaling?).

3. Core Infra Concepts You Should Know

Term	What It Means
Image	Frozen code + dependencies (Docker snapshot)
Container	A running instance of an image (isolated)
Pod	A K8s unit running one/more containers
Node	A physical/VM host that runs pods
Cluster	Group of nodes managed by Kubernetes
Deployment	YAML config that defines how pods are run
Service	Stable endpoint for accessing a set of pods

4. Metrics That Matter in ML/AI Systems

Metric	Description	Why You Should Care
Latency	Time taken to serve one request	Impacts user satisfaction
RPS	Requests handled per second	Determines how scalable you are
Throughput	Max stable request rate	Indicates system limits
p95 Latency	Slowest 5% response time	Highlights spikes/bottlenecks
Memory	RAM usage of your pod	Prevents OOM kills, slowness

5. Bottlenecks You’ll Hit — and How to Fix Them

Symptom	Root Cause	Fix
CI build takes 45 mins	No caching, bad Dockerfile	Use `--cache-from`, avoid `COPY . .` early
`No space left on device` error	Image layers too large	Clean builds, use `.dockerignore`
Pod stuck in `Pending`	No compatible node (e.g. GPU)	Add GPU nodes or queue fallback
Frequent OOM crashes	RAM underestimated	Set limits in YAML
Logs are missing or unclear	Not printed or collected properly	Add prints, use `kubectl logs`

6. Best Practices for Infra-Aware ML Engineers

Docker

Use python:3.11-slim or similar lean images
Separate COPY requirements.txt + install → cache layers
.dockerignore = lifesaver for large repos

Kubernetes

Always set resources.requests and limits
Use HPA (horizontal pod autoscaler)
Define readiness/liveness probes

Monitoring

Emit logs in structured format
Capture latency, memory, error rates
Share dashboards with DevOps for visibility

Workflow Hygiene

Test locally with mocks before pushing
Use docker build . && docker run pre-push
Document infra assumptions (RAM/CPU needed) in README

7. CI/CD & Long Builds — Why They Hurt and How to Fix Them

Problem: Change one line → entire pipeline re-runs

Why?

You copied all code too early
No layer caching
Pip install reruns every time

Better Dockerfile:

FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .
CMD ["python", "app.py"]

Also:

Use build cache in GitHub/GitLab CI
Avoid rebuilding on markdown/doc-only commits

8. Common Mistakes & Lessons Learned

Mistake	What You Should Do Instead
Ignored memory limits in YAML	Set CPU/mem `requests`/`limits`
Pushed without mocking downstream services	Use stubs for fast validation
App crashed and logs were missing	Use `kubectl logs -f <pod>`
Copied whole repo in Docker build early	Copy only needed files, cache wisely
Didn’t know why pod stuck	Use `kubectl describe pod`
Image size ballooned	Use `.dockerignore`, slim base image

9. Real-World Cheat Sheet

Goal	Command or Tip
See running pods	`kubectl get pods`
View logs for a pod	`kubectl logs <pod-name>`
Explain why pod is stuck	`kubectl describe pod <pod-name>`
Free up Docker disk space	`docker system prune`
View memory usage	`top`, `free -h`, or pod dashboard
Build & test locally	`docker build . && docker run`
Identify top memory usage	`ps aux --sort=-%mem`

10. Final Words for Data-Focused Engineers

Infra is no longer optional for ML teams. You don’t need to master Kubernetes, but you should:

Know how your code gets deployed
Track memory, latency, and pod health
Build with infra in mind (not as an afterthought)

Great models + poor infra = poor user experience

This doc is your on-ramp.

Infra Made Simple: For Data Scientists, ML Engineers & Research Scientists

Table of contents