ML Experience: MLOps, ModelOps, DataOps & DevOps

Amit SidesAmit Sides
5 min read

Undoubtedly, 2025 will be dedicated to disentangling the relationships between ModelOps - the way we manage the Model-Development-Life-Cycle (MDLC) and DataOps, the way we manage data pipelines, ETL/ELT. In this article, this relationship is examined while considering DevOps in mind, underlying Cloud infrastructure, and the set of tools we use in the complex process of delivering our Data Scientist and ML Engineering a Development Experience and eventually deploying and serving Model Inference to Production. Like always, my claim remains, that the challenge is in Orchestration.

ML-DataOps Pipelines and Workflow: Raw→Features→Vectors→Inferences

Creatively Orchestrate Workflows

Draw a DAG Diamond in Python with Hera

GCP Workflows

Data Injection: Streaming Data & Analytics

Apache AirFlow (data lineage), AWS GLUE, Apache Beam DataProc DataFlow

Tip: ArgoWorkflow Explained

Where to Start while considering future throughput? Apache Samza?

While designing our future high-scale infrastructure so we can deal with high throughput we can consider

  1. maintaining Apache Kafka on Kubernetes (see node-termination-handler).

  2. Managed Kafka Service

    1. Confluent

    2. GCP Pub/Sub

    3. Lenses (AWS MSK)

See Schemas: Cloud Events JSON, AVRO

Our data is taking too many shapes (JSON, ONNX, HDFS), and there are even more tools we can use in our DataOps/data lineage pipelines. We want to consider DevOps best practices. Choose the right tools for the task.

Data Versioning & Preprocessing

Flow diagram showing raw data moving to prepared data moving to engineered features.

[ picture from Tensorflow ]

[ Uber’s great article on feature while orchestrating HDFS, Spark, Samza, Cassandra, MLLib, XGBoost, and TensorFlow for real business purposes is inspiring. ]

Lineage: Data as MetaData

Ask yourself: Where do data lineage and Data-Meta-Data tools (OpenLineage Amundsen) work with data pipelines, data-warehouses?

Knitfab

Repo: https://github.com/opst/knitfab

MLOps system & tool. Release AI/ML engineers from trivial routines.

  • Automatic tracking of 'lineage'

  • Dynamic workflow based on 'tags'

  • Use of k8s and container to isolate ML tasks

MLOps-Data: From Features to Vectors

Feature Engineering, Feature Store

Decoupling ML from Data, Feature Storing Registries for better ML data pipelines.

[ Feast Overview | Featuretools Overview | Databricks | Feast on Kubernetes ]

Feature pipelines Feature Engineering + Data pipeline architecture offer the first stages before training our model.

Data Science Experience or MLExp: Data Science, Notebooks, Jupyter & CUDA

The most important part of MLOps is Model Development Experience; it takes the form of allowing Data Scientists and Machine learning engineers to jump right into running powerful notebooks on NVIDIA CUDA with K8s.

Clone this Data Science Notebooks and K8S GPU NVIDIA CUDA PYTORCH JUPYER

Data Science ToolSet

  1. TinyBird

  2. HopsWorks

  3. DataChain

  4. Perfect

  5. DVC

  6. H2O

  7. Dagster

  8. StoneBranch

  9. Peliqan

K8S + ModelOps: Pods of Tensorflow

[ MLflow Overview | Kubeflow Overview | MetaFlow’s Infrastructure ]

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  generateName: tfjob
  namespace: your-user-namespace

AI/ML Lifecycle Training Operator

Hyperparameters Libraries

  1. https://github.com/autonomio/talos

  2. https://github.com/kubeflow/katib

SOTA Model ZOOs

AutoML [ ONNX | Modelzoo | AutoKeras | Huggingface MTeb | Trax | Model Registry | ML Metadata

“AutoEmbedding” + VectorStore + VectorDB

AutoEmbed” if you wish not to invent wheels today. import pre-trained Models with ready-to-use state-of-the-art code snippets. Wrap it up as a Docker container to easily generate embedding at scale and ship them into VectorDB.

AWS Bedrock Embedding llama

# custom selection of integrations to work with core
pip install llama-index-core
pip install llama-index-llms-openai
pip install llama-index-llms-replicate
pip install llama-index-embeddings-huggingface
import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader("YOUR_DATA_DIRECTORY").load_data()
index = VectorStoreIndex.from_documents(documents)

TOP 5 - Massive Text Embedding Benchmark (MTEB)

from langchain_openai import OpenAIEmbeddings

embed = OpenAIEmbeddings(
    model="text-embedding-3-large"
)

Choose your VectorDB

Vector DBBest ForNot Recommended For
Milvus 2.0Large-scale production deploymentsSmall datasets (<100k vectors)
PineconeCloud-native applicationsOn-premise requirements
WeaviateSemantic search applicationsHigh-cardinality exact search
QdrantReal-time applicationsBatch processing only
ChromaPrototyping and small datasetsLarge-scale deployments
pgvectorPostgreSQL integrationHigh-performance requirements

Langchain VectorStore

DevSecOps Note: Vectors with Data Poisoning?

Deployment & Serving: Production Inferences

I’ll not discuss actual “deployment(.yaml)” in this section because I have referred to it in detail in the previous article. Instead, I’ll discuss the relationships between VectorDb, GPUs, and Production Model Inferences, or Serving.

Philosophical Note: if our Embeddings are VectorStored, and we use Query to our VectorDb, isn’t our API JSON response just GenAI?

  1. HuggingFace Pipelines

  2. Fast API

  3. ONNX Serv

git clone https://github.com/sayakpaul/ml-deployment-k8s-fastapi/tree/main
git clone https://github.com/kserve/kserve/tree/master
git clone https://github.com/DmitryBe/onnx-serving/tree/master
  1. streaming inference pipelines

5.vLLM TPU Dockerfile

https://github.com/vllm-project/vllm/blob/main/Dockerfile.tpu

Summary: Machine Learning Operation + Data Mesh

This article aimed to provide the scope for the challenges of orchestrating many tools in the different aspects of ModelOps. On the one hand, we have many tools to help Orchestrate by building Flows and Pipelines so it could give shape and “glue” everything together, but on the other hand, we have too many tools :)

Personal Tip: Recommended Books to advance your knowledge

Aim to learn more about orchestrating MLOps with Event-driven data mesh in a Cloud Native Environment.

Check out my git repository MLPlay for code examples.

0
Subscribe to my newsletter

Read articles from Amit Sides directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Amit Sides
Amit Sides

Amit Sides is a Backend Developer, DevOps Expert, DevSecOps & MLOPS GITHUB https://github.com/amitsides Technology Stack o AWS-EKS/AKS/GKE / Cloud-Native / Multi-Cloud o Microservices + MSK + SQS + KMS o Linux System Administrator / Ansible o Dockerfiles o Kubernetes Clusters + Scalability (Karpanter/KEDA) o K8s Services Controllers Ingresses, Nginx, Load Balancers, Istio, CNI, Cillium o Jenkins/GitHub Actions Yamls, Bullds ECR Registry (OCI) o TerraForm +Terragrunt Provisioning (+Terraspace) o GITOPS/ArgoCD/Flux/App-of-Apps o Databases RDS/MySQL/PostgreSQL/DynamoDB... o SRE, Observability, Logging, Monitoring, Alerting, Load Balancing, High Availability RESTFul API Implementation + JWT PYTHON BASH Scripting DevSecOps o eBPF/Kernel Security o Pod Security Admission + RBAC o CIS Kubernetes Benchmark o kube-bench o AppArmor o Seccomp o gvisor o falco o tetragon o openpolicyagent o trivy