Master Dataset and Model Tracking with Git and DVC for ML : Part-1

Digpal SinghDigpal Singh
3 min read

When I started building machine learning models with actual datasets—not the 10-line CSVs from tutorials—I realized Git was no longer enough. Every time I tried to commit a large file, Git would slow down, yell at me, or worse, just break the repo.

That’s when I discovered DVC (Data Version Control) — and I wish I had found it earlier.

In this post, I’ll walk you through every aspect of DVC from scratch, what it solved for me, and how you can use it without feeling overwhelmed.


🧠 First, What Exactly is DVC?

Think of DVC as Git for your data — it lets you version control your datasets and models without actually storing them in Git. Instead, it stores them in a remote cloud (like Azure Blob, Google Drive, or S3), and only tracks the metadata inside Git.

So your repo stays clean, your teammates stay sane, and your experiments stay reproducible.

DVC (Data Version Control) is like Git, but specifically built for managing:

  • Datasets

  • Trained models

  • Experiment artifacts

You still get version control, collaboration, and reproducibility — without clogging your Git repo.


💡 But How Is It Different From Git LFS?

Great question — I used to think Git LFS (Large File Storage) was enough. But:

  • Git LFS still ties everything to the Git repository (and slows things down)

  • It’s not optimized for ML workflows like pipelines, caching, experiment tracking

  • It lacks integrations with cloud-native storage

DVC, on the other hand, feels like a full upgrade — made for ML devs by ML devs.


🔧 How DVC Works (Without the Jargon)

Here’s a real-world analogy:

Imagine you’re working on a documentary. Your Git repo is the script and project plan — small and sharable. But the raw video files? They’re huge.

DVC lets you:

  • Track those raw files without copying them into Git

  • Save their location info (like “this version used video_v2.mov”)

  • Share that info with teammates while the actual files sit safely in cloud storage

Neat, right?


🌩️ DVC + Cloud = 💖

This is the part that truly blew my mind.

With just a few lines, I connected my Azure Blob Storage account to DVC. Now, every time I dvc push, my data is uploaded securely to the cloud — and I don’t have to worry about Git repo size limits, email attachments, or manual uploads to Google Drive.

And guess what? Teammates can just dvc pull and start working. No setup pain, no confusion, just results.


📦 What DVC Is Not

To be clear, DVC doesn’t replace:

  • Git (you’ll still use Git for code)

  • Model registries like MLflow (though DVC does some of that too)

  • Cloud platforms (it works with them, not against them)

Instead, it’s the glue that holds everything together in a sane, reproducible, and shareable way.

1
Subscribe to my newsletter

Read articles from Digpal Singh directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Digpal Singh
Digpal Singh