Track Your Data with DVC - DVC Tutorial
Data science projects can quickly become messy, with lots of data and code files to manage, and the potential for conflicts and errors as different team members work on the project. To keep everything organized and under control, you need a version control system that is specifically designed for data science projects.
That's where Data Version Control (DVC) comes in. DVC is an open-source tool that helps data scientists and machine learning engineers manage their data and code in a Git-like fashion. With DVC, you can track changes to your data and code, collaborate with team members, and reproduce your experiments with ease.
In this blog, we will walk you through a simple tutorial on how to use DVC to manage your data science project.
Prerequisites:
Before we dive into the tutorial, you should have the following installed on your machine:
Git
Python 3
DVC
Also, you should have some basic knowledge of Git and Python.
Tutorial:
In this tutorial, we will create a simple data science project that predicts the price of a house based on its size. We will use DVC to manage our data and code.
- Create a new directory for your project:
mkdir house-price-prediction
cd house-price-prediction
- Initialize a Git repository:
git init
- Initialize a DVC repository:
dvc init
- Create a new Python script called
train.py
:
import pandas as pd
from sklearn.linear_model import LinearRegression
import joblib
# Load data
data = pd.read_csv('data/housing.csv')
# Split data into features and target
X = data[['Size']]
y = data[['Price']]
# Train model
model = LinearRegression()
model.fit(X, y)
# Save model
joblib.dump(model, 'models/housing.pkl')
- Create a new directory called
data
and download a dataset for house prices:
mkdir data
cd data
wget https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.csv
cd ..
- Add the
data
directory to DVC:
dvc add data
- Commit the changes to Git and DV:
git add .
git commit -m "Initial commit"
dvc push
- Create a new Python script called
predict.py
:
import joblib
# Load model
model = joblib.load('models/housing.pkl')
# Predict house price
size = [[1650]]
price = model.predict(size)
print(price)
- Add the
predict.py
script to DVC:
dvc add predict.py
- Commit the changes to Git and DVC:
git add .
git commit -m "Added predict script"
dvc push
- Update the
train.py
script to save the model to DVC:
import pandas as pd
from sklearn.linear_model import LinearRegression
import joblib
import dvc.api
# Load data
data_url = 'data/housing.csv'
data = pd.read_csv(dvc.api.get_url(data_url))
# Split data into features and target
X = data[['Size']]
y = data[['Price']]
# Train model
model = LinearRegression()
model.fit(X, y)
# Save model
model_url = 'models/housing.pkl'
joblib.dump(model, dvc.api.get_url(model_url))
- Commit the changes to Git and DVC:
git add .
git commit -m "Updated train script to save model to DVC"
git push
dvc push
- Now let's train our model again and push it to DVC:
python train.py
dvc push
- Update the
predict.py
script to use DVC to load the model:
import dvc.api
import joblib
# Load model
model_url = 'models/housing.pkl'
model = joblib.load(dvc.api.get_url(model_url))
# Predict house price
size = [[1650]]
price = model.predict(size)
print(price)
- Commit the changes to Git and DVC:
git add .
git commit -m "Updated predict script to use DVC to load model"
dvc push
- Now you can run the
predict.py
script to make predictions:
python predict.py
Congratulations! ๐
That's it! You have successfully used DVC to manage your data and code for a simple data science project. With DVC, you can track changes to your data and code, collaborate with team members, and reproduce your experiments with ease.
Conclusion:
In this tutorial, we have covered the basics of using DVC to manage a simple data science project. DVC is a powerful tool that can help you keep your data and code organized and under control. By using DVC, you can make your data science projects more reproducible and easier to collaborate on with team members. We hope that this tutorial has helped you get started with using DVC for your data science projects.
Subscribe to my newsletter
Read articles from Aman kumar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Aman kumar
Aman kumar
MLOps Expert and Data Science Expert, Building LLM model, Open Source Contributor, and Done some Amazing Projects related to Data Science.