In this article we will show how to use DVC for dataset versioning. If you are working on a machine learning project in production, usually your dataset evolves during time and you need to track the changes in order to be able to deploy your model in production.
To do this you need a tool to track your different dataset’s version in order to understand how your model’s is performing over time, while your dataset is changing.
There are different tools available. For example:
- DVC
- Delta Lake
- Git-lfs
- …
DVC is easy to use, and is integrated with Git. So if you are familiar with Git DVC will be very easy to get started.
Machine Learning: DVC getting started
This tutorial supposed that you already install git and dvc.
- First of all you need to initialize a git project and dvc
$ mkdir Project
$ git init
$ dvc init$ git commit -am "initialize repo"
dvc init is used to initialize dvc for your specific git repository
2. Add a remote to DVC in order to store different Dataset version
$ dvc remote add -d dvc-remote /home/ubuntu/remoteDVC
dvc remote add is used to add a remote repository for dvc for storing different version of dataset.
In this case we are storing our remote DVC repository on our local machine(/home/ubuntu/remoteDVC), but you can store this inside google drive or other clouds.
3. Add your dataset to DVC track
# /Project$ mkdir Dataset/ #add your data here$ dvc add Dataset
This command will create a Dataset.dvc file. This file is responsible to track different Dataset version.
With DVC you can track images,csv file ecc…
4. Add DVC track file to Git
$ git add Dataset.dvc ./gitignore
$ git commit -am "add dataset version 1"
By default DVC add automatically Dataset folder to .gitignore. With git commit we keep track of Dataset.dvc inside our Git project.
5. Add Git tag to easily retrive your dataset Version
$ git tag -a 'v1' -m 'raw data'
Using git tag we can make a reference to our dataset using v1. In order to be able to retrive the proper dataset version. See DVC advanced
6. Push dataset version to remote DVC repository
dvc push
This command will push your dataset to your remote repository. If you look at your remote folder you will see your dataset.
7. Add more data to your dataset
If you need to add new data and keep track of your new version dataset you need to repeat step 3–6. You need to change tag for dataset version. For example V2
Dvc: Advanced Configuration
You can use DVC to change the Dataset version while maintaining the same code version.
git checkout v1 Dataset.dvc
dvc checkout
With this command we are changing the dataset to v1 using our predefined tag while maintaining the same code version.
Of course we can change back again to our v2 version with
git checkout v2 Dataset.dvc
dvc checkout
Warning
If you are using Dvc with Pytorch ImageFolder then you can have some problems switching from one version of your dataset to another.
Example
This is your dataset version V1. Let’s suppose we want to make a training without class “street”
Now if you want to go back, with Dvc you can dvc pull v1 and recreate your first dataset.
If then you want to switch back to dataset v2 here you will have some trouble because, you have this
Inside street folder there are no images, but you will have the folder and this can cause some problem if you are using pytorch imageFolder