An overview of different techniques you should know to improve your AI models
Why
Split your dataset into training and validation helps you to understand if your model as high variance(overfitting) or high bias(underfitting).
This is extremely important to understand how your model could generalize to new unseen data. If your model is in overfitting probably it cannot generalize well to new unseen data. SO is not able to make good predictions.
Having a proper validation strategy is the first step to successfully create good prediction and then business value with your AI models.
Simple Train Val split
With simple train val split you divide your dataset into training and validation
Usually you split your dataset into training with 80% and validation with 20% . Some library as Scikit performs this by using random sampling.
Things to have in mind
- You need to fix the seed() otherwise you cannot compare different training, because they use a different dataset split. Fix the seed makes random sampling always the same.
- If you have imbalanced dataset cross validation cannot help you to maintain same ratio. See stratified k-fold
- If you have small dataset is not guarantee that your validation split is representative of your training split
K-Fold Cross Validation
The idea is to split the dataset in different k-partitions. In the image below the dataset is divided into 5 partitions.
Each time you choose one partition as your validation dataset and the others partitions are your training dataset. You will train your model on each different set of partitions.
In the end you will end up with K different models.
K usually is set to [3,5,7,10,20]
- Higher K [20] is used if you want to check model performance low bias.
- Low K [3,5] is used if you want to build model for variable selection. So to use only some data and not others for training. Your model will have low variance.
Advantages:
- By averaging the models predictions you can estimate the performance of your models on unseen data drawn from same distribution
- Is a widely used method to get good models for production
- You can create a predictions for each data in your dataset. This is called OOF(out-of-fold predictions). You can use this with different ensembling techniques(Blending, Stacking)
Problems:
- If you have imbalanced dataset you cannot use this. Use Stratified-kFold
- If you retrain a model on all dataset, then you cannot compare the performance with any models that you have trained used k-Fold. Because your models are trained on k-1, so not to the entire dataset
Stratified-kFold
Is used to preserve the ratio between different classes for every fold.
Practically if you have imbalanced dataset for example class1 has 10 examples and class2 has 100 examples. Stratified- kFold create a k folds where each fold has the same imbalanced ratio as original dataset
The idea is similar to k-fold cross validation, but each fold has the same imbalanced ratio as the original dataset.
If you use K-fold cross validation for imbalanced dataset you will have this setup
The initial ratio between classes is not preserved in each split(fold).
If your dataset is very big K-fold cross validation could preserve class imbalance
Advantages
- Preserve the class imbalanced ratio with small dataset
Bootstrap and Subsampling
Bootstrap and Subsampling are similar to K-Fold cross validation but they have not fixed fold.
Practically you take a random number of data from your dataset and use other data as validation.
You repeat this for n-times
Bootstrap= Sampling with replacement
Usually in machine learning the golden rule is to use k-fold cross validation.
When to use
Boostrap and Subsamlping could be used only if you have large standard error on evaluation metric-error. This could happen due to outliers in your dataset.