- Choose the right metric
That leads us to making sure we use an appropriate metric to measure the performance of our model. From what we saw earlier, accuracy would not be an appropriate metric for this use case precisely because so many of our dataset entries fall into a single class. Let’s take a look at a few alternatives.
Precision — Precision is the number of correctly classified positive (1
in our case) examples divided by the total number of examples that were classified as being positive (1
). If we use the classifier that always predicts 0
, our precision will be 0 as well because we haven’t classified any examples as being positive.
Recall — Recall, also known as “sensitivity”, is the number of positive examples that were correctly classified divided by the total number of positive examples. In our case, this would be the number of entries with value 1
that were correctly identified by our model. Using our dummy classifier, our recall would also be 0. This metric would help us identify if our model was correctly getting the 1
entries in our dataset but it still has a major flaw. If our classifier would just always predict 1
for every example, we could get perfect recall! Our accuracy on the other hand, would not be so great. We need some sort of balance between these metrics in order to evaluate our model in a reasonable way.
F1 Score — The F1 Score is a balance between precision and recall. It is given by the formula (2 * Precision * Recall) / (Precision + Recall). This metric proves to be suitable for imbalanced datasets, eliminating the shortcomings of the plain accuracy metric. Through this, we should be able to evaluate any of our model in a better way.
2. Set up a cross validation strategy
Rather than using the default train_test_split provided by scikit-learn, we should try to make sure our splits accurately represent the distribution of our target variable. A very simple way to do this is to use the stratify
parameter when calling the train_test_split function.
Making this small change ensures that the train and test sets follow the same distribution as our original dataset.
Another way to make your cross validation strategy more robust to class imbalances is to use several folds or train on different subsets of your data. For this we can use StratifiedKFold and StratifiedShuffleSplit to ensure that we still follow our target variable’s distribution.
StratifiedKFold will split our original dataset into several folds with each fold having a distribution that is similar to the original. This means that we can train a model on each of these folds while still being sure that our distribution stays consistent. StratifiedShuffleSplit also preserves our target variable’s distribution but uses the whole dataset in each iteration and does not split it into folds.
3. Change target weights in your model
By default, models will assign the same weight to every class in our target variable. This would be fine if our dataset had a relatively even distribution among its target classes. In our case, we will want to use different weights for each class depending on how skewed our dataset is.
How should we determine what weights to use for each class? Many classifiers have a class_weight
parameter where you can pass in a string like balanced
. This should compute the appropriate weights for the data that you pass in. If not, scikit-learn also has a function that computes this for us. Let’s see how we can use class_weight in our model.
Conclusion
We’ve looked at 3 different ways to help us handle imbalanced datasets.
- Choosing the appropriate metric for our task. We have seen that sometimes, metrics like accuracy can be very misleading when evaluating a model.
- Using a good cross validation strategy. We can ensure our train and test sets follow a similar distribution using several methods like StratifiedKFold and StratifiedShuffleSplit.
- Set class weights on your target classes in order to give more weight to the minority class. Using this strategy makes our model put more importance on the minority class, potentially helping classify it better.
Armed with these strategies, you should able to tackle imbalanced datasets with ease in the future.
Thank you for reading!