Why Weight? The Importance of Balanced Datasets

Example with an Imbalanced Dataset

This sample data set is pulled from a text classification project of mine. I set out to classify hotel reviews by rating, see the full details on my GitHub. The data strongly favors positive reviews (or else hotels would need to re-examine their business model).

Class Distribution (%)1     7.431961
2     8.695045
3    17.529658
4    33.091417
5    33.251919

Calculate Class Weights

Scikit-Learn has functions to calculate class weight and sample weight form their .utils library. Custom weights can also be input as a dictionary with format {class_label: weight} . I calculated balanced weights for the above case:

Class Weights: 5 classes{1: 2.691079812206573,
2: 2.3001605136436596,
3: 1.140923566878981,
4: 0.6043863348797975,
5: 0.6014690451206716}

As you can see, heavier weights are applied to the minority classes, indicating the model must give more importance to these classes. Lower weights to the majority classes so they have less importance. A weight of 0 would mean no effect or importance (if you needed to mute a class).

Left, I’ve combined the normalized distribution of classes and the calculated weight. The ‘balanced’ column is the weight multiplied by the distribution. We see the same number for each class, adding up to 1. This is equivalent to an equal probability of seeing any class (1/5 = 0.2).

Calculate Sample Weights

Balanced class weights can be automatically calculated within the sample weight function. Set class_weight = 'balanced' to automatically adjust weights inversely proportional to class frequencies in the input data (as shown in the above table).

from sklearn.utils import class_weightsample_weights = compute_sample_weight(class_weight = 'balanced', 
y = y_train)

The sample weights are returned as an array with the class weight mapped to each sample in the target data (y_train). Example:

Sample Weights: 14330 samplesarray([0.60146905, 2.30016051, 0.60438633, ..., 0.60438633, 1.14092357, 1.14092357])

To use the sample weights in a Scikit-Learn Multinomial Naive Bayes pipeline, the weights must be added in the fit step. For this demo I will not explore NLP, this is just a comparison of the singular effect of weighting samples. Example:

pipeline = Pipeline(steps=[("NLP", TfidfVectorizer(),
("MNB", MultinomialNB())
])pipeline.fit(X_train, 
y_train, 
**{'MNB__sample_weight': sample_weights})

Non-Weighted Model Performance

Non-weighted sample data, strongly favors majority classes

Comparing results of the above model trained without sample weights: The unweighted model reached 55% accuracy.

Predictions heavily favor the majority classes. This model almost completely ignores the minority classes.

Weighted Model Performance

Weighted sample data, better train on minority classes

The exact same model was trained with the addition of balanced sample weights in the fit step. This model reached 58% accuracy.

Along the True Positive diagonal (top-left to bottom-right), we can see that the model is much better fit to predict the minority classes.

There is only a 3% difference in accuracy between the models, but vastly different predictive abilities. Accuracy is skewed because the test class has the same distribution of as the training data. So the model is just guessing across with the same proportions and hitting the mark enough times with the majority classes. This is why accuracy alone is not a good metric for model success! But that is a conversation for a different day, or you can check out this article about the Failure of Classification Accuracy for Imbalanced Class Distributions.

Conclusion

It is important to train models on balanced data sets (unless there is a particular application to weight a certain class with more importance) to avoid distribution bias in predictive ability. Some Scikit-Learn models can automatically balance input classes with class_weights = 'balance'. The Bayesian models require an array of sample weights, which can be calculated with compute_sample_weight() .