
How should you define success?

When I start a new classification project I always take some time to sit down with myself, the data, and my business case to ask an important question: what does it mean to have a “successful” model? By doing this I am trying to figure out how I will determine how well a particular model does at the task and what metric I will use for training the model. In this article I attempt to help you think through some different scoring metrics but this is by no means a complete or exhaustive list.
Often times the accuracy of the model is thought of as the most basic or standard scoring metric. Accuracy is based on how many predictions the model got correct overall, no matter the class. The biggest downfall of accuracy is class imbalance. If you had a model trying to make a prediction and your data has an imbalance of 100:1, then the model could achieve near perfect accuracy by classifying every single entry as the more populous class. But obviously that’s not a successful model even though it is technically predicting 99% of the the data correctly. Generally I will use accuracy as one of my metrics to keep an eye on when tuning or optimizing a model but not necessarily the ultimate goal. Be aware that accuracy is the default metric for scikit-learn models.
Precision is the measurement of true positives divided by true positives + false positives (TP/TP+FP). Since true positives are in the numerator and denominator the false positives have the real sway in this equation. If a model has high false positives, the precision score will lower (get worse) and if a model has low false positives, the score will increase (get better). A perfect score would be 1 because if there were no false positives then it would be a number divided by itself which would be 1.
Scoring by precision will put focus on minimizing false positives in your model. If you are doing a classification problem where reducing false positives is very important, then consider using precision as a scoring metric.
Recall is the measurement of true positives divided by true positives + false negatives (TP/TP+FN). This equation is similar to precision but now it is all about the false negatives. Since false negatives are in the denominator, if the number of false negatives is large then the recall score will decrease (get worse) and if the number of false negatives is small then the recall score will increase (get better). A perfect recall score would be 1 since if there were no false negatives then it would be a number divided by itself.
Scoring by recall will put focus on minimizing false negatives in your model. If you are doing a classification problem where reducing false negatives is very important, then consider using recall as a scoring metric.
Precision and recall are a trade-off as reducing false negatives often leads to more false positives and vise versa. But what if you care about false negatives and false positives? Well lucky for you there exists the F1 score. The F1 score is calculated as 2*(precision*recall)/(precision+recall) and is considered the balancing of the precision and recall.
If false negatives and false positives are equally undesirable and you want a metric that will try to find the best harmony between the two, then consider using the F1 score.
The beauty of the F1 score is that it can be altered slightly to become the F-beta score. The F-beta score is similar to the F1 score but allows for weighting of precision or recall to be more important. Scikit-learn has a method for creating an F-beta score which makes this really easy to set up. Using a beta score of more than 1 puts more emphasis on recall. Using a beta score of less than 1 puts more emphasis on precision.
If you find that you want to have a harmony between reducing false positives and false negatives but one would be more important to minimize than the other, consider using F-beta.
If you work on classification problem with more than 2 classes then the idea of micro and macro calculations comes up. Micro and macro refer to the way that the selected metric is calculated for a multiclass problem. You can combine micro or macro with recall, precision, F1 score, etc.
Let’s use recall as an example. Macro recall would calculate the recall score by first calculating recall for each class independently then averaging all those scores together. Micro recall would calculate the recall score by collecting all true positives and false negatives across all the classes before calculating them together in one recall formula. Macro treats all classes with equal weight where micro treats each false negative and true positive with equal weight.
Let’s think through what this all means. Let’s say that in a dataset with 5 classes that one class has terrible recall while the rest are pretty good. Then the macro recall is going to still be pretty good because the terrible class only has 1/5 the power over the final score. The micro recall will be more dramatically affected by the poor performing class as the false negatives of one terribly predicted class are going to add a lot of false negatives that go into the calculation overall. For this reason, imbalance among classes in a multiclass problem are better captured with a micro calculation than a macro. There is also a weighted macro however, which calculates the macro but weights the scores based on the number of entries for each class which can result in something more similar to the micro.
In summary, use micro or macro calculation of your selected metric for multiclass problems. If the classes are imbalanced then it’s more appropriate to use micro or weighted macro.
As part of choosing a metric I recommend making sure you have thoroughly gotten to know your data, your problem, your goal, your constraints, etc. If you are having a hard time getting started, write out on paper what a false negative and false positive would be for your problem and what implications they would have. What is the worst case scenario for a false negative? A false positive?
I hope this article helped give you more confidence in choosing the right metric for your classification models. Good luck!