Metrics for Evaluating Classification Models

Use Cases:

You want to easily and granularly see where your model succeeds and where it fails
You are building a presentation and want to visually convey your model’s results in a clear and customizable way

As nice looking as our confusion matrix is, it can be hard to compare models with it. If we were trying to figure out whether a Logistic Regression, KNN, Decision Tree, or Random Forest was best for our problem, having to sort through and compare 4 different plots each with 4 different categories would be time-consuming.

Thankfully there are four scores that we can calculate based on the relationship between our true predictions and false predictions. Each one stresses a different facet of our results and which one we choose can be based on several different factors including a client’s needs, what we are predicting, and any imbalances in our data.

Accuracy: (True Positives + True Negatives) / All Predictions

A balanced metric that rewards the model for correctly predicting both positive and negative results. This is the default for a lot of models when you call its .score() method.

This can be misleading if there is an imbalance in our data between the target class and the non-target ie if there were 5 of our target class in a data set of 100 entries and our model predicted everything was negative it would still receive an accuracy score of 95%.

The accuracy score metric is built into Scikit-Learn and just needs to be imported and our model’s predictions inserted along with the true results.

from sklearn.metrics import accuracy_score# Remember to place the true results first 
accuracy = accuracy_score(y_test, y_pred)

image by author

Precision: True Positives / (True Positives + False Positives)

A metric that prioritizes the positive guesses that our model makes while ignoring anything classified as negative.

This metric’s chief concern is that your model is correct when it predicts a data point is positive while forgiving any positives it might have missed. A good way to think about this is with the quote: “It is better that ten guilty persons escape than that one innocent suffer.”

This is best used when the consequences of a positive prediction far outweigh the consequences of a negative one. Like accuracy, precision can be imported from Scikit-Learn.

from sklearn.metrics import precision_scoreprecision = precision_score(y_test, y_pred)

image by author

Recall: True Positives / (True Positives + False Negatives)

Think of recall as precision’s mirror opposite. Instead of prioritizing positive predictions, recall’s priority is making sure that all of the target class were correctly identified, regardless of how many other data points were incorrectly lumped in with them.

Recall’s emphasis on identifying all of the target class is best captured with the hypothetical situation of a model needing to classify whether a person has a highly infectious disease. A False Positive results in a bit of a scare and the person being stuck in quarantine for a few weeks. A False Negative means the person will be free to spread the disease far and wide.

Like accuracy and precision, recall can be imported from Scikit-Learn.

from sklearn.metrics import recall_scorerecall = recall_score(y_test, y_pred)

image by author

Specificity: True Negatives / (True Negatives + False Positives)

Specificity doesn’t get a lot of love and it’s easy to see why. When most people build models they have in mind how they’re going to best identify something, whether it’s a rare disease, loan defaults, or hot dog/not hot dog. Specificity rewards your model for correctly predicting that a data point was not in the target class, punishes it for false positives, and ignores any of its correct predictions!

Specificity still has its place as a metric worth mentioning. If we consider a model’s specificity score as a compliment to its precision score, we can, while taking any data imbalances into account, use the two together to see if a model is better at returning True Negatives or True Positives.

Unlike the other metrics, specificity is not included in Scikit-Learn. We can overcome this by creating a universal function since all confusion matrices generated by confusion_matrix() have identical indexing.

def specificty_score(confusion_matrix):
true_negatives = confusion_matrix[0][0]
false_positives = confusion_matrix[0][1]return true_negatives / (true_negatives + false_positives)specificty = specificty_score(con_mat)

image by author

Use Cases:

You need a way to quickly and easily compare different models
Your client has a clear preference for how they want the model to perform which can be translated into prioritizing one of these 4 scores

When we bring up scores like precision and recall to a client, their response may logically be “those both sound nice, can we make sure the model is good at both?” While having our cake and eating it too can be a difficult task in classification, there is at least a way to score a model based on the relationship between its precision and recall: F-Scores.

F1 Score: 2 * (Precision * Recall) / (Preceision + Recall)

A model’s F1 Score is its harmonic mean between precision and recall. If you want your model to be equally good at both without favoring one over the other, this is your metric.

F1 can be directly imported from Scikit-Learn.

from sklearn.metrics import f1_scoref1 = f1_score(y_test, y_pred)

image by author

Fβ Score: (1 + β**² ) * (Precision * Recall) / (**β² * Precision + Recall)

Using an F1 score assumes that misclassifications of either False Negatives or False Positives incur equal costs, something which is hardly ever true in real life. For situations where we favor precision or recall but still want the input of the other, we can use the Fβ score.

Unlike the other scores we’ve seen so far, calculating our Fβ score requires an extra parameter besides our model’s predictions and the correct target values: the β parameter. An easy way to think of the β parameter is how many times more we favor recall over precision ie a β of .5 means that we favor recall half as much as precision, a β of 1 means we favor each equally, and a β of 2 means we favor recall twice as much as precision.

Thankfully, Sciki-Learn has a built-in Fβ score function we can import.

from sklearn.metrics import fbeta_score# Our score where we favor recall
fbeta_2 = fbeta_score(y_test, y_pred, 2)# Our score where we favor precision
fbeta_half = fbeta_score(y_test, y_pred, .5)

image by author

Use Cases:

You’ve talked with your client about different scores and they want a balance between recall and precision
You have a clear ratio in mind of how much you want to favor one over the other

The final metric I’m going to cover is a deceptively simple one: how well does our model perform when compared to simple rules? At our current state, we can confidently say that our model beats flipping a coin, but what if we just predicted everything as negative, or predicted all women as surviving and all men as dying? These are known as “Dummy” models and can be incredibly helpful in measuring how successful our model actually is.

Scikit-Learn actually has it’s own dummy classifier we can import, but in this case, we’re going to create our own array of predictions based on the gender rule mentioned above. Our goal is to beat predictions made using this simple rule. If that isn’t possible then we need to go back and do the following:

Collect better data (if possible)
Engineer our features better
Try different models and/or different model parameters

# A simple trick since Sex_female is already a binary array
dummy_y = X_test["Sex_female"]# Generate a dummy confustion matrix
dummy_con_mat = confusion_matrix(y_test, dummy_y)# Generate the dummy variant of each score
dummy_acc = accuracy_score(y_test, dummy_y)
dummy_prec = precision_score(y_test, dummy_y)
dummy_recall = recall_score(y_test, dummy_y)
dummy_spec = = specificty_score(dummy_con_mat)
dummy_f1 = f1_score(y_test, dummy_y)
dummy_fbeta_2 = fbeta_score(y_test, dummy_y, 2)
dummy_fbeta_half = fbeta_score(y_test, dummy_y, .5)

images by author

Uh oh…looks like we need to go back to the drawing board on this one!

Use Cases:

You want to sanity-check your model to make sure it’s actually worth using compared to simple rules of thumb
You want an initial benchmark for your models to beat

I hope this article leaves you more knowledgable about how to judge classification models beyond just using a model’s internal score method. In review, we learned the following:

8 metrics for evaluating your classification models including the theory behind them
How to import or build them in python
Use cases for each metric to help you decide which one best fits your needs

This article is not exhaustive of classification metrics and if none of these felt just right for your situation then my suggestion is to first read through Scikit-Learn’s metrics and scoring documentation and experiment from there. Happy hunting!

Use Cases:

Accuracy: (True Positives + True Negatives) / All Predictions

Precision: True Positives / (True Positives + False Positives)

Recall: True Positives / (True Positives + False Negatives)

Specificity: True Negatives / (True Negatives + False Positives)

Use Cases:

F1 Score: 2 * (Precision * Recall) / (Preceision + Recall)

Fβ Score: (1 + β² ) * (Precision * Recall) / (β² * Precision + Recall)

Use Cases:

Use Cases:

Footer

Fβ Score: (1 + β**² ) * (Precision * Recall) / (**β² * Precision + Recall)