Popular Machine Learning Performance Metrics

A visual explanation + connections with probability

Image by Andy Kelly on Unsplash.

This is Part I of an article series on machine learning metrics.

Here, we’ll visually review the most popular supervised learning metrics for

Classification — Accuracy, Precision, Recall, Fᵦ & AUC; and
Regression — MAE, MSE and R².

We’ll also visually explore some connections between classification metrics and probability theory.

Finally, in this and this article, I apply these metrics to real-world ML problems in Jupyter notebooks using scikit-learn.

Confusion Matrices

from sklearn.metrics import confusion_matrix

A binary classification results in four outcomes that can be summarised in a confusion matrix. Metrics for classification are then computed using the figures from the matrix.

Confusion matrix. Image by author.

Accuracy

from sklearn.metrics import accuracy_score

The most basic metric is accuracy, which gives the fraction of all predictions that our model got right.

Image by author.

This metric can be misleading if our dataset is

imbalanced;
when cost of false positives or negatives (Type I and II errors) are similar.

Precision

from sklearn.metrics import precision_score

Precision gives the fraction of positive predictions we got right. This metric prioritises minimising Type I errors. A classic example problem that demand a high-precision model is classifying spam emails— more on this below. Here, letting through false positives means important emails are accidentally thrown out.

Image by author.

Recall

from sklearn.metrics import recall_score

Recall gives the fraction of actual positives that we predicted right. This metric prioritises minimising Type II errors. Example problems that demand high-recall models include diagnostic tests for COVID-19, classifying fraudulent credit card transactions and predicting defective aircraft parts. In these problems, false negatives are either financially costly or even deadly.

Image by author.

Examples using Accuracy, Precision & Recall

Take the following example on classifying fraudulent credit card transactions.

Image by author.

The confusion matrix told us that this ‘dumb’ model classified every single transaction as fraudulent. This is a problem because the dataset is imbalanced. Why? Our accuracy is a stellar 992/1000 = 99.2% as a result, yet we missed every single fraudulent transaction!

To capture the importance of these Type II errors, we can use recall instead. Here, precision = 0/8 = 0%, highlighting that no fraud CC’s were captured. Whoops!

Check out the next example of filtering out spam email. Here, making Type II errors is much worse than making Type I errors. This is because accidentally throwing away an important email is more problematic than accidentally letting a few junk emails get through. A good metric here is therefore recall.

We have recall = 100/130 = 77%. This means out of the 130 emails we automatically filtered away into the spam folder, 100 were real spam messages. Not bad, but I’d say certainly not good enough for a production email server!

Image by author.

Finally, check out this example on diagnosing COVID-19. Here, committing Type II errors (false negative test) means missing people who are actually sick. This could be disastrous for the individual and certainly disastrous to public health. Meanwhile, a Type I error (false positive test) would at most inconvenience the individual. Better be safe than sorry! A good metric here is thus precision.

We have precision = 100/120 = 83%. This means we found 100 out of the 120 COVID-19 positive people who did the test and let 20 go home thinking they didn’t have the virus. Unfortunately, this test wouldn’t be good enough for general use.

Image by author.

F₁ and Fᵦ scores

from sklearn.metrics import f1_score, fbeta_score

Taking into account both precision and recall, the F₁-score is an advanced metric that lets you have the best of both worlds and is robust against imbalanced datasets. The metric is defined to be the harmonic mean of precision and recall:

Alternatively, you can calculate the F₁-score straight from the confusion matrix:

Image by author.

For our COVID-19 model, the F₁-score = 100/(100+0.5(20+80)) = 67%.

More generally, the Fᵦ-score allows you to calibrate the importance of Type I and II errors more precisely. It does this by letting you tell the metric that you view recall as β times more important than precision.

If you prefer to calculate the Fᵦ-score straight from the confusion matrix:

Image by author.

Area Under Curve (AUC)

from sklearn.metrics import roc_curve, roc_auc_score

This metric involves calculating the area under the Receiver Operating Characteristic (ROC) curve, which measures the ability of your classifier to separate the two classes and separate signal from noise.

A full area of 1 represents the perfect model, while a model with an AUC of 0.5 is no better than random guessing. Generally, a score of 0.9 is considered outstanding, 0.8 is excellent and 0.7 is acceptable.

Image from Udacity.

The ROC curve is drawn by plotting the True Positive Rate (TPR) (aka recall) against the False Positive Rate (FPR).

In the next section, we’ll examine TPR, FPR, TNR and FNR in detail.

ROC curve. Image by author.

In this article, I show how to generate a ROC curve and compute the AUC in Python using scikit-learn.

The four outcomes from the confusion matrix can be represented as a probability tree. This can give you a new perspective on the same situation. A benefit of interpreting the confusion matrix as a tree means we can attach probabilities to entries in the matrix.

A visual explanation + connections with probability

Confusion Matrices

Accuracy

Precision

Recall

Examples using Accuracy, Precision & Recall

F₁ and Fᵦ scores

Area Under Curve (AUC)

Footer