## A visual explanation + connections with probability

This is **Part I** of an article series on machine learning metrics.

Here, we’ll visually review the most popular supervised learning metrics for

**Classification —**Accuracy, Precision, Recall, Fᵦ & AUC; and**Regression**— MAE, MSE and R².

We’ll also visually explore some connections between classification metrics and probability theory.

Finally, in this and this article, I apply these metrics to real-world ML problems in Jupyter notebooks using scikit-learn.

## Confusion Matrices

`from sklearn.metrics import confusion_matrix`

A binary classification results in four outcomes that can be summarised in a **confusion matrix**. Metrics for classification are then computed using the figures from the matrix.

## Accuracy

`from sklearn.metrics import accuracy_score`

The most basic metric is **accuracy**, which gives the fraction of all predictions that our model got right.

This metric can be misleading if our dataset is

**imbalanced;**- when cost of
**false positives**or**negatives**(**Type I and II errors**) are similar.

## Precision

`from sklearn.metrics import precision_score`

**Precision** gives the fraction of positive predictions we got right. This metric prioritises **minimising Type I errors**. A classic example problem that demand a high-precision model is classifying spam emails— more on this below. Here, letting through false positives means important emails are accidentally thrown out.

## Recall

`from sklearn.metrics import recall_score`

**Recall** gives the fraction of actual positives that we predicted right. This metric prioritises **minimising Type II errors**. Example problems that demand high-recall models include diagnostic tests for COVID-19, classifying fraudulent credit card transactions and predicting defective aircraft parts. In these problems, false negatives are either financially costly or even deadly.

## Examples using Accuracy, Precision & Recall

Take the following example on classifying **fraudulent credit card transactions.**

The confusion matrix told us that this ‘dumb’ model classified every single transaction as fraudulent. This is a problem because the dataset is imbalanced. Why? Our **accuracy is a stellar 992/1000 = 99.2%** as a result, yet we **missed every single fraudulent transaction!**

To capture the importance of these Type II errors, we can use **recall** instead. Here, precision = 0/8 = 0%, highlighting that no fraud CC’s were captured. Whoops!

Check out the next example of **filtering out spam email**. Here, making Type II errors is much worse than making Type I errors. This is because accidentally throwing away an important email is more problematic than accidentally letting a few junk emails get through. A good metric here is therefore **recall**.

We have **recall = 100/130 = 77%**. This means out of the 130 emails we automatically filtered away into the spam folder, 100 were real spam messages. Not bad, but I’d say certainly not good enough for a production email server!

Finally, check out this example on **diagnosing COVID-19**. Here, committing Type II errors (false negative test) means missing people who are actually sick. This could be disastrous for the individual and certainly disastrous to public health. Meanwhile, a Type I error (false positive test) would at most inconvenience the individual. Better be safe than sorry! A good metric here is thus **precision**.

We have **precision = 100/120 = 83%**. This means we found 100 out of the 120 COVID-19 positive people who did the test and let 20 go home thinking they didn’t have the virus. Unfortunately, this test wouldn’t be good enough for general use.

## F₁ and Fᵦ scores

`from sklearn.metrics import f1_score, fbeta_score`

Taking into account both precision and recall, the **F₁-score** is an advanced metric that lets you have the best of both worlds *and* is robust against imbalanced datasets. The metric is defined to be the **harmonic mean** of precision and recall: