Let’s talk about metrics in Classification problems

Anyone beginning with machine learning usually has very little idea about the workflow of a typical machine learning project. I have seen people focusing rigorously on the machine learning model that they are creating and giving little time to other aspects of machine learning.

Well don’t get me wrong having the correct model for the task is very important but I want to convey that there are other aspects that are equally important like correct feature selection, choosing the right optimizer for the model or using the correct metrics to get the proper idea of what is the essence of the result shown by the model.

Today I would be solely talking about the different types metrics that are used in classification problems. It is going to be a little long read but bear with me I would be making the topics as interesting as I can because choosing the correct metrics play an important role in interpreting the results.

Let’s say there is a dataset to be used for binary classification. There are 1000 examples in it. 900 of one class and 100 for the second class.

This is an example of an imbalanced dataset since the model can be biased for a particular class. Let’s say that you made a model that always give class A label as the result. If you calculate accuracy score on this result you will get 90 percent. This is very misleading since the model has not actually learned anything and is just giving the output in the favor of class A. This is not much better than a simple for loop that always give the output as A .

So this might give you an idea about how important the correct metric choice in a problem can be. I would be giving more examples as I explain each type of these metrics as required.

So let’s begin now.

This is one of the most common metric used when we have a balanced dataset.

Balanced dataset?

A balanced dataset is the one that contains equal or almost equal number of samples from the positive and negative class. If the samples from one of the classes outnumbers the other, the data is skewed in favor of one of the class like the example I mentioned earlier.

So now let’s see the formula to calculate the accuracy score.

Tp: True positive(correct class A prediction)

Tn: True negative(correct class B prediction)

Fp: False positive(wrong class A prediction)

Fn: False negative(wrong class B prediction)

Since the accuracy score simply calculates the total correct prediction to the total number of predictions made, it works fine with cases with balanced dataset.

Precision and recall are completely different metric measures. The reason I put them together is because there are times when we have to choose among the two of them depending upon the outcome we want from the models.

Mathematically

The meaning of the symbols are the same as mentioned before.

Precision is true positive upon the sum of true positive and false positive. By observing this we can see that precision is more interested in incorrect classification on class A. For example let’s consider a common example of spam detection. The model predicts a mail as not spam(false positive) but it actually isn’t a spam. This is a critical case since a user might miss some important mail. So we want the false positive to be very less. In this case we can allow more of the false negative(spam mails marked as not spam) so we use precision.
Recall is represented as true positive upon the sum of true positive and true negative. Here let’s say we are dealing with a model that predicts whether the person has COVID virus or not. Here the critical case is that of false negative(when a person is not diagnosed with COVID is stated healthy) since it can be life threatening. So in this case we want the false negative number to be very low. So we use recall.

So by these two examples it must have become very clear that when to use precision and when to use recall. This would be helpful in understanding the precision-recall curve too which we will see in while.

But there are tasks when we want to consider both the cases . So let’s see what we can do in such cases.

F-beta is one of the metrics that is a cumulation of both precision and recall. People often refer it to as F1 score. But it is simply the F-beta score when the value of beta=1.

The formula for the F-beta score is given as

When we keep the value of beta as 1 the formula becomes

If you observe carefully this is the representation of the harmonic mean of precision and recall.

When we want to give more importance to recall we can increase the value of beta on the other hand if we want to consider precision more we can make the value of beta something lesser than 1 too.

Let’s consider an example of binary classification. Usually on the output we have a probability score. Depending upon the probability we choose among the two classes as the predicted output.

ROC curve uses different thresholds to plot the relationship between true positive rate and false positive rate. Hence it plots the result based on different values of threshold between 0 to 1.

The graph is similar to the one given below and varies a little depending on model trained.

The area under the ROC curve represents the AUC. The area under the curve should be greater than triangle formed by the random chances. This is so because otherwise the model would be performing worse than random guessing which is pretty bad.

The true positive rate is calculated as the number of true positives divided by the sum of the number of true positives and the number of false negatives. It describes how good the model is at predicting the positive class when the actual outcome is positive.

If we try to interpret the graph the closer we are to the 1 in Y-axis the better. But to get their we have to trade off with the false positive rate.

The correct place where the threshold should be set is a matter of domain expertise and varies from case to case

Like the name suggests precision curve helps in creating a visualization of precision and recall.

It helps in making an informed decision to trade off between precision and recall.

Again this depends upon on the use case as I explained earlier on the precision and recall discussion.

These are few of the metrics used in classification in machine learning and depending upon the use case we have to choose the right metrics. I want to again emphasize on the fact that solving a problem using machine learning is more than just creating a model.

The main reason we use machine learning is because we want to solve a problem and metrics we use to helps us provide insights on how close we are to the desired result.

Footer