Evaluating Classification Model Performance

Accuracy is probably one of the easiest evaluation metrics to understand but is also often the least relevant. Accuracy measures the total number of correct predictions a classification model makes, including both true positives and true negatives. You can use the formula below to calculate:

accuracy formula

Using our COVID test example, this would be the percentage of total predictions where the test correctly predicts a person does have Covid and correctly predicts a person does not have Covid; in other words, the true positives plus the true negatives.

When is accuracy most important?

If you have a balanced classification problem where the number of positive cases is nearly equal to the number of negative cases, accuracy can be a useful metric for gauging overall model performance. However, in imbalanced class cases, accuracy can be a pretty meaningless metric.

For example, say 90% of a sample are true negatives, while 10% are true positives. If the model simply predicts all negatives, it will achieve 90% accuracy, which makes it seem like the model is performing much better than it actually is. This brings us to the usefulness of our next evaluation metrics.

Precision measures the number of true positives out of all predicted positives. Conservative models often have very high precision scores, because there is a high threshold for predicting a positive. It is calculated as:

precision calculation

In terms of our Covid example, precision would be calculated as the number of times someone with the virus tested positive out of all of the positive test results.

When is precision most important?

Precision can be a very useful metric in cases where the threshold to predict a positive is high. For example, in cases of spam filtering, having a low threshold in classifying something as spam could cause important emails to get caught in the spam filter. In this case, we would want our precision to be high, so that we don’t have too many predicted positives getting caught in our spam filter. However, in medical cases, having a high precision value is not always the top priority.

Recall measures the number of true positives out of all actual total positives (true positives plus false negatives). You may also hear recall referred to as the true positive rate. This can be calculated as:

recall calculation

Recall for our Covid test answers the question: out of all the times a person being tested actually had the virus, how many times did our test correctly predict the person was positive?

When is recall most important?

In our Covid test example, recall is the metric we should be most focused on— it is the primary metric for most medical cases. When someone is being tested for Covid, the two incorrect outcomes are that a person who doesn’t have the virus is told that they do, and a person who does have the virus is told that they do not. Although false positives aren’t ideal, someone unnecessarily quarantining is a much better societal outcome than someone spreading the disease because they don’t think they have it. This is a good example of the precision-recall tradeoff. Covid tests need to meet a certain level of recall because a false negative has a worse result than a false positive. Therefore, we would trade more false positives for fewer false negatives, and have a lower threshold for predicting a positive. The tradeoff between the true positive rate and the false positive rate can be visualized in the Receiver Operating Characteristic (ROC) curve of the model.

ROC Curve Example

F1 score is the harmonic mean of precision and recall , or a kind of weighted average of the two metrics. F1 score can be calculated as:

F1 score calculation

Basically, an F1 score penalizes models for skewing too heavily towards either precision or recall, making it is a good metric for measuring overall model performance.

When is F1 score important?

Although recall is more important than precision in our Covid example, it is still important to have some level of precision, otherwise our test would be pretty useless. The f1 score takes into account both measures.

Now that you understand each of these metrics conceptually, let’s take a look at how to calculate these for ourselves in Python.

When is accuracy most important?

When is precision most important?

When is recall most important?

When is F1 score important?

Footer