Precision and recall are two of the most fundamental evaluation metrics that we have at our hands.
It’s imperative to compare your models to each other and pick the best fit models when performing tasks about classification. When you are estimating values in regression, it makes sense to speak about error as a deviation from the real values and how far apart the predictions were. But in classification, you are either correct or incorrect when classifying a binary variable. Consequently, we prefer to think of it in terms of how many false positives and false negatives a model has. In fact, when assessing the output of a classification algorithm, there are a few different basic metrics. In this blog post, I will describe the precision and recall metrics, explain what makes each one unique, and talk about the relationship between the two.
Precision
Precision tests how reliable the projections are. Precision helps one to answer the following question, in the example of a model that predicts whether or not an individual has a certain illness — how many times has the patient in question really had the illness out of all the times the model said anyone had that disease?
Keep in mind that a score of high precision can be a little deceptive. For example, let’s assume that we take a model and train it on a sample of 100,000 patients in order to make predictions. This model estimates that 50,000 patients have a particular illness, while in fact only 45,000 have the illness. If we follow the formula for precision, which is “Precision = True Positives/Predicted Positives”, then the precision of this model will be 90 percent.
Now let’s say we’re developing a second model that just predicts when it’s incredibly clear that an individual is ill (obvious signs & symptoms). This model estimates, out of 100,000 patients, that only 40 people in the entire population are sick — however, it is right for each of those 40 instances. The second model will have a 100 percent precision score, even though 99,960 incidents were overlooked for the patients who already have the disease. More moderate models may have a high precision score in this manner, but this does not always mean they are the right model to run.
Recall
Recall reveals the proportion of each of the classes that the model has currently collected. Following the same example from above, recall encourages one to ask — among all the people we saw who really had the illness, what proportion of them did our model classify accurately as having the illness?
It’s worth remembering that recall is a tricky statistic, since a higher recall score doesn’t necessarily suggest a stronger model overall. For instance, by classifying any and all patients who come through the door as having the disease in question, our model would easily score 100 percent for recall. It would have several false positives, but any infected person would still be accurately diagnosed as having the illness. The formula for recall can be remembered as “Recall = True Positives/Actual Total Positives”.
How Precision and Recall are Related
Precision and recall have an opposite relationship — if our recall decreases, then the precision will increase. Let’s discuss this through the prism of our previous example. Since they are more concerned with sick people, a doctor who is too fascinated with recall would have a greater propensity for calling anyone sick. Their precision is going to be very bad, as they categorize nearly everybody as ill because they don’t care whether they’re wrong or not. They only care about making sure they recognize sick people as sick.
There would be a very high standard for a doctor who is too concerned with precision to declare someone sick. So, they only declare someone sick if they are absolutely confident they will be right if they declare a person sick. While their precision will be very high, their recall will be very poor, and those patients who are sick but do not exceed the doctor’s threshold will be wrongly labeled as healthy.
Which Metric Should I Follow?
Asking what is better — more false positives or false negatives — is a popular philosophical topic in data science. The response, as you probably guessed, is that it depends on the task. It is important to use your critical reasoning skills on the metrics of precision and recall. Your model may concentrate on a topic where false negatives are worse than false positives. In this instance, you would depend more so on obtaining a high recall than a high precision.
I hoped this helped clarify the differences between precision and recall in machine learning. Next week, I’ll be writing about accuracy and F1 score. Thank you for reading!