Model Selection and Evaluation is a hugely important procedure in the machine learning workflow. This is the section of our workflow in which we will analyse our model. We look at more insightful statistics of its performance and decide what actions to take in order to improve this model. This step is usually the difference between a model that performs well and a model that performs very well. When we evaluate our model, we gain a greater insight into what it predicts well and what it doesn’t and this helps us turn it from a model that predicts our dataset with a 65% accuracy level to closer to 80% or 90%.
Metrics and Scoring
Let’s say we have two hypothesis for a task, h(x) and h’(x). How would we know which one is better. Well from a high level perspective, we might take the following steps:
- Measure the accuracy of both hypotheses
- Determine whether there is any statistical significance between the two results. If there is, select the better performing hypothesis. If not, we cannot say with any statistical certainty that either h(x) or h’(x) is better.
When we have a classification task, we will consider the accuracy of our model by its ability to assign an instance to its correct class. Consider this on a binary level. We have two classes, 1 and 0. We would classify a correct prediction therefore as being when the model classifies a class 1 instance as class 1, or a class 0 instance as class 0. Assuming our 1 class as being the ‘Positive class’ and the 0 class being the ‘Negative class’, we can build a table that outlines all the possibilities our model might produce.
We also have names for these classifications. Our True Positive and True Negative are our correct classifications, as we can see in both cases, the actual class and the predicted class are the same. The other two classes, in which the model predicts incorrectly, can be explained as follows:
- False Positive — when the model predicts 1, but the actual class is 0, also known as Type I Error
- False Negative — when the model predicts 0, but the actual class is 1, also known as Type II Error
When we take a series of instances and populate the above table with frequencies of how often we observe each classification, we have produced what is known as a confusion matrix. This is a good method to begin evaluating a hypothesis that goes a little bit further than a simple accuracy rate. With this confusion matrix, we can define the accuracy rate and we can also define a few other metrics to see how well our model is performing. We use the shortened abbreviations False Positive (FP), False Negative (FN), True Positive (TP) and True Negative (TN).
Below is an example of a 3 class confusion matrix and how we transform this into recall, precision and f1 figures for each class. It’s important to mention here that in the case of more than 2 classes, our 0 class will be all classes that are not our 1 class.
An Example
These additional metrics are useful because they give us more insight into the behaviour of a classifier. For example, consider the following two classifiers:
With a simple accuracy measurement, we would be inclined to say that Classifier B is the better classifier, because its accuracy rate is higher at 0.75 vs 0.50. However, when we introduce our other metrics, we can calculate the the precision and recall of classifier B is 0. The precision and recall of classifier A are 0.25 and 0.50 respectively. From this information, we see a completely different picture. Classifier B is relatively useless, particularly because it cannot predict any class 1 examples and the precision and recall reflect this where the accuracy figure does not.