Learn about bias, variance and total error relationship to model complexity and overfitting/underfitting.
Bias-variance trade-off is an important concept that everyone in the machine/deep learning field should know. Regardless of which application area you plan to apply your models to, these concepts will help you take steps to potentially address why your model is not performing well on your training or testing set.
Bias variance trade-off will help us understand the concepts of models over-fitting and under-fitting to training data. We will be able to tell whether our model will be able to reduce the error beyond a particular value or not.
Before starting with the concepts of bias and variance, we will look at ML/DL models as a means to estimate an underlying function.
The general task in machine learning as well as deep learning is the estimation of the underlying function from the training data. We shall be using the following notation.
- f(x): true function
- f̂(x): estimated function
In classical estimation theory, the data needs to be modelled mathematically based on its characteristics and properties before estimating the actual model parameters. For signal processing, probability density functions (PDFs) are defined based on the unknown parameters of the mathematical model. For example, mean and variance are the parameters for a Gaussian process.
However, the specification of the PDF is a critical component in determining a good estimator. And these PDFs aren’t given to us, but rather need to be chosen which is not only consistent with the problem constraints but also is mathematically tractable or practically solvable in a reasonable amount of time. And every PDF would have a set of underlying assumptions for all the math to be valid. If those set of conditions aren’t met then the estimator won’t behave as intended.
For the DL/ML models, these parameters are the weight and bias matrices. And rather than defining the type of PDFs, we focus on the architecture types, initialization techniques, loss functions etc. based on the problem definition. And we try to find out the best possible values for these matrices based on the training data. We use the gradient descent based optimization techniques for finding the optimal or best function approximating the actual function. [You can read more about it here: Gradient Descent Unraveled]
Searching for optimal estimators usually requires an optimality criterion. For the DL/ML models, this criterion is nothing but the loss function that we use. Several loss functions are available out there. The choice of the function depends on the problem you are trying to solve e.g. classification, regression, segmentation, etc.
However, mean squared error (MSE) is the most natural one and is defined as follows.
MSE of the estimated function is the expectation of the square of the difference of the estimated function from the actual function. I won’t be covering what expectation is because that is beyond the scope of this article. However, if you want to read more about it, I would recommend the following.
MSE measures the average mean squared deviation of the estimator from the true value. It has a nice convex surface i.e. it is curved upwards and has only one optimum. So it is well suited for a wide range of optimization techniques.
The MSE error can be decomposed into two terms namely the bias and variance. I will show and explain the decomposition derivation below. For the derivation, f(x) shall be represented as f and f̂(x) as f̂.
But before that I’ll briefly give the mathematical definitions of bias and variance:
Bias
Bias is defined as: b(f̂) = E(f̂)– f
b(f̂) here is the bias of the estimator. It essentially measures the difference between the estimator’s expected value and the true value of the parameter being estimated. [You can read more about it here. https://en.wikipedia.org/wiki/Bias_of_an_estimator
Variance
Variance is defined as: Var(f) = E[(f – E(f))²]
Variance of a random variable is the expected value of the square of the difference of the random variable from its expected value. So it is the mean squared deviation of a random variable from its own mean. A high variance would mean that the observed values of X will be farther away from the mean and vice versa. [You can read about it here. https://www.stat.auckland.ac.nz/~fewster/325/notes/ch3.pdf]
Now we start with the derivation using the definition of the MSE function. This is followed by a strategic addition and subtraction of the expectation of f̂. We then use a few properties and definitions to get to the bias and variance components.
The bias-variance decomposition can be represented with the following triangle relation.
Bias is the difference between the expected value of the estimator and the actual underlying function. Variance is the variability of the model. In estimation theory, different types of estimators exist. However for the classical estimation theory, from a practical point of view, minimum MSE estimators need to be abandoned because this criterion leads to unrealizable estimators. We often turn to Minimum Variance Unbiased (MVU) Estimators which have further specialized versions such as Best Linear Unbiased Estimators (BLUE) etc. If you are interested in estimation theory Fundamentals of Statistical Signal Processing, Volume I: Estimation Theory is an amazing resource.
In the case of deep learning, the networks work as powerful estimators without having any explicit definition. All the analysis done above is directly applicable in this case too.
Now let us see what the decomposition would look like if we added some noise to the random process. This is because observations Y for a random process which is available to us, always contain some inherent noise. It can be written as follows:
ϵ has a normal distribution with zero mean and variance σ².
This assumption adds one term into the MSE decomposition. Since the derivation is almost the same as without the noise term, I’m writing the end equation directly. However, if you are interested in the derivation, kindly refer to this link.
The σ² term represents Bayes Error and can be represented as follows.
Bayes error refers to the lowest possible error for any classifier and is analogous to irreducible error. It is also known as the Optimal error. Even if you build a perfect model, this error cannot be eliminated. This is because the training data itself is not perfect and contains noise.
So, the total error for your model is the addition of three kinds of errors:
- Error due to bias in the model
- Error due to the model variance
- Irreducible error (Bayes Error)
The relation between the bias, variance and total error can be explained by the following graph. The x-axis represents the complexity of our model and the y-axis is the error value.
- We see that as the complexity of the model increases, the bias goes on decreasing and the variance goes on increasing. This is because if the model becomes larger and larger its capacity to represent a function goes on increasing. In fact, if you make the model large enough, it can memorize the entire training data leading the error to zero (if the Bayes error is zero). However, having an overly complex model will lead to poor generalization even though you will get good training performance. And this is called overfitting.
- On the other hand, if your model is too simple, it will have a very high bias and low variance. The error would very high even for the training samples. If you observe that even after a lot of epochs, your model still has poor training data performance, it likely means that either your data has corrupt labels or the model isn’t complex enough to approximate the underlying function. And this is what is called underfitting.
- As we see in the graph, the total error goes on decreasing until the optimal complexity point. This is where only the Bayes Error is left and the model has the maximum performance. We achieve the right balance between the bias and variance at this particular point.
Following are a few examples of how under-fitting, optimal-fitting and over-fitting looks like.
We observe that for models having high variance (rightmost column), the underlying noise is also captured. This leads to awesome training performance, but terrible test performance. Since the generalization is poorest in this case. Conversely for models having a high bias (leftmost column), the model isn’t able to capture the underlying pattern in the data. Thus the model performs poorly even on the training data. The optimal model is the best model and is the most generalizable since it has the right amount of bias and variance.
We saw how the bias-variance decomposition works. How the total error, bias and variance is affected by the model complexity. We talked about Bayes error which is minimum error always present because of noisy observations. And finally, we saw how over-fitting, under-fitting and optimal-fitting looks like. In one of my future posts, I plan to discuss techniques that are used in Deep Learning to overcome the problem of over-fitting/underfitting. Thank you for reading my article. Hope you found it useful.
[1] Fundamentals of statistical signal processing: estimation theory by Steven M. Kay.
[2] https://en.wikipedia.org/wiki/Expected_value
[3] https://en.wikipedia.org/wiki/Bias_of_an_estimator
[4] https://www.stat.auckland.ac.nz/~fewster/325/notes/ch3.pdf