## Learn about bias, variance and total error relationship to model complexity and overfitting/underfitting.

Bias-variance trade-off is an important concept that everyone in the machine/deep learning field should know. Regardless of which application area you plan to apply your models to, these concepts will help you take steps to potentially address why your model is not performing well on your training or testing set.

Bias variance trade-off will help us understand the concepts of models over-fitting and under-fitting to training data. We will be able to tell whether our model will be able to reduce the error beyond a particular value or not.

Before starting with the concepts of bias and variance, we will look at ML/DL models as a means to estimate an underlying function.

The general task in machine learning as well as deep learning is the estimation of the underlying function from the training data. We shall be using the following notation.

**f(x)**: true function**f̂(x)**: estimated function

In classical estimation theory, the data needs to be modelled mathematically based on its characteristics and properties before estimating the actual model parameters. For signal processing, probability density functions (PDFs) are defined based on the unknown parameters of the mathematical model. For example, mean and variance are the parameters for a Gaussian process.

However, the specification of the PDF is a critical component in determining a good estimator. And these PDFs aren’t given to us, but rather need to be chosen which is not only consistent with the problem constraints but also is mathematically tractable or practically solvable in a reasonable amount of time. And every PDF would have a set of underlying assumptions for all the math to be valid. If those set of conditions aren’t met then the estimator won’t behave as intended.

For the DL/ML models, these parameters are the weight and bias matrices. And rather than defining the type of PDFs, we focus on the architecture types, initialization techniques, loss functions etc. based on the problem definition. And we try to find out the best possible values for these matrices based on the training data. We use the gradient descent based optimization techniques for finding the optimal or best function approximating the actual function. [You can read more about it here: Gradient Descent Unraveled]

Searching for optimal estimators usually requires an optimality criterion. For the DL/ML models, this criterion is nothing but the loss function that we use. Several loss functions are available out there. The choice of the function depends on the problem you are trying to solve e.g. classification, regression, segmentation, etc.

However, mean squared error (MSE) is the most natural one and is defined as follows.