You are not alone if you had a hard time understanding what exactly Regularization is and how it works. Regularization can be a very confusing term and I’m attempting to clear up some of that in this article.

In this article I’ll do three things: (a) define the problem that we want to tackle with regularization; then (b) examine how exactly regularization helps; and finally (c) explain how regularization works in action.

Data scientists take great care during the modeling process to make sure their models work well and they are neither under- nor overfit.

Let’s say you want to predict house prices based on some features. You start with one feature, floor area, and you build your first regression model.

*house_price = a+ b1*floor_area + e*

But you know very well that floor space is just one criteria, there are many other factors — such as the number of bedrooms, garage condition, neighborhood characteristics, school district and many more— that a buyer potentially considers before making a purchasing decision. So your first model is clearly an underfit.

At the other extreme, you could end up selecting 200 different features that can potentially impact house prices. So you built a really complex model and tested it on the training data and found that it performed great! However, when it comes to making predictions on unseen/test data the model does poorly. Why is that? One reason is — the complex model that you just built “learned” every bit of noise but missed the signal in the training data.

So how to find the sweet spot where a model is NOT too complex but complex enough to pick up the signal and performs relatively well in out-of-sample data?

**Regularization **finds that sweet spot.

Ideally, if we had a large number of features, we’d add in features one by one and in different combinations to see their impacts on model performance and choose the best model based on the performance metric.

*house_price = floor_area + garage_condition — — — — — — — (model 1)*

*house_price = garage_condition+ bedrooms — — — — — — — -(model 2)*

*house_price = floor_area + garage_condition+ bedrooms — —(model 3)*

… and so on.

Do you see the problem here? If we did feature selection that way, we’d end up running thousands of models with different feature combinations and parameter values.

It works differently in machine learning. We choose an algorithm, then select all the features at once, run the model and evaluate model performance at the very end. But aren’t we overfitting with lots of features that we actually want to avoid?

That’s where regularization comes in handy. Even if we have redundant features, regularization *controls* their effects and makes them less sensitive and sometimes lets them go. It is done by shrinking model coefficients towards zero.

Let’s start with the **cost function** (a.k.a **objective function)** that we want to optimize in regression.

You know what a cost function is, right?

The difference between a predicted and an actual value is called the *error*. Every data point in a dataset creates such errors, and the role of a cost function is to quantify these errors. And in linear regression, the objective is to minimize those errors to find the best fit model.

There are several cost functions out there such as Mean Squared Error (MSE), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE) etc. but they all do the same thing — quantifying errors.

As an example, this is how MSE works:

- it takes differences between observed and predicted values (Y — Y-hat) for each data point (
*i*), - squares the difference,
- repeats the process for all points,
- sums them up, and finally
- takes an average by dividing by the number of data points (
*n*).

It is the value of MSE that we want to minimize to find the best model.