You are not alone if you had a hard time understanding what exactly Regularization is and how it works. Regularization can be a very confusing term and I’m attempting to clear up some of that in this article.
In this article I’ll do three things: (a) define the problem that we want to tackle with regularization; then (b) examine how exactly regularization helps; and finally (c) explain how regularization works in action.
Data scientists take great care during the modeling process to make sure their models work well and they are neither under- nor overfit.
Let’s say you want to predict house prices based on some features. You start with one feature, floor area, and you build your first regression model.
house_price = a+ b1*floor_area + e
But you know very well that floor space is just one criteria, there are many other factors — such as the number of bedrooms, garage condition, neighborhood characteristics, school district and many more— that a buyer potentially considers before making a purchasing decision. So your first model is clearly an underfit.
At the other extreme, you could end up selecting 200 different features that can potentially impact house prices. So you built a really complex model and tested it on the training data and found that it performed great! However, when it comes to making predictions on unseen/test data the model does poorly. Why is that? One reason is — the complex model that you just built “learned” every bit of noise but missed the signal in the training data.
So how to find the sweet spot where a model is NOT too complex but complex enough to pick up the signal and performs relatively well in out-of-sample data?
Regularization finds that sweet spot.
Ideally, if we had a large number of features, we’d add in features one by one and in different combinations to see their impacts on model performance and choose the best model based on the performance metric.
house_price = floor_area + garage_condition — — — — — — — (model 1)
house_price = garage_condition+ bedrooms — — — — — — — -(model 2)
house_price = floor_area + garage_condition+ bedrooms — —(model 3)
… and so on.
Do you see the problem here? If we did feature selection that way, we’d end up running thousands of models with different feature combinations and parameter values.
It works differently in machine learning. We choose an algorithm, then select all the features at once, run the model and evaluate model performance at the very end. But aren’t we overfitting with lots of features that we actually want to avoid?
That’s where regularization comes in handy. Even if we have redundant features, regularization controls their effects and makes them less sensitive and sometimes lets them go. It is done by shrinking model coefficients towards zero.
Let’s start with the cost function (a.k.a objective function) that we want to optimize in regression.
You know what a cost function is, right?
The difference between a predicted and an actual value is called the error. Every data point in a dataset creates such errors, and the role of a cost function is to quantify these errors. And in linear regression, the objective is to minimize those errors to find the best fit model.
There are several cost functions out there such as Mean Squared Error (MSE), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE) etc. but they all do the same thing — quantifying errors.
As an example, this is how MSE works:
- it takes differences between observed and predicted values (Y — Y-hat) for each data point (i),
- squares the difference,
- repeats the process for all points,
- sums them up, and finally
- takes an average by dividing by the number of data points (n).
It is the value of MSE that we want to minimize to find the best model.
So if our regression model is Ŷ = α + θi Xi (where θ is the coefficient of X), then the cost function following MSE formulation above is:
So the purpose of regularization is to add a small bias in the error function:
Cost(x, y) = Error(x, y) + Regularization term
There are two kinds of regularization terms — L1 and L2. Depending on which term is used, a normal multiple regression is called by different names.
Ridge regression
We call a normal regression the “Ridge regression” when it uses L2 Regularization. The purpose of L2 is to shrink feature coefficients to close to zero, but not exactly zero (can you guess what if coefficients are zero? The answer comes next).
So the cost function we want to minimize with Ridge regression is:
LASSO regression
LASSO regression goes a bit extreme. It sets some feature coefficients to zero through L1 Regularization. This process essentially eliminates those features from the model instead of minimizing their impacts.
Hyperparameter λ: So in both L1 and L2 regularization, we have a parameter λ, which is called a hyperparameter in machine learning lingo. This is the only parameter responsible for penalizing the features.
So the question is — what values does λ take and how to find the perfect value? One thing to remember — λ adds bias to our model, so we don’t want to add a too large value. Instead, we want to get away with a small bias as possible while achieving our main goal — reducing feature sensitivity.
There are several ways to find the λ value. How each method works is a discussion for another day for today, but know that two popular methods are: gradient descent and cross-validation.