Let’s see how regularization and the learning rate alpha affect model performance.

As you’ll see shortly, tuning of hyperparameters affect a model’s accuracy and F1 score. Not sure what these metrics mean? See their definitions in my previous Titanic article.

## Effect of regularization

I used SciKit-Learn’s LogisticRegression classifier to fit and test my data. There are many solvers to choose from, each solver has their own algorithm for convergence. For illustrative purposes, I choose the *“saga”* solver. It’s the only solver to support L1, L2, and no regularization.

Note: for Scikit-Learn’s LogisticRegression, instead of the λ regularization parameter, the classifier takes in a “C”, which is the inverse of regularization strength. Think of it as 1/λ.

I used SciKit-Learn’s GridSearchCV to obtain the model’s score for every combination of `penalty = ["none", "l1", "l2"]`

and `C = [0.05, 0.1, 0.5, 1, 5]`

.

from sklearn.model_selection import GridSearchCVclf = LogisticRegression(solver='saga', max_iter=5000, random_state=0)param_grid = { 'penalty': ['none', 'l1', 'l2'], 'C': [0.05, 0.1, 0.5, 1, 5] }grid_search = GridSearchCV(clf, param_grid=param_grid)grid_search.fit(X, y)result = grid_search.cv_results_

GridSearchCV does an internal 5-fold cross validation. The average of model scores for each combination is:

L2 regularization with a C of 0.1 performed the best!

**Side note #1:** I also implemented a random search algorithm with SciKit-Learn’s RandomizedSearchCV. If you’re curious, you can find the example in my Jupyter Notebook.

**Side note #2:** I’m sure you noticed that no regularization performed better than L1, and in many cases, there was no difference between no regularization and L2. The best explanation I have is that SciKit Learn’s LogisticRegression might already be working well without regularization. Nevertheless, regularization did bring some improvement.

We’ll see later that regularization does play a big role in the SGDClassifier.

I then did a side-by-side comparison of several performance metrics without regularization and with L2 regularization.

tuned = LogisticRegression(solver='saga', penalty='l2', C=0.1, max_iter=5000, random_state=2)not_tuned = LogisticRegression(solver='saga', penalty='none', max_iter=5000, random_state=2)tuned.fit(X_train, y_train)

not_tuned.fit(X_train, y_train)y_pred_tuned = tuned.predict(X_test)

y_pred_not_tuned = not_tuned.predict(X_test)data = {

'accuracy': [accuracy_score(y_test, y_pred_tuned), accuracy_score(y_test, y_pred_not_tuned)],

'precision': [precision_score(y_test, y_pred_tuned), precision_score(y_test, y_pred_not_tuned)],

'recall': [recall_score(y_test, y_pred_tuned), recall_score(y_test, y_pred_not_tuned)],

'f1 score': [f1_score(y_test, y_pred_tuned), f1_score(y_test, y_pred_not_tuned)]

}pd.DataFrame.from_dict(data, orient='index', columns=['tuned', 'not tuned'])

Tuned performed better than not tuned in every metric except recall. Again, read this blog post if you need a refresher on what these metrics mean.

## Effect of learning rate (and regularization)

To see how different learning rates can affect model performance, I used SciKit Learn’s SGDClassifier (stochastic gradient descent classifier). It allows me to tweak learning rate whereas the LogisticRegression classifier does not.

There are three parameters to SGDClassifier we could tweak: `alpha`

, `learning_rate`

, and `eta0`

. The terminology is a bit confusing, so bear with me.

The `learning_rate`

is the type of learning rate (“optimal” vs. “constant”).

The `eta0`

is the algorithm’s learning rate when `learning_rate`

is “constant”. Normally, I call `eta0`

alpha.

The `alpha`

is the constant that multiplies the regularization term. It’s also used to calculate the learning rate when `learning_rate`

is “optimal”. `alpha`

serves the purpose of what’s commonly referred to as lambda.

Thus, there are several ways to set learning rate in SGDClassifier. If you want a constant learning rate, set `learning_rate='constant'`

and `eta0=the_learning_rate_you_want`

. If you want a dynamic learning rate (that depends on the step you’re at), set `learning_rate='optimal'`

. In the case of “optimal”, `eta0`

is not used, and `alpha`

serves the dual purpose of regularization strength and a constant in computing the dynamic learning rate at each step.

Below is a grid search algorithm for finding the best hyperparameters (for constant learning rate). I’m using the “constant” learning rate and I set the maximum iteration to 50,000.

from sklearn.linear_model import SGDClassifier

import matplotlib.pyplot as pltsgd = SGDClassifier(loss="log", penalty="l2", max_iter=50000, random_state=100)param_grid = {

'eta0': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1],

'learning_rate': ['constant'],

'alpha': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1]

}grid_search = GridSearchCV(sgd, param_grid=param_grid)grid_search.fit(X, y)result = grid_search.cv_results_

The searcher gave `alpha`

(here it means regularization strength) of 0.1 and `eta0`

(learning rate) of 0.0001 as the best params with a score of 0.7176.

I’ve plotted accuracy vs. learning rate (`eta0`

) for couple different values of regularization strength (`alpha`

). You can see that learning rate as well as regularization strength both have significant effect on a model’s performance.

The accuracy is pretty low for 0.00001 learning rate. This is likely due to the algorithm converging too slowly during gradient descent; after 50000 iterations, we’re nowhere near the minimum. The accuracy is also low for high learning rate (0.1 & 1). This is likely due to overshooting. Below is a more scaled plot with all the alphas.

Regularization strength (alpha) plays a role in accuracy too. For any given learning rate (eta0), there’s a large distribution of accuracy based on what the alpha value is.

Learning rate and regularization are just two hyperparameters in machine learning models. Every machine learning algorithm have their own set of hyperparameters. Questions? Comments? Respond below.