Weight decay, aka L2 regularization, aka ridge regression… why does it have so many names? Your guess is as good as mine. Like many other deep learning concepts, it’s a fancy term for a simple concept (in practice). It’s also something which took me a very long time to really understand, because it was cluttered behind all this math. If you’re having the same struggle I was having, then I’m hoping this article ends your search.

Let’s start off with what weight decay is. Weight decay is a regularization technique (another fancy term). What does that mean? It basically means that it helps our machine learning model to not overfit (overfitting is explained below).

The above is all good and well, but how does it work? It works by adding the squared sum of our weights (multiplied by a hyperparameter) to our loss function.

The above sounds a little complicated, so let’s go through it in code.

As always, let’s start off by importing the required libraries.

`import torch`

import matplotlib.pyplot as plt

import torch.nn as nn

from sklearn.model_selection import train_test_split

Now that we have that out of the way, let’s get some data. For the purposes of this article, we can get away with making our own data using PyTorch’s randn function.

We’ll create a tensor having the shape 1000×1000 for the features (X) using PyTorch’s randn function. Along with that, we’ll also create our target tensor (y). The target tensor will be binary (i.e., only 2 values). To accomplish that, we’ll set half the tensor to 0 and the other half to 1.

`dim = 1000`

X = torch.randn((dim,dim), dtype=float)

y = torch.cat((torch.zeros(dim//2, dtype=float), torch.ones(dim//2, dtype=float)))

Let’s use sklearn’s train_test_split function to shuffle our data and split into a train and test set. We’ll allocate 33% of the data for testing and the other 66% of it for training.

`X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)`

Now that we have that out of the way, let’s create a simple model. It will consist of only 1 hidden layer. The hidden layer will be half the size of the input layer (i.e., 500). We’ll also use the sigmoid function as our activation function for the hidden layer and the last layer.

class Model(nn.Module): def __init__(self, dim):

super().__init__()

self.hidden_layer = nn.Linear(dim, dim//2)

self.last_layer = nn.Linear(dim//2, 1)

self.sigmoid = nn.Sigmoid() def forward(self, X):

out = self.sigmoid(self.hidden_layer(X))

out = self.sigmoid(self.last_layer(out))

return out

We’re almost ready to train our model. Let’s instantiate our Model class, set our loss to BCELoss (because our target is binary) and set our optimizer to SGD. We’ll train for 1000 epochs at a learning rate of 1e-1.

`model = Model(dim)`

bce = nn.BCELoss()

optimizer = torch.optim.SGD(model.parameters(), lr = 1e-1)

epochs = 1000

Let’s train our model! While training, we’ll print out our loss (i.e., BCE) at every 100 epochs, allowing us to have an insight on our model’s performance during training.

train_loss = []

test_loss = []for epoch in range(epochs+1):

model.train()

y_pred = model(X_train.float())

loss = bce(y_pred, y_train.reshape(-1,1).float())

train_loss.append(loss.sum()) optimizer.zero_grad()

loss.backward()

optimizer.step() if(epoch % 100 == 0):

print('Train: epoch: {0} - loss: {1:.5f}'.format(epoch, train_loss[-1]))model.eval()

if(epoch % 100 == 0):

with torch.no_grad():

y_pred = model(X_test.float())

loss = bce(y_pred, y_test.reshape(-1,1).float())

test_loss.append(loss)

print('Test: epoch: {0} - loss: {1:.5f}'.format(epoch, test_loss[-1]))