The question you’re probably asking right now is, “what is batch gradient descent and how does it differ from normal gradient descent?” Batch gradient descent splits the training data up into smaller chunks (batches) and performs a forward propagation and backpropagation by the batch. This allows us to update our weights multiple times in a single epoch.
Performing calculations on small batches of the data, rather than all our data at once, is beneficial in a few ways. To name a few:
- It’s less straining on memory. Think about if we had a million 4K images . Always holding a million 4K images in memory is extremely taxing.
- Because we’re performing multiple weight updates in a single epoch, we’re able to converge (get to the bottom of our hill) in less epochs.
- Splitting up our data into batches makes it so that our model is only looking at a random sample of our data at every iteration. This allows it to generalize better. Better generalization = less chance of overfitting.
One of the questions I had when I first came across batch gradient descent was, “we’re asked to gather as much data as we can only to break that data up into small chunks? I don’t get it… ”
I’m going to go over an example (with code) to show why breaking our data into smaller chunks actually works.
Before I show the example, we’re going to have to import a few libraries.
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt
import random
from IPython import display
import time
Now that we’ve imported our libraries, using sklearn, we’re going to make an example dataset. It’s going to be a regression line made up of 1000 points.
X, y = make_regression(n_samples=1000, n_features=1, bias=5, noise=10, random_state=762)
y = y.reshape(-1, 1)
Let’s look at our example dataset by using matplotlib to plot it.
plt.scatter(X, y)
plt.show()
Cool. It looks exactly like we expected it to look. 1000 points and a regression line.
Now, something I want to show is that when we take a random sample of 64 points (i.e., our batch size), our random sample is a good representation of our full dataset.
To see this in action, let’s plot 10 different sets each containing 64 different random samples.
for i in range(10):
display.display(plt.gcf())
display.clear_output(wait=True) rand_indices = random.sample(range(1000), k=64)
plt.xlim(-4,4)
plt.ylim(-200,200)plt.scatter(X[rand_indices], y[rand_indices])
plt.show()
time.sleep(0.5)
I hope it’s making sense. Although we’re only plotting 64 random points, those 64 points give us a very good understanding of the shape and direction of the 1000 points. The argument batch gradient descent makes is that given a good representation of a problem (this good representation is assumed to be present when we have a lot of data), a small random batch (e.g., 64 data points) is sufficient to generalize our larger dataset.
Now that we’ve gone over the what and the why, let’s go over the how. We’ll end this article off with how to implement batch gradient descent in code.
Let’s start off by importing a few useful libraries.
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import torch.nn as nn
Next, let’s import our dataset and do a little bit of preprocessing on it. The dataset we’ll be working with is the Pima Indians Diabetes dataset. We’ll import it, split it into a train and test set and then standardize both the train and the test sets, while converting them into PyTorch tensors.
df = pd.read_csv(r'https://raw.githubusercontent.com/a-coders-guide-to-ai/a-coders-guide-to-neural-networks/master/data/diabetes.csv')X = df[df.columns[:-1]]
y = df['Outcome']
X = X.values
y = y.valuesX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)scaler = StandardScaler()
scaler.fit(X_train)
X_train = torch.tensor(scaler.transform(X_train))
X_test = torch.tensor(scaler.transform(X_test))
Now, we’re going to need our neural network. We’ll build a single layer feed forward neural network, consisting of 4 nodes in its hidden layer.
class Model(nn.Module):def __init__(self):
super().__init__()
self.hidden_linear = nn.Linear(8, 4)
self.output_linear = nn.Linear(4, 1)
self.sigmoid = nn.Sigmoid()
def forward(self, X):
hidden_output = self.sigmoid(self.hidden_linear(X))
output = self.sigmoid(self.output_linear(hidden_output))
return output
Let’s create a function to show accuracy as a metric (our loss is BCE). I like doing this because BCE isn’t really human readable, but accuracy is very human friendly. We’ll also setup a few variables to reuse.
def accuracy(y_pred, y):
return torch.sum((((y_pred>=0.5)+0).reshape(1,-1)==y)+0).item()/y.shape[0]epochs = 1000+1
print_epoch = 100
lr = 1e-2
Our print_epoch variable just tells our code how often we want to see our metrics (i.e., BCE and accuracy).
Let’s instantiate our Model class and set our loss (BCE) and optimizer.
model = Model()
BCE = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr = lr)
Awesome, we can finally train our model. Let’s first do it without batch gradient descent and then with. It’ll help us compare.
train_loss = []
test_loss = []for epoch in range(epochs):
model.train()
y_pred = model(X_train.float())
loss = BCE(y_pred, y_train.reshape(-1,1).float())
train_loss.append(loss)optimizer.zero_grad()
loss.backward()
optimizer.step()if(epoch % print_epoch == 0):
print('Train: epoch: {0} - loss: {1:.5f}; acc: {2:.3f}'.format(epoch, train_loss[-1], accuracy(y_pred, y_train)))model.eval()
y_pred = model(X_test.float())
loss = BCE(y_pred, y_test.reshape(-1,1).float())
test_loss.append(loss)if(epoch % print_epoch == 0):
print('Test: epoch: {0} - loss: {1:.5f}; acc: {2:.3f}'.format(epoch, test_loss[-1], accuracy(y_pred, y_test)))
As expected, the results aren’t great. 1000 epochs isn’t that much for such a complex dataset, when not using batch gradient descent.
Let’s rerun it, except this time, with batch gradient descent. We’ll reinstantiate our Model class and reset our loss (BCE) and optimizer. We’ll also set our batch size to 64.
model = Model()
BCE = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr = lr)
batch_size = 64
Great. Now that we have that done, let’s run it and see the difference.
train_loss = []
test_loss = []for epoch in range(epochs):
model.train()
for i in range(0, len(X_train), batch_size):
if(i+batch_size > len(X_train)-1):
train_loss.append(loss)
if(epoch % print_epoch == 0):
print('Train: epoch: {0} - loss: {1:.5f}; acc: {2:.3f}'.format(epoch, train_loss[-1], accuracy(y_pred, y_train[beg:end])))end = -1
else:
end = i+batch_sizebeg = iy_pred = model(X_train[beg:end].float())
loss = BCE(y_pred, y_train[beg:end].reshape(-1,1).float())optimizer.zero_grad()
model.eval()
loss.backward()
optimizer.step()
y_pred = model(X_test.float())
loss = BCE(y_pred, y_test.reshape(-1,1).float())
test_loss.append(loss)if(epoch % print_epoch == 0):
print('Test: epoch: {0} - loss: {1:.5f}; acc: {2:.3f}'.format(epoch, test_loss[-1], accuracy(y_pred, y_test)))
Interesting. Before we get into the results, you’ll see that the code is similar, but I have an extra for loop. This loop is what allows us to iterate through our data, splitting it into batches of size 64.
In terms of the result, you’ll see that it significantly outperforms training our model without batch gradient descent. In the same amount of epochs, our model jumped from 66% accuracy on the test set to 74% and our BCE went from 0.62 to 0.53.
As always, you can run the code in Google Colab — https://cutt.ly/cg2ai-batch-gradient-descent-colab