Adaline neural networks: the origin of gradient descent

Training an adaline network

The reasoning behind the training process is still the same as the perceptron: each weight w(i,j) needs to be updated in such a way that it will increase the amount of correctly predicted outputs on the next iteration — we call this update value Δ(i,j). However, this update variable is calculated in a different way, using an algorithm known as the gradient descent.

Gradient Descent

Gradient descent is an algorithm that is used to find the set of weights W that minimises the overall network predictions’ error. To achieve that, we define what is called an error function (also known as loss function or cost function) J(W), and we iteratively try to find the global minimum of this function.
Suppose there is a plot where we can describe how the error function J(W) varies based on a weight value, something like this:

Plotting the cost function J(W) depending on the weight value W

In the figure above we currently have an error of roughly 20 when w is 9. However, one can see that the network’s minimum error is achieved when w is 5. The challenge that gradient descent tackles is how to get from the current point to the weight value that produces the lowest error.
The way gradient descent finds its way to the minimum is (using the right mathematical jargon) by computing the partial derivative in respect to the weights. In layman’s terms, the intuition behind gradient descent is illustrated in the animation below:

At each iteration, we compute the partial derivative in order to know in which direction we should walk towards. That direction is given by the green line — that is a tangent line to the function at the value of w we are at. The slope of that line is the derivative at w. If the slope is positive, like in the figure above, w will be decreased in the next iteration whereas if the slope is negative, w will be increased. The update value at each iteration is defined according to the formula below, where w represents the weight matrix, µ is the learning rate coefficient that determines the step size in the same way as the perceptron, and the remaining part is the partial derivative of the cost function in respect to the weights:

In another opportunity I will dedicate a complete article to the gradient descent algorithm and all its variations. For now, if you want an alternative explanation, Andrew Ng provided a great intuition on how it works, in one of his Machine Learning Coursera classes.

Let’s contextualise gradient descent for adaline training. In the original adaline paper the authors state that:

All gains including the level are to be changed by the same absolute magnitude, such that the error is brought to zero. This is accomplished by changing each gain (which could be positive or negative) in the direction which will diminish the error by an amount which reduces the error magnitude by 1/17. (…) Convergence is indicated by small errors (before adaptation) with small fluctuations about a stable root-mean-square error.

So their goal was to reduce the error by finding the direction to do so, in order to get the optimal gains (it is just a different terminology, they called it gains, we are calling it weights). The cost function that was being used was the root mean squared error (RMSE). I will use MSE instead in this article, because it’s very similar, easier to explain the maths behind and it fits the same purpose.

Having a cost function, we can calculate what is the weight update formula that we need to apply to each weight at each iteration. The whole process of coming up with an update formula for adaline network is summarised on the screenshot below. This is taken from my gradient descent notebook.

The final weight update formula to be used in adaline training is then:

where w identifies a single weight, µ is the learning rate coefficient, y is the predicted output, t is the expected output and x represents a single input.

Training an adaline network

Gradient Descent

Footer