Gradient descent is —
→ an iterative optimization algorithm
→ for finding the local minimum of a function
→ by taking smaller steps
→ proportional to the negative of the gradient (opposite direction of the gradient) of the function at the current point.
Loss Function:
Gradient Descent is performed by taking small baby steps from randomly initialized points in Loss Function J(w) to eventually reach its minima.
Note: Here, J in J(w) stands for Jacobian — a first-order derivative vector.
We assume that the Loss Function is Convex in nature (bowl-shaped). This helps us to consider the minima computation as a Convex Optimization problem (details out of the scope of the current writeup).
Let us see how the loss function looks for two parameters x and y. Here the vector v is a linear combination of basic vectors x and y.
Weight Initialization:
The first step in Gradient Descent is to initialize the parameter weights randomly. Here, θ is the initial parameter values and Δθ will be calculated step size by Gradient Descent. θ + Δθ will be the new θ traversed after one iteration of Gradient Descent.
θ= [w,b]
Δθ = [Δw, Δb]
The below gif shows how the weight initialization impacts the loss convergence — there are several initialization techniques implemented in Deep Learning to optimize the same (out of the scope of the current writeup).
Parameter Update and Learning Rate:
However, in real-time we never update the Δθ directly and we always consider a step-size control factor namely ‘Learning Rate’ that is denoted by the Greek letter eta (η).
Equation of Learning Algorithm:
η → Learning Rate (stepsize taken at each iteration)
θnew = θ + η.Δθ
How the parameter update works based on the computed gradient —
How to get Δθ in a principal manner?
Now the question is how to get the incremental steps in a principal manner — we need a process that will help us to identify the correct step size that ensures that the loss always reduced incrementally.
- The goal is to get Δθ at each step such that the loss only decreases incrementally.
- This is where the ‘Gradient Descent’ algorithm comes into the picture.
Gradient:
The gradient is also known as the slope, in other words, Incremental Rise over Incremental Run.
Gradient = Rise / Run
In order to calculate the incremental rise and incremental run, we make use of Taylor Series expansion.
Taylor Series:
- If you have a function and if you know the value of the function at a point x, then the function’s value at a new point x+Δx can be given as:
So we are going to take a small step Δx and ensure that the loss decreases after this small step.
f(x+Δx) = f(x) + some quantity
Let us rewrite our Loss function equation in the form of Taylor expansion.
The above given Taylor series is for scalar variables and let us tweak the equation for vectors since we deal with vectors in Gradient Descent.
where Δθ = u; u is also known as the ‘change vector’.
Since the learning rate (η) values will be in the order of 0.01–0.001, usually, the third to nth terms will be very small in value and can be ignored.
In order to have a new loss lower than the previous loss, the below quantity should be negative.
The direction ‘u’ should be in 180 degrees with respect to the gradient vector.
ß=180 → Move in the direction opposite to that of Gradient.
Let us quickly summarize the Gradient Descent & Parameter Update Rules.
The Complete Learning Algorithm:
How to find Partial Derivatives for Computing Gradients?
Gradients are computed by taking partial derivate of the Loss function with respect to weights (Δw) and with respect to bias (Δb) terms. The first-order derivative will explain whether the point in the discussion is maxima or minima. Hence, we assume that the loss function is convex in nature and hence the point at which the gradient is zero is the point of minima.
Chain Rule of Differentiation is used to calculate partial derivates of Loss Functions w.r.t. weights and bias terms
Substituting the value of f(x) in the above equation,
Now, let us compute the first-order derivative of the sigmoid function,
Substituting the same in ∇w equation,
which means that ∇w can be easily calculated using f(x), x, and y alone — without any actual derivatives.
Similarly, we can see that ∇b equation reduces to the below expression (except for the x term in ∇w).
Python Code for Gradient Descent for two parameters:
Assumptions:
- Number of parameters: 2
- Number of Epochs: 1000
- X = [0.5, 2.5]; Y = [0.2, 0.9]
- eta : Learning Rate
Citation:
Article content and text is based on the course material and my understanding of PadhAI course on Deep Learning by IIT-Madras Professors : Prof Mitesh Khapra and Prof Pratyush Kumar.