Gradient descent & its DNN relatives

Note: This is just a basic high-level explanation, it might not cover the topics in detail.
Imagine you are in a car on top of a hill, your objective is to go down (reach the plains, aka the global minima). So what do you do?
You use your 360 degree camera and see which path is the steepest — why? because you want to climb down as fast as possible (your GF/BF just called). So how do you calculate the steepest point now? — by calculating the slope, aka the “gradient”, and since you want to go down, it is “descent”. Hence the name “Gradient descent”.
Okay, now you have chosen your path (the steepest). So what next? You have to drive your car. Sadly your car drives a fixed distance when you push the acceleration pedal, but luckily you can set the number of km at the start.
However, there’s a catch, if you set the no. of km very high, your every step will be large, and since your car will go in that path straight and won’t stop in-between, there’s a chance that you might again climb uphill. Hence the number of km your car covers should be optimal! (If it’s too big, you may go uphill also, and if it’s too small it will take you forever to reach the plains). This number of km is called “Learning rate or Step size”
After each optimal step, you recalculate the next steep (gradient) and you repeat the process.
You have a 360 degree camera? Yes! but you are on a hill, so there will be lot of bushes, trees, rocks which might hinder your vision
What do you ideally do in such a situation? You take a couple of steps in one random direction (“Initialization”) and check if we are going downhill or not, and if a particular path seems promising, you stick to it else you’ll change the direction.
Now let’s add some spice to the story, its pitch dark and there are some dangerous animals out there. Hence you cannot step outside and check if the path you picked is making sense or not. But no worries, this is not directed by Robert B. Weide. Your car has an altimeter — it will tell you, your current altitude. If the altitude decreases as you move forward, you can stick to the current path, else you need to change it! Pretty simple. Here, altimeter is the “Cost function”
Let’s say you are on a path. You realize the altitude aka “Cost function” isn’t changing much, you now want to try a different path, and how do you change your path? by changing the direction (North, East, West and South) using your steering wheel. I know, pretty obvious right?
Here, North, East, West & South are your “Input variables”
But hey, you don’t need to steer. Your car is much better than a Tesla, it will analyze the cost function and steer it for you. You can probably open a can and chill
However, for the sake of understanding the concept lets proceed anyway
Your car is autonomous and it has furcated the input variables into multiple simpler layers (for the car of course, for us its complex :/ ) to precisely operate the car. These layers are called “Hidden Layers”.
E.g. NE, NW, SE, SW (1st layer)
Mostly N + slightly E, Mostly E + slightly N, etc., (2nd layer)
Mostly N + slightly E also curved, etc., (3rd layer)
And many more layers like that. The items in each layer is called a “Node”
Now back to the story, the car passes the feedback from the Altimeter (Cost function), to the multiple layer (Hidden layers) so that the car can change the direction based on the feedback. This can be achieved by altering the importance (“Weights”) of each Node.
Since this feedback passes from last layer to the front layer, by the time it reaches the front layer the other layers would’ve taken all the feedback points leaving the front layer with very less points
Here, the process of going from the last layer to the first layer is called “Back propagation”
And phenomenon of front layers having less feedback points is called the “Vanishing gradient problem”. This only happens if the hidden layers are large in number
This Vanishing gradient problem can be addressed through Residual networks & ReLU, which is a story for another day 🙂
PS: This is my first attempt at writing, let me know your thoughts/inputs in the comment section. Thanks
Reference (source of inspiration):
Gradient Descent: Simply Explained? by Koo Ping Shung
Link: https://towardsdatascience.com/gradient-descent-simply-explained-1d2baa65c757
Footer