Notes to select optimizers for your Deep Learning models based on Sebastian Ruder’s article

Model is a function approximated by a neural network after training.
To determine how good a fit is, loss function is used.
When the model and loss function are combined we get an optimization problem.
Objective function is a loss function in which the model is plugged in i.e., it is parameterized by model parameters.
Optimization involves minimizing an objective/cost function with respect to model parameters.
Gradient descent is used to optimize neural networks.
A gradient is obtained by differentiating(calculus) the objective function with respect to the parameters.
Minimization means to update the parameters in the opposite direction of the gradient.
Learning rate is the size of the step taken to reach a global/local minima.

SGD iteratively updates the gradients by calculating it for one example at a time.
It is fast and can be used for online learning.
In mini-batch gradient descent, instead of iterating over each examples, the gradient is calculated on a batch size of n.

Frequent updates with high variance result in high fluctuations.
Fluctuations result in jumping from one local minima to the other complicating convergence.
But, it is shown that when the learning rate is decreased slowly, it converges to local or global minima.
Usually when SGD is mentioned, it means SGD using mini-batches.

It is difficult to choose a proper learning rate
Learning rate can be reduced according to a pre-defined schedule or depending upon when the change in the objective between epochs falls below a threshold. These schedules and thresholds have to be defined in advance and wont adapt to dataset’s characteristics.
The same learning rate is applied to all parameters.
Escaping from suboptimal local minima and saddle points traps.

NAG is is better than Momentum.
The anticipatory NAG update prevents us from going too fast and results in increased responsiveness.
Updates are adapted to the slope of the error function and results in more speed than SGD.

Has per parameter learning rate and uses low learning rates for high frequency features and high learning rates for low frequency features.
Suitable for dealing with sparse data.
No manual tuning of learning rates is required.
Default learning rate is 0.01
Learning rate of this algorithm shrinks.

Note: if the learning rate is infinitesimally small, the algorithm cannot learn.

This is an extension of Adagrad that seeks to reduce its aggressive, diminishing learning rate.
No need to set a default learning rate.

Adam is another method that computes adaptive learning rates for each parameter.
Adam can be viewed as a combination of RMSprop and momentum.
Default values are 0.9 for β1, 0.999 for β2, and 10−8 for ϵ.

Adaptive learning rate methods in some cases are outperformed by SGD with momentum.
The solution of Adaptive learning rate methods has the following disadvantages: a) Diminishes the influence of large and informative gradients which leads to poor convergence. b) Results in short-term memory of the gradients which becomes an obstacle in other scenarios.
Because of the above reasons, the following algorithms have poor generalization behaviour: Adadelta, RMSprop, Adam, AdaMax, and Nadam
AMSGrad results in a non-increasing step size, which results in good generalization behavior.

Footer