• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
  • Home
  • Crypto Currency
  • Technology
  • Contact
NEO Share

NEO Share

Sharing The Latest Tech News

  • Home
  • Artificial Intelligence
  • Machine Learning
  • Computers
  • Mobile
  • Crypto Currency

Notes to select optimizers for your Deep Learning models based on Sebastian Ruder’s article

February 27, 2021 by systems

Nisar Khan
  1. Model is a function approximated by a neural network after training.
  2. To determine how good a fit is, loss function is used.
  3. When the model and loss function are combined we get an optimization problem.
  4. Objective function is a loss function in which the model is plugged in i.e., it is parameterized by model parameters.
  5. Optimization involves minimizing an objective/cost function with respect to model parameters.
  6. Gradient descent is used to optimize neural networks.
  7. A gradient is obtained by differentiating(calculus) the objective function with respect to the parameters.
  8. Minimization means to update the parameters in the opposite direction of the gradient.
  9. Learning rate is the size of the step taken to reach a global/local minima.
  1. SGD iteratively updates the gradients by calculating it for one example at a time.
  2. It is fast and can be used for online learning.
  3. In mini-batch gradient descent, instead of iterating over each examples, the gradient is calculated on a batch size of n.
  1. Frequent updates with high variance result in high fluctuations.
  2. Fluctuations result in jumping from one local minima to the other complicating convergence.
  3. But, it is shown that when the learning rate is decreased slowly, it converges to local or global minima.
  4. Usually when SGD is mentioned, it means SGD using mini-batches.
  1. It is difficult to choose a proper learning rate
  2. Learning rate can be reduced according to a pre-defined schedule or depending upon when the change in the objective between epochs falls below a threshold. These schedules and thresholds have to be defined in advance and wont adapt to dataset’s characteristics.
  3. The same learning rate is applied to all parameters.
  4. Escaping from suboptimal local minima and saddle points traps.
  1. Accelerates SGD in the appropriate direction and also reduces oscillations.
  1. NAG is is better than Momentum.
  2. The anticipatory NAG update prevents us from going too fast and results in increased responsiveness.
  3. Updates are adapted to the slope of the error function and results in more speed than SGD.
  1. Has per parameter learning rate and uses low learning rates for high frequency features and high learning rates for low frequency features.
  2. Suitable for dealing with sparse data.
  3. No manual tuning of learning rates is required.
  4. Default learning rate is 0.01
  5. Learning rate of this algorithm shrinks.

Note: if the learning rate is infinitesimally small, the algorithm cannot learn.

  1. This is an extension of Adagrad that seeks to reduce its aggressive, diminishing learning rate.
  2. No need to set a default learning rate.
  1. Solves Adagrad’s radically diminishing learning rates.
  2. Suggested default values: Momentum, γ = 0.9, learning rate, η = 0.001.
  1. Adam is another method that computes adaptive learning rates for each parameter.
  2. Adam can be viewed as a combination of RMSprop and momentum.
  3. Default values are 0.9 for β1, 0.999 for β2, and 10−8 for ϵ.
  1. Updates are more stable as infinity norm is used.
  2. Default values are η=0.002, β1=0.9, and β2=0.999.
  1. Nadam combines Adam and NAG.
  1. Adaptive learning rate methods in some cases are outperformed by SGD with momentum.
  2. The solution of Adaptive learning rate methods has the following disadvantages: a) Diminishes the influence of large and informative gradients which leads to poor convergence. b) Results in short-term memory of the gradients which becomes an obstacle in other scenarios.
  3. Because of the above reasons, the following algorithms have poor generalization behaviour: Adadelta, RMSprop, Adam, AdaMax, and Nadam
  4. AMSGrad results in a non-increasing step size, which results in good generalization behavior.
  1. https://ruder.io/optimizing-gradient-descent/index.html
  2. https://d2l.ai/d2l-en.pdf
  3. Lecture notes from : https://www.cs.toronto.edu/~rgrosse/courses/csc421_2019/

Filed Under: Machine Learning

Primary Sidebar

Carmel WordPress Help

Carmel WordPress Help: Expert Support to Keep Your Website Running Smoothly

Stay Ahead: The Latest Tech News and Innovations

Cryptocurrency Market Updates: What’s Happening Now

Emerging Trends in Artificial Intelligence: What to Watch For

Top Cloud Computing Services to Secure Your Data

Footer

  • Privacy Policy
  • Terms and Conditions

Copyright © 2025 NEO Share

Terms and Conditions - Privacy Policy