How to Choose an Optimization Algorithm

Optimization is the problem of finding a set of inputs to an objective function that results in a maximum or minimum function evaluation.

It is the challenging problem that underlies many machine learning algorithms, from fitting logistic regression models to training artificial neural networks.

There are perhaps hundreds of popular optimization algorithms, and perhaps tens of algorithms to choose from in popular scientific code libraries. This can make it challenging to know which algorithms to consider for a given optimization problem.

In this tutorial, you will discover a guided tour of different optimization algorithms.

After completing this tutorial, you will know:

Optimization algorithms may be grouped into those that use derivatives and those that do not.
Classical algorithms use the first and sometimes second derivative of the objective function.
Direct search and stochastic algorithms are designed for objective functions where function derivatives are unavailable.

Let’s get started.

How to Choose an Optimization Algorithm
Photo by Matthewjs007, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Optimization Algorithms
Differentiable Objective Function
Non-Differential Objective Function

Optimization Algorithms

Optimization refers to a procedure for finding the input parameters or arguments to a function that result in the minimum or maximum output of the function.

The most common type of optimization problems encountered in machine learning are continuous function optimization, where the input arguments to the function are real-valued numeric values, e.g. floating point values. The output from the function is also a real-valued evaluation of the input values.

We might refer to problems of this type as continuous function optimization, to distinguish from functions that take discrete variables and are referred to as combinatorial optimization problems.

There are many different types of optimization algorithms that can be used for continuous function optimization problems, and perhaps just as many ways to group and summarize them.

One approach to grouping optimization algorithms is based on the amount of information available about the target function that is being optimized that, in turn, can be used and harnessed by the optimization algorithm.

Generally, the more information that is available about the target function, the easier the function is to optimize if the information can effectively be used in the search.

Perhaps the major division in optimization algorithms is whether the objective function can be differentiated at a point or not. That is, whether the first derivative (gradient or slope) of the function can be calculated for a given candidate solution or not. This partitions algorithms into those that can make use of the calculated gradient information and those that do not.

Differentiable Target Function?
- Algorithms that use derivative information.
- Algorithms that do not use derivative information.

We will use this as the major division for grouping optimization algorithms in this tutorial and look at algorithms for differentiable and non-differentiable objective functions.

Note: this is not an exhaustive coverage of algorithms for continuous function optimization, although it does cover the major methods that you are likely to encounter as a regular practitioner.

Differentiable Objective Function

A differentiable function is a function where the derivative can be calculated for any given point in the input space.

The derivative of a function for a value is the rate or amount of change in the function at that point. It is often called the slope.

First-Order Derivative: Slope or rate of change of an objective function at a given point.

The derivative of the function with more than one input variable (e.g. multivariate inputs) is commonly referred to as the gradient.

Gradient: Derivative of a multivariate continuous objective function.

A derivative for a multivariate objective function is a vector, and each element in the vector is called a partial derivative, or the rate of change for a given variable at the point assuming all other variables are held constant.

Partial Derivative: Element of a derivative of a multivariate objective function.

We can calculate the derivative of the derivative of the objective function, that is the rate of change of the rate of change in the objective function. This is called the second derivative.

Second-Order Derivative: Rate at which the derivative of the objective function changes.

For a function that takes multiple input variables, this is a matrix and is referred to as the Hessian matrix.

Hessian matrix: Second derivative of a function with two or more input variables.

Simple differentiable functions can be optimized analytically using calculus. Typically, the objective functions that we are interested in cannot be solved analytically.

Optimization is significantly easier if the gradient of the objective function can be calculated, and as such, there has been a lot more research into optimization algorithms that use the derivative than those that do not.

Some groups of algorithms that use gradient information include:

Bracketing Algorithms
Local Descent Algorithms
First-Order Algorithms
Second-Order Algorithms

Note: this taxonomy is inspired by the 2019 book “Algorithms for Optimization.”

Let’s take a closer look at each in turn.

Bracketing Algorithms

Bracketing optimization algorithms are intended for optimization problems with one input variable where the optima is known to exist within a specific range.

Bracketing algorithms are able to efficiently navigate the known range and locate the optima, although they assume only a single optima is present (referred to as unimodal objective functions).

Some bracketing algorithms may be able to be used without derivative information if it is not available.

Examples of bracketing algorithms include:

Fibonacci Search
Golden Section Search
Bisection Method

Local Descent Algorithms

Local descent optimization algorithms are intended for optimization problems with more than one input variable and a single global optima (e.g. unimodal objective function).

Perhaps the most common example of a local descent algorithm is the line search algorithm.

There are many variations of the line search (e.g. the Brent-Dekker algorithm), but the procedure generally involves choosing a direction to move in the search space, then performing a bracketing type search in a line or hyperplane in the chosen direction.

This process is repeated until no further improvements can be made.

The limitation is that it is computationally expensive to optimize each directional move in the search space.

First-Order Algorithms

First-order optimization algorithms explicitly involve using the first derivative (gradient) to choose the direction to move in the search space.

The procedures involve first calculating the gradient of the function, then following the gradient in the opposite direction (e.g. downhill to the minimum for minimization problems) using a step size (also called the learning rate).

The step size is a hyperparameter that controls how far to move in the search space, unlike “local descent algorithms” that perform a full line search for each directional move.

A step size that is too small results in a search that takes a long time and can get stuck, whereas a step size that is too large will result in zig-zagging or bouncing around the search space, missing the optima completely.

First-order algorithms are generally referred to as gradient descent, with more specific names referring to minor extensions to the procedure, e.g.:

Gradient Descent
Momentum
Adagrad
RMSProp
Adam

The gradient descent algorithm also provides the template for the popular stochastic version of the algorithm, named Stochastic Gradient Descent (SGD) that is used to train artificial neural networks (deep learning) models.

The important difference is that the gradient is appropriated rather than calculated directly, using prediction error on training data, such as one sample (stochastic), all examples (batch), or a small subset of training data (mini-batch).

The extensions designed to accelerate the gradient descent algorithm (momentum, etc.) can be and are commonly used with SGD.

Stochastic Gradient Descent
Batch Gradient Descent
Mini-Batch Gradient Descent

Second-Order Algorithms

Second-order optimization algorithms explicitly involve using the second derivative (Hessian) to choose the direction to move in the search space.

These algorithms are only appropriate for those objective functions where the Hessian matrix can be calculated or approximated.

Examples of second-order optimization algorithms for univariate objective functions include:

Newton’s Method
Secant Method

Second-order methods for multivariate objective functions are referred to as Quasi-Newton Methods.

There are many Quasi-Newton Methods, and they are typically named for the developers of the algorithm, such as:

Davidson-Fletcher-Powell
Broyden-Fletcher-Goldfarb-Shanno (BFGS)
Limited-memory BFGS (L-BFGS)

Now that we are familiar with the so-called classical optimization algorithms, let’s look at algorithms used when the objective function is not differentiable.

Non-Differential Objective Function

Optimization algorithms that make use of the derivative of the objective function are fast and efficient.

Nevertheless, there are objective functions where the derivative cannot be calculated, typically because the function is complex for a variety of real-world reasons. Or the derivative can be calculated in some regions of the domain, but not all, or is not a good guide.

Some difficulties on objective functions for the classical algorithms described in the previous section include:

No analytical description of the function (e.g. simulation).
Multiple global optima (e.g. multimodal).
Stochastic function evaluation (e.g. noisy).
Discontinuous objective function (e.g. regions with invalid solutions).

As such, there are optimization algorithms that do not expect first- or second-order derivatives to be available.

These algorithms are sometimes referred to as black-box optimization algorithms as they assume little or nothing (relative to the classical methods) about the objective function.

A grouping of these algorithms include:

Direct Algorithms
Stochastic Algorithms
Population Algorithms

Let’s take a closer look at each in turn.

Direct Algorithms

Direct optimization algorithms are for objective functions for which derivatives cannot be calculated.

The algorithms are deterministic procedures and often assume the objective function has a single global optima, e.g. unimodal.

Direct search methods are also typically referred to as a “pattern search” as they may navigate the search space using geometric shapes or decisions, e.g. patterns.

Gradient information is approximated directly (hence the name) from the result of the objective function comparing the relative difference between scores for points in the search space. These direct estimates are then used to choose a direction to move in the search space and triangulate the region of the optima.

Examples of direct search algorithms include:

Cyclic Coordinate Search
Powell’s Method
Hooke-Jeeves Method
Nelder-Mead Simplex Search

Stochastic Algorithms

Stochastic optimization algorithms are algorithms that make use of randomness in the search procedure for objective functions for which derivatives cannot be calculated.

Unlike the deterministic direct search methods, stochastic algorithms typically involve a lot more sampling of the objective function, but are able to handle problems with deceptive local optima.

Stochastic optimization algorithms include:

Simulated Annealing
Evolution Strategy
Cross-Entropy Method

Population Algorithms

Population optimization algorithms are stochastic optimization algorithms that maintain a pool (a population) of candidate solutions that together are used to sample, explore, and hone in on an optima.

Algorithms of this type are intended for more challenging objective problems that may have noisy function evaluations and many global optima (multimodal), and finding a good or good enough solution is challenging or infeasible using other methods.

The pool of candidate solutions adds robustness to the search, increasing the likelihood of overcoming local optima.

Examples of population optimization algorithms include:

Genetic Algorithm
Differential Evolution
Particle Swarm Optimization

Summary

In this tutorial, you discovered a guided tour of different optimization algorithms.

Specifically, you learned:

Optimization algorithms may be grouped into those that use derivatives and those that do not.
Classical algorithms use the first and sometimes second derivative of the objective function.
Direct search and stochastic algorithms are designed for objective functions where function derivatives are unavailable.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

How to Choose an Optimization Algorithm

Tutorial Overview

Optimization Algorithms

Differentiable Objective Function

Bracketing Algorithms

Local Descent Algorithms

First-Order Algorithms

Second-Order Algorithms

Non-Differential Objective Function

Direct Algorithms

Stochastic Algorithms

Population Algorithms

Further Reading

Books

Articles

Summary

Tutorial Overview

Optimization Algorithms

Differentiable Objective Function

Bracketing Algorithms

Local Descent Algorithms

First-Order Algorithms

Second-Order Algorithms

Non-Differential Objective Function

Direct Algorithms

Stochastic Algorithms

Population Algorithms

Further Reading

Books

Articles

Summary

Footer