
Data science is a fascinating field. C-level executives are enamored by its promised impact on top line revenue and practitioners are intrigued by the rapid pace of innovation. There’s already so much to know and it seems like every year, a few more things to learn.
This article draws attention to a relatively novel idea that is probably controversial to most data scientists and maybe a handful of statisticians: the bias-variance tradeoff generalization does not generalize and only applies to very specific scenarios. In fact, at the time of this writing, the bias-variance tradeoff has been empirically disproven for specific, realistic scenarios of every model known, including linear regression!
Obviously, this is not surprising to experienced deep-learning practitioners or those few thousand people avidly tracking the expanding body of relevant literature. This article summarizes and highlights from about a dozen “recent” (from 2018 through Dec. 2020) research papers on double descent out of ~70+ [2]. By virtue of what a summary is and given the hundreds of pages of research being summarized, this article glosses over many details and just presents the main ideas.
The term for this *groundbreaking* discovery is the “double descent” phenomenon and this idea goes by several names: double descent, deep double descent, double descent phenomenon/curve/risk curve. There are at least two types of double descent: model-wise and sample-wise. This article discusses the former in greater depth than the latter, though the latter is covered briefly in my discussion of linear model double descent.
I have written this article to be accessible to a broader audience. To that end, I will restate the same idea with different terminology to help familiarize my audience with the jargon. For readers with relevant background, skip the Background section 🙂
Statisticians, data scientists and deep learning practitioners are aware of the classical statistics concept of bias-variance tradeoff:
The bias-variance trade-off implies that a model (i.e. an equation) “should balance under-fitting and over-fitting: rich enough to express underlying structure in data, simple enough to avoid fitting spurious patterns” [1]. Figure 2.11 shown at left is a classic bias-variance diagram illustrating the different impact of increasing model complexity on prediction error for training vs. test data.
A more complex model predicts values closer to the actual value or demonstrates fewer cases of misclassification. In other words, the model exhibits less bias. Figure 2.4 shown at left is a well-known visualization illustrating a linear model that predicts Income based on two variables: Years of Education and Seniority.
A model’s fit, or ability to predict a a dependent variable for a dataset, increases with increasing model complexity. Examples of increasing a model’s complexity include adding more terms, non-linear/polynomial terms (e.g. x², x³, etc.) or step-wise/piecewise-constants for multiple linear regression, increasing the number of training epochs/training time for a neural network or increasing the number of decision nodes (“depth”) and increasing the number of leaves at each decision node of a decision tree and increasing the number of trees in the case of a random forest.
Variance is a measure of how much change is observed in the approximation of the function, i.e. the model or equation. When the function approximation is done with varying training datasets, more flexible models generally see greater variability because they change more easily to fit different datasets. In other words, variance refers to the amount by which the function, an equation relating a set of inputs to an output, changes, when built using one or more different training data sets [3, 5].
The bias-variance tradeoff is a statement on the relationship between interpolation and generalization. Conventional statistical wisdom states that increasing model complexity beyond the point of interpolation, or vanishing training error, is a recipe for overfitting and a model with poor generalization; meaning it will perform poorly on a different, unseen data set [3, 4]. However, subsets of the machine learning community regularly train models to perfectly fit training datasets, such that there is zero training error and these models go on to perform well on unseen, test data. This empirical contradiction has inspired feverish excitement about the “mystical” properties of neural networks and motivated the need for explanation [1a].
Starting in February 2018, Berklin, Mikhail et al. wrote a series of articles seeking to do just that [1a, 1b, 1c, 1d]. In December 2018 [1a], Belkin et. al. formalized an empirical observation: the bias-variance tradeoff only holds true for distinct scenarios and coined the term “double descent” to describe the phenomenon where bias-variance tradeoff doesn’t hold true. Figure 1 from the first paper in their series of papers on the topic helps build intuition:
The work of these Belkin et. al. spurred dozens of subsequent publications that formalized corroboration of the double descent phenomenon in multiple data sets and multiple model types.
The foundational text “ESL” (Elements of Statistical Learning) is potentially most responsible for teaching the bias-variance concept to the majority of practitioners today and one of its authors, Trevor Hastie corroborated the double descent phenomenon with his and his collaborators results in March 2019 [4], with numerous revisions as recent as just a few weeks ago (7 Dec 2020). Point is, this is exciting stuff with a lot of eyes on new developments.
No one knows.
“Fully understanding the mechanisms behind model-wise double descent in deep neural networks remains an important open question. However, an analog of model-wise double descent occurs even for linear models. A recent stream of theoretical works analyzes this setting (Bartlett et al. (2019); Muthukumar et al. (2019); Belkin et al. (2019); Mei & Montanari (2019); Hastie et al. (2019)). We believe similar mechanisms may be at work in deep neural networks” (Nakkiran, Preetum, et al., arxiv, December 2019, [8a]).
Nakkiran et. al. [8a] posits potential explanations for double descent.
- A model trained to the interpolation threshold, i.e. a model that has about the same number of parameters, p, as observations, n, results in only one model that fits the training data perfectly. This interpolating model is very sensitive to noise in the training set and is subject to model mis-specification, i.e. poor function approximation.
- Over-parameterized models ensure many interpolating models. To review, an interpolating model is one that achieves near zero training error. An over-parameterized model is: given a true model given a true model Y=X+ϵ, we might try the following two models to explain/predict
y
usingx
:
The second model is over-parametrized. In this toy example, the over-parameterized model can have a number of values for θ1 and θ2 that can result in multiple interpolating models that accurately predict Y. Stochastic gradient descent is able to find the model that best “memorizes” or “absorbs” the noise in the data, making it robust to new datasets. In other words, having more model parameters than observations means that there are multiple subsets of features that will allow fitting of the over-parameterized model to the training data and stochastic gradient descent. In statistical parlance, an over-parameterized model, p>n
does not have a unique least square objective does not have a unique minimizer
So why does double descent happen? There are hypotheses and I described two above, but no one knows…yet.
Double descent is a robust phenomenon demonstrated for a breadth of neural net architectures, random forests, ensemble methods and even linear regression for both popular and synthetic datasets.
Double descent can be described simply as a test error curve that observes two descents. Model-wise double descent is a test error curve that observes two descents with increasing model capacity/complexity/flexibility. Sample-wise double descent is a test error curve that observes two descents with increasing the number of observations in a training dataset.
Double descent scenarios can occur with:
- Two-layer neural networks with fixed weights in the first layer [1a]. Also known as non-linear parametric models called Random Fourier Features (RFF). Demonstrated with zero-one & squared loss functions with black & white/color image classification datasets: MNIST, CIFAR-10, SVHN, with bootstrap re-sampling, and English speech recognition dataset, TIMIT.
Demonstrated with and without regularization: SVHN experiments used ridge regularization for numerical stability near interpolation, MNIST experiments had no regularization [1a, 9]. - Fully connected neural networks with RReLU activation and SGD for ERM optimization [1a]. RReLU = randomized rectified linear unit [9], SGD = stochastic gradient descent, ERM = empirical risk optimization.
- ResNet–18 (a convolutional neural network that is 18 layers deep and trained on 1 million+ picture ImageNet data set with 1K+ object categories) [8a].