Data science is a fascinating field. C-level executives are enamored by its promised impact on top line revenue and practitioners are intrigued by the rapid pace of innovation. There’s already so much to know and it seems like every year, a few more things to learn.

This article draws attention to a relatively novel idea that is probably controversial to most data scientists and *maybe *a handful of statisticians: ** the bias-variance tradeoff generalization does not generalize and only applies to very specific scenarios**.

**In fact, at the time of this writing, the bias-variance tradeoff has been empirically disproven for specific, realistic scenarios of every model known, including linear regression!**

Obviously, this is not surprising to experienced deep-learning practitioners or those few thousand people avidly tracking the expanding body of relevant literature. This article summarizes and highlights from about a dozen “recent” (from 2018 through Dec. 2020) research papers on double descent out of ~70+ [2].

By virtue of what a summary is and given the hundreds of pages of research being summarized, this article glosses over many details and just presents the main ideas.

The term for this *groundbreaking* discovery is the “**double descent” **phenomenon and this idea goes by several names: double descent, deep double descent, double descent phenomenon/curve/risk curve. There are at least two types of double descent: **model-wise** and **sample-wise**. This article discusses the former in greater depth than the latter, though the latter is covered briefly in my discussion of linear model double descent.

I have written this article to be accessible to a broader audience. To that end, I will restate the same idea with different terminology to help familiarize my audience with the jargon. For readers with relevant background, skip the Background section 🙂

Statisticians, data scientists and deep learning practitioners are aware of the classical statistics concept of bias-variance tradeoff:

The **bias-variance trade-off** implies that a model (i.e. an equation) “should balance under-fitting and over-fitting: rich enough to express underlying structure in data, simple enough to avoid fitting spurious patterns” [1]. Figure 2.11 shown at left is a classic bias-variance diagram illustrating the different impact of increasing model complexity on prediction error for training vs. test data.

A more complex model predicts values closer to the actual value or demonstrates fewer cases of misclassification. In other words, the model exhibits less **bias**. Figure 2.4 shown at left is a well-known visualization illustrating a linear model that predicts Income based on two variables: Years of Education and Seniority.

A model’s fit, or ability to predict a a dependent variable for a dataset, increases with increasing model complexity. Examples of increasing a model’s complexity include adding more terms, non-linear/polynomial terms (e.g. x², x³, etc.) or step-wise/piecewise-constants for multiple linear regression, increasing the number of training epochs/training time for a neural network or increasing the number of decision nodes (“depth”) and increasing the number of leaves at each decision node of a decision tree and increasing the number of trees in the case of a random forest.

**Variance** is a measure of how much change is observed in the approximation of the function, i.e. the model or equation. When the function approximation is done with varying training datasets, more flexible models generally see greater variability because they change more easily to fit different datasets. In other words, variance refers to the amount by which the function, an equation relating a set of inputs to an output, changes, when built using one or more different training data sets [3, 5].

The bias-variance tradeoff is a statement on the relationship between interpolation and generalization. Conventional statistical wisdom states that increasing model complexity beyond the point of **interpolation**, or **vanishing training error**, is a recipe for overfitting and a model with poor generalization; meaning it will perform poorly on a different, unseen data set [3, 4]. However, subsets of the machine learning community regularly train models to perfectly fit training datasets, such that there is zero training error and these models go on to perform well on unseen, test data. This empirical contradiction has inspired feverish excitement about the “mystical” properties of neural networks and motivated the need for explanation [1a].

Starting in February 2018, Berklin, Mikhail et al. wrote a series of articles seeking to do just that [1a, 1b, 1c, 1d]. In December 2018 [1a], Belkin et. al. formalized an empirical observation: the bias-variance tradeoff only holds true for distinct scenarios and coined the term “**double descent**” to describe the phenomenon where bias-variance tradeoff doesn’t hold true. Figure 1 from the first paper in their series of papers on the topic helps build intuition:

The work of these Belkin et. al. spurred dozens of subsequent publications that formalized corroboration of the double descent phenomenon in multiple data sets and multiple model types.

The foundational text “ESL” (Elements of Statistical Learning) is potentially most responsible for teaching the bias-variance concept to the majority of practitioners today and one of its authors, Trevor Hastie corroborated the double descent phenomenon with his and his collaborators results in March 2019 [4], with numerous revisions as recent as just a few weeks ago (7 Dec 2020). Point is, this is exciting stuff with a lot of eyes on new developments.

**No one knows.**

“**Fully understanding the mechanisms behind model-wise double descent in deep neural networks remains an important open question**. However, an analog of model-wise double descent occurs even for linear models. A recent stream of theoretical works analyzes this setting (Bartlett et al. (2019); Muthukumar et al. (2019); Belkin et al. (2019); Mei & Montanari (2019); Hastie et al. (2019)). We believe similar mechanisms may be at work in deep neural networks” *(Nakkiran, Preetum, et al., **arxiv**, December 2019, [8a]).*

Nakkiran et. al. [8a] posits potential explanations for double descent.

- A model trained to the interpolation threshold, i.e. a model that has about the same number of parameters, p, as observations, n, results in only one model that fits the training data perfectly. This interpolating model is very sensitive to noise in the training set and is subject to model mis-specification, i.e. poor function approximation.
- Over-parameterized models ensure many interpolating models. To review, an interpolating model is one that achieves near zero training error. An over-parameterized model is: given a true model given a true model Y=X+ϵ, we might try the following two models to explain/predict
`y`

using`x`

: