Double descent

{{Short description|Concept in machine learning}} {{For|the concept of double descent in anthropology|Kinship#Descent rules}} [[File:Double descent in a two-layer neural network (Figure 3a from Rocks et al. 2022).png|thumb|420px|An example of the double descent phenomenon in a two-layer neural network: as the ratio of parameters to data points increases, the test error first falls, then rises, then falls again.<ref>{{cite journal |last1=Rocks |first1=Jason W. |title=Memorizing without overfitting: Bias, variance, and interpolation in overparameterized models |journal=Physical Review Research |date=2022 |volume=4 |issue=1 |article-number=013201 |doi=10.1103/PhysRevResearch.4.013201 |pmid=36713351 |pmc=9879296 |arxiv=2010.13933 |bibcode=2022PhRvR...4a3201R }}</ref> The vertical line marks the "interpolation threshold" boundary between the underparametrized region (more data points than parameters) and the overparameterized region (more parameters than data points).]]{{Machine learning bar}} '''Double descent''' in statistics and machine learning is the phenomenon where a model's error rate on the test set initially decreases with the number of parameters, then peaks, then decreases again.<ref>{{Cite web |date=2019-12-05 |title=Deep Double Descent |url=https://openai.com/blog/deep-double-descent/ |access-date=2022-08-12 |website=OpenAI |language=en}}</ref> This phenomenon has been considered surprising, as it contradicts assumptions about overfitting in classical machine learning.<ref name=":1">{{Cite arXiv |eprint=2303.14151v1 |class=cs.LG |first1=Rylan |last1=Schaeffer |first2=Mikail |last2=Khona |title=Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle |date=2023-03-24 |language=en |last3=Robertson |first3=Zachary |last4=Boopathy |first4=Akhilan |last5=Pistunova |first5=Kateryna |last6=Rocks |first6=Jason W. |last7=Fiete |first7=Ila Rani |last8=Koyejo |first8=Oluwasanmi}}</ref>

The increase usually occurs near the interpolation threshold, where the number of parameters is the same as the number of training data points (the model is ''just'' large enough to fit the training data). Or, more precisely, it is the maximum number of samples on which the model/training procedure achieves approximately on average 0 training error.<ref>{{cite arXiv |last1=Nakkiran |first1=Preetum |title=Deep Double Descent: Where Bigger Models and More Data Hurt |date=2019-12-04 |eprint=1912.02292 |last2=Kaplun |first2=Gal |last3=Bansal |first3=Yamini |last4=Yang |first4=Tristan |last5=Barak |first5=Boaz |last6=Sutskever |first6=Ilya |class=cs.LG }}</ref>

== History == Early observations of what would later be called double descent in specific models date back to 1989.<ref>{{Cite journal |last1=Vallet |first1=F. |last2=Cailton |first2=J.-G. |last3=Refregier |first3=Ph |date=June 1989 |title=Linear and Nonlinear Extension of the Pseudo-Inverse Solution for Learning Boolean Functions |url=https://dx.doi.org/10.1209/0295-5075/9/4/003 |journal=Europhysics Letters |language=en |volume=9 |issue=4 |pages=315 |doi=10.1209/0295-5075/9/4/003 |bibcode=1989EL......9..315V |issn=0295-5075|url-access=subscription }}</ref><ref>{{Cite journal |last1=Loog |first1=Marco |last2=Viering |first2=Tom |last3=Mey |first3=Alexander |last4=Krijthe |first4=Jesse H. |last5=Tax |first5=David M. J. |date=2020-05-19 |title=A brief prehistory of double descent |journal=Proceedings of the National Academy of Sciences |language=en |volume=117 |issue=20 |pages=10625–10626 |doi=10.1073/pnas.2001875117 |doi-access=free |issn=0027-8424 |pmc=7245109 |pmid=32371495|arxiv=2004.04328 |bibcode=2020PNAS..11710625L }}</ref>

The term "double descent" was coined by Belkin et. al.<ref name=":0">{{Cite journal |last1=Belkin |first1=Mikhail |last2=Hsu |first2=Daniel |last3=Ma |first3=Siyuan |last4=Mandal |first4=Soumik |date=2019-08-06 |title=Reconciling modern machine learning practice and the bias-variance trade-off |journal=Proceedings of the National Academy of Sciences |volume=116 |issue=32 |pages=15849–15854 |arxiv=1812.11118 |doi=10.1073/pnas.1903070116 |issn=0027-8424 |pmc=6689936 |pmid=31341078 |doi-access=free}}</ref> in 2019,<ref name=":1" /> when the phenomenon gained popularity as a broader concept exhibited by many models.<ref>{{Cite journal |last1=Spigler |first1=Stefano |last2=Geiger |first2=Mario |last3=d'Ascoli |first3=Stéphane |last4=Sagun |first4=Levent |last5=Biroli |first5=Giulio |last6=Wyart |first6=Matthieu |date=2019-11-22 |title=A jamming transition from under- to over-parametrization affects loss landscape and generalization |journal=Journal of Physics A: Mathematical and Theoretical |volume=52 |issue=47 |pages=474001 |doi=10.1088/1751-8121/ab4c8b |issn=1751-8113|arxiv=1810.09665 }}</ref><ref>{{Cite journal |last1=Viering |first1=Tom |last2=Loog |first2=Marco |date=2023-06-01 |title=The Shape of Learning Curves: A Review |journal=IEEE Transactions on Pattern Analysis and Machine Intelligence |volume=45 |issue=6 |pages=7799–7819 |doi=10.1109/TPAMI.2022.3220744 |pmid=36350870 |issn=0162-8828|arxiv=2103.10948 |bibcode=2023ITPAM..45.7799V }}</ref> The latter development was prompted by a perceived contradiction between the conventional wisdom that too many parameters in the model result in a significant overfitting error (an extrapolation of the bias–variance tradeoff),<ref name="geman">{{cite journal |last1=Geman |first1=Stuart |author-link1=Stuart Geman |last2=Bienenstock |first2=Élie |last3=Doursat |first3=René |year=1992 |title=Neural networks and the bias/variance dilemma |url=http://web.mit.edu/6.435/www/Geman92.pdf |journal=Neural Computation |volume=4 |pages=1–58 |doi=10.1162/neco.1992.4.1.1 |s2cid=14215320}}</ref> and the empirical observations in the 2010s that some modern machine learning techniques tend to perform better with larger models.<ref name=":0" /><ref>{{cite journal |author1=Preetum Nakkiran |author2=Gal Kaplun |author3=Yamini Bansal |author4=Tristan Yang |author5=Boaz Barak |author6=Ilya Sutskever |date=29 December 2021 |title=Deep double descent: where bigger models and more data hurt |journal=Journal of Statistical Mechanics: Theory and Experiment |publisher=IOP Publishing Ltd and SISSA Medialab srl |volume=2021 |issue=12 |page=124003 |arxiv=1912.02292 |bibcode=2021JSMTE2021l4003N |doi=10.1088/1742-5468/ac3a74 |s2cid=207808916}}</ref>

== Theoretical models == Double descent occurs in linear regression with isotropic Gaussian covariates and isotropic Gaussian noise.<ref>{{Cite arXiv |eprint=1912.07242v1 |class=stat.ML |first=Preetum |last=Nakkiran |title=More Data Can Hurt for Linear Regression: Sample-wise Double Descent |date=2019-12-16 |language=en}}</ref>

A model of double descent at the thermodynamic limit has been analyzed using the replica trick, and the result has been confirmed numerically.<ref>{{Cite journal |last1=Advani |first1=Madhu S. |last2=Saxe |first2=Andrew M. |last3=Sompolinsky |first3=Haim |date=2020-12-01 |title=High-dimensional dynamics of generalization error in neural networks |url= |journal=Neural Networks |volume=132 |pages=428–446 |doi=10.1016/j.neunet.2020.08.022 |issn=0893-6080|doi-access=free |pmid=33022471 |pmc=7685244 }}</ref>

A number of works<ref>{{cite arXiv | last1 = Maddox | first1 = Wesley J. | last2 = Benton | first2 = Gregory W. | last3 = Wilson | first3 = Andrew Gordon | title = Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited | date = 2020 | class = cs.LG | eprint = 2003.02139 }}</ref><ref>{{cite arXiv | last = Wilson | first = Andrew Gordon | title = Deep Learning is Not So Mysterious or Different | class = cs.LG | year = 2025 | eprint = 2503.02113 }}</ref> have suggested that double descent can be explained using the concept of effective dimension: While a network may have a large number of parameters, in practice only a subset of those parameters are relevant for generalization performance, as measured by the local Hessian curvature. This explanation is formalized through PAC-Bayes compression-based generalization bounds,<ref>{{cite conference | last1 = Lotfi | first1 = Sanae | last2 = Finzi | first2 = Marc | last3 = Kapoor | first3 = Sanyam | last4 = Potapczynski | first4 = Andres | last5 = Goldblum | first5 = Micah | last6 = Wilson | first6 = Andrew G. | title = PAC-Bayes Compression Bounds So Tight That They Can Explain Generalization | conference = Advances in Neural Information Processing Systems | volume = 35 | pages = 31459–31473 | year = 2022 | url = https://proceedings.neurips.cc/paper_files/paper/2022/file/cbeec55c50c3367024bafab2438a021b-Paper-Conference.pdf }} </ref> which show that less complex models are expected to generalize better under a Solomonoff prior.

== See also ==

* Grokking (machine learning)

== References == {{Reflist}}

== Further reading == * {{cite journal|title=Two Models of Double Descent for Weak Features|author1=Mikhail Belkin|author2=Daniel Hsu|author3=Ji Xu|journal=SIAM Journal on Mathematics of Data Science|volume=2|issue=4|year=2020|pages=1167–1180 |doi=10.1137/20M1336072|doi-access=free|arxiv=1903.07571}} * {{cite web|url=https://win-vector.com/2024/04/03/the-m-n-machine-learning-anomaly/|title=The m = n Machine Learning Anomaly|first=John|last=Mount|date=3 April 2024}} * {{cite journal|title=Deep double descent: where bigger models and more data hurt|author1=Preetum Nakkiran|author2=Gal Kaplun|author3=Yamini Bansal|author4=Tristan Yang|author5=Boaz Barak|author6=Ilya Sutskever|journal=Journal of Statistical Mechanics: Theory and Experiment|volume=2021|date=29 December 2021|issue=12 |page=124003 |publisher=IOP Publishing Ltd and SISSA Medialab srl|arxiv=1912.02292|doi=10.1088/1742-5468/ac3a74|bibcode=2021JSMTE2021l4003N |s2cid=207808916 }} * {{cite journal|title=The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve|author1=Song Mei|author2=Andrea Montanari|journal=Communications on Pure and Applied Mathematics|volume=75|issue=4|date=April 2022|pages=667–766 |doi=10.1002/cpa.22008|arxiv=1908.05355|s2cid=199668852 }} * {{cite journal|title=Provable Benefits of Overparameterization in Model Compression: From Double Descent to Pruning Neural Networks|author1=Xiangyu Chang|author2=Yingcong Li|author3=Samet Oymak|author4=Christos Thrampoulidis|journal=Proceedings of the AAAI Conference on Artificial Intelligence|volume=35|issue=8|year=2021|arxiv=2012.08749}} * [https://www.siam.org/publications/siam-news/articles/characterizations-of-double-descent/ Manuchehr Aminian: "Characterizations of Double Descent", SIAM News, Vol.58, No.10 (Dec.,2025).]

== External links ==

* {{cite web|url=https://mlu-explain.github.io/double-descent/|title=Double Descent: Part 1: A Visual Introduction|author1=Brent Werness|author2=Jared Wilber}} * {{cite web|url=https://mlu-explain.github.io/double-descent2/|title=Double Descent: Part 2: A Mathematical Explanation|author1=Brent Werness|author2=Jared Wilber}} * [https://www.lesswrong.com/posts/FRv7ryoqtvSuqBxuT/understanding-deep-double-descent Understanding "Deep Double Descent"] at evhub.

Category:Model selection Category:Machine learning Category:Statistical classification Category:Long stubs with short prose