Overfitting - Revision history

Lê Nguyên Hoang: /* Details */

2020-02-26T17:09:11Z

Details

Lê Nguyên Hoang: /* Details */

2020-02-26T17:07:35Z

Details

Lê Nguyên Hoang: /* Details */

2020-02-26T17:03:39Z

Details

Lê Nguyên Hoang: /* Details */

2020-02-26T16:58:40Z

Details

Lê Nguyên Hoang: /* Double descent */

2020-02-26T16:56:04Z

Double descent

Lê Nguyên Hoang: /* Double descent */

2020-02-26T16:55:02Z

Double descent

Lê Nguyên Hoang at 09:54, 24 February 2020

2020-02-24T09:54:39Z

Lê Nguyên Hoang at 12:43, 22 January 2020

2020-01-22T12:43:37Z

Lê Nguyên Hoang: Created page with "Overfitting occurs where fitting training data too closely is counter-productive to out-of-sample predictions. == Bias-variance tradeoff == [http://web.mit.edu/6.435/www/Gem..."

2020-01-20T21:31:15Z

Created page with "Overfitting occurs where fitting training data too closely is counter-productive to out-of-sample predictions. == Bias-variance tradeoff == [http://web.mit.edu/6.435/www/Gem..."

New page

Overfitting occurs where fitting training data too closely is counter-productive to out-of-sample predictions.

== Bias-variance tradeoff ==

[http://web.mit.edu/6.435/www/Geman92.pdf GBD][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=neural+networks+and+the+bias%2Fvariance+dilemma+geman+bienenstock+doursat&btnG= 92] identified the bias-variance tradeoff, which quantifies out-of-sample errors as a sum of inductive bias and model variance, for random samples drawn from the true distribution.

Formally, let f(x,S) the prediction of the algorithm trained with sample S for feature x. Then E[(f(x,S)-y)2] = bias2 + variance + sigma2, where the expectation is over x, S and y, bias = E[f(x,S)-f*(x)] is the bias with respect to the optimal prediction f*, variance = E[f(x,S)2] - E[f(x,S)]2 is how much the prediction varies from one sample to the other and sigma2 = E[(f*(x)-y)^2] is the unpredictable components of y given x.

It has been argued that overfitting is caused by increased variance when we consider a learning algorithm that is too sensitive to the randomness of sampling S. This sometimes occurs when the number of parameters of the learning algorithm is too large (but not necessarily!).

== PAC-learning ==

To prove theoretical guarantees of non-overfitting, [http://web.mit.edu/6.435/www/Valiant84.pdf Valiant][https://dblp.org/rec/bibtex/journals/cacm/Valiant84 84] introduced the concept of probably approximately correct (PAC) learning. More explanations here: [https://www.youtube.com/watch?v=uB2X2OuD4Rg&list=PLie7a1OUTSagZB9mFZnVBgsNfBtcUGJWB&index=8 Wandida16a].

In particular, the fundamental theorem of statistical learning [https://www.amazon.com/Understanding-Machine-Learning-Theory-Algorithms/dp/1107057132/ ShalevshwartzBendavidBook][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Understanding+Machine+Learning+Theory+Algorithms+shalev-shwartz+ben-david&btnG= 14] [https://www.youtube.com/watch?v=RkWuLtFPBKU&list=PLie7a1OUTSagZB9mFZnVBgsNfBtcUGJWB&index=14 Wandida16b] provides guarantees of PAC learning, when the number of data points sufficiently exceed the VC dimension of the set of learnable algorithms. Since this VC dimension is often essentially the number of parameters (assuming finite representation of the parameters as float or double), then this means that PAC learning is guaranteed when #data << #parameters.

This has become a conventional wisdom for a while.

== Double descent ==

However, the conventional wisdom is in sharp contradiction with today's success of deep neural networks [https://openreview.net/pdf?id=Sy8gdB9xx ZBHRV][https://dblp.org/rec/bibtex/conf/iclr/ZhangBHRV17 17], [https://www.pnas.org/content/pnas/116/32/15849.full.pdf BHMM][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Reconciling+modern+machine-learning+practice+and+the+classical+bias%E2%80%93variance+trade-off&btnG= 19] [https://arxiv.org/pdf/1912.02292 NKBYBS][https://dblp.org/rec/bibtex/journals/corr/abs-1912-02292 19], but also kernel methods [http://proceedings.mlr.press/v80/belkin18a/belkin18a.pdf BMM][https://dblp.org/rec/bibtex/conf/icml/BelkinMM18 18] [http://proceedings.mlr.press/v89/belkin19a/belkin19a.pdf BRT][https://dblp.org/rec/bibtex/conf/aistats/BelkinRT19 19], ridgeless (random feature) linear regression [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8849614 MVS][https://dblp.org/rec/bibtex/conf/isit/MuthukumarVS19 19] [https://arxiv.org/pdf/1906.11300 BLLT][https://dblp.org/rec/bibtex/journals/corr/abs-1906-11300 19] [https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&btnG= 19] [https://arxiv.org/pdf/1903.08560 HMRT][https://dblp.org/rec/bibtex/journals/corr/abs-1903-08560 19] [https://arxiv.org/pdf/1903.07571 BHX][https://dblp.org/rec/bibtex/journals/corr/abs-1903-07571 19] and even ensembles [https://www.pnas.org/content/pnas/116/32/15849.full.pdf BHMM][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Reconciling+modern+machine-learning+practice+and+the+classical+bias%E2%80%93variance+trade-off&btnG= 19]. Learning algorithms seem to often achieve their best out-of-sample performance when they are massively overparameterized and perfectly fit the training data (called interpolation).

Intriguingly, a double descent phenomenon often occurs [https://www.pnas.org/content/pnas/116/32/15849.full.pdf BHMM][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Reconciling+modern+machine-learning+practice+and+the+classical+bias%E2%80%93variance+trade-off&btnG= 19] [https://arxiv.org/pdf/1912.02292 NKBYBS][https://dblp.org/rec/bibtex/journals/corr/abs-1912-02292 19], where the performance at the test set first behaves as predicted by the bias-variance dilemma, but then improves and outperforms what would be advised by classical statistical learning.

All such results suggest that overfitting eventually disappears, which contradicts conventional wisdom.

== Details ==

[https://openreview.net/pdf?id=Sy8gdB9xx ZBHRV][https://dblp.org/rec/bibtex/conf/iclr/ZhangBHRV17 17] showed that large interpolating neural networks generalize well, even for large noise in the data. Also, they showed that inductive bias likely plays a limited role, as neural networks still manage to learn quite efficiently data whose labels are completely shuffled. They also proved that a neural network with 2n+d parameters can interpolate n data points of dimension d.

[https://www.pnas.org/content/pnas/116/32/15849.full.pdf BHMM][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Reconciling+modern+machine-learning+practice+and+the+classical+bias%E2%80%93variance+trade-off&btnG= 19] observed double descent for random Fourier features (which [http://papers.nips.cc/paper/3182-random-features-for-large-scale-kernel-machines.pdf RahimiRecht][https://dblp.org/rec/bibtex/conf/nips/RahimiR07 07] proved to be intimately connected to kernel methods), neural networks, decision tree and ensemble methods.

[https://arxiv.org/pdf/1912.02292 NKBYBS][https://dblp.org/rec/bibtex/journals/corr/abs-1912-02292 19] show that a very wide variety of deep neural networks exhibit a wide variety of double descent phenomenons. Not only is there double descent with respect to the number of parameters, but there also seems to be double descent with respect to the width of the neural networks, and weirdly also with respect to epochs of learning steps. They conjecture that "effective model complexity" (the number of data points for which the model is able to achieve small training loss) is a critical point where overfitting occurs. Before and beyond this, overfitting appears to vanish.

[http://proceedings.mlr.press/v80/belkin18a/belkin18a.pdf BMM][https://dblp.org/rec/bibtex/conf/icml/BelkinMM18 18] present experiments that show that interpolating kernel methods also generalize well and are able to fit random labels (though in this paper, they do not exhibit double descent). They also show that, because norms of interpolaters grow superpolynomially in Hilbert space (in exp(theta(n^(1/d)))), usual bounds controlling overfitting are actually trivial for large datasets. This indicates the need for radically different approach to understand overfitting.

[http://proceedings.mlr.press/v89/belkin19a/belkin19a.pdf BRT][https://dblp.org/rec/bibtex/conf/aistats/BelkinRT19 19] show examples of singular kernel interpolators (K(x,y)~||x-y||^(-a)) that achieve optimal rates, even for improper learning (meaning that the true function to learn does not belong to the set of hypotheses).

Note that the connection between kernel methods and neural networks has been made, for instance by [http://papers.nips.cc/paper/3182-random-features-for-large-scale-kernel-machines.pdf RahimiRecht][https://dblp.org/rec/bibtex/conf/nips/RahimiR07 07] and [http://papers.nips.cc/paper/8076-neural-tangent-kernel-convergence-and-generalization-in-neural-networks.pdf JHG][https://dblp.org/rec/bibtex/conf/nips/JacotHG18 18]. Essentially, random features implement approximate kernel methods. And the first layers of neural networks with random (or even trained) weights can be regarded as computations of random features.

[https://arxiv.org/pdf/1903.08560 HMRT][https://dblp.org/rec/bibtex/journals/corr/abs-1903-08560 19] use random matrix theory (to estimate eigenvalues of X^TX when X are random samples) to show that ridgless regression (regression with minimum l2-norm) features "infinite double descent" as the size n of the training data sets grows to infinity, along with the number of parameters p (they assume p/n = gamma, and show infinite overfitting for gamma~1). This is shown for both a linear model where X = ∑^{1/2} Z, for some fixed ∑ and some well-behaved Z of mean 0 and variance 1, and a componentwise linearity X = σ(WZ). In both cases, it is assumed that y=x^Tβ+ε, where E[ε]=0. There are also assumptions of finite fixed variance for ε, and finite fourth moment for z (needed for random matrix theory).

[https://arxiv.org/pdf/1903.07571 BHX][https://dblp.org/rec/bibtex/journals/corr/abs-1903-07571 19] analyze two other data models. The former is a classical Gaussian linear regression with a huge space of features. But the regression is only made within a (random) subset T of features, in which case double descent is observed, and errors can be derived from the norms of the true regression for T-coordinates and non-T-coordinates. A similar analysis is then provided for a random Fourier feature model.

[https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&btnG= 19] consider still another data model, where z is drawn uniformly randomly on a d-sphere, and y = f(z)+ε, such that E[ε]=0 and ε has finite fourth moment. The ridgeless linear regression is then over some random features xi=σ(Wzi), where σ applies componentwise and W has random rows of l2-norm equal to 1. They prove that this yields a double descent.

← Older revision		Revision as of 17:09, 26 February 2020
Line 49:		Line 49:
	[https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&btnG= 19] consider still another data model, where <math>z</math> is drawn uniformly randomly on a <math>d</math>-sphere, and <math>y = f(z)+\varepsilon</math>, such that <math>\mathbb E[\varepsilon]=0</math> and <math>\varepsilon</math> has finite fourth moment. The ridgeless linear regression is then over some random features <math>x_i = \sigma(Wz_i)</math>, where <math>\sigma</math> applies component-wisely and <math>W</math> has random rows of <math>\ell_2</math>-norm equal to 1. They prove that this yields a double descent.		[https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&btnG= 19] consider still another data model, where <math>z</math> is drawn uniformly randomly on a <math>d</math>-sphere, and <math>y = f(z)+\varepsilon</math>, such that <math>\mathbb E[\varepsilon]=0</math> and <math>\varepsilon</math> has finite fourth moment. The ridgeless linear regression is then over some random features <math>x_i = \sigma(Wz_i)</math>, where <math>\sigma</math> applies component-wisely and <math>W</math> has random rows of <math>\ell_2</math>-norm equal to 1. They prove that this yields a double descent.

−	[https://arxiv.org/pdf/2002.08404.pdf JSSHG][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Implicit+Regularization+of+Random+Feature+Models&btnG= 20] study a random feature model where the true function and the features are drawn from a Gaussian process. They prove an upper-bound between a regularized Bayesian posterior prediction at a point <math>x</math> and the expectation of the (slightly differently) regularized random feature prediction at this point, which can be argued to go to zero under reasonable assumptions, as the number of parameters goes to infinity. Moreover, as the regularization goes to zero, the random feature linear regression comes closer to the Bayesian posterior prediction. They also prove a bound between the expected loss of the regularized random feature model and that of the regularized kernel model. They also show that the "effective ridge" <math>\tilde \lambda</math> (i.e. ridge of the kernel method equivalent to the ridge <math>\lambda</math> of the random feature regression) is key to understanding the variance explosion. In particular, they relate it to the explosion of <math>\partial_\lambda \tilde \lambda</math>.	+	[https://arxiv.org/pdf/2002.08404.pdf JSSHG][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Implicit+Regularization+of+Random+Feature+Models&btnG= 20] study a random feature model where the true function and the features are drawn from a Gaussian process. They prove an upper-bound between a regularized Bayesian posterior prediction at a point <math>x</math> and the expectation of the (slightly differently) regularized random feature prediction at this point, which can be argued to go to zero under reasonable assumptions, as the number of parameters goes to infinity. Moreover, as the regularization goes to zero, the random feature linear regression comes closer to the Bayesian posterior prediction. They also prove a bound between the expected loss of the regularized random feature model and that of the regularized kernel model. They also show that the "effective ridge" <math>\tilde \lambda</math> (i.e. ridge of the kernel method equivalent to the ridge <math>\lambda</math> of the random feature regression) is key to understanding the variance explosion. In particular, they relate it to the explosion of <math>\partial_\lambda \tilde \lambda</math>. An interesting insight is that the adequate linear regression regularization should thus be smaller than (and can be exactly computed from) the kernel method ridge.

← Older revision		Revision as of 17:07, 26 February 2020
Line 49:		Line 49:
	[https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&btnG= 19] consider still another data model, where <math>z</math> is drawn uniformly randomly on a <math>d</math>-sphere, and <math>y = f(z)+\varepsilon</math>, such that <math>\mathbb E[\varepsilon]=0</math> and <math>\varepsilon</math> has finite fourth moment. The ridgeless linear regression is then over some random features <math>x_i = \sigma(Wz_i)</math>, where <math>\sigma</math> applies component-wisely and <math>W</math> has random rows of <math>\ell_2</math>-norm equal to 1. They prove that this yields a double descent.		[https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&btnG= 19] consider still another data model, where <math>z</math> is drawn uniformly randomly on a <math>d</math>-sphere, and <math>y = f(z)+\varepsilon</math>, such that <math>\mathbb E[\varepsilon]=0</math> and <math>\varepsilon</math> has finite fourth moment. The ridgeless linear regression is then over some random features <math>x_i = \sigma(Wz_i)</math>, where <math>\sigma</math> applies component-wisely and <math>W</math> has random rows of <math>\ell_2</math>-norm equal to 1. They prove that this yields a double descent.

−	[https://arxiv.org/pdf/2002.08404.pdf JSSHG][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Implicit+Regularization+of+Random+Feature+Models&btnG= 20] study a random feature model where the true function and the features are drawn from a Gaussian process. They prove an upper-bound between a regularized Bayesian posterior prediction at a point <math>x</math> and the expectation of the (slightly differently) regularized random feature prediction at this point, which can be argued to go to zero under reasonable assumptions, as the number of parameters goes to infinity. Moreover, as the regularization goes to zero, the random feature linear regression comes closer to the Bayesian posterior prediction. They also prove a bound between the expected loss of the regularized random feature model and that of the regularized kernel model. They also show that the "effective ridge" <math>\tilde \lambda</math> (i.e. ridge of the kernel method equivalent to the ridge <math>\lambda</math> of the random feature regression) is key to understanding the variance explosion. In particular, they relate it to <math>\partial_\lambda \tilde \lambda</math>.	+	[https://arxiv.org/pdf/2002.08404.pdf JSSHG][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Implicit+Regularization+of+Random+Feature+Models&btnG= 20] study a random feature model where the true function and the features are drawn from a Gaussian process. They prove an upper-bound between a regularized Bayesian posterior prediction at a point <math>x</math> and the expectation of the (slightly differently) regularized random feature prediction at this point, which can be argued to go to zero under reasonable assumptions, as the number of parameters goes to infinity. Moreover, as the regularization goes to zero, the random feature linear regression comes closer to the Bayesian posterior prediction. They also prove a bound between the expected loss of the regularized random feature model and that of the regularized kernel model. They also show that the "effective ridge" <math>\tilde \lambda</math> (i.e. ridge of the kernel method equivalent to the ridge <math>\lambda</math> of the random feature regression) is key to understanding the variance explosion. In particular, they relate it to the explosion of <math>\partial_\lambda \tilde \lambda</math>.

← Older revision		Revision as of 17:03, 26 February 2020
Line 49:		Line 49:
	[https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&btnG= 19] consider still another data model, where <math>z</math> is drawn uniformly randomly on a <math>d</math>-sphere, and <math>y = f(z)+\varepsilon</math>, such that <math>\mathbb E[\varepsilon]=0</math> and <math>\varepsilon</math> has finite fourth moment. The ridgeless linear regression is then over some random features <math>x_i = \sigma(Wz_i)</math>, where <math>\sigma</math> applies component-wisely and <math>W</math> has random rows of <math>\ell_2</math>-norm equal to 1. They prove that this yields a double descent.		[https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&btnG= 19] consider still another data model, where <math>z</math> is drawn uniformly randomly on a <math>d</math>-sphere, and <math>y = f(z)+\varepsilon</math>, such that <math>\mathbb E[\varepsilon]=0</math> and <math>\varepsilon</math> has finite fourth moment. The ridgeless linear regression is then over some random features <math>x_i = \sigma(Wz_i)</math>, where <math>\sigma</math> applies component-wisely and <math>W</math> has random rows of <math>\ell_2</math>-norm equal to 1. They prove that this yields a double descent.

−	[https://arxiv.org/pdf/2002.08404.pdf JSSHG][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Implicit+Regularization+of+Random+Feature+Models&btnG= 20] study a random feature model where the true function and the features are drawn from a Gaussian process. They prove an upper-bound between a regularized Bayesian posterior prediction at a point <math>x</math> and the expectation of the (slightly differently) regularized random feature prediction at this point, which can be argued to go to zero under reasonable assumptions, as the number of parameters goes to infinity. Moreover, as the regularization goes to zero, the random feature linear regression comes closer to the Bayesian posterior prediction. They also prove a bound between the expected loss of the regularized random feature model and that of the regularized kernel model.	+	[https://arxiv.org/pdf/2002.08404.pdf JSSHG][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Implicit+Regularization+of+Random+Feature+Models&btnG= 20] study a random feature model where the true function and the features are drawn from a Gaussian process. They prove an upper-bound between a regularized Bayesian posterior prediction at a point <math>x</math> and the expectation of the (slightly differently) regularized random feature prediction at this point, which can be argued to go to zero under reasonable assumptions, as the number of parameters goes to infinity. Moreover, as the regularization goes to zero, the random feature linear regression comes closer to the Bayesian posterior prediction. They also prove a bound between the expected loss of the regularized random feature model and that of the regularized kernel model. They also show that the "effective ridge" <math>\tilde \lambda</math> (i.e. ridge of the kernel method equivalent to the ridge <math>\lambda</math> of the random feature regression) is key to understanding the variance explosion. In particular, they relate it to <math>\partial_\lambda \tilde \lambda</math>.

← Older revision		Revision as of 16:58, 26 February 2020
Line 48:		Line 48:

	[https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&btnG= 19] consider still another data model, where <math>z</math> is drawn uniformly randomly on a <math>d</math>-sphere, and <math>y = f(z)+\varepsilon</math>, such that <math>\mathbb E[\varepsilon]=0</math> and <math>\varepsilon</math> has finite fourth moment. The ridgeless linear regression is then over some random features <math>x_i = \sigma(Wz_i)</math>, where <math>\sigma</math> applies component-wisely and <math>W</math> has random rows of <math>\ell_2</math>-norm equal to 1. They prove that this yields a double descent.		[https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&btnG= 19] consider still another data model, where <math>z</math> is drawn uniformly randomly on a <math>d</math>-sphere, and <math>y = f(z)+\varepsilon</math>, such that <math>\mathbb E[\varepsilon]=0</math> and <math>\varepsilon</math> has finite fourth moment. The ridgeless linear regression is then over some random features <math>x_i = \sigma(Wz_i)</math>, where <math>\sigma</math> applies component-wisely and <math>W</math> has random rows of <math>\ell_2</math>-norm equal to 1. They prove that this yields a double descent.
		+
		+	[https://arxiv.org/pdf/2002.08404.pdf JSSHG][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Implicit+Regularization+of+Random+Feature+Models&btnG= 20] study a random feature model where the true function and the features are drawn from a Gaussian process. They prove an upper-bound between a regularized Bayesian posterior prediction at a point <math>x</math> and the expectation of the (slightly differently) regularized random feature prediction at this point, which can be argued to go to zero under reasonable assumptions, as the number of parameters goes to infinity. Moreover, as the regularization goes to zero, the random feature linear regression comes closer to the Bayesian posterior prediction. They also prove a bound between the expected loss of the regularized random feature model and that of the regularized kernel model.

← Older revision		Revision as of 16:56, 26 February 2020
Line 26:		Line 26:

	Intriguingly, a <em>double descent</em> phenomenon often occurs [https://www.pnas.org/content/pnas/116/32/15849.full.pdf BHMM][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Reconciling+modern+machine-learning+practice+and+the+classical+bias%E2%80%93variance+trade-off&btnG= 19] [https://arxiv.org/pdf/1912.02292 NKBYBS][https://dblp.org/rec/bibtex/journals/corr/abs-1912-02292 19], where the performance at the test set first behaves as predicted by the bias-variance dilemma, but then improves and outperforms what would be advised by classical statistical learning.		Intriguingly, a <em>double descent</em> phenomenon often occurs [https://www.pnas.org/content/pnas/116/32/15849.full.pdf BHMM][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Reconciling+modern+machine-learning+practice+and+the+classical+bias%E2%80%93variance+trade-off&btnG= 19] [https://arxiv.org/pdf/1912.02292 NKBYBS][https://dblp.org/rec/bibtex/journals/corr/abs-1912-02292 19], where the performance at the test set first behaves as predicted by the bias-variance dilemma, but then improves and outperforms what would be advised by classical statistical learning.
−
−	[https://arxiv.org/pdf/2002.08404.pdf JSSHG][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Implicit+Regularization+of+Random+Feature+Models&btnG= 20] study a random feature model where the true function and the features are drawn from a Gaussian process. They prove an upper-bound between a regularized Bayesian posterior prediction and the expectation of the (slightly differently) regularized random feature prediction, which can be argued to go to zero under reasonable assumptions, as the number of parameters goes to infinity. Moreover, as the regularization goes to zero, the random feature linear regression comes closer to the Bayesian posterior prediction.

	All such results suggest that overfitting eventually disappears, which contradicts conventional wisdom.		All such results suggest that overfitting eventually disappears, which contradicts conventional wisdom.

@@ Line 23: / Line 23: @@
 == Double descent ==
-However, the conventional wisdom is in sharp contradiction with today's success of deep neural networks [https://openreview.net/pdf?id=Sy8gdB9xx ZBHRV][https://dblp.org/rec/bibtex/conf/iclr/ZhangBHRV17 17], [https://www.pnas.org/content/pnas/116/32/15849.full.pdf BHMM][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Reconciling+modern+machine-learning+practice+and+the+classical+bias%E2%80%93variance+trade-off&btnG= 19] [https://arxiv.org/pdf/1912.02292 NKBYBS][https://dblp.org/rec/bibtex/journals/corr/abs-1912-02292 19], but also kernel methods [http://proceedings.mlr.press/v80/belkin18a/belkin18a.pdf BMM][https://dblp.org/rec/bibtex/conf/icml/BelkinMM18 18] [http://proceedings.mlr.press/v89/belkin19a/belkin19a.pdf BRT][https://dblp.org/rec/bibtex/conf/aistats/BelkinRT19 19], ridgeless (random feature) linear regression [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8849614 MVS][https://dblp.org/rec/bibtex/conf/isit/MuthukumarVS19 19] [https://arxiv.org/pdf/1906.11300 BLLT][https://dblp.org/rec/bibtex/journals/corr/abs-1906-11300 19] [https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&btnG= 19] [https://arxiv.org/pdf/1903.08560 HMRT][https://dblp.org/rec/bibtex/journals/corr/abs-1903-08560 19] [https://arxiv.org/pdf/1903.07571 BHX][https://dblp.org/rec/bibtex/journals/corr/abs-1903-07571 19] and even ensembles [https://www.pnas.org/content/pnas/116/32/15849.full.pdf BHMM][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Reconciling+modern+machine-learning+practice+and+the+classical+bias%E2%80%93variance+trade-off&btnG= 19]. Learning algorithms seem to often achieve their best out-of-sample performance when they are massively overparameterized and perfectly fit the training data (called <em>interpolation</em>).
+However, the conventional wisdom is in sharp contradiction with today's success of deep neural networks [https://openreview.net/pdf?id=Sy8gdB9xx ZBHRV][https://dblp.org/rec/bibtex/conf/iclr/ZhangBHRV17 17], [https://www.pnas.org/content/pnas/116/32/15849.full.pdf BHMM][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Reconciling+modern+machine-learning+practice+and+the+classical+bias%E2%80%93variance+trade-off&btnG= 19] [https://arxiv.org/pdf/1912.02292 NKBYBS][https://dblp.org/rec/bibtex/journals/corr/abs-1912-02292 19], but also kernel methods [http://proceedings.mlr.press/v80/belkin18a/belkin18a.pdf BMM][https://dblp.org/rec/bibtex/conf/icml/BelkinMM18 18] [http://proceedings.mlr.press/v89/belkin19a/belkin19a.pdf BRT][https://dblp.org/rec/bibtex/conf/aistats/BelkinRT19 19], ridgeless (random feature) linear regression [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8849614 MVS][https://dblp.org/rec/bibtex/conf/isit/MuthukumarVS19 19] [https://arxiv.org/pdf/1906.11300 BLLT][https://dblp.org/rec/bibtex/journals/corr/abs-1906-11300 19] [https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&btnG= 19] [https://arxiv.org/pdf/1903.08560 HMRT][https://dblp.org/rec/bibtex/journals/corr/abs-1903-08560 19] [https://arxiv.org/pdf/1903.07571 BHX][https://dblp.org/rec/bibtex/journals/corr/abs-1903-07571 19] [https://arxiv.org/pdf/2002.08404.pdf JSSHG][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Implicit+Regularization+of+Random+Feature+Models&btnG= 20] and even ensembles [https://www.pnas.org/content/pnas/116/32/15849.full.pdf BHMM][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Reconciling+modern+machine-learning+practice+and+the+classical+bias%E2%80%93variance+trade-off&btnG= 19]. Learning algorithms seem to often achieve their best out-of-sample performance when they are massively overparameterized and perfectly fit the training data (called <em>interpolation</em>).
 Intriguingly, a <em>double descent</em> phenomenon often occurs [https://www.pnas.org/content/pnas/116/32/15849.full.pdf BHMM][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Reconciling+modern+machine-learning+practice+and+the+classical+bias%E2%80%93variance+trade-off&btnG= 19] [https://arxiv.org/pdf/1912.02292 NKBYBS][https://dblp.org/rec/bibtex/journals/corr/abs-1912-02292 19], where the performance at the test set first behaves as predicted by the bias-variance dilemma, but then improves and outperforms what would be advised by classical statistical learning.
 All such results suggest that overfitting eventually disappears, which contradicts conventional wisdom.

← Older revision		Revision as of 09:54, 24 February 2020
Line 16:		Line 16:

	This has become a conventional wisdom for a while.		This has become a conventional wisdom for a while.
		+
		+	== Test set overfitting ==
		+
		+	There has been concerns about overfitting of test sets, as these are used more and more to measure the performance of machine learning algorithms. But [https://arxiv.org/pdf/1902.10811.pdf RRSS][https://dblp.org/rec/bibtex/conf/icml/RechtRSS19 19] analyze statistical patterns on reported test set performances, and argue that there still is actual progress.

	== Double descent ==		== Double descent ==

@@ Line 5: / Line 5: @@
 [http://web.mit.edu/6.435/www/Geman92.pdf GBD][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=neural+networks+and+the+bias%2Fvariance+dilemma+geman+bienenstock+doursat&btnG= 92] identified the bias-variance tradeoff, which quantifies out-of-sample errors as a sum of inductive bias and model variance, <em>for random samples</em> drawn from the true distribution.
-Formally, let f(x,S) the prediction of the algorithm trained with sample S for feature x. Then E[(f(x,S)-y)<sup>2</sup>] = bias<sup>2</sup> + variance + sigma<sup>2</sup>, where the expectation is over x, S and y, bias = E[f(x,S)-f<sup>*</sup>(x)] is the bias with respect to the optimal prediction f<sup>*</sup>, variance = E[f(x,S)<sup>2</sup>] - E[f(x,S)]<sup>2</sup> is how much the prediction varies from one sample to the other and sigma<sup>2</sup> = E[(f<sup>*</sup>(x)-y)^2] is the unpredictable components of y given x.
+Formally, let <math>f(x,S)</math> the <math>y</math>-prediction of the algorithm trained with sample <math>S</math> for feature <math>x</math>. Then <math>\mathbb E_{x,y,S}[(f(x,S)-y)^2] = bias^2 + variance + \sigma^2</math>, where the expectation is over <math>x</math>, <math>S</math> and <math>y</math>, where <math>bias = \mathbb E_{x,S}[f(x,S)-f^*(x)]</math> is the bias with respect to the optimal prediction <math>f^*</math>, where <math>variance = \mathbb E_{x,S}[f(x,S)^2] - \mathbb E_{x,S}[f(x,S)]^2</math> is how much the prediction varies from one sample to the other and <math>\sigma^2 = \mathbb E_{x,y}[(f^*(x)-y)^2]</math> is the unpredictable components of <math>y</math> given <math>x</math>.
-It has been argued that overfitting is caused by increased variance when we consider a learning algorithm that is too sensitive to the randomness of sampling S. This sometimes occurs when the number of parameters of the learning algorithm is too large (but not necessarily!).
+It has been argued that overfitting is caused by increased variance when we consider a learning algorithm that is too sensitive to the randomness of sampling <math>S</math>. This sometimes occurs when the number of parameters of the learning algorithm is too large (but not necessarily!).
 == PAC-learning ==
@@ Line 13: / Line 13: @@
 To prove theoretical guarantees of non-overfitting, [http://web.mit.edu/6.435/www/Valiant84.pdf Valiant][https://dblp.org/rec/bibtex/journals/cacm/Valiant84 84] introduced the concept of <em>probably approximately correct</em> (PAC) learning. More explanations here: [https://www.youtube.com/watch?v=uB2X2OuD4Rg&list=PLie7a1OUTSagZB9mFZnVBgsNfBtcUGJWB&index=8 Wandida16a].
-In particular, the fundamental theorem of statistical learning [https://www.amazon.com/Understanding-Machine-Learning-Theory-Algorithms/dp/1107057132/ ShalevshwartzBendavidBook][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Understanding+Machine+Learning+Theory+Algorithms+shalev-shwartz+ben-david&btnG= 14] [https://www.youtube.com/watch?v=RkWuLtFPBKU&list=PLie7a1OUTSagZB9mFZnVBgsNfBtcUGJWB&index=14 Wandida16b] provides guarantees of PAC learning, when the number of data points sufficiently exceed the VC dimension of the set of learnable algorithms. Since this VC dimension is often essentially the number of parameters (assuming finite representation of the parameters as float or double), then this means that PAC learning is guaranteed when #data << #parameters.
+In particular, the fundamental theorem of statistical learning [https://www.amazon.com/Understanding-Machine-Learning-Theory-Algorithms/dp/1107057132/ ShalevshwartzBendavidBook][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Understanding+Machine+Learning+Theory+Algorithms+shalev-shwartz+ben-david&btnG= 14] [https://www.youtube.com/watch?v=RkWuLtFPBKU&list=PLie7a1OUTSagZB9mFZnVBgsNfBtcUGJWB&index=14 Wandida16b] provides guarantees of PAC learning, when the number of data points sufficiently exceed the VC dimension of the set of learnable algorithms. Since this VC dimension is often essentially the number of parameters (assuming finite representation of the parameters as float or double), then this means that PAC learning is guaranteed when <math>\#data \ll \#parameters</math>.
 This has become a conventional wisdom for a while.
@@ Line 27: / Line 27: @@
 == Details ==
-[https://openreview.net/pdf?id=Sy8gdB9xx ZBHRV][https://dblp.org/rec/bibtex/conf/iclr/ZhangBHRV17 17] showed that large interpolating neural networks generalize well, even for large noise in the data. Also, they showed that inductive bias likely plays a limited role, as neural networks still manage to learn quite efficiently data whose labels are completely shuffled. They also proved that a neural network with 2n+d parameters can interpolate n data points of dimension d.
+[https://openreview.net/pdf?id=Sy8gdB9xx ZBHRV][https://dblp.org/rec/bibtex/conf/iclr/ZhangBHRV17 17] showed that large interpolating neural networks generalize well, even for large noise in the data. Also, they showed that inductive bias likely plays a limited role, as neural networks still manage to learn quite efficiently data whose labels are completely shuffled. They also proved that a neural network with <math>2n+d</math> parameters can interpolate <math>n</math> data points of dimension <math>d</math>.
 [https://www.pnas.org/content/pnas/116/32/15849.full.pdf BHMM][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Reconciling+modern+machine-learning+practice+and+the+classical+bias%E2%80%93variance+trade-off&btnG= 19] observed double descent for random Fourier features (which [http://papers.nips.cc/paper/3182-random-features-for-large-scale-kernel-machines.pdf RahimiRecht][https://dblp.org/rec/bibtex/conf/nips/RahimiR07 07] proved to be intimately connected to kernel methods), neural networks, decision tree and ensemble methods.
@@ Line 33: / Line 33: @@
 [https://arxiv.org/pdf/1912.02292 NKBYBS][https://dblp.org/rec/bibtex/journals/corr/abs-1912-02292 19] show that a very wide variety of deep neural networks exhibit a wide variety of double descent phenomenons. Not only is there double descent with respect to the number of parameters, but there also seems to be double descent with respect to the width of the neural networks, and weirdly also with respect to epochs of learning steps. They conjecture that "effective model complexity" (the number of data points for which the model is able to achieve small training loss) is a critical point where overfitting occurs. Before and beyond this, overfitting appears to vanish.
-[http://proceedings.mlr.press/v80/belkin18a/belkin18a.pdf BMM][https://dblp.org/rec/bibtex/conf/icml/BelkinMM18 18] present experiments that show that interpolating kernel methods also generalize well and are able to fit random labels (though in this paper, they do not exhibit double descent). They also show that, because norms of interpolaters grow superpolynomially in Hilbert space (in exp(theta(n^(1/d)))), usual bounds controlling overfitting are actually trivial for large datasets. This indicates the need for radically different approach to understand overfitting.
+[http://proceedings.mlr.press/v80/belkin18a/belkin18a.pdf BMM][https://dblp.org/rec/bibtex/conf/icml/BelkinMM18 18] present experiments that show that interpolating kernel methods also generalize well and are able to fit random labels (though in this paper, they do not exhibit double descent). They also show that, because norms of interpolaters grow superpolynomially in Hilbert space (in <math>exp(\theta(n^{1/d}))</math>), usual bounds controlling overfitting are actually trivial for large datasets. This indicates the need for radically different approach to understand overfitting.
-[http://proceedings.mlr.press/v89/belkin19a/belkin19a.pdf BRT][https://dblp.org/rec/bibtex/conf/aistats/BelkinRT19 19] show examples of singular kernel interpolators (K(x,y)~||x-y||^(-a)) that achieve optimal rates, even for improper learning (meaning that the true function to learn does not belong to the set of hypotheses).
+[http://proceedings.mlr.press/v89/belkin19a/belkin19a.pdf BRT][https://dblp.org/rec/bibtex/conf/aistats/BelkinRT19 19] show examples of singular kernel interpolators (<math>K(x,y)\sim||x-y||^{-a}</math>) that achieve optimal rates, even for improper learning (meaning that the true function to learn does not belong to the set of hypotheses).
 Note that the connection between kernel methods and neural networks has been made, for instance by [http://papers.nips.cc/paper/3182-random-features-for-large-scale-kernel-machines.pdf RahimiRecht][https://dblp.org/rec/bibtex/conf/nips/RahimiR07 07] and [http://papers.nips.cc/paper/8076-neural-tangent-kernel-convergence-and-generalization-in-neural-networks.pdf JHG][https://dblp.org/rec/bibtex/conf/nips/JacotHG18 18]. Essentially, random features implement approximate kernel methods. And the first layers of neural networks with random (or even trained) weights can be regarded as computations of random features.
-[https://arxiv.org/pdf/1903.08560 HMRT][https://dblp.org/rec/bibtex/journals/corr/abs-1903-08560 19] use random matrix theory (to estimate eigenvalues of X^TX when X are random samples) to show that ridgless regression (regression with minimum l2-norm) features "infinite double descent" as the size n of the training data sets grows to infinity, along with the number of parameters p (they assume p/n = gamma, and show infinite overfitting for gamma~1). This is shown for both a linear model where X = ∑^{1/2} Z, for some fixed ∑ and some well-behaved Z of mean 0 and variance 1, and a componentwise linearity X = σ(WZ). In both cases, it is assumed that y=x^Tβ+ε, where E[ε]=0. There are also assumptions of finite fixed variance for ε, and finite fourth moment for z (needed for random matrix theory).
+[https://arxiv.org/pdf/1903.08560 HMRT][https://dblp.org/rec/bibtex/journals/corr/abs-1903-08560 19] use random matrix theory (to estimate eigenvalues of <math>X^TX</math> when <math>X</math> are random samples) to show that ridgless regression (regression with minimum <math>\ell_2</math>-norm) features "infinite double descent" as the size <math>n</math> of the training data sets grows to infinity, along with the number of parameters <math>p</math> (they assume <math>p/n \rightarrow gamma</math>, and show infinite overfitting for <math>gamma\sim 1</math>). This is shown for both a linear model where <math>X = \Sigma^{1/2} Z</math>, for some fixed <math>\Sigma</math> and some well-behaved <math>Z</math> of mean 0 and variance 1, and a component-wise linearity <math>X = \sigma(WZ)</math>. In both cases, it is assumed that <math>y=x^T\beta+\varepsilon</math>, where <math>\mathbb E[\varepsilon]=0</math>. There are also assumptions of finite fixed variance for <math>\varepsilon</math>, and finite fourth moment for <math>Z</math> (needed for random matrix theory).
-[https://arxiv.org/pdf/1903.07571 BHX][https://dblp.org/rec/bibtex/journals/corr/abs-1903-07571 19] analyze two other data models. The former is a classical Gaussian linear regression with a huge space of features. But the regression is only made within a (random) subset T of features, in which case double descent is observed, and errors can be derived from the norms of the true regression for T-coordinates and non-T-coordinates. A similar analysis is then provided for a random Fourier feature model.
+[https://arxiv.org/pdf/1903.07571 BHX][https://dblp.org/rec/bibtex/journals/corr/abs-1903-07571 19] analyze two other data models. The former is a classical Gaussian linear regression with a huge space of features. But the regression is only made within a (random) subset <math>T</math> of features, in which case double descent is observed, and errors can be derived from the norms of the true regression for <math>T</math>-coordinates and non-<math>T</math>-coordinates. A similar analysis is then provided for a random Fourier feature model.
-[https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&btnG= 19] consider still another data model, where z is drawn uniformly randomly on a d-sphere, and y = f(z)+ε, such that E[ε]=0 and ε has finite fourth moment. The ridgeless linear regression is then over some random features x<sub>i</sub>=σ(Wz<sub>i</sub>), where σ applies componentwise and W has random rows of l<sub>2</sub>-norm equal to 1. They prove that this yields a double descent.
+[https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&btnG= 19] consider still another data model, where <math>z</math> is drawn uniformly randomly on a <math>d</math>-sphere, and <math>y = f(z)+\varepsilon</math>, such that <math>\mathbb E[\varepsilon]=0</math> and <math>\varepsilon</math> has finite fourth moment. The ridgeless linear regression is then over some random features <math>x_i = \sigma(Wz_i)</math>, where <math>\sigma</math> applies component-wisely and <math>W</math> has random rows of <math>\ell_2</math>-norm equal to 1. They prove that this yields a double descent.