<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://robustlybeneficial.org/wiki/index.php?action=history&amp;feed=atom&amp;title=Overfitting</id>
	<title>Overfitting - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://robustlybeneficial.org/wiki/index.php?action=history&amp;feed=atom&amp;title=Overfitting"/>
	<link rel="alternate" type="text/html" href="https://robustlybeneficial.org/wiki/index.php?title=Overfitting&amp;action=history"/>
	<updated>2026-04-28T13:30:06Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.34.0</generator>
	<entry>
		<id>https://robustlybeneficial.org/wiki/index.php?title=Overfitting&amp;diff=238&amp;oldid=prev</id>
		<title>Lê Nguyên Hoang: /* Details */</title>
		<link rel="alternate" type="text/html" href="https://robustlybeneficial.org/wiki/index.php?title=Overfitting&amp;diff=238&amp;oldid=prev"/>
		<updated>2020-02-26T17:09:11Z</updated>

		<summary type="html">&lt;p&gt;&lt;span dir=&quot;auto&quot;&gt;&lt;span class=&quot;autocomment&quot;&gt;Details&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;Revision as of 17:09, 26 February 2020&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l49&quot; &gt;Line 49:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 49:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&amp;amp;btnG= 19] consider still another data model, where &amp;lt;math&amp;gt;z&amp;lt;/math&amp;gt; is drawn uniformly randomly on a &amp;lt;math&amp;gt;d&amp;lt;/math&amp;gt;-sphere, and &amp;lt;math&amp;gt;y = f(z)+\varepsilon&amp;lt;/math&amp;gt;, such that &amp;lt;math&amp;gt;\mathbb E[\varepsilon]=0&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;\varepsilon&amp;lt;/math&amp;gt; has finite fourth moment. The ridgeless linear regression is then over some random features &amp;lt;math&amp;gt;x_i = \sigma(Wz_i)&amp;lt;/math&amp;gt;, where &amp;lt;math&amp;gt;\sigma&amp;lt;/math&amp;gt; applies component-wisely and &amp;lt;math&amp;gt;W&amp;lt;/math&amp;gt; has random rows of &amp;lt;math&amp;gt;\ell_2&amp;lt;/math&amp;gt;-norm equal to 1. They prove that this yields a double descent.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&amp;amp;btnG= 19] consider still another data model, where &amp;lt;math&amp;gt;z&amp;lt;/math&amp;gt; is drawn uniformly randomly on a &amp;lt;math&amp;gt;d&amp;lt;/math&amp;gt;-sphere, and &amp;lt;math&amp;gt;y = f(z)+\varepsilon&amp;lt;/math&amp;gt;, such that &amp;lt;math&amp;gt;\mathbb E[\varepsilon]=0&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;\varepsilon&amp;lt;/math&amp;gt; has finite fourth moment. The ridgeless linear regression is then over some random features &amp;lt;math&amp;gt;x_i = \sigma(Wz_i)&amp;lt;/math&amp;gt;, where &amp;lt;math&amp;gt;\sigma&amp;lt;/math&amp;gt; applies component-wisely and &amp;lt;math&amp;gt;W&amp;lt;/math&amp;gt; has random rows of &amp;lt;math&amp;gt;\ell_2&amp;lt;/math&amp;gt;-norm equal to 1. They prove that this yields a double descent.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[https://arxiv.org/pdf/2002.08404.pdf JSSHG][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Implicit+Regularization+of+Random+Feature+Models&amp;amp;btnG= 20] study a random feature model where the true function and the features are drawn from a Gaussian process. They prove an upper-bound between a regularized Bayesian posterior prediction at a point &amp;lt;math&amp;gt;x&amp;lt;/math&amp;gt; and the expectation of the (slightly differently) regularized random feature prediction at this point, which can be argued to go to zero under reasonable assumptions, as the number of parameters goes to infinity. Moreover, as the regularization goes to zero, the random feature linear regression comes closer to the Bayesian posterior prediction. They also prove a bound between the expected loss of the regularized random feature model and that of the regularized kernel model. They also show that the &amp;quot;effective ridge&amp;quot; &amp;lt;math&amp;gt;\tilde \lambda&amp;lt;/math&amp;gt; (i.e. ridge of the kernel method equivalent to the ridge &amp;lt;math&amp;gt;\lambda&amp;lt;/math&amp;gt; of the random feature regression) is key to understanding the variance explosion. In particular, they relate it to the explosion of &amp;lt;math&amp;gt;\partial_\lambda \tilde \lambda&amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[https://arxiv.org/pdf/2002.08404.pdf JSSHG][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Implicit+Regularization+of+Random+Feature+Models&amp;amp;btnG= 20] study a random feature model where the true function and the features are drawn from a Gaussian process. They prove an upper-bound between a regularized Bayesian posterior prediction at a point &amp;lt;math&amp;gt;x&amp;lt;/math&amp;gt; and the expectation of the (slightly differently) regularized random feature prediction at this point, which can be argued to go to zero under reasonable assumptions, as the number of parameters goes to infinity. Moreover, as the regularization goes to zero, the random feature linear regression comes closer to the Bayesian posterior prediction. They also prove a bound between the expected loss of the regularized random feature model and that of the regularized kernel model. They also show that the &amp;quot;effective ridge&amp;quot; &amp;lt;math&amp;gt;\tilde \lambda&amp;lt;/math&amp;gt; (i.e. ridge of the kernel method equivalent to the ridge &amp;lt;math&amp;gt;\lambda&amp;lt;/math&amp;gt; of the random feature regression) is key to understanding the variance explosion. In particular, they relate it to the explosion of &amp;lt;math&amp;gt;\partial_\lambda \tilde \lambda&amp;lt;/math&amp;gt;&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;. An interesting insight is that the adequate linear regression regularization should thus be smaller than (and can be exactly computed from) the kernel method ridge&lt;/ins&gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Lê Nguyên Hoang</name></author>
		
	</entry>
	<entry>
		<id>https://robustlybeneficial.org/wiki/index.php?title=Overfitting&amp;diff=237&amp;oldid=prev</id>
		<title>Lê Nguyên Hoang: /* Details */</title>
		<link rel="alternate" type="text/html" href="https://robustlybeneficial.org/wiki/index.php?title=Overfitting&amp;diff=237&amp;oldid=prev"/>
		<updated>2020-02-26T17:07:35Z</updated>

		<summary type="html">&lt;p&gt;&lt;span dir=&quot;auto&quot;&gt;&lt;span class=&quot;autocomment&quot;&gt;Details&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;Revision as of 17:07, 26 February 2020&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l49&quot; &gt;Line 49:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 49:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&amp;amp;btnG= 19] consider still another data model, where &amp;lt;math&amp;gt;z&amp;lt;/math&amp;gt; is drawn uniformly randomly on a &amp;lt;math&amp;gt;d&amp;lt;/math&amp;gt;-sphere, and &amp;lt;math&amp;gt;y = f(z)+\varepsilon&amp;lt;/math&amp;gt;, such that &amp;lt;math&amp;gt;\mathbb E[\varepsilon]=0&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;\varepsilon&amp;lt;/math&amp;gt; has finite fourth moment. The ridgeless linear regression is then over some random features &amp;lt;math&amp;gt;x_i = \sigma(Wz_i)&amp;lt;/math&amp;gt;, where &amp;lt;math&amp;gt;\sigma&amp;lt;/math&amp;gt; applies component-wisely and &amp;lt;math&amp;gt;W&amp;lt;/math&amp;gt; has random rows of &amp;lt;math&amp;gt;\ell_2&amp;lt;/math&amp;gt;-norm equal to 1. They prove that this yields a double descent.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&amp;amp;btnG= 19] consider still another data model, where &amp;lt;math&amp;gt;z&amp;lt;/math&amp;gt; is drawn uniformly randomly on a &amp;lt;math&amp;gt;d&amp;lt;/math&amp;gt;-sphere, and &amp;lt;math&amp;gt;y = f(z)+\varepsilon&amp;lt;/math&amp;gt;, such that &amp;lt;math&amp;gt;\mathbb E[\varepsilon]=0&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;\varepsilon&amp;lt;/math&amp;gt; has finite fourth moment. The ridgeless linear regression is then over some random features &amp;lt;math&amp;gt;x_i = \sigma(Wz_i)&amp;lt;/math&amp;gt;, where &amp;lt;math&amp;gt;\sigma&amp;lt;/math&amp;gt; applies component-wisely and &amp;lt;math&amp;gt;W&amp;lt;/math&amp;gt; has random rows of &amp;lt;math&amp;gt;\ell_2&amp;lt;/math&amp;gt;-norm equal to 1. They prove that this yields a double descent.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[https://arxiv.org/pdf/2002.08404.pdf JSSHG][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Implicit+Regularization+of+Random+Feature+Models&amp;amp;btnG= 20] study a random feature model where the true function and the features are drawn from a Gaussian process. They prove an upper-bound between a regularized Bayesian posterior prediction at a point &amp;lt;math&amp;gt;x&amp;lt;/math&amp;gt; and the expectation of the (slightly differently) regularized random feature prediction at this point, which can be argued to go to zero under reasonable assumptions, as the number of parameters goes to infinity. Moreover, as the regularization goes to zero, the random feature linear regression comes closer to the Bayesian posterior prediction. They also prove a bound between the expected loss of the regularized random feature model and that of the regularized kernel model. They also show that the &amp;quot;effective ridge&amp;quot; &amp;lt;math&amp;gt;\tilde \lambda&amp;lt;/math&amp;gt; (i.e. ridge of the kernel method equivalent to the ridge &amp;lt;math&amp;gt;\lambda&amp;lt;/math&amp;gt; of the random feature regression) is key to understanding the variance explosion. In particular, they relate it to &amp;lt;math&amp;gt;\partial_\lambda \tilde \lambda&amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[https://arxiv.org/pdf/2002.08404.pdf JSSHG][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Implicit+Regularization+of+Random+Feature+Models&amp;amp;btnG= 20] study a random feature model where the true function and the features are drawn from a Gaussian process. They prove an upper-bound between a regularized Bayesian posterior prediction at a point &amp;lt;math&amp;gt;x&amp;lt;/math&amp;gt; and the expectation of the (slightly differently) regularized random feature prediction at this point, which can be argued to go to zero under reasonable assumptions, as the number of parameters goes to infinity. Moreover, as the regularization goes to zero, the random feature linear regression comes closer to the Bayesian posterior prediction. They also prove a bound between the expected loss of the regularized random feature model and that of the regularized kernel model. They also show that the &amp;quot;effective ridge&amp;quot; &amp;lt;math&amp;gt;\tilde \lambda&amp;lt;/math&amp;gt; (i.e. ridge of the kernel method equivalent to the ridge &amp;lt;math&amp;gt;\lambda&amp;lt;/math&amp;gt; of the random feature regression) is key to understanding the variance explosion. In particular, they relate it to &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;the explosion of &lt;/ins&gt;&amp;lt;math&amp;gt;\partial_\lambda \tilde \lambda&amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Lê Nguyên Hoang</name></author>
		
	</entry>
	<entry>
		<id>https://robustlybeneficial.org/wiki/index.php?title=Overfitting&amp;diff=236&amp;oldid=prev</id>
		<title>Lê Nguyên Hoang: /* Details */</title>
		<link rel="alternate" type="text/html" href="https://robustlybeneficial.org/wiki/index.php?title=Overfitting&amp;diff=236&amp;oldid=prev"/>
		<updated>2020-02-26T17:03:39Z</updated>

		<summary type="html">&lt;p&gt;&lt;span dir=&quot;auto&quot;&gt;&lt;span class=&quot;autocomment&quot;&gt;Details&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;Revision as of 17:03, 26 February 2020&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l49&quot; &gt;Line 49:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 49:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&amp;amp;btnG= 19] consider still another data model, where &amp;lt;math&amp;gt;z&amp;lt;/math&amp;gt; is drawn uniformly randomly on a &amp;lt;math&amp;gt;d&amp;lt;/math&amp;gt;-sphere, and &amp;lt;math&amp;gt;y = f(z)+\varepsilon&amp;lt;/math&amp;gt;, such that &amp;lt;math&amp;gt;\mathbb E[\varepsilon]=0&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;\varepsilon&amp;lt;/math&amp;gt; has finite fourth moment. The ridgeless linear regression is then over some random features &amp;lt;math&amp;gt;x_i = \sigma(Wz_i)&amp;lt;/math&amp;gt;, where &amp;lt;math&amp;gt;\sigma&amp;lt;/math&amp;gt; applies component-wisely and &amp;lt;math&amp;gt;W&amp;lt;/math&amp;gt; has random rows of &amp;lt;math&amp;gt;\ell_2&amp;lt;/math&amp;gt;-norm equal to 1. They prove that this yields a double descent.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&amp;amp;btnG= 19] consider still another data model, where &amp;lt;math&amp;gt;z&amp;lt;/math&amp;gt; is drawn uniformly randomly on a &amp;lt;math&amp;gt;d&amp;lt;/math&amp;gt;-sphere, and &amp;lt;math&amp;gt;y = f(z)+\varepsilon&amp;lt;/math&amp;gt;, such that &amp;lt;math&amp;gt;\mathbb E[\varepsilon]=0&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;\varepsilon&amp;lt;/math&amp;gt; has finite fourth moment. The ridgeless linear regression is then over some random features &amp;lt;math&amp;gt;x_i = \sigma(Wz_i)&amp;lt;/math&amp;gt;, where &amp;lt;math&amp;gt;\sigma&amp;lt;/math&amp;gt; applies component-wisely and &amp;lt;math&amp;gt;W&amp;lt;/math&amp;gt; has random rows of &amp;lt;math&amp;gt;\ell_2&amp;lt;/math&amp;gt;-norm equal to 1. They prove that this yields a double descent.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[https://arxiv.org/pdf/2002.08404.pdf JSSHG][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Implicit+Regularization+of+Random+Feature+Models&amp;amp;btnG= 20] study a random feature model where the true function and the features are drawn from a Gaussian process. They prove an upper-bound between a regularized Bayesian posterior prediction at a point &amp;lt;math&amp;gt;x&amp;lt;/math&amp;gt; and the expectation of the (slightly differently) regularized random feature prediction at this point, which can be argued to go to zero under reasonable assumptions, as the number of parameters goes to infinity. Moreover, as the regularization goes to zero, the random feature linear regression comes closer to the Bayesian posterior prediction. They also prove a bound between the expected loss of the regularized random feature model and that of the regularized kernel model.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[https://arxiv.org/pdf/2002.08404.pdf JSSHG][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Implicit+Regularization+of+Random+Feature+Models&amp;amp;btnG= 20] study a random feature model where the true function and the features are drawn from a Gaussian process. They prove an upper-bound between a regularized Bayesian posterior prediction at a point &amp;lt;math&amp;gt;x&amp;lt;/math&amp;gt; and the expectation of the (slightly differently) regularized random feature prediction at this point, which can be argued to go to zero under reasonable assumptions, as the number of parameters goes to infinity. Moreover, as the regularization goes to zero, the random feature linear regression comes closer to the Bayesian posterior prediction. They also prove a bound between the expected loss of the regularized random feature model and that of the regularized kernel model&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;. They also show that the &amp;quot;effective ridge&amp;quot; &amp;lt;math&amp;gt;\tilde \lambda&amp;lt;/math&amp;gt; (i.e. ridge of the kernel method equivalent to the ridge &amp;lt;math&amp;gt;\lambda&amp;lt;/math&amp;gt; of the random feature regression) is key to understanding the variance explosion. In particular, they relate it to &amp;lt;math&amp;gt;\partial_\lambda \tilde \lambda&amp;lt;/math&amp;gt;&lt;/ins&gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Lê Nguyên Hoang</name></author>
		
	</entry>
	<entry>
		<id>https://robustlybeneficial.org/wiki/index.php?title=Overfitting&amp;diff=235&amp;oldid=prev</id>
		<title>Lê Nguyên Hoang: /* Details */</title>
		<link rel="alternate" type="text/html" href="https://robustlybeneficial.org/wiki/index.php?title=Overfitting&amp;diff=235&amp;oldid=prev"/>
		<updated>2020-02-26T16:58:40Z</updated>

		<summary type="html">&lt;p&gt;&lt;span dir=&quot;auto&quot;&gt;&lt;span class=&quot;autocomment&quot;&gt;Details&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;Revision as of 16:58, 26 February 2020&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l48&quot; &gt;Line 48:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 48:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&amp;amp;btnG= 19] consider still another data model, where &amp;lt;math&amp;gt;z&amp;lt;/math&amp;gt; is drawn uniformly randomly on a &amp;lt;math&amp;gt;d&amp;lt;/math&amp;gt;-sphere, and &amp;lt;math&amp;gt;y = f(z)+\varepsilon&amp;lt;/math&amp;gt;, such that &amp;lt;math&amp;gt;\mathbb E[\varepsilon]=0&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;\varepsilon&amp;lt;/math&amp;gt; has finite fourth moment. The ridgeless linear regression is then over some random features &amp;lt;math&amp;gt;x_i = \sigma(Wz_i)&amp;lt;/math&amp;gt;, where &amp;lt;math&amp;gt;\sigma&amp;lt;/math&amp;gt; applies component-wisely and &amp;lt;math&amp;gt;W&amp;lt;/math&amp;gt; has random rows of &amp;lt;math&amp;gt;\ell_2&amp;lt;/math&amp;gt;-norm equal to 1. They prove that this yields a double descent.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&amp;amp;btnG= 19] consider still another data model, where &amp;lt;math&amp;gt;z&amp;lt;/math&amp;gt; is drawn uniformly randomly on a &amp;lt;math&amp;gt;d&amp;lt;/math&amp;gt;-sphere, and &amp;lt;math&amp;gt;y = f(z)+\varepsilon&amp;lt;/math&amp;gt;, such that &amp;lt;math&amp;gt;\mathbb E[\varepsilon]=0&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;\varepsilon&amp;lt;/math&amp;gt; has finite fourth moment. The ridgeless linear regression is then over some random features &amp;lt;math&amp;gt;x_i = \sigma(Wz_i)&amp;lt;/math&amp;gt;, where &amp;lt;math&amp;gt;\sigma&amp;lt;/math&amp;gt; applies component-wisely and &amp;lt;math&amp;gt;W&amp;lt;/math&amp;gt; has random rows of &amp;lt;math&amp;gt;\ell_2&amp;lt;/math&amp;gt;-norm equal to 1. They prove that this yields a double descent.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot;&gt; &lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot;&gt; &lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;[https://arxiv.org/pdf/2002.08404.pdf JSSHG][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Implicit+Regularization+of+Random+Feature+Models&amp;amp;btnG= 20] study a random feature model where the true function and the features are drawn from a Gaussian process. They prove an upper-bound between a regularized Bayesian posterior prediction at a point &amp;lt;math&amp;gt;x&amp;lt;/math&amp;gt; and the expectation of the (slightly differently) regularized random feature prediction at this point, which can be argued to go to zero under reasonable assumptions, as the number of parameters goes to infinity. Moreover, as the regularization goes to zero, the random feature linear regression comes closer to the Bayesian posterior prediction. They also prove a bound between the expected loss of the regularized random feature model and that of the regularized kernel model.&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Lê Nguyên Hoang</name></author>
		
	</entry>
	<entry>
		<id>https://robustlybeneficial.org/wiki/index.php?title=Overfitting&amp;diff=234&amp;oldid=prev</id>
		<title>Lê Nguyên Hoang: /* Double descent */</title>
		<link rel="alternate" type="text/html" href="https://robustlybeneficial.org/wiki/index.php?title=Overfitting&amp;diff=234&amp;oldid=prev"/>
		<updated>2020-02-26T16:56:04Z</updated>

		<summary type="html">&lt;p&gt;&lt;span dir=&quot;auto&quot;&gt;&lt;span class=&quot;autocomment&quot;&gt;Double descent&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;Revision as of 16:56, 26 February 2020&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l26&quot; &gt;Line 26:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 26:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Intriguingly, a &amp;lt;em&amp;gt;double descent&amp;lt;/em&amp;gt; phenomenon often occurs [https://www.pnas.org/content/pnas/116/32/15849.full.pdf BHMM][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Reconciling+modern+machine-learning+practice+and+the+classical+bias%E2%80%93variance+trade-off&amp;amp;btnG= 19] [https://arxiv.org/pdf/1912.02292 NKBYBS][https://dblp.org/rec/bibtex/journals/corr/abs-1912-02292 19], where the performance at the test set first behaves as predicted by the bias-variance dilemma, but then improves and outperforms what would be advised by classical statistical learning.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Intriguingly, a &amp;lt;em&amp;gt;double descent&amp;lt;/em&amp;gt; phenomenon often occurs [https://www.pnas.org/content/pnas/116/32/15849.full.pdf BHMM][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Reconciling+modern+machine-learning+practice+and+the+classical+bias%E2%80%93variance+trade-off&amp;amp;btnG= 19] [https://arxiv.org/pdf/1912.02292 NKBYBS][https://dblp.org/rec/bibtex/journals/corr/abs-1912-02292 19], where the performance at the test set first behaves as predicted by the bias-variance dilemma, but then improves and outperforms what would be advised by classical statistical learning.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td colspan=&quot;2&quot;&gt; &lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;[https://arxiv.org/pdf/2002.08404.pdf JSSHG][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Implicit+Regularization+of+Random+Feature+Models&amp;amp;btnG= 20] study a random feature model where the true function and the features are drawn from a Gaussian process. They prove an upper-bound between a regularized Bayesian posterior prediction and the expectation of the (slightly differently) regularized random feature prediction, which can be argued to go to zero under reasonable assumptions, as the number of parameters goes to infinity. Moreover, as the regularization goes to zero, the random feature linear regression comes closer to the Bayesian posterior prediction.&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td colspan=&quot;2&quot;&gt; &lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;All such results suggest that overfitting eventually disappears, which contradicts conventional wisdom.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;All such results suggest that overfitting eventually disappears, which contradicts conventional wisdom.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Lê Nguyên Hoang</name></author>
		
	</entry>
	<entry>
		<id>https://robustlybeneficial.org/wiki/index.php?title=Overfitting&amp;diff=233&amp;oldid=prev</id>
		<title>Lê Nguyên Hoang: /* Double descent */</title>
		<link rel="alternate" type="text/html" href="https://robustlybeneficial.org/wiki/index.php?title=Overfitting&amp;diff=233&amp;oldid=prev"/>
		<updated>2020-02-26T16:55:02Z</updated>

		<summary type="html">&lt;p&gt;&lt;span dir=&quot;auto&quot;&gt;&lt;span class=&quot;autocomment&quot;&gt;Double descent&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;Revision as of 16:55, 26 February 2020&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l23&quot; &gt;Line 23:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 23:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Double descent ==&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Double descent ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;However, the conventional wisdom is in sharp contradiction with today's success of deep neural networks [https://openreview.net/pdf?id=Sy8gdB9xx ZBHRV][https://dblp.org/rec/bibtex/conf/iclr/ZhangBHRV17 17], [https://www.pnas.org/content/pnas/116/32/15849.full.pdf BHMM][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Reconciling+modern+machine-learning+practice+and+the+classical+bias%E2%80%93variance+trade-off&amp;amp;btnG= 19] [https://arxiv.org/pdf/1912.02292 NKBYBS][https://dblp.org/rec/bibtex/journals/corr/abs-1912-02292 19], but also kernel methods [http://proceedings.mlr.press/v80/belkin18a/belkin18a.pdf BMM][https://dblp.org/rec/bibtex/conf/icml/BelkinMM18 18] [http://proceedings.mlr.press/v89/belkin19a/belkin19a.pdf BRT][https://dblp.org/rec/bibtex/conf/aistats/BelkinRT19 19], ridgeless (random feature) linear regression [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&amp;amp;arnumber=8849614 MVS][https://dblp.org/rec/bibtex/conf/isit/MuthukumarVS19 19] [https://arxiv.org/pdf/1906.11300 BLLT][https://dblp.org/rec/bibtex/journals/corr/abs-1906-11300 19] [https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&amp;amp;btnG= 19] [https://arxiv.org/pdf/1903.08560 HMRT][https://dblp.org/rec/bibtex/journals/corr/abs-1903-08560 19] [https://arxiv.org/pdf/1903.07571 BHX][https://dblp.org/rec/bibtex/journals/corr/abs-1903-07571 19] and even ensembles [https://www.pnas.org/content/pnas/116/32/15849.full.pdf BHMM][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Reconciling+modern+machine-learning+practice+and+the+classical+bias%E2%80%93variance+trade-off&amp;amp;btnG= 19]. Learning algorithms seem to often achieve their best out-of-sample performance when they are massively overparameterized and perfectly fit the training data (called &amp;lt;em&amp;gt;interpolation&amp;lt;/em&amp;gt;).&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;However, the conventional wisdom is in sharp contradiction with today's success of deep neural networks [https://openreview.net/pdf?id=Sy8gdB9xx ZBHRV][https://dblp.org/rec/bibtex/conf/iclr/ZhangBHRV17 17], [https://www.pnas.org/content/pnas/116/32/15849.full.pdf BHMM][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Reconciling+modern+machine-learning+practice+and+the+classical+bias%E2%80%93variance+trade-off&amp;amp;btnG= 19] [https://arxiv.org/pdf/1912.02292 NKBYBS][https://dblp.org/rec/bibtex/journals/corr/abs-1912-02292 19], but also kernel methods [http://proceedings.mlr.press/v80/belkin18a/belkin18a.pdf BMM][https://dblp.org/rec/bibtex/conf/icml/BelkinMM18 18] [http://proceedings.mlr.press/v89/belkin19a/belkin19a.pdf BRT][https://dblp.org/rec/bibtex/conf/aistats/BelkinRT19 19], ridgeless (random feature) linear regression [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&amp;amp;arnumber=8849614 MVS][https://dblp.org/rec/bibtex/conf/isit/MuthukumarVS19 19] [https://arxiv.org/pdf/1906.11300 BLLT][https://dblp.org/rec/bibtex/journals/corr/abs-1906-11300 19] [https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&amp;amp;btnG= 19] [https://arxiv.org/pdf/1903.08560 HMRT][https://dblp.org/rec/bibtex/journals/corr/abs-1903-08560 19] [https://arxiv.org/pdf/1903.07571 BHX][https://dblp.org/rec/bibtex/journals/corr/abs-1903-07571 19&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;] [https://arxiv.org/pdf/2002.08404.pdf JSSHG][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Implicit+Regularization+of+Random+Feature+Models&amp;amp;btnG= 20&lt;/ins&gt;] and even ensembles [https://www.pnas.org/content/pnas/116/32/15849.full.pdf BHMM][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Reconciling+modern+machine-learning+practice+and+the+classical+bias%E2%80%93variance+trade-off&amp;amp;btnG= 19]. Learning algorithms seem to often achieve their best out-of-sample performance when they are massively overparameterized and perfectly fit the training data (called &amp;lt;em&amp;gt;interpolation&amp;lt;/em&amp;gt;).&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Intriguingly, a &amp;lt;em&amp;gt;double descent&amp;lt;/em&amp;gt; phenomenon often occurs [https://www.pnas.org/content/pnas/116/32/15849.full.pdf BHMM][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Reconciling+modern+machine-learning+practice+and+the+classical+bias%E2%80%93variance+trade-off&amp;amp;btnG= 19] [https://arxiv.org/pdf/1912.02292 NKBYBS][https://dblp.org/rec/bibtex/journals/corr/abs-1912-02292 19], where the performance at the test set first behaves as predicted by the bias-variance dilemma, but then improves and outperforms what would be advised by classical statistical learning.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Intriguingly, a &amp;lt;em&amp;gt;double descent&amp;lt;/em&amp;gt; phenomenon often occurs [https://www.pnas.org/content/pnas/116/32/15849.full.pdf BHMM][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Reconciling+modern+machine-learning+practice+and+the+classical+bias%E2%80%93variance+trade-off&amp;amp;btnG= 19] [https://arxiv.org/pdf/1912.02292 NKBYBS][https://dblp.org/rec/bibtex/journals/corr/abs-1912-02292 19], where the performance at the test set first behaves as predicted by the bias-variance dilemma, but then improves and outperforms what would be advised by classical statistical learning.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot;&gt; &lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot;&gt; &lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;[https://arxiv.org/pdf/2002.08404.pdf JSSHG][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Implicit+Regularization+of+Random+Feature+Models&amp;amp;btnG= 20] study a random feature model where the true function and the features are drawn from a Gaussian process. They prove an upper-bound between a regularized Bayesian posterior prediction and the expectation of the (slightly differently) regularized random feature prediction, which can be argued to go to zero under reasonable assumptions, as the number of parameters goes to infinity. Moreover, as the regularization goes to zero, the random feature linear regression comes closer to the Bayesian posterior prediction.&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;All such results suggest that overfitting eventually disappears, which contradicts conventional wisdom.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;All such results suggest that overfitting eventually disappears, which contradicts conventional wisdom.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Lê Nguyên Hoang</name></author>
		
	</entry>
	<entry>
		<id>https://robustlybeneficial.org/wiki/index.php?title=Overfitting&amp;diff=227&amp;oldid=prev</id>
		<title>Lê Nguyên Hoang at 09:54, 24 February 2020</title>
		<link rel="alternate" type="text/html" href="https://robustlybeneficial.org/wiki/index.php?title=Overfitting&amp;diff=227&amp;oldid=prev"/>
		<updated>2020-02-24T09:54:39Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;Revision as of 09:54, 24 February 2020&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l16&quot; &gt;Line 16:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 16:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;This has become a conventional wisdom for a while.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;This has become a conventional wisdom for a while.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot;&gt; &lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot;&gt; &lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;== Test set overfitting ==&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot;&gt; &lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot;&gt; &lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;There has been concerns about overfitting of test sets, as these are used more and more to measure the performance of machine learning algorithms. But [https://arxiv.org/pdf/1902.10811.pdf RRSS][https://dblp.org/rec/bibtex/conf/icml/RechtRSS19 19] analyze statistical patterns on reported test set performances, and argue that there still is actual progress. &lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Double descent ==&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Double descent ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Lê Nguyên Hoang</name></author>
		
	</entry>
	<entry>
		<id>https://robustlybeneficial.org/wiki/index.php?title=Overfitting&amp;diff=71&amp;oldid=prev</id>
		<title>Lê Nguyên Hoang at 12:43, 22 January 2020</title>
		<link rel="alternate" type="text/html" href="https://robustlybeneficial.org/wiki/index.php?title=Overfitting&amp;diff=71&amp;oldid=prev"/>
		<updated>2020-01-22T12:43:37Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;Revision as of 12:43, 22 January 2020&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l5&quot; &gt;Line 5:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 5:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[http://web.mit.edu/6.435/www/Geman92.pdf GBD][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=neural+networks+and+the+bias%2Fvariance+dilemma+geman+bienenstock+doursat&amp;amp;btnG= 92] identified the bias-variance tradeoff, which quantifies out-of-sample errors as a sum of inductive bias and model variance, &amp;lt;em&amp;gt;for random samples&amp;lt;/em&amp;gt; drawn from the true distribution.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[http://web.mit.edu/6.435/www/Geman92.pdf GBD][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=neural+networks+and+the+bias%2Fvariance+dilemma+geman+bienenstock+doursat&amp;amp;btnG= 92] identified the bias-variance tradeoff, which quantifies out-of-sample errors as a sum of inductive bias and model variance, &amp;lt;em&amp;gt;for random samples&amp;lt;/em&amp;gt; drawn from the true distribution.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Formally, let f(x,S) the prediction of the algorithm trained with sample S for feature x. Then &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;E&lt;/del&gt;[(f(x,S)-y)&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;sup&amp;gt;&lt;/del&gt;2&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/sup&amp;gt;&lt;/del&gt;] = bias&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;sup&amp;gt;&lt;/del&gt;2&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/sup&amp;gt; &lt;/del&gt;+ variance + sigma&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;sup&amp;gt;&lt;/del&gt;2&amp;lt;/&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;sup&lt;/del&gt;&amp;gt;, where the expectation is over x, S and y, bias = &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;E&lt;/del&gt;[f(x,S)-f&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;sup&amp;gt;&lt;/del&gt;*&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/sup&amp;gt;&lt;/del&gt;(x)] is the bias with respect to the optimal prediction &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;f&lt;/del&gt;&amp;lt;&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;sup&lt;/del&gt;&amp;gt;*&amp;lt;/&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;sup&lt;/del&gt;&amp;gt;, variance = &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;E&lt;/del&gt;[f(x,S)&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;sup&amp;gt;&lt;/del&gt;2&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/sup&amp;gt;&lt;/del&gt;] - &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;E&lt;/del&gt;[f(x,S)]&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;sup&amp;gt;&lt;/del&gt;2&amp;lt;/&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;sup&lt;/del&gt;&amp;gt; is how much the prediction varies from one sample to the other and &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;sigma&lt;/del&gt;&amp;lt;&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;sup&lt;/del&gt;&amp;gt;2&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/sup&amp;gt; &lt;/del&gt;= &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;E&lt;/del&gt;[(f&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;sup&amp;gt;&lt;/del&gt;*&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/sup&amp;gt;&lt;/del&gt;(x)-y)^2] is the unpredictable components of y given x.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Formally, let &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;f(x,S)&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt; &lt;/ins&gt;the &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;y&amp;lt;/math&amp;gt;-&lt;/ins&gt;prediction of the algorithm trained with sample &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;S&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt; &lt;/ins&gt;for feature &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;x&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt;&lt;/ins&gt;. Then &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;\mathbb E_{x,y,S}&lt;/ins&gt;[(f(x,S)-y)&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;^&lt;/ins&gt;2] = bias&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;^&lt;/ins&gt;2 + variance + &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;\&lt;/ins&gt;sigma&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;^&lt;/ins&gt;2&amp;lt;/&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;math&lt;/ins&gt;&amp;gt;, where the expectation is over &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;x&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt;&lt;/ins&gt;, &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;S&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt; &lt;/ins&gt;and &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;y&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt;&lt;/ins&gt;, &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;where &amp;lt;math&amp;gt;&lt;/ins&gt;bias = &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;\mathbb E_{x,S}&lt;/ins&gt;[f(x,S)-f&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;^&lt;/ins&gt;*(x)]&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt; &lt;/ins&gt;is the bias with respect to the optimal prediction &amp;lt;&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;math&lt;/ins&gt;&amp;gt;&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;f^&lt;/ins&gt;*&amp;lt;/&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;math&lt;/ins&gt;&amp;gt;, &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;where &amp;lt;math&amp;gt;&lt;/ins&gt;variance = &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;\mathbb E_{x,S}&lt;/ins&gt;[f(x,S)&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;^&lt;/ins&gt;2] - &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;\mathbb E_{x,S}&lt;/ins&gt;[f(x,S)]&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;^&lt;/ins&gt;2&amp;lt;/&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;math&lt;/ins&gt;&amp;gt; is how much the prediction varies from one sample to the other and &amp;lt;&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;math&lt;/ins&gt;&amp;gt;&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;\sigma^&lt;/ins&gt;2 = &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;\mathbb E_{x,y}&lt;/ins&gt;[(f&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;^&lt;/ins&gt;*(x)-y)^2]&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt; &lt;/ins&gt;is the unpredictable components of &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;y&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt; &lt;/ins&gt;given &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;x&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt;&lt;/ins&gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;It has been argued that overfitting is caused by increased variance when we consider a learning algorithm that is too sensitive to the randomness of sampling S. This sometimes occurs when the number of parameters of the learning algorithm is too large (but not necessarily!).&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;It has been argued that overfitting is caused by increased variance when we consider a learning algorithm that is too sensitive to the randomness of sampling &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;S&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt;&lt;/ins&gt;. This sometimes occurs when the number of parameters of the learning algorithm is too large (but not necessarily!).&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== PAC-learning ==&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== PAC-learning ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l13&quot; &gt;Line 13:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 13:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;To prove theoretical guarantees of non-overfitting, [http://web.mit.edu/6.435/www/Valiant84.pdf Valiant][https://dblp.org/rec/bibtex/journals/cacm/Valiant84 84] introduced the concept of &amp;lt;em&amp;gt;probably approximately correct&amp;lt;/em&amp;gt; (PAC) learning. More explanations here: [https://www.youtube.com/watch?v=uB2X2OuD4Rg&amp;amp;list=PLie7a1OUTSagZB9mFZnVBgsNfBtcUGJWB&amp;amp;index=8 Wandida16a].&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;To prove theoretical guarantees of non-overfitting, [http://web.mit.edu/6.435/www/Valiant84.pdf Valiant][https://dblp.org/rec/bibtex/journals/cacm/Valiant84 84] introduced the concept of &amp;lt;em&amp;gt;probably approximately correct&amp;lt;/em&amp;gt; (PAC) learning. More explanations here: [https://www.youtube.com/watch?v=uB2X2OuD4Rg&amp;amp;list=PLie7a1OUTSagZB9mFZnVBgsNfBtcUGJWB&amp;amp;index=8 Wandida16a].&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;In particular, the fundamental theorem of statistical learning [https://www.amazon.com/Understanding-Machine-Learning-Theory-Algorithms/dp/1107057132/ ShalevshwartzBendavidBook][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Understanding+Machine+Learning+Theory+Algorithms+shalev-shwartz+ben-david&amp;amp;btnG= 14] [https://www.youtube.com/watch?v=RkWuLtFPBKU&amp;amp;list=PLie7a1OUTSagZB9mFZnVBgsNfBtcUGJWB&amp;amp;index=14 Wandida16b] provides guarantees of PAC learning, when the number of data points sufficiently exceed the VC dimension of the set of learnable algorithms. Since this VC dimension is often essentially the number of parameters (assuming finite representation of the parameters as float or double), then this means that PAC learning is guaranteed when #data &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;&amp;lt; &lt;/del&gt;#parameters.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;In particular, the fundamental theorem of statistical learning [https://www.amazon.com/Understanding-Machine-Learning-Theory-Algorithms/dp/1107057132/ ShalevshwartzBendavidBook][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Understanding+Machine+Learning+Theory+Algorithms+shalev-shwartz+ben-david&amp;amp;btnG= 14] [https://www.youtube.com/watch?v=RkWuLtFPBKU&amp;amp;list=PLie7a1OUTSagZB9mFZnVBgsNfBtcUGJWB&amp;amp;index=14 Wandida16b] provides guarantees of PAC learning, when the number of data points sufficiently exceed the VC dimension of the set of learnable algorithms. Since this VC dimension is often essentially the number of parameters (assuming finite representation of the parameters as float or double), then this means that PAC learning is guaranteed when &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;\&lt;/ins&gt;#data &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;\ll \&lt;/ins&gt;#parameters&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt;&lt;/ins&gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;This has become a conventional wisdom for a while.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;This has become a conventional wisdom for a while.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l27&quot; &gt;Line 27:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 27:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Details ==&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Details ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[https://openreview.net/pdf?id=Sy8gdB9xx ZBHRV][https://dblp.org/rec/bibtex/conf/iclr/ZhangBHRV17 17] showed that large interpolating neural networks generalize well, even for large noise in the data. Also, they showed that inductive bias likely plays a limited role, as neural networks still manage to learn quite efficiently data whose labels are completely shuffled. They also proved that a neural network with 2n+d parameters can interpolate n data points of dimension d.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[https://openreview.net/pdf?id=Sy8gdB9xx ZBHRV][https://dblp.org/rec/bibtex/conf/iclr/ZhangBHRV17 17] showed that large interpolating neural networks generalize well, even for large noise in the data. Also, they showed that inductive bias likely plays a limited role, as neural networks still manage to learn quite efficiently data whose labels are completely shuffled. They also proved that a neural network with &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;2n+d&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt; &lt;/ins&gt;parameters can interpolate &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;n&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt; &lt;/ins&gt;data points of dimension &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;d&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt;&lt;/ins&gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[https://www.pnas.org/content/pnas/116/32/15849.full.pdf BHMM][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Reconciling+modern+machine-learning+practice+and+the+classical+bias%E2%80%93variance+trade-off&amp;amp;btnG= 19] observed double descent for random Fourier features (which [http://papers.nips.cc/paper/3182-random-features-for-large-scale-kernel-machines.pdf RahimiRecht][https://dblp.org/rec/bibtex/conf/nips/RahimiR07 07] proved to be intimately connected to kernel methods), neural networks, decision tree and ensemble methods.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[https://www.pnas.org/content/pnas/116/32/15849.full.pdf BHMM][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Reconciling+modern+machine-learning+practice+and+the+classical+bias%E2%80%93variance+trade-off&amp;amp;btnG= 19] observed double descent for random Fourier features (which [http://papers.nips.cc/paper/3182-random-features-for-large-scale-kernel-machines.pdf RahimiRecht][https://dblp.org/rec/bibtex/conf/nips/RahimiR07 07] proved to be intimately connected to kernel methods), neural networks, decision tree and ensemble methods.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l33&quot; &gt;Line 33:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 33:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[https://arxiv.org/pdf/1912.02292 NKBYBS][https://dblp.org/rec/bibtex/journals/corr/abs-1912-02292 19] show that a very wide variety of deep neural networks exhibit a wide variety of double descent phenomenons. Not only is there double descent with respect to the number of parameters, but there also seems to be double descent with respect to the width of the neural networks, and weirdly also with respect to epochs of learning steps. They conjecture that &amp;quot;effective model complexity&amp;quot; (the number of data points for which the model is able to achieve small training loss) is a critical point where overfitting occurs. Before and beyond this, overfitting appears to vanish.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[https://arxiv.org/pdf/1912.02292 NKBYBS][https://dblp.org/rec/bibtex/journals/corr/abs-1912-02292 19] show that a very wide variety of deep neural networks exhibit a wide variety of double descent phenomenons. Not only is there double descent with respect to the number of parameters, but there also seems to be double descent with respect to the width of the neural networks, and weirdly also with respect to epochs of learning steps. They conjecture that &amp;quot;effective model complexity&amp;quot; (the number of data points for which the model is able to achieve small training loss) is a critical point where overfitting occurs. Before and beyond this, overfitting appears to vanish.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[http://proceedings.mlr.press/v80/belkin18a/belkin18a.pdf BMM][https://dblp.org/rec/bibtex/conf/icml/BelkinMM18 18] present experiments that show that interpolating kernel methods also generalize well and are able to fit random labels (though in this paper, they do not exhibit double descent). They also show that, because norms of interpolaters grow superpolynomially in Hilbert space (in exp(theta(n^&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;(&lt;/del&gt;1/d))&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;)&lt;/del&gt;), usual bounds controlling overfitting are actually trivial for large datasets. This indicates the need for radically different approach to understand overfitting.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[http://proceedings.mlr.press/v80/belkin18a/belkin18a.pdf BMM][https://dblp.org/rec/bibtex/conf/icml/BelkinMM18 18] present experiments that show that interpolating kernel methods also generalize well and are able to fit random labels (though in this paper, they do not exhibit double descent). They also show that, because norms of interpolaters grow superpolynomially in Hilbert space (in &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;exp(&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;\&lt;/ins&gt;theta(n^&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;{&lt;/ins&gt;1/d&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;}&lt;/ins&gt;))&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt;&lt;/ins&gt;), usual bounds controlling overfitting are actually trivial for large datasets. This indicates the need for radically different approach to understand overfitting.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[http://proceedings.mlr.press/v89/belkin19a/belkin19a.pdf BRT][https://dblp.org/rec/bibtex/conf/aistats/BelkinRT19 19] show examples of singular kernel interpolators (K(x,y)&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;~&lt;/del&gt;||x-y||^&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;(&lt;/del&gt;-a&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;)&lt;/del&gt;) that achieve optimal rates, even for improper learning (meaning that the true function to learn does not belong to the set of hypotheses).&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[http://proceedings.mlr.press/v89/belkin19a/belkin19a.pdf BRT][https://dblp.org/rec/bibtex/conf/aistats/BelkinRT19 19] show examples of singular kernel interpolators (&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;K(x,y)&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;\sim&lt;/ins&gt;||x-y||^&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;{&lt;/ins&gt;-a&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;}&amp;lt;/math&amp;gt;&lt;/ins&gt;) that achieve optimal rates, even for improper learning (meaning that the true function to learn does not belong to the set of hypotheses).&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Note that the connection between kernel methods and neural networks has been made, for instance by [http://papers.nips.cc/paper/3182-random-features-for-large-scale-kernel-machines.pdf RahimiRecht][https://dblp.org/rec/bibtex/conf/nips/RahimiR07 07] and [http://papers.nips.cc/paper/8076-neural-tangent-kernel-convergence-and-generalization-in-neural-networks.pdf JHG][https://dblp.org/rec/bibtex/conf/nips/JacotHG18 18]. Essentially, random features implement approximate kernel methods. And the first layers of neural networks with random (or even trained) weights can be regarded as computations of random features.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Note that the connection between kernel methods and neural networks has been made, for instance by [http://papers.nips.cc/paper/3182-random-features-for-large-scale-kernel-machines.pdf RahimiRecht][https://dblp.org/rec/bibtex/conf/nips/RahimiR07 07] and [http://papers.nips.cc/paper/8076-neural-tangent-kernel-convergence-and-generalization-in-neural-networks.pdf JHG][https://dblp.org/rec/bibtex/conf/nips/JacotHG18 18]. Essentially, random features implement approximate kernel methods. And the first layers of neural networks with random (or even trained) weights can be regarded as computations of random features.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[https://arxiv.org/pdf/1903.08560 HMRT][https://dblp.org/rec/bibtex/journals/corr/abs-1903-08560 19] use random matrix theory (to estimate eigenvalues of X^TX when X are random samples) to show that ridgless regression (regression with minimum &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;l2&lt;/del&gt;-norm) features &amp;quot;infinite double descent&amp;quot; as the size n of the training data sets grows to infinity, along with the number of parameters p (they assume p/n &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;= &lt;/del&gt;gamma, and show infinite overfitting for gamma&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;~&lt;/del&gt;1). This is shown for both a linear model where X = &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;∑&lt;/del&gt;^{1/2} Z, for some fixed &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;∑ &lt;/del&gt;and some well-behaved Z of mean 0 and variance 1, and a &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;componentwise &lt;/del&gt;linearity X = &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;σ&lt;/del&gt;(WZ). In both cases, it is assumed that y=x^&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;Tβ&lt;/del&gt;+&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;ε&lt;/del&gt;, where E[&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;ε&lt;/del&gt;]=0. There are also assumptions of finite fixed variance for &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;ε&lt;/del&gt;, and finite fourth moment for &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;z &lt;/del&gt;(needed for random matrix theory).&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[https://arxiv.org/pdf/1903.08560 HMRT][https://dblp.org/rec/bibtex/journals/corr/abs-1903-08560 19] use random matrix theory (to estimate eigenvalues of &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;X^TX&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt; &lt;/ins&gt;when &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;X&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt; &lt;/ins&gt;are random samples) to show that ridgless regression (regression with minimum &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;\ell_2&amp;lt;/math&amp;gt;&lt;/ins&gt;-norm) features &amp;quot;infinite double descent&amp;quot; as the size &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;n&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt; &lt;/ins&gt;of the training data sets grows to infinity, along with the number of parameters &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;p&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt; &lt;/ins&gt;(they assume &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;p/n &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;\rightarrow &lt;/ins&gt;gamma&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt;&lt;/ins&gt;, and show infinite overfitting for &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;gamma&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;\sim &lt;/ins&gt;1&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt;&lt;/ins&gt;). This is shown for both a linear model where &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;X = &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;\Sigma&lt;/ins&gt;^{1/2} Z&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt;&lt;/ins&gt;, for some fixed &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;\Sigma&amp;lt;/math&amp;gt; &lt;/ins&gt;and some well-behaved &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;Z&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt; &lt;/ins&gt;of mean 0 and variance 1, and a &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;component-wise &lt;/ins&gt;linearity &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;X = &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;\sigma&lt;/ins&gt;(WZ)&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt;&lt;/ins&gt;. In both cases, it is assumed that &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;y=x^&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;T\beta&lt;/ins&gt;+&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;\varepsilon&amp;lt;/math&amp;gt;&lt;/ins&gt;, where &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;\mathbb &lt;/ins&gt;E[&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;\varepsilon&lt;/ins&gt;]=0&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt;&lt;/ins&gt;. There are also assumptions of finite fixed variance for &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;\varepsilon&amp;lt;/math&amp;gt;&lt;/ins&gt;, and finite fourth moment for &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;Z&amp;lt;/math&amp;gt; &lt;/ins&gt;(needed for random matrix theory).&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[https://arxiv.org/pdf/1903.07571 BHX][https://dblp.org/rec/bibtex/journals/corr/abs-1903-07571 19] analyze two other data models. The former is a classical Gaussian linear regression with a huge space of features. But the regression is only made within a (random) subset T of features, in which case double descent is observed, and errors can be derived from the norms of the true regression for T-coordinates and non-T-coordinates. A similar analysis is then provided for a random Fourier feature model.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[https://arxiv.org/pdf/1903.07571 BHX][https://dblp.org/rec/bibtex/journals/corr/abs-1903-07571 19] analyze two other data models. The former is a classical Gaussian linear regression with a huge space of features. But the regression is only made within a (random) subset &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;T&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt; &lt;/ins&gt;of features, in which case double descent is observed, and errors can be derived from the norms of the true regression for &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;T&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt;&lt;/ins&gt;-coordinates and non-&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;T&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt;&lt;/ins&gt;-coordinates. A similar analysis is then provided for a random Fourier feature model.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&amp;amp;btnG= 19] consider still another data model, where z is drawn uniformly randomly on a d-sphere, and y = f(z)+&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;ε&lt;/del&gt;, such that E[&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;ε&lt;/del&gt;]=0 and &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;ε &lt;/del&gt;has finite fourth moment. The ridgeless linear regression is then over some random features &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;x&lt;/del&gt;&amp;lt;&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;sub&lt;/del&gt;&amp;gt;&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;i&lt;/del&gt;&amp;lt;/&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;sub&lt;/del&gt;&amp;gt;&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;=σ(Wz&lt;/del&gt;&amp;lt;&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;sub&lt;/del&gt;&amp;gt;&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;i&lt;/del&gt;&amp;lt;/&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;sub&lt;/del&gt;&amp;gt;&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;), where σ &lt;/del&gt;applies &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;componentwise &lt;/del&gt;and W has random rows of &lt;del class=&quot;diffchange diffchange-inline&quot;&gt;l&lt;/del&gt;&amp;lt;&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;sub&lt;/del&gt;&amp;gt;&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;2&lt;/del&gt;&amp;lt;/&lt;del class=&quot;diffchange diffchange-inline&quot;&gt;sub&lt;/del&gt;&amp;gt;-norm equal to 1. They prove that this yields a double descent.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&amp;amp;btnG= 19] consider still another data model, where &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;z&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt; &lt;/ins&gt;is drawn uniformly randomly on a &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;d&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt;&lt;/ins&gt;-sphere, and &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;y = f(z)+&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;\varepsilon&amp;lt;/math&amp;gt;&lt;/ins&gt;, such that &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;\mathbb &lt;/ins&gt;E[&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;\varepsilon&lt;/ins&gt;]=0&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt; &lt;/ins&gt;and &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;\varepsilon&amp;lt;/math&amp;gt; &lt;/ins&gt;has finite fourth moment. The ridgeless linear regression is then over some random features &amp;lt;&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;math&lt;/ins&gt;&amp;gt;&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;x_i = \sigma(Wz_i)&lt;/ins&gt;&amp;lt;/&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;math&lt;/ins&gt;&amp;gt;&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;, where &lt;/ins&gt;&amp;lt;&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;math&lt;/ins&gt;&amp;gt;&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;\sigma&lt;/ins&gt;&amp;lt;/&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;math&lt;/ins&gt;&amp;gt; applies &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;component-wisely &lt;/ins&gt;and &lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;math&amp;gt;&lt;/ins&gt;W&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;&amp;lt;/math&amp;gt; &lt;/ins&gt;has random rows of &amp;lt;&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;math&lt;/ins&gt;&amp;gt;&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;\ell_2&lt;/ins&gt;&amp;lt;/&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;math&lt;/ins&gt;&amp;gt;-norm equal to 1. They prove that this yields a double descent.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Lê Nguyên Hoang</name></author>
		
	</entry>
	<entry>
		<id>https://robustlybeneficial.org/wiki/index.php?title=Overfitting&amp;diff=19&amp;oldid=prev</id>
		<title>Lê Nguyên Hoang: Created page with &quot;Overfitting occurs where fitting training data too closely is counter-productive to out-of-sample predictions.  == Bias-variance tradeoff ==  [http://web.mit.edu/6.435/www/Gem...&quot;</title>
		<link rel="alternate" type="text/html" href="https://robustlybeneficial.org/wiki/index.php?title=Overfitting&amp;diff=19&amp;oldid=prev"/>
		<updated>2020-01-20T21:31:15Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;quot;Overfitting occurs where fitting training data too closely is counter-productive to out-of-sample predictions.  == Bias-variance tradeoff ==  [http://web.mit.edu/6.435/www/Gem...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;Overfitting occurs where fitting training data too closely is counter-productive to out-of-sample predictions.&lt;br /&gt;
&lt;br /&gt;
== Bias-variance tradeoff ==&lt;br /&gt;
&lt;br /&gt;
[http://web.mit.edu/6.435/www/Geman92.pdf GBD][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=neural+networks+and+the+bias%2Fvariance+dilemma+geman+bienenstock+doursat&amp;amp;btnG= 92] identified the bias-variance tradeoff, which quantifies out-of-sample errors as a sum of inductive bias and model variance, &amp;lt;em&amp;gt;for random samples&amp;lt;/em&amp;gt; drawn from the true distribution.&lt;br /&gt;
&lt;br /&gt;
Formally, let f(x,S) the prediction of the algorithm trained with sample S for feature x. Then E[(f(x,S)-y)&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;] = bias&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt; + variance + sigma&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;, where the expectation is over x, S and y, bias = E[f(x,S)-f&amp;lt;sup&amp;gt;*&amp;lt;/sup&amp;gt;(x)] is the bias with respect to the optimal prediction f&amp;lt;sup&amp;gt;*&amp;lt;/sup&amp;gt;, variance = E[f(x,S)&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;] - E[f(x,S)]&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt; is how much the prediction varies from one sample to the other and sigma&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt; = E[(f&amp;lt;sup&amp;gt;*&amp;lt;/sup&amp;gt;(x)-y)^2] is the unpredictable components of y given x.&lt;br /&gt;
&lt;br /&gt;
It has been argued that overfitting is caused by increased variance when we consider a learning algorithm that is too sensitive to the randomness of sampling S. This sometimes occurs when the number of parameters of the learning algorithm is too large (but not necessarily!).&lt;br /&gt;
&lt;br /&gt;
== PAC-learning ==&lt;br /&gt;
&lt;br /&gt;
To prove theoretical guarantees of non-overfitting, [http://web.mit.edu/6.435/www/Valiant84.pdf Valiant][https://dblp.org/rec/bibtex/journals/cacm/Valiant84 84] introduced the concept of &amp;lt;em&amp;gt;probably approximately correct&amp;lt;/em&amp;gt; (PAC) learning. More explanations here: [https://www.youtube.com/watch?v=uB2X2OuD4Rg&amp;amp;list=PLie7a1OUTSagZB9mFZnVBgsNfBtcUGJWB&amp;amp;index=8 Wandida16a].&lt;br /&gt;
&lt;br /&gt;
In particular, the fundamental theorem of statistical learning [https://www.amazon.com/Understanding-Machine-Learning-Theory-Algorithms/dp/1107057132/ ShalevshwartzBendavidBook][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Understanding+Machine+Learning+Theory+Algorithms+shalev-shwartz+ben-david&amp;amp;btnG= 14] [https://www.youtube.com/watch?v=RkWuLtFPBKU&amp;amp;list=PLie7a1OUTSagZB9mFZnVBgsNfBtcUGJWB&amp;amp;index=14 Wandida16b] provides guarantees of PAC learning, when the number of data points sufficiently exceed the VC dimension of the set of learnable algorithms. Since this VC dimension is often essentially the number of parameters (assuming finite representation of the parameters as float or double), then this means that PAC learning is guaranteed when #data &amp;lt;&amp;lt; #parameters.&lt;br /&gt;
&lt;br /&gt;
This has become a conventional wisdom for a while.&lt;br /&gt;
&lt;br /&gt;
== Double descent ==&lt;br /&gt;
&lt;br /&gt;
However, the conventional wisdom is in sharp contradiction with today's success of deep neural networks [https://openreview.net/pdf?id=Sy8gdB9xx ZBHRV][https://dblp.org/rec/bibtex/conf/iclr/ZhangBHRV17 17], [https://www.pnas.org/content/pnas/116/32/15849.full.pdf BHMM][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Reconciling+modern+machine-learning+practice+and+the+classical+bias%E2%80%93variance+trade-off&amp;amp;btnG= 19] [https://arxiv.org/pdf/1912.02292 NKBYBS][https://dblp.org/rec/bibtex/journals/corr/abs-1912-02292 19], but also kernel methods [http://proceedings.mlr.press/v80/belkin18a/belkin18a.pdf BMM][https://dblp.org/rec/bibtex/conf/icml/BelkinMM18 18] [http://proceedings.mlr.press/v89/belkin19a/belkin19a.pdf BRT][https://dblp.org/rec/bibtex/conf/aistats/BelkinRT19 19], ridgeless (random feature) linear regression [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&amp;amp;arnumber=8849614 MVS][https://dblp.org/rec/bibtex/conf/isit/MuthukumarVS19 19] [https://arxiv.org/pdf/1906.11300 BLLT][https://dblp.org/rec/bibtex/journals/corr/abs-1906-11300 19] [https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&amp;amp;btnG= 19] [https://arxiv.org/pdf/1903.08560 HMRT][https://dblp.org/rec/bibtex/journals/corr/abs-1903-08560 19] [https://arxiv.org/pdf/1903.07571 BHX][https://dblp.org/rec/bibtex/journals/corr/abs-1903-07571 19] and even ensembles [https://www.pnas.org/content/pnas/116/32/15849.full.pdf BHMM][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Reconciling+modern+machine-learning+practice+and+the+classical+bias%E2%80%93variance+trade-off&amp;amp;btnG= 19]. Learning algorithms seem to often achieve their best out-of-sample performance when they are massively overparameterized and perfectly fit the training data (called &amp;lt;em&amp;gt;interpolation&amp;lt;/em&amp;gt;).&lt;br /&gt;
&lt;br /&gt;
Intriguingly, a &amp;lt;em&amp;gt;double descent&amp;lt;/em&amp;gt; phenomenon often occurs [https://www.pnas.org/content/pnas/116/32/15849.full.pdf BHMM][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Reconciling+modern+machine-learning+practice+and+the+classical+bias%E2%80%93variance+trade-off&amp;amp;btnG= 19] [https://arxiv.org/pdf/1912.02292 NKBYBS][https://dblp.org/rec/bibtex/journals/corr/abs-1912-02292 19], where the performance at the test set first behaves as predicted by the bias-variance dilemma, but then improves and outperforms what would be advised by classical statistical learning.&lt;br /&gt;
&lt;br /&gt;
All such results suggest that overfitting eventually disappears, which contradicts conventional wisdom.&lt;br /&gt;
&lt;br /&gt;
== Details ==&lt;br /&gt;
&lt;br /&gt;
[https://openreview.net/pdf?id=Sy8gdB9xx ZBHRV][https://dblp.org/rec/bibtex/conf/iclr/ZhangBHRV17 17] showed that large interpolating neural networks generalize well, even for large noise in the data. Also, they showed that inductive bias likely plays a limited role, as neural networks still manage to learn quite efficiently data whose labels are completely shuffled. They also proved that a neural network with 2n+d parameters can interpolate n data points of dimension d.&lt;br /&gt;
&lt;br /&gt;
[https://www.pnas.org/content/pnas/116/32/15849.full.pdf BHMM][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=Reconciling+modern+machine-learning+practice+and+the+classical+bias%E2%80%93variance+trade-off&amp;amp;btnG= 19] observed double descent for random Fourier features (which [http://papers.nips.cc/paper/3182-random-features-for-large-scale-kernel-machines.pdf RahimiRecht][https://dblp.org/rec/bibtex/conf/nips/RahimiR07 07] proved to be intimately connected to kernel methods), neural networks, decision tree and ensemble methods.&lt;br /&gt;
&lt;br /&gt;
[https://arxiv.org/pdf/1912.02292 NKBYBS][https://dblp.org/rec/bibtex/journals/corr/abs-1912-02292 19] show that a very wide variety of deep neural networks exhibit a wide variety of double descent phenomenons. Not only is there double descent with respect to the number of parameters, but there also seems to be double descent with respect to the width of the neural networks, and weirdly also with respect to epochs of learning steps. They conjecture that &amp;quot;effective model complexity&amp;quot; (the number of data points for which the model is able to achieve small training loss) is a critical point where overfitting occurs. Before and beyond this, overfitting appears to vanish.&lt;br /&gt;
&lt;br /&gt;
[http://proceedings.mlr.press/v80/belkin18a/belkin18a.pdf BMM][https://dblp.org/rec/bibtex/conf/icml/BelkinMM18 18] present experiments that show that interpolating kernel methods also generalize well and are able to fit random labels (though in this paper, they do not exhibit double descent). They also show that, because norms of interpolaters grow superpolynomially in Hilbert space (in exp(theta(n^(1/d)))), usual bounds controlling overfitting are actually trivial for large datasets. This indicates the need for radically different approach to understand overfitting.&lt;br /&gt;
&lt;br /&gt;
[http://proceedings.mlr.press/v89/belkin19a/belkin19a.pdf BRT][https://dblp.org/rec/bibtex/conf/aistats/BelkinRT19 19] show examples of singular kernel interpolators (K(x,y)~||x-y||^(-a)) that achieve optimal rates, even for improper learning (meaning that the true function to learn does not belong to the set of hypotheses).&lt;br /&gt;
&lt;br /&gt;
Note that the connection between kernel methods and neural networks has been made, for instance by [http://papers.nips.cc/paper/3182-random-features-for-large-scale-kernel-machines.pdf RahimiRecht][https://dblp.org/rec/bibtex/conf/nips/RahimiR07 07] and [http://papers.nips.cc/paper/8076-neural-tangent-kernel-convergence-and-generalization-in-neural-networks.pdf JHG][https://dblp.org/rec/bibtex/conf/nips/JacotHG18 18]. Essentially, random features implement approximate kernel methods. And the first layers of neural networks with random (or even trained) weights can be regarded as computations of random features.&lt;br /&gt;
&lt;br /&gt;
[https://arxiv.org/pdf/1903.08560 HMRT][https://dblp.org/rec/bibtex/journals/corr/abs-1903-08560 19] use random matrix theory (to estimate eigenvalues of X^TX when X are random samples) to show that ridgless regression (regression with minimum l2-norm) features &amp;quot;infinite double descent&amp;quot; as the size n of the training data sets grows to infinity, along with the number of parameters p (they assume p/n = gamma, and show infinite overfitting for gamma~1). This is shown for both a linear model where X = ∑^{1/2} Z, for some fixed ∑ and some well-behaved Z of mean 0 and variance 1, and a componentwise linearity X = σ(WZ). In both cases, it is assumed that y=x^Tβ+ε, where E[ε]=0. There are also assumptions of finite fixed variance for ε, and finite fourth moment for z (needed for random matrix theory).&lt;br /&gt;
&lt;br /&gt;
[https://arxiv.org/pdf/1903.07571 BHX][https://dblp.org/rec/bibtex/journals/corr/abs-1903-07571 19] analyze two other data models. The former is a classical Gaussian linear regression with a huge space of features. But the regression is only made within a (random) subset T of features, in which case double descent is observed, and errors can be derived from the norms of the true regression for T-coordinates and non-T-coordinates. A similar analysis is then provided for a random Fourier feature model.&lt;br /&gt;
&lt;br /&gt;
[https://arxiv.org/pdf/1908.05355 MeiMontanari][https://scholar.google.ch/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=The+generalization+error+of+random+features+regression%3A+Precise+asymptotics+and+double+descent+curve&amp;amp;btnG= 19] consider still another data model, where z is drawn uniformly randomly on a d-sphere, and y = f(z)+ε, such that E[ε]=0 and ε has finite fourth moment. The ridgeless linear regression is then over some random features x&amp;lt;sub&amp;gt;i&amp;lt;/sub&amp;gt;=σ(Wz&amp;lt;sub&amp;gt;i&amp;lt;/sub&amp;gt;), where σ applies componentwise and W has random rows of l&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt;-norm equal to 1. They prove that this yields a double descent.&lt;/div&gt;</summary>
		<author><name>Lê Nguyên Hoang</name></author>
		
	</entry>
</feed>