# Overfitting

Overfitting occurs where fitting training data too closely is counter-productive to out-of-sample predictions.

## Bias-variance tradeoff

GBD92 identified the bias-variance tradeoff, which quantifies out-of-sample errors as a sum of inductive bias and model variance, *for random samples* drawn from the true distribution.

Formally, let [math]f(x,S)[/math] the [math]y[/math]-prediction of the algorithm trained with sample [math]S[/math] for feature [math]x[/math]. Then [math]\mathbb E_{x,y,S}[(f(x,S)-y)^2] = bias^2 + variance + \sigma^2[/math], where the expectation is over [math]x[/math], [math]S[/math] and [math]y[/math], where [math]bias = \mathbb E_{x,S}[f(x,S)-f^*(x)][/math] is the bias with respect to the optimal prediction [math]f^*[/math], where [math]variance = \mathbb E_{x,S}[f(x,S)^2] - \mathbb E_{x,S}[f(x,S)]^2[/math] is how much the prediction varies from one sample to the other and [math]\sigma^2 = \mathbb E_{x,y}[(f^*(x)-y)^2][/math] is the unpredictable components of [math]y[/math] given [math]x[/math].

It has been argued that overfitting is caused by increased variance when we consider a learning algorithm that is too sensitive to the randomness of sampling [math]S[/math]. This sometimes occurs when the number of parameters of the learning algorithm is too large (but not necessarily!).

## PAC-learning

To prove theoretical guarantees of non-overfitting, Valiant84 introduced the concept of *probably approximately correct* (PAC) learning. More explanations here: Wandida16a.

In particular, the fundamental theorem of statistical learning ShalevshwartzBendavidBook14 Wandida16b provides guarantees of PAC learning, when the number of data points sufficiently exceed the VC dimension of the set of learnable algorithms. Since this VC dimension is often essentially the number of parameters (assuming finite representation of the parameters as float or double), then this means that PAC learning is guaranteed when [math]\#data \ll \#parameters[/math].

This has become a conventional wisdom for a while.

## Test set overfitting

There has been concerns about overfitting of test sets, as these are used more and more to measure the performance of machine learning algorithms. But RRSS19 analyze statistical patterns on reported test set performances, and argue that there still is actual progress.

## Double descent

However, the conventional wisdom is in sharp contradiction with today's success of deep neural networks ZBHRV17, BHMM19 NKBYBS19, but also kernel methods BMM18 BRT19, ridgeless (random feature) linear regression MVS19 BLLT19 MeiMontanari19 HMRT19 BHX19 JSSHG20 and even ensembles BHMM19. Learning algorithms seem to often achieve their best out-of-sample performance when they are massively overparameterized and perfectly fit the training data (called *interpolation*).

Intriguingly, a *double descent* phenomenon often occurs BHMM19 NKBYBS19, where the performance at the test set first behaves as predicted by the bias-variance dilemma, but then improves and outperforms what would be advised by classical statistical learning.

All such results suggest that overfitting eventually disappears, which contradicts conventional wisdom.

## Details

ZBHRV17 showed that large interpolating neural networks generalize well, even for large noise in the data. Also, they showed that inductive bias likely plays a limited role, as neural networks still manage to learn quite efficiently data whose labels are completely shuffled. They also proved that a neural network with [math]2n+d[/math] parameters can interpolate [math]n[/math] data points of dimension [math]d[/math].

BHMM19 observed double descent for random Fourier features (which RahimiRecht07 proved to be intimately connected to kernel methods), neural networks, decision tree and ensemble methods.

NKBYBS19 show that a very wide variety of deep neural networks exhibit a wide variety of double descent phenomenons. Not only is there double descent with respect to the number of parameters, but there also seems to be double descent with respect to the width of the neural networks, and weirdly also with respect to epochs of learning steps. They conjecture that "effective model complexity" (the number of data points for which the model is able to achieve small training loss) is a critical point where overfitting occurs. Before and beyond this, overfitting appears to vanish.

BMM18 present experiments that show that interpolating kernel methods also generalize well and are able to fit random labels (though in this paper, they do not exhibit double descent). They also show that, because norms of interpolaters grow superpolynomially in Hilbert space (in [math]exp(\theta(n^{1/d}))[/math]), usual bounds controlling overfitting are actually trivial for large datasets. This indicates the need for radically different approach to understand overfitting.

BRT19 show examples of singular kernel interpolators ([math]K(x,y)\sim||x-y||^{-a}[/math]) that achieve optimal rates, even for improper learning (meaning that the true function to learn does not belong to the set of hypotheses).

Note that the connection between kernel methods and neural networks has been made, for instance by RahimiRecht07 and JHG18. Essentially, random features implement approximate kernel methods. And the first layers of neural networks with random (or even trained) weights can be regarded as computations of random features.

HMRT19 use random matrix theory (to estimate eigenvalues of [math]X^TX[/math] when [math]X[/math] are random samples) to show that ridgless regression (regression with minimum [math]\ell_2[/math]-norm) features "infinite double descent" as the size [math]n[/math] of the training data sets grows to infinity, along with the number of parameters [math]p[/math] (they assume [math]p/n \rightarrow gamma[/math], and show infinite overfitting for [math]gamma\sim 1[/math]). This is shown for both a linear model where [math]X = \Sigma^{1/2} Z[/math], for some fixed [math]\Sigma[/math] and some well-behaved [math]Z[/math] of mean 0 and variance 1, and a component-wise linearity [math]X = \sigma(WZ)[/math]. In both cases, it is assumed that [math]y=x^T\beta+\varepsilon[/math], where [math]\mathbb E[\varepsilon]=0[/math]. There are also assumptions of finite fixed variance for [math]\varepsilon[/math], and finite fourth moment for [math]Z[/math] (needed for random matrix theory).

BHX19 analyze two other data models. The former is a classical Gaussian linear regression with a huge space of features. But the regression is only made within a (random) subset [math]T[/math] of features, in which case double descent is observed, and errors can be derived from the norms of the true regression for [math]T[/math]-coordinates and non-[math]T[/math]-coordinates. A similar analysis is then provided for a random Fourier feature model.

MeiMontanari19 consider still another data model, where [math]z[/math] is drawn uniformly randomly on a [math]d[/math]-sphere, and [math]y = f(z)+\varepsilon[/math], such that [math]\mathbb E[\varepsilon]=0[/math] and [math]\varepsilon[/math] has finite fourth moment. The ridgeless linear regression is then over some random features [math]x_i = \sigma(Wz_i)[/math], where [math]\sigma[/math] applies component-wisely and [math]W[/math] has random rows of [math]\ell_2[/math]-norm equal to 1. They prove that this yields a double descent.

JSSHG20 study a random feature model where the true function and the features are drawn from a Gaussian process. They prove an upper-bound between a regularized Bayesian posterior prediction at a point [math]x[/math] and the expectation of the (slightly differently) regularized random feature prediction at this point, which can be argued to go to zero under reasonable assumptions, as the number of parameters goes to infinity. Moreover, as the regularization goes to zero, the random feature linear regression comes closer to the Bayesian posterior prediction. They also prove a bound between the expected loss of the regularized random feature model and that of the regularized kernel model. They also show that the "effective ridge" [math]\tilde \lambda[/math] (i.e. ridge of the kernel method equivalent to the ridge [math]\lambda[/math] of the random feature regression) is key to understanding the variance explosion. In particular, they relate it to the explosion of [math]\partial_\lambda \tilde \lambda[/math]. An interesting insight is that the adequate linear regression regularization should thus be smaller than (and can be exactly computed from) the kernel method ridge.