Overfitting

From RB Wiki
Revision as of 13:43, 22 January 2020 by Lê Nguyên Hoang (talk | contribs)

Overfitting occurs where fitting training data too closely is counter-productive to out-of-sample predictions.

Bias-variance tradeoff

GBD92 identified the bias-variance tradeoff, which quantifies out-of-sample errors as a sum of inductive bias and model variance, for random samples drawn from the true distribution.

Formally, let [math]f(x,S)[/math] the [math]y[/math]-prediction of the algorithm trained with sample [math]S[/math] for feature [math]x[/math]. Then [math]\mathbb E_{x,y,S}[(f(x,S)-y)^2] = bias^2 + variance + \sigma^2[/math], where the expectation is over [math]x[/math], [math]S[/math] and [math]y[/math], where [math]bias = \mathbb E_{x,S}[f(x,S)-f^*(x)][/math] is the bias with respect to the optimal prediction [math]f^*[/math], where [math]variance = \mathbb E_{x,S}[f(x,S)^2] - \mathbb E_{x,S}[f(x,S)]^2[/math] is how much the prediction varies from one sample to the other and [math]\sigma^2 = \mathbb E_{x,y}[(f^*(x)-y)^2][/math] is the unpredictable components of [math]y[/math] given [math]x[/math].

It has been argued that overfitting is caused by increased variance when we consider a learning algorithm that is too sensitive to the randomness of sampling [math]S[/math]. This sometimes occurs when the number of parameters of the learning algorithm is too large (but not necessarily!).

PAC-learning

To prove theoretical guarantees of non-overfitting, Valiant84 introduced the concept of probably approximately correct (PAC) learning. More explanations here: Wandida16a.

In particular, the fundamental theorem of statistical learning ShalevshwartzBendavidBook14 Wandida16b provides guarantees of PAC learning, when the number of data points sufficiently exceed the VC dimension of the set of learnable algorithms. Since this VC dimension is often essentially the number of parameters (assuming finite representation of the parameters as float or double), then this means that PAC learning is guaranteed when [math]\#data \ll \#parameters[/math].

This has become a conventional wisdom for a while.

Double descent

However, the conventional wisdom is in sharp contradiction with today's success of deep neural networks ZBHRV17, BHMM19 NKBYBS19, but also kernel methods BMM18 BRT19, ridgeless (random feature) linear regression MVS19 BLLT19 MeiMontanari19 HMRT19 BHX19 and even ensembles BHMM19. Learning algorithms seem to often achieve their best out-of-sample performance when they are massively overparameterized and perfectly fit the training data (called interpolation).

Intriguingly, a double descent phenomenon often occurs BHMM19 NKBYBS19, where the performance at the test set first behaves as predicted by the bias-variance dilemma, but then improves and outperforms what would be advised by classical statistical learning.

All such results suggest that overfitting eventually disappears, which contradicts conventional wisdom.

Details

ZBHRV17 showed that large interpolating neural networks generalize well, even for large noise in the data. Also, they showed that inductive bias likely plays a limited role, as neural networks still manage to learn quite efficiently data whose labels are completely shuffled. They also proved that a neural network with [math]2n+d[/math] parameters can interpolate [math]n[/math] data points of dimension [math]d[/math].

BHMM19 observed double descent for random Fourier features (which RahimiRecht07 proved to be intimately connected to kernel methods), neural networks, decision tree and ensemble methods.

NKBYBS19 show that a very wide variety of deep neural networks exhibit a wide variety of double descent phenomenons. Not only is there double descent with respect to the number of parameters, but there also seems to be double descent with respect to the width of the neural networks, and weirdly also with respect to epochs of learning steps. They conjecture that "effective model complexity" (the number of data points for which the model is able to achieve small training loss) is a critical point where overfitting occurs. Before and beyond this, overfitting appears to vanish.

BMM18 present experiments that show that interpolating kernel methods also generalize well and are able to fit random labels (though in this paper, they do not exhibit double descent). They also show that, because norms of interpolaters grow superpolynomially in Hilbert space (in [math]exp(\theta(n^{1/d}))[/math]), usual bounds controlling overfitting are actually trivial for large datasets. This indicates the need for radically different approach to understand overfitting.

BRT19 show examples of singular kernel interpolators ([math]K(x,y)\sim||x-y||^{-a}[/math]) that achieve optimal rates, even for improper learning (meaning that the true function to learn does not belong to the set of hypotheses).

Note that the connection between kernel methods and neural networks has been made, for instance by RahimiRecht07 and JHG18. Essentially, random features implement approximate kernel methods. And the first layers of neural networks with random (or even trained) weights can be regarded as computations of random features.

HMRT19 use random matrix theory (to estimate eigenvalues of [math]X^TX[/math] when [math]X[/math] are random samples) to show that ridgless regression (regression with minimum [math]\ell_2[/math]-norm) features "infinite double descent" as the size [math]n[/math] of the training data sets grows to infinity, along with the number of parameters [math]p[/math] (they assume [math]p/n \rightarrow gamma[/math], and show infinite overfitting for [math]gamma\sim 1[/math]). This is shown for both a linear model where [math]X = \Sigma^{1/2} Z[/math], for some fixed [math]\Sigma[/math] and some well-behaved [math]Z[/math] of mean 0 and variance 1, and a component-wise linearity [math]X = \sigma(WZ)[/math]. In both cases, it is assumed that [math]y=x^T\beta+\varepsilon[/math], where [math]\mathbb E[\varepsilon]=0[/math]. There are also assumptions of finite fixed variance for [math]\varepsilon[/math], and finite fourth moment for [math]Z[/math] (needed for random matrix theory).

BHX19 analyze two other data models. The former is a classical Gaussian linear regression with a huge space of features. But the regression is only made within a (random) subset [math]T[/math] of features, in which case double descent is observed, and errors can be derived from the norms of the true regression for [math]T[/math]-coordinates and non-[math]T[/math]-coordinates. A similar analysis is then provided for a random Fourier feature model.

MeiMontanari19 consider still another data model, where [math]z[/math] is drawn uniformly randomly on a [math]d[/math]-sphere, and [math]y = f(z)+\varepsilon[/math], such that [math]\mathbb E[\varepsilon]=0[/math] and [math]\varepsilon[/math] has finite fourth moment. The ridgeless linear regression is then over some random features [math]x_i = \sigma(Wz_i)[/math], where [math]\sigma[/math] applies component-wisely and [math]W[/math] has random rows of [math]\ell_2[/math]-norm equal to 1. They prove that this yields a double descent.