Stochastic gradient descent

From RB Wiki

Stochastic gradient descent (SGD) is the most widely used learning algorithm. For a very general perspective, SGD consists in iterating (1) draw some data point and (2) slightly tweak the algorithm's parameters to better fit the data point.

Formal basic model

Consider an algorithm with continuous parameters [math]\theta \in \mathbb R^d[/math]. Learning would then correspond to adjusting adequately the parameters [math]\theta[/math] of the algorithm. To do so, a loss function [math]\mathcal L(\theta,S)[/math], which usually depends on [math]\theta[/math] and on some large dataset [math]S[/math]. The gradient [math]\nabla_\theta \mathcal L(\theta,S) \in \mathbb R^d[/math] at parameters [math]\theta[/math] is the direction of change of [math]\theta[/math] along which the function [math]\mathcal L[/math] increases the most. This means that, conversely, for [math]\eta[/math] small enough, [math]\theta' = \theta - \eta \nabla_\theta \mathcal L(\theta,S)[/math] will improve upon [math]\theta[/math], in the sense that [math]\mathcal L(\theta',S) \leq \mathcal L(\theta,S)[/math].

Gradient descent starts by initializing the parameters at some value [math]\theta_0[/math]. Then, it iterates improvements of [math]\theta_{t+1} = \theta_t - \eta_t \nabla_\theta \mathcal L(\theta_t,S)[/math]. The hyperparameter [math]\eta_t[/math] is called the learning rate. See 3Blue1Brown17 for more insightful explanations.

In practice, and especially for neural networks, [math]\nabla_\theta \mathcal L(\theta_t,S)[/math] is both long to compute and unnecessary to compute too accurately at each learning step. Moreover it can nearly always be written as [math]\nabla_\theta \mathcal L(\theta_t,S) = \mathbb E_{x \leftarrow S} \left[ \nabla_\theta \mathcal L(\theta_t,x) \right][/math]. This motivated stochastic gradient descent, where the gradient [math]\nabla_\theta \mathcal L(\theta_t,S)[/math] is replaced by a stochastic estimate [math]\nabla_\theta \mathcal L(\theta_t,x)[/math], for some data point [math]x[/math] uniformly randomly drawn from [math]S[/math].

Numerous variants of this basic settings have been proposed, which we discuss below.

Classical guarantees