# Stochastic gradient descent

Stochastic gradient descent (SGD) is the most widely used learning algorithm. For a very general perspective, SGD consists in iterating (1) draw some data point and (2) slightly tweak the algorithm's parameters to better fit the data point.

## Formal basic model

Consider an algorithm with continuous parameters [math]\theta \in \mathbb R^d[/math]. Learning would then correspond to adjusting adequately the parameters [math]\theta[/math] of the algorithm. To do so, a *loss function* [math]\mathcal L(\theta,S)[/math], which usually depends on [math]\theta[/math] and on some large dataset [math]S[/math]. The gradient [math]\nabla_\theta \mathcal L(\theta,S) \in \mathbb R^d[/math] at parameters [math]\theta[/math] is the direction of change of [math]\theta[/math] along which the function [math]\mathcal L[/math] increases the most. This means that, conversely, for [math]\eta[/math] small enough, [math]\theta' = \theta - \eta \nabla_\theta \mathcal L(\theta,S)[/math] will improve upon [math]\theta[/math], in the sense that [math]\mathcal L(\theta',S) \leq \mathcal L(\theta,S)[/math].

Gradient descent starts by initializing the parameters at some value [math]\theta_0[/math]. Then, it iterates improvements of [math]\theta_{t+1} = \theta_t - \eta_t \nabla_\theta \mathcal L(\theta_t,S)[/math]. The hyperparameter [math]\eta_t[/math] is called the *learning rate*. See 3Blue1Brown17 for more insightful explanations.

In practice, and especially for neural networks, [math]\nabla_\theta \mathcal L(\theta_t,S)[/math] is both long to compute and unnecessary to compute too accurately at each learning step. Moreover it can nearly always be written as [math]\nabla_\theta \mathcal L(\theta_t,S) = \mathbb E_{x \leftarrow S} \left[ \nabla_\theta \mathcal L(\theta_t,x) \right][/math]. This motivated *stochastic gradient descent*, where the gradient [math]\nabla_\theta \mathcal L(\theta_t,S)[/math] is replaced by a stochastic estimate [math]\nabla_\theta \mathcal L(\theta_t,x)[/math], for some data point [math]x[/math] uniformly randomly drawn from [math]S[/math].

Numerous variants of this basic settings have been proposed, which we discuss below.