# Stochastic gradient descent

Stochastic gradient descent (SGD) is the most widely used learning algorithm. For a very general perspective, SGD consists in iterating (1) draw some data point and (2) slightly tweak the algorithm's parameters to better fit the data point.

## Formal basic model

Consider an algorithm with continuous parameters $\theta \in \mathbb R^d$. Learning would then correspond to adjusting adequately the parameters $\theta$ of the algorithm. To do so, a loss function $\mathcal L(\theta,S)$, which usually depends on $\theta$ and on some large dataset $S$. The gradient $\nabla_\theta \mathcal L(\theta,S) \in \mathbb R^d$ at parameters $\theta$ is the direction of change of $\theta$ along which the function $\mathcal L$ increases the most. This means that, conversely, for $\eta$ small enough, $\theta' = \theta - \eta \nabla_\theta \mathcal L(\theta,S)$ will improve upon $\theta$, in the sense that $\mathcal L(\theta',S) \leq \mathcal L(\theta,S)$.

Gradient descent starts by initializing the parameters at some value $\theta_0$. Then, it iterates improvements of $\theta_{t+1} = \theta_t - \eta_t \nabla_\theta \mathcal L(\theta_t,S)$. The hyperparameter $\eta_t$ is called the learning rate. See 3Blue1Brown17 for more insightful explanations.

In practice, and especially for neural networks, $\nabla_\theta \mathcal L(\theta_t,S)$ is both long to compute and unnecessary to compute too accurately at each learning step. Moreover it can nearly always be written as $\nabla_\theta \mathcal L(\theta_t,S) = \mathbb E_{x \leftarrow S} \left[ \nabla_\theta \mathcal L(\theta_t,x) \right]$. This motivated stochastic gradient descent, where the gradient $\nabla_\theta \mathcal L(\theta_t,S)$ is replaced by a stochastic estimate $\nabla_\theta \mathcal L(\theta_t,x)$, for some data point $x$ uniformly randomly drawn from $S$.

Numerous variants of this basic settings have been proposed, which we discuss below.