Difference between revisions of "Stochastic gradient descent"

From RB Wiki
(Created page with "Stochastic gradient descent (SGD) is the most widely used learning algorithm. For a very general perspective, SGD consists in iterating (1) draw some data point and (2) slight...")
 
m
 
Line 5: Line 5:
 
Consider an algorithm with continuous parameters <math>\theta \in \mathbb R^d</math>. Learning would then correspond to adjusting adequately the parameters <math>\theta</math> of the algorithm. To do so, a <em>loss function</em> <math>\mathcal L(\theta,S)</math>, which usually depends on <math>\theta</math> and on some large dataset <math>S</math>. The gradient <math>\nabla_\theta \mathcal L(\theta,S) \in \mathbb R^d</math> at parameters <math>\theta</math> is the direction of change of <math>\theta</math> along which the function <math>\mathcal L</math> increases the most. This means that, conversely, for <math>\eta</math> small enough, <math>\theta' = \theta - \eta \nabla_\theta \mathcal L(\theta,S)</math> will improve upon <math>\theta</math>, in the sense that <math>\mathcal L(\theta',S) \leq \mathcal L(\theta,S)</math>.
 
Consider an algorithm with continuous parameters <math>\theta \in \mathbb R^d</math>. Learning would then correspond to adjusting adequately the parameters <math>\theta</math> of the algorithm. To do so, a <em>loss function</em> <math>\mathcal L(\theta,S)</math>, which usually depends on <math>\theta</math> and on some large dataset <math>S</math>. The gradient <math>\nabla_\theta \mathcal L(\theta,S) \in \mathbb R^d</math> at parameters <math>\theta</math> is the direction of change of <math>\theta</math> along which the function <math>\mathcal L</math> increases the most. This means that, conversely, for <math>\eta</math> small enough, <math>\theta' = \theta - \eta \nabla_\theta \mathcal L(\theta,S)</math> will improve upon <math>\theta</math>, in the sense that <math>\mathcal L(\theta',S) \leq \mathcal L(\theta,S)</math>.
  
Gradient descent starts by initializing the parameters at some value <math>\theta_0</math>. Then, it iterates improvements of <math>\theta_{t+1} = \theta_t - \eta_t \nabla_\theta \mathcal L(\theta_t,S)</math>. The hyperparameter <math>\eta_t</math> is called the <em>learning rate</em>.
+
Gradient descent starts by initializing the parameters at some value <math>\theta_0</math>. Then, it iterates improvements of <math>\theta_{t+1} = \theta_t - \eta_t \nabla_\theta \mathcal L(\theta_t,S)</math>. The hyperparameter <math>\eta_t</math> is called the <em>learning rate</em>. See [https://www.youtube.com/watch?v=IHZwWFHWa-w 3Blue1Brown17] for more insightful explanations.
  
 
In practice, and especially for neural networks, <math>\nabla_\theta \mathcal L(\theta_t,S)</math> is both long to compute and unnecessary to compute too accurately at each learning step. Moreover it can nearly always be written as <math>\nabla_\theta \mathcal L(\theta_t,S) = \mathbb E_{x \leftarrow S} \left[ \nabla_\theta \mathcal L(\theta_t,x) \right]</math>. This motivated <em>stochastic gradient descent</em>, where the gradient <math>\nabla_\theta \mathcal L(\theta_t,S)</math> is replaced by a stochastic estimate <math>\nabla_\theta \mathcal L(\theta_t,x)</math>, for some data point <math>x</math> uniformly randomly drawn from <math>S</math>.
 
In practice, and especially for neural networks, <math>\nabla_\theta \mathcal L(\theta_t,S)</math> is both long to compute and unnecessary to compute too accurately at each learning step. Moreover it can nearly always be written as <math>\nabla_\theta \mathcal L(\theta_t,S) = \mathbb E_{x \leftarrow S} \left[ \nabla_\theta \mathcal L(\theta_t,x) \right]</math>. This motivated <em>stochastic gradient descent</em>, where the gradient <math>\nabla_\theta \mathcal L(\theta_t,S)</math> is replaced by a stochastic estimate <math>\nabla_\theta \mathcal L(\theta_t,x)</math>, for some data point <math>x</math> uniformly randomly drawn from <math>S</math>.

Latest revision as of 17:58, 28 January 2020

Stochastic gradient descent (SGD) is the most widely used learning algorithm. For a very general perspective, SGD consists in iterating (1) draw some data point and (2) slightly tweak the algorithm's parameters to better fit the data point.

Formal basic model

Consider an algorithm with continuous parameters [math]\theta \in \mathbb R^d[/math]. Learning would then correspond to adjusting adequately the parameters [math]\theta[/math] of the algorithm. To do so, a loss function [math]\mathcal L(\theta,S)[/math], which usually depends on [math]\theta[/math] and on some large dataset [math]S[/math]. The gradient [math]\nabla_\theta \mathcal L(\theta,S) \in \mathbb R^d[/math] at parameters [math]\theta[/math] is the direction of change of [math]\theta[/math] along which the function [math]\mathcal L[/math] increases the most. This means that, conversely, for [math]\eta[/math] small enough, [math]\theta' = \theta - \eta \nabla_\theta \mathcal L(\theta,S)[/math] will improve upon [math]\theta[/math], in the sense that [math]\mathcal L(\theta',S) \leq \mathcal L(\theta,S)[/math].

Gradient descent starts by initializing the parameters at some value [math]\theta_0[/math]. Then, it iterates improvements of [math]\theta_{t+1} = \theta_t - \eta_t \nabla_\theta \mathcal L(\theta_t,S)[/math]. The hyperparameter [math]\eta_t[/math] is called the learning rate. See 3Blue1Brown17 for more insightful explanations.

In practice, and especially for neural networks, [math]\nabla_\theta \mathcal L(\theta_t,S)[/math] is both long to compute and unnecessary to compute too accurately at each learning step. Moreover it can nearly always be written as [math]\nabla_\theta \mathcal L(\theta_t,S) = \mathbb E_{x \leftarrow S} \left[ \nabla_\theta \mathcal L(\theta_t,x) \right][/math]. This motivated stochastic gradient descent, where the gradient [math]\nabla_\theta \mathcal L(\theta_t,S)[/math] is replaced by a stochastic estimate [math]\nabla_\theta \mathcal L(\theta_t,x)[/math], for some data point [math]x[/math] uniformly randomly drawn from [math]S[/math].

Numerous variants of this basic settings have been proposed, which we discuss below.

Classical guarantees

Variants