Lê Nguyên Hoang at 15:58, 28 January 2020

2020-01-28T15:58:29Z

Lê Nguyên Hoang: Created page with "Stochastic gradient descent (SGD) is the most widely used learning algorithm. For a very general perspective, SGD consists in iterating (1) draw some data point and (2) slight..."

2020-01-28T15:56:51Z

Created page with "Stochastic gradient descent (SGD) is the most widely used learning algorithm. For a very general perspective, SGD consists in iterating (1) draw some data point and (2) slight..."

New page

Stochastic gradient descent (SGD) is the most widely used learning algorithm. For a very general perspective, SGD consists in iterating (1) draw some data point and (2) slightly tweak the algorithm's parameters to better fit the data point.

== Formal basic model ==

Consider an algorithm with continuous parameters <math>\theta \in \mathbb R^d</math>. Learning would then correspond to adjusting adequately the parameters <math>\theta</math> of the algorithm. To do so, a loss function <math>\mathcal L(\theta,S)</math>, which usually depends on <math>\theta</math> and on some large dataset <math>S</math>. The gradient <math>\nabla_\theta \mathcal L(\theta,S) \in \mathbb R^d</math> at parameters <math>\theta</math> is the direction of change of <math>\theta</math> along which the function <math>\mathcal L</math> increases the most. This means that, conversely, for <math>\eta</math> small enough, <math>\theta' = \theta - \eta \nabla_\theta \mathcal L(\theta,S)</math> will improve upon <math>\theta</math>, in the sense that <math>\mathcal L(\theta',S) \leq \mathcal L(\theta,S)</math>.

Gradient descent starts by initializing the parameters at some value <math>\theta_0</math>. Then, it iterates improvements of <math>\theta_{t+1} = \theta_t - \eta_t \nabla_\theta \mathcal L(\theta_t,S)</math>. The hyperparameter <math>\eta_t</math> is called the learning rate.

In practice, and especially for neural networks, <math>\nabla_\theta \mathcal L(\theta_t,S)</math> is both long to compute and unnecessary to compute too accurately at each learning step. Moreover it can nearly always be written as <math>\nabla_\theta \mathcal L(\theta_t,S) = \mathbb E_{x \leftarrow S} \left[ \nabla_\theta \mathcal L(\theta_t,x) \right]</math>. This motivated stochastic gradient descent, where the gradient <math>\nabla_\theta \mathcal L(\theta_t,S)</math> is replaced by a stochastic estimate <math>\nabla_\theta \mathcal L(\theta_t,x)</math>, for some data point <math>x</math> uniformly randomly drawn from <math>S</math>.

Numerous variants of this basic settings have been proposed, which we discuss below.

== Classical guarantees ==

== Variants ==

@@ Line 5: / Line 5: @@
 Consider an algorithm with continuous parameters <math>\theta \in \mathbb R^d</math>. Learning would then correspond to adjusting adequately the parameters <math>\theta</math> of the algorithm. To do so, a <em>loss function</em> <math>\mathcal L(\theta,S)</math>, which usually depends on <math>\theta</math> and on some large dataset <math>S</math>. The gradient <math>\nabla_\theta \mathcal L(\theta,S) \in \mathbb R^d</math> at parameters <math>\theta</math> is the direction of change of <math>\theta</math> along which the function <math>\mathcal L</math> increases the most. This means that, conversely, for <math>\eta</math> small enough, <math>\theta' = \theta - \eta \nabla_\theta \mathcal L(\theta,S)</math> will improve upon <math>\theta</math>, in the sense that <math>\mathcal L(\theta',S) \leq \mathcal L(\theta,S)</math>.
-Gradient descent starts by initializing the parameters at some value <math>\theta_0</math>. Then, it iterates improvements of <math>\theta_{t+1} = \theta_t - \eta_t \nabla_\theta \mathcal L(\theta_t,S)</math>. The hyperparameter <math>\eta_t</math> is called the <em>learning rate</em>.
+Gradient descent starts by initializing the parameters at some value <math>\theta_0</math>. Then, it iterates improvements of <math>\theta_{t+1} = \theta_t - \eta_t \nabla_\theta \mathcal L(\theta_t,S)</math>. The hyperparameter <math>\eta_t</math> is called the <em>learning rate</em>. See [https://www.youtube.com/watch?v=IHZwWFHWa-w 3Blue1Brown17] for more insightful explanations.
 In practice, and especially for neural networks, <math>\nabla_\theta \mathcal L(\theta_t,S)</math> is both long to compute and unnecessary to compute too accurately at each learning step. Moreover it can nearly always be written as <math>\nabla_\theta \mathcal L(\theta_t,S) = \mathbb E_{x \leftarrow S} \left[ \nabla_\theta \mathcal L(\theta_t,x) \right]</math>. This motivated <em>stochastic gradient descent</em>, where the gradient <math>\nabla_\theta \mathcal L(\theta_t,S)</math> is replaced by a stochastic estimate <math>\nabla_\theta \mathcal L(\theta_t,x)</math>, for some data point <math>x</math> uniformly randomly drawn from <math>S</math>.

Stochastic gradient descent - Revision history

Lê Nguyên Hoang at 15:58, 28 January 2020

Lê Nguyên Hoang: Created page with "Stochastic gradient descent (SGD) is the most widely used learning algorithm. For a very general perspective, SGD consists in iterating (1) draw some data point and (2) slight..."