<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://robustlybeneficial.org/wiki/index.php?action=history&amp;feed=atom&amp;title=Stochastic_gradient_descent</id>
	<title>Stochastic gradient descent - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://robustlybeneficial.org/wiki/index.php?action=history&amp;feed=atom&amp;title=Stochastic_gradient_descent"/>
	<link rel="alternate" type="text/html" href="https://robustlybeneficial.org/wiki/index.php?title=Stochastic_gradient_descent&amp;action=history"/>
	<updated>2026-04-28T15:20:24Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.34.0</generator>
	<entry>
		<id>https://robustlybeneficial.org/wiki/index.php?title=Stochastic_gradient_descent&amp;diff=147&amp;oldid=prev</id>
		<title>Lê Nguyên Hoang at 15:58, 28 January 2020</title>
		<link rel="alternate" type="text/html" href="https://robustlybeneficial.org/wiki/index.php?title=Stochastic_gradient_descent&amp;diff=147&amp;oldid=prev"/>
		<updated>2020-01-28T15:58:29Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;Revision as of 15:58, 28 January 2020&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l5&quot; &gt;Line 5:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 5:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Consider an algorithm with continuous parameters &amp;lt;math&amp;gt;\theta \in \mathbb R^d&amp;lt;/math&amp;gt;. Learning would then correspond to adjusting adequately the parameters &amp;lt;math&amp;gt;\theta&amp;lt;/math&amp;gt; of the algorithm. To do so, a &amp;lt;em&amp;gt;loss function&amp;lt;/em&amp;gt; &amp;lt;math&amp;gt;\mathcal L(\theta,S)&amp;lt;/math&amp;gt;, which usually depends on &amp;lt;math&amp;gt;\theta&amp;lt;/math&amp;gt; and on some large dataset &amp;lt;math&amp;gt;S&amp;lt;/math&amp;gt;. The gradient &amp;lt;math&amp;gt;\nabla_\theta \mathcal L(\theta,S) \in \mathbb R^d&amp;lt;/math&amp;gt; at parameters &amp;lt;math&amp;gt;\theta&amp;lt;/math&amp;gt; is the direction of change of &amp;lt;math&amp;gt;\theta&amp;lt;/math&amp;gt; along which the function &amp;lt;math&amp;gt;\mathcal L&amp;lt;/math&amp;gt; increases the most. This means that, conversely, for &amp;lt;math&amp;gt;\eta&amp;lt;/math&amp;gt; small enough, &amp;lt;math&amp;gt;\theta' = \theta - \eta \nabla_\theta \mathcal L(\theta,S)&amp;lt;/math&amp;gt; will improve upon &amp;lt;math&amp;gt;\theta&amp;lt;/math&amp;gt;, in the sense that &amp;lt;math&amp;gt;\mathcal L(\theta',S) \leq \mathcal L(\theta,S)&amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Consider an algorithm with continuous parameters &amp;lt;math&amp;gt;\theta \in \mathbb R^d&amp;lt;/math&amp;gt;. Learning would then correspond to adjusting adequately the parameters &amp;lt;math&amp;gt;\theta&amp;lt;/math&amp;gt; of the algorithm. To do so, a &amp;lt;em&amp;gt;loss function&amp;lt;/em&amp;gt; &amp;lt;math&amp;gt;\mathcal L(\theta,S)&amp;lt;/math&amp;gt;, which usually depends on &amp;lt;math&amp;gt;\theta&amp;lt;/math&amp;gt; and on some large dataset &amp;lt;math&amp;gt;S&amp;lt;/math&amp;gt;. The gradient &amp;lt;math&amp;gt;\nabla_\theta \mathcal L(\theta,S) \in \mathbb R^d&amp;lt;/math&amp;gt; at parameters &amp;lt;math&amp;gt;\theta&amp;lt;/math&amp;gt; is the direction of change of &amp;lt;math&amp;gt;\theta&amp;lt;/math&amp;gt; along which the function &amp;lt;math&amp;gt;\mathcal L&amp;lt;/math&amp;gt; increases the most. This means that, conversely, for &amp;lt;math&amp;gt;\eta&amp;lt;/math&amp;gt; small enough, &amp;lt;math&amp;gt;\theta' = \theta - \eta \nabla_\theta \mathcal L(\theta,S)&amp;lt;/math&amp;gt; will improve upon &amp;lt;math&amp;gt;\theta&amp;lt;/math&amp;gt;, in the sense that &amp;lt;math&amp;gt;\mathcal L(\theta',S) \leq \mathcal L(\theta,S)&amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt;−&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Gradient descent starts by initializing the parameters at some value &amp;lt;math&amp;gt;\theta_0&amp;lt;/math&amp;gt;. Then, it iterates improvements of &amp;lt;math&amp;gt;\theta_{t+1} = \theta_t - \eta_t \nabla_\theta \mathcal L(\theta_t,S)&amp;lt;/math&amp;gt;. The hyperparameter &amp;lt;math&amp;gt;\eta_t&amp;lt;/math&amp;gt; is called the &amp;lt;em&amp;gt;learning rate&amp;lt;/em&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt;+&lt;/td&gt;&lt;td style=&quot;color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Gradient descent starts by initializing the parameters at some value &amp;lt;math&amp;gt;\theta_0&amp;lt;/math&amp;gt;. Then, it iterates improvements of &amp;lt;math&amp;gt;\theta_{t+1} = \theta_t - \eta_t \nabla_\theta \mathcal L(\theta_t,S)&amp;lt;/math&amp;gt;. The hyperparameter &amp;lt;math&amp;gt;\eta_t&amp;lt;/math&amp;gt; is called the &amp;lt;em&amp;gt;learning rate&amp;lt;/em&amp;gt;&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;. See [https://www.youtube.com/watch?v=IHZwWFHWa-w 3Blue1Brown17] for more insightful explanations&lt;/ins&gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;In practice, and especially for neural networks, &amp;lt;math&amp;gt;\nabla_\theta \mathcal L(\theta_t,S)&amp;lt;/math&amp;gt; is both long to compute and unnecessary to compute too accurately at each learning step. Moreover it can nearly always be written as &amp;lt;math&amp;gt;\nabla_\theta \mathcal L(\theta_t,S) = \mathbb E_{x \leftarrow S} \left[ \nabla_\theta \mathcal L(\theta_t,x) \right]&amp;lt;/math&amp;gt;. This motivated &amp;lt;em&amp;gt;stochastic gradient descent&amp;lt;/em&amp;gt;, where the gradient &amp;lt;math&amp;gt;\nabla_\theta \mathcal L(\theta_t,S)&amp;lt;/math&amp;gt; is replaced by a stochastic estimate &amp;lt;math&amp;gt;\nabla_\theta \mathcal L(\theta_t,x)&amp;lt;/math&amp;gt;, for some data point &amp;lt;math&amp;gt;x&amp;lt;/math&amp;gt; uniformly randomly drawn from &amp;lt;math&amp;gt;S&amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;td class='diff-marker'&gt; &lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;In practice, and especially for neural networks, &amp;lt;math&amp;gt;\nabla_\theta \mathcal L(\theta_t,S)&amp;lt;/math&amp;gt; is both long to compute and unnecessary to compute too accurately at each learning step. Moreover it can nearly always be written as &amp;lt;math&amp;gt;\nabla_\theta \mathcal L(\theta_t,S) = \mathbb E_{x \leftarrow S} \left[ \nabla_\theta \mathcal L(\theta_t,x) \right]&amp;lt;/math&amp;gt;. This motivated &amp;lt;em&amp;gt;stochastic gradient descent&amp;lt;/em&amp;gt;, where the gradient &amp;lt;math&amp;gt;\nabla_\theta \mathcal L(\theta_t,S)&amp;lt;/math&amp;gt; is replaced by a stochastic estimate &amp;lt;math&amp;gt;\nabla_\theta \mathcal L(\theta_t,x)&amp;lt;/math&amp;gt;, for some data point &amp;lt;math&amp;gt;x&amp;lt;/math&amp;gt; uniformly randomly drawn from &amp;lt;math&amp;gt;S&amp;lt;/math&amp;gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Lê Nguyên Hoang</name></author>
		
	</entry>
	<entry>
		<id>https://robustlybeneficial.org/wiki/index.php?title=Stochastic_gradient_descent&amp;diff=146&amp;oldid=prev</id>
		<title>Lê Nguyên Hoang: Created page with &quot;Stochastic gradient descent (SGD) is the most widely used learning algorithm. For a very general perspective, SGD consists in iterating (1) draw some data point and (2) slight...&quot;</title>
		<link rel="alternate" type="text/html" href="https://robustlybeneficial.org/wiki/index.php?title=Stochastic_gradient_descent&amp;diff=146&amp;oldid=prev"/>
		<updated>2020-01-28T15:56:51Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;quot;Stochastic gradient descent (SGD) is the most widely used learning algorithm. For a very general perspective, SGD consists in iterating (1) draw some data point and (2) slight...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;Stochastic gradient descent (SGD) is the most widely used learning algorithm. For a very general perspective, SGD consists in iterating (1) draw some data point and (2) slightly tweak the algorithm's parameters to better fit the data point. &lt;br /&gt;
&lt;br /&gt;
== Formal basic model ==&lt;br /&gt;
&lt;br /&gt;
Consider an algorithm with continuous parameters &amp;lt;math&amp;gt;\theta \in \mathbb R^d&amp;lt;/math&amp;gt;. Learning would then correspond to adjusting adequately the parameters &amp;lt;math&amp;gt;\theta&amp;lt;/math&amp;gt; of the algorithm. To do so, a &amp;lt;em&amp;gt;loss function&amp;lt;/em&amp;gt; &amp;lt;math&amp;gt;\mathcal L(\theta,S)&amp;lt;/math&amp;gt;, which usually depends on &amp;lt;math&amp;gt;\theta&amp;lt;/math&amp;gt; and on some large dataset &amp;lt;math&amp;gt;S&amp;lt;/math&amp;gt;. The gradient &amp;lt;math&amp;gt;\nabla_\theta \mathcal L(\theta,S) \in \mathbb R^d&amp;lt;/math&amp;gt; at parameters &amp;lt;math&amp;gt;\theta&amp;lt;/math&amp;gt; is the direction of change of &amp;lt;math&amp;gt;\theta&amp;lt;/math&amp;gt; along which the function &amp;lt;math&amp;gt;\mathcal L&amp;lt;/math&amp;gt; increases the most. This means that, conversely, for &amp;lt;math&amp;gt;\eta&amp;lt;/math&amp;gt; small enough, &amp;lt;math&amp;gt;\theta' = \theta - \eta \nabla_\theta \mathcal L(\theta,S)&amp;lt;/math&amp;gt; will improve upon &amp;lt;math&amp;gt;\theta&amp;lt;/math&amp;gt;, in the sense that &amp;lt;math&amp;gt;\mathcal L(\theta',S) \leq \mathcal L(\theta,S)&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
Gradient descent starts by initializing the parameters at some value &amp;lt;math&amp;gt;\theta_0&amp;lt;/math&amp;gt;. Then, it iterates improvements of &amp;lt;math&amp;gt;\theta_{t+1} = \theta_t - \eta_t \nabla_\theta \mathcal L(\theta_t,S)&amp;lt;/math&amp;gt;. The hyperparameter &amp;lt;math&amp;gt;\eta_t&amp;lt;/math&amp;gt; is called the &amp;lt;em&amp;gt;learning rate&amp;lt;/em&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
In practice, and especially for neural networks, &amp;lt;math&amp;gt;\nabla_\theta \mathcal L(\theta_t,S)&amp;lt;/math&amp;gt; is both long to compute and unnecessary to compute too accurately at each learning step. Moreover it can nearly always be written as &amp;lt;math&amp;gt;\nabla_\theta \mathcal L(\theta_t,S) = \mathbb E_{x \leftarrow S} \left[ \nabla_\theta \mathcal L(\theta_t,x) \right]&amp;lt;/math&amp;gt;. This motivated &amp;lt;em&amp;gt;stochastic gradient descent&amp;lt;/em&amp;gt;, where the gradient &amp;lt;math&amp;gt;\nabla_\theta \mathcal L(\theta_t,S)&amp;lt;/math&amp;gt; is replaced by a stochastic estimate &amp;lt;math&amp;gt;\nabla_\theta \mathcal L(\theta_t,x)&amp;lt;/math&amp;gt;, for some data point &amp;lt;math&amp;gt;x&amp;lt;/math&amp;gt; uniformly randomly drawn from &amp;lt;math&amp;gt;S&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
Numerous variants of this basic settings have been proposed, which we discuss below.&lt;br /&gt;
&lt;br /&gt;
== Classical guarantees ==&lt;br /&gt;
&lt;br /&gt;
== Variants ==&lt;/div&gt;</summary>
		<author><name>Lê Nguyên Hoang</name></author>
		
	</entry>
</feed>