Robust statistics - Revision history

Lê Nguyên Hoang: /* Robustness to additive poisoning */

2020-02-26T07:00:25Z

Robustness to additive poisoning

Lê Nguyên Hoang: /* Robustness to additive poisoning */

2020-02-26T06:59:30Z

Robustness to additive poisoning

Lê Nguyên Hoang at 06:58, 26 February 2020

2020-02-26T06:58:43Z

Lê Nguyên Hoang: /* Poisoning models */

2020-02-11T08:50:26Z

Poisoning models

Lê Nguyên Hoang at 08:33, 2 February 2020

2020-02-02T08:33:28Z

El Mahdi El Mhamdi at 10:31, 27 January 2020

2020-01-27T10:31:46Z

Lê Nguyên Hoang: /* What if there are more outliers than inliers? */

2020-01-26T20:52:03Z

What if there are more outliers than inliers?

Lê Nguyên Hoang: /* What if there are more outliers than inliers? */

2020-01-26T20:51:31Z

What if there are more outliers than inliers?

Lê Nguyên Hoang at 08:00, 25 January 2020

2020-01-25T08:00:14Z

Lê Nguyên Hoang: /* Poisoning models */

2020-01-24T17:09:10Z

Poisoning models

@@ Line 33: / Line 33: @@
 On the other hand, assuming the true distribution is a spherical normal distribution <math>\mathcal N(0,I)</math>, Tukey proposed another approach based on identifying the directions of largest variances, since these are likely to be the "attack line" of the adversary [https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Mathematics+and+the+picturing+of+data+tukey&btnG= Tukey75]. <em>Tukey's median</em> yields <math>O(\varepsilon)</math>-error with high probability <math>1-\tau</math> for <math>n=\Omega\left( \frac{d+\log(1/\tau)}{\varepsilon^2}\right)</math> data points. Unfortunately, Tukey's median is NP-hard to compute, and is typically exponential in d.
-[https://arxiv.org/pdf/1811.09380.pdf CDG][https://dblp.org/rec/bibtex/conf/soda/0002D019 19] [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] proved that such a bound can be achieved in quasi-linear-time, even for heavy-tailed distribution with bounded but unknown variance. [https://arxiv.org/pdf/1811.09380.pdf CDG][https://dblp.org/rec/bibtex/conf/soda/0002D019 19] proved that in the strong poisoning model, their quasi-linear-time algorithm achieves <math>O(\sqrt{\varepsilon})</math> error, which is asymptotically optimal in terms of performance and computation time. On the other hand, [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] used an approach akin to median-of-means, for <math>K = \Omega(\varepsilon n)</math>, they designed an algorithm that achieves an error <math>O\left( \sqrt{\frac{Tr(\Sigma)}{n}} + \sqrt{\frac{||\Sigma||_{op} K}{n}} \right)</math> in time <math>\tilde O(nd + uKd)</math>. Here, <math>u</math> is an integer parameter, which is needed to guarantee a subgaussian decay rate of errors, which will be in <math>1-\exp(-\Theta(K+u))</math>. Note that <math>Tr(\Sigma)</math> is essentially the "effective dimension" of the data points.
+[https://arxiv.org/pdf/1811.09380.pdf CDG][https://dblp.org/rec/bibtex/conf/soda/0002D019 19] [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] proved that such a bound can be achieved in quasi-linear-time, even for heavy-tailed distribution with bounded but unknown variance. [https://arxiv.org/pdf/1811.09380.pdf CDG][https://dblp.org/rec/bibtex/conf/soda/0002D019 19] proved that in the strong poisoning model, their quasi-linear-time algorithm achieves <math>O(||\Sigma||_{op} \sqrt{\varepsilon})</math> error, which is asymptotically optimal in terms of performance and computation time. On the other hand, [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] used an approach akin to median-of-means, for <math>K = \Omega(\varepsilon n)</math>, they designed an algorithm that achieves an error <math>O\left( \sqrt{\frac{Tr(\Sigma)}{n}} + \sqrt{\frac{||\Sigma||_{op} K}{n}} \right)</math> in time <math>\tilde O(nd + uKd)</math>. Here, <math>u</math> is an integer parameter, which is needed to guarantee a subgaussian decay rate of errors, which will be in <math>1-\exp(-\Theta(K+u))</math>. Note that <math>Tr(\Sigma)</math> is essentially the "effective dimension" of the data points.
 Their technique relies on partitioning data into <math>K</math> buckets, computing the means for each bucket, and replacing the computation of median of the means by a covering SDP that fits all configuration of bucket poisoning. It turns out that approximations of such an covering SDP can be founded in quasi-linear-time [https://arxiv.org/pdf/1201.5135 PTZ][https://dblp.org/rec/bibtex/conf/spaa/PengT12 12]. It turns out that, rather than being used to directly compute a mean estimator, this is actually used to perform gradient descent, starting from the coordinate-wise median, and then descending along the direction provided by the covering SDP. [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] proved that only <math>O(\log d)</math> such steps were needed to guarantee their bound.

@@ Line 33: / Line 33: @@
 On the other hand, assuming the true distribution is a spherical normal distribution <math>\mathcal N(0,I)</math>, Tukey proposed another approach based on identifying the directions of largest variances, since these are likely to be the "attack line" of the adversary [https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Mathematics+and+the+picturing+of+data+tukey&btnG= Tukey75]. <em>Tukey's median</em> yields <math>O(\varepsilon)</math>-error with high probability <math>1-\tau</math> for <math>n=\Omega\left( \frac{d+\log(1/\tau)}{\varepsilon^2}\right)</math> data points. Unfortunately, Tukey's median is NP-hard to compute, and is typically exponential in d.
-[https://arxiv.org/pdf/1811.09380.pdf CDG][https://dblp.org/rec/bibtex/conf/soda/0002D019 19] [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] proved that such a bound can be achieved in quasi-linear-time, even for heavy-tailed distribution with bounded but unknown variance. [https://arxiv.org/pdf/1811.09380.pdf CDG][https://dblp.org/rec/bibtex/conf/soda/0002D019 19] proved that in the strong poisoning model, their quasi-linear-time algorithm achieves <math>O(\sqrt{\varepsilon}</math> error, which is asymptotically optimal in terms of performance and computation time. On the other hand, [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] used an approach akin to median-of-means, for <math>K = \Omega(\varepsilon n)</math>, they designed an algorithm that achieves an error <math>O\left( \sqrt{\frac{Tr(\Sigma)}{n}} + \sqrt{\frac{||\Sigma||_{op} K}{n}} \right)</math> in time <math>\tilde O(nd + uKd)</math>. Here, <math>u</math> is an integer parameter, which is needed to guarantee a subgaussian decay rate of errors, which will be in <math>1-\exp(-\Theta(K+u))</math>. Note that <math>Tr(\Sigma)</math> is essentially the "effective dimension" of the data points.
+[https://arxiv.org/pdf/1811.09380.pdf CDG][https://dblp.org/rec/bibtex/conf/soda/0002D019 19] [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] proved that such a bound can be achieved in quasi-linear-time, even for heavy-tailed distribution with bounded but unknown variance. [https://arxiv.org/pdf/1811.09380.pdf CDG][https://dblp.org/rec/bibtex/conf/soda/0002D019 19] proved that in the strong poisoning model, their quasi-linear-time algorithm achieves <math>O(\sqrt{\varepsilon})</math> error, which is asymptotically optimal in terms of performance and computation time. On the other hand, [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] used an approach akin to median-of-means, for <math>K = \Omega(\varepsilon n)</math>, they designed an algorithm that achieves an error <math>O\left( \sqrt{\frac{Tr(\Sigma)}{n}} + \sqrt{\frac{||\Sigma||_{op} K}{n}} \right)</math> in time <math>\tilde O(nd + uKd)</math>. Here, <math>u</math> is an integer parameter, which is needed to guarantee a subgaussian decay rate of errors, which will be in <math>1-\exp(-\Theta(K+u))</math>. Note that <math>Tr(\Sigma)</math> is essentially the "effective dimension" of the data points.
 Their technique relies on partitioning data into <math>K</math> buckets, computing the means for each bucket, and replacing the computation of median of the means by a covering SDP that fits all configuration of bucket poisoning. It turns out that approximations of such an covering SDP can be founded in quasi-linear-time [https://arxiv.org/pdf/1201.5135 PTZ][https://dblp.org/rec/bibtex/conf/spaa/PengT12 12]. It turns out that, rather than being used to directly compute a mean estimator, this is actually used to perform gradient descent, starting from the coordinate-wise median, and then descending along the direction provided by the covering SDP. [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] proved that only <math>O(\log d)</math> such steps were needed to guarantee their bound.

@@ Line 3: / Line 3: @@
 This question has arguably become crucial, as large-scale algorithms perform learning from users' data. Yet, clearly, if the algorithm is used by thousands, millions or billions of users, many of the data will likely be corrupted, because of bugs [https://www.youtube.com/watch?v=yb2zkxHDfUE standupmaths20], or because some users will maliciously want to exploit or attack the algorithm. This latter case is known as a <em>poisoning attack</em>.
-Over the last three years, there have been fascinating recent advances, both for classical statistical tasks [https://arxiv.org/pdf/1911.05911.pdf DiakonikolasKane][https://dblp.org/rec/bibtex/journals/corr/abs-1911-05911 19] [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] and [[modern machine learning]] [http://papers.nips.cc/paper/6617-machine-learning-with-adversaries-byzantine-tolerant-gradient-descent.pdf BEGS][https://dblp.org/rec/bibtex/conf/nips/BlanchardMGS17 17] especially in very high dimensional settings such as [[training neural networks]] [http://proceedings.mlr.press/v80/mhamdi18a EGR]. We discussed robust statistics in [https://www.youtube.com/watch?v=QguWgfGsG-k RB2].
+Over the last three years, there have been fascinating recent advances, both for classical statistical tasks [https://arxiv.org/pdf/1911.05911.pdf DiakonikolasKane][https://dblp.org/rec/bibtex/journals/corr/abs-1911-05911 19] [https://arxiv.org/pdf/1811.09380.pdf CDG][https://dblp.org/rec/bibtex/conf/soda/0002D019 19] [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] and [[modern machine learning]] [http://papers.nips.cc/paper/6617-machine-learning-with-adversaries-byzantine-tolerant-gradient-descent.pdf BEGS][https://dblp.org/rec/bibtex/conf/nips/BlanchardMGS17 17] especially in very high dimensional settings such as [[training neural networks]] [http://proceedings.mlr.press/v80/mhamdi18a EGR]. We discussed robust statistics in [https://www.youtube.com/watch?v=QguWgfGsG-k RB2].
 == Example of the median ==
@@ Line 33: / Line 33: @@
 On the other hand, assuming the true distribution is a spherical normal distribution <math>\mathcal N(0,I)</math>, Tukey proposed another approach based on identifying the directions of largest variances, since these are likely to be the "attack line" of the adversary [https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Mathematics+and+the+picturing+of+data+tukey&btnG= Tukey75]. <em>Tukey's median</em> yields <math>O(\varepsilon)</math>-error with high probability <math>1-\tau</math> for <math>n=\Omega\left( \frac{d+\log(1/\tau)}{\varepsilon^2}\right)</math> data points. Unfortunately, Tukey's median is NP-hard to compute, and is typically exponential in d.
-[https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] proved that such a bound can be achieved for additive poisoning in quasi-linear-time, even for heavy-tailed distribution with bounded but unknown variance. More precisely, by using an approach akin to median-of-means, for <math>K = \Omega(\varepsilon n)</math>, they designed an algorithm that achieves an error <math>O\left( \sqrt{\frac{Tr(\Sigma)}{n}} + \sqrt{\frac{||\Sigma||_{op} K}{n}} \right)</math> in time <math>\tilde O(nd + uKd)</math>. Here, <math>u</math> is an integer parameter, which is needed to guarantee a subgaussian decay rate of errors, which will be in <math>1-\exp(-\Theta(K+u))</math>. Note that <math>Tr(\Sigma)</math> is essentially the "effective dimension" of the data points.
+[https://arxiv.org/pdf/1811.09380.pdf CDG][https://dblp.org/rec/bibtex/conf/soda/0002D019 19] [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] proved that such a bound can be achieved in quasi-linear-time, even for heavy-tailed distribution with bounded but unknown variance. [https://arxiv.org/pdf/1811.09380.pdf CDG][https://dblp.org/rec/bibtex/conf/soda/0002D019 19] proved that in the strong poisoning model, their quasi-linear-time algorithm achieves <math>O(\sqrt{\varepsilon}</math> error, which is asymptotically optimal in terms of performance and computation time. On the other hand, [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] used an approach akin to median-of-means, for <math>K = \Omega(\varepsilon n)</math>, they designed an algorithm that achieves an error <math>O\left( \sqrt{\frac{Tr(\Sigma)}{n}} + \sqrt{\frac{||\Sigma||_{op} K}{n}} \right)</math> in time <math>\tilde O(nd + uKd)</math>. Here, <math>u</math> is an integer parameter, which is needed to guarantee a subgaussian decay rate of errors, which will be in <math>1-\exp(-\Theta(K+u))</math>. Note that <math>Tr(\Sigma)</math> is essentially the "effective dimension" of the data points.
 Their technique relies on partitioning data into <math>K</math> buckets, computing the means for each bucket, and replacing the computation of median of the means by a covering SDP that fits all configuration of bucket poisoning. It turns out that approximations of such an covering SDP can be founded in quasi-linear-time [https://arxiv.org/pdf/1201.5135 PTZ][https://dblp.org/rec/bibtex/conf/spaa/PengT12 12]. It turns out that, rather than being used to directly compute a mean estimator, this is actually used to perform gradient descent, starting from the coordinate-wise median, and then descending along the direction provided by the covering SDP. [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] proved that only <math>O(\log d)</math> such steps were needed to guarantee their bound.

@@ Line 21: / Line 21: @@
 Other models include an adversary with only erasing power, or an adversary that must choose its <em>outliers</em> without knowledge of the values of the <em>inliers</em>. Evidently, any guarantee for such weaker poisoning models will also hold for stronger poisoning models [https://arxiv.org/pdf/1911.05911.pdf DiakonikolasKane][https://dblp.org/rec/bibtex/journals/corr/abs-1911-05911 19].
-Perhaps the most general form of poisoning attack is the following. Consider a "true dataset" <math>D</math>. However, the attacker gets to distort the dataset using a distort function <math>f \in \mathcal F</math>, thereby yielding <math>f(D)</math>. Suppose we now have a best-possible machine learning algorithm <math>ML</math> that learns from data. It would ideally compute <math>ML(D)</math>. But we can only exploit <math>f(D)</math>, by some hopefully robust machine learning algorithm <math>RML(f(D))</math>. What we would like is to guarantee that <math>d(ML(D),RML(f(D))) < bound</math>, for a suitable distance <math>d</math> and any <math>f \in \mathcal F</math>, where <math>RML</math> is tractable. A further generalization of this could consist in assuming a prior probabilistic belief on the set <mathcal>F</mathcal> of attack models that we need to defend against.
+Perhaps the most general form of poisoning attack is the following. Consider a "true dataset" <math>D</math>. However, the attacker gets to distort the dataset using a distort function <math>f \in \mathcal F</math>, thereby yielding <math>f(D)</math>. Suppose we now have a best-possible machine learning algorithm <math>ML</math> that learns from data. It would ideally compute <math>ML(D)</math>. But we can only exploit <math>f(D)</math>, by some hopefully robust machine learning algorithm <math>RML(f(D))</math>. What we would like is to guarantee that <math>d(ML(D),RML(f(D))) < bound</math>, for a suitable distance <math>d</math> and any <math>f \in \mathcal F</math>, where <math>RML</math> is tractable.
 It is noteworthy that robustness to such attacks are useful, even if there are no adversary. Indeed, data may still get corrupted, because of bugs, crashes or misuse by a human operator (see for instance the case of gene mutations caused by Excel that plagued genetic research [https://www.sciencemag.org/news/2016/08/one-five-genetics-papers-contains-errors-thanks-microsoft-excel Boddy16]). Our algorithms need to remain performant despite such issues. An algorithm robust to strong attacks will be robust to such weaker flaws.

@@ Line 3: / Line 3: @@
 This question has arguably become crucial, as large-scale algorithms perform learning from users' data. Yet, clearly, if the algorithm is used by thousands, millions or billions of users, many of the data will likely be corrupted, because of bugs [https://www.youtube.com/watch?v=yb2zkxHDfUE standupmaths20], or because some users will maliciously want to exploit or attack the algorithm. This latter case is known as a <em>poisoning attack</em>.
-Over the last three years, there have been fascinating recent advances, both for classical statistical tasks [https://arxiv.org/pdf/1911.05911.pdf DiakonikolasKane][https://dblp.org/rec/bibtex/journals/corr/abs-1911-05911 19] [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] and [[modern machine learning]] [http://papers.nips.cc/paper/6617-machine-learning-with-adversaries-byzantine-tolerant-gradient-descent.pdf BEGS][https://dblp.org/rec/bibtex/conf/nips/BlanchardMGS17 17] especially in very high dimensional settings such as [[training neural networks]] [http://proceedings.mlr.press/v80/mhamdi18a EGR].
+Over the last three years, there have been fascinating recent advances, both for classical statistical tasks [https://arxiv.org/pdf/1911.05911.pdf DiakonikolasKane][https://dblp.org/rec/bibtex/journals/corr/abs-1911-05911 19] [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] and [[modern machine learning]] [http://papers.nips.cc/paper/6617-machine-learning-with-adversaries-byzantine-tolerant-gradient-descent.pdf BEGS][https://dblp.org/rec/bibtex/conf/nips/BlanchardMGS17 17] especially in very high dimensional settings such as [[training neural networks]] [http://proceedings.mlr.press/v80/mhamdi18a EGR]. We discussed robust statistics in [https://www.youtube.com/watch?v=QguWgfGsG-k RB2].
 == Example of the median ==

@@ Line 3: / Line 3: @@
 This question has arguably become crucial, as large-scale algorithms perform learning from users' data. Yet, clearly, if the algorithm is used by thousands, millions or billions of users, many of the data will likely be corrupted, because of bugs [https://www.youtube.com/watch?v=yb2zkxHDfUE standupmaths20], or because some users will maliciously want to exploit or attack the algorithm. This latter case is known as a <em>poisoning attack</em>.
-Over the last three years, there have been fascinating recent advances, both for classical statistical tasks [https://arxiv.org/pdf/1911.05911.pdf DiakonikolasKane][https://dblp.org/rec/bibtex/journals/corr/abs-1911-05911 19] [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] and [[neural networks]] [http://papers.nips.cc/paper/6617-machine-learning-with-adversaries-byzantine-tolerant-gradient-descent.pdf BMGS][https://dblp.org/rec/bibtex/conf/nips/BlanchardMGS17 17].
+Over the last three years, there have been fascinating recent advances, both for classical statistical tasks [https://arxiv.org/pdf/1911.05911.pdf DiakonikolasKane][https://dblp.org/rec/bibtex/journals/corr/abs-1911-05911 19] [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] and [[modern machine learning]] [http://papers.nips.cc/paper/6617-machine-learning-with-adversaries-byzantine-tolerant-gradient-descent.pdf BEGS][https://dblp.org/rec/bibtex/conf/nips/BlanchardMGS17 17] especially in very high dimensional settings such as [[training neural networks]] [http://proceedings.mlr.press/v80/mhamdi18a EGR].
 == Example of the median ==
@@ Line 61: / Line 61: @@
 The main application of robust statistics (at least relevant to AI ethics) seems to be the aggregation of [[stochastic gradient descent|stochastic gradients]] for [[neural networks]]. In this setting, even a linear-time algorithm in <math>\Omega(nd)</math> is impractical if we demand <math>n \geq d</math> (which is necessary to have dimension-independent guarantees). In practice, this setting is often carried out with batches whose size <math>n</math> is significantly smaller than <math>d</math>. In fact, despite conventional wisdom and PAC-learning theory, it seems that <math>n \ll d</math> may be a desirable setting to do neural network learning (see [[overfitting]] where we discuss double descent). For <math>n \ll d</math>, is there a gain in using algorithms more complex than coordinate-wise median?
-[http://papers.nips.cc/paper/6617-machine-learning-with-adversaries-byzantine-tolerant-gradient-descent.pdf BMGS][https://dblp.org/rec/bibtex/conf/nips/BlanchardMGS17 17] proposed Krum and multi-Krum, aggregation algorithms for this setting that have weaker robustness guarantees but are more efficient. Is it possible to improve upon them?
+[http://papers.nips.cc/paper/6617-machine-learning-with-adversaries-byzantine-tolerant-gradient-descent.pdf BEGS][https://dblp.org/rec/bibtex/conf/nips/BlanchardMGS17 17] proposed Krum and multi-Krum, aggregation algorithms for this setting that have weaker robustness guarantees but are more efficient. Is it possible to improve upon them?

@@ Line 55: / Line 55: @@
 Second, one might try to a apply a robust Bayesian inference to the data, which would yield a set of posterior beliefs. This framework has yet to be defined.
-Third, we may assume that the data are more or less [[data certification|certified]]. This is a natural setting, as we humans often judge the reliability of raw data depending on its source, and we usually consider a continuum of reliability, rather than a clear cut binary classification. Algorithms should probably do the same at some point, but there does not yet seem to be an algorithmic framework to pose this problem. In particular, it could be interesting to analyze threat models.
+Third, we may assume that the data are more or less [[data certification|certified]]. This is a natural setting, as we humans often judge the reliability of raw data depending on its source, and we usually consider a continuum of reliability, rather than a clear cut binary classification. Algorithms should probably do the same at some point, but there does not yet seem to be an algorithmic framework to pose this problem. In particular, it could be interesting to analyze threat models where different degrees or sorts of certification have different levels of liability.
 == Robust statistics for neural networks ==

@@ Line 1: / Line 1: @@
 Robust statistics is the problem of estimating parameters from unreliable empirical data. Typically, suppose that a fraction of the training dataset is compromised. Can we design algorithms that nevertheless succeed in learning adequately from such a partially compromised dataset?
-This question has arguably become crucial, as large-scale algorithms perform learning from users' data. Yet, clearly, if the algorithm is used by thousands, millions or billions of users, many of the data will likely be corrupted, because of bugs [https://www.sciencemag.org/news/2016/08/one-five-genetics-papers-contains-errors-thanks-microsoft-excel Boddy16], or because some users will maliciously want to exploit or attack the algorithm. This latter case is known as a <em>poisoning attack</em>.
+This question has arguably become crucial, as large-scale algorithms perform learning from users' data. Yet, clearly, if the algorithm is used by thousands, millions or billions of users, many of the data will likely be corrupted, because of bugs [https://www.youtube.com/watch?v=yb2zkxHDfUE standupmaths20], or because some users will maliciously want to exploit or attack the algorithm. This latter case is known as a <em>poisoning attack</em>.
 Over the last three years, there have been fascinating recent advances, both for classical statistical tasks [https://arxiv.org/pdf/1911.05911.pdf DiakonikolasKane][https://dblp.org/rec/bibtex/journals/corr/abs-1911-05911 19] [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] and [[neural networks]] [http://papers.nips.cc/paper/6617-machine-learning-with-adversaries-byzantine-tolerant-gradient-descent.pdf BMGS][https://dblp.org/rec/bibtex/conf/nips/BlanchardMGS17 17].