# Difference between revisions of "Robust statistics"

Line 3: | Line 3: | ||

This question has arguably become crucial, as large-scale algorithms perform learning from users' data. Yet, clearly, if the algorithm is used by thousands, millions or billions of users, many of the data will likely be corrupted, because of bugs [https://www.youtube.com/watch?v=yb2zkxHDfUE standupmaths20], or because some users will maliciously want to exploit or attack the algorithm. This latter case is known as a <em>poisoning attack</em>. | This question has arguably become crucial, as large-scale algorithms perform learning from users' data. Yet, clearly, if the algorithm is used by thousands, millions or billions of users, many of the data will likely be corrupted, because of bugs [https://www.youtube.com/watch?v=yb2zkxHDfUE standupmaths20], or because some users will maliciously want to exploit or attack the algorithm. This latter case is known as a <em>poisoning attack</em>. | ||

− | Over the last three years, there have been fascinating recent advances, both for classical statistical tasks [https://arxiv.org/pdf/1911.05911.pdf DiakonikolasKane][https://dblp.org/rec/bibtex/journals/corr/abs-1911-05911 19] [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] and [[ | + | Over the last three years, there have been fascinating recent advances, both for classical statistical tasks [https://arxiv.org/pdf/1911.05911.pdf DiakonikolasKane][https://dblp.org/rec/bibtex/journals/corr/abs-1911-05911 19] [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] and [[modern machine learning]] [http://papers.nips.cc/paper/6617-machine-learning-with-adversaries-byzantine-tolerant-gradient-descent.pdf BEGS][https://dblp.org/rec/bibtex/conf/nips/BlanchardMGS17 17] especially in very high dimensional settings such as [[training neural networks]] [http://proceedings.mlr.press/v80/mhamdi18a EGR]. |

== Example of the median == | == Example of the median == | ||

Line 61: | Line 61: | ||

The main application of robust statistics (at least relevant to AI ethics) seems to be the aggregation of [[stochastic gradient descent|stochastic gradients]] for [[neural networks]]. In this setting, even a linear-time algorithm in <math>\Omega(nd)</math> is impractical if we demand <math>n \geq d</math> (which is necessary to have dimension-independent guarantees). In practice, this setting is often carried out with batches whose size <math>n</math> is significantly smaller than <math>d</math>. In fact, despite conventional wisdom and PAC-learning theory, it seems that <math>n \ll d</math> may be a desirable setting to do neural network learning (see [[overfitting]] where we discuss double descent). For <math>n \ll d</math>, is there a gain in using algorithms more complex than coordinate-wise median? | The main application of robust statistics (at least relevant to AI ethics) seems to be the aggregation of [[stochastic gradient descent|stochastic gradients]] for [[neural networks]]. In this setting, even a linear-time algorithm in <math>\Omega(nd)</math> is impractical if we demand <math>n \geq d</math> (which is necessary to have dimension-independent guarantees). In practice, this setting is often carried out with batches whose size <math>n</math> is significantly smaller than <math>d</math>. In fact, despite conventional wisdom and PAC-learning theory, it seems that <math>n \ll d</math> may be a desirable setting to do neural network learning (see [[overfitting]] where we discuss double descent). For <math>n \ll d</math>, is there a gain in using algorithms more complex than coordinate-wise median? | ||

− | [http://papers.nips.cc/paper/6617-machine-learning-with-adversaries-byzantine-tolerant-gradient-descent.pdf | + | [http://papers.nips.cc/paper/6617-machine-learning-with-adversaries-byzantine-tolerant-gradient-descent.pdf BEGS][https://dblp.org/rec/bibtex/conf/nips/BlanchardMGS17 17] proposed Krum and multi-Krum, aggregation algorithms for this setting that have weaker robustness guarantees but are more efficient. Is it possible to improve upon them? |

+ | |||

+ | == Robust statistics for agreement and multi-agent settings == | ||

+ | |||

+ | Another venue where robust statistics are needed is the increasingly multi-agent setting that modern AI is built on. [https://dl.acm.org/doi/10.1145/2488608.2488657 MH2013] |

## Revision as of 12:31, 27 January 2020

Robust statistics is the problem of estimating parameters from unreliable empirical data. Typically, suppose that a fraction of the training dataset is compromised. Can we design algorithms that nevertheless succeed in learning adequately from such a partially compromised dataset?

This question has arguably become crucial, as large-scale algorithms perform learning from users' data. Yet, clearly, if the algorithm is used by thousands, millions or billions of users, many of the data will likely be corrupted, because of bugs standupmaths20, or because some users will maliciously want to exploit or attack the algorithm. This latter case is known as a *poisoning attack*.

Over the last three years, there have been fascinating recent advances, both for classical statistical tasks DiakonikolasKane19 DepersinLecué19 and modern machine learning BEGS17 especially in very high dimensional settings such as training neural networks EGR.

## Contents

## Example of the median

Suppose the data are real numbers, and we want to estimate the mean of the true data (which we shall call *inliers*). Note that the naive empirical mean estimate would be a bad idea here, as a single malicious user could completely upset the empirical mean estimate. In fact, by choosing its input data adequately (called *outliers*), the malicious user can make the empirical mean estimate equal whatever the malicious user wants it to be.

It turns out that using the median of the dataset is a robust way to do so. Indeed, even if 45% of the data are *outliers*, the median will still be a quantile of the inliers, which should not be too far from the actual mean. The median is said to have a 0.5 statistical breakdown point RousseeuwLeroy05. No statistical method can achieve a better breakdown point, but other methods also achieve 0.5 statistical breakdown, like trimmed mean (we remove sufficiently many extreme values on both sides and take the mean of the rest).

Another way to quantify robustness is to compute a high-probability upper bound between the empirical median and the mean μ of the true distribution of inliers. Call ε the fraction of outliers. It turns out that, assuming the true distribution is a normal distribution [math]\mathcal N(\mu,1)[/math], given [math]n=\Omega\left( \frac{d+\log(1/\tau)}{\varepsilon^2}\right)[/math] data, we can guarantee [math]|median-\mu| = O(\varepsilon)[/math] with probability [math]1-\tau[/math]. This asymptotic bound is also best possible Huber92.

## Poisoning models

The above model holds for arguably the strongest poisoning model. This is one where an adversary gets to read the full dataset before we can, and is able to erase a fraction [math]\varepsilon[/math] of the data, and to replace them by any imaginable input. The dataset is then analyzed by our (robust) statistics algorithm.

A weaker, but still widespread, model is one where a fraction [math]1-\varepsilon[/math] comes from the true distribution, while the remaining [math]\varepsilon[/math] is chosen by the adversary BMGS17.

Other models include an adversary with only erasing power, or an adversary that must choose its *outliers* without knowledge of the values of the *inliers*. Evidently, any guarantee for such weaker poisoning models will also hold for stronger poisoning models DiakonikolasKane19.

Perhaps the most general form of poisoning attack is the following. Consider a "true dataset" [math]D[/math]. However, the attacker gets to distort the dataset using a distort function [math]f \in \mathcal F[/math], thereby yielding [math]f(D)[/math]. Suppose we now have a best-possible machine learning algorithm [math]ML[/math] that learns from data. It would ideally compute [math]ML(D)[/math]. But we can only exploit [math]f(D)[/math], by some hopefully robust machine learning algorithm [math]RML(f(D))[/math]. What we would like is to guarantee that [math]d(ML(D),RML(f(D))) \lt bound[/math], for a suitable distance [math]d[/math] and any [math]f \in \mathcal F[/math], where [math]RML[/math] is tractable. A further generalization of this could consist in assuming a prior probabilistic belief on the set <mathcal>F</mathcal> of attack models that we need to defend against.

It is noteworthy that robustness to such attacks are useful, even if there are no adversary. Indeed, data may still get corrupted, because of bugs, crashes or misuse by a human operator (see for instance the case of gene mutations caused by Excel that plagued genetic research Boddy16). Our algorithms need to remain performant despite such issues. An algorithm robust to strong attacks will be robust to such weaker flaws.

## Robustness to additive poisoning

Unfortunately, results that hold for small dimensions generalize poorly to high dimensions, either because of weak robustness guarantees or computational slowness. Typically, the *coordinate-wise median* and the geometric median both yield [math]\Omega(\varepsilon \sqrt{d})[/math]-error, even in the limit of infinite-size datasets and assuming normality for inliers. This is very bad, as today's neural networks often have [math]d\sim 10^6[/math], if not [math]10^9[/math] or [math]10^{12}[/math] parameters.

On the other hand, assuming the true distribution is a spherical normal distribution [math]\mathcal N(0,I)[/math], Tukey proposed another approach based on identifying the directions of largest variances, since these are likely to be the "attack line" of the adversary Tukey75. *Tukey's median* yields [math]O(\varepsilon)[/math]-error with high probability [math]1-\tau[/math] for [math]n=\Omega\left( \frac{d+\log(1/\tau)}{\varepsilon^2}\right)[/math] data points. Unfortunately, Tukey's median is NP-hard to compute, and is typically exponential in d.

DepersinLecué19 proved that such a bound can be achieved for additive poisoning in quasi-linear-time, even for heavy-tailed distribution with bounded but unknown variance. More precisely, by using an approach akin to median-of-means, for [math]K = \Omega(\varepsilon n)[/math], they designed an algorithm that achieves an error [math]O\left( \sqrt{\frac{Tr(\Sigma)}{n}} + \sqrt{\frac{||\Sigma||_{op} K}{n}} \right)[/math] in time [math]\tilde O(nd + uKd)[/math]. Here, [math]u[/math] is an integer parameter, which is needed to guarantee a subgaussian decay rate of errors, which will be in [math]1-\exp(-\Theta(K+u))[/math]. Note that [math]Tr(\Sigma)[/math] is essentially the "effective dimension" of the data points.

Their technique relies on partitioning data into [math]K[/math] buckets, computing the means for each bucket, and replacing the computation of median of the means by a covering SDP that fits all configuration of bucket poisoning. It turns out that approximations of such an covering SDP can be founded in quasi-linear-time PTZ12. It turns out that, rather than being used to directly compute a mean estimator, this is actually used to perform gradient descent, starting from the coordinate-wise median, and then descending along the direction provided by the covering SDP. DepersinLecué19 proved that only [math]O(\log d)[/math] such steps were needed to guarantee their bound.

## Robustness to strong poisoning

Note that all the papers of this section seem to strongly rely on the knowledge of the covariance matrix of inliers by the algorithm.

For robustness to strong poisoning, Tukey's ideas can be turned into an polynomial-time algorithm for robust statistics mean estimate. The trick is to identify worst-case "attack line" by computing the largest eigenvalue of the empirical covariance matrix, and to remove extremal points along such lines to reduce variance. DKKLMS16 DKKLMS17 show that, for [math]n=\Omega(d/\varepsilon^2)[/math], this yields [math]O(\varepsilon \sqrt{\log(1/\varepsilon)})[/math]-error with high probability for sub-Gaussian inliers, and [math]O(\sigma \sqrt{\varepsilon})[/math] for inliners whose true covariance matrix [math]\Sigma[/math] is such that [math]\sigma^2 I - \Sigma[/math] is semidefinite positive.

The asymptotical optimal bound [math]O(\varepsilon)[/math] has been achieved a more sophisticated filtering polynomial-time algorithm by DKKLM+18 for Gaussian distribution in the additive poisoning model, while DKS17 showed that no polynomial-time can achieve better than [math]O(\varepsilon \sqrt{\log(1/\varepsilon)})[/math] in the Statistical Query Model with strong poisoning.

Quasi-linear-time robust mean estimators have been designed by CDG19, i.e. with [math]\tilde O(nd)[/math] up to logarithmic factors, based on filtering methods of extremal points on variance-maximizing directions.

Note that all such results can be applied to robust linear regression, by applying robust mean estimator to gradient descent estimator (with the mean taken over data points), assuming that the covariance matrix of the distribution of features [math]x[/math] is known, and that the noise is of known mean and variance. Robust covariance matrix estimation can also be addressed by the framework, as it too can be regarded as a robust mean estimation problem (the fourth moment then needs to be assumed to be upper-bounded).

## What if there are more outliers than inliers?

Another puzzling question concerns the setting where outliers may be more numerous than inliers. A simple argument shows that methods cited above just won't work. What can be done for such a setting? Can we still learn something from the data, or do we have to throw away the dataset altogether? Let's cite three approaches that may be fruitful.

First, BBV08 introduced *list-decodable learning*, which consists in returning several hypothesis, and DKS18 provided polynomial-time for robust list-decodable mean estimation.

Second, one might try to a apply a robust Bayesian inference to the data, which would yield a set of posterior beliefs. This framework has yet to be defined.

Third, we may assume that the data are more or less certified. This is a natural setting, as we humans often judge the reliability of raw data depending on its source, and we usually consider a continuum of reliability, rather than a clear cut binary classification. Algorithms should probably do the same at some point, but there does not yet seem to be an algorithmic framework to pose this problem. In particular, it could be interesting to analyze threat models where different degrees or sorts of certification have different levels of liability.

## Robust statistics for neural networks

The main application of robust statistics (at least relevant to AI ethics) seems to be the aggregation of stochastic gradients for neural networks. In this setting, even a linear-time algorithm in [math]\Omega(nd)[/math] is impractical if we demand [math]n \geq d[/math] (which is necessary to have dimension-independent guarantees). In practice, this setting is often carried out with batches whose size [math]n[/math] is significantly smaller than [math]d[/math]. In fact, despite conventional wisdom and PAC-learning theory, it seems that [math]n \ll d[/math] may be a desirable setting to do neural network learning (see overfitting where we discuss double descent). For [math]n \ll d[/math], is there a gain in using algorithms more complex than coordinate-wise median?

BEGS17 proposed Krum and multi-Krum, aggregation algorithms for this setting that have weaker robustness guarantees but are more efficient. Is it possible to improve upon them?

## Robust statistics for agreement and multi-agent settings

Another venue where robust statistics are needed is the increasingly multi-agent setting that modern AI is built on. MH2013