Difference between revisions of "Robust statistics"
m |
m |
||
Line 21: | Line 21: | ||
Other models include an adversary with only erasing power, or an adversary that must choose its <em>outliers</em> without knowledge of the values of the <em>inliers</em>. Evidently, any guarantee for such weaker poisoning models will also hold for stronger poisoning models [https://arxiv.org/pdf/1911.05911.pdf DiakonikolasKane][https://dblp.org/rec/bibtex/journals/corr/abs-1911-05911 19]. | Other models include an adversary with only erasing power, or an adversary that must choose its <em>outliers</em> without knowledge of the values of the <em>inliers</em>. Evidently, any guarantee for such weaker poisoning models will also hold for stronger poisoning models [https://arxiv.org/pdf/1911.05911.pdf DiakonikolasKane][https://dblp.org/rec/bibtex/journals/corr/abs-1911-05911 19]. | ||
− | == | + | == Robustness to additive poisoning == |
[https://arxiv.org/pdf/1911.05911.pdf DiakonikolasKane][https://dblp.org/rec/bibtex/journals/corr/abs-1911-05911 19] is a great resource for this problem. Below we summarize briefly the key ideas and results. | [https://arxiv.org/pdf/1911.05911.pdf DiakonikolasKane][https://dblp.org/rec/bibtex/journals/corr/abs-1911-05911 19] is a great resource for this problem. Below we summarize briefly the key ideas and results. | ||
Line 29: | Line 29: | ||
On the other hand, assuming the true distribution is a spherical normal distribution <math>\mathcal N(0,I)</math>, Tukey proposed another approach based on identifying the directions of largest variances, since these are likely to be the "attack line" of the adversary [https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Mathematics+and+the+picturing+of+data+tukey&btnG= Tukey75]. <em>Tukey's median</em> yields <math>O(\varepsilon)</math>-error with high probability <math>1-\tau</math> for <math>n=\Omega\left( \frac{d+\log(1/\tau)}{\varepsilon^2}\right)</math> data points. Unfortunately, Tukey's median is NP-hard to compute, and is typically exponential in d. | On the other hand, assuming the true distribution is a spherical normal distribution <math>\mathcal N(0,I)</math>, Tukey proposed another approach based on identifying the directions of largest variances, since these are likely to be the "attack line" of the adversary [https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Mathematics+and+the+picturing+of+data+tukey&btnG= Tukey75]. <em>Tukey's median</em> yields <math>O(\varepsilon)</math>-error with high probability <math>1-\tau</math> for <math>n=\Omega\left( \frac{d+\log(1/\tau)}{\varepsilon^2}\right)</math> data points. Unfortunately, Tukey's median is NP-hard to compute, and is typically exponential in d. | ||
− | + | [https://arxiv.org/pdf/1906.03058 DepersinLecué][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Robust+subgaussian+estimation+of+a+mean+vector+in+nearly+linear+time&btnG= 19] proved that such a bound can be achieved for additive poisoning in quasi-linear-time, even for heavy-tailed distribution with bounded but unknown variance. More precisely, by using an approach akin to median-of-means, for <math>K = \Omega(\varepsilon n</math>, they designed an algorithm that achieves an error <math>O\left( \sqrt{\frac{Tr \Sigma}{n}} + \sqrt{\frac{||\Sigma||_{op} K}{n}} \right)</math> (if the data spread in a subspace of dimension <math>d</math>, we retrieve the bound above). Their technique relies on partitioning data into <math>K</math> buckets, computing the means for each bucket, and replacing the computation of median of the means by a covering SDP that fits all configuration of bucket poisoning. It turns out that approximations of such an covering SDP can be founded in quasi-linear-time [https://arxiv.org/pdf/1201.5135 PTZ][https://dblp.org/rec/bibtex/conf/spaa/PengT12 12]. | |
− | + | == Robustness to strong poisoning == | |
− | + | For robustness to strong poisoning, Tukey's ideas can be turned into an polynomial-time algorithm for robust statistics mean estimate. The trick is to identify worst-case "attack line" by computing the largest eigenvalue of the empirical covariance matrix, and to remove extremal points along such lines to reduce variance. [https://arxiv.org/pdf/1604.06443.pdf DKKLMS][https://dblp.org/rec/bibtex/conf/focs/DiakonikolasKK016 16] [http://proceedings.mlr.press/v70/diakonikolas17a/diakonikolas17a.pdf DKKLMS][https://dblp.org/rec/bibtex/conf/icml/DiakonikolasKK017 17] show that, for <math>n=\Omega(d/\varepsilon^2)</math>, this yields <math>O(\varepsilon \sqrt{\log(1/\varepsilon)})</math>-error with high probability for sub-Gaussian inliers, and <math>O(\sigma \sqrt{\varepsilon})</math> for inliners whose true covariance matrix <math>\Sigma</math> is such that <math>\sigma^2 I - \Sigma</math> is semidefinite positive. | |
+ | |||
+ | The asymptotical optimal bound <math>O(\varepsilon)</math> has been achieved a more sophisticated filtering polynomial-time algorithm by [https://epubs.siam.org/doi/pdf/10.1137/1.9781611975031.171 DKKLM+][https://dblp.org/rec/bibtex/conf/soda/DiakonikolasKK018 18] for Gaussian distribution in the additive poisoning model, while [https://ieeexplore.ieee.org/document/8104048 DKS][https://dblp.org/rec/bibtex/conf/focs/DiakonikolasKS17 17] showed that no polynomial-time can achieve better than <math>O(\varepsilon \sqrt{\log(1/\varepsilon)})</math> in the Statistical Query Model with strong poisoning. | ||
+ | |||
+ | Quasi-linear-time robust mean estimators have been designed by [https://epubs.siam.org/doi/pdf/10.1137/1.9781611975482.171 CDG][https://dblp.org/rec/bibtex/conf/soda/0002D019 19], i.e. with <math>\tilde O(nd)</math> up to logarithmic factors, based on filtering methods of extremal points on variance-maximizing directions. | ||
Note that all such results can be applied to robust linear regression, by applying robust mean estimator to gradient descent estimator (with the mean taken over data points), assuming that the covariance matrix of the distribution of features <math>x</math> is known, and that the noise is of known mean and variance. Robust covariance matrix estimation can also be addressed by the framework, as it too can be regarded as a robust mean estimation problem (the fourth moment then needs to be assumed to be upper-bounded). | Note that all such results can be applied to robust linear regression, by applying robust mean estimator to gradient descent estimator (with the mean taken over data points), assuming that the covariance matrix of the distribution of features <math>x</math> is known, and that the noise is of known mean and variance. Robust covariance matrix estimation can also be addressed by the framework, as it too can be regarded as a robust mean estimation problem (the fourth moment then needs to be assumed to be upper-bounded). |
Revision as of 08:53, 23 January 2020
Robust statistics is the problem of estimating parameters from unreliable empirical data. Typically, suppose that a fraction of the training dataset is compromised. Can we design algorithms that nevertheless succeed in learning adequately from such a partially compromised dataset?
This question has arguably become crucial, as large-scale algorithms perform learning from users' data. Yet, clearly, if the algorithm is used by thousands, millions or billions of users, many of the data will likely be corrupted, because of bugs Boddy16, or because some users will maliciously want to exploit or attack the algorithm. This latter case is known as a poisoning attack.
Over the last three years, there have been fascinating recent advances, both for classical statistical tasks DiakonikolasKane19 and neural networks BMGS17.
Contents
Example of the median
Suppose the data are real numbers, and we want to estimate the mean of the true data (which we shall call inliers). Note that the naive empirical mean estimate would be a bad idea here, as a single malicious user could completely upset the empirical mean estimate. In fact, by choosing its input data adequately (called outliers), the malicious user can make the empirical mean estimate equal whatever the malicious user wants it to be.
It turns out that using the median of the dataset is a robust way to do so. Indeed, even if 45% of the data are outliers, the median will still be a quantile of the inliers, which should not be too far from the actual mean. The median is said to have a 0.5 statistical breakdown point RousseeuwLeroy05. No statistical method can achieve a better breakdown point, but other methods also achieve 0.5 statistical breakdown, like trimmed mean (we remove sufficiently many extreme values on both sides and take the mean of the rest).
Another way to quantify robustness is to compute a high-probability upper bound between the empirical median and the mean μ of the true distribution of inliers. Call ε the fraction of outliers. It turns out that, assuming the true distribution is a normal distribution [math]\mathcal N(\mu,1)[/math], given [math]n=\Omega\left( \frac{d+\log(1/\tau)}{\varepsilon^2}\right)[/math] data, we can guarantee [math]|median-\mu| = O(\varepsilon)[/math] with probability [math]1-\tau[/math]. This asymptotic bound is also best possible Huber92.
Poisoning models
The above model holds for arguably the strongest poisoning model. This is one where an adversary gets to read the full dataset before we can, and is able to erase a fraction [math]\varepsilon[/math] of the data, and to replace them by any imaginable input. The dataset is then analyzed by our (robust) statistics algorithm.
A weaker, but still widespread, model is one where a fraction [math]1-\varepsilon[/math] comes from the true distribution, while the remaining [math]\varepsilon[/math] is chosen by the adversary BMGS17.
Other models include an adversary with only erasing power, or an adversary that must choose its outliers without knowledge of the values of the inliers. Evidently, any guarantee for such weaker poisoning models will also hold for stronger poisoning models DiakonikolasKane19.
Robustness to additive poisoning
DiakonikolasKane19 is a great resource for this problem. Below we summarize briefly the key ideas and results.
Unfortunately, results that hold for small dimensions generalize poorly to high dimensions, either because of weak robustness guarantees or computational slowness. Typically, the coordinate-wise median and the geometric median both yield [math]\Omega(\varepsilon \sqrt{d})[/math]-error, even in the limit of infinite-size datasets and assuming normality for inliers. This is very bad, as today's neural networks often have [math]d\sim 10^6[/math], if not [math]10^9[/math] or [math]10^12[/math] parameters.
On the other hand, assuming the true distribution is a spherical normal distribution [math]\mathcal N(0,I)[/math], Tukey proposed another approach based on identifying the directions of largest variances, since these are likely to be the "attack line" of the adversary Tukey75. Tukey's median yields [math]O(\varepsilon)[/math]-error with high probability [math]1-\tau[/math] for [math]n=\Omega\left( \frac{d+\log(1/\tau)}{\varepsilon^2}\right)[/math] data points. Unfortunately, Tukey's median is NP-hard to compute, and is typically exponential in d.
DepersinLecué19 proved that such a bound can be achieved for additive poisoning in quasi-linear-time, even for heavy-tailed distribution with bounded but unknown variance. More precisely, by using an approach akin to median-of-means, for [math]K = \Omega(\varepsilon n[/math], they designed an algorithm that achieves an error [math]O\left( \sqrt{\frac{Tr \Sigma}{n}} + \sqrt{\frac{||\Sigma||_{op} K}{n}} \right)[/math] (if the data spread in a subspace of dimension [math]d[/math], we retrieve the bound above). Their technique relies on partitioning data into [math]K[/math] buckets, computing the means for each bucket, and replacing the computation of median of the means by a covering SDP that fits all configuration of bucket poisoning. It turns out that approximations of such an covering SDP can be founded in quasi-linear-time PTZ12.
Robustness to strong poisoning
For robustness to strong poisoning, Tukey's ideas can be turned into an polynomial-time algorithm for robust statistics mean estimate. The trick is to identify worst-case "attack line" by computing the largest eigenvalue of the empirical covariance matrix, and to remove extremal points along such lines to reduce variance. DKKLMS16 DKKLMS17 show that, for [math]n=\Omega(d/\varepsilon^2)[/math], this yields [math]O(\varepsilon \sqrt{\log(1/\varepsilon)})[/math]-error with high probability for sub-Gaussian inliers, and [math]O(\sigma \sqrt{\varepsilon})[/math] for inliners whose true covariance matrix [math]\Sigma[/math] is such that [math]\sigma^2 I - \Sigma[/math] is semidefinite positive.
The asymptotical optimal bound [math]O(\varepsilon)[/math] has been achieved a more sophisticated filtering polynomial-time algorithm by DKKLM+18 for Gaussian distribution in the additive poisoning model, while DKS17 showed that no polynomial-time can achieve better than [math]O(\varepsilon \sqrt{\log(1/\varepsilon)})[/math] in the Statistical Query Model with strong poisoning.
Quasi-linear-time robust mean estimators have been designed by CDG19, i.e. with [math]\tilde O(nd)[/math] up to logarithmic factors, based on filtering methods of extremal points on variance-maximizing directions.
Note that all such results can be applied to robust linear regression, by applying robust mean estimator to gradient descent estimator (with the mean taken over data points), assuming that the covariance matrix of the distribution of features [math]x[/math] is known, and that the noise is of known mean and variance. Robust covariance matrix estimation can also be addressed by the framework, as it too can be regarded as a robust mean estimation problem (the fourth moment then needs to be assumed to be upper-bounded).
In all such works though, the covariance matrix of inliers is assumed to be known. This is critical as part of the algorithm requires rescaling coordinates so that the rescaled covariance is the identity matrix. Can we design an efficient robust mean estimator algorithm for bounded but unknown covariance?
What if there are more outliers than inliers?
Another puzzling question concerns the setting where outliers may be more numerous than inliers. A simple argument shows that methods cited above just won't work. What can be done for such a setting? Can we still learn something from the data, or do we have to throw away the dataset altogether? Let's cite three approaches that may be fruitful.
First, BBV08 introduced list-decodable learning, which consists in returning several hypothesis, and DKS18 provided polynomial-time for robust list-decodable mean estimation.
Second, one might try to a apply a robust Bayesian inference to the data, which would yield a set of posterior beliefs. This framework has yet to be defined.
Third, we may assume that the data are more or less certified. This is a natural setting, as we humans often judge the reliability of raw data depending on its source, and we usually consider a continuum of reliability, rather than a clear cut binary classification. Algorithms should probably do the same at some point, but there does not yet seem to be an algorithmic framework to pose this problem.
Robust statistics for neural networks
The main application of robust statistics (at least relevant to AI ethics) seems to be the aggregation of stochastic gradients for neural networks. In this setting, even a linear-time algorithm in [math]O(nd)[/math] is impractical if we demand [math]n \geq d[/math] (which is necessary to have dimension-independent guarantees). In practice, this setting is often carried out with batches whose size [math]n[/math] is significantly smaller than [math]d[/math]. Is there a gain in using algorithms more complex than coordinate-wise median?
BMGS17 proposed Krum and multi-Krum, aggregation algorithms for this setting that have weaker robustness guarantees but are more efficient. Is it possible to improve upon them?