Difference between revisions of "Robust statistics"

From RB Wiki
m
m
Line 3: Line 3:
 
This question has arguably become crucial, as large-scale algorithms perform learning from users' data. Yet, clearly, if the algorithm is used by thousands, millions or billions of users, many of the data will likely be corrupted, because of bugs [https://www.sciencemag.org/news/2016/08/one-five-genetics-papers-contains-errors-thanks-microsoft-excel Boddy16], or because some users will maliciously want to exploit or attack the algorithm. This latter case is known as a <em>poisoning attack</em>.
 
This question has arguably become crucial, as large-scale algorithms perform learning from users' data. Yet, clearly, if the algorithm is used by thousands, millions or billions of users, many of the data will likely be corrupted, because of bugs [https://www.sciencemag.org/news/2016/08/one-five-genetics-papers-contains-errors-thanks-microsoft-excel Boddy16], or because some users will maliciously want to exploit or attack the algorithm. This latter case is known as a <em>poisoning attack</em>.
  
In very high dimensional, designing efficient algorithms with strong robustness guarantees is an open problem, though there are fascinating recent results, both for classical statistical tasks [https://arxiv.org/pdf/1911.05911.pdf DiakonikolasKane][https://dblp.org/rec/bibtex/journals/corr/abs-1911-05911 19] and [[neural networks]] [http://papers.nips.cc/paper/6617-machine-learning-with-adversaries-byzantine-tolerant-gradient-descent.pdf BMGS][https://dblp.org/rec/bibtex/conf/nips/BlanchardMGS17 17].
+
Over the last three years, there have been fascinating recent advances, both for classical statistical tasks [https://arxiv.org/pdf/1911.05911.pdf DiakonikolasKane][https://dblp.org/rec/bibtex/journals/corr/abs-1911-05911 19] and [[neural networks]] [http://papers.nips.cc/paper/6617-machine-learning-with-adversaries-byzantine-tolerant-gradient-descent.pdf BMGS][https://dblp.org/rec/bibtex/conf/nips/BlanchardMGS17 17].
  
 
== Example of the median ==
 
== Example of the median ==

Revision as of 08:29, 22 January 2020

Robust statistics is the problem of estimating parameters from unreliable empirical data. Typically, suppose that a fraction of the training dataset is compromised. Can we design algorithms that nevertheless succeed in learning adequately from such a partially compromised dataset?

This question has arguably become crucial, as large-scale algorithms perform learning from users' data. Yet, clearly, if the algorithm is used by thousands, millions or billions of users, many of the data will likely be corrupted, because of bugs Boddy16, or because some users will maliciously want to exploit or attack the algorithm. This latter case is known as a poisoning attack.

Over the last three years, there have been fascinating recent advances, both for classical statistical tasks DiakonikolasKane19 and neural networks BMGS17.

Example of the median

Suppose the data are real numbers, and we want to estimate the mean of the true data (which we shall call inliers). Note that the naive empirical mean estimate would be a bad idea here, as a single malicious user could completely upset the empirical mean estimate. In fact, by choosing its input data adequately (called outliers), the malicious user can make the empirical mean estimate equal whatever the malicious user wants it to be.

It turns out that using the median of the dataset is a robust way to do so. Indeed, even if 45% of the data are outliers, the median will still be a quantile of the inliers, which should not be too far from the actual mean. The median is said to have a 0.5 statistical breakdown point RousseeuwLeroy05. No statistical method can achieve a better breakdown point, but other methods also achieve 0.5 statistical breakdown, like trimmed mean (we remove sufficiently many extreme values on both sides and take the mean of the rest).

Another way to quantify robustness is to compute a high-probability upper bound between the empirical median and the mean μ of the true distribution of inliers. Call ε the fraction of outliers. It turns out that, assuming the true distribution is a normal distribution N(μ,1), given n=Ω((d+log(1/τ))/ε2) data, we can guarantee |mean-μ| < O(ε) with probability 1-τ. This asymptotic bound is also best possible Huber92.

Poisoning models

The above model holds for arguably the strongest poisoning model. This is one where an adversary gets to read the full dataset before we can, and is able to erase a fraction ε of the data, and to replace them by any imaginable input. The dataset is then analyzed by our (robust) statistics algorithm.

A weaker, but still widespread, model is one where a fraction 1-ε comes from the true distribution, while the remaining ε is chosen by the adversary BMGS17.

Other models include an adversary with only erasing power, or an adversary that must choose its outliers without knowledge of the values of the inliers. Evidently, any guarantee for such weaker poisoning models will also hold for stronger poisoning models DiakonikolasKane19.

High-dimensional robust mean estimates

DiakonikolasKane19 is a great resource for this problem. Below we summarize briefly the key ideas and results.

Unfortunately, results that hold for small dimensions generalize poorly to high dimensions, either because of weak robustness guarantees or computational slowness. Typically, the coordinate-wise median and the geometric median both yield Ω(ε√d)-error, even in the limit of infinite-size datasets and assuming normality for inliers. This is very bad, as today's neural networks often have d~106, if not 109 or 1012 parameters.

On the other hand, assuming the true distribution is a spherical normal distribution N(0,I), Tukey proposed another approach based on identifying the directions of largest variances, since these are likely to be the "attack line" of the adversary Tukey75. Tukey's median yields O(ε)-error with high probability 1-τ for n=Ω((d+log(1/τ))/ε2) data points. Unfortunately, Tukey's median is NP-hard to compute, and is typically exponential in d.

But its ideas can be turned into an polynomial-time algorithm for robust statistics mean estimate. The trick is to identify worst-case "attack line" by computing the largest eigenvalue of the empirical covariance matrix, and to remove extremal points along such lines to reduce variance. DKKLMS16 DKKLMS17 show that, for n=Ω(d/ε2), this yields O(ε√log(1/ε))-error with high probability for sub-Gaussian inliers, and O(σ√ε) for inliners whose true covariance matrix Σ is such that σ2I-Σ is semidefinite positive.

The asymptotical optimal bound O(ε) has been achieved a more sophisticated filtering polynomial-time algorithm by DKKLM+18 for Gaussian distribution in the additive contamination model, while DKS17 showed that no polynomial-time can achieve better than O(ε √log(1/ε)) in the Statistical Query Model with strong contamination.

Quasi-linear-time robust mean estimators have been designed by CDG19 and DepersinLecué19, i.e. with O(nd) up to logarithmic factors, based on filtering methods of extremal points on variance-maximizing directions.

Note that all such results can be applied to robust linear regression, by applying robust mean estimator to gradient descent estimator (with the mean taken over data points), assuming that the covariance matrix of the distribution of features x is known, and that the noise is of known mean and variance. Robust covariance matrix estimation can also be addressed by the framework, as it too can be regarded as a robust mean estimation problem (the fourth moment then needs to be assumed to be upper-bounded).

In all such works though, the covariance matrix of inliers is assumed to be known. This is critical as part of the algorithm requires rescaling coordinates so that the rescaled covariance is the identity matrix. Can we design an efficient robust mean estimator algorithm for bounded but unknown covariance?

Another puzzling question concerns the setting where outliers may be more numerous than inliers. What can be done for such a setting? Can we still learn something from the data, or do we have to throw away the dataset altogether? Let's cite three approaches that may be fruitful. First, BBV08 introduced list-decodable learning, which consists in returning several hypothesis, and DKS18 provided polynomial-time for robust list-decodable mean estimation. Second, one might try to a apply a robust Bayesian inference to the data, which would yield a set of posterior beliefs. This framework has yet to be defined. Third, we may assume that the data are more or less certified. This is a natural setting, as we humans often judge the reliability of raw data depending on its source, and we usually consider a continuum of reliability, rather than a clear cut binary classification. Algorithms should probably do the same at some point, but there does not yet seem to be an algorithmic framework to pose this problem.

Robust statistics for neural networks

The main application of robust statistics (at least relevant to AI ethics) seems to be the aggregation of stochastic gradients for neural networks. In this setting, even a linear-time algorithm in O(nd) is impractical if we demand n>d (which is necessary to have dimension-independent guarantees). In practice, this setting is often carried out with batches whose size n is significantly smaller than d. Is there a gain in using algorithms more complex than coordinate-wise median?

BMGS17 proposed Krum and multi-Krum, aggregation algorithms for this setting that have weaker robustness guarantees but are more efficient. Is it possible to improve upon them?