Robust statistics

From RB Wiki
Revision as of 09:02, 23 January 2020 by Lê Nguyên Hoang (talk | contribs)

Robust statistics is the problem of estimating parameters from unreliable empirical data. Typically, suppose that a fraction of the training dataset is compromised. Can we design algorithms that nevertheless succeed in learning adequately from such a partially compromised dataset?

This question has arguably become crucial, as large-scale algorithms perform learning from users' data. Yet, clearly, if the algorithm is used by thousands, millions or billions of users, many of the data will likely be corrupted, because of bugs Boddy16, or because some users will maliciously want to exploit or attack the algorithm. This latter case is known as a poisoning attack.

Over the last three years, there have been fascinating recent advances, both for classical statistical tasks DiakonikolasKane19 and neural networks BMGS17.

Example of the median

Suppose the data are real numbers, and we want to estimate the mean of the true data (which we shall call inliers). Note that the naive empirical mean estimate would be a bad idea here, as a single malicious user could completely upset the empirical mean estimate. In fact, by choosing its input data adequately (called outliers), the malicious user can make the empirical mean estimate equal whatever the malicious user wants it to be.

It turns out that using the median of the dataset is a robust way to do so. Indeed, even if 45% of the data are outliers, the median will still be a quantile of the inliers, which should not be too far from the actual mean. The median is said to have a 0.5 statistical breakdown point RousseeuwLeroy05. No statistical method can achieve a better breakdown point, but other methods also achieve 0.5 statistical breakdown, like trimmed mean (we remove sufficiently many extreme values on both sides and take the mean of the rest).

Another way to quantify robustness is to compute a high-probability upper bound between the empirical median and the mean μ of the true distribution of inliers. Call ε the fraction of outliers. It turns out that, assuming the true distribution is a normal distribution [math]\mathcal N(\mu,1)[/math], given [math]n=\Omega\left( \frac{d+\log(1/\tau)}{\varepsilon^2}\right)[/math] data, we can guarantee [math]|median-\mu| = O(\varepsilon)[/math] with probability [math]1-\tau[/math]. This asymptotic bound is also best possible Huber92.

Poisoning models

The above model holds for arguably the strongest poisoning model. This is one where an adversary gets to read the full dataset before we can, and is able to erase a fraction [math]\varepsilon[/math] of the data, and to replace them by any imaginable input. The dataset is then analyzed by our (robust) statistics algorithm.

A weaker, but still widespread, model is one where a fraction [math]1-\varepsilon[/math] comes from the true distribution, while the remaining [math]\varepsilon[/math] is chosen by the adversary BMGS17.

Other models include an adversary with only erasing power, or an adversary that must choose its outliers without knowledge of the values of the inliers. Evidently, any guarantee for such weaker poisoning models will also hold for stronger poisoning models DiakonikolasKane19.

Robustness to additive poisoning

DiakonikolasKane19 is a great resource for this problem. Below we summarize briefly the key ideas and results.

Unfortunately, results that hold for small dimensions generalize poorly to high dimensions, either because of weak robustness guarantees or computational slowness. Typically, the coordinate-wise median and the geometric median both yield [math]\Omega(\varepsilon \sqrt{d})[/math]-error, even in the limit of infinite-size datasets and assuming normality for inliers. This is very bad, as today's neural networks often have [math]d\sim 10^6[/math], if not [math]10^9[/math] or [math]10^{12}[/math] parameters.

On the other hand, assuming the true distribution is a spherical normal distribution [math]\mathcal N(0,I)[/math], Tukey proposed another approach based on identifying the directions of largest variances, since these are likely to be the "attack line" of the adversary Tukey75. Tukey's median yields [math]O(\varepsilon)[/math]-error with high probability [math]1-\tau[/math] for [math]n=\Omega\left( \frac{d+\log(1/\tau)}{\varepsilon^2}\right)[/math] data points. Unfortunately, Tukey's median is NP-hard to compute, and is typically exponential in d.

DepersinLecué19 proved that such a bound can be achieved for additive poisoning in quasi-linear-time, even for heavy-tailed distribution with bounded but unknown variance. More precisely, by using an approach akin to median-of-means, for [math]K = \Omega(\varepsilon n[/math], they designed an algorithm that achieves an error [math]O\left( \sqrt{\frac{Tr(\Sigma)}{n}} + \sqrt{\frac{||\Sigma||_{op} K}{n}} \right)[/math] in time [math]\tilde O(nd + uKd)[/math]. Here, [math]u[/math] is an integer parameter, which is needed to guarantee a subgaussian decay rate of errors, which will be in [math]1-\exp(-\Theta(K+u))[/math]. Note that [math]Tr(\Sigma)[/math] is essentially the "effective dimension" of the data points.

Their technique relies on partitioning data into [math]K[/math] buckets, computing the means for each bucket, and replacing the computation of median of the means by a covering SDP that fits all configuration of bucket poisoning. It turns out that approximations of such an covering SDP can be founded in quasi-linear-time PTZ12. It turns out that, rather than being used to directly compute a mean estimator, this is actually used to perform gradient descent, starting from the coordinate-wise median, and then descending along the direction provided by the covering SDP. DepersinLecué19 proved that only [math]O(\log d)[/math] such steps were needed to guarantee their bound.

Robustness to strong poisoning

For robustness to strong poisoning, Tukey's ideas can be turned into an polynomial-time algorithm for robust statistics mean estimate. The trick is to identify worst-case "attack line" by computing the largest eigenvalue of the empirical covariance matrix, and to remove extremal points along such lines to reduce variance. DKKLMS16 DKKLMS17 show that, for [math]n=\Omega(d/\varepsilon^2)[/math], this yields [math]O(\varepsilon \sqrt{\log(1/\varepsilon)})[/math]-error with high probability for sub-Gaussian inliers, and [math]O(\sigma \sqrt{\varepsilon})[/math] for inliners whose true covariance matrix [math]\Sigma[/math] is such that [math]\sigma^2 I - \Sigma[/math] is semidefinite positive. Note however, that all the papers of this section seem to strongly rely on the knowledge of the covariance matrix of inliers by the algorithm.

The asymptotical optimal bound [math]O(\varepsilon)[/math] has been achieved a more sophisticated filtering polynomial-time algorithm by DKKLM+18 for Gaussian distribution in the additive poisoning model, while DKS17 showed that no polynomial-time can achieve better than [math]O(\varepsilon \sqrt{\log(1/\varepsilon)})[/math] in the Statistical Query Model with strong poisoning.

Quasi-linear-time robust mean estimators have been designed by CDG19, i.e. with [math]\tilde O(nd)[/math] up to logarithmic factors, based on filtering methods of extremal points on variance-maximizing directions.

Note that all such results can be applied to robust linear regression, by applying robust mean estimator to gradient descent estimator (with the mean taken over data points), assuming that the covariance matrix of the distribution of features [math]x[/math] is known, and that the noise is of known mean and variance. Robust covariance matrix estimation can also be addressed by the framework, as it too can be regarded as a robust mean estimation problem (the fourth moment then needs to be assumed to be upper-bounded).

In all such works though, the covariance matrix of inliers is assumed to be known. This is critical as part of the algorithm requires rescaling coordinates so that the rescaled covariance is the identity matrix. Can we design an efficient robust mean estimator algorithm for bounded but unknown covariance?

What if there are more outliers than inliers?

Another puzzling question concerns the setting where outliers may be more numerous than inliers. A simple argument shows that methods cited above just won't work. What can be done for such a setting? Can we still learn something from the data, or do we have to throw away the dataset altogether? Let's cite three approaches that may be fruitful.

First, BBV08 introduced list-decodable learning, which consists in returning several hypothesis, and DKS18 provided polynomial-time for robust list-decodable mean estimation.

Second, one might try to a apply a robust Bayesian inference to the data, which would yield a set of posterior beliefs. This framework has yet to be defined.

Third, we may assume that the data are more or less certified. This is a natural setting, as we humans often judge the reliability of raw data depending on its source, and we usually consider a continuum of reliability, rather than a clear cut binary classification. Algorithms should probably do the same at some point, but there does not yet seem to be an algorithmic framework to pose this problem.

Robust statistics for neural networks

The main application of robust statistics (at least relevant to AI ethics) seems to be the aggregation of stochastic gradients for neural networks. In this setting, even a linear-time algorithm in [math]O(nd)[/math] is impractical if we demand [math]n \geq d[/math] (which is necessary to have dimension-independent guarantees). In practice, this setting is often carried out with batches whose size [math]n[/math] is significantly smaller than [math]d[/math]. Is there a gain in using algorithms more complex than coordinate-wise median?

BMGS17 proposed Krum and multi-Krum, aggregation algorithms for this setting that have weaker robustness guarantees but are more efficient. Is it possible to improve upon them?