Bayesianism

From RB Wiki
Revision as of 08:44, 4 February 2020 by Lê Nguyên Hoang (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Bayesianism is an epistemology that consists of relying solely on the laws of probability to reason and to make predictions. Introduced by Laplace, this philosophy is discussed at length in Hoang20. Other great introductions to Bayesianism include McGrayne11, Bram15 and 3Blue1Brown19. Its ultimate form is often argued to be Solomonoff's induction.

Bayes rule

The most crucial equation of Bayesianism is Bayes rule. Bayes rule says that [math]\mathbb P[T|D] = \frac{\mathbb P[D|T] \mathbb P[T]}{\mathbb P[D|T] \mathbb P[T] + \sum_{A \neq T} \mathbb P[D|A] \mathbb P[A]}[/math].

In words, this corresponds to saying that the credence of a theory [math]T[/math] given data [math]D[/math] depends on three factors: (1) its prior credence [math]T[/math], (2) the ability of theory [math]T[/math] to predict data [math]D[/math] (confusingly called the likelihood), and (3) how it fares compared to alternative theories [math]A[/math].

Put differently, a theory [math]T[/math] is all the more credible if (1) it sounded credible before we learned about data [math]D[/math], (2) it fits the data [math]D[/math] well, and (3) no alternative does it as well.

Note that (2) implies that [math]T[/math] must be a probabilistically predictive theory. In math terms, it should be a probability measure over the set of data.

Controversy

Point (2) seems relatively straightforward. This is arguably why some statisticians prefer to focus solely on (2), and to ignore (1) and (3). Such statisticians are sometimes called frequentists.

Extensive discussions have dwelt on points (1) and (3). It has been argued that there is too much subjectivity regarding (1) and (3), which has often been used as an argument to reject Bayesianism altogether.

There are, however, two ways to defend Bayesianism. First, one could say that Bayesianism is not complete. It is just a way to move from prior assumptions to better states of knowledge. Interestingly, the update rules of Bayesianism have numerous desirable properties in this regard, such as the Dutch book argument, the absence of confirmation bias, Bayesian debating and statistical admissibility.

Another line of defense of Bayesianism consists essentially in building upon the Church-Turing thesis to restrict ourselves only to computable probability measures. It turns out that there is then a natural probability measure over (prefix-free) computable probability measures, called the Solomonoff universal prior. This prior solves problems (1) and (3).

Unfortunately, the prior is not quite canonical. It depends on a base universal Turing machine. But because of the existence of compilers, one can show that any two universal priors are quite "similar" (this is the same problem as that of Solomonoff-Kolmogorov complexity). Solomonoff's completeness theorem is then a very compelling argument in favor of Solomonoff Bayesianism.

Intuitive use

Unfortunately, Bayes rule is impossible to apply in practice, since it demands unreasonable amounts of computation (see Solomonoff's incompleteness). However, assuming that it is what we would want to apply, assuming we had infinite computational power, we can draw inspiration from it to design heuristic approximations of Bayes rule.

From a human rationality perspective, the key trick to perform Bayes rule approximation is to bet Galef15, Duke17. Bets force us to express and be aware of priors. Then, the results of the bets should force us to update our beliefs. In particular, surprising results call for potentially larger updates. Thus, as a second rule of thumb, to apply Bayesianism, one must pay great attention to their own surprise.

Bayesian approximations

There are numerous proposals to perform pragmatic efficient approximations of Bayes rule. They arguably fall in the following categories.

One can restrict oneself to a small number of theories that compete. Note that this is roughly what is done by multiplicative weights update AHK12 Wandida16a Wandida16b (and variants like boosting). Unfortunately, this only works if we consider at most [math]\sim 10^9[/math] theories. In practice, parameterized theories come with a combinatorial number of variants. Other methods are needed.

Another approach is to choose well-behaved classes of probability distributions, like conjugate priors RaiffaSchlaifer62. One can then fit a probability distribution to the most similar probability distribution of that class. This is known as a variational method. Unfortunately, in practice, such well-behaved classes fail to describe efficiently many datasets.

One possibility then is to only compute the most credible theory. This is known as maximum a posteriori. Interestingly, optimization problems are often nicely addressed by efficient algorithms (at least if the objective function is convex). However, by discarding all theories but one, maximum a posteriori exposes us to overfitting.

A fourth approach addresses the case where even the likelihood term cannot be computed efficiently, typically because the theory T is designed to be sampled (it is easy to draw data D that follow the distribution defined by T), but not to compute the probability of some data it is given. This case requires a likelihood-free method. The likelihood is then typically replaced by a proxy of it. A generative adversarial network (GAN) can be interpreted to work this way.

Finally, some people (cite!) have proposed to extend the definition of probability, and to work with pseudo-probability distributions. Such distributions may allow for more tractable algorithms, and return correct answers when the pseudo-probabilities turn out to also be probabilities.