Preference learning from comparisons

From RB Wiki

It has been argued that we humans are much more effective at comparing alternatives than at scoring them MaystrePhD18. Besides, implicit observations of humans mostly provide such choice data, in the form of like/no-like, share/no-share, click/no-click, and so on. Therefore, preference learning from comparisons seems to be an important approach to preference learning.

Classical models

The classical models for preference learning are due to Thurston27, Zermelo29, DavidBook63, BradleyTerry52, LuceBook59.

In these models, in a choice between 1 and 2, the human implicitly the scores [math]\theta_1[/math] and [math]\theta_2[/math] that she assigns to each alternative. But his computation of [math]\theta_1 - \theta_2[/math] is noisy, and is accompanied with some noise [math]\varepsilon[/math], to yield [math]x_{12} = \theta_1-\theta_2+\varepsilon[/math]. In the above models, the sign of [math]x_{12}[/math] then determines the choice of the human.

This approach allows to explain some of the inconsistencies in humans' decision-making. Equivalently, this corresponds to saying that the probability that the human chooses 1 over 2 is a function of the intrinsic difference [math]\theta_1-\theta_2[/math], which we can write [math]\mathbb P[1 \succ 2] = \mathbb P[x_{12}\gt 0] = \phi(\theta_1-\theta_2)[/math]. Intuitively, the greater the intrinsic difference between 1 and 2, the less likely it is for the human to say he prefers 2 to 1.

Different models assume different noise models for [math]\varepsilon[/math], or, equivalently, for the function [math]\phi[/math]. Thurstone's model assumes that [math]\varepsilon[/math] is normally distributed, which is equivalent to saying that [math]\phi(z) = \Phi(z/\sigma)[/math], where [math]\sigma^2[/math] is the variance of [math]\varepsilon[/math] (which may depend on the choice of 1 and 2), and where [math]\Phi[/math] is the cumulative density function of the standard normal distribution.

The Bradley-Terry model assumes that [math]\varepsilon[/math] follows a Gumbel distribution. Equivalently, it sets [math]\phi(z)= \frac{1}{1+\exp(-z)}[/math] (up to variance scaling). Luce's model generalizes the Bradley-Terry models, by considering choices among numerous alternatives and setting [math]\phi(z_2, z_3, ...)= \frac{1}{1+\exp(-z_2)+\exp(-z_3)+...}[/math], where this quantity is the probability of choosing option 1, and where [math]z_i = \theta_1-\theta_i[/math]. Interestingly, Luce proved that this framework was equivalent to demanding independence of irrelevant alternatives.

The inference problem is then the problem of inferring the values of parameters [math]\theta_i[/math] given observational data [math]\mathcal D[/math] of choices [math]i[/math] out of a set of alternatives [math]\mathcal A[/math]. Bayesian inference would suggest computing [math]\mathbb P[\theta|\mathcal D][/math]. But as often, this approach might be too computationally costly in practice.

Markov chain for preference learning

Note that in the Luce model, the log-likelihood [math]\log \mathbb P[\mathcal D|\theta] = \sum \left( \theta_i - \log \sum_{(j \succ i) \in \mathcal D} \exp \theta_j \right)[/math] is a strictly concave function in [math]\theta[/math] if the comparison graph is strongly connected (for any [math]i,j[/math], there exists [math]k_1, ... k_m[/math] such that there are some of the data that says [math]i \succ k_1 \succ k_2 \succ ... \succ k_m \succ j[/math]).

One approach to compute the maximum likelihood estimator ([math]\arg\max_\theta \mathbb P[\mathcal D|\theta][/math]) consists of computing the stationary distribution of a Markov chain on the comparison graph. This is a graph whose nodes are alternatives, and which contains a directed arrow from [math]i[/math] to [math]j[/math] if the user has once asserted that he prefers [math]i[/math] to [math]j[/math].

Gaussian process

Connection to sports

Chess Elo78, football MKFG16.