Preference learning from comparisons

From RB Wiki

It has been argued that we humans are much more effective at comparing alternatives than at scoring them MaystrePhD18. Besides, implicit observations of humans mostly provide such choice data, in the form of like/no-like, share/no-share, click/no-click, and so on. Therefore, preference learning from comparisons seems to be an important approach to preference learning.

Classical models

The classical models for preference learning are due to Thurston27, Zermelo29, DavidBook63, BradleyTerry52, LuceBook59.

In these models, in a choice between 1 and 2, the human implicitly the scores [math]\theta_1[/math] and [math]\theta_2[/math] that she assigns to each alternative. But his computation of [math]\theta_1 - \theta_2[/math] is noisy, and is accompanied with some noise [math]\varepsilon[/math], to yield [math]x_{12} = \theta_1-\theta_2+\varepsilon[/math]. In the above models, the sign of [math]x_{12}[/math] then determines the choice of the human.

This approach allows to explain some of the inconsistencies in humans' decision-making. Equivalently, this corresponds to saying that the probability that the human chooses 1 over 2 is a function of the intrinsic difference [math]\theta_1-\theta_2[/math], which we can write [math]\mathbb P[1 \succ 2] = \mathbb P[x_{12}\gt 0] = \phi(\theta_1-\theta_2)[/math]. Intuitively, the greater the intrinsic difference between 1 and 2, the less likely it is for the human to say he prefers 2 to 1.

Different models assume different noise models for [math]\varepsilon[/math], or, equivalently, for the function [math]\phi[/math]. Thurstone's model assumes that [math]\varepsilon[/math] is normally distributed, which is equivalent to saying that [math]\phi(z) = \Phi(z/\sigma)[/math], where [math]\sigma^2[/math] is the variance of [math]\varepsilon[/math] (which may depend on the choice of 1 and 2), and where [math]\Phi[/math] is the cumulative density function of the standard normal distribution.

The Bradley-Terry model assumes that [math]\varepsilon[/math] follows a Gumbel distribution. Equivalently, it sets [math]\phi(z)= \frac{1}{1+\exp(-z)}[/math] (up to variance scaling). Luce's model generalizes the Bradley-Terry models, by considering choices among numerous alternatives and setting [math]\phi(z_2, z_3, ...)= \frac{1}{1+\exp(-z_2)+\exp(-z_3)+...}[/math], where this quantity is the probability of choosing option 1, and where [math]z_i = \theta_1-\theta_i[/math]. Interestingly, Luce proved that this framework was equivalent to demanding independence of irrelevant alternatives.

Inference

The inference problem is then the problem of inferring the values of parameters [math]\theta_i[/math] given observational data [math]\mathcal D[/math] of choices [math]i[/math] out of a set of alternatives [math]\mathcal A[/math]. Bayesian inference would suggest computing [math]\mathbb P[\theta|\mathcal D][/math]. But as often, this approach might be too computationally costly in practice. Approximate Bayesian methods are needed.

Note that in the Luce model, the log-likelihood [math]\log \mathbb P[\mathcal D|\theta] = \sum \left( \theta_i - \log \sum_{(j \succ i) \in \mathcal D} \exp \theta_j \right)[/math] is a strictly concave function in [math]\theta[/math] if the comparison graph is strongly connected (for any [math]i,j[/math], there exists [math]k_1, ... k_m[/math] such that there are some of the data that says [math]i \succ k_1 \succ k_2 \succ ... \succ k_m \succ j[/math]). This proves the existence and uniqueness of the maximum likelihood estimator under this assumption.

Markov chain for preference learning

NOS12 proposed a Markov chain approach to compute the maximum likelihood estimator, in the same vein as the PageRank algorithm PBMW99. Namely, by choosing adequately the transition rates of the comparison graph, the stationary distribution of the Markov chain thereby constructed turns out to equal the vector [math](\exp \theta_i^\star)[/math], where [math]\theta^\star[/math] is the maximum likelihood estimator.

Note however, that this approach is too data and computationally expensive if the set of alternatives is combinatorial. In fact, for most recommender systems, it may be inapplicable, since most users have never even been exposed to most of the video or music contents of platforms like YouTube or Spotify.

Gaussian process

Connection to sports

Chess Elo78, football MKFG16.