Preference learning from comparisons

From RB Wiki
Revision as of 23:34, 22 February 2020 by Lê Nguyên Hoang (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

It has been argued that we humans are much more effective at comparing alternatives than at scoring them MaystrePhD18 RB5. Besides, implicit observations of humans mostly provide such choice data, in the form of like/no-like, share/no-share, click/no-click, and so on. Therefore, preference learning from comparisons seems to be an important approach to preference learning.

Classical models

The classical models for preference learning are due to Thurston27, Zermelo29, Mosteller51, BradleyTerry52, LuceBook59, DavidBook63.

In these models, in a choice between 1 and 2, the human implicitly the scores [math]\theta_1[/math] and [math]\theta_2[/math] that she assigns to each alternative. But his computation of [math]\theta_1 - \theta_2[/math] is noisy, and is accompanied with some noise [math]\varepsilon[/math], to yield [math]x_{12} = \theta_1-\theta_2+\varepsilon[/math]. In the above models, the sign of [math]x_{12}[/math] then determines the choice of the human.

This approach allows to explain some of the inconsistencies in humans' decision-making. Equivalently, this corresponds to saying that the probability that the human chooses 1 over 2 is a function of the intrinsic difference [math]\theta_1-\theta_2[/math], which we can write [math]\mathbb P[1 \succ 2] = \mathbb P[x_{12}\gt 0] = \phi(\theta_1-\theta_2)[/math]. Intuitively, the greater the intrinsic difference between 1 and 2, the less likely it is for the human to say he prefers 2 to 1.

Different models assume different noise models for [math]\varepsilon[/math], or, equivalently, for the function [math]\phi[/math]. Thurstone's model assumes that [math]\varepsilon[/math] is normally distributed, which is equivalent to saying that [math]\phi(z) = \Phi(z/\sigma)[/math], where [math]\sigma^2[/math] is the variance of [math]\varepsilon[/math] (which may depend on the choice of 1 and 2), and where [math]\Phi[/math] is the cumulative density function of the standard normal distribution.

The Bradley-Terry model assumes that [math]\varepsilon[/math] follows a Gumbel distribution. Equivalently, it sets [math]\phi(z)= \frac{1}{1+\exp(-z)}[/math] (up to variance scaling). Luce's model generalizes the Bradley-Terry models, by considering choices among numerous alternatives and setting [math]\phi(z_2, z_3, ...)= \frac{1}{1+\exp(-z_2)+\exp(-z_3)+...}[/math], where this quantity is the probability of choosing option 1, and where [math]z_i = \theta_1-\theta_i[/math]. Interestingly, Luce proved that this framework was equivalent to demanding independence of irrelevant alternatives.

Inference

The inference problem is then the problem of inferring the values of parameters [math]\theta_i[/math] given observational data [math]\mathcal D[/math] of choices [math]i[/math] out of a set of alternatives [math]\mathcal A[/math]. Bayesian inference would suggest computing [math]\mathbb P[\theta|\mathcal D][/math]. But as often, this approach might be too computationally costly in practice. Approximate Bayesian methods are needed.

Note that in the Luce model, the log-likelihood [math]\log \mathbb P[\mathcal D|\theta] = \sum \left( \theta_i - \log \sum_{(j \succ i) \in \mathcal D} \exp \theta_j \right)[/math] is a strictly concave function in [math]\theta[/math] if the comparison graph is strongly connected (for any [math]i,j[/math], there exists [math]k_1, ... k_m[/math] such that there are some of the data that says [math]i \succ k_1 \succ k_2 \succ ... \succ k_m \succ j[/math]). This proves the existence and uniqueness of the maximum likelihood estimator under this assumption.

Markov chain for preference learning

NOS12 proposed a Markov chain approach to compute the maximum likelihood estimator, in the same vein as the PageRank algorithm PBMW99. Namely, by choosing adequately the transition rates of the comparison graph, the stationary distribution of the Markov chain thereby constructed turns out to equal the vector [math](\exp \theta_i^\star)[/math], where [math]\theta^\star[/math] is the maximum likelihood estimator.

Note however, that this approach is too data and computationally expensive if the set of alternatives is combinatorial. In fact, for most recommender systems, it may be inapplicable, since most users have never even been exposed to most of the video or music contents of platforms like YouTube or Spotify.

Gaussian process

Perhaps one of the most promising avenues to scalable comparison-based preference learning consists of building upon the assumption that, a priori, the intrinsic preference [math]\theta_1-\theta_2[/math] between two alternatives [math]z_1[/math] and [math]z_2[/math] is a Gaussian process. By then invoking the similarity between having to choose between [math]x = (z_1,z_2)[/math] and [math]x' = (z_1',z_2')[/math], we can then generalize from observed data by means of a kernel method.

To implement this approach, we first need a measure of the similarity between two choices [math]x[/math] and [math]x'[/math]. Intuitively, [math]x[/math] and [math]x'[/math] will be similar if both [math]z_1 \approx z_1'[/math] and [math]z_2 \approx z_2'[/math]. Note that [math]x[/math] and [math]x'[/math] can also be "anti-similar", if [math]z_1 \approx z_2'[/math] and [math]z_2 \approx z_1'[/math]. Let us call [math]K(x,x')[/math] the similarity between [math]x[/math] and </math>x'</math>. Gaussian process prior and Bayesian methods then enable to infer a posterior from revealed preferences on the preference for other dilemmas ChuGhahramani05a ChuGharamani05b.

Note that such kernels can be approximately implemented by vector representation, for instance by some deep neural network [math]f[/math]. We could then have [math]K(x,x') = f(x)^Tf(x')[/math] (anti-similarity could be naturally enforced if [math]f(z_1,z_2) = - f(z_2,z_1)[/math], which is guaranteed if [math]f(z_1,z_2) = g(z_1)-g(z_2)[/math] for some other vector representation [math]g[/math]). The use of such vector representations could be useful to accelerate computation, as opposed to the kernel method which requires going through all the training data to make a prediction.

Connection to sports

Chess Elo78, football MKFG16.