Difference between revisions of "Preference learning from comparisons"

From RB Wiki
Line 17: Line 17:
 
== Markov chain for preference learning ==
 
== Markov chain for preference learning ==
  
Note that in the Luce model, the log-likelihood <math>\log \mathbb P[\mathcal D|\theta] = \sum \left( \theta_i - \log \sum_{(j \succ i) \in \mathcal D} \exp \theta_j \right)</math> is a strictly concave function in <math>\theta</math> if the comparison graph is strongly connected (for any <math>i,j</math>, there exists <math>k_1, ... k_m</math> such that there are some of the data that says <math>i \succ k_1 \succ k_2 \succ ... \succ k_m \succ j</math>).
+
[http://papers.nips.cc/paper/4701-iterative-ranking-from-pair-wise-comparisons.pdf NOS][https://dblp.org/rec/bibtex/conf/nips/NegahbanOS12 12] proposed a Markov chain approach to compute the maximum likelihood estimator, in the same vein as the PageRank algorithm [http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf PBMW][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=The+PageRank+citation+ranking%3A+Bringing+order+to+the+Web+page+brin+motwani+winograd&btnG= 99]. Namely, by choosing adequately the transition rates of the comparison graph, the stationary distribution of the Markov chain thereby constructed turns out to equal the vector <math>(\exp \theta_i^\star)</math>, where <math>\theta^\star</math> is the maximum likelihood estimator.
  
One approach to compute the maximum likelihood estimator (<math>\arg\max_\theta \mathbb P[\mathcal D|\theta]</math>) consists of computing the stationary distribution of a Markov chain on the <em>comparison graph</em>. This is a graph whose nodes are alternatives, and which contains a directed arrow from <math>i</math> to <math>j</math> if the user has once asserted that he prefers <math>i</math> to <math>j</math>.
+
Note however, that this approach is too data and computationally expensive if the set of alternatives is combinatorial. In fact, for most recommender systems, it may be inapplicable, since most users have never even been exposed to most of the video or music contents of platforms like YouTube or Spotify.
  
 
== Gaussian process ==
 
== Gaussian process ==

Revision as of 08:45, 3 February 2020

It has been argued that we humans are much more effective at comparing alternatives than at scoring them MaystrePhD18. Besides, implicit observations of humans mostly provide such choice data, in the form of like/no-like, share/no-share, click/no-click, and so on. Therefore, preference learning from comparisons seems to be an important approach to preference learning.

Classical models

The classical models for preference learning are due to Thurston27, Zermelo29, DavidBook63, BradleyTerry52, LuceBook59.

In these models, in a choice between 1 and 2, the human implicitly the scores [math]\theta_1[/math] and [math]\theta_2[/math] that she assigns to each alternative. But his computation of [math]\theta_1 - \theta_2[/math] is noisy, and is accompanied with some noise [math]\varepsilon[/math], to yield [math]x_{12} = \theta_1-\theta_2+\varepsilon[/math]. In the above models, the sign of [math]x_{12}[/math] then determines the choice of the human.

This approach allows to explain some of the inconsistencies in humans' decision-making. Equivalently, this corresponds to saying that the probability that the human chooses 1 over 2 is a function of the intrinsic difference [math]\theta_1-\theta_2[/math], which we can write [math]\mathbb P[1 \succ 2] = \mathbb P[x_{12}\gt 0] = \phi(\theta_1-\theta_2)[/math]. Intuitively, the greater the intrinsic difference between 1 and 2, the less likely it is for the human to say he prefers 2 to 1.

Different models assume different noise models for [math]\varepsilon[/math], or, equivalently, for the function [math]\phi[/math]. Thurstone's model assumes that [math]\varepsilon[/math] is normally distributed, which is equivalent to saying that [math]\phi(z) = \Phi(z/\sigma)[/math], where [math]\sigma^2[/math] is the variance of [math]\varepsilon[/math] (which may depend on the choice of 1 and 2), and where [math]\Phi[/math] is the cumulative density function of the standard normal distribution.

The Bradley-Terry model assumes that [math]\varepsilon[/math] follows a Gumbel distribution. Equivalently, it sets [math]\phi(z)= \frac{1}{1+\exp(-z)}[/math] (up to variance scaling). Luce's model generalizes the Bradley-Terry models, by considering choices among numerous alternatives and setting [math]\phi(z_2, z_3, ...)= \frac{1}{1+\exp(-z_2)+\exp(-z_3)+...}[/math], where this quantity is the probability of choosing option 1, and where [math]z_i = \theta_1-\theta_i[/math]. Interestingly, Luce proved that this framework was equivalent to demanding independence of irrelevant alternatives.

The inference problem is then the problem of inferring the values of parameters [math]\theta_i[/math] given observational data [math]\mathcal D[/math] of choices [math]i[/math] out of a set of alternatives [math]\mathcal A[/math]. Bayesian inference would suggest computing [math]\mathbb P[\theta|\mathcal D][/math]. But as often, this approach might be too computationally costly in practice.

Markov chain for preference learning

NOS12 proposed a Markov chain approach to compute the maximum likelihood estimator, in the same vein as the PageRank algorithm PBMW99. Namely, by choosing adequately the transition rates of the comparison graph, the stationary distribution of the Markov chain thereby constructed turns out to equal the vector [math](\exp \theta_i^\star)[/math], where [math]\theta^\star[/math] is the maximum likelihood estimator.

Note however, that this approach is too data and computationally expensive if the set of alternatives is combinatorial. In fact, for most recommender systems, it may be inapplicable, since most users have never even been exposed to most of the video or music contents of platforms like YouTube or Spotify.

Gaussian process

Connection to sports

Chess Elo78, football MKFG16.