Preference learning from comparisons - Revision history

Lê Nguyên Hoang at 22:34, 22 February 2020

2020-02-22T22:34:04Z

Lê Nguyên Hoang: /* Gaussian process */

2020-02-04T14:21:56Z

Gaussian process

Lê Nguyên Hoang: /* Classical models */

2020-02-04T13:02:51Z

Classical models

Lê Nguyên Hoang: /* Gaussian process */

2020-02-03T08:25:09Z

Gaussian process

Lê Nguyên Hoang: /* Gaussian process */

2020-02-03T08:03:55Z

Gaussian process

Lê Nguyên Hoang: /* Classical models */

2020-02-03T07:46:11Z

Classical models

Lê Nguyên Hoang: /* Classical models */

2020-02-03T07:45:51Z

Classical models

Lê Nguyên Hoang: /* Classical models */

2020-02-03T07:45:16Z

Classical models

Lê Nguyên Hoang: /* Markov chain for preference learning */

2020-02-03T07:45:04Z

Markov chain for preference learning

Lê Nguyên Hoang: /* Markov chain for preference learning */

2020-02-03T07:32:52Z

Markov chain for preference learning

← Older revision		Revision as of 22:34, 22 February 2020
Line 1:		Line 1:
−	It has been argued that we humans are much more effective at comparing alternatives than at scoring them [https://infoscience.epfl.ch/record/255399/files/EPFL_TH8637.pdf MaystrePhD][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Efficient+Learning+from+Comparisons+maystre&btnG= 18]. Besides, implicit observations of humans mostly provide such choice data, in the form of like/no-like, share/no-share, click/no-click, and so on. Therefore, preference learning from comparisons seems to be an important approach to preference learning.	+	It has been argued that we humans are much more effective at comparing alternatives than at scoring them [https://infoscience.epfl.ch/record/255399/files/EPFL_TH8637.pdf MaystrePhD][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Efficient+Learning+from+Comparisons+maystre&btnG= 18] [https://www.youtube.com/watch?v=bmD-myeu19Q RB5]. Besides, implicit observations of humans mostly provide such choice data, in the form of like/no-like, share/no-share, click/no-click, and so on. Therefore, preference learning from comparisons seems to be an important approach to preference learning.

	== Classical models ==		== Classical models ==

@@ Line 29: / Line 29: @@
 Perhaps one of the most promising avenues to scalable comparison-based preference learning consists of building upon the assumption that, a priori, the intrinsic preference <math>\theta_1-\theta_2</math> between two alternatives <math>z_1</math> and <math>z_2</math> is a Gaussian process. By then invoking the similarity between having to choose between <math>x = (z_1,z_2)</math> and <math>x' = (z_1',z_2')</math>, we can then generalize from observed data by means of a [[kernel method]].
-To implement this approach, we first need a measure of the similarity between two choices <math>x</math> and <math>x'</math>. Intuitively, <math>x</math> and <math>x'</math> will be similar if both <math>z_1 \approx z_1'</math> and <math>z_2 \approx z_2'</math>. Note that <math>x</math> and <math>x'</math> can also be "anti-similar", if <math>z_1 \approx z_2'</math> and <math>z_2 \approx z_1'</math>. Let us call <math>K(x,x')</math> this property.
+To implement this approach, we first need a measure of the similarity between two choices <math>x</math> and <math>x'</math>. Intuitively, <math>x</math> and <math>x'</math> will be similar if both <math>z_1 \approx z_1'</math> and <math>z_2 \approx z_2'</math>. Note that <math>x</math> and <math>x'</math> can also be "anti-similar", if <math>z_1 \approx z_2'</math> and <math>z_2 \approx z_1'</math>. Let us call <math>K(x,x')</math> the similarity between <math>x</math> and </math>x'</math>. Gaussian process prior and Bayesian methods then enable to infer a posterior from revealed preferences on the preference for other dilemmas [https://dl.acm.org/doi/pdf/10.1145/1102351.1102369?download=true ChuGhahramani][https://dblp.org/rec/bibtex/conf/icml/ChuG05 05a] [http://www.jmlr.org/papers/volume6/chu05a/chu05a.pdf ChuGharamani][https://dblp.org/rec/bibtex/journals/jmlr/ChuG05 05b].
 Note that such kernels can be approximately implemented by [[representational learning|vector representation]], for instance by some deep neural network <math>f</math>. We could then have <math>K(x,x') = f(x)^Tf(x')</math> (anti-similarity could be naturally enforced if <math>f(z_1,z_2) = - f(z_2,z_1)</math>, which is guaranteed if <math>f(z_1,z_2) = g(z_1)-g(z_2)</math> for some other vector representation <math>g</math>). The use of such vector representations could be useful to accelerate computation, as opposed to the kernel method which requires going through all the training data to make a prediction.
 == Connection to sports ==
 Chess [https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=The+Rating+Of+ChessPlayers+past+and+present+Players+elo+arco&btnG= Elo78], football [http://ceur-ws.org/Vol-1842/paper_04.pdf MKFG][https://dblp.org/rec/bibtex/conf/pkdd/MaystreKFG16 16].

@@ Line 3: / Line 3: @@
 == Classical models ==
-The classical models for preference learning are due to [http://mlab.no/blog/wp-content/uploads/2009/07/thurstone94law.pdf Thurston][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=A+law+of+comparative+judgment+thurstone&btnG= 27], [https://idp.springer.com/authorize/casa?redirect_uri=https://link.springer.com/content/pdf/10.1007/BF01180541.pdf&casa_token=dMSQZQpkjt4AAAAA:4HFh3rzi3S6VqH29Pg00S64qMuFB9b6VpF8pRikHmTNWf3REi10i_dj9_Eps3Ivp6l8c-ZETTU-Cx_-L6g Zermelo][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Die+Berechnung+der+Turnier-Ergebnisse+als+ein+Maximumproblem+der+Wahrscheinlichkeitsrechnung.&btnG= 29], [https://apps.dtic.mil/dtic/tr/fulltext/u2/a417190.pdf#page=15 DavidBook][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=The+Method+of+Paired+Comparisons+david&btnG= 63], [https://www.jstor.org/stable/pdf/2334029.pdf?casa_token=Q0ROYjxwOZEAAAAA:fuuqbCntxaNmZB0hlKk8iXfzOkLzQs9H677gh2UOQdMI4496B3YZHlUaiuMgD9zONaHYLHgtSmjMHziL29cx9d9tv2f_QwyqN0wxYUJrExI_eK8QR7sa BradleyTerry][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Rank+analysis+of+incomplete+block+designs%3A+I.+The+method+of+paired+comparisons&btnG= 52], [https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Individual+Choice+Behavior%3A+A+Theoretical+Analysis+luce+wiley+1959&btnG= LuceBook59].
+The classical models for preference learning are due to [http://mlab.no/blog/wp-content/uploads/2009/07/thurstone94law.pdf Thurston][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=A+law+of+comparative+judgment+thurstone&btnG= 27], [https://idp.springer.com/authorize/casa?redirect_uri=https://link.springer.com/content/pdf/10.1007/BF01180541.pdf&casa_token=dMSQZQpkjt4AAAAA:4HFh3rzi3S6VqH29Pg00S64qMuFB9b6VpF8pRikHmTNWf3REi10i_dj9_Eps3Ivp6l8c-ZETTU-Cx_-L6g Zermelo][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Die+Berechnung+der+Turnier-Ergebnisse+als+ein+Maximumproblem+der+Wahrscheinlichkeitsrechnung.&btnG= 29], [https://link.springer.com/content/pdf/10.1007/BF02289115.pdf Mosteller][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=mosteller+Remarks+on+the+method+of+paired+comparisons&btnG= 51], [https://www.jstor.org/stable/pdf/2334029.pdf?casa_token=Q0ROYjxwOZEAAAAA:fuuqbCntxaNmZB0hlKk8iXfzOkLzQs9H677gh2UOQdMI4496B3YZHlUaiuMgD9zONaHYLHgtSmjMHziL29cx9d9tv2f_QwyqN0wxYUJrExI_eK8QR7sa BradleyTerry][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Rank+analysis+of+incomplete+block+designs%3A+I.+The+method+of+paired+comparisons&btnG= 52], [https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=Individual+Choice+Behavior%3A+A+Theoretical+Analysis+luce+wiley+1959&btnG= LuceBook59], [https://apps.dtic.mil/dtic/tr/fulltext/u2/a417190.pdf#page=15 DavidBook][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=The+Method+of+Paired+Comparisons+david&btnG= 63].
 In these models, in a choice between 1 and 2, the human implicitly the scores <math>\theta_1</math> and <math>\theta_2</math> that she assigns to each alternative. But his computation of <math>\theta_1 - \theta_2</math> is noisy, and is accompanied with some noise <math>\varepsilon</math>, to yield <math>x_{12} = \theta_1-\theta_2+\varepsilon</math>. In the above models, the sign of <math>x_{12}</math> then determines the choice of the human.

@@ Line 31: / Line 31: @@
 To implement this approach, we first need a measure of the similarity between two choices <math>x</math> and <math>x'</math>. Intuitively, <math>x</math> and <math>x'</math> will be similar if both <math>z_1 \approx z_1'</math> and <math>z_2 \approx z_2'</math>. Note that <math>x</math> and <math>x'</math> can also be "anti-similar", if <math>z_1 \approx z_2'</math> and <math>z_2 \approx z_1'</math>. Let us call <math>K(x,x')</math> this property.
-Note that such kernels can be approximately implemented by [[representational learning|vector representation]], for instance by some deep neural network <math>f</math>. We could then have <math>K(x,x') = f(x)^Tf(x')</math> (anti-similarity could be naturally enforced if <math>f(z_1,z_2) = - f(z_2,z_1)</math>). The use of such vector representations could be useful to accelerate computation, as opposed to the kernel method which requires going through all the training data to make a prediction.
+Note that such kernels can be approximately implemented by [[representational learning|vector representation]], for instance by some deep neural network <math>f</math>. We could then have <math>K(x,x') = f(x)^Tf(x')</math> (anti-similarity could be naturally enforced if <math>f(z_1,z_2) = - f(z_2,z_1)</math>, which is guaranteed if <math>f(z_1,z_2) = g(z_1)-g(z_2)</math> for some other vector representation <math>g</math>). The use of such vector representations could be useful to accelerate computation, as opposed to the kernel method which requires going through all the training data to make a prediction.
 NOTE: This section needs references.

@@ Line 26: / Line 26: @@
 == Gaussian process ==
 == Connection to sports ==
 Chess [https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=The+Rating+Of+ChessPlayers+past+and+present+Players+elo+arco&btnG= Elo78], football [http://ceur-ws.org/Vol-1842/paper_04.pdf MKFG][https://dblp.org/rec/bibtex/conf/pkdd/MaystreKFG16 16].

← Older revision		Revision as of 07:46, 3 February 2020
Line 12:		Line 12:

	The Bradley-Terry model assumes that <math>\varepsilon</math> follows a [https://en.wikipedia.org/wiki/Gumbel_distribution Gumbel distribution]. Equivalently, it sets <math>\phi(z)= \frac{1}{1+\exp(-z)}</math> (up to variance scaling). Luce's model generalizes the Bradley-Terry models, by considering choices among numerous alternatives and setting <math>\phi(z_2, z_3, ...)= \frac{1}{1+\exp(-z_2)+\exp(-z_3)+...}</math>, where this quantity is the probability of choosing option 1, and where <math>z_i = \theta_1-\theta_i</math>. Interestingly, Luce proved that this framework was equivalent to demanding independence of irrelevant alternatives.		The Bradley-Terry model assumes that <math>\varepsilon</math> follows a [https://en.wikipedia.org/wiki/Gumbel_distribution Gumbel distribution]. Equivalently, it sets <math>\phi(z)= \frac{1}{1+\exp(-z)}</math> (up to variance scaling). Luce's model generalizes the Bradley-Terry models, by considering choices among numerous alternatives and setting <math>\phi(z_2, z_3, ...)= \frac{1}{1+\exp(-z_2)+\exp(-z_3)+...}</math>, where this quantity is the probability of choosing option 1, and where <math>z_i = \theta_1-\theta_i</math>. Interestingly, Luce proved that this framework was equivalent to demanding independence of irrelevant alternatives.
		+
		+	== Inference ==

	The inference problem is then the problem of inferring the values of parameters <math>\theta_i</math> given observational data <math>\mathcal D</math> of choices <math>i</math> out of a set of alternatives <math>\mathcal A</math>. Bayesian inference would suggest computing <math>\mathbb P[\theta\|\mathcal D]</math>. But as often, this approach might be too computationally costly in practice. Approximate Bayesian methods are needed.		The inference problem is then the problem of inferring the values of parameters <math>\theta_i</math> given observational data <math>\mathcal D</math> of choices <math>i</math> out of a set of alternatives <math>\mathcal A</math>. Bayesian inference would suggest computing <math>\mathbb P[\theta\|\mathcal D]</math>. But as often, this approach might be too computationally costly in practice. Approximate Bayesian methods are needed.

@@ Line 13: / Line 13: @@
 The Bradley-Terry model assumes that <math>\varepsilon</math> follows a [https://en.wikipedia.org/wiki/Gumbel_distribution Gumbel distribution]. Equivalently, it sets <math>\phi(z)= \frac{1}{1+\exp(-z)}</math> (up to variance scaling). Luce's model generalizes the Bradley-Terry models, by considering choices among numerous alternatives and setting <math>\phi(z_2, z_3, ...)= \frac{1}{1+\exp(-z_2)+\exp(-z_3)+...}</math>, where this quantity is the probability of choosing option 1, and where <math>z_i = \theta_1-\theta_i</math>. Interestingly, Luce proved that this framework was equivalent to demanding independence of irrelevant alternatives.
-The inference problem is then the problem of inferring the values of parameters <math>\theta_i</math> given observational data <math>\mathcal D</math> of choices <math>i</math> out of a set of alternatives <math>\mathcal A</math>. Bayesian inference would suggest computing <math>\mathbb P[\theta|\mathcal D]</math>. But as often, this approach might be too computationally costly in practice.
+The inference problem is then the problem of inferring the values of parameters <math>\theta_i</math> given observational data <math>\mathcal D</math> of choices <math>i</math> out of a set of alternatives <math>\mathcal A</math>. Bayesian inference would suggest computing <math>\mathbb P[\theta|\mathcal D]</math>. But as often, this approach might be too computationally costly in practice. Approximate Bayesian methods are needed.
 Note that in the Luce model, the log-likelihood <math>\log \mathbb P[\mathcal D|\theta] = \sum \left( \theta_i - \log \sum_{(j \succ i) \in \mathcal D} \exp \theta_j \right)</math> is a strictly concave function in <math>\theta</math> if the comparison graph is strongly connected (for any <math>i,j</math>, there exists <math>k_1, ... k_m</math> such that there are some of the data that says <math>i \succ k_1 \succ k_2 \succ ... \succ k_m \succ j</math>). This proves the existence and uniqueness of the maximum likelihood estimator under this assumption.

← Older revision		Revision as of 07:45, 3 February 2020
Line 14:		Line 14:

	The inference problem is then the problem of inferring the values of parameters <math>\theta_i</math> given observational data <math>\mathcal D</math> of choices <math>i</math> out of a set of alternatives <math>\mathcal A</math>. Bayesian inference would suggest computing <math>\mathbb P[\theta\|\mathcal D]</math>. But as often, this approach might be too computationally costly in practice.		The inference problem is then the problem of inferring the values of parameters <math>\theta_i</math> given observational data <math>\mathcal D</math> of choices <math>i</math> out of a set of alternatives <math>\mathcal A</math>. Bayesian inference would suggest computing <math>\mathbb P[\theta\|\mathcal D]</math>. But as often, this approach might be too computationally costly in practice.
		+
		+	Note that in the Luce model, the log-likelihood <math>\log \mathbb P[\mathcal D\|\theta] = \sum \left( \theta_i - \log \sum_{(j \succ i) \in \mathcal D} \exp \theta_j \right)</math> is a strictly concave function in <math>\theta</math> if the comparison graph is strongly connected (for any <math>i,j</math>, there exists <math>k_1, ... k_m</math> such that there are some of the data that says <math>i \succ k_1 \succ k_2 \succ ... \succ k_m \succ j</math>). This proves the existence and uniqueness of the maximum likelihood estimator under this assumption.

	== Markov chain for preference learning ==		== Markov chain for preference learning ==

@@ Line 17: / Line 17: @@
 == Markov chain for preference learning ==
-Note that in the Luce model, the log-likelihood <math>\log \mathbb P[\mathcal D|\theta] = \sum \left( \theta_i - \log \sum_{(j \succ i) \in \mathcal D} \exp \theta_j \right)</math> is a strictly concave function in <math>\theta</math> if the comparison graph is strongly connected (for any <math>i,j</math>, there exists <math>k_1, ... k_m</math> such that there are some of the data that says <math>i \succ k_1 \succ k_2 \succ ... \succ k_m \succ j</math>).
+[http://papers.nips.cc/paper/4701-iterative-ranking-from-pair-wise-comparisons.pdf NOS][https://dblp.org/rec/bibtex/conf/nips/NegahbanOS12 12] proposed a Markov chain approach to compute the maximum likelihood estimator, in the same vein as the PageRank algorithm [http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf PBMW][https://scholar.google.ch/scholar?hl=en&as_sdt=0%2C5&q=The+PageRank+citation+ranking%3A+Bringing+order+to+the+Web+page+brin+motwani+winograd&btnG= 99]. Namely, by choosing adequately the transition rates of the comparison graph, the stationary distribution of the Markov chain thereby constructed turns out to equal the vector <math>(\exp \theta_i^\star)</math>, where <math>\theta^\star</math> is the maximum likelihood estimator.
-One approach to compute the maximum likelihood estimator (<math>\arg\max_\theta \mathbb P[\mathcal D|\theta]</math>) consists of computing the stationary distribution of a Markov chain on the <em>comparison graph</em>. This is a graph whose nodes are alternatives, and which contains a directed arrow from <math>i</math> to <math>j</math> if the user has once asserted that he prefers <math>i</math> to <math>j</math>.
+Note however, that this approach is too data and computationally expensive if the set of alternatives is combinatorial. In fact, for most recommender systems, it may be inapplicable, since most users have never even been exposed to most of the video or music contents of platforms like YouTube or Spotify.
 == Gaussian process ==

← Older revision		Revision as of 07:32, 3 February 2020
Line 17:		Line 17:
	== Markov chain for preference learning ==		== Markov chain for preference learning ==

		+	Note that in the Luce model, the log-likelihood <math>\log \mathbb P[\mathcal D\|\theta] = \sum \left( \theta_i - \log \sum_{(j \succ i) \in \mathcal D} \exp \theta_j \right)</math> is a strictly concave function in <math>\theta</math> if the comparison graph is strongly connected (for any <math>i,j</math>, there exists <math>k_1, ... k_m</math> such that there are some of the data that says <math>i \succ k_1 \succ k_2 \succ ... \succ k_m \succ j</math>).
		+
		+	One approach to compute the maximum likelihood estimator (<math>\arg\max_\theta \mathbb P[\mathcal D\|\theta]</math>) consists of computing the stationary distribution of a Markov chain on the <em>comparison graph</em>. This is a graph whose nodes are alternatives, and which contains a directed arrow from <math>i</math> to <math>j</math> if the user has once asserted that he prefers <math>i</math> to <math>j</math>.

	== Gaussian process ==		== Gaussian process ==