Learning to Rank by Optimizing NDCG Measure

Similar documents
Listwise Approach to Learning to Rank Theory and Algorithm

Large margin optimization of ranking measures

An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge

Information Retrieval

11. Learning To Rank. Most slides were adapted from Stanford CS 276 course.

Position-Aware ListMLE: A Sequential Learning Process for Ranking

Information Retrieval

Gradient Descent Optimization of Smoothed Information Retrieval Metrics

Stochastic Top-k ListNet

On NDCG Consistency of Listwise Ranking Methods

The LambdaLoss Framework for Ranking Metric Optimization

Decoupled Collaborative Ranking

5/21/17. Machine learning for IR ranking? Machine learning for IR ranking. Machine learning for IR ranking. Introduction to Information Retrieval

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

ConeRANK: Ranking as Learning Generalized Inequalities

Ordinal Classification with Decision Rules

CS229 Supplemental Lecture notes

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach

Linear & nonlinear classifiers

A Large Deviation Bound for the Area Under the ROC Curve

Latent Variable Models and EM algorithm

Support Vector Machines

CSCI-567: Machine Learning (Spring 2019)

Ranking Structured Documents: A Large Margin Based Approach for Patent Prior Art Search

Logistic Regression and Boosting for Labeled Bags of Instances

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

A Statistical View of Ranking: Midway between Classification and Regression

Foundations of Machine Learning Ranking. Mehryar Mohri Courant Institute and Google Research

ML (cont.): SUPPORT VECTOR MACHINES

A Simple Algorithm for Multilabel Ranking

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning

6.036 midterm review. Wednesday, March 18, 15

Large-scale Linear RankSVM

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

SUPPORT VECTOR MACHINE

ACML 2009 Tutorial Nov. 2, 2009 Nanjing. Learning to Rank. Hang Li Microsoft Research Asia

Lecture 8. Instructor: Haipeng Luo

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

Classification of Ordinal Data Using Neural Networks

Large-Margin Thresholded Ensembles for Ordinal Regression

Linear & nonlinear classifiers

Linear Classification and SVM. Dr. Xin Zhang

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers

Learning Classification with Auxiliary Probabilistic Information Quang Nguyen Hamed Valizadegan Milos Hauskrecht

ABC-LogitBoost for Multi-Class Classification

A Uniform Convergence Bound for the Area Under the ROC Curve

SVRG++ with Non-uniform Sampling

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Online Advertising is Big Business

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Machine Learning for NLP

CIS 520: Machine Learning Oct 09, Kernel Methods

Support Vector Machines and Kernel Methods

Online Advertising is Big Business

Star-Structured High-Order Heterogeneous Data Co-clustering based on Consistent Information Theory

Learning from the Wisdom of Crowds by Minimax Entropy. Denny Zhou, John Platt, Sumit Basu and Yi Mao Microsoft Research, Redmond, WA

Large Scale Semi-supervised Linear SVMs. University of Chicago

Learning Binary Classifiers for Multi-Class Problem

Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning

Generalized Boosting Algorithms for Convex Optimization

From RankNet to LambdaRank to LambdaMART: An Overview

An Novel Continuation Power Flow Method Based on Line Voltage Stability Index

Support Vector Machines

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Large-Margin Thresholded Ensembles for Ordinal Regression

Decision Rule-based Algorithm for Ordinal Classification based on Rank Loss Minimization

Kernel Methods and Support Vector Machines

The Frank-Wolfe Algorithm:

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

MULTIPLEKERNELLEARNING CSE902

Ad Placement Strategies

Qualifying Exam in Machine Learning

The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

10701/15781 Machine Learning, Spring 2007: Homework 2

Support Vector Machines for Classification: A Statistical Portrait

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Minimax risk bounds for linear threshold functions

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Support Vector Machines.

Multiclass Boosting with Repartitioning

Machine Learning And Applications: Supervised Learning-SVM

Support Vector Ordinal Regression using Privileged Information

Stochastic Gradient Descent

Technical Details about the Expectation Maximization (EM) Algorithm

Fantope Regularization in Metric Learning

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

18.6 Regression and Classification with Linear Models

Probabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning

Ranking and Filtering

Transcription:

Learning to Ran by Optimizing Measure Hamed Valizadegan Rong Jin Computer Science and Engineering Michigan State University East Lansing, MI 48824 {valizade,rongjin}@cse.msu.edu Ruofei Zhang Jianchang Mao Advertising Sciences, Yahoo! Labs 440 Great America Parway, Santa Clara, CA 95054 {rzhang,jmao}@yahoo-inc.com Abstract Learning to ran is a relatively new field of study, aiming to learn a raning function from a set of training data with relevancy labels. The raning algorithms are often evaluated using information retrieval measures, such as Normalized Discounted Cumulative Gain ( [] and Mean Average Precision (MAP [2]. Until recently, most learning to ran algorithms were not using a loss function related to the above mentioned evaluation measures. The main difficulty in direct optimization of these measures is that they depend on the rans of documents, not the numerical values output by the raning function. We propose a probabilistic framewor that addresses this challenge by optimizing the expectation of over all the possible permutations of documents. A relaxation strategy is used to approximate the average of over the space of permutation, and a bound optimization approach is proposed to mae the computation efficient. Extensive experiments show that the proposed algorithm outperforms state-of-the-art raning algorithms on several benchmar data sets. Introduction Learning to ran has attracted the focus of many machine learning researchers in the last decade because of its growing application in the areas lie information retrieval (IR and recommender systems. In the simplest form, the so-called pointwise approaches, raning can be treated as classification or regression by learning the numeric ran value of documents as an absolute quantity [, 4]. The second group of algorithms, the pairwise approaches, considers the pair of documents as independent variables and learns a classification (regression model to correctly order the training pairs [5, 6, 7, 8, 9, 0, ]. The main problem with these approaches is that their loss functions are related to individual documents while most evaluation metrics of information retrieval measure the raning quality for individual queries, not documents. This mismatch has motivated the so called listwise approaches for information raning, which treats each raning list of documents for a query as a training instance [2, 2,, 4, 5, 6, 7]. Unlie the pointwise or pairwise approaches, the listwise approaches aim to optimize the evaluation metrics such as and MAP. The main difficulty in optimizing these evaluation metrics is that they are dependent on the ran position of documents induced by the raning function, not the numerical values output by the raning function. In the past studies, this problem was addressed either by the convex surrogate of the IR metrics or by heuristic optimization methods such as genetic algorithm. In this wor, we address this challenge by a probabilistic framewor that optimizes the expectation of over all the possible permutation of documents. To handle the computational difficulty, we present a relaxation strategy that approximates the expectation of in the space of permutation, and a bound optimization algorithm [8] for efficient optimization. Our experiment with several benchmar data sets shows that our method performs better than several state-of-the-art raning techniques.

The rest of this paper is organized as follows. The related wor is presented in Section 2. The proposed framewor and optimization strategy is presented in Section. We report our experimental study in Section 4 and conclude this wor in Section 5. 2 Related Wor We focus on reviewing the listwise approaches that are closely related to the theme of this wor. The listwise approaches can be classified into two categories. The first group of approaches directly optimizes the IR evaluation metrics. Most IR evaluation metrics, however, depend on the sorted order of documents, and are non-convex in the target raning function. To avoid the computational difficulty, these approaches either approximate the metrics with some convex functions or deploy methods (e.g., genetic algorithm [9] for non-convex optimization. In [], the authors introduced LambdaRan that addresses the difficulty in optimizing IR metrics by defining a virtual gradient on each document after the sorting. While [] provided a simple test to determine if there exists an implicit cost function for the virtual gradient, theoretical justification for the relation between the implicit cost function and the IR evaluation metric is incomplete. This may partially explain why LambdaRan performs very poor when compared to MCRan [], a simple adjustment of classification for raning (a pointwise approach. The authors of MCRan paper even claimed that a boosting model for regression produces better results than LambdaRan. Volovs and Zemel [7] proposed optimizing the expectation of IR measures to overcome the sorting problem, similar to the approach taen in this paper. However they use monte carlo sampling to address the intractable tas of computing the expectation in the permutation space which could be a bad approximation for the queries with large number of documents. AdaRan [20] uses boosting to optimize, similar to our optimization strategy. However they deploy heuristics to embed the IR evaluation metrics in computing the weights of queries and the importance of wea raners; i.e. it uses value of each query in the current iteration as the weight for that query in constructing the wea raner (the documents of each query have similar weight. This is unlie our approach that the contribution of each single document to the final score is considered. Moreover, unlie our method, the convergence of AdaRan is conditional and not guaranteed. Sun et al. [2] reduced the raning, as measured by, to pairwise classification and applied alternating optimization strategy to address the sorting problem by fixing the ran position in getting the derivative. SVM- MAP [2] relaxes the MAP metric by incorporating it into the constrains of SVM. Since SVM-MAP is designed to optimize MAP, it only considers the binary relevancy and cannot be applied to the data sets that have more than two levels of relevance judgements. The second group of listwise algorithms defines a listwise loss function as an indirect way to optimize the IR evaluation metrics. RanCosine [2] uses cosine similarity between the raning list and the ground truth as a query level loss function. ListNet [4] adopts the KL divergence for loss function by defining a probabilistic distribution in the space of permutation for learning to ran. FRan [9] uses a new loss function called fidelity loss on the probability framewor introduced in ListNet. ListMLE [5] employs the lielihood loss as the surrogate for the IR evaluation metrics. The main problem with this group of approaches is that the connection between the listwise loss function and the targeted IR evaluation metric is unclear, and therefore optimizing the listwise loss function may not necessarily result in the optimization of the IR metrics. Optimizing Measure. Notation Assume that we have a collection of n queries for training, denoted by Q = {q,..., q n }. For each query q, we have a collection of m documents D = {d i, i =,..., m }, whose relevance to q is given by a vector r = (r,..., rm Z m. We denote by F (d, q the raning function that taes a document-query pair (d, q and outputs a real number score, and by ji the ran of document within the collection D for query q. The value for raning function F (d, q is then d i computed as following: L(Q, F = n n = i= 2 r i log( + j i ( where is the normalization factor []. is usually truncated at a particular ran level (e.g. the first 0 retrieved documents to emphasize the importance of the first retrieved documents. 2

.2 A Probabilistic Framewor One of the main challenges faced by optimizing the metric defined in Equation ( is that the dependence of document rans (i.e., ji on the raning function F (d, q is not explicitly expressed, which maes it computationally challenging. To address this problem, we consider the expectation of L(Q, F over all the possible ranings induced by the raning function F (d, q, i.e., L(Q, F = n 2 r i n log( + j = i= i = n Pr(π F, q 2 r i n log( + π (i = i= π S m F where S m stands for the group of permutations of m documents, and π is an instance of permutation (or raning. Notation π (i stands for the ran position of the ith document by π. To this end, we first utilize the result in the following lemma to approximate the expectation of / log(+π (i by the expectation of π (i. Lemma. For any distribution Pr(π F, q, the inequality L(Q, F H(Q, F holds where H(Q, F = n 2 r i n log( + π (i F = i= ( (2 Proof. The proof follows from the fact that (a /x is a convex function when x > 0 and therefore / log(+x / log(+x ; (b log(+x is a concave function, and therefore log(+x log( + x. Combining these two factors together, we have the result stated in the lemma. Given H(Q, F provides a lower bound for L(Q, F, in order to maximize L(Q, F, we could alternatively maximize H(Q, F, which is substantially simpler than L(Q, F. In the next step of simplification, we rewrite π (i as π (i = + I(π (i > π (j (4 j= where I(x outputs when x is true and zero otherwise. Hence, π (i is written as π (i = + I(π (i > π (j = + Pr(π (i > π (j (5 j= As a result, to optimize H(Q, F, we only need to define Pr(π (i > π (j, i.e., the marginal distribution for document d j to be raned before document d i. In the next section, we will discuss how to define a probability model for Pr(π F, q, and derive pairwise raning probability Pr(π (i > π (j from distribution Pr(π F, q.. Objective Function We model Pr(π F, q as follows Pr(π F, q = Z(F, q exp = Z(F, q exp j= i= j:π (j>π (i ( m (F (d i, q F (d j, q (m 2π (i + F (d i, q i= where Z(F, q is the partition function that ensures the sum of probability is one. Equation (6 models each pair (d i, d j of the raning list π by the factor exp(f (d i, q F (d j, q if d i is raned before d j (i.e., π (d i < π (d j and vice versa. This modeling choice is consistent with the idea of raning the documents with largest scores first; intuitively, the more documents in a permutation are in the decreasing order of score, the bigger the probability of the permutation is. Using Equation (6 for Pr(π F, q, we have H(Q, F expressed in terms of raning function F. By maximizing H(Q, F over F, we could find the optimal solution for raning function F. As indicated by Equation (5, we only need to compute the marginal distribution Pr(π (i > π (j. To approximate Pr(π (i > π (j, we divide the group of permutation S m into two sets: (6

G a(i, j = {π π (i > π (j} and G b (i, j = {π π (i < π (j}. Notice that there is a one-to-one mapping between these two sets; namely for any raning π G a(i, j, we could create a corresponding raning π G b (i, j by switching the ranings of document d i and d j and vice versa. The following lemma allows us to bound the marginal distribution Pr(π (i > π (j. Lemma 2. If F (d i, q > F (d j, q, we have Pr(π (i > π (j + exp [ 2(F (d i, q F (d j, q ] (7 Proof. = = Pr(π F, q + Pr(π F, q π G a (i,j π G b (i,j Pr(π F, q ( + exp [ 2(π (i π (j(f (d i, q F (d j, q ] π G a (i,j ( Pr(π F, q ( + exp [ 2(F (d i, q F (d j, q ] π G a (i,j = ( + exp [ 2(F (d i, q F (d j, q ] Pr ( π (i > π (j We used the definition of Pr(π F, q in Equation (6 to find G b (i, j as the dual of G a(i, j in the first step of the proof. The inequality in the proof is because π (i π (j and the last step is because Pr(π F, q is the only term dependent on π. This lemma indicates that we could approximate Pr(π (i > π (j by a simple logistic model. The idea of using logistic model for Pr(π (i > π (j is not new in learning to ran [7, 9]; however it has been taen for granted and no justification has been provided in using it for learning to ran. Using the logistic model approximation introduced in Lemma 2, we now have π (i written as π (i + + exp [ 2(F (d i, q F (d j, q ] (8 j= To simplify our notation, we define Fi = 2F (d i, q, and rewrite the above expression as π (i = + Pr(π (i > π (j + + exp(fi Fj j= Using the above approximation for π (i, we have H in Equation ( written as H(Q, F n 2 r i n log(2 + A i (9 where A i = j= = i= j= I(j i + exp(f i F j (0 We define the following proposition to further simplify the objective function: Proposition. log(2 + A i log(2 A i 2 [log(2] 2 The proof is due to the Taylor expansion of convex function /log(2 + x, x > around x = 0 noting that A i > 0 (the proof of convexity of /log( + x is given in Lemma. By plugging the result of this proposition to the objective function in Equation (9, the new objective is to minimize the following quantity: M(Q, F n (2 r i A n i ( = i= The objective function in Equation ( is explicitly related to F via term A i. In the next section, we aim to derive an algorithm that learns an effective raning function by efficiently minimizing M. It is also important to note that although M is no longer a rigorous lower bound for the original objective function L, our empirical study shows that this approximation is very effective in identifying the appropriate raning function from the training data. 4

.4 Algorithm To minimize M(Q, F in Equation (, we employ the bound optimization strategy [8] that iteratively updates the solution for F. Let Fi denote the value obtained so far for document d i. To improve, following the idea of Adaboost, we restrict the new raning value for document d i, denoted by F i, is updated as to the following form: F i = Fi + αfi (2 where α > 0 is the combination weight and fi = f(d i, q {0, } is a binary value. Note that in the above, we assume the raning function F (d, q is updated iteratively with an addition of binary classification function f(d, q, which leads to efficient computation as well as effective exploitation of the existing algorithms for data classification.. To construct a lower bound for M(Q, F, we first handle the expression [ + exp(fi Fj ], summarized by the following proposition. Proposition 2. + exp( F i F j + exp(fi Fj + [ γ i,j exp(α(f j fi ] ( where γi,j exp(fi Fj = ( + exp(f i Fj 2 (4 The proof of this proposition can be found in Appendix A. This proposition separates the term related to Fi from that related to αfi in Equation (, and shows how the new wea raner (i.e., the binary classification function f(d, q will affect the current raning function F (d, q. Using the above proposition, we can derive the upper bound for M (Theorem as well as a closed form solution for α given the solution for F (Theorem 2. Theorem. Given the solution for binary classifier fi d, the optimal α that minimizes the objective function in Equation ( is m i,j= 2r i θi,j I(f j < f i α = 2 log where θi,j = γ i,ji(j i. Theorem 2. n = n = m i,j= 2r i θ i,j I(f j > f i M(Q, F M(Q, F + γ(α + exp(α where γ(α is only a function of α with γ(0 = 0. n = i= f i (5 2 r i 2 r j The proofs of these theorems are provided in Appendix B and Appendix C respectively. Note that the bound provided by Theorem 2 is tight because by setting α = 0, the inequality reduces to equality resulting M(Q, F = M(Q, F. The importance of this theorem is that the optimal solution for fi can be found without nowing the solution for α. Algorithm summarizes the procedure in minimizing the objective function in Equation (. First, it computes θij for every pair of documents of query. Then, it computes w i, a weight for each document which can be positive or negative. A positive weight wi indicates that the raning position of d i induced by the current raning function F is less than its true ran position, while a negative weight wi shows that raning position of d i induced by the current F is greater than its true ran position. Therefore, the sign of weight wi provides a clear guidance for how to construct the next wea raner, the binary classifier in our case; that is, the documents with a positive wi should be labeled as + by the binary classifier and those with negative wi should be labeled as. The magnitude of wi shows how much the corresponding document is misplaced in the raning. In other words, it shows the importance of correcting the raning position of document d i in terms of improving the value of. This leads to maximizing η given in Equation (7 which can be considered as some sort of classification accuracy. We use sampling strategy in order to maximize η because most binary classifiers do not support the weighted training set; that is, we first sample the documents according to wi and then construct a binary classifier with the sampled documents. It can be shown that the proposed algorithm reduces the objective function M exponentially (the proof is removed due to the lac of space. j= θ i,j Notice that we use F (d i instead of F (d i, q to simplify the notation in the algorithm. 5

Algorithm Boost: A Boosting Algorithm for Maximizing : Initialize F (d i = 0 for all documents 2: repeat : Compute θi,j = γ i,j I(j i for all document pairs of each query. γ i,j : Compute the weight for each document as w i = j= 2 r i 2 r j is given in Eq. (4. θ i,j (6 : Assign each document the following class label yi = sign(w i. 4: Train a classifier f(x : R d {0, } that maximizes the following quantity η = n wi f(d i yi (7 = i= 5: Predict f i for all documents in {D, i =,..., n} 6: Compute the combination weight α as provided in Equation (5. 7: Update the raning function as Fi Fi + αfi. 8: until reach the maximum number of iterations 4 Experiments To study the performance of Boost we use the latest version (version.0 of LETOR pacage provided by Microsoft Research Asia [22]. LETOR Pacage includes several benchmar data data, baselines and evaluation tools for research on learning to ran. 4. Letor Data Sets There are seven data sets provided in the LETOR pacage: OHSUMED, Top Distillation 200 (TD200, Top Distillation 2004 (TD2004, Homepage Finding 200 (HP200, Homepage Finding 200 (HP200, Named Page Finding 200 (NP200 and Named Page Finding 2004 (NP2004 2. There are 06 queries in the OSHUMED data sets with a number of documents for each query. The relevancy of each document in OHSUMED data set is scored 0 (irrelevant, (possibly or 2 (definitely. The total number of query-document relevancy judgments provided in OHSUMED data set is 640 and there are 45 features for each query-document pair. For TD200, TD2004, HP200, HP2004 and NP200, there are 50, 75, 75, 75 and 50 queries, respectively, with about 000 retrieved documents for each query. This amounts to a total number of 497, 7470, 74409, 784 and 47606 query-document pairs for TD200, TD2004, HP200, HP2004 and NP200 respectively. For these data sets, there are 6 features extracted for each query-document pair and a binary relevancy judgment for each pair is provided. For every data sets in LETOR, five partitions are provided to conduct the five-fold cross validation, each includes training, test and validation sets. The results of a number of state-of-the-art learning to ran algorithms are also provided in the LETOR pacage. Since these baselines include some of the most well-nown learning to ran algorithms from each category (pointwise, pairwise and listwise, we use them to study the performance of Boost. Here is the list of these baselines (the details can be found in the LETOR web page: Regression: This is a simple linear regression which is a basic pointwise approach and can be considered as a reference point. RanSVM: RanSVM is a pairwise approach using Support Vector Machine [5]. FRan: FRan is a pairwise approach. It uses similar probability model to RanNet [7] for the relative ran position of two documents, with a novel loss function called Fidelity loss function [9]. TSai et al [9] showed that FRan performs much better than RanNet. ListNet: ListNet is a listwise learning to ran algorithm [4]. It uses cross-entropy loss as its listwise loss function. AdaRan : This is a listwise boosting algorithm that incorporates in computing the samples and combination weights [20]. 2 The experiment result for the last data set is not reported due to the lac of space. 6

2 4 5 6 7 8 9 0 0.54 0.52 0.5 0.48 0.46 OHSUMED dataset Regression FRan ListNet RanSVM AdaRan SVM_MAP _\BOOST 0.42 0.4 0.8 0.6 0.4 0.2 TD200 dataset 0.5 0.45 0.4 0.5 TD2004 dataset 0.44 0. 0.28 0. 0.42 0.26 0.4 @ n (a OHSUMED 0.24 2 4 5 6 7 8 9 0 (b TD200 0.25 2 4 5 6 7 8 9 0 (c TD2004 HP200 dataset 0.85 HP2004 dataset NP200 dataset 0.85 0.8 0.8 0.8 0.75 0.75 0.7 0.65 0.6 0.55 0.75 0.7 0.65 0.6 0.7 0.5 0.45 0.4 0.55 0.5 0.65 2 4 5 6 7 8 9 0 (d HP200 0.5 2 4 5 6 7 8 9 0 (e HP2004 0.45 2 4 5 6 7 8 9 0 (f NP200 Figure : The experimental results in terms of for Letor.0 data sets SVM MAP: SVM MAP is a support vector machine with MAP measure used in the constraints. It is a listwise approach [2]. While the validation set is used in finding the best set of parameters in the baselines in LETOR, it is not being used for Boost in our experiments. For Boost, we set the maximum number of iteration to 00 and use decision stump as the wea raner. Figure provides the the average results of five folds for different learning to ran algorithms in terms of @ each of the first 0 truncation level on the LETOR data sets. Notice that the performance of algorithms in comparison varies from one data set to another; however Boost performs almost always the best. We would lie to point out a few statistics; On OHSUMED data set, Boost performs 0.50 at @, a 4% increase in performance, compared to FRANK, the second best algorithm. On TD200 data set, this value for Boost is 0.75 that shows a 0% increase, compared with RanSVM (0.4, the second best method. On HP2004 data set, Boost performs 0.80 at @, compared to 0.75 of SVM MAP, the second best method, which indicates a 6% increase. Moreover, among all the methods in comparison, Boost appears to be the most stable method across all the data sets. For example, FRan, which performs well in OHSUMED and TD2004 data sets, yields a poor performance on TD200, HP200 and HP 2004. Similarly, AdaRan achieves a decent performance on OHSUMED data set, but fails to deliver accurate raning results on TD200, HP200 and NP200. In fact, both AdaRan and FRan perform even worse than the simple Regression approach on TD200, which further indicates their instability. As another example, ListNet and RanSVM, which perform well on TD200 are not competitive to boost on OHSUMED and TD2004 data sets. 5 Conclusion Listwise approach is a relatively new approach to learning to ran. It aims to use a query-level loss function to optimize a given IR measure. The difficulty in optimizing IR measure lies in the inherited sort function in the measure. We address this challenge by a probabilistic framewor that optimizes the expectation of over all the possible permutations of documents. We present a relaxation strategy to effectively approximate the expectation of, and a bound optimization strategy for efficient optimization. Our experiments on benchmar data sets shows that our method is superior to the state-of-the-art learning to ran algorithms in terms of performance and stability. is commonly measured at the first few retrieved documents to emphasize their importance. 7

6 Acnowledgements The wor was supported in part by the Yahoo! Labs 4 and National Institute of Health (R0GM079688-0. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of Yahoo! and NIH. A Proof of Proposition 2 + exp( F i F j = + exp(fi Fj + α(f i f j ( + exp(fi Fj + exp(fi Fj exp(α(f i fj = + exp(fi Fj + exp(f i Fj ( + exp(fi Fj exp(f i Fj + exp(fi Fj + exp(f i Fj + exp(fi Fj exp(α(f j fi = + exp(fi Fj + [ γ i,j exp(α(f j fi ] The first step is a simple manipulations of the terms and the second step is due to the convexity of inverse function on R +. B Proof of Theorem In order to obtain the result of the Theorem, we first plug Equation ( in Equation (. This leads to minimizing n m [ = i,j= 2r i θi,j exp(α(f j fi ], the term related to α. Since fi taes binary values 0 and, we have the following: n 2 r i n θi,j exp(α(fj fi = = i,j= = i,j= 2 r i ( θi,j exp(αi(fj > fi + exp( αi(fj < fi Getting the partial derivative of this term respect to α and having it equal to zero results the theorem. C Proof of Theorem 2 First, we provide the following proposition to handle exp(α(f j f i. Proposition. If x, y [0, ], we have exp(α(x y exp(α (x y + Proof. Due to the convexity of exp function, we have: exp(α(x y = exp(α x y + x y + exp(α + exp( α + + 0 x + y exp(α + x + y + α + exp( α (8 Using the result in the above proposition, we can bound the last term in Equation ( as follows: θi,j[ exp(α(f j fi ] ( exp(α θi,j (fj f exp(α + exp( α 2 i + (9 Using the result in Equation (9 and (, we have M(Q, F in Equation ( bounded as M(Q, F M(Q, F + γ(α + exp(α n 2 r i θ Z i,j(f i fj = i= j= = M(Q, F + γ(α + exp(α n fi 2 r i 2 r j θi,j = i= 4 The first author has been supported as a part-time intern in Yahoo! j= 8

References [] Kalervo Järvelin and Jaana Keäläinen. Ir evaluation methods for retrieving highly relevant documents. In SIGIR 2000: Proceedings of the 2th annual international ACM SIGIR conference on Research and development in information retrieval, pages 4 48, 2000. [2] Yisong Yue, Thomas Finley, Filip Radlinsi, and Thorsten Joachims. A support vector method for optimizing average precision. In SIGIR 2007: Proceedings of the 0th annual Int. ACM SIGIR Conf. on Research and development in information retrieval, pages 27 278, 2007. [] Ping Li, Christopher Burges, and Qiang Wu. Mcran: Learning to ran using multiple classification and gradient boosting. In Neural Information Processing System 2007. [4] Ramesh Nallapati. Discriminative models for information retrieval. In SIGIR 04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 64 7, New Yor, NY, USA, 2004. ACM. [5] Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Support vector learning for ordinal regression. In Int. Conf. on Artificial Neural Networs 999, pages 97 02, 999. [6] Yoav Freund, Raj Iyer, Robert E. Schapire, and Yoram Singer. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4:9 969, 200. [7] Chris Burges, Tal Shaed, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to ran using gradient descent. In International Conference on Machine Learning 2005, 2005. [8] Yunbo Cao, Jun Xu, Tie-Yan Liu, Hang Li, Yalou Huang, and Hsiao-Wuen Hon. Adapting raning svm to document retrieval. In SIGIR 2006: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 86 9, 2006. [9] Ming Feng Tsai, Tie yan Liu, Tao Qin, Hsin hsi Chen, and Wei ying Ma. Fran: A raning method with fidelity loss. In SIGIR 2007: Proceedings of the 0th annual international ACM SIGIR conference on Research and development in information retrieval, 2007. [0] Rong Jin, Hamed Valizadegan, and Hang Li. Raning refinement and its application to information retrieval. In WWW 08: Proc. of the 7th int. conference on World Wide Web. [] Steven C.H. Hoi and Rong Jin. Semi-supervised ensemble raning. In Proceedings of Association for the Advancement of Artificial Intelligence (AAAI2008. [2] Tao Qin, Tie yan Liu, Ming feng Tsai, Xu dong Zhang, and Hang Li. Learning to search web pages with query-level loss functions. Technical report, 2006. [] Christopher J. C. Burges, Robert Ragno, and Quoc V. Le. Learning to ran with nonsmooth cost functions. In Neural Information Processing System 2006, 2006. [4] Zhe Cao and Tie yan Liu. Learning to ran: From pairwise approach to listwise approach. In International Conference on Machine Learning 2007, pages 29 6, 2007. [5] Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to ran: theory and algorithm. In Int. Conf. on Machine Learning 2008, pages 92 99, 2008. [6] Michael Taylor, John Guiver, Stephen Robertson, and Tom Mina. Softran: optimizing nonsmooth ran metrics. [7] Masims N. Volovs and Richard S. Zemel. Boltzran: learning to maximize expected raning gain. In ICML 09: Proceedings of the 26th Annual International Conference on Machine Learning, pages 089 096, New Yor, NY, USA, 2009. ACM. [8] Ruslan Salahutdinov, Sam Roweis, and Zoubin Ghahramani. On the convergence of bound optimization algorithms. In Proc. 9th Conf. in Uncertainty in Artificial Intelligence (UAI 0. [9] Jen-Yuan Yeh, Yung-Yi Lin, Hao-Ren Ke, and Wei-Pang Yang. Learning to ran for information retrieval using genetic programming. In SIGIR 2007 worshop: Learning to Ran for Information Retrieval. [20] Jun Xu and Hang Li. Adaran: a boosting algorithm for information retrieval. In SIGIR 07: Proceedings of the 0th annual international ACM SIGIR conference on Research and development in information retrieval, pages 9 98, 2007. [2] Zhengya Sun, Tao Qin, Qing Tao, and Jue Wang. Robust sparse ran learning for non-smooth raning measures. In SIGIR 09: Proceedings of the 2nd international ACM SIGIR conference on Research and development in information retrieval, pages 259 266, New Yor, NY, USA, 2009. ACM. [22] Tie-Yan Liu, Tao Qin, Jun Xu, Wenying Xiong, and Hang Li. Letor: Benchmar dataset for research on learning to ran for information retrieval. 9