Web Search and Text Mining. Learning from Preference Data

Similar documents
Semestrial Project - Expedia Hotel Ranking

Lecture 8. Instructor: Haipeng Luo

Robust Reductions from Ranking to Classification

Large-Margin Thresholded Ensembles for Ordinal Regression

Robust Reductions from Ranking to Classification

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Large-Margin Thresholded Ensembles for Ordinal Regression

1 Review of Winnow Algorithm

6.036 midterm review. Wednesday, March 18, 15

CS-E4830 Kernel Methods in Machine Learning

CS 6375 Machine Learning

Learning Binary Classifiers for Multi-Class Problem

CSCI-567: Machine Learning (Spring 2019)

Large-scale Linear RankSVM

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Multiclass Boosting with Repartitioning

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach

The definitions and notation are those introduced in the lectures slides. R Ex D [h

MIRA, SVM, k-nn. Lirong Xia

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

Randomized Decision Trees

Linear Models for Regression CS534

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CS 188: Artificial Intelligence. Outline

Robust Reductions from Ranking to Classification

Linear Models for Regression CS534

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Decoupled Collaborative Ranking

Large margin optimization of ranking measures

Evaluation Metrics. Jaime Arguello INLS 509: Information Retrieval March 25, Monday, March 25, 13

Stephen Scott.

Totally Corrective Boosting Algorithms that Maximize the Margin

Statistical Ranking Problem

COMS 4771 Lecture Boosting 1 / 16

Multi-label Active Learning with Auxiliary Learner

Evaluation. Andrea Passerini Machine Learning. Evaluation

Consistency of Nearest Neighbor Methods

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

i=1 = H t 1 (x) + α t h t (x)

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Machine Learning. Ensemble Methods. Manfred Huber

Evaluation requires to define performance measures to be optimized

An Introduction to Machine Learning

Listwise Approach to Learning to Rank Theory and Algorithm

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Lecture 7: DecisionTrees

Machine Learning Linear Models

Information Retrieval

Statistical Machine Learning from Data

VBM683 Machine Learning

How do we compare the relative performance among competing models?

ECE 5424: Introduction to Machine Learning

Linear, Binary SVM Classifiers

5/21/17. Machine learning for IR ranking? Machine learning for IR ranking. Machine learning for IR ranking. Introduction to Information Retrieval

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Entropy-based data organization tricks for browsing logs and packet captures

CIS 520: Machine Learning Oct 09, Kernel Methods

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

CS281B/Stat241B. Statistical Learning Theory. Lecture 1.

Learning by constraints and SVMs (2)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Learning theory. Ensemble methods. Boosting. Boosting: history

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

Logistic Regression. Machine Learning Fall 2018

TTIC An Introduction to the Theory of Machine Learning. Learning from noisy data, intro to SQ model

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

How to learn from very few examples?

CS534 Machine Learning - Spring Final Exam

15-388/688 - Practical Data Science: Decision trees and interpretable models. J. Zico Kolter Carnegie Mellon University Spring 2018

Robotics 2 AdaBoost for People and Place Detection

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Classification: Analyzing Sentiment

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Discriminative Models

Models, Data, Learning Problems

Multiclass Classification-1

Numerical Learning Algorithms

1 Generalization bounds based on Rademacher complexity

Discriminative Direction for Kernel Classifiers

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Computational Game Theory Spring Semester, 2005/6. Lecturer: Yishay Mansour Scribe: Ilan Cohen, Natan Rubin, Ophir Bleiberg*

Active Learning and Optimized Information Gathering

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Performance Metrics for Machine Learning. Sargur N. Srihari

ECE521 week 3: 23/26 January 2017

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

A Deep Interpretation of Classifier Chains

CS60021: Scalable Data Mining. Large Scale Machine Learning

Introduction to Boosting and Joint Boosting

Greedy function optimization in learning to rank

A MODIFIED ALGORITHM FOR RANKING PLAYERS OF A ROUND-ROBIN TOURNAMENT

ORIE 4741 Final Exam

CS229 Supplemental Lecture notes

Foundations of Machine Learning

Transcription:

Web Search and Text Mining Learning from Preference Data

Outline Two-stage algorithm, learning preference functions, and finding a total order that best agrees with a preference function Learning ranking functions from preference data Learning ranking functions from combined labeled and preference data

Ranking Problem and Preference Judgments Ranking problem: ranking a list of items according to certain underlying criterion. Preference Judgments: another an item should be ranked higher than Problem set-up: 1) learn a preference function from a set of preference judgments (preference data); 2) for a new list of items, apply the learned preference function 3) find a total order of the items that best agree with the preference function

Preference Functions We assume each item is represented by a feature vector x X. A preference function g : X X [0, 1], g(u, v) [0, 1] Interpretation: 1) g(u, v) close to 1, u ranked higher than v 2) g(u, v) close to 0, v ranked higher than u 3) g(u, v) close to 1/2, no preference w.r.t. u and v

Learning a Preference Function Given preference data S = {< x i, y i >, x i y i, i = 1,..., N} We can turn it into a binary classification problem {(< x i, y i >, 1), (< y i, x i >, 1), i = 1,..., N} Many choice: SVM, AdaBoost, etc.

From Preference Function to Total Order Given a new list of items U, we run the binary classifier on < u, v > U U. In effect, we run a tournament on U and use the binary classifier to determine the outcome of each match between the players u and v. Problem. Find a total order (linear order) on U according to the tournament results. For example, minimize the number of mistakes. A mistake occurs if a lower ranked player beats a higher ranked player. NP-hard, minimum feedback arc set problem in tournaments

A Heuristic Rank the players according to the number of wins, break the tie arbitrarily. This algorithm provides a 5-approximation for the feedback arc set problem (Coppersmith, Fleischer, and Rudra, SODA, 2006)

Ranking Functions from Preference Data Preference data S = { x i, y i x i y i, i = 1,..., N}. Learn a function h, h H, such that h match the set of preferences, i.e., as much as possible. h(x i ) h(y i ), if x i y i, i = 1,..., N, Objective Function. R(h) = 1 2 N i=1 (max{0, h(y i ) h(x i )}) 2

Interpretation 1) If h matches the given preference, i.e., h(x i ) h(y i ), then h incurs no cost; 2) Otherwise, the cost is (h(y i ) h(x i )) 2. A proxy for the number of mistakes made by h.

Functional gradient boosting applied to Consider R(h) = 1 2 N i=1 (max{0, h(y i ) h(x i )}) 2 h(x i ), h(y i ), i = 1,..., N, as the unknowns, and compute the gradient of R(h). The components of the negative gradient corresponding to h(x i ) and h(y i ), respectively, are max{0, h(y i ) h(x i )}, max{0, h(y i ) h(x i )}. For a match, the components are zero, otherwise they are h(y i ) h(x i ), h(x i ) h(y i ).

With step size α along the gradient, we have new function values at x i and y i, respectively, (x i, h(x i ) + α(h(y i ) h(x i ))), (y i, h(y i ) + α(h(x i ) h(y i ))) If we set α = 1, we have (x i, h(y i )), (y i, h(x i )), i.e., we just swap the function values at x i and y i. One complication. If x i appear in multiple preference pairs, we may have contradicting requirements for the new function value at x i. One solution, let the data tell you want to do.

Algorithm. (GBrank) Start with an initial guess h 0, for k = 1, 2,..., 1) using h k 1 as the current approximation of h, we separate S into two disjoint sets, and S + = { x i, y i S h k 1 (x i ) h k 1 (y i )} S = { x i, y i S h k 1 (x i ) < h k 1 (y i )}; 2) fitting g k (x) and the following training data {(x i, h k 1 (y i )), (y i, h k 1 (x i )) (x i, y i ) S }; 3) forming h k (x) = h k 1 (x) + µg k (x).

Some Experimental Results A commercial SE, 4372 queries and 115278 query-document pairs. A 0-4 grade is assigned to each query-document. Labeled data to preference data. Query q and two documents d x and d y. Feature vectors for (q, d x ) and (q, d y ) be x and y. If d x has a higher grade than d y, we include the preference x y while if d y has a higher grade than d x, we include the preference y x

Evaluation Metrics Number of contradicting pairs. Precision at K%: for two documents x and y (w.r.t. the same query), reasonable to assume that it is easy to compare x and y if h(x) h(y) is large, and x and y have about the same rank if h(x) is close to h(y). Sort all the document pairs x, y according to h(x) h(y). Precision at K%, the fraction of non-contradicting pairs in the top K% of the sorted list. Discounted Cumulative Gain (DCG) DCG N = N i=1 G i log 2 (i + 1).

Number of contradicting pairs in training data v. iterations Number of Contradicting Pairs in test data v. iterations DCG v. Iterations 350000 133000 7 number of contradicting pairs 300000 250000 200000 number of contradicting pairs 132000 131000 130000 129000 128000 127000 126000 125000 124000 DCG-5 6.9 6.8 6.7 6.6 6.5 6.4 150000 0 10 20 30 40 50 60 70 80 123000 0 10 20 30 40 50 60 70 80 6.3 0 10 20 30 40 50 60 70 80 iterations iterations Num of iterations

# of Contradicting Test Pairs v. Training Data Size DCG-5 vs. Training Set Size # of contradicting test pairs 142000 140000 138000 136000 134000 132000 130000 128000 126000 124000 122000 10% 20% 30% 40% 50% 60% 70% 80% 90% 100 % GBrank GBT dcg-5 7 6.95 6.9 6.85 6.8 6.75 6.7 6.65 6.6 6.55 6.5 10% 20% 30% 40% 50% 60% 70% 80% 90% 100 % GBrank GBT % of training data used % of training data used

DCG for GBRank, GBT, and RankSVM in 5-fold cross validation dcg-5 7.8 7.6 7.4 7.2 7 6.8 6.6 6.4 6.2 GBRank GBT RankSVM 6 1 2 3 4 5 fold number

Number of contradicting pairs for GBRank, GBT, and RankSVM in 5- fold cross validation number of contradicting pairs 76000 74000 72000 70000 68000 66000 64000 62000 60000 58000 56000 GBRank GBT RankSVM 54000 1 2 3 4 5 fold number

Combined Labeled Data and Preference Data Preference judgments, S = {x i y i, i = 1,..., N}. Additionally, there are also labeled data, L = {(z i, l i ), i = 1,..., n}, where z i is the feature of an item and l i is the corresponding numerically coded label.

Objective Functions Find a ranking function h to minimize, R(h, α, β) = 1 2 N i=1(max{0, h(y i ) h(x i )}) 2 + 1 2 n i=1 (αl i +β h(z i )) 2. Why α, β? l i fixed, not reasonable to ask h(z i ) l i. Optimization problem, {h, α, β } = argmin h H,α 0,β R(h, α, β)

Algorithm. (combined) Gradient Boosting Ranking (cgbrank) Start with an initial guess h 0, for m = 1, 2,..., 1) compute α m and β m such that {α m, β m } = argmin α,β 1 2 n i=1 and let g m i = α m l i + β m, i = 1,..., n. (αl i + β h m 1 (z i )) 2, 2) using h m 1 as the current approximation of h, we separate S into two disjoint sets, S + = {(x i, y i ) S h m 1 (x i ) h m 1 (y i )}

and S = {(x i, y i ) S h m 1 (x i ) < h m 1 (y i )}; 3) we construct a training set for fitting g m (x) by adding the following for each (x i, y i ) S, and (x i, h m 1 (y i ) τ), (y i, h m 1 (x i ) + τ), {(z i, g m i ), i = 1,..., n}. The fitting of g m (x) is done by using GBT with the above training set; 4) form h m (x) = h m 1 (x) + µg m (x), where µ is a shrinking factor.