Web Search and Text Mining Learning from Preference Data
Outline Two-stage algorithm, learning preference functions, and finding a total order that best agrees with a preference function Learning ranking functions from preference data Learning ranking functions from combined labeled and preference data
Ranking Problem and Preference Judgments Ranking problem: ranking a list of items according to certain underlying criterion. Preference Judgments: another an item should be ranked higher than Problem set-up: 1) learn a preference function from a set of preference judgments (preference data); 2) for a new list of items, apply the learned preference function 3) find a total order of the items that best agree with the preference function
Preference Functions We assume each item is represented by a feature vector x X. A preference function g : X X [0, 1], g(u, v) [0, 1] Interpretation: 1) g(u, v) close to 1, u ranked higher than v 2) g(u, v) close to 0, v ranked higher than u 3) g(u, v) close to 1/2, no preference w.r.t. u and v
Learning a Preference Function Given preference data S = {< x i, y i >, x i y i, i = 1,..., N} We can turn it into a binary classification problem {(< x i, y i >, 1), (< y i, x i >, 1), i = 1,..., N} Many choice: SVM, AdaBoost, etc.
From Preference Function to Total Order Given a new list of items U, we run the binary classifier on < u, v > U U. In effect, we run a tournament on U and use the binary classifier to determine the outcome of each match between the players u and v. Problem. Find a total order (linear order) on U according to the tournament results. For example, minimize the number of mistakes. A mistake occurs if a lower ranked player beats a higher ranked player. NP-hard, minimum feedback arc set problem in tournaments
A Heuristic Rank the players according to the number of wins, break the tie arbitrarily. This algorithm provides a 5-approximation for the feedback arc set problem (Coppersmith, Fleischer, and Rudra, SODA, 2006)
Ranking Functions from Preference Data Preference data S = { x i, y i x i y i, i = 1,..., N}. Learn a function h, h H, such that h match the set of preferences, i.e., as much as possible. h(x i ) h(y i ), if x i y i, i = 1,..., N, Objective Function. R(h) = 1 2 N i=1 (max{0, h(y i ) h(x i )}) 2
Interpretation 1) If h matches the given preference, i.e., h(x i ) h(y i ), then h incurs no cost; 2) Otherwise, the cost is (h(y i ) h(x i )) 2. A proxy for the number of mistakes made by h.
Functional gradient boosting applied to Consider R(h) = 1 2 N i=1 (max{0, h(y i ) h(x i )}) 2 h(x i ), h(y i ), i = 1,..., N, as the unknowns, and compute the gradient of R(h). The components of the negative gradient corresponding to h(x i ) and h(y i ), respectively, are max{0, h(y i ) h(x i )}, max{0, h(y i ) h(x i )}. For a match, the components are zero, otherwise they are h(y i ) h(x i ), h(x i ) h(y i ).
With step size α along the gradient, we have new function values at x i and y i, respectively, (x i, h(x i ) + α(h(y i ) h(x i ))), (y i, h(y i ) + α(h(x i ) h(y i ))) If we set α = 1, we have (x i, h(y i )), (y i, h(x i )), i.e., we just swap the function values at x i and y i. One complication. If x i appear in multiple preference pairs, we may have contradicting requirements for the new function value at x i. One solution, let the data tell you want to do.
Algorithm. (GBrank) Start with an initial guess h 0, for k = 1, 2,..., 1) using h k 1 as the current approximation of h, we separate S into two disjoint sets, and S + = { x i, y i S h k 1 (x i ) h k 1 (y i )} S = { x i, y i S h k 1 (x i ) < h k 1 (y i )}; 2) fitting g k (x) and the following training data {(x i, h k 1 (y i )), (y i, h k 1 (x i )) (x i, y i ) S }; 3) forming h k (x) = h k 1 (x) + µg k (x).
Some Experimental Results A commercial SE, 4372 queries and 115278 query-document pairs. A 0-4 grade is assigned to each query-document. Labeled data to preference data. Query q and two documents d x and d y. Feature vectors for (q, d x ) and (q, d y ) be x and y. If d x has a higher grade than d y, we include the preference x y while if d y has a higher grade than d x, we include the preference y x
Evaluation Metrics Number of contradicting pairs. Precision at K%: for two documents x and y (w.r.t. the same query), reasonable to assume that it is easy to compare x and y if h(x) h(y) is large, and x and y have about the same rank if h(x) is close to h(y). Sort all the document pairs x, y according to h(x) h(y). Precision at K%, the fraction of non-contradicting pairs in the top K% of the sorted list. Discounted Cumulative Gain (DCG) DCG N = N i=1 G i log 2 (i + 1).
Number of contradicting pairs in training data v. iterations Number of Contradicting Pairs in test data v. iterations DCG v. Iterations 350000 133000 7 number of contradicting pairs 300000 250000 200000 number of contradicting pairs 132000 131000 130000 129000 128000 127000 126000 125000 124000 DCG-5 6.9 6.8 6.7 6.6 6.5 6.4 150000 0 10 20 30 40 50 60 70 80 123000 0 10 20 30 40 50 60 70 80 6.3 0 10 20 30 40 50 60 70 80 iterations iterations Num of iterations
# of Contradicting Test Pairs v. Training Data Size DCG-5 vs. Training Set Size # of contradicting test pairs 142000 140000 138000 136000 134000 132000 130000 128000 126000 124000 122000 10% 20% 30% 40% 50% 60% 70% 80% 90% 100 % GBrank GBT dcg-5 7 6.95 6.9 6.85 6.8 6.75 6.7 6.65 6.6 6.55 6.5 10% 20% 30% 40% 50% 60% 70% 80% 90% 100 % GBrank GBT % of training data used % of training data used
DCG for GBRank, GBT, and RankSVM in 5-fold cross validation dcg-5 7.8 7.6 7.4 7.2 7 6.8 6.6 6.4 6.2 GBRank GBT RankSVM 6 1 2 3 4 5 fold number
Number of contradicting pairs for GBRank, GBT, and RankSVM in 5- fold cross validation number of contradicting pairs 76000 74000 72000 70000 68000 66000 64000 62000 60000 58000 56000 GBRank GBT RankSVM 54000 1 2 3 4 5 fold number
Combined Labeled Data and Preference Data Preference judgments, S = {x i y i, i = 1,..., N}. Additionally, there are also labeled data, L = {(z i, l i ), i = 1,..., n}, where z i is the feature of an item and l i is the corresponding numerically coded label.
Objective Functions Find a ranking function h to minimize, R(h, α, β) = 1 2 N i=1(max{0, h(y i ) h(x i )}) 2 + 1 2 n i=1 (αl i +β h(z i )) 2. Why α, β? l i fixed, not reasonable to ask h(z i ) l i. Optimization problem, {h, α, β } = argmin h H,α 0,β R(h, α, β)
Algorithm. (combined) Gradient Boosting Ranking (cgbrank) Start with an initial guess h 0, for m = 1, 2,..., 1) compute α m and β m such that {α m, β m } = argmin α,β 1 2 n i=1 and let g m i = α m l i + β m, i = 1,..., n. (αl i + β h m 1 (z i )) 2, 2) using h m 1 as the current approximation of h, we separate S into two disjoint sets, S + = {(x i, y i ) S h m 1 (x i ) h m 1 (y i )}
and S = {(x i, y i ) S h m 1 (x i ) < h m 1 (y i )}; 3) we construct a training set for fitting g m (x) by adding the following for each (x i, y i ) S, and (x i, h m 1 (y i ) τ), (y i, h m 1 (x i ) + τ), {(z i, g m i ), i = 1,..., n}. The fitting of g m (x) is done by using GBT with the above training set; 4) form h m (x) = h m 1 (x) + µg m (x), where µ is a shrinking factor.