Unsupervised Rank Aggregation with Distance-Based Models

Size: px

Start display at page:

Download "Unsupervised Rank Aggregation with Distance-Based Models"

Katherine Clark
5 years ago
Views:

1 Unsupervised Rank Aggregation with Distance-Based Models Kevin Small Tufts University Collaborators: Alex Klementiev (Johns Hopkins University) Ivan Titov (Saarland University) Dan Roth (University of Illinois)

2 Motivation Consider a panel of judges Each independently generates (partial) rankings over objects to the best of their ability Query The need to meaningfully aggregate their output is a fundamental problem Applications are plentiful in Information Retrieval and Natural Language Processing 2

Multilingual Named Entity Discovery Named Entity Discovery [Klementiev &

with Named Entities, find their counterparts in the other guimaraes

are often transliterated: rank according to a transliteration model score

alignment NEs tend to co-occur in similar contexts: rank according to

3 Multilingual Named Entity Discovery Named Entity Discovery [Klementiev & Roth, ACL 06]: given a bilingual corpus one side of which is annotated with Named Entities, find their counterparts in the other guimaraes Candidate r1 r2 r3 r NEs are often transliterated: rank according to a transliteration model score NEs tend to co-occur across languages: rank according to temporal alignment NEs tend to co-occur in similar contexts: rank according to contextual similarity NEs tend to co-occur in similar topics: rank according to topic similarity etc. 3

4 Overview of Our Approach We propose a formal framework for unsupervised structured label aggregation Judges independently generate a (partial) labeling attempting to reproduce the true underlying label based on their expertise in a given domain We derive an EM-based algorithm treating the votes of individual judges and the true label as the observed and unobserved data, respectively Intuition: experts in a given domain are better at generating votes close to true ranking and will tend to agree with each other, while the non-experts will not We instantiate the framework for the cases of combining permutations, combining top-k lists, and combining dependency parses. 4

Notation Permutation over n objects x 1 x n e = (1,2,...,n) is the identity permutation Set S n of all n! permutations Distance d : S n S n R + between permutations E.g.

5 Notation Permutation over n objects x 1 x n e = (1,2,...,n) is the identity permutation Set S n of all n! permutations Distance d : S n S n R + between permutations E.g. Kendall s tau distance: minimum number of adjacent transpositions d is assumed to be invariant to arbitrary re-labeling of the n objects 1 4 d K (, ) = 3 d(, ) = d(e, ) = D( ). If is a r.v., so is D=D( ) d K (, ) = d K (, ) = D K ( ) =

6 Background: Mallows Models Uniform when = 0 Peaky when is large Expensive to compute where σ θ R, θ 0 is the dispersion parameter σ S n is the location parameter d(.,. ) right-invariant, so Z(θ, σ) does not depend on If D can be decomposed D(π) = m i=1 V i(π) where V i are indep. r.v. s, then E θ (D) may be efficient to compute [Fligner and Verducci 86] 6

models p( i i, ), with the same location

7 Generative Story for Aggregation p( ) Generate the true according to prior p( ) p( 1 1, ) p( 2 2, ) p( K K, ) 1 2 K Draw 1 K independently from K Mallows models p( i i, ), with the same location parameter K p(π, σ θ) =p(π) p(σ i θ i,π) i=1 7

Background: Extended Mallows Models The associated conditional model (when votes of K judges σ Sn K are available) proposed in [Lebanon and Lafferty 02]: where Free parameters θ R

8 Background: Extended Mallows Models The associated conditional model (when votes of K judges σ Sn K are available) proposed in [Lebanon and Lafferty 02]: where Free parameters θ R K, θ 0 represent the degree of expertise of individual judges. It is straightforward to generalize both models to partial rankings by constructing appropriate distance functions 8

9 Outline Motivation Problem Statement Overview of our approach Background Mallows models / Extended Mallows models Unsupervised Learning and Inference Incorporating domain-specific expertise Instantiations of the framework Combining permutations / top-k lists Experiments Dependency Parsing Conclusions Introduction and background Our contribution 9

10 Our Approach [ICML 2008] We propose a formal framework for unsupervised rank aggregation based on the extended Mallows model formalism We derive an EM-based algorithm to estimate model parameters Observed data: votes of individual judges Unobserved data: true ranking Judge 1 Judge 2 Judge K (1) (1) 1 2 (1) K (1) (2) (2) 1 2 (2) K (2) Q (Q) (Q) 1 2 (Q) K (Q) 10

11 Learning Denoting to be the value of parameters from the previous iteration, the M step for the i th ranker is: θ LHS RHS In general, > n! computations Average distance between > (n!) votes Q computations Marginal of the unobserved of the i th ranker and π (1..Q) data π (1..Q) 11

12 Learning and Inference LHS RHS Learning (estimating ) For K constituent rankers, repeat: Estimate the RHS given current parameter values Sample with Metropolis-Hastings Depends on structure type, Or use heuristics more about this later Solve the LHS to update Efficient estimation can be done for particular types of distance functions Inference (computing the most likely ranking) Sample with Metropolis-Hastings or use heuristics 12

13 Domain-specific expertise? [IJCAI 2009] Relative expertise may not stay the same May depend on the type of objects May depend on the type of query Query Typically, ranked supervised data to estimate judges expertise is very expensive to obtain Especially for multiple types 13

Mallows Models with Domain-Specific Expertise The associated conditional model (when votes of K judges σ Sn K are available) can be derived: p(π, t σ, θ, α) =α t exp ( K ) i=1 θ t,i d(π, σ i ) Z(θ,

14 Mallows Models with Domain-Specific Expertise The associated conditional model (when votes of K judges σ Sn K are available) can be derived: p(π, t σ, θ, α) =α t exp ( K ) i=1 θ t,i d(π, σ i ) Z(θ, σ) Free parameters θ R T K, θ 0 represent the degree of expertise of individual judges. α R T are the mixture weights. Note: it is straightforward to generalize these models to other structured labels (e.g. partial rankings) by constructing appropriate distance functions 14

15 Learning α t = 1 Q E θt,i (D) = Q j=1 1 α t Q p(π (j),t σ (j), θ, α ) π (j) S n 1 Q d(π (j),σ (j) i )p(π (j),t σ (j), θ, α ) π (j) S n 3 j=1 2 For each of i th ranker and t th type: Estimate (1) and (2) given current parameter values θ and α α t Solve 3 to update Repeat E θt,i (D) θ t,i 15

16 Instantiating the Framework We have not committed to a particular type of structure In order to instantiate the framework: Design a distance function appropriate for the setting If a function if right invariant and decomposable [LHS] estimation can be done quickly Design a sampling procedure for learning [RHS] and inference 16

Case 1: Combining Permutations [LHS] Kendall tau distance D K is the minimum number of adjacent transpositions needed to transform one permutation into another Can be decomposed into a sum of

17 Case 1: Combining Permutations [LHS] Kendall tau distance D K is the minimum number of adjacent transpositions needed to transform one permutation into another Can be decomposed into a sum of independent random variables: D K (π) = n 1 i=1 V i (π) V i (π) = where j>i I(π 1 (i) π 1 (j)) V i And the expected value can be shown to be: E θ (D K )= neθ n 1 e θ j=1 je θj 1 e θj Monotonically decreasing, can find with line search quickly 17

18 Case 1: Combining Permutations [RHS] Sampling from the base chain of random transpositions Start with a random permutation If chain is at π, randomly transpose two objects forming π If a = p(π θ, σ)/p(π θ, σ) 1 chain moves to π Else, chain moves to π with probability a Note that we can compute distance incrementally, i.e. add the change due to a single transposition Convergence n log(n) if d is Cayley s distance [Diaconis 98], likely similar for some others No convergence results for general case, but it works well in practice 18

and argsort Model parameters represent relative expertise, so it

19 Case 1: Combining Permutations [RHS] An alternative heuristic: weighted Borda count, i.e. Linearly combine ranks of each object and argsort Model parameters represent relative expertise, so it makes sense to weigh rankers as w i = e ( θ i) w 1 + w 2 + w K argsort 19

20 Case 2: Combining Top-k [LHS] We extend Kendall tau to top-k r grey boxes z white boxes r + z = k D K ( π) = k i=1 π 1 (i) / Z Ũ i ( π)+ r(r +1) 2 + k i=1 π 1 (i) Z Ṽ i ( π) Bring grey boxes to bottom Switch with objects in (k+1) Kendall s tau for the k elements 20

21 Case 2: Combining Top-k [LHS & RHS] R.v. s and are independent, we can use the same trick to show that [LHS] is: E θ ( D K )= k keθ 1 e θ j=r+1 je jθ r(r +1) + 1 ejθ 2 e θ(z+1) r(z +1) 1 e θ(z+1) Also monotonically decreasing, can again use line search Both and reduce to Kendall tau results when same elements are ranked in both lists, i.e. r = 0 Sampling / heuristics for [RHS] and inference are similar to the permutation case 21

22 Outline Motivation Problem Statement Overview of our approach Background Mallows models / Extended Mallows models Unsupervised Learning and Inference Incorporating domain-specific expertise Instantiations of the framework Combining permutations / top-k lists Experiments Dependency Parsing Conclusions Introduction and background Our contribution 22

Exp. 1 Combining permutations Using sampling to estimate

= 10 sets of votes Using true rankings to evaluate the RHS

weighted (e θ i ) Borda heuristic to estimate the RHS 0 2

23 Exp. 1 Combining permutations Using sampling to estimate the RHS Judges: K = 10 (Mallows models) Objects: n = 30 Q = 10 sets of votes Using true rankings to evaluate the RHS Average D k to true permutation Using weighted (e θ i ) Borda heuristic to estimate the RHS EM Iteration Sampling Weighted True 23

24 Exp. 2 Meta-search dispersion parameters Judges: K = 4 search engines (S1, S2, S3, S4) Documents: Top k = 100 Queries: Q = 50 queries Define Mean Reciprocal Page Rank (MRPR): mean rank of the page containing the correct document Our model gets 0.92 Model parameters correspond to ranker quality 24

25 Exp. 3 Top-k rankings: robustness to noise Judges: K = 38 TREC-3 ad-hoc retrieval shared task participants Documents: Top k = 100 documents Queries: Q = 50 queries Replaced K r [0,K] randomly chosen participants with random rankers. Baseline: rank objects according to score: K CombMNZ rank = N x (k r i (x, q)) where r i (x, q) is the rank of x returned by i for query q, and is the number of participants with x in top-k i=1 N x 25

Exp. 3 Top-k rankings: robustness to noise 0.8 0.7 0.6 Precision 0.5 0.

26 Exp. 3 Top-k rankings: robustness to noise Precision Learn to discard random rankers without supervision Aggregation (Top-10) 0.2 CombMNZ rank (Top-10) Aggregation (Top-30) CombMNZ rank (Top-30) Number of random rankers K r 26

27 Outline Motivation Problem Statement Overview of our approach Background Mallows models / Extended Mallows models Unsupervised Learning and Inference Incorporating domain-specific expertise Instantiations of the framework Combining permutations / top-k lists Experiments Dependency Parsing Conclusions Introduction and background Our contribution 27

28 Dependency Parses P PMOD ROOT NMOD SBJ ROOT ADV AMOD NMOD NMOD PMOD P SBJ ADV AMOD NMOD v ROOT 0 Buyers 1 v(1) stepped in to the futures pit v(2) v(3) v(4) v(5) v(6) v(7) v(8) Let v(i) denote a pair of head offset and label, i.e. a link ROOT 0 y(1) y(2) y(3) y(4) y(5) y(6) y(7) y(8) Buyers stepped in to the futures pit SBJ ADV PMOD NMOD SBJ ROOT ADV PMOD NMOD NMOD PMOD ROOT ROOT NMOD PMOD ROOT y Labeled attachment score is link accuracy, trivially computed from Hamming distance: d H (v, y) = n v(i) y(i) i=1 28

29 Dependency Parses: Parameter Estimation Parameter estimation of the type-agnostic model can be done directly Let us assume there are exactly S possibilities for each link, and that j th Q (of Q) sentences has n (j) words, j=1 n(j) = N. On each round of training the learning procedure for the type-agnostic model is equivalent to: R i = 1 N θ i = log R i log(1 R i ) log ( S 1) Q j=1 n (j) l=1 ( v S v y(j) i,(l) exp K i=1 θ i ) v y(j) i,(l) ( v S exp K ) i=1 θ i v y(j) i,(l) With small S, parameter estimation can be done quickly! 29

30 Dependency Parses: Aggregation Dependency parsers are CoNLL-2007 shared task participants 10 languages: Arabic, Basque, Catalan, Chinese, Czech, English, Greek, Hungarian, Italian, and Turkish 131 to 690 sentences and 4513 to 5390 words, depending on the language Between 20 and 23 systems, depending on the language Varied the number of participants attempting to represent expertise in the entire pool Group 1 Group 2 Group K Accuracy Baseline: majority vote on each link (ties broken randomly) 30

31 Predicting Relative Performance Estimation of relative expertise correlates with true relative expertise Results for Italian Labeled Attachment Participant True Rank True Performance sagae@is.s.u-tokyo.ac.jp nakagawa378@oki.com johan.hall@vxu.se carreras@csail.mit.edu chenwl@nict.go.jp attardi@di.unipi.it xyduan@nlpr.ia.ac.cn ivan.titov@cui.unige.ch dasmith@jhu.edu michael.schiehlen@ims.uni-stuttgart.de bcbb@db.csie.ncu.edu.tw prashanth@research.iiit.ac.in richard@cs.lth.se nguyenml@jaist.ac.jp joyce840205@gmail.com s.v.m.canisius@uvt.nl francis.maes@lip6.fr zeman@ufal.mff.cuni.cz svetoslav.marinov@his.se

32 Dependency Parses: Aggregation Performance measured as average accuracy over 10 languages 83 Average Labeled Attachment Score Largest improvement when fewer experts (higher practical significance) Number of good experts grows: voted baseline is harder to beat 79 Voted Baseline Aggregation Model Number of Participants 32

33 Conclusions Propose a formal mathematical and algorithmic framework for aggregating (partial) structured labels without supervision Show that learning can be made efficient for decomposable distance functions Instantiate the framework for combining permutations, combining top-k lists, and dependency parses Introduce a novel distance function for top-k and dependency parses 33

34 Thanks! 34

A Framework for Unsupervised Rank Aggregation

SIGIR 08 LR4IR Workshop A Framework for Unsupervised Rank Aggregation Alexandre Klementiev, Dan Roth, and Kevin Small Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL