Devavrat Shah Laboratory for Information and Decision Systems Department of EECS Massachusetts Institute of Technology http://web.mit.edu/devavrat/www (list of relevant references in the last set of slides)
o Ideally o Graphical models o Belief propagation o Connections to Probability, Statistics, EE, CS, o In reality o A set of very exciting (to me, may be others) questions at the interface of all of the above and more o Seemingly unrelated to graphical model o However, provide fertile ground to understand everything about graphical models (algorithms, analysis)
o Recommendations o What movie to watch o Which restaurant to eat o o Precisely, o Suggest what you may like o Given what others have liked o By finding others like you and what they had liked
o Ranking o Players and/or Teams o Based on outcome of games o Papers at a competitive conference o Using reviews o Graduate admissions o From feedback of professors o Precisely, o Global ranking of objects from partial preferences
o Partial preferences are revealed in different forms o Sports: Win and Loss o Social: Starred rating o Conferences: Scores o All can be viewed as pair- wise comparisons o IND beats AUS: IND AUS o Clio ***** vs No 9 Park ****: Clio No 9 Park o Ranking Paper 9/10 vs Other Paper 5/10: Ranking Other
o Revealed preferences lead to o Bag of pair- wise comparisons o Question of interest o Recommendations o Suggest what you may like given what others have liked o Ranking o Global ranking of objects given outcome of games/ Ø This requires understanding (computing) choice model o What people like/dislike from pair- wise comparisons
o Rational view: Axiom of revealed preferences [Samuelson 37] o There is one ordering over all objects consistent across population o Unlikely (lack of transitivity in people s preferences) o Meaningful view discrete choice model o Distribution over orderings of objects o consistent with population s revealed preferences Data Choice Model Decision B B A B C C B C A A C A 0.25 0.75 A B B C C A B C A
o Object tracking (cf. Huang, Guestrin, Guibas 08) o Noisy observations of locations o Feasible to maintain partial information only o Q=[Q ij ] first- order information 1 2 Q 11 = P(1è P1) Q 13 Q 12 P1 P2 3 P3 Objects Locations
o Object tracking o Noisy observations of locations o Feasible to maintain partial information only o Q=[Q ij ] first- order information Data Choice Model Decision 1 2 Q 11 Q 12 Q 13 P1 P2 1 2 P1 P2 1 2 P1 P2 3 P3 3 P3 3 P3
o Recommendation o Ranking o Object tracking o Policy making o Business operations (assortment optimization) o Display advertising o Polling, o Canonical question o Decision using choice model learnt from partial preference data
1 2 Q 11 Q 12 Q 13 P1 P2 3 P3
Q 11 1 Q 12 P1 Q 13 2 P2 3 P3 o Q. Given weighted bipartite graph G=(V, E, Q) o Find matching of objects/positions o That is `most likely
1 P1 2 P2 3 P3 o Answer: maximum weight matching o Weight of a matching equals o summation of Q- entries of edges participating in the matching
A B C C B C A A
# times 1 defeats 2 1 A 12 A 21 6 2 5 3 4 o Q1. Given weighted comparison graph G=(V, E, A) o Find ranking of/scores associated with objects o Q2. When possible (e.g. Conference/Crowd- Sourcing), choose G so as to o Minimize the number of comparisons required to find ranking/scores
1 A 12 A 21 6 2 o Random walk on comparison graph G=(V,E,A) o d = max (undirected) vertex degree of G o For each edge (i,j): o P ij = (A ji +1)/(A ij +A ij +2) x 1/d o For each node i: o P ii = 1- Σ j i P ij o Let G be connected o Let s be the unique stationary distribution of RW P o Ranking: o Use s as scores of objects s T = s T P 5 4 3
1 A 12 A 21 6 2 o Random walk on comparison graph G=(V,E,A) o d = max (undirected) vertex degree of G o For each edge (i,j): o P ij = (A ji +1)/(A ij +A ij +2) x 1/d o For each node i: o P ii = 1- Σ j i P ij o Ranking: use s as scores of objects, where o s be the unique stationary distribution of RW P o Choice of graph G s T = s T P o Subject to constraints, choose G so that o Spectral gap of natural RW on G is maximized o SDP [Boyd, Diaconis, Xiao 04] 5 4 3
1 2 Q 11 Q 12 Q 13 P1 P2 A B C B C A 3 P3 C A o Maximum Weight Matching o How to compute it? o Belief propagation o Why does it make sense? o Max- likelihood estimation w.r.t. exponential family o Rank centrality o How to compute it? o Power- iteration o Why does it make sense? o Mode for Bradley- Terry- Luce (or MNL) model
1 2 Q 11 Q 12 Q 13 P1 P2 3 P3
(all of below explained using class- board) o Computation o Belief propagation o Algorithm o Why it works o Model o Maximum entropy (max- ent) consistent distribution o Maximum Likelihood in exponential family o Maximum weight matching o First- order approximation of mode of this distribution o Exact computation of max- ent o Via dual gradient o Belief propagation/mcmc at rescue
A B C C B C A A
o Choice model (distribution over permutations) [Bradley- Terry- Luce (BTL) or MNL (cf. McFadden) Model] o Each object i has an associated weight w i 0 o When objects i and j are compared o P(i j) = w i /(w i + w j ) o Sampling model o Edges E of graph G are selected o For each (i,j) ε E, sample k pair- wise comparisons
o Error(s) = 1 w ( (w i -w j ) 2 I {(s(i)-s(j))(w i -w j )<0 ij }) 1/2 o G: Erdos- Renyi graph with edge prob. d/n 0.1 Ratio Matrix L1 ranking Rank Centrality ML estimate 0.1 Ratio Matrix L1 ranking Rank Centrality ML estimate 0.01 0.01 0.001 0.0001 1 10 100 0.001 0.1 1 k d/n
o Theorem 1. o Let R= (max ij w i /w j ). o Let G be Erdos- Renyi graph. o Under Rank centrality, with d = Ω(log n) s-w w C R 5 log n kd o That is, sufficient to have O(R 5 n log n) samples o Optimal dependence on n for ER graph o Dependence on R?
o Theorem 1. o Let R= (max ij w i /w j ). o Let G be Erdos- Renyi graph. o Under Rank centrality, with d = Ω(log n) s-w w C R 5 log n kd o Information theoretic lower- bound: for any algorithm s-w w C' 1 kd
o Theorem 2. o Let R= (max ij w i /w j ). o Let G be any connected graph: o L = D - 1 E be natural RW transition matrix o Δ = 1- λ max (L) o κ = d max /d min o Under Rank centrality, with kd = Ω(log n) s-w w C Δ κ R5 log n kd o That is, number of samples required O(R 5 κ 2 n log n x Δ - 2 ) o Graph structure plays role through it s Laplacian
o Theorem 2. o Under Rank centrality, with kd = Ω(log n) s-w w C Δ κ R5 log n kd o That is, number of samples required O(R 5 κ 2 n log n x Δ - 2 ) o Choice of graph G o Subject to constraints, choose G so that o Spectral gap Δ is maximized o SDP [Boyd, Diaconis, Xiao 04]
o Bound on o Use of comparison theorem [Diaconis- Saloff Coste 94]++ o Bound on o Use of (modified) concentration of measure inequality for matrices o Finally, use this to further bound Error(s)
Washington Post: Allourideas
o Ground truth: algorithm s result with complete data o Error: average position discrepancy 20 18 16 14 12 10 8 6 4 2 0 L1 ranking Rank Centrality 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
o Input: complete preference (not comparisons) o Axiomatic impossibility [Arrow 51] o Some algorithms o Kemeny optimal: minimize disagreements o Extended Condorcet Criteria o NP- hard, 2- approx algorithm [Dwork et al 01] 1 6 2 5 2 3 4 1 5 6 4 A 21 6 2 5 1 4 3 A 12 3 o Borda count: average position is score o Simple o Useful axiomatic properties [Young 74]
o Algorithms with partial data o Let pair- wise data available for all pairs o Kemeny distance depends on pair- wise marginal only argmin σ A ij Ι(σ (i) < σ (j)) o Data is consistent with a distribution on permutations o For example, obtained as the max- ent approximation o Kemeny optimal of this distribution is the same as above o NP- hard o 2- approx for this distribution acts as 2- approx for the above [Ammar- Shah 11 12]
o Borda count o Average position o But, comparison do not have position information o Given pair- wise marginal p ij for all i j o For any distribution consistent with pair- wise marginal o Borda count is given as c(i) p j ij [Ammar- Shah 11 12]
o Finding winner and BTL choice model [Adler, Gemmell, Harchol- Balter, Karp, Kenyon 87] o O(log n) iteration adaptive algorithm, O(n) total comparisons o Noisy sorting and Mallow s model [Braverman, Mossel 09] o O(n log n) samples in total (complete ordering) o Average position (Borda count) algorithm++ o Polynomial(n) time algorithm
o Choice model o A powerful model to tackle a range of questions o Many are in it s infancy (e.g. recommendations) o Challenge being computation + statistics o Excellent play ground to resolve challenges of graphical models o Two examples in this tutorial o Object tracking o Learning from first- order marginals o Ranking o Using pair- wise comparisons
o Open direction: o Learning graphical models efficiently o Computationally and statistically o A concrete question: o Given pair- wise comparisons data o When can we learn the choice model efficiently? o For example, if exact pair- wise comparison marginals available o Then, can learn `sparse choice model up to o(log n) sparsity [Farias + Jagabathula + S 09 12] o But what about noisy setting? o Or, max- ent learning?
o Part I: o A. Ammar, D. Shah, ``Efficient rank aggregation from partial data, Proceedings of ACM Sigmetrics 2012. o M. Bayati, D. Shah, M. Sharma, ``Max- product for maximum weight matching: convergence, correctness and LP duality, IEEE Transactions on Information Theory, 2008. o S. Jagabathula, D. Shah, ``Inferring ranking using constrained sensing, IEEE Transactions on Information Theory, 2011. o S. Agrawal, Z. Wang, Y. Ye, ``Parimutuel betting on permutations, Internet and Network Economics, 2008. o J. Huang, C. Guestrin, L. Guibas, ``Fourier theoretic probabilistic inference over permutations, Journal of Machine Learning Research, 2009. Color coding covered in tutorial sparse choice model that is v. relevant others/ closely related/definitely worth a read
o Part II: o S. Negahban, S. Oh, D. Shah, ``Iterative ranking using pair- wise comparisons, Proceedings of NIPS 2012. o V. Farias, S. Jagabathula, D. Shah, ``Data driven approach to modeling choice, Proceedings of NIPS, 2009. Also Management Science, 2012 (and Arxiv). o V. Farias, S. Jagabathula, D. Shah, ``Sparse choice model, available on Arxiv, 2012. o A. Ammar, D. Shah, ``Compare, don t score, Proceedings of Allerton, 2011. Color coding covered in tutorial sparse choice model that is v. relevant others/ closely related/definitely worth a read
o At large: o H. Varian, ``Revealed preferences, Samuelsonian economics and the twenty- first century, 2006. o D. McFadden, ``Disaggregate Behavioral Travel Demand s RUM Side, A 30- year Retrospective, available online http://emlab.berkeley.edu/pub/wp/mcfadden0300.pdf o P. Diaconis, ``Group representation in probability and statistics, Lecture notes- monograph series, 1988. o M. Wainwright, M. Jordan, ``Graphical models, exponential families, and variational inference, Foundations and Trends in Machine Learning, 2008. Color coding covered in tutorial sparse choice model that is v. relevant others/ closely related/definitely worth a read