o A set of very exciting (to me, may be others) questions at the interface of all of the above and more

Devavrat Shah Laboratory for Information and Decision Systems Department of EECS Massachusetts Institute of Technology http://web.mit.edu/devavrat/www (list of relevant references in the last set of slides)

o Ideally o Graphical models o Belief propagation o Connections to Probability, Statistics, EE, CS, o In reality o A set of very exciting (to me, may be others) questions at the interface of all of the above and more o Seemingly unrelated to graphical model o However, provide fertile ground to understand everything about graphical models (algorithms, analysis)

o Recommendations o What movie to watch o Which restaurant to eat o o Precisely, o Suggest what you may like o Given what others have liked o By finding others like you and what they had liked

o Ranking o Players and/or Teams o Based on outcome of games o Papers at a competitive conference o Using reviews o Graduate admissions o From feedback of professors o Precisely, o Global ranking of objects from partial preferences

o Partial preferences are revealed in different forms o Sports: Win and Loss o Social: Starred rating o Conferences: Scores o All can be viewed as pair- wise comparisons o IND beats AUS: IND AUS o Clio ***** vs No 9 Park ****: Clio No 9 Park o Ranking Paper 9/10 vs Other Paper 5/10: Ranking Other

o Revealed preferences lead to o Bag of pair- wise comparisons o Question of interest o Recommendations o Suggest what you may like given what others have liked o Ranking o Global ranking of objects given outcome of games/ Ø This requires understanding (computing) choice model o What people like/dislike from pair- wise comparisons

o Rational view: Axiom of revealed preferences [Samuelson 37] o There is one ordering over all objects consistent across population o Unlikely (lack of transitivity in people s preferences) o Meaningful view discrete choice model o Distribution over orderings of objects o consistent with population s revealed preferences Data Choice Model Decision B B A B C C B C A A C A 0.25 0.75 A B B C C A B C A

o Object tracking (cf. Huang, Guestrin, Guibas 08) o Noisy observations of locations o Feasible to maintain partial information only o Q=[Q ij ] first- order information 1 2 Q 11 = P(1è P1) Q 13 Q 12 P1 P2 3 P3 Objects Locations

o Object tracking o Noisy observations of locations o Feasible to maintain partial information only o Q=[Q ij ] first- order information Data Choice Model Decision 1 2 Q 11 Q 12 Q 13 P1 P2 1 2 P1 P2 1 2 P1 P2 3 P3 3 P3 3 P3

o Recommendation o Ranking o Object tracking o Policy making o Business operations (assortment optimization) o Display advertising o Polling, o Canonical question o Decision using choice model learnt from partial preference data

1 2 Q 11 Q 12 Q 13 P1 P2 3 P3

Q 11 1 Q 12 P1 Q 13 2 P2 3 P3 o Q. Given weighted bipartite graph G=(V, E, Q) o Find matching of objects/positions o That is `most likely

1 P1 2 P2 3 P3 o Answer: maximum weight matching o Weight of a matching equals o summation of Q- entries of edges participating in the matching

A B C C B C A A

# times 1 defeats 2 1 A 12 A 21 6 2 5 3 4 o Q1. Given weighted comparison graph G=(V, E, A) o Find ranking of/scores associated with objects o Q2. When possible (e.g. Conference/Crowd- Sourcing), choose G so as to o Minimize the number of comparisons required to find ranking/scores

1 A 12 A 21 6 2 o Random walk on comparison graph G=(V,E,A) o d = max (undirected) vertex degree of G o For each edge (i,j): o P ij = (A ji +1)/(A ij +A ij +2) x 1/d o For each node i: o P ii = 1- Σ j i P ij o Let G be connected o Let s be the unique stationary distribution of RW P o Ranking: o Use s as scores of objects s T = s T P 5 4 3

1 A 12 A 21 6 2 o Random walk on comparison graph G=(V,E,A) o d = max (undirected) vertex degree of G o For each edge (i,j): o P ij = (A ji +1)/(A ij +A ij +2) x 1/d o For each node i: o P ii = 1- Σ j i P ij o Ranking: use s as scores of objects, where o s be the unique stationary distribution of RW P o Choice of graph G s T = s T P o Subject to constraints, choose G so that o Spectral gap of natural RW on G is maximized o SDP [Boyd, Diaconis, Xiao 04] 5 4 3

1 2 Q 11 Q 12 Q 13 P1 P2 A B C B C A 3 P3 C A o Maximum Weight Matching o How to compute it? o Belief propagation o Why does it make sense? o Max- likelihood estimation w.r.t. exponential family o Rank centrality o How to compute it? o Power- iteration o Why does it make sense? o Mode for Bradley- Terry- Luce (or MNL) model

1 2 Q 11 Q 12 Q 13 P1 P2 3 P3

(all of below explained using class- board) o Computation o Belief propagation o Algorithm o Why it works o Model o Maximum entropy (max- ent) consistent distribution o Maximum Likelihood in exponential family o Maximum weight matching o First- order approximation of mode of this distribution o Exact computation of max- ent o Via dual gradient o Belief propagation/mcmc at rescue

A B C C B C A A

o Choice model (distribution over permutations) [Bradley- Terry- Luce (BTL) or MNL (cf. McFadden) Model] o Each object i has an associated weight w i 0 o When objects i and j are compared o P(i j) = w i /(w i + w j ) o Sampling model o Edges E of graph G are selected o For each (i,j) ε E, sample k pair- wise comparisons

o Error(s) = 1 w ( (w i -w j ) 2 I {(s(i)-s(j))(w i -w j )<0 ij }) 1/2 o G: Erdos- Renyi graph with edge prob. d/n 0.1 Ratio Matrix L1 ranking Rank Centrality ML estimate 0.1 Ratio Matrix L1 ranking Rank Centrality ML estimate 0.01 0.01 0.001 0.0001 1 10 100 0.001 0.1 1 k d/n

o Theorem 1. o Let R= (max ij w i /w j ). o Let G be Erdos- Renyi graph. o Under Rank centrality, with d = Ω(log n) s-w w C R 5 log n kd o That is, sufficient to have O(R 5 n log n) samples o Optimal dependence on n for ER graph o Dependence on R?

o Theorem 1. o Let R= (max ij w i /w j ). o Let G be Erdos- Renyi graph. o Under Rank centrality, with d = Ω(log n) s-w w C R 5 log n kd o Information theoretic lower- bound: for any algorithm s-w w C' 1 kd

o Theorem 2. o Let R= (max ij w i /w j ). o Let G be any connected graph: o L = D - 1 E be natural RW transition matrix o Δ = 1- λ max (L) o κ = d max /d min o Under Rank centrality, with kd = Ω(log n) s-w w C Δ κ R5 log n kd o That is, number of samples required O(R 5 κ 2 n log n x Δ - 2 ) o Graph structure plays role through it s Laplacian

o Theorem 2. o Under Rank centrality, with kd = Ω(log n) s-w w C Δ κ R5 log n kd o That is, number of samples required O(R 5 κ 2 n log n x Δ - 2 ) o Choice of graph G o Subject to constraints, choose G so that o Spectral gap Δ is maximized o SDP [Boyd, Diaconis, Xiao 04]

o Bound on o Use of comparison theorem [Diaconis- Saloff Coste 94]++ o Bound on o Use of (modified) concentration of measure inequality for matrices o Finally, use this to further bound Error(s)

Washington Post: Allourideas

o Ground truth: algorithm s result with complete data o Error: average position discrepancy 20 18 16 14 12 10 8 6 4 2 0 L1 ranking Rank Centrality 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

o Input: complete preference (not comparisons) o Axiomatic impossibility [Arrow 51] o Some algorithms o Kemeny optimal: minimize disagreements o Extended Condorcet Criteria o NP- hard, 2- approx algorithm [Dwork et al 01] 1 6 2 5 2 3 4 1 5 6 4 A 21 6 2 5 1 4 3 A 12 3 o Borda count: average position is score o Simple o Useful axiomatic properties [Young 74]

o Algorithms with partial data o Let pair- wise data available for all pairs o Kemeny distance depends on pair- wise marginal only argmin σ A ij Ι(σ (i) < σ (j)) o Data is consistent with a distribution on permutations o For example, obtained as the max- ent approximation o Kemeny optimal of this distribution is the same as above o NP- hard o 2- approx for this distribution acts as 2- approx for the above [Ammar- Shah 11 12]

o Borda count o Average position o But, comparison do not have position information o Given pair- wise marginal p ij for all i j o For any distribution consistent with pair- wise marginal o Borda count is given as c(i) p j ij [Ammar- Shah 11 12]

o Finding winner and BTL choice model [Adler, Gemmell, Harchol- Balter, Karp, Kenyon 87] o O(log n) iteration adaptive algorithm, O(n) total comparisons o Noisy sorting and Mallow s model [Braverman, Mossel 09] o O(n log n) samples in total (complete ordering) o Average position (Borda count) algorithm++ o Polynomial(n) time algorithm

o Choice model o A powerful model to tackle a range of questions o Many are in it s infancy (e.g. recommendations) o Challenge being computation + statistics o Excellent play ground to resolve challenges of graphical models o Two examples in this tutorial o Object tracking o Learning from first- order marginals o Ranking o Using pair- wise comparisons

o Open direction: o Learning graphical models efficiently o Computationally and statistically o A concrete question: o Given pair- wise comparisons data o When can we learn the choice model efficiently? o For example, if exact pair- wise comparison marginals available o Then, can learn `sparse choice model up to o(log n) sparsity [Farias + Jagabathula + S 09 12] o But what about noisy setting? o Or, max- ent learning?

o Part I: o A. Ammar, D. Shah, ``Efficient rank aggregation from partial data, Proceedings of ACM Sigmetrics 2012. o M. Bayati, D. Shah, M. Sharma, ``Max- product for maximum weight matching: convergence, correctness and LP duality, IEEE Transactions on Information Theory, 2008. o S. Jagabathula, D. Shah, ``Inferring ranking using constrained sensing, IEEE Transactions on Information Theory, 2011. o S. Agrawal, Z. Wang, Y. Ye, ``Parimutuel betting on permutations, Internet and Network Economics, 2008. o J. Huang, C. Guestrin, L. Guibas, ``Fourier theoretic probabilistic inference over permutations, Journal of Machine Learning Research, 2009. Color coding covered in tutorial sparse choice model that is v. relevant others/ closely related/definitely worth a read

o Part II: o S. Negahban, S. Oh, D. Shah, ``Iterative ranking using pair- wise comparisons, Proceedings of NIPS 2012. o V. Farias, S. Jagabathula, D. Shah, ``Data driven approach to modeling choice, Proceedings of NIPS, 2009. Also Management Science, 2012 (and Arxiv). o V. Farias, S. Jagabathula, D. Shah, ``Sparse choice model, available on Arxiv, 2012. o A. Ammar, D. Shah, ``Compare, don t score, Proceedings of Allerton, 2011. Color coding covered in tutorial sparse choice model that is v. relevant others/ closely related/definitely worth a read

o At large: o H. Varian, ``Revealed preferences, Samuelsonian economics and the twenty- first century, 2006. o D. McFadden, ``Disaggregate Behavioral Travel Demand s RUM Side, A 30- year Retrospective, available online http://emlab.berkeley.edu/pub/wp/mcfadden0300.pdf o P. Diaconis, ``Group representation in probability and statistics, Lecture notes- monograph series, 1988. o M. Wainwright, M. Jordan, ``Graphical models, exponential families, and variational inference, Foundations and Trends in Machine Learning, 2008. Color coding covered in tutorial sparse choice model that is v. relevant others/ closely related/definitely worth a read