A physical model for efficient rankings in networks

Size: px

Start display at page:

Download "A physical model for efficient rankings in networks"

Nora Beverley Robbins
5 years ago
Views:

1 A physical model for efficient rankings in networks Daniel Larremore Assistant Professor Dept. of Computer Science & BioFrontiers Institute March 5, 2018 CompleNet

The idea of rankings pervasive! Assumptions: 1. Competitors have some intrinsic quality (or vector of qualities). 2. Interactions can (stochastically) reveal differences in qualities. 3.

2 The idea of rankings pervasive! Assumptions: 1. Competitors have some intrinsic quality (or vector of qualities). 2. Interactions can (stochastically) reveal differences in qualities. 3. Competitions or endorsements are pair-wise. (e.g. Lee Sedol vs. AlphaGo) In other words: outcomes are generated by a stochastic process, which is some function of the positions of the competitors. Cornell MIT Caltech Harvard UC Berkeley Stanford Washington Princeton Yale Carnegie Mellon Hobson & DeDeo. PLOS Computational Biology 11(9), e (2015). Clauset, Arbesman, Larremore. Science Advances 1, e (2015).

3 The idea of rankings pervasive! Assumptions: 1. Competitors have some intrinsic quality (or vector of qualities). 2. Interactions can (stochastically) reveal differences in qualities. 3. Competitions or endorsements are pair-wise. (e.g. Lee Sedol vs. AlphaGo) In other words: outcomes are generated by a stochastic process, which is some function of the positions Latent position of the competitors. can be revealed by pair-wise dominance or endorsement interactions. Cornell MIT Caltech Harvard UC Berkeley Stanford Washington Princeton Yale Carnegie Mellon Hobson & DeDeo. PLOS Computational Biology 11(9), e (2015). Clauset, Arbesman, Larremore. Science Advances 1, e (2015).

Ranking methods: many competing ideas! Minimum Violations Rank finds an order of nodes to minimize upsets : No ties (but see Agony methods), many minima, and provably NP hard.

4 Ranking methods: many competing ideas! Minimum Violations Rank finds an order of nodes to minimize upsets : No ties (but see Agony methods), many minima, and provably NP hard. PageRank defines scalar rank recursively: important pages are those that are linked to by important pages. Great at finding the top 3 but low interpretability of the PageRank scores. Ball & Newman inferred ordinal rank via random graph + ranks model: infer parameters of people s attachment preferences & ranks. Found that in AddHealth data, teens link to others of nearby social status. Niche Models embed species in a latent space based on feeding preferences: most species feed from narrow range in a 1-dim. space (~body size). Great for food webs. Inference models v slow for all but small networks. Bradley-Terry-Luce embed products in a 1D space. Outcome direction is simply: Provably avoids non-transitive properties. Great when lots of data per interaction. The rest of this talk: SpringRank fast, interpretable, predictive. Honestly? Best to google BTL model. Williams & Martinez. Nature (2000). Ball, Newman. Network Science 1, (2013)

5 SpringRank: each directed edge = directed spring 4 3 energy ` =1! µ =

6 Relax and let the springs decide the ranks NX H(s) = 1 2 A ij (s i s j 1) 2 i,j=1 SpringRank Hamiltonian = energy of the system, given the node positions s. Because the springs are linear, the potential is quadratic. The SR Hamiltonian is convex in s. The solution is unique up to an additive constant (why?)

7 Derivatives work out i = X j A ij (s i s j 1) A ji (s j s i 1) Rewrite as a linear algebra problem. D out + D in A + A T s = D out D in 1 We know a priori that the matrix on the left is singular: translational invariance of H(s). [if s is a solution, then s + k is a solution for any constant k; eigenvalue 0, eigenvector 1] Uniqueness: Set s1=0, min(s)=0, or mean(s)=0. Or use a pseudoinverse. Or regularize.

8 It works! Real networks tend to be sparse our linear algebra problem is sparse we can use sparse iterative solvers millions of edges in seconds. computer science faculty hiring network Note that node positions can be clumpy.

9 rural social support parakeets CS hiring risk: stopping at ours is faster + pretty pictures

Generative models: create synthetic data We like generative models because they open the door to inference: GM + parameters + ranks GM + parameters + ranks stochastic draw

10 Generative models: create synthetic data We like generative models because they open the door to inference: GM + parameters + ranks GM + parameters + ranks stochastic draw inference Data Data Define a model: Let the expected number of edges from i to j be: And let the actual number of edges be drawn from a Poisson distribution with that mean.

11 Maximize the likelihood to find the ranks! the usual tricks: maximize the log; take derivatives & set to zero; chuck out additive constants. For low temperature ( large) or a sparse network (M small), the ML ranks are equal to the ranks minimizing the SR Hamiltonian! In other words: solving SpringRank is asymptotically equivalent to inferring s using the GM.

12 Consistency: Test SpringRank using GM synthetic data A B Easy: 1. Plant s from standard normal. 2. Infer ranks. 3. Compare to s. =0.3 =2.1 Spearman correlation Spring Rank BTL Colley David`s Score Min. Viol. Page Rank Eigen. Centr inverse temperature

13 Consistency: Test SpringRank using GM synthetic data C D Hard: 1. Plant s from three normals. 2. Infer ranks. 3. Compare to s. =0.3 =2.1 Spearman correlation Spring Rank BTL Colley David`s Score Min. Viol. Page Rank Eigen. Centr inverse temperature

14 How can we validate models using empirical data? [jokes]

15 Cross validation: train on 80%, predict 20% In a linear hierarchy the key quantity to predict is edge direction, given edge existence. If i and j were to face off, who would win? I ll give you undirected(a), and you predict directed(a). Setup: learn s from 80% of A. Then predict edge directions for remaining 20% of A. SpringRank predicts edge direction based on the relative direction probabilities:

16 50 independent trials of 5-fold cross validation (250 folds) Cross validation results: SR makes better predictions Accuracy: Goal: maximize the number of correctly predicted edge directions. a SpringRank prediction accuracy Alakapurum SR 0.99 BTL Business SR 0.67 BTL Parakeet G1 SR 0.71 BTL History BTL prediction accuracy a SR 0.98 BTL 0.02

Cross validation results: SR makes better predictions Accuracy: Goal: maximize the number of correctly predicted edge directions. A Accuracy improvement over BTL 0.06 0.05 0.04 0.03 0.02 0.01 0-0.

17 Cross validation results: SR makes better predictions Accuracy: Goal: maximize the number of correctly predicted edge directions. A Accuracy improvement over BTL SpringRank SpringRank (Regularized) Alakapurum Tenpatti Parakeet G1 Parakeet G2 Business Comp. Sci. History Synthetic =1 Synthetic =5 medians & quartiles; 50 independent trials of 5-fold cross validation (250 folds)

18 Conclusions and extensions 1. SpringRank is a fast, scalable, predictive approach to ranking in directed networks. Works for Dominance or Endorsement. 2. It formalizes the idea that we can learn from not just the direction of edges, but the existence of edges as well. 3. There is a corresponding generative model whose ML solution is asymptotically equivalent. SpringRank approximates this ML solution with a convex Hamiltonian. a. What if the hierarchy branches, like a Y? b. What about other convex spring potential functions? c. When might the spring constants depend on i or j?

19 Caterina De Bacco Cris Moore A physical model for efficient ranking in networks De Bacco*, Larremore*, and Moore. In Revision. {preprint, python, matlab, viz} danlarremore.com

20 Thank

Locally-biased analytics

Locally-biased analytics You have BIG data and want to analyze a small part of it: Solution 1: Cut out small part and use traditional methods Challenge: cutting out may be difficult a priori Solution 2: