Purnamrita Sarkar (Carnegie Mellon) Deepayan Chakrabarti (Yahoo! Research) Andrew W. Moore (Google, Inc.)

Size: px

Start display at page:

Download "Purnamrita Sarkar (Carnegie Mellon) Deepayan Chakrabarti (Yahoo! Research) Andrew W. Moore (Google, Inc.)"

Leona Fitzgerald
5 years ago
Views:

1 Purnamrita Sarkar (Carnegie Mellon) Deepayan Chakrabarti (Yahoo! Research) Andrew W. Moore (Google, Inc.)

2 Which pair of nodes {i,j} should be connected? Variant: node i is given Alice Bob Charlie Friend suggestion in Facebook Movie recommendation in Netflix

3 Predict link between nodes With the minimum number of hops With max common neighbors (length 2 paths) Alice Bob 1000 followers Prolific common friends Less evidence Charlie 8 followers The Adamic/Adar score gives more weight to low degree common neighbors. Less prolific Much more evidence

4 Predict link between nodes With the minimum number of hops With more common neighbors (length 2 paths) With larger Adamic/Adar With more short paths (e.g. length 3 paths )

5 Link prediction accuracy* How do we justify these observations? Especially if the graph is sparse Random Shortest Path Common Neighbors Adamic/Adar Ensemble of short paths *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007

6 Raftery et al. s Model: Points close in this space are more likely to be connected. Unit volume universe Nodes are uniformly distributed in a latent space The problem of link prediction is to find the nearest neighbor who is not currently linked to the node. Equivalent to inferring distances in the latent space 6

7 Two sources of randomness Point positions: uniform in D dimensional space Linkage probability: logistic with parameters, r, r and D are known Higher probability of linking 1 determines the steepness radius r 7

8 Especially if the graph is sparse Link prediction accuracy Random Shortest Path Common Neighbors Adamic/Adar Ensemble of short paths *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007

9 i j Pr 2 (i,j) = Pr(common neighbor d ij ) Product of two logistic probabilities, integrated over a volume determined by d ij As Logistic Step function Much easier to analyze!

10 Unit volume universe Everyone has same radius r i j =Number of common neighbors Empirical Bernstein Bounds on distance V(r)=volume of radius r in D dims 10

11 OPT = node closest to i MAX = node with max common neighbors with i Theorem: w.h.p d OPT d MAX d OPT + 2[ Common neighbors is an asymptotically optimal heuristic as N

12 Node k has radius r k. i k if d ik r k (Directed graph) r k captures popularity of node k Type 1: i k j Type 2: i k j r i i k j r j r k i k j r k A(r i, r j,d ij ) A(r k, r k,d ij ) 12

13 Example graph: N 1 nodes of radius r 1 and N 2 nodes of radius r 2 r 1 << r 2 1 ~ Bin[N 1, A(r 1, r 1, d ij )] 2 ~ Bin[N 2, A(r 2, r 2, d ij )] i k Maximize Pr[ 1, 2 d ij ] = product of two binomials j w(r 1 ) E[ 1 d*] + w(r 2 ) E[ 2 d*] = w(r 1 ) 1 + w(r 2 ) 2 RHS LHS d*

14 Jacobian Small variance Presence is more surprising 1/r Adamic/Adar Small variance Absence is more surprising r is close to max radius { Variance Real world graphs generally fall in this range

15 Especially if the graph is sparse Link prediction accuracy Random Shortest Path Common Neighbors Adamic/Adar Ensemble of short paths *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007

16 Common neighbors = 2 hop paths Analysis of longer paths: two components 1. Bounding E( l d ij ). [ l = # l hop paths] Bounds Pr l (i,j) by using triangle inequality on a series of common neighbor probabilities. 2. l E( l d ij ) Triangulation

17 Common neighbors = 2 hop paths Analysis of longer paths: two components 1. Bounding E( l d ij ) [ l = # l hop paths] Bounds Pr l (i,j) by using triangle inequality on a series of common neighbor probabilities. 2. l E( l d ij ) Bounded dependence of l on position of each node Can use McDiarmid s inequality to bound l - E( l d ij )

18 Bound d ij as a function of l using McDiarmid s inequality. For l l we need l >> l to obtain similar bounds Also, we can obtain much tighter bounds for long paths if shorter paths are known to exist.

19 1 Factor weak bound for Logistic Can be made tighter, as logistic approaches the step function.

20 Three key ingredients 1. Closer points are likelier to be linked. Small World Model- Watts, Strogatz, 1998, Kleinberg Triangle inequality holds necessary to extend to l hop paths 3. Points are spread uniformly at random Otherwise properties will depend on location as well as distance

21 Link prediction accuracy* Differentiating between different degrees is important For large dense graphs, common neighbors are enough The number of paths matters, not the length In sparse graphs, length 3 or more paths help in prediction. Random Shortest Path Common Neighbors Adamic/Adar Ensemble of short paths *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007

23 Generative model Link Prediction Heuristics A few properties Most likely neighbor of node i? node a node b We also offer some new prediction algorithms Compare Can justify the empirical observations 23

24 Combine bounds from different radii But there might not be enough data to obtain individual bounds from each radius New sweep estimator Q r = Fraction of nodes w. radius r, which are common neighbors. Higher Q r smaller d ij w.h.p

25 Q r = Fraction of nodes w. radius r, which are common neighbors larger Q r smaller d ij w.h.p T R : = Fraction of nodes w. radius R, which are common neighbors. Smaller T R large d ij w.h.p

26 Number of common neighbors of a given radius r Q r = Fraction of nodes with radius r which are common neighbors T R = Fraction of nodes with radius R which are common neighbors Large Q r small d ij Small T R large d ij

Link Prediction. Eman Badr Mohammed Saquib Akmal Khan

Link Prediction. Eman Badr Mohammed Saquib Akmal Khan Link Prediction Eman Badr Mohammed Saquib Akmal Khan 11-06-2013 Link Prediction Which pair of nodes should be connected? Applications Facebook friend suggestion Recommendation systems Monitoring and controlling