Link Prediction. Eman Badr Mohammed Saquib Akmal Khan

Similar documents
Purnamrita Sarkar (Carnegie Mellon) Deepayan Chakrabarti (Yahoo! Research) Andrew W. Moore (Google, Inc.)

Efficient Subgraph Matching by Postponing Cartesian Products. Matthias Barkowsky

Point-of-Interest Recommendations: Learning Potential Check-ins from Friends

RaRE: Social Rank Regulated Large-scale Network Embedding

Nonparametric Bayesian Matrix Factorization for Assortative Networks

Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach

CS6220: DATA MINING TECHNIQUES

Overlapping Communities

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Supporting Statistical Hypothesis Testing Over Graphs

ECEN 689 Special Topics in Data Science for Communications Networks

Collaborative topic models: motivations cont

Learning latent structure in complex networks

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Decoupled Collaborative Ranking

Learning Theory Continued

Link Analysis Ranking

Online Social Networks and Media. Link Analysis and Web Search

Lifted and Constrained Sampling of Attributed Graphs with Generative Network Models

Web Structure Mining Nodes, Links and Influence

The network growth model that considered community information

Mini course on Complex Networks

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Data Mining Techniques

CS6220: DATA MINING TECHNIQUES

Online Social Networks and Media. Link Analysis and Web Search

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Algorithm-Independent Learning Issues

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Metrics: Growth, dimension, expansion

Greedy Search in Social Networks

CSCI-567: Machine Learning (Spring 2019)

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

CS249: ADVANCED DATA MINING

Scaling Neighbourhood Methods

Online Social Networks and Media. Opinion formation on social networks

Kernel adaptive Sequential Monte Carlo

Computational Genomics

CS281 Section 4: Factor Analysis and PCA

Advanced Introduction to Machine Learning

CS6220: DATA MINING TECHNIQUES

Pattern Recognition and Machine Learning

ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 3 Centrality, Similarity, and Strength Ties

Groups of vertices and Core-periphery structure. By: Ralucca Gera, Applied math department, Naval Postgraduate School Monterey, CA, USA

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Facebook Friends! and Matrix Functions

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Notes on Machine Learning for and

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

CS224W: Analysis of Networks Jure Leskovec, Stanford University

Indirect Sampling in Case of Asymmetrical Link Structures

Modeling Dynamic Evolution of Online Friendship Network

Friendship and Mobility: User Movement In Location-Based Social Networks. Eunjoon Cho* Seth A. Myers* Jure Leskovec

Degree Distribution: The case of Citation Networks

Chaos, Complexity, and Inference (36-462)

Probability and Information Theory. Sargur N. Srihari

Data Mining and Analysis: Fundamental Concepts and Algorithms

1 Complex Networks - A Brief Overview

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

STA 4273H: Statistical Machine Learning

Node similarity and classification

Protein Complex Identification by Supervised Graph Clustering

Data Exploration and Unsupervised Learning with Clustering

Finding central nodes in large networks

Unified Modeling of User Activities on Social Networking Sites

Recommendation Systems

A physical model for efficient rankings in networks

STAT 518 Intro Student Presentation

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Determining the Diameter of Small World Networks

Introduction to Data Mining

6.207/14.15: Networks Lecture 12: Generalized Random Graphs

Basics and Random Graphs con0nued

Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks

Slides based on those in:

Chaos, Complexity, and Inference (36-462)

LAPLACIAN MATRIX AND APPLICATIONS

Generative v. Discriminative classifiers Intuition

Learning Time-Series Shapelets

Introduction to Link Prediction

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Network Analysis and Modeling

Probabilistic Time Series Classification

Social Interaction Based Video Recommendation: Recommending YouTube Videos to Facebook Users

A Deep Interpretation of Classifier Chains

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Inferring Friendship from Check-in Data of Location-Based Social Networks

Chapter 17: Undirected Graphical Models

Recommendation Systems

Statistical Inference for Networks. Peter Bickel

a Short Introduction

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Nonresponse weighting adjustment using estimated response probability

Data Mining Techniques

Machine Learning in Simple Networks. Lars Kai Hansen

Transcription:

Link Prediction Eman Badr Mohammed Saquib Akmal Khan 11-06-2013

Link Prediction Which pair of nodes should be connected?

Applications Facebook friend suggestion Recommendation systems Monitoring and controlling computer viruses that use email as a vector Predict unobserved links in protein protein interaction networks in biological system

Methods for Link Prediction Link prediction problems rely on Homophily similar nodes are more likely to be connected. All the methods assign a connection weight score(i,j) to pairs of nodes i, j, based on the input graph, and then produce a ranked list in decreasing order of score(i, j). Can be viewed as computing a measure of proximity or similarity between nodes i and j.

Typical Approaches Common neighbors Jaccard coefficient Degree Product Shortest path score(i, j) is 1 divided by the length of the shortest path through the network from i to j (or zero for pairs that are not connected by any path).

Typical Approaches (cont.) Adamic/Adar link prediction heuristic gives more weight to low degree common neighbors.

Hierarchical Structure and Predicting Missing Links Consists of three parts Inferring hierarchical structure from network data Generating random graphs which are statistically similar to real ones Predicting missing links using hierarchal structure

Inferring hierarchical structure from network data Nature 453, 98-101 (2008) - Aaron Clauset, Cristopher Moore, M. E. J. Newman

Hierarchal Structure Recent studies suggest that networks often exhibit hierarchical organization, where vertices divide into groups that further subdivide into groups of groups, and so forth over multiple scales

Clustering and Hierarchy

Hierarchal Random Graph A binary tree T: leaves are original vertices,internal nodes represent communities Each internal node has a probability p i Two vertices are connected with probability p i where i is their lowest common ancestor Probability p i

Maximum Likelihood For each internal node i Li and Ri = # descendants Ei = # edges between them Likelihood these edges exist, and not others, is L i = p E i(1 p i ) L ir i E i

Maximum Likelihood (cont.) Each L i is maximized by p i = E i L i R i The likelihood of the entire tree is then L D = i L i The log-likelihood is ln L D = i L i R i h( E i L i R i ) Where h p = p ln p 1 p ln(1 p)

Maximum Likelihood (cont.) L T = 1 3 2 3 2. 2 8 2 6 8 6 = 0.0016 L T = 1 9 8 9 8 = 0.0433

Markov Chain A Markov chain Monte Carlo method to sample dendrograms D with probability proportional to their likelihood L(D) we accept the transition D D if log L = log L(D ) log L(D) is nonnegative, so that D is at least as likely as D

Generating random graphs which are statistically similar to real ones

Resampling from the hierarchical random graph 1. Initialize the Markov chain by choosing a random starting dendrogram. 2. Run the Monte Carlo algorithm until equilibrium is reached. 3. Sample dendrograms at regular intervals from those generated by the Markov chain. 4. For each sampled dendrogram D, create a resampled graph G with n vertices by placing an edge between each of the n(n 1)/2 vertex pairs (i, j) with independent probability p r, where r is the lowest common ancestor of i and j in D

Resampling from the hierarchical random graph (cont.) A Food Web An Inferred Hierarchy herbivore plant parasite

Network Statistics

Consensus Dendrograms Sample many trees instead of one. From phylogeny construction: combine these into a consensus dendogram.

Link Prediction

Predicting missing connection 1. Initialize the Markov chain by choosing a random starting dendrogram. 2. Run the Monte Carlo algorithm until equilibrium is reached. 3. Sample dendrograms at regular intervals thereafter from those generated by the Markov chain. 4. For each pair of vertices i, j for which there is not already a known connection, calculate the mean probability p ij that they are connected by averaging over the corresponding probabilities p ij in each of the sampled dendrograms D. 5. Sort these pairs i, j in decreasing order of p ij and predict that the highest ranked ones have missing connections.

Comparison with other link prediction algorthims

Theoretical Justification of Popular Link Prediction Heuristics IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence Vol. 3 - Purnamrita Sarkar, Deepayan Chakrabarti, Andrew W. Moore

Theoretical Justification of Popular Link Prediction Heuristics Some of the most popular heuristics are: most common neighbors, Adamic/Adar, and short paths These heuristic proved its very good performance in link prediction problem, WHY?

Approach (Latent Space Model) Raftery et al. s Model: Points close in this space are more likely to be connected. The problem of link prediction is to find the nearest neighbor who is not currently linked to the node. Equivalent to inferring distances in the latent space Unit volume universe Nodes are uniformly distributed in a latent space

Link Prediction Generative Model P i~j d ij = 1 1 + e α(d ij r) Higher probability of linking 1 ½ α determines the steepness radius r

Common Neighbors i j Pr Pr i, j = Pr common neighbors d ij = Pr(i~k~j d ij ) (i, j) Pr( i ~ k d ) Pr( j ~ ) P(d ) d 2 ik jk ik jk ij ik d jk k d,d d Product of two logistic probabilities, integrated over a volume determined by d ij

Common Neighbors η=number of common neighbors Empirical Bernstein Bounds on distance P η N ε A(r, r,d η/n ε 2r 1 V(r) ij 1/ D ) η N d ij ε 1 2 2r η/n ε 1 V(r) 2/ D V(r)=volume of radius r in D dims

Common Neighbors contd.. OPT = node closest to i MAX = node with max common neighbors with i Theorem: d OPT d MAX d OPT + 2[ε/V(1)] 1/D ε = c 1 (var N /N) ½ + c 2 /(N-1)

Deterministic Model with Distinct Radii Consider nodes i, j and k with radii r i, r j and r k respectively. Connectivity Model: i j iff d ij r j (Directed graph)

Common Neighbors Distinct Radii Type 1: i k j Type 2: i k j r i i k j r j r k i k j r k A(r i, r j,d ij ) A(r k, r k,d ij )

Type 2 Common neighbors Constraint: d ij d ik + d jk 2r k Intuition weighting common neighbors differently, depending on their radii.

Type 2 Common neighbors Example Toy network : N 1 nodes of radius r 1 and N 2 nodes of radius r 2 ŋ 1 and ŋ 2, common neighbors of N 1 and N 2 r 1 < r 2 Mixture : A(r 1, r 1,d ij ), A(r 2, r 2,d ij )

Type 2 Common neighbors Example contd.. η 1 ~ Bin(N 1, A(r 1, r 1, d ij )) η 2 ~ Bin(N 2, A(r 2, r 2, d ij )) k i j Maximize P(η 1, η 2 d ij ) = Product of two binomials w(r 1 ) E(η 1 d*) + w(r 2 ) E(η 2 d*) = w(r 1 )η 1 + w(r 2 ) η 2 RHS LHS d*

Type 2 Common neighbors Example contd.. Small variance, Presence is more surprising Small variance, Absence is more surprising r is close to max radius

Summary Simple heuristic of counting common neighbors often outperforms more complicated heuristics Weighted counts of common neighbors outperforms the unweighted count.

Supervised Random Walks WSDM '11 - Proceedings of the Fourth ACM International Conference on Web Search and Data Mining - Lars Backstrom, Jure Leskovec

Link Prediction / Link Recommendation Problem Friendship is important on social networks How to predict the future interaction? How to recommend potential friends to new user?

Motivation Predicting future interaction brings direct business consequences Facebook, Myspace Possible collaborations, JV Beyond social networks: predicting coauthor/collaborations/protein-protein interactions/ security

Challenges Real networks are extremely sparse Modeling social networks using intrinsic features of the network (node and edge level attributes)

General approaches Classification Ranking nodes PageRank, Personalized PageRank, Random Walk with Restarts

Our approach Combination classification, node ranking Based on the Supervised Random Walks Combines the network structure with the characteristics of nodes and edges Develop an algorithm to estimate the edge strength bias a PageRank-like random walk to visits given nodes more often

Random Walk with Restarts (RWR) Measures the proximity of nodes in a graph (Pan et al, 2004) Start a random walk to a node s and compute the proximity of each other node to node s Restart probability : α

Problem formulation Given G(V, E) A start point s, learning candidate C = {ci} Destination nodes D = {d1,,dk}, no-link nodes L = {l1,,ln}, C = D L For edge (u, v) we compute the strength auv = fw(ψuv)

Optimization Eq(1) is hard version and Eq(2) is soft version of the optimization problem.

Algorithm

Experiments on Synthetic data A scale-free graph G with 10,000 nodes Evaluated by classification accuracy Strength function :

Experiments on Synthetic data Note: AUC Area under the ROC curve. 1.0 means perfect classification and 0.5 means random guessing.

Experimental setup Real Datasets: - Co-authorship networks : arxive database dataset - Facebook network : Iceland

Experiments on Real datasets Four co-authorship networks and the Facebook network of Iceland

Learning curve for Facebook data Performance of Supervised Random Walks as a function of the number of steps of parameter estimation procedure.

Results Table1 : Hep-Ph co-authorship network Table 2: Results for Facebook datasets Supervised Random Walk : AUC ( 0.7 0.8), Precision at top 20 (4.2 7.6)

Results contd. Table 3: Results all datasets

Fraction positives ROC Curve of Astro-Ph test data. ROC curve for SRW and RW with Restart Fraction Negatives

Conclusion Supervised Random Walks has great improvement over Random Walks. Outperforms supervised machine learning techniques Combines rich node and edge features with the structure of the network Apply to: recommendations, anomaly detection, missing link, expertise search and ranking

Questions???

References 1. A. Clauset, C. Moore, M.E.J. Newman. Hierarchical structure and the prediction of missing links in networks. Nature, 2008. http://www-personal.umich.edu/~mejn/papers/cmn08.pdf 2. Theoretical Justification of Popular Link Prediction Heuristics, by P. Sarkar, D. Chakrabarti, and A. W. Moore, invited to IJCAI 2011 http://www.cs.cmu.edu/~deepay/mywww/papers/ijcai11-linkpred.pdf 3. L. Backstrom, J. Leskovec. Supervised Random Walks: Predicting and Recommending Links in Social Networks. In Proc. WSDM, 2011. http://cs.stanford.edu/people/jure/pubs/linkpred-wsdm11.pdf

Thank You