Efficient Subgraph Matching by Postponing Cartesian Products. Matthias Barkowsky

Size: px

Start display at page:

Download "Efficient Subgraph Matching by Postponing Cartesian Products. Matthias Barkowsky"

Josephine Riley
6 years ago
Views:

1 Efficient Subgraph Matching by Postponing Cartesian Products Matthias Barkowsky 5

2 Subgraph Isomorphism - Approaches search problem: find all embeddings of a query graph in a data graph NP-hard, but can often be solved efficiently in practice general approach: iteratively map query vertices to data vertices -> ordering very important but finding optimal order is NP-hard as well 2

3 Query Vertex Ordering Query Graph: Data Graph: A u 1 A B B C C B C C C D u 2 E u 5 D D E E E F u 3 u 4 u 6 F F QuickSI: (u 1, u 2, u 3, u 4, u 5, u 6 ) 3

4 Query Vertex Ordering Query Graph: Data Graph: A u 1 A B B C C B C C C D u 2 E u 5 D D E E E F u 3 u 4 u 6 F F QuickSI: (u 1, u 2, u 3, u 4, u 5, u 6 ) 4

5 Core-Forest-Leaf Decomposition Core: dense subgraph of Query Graph A u 1 B C u 2 u 5 Forest: number of trees connected to B C Core via one vertex D u 2 u 5 u 3 Leafs: leafs of trees in the Forest structure D u 3 C u 5 F E u 4 u 6 5

6 Core-Forest-Leaf Decomposition A u 1 B u 2 C u 5 D u 3 E u 6 F u 4 matching order: Core Forest Leafs 6

7 Compact Path Index Idea Query Graph: Data Graph: B u 2 A u 1 C u 5 v 1 A v 3 C B C C C v 4 v 2 v 5 v 6 v 1004 B u 1 v 1 u 2 v 2 v 3 u 5 v 4 v 5 v

8 Compact Path Index - Construction criteria for data vertices in candidate set of a query vertex: 1. matching label 2. neighbor in candidate set of each already processed neighbor query vertex 8

9 Compact Path Index Construction Query Graph: Data Graph: B u 2 A u 1 C u 5 v 1 A v 3 C B C C C v 4 v 2 v 5 v 6 v 1004 B 9

10 Compact Path Index Construction u 1 v 1 Query Graph: Data Graph: B u 2 A u 1 C u 5 v 1 A v 3 C B C C C v 4 v 2 v 5 v 6 v 1004 B 10

11 Compact Path Index Construction u 1 v 1 Query Graph: Data Graph: B u 2 A u 1 C u 5 v 1 A v 3 C B C C C v 4 v 2 v 5 v 6 v 1004 B 11

12 Compact Path Index Construction u 1 v 1 u 2 v 2 v 3 Query Graph: Data Graph: B u 2 A u 1 C u 5 v 1 A v 3 C B C C C v 4 v 2 v 5 v 6 v 1004 B 12

13 Compact Path Index Construction u 1 v 1 u 2 v 2 v 3 Query Graph: Data Graph: B u 2 A u 1 C u 5 v 1 A v 3 C B C C C v 4 v 2 v 5 v 6 v 1004 B 13

14 Compact Path Index Construction u 1 v 1 u 2 v 2 Query Graph: Data Graph: B u 2 A u 1 C u 5 v 1 A v 3 C B C C C v 4 v 2 v 5 v 6 v 1004 B 14

15 Compact Path Index Construction u 1 v 1 u 2 v 2 Query Graph: Data Graph: B u 2 A u 1 C u 5 v 1 A v 3 C B C C C v 4 v 2 v 5 v 6 v 1004 B 15

16 Compact Path Index Construction u 1 v 1 u 2 v 2 u 5 v 4 v 5 v 1004 Query Graph: Data Graph: B u 2 A u 1 C u 5 v 1 A v 3 C B C C C v 4 v 2 v 5 v 6 v 1004 B 16

17 Compact Path Index Construction u 1 v 1 u 2 v 2 u 5 v 4 v 5 v 1004 Query Graph: Data Graph: B u 2 A u 1 C u 5 v 1 A v 3 C B C C C v 4 v 2 v 5 v 6 v 1004 B 17

18 Compact Path Index Construction u 1 v 1 u 2 v 2 u 5 v 4 v 5 Query Graph: Data Graph: B u 2 A u 1 C u 5 v 1 A v 3 C B C C C v 4 v 2 v 5 v 6 v 1004 B 18

19 Compact Path Index - Application Query Graph: Index: A u 1 u 1 v 1 B C u 2 u 5 u 2 v 2 u 5 v 4 v 5 P 1 : 1 embedding P 2 : 2 embeddings A u 1 A u 1 B C u 2 u 5 19

20 Compact Path Index - Application matching sequence: start with P 1 = (u 1, u 2 ), add P 2 = (u 1, u 5 ) S = (u 1, u 2, u 5 ) iteratively map query vertices to candidates in the index, check data graph for non-tree edges P 1 : 1 embedding P 2 : 2 embeddings A u 1 A u 1 B C u 2 u 5 20

21 Improved Query Vertex Ordering Query Graph: Data Graph: A u 1 A B B C C B C C C D u 2 E u 5 D D E E E F u 3 u 4 u 6 F F CFL: (u 1, u 2, u 5, u 3, u 4, u 6 ) 21

22 Evaluation compared to QuickSI and TurboISO several datasets of varying size ( nodes) real Protein Interaction Networks synthetic graphs several generated queries of varying size ( nodes) 22

23 Evaluation faster by multiple orders of magnitude 23

24 References Bi, Fei, et al. "Efficient subgraph matching by postponing cartesian products." Proceedings of the 2016 International Conference on Management of Data. ACM,

25 Supervised Random Walks Predicting and Recommending Links in Social Networks Ronny Pachel Uni Potsdam Feb 07th 2017 Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

26 Contents Introduction 1 Introduction 2 Supervised Random Walks 3 Experimental Setup 4 Experiments on Facebook Data 5 Conclusion Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

27 Link Prediction Introduction Network consisting of nodes and edges Between which two nodes will new edges (links) be created? [In a specific timeframe] Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

28 Challenges Introduction Motivation Author working at Facebook - motivation obvious Every new Link (Friendships, etc.) is increasing Network value Challenges of real Networks Extremely sparse Facebook has nearly 2 billion users, how many friends do you have? To predict that no new Links will be create is nearly 100% accurate! What network features have an influence on the creation of new links? Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

29 Present Work Introduction Method for link prediction and recommendation with supervised random walks Combines network structure and network properties PageRank-like random walk that visits given nodes (positive training examples) more often LEARN edge strength (transition probabilities) Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

30 Contents Supervised Random Walks 1 Introduction 2 Supervised Random Walks 3 Experimental Setup 4 Experiments on Facebook Data 5 Conclusion Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

31 Random Walks Supervised Random Walks Random Walks! 0.4 B A C Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

32 Supervised Random Walks Transition Matrix Probability after one step: = Stationary Probability: /3 1/ /3 = 1/ /3 1/3 Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

33 Overview Supervised Random Walks Given is a graph G Starting from a single node s with already existing Links and for which new links should be predicted Also required is a set of positive and negative training Nodes General approach - can also be used for link prediction, recommendation, link anomaly detection, missing link prediction, etc. Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

34 Approaches Supervised Random Walks We have two general approaches for this problem: As Classification task: Each candidate node is either a created Link in the training data (positive example), or not (negative example) But: Has high class imbalance, which makes learning very hard! How to take the structure of the Network into account? And: The choice and extraction of useful features is very hard! There are countless ways to describe node proximity alone! As Ranking Task: Give Nodes with new created Links a higher Score For Example PageRank and Random Walks with Restarts (RWR) are used Stationary Distribution of a Random Walk gives a measure of closeness Takes the structure into account, but not the properties Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

35 This Approach Supervised Random Walks Combination of Both Approaches - Uses Network Structure and Node and Edge Properties Idea: Control a RWR via its transisiton properties to visit positive training nodes more often than the negative training nodes Trivial Approach by setting the transition properties manually would massively overfit Solution: Find transition properties depending on node and edge properties via a learning algorithm! Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

36 Supervised Random Walks Problem Formulation Given Directed Graph: G(V, E) Start node: s Set of candidate nodes: C = c i = D L Set of destination nodes (positive training examples): D = d i Set of no-link nodes (negative training examples): L = l i Feature Vector for each Edge (u, v): ψ uv Includes the Properties of both, the Nodes and the Edge! Find Edge Strength: a uv Used as transition probability Use Learned Function in Training phase f w (ψ uv ) w: vector of weights Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

37 Supervised Random Walks Predicting Process We take a source node for which we want to predict future links For each Edge (u, v) in G we compute the strength a uv = f w (ψ u v) Then a RWR is run and stationary distribution assigns every node u a probability value p u Nodes are ordered by p u values - highest values are predicted as destinations for future links Weights must be learned! Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

38 Supervised Random Walks Optimization Problem min w F (w) = w 2 + λ d D,l L h(p l p d ) Minimization Problem! p d should be higher than p l h(): loss function - measures how bad the current prediction is in form of a penalty w 2 : second norm of the weight vector - smaller values are wanted! λ: trade-off between the loss and the weights vector A high λ means that the loss is more important than small weights, used to prevent overfitting to the data But how do we come from the input (w) to the measurable output (p)? Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

39 Supervised Random Walks Optimization Problem Set a start value of w For each edge (u,v) calculate the edge transition a uv = f (ψ uv ) Build the random walk transition probability matrix Q Incorporate the restart probabability α into Q Find the vector p of the node probabilities (also known as Personalized PageRank) via a eigenvalue problem (or stationary distribution) This process can be written as a single formula The derivation of this formula is used for a iterative calculation to converge to the p vector (gradient descent method) If I change the weight values like this, is the final result (F(w)) better (smaller)? Note: This problem is not generally convex - no global minimum is guaranteed! Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

40 Contents Experimental Setup 1 Introduction 2 Supervised Random Walks 3 Experimental Setup 4 Experiments on Facebook Data 5 Conclusion Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

41 Network Experimental Setup 4 physics co-authorship networks, complete Facebook network of Iceland Iceland has a very high Facebook penetration Relatively few Edges link to users from other countries Timeframe from Nov 1st 2009 (t 1 ) to Jun 13th 2010 (t 2 ) More than users, about 55% of Icelands population On average 168 friends at t 1 and 26 new friends until t 2 Focus on predicting Links to Nodes that are 2 hops away Empirically most new Links close a triangle Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

42 Preprocessing Experimental Setup Only active Nodes with more than 20 Links Candidate Nodes that have less than 4 common friends have been pruned Empirically shown that it is very unlikely that they would become friends Nodes with too many links have been pruned Even after just 2 hops some nodes can have several millions neighbours These measure speed up the computation but do not influence the prediction performance much Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

43 Learner Setup Experimental Setup From these users, 200 have been randomly selected, 100 as training set to learn the algorithm, and 100 as test set to measure the performance These Properties have been used to generate 7 Features: Edge Age Edge Initiator Communication and Observation Features Number of Common Friends Two performance metrics have been used: AUC - Area under ROC curve Prec@20 - Precision at the Top 20 Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

44 Contents Experiments on Facebook Data 1 Introduction 2 Supervised Random Walks 3 Experimental Setup 4 Experiments on Facebook Data 5 Conclusion Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

45 Algorithm Experiments on Facebook Data We need to choose 4 aspects of the algorithm: Loss Function Edge Strength Function Choice of RWR Restart Parameter α Choice of Regularization Parameter λ Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

46 Experiments on Facebook Data Loss Function Must be continuous and differentiable to optimize over them candidates: Squared loss with margin b: h(x) = max{x + b, 0} 2 Huber Loss with margin b and window z b: 0 if x b, h(x) = (x + b) 2 /(2z) if b < x z b, (x + b) z/2 if x > z b Wilcoxon-Mann-Whitney (WMW) loss with width b: 1 h(x) = 1+exp( x/b) Squared loss and Huber loss can be computed easier WMW has a slightly better performance But: Only WMW increases the AUC and Prec@20! Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

47 Experiments on Facebook Data Strength Function f w (ψ u v) must be non-negative and differentiable Two candidates, both are basing on the inner product of the weight vector and the feature vector candidates: Exponential edge strength: a uv = exp(ψ uv w) Logistic edge strength: a uv = (1 + exp( ψ w)) 1 Experiments show no significant difference in performance But the exponential function has a potential problem with underflowing or overflowing double precision numbers Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

48 Choice of α Experiments on Facebook Data α controls how far a walk wanders before it s reset α = 0 : PageRank of a node is it s degree α = 1 : After one hop the walk is reset Big values are short and local walks, low values are long Have a influence in unweighted graphs, but with edge strengths assigned the influence diminishes Empirically 0.3 < α < 0.7 has same performance Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

49 Choice of λ Experiments on Facebook Data Overfitting is not an issue And number of features is small We chose a λ of 1 Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

50 Results Experiments on Facebook Data SRW with all weights w = 0 as starting point Compared to two unsupervised methods (RWR, node degree) and two supervised machine learning methods (decision trees and logistic regression) Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

51 Contents Conclusion 1 Introduction 2 Supervised Random Walks 3 Experimental Setup 4 Experiments on Facebook Data 5 Conclusion Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

52 Conclusions Conclusion Supervised Random Walks - a new learning algorithm for link prediction and recommendation Combines two different approaches that see the problem either as classification task or as ranking task Demonstrated on real Facebook data Predictions show large improvements over RWR and easier than classical machine learning approaches SRW can be adopted to many other problems Ronny Pachel Uni Potsdam Supervised Random Walks Feb 07th / 28

53 Social Influence Computation and Maximization in Signed Networks with Competing Cascades David Schumann Hasso Plattner Institute Graph Mining Tuesday 7 th February, / 12

54 What is this paper about? Domain: Network of susceptible nodes. Trust and Distrust Relationships Diffusion Model: Independent Cascade Problem: Starting seed-set of size m with the highest expectation of nodes colored red Signed Network Influence Maximization (SiNiMax) Problem NP-hard 2 / 12

55 What is this paper about? 3 / 12

56 Signed Networks with competing Cascades Weighted, directed and signed graph G = (V, E, W ) Weight Matrix W : sign represents trust or distrust relationships. Absolute value denotes strength of influence / 12

57 Signed Networks with competing Cascades Two influnce diffusion processes spreading in discrete time steps Using the standard Independent Cascade Model (ICM) Infected node v gets one chance to infect neighbour w with probability p v,w On success w becomes infected at step t + 1. Infected nodes never return to susceptible state / 12

58 Signed Networks with competing Cascades Two influnce diffusion processes spreading in discrete time steps Using the standard Independent Cascade Model (ICM) Infected node v gets one chance to infect neighbour w with probability p v,w On success w becomes infected at step t + 1. Infected nodes never return to susceptible state / 12

59 Signed Networks with competing Cascades Two influnce diffusion processes spreading in discrete time steps Using the standard Independent Cascade Model (ICM) Infected node v gets one chance to infect neighbour w with probability p v,w On success w becomes infected at step t + 1. Infected nodes never return to susceptible state / 12

60 Signed Networks with competing Cascades Two influnce diffusion processes spreading in discrete time steps Using the standard Independent Cascade Model (ICM) Infected node v gets one chance to infect neighbour w with probability p v,w On success w becomes infected at step t + 1. Infected nodes never return to susceptible state / 12

61 Help of the Unified Model Reduces computational cost for commonly used diffusion models. Using: Individual and Collective Influence. For ICM: p (v,u) (t) = 0 r u,t = 1 P (no infected neighbor succeeds) r u,t = 1 (1 p(b v,t 1 ))... v N i (u) 6 / 12

62 Unified Model of Competing Cascades r + u,t r u,t B + u,t B u,t = Collective influence on u to be red. = Collective influence on u to be blue. = Probability of u being red at or before time t. = Probability of u being blue at or before time t. B u,t + (k, +) = Probability of node u being colored red at time t if node k is infected with red at time t 1. 7 / 12

63 Online Seed-set Selection using the Unified Model for Signed Networks: OSSU M ± Algorithm 1 pseudocode for OSSUM± Input: G, m Output: S 1: S 2: for t = 1 to m do 3: (k, c) = arg max (k,c) u (B+ u,t (k, c) B+ u,t ) 4: S S (k, c) 5: end for Runtime Complexity: O(m( V + E )) 8 / 12

64 Experiment Results Comparison of OSSU M ± to three other heuristics: Positive Degree: Nodes with highest positive degree are selected Positive Degree Discount: Variation of Positive Degree Effective Degree: Nodes with highest effective degree are selected Data Set: Epinions Slashdot # nodes 131k 82k # edges 840k 550k # positive edges 710k 425k # negative edges 130k 125k 9 / 12

65 Experiment Results 10 / 12

66 Experiment Results 11 / 12

67 Review Positive: Introduced novel influence maximization problem SiNiMax OSSU M ± outperforms state-of-the-art heuristics for influence maximization Application to real world data Negative: Inaccuracies in description of method (missing edge cases) Unproven claim of extensaibility of this approach Real world data had to be modified to show advantage of OSSUM± 12 / 12

68 Top-k Link Recommendation in Social Graphs Sören Tietböhl

73 X = UV

74 X = UV

76 Predictions for user 4: Ground truth: (1,4),(2,4),(3,4),(4,5),(4,6) Example prediction 1: (2,4),(3,4),(4,5),(1,4),(4,6) Example prediction 2: (2,4),(4,5),(3,4),(1,4),(4,6)

77 Predictions for user 4: Ground truth: (1,4),(2,4),(3,4),(4,5),(4,6) Example prediction 1: (2,4),(3,4),(4,5),(1,4),(4,6) Example prediction 2: (2,4),(4,5),(3,4),(1,4),(4,6) Loss is 1 in both cases

78 Predictions for user 4: Ground truth: (1,4),(2,4),(3,4),(4,5),(4,6) Example prediction 1: (2,4),(3,4),(4,5),(1,4),(4,6) Example prediction 2: (2,4),(4,5),(3,4),(1,4),(4,6) Loss is 1 in both cases TODO List:

79 Predictions for user 4: Ground truth: (1,4),(2,4),(3,4),(4,5),(4,6) Example prediction 1: (2,4),(3,4),(4,5),(1,4),(4,6) Example prediction 2: (2,4),(4,5),(3,4),(1,4),(4,6) Loss is 1 in both cases TODO List: 1. Define new loss function that penalizes mistakes at the top more than at the bottom 3 4 6

80 Predictions for user 4: Ground truth: (1,4),(2,4),(3,4),(4,5),(4,6) Example prediction 1: (2,4),(3,4),(4,5),(1,4),(4,6) Example prediction 2: (2,4),(4,5),(3,4),(1,4),(4,6) Loss is 1 in both cases TODO List: 1. Define new loss function that penalizes mistakes at the top more than at the bottom 2. Derive an algorithm that minimizes that function 3 4 6

81 Predictions for user 4: Ground truth: (1,4),(2,4),(3,4),(4,5),(4,6) Example prediction 1: (2,4),(3,4),(4,5),(1,4),(4,6) Example prediction 2: (2,4),(4,5),(3,4),(1,4),(4,6) Loss is 1 in both cases TODO List: 1. Define new loss function that penalizes mistakes at the top more than at the bottom 2. Derive an algorithm that minimizes that function 3. Evaluate the algorithm using various datasets and metrics 3 4 6

82 Predictions for user 4: Ground truth: (1,4),(2,4),(3,4),(4,5),(4,6) Example prediction 1: (2,4),(3,4),(4,5),(1,4),(4,6) Example prediction 2: (2,4),(4,5),(3,4),(1,4),(4,6) Loss is 1 in both cases TODO List: 1. Define new loss function that penalizes mistakes at the top more than at the bottom 2. Derive an algorithm that minimizes that function 3. Evaluate the algorithm using various datasets and metrics 4.??? 5. Profit

85 = Training Adjacency Matrix = Ranking Score

86 = Training Adjacency Matrix = Ranking Score (2,4),(4,5),(3,4),(1,4),(4,6)

87 = Training Adjacency Matrix = Ranking Score (2,4),(4,5),(3,4),(1,4),(4,6)

88 = Training Adjacency Matrix = Ranking Score (2,4),(4,5),(3,4),(1,4),(4,6)

89 Predictions for user 4: Ground truth: (1,4),(2,4),(3,4),(4,5),(4,6) Example prediction 1: (2,4),(3,4),(4,5),(1,4),(4,6) Example prediction 2: (2,4),(4,5),(3,4),(1,4),(4,6)

90 Predictions for user 4: Ground truth: (1,4),(2,4),(3,4),(4,5),(4,6) Example prediction 1: (2,4),(3,4),(4,5),(1,4),(4,6) Example prediction 2: (2,4),(4,5),(3,4),(1,4),(4,6)

91 Predictions for user 4: Ground truth: (1,4),(2,4),(3,4),(4,5),(4,6) Example prediction 1: (2,4),(3,4),(4,5),(1,4),(4,6) Example prediction 2: (2,4),(4,5),(3,4),(1,4),(4,6) Prediction 1:

92 Predictions for user 4: Ground truth: (1,4),(2,4),(3,4),(4,5),(4,6) Example prediction 1: (2,4),(3,4),(4,5),(1,4),(4,6) Example prediction 2: (2,4),(4,5),(3,4),(1,4),(4,6) Prediction 1: Prediction 2:

95 Sigmoid function:

96 Sigmoid function: Now its derivable!

97 Sigmoid function: Now its derivable! yay

98 Sigmoid function: Now its derivable!

100

101

102

103

104 (2,4),(3,4),(4,5),(1,4),(4,6)

105 (2,4),(3,4),(4,5),(1,4),(4,6) (1,4),(2,4),(3,4),(4,5),(4,6)

106

107

108

109

110 Sources Dongjin Song, David A. Meyer, Dacheng Tao, "Top-k Link Recommendation in Social Networks", 2015 IEEE International Conference on Data Mining (ICDM), vol. 00, no., pp , 2015, doi: /icdm Graph Mining Course Slides quicklatex.com

111 pscan: Fast and Exact Structural Graph Clustering Lukas Wenzel Graph Mining WT 2016/17 1 pscan: Fast and Exact Structural Graph Clustering Lukas Wenzel WT 2016/17

112 Structural Graph Clustering Cluster v 1 v 9 Hub v 2 v 4 v 5 v 6 v 8 Outlier v 3 v 7 v 10 2 pscan: Fast and Exact Structural Graph Clustering Lukas Wenzel WT 2016/17

113 Agenda Principles of SCAN Improvements of pscan Experimental Results 3 pscan: Fast and Exact Structural Graph Clustering Lukas Wenzel WT 2016/17

114 Principles of SCAN Graph G = (V, E) Structural Neighborhood: N u = {v V (u, v) E} {u} d u = N u Vertex u v 1 v 9 Neighborhood N u v 2 v 4 v 5 v 6 v 8 v 3 v 7 v 10 4 pscan: Fast and Exact Structural Graph Clustering Lukas Wenzel WT 2016/17

115 Principles of SCAN Structural Similarity: σ(u, v) = N u N v du d v Neighborhood N u v 1 v 9 Neighborhood N v v 2 v 4 v 5 v 6 v 8 N u N v v 3 v 7 σ(v 6, v 5 ) = v 10 5 pscan: Fast and Exact Structural Graph Clustering Lukas Wenzel WT 2016/17

116 Principles of SCAN Similarity Threshold and Core Vertices: Nu ɛ = {v N u σ(u, v) ɛ} V C = {v V Nv ɛ µ} Core Vertex ɛ-edge ɛ = 0.7 v v v v v v v 9 v v 8 µ = 4 v 10 6 pscan: Fast and Exact Structural Graph Clustering Lukas Wenzel WT 2016/17

117 Principles of SCAN Structure Reachability: u is structure reachable by v if there is a Path of ɛ-edges and intermediate core vertices Cluster C V such that all Vertices v C are mutually structure reachable and no Vertex x V \ C is structure reachable by any Vertex in C Core Vertex v 1 v 9 ɛ-edge v 2 v 4 v 5 v 6 v 8 ɛ = 0.7 v 3 v 7 µ = 4 v 10 7 pscan: Fast and Exact Structural Graph Clustering Lukas Wenzel WT 2016/17

118 Principles of SCAN Algorithm: SCAN requires structural similarity calculation along all edges in G equivalent to enumeration of all triangles in G complexity of O( E 1.5 ) which is worst case optimum 8 pscan: Fast and Exact Structural Graph Clustering Lukas Wenzel WT 2016/17

119 Agenda Principles of SCAN Improvements of pscan Experimental Results 9 pscan: Fast and Exact Structural Graph Clustering Lukas Wenzel WT 2016/17

120 Improvements of pscan Paradigm: 1 cluster core vertices 2 expand clusters by core vertices structural neighborhood Effective Degree and Similar Degree: vertex u V with incomplete similarity measures in its neighborhood effective degree: d e u = d u x where x is the number of neighbors v V with σ(u, v) < ɛ known similar degree: d s u number of neighbors v V with σ(u, v) ɛ known d e u and d s u provide an upper and lower bound for N ɛ u calculate structural neighborhoods for vertices in non-increasing d e -order update d e and d s for every calculated σ; omit neighborhood calculation if d e or d s are sufficient to identify core vertex 10 pscan: Fast and Exact Structural Graph Clustering Lukas Wenzel WT 2016/17

121 Improvements of pscan u ds u d e u v v v v v v v ɛ = 0.7 v v µ = 4 v pscan: Fast and Exact Structural Graph Clustering Lukas Wenzel WT 2016/17

122 Improvements of pscan u ds u d e u v v v v 4 5 v 5 3 v 6 4 v ɛ = 0.7 v v µ = 4 v pscan: Fast and Exact Structural Graph Clustering Lukas Wenzel WT 2016/17

123 Improvements of pscan u ds u d e u v v v v 4 4 v 5 2 v 6 4 v ɛ = 0.7 v v µ = 4 v pscan: Fast and Exact Structural Graph Clustering Lukas Wenzel WT 2016/17

124 Improvements of pscan u ds u d e u v v v v 4 4 v 5 2 v v ɛ = 0.7 v v µ = 4 v pscan: Fast and Exact Structural Graph Clustering Lukas Wenzel WT 2016/17

125 Improvements of pscan u ds u d e u v v v v 4 4 v 5 2 v v ɛ = 0.7 v v µ = 4 v pscan: Fast and Exact Structural Graph Clustering Lukas Wenzel WT 2016/17

126 Improvements of pscan u ds u d e u 1 v v ɛ = 0.7 v v v 4 4 v 5 2 v v v µ = 4 v pscan: Fast and Exact Structural Graph Clustering Lukas Wenzel WT 2016/17

127 Improvements of pscan u ds u d e u 1 v v ɛ = 0.7 v v v 4 4 v 5 2 v v v µ = 4 v pscan: Fast and Exact Structural Graph Clustering Lukas Wenzel WT 2016/17

128 Improvements of pscan u ds u d e u 1 v v ɛ = 0.7 v v v 4 4 v 5 2 v v v µ = 4 v pscan: Fast and Exact Structural Graph Clustering Lukas Wenzel WT 2016/17

129 Agenda Principles of SCAN Improvements of pscan Experimental Results 19 pscan: Fast and Exact Structural Graph Clustering Lukas Wenzel WT 2016/17

130 Experimental Results Algorithms: SCAN SCAN++ (approximative version of SCAN omitting some similarity calculations) pscan (5 variants with different degrees of optimization) Data: 13 real world networks (13k to 1.2G edges) 11 generated networks with varying characteristics 20 pscan: Fast and Exact Structural Graph Clustering Lukas Wenzel WT 2016/17

131 Experimental Results 1 order of magnitude faster than either SCAN or SCAN++ 21 pscan: Fast and Exact Structural Graph Clustering Lukas Wenzel WT 2016/17

132 Clustering Large Probabilistic Graphs Marianne Thieffry

133 Agenda 1. Definition of pclusteredit 2. Algorithms for pclusteredit 3. Generalizations of pclusteredit

134 Definition of pclusteredit

135 Probabilistic Graphs

136 Probabilistic Graphs

137 Probabilistic Graphs

138 Cluster graphs

139 Edit distance # edges added + # edges removed

140 Edit distance # edges added + # edges removed [between all deterministic instances of a given probabilistic graph and a given deterministic graph ]

141 pclusteredit [while minimizing the edit distance]

142 pclusteredit

143 CorrelationClustering

144 CorrelationClustering

145 CorrelationClustering

146 Algorithms for pclusteredit

147 pkwikcluster

148 pkwikcluster

149 pkwikcluster

150 pkwikcluster

151 pkwikcluster 0.8

152 pkwikcluster O(m)

153 Furthest

154 Furthest

155 Furthest

156 Furthest

157 Furthest O(nk²)

158 Agglomerative Singleton clusters Merge clusters with highest average edge probability, until highest average edge probability is at most 0.5 Complexity O(kn log n)

159 pkwikcluster

160 pkwikcluster

161 pkwikcluster

162 pkwikcluster

163 pkwikcluster O(kn log n)

164 Significance

165 Evaluation - CORE dataset protein-protein interaction network 2708 nodes 7123 edges

166 Evaluation - CORE dataset

167 Evaluation - CORE dataset

168 Evaluation - CORE dataset

169 Evaluation - CORE dataset

170 Generalizations of pclusteredit

171 Noisy cluster graphs

172 p2clusteredit & minα Find α-cluster with minimal edit distance Find α that produces minimum α-cluster with minimal edit distance

173 The shortest Path is not always a straight line From Vasiliki Kalavri, Tiago Simas & Dionysios Logothetis Philippe ZOU Graph Mining - WS 2016/2017 Davide Mottin & Konstantina Lazaridou 1

174 Agenda 1. Overview 2. Implementation 3. Evaluation 4. Conclusion 2

175 Overview 3

176 Motivations Leverage the Backbone metric The Blackbone metric is the minimum subgraph that preserves the shortest paths of a weighted graph Improve graph analysis tools and computing performance Large graph reduce running time and storage 4

177 Backbone Metric For instance, in the example they give : 5

178 Concept of semi-metricity An edge is semi-metric if you can find an indirect shorter path between the edges. Let consider the previous graph 6

179 Concept of semi-metricity An edge is semi-metric if you can find an indirect shorter path between the edges. Let consider the previous graph = 6 < 10 AD is semi-metric (same for the node CE) 7

180 Concept of semi-metricity Semi-metricity Degree : number of edges to add to reach the shortest path compared to the original (straight) one AD will be 2nd order (and CE will be 1st order) 8

181 Concept of semi-metricity Performed an analysis on data sets 9

182 Implementation 10

183 Backbone algorithm 3 phases Find 1st order semi-metric edges Only look at triangles 11

184 Backbone algorithm 3 phases Find 1st order semi-metric edges Only look at triangles 12

185 Backbone algorithm 3 phases Focus on 2nd order semi-metric edges 13

186 Backbone algorithm 3 phases Focus on 2nd order semi-metric edges Lowest weight edge of every node is metric? 14

187 Backbone algorithm 3 phases Breadth first search? 15

188 Backbone algorithm 3 phases Breadth first search? 16

189 Backbone algorithm 3 phases Breadth first search? 17

190 Backbone algorithm 3 phases Breadth first search? Target reached but with a shorter path -> Semi-metric edge 18

191 Backbone algorithm 3 phases End of algorithm 19

192 Evaluation 20

193 Execution time for phase 1 Execution time for finding triangles + removal 1st order semi-metric 21

194 Execution time for phase 1 Increase linearly. 15 Minutes for 4 Billions edges. 22

195 Storage Size reduction of the network after removal 23

196 Conclusion 24

197 Conclusion Provide a new concept in graphs : semi-metricity Algorithm to lower the number of edges in order to improve computation and reduce time show that direct nodes are not necessarily the shortest ones Challenge the large graphes issue 25

Link Prediction. Eman Badr Mohammed Saquib Akmal Khan

Link Prediction. Eman Badr Mohammed Saquib Akmal Khan Link Prediction Eman Badr Mohammed Saquib Akmal Khan 11-06-2013 Link Prediction Which pair of nodes should be connected? Applications Facebook friend suggestion Recommendation systems Monitoring and controlling