On Top-k Structural. Similarity Search. Pei Lee, Laks V.S. Lakshmanan University of British Columbia Vancouver, BC, Canada

On Top-k Structural 1 Similarity Search Pei Lee, Laks V.S. Lakshmanan University of British Columbia Vancouver, BC, Canada Jeffrey Xu Yu Chinese University of Hong Kong Hong Kong, China 2014/10/14 Pei Lee, ICDE 2012, 2012-04-03

What s structural similarity? 2 Graph structures are ubiquitous Social networks, citation networks, web graphs, etc Problem Statement

What s structural similarity? 3 Structural similarity: the pairwise similarity between nodes in a graph How f to quantify d b the g similarity between u and v? e u c a v h Problem Definition: Input: G( V, E), u V, v V Output: S( u, v) Intuition: two nodes are similar, if their neighbors are similar Do you remember PageRank s intuition? Problem Statement A node is important, if this node is referenced by many other important nodes

What s top-k structural similarity search? 4 I am a node in a huge graph with millions of nodes I want to find top-k similar nodes with me But I definitely do not want to compare with every node I hope the accuracy of results is guaranteed. Problem Definition: Input: G( V, E), v V, k Output: Top- k similar nodes for Problem Statement v

Existing Structural Similarity Measures 5 Neighbor-based approaches Jaccard Coefficient, Cosine Similarity, Pearson correlation, Co-citation, etc Cons: no neighbors, no similarity! Meeting-based approaches SimRank (Jeh & Widom, KDD 02) P-Rank (Zhao et.al, CIKM 09) (by extending SimRank) Cons: high computational cost Not designed for top-k similarity search Related Work

SimRank & P-Rank 6 SimRank: two nodes are similar, if they are referenced by similar nodes b u a c v S( a, a) 1 S( b, c) 0.5 0 S( u, v) 0.25 0 Pairwise iterative form: 0 < C < 1 C Sn 1( u, v) Sn( i, j) I( u) I( v) Matrix form: S T n1 CWSnW In-neighbors ii ( u) ji ( v) Correction matrix Related Work Transition matrix P-Rank: two nodes are similar, if they are related with similar nodes 0 < λ < 1 S ( u, v) C S ( i, j) (1 ) C S ( i, j) n1 n n I( u) I( v) ii ( u) ji ( v) O( u) O( v) io ( u) jo ( v) SimRank Reversed SimRank

Top-k similarity search: challenges 7 Matrix-based approach: Offline: compute a V -by- V similarity matrix SimRank/P-Rank takes O( E 2 ) time, which degenerate to O( V 4 ) in the worst case Space cost: hard to store this huge similarity matrix Vector-based approach: Offline: compute a vector with length V Takes O( V D 2n ) time in the worst case, where n is the iteration number, D is the average edge degree All these approaches need to access the whole graph to find the exact top-k similar nodes Challenges

Contributions 8 Transform the computation of pairwise similarity on graph G to the computation of authority on G G, based on a propagation & aggregation process; Propose TopSim, a local top-k structural similarity search algorithm that avoids accessing the whole graph while the accuracy is guaranteed. Propose Trun-TopSim-SM and Prio-TopSim-SM, which are two approximations allowing us to trade accuracy for speed. Contributions

Structural similarity computation 9 Similarity Score Propagation & Aggregation Similarity Path Single random walk on G G Coupling random walk on G

Product of graphs: G G 10 Given G(V, E), G G is defined as For node u and v in G, uv is a node in G G For edge (u, u ) and (v, v ) in G, (uv, u v ) is an edge in G G b a cb cc uu ec vu c d v bd ce uv ee ea u e G da aa vv dd ae G G Each node pair in G will be a node in G G Each edge pair in G will be an edge in G G No need to materialize G G: only conceptually exists to facilitate analysis

Coupling random walk 11 Coupling random walk: two random surfers walk simultaneously and follow the same edge direction Surf1, Surf2 SimRank: S(u, v) is the first meeting probability of two random surfers starting from u and v respectively and following backward links. Coupling random walk on G can be equivalently transformed as a single random walk on G G b a cc uu ec cb vu c d v bd ce uv ee ea u e G da aa vv dd ae G G

Compute similarity based on coupling random walk 12 We actually transform a similarity ranking problem on G into an authority ranking problem on G G R(uv) = S(u, v) Initialization: R(uv) = 1 is fixed if u = v (uv is a source node) R(uv) = 0 if u v and R(uv) will be updated (uv is a target node) A propagation & Aggregation process Propagation: nodes propagate their authority to their neighbors following random walk steps Aggregation: nodes receive and aggregate the authorities that are propagated-in from their neighbors.

Compute S(u,v)? 13 Similarity path: a path from source node to target node without going by source nodes cc uu ec cb vu bd ce uv ee ea vv da aa dd ae Probability of a transition step: Similarity:

Compute S(u,v): example 14 If we only consider 3 steps C = 0.5 1 1 cc uu ec cb vu Path 1: (ee, uv) P(ee, uv) = 0.5 bd ce uv 1 ee ea da aa 1 vv 1 dd 1 ae Path 2: (aa, bd, ce, uv) P(aa, bd, ce, uv) = 0.5*1*0.5 = 0.25 S 3 (u,v) = P(ee, uv)*c + P(aa, bd, ce, uv)*c 3 = 0.281

Length bound for similarity paths 15 How many steps should take to guarantee the accuracy? Accuracy loss upper bound: TopSim: find top k similar nodes for a node v Start from v and explore neighbors in G as candidates step by step; Stop until the following condition is satisfied: TopSim can guarantee all theoretical top k similar nodes are explored.

16 TopSim

SimMap 17 Observation: many similarity paths are overlapped 3 c c Example: 2 1 0 f d e u b a g v h b a u Similarity paths v Start from c SM(b) = {(d, 1/2), (f, 1/2)} SM(a) = {(e, 1/8)} SM(v) = {(u, 1/32)} SimMap SM(u) = {(key, value)} key is the node visited by Surf2 on step i when Surf1 visits the node u value = S i (key, u)

TopSim based on SimMaps 18 Start from v and find source nodes at each step From level n-1 to 0 Let Surf1 start from source node and walk to node v Let Surf2 start from the same source node and put the visited nodes into SimMaps When level=0, Surf1 visits v, Surf2 will exactly visits the similar nodes of v in the same step 3 c 2 f d b g 1 e a h 0 u v

Algorithms 19 Algrithms Quality Performance TopSim Accurate Slow if not sparse TopSim-SM Accurate More efficient than TopSim Trun-TopSim-SM Trade accuracy for speed More efficient than TopSim-SM Prio-TopSim-SM Trade accuracy for speed More efficient than TopSim-SM

TopSim approximations for Scalefree graphs 20 Scale-free graphs Some nodes have very high degree Web graphs, citation networks, etc Random surfers will be trapped by high degree nodes The size of SimMaps will be exploded Revisit the transition probabilities: a n

TopSim approximations 21 Basic idea: Only consider similarity paths with higher probability Truncated TopSim-SM If P(u 0 u 0,, u i v i ) < η, stop and ignore this path Prioritized TopSim-SM Set a buffer size H Only expand top H nodes in SimMaps: If SM(u) > H, set SM(u) = H. See accuracy and complexity analysis in paper

Experiments 22 Datasets Arxiv High Energy Physics paper citation network, including 34,546 nodes and 421,578 edges DBLP co-author graph, with 0.92M nodes, 6.1M edges DBLP citation network, with 1.5M papers and 2.1M citations Live Journal social network, with 4.84M users and 68.99M friendship ties C = 0.5, η = 0.001, H = 100

Accuracy of similarity scores 23 Accuracy ratio Accuracy loss

24 Precision@k

25 Kendall Tau distance

Running time with different node sizes and node degrees 26 TopSim algorithms are not sensitive to the graph size TopSim approximations are not sensitive to high degree nodes

27 Running time and accessed nodes

Conclusion 28 We transform a similarity problem on graph G into an authority ranking problem on the product graph G G; We propose a family of TopSim methods that produce both the exact and approximate top-k results while accessing a small portion of the graph; Our algorithms work with both SimRank and P-Rank under the same top k framework. Questions?