On Top-k Structural. Similarity Search. Pei Lee, Laks V.S. Lakshmanan University of British Columbia Vancouver, BC, Canada

Similar documents
Cross-lingual and temporal Wikipedia analysis

ECEN 689 Special Topics in Data Science for Communications Networks

Web Structure Mining Nodes, Links and Influence

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Online Social Networks and Media. Link Analysis and Web Search

Link Analysis Ranking

Online Social Networks and Media. Link Analysis and Web Search

Node Centrality and Ranking on Networks

Link Analysis. Stony Brook University CSE545, Fall 2016

Link Mining PageRank. From Stanford C246

Model Reduction for Edge-Weighted Personalized PageRank

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

CS249: ADVANCED DATA MINING

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

CS6220: DATA MINING TECHNIQUES

New Coding System of Grid Squares in the Republic of Indonesia

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

To Randomize or Not To

Complex Social System, Elections. Introduction to Network Analysis 1

Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach

CS533 Fall 2017 HW5 Solutions. CS533 Information Retrieval Fall HW5 Solutions

Point-of-Interest Recommendations: Learning Potential Check-ins from Friends

1998: enter Link Analysis

Link Analysis. Leonid E. Zhukov

OLAK: An Efficient Algorithm to Prevent Unraveling in Social Networks. Fan Zhang 1, Wenjie Zhang 2, Ying Zhang 1, Lu Qin 1, Xuemin Lin 2

Slides based on those in:

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Computing PageRank using Power Extrapolation

Dynamical SimRank search on time-varying networks

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Axiomatic Analysis of Co-occurrence Similarity Functions

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Large-scale Collaborative Ranking in Near-Linear Time

CS6220: DATA MINING TECHNIQUES

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Introduction to Data Mining

Node similarity and classification

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

arxiv: v1 [cs.si] 13 Dec 2011

Cutting Graphs, Personal PageRank and Spilling Paint

Axiomatic Analysis of Co-occurrence Similarity Functions

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Kristina Lerman USC Information Sciences Institute

ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 3 Centrality, Similarity, and Strength Ties

Idea: Select and rank nodes w.r.t. their relevance or interestingness in large networks.

Node and Link Analysis

Overlapping Communities

Link Prediction. Eman Badr Mohammed Saquib Akmal Khan

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

A Note on Google s PageRank

Degree Distribution: The case of Citation Networks

0.1 Naive formulation of PageRank

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Data and Algorithms of the Web

IR: Information Retrieval

Updating PageRank. Amy Langville Carl Meyer

A Nearly Sublinear Approximation to exp{p}e i for Large Sparse Matrices from Social Networks

TopPPR: Top-k Personalized PageRank Queries with Precision Guarantees on Large Graphs

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Determining the Diameter of Small World Networks

Computing Trusted Authority Scores in Peer-to-Peer Web Search Networks

Centrality Measures. Leonid E. Zhukov

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Page rank computation HPC course project a.y

Part III: Traveling salesman problems

Analysis of an Optimal Measurement Index Based on the Complex Network

DS504/CS586: Big Data Analytics Graph Mining II

Item Recommendation for Emerging Online Businesses

Two Proofs of Commute Time being Proportional to Effective Resistance

Data Mining Recitation Notes Week 3

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Recommendation Systems

Query Independent Scholarly Article Ranking

UpdatingtheStationary VectorofaMarkovChain. Amy Langville Carl Meyer

Data Mining and Matrices

Heat Kernel Based Community Detection

CPSC 540: Machine Learning

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

11 The Max-Product Algorithm

Personal PageRank and Spilling Paint

Recommendation Systems

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Uncertainty and Randomization

DS504/CS586: Big Data Analytics Graph Mining II

Hyperlinked-Induced Topic Search (HITS) identifies. authorities as good content sources (~high indegree) HITS [Kleinberg 99] considers a web page

Strong Localization in Personalized PageRank Vectors

Data Mining Techniques

Data Mining Techniques

Asymmetric Correlation Regularized Matrix Factorization for Web Service Recommendation

Edge-Weighted Personalized PageRank: Breaking a Decade-Old Performance Barrier

Online Sampling of High Centrality Individuals in Social Networks

Structural Link Analysis and Prediction in Microblogs

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY Static and Dynamic Structural Correlations in Graphs

Large Graph Mining: Power Tools and a Practitioner s guide

Lecture: Local Spectral Methods (2 of 4) 19 Computing spectral ranking with the push procedure

Transcription:

On Top-k Structural 1 Similarity Search Pei Lee, Laks V.S. Lakshmanan University of British Columbia Vancouver, BC, Canada Jeffrey Xu Yu Chinese University of Hong Kong Hong Kong, China 2014/10/14 Pei Lee, ICDE 2012, 2012-04-03

What s structural similarity? 2 Graph structures are ubiquitous Social networks, citation networks, web graphs, etc Problem Statement

What s structural similarity? 3 Structural similarity: the pairwise similarity between nodes in a graph How f to quantify d b the g similarity between u and v? e u c a v h Problem Definition: Input: G( V, E), u V, v V Output: S( u, v) Intuition: two nodes are similar, if their neighbors are similar Do you remember PageRank s intuition? Problem Statement A node is important, if this node is referenced by many other important nodes

What s top-k structural similarity search? 4 I am a node in a huge graph with millions of nodes I want to find top-k similar nodes with me But I definitely do not want to compare with every node I hope the accuracy of results is guaranteed. Problem Definition: Input: G( V, E), v V, k Output: Top- k similar nodes for Problem Statement v

Existing Structural Similarity Measures 5 Neighbor-based approaches Jaccard Coefficient, Cosine Similarity, Pearson correlation, Co-citation, etc Cons: no neighbors, no similarity! Meeting-based approaches SimRank (Jeh & Widom, KDD 02) P-Rank (Zhao et.al, CIKM 09) (by extending SimRank) Cons: high computational cost Not designed for top-k similarity search Related Work

SimRank & P-Rank 6 SimRank: two nodes are similar, if they are referenced by similar nodes b u a c v S( a, a) 1 S( b, c) 0.5 0 S( u, v) 0.25 0 Pairwise iterative form: 0 < C < 1 C Sn 1( u, v) Sn( i, j) I( u) I( v) Matrix form: S T n1 CWSnW In-neighbors ii ( u) ji ( v) Correction matrix Related Work Transition matrix P-Rank: two nodes are similar, if they are related with similar nodes 0 < λ < 1 S ( u, v) C S ( i, j) (1 ) C S ( i, j) n1 n n I( u) I( v) ii ( u) ji ( v) O( u) O( v) io ( u) jo ( v) SimRank Reversed SimRank

Top-k similarity search: challenges 7 Matrix-based approach: Offline: compute a V -by- V similarity matrix SimRank/P-Rank takes O( E 2 ) time, which degenerate to O( V 4 ) in the worst case Space cost: hard to store this huge similarity matrix Vector-based approach: Offline: compute a vector with length V Takes O( V D 2n ) time in the worst case, where n is the iteration number, D is the average edge degree All these approaches need to access the whole graph to find the exact top-k similar nodes Challenges

Contributions 8 Transform the computation of pairwise similarity on graph G to the computation of authority on G G, based on a propagation & aggregation process; Propose TopSim, a local top-k structural similarity search algorithm that avoids accessing the whole graph while the accuracy is guaranteed. Propose Trun-TopSim-SM and Prio-TopSim-SM, which are two approximations allowing us to trade accuracy for speed. Contributions

Structural similarity computation 9 Similarity Score Propagation & Aggregation Similarity Path Single random walk on G G Coupling random walk on G

Product of graphs: G G 10 Given G(V, E), G G is defined as For node u and v in G, uv is a node in G G For edge (u, u ) and (v, v ) in G, (uv, u v ) is an edge in G G b a cb cc uu ec vu c d v bd ce uv ee ea u e G da aa vv dd ae G G Each node pair in G will be a node in G G Each edge pair in G will be an edge in G G No need to materialize G G: only conceptually exists to facilitate analysis

Coupling random walk 11 Coupling random walk: two random surfers walk simultaneously and follow the same edge direction Surf1, Surf2 SimRank: S(u, v) is the first meeting probability of two random surfers starting from u and v respectively and following backward links. Coupling random walk on G can be equivalently transformed as a single random walk on G G b a cc uu ec cb vu c d v bd ce uv ee ea u e G da aa vv dd ae G G

Compute similarity based on coupling random walk 12 We actually transform a similarity ranking problem on G into an authority ranking problem on G G R(uv) = S(u, v) Initialization: R(uv) = 1 is fixed if u = v (uv is a source node) R(uv) = 0 if u v and R(uv) will be updated (uv is a target node) A propagation & Aggregation process Propagation: nodes propagate their authority to their neighbors following random walk steps Aggregation: nodes receive and aggregate the authorities that are propagated-in from their neighbors.

Compute S(u,v)? 13 Similarity path: a path from source node to target node without going by source nodes cc uu ec cb vu bd ce uv ee ea vv da aa dd ae Probability of a transition step: Similarity:

Compute S(u,v): example 14 If we only consider 3 steps C = 0.5 1 1 cc uu ec cb vu Path 1: (ee, uv) P(ee, uv) = 0.5 bd ce uv 1 ee ea da aa 1 vv 1 dd 1 ae Path 2: (aa, bd, ce, uv) P(aa, bd, ce, uv) = 0.5*1*0.5 = 0.25 S 3 (u,v) = P(ee, uv)*c + P(aa, bd, ce, uv)*c 3 = 0.281

Length bound for similarity paths 15 How many steps should take to guarantee the accuracy? Accuracy loss upper bound: TopSim: find top k similar nodes for a node v Start from v and explore neighbors in G as candidates step by step; Stop until the following condition is satisfied: TopSim can guarantee all theoretical top k similar nodes are explored.

16 TopSim

SimMap 17 Observation: many similarity paths are overlapped 3 c c Example: 2 1 0 f d e u b a g v h b a u Similarity paths v Start from c SM(b) = {(d, 1/2), (f, 1/2)} SM(a) = {(e, 1/8)} SM(v) = {(u, 1/32)} SimMap SM(u) = {(key, value)} key is the node visited by Surf2 on step i when Surf1 visits the node u value = S i (key, u)

TopSim based on SimMaps 18 Start from v and find source nodes at each step From level n-1 to 0 Let Surf1 start from source node and walk to node v Let Surf2 start from the same source node and put the visited nodes into SimMaps When level=0, Surf1 visits v, Surf2 will exactly visits the similar nodes of v in the same step 3 c 2 f d b g 1 e a h 0 u v

Algorithms 19 Algrithms Quality Performance TopSim Accurate Slow if not sparse TopSim-SM Accurate More efficient than TopSim Trun-TopSim-SM Trade accuracy for speed More efficient than TopSim-SM Prio-TopSim-SM Trade accuracy for speed More efficient than TopSim-SM

TopSim approximations for Scalefree graphs 20 Scale-free graphs Some nodes have very high degree Web graphs, citation networks, etc Random surfers will be trapped by high degree nodes The size of SimMaps will be exploded Revisit the transition probabilities: a n

TopSim approximations 21 Basic idea: Only consider similarity paths with higher probability Truncated TopSim-SM If P(u 0 u 0,, u i v i ) < η, stop and ignore this path Prioritized TopSim-SM Set a buffer size H Only expand top H nodes in SimMaps: If SM(u) > H, set SM(u) = H. See accuracy and complexity analysis in paper

Experiments 22 Datasets Arxiv High Energy Physics paper citation network, including 34,546 nodes and 421,578 edges DBLP co-author graph, with 0.92M nodes, 6.1M edges DBLP citation network, with 1.5M papers and 2.1M citations Live Journal social network, with 4.84M users and 68.99M friendship ties C = 0.5, η = 0.001, H = 100

Accuracy of similarity scores 23 Accuracy ratio Accuracy loss

24 Precision@k

25 Kendall Tau distance

Running time with different node sizes and node degrees 26 TopSim algorithms are not sensitive to the graph size TopSim approximations are not sensitive to high degree nodes

27 Running time and accessed nodes

Conclusion 28 We transform a similarity problem on graph G into an authority ranking problem on the product graph G G; We propose a family of TopSim methods that produce both the exact and approximate top-k results while accessing a small portion of the graph; Our algorithms work with both SimRank and P-Rank under the same top k framework. Questions?