Heat Kernel Based Community Detection

Similar documents
Facebook Friends! and Matrix Functions

A Nearly Sublinear Approximation to exp{p}e i for Large Sparse Matrices from Social Networks

Strong localization in seeded PageRank vectors

Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach

Seeded PageRank solution paths

Parallel Local Graph Clustering

Strong Localization in Personalized PageRank Vectors

Lecture: Local Spectral Methods (2 of 4) 19 Computing spectral ranking with the push procedure

Heat Kernel Based Community Detection

Locally-biased analytics

Local Lanczos Spectral Approximation for Community Detection

Four graph partitioning algorithms. Fan Chung University of California, San Diego

Jure Leskovec Joint work with Jaewon Yang, Julian McAuley

Lecture: Local Spectral Methods (3 of 4) 20 An optimization perspective on local spectral methods

Cross-lingual and temporal Wikipedia analysis

Lecture: Local Spectral Methods (1 of 4)

Kristina Lerman USC Information Sciences Institute

Using Local Spectral Methods in Theory and in Practice

Attentive Betweenness Centrality (ABC): Considering Options and Bandwidth when Measuring Criticality

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Graph Partitioning Using Random Walks

TopPPR: Top-k Personalized PageRank Queries with Precision Guarantees on Large Graphs

Lecture 14: Random Walks, Local Graph Clustering, Linear Programming

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

DS504/CS586: Big Data Analytics Graph Mining II

EFFICIENT ALGORITHMS FOR PERSONALIZED PAGERANK

A Nearly-Sublinear Method for Approximating a Column of the Matrix Exponential for Matrices from Large, Sparse Networks

PU Learning for Matrix Completion

Greedy Maximization Framework for Graph-based Influence Functions

CS6220: DATA MINING TECHNIQUES

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

LOCAL CHEEGER INEQUALITIES AND SPARSE CUTS

NETWORKS (a.k.a. graphs) are important data structures

Cutting Graphs, Personal PageRank and Spilling Paint

Web Structure Mining Nodes, Links and Influence

0.1 Naive formulation of PageRank

Link Analysis. Stony Brook University CSE545, Fall 2016

Regularized Laplacian Estimation and Fast Eigenvector Approximation

Link Prediction. Eman Badr Mohammed Saquib Akmal Khan

CS6220: DATA MINING TECHNIQUES

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Online Social Networks and Media. Link Analysis and Web Search

Algorithmic Primitives for Network Analysis: Through the Lens of the Laplacian Paradigm

The Ties that Bind Characterizing Classes by Attributes and Social Ties

Online Social Networks and Media. Link Analysis and Web Search

Edo Liberty Principal Scientist Amazon Web Services. Streaming Quantiles

Estimating Clustering Coefficients and Size of Social Networks via Random Walk

Model Reduction for Edge-Weighted Personalized PageRank

Continuous Influence Maximization: What Discounts Should We Offer to Social Network Users?

Statistical Problem. . We may have an underlying evolving system. (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Dissertation Defense

DS504/CS586: Big Data Analytics Graph Mining II

Fast Algorithms for Pseudoarboricity

On Top-k Structural. Similarity Search. Pei Lee, Laks V.S. Lakshmanan University of British Columbia Vancouver, BC, Canada

Lecture 1: From Data to Graphs, Weighted Graphs and Graph Laplacian

An Efficient reconciliation algorithm for social networks

A Local Spectral Method for Graphs: With Applications to Improving Graph Partitions and Exploring Data Graphs Locally

Link Mining PageRank. From Stanford C246

Edge-Weighted Personalized PageRank: Breaking a Decade-Old Performance Barrier

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Spectral Analysis of Random Walks

Analysis of Spectral Kernel Design based Semi-supervised Learning

Cost and Preference in Recommender Systems Junhua Chen LESS IS MORE

Algorithms for Picture Analysis. Lecture 07: Metrics. Axioms of a Metric

8.1 Concentration inequality for Gaussian random matrix (cont d)

Machine Learning CPSC 340. Tutorial 12

A Tale of Two Eigenvalue Problems

Personal PageRank and Spilling Paint

Data and Algorithms of the Web

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

LAPLACIAN MATRIX AND APPLICATIONS

) (d o f. For the previous layer in a neural network (just the rightmost layer if a single neuron), the required update equation is: 2.

Efficient Subgraph Matching by Postponing Cartesian Products. Matthias Barkowsky

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Understanding graphs through spectral densities

Algorithms, Graph Theory, and the Solu7on of Laplacian Linear Equa7ons. Daniel A. Spielman Yale University

The Push Algorithm for Spectral Ranking

CS249: ADVANCED DATA MINING

Spectral Graph Theory and its Applications. Daniel A. Spielman Dept. of Computer Science Program in Applied Mathematics Yale Unviersity

6.842 Randomness and Computation March 3, Lecture 8

Beyond Scalar Affinities for Network Analysis or Vector Diffusion Maps and the Connection Laplacian

Extractors and the Leftover Hash Lemma

Estimating network degree distributions from sampled networks: An inverse problem

Data dependent operators for the spatial-spectral fusion problem

7.1 Sampling Error The Need for Sampling Distributions

The Limiting Shape of Internal DLA with Multiple Sources

Network Essence: PageRank Completion and Centrality-Conforming Markov Chains

Protein Complex Identification by Supervised Graph Clustering

Capacity Releasing Diffusion for Speed and Locality

Performance evaluation of binary classifiers

Local Flow Partitioning for Faster Edge Connectivity

Chapter 7 Network Flow Problems, I

Math 443/543 Graph Theory Notes 5: Graphs as matrices, spectral graph theory, and PageRank

CS249: ADVANCED DATA MINING

Deterministic Decentralized Search in Random Graphs

Diffusion and random walks on graphs

First, parse the title...

Transcription:

Heat Kernel Based Community Detection Joint with David F. Gleich, (Purdue), supported by" NSF CAREER 1149756-CCF Kyle Kloster! Purdue University!

Local Community Detection Given seed(s) S in G, find a community that contains S. seed Community?

Local Community Detection Given seed(s) S in G, find a community that contains S. seed Community? high internal, low external connectivity!

Low-conductance sets are communities # edges leaving T conductance( T ) = # edge endpoints in T = chance a random step exits T

Low-conductance sets are communities # edges leaving T conductance( T ) = # edge endpoints in T = chance a random step exits T conductance(comm) = 39/381 =.102 How to find these?!

Graph diffusions find low-conductance sets A diffusion propagates rank from a seed across a graph. seed = high! = low! diffusion value!

Graph diffusions find low-conductance sets A diffusion propagates rank from a seed across a graph. seed = high! = low! diffusion value! = local community /! low-conductance set! Okay " how does this work?

Graph Diffusion A diffusion models how a mass (green dye, money, popularity) spreads from a seed across a network. seed p 0 p 1 p 2 p 3

Graph Diffusion diffuse A diffusion models how a mass (green dye, money, popularity) spreads from a seed across a network. seed p 0 p 1 p 2 p 3

Graph Diffusion diffuse A diffusion models how a mass seed (green dye, money, popularity) spreads from a seed across a p 0 p 1 p 2 p 3 network." Once mass reaches a node, it propagates to the neighbors, with some decay. decay : dye dilutes, money is taxed, popularity fades

Graph Diffusion A diffusion models how a mass seed (green dye, money, popularity) spreads from a seed across a network. " Once mass reaches a node, it propagates to the neighbors, diffuse with some decay. decay : dye dilutes, money is taxed, popularity fades p 0 p 1 p 2 p 3

Graph Diffusion A diffusion models how a mass seed (green dye, money, popularity) spreads from a seed across a network." Once mass reaches a node, it propagates to the neighbors, with some decay. decay : dye dilutes, money is taxed, popularity fades p 0 p 1 p 2 p 3

Diffusion score diffusion score of a node = " weighted sum of the mass at that node during different stages. c 0 p 0 + c 1 p 1 + c 2 p 2 + c p 3 + 3

Diffusion score diffusion score of a node = " weighted sum of the mass at that node during different stages. c 0 p 0 + c 1 p 1 + c 2 p 2 + c p 3 + 3 diffusion score vector = f! 1X f = c k P k s k=0 P = random-walk s = c k = transition matrix normalized seed vector weight on stage k

Heat Kernel vs. PageRank Diffusions Heat Kernel uses t k /k! " Our work is new analysis for this diffusion. t 0 p 0 0! t + 1 + + 1! p 1 t 2 p 2 p 3 2! t 3 3! + PageRank uses α k at stage k." Standard, widely-used diffusion we use for comparison. α 0 p 0 + α 1 p 1 + α 2 p 2 + α3 p 3 +

Heat Kernel vs. PageRank Behavior HK emphasizes earlier stages of diffusion. 10 0 =0.99 PR, α k Weight 10 5 t=1 t=5 t=15 =0.85 HK, t k /k! 0 20 40 60 80 100 Length à involve shorter walks from seed, à so HK looks at smaller sets than PR

Heat Kernel vs. PageRank Theory PR good conductance Local Cheeger Inequality:" PR finds set of nearoptimal conductance fast algorithm PPR-push is O(1/(ε(1-α))) in theory, fast in practice [Andersen Chung Lang 06] HK

Heat Kernel vs. PageRank Theory PR good conductance Local Cheeger Inequality:" PR finds set of nearoptimal conductance fast algorithm PPR-push is O(1/(ε(1-α))) in theory, fast in practice [Andersen Chung Lang 06] HK Local Cheeger Inequality [Chung 07]

Heat Kernel vs. PageRank Theory PR good conductance Local Cheeger Inequality:" PR finds set of nearoptimal conductance fast algorithm PPR-push is O(1/(ε(1-α))) in theory, fast in practice [Andersen Chung Lang 06] HK Local Cheeger Inequality [Chung 07] Our work!

Our work on Heat Kernel: theory THEOREM Our algorithm for a relative" ε-accuracy in a degree-weighted norm has runtime <= O( e t (log(1/ε) + log(t)) / ε) (which is constant, regardless of graph size)

Our work on Heat Kernel: theory THEOREM Our algorithm for a relative" ε-accuracy in a degree-weighted norm has runtime <= O( e t (log(1/ε) + log(t)) / ε) (which is constant, regardless of graph size) COROLLARY HK is local! (O(1) runtime à diffusion vector has O(1) entries)!

Our work on Heat Kernel: results First efficient, deterministic HK algorithm. Deterministic is important to be able to compare the behaviors of HK and PR experimentally: Our key findings! HK more accurately describes ground-truth communities in real-world networks! identifies smaller sets à better precision speed & conductance comparable with PR

Python" demo Twitter graph 41.6 M nodes 2.4 B edges un-optimized Python code on a laptop Available for download:!https://gist.github.com/dgleich/cf170a226aa848240cf4!

Algorithm Outline Computing HK! 1. Pre-compute push thresholds 2. Do push on all entries above threshold

Algorithm Intuition Computing HK given parameters t, ε, seed s! Starting from here How to end up here? seed p 0 p 1 p 2 p 3 t 0 p 0 0! t + 1 + + 1! p 1 t 2 p 2 p 3 2! t 3 3! +

Algorithm Intuition Begin with mass at seed(s) in a residual staging area, r 0 The residuals r k hold mass that is unprocessed it s like error seed r 0 r 1 r 2 r 3 t 0 p 0 0! t + 1 + + 1! p 1 t 2 p 2 p 3 2! t 3 3! +

Push Operation push (1) remove entry in r k," (2) put in p, r 0 r 1 r 2 r 3 t 0 p 0 0! t + 1 + + 1! p 1 t 2 p 2 p 3 2! t 3 3! +

Push Operation push (1) remove entry in r k," (2) put in p, (3) then scale and spread to neighbors in next r! t 1 1! r 0 r 1 r 2 r 3 t 0 p 0 0! t + 1 + + 1! p 1 t 2 p 2 p 3 2! t 3 3! +

Push Operation push (1) remove entry in r k," (2) put in p, (3) then scale and spread to neighbors in next r (repeat) t 2 2! r 0 r 1 r 2 r 3 t 0 p 0 0! t + 1 + + 1! p 1 t 2 p 2 p 3 2! t 3 3! +

Push Operation push (1) remove entry in r k," (2) put in p, (3) then scale and spread to neighbors in next r (repeat) r 0 r 1 r 2 r 3 t 2 2! t 0 p 0 0! t + 1 + + 1! p 1 t 2 p 2 p 3 2! t 3 3! +

Push Operation push (1) remove entry in r k," (2) put in p, (3) then scale and spread to neighbors in next r (repeat) r 0 r 1 r 2 r 3 t 2 2! t 3 3! t 0 p 0 0! t + 1 + + 1! p 1 t 2 p 2 p 3 2! t 3 3! +

Thresholds entries < threshold ERROR equals weighted sum of entries left in r k à Set threshold so leftovers sum to < ε r 0 r 1 r 2 r 3 t 0 p 0 0! t + 1 + + 1! p 1 t 2 p 2 p 3 2! t 3 3! +

Thresholds entries < threshold ERROR equals weighted sum of entries left in r k à Set threshold so leftovers sum to < ε Threshold for stage r k is" sum of remaining scale factors" prior scaling factor t 0 p 0 0! r 0 r 1 r 2 r 3 t + 1 + + 1! p 1 t 2 p 2 p 3 2! t 3 3! +

Algorithm Outline Computing HK! 1. Pre-compute push thresholds 2. Do push on all entries above threshold

Communities in Real-world Networks Given a seed in an unidentified real-world community, how well can HK and PR describe that community? Measure quality using F 1 -measure. Graph V E amazon 330 K 930 K dblp 320 K 1 M youtube 1.1 M 3 M lj 4 M 35 M orkut 3.1 M 120 M friendster 66 M 1.8 B Datasets from SNAP collection [Leskovec] F 1 -measure is the harmonic mean of " # correct guesses precision = # total guesses and recall = # answers you get # answers there are

data F 1 precision set size comm HK PR HK PR HK PR size amazon 0.325 0.140 0.244 0.107 193 15293 495 dblp 0.257 0.115 0.208 0.081 44 16026 1429 youtube 0.177 0.136 0.135 0.098 1010 6079 1615 lj 0.131 0.107 0.102 0.086 283 738 662 orkut 0.055 0.044 0.036 0.031 537 1989 4526 friendster 0.078 0.090 0.066 0.075 229 333 724

data F 1 precision set size comm HK PR HK PR HK PR size amazon 0.325 0.140 0.244 0.107 193 15293 495 dblp 0.257 0.115 0.208 0.081 44 16026 1429 youtube 0.177 0.136 0.135 0.098 1010 6079 1615 lj 0.131 0.107 0.102 0.086 283 738 662 orkut 0.055 0.044 0.036 0.031 537 1989 4526 friendster 0.078 0.090 0.066 0.075 229 333 724 PR achieves high recall by guessing a huge set HK identifies a tighter cluster, so attains better precision

Runtime &" Conductance" " HK is comparable in runtime and conductance." " " As graphs scale," the diffusions performance becomes even more similar. Runtime (s) log10(conductances) 2 1.5 1 0.5 0 10 1 10 2 Runtime: hk vs. ppr hkgrow 50% 25% 75% pprgrow 50% 25% 75% 5 6 7 8 9 log10( V + E ) 10 0 Conductances: hk vs. ppr hkgrow 50% 25% 75% pprgrow 50% 25% 75% 5 6 7 8 9 log10( V + E )

Code, references, future work Code available at!!http://www.cs.purdue.edu/homes/dgleich/codes/hkgrow! Ongoing work! - generalizing to other diffusions - simultaneously compute multiple diffusions Questions or suggestions? Email Kyle Kloster at kkloste-at-purdue-dot-edu!