Heat Kernel Based Community Detection

Size: px

Start display at page:

Download "Heat Kernel Based Community Detection"

Brian Arnold
6 years ago
Views:

1 Heat Kernel Based Community Detection Joint with David F. Gleich, (Purdue), supported by" NSF CAREER CCF Kyle Kloster! Purdue University!

2 Local Community Detection Given seed(s) S in G, find a community that contains S. seed Community?

3 Local Community Detection Given seed(s) S in G, find a community that contains S. seed Community? high internal, low external connectivity!

4 Low-conductance sets are communities # edges leaving T conductance( T ) = # edge endpoints in T = chance a random step exits T

5 Low-conductance sets are communities # edges leaving T conductance( T ) = # edge endpoints in T = chance a random step exits T conductance(comm) = 39/381 =.102 How to find these?!

6 Graph diffusions find low-conductance sets A diffusion propagates rank from a seed across a graph. seed = high! = low! diffusion value!

7 Graph diffusions find low-conductance sets A diffusion propagates rank from a seed across a graph. seed = high! = low! diffusion value! = local community /! low-conductance set! Okay " how does this work?

8 Graph Diffusion A diffusion models how a mass (green dye, money, popularity) spreads from a seed across a network. seed p 0 p 1 p 2 p 3

9 Graph Diffusion diffuse A diffusion models how a mass (green dye, money, popularity) spreads from a seed across a network. seed p 0 p 1 p 2 p 3

10 Graph Diffusion diffuse A diffusion models how a mass seed (green dye, money, popularity) spreads from a seed across a p 0 p 1 p 2 p 3 network." Once mass reaches a node, it propagates to the neighbors, with some decay. decay : dye dilutes, money is taxed, popularity fades

11 Graph Diffusion A diffusion models how a mass seed (green dye, money, popularity) spreads from a seed across a network. " Once mass reaches a node, it propagates to the neighbors, diffuse with some decay. decay : dye dilutes, money is taxed, popularity fades p 0 p 1 p 2 p 3

12 Graph Diffusion A diffusion models how a mass seed (green dye, money, popularity) spreads from a seed across a network." Once mass reaches a node, it propagates to the neighbors, with some decay. decay : dye dilutes, money is taxed, popularity fades p 0 p 1 p 2 p 3

13 Diffusion score diffusion score of a node = " weighted sum of the mass at that node during different stages. c 0 p 0 + c 1 p 1 + c 2 p 2 + c p 3 + 3

14 Diffusion score diffusion score of a node = " weighted sum of the mass at that node during different stages. c 0 p 0 + c 1 p 1 + c 2 p 2 + c p diffusion score vector = f! 1X f = c k P k s k=0 P = random-walk s = c k = transition matrix normalized seed vector weight on stage k

15 Heat Kernel vs. PageRank Diffusions Heat Kernel uses t k /k! " Our work is new analysis for this diffusion. t 0 p 0 0! t ! p 1 t 2 p 2 p 3 2! t 3 3! + PageRank uses α k at stage k." Standard, widely-used diffusion we use for comparison. α 0 p 0 + α 1 p 1 + α 2 p 2 + α3 p 3 +

16 Heat Kernel vs. PageRank Behavior HK emphasizes earlier stages of diffusion =0.99 PR, α k Weight 10 5 t=1 t=5 t=15 =0.85 HK, t k /k! Length à involve shorter walks from seed, à so HK looks at smaller sets than PR

17 Heat Kernel vs. PageRank Theory PR good conductance Local Cheeger Inequality:" PR finds set of nearoptimal conductance fast algorithm PPR-push is O(1/(ε(1-α))) in theory, fast in practice [Andersen Chung Lang 06] HK

18 Heat Kernel vs. PageRank Theory PR good conductance Local Cheeger Inequality:" PR finds set of nearoptimal conductance fast algorithm PPR-push is O(1/(ε(1-α))) in theory, fast in practice [Andersen Chung Lang 06] HK Local Cheeger Inequality [Chung 07]

19 Heat Kernel vs. PageRank Theory PR good conductance Local Cheeger Inequality:" PR finds set of nearoptimal conductance fast algorithm PPR-push is O(1/(ε(1-α))) in theory, fast in practice [Andersen Chung Lang 06] HK Local Cheeger Inequality [Chung 07] Our work!

20 Our work on Heat Kernel: theory THEOREM Our algorithm for a relative" ε-accuracy in a degree-weighted norm has runtime <= O( e t (log(1/ε) + log(t)) / ε) (which is constant, regardless of graph size)

21 Our work on Heat Kernel: theory THEOREM Our algorithm for a relative" ε-accuracy in a degree-weighted norm has runtime <= O( e t (log(1/ε) + log(t)) / ε) (which is constant, regardless of graph size) COROLLARY HK is local! (O(1) runtime à diffusion vector has O(1) entries)!

22 Our work on Heat Kernel: results First efficient, deterministic HK algorithm. Deterministic is important to be able to compare the behaviors of HK and PR experimentally: Our key findings! HK more accurately describes ground-truth communities in real-world networks! identifies smaller sets à better precision speed & conductance comparable with PR

23 Python" demo Twitter graph 41.6 M nodes 2.4 B edges un-optimized Python code on a laptop Available for download:!

24 Algorithm Outline Computing HK! 1. Pre-compute push thresholds 2. Do push on all entries above threshold

25 Algorithm Intuition Computing HK given parameters t, ε, seed s! Starting from here How to end up here? seed p 0 p 1 p 2 p 3 t 0 p 0 0! t ! p 1 t 2 p 2 p 3 2! t 3 3! +

26 Algorithm Intuition Begin with mass at seed(s) in a residual staging area, r 0 The residuals r k hold mass that is unprocessed it s like error seed r 0 r 1 r 2 r 3 t 0 p 0 0! t ! p 1 t 2 p 2 p 3 2! t 3 3! +

27 Push Operation push (1) remove entry in r k," (2) put in p, r 0 r 1 r 2 r 3 t 0 p 0 0! t ! p 1 t 2 p 2 p 3 2! t 3 3! +

28 Push Operation push (1) remove entry in r k," (2) put in p, (3) then scale and spread to neighbors in next r! t 1 1! r 0 r 1 r 2 r 3 t 0 p 0 0! t ! p 1 t 2 p 2 p 3 2! t 3 3! +

29 Push Operation push (1) remove entry in r k," (2) put in p, (3) then scale and spread to neighbors in next r (repeat) t 2 2! r 0 r 1 r 2 r 3 t 0 p 0 0! t ! p 1 t 2 p 2 p 3 2! t 3 3! +

30 Push Operation push (1) remove entry in r k," (2) put in p, (3) then scale and spread to neighbors in next r (repeat) r 0 r 1 r 2 r 3 t 2 2! t 0 p 0 0! t ! p 1 t 2 p 2 p 3 2! t 3 3! +

31 Push Operation push (1) remove entry in r k," (2) put in p, (3) then scale and spread to neighbors in next r (repeat) r 0 r 1 r 2 r 3 t 2 2! t 3 3! t 0 p 0 0! t ! p 1 t 2 p 2 p 3 2! t 3 3! +

32 Thresholds entries < threshold ERROR equals weighted sum of entries left in r k à Set threshold so leftovers sum to < ε r 0 r 1 r 2 r 3 t 0 p 0 0! t ! p 1 t 2 p 2 p 3 2! t 3 3! +

33 Thresholds entries < threshold ERROR equals weighted sum of entries left in r k à Set threshold so leftovers sum to < ε Threshold for stage r k is" sum of remaining scale factors" prior scaling factor t 0 p 0 0! r 0 r 1 r 2 r 3 t ! p 1 t 2 p 2 p 3 2! t 3 3! +

34 Algorithm Outline Computing HK! 1. Pre-compute push thresholds 2. Do push on all entries above threshold

35 Communities in Real-world Networks Given a seed in an unidentified real-world community, how well can HK and PR describe that community? Measure quality using F 1 -measure. Graph V E amazon 330 K 930 K dblp 320 K 1 M youtube 1.1 M 3 M lj 4 M 35 M orkut 3.1 M 120 M friendster 66 M 1.8 B Datasets from SNAP collection [Leskovec] F 1 -measure is the harmonic mean of " # correct guesses precision = # total guesses and recall = # answers you get # answers there are

37 data F 1 precision set size comm HK PR HK PR HK PR size amazon dblp youtube lj orkut friendster

38 data F 1 precision set size comm HK PR HK PR HK PR size amazon dblp youtube lj orkut friendster PR achieves high recall by guessing a huge set HK identifies a tighter cluster, so attains better precision

39 Runtime &" Conductance" " HK is comparable in runtime and conductance." " " As graphs scale," the diffusions performance becomes even more similar. Runtime (s) log10(conductances) Runtime: hk vs. ppr hkgrow 50% 25% 75% pprgrow 50% 25% 75% log10( V + E ) 10 0 Conductances: hk vs. ppr hkgrow 50% 25% 75% pprgrow 50% 25% 75% log10( V + E )

40 Code, references, future work Code available at!! Ongoing work! - generalizing to other diffusions - simultaneously compute multiple diffusions Questions or suggestions? Kyle Kloster at kkloste-at-purdue-dot-edu!

Facebook Friends! and Matrix Functions

Facebook Friends! and Matrix Functions! Graduate Research Day Joint with David F. Gleich, (Purdue), supported by" NSF CAREER 1149756-CCF Kyle Kloster! Purdue University! Network Analysis Use linear algebra