ISIT Tutorial Information theory and machine learning Part II

Size: px

Start display at page:

Download "ISIT Tutorial Information theory and machine learning Part II"

Myron Emery Sanders
5 years ago
Views:

1 ISIT Tutorial Information theory and machine learning Part II Emmanuel Abbe Princeton University Martin Wainwright UC Berkeley 1

properties on a collection of agents by observing local noisy

2 Inverse problems on graphs Examples: A large variety of machine learning and data-mining problems are about inferring global properties on a collection of agents by observing local noisy interactions of these agents - community detection in social networks 2

noisy interactions of these agents - community

3 Inverse problems on graphs Examples: A large variety of machine learning and data-mining problems are about inferring global properties on a collection of agents by observing local noisy interactions of these agents - community detection in social networks - image segmentation 3

4 Inverse problems on graphs Examples: A large variety of machine learning and data-mining problems are about inferring global properties on a collection of agents by observing local noisy interactions of these agents - community detection in social networks - image segmentation - data classification and information retrieval 4

5 Inverse problems on graphs Examples: A large variety of machine learning and data-mining problems are about inferring global properties on a collection of agents by observing local noisy interactions of these agents - community detection in social networks - image segmentation - data classification and information retrieval - object matching, synchronization - page sorting - protein-to-protein interactions - haplotype assembly

6 Inverse problems on graphs A large variety of machine learning and data-mining problems are about inferring global properties on a collection of agents by observing local noisy interactions of these agents In each case: observe information on the edges of a network that has been generated from hidden attributes on the nodes, and try to infer back these attributes Dual to the graphical model learning problem (previous part) 6

7 What about graph-based codes? X 1 W Y 1 C X n W Y N Different: the code is a design parameter and takes typically specific non-local interactions of the bits (e.g., random, LDPC, polar codes). (1) What are the relevant types of codes and channels behind machine learning problems? (2) What are the fundamental limits for these? 7

8 Outline of the talk 1. Community detection and clustering 2. Stochastic block models : fundamental limits and capacity-achieving algorithms 3. Open problems 4. Graphical channels and low-rank matrix recovery 8

9 Community detection and clustering 9

10 Networks provide local interactions among agents social networks: friendship call graphs: calls biological networks: protein interactions genome HiC networks: DNA contacts 10

11 Networks provide local interactions among agents one often wants to infer global similarity classes social networks: friendship call graphs: calls biological networks: protein interactions genome HiC networks: DNA contacts 11

The challenges of community detection A long-studied and notoriously hard problem what is a good clustering? assort. and disassort. relations work with models how to get a good clustering?

12 The challenges of community detection A long-studied and notoriously hard problem what is a good clustering? assort. and disassort. relations work with models how to get a good clustering? computationally hard many heuristics... Tutorial motto: Can on establish a clear line-of-sight as in communications with the Shannon capacity? Peak Data Efficiency WAN wireless tech. Infeasible Region Shannon-capacity HSDPA EV-DO SNR 12

13 The Stochastic Block Model 13

14 The stochastic block model P = diag(p) p =(p 1,...,p k ) 0 1 W W 1k B W C. A W k1 W kk The DMC of clustering..? SBM(n, p, W ) <- probability vector = relative size of the communities np 4 <- symmetric matrix with entries in [0,1] = prob.of connecting among communities np 1 np 2 W 14 W 11 W 12 W 13 W 22 W 24 W 33 W 44 W 34 W 23 np 3 k =4 14

15 The (exact) recovery problem Let X n =[X 1,...,X n ] represent the community variables of the nodes (drawn under p) Definition. An algorithm ˆX n ( ) solves (exact) recovery in the SBM if for a random graph G under the model, lim n!1 P(X n = ˆX n (G)) = 1. We will see weaker recovery requirements later Starting point: progress in science often comes from understanding special cases... 15

16 SBM with 2 symmetric communities: 2-SBM 16

17 2-SBM p q p n 2 p 1 = p 2 =1/2 n 2 W 11 = W 22 = p W 12 = q 17

18 Some history for 2-SBM Recovery problem P( ˆX n = X n )! Holland Laskey Leinhardt Boppana Dyer Frieze Bui, Chaudhuri, Leighton, Sipser Snijders Nowicki Jerrum Sorkin Condon Karp Carson Impagliazzo McSherry Bickel Chen Rohe Chatterjee Yu Bui, Chaudhuri, Leighton, Sipser 84 maxflow-mincut p = (1/n),q = o(n 1 4/((p+q)n) ) Boppana 87 spectral meth. (p q)/ p p + q = ( p log(n)/n) Dyer, Frieze 89 min-cut via degrees p q = (1) Snijders, Nowicki 97 EM algo. p q = (1) Jerrum, Sorkin 98 Metropolis aglo. p q = (n 1/6+ ) Condon, Karp 99 augmentation algo. p q = (n 1/2+ ) Carson, Impagliazzo 01 hill-climbing algo. p q = (n 1/2 log 4 (n)) Mcsherry 01 spectral meth. (p q)/ p p ( p log(n)/n) Bickel, Chen 09 N-G modularity (p q)/ p p + q = (log(n)/ p n) Rohe, Chatterjee, Yu 11 spectral meth. p q = (1) algorithms driven... Instead of how, when can we recover the clusters (IT)? 18

19 Information-theoretic view of clustering X 1 R = n N W Y 1 X 1 R = 2 n W Y 1 C unorthodox code! 2-SBM X n W = W Y N 1 1 X n W = W 1 p p 1 q q N = Y N n 2 reliable comm. iff R < 1-H(ɛ) reliable comm. iff??? 19, exact recovery

Bui, Chaudhuri, Leighton, Sipser 84 maxflow-mincut p = (1/n),q = o(n 1 4/((p+q)n) ) Boppana 87 spectral meth.

20 Some history for 2-SBM Recovery problem P( ˆX n = X n )! Holland Laskey Leinhardt Boppana Dyer Frieze Bui, Chaudhuri, Leighton, Sipser Snijders Nowicki Jerrum Sorkin Condon Karp Carson Impagliazzo McSherry Bickel Chen Rohe Chatterjee Yu Bui, Chaudhuri, Leighton, Sipser 84 maxflow-mincut p = (1/n),q = o(n 1 4/((p+q)n) ) Boppana 87 spectral meth. (p q)/ p p + q = ( p log(n)/n) Dyer, Frieze 89 min-cut via degrees p q = (1) Snijders, Nowicki 97 EM algo. p q = (1) Jerrum, Sorkin 98 Metropolis aglo. p q = (n 1/6+ ) Condon, Karp 99 augmentation algo. p q = (n 1/2+ ) Carson, Impagliazzo 01 hill-climbing algo. p q = (n 1/2 log 4 (n)) Mcsherry 01 spectral meth. (p q)/ p p ( p log(n)/n) Bickel, Chen 09 N-G modularity (p q)/ p p + q = (log(n)/ p n) Rohe, Chatterjee, Yu 11 spectral meth. p q = (1) b p = a log(n),q = b log(n) n n Recovery i efficiently achievable Infeasible Abbe-Bandeira-Hall a+b 2 1+ p ab 20 a

21 Some history for 2-SBM 1983 Recovery problem P( ˆX n = X n )! 1 Detection problem 9 > 0:P( d( ˆX n,x n ) n < 1 )! 1 2 Holland Laskey Leinhardt Boppana Dyer Frieze Bui, Chaudhuri, Leighton, Sipser Snijders Nowicki Jerrum Sorkin Condon Karp Carson Impagliazzo McSherry Bui, Chaudhuri, Leighton, Sipser 84 maxflow-mincut p = (1/n),q = o(n 1 4/((p+q)n) ) Boppana 87 spectral meth. (p q)/ p p + q = ( p log(n)/n) Dyer, Frieze 89 min-cut via degrees p q = (1) Snijders, Nowicki 97 EM algo. p q = (1) Jerrum, Sorkin 98 Metropolis aglo. p q = (n 1/6+ ) Condon, Karp 99 augmentation algo. p q = (n 1/2+ ) Carson, Impagliazzo 01 hill-climbing algo. p q = (n 1/2 log 4 (n)) Mcsherry 01 spectral meth. (p q)/ p p ( p log(n)/n) Bickel, Chen 09 N-G modularity (p q)/ p p + q = (log(n)/ p n) Rohe, Chatterjee, Yu 11 spectral meth. p q = (1) Bickel Chen Decelle Krzakala Moore Zdeborova Rohe Chatterjee Yu Coja-Oghlan Massoulié Mossel Neeman Sly Mossel-Neeman-Sly p = a log(n),q = b log(n) n n Recovery i Abbe-Bandeira-Hall a+b 2 1+ p ab efficiently achievable What about multiple/asymm. communities? Conjecture: detection changes with 5 or more 21 p = a n,q = b n (a b) 2 > 2(a + b) Detection i

22 Recovery in the 2-SBM: IT limit Converse: p If a+b 2 i B i R i p ab < 1 then ML fails w.h.p. q R j j B j p p = a log n n,q = b log n n n/2 n/2 ML fails if two nodes can be swapped to reduce the cut Abbe-Bandeira-Hall what is ML?! min-bisection P (9i : B i apple R i )=? np(b 1 apple R 1 ) (weak correlations) P (B 1 apple R 1 )=n ((a+b)/2 p ab)+o(1)

23 Recovery in the 2-SBM: efficient algorithms p q p 1 +1 P n/2 n/2 spectral: ML-decoding: lifting: X = xx t SDP: max x T Ax max x T Ax max tr(ax) s.t. kxk =1 s.t. x i = ±1 s.t. X ii =1 1 t x =0 1 t x =0 1 t X =0 NP-hard X 0 rank(x) =1 Abbe-Bandeira-Hall 14 23

24 Recovery in the 2-SBM: efficient algorithms Theorem. The SDP solves recovery if 2L SBM + 11 t + I n 0 where L SBM = D G+ D G A. -> Analyze the spectral norm of a random matrix [Abbe-Bandeira-Hall 14] Bernstein: slightly loose [Xu-Hanjek-Wu 15] Seginer bound [Bandeira, Bandeira-Van Handel 15] tight bound Note that SDP can be expensive... Abbe-Bandeira-Hall 14 24

25 The general SBM 25

26 SBM(n, p, W ) Quiz: If a node is in community i, how many neighbors does it have in expectation in community j? 1. np j 2. np j W ij np 1 np 2 W 14 W 11 W 12 W 13 W 22 W np i W ij i 1 np 4 W 24 W 33 W 44 W 34 np np W A degree profile matrix 26

27 Back to the Information-theoretic view of clustering X 1 R = n N W Y 1 X 1 R = 2 n W Y 1 C SBM X n W W Y N reliable comm. iff R < 1-H(ɛ) X n W W Y N reliable comm. iff 1 < (a+b)/2- ab reliable comm. iff R < max I(p,W) p {z } KL-divergence 27 reliable comm. iff 1 < J(p,W)???

28 Main results Theorem 1. Recovery is solvable in SBM(n, p, Q log(n)/n) if and only if where 1 2 (p a p b) 2 1 Abbe-Bandeira-Hall 14 J(p, Q) :=min i<j D +((PQ) i, (PQ) j ) 1 D + (µ, ) := max t2[0,1] X `2[k] D 1/2 (µ, ) = 1 2 kp µ log max t Pi µt i t i tµ` +(1 t) ` µ t` 1 t ` {z } D t (µ, ) D t is an f-divergence: P i if(µ i / i ) p k 2 2 is the Hellinger divergence (distance) is the Cherno divergence f(x) =1 t + tx x t We call D + the CH-divergence. [Abbe-Sandon 15] D 28

29 Main results Theorem 1. Recovery is solvable in SBM(n, p, Q log(n)/n) if and only if where J(p, Q) :=min i<j D +((PQ) i, (PQ) j ) 1 D + (µ, ) := max t2[0,1] X `2[k] tµ` +(1 t) ` µ t` 1 t ` Is recovery in the general SBM solvable efficiently down the information theoretic threshold? YES! Theorem 2. The degree-profiling algorithm achieves the threshold and runs in quasi-linear time. [Abbe-Sandon 15] 29

30 When can we extract a specific community? Theorem. If community i has a profile (PQ) i at D + -distance at least 1 from all other profiles (PQ) j, j 6= i, then it can be extracted w.h.p. (PQ) j 1 (PQ) i What if we do not know the parameters? We can learn them on the fly: [Abbe-Sandon 15] (second paper) [Abbe-Sandon 15] 30

31 Proof techniques and algorithms 31

$For any 1, 2 2 (R + \{0}) k with 1 6= 2 and p 1,p 2 2 R + \{0}, X min(p ln(n) 1 (x)p 1, P ln(n) 2 (x)p 2 )= n D +( 1, 2 ) o(1), x2z k + where D +$

32 A key step 1 2 Hypothesis 1 d v d v. P(log(n)(PQ) 1 ) Hypothesis 2 d v. P(log(n)(PQ) 2 ) 3 Theorem. For any 1, 2 2 (R + \{0}) k with 1 6= 2 and p 1,p 2 2 R + \{0}, X min(p ln(n) 1 (x)p 1, P ln(n) 2 (x)p 2 )= n D +( 1, 2 ) o(1), x2z k + where D + is the CH-divergence. [Abbe-Sandon 15] 32

33 How to use this? Plan: put effort in recovering most of the nodes and then finish greedily with local improvements [Abbe-Sandon 15] 33

34 The degree-profiling algorithm (capacity-achieving) G (1) Split into two graphs (2) Run Sphere-comparison on G loglog-degree G G -> gets a fraction 1-o(1) (see next) (3) Take now G with the clustering of G log-degree d v Hypothesis 1 d v... P(log(n)(PQ) 1 ) Hypothesis 2 [Abbe-Sandon 15] 34 d v... P(log(n)(PQ) 2 ) P e = n D +((PQ) 1,(PQ) 2 )+o(1)

35 How do we get most nodes correctly? 35

36 Other recovery requirements Weak recovery or detection : (for the symmetric k-sbm). for some c =1/k + > 0 Partial recovery: An algorithm solves partial recovery in the SBM with accuracy c if it produces a clustering which is correct on a fraction c of the nodes with high probability. Almost exact recovery: c =1 o(1) Exact recovery: c =1 For all the above: what are the efficient VS. information-theoretic fundamental limits? 36

37 Partial recovery in SBM(n, p, Q/n) What is a good notion of SNR? Proposed notion of SNR: min 2 max e.v. of PQ Bin(n/2, a/n) Bin(n/2, b/n) = = (a b)2 2(a+b) (a b)2 k(a+(k 1)b) for 2-symm. comm. for k-symm. comm. Theorem (informal). In the sparse SBM(n, p, Q/n), the Sphere-comparison algorithm recovers a fraction of nodes which approaches 1 when the SNR diverges. Note that the SNR scales if Q scales! [Abbe-Sandon 15] 37

38 A node neighborhood in SBM(n, p, Q/n) v = i Sphere-comparison np 1 p 1 Q i1... (PQ) i... ((PQ) r ) i [Abbe-Sandon 15] 38 p k Q ik np k {z } depth r N r (v)

39 Take now two nodes: v = i Sphere-comparison N r (v) N r 0(v 0 ) Compare v and v from: N r (v) \ N r 0(v 0 ) hard to analyze... v 0 = j [Abbe-Sandon 15] 39

40 Decorrelate: v = i Compare v and v from: Sphere-comparison Subsample G with prob. c to get E N r,r0 [E](v v 0 ) = number of crossing edges N r[g\e] (v) E N r[g\e] (v) cq n N r 0 [G\E](v 0 ) N r0 [G\E](v 0 ) ((1 c)pq) r e v cq n ((1 c)pq)r0 e v 0 = c(1 c) r+r0 e v Q(PQ) r+r0 e v 0 /n [Abbe-Sandon 15] = j 40 v 0 Additional steps: 1. look at several depths -> Vandermonde syst. 2. use anchor nodes

41 A real data example 41

42 The political blogs network 1222 blogs (left- and right-leaning) [Adamic and Glance 05] edge = hyperlink between blogs The CH-divergence is close to 1 We can recover 95% of the nodes correctly 42

43 Some open problems in community detection 43

44 Some open problems in community detection I. The SBM: a. Recovery Growing nb. of communities? Sub-linear communities? [Abbe-Sandon 15] should extend to k = o(log(n)) [Chen-Xu 14] k,p,q scale with n polynomially 44

45 Some open problems in community detection I. The SBM: a. Recovery b. Detection and broadcasting on trees [Mossel-Neeman-Sly 13] Converse for detection in 2-SBM: p=a/n, q=b/n BSC(b/(a + b)) Galton-Watson tree Poisson((a+b)/2) X r c = a + b If (a+b)/2<=1 the tree dies w.p. 1 If (a+b)/2>1 the tree survives w.p. >0 45 Unorthodox broadcasting problem: when can we detect the root-bit? If and only if c> 1 (1 2") 2 [Evans-Kenyon-Peres-Schulman 00] (a b)2 () 2(a + b) > 1

46 Some open problems in community detection I. The SBM: a. Recovery b. Detection and broadcasting on trees For k clusters? SNR = (a b) 2 k(a (k 1)b) impossible possible efficient 0 c k 1 SNR Conjecture. For the symmetric k-sbm(n, a, b), there exists c k s.t. (1) If SNR <c k, then detection cannot be solved, (2) If c k < SNR < 1, then detection can be solved information-theoretically but not e ciently, (3) If SNR > 1, then detection can be solved e ciently. Moreover c k = 1 for k 2 {2, 3, 4} and c k < 1 for k 5. [Decelle-Krzakala-Zdeborova-Moore 11] 46

47 Some open problems in community detection I. The SBM: a. Recovery b. Detection and broadcasting on trees c. Partial recovery and the SNR-distortion curve Conjecture. For the symmetric k-sbm(n, a, b) and 2 (1/k, 1), there exists k, k s.t. partial-recovery of accuracy is solvable if and only if SNR > k, and e ciently solvable i SNR > k. =1 k =2 1/2 k SNR

48 Some open problems in community detection II. Other block models - Censored block models - Labelled block models 2-CBM [Abbe-Bandeira-Bracher-Singer, Xu-Lelarge-Massoulie, Saad-Krzakala-Lelarge-Zdeborova, Chin-Rao-Vu, Hajek-Wu-Xu] 2-CBM(n, p, ) p 0 1 n/2 p p n/2 [Abbe-Bandeira-Bracher-Singer 14] correlation clustering [Bansal, Blum, Chawla 04] LDGM codes [Kumar, Pakzad, Salavati, Shokrollahi 12] labelled block model [Heimlicher, Lelarge, Massoulié 12] soft CSPs [Abbe, Montanari 13] pairwise measurements [Chen, Suh, Goldsmith 14 and 15] bounded-size correlation clustering [Puleo, Milenkovic 14] 48

49 Some open problems in community detection II. Other block models - Censored block models - Labelled block models - Degree-corrected block models [Karrer-Newman 11] - Mixed-membership block models [Airoldi-Blei-Fienber-Xing 08] - Overlapping block models [Abbe-Sandon 15] u X u p X v v p OSBM(n, p, f) p on {0, 1} s connect (u, v) with prob. f(x u,x v ) f : {0, 1} s {0, 1} s! [0, 1] 49

50 Some open problems in community detection II. Other block models - Censored block models - Labelled block models - Degree-corrected block models - Mixed-membership block models - Overlapping block models - Planted community [Deshpande-Montanari 14, Montanari 15], p p 50

51 Some open problems in community detection II. Other block models - Censored block models - Labelled block models - Degree-corrected block models - Mixed-membership block models - Overlapping block models - Planted community - Hypergraph models 51

52 Some open problems in community detection II. Other block models - Censored block models - Labelled block models - Degree-corrected block models - Mixed-membership block models - Overlapping block models - Planted community - Hypergraph models For all the above: Is there a CH-divergence behind? A generalized notion of SNR? Detection gaps? Efficient algorithms? 52

[0, 1] P (E ij =1 X i = x i,x j = x j )=w(x i,x j ) [Lovasz] SBMs can

53 Some open problems in community detection III. Beyond block models: a. Exchangeable random arrays and graphons w :[0, 1] 2! [0, 1] P (E ij =1 X i = x i,x j = x j )=w(x i,x j ) [Lovasz] SBMs can approximate graphons [Choi-Wolfe, Airoldi-Costa-Chan] b. Graphical channels A general information-theoretic model? 53

54 Graphical channels [Abbe-Montanari 13] A family of channels motivated by inference on graph problems Q Y 1 G =(V,E) ak-hypergraph Q : X k! Y a channel (kernel) X 1 For x 2 X V,y 2 Y E, G P(y x) = Q e2e(g) Q(y e x[e]) X n How much information do graphical channels carry? Q Y N Theorem. lim n!1 1 n I(Xn ; G) exists for ER graphs and some kernels -> why not always? what is the limit? 54

55 Connection: sparse PCA and clustering 55

SBM and low-rank Gaussian model r Gaussian symmetric Spiked Wigner model: Y = n XXt + Z Y ij = cx i X j + Z ij X =(X 1,...,X n ) i.i.d. Bernoulli( )! sparse-pca [Deshpande-Abbe-Montanari 15] 1 lim n!

56 SBM and low-rank Gaussian model r Gaussian symmetric Spiked Wigner model: Y = n XXt + Z Y ij = cx i X j + Z ij X =(X 1,...,X n ) i.i.d. Bernoulli( )! sparse-pca [Deshpande-Abbe-Montanari 15] 1 lim n!1 n I(X; G(n, p? Theorem. n,q n )) = [Amini-Wainwright 09, Desphande-Montanari 14] X =(X 1,...,X n )i.i.d.radamacher(1/2)! SBM??? lim n!1 1 n I(X; Y ) = + 4 If n = n(p n q n ) I( ) (finite SNR), with np n,nq n!1 (large degrees) 2(p n + q n )! where solves = (1 MMSE( )) Y 1 ( )= p X 1 + Z 1 I-MMSE [Guo-Shamai-Verdú] (single-letter)

57 Conclusion Community detection couples naturally with the channel view of information theory and more specifically with: - graph-based codes - f-divergences - broadcasting problems - I-MMSE -... {z } unorthodox versions... More generally, the problem of inferring global similarity classes in data sets from noisy local interactions is at the center of many problems in ML, and an information-theoretic view of these problems seems needed and powerful. 57

58 Questions? Documents related to the tutorial: 58

Statistical and Computational Phase Transitions in Planted Models

Statistical and Computational Phase Transitions in Planted Models Jiaming Xu Joint work with Yudong Chen (UC Berkeley) Acknowledgement: Prof. Bruce Hajek November 4, 203 Cluster/Community structure in