ELE 538B: Mathematics of High-Dimensional Data. Spectral methods. Yuxin Chen Princeton University, Fall 2018

Size: px

Start display at page:

Download "ELE 538B: Mathematics of High-Dimensional Data. Spectral methods. Yuxin Chen Princeton University, Fall 2018"

Jasmine Oliver
5 years ago
Views:

1 ELE 538B: Mathematics of High-Dimensional Data Spectral methods Yuxin Chen Princeton University, Fall 2018

2 Outline A motivating application: graph clustering Distance and angles between two subspaces Eigen-space perturbation theory Extension: singular subspaces Extension: eigen-space for asymmetric transition matrices Spectral methods 2-2

3 A motivating application: graph clustering

4 Graph clustering / community detection Community structures are common in many social networks figure credit: The Future Buzz figure credit: S. Papadopoulos Goal: partition users into several clusters based on their friendships / similarities Spectral methods 2-4

5 A simple model: stochastic block model (SBM) x i = 1: 1 st community x i = 1: 2 nd community n nodes {1,, n} 2 communities n unknown variables: x 1,, x n {1, 1} encode community memberships Spectral methods 2-5

6 A simple model: stochastic block model (SBM) G observe a graph G (i, j) G with prob. { p, if i and j are from same community q, else Here, p > q and p, q log n/n Goal: recover community memberships of all nodes, i.e. {x i } Spectral methods 2-6

7 Adjacency matrix Consider the adjacency matrix A {0, 1} n n of G: { 1, if (i, j) G A i,j = 0, else WLOG, suppose x 1 = = x n/2 = 1; x n/2+1 = = x n = 1 Spectral methods 2-7

8 Adjacency matrix A = E[A] }{{} rank 2 + A E [A] [ p11 E[A] = q11 q11 p11 ] = p + q p q [ ] 1 [ 1, 1 ] }{{} 2 1 }{{} uninformative bias =x:=[x i] 1 i n Spectral methods 2-8

9 Spectral clustering A = E[A] }{{} rank 2 + A E [A] 1. computing the leading eigenvector û = [û i ] 1 i n of A p+q 2 { 11 1, if ûi > 0 2. rounding: output ˆx i = 1, if û i < 0 Spectral methods 2-9

10 Spectral clustering Rationale: recovery is reliable if A E[A] is sufficiently small }{{} perturbation if A E[A] = 0, then û ± [ 1 1 ] = perfect clustering Question: how to quantify the effect of perturbation A E[A] on û Spectral methods 2-10

11 Distance and angles between two subspaces

12 Setup and notation Consider 2 symmetric matrices M, ˆM = M + H R n n with eigen-decompositions n M = λ i u i u i and ˆM n = ˆλ i û i û i i=1 i=1 where λ 1 λ n ; ˆλ 1 ˆλ n. For simplicity, write [ ] [ ] Λ0 U M = [U 0, U 1 ] 0 Λ 1 U 1 [ ] [ ˆΛ0 Û ˆM = [Û0, Û1] 0 ˆΛ 1 Û1 Here, U 0 = [u 1,, u r ], Λ 0 = diag ([λ 1,, λ r ]), ] Spectral methods 2-12

13 Setup and notation [ ] M = u 1 u r u r+1 u n }{{}}{{} U 0 U 1 λ 1... λ r }{{} Λ 0 λ r+1... λn }{{} Λ 1 u 1. u r u r+1. u n U0 U1 Spectral methods 2-13

14 Setup and notation M : spectral norm (largest singular value of M) M F : Frobenius norm ( M F = tr(m M) = i,j M 2 i,j ) Spectral methods 2-14

15 Eigen-space perturbation theory Main focus: how does the perturbation H affect the distance between U and Û Question #0: how to define distance between two subspaces U Û F and U Û are not appropriate, since they fall short of accounting for global orthonormal transformation }{{} orthonormal R R r r, U and UR represent same subspace Spectral methods 2-15

16 Distance between two eigen-spaces One metric that takes care of global orthonormal transformation is dist(x 0, Z 0 ) := X 0 X 0 Z 0 Z 0 (2.1) This metric has several equivalent expressions: Lemma 2.1 Suppose X := [X 0, X 1 ] and Z := [Z 0, Z 1 ] are orthonormal matrices. Then }{{} complement subspace }{{} complement subspace dist(x 0, Z 0 ) = X 0 Z 1 = Z 0 X 1 sanity check: if X 0 = Z 0, then dist(x 0, Z 0 ) = X 0 Z 1 = 0 Spectral methods 2-16

17 Principal angles between two eigen-spaces In addition to distance, one might also be interested in angles i We can quantify the similarity between two lines (represented resp. by unit vectors x 0 and z 0 ) by an angle between them θ = arccos x 0, z 0 Spectral methods 2-17

18 Principal angles between two eigen-spaces For r-dimensional subspaces, one needs r angles Specifically, given X 0 Z 0 1, we write SVD of X 0 Z 0 R r r as cos θ 1 X0 Z 0 = U... } {{ cos θ r } :=cos Θ V := U cos Θ V where {θ 1,, θ r } are called the principal angles between X 0 and Z 0 Spectral methods 2-18

19 Relations between principal angles and dist(, ) As expected, principal angles and distance are closely related Lemma 2.2 Suppose X := [X 0, X 1 ] and Z := [Z 0, Z 1 ] are orthonormal matrices. Then X 0 Z 1 = sin Θ = max{ sin θ 1,, sin θ r } Lemmas 2.1 and 2.2 taken collectively give dist(x 0, Z 0 ) = max{ sin θ 1,, sin θ r } (2.2) Spectral methods 2-19

20 Proof of Lemma 2.2 X0 Z 1 = X0 Z 1 Z1 1 X 0 2 }{{} =I Z 0 Z0 = X 0 X 0 X 0 Z 0 Z 0 X = I U cos 2 Θ U 1 2 (since X 0 Z 0 = U cos Θ V ) = I cos 2 Θ 1 2 = sin Θ Spectral methods 2-20

21 Proof of Lemma 2.1 We first claim that SVD of X 1 Z 0 can be written as X 1 Z 0 = Ũ sin Θ V (2.3) for some orthonormal Ũ (to be proved later). With this claim in place, one has [ ] [ ] X Z 0 = [X 0, X 1 ] 0 U cos Θ V X1 Z 0 = [X 0, X 1 ] Ũ sin Θ V [ = Z 0 Z0 = [X 0, X 1 ] U cos 2 Θ U Ũ cos Θ sin Θ U U cos Θ sin Θ Ũ Ũ sin 2 Θ Ũ ] [ X 0 X 1 ] As a consequence, X 0 X0 Z 0 Z0 [ I U cos = [X 0, X 1 ] 2 Θ U U cos Θ sin Θ Ũ Ũ cos Θ sin Θ U Ũ sin2 Θ Ũ ] [ X 0 X 1 ] Spectral methods 2-21

22 Proof of Lemma 2.1 (cont.) This further gives X 0 X0 Z 0 Z0 ] [ ] [ ] = U sin [ 2 Θ cos Θ sin Θ U Ũ cos Θ sin Θ sin 2 Θ Ũ [ ] sin = 2 Θ cos Θ sin Θ cos Θ sin Θ sin 2 ( is rotationally invariant) Θ }{{} each block is diagonal matrix [ ] = max sin 2 θ i cos θ i sin θ i 1 i r cos θ i sin θ i sin 2 θ i [ ] = max sin 1 i r sin θ θi cos θ i i cos θ i sin θ i = max 1 i r sin θ i = sin Θ Spectral methods 2-22

23 Proof of Lemma 2.1 (cont.) It remains to justify (2.3). To this end, observe that Z 0 X 1 X 1 Z 0 = Z 0 Z 0 Z 0 X 0 X 0 Z 0 = I V cos 2 Θ V = V sin 2 Θ V and hence right singular space (resp. singular values) of X 1 Z 0 is given by V (resp. sin Θ) Spectral methods 2-23

24 Eigen-space perturbation theory

If H < λ r (M), then dist ( ) HU 0 Û 0, U 0 λ r (M) H H λ r (M) H depends on

25 Davis-Kahan sin Θ Theorem: a simple case Theorem 2.3 Chandler Davis William Kahan Suppose M 0 and has rank r. If H < λ r (M), then dist ( ) HU 0 Û 0, U 0 λ r (M) H H λ r (M) H depends on the smallest non-zero eigenvalue of M }{{} and perturbation eigen-gap between λ r and λ Spectral methods r size

26 Proof of Theorem 2.3 We intend to control Û1 U 0 by studying their interactions through H: ) Û 1 HU 0 = Û1 (Û ˆΛÛ UΛU }{{ } U 0 = } {{ } M+H M ˆΛ 1 Û 1 U 0 Û 1 U 0 Λ 0 (since U 1 U 0 = Û 1 Û0 = 0) Û 1 U 0 Λ 0 ˆΛ1 Û 1 U 0 (triangle inequality) Û 1 U 0 λr Û 1 U 0 ˆΛ1 (2.4) In view of Weyl s Theorem, ˆΛ 1 H, which combined with (2.4) gives Û Û 1 U 0 1 HU 0 HU 0 λ r H Û1 λ r H This together with Lemma 2.1 completes the proof HU 0 = λ r H Spectral methods 2-26

27 Davis-Kahan sin Θ Theorem: more general case Theorem 2.4 (Davis-Kahan sin Θ Theorem) Suppose λ r (M) a and λ r+1 ( ˆM) a for some > 0. Then dist ( ) HU 0 Û 0, U 0 H immediate consequence: if λ r (M) > λ r+1 (M) + H, then dist ( Û 0, U 0 ) H λ r (M) λ r+1 (M) H }{{} spectral gap (2.5) Spectral methods 2-27

28 Back to stochastic block model... Let M = E[A] p + q [ 2 11, ˆM = A p+q 2 11 and u = 1 1 n 1 }{{} [ ] [1, 1 ] = p q Then the Davis-Kahan sin Θ Theorem yields ] dist(û, u) ˆM M λ 1 (M) ˆM M = A E[A] (p q)n 2 A E[A] (2.6) Question: how to bound A E[A] Spectral methods 2-28

29 A hammer: matrix Bernstein inequality Consider a sequence of independent random matrices { X l R d } 1 d 2 E[X l ] = 0 X l B for each l variance statistic: { E [ ] ] } v := max X l lxl E, [ l X l X l Theorem 2.5 (Matrix Bernstein inequality) For all τ 0, { P X l l } τ (d 1 + d 2 ) exp ( τ 2 /2 ) v + Bτ/3 Spectral methods 2-29

30 A hammer: matrix Bernstein inequality ( { } P X τ 2 ) /2 l l τ (d 1 + d 2 ) exp v + Bτ/3 moderate-deviation regime (τ is small): sub-gaussian tail behavior exp( τ 2 /2v) large-deviation regime (τ is large): sub-exponential tail behavior exp( 3τ/2B) (slower decay) user-friendly form (exercise): with prob. 1 O((d 1 + d 2 ) 10 ) X l l v log(d 1 + d 2 ) + B log(d 1 + d 2 ) (2.7) Spectral methods 2-30

31 Bounding A E[A] The Matrix Bernstein inequality yields Lemma 2.6 Consider SBM with p > q log n n. Then with high prob. A E[A] np log n (2.8) Spectral methods 2-31

32 Statistical accuracy of spectral clustering Substitute (2.8) into (2.6) to reach A E[A] np log n dist(û, u) (p q)n 2 A E[A] (p q)n provided that (p q)n np log n Thus, under condition p q p log n, with high prob. one has n dist(û, u) 1 = nearly perfect clustering Spectral methods 2-32

33 Statistical accuracy of spectral clustering p q p log n n = nearly perfect clustering dense regime: if p q 1, then this condition reads log n p q n sparse regime: if p = a log n n and q = b log n n a b a for a, b 1, then This condition is information-theoretically optimal (up to log factor) Mossel, Neeman, Sly 15, Abbe 18 Spectral methods 2-33

34 Proof of Lemma 2.6 To simplify presentation, assume A i,j and A j,i are independent (check: why this assumption does not change our bounds) Spectral methods 2-34

35 Proof of Lemma 2.6 Write A E[A] as i,j X i,j, where X i,j = ( A i,j E[A i,j ] ) e i e j Since Var(A i,j ) p, one has E [ X i,j Xi,j] pei e i, which gives E [ X i,j X ] i,j pe ie i np I i,j i,j Similarly, i,j E [ X i,j X i,j] np I. As a result, { v = max E [ X i,j X ] i,j, i,j i,j E [ Xi,jX ] } i,j np In addition, X i,j 1 := B Take the matrix Bernstein inequality to conclude that with high prob., A E[A] v log n + B log n np log n (since p log n n ) Spectral methods 2-35

36 Extension: singular subspaces

37 Singular value decomposition Consider two matrices M, ˆM = M + H with SVD Σ 0 0 [ ] V M = [U 0, U 1 ] 0 Σ 1 0 V1 0 0 ˆM = [Û0 Û1], ˆΣ ˆΣ1 0 0 [ ˆV 0 ˆV 1 where U 0 (resp. Û 0 ) and V 0 (resp. ˆV 0 ) represent the top-r singular subspaces of M (resp. ˆM) ] Spectral methods 2-37

38 Wedin sin Θ Theorem The Davis-Kahan Theorem generalizes to singular subspace perturbation: Theorem 2.7 (Wedin sin Θ Theorem) Suppose max σ r (M) }{{} rth singular value a and σ r+1 ( ˆM) a for some > 0. Then { dist ( Û 0, U 0 ), dist ( ˆV0, V 0 ) } max { HV 0, H U 0 } } {{ } two-sided interactions H Spectral methods 2-38

39 Example: low-rank matrix completion Netflix challenge: Netflix provides highly incomplete ratings from 0.5 million users for & 17,770 movies How to predict unseen user ratings for movies Spectral methods 2-39

40 Example: low-rank matrix completion In general, we cannot infer missing ratings this is an underdetermined system (more unknowns than observations) Spectral methods 2-40

41 Example: low-rank matrix completion... unless rating matrix has other structure A few factors explain most of the data Spectral methods 2-41

42 Example: low-rank matrix completion... unless rating matrix has other structure A few factors explain most of the data low-rank approximation How to exploit (approx.) low-rank structure in prediction Spectral methods 2-41

43 Model for low-rank matrix completion figure credit: Candès consider a low-rank matrix M each entry M i,j is observed independently with prob. p goal: fill in missing entries Spectral methods 2-42

44 Spectral estimate for matrix completion 1. set ˆM R n n as { 1 p ˆM i,j = M i,j if M i,j is observed 0, else rationale for rescaling: E[ ˆM] = M 2. compute the rank-r SVD Û ˆΣ ˆV of ˆM, and return ( Û, ˆΣ, ˆV ) Spectral methods 2-43

45 Statistical accuracy of spectral estimate Let s analyze a simple case where M = uv with u = 1 ũ 2 ũ, v = 1 ṽ 2 ṽ, ũ, ṽ N (0, I n ) From Wedin s Theorem: if p log 3 n/n, then with high prob. max {dist(û, u), dist(ˆv, v)} ˆM M σ 1 (M) ˆM M ˆM M }{{} controlled by Bernstein 1 (nearly accurate estimates) (2.9) Spectral methods 2-44

46 Sample complexity For rank-1 matrix completion, p log3 n n = nearly accurate estimates Sample complexity needed for obtaining reliable spectral estimates is n 2 p n log 3 n }{{} optimal up to log factor Spectral methods 2-45

47 Write First, Proof of (2.9) ˆM M = i,j X i,j, where X i,j = ( ˆM i,j M i,j )e i e j X i,j 1 p max M i,j log n := B (check) i,j pn Next, E [ X i,j Xi,j] = Var( ˆMi,j )e i e i and hence E [ i,j X i,jxi,j ] { = [ E max i,j i,j X i,jx i,j Var ( ) } { n ˆMi,j ni p max Mi,j 2 i,j ] n p max i,j M 2 i,j log2 n np Similar bounds hold for E [ i,j X i,j X i,j]. Therefore, } I (check) { E [ v := max X ] i,jxi,j [ ] }, E i,j i,j X i,jx i,j log2 n np Take the matrix Bernstein inequality to yield: if p log 3 n/n, then ˆM M v log n + B log n 1 Spectral methods 2-46

48 Extension: eigen-space for asymmetric transition matrices

49 Eigen-decomposition for asymmetric matrices Eigen-decomposition for asymmetric matrices is much more tricky; for example: 1. both eigenvalues and eigenvectors might be complex-valued 2. eigenvectors might not be orthogonal to each other This lecture focuses on a special case: probability transition matrices Spectral methods 2-48

50 Probability transition matrices P 2, P1, Consider Markov chain {X t } t 0 n states transition probability P{X t+1 = j X t = i} = P i,j transition matrix P = [P i,j ] 1 i,j n stationary distribution π = [π 1,, π n ] is 1st eigenvector of P }{{} π 1 + +π n=1 πp = π {X t } t 0 is said to be reversible if π i P i,j = π j P j,i for all i, j Spectral methods 2-49

51 Eigenvector perturbation for transition matrices Define a π := Theorem 2.8 (Chen, Fan, Ma, Wang 17) π 1 a π na 2 n Suppose P, ˆP are transition matrices with stationary distributions π, ˆπ, respectively. Assume P induces reversible Markov chain. If 1 > max {λ 2 (P ), λ n (P )} + ˆP P π, then π( ˆP P ) ˆπ π π π 1 max {λ 2 (P ), λ n (P )} } {{ } spectral gap ˆP P π }{{} perturbation ˆP does not need to induce reversible Markov chain Spectral methods 2-50

52 Example: ranking from pairwise comparisons pairwise comparisons for ranking tennis players figure credit: Bozóki, Csató, Temesi Spectral methods 2-51

53 Bradley-Terry-Luce (logistic) model w i : preference score n items to be ranked i: rank assign a latent score {w i } 1 i n to each item, so that item i item j if w i > w j each pair of items (i, j) is compared independently w j P {item j beats item i} = w i + w j Spectral methods 2-52

54 Bradley-Terry-Luce (logistic) model w i : preference score n items to be ranked i: rank assign a latent score {w i } 1 i n to each item, so that item i item j if w i > w j each pair of items (i, j) is compared independently w ind. 1, with prob. j w y i,j = i +w j 0, else Spectral methods 2-52

55 Spectral ranking method construct a probability transition matrix ˆP obeying ˆP i,j = { 1 2n y i,j, if i j 1 l:l i ˆP i,l, if i = j return the score estimate as the leading left eigenvector ˆπ of ˆP closely related to PageRank! Spectral methods 2-53

56 Rationale behind spectral method E[ ˆP i,j ] = 1 2n w j, w i + w j i j P := E[ ˆP ] obeys w i P i,j = w j P j,i (detailed balance) Thus, the stationary distribution π of P obeys π = 1 l w w l (reveal true scores) Spectral methods 2-54

57 Statistical guarantees for spectral ranking Negahban, Oh, Shah 16, Chen, Fan, Ma, Wang 17 Suppose max i,j w i w j 1. Then with high prob. ˆπ π 2 π 2 ˆπ π π π 2 1 n 0 }{{} nearly perfect estimate consequence of Theorem 2.8 and the matrix Bernstein (exercise) Spectral methods 2-55

58 Reference [1] The rotation of eigenvectors by a perturbation, C. Davis, W. Kahan, SIAM Journal on Numerical Analysis, [2] Perturbation bounds in connection with singular value decomposition, P. Wedin, BIT Numerical Mathematics, [3] Inference, estimation, and information processing, EE 378B lecture notes, A. Montanari, Stanford University. [4] COMS 4772 lecture notes, D. Hsu, Columbia University. [5] High-dimensional statistics: a non-asymptotic viewpoint, M. Wainwright, Cambridge University Press, [6] Principal component analysis for big data, J. Fan, Q. Sun, W. Zhou, Z. Zhu, arxiv: , Spectral methods 2-56

59 Reference [7] Community detection and stochastic block models, E. Abbe, Foundations and Trends in Communications and Information Theory, [8] Consistency thresholds for the planted bisection model, E. Mossel, J. Neeman, A. Sly, ACM Symposium on Theory of Computing, [9] Matrix completion from a few entries, R. Keshavan, A. Montanari, S. Oh, IEEE Transactions on Information Theory, [10] The PageRank citation ranking: bringing order to the web, L. Page, S. Brin, R. Motwani, T. Winograd, [11] Rank centrality: ranking from pairwise comparisons, S. Negahban, S. Oh, D. Shah, Operations Research, [12] Spectral method and regularized MLE are both optimal for top-k ranking, Y. Chen, J. Fan, C. Ma, K. Wang, accepted to Annals of Statistics, Spectral methods 2-57

Spectral Method and Regularized MLE Are Both Optimal for Top-K Ranking

Spectral Method and Regularized MLE Are Both Optimal for Top-K Ranking Yuxin Chen Electrical Engineering, Princeton University Joint work with Jianqing Fan, Cong Ma and Kaizheng Wang Ranking A fundamental