Random Matrices for Big Data Signal Processing and Machine Learning

Random Matrices for Big Data Signal Processing and Machine Learning (ICASSP 2017, New Orleans) Romain COUILLET and Hafiz TIOMOKO ALI CentraleSupélec, France March, 2017 1 / 153

Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 2 / 153

Basics of Random Matrix Theory/ 3/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 3 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 4/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 4 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/153 Context Baseline scenario: x 1,..., x n C N (or R N ) i.i.d. with E[x 1 ] = 0, E[x 1 x 1 ] = C N : 5 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/153 Context Baseline scenario: x 1,..., x n C N (or R N ) i.i.d. with E[x 1 ] = 0, E[x 1 x 1 ] = C N : If x 1 N (0, C N ), ML estimator for C N is the sample covariance matrix (SCM) Ĉ N = 1 n x i x i n. i=1 If n, then, strong law of large numbers Ĉ N a.s. C N. or equivalently, in spectral norm ĈN C N a.s. 0. 5 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/153 The Large Dimensional Fallacies Setting: x i C N i.i.d., x 1 CN (0, I N ) 6 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/153 The Large Dimensional Fallacies Setting: x i C N i.i.d., x 1 CN (0, I N ) assume N = N(n) such that N/n c > 1 6 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/153 The Large Dimensional Fallacies Setting: x i C N i.i.d., x 1 CN (0, I N ) assume N = N(n) such that N/n c > 1 then, joint point-wise convergence ] max 1 i,j N [ĈN I N ij = max 1 1 i,j N n X j, Xi, δ ij a.s. 0. however, eigenvalue mismatch 0 = λ 1 (ĈN ) =... = λ N n (ĈN ) λ N n+1 (ĈN )... λ N (ĈN ) 1 = λ 1 (I N ) =... = λ N n (I N ) = λ N n+1 (ĈN ) =... = λ N (I N ) 6 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/153 The Mar cenko Pastur law 0.8 Empirical eigenvalue distribution Mar cenko Pastur Law 0.6 Density 0.4 0.2 0 0 0.5 1 1.5 2 2.5 3 Eigenvalues of ĈN Figure: Histogram of the eigenvalues of ĈN for N = 500, n = 2000, C N = I N. 7 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/153 The Mar cenko Pastur law Definition (Empirical Spectral Density) Empirical spectral density (e.s.d.) µ N of Hermitian matrix A N C N N is µ N = 1 N δ λi (A N N ). i=1 Theorem (Mar cenko Pastur Law [Mar cenko,pastur 67]) X N C N n with i.i.d. zero mean, unit variance entries. As N, n with N/n c (0, ), e.s.d. µ N of 1 n X N X N satisfies weakly, where µ c({0}) = max{0, 1 c 1 } µ N a.s. µ c 8 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 9/153 The Mar cenko Pastur law 1.2 c = 0.1 1 0.8 Density fc(x) 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 3 x Figure: Mar cenko-pastur law for different limit ratios c = lim N N/n. 9 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 9/153 The Mar cenko Pastur law 1.2 c = 0.1 c = 0.2 1 0.8 Density fc(x) 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 3 x Figure: Mar cenko-pastur law for different limit ratios c = lim N N/n. 9 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 9/153 The Mar cenko Pastur law 1.2 c = 0.1 c = 0.2 1 c = 0.5 0.8 Density fc(x) 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 3 x Figure: Mar cenko-pastur law for different limit ratios c = lim N N/n. 9 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 10/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 10 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 11/153 The Stieltjes transform Definition (Stieltjes Transform) For µ real probability measure of support supp(µ), Stieltjes transform m µ defined, for z C \ supp(µ), as 1 m µ(z) = t z µ(dt). Property (Inverse Stieltjes Transform) For a < b continuity points of µ, 1 µ([a, b]) = lim ε 0 π b I[m µ(x + ıε)]dx a 11 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 12/153 The Stieltjes transform Property (Relation to e.s.d.) If µ e.s.d. of Hermitian A C N N, (i.e., µ = 1 N N i=1 δ λ i (A)) m µ(z) = 1 N tr (A zi N ) 1 Proof: m µ(z) = µ(dt) t z = 1 N N i=1 = 1 N tr (A zi N ) 1. 1 λ i (A) z = 1 N tr (diag{λ i(a)} zi N ) 1 12 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 13/153 The Stieltjes transform Property (Stieltjes transform of Gram matrices) For X C N n, and Then µ e.s.d. of XX µ e.s.d. of X X m µ(z) = n N m µ(z) N n 1 N z. 13 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 14/153 The Stieltjes transform Three fundamental lemmas in all proofs. Lemma (Resolvent Identity) For A, B C N N invertible, A 1 B 1 = A 1 (B A)B 1. Corollary For t C, x C N, A C N N, with A and A + txx invertible, (A + txx ) 1 x = A 1 x 1 + tx A 1 x. 14 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 15/153 The Stieltjes transform Three fundamental lemmas in all proofs. Lemma (Rank-one perturbation) For A, B C N N Hermitian nonnegative definite, e.s.d. µ of A, t > 0, x C N, z C \ supp(µ), 1 N tr B (A + txx zi N ) 1 1 N tr B (A zi N ) 1 1 B N dist(z, supp(µ)) 15 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 16/153 The Stieltjes transform Three fundamental lemmas in all proofs. Lemma (Trace Lemma) For then x C N with i.i.d. entries with zero mean, unit variance, finite 2p order moment, A C N N deterministic (or independent of x), [ 1 E N x Ax 1 N tr A p ] K A p N p/2. In particular, if lim sup N A <, and x has entries with finite eighth-order moment, 1 N x Ax 1 a.s. tr A 0 N (by Markov inequality and Borel Cantelli lemma). 16 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 17/153 Proof of the Mar cenko Pastur law Theorem (Mar cenko Pastur Law [Mar cenko,pastur 67]) X N C N n with i.i.d. zero mean, unit variance entries. As N, n with N/n c (0, ), e.s.d. µ N of 1 n X N X N satisfies weakly, where µ c({0}) = max{0, 1 c 1 } µ N a.s. µ c 17 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 18/153 Proof of the Mar cenko Pastur law Stieltjes transform approach. 18 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 18/153 Proof of the Mar cenko Pastur law Stieltjes transform approach. Proof With µ N e.s.d. of 1 n X N X N, m µn (z) = 1 N tr ( 1 n X N X N zi N ) 1 = 1 N [ N ( ) ] 1 1 n X N XN zi N. i=1 ii Write [ ] y X N = C Y N n N 1 18 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 19/153 Proof of the Mar cenko Pastur law Proof (continued) From block matrix inverse formula ( ) 1 ( A B (A BD = 1 C) 1 A 1 B(D CA 1 B) 1 ) C D (A BD 1 C) 1 CA 1 (D CA 1 B) 1 we have [ ( ) ] 1 1 n X N XN zi N 11 = 1 z z 1 n y ( 1 n Y N 1 Y N 1 zi n) 1 y. 19 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 20/153 Proof of the Mar cenko Pastur law Proof (continued) By Rank-1 Perturbation Lemma (X N X N = Y N 1 Y N 1 + yy ), as N, n [ ( ) ] 1 1 n X N XN zi N 11 1 z z 1 n tr ( 1 n X N X N zi n) 1 a.s. 0. Since 1 n tr ( 1 n X N X N zi n) 1 = 1 n tr ( 1 n X N X N zi N ) 1 n N n [ ( ) ] 1 1 n X N XN zi N 11 1 z, 1 1 N n z z 1 n tr ( 1 n X N X N zi N ) 1 a.s. 0. 20 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 21/153 Proof of the Mar cenko Pastur law Proof (continued) Then m µn (z) a.s. m(z) solution to m(z) = 1 1 c z czm(z) 21 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 21/153 Proof of the Mar cenko Pastur law Proof (continued) Then m µn (z) a.s. m(z) solution to m(z) = 1 1 c z czm(z) i.e., (with branch of f(z) such that m(z) 0 as z ) (z ) ( m(z) = 1 c 2cz 1 (1 + c) 2 z (1 c) 2) 2c +. 2cz Finally, by inverse Stieltjes Transform, for x > 0, ((1 1 + c) 2 x ) ( x (1 c) 2) lim I[m(x + ıε)] = 1 ε 0 π 2πcx {x [(1 c) 2,(1+ c) 2 ]}. And for x = 0, lim ε 0 ıεi[m(ıε)] = ( 1 c 1) 1 {c>1}. 21 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 22/153 Sample Covariance Matrices Theorem (Sample Covariance Matrix Model [Silverstein,Bai 95]) Let Y N = C 1 2 N X N C N n, with C N C N N nonnegative definite with e.s.d. ν N ν weakly, X N C N n has i.i.d. entries of zero mean and unit variance. As N, n, N/n c (0, ), µ N e.s.d. of 1 n Y N Y N C n n satisfies µ N a.s. µ weakly, with m µ (z), I[z] > 0, unique solution with I[m µ (z)] > 0 of ( m µ (z) = z + c ) t 1 1 + tm µ (z) ν(dt). 22 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 23/153 Sample Covariance Matrices e.s.d. e.s.d. f f 0.6 0.6 0.4 0.4 0.2 0.2 0 1 3 7 0 1 3 4 (i) (ii) Figure: Histogram of the eigenvalues of 1 n Y N Y N, n = 3000, N = 300, with C N diagonal with evenly weighted masses in (i) 1, 3, 7, (ii) 1, 3, 4. 23 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 24/153 Further Models and Deterministic Equivalents Theorem (Doubly-correlated i.i.d. matrices) Let B N = C 1 2 N X N T N X N C 1 2 N, with e.s.d. µ N, X k C N n with i.i.d. entries of zero mean, variance 1/n, C N Hermitian nonnegative definite, T N diagonal nonnegative, lim sup N max( C N, T N ) <. Denote c = N/n. Then, as N, n with bounded ratio c, for z C \ R, m µn (z) m N (z) a.s. 0, m N (z) = 1 N tr ( zi N + ē N (z)c N ) 1 with ē(z) unique solution in {z C +, ē N (z) C + } or {z R, ē N (z) R + } of e N (z) = 1 N tr C N ( zi N + ē N (z)c N ) 1 ē N (z) = 1 n tr T N (I n + ce N (z)t N ) 1. 24 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 25/153 Other Refined Sample Covariance Models Side note on other models. Similar results for multiple matrix models: 25 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 25/153 Other Refined Sample Covariance Models Side note on other models. Similar results for multiple matrix models: Information-plus-noise: Y N = A N + X N, A N deterministic Variance profile: Y N = P N X N (entry-wise product) Per-column covariance: Y N = [y 1,..., y n], y i = C 1 2 N,i x i etc. 25 / 153

Basics of Random Matrix Theory/Spiked Models 26/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 26 / 153

Basics of Random Matrix Theory/Spiked Models 27/153 No Eigenvalue Outside the Support Theorem (No Eigenvalue Outside the Support [Silverstein,Bai 98]) 2 Let Y N = C 1 N X N C N n, with C N C N N nonnegative definite with e.s.d. ν N ν weakly, E[ X N 4 ij ] <, 27 / 153

Basics of Random Matrix Theory/Spiked Models 27/153 No Eigenvalue Outside the Support Theorem (No Eigenvalue Outside the Support [Silverstein,Bai 98]) Let Y N = C 1 2 N X N C N n, with C N C N N nonnegative definite with e.s.d. ν N ν weakly, E[ X N 4 ij ] <, X N C N n has i.i.d. entries of zero mean and unit variance, max i dist(λ i (C N ), supp(ν)) 0. Let µ be the limiting e.s.d. of 1 n Y N Y N as before. Let [a, b] R \ supp( ν). Then, for all large n, almost surely. { ( )} 1 n λ i n Y N Y N [a, b] = i=1 27 / 153

Basics of Random Matrix Theory/Spiked Models 28/153 Spiked Models Breaking the rules. If we break Rule 1: Infinitely many eigenvalues may wander away from supp(µ). 0.8 {λ i } n i=1 µ 0.8 {λ i } n i=1 µ 0.6 0.6 0.4 0.4 0.2 0.2 0 0 1 2 3 E[X 4 ij ] < 0 0 1 2 3 E[X4 ij ] = 28 / 153

Basics of Random Matrix Theory/Spiked Models 29/153 Spiked Models If we break: Rule 2: C N may create isolated eigenvalues in 1 n Y N YN, called spikes. 0.8 {λ i } n i=1 µ 0.6 0.4 0.2 0 1 + ω 1 + c 1+ω 1 ω 1, 1 + ω 2 + c 1+ω 2 ω 2 Figure: Eigenvalues of 1 n Y N Y N, C N = diag(1,..., 1, 2, 2, 3, 3), N = 500, n = 1500. }{{} N 4 29 / 153

Basics of Random Matrix Theory/Spiked Models 30/153 Spiked Models Theorem (Eigenvalues [Baik,Silverstein 06]) Let Y N = C 1 2 N X N, with X N with i.i.d. zero mean, unit variance, E[ X N 4 ij ] <. C N = I N + P, P = UΩU, where, for K fixed, Ω = diag (ω 1,..., ω K ) R K K, with ω 1... ω K > 0. Then, as N, n, N/n c (0, ), denoting λ i = λ i ( 1 n Y N Y N ), if ω m > c, λ m a.s. 1 + ω m + c 1 + ωm ω m > (1 + c) 2 30 / 153

Basics of Random Matrix Theory/Spiked Models 31/153 Spiked Models Proof Two ingredients: Algebraic calculus + trace lemma 31 / 153

Basics of Random Matrix Theory/Spiked Models 31/153 Spiked Models Proof Two ingredients: Algebraic calculus + trace lemma Find eigenvalues away from eigenvalues of 1 n X N X N : ( 1 0 = det = det(c N ) det ( 1 = det n Y N Y N λi N ) ( ) 1 n X N XN λc 1 N ) n X N XN λi N + λ(i N C 1 N ) ( ) ( 1 = det n X N XN λi N det ( I N + λ(i N C 1 1 N ) n X N XN λi N ) 1 ). Use low rank property: I N C 1 N = I N (I N + UΩU ) 1 = U(I K + Ω 1 ) 1 U, Ω C K K. Hence ( ) ( ( ) ) 1 1 1 0 = det n X N XN λi N det I N + λu(i K + Ω 1 ) 1 U n X N XN λi N

Basics of Random Matrix Theory/Spiked Models 32/153 Spiked Models Proof (2) Sylverster s identity (det(i + AB) = det(i + BA)), ( ) ( ( ) ) 1 1 1 0 = det n X N XN λi N det I K + λ(i K + Ω 1 ) 1 U n X N XN λi N U No eigenvalue outside the support [Bai,Sil 98]: det( 1 n X N X N λi N ) has no zero beyond (1 + c) 2 for all large n a.s. Extension of Trace Lemma: for each z C \ supp(µ), U ( 1 n X N X N zi N ) 1 U a.s. m µ(z)i K. (X N being almost-unitarily invariant, U can be seen as formed of random i.i.d.-like vectors) 32 / 153

Basics of Random Matrix Theory/Spiked Models 33/153 Spiked Models Proof (3) Limiting solutions: zeros (with multiplicity) of 1 + λωm 1 + ω m m µ(λ) = 0. 33 / 153

Basics of Random Matrix Theory/Spiked Models 33/153 Spiked Models Proof (3) Limiting solutions: zeros (with multiplicity) of 1 + λωm 1 + ω m m µ(λ) = 0. Using Mar cenko Pastur law properties (m µ(z) = (1 c z czm µ(z)) 1 ), { λ 1 + ω m + c 1 + } M ωm. ω m m=1 33 / 153

Basics of Random Matrix Theory/Spiked Models 34/153 Spiked Models Theorem (Eigenvectors [Paul 07]) Let Y N = C 1 2 N X N, with X N with i.i.d. zero mean, unit variance, finite fourth order moment entries C N = I N + P, P = K i=1 ω iu i u i, ω 1 >... > ω M > 0. Then, as N, n, N/n c (0, ), for a, b C N deterministic and û i eigenvector of λ i ( 1 n Y N Y N ), In particular, a û i û i b 1 cω 2 i 1 + cω 1 i a u i u i b 1 ω i > c a.s. 0 û i u i 2 a.s. 1 cω 2 i 1 + cω 1 1 ωi > c. i 34 / 153

Basics of Random Matrix Theory/Spiked Models 35/153 Spiked Models 1 0.8 û 1 u 1 2 0.6 0.4 0.2 N = 100 N = 200 N = 400 1 c/ω 1 2 1+c/ω 1 0 0 1 2 3 4 Population spike ω 1 Figure: Simulated versus limiting û 1 u1 2 for Y N = C 1 2N X N, C N = I N + ω 1u 1u 1, N/n = 1/3, varying ω 1. 35 / 153

Basics of Random Matrix Theory/Spiked Models 36/153 Tracy Widom Theorem Theorem (Phase Transition [Baik,BenArous,Péché 05]) Let Y N = C 1 2 N X N, with X N with i.i.d. complex Gaussian zero mean, unit variance entries, C N = I N + P, P = K i=1 ω iu i u i, ω 1 >... > ω K > 0 (K 0). Then, as N, n, N/n c < 1, If ω 1 < c (or K = 0), N 2 3 λ 1 (1 + c) 2 (1 + c) 4 3 c 1 2 L T 2, (complex Tracy Widom law) 36 / 153

Basics of Random Matrix Theory/Spiked Models 37/153 Tracy Widom Theorem 0.5 Centered-scaled λ 1 T 2 0.4 0.3 0.2 0.1 0 4 2 0 2 Figure: Distribution of N 2 3 c 1 2 (1 + c) 4 3 [ λ 1( 1 n X N X N ) (1 + c) 2] versus Tracy Widom (T 2), N = 500, n = 1500. 37 / 153

Basics of Random Matrix Theory/Spiked Models 38/153 Other Spiked Models Similar results for multiple matrix models: Additive spiked model: Y N = 1 n XX + P, P deterministic and low rank Y N = 1 n X (I + P )X Y N = 1 n (X + P ) (X + P ) Y N = 1 n T X (I + P )XT etc. 38 / 153

Basics of Random Matrix Theory/Other Common Random Matrix Models 39/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 39 / 153

Basics of Random Matrix Theory/Other Common Random Matrix Models 40/153 The Semi-circle law Theorem Let X N C N N 1 Hermitian with e.s.d. µ N such that [X N N ] i>j are i.i.d. with zero mean and unit variance. Then, as N, µ N a.s. µ with µ(dt) = 1 2π (4 t 2 ) + dt. In particular, m µ satisfies m µ(z) = 1 z m µ(z). 40 / 153

Basics of Random Matrix Theory/Other Common Random Matrix Models 41/153 The Semi-circle law 0.4 Empirical eigenvalue distribution Semi-circle Law 0.3 Density 0.2 0.1 0 3 2 1 0 1 2 3 Eigenvalues Figure: Histogram of the eigenvalues of Wigner matrices and the semi-circle law, for N = 500 41 / 153

Basics of Random Matrix Theory/Other Common Random Matrix Models 42/153 The Circular law Theorem Let X N C N N with e.s.d. µ N be such that mean and unit variance. Then, as N, µ N a.s. µ with µ a complex-supported measure with µ(dz) = 1 2π δ z 1dz. 1 N [X N ] ij are i.i.d. entries with zero 42 / 153

Basics of Random Matrix Theory/Other Common Random Matrix Models 43/153 The Circular law 1 Empirical eigenvalues Circular Law Eigenvalues (imaginary part) 0.5 0 0.5 1 1 0.5 0 0.5 1 Eigenvalues (real part) Figure: Eigenvalues of X N with i.i.d. standard Gaussian entries, for N = 500. 43 / 153

Basics of Random Matrix Theory/Other Common Random Matrix Models 44/153 Bibliographical references: Maths Book and Tutorial References I From most accessible to least: Couillet, R., & Debbah, M. (2011). Random matrix methods for wireless communications. Cambridge University Press. Tao, T. (2012). Topics in random matrix theory (Vol. 132). Providence, RI: American Mathematical Society. Bai, Z., & Silverstein, J. W. (2010). Spectral analysis of large dimensional random matrices (Vol. 20). New York: Springer. Pastur, L. A., Shcherbina, M., & Shcherbina, M. (2011). Eigenvalue distribution of large random matrices (Vol. 171). Providence, RI: American Mathematical Society. Anderson, G. W., Guionnet, A., & Zeitouni, O. (2010). An introduction to random matrices (Vol. 118). Cambridge university press. 44 / 153

Applications/ 45/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 45 / 153

Applications/Random Matrices and Robust Estimation 46/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 46 / 153

Applications/Random Matrices and Robust Estimation 47/153 Context Baseline scenario: x 1,..., x n C N (or R N ) i.i.d. with E[x 1 ] = 0, E[x 1 x 1 ] = C N : 47 / 153

Applications/Random Matrices and Robust Estimation 47/153 Context Baseline scenario: x 1,..., x n C N (or R N ) i.i.d. with E[x 1 ] = 0, E[x 1 x 1 ] = C N : If x 1 N (0, C N ), ML estimator for C N is sample covariance matrix (SCM) Ĉ N = 1 n x i x i n. i=1 [Huber 67] If x 1 (1 ε)n (0, C N ) + εg, G unknown, robust estimator (n > N) { } Ĉ N = 1 n l 2 max l 1, n 1 i=1 N x i Ĉ 1 N x x i x i for some l 1, l 2 > 0. i [Maronna 76] If x 1 elliptical (and n > N), ML estimator for C N given by Ĉ N = 1 n ( ) 1 u n N x i Ĉ 1 N x i x i x i for some non-increasing u. i=1 [Pascal 13; Chen 11] If N > n, x 1 elliptical or with outliers, shrinkage extensions Ĉ N (ρ) = (1 ρ) 1 n x i x i n 1 i=1 N x i Ĉ 1 N (ρ)x + ρi N i ˇB N (ρ) Č N (ρ) = 1 N tr ˇB N (ρ), ˇBN (ρ) = (1 ρ) 1 n x i x i n 1 i=1 N x i Č 1 N (ρ)x + ρi N i 47 / 153

Applications/Random Matrices and Robust Estimation 48/153 Context Results only known for N fixed and n : not appropriate in settings of interest today (BigData, array processing, MIMO) 48 / 153

Applications/Random Matrices and Robust Estimation 48/153 Context Results only known for N fixed and n : not appropriate in settings of interest today (BigData, array processing, MIMO) We study such ĈN in the regime N, n, N/n c (0, ). Math interest: limiting eigenvalue distribution of ĈN limiting values and fluctuations of functionals f(ĉn ) Application interest: comparison between SCM and robust estimators performance of robust/non-robust estimation methods improvement thereof (by proper parametrization) 48 / 153

Applications/Random Matrices and Robust Estimation 49/153 Model Description Definition (Maronna s Estimator) For x 1,..., x n C N with n > N, Ĉ N is the solution (upon existence and uniqueness) of where u : [0, ) (0, ) is n Ĉ N = 1 n i=1 ( 1 u N x i Ĉ 1 N x i ) x i x i non-increasing such that φ(x) xu(x) increasing of supremum φ with 1 < φ < c 1, c (0, 1). 49 / 153

Applications/Random Matrices and Robust Estimation 50/153 The Results in a Nutshell For various models of the x i s, First order convergence: ĈN ŜN a.s. 0 for some tractable random matrices ŜN. We only discuss this result here. Second order results: N 1 ε ( a Ĉ k N b a Ŝ k N b ) a.s. 0 allowing transfer of CLT results. Applications: improved robust covariance matrix estimation improved robust tests / estimators specific examples in statistics at large, array processing, statistical finance, etc. 50 / 153

Applications/Random Matrices and Robust Estimation 51/153 (Elliptical) scenario Theorem (Large dimensional behavior, elliptical case) For x i = τ i w i, τ i impulsive (random or not), w i unitarily invariant, w i = N, ĈN ŜN a.s. 0 with, for some v related to u (v = u g 1, g(x) = x(1 cφ(x)) 1 ), n Ĉ N 1 n i=1 and γ N unique solution of ( 1 u N x i Ĉ 1 N x i ) x i x i, 1 = 1 n γv(τ i γ) n 1 + cγv(τ j=1 i γ). Ŝ N 1 n v(τ i γ N )x i x i n i=1 Corollaries Spectral measure: µĉn N µŝn N L 0 a.s. (µ X N 1 n n i=1 δ λ i (X)) 51 / 153

Applications/Random Matrices and Robust Estimation 52/153 Large dimensional behavior 3 Eigenvalues of ĈN Density 2 1 0 0 0.5 1 1.5 2 Eigenvalues Figure: n = 2500, N = 500, C N = diag(i 125, 3I 125, 10I 250), τ i Γ(.5, 2) i.i.d. 52 / 153

Applications/Random Matrices and Robust Estimation 52/153 Large dimensional behavior Eigenvalues of ĈN 3 Eigenvalues of ŜN Density 2 1 0 0 0.5 1 1.5 2 Eigenvalues Figure: n = 2500, N = 500, C N = diag(i 125, 3I 125, 10I 250), τ i Γ(.5, 2) i.i.d. 52 / 153

Applications/Random Matrices and Robust Estimation 52/153 Large dimensional behavior Eigenvalues of ĈN 3 Eigenvalues of ŜN Approx. Density Density 2 1 0 0 0.5 1 1.5 2 Eigenvalues Figure: n = 2500, N = 500, C N = diag(i 125, 3I 125, 10I 250), τ i Γ(.5, 2) i.i.d. 52 / 153

Applications/Random Matrices and Robust Estimation 53/153 Elements of Proof Definition (v and ψ) Letting g(x) = x(1 cφ(x)) 1 (on R + ), v(x) (u g 1 )(x) non-increasing ψ(x) xv(x) increasing and bounded by ψ. Lemma (Rewriting ĈN) It holds (with C N = I N ) that Ĉ N 1 n τ i v (τ i d i ) w i wi n i=1 with (d 1,..., d n) R n + a.s. unique solution to d i = 1 N w i Ĉ 1 (i) w i = 1 N w i 1 1 τ j v(τ j d j )w j wj w i, i = 1,..., n. n j i 53 / 153

Applications/Random Matrices and Robust Estimation 54/153 Elements of Proof Remark (Quadratic Form close to Trace) Random matrix insight: ( 1 n j i τ jv(τ j d j )w j wj ) 1 almost independent of w i, so 1 1 d i = 1 N w 1 i τ j v(τ j d j )w j wj w i 1 n N tr 1 τ j v(τ j d j )w j w j γ N n j i j i for some deterministic sequence (γ N ) N=1, irrespective of i. Lemma (Key Lemma) Letting e i v(τ id i ) v(τ i γ N ) with γ N unique solution to we have 1 = 1 n ψ(τ i γ N ) n 1 + cψ(τ k=1 i γ N ) max e i 1 a.s. 0. 1 i n 54 / 153

Applications/Random Matrices and Robust Estimation 55/153 Proof of the Key Lemma: max i e i 1 a.s. 0, e i = v(τ id i ) v(τ i γ N ) Property (Quadratic form and γ N ) 1 max 1 1 i n N w 1 i τ j v(τ j γ N )w j wj w i γ N a.s. n 0. j i Proof of the Property Uniformity easy (moments of all orders for [w i ] j ). By a quadratic form similar to trace approach, we get 1 max 1 1 i n N w 1 i τ j v(τ j γ N )w j wj w i m(0) a.s. n 0 j i with m(0) unique positive solution to [MarPas 67; BaiSil 95] m(0) = 1 n τ i v(τ i γ N ) n 1 + cτ i=1 i v(τ i γ N )m(0). γ N precisely solves this equation, thus m(0) = γ N. 55 / 153

Applications/Random Matrices and Robust Estimation 56/153 Proof of the Key Lemma: max i e i 1 a.s. 0, e i = v(τ id i ) Substitution Trick (case τ i [a, b] (0, )) Up to relabelling e 1... e n, use v(τ i γ N ) 1 v(τ nγ N )e n = v(τ nd n) = v 1 τn N w 1 n τ i v(τ i d i ) w i w i n }{{} w n i<n =v(τ i γ N )e i ( ) v τ ne 1 1 1 1 n N w n τ i v(τ i γ N )w i wi w n n i<n Use properties of ψ to get v ( τ ne 1 n (γ N ε ) n) a.s., ε n 0 (slow). ψ (τ nγ N ) ψ ( τ ne 1 n γ ) ( ) 1 N 1 ε nγ 1 N 56 / 153

Applications/Random Matrices and Robust Estimation 57/153 Outlier Data Theorem (Outlier Rejection) Observation set X = [ ] x 1,..., x (1 εn)n, a 1,..., a εnn where x i CN (0, C N ) and a 1,..., a εnn C N deterministic outliers. Then, ĈN ŜN a.s. 0 where Ŝ N v (γ N ) 1 n (1 ε n)n i=1 x i x i + 1 ε nn v (α i,n ) a i a i n i=1 with γ N and α 1,n,..., α εnn,n unique positive solutions to ( ) γ N = 1 N tr C (1 ε)v(γ N ) N C N + 1 ε nn 1 v (α i,n ) a i a i 1 + cv(γ N )γ N n i=1 1 α i,n = 1 N a (1 ε)v(γ N ) i C N + 1 ε nn v (α j,n ) a j a j a i, i = 1,..., ε nn. 1 + cv(γ N )γ N n j i 57 / 153

Applications/Random Matrices and Robust Estimation 58/153 Outlier Data For ε nn = 1, ( φ 1 (1) Ŝ N = v 1 c ) n 1 1 n i=1 ( φ x i x i (v 1 + (1) 1 c Outlier rejection relies on 1 N a 1 C 1 N a 1 1. 1 N a 1C 1 N a 1 ) ) + o(1) a 1 a 1 58 / 153

Applications/Random Matrices and Robust Estimation 58/153 Outlier Data For ε nn = 1, ( φ 1 (1) Ŝ N = v 1 c ) n 1 1 n i=1 ( φ x i x i (v 1 + (1) 1 c Outlier rejection relies on 1 N a 1 C 1 N a 1 1. For a i CN (0, D N ), ε n ε 0, 1 N a 1C 1 N a 1 Ŝ N = v (γ 1 (1 ε n)n n) x i x i n + v (αn) 1 ε nn a i a i n i=1 i=1 γ n = 1 ( (1 ε)v(γn) N tr C N C N + 1 + cv(γ n)γ n α n = 1 ( (1 ε)v(γn) N tr D N C N + 1 + cv(γ n)γ n ) εv(α n) 1 + cv(α n)α n D N εv(α n) 1 + cv(α n)α n D N ) + o(1) a 1 a 1 ) 1 ) 1. 58 / 153

Applications/Random Matrices and Robust Estimation 58/153 Outlier Data For ε nn = 1, ( φ 1 (1) Ŝ N = v 1 c ) n 1 1 n i=1 ( φ x i x i (v 1 + (1) 1 c Outlier rejection relies on 1 N a 1 C 1 N a 1 1. For a i CN (0, D N ), ε n ε 0, For ε n 0, 1 N a 1C 1 N a 1 Ŝ N = v (γ 1 (1 ε n)n n) x i x i n + v (αn) 1 ε nn a i a i n i=1 i=1 γ n = 1 ( (1 ε)v(γn) N tr C N C N + 1 + cv(γ n)γ n α n = 1 ( (1 ε)v(γn) N tr D N C N + 1 + cv(γ n)γ n ( φ 1 ) (1 ε (1) 1 n)n Ŝ N = v x i x i 1 c n + 1 n i=1 ε nn i=1 ) εv(α n) 1 + cv(α n)α n D N εv(α n) 1 + cv(α n)α n D N ( φ 1 (1) v 1 c ) + o(1) a 1 a 1 ) 1 ) 1. 1 N tr D N C 1 N ) a i a i Outlier rejection relies on 1 N tr D N C 1 N 1. 58 / 153

Applications/Random Matrices and Robust Estimation 59/153 Outlier Data Deterministic equivalent eigenvalue distribution 10 8 6 4 2 1 n (1 εn)n x i=1 i x i (Clean Data) 0 0 0.1 0.2 0.3 0.4 0.5 Eigenvalues Figure: Limiting eigenvalue distributions. [C N ] ij =.9 i j, D N = I N, ε =.05. 59 / 153

Applications/Random Matrices and Robust Estimation 59/153 Outlier Data Deterministic equivalent eigenvalue distribution 10 8 6 4 2 1 (1 εn)n x n i=1 i x i (Clean Data) 1 n XX or per-input normalized 0 0 0.1 0.2 0.3 0.4 0.5 Eigenvalues Figure: Limiting eigenvalue distributions. [C N ] ij =.9 i j, D N = I N, ε =.05. 59 / 153

Applications/Random Matrices and Robust Estimation 59/153 Outlier Data Deterministic equivalent eigenvalue distribution 10 8 6 4 2 1 (1 εn)n x n i=1 i x i (Clean Data) 1 n XX or per-input normalized Ĉ N 0 0 0.1 0.2 0.3 0.4 0.5 Eigenvalues Figure: Limiting eigenvalue distributions. [C N ] ij =.9 i j, D N = I N, ε =.05. 59 / 153

Applications/Random Matrices and Robust Estimation 60/153 Example of application to statistical finance Robust matrix-optimized portfolio allocation ĈST 0.065 Annualized standard deviation 0.06 0.055 0.05 0.045 0.04 0.035 0.03 Ĉ ST Ĉ P Ĉ C Ĉ C2 Ĉ LW Ĉ R 0.025 0 50 100 150 200 250 300 350 400 t (N=45,n=300) 60 / 153

Applications/Spectral Clustering Methods and Random Matrices 61/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 61 / 153

Applications/Spectral Clustering Methods and Random Matrices 62/153 Reminder on Spectral Clustering Methods Context: Two-step classification of n objects based on similarity A R n n : 1. extraction of eigenvectors U = [u 1,..., u l ] with dominant eigenvalues 62 / 153

Applications/Spectral Clustering Methods and Random Matrices 63/153 Reminder on Spectral Clustering Methods Eigenv. 2 Eigenv. 1 63 / 153

Applications/Spectral Clustering Methods and Random Matrices 63/153 Reminder on Spectral Clustering Methods Eigenv. 2 Eigenv. 1 l-dimensional representation (shuffling no longer matters!) Eigenvector 2 Eigenvector 1 63 / 153

Applications/Spectral Clustering Methods and Random Matrices 64/153 The Random Matrix Approach A two-step method: 1. If A n is not a standard random matrix, retrieve Ãn such that A n Ãn a.s. 0 in operator norm as n. Transfers crucial properties from A n to Ãn: limiting eigenvalue distribution spikes eigenvectors of isolated eigenvalues. 64 / 153

Applications/Spectral Clustering Methods and Random Matrices 65/153 The Random Matrix Approach The Spike Analysis: For noisy plateaus -looking isolated eigenvectors u 1,..., u l of Ãn, write k u i = α a j a i + σi a wa i na a=1 with j a R n canonical vector of class C a, w a i α a i = 1 na u T i ja (σi a )2 = u i α a j a 2 i. na noise orthogonal to ja, and evaluate 65 / 153

Applications/Community Detection on Graphs 66/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 66 / 153

Applications/Community Detection on Graphs 67/153 System Setting Assume n-node, m-edges undirected graph G, with intrinsic average connectivity q 1,..., q n µ i.i.d. 67 / 153

Applications/Community Detection on Graphs 67/153 System Setting Assume n-node, m-edges undirected graph G, with intrinsic average connectivity q 1,..., q n µ i.i.d. k classes C 1,..., C k independent of {q i } of (large) sizes n 1,..., n k, with preferential attachment C ab between C a and C b 67 / 153

Applications/Community Detection on Graphs 68/153 Objective Study of spectral methods: standard methods based on adjacency A, modularity A ddt 2m, normalized adjacency D 1 AD 1, etc. (adapted to dense nets) refined methods based on Bethe Hessian (r 2 1)I n ra + D (adapted to sparse nets!) 68 / 153

Applications/Community Detection on Graphs 69/153 Limitations of Adjacency/Modularity Approach (Modularity) (Bethe Hessian) 69 / 153

Applications/Community Detection on Graphs 69/153 Limitations of Adjacency/Modularity Approach (Modularity) Scenario: 3 classes with µ bi-modal (e.g., µ = 3 4 δ 0.1 + 1 4 δ 0.5) (Bethe Hessian) Leading eigenvectors of A (or modularity A ddt 2m ) biased by q i distribution. Similar behavior for Bethe Hessian. 69 / 153

Applications/Community Detection on Graphs 70/153 Regularized Modularity Approach Connectivity Model: P (i j) = q i q j C ab for i C a, j C b. Dense Regime Assumptions: Non trivial regime when, as n, C ab = 1 + M ab n, M ab = O(1). Community information is weak but highly REDUNDANT! Considered Matrix: For α [0, 1], (and with D = diag(a1 n) = diag(d) the degree matrix), m = 1 2 dt 1 the number of edges L α = (2m) α 1 [ ] D α A ddt D α. n 2m Our results in a nutshell: we find optimal α opt having best phase transition. 70 / 153

Applications/Community Detection on Graphs 71/153 Asymptotic Equivalence Theorem (Limiting Random Matrix Equivalent) For each α [0, 1], as n, L α L α 0 almost surely, where L α = (2m) α 1 [ ] D α A ddt D α n 2m L α = 1 D α n q XDq α + UΛU T with D q = diag({q i }), X zero-mean random matrix, [ ] U = Dq 1 α J n Dq α X1 n, rank k + 1 [ (Ik 1 Λ = k c T )M(I k c1 T k ) 1 ] k 1 T k 0 and J = [j 1,..., j k ], j a = [0,..., 0, 1 T n a, 0,..., 0] T R n canonical vector of class C a. 71 / 153

Applications/Community Detection on Graphs 72/153 Eigenvalue Spectrum Eigenvalues of L 1 Limiting law spikes S α S α S α 6 4 2 0 2 4 6 Figure: Eigenvalues of L 1, K = 3, n = 2000, c 1 = 0.3, c 2 = 0.3, c 3 = 0.4, µ = 1 2 δq (1) + 1 2 δq (2), q (1) = 0.4, q (2) = 0.9, M defined by M ii = 12, M ij = 4, i j. 72 / 153

Applications/Community Detection on Graphs 73/153 Phase Transition Theorem (Phase Transition) For α [0, 1], isolated eigenvalue λ i (L α) if λ i ( M) > τ α, M = (D(c) cc T )M, τ α = lim 1 x S + α e α, phase transition threshold 2 (x) with [S α, Sα + ] limiting eigenvalue support of Lα and eα 2 (x) ( x > Sα + ) solution of e α 1 (x) = e α 2 (x) = 1 In this case, e α 2 (λ i(l = λ α)) i( M). q 1 2α x q 1 2α e α 1 (x) + q2 2α e α µ(dq) 2 (x) q 2 2α x q 1 2α e α 1 (x) + q2 2α e α µ(dq). 2 (x) 73 / 153

Applications/Community Detection on Graphs 74/153 Simulated Performance Results (2 masses of q i ) (Modularity) (Bethe Hessian) 74 / 153

Applications/Community Detection on Graphs 74/153 Simulated Performance Results (2 masses of q i ) (Modularity) (Bethe Hessian) (Algo with α = 1) Figure: Two dominant eigenvectors (x-y axes) for n = 2000, K = 3, µ = 3 4 δq (1) + 1 4 δq (2), q (1) = 0.1, q (2) = 0.5, c 1 = c 2 = 1 4, c3 = 1 2, M = 100I3. 74 / 153

Applications/Community Detection on Graphs 74/153 Simulated Performance Results (2 masses of q i ) (Modularity) (Bethe Hessian) (Algo with α = 1) (Algo with α opt) Figure: Two dominant eigenvectors (x-y axes) for n = 2000, K = 3, µ = 3 4 δq (1) + 1 4 δq (2), q (1) = 0.1, q (2) = 0.5, c 1 = c 2 = 1 4, c3 = 1 2, M = 100I3. 74 / 153

Applications/Community Detection on Graphs 75/153 Simulated Performance Results (2 masses for q i ) 1.1 1.08 Normalized spike λ S + α 1.06 1.04 1.02 1 1 0.98 2 4 6 8 10 12 14 16 Eigenvalue l (l = 1/e α 2 (λ) beyond phase transition) Figure: Largest eigenvalue λ of L α as a function of the largest eigenvalue l of (D(c) cc T )M, for µ = 3 4 δq (1) + 1 4 δq (2) with q (1) = 0.1 and q (2) = 0.5, for α {0, 1 4, 1 2, 3 4, 1, αopt} (indicated below the graph). Here, α opt = 0.07. Circles indicate phase transition. Beyond phase transition, l = 1/e α 2 (λ). 75 / 153

Applications/Community Detection on Graphs 75/153 Simulated Performance Results (2 masses for q i ) 1.1 1.08 Normalized spike λ S + α 1.06 1.04 1.02 1 0 1 1 3 1 4 2 4 0.98 2 4 6 8 10 12 14 16 Eigenvalue l (l = 1/e α 2 (λ) beyond phase transition) Figure: Largest eigenvalue λ of L α as a function of the largest eigenvalue l of (D(c) cc T )M, for µ = 3 4 δq (1) + 1 4 δq (2) with q (1) = 0.1 and q (2) = 0.5, for α {0, 1 4, 1 2, 3 4, 1, αopt} (indicated below the graph). Here, α opt = 0.07. Circles indicate phase transition. Beyond phase transition, l = 1/e α 2 (λ). 75 / 153

Applications/Community Detection on Graphs 75/153 Simulated Performance Results (2 masses for q i ) 1.1 1.08 Normalized spike λ S + α 1.06 1.04 1.02 α opt 1 0 1 1 3 1 4 2 4 0.98 2 4 6 8 10 12 14 16 Eigenvalue l (l = 1/e α 2 (λ) beyond phase transition) Figure: Largest eigenvalue λ of L α as a function of the largest eigenvalue l of (D(c) cc T )M, for µ = 3 4 δq (1) + 1 4 δq (2) with q (1) = 0.1 and q (2) = 0.5, for α {0, 1 4, 1 2, 3 4, 1, αopt} (indicated below the graph). Here, α opt = 0.07. Circles indicate phase transition. Beyond phase transition, l = 1/e α 2 (λ). 75 / 153

Applications/Community Detection on Graphs 76/153 Simulated Performance Results (2 masses for q i ) 1 0.8 0.6 Overlap 0.4 0.2 0 α = 1 Phase transition 10 20 30 40 50 Figure: Overlap performance for n = 3000, K = 3, c i = 1 3, µ = 3 4 δq (1) + 1 4 δq (2) with q (1) = 0.1 and q (2) = 0.5, M = I 3, for [5, 50]. Here α opt = 0.07. 76 / 153

Applications/Community Detection on Graphs 76/153 Simulated Performance Results (2 masses for q i ) 1 0.8 0.6 Overlap 0.4 0.2 0 α = 0 α = 1 2 α = 1 Phase transition 10 20 30 40 50 Figure: Overlap performance for n = 3000, K = 3, c i = 1 3, µ = 3 4 δq (1) + 1 4 δq (2) with q (1) = 0.1 and q (2) = 0.5, M = I 3, for [5, 50]. Here α opt = 0.07. 76 / 153

Applications/Community Detection on Graphs 76/153 Simulated Performance Results (2 masses for q i ) 1 0.8 0.6 Overlap 0.4 α = 0 0.2 0 α = 1 2 α = 1 α = α opt Phase transition 10 20 30 40 50 Figure: Overlap performance for n = 3000, K = 3, c i = 1 3, µ = 3 4 δq (1) + 1 4 δq (2) with q (1) = 0.1 and q (2) = 0.5, M = I 3, for [5, 50]. Here α opt = 0.07. 76 / 153

Applications/Community Detection on Graphs 76/153 Simulated Performance Results (2 masses for q i ) 1 0.8 0.6 Overlap 0.4 0.2 0 10 20 30 40 50 α = 0 α = 1 2 α = 1 α = α opt Bethe Hessian Phase transition Figure: Overlap performance for n = 3000, K = 3, c i = 1 3, µ = 3 4 δq (1) + 1 4 δq (2) with q (1) = 0.1 and q (2) = 0.5, M = I 3, for [5, 50]. Here α opt = 0.07. 76 / 153

Applications/Community Detection on Graphs 77/153 Simulated Performance Results (2 masses for q i ) 1 0.8 α = 0 α = 0.5 α = 1 α = α opt Bethe Hessian 0.6 Overlap 0.4 0.2 0 0.2 0.4 0.6 0.8 q 2 (q 1 = 0.1) Figure: Overlap performance for n = 3000, K = 3, µ = 3 4 δq (1) + 1 4 δq (2) with q (1) = 0.1 and q (2) [0.1, 0.9], M = 10(2I 3 1 31 T 3 ), ci = 1 3. 77 / 153

Applications/Community Detection on Graphs 78/153 Theoretical Performance Analysis of eigenvectors reveals: eigenvectors are noisy staircase vectors 78 / 153

Applications/Community Detection on Graphs 78/153 Theoretical Performance Analysis of eigenvectors reveals: eigenvectors are noisy staircase vectors conjectured Gaussian fluctuations of eigenvector entries for q i = q 0 (homogeneous case), same variance for all entries in non-homogeneous case, we can compute average variance per class Heuristic asymptotic performance upper-bound using EM. 78 / 153

Applications/Community Detection on Graphs 79/153 Theoretical Performance Results (uniform distribution for q i ) 1 0.9 Correct classificate rate 0.8 0.7 0.6 0.5 α = 1 α = 0.75 α = 0.5 α = 0.25 α = 0 α opt 5 10 15 20 Figure: Theoretical probability of correct recovery for n = 2000, K = 2, c 1 = 0.6, c 2 = 0.4, µ uniformly distributed in [0.2, 0.8], M = I 2, for [0, 20]. 79 / 153

Applications/Community Detection on Graphs 80/153 80 / 153

Applications/Community Detection on Graphs 80/153 Some Takeaway messages Main findings: 80 / 153

Applications/Community Detection on Graphs 80/153 Some Takeaway messages Main findings: Degree heterogeneity breaks community structures in eigenvectors. Compensation by D α 1 normalization of eigenvectors. Classical debate over best normalization of adjacency (or modularity) matrix A not trivial to solve. With heterogeneous degrees, we found a good on-line method. Simulations support good performances even for rather sparse settings. 80 / 153

Applications/Kernel Spectral Clustering 81/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 81 / 153

Applications/Kernel Spectral Clustering 82/153 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes S 1,..., S k. 82 / 153

Applications/Kernel Spectral Clustering 82/153 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes S 1,..., S k. Typical metric to optimize: k κ(x j, x j (RatioCut) argmin ) S1... S k ={1,...,n} S i=1 j S i i j / S i for some similarity kernel κ(x, y) 0 (large if x similar to y). Can be shown equivalent to (RatioCut) argmin M M tr M T (D K)M { } where M R n k M; M ij {0, S j 1 2 } (in particular, M T M = I k ) and K = {κ(x i, x j )} n i,j=1, D ii = n K ij. j=1 82 / 153

Applications/Kernel Spectral Clustering 83/153 Kernel Spectral Clustering Towards kernel spectral clustering Kernel spectral clustering: discrete-to-continuous relaxations of such metrics (RatioCut) argmin M, M T M=I K tr M T (D K)M i.e., eigenvector problem: 1. find eigenvectors of smallest eigenvalues 2. retrieve classes from eigenvector components 83 / 153

Applications/Kernel Spectral Clustering 84/153 Kernel Spectral Clustering 84 / 153

Applications/Kernel Spectral Clustering 84/153 Kernel Spectral Clustering Figure: Leading four eigenvectors of D 2 1 KD 2 1 for MNIST data. 84 / 153

Applications/Kernel Spectral Clustering 85/153 Methodology and objectives Current state: Algorithms derived from ad-hoc procedures (e.g., relaxation). Little understanding of performance, even for Gaussian mixtures! Let alone when both p and n are large (BigData setting) Objectives and Roadmap: Develop mathematical analysis framework for BigData kernel spectral clustering (p, n ) Understand: 1. Phase transition effects (i.e., when is clustering possible?) 2. Content of each eigenvector 3. Influence of kernel function 4. Performance comparison of clustering algorithms 85 / 153

Applications/Kernel Spectral Clustering 86/153 Model and Assumptions Gaussian mixture model: x 1,..., x n R p, k classes C 1,..., C k, x 1,..., x n1 C 1,..., x n nk +1,..., x n C k, C a = {x x N (µ a, C a)}. Then, for x i C a, with w i N(0, C a), x i = µ a + w i. Assumption (Convergence Rate) As n, p 1. Data scaling: n c 0 (0, ), n 2. Class scaling: a n c a (0, 1), 3. Mean scaling: with µ k a=1 n a n µ a and µ a µ a µ, then µ a = O(1) 4. Covariance scaling: with C k n a a=1 n C a and Ca Ca C, then C a = O(1), 1 p tr C a = O(1) tr C ac b = O(p) 86 / 153

Applications/Kernel Spectral Clustering 87/153 Model and Assumptions Kernel Matrix: Kernel matrix of interest: K = { ( )} 1 n f p x i x j 2 i,j=1 for some sufficiently smooth nonnegative f. 87 / 153

Applications/Kernel Spectral Clustering 87/153 Model and Assumptions Kernel Matrix: Kernel matrix of interest: K = { ( )} 1 n f p x i x j 2 i,j=1 for some sufficiently smooth nonnegative f. We study the normalized Laplacian: L = nd 1 2 KD 1 2 with d = K1 n, D = diag(d). 87 / 153

Applications/Kernel Spectral Clustering 88/153 Model and Assumptions Difficulty: L is a very intractable random matrix non-linear f non-trivial dependence between entries of L 88 / 153

Applications/Kernel Spectral Clustering 88/153 Model and Assumptions Difficulty: L is a very intractable random matrix non-linear f non-trivial dependence between entries of L Strategy: 1. Find random equivalent ˆL a.s. (i.e., L ˆL 0 as n, p ) based on: concentration: K ij constant as n, p (for all i j) Taylor expansion around limit point 2. Apply spiked random matrix approach to study: existence of isolated eigenvalues in ˆL: phase transition 88 / 153

Applications/Kernel Spectral Clustering 89/153 Random Matrix Equivalent Results on K: Key Remark: Under our assumptions, uniformly on i, j {1,..., n}, for some common limit τ. large dimensional approximation for K: K = f(τ)1 n1 T n + }{{} O (n) 1 p x i x j 2 a.s. τ na1 }{{} low rank, O ( n) + A 2 }{{} informative terms, O (1) difficult to handle (3 orders to manipulate!) Observation: Spectrum of L = nd 1 2 KD 1 2 : Dominant eigenvalue n with eigenvector D 1 2 1 n All other eigenvalues of order O(1). Naturally leads to study: Projected normalized Laplacian (or modularity -type Laplacian): L = nd 1 2 KD 1 2 n D 1 2 1 n1 T 2 nd 1 ) 1 T = nd 1 2 (K ddt n D1n 1 T D 1 2. d Dominant (normalized) eigenvector D 1 2 1 n 1 T n D1 n. 89 / 153

Applications/Kernel Spectral Clustering 90/153 Random Matrix Equivalent Theorem (Random Matrix Equivalent) As n, p, in operator norm, L ˆL a.s. 0, where ˆL = 2 f [ ] (τ) 1 f(τ) p P W T W P + UBU T + α(τ)i n and τ = 2 p tr C, W = [w 1,..., w n] R p n (x i = µ a + w i ), P = I n 1 n 1n1T n, [ ] 1 U = p J, Φ, ψ R n (2k+4) B = B 11 = M T M + ( B 11 I k 1 k c T 5f ) (τ) 8f(τ) f (τ) 2f t (τ) I k c1 T k ) 0 k k 0 k 1 R(2k+4) (2k+4) t T 5f 0 (τ) 1 k ( 5f (τ) 8f(τ) f (τ) 2f (τ) ( 5f (τ) 8f(τ) f (τ) 2f (τ) 8f(τ) f (τ) 2f (τ) ) tt T f (τ) f (τ) T + p f(τ)α(τ) n 2f (τ) 1 k1 T k Rk k. 90 / 153

Applications/Kernel Spectral Clustering 90/153 Random Matrix Equivalent Theorem (Random Matrix Equivalent) As n, p, in operator norm, L ˆL a.s. 0, where ˆL = 2 f [ ] (τ) 1 f(τ) p P W T W P + UBU T + α(τ)i n and τ = 2 p tr C, W = [w 1,..., w n] R p n (x i = µ a + w i ), P = I n 1 n 1n1T n, [ ] 1 U = p J, Φ, ψ R n (2k+4) B = B 11 = M T M + ( B 11 I k 1 k c T 5f ) (τ) 8f(τ) f (τ) 2f t (τ) I k c1 T k ) 0 k k 0 k 1 R(2k+4) (2k+4) t T 5f 0 (τ) 1 k ( 5f (τ) 8f(τ) f (τ) 2f (τ) ( 5f (τ) 8f(τ) f (τ) 2f (τ) M = [µ 1,..., µ k ] Rn k, µ a = µ a k n b b=1 n µ b. 8f(τ) f (τ) 2f (τ) ) tt T f (τ) f (τ) T + p f(τ)α(τ) n 2f (τ) 1 k1 T k Rk k. 90 / 153

Applications/Kernel Spectral Clustering 90/153 Random Matrix Equivalent Theorem (Random Matrix Equivalent) As n, p, in operator norm, L ˆL a.s. 0, where ˆL = 2 f [ ] (τ) 1 f(τ) p P W T W P + UBU T + α(τ)i n and τ = 2 p tr C, W = [w 1,..., w n] R p n (x i = µ a + w i ), P = I n 1 n 1n1T n, [ ] 1 U = p J, Φ, ψ R n (2k+4) B = B 11 = M T M + T = { 1 p tr C ac b ( B 11 I k 1 k c T 5f ) (τ) 8f(τ) f (τ) 2f t (τ) I k c1 T k ) 0 k k 0 k 1 R(2k+4) (2k+4) t T 5f 0 (τ) 1 k ( 5f (τ) 8f(τ) f (τ) 2f (τ) ( 5f (τ) 8f(τ) f (τ) 2f (τ) } k a,b=1 Rk k, C a = C a k b=1 8f(τ) f (τ) 2f (τ) ) tt T f (τ) f (τ) T + p f(τ)α(τ) n 2f (τ) 1 k1 T k Rk k. n b n C b. 90 / 153

Applications/Kernel Spectral Clustering 91/153 Random Matrix Equivalent Some consequences: ˆL is a spiked model: UBU T seen as low rank perturbation of 1 p P W T W P 91 / 153

Applications/Kernel Spectral Clustering 91/153 Random Matrix Equivalent Some consequences: ˆL is a spiked model: UBU T seen as low rank perturbation of 1 p P W T W P If f (τ) = 0, L asymptotically deterministic! only t and T can be discriminated upon If f (τ) = 0, (e.g., f(x) = x) T unused If 5f (τ) 8f(τ) Further analysis: = f (τ) 2f, t (seemingly) unused (τ) Determine separability condition for eigenvalues Evaluate eigenvalue positions when separable Evaluate eigenvector projection to canonical basis j 1,..., j k Evaluate fluctuation of eigenvectors. 91 / 153

Applications/Kernel Spectral Clustering 92/153 Isolated eigenvalues: Gaussian inputs Eigenvalues of L Eigenvalues of ˆL 0 1 2 3 4 0 1 2 3 4 Figure: Eigenvalues of L and ˆL, k = 3, p = 2048, n = 512, c 1 = c 2 = 1/4, c 3 = 1/2, [µ a] j = 4δ aj, C a = (1 + 2(a 1)/ p)i p, f(x) = exp( x/2). 92 / 153

Applications/Kernel Spectral Clustering 93/153 Theoretical Findings versus MNIST 0.2 Eigenvalues of L 0.15 0.1 5 10 2 0 0 10 20 30 40 50 Figure: Eigenvalues of L (red) and (equivalent Gaussian model) ˆL (white), MNIST data, p = 784, n = 192. 93 / 153

Applications/Kernel Spectral Clustering 93/153 Theoretical Findings versus MNIST 0.2 Eigenvalues of L Eigenvalues of ˆL as if Gaussian model 0.15 0.1 5 10 2 0 0 10 20 30 40 50 Figure: Eigenvalues of L (red) and (equivalent Gaussian model) ˆL (white), MNIST data, p = 784, n = 192. 93 / 153

Applications/Kernel Spectral Clustering 94/153 Theoretical Findings versus MNIST Figure: Leading four eigenvectors of D 2 1 KD 1 2 for MNIST data (red) and theoretical findings (blue). 94 / 153