Random Matrices for Big Data Signal Processing and Machine Learning

Size: px

Start display at page:

Download "Random Matrices for Big Data Signal Processing and Machine Learning"

Britton Blair
6 years ago
Views:

1 Random Matrices for Big Data Signal Processing and Machine Learning (ICASSP 2017, New Orleans) Romain COUILLET and Hafiz TIOMOKO ALI CentraleSupélec, France March, / 153

2 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 2 / 153

3 Basics of Random Matrix Theory/ 3/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 3 / 153

4 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 4/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 4 / 153

5 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/153 Context Baseline scenario: x 1,..., x n C N (or R N ) i.i.d. with E[x 1 ] = 0, E[x 1 x 1 ] = C N : 5 / 153

6 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/153 Context Baseline scenario: x 1,..., x n C N (or R N ) i.i.d. with E[x 1 ] = 0, E[x 1 x 1 ] = C N : If x 1 N (0, C N ), ML estimator for C N is the sample covariance matrix (SCM) Ĉ N = 1 n x i x i n. i=1 5 / 153

7 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/153 Context Baseline scenario: x 1,..., x n C N (or R N ) i.i.d. with E[x 1 ] = 0, E[x 1 x 1 ] = C N : If x 1 N (0, C N ), ML estimator for C N is the sample covariance matrix (SCM) Ĉ N = 1 n x i x i n. i=1 If n, then, strong law of large numbers Ĉ N a.s. C N. or equivalently, in spectral norm ĈN C N a.s / 153

8 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/153 Context Baseline scenario: x 1,..., x n C N (or R N ) i.i.d. with E[x 1 ] = 0, E[x 1 x 1 ] = C N : If x 1 N (0, C N ), ML estimator for C N is the sample covariance matrix (SCM) Ĉ N = 1 n x i x i n. i=1 If n, then, strong law of large numbers Ĉ N a.s. C N. or equivalently, in spectral norm ĈN C N a.s. 0. Random Matrix Regime No longer valid if N, n with N/n c (0, ), ĈN C N 0. 5 / 153

9 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/153 Context Baseline scenario: x 1,..., x n C N (or R N ) i.i.d. with E[x 1 ] = 0, E[x 1 x 1 ] = C N : If x 1 N (0, C N ), ML estimator for C N is the sample covariance matrix (SCM) Ĉ N = 1 n x i x i n. i=1 If n, then, strong law of large numbers Ĉ N a.s. C N. or equivalently, in spectral norm ĈN C N a.s. 0. Random Matrix Regime No longer valid if N, n with N/n c (0, ), ĈN C N 0. For practical N, n with N n, leads to dramatically wrong conclusions 5 / 153

10 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/153 Context Baseline scenario: x 1,..., x n C N (or R N ) i.i.d. with E[x 1 ] = 0, E[x 1 x 1 ] = C N : If x 1 N (0, C N ), ML estimator for C N is the sample covariance matrix (SCM) Ĉ N = 1 n x i x i n. i=1 If n, then, strong law of large numbers Ĉ N a.s. C N. or equivalently, in spectral norm ĈN C N a.s. 0. Random Matrix Regime No longer valid if N, n with N/n c (0, ), ĈN C N 0. For practical N, n with N n, leads to dramatically wrong conclusions Even for N = n/ / 153

11 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/153 The Large Dimensional Fallacies Setting: x i C N i.i.d., x 1 CN (0, I N ) 6 / 153

12 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/153 The Large Dimensional Fallacies Setting: x i C N i.i.d., x 1 CN (0, I N ) assume N = N(n) such that N/n c > 1 6 / 153

13 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/153 The Large Dimensional Fallacies Setting: x i C N i.i.d., x 1 CN (0, I N ) assume N = N(n) such that N/n c > 1 then, joint point-wise convergence ] max 1 i,j N [ĈN I N ij = max 1 1 i,j N n X j, Xi, δ ij a.s / 153

14 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/153 The Large Dimensional Fallacies Setting: x i C N i.i.d., x 1 CN (0, I N ) assume N = N(n) such that N/n c > 1 then, joint point-wise convergence ] max 1 i,j N [ĈN I N ij = max 1 1 i,j N n X j, Xi, δ ij a.s. 0. however, eigenvalue mismatch 0 = λ 1 (ĈN ) =... = λ N n (ĈN ) λ N n+1 (ĈN )... λ N (ĈN ) 1 = λ 1 (I N ) =... = λ N n (I N ) = λ N n+1 (ĈN ) =... = λ N (I N ) 6 / 153

15 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/153 The Large Dimensional Fallacies Setting: x i C N i.i.d., x 1 CN (0, I N ) assume N = N(n) such that N/n c > 1 then, joint point-wise convergence ] max 1 i,j N [ĈN I N ij = max 1 1 i,j N n X j, Xi, δ ij a.s. 0. however, eigenvalue mismatch 0 = λ 1 (ĈN ) =... = λ N n (ĈN ) λ N n+1 (ĈN )... λ N (ĈN ) 1 = λ 1 (I N ) =... = λ N n (I N ) = λ N n+1 (ĈN ) =... = λ N (I N ) no convergence in spectral norm. 6 / 153

16 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/153 The Mar cenko Pastur law 0.8 Empirical eigenvalue distribution Mar cenko Pastur Law 0.6 Density Eigenvalues of ĈN Figure: Histogram of the eigenvalues of ĈN for N = 500, n = 2000, C N = I N. 7 / 153

17 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/153 The Mar cenko Pastur law Definition (Empirical Spectral Density) Empirical spectral density (e.s.d.) µ N of Hermitian matrix A N C N N is µ N = 1 N δ λi (A N N ). i=1 8 / 153

18 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/153 The Mar cenko Pastur law Definition (Empirical Spectral Density) Empirical spectral density (e.s.d.) µ N of Hermitian matrix A N C N N is µ N = 1 N δ λi (A N N ). i=1 Theorem (Mar cenko Pastur Law [Mar cenko,pastur 67]) X N C N n with i.i.d. zero mean, unit variance entries. As N, n with N/n c (0, ), e.s.d. µ N of 1 n X N X N satisfies weakly, where µ c({0}) = max{0, 1 c 1 } µ N a.s. µ c 8 / 153

19 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/153 The Mar cenko Pastur law Definition (Empirical Spectral Density) Empirical spectral density (e.s.d.) µ N of Hermitian matrix A N C N N is µ N = 1 N δ λi (A N N ). i=1 Theorem (Mar cenko Pastur Law [Mar cenko,pastur 67]) X N C N n with i.i.d. zero mean, unit variance entries. As N, n with N/n c (0, ), e.s.d. µ N of 1 n X N X N satisfies weakly, where µ c({0}) = max{0, 1 c 1 } µ N a.s. µ c on (0, ), µ c has continuous density f c supported on [(1 c) 2, (1 + c) 2 ] f c(x) = 1 (x (1 c) 2 )((1 + c) 2 x). 2πcx 8 / 153

20 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 9/153 The Mar cenko Pastur law 1.2 c = Density fc(x) x Figure: Mar cenko-pastur law for different limit ratios c = lim N N/n. 9 / 153

21 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 9/153 The Mar cenko Pastur law 1.2 c = 0.1 c = Density fc(x) x Figure: Mar cenko-pastur law for different limit ratios c = lim N N/n. 9 / 153

22 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 9/153 The Mar cenko Pastur law 1.2 c = 0.1 c = c = Density fc(x) x Figure: Mar cenko-pastur law for different limit ratios c = lim N N/n. 9 / 153

23 Basics of Random Matrix Theory/The Stieltjes Transform Method 10/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 10 / 153

24 Basics of Random Matrix Theory/The Stieltjes Transform Method 11/153 The Stieltjes transform Definition (Stieltjes Transform) For µ real probability measure of support supp(µ), Stieltjes transform m µ defined, for z C \ supp(µ), as 1 m µ(z) = t z µ(dt). 11 / 153

25 Basics of Random Matrix Theory/The Stieltjes Transform Method 11/153 The Stieltjes transform Definition (Stieltjes Transform) For µ real probability measure of support supp(µ), Stieltjes transform m µ defined, for z C \ supp(µ), as 1 m µ(z) = t z µ(dt). Property (Inverse Stieltjes Transform) For a < b continuity points of µ, 1 µ([a, b]) = lim ε 0 π b I[m µ(x + ıε)]dx a 11 / 153

26 Basics of Random Matrix Theory/The Stieltjes Transform Method 11/153 The Stieltjes transform Definition (Stieltjes Transform) For µ real probability measure of support supp(µ), Stieltjes transform m µ defined, for z C \ supp(µ), as 1 m µ(z) = t z µ(dt). Property (Inverse Stieltjes Transform) For a < b continuity points of µ, 1 µ([a, b]) = lim ε 0 π Besides, if µ has a density f at x, b I[m µ(x + ıε)]dx a 1 f(x) = lim I[mµ(x + ıε)]. ε 0 π 11 / 153

27 Basics of Random Matrix Theory/The Stieltjes Transform Method 12/153 The Stieltjes transform Property (Relation to e.s.d.) If µ e.s.d. of Hermitian A C N N, (i.e., µ = 1 N N i=1 δ λ i (A)) m µ(z) = 1 N tr (A zi N ) 1 12 / 153

28 Basics of Random Matrix Theory/The Stieltjes Transform Method 12/153 The Stieltjes transform Property (Relation to e.s.d.) If µ e.s.d. of Hermitian A C N N, (i.e., µ = 1 N N i=1 δ λ i (A)) m µ(z) = 1 N tr (A zi N ) 1 Proof: m µ(z) = µ(dt) t z = 1 N N i=1 = 1 N tr (A zi N ) 1. 1 λ i (A) z = 1 N tr (diag{λ i(a)} zi N ) 1 12 / 153

29 Basics of Random Matrix Theory/The Stieltjes Transform Method 13/153 The Stieltjes transform Property (Stieltjes transform of Gram matrices) For X C N n, and Then µ e.s.d. of XX µ e.s.d. of X X m µ(z) = n N m µ(z) N n 1 N z. 13 / 153

30 Basics of Random Matrix Theory/The Stieltjes Transform Method 13/153 The Stieltjes transform Property (Stieltjes transform of Gram matrices) For X C N n, and Then µ e.s.d. of XX µ e.s.d. of X X m µ(z) = n N m µ(z) N n 1 N z. Proof: m µ(z) = 1 N 1 N λ i=1 i (XX ) z = 1 N n i=1 1 λ i (X X) z + 1 N (N n) 1 0 z. 13 / 153

31 Basics of Random Matrix Theory/The Stieltjes Transform Method 14/153 The Stieltjes transform Three fundamental lemmas in all proofs. Lemma (Resolvent Identity) For A, B C N N invertible, A 1 B 1 = A 1 (B A)B / 153

32 Basics of Random Matrix Theory/The Stieltjes Transform Method 14/153 The Stieltjes transform Three fundamental lemmas in all proofs. Lemma (Resolvent Identity) For A, B C N N invertible, A 1 B 1 = A 1 (B A)B 1. Corollary For t C, x C N, A C N N, with A and A + txx invertible, (A + txx ) 1 x = A 1 x 1 + tx A 1 x. 14 / 153

33 Basics of Random Matrix Theory/The Stieltjes Transform Method 15/153 The Stieltjes transform Three fundamental lemmas in all proofs. Lemma (Rank-one perturbation) For A, B C N N Hermitian nonnegative definite, e.s.d. µ of A, t > 0, x C N, z C \ supp(µ), 1 N tr B (A + txx zi N ) 1 1 N tr B (A zi N ) 1 1 B N dist(z, supp(µ)) 15 / 153

34 Basics of Random Matrix Theory/The Stieltjes Transform Method 15/153 The Stieltjes transform Three fundamental lemmas in all proofs. Lemma (Rank-one perturbation) For A, B C N N Hermitian nonnegative definite, e.s.d. µ of A, t > 0, x C N, z C \ supp(µ), 1 N tr B (A + txx zi N ) 1 1 N tr B (A zi N ) 1 1 B N dist(z, supp(µ)) In particular, as N, if lim sup N B <, 1 N tr B (A + txx zi N ) 1 1 N tr B (A zi N ) / 153

35 Basics of Random Matrix Theory/The Stieltjes Transform Method 16/153 The Stieltjes transform Three fundamental lemmas in all proofs. Lemma (Trace Lemma) For then x C N with i.i.d. entries with zero mean, unit variance, finite 2p order moment, A C N N deterministic (or independent of x), [ 1 E N x Ax 1 N tr A p ] K A p N p/2. 16 / 153

36 Basics of Random Matrix Theory/The Stieltjes Transform Method 16/153 The Stieltjes transform Three fundamental lemmas in all proofs. Lemma (Trace Lemma) For then x C N with i.i.d. entries with zero mean, unit variance, finite 2p order moment, A C N N deterministic (or independent of x), [ 1 E N x Ax 1 N tr A p ] K A p N p/2. In particular, if lim sup N A <, and x has entries with finite eighth-order moment, 1 N x Ax 1 a.s. tr A 0 N (by Markov inequality and Borel Cantelli lemma). 16 / 153

37 Basics of Random Matrix Theory/The Stieltjes Transform Method 17/153 Proof of the Mar cenko Pastur law Theorem (Mar cenko Pastur Law [Mar cenko,pastur 67]) X N C N n with i.i.d. zero mean, unit variance entries. As N, n with N/n c (0, ), e.s.d. µ N of 1 n X N X N satisfies weakly, where µ c({0}) = max{0, 1 c 1 } µ N a.s. µ c 17 / 153

38 Basics of Random Matrix Theory/The Stieltjes Transform Method 17/153 Proof of the Mar cenko Pastur law Theorem (Mar cenko Pastur Law [Mar cenko,pastur 67]) X N C N n with i.i.d. zero mean, unit variance entries. As N, n with N/n c (0, ), e.s.d. µ N of 1 n X N X N satisfies weakly, where µ c({0}) = max{0, 1 c 1 } µ N a.s. µ c on (0, ), µ c has continuous density f c supported on [(1 c) 2, (1 + c) 2 ] f c(x) = 1 (x (1 c) 2 )((1 + c) 2 x). 2πcx 17 / 153

39 Basics of Random Matrix Theory/The Stieltjes Transform Method 18/153 Proof of the Mar cenko Pastur law Stieltjes transform approach. 18 / 153

40 Basics of Random Matrix Theory/The Stieltjes Transform Method 18/153 Proof of the Mar cenko Pastur law Stieltjes transform approach. Proof With µ N e.s.d. of 1 n X N X N, m µn (z) = 1 N tr ( 1 n X N X N zi N ) 1 = 1 N [ N ( ) ] 1 1 n X N XN zi N. i=1 ii 18 / 153

41 Basics of Random Matrix Theory/The Stieltjes Transform Method 18/153 Proof of the Mar cenko Pastur law Stieltjes transform approach. Proof With µ N e.s.d. of 1 n X N X N, m µn (z) = 1 N tr ( 1 n X N X N zi N ) 1 = 1 N [ N ( ) ] 1 1 n X N XN zi N. i=1 ii Write [ ] y X N = C Y N n N 1 18 / 153

42 Basics of Random Matrix Theory/The Stieltjes Transform Method 18/153 Proof of the Mar cenko Pastur law Stieltjes transform approach. Proof With µ N e.s.d. of 1 n X N X N, m µn (z) = 1 N tr ( 1 n X N X N zi N ) 1 = 1 N [ N ( ) ] 1 1 n X N XN zi N. i=1 ii Write so that, for I[z] > 0, [ ] y X N = C Y N n N 1 ( ) 1 1 n X N XN zi N = ( 1 n y y z 1 n Y N 1y 1 ) 1 n y Y N 1 1 n Y N 1YN 1 zi. N 1 18 / 153

43 Basics of Random Matrix Theory/The Stieltjes Transform Method 19/153 Proof of the Mar cenko Pastur law Proof (continued) From block matrix inverse formula ( ) 1 ( A B (A BD = 1 C) 1 A 1 B(D CA 1 B) 1 ) C D (A BD 1 C) 1 CA 1 (D CA 1 B) 1 we have [ ( ) ] 1 1 n X N XN zi N 11 = 1 z z 1 n y ( 1 n Y N 1 Y N 1 zi n) 1 y. 19 / 153

44 Basics of Random Matrix Theory/The Stieltjes Transform Method 19/153 Proof of the Mar cenko Pastur law Proof (continued) From block matrix inverse formula ( ) 1 ( A B (A BD = 1 C) 1 A 1 B(D CA 1 B) 1 ) C D (A BD 1 C) 1 CA 1 (D CA 1 B) 1 we have [ ( ) ] 1 1 n X N XN zi N 11 = 1 z z 1 n y ( 1 n Y N 1 Y N 1 zi n) 1 y. By Trace Lemma, as N, n [ ( ) ] 1 1 n X N XN zi 1 N z z 1 11 n tr ( 1 n Y N 1 Y N 1 zi n) 1 a.s / 153

45 Basics of Random Matrix Theory/The Stieltjes Transform Method 20/153 Proof of the Mar cenko Pastur law Proof (continued) By Rank-1 Perturbation Lemma (X N X N = Y N 1 Y N 1 + yy ), as N, n [ ( ) ] 1 1 n X N XN zi N 11 1 z z 1 n tr ( 1 n X N X N zi n) 1 a.s / 153

46 Basics of Random Matrix Theory/The Stieltjes Transform Method 20/153 Proof of the Mar cenko Pastur law Proof (continued) By Rank-1 Perturbation Lemma (X N X N = Y N 1 Y N 1 + yy ), as N, n [ ( ) ] 1 1 n X N XN zi N 11 1 z z 1 n tr ( 1 n X N X N zi n) 1 a.s. 0. Since 1 n tr ( 1 n X N X N zi n) 1 = 1 n tr ( 1 n X N X N zi N ) 1 n N n [ ( ) ] 1 1 n X N XN zi N 11 1 z, 1 1 N n z z 1 n tr ( 1 n X N X N zi N ) 1 a.s / 153

47 Basics of Random Matrix Theory/The Stieltjes Transform Method 20/153 Proof of the Mar cenko Pastur law Proof (continued) By Rank-1 Perturbation Lemma (X N X N = Y N 1 Y N 1 + yy ), as N, n [ ( ) ] 1 1 n X N XN zi N 11 1 z z 1 n tr ( 1 n X N X N zi n) 1 a.s. 0. Since 1 n tr ( 1 n X N X N zi n) 1 = 1 n tr ( 1 n X N X N zi N ) 1 n N n [ ( ) ] 1 1 n X N XN zi N 11 1 z, 1 1 N n z z 1 n tr ( 1 n X N X N zi N ) 1 a.s. 0. Repeating for entries (2, 2),..., (N, N), and averaging, we get (for I[z] > 0) m µn (z) 1 1 N n z z N n mµ N (z) a.s / 153

48 Basics of Random Matrix Theory/The Stieltjes Transform Method 21/153 Proof of the Mar cenko Pastur law Proof (continued) Then m µn (z) a.s. m(z) solution to m(z) = 1 1 c z czm(z) 21 / 153

49 Basics of Random Matrix Theory/The Stieltjes Transform Method 21/153 Proof of the Mar cenko Pastur law Proof (continued) Then m µn (z) a.s. m(z) solution to m(z) = 1 1 c z czm(z) i.e., (with branch of f(z) such that m(z) 0 as z ) (z ) ( m(z) = 1 c 2cz 1 (1 + c) 2 z (1 c) 2) 2c +. 2cz 21 / 153

50 Basics of Random Matrix Theory/The Stieltjes Transform Method 21/153 Proof of the Mar cenko Pastur law Proof (continued) Then m µn (z) a.s. m(z) solution to m(z) = 1 1 c z czm(z) i.e., (with branch of f(z) such that m(z) 0 as z ) (z ) ( m(z) = 1 c 2cz 1 (1 + c) 2 z (1 c) 2) 2c +. 2cz Finally, by inverse Stieltjes Transform, for x > 0, ((1 1 + c) 2 x ) ( x (1 c) 2) lim I[m(x + ıε)] = 1 ε 0 π 2πcx {x [(1 c) 2,(1+ c) 2 ]}. And for x = 0, lim ε 0 ıεi[m(ıε)] = ( 1 c 1) 1 {c>1}. 21 / 153

51 Basics of Random Matrix Theory/The Stieltjes Transform Method 22/153 Sample Covariance Matrices Theorem (Sample Covariance Matrix Model [Silverstein,Bai 95]) Let Y N = C 1 2 N X N C N n, with C N C N N nonnegative definite with e.s.d. ν N ν weakly, X N C N n has i.i.d. entries of zero mean and unit variance. As N, n, N/n c (0, ), µ N e.s.d. of 1 n Y N Y N C n n satisfies µ N a.s. µ weakly, with m µ (z), I[z] > 0, unique solution with I[m µ (z)] > 0 of ( m µ (z) = z + c ) t tm µ (z) ν(dt). 22 / 153

52 Basics of Random Matrix Theory/The Stieltjes Transform Method 22/153 Sample Covariance Matrices Theorem (Sample Covariance Matrix Model [Silverstein,Bai 95]) Let Y N = C 1 2 N X N C N n, with C N C N N nonnegative definite with e.s.d. ν N ν weakly, X N C N n has i.i.d. entries of zero mean and unit variance. As N, n, N/n c (0, ), µ N e.s.d. of 1 n Y N Y N C n n satisfies µ N a.s. µ weakly, with m µ (z), I[z] > 0, unique solution with I[m µ (z)] > 0 of ( m µ (z) = z + c ) t tm µ (z) ν(dt). Moreover, µ is continuous on R + and real analytic wherever positive. 22 / 153

53 Basics of Random Matrix Theory/The Stieltjes Transform Method 22/153 Sample Covariance Matrices Theorem (Sample Covariance Matrix Model [Silverstein,Bai 95]) Let Y N = C 1 2 N X N C N n, with C N C N N nonnegative definite with e.s.d. ν N ν weakly, X N C N n has i.i.d. entries of zero mean and unit variance. As N, n, N/n c (0, ), µ N e.s.d. of 1 n Y N Y N C n n satisfies µ N a.s. µ weakly, with m µ (z), I[z] > 0, unique solution with I[m µ (z)] > 0 of ( m µ (z) = z + c ) t tm µ (z) ν(dt). Moreover, µ is continuous on R + and real analytic wherever positive. Immediate corollary: For µ N e.s.d. of 1 n Y N Y N = 1 n n i=1 C 1 2 N x i x i C 1 2 N, weakly, with µ = cµ + (1 c)δ 0. µ N a.s. µ 22 / 153

54 Basics of Random Matrix Theory/The Stieltjes Transform Method 23/153 Sample Covariance Matrices e.s.d. e.s.d. f f (i) (ii) Figure: Histogram of the eigenvalues of 1 n Y N Y N, n = 3000, N = 300, with C N diagonal with evenly weighted masses in (i) 1, 3, 7, (ii) 1, 3, / 153

55 Basics of Random Matrix Theory/The Stieltjes Transform Method 24/153 Further Models and Deterministic Equivalents Theorem (Doubly-correlated i.i.d. matrices) Let B N = C 1 2 N X N T N X N C 1 2 N, with e.s.d. µ N, X k C N n with i.i.d. entries of zero mean, variance 1/n, C N Hermitian nonnegative definite, T N diagonal nonnegative, lim sup N max( C N, T N ) <. Denote c = N/n. Then, as N, n with bounded ratio c, for z C \ R, m µn (z) m N (z) a.s. 0, m N (z) = 1 N tr ( zi N + ē N (z)c N ) 1 with ē(z) unique solution in {z C +, ē N (z) C + } or {z R, ē N (z) R + } of e N (z) = 1 N tr C N ( zi N + ē N (z)c N ) 1 ē N (z) = 1 n tr T N (I n + ce N (z)t N ) / 153

56 Basics of Random Matrix Theory/The Stieltjes Transform Method 25/153 Other Refined Sample Covariance Models Side note on other models. Similar results for multiple matrix models: 25 / 153

57 Basics of Random Matrix Theory/The Stieltjes Transform Method 25/153 Other Refined Sample Covariance Models Side note on other models. Similar results for multiple matrix models: Information-plus-noise: Y N = A N + X N, A N deterministic Variance profile: Y N = P N X N (entry-wise product) Per-column covariance: Y N = [y 1,..., y n], y i = C 1 2 N,i x i etc. 25 / 153

58 Basics of Random Matrix Theory/Spiked Models 26/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 26 / 153

59 Basics of Random Matrix Theory/Spiked Models 27/153 No Eigenvalue Outside the Support Theorem (No Eigenvalue Outside the Support [Silverstein,Bai 98]) 2 Let Y N = C 1 N X N C N n, with C N C N N nonnegative definite with e.s.d. ν N ν weakly, 27 / 153

60 Basics of Random Matrix Theory/Spiked Models 27/153 No Eigenvalue Outside the Support Theorem (No Eigenvalue Outside the Support [Silverstein,Bai 98]) 2 Let Y N = C 1 N X N C N n, with C N C N N nonnegative definite with e.s.d. ν N ν weakly, E[ X N 4 ij ] <, 27 / 153

61 Basics of Random Matrix Theory/Spiked Models 27/153 No Eigenvalue Outside the Support Theorem (No Eigenvalue Outside the Support [Silverstein,Bai 98]) 2 Let Y N = C 1 N X N C N n, with C N C N N nonnegative definite with e.s.d. ν N ν weakly, E[ X N 4 ij ] <, X N C N n has i.i.d. entries of zero mean and unit variance, 27 / 153

62 Basics of Random Matrix Theory/Spiked Models 27/153 No Eigenvalue Outside the Support Theorem (No Eigenvalue Outside the Support [Silverstein,Bai 98]) 2 Let Y N = C 1 N X N C N n, with C N C N N nonnegative definite with e.s.d. ν N ν weakly, E[ X N 4 ij ] <, X N C N n has i.i.d. entries of zero mean and unit variance, max i dist(λ i (C N ), supp(ν)) / 153

63 Basics of Random Matrix Theory/Spiked Models 27/153 No Eigenvalue Outside the Support Theorem (No Eigenvalue Outside the Support [Silverstein,Bai 98]) Let Y N = C 1 2 N X N C N n, with C N C N N nonnegative definite with e.s.d. ν N ν weakly, E[ X N 4 ij ] <, X N C N n has i.i.d. entries of zero mean and unit variance, max i dist(λ i (C N ), supp(ν)) 0. Let µ be the limiting e.s.d. of 1 n Y N Y N as before. Let [a, b] R \ supp( ν). Then, for all large n, almost surely. { ( )} 1 n λ i n Y N Y N [a, b] = i=1 27 / 153

64 Basics of Random Matrix Theory/Spiked Models 27/153 No Eigenvalue Outside the Support Theorem (No Eigenvalue Outside the Support [Silverstein,Bai 98]) Let Y N = C 1 2 N X N C N n, with C N C N N nonnegative definite with e.s.d. ν N ν weakly, E[ X N 4 ij ] <, X N C N n has i.i.d. entries of zero mean and unit variance, max i dist(λ i (C N ), supp(ν)) 0. Let µ be the limiting e.s.d. of 1 n Y N Y N as before. Let [a, b] R \ supp( ν). Then, for all large n, almost surely. { ( )} 1 n λ i n Y N Y N [a, b] = i=1 In practice: This means that eigenvalues of 1 n Y N Y N cannot be bound at macroscopic distance from the bulk, for N, n large. 27 / 153

65 Basics of Random Matrix Theory/Spiked Models 28/153 Spiked Models Breaking the rules. If we break Rule 1: Infinitely many eigenvalues may wander away from supp(µ). 0.8 {λ i } n i=1 µ 0.8 {λ i } n i=1 µ E[X 4 ij ] < E[X4 ij ] = 28 / 153

66 Basics of Random Matrix Theory/Spiked Models 29/153 Spiked Models If we break: Rule 2: C N may create isolated eigenvalues in 1 n Y N YN, called spikes. 0.8 {λ i } n i=1 µ ω 1 + c 1+ω 1 ω 1, 1 + ω 2 + c 1+ω 2 ω 2 Figure: Eigenvalues of 1 n Y N Y N, C N = diag(1,..., 1, 2, 2, 3, 3), N = 500, n = }{{} N 4 29 / 153

67 Basics of Random Matrix Theory/Spiked Models 30/153 Spiked Models Theorem (Eigenvalues [Baik,Silverstein 06]) Let Y N = C 1 2 N X N, with X N with i.i.d. zero mean, unit variance, E[ X N 4 ij ] <. C N = I N + P, P = UΩU, where, for K fixed, Ω = diag (ω 1,..., ω K ) R K K, with ω 1... ω K > / 153

68 Basics of Random Matrix Theory/Spiked Models 30/153 Spiked Models Theorem (Eigenvalues [Baik,Silverstein 06]) Let Y N = C 1 2 N X N, with X N with i.i.d. zero mean, unit variance, E[ X N 4 ij ] <. C N = I N + P, P = UΩU, where, for K fixed, Ω = diag (ω 1,..., ω K ) R K K, with ω 1... ω K > 0. Then, as N, n, N/n c (0, ), denoting λ i = λ i ( 1 n Y N Y N ), if ω m > c, λ m a.s. 1 + ω m + c 1 + ωm ω m > (1 + c) 2 30 / 153

69 Basics of Random Matrix Theory/Spiked Models 30/153 Spiked Models Theorem (Eigenvalues [Baik,Silverstein 06]) Let Y N = C 1 2 N X N, with X N with i.i.d. zero mean, unit variance, E[ X N 4 ij ] <. C N = I N + P, P = UΩU, where, for K fixed, Ω = diag (ω 1,..., ω K ) R K K, with ω 1... ω K > 0. Then, as N, n, N/n c (0, ), denoting λ i = λ i ( 1 n Y N Y N ), if ω m > c, λ m a.s. 1 + ω m + c 1 + ωm ω m > (1 + c) 2 if ω m (0, c], λ m a.s. (1 + c) 2 30 / 153

70 Basics of Random Matrix Theory/Spiked Models 31/153 Spiked Models Proof Two ingredients: Algebraic calculus + trace lemma 31 / 153

71 Basics of Random Matrix Theory/Spiked Models 31/153 Spiked Models Proof Two ingredients: Algebraic calculus + trace lemma Find eigenvalues away from eigenvalues of 1 n X N X N : ( 1 0 = det = det(c N ) det ( 1 = det n Y N Y N λi N ) ( ) 1 n X N XN λc 1 N ) n X N XN λi N + λ(i N C 1 N ) ( ) ( 1 = det n X N XN λi N det ( I N + λ(i N C 1 1 N ) n X N XN λi N ) 1 ). 31 / 153

72 Basics of Random Matrix Theory/Spiked Models 31/153 Spiked Models Proof Two ingredients: Algebraic calculus + trace lemma Find eigenvalues away from eigenvalues of 1 n X N X N : ( 1 0 = det = det(c N ) det ( 1 = det n Y N Y N λi N ) ( ) 1 n X N XN λc 1 N ) n X N XN λi N + λ(i N C 1 N ) ( ) ( 1 = det n X N XN λi N det ( I N + λ(i N C 1 1 N ) n X N XN λi N ) 1 ). Use low rank property: I N C 1 N = I N (I N + UΩU ) 1 = U(I K + Ω 1 ) 1 U, Ω C K K. Hence ( ) ( ( ) ) = det n X N XN λi N det I N + λu(i K + Ω 1 ) 1 U n X N XN λi N

73 Basics of Random Matrix Theory/Spiked Models 31/153 Spiked Models Proof Two ingredients: Algebraic calculus + trace lemma Find eigenvalues away from eigenvalues of 1 n X N X N : ( 1 0 = det = det(c N ) det ( 1 = det n Y N Y N λi N ) ( ) 1 n X N XN λc 1 N ) n X N XN λi N + λ(i N C 1 N ) ( ) ( 1 = det n X N XN λi N det ( I N + λ(i N C 1 1 N ) n X N XN λi N ) 1 ). Use low rank property: I N C 1 N = I N (I N + UΩU ) 1 = U(I K + Ω 1 ) 1 U, Ω C K K. Hence ( ) ( ( ) ) = det n X N XN λi N det I N + λu(i K + Ω 1 ) 1 U n X N XN λi N 31 / 153

74 Basics of Random Matrix Theory/Spiked Models 32/153 Spiked Models Proof (2) Sylverster s identity (det(i + AB) = det(i + BA)), ( ) ( ( ) ) = det n X N XN λi N det I K + λ(i K + Ω 1 ) 1 U n X N XN λi N U 32 / 153

75 Basics of Random Matrix Theory/Spiked Models 32/153 Spiked Models Proof (2) Sylverster s identity (det(i + AB) = det(i + BA)), ( ) ( ( ) ) = det n X N XN λi N det I K + λ(i K + Ω 1 ) 1 U n X N XN λi N U No eigenvalue outside the support [Bai,Sil 98]: det( 1 n X N X N λi N ) has no zero beyond (1 + c) 2 for all large n a.s. 32 / 153

76 Basics of Random Matrix Theory/Spiked Models 32/153 Spiked Models Proof (2) Sylverster s identity (det(i + AB) = det(i + BA)), ( ) ( ( ) ) = det n X N XN λi N det I K + λ(i K + Ω 1 ) 1 U n X N XN λi N U No eigenvalue outside the support [Bai,Sil 98]: det( 1 n X N X N λi N ) has no zero beyond (1 + c) 2 for all large n a.s. Extension of Trace Lemma: for each z C \ supp(µ), U ( 1 n X N X N zi N ) 1 U a.s. m µ(z)i K. (X N being almost-unitarily invariant, U can be seen as formed of random i.i.d.-like vectors) 32 / 153

77 Basics of Random Matrix Theory/Spiked Models 32/153 Spiked Models Proof (2) Sylverster s identity (det(i + AB) = det(i + BA)), ( ) ( ( ) ) = det n X N XN λi N det I K + λ(i K + Ω 1 ) 1 U n X N XN λi N U No eigenvalue outside the support [Bai,Sil 98]: det( 1 n X N X N λi N ) has no zero beyond (1 + c) 2 for all large n a.s. Extension of Trace Lemma: for each z C \ supp(µ), U ( 1 n X N X N zi N ) 1 U a.s. m µ(z)i K. (X N being almost-unitarily invariant, U can be seen as formed of random i.i.d.-like vectors) As a result, for all large n a.s., ( 0 = det I K + λ(i K + Ω 1 ) 1 U ( 1 ) n X N XN λi N ) 1 U M m=1 ( 1 + λ 1 + ωm 1 m µ(λ)) km = M m=1 ( ) 1 + λωm km m µ(λ) 1 + ω m 32 / 153

78 Basics of Random Matrix Theory/Spiked Models 33/153 Spiked Models Proof (3) Limiting solutions: zeros (with multiplicity) of 1 + λωm 1 + ω m m µ(λ) = / 153

79 Basics of Random Matrix Theory/Spiked Models 33/153 Spiked Models Proof (3) Limiting solutions: zeros (with multiplicity) of 1 + λωm 1 + ω m m µ(λ) = 0. Using Mar cenko Pastur law properties (m µ(z) = (1 c z czm µ(z)) 1 ), { λ 1 + ω m + c 1 + } M ωm. ω m m=1 33 / 153

80 Basics of Random Matrix Theory/Spiked Models 34/153 Spiked Models Theorem (Eigenvectors [Paul 07]) Let Y N = C 1 2 N X N, with X N with i.i.d. zero mean, unit variance, finite fourth order moment entries C N = I N + P, P = K i=1 ω iu i u i, ω 1 >... > ω M > / 153

81 Basics of Random Matrix Theory/Spiked Models 34/153 Spiked Models Theorem (Eigenvectors [Paul 07]) Let Y N = C 1 2 N X N, with X N with i.i.d. zero mean, unit variance, finite fourth order moment entries C N = I N + P, P = K i=1 ω iu i u i, ω 1 >... > ω M > 0. Then, as N, n, N/n c (0, ), for a, b C N deterministic and û i eigenvector of λ i ( 1 n Y N Y N ), In particular, a û i û i b 1 cω 2 i 1 + cω 1 i a u i u i b 1 ω i > c a.s. 0 û i u i 2 a.s. 1 cω 2 i 1 + cω 1 1 ωi > c. i 34 / 153

82 Basics of Random Matrix Theory/Spiked Models 34/153 Spiked Models Theorem (Eigenvectors [Paul 07]) Let Y N = C 1 2 N X N, with X N with i.i.d. zero mean, unit variance, finite fourth order moment entries C N = I N + P, P = K i=1 ω iu i u i, ω 1 >... > ω M > 0. Then, as N, n, N/n c (0, ), for a, b C N deterministic and û i eigenvector of λ i ( 1 n Y N Y N ), In particular, a û i û i b 1 cω 2 i 1 + cω 1 i a u i u i b 1 ω i > c a.s. 0 û i u i 2 a.s. 1 cω 2 i 1 + cω 1 1 ωi > c. i Proof: Based on Cauchy integral + similar ingredients as eigenvalue proof a û i û i b = 1 ( ) 1 1 2πı n Y N YN zi N b dz C i a for C m contour circling around λ i only. 34 / 153

83 Basics of Random Matrix Theory/Spiked Models 35/153 Spiked Models û 1 u N = 100 N = 200 N = c/ω c/ω Population spike ω 1 Figure: Simulated versus limiting û 1 u1 2 for Y N = C 1 2N X N, C N = I N + ω 1u 1u 1, N/n = 1/3, varying ω / 153

84 Basics of Random Matrix Theory/Spiked Models 36/153 Tracy Widom Theorem Theorem (Phase Transition [Baik,BenArous,Péché 05]) Let Y N = C 1 2 N X N, with X N with i.i.d. complex Gaussian zero mean, unit variance entries, C N = I N + P, P = K i=1 ω iu i u i, ω 1 >... > ω K > 0 (K 0). 36 / 153

85 Basics of Random Matrix Theory/Spiked Models 36/153 Tracy Widom Theorem Theorem (Phase Transition [Baik,BenArous,Péché 05]) Let Y N = C 1 2 N X N, with X N with i.i.d. complex Gaussian zero mean, unit variance entries, C N = I N + P, P = K i=1 ω iu i u i, ω 1 >... > ω K > 0 (K 0). Then, as N, n, N/n c < 1, If ω 1 < c (or K = 0), N 2 3 λ 1 (1 + c) 2 (1 + c) 4 3 c 1 2 L T 2, (complex Tracy Widom law) 36 / 153

86 Basics of Random Matrix Theory/Spiked Models 36/153 Tracy Widom Theorem Theorem (Phase Transition [Baik,BenArous,Péché 05]) Let Y N = C 1 2 N X N, with X N with i.i.d. complex Gaussian zero mean, unit variance entries, C N = I N + P, P = K i=1 ω iu i u i, ω 1 >... > ω K > 0 (K 0). Then, as N, n, N/n c < 1, If ω 1 < c (or K = 0), N 2 3 λ 1 (1 + c) 2 (1 + c) 4 3 c 1 2 L T 2, (complex Tracy Widom law) If ω 1 > c, ( (1 + ω1 ) 2 c (1 + ω 1) 2 ) 1 [ ( 2 1 N ω1 2 2 λ ω 1 + c 1 + ω )] 1 L N (0, 1). ω 1 36 / 153

87 Basics of Random Matrix Theory/Spiked Models 37/153 Tracy Widom Theorem 0.5 Centered-scaled λ 1 T Figure: Distribution of N 2 3 c 1 2 (1 + c) 4 3 [ λ 1( 1 n X N X N ) (1 + c) 2] versus Tracy Widom (T 2), N = 500, n = / 153

88 Basics of Random Matrix Theory/Spiked Models 38/153 Other Spiked Models Similar results for multiple matrix models: Additive spiked model: Y N = 1 n XX + P, P deterministic and low rank Y N = 1 n X (I + P )X Y N = 1 n (X + P ) (X + P ) Y N = 1 n T X (I + P )XT etc. 38 / 153

89 Basics of Random Matrix Theory/Other Common Random Matrix Models 39/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 39 / 153

90 Basics of Random Matrix Theory/Other Common Random Matrix Models 40/153 The Semi-circle law Theorem Let X N C N N 1 Hermitian with e.s.d. µ N such that [X N N ] i>j are i.i.d. with zero mean and unit variance. Then, as N, µ N a.s. µ with µ(dt) = 1 2π (4 t 2 ) + dt. In particular, m µ satisfies m µ(z) = 1 z m µ(z). 40 / 153

91 Basics of Random Matrix Theory/Other Common Random Matrix Models 41/153 The Semi-circle law 0.4 Empirical eigenvalue distribution Semi-circle Law 0.3 Density Eigenvalues Figure: Histogram of the eigenvalues of Wigner matrices and the semi-circle law, for N = / 153

92 Basics of Random Matrix Theory/Other Common Random Matrix Models 42/153 The Circular law Theorem Let X N C N N with e.s.d. µ N be such that mean and unit variance. Then, as N, µ N a.s. µ with µ a complex-supported measure with µ(dz) = 1 2π δ z 1dz. 1 N [X N ] ij are i.i.d. entries with zero 42 / 153

93 Basics of Random Matrix Theory/Other Common Random Matrix Models 43/153 The Circular law 1 Empirical eigenvalues Circular Law Eigenvalues (imaginary part) Eigenvalues (real part) Figure: Eigenvalues of X N with i.i.d. standard Gaussian entries, for N = / 153

94 Basics of Random Matrix Theory/Other Common Random Matrix Models 44/153 Bibliographical references: Maths Book and Tutorial References I From most accessible to least: Couillet, R., & Debbah, M. (2011). Random matrix methods for wireless communications. Cambridge University Press. Tao, T. (2012). Topics in random matrix theory (Vol. 132). Providence, RI: American Mathematical Society. Bai, Z., & Silverstein, J. W. (2010). Spectral analysis of large dimensional random matrices (Vol. 20). New York: Springer. Pastur, L. A., Shcherbina, M., & Shcherbina, M. (2011). Eigenvalue distribution of large random matrices (Vol. 171). Providence, RI: American Mathematical Society. Anderson, G. W., Guionnet, A., & Zeitouni, O. (2010). An introduction to random matrices (Vol. 118). Cambridge university press. 44 / 153

95 Applications/ 45/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 45 / 153

96 Applications/Random Matrices and Robust Estimation 46/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 46 / 153

97 Applications/Random Matrices and Robust Estimation 47/153 Context Baseline scenario: x 1,..., x n C N (or R N ) i.i.d. with E[x 1 ] = 0, E[x 1 x 1 ] = C N : 47 / 153

98 Applications/Random Matrices and Robust Estimation 47/153 Context Baseline scenario: x 1,..., x n C N (or R N ) i.i.d. with E[x 1 ] = 0, E[x 1 x 1 ] = C N : If x 1 N (0, C N ), ML estimator for C N is sample covariance matrix (SCM) Ĉ N = 1 n x i x i n. i=1 47 / 153

99 Applications/Random Matrices and Robust Estimation 47/153 Context Baseline scenario: x 1,..., x n C N (or R N ) i.i.d. with E[x 1 ] = 0, E[x 1 x 1 ] = C N : If x 1 N (0, C N ), ML estimator for C N is sample covariance matrix (SCM) Ĉ N = 1 n x i x i n. i=1 [Huber 67] If x 1 (1 ε)n (0, C N ) + εg, G unknown, robust estimator (n > N) { } Ĉ N = 1 n l 2 max l 1, n 1 i=1 N x i Ĉ 1 N x x i x i for some l 1, l 2 > 0. i 47 / 153

100 Applications/Random Matrices and Robust Estimation 47/153 Context Baseline scenario: x 1,..., x n C N (or R N ) i.i.d. with E[x 1 ] = 0, E[x 1 x 1 ] = C N : If x 1 N (0, C N ), ML estimator for C N is sample covariance matrix (SCM) Ĉ N = 1 n x i x i n. i=1 [Huber 67] If x 1 (1 ε)n (0, C N ) + εg, G unknown, robust estimator (n > N) { } Ĉ N = 1 n l 2 max l 1, n 1 i=1 N x i Ĉ 1 N x x i x i for some l 1, l 2 > 0. i [Maronna 76] If x 1 elliptical (and n > N), ML estimator for C N given by Ĉ N = 1 n ( ) 1 u n N x i Ĉ 1 N x i x i x i for some non-increasing u. i=1 47 / 153

101 Applications/Random Matrices and Robust Estimation 47/153 Context Baseline scenario: x 1,..., x n C N (or R N ) i.i.d. with E[x 1 ] = 0, E[x 1 x 1 ] = C N : If x 1 N (0, C N ), ML estimator for C N is sample covariance matrix (SCM) Ĉ N = 1 n x i x i n. i=1 [Huber 67] If x 1 (1 ε)n (0, C N ) + εg, G unknown, robust estimator (n > N) { } Ĉ N = 1 n l 2 max l 1, n 1 i=1 N x i Ĉ 1 N x x i x i for some l 1, l 2 > 0. i [Maronna 76] If x 1 elliptical (and n > N), ML estimator for C N given by Ĉ N = 1 n ( ) 1 u n N x i Ĉ 1 N x i x i x i for some non-increasing u. i=1 [Pascal 13; Chen 11] If N > n, x 1 elliptical or with outliers, shrinkage extensions Ĉ N (ρ) = (1 ρ) 1 n x i x i n 1 i=1 N x i Ĉ 1 N (ρ)x + ρi N i ˇB N (ρ) Č N (ρ) = 1 N tr ˇB N (ρ), ˇBN (ρ) = (1 ρ) 1 n x i x i n 1 i=1 N x i Č 1 N (ρ)x + ρi N i 47 / 153

102 Applications/Random Matrices and Robust Estimation 48/153 Context Results only known for N fixed and n : not appropriate in settings of interest today (BigData, array processing, MIMO) 48 / 153

103 Applications/Random Matrices and Robust Estimation 48/153 Context Results only known for N fixed and n : not appropriate in settings of interest today (BigData, array processing, MIMO) We study such ĈN in the regime N, n, N/n c (0, ). 48 / 153

104 Applications/Random Matrices and Robust Estimation 48/153 Context Results only known for N fixed and n : not appropriate in settings of interest today (BigData, array processing, MIMO) We study such ĈN in the regime N, n, N/n c (0, ). Math interest: limiting eigenvalue distribution of ĈN limiting values and fluctuations of functionals f(ĉn ) 48 / 153

105 Applications/Random Matrices and Robust Estimation 48/153 Context Results only known for N fixed and n : not appropriate in settings of interest today (BigData, array processing, MIMO) We study such ĈN in the regime N, n, N/n c (0, ). Math interest: limiting eigenvalue distribution of ĈN limiting values and fluctuations of functionals f(ĉn ) Application interest: comparison between SCM and robust estimators performance of robust/non-robust estimation methods improvement thereof (by proper parametrization) 48 / 153

106 Applications/Random Matrices and Robust Estimation 49/153 Model Description Definition (Maronna s Estimator) For x 1,..., x n C N with n > N, Ĉ N is the solution (upon existence and uniqueness) of n Ĉ N = 1 n i=1 ( 1 u N x i Ĉ 1 N x i ) x i x i 49 / 153

107 Applications/Random Matrices and Robust Estimation 49/153 Model Description Definition (Maronna s Estimator) For x 1,..., x n C N with n > N, Ĉ N is the solution (upon existence and uniqueness) of where u : [0, ) (0, ) is n Ĉ N = 1 n i=1 ( 1 u N x i Ĉ 1 N x i ) x i x i non-increasing such that φ(x) xu(x) increasing of supremum φ with 1 < φ < c 1, c (0, 1). 49 / 153

108 Applications/Random Matrices and Robust Estimation 50/153 The Results in a Nutshell For various models of the x i s, First order convergence: ĈN ŜN a.s. 0 for some tractable random matrices ŜN. We only discuss this result here. 50 / 153

109 Applications/Random Matrices and Robust Estimation 50/153 The Results in a Nutshell For various models of the x i s, First order convergence: ĈN ŜN a.s. 0 for some tractable random matrices ŜN. We only discuss this result here. Second order results: N 1 ε ( a Ĉ k N b a Ŝ k N b ) a.s. 0 allowing transfer of CLT results. 50 / 153

110 Applications/Random Matrices and Robust Estimation 50/153 The Results in a Nutshell For various models of the x i s, First order convergence: ĈN ŜN a.s. 0 for some tractable random matrices ŜN. We only discuss this result here. Second order results: N 1 ε ( a Ĉ k N b a Ŝ k N b ) a.s. 0 allowing transfer of CLT results. Applications: improved robust covariance matrix estimation improved robust tests / estimators specific examples in statistics at large, array processing, statistical finance, etc. 50 / 153

111 Applications/Random Matrices and Robust Estimation 51/153 (Elliptical) scenario Theorem (Large dimensional behavior, elliptical case) For x i = τ i w i, τ i impulsive (random or not), w i unitarily invariant, w i = N, ĈN ŜN a.s. 0 with, for some v related to u (v = u g 1, g(x) = x(1 cφ(x)) 1 ), n Ĉ N 1 n i=1 and γ N unique solution of ( 1 u N x i Ĉ 1 N x i ) x i x i, 1 = 1 n γv(τ i γ) n 1 + cγv(τ j=1 i γ). Ŝ N 1 n v(τ i γ N )x i x i n i=1 51 / 153

112 Applications/Random Matrices and Robust Estimation 51/153 (Elliptical) scenario Theorem (Large dimensional behavior, elliptical case) For x i = τ i w i, τ i impulsive (random or not), w i unitarily invariant, w i = N, ĈN ŜN a.s. 0 with, for some v related to u (v = u g 1, g(x) = x(1 cφ(x)) 1 ), n Ĉ N 1 n i=1 and γ N unique solution of ( 1 u N x i Ĉ 1 N x i ) x i x i, 1 = 1 n γv(τ i γ) n 1 + cγv(τ j=1 i γ). Ŝ N 1 n v(τ i γ N )x i x i n i=1 Corollaries Spectral measure: µĉn N µŝn N L 0 a.s. (µ X N 1 n n i=1 δ λ i (X)) 51 / 153

113 Applications/Random Matrices and Robust Estimation 51/153 (Elliptical) scenario Theorem (Large dimensional behavior, elliptical case) For x i = τ i w i, τ i impulsive (random or not), w i unitarily invariant, w i = N, ĈN ŜN a.s. 0 with, for some v related to u (v = u g 1, g(x) = x(1 cφ(x)) 1 ), n Ĉ N 1 n i=1 and γ N unique solution of ( 1 u N x i Ĉ 1 N x i ) x i x i, 1 = 1 n γv(τ i γ) n 1 + cγv(τ j=1 i γ). Ŝ N 1 n v(τ i γ N )x i x i n i=1 Corollaries Spectral measure: µĉn N µŝn N L 0 a.s. (µ X N 1 n n i=1 δ λ i (X)) Local convergence: max 1 i N λ i (ĈN ) λ i (ŜN ) a.s / 153

114 Applications/Random Matrices and Robust Estimation 51/153 (Elliptical) scenario Theorem (Large dimensional behavior, elliptical case) For x i = τ i w i, τ i impulsive (random or not), w i unitarily invariant, w i = N, ĈN ŜN a.s. 0 with, for some v related to u (v = u g 1, g(x) = x(1 cφ(x)) 1 ), n Ĉ N 1 n i=1 and γ N unique solution of ( 1 u N x i Ĉ 1 N x i ) x i x i, 1 = 1 n γv(τ i γ) n 1 + cγv(τ j=1 i γ). Ŝ N 1 n v(τ i γ N )x i x i n i=1 Corollaries Spectral measure: µĉn N µŝn N L 0 a.s. (µ X N 1 n n i=1 δ λ i (X)) Local convergence: max 1 i N λ i (ĈN ) λ i (ŜN ) a.s. 0. Norm boundedness: lim sup N ĈN < Bounded spectrum (unlike SCM!) 51 / 153

115 Applications/Random Matrices and Robust Estimation 52/153 Large dimensional behavior 3 Eigenvalues of ĈN Density Eigenvalues Figure: n = 2500, N = 500, C N = diag(i 125, 3I 125, 10I 250), τ i Γ(.5, 2) i.i.d. 52 / 153

116 Applications/Random Matrices and Robust Estimation 52/153 Large dimensional behavior Eigenvalues of ĈN 3 Eigenvalues of ŜN Density Eigenvalues Figure: n = 2500, N = 500, C N = diag(i 125, 3I 125, 10I 250), τ i Γ(.5, 2) i.i.d. 52 / 153

117 Applications/Random Matrices and Robust Estimation 52/153 Large dimensional behavior Eigenvalues of ĈN 3 Eigenvalues of ŜN Approx. Density Density Eigenvalues Figure: n = 2500, N = 500, C N = diag(i 125, 3I 125, 10I 250), τ i Γ(.5, 2) i.i.d. 52 / 153

118 Applications/Random Matrices and Robust Estimation 53/153 Elements of Proof Definition (v and ψ) Letting g(x) = x(1 cφ(x)) 1 (on R + ), v(x) (u g 1 )(x) non-increasing ψ(x) xv(x) increasing and bounded by ψ. 53 / 153

119 Applications/Random Matrices and Robust Estimation 53/153 Elements of Proof Definition (v and ψ) Letting g(x) = x(1 cφ(x)) 1 (on R + ), v(x) (u g 1 )(x) non-increasing ψ(x) xv(x) increasing and bounded by ψ. Lemma (Rewriting ĈN) It holds (with C N = I N ) that Ĉ N 1 n τ i v (τ i d i ) w i wi n i=1 with (d 1,..., d n) R n + a.s. unique solution to d i = 1 N w i Ĉ 1 (i) w i = 1 N w i 1 1 τ j v(τ j d j )w j wj w i, i = 1,..., n. n j i 53 / 153

120 Applications/Random Matrices and Robust Estimation 54/153 Elements of Proof Remark (Quadratic Form close to Trace) Random matrix insight: ( 1 n j i τ jv(τ j d j )w j wj ) 1 almost independent of w i, so 54 / 153

121 Applications/Random Matrices and Robust Estimation 54/153 Elements of Proof Remark (Quadratic Form close to Trace) Random matrix insight: ( 1 n j i τ jv(τ j d j )w j wj ) 1 almost independent of w i, so 1 1 d i = 1 N w 1 i τ j v(τ j d j )w j wj w i 1 n N tr 1 τ j v(τ j d j )w j w j γ N n j i j i for some deterministic sequence (γ N ) N=1, irrespective of i. 54 / 153

122 Applications/Random Matrices and Robust Estimation 54/153 Elements of Proof Remark (Quadratic Form close to Trace) Random matrix insight: ( 1 n j i τ jv(τ j d j )w j wj ) 1 almost independent of w i, so 1 1 d i = 1 N w 1 i τ j v(τ j d j )w j wj w i 1 n N tr 1 τ j v(τ j d j )w j w j γ N n j i j i for some deterministic sequence (γ N ) N=1, irrespective of i. Lemma (Key Lemma) Letting e i v(τ id i ) v(τ i γ N ) with γ N unique solution to we have 1 = 1 n ψ(τ i γ N ) n 1 + cψ(τ k=1 i γ N ) max e i 1 a.s i n 54 / 153

123 Applications/Random Matrices and Robust Estimation 55/153 Proof of the Key Lemma: max i e i 1 a.s. 0, e i = v(τ id i ) v(τ i γ N ) Property (Quadratic form and γ N ) 1 max 1 1 i n N w 1 i τ j v(τ j γ N )w j wj w i γ N a.s. n 0. j i 55 / 153

124 Applications/Random Matrices and Robust Estimation 55/153 Proof of the Key Lemma: max i e i 1 a.s. 0, e i = v(τ id i ) v(τ i γ N ) Property (Quadratic form and γ N ) 1 max 1 1 i n N w 1 i τ j v(τ j γ N )w j wj w i γ N a.s. n 0. j i Proof of the Property Uniformity easy (moments of all orders for [w i ] j ). By a quadratic form similar to trace approach, we get 1 max 1 1 i n N w 1 i τ j v(τ j γ N )w j wj w i m(0) a.s. n 0 j i with m(0) unique positive solution to [MarPas 67; BaiSil 95] m(0) = 1 n τ i v(τ i γ N ) n 1 + cτ i=1 i v(τ i γ N )m(0). γ N precisely solves this equation, thus m(0) = γ N. 55 / 153

125 Applications/Random Matrices and Robust Estimation 56/153 Proof of the Key Lemma: max i e i 1 a.s. 0, e i = v(τ id i ) Substitution Trick (case τ i [a, b] (0, )) Up to relabelling e 1... e n, use v(τ i γ N ) 1 v(τ nγ N )e n = v(τ nd n) = v 1 τn N w 1 n τ i v(τ i d i ) w i w i n }{{} w n i<n =v(τ i γ N )e i ( ) v τ ne n N w n τ i v(τ i γ N )w i wi w n n i<n v ( τ ne 1 n (γ N ε ) n) a.s., ε n 0 (slow). 56 / 153

126 Applications/Random Matrices and Robust Estimation 56/153 Proof of the Key Lemma: max i e i 1 a.s. 0, e i = v(τ id i ) Substitution Trick (case τ i [a, b] (0, )) Up to relabelling e 1... e n, use v(τ i γ N ) 1 v(τ nγ N )e n = v(τ nd n) = v 1 τn N w 1 n τ i v(τ i d i ) w i w i n }{{} w n i<n =v(τ i γ N )e i ( ) v τ ne n N w n τ i v(τ i γ N )w i wi w n n i<n Use properties of ψ to get v ( τ ne 1 n (γ N ε ) n) a.s., ε n 0 (slow). ψ (τ nγ N ) ψ ( τ ne 1 n γ ) ( ) 1 N 1 ε nγ 1 N 56 / 153

127 Applications/Random Matrices and Robust Estimation 56/153 Proof of the Key Lemma: max i e i 1 a.s. 0, e i = v(τ id i ) Substitution Trick (case τ i [a, b] (0, )) Up to relabelling e 1... e n, use v(τ i γ N ) 1 v(τ nγ N )e n = v(τ nd n) = v 1 τn N w 1 n τ i v(τ i d i ) w i w i n }{{} w n i<n =v(τ i γ N )e i ( ) v τ ne n N w n τ i v(τ i γ N )w i wi w n n i<n Use properties of ψ to get v ( τ ne 1 n (γ N ε ) n) a.s., ε n 0 (slow). ψ (τ nγ N ) ψ ( τ ne 1 n γ ) ( ) 1 N 1 ε nγ 1 N Conclusion: If e n > 1 + l i.o., as τ n [a, b], on subsequence { τn τ 0 > 0 γ N γ 0 > 0, ( ) τ0 γ 0 ψ(τ 0 γ 0 ) ψ, a contradiction. 1 + l 56 / 153

128 Applications/Random Matrices and Robust Estimation 57/153 Outlier Data Theorem (Outlier Rejection) Observation set X = [ ] x 1,..., x (1 εn)n, a 1,..., a εnn where x i CN (0, C N ) and a 1,..., a εnn C N deterministic outliers. Then, ĈN ŜN a.s. 0 where Ŝ N v (γ N ) 1 n (1 ε n)n i=1 x i x i + 1 ε nn v (α i,n ) a i a i n i=1 with γ N and α 1,n,..., α εnn,n unique positive solutions to ( ) γ N = 1 N tr C (1 ε)v(γ N ) N C N + 1 ε nn 1 v (α i,n ) a i a i 1 + cv(γ N )γ N n i=1 1 α i,n = 1 N a (1 ε)v(γ N ) i C N + 1 ε nn v (α j,n ) a j a j a i, i = 1,..., ε nn. 1 + cv(γ N )γ N n j i 57 / 153

129 Applications/Random Matrices and Robust Estimation 58/153 Outlier Data For ε nn = 1, ( φ 1 (1) Ŝ N = v 1 c ) n 1 1 n i=1 ( φ x i x i (v 1 + (1) 1 c Outlier rejection relies on 1 N a 1 C 1 N a N a 1C 1 N a 1 ) ) + o(1) a 1 a 1 58 / 153

130 Applications/Random Matrices and Robust Estimation 58/153 Outlier Data For ε nn = 1, ( φ 1 (1) Ŝ N = v 1 c ) n 1 1 n i=1 ( φ x i x i (v 1 + (1) 1 c Outlier rejection relies on 1 N a 1 C 1 N a 1 1. For a i CN (0, D N ), ε n ε 0, 1 N a 1C 1 N a 1 Ŝ N = v (γ 1 (1 ε n)n n) x i x i n + v (αn) 1 ε nn a i a i n i=1 i=1 γ n = 1 ( (1 ε)v(γn) N tr C N C N cv(γ n)γ n α n = 1 ( (1 ε)v(γn) N tr D N C N cv(γ n)γ n ) εv(α n) 1 + cv(α n)α n D N εv(α n) 1 + cv(α n)α n D N ) + o(1) a 1 a 1 ) 1 ) / 153

131 Applications/Random Matrices and Robust Estimation 58/153 Outlier Data For ε nn = 1, ( φ 1 (1) Ŝ N = v 1 c ) n 1 1 n i=1 ( φ x i x i (v 1 + (1) 1 c Outlier rejection relies on 1 N a 1 C 1 N a 1 1. For a i CN (0, D N ), ε n ε 0, For ε n 0, 1 N a 1C 1 N a 1 Ŝ N = v (γ 1 (1 ε n)n n) x i x i n + v (αn) 1 ε nn a i a i n i=1 i=1 γ n = 1 ( (1 ε)v(γn) N tr C N C N cv(γ n)γ n α n = 1 ( (1 ε)v(γn) N tr D N C N cv(γ n)γ n ( φ 1 ) (1 ε (1) 1 n)n Ŝ N = v x i x i 1 c n + 1 n i=1 ε nn i=1 ) εv(α n) 1 + cv(α n)α n D N εv(α n) 1 + cv(α n)α n D N ( φ 1 (1) v 1 c ) + o(1) a 1 a 1 ) 1 ) 1. 1 N tr D N C 1 N ) a i a i Outlier rejection relies on 1 N tr D N C 1 N / 153

132 Applications/Random Matrices and Robust Estimation 59/153 Outlier Data Deterministic equivalent eigenvalue distribution n (1 εn)n x i=1 i x i (Clean Data) Eigenvalues Figure: Limiting eigenvalue distributions. [C N ] ij =.9 i j, D N = I N, ε = / 153

133 Applications/Random Matrices and Robust Estimation 59/153 Outlier Data Deterministic equivalent eigenvalue distribution (1 εn)n x n i=1 i x i (Clean Data) 1 n XX or per-input normalized Eigenvalues Figure: Limiting eigenvalue distributions. [C N ] ij =.9 i j, D N = I N, ε = / 153

134 Applications/Random Matrices and Robust Estimation 59/153 Outlier Data Deterministic equivalent eigenvalue distribution (1 εn)n x n i=1 i x i (Clean Data) 1 n XX or per-input normalized Ĉ N Eigenvalues Figure: Limiting eigenvalue distributions. [C N ] ij =.9 i j, D N = I N, ε = / 153

135 Applications/Random Matrices and Robust Estimation 60/153 Example of application to statistical finance Robust matrix-optimized portfolio allocation ĈST Annualized standard deviation Ĉ ST Ĉ P Ĉ C Ĉ C2 Ĉ LW Ĉ R t (N=45,n=300) 60 / 153

136 Applications/Spectral Clustering Methods and Random Matrices 61/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 61 / 153

137 Applications/Spectral Clustering Methods and Random Matrices 62/153 Reminder on Spectral Clustering Methods Context: Two-step classification of n objects based on similarity A R n n : 1. extraction of eigenvectors U = [u 1,..., u l ] with dominant eigenvalues 62 / 153

138 Applications/Spectral Clustering Methods and Random Matrices 62/153 Reminder on Spectral Clustering Methods Context: Two-step classification of n objects based on similarity A R n n : 1. extraction of eigenvectors U = [u 1,..., u l ] with dominant eigenvalues 2. classification of vectors U 1,,..., U n, R l using k-means/em. 62 / 153

139 Applications/Spectral Clustering Methods and Random Matrices 62/153 Reminder on Spectral Clustering Methods Context: Two-step classification of n objects based on similarity A R n n : 1. extraction of eigenvectors U = [u 1,..., u l ] with dominant eigenvalues 2. classification of vectors U 1,,..., U n, R l using k-means/em. 0 spikes 62 / 153

140 Applications/Spectral Clustering Methods and Random Matrices 62/153 Reminder on Spectral Clustering Methods Context: Two-step classification of n objects based on similarity A R n n : 1. extraction of eigenvectors U = [u 1,..., u l ] with dominant eigenvalues 2. classification of vectors U 1,,..., U n, R l using k-means/em. 0 spikes Eigenvectors (in practice, shuffled!!) 62 / 153

141 Applications/Spectral Clustering Methods and Random Matrices 63/153 Reminder on Spectral Clustering Methods Eigenv. 2 Eigenv / 153

142 Applications/Spectral Clustering Methods and Random Matrices 63/153 Reminder on Spectral Clustering Methods Eigenv. 2 Eigenv. 1 l-dimensional representation (shuffling no longer matters!) Eigenvector 2 Eigenvector 1 63 / 153

143 Applications/Spectral Clustering Methods and Random Matrices 63/153 Reminder on Spectral Clustering Methods Eigenv. 2 Eigenv. 1 l-dimensional representation (shuffling no longer matters!) Eigenvector 2 Eigenvector 1 EM or k-means clustering. 63 / 153

144 Applications/Spectral Clustering Methods and Random Matrices 64/153 The Random Matrix Approach A two-step method: 1. If A n is not a standard random matrix, retrieve Ãn such that A n Ãn a.s. 0 in operator norm as n. 64 / 153

145 Applications/Spectral Clustering Methods and Random Matrices 64/153 The Random Matrix Approach A two-step method: 1. If A n is not a standard random matrix, retrieve Ãn such that A n Ãn a.s. 0 in operator norm as n. Transfers crucial properties from A n to Ãn: 64 / 153

146 Applications/Spectral Clustering Methods and Random Matrices 64/153 The Random Matrix Approach A two-step method: 1. If A n is not a standard random matrix, retrieve Ãn such that A n Ãn a.s. 0 in operator norm as n. Transfers crucial properties from A n to Ãn: limiting eigenvalue distribution 64 / 153

147 Applications/Spectral Clustering Methods and Random Matrices 64/153 The Random Matrix Approach A two-step method: 1. If A n is not a standard random matrix, retrieve Ãn such that A n Ãn a.s. 0 in operator norm as n. Transfers crucial properties from A n to Ãn: limiting eigenvalue distribution spikes 64 / 153

148 Applications/Spectral Clustering Methods and Random Matrices 64/153 The Random Matrix Approach A two-step method: 1. If A n is not a standard random matrix, retrieve Ãn such that A n Ãn a.s. 0 in operator norm as n. Transfers crucial properties from A n to Ãn: limiting eigenvalue distribution spikes eigenvectors of isolated eigenvalues. 64 / 153

149 Applications/Spectral Clustering Methods and Random Matrices 64/153 The Random Matrix Approach A two-step method: 1. If A n is not a standard random matrix, retrieve Ãn such that A n Ãn a.s. 0 in operator norm as n. Transfers crucial properties from A n to Ãn: limiting eigenvalue distribution spikes eigenvectors of isolated eigenvalues. 2. From Ãn, perform spiked model analysis: 64 / 153

150 Applications/Spectral Clustering Methods and Random Matrices 64/153 The Random Matrix Approach A two-step method: 1. If A n is not a standard random matrix, retrieve Ãn such that A n Ãn a.s. 0 in operator norm as n. Transfers crucial properties from A n to Ãn: limiting eigenvalue distribution spikes eigenvectors of isolated eigenvalues. 2. From Ãn, perform spiked model analysis: exhibit phase transition phenomenon 64 / 153

151 Applications/Spectral Clustering Methods and Random Matrices 64/153 The Random Matrix Approach A two-step method: 1. If A n is not a standard random matrix, retrieve Ãn such that A n Ãn a.s. 0 in operator norm as n. Transfers crucial properties from A n to Ãn: limiting eigenvalue distribution spikes eigenvectors of isolated eigenvalues. 2. From Ãn, perform spiked model analysis: exhibit phase transition phenomenon read the content of isolated eigenvectors of Ã n. 64 / 153

152 Applications/Spectral Clustering Methods and Random Matrices 65/153 The Random Matrix Approach The Spike Analysis: For noisy plateaus -looking isolated eigenvectors u 1,..., u l of Ãn, write k u i = α a j a i + σi a wa i na a=1 with j a R n canonical vector of class C a, w a i noise orthogonal to ja, 65 / 153

153 Applications/Spectral Clustering Methods and Random Matrices 65/153 The Random Matrix Approach The Spike Analysis: For noisy plateaus -looking isolated eigenvectors u 1,..., u l of Ãn, write k u i = α a j a i + σi a wa i na a=1 with j a R n canonical vector of class C a, w a i α a i = 1 na u T i ja (σi a )2 = u i α a j a 2 i. na noise orthogonal to ja, and evaluate 65 / 153

154 Applications/Spectral Clustering Methods and Random Matrices 65/153 The Random Matrix Approach The Spike Analysis: For noisy plateaus -looking isolated eigenvectors u 1,..., u l of Ãn, write k u i = α a j a i + σi a wa i na a=1 with j a R n canonical vector of class C a, w a i α a i = 1 na u T i ja (σi a )2 = u i α a j a 2 i. na noise orthogonal to ja, and evaluate = Can be done using complex analysis calculus, e.g. (α a i )2 = 1 ja T n u iu T i ja a = 1 1 ) 1jadz. ja T (Ãn zi n 2πı γ a n a 65 / 153

155 Applications/Community Detection on Graphs 66/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 66 / 153

156 Applications/Community Detection on Graphs 67/153 System Setting Assume n-node, m-edges undirected graph G, with intrinsic average connectivity q 1,..., q n µ i.i.d. 67 / 153

157 Applications/Community Detection on Graphs 67/153 System Setting Assume n-node, m-edges undirected graph G, with intrinsic average connectivity q 1,..., q n µ i.i.d. k classes C 1,..., C k independent of {q i } of (large) sizes n 1,..., n k, with preferential attachment C ab between C a and C b 67 / 153

158 Applications/Community Detection on Graphs 67/153 System Setting Assume n-node, m-edges undirected graph G, with intrinsic average connectivity q 1,..., q n µ i.i.d. k classes C 1,..., C k independent of {q i } of (large) sizes n 1,..., n k, with preferential attachment C ab between C a and C b induces edge probability for node i C a, j C b, P (i j) = q i q j C ab. 67 / 153

159 Applications/Community Detection on Graphs 67/153 System Setting Assume n-node, m-edges undirected graph G, with intrinsic average connectivity q 1,..., q n µ i.i.d. k classes C 1,..., C k independent of {q i } of (large) sizes n 1,..., n k, with preferential attachment C ab between C a and C b induces edge probability for node i C a, j C b, P (i j) = q i q j C ab. adjacency matrix A with A ij Bernoulli(q i q j C ab ). 67 / 153

160 Applications/Community Detection on Graphs 68/153 Objective Study of spectral methods: standard methods based on adjacency A, modularity A ddt 2m, normalized adjacency D 1 AD 1, etc. (adapted to dense nets) refined methods based on Bethe Hessian (r 2 1)I n ra + D (adapted to sparse nets!) 68 / 153

161 Applications/Community Detection on Graphs 68/153 Objective Study of spectral methods: standard methods based on adjacency A, modularity A ddt 2m, normalized adjacency D 1 AD 1, etc. (adapted to dense nets) refined methods based on Bethe Hessian (r 2 1)I n ra + D (adapted to sparse nets!) Improvement to realistic graphs: observation of failure of standard methods above improvement by new methods. 68 / 153

162 Applications/Community Detection on Graphs 69/153 Limitations of Adjacency/Modularity Approach (Modularity) (Bethe Hessian) 69 / 153

163 Applications/Community Detection on Graphs 69/153 Limitations of Adjacency/Modularity Approach (Modularity) Scenario: 3 classes with µ bi-modal (e.g., µ = 3 4 δ δ 0.5) (Bethe Hessian) Leading eigenvectors of A (or modularity A ddt 2m ) biased by q i distribution. Similar behavior for Bethe Hessian. 69 / 153

164 Applications/Community Detection on Graphs 70/153 Regularized Modularity Approach Connectivity Model: P (i j) = q i q j C ab for i C a, j C b. Dense Regime Assumptions: Non trivial regime when, as n, C ab = 1 + M ab n, M ab = O(1). 70 / 153

165 Applications/Community Detection on Graphs 70/153 Regularized Modularity Approach Connectivity Model: P (i j) = q i q j C ab for i C a, j C b. Dense Regime Assumptions: Non trivial regime when, as n, C ab = 1 + M ab n, M ab = O(1). Community information is weak but highly REDUNDANT! 70 / 153

166 Applications/Community Detection on Graphs 70/153 Regularized Modularity Approach Connectivity Model: P (i j) = q i q j C ab for i C a, j C b. Dense Regime Assumptions: Non trivial regime when, as n, C ab = 1 + M ab n, M ab = O(1). Community information is weak but highly REDUNDANT! Considered Matrix: For α [0, 1], (and with D = diag(a1 n) = diag(d) the degree matrix), m = 1 2 dt 1 the number of edges L α = (2m) α 1 [ ] D α A ddt D α. n 2m 70 / 153

167 Applications/Community Detection on Graphs 70/153 Regularized Modularity Approach Connectivity Model: P (i j) = q i q j C ab for i C a, j C b. Dense Regime Assumptions: Non trivial regime when, as n, C ab = 1 + M ab n, M ab = O(1). Community information is weak but highly REDUNDANT! Considered Matrix: For α [0, 1], (and with D = diag(a1 n) = diag(d) the degree matrix), m = 1 2 dt 1 the number of edges L α = (2m) α 1 [ ] D α A ddt D α. n 2m Our results in a nutshell: we find optimal α opt having best phase transition. 70 / 153

168 Applications/Community Detection on Graphs 70/153 Regularized Modularity Approach Connectivity Model: P (i j) = q i q j C ab for i C a, j C b. Dense Regime Assumptions: Non trivial regime when, as n, C ab = 1 + M ab n, M ab = O(1). Community information is weak but highly REDUNDANT! Considered Matrix: For α [0, 1], (and with D = diag(a1 n) = diag(d) the degree matrix), m = 1 2 dt 1 the number of edges L α = (2m) α 1 [ ] D α A ddt D α. n 2m Our results in a nutshell: we find optimal α opt having best phase transition. we find consistent estimator ˆα opt from A alone. 70 / 153

169 Applications/Community Detection on Graphs 70/153 Regularized Modularity Approach Connectivity Model: P (i j) = q i q j C ab for i C a, j C b. Dense Regime Assumptions: Non trivial regime when, as n, C ab = 1 + M ab n, M ab = O(1). Community information is weak but highly REDUNDANT! Considered Matrix: For α [0, 1], (and with D = diag(a1 n) = diag(d) the degree matrix), m = 1 2 dt 1 the number of edges L α = (2m) α 1 [ ] D α A ddt D α. n 2m Our results in a nutshell: we find optimal α opt having best phase transition. we find consistent estimator ˆα opt from A alone. we claim optimal eigenvector regularization D α 1 u, u eigenvector of L α. 70 / 153

170 Applications/Community Detection on Graphs 71/153 Asymptotic Equivalence Theorem (Limiting Random Matrix Equivalent) For each α [0, 1], as n, L α L α 0 almost surely, where L α = (2m) α 1 [ ] D α A ddt D α n 2m L α = 1 D α n q XDq α + UΛU T with D q = diag({q i }), X zero-mean random matrix, [ ] U = Dq 1 α J n Dq α X1 n, rank k + 1 [ (Ik 1 Λ = k c T )M(I k c1 T k ) 1 ] k 1 T k 0 and J = [j 1,..., j k ], j a = [0,..., 0, 1 T n a, 0,..., 0] T R n canonical vector of class C a. 71 / 153

171 Applications/Community Detection on Graphs 71/153 Asymptotic Equivalence Theorem (Limiting Random Matrix Equivalent) For each α [0, 1], as n, L α L α 0 almost surely, where L α = (2m) α 1 [ ] D α A ddt D α n 2m L α = 1 D α n q XDq α + UΛU T with D q = diag({q i }), X zero-mean random matrix, [ ] U = Dq 1 α J n Dq α X1 n, rank k + 1 [ (Ik 1 Λ = k c T )M(I k c1 T k ) 1 ] k 1 T k 0 and J = [j 1,..., j k ], j a = [0,..., 0, 1 T n a, 0,..., 0] T R n canonical vector of class C a. Consequences: isolated eigenvalues beyond phase transition λ(m) > spectrum edge optimal choice α opt of α from study of noise spectrum. 71 / 153

172 Applications/Community Detection on Graphs 71/153 Asymptotic Equivalence Theorem (Limiting Random Matrix Equivalent) For each α [0, 1], as n, L α L α 0 almost surely, where L α = (2m) α 1 [ ] D α A ddt D α n 2m L α = 1 D α n q XDq α + UΛU T with D q = diag({q i }), X zero-mean random matrix, [ ] U = Dq 1 α J n Dq α X1 n, rank k + 1 [ (Ik 1 Λ = k c T )M(I k c1 T k ) 1 ] k 1 T k 0 and J = [j 1,..., j k ], j a = [0,..., 0, 1 T n a, 0,..., 0] T R n canonical vector of class C a. Consequences: isolated eigenvalues beyond phase transition λ(m) > spectrum edge optimal choice α opt of α from study of noise spectrum. eigenvectors correlated to Dq 1 α J Natural regularization by D α 1! 71 / 153

173 Applications/Community Detection on Graphs 72/153 Eigenvalue Spectrum Eigenvalues of L 1 Limiting law spikes S α S α S α Figure: Eigenvalues of L 1, K = 3, n = 2000, c 1 = 0.3, c 2 = 0.3, c 3 = 0.4, µ = 1 2 δq (1) δq (2), q (1) = 0.4, q (2) = 0.9, M defined by M ii = 12, M ij = 4, i j. 72 / 153

174 Applications/Community Detection on Graphs 73/153 Phase Transition Theorem (Phase Transition) For α [0, 1], isolated eigenvalue λ i (L α) if λ i ( M) > τ α, M = (D(c) cc T )M, τ α = lim 1 x S + α e α, phase transition threshold 2 (x) with [S α, Sα + ] limiting eigenvalue support of Lα and eα 2 (x) ( x > Sα + ) solution of e α 1 (x) = e α 2 (x) = 1 In this case, e α 2 (λ i(l = λ α)) i( M). q 1 2α x q 1 2α e α 1 (x) + q2 2α e α µ(dq) 2 (x) q 2 2α x q 1 2α e α 1 (x) + q2 2α e α µ(dq). 2 (x) 73 / 153

175 Applications/Community Detection on Graphs 73/153 Phase Transition Theorem (Phase Transition) For α [0, 1], isolated eigenvalue λ i (L α) if λ i ( M) > τ α, M = (D(c) cc T )M, τ α = lim 1 x S + α e α, phase transition threshold 2 (x) with [S α, Sα + ] limiting eigenvalue support of Lα and eα 2 (x) ( x > Sα + ) solution of e α 1 (x) = e α 2 (x) = 1 In this case, e α 2 (λ i(l = λ α)) i( M). q 1 2α x q 1 2α e α 1 (x) + q2 2α e α µ(dq) 2 (x) q 2 2α x q 1 2α e α 1 (x) + q2 2α e α µ(dq). 2 (x) Clustering still possible when λ i ( M) = (min α τ α) + ε. Optimal α = α opt: α opt = argmin α [0,1] {τ α}. 73 / 153

176 Applications/Community Detection on Graphs 73/153 Phase Transition Theorem (Phase Transition) For α [0, 1], isolated eigenvalue λ i (L α) if λ i ( M) > τ α, M = (D(c) cc T )M, τ α = lim 1 x S + α e α, phase transition threshold 2 (x) with [S α, Sα + ] limiting eigenvalue support of Lα and eα 2 (x) ( x > Sα + ) solution of e α 1 (x) = e α 2 (x) = 1 In this case, e α 2 (λ i(l = λ α)) i( M). q 1 2α x q 1 2α e α 1 (x) + q2 2α e α µ(dq) 2 (x) q 2 2α x q 1 2α e α 1 (x) + q2 2α e α µ(dq). 2 (x) Clustering still possible when λ i ( M) = (min α τ α) + ε. Optimal α = α opt: α opt = argmin α [0,1] {τ α}. d From max i i q i d T 1 n a.s. 0, we obtain consistent estimator ˆα opt of α opt. 73 / 153

177 Applications/Community Detection on Graphs 74/153 Simulated Performance Results (2 masses of q i ) (Modularity) (Bethe Hessian) 74 / 153

178 Applications/Community Detection on Graphs 74/153 Simulated Performance Results (2 masses of q i ) (Modularity) (Bethe Hessian) (Algo with α = 1) Figure: Two dominant eigenvectors (x-y axes) for n = 2000, K = 3, µ = 3 4 δq (1) δq (2), q (1) = 0.1, q (2) = 0.5, c 1 = c 2 = 1 4, c3 = 1 2, M = 100I3. 74 / 153

179 Applications/Community Detection on Graphs 74/153 Simulated Performance Results (2 masses of q i ) (Modularity) (Bethe Hessian) (Algo with α = 1) (Algo with α opt) Figure: Two dominant eigenvectors (x-y axes) for n = 2000, K = 3, µ = 3 4 δq (1) δq (2), q (1) = 0.1, q (2) = 0.5, c 1 = c 2 = 1 4, c3 = 1 2, M = 100I3. 74 / 153

180 Applications/Community Detection on Graphs 75/153 Simulated Performance Results (2 masses for q i ) Normalized spike λ S + α Eigenvalue l (l = 1/e α 2 (λ) beyond phase transition) Figure: Largest eigenvalue λ of L α as a function of the largest eigenvalue l of (D(c) cc T )M, for µ = 3 4 δq (1) δq (2) with q (1) = 0.1 and q (2) = 0.5, for α {0, 1 4, 1 2, 3 4, 1, αopt} (indicated below the graph). Here, α opt = Circles indicate phase transition. Beyond phase transition, l = 1/e α 2 (λ). 75 / 153

181 Applications/Community Detection on Graphs 75/153 Simulated Performance Results (2 masses for q i ) Normalized spike λ S + α Eigenvalue l (l = 1/e α 2 (λ) beyond phase transition) Figure: Largest eigenvalue λ of L α as a function of the largest eigenvalue l of (D(c) cc T )M, for µ = 3 4 δq (1) δq (2) with q (1) = 0.1 and q (2) = 0.5, for α {0, 1 4, 1 2, 3 4, 1, αopt} (indicated below the graph). Here, α opt = Circles indicate phase transition. Beyond phase transition, l = 1/e α 2 (λ). 75 / 153

182 Applications/Community Detection on Graphs 75/153 Simulated Performance Results (2 masses for q i ) Normalized spike λ S + α α opt Eigenvalue l (l = 1/e α 2 (λ) beyond phase transition) Figure: Largest eigenvalue λ of L α as a function of the largest eigenvalue l of (D(c) cc T )M, for µ = 3 4 δq (1) δq (2) with q (1) = 0.1 and q (2) = 0.5, for α {0, 1 4, 1 2, 3 4, 1, αopt} (indicated below the graph). Here, α opt = Circles indicate phase transition. Beyond phase transition, l = 1/e α 2 (λ). 75 / 153

183 Applications/Community Detection on Graphs 76/153 Simulated Performance Results (2 masses for q i ) Overlap α = 1 Phase transition Figure: Overlap performance for n = 3000, K = 3, c i = 1 3, µ = 3 4 δq (1) δq (2) with q (1) = 0.1 and q (2) = 0.5, M = I 3, for [5, 50]. Here α opt = / 153

184 Applications/Community Detection on Graphs 76/153 Simulated Performance Results (2 masses for q i ) Overlap α = 0 α = 1 2 α = 1 Phase transition Figure: Overlap performance for n = 3000, K = 3, c i = 1 3, µ = 3 4 δq (1) δq (2) with q (1) = 0.1 and q (2) = 0.5, M = I 3, for [5, 50]. Here α opt = / 153

185 Applications/Community Detection on Graphs 76/153 Simulated Performance Results (2 masses for q i ) Overlap 0.4 α = α = 1 2 α = 1 α = α opt Phase transition Figure: Overlap performance for n = 3000, K = 3, c i = 1 3, µ = 3 4 δq (1) δq (2) with q (1) = 0.1 and q (2) = 0.5, M = I 3, for [5, 50]. Here α opt = / 153

186 Applications/Community Detection on Graphs 76/153 Simulated Performance Results (2 masses for q i ) Overlap α = 0 α = 1 2 α = 1 α = α opt Bethe Hessian Phase transition Figure: Overlap performance for n = 3000, K = 3, c i = 1 3, µ = 3 4 δq (1) δq (2) with q (1) = 0.1 and q (2) = 0.5, M = I 3, for [5, 50]. Here α opt = / 153

187 Applications/Community Detection on Graphs 77/153 Simulated Performance Results (2 masses for q i ) α = 0 α = 0.5 α = 1 α = α opt Bethe Hessian 0.6 Overlap q 2 (q 1 = 0.1) Figure: Overlap performance for n = 3000, K = 3, µ = 3 4 δq (1) δq (2) with q (1) = 0.1 and q (2) [0.1, 0.9], M = 10(2I T 3 ), ci = / 153

188 Applications/Community Detection on Graphs 78/153 Theoretical Performance Analysis of eigenvectors reveals: eigenvectors are noisy staircase vectors 78 / 153

189 Applications/Community Detection on Graphs 78/153 Theoretical Performance Analysis of eigenvectors reveals: eigenvectors are noisy staircase vectors conjectured Gaussian fluctuations of eigenvector entries 78 / 153

190 Applications/Community Detection on Graphs 78/153 Theoretical Performance Analysis of eigenvectors reveals: eigenvectors are noisy staircase vectors conjectured Gaussian fluctuations of eigenvector entries for q i = q 0 (homogeneous case), same variance for all entries in non-homogeneous case, we can compute average variance per class Heuristic asymptotic performance upper-bound using EM. 78 / 153

191 Applications/Community Detection on Graphs 79/153 Theoretical Performance Results (uniform distribution for q i ) Correct classificate rate α = 1 α = 0.75 α = 0.5 α = 0.25 α = 0 α opt Figure: Theoretical probability of correct recovery for n = 2000, K = 2, c 1 = 0.6, c 2 = 0.4, µ uniformly distributed in [0.2, 0.8], M = I 2, for [0, 20]. 79 / 153

192 Applications/Community Detection on Graphs 80/ / 153

193 Applications/Community Detection on Graphs 80/153 Some Takeaway messages Main findings: 80 / 153

194 Applications/Community Detection on Graphs 80/153 Some Takeaway messages Main findings: Degree heterogeneity breaks community structures in eigenvectors. Compensation by D α 1 normalization of eigenvectors. 80 / 153

195 Applications/Community Detection on Graphs 80/153 Some Takeaway messages Main findings: Degree heterogeneity breaks community structures in eigenvectors. Compensation by D α 1 normalization of eigenvectors. Classical debate over best normalization of adjacency (or modularity) matrix A not trivial to solve. With heterogeneous degrees, we found a good on-line method. 80 / 153

196 Applications/Community Detection on Graphs 80/153 Some Takeaway messages Main findings: Degree heterogeneity breaks community structures in eigenvectors. Compensation by D α 1 normalization of eigenvectors. Classical debate over best normalization of adjacency (or modularity) matrix A not trivial to solve. With heterogeneous degrees, we found a good on-line method. Simulations support good performances even for rather sparse settings. 80 / 153

197 Applications/Community Detection on Graphs 80/153 Some Takeaway messages Main findings: Degree heterogeneity breaks community structures in eigenvectors. Compensation by D α 1 normalization of eigenvectors. Classical debate over best normalization of adjacency (or modularity) matrix A not trivial to solve. With heterogeneous degrees, we found a good on-line method. Simulations support good performances even for rather sparse settings. But strong limitations: 80 / 153

198 Applications/Community Detection on Graphs 80/153 Some Takeaway messages Main findings: Degree heterogeneity breaks community structures in eigenvectors. Compensation by D α 1 normalization of eigenvectors. Classical debate over best normalization of adjacency (or modularity) matrix A not trivial to solve. With heterogeneous degrees, we found a good on-line method. Simulations support good performances even for rather sparse settings. But strong limitations: Key assumption: C ab = 1 + M ab n. Everything collapses if different regime. 80 / 153

199 Applications/Community Detection on Graphs 80/153 Some Takeaway messages Main findings: Degree heterogeneity breaks community structures in eigenvectors. Compensation by D α 1 normalization of eigenvectors. Classical debate over best normalization of adjacency (or modularity) matrix A not trivial to solve. With heterogeneous degrees, we found a good on-line method. Simulations support good performances even for rather sparse settings. But strong limitations: Key assumption: C ab = 1 + M ab n. Everything collapses if different regime. Simulations on small networks in fact give ridiculous arbitrary results. 80 / 153

200 Applications/Kernel Spectral Clustering 81/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 81 / 153

201 Applications/Kernel Spectral Clustering 82/153 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes S 1,..., S k. 82 / 153

202 Applications/Kernel Spectral Clustering 82/153 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes S 1,..., S k. Typical metric to optimize: k κ(x j, x j (RatioCut) argmin ) S1... S k ={1,...,n} S i=1 j S i i j / S i for some similarity kernel κ(x, y) 0 (large if x similar to y). 82 / 153

203 Applications/Kernel Spectral Clustering 82/153 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes S 1,..., S k. Typical metric to optimize: k κ(x j, x j (RatioCut) argmin ) S1... S k ={1,...,n} S i=1 j S i i j / S i for some similarity kernel κ(x, y) 0 (large if x similar to y). Can be shown equivalent to (RatioCut) argmin M M tr M T (D K)M { } where M R n k M; M ij {0, S j 1 2 } (in particular, M T M = I k ) and K = {κ(x i, x j )} n i,j=1, D ii = n K ij. j=1 82 / 153

204 Applications/Kernel Spectral Clustering 82/153 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes S 1,..., S k. Typical metric to optimize: k κ(x j, x j (RatioCut) argmin ) S1... S k ={1,...,n} S i=1 j S i i j / S i for some similarity kernel κ(x, y) 0 (large if x similar to y). Can be shown equivalent to (RatioCut) argmin M M tr M T (D K)M { } where M R n k M; M ij {0, S j 1 2 } (in particular, M T M = I k ) and K = {κ(x i, x j )} n i,j=1, D ii = n K ij. j=1 But integer problem! Usually NP-complete. 82 / 153

205 Applications/Kernel Spectral Clustering 83/153 Kernel Spectral Clustering Towards kernel spectral clustering Kernel spectral clustering: discrete-to-continuous relaxations of such metrics (RatioCut) argmin M, M T M=I K tr M T (D K)M i.e., eigenvector problem: 1. find eigenvectors of smallest eigenvalues 2. retrieve classes from eigenvector components 83 / 153

206 Applications/Kernel Spectral Clustering 83/153 Kernel Spectral Clustering Towards kernel spectral clustering Kernel spectral clustering: discrete-to-continuous relaxations of such metrics (RatioCut) argmin M, M T M=I K tr M T (D K)M i.e., eigenvector problem: 1. find eigenvectors of smallest eigenvalues 2. retrieve classes from eigenvector components Refinements: working on K, D K, I n D 1 K, I n D 1 2 KD 2 1, etc. several steps algorithms: Ng Jordan Weiss, Shi Malik, etc. 83 / 153

207 Applications/Kernel Spectral Clustering 84/153 Kernel Spectral Clustering 84 / 153

208 Applications/Kernel Spectral Clustering 84/153 Kernel Spectral Clustering 84 / 153

209 Applications/Kernel Spectral Clustering 84/153 Kernel Spectral Clustering 84 / 153

210 Applications/Kernel Spectral Clustering 84/153 Kernel Spectral Clustering Figure: Leading four eigenvectors of D 2 1 KD 2 1 for MNIST data. 84 / 153

211 Applications/Kernel Spectral Clustering 85/153 Methodology and objectives Current state: Algorithms derived from ad-hoc procedures (e.g., relaxation). Little understanding of performance, even for Gaussian mixtures! Let alone when both p and n are large (BigData setting) 85 / 153

212 Applications/Kernel Spectral Clustering 85/153 Methodology and objectives Current state: Algorithms derived from ad-hoc procedures (e.g., relaxation). Little understanding of performance, even for Gaussian mixtures! Let alone when both p and n are large (BigData setting) Objectives and Roadmap: Develop mathematical analysis framework for BigData kernel spectral clustering (p, n ) 85 / 153

213 Applications/Kernel Spectral Clustering 85/153 Methodology and objectives Current state: Algorithms derived from ad-hoc procedures (e.g., relaxation). Little understanding of performance, even for Gaussian mixtures! Let alone when both p and n are large (BigData setting) Objectives and Roadmap: Develop mathematical analysis framework for BigData kernel spectral clustering (p, n ) Understand: 1. Phase transition effects (i.e., when is clustering possible?) 2. Content of each eigenvector 3. Influence of kernel function 4. Performance comparison of clustering algorithms 85 / 153

214 Applications/Kernel Spectral Clustering 85/153 Methodology and objectives Current state: Algorithms derived from ad-hoc procedures (e.g., relaxation). Little understanding of performance, even for Gaussian mixtures! Let alone when both p and n are large (BigData setting) Objectives and Roadmap: Develop mathematical analysis framework for BigData kernel spectral clustering (p, n ) Understand: 1. Phase transition effects (i.e., when is clustering possible?) 2. Content of each eigenvector 3. Influence of kernel function 4. Performance comparison of clustering algorithms Methodology: Use statistical assumptions (Gaussian mixture) Benefit from doubly-infinite independence and random matrix tools 85 / 153

215 Applications/Kernel Spectral Clustering 86/153 Model and Assumptions Gaussian mixture model: x 1,..., x n R p, k classes C 1,..., C k, x 1,..., x n1 C 1,..., x n nk +1,..., x n C k, C a = {x x N (µ a, C a)}. Then, for x i C a, with w i N(0, C a), x i = µ a + w i. 86 / 153

216 Applications/Kernel Spectral Clustering 86/153 Model and Assumptions Gaussian mixture model: x 1,..., x n R p, k classes C 1,..., C k, x 1,..., x n1 C 1,..., x n nk +1,..., x n C k, C a = {x x N (µ a, C a)}. Then, for x i C a, with w i N(0, C a), x i = µ a + w i. Assumption (Convergence Rate) As n, p 1. Data scaling: n c 0 (0, ), n 2. Class scaling: a n c a (0, 1), 3. Mean scaling: with µ k a=1 n a n µ a and µ a µ a µ, then µ a = O(1) 4. Covariance scaling: with C k n a a=1 n C a and Ca Ca C, then C a = O(1), 1 p tr C a = O(1) tr C ac b = O(p) 86 / 153

217 Applications/Kernel Spectral Clustering 87/153 Model and Assumptions Kernel Matrix: Kernel matrix of interest: K = { ( )} 1 n f p x i x j 2 i,j=1 for some sufficiently smooth nonnegative f. 87 / 153

218 Applications/Kernel Spectral Clustering 87/153 Model and Assumptions Kernel Matrix: Kernel matrix of interest: K = { ( )} 1 n f p x i x j 2 i,j=1 for some sufficiently smooth nonnegative f. We study the normalized Laplacian: L = nd 1 2 KD 1 2 with d = K1 n, D = diag(d). 87 / 153

219 Applications/Kernel Spectral Clustering 88/153 Model and Assumptions Difficulty: L is a very intractable random matrix non-linear f non-trivial dependence between entries of L 88 / 153

220 Applications/Kernel Spectral Clustering 88/153 Model and Assumptions Difficulty: L is a very intractable random matrix non-linear f non-trivial dependence between entries of L Strategy: 1. Find random equivalent ˆL a.s. (i.e., L ˆL 0 as n, p ) based on: concentration: K ij constant as n, p (for all i j) Taylor expansion around limit point 88 / 153

221 Applications/Kernel Spectral Clustering 88/153 Model and Assumptions Difficulty: L is a very intractable random matrix non-linear f non-trivial dependence between entries of L Strategy: 1. Find random equivalent ˆL a.s. (i.e., L ˆL 0 as n, p ) based on: concentration: K ij constant as n, p (for all i j) Taylor expansion around limit point 2. Apply spiked random matrix approach to study: existence of isolated eigenvalues in ˆL: phase transition 88 / 153

222 Applications/Kernel Spectral Clustering 88/153 Model and Assumptions Difficulty: L is a very intractable random matrix non-linear f non-trivial dependence between entries of L Strategy: 1. Find random equivalent ˆL a.s. (i.e., L ˆL 0 as n, p ) based on: concentration: K ij constant as n, p (for all i j) Taylor expansion around limit point 2. Apply spiked random matrix approach to study: existence of isolated eigenvalues in ˆL: phase transition eigenvector projections on canonical class-basis 88 / 153

223 Applications/Kernel Spectral Clustering 89/153 Random Matrix Equivalent Results on K: Key Remark: Under our assumptions, uniformly on i, j {1,..., n}, for some common limit τ. 1 p x i x j 2 a.s. τ 89 / 153

224 Applications/Kernel Spectral Clustering 89/153 Random Matrix Equivalent Results on K: Key Remark: Under our assumptions, uniformly on i, j {1,..., n}, for some common limit τ. large dimensional approximation for K: K = f(τ)1 n1 T n + }{{} O (n) 1 p x i x j 2 a.s. τ na1 }{{} low rank, O ( n) + A 2 }{{} informative terms, O (1) 89 / 153

225 Applications/Kernel Spectral Clustering 89/153 Random Matrix Equivalent Results on K: Key Remark: Under our assumptions, uniformly on i, j {1,..., n}, for some common limit τ. large dimensional approximation for K: K = f(τ)1 n1 T n + }{{} O (n) 1 p x i x j 2 a.s. τ na1 }{{} low rank, O ( n) + A 2 }{{} informative terms, O (1) difficult to handle (3 orders to manipulate!) 89 / 153

226 Applications/Kernel Spectral Clustering 89/153 Random Matrix Equivalent Results on K: Key Remark: Under our assumptions, uniformly on i, j {1,..., n}, for some common limit τ. large dimensional approximation for K: K = f(τ)1 n1 T n + }{{} O (n) 1 p x i x j 2 a.s. τ na1 }{{} low rank, O ( n) + A 2 }{{} informative terms, O (1) difficult to handle (3 orders to manipulate!) Observation: Spectrum of L = nd 1 2 KD 1 2 : Dominant eigenvalue n with eigenvector D n All other eigenvalues of order O(1). 89 / 153

227 Applications/Kernel Spectral Clustering 89/153 Random Matrix Equivalent Results on K: Key Remark: Under our assumptions, uniformly on i, j {1,..., n}, for some common limit τ. large dimensional approximation for K: K = f(τ)1 n1 T n + }{{} O (n) 1 p x i x j 2 a.s. τ na1 }{{} low rank, O ( n) + A 2 }{{} informative terms, O (1) difficult to handle (3 orders to manipulate!) Observation: Spectrum of L = nd 1 2 KD 1 2 : Dominant eigenvalue n with eigenvector D n All other eigenvalues of order O(1). Naturally leads to study: Projected normalized Laplacian (or modularity -type Laplacian): L = nd 1 2 KD 1 2 n D n1 T 2 nd 1 ) 1 T = nd 1 2 (K ddt n D1n 1 T D 1 2. d Dominant (normalized) eigenvector D n 1 T n D1 n. 89 / 153

228 Applications/Kernel Spectral Clustering 90/153 Random Matrix Equivalent Theorem (Random Matrix Equivalent) As n, p, in operator norm, L ˆL a.s. 0, where ˆL = 2 f [ ] (τ) 1 f(τ) p P W T W P + UBU T + α(τ)i n and τ = 2 p tr C, W = [w 1,..., w n] R p n (x i = µ a + w i ), P = I n 1 n 1n1T n, [ ] 1 U = p J, Φ, ψ R n (2k+4) B = B 11 = M T M + ( B 11 I k 1 k c T 5f ) (τ) 8f(τ) f (τ) 2f t (τ) I k c1 T k ) 0 k k 0 k 1 R(2k+4) (2k+4) t T 5f 0 (τ) 1 k ( 5f (τ) 8f(τ) f (τ) 2f (τ) ( 5f (τ) 8f(τ) f (τ) 2f (τ) 8f(τ) f (τ) 2f (τ) ) tt T f (τ) f (τ) T + p f(τ)α(τ) n 2f (τ) 1 k1 T k Rk k. 90 / 153

229 Applications/Kernel Spectral Clustering 90/153 Random Matrix Equivalent Theorem (Random Matrix Equivalent) As n, p, in operator norm, L ˆL a.s. 0, where ˆL = 2 f [ ] (τ) 1 f(τ) p P W T W P + UBU T + α(τ)i n and τ = 2 p tr C, W = [w 1,..., w n] R p n (x i = µ a + w i ), P = I n 1 n 1n1T n, [ ] 1 U = p J, Φ, ψ R n (2k+4) B = B 11 = M T M + ( B 11 I k 1 k c T 5f ) (τ) 8f(τ) f (τ) 2f t (τ) I k c1 T k ) 0 k k 0 k 1 R(2k+4) (2k+4) t T 5f 0 (τ) 1 k ( 5f (τ) 8f(τ) f (τ) 2f (τ) ( 5f (τ) 8f(τ) f (τ) 2f (τ) 8f(τ) f (τ) 2f (τ) ) tt T f (τ) f (τ) T + p f(τ)α(τ) n 2f (τ) 1 k1 T k Rk k. 1 p J = [j 1,..., j k ] R n k, j a canonical vector of class C a. 90 / 153

230 Applications/Kernel Spectral Clustering 90/153 Random Matrix Equivalent Theorem (Random Matrix Equivalent) As n, p, in operator norm, L ˆL a.s. 0, where ˆL = 2 f [ ] (τ) 1 f(τ) p P W T W P + UBU T + α(τ)i n and τ = 2 p tr C, W = [w 1,..., w n] R p n (x i = µ a + w i ), P = I n 1 n 1n1T n, [ ] 1 U = p J, Φ, ψ R n (2k+4) B = B 11 = M T M + ( B 11 I k 1 k c T 5f ) (τ) 8f(τ) f (τ) 2f t (τ) I k c1 T k ) 0 k k 0 k 1 R(2k+4) (2k+4) t T 5f 0 (τ) 1 k ( 5f (τ) 8f(τ) f (τ) 2f (τ) ( 5f (τ) 8f(τ) f (τ) 2f (τ) M = [µ 1,..., µ k ] Rn k, µ a = µ a k n b b=1 n µ b. 8f(τ) f (τ) 2f (τ) ) tt T f (τ) f (τ) T + p f(τ)α(τ) n 2f (τ) 1 k1 T k Rk k. 90 / 153

231 Applications/Kernel Spectral Clustering 90/153 Random Matrix Equivalent Theorem (Random Matrix Equivalent) As n, p, in operator norm, L ˆL a.s. 0, where ˆL = 2 f [ ] (τ) 1 f(τ) p P W T W P + UBU T + α(τ)i n and τ = 2 p tr C, W = [w 1,..., w n] R p n (x i = µ a + w i ), P = I n 1 n 1n1T n, [ ] 1 U = p J, Φ, ψ R n (2k+4) B = B 11 = M T M + ( B 11 I k 1 k c T 5f ) (τ) 8f(τ) f (τ) 2f t (τ) I k c1 T k ) 0 k k 0 k 1 R(2k+4) (2k+4) t T 5f 0 (τ) 1 k ( 5f (τ) 8f(τ) f (τ) 2f (τ) ( 5f (τ) 8f(τ) f (τ) 2f (τ) 8f(τ) f (τ) 2f (τ) ) tt T f (τ) f (τ) T + p f(τ)α(τ) n 2f (τ) 1 k1 T k Rk k. [ ] t = p 1 tr C1,..., p 1 tr Ck R k, Ca = Ca k n b b=1 n C b. 90 / 153

232 Applications/Kernel Spectral Clustering 90/153 Random Matrix Equivalent Theorem (Random Matrix Equivalent) As n, p, in operator norm, L ˆL a.s. 0, where ˆL = 2 f [ ] (τ) 1 f(τ) p P W T W P + UBU T + α(τ)i n and τ = 2 p tr C, W = [w 1,..., w n] R p n (x i = µ a + w i ), P = I n 1 n 1n1T n, [ ] 1 U = p J, Φ, ψ R n (2k+4) B = B 11 = M T M + T = { 1 p tr C ac b ( B 11 I k 1 k c T 5f ) (τ) 8f(τ) f (τ) 2f t (τ) I k c1 T k ) 0 k k 0 k 1 R(2k+4) (2k+4) t T 5f 0 (τ) 1 k ( 5f (τ) 8f(τ) f (τ) 2f (τ) ( 5f (τ) 8f(τ) f (τ) 2f (τ) } k a,b=1 Rk k, C a = C a k b=1 8f(τ) f (τ) 2f (τ) ) tt T f (τ) f (τ) T + p f(τ)α(τ) n 2f (τ) 1 k1 T k Rk k. n b n C b. 90 / 153

233 Applications/Kernel Spectral Clustering 91/153 Random Matrix Equivalent Some consequences: ˆL is a spiked model: UBU T seen as low rank perturbation of 1 p P W T W P 91 / 153

234 Applications/Kernel Spectral Clustering 91/153 Random Matrix Equivalent Some consequences: ˆL is a spiked model: UBU T seen as low rank perturbation of 1 p P W T W P If f (τ) = 0, L asymptotically deterministic! only t and T can be discriminated upon If f (τ) = 0, (e.g., f(x) = x) T unused If 5f (τ) 8f(τ) = f (τ) 2f, t (seemingly) unused (τ) 91 / 153

235 Applications/Kernel Spectral Clustering 91/153 Random Matrix Equivalent Some consequences: ˆL is a spiked model: UBU T seen as low rank perturbation of 1 p P W T W P If f (τ) = 0, L asymptotically deterministic! only t and T can be discriminated upon If f (τ) = 0, (e.g., f(x) = x) T unused If 5f (τ) 8f(τ) Further analysis: = f (τ) 2f, t (seemingly) unused (τ) Determine separability condition for eigenvalues Evaluate eigenvalue positions when separable Evaluate eigenvector projection to canonical basis j 1,..., j k Evaluate fluctuation of eigenvectors. 91 / 153

236 Applications/Kernel Spectral Clustering 92/153 Isolated eigenvalues: Gaussian inputs Eigenvalues of L Eigenvalues of ˆL Figure: Eigenvalues of L and ˆL, k = 3, p = 2048, n = 512, c 1 = c 2 = 1/4, c 3 = 1/2, [µ a] j = 4δ aj, C a = (1 + 2(a 1)/ p)i p, f(x) = exp( x/2). 92 / 153

237 Applications/Kernel Spectral Clustering 93/153 Theoretical Findings versus MNIST 0.2 Eigenvalues of L Figure: Eigenvalues of L (red) and (equivalent Gaussian model) ˆL (white), MNIST data, p = 784, n = / 153

238 Applications/Kernel Spectral Clustering 93/153 Theoretical Findings versus MNIST 0.2 Eigenvalues of L Eigenvalues of ˆL as if Gaussian model Figure: Eigenvalues of L (red) and (equivalent Gaussian model) ˆL (white), MNIST data, p = 784, n = / 153

239 Applications/Kernel Spectral Clustering 94/153 Theoretical Findings versus MNIST Figure: Leading four eigenvectors of D 2 1 KD 1 2 for MNIST data (red) and theoretical findings (blue). 94 / 153

240 Applications/Kernel Spectral Clustering 94/153 Theoretical Findings versus MNIST Figure: Leading four eigenvectors of D 2 1 KD 1 2 for MNIST data (red) and theoretical findings (blue). 94 / 153

Random Matrices in Machine Learning

Random Matrices in Machine Learning Romain COUILLET CentraleSupélec, University of ParisSaclay, France GSTATS IDEX DataScience Chair, GIPSA-lab, University Grenoble Alpes, France. June 21, 2018 1 / 113