A Random Matrix Framework for BigData Machine Learning

Size: px

Start display at page:

Download "A Random Matrix Framework for BigData Machine Learning"

Polly Daniel
5 years ago
Views:

1 A Random Matrix Framework for BigData Machine Learning (Groupe Deep Learning, DigiCosme) Romain COUILLET CentraleSupélec, France June, / 63

2 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 2 / 63

3 Basics of Random Matrix Theory/ 3/63 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 3 / 63

4 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 4/63 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 4 / 63

5 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/63 Context Baseline scenario: y 1,..., y n C p (or R p ) i.i.d. with E[y 1 ] = 0, E[y 1 y 1 ] = Cp: 5 / 63

6 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/63 Context Baseline scenario: y 1,..., y n C p (or R p ) i.i.d. with E[y 1 ] = 0, E[y 1 y1 ] = Cp: If y 1 N (0, C p), ML estimator for C p is the sample covariance matrix (SCM) Ĉ p = 1 n YpY p = 1 n y i yi n i=1 (Y p = [y 1,..., y n] C p n ). 5 / 63

7 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/63 Context Baseline scenario: y 1,..., y n C p (or R p ) i.i.d. with E[y 1 ] = 0, E[y 1 y1 ] = Cp: If y 1 N (0, C p), ML estimator for C p is the sample covariance matrix (SCM) Ĉ p = 1 n YpY p (Y p = [y 1,..., y n] C p n ). If n, then, strong law of large numbers = 1 n y i yi n i=1 Ĉ p a.s. C p. or equivalently, in spectral norm Ĉp Cp a.s / 63

8 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/63 Context Baseline scenario: y 1,..., y n C p (or R p ) i.i.d. with E[y 1 ] = 0, E[y 1 y1 ] = Cp: If y 1 N (0, C p), ML estimator for C p is the sample covariance matrix (SCM) Ĉ p = 1 n YpY p (Y p = [y 1,..., y n] C p n ). If n, then, strong law of large numbers = 1 n y i yi n i=1 Ĉ p a.s. C p. or equivalently, in spectral norm Ĉp Cp a.s. 0. Random Matrix Regime No longer valid if p, n with p/n c (0, ), Ĉp Cp 0. 5 / 63

9 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/63 Context Baseline scenario: y 1,..., y n C p (or R p ) i.i.d. with E[y 1 ] = 0, E[y 1 y1 ] = Cp: If y 1 N (0, C p), ML estimator for C p is the sample covariance matrix (SCM) Ĉ p = 1 n YpY p (Y p = [y 1,..., y n] C p n ). If n, then, strong law of large numbers = 1 n y i yi n i=1 Ĉ p a.s. C p. or equivalently, in spectral norm Ĉp Cp a.s. 0. Random Matrix Regime No longer valid if p, n with p/n c (0, ), Ĉp Cp 0. For practical p, n with p n, leads to dramatically wrong conclusions 5 / 63

10 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/63 The Mar cenko Pastur law 0.8 Empirical eigenvalue distribution Mar cenko Pastur Law 0.6 Density Eigenvalues of Ĉp Figure: Histogram of the eigenvalues of Ĉp for p = 500, n = 2000, Cp = Ip. 6 / 63

11 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/63 The Mar cenko Pastur law Definition (Empirical Spectral Density) Empirical spectral density (e.s.d.) µ p of Hermitian matrix A p C p p is µ p = 1 p δ λi (A p p). i=1 7 / 63

12 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/63 The Mar cenko Pastur law Definition (Empirical Spectral Density) Empirical spectral density (e.s.d.) µ p of Hermitian matrix A p C p p is µ p = 1 p δ λi (A p p). i=1 Theorem (Mar cenko Pastur Law [Mar cenko,pastur 67]) X p C p n with i.i.d. zero mean, unit variance entries. As p, n with p/n c (0, ), e.s.d. µ p of 1 n XpX p satisfies weakly, where µ c({0}) = max{0, 1 c 1 } µ p a.s. µ c 7 / 63

13 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/63 The Mar cenko Pastur law Definition (Empirical Spectral Density) Empirical spectral density (e.s.d.) µ p of Hermitian matrix A p C p p is µ p = 1 p δ λi (A p p). i=1 Theorem (Mar cenko Pastur Law [Mar cenko,pastur 67]) X p C p n with i.i.d. zero mean, unit variance entries. As p, n with p/n c (0, ), e.s.d. µ p of 1 n XpX p satisfies weakly, where µ c({0}) = max{0, 1 c 1 } µ p a.s. µ c on (0, ), µ c has continuous density f c supported on [(1 c) 2, (1 + c) 2 ] f c(x) = 1 (x (1 c) 2 )((1 + c) 2 x). 2πcx 7 / 63

14 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/63 The Mar cenko Pastur law 1.2 c = Density fc(x) x Figure: Mar cenko-pastur law for different limit ratios c = lim p p/n. 8 / 63

15 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/63 The Mar cenko Pastur law 1.2 c = 0.1 c = Density fc(x) x Figure: Mar cenko-pastur law for different limit ratios c = lim p p/n. 8 / 63

16 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/63 The Mar cenko Pastur law 1.2 c = 0.1 c = c = Density fc(x) x Figure: Mar cenko-pastur law for different limit ratios c = lim p p/n. 8 / 63

17 Basics of Random Matrix Theory/Spiked Models 9/63 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 9 / 63

18 Basics of Random Matrix Theory/Spiked Models 10/63 Spiked Models If we break: Small rank Perturbation: C p = I P + P, P of low rank. 0.8 {λ i } n i=1 µ ω 1 + c 1+ω 1 ω 1, 1 + ω 2 + c 1+ω 2 ω 2 Figure: Eigenvalues of 1 n YpY p, Cp = diag(1,..., 1, 2, 2, 3, 3), p = 500, n = }{{} p 4 10 / 63

19 Basics of Random Matrix Theory/Spiked Models 11/63 Spiked Models Theorem (Eigenvalues [Baik,Silverstein 06]) Let Y p = C 1 2 p X p, with X p with i.i.d. zero mean, unit variance, E[ X p 4 ij ] <. C p = I p + P, P = UΩU, where, for K fixed, Ω = diag (ω 1,..., ω K ) R K K, with ω 1... ω K > / 63

20 Basics of Random Matrix Theory/Spiked Models 11/63 Spiked Models Theorem (Eigenvalues [Baik,Silverstein 06]) Let Y p = C 1 2 p X p, with X p with i.i.d. zero mean, unit variance, E[ X p 4 ij ] <. C p = I p + P, P = UΩU, where, for K fixed, Ω = diag (ω 1,..., ω K ) R K K, with ω 1... ω K > 0. Then, as p, n, p/n c (0, ), denoting λ m = λ m( 1 n YpY p ) (λ m > λ m+1 ), { a.s. 1 + ωm + c 1+ωm > (1 + c) λ m ω 2, ω m > c m (1 + c) 2, ω m (0, c]. 11 / 63

21 Basics of Random Matrix Theory/Spiked Models 12/63 Spiked Models Theorem (Eigenvectors [Paul 07]) Let Y p = C 1 2 p X p, with X p with i.i.d. zero mean, unit variance, E[ X p 4 ij ] <. C p = I p + P, P = UΩU = K i=1 ω iu i u i, ω 1 >... > ω M > / 63

22 Basics of Random Matrix Theory/Spiked Models 12/63 Spiked Models Theorem (Eigenvectors [Paul 07]) Let Y p = C 1 2 p X p, with X p with i.i.d. zero mean, unit variance, E[ X p 4 ij ] <. C p = I p + P, P = UΩU = K i=1 ω iu i u i, ω 1 >... > ω M > 0. Then, as p, n, p/n c (0, ), for a, b C p deterministic and û i eigenvector of λ i ( 1 n YpY p ), In particular, a û i û i b 1 cω 2 i 1 + cω 1 i a u i u i b 1 ω i > c a.s. 0 û i u i 2 a.s. 1 cω 2 i 1 + cω 1 1 ωi > c. i 12 / 63

23 Basics of Random Matrix Theory/Spiked Models 13/63 Spiked Models û 1 u p = 100 p = 200 p = c/ω c/ω Population spike ω 1 Figure: Simulated versus limiting û 1 1 u1 2 for Y p = C 2 p X p, C p = I p + ω 1u 1u 1, p/n = 1/3, varying ω / 63

24 Basics of Random Matrix Theory/Spiked Models 14/63 Other Spiked Models Similar results for multiple matrix models: Y p = 1 n XpX p + P Y p = 1 n X p (I + P )X Y p = 1 n (Xp + P ) (X p + P ) Y p = 1 n T X p (I + P )XpT etc. 14 / 63

25 Applications/ 15/63 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 15 / 63

26 Applications/Reminder on Spectral Clustering Methods 16/63 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 16 / 63

27 Applications/Reminder on Spectral Clustering Methods 17/63 Reminder on Spectral Clustering Methods Context: Two-step classification of n objects based on similarity A R n n : 1. extraction of eigenvectors U = [u 1,..., u l ] with dominant eigenvalues 17 / 63

28 Applications/Reminder on Spectral Clustering Methods 17/63 Reminder on Spectral Clustering Methods Context: Two-step classification of n objects based on similarity A R n n : 1. extraction of eigenvectors U = [u 1,..., u l ] with dominant eigenvalues 2. classification of n rows U 1,,..., U n, R l using k-means/em. 17 / 63

29 Applications/Reminder on Spectral Clustering Methods 17/63 Reminder on Spectral Clustering Methods Context: Two-step classification of n objects based on similarity A R n n : 1. extraction of eigenvectors U = [u 1,..., u l ] with dominant eigenvalues 2. classification of n rows U 1,,..., U n, R l using k-means/em. 0 spikes 17 / 63

30 Applications/Reminder on Spectral Clustering Methods 17/63 Reminder on Spectral Clustering Methods Context: Two-step classification of n objects based on similarity A R n n : 1. extraction of eigenvectors U = [u 1,..., u l ] with dominant eigenvalues 2. classification of n rows U 1,,..., U n, R l using k-means/em. 0 spikes Eigenvectors (in practice, shuffled) Eigenv. 2 Eigenv / 63

31 Applications/Reminder on Spectral Clustering Methods 18/63 Reminder on Spectral Clustering Methods Eigenv. 2 Eigenv / 63

32 Applications/Reminder on Spectral Clustering Methods 18/63 Reminder on Spectral Clustering Methods Eigenv. 2 Eigenv. 1 l-dimensional representation (shuffling no longer matters) Eigenvector 2 Eigenvector 1 18 / 63

33 Applications/Reminder on Spectral Clustering Methods 18/63 Reminder on Spectral Clustering Methods Eigenv. 2 Eigenv. 1 l-dimensional representation (shuffling no longer matters) Eigenvector 2 Eigenvector 1 EM or k-means clustering. 18 / 63

34 Applications/Kernel Spectral Clustering 19/63 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 19 / 63

35 Applications/Kernel Spectral Clustering 20/63 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes C 1,..., C k. 20 / 63

36 Applications/Kernel Spectral Clustering 20/63 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes C 1,..., C k. Kernel spectral clustering based on kernel matrix K = {κ(x i, x j )} n i,j=1 20 / 63

37 Applications/Kernel Spectral Clustering 20/63 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes C 1,..., C k. Kernel spectral clustering based on kernel matrix K = {κ(x i, x j )} n i,j=1 Usually, κ(x, y) = f(x T y) or κ(x, y) = f( x y 2 ) 20 / 63

38 Applications/Kernel Spectral Clustering 20/63 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes C 1,..., C k. Kernel spectral clustering based on kernel matrix K = {κ(x i, x j )} n i,j=1 Usually, κ(x, y) = f(x T y) or κ(x, y) = f( x y 2 ) Refinements: instead of K, use D K, I n D 1 K, I n D 1 2 KD 1 2, etc. several steps algorithms: Ng Jordan Weiss, Shi Malik, etc. 20 / 63

39 Applications/Kernel Spectral Clustering 20/63 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes C 1,..., C k. Kernel spectral clustering based on kernel matrix K = {κ(x i, x j )} n i,j=1 Usually, κ(x, y) = f(x T y) or κ(x, y) = f( x y 2 ) Refinements: instead of K, use D K, I n D 1 K, I n D 1 2 KD 1 2, etc. several steps algorithms: Ng Jordan Weiss, Shi Malik, etc. Intuition (from small dimensions) K essentially low rank with class structure in eigenvectors. 20 / 63

40 Applications/Kernel Spectral Clustering 21/63 Kernel Spectral Clustering 21 / 63

41 Applications/Kernel Spectral Clustering 21/63 Kernel Spectral Clustering 21 / 63

42 Applications/Kernel Spectral Clustering 21/63 Kernel Spectral Clustering 21 / 63

43 Applications/Kernel Spectral Clustering 21/63 Kernel Spectral Clustering Figure: Leading four eigenvectors of D 2 1 KD 2 1 for MNIST data. 21 / 63

44 Applications/Kernel Spectral Clustering 22/63 Model and Assumptions Gaussian mixture model: x 1,..., x n R p, k classes C 1,..., C k, x 1,..., x n1 C 1,..., x n nk +1,..., x n C k, x i N (µ gi, C gi ). 22 / 63

45 Applications/Kernel Spectral Clustering 22/63 Model and Assumptions Gaussian mixture model: x 1,..., x n R p, k classes C 1,..., C k, x 1,..., x n1 C 1,..., x n nk +1,..., x n C k, x i N (µ gi, C gi ). Assumption (Convergence Rate) As n, p 1. Data scaling: n c 0 (0, ), n 2. Class scaling: a n c a (0, 1), 3. Mean scaling: with µ k a=1 n a n µ a and µ a µa µ, then µ a = O(1) 4. Covariance scaling: with C k n a a=1 n C a and Ca C a C, then C a = O(1), tr C a = O( p), tr C ac b = O(p) 22 / 63

46 Applications/Kernel Spectral Clustering 22/63 Model and Assumptions Gaussian mixture model: x 1,..., x n R p, k classes C 1,..., C k, x 1,..., x n1 C 1,..., x n nk +1,..., x n C k, x i N (µ gi, C gi ). Assumption (Convergence Rate) As n, p 1. Data scaling: n c 0 (0, ), n 2. Class scaling: a n c a (0, 1), 3. Mean scaling: with µ k a=1 n a n µ a and µ a µa µ, then µ a = O(1) 4. Covariance scaling: with C k n a a=1 n C a and Ca C a C, then C a = O(1), tr C a = O( p), tr C ac b = O(p) Remark: For 2 classes, this is µ 1 µ 2 = O(1), tr (C 1 C 2 ) = O( p), C i = O(1), tr ([C 1 C 2 ] 2 ) = O(p). 22 / 63

47 Applications/Kernel Spectral Clustering 23/63 Model and Assumptions Kernel Matrix: Kernel matrix of interest: K = { ( )} 1 n f p x i x j 2 i,j=1 for some sufficiently smooth nonnegative f (f( 1 p xt i x j) simpler). 23 / 63

48 Applications/Kernel Spectral Clustering 23/63 Model and Assumptions Kernel Matrix: Kernel matrix of interest: K = { ( )} 1 n f p x i x j 2 i,j=1 for some sufficiently smooth nonnegative f (f( 1 p xt i x j) simpler). We study the normalized Laplacian: L = nd 1 2 (K ddt d T 1 n ) D 1 2 with d = K1 n, D = diag(d). 23 / 63

49 Applications/Kernel Spectral Clustering 24/63 Random Matrix Equivalent Key Remark: Under our assumptions, uniformly on i, j {1,..., n}, 1 p x i x j 2 a.s. τ > / 63

50 Applications/Kernel Spectral Clustering 24/63 Random Matrix Equivalent Key Remark: Under our assumptions, uniformly on i, j {1,..., n}, 1 p x i x j 2 a.s. τ > 0. Allows for Taylor expansion of K: K = f(τ)1 n1 T n + }{{} O (n) nk1 }{{} low rank, O ( n) + K 2 }{{} informative terms, O (1) 24 / 63

51 Applications/Kernel Spectral Clustering 24/63 Random Matrix Equivalent Key Remark: Under our assumptions, uniformly on i, j {1,..., n}, 1 p x i x j 2 a.s. τ > 0. Allows for Taylor expansion of K: K = f(τ)1 n1 T n + }{{} O (n) nk1 }{{} low rank, O ( n) + K 2 }{{} informative terms, O (1) However not the (small dimension) intuitive behavior. 24 / 63

52 Applications/Kernel Spectral Clustering 25/63 Random Matrix Equivalent Theorem (Random Matrix Equivalent [Couillet, Benaych 2015]) As n, p, L ˆL a.s. 0, o ) ( D 2 1 1, avec K ij = f L = nd 2 (K 1 ddt d T 1 n ˆL = 2 f [ (τ) 1 f(τ) p ΠW T W Π + 1 ] p JBJ T + et W = [w 1,..., w n] R p n (x i = µ a + w i ), Π = I n 1 n 1n1T n, ) p x i x j 2 25 / 63

53 Applications/Kernel Spectral Clustering 25/63 Random Matrix Equivalent Theorem (Random Matrix Equivalent [Couillet, Benaych 2015]) As n, p, L ˆL a.s. 0, o ) ( D 2 1 1, avec K ij = f L = nd 2 (K 1 ddt d T 1 n ˆL = 2 f [ (τ) 1 f(τ) p ΠW T W Π + 1 ] p JBJ T + et W = [w 1,..., w n] R p n (x i = µ a + w i ), Π = I n 1 n 1n1T n, ) p x i x j 2 J = [j 1,..., j k ], ja T = (0... 0, 1na, 0,..., 0) ( 5f B = M T (τ) M + 8f(τ) f ) (τ) 2f tt T f (τ) (τ) f (τ) T +. { } k Recall M = [µ 1,..., µ k ], t = [ 1 p tr C1,..., p 1 tr Ck ], T = 1 p tr C a C b. a,b=1 25 / 63

54 Applications/Kernel Spectral Clustering 26/63 Isolated eigenvalues: Gaussian inputs Eigenvalues of L Eigenvalues of ˆL Figure: Eigenvalues of L and ˆL, k = 3, p = 2048, n = 512, c 1 = c 2 = 1/4, c 3 = 1/2, [µ a] j = 4δ aj, C a = (1 + 2(a 1)/ p)i p, f(x) = exp( x/2). 26 / 63

55 Applications/Kernel Spectral Clustering 27/63 Theoretical Findings versus MNIST 0.2 Eigenvalues of L Figure: Eigenvalues of L (red) and (equivalent Gaussian model) ˆL (white), MNIST data, p = 784, n = / 63

56 Applications/Kernel Spectral Clustering 27/63 Theoretical Findings versus MNIST 0.2 Eigenvalues of L Eigenvalues of ˆL as if Gaussian model Figure: Eigenvalues of L (red) and (equivalent Gaussian model) ˆL (white), MNIST data, p = 784, n = / 63

57 Applications/Kernel Spectral Clustering 28/63 Theoretical Findings versus MNIST Figure: Leading four eigenvectors of D 2 1 KD 1 2 for MNIST data (red) and theoretical findings (blue). 28 / 63

58 Applications/Kernel Spectral Clustering 28/63 Theoretical Findings versus MNIST Figure: Leading four eigenvectors of D 2 1 KD 1 2 for MNIST data (red) and theoretical findings (blue). 28 / 63

59 Applications/Kernel Spectral Clustering 29/63 Theoretical Findings versus MNIST Eigenvector 2/Eigenvector 1 Eigenvector 3/Eigenvector Figure: 2D representation of eigenvectors of L, for the MNIST dataset. Theoretical means and 1- and 2-standard deviations in blue. Class 1 in red, Class 2 in black, Class 3 in green. 29 / 63

60 Applications/Kernel Spectral Clustering 30/63 The suprising f (τ) = 0 case Classification error p = 512 p = 1024 Theory f (τ) Figure: Classification performance, polynomial kernel with f(τ) = 4, f (τ) = 2, x i N (0, C a), with C 1 = I p, [C 2] i,j =.4 i j, c 0 = / 63

61 Applications/Semi-supervised Learning 31/63 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 31 / 63

62 Applications/Semi-supervised Learning 32/63 Problem Statement Context: Similar to clustering: Classify x 1,..., x n R p in k classes, with n l labelled and n u unlabelled data. 32 / 63

63 Applications/Semi-supervised Learning 32/63 Problem Statement Context: Similar to clustering: Classify x 1,..., x n R p in k classes, with n l labelled and n u unlabelled data. Problem statement: give scores F ia (d i = [K1 n] i ) F = argmin F R n k such that F ia = δ {xi C a}, for all labelled x i. k K ij (F ia d α 1 i F ja d α 1 j ) 2 a=1 i,j 32 / 63

64 Applications/Semi-supervised Learning 32/63 Problem Statement Context: Similar to clustering: Classify x 1,..., x n R p in k classes, with n l labelled and n u unlabelled data. Problem statement: give scores F ia (d i = [K1 n] i ) F = argmin F R n k such that F ia = δ {xi C a}, for all labelled x i. k K ij (F ia d α 1 i F ja d α 1 j ) 2 a=1 i,j Solution: for F (u) R nu k, F (l) R n l k scores of unlabelled/labelled data, ( ) 1 F (u) = I nu D α (u) K (u,u)d α 1 (u) D α (u) K (u,l)d α 1 F (l) (l) where we naturally decompose [ ] K(l,l) K K = (l,u) K (u,l) K (u,u) [ ] D(l) 0 D = 0 D (u) = diag {K1 n}. 32 / 63

65 Applications/Semi-supervised Learning 33/63 MNIST Data Example 1.2 [F (u) ],1 (Zeros) F,a (u) Index Figure: Vectors [F (u) ],a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 33 / 63

66 Applications/Semi-supervised Learning 33/63 MNIST Data Example 1.2 [F (u) ],1 (Zeros) [F (u) ],2 (Ones) F,a (u) Index Figure: Vectors [F (u) ],a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 33 / 63

67 Applications/Semi-supervised Learning 33/63 MNIST Data Example [F (u) ],1 (Zeros) [F (u) ],2 (Ones) [F (u) ],3 (Twos) 1.2 F,a (u) Index Figure: Vectors [F (u) ],a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 33 / 63

68 Applications/Semi-supervised Learning 34/63 MNIST Data Example 10 2 [F (u) ],1 (Zeros) 4 kb=1 F (u),b F,a (u) k Index Figure: Centered Vectors [F (u) ],a = [F (u) 1 k F (u)1 k 1 T k ],a, 3-class MNIST data (zeros, ones, twos), α = 0, n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 34 / 63

69 Applications/Semi-supervised Learning 34/63 MNIST Data Example 10 2 [F (u) ],1 (Zeros) 4 [F (u) ],2 (Ones) kb=1 F (u),b F,a (u) k Index Figure: Centered Vectors [F (u) ],a = [F (u) 1 k F (u)1 k 1 T k ],a, 3-class MNIST data (zeros, ones, twos), α = 0, n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 34 / 63

70 Applications/Semi-supervised Learning 34/63 MNIST Data Example 10 2 [F (u) ],1 (Zeros) 4 [F (u) ],2 (Ones) [F (u) ],3 (Twos) kb=1 F (u),b F,a (u) k Index Figure: Centered Vectors [F (u) ],a = [F (u) 1 k F (u)1 k 1 T k ],a, 3-class MNIST data (zeros, ones, twos), α = 0, n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 34 / 63

71 Applications/Semi-supervised Learning 35/63 Main Results Results: Assuming n l /n c l (0, 1), by previous Taylor expansion, In the first order, F (u),a = C n [ l,a n v }{{} O(1) + α ta1nu n }{{} O(n 1 2 ) ] + O(n 1 ) }{{} Informative terms where v = O(1) random vector (entry-wise) and t a = 1 p tr C a. 35 / 63

72 Applications/Semi-supervised Learning 35/63 Main Results Results: Assuming n l /n c l (0, 1), by previous Taylor expansion, In the first order, F (u),a = C n [ l,a n v }{{} O(1) + α ta1nu n }{{} O(n 1 2 ) ] + O(n 1 ) }{{} Informative terms where v = O(1) random vector (entry-wise) and t a = 1 p tr C a. Consequences: 35 / 63

73 Applications/Semi-supervised Learning 35/63 Main Results Results: Assuming n l /n c l (0, 1), by previous Taylor expansion, In the first order, F (u),a = C n [ l,a n v }{{} O(1) + α ta1nu n }{{} O(n 1 2 ) ] + O(n 1 ) }{{} Informative terms where v = O(1) random vector (entry-wise) and t a = 1 p tr C a. Consequences: Random non-informative bias v 35 / 63

74 Applications/Semi-supervised Learning 35/63 Main Results Results: Assuming n l /n c l (0, 1), by previous Taylor expansion, In the first order, F (u),a = C n [ l,a n v }{{} O(1) + α ta1nu n }{{} O(n 1 2 ) ] + O(n 1 ) }{{} Informative terms where v = O(1) random vector (entry-wise) and t a = 1 p tr C a. Consequences: Random non-informative bias v Strong Impact of n l,a F (u),a to be scaled by n l,a 35 / 63

75 Applications/Semi-supervised Learning 35/63 Main Results Results: Assuming n l /n c l (0, 1), by previous Taylor expansion, In the first order, F (u),a = C n [ l,a n v }{{} O(1) + α ta1nu n }{{} O(n 1 2 ) ] + O(n 1 ) }{{} Informative terms where v = O(1) random vector (entry-wise) and t a = 1 p tr C a. Consequences: Random non-informative bias v Strong Impact of n l,a F (u),a to be scaled by n l,a Additional per-class bias αt a1 nu α = 0 + β p. 35 / 63

76 Applications/Semi-supervised Learning 36/63 Main Results As a consequence of the remarks above, we take α = β p and define ˆF (u) i,a = np n l,a F (u) ia. 36 / 63

77 Applications/Semi-supervised Learning 36/63 Main Results As a consequence of the remarks above, we take α = β p and define ˆF (u) i,a = np n l,a F (u) ia. Theorem For x i C b unlabelled, ˆF i, G b 0, G b N (m b, Σ b ) where m b R k, Σ b R k k given by (m b ) a = 2f (τ) M ab + f (τ) f(τ) f(τ) t a t b + 2f (τ) T ab f (τ) 2 f(τ) f(τ) 2 tat b + β n f (τ) n l f(τ) ta + B b (Σ b ) a1 a 2 = 2tr ( C2 b f (τ) 2 p f(τ) 2 f ) 2 (τ) t a1 f(τ) ta2 + 4f (τ) 2 ) ([M T f(τ) 2 C b M] a1a2 + δa 2 a 1 p T ba1 n l,a1 with t, T, M as before, Xa = X a k n l,d d=1 X n l d and B b bias independent of a. 36 / 63

78 Applications/Semi-supervised Learning 37/63 Main Results Corollary (Asymptotic Classification Error) For k = 2 classes and a b, ( ) P ( ˆF i,a > ˆF (m b ) b (m b ) a ib x i C b ) Q 0. [1, 1]Σb [1, 1] T 37 / 63

79 Applications/Semi-supervised Learning 37/63 Main Results Corollary (Asymptotic Classification Error) For k = 2 classes and a b, ( ) P ( ˆF i,a > ˆF (m b ) b (m b ) a ib x i C b ) Q 0. [1, 1]Σb [1, 1] T Some consequences: non obvious choices of appropriate kernels non obvious choice of optimal β (induces a possibly beneficial bias) importance of n l versus n u. 37 / 63

80 Applications/Semi-supervised Learning 38/63 MNIST Data Example 1 Simulations Probability of correct classification Index Figure: Performance as a function of α, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 38 / 63

81 Applications/Semi-supervised Learning 38/63 MNIST Data Example 1 Simulations Theory as if Gaussian Probability of correct classification Index Figure: Performance as a function of α, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 38 / 63

82 Applications/Semi-supervised Learning 39/63 MNIST Data Example 1 Simulations Probability of correct classification Index Figure: Performance as a function of α, for 2-class MNIST data (zeros, ones), n = 1568, p = 784, n l /n = 1/16, Gaussian kernel. 39 / 63

83 Applications/Semi-supervised Learning 39/63 MNIST Data Example 1 Simulations Theory as if Gaussian Probability of correct classification Index Figure: Performance as a function of α, for 2-class MNIST data (zeros, ones), n = 1568, p = 784, n l /n = 1/16, Gaussian kernel. 39 / 63

84 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 40/63 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 40 / 63

85 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 41/63 Random Feature Maps and Extreme Learning Machines Context: Random Feature Map (large) input x 1,..., x T R p random W = wt 1... R n p wn T non-linear activation function σ. 41 / 63

86 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 41/63 Random Feature Maps and Extreme Learning Machines Context: Random Feature Map (large) input x 1,..., x T R p random W = wt 1... R n p wn T non-linear activation function σ. Neural Network Model (extreme learning machine): Ridge-regression learning small output y 1,..., y T R d ridge-regression output β R n d 41 / 63

87 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 42/63 Random Feature Maps and Extreme Learning Machines Objectives: evaluate training and testing MSE performance as n, p, T 42 / 63

88 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 42/63 Random Feature Maps and Extreme Learning Machines Objectives: evaluate training and testing MSE performance as n, p, T Training MSE: E train = 1 T y i β T σ(w x i ) 2 = 1 T T Y βt Σ 2 F i=1 with { } Σ = σ(w X) = σ(wi T x j) 1 i n 1 j T β= 1 T Σ ( 1 T ΣT Σ + γi T ) 1 Y. 42 / 63

89 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 42/63 Random Feature Maps and Extreme Learning Machines Objectives: evaluate training and testing MSE performance as n, p, T Training MSE: E train = 1 T y i β T σ(w x i ) 2 = 1 T T Y βt Σ 2 F i=1 with { } Σ = σ(w X) = σ(wi T x j) 1 i n 1 j T β= 1 T Σ ( 1 T ΣT Σ + γi T ) 1 Y. Testing MSE: upon new pair ( ˆX, Ŷ ) of length ˆT, E test = 1ˆT Ŷ βt ˆΣ 2 F. where ˆΣ = σ(w ˆX). 42 / 63

90 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 43/63 Technical Aspects Preliminary observations: Link to resolvent of 1 T ΣT Σ: where Q = Q(γ) is the resolvent with Σ ij = σ(w T i x j). E train = γ2 T tr Y T Y Q 2 = γ 2 1 γ T tr Y T Y Q ( ) 1 1 Q T ΣT Σ + γi T 43 / 63

91 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 43/63 Technical Aspects Preliminary observations: Link to resolvent of 1 T ΣT Σ: where Q = Q(γ) is the resolvent with Σ ij = σ(w T i x j). E train = γ2 T tr Y T Y Q 2 = γ 2 1 γ T tr Y T Y Q ( ) 1 1 Q T ΣT Σ + γi T Central object: resolvent E[Q]. 43 / 63

92 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 44/63 Main Technical Result Theorem [Asymptotic Equivalent for E[Q]] For Lipschitz σ, bounded X, Y, W = f(z) (entry-wise) with Z standard Gaussian, we have, for all ε > 0, for some C > 0, where E[Q] Q < Cn ε 1 2 ( n Q = T Φ E ) Φ δ + γi T ] [ σ(x T w)σ(w T X) with w = f(z), z N (0, I p), and δ > 0 the unique positive solution to δ = 1 tr Φ Q. T 44 / 63

93 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 44/63 Main Technical Result Theorem [Asymptotic Equivalent for E[Q]] For Lipschitz σ, bounded X, Y, W = f(z) (entry-wise) with Z standard Gaussian, we have, for all ε > 0, for some C > 0, where E[Q] Q < Cn ε 1 2 ( n Q = T Φ E ) Φ δ + γi T ] [ σ(x T w)σ(w T X) with w = f(z), z N (0, I p), and δ > 0 the unique positive solution to δ = 1 tr Φ Q. T Proof arguments: σ(w X) has independent rows but dependent columns breaks the trace lemma argument (i.e., 1 p wt XAX T w 1 p tr XAXT ) 44 / 63

94 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 44/63 Main Technical Result Theorem [Asymptotic Equivalent for E[Q]] For Lipschitz σ, bounded X, Y, W = f(z) (entry-wise) with Z standard Gaussian, we have, for all ε > 0, for some C > 0, where E[Q] Q < Cn ε 1 2 ( n Q = T Φ E ) Φ δ + γi T ] [ σ(x T w)σ(w T X) with w = f(z), z N (0, I p), and δ > 0 the unique positive solution to δ = 1 tr Φ Q. T Proof arguments: σ(w X) has independent rows but dependent columns breaks the trace lemma argument (i.e., 1 p wt XAX T w 1 p tr XAXT ) Concentration of measure lemma: 1 p σ(wt X)Aσ(X T w) 1 p tr ΦA 44 / 63

95 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 45/63 Main Technical Result Values of Φ(a, b) for w N (0, I p), σ(t) Φ(a, b) ( 1 max(t, 0) 2π a b (a, b) acos( (a, b)) + ) 1 (a, b) ( 2 2 t π a b (a, b) asin( (a, b)) + ) 1 (a, b) ( ) 2 2 erf(t) π asin 2a T b (1+2 a 2 )(1+2 b 2 ) 1 1 {t>0} 2 1 acos( (a, b)) 2π sign(t) 1 2 acos( (a, b)) π cos(t) exp( 1 2 ( a 2 + b 2 )) cosh(a T b). where (a, b) at b a b. 45 / 63

96 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 45/63 Main Technical Result Values of Φ(a, b) for w N (0, I p), σ(t) Φ(a, b) ( 1 max(t, 0) 2π a b (a, b) acos( (a, b)) + ) 1 (a, b) ( 2 2 t π a b (a, b) asin( (a, b)) + ) 1 (a, b) ( ) 2 2 erf(t) π asin 2a T b (1+2 a 2 )(1+2 b 2 ) 1 1 {t>0} 2 1 acos( (a, b)) 2π sign(t) 1 2 acos( (a, b)) π cos(t) exp( 1 2 ( a 2 + b 2 )) cosh(a T b). where (a, b) at b a b. Value of Φ(a, b) for w i i.i.d. with E[wi k] = m k (m 1 = 0), σ(t) = ζ 2 t 2 + ζ 1 t + ζ 0 [ ( Φ(a, b) = ζ2 2 m 2 2 2(a T b) 2 + a 2 b 2) ] + (m 4 3m 2 2 )(a2 ) T (b 2 ) + ζ1 2 m 2a T b ] + ζ 2 ζ 1 m 3 [(a 2 ) T b + a T (b 2 [ ) + ζ 2 ζ 0 m 2 a 2 + b 2] + ζ0 2 where (a 2 ) [a 2 1,..., a2 p] T. 45 / 63

97 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 46/63 Main Results Theorem [Asymptotic E train ] For all ε > 0, almost surely, where with Ψ n T Φ 1+δ. n 1 2 ε ( E train Ētrain) 0 E train = 1 Y T Σ T β 2 T = γ2 F T tr Y T Y Q 2 [ ] Ē train = γ2 T tr Y T Y Q 1 tr Ψ Q 2 n 1 1 n tr (Ψ Q) Ψ + I 2 T Q 46 / 63

98 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 47/63 Main Results Letting ˆX R p ˆT, Ŷ Rd ˆT satisfy similar properties as (X, Y ), Claim [Asymptotic E test ] For all ε > 0, almost surely, where with Ψ AB = n T n 1 2 ε ( E test Ētest ) 0 E test = 1ˆT Ŷ T ˆΣ T β 2 F Ē test = 1ˆT Ŷ T Ψ T QY T 2 X ˆX F 1 n + tr Y T [ Y QΨ Q tr (Ψ Q) 2 n ˆT tr Ψ ˆX ˆX 1ˆT ] tr (I T + γ Q)(Ψ X ˆXΨ ˆXX Q) Φ AB 1+δ, Φ AB = E[σ(A T w)σ(w T B)]. 47 / 63

99 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 48/63 Simulations on MNIST: Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t γ Figure: Neural network performance for Lipschitz continuous σ( ), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 63

100 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 48/63 Simulations on MNIST: Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t γ Figure: Neural network performance for Lipschitz continuous σ( ), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 63

101 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 48/63 Simulations on MNIST: Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t 10 1 σ(t) = erf(t) γ Figure: Neural network performance for Lipschitz continuous σ( ), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 63

102 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 48/63 Simulations on MNIST: Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t σ(t) = t 10 1 σ(t) = erf(t) γ Figure: Neural network performance for Lipschitz continuous σ( ), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 63

103 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 48/63 Simulations on MNIST: Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t σ(t) = t 10 1 σ(t) = erf(t) σ(t) = max(t, 0) γ Figure: Neural network performance for Lipschitz continuous σ( ), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 63

104 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 49/63 Simulations on MNIST: non Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t2 σ(t) = 1 {t>0} σ(t) = sign(t) γ Figure: Neural network performance for σ( ) either discontinuous or non Lipschitz, as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 63

105 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 50/63 Simulations on tuned Gaussian mixture Gaussian mixture classification X = [X 1, X 2 ], with {X 1 } i N (0, C 1 ), {X 2 } i N (0, C 2 ), tr C 1 = tr C 2 50 / 63

106 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 50/63 Simulations on tuned Gaussian mixture Gaussian mixture classification X = [X 1, X 2 ], with {X 1 } i N (0, C 1 ), {X 2 } i N (0, C 2 ), tr C 1 = tr C 2 We can prove that, for σ(t) = ζ 2 t 2 + ζ 1 t + ζ 0 and E[W k ij ] = m k, Classification only possible if m 4 m / 63

107 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 50/63 Simulations on tuned Gaussian mixture Gaussian mixture classification X = [X 1, X 2 ], with {X 1 } i N (0, C 1 ), {X 2 } i N (0, C 2 ), tr C 1 = tr C 2 We can prove that, for σ(t) = ζ 2 t 2 + ζ 1 t + ζ 0 and E[W k ij ] = m k, Classification only possible if m 4 m 2 2 Ē train Ē test E train E test 10 0 MSE W ij Bern W ij N (0, 1) W ij Stud γ 50 / 63

108 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 51/63 Simulations on tuned Gaussian mixture Interpretation in eigenstructure of Φ: no information carried in dominant eigenmodes if m 4 = m 2 2. close spike no spike far spike W ij N (0, 1) W ij Bern W ij Stud 51 / 63

109 Perspectives/ 52/63 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 52 / 63

110 Perspectives/ 53/63 Summary of Results and Perspectives I Random Neural Networks. Extreme learning machines (one-layer random NN) Linear echo-state networks (ESN) Logistic regression and classification error in extreme learning machines (ELM) Further random feature maps characterization Generalized random NN (multiple layers, multiple activations) Random convolutional networks for image processing Non-linear ESN Deep Neural Networks (DNN). Backpropagation in NN (σ(w X) for random X, backprop. on W ) Statistical physics-inspired approaches (spin-glass models, Hamiltonian-based models) Non-linear ESN DNN performance of physics-realistic models (4th-order Hamiltonian, locality) 53 / 63

111 Perspectives/ 54/63 Summary of Results and Perspectives II References. H. W. Lin, M. Tegmark, Why does deep and cheap learning work so well?, arxiv: v2, C. Williams, Computation with infinite neural networks, Neural Computation, 10(5), , Herbert Jaeger. Short term memory in echo state networks. GMD-Forschungszentrum Informationstechnik, Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew, Extreme learning machine : theory and applications, Neurocomputing, 70(1) :489501, N. El Karoui, Concentration of measure and spectra of random matrices: applications to correlation matrices, elliptical distributions and beyond, The Annals of Applied Probability, 19(6), , C. Louart, Z. Liao, R. Couillet, A Random Matrix Approach to Neural Networks, (submitted to) Annals of Applied Probability, R. Couillet, G. Wainrib, H. Sevi, H. Tiomoko Ali, The asymptotic performance of linear echo state neural networks, Journal of Machine Learning Research, vol. 17, no. 178, pp. 1-35, Choromanska, Anna, et al. The Loss Surfaces of Multilayer Networks. AISTATS / 63

112 Perspectives/ 55/63 Summary of Results and Perspectives III Rahimi, Ali, and Benjamin Recht. Random Features for Large-Scale Kernel Machines. NIPS. Vol. 3. No / 63

113 Perspectives/ 56/63 Summary of Results and Perspectives I Kernel methods. Spectral clustering Subspace spectral clustering (f (τ) = 0) Spectral clustering with outer product kernel f(x T y) Semi-supervised learning, kernel approaches. Least square support vector machines (LS-SVM). Support vector machines (SVM). Kernel matrices based on Kendall τ, Spearman ρ. Applications. Massive MIMO user subspace clustering (patent proposed) Kernel correlation matrices for biostats, heterogeneous datasets. Kernel PCA. Kendall τ in biostats. References. N. El Karoui, The spectrum of kernel random matrices, The Annals of Statistics, 38(1), 1-50, / 63

114 Perspectives/ 57/63 Summary of Results and Perspectives II R. Couillet, F. Benaych-Georges, Kernel Spectral Clustering of Large Dimensional Data, Electronic Journal of Statistics, vol. 10, no. 1, pp , R. Couillet, A. Kammoun, Random Matrix Improved Subspace Clustering, Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, Z. Liao, R. Couillet, A Large Dimensional Analysis of Least Squares Support Vector Machines, (submitted to) Journal of Machine Learning Research, X. Mai, R. Couillet, The counterintuitive mechanism of graph-based semi-supervised learning in the big data regime, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 17), New Orleans, USA, / 63

115 Perspectives/ 58/63 Summary of Results and Perspectives I Community detection. Heterogeneous dense network clustering. Semi-supervised clustering. Sparse network extensions. Beyond community detection (hub detection). Applications. Improved methods for community detection. Applications to distributed optimization (network diffusion, graph signal processing). References. H. Tiomoko Ali, R. Couillet, Spectral community detection in heterogeneous large networks, (submitted to) Journal of Multivariate Analysis, F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborová, P. Zhang, Spectral redemption in clustering sparse networks. Proceedings of the National Academy of Sciences, 110(52), , C. Bordenave, M. Lelarge, L. Massoulié, Non-backtracking spectrum of random graphs: community detection and non-regular Ramanujan graphs, Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pp , 2015 A. Saade, F. Krzakala, L. Zdeborová, Spectral clustering of graphs with the Bethe Hessian, In Advances in Neural Information Processing Systems, pp , / 63

116 Perspectives/ 59/63 Summary of Results and Perspectives I Robust statistics. Tyler, Maronna (and regularized) estimators Elliptical data setting, deterministic outlier setting Central limit theorem extensions Joint mean and covariance robust estimation Robust regression (preliminary works exist already using strikingly different approaches) Applications. Statistical finance (portfolio estimation) Localisation in array processing (robust GMUSIC) Detectors in space time array processing Correlation matrices in biostatistics, human science datasets, etc. References. R. Couillet, F. Pascal, J. W. Silverstein, Robust Estimates of Covariance Matrices in the Large Dimensional Regime, IEEE Transactions on Information Theory, vol. 60, no. 11, pp , / 63

117 Perspectives/ 60/63 Summary of Results and Perspectives II R. Couillet, F. Pascal, J. W. Silverstein, The Random Matrix Regime of Maronna s M-estimator with elliptically distributed samples, Elsevier Journal of Multivariate Analysis, vol. 139, pp , T. Zhang, X. Cheng, A. Singer, Marchenko-Pastur Law for Tyler s and Maronna s M-estimators, arxiv: , R. Couillet, M. McKay, Large Dimensional Analysis and Optimization of Robust Shrinkage Covariance Matrix Estimators, Elsevier Journal of Multivariate Analysis, vol. 131, pp , D. Morales-Jimenez, R. Couillet, M. McKay, Large Dimensional Analysis of Robust M-Estimators of Covariance with Outliers, IEEE Transactions on Signal Processing, vol. 63, no. 21, pp , L. Yang, R. Couillet, M. McKay, A Robust Statistics Approach to Minimum Variance Portfolio Optimization, IEEE Transactions on Signal Processing, vol. 63, no. 24, pp , R. Couillet, Robust spiked random matrices and a robust G-MUSIC estimator, Elsevier Journal of Multivariate Analysis, vol. 140, pp , A. Kammoun, R. Couillet, F. Pascal, M.-S. Alouini, Optimal Design of the Adaptive Normalized Matched Filter Detector, (submitted to) IEEE Transactions on Information Theory, 2016, arxiv Preprint / 63

118 Perspectives/ 61/63 Summary of Results and Perspectives III R. Couillet, A. Kammoun, F. Pascal, Second order statistics of robust estimators of scatter. Application to GLRT detection for elliptical signals, Elsevier Journal of Multivariate Analysis, vol. 143, pp , D. Donoho, A. Montanari, High dimensional robust m-estimation: Asymptotic variance via approximate message passing, Probability Theory and Related Fields, 1-35, N. El Karoui, Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators: rigorous results. arxiv preprint arxiv: , / 63

119 Perspectives/ 62/63 Summary of Results and Perspectives I Other works and ideas. Spike random matrix sparse PCA Non-linear shrinkage methods Sparse kernel PCA Random signal processing on graph methods. Random matrix analysis of diffusion networks performance. Applications. Spike factor models in portfolio optimization Non-linear shrinkage in portfolio optimization, biostats References. R. Couillet, M. McKay, Optimal block-sparse PCA for high dimensional correlated samples, (submitted to) Journal of Multivariate Analysis, J. Bun, J. P. Bouchaud, M. Potters, On the overlaps between eigenvectors of correlated random matrices, arxiv preprint arxiv: (2016). Ledoit, O. and Wolf, M., Nonlinear shrinkage estimation of large-dimensional covariance matrices, / 63

120 Perspectives/ 63/63 The End Thank you. 63 / 63

A Random Matrix Framework for BigData Machine Learning and Applications to Wireless Communications

A Random Matrix Framework for BigData Machine Learning and Applications to Wireless Communications (EURECOM) Romain COUILLET CentraleSupélec, France June, 2017 1 / 80 Outline Basics of Random Matrix Theory