A Random Matrix Framework for BigData Machine Learning and Applications to Wireless Communications

Size: px

Start display at page:

Download "A Random Matrix Framework for BigData Machine Learning and Applications to Wireless Communications"

Willa Payne
5 years ago
Views:

1 A Random Matrix Framework for BigData Machine Learning and Applications to Wireless Communications (EURECOM) Romain COUILLET CentraleSupélec, France June, / 80

2 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 2 / 80

3 Basics of Random Matrix Theory/ 3/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 3 / 80

4 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 4/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 4 / 80

5 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/80 Context Baseline scenario: y 1,..., y n C p (or R p ) i.i.d. with E[y 1 ] = 0, E[y 1 y 1 ] = Cp: 5 / 80

6 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/80 Context Baseline scenario: y 1,..., y n C p (or R p ) i.i.d. with E[y 1 ] = 0, E[y 1 y1 ] = Cp: If y 1 N (0, C p), ML estimator for C p is the sample covariance matrix (SCM) Ĉ p = 1 n YpY p = 1 n y i yi n i=1 (Y p = [y 1,..., y n] C p n ). 5 / 80

7 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/80 Context Baseline scenario: y 1,..., y n C p (or R p ) i.i.d. with E[y 1 ] = 0, E[y 1 y1 ] = Cp: If y 1 N (0, C p), ML estimator for C p is the sample covariance matrix (SCM) Ĉ p = 1 n YpY p (Y p = [y 1,..., y n] C p n ). If n, then, strong law of large numbers = 1 n y i yi n i=1 Ĉ p a.s. C p. or equivalently, in spectral norm Ĉp Cp a.s / 80

8 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/80 Context Baseline scenario: y 1,..., y n C p (or R p ) i.i.d. with E[y 1 ] = 0, E[y 1 y1 ] = Cp: If y 1 N (0, C p), ML estimator for C p is the sample covariance matrix (SCM) Ĉ p = 1 n YpY p (Y p = [y 1,..., y n] C p n ). If n, then, strong law of large numbers = 1 n y i yi n i=1 Ĉ p a.s. C p. or equivalently, in spectral norm Ĉp Cp a.s. 0. Random Matrix Regime No longer valid if p, n with p/n c (0, ), Ĉp Cp 0. 5 / 80

9 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/80 Context Baseline scenario: y 1,..., y n C p (or R p ) i.i.d. with E[y 1 ] = 0, E[y 1 y1 ] = Cp: If y 1 N (0, C p), ML estimator for C p is the sample covariance matrix (SCM) Ĉ p = 1 n YpY p (Y p = [y 1,..., y n] C p n ). If n, then, strong law of large numbers = 1 n y i yi n i=1 Ĉ p a.s. C p. or equivalently, in spectral norm Ĉp Cp a.s. 0. Random Matrix Regime No longer valid if p, n with p/n c (0, ), Ĉp Cp 0. For practical p, n with p n, leads to dramatically wrong conclusions 5 / 80

10 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/80 The Mar cenko Pastur law 0.8 Empirical eigenvalue distribution Mar cenko Pastur Law 0.6 Density Eigenvalues of Ĉp Figure: Histogram of the eigenvalues of Ĉp for p = 500, n = 2000, Cp = Ip. 6 / 80

11 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/80 The Mar cenko Pastur law Definition (Empirical Spectral Density) Empirical spectral density (e.s.d.) µ p of Hermitian matrix A p C p p is µ p = 1 p δ λi (A p p). i=1 7 / 80

12 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/80 The Mar cenko Pastur law Definition (Empirical Spectral Density) Empirical spectral density (e.s.d.) µ p of Hermitian matrix A p C p p is µ p = 1 p δ λi (A p p). i=1 Theorem (Mar cenko Pastur Law [Mar cenko,pastur 67]) X p C p n with i.i.d. zero mean, unit variance entries. As p, n with p/n c (0, ), e.s.d. µ p of 1 n XpX p satisfies weakly, where µ c({0}) = max{0, 1 c 1 } µ p a.s. µ c 7 / 80

13 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/80 The Mar cenko Pastur law Definition (Empirical Spectral Density) Empirical spectral density (e.s.d.) µ p of Hermitian matrix A p C p p is µ p = 1 p δ λi (A p p). i=1 Theorem (Mar cenko Pastur Law [Mar cenko,pastur 67]) X p C p n with i.i.d. zero mean, unit variance entries. As p, n with p/n c (0, ), e.s.d. µ p of 1 n XpX p satisfies weakly, where µ c({0}) = max{0, 1 c 1 } µ p a.s. µ c on (0, ), µ c has continuous density f c supported on [(1 c) 2, (1 + c) 2 ] f c(x) = 1 (x (1 c) 2 )((1 + c) 2 x). 2πcx 7 / 80

14 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/80 The Mar cenko Pastur law 1.2 c = Density fc(x) x Figure: Mar cenko-pastur law for different limit ratios c = lim p p/n. 8 / 80

15 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/80 The Mar cenko Pastur law 1.2 c = 0.1 c = Density fc(x) x Figure: Mar cenko-pastur law for different limit ratios c = lim p p/n. 8 / 80

16 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/80 The Mar cenko Pastur law 1.2 c = 0.1 c = c = Density fc(x) x Figure: Mar cenko-pastur law for different limit ratios c = lim p p/n. 8 / 80

17 Basics of Random Matrix Theory/Spiked Models 9/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 9 / 80

18 Basics of Random Matrix Theory/Spiked Models 10/80 Spiked Models Small rank perturbation: C p = I p + P, P of low rank. 0.8 {λ i } n i=1 µ ω 1 + c 1+ω 1 ω 1, 1 + ω 2 + c 1+ω 2 ω 2 Figure: Eigenvalues of 1 n YpY p, Cp = diag(1,..., 1, 2, 2, 3, 3), p = 500, n = }{{} p 4 10 / 80

19 Basics of Random Matrix Theory/Spiked Models 11/80 Spiked Models Theorem (Eigenvalues [Baik,Silverstein 06]) Let Y p = C 1 2 p X p, with X p with i.i.d. zero mean, unit variance, E[ X p 4 ij ] <. C p = I p + P, P = UΩU, where, for K fixed, Ω = diag (ω 1,..., ω K ) R K K, with ω 1... ω K > / 80

20 Basics of Random Matrix Theory/Spiked Models 11/80 Spiked Models Theorem (Eigenvalues [Baik,Silverstein 06]) Let Y p = C 1 2 p X p, with X p with i.i.d. zero mean, unit variance, E[ X p 4 ij ] <. C p = I p + P, P = UΩU, where, for K fixed, Ω = diag (ω 1,..., ω K ) R K K, with ω 1... ω K > 0. Then, as p, n, p/n c (0, ), denoting λ m = λ m( 1 n YpY p ) (λ m > λ m+1 ), { a.s. 1 + ωm + c 1+ωm > (1 + c) λ m ω 2, ω m > c m (1 + c) 2, ω m (0, c]. 11 / 80

21 Basics of Random Matrix Theory/Spiked Models 12/80 Spiked Models Theorem (Eigenvectors [Paul 07]) Let Y p = C 1 2 p X p, with X p with i.i.d. zero mean, unit variance, E[ X p 4 ij ] <. C p = I p + P, P = UΩU = K i=1 ω iu i u i, ω 1 >... > ω M > / 80

22 Basics of Random Matrix Theory/Spiked Models 12/80 Spiked Models Theorem (Eigenvectors [Paul 07]) Let Y p = C 1 2 p X p, with X p with i.i.d. zero mean, unit variance, E[ X p 4 ij ] <. C p = I p + P, P = UΩU = K i=1 ω iu i u i, ω 1 >... > ω M > 0. Then, as p, n, p/n c (0, ), for a, b C p deterministic and û i eigenvector of λ i ( 1 n YpY p ), In particular, a û i û i b 1 cω 2 i 1 + cω 1 i a u i u i b 1 ω i > c a.s. 0 û i u i 2 a.s. 1 cω 2 i 1 + cω 1 1 ωi > c. i 12 / 80

23 Basics of Random Matrix Theory/Spiked Models 13/80 Spiked Models û 1 u p = 100 p = 200 p = c/ω c/ω Population spike ω 1 Figure: Simulated versus limiting û 1 1 u1 2 for Y p = C 2 p X p, C p = I p + ω 1u 1u 1, p/n = 1/3, varying ω / 80

24 Basics of Random Matrix Theory/Spiked Models 14/80 Other Spiked Models Similar results for multiple matrix models: Y p = 1 n XpX p + P Y p = 1 n X p (I + P )X Y p = 1 n (Xp + P ) (X p + P ) Y p = 1 n T X p (I + P )XpT etc. 14 / 80

25 Applications/ 15/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 15 / 80

26 Applications/Reminder on Spectral Clustering Methods 16/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 16 / 80

27 Applications/Reminder on Spectral Clustering Methods 17/80 Reminder on Spectral Clustering Methods Context: Two-step classification of n objects based on similarity A R n n : 0 spikes 17 / 80

28 Applications/Reminder on Spectral Clustering Methods 17/80 Reminder on Spectral Clustering Methods Context: Two-step classification of n objects based on similarity A R n n : 0 spikes Eigenvectors (in practice, shuffled) Eigenv. 2 Eigenv / 80

29 Applications/Reminder on Spectral Clustering Methods 18/80 Reminder on Spectral Clustering Methods Eigenv. 2 Eigenv / 80

30 Applications/Reminder on Spectral Clustering Methods 18/80 Reminder on Spectral Clustering Methods Eigenv. 2 Eigenv. 1 l-dimensional representation (shuffling no longer matters) Eigenvector 2 Eigenvector 1 18 / 80

31 Applications/Reminder on Spectral Clustering Methods 18/80 Reminder on Spectral Clustering Methods Eigenv. 2 Eigenv. 1 l-dimensional representation (shuffling no longer matters) Eigenvector 2 Eigenvector 1 EM or k-means clustering. 18 / 80

32 Applications/Kernel Spectral Clustering 19/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 19 / 80

33 Applications/Kernel Spectral Clustering 20/80 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes C 1,..., C k. 20 / 80

34 Applications/Kernel Spectral Clustering 20/80 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes C 1,..., C k. Kernel spectral clustering based on kernel matrix K = {κ(x i, x j )} n i,j=1 20 / 80

35 Applications/Kernel Spectral Clustering 20/80 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes C 1,..., C k. Kernel spectral clustering based on kernel matrix K = {κ(x i, x j )} n i,j=1 Usually, κ(x, y) = f(x T y) or κ(x, y) = f( x y 2 ) 20 / 80

36 Applications/Kernel Spectral Clustering 20/80 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes C 1,..., C k. Kernel spectral clustering based on kernel matrix K = {κ(x i, x j )} n i,j=1 Usually, κ(x, y) = f(x T y) or κ(x, y) = f( x y 2 ) Refinements: instead of K, use D K, I n D 1 K, I n D 1 2 KD 1 2, etc. several steps algorithms: Ng Jordan Weiss, Shi Malik, etc. 20 / 80

37 Applications/Kernel Spectral Clustering 20/80 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes C 1,..., C k. Kernel spectral clustering based on kernel matrix K = {κ(x i, x j )} n i,j=1 Usually, κ(x, y) = f(x T y) or κ(x, y) = f( x y 2 ) Refinements: instead of K, use D K, I n D 1 K, I n D 1 2 KD 1 2, etc. several steps algorithms: Ng Jordan Weiss, Shi Malik, etc. Intuition (from small dimensions) K essentially low rank with class structure in eigenvectors. 20 / 80

38 Applications/Kernel Spectral Clustering 21/80 Kernel Spectral Clustering 21 / 80

39 Applications/Kernel Spectral Clustering 21/80 Kernel Spectral Clustering 21 / 80

40 Applications/Kernel Spectral Clustering 21/80 Kernel Spectral Clustering 21 / 80

41 Applications/Kernel Spectral Clustering 21/80 Kernel Spectral Clustering Figure: Leading four eigenvectors of D 2 1 KD 2 1 for MNIST data. 21 / 80

42 Applications/Kernel Spectral Clustering 22/80 Model and Assumptions Gaussian mixture model: x 1,..., x n R p, k classes C 1,..., C k, x 1,..., x n1 C 1,..., x n nk +1,..., x n C k, x i N (µ gi, C gi ). 22 / 80

43 Applications/Kernel Spectral Clustering 22/80 Model and Assumptions Gaussian mixture model: x 1,..., x n R p, k classes C 1,..., C k, x 1,..., x n1 C 1,..., x n nk +1,..., x n C k, x i N (µ gi, C gi ). Assumption (Convergence Rate) As n, p 1. Data scaling: n c 0 (0, ), n 2. Class scaling: a n c a (0, 1), 3. Mean scaling: with µ k a=1 n a n µ a and µ a µa µ, then µ a = O(1) 4. Covariance scaling: with C k n a a=1 n C a and Ca C a C, then C a = O(1), tr C a = O( p), tr C ac b = O(p) 22 / 80

44 Applications/Kernel Spectral Clustering 22/80 Model and Assumptions Gaussian mixture model: x 1,..., x n R p, k classes C 1,..., C k, x 1,..., x n1 C 1,..., x n nk +1,..., x n C k, x i N (µ gi, C gi ). Assumption (Convergence Rate) As n, p 1. Data scaling: n c 0 (0, ), n 2. Class scaling: a n c a (0, 1), 3. Mean scaling: with µ k a=1 n a n µ a and µ a µa µ, then µ a = O(1) 4. Covariance scaling: with C k n a a=1 n C a and Ca C a C, then C a = O(1), tr C a = O( p), tr C ac b = O(p) Remark: For 2 classes, this is µ 1 µ 2 = O(1), tr (C 1 C 2 ) = O( p), C i = O(1), tr ([C 1 C 2 ] 2 ) = O(p). 22 / 80

45 Applications/Kernel Spectral Clustering 23/80 Model and Assumptions Kernel Matrix: Kernel matrix of interest: K = { ( )} 1 n f p x i x j 2 i,j=1 for some sufficiently smooth nonnegative f (f( 1 p xt i x j) simpler). 23 / 80

46 Applications/Kernel Spectral Clustering 23/80 Model and Assumptions Kernel Matrix: Kernel matrix of interest: K = { ( )} 1 n f p x i x j 2 i,j=1 for some sufficiently smooth nonnegative f (f( 1 p xt i x j) simpler). We study the normalized Laplacian: L = nd 1 2 (K ddt d T 1 n ) D 1 2 with d = K1 n, D = diag(d). 23 / 80

47 Applications/Kernel Spectral Clustering 24/80 Random Matrix Equivalent Key Remark: Under our assumptions, uniformly on i, j {1,..., n}, 1 p x i x j 2 a.s. τ > / 80

48 Applications/Kernel Spectral Clustering 24/80 Random Matrix Equivalent Key Remark: Under our assumptions, uniformly on i, j {1,..., n}, 1 p x i x j 2 a.s. τ > 0. Allows for Taylor expansion of K: K = f(τ)1 n1 T n + }{{} O (n) nk1 }{{} low rank, O ( n) + K 2 }{{} informative terms, O (1) 24 / 80

49 Applications/Kernel Spectral Clustering 24/80 Random Matrix Equivalent Key Remark: Under our assumptions, uniformly on i, j {1,..., n}, 1 p x i x j 2 a.s. τ > 0. Allows for Taylor expansion of K: K = f(τ)1 n1 T n + }{{} O (n) nk1 }{{} low rank, O ( n) + K 2 }{{} informative terms, O (1) However not the (small dimension) intuitive behavior. 24 / 80

50 Applications/Kernel Spectral Clustering 25/80 Random Matrix Equivalent Theorem (Random Matrix Equivalent [Couillet, Benaych 2015]) As n, p, L ˆL a.s. 0, o ) ( D 2 1 1, avec K ij = f L = nd 2 (K 1 ddt d T 1 n ˆL = 2 f [ (τ) 1 f(τ) p ΠW T W Π + 1 ] p JBJ T + et W = [w 1,..., w n] R p n (x i = µ a + w i ), Π = I n 1 n 1n1T n, ) p x i x j 2 25 / 80

51 Applications/Kernel Spectral Clustering 25/80 Random Matrix Equivalent Theorem (Random Matrix Equivalent [Couillet, Benaych 2015]) As n, p, L ˆL a.s. 0, o ) ( D 2 1 1, avec K ij = f L = nd 2 (K 1 ddt d T 1 n ˆL = 2 f [ (τ) 1 f(τ) p ΠW T W Π + 1 ] p JBJ T + et W = [w 1,..., w n] R p n (x i = µ a + w i ), Π = I n 1 n 1n1T n, ) p x i x j 2 J = [j 1,..., j k ], ja T = (0... 0, 1na, 0,..., 0) ( 5f B = M T (τ) M + 8f(τ) f ) (τ) 2f tt T f (τ) (τ) f (τ) T +. { } k Recall M = [µ 1,..., µ k ], t = [ 1 p tr C1,..., p 1 tr Ck ], T = 1 p tr C a C b. a,b=1 25 / 80

52 Applications/Kernel Spectral Clustering 26/80 Isolated eigenvalues: Gaussian inputs Eigenvalues of L Eigenvalues of ˆL Figure: Eigenvalues of L and ˆL, k = 3, p = 2048, n = 512, c 1 = c 2 = 1/4, c 3 = 1/2, [µ a] j = 4δ aj, C a = (1 + 2(a 1)/ p)i p, f(x) = exp( x/2). 26 / 80

53 Applications/Kernel Spectral Clustering 27/80 Theoretical Findings versus MNIST 0.2 Eigenvalues of L Figure: Eigenvalues of L (red) and (equivalent Gaussian model) ˆL (white), MNIST data, p = 784, n = / 80

54 Applications/Kernel Spectral Clustering 27/80 Theoretical Findings versus MNIST 0.2 Eigenvalues of L Eigenvalues of ˆL as if Gaussian model Figure: Eigenvalues of L (red) and (equivalent Gaussian model) ˆL (white), MNIST data, p = 784, n = / 80

55 Applications/Kernel Spectral Clustering 28/80 Theoretical Findings versus MNIST Figure: Leading four eigenvectors of D 2 1 KD 1 2 for MNIST data (red) and theoretical findings (blue). 28 / 80

56 Applications/Kernel Spectral Clustering 28/80 Theoretical Findings versus MNIST Figure: Leading four eigenvectors of D 2 1 KD 1 2 for MNIST data (red) and theoretical findings (blue). 28 / 80

57 Applications/Kernel Spectral Clustering 29/80 Theoretical Findings versus MNIST Eigenvector 2/Eigenvector 1 Eigenvector 3/Eigenvector Figure: 2D representation of eigenvectors of L, for the MNIST dataset. Theoretical means and 1- and 2-standard deviations in blue. Class 1 in red, Class 2 in black, Class 3 in green. 29 / 80

58 Applications/Kernel Spectral Clustering 30/80 The surprising f (τ) = 0 case Classification error p = 512 p = 1024 Theory f (τ) Figure: Classification performance, polynomial kernel with f(τ) = 4, f (τ) = 2, x i N (0, C a), with C 1 = I p, [C 2] i,j =.4 i j, c 0 = / 80

59 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 31/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 31 / 80

60 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 32/80 Position of the Problem Problem: Cluster large data x 1,..., x n R p based on spanned subspaces. 32 / 80

61 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 32/80 Position of the Problem Problem: Cluster large data x 1,..., x n R p based on spanned subspaces. Method: Still assume x 1,..., x n belong to k classes C 1,..., C k. Zero-mean Gaussian model for the data: for x i C k, x i N (0, C k ). 32 / 80

62 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 32/80 Position of the Problem Problem: Cluster large data x 1,..., x n R p based on spanned subspaces. Method: Still assume x 1,..., x n belong to k classes C 1,..., C k. Zero-mean Gaussian model for the data: for x i C k, Performance of L = nd 1 2 in the regime n, p. x i N (0, C k ). ( ) K 1n1T n D 1 1 T n D1n 2, with { ( K = f x i x j 2)}, x = x 1 i,j n x 32 / 80

63 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 33/80 Model and Reminders Assumption 1 [Classes]. Vectors x 1,..., x n R p i.i.d. from k-class Gaussian mixture, with x i C k x i N (0, C k ) (sorted by class for simplicity). 33 / 80

64 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 33/80 Model and Reminders Assumption 1 [Classes]. Vectors x 1,..., x n R p i.i.d. from k-class Gaussian mixture, with x i C k x i N (0, C k ) (sorted by class for simplicity). Assumption 2a [Growth Rates]. As n, for each a {1,..., k}, n 1. p c 0 (0, ) n 2. a n c a (0, ) 1 3. p tr Ca = 1 and tr C acb = O(p), with C a = C a C, C = k b=1 c bc b. 33 / 80

65 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 33/80 Model and Reminders Assumption 1 [Classes]. Vectors x 1,..., x n R p i.i.d. from k-class Gaussian mixture, with x i C k x i N (0, C k ) (sorted by class for simplicity). Assumption 2a [Growth Rates]. As n, for each a {1,..., k}, n 1. p c 0 (0, ) n 2. a n c a (0, ) 1 3. p tr Ca = 1 and tr C acb = O(p), with C a = C a C, C = k b=1 c bc b. Theorem (Corollary of Previous Section) Let f smooth with f (2) 0. Then, under Assumptions 2a, L = nd 1 2 ( K exhibits phase transition phenomenon ) 1n1T n 1 T D 1 2, with K = { f ( x i x j 2)} n n D1n i,j=1 ( x = x/ x ) 33 / 80

66 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 33/80 Model and Reminders Assumption 1 [Classes]. Vectors x 1,..., x n R p i.i.d. from k-class Gaussian mixture, with x i C k x i N (0, C k ) (sorted by class for simplicity). Assumption 2a [Growth Rates]. As n, for each a {1,..., k}, n 1. p c 0 (0, ) n 2. a n c a (0, ) 1 3. p tr Ca = 1 and tr C acb = O(p), with C a = C a C, C = k b=1 c bc b. Theorem (Corollary of Previous Section) Let f smooth with f (2) 0. Then, under Assumptions 2a, L = nd 1 2 ( K ) 1n1T n 1 T D 1 2, with K = { f ( x i x j 2)} n n D1n i,j=1 ( x = x/ x ) exhibits phase transition phenomenon, i.e., leading eigenvectors of L asymptotically contain structural information about C 1,..., C k if and only if has sufficiently large eigenvalues. { } 1 k T = p tr C acb a,b=1 33 / 80

67 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 34/80 The case f (2) = 0 Assumption 2b [Growth Rates]. As n, for each a {1,..., k}, n 1. p c 0 (0, ) n 2. a n c a (0, ) 1 3. p tr Ca = 1 and tr C a C b = O(p), with C a = Ca C, C = k b=1 c bc b. 34 / 80

68 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 34/80 The case f (2) = 0 Assumption 2b [Growth Rates]. As n, for each a {1,..., k}, n 1. p c 0 (0, ) n 2. a n c a (0, ) 1 3. p tr Ca = 1 and tr C acb = O( p), with Ca = C a C, C = k b=1 c bc b. (in this regime, previous kernels clearly fail) 34 / 80

69 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 34/80 The case f (2) = 0 Assumption 2b [Growth Rates]. As n, for each a {1,..., k}, n 1. p c 0 (0, ) n 2. a n c a (0, ) 1 3. p tr Ca = 1 and tr C acb = O( p), with Ca = C a C, C = k b=1 c bc b. (in this regime, previous kernels clearly fail) Theorem (Random Equivalent for f (2) = 0) Let f be smooth with f (2) = 0 and L p f(2) [ 2f L (2) Then, under Assumptions 2b, ] f(0) f(2) P, P = I n 1 f(2) n 1n1T n. { } 1 L = P ΦP + tr (C p a C b ) 1na 1 T k n b + o (1) p a,b=1 where Φ ij = δ i j p [ (x T i x j ) 2 E[(x T i x j) 2 ] ]. 34 / 80

70 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 35/80 The case f (2) = 0 3 Eigenvalues of L 2 1 λ 2 (L) λ 1 (L) Figure: Eigenvalues of L, p = 1000, n = 2000, k = 3, c 1 = c 2 = 1/4, c 3 = 1/2, C i I p + (p/8) 5 4 W iw T i, Wi Rp (p/8) of i.i.d. N (0, 1) entries, f(t) = exp( (t 2) 2 ). No longer a Marcenko Pastur like bulk, but rather a semi-circle bulk! 35 / 80

71 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 36/80 The case f (2) = 0 36 / 80

72 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 37/80 The case f (2) = 0 Roadmap. We now need to: study the spectrum of Φ 37 / 80

73 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 37/80 The case f (2) = 0 Roadmap. We now need to: study the spectrum of Φ study the isolated eigenvalues of L (and the phase transition) 37 / 80

74 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 37/80 The case f (2) = 0 Roadmap. We now need to: study the spectrum of Φ study the isolated eigenvalues of L (and the phase transition) retrieve information from the eigenvectors. 37 / 80

75 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 37/80 The case f (2) = 0 Roadmap. We now need to: study the spectrum of Φ study the isolated eigenvalues of L (and the phase transition) retrieve information from the eigenvectors. Theorem (Semi-circle law for Φ) Let µ n = 1 n n i=1 δ λ i (L). Then, under Assumption 2b, with µ the semi-circle distribution µ(dt) = µ n a.s. µ 1 2πc 0 ω 2 (4c 0 ω 2 t 2 ) + 1 dt, ω = lim 2 p p tr (C ) / 80

76 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 38/80 The case f (2) = 0 3 Eigenvalues of L Semi-circle law 2 1 λ 2 (L) λ 1 (L) Figure: Eigenvalues of L, p = 1000, n = 2000, k = 3, c 1 = c 2 = 1/4, c 3 = 1/2, C i I p + (p/8) 5 4 W iw T i, Wi Rp (p/8) of i.i.d. N (0, 1) entries, f(t) = exp( (t 2) 2 ). 38 / 80

77 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 39/80 The case f (2) = 0 Denote now { } cac k b T lim tr C p p a C b. a,b=1 39 / 80

78 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 39/80 The case f (2) = 0 Denote now { } cac k b T lim tr C p p acb. a,b=1 Theorem (Isolated Eigenvalues) Let ν 1... ν k eigenvalues of T. Then, if c 0 ν i > ω, L has an isolated eigenvalue λ i satisfying λ i a.s. ρ i c 0 ν i + ω2 ν i. 39 / 80

79 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 40/80 The case f (2) = 0 Theorem (Isolated Eigenvectors) For each isolated eigenpair (λ i, u i ) of L corresponding to (ν i, v i ) of T, write k u i = α a j a i + σi a wa i na a=1 with j a = [0 T n 1,..., 1 T n a,..., 0 T n k ] T, (w a i )T j a = 0, supp(w a i ) = supp(ja), wa i = 1. Then, under Assumptions 1 2b, α a i αb i a.s. (1 1c0 ω 2 (σ a i )2 a.s. ca c 0 ω 2 ν 2 i ν 2 i ) [v i v T i ] ab and the fluctuations of u i, u j, i j, are asymptotically uncorrelated. 40 / 80

80 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 41/80 The case f (2) = 0 Eigenvector 2 Eigenvector ,000 1,200 1,400 1,600 1,800 2,000 Figure: Leading two eigenvectors of L (or equivalently of L) versus deterministic approximations of α a i ± σa i. 41 / 80

81 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 41/80 The case f (2) = 0 Eigenvector 1 1 n1 (α 1 1 σ1 1 ) 1 n2 (α σ2 1 ) 1 n2 α 2 1 Eigenvector ,000 1,200 1,400 1,600 1,800 2,000 Figure: Leading two eigenvectors of L (or equivalently of L) versus deterministic approximations of α a i ± σa i. 41 / 80

82 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 42/80 The case f (2) = σ 2 Eigenvector σ Eigenvector Figure: Leading two eigenvectors of L (or equivalently of L) versus deterministic approximations of α a i ± σa i. 42 / 80

83 Applications/The case f 0 (τ ) = 0: an application to massive MIMO user clustering 43/80 Application to Massive MIMO UE Clustering 43 / 80

84 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 44/80 Massive MIMO UE Clustering Setting. Massive MIMO cell with p antenna elements n users equipments (UE) with channels x 1,..., x n R p UE s belong to solid angle groups, i.e., E[x i ] = 0, E[x i x T i ] = Ca C(Θa). 44 / 80

85 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 44/80 Massive MIMO UE Clustering Setting. Massive MIMO cell with p antenna elements n users equipments (UE) with channels x 1,..., x n R p UE s belong to solid angle groups, i.e., E[x i ] = 0, E[x i x T i ] = Ca C(Θa). T independent channel observations x (1) i,..., x (T ) i for UE i. 44 / 80

86 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 44/80 Massive MIMO UE Clustering Setting. Massive MIMO cell with p antenna elements n users equipments (UE) with channels x 1,..., x n R p UE s belong to solid angle groups, i.e., E[x i ] = 0, E[x i x T i ] = Ca C(Θa). T independent channel observations x (1) i,..., x (T ) i for UE i. Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoid pilot contamination). 44 / 80

87 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 44/80 Massive MIMO UE Clustering Setting. Massive MIMO cell with p antenna elements n users equipments (UE) with channels x 1,..., x n R p UE s belong to solid angle groups, i.e., E[x i ] = 0, E[x i x T i ] = Ca C(Θa). T independent channel observations x (1) i,..., x (T ) i for UE i. Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoid pilot contamination). Algorithm. 1. Build kernel matrix K, then L, based on nt vectors x (1) 1,..., x(t n ) (as if nt values to cluster). 44 / 80

88 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 44/80 Massive MIMO UE Clustering Setting. Massive MIMO cell with p antenna elements n users equipments (UE) with channels x 1,..., x n R p UE s belong to solid angle groups, i.e., E[x i ] = 0, E[x i x T i ] = Ca C(Θa). T independent channel observations x (1) i,..., x (T ) i for UE i. Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoid pilot contamination). Algorithm. 1. Build kernel matrix K, then L, based on nt vectors x (1) 1,..., x(t n ) (as if nt values to cluster). 2. Extract dominant isolated eigenvectors u 1,..., u κ 44 / 80

89 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 44/80 Massive MIMO UE Clustering Setting. Massive MIMO cell with p antenna elements n users equipments (UE) with channels x 1,..., x n R p UE s belong to solid angle groups, i.e., E[x i ] = 0, E[x i x T i ] = Ca C(Θa). T independent channel observations x (1) i,..., x (T ) i for UE i. Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoid pilot contamination). Algorithm. 1. Build kernel matrix K, then L, based on nt vectors x (1) 1,..., x(t n ) (as if nt values to cluster). 2. Extract dominant isolated eigenvectors u 1,..., u κ 3. For each i, create ũ i = 1 T (In 1T T )u i, i.e., average eigenvectors along time. 44 / 80

90 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 44/80 Massive MIMO UE Clustering Setting. Massive MIMO cell with p antenna elements n users equipments (UE) with channels x 1,..., x n R p UE s belong to solid angle groups, i.e., E[x i ] = 0, E[x i x T i ] = Ca C(Θa). T independent channel observations x (1) i,..., x (T ) i for UE i. Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoid pilot contamination). Algorithm. 1. Build kernel matrix K, then L, based on nt vectors x (1) 1,..., x(t n ) (as if nt values to cluster). 2. Extract dominant isolated eigenvectors u 1,..., u κ 3. For each i, create ũ i = 1 T (In 1T T )u i, i.e., average eigenvectors along time. 4. Perform k-class clustering on vectors ũ 1,..., ũ κ. 44 / 80

91 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 45/80 Massive MIMO UE Clustering Eigenvector Eigenvector Eigenvector Figure: Leading two eigenvectors before (left figure) and after (right figure) T -averaging. Setting: p = 400, n = 40, T = 10, k = 3, c 1 = c 3 = 1/4, c 2 = 1/2, angular spread model with angles π/30 ± π/20, 0 ± π/20, and π/30 ± π/20. Kernel function f(t) = exp( (t 2) 2 ). 45 / 80

92 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 46/80 Massive MIMO UE Clustering 1 Correct clustering probability k-means (oracle) k-means EM (oracle) EM Random guess T Figure: Overlap for different T, using the k-means or EM starting from actual centroid solutions (oracle) or randomly. 46 / 80

93 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 47/80 Massive MIMO UE Clustering 1 Correct clustering probability Optimal kernel (k-means) Optimal kernel (EM) Gaussian kernel (k-means) Gaussian kernel (EM) Random guess T Figure: Overlap for optimal kernel f(t) (here f(t) = exp( (t 2) 2 )) and Gaussian kernel f(t) = exp( t 2 ), for different T, using the k-means or EM. 47 / 80

94 Applications/Semi-supervised Learning 48/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 48 / 80

95 Applications/Semi-supervised Learning 49/80 Problem Statement Context: Similar to clustering: Classify x 1,..., x n R p in k classes, with n l labelled and n u unlabelled data. 49 / 80

96 Applications/Semi-supervised Learning 49/80 Problem Statement Context: Similar to clustering: Classify x 1,..., x n R p in k classes, with n l labelled and n u unlabelled data. Problem statement: give scores F ia (d i = [K1 n] i ) F = argmin F R n k such that F ia = δ {xi C a}, for all labelled x i. k K ij (F ia d α 1 i F ja d α 1 j ) 2 a=1 i,j 49 / 80

97 Applications/Semi-supervised Learning 49/80 Problem Statement Context: Similar to clustering: Classify x 1,..., x n R p in k classes, with n l labelled and n u unlabelled data. Problem statement: give scores F ia (d i = [K1 n] i ) F = argmin F R n k such that F ia = δ {xi C a}, for all labelled x i. k K ij (F ia d α 1 i F ja d α 1 j ) 2 a=1 i,j Solution: for F (u) R nu k, F (l) R n l k scores of unlabelled/labelled data, ( ) 1 F (u) = I nu D α (u) K (u,u)d α 1 (u) D α (u) K (u,l)d α 1 F (l) (l) where we naturally decompose [ ] K(l,l) K K = (l,u) K (u,l) K (u,u) [ ] D(l) 0 D = 0 D (u) = diag {K1 n}. 49 / 80

98 Applications/Semi-supervised Learning 50/80 MNIST Data Example 1.2 [F (u) ],1 (Zeros) F,a (u) Index Figure: Vectors [F (u) ],a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 50 / 80

99 Applications/Semi-supervised Learning 50/80 MNIST Data Example 1.2 [F (u) ],1 (Zeros) [F (u) ],2 (Ones) F,a (u) Index Figure: Vectors [F (u) ],a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 50 / 80

100 Applications/Semi-supervised Learning 50/80 MNIST Data Example [F (u) ],1 (Zeros) [F (u) ],2 (Ones) [F (u) ],3 (Twos) 1.2 F,a (u) Index Figure: Vectors [F (u) ],a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 50 / 80

101 Applications/Semi-supervised Learning 51/80 MNIST Data Example 10 2 [F (u) ],1 (Zeros) 4 kb=1 F (u),b F,a (u) k Index Figure: Centered Vectors [F (u) ],a = [F (u) 1 k F (u)1 k 1 T k ],a, 3-class MNIST data (zeros, ones, twos), α = 0, n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 51 / 80

102 Applications/Semi-supervised Learning 51/80 MNIST Data Example 10 2 [F (u) ],1 (Zeros) 4 [F (u) ],2 (Ones) kb=1 F (u),b F,a (u) k Index Figure: Centered Vectors [F (u) ],a = [F (u) 1 k F (u)1 k 1 T k ],a, 3-class MNIST data (zeros, ones, twos), α = 0, n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 51 / 80

103 Applications/Semi-supervised Learning 51/80 MNIST Data Example 10 2 [F (u) ],1 (Zeros) 4 [F (u) ],2 (Ones) [F (u) ],3 (Twos) kb=1 F (u),b F,a (u) k Index Figure: Centered Vectors [F (u) ],a = [F (u) 1 k F (u)1 k 1 T k ],a, 3-class MNIST data (zeros, ones, twos), α = 0, n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 51 / 80

104 Applications/Semi-supervised Learning 52/80 Main Results Results: Assuming n l /n c l (0, 1), by previous Taylor expansion, In the first order, F (u),a = C n [ l,a n v }{{} O(1) + α ta1nu n }{{} O(n 1 2 ) ] + O(n 1 ) }{{} Informative terms where v = O(1) random vector (entry-wise) and t a = 1 p tr C a. 52 / 80

105 Applications/Semi-supervised Learning 52/80 Main Results Results: Assuming n l /n c l (0, 1), by previous Taylor expansion, In the first order, F (u),a = C n [ l,a n v }{{} O(1) + α ta1nu n }{{} O(n 1 2 ) ] + O(n 1 ) }{{} Informative terms where v = O(1) random vector (entry-wise) and t a = 1 p tr C a. Consequences: 52 / 80

106 Applications/Semi-supervised Learning 52/80 Main Results Results: Assuming n l /n c l (0, 1), by previous Taylor expansion, In the first order, F (u),a = C n [ l,a n v }{{} O(1) + α ta1nu n }{{} O(n 1 2 ) ] + O(n 1 ) }{{} Informative terms where v = O(1) random vector (entry-wise) and t a = 1 p tr C a. Consequences: Random non-informative bias v 52 / 80

107 Applications/Semi-supervised Learning 52/80 Main Results Results: Assuming n l /n c l (0, 1), by previous Taylor expansion, In the first order, F (u),a = C n [ l,a n v }{{} O(1) + α ta1nu n }{{} O(n 1 2 ) ] + O(n 1 ) }{{} Informative terms where v = O(1) random vector (entry-wise) and t a = 1 p tr C a. Consequences: Random non-informative bias v Strong Impact of n l,a F (u),a to be scaled by n l,a 52 / 80

108 Applications/Semi-supervised Learning 52/80 Main Results Results: Assuming n l /n c l (0, 1), by previous Taylor expansion, In the first order, F (u),a = C n [ l,a n v }{{} O(1) + α ta1nu n }{{} O(n 1 2 ) ] + O(n 1 ) }{{} Informative terms where v = O(1) random vector (entry-wise) and t a = 1 p tr C a. Consequences: Random non-informative bias v Strong Impact of n l,a F (u),a to be scaled by n l,a Additional per-class bias αt a1 nu α = 0 + β p. 52 / 80

109 Applications/Semi-supervised Learning 53/80 Main Results As a consequence of the remarks above, we take α = β p and define ˆF (u) i,a = np n l,a F (u) ia. 53 / 80

110 Applications/Semi-supervised Learning 53/80 Main Results As a consequence of the remarks above, we take α = β p and define ˆF (u) i,a = np n l,a F (u) ia. Theorem For x i C b unlabelled, ˆF i, G b 0, G b N (m b, Σ b ) where m b R k, Σ b R k k given by (m b ) a = 2f (τ) M ab + f (τ) f(τ) f(τ) t a t b + 2f (τ) T ab f (τ) 2 f(τ) f(τ) 2 tat b + β n f (τ) n l f(τ) ta + B b (Σ b ) a1 a 2 = 2tr ( C2 b f (τ) 2 p f(τ) 2 f ) 2 (τ) t a1 f(τ) ta2 + 4f (τ) 2 ) ([M T f(τ) 2 C b M] a1a2 + δa 2 a 1 p T ba1 n l,a1 with t, T, M as before, Xa = X a k n l,d d=1 X n l d and B b bias independent of a. 53 / 80

111 Applications/Semi-supervised Learning 54/80 Main Results Corollary (Asymptotic Classification Error) For k = 2 classes and a b, ( ) P ( ˆF i,a > ˆF (m b ) b (m b ) a ib x i C b ) Q 0. [1, 1]Σb [1, 1] T 54 / 80

112 Applications/Semi-supervised Learning 54/80 Main Results Corollary (Asymptotic Classification Error) For k = 2 classes and a b, ( ) P ( ˆF i,a > ˆF (m b ) b (m b ) a ib x i C b ) Q 0. [1, 1]Σb [1, 1] T Some consequences: non obvious choices of appropriate kernels non obvious choice of optimal β (induces a possibly beneficial bias) importance of n l versus n u. 54 / 80

113 Applications/Semi-supervised Learning 55/80 MNIST Data Example 1 Simulations Probability of correct classification Index Figure: Performance as a function of α, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 55 / 80

114 Applications/Semi-supervised Learning 55/80 MNIST Data Example 1 Simulations Theory as if Gaussian Probability of correct classification Index Figure: Performance as a function of α, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 55 / 80

115 Applications/Semi-supervised Learning 56/80 MNIST Data Example 1 Simulations Probability of correct classification Index Figure: Performance as a function of α, for 2-class MNIST data (zeros, ones), n = 1568, p = 784, n l /n = 1/16, Gaussian kernel. 56 / 80

116 Applications/Semi-supervised Learning 56/80 MNIST Data Example 1 Simulations Theory as if Gaussian Probability of correct classification Index Figure: Performance as a function of α, for 2-class MNIST data (zeros, ones), n = 1568, p = 784, n l /n = 1/16, Gaussian kernel. 56 / 80

117 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 57/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 57 / 80

118 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 58/80 Random Feature Maps and Extreme Learning Machines Context: Random Feature Map (large) input x 1,..., x T R p random W = wt 1... R n p wn T non-linear activation function σ. 58 / 80

119 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 58/80 Random Feature Maps and Extreme Learning Machines Context: Random Feature Map (large) input x 1,..., x T R p random W = wt 1... R n p wn T non-linear activation function σ. Neural Network Model (extreme learning machine): Ridge-regression learning small output y 1,..., y T R d ridge-regression output β R n d 58 / 80

120 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 59/80 Random Feature Maps and Extreme Learning Machines Objectives: evaluate training and testing MSE performance as n, p, T 59 / 80

121 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 59/80 Random Feature Maps and Extreme Learning Machines Objectives: evaluate training and testing MSE performance as n, p, T Training MSE: E train = 1 T y i β T σ(w x i ) 2 = 1 T T Y βt Σ 2 F i=1 with { } Σ = σ(w X) = σ(wi T x j) 1 i n 1 j T β= 1 T Σ ( 1 T ΣT Σ + γi T ) 1 Y. 59 / 80

122 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 59/80 Random Feature Maps and Extreme Learning Machines Objectives: evaluate training and testing MSE performance as n, p, T Training MSE: E train = 1 T y i β T σ(w x i ) 2 = 1 T T Y βt Σ 2 F i=1 with { } Σ = σ(w X) = σ(wi T x j) 1 i n 1 j T β= 1 T Σ ( 1 T ΣT Σ + γi T ) 1 Y. Testing MSE: upon new pair ( ˆX, Ŷ ) of length ˆT, E test = 1ˆT Ŷ βt ˆΣ 2 F. where ˆΣ = σ(w ˆX). 59 / 80

123 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 60/80 Technical Aspects Preliminary observations: Link to resolvent of 1 T ΣT Σ: where Q = Q(γ) is the resolvent with Σ ij = σ(w T i x j). E train = γ2 T tr Y T Y Q 2 = γ 2 1 γ T tr Y T Y Q ( ) 1 1 Q T ΣT Σ + γi T 60 / 80

124 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 60/80 Technical Aspects Preliminary observations: Link to resolvent of 1 T ΣT Σ: where Q = Q(γ) is the resolvent with Σ ij = σ(w T i x j). E train = γ2 T tr Y T Y Q 2 = γ 2 1 γ T tr Y T Y Q ( ) 1 1 Q T ΣT Σ + γi T Central object: resolvent E[Q]. 60 / 80

125 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 61/80 Main Technical Result Theorem [Asymptotic Equivalent for E[Q]] For Lipschitz σ, bounded X, Y, W = f(z) (entry-wise) with Z standard Gaussian, we have, for all ε > 0, for some C > 0, where E[Q] Q < Cn ε 1 2 ( n Q = T Φ E ) Φ δ + γi T ] [ σ(x T w)σ(w T X) with w = f(z), z N (0, I p), and δ > 0 the unique positive solution to δ = 1 tr Φ Q. T 61 / 80

126 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 61/80 Main Technical Result Theorem [Asymptotic Equivalent for E[Q]] For Lipschitz σ, bounded X, Y, W = f(z) (entry-wise) with Z standard Gaussian, we have, for all ε > 0, for some C > 0, where E[Q] Q < Cn ε 1 2 ( n Q = T Φ E ) Φ δ + γi T ] [ σ(x T w)σ(w T X) with w = f(z), z N (0, I p), and δ > 0 the unique positive solution to δ = 1 tr Φ Q. T Proof arguments: σ(w X) has independent rows but dependent columns breaks the trace lemma argument (i.e., 1 p wt XAX T w 1 p tr XAXT ) 61 / 80

127 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 61/80 Main Technical Result Theorem [Asymptotic Equivalent for E[Q]] For Lipschitz σ, bounded X, Y, W = f(z) (entry-wise) with Z standard Gaussian, we have, for all ε > 0, for some C > 0, where E[Q] Q < Cn ε 1 2 ( n Q = T Φ E ) Φ δ + γi T ] [ σ(x T w)σ(w T X) with w = f(z), z N (0, I p), and δ > 0 the unique positive solution to δ = 1 tr Φ Q. T Proof arguments: σ(w X) has independent rows but dependent columns breaks the trace lemma argument (i.e., 1 p wt XAX T w 1 p tr XAXT ) Concentration of measure lemma: 1 p σ(wt X)Aσ(X T w) 1 p tr ΦA 61 / 80

128 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 62/80 Main Technical Result Values of Φ(a, b) for w N (0, I p), σ(t) Φ(a, b) ( 1 max(t, 0) 2π a b (a, b) acos( (a, b)) + ) 1 (a, b) ( 2 2 t π a b (a, b) asin( (a, b)) + ) 1 (a, b) ( ) 2 2 erf(t) π asin 2a T b (1+2 a 2 )(1+2 b 2 ) 1 1 {t>0} 2 1 acos( (a, b)) 2π sign(t) 1 2 acos( (a, b)) π cos(t) exp( 1 2 ( a 2 + b 2 )) cosh(a T b). where (a, b) at b a b. 62 / 80

129 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 62/80 Main Technical Result Values of Φ(a, b) for w N (0, I p), σ(t) Φ(a, b) ( 1 max(t, 0) 2π a b (a, b) acos( (a, b)) + ) 1 (a, b) ( 2 2 t π a b (a, b) asin( (a, b)) + ) 1 (a, b) ( ) 2 2 erf(t) π asin 2a T b (1+2 a 2 )(1+2 b 2 ) 1 1 {t>0} 2 1 acos( (a, b)) 2π sign(t) 1 2 acos( (a, b)) π cos(t) exp( 1 2 ( a 2 + b 2 )) cosh(a T b). where (a, b) at b a b. Value of Φ(a, b) for w i i.i.d. with E[wi k] = m k (m 1 = 0), σ(t) = ζ 2 t 2 + ζ 1 t + ζ 0 [ ( Φ(a, b) = ζ2 2 m 2 2 2(a T b) 2 + a 2 b 2) ] + (m 4 3m 2 2 )(a2 ) T (b 2 ) + ζ1 2 m 2a T b ] + ζ 2 ζ 1 m 3 [(a 2 ) T b + a T (b 2 [ ) + ζ 2 ζ 0 m 2 a 2 + b 2] + ζ0 2 where (a 2 ) [a 2 1,..., a2 p] T. 62 / 80

130 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 63/80 Main Results Theorem [Asymptotic E train ] For all ε > 0, almost surely, where with Ψ n T Φ 1+δ. n 1 2 ε ( E train Ētrain) 0 E train = 1 Y T Σ T β 2 T = γ2 F T tr Y T Y Q 2 [ ] Ē train = γ2 T tr Y T Y Q 1 tr Ψ Q 2 n 1 1 n tr (Ψ Q) Ψ + I 2 T Q 63 / 80

131 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 64/80 Main Results Letting ˆX R p ˆT, Ŷ Rd ˆT satisfy similar properties as (X, Y ), Claim [Asymptotic E test ] For all ε > 0, almost surely, where with Ψ AB = n T n 1 2 ε ( E test Ētest ) 0 E test = 1ˆT Ŷ T ˆΣ T β 2 F Ē test = 1ˆT Ŷ T Ψ T QY T 2 X ˆX F 1 n + tr Y T [ Y QΨ Q tr (Ψ Q) 2 n ˆT tr Ψ ˆX ˆX 1ˆT ] tr (I T + γ Q)(Ψ X ˆXΨ ˆXX Q) Φ AB 1+δ, Φ AB = E[σ(A T w)σ(w T B)]. 64 / 80

132 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 65/80 Simulations on MNIST: Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t γ Figure: Neural network performance for Lipschitz continuous σ( ), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 80

133 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 65/80 Simulations on MNIST: Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t γ Figure: Neural network performance for Lipschitz continuous σ( ), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 80

134 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 65/80 Simulations on MNIST: Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t 10 1 σ(t) = erf(t) γ Figure: Neural network performance for Lipschitz continuous σ( ), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 80

135 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 65/80 Simulations on MNIST: Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t σ(t) = t 10 1 σ(t) = erf(t) γ Figure: Neural network performance for Lipschitz continuous σ( ), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 80

136 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 65/80 Simulations on MNIST: Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t σ(t) = t 10 1 σ(t) = erf(t) σ(t) = max(t, 0) γ Figure: Neural network performance for Lipschitz continuous σ( ), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 80

137 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 66/80 Simulations on MNIST: non Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t2 σ(t) = 1 {t>0} σ(t) = sign(t) γ Figure: Neural network performance for σ( ) either discontinuous or non Lipschitz, as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 80

138 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 67/80 Simulations on tuned Gaussian mixture Gaussian mixture classification X = [X 1, X 2 ], with {X 1 } i N (0, C 1 ), {X 2 } i N (0, C 2 ), tr C 1 = tr C 2 67 / 80

139 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 67/80 Simulations on tuned Gaussian mixture Gaussian mixture classification X = [X 1, X 2 ], with {X 1 } i N (0, C 1 ), {X 2 } i N (0, C 2 ), tr C 1 = tr C 2 We can prove that, for σ(t) = ζ 2 t 2 + ζ 1 t + ζ 0 and E[W k ij ] = m k, Classification only possible if m 4 m / 80

140 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 67/80 Simulations on tuned Gaussian mixture Gaussian mixture classification X = [X 1, X 2 ], with {X 1 } i N (0, C 1 ), {X 2 } i N (0, C 2 ), tr C 1 = tr C 2 We can prove that, for σ(t) = ζ 2 t 2 + ζ 1 t + ζ 0 and E[W k ij ] = m k, Classification only possible if m 4 m 2 2 Ē train Ē test E train E test 10 0 MSE W ij Bern W ij N (0, 1) W ij Stud γ 67 / 80

Random Matrices in Machine Learning

Random Matrices in Machine Learning Romain COUILLET CentraleSupélec, University of ParisSaclay, France GSTATS IDEX DataScience Chair, GIPSA-lab, University Grenoble Alpes, France. June 21, 2018 1 / 113