A Random Matrix Framework for BigData Machine Learning and Applications to Wireless Communications

Size: px
Start display at page:

Download "A Random Matrix Framework for BigData Machine Learning and Applications to Wireless Communications"

Transcription

1 A Random Matrix Framework for BigData Machine Learning and Applications to Wireless Communications (EURECOM) Romain COUILLET CentraleSupélec, France June, / 80

2 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 2 / 80

3 Basics of Random Matrix Theory/ 3/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 3 / 80

4 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 4/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 4 / 80

5 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/80 Context Baseline scenario: y 1,..., y n C p (or R p ) i.i.d. with E[y 1 ] = 0, E[y 1 y 1 ] = Cp: 5 / 80

6 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/80 Context Baseline scenario: y 1,..., y n C p (or R p ) i.i.d. with E[y 1 ] = 0, E[y 1 y1 ] = Cp: If y 1 N (0, C p), ML estimator for C p is the sample covariance matrix (SCM) Ĉ p = 1 n YpY p = 1 n y i yi n i=1 (Y p = [y 1,..., y n] C p n ). 5 / 80

7 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/80 Context Baseline scenario: y 1,..., y n C p (or R p ) i.i.d. with E[y 1 ] = 0, E[y 1 y1 ] = Cp: If y 1 N (0, C p), ML estimator for C p is the sample covariance matrix (SCM) Ĉ p = 1 n YpY p (Y p = [y 1,..., y n] C p n ). If n, then, strong law of large numbers = 1 n y i yi n i=1 Ĉ p a.s. C p. or equivalently, in spectral norm Ĉp Cp a.s / 80

8 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/80 Context Baseline scenario: y 1,..., y n C p (or R p ) i.i.d. with E[y 1 ] = 0, E[y 1 y1 ] = Cp: If y 1 N (0, C p), ML estimator for C p is the sample covariance matrix (SCM) Ĉ p = 1 n YpY p (Y p = [y 1,..., y n] C p n ). If n, then, strong law of large numbers = 1 n y i yi n i=1 Ĉ p a.s. C p. or equivalently, in spectral norm Ĉp Cp a.s. 0. Random Matrix Regime No longer valid if p, n with p/n c (0, ), Ĉp Cp 0. 5 / 80

9 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/80 Context Baseline scenario: y 1,..., y n C p (or R p ) i.i.d. with E[y 1 ] = 0, E[y 1 y1 ] = Cp: If y 1 N (0, C p), ML estimator for C p is the sample covariance matrix (SCM) Ĉ p = 1 n YpY p (Y p = [y 1,..., y n] C p n ). If n, then, strong law of large numbers = 1 n y i yi n i=1 Ĉ p a.s. C p. or equivalently, in spectral norm Ĉp Cp a.s. 0. Random Matrix Regime No longer valid if p, n with p/n c (0, ), Ĉp Cp 0. For practical p, n with p n, leads to dramatically wrong conclusions 5 / 80

10 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/80 The Mar cenko Pastur law 0.8 Empirical eigenvalue distribution Mar cenko Pastur Law 0.6 Density Eigenvalues of Ĉp Figure: Histogram of the eigenvalues of Ĉp for p = 500, n = 2000, Cp = Ip. 6 / 80

11 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/80 The Mar cenko Pastur law Definition (Empirical Spectral Density) Empirical spectral density (e.s.d.) µ p of Hermitian matrix A p C p p is µ p = 1 p δ λi (A p p). i=1 7 / 80

12 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/80 The Mar cenko Pastur law Definition (Empirical Spectral Density) Empirical spectral density (e.s.d.) µ p of Hermitian matrix A p C p p is µ p = 1 p δ λi (A p p). i=1 Theorem (Mar cenko Pastur Law [Mar cenko,pastur 67]) X p C p n with i.i.d. zero mean, unit variance entries. As p, n with p/n c (0, ), e.s.d. µ p of 1 n XpX p satisfies weakly, where µ c({0}) = max{0, 1 c 1 } µ p a.s. µ c 7 / 80

13 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/80 The Mar cenko Pastur law Definition (Empirical Spectral Density) Empirical spectral density (e.s.d.) µ p of Hermitian matrix A p C p p is µ p = 1 p δ λi (A p p). i=1 Theorem (Mar cenko Pastur Law [Mar cenko,pastur 67]) X p C p n with i.i.d. zero mean, unit variance entries. As p, n with p/n c (0, ), e.s.d. µ p of 1 n XpX p satisfies weakly, where µ c({0}) = max{0, 1 c 1 } µ p a.s. µ c on (0, ), µ c has continuous density f c supported on [(1 c) 2, (1 + c) 2 ] f c(x) = 1 (x (1 c) 2 )((1 + c) 2 x). 2πcx 7 / 80

14 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/80 The Mar cenko Pastur law 1.2 c = Density fc(x) x Figure: Mar cenko-pastur law for different limit ratios c = lim p p/n. 8 / 80

15 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/80 The Mar cenko Pastur law 1.2 c = 0.1 c = Density fc(x) x Figure: Mar cenko-pastur law for different limit ratios c = lim p p/n. 8 / 80

16 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/80 The Mar cenko Pastur law 1.2 c = 0.1 c = c = Density fc(x) x Figure: Mar cenko-pastur law for different limit ratios c = lim p p/n. 8 / 80

17 Basics of Random Matrix Theory/Spiked Models 9/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 9 / 80

18 Basics of Random Matrix Theory/Spiked Models 10/80 Spiked Models Small rank perturbation: C p = I p + P, P of low rank. 0.8 {λ i } n i=1 µ ω 1 + c 1+ω 1 ω 1, 1 + ω 2 + c 1+ω 2 ω 2 Figure: Eigenvalues of 1 n YpY p, Cp = diag(1,..., 1, 2, 2, 3, 3), p = 500, n = }{{} p 4 10 / 80

19 Basics of Random Matrix Theory/Spiked Models 11/80 Spiked Models Theorem (Eigenvalues [Baik,Silverstein 06]) Let Y p = C 1 2 p X p, with X p with i.i.d. zero mean, unit variance, E[ X p 4 ij ] <. C p = I p + P, P = UΩU, where, for K fixed, Ω = diag (ω 1,..., ω K ) R K K, with ω 1... ω K > / 80

20 Basics of Random Matrix Theory/Spiked Models 11/80 Spiked Models Theorem (Eigenvalues [Baik,Silverstein 06]) Let Y p = C 1 2 p X p, with X p with i.i.d. zero mean, unit variance, E[ X p 4 ij ] <. C p = I p + P, P = UΩU, where, for K fixed, Ω = diag (ω 1,..., ω K ) R K K, with ω 1... ω K > 0. Then, as p, n, p/n c (0, ), denoting λ m = λ m( 1 n YpY p ) (λ m > λ m+1 ), { a.s. 1 + ωm + c 1+ωm > (1 + c) λ m ω 2, ω m > c m (1 + c) 2, ω m (0, c]. 11 / 80

21 Basics of Random Matrix Theory/Spiked Models 12/80 Spiked Models Theorem (Eigenvectors [Paul 07]) Let Y p = C 1 2 p X p, with X p with i.i.d. zero mean, unit variance, E[ X p 4 ij ] <. C p = I p + P, P = UΩU = K i=1 ω iu i u i, ω 1 >... > ω M > / 80

22 Basics of Random Matrix Theory/Spiked Models 12/80 Spiked Models Theorem (Eigenvectors [Paul 07]) Let Y p = C 1 2 p X p, with X p with i.i.d. zero mean, unit variance, E[ X p 4 ij ] <. C p = I p + P, P = UΩU = K i=1 ω iu i u i, ω 1 >... > ω M > 0. Then, as p, n, p/n c (0, ), for a, b C p deterministic and û i eigenvector of λ i ( 1 n YpY p ), In particular, a û i û i b 1 cω 2 i 1 + cω 1 i a u i u i b 1 ω i > c a.s. 0 û i u i 2 a.s. 1 cω 2 i 1 + cω 1 1 ωi > c. i 12 / 80

23 Basics of Random Matrix Theory/Spiked Models 13/80 Spiked Models û 1 u p = 100 p = 200 p = c/ω c/ω Population spike ω 1 Figure: Simulated versus limiting û 1 1 u1 2 for Y p = C 2 p X p, C p = I p + ω 1u 1u 1, p/n = 1/3, varying ω / 80

24 Basics of Random Matrix Theory/Spiked Models 14/80 Other Spiked Models Similar results for multiple matrix models: Y p = 1 n XpX p + P Y p = 1 n X p (I + P )X Y p = 1 n (Xp + P ) (X p + P ) Y p = 1 n T X p (I + P )XpT etc. 14 / 80

25 Applications/ 15/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 15 / 80

26 Applications/Reminder on Spectral Clustering Methods 16/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 16 / 80

27 Applications/Reminder on Spectral Clustering Methods 17/80 Reminder on Spectral Clustering Methods Context: Two-step classification of n objects based on similarity A R n n : 0 spikes 17 / 80

28 Applications/Reminder on Spectral Clustering Methods 17/80 Reminder on Spectral Clustering Methods Context: Two-step classification of n objects based on similarity A R n n : 0 spikes Eigenvectors (in practice, shuffled) Eigenv. 2 Eigenv / 80

29 Applications/Reminder on Spectral Clustering Methods 18/80 Reminder on Spectral Clustering Methods Eigenv. 2 Eigenv / 80

30 Applications/Reminder on Spectral Clustering Methods 18/80 Reminder on Spectral Clustering Methods Eigenv. 2 Eigenv. 1 l-dimensional representation (shuffling no longer matters) Eigenvector 2 Eigenvector 1 18 / 80

31 Applications/Reminder on Spectral Clustering Methods 18/80 Reminder on Spectral Clustering Methods Eigenv. 2 Eigenv. 1 l-dimensional representation (shuffling no longer matters) Eigenvector 2 Eigenvector 1 EM or k-means clustering. 18 / 80

32 Applications/Kernel Spectral Clustering 19/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 19 / 80

33 Applications/Kernel Spectral Clustering 20/80 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes C 1,..., C k. 20 / 80

34 Applications/Kernel Spectral Clustering 20/80 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes C 1,..., C k. Kernel spectral clustering based on kernel matrix K = {κ(x i, x j )} n i,j=1 20 / 80

35 Applications/Kernel Spectral Clustering 20/80 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes C 1,..., C k. Kernel spectral clustering based on kernel matrix K = {κ(x i, x j )} n i,j=1 Usually, κ(x, y) = f(x T y) or κ(x, y) = f( x y 2 ) 20 / 80

36 Applications/Kernel Spectral Clustering 20/80 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes C 1,..., C k. Kernel spectral clustering based on kernel matrix K = {κ(x i, x j )} n i,j=1 Usually, κ(x, y) = f(x T y) or κ(x, y) = f( x y 2 ) Refinements: instead of K, use D K, I n D 1 K, I n D 1 2 KD 1 2, etc. several steps algorithms: Ng Jordan Weiss, Shi Malik, etc. 20 / 80

37 Applications/Kernel Spectral Clustering 20/80 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes C 1,..., C k. Kernel spectral clustering based on kernel matrix K = {κ(x i, x j )} n i,j=1 Usually, κ(x, y) = f(x T y) or κ(x, y) = f( x y 2 ) Refinements: instead of K, use D K, I n D 1 K, I n D 1 2 KD 1 2, etc. several steps algorithms: Ng Jordan Weiss, Shi Malik, etc. Intuition (from small dimensions) K essentially low rank with class structure in eigenvectors. 20 / 80

38 Applications/Kernel Spectral Clustering 21/80 Kernel Spectral Clustering 21 / 80

39 Applications/Kernel Spectral Clustering 21/80 Kernel Spectral Clustering 21 / 80

40 Applications/Kernel Spectral Clustering 21/80 Kernel Spectral Clustering 21 / 80

41 Applications/Kernel Spectral Clustering 21/80 Kernel Spectral Clustering Figure: Leading four eigenvectors of D 2 1 KD 2 1 for MNIST data. 21 / 80

42 Applications/Kernel Spectral Clustering 22/80 Model and Assumptions Gaussian mixture model: x 1,..., x n R p, k classes C 1,..., C k, x 1,..., x n1 C 1,..., x n nk +1,..., x n C k, x i N (µ gi, C gi ). 22 / 80

43 Applications/Kernel Spectral Clustering 22/80 Model and Assumptions Gaussian mixture model: x 1,..., x n R p, k classes C 1,..., C k, x 1,..., x n1 C 1,..., x n nk +1,..., x n C k, x i N (µ gi, C gi ). Assumption (Convergence Rate) As n, p 1. Data scaling: n c 0 (0, ), n 2. Class scaling: a n c a (0, 1), 3. Mean scaling: with µ k a=1 n a n µ a and µ a µa µ, then µ a = O(1) 4. Covariance scaling: with C k n a a=1 n C a and Ca C a C, then C a = O(1), tr C a = O( p), tr C ac b = O(p) 22 / 80

44 Applications/Kernel Spectral Clustering 22/80 Model and Assumptions Gaussian mixture model: x 1,..., x n R p, k classes C 1,..., C k, x 1,..., x n1 C 1,..., x n nk +1,..., x n C k, x i N (µ gi, C gi ). Assumption (Convergence Rate) As n, p 1. Data scaling: n c 0 (0, ), n 2. Class scaling: a n c a (0, 1), 3. Mean scaling: with µ k a=1 n a n µ a and µ a µa µ, then µ a = O(1) 4. Covariance scaling: with C k n a a=1 n C a and Ca C a C, then C a = O(1), tr C a = O( p), tr C ac b = O(p) Remark: For 2 classes, this is µ 1 µ 2 = O(1), tr (C 1 C 2 ) = O( p), C i = O(1), tr ([C 1 C 2 ] 2 ) = O(p). 22 / 80

45 Applications/Kernel Spectral Clustering 23/80 Model and Assumptions Kernel Matrix: Kernel matrix of interest: K = { ( )} 1 n f p x i x j 2 i,j=1 for some sufficiently smooth nonnegative f (f( 1 p xt i x j) simpler). 23 / 80

46 Applications/Kernel Spectral Clustering 23/80 Model and Assumptions Kernel Matrix: Kernel matrix of interest: K = { ( )} 1 n f p x i x j 2 i,j=1 for some sufficiently smooth nonnegative f (f( 1 p xt i x j) simpler). We study the normalized Laplacian: L = nd 1 2 (K ddt d T 1 n ) D 1 2 with d = K1 n, D = diag(d). 23 / 80

47 Applications/Kernel Spectral Clustering 24/80 Random Matrix Equivalent Key Remark: Under our assumptions, uniformly on i, j {1,..., n}, 1 p x i x j 2 a.s. τ > / 80

48 Applications/Kernel Spectral Clustering 24/80 Random Matrix Equivalent Key Remark: Under our assumptions, uniformly on i, j {1,..., n}, 1 p x i x j 2 a.s. τ > 0. Allows for Taylor expansion of K: K = f(τ)1 n1 T n + }{{} O (n) nk1 }{{} low rank, O ( n) + K 2 }{{} informative terms, O (1) 24 / 80

49 Applications/Kernel Spectral Clustering 24/80 Random Matrix Equivalent Key Remark: Under our assumptions, uniformly on i, j {1,..., n}, 1 p x i x j 2 a.s. τ > 0. Allows for Taylor expansion of K: K = f(τ)1 n1 T n + }{{} O (n) nk1 }{{} low rank, O ( n) + K 2 }{{} informative terms, O (1) However not the (small dimension) intuitive behavior. 24 / 80

50 Applications/Kernel Spectral Clustering 25/80 Random Matrix Equivalent Theorem (Random Matrix Equivalent [Couillet, Benaych 2015]) As n, p, L ˆL a.s. 0, o ) ( D 2 1 1, avec K ij = f L = nd 2 (K 1 ddt d T 1 n ˆL = 2 f [ (τ) 1 f(τ) p ΠW T W Π + 1 ] p JBJ T + et W = [w 1,..., w n] R p n (x i = µ a + w i ), Π = I n 1 n 1n1T n, ) p x i x j 2 25 / 80

51 Applications/Kernel Spectral Clustering 25/80 Random Matrix Equivalent Theorem (Random Matrix Equivalent [Couillet, Benaych 2015]) As n, p, L ˆL a.s. 0, o ) ( D 2 1 1, avec K ij = f L = nd 2 (K 1 ddt d T 1 n ˆL = 2 f [ (τ) 1 f(τ) p ΠW T W Π + 1 ] p JBJ T + et W = [w 1,..., w n] R p n (x i = µ a + w i ), Π = I n 1 n 1n1T n, ) p x i x j 2 J = [j 1,..., j k ], ja T = (0... 0, 1na, 0,..., 0) ( 5f B = M T (τ) M + 8f(τ) f ) (τ) 2f tt T f (τ) (τ) f (τ) T +. { } k Recall M = [µ 1,..., µ k ], t = [ 1 p tr C1,..., p 1 tr Ck ], T = 1 p tr C a C b. a,b=1 25 / 80

52 Applications/Kernel Spectral Clustering 26/80 Isolated eigenvalues: Gaussian inputs Eigenvalues of L Eigenvalues of ˆL Figure: Eigenvalues of L and ˆL, k = 3, p = 2048, n = 512, c 1 = c 2 = 1/4, c 3 = 1/2, [µ a] j = 4δ aj, C a = (1 + 2(a 1)/ p)i p, f(x) = exp( x/2). 26 / 80

53 Applications/Kernel Spectral Clustering 27/80 Theoretical Findings versus MNIST 0.2 Eigenvalues of L Figure: Eigenvalues of L (red) and (equivalent Gaussian model) ˆL (white), MNIST data, p = 784, n = / 80

54 Applications/Kernel Spectral Clustering 27/80 Theoretical Findings versus MNIST 0.2 Eigenvalues of L Eigenvalues of ˆL as if Gaussian model Figure: Eigenvalues of L (red) and (equivalent Gaussian model) ˆL (white), MNIST data, p = 784, n = / 80

55 Applications/Kernel Spectral Clustering 28/80 Theoretical Findings versus MNIST Figure: Leading four eigenvectors of D 2 1 KD 1 2 for MNIST data (red) and theoretical findings (blue). 28 / 80

56 Applications/Kernel Spectral Clustering 28/80 Theoretical Findings versus MNIST Figure: Leading four eigenvectors of D 2 1 KD 1 2 for MNIST data (red) and theoretical findings (blue). 28 / 80

57 Applications/Kernel Spectral Clustering 29/80 Theoretical Findings versus MNIST Eigenvector 2/Eigenvector 1 Eigenvector 3/Eigenvector Figure: 2D representation of eigenvectors of L, for the MNIST dataset. Theoretical means and 1- and 2-standard deviations in blue. Class 1 in red, Class 2 in black, Class 3 in green. 29 / 80

58 Applications/Kernel Spectral Clustering 30/80 The surprising f (τ) = 0 case Classification error p = 512 p = 1024 Theory f (τ) Figure: Classification performance, polynomial kernel with f(τ) = 4, f (τ) = 2, x i N (0, C a), with C 1 = I p, [C 2] i,j =.4 i j, c 0 = / 80

59 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 31/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 31 / 80

60 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 32/80 Position of the Problem Problem: Cluster large data x 1,..., x n R p based on spanned subspaces. 32 / 80

61 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 32/80 Position of the Problem Problem: Cluster large data x 1,..., x n R p based on spanned subspaces. Method: Still assume x 1,..., x n belong to k classes C 1,..., C k. Zero-mean Gaussian model for the data: for x i C k, x i N (0, C k ). 32 / 80

62 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 32/80 Position of the Problem Problem: Cluster large data x 1,..., x n R p based on spanned subspaces. Method: Still assume x 1,..., x n belong to k classes C 1,..., C k. Zero-mean Gaussian model for the data: for x i C k, Performance of L = nd 1 2 in the regime n, p. x i N (0, C k ). ( ) K 1n1T n D 1 1 T n D1n 2, with { ( K = f x i x j 2)}, x = x 1 i,j n x 32 / 80

63 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 33/80 Model and Reminders Assumption 1 [Classes]. Vectors x 1,..., x n R p i.i.d. from k-class Gaussian mixture, with x i C k x i N (0, C k ) (sorted by class for simplicity). 33 / 80

64 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 33/80 Model and Reminders Assumption 1 [Classes]. Vectors x 1,..., x n R p i.i.d. from k-class Gaussian mixture, with x i C k x i N (0, C k ) (sorted by class for simplicity). Assumption 2a [Growth Rates]. As n, for each a {1,..., k}, n 1. p c 0 (0, ) n 2. a n c a (0, ) 1 3. p tr Ca = 1 and tr C acb = O(p), with C a = C a C, C = k b=1 c bc b. 33 / 80

65 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 33/80 Model and Reminders Assumption 1 [Classes]. Vectors x 1,..., x n R p i.i.d. from k-class Gaussian mixture, with x i C k x i N (0, C k ) (sorted by class for simplicity). Assumption 2a [Growth Rates]. As n, for each a {1,..., k}, n 1. p c 0 (0, ) n 2. a n c a (0, ) 1 3. p tr Ca = 1 and tr C acb = O(p), with C a = C a C, C = k b=1 c bc b. Theorem (Corollary of Previous Section) Let f smooth with f (2) 0. Then, under Assumptions 2a, L = nd 1 2 ( K exhibits phase transition phenomenon ) 1n1T n 1 T D 1 2, with K = { f ( x i x j 2)} n n D1n i,j=1 ( x = x/ x ) 33 / 80

66 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 33/80 Model and Reminders Assumption 1 [Classes]. Vectors x 1,..., x n R p i.i.d. from k-class Gaussian mixture, with x i C k x i N (0, C k ) (sorted by class for simplicity). Assumption 2a [Growth Rates]. As n, for each a {1,..., k}, n 1. p c 0 (0, ) n 2. a n c a (0, ) 1 3. p tr Ca = 1 and tr C acb = O(p), with C a = C a C, C = k b=1 c bc b. Theorem (Corollary of Previous Section) Let f smooth with f (2) 0. Then, under Assumptions 2a, L = nd 1 2 ( K ) 1n1T n 1 T D 1 2, with K = { f ( x i x j 2)} n n D1n i,j=1 ( x = x/ x ) exhibits phase transition phenomenon, i.e., leading eigenvectors of L asymptotically contain structural information about C 1,..., C k if and only if has sufficiently large eigenvalues. { } 1 k T = p tr C acb a,b=1 33 / 80

67 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 34/80 The case f (2) = 0 Assumption 2b [Growth Rates]. As n, for each a {1,..., k}, n 1. p c 0 (0, ) n 2. a n c a (0, ) 1 3. p tr Ca = 1 and tr C a C b = O(p), with C a = Ca C, C = k b=1 c bc b. 34 / 80

68 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 34/80 The case f (2) = 0 Assumption 2b [Growth Rates]. As n, for each a {1,..., k}, n 1. p c 0 (0, ) n 2. a n c a (0, ) 1 3. p tr Ca = 1 and tr C acb = O( p), with Ca = C a C, C = k b=1 c bc b. (in this regime, previous kernels clearly fail) 34 / 80

69 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 34/80 The case f (2) = 0 Assumption 2b [Growth Rates]. As n, for each a {1,..., k}, n 1. p c 0 (0, ) n 2. a n c a (0, ) 1 3. p tr Ca = 1 and tr C acb = O( p), with Ca = C a C, C = k b=1 c bc b. (in this regime, previous kernels clearly fail) Theorem (Random Equivalent for f (2) = 0) Let f be smooth with f (2) = 0 and L p f(2) [ 2f L (2) Then, under Assumptions 2b, ] f(0) f(2) P, P = I n 1 f(2) n 1n1T n. { } 1 L = P ΦP + tr (C p a C b ) 1na 1 T k n b + o (1) p a,b=1 where Φ ij = δ i j p [ (x T i x j ) 2 E[(x T i x j) 2 ] ]. 34 / 80

70 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 35/80 The case f (2) = 0 3 Eigenvalues of L 2 1 λ 2 (L) λ 1 (L) Figure: Eigenvalues of L, p = 1000, n = 2000, k = 3, c 1 = c 2 = 1/4, c 3 = 1/2, C i I p + (p/8) 5 4 W iw T i, Wi Rp (p/8) of i.i.d. N (0, 1) entries, f(t) = exp( (t 2) 2 ). No longer a Marcenko Pastur like bulk, but rather a semi-circle bulk! 35 / 80

71 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 36/80 The case f (2) = 0 36 / 80

72 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 37/80 The case f (2) = 0 Roadmap. We now need to: study the spectrum of Φ 37 / 80

73 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 37/80 The case f (2) = 0 Roadmap. We now need to: study the spectrum of Φ study the isolated eigenvalues of L (and the phase transition) 37 / 80

74 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 37/80 The case f (2) = 0 Roadmap. We now need to: study the spectrum of Φ study the isolated eigenvalues of L (and the phase transition) retrieve information from the eigenvectors. 37 / 80

75 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 37/80 The case f (2) = 0 Roadmap. We now need to: study the spectrum of Φ study the isolated eigenvalues of L (and the phase transition) retrieve information from the eigenvectors. Theorem (Semi-circle law for Φ) Let µ n = 1 n n i=1 δ λ i (L). Then, under Assumption 2b, with µ the semi-circle distribution µ(dt) = µ n a.s. µ 1 2πc 0 ω 2 (4c 0 ω 2 t 2 ) + 1 dt, ω = lim 2 p p tr (C ) / 80

76 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 38/80 The case f (2) = 0 3 Eigenvalues of L Semi-circle law 2 1 λ 2 (L) λ 1 (L) Figure: Eigenvalues of L, p = 1000, n = 2000, k = 3, c 1 = c 2 = 1/4, c 3 = 1/2, C i I p + (p/8) 5 4 W iw T i, Wi Rp (p/8) of i.i.d. N (0, 1) entries, f(t) = exp( (t 2) 2 ). 38 / 80

77 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 39/80 The case f (2) = 0 Denote now { } cac k b T lim tr C p p a C b. a,b=1 39 / 80

78 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 39/80 The case f (2) = 0 Denote now { } cac k b T lim tr C p p acb. a,b=1 Theorem (Isolated Eigenvalues) Let ν 1... ν k eigenvalues of T. Then, if c 0 ν i > ω, L has an isolated eigenvalue λ i satisfying λ i a.s. ρ i c 0 ν i + ω2 ν i. 39 / 80

79 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 40/80 The case f (2) = 0 Theorem (Isolated Eigenvectors) For each isolated eigenpair (λ i, u i ) of L corresponding to (ν i, v i ) of T, write k u i = α a j a i + σi a wa i na a=1 with j a = [0 T n 1,..., 1 T n a,..., 0 T n k ] T, (w a i )T j a = 0, supp(w a i ) = supp(ja), wa i = 1. Then, under Assumptions 1 2b, α a i αb i a.s. (1 1c0 ω 2 (σ a i )2 a.s. ca c 0 ω 2 ν 2 i ν 2 i ) [v i v T i ] ab and the fluctuations of u i, u j, i j, are asymptotically uncorrelated. 40 / 80

80 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 41/80 The case f (2) = 0 Eigenvector 2 Eigenvector ,000 1,200 1,400 1,600 1,800 2,000 Figure: Leading two eigenvectors of L (or equivalently of L) versus deterministic approximations of α a i ± σa i. 41 / 80

81 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 41/80 The case f (2) = 0 Eigenvector 1 1 n1 (α 1 1 σ1 1 ) 1 n2 (α σ2 1 ) 1 n2 α 2 1 Eigenvector ,000 1,200 1,400 1,600 1,800 2,000 Figure: Leading two eigenvectors of L (or equivalently of L) versus deterministic approximations of α a i ± σa i. 41 / 80

82 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 42/80 The case f (2) = σ 2 Eigenvector σ Eigenvector Figure: Leading two eigenvectors of L (or equivalently of L) versus deterministic approximations of α a i ± σa i. 42 / 80

83 Applications/The case f 0 (τ ) = 0: an application to massive MIMO user clustering 43/80 Application to Massive MIMO UE Clustering 43 / 80

84 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 44/80 Massive MIMO UE Clustering Setting. Massive MIMO cell with p antenna elements n users equipments (UE) with channels x 1,..., x n R p UE s belong to solid angle groups, i.e., E[x i ] = 0, E[x i x T i ] = Ca C(Θa). 44 / 80

85 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 44/80 Massive MIMO UE Clustering Setting. Massive MIMO cell with p antenna elements n users equipments (UE) with channels x 1,..., x n R p UE s belong to solid angle groups, i.e., E[x i ] = 0, E[x i x T i ] = Ca C(Θa). T independent channel observations x (1) i,..., x (T ) i for UE i. 44 / 80

86 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 44/80 Massive MIMO UE Clustering Setting. Massive MIMO cell with p antenna elements n users equipments (UE) with channels x 1,..., x n R p UE s belong to solid angle groups, i.e., E[x i ] = 0, E[x i x T i ] = Ca C(Θa). T independent channel observations x (1) i,..., x (T ) i for UE i. Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoid pilot contamination). 44 / 80

87 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 44/80 Massive MIMO UE Clustering Setting. Massive MIMO cell with p antenna elements n users equipments (UE) with channels x 1,..., x n R p UE s belong to solid angle groups, i.e., E[x i ] = 0, E[x i x T i ] = Ca C(Θa). T independent channel observations x (1) i,..., x (T ) i for UE i. Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoid pilot contamination). Algorithm. 1. Build kernel matrix K, then L, based on nt vectors x (1) 1,..., x(t n ) (as if nt values to cluster). 44 / 80

88 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 44/80 Massive MIMO UE Clustering Setting. Massive MIMO cell with p antenna elements n users equipments (UE) with channels x 1,..., x n R p UE s belong to solid angle groups, i.e., E[x i ] = 0, E[x i x T i ] = Ca C(Θa). T independent channel observations x (1) i,..., x (T ) i for UE i. Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoid pilot contamination). Algorithm. 1. Build kernel matrix K, then L, based on nt vectors x (1) 1,..., x(t n ) (as if nt values to cluster). 2. Extract dominant isolated eigenvectors u 1,..., u κ 44 / 80

89 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 44/80 Massive MIMO UE Clustering Setting. Massive MIMO cell with p antenna elements n users equipments (UE) with channels x 1,..., x n R p UE s belong to solid angle groups, i.e., E[x i ] = 0, E[x i x T i ] = Ca C(Θa). T independent channel observations x (1) i,..., x (T ) i for UE i. Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoid pilot contamination). Algorithm. 1. Build kernel matrix K, then L, based on nt vectors x (1) 1,..., x(t n ) (as if nt values to cluster). 2. Extract dominant isolated eigenvectors u 1,..., u κ 3. For each i, create ũ i = 1 T (In 1T T )u i, i.e., average eigenvectors along time. 44 / 80

90 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 44/80 Massive MIMO UE Clustering Setting. Massive MIMO cell with p antenna elements n users equipments (UE) with channels x 1,..., x n R p UE s belong to solid angle groups, i.e., E[x i ] = 0, E[x i x T i ] = Ca C(Θa). T independent channel observations x (1) i,..., x (T ) i for UE i. Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoid pilot contamination). Algorithm. 1. Build kernel matrix K, then L, based on nt vectors x (1) 1,..., x(t n ) (as if nt values to cluster). 2. Extract dominant isolated eigenvectors u 1,..., u κ 3. For each i, create ũ i = 1 T (In 1T T )u i, i.e., average eigenvectors along time. 4. Perform k-class clustering on vectors ũ 1,..., ũ κ. 44 / 80

91 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 45/80 Massive MIMO UE Clustering Eigenvector Eigenvector Eigenvector Figure: Leading two eigenvectors before (left figure) and after (right figure) T -averaging. Setting: p = 400, n = 40, T = 10, k = 3, c 1 = c 3 = 1/4, c 2 = 1/2, angular spread model with angles π/30 ± π/20, 0 ± π/20, and π/30 ± π/20. Kernel function f(t) = exp( (t 2) 2 ). 45 / 80

92 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 46/80 Massive MIMO UE Clustering 1 Correct clustering probability k-means (oracle) k-means EM (oracle) EM Random guess T Figure: Overlap for different T, using the k-means or EM starting from actual centroid solutions (oracle) or randomly. 46 / 80

93 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 47/80 Massive MIMO UE Clustering 1 Correct clustering probability Optimal kernel (k-means) Optimal kernel (EM) Gaussian kernel (k-means) Gaussian kernel (EM) Random guess T Figure: Overlap for optimal kernel f(t) (here f(t) = exp( (t 2) 2 )) and Gaussian kernel f(t) = exp( t 2 ), for different T, using the k-means or EM. 47 / 80

94 Applications/Semi-supervised Learning 48/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 48 / 80

95 Applications/Semi-supervised Learning 49/80 Problem Statement Context: Similar to clustering: Classify x 1,..., x n R p in k classes, with n l labelled and n u unlabelled data. 49 / 80

96 Applications/Semi-supervised Learning 49/80 Problem Statement Context: Similar to clustering: Classify x 1,..., x n R p in k classes, with n l labelled and n u unlabelled data. Problem statement: give scores F ia (d i = [K1 n] i ) F = argmin F R n k such that F ia = δ {xi C a}, for all labelled x i. k K ij (F ia d α 1 i F ja d α 1 j ) 2 a=1 i,j 49 / 80

97 Applications/Semi-supervised Learning 49/80 Problem Statement Context: Similar to clustering: Classify x 1,..., x n R p in k classes, with n l labelled and n u unlabelled data. Problem statement: give scores F ia (d i = [K1 n] i ) F = argmin F R n k such that F ia = δ {xi C a}, for all labelled x i. k K ij (F ia d α 1 i F ja d α 1 j ) 2 a=1 i,j Solution: for F (u) R nu k, F (l) R n l k scores of unlabelled/labelled data, ( ) 1 F (u) = I nu D α (u) K (u,u)d α 1 (u) D α (u) K (u,l)d α 1 F (l) (l) where we naturally decompose [ ] K(l,l) K K = (l,u) K (u,l) K (u,u) [ ] D(l) 0 D = 0 D (u) = diag {K1 n}. 49 / 80

98 Applications/Semi-supervised Learning 50/80 MNIST Data Example 1.2 [F (u) ],1 (Zeros) F,a (u) Index Figure: Vectors [F (u) ],a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 50 / 80

99 Applications/Semi-supervised Learning 50/80 MNIST Data Example 1.2 [F (u) ],1 (Zeros) [F (u) ],2 (Ones) F,a (u) Index Figure: Vectors [F (u) ],a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 50 / 80

100 Applications/Semi-supervised Learning 50/80 MNIST Data Example [F (u) ],1 (Zeros) [F (u) ],2 (Ones) [F (u) ],3 (Twos) 1.2 F,a (u) Index Figure: Vectors [F (u) ],a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 50 / 80

101 Applications/Semi-supervised Learning 51/80 MNIST Data Example 10 2 [F (u) ],1 (Zeros) 4 kb=1 F (u),b F,a (u) k Index Figure: Centered Vectors [F (u) ],a = [F (u) 1 k F (u)1 k 1 T k ],a, 3-class MNIST data (zeros, ones, twos), α = 0, n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 51 / 80

102 Applications/Semi-supervised Learning 51/80 MNIST Data Example 10 2 [F (u) ],1 (Zeros) 4 [F (u) ],2 (Ones) kb=1 F (u),b F,a (u) k Index Figure: Centered Vectors [F (u) ],a = [F (u) 1 k F (u)1 k 1 T k ],a, 3-class MNIST data (zeros, ones, twos), α = 0, n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 51 / 80

103 Applications/Semi-supervised Learning 51/80 MNIST Data Example 10 2 [F (u) ],1 (Zeros) 4 [F (u) ],2 (Ones) [F (u) ],3 (Twos) kb=1 F (u),b F,a (u) k Index Figure: Centered Vectors [F (u) ],a = [F (u) 1 k F (u)1 k 1 T k ],a, 3-class MNIST data (zeros, ones, twos), α = 0, n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 51 / 80

104 Applications/Semi-supervised Learning 52/80 Main Results Results: Assuming n l /n c l (0, 1), by previous Taylor expansion, In the first order, F (u),a = C n [ l,a n v }{{} O(1) + α ta1nu n }{{} O(n 1 2 ) ] + O(n 1 ) }{{} Informative terms where v = O(1) random vector (entry-wise) and t a = 1 p tr C a. 52 / 80

105 Applications/Semi-supervised Learning 52/80 Main Results Results: Assuming n l /n c l (0, 1), by previous Taylor expansion, In the first order, F (u),a = C n [ l,a n v }{{} O(1) + α ta1nu n }{{} O(n 1 2 ) ] + O(n 1 ) }{{} Informative terms where v = O(1) random vector (entry-wise) and t a = 1 p tr C a. Consequences: 52 / 80

106 Applications/Semi-supervised Learning 52/80 Main Results Results: Assuming n l /n c l (0, 1), by previous Taylor expansion, In the first order, F (u),a = C n [ l,a n v }{{} O(1) + α ta1nu n }{{} O(n 1 2 ) ] + O(n 1 ) }{{} Informative terms where v = O(1) random vector (entry-wise) and t a = 1 p tr C a. Consequences: Random non-informative bias v 52 / 80

107 Applications/Semi-supervised Learning 52/80 Main Results Results: Assuming n l /n c l (0, 1), by previous Taylor expansion, In the first order, F (u),a = C n [ l,a n v }{{} O(1) + α ta1nu n }{{} O(n 1 2 ) ] + O(n 1 ) }{{} Informative terms where v = O(1) random vector (entry-wise) and t a = 1 p tr C a. Consequences: Random non-informative bias v Strong Impact of n l,a F (u),a to be scaled by n l,a 52 / 80

108 Applications/Semi-supervised Learning 52/80 Main Results Results: Assuming n l /n c l (0, 1), by previous Taylor expansion, In the first order, F (u),a = C n [ l,a n v }{{} O(1) + α ta1nu n }{{} O(n 1 2 ) ] + O(n 1 ) }{{} Informative terms where v = O(1) random vector (entry-wise) and t a = 1 p tr C a. Consequences: Random non-informative bias v Strong Impact of n l,a F (u),a to be scaled by n l,a Additional per-class bias αt a1 nu α = 0 + β p. 52 / 80

109 Applications/Semi-supervised Learning 53/80 Main Results As a consequence of the remarks above, we take α = β p and define ˆF (u) i,a = np n l,a F (u) ia. 53 / 80

110 Applications/Semi-supervised Learning 53/80 Main Results As a consequence of the remarks above, we take α = β p and define ˆF (u) i,a = np n l,a F (u) ia. Theorem For x i C b unlabelled, ˆF i, G b 0, G b N (m b, Σ b ) where m b R k, Σ b R k k given by (m b ) a = 2f (τ) M ab + f (τ) f(τ) f(τ) t a t b + 2f (τ) T ab f (τ) 2 f(τ) f(τ) 2 tat b + β n f (τ) n l f(τ) ta + B b (Σ b ) a1 a 2 = 2tr ( C2 b f (τ) 2 p f(τ) 2 f ) 2 (τ) t a1 f(τ) ta2 + 4f (τ) 2 ) ([M T f(τ) 2 C b M] a1a2 + δa 2 a 1 p T ba1 n l,a1 with t, T, M as before, Xa = X a k n l,d d=1 X n l d and B b bias independent of a. 53 / 80

111 Applications/Semi-supervised Learning 54/80 Main Results Corollary (Asymptotic Classification Error) For k = 2 classes and a b, ( ) P ( ˆF i,a > ˆF (m b ) b (m b ) a ib x i C b ) Q 0. [1, 1]Σb [1, 1] T 54 / 80

112 Applications/Semi-supervised Learning 54/80 Main Results Corollary (Asymptotic Classification Error) For k = 2 classes and a b, ( ) P ( ˆF i,a > ˆF (m b ) b (m b ) a ib x i C b ) Q 0. [1, 1]Σb [1, 1] T Some consequences: non obvious choices of appropriate kernels non obvious choice of optimal β (induces a possibly beneficial bias) importance of n l versus n u. 54 / 80

113 Applications/Semi-supervised Learning 55/80 MNIST Data Example 1 Simulations Probability of correct classification Index Figure: Performance as a function of α, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 55 / 80

114 Applications/Semi-supervised Learning 55/80 MNIST Data Example 1 Simulations Theory as if Gaussian Probability of correct classification Index Figure: Performance as a function of α, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 55 / 80

115 Applications/Semi-supervised Learning 56/80 MNIST Data Example 1 Simulations Probability of correct classification Index Figure: Performance as a function of α, for 2-class MNIST data (zeros, ones), n = 1568, p = 784, n l /n = 1/16, Gaussian kernel. 56 / 80

116 Applications/Semi-supervised Learning 56/80 MNIST Data Example 1 Simulations Theory as if Gaussian Probability of correct classification Index Figure: Performance as a function of α, for 2-class MNIST data (zeros, ones), n = 1568, p = 784, n l /n = 1/16, Gaussian kernel. 56 / 80

117 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 57/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 57 / 80

118 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 58/80 Random Feature Maps and Extreme Learning Machines Context: Random Feature Map (large) input x 1,..., x T R p random W = wt 1... R n p wn T non-linear activation function σ. 58 / 80

119 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 58/80 Random Feature Maps and Extreme Learning Machines Context: Random Feature Map (large) input x 1,..., x T R p random W = wt 1... R n p wn T non-linear activation function σ. Neural Network Model (extreme learning machine): Ridge-regression learning small output y 1,..., y T R d ridge-regression output β R n d 58 / 80

120 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 59/80 Random Feature Maps and Extreme Learning Machines Objectives: evaluate training and testing MSE performance as n, p, T 59 / 80

121 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 59/80 Random Feature Maps and Extreme Learning Machines Objectives: evaluate training and testing MSE performance as n, p, T Training MSE: E train = 1 T y i β T σ(w x i ) 2 = 1 T T Y βt Σ 2 F i=1 with { } Σ = σ(w X) = σ(wi T x j) 1 i n 1 j T β= 1 T Σ ( 1 T ΣT Σ + γi T ) 1 Y. 59 / 80

122 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 59/80 Random Feature Maps and Extreme Learning Machines Objectives: evaluate training and testing MSE performance as n, p, T Training MSE: E train = 1 T y i β T σ(w x i ) 2 = 1 T T Y βt Σ 2 F i=1 with { } Σ = σ(w X) = σ(wi T x j) 1 i n 1 j T β= 1 T Σ ( 1 T ΣT Σ + γi T ) 1 Y. Testing MSE: upon new pair ( ˆX, Ŷ ) of length ˆT, E test = 1ˆT Ŷ βt ˆΣ 2 F. where ˆΣ = σ(w ˆX). 59 / 80

123 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 60/80 Technical Aspects Preliminary observations: Link to resolvent of 1 T ΣT Σ: where Q = Q(γ) is the resolvent with Σ ij = σ(w T i x j). E train = γ2 T tr Y T Y Q 2 = γ 2 1 γ T tr Y T Y Q ( ) 1 1 Q T ΣT Σ + γi T 60 / 80

124 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 60/80 Technical Aspects Preliminary observations: Link to resolvent of 1 T ΣT Σ: where Q = Q(γ) is the resolvent with Σ ij = σ(w T i x j). E train = γ2 T tr Y T Y Q 2 = γ 2 1 γ T tr Y T Y Q ( ) 1 1 Q T ΣT Σ + γi T Central object: resolvent E[Q]. 60 / 80

125 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 61/80 Main Technical Result Theorem [Asymptotic Equivalent for E[Q]] For Lipschitz σ, bounded X, Y, W = f(z) (entry-wise) with Z standard Gaussian, we have, for all ε > 0, for some C > 0, where E[Q] Q < Cn ε 1 2 ( n Q = T Φ E ) Φ δ + γi T ] [ σ(x T w)σ(w T X) with w = f(z), z N (0, I p), and δ > 0 the unique positive solution to δ = 1 tr Φ Q. T 61 / 80

126 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 61/80 Main Technical Result Theorem [Asymptotic Equivalent for E[Q]] For Lipschitz σ, bounded X, Y, W = f(z) (entry-wise) with Z standard Gaussian, we have, for all ε > 0, for some C > 0, where E[Q] Q < Cn ε 1 2 ( n Q = T Φ E ) Φ δ + γi T ] [ σ(x T w)σ(w T X) with w = f(z), z N (0, I p), and δ > 0 the unique positive solution to δ = 1 tr Φ Q. T Proof arguments: σ(w X) has independent rows but dependent columns breaks the trace lemma argument (i.e., 1 p wt XAX T w 1 p tr XAXT ) 61 / 80

127 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 61/80 Main Technical Result Theorem [Asymptotic Equivalent for E[Q]] For Lipschitz σ, bounded X, Y, W = f(z) (entry-wise) with Z standard Gaussian, we have, for all ε > 0, for some C > 0, where E[Q] Q < Cn ε 1 2 ( n Q = T Φ E ) Φ δ + γi T ] [ σ(x T w)σ(w T X) with w = f(z), z N (0, I p), and δ > 0 the unique positive solution to δ = 1 tr Φ Q. T Proof arguments: σ(w X) has independent rows but dependent columns breaks the trace lemma argument (i.e., 1 p wt XAX T w 1 p tr XAXT ) Concentration of measure lemma: 1 p σ(wt X)Aσ(X T w) 1 p tr ΦA 61 / 80

128 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 62/80 Main Technical Result Values of Φ(a, b) for w N (0, I p), σ(t) Φ(a, b) ( 1 max(t, 0) 2π a b (a, b) acos( (a, b)) + ) 1 (a, b) ( 2 2 t π a b (a, b) asin( (a, b)) + ) 1 (a, b) ( ) 2 2 erf(t) π asin 2a T b (1+2 a 2 )(1+2 b 2 ) 1 1 {t>0} 2 1 acos( (a, b)) 2π sign(t) 1 2 acos( (a, b)) π cos(t) exp( 1 2 ( a 2 + b 2 )) cosh(a T b). where (a, b) at b a b. 62 / 80

129 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 62/80 Main Technical Result Values of Φ(a, b) for w N (0, I p), σ(t) Φ(a, b) ( 1 max(t, 0) 2π a b (a, b) acos( (a, b)) + ) 1 (a, b) ( 2 2 t π a b (a, b) asin( (a, b)) + ) 1 (a, b) ( ) 2 2 erf(t) π asin 2a T b (1+2 a 2 )(1+2 b 2 ) 1 1 {t>0} 2 1 acos( (a, b)) 2π sign(t) 1 2 acos( (a, b)) π cos(t) exp( 1 2 ( a 2 + b 2 )) cosh(a T b). where (a, b) at b a b. Value of Φ(a, b) for w i i.i.d. with E[wi k] = m k (m 1 = 0), σ(t) = ζ 2 t 2 + ζ 1 t + ζ 0 [ ( Φ(a, b) = ζ2 2 m 2 2 2(a T b) 2 + a 2 b 2) ] + (m 4 3m 2 2 )(a2 ) T (b 2 ) + ζ1 2 m 2a T b ] + ζ 2 ζ 1 m 3 [(a 2 ) T b + a T (b 2 [ ) + ζ 2 ζ 0 m 2 a 2 + b 2] + ζ0 2 where (a 2 ) [a 2 1,..., a2 p] T. 62 / 80

130 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 63/80 Main Results Theorem [Asymptotic E train ] For all ε > 0, almost surely, where with Ψ n T Φ 1+δ. n 1 2 ε ( E train Ētrain) 0 E train = 1 Y T Σ T β 2 T = γ2 F T tr Y T Y Q 2 [ ] Ē train = γ2 T tr Y T Y Q 1 tr Ψ Q 2 n 1 1 n tr (Ψ Q) Ψ + I 2 T Q 63 / 80

131 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 64/80 Main Results Letting ˆX R p ˆT, Ŷ Rd ˆT satisfy similar properties as (X, Y ), Claim [Asymptotic E test ] For all ε > 0, almost surely, where with Ψ AB = n T n 1 2 ε ( E test Ētest ) 0 E test = 1ˆT Ŷ T ˆΣ T β 2 F Ē test = 1ˆT Ŷ T Ψ T QY T 2 X ˆX F 1 n + tr Y T [ Y QΨ Q tr (Ψ Q) 2 n ˆT tr Ψ ˆX ˆX 1ˆT ] tr (I T + γ Q)(Ψ X ˆXΨ ˆXX Q) Φ AB 1+δ, Φ AB = E[σ(A T w)σ(w T B)]. 64 / 80

132 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 65/80 Simulations on MNIST: Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t γ Figure: Neural network performance for Lipschitz continuous σ( ), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 80

133 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 65/80 Simulations on MNIST: Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t γ Figure: Neural network performance for Lipschitz continuous σ( ), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 80

134 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 65/80 Simulations on MNIST: Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t 10 1 σ(t) = erf(t) γ Figure: Neural network performance for Lipschitz continuous σ( ), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 80

135 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 65/80 Simulations on MNIST: Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t σ(t) = t 10 1 σ(t) = erf(t) γ Figure: Neural network performance for Lipschitz continuous σ( ), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 80

136 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 65/80 Simulations on MNIST: Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t σ(t) = t 10 1 σ(t) = erf(t) σ(t) = max(t, 0) γ Figure: Neural network performance for Lipschitz continuous σ( ), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 80

137 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 66/80 Simulations on MNIST: non Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t2 σ(t) = 1 {t>0} σ(t) = sign(t) γ Figure: Neural network performance for σ( ) either discontinuous or non Lipschitz, as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 80

138 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 67/80 Simulations on tuned Gaussian mixture Gaussian mixture classification X = [X 1, X 2 ], with {X 1 } i N (0, C 1 ), {X 2 } i N (0, C 2 ), tr C 1 = tr C 2 67 / 80

139 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 67/80 Simulations on tuned Gaussian mixture Gaussian mixture classification X = [X 1, X 2 ], with {X 1 } i N (0, C 1 ), {X 2 } i N (0, C 2 ), tr C 1 = tr C 2 We can prove that, for σ(t) = ζ 2 t 2 + ζ 1 t + ζ 0 and E[W k ij ] = m k, Classification only possible if m 4 m / 80

140 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 67/80 Simulations on tuned Gaussian mixture Gaussian mixture classification X = [X 1, X 2 ], with {X 1 } i N (0, C 1 ), {X 2 } i N (0, C 2 ), tr C 1 = tr C 2 We can prove that, for σ(t) = ζ 2 t 2 + ζ 1 t + ζ 0 and E[W k ij ] = m k, Classification only possible if m 4 m 2 2 Ē train Ē test E train E test 10 0 MSE W ij Bern W ij N (0, 1) W ij Stud γ 67 / 80

Random Matrices in Machine Learning

Random Matrices in Machine Learning Random Matrices in Machine Learning Romain COUILLET CentraleSupélec, University of ParisSaclay, France GSTATS IDEX DataScience Chair, GIPSA-lab, University Grenoble Alpes, France. June 21, 2018 1 / 113

More information

A Random Matrix Framework for BigData Machine Learning

A Random Matrix Framework for BigData Machine Learning A Random Matrix Framework for BigData Machine Learning (Groupe Deep Learning, DigiCosme) Romain COUILLET CentraleSupélec, France June, 2017 1 / 63 Outline Basics of Random Matrix Theory Motivation: Large

More information

Random Matrices for Big Data Signal Processing and Machine Learning

Random Matrices for Big Data Signal Processing and Machine Learning Random Matrices for Big Data Signal Processing and Machine Learning (ICASSP 2017, New Orleans) Romain COUILLET and Hafiz TIOMOKO ALI CentraleSupélec, France March, 2017 1 / 153 Outline Basics of Random

More information

Random Matrix Theory for Neural Networks

Random Matrix Theory for Neural Networks Random Matrix Theory for Neural Networks Ph.D. Mid-Term Evaluation Zhenyu Liao Laboratoire des Signaux et Systèmes CentraleSupélec Université Paris-Saclay Salle sd.207, Bâtiment Bouygues Gif-sur-Yvette,

More information

On the Spectrum of Random Features Maps of High Dimensional Data

On the Spectrum of Random Features Maps of High Dimensional Data On the Spectrum of Random Features Maps of High Dimensional Data ICML 018, Stockholm, Sweden Zhenyu Liao, Romain Couillet LS, CentraleSupélec, Université Paris-Saclay, France GSTATS IDEX DataScience Chair,

More information

The Dynamics of Learning: A Random Matrix Approach

The Dynamics of Learning: A Random Matrix Approach The Dynamics of Learning: A Random Matrix Approach ICML 2018, Stockholm, Sweden Zhenyu Liao, Romain Couillet L2S, CentraleSupélec, Université Paris-Saclay, France GSTATS IDEX DataScience Chair, GIPSA-lab,

More information

Analysis of Spectral Kernel Design based Semi-supervised Learning

Analysis of Spectral Kernel Design based Semi-supervised Learning Analysis of Spectral Kernel Design based Semi-supervised Learning Tong Zhang IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Rie Kubota Ando IBM T. J. Watson Research Center Yorktown Heights,

More information

Non white sample covariance matrices.

Non white sample covariance matrices. Non white sample covariance matrices. S. Péché, Université Grenoble 1, joint work with O. Ledoit, Uni. Zurich 17-21/05/2010, Université Marne la Vallée Workshop Probability and Geometry in High Dimensions

More information

Research Program. Romain COUILLET. 11 février CentraleSupélec Université Paris-Sud 11 1 / 38

Research Program. Romain COUILLET. 11 février CentraleSupélec Université Paris-Sud 11 1 / 38 Research Program Romain COUILLET CentraleSupélec Université Paris-Sud 11 11 février 2016 1 / 38 Outline Curriculum Vitae Research Project : Learning in Large Dimensions Axis 1 : Robust Estimation in Large

More information

Assessing the dependence of high-dimensional time series via sample autocovariances and correlations

Assessing the dependence of high-dimensional time series via sample autocovariances and correlations Assessing the dependence of high-dimensional time series via sample autocovariances and correlations Johannes Heiny University of Aarhus Joint work with Thomas Mikosch (Copenhagen), Richard Davis (Columbia),

More information

Estimation of the Global Minimum Variance Portfolio in High Dimensions

Estimation of the Global Minimum Variance Portfolio in High Dimensions Estimation of the Global Minimum Variance Portfolio in High Dimensions Taras Bodnar, Nestor Parolya and Wolfgang Schmid 07.FEBRUARY 2014 1 / 25 Outline Introduction Random Matrix Theory: Preliminary Results

More information

1 Math 241A-B Homework Problem List for F2015 and W2016

1 Math 241A-B Homework Problem List for F2015 and W2016 1 Math 241A-B Homework Problem List for F2015 W2016 1.1 Homework 1. Due Wednesday, October 7, 2015 Notation 1.1 Let U be any set, g be a positive function on U, Y be a normed space. For any f : U Y let

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

A note on a Marčenko-Pastur type theorem for time series. Jianfeng. Yao

A note on a Marčenko-Pastur type theorem for time series. Jianfeng. Yao A note on a Marčenko-Pastur type theorem for time series Jianfeng Yao Workshop on High-dimensional statistics The University of Hong Kong, October 2011 Overview 1 High-dimensional data and the sample covariance

More information

Large sample covariance matrices and the T 2 statistic

Large sample covariance matrices and the T 2 statistic Large sample covariance matrices and the T 2 statistic EURANDOM, the Netherlands Joint work with W. Zhou Outline 1 2 Basic setting Let {X ij }, i, j =, be i.i.d. r.v. Write n s j = (X 1j,, X pj ) T and

More information

CS 195-5: Machine Learning Problem Set 1

CS 195-5: Machine Learning Problem Set 1 CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

A RANDOM MATRIX APPROACH TO NEURAL NETWORKS. By Cosme Louart, Zhenyu Liao, and Romain Couillet CentraleSupélec, University of Paris Saclay, France.

A RANDOM MATRIX APPROACH TO NEURAL NETWORKS. By Cosme Louart, Zhenyu Liao, and Romain Couillet CentraleSupélec, University of Paris Saclay, France. Submitted to the Annals of Applied Probability A RANDOM MARIX APPROACH O NEURAL NEWORKS By Cosme Louart, Zhenyu Liao, and Romain Couillet CentraleSupélec, University of Paris Saclay, France. his article

More information

5.1 Consistency of least squares estimates. We begin with a few consistency results that stand on their own and do not depend on normality.

5.1 Consistency of least squares estimates. We begin with a few consistency results that stand on their own and do not depend on normality. 88 Chapter 5 Distribution Theory In this chapter, we summarize the distributions related to the normal distribution that occur in linear models. Before turning to this general problem that assumes normal

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

Preface to the Second Edition...vii Preface to the First Edition... ix

Preface to the Second Edition...vii Preface to the First Edition... ix Contents Preface to the Second Edition...vii Preface to the First Edition........... ix 1 Introduction.............................................. 1 1.1 Large Dimensional Data Analysis.........................

More information

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a

More information

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1). Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Statistical signal processing

Statistical signal processing Statistical signal processing Short overview of the fundamentals Outline Random variables Random processes Stationarity Ergodicity Spectral analysis Random variable and processes Intuition: A random variable

More information

Classification. Sandro Cumani. Politecnico di Torino

Classification. Sandro Cumani. Politecnico di Torino Politecnico di Torino Outline Generative model: Gaussian classifier (Linear) discriminative model: logistic regression (Non linear) discriminative model: neural networks Gaussian Classifier We want to

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

c 4, < y 2, 1 0, otherwise,

c 4, < y 2, 1 0, otherwise, Fundamentals of Big Data Analytics Univ.-Prof. Dr. rer. nat. Rudolf Mathar Problem. Probability theory: The outcome of an experiment is described by three events A, B and C. The probabilities Pr(A) =,

More information

Eigenvalues and Singular Values of Random Matrices: A Tutorial Introduction

Eigenvalues and Singular Values of Random Matrices: A Tutorial Introduction Random Matrix Theory and its applications to Statistics and Wireless Communications Eigenvalues and Singular Values of Random Matrices: A Tutorial Introduction Sergio Verdú Princeton University National

More information

When Dictionary Learning Meets Classification

When Dictionary Learning Meets Classification When Dictionary Learning Meets Classification Bufford, Teresa 1 Chen, Yuxin 2 Horning, Mitchell 3 Shee, Liberty 1 Mentor: Professor Yohann Tendero 1 UCLA 2 Dalhousie University 3 Harvey Mudd College August

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

MATH JORDAN FORM

MATH JORDAN FORM MATH 53 JORDAN FORM Let A,, A k be square matrices of size n,, n k, respectively with entries in a field F We define the matrix A A k of size n = n + + n k as the block matrix A 0 0 0 0 A 0 0 0 0 A k It

More information

Free Probability, Sample Covariance Matrices and Stochastic Eigen-Inference

Free Probability, Sample Covariance Matrices and Stochastic Eigen-Inference Free Probability, Sample Covariance Matrices and Stochastic Eigen-Inference Alan Edelman Department of Mathematics, Computer Science and AI Laboratories. E-mail: edelman@math.mit.edu N. Raj Rao Deparment

More information

Random Correlation Matrices, Top Eigenvalue with Heavy Tails and Financial Applications

Random Correlation Matrices, Top Eigenvalue with Heavy Tails and Financial Applications Random Correlation Matrices, Top Eigenvalue with Heavy Tails and Financial Applications J.P Bouchaud with: M. Potters, G. Biroli, L. Laloux, M. A. Miceli http://www.cfm.fr Portfolio theory: Basics Portfolio

More information

Inference For High Dimensional M-estimates. Fixed Design Results

Inference For High Dimensional M-estimates. Fixed Design Results : Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

Convex Optimization in Classification Problems

Convex Optimization in Classification Problems New Trends in Optimization and Computational Algorithms December 9 13, 2001 Convex Optimization in Classification Problems Laurent El Ghaoui Department of EECS, UC Berkeley elghaoui@eecs.berkeley.edu 1

More information

Robust covariance matrices estimation and applications in signal processing

Robust covariance matrices estimation and applications in signal processing Robust covariance matrices estimation and applications in signal processing F. Pascal SONDRA/Supelec GDR ISIS Journée Estimation et traitement statistique en grande dimension May 16 th, 2013 FP (SONDRA/Supelec)

More information

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010 STATS306B Discriminant Analysis Jonathan Taylor Department of Statistics Stanford University June 3, 2010 Spring 2010 Classification Given K classes in R p, represented as densities f i (x), 1 i K classify

More information

MATH 5640: Functions of Diagonalizable Matrices

MATH 5640: Functions of Diagonalizable Matrices MATH 5640: Functions of Diagonalizable Matrices Hung Phan, UMass Lowell November 27, 208 Spectral theorem for diagonalizable matrices Definition Let V = X Y Every v V is uniquely decomposed as u = x +

More information

Channel capacity estimation using free probability theory

Channel capacity estimation using free probability theory Channel capacity estimation using free probability theory January 008 Problem at hand The capacity per receiving antenna of a channel with n m channel matrix H and signal to noise ratio ρ = 1 σ is given

More information

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013. The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment Two Caramanis/Sanghavi Due: Tuesday, Feb. 19, 2013. Computational

More information

8.1 Concentration inequality for Gaussian random matrix (cont d)

8.1 Concentration inequality for Gaussian random matrix (cont d) MGMT 69: Topics in High-dimensional Data Analysis Falll 26 Lecture 8: Spectral clustering and Laplacian matrices Lecturer: Jiaming Xu Scribe: Hyun-Ju Oh and Taotao He, October 4, 26 Outline Concentration

More information

Tensor Methods for Feature Learning

Tensor Methods for Feature Learning Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine Feature Learning For Efficient Classification Find good transformations of input for improved classification Figures used attributed to

More information

Homework 1. Yuan Yao. September 18, 2011

Homework 1. Yuan Yao. September 18, 2011 Homework 1 Yuan Yao September 18, 2011 1. Singular Value Decomposition: The goal of this exercise is to refresh your memory about the singular value decomposition and matrix norms. A good reference to

More information

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee 9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic

More information

Applications and fundamental results on random Vandermon

Applications and fundamental results on random Vandermon Applications and fundamental results on random Vandermonde matrices May 2008 Some important concepts from classical probability Random variables are functions (i.e. they commute w.r.t. multiplication)

More information

Solving Corrupted Quadratic Equations, Provably

Solving Corrupted Quadratic Equations, Provably Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

More information

Comparison Method in Random Matrix Theory

Comparison Method in Random Matrix Theory Comparison Method in Random Matrix Theory Jun Yin UW-Madison Valparaíso, Chile, July - 2015 Joint work with A. Knowles. 1 Some random matrices Wigner Matrix: H is N N square matrix, H : H ij = H ji, EH

More information

Classifier Complexity and Support Vector Classifiers

Classifier Complexity and Support Vector Classifiers Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl

More information

A Least Squares Formulation for Canonical Correlation Analysis

A Least Squares Formulation for Canonical Correlation Analysis A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation

More information

Kernel Methods. Barnabás Póczos

Kernel Methods. Barnabás Póczos Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels

More information

High-resolution Parametric Subspace Methods

High-resolution Parametric Subspace Methods High-resolution Parametric Subspace Methods The first parametric subspace-based method was the Pisarenko method,, which was further modified, leading to the MUltiple SIgnal Classification (MUSIC) method.

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Eigenvector stability: Random Matrix Theory and Financial Applications

Eigenvector stability: Random Matrix Theory and Financial Applications Eigenvector stability: Random Matrix Theory and Financial Applications J.P Bouchaud with: R. Allez, M. Potters http://www.cfm.fr Portfolio theory: Basics Portfolio weights w i, Asset returns X t i If expected/predicted

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Free deconvolution for signal processing applications

Free deconvolution for signal processing applications IEEE TRANSACTIONS ON INFORMATION THEORY, VOL., NO., JANUARY 27 Free deconvolution for signal processing applications Øyvind Ryan, Member, IEEE, Mérouane Debbah, Member, IEEE arxiv:cs.it/725v 4 Jan 27 Abstract

More information

Fast learning rates for plug-in classifiers under the margin condition

Fast learning rates for plug-in classifiers under the margin condition Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,

More information

A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

An indicator for the number of clusters using a linear map to simplex structure

An indicator for the number of clusters using a linear map to simplex structure An indicator for the number of clusters using a linear map to simplex structure Marcus Weber, Wasinee Rungsarityotin, and Alexander Schliep Zuse Institute Berlin ZIB Takustraße 7, D-495 Berlin, Germany

More information

Clustering VS Classification

Clustering VS Classification MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:

More information

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song Presenter: Jiwei Zhao Department of Statistics University of Wisconsin Madison April

More information

Learning Task Grouping and Overlap in Multi-Task Learning

Learning Task Grouping and Overlap in Multi-Task Learning Learning Task Grouping and Overlap in Multi-Task Learning Abhishek Kumar Hal Daumé III Department of Computer Science University of Mayland, College Park 20 May 2013 Proceedings of the 29 th International

More information

Empirical properties of large covariance matrices in finance

Empirical properties of large covariance matrices in finance Empirical properties of large covariance matrices in finance Ex: RiskMetrics Group, Geneva Since 2010: Swissquote, Gland December 2009 Covariance and large random matrices Many problems in finance require

More information

Classification 2: Linear discriminant analysis (continued); logistic regression

Classification 2: Linear discriminant analysis (continued); logistic regression Classification 2: Linear discriminant analysis (continued); logistic regression Ryan Tibshirani Data Mining: 36-462/36-662 April 4 2013 Optional reading: ISL 4.4, ESL 4.3; ISL 4.3, ESL 4.4 1 Reminder:

More information

More Spectral Clustering and an Introduction to Conjugacy

More Spectral Clustering and an Introduction to Conjugacy CS8B/Stat4B: Advanced Topics in Learning & Decision Making More Spectral Clustering and an Introduction to Conjugacy Lecturer: Michael I. Jordan Scribe: Marco Barreno Monday, April 5, 004. Back to spectral

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Linear Algebra and Dirac Notation, Pt. 2

Linear Algebra and Dirac Notation, Pt. 2 Linear Algebra and Dirac Notation, Pt. 2 PHYS 500 - Southern Illinois University February 1, 2017 PHYS 500 - Southern Illinois University Linear Algebra and Dirac Notation, Pt. 2 February 1, 2017 1 / 14

More information

Examples include: (a) the Lorenz system for climate and weather modeling (b) the Hodgkin-Huxley system for neuron modeling

Examples include: (a) the Lorenz system for climate and weather modeling (b) the Hodgkin-Huxley system for neuron modeling 1 Introduction Many natural processes can be viewed as dynamical systems, where the system is represented by a set of state variables and its evolution governed by a set of differential equations. Examples

More information

Convergence of Eigenspaces in Kernel Principal Component Analysis

Convergence of Eigenspaces in Kernel Principal Component Analysis Convergence of Eigenspaces in Kernel Principal Component Analysis Shixin Wang Advanced machine learning April 19, 2016 Shixin Wang Convergence of Eigenspaces April 19, 2016 1 / 18 Outline 1 Motivation

More information

Finding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October

Finding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October Finding normalized and modularity cuts by spectral clustering Marianna Bolla Institute of Mathematics Budapest University of Technology and Economics marib@math.bme.hu Ljubjana 2010, October Outline Find

More information

Estimation of large dimensional sparse covariance matrices

Estimation of large dimensional sparse covariance matrices Estimation of large dimensional sparse covariance matrices Department of Statistics UC, Berkeley May 5, 2009 Sample covariance matrix and its eigenvalues Data: n p matrix X n (independent identically distributed)

More information

Unsupervised dimensionality reduction

Unsupervised dimensionality reduction Unsupervised dimensionality reduction Guillaume Obozinski Ecole des Ponts - ParisTech SOCN course 2014 Guillaume Obozinski Unsupervised dimensionality reduction 1/30 Outline 1 PCA 2 Kernel PCA 3 Multidimensional

More information

Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics. Jiti Gao

Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics. Jiti Gao Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics Jiti Gao Department of Statistics School of Mathematics and Statistics The University of Western Australia Crawley

More information

Unlabeled Data: Now It Helps, Now It Doesn t

Unlabeled Data: Now It Helps, Now It Doesn t institution-logo-filena A. Singh, R. D. Nowak, and X. Zhu. In NIPS, 2008. 1 Courant Institute, NYU April 21, 2015 Outline institution-logo-filena 1 Conflicting Views in Semi-supervised Learning The Cluster

More information

The Informativeness of k-means for Learning Mixture Models

The Informativeness of k-means for Learning Mixture Models The Informativeness of k-means for Learning Mixture Models Vincent Y. F. Tan (Joint work with Zhaoqiang Liu) National University of Singapore June 18, 2018 1/35 Gaussian distribution For F dimensions,

More information

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM. Université du Sud Toulon - Var Master Informatique Probabilistic Learning and Data Analysis TD: Model-based clustering by Faicel CHAMROUKHI Solution The aim of this practical wor is to show how the Classification

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

Exponential tail inequalities for eigenvalues of random matrices

Exponential tail inequalities for eigenvalues of random matrices Exponential tail inequalities for eigenvalues of random matrices M. Ledoux Institut de Mathématiques de Toulouse, France exponential tail inequalities classical theme in probability and statistics quantify

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Assumptions for the theoretical properties of the

Assumptions for the theoretical properties of the Statistica Sinica: Supplement IMPUTED FACTOR REGRESSION FOR HIGH- DIMENSIONAL BLOCK-WISE MISSING DATA Yanqing Zhang, Niansheng Tang and Annie Qu Yunnan University and University of Illinois at Urbana-Champaign

More information

Convex Optimization M2

Convex Optimization M2 Convex Optimization M2 Lecture 8 A. d Aspremont. Convex Optimization M2. 1/57 Applications A. d Aspremont. Convex Optimization M2. 2/57 Outline Geometrical problems Approximation problems Combinatorial

More information

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) ( ) Knowledge Discovery and Data Mining 1 (VO) (707.003) Review of Linear Algebra Denis Helic KTI, TU Graz Oct 9, 2014 Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 1 / 74 Big picture: KDDM Probability Theory

More information

What is semi-supervised learning?

What is semi-supervised learning? What is semi-supervised learning? In many practical learning domains, there is a large supply of unlabeled data but limited labeled data, which can be expensive to generate text processing, video-indexing,

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

The circular law. Lewis Memorial Lecture / DIMACS minicourse March 19, Terence Tao (UCLA)

The circular law. Lewis Memorial Lecture / DIMACS minicourse March 19, Terence Tao (UCLA) The circular law Lewis Memorial Lecture / DIMACS minicourse March 19, 2008 Terence Tao (UCLA) 1 Eigenvalue distributions Let M = (a ij ) 1 i n;1 j n be a square matrix. Then one has n (generalised) eigenvalues

More information

Dissertation Defense

Dissertation Defense Clustering Algorithms for Random and Pseudo-random Structures Dissertation Defense Pradipta Mitra 1 1 Department of Computer Science Yale University April 23, 2008 Mitra (Yale University) Dissertation

More information

Covariance function estimation in Gaussian process regression

Covariance function estimation in Gaussian process regression Covariance function estimation in Gaussian process regression François Bachoc Department of Statistics and Operations Research, University of Vienna WU Research Seminar - May 2015 François Bachoc Gaussian

More information

Lecture 8 : Eigenvalues and Eigenvectors

Lecture 8 : Eigenvalues and Eigenvectors CPS290: Algorithmic Foundations of Data Science February 24, 2017 Lecture 8 : Eigenvalues and Eigenvectors Lecturer: Kamesh Munagala Scribe: Kamesh Munagala Hermitian Matrices It is simpler to begin with

More information

PROBABILITY: LIMIT THEOREMS II, SPRING HOMEWORK PROBLEMS

PROBABILITY: LIMIT THEOREMS II, SPRING HOMEWORK PROBLEMS PROBABILITY: LIMIT THEOREMS II, SPRING 218. HOMEWORK PROBLEMS PROF. YURI BAKHTIN Instructions. You are allowed to work on solutions in groups, but you are required to write up solutions on your own. Please

More information

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis. Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar

More information

6-1. Canonical Correlation Analysis

6-1. Canonical Correlation Analysis 6-1. Canonical Correlation Analysis Canonical Correlatin analysis focuses on the correlation between a linear combination of the variable in one set and a linear combination of the variables in another

More information

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013. The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment 1 Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013. (Problems 1 and

More information