A Random Matrix Framework for BigData Machine Learning and Applications to Wireless Communications
|
|
- Willa Payne
- 5 years ago
- Views:
Transcription
1 A Random Matrix Framework for BigData Machine Learning and Applications to Wireless Communications (EURECOM) Romain COUILLET CentraleSupélec, France June, / 80
2 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 2 / 80
3 Basics of Random Matrix Theory/ 3/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 3 / 80
4 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 4/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 4 / 80
5 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/80 Context Baseline scenario: y 1,..., y n C p (or R p ) i.i.d. with E[y 1 ] = 0, E[y 1 y 1 ] = Cp: 5 / 80
6 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/80 Context Baseline scenario: y 1,..., y n C p (or R p ) i.i.d. with E[y 1 ] = 0, E[y 1 y1 ] = Cp: If y 1 N (0, C p), ML estimator for C p is the sample covariance matrix (SCM) Ĉ p = 1 n YpY p = 1 n y i yi n i=1 (Y p = [y 1,..., y n] C p n ). 5 / 80
7 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/80 Context Baseline scenario: y 1,..., y n C p (or R p ) i.i.d. with E[y 1 ] = 0, E[y 1 y1 ] = Cp: If y 1 N (0, C p), ML estimator for C p is the sample covariance matrix (SCM) Ĉ p = 1 n YpY p (Y p = [y 1,..., y n] C p n ). If n, then, strong law of large numbers = 1 n y i yi n i=1 Ĉ p a.s. C p. or equivalently, in spectral norm Ĉp Cp a.s / 80
8 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/80 Context Baseline scenario: y 1,..., y n C p (or R p ) i.i.d. with E[y 1 ] = 0, E[y 1 y1 ] = Cp: If y 1 N (0, C p), ML estimator for C p is the sample covariance matrix (SCM) Ĉ p = 1 n YpY p (Y p = [y 1,..., y n] C p n ). If n, then, strong law of large numbers = 1 n y i yi n i=1 Ĉ p a.s. C p. or equivalently, in spectral norm Ĉp Cp a.s. 0. Random Matrix Regime No longer valid if p, n with p/n c (0, ), Ĉp Cp 0. 5 / 80
9 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/80 Context Baseline scenario: y 1,..., y n C p (or R p ) i.i.d. with E[y 1 ] = 0, E[y 1 y1 ] = Cp: If y 1 N (0, C p), ML estimator for C p is the sample covariance matrix (SCM) Ĉ p = 1 n YpY p (Y p = [y 1,..., y n] C p n ). If n, then, strong law of large numbers = 1 n y i yi n i=1 Ĉ p a.s. C p. or equivalently, in spectral norm Ĉp Cp a.s. 0. Random Matrix Regime No longer valid if p, n with p/n c (0, ), Ĉp Cp 0. For practical p, n with p n, leads to dramatically wrong conclusions 5 / 80
10 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/80 The Mar cenko Pastur law 0.8 Empirical eigenvalue distribution Mar cenko Pastur Law 0.6 Density Eigenvalues of Ĉp Figure: Histogram of the eigenvalues of Ĉp for p = 500, n = 2000, Cp = Ip. 6 / 80
11 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/80 The Mar cenko Pastur law Definition (Empirical Spectral Density) Empirical spectral density (e.s.d.) µ p of Hermitian matrix A p C p p is µ p = 1 p δ λi (A p p). i=1 7 / 80
12 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/80 The Mar cenko Pastur law Definition (Empirical Spectral Density) Empirical spectral density (e.s.d.) µ p of Hermitian matrix A p C p p is µ p = 1 p δ λi (A p p). i=1 Theorem (Mar cenko Pastur Law [Mar cenko,pastur 67]) X p C p n with i.i.d. zero mean, unit variance entries. As p, n with p/n c (0, ), e.s.d. µ p of 1 n XpX p satisfies weakly, where µ c({0}) = max{0, 1 c 1 } µ p a.s. µ c 7 / 80
13 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/80 The Mar cenko Pastur law Definition (Empirical Spectral Density) Empirical spectral density (e.s.d.) µ p of Hermitian matrix A p C p p is µ p = 1 p δ λi (A p p). i=1 Theorem (Mar cenko Pastur Law [Mar cenko,pastur 67]) X p C p n with i.i.d. zero mean, unit variance entries. As p, n with p/n c (0, ), e.s.d. µ p of 1 n XpX p satisfies weakly, where µ c({0}) = max{0, 1 c 1 } µ p a.s. µ c on (0, ), µ c has continuous density f c supported on [(1 c) 2, (1 + c) 2 ] f c(x) = 1 (x (1 c) 2 )((1 + c) 2 x). 2πcx 7 / 80
14 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/80 The Mar cenko Pastur law 1.2 c = Density fc(x) x Figure: Mar cenko-pastur law for different limit ratios c = lim p p/n. 8 / 80
15 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/80 The Mar cenko Pastur law 1.2 c = 0.1 c = Density fc(x) x Figure: Mar cenko-pastur law for different limit ratios c = lim p p/n. 8 / 80
16 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/80 The Mar cenko Pastur law 1.2 c = 0.1 c = c = Density fc(x) x Figure: Mar cenko-pastur law for different limit ratios c = lim p p/n. 8 / 80
17 Basics of Random Matrix Theory/Spiked Models 9/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 9 / 80
18 Basics of Random Matrix Theory/Spiked Models 10/80 Spiked Models Small rank perturbation: C p = I p + P, P of low rank. 0.8 {λ i } n i=1 µ ω 1 + c 1+ω 1 ω 1, 1 + ω 2 + c 1+ω 2 ω 2 Figure: Eigenvalues of 1 n YpY p, Cp = diag(1,..., 1, 2, 2, 3, 3), p = 500, n = }{{} p 4 10 / 80
19 Basics of Random Matrix Theory/Spiked Models 11/80 Spiked Models Theorem (Eigenvalues [Baik,Silverstein 06]) Let Y p = C 1 2 p X p, with X p with i.i.d. zero mean, unit variance, E[ X p 4 ij ] <. C p = I p + P, P = UΩU, where, for K fixed, Ω = diag (ω 1,..., ω K ) R K K, with ω 1... ω K > / 80
20 Basics of Random Matrix Theory/Spiked Models 11/80 Spiked Models Theorem (Eigenvalues [Baik,Silverstein 06]) Let Y p = C 1 2 p X p, with X p with i.i.d. zero mean, unit variance, E[ X p 4 ij ] <. C p = I p + P, P = UΩU, where, for K fixed, Ω = diag (ω 1,..., ω K ) R K K, with ω 1... ω K > 0. Then, as p, n, p/n c (0, ), denoting λ m = λ m( 1 n YpY p ) (λ m > λ m+1 ), { a.s. 1 + ωm + c 1+ωm > (1 + c) λ m ω 2, ω m > c m (1 + c) 2, ω m (0, c]. 11 / 80
21 Basics of Random Matrix Theory/Spiked Models 12/80 Spiked Models Theorem (Eigenvectors [Paul 07]) Let Y p = C 1 2 p X p, with X p with i.i.d. zero mean, unit variance, E[ X p 4 ij ] <. C p = I p + P, P = UΩU = K i=1 ω iu i u i, ω 1 >... > ω M > / 80
22 Basics of Random Matrix Theory/Spiked Models 12/80 Spiked Models Theorem (Eigenvectors [Paul 07]) Let Y p = C 1 2 p X p, with X p with i.i.d. zero mean, unit variance, E[ X p 4 ij ] <. C p = I p + P, P = UΩU = K i=1 ω iu i u i, ω 1 >... > ω M > 0. Then, as p, n, p/n c (0, ), for a, b C p deterministic and û i eigenvector of λ i ( 1 n YpY p ), In particular, a û i û i b 1 cω 2 i 1 + cω 1 i a u i u i b 1 ω i > c a.s. 0 û i u i 2 a.s. 1 cω 2 i 1 + cω 1 1 ωi > c. i 12 / 80
23 Basics of Random Matrix Theory/Spiked Models 13/80 Spiked Models û 1 u p = 100 p = 200 p = c/ω c/ω Population spike ω 1 Figure: Simulated versus limiting û 1 1 u1 2 for Y p = C 2 p X p, C p = I p + ω 1u 1u 1, p/n = 1/3, varying ω / 80
24 Basics of Random Matrix Theory/Spiked Models 14/80 Other Spiked Models Similar results for multiple matrix models: Y p = 1 n XpX p + P Y p = 1 n X p (I + P )X Y p = 1 n (Xp + P ) (X p + P ) Y p = 1 n T X p (I + P )XpT etc. 14 / 80
25 Applications/ 15/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 15 / 80
26 Applications/Reminder on Spectral Clustering Methods 16/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 16 / 80
27 Applications/Reminder on Spectral Clustering Methods 17/80 Reminder on Spectral Clustering Methods Context: Two-step classification of n objects based on similarity A R n n : 0 spikes 17 / 80
28 Applications/Reminder on Spectral Clustering Methods 17/80 Reminder on Spectral Clustering Methods Context: Two-step classification of n objects based on similarity A R n n : 0 spikes Eigenvectors (in practice, shuffled) Eigenv. 2 Eigenv / 80
29 Applications/Reminder on Spectral Clustering Methods 18/80 Reminder on Spectral Clustering Methods Eigenv. 2 Eigenv / 80
30 Applications/Reminder on Spectral Clustering Methods 18/80 Reminder on Spectral Clustering Methods Eigenv. 2 Eigenv. 1 l-dimensional representation (shuffling no longer matters) Eigenvector 2 Eigenvector 1 18 / 80
31 Applications/Reminder on Spectral Clustering Methods 18/80 Reminder on Spectral Clustering Methods Eigenv. 2 Eigenv. 1 l-dimensional representation (shuffling no longer matters) Eigenvector 2 Eigenvector 1 EM or k-means clustering. 18 / 80
32 Applications/Kernel Spectral Clustering 19/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 19 / 80
33 Applications/Kernel Spectral Clustering 20/80 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes C 1,..., C k. 20 / 80
34 Applications/Kernel Spectral Clustering 20/80 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes C 1,..., C k. Kernel spectral clustering based on kernel matrix K = {κ(x i, x j )} n i,j=1 20 / 80
35 Applications/Kernel Spectral Clustering 20/80 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes C 1,..., C k. Kernel spectral clustering based on kernel matrix K = {κ(x i, x j )} n i,j=1 Usually, κ(x, y) = f(x T y) or κ(x, y) = f( x y 2 ) 20 / 80
36 Applications/Kernel Spectral Clustering 20/80 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes C 1,..., C k. Kernel spectral clustering based on kernel matrix K = {κ(x i, x j )} n i,j=1 Usually, κ(x, y) = f(x T y) or κ(x, y) = f( x y 2 ) Refinements: instead of K, use D K, I n D 1 K, I n D 1 2 KD 1 2, etc. several steps algorithms: Ng Jordan Weiss, Shi Malik, etc. 20 / 80
37 Applications/Kernel Spectral Clustering 20/80 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes C 1,..., C k. Kernel spectral clustering based on kernel matrix K = {κ(x i, x j )} n i,j=1 Usually, κ(x, y) = f(x T y) or κ(x, y) = f( x y 2 ) Refinements: instead of K, use D K, I n D 1 K, I n D 1 2 KD 1 2, etc. several steps algorithms: Ng Jordan Weiss, Shi Malik, etc. Intuition (from small dimensions) K essentially low rank with class structure in eigenvectors. 20 / 80
38 Applications/Kernel Spectral Clustering 21/80 Kernel Spectral Clustering 21 / 80
39 Applications/Kernel Spectral Clustering 21/80 Kernel Spectral Clustering 21 / 80
40 Applications/Kernel Spectral Clustering 21/80 Kernel Spectral Clustering 21 / 80
41 Applications/Kernel Spectral Clustering 21/80 Kernel Spectral Clustering Figure: Leading four eigenvectors of D 2 1 KD 2 1 for MNIST data. 21 / 80
42 Applications/Kernel Spectral Clustering 22/80 Model and Assumptions Gaussian mixture model: x 1,..., x n R p, k classes C 1,..., C k, x 1,..., x n1 C 1,..., x n nk +1,..., x n C k, x i N (µ gi, C gi ). 22 / 80
43 Applications/Kernel Spectral Clustering 22/80 Model and Assumptions Gaussian mixture model: x 1,..., x n R p, k classes C 1,..., C k, x 1,..., x n1 C 1,..., x n nk +1,..., x n C k, x i N (µ gi, C gi ). Assumption (Convergence Rate) As n, p 1. Data scaling: n c 0 (0, ), n 2. Class scaling: a n c a (0, 1), 3. Mean scaling: with µ k a=1 n a n µ a and µ a µa µ, then µ a = O(1) 4. Covariance scaling: with C k n a a=1 n C a and Ca C a C, then C a = O(1), tr C a = O( p), tr C ac b = O(p) 22 / 80
44 Applications/Kernel Spectral Clustering 22/80 Model and Assumptions Gaussian mixture model: x 1,..., x n R p, k classes C 1,..., C k, x 1,..., x n1 C 1,..., x n nk +1,..., x n C k, x i N (µ gi, C gi ). Assumption (Convergence Rate) As n, p 1. Data scaling: n c 0 (0, ), n 2. Class scaling: a n c a (0, 1), 3. Mean scaling: with µ k a=1 n a n µ a and µ a µa µ, then µ a = O(1) 4. Covariance scaling: with C k n a a=1 n C a and Ca C a C, then C a = O(1), tr C a = O( p), tr C ac b = O(p) Remark: For 2 classes, this is µ 1 µ 2 = O(1), tr (C 1 C 2 ) = O( p), C i = O(1), tr ([C 1 C 2 ] 2 ) = O(p). 22 / 80
45 Applications/Kernel Spectral Clustering 23/80 Model and Assumptions Kernel Matrix: Kernel matrix of interest: K = { ( )} 1 n f p x i x j 2 i,j=1 for some sufficiently smooth nonnegative f (f( 1 p xt i x j) simpler). 23 / 80
46 Applications/Kernel Spectral Clustering 23/80 Model and Assumptions Kernel Matrix: Kernel matrix of interest: K = { ( )} 1 n f p x i x j 2 i,j=1 for some sufficiently smooth nonnegative f (f( 1 p xt i x j) simpler). We study the normalized Laplacian: L = nd 1 2 (K ddt d T 1 n ) D 1 2 with d = K1 n, D = diag(d). 23 / 80
47 Applications/Kernel Spectral Clustering 24/80 Random Matrix Equivalent Key Remark: Under our assumptions, uniformly on i, j {1,..., n}, 1 p x i x j 2 a.s. τ > / 80
48 Applications/Kernel Spectral Clustering 24/80 Random Matrix Equivalent Key Remark: Under our assumptions, uniformly on i, j {1,..., n}, 1 p x i x j 2 a.s. τ > 0. Allows for Taylor expansion of K: K = f(τ)1 n1 T n + }{{} O (n) nk1 }{{} low rank, O ( n) + K 2 }{{} informative terms, O (1) 24 / 80
49 Applications/Kernel Spectral Clustering 24/80 Random Matrix Equivalent Key Remark: Under our assumptions, uniformly on i, j {1,..., n}, 1 p x i x j 2 a.s. τ > 0. Allows for Taylor expansion of K: K = f(τ)1 n1 T n + }{{} O (n) nk1 }{{} low rank, O ( n) + K 2 }{{} informative terms, O (1) However not the (small dimension) intuitive behavior. 24 / 80
50 Applications/Kernel Spectral Clustering 25/80 Random Matrix Equivalent Theorem (Random Matrix Equivalent [Couillet, Benaych 2015]) As n, p, L ˆL a.s. 0, o ) ( D 2 1 1, avec K ij = f L = nd 2 (K 1 ddt d T 1 n ˆL = 2 f [ (τ) 1 f(τ) p ΠW T W Π + 1 ] p JBJ T + et W = [w 1,..., w n] R p n (x i = µ a + w i ), Π = I n 1 n 1n1T n, ) p x i x j 2 25 / 80
51 Applications/Kernel Spectral Clustering 25/80 Random Matrix Equivalent Theorem (Random Matrix Equivalent [Couillet, Benaych 2015]) As n, p, L ˆL a.s. 0, o ) ( D 2 1 1, avec K ij = f L = nd 2 (K 1 ddt d T 1 n ˆL = 2 f [ (τ) 1 f(τ) p ΠW T W Π + 1 ] p JBJ T + et W = [w 1,..., w n] R p n (x i = µ a + w i ), Π = I n 1 n 1n1T n, ) p x i x j 2 J = [j 1,..., j k ], ja T = (0... 0, 1na, 0,..., 0) ( 5f B = M T (τ) M + 8f(τ) f ) (τ) 2f tt T f (τ) (τ) f (τ) T +. { } k Recall M = [µ 1,..., µ k ], t = [ 1 p tr C1,..., p 1 tr Ck ], T = 1 p tr C a C b. a,b=1 25 / 80
52 Applications/Kernel Spectral Clustering 26/80 Isolated eigenvalues: Gaussian inputs Eigenvalues of L Eigenvalues of ˆL Figure: Eigenvalues of L and ˆL, k = 3, p = 2048, n = 512, c 1 = c 2 = 1/4, c 3 = 1/2, [µ a] j = 4δ aj, C a = (1 + 2(a 1)/ p)i p, f(x) = exp( x/2). 26 / 80
53 Applications/Kernel Spectral Clustering 27/80 Theoretical Findings versus MNIST 0.2 Eigenvalues of L Figure: Eigenvalues of L (red) and (equivalent Gaussian model) ˆL (white), MNIST data, p = 784, n = / 80
54 Applications/Kernel Spectral Clustering 27/80 Theoretical Findings versus MNIST 0.2 Eigenvalues of L Eigenvalues of ˆL as if Gaussian model Figure: Eigenvalues of L (red) and (equivalent Gaussian model) ˆL (white), MNIST data, p = 784, n = / 80
55 Applications/Kernel Spectral Clustering 28/80 Theoretical Findings versus MNIST Figure: Leading four eigenvectors of D 2 1 KD 1 2 for MNIST data (red) and theoretical findings (blue). 28 / 80
56 Applications/Kernel Spectral Clustering 28/80 Theoretical Findings versus MNIST Figure: Leading four eigenvectors of D 2 1 KD 1 2 for MNIST data (red) and theoretical findings (blue). 28 / 80
57 Applications/Kernel Spectral Clustering 29/80 Theoretical Findings versus MNIST Eigenvector 2/Eigenvector 1 Eigenvector 3/Eigenvector Figure: 2D representation of eigenvectors of L, for the MNIST dataset. Theoretical means and 1- and 2-standard deviations in blue. Class 1 in red, Class 2 in black, Class 3 in green. 29 / 80
58 Applications/Kernel Spectral Clustering 30/80 The surprising f (τ) = 0 case Classification error p = 512 p = 1024 Theory f (τ) Figure: Classification performance, polynomial kernel with f(τ) = 4, f (τ) = 2, x i N (0, C a), with C 1 = I p, [C 2] i,j =.4 i j, c 0 = / 80
59 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 31/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 31 / 80
60 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 32/80 Position of the Problem Problem: Cluster large data x 1,..., x n R p based on spanned subspaces. 32 / 80
61 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 32/80 Position of the Problem Problem: Cluster large data x 1,..., x n R p based on spanned subspaces. Method: Still assume x 1,..., x n belong to k classes C 1,..., C k. Zero-mean Gaussian model for the data: for x i C k, x i N (0, C k ). 32 / 80
62 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 32/80 Position of the Problem Problem: Cluster large data x 1,..., x n R p based on spanned subspaces. Method: Still assume x 1,..., x n belong to k classes C 1,..., C k. Zero-mean Gaussian model for the data: for x i C k, Performance of L = nd 1 2 in the regime n, p. x i N (0, C k ). ( ) K 1n1T n D 1 1 T n D1n 2, with { ( K = f x i x j 2)}, x = x 1 i,j n x 32 / 80
63 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 33/80 Model and Reminders Assumption 1 [Classes]. Vectors x 1,..., x n R p i.i.d. from k-class Gaussian mixture, with x i C k x i N (0, C k ) (sorted by class for simplicity). 33 / 80
64 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 33/80 Model and Reminders Assumption 1 [Classes]. Vectors x 1,..., x n R p i.i.d. from k-class Gaussian mixture, with x i C k x i N (0, C k ) (sorted by class for simplicity). Assumption 2a [Growth Rates]. As n, for each a {1,..., k}, n 1. p c 0 (0, ) n 2. a n c a (0, ) 1 3. p tr Ca = 1 and tr C acb = O(p), with C a = C a C, C = k b=1 c bc b. 33 / 80
65 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 33/80 Model and Reminders Assumption 1 [Classes]. Vectors x 1,..., x n R p i.i.d. from k-class Gaussian mixture, with x i C k x i N (0, C k ) (sorted by class for simplicity). Assumption 2a [Growth Rates]. As n, for each a {1,..., k}, n 1. p c 0 (0, ) n 2. a n c a (0, ) 1 3. p tr Ca = 1 and tr C acb = O(p), with C a = C a C, C = k b=1 c bc b. Theorem (Corollary of Previous Section) Let f smooth with f (2) 0. Then, under Assumptions 2a, L = nd 1 2 ( K exhibits phase transition phenomenon ) 1n1T n 1 T D 1 2, with K = { f ( x i x j 2)} n n D1n i,j=1 ( x = x/ x ) 33 / 80
66 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 33/80 Model and Reminders Assumption 1 [Classes]. Vectors x 1,..., x n R p i.i.d. from k-class Gaussian mixture, with x i C k x i N (0, C k ) (sorted by class for simplicity). Assumption 2a [Growth Rates]. As n, for each a {1,..., k}, n 1. p c 0 (0, ) n 2. a n c a (0, ) 1 3. p tr Ca = 1 and tr C acb = O(p), with C a = C a C, C = k b=1 c bc b. Theorem (Corollary of Previous Section) Let f smooth with f (2) 0. Then, under Assumptions 2a, L = nd 1 2 ( K ) 1n1T n 1 T D 1 2, with K = { f ( x i x j 2)} n n D1n i,j=1 ( x = x/ x ) exhibits phase transition phenomenon, i.e., leading eigenvectors of L asymptotically contain structural information about C 1,..., C k if and only if has sufficiently large eigenvalues. { } 1 k T = p tr C acb a,b=1 33 / 80
67 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 34/80 The case f (2) = 0 Assumption 2b [Growth Rates]. As n, for each a {1,..., k}, n 1. p c 0 (0, ) n 2. a n c a (0, ) 1 3. p tr Ca = 1 and tr C a C b = O(p), with C a = Ca C, C = k b=1 c bc b. 34 / 80
68 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 34/80 The case f (2) = 0 Assumption 2b [Growth Rates]. As n, for each a {1,..., k}, n 1. p c 0 (0, ) n 2. a n c a (0, ) 1 3. p tr Ca = 1 and tr C acb = O( p), with Ca = C a C, C = k b=1 c bc b. (in this regime, previous kernels clearly fail) 34 / 80
69 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 34/80 The case f (2) = 0 Assumption 2b [Growth Rates]. As n, for each a {1,..., k}, n 1. p c 0 (0, ) n 2. a n c a (0, ) 1 3. p tr Ca = 1 and tr C acb = O( p), with Ca = C a C, C = k b=1 c bc b. (in this regime, previous kernels clearly fail) Theorem (Random Equivalent for f (2) = 0) Let f be smooth with f (2) = 0 and L p f(2) [ 2f L (2) Then, under Assumptions 2b, ] f(0) f(2) P, P = I n 1 f(2) n 1n1T n. { } 1 L = P ΦP + tr (C p a C b ) 1na 1 T k n b + o (1) p a,b=1 where Φ ij = δ i j p [ (x T i x j ) 2 E[(x T i x j) 2 ] ]. 34 / 80
70 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 35/80 The case f (2) = 0 3 Eigenvalues of L 2 1 λ 2 (L) λ 1 (L) Figure: Eigenvalues of L, p = 1000, n = 2000, k = 3, c 1 = c 2 = 1/4, c 3 = 1/2, C i I p + (p/8) 5 4 W iw T i, Wi Rp (p/8) of i.i.d. N (0, 1) entries, f(t) = exp( (t 2) 2 ). No longer a Marcenko Pastur like bulk, but rather a semi-circle bulk! 35 / 80
71 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 36/80 The case f (2) = 0 36 / 80
72 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 37/80 The case f (2) = 0 Roadmap. We now need to: study the spectrum of Φ 37 / 80
73 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 37/80 The case f (2) = 0 Roadmap. We now need to: study the spectrum of Φ study the isolated eigenvalues of L (and the phase transition) 37 / 80
74 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 37/80 The case f (2) = 0 Roadmap. We now need to: study the spectrum of Φ study the isolated eigenvalues of L (and the phase transition) retrieve information from the eigenvectors. 37 / 80
75 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 37/80 The case f (2) = 0 Roadmap. We now need to: study the spectrum of Φ study the isolated eigenvalues of L (and the phase transition) retrieve information from the eigenvectors. Theorem (Semi-circle law for Φ) Let µ n = 1 n n i=1 δ λ i (L). Then, under Assumption 2b, with µ the semi-circle distribution µ(dt) = µ n a.s. µ 1 2πc 0 ω 2 (4c 0 ω 2 t 2 ) + 1 dt, ω = lim 2 p p tr (C ) / 80
76 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 38/80 The case f (2) = 0 3 Eigenvalues of L Semi-circle law 2 1 λ 2 (L) λ 1 (L) Figure: Eigenvalues of L, p = 1000, n = 2000, k = 3, c 1 = c 2 = 1/4, c 3 = 1/2, C i I p + (p/8) 5 4 W iw T i, Wi Rp (p/8) of i.i.d. N (0, 1) entries, f(t) = exp( (t 2) 2 ). 38 / 80
77 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 39/80 The case f (2) = 0 Denote now { } cac k b T lim tr C p p a C b. a,b=1 39 / 80
78 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 39/80 The case f (2) = 0 Denote now { } cac k b T lim tr C p p acb. a,b=1 Theorem (Isolated Eigenvalues) Let ν 1... ν k eigenvalues of T. Then, if c 0 ν i > ω, L has an isolated eigenvalue λ i satisfying λ i a.s. ρ i c 0 ν i + ω2 ν i. 39 / 80
79 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 40/80 The case f (2) = 0 Theorem (Isolated Eigenvectors) For each isolated eigenpair (λ i, u i ) of L corresponding to (ν i, v i ) of T, write k u i = α a j a i + σi a wa i na a=1 with j a = [0 T n 1,..., 1 T n a,..., 0 T n k ] T, (w a i )T j a = 0, supp(w a i ) = supp(ja), wa i = 1. Then, under Assumptions 1 2b, α a i αb i a.s. (1 1c0 ω 2 (σ a i )2 a.s. ca c 0 ω 2 ν 2 i ν 2 i ) [v i v T i ] ab and the fluctuations of u i, u j, i j, are asymptotically uncorrelated. 40 / 80
80 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 41/80 The case f (2) = 0 Eigenvector 2 Eigenvector ,000 1,200 1,400 1,600 1,800 2,000 Figure: Leading two eigenvectors of L (or equivalently of L) versus deterministic approximations of α a i ± σa i. 41 / 80
81 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 41/80 The case f (2) = 0 Eigenvector 1 1 n1 (α 1 1 σ1 1 ) 1 n2 (α σ2 1 ) 1 n2 α 2 1 Eigenvector ,000 1,200 1,400 1,600 1,800 2,000 Figure: Leading two eigenvectors of L (or equivalently of L) versus deterministic approximations of α a i ± σa i. 41 / 80
82 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 42/80 The case f (2) = σ 2 Eigenvector σ Eigenvector Figure: Leading two eigenvectors of L (or equivalently of L) versus deterministic approximations of α a i ± σa i. 42 / 80
83 Applications/The case f 0 (τ ) = 0: an application to massive MIMO user clustering 43/80 Application to Massive MIMO UE Clustering 43 / 80
84 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 44/80 Massive MIMO UE Clustering Setting. Massive MIMO cell with p antenna elements n users equipments (UE) with channels x 1,..., x n R p UE s belong to solid angle groups, i.e., E[x i ] = 0, E[x i x T i ] = Ca C(Θa). 44 / 80
85 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 44/80 Massive MIMO UE Clustering Setting. Massive MIMO cell with p antenna elements n users equipments (UE) with channels x 1,..., x n R p UE s belong to solid angle groups, i.e., E[x i ] = 0, E[x i x T i ] = Ca C(Θa). T independent channel observations x (1) i,..., x (T ) i for UE i. 44 / 80
86 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 44/80 Massive MIMO UE Clustering Setting. Massive MIMO cell with p antenna elements n users equipments (UE) with channels x 1,..., x n R p UE s belong to solid angle groups, i.e., E[x i ] = 0, E[x i x T i ] = Ca C(Θa). T independent channel observations x (1) i,..., x (T ) i for UE i. Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoid pilot contamination). 44 / 80
87 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 44/80 Massive MIMO UE Clustering Setting. Massive MIMO cell with p antenna elements n users equipments (UE) with channels x 1,..., x n R p UE s belong to solid angle groups, i.e., E[x i ] = 0, E[x i x T i ] = Ca C(Θa). T independent channel observations x (1) i,..., x (T ) i for UE i. Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoid pilot contamination). Algorithm. 1. Build kernel matrix K, then L, based on nt vectors x (1) 1,..., x(t n ) (as if nt values to cluster). 44 / 80
88 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 44/80 Massive MIMO UE Clustering Setting. Massive MIMO cell with p antenna elements n users equipments (UE) with channels x 1,..., x n R p UE s belong to solid angle groups, i.e., E[x i ] = 0, E[x i x T i ] = Ca C(Θa). T independent channel observations x (1) i,..., x (T ) i for UE i. Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoid pilot contamination). Algorithm. 1. Build kernel matrix K, then L, based on nt vectors x (1) 1,..., x(t n ) (as if nt values to cluster). 2. Extract dominant isolated eigenvectors u 1,..., u κ 44 / 80
89 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 44/80 Massive MIMO UE Clustering Setting. Massive MIMO cell with p antenna elements n users equipments (UE) with channels x 1,..., x n R p UE s belong to solid angle groups, i.e., E[x i ] = 0, E[x i x T i ] = Ca C(Θa). T independent channel observations x (1) i,..., x (T ) i for UE i. Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoid pilot contamination). Algorithm. 1. Build kernel matrix K, then L, based on nt vectors x (1) 1,..., x(t n ) (as if nt values to cluster). 2. Extract dominant isolated eigenvectors u 1,..., u κ 3. For each i, create ũ i = 1 T (In 1T T )u i, i.e., average eigenvectors along time. 44 / 80
90 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 44/80 Massive MIMO UE Clustering Setting. Massive MIMO cell with p antenna elements n users equipments (UE) with channels x 1,..., x n R p UE s belong to solid angle groups, i.e., E[x i ] = 0, E[x i x T i ] = Ca C(Θa). T independent channel observations x (1) i,..., x (T ) i for UE i. Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoid pilot contamination). Algorithm. 1. Build kernel matrix K, then L, based on nt vectors x (1) 1,..., x(t n ) (as if nt values to cluster). 2. Extract dominant isolated eigenvectors u 1,..., u κ 3. For each i, create ũ i = 1 T (In 1T T )u i, i.e., average eigenvectors along time. 4. Perform k-class clustering on vectors ũ 1,..., ũ κ. 44 / 80
91 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 45/80 Massive MIMO UE Clustering Eigenvector Eigenvector Eigenvector Figure: Leading two eigenvectors before (left figure) and after (right figure) T -averaging. Setting: p = 400, n = 40, T = 10, k = 3, c 1 = c 3 = 1/4, c 2 = 1/2, angular spread model with angles π/30 ± π/20, 0 ± π/20, and π/30 ± π/20. Kernel function f(t) = exp( (t 2) 2 ). 45 / 80
92 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 46/80 Massive MIMO UE Clustering 1 Correct clustering probability k-means (oracle) k-means EM (oracle) EM Random guess T Figure: Overlap for different T, using the k-means or EM starting from actual centroid solutions (oracle) or randomly. 46 / 80
93 Applications/The case f (τ) = 0: an application to massive MIMO user clustering 47/80 Massive MIMO UE Clustering 1 Correct clustering probability Optimal kernel (k-means) Optimal kernel (EM) Gaussian kernel (k-means) Gaussian kernel (EM) Random guess T Figure: Overlap for optimal kernel f(t) (here f(t) = exp( (t 2) 2 )) and Gaussian kernel f(t) = exp( t 2 ), for different T, using the k-means or EM. 47 / 80
94 Applications/Semi-supervised Learning 48/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 48 / 80
95 Applications/Semi-supervised Learning 49/80 Problem Statement Context: Similar to clustering: Classify x 1,..., x n R p in k classes, with n l labelled and n u unlabelled data. 49 / 80
96 Applications/Semi-supervised Learning 49/80 Problem Statement Context: Similar to clustering: Classify x 1,..., x n R p in k classes, with n l labelled and n u unlabelled data. Problem statement: give scores F ia (d i = [K1 n] i ) F = argmin F R n k such that F ia = δ {xi C a}, for all labelled x i. k K ij (F ia d α 1 i F ja d α 1 j ) 2 a=1 i,j 49 / 80
97 Applications/Semi-supervised Learning 49/80 Problem Statement Context: Similar to clustering: Classify x 1,..., x n R p in k classes, with n l labelled and n u unlabelled data. Problem statement: give scores F ia (d i = [K1 n] i ) F = argmin F R n k such that F ia = δ {xi C a}, for all labelled x i. k K ij (F ia d α 1 i F ja d α 1 j ) 2 a=1 i,j Solution: for F (u) R nu k, F (l) R n l k scores of unlabelled/labelled data, ( ) 1 F (u) = I nu D α (u) K (u,u)d α 1 (u) D α (u) K (u,l)d α 1 F (l) (l) where we naturally decompose [ ] K(l,l) K K = (l,u) K (u,l) K (u,u) [ ] D(l) 0 D = 0 D (u) = diag {K1 n}. 49 / 80
98 Applications/Semi-supervised Learning 50/80 MNIST Data Example 1.2 [F (u) ],1 (Zeros) F,a (u) Index Figure: Vectors [F (u) ],a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 50 / 80
99 Applications/Semi-supervised Learning 50/80 MNIST Data Example 1.2 [F (u) ],1 (Zeros) [F (u) ],2 (Ones) F,a (u) Index Figure: Vectors [F (u) ],a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 50 / 80
100 Applications/Semi-supervised Learning 50/80 MNIST Data Example [F (u) ],1 (Zeros) [F (u) ],2 (Ones) [F (u) ],3 (Twos) 1.2 F,a (u) Index Figure: Vectors [F (u) ],a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 50 / 80
101 Applications/Semi-supervised Learning 51/80 MNIST Data Example 10 2 [F (u) ],1 (Zeros) 4 kb=1 F (u),b F,a (u) k Index Figure: Centered Vectors [F (u) ],a = [F (u) 1 k F (u)1 k 1 T k ],a, 3-class MNIST data (zeros, ones, twos), α = 0, n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 51 / 80
102 Applications/Semi-supervised Learning 51/80 MNIST Data Example 10 2 [F (u) ],1 (Zeros) 4 [F (u) ],2 (Ones) kb=1 F (u),b F,a (u) k Index Figure: Centered Vectors [F (u) ],a = [F (u) 1 k F (u)1 k 1 T k ],a, 3-class MNIST data (zeros, ones, twos), α = 0, n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 51 / 80
103 Applications/Semi-supervised Learning 51/80 MNIST Data Example 10 2 [F (u) ],1 (Zeros) 4 [F (u) ],2 (Ones) [F (u) ],3 (Twos) kb=1 F (u),b F,a (u) k Index Figure: Centered Vectors [F (u) ],a = [F (u) 1 k F (u)1 k 1 T k ],a, 3-class MNIST data (zeros, ones, twos), α = 0, n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 51 / 80
104 Applications/Semi-supervised Learning 52/80 Main Results Results: Assuming n l /n c l (0, 1), by previous Taylor expansion, In the first order, F (u),a = C n [ l,a n v }{{} O(1) + α ta1nu n }{{} O(n 1 2 ) ] + O(n 1 ) }{{} Informative terms where v = O(1) random vector (entry-wise) and t a = 1 p tr C a. 52 / 80
105 Applications/Semi-supervised Learning 52/80 Main Results Results: Assuming n l /n c l (0, 1), by previous Taylor expansion, In the first order, F (u),a = C n [ l,a n v }{{} O(1) + α ta1nu n }{{} O(n 1 2 ) ] + O(n 1 ) }{{} Informative terms where v = O(1) random vector (entry-wise) and t a = 1 p tr C a. Consequences: 52 / 80
106 Applications/Semi-supervised Learning 52/80 Main Results Results: Assuming n l /n c l (0, 1), by previous Taylor expansion, In the first order, F (u),a = C n [ l,a n v }{{} O(1) + α ta1nu n }{{} O(n 1 2 ) ] + O(n 1 ) }{{} Informative terms where v = O(1) random vector (entry-wise) and t a = 1 p tr C a. Consequences: Random non-informative bias v 52 / 80
107 Applications/Semi-supervised Learning 52/80 Main Results Results: Assuming n l /n c l (0, 1), by previous Taylor expansion, In the first order, F (u),a = C n [ l,a n v }{{} O(1) + α ta1nu n }{{} O(n 1 2 ) ] + O(n 1 ) }{{} Informative terms where v = O(1) random vector (entry-wise) and t a = 1 p tr C a. Consequences: Random non-informative bias v Strong Impact of n l,a F (u),a to be scaled by n l,a 52 / 80
108 Applications/Semi-supervised Learning 52/80 Main Results Results: Assuming n l /n c l (0, 1), by previous Taylor expansion, In the first order, F (u),a = C n [ l,a n v }{{} O(1) + α ta1nu n }{{} O(n 1 2 ) ] + O(n 1 ) }{{} Informative terms where v = O(1) random vector (entry-wise) and t a = 1 p tr C a. Consequences: Random non-informative bias v Strong Impact of n l,a F (u),a to be scaled by n l,a Additional per-class bias αt a1 nu α = 0 + β p. 52 / 80
109 Applications/Semi-supervised Learning 53/80 Main Results As a consequence of the remarks above, we take α = β p and define ˆF (u) i,a = np n l,a F (u) ia. 53 / 80
110 Applications/Semi-supervised Learning 53/80 Main Results As a consequence of the remarks above, we take α = β p and define ˆF (u) i,a = np n l,a F (u) ia. Theorem For x i C b unlabelled, ˆF i, G b 0, G b N (m b, Σ b ) where m b R k, Σ b R k k given by (m b ) a = 2f (τ) M ab + f (τ) f(τ) f(τ) t a t b + 2f (τ) T ab f (τ) 2 f(τ) f(τ) 2 tat b + β n f (τ) n l f(τ) ta + B b (Σ b ) a1 a 2 = 2tr ( C2 b f (τ) 2 p f(τ) 2 f ) 2 (τ) t a1 f(τ) ta2 + 4f (τ) 2 ) ([M T f(τ) 2 C b M] a1a2 + δa 2 a 1 p T ba1 n l,a1 with t, T, M as before, Xa = X a k n l,d d=1 X n l d and B b bias independent of a. 53 / 80
111 Applications/Semi-supervised Learning 54/80 Main Results Corollary (Asymptotic Classification Error) For k = 2 classes and a b, ( ) P ( ˆF i,a > ˆF (m b ) b (m b ) a ib x i C b ) Q 0. [1, 1]Σb [1, 1] T 54 / 80
112 Applications/Semi-supervised Learning 54/80 Main Results Corollary (Asymptotic Classification Error) For k = 2 classes and a b, ( ) P ( ˆF i,a > ˆF (m b ) b (m b ) a ib x i C b ) Q 0. [1, 1]Σb [1, 1] T Some consequences: non obvious choices of appropriate kernels non obvious choice of optimal β (induces a possibly beneficial bias) importance of n l versus n u. 54 / 80
113 Applications/Semi-supervised Learning 55/80 MNIST Data Example 1 Simulations Probability of correct classification Index Figure: Performance as a function of α, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 55 / 80
114 Applications/Semi-supervised Learning 55/80 MNIST Data Example 1 Simulations Theory as if Gaussian Probability of correct classification Index Figure: Performance as a function of α, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 55 / 80
115 Applications/Semi-supervised Learning 56/80 MNIST Data Example 1 Simulations Probability of correct classification Index Figure: Performance as a function of α, for 2-class MNIST data (zeros, ones), n = 1568, p = 784, n l /n = 1/16, Gaussian kernel. 56 / 80
116 Applications/Semi-supervised Learning 56/80 MNIST Data Example 1 Simulations Theory as if Gaussian Probability of correct classification Index Figure: Performance as a function of α, for 2-class MNIST data (zeros, ones), n = 1568, p = 784, n l /n = 1/16, Gaussian kernel. 56 / 80
117 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 57/80 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering The case f (τ) = 0: an application to massive MIMO user clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 57 / 80
118 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 58/80 Random Feature Maps and Extreme Learning Machines Context: Random Feature Map (large) input x 1,..., x T R p random W = wt 1... R n p wn T non-linear activation function σ. 58 / 80
119 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 58/80 Random Feature Maps and Extreme Learning Machines Context: Random Feature Map (large) input x 1,..., x T R p random W = wt 1... R n p wn T non-linear activation function σ. Neural Network Model (extreme learning machine): Ridge-regression learning small output y 1,..., y T R d ridge-regression output β R n d 58 / 80
120 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 59/80 Random Feature Maps and Extreme Learning Machines Objectives: evaluate training and testing MSE performance as n, p, T 59 / 80
121 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 59/80 Random Feature Maps and Extreme Learning Machines Objectives: evaluate training and testing MSE performance as n, p, T Training MSE: E train = 1 T y i β T σ(w x i ) 2 = 1 T T Y βt Σ 2 F i=1 with { } Σ = σ(w X) = σ(wi T x j) 1 i n 1 j T β= 1 T Σ ( 1 T ΣT Σ + γi T ) 1 Y. 59 / 80
122 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 59/80 Random Feature Maps and Extreme Learning Machines Objectives: evaluate training and testing MSE performance as n, p, T Training MSE: E train = 1 T y i β T σ(w x i ) 2 = 1 T T Y βt Σ 2 F i=1 with { } Σ = σ(w X) = σ(wi T x j) 1 i n 1 j T β= 1 T Σ ( 1 T ΣT Σ + γi T ) 1 Y. Testing MSE: upon new pair ( ˆX, Ŷ ) of length ˆT, E test = 1ˆT Ŷ βt ˆΣ 2 F. where ˆΣ = σ(w ˆX). 59 / 80
123 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 60/80 Technical Aspects Preliminary observations: Link to resolvent of 1 T ΣT Σ: where Q = Q(γ) is the resolvent with Σ ij = σ(w T i x j). E train = γ2 T tr Y T Y Q 2 = γ 2 1 γ T tr Y T Y Q ( ) 1 1 Q T ΣT Σ + γi T 60 / 80
124 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 60/80 Technical Aspects Preliminary observations: Link to resolvent of 1 T ΣT Σ: where Q = Q(γ) is the resolvent with Σ ij = σ(w T i x j). E train = γ2 T tr Y T Y Q 2 = γ 2 1 γ T tr Y T Y Q ( ) 1 1 Q T ΣT Σ + γi T Central object: resolvent E[Q]. 60 / 80
125 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 61/80 Main Technical Result Theorem [Asymptotic Equivalent for E[Q]] For Lipschitz σ, bounded X, Y, W = f(z) (entry-wise) with Z standard Gaussian, we have, for all ε > 0, for some C > 0, where E[Q] Q < Cn ε 1 2 ( n Q = T Φ E ) Φ δ + γi T ] [ σ(x T w)σ(w T X) with w = f(z), z N (0, I p), and δ > 0 the unique positive solution to δ = 1 tr Φ Q. T 61 / 80
126 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 61/80 Main Technical Result Theorem [Asymptotic Equivalent for E[Q]] For Lipschitz σ, bounded X, Y, W = f(z) (entry-wise) with Z standard Gaussian, we have, for all ε > 0, for some C > 0, where E[Q] Q < Cn ε 1 2 ( n Q = T Φ E ) Φ δ + γi T ] [ σ(x T w)σ(w T X) with w = f(z), z N (0, I p), and δ > 0 the unique positive solution to δ = 1 tr Φ Q. T Proof arguments: σ(w X) has independent rows but dependent columns breaks the trace lemma argument (i.e., 1 p wt XAX T w 1 p tr XAXT ) 61 / 80
127 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 61/80 Main Technical Result Theorem [Asymptotic Equivalent for E[Q]] For Lipschitz σ, bounded X, Y, W = f(z) (entry-wise) with Z standard Gaussian, we have, for all ε > 0, for some C > 0, where E[Q] Q < Cn ε 1 2 ( n Q = T Φ E ) Φ δ + γi T ] [ σ(x T w)σ(w T X) with w = f(z), z N (0, I p), and δ > 0 the unique positive solution to δ = 1 tr Φ Q. T Proof arguments: σ(w X) has independent rows but dependent columns breaks the trace lemma argument (i.e., 1 p wt XAX T w 1 p tr XAXT ) Concentration of measure lemma: 1 p σ(wt X)Aσ(X T w) 1 p tr ΦA 61 / 80
128 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 62/80 Main Technical Result Values of Φ(a, b) for w N (0, I p), σ(t) Φ(a, b) ( 1 max(t, 0) 2π a b (a, b) acos( (a, b)) + ) 1 (a, b) ( 2 2 t π a b (a, b) asin( (a, b)) + ) 1 (a, b) ( ) 2 2 erf(t) π asin 2a T b (1+2 a 2 )(1+2 b 2 ) 1 1 {t>0} 2 1 acos( (a, b)) 2π sign(t) 1 2 acos( (a, b)) π cos(t) exp( 1 2 ( a 2 + b 2 )) cosh(a T b). where (a, b) at b a b. 62 / 80
129 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 62/80 Main Technical Result Values of Φ(a, b) for w N (0, I p), σ(t) Φ(a, b) ( 1 max(t, 0) 2π a b (a, b) acos( (a, b)) + ) 1 (a, b) ( 2 2 t π a b (a, b) asin( (a, b)) + ) 1 (a, b) ( ) 2 2 erf(t) π asin 2a T b (1+2 a 2 )(1+2 b 2 ) 1 1 {t>0} 2 1 acos( (a, b)) 2π sign(t) 1 2 acos( (a, b)) π cos(t) exp( 1 2 ( a 2 + b 2 )) cosh(a T b). where (a, b) at b a b. Value of Φ(a, b) for w i i.i.d. with E[wi k] = m k (m 1 = 0), σ(t) = ζ 2 t 2 + ζ 1 t + ζ 0 [ ( Φ(a, b) = ζ2 2 m 2 2 2(a T b) 2 + a 2 b 2) ] + (m 4 3m 2 2 )(a2 ) T (b 2 ) + ζ1 2 m 2a T b ] + ζ 2 ζ 1 m 3 [(a 2 ) T b + a T (b 2 [ ) + ζ 2 ζ 0 m 2 a 2 + b 2] + ζ0 2 where (a 2 ) [a 2 1,..., a2 p] T. 62 / 80
130 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 63/80 Main Results Theorem [Asymptotic E train ] For all ε > 0, almost surely, where with Ψ n T Φ 1+δ. n 1 2 ε ( E train Ētrain) 0 E train = 1 Y T Σ T β 2 T = γ2 F T tr Y T Y Q 2 [ ] Ē train = γ2 T tr Y T Y Q 1 tr Ψ Q 2 n 1 1 n tr (Ψ Q) Ψ + I 2 T Q 63 / 80
131 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 64/80 Main Results Letting ˆX R p ˆT, Ŷ Rd ˆT satisfy similar properties as (X, Y ), Claim [Asymptotic E test ] For all ε > 0, almost surely, where with Ψ AB = n T n 1 2 ε ( E test Ētest ) 0 E test = 1ˆT Ŷ T ˆΣ T β 2 F Ē test = 1ˆT Ŷ T Ψ T QY T 2 X ˆX F 1 n + tr Y T [ Y QΨ Q tr (Ψ Q) 2 n ˆT tr Ψ ˆX ˆX 1ˆT ] tr (I T + γ Q)(Ψ X ˆXΨ ˆXX Q) Φ AB 1+δ, Φ AB = E[σ(A T w)σ(w T B)]. 64 / 80
132 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 65/80 Simulations on MNIST: Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t γ Figure: Neural network performance for Lipschitz continuous σ( ), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 80
133 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 65/80 Simulations on MNIST: Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t γ Figure: Neural network performance for Lipschitz continuous σ( ), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 80
134 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 65/80 Simulations on MNIST: Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t 10 1 σ(t) = erf(t) γ Figure: Neural network performance for Lipschitz continuous σ( ), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 80
135 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 65/80 Simulations on MNIST: Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t σ(t) = t 10 1 σ(t) = erf(t) γ Figure: Neural network performance for Lipschitz continuous σ( ), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 80
136 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 65/80 Simulations on MNIST: Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t σ(t) = t 10 1 σ(t) = erf(t) σ(t) = max(t, 0) γ Figure: Neural network performance for Lipschitz continuous σ( ), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 80
137 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 66/80 Simulations on MNIST: non Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t2 σ(t) = 1 {t>0} σ(t) = sign(t) γ Figure: Neural network performance for σ( ) either discontinuous or non Lipschitz, as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 80
138 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 67/80 Simulations on tuned Gaussian mixture Gaussian mixture classification X = [X 1, X 2 ], with {X 1 } i N (0, C 1 ), {X 2 } i N (0, C 2 ), tr C 1 = tr C 2 67 / 80
139 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 67/80 Simulations on tuned Gaussian mixture Gaussian mixture classification X = [X 1, X 2 ], with {X 1 } i N (0, C 1 ), {X 2 } i N (0, C 2 ), tr C 1 = tr C 2 We can prove that, for σ(t) = ζ 2 t 2 + ζ 1 t + ζ 0 and E[W k ij ] = m k, Classification only possible if m 4 m / 80
140 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 67/80 Simulations on tuned Gaussian mixture Gaussian mixture classification X = [X 1, X 2 ], with {X 1 } i N (0, C 1 ), {X 2 } i N (0, C 2 ), tr C 1 = tr C 2 We can prove that, for σ(t) = ζ 2 t 2 + ζ 1 t + ζ 0 and E[W k ij ] = m k, Classification only possible if m 4 m 2 2 Ē train Ē test E train E test 10 0 MSE W ij Bern W ij N (0, 1) W ij Stud γ 67 / 80
Random Matrices in Machine Learning
Random Matrices in Machine Learning Romain COUILLET CentraleSupélec, University of ParisSaclay, France GSTATS IDEX DataScience Chair, GIPSA-lab, University Grenoble Alpes, France. June 21, 2018 1 / 113
More informationA Random Matrix Framework for BigData Machine Learning
A Random Matrix Framework for BigData Machine Learning (Groupe Deep Learning, DigiCosme) Romain COUILLET CentraleSupélec, France June, 2017 1 / 63 Outline Basics of Random Matrix Theory Motivation: Large
More informationRandom Matrices for Big Data Signal Processing and Machine Learning
Random Matrices for Big Data Signal Processing and Machine Learning (ICASSP 2017, New Orleans) Romain COUILLET and Hafiz TIOMOKO ALI CentraleSupélec, France March, 2017 1 / 153 Outline Basics of Random
More informationRandom Matrix Theory for Neural Networks
Random Matrix Theory for Neural Networks Ph.D. Mid-Term Evaluation Zhenyu Liao Laboratoire des Signaux et Systèmes CentraleSupélec Université Paris-Saclay Salle sd.207, Bâtiment Bouygues Gif-sur-Yvette,
More informationOn the Spectrum of Random Features Maps of High Dimensional Data
On the Spectrum of Random Features Maps of High Dimensional Data ICML 018, Stockholm, Sweden Zhenyu Liao, Romain Couillet LS, CentraleSupélec, Université Paris-Saclay, France GSTATS IDEX DataScience Chair,
More informationThe Dynamics of Learning: A Random Matrix Approach
The Dynamics of Learning: A Random Matrix Approach ICML 2018, Stockholm, Sweden Zhenyu Liao, Romain Couillet L2S, CentraleSupélec, Université Paris-Saclay, France GSTATS IDEX DataScience Chair, GIPSA-lab,
More informationAnalysis of Spectral Kernel Design based Semi-supervised Learning
Analysis of Spectral Kernel Design based Semi-supervised Learning Tong Zhang IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Rie Kubota Ando IBM T. J. Watson Research Center Yorktown Heights,
More informationNon white sample covariance matrices.
Non white sample covariance matrices. S. Péché, Université Grenoble 1, joint work with O. Ledoit, Uni. Zurich 17-21/05/2010, Université Marne la Vallée Workshop Probability and Geometry in High Dimensions
More informationResearch Program. Romain COUILLET. 11 février CentraleSupélec Université Paris-Sud 11 1 / 38
Research Program Romain COUILLET CentraleSupélec Université Paris-Sud 11 11 février 2016 1 / 38 Outline Curriculum Vitae Research Project : Learning in Large Dimensions Axis 1 : Robust Estimation in Large
More informationAssessing the dependence of high-dimensional time series via sample autocovariances and correlations
Assessing the dependence of high-dimensional time series via sample autocovariances and correlations Johannes Heiny University of Aarhus Joint work with Thomas Mikosch (Copenhagen), Richard Davis (Columbia),
More informationEstimation of the Global Minimum Variance Portfolio in High Dimensions
Estimation of the Global Minimum Variance Portfolio in High Dimensions Taras Bodnar, Nestor Parolya and Wolfgang Schmid 07.FEBRUARY 2014 1 / 25 Outline Introduction Random Matrix Theory: Preliminary Results
More information1 Math 241A-B Homework Problem List for F2015 and W2016
1 Math 241A-B Homework Problem List for F2015 W2016 1.1 Homework 1. Due Wednesday, October 7, 2015 Notation 1.1 Let U be any set, g be a positive function on U, Y be a normed space. For any f : U Y let
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationA note on a Marčenko-Pastur type theorem for time series. Jianfeng. Yao
A note on a Marčenko-Pastur type theorem for time series Jianfeng Yao Workshop on High-dimensional statistics The University of Hong Kong, October 2011 Overview 1 High-dimensional data and the sample covariance
More informationLarge sample covariance matrices and the T 2 statistic
Large sample covariance matrices and the T 2 statistic EURANDOM, the Netherlands Joint work with W. Zhou Outline 1 2 Basic setting Let {X ij }, i, j =, be i.i.d. r.v. Write n s j = (X 1j,, X pj ) T and
More informationCS 195-5: Machine Learning Problem Set 1
CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of
More informationLearning gradients: prescriptive models
Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan
More informationA RANDOM MATRIX APPROACH TO NEURAL NETWORKS. By Cosme Louart, Zhenyu Liao, and Romain Couillet CentraleSupélec, University of Paris Saclay, France.
Submitted to the Annals of Applied Probability A RANDOM MARIX APPROACH O NEURAL NEWORKS By Cosme Louart, Zhenyu Liao, and Romain Couillet CentraleSupélec, University of Paris Saclay, France. his article
More information5.1 Consistency of least squares estimates. We begin with a few consistency results that stand on their own and do not depend on normality.
88 Chapter 5 Distribution Theory In this chapter, we summarize the distributions related to the normal distribution that occur in linear models. Before turning to this general problem that assumes normal
More informationThe Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017
The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the
More informationPreface to the Second Edition...vii Preface to the First Edition... ix
Contents Preface to the Second Edition...vii Preface to the First Edition........... ix 1 Introduction.............................................. 1 1.1 Large Dimensional Data Analysis.........................
More informationReminders. Thought questions should be submitted on eclass. Please list the section related to the thought question
Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a
More informationClassification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).
Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationStatistical signal processing
Statistical signal processing Short overview of the fundamentals Outline Random variables Random processes Stationarity Ergodicity Spectral analysis Random variable and processes Intuition: A random variable
More informationClassification. Sandro Cumani. Politecnico di Torino
Politecnico di Torino Outline Generative model: Gaussian classifier (Linear) discriminative model: logistic regression (Non linear) discriminative model: neural networks Gaussian Classifier We want to
More informationLecture Notes 1: Vector spaces
Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector
More informationc 4, < y 2, 1 0, otherwise,
Fundamentals of Big Data Analytics Univ.-Prof. Dr. rer. nat. Rudolf Mathar Problem. Probability theory: The outcome of an experiment is described by three events A, B and C. The probabilities Pr(A) =,
More informationEigenvalues and Singular Values of Random Matrices: A Tutorial Introduction
Random Matrix Theory and its applications to Statistics and Wireless Communications Eigenvalues and Singular Values of Random Matrices: A Tutorial Introduction Sergio Verdú Princeton University National
More informationWhen Dictionary Learning Meets Classification
When Dictionary Learning Meets Classification Bufford, Teresa 1 Chen, Yuxin 2 Horning, Mitchell 3 Shee, Liberty 1 Mentor: Professor Yohann Tendero 1 UCLA 2 Dalhousie University 3 Harvey Mudd College August
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationMATH JORDAN FORM
MATH 53 JORDAN FORM Let A,, A k be square matrices of size n,, n k, respectively with entries in a field F We define the matrix A A k of size n = n + + n k as the block matrix A 0 0 0 0 A 0 0 0 0 A k It
More informationFree Probability, Sample Covariance Matrices and Stochastic Eigen-Inference
Free Probability, Sample Covariance Matrices and Stochastic Eigen-Inference Alan Edelman Department of Mathematics, Computer Science and AI Laboratories. E-mail: edelman@math.mit.edu N. Raj Rao Deparment
More informationRandom Correlation Matrices, Top Eigenvalue with Heavy Tails and Financial Applications
Random Correlation Matrices, Top Eigenvalue with Heavy Tails and Financial Applications J.P Bouchaud with: M. Potters, G. Biroli, L. Laloux, M. A. Miceli http://www.cfm.fr Portfolio theory: Basics Portfolio
More informationInference For High Dimensional M-estimates. Fixed Design Results
: Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and
More informationIntroduction to Machine Learning
10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what
More informationConvex Optimization in Classification Problems
New Trends in Optimization and Computational Algorithms December 9 13, 2001 Convex Optimization in Classification Problems Laurent El Ghaoui Department of EECS, UC Berkeley elghaoui@eecs.berkeley.edu 1
More informationRobust covariance matrices estimation and applications in signal processing
Robust covariance matrices estimation and applications in signal processing F. Pascal SONDRA/Supelec GDR ISIS Journée Estimation et traitement statistique en grande dimension May 16 th, 2013 FP (SONDRA/Supelec)
More informationSTATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010
STATS306B Discriminant Analysis Jonathan Taylor Department of Statistics Stanford University June 3, 2010 Spring 2010 Classification Given K classes in R p, represented as densities f i (x), 1 i K classify
More informationMATH 5640: Functions of Diagonalizable Matrices
MATH 5640: Functions of Diagonalizable Matrices Hung Phan, UMass Lowell November 27, 208 Spectral theorem for diagonalizable matrices Definition Let V = X Y Every v V is uniquely decomposed as u = x +
More informationChannel capacity estimation using free probability theory
Channel capacity estimation using free probability theory January 008 Problem at hand The capacity per receiving antenna of a channel with n m channel matrix H and signal to noise ratio ρ = 1 σ is given
More informationThe University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.
The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment Two Caramanis/Sanghavi Due: Tuesday, Feb. 19, 2013. Computational
More information8.1 Concentration inequality for Gaussian random matrix (cont d)
MGMT 69: Topics in High-dimensional Data Analysis Falll 26 Lecture 8: Spectral clustering and Laplacian matrices Lecturer: Jiaming Xu Scribe: Hyun-Ju Oh and Taotao He, October 4, 26 Outline Concentration
More informationTensor Methods for Feature Learning
Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine Feature Learning For Efficient Classification Find good transformations of input for improved classification Figures used attributed to
More informationHomework 1. Yuan Yao. September 18, 2011
Homework 1 Yuan Yao September 18, 2011 1. Singular Value Decomposition: The goal of this exercise is to refresh your memory about the singular value decomposition and matrix norms. A good reference to
More information9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee
9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic
More informationApplications and fundamental results on random Vandermon
Applications and fundamental results on random Vandermonde matrices May 2008 Some important concepts from classical probability Random variables are functions (i.e. they commute w.r.t. multiplication)
More informationSolving Corrupted Quadratic Equations, Provably
Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin
More information4 Bias-Variance for Ridge Regression (24 points)
Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,
More informationLecture 2: Linear Algebra Review
EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1
More informationKernel Method: Data Analysis with Positive Definite Kernels
Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University
More informationComparison Method in Random Matrix Theory
Comparison Method in Random Matrix Theory Jun Yin UW-Madison Valparaíso, Chile, July - 2015 Joint work with A. Knowles. 1 Some random matrices Wigner Matrix: H is N N square matrix, H : H ij = H ji, EH
More informationClassifier Complexity and Support Vector Classifiers
Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl
More informationA Least Squares Formulation for Canonical Correlation Analysis
A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation
More informationKernel Methods. Barnabás Póczos
Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels
More informationHigh-resolution Parametric Subspace Methods
High-resolution Parametric Subspace Methods The first parametric subspace-based method was the Pisarenko method,, which was further modified, leading to the MUltiple SIgnal Classification (MUSIC) method.
More informationPCA with random noise. Van Ha Vu. Department of Mathematics Yale University
PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationEigenvector stability: Random Matrix Theory and Financial Applications
Eigenvector stability: Random Matrix Theory and Financial Applications J.P Bouchaud with: R. Allez, M. Potters http://www.cfm.fr Portfolio theory: Basics Portfolio weights w i, Asset returns X t i If expected/predicted
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationFree deconvolution for signal processing applications
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL., NO., JANUARY 27 Free deconvolution for signal processing applications Øyvind Ryan, Member, IEEE, Mérouane Debbah, Member, IEEE arxiv:cs.it/725v 4 Jan 27 Abstract
More informationFast learning rates for plug-in classifiers under the margin condition
Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,
More informationA Magiv CV Theory for Large-Margin Classifiers
A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector
More informationStatistical Machine Learning
Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x
More informationLearning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute
More informationAn indicator for the number of clusters using a linear map to simplex structure
An indicator for the number of clusters using a linear map to simplex structure Marcus Weber, Wasinee Rungsarityotin, and Alexander Schliep Zuse Institute Berlin ZIB Takustraße 7, D-495 Berlin, Germany
More informationClustering VS Classification
MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:
More informationSTAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song
STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song Presenter: Jiwei Zhao Department of Statistics University of Wisconsin Madison April
More informationLearning Task Grouping and Overlap in Multi-Task Learning
Learning Task Grouping and Overlap in Multi-Task Learning Abhishek Kumar Hal Daumé III Department of Computer Science University of Mayland, College Park 20 May 2013 Proceedings of the 29 th International
More informationEmpirical properties of large covariance matrices in finance
Empirical properties of large covariance matrices in finance Ex: RiskMetrics Group, Geneva Since 2010: Swissquote, Gland December 2009 Covariance and large random matrices Many problems in finance require
More informationClassification 2: Linear discriminant analysis (continued); logistic regression
Classification 2: Linear discriminant analysis (continued); logistic regression Ryan Tibshirani Data Mining: 36-462/36-662 April 4 2013 Optional reading: ISL 4.4, ESL 4.3; ISL 4.3, ESL 4.4 1 Reminder:
More informationMore Spectral Clustering and an Introduction to Conjugacy
CS8B/Stat4B: Advanced Topics in Learning & Decision Making More Spectral Clustering and an Introduction to Conjugacy Lecturer: Michael I. Jordan Scribe: Marco Barreno Monday, April 5, 004. Back to spectral
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationLinear Algebra and Dirac Notation, Pt. 2
Linear Algebra and Dirac Notation, Pt. 2 PHYS 500 - Southern Illinois University February 1, 2017 PHYS 500 - Southern Illinois University Linear Algebra and Dirac Notation, Pt. 2 February 1, 2017 1 / 14
More informationExamples include: (a) the Lorenz system for climate and weather modeling (b) the Hodgkin-Huxley system for neuron modeling
1 Introduction Many natural processes can be viewed as dynamical systems, where the system is represented by a set of state variables and its evolution governed by a set of differential equations. Examples
More informationConvergence of Eigenspaces in Kernel Principal Component Analysis
Convergence of Eigenspaces in Kernel Principal Component Analysis Shixin Wang Advanced machine learning April 19, 2016 Shixin Wang Convergence of Eigenspaces April 19, 2016 1 / 18 Outline 1 Motivation
More informationFinding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October
Finding normalized and modularity cuts by spectral clustering Marianna Bolla Institute of Mathematics Budapest University of Technology and Economics marib@math.bme.hu Ljubjana 2010, October Outline Find
More informationEstimation of large dimensional sparse covariance matrices
Estimation of large dimensional sparse covariance matrices Department of Statistics UC, Berkeley May 5, 2009 Sample covariance matrix and its eigenvalues Data: n p matrix X n (independent identically distributed)
More informationUnsupervised dimensionality reduction
Unsupervised dimensionality reduction Guillaume Obozinski Ecole des Ponts - ParisTech SOCN course 2014 Guillaume Obozinski Unsupervised dimensionality reduction 1/30 Outline 1 PCA 2 Kernel PCA 3 Multidimensional
More informationModel Specification Testing in Nonparametric and Semiparametric Time Series Econometrics. Jiti Gao
Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics Jiti Gao Department of Statistics School of Mathematics and Statistics The University of Western Australia Crawley
More informationUnlabeled Data: Now It Helps, Now It Doesn t
institution-logo-filena A. Singh, R. D. Nowak, and X. Zhu. In NIPS, 2008. 1 Courant Institute, NYU April 21, 2015 Outline institution-logo-filena 1 Conflicting Views in Semi-supervised Learning The Cluster
More informationThe Informativeness of k-means for Learning Mixture Models
The Informativeness of k-means for Learning Mixture Models Vincent Y. F. Tan (Joint work with Zhaoqiang Liu) National University of Singapore June 18, 2018 1/35 Gaussian distribution For F dimensions,
More information1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.
Université du Sud Toulon - Var Master Informatique Probabilistic Learning and Data Analysis TD: Model-based clustering by Faicel CHAMROUKHI Solution The aim of this practical wor is to show how the Classification
More informationECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction
ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering
More informationExponential tail inequalities for eigenvalues of random matrices
Exponential tail inequalities for eigenvalues of random matrices M. Ledoux Institut de Mathématiques de Toulouse, France exponential tail inequalities classical theme in probability and statistics quantify
More informationUnsupervised Learning
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and
More informationAssumptions for the theoretical properties of the
Statistica Sinica: Supplement IMPUTED FACTOR REGRESSION FOR HIGH- DIMENSIONAL BLOCK-WISE MISSING DATA Yanqing Zhang, Niansheng Tang and Annie Qu Yunnan University and University of Illinois at Urbana-Champaign
More informationConvex Optimization M2
Convex Optimization M2 Lecture 8 A. d Aspremont. Convex Optimization M2. 1/57 Applications A. d Aspremont. Convex Optimization M2. 2/57 Outline Geometrical problems Approximation problems Combinatorial
More informationKnowledge Discovery and Data Mining 1 (VO) ( )
Knowledge Discovery and Data Mining 1 (VO) (707.003) Review of Linear Algebra Denis Helic KTI, TU Graz Oct 9, 2014 Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 1 / 74 Big picture: KDDM Probability Theory
More informationWhat is semi-supervised learning?
What is semi-supervised learning? In many practical learning domains, there is a large supply of unlabeled data but limited labeled data, which can be expensive to generate text processing, video-indexing,
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationThe circular law. Lewis Memorial Lecture / DIMACS minicourse March 19, Terence Tao (UCLA)
The circular law Lewis Memorial Lecture / DIMACS minicourse March 19, 2008 Terence Tao (UCLA) 1 Eigenvalue distributions Let M = (a ij ) 1 i n;1 j n be a square matrix. Then one has n (generalised) eigenvalues
More informationDissertation Defense
Clustering Algorithms for Random and Pseudo-random Structures Dissertation Defense Pradipta Mitra 1 1 Department of Computer Science Yale University April 23, 2008 Mitra (Yale University) Dissertation
More informationCovariance function estimation in Gaussian process regression
Covariance function estimation in Gaussian process regression François Bachoc Department of Statistics and Operations Research, University of Vienna WU Research Seminar - May 2015 François Bachoc Gaussian
More informationLecture 8 : Eigenvalues and Eigenvectors
CPS290: Algorithmic Foundations of Data Science February 24, 2017 Lecture 8 : Eigenvalues and Eigenvectors Lecturer: Kamesh Munagala Scribe: Kamesh Munagala Hermitian Matrices It is simpler to begin with
More informationPROBABILITY: LIMIT THEOREMS II, SPRING HOMEWORK PROBLEMS
PROBABILITY: LIMIT THEOREMS II, SPRING 218. HOMEWORK PROBLEMS PROF. YURI BAKHTIN Instructions. You are allowed to work on solutions in groups, but you are required to write up solutions on your own. Please
More informationVector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.
Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar
More information6-1. Canonical Correlation Analysis
6-1. Canonical Correlation Analysis Canonical Correlatin analysis focuses on the correlation between a linear combination of the variable in one set and a linear combination of the variables in another
More informationThe University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.
The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment 1 Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013. (Problems 1 and
More information