A Random Matrix Framework for BigData Machine Learning

Size: px
Start display at page:

Download "A Random Matrix Framework for BigData Machine Learning"

Transcription

1 A Random Matrix Framework for BigData Machine Learning (Groupe Deep Learning, DigiCosme) Romain COUILLET CentraleSupélec, France June, / 63

2 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 2 / 63

3 Basics of Random Matrix Theory/ 3/63 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 3 / 63

4 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 4/63 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 4 / 63

5 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/63 Context Baseline scenario: y 1,..., y n C p (or R p ) i.i.d. with E[y 1 ] = 0, E[y 1 y 1 ] = Cp: 5 / 63

6 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/63 Context Baseline scenario: y 1,..., y n C p (or R p ) i.i.d. with E[y 1 ] = 0, E[y 1 y1 ] = Cp: If y 1 N (0, C p), ML estimator for C p is the sample covariance matrix (SCM) Ĉ p = 1 n YpY p = 1 n y i yi n i=1 (Y p = [y 1,..., y n] C p n ). 5 / 63

7 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/63 Context Baseline scenario: y 1,..., y n C p (or R p ) i.i.d. with E[y 1 ] = 0, E[y 1 y1 ] = Cp: If y 1 N (0, C p), ML estimator for C p is the sample covariance matrix (SCM) Ĉ p = 1 n YpY p (Y p = [y 1,..., y n] C p n ). If n, then, strong law of large numbers = 1 n y i yi n i=1 Ĉ p a.s. C p. or equivalently, in spectral norm Ĉp Cp a.s / 63

8 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/63 Context Baseline scenario: y 1,..., y n C p (or R p ) i.i.d. with E[y 1 ] = 0, E[y 1 y1 ] = Cp: If y 1 N (0, C p), ML estimator for C p is the sample covariance matrix (SCM) Ĉ p = 1 n YpY p (Y p = [y 1,..., y n] C p n ). If n, then, strong law of large numbers = 1 n y i yi n i=1 Ĉ p a.s. C p. or equivalently, in spectral norm Ĉp Cp a.s. 0. Random Matrix Regime No longer valid if p, n with p/n c (0, ), Ĉp Cp 0. 5 / 63

9 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/63 Context Baseline scenario: y 1,..., y n C p (or R p ) i.i.d. with E[y 1 ] = 0, E[y 1 y1 ] = Cp: If y 1 N (0, C p), ML estimator for C p is the sample covariance matrix (SCM) Ĉ p = 1 n YpY p (Y p = [y 1,..., y n] C p n ). If n, then, strong law of large numbers = 1 n y i yi n i=1 Ĉ p a.s. C p. or equivalently, in spectral norm Ĉp Cp a.s. 0. Random Matrix Regime No longer valid if p, n with p/n c (0, ), Ĉp Cp 0. For practical p, n with p n, leads to dramatically wrong conclusions 5 / 63

10 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/63 The Mar cenko Pastur law 0.8 Empirical eigenvalue distribution Mar cenko Pastur Law 0.6 Density Eigenvalues of Ĉp Figure: Histogram of the eigenvalues of Ĉp for p = 500, n = 2000, Cp = Ip. 6 / 63

11 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/63 The Mar cenko Pastur law Definition (Empirical Spectral Density) Empirical spectral density (e.s.d.) µ p of Hermitian matrix A p C p p is µ p = 1 p δ λi (A p p). i=1 7 / 63

12 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/63 The Mar cenko Pastur law Definition (Empirical Spectral Density) Empirical spectral density (e.s.d.) µ p of Hermitian matrix A p C p p is µ p = 1 p δ λi (A p p). i=1 Theorem (Mar cenko Pastur Law [Mar cenko,pastur 67]) X p C p n with i.i.d. zero mean, unit variance entries. As p, n with p/n c (0, ), e.s.d. µ p of 1 n XpX p satisfies weakly, where µ c({0}) = max{0, 1 c 1 } µ p a.s. µ c 7 / 63

13 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/63 The Mar cenko Pastur law Definition (Empirical Spectral Density) Empirical spectral density (e.s.d.) µ p of Hermitian matrix A p C p p is µ p = 1 p δ λi (A p p). i=1 Theorem (Mar cenko Pastur Law [Mar cenko,pastur 67]) X p C p n with i.i.d. zero mean, unit variance entries. As p, n with p/n c (0, ), e.s.d. µ p of 1 n XpX p satisfies weakly, where µ c({0}) = max{0, 1 c 1 } µ p a.s. µ c on (0, ), µ c has continuous density f c supported on [(1 c) 2, (1 + c) 2 ] f c(x) = 1 (x (1 c) 2 )((1 + c) 2 x). 2πcx 7 / 63

14 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/63 The Mar cenko Pastur law 1.2 c = Density fc(x) x Figure: Mar cenko-pastur law for different limit ratios c = lim p p/n. 8 / 63

15 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/63 The Mar cenko Pastur law 1.2 c = 0.1 c = Density fc(x) x Figure: Mar cenko-pastur law for different limit ratios c = lim p p/n. 8 / 63

16 Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/63 The Mar cenko Pastur law 1.2 c = 0.1 c = c = Density fc(x) x Figure: Mar cenko-pastur law for different limit ratios c = lim p p/n. 8 / 63

17 Basics of Random Matrix Theory/Spiked Models 9/63 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 9 / 63

18 Basics of Random Matrix Theory/Spiked Models 10/63 Spiked Models If we break: Small rank Perturbation: C p = I P + P, P of low rank. 0.8 {λ i } n i=1 µ ω 1 + c 1+ω 1 ω 1, 1 + ω 2 + c 1+ω 2 ω 2 Figure: Eigenvalues of 1 n YpY p, Cp = diag(1,..., 1, 2, 2, 3, 3), p = 500, n = }{{} p 4 10 / 63

19 Basics of Random Matrix Theory/Spiked Models 11/63 Spiked Models Theorem (Eigenvalues [Baik,Silverstein 06]) Let Y p = C 1 2 p X p, with X p with i.i.d. zero mean, unit variance, E[ X p 4 ij ] <. C p = I p + P, P = UΩU, where, for K fixed, Ω = diag (ω 1,..., ω K ) R K K, with ω 1... ω K > / 63

20 Basics of Random Matrix Theory/Spiked Models 11/63 Spiked Models Theorem (Eigenvalues [Baik,Silverstein 06]) Let Y p = C 1 2 p X p, with X p with i.i.d. zero mean, unit variance, E[ X p 4 ij ] <. C p = I p + P, P = UΩU, where, for K fixed, Ω = diag (ω 1,..., ω K ) R K K, with ω 1... ω K > 0. Then, as p, n, p/n c (0, ), denoting λ m = λ m( 1 n YpY p ) (λ m > λ m+1 ), { a.s. 1 + ωm + c 1+ωm > (1 + c) λ m ω 2, ω m > c m (1 + c) 2, ω m (0, c]. 11 / 63

21 Basics of Random Matrix Theory/Spiked Models 12/63 Spiked Models Theorem (Eigenvectors [Paul 07]) Let Y p = C 1 2 p X p, with X p with i.i.d. zero mean, unit variance, E[ X p 4 ij ] <. C p = I p + P, P = UΩU = K i=1 ω iu i u i, ω 1 >... > ω M > / 63

22 Basics of Random Matrix Theory/Spiked Models 12/63 Spiked Models Theorem (Eigenvectors [Paul 07]) Let Y p = C 1 2 p X p, with X p with i.i.d. zero mean, unit variance, E[ X p 4 ij ] <. C p = I p + P, P = UΩU = K i=1 ω iu i u i, ω 1 >... > ω M > 0. Then, as p, n, p/n c (0, ), for a, b C p deterministic and û i eigenvector of λ i ( 1 n YpY p ), In particular, a û i û i b 1 cω 2 i 1 + cω 1 i a u i u i b 1 ω i > c a.s. 0 û i u i 2 a.s. 1 cω 2 i 1 + cω 1 1 ωi > c. i 12 / 63

23 Basics of Random Matrix Theory/Spiked Models 13/63 Spiked Models û 1 u p = 100 p = 200 p = c/ω c/ω Population spike ω 1 Figure: Simulated versus limiting û 1 1 u1 2 for Y p = C 2 p X p, C p = I p + ω 1u 1u 1, p/n = 1/3, varying ω / 63

24 Basics of Random Matrix Theory/Spiked Models 14/63 Other Spiked Models Similar results for multiple matrix models: Y p = 1 n XpX p + P Y p = 1 n X p (I + P )X Y p = 1 n (Xp + P ) (X p + P ) Y p = 1 n T X p (I + P )XpT etc. 14 / 63

25 Applications/ 15/63 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 15 / 63

26 Applications/Reminder on Spectral Clustering Methods 16/63 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 16 / 63

27 Applications/Reminder on Spectral Clustering Methods 17/63 Reminder on Spectral Clustering Methods Context: Two-step classification of n objects based on similarity A R n n : 1. extraction of eigenvectors U = [u 1,..., u l ] with dominant eigenvalues 17 / 63

28 Applications/Reminder on Spectral Clustering Methods 17/63 Reminder on Spectral Clustering Methods Context: Two-step classification of n objects based on similarity A R n n : 1. extraction of eigenvectors U = [u 1,..., u l ] with dominant eigenvalues 2. classification of n rows U 1,,..., U n, R l using k-means/em. 17 / 63

29 Applications/Reminder on Spectral Clustering Methods 17/63 Reminder on Spectral Clustering Methods Context: Two-step classification of n objects based on similarity A R n n : 1. extraction of eigenvectors U = [u 1,..., u l ] with dominant eigenvalues 2. classification of n rows U 1,,..., U n, R l using k-means/em. 0 spikes 17 / 63

30 Applications/Reminder on Spectral Clustering Methods 17/63 Reminder on Spectral Clustering Methods Context: Two-step classification of n objects based on similarity A R n n : 1. extraction of eigenvectors U = [u 1,..., u l ] with dominant eigenvalues 2. classification of n rows U 1,,..., U n, R l using k-means/em. 0 spikes Eigenvectors (in practice, shuffled) Eigenv. 2 Eigenv / 63

31 Applications/Reminder on Spectral Clustering Methods 18/63 Reminder on Spectral Clustering Methods Eigenv. 2 Eigenv / 63

32 Applications/Reminder on Spectral Clustering Methods 18/63 Reminder on Spectral Clustering Methods Eigenv. 2 Eigenv. 1 l-dimensional representation (shuffling no longer matters) Eigenvector 2 Eigenvector 1 18 / 63

33 Applications/Reminder on Spectral Clustering Methods 18/63 Reminder on Spectral Clustering Methods Eigenv. 2 Eigenv. 1 l-dimensional representation (shuffling no longer matters) Eigenvector 2 Eigenvector 1 EM or k-means clustering. 18 / 63

34 Applications/Kernel Spectral Clustering 19/63 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 19 / 63

35 Applications/Kernel Spectral Clustering 20/63 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes C 1,..., C k. 20 / 63

36 Applications/Kernel Spectral Clustering 20/63 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes C 1,..., C k. Kernel spectral clustering based on kernel matrix K = {κ(x i, x j )} n i,j=1 20 / 63

37 Applications/Kernel Spectral Clustering 20/63 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes C 1,..., C k. Kernel spectral clustering based on kernel matrix K = {κ(x i, x j )} n i,j=1 Usually, κ(x, y) = f(x T y) or κ(x, y) = f( x y 2 ) 20 / 63

38 Applications/Kernel Spectral Clustering 20/63 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes C 1,..., C k. Kernel spectral clustering based on kernel matrix K = {κ(x i, x j )} n i,j=1 Usually, κ(x, y) = f(x T y) or κ(x, y) = f( x y 2 ) Refinements: instead of K, use D K, I n D 1 K, I n D 1 2 KD 1 2, etc. several steps algorithms: Ng Jordan Weiss, Shi Malik, etc. 20 / 63

39 Applications/Kernel Spectral Clustering 20/63 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes C 1,..., C k. Kernel spectral clustering based on kernel matrix K = {κ(x i, x j )} n i,j=1 Usually, κ(x, y) = f(x T y) or κ(x, y) = f( x y 2 ) Refinements: instead of K, use D K, I n D 1 K, I n D 1 2 KD 1 2, etc. several steps algorithms: Ng Jordan Weiss, Shi Malik, etc. Intuition (from small dimensions) K essentially low rank with class structure in eigenvectors. 20 / 63

40 Applications/Kernel Spectral Clustering 21/63 Kernel Spectral Clustering 21 / 63

41 Applications/Kernel Spectral Clustering 21/63 Kernel Spectral Clustering 21 / 63

42 Applications/Kernel Spectral Clustering 21/63 Kernel Spectral Clustering 21 / 63

43 Applications/Kernel Spectral Clustering 21/63 Kernel Spectral Clustering Figure: Leading four eigenvectors of D 2 1 KD 2 1 for MNIST data. 21 / 63

44 Applications/Kernel Spectral Clustering 22/63 Model and Assumptions Gaussian mixture model: x 1,..., x n R p, k classes C 1,..., C k, x 1,..., x n1 C 1,..., x n nk +1,..., x n C k, x i N (µ gi, C gi ). 22 / 63

45 Applications/Kernel Spectral Clustering 22/63 Model and Assumptions Gaussian mixture model: x 1,..., x n R p, k classes C 1,..., C k, x 1,..., x n1 C 1,..., x n nk +1,..., x n C k, x i N (µ gi, C gi ). Assumption (Convergence Rate) As n, p 1. Data scaling: n c 0 (0, ), n 2. Class scaling: a n c a (0, 1), 3. Mean scaling: with µ k a=1 n a n µ a and µ a µa µ, then µ a = O(1) 4. Covariance scaling: with C k n a a=1 n C a and Ca C a C, then C a = O(1), tr C a = O( p), tr C ac b = O(p) 22 / 63

46 Applications/Kernel Spectral Clustering 22/63 Model and Assumptions Gaussian mixture model: x 1,..., x n R p, k classes C 1,..., C k, x 1,..., x n1 C 1,..., x n nk +1,..., x n C k, x i N (µ gi, C gi ). Assumption (Convergence Rate) As n, p 1. Data scaling: n c 0 (0, ), n 2. Class scaling: a n c a (0, 1), 3. Mean scaling: with µ k a=1 n a n µ a and µ a µa µ, then µ a = O(1) 4. Covariance scaling: with C k n a a=1 n C a and Ca C a C, then C a = O(1), tr C a = O( p), tr C ac b = O(p) Remark: For 2 classes, this is µ 1 µ 2 = O(1), tr (C 1 C 2 ) = O( p), C i = O(1), tr ([C 1 C 2 ] 2 ) = O(p). 22 / 63

47 Applications/Kernel Spectral Clustering 23/63 Model and Assumptions Kernel Matrix: Kernel matrix of interest: K = { ( )} 1 n f p x i x j 2 i,j=1 for some sufficiently smooth nonnegative f (f( 1 p xt i x j) simpler). 23 / 63

48 Applications/Kernel Spectral Clustering 23/63 Model and Assumptions Kernel Matrix: Kernel matrix of interest: K = { ( )} 1 n f p x i x j 2 i,j=1 for some sufficiently smooth nonnegative f (f( 1 p xt i x j) simpler). We study the normalized Laplacian: L = nd 1 2 (K ddt d T 1 n ) D 1 2 with d = K1 n, D = diag(d). 23 / 63

49 Applications/Kernel Spectral Clustering 24/63 Random Matrix Equivalent Key Remark: Under our assumptions, uniformly on i, j {1,..., n}, 1 p x i x j 2 a.s. τ > / 63

50 Applications/Kernel Spectral Clustering 24/63 Random Matrix Equivalent Key Remark: Under our assumptions, uniformly on i, j {1,..., n}, 1 p x i x j 2 a.s. τ > 0. Allows for Taylor expansion of K: K = f(τ)1 n1 T n + }{{} O (n) nk1 }{{} low rank, O ( n) + K 2 }{{} informative terms, O (1) 24 / 63

51 Applications/Kernel Spectral Clustering 24/63 Random Matrix Equivalent Key Remark: Under our assumptions, uniformly on i, j {1,..., n}, 1 p x i x j 2 a.s. τ > 0. Allows for Taylor expansion of K: K = f(τ)1 n1 T n + }{{} O (n) nk1 }{{} low rank, O ( n) + K 2 }{{} informative terms, O (1) However not the (small dimension) intuitive behavior. 24 / 63

52 Applications/Kernel Spectral Clustering 25/63 Random Matrix Equivalent Theorem (Random Matrix Equivalent [Couillet, Benaych 2015]) As n, p, L ˆL a.s. 0, o ) ( D 2 1 1, avec K ij = f L = nd 2 (K 1 ddt d T 1 n ˆL = 2 f [ (τ) 1 f(τ) p ΠW T W Π + 1 ] p JBJ T + et W = [w 1,..., w n] R p n (x i = µ a + w i ), Π = I n 1 n 1n1T n, ) p x i x j 2 25 / 63

53 Applications/Kernel Spectral Clustering 25/63 Random Matrix Equivalent Theorem (Random Matrix Equivalent [Couillet, Benaych 2015]) As n, p, L ˆL a.s. 0, o ) ( D 2 1 1, avec K ij = f L = nd 2 (K 1 ddt d T 1 n ˆL = 2 f [ (τ) 1 f(τ) p ΠW T W Π + 1 ] p JBJ T + et W = [w 1,..., w n] R p n (x i = µ a + w i ), Π = I n 1 n 1n1T n, ) p x i x j 2 J = [j 1,..., j k ], ja T = (0... 0, 1na, 0,..., 0) ( 5f B = M T (τ) M + 8f(τ) f ) (τ) 2f tt T f (τ) (τ) f (τ) T +. { } k Recall M = [µ 1,..., µ k ], t = [ 1 p tr C1,..., p 1 tr Ck ], T = 1 p tr C a C b. a,b=1 25 / 63

54 Applications/Kernel Spectral Clustering 26/63 Isolated eigenvalues: Gaussian inputs Eigenvalues of L Eigenvalues of ˆL Figure: Eigenvalues of L and ˆL, k = 3, p = 2048, n = 512, c 1 = c 2 = 1/4, c 3 = 1/2, [µ a] j = 4δ aj, C a = (1 + 2(a 1)/ p)i p, f(x) = exp( x/2). 26 / 63

55 Applications/Kernel Spectral Clustering 27/63 Theoretical Findings versus MNIST 0.2 Eigenvalues of L Figure: Eigenvalues of L (red) and (equivalent Gaussian model) ˆL (white), MNIST data, p = 784, n = / 63

56 Applications/Kernel Spectral Clustering 27/63 Theoretical Findings versus MNIST 0.2 Eigenvalues of L Eigenvalues of ˆL as if Gaussian model Figure: Eigenvalues of L (red) and (equivalent Gaussian model) ˆL (white), MNIST data, p = 784, n = / 63

57 Applications/Kernel Spectral Clustering 28/63 Theoretical Findings versus MNIST Figure: Leading four eigenvectors of D 2 1 KD 1 2 for MNIST data (red) and theoretical findings (blue). 28 / 63

58 Applications/Kernel Spectral Clustering 28/63 Theoretical Findings versus MNIST Figure: Leading four eigenvectors of D 2 1 KD 1 2 for MNIST data (red) and theoretical findings (blue). 28 / 63

59 Applications/Kernel Spectral Clustering 29/63 Theoretical Findings versus MNIST Eigenvector 2/Eigenvector 1 Eigenvector 3/Eigenvector Figure: 2D representation of eigenvectors of L, for the MNIST dataset. Theoretical means and 1- and 2-standard deviations in blue. Class 1 in red, Class 2 in black, Class 3 in green. 29 / 63

60 Applications/Kernel Spectral Clustering 30/63 The suprising f (τ) = 0 case Classification error p = 512 p = 1024 Theory f (τ) Figure: Classification performance, polynomial kernel with f(τ) = 4, f (τ) = 2, x i N (0, C a), with C 1 = I p, [C 2] i,j =.4 i j, c 0 = / 63

61 Applications/Semi-supervised Learning 31/63 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 31 / 63

62 Applications/Semi-supervised Learning 32/63 Problem Statement Context: Similar to clustering: Classify x 1,..., x n R p in k classes, with n l labelled and n u unlabelled data. 32 / 63

63 Applications/Semi-supervised Learning 32/63 Problem Statement Context: Similar to clustering: Classify x 1,..., x n R p in k classes, with n l labelled and n u unlabelled data. Problem statement: give scores F ia (d i = [K1 n] i ) F = argmin F R n k such that F ia = δ {xi C a}, for all labelled x i. k K ij (F ia d α 1 i F ja d α 1 j ) 2 a=1 i,j 32 / 63

64 Applications/Semi-supervised Learning 32/63 Problem Statement Context: Similar to clustering: Classify x 1,..., x n R p in k classes, with n l labelled and n u unlabelled data. Problem statement: give scores F ia (d i = [K1 n] i ) F = argmin F R n k such that F ia = δ {xi C a}, for all labelled x i. k K ij (F ia d α 1 i F ja d α 1 j ) 2 a=1 i,j Solution: for F (u) R nu k, F (l) R n l k scores of unlabelled/labelled data, ( ) 1 F (u) = I nu D α (u) K (u,u)d α 1 (u) D α (u) K (u,l)d α 1 F (l) (l) where we naturally decompose [ ] K(l,l) K K = (l,u) K (u,l) K (u,u) [ ] D(l) 0 D = 0 D (u) = diag {K1 n}. 32 / 63

65 Applications/Semi-supervised Learning 33/63 MNIST Data Example 1.2 [F (u) ],1 (Zeros) F,a (u) Index Figure: Vectors [F (u) ],a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 33 / 63

66 Applications/Semi-supervised Learning 33/63 MNIST Data Example 1.2 [F (u) ],1 (Zeros) [F (u) ],2 (Ones) F,a (u) Index Figure: Vectors [F (u) ],a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 33 / 63

67 Applications/Semi-supervised Learning 33/63 MNIST Data Example [F (u) ],1 (Zeros) [F (u) ],2 (Ones) [F (u) ],3 (Twos) 1.2 F,a (u) Index Figure: Vectors [F (u) ],a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 33 / 63

68 Applications/Semi-supervised Learning 34/63 MNIST Data Example 10 2 [F (u) ],1 (Zeros) 4 kb=1 F (u),b F,a (u) k Index Figure: Centered Vectors [F (u) ],a = [F (u) 1 k F (u)1 k 1 T k ],a, 3-class MNIST data (zeros, ones, twos), α = 0, n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 34 / 63

69 Applications/Semi-supervised Learning 34/63 MNIST Data Example 10 2 [F (u) ],1 (Zeros) 4 [F (u) ],2 (Ones) kb=1 F (u),b F,a (u) k Index Figure: Centered Vectors [F (u) ],a = [F (u) 1 k F (u)1 k 1 T k ],a, 3-class MNIST data (zeros, ones, twos), α = 0, n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 34 / 63

70 Applications/Semi-supervised Learning 34/63 MNIST Data Example 10 2 [F (u) ],1 (Zeros) 4 [F (u) ],2 (Ones) [F (u) ],3 (Twos) kb=1 F (u),b F,a (u) k Index Figure: Centered Vectors [F (u) ],a = [F (u) 1 k F (u)1 k 1 T k ],a, 3-class MNIST data (zeros, ones, twos), α = 0, n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 34 / 63

71 Applications/Semi-supervised Learning 35/63 Main Results Results: Assuming n l /n c l (0, 1), by previous Taylor expansion, In the first order, F (u),a = C n [ l,a n v }{{} O(1) + α ta1nu n }{{} O(n 1 2 ) ] + O(n 1 ) }{{} Informative terms where v = O(1) random vector (entry-wise) and t a = 1 p tr C a. 35 / 63

72 Applications/Semi-supervised Learning 35/63 Main Results Results: Assuming n l /n c l (0, 1), by previous Taylor expansion, In the first order, F (u),a = C n [ l,a n v }{{} O(1) + α ta1nu n }{{} O(n 1 2 ) ] + O(n 1 ) }{{} Informative terms where v = O(1) random vector (entry-wise) and t a = 1 p tr C a. Consequences: 35 / 63

73 Applications/Semi-supervised Learning 35/63 Main Results Results: Assuming n l /n c l (0, 1), by previous Taylor expansion, In the first order, F (u),a = C n [ l,a n v }{{} O(1) + α ta1nu n }{{} O(n 1 2 ) ] + O(n 1 ) }{{} Informative terms where v = O(1) random vector (entry-wise) and t a = 1 p tr C a. Consequences: Random non-informative bias v 35 / 63

74 Applications/Semi-supervised Learning 35/63 Main Results Results: Assuming n l /n c l (0, 1), by previous Taylor expansion, In the first order, F (u),a = C n [ l,a n v }{{} O(1) + α ta1nu n }{{} O(n 1 2 ) ] + O(n 1 ) }{{} Informative terms where v = O(1) random vector (entry-wise) and t a = 1 p tr C a. Consequences: Random non-informative bias v Strong Impact of n l,a F (u),a to be scaled by n l,a 35 / 63

75 Applications/Semi-supervised Learning 35/63 Main Results Results: Assuming n l /n c l (0, 1), by previous Taylor expansion, In the first order, F (u),a = C n [ l,a n v }{{} O(1) + α ta1nu n }{{} O(n 1 2 ) ] + O(n 1 ) }{{} Informative terms where v = O(1) random vector (entry-wise) and t a = 1 p tr C a. Consequences: Random non-informative bias v Strong Impact of n l,a F (u),a to be scaled by n l,a Additional per-class bias αt a1 nu α = 0 + β p. 35 / 63

76 Applications/Semi-supervised Learning 36/63 Main Results As a consequence of the remarks above, we take α = β p and define ˆF (u) i,a = np n l,a F (u) ia. 36 / 63

77 Applications/Semi-supervised Learning 36/63 Main Results As a consequence of the remarks above, we take α = β p and define ˆF (u) i,a = np n l,a F (u) ia. Theorem For x i C b unlabelled, ˆF i, G b 0, G b N (m b, Σ b ) where m b R k, Σ b R k k given by (m b ) a = 2f (τ) M ab + f (τ) f(τ) f(τ) t a t b + 2f (τ) T ab f (τ) 2 f(τ) f(τ) 2 tat b + β n f (τ) n l f(τ) ta + B b (Σ b ) a1 a 2 = 2tr ( C2 b f (τ) 2 p f(τ) 2 f ) 2 (τ) t a1 f(τ) ta2 + 4f (τ) 2 ) ([M T f(τ) 2 C b M] a1a2 + δa 2 a 1 p T ba1 n l,a1 with t, T, M as before, Xa = X a k n l,d d=1 X n l d and B b bias independent of a. 36 / 63

78 Applications/Semi-supervised Learning 37/63 Main Results Corollary (Asymptotic Classification Error) For k = 2 classes and a b, ( ) P ( ˆF i,a > ˆF (m b ) b (m b ) a ib x i C b ) Q 0. [1, 1]Σb [1, 1] T 37 / 63

79 Applications/Semi-supervised Learning 37/63 Main Results Corollary (Asymptotic Classification Error) For k = 2 classes and a b, ( ) P ( ˆF i,a > ˆF (m b ) b (m b ) a ib x i C b ) Q 0. [1, 1]Σb [1, 1] T Some consequences: non obvious choices of appropriate kernels non obvious choice of optimal β (induces a possibly beneficial bias) importance of n l versus n u. 37 / 63

80 Applications/Semi-supervised Learning 38/63 MNIST Data Example 1 Simulations Probability of correct classification Index Figure: Performance as a function of α, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 38 / 63

81 Applications/Semi-supervised Learning 38/63 MNIST Data Example 1 Simulations Theory as if Gaussian Probability of correct classification Index Figure: Performance as a function of α, for 3-class MNIST data (zeros, ones, twos), n = 192, p = 784, n l /n = 1/16, Gaussian kernel. 38 / 63

82 Applications/Semi-supervised Learning 39/63 MNIST Data Example 1 Simulations Probability of correct classification Index Figure: Performance as a function of α, for 2-class MNIST data (zeros, ones), n = 1568, p = 784, n l /n = 1/16, Gaussian kernel. 39 / 63

83 Applications/Semi-supervised Learning 39/63 MNIST Data Example 1 Simulations Theory as if Gaussian Probability of correct classification Index Figure: Performance as a function of α, for 2-class MNIST data (zeros, ones), n = 1568, p = 784, n l /n = 1/16, Gaussian kernel. 39 / 63

84 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 40/63 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 40 / 63

85 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 41/63 Random Feature Maps and Extreme Learning Machines Context: Random Feature Map (large) input x 1,..., x T R p random W = wt 1... R n p wn T non-linear activation function σ. 41 / 63

86 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 41/63 Random Feature Maps and Extreme Learning Machines Context: Random Feature Map (large) input x 1,..., x T R p random W = wt 1... R n p wn T non-linear activation function σ. Neural Network Model (extreme learning machine): Ridge-regression learning small output y 1,..., y T R d ridge-regression output β R n d 41 / 63

87 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 42/63 Random Feature Maps and Extreme Learning Machines Objectives: evaluate training and testing MSE performance as n, p, T 42 / 63

88 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 42/63 Random Feature Maps and Extreme Learning Machines Objectives: evaluate training and testing MSE performance as n, p, T Training MSE: E train = 1 T y i β T σ(w x i ) 2 = 1 T T Y βt Σ 2 F i=1 with { } Σ = σ(w X) = σ(wi T x j) 1 i n 1 j T β= 1 T Σ ( 1 T ΣT Σ + γi T ) 1 Y. 42 / 63

89 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 42/63 Random Feature Maps and Extreme Learning Machines Objectives: evaluate training and testing MSE performance as n, p, T Training MSE: E train = 1 T y i β T σ(w x i ) 2 = 1 T T Y βt Σ 2 F i=1 with { } Σ = σ(w X) = σ(wi T x j) 1 i n 1 j T β= 1 T Σ ( 1 T ΣT Σ + γi T ) 1 Y. Testing MSE: upon new pair ( ˆX, Ŷ ) of length ˆT, E test = 1ˆT Ŷ βt ˆΣ 2 F. where ˆΣ = σ(w ˆX). 42 / 63

90 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 43/63 Technical Aspects Preliminary observations: Link to resolvent of 1 T ΣT Σ: where Q = Q(γ) is the resolvent with Σ ij = σ(w T i x j). E train = γ2 T tr Y T Y Q 2 = γ 2 1 γ T tr Y T Y Q ( ) 1 1 Q T ΣT Σ + γi T 43 / 63

91 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 43/63 Technical Aspects Preliminary observations: Link to resolvent of 1 T ΣT Σ: where Q = Q(γ) is the resolvent with Σ ij = σ(w T i x j). E train = γ2 T tr Y T Y Q 2 = γ 2 1 γ T tr Y T Y Q ( ) 1 1 Q T ΣT Σ + γi T Central object: resolvent E[Q]. 43 / 63

92 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 44/63 Main Technical Result Theorem [Asymptotic Equivalent for E[Q]] For Lipschitz σ, bounded X, Y, W = f(z) (entry-wise) with Z standard Gaussian, we have, for all ε > 0, for some C > 0, where E[Q] Q < Cn ε 1 2 ( n Q = T Φ E ) Φ δ + γi T ] [ σ(x T w)σ(w T X) with w = f(z), z N (0, I p), and δ > 0 the unique positive solution to δ = 1 tr Φ Q. T 44 / 63

93 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 44/63 Main Technical Result Theorem [Asymptotic Equivalent for E[Q]] For Lipschitz σ, bounded X, Y, W = f(z) (entry-wise) with Z standard Gaussian, we have, for all ε > 0, for some C > 0, where E[Q] Q < Cn ε 1 2 ( n Q = T Φ E ) Φ δ + γi T ] [ σ(x T w)σ(w T X) with w = f(z), z N (0, I p), and δ > 0 the unique positive solution to δ = 1 tr Φ Q. T Proof arguments: σ(w X) has independent rows but dependent columns breaks the trace lemma argument (i.e., 1 p wt XAX T w 1 p tr XAXT ) 44 / 63

94 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 44/63 Main Technical Result Theorem [Asymptotic Equivalent for E[Q]] For Lipschitz σ, bounded X, Y, W = f(z) (entry-wise) with Z standard Gaussian, we have, for all ε > 0, for some C > 0, where E[Q] Q < Cn ε 1 2 ( n Q = T Φ E ) Φ δ + γi T ] [ σ(x T w)σ(w T X) with w = f(z), z N (0, I p), and δ > 0 the unique positive solution to δ = 1 tr Φ Q. T Proof arguments: σ(w X) has independent rows but dependent columns breaks the trace lemma argument (i.e., 1 p wt XAX T w 1 p tr XAXT ) Concentration of measure lemma: 1 p σ(wt X)Aσ(X T w) 1 p tr ΦA 44 / 63

95 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 45/63 Main Technical Result Values of Φ(a, b) for w N (0, I p), σ(t) Φ(a, b) ( 1 max(t, 0) 2π a b (a, b) acos( (a, b)) + ) 1 (a, b) ( 2 2 t π a b (a, b) asin( (a, b)) + ) 1 (a, b) ( ) 2 2 erf(t) π asin 2a T b (1+2 a 2 )(1+2 b 2 ) 1 1 {t>0} 2 1 acos( (a, b)) 2π sign(t) 1 2 acos( (a, b)) π cos(t) exp( 1 2 ( a 2 + b 2 )) cosh(a T b). where (a, b) at b a b. 45 / 63

96 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 45/63 Main Technical Result Values of Φ(a, b) for w N (0, I p), σ(t) Φ(a, b) ( 1 max(t, 0) 2π a b (a, b) acos( (a, b)) + ) 1 (a, b) ( 2 2 t π a b (a, b) asin( (a, b)) + ) 1 (a, b) ( ) 2 2 erf(t) π asin 2a T b (1+2 a 2 )(1+2 b 2 ) 1 1 {t>0} 2 1 acos( (a, b)) 2π sign(t) 1 2 acos( (a, b)) π cos(t) exp( 1 2 ( a 2 + b 2 )) cosh(a T b). where (a, b) at b a b. Value of Φ(a, b) for w i i.i.d. with E[wi k] = m k (m 1 = 0), σ(t) = ζ 2 t 2 + ζ 1 t + ζ 0 [ ( Φ(a, b) = ζ2 2 m 2 2 2(a T b) 2 + a 2 b 2) ] + (m 4 3m 2 2 )(a2 ) T (b 2 ) + ζ1 2 m 2a T b ] + ζ 2 ζ 1 m 3 [(a 2 ) T b + a T (b 2 [ ) + ζ 2 ζ 0 m 2 a 2 + b 2] + ζ0 2 where (a 2 ) [a 2 1,..., a2 p] T. 45 / 63

97 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 46/63 Main Results Theorem [Asymptotic E train ] For all ε > 0, almost surely, where with Ψ n T Φ 1+δ. n 1 2 ε ( E train Ētrain) 0 E train = 1 Y T Σ T β 2 T = γ2 F T tr Y T Y Q 2 [ ] Ē train = γ2 T tr Y T Y Q 1 tr Ψ Q 2 n 1 1 n tr (Ψ Q) Ψ + I 2 T Q 46 / 63

98 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 47/63 Main Results Letting ˆX R p ˆT, Ŷ Rd ˆT satisfy similar properties as (X, Y ), Claim [Asymptotic E test ] For all ε > 0, almost surely, where with Ψ AB = n T n 1 2 ε ( E test Ētest ) 0 E test = 1ˆT Ŷ T ˆΣ T β 2 F Ē test = 1ˆT Ŷ T Ψ T QY T 2 X ˆX F 1 n + tr Y T [ Y QΨ Q tr (Ψ Q) 2 n ˆT tr Ψ ˆX ˆX 1ˆT ] tr (I T + γ Q)(Ψ X ˆXΨ ˆXX Q) Φ AB 1+δ, Φ AB = E[σ(A T w)σ(w T B)]. 47 / 63

99 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 48/63 Simulations on MNIST: Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t γ Figure: Neural network performance for Lipschitz continuous σ( ), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 63

100 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 48/63 Simulations on MNIST: Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t γ Figure: Neural network performance for Lipschitz continuous σ( ), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 63

101 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 48/63 Simulations on MNIST: Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t 10 1 σ(t) = erf(t) γ Figure: Neural network performance for Lipschitz continuous σ( ), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 63

102 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 48/63 Simulations on MNIST: Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t σ(t) = t 10 1 σ(t) = erf(t) γ Figure: Neural network performance for Lipschitz continuous σ( ), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 63

103 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 48/63 Simulations on MNIST: Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t σ(t) = t 10 1 σ(t) = erf(t) σ(t) = max(t, 0) γ Figure: Neural network performance for Lipschitz continuous σ( ), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 63

104 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 49/63 Simulations on MNIST: non Lipschitz σ( ) 10 0 Ē train Ē test E train E test MSE σ(t) = t2 σ(t) = 1 {t>0} σ(t) = sign(t) γ Figure: Neural network performance for σ( ) either discontinuous or non Lipschitz, as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = ˆT = 1024, p = / 63

105 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 50/63 Simulations on tuned Gaussian mixture Gaussian mixture classification X = [X 1, X 2 ], with {X 1 } i N (0, C 1 ), {X 2 } i N (0, C 2 ), tr C 1 = tr C 2 50 / 63

106 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 50/63 Simulations on tuned Gaussian mixture Gaussian mixture classification X = [X 1, X 2 ], with {X 1 } i N (0, C 1 ), {X 2 } i N (0, C 2 ), tr C 1 = tr C 2 We can prove that, for σ(t) = ζ 2 t 2 + ζ 1 t + ζ 0 and E[W k ij ] = m k, Classification only possible if m 4 m / 63

107 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 50/63 Simulations on tuned Gaussian mixture Gaussian mixture classification X = [X 1, X 2 ], with {X 1 } i N (0, C 1 ), {X 2 } i N (0, C 2 ), tr C 1 = tr C 2 We can prove that, for σ(t) = ζ 2 t 2 + ζ 1 t + ζ 0 and E[W k ij ] = m k, Classification only possible if m 4 m 2 2 Ē train Ē test E train E test 10 0 MSE W ij Bern W ij N (0, 1) W ij Stud γ 50 / 63

108 Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 51/63 Simulations on tuned Gaussian mixture Interpretation in eigenstructure of Φ: no information carried in dominant eigenmodes if m 4 = m 2 2. close spike no spike far spike W ij N (0, 1) W ij Bern W ij Stud 51 / 63

109 Perspectives/ 52/63 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices Spiked Models Applications Reminder on Spectral Clustering Methods Kernel Spectral Clustering Semi-supervised Learning Random Feature Maps, Extreme Learning Machines, and Neural Networks Perspectives 52 / 63

110 Perspectives/ 53/63 Summary of Results and Perspectives I Random Neural Networks. Extreme learning machines (one-layer random NN) Linear echo-state networks (ESN) Logistic regression and classification error in extreme learning machines (ELM) Further random feature maps characterization Generalized random NN (multiple layers, multiple activations) Random convolutional networks for image processing Non-linear ESN Deep Neural Networks (DNN). Backpropagation in NN (σ(w X) for random X, backprop. on W ) Statistical physics-inspired approaches (spin-glass models, Hamiltonian-based models) Non-linear ESN DNN performance of physics-realistic models (4th-order Hamiltonian, locality) 53 / 63

111 Perspectives/ 54/63 Summary of Results and Perspectives II References. H. W. Lin, M. Tegmark, Why does deep and cheap learning work so well?, arxiv: v2, C. Williams, Computation with infinite neural networks, Neural Computation, 10(5), , Herbert Jaeger. Short term memory in echo state networks. GMD-Forschungszentrum Informationstechnik, Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew, Extreme learning machine : theory and applications, Neurocomputing, 70(1) :489501, N. El Karoui, Concentration of measure and spectra of random matrices: applications to correlation matrices, elliptical distributions and beyond, The Annals of Applied Probability, 19(6), , C. Louart, Z. Liao, R. Couillet, A Random Matrix Approach to Neural Networks, (submitted to) Annals of Applied Probability, R. Couillet, G. Wainrib, H. Sevi, H. Tiomoko Ali, The asymptotic performance of linear echo state neural networks, Journal of Machine Learning Research, vol. 17, no. 178, pp. 1-35, Choromanska, Anna, et al. The Loss Surfaces of Multilayer Networks. AISTATS / 63

112 Perspectives/ 55/63 Summary of Results and Perspectives III Rahimi, Ali, and Benjamin Recht. Random Features for Large-Scale Kernel Machines. NIPS. Vol. 3. No / 63

113 Perspectives/ 56/63 Summary of Results and Perspectives I Kernel methods. Spectral clustering Subspace spectral clustering (f (τ) = 0) Spectral clustering with outer product kernel f(x T y) Semi-supervised learning, kernel approaches. Least square support vector machines (LS-SVM). Support vector machines (SVM). Kernel matrices based on Kendall τ, Spearman ρ. Applications. Massive MIMO user subspace clustering (patent proposed) Kernel correlation matrices for biostats, heterogeneous datasets. Kernel PCA. Kendall τ in biostats. References. N. El Karoui, The spectrum of kernel random matrices, The Annals of Statistics, 38(1), 1-50, / 63

114 Perspectives/ 57/63 Summary of Results and Perspectives II R. Couillet, F. Benaych-Georges, Kernel Spectral Clustering of Large Dimensional Data, Electronic Journal of Statistics, vol. 10, no. 1, pp , R. Couillet, A. Kammoun, Random Matrix Improved Subspace Clustering, Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, Z. Liao, R. Couillet, A Large Dimensional Analysis of Least Squares Support Vector Machines, (submitted to) Journal of Machine Learning Research, X. Mai, R. Couillet, The counterintuitive mechanism of graph-based semi-supervised learning in the big data regime, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 17), New Orleans, USA, / 63

115 Perspectives/ 58/63 Summary of Results and Perspectives I Community detection. Heterogeneous dense network clustering. Semi-supervised clustering. Sparse network extensions. Beyond community detection (hub detection). Applications. Improved methods for community detection. Applications to distributed optimization (network diffusion, graph signal processing). References. H. Tiomoko Ali, R. Couillet, Spectral community detection in heterogeneous large networks, (submitted to) Journal of Multivariate Analysis, F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborová, P. Zhang, Spectral redemption in clustering sparse networks. Proceedings of the National Academy of Sciences, 110(52), , C. Bordenave, M. Lelarge, L. Massoulié, Non-backtracking spectrum of random graphs: community detection and non-regular Ramanujan graphs, Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pp , 2015 A. Saade, F. Krzakala, L. Zdeborová, Spectral clustering of graphs with the Bethe Hessian, In Advances in Neural Information Processing Systems, pp , / 63

116 Perspectives/ 59/63 Summary of Results and Perspectives I Robust statistics. Tyler, Maronna (and regularized) estimators Elliptical data setting, deterministic outlier setting Central limit theorem extensions Joint mean and covariance robust estimation Robust regression (preliminary works exist already using strikingly different approaches) Applications. Statistical finance (portfolio estimation) Localisation in array processing (robust GMUSIC) Detectors in space time array processing Correlation matrices in biostatistics, human science datasets, etc. References. R. Couillet, F. Pascal, J. W. Silverstein, Robust Estimates of Covariance Matrices in the Large Dimensional Regime, IEEE Transactions on Information Theory, vol. 60, no. 11, pp , / 63

117 Perspectives/ 60/63 Summary of Results and Perspectives II R. Couillet, F. Pascal, J. W. Silverstein, The Random Matrix Regime of Maronna s M-estimator with elliptically distributed samples, Elsevier Journal of Multivariate Analysis, vol. 139, pp , T. Zhang, X. Cheng, A. Singer, Marchenko-Pastur Law for Tyler s and Maronna s M-estimators, arxiv: , R. Couillet, M. McKay, Large Dimensional Analysis and Optimization of Robust Shrinkage Covariance Matrix Estimators, Elsevier Journal of Multivariate Analysis, vol. 131, pp , D. Morales-Jimenez, R. Couillet, M. McKay, Large Dimensional Analysis of Robust M-Estimators of Covariance with Outliers, IEEE Transactions on Signal Processing, vol. 63, no. 21, pp , L. Yang, R. Couillet, M. McKay, A Robust Statistics Approach to Minimum Variance Portfolio Optimization, IEEE Transactions on Signal Processing, vol. 63, no. 24, pp , R. Couillet, Robust spiked random matrices and a robust G-MUSIC estimator, Elsevier Journal of Multivariate Analysis, vol. 140, pp , A. Kammoun, R. Couillet, F. Pascal, M.-S. Alouini, Optimal Design of the Adaptive Normalized Matched Filter Detector, (submitted to) IEEE Transactions on Information Theory, 2016, arxiv Preprint / 63

118 Perspectives/ 61/63 Summary of Results and Perspectives III R. Couillet, A. Kammoun, F. Pascal, Second order statistics of robust estimators of scatter. Application to GLRT detection for elliptical signals, Elsevier Journal of Multivariate Analysis, vol. 143, pp , D. Donoho, A. Montanari, High dimensional robust m-estimation: Asymptotic variance via approximate message passing, Probability Theory and Related Fields, 1-35, N. El Karoui, Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators: rigorous results. arxiv preprint arxiv: , / 63

119 Perspectives/ 62/63 Summary of Results and Perspectives I Other works and ideas. Spike random matrix sparse PCA Non-linear shrinkage methods Sparse kernel PCA Random signal processing on graph methods. Random matrix analysis of diffusion networks performance. Applications. Spike factor models in portfolio optimization Non-linear shrinkage in portfolio optimization, biostats References. R. Couillet, M. McKay, Optimal block-sparse PCA for high dimensional correlated samples, (submitted to) Journal of Multivariate Analysis, J. Bun, J. P. Bouchaud, M. Potters, On the overlaps between eigenvectors of correlated random matrices, arxiv preprint arxiv: (2016). Ledoit, O. and Wolf, M., Nonlinear shrinkage estimation of large-dimensional covariance matrices, / 63

120 Perspectives/ 63/63 The End Thank you. 63 / 63

A Random Matrix Framework for BigData Machine Learning and Applications to Wireless Communications

A Random Matrix Framework for BigData Machine Learning and Applications to Wireless Communications A Random Matrix Framework for BigData Machine Learning and Applications to Wireless Communications (EURECOM) Romain COUILLET CentraleSupélec, France June, 2017 1 / 80 Outline Basics of Random Matrix Theory

More information

Random Matrices in Machine Learning

Random Matrices in Machine Learning Random Matrices in Machine Learning Romain COUILLET CentraleSupélec, University of ParisSaclay, France GSTATS IDEX DataScience Chair, GIPSA-lab, University Grenoble Alpes, France. June 21, 2018 1 / 113

More information

Random Matrices for Big Data Signal Processing and Machine Learning

Random Matrices for Big Data Signal Processing and Machine Learning Random Matrices for Big Data Signal Processing and Machine Learning (ICASSP 2017, New Orleans) Romain COUILLET and Hafiz TIOMOKO ALI CentraleSupélec, France March, 2017 1 / 153 Outline Basics of Random

More information

Random Matrix Theory for Neural Networks

Random Matrix Theory for Neural Networks Random Matrix Theory for Neural Networks Ph.D. Mid-Term Evaluation Zhenyu Liao Laboratoire des Signaux et Systèmes CentraleSupélec Université Paris-Saclay Salle sd.207, Bâtiment Bouygues Gif-sur-Yvette,

More information

On the Spectrum of Random Features Maps of High Dimensional Data

On the Spectrum of Random Features Maps of High Dimensional Data On the Spectrum of Random Features Maps of High Dimensional Data ICML 018, Stockholm, Sweden Zhenyu Liao, Romain Couillet LS, CentraleSupélec, Université Paris-Saclay, France GSTATS IDEX DataScience Chair,

More information

Research Program. Romain COUILLET. 11 février CentraleSupélec Université Paris-Sud 11 1 / 38

Research Program. Romain COUILLET. 11 février CentraleSupélec Université Paris-Sud 11 1 / 38 Research Program Romain COUILLET CentraleSupélec Université Paris-Sud 11 11 février 2016 1 / 38 Outline Curriculum Vitae Research Project : Learning in Large Dimensions Axis 1 : Robust Estimation in Large

More information

Robust covariance matrices estimation and applications in signal processing

Robust covariance matrices estimation and applications in signal processing Robust covariance matrices estimation and applications in signal processing F. Pascal SONDRA/Supelec GDR ISIS Journée Estimation et traitement statistique en grande dimension May 16 th, 2013 FP (SONDRA/Supelec)

More information

The Dynamics of Learning: A Random Matrix Approach

The Dynamics of Learning: A Random Matrix Approach The Dynamics of Learning: A Random Matrix Approach ICML 2018, Stockholm, Sweden Zhenyu Liao, Romain Couillet L2S, CentraleSupélec, Université Paris-Saclay, France GSTATS IDEX DataScience Chair, GIPSA-lab,

More information

Non white sample covariance matrices.

Non white sample covariance matrices. Non white sample covariance matrices. S. Péché, Université Grenoble 1, joint work with O. Ledoit, Uni. Zurich 17-21/05/2010, Université Marne la Vallée Workshop Probability and Geometry in High Dimensions

More information

The non-backtracking operator

The non-backtracking operator The non-backtracking operator Florent Krzakala LPS, Ecole Normale Supérieure in collaboration with Paris: L. Zdeborova, A. Saade Rome: A. Decelle Würzburg: J. Reichardt Santa Fe: C. Moore, P. Zhang Berkeley:

More information

Estimation of the Global Minimum Variance Portfolio in High Dimensions

Estimation of the Global Minimum Variance Portfolio in High Dimensions Estimation of the Global Minimum Variance Portfolio in High Dimensions Taras Bodnar, Nestor Parolya and Wolfgang Schmid 07.FEBRUARY 2014 1 / 25 Outline Introduction Random Matrix Theory: Preliminary Results

More information

A RANDOM MATRIX APPROACH TO NEURAL NETWORKS. By Cosme Louart, Zhenyu Liao, and Romain Couillet CentraleSupélec, University of Paris Saclay, France.

A RANDOM MATRIX APPROACH TO NEURAL NETWORKS. By Cosme Louart, Zhenyu Liao, and Romain Couillet CentraleSupélec, University of Paris Saclay, France. Submitted to the Annals of Applied Probability A RANDOM MARIX APPROACH O NEURAL NEWORKS By Cosme Louart, Zhenyu Liao, and Romain Couillet CentraleSupélec, University of Paris Saclay, France. his article

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

When Dictionary Learning Meets Classification

When Dictionary Learning Meets Classification When Dictionary Learning Meets Classification Bufford, Teresa 1 Chen, Yuxin 2 Horning, Mitchell 3 Shee, Liberty 1 Mentor: Professor Yohann Tendero 1 UCLA 2 Dalhousie University 3 Harvey Mudd College August

More information

Analysis of Spectral Kernel Design based Semi-supervised Learning

Analysis of Spectral Kernel Design based Semi-supervised Learning Analysis of Spectral Kernel Design based Semi-supervised Learning Tong Zhang IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Rie Kubota Ando IBM T. J. Watson Research Center Yorktown Heights,

More information

Reconstruction in the Generalized Stochastic Block Model

Reconstruction in the Generalized Stochastic Block Model Reconstruction in the Generalized Stochastic Block Model Marc Lelarge 1 Laurent Massoulié 2 Jiaming Xu 3 1 INRIA-ENS 2 INRIA-Microsoft Research Joint Centre 3 University of Illinois, Urbana-Champaign GDR

More information

Free Probability, Sample Covariance Matrices and Stochastic Eigen-Inference

Free Probability, Sample Covariance Matrices and Stochastic Eigen-Inference Free Probability, Sample Covariance Matrices and Stochastic Eigen-Inference Alan Edelman Department of Mathematics, Computer Science and AI Laboratories. E-mail: edelman@math.mit.edu N. Raj Rao Deparment

More information

Large sample covariance matrices and the T 2 statistic

Large sample covariance matrices and the T 2 statistic Large sample covariance matrices and the T 2 statistic EURANDOM, the Netherlands Joint work with W. Zhou Outline 1 2 Basic setting Let {X ij }, i, j =, be i.i.d. r.v. Write n s j = (X 1j,, X pj ) T and

More information

On the Spectrum of Random Features Maps of High Dimensional Data

On the Spectrum of Random Features Maps of High Dimensional Data Zhenyu Liao * Romain Couillet * Abstract Random feature maps are ubiquitous in modern statistical machine learning, where they generalize random projections by means of powerful, yet often difficult to

More information

Inference For High Dimensional M-estimates. Fixed Design Results

Inference For High Dimensional M-estimates. Fixed Design Results : Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and

More information

High Dimensional Minimum Risk Portfolio Optimization

High Dimensional Minimum Risk Portfolio Optimization High Dimensional Minimum Risk Portfolio Optimization Liusha Yang Department of Electronic and Computer Engineering Hong Kong University of Science and Technology Centrale-Supélec June 26, 2015 Yang (HKUST)

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Assessing the dependence of high-dimensional time series via sample autocovariances and correlations

Assessing the dependence of high-dimensional time series via sample autocovariances and correlations Assessing the dependence of high-dimensional time series via sample autocovariances and correlations Johannes Heiny University of Aarhus Joint work with Thomas Mikosch (Copenhagen), Richard Davis (Columbia),

More information

Deep Learning: Approximation of Functions by Composition

Deep Learning: Approximation of Functions by Composition Deep Learning: Approximation of Functions by Composition Zuowei Shen Department of Mathematics National University of Singapore Outline 1 A brief introduction of approximation theory 2 Deep learning: approximation

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Community detection in stochastic block models via spectral methods

Community detection in stochastic block models via spectral methods Community detection in stochastic block models via spectral methods Laurent Massoulié (MSR-Inria Joint Centre, Inria) based on joint works with: Dan Tomozei (EPFL), Marc Lelarge (Inria), Jiaming Xu (UIUC),

More information

Tensor Methods for Feature Learning

Tensor Methods for Feature Learning Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine Feature Learning For Efficient Classification Find good transformations of input for improved classification Figures used attributed to

More information

Solving Corrupted Quadratic Equations, Provably

Solving Corrupted Quadratic Equations, Provably Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

High Dimensional Covariance and Precision Matrix Estimation

High Dimensional Covariance and Precision Matrix Estimation High Dimensional Covariance and Precision Matrix Estimation Wei Wang Washington University in St. Louis Thursday 23 rd February, 2017 Wei Wang (Washington University in St. Louis) High Dimensional Covariance

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

Convex Optimization in Classification Problems

Convex Optimization in Classification Problems New Trends in Optimization and Computational Algorithms December 9 13, 2001 Convex Optimization in Classification Problems Laurent El Ghaoui Department of EECS, UC Berkeley elghaoui@eecs.berkeley.edu 1

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

Permutation-invariant regularization of large covariance matrices. Liza Levina

Permutation-invariant regularization of large covariance matrices. Liza Levina Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

Learning Task Grouping and Overlap in Multi-Task Learning

Learning Task Grouping and Overlap in Multi-Task Learning Learning Task Grouping and Overlap in Multi-Task Learning Abhishek Kumar Hal Daumé III Department of Computer Science University of Mayland, College Park 20 May 2013 Proceedings of the 29 th International

More information

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1

More information

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Risk and Noise Estimation in High Dimensional Statistics via State Evolution

Risk and Noise Estimation in High Dimensional Statistics via State Evolution Risk and Noise Estimation in High Dimensional Statistics via State Evolution Mohsen Bayati Stanford University Joint work with Jose Bento, Murat Erdogdu, Marc Lelarge, and Andrea Montanari Statistical

More information

EIGENVECTOR OVERLAPS (AN RMT TALK) J.-Ph. Bouchaud (CFM/Imperial/ENS) (Joint work with Romain Allez, Joel Bun & Marc Potters)

EIGENVECTOR OVERLAPS (AN RMT TALK) J.-Ph. Bouchaud (CFM/Imperial/ENS) (Joint work with Romain Allez, Joel Bun & Marc Potters) EIGENVECTOR OVERLAPS (AN RMT TALK) J.-Ph. Bouchaud (CFM/Imperial/ENS) (Joint work with Romain Allez, Joel Bun & Marc Potters) Randomly Perturbed Matrices Questions in this talk: How similar are the eigenvectors

More information

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1). Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes

More information

A Least Squares Formulation for Canonical Correlation Analysis

A Least Squares Formulation for Canonical Correlation Analysis A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation

More information

Applications of random matrix theory to principal component analysis(pca)

Applications of random matrix theory to principal component analysis(pca) Applications of random matrix theory to principal component analysis(pca) Jun Yin IAS, UW-Madison IAS, April-2014 Joint work with A. Knowles and H. T Yau. 1 Basic picture: Let H be a Wigner (symmetric)

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

A note on a Marčenko-Pastur type theorem for time series. Jianfeng. Yao

A note on a Marčenko-Pastur type theorem for time series. Jianfeng. Yao A note on a Marčenko-Pastur type theorem for time series Jianfeng Yao Workshop on High-dimensional statistics The University of Hong Kong, October 2011 Overview 1 High-dimensional data and the sample covariance

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

Community Detection. fundamental limits & efficient algorithms. Laurent Massoulié, Inria

Community Detection. fundamental limits & efficient algorithms. Laurent Massoulié, Inria Community Detection fundamental limits & efficient algorithms Laurent Massoulié, Inria Community Detection From graph of node-to-node interactions, identify groups of similar nodes Example: Graph of US

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

Spectral Partitiong in a Stochastic Block Model

Spectral Partitiong in a Stochastic Block Model Spectral Graph Theory Lecture 21 Spectral Partitiong in a Stochastic Block Model Daniel A. Spielman November 16, 2015 Disclaimer These notes are not necessarily an accurate representation of what happened

More information

Kernel Methods. Barnabás Póczos

Kernel Methods. Barnabás Póczos Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels

More information

Learning Bound for Parameter Transfer Learning

Learning Bound for Parameter Transfer Learning Learning Bound for Parameter Transfer Learning Wataru Kumagai Faculty of Engineering Kanagawa University kumagai@kanagawa-u.ac.jp Abstract We consider a transfer-learning problem by using the parameter

More information

The circular law. Lewis Memorial Lecture / DIMACS minicourse March 19, Terence Tao (UCLA)

The circular law. Lewis Memorial Lecture / DIMACS minicourse March 19, Terence Tao (UCLA) The circular law Lewis Memorial Lecture / DIMACS minicourse March 19, 2008 Terence Tao (UCLA) 1 Eigenvalue distributions Let M = (a ij ) 1 i n;1 j n be a square matrix. Then one has n (generalised) eigenvalues

More information

arxiv: v1 [stat.ml] 3 Mar 2019

arxiv: v1 [stat.ml] 3 Mar 2019 Classification via local manifold approximation Didong Li 1 and David B Dunson 1,2 Department of Mathematics 1 and Statistical Science 2, Duke University arxiv:1903.00985v1 [stat.ml] 3 Mar 2019 Classifiers

More information

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a

More information

A PROBABILISTIC INTERPRETATION OF SAMPLING THEORY OF GRAPH SIGNALS. Akshay Gadde and Antonio Ortega

A PROBABILISTIC INTERPRETATION OF SAMPLING THEORY OF GRAPH SIGNALS. Akshay Gadde and Antonio Ortega A PROBABILISTIC INTERPRETATION OF SAMPLING THEORY OF GRAPH SIGNALS Akshay Gadde and Antonio Ortega Department of Electrical Engineering University of Southern California, Los Angeles Email: agadde@usc.edu,

More information

Minimum variance portfolio optimization in the spiked covariance model

Minimum variance portfolio optimization in the spiked covariance model Minimum variance portfolio optimization in the spiked covariance model Liusha Yang, Romain Couillet, Matthew R Mckay To cite this version: Liusha Yang, Romain Couillet, Matthew R Mckay Minimum variance

More information

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley Collaborators Joint work with Samy Bengio, Moritz Hardt, Michael Jordan, Jason Lee, Max Simchowitz,

More information

CS 195-5: Machine Learning Problem Set 1

CS 195-5: Machine Learning Problem Set 1 CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

More information

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania Submitted to the Annals of Statistics DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING By T. Tony Cai and Linjun Zhang University of Pennsylvania We would like to congratulate the

More information

Finding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October

Finding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October Finding normalized and modularity cuts by spectral clustering Marianna Bolla Institute of Mathematics Budapest University of Technology and Economics marib@math.bme.hu Ljubjana 2010, October Outline Find

More information

Multi-View Regression via Canonincal Correlation Analysis

Multi-View Regression via Canonincal Correlation Analysis Multi-View Regression via Canonincal Correlation Analysis Dean P. Foster U. Pennsylvania (Sham M. Kakade of TTI) Model and background p ( i n) y i = X ij β j + ɛ i ɛ i iid N(0, σ 2 ) i=j Data mining and

More information

TUM 2016 Class 3 Large scale learning by regularization

TUM 2016 Class 3 Large scale learning by regularization TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

9.2 Support Vector Machines 159

9.2 Support Vector Machines 159 9.2 Support Vector Machines 159 9.2.3 Kernel Methods We have all the tools together now to make an exciting step. Let us summarize our findings. We are interested in regularized estimation problems of

More information

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31 Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking Dengyong Zhou zhou@tuebingen.mpg.de Dept. Schölkopf, Max Planck Institute for Biological Cybernetics, Germany Learning from

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Tutorial on Machine Learning for Advanced Electronics

Tutorial on Machine Learning for Advanced Electronics Tutorial on Machine Learning for Advanced Electronics Maxim Raginsky March 2017 Part I (Some) Theory and Principles Machine Learning: estimation of dependencies from empirical data (V. Vapnik) enabling

More information

Unsupervised dimensionality reduction

Unsupervised dimensionality reduction Unsupervised dimensionality reduction Guillaume Obozinski Ecole des Ponts - ParisTech SOCN course 2014 Guillaume Obozinski Unsupervised dimensionality reduction 1/30 Outline 1 PCA 2 Kernel PCA 3 Multidimensional

More information

8.1 Concentration inequality for Gaussian random matrix (cont d)

8.1 Concentration inequality for Gaussian random matrix (cont d) MGMT 69: Topics in High-dimensional Data Analysis Falll 26 Lecture 8: Spectral clustering and Laplacian matrices Lecturer: Jiaming Xu Scribe: Hyun-Ju Oh and Taotao He, October 4, 26 Outline Concentration

More information

Estimation of large dimensional sparse covariance matrices

Estimation of large dimensional sparse covariance matrices Estimation of large dimensional sparse covariance matrices Department of Statistics UC, Berkeley May 5, 2009 Sample covariance matrix and its eigenvalues Data: n p matrix X n (independent identically distributed)

More information

Information-theoretically Optimal Sparse PCA

Information-theoretically Optimal Sparse PCA Information-theoretically Optimal Sparse PCA Yash Deshpande Department of Electrical Engineering Stanford, CA. Andrea Montanari Departments of Electrical Engineering and Statistics Stanford, CA. Abstract

More information

Tractable Upper Bounds on the Restricted Isometry Constant

Tractable Upper Bounds on the Restricted Isometry Constant Tractable Upper Bounds on the Restricted Isometry Constant Alex d Aspremont, Francis Bach, Laurent El Ghaoui Princeton University, École Normale Supérieure, U.C. Berkeley. Support from NSF, DHS and Google.

More information

Classification. Sandro Cumani. Politecnico di Torino

Classification. Sandro Cumani. Politecnico di Torino Politecnico di Torino Outline Generative model: Gaussian classifier (Linear) discriminative model: logistic regression (Non linear) discriminative model: neural networks Gaussian Classifier We want to

More information

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space Robert Jenssen, Deniz Erdogmus 2, Jose Principe 2, Torbjørn Eltoft Department of Physics, University of Tromsø, Norway

More information

Data Spectroscopy: Learning Mixture Models using Eigenspaces of Convolution Operators

Data Spectroscopy: Learning Mixture Models using Eigenspaces of Convolution Operators Data Spectroscopy: Learning Mixture Models using Eigenspaces of Convolution Operators Tao Shi Department of Statistics, Ohio State University Mikhail Belkin Department of Computer Science and Engineering,

More information

Future Random Matrix Tools for Large Dimensional Signal Processing

Future Random Matrix Tools for Large Dimensional Signal Processing /42 Future Random Matrix Tools for Large Dimensional Signal Processing EUSIPCO 204, Lisbon, Portugal. Abla KAMMOUN and Romain COUILLET 2 King s Abdullah University of Technology and Science, Saudi Arabia

More information

Short Term Memory Quantifications in Input-Driven Linear Dynamical Systems

Short Term Memory Quantifications in Input-Driven Linear Dynamical Systems Short Term Memory Quantifications in Input-Driven Linear Dynamical Systems Peter Tiňo and Ali Rodan School of Computer Science, The University of Birmingham Birmingham B15 2TT, United Kingdom E-mail: {P.Tino,

More information

On corrections of classical multivariate tests for high-dimensional data. Jian-feng. Yao Université de Rennes 1, IRMAR

On corrections of classical multivariate tests for high-dimensional data. Jian-feng. Yao Université de Rennes 1, IRMAR Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova On corrections of classical multivariate tests for high-dimensional

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Semi-Supervised Learning in Gigantic Image Collections. Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT)

Semi-Supervised Learning in Gigantic Image Collections. Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT) Semi-Supervised Learning in Gigantic Image Collections Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT) Gigantic Image Collections What does the world look like? High

More information

Random Matrix Theory Lecture 1 Introduction, Ensembles and Basic Laws. Symeon Chatzinotas February 11, 2013 Luxembourg

Random Matrix Theory Lecture 1 Introduction, Ensembles and Basic Laws. Symeon Chatzinotas February 11, 2013 Luxembourg Random Matrix Theory Lecture 1 Introduction, Ensembles and Basic Laws Symeon Chatzinotas February 11, 2013 Luxembourg Outline 1. Random Matrix Theory 1. Definition 2. Applications 3. Asymptotics 2. Ensembles

More information

c 4, < y 2, 1 0, otherwise,

c 4, < y 2, 1 0, otherwise, Fundamentals of Big Data Analytics Univ.-Prof. Dr. rer. nat. Rudolf Mathar Problem. Probability theory: The outcome of an experiment is described by three events A, B and C. The probabilities Pr(A) =,

More information

On corrections of classical multivariate tests for high-dimensional data

On corrections of classical multivariate tests for high-dimensional data On corrections of classical multivariate tests for high-dimensional data Jian-feng Yao with Zhidong Bai, Dandan Jiang, Shurong Zheng Overview Introduction High-dimensional data and new challenge in statistics

More information

A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

A spectral clustering algorithm based on Gram operators

A spectral clustering algorithm based on Gram operators A spectral clustering algorithm based on Gram operators Ilaria Giulini De partement de Mathe matiques et Applications ENS, Paris Joint work with Olivier Catoni 1 july 2015 Clustering task of grouping

More information

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1 Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1 ( OWL ) Regularization Mário A. T. Figueiredo Instituto de Telecomunicações and Instituto Superior Técnico, Universidade de

More information

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China) Network Newton Aryan Mokhtari, Qing Ling and Alejandro Ribeiro University of Pennsylvania, University of Science and Technology (China) aryanm@seas.upenn.edu, qingling@mail.ustc.edu.cn, aribeiro@seas.upenn.edu

More information

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Random Correlation Matrices, Top Eigenvalue with Heavy Tails and Financial Applications

Random Correlation Matrices, Top Eigenvalue with Heavy Tails and Financial Applications Random Correlation Matrices, Top Eigenvalue with Heavy Tails and Financial Applications J.P Bouchaud with: M. Potters, G. Biroli, L. Laloux, M. A. Miceli http://www.cfm.fr Portfolio theory: Basics Portfolio

More information

Random Matrices: Beyond Wigner and Marchenko-Pastur Laws

Random Matrices: Beyond Wigner and Marchenko-Pastur Laws Random Matrices: Beyond Wigner and Marchenko-Pastur Laws Nathan Noiry Modal X, Université Paris Nanterre May 3, 2018 Wigner s Matrices Wishart s Matrices Generalizations Wigner s Matrices ij, (n, i, j

More information

Information-theoretic bounds and phase transitions in clustering, sparse PCA, and submatrix localization

Information-theoretic bounds and phase transitions in clustering, sparse PCA, and submatrix localization Information-theoretic bounds and phase transitions in clustering, sparse PCA, and submatrix localization Jess Banks Cristopher Moore Roman Vershynin Nicolas Verzelen Jiaming Xu Abstract We study the problem

More information

De-biasing the Lasso: Optimal Sample Size for Gaussian Designs

De-biasing the Lasso: Optimal Sample Size for Gaussian Designs De-biasing the Lasso: Optimal Sample Size for Gaussian Designs Adel Javanmard USC Marshall School of Business Data Science and Operations department Based on joint work with Andrea Montanari Oct 2015 Adel

More information

Probabilistic Machine Learning. Industrial AI Lab.

Probabilistic Machine Learning. Industrial AI Lab. Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear

More information

Manifold Coarse Graining for Online Semi-supervised Learning

Manifold Coarse Graining for Online Semi-supervised Learning for Online Semi-supervised Learning Mehrdad Farajtabar, Amirreza Shaban, Hamid R. Rabiee, Mohammad H. Rohban Digital Media Lab, Department of Computer Engineering, Sharif University of Technology, Tehran,

More information