Random Matrices for Big Data Signal Processing and Machine Learning

Similar documents
Random Matrices in Machine Learning

A Random Matrix Framework for BigData Machine Learning and Applications to Wireless Communications

A Random Matrix Framework for BigData Machine Learning

Non white sample covariance matrices.

Comparison Method in Random Matrix Theory

Local semicircle law, Wegner estimate and level repulsion for Wigner random matrices

Eigenvalues and Singular Values of Random Matrices: A Tutorial Introduction

Assessing the dependence of high-dimensional time series via sample autocovariances and correlations

Future Random Matrix Tools for Large Dimensional Signal Processing

The Dynamics of Learning: A Random Matrix Approach

A note on a Marčenko-Pastur type theorem for time series. Jianfeng. Yao

The circular law. Lewis Memorial Lecture / DIMACS minicourse March 19, Terence Tao (UCLA)

Large sample covariance matrices and the T 2 statistic

Preface to the Second Edition...vii Preface to the First Edition... ix

Applications of random matrix theory to principal component analysis(pca)

Determinantal point processes and random matrix theory in a nutshell

Spectral law of the sum of random matrices

Estimation of the Global Minimum Variance Portfolio in High Dimensions

Inhomogeneous circular laws for random matrices with non-identically distributed entries

On corrections of classical multivariate tests for high-dimensional data. Jian-feng. Yao Université de Rennes 1, IRMAR

The norm of polynomials in large random matrices

THE EIGENVALUES AND EIGENVECTORS OF FINITE, LOW RANK PERTURBATIONS OF LARGE RANDOM MATRICES

Strong Convergence of the Empirical Distribution of Eigenvalues of Large Dimensional Random Matrices

Random Matrices: Invertibility, Structure, and Applications

Applications and fundamental results on random Vandermon

Concentration Inequalities for Random Matrices

Robust covariance matrices estimation and applications in signal processing

Homework 1. Yuan Yao. September 18, 2011

The Matrix Dyson Equation in random matrix theory

Research Program. Romain COUILLET. 11 février CentraleSupélec Université Paris-Sud 11 1 / 38

Exponential tail inequalities for eigenvalues of random matrices

Eigenvalue variance bounds for Wigner and covariance random matrices

Markov operators, classical orthogonal polynomial ensembles, and random matrices

Supelec Randomness in Wireless Networks: how to deal with it?

MIMO Capacities : Eigenvalue Computation through Representation Theory

DISTRIBUTION OF EIGENVALUES OF REAL SYMMETRIC PALINDROMIC TOEPLITZ MATRICES AND CIRCULANT MATRICES

Fluctuations from the Semicircle Law Lecture 4

EXTENDED GLRT DETECTORS OF CORRELATION AND SPHERICITY: THE UNDERSAMPLED REGIME. Xavier Mestre 1, Pascal Vallet 2

1 Intro to RMT (Gene)

On the concentration of eigenvalues of random symmetric matrices

Concentration inequalities: basics and some new challenges

Semicircle law on short scales and delocalization for Wigner random matrices

NORMS ON SPACE OF MATRICES

On corrections of classical multivariate tests for high-dimensional data

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

Random regular digraphs: singularity and spectrum

2 Two-Point Boundary Value Problems

Part II Probability and Measure

Data Analysis and Manifold Learning Lecture 9: Diffusion on Manifolds and on Graphs

A new method to bound rate of convergence

Random Matrix: From Wigner to Quantum Chaos

Lectures 6 7 : Marchenko-Pastur Law

arxiv: v1 [math-ph] 19 Oct 2018

Free Probability, Sample Covariance Matrices and Stochastic Eigen-Inference

INVERSE EIGENVALUE STATISTICS FOR RAYLEIGH AND RICIAN MIMO CHANNELS

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

EE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 17

MATH 423 Linear Algebra II Lecture 33: Diagonalization of normal operators.

Basic Concepts in Matrix Algebra

An introduction to G-estimation with sample covariance matrices

On the Spectrum of Random Features Maps of High Dimensional Data

Knowledge Discovery and Data Mining 1 (VO) ( )

Random Matrix Theory for Neural Networks

Density of States for Random Band Matrices in d = 2

MATH 205C: STATIONARY PHASE LEMMA

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Math 118, Fall 2014 Final Exam

High Dimensional Minimum Risk Portfolio Optimization

ELEMENTS OF PROBABILITY THEORY

Def. The euclidian distance between two points x = (x 1,...,x p ) t and y = (y 1,...,y p ) t in the p-dimensional space R p is defined as

LINEAR ALGEBRA BOOT CAMP WEEK 4: THE SPECTRAL THEOREM

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery

Archive of past papers, solutions and homeworks for. MATH 224, Linear Algebra 2, Spring 2013, Laurence Barker

Random Matrix Theory Lecture 1 Introduction, Ensembles and Basic Laws. Symeon Chatzinotas February 11, 2013 Luxembourg

EEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as

Lecture Notes 1: Vector spaces

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539

1 Math 241A-B Homework Problem List for F2015 and W2016

Invertibility of random matrices

Lecture I: Asymptotics for large GUE random matrices

Applications of Robust Optimization in Signal Processing: Beamforming and Power Control Fall 2012

5 Operations on Multiple Random Variables

Compound matrices and some classical inequalities

2. Matrix Algebra and Random Vectors

OR MSc Maths Revision Course

The Multivariate Normal Distribution. In this case according to our theorem

Operator norm convergence for sequence of matrices and application to QIT

CS281 Section 4: Factor Analysis and PCA

Free deconvolution for signal processing applications

Minimum variance portfolio optimization in the spiked covariance model

Lecture 4: Analysis of MIMO Systems

Convergence of Eigenspaces in Kernel Principal Component Analysis

Asymptotic Statistics-III. Changliang Zou

Matrix Theory. A.Holst, V.Ufnarovski

1 Tridiagonal matrices

P (A G) dp G P (A G)

Part IA. Vectors and Matrices. Year

On the principal components of sample covariance matrices

8.1 Concentration inequality for Gaussian random matrix (cont d)

Transcription:

Random Matrices for Big Data Signal Processing and Machine Learning (ICASSP 2017, New Orleans) Romain COUILLET and Hafiz TIOMOKO ALI CentraleSupélec, France March, 2017 1 / 153

Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 2 / 153

Basics of Random Matrix Theory/ 3/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 3 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 4/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 4 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/153 Context Baseline scenario: x 1,..., x n C N (or R N ) i.i.d. with E[x 1 ] = 0, E[x 1 x 1 ] = C N : 5 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/153 Context Baseline scenario: x 1,..., x n C N (or R N ) i.i.d. with E[x 1 ] = 0, E[x 1 x 1 ] = C N : If x 1 N (0, C N ), ML estimator for C N is the sample covariance matrix (SCM) Ĉ N = 1 n x i x i n. i=1 5 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/153 Context Baseline scenario: x 1,..., x n C N (or R N ) i.i.d. with E[x 1 ] = 0, E[x 1 x 1 ] = C N : If x 1 N (0, C N ), ML estimator for C N is the sample covariance matrix (SCM) Ĉ N = 1 n x i x i n. i=1 If n, then, strong law of large numbers Ĉ N a.s. C N. or equivalently, in spectral norm ĈN C N a.s. 0. 5 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/153 Context Baseline scenario: x 1,..., x n C N (or R N ) i.i.d. with E[x 1 ] = 0, E[x 1 x 1 ] = C N : If x 1 N (0, C N ), ML estimator for C N is the sample covariance matrix (SCM) Ĉ N = 1 n x i x i n. i=1 If n, then, strong law of large numbers Ĉ N a.s. C N. or equivalently, in spectral norm ĈN C N a.s. 0. Random Matrix Regime No longer valid if N, n with N/n c (0, ), ĈN C N 0. 5 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/153 Context Baseline scenario: x 1,..., x n C N (or R N ) i.i.d. with E[x 1 ] = 0, E[x 1 x 1 ] = C N : If x 1 N (0, C N ), ML estimator for C N is the sample covariance matrix (SCM) Ĉ N = 1 n x i x i n. i=1 If n, then, strong law of large numbers Ĉ N a.s. C N. or equivalently, in spectral norm ĈN C N a.s. 0. Random Matrix Regime No longer valid if N, n with N/n c (0, ), ĈN C N 0. For practical N, n with N n, leads to dramatically wrong conclusions 5 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/153 Context Baseline scenario: x 1,..., x n C N (or R N ) i.i.d. with E[x 1 ] = 0, E[x 1 x 1 ] = C N : If x 1 N (0, C N ), ML estimator for C N is the sample covariance matrix (SCM) Ĉ N = 1 n x i x i n. i=1 If n, then, strong law of large numbers Ĉ N a.s. C N. or equivalently, in spectral norm ĈN C N a.s. 0. Random Matrix Regime No longer valid if N, n with N/n c (0, ), ĈN C N 0. For practical N, n with N n, leads to dramatically wrong conclusions Even for N = n/100. 5 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/153 The Large Dimensional Fallacies Setting: x i C N i.i.d., x 1 CN (0, I N ) 6 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/153 The Large Dimensional Fallacies Setting: x i C N i.i.d., x 1 CN (0, I N ) assume N = N(n) such that N/n c > 1 6 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/153 The Large Dimensional Fallacies Setting: x i C N i.i.d., x 1 CN (0, I N ) assume N = N(n) such that N/n c > 1 then, joint point-wise convergence ] max 1 i,j N [ĈN I N ij = max 1 1 i,j N n X j, Xi, δ ij a.s. 0. 6 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/153 The Large Dimensional Fallacies Setting: x i C N i.i.d., x 1 CN (0, I N ) assume N = N(n) such that N/n c > 1 then, joint point-wise convergence ] max 1 i,j N [ĈN I N ij = max 1 1 i,j N n X j, Xi, δ ij a.s. 0. however, eigenvalue mismatch 0 = λ 1 (ĈN ) =... = λ N n (ĈN ) λ N n+1 (ĈN )... λ N (ĈN ) 1 = λ 1 (I N ) =... = λ N n (I N ) = λ N n+1 (ĈN ) =... = λ N (I N ) 6 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/153 The Large Dimensional Fallacies Setting: x i C N i.i.d., x 1 CN (0, I N ) assume N = N(n) such that N/n c > 1 then, joint point-wise convergence ] max 1 i,j N [ĈN I N ij = max 1 1 i,j N n X j, Xi, δ ij a.s. 0. however, eigenvalue mismatch 0 = λ 1 (ĈN ) =... = λ N n (ĈN ) λ N n+1 (ĈN )... λ N (ĈN ) 1 = λ 1 (I N ) =... = λ N n (I N ) = λ N n+1 (ĈN ) =... = λ N (I N ) no convergence in spectral norm. 6 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/153 The Mar cenko Pastur law 0.8 Empirical eigenvalue distribution Mar cenko Pastur Law 0.6 Density 0.4 0.2 0 0 0.5 1 1.5 2 2.5 3 Eigenvalues of ĈN Figure: Histogram of the eigenvalues of ĈN for N = 500, n = 2000, C N = I N. 7 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/153 The Mar cenko Pastur law Definition (Empirical Spectral Density) Empirical spectral density (e.s.d.) µ N of Hermitian matrix A N C N N is µ N = 1 N δ λi (A N N ). i=1 8 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/153 The Mar cenko Pastur law Definition (Empirical Spectral Density) Empirical spectral density (e.s.d.) µ N of Hermitian matrix A N C N N is µ N = 1 N δ λi (A N N ). i=1 Theorem (Mar cenko Pastur Law [Mar cenko,pastur 67]) X N C N n with i.i.d. zero mean, unit variance entries. As N, n with N/n c (0, ), e.s.d. µ N of 1 n X N X N satisfies weakly, where µ c({0}) = max{0, 1 c 1 } µ N a.s. µ c 8 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/153 The Mar cenko Pastur law Definition (Empirical Spectral Density) Empirical spectral density (e.s.d.) µ N of Hermitian matrix A N C N N is µ N = 1 N δ λi (A N N ). i=1 Theorem (Mar cenko Pastur Law [Mar cenko,pastur 67]) X N C N n with i.i.d. zero mean, unit variance entries. As N, n with N/n c (0, ), e.s.d. µ N of 1 n X N X N satisfies weakly, where µ c({0}) = max{0, 1 c 1 } µ N a.s. µ c on (0, ), µ c has continuous density f c supported on [(1 c) 2, (1 + c) 2 ] f c(x) = 1 (x (1 c) 2 )((1 + c) 2 x). 2πcx 8 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 9/153 The Mar cenko Pastur law 1.2 c = 0.1 1 0.8 Density fc(x) 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 3 x Figure: Mar cenko-pastur law for different limit ratios c = lim N N/n. 9 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 9/153 The Mar cenko Pastur law 1.2 c = 0.1 c = 0.2 1 0.8 Density fc(x) 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 3 x Figure: Mar cenko-pastur law for different limit ratios c = lim N N/n. 9 / 153

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 9/153 The Mar cenko Pastur law 1.2 c = 0.1 c = 0.2 1 c = 0.5 0.8 Density fc(x) 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 3 x Figure: Mar cenko-pastur law for different limit ratios c = lim N N/n. 9 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 10/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 10 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 11/153 The Stieltjes transform Definition (Stieltjes Transform) For µ real probability measure of support supp(µ), Stieltjes transform m µ defined, for z C \ supp(µ), as 1 m µ(z) = t z µ(dt). 11 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 11/153 The Stieltjes transform Definition (Stieltjes Transform) For µ real probability measure of support supp(µ), Stieltjes transform m µ defined, for z C \ supp(µ), as 1 m µ(z) = t z µ(dt). Property (Inverse Stieltjes Transform) For a < b continuity points of µ, 1 µ([a, b]) = lim ε 0 π b I[m µ(x + ıε)]dx a 11 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 11/153 The Stieltjes transform Definition (Stieltjes Transform) For µ real probability measure of support supp(µ), Stieltjes transform m µ defined, for z C \ supp(µ), as 1 m µ(z) = t z µ(dt). Property (Inverse Stieltjes Transform) For a < b continuity points of µ, 1 µ([a, b]) = lim ε 0 π Besides, if µ has a density f at x, b I[m µ(x + ıε)]dx a 1 f(x) = lim I[mµ(x + ıε)]. ε 0 π 11 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 12/153 The Stieltjes transform Property (Relation to e.s.d.) If µ e.s.d. of Hermitian A C N N, (i.e., µ = 1 N N i=1 δ λ i (A)) m µ(z) = 1 N tr (A zi N ) 1 12 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 12/153 The Stieltjes transform Property (Relation to e.s.d.) If µ e.s.d. of Hermitian A C N N, (i.e., µ = 1 N N i=1 δ λ i (A)) m µ(z) = 1 N tr (A zi N ) 1 Proof: m µ(z) = µ(dt) t z = 1 N N i=1 = 1 N tr (A zi N ) 1. 1 λ i (A) z = 1 N tr (diag{λ i(a)} zi N ) 1 12 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 13/153 The Stieltjes transform Property (Stieltjes transform of Gram matrices) For X C N n, and Then µ e.s.d. of XX µ e.s.d. of X X m µ(z) = n N m µ(z) N n 1 N z. 13 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 13/153 The Stieltjes transform Property (Stieltjes transform of Gram matrices) For X C N n, and Then µ e.s.d. of XX µ e.s.d. of X X m µ(z) = n N m µ(z) N n 1 N z. Proof: m µ(z) = 1 N 1 N λ i=1 i (XX ) z = 1 N n i=1 1 λ i (X X) z + 1 N (N n) 1 0 z. 13 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 14/153 The Stieltjes transform Three fundamental lemmas in all proofs. Lemma (Resolvent Identity) For A, B C N N invertible, A 1 B 1 = A 1 (B A)B 1. 14 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 14/153 The Stieltjes transform Three fundamental lemmas in all proofs. Lemma (Resolvent Identity) For A, B C N N invertible, A 1 B 1 = A 1 (B A)B 1. Corollary For t C, x C N, A C N N, with A and A + txx invertible, (A + txx ) 1 x = A 1 x 1 + tx A 1 x. 14 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 15/153 The Stieltjes transform Three fundamental lemmas in all proofs. Lemma (Rank-one perturbation) For A, B C N N Hermitian nonnegative definite, e.s.d. µ of A, t > 0, x C N, z C \ supp(µ), 1 N tr B (A + txx zi N ) 1 1 N tr B (A zi N ) 1 1 B N dist(z, supp(µ)) 15 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 15/153 The Stieltjes transform Three fundamental lemmas in all proofs. Lemma (Rank-one perturbation) For A, B C N N Hermitian nonnegative definite, e.s.d. µ of A, t > 0, x C N, z C \ supp(µ), 1 N tr B (A + txx zi N ) 1 1 N tr B (A zi N ) 1 1 B N dist(z, supp(µ)) In particular, as N, if lim sup N B <, 1 N tr B (A + txx zi N ) 1 1 N tr B (A zi N ) 1 0. 15 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 16/153 The Stieltjes transform Three fundamental lemmas in all proofs. Lemma (Trace Lemma) For then x C N with i.i.d. entries with zero mean, unit variance, finite 2p order moment, A C N N deterministic (or independent of x), [ 1 E N x Ax 1 N tr A p ] K A p N p/2. 16 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 16/153 The Stieltjes transform Three fundamental lemmas in all proofs. Lemma (Trace Lemma) For then x C N with i.i.d. entries with zero mean, unit variance, finite 2p order moment, A C N N deterministic (or independent of x), [ 1 E N x Ax 1 N tr A p ] K A p N p/2. In particular, if lim sup N A <, and x has entries with finite eighth-order moment, 1 N x Ax 1 a.s. tr A 0 N (by Markov inequality and Borel Cantelli lemma). 16 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 17/153 Proof of the Mar cenko Pastur law Theorem (Mar cenko Pastur Law [Mar cenko,pastur 67]) X N C N n with i.i.d. zero mean, unit variance entries. As N, n with N/n c (0, ), e.s.d. µ N of 1 n X N X N satisfies weakly, where µ c({0}) = max{0, 1 c 1 } µ N a.s. µ c 17 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 17/153 Proof of the Mar cenko Pastur law Theorem (Mar cenko Pastur Law [Mar cenko,pastur 67]) X N C N n with i.i.d. zero mean, unit variance entries. As N, n with N/n c (0, ), e.s.d. µ N of 1 n X N X N satisfies weakly, where µ c({0}) = max{0, 1 c 1 } µ N a.s. µ c on (0, ), µ c has continuous density f c supported on [(1 c) 2, (1 + c) 2 ] f c(x) = 1 (x (1 c) 2 )((1 + c) 2 x). 2πcx 17 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 18/153 Proof of the Mar cenko Pastur law Stieltjes transform approach. 18 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 18/153 Proof of the Mar cenko Pastur law Stieltjes transform approach. Proof With µ N e.s.d. of 1 n X N X N, m µn (z) = 1 N tr ( 1 n X N X N zi N ) 1 = 1 N [ N ( ) ] 1 1 n X N XN zi N. i=1 ii 18 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 18/153 Proof of the Mar cenko Pastur law Stieltjes transform approach. Proof With µ N e.s.d. of 1 n X N X N, m µn (z) = 1 N tr ( 1 n X N X N zi N ) 1 = 1 N [ N ( ) ] 1 1 n X N XN zi N. i=1 ii Write [ ] y X N = C Y N n N 1 18 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 18/153 Proof of the Mar cenko Pastur law Stieltjes transform approach. Proof With µ N e.s.d. of 1 n X N X N, m µn (z) = 1 N tr ( 1 n X N X N zi N ) 1 = 1 N [ N ( ) ] 1 1 n X N XN zi N. i=1 ii Write so that, for I[z] > 0, [ ] y X N = C Y N n N 1 ( ) 1 1 n X N XN zi N = ( 1 n y y z 1 n Y N 1y 1 ) 1 n y Y N 1 1 n Y N 1YN 1 zi. N 1 18 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 19/153 Proof of the Mar cenko Pastur law Proof (continued) From block matrix inverse formula ( ) 1 ( A B (A BD = 1 C) 1 A 1 B(D CA 1 B) 1 ) C D (A BD 1 C) 1 CA 1 (D CA 1 B) 1 we have [ ( ) ] 1 1 n X N XN zi N 11 = 1 z z 1 n y ( 1 n Y N 1 Y N 1 zi n) 1 y. 19 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 19/153 Proof of the Mar cenko Pastur law Proof (continued) From block matrix inverse formula ( ) 1 ( A B (A BD = 1 C) 1 A 1 B(D CA 1 B) 1 ) C D (A BD 1 C) 1 CA 1 (D CA 1 B) 1 we have [ ( ) ] 1 1 n X N XN zi N 11 = 1 z z 1 n y ( 1 n Y N 1 Y N 1 zi n) 1 y. By Trace Lemma, as N, n [ ( ) ] 1 1 n X N XN zi 1 N z z 1 11 n tr ( 1 n Y N 1 Y N 1 zi n) 1 a.s. 0. 19 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 20/153 Proof of the Mar cenko Pastur law Proof (continued) By Rank-1 Perturbation Lemma (X N X N = Y N 1 Y N 1 + yy ), as N, n [ ( ) ] 1 1 n X N XN zi N 11 1 z z 1 n tr ( 1 n X N X N zi n) 1 a.s. 0. 20 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 20/153 Proof of the Mar cenko Pastur law Proof (continued) By Rank-1 Perturbation Lemma (X N X N = Y N 1 Y N 1 + yy ), as N, n [ ( ) ] 1 1 n X N XN zi N 11 1 z z 1 n tr ( 1 n X N X N zi n) 1 a.s. 0. Since 1 n tr ( 1 n X N X N zi n) 1 = 1 n tr ( 1 n X N X N zi N ) 1 n N n [ ( ) ] 1 1 n X N XN zi N 11 1 z, 1 1 N n z z 1 n tr ( 1 n X N X N zi N ) 1 a.s. 0. 20 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 20/153 Proof of the Mar cenko Pastur law Proof (continued) By Rank-1 Perturbation Lemma (X N X N = Y N 1 Y N 1 + yy ), as N, n [ ( ) ] 1 1 n X N XN zi N 11 1 z z 1 n tr ( 1 n X N X N zi n) 1 a.s. 0. Since 1 n tr ( 1 n X N X N zi n) 1 = 1 n tr ( 1 n X N X N zi N ) 1 n N n [ ( ) ] 1 1 n X N XN zi N 11 1 z, 1 1 N n z z 1 n tr ( 1 n X N X N zi N ) 1 a.s. 0. Repeating for entries (2, 2),..., (N, N), and averaging, we get (for I[z] > 0) m µn (z) 1 1 N n z z N n mµ N (z) a.s. 0. 20 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 21/153 Proof of the Mar cenko Pastur law Proof (continued) Then m µn (z) a.s. m(z) solution to m(z) = 1 1 c z czm(z) 21 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 21/153 Proof of the Mar cenko Pastur law Proof (continued) Then m µn (z) a.s. m(z) solution to m(z) = 1 1 c z czm(z) i.e., (with branch of f(z) such that m(z) 0 as z ) (z ) ( m(z) = 1 c 2cz 1 (1 + c) 2 z (1 c) 2) 2c +. 2cz 21 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 21/153 Proof of the Mar cenko Pastur law Proof (continued) Then m µn (z) a.s. m(z) solution to m(z) = 1 1 c z czm(z) i.e., (with branch of f(z) such that m(z) 0 as z ) (z ) ( m(z) = 1 c 2cz 1 (1 + c) 2 z (1 c) 2) 2c +. 2cz Finally, by inverse Stieltjes Transform, for x > 0, ((1 1 + c) 2 x ) ( x (1 c) 2) lim I[m(x + ıε)] = 1 ε 0 π 2πcx {x [(1 c) 2,(1+ c) 2 ]}. And for x = 0, lim ε 0 ıεi[m(ıε)] = ( 1 c 1) 1 {c>1}. 21 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 22/153 Sample Covariance Matrices Theorem (Sample Covariance Matrix Model [Silverstein,Bai 95]) Let Y N = C 1 2 N X N C N n, with C N C N N nonnegative definite with e.s.d. ν N ν weakly, X N C N n has i.i.d. entries of zero mean and unit variance. As N, n, N/n c (0, ), µ N e.s.d. of 1 n Y N Y N C n n satisfies µ N a.s. µ weakly, with m µ (z), I[z] > 0, unique solution with I[m µ (z)] > 0 of ( m µ (z) = z + c ) t 1 1 + tm µ (z) ν(dt). 22 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 22/153 Sample Covariance Matrices Theorem (Sample Covariance Matrix Model [Silverstein,Bai 95]) Let Y N = C 1 2 N X N C N n, with C N C N N nonnegative definite with e.s.d. ν N ν weakly, X N C N n has i.i.d. entries of zero mean and unit variance. As N, n, N/n c (0, ), µ N e.s.d. of 1 n Y N Y N C n n satisfies µ N a.s. µ weakly, with m µ (z), I[z] > 0, unique solution with I[m µ (z)] > 0 of ( m µ (z) = z + c ) t 1 1 + tm µ (z) ν(dt). Moreover, µ is continuous on R + and real analytic wherever positive. 22 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 22/153 Sample Covariance Matrices Theorem (Sample Covariance Matrix Model [Silverstein,Bai 95]) Let Y N = C 1 2 N X N C N n, with C N C N N nonnegative definite with e.s.d. ν N ν weakly, X N C N n has i.i.d. entries of zero mean and unit variance. As N, n, N/n c (0, ), µ N e.s.d. of 1 n Y N Y N C n n satisfies µ N a.s. µ weakly, with m µ (z), I[z] > 0, unique solution with I[m µ (z)] > 0 of ( m µ (z) = z + c ) t 1 1 + tm µ (z) ν(dt). Moreover, µ is continuous on R + and real analytic wherever positive. Immediate corollary: For µ N e.s.d. of 1 n Y N Y N = 1 n n i=1 C 1 2 N x i x i C 1 2 N, weakly, with µ = cµ + (1 c)δ 0. µ N a.s. µ 22 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 23/153 Sample Covariance Matrices e.s.d. e.s.d. f f 0.6 0.6 0.4 0.4 0.2 0.2 0 1 3 7 0 1 3 4 (i) (ii) Figure: Histogram of the eigenvalues of 1 n Y N Y N, n = 3000, N = 300, with C N diagonal with evenly weighted masses in (i) 1, 3, 7, (ii) 1, 3, 4. 23 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 24/153 Further Models and Deterministic Equivalents Theorem (Doubly-correlated i.i.d. matrices) Let B N = C 1 2 N X N T N X N C 1 2 N, with e.s.d. µ N, X k C N n with i.i.d. entries of zero mean, variance 1/n, C N Hermitian nonnegative definite, T N diagonal nonnegative, lim sup N max( C N, T N ) <. Denote c = N/n. Then, as N, n with bounded ratio c, for z C \ R, m µn (z) m N (z) a.s. 0, m N (z) = 1 N tr ( zi N + ē N (z)c N ) 1 with ē(z) unique solution in {z C +, ē N (z) C + } or {z R, ē N (z) R + } of e N (z) = 1 N tr C N ( zi N + ē N (z)c N ) 1 ē N (z) = 1 n tr T N (I n + ce N (z)t N ) 1. 24 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 25/153 Other Refined Sample Covariance Models Side note on other models. Similar results for multiple matrix models: 25 / 153

Basics of Random Matrix Theory/The Stieltjes Transform Method 25/153 Other Refined Sample Covariance Models Side note on other models. Similar results for multiple matrix models: Information-plus-noise: Y N = A N + X N, A N deterministic Variance profile: Y N = P N X N (entry-wise product) Per-column covariance: Y N = [y 1,..., y n], y i = C 1 2 N,i x i etc. 25 / 153

Basics of Random Matrix Theory/Spiked Models 26/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 26 / 153

Basics of Random Matrix Theory/Spiked Models 27/153 No Eigenvalue Outside the Support Theorem (No Eigenvalue Outside the Support [Silverstein,Bai 98]) 2 Let Y N = C 1 N X N C N n, with C N C N N nonnegative definite with e.s.d. ν N ν weakly, 27 / 153

Basics of Random Matrix Theory/Spiked Models 27/153 No Eigenvalue Outside the Support Theorem (No Eigenvalue Outside the Support [Silverstein,Bai 98]) 2 Let Y N = C 1 N X N C N n, with C N C N N nonnegative definite with e.s.d. ν N ν weakly, E[ X N 4 ij ] <, 27 / 153

Basics of Random Matrix Theory/Spiked Models 27/153 No Eigenvalue Outside the Support Theorem (No Eigenvalue Outside the Support [Silverstein,Bai 98]) 2 Let Y N = C 1 N X N C N n, with C N C N N nonnegative definite with e.s.d. ν N ν weakly, E[ X N 4 ij ] <, X N C N n has i.i.d. entries of zero mean and unit variance, 27 / 153

Basics of Random Matrix Theory/Spiked Models 27/153 No Eigenvalue Outside the Support Theorem (No Eigenvalue Outside the Support [Silverstein,Bai 98]) 2 Let Y N = C 1 N X N C N n, with C N C N N nonnegative definite with e.s.d. ν N ν weakly, E[ X N 4 ij ] <, X N C N n has i.i.d. entries of zero mean and unit variance, max i dist(λ i (C N ), supp(ν)) 0. 27 / 153

Basics of Random Matrix Theory/Spiked Models 27/153 No Eigenvalue Outside the Support Theorem (No Eigenvalue Outside the Support [Silverstein,Bai 98]) Let Y N = C 1 2 N X N C N n, with C N C N N nonnegative definite with e.s.d. ν N ν weakly, E[ X N 4 ij ] <, X N C N n has i.i.d. entries of zero mean and unit variance, max i dist(λ i (C N ), supp(ν)) 0. Let µ be the limiting e.s.d. of 1 n Y N Y N as before. Let [a, b] R \ supp( ν). Then, for all large n, almost surely. { ( )} 1 n λ i n Y N Y N [a, b] = i=1 27 / 153

Basics of Random Matrix Theory/Spiked Models 27/153 No Eigenvalue Outside the Support Theorem (No Eigenvalue Outside the Support [Silverstein,Bai 98]) Let Y N = C 1 2 N X N C N n, with C N C N N nonnegative definite with e.s.d. ν N ν weakly, E[ X N 4 ij ] <, X N C N n has i.i.d. entries of zero mean and unit variance, max i dist(λ i (C N ), supp(ν)) 0. Let µ be the limiting e.s.d. of 1 n Y N Y N as before. Let [a, b] R \ supp( ν). Then, for all large n, almost surely. { ( )} 1 n λ i n Y N Y N [a, b] = i=1 In practice: This means that eigenvalues of 1 n Y N Y N cannot be bound at macroscopic distance from the bulk, for N, n large. 27 / 153

Basics of Random Matrix Theory/Spiked Models 28/153 Spiked Models Breaking the rules. If we break Rule 1: Infinitely many eigenvalues may wander away from supp(µ). 0.8 {λ i } n i=1 µ 0.8 {λ i } n i=1 µ 0.6 0.6 0.4 0.4 0.2 0.2 0 0 1 2 3 E[X 4 ij ] < 0 0 1 2 3 E[X4 ij ] = 28 / 153

Basics of Random Matrix Theory/Spiked Models 29/153 Spiked Models If we break: Rule 2: C N may create isolated eigenvalues in 1 n Y N YN, called spikes. 0.8 {λ i } n i=1 µ 0.6 0.4 0.2 0 1 + ω 1 + c 1+ω 1 ω 1, 1 + ω 2 + c 1+ω 2 ω 2 Figure: Eigenvalues of 1 n Y N Y N, C N = diag(1,..., 1, 2, 2, 3, 3), N = 500, n = 1500. }{{} N 4 29 / 153

Basics of Random Matrix Theory/Spiked Models 30/153 Spiked Models Theorem (Eigenvalues [Baik,Silverstein 06]) Let Y N = C 1 2 N X N, with X N with i.i.d. zero mean, unit variance, E[ X N 4 ij ] <. C N = I N + P, P = UΩU, where, for K fixed, Ω = diag (ω 1,..., ω K ) R K K, with ω 1... ω K > 0. 30 / 153

Basics of Random Matrix Theory/Spiked Models 30/153 Spiked Models Theorem (Eigenvalues [Baik,Silverstein 06]) Let Y N = C 1 2 N X N, with X N with i.i.d. zero mean, unit variance, E[ X N 4 ij ] <. C N = I N + P, P = UΩU, where, for K fixed, Ω = diag (ω 1,..., ω K ) R K K, with ω 1... ω K > 0. Then, as N, n, N/n c (0, ), denoting λ i = λ i ( 1 n Y N Y N ), if ω m > c, λ m a.s. 1 + ω m + c 1 + ωm ω m > (1 + c) 2 30 / 153

Basics of Random Matrix Theory/Spiked Models 30/153 Spiked Models Theorem (Eigenvalues [Baik,Silverstein 06]) Let Y N = C 1 2 N X N, with X N with i.i.d. zero mean, unit variance, E[ X N 4 ij ] <. C N = I N + P, P = UΩU, where, for K fixed, Ω = diag (ω 1,..., ω K ) R K K, with ω 1... ω K > 0. Then, as N, n, N/n c (0, ), denoting λ i = λ i ( 1 n Y N Y N ), if ω m > c, λ m a.s. 1 + ω m + c 1 + ωm ω m > (1 + c) 2 if ω m (0, c], λ m a.s. (1 + c) 2 30 / 153

Basics of Random Matrix Theory/Spiked Models 31/153 Spiked Models Proof Two ingredients: Algebraic calculus + trace lemma 31 / 153

Basics of Random Matrix Theory/Spiked Models 31/153 Spiked Models Proof Two ingredients: Algebraic calculus + trace lemma Find eigenvalues away from eigenvalues of 1 n X N X N : ( 1 0 = det = det(c N ) det ( 1 = det n Y N Y N λi N ) ( ) 1 n X N XN λc 1 N ) n X N XN λi N + λ(i N C 1 N ) ( ) ( 1 = det n X N XN λi N det ( I N + λ(i N C 1 1 N ) n X N XN λi N ) 1 ). 31 / 153

Basics of Random Matrix Theory/Spiked Models 31/153 Spiked Models Proof Two ingredients: Algebraic calculus + trace lemma Find eigenvalues away from eigenvalues of 1 n X N X N : ( 1 0 = det = det(c N ) det ( 1 = det n Y N Y N λi N ) ( ) 1 n X N XN λc 1 N ) n X N XN λi N + λ(i N C 1 N ) ( ) ( 1 = det n X N XN λi N det ( I N + λ(i N C 1 1 N ) n X N XN λi N ) 1 ). Use low rank property: I N C 1 N = I N (I N + UΩU ) 1 = U(I K + Ω 1 ) 1 U, Ω C K K. Hence ( ) ( ( ) ) 1 1 1 0 = det n X N XN λi N det I N + λu(i K + Ω 1 ) 1 U n X N XN λi N

Basics of Random Matrix Theory/Spiked Models 31/153 Spiked Models Proof Two ingredients: Algebraic calculus + trace lemma Find eigenvalues away from eigenvalues of 1 n X N X N : ( 1 0 = det = det(c N ) det ( 1 = det n Y N Y N λi N ) ( ) 1 n X N XN λc 1 N ) n X N XN λi N + λ(i N C 1 N ) ( ) ( 1 = det n X N XN λi N det ( I N + λ(i N C 1 1 N ) n X N XN λi N ) 1 ). Use low rank property: I N C 1 N = I N (I N + UΩU ) 1 = U(I K + Ω 1 ) 1 U, Ω C K K. Hence ( ) ( ( ) ) 1 1 1 0 = det n X N XN λi N det I N + λu(i K + Ω 1 ) 1 U n X N XN λi N 31 / 153

Basics of Random Matrix Theory/Spiked Models 32/153 Spiked Models Proof (2) Sylverster s identity (det(i + AB) = det(i + BA)), ( ) ( ( ) ) 1 1 1 0 = det n X N XN λi N det I K + λ(i K + Ω 1 ) 1 U n X N XN λi N U 32 / 153

Basics of Random Matrix Theory/Spiked Models 32/153 Spiked Models Proof (2) Sylverster s identity (det(i + AB) = det(i + BA)), ( ) ( ( ) ) 1 1 1 0 = det n X N XN λi N det I K + λ(i K + Ω 1 ) 1 U n X N XN λi N U No eigenvalue outside the support [Bai,Sil 98]: det( 1 n X N X N λi N ) has no zero beyond (1 + c) 2 for all large n a.s. 32 / 153

Basics of Random Matrix Theory/Spiked Models 32/153 Spiked Models Proof (2) Sylverster s identity (det(i + AB) = det(i + BA)), ( ) ( ( ) ) 1 1 1 0 = det n X N XN λi N det I K + λ(i K + Ω 1 ) 1 U n X N XN λi N U No eigenvalue outside the support [Bai,Sil 98]: det( 1 n X N X N λi N ) has no zero beyond (1 + c) 2 for all large n a.s. Extension of Trace Lemma: for each z C \ supp(µ), U ( 1 n X N X N zi N ) 1 U a.s. m µ(z)i K. (X N being almost-unitarily invariant, U can be seen as formed of random i.i.d.-like vectors) 32 / 153

Basics of Random Matrix Theory/Spiked Models 32/153 Spiked Models Proof (2) Sylverster s identity (det(i + AB) = det(i + BA)), ( ) ( ( ) ) 1 1 1 0 = det n X N XN λi N det I K + λ(i K + Ω 1 ) 1 U n X N XN λi N U No eigenvalue outside the support [Bai,Sil 98]: det( 1 n X N X N λi N ) has no zero beyond (1 + c) 2 for all large n a.s. Extension of Trace Lemma: for each z C \ supp(µ), U ( 1 n X N X N zi N ) 1 U a.s. m µ(z)i K. (X N being almost-unitarily invariant, U can be seen as formed of random i.i.d.-like vectors) As a result, for all large n a.s., ( 0 = det I K + λ(i K + Ω 1 ) 1 U ( 1 ) n X N XN λi N ) 1 U M m=1 ( 1 + λ 1 + ωm 1 m µ(λ)) km = M m=1 ( ) 1 + λωm km m µ(λ) 1 + ω m 32 / 153

Basics of Random Matrix Theory/Spiked Models 33/153 Spiked Models Proof (3) Limiting solutions: zeros (with multiplicity) of 1 + λωm 1 + ω m m µ(λ) = 0. 33 / 153

Basics of Random Matrix Theory/Spiked Models 33/153 Spiked Models Proof (3) Limiting solutions: zeros (with multiplicity) of 1 + λωm 1 + ω m m µ(λ) = 0. Using Mar cenko Pastur law properties (m µ(z) = (1 c z czm µ(z)) 1 ), { λ 1 + ω m + c 1 + } M ωm. ω m m=1 33 / 153

Basics of Random Matrix Theory/Spiked Models 34/153 Spiked Models Theorem (Eigenvectors [Paul 07]) Let Y N = C 1 2 N X N, with X N with i.i.d. zero mean, unit variance, finite fourth order moment entries C N = I N + P, P = K i=1 ω iu i u i, ω 1 >... > ω M > 0. 34 / 153

Basics of Random Matrix Theory/Spiked Models 34/153 Spiked Models Theorem (Eigenvectors [Paul 07]) Let Y N = C 1 2 N X N, with X N with i.i.d. zero mean, unit variance, finite fourth order moment entries C N = I N + P, P = K i=1 ω iu i u i, ω 1 >... > ω M > 0. Then, as N, n, N/n c (0, ), for a, b C N deterministic and û i eigenvector of λ i ( 1 n Y N Y N ), In particular, a û i û i b 1 cω 2 i 1 + cω 1 i a u i u i b 1 ω i > c a.s. 0 û i u i 2 a.s. 1 cω 2 i 1 + cω 1 1 ωi > c. i 34 / 153

Basics of Random Matrix Theory/Spiked Models 34/153 Spiked Models Theorem (Eigenvectors [Paul 07]) Let Y N = C 1 2 N X N, with X N with i.i.d. zero mean, unit variance, finite fourth order moment entries C N = I N + P, P = K i=1 ω iu i u i, ω 1 >... > ω M > 0. Then, as N, n, N/n c (0, ), for a, b C N deterministic and û i eigenvector of λ i ( 1 n Y N Y N ), In particular, a û i û i b 1 cω 2 i 1 + cω 1 i a u i u i b 1 ω i > c a.s. 0 û i u i 2 a.s. 1 cω 2 i 1 + cω 1 1 ωi > c. i Proof: Based on Cauchy integral + similar ingredients as eigenvalue proof a û i û i b = 1 ( ) 1 1 2πı n Y N YN zi N b dz C i a for C m contour circling around λ i only. 34 / 153

Basics of Random Matrix Theory/Spiked Models 35/153 Spiked Models 1 0.8 û 1 u 1 2 0.6 0.4 0.2 N = 100 N = 200 N = 400 1 c/ω 1 2 1+c/ω 1 0 0 1 2 3 4 Population spike ω 1 Figure: Simulated versus limiting û 1 u1 2 for Y N = C 1 2N X N, C N = I N + ω 1u 1u 1, N/n = 1/3, varying ω 1. 35 / 153

Basics of Random Matrix Theory/Spiked Models 36/153 Tracy Widom Theorem Theorem (Phase Transition [Baik,BenArous,Péché 05]) Let Y N = C 1 2 N X N, with X N with i.i.d. complex Gaussian zero mean, unit variance entries, C N = I N + P, P = K i=1 ω iu i u i, ω 1 >... > ω K > 0 (K 0). 36 / 153

Basics of Random Matrix Theory/Spiked Models 36/153 Tracy Widom Theorem Theorem (Phase Transition [Baik,BenArous,Péché 05]) Let Y N = C 1 2 N X N, with X N with i.i.d. complex Gaussian zero mean, unit variance entries, C N = I N + P, P = K i=1 ω iu i u i, ω 1 >... > ω K > 0 (K 0). Then, as N, n, N/n c < 1, If ω 1 < c (or K = 0), N 2 3 λ 1 (1 + c) 2 (1 + c) 4 3 c 1 2 L T 2, (complex Tracy Widom law) 36 / 153

Basics of Random Matrix Theory/Spiked Models 36/153 Tracy Widom Theorem Theorem (Phase Transition [Baik,BenArous,Péché 05]) Let Y N = C 1 2 N X N, with X N with i.i.d. complex Gaussian zero mean, unit variance entries, C N = I N + P, P = K i=1 ω iu i u i, ω 1 >... > ω K > 0 (K 0). Then, as N, n, N/n c < 1, If ω 1 < c (or K = 0), N 2 3 λ 1 (1 + c) 2 (1 + c) 4 3 c 1 2 L T 2, (complex Tracy Widom law) If ω 1 > c, ( (1 + ω1 ) 2 c (1 + ω 1) 2 ) 1 [ ( 2 1 N ω1 2 2 λ 1 1 + ω 1 + c 1 + ω )] 1 L N (0, 1). ω 1 36 / 153

Basics of Random Matrix Theory/Spiked Models 37/153 Tracy Widom Theorem 0.5 Centered-scaled λ 1 T 2 0.4 0.3 0.2 0.1 0 4 2 0 2 Figure: Distribution of N 2 3 c 1 2 (1 + c) 4 3 [ λ 1( 1 n X N X N ) (1 + c) 2] versus Tracy Widom (T 2), N = 500, n = 1500. 37 / 153

Basics of Random Matrix Theory/Spiked Models 38/153 Other Spiked Models Similar results for multiple matrix models: Additive spiked model: Y N = 1 n XX + P, P deterministic and low rank Y N = 1 n X (I + P )X Y N = 1 n (X + P ) (X + P ) Y N = 1 n T X (I + P )XT etc. 38 / 153

Basics of Random Matrix Theory/Other Common Random Matrix Models 39/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 39 / 153

Basics of Random Matrix Theory/Other Common Random Matrix Models 40/153 The Semi-circle law Theorem Let X N C N N 1 Hermitian with e.s.d. µ N such that [X N N ] i>j are i.i.d. with zero mean and unit variance. Then, as N, µ N a.s. µ with µ(dt) = 1 2π (4 t 2 ) + dt. In particular, m µ satisfies m µ(z) = 1 z m µ(z). 40 / 153

Basics of Random Matrix Theory/Other Common Random Matrix Models 41/153 The Semi-circle law 0.4 Empirical eigenvalue distribution Semi-circle Law 0.3 Density 0.2 0.1 0 3 2 1 0 1 2 3 Eigenvalues Figure: Histogram of the eigenvalues of Wigner matrices and the semi-circle law, for N = 500 41 / 153

Basics of Random Matrix Theory/Other Common Random Matrix Models 42/153 The Circular law Theorem Let X N C N N with e.s.d. µ N be such that mean and unit variance. Then, as N, µ N a.s. µ with µ a complex-supported measure with µ(dz) = 1 2π δ z 1dz. 1 N [X N ] ij are i.i.d. entries with zero 42 / 153

Basics of Random Matrix Theory/Other Common Random Matrix Models 43/153 The Circular law 1 Empirical eigenvalues Circular Law Eigenvalues (imaginary part) 0.5 0 0.5 1 1 0.5 0 0.5 1 Eigenvalues (real part) Figure: Eigenvalues of X N with i.i.d. standard Gaussian entries, for N = 500. 43 / 153

Basics of Random Matrix Theory/Other Common Random Matrix Models 44/153 Bibliographical references: Maths Book and Tutorial References I From most accessible to least: Couillet, R., & Debbah, M. (2011). Random matrix methods for wireless communications. Cambridge University Press. Tao, T. (2012). Topics in random matrix theory (Vol. 132). Providence, RI: American Mathematical Society. Bai, Z., & Silverstein, J. W. (2010). Spectral analysis of large dimensional random matrices (Vol. 20). New York: Springer. Pastur, L. A., Shcherbina, M., & Shcherbina, M. (2011). Eigenvalue distribution of large random matrices (Vol. 171). Providence, RI: American Mathematical Society. Anderson, G. W., Guionnet, A., & Zeitouni, O. (2010). An introduction to random matrices (Vol. 118). Cambridge university press. 44 / 153

Applications/ 45/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 45 / 153

Applications/Random Matrices and Robust Estimation 46/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 46 / 153

Applications/Random Matrices and Robust Estimation 47/153 Context Baseline scenario: x 1,..., x n C N (or R N ) i.i.d. with E[x 1 ] = 0, E[x 1 x 1 ] = C N : 47 / 153

Applications/Random Matrices and Robust Estimation 47/153 Context Baseline scenario: x 1,..., x n C N (or R N ) i.i.d. with E[x 1 ] = 0, E[x 1 x 1 ] = C N : If x 1 N (0, C N ), ML estimator for C N is sample covariance matrix (SCM) Ĉ N = 1 n x i x i n. i=1 47 / 153

Applications/Random Matrices and Robust Estimation 47/153 Context Baseline scenario: x 1,..., x n C N (or R N ) i.i.d. with E[x 1 ] = 0, E[x 1 x 1 ] = C N : If x 1 N (0, C N ), ML estimator for C N is sample covariance matrix (SCM) Ĉ N = 1 n x i x i n. i=1 [Huber 67] If x 1 (1 ε)n (0, C N ) + εg, G unknown, robust estimator (n > N) { } Ĉ N = 1 n l 2 max l 1, n 1 i=1 N x i Ĉ 1 N x x i x i for some l 1, l 2 > 0. i 47 / 153

Applications/Random Matrices and Robust Estimation 47/153 Context Baseline scenario: x 1,..., x n C N (or R N ) i.i.d. with E[x 1 ] = 0, E[x 1 x 1 ] = C N : If x 1 N (0, C N ), ML estimator for C N is sample covariance matrix (SCM) Ĉ N = 1 n x i x i n. i=1 [Huber 67] If x 1 (1 ε)n (0, C N ) + εg, G unknown, robust estimator (n > N) { } Ĉ N = 1 n l 2 max l 1, n 1 i=1 N x i Ĉ 1 N x x i x i for some l 1, l 2 > 0. i [Maronna 76] If x 1 elliptical (and n > N), ML estimator for C N given by Ĉ N = 1 n ( ) 1 u n N x i Ĉ 1 N x i x i x i for some non-increasing u. i=1 47 / 153

Applications/Random Matrices and Robust Estimation 47/153 Context Baseline scenario: x 1,..., x n C N (or R N ) i.i.d. with E[x 1 ] = 0, E[x 1 x 1 ] = C N : If x 1 N (0, C N ), ML estimator for C N is sample covariance matrix (SCM) Ĉ N = 1 n x i x i n. i=1 [Huber 67] If x 1 (1 ε)n (0, C N ) + εg, G unknown, robust estimator (n > N) { } Ĉ N = 1 n l 2 max l 1, n 1 i=1 N x i Ĉ 1 N x x i x i for some l 1, l 2 > 0. i [Maronna 76] If x 1 elliptical (and n > N), ML estimator for C N given by Ĉ N = 1 n ( ) 1 u n N x i Ĉ 1 N x i x i x i for some non-increasing u. i=1 [Pascal 13; Chen 11] If N > n, x 1 elliptical or with outliers, shrinkage extensions Ĉ N (ρ) = (1 ρ) 1 n x i x i n 1 i=1 N x i Ĉ 1 N (ρ)x + ρi N i ˇB N (ρ) Č N (ρ) = 1 N tr ˇB N (ρ), ˇBN (ρ) = (1 ρ) 1 n x i x i n 1 i=1 N x i Č 1 N (ρ)x + ρi N i 47 / 153

Applications/Random Matrices and Robust Estimation 48/153 Context Results only known for N fixed and n : not appropriate in settings of interest today (BigData, array processing, MIMO) 48 / 153

Applications/Random Matrices and Robust Estimation 48/153 Context Results only known for N fixed and n : not appropriate in settings of interest today (BigData, array processing, MIMO) We study such ĈN in the regime N, n, N/n c (0, ). 48 / 153

Applications/Random Matrices and Robust Estimation 48/153 Context Results only known for N fixed and n : not appropriate in settings of interest today (BigData, array processing, MIMO) We study such ĈN in the regime N, n, N/n c (0, ). Math interest: limiting eigenvalue distribution of ĈN limiting values and fluctuations of functionals f(ĉn ) 48 / 153

Applications/Random Matrices and Robust Estimation 48/153 Context Results only known for N fixed and n : not appropriate in settings of interest today (BigData, array processing, MIMO) We study such ĈN in the regime N, n, N/n c (0, ). Math interest: limiting eigenvalue distribution of ĈN limiting values and fluctuations of functionals f(ĉn ) Application interest: comparison between SCM and robust estimators performance of robust/non-robust estimation methods improvement thereof (by proper parametrization) 48 / 153

Applications/Random Matrices and Robust Estimation 49/153 Model Description Definition (Maronna s Estimator) For x 1,..., x n C N with n > N, Ĉ N is the solution (upon existence and uniqueness) of n Ĉ N = 1 n i=1 ( 1 u N x i Ĉ 1 N x i ) x i x i 49 / 153

Applications/Random Matrices and Robust Estimation 49/153 Model Description Definition (Maronna s Estimator) For x 1,..., x n C N with n > N, Ĉ N is the solution (upon existence and uniqueness) of where u : [0, ) (0, ) is n Ĉ N = 1 n i=1 ( 1 u N x i Ĉ 1 N x i ) x i x i non-increasing such that φ(x) xu(x) increasing of supremum φ with 1 < φ < c 1, c (0, 1). 49 / 153

Applications/Random Matrices and Robust Estimation 50/153 The Results in a Nutshell For various models of the x i s, First order convergence: ĈN ŜN a.s. 0 for some tractable random matrices ŜN. We only discuss this result here. 50 / 153

Applications/Random Matrices and Robust Estimation 50/153 The Results in a Nutshell For various models of the x i s, First order convergence: ĈN ŜN a.s. 0 for some tractable random matrices ŜN. We only discuss this result here. Second order results: N 1 ε ( a Ĉ k N b a Ŝ k N b ) a.s. 0 allowing transfer of CLT results. 50 / 153

Applications/Random Matrices and Robust Estimation 50/153 The Results in a Nutshell For various models of the x i s, First order convergence: ĈN ŜN a.s. 0 for some tractable random matrices ŜN. We only discuss this result here. Second order results: N 1 ε ( a Ĉ k N b a Ŝ k N b ) a.s. 0 allowing transfer of CLT results. Applications: improved robust covariance matrix estimation improved robust tests / estimators specific examples in statistics at large, array processing, statistical finance, etc. 50 / 153

Applications/Random Matrices and Robust Estimation 51/153 (Elliptical) scenario Theorem (Large dimensional behavior, elliptical case) For x i = τ i w i, τ i impulsive (random or not), w i unitarily invariant, w i = N, ĈN ŜN a.s. 0 with, for some v related to u (v = u g 1, g(x) = x(1 cφ(x)) 1 ), n Ĉ N 1 n i=1 and γ N unique solution of ( 1 u N x i Ĉ 1 N x i ) x i x i, 1 = 1 n γv(τ i γ) n 1 + cγv(τ j=1 i γ). Ŝ N 1 n v(τ i γ N )x i x i n i=1 51 / 153

Applications/Random Matrices and Robust Estimation 51/153 (Elliptical) scenario Theorem (Large dimensional behavior, elliptical case) For x i = τ i w i, τ i impulsive (random or not), w i unitarily invariant, w i = N, ĈN ŜN a.s. 0 with, for some v related to u (v = u g 1, g(x) = x(1 cφ(x)) 1 ), n Ĉ N 1 n i=1 and γ N unique solution of ( 1 u N x i Ĉ 1 N x i ) x i x i, 1 = 1 n γv(τ i γ) n 1 + cγv(τ j=1 i γ). Ŝ N 1 n v(τ i γ N )x i x i n i=1 Corollaries Spectral measure: µĉn N µŝn N L 0 a.s. (µ X N 1 n n i=1 δ λ i (X)) 51 / 153

Applications/Random Matrices and Robust Estimation 51/153 (Elliptical) scenario Theorem (Large dimensional behavior, elliptical case) For x i = τ i w i, τ i impulsive (random or not), w i unitarily invariant, w i = N, ĈN ŜN a.s. 0 with, for some v related to u (v = u g 1, g(x) = x(1 cφ(x)) 1 ), n Ĉ N 1 n i=1 and γ N unique solution of ( 1 u N x i Ĉ 1 N x i ) x i x i, 1 = 1 n γv(τ i γ) n 1 + cγv(τ j=1 i γ). Ŝ N 1 n v(τ i γ N )x i x i n i=1 Corollaries Spectral measure: µĉn N µŝn N L 0 a.s. (µ X N 1 n n i=1 δ λ i (X)) Local convergence: max 1 i N λ i (ĈN ) λ i (ŜN ) a.s. 0. 51 / 153

Applications/Random Matrices and Robust Estimation 51/153 (Elliptical) scenario Theorem (Large dimensional behavior, elliptical case) For x i = τ i w i, τ i impulsive (random or not), w i unitarily invariant, w i = N, ĈN ŜN a.s. 0 with, for some v related to u (v = u g 1, g(x) = x(1 cφ(x)) 1 ), n Ĉ N 1 n i=1 and γ N unique solution of ( 1 u N x i Ĉ 1 N x i ) x i x i, 1 = 1 n γv(τ i γ) n 1 + cγv(τ j=1 i γ). Ŝ N 1 n v(τ i γ N )x i x i n i=1 Corollaries Spectral measure: µĉn N µŝn N L 0 a.s. (µ X N 1 n n i=1 δ λ i (X)) Local convergence: max 1 i N λ i (ĈN ) λ i (ŜN ) a.s. 0. Norm boundedness: lim sup N ĈN < Bounded spectrum (unlike SCM!) 51 / 153

Applications/Random Matrices and Robust Estimation 52/153 Large dimensional behavior 3 Eigenvalues of ĈN Density 2 1 0 0 0.5 1 1.5 2 Eigenvalues Figure: n = 2500, N = 500, C N = diag(i 125, 3I 125, 10I 250), τ i Γ(.5, 2) i.i.d. 52 / 153

Applications/Random Matrices and Robust Estimation 52/153 Large dimensional behavior Eigenvalues of ĈN 3 Eigenvalues of ŜN Density 2 1 0 0 0.5 1 1.5 2 Eigenvalues Figure: n = 2500, N = 500, C N = diag(i 125, 3I 125, 10I 250), τ i Γ(.5, 2) i.i.d. 52 / 153

Applications/Random Matrices and Robust Estimation 52/153 Large dimensional behavior Eigenvalues of ĈN 3 Eigenvalues of ŜN Approx. Density Density 2 1 0 0 0.5 1 1.5 2 Eigenvalues Figure: n = 2500, N = 500, C N = diag(i 125, 3I 125, 10I 250), τ i Γ(.5, 2) i.i.d. 52 / 153

Applications/Random Matrices and Robust Estimation 53/153 Elements of Proof Definition (v and ψ) Letting g(x) = x(1 cφ(x)) 1 (on R + ), v(x) (u g 1 )(x) non-increasing ψ(x) xv(x) increasing and bounded by ψ. 53 / 153

Applications/Random Matrices and Robust Estimation 53/153 Elements of Proof Definition (v and ψ) Letting g(x) = x(1 cφ(x)) 1 (on R + ), v(x) (u g 1 )(x) non-increasing ψ(x) xv(x) increasing and bounded by ψ. Lemma (Rewriting ĈN) It holds (with C N = I N ) that Ĉ N 1 n τ i v (τ i d i ) w i wi n i=1 with (d 1,..., d n) R n + a.s. unique solution to d i = 1 N w i Ĉ 1 (i) w i = 1 N w i 1 1 τ j v(τ j d j )w j wj w i, i = 1,..., n. n j i 53 / 153

Applications/Random Matrices and Robust Estimation 54/153 Elements of Proof Remark (Quadratic Form close to Trace) Random matrix insight: ( 1 n j i τ jv(τ j d j )w j wj ) 1 almost independent of w i, so 54 / 153

Applications/Random Matrices and Robust Estimation 54/153 Elements of Proof Remark (Quadratic Form close to Trace) Random matrix insight: ( 1 n j i τ jv(τ j d j )w j wj ) 1 almost independent of w i, so 1 1 d i = 1 N w 1 i τ j v(τ j d j )w j wj w i 1 n N tr 1 τ j v(τ j d j )w j w j γ N n j i j i for some deterministic sequence (γ N ) N=1, irrespective of i. 54 / 153

Applications/Random Matrices and Robust Estimation 54/153 Elements of Proof Remark (Quadratic Form close to Trace) Random matrix insight: ( 1 n j i τ jv(τ j d j )w j wj ) 1 almost independent of w i, so 1 1 d i = 1 N w 1 i τ j v(τ j d j )w j wj w i 1 n N tr 1 τ j v(τ j d j )w j w j γ N n j i j i for some deterministic sequence (γ N ) N=1, irrespective of i. Lemma (Key Lemma) Letting e i v(τ id i ) v(τ i γ N ) with γ N unique solution to we have 1 = 1 n ψ(τ i γ N ) n 1 + cψ(τ k=1 i γ N ) max e i 1 a.s. 0. 1 i n 54 / 153

Applications/Random Matrices and Robust Estimation 55/153 Proof of the Key Lemma: max i e i 1 a.s. 0, e i = v(τ id i ) v(τ i γ N ) Property (Quadratic form and γ N ) 1 max 1 1 i n N w 1 i τ j v(τ j γ N )w j wj w i γ N a.s. n 0. j i 55 / 153

Applications/Random Matrices and Robust Estimation 55/153 Proof of the Key Lemma: max i e i 1 a.s. 0, e i = v(τ id i ) v(τ i γ N ) Property (Quadratic form and γ N ) 1 max 1 1 i n N w 1 i τ j v(τ j γ N )w j wj w i γ N a.s. n 0. j i Proof of the Property Uniformity easy (moments of all orders for [w i ] j ). By a quadratic form similar to trace approach, we get 1 max 1 1 i n N w 1 i τ j v(τ j γ N )w j wj w i m(0) a.s. n 0 j i with m(0) unique positive solution to [MarPas 67; BaiSil 95] m(0) = 1 n τ i v(τ i γ N ) n 1 + cτ i=1 i v(τ i γ N )m(0). γ N precisely solves this equation, thus m(0) = γ N. 55 / 153

Applications/Random Matrices and Robust Estimation 56/153 Proof of the Key Lemma: max i e i 1 a.s. 0, e i = v(τ id i ) Substitution Trick (case τ i [a, b] (0, )) Up to relabelling e 1... e n, use v(τ i γ N ) 1 v(τ nγ N )e n = v(τ nd n) = v 1 τn N w 1 n τ i v(τ i d i ) w i w i n }{{} w n i<n =v(τ i γ N )e i ( ) v τ ne 1 1 1 1 n N w n τ i v(τ i γ N )w i wi w n n i<n v ( τ ne 1 n (γ N ε ) n) a.s., ε n 0 (slow). 56 / 153

Applications/Random Matrices and Robust Estimation 56/153 Proof of the Key Lemma: max i e i 1 a.s. 0, e i = v(τ id i ) Substitution Trick (case τ i [a, b] (0, )) Up to relabelling e 1... e n, use v(τ i γ N ) 1 v(τ nγ N )e n = v(τ nd n) = v 1 τn N w 1 n τ i v(τ i d i ) w i w i n }{{} w n i<n =v(τ i γ N )e i ( ) v τ ne 1 1 1 1 n N w n τ i v(τ i γ N )w i wi w n n i<n Use properties of ψ to get v ( τ ne 1 n (γ N ε ) n) a.s., ε n 0 (slow). ψ (τ nγ N ) ψ ( τ ne 1 n γ ) ( ) 1 N 1 ε nγ 1 N 56 / 153

Applications/Random Matrices and Robust Estimation 56/153 Proof of the Key Lemma: max i e i 1 a.s. 0, e i = v(τ id i ) Substitution Trick (case τ i [a, b] (0, )) Up to relabelling e 1... e n, use v(τ i γ N ) 1 v(τ nγ N )e n = v(τ nd n) = v 1 τn N w 1 n τ i v(τ i d i ) w i w i n }{{} w n i<n =v(τ i γ N )e i ( ) v τ ne 1 1 1 1 n N w n τ i v(τ i γ N )w i wi w n n i<n Use properties of ψ to get v ( τ ne 1 n (γ N ε ) n) a.s., ε n 0 (slow). ψ (τ nγ N ) ψ ( τ ne 1 n γ ) ( ) 1 N 1 ε nγ 1 N Conclusion: If e n > 1 + l i.o., as τ n [a, b], on subsequence { τn τ 0 > 0 γ N γ 0 > 0, ( ) τ0 γ 0 ψ(τ 0 γ 0 ) ψ, a contradiction. 1 + l 56 / 153

Applications/Random Matrices and Robust Estimation 57/153 Outlier Data Theorem (Outlier Rejection) Observation set X = [ ] x 1,..., x (1 εn)n, a 1,..., a εnn where x i CN (0, C N ) and a 1,..., a εnn C N deterministic outliers. Then, ĈN ŜN a.s. 0 where Ŝ N v (γ N ) 1 n (1 ε n)n i=1 x i x i + 1 ε nn v (α i,n ) a i a i n i=1 with γ N and α 1,n,..., α εnn,n unique positive solutions to ( ) γ N = 1 N tr C (1 ε)v(γ N ) N C N + 1 ε nn 1 v (α i,n ) a i a i 1 + cv(γ N )γ N n i=1 1 α i,n = 1 N a (1 ε)v(γ N ) i C N + 1 ε nn v (α j,n ) a j a j a i, i = 1,..., ε nn. 1 + cv(γ N )γ N n j i 57 / 153

Applications/Random Matrices and Robust Estimation 58/153 Outlier Data For ε nn = 1, ( φ 1 (1) Ŝ N = v 1 c ) n 1 1 n i=1 ( φ x i x i (v 1 + (1) 1 c Outlier rejection relies on 1 N a 1 C 1 N a 1 1. 1 N a 1C 1 N a 1 ) ) + o(1) a 1 a 1 58 / 153

Applications/Random Matrices and Robust Estimation 58/153 Outlier Data For ε nn = 1, ( φ 1 (1) Ŝ N = v 1 c ) n 1 1 n i=1 ( φ x i x i (v 1 + (1) 1 c Outlier rejection relies on 1 N a 1 C 1 N a 1 1. For a i CN (0, D N ), ε n ε 0, 1 N a 1C 1 N a 1 Ŝ N = v (γ 1 (1 ε n)n n) x i x i n + v (αn) 1 ε nn a i a i n i=1 i=1 γ n = 1 ( (1 ε)v(γn) N tr C N C N + 1 + cv(γ n)γ n α n = 1 ( (1 ε)v(γn) N tr D N C N + 1 + cv(γ n)γ n ) εv(α n) 1 + cv(α n)α n D N εv(α n) 1 + cv(α n)α n D N ) + o(1) a 1 a 1 ) 1 ) 1. 58 / 153

Applications/Random Matrices and Robust Estimation 58/153 Outlier Data For ε nn = 1, ( φ 1 (1) Ŝ N = v 1 c ) n 1 1 n i=1 ( φ x i x i (v 1 + (1) 1 c Outlier rejection relies on 1 N a 1 C 1 N a 1 1. For a i CN (0, D N ), ε n ε 0, For ε n 0, 1 N a 1C 1 N a 1 Ŝ N = v (γ 1 (1 ε n)n n) x i x i n + v (αn) 1 ε nn a i a i n i=1 i=1 γ n = 1 ( (1 ε)v(γn) N tr C N C N + 1 + cv(γ n)γ n α n = 1 ( (1 ε)v(γn) N tr D N C N + 1 + cv(γ n)γ n ( φ 1 ) (1 ε (1) 1 n)n Ŝ N = v x i x i 1 c n + 1 n i=1 ε nn i=1 ) εv(α n) 1 + cv(α n)α n D N εv(α n) 1 + cv(α n)α n D N ( φ 1 (1) v 1 c ) + o(1) a 1 a 1 ) 1 ) 1. 1 N tr D N C 1 N ) a i a i Outlier rejection relies on 1 N tr D N C 1 N 1. 58 / 153

Applications/Random Matrices and Robust Estimation 59/153 Outlier Data Deterministic equivalent eigenvalue distribution 10 8 6 4 2 1 n (1 εn)n x i=1 i x i (Clean Data) 0 0 0.1 0.2 0.3 0.4 0.5 Eigenvalues Figure: Limiting eigenvalue distributions. [C N ] ij =.9 i j, D N = I N, ε =.05. 59 / 153

Applications/Random Matrices and Robust Estimation 59/153 Outlier Data Deterministic equivalent eigenvalue distribution 10 8 6 4 2 1 (1 εn)n x n i=1 i x i (Clean Data) 1 n XX or per-input normalized 0 0 0.1 0.2 0.3 0.4 0.5 Eigenvalues Figure: Limiting eigenvalue distributions. [C N ] ij =.9 i j, D N = I N, ε =.05. 59 / 153

Applications/Random Matrices and Robust Estimation 59/153 Outlier Data Deterministic equivalent eigenvalue distribution 10 8 6 4 2 1 (1 εn)n x n i=1 i x i (Clean Data) 1 n XX or per-input normalized Ĉ N 0 0 0.1 0.2 0.3 0.4 0.5 Eigenvalues Figure: Limiting eigenvalue distributions. [C N ] ij =.9 i j, D N = I N, ε =.05. 59 / 153

Applications/Random Matrices and Robust Estimation 60/153 Example of application to statistical finance Robust matrix-optimized portfolio allocation ĈST 0.065 Annualized standard deviation 0.06 0.055 0.05 0.045 0.04 0.035 0.03 Ĉ ST Ĉ P Ĉ C Ĉ C2 Ĉ LW Ĉ R 0.025 0 50 100 150 200 250 300 350 400 t (N=45,n=300) 60 / 153

Applications/Spectral Clustering Methods and Random Matrices 61/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 61 / 153

Applications/Spectral Clustering Methods and Random Matrices 62/153 Reminder on Spectral Clustering Methods Context: Two-step classification of n objects based on similarity A R n n : 1. extraction of eigenvectors U = [u 1,..., u l ] with dominant eigenvalues 62 / 153

Applications/Spectral Clustering Methods and Random Matrices 62/153 Reminder on Spectral Clustering Methods Context: Two-step classification of n objects based on similarity A R n n : 1. extraction of eigenvectors U = [u 1,..., u l ] with dominant eigenvalues 2. classification of vectors U 1,,..., U n, R l using k-means/em. 62 / 153

Applications/Spectral Clustering Methods and Random Matrices 62/153 Reminder on Spectral Clustering Methods Context: Two-step classification of n objects based on similarity A R n n : 1. extraction of eigenvectors U = [u 1,..., u l ] with dominant eigenvalues 2. classification of vectors U 1,,..., U n, R l using k-means/em. 0 spikes 62 / 153

Applications/Spectral Clustering Methods and Random Matrices 62/153 Reminder on Spectral Clustering Methods Context: Two-step classification of n objects based on similarity A R n n : 1. extraction of eigenvectors U = [u 1,..., u l ] with dominant eigenvalues 2. classification of vectors U 1,,..., U n, R l using k-means/em. 0 spikes Eigenvectors (in practice, shuffled!!) 62 / 153

Applications/Spectral Clustering Methods and Random Matrices 63/153 Reminder on Spectral Clustering Methods Eigenv. 2 Eigenv. 1 63 / 153

Applications/Spectral Clustering Methods and Random Matrices 63/153 Reminder on Spectral Clustering Methods Eigenv. 2 Eigenv. 1 l-dimensional representation (shuffling no longer matters!) Eigenvector 2 Eigenvector 1 63 / 153

Applications/Spectral Clustering Methods and Random Matrices 63/153 Reminder on Spectral Clustering Methods Eigenv. 2 Eigenv. 1 l-dimensional representation (shuffling no longer matters!) Eigenvector 2 Eigenvector 1 EM or k-means clustering. 63 / 153

Applications/Spectral Clustering Methods and Random Matrices 64/153 The Random Matrix Approach A two-step method: 1. If A n is not a standard random matrix, retrieve Ãn such that A n Ãn a.s. 0 in operator norm as n. 64 / 153

Applications/Spectral Clustering Methods and Random Matrices 64/153 The Random Matrix Approach A two-step method: 1. If A n is not a standard random matrix, retrieve Ãn such that A n Ãn a.s. 0 in operator norm as n. Transfers crucial properties from A n to Ãn: 64 / 153

Applications/Spectral Clustering Methods and Random Matrices 64/153 The Random Matrix Approach A two-step method: 1. If A n is not a standard random matrix, retrieve Ãn such that A n Ãn a.s. 0 in operator norm as n. Transfers crucial properties from A n to Ãn: limiting eigenvalue distribution 64 / 153

Applications/Spectral Clustering Methods and Random Matrices 64/153 The Random Matrix Approach A two-step method: 1. If A n is not a standard random matrix, retrieve Ãn such that A n Ãn a.s. 0 in operator norm as n. Transfers crucial properties from A n to Ãn: limiting eigenvalue distribution spikes 64 / 153

Applications/Spectral Clustering Methods and Random Matrices 64/153 The Random Matrix Approach A two-step method: 1. If A n is not a standard random matrix, retrieve Ãn such that A n Ãn a.s. 0 in operator norm as n. Transfers crucial properties from A n to Ãn: limiting eigenvalue distribution spikes eigenvectors of isolated eigenvalues. 64 / 153

Applications/Spectral Clustering Methods and Random Matrices 64/153 The Random Matrix Approach A two-step method: 1. If A n is not a standard random matrix, retrieve Ãn such that A n Ãn a.s. 0 in operator norm as n. Transfers crucial properties from A n to Ãn: limiting eigenvalue distribution spikes eigenvectors of isolated eigenvalues. 2. From Ãn, perform spiked model analysis: 64 / 153

Applications/Spectral Clustering Methods and Random Matrices 64/153 The Random Matrix Approach A two-step method: 1. If A n is not a standard random matrix, retrieve Ãn such that A n Ãn a.s. 0 in operator norm as n. Transfers crucial properties from A n to Ãn: limiting eigenvalue distribution spikes eigenvectors of isolated eigenvalues. 2. From Ãn, perform spiked model analysis: exhibit phase transition phenomenon 64 / 153

Applications/Spectral Clustering Methods and Random Matrices 64/153 The Random Matrix Approach A two-step method: 1. If A n is not a standard random matrix, retrieve Ãn such that A n Ãn a.s. 0 in operator norm as n. Transfers crucial properties from A n to Ãn: limiting eigenvalue distribution spikes eigenvectors of isolated eigenvalues. 2. From Ãn, perform spiked model analysis: exhibit phase transition phenomenon read the content of isolated eigenvectors of à n. 64 / 153

Applications/Spectral Clustering Methods and Random Matrices 65/153 The Random Matrix Approach The Spike Analysis: For noisy plateaus -looking isolated eigenvectors u 1,..., u l of Ãn, write k u i = α a j a i + σi a wa i na a=1 with j a R n canonical vector of class C a, w a i noise orthogonal to ja, 65 / 153

Applications/Spectral Clustering Methods and Random Matrices 65/153 The Random Matrix Approach The Spike Analysis: For noisy plateaus -looking isolated eigenvectors u 1,..., u l of Ãn, write k u i = α a j a i + σi a wa i na a=1 with j a R n canonical vector of class C a, w a i α a i = 1 na u T i ja (σi a )2 = u i α a j a 2 i. na noise orthogonal to ja, and evaluate 65 / 153

Applications/Spectral Clustering Methods and Random Matrices 65/153 The Random Matrix Approach The Spike Analysis: For noisy plateaus -looking isolated eigenvectors u 1,..., u l of Ãn, write k u i = α a j a i + σi a wa i na a=1 with j a R n canonical vector of class C a, w a i α a i = 1 na u T i ja (σi a )2 = u i α a j a 2 i. na noise orthogonal to ja, and evaluate = Can be done using complex analysis calculus, e.g. (α a i )2 = 1 ja T n u iu T i ja a = 1 1 ) 1jadz. ja T (Ãn zi n 2πı γ a n a 65 / 153

Applications/Community Detection on Graphs 66/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 66 / 153

Applications/Community Detection on Graphs 67/153 System Setting Assume n-node, m-edges undirected graph G, with intrinsic average connectivity q 1,..., q n µ i.i.d. 67 / 153

Applications/Community Detection on Graphs 67/153 System Setting Assume n-node, m-edges undirected graph G, with intrinsic average connectivity q 1,..., q n µ i.i.d. k classes C 1,..., C k independent of {q i } of (large) sizes n 1,..., n k, with preferential attachment C ab between C a and C b 67 / 153

Applications/Community Detection on Graphs 67/153 System Setting Assume n-node, m-edges undirected graph G, with intrinsic average connectivity q 1,..., q n µ i.i.d. k classes C 1,..., C k independent of {q i } of (large) sizes n 1,..., n k, with preferential attachment C ab between C a and C b induces edge probability for node i C a, j C b, P (i j) = q i q j C ab. 67 / 153

Applications/Community Detection on Graphs 67/153 System Setting Assume n-node, m-edges undirected graph G, with intrinsic average connectivity q 1,..., q n µ i.i.d. k classes C 1,..., C k independent of {q i } of (large) sizes n 1,..., n k, with preferential attachment C ab between C a and C b induces edge probability for node i C a, j C b, P (i j) = q i q j C ab. adjacency matrix A with A ij Bernoulli(q i q j C ab ). 67 / 153

Applications/Community Detection on Graphs 68/153 Objective Study of spectral methods: standard methods based on adjacency A, modularity A ddt 2m, normalized adjacency D 1 AD 1, etc. (adapted to dense nets) refined methods based on Bethe Hessian (r 2 1)I n ra + D (adapted to sparse nets!) 68 / 153

Applications/Community Detection on Graphs 68/153 Objective Study of spectral methods: standard methods based on adjacency A, modularity A ddt 2m, normalized adjacency D 1 AD 1, etc. (adapted to dense nets) refined methods based on Bethe Hessian (r 2 1)I n ra + D (adapted to sparse nets!) Improvement to realistic graphs: observation of failure of standard methods above improvement by new methods. 68 / 153

Applications/Community Detection on Graphs 69/153 Limitations of Adjacency/Modularity Approach (Modularity) (Bethe Hessian) 69 / 153

Applications/Community Detection on Graphs 69/153 Limitations of Adjacency/Modularity Approach (Modularity) Scenario: 3 classes with µ bi-modal (e.g., µ = 3 4 δ 0.1 + 1 4 δ 0.5) (Bethe Hessian) Leading eigenvectors of A (or modularity A ddt 2m ) biased by q i distribution. Similar behavior for Bethe Hessian. 69 / 153

Applications/Community Detection on Graphs 70/153 Regularized Modularity Approach Connectivity Model: P (i j) = q i q j C ab for i C a, j C b. Dense Regime Assumptions: Non trivial regime when, as n, C ab = 1 + M ab n, M ab = O(1). 70 / 153

Applications/Community Detection on Graphs 70/153 Regularized Modularity Approach Connectivity Model: P (i j) = q i q j C ab for i C a, j C b. Dense Regime Assumptions: Non trivial regime when, as n, C ab = 1 + M ab n, M ab = O(1). Community information is weak but highly REDUNDANT! 70 / 153

Applications/Community Detection on Graphs 70/153 Regularized Modularity Approach Connectivity Model: P (i j) = q i q j C ab for i C a, j C b. Dense Regime Assumptions: Non trivial regime when, as n, C ab = 1 + M ab n, M ab = O(1). Community information is weak but highly REDUNDANT! Considered Matrix: For α [0, 1], (and with D = diag(a1 n) = diag(d) the degree matrix), m = 1 2 dt 1 the number of edges L α = (2m) α 1 [ ] D α A ddt D α. n 2m 70 / 153

Applications/Community Detection on Graphs 70/153 Regularized Modularity Approach Connectivity Model: P (i j) = q i q j C ab for i C a, j C b. Dense Regime Assumptions: Non trivial regime when, as n, C ab = 1 + M ab n, M ab = O(1). Community information is weak but highly REDUNDANT! Considered Matrix: For α [0, 1], (and with D = diag(a1 n) = diag(d) the degree matrix), m = 1 2 dt 1 the number of edges L α = (2m) α 1 [ ] D α A ddt D α. n 2m Our results in a nutshell: we find optimal α opt having best phase transition. 70 / 153

Applications/Community Detection on Graphs 70/153 Regularized Modularity Approach Connectivity Model: P (i j) = q i q j C ab for i C a, j C b. Dense Regime Assumptions: Non trivial regime when, as n, C ab = 1 + M ab n, M ab = O(1). Community information is weak but highly REDUNDANT! Considered Matrix: For α [0, 1], (and with D = diag(a1 n) = diag(d) the degree matrix), m = 1 2 dt 1 the number of edges L α = (2m) α 1 [ ] D α A ddt D α. n 2m Our results in a nutshell: we find optimal α opt having best phase transition. we find consistent estimator ˆα opt from A alone. 70 / 153

Applications/Community Detection on Graphs 70/153 Regularized Modularity Approach Connectivity Model: P (i j) = q i q j C ab for i C a, j C b. Dense Regime Assumptions: Non trivial regime when, as n, C ab = 1 + M ab n, M ab = O(1). Community information is weak but highly REDUNDANT! Considered Matrix: For α [0, 1], (and with D = diag(a1 n) = diag(d) the degree matrix), m = 1 2 dt 1 the number of edges L α = (2m) α 1 [ ] D α A ddt D α. n 2m Our results in a nutshell: we find optimal α opt having best phase transition. we find consistent estimator ˆα opt from A alone. we claim optimal eigenvector regularization D α 1 u, u eigenvector of L α. 70 / 153

Applications/Community Detection on Graphs 71/153 Asymptotic Equivalence Theorem (Limiting Random Matrix Equivalent) For each α [0, 1], as n, L α L α 0 almost surely, where L α = (2m) α 1 [ ] D α A ddt D α n 2m L α = 1 D α n q XDq α + UΛU T with D q = diag({q i }), X zero-mean random matrix, [ ] U = Dq 1 α J n Dq α X1 n, rank k + 1 [ (Ik 1 Λ = k c T )M(I k c1 T k ) 1 ] k 1 T k 0 and J = [j 1,..., j k ], j a = [0,..., 0, 1 T n a, 0,..., 0] T R n canonical vector of class C a. 71 / 153

Applications/Community Detection on Graphs 71/153 Asymptotic Equivalence Theorem (Limiting Random Matrix Equivalent) For each α [0, 1], as n, L α L α 0 almost surely, where L α = (2m) α 1 [ ] D α A ddt D α n 2m L α = 1 D α n q XDq α + UΛU T with D q = diag({q i }), X zero-mean random matrix, [ ] U = Dq 1 α J n Dq α X1 n, rank k + 1 [ (Ik 1 Λ = k c T )M(I k c1 T k ) 1 ] k 1 T k 0 and J = [j 1,..., j k ], j a = [0,..., 0, 1 T n a, 0,..., 0] T R n canonical vector of class C a. Consequences: isolated eigenvalues beyond phase transition λ(m) > spectrum edge optimal choice α opt of α from study of noise spectrum. 71 / 153

Applications/Community Detection on Graphs 71/153 Asymptotic Equivalence Theorem (Limiting Random Matrix Equivalent) For each α [0, 1], as n, L α L α 0 almost surely, where L α = (2m) α 1 [ ] D α A ddt D α n 2m L α = 1 D α n q XDq α + UΛU T with D q = diag({q i }), X zero-mean random matrix, [ ] U = Dq 1 α J n Dq α X1 n, rank k + 1 [ (Ik 1 Λ = k c T )M(I k c1 T k ) 1 ] k 1 T k 0 and J = [j 1,..., j k ], j a = [0,..., 0, 1 T n a, 0,..., 0] T R n canonical vector of class C a. Consequences: isolated eigenvalues beyond phase transition λ(m) > spectrum edge optimal choice α opt of α from study of noise spectrum. eigenvectors correlated to Dq 1 α J Natural regularization by D α 1! 71 / 153

Applications/Community Detection on Graphs 72/153 Eigenvalue Spectrum Eigenvalues of L 1 Limiting law spikes S α S α S α 6 4 2 0 2 4 6 Figure: Eigenvalues of L 1, K = 3, n = 2000, c 1 = 0.3, c 2 = 0.3, c 3 = 0.4, µ = 1 2 δq (1) + 1 2 δq (2), q (1) = 0.4, q (2) = 0.9, M defined by M ii = 12, M ij = 4, i j. 72 / 153

Applications/Community Detection on Graphs 73/153 Phase Transition Theorem (Phase Transition) For α [0, 1], isolated eigenvalue λ i (L α) if λ i ( M) > τ α, M = (D(c) cc T )M, τ α = lim 1 x S + α e α, phase transition threshold 2 (x) with [S α, Sα + ] limiting eigenvalue support of Lα and eα 2 (x) ( x > Sα + ) solution of e α 1 (x) = e α 2 (x) = 1 In this case, e α 2 (λ i(l = λ α)) i( M). q 1 2α x q 1 2α e α 1 (x) + q2 2α e α µ(dq) 2 (x) q 2 2α x q 1 2α e α 1 (x) + q2 2α e α µ(dq). 2 (x) 73 / 153

Applications/Community Detection on Graphs 73/153 Phase Transition Theorem (Phase Transition) For α [0, 1], isolated eigenvalue λ i (L α) if λ i ( M) > τ α, M = (D(c) cc T )M, τ α = lim 1 x S + α e α, phase transition threshold 2 (x) with [S α, Sα + ] limiting eigenvalue support of Lα and eα 2 (x) ( x > Sα + ) solution of e α 1 (x) = e α 2 (x) = 1 In this case, e α 2 (λ i(l = λ α)) i( M). q 1 2α x q 1 2α e α 1 (x) + q2 2α e α µ(dq) 2 (x) q 2 2α x q 1 2α e α 1 (x) + q2 2α e α µ(dq). 2 (x) Clustering still possible when λ i ( M) = (min α τ α) + ε. Optimal α = α opt: α opt = argmin α [0,1] {τ α}. 73 / 153

Applications/Community Detection on Graphs 73/153 Phase Transition Theorem (Phase Transition) For α [0, 1], isolated eigenvalue λ i (L α) if λ i ( M) > τ α, M = (D(c) cc T )M, τ α = lim 1 x S + α e α, phase transition threshold 2 (x) with [S α, Sα + ] limiting eigenvalue support of Lα and eα 2 (x) ( x > Sα + ) solution of e α 1 (x) = e α 2 (x) = 1 In this case, e α 2 (λ i(l = λ α)) i( M). q 1 2α x q 1 2α e α 1 (x) + q2 2α e α µ(dq) 2 (x) q 2 2α x q 1 2α e α 1 (x) + q2 2α e α µ(dq). 2 (x) Clustering still possible when λ i ( M) = (min α τ α) + ε. Optimal α = α opt: α opt = argmin α [0,1] {τ α}. d From max i i q i d T 1 n a.s. 0, we obtain consistent estimator ˆα opt of α opt. 73 / 153

Applications/Community Detection on Graphs 74/153 Simulated Performance Results (2 masses of q i ) (Modularity) (Bethe Hessian) 74 / 153

Applications/Community Detection on Graphs 74/153 Simulated Performance Results (2 masses of q i ) (Modularity) (Bethe Hessian) (Algo with α = 1) Figure: Two dominant eigenvectors (x-y axes) for n = 2000, K = 3, µ = 3 4 δq (1) + 1 4 δq (2), q (1) = 0.1, q (2) = 0.5, c 1 = c 2 = 1 4, c3 = 1 2, M = 100I3. 74 / 153

Applications/Community Detection on Graphs 74/153 Simulated Performance Results (2 masses of q i ) (Modularity) (Bethe Hessian) (Algo with α = 1) (Algo with α opt) Figure: Two dominant eigenvectors (x-y axes) for n = 2000, K = 3, µ = 3 4 δq (1) + 1 4 δq (2), q (1) = 0.1, q (2) = 0.5, c 1 = c 2 = 1 4, c3 = 1 2, M = 100I3. 74 / 153

Applications/Community Detection on Graphs 75/153 Simulated Performance Results (2 masses for q i ) 1.1 1.08 Normalized spike λ S + α 1.06 1.04 1.02 1 1 0.98 2 4 6 8 10 12 14 16 Eigenvalue l (l = 1/e α 2 (λ) beyond phase transition) Figure: Largest eigenvalue λ of L α as a function of the largest eigenvalue l of (D(c) cc T )M, for µ = 3 4 δq (1) + 1 4 δq (2) with q (1) = 0.1 and q (2) = 0.5, for α {0, 1 4, 1 2, 3 4, 1, αopt} (indicated below the graph). Here, α opt = 0.07. Circles indicate phase transition. Beyond phase transition, l = 1/e α 2 (λ). 75 / 153

Applications/Community Detection on Graphs 75/153 Simulated Performance Results (2 masses for q i ) 1.1 1.08 Normalized spike λ S + α 1.06 1.04 1.02 1 0 1 1 3 1 4 2 4 0.98 2 4 6 8 10 12 14 16 Eigenvalue l (l = 1/e α 2 (λ) beyond phase transition) Figure: Largest eigenvalue λ of L α as a function of the largest eigenvalue l of (D(c) cc T )M, for µ = 3 4 δq (1) + 1 4 δq (2) with q (1) = 0.1 and q (2) = 0.5, for α {0, 1 4, 1 2, 3 4, 1, αopt} (indicated below the graph). Here, α opt = 0.07. Circles indicate phase transition. Beyond phase transition, l = 1/e α 2 (λ). 75 / 153

Applications/Community Detection on Graphs 75/153 Simulated Performance Results (2 masses for q i ) 1.1 1.08 Normalized spike λ S + α 1.06 1.04 1.02 α opt 1 0 1 1 3 1 4 2 4 0.98 2 4 6 8 10 12 14 16 Eigenvalue l (l = 1/e α 2 (λ) beyond phase transition) Figure: Largest eigenvalue λ of L α as a function of the largest eigenvalue l of (D(c) cc T )M, for µ = 3 4 δq (1) + 1 4 δq (2) with q (1) = 0.1 and q (2) = 0.5, for α {0, 1 4, 1 2, 3 4, 1, αopt} (indicated below the graph). Here, α opt = 0.07. Circles indicate phase transition. Beyond phase transition, l = 1/e α 2 (λ). 75 / 153

Applications/Community Detection on Graphs 76/153 Simulated Performance Results (2 masses for q i ) 1 0.8 0.6 Overlap 0.4 0.2 0 α = 1 Phase transition 10 20 30 40 50 Figure: Overlap performance for n = 3000, K = 3, c i = 1 3, µ = 3 4 δq (1) + 1 4 δq (2) with q (1) = 0.1 and q (2) = 0.5, M = I 3, for [5, 50]. Here α opt = 0.07. 76 / 153

Applications/Community Detection on Graphs 76/153 Simulated Performance Results (2 masses for q i ) 1 0.8 0.6 Overlap 0.4 0.2 0 α = 0 α = 1 2 α = 1 Phase transition 10 20 30 40 50 Figure: Overlap performance for n = 3000, K = 3, c i = 1 3, µ = 3 4 δq (1) + 1 4 δq (2) with q (1) = 0.1 and q (2) = 0.5, M = I 3, for [5, 50]. Here α opt = 0.07. 76 / 153

Applications/Community Detection on Graphs 76/153 Simulated Performance Results (2 masses for q i ) 1 0.8 0.6 Overlap 0.4 α = 0 0.2 0 α = 1 2 α = 1 α = α opt Phase transition 10 20 30 40 50 Figure: Overlap performance for n = 3000, K = 3, c i = 1 3, µ = 3 4 δq (1) + 1 4 δq (2) with q (1) = 0.1 and q (2) = 0.5, M = I 3, for [5, 50]. Here α opt = 0.07. 76 / 153

Applications/Community Detection on Graphs 76/153 Simulated Performance Results (2 masses for q i ) 1 0.8 0.6 Overlap 0.4 0.2 0 10 20 30 40 50 α = 0 α = 1 2 α = 1 α = α opt Bethe Hessian Phase transition Figure: Overlap performance for n = 3000, K = 3, c i = 1 3, µ = 3 4 δq (1) + 1 4 δq (2) with q (1) = 0.1 and q (2) = 0.5, M = I 3, for [5, 50]. Here α opt = 0.07. 76 / 153

Applications/Community Detection on Graphs 77/153 Simulated Performance Results (2 masses for q i ) 1 0.8 α = 0 α = 0.5 α = 1 α = α opt Bethe Hessian 0.6 Overlap 0.4 0.2 0 0.2 0.4 0.6 0.8 q 2 (q 1 = 0.1) Figure: Overlap performance for n = 3000, K = 3, µ = 3 4 δq (1) + 1 4 δq (2) with q (1) = 0.1 and q (2) [0.1, 0.9], M = 10(2I 3 1 31 T 3 ), ci = 1 3. 77 / 153

Applications/Community Detection on Graphs 78/153 Theoretical Performance Analysis of eigenvectors reveals: eigenvectors are noisy staircase vectors 78 / 153

Applications/Community Detection on Graphs 78/153 Theoretical Performance Analysis of eigenvectors reveals: eigenvectors are noisy staircase vectors conjectured Gaussian fluctuations of eigenvector entries 78 / 153

Applications/Community Detection on Graphs 78/153 Theoretical Performance Analysis of eigenvectors reveals: eigenvectors are noisy staircase vectors conjectured Gaussian fluctuations of eigenvector entries for q i = q 0 (homogeneous case), same variance for all entries in non-homogeneous case, we can compute average variance per class Heuristic asymptotic performance upper-bound using EM. 78 / 153

Applications/Community Detection on Graphs 79/153 Theoretical Performance Results (uniform distribution for q i ) 1 0.9 Correct classificate rate 0.8 0.7 0.6 0.5 α = 1 α = 0.75 α = 0.5 α = 0.25 α = 0 α opt 5 10 15 20 Figure: Theoretical probability of correct recovery for n = 2000, K = 2, c 1 = 0.6, c 2 = 0.4, µ uniformly distributed in [0.2, 0.8], M = I 2, for [0, 20]. 79 / 153

Applications/Community Detection on Graphs 80/153 80 / 153

Applications/Community Detection on Graphs 80/153 Some Takeaway messages Main findings: 80 / 153

Applications/Community Detection on Graphs 80/153 Some Takeaway messages Main findings: Degree heterogeneity breaks community structures in eigenvectors. Compensation by D α 1 normalization of eigenvectors. 80 / 153

Applications/Community Detection on Graphs 80/153 Some Takeaway messages Main findings: Degree heterogeneity breaks community structures in eigenvectors. Compensation by D α 1 normalization of eigenvectors. Classical debate over best normalization of adjacency (or modularity) matrix A not trivial to solve. With heterogeneous degrees, we found a good on-line method. 80 / 153

Applications/Community Detection on Graphs 80/153 Some Takeaway messages Main findings: Degree heterogeneity breaks community structures in eigenvectors. Compensation by D α 1 normalization of eigenvectors. Classical debate over best normalization of adjacency (or modularity) matrix A not trivial to solve. With heterogeneous degrees, we found a good on-line method. Simulations support good performances even for rather sparse settings. 80 / 153

Applications/Community Detection on Graphs 80/153 Some Takeaway messages Main findings: Degree heterogeneity breaks community structures in eigenvectors. Compensation by D α 1 normalization of eigenvectors. Classical debate over best normalization of adjacency (or modularity) matrix A not trivial to solve. With heterogeneous degrees, we found a good on-line method. Simulations support good performances even for rather sparse settings. But strong limitations: 80 / 153

Applications/Community Detection on Graphs 80/153 Some Takeaway messages Main findings: Degree heterogeneity breaks community structures in eigenvectors. Compensation by D α 1 normalization of eigenvectors. Classical debate over best normalization of adjacency (or modularity) matrix A not trivial to solve. With heterogeneous degrees, we found a good on-line method. Simulations support good performances even for rather sparse settings. But strong limitations: Key assumption: C ab = 1 + M ab n. Everything collapses if different regime. 80 / 153

Applications/Community Detection on Graphs 80/153 Some Takeaway messages Main findings: Degree heterogeneity breaks community structures in eigenvectors. Compensation by D α 1 normalization of eigenvectors. Classical debate over best normalization of adjacency (or modularity) matrix A not trivial to solve. With heterogeneous degrees, we found a good on-line method. Simulations support good performances even for rather sparse settings. But strong limitations: Key assumption: C ab = 1 + M ab n. Everything collapses if different regime. Simulations on small networks in fact give ridiculous arbitrary results. 80 / 153

Applications/Kernel Spectral Clustering 81/153 Outline Basics of Random Matrix Theory Motivation: Large Sample Covariance Matrices The Stieltjes Transform Method Spiked Models Other Common Random Matrix Models Applications Random Matrices and Robust Estimation Spectral Clustering Methods and Random Matrices Community Detection on Graphs Kernel Spectral Clustering Kernel Spectral Clustering: Subspace Clustering Semi-supervised Learning Support Vector Machines Neural Networks: Extreme Learning Machines Perspectives 81 / 153

Applications/Kernel Spectral Clustering 82/153 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes S 1,..., S k. 82 / 153

Applications/Kernel Spectral Clustering 82/153 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes S 1,..., S k. Typical metric to optimize: k κ(x j, x j (RatioCut) argmin ) S1... S k ={1,...,n} S i=1 j S i i j / S i for some similarity kernel κ(x, y) 0 (large if x similar to y). 82 / 153

Applications/Kernel Spectral Clustering 82/153 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes S 1,..., S k. Typical metric to optimize: k κ(x j, x j (RatioCut) argmin ) S1... S k ={1,...,n} S i=1 j S i i j / S i for some similarity kernel κ(x, y) 0 (large if x similar to y). Can be shown equivalent to (RatioCut) argmin M M tr M T (D K)M { } where M R n k M; M ij {0, S j 1 2 } (in particular, M T M = I k ) and K = {κ(x i, x j )} n i,j=1, D ii = n K ij. j=1 82 / 153

Applications/Kernel Spectral Clustering 82/153 Kernel Spectral Clustering Problem Statement Dataset x 1,..., x n R p Objective: cluster data in k similarity classes S 1,..., S k. Typical metric to optimize: k κ(x j, x j (RatioCut) argmin ) S1... S k ={1,...,n} S i=1 j S i i j / S i for some similarity kernel κ(x, y) 0 (large if x similar to y). Can be shown equivalent to (RatioCut) argmin M M tr M T (D K)M { } where M R n k M; M ij {0, S j 1 2 } (in particular, M T M = I k ) and K = {κ(x i, x j )} n i,j=1, D ii = n K ij. j=1 But integer problem! Usually NP-complete. 82 / 153

Applications/Kernel Spectral Clustering 83/153 Kernel Spectral Clustering Towards kernel spectral clustering Kernel spectral clustering: discrete-to-continuous relaxations of such metrics (RatioCut) argmin M, M T M=I K tr M T (D K)M i.e., eigenvector problem: 1. find eigenvectors of smallest eigenvalues 2. retrieve classes from eigenvector components 83 / 153

Applications/Kernel Spectral Clustering 83/153 Kernel Spectral Clustering Towards kernel spectral clustering Kernel spectral clustering: discrete-to-continuous relaxations of such metrics (RatioCut) argmin M, M T M=I K tr M T (D K)M i.e., eigenvector problem: 1. find eigenvectors of smallest eigenvalues 2. retrieve classes from eigenvector components Refinements: working on K, D K, I n D 1 K, I n D 1 2 KD 2 1, etc. several steps algorithms: Ng Jordan Weiss, Shi Malik, etc. 83 / 153

Applications/Kernel Spectral Clustering 84/153 Kernel Spectral Clustering 84 / 153

Applications/Kernel Spectral Clustering 84/153 Kernel Spectral Clustering 84 / 153

Applications/Kernel Spectral Clustering 84/153 Kernel Spectral Clustering 84 / 153

Applications/Kernel Spectral Clustering 84/153 Kernel Spectral Clustering Figure: Leading four eigenvectors of D 2 1 KD 2 1 for MNIST data. 84 / 153

Applications/Kernel Spectral Clustering 85/153 Methodology and objectives Current state: Algorithms derived from ad-hoc procedures (e.g., relaxation). Little understanding of performance, even for Gaussian mixtures! Let alone when both p and n are large (BigData setting) 85 / 153

Applications/Kernel Spectral Clustering 85/153 Methodology and objectives Current state: Algorithms derived from ad-hoc procedures (e.g., relaxation). Little understanding of performance, even for Gaussian mixtures! Let alone when both p and n are large (BigData setting) Objectives and Roadmap: Develop mathematical analysis framework for BigData kernel spectral clustering (p, n ) 85 / 153

Applications/Kernel Spectral Clustering 85/153 Methodology and objectives Current state: Algorithms derived from ad-hoc procedures (e.g., relaxation). Little understanding of performance, even for Gaussian mixtures! Let alone when both p and n are large (BigData setting) Objectives and Roadmap: Develop mathematical analysis framework for BigData kernel spectral clustering (p, n ) Understand: 1. Phase transition effects (i.e., when is clustering possible?) 2. Content of each eigenvector 3. Influence of kernel function 4. Performance comparison of clustering algorithms 85 / 153

Applications/Kernel Spectral Clustering 85/153 Methodology and objectives Current state: Algorithms derived from ad-hoc procedures (e.g., relaxation). Little understanding of performance, even for Gaussian mixtures! Let alone when both p and n are large (BigData setting) Objectives and Roadmap: Develop mathematical analysis framework for BigData kernel spectral clustering (p, n ) Understand: 1. Phase transition effects (i.e., when is clustering possible?) 2. Content of each eigenvector 3. Influence of kernel function 4. Performance comparison of clustering algorithms Methodology: Use statistical assumptions (Gaussian mixture) Benefit from doubly-infinite independence and random matrix tools 85 / 153

Applications/Kernel Spectral Clustering 86/153 Model and Assumptions Gaussian mixture model: x 1,..., x n R p, k classes C 1,..., C k, x 1,..., x n1 C 1,..., x n nk +1,..., x n C k, C a = {x x N (µ a, C a)}. Then, for x i C a, with w i N(0, C a), x i = µ a + w i. 86 / 153

Applications/Kernel Spectral Clustering 86/153 Model and Assumptions Gaussian mixture model: x 1,..., x n R p, k classes C 1,..., C k, x 1,..., x n1 C 1,..., x n nk +1,..., x n C k, C a = {x x N (µ a, C a)}. Then, for x i C a, with w i N(0, C a), x i = µ a + w i. Assumption (Convergence Rate) As n, p 1. Data scaling: n c 0 (0, ), n 2. Class scaling: a n c a (0, 1), 3. Mean scaling: with µ k a=1 n a n µ a and µ a µ a µ, then µ a = O(1) 4. Covariance scaling: with C k n a a=1 n C a and Ca Ca C, then C a = O(1), 1 p tr C a = O(1) tr C ac b = O(p) 86 / 153

Applications/Kernel Spectral Clustering 87/153 Model and Assumptions Kernel Matrix: Kernel matrix of interest: K = { ( )} 1 n f p x i x j 2 i,j=1 for some sufficiently smooth nonnegative f. 87 / 153

Applications/Kernel Spectral Clustering 87/153 Model and Assumptions Kernel Matrix: Kernel matrix of interest: K = { ( )} 1 n f p x i x j 2 i,j=1 for some sufficiently smooth nonnegative f. We study the normalized Laplacian: L = nd 1 2 KD 1 2 with d = K1 n, D = diag(d). 87 / 153

Applications/Kernel Spectral Clustering 88/153 Model and Assumptions Difficulty: L is a very intractable random matrix non-linear f non-trivial dependence between entries of L 88 / 153

Applications/Kernel Spectral Clustering 88/153 Model and Assumptions Difficulty: L is a very intractable random matrix non-linear f non-trivial dependence between entries of L Strategy: 1. Find random equivalent ˆL a.s. (i.e., L ˆL 0 as n, p ) based on: concentration: K ij constant as n, p (for all i j) Taylor expansion around limit point 88 / 153

Applications/Kernel Spectral Clustering 88/153 Model and Assumptions Difficulty: L is a very intractable random matrix non-linear f non-trivial dependence between entries of L Strategy: 1. Find random equivalent ˆL a.s. (i.e., L ˆL 0 as n, p ) based on: concentration: K ij constant as n, p (for all i j) Taylor expansion around limit point 2. Apply spiked random matrix approach to study: existence of isolated eigenvalues in ˆL: phase transition 88 / 153

Applications/Kernel Spectral Clustering 88/153 Model and Assumptions Difficulty: L is a very intractable random matrix non-linear f non-trivial dependence between entries of L Strategy: 1. Find random equivalent ˆL a.s. (i.e., L ˆL 0 as n, p ) based on: concentration: K ij constant as n, p (for all i j) Taylor expansion around limit point 2. Apply spiked random matrix approach to study: existence of isolated eigenvalues in ˆL: phase transition eigenvector projections on canonical class-basis 88 / 153

Applications/Kernel Spectral Clustering 89/153 Random Matrix Equivalent Results on K: Key Remark: Under our assumptions, uniformly on i, j {1,..., n}, for some common limit τ. 1 p x i x j 2 a.s. τ 89 / 153

Applications/Kernel Spectral Clustering 89/153 Random Matrix Equivalent Results on K: Key Remark: Under our assumptions, uniformly on i, j {1,..., n}, for some common limit τ. large dimensional approximation for K: K = f(τ)1 n1 T n + }{{} O (n) 1 p x i x j 2 a.s. τ na1 }{{} low rank, O ( n) + A 2 }{{} informative terms, O (1) 89 / 153

Applications/Kernel Spectral Clustering 89/153 Random Matrix Equivalent Results on K: Key Remark: Under our assumptions, uniformly on i, j {1,..., n}, for some common limit τ. large dimensional approximation for K: K = f(τ)1 n1 T n + }{{} O (n) 1 p x i x j 2 a.s. τ na1 }{{} low rank, O ( n) + A 2 }{{} informative terms, O (1) difficult to handle (3 orders to manipulate!) 89 / 153

Applications/Kernel Spectral Clustering 89/153 Random Matrix Equivalent Results on K: Key Remark: Under our assumptions, uniformly on i, j {1,..., n}, for some common limit τ. large dimensional approximation for K: K = f(τ)1 n1 T n + }{{} O (n) 1 p x i x j 2 a.s. τ na1 }{{} low rank, O ( n) + A 2 }{{} informative terms, O (1) difficult to handle (3 orders to manipulate!) Observation: Spectrum of L = nd 1 2 KD 1 2 : Dominant eigenvalue n with eigenvector D 1 2 1 n All other eigenvalues of order O(1). 89 / 153

Applications/Kernel Spectral Clustering 89/153 Random Matrix Equivalent Results on K: Key Remark: Under our assumptions, uniformly on i, j {1,..., n}, for some common limit τ. large dimensional approximation for K: K = f(τ)1 n1 T n + }{{} O (n) 1 p x i x j 2 a.s. τ na1 }{{} low rank, O ( n) + A 2 }{{} informative terms, O (1) difficult to handle (3 orders to manipulate!) Observation: Spectrum of L = nd 1 2 KD 1 2 : Dominant eigenvalue n with eigenvector D 1 2 1 n All other eigenvalues of order O(1). Naturally leads to study: Projected normalized Laplacian (or modularity -type Laplacian): L = nd 1 2 KD 1 2 n D 1 2 1 n1 T 2 nd 1 ) 1 T = nd 1 2 (K ddt n D1n 1 T D 1 2. d Dominant (normalized) eigenvector D 1 2 1 n 1 T n D1 n. 89 / 153

Applications/Kernel Spectral Clustering 90/153 Random Matrix Equivalent Theorem (Random Matrix Equivalent) As n, p, in operator norm, L ˆL a.s. 0, where ˆL = 2 f [ ] (τ) 1 f(τ) p P W T W P + UBU T + α(τ)i n and τ = 2 p tr C, W = [w 1,..., w n] R p n (x i = µ a + w i ), P = I n 1 n 1n1T n, [ ] 1 U = p J, Φ, ψ R n (2k+4) B = B 11 = M T M + ( B 11 I k 1 k c T 5f ) (τ) 8f(τ) f (τ) 2f t (τ) I k c1 T k ) 0 k k 0 k 1 R(2k+4) (2k+4) t T 5f 0 (τ) 1 k ( 5f (τ) 8f(τ) f (τ) 2f (τ) ( 5f (τ) 8f(τ) f (τ) 2f (τ) 8f(τ) f (τ) 2f (τ) ) tt T f (τ) f (τ) T + p f(τ)α(τ) n 2f (τ) 1 k1 T k Rk k. 90 / 153

Applications/Kernel Spectral Clustering 90/153 Random Matrix Equivalent Theorem (Random Matrix Equivalent) As n, p, in operator norm, L ˆL a.s. 0, where ˆL = 2 f [ ] (τ) 1 f(τ) p P W T W P + UBU T + α(τ)i n and τ = 2 p tr C, W = [w 1,..., w n] R p n (x i = µ a + w i ), P = I n 1 n 1n1T n, [ ] 1 U = p J, Φ, ψ R n (2k+4) B = B 11 = M T M + ( B 11 I k 1 k c T 5f ) (τ) 8f(τ) f (τ) 2f t (τ) I k c1 T k ) 0 k k 0 k 1 R(2k+4) (2k+4) t T 5f 0 (τ) 1 k ( 5f (τ) 8f(τ) f (τ) 2f (τ) ( 5f (τ) 8f(τ) f (τ) 2f (τ) 8f(τ) f (τ) 2f (τ) ) tt T f (τ) f (τ) T + p f(τ)α(τ) n 2f (τ) 1 k1 T k Rk k. 1 p J = [j 1,..., j k ] R n k, j a canonical vector of class C a. 90 / 153

Applications/Kernel Spectral Clustering 90/153 Random Matrix Equivalent Theorem (Random Matrix Equivalent) As n, p, in operator norm, L ˆL a.s. 0, where ˆL = 2 f [ ] (τ) 1 f(τ) p P W T W P + UBU T + α(τ)i n and τ = 2 p tr C, W = [w 1,..., w n] R p n (x i = µ a + w i ), P = I n 1 n 1n1T n, [ ] 1 U = p J, Φ, ψ R n (2k+4) B = B 11 = M T M + ( B 11 I k 1 k c T 5f ) (τ) 8f(τ) f (τ) 2f t (τ) I k c1 T k ) 0 k k 0 k 1 R(2k+4) (2k+4) t T 5f 0 (τ) 1 k ( 5f (τ) 8f(τ) f (τ) 2f (τ) ( 5f (τ) 8f(τ) f (τ) 2f (τ) M = [µ 1,..., µ k ] Rn k, µ a = µ a k n b b=1 n µ b. 8f(τ) f (τ) 2f (τ) ) tt T f (τ) f (τ) T + p f(τ)α(τ) n 2f (τ) 1 k1 T k Rk k. 90 / 153

Applications/Kernel Spectral Clustering 90/153 Random Matrix Equivalent Theorem (Random Matrix Equivalent) As n, p, in operator norm, L ˆL a.s. 0, where ˆL = 2 f [ ] (τ) 1 f(τ) p P W T W P + UBU T + α(τ)i n and τ = 2 p tr C, W = [w 1,..., w n] R p n (x i = µ a + w i ), P = I n 1 n 1n1T n, [ ] 1 U = p J, Φ, ψ R n (2k+4) B = B 11 = M T M + ( B 11 I k 1 k c T 5f ) (τ) 8f(τ) f (τ) 2f t (τ) I k c1 T k ) 0 k k 0 k 1 R(2k+4) (2k+4) t T 5f 0 (τ) 1 k ( 5f (τ) 8f(τ) f (τ) 2f (τ) ( 5f (τ) 8f(τ) f (τ) 2f (τ) 8f(τ) f (τ) 2f (τ) ) tt T f (τ) f (τ) T + p f(τ)α(τ) n 2f (τ) 1 k1 T k Rk k. [ ] t = p 1 tr C1,..., p 1 tr Ck R k, Ca = Ca k n b b=1 n C b. 90 / 153

Applications/Kernel Spectral Clustering 90/153 Random Matrix Equivalent Theorem (Random Matrix Equivalent) As n, p, in operator norm, L ˆL a.s. 0, where ˆL = 2 f [ ] (τ) 1 f(τ) p P W T W P + UBU T + α(τ)i n and τ = 2 p tr C, W = [w 1,..., w n] R p n (x i = µ a + w i ), P = I n 1 n 1n1T n, [ ] 1 U = p J, Φ, ψ R n (2k+4) B = B 11 = M T M + T = { 1 p tr C ac b ( B 11 I k 1 k c T 5f ) (τ) 8f(τ) f (τ) 2f t (τ) I k c1 T k ) 0 k k 0 k 1 R(2k+4) (2k+4) t T 5f 0 (τ) 1 k ( 5f (τ) 8f(τ) f (τ) 2f (τ) ( 5f (τ) 8f(τ) f (τ) 2f (τ) } k a,b=1 Rk k, C a = C a k b=1 8f(τ) f (τ) 2f (τ) ) tt T f (τ) f (τ) T + p f(τ)α(τ) n 2f (τ) 1 k1 T k Rk k. n b n C b. 90 / 153

Applications/Kernel Spectral Clustering 91/153 Random Matrix Equivalent Some consequences: ˆL is a spiked model: UBU T seen as low rank perturbation of 1 p P W T W P 91 / 153

Applications/Kernel Spectral Clustering 91/153 Random Matrix Equivalent Some consequences: ˆL is a spiked model: UBU T seen as low rank perturbation of 1 p P W T W P If f (τ) = 0, L asymptotically deterministic! only t and T can be discriminated upon If f (τ) = 0, (e.g., f(x) = x) T unused If 5f (τ) 8f(τ) = f (τ) 2f, t (seemingly) unused (τ) 91 / 153

Applications/Kernel Spectral Clustering 91/153 Random Matrix Equivalent Some consequences: ˆL is a spiked model: UBU T seen as low rank perturbation of 1 p P W T W P If f (τ) = 0, L asymptotically deterministic! only t and T can be discriminated upon If f (τ) = 0, (e.g., f(x) = x) T unused If 5f (τ) 8f(τ) Further analysis: = f (τ) 2f, t (seemingly) unused (τ) Determine separability condition for eigenvalues Evaluate eigenvalue positions when separable Evaluate eigenvector projection to canonical basis j 1,..., j k Evaluate fluctuation of eigenvectors. 91 / 153

Applications/Kernel Spectral Clustering 92/153 Isolated eigenvalues: Gaussian inputs Eigenvalues of L Eigenvalues of ˆL 0 1 2 3 4 0 1 2 3 4 Figure: Eigenvalues of L and ˆL, k = 3, p = 2048, n = 512, c 1 = c 2 = 1/4, c 3 = 1/2, [µ a] j = 4δ aj, C a = (1 + 2(a 1)/ p)i p, f(x) = exp( x/2). 92 / 153

Applications/Kernel Spectral Clustering 93/153 Theoretical Findings versus MNIST 0.2 Eigenvalues of L 0.15 0.1 5 10 2 0 0 10 20 30 40 50 Figure: Eigenvalues of L (red) and (equivalent Gaussian model) ˆL (white), MNIST data, p = 784, n = 192. 93 / 153

Applications/Kernel Spectral Clustering 93/153 Theoretical Findings versus MNIST 0.2 Eigenvalues of L Eigenvalues of ˆL as if Gaussian model 0.15 0.1 5 10 2 0 0 10 20 30 40 50 Figure: Eigenvalues of L (red) and (equivalent Gaussian model) ˆL (white), MNIST data, p = 784, n = 192. 93 / 153

Applications/Kernel Spectral Clustering 94/153 Theoretical Findings versus MNIST Figure: Leading four eigenvectors of D 2 1 KD 1 2 for MNIST data (red) and theoretical findings (blue). 94 / 153

Applications/Kernel Spectral Clustering 94/153 Theoretical Findings versus MNIST Figure: Leading four eigenvectors of D 2 1 KD 1 2 for MNIST data (red) and theoretical findings (blue). 94 / 153