On the Spectrum of Random Features Maps of High Dimensional Data

Size: px

Start display at page:

Download "On the Spectrum of Random Features Maps of High Dimensional Data"

Amie Gallagher
5 years ago
Views:

On the Spectrum of Random Features Maps of High Dimensional Data ICML 018, Stockholm, Sweden Zhenyu Liao, Romain Couillet LS, CentraleSupélec, Université Paris-Saclay, France GSTATS IDEX

1 On the Spectrum of Random Features Maps of High Dimensional Data ICML 018, Stockholm, Sweden Zhenyu Liao, Romain Couillet LS, CentraleSupélec, Université Paris-Saclay, France GSTATS IDEX DataScience Chair, GIPSA-lab, Université Grenoble-Alpes, France. Z. Liao, R. Couillet (CentraleSupélec & UG-A) On the Spectrum of RFM of High Dimensional Data ICML 018, Stockholm, SWEDEN 1 / 18

2 Outline 1 Problem Statement Main Results 3 Summary Z. Liao, R. Couillet (CentraleSupélec & UG-A) On the Spectrum of RFM of High Dimensional Data ICML 018, Stockholm, SWEDEN / 18

3 Problem Setup Random projection/random feature maps for feature extraction: data vectors random W R n p σ( ) entry-wise feature vectors X = [x 1,..., x T ] R p T Σ = σ(wx) R n T Figure: Illustration of random feature maps Objective Gram matrix of random features G 1 n ΣT Σ (sample covariance matrix in feature space): what kind of data information are extracted? what is the impact of different nonlinearities? how to perform clustering with G, what do its eigenvectors look like? With RMT: for large n, p, T, eigenspectrum of G is determined only by 1 the average kernel matrix Φ i,j E w G i,j = E w σ(w T x i )σ(w T x j ) (function of X) the ratios between n, p, T. 1 Louart Cosme, Zhenyu Liao, and Romain Couillet. A Random Matrix Approach to Neural Networks. The Annals of Applied Probability 8, no. (018): Z. Liao, R. Couillet (CentraleSupélec & UG-A) On the Spectrum of RFM of High Dimensional Data ICML 018, Stockholm, SWEDEN 4 / 18

4 Some Known Facts Objective: spectral characterization of Φ, with Φ i,j = E w σ(w T x i )σ(w T x j ): For standard Gaussian W integral calculus on R p. Table: Φ i,j for commonly used σ( ), xt i x j x i x j. σ(t) Φ i,j t x T i x j ( ) max(t, 0) 1 π x i x j arccos ( ) + 1 ( ) t π x i x j arcsin ( ) + 1 ς + max(t, 0)+ 1 ς max( t, 0) (ς + + ς )xt i x j + x i x j ( (ς π + + ς ) 1 arccos( ) 1 1 t>0 π 1 arccos ( ) sign(t) π arcsin ( ) ( ( ) ς t + ς 1 t + ς 0 ς x T i x j + xi ) ( x j + ς1 xt i x j + ς ς 0 cos(t) sin(t) exp exp ( ( 1 1 x i ) + x j + ς0 ( x i )) + x j cosh(x T i x j ) ( x i )) + x j sinh(x T i x j ) ) ( x erf(t) T π arcsin i x j (1+ x i )(1+ x j ) exp( t ) 1 (1+ x i )(1+ x j ) (x T i x j ) ) (still) highly nonlinear functions of the data x! Z. Liao, R. Couillet (CentraleSupélec & UG-A) On the Spectrum of RFM of High Dimensional Data ICML 018, Stockholm, SWEDEN 5 / 18

5 Dig Deeper into the Average Kernel Φ Data Model Consider data from a K-class Gaussian mixture model: x i C a x i = µ a / p + ω i, with ω i N (0, C a/p), a = 1,..., K of statistical mean µ a and covariance C a. Non-trivial Classification [Neyman-Pearson Minimal] For p large, we have µ a µ b = O(1), C a = O(1) and tr(c a C b )/ p = O(1). As a consequence, x i = ω i }{{} + µ a /p + µ T a ω i/ p }{{} O(1) O(p 1 ) = tr C a/p + ω }{{} i tr C a/p }{{} O(1) O(p 1/ ) + µ a /p + µ T a ω i/ p }{{} O(p 1 ) if relaxed, classification too easy: it suffices to compare the norm x i and x j! in fact reveals a more intrinsic property of high dimensional data: Curse of dimensionality: little difference in Euclidean distance between pairs! Denote C K T = i i=1 T Ca and Ca = C a + C for a = 1,..., K. Then x i = τ + O(p 1/ ) with τ tr(c )/p, x i x j = x i + x j xi Tx j τ: Almost constant distance no matter from the same or different classes! Z. Liao, R. Couillet (CentraleSupélec & UG-A) On the Spectrum of RFM of High Dimensional Data ICML 018, Stockholm, SWEDEN 7 / 18

6 Dig Deeper into the Average Kernel Φ Why things are still working? statistical information are hidden in smaller order terms! x i x j = x i + x j x T i x j τ + ω T i ω j }{{} O(p 1/ ) + µ T a µ b /p + µt a ω j/ p + µ T b ω i/ p }{{} O(p 1 ) Small entry-wise small in matrix form (in operator norm): repeated in p p large matrix spectral clustering works! Moreover, concentration brings simplifications: for Φ i,j = E w σ(w T x i )σ(w T x j ) and ReLU, with Φ i,j = 1 ( π x i x j arccos ( ) + 1 ) xt i x j x i x j. Concentration : = 0 τ + information terms (µ a, C a)! Blessing of Dimensionality High dimensional concentration Taylor expansion to linearize Φ! Z. Liao, R. Couillet (CentraleSupélec & UG-A) On the Spectrum of RFM of High Dimensional Data ICML 018, Stockholm, SWEDEN 8 / 18

7 Main Results Asymptotic Equivalent of Φ For all σ( ) listed in the table above, we have, as n p T, Φ Φ 0 almost surely, with ( ) T ( ) Φ d 1 Ω + M JT Ω + M JT p p + d UBU T + d 0 I T and U [ J p, φ ], B [ ] tt T + S t t T. 1 Table: Coefficients d i in Φ for different σ( ). σ(t) d 1 d t 1 0 max(t, 0) πτ t 0 1 ς + max(t, 0)+ ς max( t, 0) πτ 1 4 (ς + ς ) 8τπ 1 (ς + + ς ) 1 t>0 1 πτ 0 sign(t) πτ 0 ς t + ς 1 t + ς 0 ς1 ς cos(t) 0 e τ 4 sin(t) e τ 0 erf(t) 4 1 π 0 τ+1 exp( t ) 0 1 4(τ+1) 3 With J [j 1,..., j K ], j a canonical vector of C a: (j a) i = δ xi C a (for clustering), weighted by Ω, φ random fluctuations { of data. M [µ 1,..., µ K ], t tr C a/ } K p, S {tr(cac b)/p} K a=1 a,b=1 statistical information from data distribution. Z. Liao, R. Couillet (CentraleSupélec & UG-A) On the Spectrum of RFM of High Dimensional Data ICML 018, Stockholm, SWEDEN 9 / 18

8 Consequence Table: Coefficients d i in Φ for different σ( ). σ(t) d 1 d t 1 0 max(t, 0) πτ t 0 1 ς + max(t, 0)+ ς max( t, 0) πτ 1 4 (ς + ς ) 8τπ 1 (ς + + ς ) 1 t>0 1 πτ 0 sign(t) πτ 0 ς t + ς 1 t + ς 0 ς1 ς cos(t) 0 e τ 4 sin(t) e τ 0 erf(t) 4 1 π 0 τ+1 exp( t ) 0 1 4(τ+1) 3 A natural classification of σ( ): mean-oriented, d 1 0, d = 0: t, 1 t>0, sign(t), sin(t) and erf(t) separate with difference in means M; covariance-oriented, d 1 = 0, d 0: t, cos(t) and exp( t /) track differences in covariances t, S; balanced, both d 1, d 0: ReLU function max(t, 0), Leaky ReLU function ς + max(t, 0) + ς max( t, 0), quadratic function ς t + ς 1t + ς 0. make use of both statistics! Z. Liao, R. Couillet (CentraleSupélec & UG-A) On the Spectrum of RFM of High Dimensional Data ICML 018, Stockholm, SWEDEN 10 / 18

9 Numerical Validations: Gaussian Data Example: Gaussian mixture data of four classes: N (µ 1, C 1 ), N (µ 1, C ), N (µ, C 1 ) and N (µ, C ) with Leaky ReLU function ς + max(t, 0) + ς max( t, 0). Case 1: ς + = ς = 1 (equivalent to linear map σ(t) = t) C 1 C C 3 C 4 C 1 C C 3 C 4 Eigenvector 1 Eigenvector Case : ς + = ς = 1 (equivalent to σ(t) = t ) C 1 C C 3 C 4 C 1 C C 3 C 4 Eigenvector 1 Eigenvector Z. Liao, R. Couillet (CentraleSupélec & UG-A) On the Spectrum of RFM of High Dimensional Data ICML 018, Stockholm, SWEDEN 11 / 18

10 Numerical Validations: Gaussian Data Case 3: ς + = 1, ς = 0 (the ReLU function) C 1 C C 3 C 4 C 1 C C 3 C 4 Eigenvector 1 Eigenvector Eigenvector Eigenvector 1 Z. Liao, R. Couillet (CentraleSupélec & UG-A) On the Spectrum of RFM of High Dimensional Data ICML 018, Stockholm, SWEDEN 1 / 18

Numerical Validations: Real Datasets Figure: The MNIST image database. time Figure: The epileptic EEG datasets. Reproducibility: codes available at https://github.com/zhenyu-liao/rmt4rfm.

11 Numerical Validations: Real Datasets Figure: The MNIST image database. time Figure: The epileptic EEG datasets. Reproducibility: codes available at Z. Liao, R. Couillet (CentraleSupélec & UG-A) On the Spectrum of RFM of High Dimensional Data ICML 018, Stockholm, SWEDEN 13 / 18

12 Numerical Validations: Real Datasets Table: Empirical estimation of differences in means and covariances of the MNIST and epileptic EEG datasets. M T M tt T + S MNIST data EEG data Table: Clustering accuracies on MNIST dataset. σ(t) T = 64 T = 18 t 88.94% 87.30% 1 t>0 8.94% 85.56% sign(t) 83.34% 85.% sin(t) 87.81% 87.50% erf(t) 87.8% 86.59% t 60.41% 57.81% cos(t) 59.56% 57.7% exp( t ) 60.44% 58.67% balanced ReLU(t) 85.7% 8.7% Table: Clustering accuracies on EEG dataset. meanoriented covoriented meanoriented covoriented σ(t) T = 64 T = 18 t 70.31% 69.58% 1 t> % 63.47% sign(t) 64.63% 63.03% sin(t) 70.34% 68.% erf(t) 70.59% 67.70% t 99.69% 99.50% cos(t) 99.38% 99.36% exp( t ) 99.81% 99.77% balanced ReLU(t) 87.91% 90.97% Z. Liao, R. Couillet (CentraleSupélec & UG-A) On the Spectrum of RFM of High Dimensional Data ICML 018, Stockholm, SWEDEN 14 / 18

13 Numerical Validations: Real Datasets Leading eigenvector for MNIST data Simulation: mean/std for MNIST data Theory: mean/std for Gaussian data C 1 C Leading eigenvector for EEG data Simulation: mean/std for EEG data Theory: mean/std for Gaussian data C 1 C Figure: Leading eigenvector of Φ for the MNIST (top) and EEG (bottom) with Gaussian mixture data (of same statistics) with a width of ±1 standard deviations. Z. Liao, R. Couillet (CentraleSupélec & UG-A) On the Spectrum of RFM of High Dimensional Data ICML 018, Stockholm, SWEDEN 15 / 18

14 Summary Take-away message: concentration of high dimensional data to handle the nonlinearity different nonlinearities into three attributes: mean-, covariance-oriented and balanced optimize the choice of nonlinearity as a function of data (quadratic and LReLU) novel insight into understanding of neural networks for high dimensional data Future work: study of the eigenvalue distribution the (asymptotic) behavior of leading eigenvectors combination of different type of nonlinearities, e.g., sin + cos Gaussian kernel directly linking σ( ) and the coefficients d 0, d 1 and d Z. Liao, R. Couillet (CentraleSupélec & UG-A) On the Spectrum of RFM of High Dimensional Data ICML 018, Stockholm, SWEDEN 17 / 18

15 Thank you Thank you! Poster # 6 Z. Liao, R. Couillet (CentraleSupélec & UG-A) On the Spectrum of RFM of High Dimensional Data ICML 018, Stockholm, SWEDEN 18 / 18

Random Matrix Theory for Neural Networks

Random Matrix Theory for Neural Networks Ph.D. Mid-Term Evaluation Zhenyu Liao Laboratoire des Signaux et Systèmes CentraleSupélec Université Paris-Saclay Salle sd.207, Bâtiment Bouygues Gif-sur-Yvette,