Random Matrix Theory for Neural Networks

Size: px

Start display at page:

Download "Random Matrix Theory for Neural Networks"

Allan Gordon
5 years ago
Views:

1 Random Matrix Theory for Neural Networks Ph.D. Mid-Term Evaluation Zhenyu Liao Laboratoire des Signaux et Systèmes CentraleSupélec Université Paris-Saclay Salle sd.207, Bâtiment Bouygues Gif-sur-Yvette, France Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, 208 / 26

2 Outline Curriculum Vitae 2 Introduction 3 Summary of main results 4 Conclusion Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

3 Curriculum Vitae Education Ph.D. in Statistics and Signal Processing, L2S, CentraleSupélec, France 206-present Thesis: Random Matrix Theory for Neural Networks. Supervisor: Prof. Romain Couillet, Prof. Yacine Chitour. M.Sc. in Signal and Image Processing, CentraleSupélec/Paris-Sud, France B.Sc. in Electronic Engineering, Paris-Sud, France B.Sc. in Optical & Electronic Information, HUST, Wuhan, China Ph.D. training Scientific training [completed] Random matrix theory and machine learning application: 27 hours. Summer school of signal and image processing in Peyresq: 2 hours. Professional training [completed] Law and intellectual property: 8 hours. European projects Horizon 2020: 8 hours. Techniques for scientific writing and associated softwares: 5 hours. Teaching : Lab work of Signal and System, with Prof. Laurent Le Brusquet, Department of Signal and Statistics, CentraleSupélec: 54 hours. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

4 Curriculum Vitae Review activities IEEE Transactions on Signal Processing Neural Processing Letters Publications Conferences Z. Liao, R. Couillet, The Dynamics of Learning: A Random Matrix Approach, (submitted to) The 35th International Conference on Machine Learning (ICML 208), Stockholm, Sweden, 208. Z. Liao, R. Couillet, On the Spectrum of Random Features Maps of High Dimensional Data, (submitted to) The 35th International Conference on Machine Learning (ICML 208), Stockholm, Sweden, 208. Z. Liao, R. Couillet, Une Analyse des Méthodes de Projections Aléatoires par la Théorie des Matrices Aléatoires (in French), Colloque GRETSI 7, Juan Les Pins, France, 207. Z. Liao, R. Couillet, Random Matrices Meet Machine Learning: A Large Dimensional Analysis of LS-SVM, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 7), New Orleans, USA, 207. Journals C. Louart, Z. Liao, R. Couillet, A Random Matrix Approach to Neural Networks, (in press) Annals of Applied Probability, 207. Z. Liao, R. Couillet, A Large Dimensional Analysis of Least Squares Support Vector Machines, (submitted to) Journal of Machine Learning Research, 207. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

5 Motivation: features in machine learning Learning = Representation + Evaluation + Optimization. Features: representation of the data that contains crucial information. Various methods for feature extraction: feature selection by hand feature learned via backpropagation random feature maps How to study and understand these features? Sample Covariance Matrix SCM T XXT of some data X = [x,..., x T ] R p T. SCM in feature space feature Gram matrix G: G T FT F with F = [f(x ),..., f(x T )] feature matrix of data X = [x,..., x T ]. Domingos, Pedro. A few useful things to know about machine learning. Communications of the ACM 55.0 (202): Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

6 Motivation: in random feature maps (RFM) SCM in feature space feature Gram matrix G: G T FT F with F R n T feature matrix of data X. In RFM, G determines training and test performance via its resolvent Q(z) (G zi T ). Example : MSE of RFM-based ridge regression (also called extreme learning machines): E train = T y βt F 2 F = γ2 T yt Q 2 ( γ)y E test = ˆT ŷ β T ˆF 2 F with ridge regressor β T F (G + γi T ) y T = T FQ( γ)yt and regularization γ > 0. y associated target of training data X and ŷ target of test data ˆX. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

7 Motivation: gradient descent dynamics Example 2: Feature matrix F with associated labels y, a classifier trained by GD is to minimize L(w) = 2T y wt F 2 with constant small learning rate α, GD dynamics yields: w(t) t = α wl = α ( T F(y FT w(t)) w(t) = e αt T FFT w 0 + I p e αt FFT) T (FF T ) Fy. Generalization performance of new feature ˆf related to: w(t) Tˆf = f t(z)ˆf T Q(z)w 0 dz 2πi 2πi γ γ f t(z) T ˆf Q(z) Fy dz z T with f t(z) = exp( αtz), Q resolvent of G = T FFT, γ contours all eigenvalues of G. Important remark Both examples: study of eigenspectrum through resolvent of SCM-like matrices (G or G), especially in high dimensional regime today where n, p, T are comparably large. Random Matrix Theory (RMT) is the answer! Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

8 Motivation: difficulty from nonlinearity However, highly nonlinear and abstract nature of real-world problems classical RMT results cannot apply directly in need of new tools to adapt to nonlinear models Example: random feature maps data W R n p σ( ) entry-wise features x R p σ(wx) R n Figure: Illustration of random feature maps with random matrix W R n p with i.i.d. entries and (nonlinear) activation function σ( ). Classical RMT results essentially based on trace lemma: for A R n n of bounded operator norm and random vector w R n with i.i.d. entries n wt Aw n tr(a) 0 almost surely as n. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

9 Motivation: difficulty from nonlinearity Example: random feature maps data W R n p σ( ) entry-wise features x R p σ(wx) R n Figure: Illustration of random feature maps Classical RMT results are essentially based on the trace lemma: for A R n n of bounded operator norm and random vector w R n with i.i.d. entries n wt Aw n tr(a) 0 almost surely as n. However, object under study n σ(wt X)Aσ(X T w): loss of independence between entries more elusive due to σ( ) One major objective of the thesis: handle nonlinearity in RMT Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, 208 / 26

10 Handle nonlinearity in RMT: concentration Lemma: Concentration of quadratic forms For w N (0, I p), X, A bounded and σ( ) Lipschitz, as n, p with p n c, ( P n σ(wt X)Aσ(X T w) ) n tr(φa) > t Ce cn min(t,t2 ) where Φ E w [ σ(x T w)σ(w T X) ]. [ ] Φ(a, b) E w σ(w T a)σ(w T b) = (2π) p 2 σ(w T a)σ(w T b)e 2 w 2 dw R p = σ( w T ã)σ( w T b)e 2 w 2 d w (projection on span(a, b)) 2π R 2 For σ(t) = max(t, 0) ReLU(t), Φ(a, b) = w T ã w T b e 2 w 2 d w = ( ) 2π 2π a b 2 + arccos ( ) S where S min ( w T ã, w T b ) > 0, at b a b. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

11 Handle nonlinearity in RMT: concentration Behavior of Φ as a function of X: concentration phenomenon for large n, p, T For ReLU function, Φ(x i, x j ) = ( ) 2π x i x j 2 (x i, x j ) + (x i, x j ) arccos ( (x i, x j )) Assumption: GMM data ( ) For i =,..., T, x i N p µ a, p Ca has cardinality T a, for a =,..., K. and x i = p µ a + ω i, with ω i N (0, Ca), class Ca p x T i x j = ( µ T a / p + ω T i ) (µb / p + ω j ) = µ T a µ b /p + (µt a ω j + µ T b ω i)/ p + ω T i ω j x T i x i = ( µ T a / p + ω T i ) (µa / p + ω i ) = µ a 2 /p + 2µ T a ω i/ p + ω i 2 Growth rate [Neyman-Pearson Minimal] As n, p, T, let C K T a a= T C a and for a =,..., K, C a C a C, the Euclidean norm µ a = O() the operator norm C a = O() and tr(c a) = O( p) Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

12 Handle nonlinearity in RMT: concentration Growth rate [Neyman-Pearson Minimal] As n, p, T, let C K T a a= T C a and for a =,..., K, C a C a C, the Euclidean norm µ a = O() the operator norm C a = O() and tr(c a) = O( p) ( ) ( ) xi T x j = µ T p a + ωt i µ p b + ω j = p µt a µ b + (µ T p a ω j + µ T b ω i) + ω T i ω j }{{}}{{} ( ) ( xi T x i = µ T p a + ωt i O(p ) O(p ) ) µ p a + ω i = p µ a µ T p a ω i + ω i 2 = p µ a µ T p a ω i + ω i 2 p tr(ca) + p tr(ca ) + τ }{{}}{{}}{{} O() with E ωi ω i 2 = tr(c a)/p and τ tr(c )/p. O(p /2 ) g(x T i x i) concentrates around g(τ) Taylor expansion O(p /2 ) Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

13 Main results Theorem (Asymptotic equivalent of Φ c ) [ ] Recenter Φ E w σ(x T w)σ(w T X) to get Φ c PΦP, with P I T T T T T. As T, Φ c Φ c 0 almost surely, with Φ c P ΦP and Φ d ( Ω + M JT p ) T ( Ω + M JT p ) + d 2 UBU T + d 0 I T as well as U [ J p, φ ], B [ tt T + 2S ] t t T [ ] { [ where Ω ω,..., ω T, φ ωi 2 E ω i 2]} T, and data i= statistics2, [ ] M µ,..., µ K, t {tr C a / [ ] p} K a=, S {tr(cac b)/p} K a,b=, J j,..., j K where j a R T denotes the canonical vector of class C a such that (j a) i = δ xi C a. 2 M for means, t for (difference in) traces while S for the shapes of covariances. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

14 Coefficients d i for different activation functions Table: Φ(a, b) for different σ( ), (a, b) at b a b. σ(t) Φ(a, b) t a T b ) max(t, 0) ReLU(t) 2π ( (a, a b b) arccos ( (a, b)) + (a, b) ) 2 2 t π ( (a, a b b) arcsin ( (a, b)) + (a, b) 2 t>0 2 2π arccos ( (a, b)) 2 sign(t) ( π arcsin ( (a, b)) ( ς 2t 2 + ς t + ς 0 ς a T b cos(t) sin(t) ) ) 2 ( + a 2 b 2 + ς 2 at b + ς 2ς 0 a 2 + b 2) + ς 2 0 ( ( exp 2 a 2 + b 2)) cosh(a T b) ( ( exp 2 a 2 + b 2)) sinh(a T b) ) ( 2 erf(t) π arcsin 2a T b (+2 a 2 )(+2 b 2 ) exp( t2 2 ) (+ a 2 )(+ b 2 ) (a T b) 2 Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

15 Coefficients d i for different activation functions Table: Coefficients d i in Φ c for different σ( ). σ(t) d 0 d d 2 t ( 0 0 max(t, 0) ReLU(t) ) 4 2π τ ( ) 4 8πτ t 2 π τ 0 2πτ t>0 4 2π 2πτ 0 sign(t) 2 2 π πτ 0 ς 2t 2 + ς t + ς 0 2τ 2 ς 2 2 ς 2 ς 2 2 cos(t) 2 + e 2τ 2 e τ e 0 τ 4 sin(t) ( 2 e 2τ ( 2 τe τ e τ 0 2 erf(t) π arccos 2τ ) ) 2τ+ 2τ 4 2τ+ π 2τ+ 0 exp( t2 2 ) 2τ+ τ+ 0 4(τ+) 3 three types of function σ( ): mean-oriented, d 0 while d 2 = 0: t, t>, sign(t), sin(t) and erf(t) covariance-oriented, d = 0 while d 2 0: t, cos(t), exp( t 2 /2) balanced, d, d 2 0: the ReLU function and quadratic function Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

16 Applications Example : MSE of the random feature-based ridge regression: E train = T y βt F 2 F = γ2 T yt Q 2 ( γ)y, E test = ˆT ŷ β T ˆF 2 F with the ridge regressor β T F (G + γi T ) y T = T FQ( γ)yt. 0 0 E train (Theory) E test (Theory) E train (Simulation) E test (Simulation) MSE σ(t) = t 0 σ(t) = erf(t) σ(t) = max(t, 0) γ Figure: Performance for MNIST data (number 7 and 9), n = 52, T = ˆT = 024, p = 784. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

17 Applications Example 2: Random-feature based spectral clustering.5 Eigenvalues of Φc Eigenvalues of Φc Leading eigenvector for MNIST data Simulation: mean/std for MNIST data Theory: mean/std for Gaussian data C C 2 Figure: Eigenvalue distribution of Φ c and Φ c and leading eigenvector for MNIST data, with ± standard deviations (from 500 trials). With the ReLU function, p = 784, T = 28 and c = c 2 = /2. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

18 Applications Example 2: Random-feature based spectral clustering Table: Empirical estimation of (normalized) statistics of the MNIST and epileptic EEG datasets. M T M tt T + 2S MNIST data EEG data Table: Classification accuracies for random feature-based spectral clustering with different σ(t) on the MNIST dataset. Table: Classification accuracies for random feature-based spectral clustering with different σ(t) on the epileptic EEG dataset. σ(t) T = 64 T = 28 t 88.94% 87.30% meanoriented t> % 85.56% sign(t) 83.34% 85.22% sin(t) 87.8% 87.50% erf(t) 87.28% 86.59% t 60.4% 57.8% covoriented cos(t) 59.56% 57.72% exp( t 2 /2) 60.44% 58.67% balanced ReLU(t) 85.72% 82.27% σ(t) T = 64 T = 28 t 70.3% 69.58% meanoriented t> % 63.47% sign(t) 64.63% 63.03% sin(t) 70.34% 68.22% erf(t) 70.59% 67.70% t 99.69% 99.50% covoriented cos(t) 99.38% 99.36% exp( t 2 /2) 99.8% 99.77% balanced ReLU(t) 87.9% 90.97% Much better than ReLU in both cases Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

19 Applications Example 3: Temporal evolution of training and generalization performance Simulation: training performance Theory: training performance Simulation: generalization performance Theory: generalization performance Misclassification rate Epoch (t) Figure: Training and generalization performance for MNIST data (number and 7) with n = p = 784, c = c 2 = /2, α = 0.0 and σ 2 = 0.. Simulation results obtained by averaging over 00 runs. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

20 Conclusion Take-away messages: SCMs or Gram-like matrices naturally appear in neural nets RMT is a powerful tool in the double-asymptotic regime difficulty from the nonlinear structure: handled with the concentration approach performance analyses and improvements of different algos fast tunning of hyper-parameters wise choice of activation vis-à-vis the data structure more insights into training neural nets Future work: deeper study of inner product kernels and activations in neural nets (relate σ( ) to d, d 2 ) the loss landscape of deep networks (beyond single-layer) Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

21 References On random feature maps and neural nets: C. Louart, Z. Liao, R. Couillet, A Random Matrix Approach to Neural Networks, (in press) Annals of Applied Probability, 207. Z. Liao, R. Couillet, On the Spectrum of Random Features Maps of High Dimensional Data, (submitted to) The 35th International Conference on Machine Learning (ICML 8), Stockholm, Sweden, 208. On the learning dynamics of neural nets: Z. Liao, R. Couillet, The Dynamics of Learning: A Random Matrix Approach, (submitted to) The 35th International Conference on Machine Learning (ICML 8), Stockholm, Sweden, 208. a journal in preparation with the co-supervisor Prof. Yacine CHITOUR On the kernel methods: Z. Liao, R. Couillet, Random Matrices Meet Machine Learning: A Large Dimensional Analysis of LS-SVM, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 7), New Orleans, USA, 207. Z. Liao, R. Couillet, A Large Dimensional Analysis of Least Squares Support Vector Machines, (submitted to) Journal of Machine Learning Research, 207 Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

22 Merci! Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

On the Spectrum of Random Features Maps of High Dimensional Data

On the Spectrum of Random Features Maps of High Dimensional Data ICML 018, Stockholm, Sweden Zhenyu Liao, Romain Couillet LS, CentraleSupélec, Université Paris-Saclay, France GSTATS IDEX DataScience Chair,