The Dynamics of Learning: A Random Matrix Approach

Size: px

Start display at page:

Download "The Dynamics of Learning: A Random Matrix Approach"

Tobias Welch
5 years ago
Views:

The Dynamics of Learning: A Random Matrix Approach ICML 2018, Stockholm, Sweden Zhenyu Liao, Romain Couillet L2S, CentraleSupélec, Université Paris-Saclay, France GSTATS IDEX

1 The Dynamics of Learning: A Random Matrix Approach ICML 2018, Stockholm, Sweden Zhenyu Liao, Romain Couillet L2S, CentraleSupélec, Université Paris-Saclay, France GSTATS IDEX DataScience Chair, GIPSA-lab, Université Grenoble-Alpes, France. Z. Liao, R. Couillet (CentraleSupélec & UG-A) The Dynamics of Learning: A RMT Approach ICML 2018, Stockholm, SWEDEN 1 / 16

2 Outline 1 Motivation 2 Problem Statement 3 Main Results 4 Summary Z. Liao, R. Couillet (CentraleSupélec & UG-A) The Dynamics of Learning: A RMT Approach ICML 2018, Stockholm, SWEDEN 2 / 16

3 Motivation About deep learning: Some known facts: trained with backpropagation (gradient decent) has achieved superhuman performance in many applications highly over-parameterized, but some still generalize remarkably well in practice! and some (more) mysteries: how do neural networks learn from training data? what features are learned? why they generalize without overfitting? memorize or generalize? can the network performance be guaranteed or... even predicted? The learning dynamics of neural networks! In particular: under so-called double asymptotic regime (RMT regime): number of network parameters and number of data instances comparably large! In this work: A general RMT framework for studying learning dynamics of a single-layer network! As a consequence, more insights on: random initialization of training overfitting in neural networks (explicit or implicit) regularization: early stopping, l 2 -penalization Z. Liao, R. Couillet (CentraleSupélec & UG-A) The Dynamics of Learning: A RMT Approach ICML 2018, Stockholm, SWEDEN 4 / 16

4 Problem Setup A toy model of binary classification: Gaussian mixture data Consider data x i drawn from a two-class Gaussian mixture model: for a = 1, 2 x i C a x i = ( 1) a µ + z i with z i N (0 p, I p). With label y i = 1 for C 1 and +1 for C 2. Objective: Learning dynamics Gradient descent on loss L(w) = 1 2n yt w T X 2 with X = [x 1,..., x n]. rate α, with continuous-time approximation: dw(t) dt = α L(w) w = α n X ( y X T w(t) ) For small learning of explicit solution w(t) = e αt n XXT w 0 + ( I p e αt n XXT ) (XX T ) 1 Xy if XX T invertible and w 0 the initialization of gradient descent. projection of eigenvector weighted by exp( αtλ) of eigenvalue λ functional of sample covariance matrix 1 n XXT : Random Matrix Theory is the answer! Z. Liao, R. Couillet (CentraleSupélec & UG-A) The Dynamics of Learning: A RMT Approach ICML 2018, Stockholm, SWEDEN 6 / 16

5 Problem Setup Objective: Test performance Test performance for a new ˆx: Since ˆx Gaussian and independent of w(t): P(w(t) Tˆx > 0 ˆx C 1 ), P(w(t) Tˆx < 0 ˆx C 2 ). w(t) Tˆx N (±w(t) T µ, w(t) 2 ) ( recall w(t) = e αt n XXT w 0 + I p e αt XXT) n (XX T ) 1 Xy. With RMT: although X random: w(t) T µ and w(t) 2 have asymptotically deterministic behavior (only depends on data statistics and dimensions): the technique of deterministic equivalent Cauchy s integral formula to express the functional exp( ) via contour integration Network performance at any time is in fact deterministic and predictable! Z. Liao, R. Couillet (CentraleSupélec & UG-A) The Dynamics of Learning: A RMT Approach ICML 2018, Stockholm, SWEDEN 7 / 16

6 Proposed analysis framework Resolvent and deterministic equivalents Consider an n n Hermitian random matrix M. Define its resolvent Q M (z), for z C not eigenvalue of M Q M (z) = (M zi n) 1. For certain simple distributions of M, define a so-called deterministic equivalent Q M of Q M : a deterministic matrix such that 1 n tr (AQ M) 1 n tr ( A Q M ) 0 a T ( Q M Q M ) b 0 almost surely as n, with A, a, b of bounded norm (operator and Euclidean). Study Q M instead of the random Q M for n large! However, for more sophisticated functionals of M: Cauchy s integral formula Example: for f(m) = a T e M b, f(m) = 1 exp(z)a T Q M (z)bdz 1 exp(z)a T Q M (z)bdz. 2πi 2πi γ γ with γ a positively oriented path circling around all the eigenvalues of M. Z. Liao, R. Couillet (CentraleSupélec & UG-A) The Dynamics of Learning: A RMT Approach ICML 2018, Stockholm, SWEDEN 9 / 16

7 Test performance To evaluate test performance: ( w(t) Tˆx N (±w(t) T µ, w(t) 2 ) with ) w(t) = e αt n XXT w 0 + I p e αt n XXT (XX T ) 1 Xy. For w(t) T µ: Cauchy s integral formula: for f t(x) exp( αtx), ( ) µ T w(t) = 1 2πi γ µt 1 1 ( n XXT zi p ft(z)w0 + 1 f t(z) 1 Xy) dz. z n replace the random ( 1 n XXT zi p ) 1 by its deterministic equivalent. Theorem (Test Performance) Let p/n c (0, ) and the initialization w 0 be a random vector with i.i.d. entries of zero mean, variance σ 2 /p. Then, as n, with probability one P(w(t) Tˆx > 0 ˆx C 1 ) Q for E 1 1 f t (z) µ 2 m(z) dz 2πi γ z, V 1 ( µ 2 +c)m(z)+1 2πi ( E V ) 0, P(w(t) Tˆx < 0 ˆx C 2 ) Q γ ( E V ) 0 [ 1 z 2 (1 f t(z)) 2 ( µ 2 +c)m(z)+1 σ2 f 2 t (z)m(z) ] dz. γ a closed positively oriented path that contains all eigenvalues of 1 n XXT and the origin, Q(x) = 1 2π x exp( u2 /2)du and m(z) given by the popular Marčenko Pastur equation. Not really understandable, nor interpretable... Z. Liao, R. Couillet (CentraleSupélec & UG-A) The Dynamics of Learning: A RMT Approach ICML 2018, Stockholm, SWEDEN 10 / 16

8 Simplification: break the contour integration Eigenvalues of 1 n XXT Marčenko Pastur distribution Isolated eigenvalue: λ s Figure: Eigenvalue distribution of 1 n XXT for µ = [1.5; 0 p 1], p = 512, n = and c 1 = c 2 = 1/2. Main bulk ([λ, λ + ]): sum of real line integrals; isolated eigenvalue (λ s): residue theorem. (Simplified) test performance 1 ft(x) E = µ(dx), V = µ 2 + c (1 ft(x)) 2 µ(dx) x µ 2 x 2 + σ 2 f 2 t (x)ν(dx) Z. Liao, R. Couillet (CentraleSupélec & UG-A) The Dynamics of Learning: A RMT Approach ICML 2018, Stockholm, SWEDEN 11 / 16

9 Discussions (Simplified) test performance 1 ft(x) E = µ(dx), V = µ 2 + c (1 ft(x)) 2 µ(dx) x µ 2 x 2 + σ 2 ft 2 (x)ν(dx) where we recall f t(x) exp( αtx) and the popular Marčenko Pastur distribution ν(dx) (x λ ) + (λ + x) + ( ) dx πcx c δ(x) with λ (1 c) 2, λ + (1 + c) 2 and (x λ ) + (λ + x) + µ(dx) dx + ( µ 4 c) + 2π(λ s x) µ 2 δ λs (x) with λ s = c µ 2 + c/ µ 2. Some remarks: 1 µ(dx): continuous distribution [λ, λ + ] vs. Dirac measure at λ s: comparable information! 2 µ(dx) = µ 2 together with Cauchy Schwarz inequality: E 2 (1 f t (x)) 2 x 2 dµ(x) dµ(x) µ 4 µ 2 V, with equality if and only if the +c (initialization) variance σ 2 = 0. in fact performance drop due to large σ 2! 3 How much we over-fit? As t, the performance drop by a factor 1 min(c, c 1 ), with p/n c (0, ). Z. Liao, R. Couillet (CentraleSupélec & UG-A) The Dynamics of Learning: A RMT Approach ICML 2018, Stockholm, SWEDEN 12 / 16

10 Numerical validations Optimal error rates σ Optimal stopping time 400 Misclassification rate Simulation: training performance Theory: training performance Simulation: generalization performance Theory: generalization performance σ Training time (t) Figure: Training and generalization performance for MNIST data (number 1 and 7) with n = p = 784, c 1 = c 2 = 1/2, α = 0.01 and σ 2 = 0.1. Results averaged over 100 runs. Figure: Optimal performance and stopping time as function of σ 2 with c = 1/2, µ 2 = 4 and α = Z. Liao, R. Couillet (CentraleSupélec & UG-A) The Dynamics of Learning: A RMT Approach ICML 2018, Stockholm, SWEDEN 13 / 16

11 Summary Take-away messages: RMT framework to understand and predict learning dynamics: Cauchy s integral formula + technique of deterministic equivalent easily extended to more elaborate data models: e.g., Gaussian mixture model with different means and covariances a byproduct: choose the initialization variance σ 2 even smaller! Z. Liao, R. Couillet (CentraleSupélec & UG-A) The Dynamics of Learning: A RMT Approach ICML 2018, Stockholm, SWEDEN 15 / 16

12 Thank you Thank you! Any question? Poster # 189! Z. Liao, R. Couillet (CentraleSupélec & UG-A) The Dynamics of Learning: A RMT Approach ICML 2018, Stockholm, SWEDEN 16 / 16

On the Spectrum of Random Features Maps of High Dimensional Data

On the Spectrum of Random Features Maps of High Dimensional Data ICML 018, Stockholm, Sweden Zhenyu Liao, Romain Couillet LS, CentraleSupélec, Université Paris-Saclay, France GSTATS IDEX DataScience Chair,