Oslo Class 4 Early Stopping and Spectral Regularization

Size: px

Start display at page:

Download "Oslo Class 4 Early Stopping and Spectral Regularization"

Briana Quinn
6 years ago
Views:

1 Oslo Class 4 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGE-MIT-IIT June 28, 2016

2 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond linear models: non-linear features and kernels

3 Regularization by penalization Replace min w E(w) by min Ê(w) + λ w 2 w }{{} Ê λ (w) Ê(w) = 1 n n i=1 L(w x i, y i ) λ > 0 regularization parameter

4 Loss functions and computational methods Logistic loss Hinge loss log(1 + e yw x ) 1 yw x + w t+1 = w t γ t Êλ(w t )...

5 Square loss loss square loss Hinge loss Logistic loss (y w x) 2 = (1 yw x) 2

6 Ridge regression / Tikhonov regression Ê λ (w) = 1 n ˆXw ŷ 2 + λ w 2 }{{} Smooth and strongly convex ˆX n d data matrix ŷ n 1 output vector.

7 Ridge regression / Tikhonov regression Ê λ (w) = 1 n ˆXw ŷ 2 + λ w 2 }{{} Smooth and strongly convex ˆX n d data matrix ŷ n 1 output vector. Êλ(w) = 2 n ˆX ( ˆXw ŷ) + 2λw = 0 = ( ˆX ˆX + λni)w = ˆX ŷ

8 Linear systems ( ˆX ˆX + λni)w = ˆX ŷ nd 2 to form ˆX ˆX roughly d 3 to solve the linear system

9 Representer theorem for square loss n f(x) = x w = f(x) = x x i c i i=1

10 Representer theorem for square loss n f(x) = x w = f(x) = x x i c i i=1 Proof: Using SVD of ˆX... w = ( ˆX ˆX + λni) 1 ˆX ŷ = ˆX ( ˆX ˆX + λni) 1 ŷ }{{} c = w = ˆX c = n x i c i i=1

11 Beyond linear models n f(x) = x w = x x i c i, i=1 w = ( ˆX ˆX + λni) 1 ˆX ŷ, c = ( ˆX ˆX + λni) 1 ŷ non-linear features x φ(x) = (φ 1 (x),... φ n (x)), f(x) = φ(x) w non-linear kernels ˆX ˆX = ˆK, n f(x) = K(x, x i )c i. i=1

12 Interlude: linear systems and stability Aw = y, A = diag(a 1,..., a d ), c = a 1 a d <, w = A 1 y, A 1 = diag(a 1 1,..., a 1 d ) More generally A = UΣU, Σ = diag(σ 1,..., σ d ) A 1 = UΣ 1 U, Σ 1 = diag(σ 1 1,..., σ 1 d )

13 Tikhonov Regularization 1 n n (y i w x i ) 2 1 n i=1 n (y i w x i ) 2 + λ w 2 i=1 ˆX ˆXw = ˆX ŷ ( ˆX ˆX + λni)w = ˆX ŷ Overfitting and numerical stability

14 Beyond Tikhonov: TSVD ˆX ˆX = V ΣV, w M = ( ˆX ˆX) 1 M ˆX ŷ ( ˆX ˆX) 1 M Σ 1 M = V Σ 1 M V = diag(σ 1 1,..., σ M 1, 0..., 0) Also known as principal component regression (PCR)

15 Principal component analysis (PCA) Dimensionality reduction ˆX ˆX = V ΣV Eigenfunctions are directions, of maximum variance best reconstruction

16 TSVD and PCA T SV D P CA + ERM Regularization by projection

17 TSVD/PCR beyond linearity Non-linear function p f(x) = w i φ i (x) = Φ(x) w i=1 with w = ( Φ Φ) 1 M Φ ŷ Let Φ = (Φ(x 1 ),... Φ(x n )) R n p. Φ Φ = V ΣV, Σ = diag(σ 1,..., σ p ), Σ 1 M ( Φ Φ) 1 M = V Σ 1 M V = diag(σ 1 1,..., σ 1 M, 0,... )

18 TSVD/PCR with kernels n f(x) = K(x, x i )c i, i=1 1 c = ( K) M ŷ K ij = K(x i, x j ), K = UΣU, Σ = (σ 1,..., σ n ), K 1 M = UΣ 1 M U, Σ 1 M = (σ 1 1,..., σ 1, 0,... ), M

19 Early stopping regularization Other example of regularization: Early stopping of an iterative procedure applied to noisy data.

20 Gradient descent for square loss w t+1 = w t γ ˆX ( ˆXw t ŷ) n (y i w x i ) 2 = ˆXw ŷ 2 i=1 no penalty 2 stepsize chosen a priori γ = ˆX ˆX

21 Early stopping at work 1.5 Fitting on the training set Iteration #

22 Early stopping at work 1.5 Fitting on the training set Iteration #

23 Early stopping at work 1.5 Fitting on the training set Iteration #

24 Early stopping at work 1.5 Fitting on the training set Iteration #

25 Semi-convergence min w E(w) vs min Ê(w) w

26 Connection to Tikhonov or TSVD w t+1 = w t γ ˆX ( ˆXw t ŷ) = (I γ ˆX ˆX)wt + γ ˆX ŷ) by induction t 1 w t = γ (I γ ˆX j ˆX) ˆX ŷ j=0 }{{} Truncated power series

27 Neumann series t 1 γ (I γ ˆX j ˆX) j=0

28 Neumann series t 1 γ (I γ ˆX j ˆX) j=0 a < 1 (1 a) 1 = a j = a 1 = (1 a) j j=0 j=0

29 Neumann series t 1 γ (I γ ˆX j ˆX) j=0 a < 1 (1 a) 1 = a j = a 1 = (1 a) j j=0 j=0 A R d d, A < 1, invertible A 1 = (I A) j j=0

30 Stable matrix inversion Truncated Neumann Series ( ˆX ˆX) 1 = γ (I γ ˆX j ˆX) j=0 t 1 γ (I γ ˆX j ˆX) j=0 Note that t 1 γ (I γ ˆX ˆX) j = (I (I γ ˆX ˆX) t )( ˆX 1 ˆX) j=0 and compare to ( ˆX ˆX + λni) 1

31 Spectral filtering Different instances of the same principle. Tikhonov w t = ( ˆX ˆX + λni) 1 ˆX ŷ Early Stopping t 1 w t = γ (I γ ˆX j ˆX) ˆX ŷ j=0 TSVD w t = ( ˆX ˆX) 1 M ˆX ŷ

32 Statistics and optimization t 1 w t = γ (I γ ˆX j ˆX) ˆX ŷ j=0 The difference is in the computations w t+1 = w t γ ˆX ( ˆXw t ŷ) Tikhonov - O(nd 2 + d 3 ) TSVD - O(nd 2 + d 2 M) GD - O(ndt)

33 Regularization path and warm restart min w E(w) vs min Ê(w) w

34 Beyond linear models Non-linear function p f(x) = w i φ i (x) = Φ(x) w i=1 Replace x by Φ(x) = (φ 1 (x),..., φ p (x)) Replace X by Φ = (Φ(x 1 ),... Φ(x n )) R n p. Computational cost O(npt). w t+1 = w t γ Φ ( Φw t ŷ)

35 Early-stopping and kernels By induction n f(x) = K(x i, x)c i i=1 c t+1 = c t γ( Kc t ŷ) Computational Complexity O(n 2 t). K ij = K(x i, x j )

36 What about other loss functions? PCA + ERM Gradient / Subgradient Descent. Iterations for regularization, not only optimization!

37 Going big... Bottleneck of Kernel methods Memory K is O(n 2 )

38 Approaches to large scale (Random) features - find Φ : X R M, with M n s.t. K(x, x ) Φ(x) Φ(x )

39 Approaches to large scale (Random) features - find Φ : X R M, with M n s.t. K(x, x ) Φ(x) Φ(x ) Subsampling (Nyström) - replace n f(x) = K(x, x i )c i by M f(x) = K(x, x i )c i i=1 i=1 x i subsampled from training set, M

40 Approaches to large scale (Random) features - find Φ : X R M, with M n s.t. K(x, x ) Φ(x) Φ(x ) Subsampling (Nyström) - replace n f(x) = K(x, x i )c i by M f(x) = K(x, x i )c i i=1 i=1 x i subsampled from training set, M Greedy! Neural Nets

41 This class Regularization beyond penalization Regularization by projection Regularization by early stopping

42 Next class Multioutput learning Multitask learning Vector valued learning Multiclass learning

TUM 2016 Class 3 Large scale learning by regularization

TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond