TUM 2016 Class 3 Large scale learning by regularization

Size: px

Start display at page:

Download "TUM 2016 Class 3 Large scale learning by regularization"

Howard Copeland
6 years ago
Views:

1 TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016

2 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond linear models: non-linear features and kernels

3 Regularization by penalization Replace min w E(w) by min Ê(w) + λ w 2 w }{{} Ê λ (w) Ê(w) = 1 n n i=1 L(w x i, y i ) λ > 0 regularization parameter

4 Early stopping regularization Another example of regularization: Early stopping of an iterative procedure applied to noisy data.

5 Gradient descent for square loss w t+1 = w t γ ˆX ( ˆXw t ŷ) n (y i w x i ) 2 = ˆXw ŷ 2 i=1 no penalty 2 stepsize chosen a priori γ = ˆX ˆX

6 Early stopping at work 1.5 Fitting on the training set Iteration #1 1.5 Fitting on the Itera

7 Semi-convergence min w E(w) vs min Ê(w) w

8 Connection to Tikhonov w t+1 = w t γ ˆX ( ˆXw t ŷ) = (I γ ˆX ˆX)wt + γ ˆX ŷ by induction t 1 w t = γ (I γ ˆX j ˆX) ˆX ŷ j=0 }{{} Truncated power series

9 Neumann series t 1 γ (I γ ˆX j ˆX) j=0 a < 1 (1 a) 1 = a j = a 1 = (1 a) j j=0 j=0 A R d d, A < 1, invertible A 1 = (I A) j j=0

10 Stable matrix inversion Truncated Neumann Series ( ˆX ˆX) 1 = γ (I γ ˆX j ˆX) j=0 t 1 γ (I γ ˆX j ˆX) j=0 compare to ( ˆX ˆX) 1 ( ˆX ˆX + λni) 1

11 Early-stopping: extensions Early stopping regularization, so far analogous to min w 1 n n (w T x i y i ) 2 + λ w 2 i=1 Extensions Early stopping regularization analogous to min w 1 n n V (w T x i, y i ) + λ w 2 i=1 or... or both. min w 1 n n (w T x i y i ) 2 + λr(w) i=1

12 Early-stopping why? Regularization path Warm-restart Computational regularization

13 Beyond Tikhonov: TSVD ˆX ˆX = V ΣV, w M = ( ˆX ˆX) 1 M ˆX ŷ ( ˆX ˆX) 1 M Σ 1 M = V Σ 1 M V = diag(σ 1 1,..., σ M 1, 0..., 0) Also known as principal component regression (PCR)

14 Principal component analysis (PCA) Dimensionality reduction ˆX ˆX = V ΣV Eigenfunctions are directions, of maximum variance best reconstruction

15 TSVD and PCA T SV D P CA + ERM Regularization by projection

16 TSVD/PCR beyond linearity Non-linear function p f(x) = w i φ i (x) = Φ(x) w i=1 with w = ( Φ Φ) 1 M Φ ŷ Let Φ = (Φ(x 1 ),... Φ(x n )) R n p. Φ Φ = V ΣV, ( Φ Φ) 1 M = V Σ 1 M V Σ = diag(σ 1,..., σ p ), Σ 1 M = diag(σ 1 1,..., σ 1 M, 0,... )

17 TSVD/PCR with kernels n f(x) = K(x, x i )c i, i=1 1 c = ( K) M ŷ K ij = K(x i, x j ), K = UΣU, Σ = (σ 1,..., σ n ), K 1 M = UΣ 1 M U, Σ 1 M = (σ 1 1,..., σ 1, 0,... ), M

18 Complexity of nonparametric learning time: O(n 3 ) or O(n 2 T ) or O(n 2 M) space: O(n 2 )

19 Going big... Bottleneck of non-linear learning with kernel methods Memory K is O(n 2 )

20 An intuition PCR/spectral filtering : first compute then discard. Since we know we need only part of the information in the data: Can we compute less?

21 Approaches to large scale (Random) features - find Φ : X R M, with M n s.t. K(x, x ) Φ(x) Φ(x ) Subsampling (Nyström) - replace n f(x) = K(x, x i )c i by M f(x) = K(x, x i )c i i=1 i=1 x i subsampled from training set, M

22 Random features: Gaussian kernel It holds (using Fourier transform), K(x, x ) = e x x 2γ = dωe ω2 c } {{ } dp(ω) e iωt x e iωt x. Consider, K(x, x ) = 1 M e iωt j x e iωt j x m j=1 }{{} Φ(x) Φ(x ) with ω 1,..., ω M i.i.d. samples w.r.t. to p.

23 Random features: Gaussian kernel (cont.) Then, with, e x x 2γ Φ(x) Φ(x ) Φ(x) = (e iωt 1 x,..., e iωt M x ). Alternatively consider Φ(x) = (cos(ω 1 x + b 1 ),..., cos(ω M x + b M ))

24 Other examples of random features translation invariant kernels K(x, x ) = H(x x ), Φ(x) j = e iω j x, infinite neural nets kernels ω j π = F(H) Φ(x) j = ωj x + b j +, (ω j, b j ) π = U[S d+1 ] infinite dot product kernels homogeneous additive kernels group invariant kernels... Note: Connections with hashing and sketching techniques.

25 Learning with random features Let with coefficients solving f(x) = w Φ(x) 1 min Φ n w ŷ 2 + λ w 2, w R M n n Φ n n by M matrix with rows Φ(x i ).

26 Complexity of learning with random features 1 min Φ n w ŷ w R M n 2 n + λ w 2 ( Φ Φ n n + λni) w = }{{} Φ n ŷ M M Computations Time: O(n 3 ) O(nM 2 ) Space: O(n 2 ) O(nM)

27 RF as data independent subsampling Consider, f(x) = or more generally, d cos(ωj x + b j )w j j=1 f(x) = with w j optimized d q(x, ω j )w j. j=1 ω j randomized independently of data What about data dependent sampling?

28 Nÿstrom methods n M f(x) = K(x, x i )c i f(x) = K(x, x i )c i i=1 i=1 x i centers subsampled from training set M Note: keep all data! (just use fewer to parameterize functions)

29 Nÿstrom ridge regression min Kn,M c ŷ 2 + λc KM,M c c R M n ( K n,m ) i,j = K( x i, x j ) ( K M,M ) i,j = K( x i, x j )

30 Complexity of Nÿstrom ridge regression 1 min c R M n K M,n c ŷ 2 + λc KM,M c n (Kn,M K n,m + λnk M,M ) c = Kn,M ŷ }{{} M M Computations Time: O(n 3 ) O(nM 2 ) Space: O(n 2 ) O(nM)

31 Subsampling and regularization Random features Nÿstrom M f(x) = q(x, ω i )w i i=1 M f(x) = K(x, x i )c i i= Validation Error

32 Subsampling as stochastic regularization The subsampling level M can be seen as a regularization parameter! Validation Error M controls: statistics, space and time complexity!

33 An incrementation approach Algorithm 1. Pick a center + compute solution 2. Pick another center + rank one update 3. Pick another center...

34 Computational regularization Computational regularization idea: use computations to regularize Iterative and subsampling regularization can be seen as instances.

35 Approaches to large scale non-linear learning Consider, f(x) = with Q feature or kernel and w j optimized, ω j randomized. d Q(x, ω j )w j. j=1

36 Shallow nets Consider, d f(x) = Q(x, ω j )w j. j=1 Neural nets w j optimized ω j randomizedoptimized This is a one layer neural net!

37 From shallow to deep nets Shallow nets Q activation function. f(x) = w Φ W (x), Φ W (x) = Q(W x) Deep nets f(x) = w Φ(x), Φ = ΦWL Φ W1 Φ Wj = Q(W j x)

38 Deep nets f(x) = w Φ(x), Φ = ΦWL Φ W1 Φ Wj = Q(W j x) Neural nets w j, W j optimized, learning data representation (?)

39 This class early stopping projection regularization subsampling & regularization

40 Next class a practical experience

41 Frank Bauer, Sergei Pereverzev, and Lorenzo Rosasco. On regularization algorithms in learning theory. Journal of complexity, 23(1):52 72, Raffaello Camoriano, Tomás Angles, Alessandro Rudi, and Lorenzo Rosasco. Nytro: When subsampling meets early stopping. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages , L Lo Gerfo, Lorenzo Rosasco, Francesca Odone, Ernesto De Vito, and Alessandro Verri. Spectral algorithms for supervised learning. Neural Computation, 20(7): , Junhong Lin, Lorenzo Rosasco, and Ding-Xuan Zhou. Iterative regularization for learning with convex loss functions. Journal of Machine Learning Research, 17(77):1 38, Sofia Mosci, Lorenzo Rosasco, and Alessandro Verri. Dimensionality reduction and generalization. In Proceedings of the 24th international conference on Machine learning, pages ACM, 2007.

42 Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages , Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computational regularization. In Advances in Neural Information Processing Systems, pages , Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Generalization properties of learning with random features. arxiv preprint arxiv: , Alex J Smola and Bernhard Schölkopf. Sparse greedy matrix approximation for machine learning

Oslo Class 4 Early Stopping and Spectral Regularization

RegML2017@SIMULA Oslo Class 4 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGE-MIT-IIT June 28, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x