TUM 2016 Class 3 Large scale learning by regularization

TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016

Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond linear models: non-linear features and kernels

Regularization by penalization Replace min w E(w) by min Ê(w) + λ w 2 w }{{} Ê λ (w) Ê(w) = 1 n n i=1 L(w x i, y i ) λ > 0 regularization parameter

Early stopping regularization Another example of regularization: Early stopping of an iterative procedure applied to noisy data.

Gradient descent for square loss w t+1 = w t γ ˆX ( ˆXw t ŷ) n (y i w x i ) 2 = ˆXw ŷ 2 i=1 no penalty 2 stepsize chosen a priori γ = ˆX ˆX

Early stopping at work 1.5 Fitting on the training set Iteration #1 1.5 Fitting on the Itera 1 1 0.5 0.5 0 0-0.5-0.5-1 -1-1.5 0 0.2 0.4 0.6 0.8 1-1.5 0 0.2 0.4

Semi-convergence min w E(w) vs min Ê(w) w

Connection to Tikhonov w t+1 = w t γ ˆX ( ˆXw t ŷ) = (I γ ˆX ˆX)wt + γ ˆX ŷ by induction t 1 w t = γ (I γ ˆX j ˆX) ˆX ŷ j=0 }{{} Truncated power series

Neumann series t 1 γ (I γ ˆX j ˆX) j=0 a < 1 (1 a) 1 = a j = a 1 = (1 a) j j=0 j=0 A R d d, A < 1, invertible A 1 = (I A) j j=0

Stable matrix inversion Truncated Neumann Series ( ˆX ˆX) 1 = γ (I γ ˆX j ˆX) j=0 t 1 γ (I γ ˆX j ˆX) j=0 compare to ( ˆX ˆX) 1 ( ˆX ˆX + λni) 1

Early-stopping: extensions Early stopping regularization, so far analogous to min w 1 n n (w T x i y i ) 2 + λ w 2 i=1 Extensions Early stopping regularization analogous to min w 1 n n V (w T x i, y i ) + λ w 2 i=1 or... or both. min w 1 n n (w T x i y i ) 2 + λr(w) i=1

Early-stopping why? Regularization path Warm-restart Computational regularization

Beyond Tikhonov: TSVD ˆX ˆX = V ΣV, w M = ( ˆX ˆX) 1 M ˆX ŷ ( ˆX ˆX) 1 M Σ 1 M = V Σ 1 M V = diag(σ 1 1,..., σ M 1, 0..., 0) Also known as principal component regression (PCR)

Principal component analysis (PCA) Dimensionality reduction ˆX ˆX = V ΣV Eigenfunctions are directions, of maximum variance best reconstruction

TSVD and PCA T SV D P CA + ERM Regularization by projection

TSVD/PCR beyond linearity Non-linear function p f(x) = w i φ i (x) = Φ(x) w i=1 with w = ( Φ Φ) 1 M Φ ŷ Let Φ = (Φ(x 1 ),... Φ(x n )) R n p. Φ Φ = V ΣV, ( Φ Φ) 1 M = V Σ 1 M V Σ = diag(σ 1,..., σ p ), Σ 1 M = diag(σ 1 1,..., σ 1 M, 0,... )

TSVD/PCR with kernels n f(x) = K(x, x i )c i, i=1 1 c = ( K) M ŷ K ij = K(x i, x j ), K = UΣU, Σ = (σ 1,..., σ n ), K 1 M = UΣ 1 M U, Σ 1 M = (σ 1 1,..., σ 1, 0,... ), M

Complexity of nonparametric learning time: O(n 3 ) or O(n 2 T ) or O(n 2 M) space: O(n 2 )

Going big... Bottleneck of non-linear learning with kernel methods Memory K is O(n 2 )

An intuition PCR/spectral filtering : first compute then discard. Since we know we need only part of the information in the data: Can we compute less?

Approaches to large scale (Random) features - find Φ : X R M, with M n s.t. K(x, x ) Φ(x) Φ(x ) Subsampling (Nyström) - replace n f(x) = K(x, x i )c i by M f(x) = K(x, x i )c i i=1 i=1 x i subsampled from training set, M

Random features: Gaussian kernel It holds (using Fourier transform), K(x, x ) = e x x 2γ = dωe ω2 c } {{ } dp(ω) e iωt x e iωt x. Consider, K(x, x ) = 1 M e iωt j x e iωt j x m j=1 }{{} Φ(x) Φ(x ) with ω 1,..., ω M i.i.d. samples w.r.t. to p.

Random features: Gaussian kernel (cont.) Then, with, e x x 2γ Φ(x) Φ(x ) Φ(x) = (e iωt 1 x,..., e iωt M x ). Alternatively consider Φ(x) = (cos(ω 1 x + b 1 ),..., cos(ω M x + b M ))

Other examples of random features translation invariant kernels K(x, x ) = H(x x ), Φ(x) j = e iω j x, infinite neural nets kernels ω j π = F(H) Φ(x) j = ωj x + b j +, (ω j, b j ) π = U[S d+1 ] infinite dot product kernels homogeneous additive kernels group invariant kernels... Note: Connections with hashing and sketching techniques.

Learning with random features Let with coefficients solving f(x) = w Φ(x) 1 min Φ n w ŷ 2 + λ w 2, w R M n n Φ n n by M matrix with rows Φ(x i ).

Complexity of learning with random features 1 min Φ n w ŷ w R M n 2 n + λ w 2 ( Φ Φ n n + λni) w = }{{} Φ n ŷ M M Computations Time: O(n 3 ) O(nM 2 ) Space: O(n 2 ) O(nM)

RF as data independent subsampling Consider, f(x) = or more generally, d cos(ωj x + b j )w j j=1 f(x) = with w j optimized d q(x, ω j )w j. j=1 ω j randomized independently of data What about data dependent sampling?

Nÿstrom methods n M f(x) = K(x, x i )c i f(x) = K(x, x i )c i i=1 i=1 x i centers subsampled from training set M Note: keep all data! (just use fewer to parameterize functions)

Nÿstrom ridge regression min Kn,M c ŷ 2 + λc KM,M c c R M n ( K n,m ) i,j = K( x i, x j ) ( K M,M ) i,j = K( x i, x j )

Complexity of Nÿstrom ridge regression 1 min c R M n K M,n c ŷ 2 + λc KM,M c n (Kn,M K n,m + λnk M,M ) c = Kn,M ŷ }{{} M M Computations Time: O(n 3 ) O(nM 2 ) Space: O(n 2 ) O(nM)

Subsampling and regularization Random features Nÿstrom M f(x) = q(x, ω i )w i i=1 M f(x) = K(x, x i )c i i=1 0.11 0.1 Validation Error 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0 50 100 150 200 250 300

Subsampling as stochastic regularization The subsampling level M can be seen as a regularization parameter! 0.11 0.1 Validation Error 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0 50 100 150 200 250 300 M controls: statistics, space and time complexity!

An incrementation approach Algorithm 1. Pick a center + compute solution 2. Pick another center + rank one update 3. Pick another center...

Computational regularization Computational regularization idea: use computations to regularize Iterative and subsampling regularization can be seen as instances.

Approaches to large scale non-linear learning Consider, f(x) = with Q feature or kernel and w j optimized, ω j randomized. d Q(x, ω j )w j. j=1

Shallow nets Consider, d f(x) = Q(x, ω j )w j. j=1 Neural nets w j optimized ω j randomizedoptimized This is a one layer neural net!

From shallow to deep nets Shallow nets Q activation function. f(x) = w Φ W (x), Φ W (x) = Q(W x) Deep nets f(x) = w Φ(x), Φ = ΦWL Φ W1 Φ Wj = Q(W j x)

Deep nets f(x) = w Φ(x), Φ = ΦWL Φ W1 Φ Wj = Q(W j x) Neural nets w j, W j optimized, learning data representation (?)

This class early stopping projection regularization subsampling & regularization

Next class a practical experience

Frank Bauer, Sergei Pereverzev, and Lorenzo Rosasco. On regularization algorithms in learning theory. Journal of complexity, 23(1):52 72, 2007. Raffaello Camoriano, Tomás Angles, Alessandro Rudi, and Lorenzo Rosasco. Nytro: When subsampling meets early stopping. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 1403 1411, 2016. L Lo Gerfo, Lorenzo Rosasco, Francesca Odone, Ernesto De Vito, and Alessandro Verri. Spectral algorithms for supervised learning. Neural Computation, 20(7):1873 1897, 2008. Junhong Lin, Lorenzo Rosasco, and Ding-Xuan Zhou. Iterative regularization for learning with convex loss functions. Journal of Machine Learning Research, 17(77):1 38, 2016. Sofia Mosci, Lorenzo Rosasco, and Alessandro Verri. Dimensionality reduction and generalization. In Proceedings of the 24th international conference on Machine learning, pages 657 664. ACM, 2007.

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177 1184, 2007. Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computational regularization. In Advances in Neural Information Processing Systems, pages 1657 1665, 2015. Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Generalization properties of learning with random features. arxiv preprint arxiv:1602.04474, 2016. Alex J Smola and Bernhard Schölkopf. Sparse greedy matrix approximation for machine learning. 2000.