TUM 2016 Class 3 Large scale learning by regularization

Similar documents
Oslo Class 4 Early Stopping and Spectral Regularization

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling

RegML 2018 Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 6 Sparsity based regularization

Optimal kernel methods for large scale learning

Kernel Learning via Random Fourier Representations

Regularization via Spectral Filtering

Spectral Regularization

Approximate Kernel PCA with Random Features

Approximate Kernel Methods

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

MLCC 2017 Regularization Networks I: Linear Models

Stochastic optimization in Hilbert spaces

TUM 2016 Class 1 Statistical learning theory

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

2 Tikhonov Regularization and ERM

Oslo Class 7 Structured sparsity

RegML 2018 Class 8 Deep learning

An inverse problem perspective on machine learning

Learning sets and subspaces: a spectral approach

Regularization Algorithms for Learning

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

ONLINE LEARNING WITH KERNELS: OVERCOMING THE GROWING SUM PROBLEM. Abhishek Singh, Narendra Ahuja and Pierre Moulin

Spectral Filtering for MultiOutput Learning

Convergence Rates of Kernel Quadrature Rules

Optimal Distributed Learning with Multi-pass Stochastic Gradient Methods

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Neural networks and optimization

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

Early Stopping for Computational Learning

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Introductory Machine Learning Notes 1

FALKON: An Optimal Large Scale Kernel Method

Sparse Approximation and Variable Selection

CIS 520: Machine Learning Oct 09, Kernel Methods

Linear Methods for Regression. Lijun Zhang

Mathematical Methods for Data Analysis

Bits of Machine Learning Part 1: Supervised Learning

Reproducing Kernel Hilbert Spaces

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

MLCC 2015 Dimensionality Reduction and PCA

Efficient Complex Output Prediction

Spectral Algorithms for Supervised Learning

Towards Deep Kernel Machines

Semi-Supervised Learning in Gigantic Image Collections. Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT)

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Midterm exam CS 189/289, Fall 2015

Statistical Machine Learning from Data

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

A summary of Deep Learning without Poor Local Minima

Convergence rates of spectral methods for statistical inverse learning problems

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Introductory Machine Learning Notes 1. Lorenzo Rosasco

ECS289: Scalable Machine Learning

Learning gradients: prescriptive models

Linear regression methods

Machine Learning Basics: Maximum Likelihood Estimation

Reproducing Kernel Hilbert Spaces

Introductory Machine Learning Notes 1. Lorenzo Rosasco

Iterative Convex Regularization

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

OPTIMIZATION METHODS IN DEEP LEARNING

Statistically and Computationally Efficient Variance Estimator for Kernel Ridge Regression

Logistic Regression Logistic

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms

Random Projections. Lopez Paz & Duvenaud. November 7, 2013

COMS 4771 Regression. Nakul Verma

Inverse problems Total Variation Regularization Mark van Kraaij Casa seminar 23 May 2007 Technische Universiteit Eindh ove n University of Technology

Sketched Ridge Regression:

Basis Expansion and Nonlinear SVM. Kai Yu

Learning with stochastic proximal gradient

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Artificial Neural Networks. MGS Lecture 2

Linear Models in Machine Learning

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

SVRG++ with Non-uniform Sampling

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces

Overfitting, Bias / Variance Analysis

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Generalization theory

Online Gradient Descent Learning Algorithms

Lecture 2 Machine Learning Review

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Parameter Norm Penalties. Sargur N. Srihari

Learning with kernels and SVM

Linear Regression (continued)

Neural networks and optimization

Neural Networks and Deep Learning

Transcription:

TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016

Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond linear models: non-linear features and kernels

Regularization by penalization Replace min w E(w) by min Ê(w) + λ w 2 w }{{} Ê λ (w) Ê(w) = 1 n n i=1 L(w x i, y i ) λ > 0 regularization parameter

Early stopping regularization Another example of regularization: Early stopping of an iterative procedure applied to noisy data.

Gradient descent for square loss w t+1 = w t γ ˆX ( ˆXw t ŷ) n (y i w x i ) 2 = ˆXw ŷ 2 i=1 no penalty 2 stepsize chosen a priori γ = ˆX ˆX

Early stopping at work 1.5 Fitting on the training set Iteration #1 1.5 Fitting on the Itera 1 1 0.5 0.5 0 0-0.5-0.5-1 -1-1.5 0 0.2 0.4 0.6 0.8 1-1.5 0 0.2 0.4

Semi-convergence min w E(w) vs min Ê(w) w

Connection to Tikhonov w t+1 = w t γ ˆX ( ˆXw t ŷ) = (I γ ˆX ˆX)wt + γ ˆX ŷ by induction t 1 w t = γ (I γ ˆX j ˆX) ˆX ŷ j=0 }{{} Truncated power series

Neumann series t 1 γ (I γ ˆX j ˆX) j=0 a < 1 (1 a) 1 = a j = a 1 = (1 a) j j=0 j=0 A R d d, A < 1, invertible A 1 = (I A) j j=0

Stable matrix inversion Truncated Neumann Series ( ˆX ˆX) 1 = γ (I γ ˆX j ˆX) j=0 t 1 γ (I γ ˆX j ˆX) j=0 compare to ( ˆX ˆX) 1 ( ˆX ˆX + λni) 1

Early-stopping: extensions Early stopping regularization, so far analogous to min w 1 n n (w T x i y i ) 2 + λ w 2 i=1 Extensions Early stopping regularization analogous to min w 1 n n V (w T x i, y i ) + λ w 2 i=1 or... or both. min w 1 n n (w T x i y i ) 2 + λr(w) i=1

Early-stopping why? Regularization path Warm-restart Computational regularization

Beyond Tikhonov: TSVD ˆX ˆX = V ΣV, w M = ( ˆX ˆX) 1 M ˆX ŷ ( ˆX ˆX) 1 M Σ 1 M = V Σ 1 M V = diag(σ 1 1,..., σ M 1, 0..., 0) Also known as principal component regression (PCR)

Principal component analysis (PCA) Dimensionality reduction ˆX ˆX = V ΣV Eigenfunctions are directions, of maximum variance best reconstruction

TSVD and PCA T SV D P CA + ERM Regularization by projection

TSVD/PCR beyond linearity Non-linear function p f(x) = w i φ i (x) = Φ(x) w i=1 with w = ( Φ Φ) 1 M Φ ŷ Let Φ = (Φ(x 1 ),... Φ(x n )) R n p. Φ Φ = V ΣV, ( Φ Φ) 1 M = V Σ 1 M V Σ = diag(σ 1,..., σ p ), Σ 1 M = diag(σ 1 1,..., σ 1 M, 0,... )

TSVD/PCR with kernels n f(x) = K(x, x i )c i, i=1 1 c = ( K) M ŷ K ij = K(x i, x j ), K = UΣU, Σ = (σ 1,..., σ n ), K 1 M = UΣ 1 M U, Σ 1 M = (σ 1 1,..., σ 1, 0,... ), M

Complexity of nonparametric learning time: O(n 3 ) or O(n 2 T ) or O(n 2 M) space: O(n 2 )

Going big... Bottleneck of non-linear learning with kernel methods Memory K is O(n 2 )

An intuition PCR/spectral filtering : first compute then discard. Since we know we need only part of the information in the data: Can we compute less?

Approaches to large scale (Random) features - find Φ : X R M, with M n s.t. K(x, x ) Φ(x) Φ(x ) Subsampling (Nyström) - replace n f(x) = K(x, x i )c i by M f(x) = K(x, x i )c i i=1 i=1 x i subsampled from training set, M

Random features: Gaussian kernel It holds (using Fourier transform), K(x, x ) = e x x 2γ = dωe ω2 c } {{ } dp(ω) e iωt x e iωt x. Consider, K(x, x ) = 1 M e iωt j x e iωt j x m j=1 }{{} Φ(x) Φ(x ) with ω 1,..., ω M i.i.d. samples w.r.t. to p.

Random features: Gaussian kernel (cont.) Then, with, e x x 2γ Φ(x) Φ(x ) Φ(x) = (e iωt 1 x,..., e iωt M x ). Alternatively consider Φ(x) = (cos(ω 1 x + b 1 ),..., cos(ω M x + b M ))

Other examples of random features translation invariant kernels K(x, x ) = H(x x ), Φ(x) j = e iω j x, infinite neural nets kernels ω j π = F(H) Φ(x) j = ωj x + b j +, (ω j, b j ) π = U[S d+1 ] infinite dot product kernels homogeneous additive kernels group invariant kernels... Note: Connections with hashing and sketching techniques.

Learning with random features Let with coefficients solving f(x) = w Φ(x) 1 min Φ n w ŷ 2 + λ w 2, w R M n n Φ n n by M matrix with rows Φ(x i ).

Complexity of learning with random features 1 min Φ n w ŷ w R M n 2 n + λ w 2 ( Φ Φ n n + λni) w = }{{} Φ n ŷ M M Computations Time: O(n 3 ) O(nM 2 ) Space: O(n 2 ) O(nM)

RF as data independent subsampling Consider, f(x) = or more generally, d cos(ωj x + b j )w j j=1 f(x) = with w j optimized d q(x, ω j )w j. j=1 ω j randomized independently of data What about data dependent sampling?

Nÿstrom methods n M f(x) = K(x, x i )c i f(x) = K(x, x i )c i i=1 i=1 x i centers subsampled from training set M Note: keep all data! (just use fewer to parameterize functions)

Nÿstrom ridge regression min Kn,M c ŷ 2 + λc KM,M c c R M n ( K n,m ) i,j = K( x i, x j ) ( K M,M ) i,j = K( x i, x j )

Complexity of Nÿstrom ridge regression 1 min c R M n K M,n c ŷ 2 + λc KM,M c n (Kn,M K n,m + λnk M,M ) c = Kn,M ŷ }{{} M M Computations Time: O(n 3 ) O(nM 2 ) Space: O(n 2 ) O(nM)

Subsampling and regularization Random features Nÿstrom M f(x) = q(x, ω i )w i i=1 M f(x) = K(x, x i )c i i=1 0.11 0.1 Validation Error 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0 50 100 150 200 250 300

Subsampling as stochastic regularization The subsampling level M can be seen as a regularization parameter! 0.11 0.1 Validation Error 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0 50 100 150 200 250 300 M controls: statistics, space and time complexity!

An incrementation approach Algorithm 1. Pick a center + compute solution 2. Pick another center + rank one update 3. Pick another center...

Computational regularization Computational regularization idea: use computations to regularize Iterative and subsampling regularization can be seen as instances.

Approaches to large scale non-linear learning Consider, f(x) = with Q feature or kernel and w j optimized, ω j randomized. d Q(x, ω j )w j. j=1

Shallow nets Consider, d f(x) = Q(x, ω j )w j. j=1 Neural nets w j optimized ω j randomizedoptimized This is a one layer neural net!

From shallow to deep nets Shallow nets Q activation function. f(x) = w Φ W (x), Φ W (x) = Q(W x) Deep nets f(x) = w Φ(x), Φ = ΦWL Φ W1 Φ Wj = Q(W j x)

Deep nets f(x) = w Φ(x), Φ = ΦWL Φ W1 Φ Wj = Q(W j x) Neural nets w j, W j optimized, learning data representation (?)

This class early stopping projection regularization subsampling & regularization

Next class a practical experience

Frank Bauer, Sergei Pereverzev, and Lorenzo Rosasco. On regularization algorithms in learning theory. Journal of complexity, 23(1):52 72, 2007. Raffaello Camoriano, Tomás Angles, Alessandro Rudi, and Lorenzo Rosasco. Nytro: When subsampling meets early stopping. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 1403 1411, 2016. L Lo Gerfo, Lorenzo Rosasco, Francesca Odone, Ernesto De Vito, and Alessandro Verri. Spectral algorithms for supervised learning. Neural Computation, 20(7):1873 1897, 2008. Junhong Lin, Lorenzo Rosasco, and Ding-Xuan Zhou. Iterative regularization for learning with convex loss functions. Journal of Machine Learning Research, 17(77):1 38, 2016. Sofia Mosci, Lorenzo Rosasco, and Alessandro Verri. Dimensionality reduction and generalization. In Proceedings of the 24th international conference on Machine learning, pages 657 664. ACM, 2007.

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177 1184, 2007. Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computational regularization. In Advances in Neural Information Processing Systems, pages 1657 1665, 2015. Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Generalization properties of learning with random features. arxiv preprint arxiv:1602.04474, 2016. Alex J Smola and Bernhard Schölkopf. Sparse greedy matrix approximation for machine learning. 2000.