Oslo Class 2 Tikhonov regularization and kernels

Similar documents
RegML 2018 Class 2 Tikhonov regularization and kernels

Oslo Class 4 Early Stopping and Spectral Regularization

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

MLCC 2017 Regularization Networks I: Linear Models

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces

TUM 2016 Class 3 Large scale learning by regularization

Reproducing Kernel Hilbert Spaces

Support Vector Machine

Regularization via Spectral Filtering

Spectral Regularization

TUM 2016 Class 1 Statistical learning theory

Oslo Class 6 Sparsity based regularization

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

Reproducing Kernel Hilbert Spaces

CIS 520: Machine Learning Oct 09, Kernel Methods

Basis Expansion and Nonlinear SVM. Kai Yu

Stochastic optimization in Hilbert spaces

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Kernelized Perceptron Support Vector Machines

The Representor Theorem, Kernels, and Hilbert Spaces

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Introductory Machine Learning Notes 1. Lorenzo Rosasco

Hilbert Space Methods in Learning

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

Introductory Machine Learning Notes 1

Support Vector Machines

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machines for Classification: A Statistical Portrait

Kernel Methods. Barnabás Póczos

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Generalization theory

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

(Kernels +) Support Vector Machines

Functional Gradient Descent

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

SVMs, Duality and the Kernel Trick

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Bits of Machine Learning Part 1: Supervised Learning

Online Gradient Descent Learning Algorithms

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

CPSC 540: Machine Learning

Midterm exam CS 189/289, Fall 2015

Support Vector Machines

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Overfitting, Bias / Variance Analysis

RegML 2018 Class 8 Deep learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

2 Tikhonov Regularization and ERM

Kernel Methods. Outline

9.520 Problem Set 2. Due April 25, 2011

A summary of Deep Learning without Poor Local Minima

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Convex Optimization Lecture 16

MATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Introductory Machine Learning Notes 1. Lorenzo Rosasco

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Support Vector Machine (SVM) and Kernel Methods

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Representer theorem and kernel examples

Kernel methods CSE 250B

Big Data Analytics: Optimization and Randomization

Deep Learning for Computer Vision

Binary Classification / Perceptron

LMS Algorithm Summary

l 1 and l 2 Regularization

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Ch 4. Linear Models for Classification

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

Machine Learning A Geometric Approach

Class 2 & 3 Overfitting & Regularization

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

ECS289: Scalable Machine Learning

Support Vector Machines and Kernel Methods

Lecture 9: PGM Learning

9.2 Support Vector Machines 159

Bayesian Support Vector Machines for Feature Ranking and Selection

Machine Learning. Support Vector Machines. Manfred Huber

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear & nonlinear classifiers

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Lecture 4 February 2

Support Vector Machine

Machine Learning Practice Page 2 of 2 10/28/13

Support Vector Machines and Kernel Methods

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Oslo Class 7 Structured sparsity

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Kernel Methods. Machine Learning A W VO

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical Methods for Data Mining

Transcription:

RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017

Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i, y i ) n (ρ, fixed, unknown). How can we design reliable algorithms?

This class Regularization by penalization Logistic regression Representer theorems and Kernels Support vector machines

Empirical Risk Minimization (ERM) min f H E(f) min Ê(f) f H proxy to E Ê(f) = 1 n L(f(x i ), y i )

From ERM to regularization ERM can be a bad idea if n is small and H is big Penalization min Ê(f) min f H f H Ê(f) + λ R(f) }{{} regularization λ regularization parameter

Examples of regularizers Let p f(x) = φ j (x)w j j=1 l 2 p R(f) = w 2 = w j 2, j=1 l 1 p R(f) = w 1 = w j, j=1 Differential operators R(f) = f(x) 2 dρ(x),... X

From statistics to optimization Problem Solve min Ê(w) + λ w 2 w Rp with Ê(w) = 1 n L(w x i, y i ).

Minimization min Ê(w) + λ w 2 w Strongly convex continuous coercive functional Computations depends on the considered loss function

Hinge loss & SVM 2.0 1.5 1.0 0.5 0 1 loss square loss Hinge loss Logistic loss 0.5 1 2 log (1 + e yf(x) ) Linear Non linear Solution by gradient descent

Logistic regression Ê λ (w) = 1 log(1 + e yiw x i ) + λ w 2. n }{{} smooth and strongly convex Êλ(w) = 1 n x i y i 1 + e yiw x i + 2λw

Gradient descent Let F : R d R differentiable, (strictly) convex and such that F (w) F (w ) L w w (e.g. sup w H(w) L) }{{} hessian Then w 0 = 0, w t+1 = w t 1 L F (w t), converges to the minimizer of F.

Gradient descent for LR 1 min w R p n log(1 + e yiw x i ) + λ w 2 Consider w t+1 = w t 1 L [ 1 n x i y i 1 + e yiw t xi + 2λw t ] Excercise: compute a good step-size!

Remarks Complexity: O(ndT ) n number of examples, d dimensionality, T number of steps What about non-linear functions?

Non-linear features d p f(x) = w i x i f(x) = w i φ i (x i ).

Non-linear features d p f(x) = w i x i f(x) = w i φ i (x i ). Model Φ(x) = (φ 1 (x),..., φ p (x)),

Gradient descent for LR Same up-to the change x Φ(x)... 1 min w R p n log(1 + e yiw Φ(x i) ) + λ w 2 Consider w t+1 = w t 1 L [ 1 n x i y i 1 + e yiw t Φ(xi) + 2λw t ]

Gradient descent for LR Same up-to the change x Φ(x)... 1 min w R p n log(1 + e yiw Φ(x i) ) + λ w 2 Consider w t+1 = w t 1 L [ 1 n x i y i 1 + e yiw t Φ(xi) + 2λw t ] Complexity is O(npT )

Representer theorem and Kernels Nonparametrics? d f(x) = w i x i f(x) = w i φ i (x i ).

Representer theorem and Kernels Nonparametrics? d f(x) = w i x i f(x) = w i φ i (x i ). An equivalent formulation for gradient descent provides a way in...

Representer theorem for GD & LR By induction c t+1 = c t 1 L [ 1 n e i y i 1 + e yift(xi) + 2λc t ] with e i the i-th element of the canonical basis and f t (x) = x x i (c t ) i Show as an excercise!

Representer theorems Idea Show that f(x) = w x = x i xc i, c i R. i.e. w = n x ic i.

Representer theorems Idea Show that i.e. w = n x ic i. f(x) = w x = x i xc i, c i R. Then: Replace inner product x x K(x, x ) = Φ(x) T Φ(x) Compute c = (c 1,..., c n ) R n rather than w R d.

From features to kernels f(x) = x i xc i f(x) = Φ(x i ) Φ(x)c i Kernels Φ(x) Φ(x ) K(x, x )

From features to kernels f(x) = x i xc i f(x) = Φ(x i ) Φ(x)c i Kernels Φ(x) Φ(x ) K(x, x ) f(x) = K(x i, x)c i

LR with kernels GD-LR becomes c t+1 = c t 1 L [ 1 n e i y i 1 + e yift(xi) + 2λc t ] where f t (x) = K(x, x i )(c t ) i What are (good) kernels?

Examples of kernels Linear K(x, x ) = x x Polynomial K(x, x ) = (1 + x x) p, with p N Gaussian K(x, x ) = e γ x x 2, with γ > 0

Examples of kernels Linear K(x, x ) = x x Polynomial K(x, x ) = (1 + x x) p, with p N Gaussian K(x, x ) = e γ x x 2, with γ > 0 f(x) = c i K(x i, x)

Kernel engineering Kernels for Strings, Graphs, Histograms, Sets,...

What is a kernel? K(x, x ) A similarity measure An inner product A positive definite function

Positive definite function K : X X R is positive definite, when for any n N, x 1,..., x n X, let K n such that K n R n n, (K n ) ij = K(x i, x j ), then K n is positive semidefinite, (eigenvalues 0)

PD functions, RKHS and Gaussian processes Each PD Kernel defines: a function space called Reproducing kernel Hilbert space (RKHS)... H = span {K(, x) x X}. a Gaussian process with covariance function K.

Nonparametrics and kernels Number of parameters automatically determined by number of points f(x) = K(x i, x)c i Compare to p f(x) = φ j (x)w j j=1

Hinge loss 2.0 1.5 1.0 0.5 0 1 loss square loss Hinge loss Logistic loss 0.5 1 2 1 yf(x) + Linear Non linear Solution by sub-gradient method

Support vector machine (SVM) Ê λ (w) = 1 1 y i w x i + + λ w 2 n }{{} strongly-convex but non-smooth!

Subgradient

Subgradient Let F : R p R convex. Subgradient F (w 0 ) set of vectors v R p such that, for every w R p F (w) F (w 0 ) (w w 0 ) v In one dimension F (w 0 ) = [F (w 0 ), F +(w 0 )].

Subgradient method Let F : R p R strictly convex, with bounded subdifferential, and γ t = 1/ t then, w t+1 = w t γ t v t with v t F (w t ) converges to the minimizer of F.

Subgradient method for SVM 1 min w R p n 1 y i w x i + + λ w 2 Consider left derivative w t+1 = w t 1 t ( 1 n ) S i (w t ) + 2λw t S i (w) = { y i x i if y i w x i 1, 0 otherwise

Remarks Traditional solution is solving a quadratic program (QP)...

Remarks Traditional solution is solving a quadratic program (QP)...... but it doesn t scale (and is more complicated!).

Remarks Traditional solution is solving a quadratic program (QP)...... but it doesn t scale (and is more complicated!). Easy extension to stochastic gradient.

Remarks Traditional solution is solving a quadratic program (QP)...... but it doesn t scale (and is more complicated!). Easy extension to stochastic gradient. Can be seen as a robust, regularized perceptron.

Non linear Subgradient SVM Same up-to the change x Φ(x), w t+1 = w t 1 t ( 1 n ) S i (w t ) + 2λw t S i (w) = { y i x i if y i w Φ(x i ) 1 0 otherwise

Representer theorem for SVM By induction (exercise) c t+1 = c t 1 t ( 1 n ) S i (c t ) + 2λc t with and S i (c t ) = f t (x) = x x i (c t ) i { y i e i if y i f t (x i ) < 1 0 otherwise where e i the i-th element of the canonical basis.

Kernel SVM by subgradient Given a kernel K, c t+1 = c t 1 t ( 1 n ) S i (c t ) + 2λc t with and f t (x) = S i (c t ) = K(x, x i )(c t ) i { y i e i if y i f t (x i ) < 1 0 otherwise where e i the i-th element of the canonical basis.

Complexity Without representer Logistic: O(ndT ) SVM: O(ndT ) With representer Logistic: O(n 2 (d + T )) SVM: O(n 2 (d + T )) n number of example, d dimensionality, T number of steps

But why are these called support vector machines???

Optimality condition for SVM Smooth Convex F (w ) = 0 Non-smooth Convex 0 F (w)

Optimality condition for SVM Smooth Convex F (w ) = 0 Non-smooth Convex 0 F (w) 0 F (w ) 0 1 y i w x i + + λ2w w 1 2λ 1 y iw x i +

Optimality condition for SVM (cont.) The optimality condition can be rewritten as 0 = 1 n ( y i x i c i ) + 2λw w = where c i = c i (w) [V ( y i w x i ), V + ( y i w x i )]. x i ( y ic i 2λn ). A direct computation gives c i = 1 if yf(x i ) < 1 0 c i 1 if yf(x i ) = 1 c i = 0 if yf(x i ) > 1

Support vectors c i = 1 if yf(x i ) < 1 0 c i 1 if yf(x i ) = 1 c i = 0 if yf(x i ) > 1

Are loss functions all the same? min Ê(w) + λ w 2 w each loss has a different target function...... and different computations The choice of the loss is problem dependent

This class Learning and regularization: logistic regression and SVM Optimization with first order methods Linear and non-linear parametric models Non-parametric models and kernels

Next class Beyond penalization Regularization by projection early-stopping