MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

Similar documents
RegML 2018 Class 2 Tikhonov regularization and kernels

Reproducing Kernel Hilbert Spaces

CIS 520: Machine Learning Oct 09, Kernel Methods

Kernel Methods. Outline

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Oslo Class 2 Tikhonov regularization and kernels

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Basis Expansion and Nonlinear SVM. Kai Yu

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Kernel methods CSE 250B

Oslo Class 4 Early Stopping and Spectral Regularization

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

9.2 Support Vector Machines 159

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Machine Learning

Kernel Methods. Barnabás Póczos

10-701/ Recitation : Kernels

Lecture 4 February 2

The Kernel Trick. Carlos C. Rodríguez October 25, Why don t we do it in higher dimensions?

Regularization via Spectral Filtering

Kernel Ridge Regression. Mohammad Emtiyaz Khan EPFL Oct 27, 2015

Reproducing Kernel Hilbert Spaces

Support Vector Machine (SVM) and Kernel Methods

Regularization in Reproducing Kernel Banach Spaces

Support Vector Machine

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Spectral Regularization

LMS Algorithm Summary

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Support Vector Machine (SVM) and Kernel Methods

Advanced Introduction to Machine Learning

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Review: Support vector machines. Machine learning techniques and image analysis

Kernel Method: Data Analysis with Positive Definite Kernels

MLCC 2015 Dimensionality Reduction and PCA

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Support Vector Machine (SVM) and Kernel Methods

Kernel Methods. Machine Learning A W VO

Kernels A Machine Learning Overview

Approximate Kernel Methods

Designing Kernel Functions Using the Karhunen-Loève Expansion

Support Vector Machines.

Kernel methods, kernel SVM and ridge regression

Lasso, Ridge, and Elastic Net

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Compressive Inference

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

The Representor Theorem, Kernels, and Hilbert Spaces

SVMs: nonlinearity through kernels

CS798: Selected topics in Machine Learning

(Kernels +) Support Vector Machines

Statistical Methods for SVM

Linear regression COMS 4771

Support Vector Machines for Classification: A Statistical Portrait

Introductory Machine Learning Notes 1. Lorenzo Rosasco

Linear Regression (continued)

Lecture 10: Support Vector Machine and Large Margin Classifier

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Regularized Least Squares

MATH 590: Meshfree Methods

Reproducing Kernel Hilbert Spaces

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Function Spaces. 1 Hilbert Spaces

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Causal Inference by Minimizing the Dual Norm of Bias. Nathan Kallus. Cornell University and Cornell Tech

Day 4: Classification, support vector machines

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Stochastic optimization in Hilbert spaces

Kernel Learning via Random Fourier Representations

Kernel Methods in Machine Learning

Nonlinearity & Preprocessing

Support Vector Machines

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Generalization theory

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines and Speaker Verification

EECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels

Max Margin-Classifier

+E x, y ρ. (fw (x) y) 2] (4)

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Representer theorem and kernel examples

Lecture 10: A brief introduction to Support Vector Machine

Kernel Methods for Computational Biology and Chemistry

MLCC 2017 Regularization Networks I: Linear Models

Kaggle.

Transcription:

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and Kernels Lorenzo Rosasco

Linear functions Let H lin be the space of linear functions f(x) = w x. f w is one to one, inner product f, f H := w w, norm/metric f f H := w w.

An observation Function norm controls point-wise convergence. Since then f(x) f(x) x w w, x X w j w f j (x) f(x), x X.

ERM 1 min w R d n n (y i w x i ) 2 + λ w 2, λ 0 i=1 λ 0 ordinary least squares (bias to minimal norm), λ > 0 ridge regression (stable).

Computations Let Xn R nd and Ŷ R n. The ridge regression solution is ŵ λ = (Xn Xn+nλI) 1 Xn Ŷ time O(nd 2 d 3 ) mem. O(nd d 2 ) but also ŵ λ = Xn (XnXn +nλi) 1 Ŷ time O(dn 2 n 3 ) mem. O(nd n 2 )

Representer theorem in disguise We noted that n n ŵ λ = Xn c = x i c i ˆf λ (x) = x x i c i, i=1 i=1 c = (XnXn + nλi) 1 Ŷ, (XnXn ) ij = x i x j.

Limits of linear functions Regression

Limits of linear functions Classification

Nonlinear functions Two main possibilities: f(x) = w Φ(x), f(x) = Φ(w x) where Φ is a non linear map. The former choice leads to linear spaces of functions 1. The latter choice can be iterated f(x) = Φ(w L Φ(w L 1...Φ(w 1 x))). 1 The spaces are linear, NOT the functions!

Features and feature maps where Φ : X R p f(x) = w Φ(x), and ϕ j : X R, for j = 1,...,p. Φ(x) = (ϕ 1 (x),...,ϕ p (x)) X need not be R d. We can also write p f(x) = w j ϕ j (x). i=1

Geometric view f(x) = w Φ(x)

An example

More examples The equation p f(x) = w Φ(x) = w j ϕ j (x) i=1 suggests to think of features as some form of basis. Indeed we can consider Fourier basis, wave-lets + their variations,...

And even more examples Any set of functions ϕ j : X R, j = 1,...,p can be considered. Feature design/engineering vision: SIFT, HOG audio: MFCC...

Nonlinear functions using features Let H Φ be the space of linear functions f(x) = w Φ(x). f w is one to one, if (ϕ j ) j are lin. indip. inner product f, f H Φ := w w, norm/metric f f HΦ := w w. In this case f(x) f(x) Φ(x) w w, x X.

Back to ERM 1 min w R p n n (y i w Φ(x i )) 2 + λ w 2, λ 0, i=1 Equivalent to, 1 min f H Φ n n (y i f(x i )) 2 + λ f 2 H Φ, λ 0. i=1

Computations using features Let Φ R np with ( Φ) ij = ϕ j (x i ) The ridge regression solution is ŵ λ = ( Φ Φ+nλI) 1 Φ Ŷ time O(np 2 p 3 ) mem. O(np p 2 ), but also ŵ λ = Φ ( Φ Φ +nλi) 1 Ŷ time O(pn 2 n 3 ) mem. O(np n 2 ).

Representer theorem a little less in disguise Analogously to before ŵ λ = Φ n n c = Φ(x i )c i ˆf λ (x) = Φ(x) Φ(x i )c i i=1 i=1 c = ( Φ Φ + λi) 1 Ŷ, ( Φ Φ ) ij = Φ(x i ) Φ(x j ) p Φ(x) Φ( x) = ϕ s (x)ϕ s ( x). s=1

Unleash the features Can we consider linearly dependent features? Can we consider p =?

An observation For X = R consider with ϕ 1 (x) = 1. Then ϕ j (x)ϕ j ( x) = j=1 ϕ j (x) = x j 1 e x2 γ j=1 (2γ) (j 1) (j 1)!, j = 2,..., x j 1 e x2 γ (2γ) j 1 (j 1)! xj 1 e x2 γ (2γ) j 1 (j 1)! = e x2γ e x2 γ = e x x 2 γ j=1 (2γ) j 1 (j 1)! (x x)j 1 = e x2γ e x2γ e 2x x2 γ

From features to kernels Φ(x) Φ( x) = ϕ j (x)ϕ j ( x) = k(x, x) j=1 We might be able to compute the series in closed form. The function k is called kernel. Can we run ridge regression?

Kernel ridge regression We have n n ˆf λ (x) = Φ(x) Φ(x i )c i = k(x,x i )c i i=1 i=1 c = ( K + λi) 1 Ŷ, ( K) ij = Φ(x i ) Φ(x j ) = k(x i,x j ) K is the kernel matrix, the Gram (inner products) matrix of the data. The kernel trick

Kernels Can we start from kernels instead of features? Which functions k : X X R define kernels we can use?

Positive definite kernels A function k : X X R is called positive definite: if the matrix ˆK is positive semidefinite for all choice of points x 1,...,x n, i.e. a Ka 0, a R n. Equivalently n k(x i,x j )a i a j 0, i,j=1 for any a 1,...,a n R, x 1,...,x n X.

Inner product kernels are pos. def. Assume Φ : X R p, p and k(x, x) = Φ(x) Φ( x) Note that n n 2 n k(x i,x j )a i a j = Φ(x i ) Φ(x j )a i a j = Φ(x i )a i. i,j=1 i,j=1 i=1 Clearly k is symmetric.

But there are many pos. def. kernels Classic examples linear k(x, x) = x x polynomial k(x, x) = (x x + 1) s Gaussian k(x, x) = e x x 2 γ But one can consider kernels on probability distributions kernels on strings kernels on functions kernels on groups kernels graphs... It is natural to think of a kernel as a measure of similarity.

From pos. def. kernels to functions Let X be any set/ Given a pos. def. kernel k. consider the space H k of functions f(x) = N k(x,x i )a i i=1 for any a 1,...,a n R, x 1,...,x n X and any N N. Define an inner product on H k N N f, f = k(x Hk i, x j )a i ā j. i=1 j=1 H k can be completed to a Hilbert space.

A key result Functions defind by Gaussian kernels with large and small widths.

An illustration Theorem Given a pos. def. k there exists Φ s.t. k(x, x) = Φ(x),Φ( x) Hk and H Φ H k Roughly speaking N f(x) = w Φ(x) f(x) = k(x,x i )a i i=1

From features and kernels to RKHS and beyond H k and H Φ have many properties, characterizations, connections: reproducing property reproducing kernel Hilbert spaces (RKHS) Mercer theorem (Kar hunen Loéve expansion) Gaussian processes Cameron-Martin spaces

Reproducing property Note that by definition of H k k x = k(x, ) is a function in H k For all f H k, x X f(x) = f,k x Hk called the reproducing property Note that f(x) f(x) k x Hk f f Hk, x X. The above observations have a converse.

RKHS Definition A RKHS H is a Hilbert with a function k : X X R s.t. k x = k(x, ) H k, and f(x) = f,k x Hk. Theorem If H is a RKHS then k is pos. def.

Evaluation functionals in a RKHS If H is a RKHS then the evaluation functionals e x (f) = f(x) are continuous. i.e. e x (f) e x ( f) f f Hk, x X since e x (f) = f,k x Hk. Note that L 2 (R d ) or C(R d ) don t have this property!

Alternative RKHS definition Turns out the previous property also characterizes a RKHS. Theorem A Hilbert space with continuous evaluation functionals is a RKHS.

Summing up From linear to non linear functions using features using kernels plus pos. def. functions reproducing property RKHS