Reproducing Kernel Hilbert Spaces

Similar documents
Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

Reproducing Kernel Hilbert Spaces

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

Oslo Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels

Spectral Regularization

Kernels A Machine Learning Overview

Regularization via Spectral Filtering

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

The Learning Problem and Regularization

Lecture 4 February 2

Kernel Method: Data Analysis with Positive Definite Kernels

CIS 520: Machine Learning Oct 09, Kernel Methods

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Kernel Methods. Outline

Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

BINARY CLASSIFICATION

MLCC 2017 Regularization Networks I: Linear Models

Hilbert Space Methods in Learning

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

UNBOUNDED OPERATORS ON HILBERT SPACES. Let X and Y be normed linear spaces, and suppose A : X Y is a linear map.

Complexity and regularization issues in kernel-based learning

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Kernel Methods. Machine Learning A W VO

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Regularization in Reproducing Kernel Banach Spaces

Support Vector Machines

2 Tikhonov Regularization and ERM

Definition 1. A set V is a vector space over the scalar field F {R, C} iff. there are two operations defined on V, called vector addition

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

MATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels

5.6 Nonparametric Logistic Regression

Kernel methods and the exponential family

Statistical learning theory, Support vector machines, and Bioinformatics

The Representor Theorem, Kernels, and Hilbert Spaces

Support Vector Machine

Are Loss Functions All the Same?

Kernels for Multi task Learning

Notes on Integrable Functions and the Riesz Representation Theorem Math 8445, Winter 06, Professor J. Segert. f(x) = f + (x) + f (x).

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Generalization theory

Multiscale Frame-based Kernels for Image Registration

Kernel Methods for Computational Biology and Chemistry

1. Subspaces A subset M of Hilbert space H is a subspace of it is closed under the operation of forming linear combinations;i.e.,

Inner products. Theorem (basic properties): Given vectors u, v, w in an inner product space V, and a scalar k, the following properties hold:

Functional Gradient Descent

Statistical learning on graphs

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

Online Gradient Descent Learning Algorithms

Lecture 10: Support Vector Machine and Large Margin Classifier

Basis Expansion and Nonlinear SVM. Kai Yu

22 : Hilbert Space Embeddings of Distributions

Back to the future: Radial Basis Function networks revisited

There are two things that are particularly nice about the first basis

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

Approximate Kernel PCA with Random Features

Regularization Networks and Support Vector Machines

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Lecture 3: Statistical Decision Theory (Part II)

Introduction to Signal Spaces

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

Mercer s Theorem, Feature Maps, and Smoothing

SVMC An introduction to Support Vector Machines Classification

Manifold Regularization

Linear & nonlinear classifiers

Linear & nonlinear classifiers

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machines, Kernel SVM

Semi-Supervised Learning in Reproducing Kernel Hilbert Spaces Using Local Invariances

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machine (continued)

A unified framework for Regularization Networks and Support Vector Machines. Theodoros Evgeniou, Massimiliano Pontil, Tomaso Poggio

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Math Real Analysis II

Support Vector Machines for Classification: A Statistical Portrait

Orthonormal Systems. Fourier Series

On the inverse matrix of the Laplacian and all ones matrix

Cambridge University Press The Mathematics of Signal Processing Steven B. Damelin and Willard Miller Excerpt More information

Causal Inference by Minimizing the Dual Norm of Bias. Nathan Kallus. Cornell University and Cornell Tech

10-725/36-725: Convex Optimization Prerequisite Topics

SVMs, Duality and the Kernel Trick

Kernel Methods and Support Vector Machines

10-701/ Recitation : Kernels

Class 2 & 3 Overfitting & Regularization

TUM 2016 Class 1 Statistical learning theory

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Kernel Principal Component Analysis

An Analytical Comparison between Bayes Point Machines and Support Vector Machines

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

Support Vector Machines

On the V γ Dimension for Regression in Reproducing Kernel Hilbert Spaces. Theodoros Evgeniou, Massimiliano Pontil

LECTURE 7 Support vector machines

Transcription:

Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007

About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert Spaces (RKHS) and to derive the general solution of Tikhonov regularization in RKHS.

Function Approximation from Samples Here is a graphical example for generalization: given a certain number of samples... f(x) x

Function Approximation from Samples (cont.) Suppose this is the true solution... f(x) x

The Problem is Ill-Posed... but suppose ERM gives this solution! f(x) x

Regularization The basic idea of regularization (originally introduced independently of the learning problem) is to restore well-posedness of ERM by constraining the hypothesis space H. Penalized Minimization A possible way to do this is considering penalized empirical risk minimization, that is we look for solutions minimizing a two term functional ERR(f) }{{} empirical error +λ pen(f) }{{} penalization term the regularization parameter λ trade-offs the two terms.

Tikhonov Regularization Tikhonov regularization amounts to minimize 1 n n V(f(x i ), y i ) + λ f 2 H, λ > 0 (1) i=1 V(f(x), y) is the loss function, that is the price we pay when we predict f(x) in place of y H is the norm in the function space H Such a penalization term should encode some notion of smoothness of f.

The "Ingredients" of Tikhonov Regularization The scheme we just described is very general and by choosing different loss functions V(f(x), y) we can recover different algorithms The main point we want to discuss is how to choose a norm encoding some notion of smoothness/complexity of the solution Reproducing Kernel Hilbert Spaces allow us to do this in a very powerful way

Some Functional Analysis A function space F is a space whose elements are functions f, for example f : R d R. A norm is a nonnegative function such that f, g F and α R 1 f 0 and f = 0 iff f = 0; 2 f + g f + g ; 3 αf = α f. A norm can be defined via a dot product f = f, f. A Hilbert space (besides other technical conditions) is a (possibly) infinite dimensional linear space endowed with a dot product.

Some Functional Analysis A function space F is a space whose elements are functions f, for example f : R d R. A norm is a nonnegative function such that f, g F and α R 1 f 0 and f = 0 iff f = 0; 2 f + g f + g ; 3 αf = α f. A norm can be defined via a dot product f = f, f. A Hilbert space (besides other technical conditions) is a (possibly) infinite dimensional linear space endowed with a dot product.

Some Functional Analysis A function space F is a space whose elements are functions f, for example f : R d R. A norm is a nonnegative function such that f, g F and α R 1 f 0 and f = 0 iff f = 0; 2 f + g f + g ; 3 αf = α f. A norm can be defined via a dot product f = f, f. A Hilbert space (besides other technical conditions) is a (possibly) infinite dimensional linear space endowed with a dot product.

Some Functional Analysis A function space F is a space whose elements are functions f, for example f : R d R. A norm is a nonnegative function such that f, g F and α R 1 f 0 and f = 0 iff f = 0; 2 f + g f + g ; 3 αf = α f. A norm can be defined via a dot product f = f, f. A Hilbert space (besides other technical conditions) is a (possibly) infinite dimensional linear space endowed with a dot product.

Examples Continuous functions C[a, b] : a norm can be established by defining (not a Hilbert space!) f = max a x b f(x) Square integrable functions L 2 [a, b]: it is a Hilbert space where the norm is induced by the dot product f, g = b a f(x)g(x)dx

Examples Continuous functions C[a, b] : a norm can be established by defining (not a Hilbert space!) f = max a x b f(x) Square integrable functions L 2 [a, b]: it is a Hilbert space where the norm is induced by the dot product f, g = b a f(x)g(x)dx

RKHS A linear evaluation functional over the Hilbert space of functions H is a linear functional F t : H R that evaluates each function in the space at the point t, or F t [f] = f(t). Definition A Hilbert space H is a reproducing kernel Hilbert space (RKHS) if the evaluation functionals are bounded, i.e. if there exists a M s.t. F t [f] = f(t) M f H f H

RKHS A linear evaluation functional over the Hilbert space of functions H is a linear functional F t : H R that evaluates each function in the space at the point t, or F t [f] = f(t). Definition A Hilbert space H is a reproducing kernel Hilbert space (RKHS) if the evaluation functionals are bounded, i.e. if there exists a M s.t. F t [f] = f(t) M f H f H

Evaluation functionals Evaluation functionals are not always bounded. Consider L 2 [a, b]: Each element of the space is an equivalence class of functions with the same integral f(x) 2 dx. An integral remains the same if we change the function in a countable set of points.

Reproducing kernel (rk) If H is a RKHS, then for each t X there exists, by the Riesz representation theorem a function K t of H (called representer) with the reproducing property F t [f] = K t, f H = f(t). Since K t is a function in H, by the reproducing property, for each x X K t (x) = K t, K x H The reproducing kernel (rk) of H is K(t, x) := K t (x)

Reproducing kernel (rk) If H is a RKHS, then for each t X there exists, by the Riesz representation theorem a function K t of H (called representer) with the reproducing property F t [f] = K t, f H = f(t). Since K t is a function in H, by the reproducing property, for each x X K t (x) = K t, K x H The reproducing kernel (rk) of H is K(t, x) := K t (x)

Positive definite kernels Let X be some set, for example a subset of R d or R d itself. A kernel is a symmetric function K : X X R. Definition A kernel K(t, s) is positive definite (pd) if n c i c j K(t i, t j ) 0 i,j=1 for any n N and choice of t 1,..., t n X and c 1,..., c n R.

RKHS and kernels The following theorem relates pd kernels and RKHS Theorem a) For every RKHS the reproducing kernel is a positive definite kernel b) Conversely for every positive definite kernel K on X X there is a unique RKHS on X with K as its reproducing kernel

Sketch of proof a) We must prove that the rk K(t, x) = K t, K x H is symmetric and pd. Symmetry follows from the symmetry property of dot products K is pd because n c i c j K(t i, t j ) = i,j=1 K t, K x H = K x, K t H n c i c j K ti, K tj H = c j K tj 2 H 0. i,j=1

Sketch of proof (cont.) b) Conversely, given K one can construct the RKHS H as the completion of the space of functions spanned by the set {K x x X} with a inner product defined as follows. The dot product of two functions f and g in span{k x x X} f(x) = g(x) = s α i K xi (x) i=1 s i=1 β i K x i (x) is by definition f, g H = s s α i β j K(x i, x j ). i=1 j=1

Examples of pd kernels Very common examples of symmetric pd kernels are Linear kernel K(x, x ) = x x Gaussian kernel Polynomial kernel K(x, x ) = e x x 2 σ 2, σ > 0 K(x, x ) = (x x + 1) d, d N For specific applications, designing an effective kernel is a challenging problem.

Historical Remarks RKHS were explicitly introduced in learning theory by Girosi (1997). Poggio and Girosi (1989) introduced Tikhonov regularization in learning theory and worked with RKHS only implicitly, because they dealt mainly with hypothesis spaces on unbounded domains, which we will not discuss here. Of course, RKHS were used much earlier in approximation theory (eg Wahba, 1990...) and computer vision (eg Bertero, Torre, Poggio, 1988...).

Norms in RKHS and Smoothness Choosing different kernels one can show that the norm in the corresponding RKHS encodes different notions of smoothness. Sobolev kernel: consider f : [0, 1] R with f(0) = f(1) = 0. The norm f 2 H = (f (x)) 2 dx = ω 2 F(ω) 2 dω is induced by the kernel K(x, y) = Θ(y x)(1 y)x + Θ(x y)(1 x)y. Gaussian kernel: the norm can be written as f 2 H = 1 2π d F(ω) 2 exp σ2 ω 2 2 dω where F(ω) = F{f }(ω) = f(t)e iωt dt is the Fourier tranform of f.

Norms in RKHS and Smoothness Choosing different kernels one can show that the norm in the corresponding RKHS encodes different notions of smoothness. Sobolev kernel: consider f : [0, 1] R with f(0) = f(1) = 0. The norm f 2 H = (f (x)) 2 dx = ω 2 F(ω) 2 dω is induced by the kernel K(x, y) = Θ(y x)(1 y)x + Θ(x y)(1 x)y. Gaussian kernel: the norm can be written as f 2 H = 1 2π d F(ω) 2 exp σ2 ω 2 2 dω where F(ω) = F{f }(ω) = f(t)e iωt dt is the Fourier tranform of f.

Norms in RKHS and Smoothness Choosing different kernels one can show that the norm in the corresponding RKHS encodes different notions of smoothness. Sobolev kernel: consider f : [0, 1] R with f(0) = f(1) = 0. The norm f 2 H = (f (x)) 2 dx = ω 2 F(ω) 2 dω is induced by the kernel K(x, y) = Θ(y x)(1 y)x + Θ(x y)(1 x)y. Gaussian kernel: the norm can be written as f 2 H = 1 2π d F(ω) 2 exp σ2 ω 2 2 dω where F(ω) = F{f }(ω) = f(t)e iωt dt is the Fourier tranform of f.

Linear Case Our function space is 1-dimensional lines f(x) = w x and K(x, x i ) x x i where the RKHS norm is simply f 2 H = f, f H = K w, K w H = K(w, w) = w 2 so that our measure of complexity is the slope of the line. We want to separate two classes using lines and see how the magnitude of the slope corresponds to a measure of complexity. We will look at three examples and see that each example requires more "complicated functions, functions with greater slopes, to separate the positive examples from negative examples.

Linear case (cont.) here are three datasets: a linear function should be used to separate the classes. Notice that as the class distinction becomes finer, a larger slope is required to separate the classes. 2 2 1.5 1.5 1 1 0.5 0.5 f(x) 0 f(x) 0 0.5 0.5 1 1 1.5 1.5 2 2 1.5 1 0.5 0 0.5 1 1.5 2 x 2 2 1.5 1 0.5 0 0.5 1 1.5 2 x 2 1.5 1 0.5 f(x) 0 0.5 1 1.5 2 2 1.5 1 0.5 0 0.5 1 1.5 2 x

Again Tikhonov Regularization The algorithms (Regularization Networks) that we want to study are defined by an optimization problem over RKHS, f λ S 1 = arg min f H n n V(f(x i ), y i ) + λ f 2 H i=1 where the regularization parameter λ is a positive number, H is the RKHS as defined by the pd kernel K(, ), and V(, ) is a loss function. Note that H is possibly infinite dimensional!

Existence and uniqueness of minimum If the positive loss function V(, ) is convex with respect to its first entry, the functional Φ[f] = 1 n n V(f(x i ), y i ) + λ f 2 H i=1 is strictly convex and coercive, hence it has exactly one local (global) minimum. Both the squared loss and the hinge loss are convex. On the contrary the 0-1 loss V = Θ( f(x)y), where Θ( ) is the Heaviside step function, is not convex.

The Representer Theorem An important result The minimizer over the RKHS H, f S, of the regularized empirical functional I S [f] + λ f 2 H, can be represented by the expression n fs λ (x) = c i K(x i, x), for some n-tuple (c 1,..., c n ) R n. Hence, minimizing over the (possibly infinite dimensional) Hilbert space, boils down to minimizing over R n. i=1

Sketch of proof Define the linear subspace of H, H 0 = span({k xi } i=1,...,n ) Let H 0 be the linear subspace of H, H 0 = {f H f(x i) = 0, i = 1,...,n}. From the reproducing property of H, f H 0 f, i c i K xi H = i c i f, K xi H = i c i f(x i ) = 0. H 0 is the orthogonal complement of H 0.

Sketch of proof (cont.) Every f H can be uniquely decomposed in components along and perpendicular to H 0 : f = f 0 + f 0. Since by orthogonality f 0 + f 0 2 = f 0 2 + f 0 2, and by the reproducing property then I S [f 0 + f 0 ] = I S[f 0 ], I S [f 0 ] + λ f 0 2 H I S [f 0 + f 0 ] + λ f 0 + f 0 2 H. Hence the minimum f λ S = f 0 must belong to the linear space H 0.

Common Loss Functions The following two important learning techniques are implemented by different choices for the loss function V(, ) Regularized least squares (RLS) V = (y f(x)) 2 Support vector machines for classification (SVMC) V = 1 yf(x) + where (k) + max(k, 0).

The Square Loss For regression, a natural choice of loss function is the square loss V(f(x), y) = (f(x) y) 2. 9 8 7 6 L2 loss 5 4 3 2 1 0 3 2 1 0 1 2 3 y f(x)

The Hinge Loss 4 3.5 3 2.5 Hinge Loss 2 1.5 1 0.5 0 3 2 1 0 1 2 3 y * f(x)

Tikhonov Regularization for RLS and SVMs In the next two classes we will study Tikhonov regularization with different loss functions for both regression and classification. We will start with the square loss and then consider SVM loss functions.