Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Similar documents
Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

Oslo Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels

Kernel Method: Data Analysis with Positive Definite Kernels

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Kernels A Machine Learning Overview

Spectral Regularization

The Learning Problem and Regularization

Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space

CIS 520: Machine Learning Oct 09, Kernel Methods

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

The Representor Theorem, Kernels, and Hilbert Spaces

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Regularization via Spectral Filtering

Kernel Methods. Machine Learning A W VO

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

Support Vector Machines

Cambridge University Press The Mathematics of Signal Processing Steven B. Damelin and Willard Miller Excerpt More information

Kernels for Multi task Learning

10-701/ Recitation : Kernels

Section 7.5 Inner Product Spaces

Inner products. Theorem (basic properties): Given vectors u, v, w in an inner product space V, and a scalar k, the following properties hold:

Kernel Methods. Outline

EECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels

Lecture 4 February 2

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

Hilbert Spaces. Contents

Linear Algebra Massoud Malek

SVMs, Duality and the Kernel Trick

which arises when we compute the orthogonal projection of a vector y in a subspace with an orthogonal basis. Hence assume that P y = A ij = x j, x i

Are Loss Functions All the Same?

Kernel Methods for Computational Biology and Chemistry

Machine Learning. Support Vector Machines. Manfred Huber

Hilbert Spaces: Infinite-Dimensional Vector Spaces

Chapter 1. Preliminaries. The purpose of this chapter is to provide some basic background information. Linear Space. Hilbert Space.

5.6 Nonparametric Logistic Regression

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machines for Classification: A Statistical Portrait

BINARY CLASSIFICATION

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

MATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels

There are two things that are particularly nice about the first basis

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

1. Subspaces A subset M of Hilbert space H is a subspace of it is closed under the operation of forming linear combinations;i.e.,

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture Notes 1: Vector spaces

Support Vector Machine

Mathematical foundations - linear algebra

LECTURE 7 Support vector machines

Introduction to Signal Spaces

Support Vector Machines

Basis Expansion and Nonlinear SVM. Kai Yu

Notes on Integrable Functions and the Riesz Representation Theorem Math 8445, Winter 06, Professor J. Segert. f(x) = f + (x) + f (x).

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

2. Dual space is essential for the concept of gradient which, in turn, leads to the variational analysis of Lagrange multipliers.

Mercer s Theorem, Feature Maps, and Smoothing

Optimization and Optimal Control in Banach Spaces

What is an RKHS? Dino Sejdinovic, Arthur Gretton. March 11, 2014

Insights into the Geometry of the Gaussian Kernel and an Application in Geometric Modeling

Statistical learning on graphs

An Introduction to Kernel Methods 1

Support Vector Machine (SVM) and Kernel Methods

Lecture 10: A brief introduction to Support Vector Machine

Regularization in Reproducing Kernel Banach Spaces

Complexity and regularization issues in kernel-based learning

MA677 Assignment #3 Morgan Schreffler Due 09/19/12 Exercise 1 Using Hölder s inequality, prove Minkowski s inequality for f, g L p (R d ), p 1:

Kernel Principal Component Analysis

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

Lecture 5 : Projections

Strictly Positive Definite Functions on a Real Inner Product Space

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Mathematics Department Stanford University Math 61CM/DM Inner products

Lecture 10: Support Vector Machine and Large Margin Classifier

08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Support Vector Machine (SVM) and Kernel Methods

Generalization theory

Functional Analysis Review

i=1 β i,i.e. = β 1 x β x β 1 1 xβ d

REAL AND COMPLEX ANALYSIS

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Applied Analysis (APPM 5440): Final exam 1:30pm 4:00pm, Dec. 14, Closed books.

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Fall TMA4145 Linear Methods. Solutions to exercise set 9. 1 Let X be a Hilbert space and T a bounded linear operator on X.

Linear & nonlinear classifiers

Linear & nonlinear classifiers

The Transpose of a Vector

UNBOUNDED OPERATORS ON HILBERT SPACES. Let X and Y be normed linear spaces, and suppose A : X Y is a linear map.

The Kernel Trick. Carlos C. Rodríguez October 25, Why don t we do it in higher dimensions?

9.2 Support Vector Machines 159

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machines

Transcription:

Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto

About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert Spaces (RKHS) and to derive the general solution of Tikhonov regularization in RKHS.

Here is a graphical example for generalization: given a certain number of samples... f(x) x

suppose this is the true solution... f(x) x

... but suppose ERM gives this solution! f(x) x

Regularization The basic idea of regularization (originally introduced independently of the learning problem) is to restore wellposedness of ERM by constraining the hypothesis space H. The direct way minimize the empirical error subject to f in a ball in an appropriate normed functional space H is called Ivanov regularization. The indirect way is Tikhonov regularization (which is not ERM).

Ivanov regularization over normed spaces ERM finds the function in H which minimizes 1 n n i=1 V (f(x i ), y i ) which in general for arbitrary hypothesis space H is ill-posed. Ivanov regularizes by finding the function that minimizes while satisfying 1 n n i=1 V (f(x i ), y i ) f 2 H A, with, the norm in the normed function space H

Function space A function space is a space made of functions. Each function in the space can be thought of as a point. Examples: 1. C[a, b], the set of all real-valued continuous functions in the interval [a, b]; 2. L 1 [a, b], the set of all real-valued functions whose absolute value is integrable in the interval [a, b]; 3. L 2 [a, b], the set of all real-valued functions square integrable in the interval [a, b]

Normed space A normed space is a linear (vector) space N in which a norm is defined. A nonnegative function is a norm iff f, g N and α IR 1. f 0 and f = 0 iff f = 0; 2. f + g f + g ; 3. αf = α f. Note, if all conditions are satisfied except f = 0 iff f = 0 then the space has a seminorm instead of a norm.

Examples 1. A norm in C[a, b] can be established by defining f = max a t b f(t). 2. A norm in L 1 [a, b] can be established by defining f = b a f(t) dt. 3. A norm in L 2 [a, b] can be established by defining f = ( b a f2 (t)dt ) 1/2.

From Ivanov to Tikhonov regularization Alternatively, by the Lagrange multipler s technique, Tikhonov regularization minimizes over the whole normed function space H, for a fixed positive parameter λ, the regularized functional 1 n n i=1 V (f(x i ), y i ) + λ f 2 H. (1) In practice, the normed function space H that we will consider, is a Reproducing Kernel Hilbert Space (RKHS).

Lagrange multiplier s technique Lagrange multiplier s technique allows the reduction of the constrained minimization problem Minimize I(x) subject to Φ(x) A (for some A) to the unconstrained minimization problem Minimize J(x) = I(x) + λφ(x) (for some λ 0)

Hilbert space A Hilbert space is a normed space whose norm is induced by a dot product f, g by the relation f = f, f. A Hilbert space must also be complete and separable. Hilbert spaces generalize the finite Euclidean spaces IR d, and are generally infinite dimensional. Separability implies that Hilbert spaces have countable orthonormal bases.

Examples Euclidean d-space. The set of d-tuples x = (x 1,..., x d ) endowed with the dot product x, y = The corresponding norm is The vectors x = d i=1 d i=1 x i y i. e 1 = (1,0,...,0), e 2 = (0,1,...,0),..., e d = (0,0,...,1) x 2 i. form an orthonormal basis, that is e i, e j = δ ij.

Examples (cont.) The function space L 2 [a, b] consisting of square integrable functions. The norm is induced by the dot product b f, g = f(t)g(t)dt. a It can be proved that this space is complete and separable. An important example of orthogonal basis in this space is the following set of functions 1, cos 2πnt 2πnt, sin b a b a (n = 1,2,...). C[a, b] and L 1 [a, b] are not Hilbert spaces.

Evaluation functionals A linear evaluation functional over the Hilbert space of functions H is a linear functional F t : H IR that evaluates each function in the space at the point t, or F t [f] = f(t) The functional is bounded if there exists a M s.t. F t [f] = f(t) M f H f H where H is the norm in the Hilbert space of functions. we don t like the space L 2 [a, b] because the its evaluation functionals are unbounded.

Evaluation functionals in Hilbert space The evaluation functional is not bounded in the familiar Hilbert space L 2 ([0,1]), no such M exists and in fact elements of L 2 ([0,1]) are not even defined pointwise. 6 5 4 f(x) 3 2 1 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 x

RKHS Definition A (real) RKHS is a Hilbert space of real-valued functions on a domain X (closed bounded subset of IR d ) with the property that for each t X the evaluation functional F t is a bounded linear functional.

Reproducing kernel (rk) If H is a RKHS, then for each t X there exists, by the Riesz representation theorem a function K t of H (called representer of evaluation) with the property called by Aronszajn the reproducing property F t [f] = K t, f K = f(t). Since K t is a function in H, by the reproducing property, for each x X K t (x) = K t, K x K The reproducing kernel (rk) of H is. K(t, x) := K t (x)

Positive definite kernels Let X be some set, for example a subset of IR d or IR d itself. A kernel is a symmetric function K : X X IR. Definition A kernel K(t,s) is positive definite (pd) if n i,j=1 c i c j K(t i,t j ) 0 for any n IN and choice of t 1,...,t n X and c 1,..., c n IR.

RKHS and kernels The following theorem relates pd kernels and RKHS. Theorem a) For every RKHS the rk is a positive definite kernel on. b) Conversely for every positive definite kernel K on X X there is a unique RKHS on X with K as its rk

Sketch of proof a) We must prove that the rk K(t, x) = K t, K x K is symmetric and pd. Symmetry follows from the symmetry property of dot products K t, K x K = K x, K t K K is pd because n i,j=1 c i c j K(t i,t j ) = n i,j=1 c i c j K ti, K tj K = c j K tj 2 K 0.

Sketch of proof (cont.) b) Conversely, given K one can construct the RKHS H as the completion of the space of functions spanned by the set {K x x X} with a inner product defined as follows. The dot product of two functions f and g in span{k x x X} is by definition f(x) = g(x) = f, g K = s s i=1 s i=1 s i=1 j=1 α i K xi (x) β i K x i (x) α i β j K(x i, x j ).

Examples of pd kernels Very common examples of symmetric pd kernels are Linear kernel K(x,x ) = x x Gaussian kernel Polynomial kernel K(x,x x x 2 ) = e σ 2, σ > 0 K(x,x ) = (x x + 1) d, d IN For specific applications, designing an effective kernel is a challenging problem.

Historical Remarks RKHS were explicitly introduced in learning theory by Girosi (1997). Poggio and Girosi (1989) introduced Tikhonov regularization in learning theory and worked with RKHS only implicitly, because they dealt mainly with hypothesis spaces on unbounded domains, which we will not discuss here. Of course, RKHS were used much earlier in approximation theory (eg Wahba, 1990...) and computer vision (eg Bertero, Torre, Poggio, 1988...).

Back to Tikhonov Regularization The algorithms (Regularization Networks) that we want to study are defined by an optimization problem over RKHS, f S = argmin f H 1 n n i=1 V (f(x i ), y i ) + λ f 2 K where the regularization parameter λ is a positive number, H is the RKHS as defined by the pd kernel K(, ), and V (, ) is a loss function.

Norms in RKHS, Complexity, and Smoothness We measure the complexity of a hypothesis space using the the RKHS norm, f K. The next result illustrates how bounding the RKHS norm corresponds to enforcing some kind of simplicity or smoothness of the functions.

Regularity of functions in RKHS Functions over X, in the RKHS H induced by a pd kernel K, fulfill a Lipschitz-like condition, with Lipschitz constant given by the norm f K. In fact, by the Cauchy-Schwartz inequality, we get x, x X f(x) f(x ) = f, K x K x K f K d(x, x ), with the distance d over X, defined by d 2 (x, x ) = K(x, x) 2K(x, x ) + K(x, x ).

A linear example Our function space is 1-dimensional lines For this kernel f(x) = w x and K(x, x i ) x x i. d 2 (x, x ) = K(x, x) 2K(x, x ) + K(x, x ) = x x 2, and using the RKHS norm f 2 K = f, f K = K w, K w K = K(w, w) = w 2 so our measure of complexity is the slope of the line. We want to separate two classes using lines and see how the magnitude of the slope corresponds to a measure of complexity. We will look at three examples and see that each example requires more complicated functions, functions with greater slopes, to separate the positive examples from negative examples.

A linear example (cont.) here are three datasets: a linear function should be used to separate the classes. Notice that as the class distinction becomes finer, a larger slope is required to separate the classes. 2 2 1.5 1.5 1 1 0.5 0.5 f(x) 0 f(x) 0 0.5 0.5 1 1 1.5 1.5 2 2 1.5 1 0.5 0 0.5 1 1.5 2 x 2 2 1.5 1 0.5 0 0.5 1 1.5 2 x 2 1.5 1 0.5 f(x) 0 0.5 1 1.5 2 2 1.5 1 0.5 0 0.5 1 1.5 2 x

Again Tikhonov Regularization The algorithms (Regularization Networks) that we want to study are defined by an optimization problem over RKHS, f S = argmin f H 1 n n i=1 V (f(x i ), y i ) + λ f 2 K where the regularization parameter λ is a positive number, H is the RKHS as defined by the pd kernel K(, ), and V (, ) is a loss function.

Common loss functions The following two important learning techniques are implemented by different choices for the loss function V (, ) Regularized least squares (RLS) V = (y f(x)) 2 Support vector machines for classification (SVMC) V = 1 yf(x) + where (k) + max(k,0).

The Square Loss For regression, a natural choice of loss function is the square loss V (f(x), y) = (f(x) y) 2. 9 8 7 6 L2 loss 5 4 3 2 1 0 3 2 1 0 1 2 3 y f(x)

The Hinge Loss 4 3.5 3 2.5 Hinge Loss 2 1.5 1 0.5 0 3 2 1 0 1 2 3 y * f(x)

Existence and uniqueness of minimum If the positive loss function V (, ) is convex with respect to its first entry, the functional Φ[f] = 1 n n i=1 V (f(x i ), y i ) + λ f 2 K is strictly convex and coercive, hence it has exactly one local (global) minimum. Both the squared loss and the hinge loss are convex. On the contrary the 0-1 loss V = Θ( f(x)y), where Θ( ) is the Heaviside step function, is not convex.

The Representer Theorem The minimizer over the RKHS H, f S, of the regularized empirical functional I S [f] + λ f 2 K, can be represented by the expression f S (x) = n i=1 for some n-tuple (c 1,..., c n ) IR n. c i K(x i, x), Hence, minimizing over the (possibly infinite dimensional) Hilbert space, boils down to minimizing over IR n.

Sketch of proof Define the linear subspace of H, H 0 = span({k xi } i=1,...,n ) Let H 0 be the linear subspace of H, H 0 = {f H f(x i) = 0, i = 1,..., n}. From the reproducing property of H, f H 0 f, i c i K xi K = i c i f, K xi K = i c i f(x i ) = 0. H 0 is the orthogonal complement of H 0.

Sketch of proof (cont.) Every f H can be uniquely decomposed in components along and perpendicular to H 0 : f = f 0 + f 0. Since by orthogonality f 0 + f0 2 = f 0 2 + f0 2, and by the reproducing property I S [f 0 + f0 ] = I S[f 0 ], then I S [f 0 ] + λ f 0 2 K I S[f 0 + f0 ] + λ f 0 + f0 2 K. Hence the minimum f λ S = f 0 must belong to the linear space H 0.

Applying the Representer Theorem Using the representer theorem the minimization problem over H min f H I S[f] + λ f 2 K can be easily reduced to a minimization problem over IR n for a suitable function g( ). min g(c) c IR n If the loss function is convex, then g is convex, and finding the minimum reduces to computing the subgradient of g. In particular, if the loss function is differentiable (eg. squared loss), we just have to compute (and set to zero) the gradient of g.

Tikhonov Regularization for RLS and SVMs In the next two classes we will study Tikhonov regularization with different loss functions for both regression and classification. We will start with the square loss and then consider SVM loss functions.