The Representor Theorem, Kernels, and Hilbert Spaces

Similar documents
CIS 520: Machine Learning Oct 09, Kernel Methods

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Support Vector Machines

Kernels and the Kernel Trick. Machine Learning Fall 2017

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) and Kernel Methods

Machine Learning. Support Vector Machines. Manfred Huber

Oslo Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machine (SVM) and Kernel Methods

Kernel Methods. Machine Learning A W VO

SVMs, Duality and the Kernel Trick

Support Vector Machines

Lecture 7: Kernels for Classification and Regression

Kernels MIT Course Notes

Kernel methods, kernel SVM and ridge regression

LMS Algorithm Summary

Day 4: Classification, support vector machines

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

Lecture 10: A brief introduction to Support Vector Machine

CS798: Selected topics in Machine Learning

Introduction to Support Vector Machines

Statistical Methods for SVM

PAC-learning, VC Dimension and Margin-based Bounds

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Lecture Notes on Support Vector Machine

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Support Vector Machines

Support Vector Machines

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces

Midterm exam CS 189/289, Fall 2015

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Hilbert Spaces: Infinite-Dimensional Vector Spaces

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machine (continued)

9.2 Support Vector Machines 159

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Statistical Machine Learning from Data

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Reproducing Kernel Hilbert Spaces

Support Vector Machines

Support Vector Machine

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Kernel Methods. Charles Elkan October 17, 2007

Kernel Method: Data Analysis with Positive Definite Kernels

Discriminative Learning and Big Data

Support Vector Machine

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Introduction to Support Vector Machines

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Generalization theory

Recall that any inner product space V has an associated norm defined by

Linear & nonlinear classifiers

Statistical learning theory, Support vector machines, and Bioinformatics

Approximation Theoretical Questions for SVMs

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Lecture 10: Support Vector Machine and Large Margin Classifier

Modeling Dependence of Daily Stock Prices and Making Predictions of Future Movements

Representer theorem and kernel examples

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Support Vector Machines for Classification: A Statistical Portrait

Perceptron Revisited: Linear Separators. Support Vector Machines

ML (cont.): SUPPORT VECTOR MACHINES

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Machine Learning 2: Nonlinear Regression

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Machine Learning, Fall 2011: Homework 5

Machine Learning Practice Page 2 of 2 10/28/13

Kernels A Machine Learning Overview

Support Vector Machines: Maximum Margin Classifiers

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Statistical Methods for Data Mining

BINARY CLASSIFICATION

Convergence of Eigenspaces in Kernel Principal Component Analysis

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Kernel Methods and Support Vector Machines

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Functional Analysis Review

Support Vector Machines and Kernel Methods

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Transcription:

The Representor Theorem, Kernels, and Hilbert Spaces We will now work with infinite dimensional feature vectors and parameter vectors. The space l is defined to be the set of sequences f 1, f, f 3,... which have finite norm, i.e., where we have the following. f fi < 1 We are now interested in regression and classsification with infinite dimensional feature vectors and wight parameters. In other words we have Φx l and β l. In practice there is essentially no difference between the infinite dimensional case and the finite dimensional case with Φx, β R d but where d >> T, i.e., where the dimension is large compared to the size of the training data. It is possible to prove an infinite dimensional version of the Cauchy-Swartz inequality: f i g i f g This inequality also implies that we have the following for any two vectors f and g in l. f i g i < In other words sums of products of features are absolutely convergent. Absolutely convergent sums have the property tht the terms in the sum can be rearranged in any order while preserving the value of the sum. Now consider the general regularized regression equation see the notes on regularized reegression. β argmin β Lm t β + λ β m t β y t β Φx t This regression equation is well defined in the infinite dimensional case except that β may have inifinite norm. We will assume in these notes that β has finite norm. We can minimize β while holding all m t β constant for all t by removing the component of β orthoganal to all vectors Φx t. Without loss of generality we can therefore assume that β is in the span of the vectors Φx t. 1

β α t Φx t 3 Equation 3 is called the representor theorem. yields the following. The representor theorem β Φx α t Φx t Φx 4 α t Kx t, x 5 Kx 1, x Φx 1 Φx 6 Equation 6 introduces Kx 1, x as an abbreviation for Φx 1 Φx. However, it is often possible to compute Kx 1, x efficiently without computng the infinte feature vectors Φx 1 or Φx. We will consider a variety of easily computed functions K. We now have the following definition. Definition: A function K on X X is called a kernel function if there exists a function Φ mapping X into l such that for any x 1, x X we have that Kx 1, x Φx 1 Φx. We will show below that for x 1, x R q the functions x 1 x + 1 p and exp 1 x 1 x T Σ 1 x 1 x are both kernels. The first is called a polynomial kernel and the second is called a Gaussian kernel. The Gaussian kernel is particularly widely used. For the Gaussian kernel we have that Kx 1, x 1 where the equality is achieved when x 1 x. In this case Kx 1, x expresses a nearness of x 1 to x. When K is a Gaussian kernel equation 5 can be viewed as a classifying x using a weighted nearest neighbor rule where Kx t, x gives the nearness of x t to x. In order to use equation with a Gaussian kernel we need to find an expression for λ β in terms of the parameters α t. This can be done as follows.

β β β T α t Φx t α t Φx s s1 t,s α t Φx t Φx s α s t,s α t Kx t, x s α s α T Kα K t,s Kx t, x s 7 The matrik K defined by 7 is called the kernel matrix or sometimes the Gramm matrix. Equation can now be rewritten in terms of α as follows. α argmin α Lm t α + λα T Kα 8 In the margin m t α is computed using 5. The significance of 8 is the feature vectors Φx and the parameter vector β are never computed. Instead, the learning is specified by a kernel function K such as a Gaussian kernel and a loss function L with no explicit reference to features. When L is taken to be hinge loss the resulting optimization problem in α is a convex quadratic program. This is the kernel form a support vector machine SVM. Equation 8 can also be viewed as a way setting the weights α in a nearest neighbor rule. Empirically 8 works better than other weight setting heuristics. 1 Some Closure Properties on Kernels Note that any kernel function K must be symmmetric, i.e., Kx 1, x Kx, x 1. It must also be positive semidefinite, i.e., Kx, x 0. If K is a kernel and α > 0 then αk is also a kernel. To see this let Φ be a feature map for K. Define Φ so that Φ x αφ 1 x. We then have that Φ x 1 Φ x αkx 1, x. Note that for α < 0 we have that αk is not positive semidefinite and hence cannot be a kernel. If K 1 and K are kernels then K 1 + K is a kernel. To see this let Φ 1 be a feature map for K 1 and let Φ be a feature map for K. Let Φ 3 be the feature 3

map defined as follows. Φ 3 x f 1 x, g 1 x, f x, g x, f 3 x, g 3 x,... Φ 1 x f 1 x, f x, f 3 x,... Φ x g 1 x, g x, g 3 x,... We then have that Φ 3 x 1 Φ 3 x equals Φ 1 x 1 Φ 1 x + Φ x 1 Φ x and hence Φ 3 is the desired feature map for K 1 + K. If K 1 and K are kernels then so is the product K 1 K. To see this let Φ 1 be a feature map for K 1 and let Φ be the feature map for K. Let f i x be the ith feature value under feature map Φ i and let g i x be the ith feature value under the feature map Φ. We now have the following. K 1 x 1, x K x 1, x Φ 1 x 1 Φ 1 x Φ x 1 Φ x f i x 1 f i x g j x 1 g j x j1 i,j f i x 1 f i x g j x 1 g j x i,j f i x 1 g j x 1 f i x g j x We can now define a feature map Φ 3 with a feature h i,j x or each pair i, j defined as follows. h i,j x f i xg j x. We then have that K 1 x 1, x K x 1, x is Φ 3 x 1 Φ 3 x where the inner product sums over all pairs i, j. Since the number of such pairs is countable, we can enumerate the pairs in a linear sequence to get Φ 3 x l. It follows from these closure properties that if p is a polynomial with positive coefficients, and K is a kernel, then pkx 1, x is also a kernel. This proves that polynomial kernels are kernels. One can also give a direct proof that if K is a kernel and p is a convergent infinite power series with positive coeffficients an convergent infinite polynomial then pkx 1, x is a kernel. The proof is similar to the proof that a product of kernels is a kernel but uses a countable set of higher order moments as features. The result for infinite power series can then be used to prove that a Gaussian kernel is a kernel. These proofs are homework problems for these notes. Unlike most proofs in the literature, we do not require compactness of the set X on which the Gaussian kernel is defined. 4

Hilbert Space The set l is an infinite dimensional Hilbert space. In fact, all Hilbert spaces with a countable basis are isomorphic to l. So l is really the only Hilbert space we need to consider. But different feature maps yield different interpretations of the space l as functions on X. A particularly interesting feature map is the following. Φx 1, x, x, x 3 3!,..., x n n!,... Now consider any function f all of whose derivatives exist at 0. Define βf to be the following infinite sequence. βf f0, f 0, f 0,..., f k 0 k!,... For any f with βf l which is many familiar functions we have the following. fx βf Φx 9 So under this feature map, the parameter vectors β in l represent essentially all functions whose Taylor series converges. For any given feature map Φ on X define HΦ to be the set of functions f from X to R such that there exists a parameter vector βf l satisfying 9. Equation can then be written as follows where f abbreviates βf. f argmin f HΦ Ly t fx t + λ f This way of writing the equation emphasizes that with a rich feature map selecting β is equivalent to selecting a function from a rich space of functions. 5