Representer theorem and kernel examples

Similar documents
Kernels MIT Course Notes

CIS 520: Machine Learning Oct 09, Kernel Methods

Kernels A Machine Learning Overview

Kernel Methods. Outline

Advanced Introduction to Machine Learning

Kernel methods, kernel SVM and ridge regression

Oslo Class 2 Tikhonov regularization and kernels

Reproducing Kernel Hilbert Spaces

Support Vector Machines

Reproducing Kernel Hilbert Spaces

22 : Hilbert Space Embeddings of Distributions

RegML 2018 Class 2 Tikhonov regularization and kernels

MATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels

Lecture 4 February 2

The Representor Theorem, Kernels, and Hilbert Spaces

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

Functional Gradient Descent

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

Kernels and the Kernel Trick. Machine Learning Fall 2017

Kernel Method: Data Analysis with Positive Definite Kernels

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces

Linear Algebra, 4th day, Thursday 7/1/04 REU Info:

Algebra II. Paulius Drungilas and Jonas Jankauskas

Hilbert Space Methods in Learning

2. Review of Linear Algebra

Reproducing Kernel Hilbert Spaces

There are two things that are particularly nice about the first basis

Vectors in Function Spaces

Lecture 3: Review of Linear Algebra

Kernel Methods. Machine Learning A W VO

Support Vector Machine

LINEAR ALGEBRA (PMTH213) Tutorial Questions

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Inner Product Spaces

NATIONAL UNIVERSITY OF SINGAPORE DEPARTMENT OF MATHEMATICS SEMESTER 2 EXAMINATION, AY 2010/2011. Linear Algebra II. May 2011 Time allowed :

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Global Optimization of Polynomials

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Introduction to Kernel methods

(i) The optimisation problem solved is 1 min

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Kernel Methods. Charles Elkan October 17, 2007

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Minimax risk bounds for linear threshold functions

10-701/ Recitation : Kernels

Assignment 1 Math 5341 Linear Algebra Review. Give complete answers to each of the following questions. Show all of your work.

Review and problem list for Applied Math I

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Discriminative Direction for Kernel Classifiers

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

Scattered Data Interpolation with Polynomial Precision and Conditionally Positive Definite Functions

Machine Learning : Support Vector Machines

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

CS798: Selected topics in Machine Learning

Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space

LMS Algorithm Summary

Math 102, Winter Final Exam Review. Chapter 1. Matrices and Gaussian Elimination

Fitting Linear Statistical Models to Data by Least Squares I: Introduction

Introduction to Support Vector Machines

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

Lecture 3: Review of Linear Algebra

Theory of Positive Definite Kernel and Reproducing Kernel Hilbert Space

INTRODUCTION TO LIE ALGEBRAS. LECTURE 2.

LECTURE NOTE #11 PROF. ALAN YUILLE

Support Vector Machines: Kernels

Kernel Principal Component Analysis

Math 312 Final Exam Jerry L. Kazdan May 5, :00 2:00

Manifold Learning: Theory and Applications to HRI

Linear Algebra problems

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Exercises * on Linear Algebra

Support Vector Machine (SVM) and Kernel Methods

Polynomial interpolation on the sphere, reproducing kernels and random matrices

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Machine Learning. Support Vector Machines. Manfred Huber

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

How Good is a Kernel When Used as a Similarity Measure?

Support Vector Machines

EECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels

LECTURE 7. k=1 (, v k)u k. Moreover r

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Exercises * on Principal Component Analysis

SOLUTION KEY TO THE LINEAR ALGEBRA FINAL EXAM 1 2 ( 2) ( 1) c a = 1 0

Strictly Positive Definite Functions on a Real Inner Product Space

Insights into the Geometry of the Gaussian Kernel and an Application in Geometric Modeling

(Kernels +) Support Vector Machines

Causal Inference by Minimizing the Dual Norm of Bias. Nathan Kallus. Cornell University and Cornell Tech

Lecture Notes on the Gaussian Distribution

Linear Algebra Massoud Malek

Lecture 3 January 28

CS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings

3 Compact Operators, Generalized Inverse, Best- Approximate Solution

Linear Algebra Lecture Notes-I

6.1 Composition of Functions

Recall the convention that, for us, all vectors are column vectors.

Regularization in Reproducing Kernel Banach Spaces

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

Transcription:

CS81B/Stat41B Spring 008) Statistical Learning Theory Lecture: 8 Representer theorem and kernel examples Lecturer: Peter Bartlett Scribe: Howard Lei 1 Representer Theorem Recall that the SVM optimization problem can be ressed as follows: Jf ) = min f H Jf) where Jf) = C n hingeloss fx i ), y i ) + f H and H is a Reproducing Kernel Hilbert Space RKHS). Theorem 1.1. Fix a kernel k, and let H be the corresponding RKHS. Then, for a function L: R n R and non-decreasing Ω: R R, if the SVM optimization problem can be ressed as: Jf ) = min f H Jf) = min Lfx1 )... fx n )) + Ω f H)) f H then the solution can be ressed as: f = α i kx i, ) Furthermore, if Ω is strictly increasing, then all solutions have this form. This shows that to solve the SVM optimization problem, we only need to solve for the α i, which agrees with the solution obtained via the Lagrangian formulation of the problem. Furthermore, our solution lies in the span of the kernels. Suppose we project f onto the subspace: span{kx i, ): 1 i n} obtaining f s the component along the subspace) and f the component perpendicular to the subspace). We have: f = f s + f f = f s + f f s Since Ω is non-decreasing, Ω f H) Ω f s H) 1

Representer theorem and kernel examples implying that Ω ) is minimized if f lies in the subspace. Furthermore, since the kernel k has the reproducing property, we have: Implying that: fx i ) = f, kx i, ) = f s, kx i, ) + f, kx i, ) = f s, kx i, ) = f s x i ) Lfx 1 ),..., fx n )) = Lf s x 1 ),..., f s x n )) Hence, L ) depends only on the component of f lying in the subspace: span{kx i, ): 1 i n}, and Ω ) is minimized if f lies in that subspace. Hence, Jf) is minimized if f lies in that subspace, and we can ress the minimizer as: f ) = α i kx i, ) Note that if Ω ) is strictly non-decreasing, then f must necessarily be zero for f to be the minimizer of Jf), implying that f must necessarily lie in the subspace: span{kx i, ): 1 i n}. Constructing Kernels In this section, we discuss ways to construct new kernels from previously defined kernels. Suppose k 1 and k are valid symmetric, positive definite) kernels on X. Then, the following are valid kernels: 1. ku, v) = αk 1 u, v) + βk u, v), for α, β 0 Since αk 1 u, v) = αφ 1 u), αφ 1 v) and βk u, v) = βφ u), βφ v), then: ku, v) = αk 1 u, v) + βk u, v) 1) = αφ 1 u), αφ 1 v) + βφ u), βφ v) ) = [ αφ 1 u) βφ u)], [ αφ 1 v) βφ v)] 3) and we see that ku, v) can be ressed as an inner product. ku, v) = k 1 u, v)k u, v) Note that the gram matrix K for k is the Hadamard product or element-by-element product) of K 1 and K K = K 1 K ). Suppose that K 1 and K are covariance matrices of X 1,..., X n ) and Y 1,..., Y n ) respectively. Then K is simply the covariance matrix of X 1 Y 1,..., X n Y n ), implying that it is symmetric and positive definite. 3. ku, v) = k 1 fu), fv)), where f: X X Since f is a transformation in the same domain, k is simply a different kernel in that domain: ku, v) = k 1 fu), fv)) = Φfu)), Φfv)) = Φ f u), Φ f v)

Representer theorem and kernel examples 3 4. ku, v) = gu)gv), for g: X R We can ress the gram matrix K as the outer product of the vector γ = [gx 1 ),..., gx n )]. Hence, K is symmetric and positive semi-definite with rank 1. It is positive semi-definite because the non-zero eigenvalue of γγ is the trace of γγ which is the trace of γ γ which is simply γ γ which is greater than or equal to 0). 5. ku, v) = fk 1 u, v)), where f is a polynomial with positive coefficients. Since each polynomial term is a product of kernels with a positive coefficient, the proof follows by applying 1 and. 6. ku, v) = k 1 u, v)) Since: The proof follows from 5 and the fact that: x) = lim 1 + x + + x ) i i i! ku, v) = lim i k i u, v) ) 7. ku, v) = u v σ ku, v) = u v = σ ) = u σ ) ) u v +u v σ v σ )) ) u v σ = gu)gv))k 1 u, v)) 6) gu)gv) is a kernel according to 4, and k 1 u, v)) is a kernel according to 6. According to, the product of two kernels is a valid kernel. 4) 5) Note that the Gaussian kernel is translation-invariant, where ku, v) can be ressed as fu v) = fx). Example: Translation-invariant kernels Consider the function f: [ π, π] R, and suppose that f is continuous and even i.e. fx) = f x)). Then, we can ress f via the Fourier ansion as: fx) = a n cosnx) n=0

4 Representer theorem and kernel examples where a n 0. If we let x be the difference of u and v, then we have: fx) = fu v) = a 0 + = a n sinnu)sinnv) + cosnu)cosnv)) 7) n=1 λ i Ψ i u)ψ i v), 8) i=0 where {Ψ i } = {sinnu) : n 1} {cosnu) : n 0}. We see that fu v) is a valid kernel that s translation invariant. This example shows that we can choose the kernel by choosing the a i coefficients, which is equivalent to choosing a filter. Example: Bag-of-words kernel Suppose that Φ w d) is the number of times word w appears in document d. If we want to classify documents by their word counts, we can use the kernel kd 1, d ) = Φd 1 ), Φd ). In practice, these counts are weighted to take into account the relative frequency of different words.) Example: Marginalized kernel Given the probability distribution px, h) and hence ph x)) and a kernel defined for x,h) pairs kx, h), x, h ))), we can obtain a kernel on only the x s as follows: k m x, x ) = h,h kx, h), x, h ))ph x)ph x ) Exercise: Prove that this is a valid kernel! Example: Convolution kernel or string kernel) Define a i to be a letter of the alphabet, s = s i,..., s l ) to be a string of letters, and Σ to be the space of all possible letter sequences. Suppose that s has a = a 1,..., a n ) as a subsequence if there exists a sequence of indices I = i 1,..., i n ), where i 1 < i < < i n with s ij = a j, where j = 1,..., n. Define the length of the set of indices i 1,..., i n ) forming the subsequence as li) = i n i 1 + 1. For simplicity, we use the notation s[i] = a. Define, for fixed n, the feature map for a particular sequence a and string s: Φ a s) = I: s[i]=a λ li) where λ 0, 1). To compare two strings s and s, we can use the following kernel: ks, s ) = a Σ n Φ a s)φ a s )

Representer theorem and kernel examples 5 We can also derive the above kernel via convolution. Define the following kernel: k 0 s, i), s, i )) = 1[si) = s i )] Set k n s, i), s, i )) = k 0 s, i), s, i ))h k n 1 )s, i), s, i )) where hi j) = 1[i j > 0]λ i j), and is the convolution operator. Then: h k n 1 )s, i), s, i )) = hi j)hi j )k n 1 s, i), s, i )) j,j and ks, s ) = i,i k n s, i), s, i ))