EECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels

Similar documents
Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Density Estimation

Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space

Basic Properties of Metric and Normed Spaces

Your first day at work MATH 806 (Fall 2015)

Chapter 8 Integral Operators

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

10-701/ Recitation : Kernels

Random Feature Maps for Dot Product Kernels Supplementary Material

What is an RKHS? Dino Sejdinovic, Arthur Gretton. March 11, 2014

Metric spaces and metrizability

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

(, ) : R n R n R. 1. It is bilinear, meaning it s linear in each argument: that is

Functional Analysis Exercise Class

Your first day at work MATH 806 (Fall 2015)

Functional Analysis Review

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

CIS 520: Machine Learning Oct 09, Kernel Methods

Orthonormal Bases Fall Consider an inner product space V with inner product f, g and norm

Lecture Notes on Metric Spaces

(x, y) = d(x, y) = x y.

MA257: INTRODUCTION TO NUMBER THEORY LECTURE NOTES

Vector Spaces. Vector space, ν, over the field of complex numbers, C, is a set of elements a, b,..., satisfying the following axioms.

Kernels and the Kernel Trick. Machine Learning Fall 2017

4 Countability axioms

1 Functional Analysis

Metric Spaces and Topology

Problem Set 6: Solutions Math 201A: Fall a n x n,

Notions such as convergent sequence and Cauchy sequence make sense for any metric space. Convergent Sequences are Cauchy

FUNCTIONAL ANALYSIS LECTURE NOTES: ADJOINTS IN HILBERT SPACES

4 Linear operators and linear functionals

Class Notes for MATH 454.

REAL AND COMPLEX ANALYSIS

Inner Product Spaces

Kernel Methods. Outline

Lecture 4 February 2

Class Notes for MATH 255.

SPACES: FROM ANALYSIS TO GEOMETRY AND BACK

Kernels A Machine Learning Overview

Lecture 8: Linear Algebra Background

Support Vector Machines

Elementary linear algebra

j=1 [We will show that the triangle inequality holds for each p-norm in Chapter 3 Section 6.] The 1-norm is A F = tr(a H A).

Fall TMA4145 Linear Methods. Solutions to exercise set 9. 1 Let X be a Hilbert space and T a bounded linear operator on X.

Kernel Methods in Machine Learning

Real Analysis Notes. Thomas Goller

MATH 220: INNER PRODUCT SPACES, SYMMETRIC OPERATORS, ORTHOGONALITY

I teach myself... Hilbert spaces

Linear Normed Spaces (cont.) Inner Product Spaces

Lecture 5 - Hausdorff and Gromov-Hausdorff Distance

Spectral Theory, with an Introduction to Operator Means. William L. Green

Fourier Transform & Sobolev Spaces

Recall that any inner product space V has an associated norm defined by

Basic Calculus Review

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9

08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms

Commutative Banach algebras 79

Calibrated Surrogate Losses

COMPLETION OF A METRIC SPACE

1 Fourier Integrals on L 2 (R) and L 1 (R).

Numerical Linear Algebra

Chapter 2. Metric Spaces. 2.1 Metric Spaces

4 Hilbert spaces. The proof of the Hilbert basis theorem is not mathematics, it is theology. Camille Jordan

a. Define a function called an inner product on pairs of points x = (x 1, x 2,..., x n ) and y = (y 1, y 2,..., y n ) in R n by

CS 7140: Advanced Machine Learning

HILBERT SPACES AND THE RADON-NIKODYM THEOREM. where the bar in the first equation denotes complex conjugation. In either case, for any x V define

David Hilbert was old and partly deaf in the nineteen thirties. Yet being a diligent

B553 Lecture 3: Multivariate Calculus and Linear Algebra Review

Best approximation in the 2-norm

MATH 101B: ALGEBRA II PART A: HOMOLOGICAL ALGEBRA

Real Analysis Problems

MATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels

THE FORM SUM AND THE FRIEDRICHS EXTENSION OF SCHRÖDINGER-TYPE OPERATORS ON RIEMANNIAN MANIFOLDS

MATH 426, TOPOLOGY. p 1.

INNER PRODUCT SPACE. Definition 1

Math 328 Course Notes

L p Functions. Given a measure space (X, µ) and a real number p [1, ), recall that the L p -norm of a measurable function f : X R is defined by

LINEAR ALGEBRA W W L CHEN

Hilbert Spaces. Hilbert space is a vector space with some extra structure. We start with formal (axiomatic) definition of a vector space.

(1) Consider the space S consisting of all continuous real-valued functions on the closed interval [0, 1]. For f, g S, define

The Contraction Mapping Principle

7 Complete metric spaces and function spaces

Inner products. Theorem (basic properties): Given vectors u, v, w in an inner product space V, and a scalar k, the following properties hold:

1 Topology Definition of a topology Basis (Base) of a topology The subspace topology & the product topology on X Y 3

An Introduction to Kernel Methods 1

F (z) =f(z). f(z) = a n (z z 0 ) n. F (z) = a n (z z 0 ) n

Lecture 6: September 19

Kernel Methods. Machine Learning A W VO

Harmonic Analysis on the Cube and Parseval s Identity

Measure Theory on Topological Spaces. Course: Prof. Tony Dorlas 2010 Typset: Cathal Ormond

CSC Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming

Problem 1A. Find the volume of the solid given by x 2 + z 2 1, y 2 + z 2 1. (Hint: 1. Solution: The volume is 1. Problem 2A.

Some Background Material

Cambridge University Press The Mathematics of Signal Processing Steven B. Damelin and Willard Miller Excerpt More information

Reproducing Kernel Hilbert Spaces

SPECTRAL THEOREM FOR COMPACT SELF-ADJOINT OPERATORS

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Transcription:

EECS 598: Statistical Learning Theory, Winter 2014 Topic 11 Kernels Lecturer: Clayton Scott Scribe: Jun Guo, Soumik Chatterjee Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the Instructor. 1 Introduction These notes will introduce kernels which are the building blocks of modern kernel methods in machine learning. 2 Review of Hilbert Spaces Definition 1. A real inner product space (IPS) is a pair (V,, ), where V is a real vector space and, : V V R satisfies (a) u, u 0, u V and u, u 0 u 0 (b) u, v v, u, u, v V. (c) a 1, a 2 R, u 1, u 2, v V, a 1 u 1 + a 2 u 2, v a 1 u 1, v + a 2 u 2, v. Inner product spaces satisfy the Cauchy-Schwarz inequality: Proposition 1. Suppose (V,, ) is an IPS. Then u, v V, u, v u v. where u u, u. Proof. Consider Now G [ u, u u, v v, u v, v ]. G is a Gram matrix G is PSD det(g) 0 u 2 v 2 u, v 2 0 u, v u v. [ ] a To show that G is a PSD matrix, observe that R b 2, [ a b ] G [ a b ] [ a b ] [ a u, u + b u, v a v, u + b v, v au + bv, au + bv 0. ] 1

2 Observation: The proof of the Cauchy-Schwarz inequality used all properties of inner products except definiteness: u, u 0 u 0. This will be important below. Definition 2. A metric space is a pair (M, d) where M is a set and d : M M R satisfies: (a) d(x, y) 0 x, y M and d(x, y) 0 x y (b) x, y M, d(x, y) d(y, x) (c) x, y, z M, d(x, z) d(x, y) + d(y, z). An IPS is a metric space with induced metric d(u, v) : u v : u v, u v. Verification of (a) and (b) is straightforward. To verify (c), it suffices to show u, v V, u + v u + v, (1) for then d(x, z) x z x y + y z x y + y z. To see (1), observe So, ( w v ) 2 w 2 2 w v + v 2 w 2 2 w, v + v 2 w v, w v w v 2. u v w v. Letting w v u, we have u + v u + v. Metric spaces allow us to study convergence sequences and continuity. We make the following definitions. Definition 3. Let (M, d) be a metric space. We say a sequence (x n ) converges iff x M such that d(x n, x ) 0 as n. Suppose ( M, d) is another metric space, and f : M M. We say f is continuous at x M iff sequences (x n ) in M converging to x, the sequence (f(x n )) converges to f(x) in M. We say f is continuous if it is continuous at all x M. A sequence (x n ) is a Cauchy sequence iff ɛ > 0, N N (m, n N d(x m, x n ) < ɛ). A metric space (M, d) is said to be complete iff every Cauchy sequence of points in M converges to a point in M. Remark. Every convergent sequence is a Cauchy sequence, but not conversely. This is easy to check. Inner products are continuous in one argument when the other argument is held fixed. Lemma 1. Let (V,, ) be an IPS. Suppose u n u V, i.e., lim n u n u 0, and let v V. Then, lim u n, v u, v, n i.e., the inner product is continuous in its first argument. Proof. We need to show: lim u n, v u, v 0. n This follows from Cauchy-Schwarz: u n, v u, v u n u, v u n u v 0, as n, since u n u 0.

3 Every metric space (M, d) has a completion (M, d ), which is a complete metric space for which there exists ϕ : M M satisfying: (a) ϕ(m) is dense in M. (b) ϕ is an isometry, i.e., ϕ is injective and x, y M, d (ϕ(x), ϕ(y)) d(x, y). Furthermore, any two completions of (M, d) are isometric, i.e., there exists a bijective, distance preserving map between them. There is a standard construction of a completion. It is well worth understanding [1]. In the standard completion, it is not true that M M, since M consists of equivalence classes of Cauchy sequences of points in M. However, in many cases there is a completion such that M M. Example. Consider (Q, ), the rational numbers with the natural metric. This is not complete since one can take a sequence of decimal expansions of some irrational number, e.g., 3, 3.1, 3.14, 3.141, 3.1419,.... Such a sequence is Cauchy, but does not coverge to a point in Q. (R, ) is a completion of (Q, ), and in this case ϕ is just the identity map. Definition 4. A Hilbert space is a complete inner product space, where completeness is with respect to the induced metric. Hilbert spaces satisfy nice properties like the projection theorem and the Riesz representation theorem. We ll use these later. 3 Kernels Let X be a set. Definition 5. Let k : X X R. We say k is symmetric if and only if k(x, x ) k(x, x), x, x X. We say k is positive definite (PD) if and only if n N, x 1,..., x n X, the n n matrix [k(x i, x j )] n i,j1 is positive semi-definite (PSD). Theorem 1. Let k : X X R. The following are equivalent: (a) k is positive definite and symmetric. (b) a Hilbert space (F,, ) and a function Φ : X F s.t. x, x X, k(x, x ) Φ(x), Φ(x ) Definition 6. If k satisfies the conditions of the previous theorem, we say k is a kernel on X. Remarks and terminology. Kernels are some times called positive definite symmetric kernels or inner product kernels depending on which condition is being emphasized. Φ is called a feature map and F is called the feature space (we ll refer to X as the input space to avoid confusion). The feature map/space is not unique. Proof of Theorem 1. ((b) (a)): Suppose (b) holds. Clearly k is symmetric because, is. Furthermore, for x 1,..., x n X, [k(x i, x j )] n i,j1 [ Φ(x i), Φ(x j ) ] n i,j1, which is a Gram matrix, and therefore PSD. So k is PD. ((a) (b)): Assume k is positive definite and symmetric. Define { n } F 0 α i k(, x i ) n N, x 1,..., x n X, α 1,..., α n R. For f n α ik(, x i ) and g m j1 β jk(, x j ) in F 0, define f, g (i,j) α iβ j k(x i, x j ).

4 To check that this inner product is well-defined, i.e., independent of the particular representations of f and g, observe m α i β j k(x i, x j) β j f(x j) (i,j) j1 n α i g(x i ) So, only depends on the functions f and g, and not on their representations. Next, we show that the properties of an inner product are satisfied. Symmetry: f, g i,j α iβ j k(x i, x j ) j,i β jα i k(x j, x i) g, f, by symmetry of k. Linearity in first variable: Also an easy exercise. Nonnegativity: f, f i,j α iα j k(x i, x j ) 0, because k is PD. Therefore the Cauchy-Schwarz inequality holds. To establish definiteness, suppose f, f 0. Then x X, n f(x) α i k(x i, x), f, k(, x), (since k(, x) F 0 ) f, f k(, x), k(, x) 0. (Cauchy-Schwarz) Thus f(x) 0, x X, i.e., f is the zero function. The axioms of an inner product are established. Let (F,, F ) be a completion of F 0 with ϕ : F 0 F the completion isometry. Define Φ : X F by Φ(x) ϕ(k(, x)). Then Φ(x), Φ(x ) F k(, x), k(, x ) F0 k(x, x ). This completes the proof. 4 Examples and properties of kernels Let (F,, ) be any Hilbert space. Then k(x, x ) x, x is a kernel on F, which can be seen by taking Φ to be the identity map. Example. k(x, x ) x T x (dot product) is a kernel in R d. Lemma 2. If k 1, k 2 are kernels on X and α 0, then αk and k 1 + k 2 are kernels. Proof. The proof is left as an exercise. Lemma 3. If k 1 is a kernel on X 1 and k 2 is a kernel on X 2, then k 1 k 2 is a kernel on X 1 X 2, where (k 1 k 2 )((x 1, x 2 ), (x 1, x 2)) : k 1 (x 1, x 2 )k 2 (x 1, x 2) Proof sketch. Denote k k 1 k 2. Let F i, Φ i be Hilbert spaces and the feature maps corresponding to k i, i 1, 2. Let F 1 F 2 be the tensor product of F 1 and F 2. This is an inner product space whose underlying set is F 1 F 2, but the vector space structure is different from the usual one given by the direct sum. The inner product on the tensor product is such that (f 1, f 2 ), (f 1, f 2) F1 F 2 f 1, f 1 F1 f 2, f 2 F2. Let F be a completion of F 1 F 2, and ϕ : F 1 F 2 F an associated completion embedding. Define

5 Φ : X 1 X 2 F by Φ(x 1, x 2 ) ϕ(φ 1 (x 1 ), Φ 2 (x 2 )). Denoting x (x 1, x 2 ), x (x 1, x 2), k(x, x ) k 1 (x 1, x 1)k 2 (x 2, x 2) Φ 1 (x 1 ), Φ 1 (x 1) F1 Φ 2 (x 2 ), Φ 2 (x 2) F2 (Φ 1 (x 1 ), Φ 2 (x 2 )), (Φ 1 (x 1), Φ 2 (x 2)) F1 F 2 ϕ(φ 1 (x 1 ), Φ 2 (x 2 )), ϕ(φ 1 (x 1), Φ 2 (x 2)) F Φ(x 1, x 2 ), Φ(x 1, x 2) F Φ(x), Φ(x ) F Lemma 4. If A : X X and k is a kernel on X, then k defined by k(x, x ) k(a(x), A(x )) is a kernel on X. Proof. The proof is left as an exercise. Corollary 1. If k is a kernel on X, then so is k 2 where k 2 (x, x ) : k(x, x ) 2. Proof. By Lemma 2, k k is a kernel on X X. Let A : X X X be defined by A(x) (x, x). By Lemma 3, (k k)(a(x), A(x )) defines a kernel on X as (k k)(a(x), A(x )) (k k)((x, x), (x, x )) k(x, x )k(x, x ) k(x, x ) 2 Example. Let p(t) n j1 α j t j, α j 0. Then k(x, x ) p( x, x ) is a kernel on R d. Taking p(t) (t+c) m, c 0 gives the inhomogeneous polynomial kernel. An explicit feature map for k(x, x ) ( x, x +c) m can be derived via k(x, x ) ( x, x + c) m (x 1 x 1 + x 2 x 2 + + x d x d + c) m ( ) m (x 1 x j 1,, j 1) j1 (x d x d) j d c m P j i d (j 1,,j d ) J(m) Φ(x), Φ(x ) where J(m) {(j 1,, j d ) j i 0, j i m} and Φ(x) is the finite dimensional vector Φ(x) ( ( m j 1,, j d ) c m P j ix j 1 1 x j d d c m P j i x j1 1 xj d d ) (j 1,,j d ) J(m) Lemma 5. Suppose the series f(z) a n z n converges for z ( r, r) where r (0, ]. If a n 0 n, then is a kernel on X {x R d x 2 < r} k(x, x ) : a n x, x n.

6 Example. For an β > 0, e βz e β x,x is a kernel on R d. a n z n with a n βn n!, and this holds for all z R (r ). Therefore, Proof. If x < r, x < r then x, x < x x < r and so a n x, x n converges, meaning k is well-defined. Since f(z) converges on ( r, r), it converges absolutely. Hence we can rearrange terms without affecting the limit. where k(x, x ) a n x, x n a n (x 1 x 1 + + x d x d) n a n (j 1,,j d ):j i 0 Φ(x), Φ(x ) Φ(x) ( (j 1,,j d ):j i 0, P j in a j1+ +j d ( j i )! ji! a j1+ +j d ( j i )! ji! and the Hilbert space is l 2 : {(c 1, c 2, ) c 2 i < }. ( ) d n (x i x j 1,, j i) ji d x j i i d (x i x i) ji ) (j 1,,j d ):j i 0 Lemma 6 (Normalized kernels). If k is a kernel on X such that k(x, x) > 0 for all x, then k is also a kernel on X, where k(x, x k(x,x ) : ). k(x,x)k(x.x ) You are asked to remove the positivity condition in an exercise. Proof. Let Φ : X F be a feature map for k. Define Φ(x) so k is a kernel. k(x, x ) Example. If k(x, x ) e 2γ x,x, γ > 0, then is the well known Gaussian kernel. Φ(x) Φ(x). Then Φ(x), Φ(x ) Φ(x), Φ(x) Φ(x ), Φ(x ) Φ(x), Φ(x ), k(x, x e 2γ x,x ) e γ x,x e γ x,x e γ( x 2 2 x,x + x 2 ) e γ x x 2 The above arguments emphasize the inner product characterization of kernels. For proofs of the above properties that leverage the positive definite symmetric characterization, see [2]. A good reference for additional theoretical properties of kernels is [4].

7 Exercises 1. Verify the linearity in the first argument axiom for the inner product defined in the proof of Theorem 1. 2. Properties of kernels (a) Prove Lemma 2 (b) Prove Lemma 4 (c) Suppose (k n ) is a sequence of kernels on X that converges pointwise to the function k, i.e., for all x, x X, lim n k n (x, x ) k(x, x ). Show that k is a kernel on X. (d) Use the previous property to generalize Lemma 5 by replacing the dot product with an arbitrary kernel. Provide a revised theorem statement. (e) When we introduced normalized kernels above, we assumed k(x, x ) > 0 x, x. Let s remove this restriction. Show that if k is a kernel, then so is k(x, x ) : { k(x,x ) k(x,x)k(x,x ), if k(x, x) > 0 and k(x, x ) > 0, 0, otherwise. (f) Give an alternate proof that the Gaussian kernel is a kernel without invoking normalized kernels. 3. A radial kernel on R d is a kernel of the form k(x, x ) g( x x ) for some function g, where the norm is Euclidean norm. (a) Show that if p 0, then k(x, x ) 0 e u x x 2 p(u)du is a radial kernel. Note: It can be shown that if g( x x ) is a kernel for every dimension d 1, then it must have the above form, where p(u)du is generalized to dµ where µ is a finite measure [3]. (b) (Laplacian kernel) Show that for α > 0, k(x, x ) e α x x is a kernel. Major hint: e α s e su α 0 2 πu 3 e α 2 4u du. (c) (Multivariate Student-type kernel) Show that for α, β > 0, k(x, x ) (1 + x x 2 ) α β is a kernel. Hint: Consider the MGF of a gamma distribution. (d) Show that k(x, x ) e α x x 1 is a kernel, where 1 is the 1-norm. This kernel could be considered radial with respect to this alternative norm. The Laplacian and multivariate Student kernels are alternatives to the Gaussian kernel. They do not decay to zero as rapidly as the Gaussian kernel, and therefore are less likely to encounter numerical problems. Sometimes the entries of a Gaussian kernel matrix can be all zeros and ones, or the kernel matrix can be near-singular. 4. Learn about tensor products, including both the vector space and inner product space structure.

8 References [1] http://www.math.columbia.edu/ nironi/completion.pdf. [2] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of Machine Learning, MIT Press, 2012. [3] C. Scovel, D. Hush, I. Steinwart, and J. Theiler, Radial kernels and their reproducing kernel Hilbert spaces, J. Complexity, vol. 26, pp. 641-660, 2010. [4] I. Steinwart and A. Christman, Support Vector Machines, Springer, 2008.