Statistical machine learning and kernel methods. Primary references: John Shawe-Taylor and Nello Cristianini, Kernel Methods for Pattern Analysis

Similar documents
SVM example: cancer classification Support Vector Machines

Solving SVM: Quadratic Programming

Gene expression experiments. Infinite Dimensional Vector Spaces. 1. Motivation: Statistical machine learning and reproducing kernel Hilbert Spaces

Infinite Dimensional Vector Spaces. 1. Motivation: Statistical machine learning and reproducing kernel Hilbert Spaces

E.G., data mining, prediction. Function approximation problem: How >9 best estimate the function 0 from partial information (examples)

MAP METHODS FOR MACHINE LEARNING. Mark A. Kon, Boston University Leszek Plaskota, Warsaw University, Warsaw Andrzej Przybyszewski, McGill University

Dr. H. Joseph Straight SUNY Fredonia Smokin' Joe's Catalog of Groups: Direct Products and Semi-direct Products

The Hahn-Banach and Radon-Nikodym Theorems

Suggestions - Problem Set (a) Show the discriminant condition (1) takes the form. ln ln, # # R R

7. Random variables as observables. Definition 7. A random variable (function) X on a probability measure space is called an observable

Inner Product Spaces

Proofs Involving Quantifiers. Proof Let B be an arbitrary member Proof Somehow show that there is a value

DEPARTMENT OF MATHEMATICS AND STATISTICS QUALIFYING EXAMINATION II IN MATHEMATICAL STATISTICS SATURDAY, SEPTEMBER 24, 2016

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

QUALIFYING EXAMINATION II IN MATHEMATICAL STATISTICS SATURDAY, MAY 9, Examiners: Drs. K. M. Ramachandran and G. S. Ladde

Manifold Regularization

Reproducing Kernel Hilbert Spaces

Product Measures and Fubini's Theorem

Notes on the Unique Extension Theorem. Recall that we were interested in defining a general measure of a size of a set on Ò!ß "Ó.

Probability basics. Probability and Measure Theory. Probability = mathematical analysis

Kernels for Multi task Learning

8œ! This theorem is justified by repeating the process developed for a Taylor polynomial an infinite number of times.

It's Only Fitting. Fitting model to data parameterizing model estimating unknown parameters in the model

Example Let VœÖÐBßCÑ À b-, CœB - ( see the example above). Explain why

Systems of Equations 1. Systems of Linear Equations

PART I. Multiple choice. 1. Find the slope of the line shown here. 2. Find the slope of the line with equation $ÐB CÑœ(B &.

Reproducing Kernel Hilbert Spaces

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

An Analytical Comparison between Bayes Point Machines and Support Vector Machines

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Implicit in the definition of orthogonal arrays is a projection propertyþ

Reproducing Kernel Hilbert Spaces

LA PRISE DE CALAIS. çoys, çoys, har - dis. çoys, dis. tons, mantz, tons, Gas. c est. à ce. C est à ce. coup, c est à ce

Framework for functional tree simulation applied to 'golden delicious' apple trees

B œ c " " ã B œ c 8 8. such that substituting these values for the B 3 's will make all the equations true

2. Let 0 be as in problem 1. Find the Fourier coefficient,&. (In other words, find the constant,& the Fourier series for 0.)

ML (cont.): SUPPORT VECTOR MACHINES

Reproducing Kernel Hilbert Spaces

Complexity and regularization issues in kernel-based learning

Here are proofs for some of the results about diagonalization that were presented without proof in class.

: œ Ö: =? À =ß> real numbers. œ the previous plane with each point translated by : Ðfor example,! is translated to :)

Are Loss Functions All the Same?

4.3 Laplace Transform in Linear System Analysis

Inner Product Spaces. Another notation sometimes used is X.?


Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

Kernel Method: Data Analysis with Positive Definite Kernels

Machine Learning : Support Vector Machines

Regularization in Reproducing Kernel Banach Spaces

Analysis of Spectral Kernel Design based Semi-supervised Learning

Lund Institute of Technology Centre for Mathematical Sciences Mathematical Statistics

Ð"Ñ + Ð"Ñ, Ð"Ñ +, +, + +, +,,

From Handout #1, the randomization model for a design with a simple block structure can be written as

An Example file... log.txt

Math 1AA3/1ZB3 Sample Test 3, Version #1

5.6 Nonparametric Logistic Regression

An investigation into the use of maximum posterior probability estimates for assessing the accuracy of large maps

e) D œ < f) D œ < b) A parametric equation for the line passing through Ð %, &,) ) and (#,(, %Ñ.

Stat 643 Problems. a) Show that V is a sub 5-algebra of T. b) Let \ be an integrable random variable. Find EÐ\lVÑÞ

Math 131 Exam 4 (Final Exam) F04M

Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space

Reproducing Kernel Hilbert Spaces

Engineering Mathematics (E35 317) Final Exam December 18, 2007

This document has been prepared by Sunder Kidambi with the blessings of

Vectors. Teaching Learning Point. Ç, where OP. l m n

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil

Lecture 4 February 2

106 Chapter 5 Curve Sketching. If f(x) has a local extremum at x = a and. THEOREM Fermat s Theorem f is differentiable at a, then f (a) = 0.

Normed & Inner Product Vector Spaces

AN IDENTIFICATION ALGORITHM FOR ARMAX SYSTEMS

Support Vector Machines Explained

Binomial Randomization 3/13/02

Kernels and the Kernel Trick. Machine Learning Fall 2017

Overview of normed linear spaces

EXCERPTS FROM ACTEX CALCULUS REVIEW MANUAL

Kernel Methods. Machine Learning A W VO

Æ Å not every column in E is a pivot column (so EB œ! has at least one free variable) Æ Å E has linearly dependent columns

Example: A Markov Process

Online Gradient Descent Learning Algorithms

Relations. Relations occur all the time in mathematics. For example, there are many relations between # and % À

Constructive Decision Theory

Lecture 10: Support Vector Machine and Large Margin Classifier

OC330C. Wiring Diagram. Recommended PKH- P35 / P50 GALH PKA- RP35 / RP50. Remarks (Drawing No.) No. Parts No. Parts Name Specifications

2 Hallén s integral equation for the thin wire dipole antenna

Planning for Reactive Behaviors in Hide and Seek

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Multinomial Allocation Model

(Kernels +) Support Vector Machines

The Transpose of a Vector

Dr. Nestler - Math 2 - Ch 3: Polynomial and Rational Functions

Universal Multi-Task Kernels

QUESTIONS ON QUARKONIUM PRODUCTION IN NUCLEAR COLLISIONS

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

10-701/ Recitation : Kernels

Théorie Analytique des Probabilités

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

The Heat Equation John K. Hunter February 15, The heat equation on a circle

Cellular Automaton Growth on # : Theorems, Examples, and Problems

Engineering Mathematics (E35 317) Final Exam December 15, 2006

Transcription:

Part 5 (MA 751) Statistical machine learning and kernel methods Primary references: John Shawe-Taylor and Nello Cristianini, Kernel Methods for Pattern Analysis Christopher Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery 2, 121 167 (1998).

Other references: Aronszajn ß Theory of reproducing kernels. Transactions of the American Mathematical Society, 686, 337-404, 1950. Felipe Cucker and Steve Smale, On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 2002. Teo Evgeniou, Massimo Pontil and Tomaso Poggio, Regularization Networks and Support Vector Machines Advances in Computational Mathematics, 2000.

1. Linear functionals Given a vector space Z, we define a map 0 À Z Ä from Z to the real numbers to be a functional. If 0 is linear, i.e., if for real +ß, we have then we say 0Ð+ x, yñ œ +0ÐxÑ,0ÐyÑß 0 is a linear functional. If Z is an inner product space (so each v has a length mvm), we say that 0 is bounded if l0ðxñl Ÿ Gmxm for some number G! and all x \.

Reproducing kernel Hilbert spaces 2. Reproducing Kernel Hilbert spaces: Def. 1. A 8 8matrix Q is symmetric if Q34 œq43 for all 3ß4Þ A symmetric Q is positive if all of its eigenvalues are nonnegative.

Reproducing kernel Hilbert spaces Equivalently Q is positive if X ØaßQaÙ a Q a! Ô + " + # for all vectors a œ Ö Ù, with Ø, Ù the standard inner ã Õ+ Ø 8 product on 8. Above a X œð+ " ßáß+ 8Ñis the transpose of a.

Reproducing kernel Hilbert spaces Definition 2: Let \ : be compact (i.e., a closed bounded subset). A (real) reproducing kernel Hilbert space (RKHS) [ on \ is a Hilbert space of functions on \ (i.e., a complete collection of functions which is closed under addition and scalar mult, and for which an inner product is defined)þ [ also needs the property: for any fixed x \, the evaluation functional x À [ Ä defined by bounded linear functional on [. x Ð0Ñ œ 0ÐxÑ

Reproducing kernel Hilbert spaces Definition 3: We define a kernel to be a function OÀ\ \Ä which is symmetric, i.e., OÐxy ß Ñ œ OÐyx ß Ñ for xy ß \. We say that O is positive if for any fixed collection Öx ßáß x \ " 8, the 8 8matrix OœÐO34Ñ OÐx3ßx4Ñ is positive (i.e., non-negative).

Kernel existence We now have the reason these are called RKHS: Theorem 1: Given a reproducing kernel Hilbert space [ of functions on \., there exists a unique symmetric positive kernel function OÐxy ß Ñ such that for all 0 [ ß 0ÐxÑ œ Ø0Ð ÑßOÐ ßxÑÙ [ (inner product above is in the variable ; x is fixed). Note this means that evaluation of 0 at x is equivalent to taking inner product of 0 with the fixed function OÐ ßxÑ.

Kernel existence Proof (please look at this on your own): For any fixed x \, recall x is a bounded linear functional on [. By the Riesz Representation theorem 1 there exists a fixed function, call it OÐ Ñ x such that for all 0 [ (recall x is fixed, now 0is varying) 0ÐxÑ œ x Ð0Ñ œ Ø0Ð Ñß Ox Ð ÑÙÞ (1) # (all inner products are in [ß not in P, i.e., Ø0ß 1Ù œ Ø0ß 1Ù ). [ 1 Riesz Representation Theorem: If 9 À [ Ä is a bounded linear functional on [, there exists a unique y [ such that ax [ ß 9ÐxÑ œ ØyßxÙ.

Kernel existence That is, evaluation of 0 at x is equivalent to an inner product with the function. O x Define OÐxy ß Ñ œ OxÐyÑÞ Note by (1), the functions Ox Ð Ñ and OÐ Ñsatisfy y so OÐxy ß Ñ ØO Ð Ñß O Ð ÑÙ œ O ÐxÑ œ O Ðy Ñ, x y y x is symmetric.

Kernel existence To prove OÐxy ß Ñ is positive definite: let Öx" ßáß x8 be a fixed collection. If O 34 OÐx3ßx4Ñ, then if O œ ÐO34Ñ is a matrix Ô -" -# and c œ Ö Ùß ã Õ- Ø 8

Kernel existence 8 8 X ØcßOcÙ c Oc œ - - OÐx ß x Ñ œ - - ØO x Ð ÑßO x Ð ÑÙ 3 4 3 4 3 4 3ß4œ" 3ß4œ" 3 4 8 8 8 œ -O 3 x Ð Ñß -O 4 x Ð Ñ œ¾ -O 3 x Ð Ñ ¾!. 3 4 3 3œ" 4œ" 3œ" # [

Kernel existence Definition 4: We call the above kernel OÐxy ß Ñ the reproducing kernel of [. Definition 5: A Mercer kernel is a positive definite kernel OÐxy ß Ñ which is also continuous as a function of xand y and bounded. Def. 6: For a continuous function 0 on a compact set \. we define m0m œ maxl0ðxñlþ x \

Kernel existence Theorem 2: (i) For every Mercer kernel OÀ\ \Ä, there exists a unique Hilbert space [ of functions on \ such that O is its reproducing kernel. (ii) Moreover, this [ consists of continuous functions, and for any 0 [ m0m where Q œ max OÐxy ß Ñ Þ O xy ß \ Ÿ Q m0m O [,

Kernel existence Proof (please look at this on your own): Let OÐxy ß Ñ À \ \ Ä be a Mercer kernel. We will construct a reproducing kernel Hilbert space [ with reproducing kernel O as follows. Define [! œ spanöoxð Ñ x \ œ -3Ox 3 Ð Ñ À Ö x 3 3 \ is any finite subset à -3 ŸÞ 3

Kernel existence Now we define inner product Ø0ß1Ùfor 0ß1 Þ Assume 6 6 [! 0Ð Ñœ +O Ð Ñß 1Ð Ñœ,O Ð ÑÞ 3 x 3 3œ" 3œ" 3 3 [Note we may assume 0ß1 both use same set Ö x 3 since if not we may take a union without loss]. [Note again that here Ø ß ÙœØ ß Ù [ ] x

Kernel existence Then define ØOxÐ Ñß OyÐ ÑÙ œ OÐxy ß Ñ 6 6 Ø0Ð Ñß 1Ð ÑÙ œ + 3OÐx3ß Ñß, 4OÐx4ß Ñ 3œ" 4œ" 6 6 œ + + ØOÐx ß Ñß OÐx ß ÑÙ œ +, OÐx ß x ÑÞ 3 4 3 4 3 4 3 4 3ß4œ" 3ß4œ"

Kernel existence Easy to check that with the above inner product [! is an inner product space (i.e., satisfies properties ÐaÑ ÐdÑ). Now form the completion 2 of this space into the Hilbert space [Þ Note that for 0œ +O Ð Ñas above 3 3 x 3 2 The completion of a non-complete inner product space space [! is the (unique) smallest complete inner product (Hilbert) space [ which contains [!. That is, [! [, the inner product on [! is the same as on [, and there is no smaller complete Hilbert space which contains. [! Example 1: [ œœ œð+ ß+ ßáѺ # a " # l+l 3 with inner product ØaßbÙœ +, 3 3 was discussed in class. The inner product space 3œ" 3œ" [! œœð+ß+ßáñ " # [ º all but a finite number of + 3 are 0 [ is an example of an incomplete space. [ is its completion.

Kernel existence l0ðxñl œ Ø0Ð Ñß OÐxß ÑÙ Ÿ m0ð ÑmmOÐxß Ñm œ m0mèøoðxß ÑßOÐxß ÑÙ ß+ # ßáÑ [ ºall but a finite number of + 3 are 0 [ is an example of an incomplete space. [ is its completion. # Example 2: [ œp Ð 1ß1Ñwith standard inner product for functions. We know if + #! 0ÐB) [ then 0ÐBÑ œ + cos 5B, sin 5B. Define [ to be all 0 [ for which the 5œ" 5 5! above sum is finite (i.e., all but a finite number of terms are 0). Then [ is the completion of [!.

Kernel existence ŸQ m0m Þ O [ [Note again here we write m0m œ m0m [ Ø0ß 1Ù œ Ø0ß 1Ù [ ] by definition; similarly The above shows that the imbedding MÀ[! ÄGÐ\Ñ(the latter is the continuous functions on \) is bounded. By this we mean that M maps function 0 as a function in [! to itself as a function in GÐ\Ñ; in GÐ\Ñ the norm of 0 is m0m sup l0ðbñl. B \ By bounded we mean that constant.!. mm0m œ m0m Ÿ.m0m [ for some

Kernel existence Thus any Cauchy sequence in [! is also Cauchy in GÐ\Ñ, and so it follows easily that the completion [ of [! exists as a subset of GÐ\Ñ. That is a reproducing kernel for follows by approximation O [ from [! Þ

Regularization methods 3. Regularization methods for choosing 0 Finding 0 from R0 œ g œ ÖÐx 3 ßC 3 Ñ 3œ" is an ill-posed problem: " the operator R does not exist because R is not one to one. Need to combine both: (a) Data R0 (b) A priori information, e.g., " 0 is smooth", e.g. expressing a preference for smooth over wiggly solutions seen earlier. How to incorporate? Using Tikhonov regularization methods. R

Regularization methods We introduce a regularization loss functional L Ð0Ñ representing penalty for choice of an "unrealistic" 0 such as that in (a) above. Assume we want to find the correct function 0Ð! xñß R0 ÐxÑ œ ÐÐx, C ÑßáßÐx ßC ÑÑ œ g! " " 8 8 from data

Regularization methods Suppose we are given 0ÐxÑ as a candidate for approximating 0Ð! x Ñfrom the information in g Þ We score 0as a good or bad approximation based on a combination of its error on 8 the known points Ö x 3 3œ", together with its "plausibility", i.e., how low the Lagrangian 8 " _ œ ZÐ0Ð ÑßC Ñ PÐ0Ñ 8 x 3 3 3œ" is. Here ZÐ0Ðx 3 ÑßCÑ 3 is a measure of the loss whenever 0Ð Ñ is far from C, e.g. x 3 3 Z Ð0Ðx ÑßC Ñ œ l0ðx Ñ C l Þ 3 3 3 3 #

Regularization methods And PÐ0Ñ measures the a priori loss, i.e., a measure of discrepancy between the prospective choice 0 and our prior expectation about 0.

Example: Examples: Regularization methods # # P PÐ0Ñ œ me0m # œ (. xle0ðxñl ß # # `0 `B# `0 `B# where E0 œ? 0 0à here? 0 œ á Þ Note?0 and thus PÐ0Ñmeasures the degree of nonsmoothness that 0 has (i.e., we prefer smoother functions a priori). " :

Examples: Regularization methods Example 7: Consider the case PÐ0Ñ œ me0m # norm m0m œ me0m [ P # above. The = reproducing kernel Hilbert space norm (at least if dimension. is small). That is, it comes from an inner product Ø0ß 1Ù œ ' \ ÐE0ÑÐBÑÐE1ÑÐBÑ.B, and with this inner product [ is an RKHS. If this is the case, in general things become easier.

Examples: Regularization methods Example 8: In the case 0œ0ÐBÑ, B ". Suppose we choose: we have # #.. E0 œ 0 0 œ " 0ß.B# Œ.B# #. PÐ0Ñ œ me0m œ ( Œ " 0.Bß.B# and me0m is a measure of "lack of smoothness" of 0. # #

Examples: Regularization methods 4. More about using the Laplacian to measure smoothness (Sobolev smoothness) Basic definitions: Recall the Laplacian operator? on a function 0 on : 0Ð Ñ œ 0ÐB ßáßB Ñ is defined by x " : # # ` `?0 œ 0 á 0Þ `B# `B# " :

Using the Laplacian for kernels For =! an even integer, we can define the Sobolev space L = by: = #. =Î# # : L œö0 P Ð ÑÀÐ"? Ñ 0 P Ð Ñ # : # to be functions in PÐ Ñwhich are still in P after taking the =Î# derivative operation Ð"? Ñ, i.e., ÐM? Ñ repeated =Î# times (the operator " is always the identity). For 0ß1 L = define the new inner product

Using the Laplacian for kernels Ø0ß 1Ù = œ ØÐ? "Ñ 0ß Ð? "Ñ 1Ù # à L =Î# [note Ø2ÐxÑß5ÐxÑÙ œ 2ÐxÑ5ÐxÑ. x] P \ # ' =Î# P Can show that L = is an RKHS with reproducing kernel " OÐzÑ œ Y " Œ Ðl l# "Ñ= = (3)

Using the Laplacian for kernels where Y " denotes the inverse Fourier transform. The " : function Ðl= l "Ñ is a function on = œð= # = " ßáß= : Ñ ß where l= l # œ = # á = # Þ " :

Using the Laplacian for kernels Fig 1: OÐzÑin one dimension - a smooth kernel

Using the Laplacian for kernels OÐz Ñ is called a radial basis function. Note: the kernel OÐxy ß Ñ (as function of 2 variables) is defined in terms of above O by OÐxy ß Ñ œ OÐx yñþ

Using the Laplacian for kernels The Representer Theorem for RKHS 1. An application: using RKHS for regularization Assume again we have an unknown function 0 on \, with only data R0 œ ÐÐx ß C ÑßáßÐx ßC ÑÑ œ g. " " 8 8 To find the best guess s0 for 0, approximate it by the minimizer

RKHS and regularization 8 s " # # 0 œ arg min m0ð 3Ñ C3m m0m Þ 8 L = x - Ÿ (1) = 0 L 3œ" where - can be some constant. Note we are finding an 0 8 which balances minimizing # m0ðx Ñ C m ß 3œ" 3 3 i.e., the data error, with minimizing m0m # L, i.e., maximizing the = smoothness. The solution to such a problem will look like this:

RKHS and regularization It will compromise between fitting the data (which may have error) and trying to be smooth.

RKHS and regularization The amazing thing: 0s can be found explicitly using radial basis functions. 2. Solving the minimization Now consider the optimization problem (1). We claim that we can solve it explicitly. To see this works in general for RKHS, return to the general problem. Given an unknown 0 [ œrkhs. Try to find the "best" approximation s0 to 0 fitting the data

RKHS and regularization R0 ÐÐx" ßC" ÑßáßÐx8ßC8 ÑÑ, but ALSO satisfying a priori knowledge that m0! m [ is small. Specifically, we want to find 8 " arg min # Z Ð0Ðx3ÑßC3Ñ -m0m[ Þ 8 0 [ 3œ" Note we can have, e.g., Z Ð0Ðx ÑßC Ñ œ Ð0Ðx Ñ C Ñ Þ case 3 3 3 3 # (2) In that

RKHS and regularization 8 8 3 3 3 3 # 3œ" 3œ" Z Ð0Ðx ÑßC Ñ œ Ð0Ðx Ñ C Ñ Þ Consider the general case (2), with arbitrary error measure. Z We have the

RKHS and regularization Representer Theorem: E solution of the Tikhonov optimization problem Ð"Ñ can be written 8 s0ðbñ œ + OÐxx ß Ñß (3) 3œ" 3 3 where O is the reproducing kernel of the RKHS [. Important theorem: says we only need to find a set of 8 numbers + 3 to optimize the infinite dimensional problem (1) above.

RKHS and regularization Proof: Use calculus of variations. If a minimizer 0 " exists, then for all 1 [, assuming that the derivatives with respect to % exist:

Representer theorem proof 8. "! œ # Z ÐÐ0" % 1ÑÐx 3Ñß C3 Ñ - m0" % 1m[ º.% 8 8 3œ" œ " `Z Ð0" Ðx3ÑßC3Ñ 1Ðx3Ñ 8 `0 Ðx Ñ 3œ" " 3 % œ!. - Ø0ß0Ù #Ø0ß1Ù % % Ø1ß1Ù.% # " " "

Representer theorem proof 8 " œ Z Ð0 Ð ÑßCÑ 1Ð Ñ # Ø0 ß1Ùß 8 " " x3 3 x3 - " 3œ" where ZÐ+ß,Ñœ ZÐ+ß,Ñand all inner products are in [ Þ " ` `+ Since the above is true for all 1 [ ß 1œO x we get: it follows that if we let

Representer theorem proof 8 "!œ ZÐ0Ð ÑßCÑOÐ Ñ # Ø0ßOÙ 8 " " x3 3 x x3 - " x 3œ" 8 " œ Z Ð0 Ð Ñß C ÑO Ð Ñ # 0 Ð Ñß 8 " " x3 3 x x3 - " x 3œ" or 8 " 0Ð " xñœ ZÐ0Ð " " x3ñßcñoð 3 xßx3ñþ #-8 3œ"

Representer theorem proof Thus if a minimizer s 0œ0" exists for (1) ßit can be written in the form (3) as claimed, with " + 3 œ Z" Ð0" Ðx3ÑßC3ÑÞ #-8 Note that this does not solve the problem, since the + 3 are expressed in terms of the solution itself. But it does reduce the possibilities for what a solution looks like.