Statistical machine learning and kernel methods. Primary references: John Shawe-Taylor and Nello Cristianini, Kernel Methods for Pattern Analysis

Part 5 (MA 751) Statistical machine learning and kernel methods Primary references: John Shawe-Taylor and Nello Cristianini, Kernel Methods for Pattern Analysis Christopher Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery 2, 121 167 (1998).

Other references: Aronszajn ß Theory of reproducing kernels. Transactions of the American Mathematical Society, 686, 337-404, 1950. Felipe Cucker and Steve Smale, On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 2002. Teo Evgeniou, Massimo Pontil and Tomaso Poggio, Regularization Networks and Support Vector Machines Advances in Computational Mathematics, 2000.

1. Linear functionals Given a vector space Z, we define a map 0 À Z Ä from Z to the real numbers to be a functional. If 0 is linear, i.e., if for real +ß, we have then we say 0Ð+ x, yñ œ +0ÐxÑ,0ÐyÑß 0 is a linear functional. If Z is an inner product space (so each v has a length mvm), we say that 0 is bounded if l0ðxñl Ÿ Gmxm for some number G! and all x \.

Reproducing kernel Hilbert spaces 2. Reproducing Kernel Hilbert spaces: Def. 1. A 8 8matrix Q is symmetric if Q34 œq43 for all 3ß4Þ A symmetric Q is positive if all of its eigenvalues are nonnegative.

Reproducing kernel Hilbert spaces Equivalently Q is positive if X ØaßQaÙ a Q a! Ô + " + # for all vectors a œ Ö Ù, with Ø, Ù the standard inner ã Õ+ Ø 8 product on 8. Above a X œð+ " ßáß+ 8Ñis the transpose of a.

Reproducing kernel Hilbert spaces Definition 2: Let \ : be compact (i.e., a closed bounded subset). A (real) reproducing kernel Hilbert space (RKHS) [ on \ is a Hilbert space of functions on \ (i.e., a complete collection of functions which is closed under addition and scalar mult, and for which an inner product is defined)þ [ also needs the property: for any fixed x \, the evaluation functional x À [ Ä defined by bounded linear functional on [. x Ð0Ñ œ 0ÐxÑ

Reproducing kernel Hilbert spaces Definition 3: We define a kernel to be a function OÀ\ \Ä which is symmetric, i.e., OÐxy ß Ñ œ OÐyx ß Ñ for xy ß \. We say that O is positive if for any fixed collection Öx ßáß x \ " 8, the 8 8matrix OœÐO34Ñ OÐx3ßx4Ñ is positive (i.e., non-negative).

Kernel existence We now have the reason these are called RKHS: Theorem 1: Given a reproducing kernel Hilbert space [ of functions on \., there exists a unique symmetric positive kernel function OÐxy ß Ñ such that for all 0 [ ß 0ÐxÑ œ Ø0Ð ÑßOÐ ßxÑÙ [ (inner product above is in the variable ; x is fixed). Note this means that evaluation of 0 at x is equivalent to taking inner product of 0 with the fixed function OÐ ßxÑ.

Kernel existence Proof (please look at this on your own): For any fixed x \, recall x is a bounded linear functional on [. By the Riesz Representation theorem 1 there exists a fixed function, call it OÐ Ñ x such that for all 0 [ (recall x is fixed, now 0is varying) 0ÐxÑ œ x Ð0Ñ œ Ø0Ð Ñß Ox Ð ÑÙÞ (1) # (all inner products are in [ß not in P, i.e., Ø0ß 1Ù œ Ø0ß 1Ù ). [ 1 Riesz Representation Theorem: If 9 À [ Ä is a bounded linear functional on [, there exists a unique y [ such that ax [ ß 9ÐxÑ œ ØyßxÙ.

Kernel existence That is, evaluation of 0 at x is equivalent to an inner product with the function. O x Define OÐxy ß Ñ œ OxÐyÑÞ Note by (1), the functions Ox Ð Ñ and OÐ Ñsatisfy y so OÐxy ß Ñ ØO Ð Ñß O Ð ÑÙ œ O ÐxÑ œ O Ðy Ñ, x y y x is symmetric.

Kernel existence To prove OÐxy ß Ñ is positive definite: let Öx" ßáß x8 be a fixed collection. If O 34 OÐx3ßx4Ñ, then if O œ ÐO34Ñ is a matrix Ô -" -# and c œ Ö Ùß ã Õ- Ø 8

Kernel existence 8 8 X ØcßOcÙ c Oc œ - - OÐx ß x Ñ œ - - ØO x Ð ÑßO x Ð ÑÙ 3 4 3 4 3 4 3ß4œ" 3ß4œ" 3 4 8 8 8 œ -O 3 x Ð Ñß -O 4 x Ð Ñ œ¾ -O 3 x Ð Ñ ¾!. 3 4 3 3œ" 4œ" 3œ" # [

Kernel existence Definition 4: We call the above kernel OÐxy ß Ñ the reproducing kernel of [. Definition 5: A Mercer kernel is a positive definite kernel OÐxy ß Ñ which is also continuous as a function of xand y and bounded. Def. 6: For a continuous function 0 on a compact set \. we define m0m œ maxl0ðxñlþ x \

Kernel existence Theorem 2: (i) For every Mercer kernel OÀ\ \Ä, there exists a unique Hilbert space [ of functions on \ such that O is its reproducing kernel. (ii) Moreover, this [ consists of continuous functions, and for any 0 [ m0m where Q œ max OÐxy ß Ñ Þ O xy ß \ Ÿ Q m0m O [,

Kernel existence Proof (please look at this on your own): Let OÐxy ß Ñ À \ \ Ä be a Mercer kernel. We will construct a reproducing kernel Hilbert space [ with reproducing kernel O as follows. Define [! œ spanöoxð Ñ x \ œ -3Ox 3 Ð Ñ À Ö x 3 3 \ is any finite subset à -3 ŸÞ 3

Kernel existence Now we define inner product Ø0ß1Ùfor 0ß1 Þ Assume 6 6 [! 0Ð Ñœ +O Ð Ñß 1Ð Ñœ,O Ð ÑÞ 3 x 3 3œ" 3œ" 3 3 [Note we may assume 0ß1 both use same set Ö x 3 since if not we may take a union without loss]. [Note again that here Ø ß ÙœØ ß Ù [ ] x

Kernel existence Then define ØOxÐ Ñß OyÐ ÑÙ œ OÐxy ß Ñ 6 6 Ø0Ð Ñß 1Ð ÑÙ œ + 3OÐx3ß Ñß, 4OÐx4ß Ñ 3œ" 4œ" 6 6 œ + + ØOÐx ß Ñß OÐx ß ÑÙ œ +, OÐx ß x ÑÞ 3 4 3 4 3 4 3 4 3ß4œ" 3ß4œ"

Kernel existence Easy to check that with the above inner product [! is an inner product space (i.e., satisfies properties ÐaÑ ÐdÑ). Now form the completion 2 of this space into the Hilbert space [Þ Note that for 0œ +O Ð Ñas above 3 3 x 3 2 The completion of a non-complete inner product space space [! is the (unique) smallest complete inner product (Hilbert) space [ which contains [!. That is, [! [, the inner product on [! is the same as on [, and there is no smaller complete Hilbert space which contains. [! Example 1: [ œœ œð+ ß+ ßáÑº # a " # l+l 3 with inner product ØaßbÙœ +, 3 3 was discussed in class. The inner product space 3œ" 3œ" [! œœð+ß+ßáñ " # [ º all but a finite number of + 3 are 0 [ is an example of an incomplete space. [ is its completion.

Kernel existence l0ðxñl œ Ø0Ð Ñß OÐxß ÑÙ Ÿ m0ð ÑmmOÐxß Ñm œ m0mèøoðxß ÑßOÐxß ÑÙ ß+ # ßáÑ [ ºall but a finite number of + 3 are 0 [ is an example of an incomplete space. [ is its completion. # Example 2: [ œp Ð 1ß1Ñwith standard inner product for functions. We know if + #! 0ÐB) [ then 0ÐBÑ œ + cos 5B, sin 5B. Define [ to be all 0 [ for which the 5œ" 5 5! above sum is finite (i.e., all but a finite number of terms are 0). Then [ is the completion of [!.

Kernel existence ŸQ m0m Þ O [ [Note again here we write m0m œ m0m [ Ø0ß 1Ù œ Ø0ß 1Ù [ ] by definition; similarly The above shows that the imbedding MÀ[! ÄGÐ\Ñ(the latter is the continuous functions on \) is bounded. By this we mean that M maps function 0 as a function in [! to itself as a function in GÐ\Ñ; in GÐ\Ñ the norm of 0 is m0m sup l0ðbñl. B \ By bounded we mean that constant.!. mm0m œ m0m Ÿ.m0m [ for some

Kernel existence Thus any Cauchy sequence in [! is also Cauchy in GÐ\Ñ, and so it follows easily that the completion [ of [! exists as a subset of GÐ\Ñ. That is a reproducing kernel for follows by approximation O [ from [! Þ

Regularization methods 3. Regularization methods for choosing 0 Finding 0 from R0 œ g œ ÖÐx 3 ßC 3 Ñ 3œ" is an ill-posed problem: " the operator R does not exist because R is not one to one. Need to combine both: (a) Data R0 (b) A priori information, e.g., " 0 is smooth", e.g. expressing a preference for smooth over wiggly solutions seen earlier. How to incorporate? Using Tikhonov regularization methods. R

Regularization methods We introduce a regularization loss functional L Ð0Ñ representing penalty for choice of an "unrealistic" 0 such as that in (a) above. Assume we want to find the correct function 0Ð! xñß R0 ÐxÑ œ ÐÐx, C ÑßáßÐx ßC ÑÑ œ g! " " 8 8 from data

Regularization methods Suppose we are given 0ÐxÑ as a candidate for approximating 0Ð! x Ñfrom the information in g Þ We score 0as a good or bad approximation based on a combination of its error on 8 the known points Ö x 3 3œ", together with its "plausibility", i.e., how low the Lagrangian 8 " _ œ ZÐ0Ð ÑßC Ñ PÐ0Ñ 8 x 3 3 3œ" is. Here ZÐ0Ðx 3 ÑßCÑ 3 is a measure of the loss whenever 0Ð Ñ is far from C, e.g. x 3 3 Z Ð0Ðx ÑßC Ñ œ l0ðx Ñ C l Þ 3 3 3 3 #

Regularization methods And PÐ0Ñ measures the a priori loss, i.e., a measure of discrepancy between the prospective choice 0 and our prior expectation about 0.

Example: Examples: Regularization methods # # P PÐ0Ñ œ me0m # œ (. xle0ðxñl ß # # `0 `B# `0 `B# where E0 œ? 0 0à here? 0 œ á Þ Note?0 and thus PÐ0Ñmeasures the degree of nonsmoothness that 0 has (i.e., we prefer smoother functions a priori). " :

Examples: Regularization methods Example 7: Consider the case PÐ0Ñ œ me0m # norm m0m œ me0m [ P # above. The = reproducing kernel Hilbert space norm (at least if dimension. is small). That is, it comes from an inner product Ø0ß 1Ù œ ' \ ÐE0ÑÐBÑÐE1ÑÐBÑ.B, and with this inner product [ is an RKHS. If this is the case, in general things become easier.

Examples: Regularization methods Example 8: In the case 0œ0ÐBÑ, B ". Suppose we choose: we have # #.. E0 œ 0 0 œ " 0ß.B# Œ.B# #. PÐ0Ñ œ me0m œ ( Œ " 0.Bß.B# and me0m is a measure of "lack of smoothness" of 0. # #

Examples: Regularization methods 4. More about using the Laplacian to measure smoothness (Sobolev smoothness) Basic definitions: Recall the Laplacian operator? on a function 0 on : 0Ð Ñ œ 0ÐB ßáßB Ñ is defined by x " : # # ` `?0 œ 0 á 0Þ `B# `B# " :

Using the Laplacian for kernels For =! an even integer, we can define the Sobolev space L = by: = #. =Î# # : L œö0 P Ð ÑÀÐ"? Ñ 0 P Ð Ñ # : # to be functions in PÐ Ñwhich are still in P after taking the =Î# derivative operation Ð"? Ñ, i.e., ÐM? Ñ repeated =Î# times (the operator " is always the identity). For 0ß1 L = define the new inner product

Using the Laplacian for kernels Ø0ß 1Ù = œ ØÐ? "Ñ 0ß Ð? "Ñ 1Ù # à L =Î# [note Ø2ÐxÑß5ÐxÑÙ œ 2ÐxÑ5ÐxÑ. x] P \ # ' =Î# P Can show that L = is an RKHS with reproducing kernel " OÐzÑ œ Y " Œ Ðl l# "Ñ= = (3)

Using the Laplacian for kernels where Y " denotes the inverse Fourier transform. The " : function Ðl= l "Ñ is a function on = œð= # = " ßáß= : Ñ ß where l= l # œ = # á = # Þ " :

Using the Laplacian for kernels Fig 1: OÐzÑin one dimension - a smooth kernel

Using the Laplacian for kernels OÐz Ñ is called a radial basis function. Note: the kernel OÐxy ß Ñ (as function of 2 variables) is defined in terms of above O by OÐxy ß Ñ œ OÐx yñþ

Using the Laplacian for kernels The Representer Theorem for RKHS 1. An application: using RKHS for regularization Assume again we have an unknown function 0 on \, with only data R0 œ ÐÐx ß C ÑßáßÐx ßC ÑÑ œ g. " " 8 8 To find the best guess s0 for 0, approximate it by the minimizer

RKHS and regularization 8 s " # # 0 œ arg min m0ð 3Ñ C3m m0m Þ 8 L = x - Ÿ (1) = 0 L 3œ" where - can be some constant. Note we are finding an 0 8 which balances minimizing # m0ðx Ñ C m ß 3œ" 3 3 i.e., the data error, with minimizing m0m # L, i.e., maximizing the = smoothness. The solution to such a problem will look like this:

RKHS and regularization It will compromise between fitting the data (which may have error) and trying to be smooth.

RKHS and regularization The amazing thing: 0s can be found explicitly using radial basis functions. 2. Solving the minimization Now consider the optimization problem (1). We claim that we can solve it explicitly. To see this works in general for RKHS, return to the general problem. Given an unknown 0 [ œrkhs. Try to find the "best" approximation s0 to 0 fitting the data

RKHS and regularization R0 ÐÐx" ßC" ÑßáßÐx8ßC8 ÑÑ, but ALSO satisfying a priori knowledge that m0! m [ is small. Specifically, we want to find 8 " arg min # Z Ð0Ðx3ÑßC3Ñ -m0m[ Þ 8 0 [ 3œ" Note we can have, e.g., Z Ð0Ðx ÑßC Ñ œ Ð0Ðx Ñ C Ñ Þ case 3 3 3 3 # (2) In that

RKHS and regularization 8 8 3 3 3 3 # 3œ" 3œ" Z Ð0Ðx ÑßC Ñ œ Ð0Ðx Ñ C Ñ Þ Consider the general case (2), with arbitrary error measure. Z We have the

RKHS and regularization Representer Theorem: E solution of the Tikhonov optimization problem Ð"Ñ can be written 8 s0ðbñ œ + OÐxx ß Ñß (3) 3œ" 3 3 where O is the reproducing kernel of the RKHS [. Important theorem: says we only need to find a set of 8 numbers + 3 to optimize the infinite dimensional problem (1) above.

RKHS and regularization Proof: Use calculus of variations. If a minimizer 0 " exists, then for all 1 [, assuming that the derivatives with respect to % exist:

Representer theorem proof 8. "! œ # Z ÐÐ0" % 1ÑÐx 3Ñß C3 Ñ - m0" % 1m[ º.% 8 8 3œ" œ " `Z Ð0" Ðx3ÑßC3Ñ 1Ðx3Ñ 8 `0 Ðx Ñ 3œ" " 3 % œ!. - Ø0ß0Ù #Ø0ß1Ù % % Ø1ß1Ù.% # " " "

Representer theorem proof 8 " œ Z Ð0 Ð ÑßCÑ 1Ð Ñ # Ø0 ß1Ùß 8 " " x3 3 x3 - " 3œ" where ZÐ+ß,Ñœ ZÐ+ß,Ñand all inner products are in [ Þ " ` `+ Since the above is true for all 1 [ ß 1œO x we get: it follows that if we let

Representer theorem proof 8 "!œ ZÐ0Ð ÑßCÑOÐ Ñ # Ø0ßOÙ 8 " " x3 3 x x3 - " x 3œ" 8 " œ Z Ð0 Ð Ñß C ÑO Ð Ñ # 0 Ð Ñß 8 " " x3 3 x x3 - " x 3œ" or 8 " 0Ð " xñœ ZÐ0Ð " " x3ñßcñoð 3 xßx3ñþ #-8 3œ"

Representer theorem proof Thus if a minimizer s 0œ0" exists for (1) ßit can be written in the form (3) as claimed, with " + 3 œ Z" Ð0" Ðx3ÑßC3ÑÞ #-8 Note that this does not solve the problem, since the + 3 are expressed in terms of the solution itself. But it does reduce the possibilities for what a solution looks like.