Kernel Methods and SVMs

Similar documents
Lecture 10 Support Vector Machines II

Support Vector Machines CS434

Kernel Methods and SVMs Extension

Lecture 3: Dual problems and Kernels

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Which Separator? Spring 1

Support Vector Machines CS434

Linear Feature Engineering 11

10-701/ Machine Learning, Fall 2005 Homework 3

Lecture 10 Support Vector Machines. Oct

1 Convex Optimization

Feature Selection: Part 1

Linear Classification, SVMs and Nearest Neighbors

Week 5: Neural Networks

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Support Vector Machines

Maximal Margin Classifier

Support Vector Machines

Generalized Linear Methods

CSC 411 / CSC D11 / CSC C11

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Lagrange Multipliers Kernel Trick

Support Vector Machines

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

1 Matrix representations of canonical matrices

Ensemble Methods: Boosting

COS 521: Advanced Algorithms Game Theory and Linear Programming

Difference Equations

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 10: Classifica8on with Support Vector Machine (cont.

Lecture Notes on Linear Regression

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Support Vector Machines

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

CSE 252C: Computer Vision III

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Linear Approximation with Regularization and Moving Least Squares

Lecture 4. Instructor: Haipeng Luo

Fisher Linear Discriminant Analysis

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Singular Value Decomposition: Theory and Applications

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Kristin P. Bennett. Rensselaer Polytechnic Institute

Advanced Introduction to Machine Learning

Module 9. Lecture 6. Duality in Assignment Problems

18-660: Numerical Methods for Engineering Design and Optimization

CSE 546 Midterm Exam, Fall 2014(with Solution)

Homework Assignment 3 Due in class, Thursday October 15

The exam is closed book, closed notes except your one-page cheat sheet.

From Biot-Savart Law to Divergence of B (1)

C/CS/Phy191 Problem Set 3 Solutions Out: Oct 1, 2008., where ( 00. ), so the overall state of the system is ) ( ( ( ( 00 ± 11 ), Φ ± = 1

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens

Online Classification: Perceptron and Winnow

Lecture 12: Discrete Laplacian

Lecture 2: Prelude to the big shrink

Supporting Information

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Assortment Optimization under MNL

Section 8.3 Polar Form of Complex Numbers

THE SUMMATION NOTATION Ʃ

PHYS 705: Classical Mechanics. Calculus of Variations II

Limited Dependent Variables

Errors for Linear Systems

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Chapter 6 Support vector machine. Séparateurs à vaste marge

Moments of Inertia. and reminds us of the analogous equation for linear momentum p= mv, which is of the form. The kinetic energy of the body is.

10. Canonical Transformations Michael Fowler

APPENDIX A Some Linear Algebra

The Geometry of Logit and Probit

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

MMA and GCMMA two methods for nonlinear optimization

Pattern Classification

Lecture 20: November 7

Lecture 20: Lift and Project, SDP Duality. Today we will study the Lift and Project method. Then we will prove the SDP duality theorem.

Problem Set 9 Solutions

17 Support Vector Machines

Primer on High-Order Moment Estimators

Solutions to exam in SF1811 Optimization, Jan 14, 2015

Solutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution.

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

Case A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k.

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Recap: the SVM problem

Natural Language Processing and Information Retrieval

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

Time-Varying Systems and Computations Lecture 6

8.6 The Complex Number System

Support Vector Machines. Jie Tang Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University 2012

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Lecture 17: Lee-Sidford Barrier

Lecture 6: Support Vector Machines

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 16

UVA CS / Introduc8on to Machine Learning and Data Mining

Maximum Likelihood Estimation (MLE)

= z 20 z n. (k 20) + 4 z k = 4

Section 3.6 Complex Zeros

Transcription:

Statstcal Machne Learnng Notes 7 Instructor: Justn Domke Kernel Methods and SVMs Contents 1 Introducton 2 2 Kernel Rdge Regresson 2 3 The Kernel Trck 5 4 Support Vector Machnes 7 5 Examples 1 6 Kernel Theory 16 6.1 Kernel algebra................................... 16 6.2 Understandng Polynomal Kernels va Kernel Algebra............ 18 6.3 Mercer s Theorem................................. 19 7 Our Story so Far 21 8 Dscusson 22 8.1 SVMs as Template Methods........................... 22 8.2 Theoretcal Issues................................. 23 1

Kernel Methods and SVMs 2 1 Introducton Support Vector Machnes (SVMs) are a very succesful and popular set of technques for classfcaton. Hstorcally, SVMs emerged after the neural network boom of the 8s and early 9s. People were surprsed to see that SVMs wth lttle to no tweakng could compete wth neural networks nvolvng a great deal of manual engneerng. It remans true today that SVMs are among the best off-the-shelf classfcaton methods. If you want to get good results wth a mnmum of messng around, SVMs are a very good choce. Unlke the other classfcaton methods we dscuss, t s not convenent to begn wth a concse defnton of SVMs, or even to say what exactly a support vector s. There s a set of deas that must be understood frst. Most of these you have already seen n the notes on lnear methods, bass expansons, and template methods. The bggest remanng concept s known as the kernel trck. In fact, ths dea s so fundamental many people have advocated that SVMs be renamed Kernel Machnes. It s worth mentonng that the standard presentaton of SVMs s based on the concept of margn. For lack of tme, ths perspectve on SVMs wll not be presented here. If you wll be workng serously wth SVMs, you should famlarze yourself wth the margn perspectve to enoy a full understandng. (Warnng: These notes are probably the most techncally challengng n ths course, partcularly f you don t have a strong background n lnear algebra, Lagrange multplers, and optmzaton. Kernel methods smply use more mathematcal machnery than most of the other technques we cover, so you should be prepared to put n some extra effort. Enoy!) 2 Kernel Rdge Regresson We begn by not talkng about SVMs, or even about classfcaton. Instead, we revst rdge regresson, wth a slght change of notaton. Let the set of nputs be {(x, y )}, where ndexes the samples. The problem s to mnmze ( ) 2 x w y + λw w If we take the dervatve wth respect to w and set t to zero, we get

Kernel Methods and SVMs 3 = 2x (x T w y ) + 2λw w = ( x x T + λi ) 1 x y. Now, let s consder a dfferent dervaton, makng use of some Lagrange dualty. If we ntroduce a new varable z, and constran t to be the dfference between w x and y, we have 1 mn w,z 2 z z + 1 λw w (2.1) 2 s.t. z = x w y. Usng α to denote the Lagrange multplers, ths has the Lagrangan L = 1 2 z z + 1 2 λw w + α (x w y z ). Recall our foray nto Lagrange dualty. We can solve the orgnal problem by dong max α mn L(w,z, α). w,z To begn, we attack the nner mnmzaton: For fxed α, we would lke to solve for the mnmzng w and z. We can do ths by settng the dervatves of L wth respect to z and w to be zero. Dong ths, we can fnd 1 z = α, w = 1 α x (2.2) λ So, we can solve the problem by maxmzng the Lagrangan (wth respect to α), where we substtute the above expressons for z and w. Thus, we have an unconstraned maxmzaton. max α L(w (α),z (α), α) 1 = dl dz = z α = dl dw = λw + α x

Kernel Methods and SVMs 4 Before dvng nto the detals of that, we can already notce somethng very nterestng happenng here: w s gven by a sum of the nput vectors x, weghted by α /λ. If we were so nclned, we could avod explctly computng w, and predct a new pont x drectly from the data as f(x) = x w = 1 α x x. λ Now, let k(x,x ) = x x be the kernel functon. For now, ust thnk of ths as a change of notaton. Usng ths, we can agan wrte the rdge regresson predctons as f(x) = 1 α k(x,x ). λ Thus, all we really need s the nner product of x wth each of the tranng elements x. We wll return to why ths mght be useful later. Frst, let s return to dong the maxmzaton over the Lagrange multplers α, to see f anythng smlar happens there. The math below looks really complcated. However, all the we are dong s substtutng the expressons for z and w from Eq. 2.2, then dong a lot of manpulaton. max mn L α w,z = max α = max α = max α 1 2 = max α 1 2 1 α 2 + 1 2 2 λ( 1 ) ( 1 ) α x α x λ λ + α (x 1 α x y α ) λ 1 α 2 + 1 α α x x 1 α α x x 2 2λ λ α 2 1 α α x x α y 2λ α 2 1 α α k(x,x ) α y. 2λ α (y + α ) Agan, we only need nner products. If we defne the matrx K by K = k(x,x ), then we can rewrte ths n a puncher vector notaton as

Kernel Methods and SVMs 5 max mn L = max 1 α w,z α 2 α α 1 2λ αt Kα α y. Here, we use a captal K to denote the matrx wth entres K and a lowercase k to denote the kernel functon k(, ). Note that most lterature on kernel machnes mldly abuses notaton by usng the captal letter K for both. The thng on the rght s ust a quadratc n α. As such, we can fnd the optmum as the soluton of a lnear system 2. What s mportant s the observaton that, agan, we only need the nner products of the data k(x,x ) = x x to do the optmzaton over α. Then, once we have solved for α, we can predct f(x) for new x agan usng only nner products. If someone tells us all the nner products, we don t need the orgnal data {x } at all! 3 The Kernel Trck So we can work completely wth nner products, rather than the vectors themselves. So what? One way of lookng at thngs s that we can mplctly use bass expansons. If we want to take x, and transform t nto some fancy feature space φ(x), we can replace the kernel functon by K = k(x,x ) = φ(x ) φ(x ). The pont of talkng about ths s that for certan bass expansons, we can compute k very cheaply wthout ever explctly formng φ(x ) or φ(x ). Ths can mean a huge computatonal savngs. A nce example of ths s the kernel functon We can see that k(x,v) = (x v) 2. 2 It s easy to show (by takng the gradent) that the optmum s at α = ( 1 λ K I) 1 y.

Kernel Methods and SVMs 6 k(x,v) = ( = ( = x v ) 2 x v )( x v ) x x v v. It s not hard to see that k(x,v) = φ(x) φ(v), where φ s a quadratc bass expanson φ m (x) = φ (x)φ (x). For example, n two dmensons, k(x,v) = ( v 1 + v 2 ) 2 = v 1 v 1 + v 1 v 2 + v 2 v 1 + v 2 v 2. whle the bass expansons are φ(x) = (,,, ), φ(v) = (v 1 v 1, v 1 v 2, v 2 v 1, v 2 v 2 ). It s not hard to work out that k(x,v) = φ(x) φ(v). However, notce that we can compute k(x,v) n tme O(d), rather than the O(d 2 ) tme t would take to explctly compute φ(x) φ(v). Ths s the kernel trck : gettng around the computatonal expense n computng large bass expansons by drectly computng kernel functons. Notce, however, that the kernel trck changes nothng, nada, zero about the statstcal ssues wth huge bass expansons. We get exactly the same predctons as f we computed the bass expanson explctly, and used tradtonal lnear methods. We ust compute the predctons n a dfferent way. In fact can nvent a new kernel functon k(x,v), and, as long as t obeys certan rules, use t n the above algorthm, wth out explctly thnkng about bass expanson at all. Some common examples are: name k(x, v) Lnear x v Polynomal (r + x v) d, for some r, d > Radal Bass Functon exp ( γ x v 2), γ > Gaussan exp ( 1 x v 2) 2σ 2

Kernel Methods and SVMs 7 We wll return below to the queston of what kernel functons are legal, meanng there s some feature space φ such that k(x,v) = φ(x) φ(v). Now, what exactly was t about rdge regresson that let us get away wth workng entrely wth nner products? How much could we change the problem, and preserve ths? We really need two thngs to happen: 1. When we take dl/dw =, we need to be able to solve for w, and the soluton needs to be a lnear combnaton of the nput vectors x. 2. When we substtute ths soluton back nto the Lagrangan, we need to get a soluton that smplfes down nto nner products only. Notce that ths leaves us a great deal of flexblty. For example, we could replace the leastsquares crteron z z wth an alternatve (convex) measure. We could also change the way n whch whch measure errors from z = w x y, to somethng else, (although wth some restrctons). 4 Support Vector Machnes Now, we turn to the bnary classfcaton problem. Support Vector Machnes result from mnmzng the hnge loss (1 y w x ) + wth rdge regularzaton. mn w Ths s equvalent to (for c = 2/λ) (1 y w x ) + + λ w 2. mn w c (1 y w x ) + + 1 2 w 2. Because the hnge loss s non-dfferentable, we ntroduce new varables z, creatng a constraned optmzaton mn z,w c z + 1 2 w 2 (4.1) s.t. z (1 y w x ) z.

Kernel Methods and SVMs 8 Introducng new constrants to smplfy an obectve lke ths seems strange at frst, but sn t too hard to understand. Notce the constrants are exactly equvalent to forcng that z (1 y w x ) +. But snce we are mnmzng the sum of all the z, the optmzaton wll make t as small as possble, and so z wll be the hnge loss for example, no more, no less. Introducng Lagrange multplers α to enforce that z (1 y w x ) and µ to enforce that z, we get the Lagrangan L = c z + 1 2 w 2 + α (1 y w x z ) + µ ( z ). A bunch of manpulaton changes ths to L = z ( c µ α ) + 1 2 w 2 + α w α y x. As ever, Lagrangan dualty states that we can solve our orgnal problem by dong max mn L. α,µ z,w For now, we work on the nner mnmzaton. For a partcular α and µ, we want to mnmze wth respect to z and w. By settng dl/dw =, we fnd that w = α y x. Meanwhle, settng dl/dz gves that α = c µ. If we substtute these expressons, we fnd that µ dsappears. However, notce that snce µ we must have that α c. max mn L = max ( ) 1 z α α + α α y y x x α,µ z,w α C 2 + α α y x α y x = max α 1 α α y y x x α c 2

Kernel Methods and SVMs 9 Ths s a maxmzaton of a quadratc obectve, under lnear constrants. That s, ths s a quadratc program. Hstorcally, QP solvers were frst used to solve SVM problems. However, as these scale poorly to large problems, a huge amount of effort has been devoted to faster solvers. (Often based on coordnate ascent and/or onlne optmzaton). Ths area s stll evolvng. However, software s wdely avalable now for solvers that are qute fast n practce. Now, as we saw above that w = α y x, we can classfy new ponts x by f(x) = α y x x Clearly, ths can be kernelzed. If we do so, we can compute the Lagrange multplers by the SVM optmzaton max α c α 1 α α y y k(x,x ), (4.2) 2 whch s agan a quadratc program. We can classfy new ponts by the SVM classfcaton rule f(x) = α y k(x,x ). (4.3) Snce we have kernelzed both the learnng optmzaton, and the classfcaton rule, we are agan free to replace k wth any of the varety of kernel functons we saw before. Now, fnally, we can defne what a support vector s. Notce that Eq. 4.2 s the maxmzaton of a quadratc functon of α, under the box constrants that α c. It often happens that α wants to be negatve (n terms ot the quadratc functon), but s prevented from ths by the constrants. Thus, α s often sparse. Ths has some nterestng consequences. Frst of all, clearly f α =, we don t need to nclude the correspondng term n Eq. 4.3. Ths s potentally a bg savngs. If all α are nonzero, then we would need to explctly compute the kernel functon wth all nputs, and our tme complexty s smlar to a nearest-neghbor method. If we only have a few nonzero α then we only have to compute a few kernel functons, and our complexty s smlar to that of a normal lnear method. Another nterestng property of the sparsty of α s that non- don t affect the soluton. Let s see why. What does t mean f α =? Well, recall that the multpler α s enforcng the constrant that z 1 y w x. (4.4)

Kernel Methods and SVMs 1 If α = at the soluton, then ths means, nformally speakng, that we ddn t really need to enforce ths constrant at all: If we threw t out of the optmzaton, t would stll automatcally be obeyed. How could ths be? Recall that the orgnal optmzaton n Eq. 4.1 s tryng to mnmze all the z. There are two thngs stoppng t from flyng down to : the constrant n Eq. 4.4 above, and the constrant that z. If the constrant above can be removed wth out changng the soluton, then t must be that z =. Thus, α = mples that 1 y w x, or, equvalently, that y w x 1. Thus non- are ponts that are very well classfed, that are comfortably on the rght sde of the lnear boundary. Now, magne we take some x wth z =, and remove t from the tranng set. It s pretty easy to see that ths s equvalent to takng the optmzaton mn z,w c z + 1 2 w 2 s.t. z (1 y w x ) z, and ust droppng the constrant that z (1 y w x ), meanng that z decouples from the other varables, and the optmzaton wll pck z =. But, as we saw above, ths has no effect. Thus, removng a non-support vector from the tranng set has no mpact on the resultng classfcaton rule. 5 Examples In class, we saw some examples of runnng SVMs. Here are many more.

Kernel Methods and SVMs 11 Dataset A, c = 1, k(x,v) = x v. predctons 1 5 α 1 1 5 1 1 2 4 6 8 1 sorted ndces Dataset A, c = 1 3, k(x,v) = x v. predctons 1 5 α 1 1 5 1 1 2 4 6 8 1 sorted ndces Dataset A, c = 1 5, k(x,v) = x v. predctons 1 5 α 1 1 5 2 4 6 8 1 sorted ndces

Kernel Methods and SVMs 12 Dataset A, c = 1, k(x,v) = 1 + x v. predctons 1 5 α 1 1 5 1 1 1 15 2 4 6 8 1 sorted ndces Dataset A, c = 1 3, k(x,v) = 1 + x v. predctons 1 5 α 1 1 5 1 1 2 4 6 8 1 sorted ndces Dataset A, c = 1 5, k(x,v) = 1 + x v. predctons 1 5 α 1 1 5 1 1 2 4 6 8 1 sorted ndces

Kernel Methods and SVMs 13 Dataset B, c = 1 5, k(x,v) = 1 + x v. predctons 1 5 α 1 1 5 1 2 3 4 5 sorted ndces Dataset B, c = 1 5, k(x,v) = (1 + x v) 5. predctons 1 5 α 1 1 5 1 2 3 4 5 sorted ndces Dataset B, c = 1 5, k(x,v) = (1 + x v) 1. predctons 1 5 α 1 1 5 1 1 1 2 3 4 5 sorted ndces

Kernel Methods and SVMs 14 Dataset C (dataset B wth nose), c = 1 5, k(x,v) = 1 + x v. predctons 1 5 α 1 1 5 1 2 3 4 5 sorted ndces Dataset C, c = 1 5, k(x,v) = (1 + x v) 5. predctons 1 5 α 1 1 5 1 2 3 4 5 sorted ndces Dataset C, c = 1 5, k(x,v) = (1 + x v) 1. predctons 1 5 α 1 1 5 1 1 1 2 3 4 5 sorted ndces

Kernel Methods and SVMs 15 Dataset C (dataset B wth nose), c = 1 5, k(x,v) = exp ( 2 x v 2). predctons 1 5 α 1 1 5 1 1 1 2 3 4 5 sorted ndces Dataset C, c = 1 5, k(x,v) = exp ( 2 x v 2). predctons 1 5 α 1 1 5 1 1 1 2 3 4 5 sorted ndces Dataset C, c = 1 5, k(x,v) = exp ( 2 x v 2). predctons 1 5 α 1 1 5 1 1 1 2 3 4 5 sorted ndces

Kernel Methods and SVMs 16 6 Kernel Theory We now return to the ssue of what makes a vald kernel k(x, v) where vald means there exsts some feature space φ such that k(x,v) = φ(x) φ(v). 6.1 Kernel algebra We can construct complex kernel functons from smple ones, usng an algebra of composton rules 3. Interestngly, these rules can be understood from parallel compostons n feature space. To take an example, suppose we have two vald kernel functons k a and k b. If we defne a new kernel functon by k(x,v) = k a (x,v) + k b (x,v), k wll be vald. To see why, consder the feature spaces φ a and φ b correspondng to k a and k b. If we defne φ by ust concatenatng φ a and φ b, by φ(x) = (φ a (x), φ b (x)), then φ s the feature space correspondng to k. To see ths, note φ(x) φ(v) = (φ a (x), φ b (x)) (φ a (v), φ b (v)) = φ a (x) φ a (v) + φ b (x)φ b (v) = k a (x,v) + k b (x,v) = k(x,v). We can make a table of kernel composton rules, along wth the dual feature space composton rules. kernel composton feature composton a) k(x,v) = k a (x,v) + k b (x,v) b) k(x,v) = fk a (x,v), f > φ(x) = (φ a (x), φ b (x)), φ(x) = fφ a (x) c) k(x,v) = k a (x,v)k b (x,v) φ m (x) = φ a (x)φ b (x) d) k(x,v) = x T Av, A postve sem-defnte φ(x) = L T x, where A = LL T. e) k(x,v) = x T M T Mv, M arbtrary φ(x) = Mx 3 Ths materal s based on class notes from Mchael Jordan.

Kernel Methods and SVMs 17 We have already proven rule (a). Let s prove some of the others. Rule (b) s qute easy to understand: φ(x) φ(v) = fφ a (x) fφ a (v) = fφ a (x) φ a (v) = fk a (x,v) = k(x, v) Rule (c) s more complex. It s mportant to understand the notaton. If then φ contans all sx pars. φ a (x) = (φ a1 (x), φ a2 (x), φ a3 (x)) φ b (x) = (φ b1 (x), φ b2 (x)), φ(x) = ( φ a1 (x)φ b1 (x), φ a2 (x)φ b1 (x), φ a3 (x)φ b1 (x) φ a1 (x)φ b2 (x), φ a2 (x)φ b2 (x), φ a3 (x)φ b2 (x) ) Wth that understandng, we can prove rule (c) va φ(x) φ(v) = φ m (x) m φ(v) m = φ a (x)φ b (x)φ a (v)φ b (v) = ( φ a (x)φ a (v) )( φ b (x)φ b (v) ) = ( φ a (x) φ a (v) )( φ b (x) φ b (v) ) = k a (x,v)k b (x,v) = k(x,v). Rule (d) follows from the well known result n lnear algebra that a symmetrc postve sem-defnte matrx A can be factored as A = LL T. Wth that known, clearly

Kernel Methods and SVMs 18 φ(x) φ(v) = (Lx) (Lv) = x T L T Lv = x T A T v = x T Av = k(x, v) We can alternatvely thnk of rule (d) as sayng that k(x) = x T M T Mx corresponds to the bass expanson φ(x) = Mx for any M. That gves rule (e). 6.2 Understandng Polynomal Kernels va Kernel Algebra So, we have all these rules for combnng kernels. What do they tell us? Rules (a),(b), and (c) essentally tell us that polynomal combnatons of vald kernels are vald kernels. Usng ths, we can understand the meanng of polynomal kernels Frst off, for some scalar varable x, consder a polynomal kernel of the form k(x, v) = (xv) d. To what bass expanson does ths kernel correspond? We can buld ths up stage by stage. k(x, v) = xv φ(x) = (x) k(x, v) = (xv) 2 φ(x) = ( ) by rule (c) k(x, v) = (xv) 3 φ(x) = (x 3 ) by rule (c). If we work wth vectors, we fnd that k(x,v) = (x v) corresponds to φ(x) = x, whle (by rule (c)) k(x,v) = (x v) 2 corresponds to a feature space wth all parwse terms φ m (x) = x x, 1, n. Smlarly, k(x,v) = (x v) 3 corresponds to a feature space wth all trplets φ m (x) = x x x k, 1,, k n.

Kernel Methods and SVMs 19 More generally, k(x,v) = (x v) d corresponds to a feature space wth terms φ m (x) =...x d, 1 n (6.1) Thus, a polynomal kernel s equvalent to a polynomal bass expanson, wth all terms of order d. Ths s pretty surprsng even though the word polynomal s n front of both of these terms! Agan, we should reterate the computatonal savngs here. In general, computng a polynomal bass expanson wll take tme O(n d ). However, computng a polynomal kernel only takes tme O(n). Agan, though, we have only defeated the computatonal ssue wth hgh-degree polynomal bass expansons. The statstcal propertes are unchanged. Now, consder the kernel k(x,v) = (r + x v) d. What s the mpact of addng the constant of r? Notce that ths s equvalent to smply takng the vectors x and v and prependng a constant of r to them. Thus, ths kernel corresponds to a polynomal expanson wth constant terms added. One way to wrte ths would be φ m (x) =...x d, n (6.2) where we consder x to be equal to r. Thus, ths kernel s equvalent to a polynomal bass expanson wth all terms of all order less than or equal to d. An nterestng queston s the mpact of the constant r. Should we set t large or small? What s the mpact of ths choce. Notce that the lower-order terms n the bass expanson n Eq. 6.2 wll have many terms x, and so get multpled by a r to a hgh power. Meanwhle, hgh order terms wll have have few or no terms of x, and so get multpled by r to a low power. Thus, a large factor of r has the effect of makng the low-order term larger, relatve to hgh-order terms. Recall that f we make the bass expanson larger, ths has the effect of reducng the the regularzaton penalty, snce the same classfcaton rule can be accomplshed wth a smaller weght. Thus, f we make part of a bass expanson larger, those parts of the bass expanson wll tend to play a larger role n the fnal classfcaton rule. Thus, usng a larger constant r had the effect of makng the low-order parts of the polynomal expanson n Eq. 6.2 tend to have more mpact. 6.3 Mercer s Theorem One thng we mght worry about s f the SVM optmzaton s convex n α. The concern s f α α y y k(x,x ) (6.3)

Kernel Methods and SVMs 2 s convex wth respect to α. We can show that f k s a vald kernel functon, then the kernel matrx K must be postve sem-defnte. z T Kz = = = z K z z φ(x ) φ(x )z z φ(x ) φ(x )z = z φ(x ) 2 We can also show that, f K s postve sem-defnte, then the SVM optmzaton s concave. The thng to see s that α α y y k(x,x ) = α T dag(y)kdag(y)α = α T Mα, where M = dag(y)kdag(y). It s not hard to show that M s postve sem-defnte. So ths s very nce f we use any vald kernel functon, we can be assured that the optmzaton that we need to solve n order to recover the Lagrange multplers α wll be concave. (The equvalent of convex when we are dong a maxmzaton nstead of a mnmzaton.) Now, we stll face the queston do there exst nvald kernel functons that also yeld postve sem-defnte kernel matrces? It turns out that the answer s no. Ths result s known as Mercer s theorem. A kernel functon s vald f and only f the correspondng kernel matrx s postve sem-defnte for all tranng sets {x }. Ths s very convenent the vald kernel functons are exactly those that yeld optmzaton problems that we can relably solve. However, notce that Mercer s theorem refers to all sets of ponts {x }. An nvald kernel can yeld a postve sem-defnte kernel matrx for some partcular tranng set. All we know s that, for an nvald kernel, there s some tranng set that yelds a non postve sem-defnte kernel matrx.

Kernel Methods and SVMs 21 7 Our Story so Far There were a lot of techncal detals here. It s worth takng a look back to do a conceptual overvew and make sure we haven t mssed the bg pcture. The startng pont for SVMs s s mnmzng the hnge loss under rdge regresson,.e. w = mn w c (1 y w x ) + + 1 2 w 2. Fundamentally, SVMs are ust fttng an optmzaton lke ths. The dfference s that they perform the optmzaton n a dfferent way, and they allow you to work effcently wth powerful bass expansons / kernel functons. Act 1. We proved that f w s the vector of weghts that results from ths optmzaton, then we could alternatvely calculate w as w = α y x, where the α are gven by the optmzaton max α c α 1 α α y y x x. (7.1) 2 Wth that optmzaton solved, we can classfy a new pont x by f(x) = x w = α y x x, (7.2) Thus, f we want to, we can thnk of the varables α as beng the man thngs that we are fttng, rather than the weghts w. Act 2. Next, we notced that the above optmzaton (Eq. 7.3) and classfer (Eq. 7.4) only depend on the nner products of the data elements. Thus, we could replace the nner products n these expressons wth kernel evaluatons, gvng the optmzaton and the classfcaton rule max α c α 1 α α y y k(x,x ), (7.3) 2

Kernel Methods and SVMs 22 f(x) = α y k(x,x ), (7.4) where k(x,v) = x v. Act 3. Now, magne that nstead of drectly workng wth the data, we wanted to work wth some bass expanson. Ths would be easy to accomplsh ust by swtchng the kernel functon to be k(x,v) = φ(x) φ(v). However, we also notced that for some bass expansons, lke polynomals, we could compute k(x, v) much more effcently than explctly formng the bass expansons and then takng the nner product. We called ths computatonal trck the kernel trck. Act 4. Fnally, we developed a kernel algebra, whch allowed us to understand how we can combne dfferent kernel functons, and what ths meant n feature space. We also saw Mercer s theorm, whch tells us what kernel functons are and are not legal. Happly, ths corresponded wth the SVM optmzaton problem beng convex. 8 Dscusson 8.1 SVMs as Template Methods Regardless of what we say, at the end of the day, support vector machnes make ther predctons through the classfcaton rule f(x) = α y k(x,x ). Intutvely, k(x,x ) measures how smlar x s to tranng example x. Ths bears a strong resemblence to K-NN classfcaton, where we use k as the dstance metrc, rather than somethng lke the Eucldean dstance. Thus, SVMs can be seen as glorfed template methods, where the amount that each pont x partcpates n predctons s reweghted n the learnng stage. Ths s a vew usually espoused by SVM skeptcs, but a reasonable one. Remember, however, that there s absolutely nothng wrong wth template methods.

Kernel Methods and SVMs 23 8.2 Theoretcal Issues An advantage of SVMs s that rgorous theoretcal guarantees can often be gven for ther performance. It s possble to use these theoretcal bounds to do model selecton, rather than, e.g., cross valdaton. However, at the moment, these theoretcal guarantees are rather loose n practce, meanng that SVMs perform sgnfcantly better than the bounds can show. As such, one can often get better practcal results by usng more heurstc model selecton procedures lke cross valdaton. We wll see ths when we get to learnng theory.