CSCI567 Machine Learning (Fall 2014)

CSCI567 Machie Learig (Fall 2014) Drs. Sha & Liu {feisha,yaliu.cs}@usc.edu October 9, 2014 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 1 / 49

Outlie Admiistratio 1 Admiistratio 2 Review of last lecture 3 Support vector machies 4 Geometric Uderstadig of SVM Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 2 / 49

Quiz #1 Admiistratio Tuesday Oct 21 6-8pm TTH301 Some exceptios are hadled case by case. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 3 / 49

Outlie Review of last lecture 1 Admiistratio 2 Review of last lecture Kerel methods Kerelized machie learig methods 3 Support vector machies 4 Geometric Uderstadig of SVM Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 4 / 49

Review of last lecture Kerel methods How to to do oliear predictio without specifyig oliear basis fuctios? Defiitio of kerel fuctio: a (positive semidefiite) kerel fuctio k(, ) is a bivariate fuctio that satisfies the followig properties. For ay x m ad x, k(x m, x ) = k(x, x m ) ad k(x m, x ) = φ(x m ) T φ(x ) for some fuctio φ( ). Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 5 / 49

Review of last lecture Kerel methods How to to do oliear predictio without specifyig oliear basis fuctios? Defiitio of kerel fuctio: a (positive semidefiite) kerel fuctio k(, ) is a bivariate fuctio that satisfies the followig properties. For ay x m ad x, k(x m, x ) = k(x, x m ) ad k(x m, x ) = φ(x m ) T φ(x ) for some fuctio φ( ). Examples we have see k(x m, x ) = (x T mx ) 2 k(x m, x ) = 2 si(2π(x m1 x 1 )) x m1 x 1 si(2π(x m2 x 2 )) x m2 x 2 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 5 / 49

Review of last lecture Kerel methods Coditios for beig a positive semidefiite kerel fuctio Mercer theorem (loosely), a bivariate fuctio k(, ) is a positive semidefiite kerel fuctio, if ad oly if, for ay N ad ay x 1, x 2,..., ad x N, the matrix K = k(x 1, x 1 ) k(x 1, x 2 ) k(x 1, x N ) k(x 2, x 1 ) k(x 2, x 2 ) k(x 2, x N ).... k(x N, x 1 ) k(x N, x 2 ) k(x N, x N ) is positive semidefiite. We also refer k(, ) as a positive semidefiite kerel. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 6 / 49

Review of last lecture Kerel methods Flashback: why usig kerel fuctios? without specifyig φ( ), the kerel matrix k(x 1, x 1 ) k(x 1, x 2 ) k(x 1, x N ) k(x 2, x 1 ) k(x 2, x 2 ) k(x 2, x N ) K =.... k(x N, x 1 ) k(x N, x 2 ) k(x N, x N ) is exactly the same as K = ΦΦ T = φ(x 1 ) T φ(x 1 ) φ(x 1 ) T φ(x 2 ) φ(x 1 ) T φ(x N ) φ(x 2 ) T φ(x 1 ) φ(x 2 ) T φ(x 2 ) φ(x 2 ) T φ(x N ) φ(x N ) T φ(x 1 ) φ(x N ) T φ(x 2 ) φ(x N ) T φ(x N ) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 7 / 49

Kerel fuctios Review of last lecture Kerel methods Defiitio: a (positive semidefiite) kerel fuctio k(, ) is a bivariate fuctio that satisfies the followig properties. For ay x m ad x, k(x m, x ) = k(x, x m ) ad k(x m, x ) = φ(x m ) T φ(x ) for some fuctio φ( ). Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 8 / 49

Review of last lecture Kerel methods Kerel fuctios Defiitio: a (positive semidefiite) kerel fuctio k(, ) is a bivariate fuctio that satisfies the followig properties. For ay x m ad x, k(x m, x ) = k(x, x m ) ad k(x m, x ) = φ(x m ) T φ(x ) for some fuctio φ( ). Examples we have see k(x m, x ) = (x T mx ) 2 k(x m, x ) = 2 si(2π(x m1 x 1 )) x m1 x 1 si(2π(x m2 x 2 )) x m2 x 2 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 8 / 49

Review of last lecture Kerel methods Kerel fuctios Defiitio: a (positive semidefiite) kerel fuctio k(, ) is a bivariate fuctio that satisfies the followig properties. For ay x m ad x, k(x m, x ) = k(x, x m ) ad k(x m, x ) = φ(x m ) T φ(x ) for some fuctio φ( ). Examples we have see k(x m, x ) = (x T mx ) 2 k(x m, x ) = 2 si(2π(x m1 x 1 )) x m1 x 1 si(2π(x m2 x 2 )) x m2 x 2 Examples that are ot kerels k(x m, x ) = x m x 2 2 are ot our desired kerel fuctio as it caot be writte as ier products betwee two vectors. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 8 / 49

Review of last lecture Examples of kerel fuctios Kerel methods Polyomial kerel fuctio with degree of d for c 0 ad d is a positive iteger. k(x m, x ) = (x T mx + c) d Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 11 / 49

Review of last lecture Examples of kerel fuctios Kerel methods Polyomial kerel fuctio with degree of d k(x m, x ) = (x T mx + c) d for c 0 ad d is a positive iteger. Gaussia kerel, RBF kerel, or Gaussia RBF kerel k(x m, x ) = e xm x 2 2 /2σ2 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 11 / 49

Review of last lecture Examples of kerel fuctios Kerel methods Polyomial kerel fuctio with degree of d k(x m, x ) = (x T mx + c) d for c 0 ad d is a positive iteger. Gaussia kerel, RBF kerel, or Gaussia RBF kerel k(x m, x ) = e xm x 2 2 /2σ2 Most of those kerels have parameters to be tued: d, c, σ 2, etc. They are hyper parameters ad are ofte tued o holdout data or with cross-validatio. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 11 / 49

Review of last lecture Kerel methods Why x m x 2 2 is ot a positive semidefiite kerel? Use the defiitio of positive semidefiite kerel fuctio. We choose N = 2, ad compute the matrix ( 0 x K = 1 x 2 2 ) 2 x 1 x 2 2 2 0 This matrix caot be positive semidefiite as it has both egative ad positive eigevalues (the sum of the diagoal elemets is called the trace of a matrix, which equals to the sum of the matrix s eigevalues. I our case, the trace is zero.) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 12 / 49

Review of last lecture Kerel methods There are ifiite umbers of kerels to use! Rules of composig kerels (this is just a partial list) if k(x m, x ) is a kerel, the ck(x m, x ) is also if c > 0. if both k 1 (x m, x ) ad k 2 (x m, x ) are kerels, the αk 1 (x m, x ) + βk 2 (x m, x ) are also if α, β 0 if both k 1 (x m, x ) ad k 2 (x m, x ) are kerels, the k 1 (x m, x )k 2 (x m, x ) are also. if k(x m, x ) is a kerel, the e k(xm,x) is also. I practice, usig which kerel, or which kerels to compose a ew kerel, remais somewhat as black art, though most people will start with polyomial ad Gaussia RBF kerels. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 13 / 49

Kerelizatio trick Review of last lecture Kerelized machie learig methods May learig methods deped o computig ier products betwee features we have see the example of regularized least squares. For those methods, we ca use a kerel fuctio i the place of the ier products, i.e., kererlizig the methods, thus, itroducig oliear features/basis. We will preset oe more to illustrate this trick by kererlizig earest eighbor classifier. Whe we talk about support vector machies ext lecture, we will see the trick oe more time. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 14 / 49

Review of last lecture Kerelized earest eighbor classifier Kerelized machie learig methods I earest eighbor classifier, the most importat quatity to compute is the (squared) distace betwee two data poits x m ad x d(x m, x ) = x m x 2 2 = x T mx m + x T x 2x T mx Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 15 / 49

Review of last lecture Kerelized earest eighbor classifier Kerelized machie learig methods I earest eighbor classifier, the most importat quatity to compute is the (squared) distace betwee two data poits x m ad x d(x m, x ) = x m x 2 2 = x T mx m + x T x 2x T mx We replace all the ier products i the distace with a kerel fuctio k(, ), arrivig at the kereled distace d kerel (x m, x ) = k(x m, x m ) + k(x, x ) 2k(x m, x ) The distace is equivalet to compute the distace betwee φ(x m ) ad φ(x ) d kerel (x m, x ) = d(φ(x m ), φ(x )) where the φ( ) is the oliear mappig fuctio implied by the kerel fuctio. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 15 / 49

Review of last lecture Kerelized earest eighbor classifier Kerelized machie learig methods I earest eighbor classifier, the most importat quatity to compute is the (squared) distace betwee two data poits x m ad x d(x m, x ) = x m x 2 2 = x T mx m + x T x 2x T mx We replace all the ier products i the distace with a kerel fuctio k(, ), arrivig at the kereled distace d kerel (x m, x ) = k(x m, x m ) + k(x, x ) 2k(x m, x ) The distace is equivalet to compute the distace betwee φ(x m ) ad φ(x ) d kerel (x m, x ) = d(φ(x m ), φ(x )) where the φ( ) is the oliear mappig fuctio implied by the kerel fuctio. The earest eighbor of a poit x is thus foud with arg mi d kerel (x, x ) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 15 / 49

Review of last lecture Kerelized machie learig methods Take-home exercise You have see examples of kerelizig liear regressio earest eighbor But ca you kerelize the followig? Decisio tree Logistic (or multiomial logistic) regressio Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 16 / 49

Review of last lecture Examples of kerel fuctios Kerelized machie learig methods Polyomial kerel fuctio with degree of d k(x m, x ) = (x T mx + c) d for c 0 ad d is a positive iteger. Gaussia kerel, RBF kerel, or Gaussia RBF kerel k(x m, x ) = e xm x 2 2 /2σ2 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 17 / 49

Outlie Support vector machies 1 Admiistratio 2 Review of last lecture 3 Support vector machies Hige loss Primal formulatio of SVM Basic Lagrage duality theory Dual formulatio of SVM A very simple example 4 Geometric Uderstadig of SVM Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 18 / 49

Support vector machies Support vector machies Oe of the most commoly used machie learig algorithms. Covex optimizatio for classificatio ad regressio. It icorporates kerel tricks to defie oliear decisio boudaries or regressio fuctios. It provides theoretical guaratees o geeralizatio errors. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 19 / 49

Hige loss Support vector machies Hige loss Defiitio Aassumig the label y { 1, 1} ad the decisio rule is h(x) = sig(f(x)) with f(x) = w T φ(x) + b, { l hige 0 if yf(x) 1 (f(x), y) = 1 yf(x) otherwise 8 7 6 5 4 3 2 1 0 2 1 0 1 2 Ituitio: pealize more if icorrectly classified (the left brach to the kik poit) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 20 / 49

Hige loss Support vector machies Hige loss Defiitio Aassumig the label y { 1, 1} ad the decisio rule is h(x) = sig(f(x)) with f(x) = w T φ(x) + b, { l hige 0 if yf(x) 1 (f(x), y) = 1 yf(x) otherwise 8 7 6 5 4 3 2 1 0 2 1 0 1 2 Ituitio: pealize more if icorrectly classified (the left brach to the kik poit) Coveiet shorthad l hige (f(x), y) = max(0, 1 yf(x)) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 20 / 49

Support vector machies Hige loss Properties 8 7 6 5 4 3 2 1 0 2 1 0 1 2 Upper-boud (above) the 0/1 loss fuctio (black lie); optimizig it leads to reduced classificatio errors amely, we use the hige loss fuctio as a surrogate to the true error fuctio we care about. This fuctio is ot differetiable at the kik poit! Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 21 / 49

Support vector machies Primal formulatio of SVM Primal formulatio of support vector machies (SVM) Miimizig the total hige loss o all the traiig data mi w,b max(0, 1 y [w T φ(x ) + b]) + λ 2 w 2 2 which is aalogous to regularized least square, which balaces two terms (the loss ad the regularizer). Covetioally, we rewrite the objective fuctio as mi w,b C max(0, 1 y [w T φ(x ) + b]) + 1 2 w 2 2 where C is idetified as 1/λ. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 22 / 49

Support vector machies Primal formulatio of SVM Primal formulatio of support vector machies (SVM) Miimizig the total hige loss o all the traiig data mi w,b max(0, 1 y [w T φ(x ) + b]) + λ 2 w 2 2 which is aalogous to regularized least square, which balaces two terms (the loss ad the regularizer). Covetioally, we rewrite the objective fuctio as mi w,b C max(0, 1 y [w T φ(x ) + b]) + 1 2 w 2 2 where C is idetified as 1/λ. We further rewrite ito aother equivalet form mi w,b,{ξ } C ξ + 1 2 w 2 2 s.t. max(0, 1 y [w T φ(x ) + b]) = ξ, Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 22 / 49

Support vector machies Primal formulatio of SVM Primal formulatio of SVM Primal formulatio mi w,b,{ξ } C ξ + 1 2 w 2 2 s.t. 1 y [w T φ(x ) + b] ξ, ξ 0, where all ξ are called slack variables. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 23 / 49

Support vector machies Primal formulatio of SVM Primal formulatio of SVM Primal formulatio mi w,b,{ξ } C ξ + 1 2 w 2 2 s.t. 1 y [w T φ(x ) + b] ξ, ξ 0, where all ξ are called slack variables. Remarks This is a covex quadratic programmig: the objective fuctio is quadratic i w ad the costraits are liear (iequality) costraits i w ad ξ. Give φ( ), we ca solve the optimizatio problem efficietly as it is covex, for example, usig Matlab s quadprog() fuctio. However, there are efficiet algorithms for solvig this problem, takig advatage of the special structures of the objective fuctio ad the costraits. (We will ot discuss them. Most existig SVM implemetatio/packages implemet such efficiet algorithms.) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 23 / 49

Support vector machies Basic Lagrage duality theory Basic Lagrage duality theory Key cocepts you should kow What do primal ad dual mea? How SVM exploits dual formulatio, thus results i usig kerel fuctios for oliear classificatio What do support vectors mea? Our roadmap We will tell you what dual looks like We will show you how it is derived Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 24 / 49

Dual formulatio Support vector machies Dual formulatio of SVM Dual is also a covex quadratic programmig max α α 1 y m y α m α φ(x m ) T φ(x ) 2 m, s.t. 0 α C, α y = 0 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 25 / 49

Dual formulatio Support vector machies Dual formulatio of SVM Dual is also a covex quadratic programmig Remarks max α α 1 y m y α m α φ(x m ) T φ(x ) 2 m, s.t. 0 α C, α y = 0 The optimizatio is covex as the objective fuctio is cocave. (Take-home exercise: please verify) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 25 / 49

Dual formulatio Support vector machies Dual formulatio of SVM Dual is also a covex quadratic programmig Remarks max α α 1 y m y α m α φ(x m ) T φ(x ) 2 m, s.t. 0 α C, α y = 0 The optimizatio is covex as the objective fuctio is cocave. (Take-home exercise: please verify) There are N dual variable α, oe for each costrait 1 y [w T φ(x ) + b]) ξ i the primal formulatio. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 25 / 49

Kerelized SVM Support vector machies Dual formulatio of SVM We replace the ier products φ(x m ) T φ(x ) with a kerel fuctio max α α 1 y m y α m α k(x m, x ) 2 m, s.t. 0 α C, α y = 0 as i kerelized liear regressio ad kererlized earest eighbor. We oly eed to defie a kerel fuctio ad we will automatically get (oliearly) mapped features ad the support vector machie costructed with those features. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 26 / 49

Support vector machies Dual formulatio of SVM Recoverig solutio to the primal formulatio Weights w = y α φ(x ) Liear combiatio of the iput features! Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 27 / 49

Support vector machies Dual formulatio of SVM Recoverig solutio to the primal formulatio Weights w = y α φ(x ) Liear combiatio of the iput features! b b = [y w T φ(x )] = [y m y m α m k(x m, x )], for ay C > α > 0 Makig predictio o a test poit x h(x) = sig(w T φ(x) + b) = sig( y α k(x, x) + b) Agai, to make predictio, it suffices to kow the kerel fuctio. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 27 / 49

Support vector machies Dual formulatio of SVM Derivatio of the dual We will derive the dual formulatio as the process will reveal some iterestig ad importat properties of SVM. Particularly, why is it called support vector? Recipe Formulate a Lagragia fuctio that icorporates the costraits, thru itroducig dual variables Miimize the Lagragia fuctio to solve the primal variables Put the primal variables ito the Lagragia ad express i terms of dual variables Maximize the Lagragia with respect to dual variables Recover the solutio (for the primal variables) from the dual variables Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 28 / 49

Support vector machies A very simple example A simple example Cosider the example of covex quadratic programmig mi 1 2 x2 s.t. x 0 2x 3 0 The Lagragia is (ote that we do ot have equality costraits) L(x, µ) = 1 2 x2 + µ 1 ( x) + µ 2 (2x 3) = 1 2 x2 + (2µ 2 µ 1 )x 3µ 2 uder the costrait that µ 1 0 ad µ 2 0. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 29 / 49

Support vector machies A very simple example A simple example Cosider the example of covex quadratic programmig mi 1 2 x2 s.t. x 0 2x 3 0 The Lagragia is (ote that we do ot have equality costraits) L(x, µ) = 1 2 x2 + µ 1 ( x) + µ 2 (2x 3) = 1 2 x2 + (2µ 2 µ 1 )x 3µ 2 uder the costrait that µ 1 0 ad µ 2 0. Its dual problem is max mi L(x, µ) = max mi 1 µ 1 0,µ 2 0 x µ 1 0,µ 2 0 x 2 x2 + (2µ 2 µ 1 )x 3µ 2 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 29 / 49

Example (cot d) Support vector machies A very simple example We solve the mi x L(x, µ) first ow it is ucostraied. The optimal x is attaied by ( 1 2 x2 + (2µ 2 µ 1 )x 3µ 2 ) x = 0 x = (2µ 2 µ 1 ) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 30 / 49

Example (cot d) Support vector machies A very simple example We solve the mi x L(x, µ) first ow it is ucostraied. The optimal x is attaied by ( 1 2 x2 + (2µ 2 µ 1 )x 3µ 2 ) x = 0 x = (2µ 2 µ 1 ) This gives us the dual objective fuctio, by substitutig the solutio ito the objective fuctio, 1 g(µ) = mi x 2 x2 + (2µ 2 µ 1 )x 3µ 2 = 1 2 (2µ 2 µ 1 ) 2 3µ 2 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 30 / 49

Example (cot d) Support vector machies A very simple example We solve the mi x L(x, µ) first ow it is ucostraied. The optimal x is attaied by ( 1 2 x2 + (2µ 2 µ 1 )x 3µ 2 ) x = 0 x = (2µ 2 µ 1 ) This gives us the dual objective fuctio, by substitutig the solutio ito the objective fuctio, 1 g(µ) = mi x 2 x2 + (2µ 2 µ 1 )x 3µ 2 = 1 2 (2µ 2 µ 1 ) 2 3µ 2 We get our dual problem as We will solve the dual ext. max µ 1 0,µ 2 0 1 2 (2µ 2 µ 1 ) 2 3µ 2 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 30 / 49

Solvig the dual Support vector machies A very simple example Note that, g(µ) = 1 2 (2µ 2 µ 1 ) 2 3µ 2 0 for all µ 1 0, µ 2 0. Thus, to maximize the fuctio, the optimal solutio is µ 1 = 0, µ 2 = 0 This brigs us back the optimal solutio of x x = (2µ 2 µ 1) = 0 Namely, we have arrived at the same solutio as the oe we guessed from the primal formulatio Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 31 / 49

Support vector machies A very simple example Derivig the dual for SVM Lagragia L(w, {ξ }, {α }, {λ }) = C ξ + 1 2 w 2 2 λ ξ + α {1 y [w T φ(x ) + b] ξ } uder the costrait that α 0 ad λ 0. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 32 / 49

Support vector machies A very simple example Miimizig the Lagragia Takig derivatives with respect to the primal variables L w = w y α φ(x ) = 0 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 33 / 49

Support vector machies Miimizig the Lagragia A very simple example Takig derivatives with respect to the primal variables L w = w y α φ(x ) = 0 L b = α y = 0 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 33 / 49

Support vector machies Miimizig the Lagragia A very simple example Takig derivatives with respect to the primal variables L w = w y α φ(x ) = 0 L b = α y = 0 L = C λ α = 0 ξ Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 33 / 49

Support vector machies Miimizig the Lagragia A very simple example Takig derivatives with respect to the primal variables L w = w y α φ(x ) = 0 L b = α y = 0 L = C λ α = 0 ξ This gives rise to equatios likig the primal variables ad the dual variables as well as ew costraits o the dual variables: w = α y = 0 C λ α = 0 y α φ(x ) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 33 / 49

Support vector machies A very simple example Rewrite the Lagrage i terms of dual variables Substitute the solutio to the primal back ito the Lagragia g({α },{λ }) = L(w, {ξ }, {α }, {λ }) = (C α λ )ξ + 1 2 y α φ(x ) 2 2 + α ( ) + α y b ( T α y y m α m φ(x m )) φ(x ) m Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 34 / 49

Support vector machies A very simple example Rewrite the Lagrage i terms of dual variables Substitute the solutio to the primal back ito the Lagragia g({α },{λ }) = L(w, {ξ }, {α }, {λ }) = (C α λ )ξ + 1 2 y α φ(x ) 2 2 + α ( ) + α y b ( T α y y m α m φ(x m )) φ(x ) m = α + 1 2 y α φ(x ) 2 2 α α m y m y φ(x m ) T φ(x ) m, Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 34 / 49

Support vector machies A very simple example Rewrite the Lagrage i terms of dual variables Substitute the solutio to the primal back ito the Lagragia g({α },{λ }) = L(w, {ξ }, {α }, {λ }) = (C α λ )ξ + 1 2 y α φ(x ) 2 2 + α ( ) + α y b ( T α y y m α m φ(x m )) φ(x ) m = α + 1 2 y α φ(x ) 2 2 α α m y m y φ(x m ) T φ(x ) m, = α 1 α α m y m y φ(x m ) T φ(x ) 2 m, Several terms vaish because of the costraits α y = 0 ad C λ α = 0. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 34 / 49

The dual problem Support vector machies A very simple example Maximizig the dual uder the costraits max g({α }, {λ }) = α 1 y m y α m α k(x m, x ) α 2 m, s.t. α 0, α y = 0 C λ α = 0, λ 0, Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 35 / 49

The dual problem Support vector machies A very simple example Maximizig the dual uder the costraits max g({α }, {λ }) = α 1 y m y α m α k(x m, x ) α 2 m, s.t. α 0, α y = 0 C λ α = 0, λ 0, We ca simplify as the objective fuctio does ot deped o λ, thus we ca covert the equality costrait ivolvig λ with a iequality costrait o α C: α C λ = C α 0 C λ α = 0, λ 0 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 35 / 49

Fial form Support vector machies A very simple example max α α 1 y m y α m α φ(x m ) T φ(x ) 2 m, s.t. 0 α C, α y = 0 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 36 / 49

Recover the solutio Support vector machies A very simple example The primal variable w is idetified as w = α y φ(x ) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 37 / 49

Recover the solutio Support vector machies A very simple example The primal variable w is idetified as w = α y φ(x ) To idetify b, we eed somethig else. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 37 / 49

Support vector machies A very simple example Complemetary slackess ad support vectors At the optimal solutio to both primal ad dual, the followig must be satisfied for every iequality costrait (these are called KKT coditios) λ ξ = 0 α {1 ξ y [w T φ(x ) + b]} = 0 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 38 / 49

Support vector machies A very simple example Complemetary slackess ad support vectors At the optimal solutio to both primal ad dual, the followig must be satisfied for every iequality costrait (these are called KKT coditios) λ ξ = 0 α {1 ξ y [w T φ(x ) + b]} = 0 From the first coditio, if α < C, the λ = C α > 0 ξ = 0 Thus, i cojuctio with the secod coditio, we kow that, if C > α > 0, the as y { 1, 1}. 1 y [w T φ(x ) + b] = 0 b = y w T φ(x ) For those whose α > 0, we call such traiig samples as support vectors. (We will discuss their geometric iterpretatio later). Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 38 / 49

Outlie Geometric Uderstadig of SVM 1 Admiistratio 2 Review of last lecture 3 Support vector machies 4 Geometric Uderstadig of SVM Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 39 / 49

Geometric Uderstadig of SVM Ituitio: where to put the decisio boudary? Cosider the biary classificatio i the followig figure. We have assumed, for coveiece, that the traiig dataset is separable there is a decisio boudary that separates the two classes perfectly. H H H There are ifiite may ways of puttig the decisio boudary H : w T φ(x) + b = 0! Our ituitio is, however, to put the decisio boudary to be i the middle of the two classes as much as possible. I other words, we wat the decisio boudary is to be far to every poit as much as possible as log as the decisio boudary classifies every poit correctly. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 40 / 49

Distaces Geometric Uderstadig of SVM The distace from a poit φ(x) to the decisio boudary is d H (φ(x)) = wt φ(x) + b w 2 (We have derived the above i the recitatio/quiz0. Please re-verify it as a take-home exercise). We ca remove the absolute by exploitig the fact that the decisio boudary classifies every poit i the traiig dataset correctly. Namely, (w T φ(x) + b) ad x s label y are of the same sig. The distace is ow, d H (φ(x)) = y[wt φ(x) + b] w 2 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 41 / 49

Maximizig margi Geometric Uderstadig of SVM Margi The margi is defied as the smallest distace from all the traiig poits y [w T φ(x ) + b] margi = mi w 2 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 42 / 49

Maximizig margi Geometric Uderstadig of SVM Margi The margi is defied as the smallest distace from all the traiig poits y [w T φ(x ) + b] margi = mi w 2 Sice we are iterested i fidig a w to put all poits as distat as possible from the decisio boudary, we maximize the margi max w y [w T φ(x ) + b] 1 mi = max mi y [w T φ(x ) + b] w w w 2 H : w T φ(x)+b =0 w T φ(x)+b w 2 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 42 / 49

Rescaled margi Geometric Uderstadig of SVM Sice the margi does ot chage if we scale (w, b) by a costat factor c ( as w T φ(x) + b = 0 ad (cw) T φ(x) + (cb) = 0 are the same decisio boudary), we fix the scale by forcig mi y [w T φ(x ) + b] = 1 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 43 / 49

Geometric Uderstadig of SVM Rescaled margi Sice the margi does ot chage if we scale (w, b) by a costat factor c ( as w T φ(x) + b = 0 ad (cw) T φ(x) + (cb) = 0 are the same decisio boudary), we fix the scale by forcig I this case, our margi becomes mi y [w T φ(x ) + b] = 1 margi = 1 w 2 precisely, the closest poit to the decisio boudary has a distace of that. w T φ(x)+b =1 H : w T φ(x)+b =0 1 w2 w T φ(x)+b = 1 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 43 / 49

Primal formulatio Geometric Uderstadig of SVM Combiig everythig we have, for a separable traiig dataset, we aim to max w This is equivalet to 1 w 2 such that y [w T φ(x ) + b] 1, 1 mi w 2 w 2 2 s.t. y [w T φ(x ) + b] 1, This starts to look like our first formulatio for SVMs. For this geometric ituitio, SVM is called max margi (or large margi) classifier. The costraits are called large margi costraits. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 44 / 49

Geometric Uderstadig of SVM SVM for o-separable data Suppose there are traiig data poits that caot be classified correctly o matter how we choose w. For those data poits, y [w T φ(x ) + b] 0 for ay w. Thus, the previous costrait y [w T φ(x ) + b] 1, is o loger feasible. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 45 / 49

Geometric Uderstadig of SVM SVM for o-separable data Suppose there are traiig data poits that caot be classified correctly o matter how we choose w. For those data poits, y [w T φ(x ) + b] 0 for ay w. Thus, the previous costrait y [w T φ(x ) + b] 1, is o loger feasible. To deal with this issue, we itroduce slack variables ξ to help y [w T φ(x ) + b] 1 ξ, where we also require ξ 0. Note that, eve for hard poits that caot be classified correctly, the slack variable will be able to make them satisfy the above costrait (we ca keep icreasig ξ util the above iequality is met.) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 45 / 49

Geometric Uderstadig of SVM SVM Primal formulatio with slack variables We obviously do ot wat ξ goes to ifiity, so we balace their sizes by pealizig them toward zero as much as possible mi w 1 2 w 2 2 + C ξ s.t. y [w T φ(x ) + b] 1 ξ, ξ 0, where C is our tradeoff (hyper)parameter. This is precisely the primal formulatio we first got for SVM. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 46 / 49

Geometric Uderstadig of SVM Meaig of support vectors i SVMs Complemetary slackess At optimum, we have to have α {1 ξ y [w T φ(x ) + b]} = 0, That meas, for some, α = 0. Additioally, our optimal solutio is give by w = α y φ(x ) = α y φ(x ) :α >0 I words, our solutio is oly determied by those traiig samples whose correspodig α is strictly positive. Those samples are called support vectors. No-support vectors whose α = 0 ca be removed by the traiig dataset this removal will ot affect the optimal solutio (i.e., after the removal, if we costruct aother SVM classifier o the reduced dataset, the optimal solutio is the same as the oe o the origial dataset.) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 47 / 49

Geometric Uderstadig of SVM Who are support vectors? Case aalysis Sice, we have We have 1 ξ y [w T φ(x ) + b]} = 0 ξ = 0. This implies y [w T φ(x ) + b] = 1. They are o poits that are 1/ w 2 away from the decisio boudary. ξ < 1. These are poits that ca be classified correctly but do ot satisfy the large margi costrait they have smaller distaces to the decisio boudary. ξ > 1. These are poits that are misclassified. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 48 / 49

Geometric Uderstadig of SVM Visualizatio of how traiig data poits are categorized w T φ(x)+b =1 H : w T φ(x)+b =0 ξ < 1 ξ > 1 w T φ(x)+b = 1 ξ =0 Support vectors are those beig circled with the orage lie. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 49 / 49