Lecture 10 Support Vector Machines II

Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28

Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed as of 2016-02-20 2/28

Today: Optmzaton theory behnd support vector machnes More examples 3/28

Recall that we settled on the followng defnton for a the support vector machne: max β 2 =1 M s.t. y (x t β + β 0 ) > M ξ, = 1,..., n ξ > 0, ξ Constant. Ths defnes a margn around the lnear decson plane of wdth and tres to mnmze the number of errors (ξ) for ponts that are on the wrong sde of the margn. 4/28

We then re-parameterzed ths by settng the margn to 1 but allowng the sze of β to grow: mn 1 2 β 2 2 s.t. y (x t β + β 0 ) > 1 ξ, = 1,..., n ξ > 0, ξ Constant. Ths defnes a margn around the lnear decson plane of wdth 1 β, and tred to mnmze the number of errors ξ of ponts that are on the wrong sde of the margn. 5/28

6/28

Notce that we can rewrte mn 1 2 β 2 2 s.t. y (x t β + β 0 ) > 1 ξ, = 1,..., n ξ > 0, ξ Constant Wth a constant C > 0, whch depends only on the constant n the orgnal formulaton, as: mn 1 2 β 2 2 + C ξ s.t. y (x t β + β 0 ) > 1 ξ, ξ > 0, = 1,..., n. By notcng that the second form wll fnd a β that mnmzes β 2 2 such that ξ ξ. 7/28

TheLagrangan Gven a constraned optmzaton problem: mnf(x) s.t. g j (x) = 0, j = 1,..., K We can defne the prmal Lagrangan functon as: K L P = f(x) λ j g j (x) j=1 What does ths functon look lke? 8/28

TheLagranganDual The Lagrangan dual functon s then gven as the nfmum of L P as functon of the λ j over values of x: L D (λ) = nf L P(x, λ) x K = nf x f(x) λ j g j (x). And the dualproblem s to fnd the maxmum of the dual functon over all choces of λ: j=1 λ = arg max L D (λ). λ The optmal value of the prmal problem, x, can be reconstructed by workng backwards: x = arg mn L P (x, λ ). x 9/28

For a better understandng of the dual problem we can vsualze the Lagrangan soluton as a saddle pont 1 1 www.convexoptmzaton.com 10/28

It turns out that ths s a very good framework for workng wth support vector machnes. We can defne the Lagrangan functon as: L P = 1 2 β 2 2 + C ξ α [ y (x t β + β 0 ) (1 ξ ) ] µ ξ Where α and µ are the Lagrangan multplers. 11/28

Asde: Techncally, the theory of Lagrangan multplers only apply when the constrants on the soluton are equalty constrants rather than nequalty constrants. The larger theory needed for the general case uses the Karush-Kuhn-Tucker (KKT) condtons. These add addtonal constrants on top of those presented here. Followng the Elements of Statstcal Learnng, we wll not worry wth those detals here as they are more an annoyance than an nterestng conceptual dfference. 12/28

To construct the dual functon, we need to take partal dervatves wth respect to the prmal varables: β, β 0, and ξ. If we plug these nto the prmal problem, we get the dual functon. 13/28

For β j : { L P = 1 β j β j = β j 2 β 2 2 + C ξ [ α y (x t β + β 0 ) (1 ξ ) ] { 1 2 β 2 2 } α y (x t β) µ ξ } = β j α y x,j 14/28

For β j : { L P = 1 β j β j = β j 2 β 2 2 + C ξ [ α y (x t β + β 0 ) (1 ξ ) ] { 1 2 β 2 2 } α y (x t β) µ ξ } = β j α y x,j Settng ths equal to zero, and wrtng the equaton smultaneously for all β j, we get: β = α y x 14/28

Ths necessary condton for the soluton of the support vector machne s of ndependent nterest. It says that β can be wrtten as a lnear combnaton of the data ponts x. Any such that α s non-zero s called a support vector. 15/28

For β 0, the dervatve s gven as: { L P = 1 β 0 β 0 2 β 2 2 + C ξ { = } α y β 0 β 0 α [ y (x t β + β 0 ) (1 ξ ) ] µ ξ } = α y Whch when set to zero gves: 0 = α y Ths explans why the term β 0 s often called the bas of the support vector machne. 16/28

Fnally, the dervatve wth respect to ξ s gven as: { L P = 1 ξ ξ 2 β 2 2 + C ξ [ α y (x t β + β 0 ) (1 ξ ) ] { = C ξ + α (1 ξ ) } µ ξ µ ξ } = C α µ 17/28

Fnally, the dervatve wth respect to ξ s gven as: { L P = 1 ξ ξ 2 β 2 2 + C ξ [ α y (x t β + β 0 ) (1 ξ ) ] { = C ξ + α (1 ξ ) } µ ξ µ ξ } = C α µ Settng ths equal to zero we see that: α = C µ. 17/28

We now want to plug ths nto the Lagrangan prmal functon. We frst use the fact that α = C µ to unte the tralng terms wth respect to α : L D = 1 2 β 2 2 + C ξ [ α y (x t β + β 0 ) (1 ξ ) ] µ ξ = 1 2 β 2 2 + (C µ )ξ α y x t β α y β 0 + α (1 ξ ) = 1 2 β 2 2 + α ξ α y x t β α y β 0 + α (1 ξ ) = 1 2 β 2 2 + α α y x t β α y β 0 18/28

Now, notce that snce β = α y x, we have that: β 2 2 = j β 2 j = (α y x ) t (α y x ) = α α y y x t x Also see that we can rewrte the last term n our dual functon as: α y x t β = α α y y x t x = β 2 2. 19/28

Fnally, the dual functon can be wrtten as: L D = 1 2 β 2 2 + α α y x t β = = α 1 2 β 2 2 α 1 2 α α y y x t x. Whch we want to maxmze under the constrants (the frst s from the KKT condtons, the second from the partal dervatve of the bas): 0 α C, α y = 0. 20/28

To what extent can we make some sense of ths equaton? Frst notce the dualty of the problem f we flp the ±1 labelng of the classes y : L D = α 1 2 α α y y x t x The functon only depends on the sgn of y y. Also note that t only depends on the data that serve as support vectors: L D = α 1 2 α α y y x t x. 21/28

Perhaps most mportantly though, notce that only the outer product XX t effects the fnal results: L D = α 1 2 α α y y x t x. As x t x s the (, ) th element of XX t. Ths s a measurement of how smlar x and x are to one another (f scaled to both have length one, t s the cosne of the angle between them). So, we can see that: 1. There s a penalty for ncludng two smlar x s wth the same class label 2. There s a beneft for ncludng two smlar x s wth dfferent class labels Both of whch actually make sense for a classfcaton algorthm. 22/28

Takng a step back now, how does logstc regresson and support vector machnes compare? 1. Both separate the plane nto two half-spaces whch attempt to splt the classes as well as possble 2. However, logstc regresson s (prmarly) concerned wth the correlaton matrx X t X between the varables and support vector machnes only care about the smlarty matrx XX t between observatons 23/28

TheKernelTrck In the case of logstc and lnear regresson I have shown how bass expanson can be used to add non-lnear effects nto a lnear model. One observaton that makes support vector machnes attractve s that t s possble to mmc bass expanson wthout ever havng to actually project nto a hgher dmensonal space. 24/28

TheKernelTrck, cont. Assume that we have a mappng h of samples x nto a hgher dmensonal space. We can re-wrte the dual functon usng nner product notaton: L D = α 1 2 α α y y < h(x ), h(x ) >. It quckly becomes apparent that we only need a fast way of calculatng nner products n the space of h, whch may not requre actually determnng and calculatng h tself. 25/28

TheKernelTrck, cont. The projected nner product < h(x ), h(x ) > s usually wrtten drectly as K(x, x ) for a functon K called the kernel. Popular choces nclude: 1. Lnear: K(x, x ) =< x, x > 2. Polynomal: K(x, x ) = (1+ < x, x >) d 3. Radal: K(x, x ) = exp( γ x x 2 ) 4. Sgmod: K(x, x ) = tanh(κ 1 < x, x > +κ 2 ) Notce that these all requre approxmately the same effort to calculate as the lnear kernel. 26/28

Fnshngtheoptmzaton Now, we can re-wrte the optmzaton problem as: max s.t. 1 t α α t Kα 0 α C For a sutable matrx K, called the kernel matrx. Ths s a quadratc program wth box constrants, and can be solved farly effcently by general purpose solvers. 27/28

Morenformaton I try to provde addtonal references for all of my lectures on the class webste. For today s materal (and Wednesday s) I would lke to make a partcular pont to menton two references: Elements of Statstcal Learnng, Sectons 12.1-12.4 Convex Optmzaton, S. Boyd, Chapter 5 (5.5 n partcular) These contan many more detals than I have tme to cover, and assume a deeper background n statstcs / convex calculus. 28/28