Linear Support Vector Machines

Liear Support Vector Machies David S. Roseberg The Support Vector Machie For a liear support vector machie (SVM), we use the hypothesis space of affie fuctios F = { f(x) = w T x + b w R d, b R } ad evaluate them with respect to the SVM loss fuctio, also kow as the hige loss. The hige loss is a margi loss defied as l(m) = ( m) +, where m = yf(x) is the margi for the predictio fuctio f o the example (x, y), ad (x) + = x(x 0) deotes the positive part of x. The SVM traditioally uses a l 2 regularizatio term, ad the objective fuctio is writte as J(w, b) = 2 w 2 + c ( yi w T x i + b ]). + Note that the w parameter is regularized, while the bias term b is ot regularized. A alterative approach (which saves some writig), is to drop the b ad add a a costat feature, say with the value, to the represetatio of x. With this approach, the bias term will be regularized alog with the rest of the parameters. Rather tha the typical λ regularizatio parameter attached to the l 2 pealty, for SVMs it s traditioal to have a c parameter attached to the empirical risk compoet. The larger c is, the more relative importace we attach to miimizig the empirical risk compared to fidig a simple hypothesis with small l 2 -orm.

2 3 Compute the Lagragia Dual 2 Formulatig SVM as a QP The SVM optimizatio problem is mi w R d,b R 2 w 2 + c ( yi w T x i + b ]). (2.) + This is a ucostraied optimizatio problem (which is ice), but the objective fuctio is ot differetiable, which makes it difficult to work with. We ca formulate a equivalet problem with a differetiable objective, but we ll have to add ew costraits to do so. Note that 2.is equivalet to miimize 2 w 2 + c subject to ξ i ( y i w T x i + b ]) +, sice the miimizatio will always drive dow ξ i util ξ i = ( y i w T x i + b ]) +. We ca ow break up the iequality ito two parts: ξ i miimize subject to 2 w 2 + c ξ i ξ i 0 for i =,..., ξ i ( y i w T x i + b ]) for i =,..., We ow have a differetiable objective fuctio i d + + variables with 2 affie costraits. This is a quadratic program that ca be solved by ay off-the-shelf QP solver. 3 Compute the Lagragia Dual The Lagragia for this formulatio is L(w, b, ξ, α, λ) = 2 w 2 + c ( ξ i + α i yi w T x i + b ] ) ξ i λ i ξ i = ( c ) 2 wt w + ξ i α ( i λ i + α i yi w T x i + b ]).

3 From our study of Lagragia duality, we kow that the origial problem ca ow be expressed as if sup L(w, b, ξ, α, λ). w,b,ξ α,λ 0 Sice our costraits are affie, by Slater s coditio we have strog duality so log as the problem is feasible (i.e. so log as there is at least oe poit i the feasible set). The costraits are satisfied by w = 0 ad ξ i = for i =,...,, so we have strog duality. Thus we get the same result if we solve the followig dual problem: sup if α,λ 0 w,b,ξ L(w, b, ξ, α, λ). As usual, we capture the ier optimizatio i the Lagrage dual objective: g(α, λ) = if w,ξ L(w, b, ξ, α, λ). Note that if c α i λ i 0, the the Lagragia is ubouded below (by takig ξ i ± ) ad thus the ifimum is. For ay give (α, λ), the fuctio (w, ξ) L(w, b, ξ, α, λ) is covex ad differetiable, thus we have a optimal poit if ad oly if all partial derivatives of L with respect to w, b, ad ξ are 0: w L = 0 w b L = 0 α i y i x i = 0 w = α i y i x i (3.) ξi L = 0 c α i λ i = 0 α i + λ i = c (3.2) Note that oe of the coditios is α i +λ i = c, which agrees with our previous observatio that if α i + λ i c the L us ubouded below. Substitutig these coditios back ito L, the secod term disappears, while the first ad third terms become 2 wt w = α i α j y i y j x T i x j 2 i,j= α i ( y i w T x i + b ] ) = α i α i α j y i y j x T j x i b α i y i. i,j= }{{} =0

4 3 Compute the Lagragia Dual Puttig it together, the dual fuctio is { g(α, λ) = α i 2 i,j= α iα j y i y j x T j x i α iy i = 0, α i + λ i = c, all i otherwise. Thus we ca write the dual problem as sup α,λ s.t. α i 2 α i α j y i y j x T j x i i,j= α i + λ i = c, i =,..., α i, λ i 0, i =,..., We ca actually elimiate the λ variables, replacig the last three costraits by 0 α i c : sup α s.t. α i 2 α i 0, c ]. α i α j y i y j x T j x i i,j= Whe writte i stadard form, this has a quadratic objective i ukows ad 2+ costraits. Note that these costraits have a particularly simple form: they are called box costraits. If α is a solutio to the dual problem, the by strog duality ad (3.), the optimal solutio to the primal problem is give by w = αi y i x i. Note that w = α i y i x i oly depeds o those examples for which α i > 0 (recall that α i 0 by costrait). These examples are called support vectors.

5 Sice α i 0, c ], we see that c cotrols the amout of weight we ca put o ay sigle example. Note that we still do t have a expressio for the optimal bias term b. We ll derive this below usig complemetary slackess coditios. 4 Cosequeces of Complemetary Slackess Let (w, b, ξ i ) ad (α, λ ) be optimal solutios to the primal ad dual problems, respectively. For otatioal coveiece, let s defie f (x) = x T i w +b. By strog duality, we have the followig complemetary slackess coditios: αi ( y i f (x i ) ξi ) = 0 (4.) ( c ) λ i ξi = α i ξi = 0 (4.2) We ow draw may straightforward coclusios: As we oted above, ξ i is the hige loss o example i. Whe ξ i = 0, we re either at the margi (i.e. y i f (x i ) = ) or o the good side of the margi (y i f (x i ) > ). That is ξ i = 0 = y i f (x i ). (4.3) By (4.2), α i = 0 implies ξ i = 0, which by (4.3) implies y i f (x). α i ( 0, c ) implies ξ i = 0, by 4.2. The by 4. we get y i f (x i ) =. So the predictio is right o the margi. If y i f (x i ) < the the margi loss is ξi αi = c. > 0, ad (4.2) implies that If y i f (x) > the the margi loss is ξ i = 0, ad 4. implies α i = 0. The cotrapositive of the previous result is that α i > 0 implies y i f (x). This seems to be all we ca say for the specific case α i = c. We also ca t draw ay extra iformatio about αi o the margi (y i f (x i ) = ). for poits exactly

6 4 Cosequeces of Complemetary Slackess We summarize these results below: αi = 0 = y i f (x i ) ( αi 0, c ) = y i f (x i ) = α i = c = y i f (x i ) 4. Determiig b y i f (x i ) < = αi = c y i f (x i ) = = αi 0, c ] y i f (x i ) > = αi = 0 Fially, let s determie b. Suppose there exists a i such that α i ( 0, c ). The ξ i = 0 by (4.2) ad by 4. we get y i x T i w + b ] =. Sice y i {, }, ) we ca coclude that b = y i x T i w. With exact calculatios, we would get the same b for ay choice of i with αi ( 0, ) c. With umerical error, however, it will be more robust to average over all eligible i s: { ( b = mea y i x T i w αi 0, c )}. If there are o α i ( 0, c ), the we have a degeerate SVM traiig problem, for which w = 0, ad we always predict the majority class. This is show i Rifki et al. s A Note o Support Vector Machie Degeeracy, a MIT AI Lab Techical Report.