Support Vector Machines MIT Course Notes Cynthia Rudin

Support Vector Machines MIT 5.097 Course Notes Cynthia Rudin Credit: Ng, Hastie, Tibshirani, Friedan Thanks: Şeyda Ertekin Let s start with soe intuition about argins. The argin of an exaple x i = distance fro exaple to decision boundary = y i f(x i ) The argin is positive if the exaple is on the correct side of the decision boundary, otherwise it s negative. Here s the intuition for SVM s: We want all exaples to have large argins, want the to be as far fro decision boundary as possible. That way, the decision boundary is ore stable, we are confident in all decisions.

Most other algoriths (logistic regression, decision trees, perceptron) don t generally produce large argins. (AdaBoost generally produces large argins.) As in logistic regression and AdaBoost, function f is linear, f(x) = λ (j) x (j) + λ 0. j= Note that the intercept ter can get swept into x by adding a as the last coponent of each x. Then f(x) would be just λ T x but for this lecture we ll keep the intercept ter separately because SVM handles that ter differently than if you put the intercept as a separate feature. We classify x using sign(f(x)). If x i has a large argin, we are confident that we classified it correctly. So we re essentially suggesting to use the argin y i f(x i ) to easure the confidence in our prediction. But there is a proble with using y i f(x i ) to easure confidence in prediction. There is soe arbitrariness about it. How? What should we do about it?

SVM s axiize the distance fro the decision boundary to the nearest training exaple they axiize the iniu argin. There is a geoetric perspective too. I set the intercept to zero for this picture (so the decision boundary passes through the origin): The decision boundary are x s where λ T x = 0. That eans the unit vector for λ ust be perpendicular to those x s that lie on the decision boundary. Now that you have the intuition, we ll put the intercept back, and we have to translate the decision boundary, so it s really the set of x s where λ T x = λ 0. The argin of exaple i is denoted γ i : B is the point on the decision boundary closest to the positive exaple x i. B is λ B = x i γ i λ 3

since we oved γ i units along the unit vector to get fro the exaple to B. Since B lies on the decision boundary, it obeys λ T x + λ 0 = 0, where x is B. (I wrote the intercept there explicitly). So, ( ) λ T λ x i γ i + λ 0 = 0 Siplifying, λ λ T λ x i γ i + λ 0 = 0 λ λ T x i + λ 0 γ i = λ =: f(x i ) (this is the noralized version of f) = y i f(x i ) since y i =. Note that here we noralized so we wouldn t have the arbitrariness in the eaning of the argin. If the exaple is negative, the sae calculation works, with a few sign flips (we d need to ove γ i units rather than γ i units). So the geoetric argin fro the picture is the sae as the functional argin y i f(x i ). Maxiize the iniu argin Support vector achines axiize the iniu argin. They would like to have all exaples being far fro the decision boundary. So they ll choose f this way: ax ax γ s.t. y i f(x i ) γ i =... f γ λ T x i + λ 0 ax γ s.t. y i γ i =... γ,λ,λ 0 λ ax γ s.t. y i (λ T x i + λ 0 ) γ λ i =.... γ,λ,λ 0 4

For any λ and λ 0 that satisfy this, any positively scaled ultiple satisfies the too, so we can arbitrarily set λ = /γ so that the right side is. Now when we axiize γ, we re axiizing γ = / λ. So we have Equivalently, ax s.t. y i (λ T xi + λ 0 ) i =.... λ,λ 0 λ in λ,λ 0 λ s.t. y i (λ T x i + λ 0 ) 0 i =... () (the / and square are just for convenience) which is the sae as: in λ s.t. y i (λ T x i + λ 0 ) + 0 λ,λ 0 leading to the Lagrangian n i =... L ([λ, λ 0 ], α) = λ (j) + [ ] α i yi (λ T x i + λ 0 ) + j= i= Writing the KKT conditions, starting with Lagrangian stationarity, where we need to find the gradient wrt λ and the derivative wrt λ 0 : λ L ([λ, λ 0 ], α) = λ α i y i x i = 0 = λ = α i y i x i. i= i= L ([λ, λ0 ], α) = α i y i = 0 = α i y i = 0. λ 0 i= α i 0 i [ α i y i (λ T x i + λ 0 ) + ] = 0 i T y i (λ x i + λ 0 ) + 0. i= (dual feasibility) (copleentary slackness) (prial feasibility) 5

Using the KKT conditions, we can siplify the Lagrangian. L ([λ, λ 0 ], α) = λ + λ T ( α i y i x i ) + ( α i y i λ 0 ) + α i i= i= i= (We just expanded ters. Now we ll plug in the first KKT condition.) = λ λ λ 0 (α i y i ) + α i i= i= (Plug in the second KKT condition.) = λ (j) + 0 + αi () j= Again using the first KKT condition, we can rewrite the first ter. ( ) λ (j) (j) = α i y i x j= i j= i= i= = (j) (j) α i α k y i y k x i x k j= i= k= = T α i α k yiy k xi x k. i= k= Plugging back into the Lagrangian (), which now only depends on α, and putting in the second and third KKT conditions gives us the dual proble; ax L (α) α where { T α i 0 i =... L (α) = α i α i α k y i y k xi x k s.t. i= α iy i = 0 i= i,k We ll use the last two KKT conditions in what follows, for instance to get conditions on λ 0, but what we ve already done is enough to define the dual proble for α. We can solve this dual proble. Either (i) we d use a generic quadratic prograing solver, or (ii) use another algorith, like SMO, which I will discuss 6 (3)

later. For now, assue we solved it. So we have α,..., α. We can use the solution of the dual proble to get the solution of the prial proble. We can plug α into the first KKT condition to get λ = α i y i x i. (4) We still need to get λ 0, but we can see soething cool in the process. Support Vectors i= Look at the copleentary slackness KKT condition and the prial and dual feasibility conditions: α i > 0 y i (λ T x i + λ 0 ) = [ ] T α α i < 0 (Can t happen) i y i (λ x i + λ 0 ) + = 0 y i(λ T x i + λ 0) + < 0 α i = 0 y i (λ T x i + λ 0) + > 0 (Can t happen) Define the optial (scaled) scoring function: f (x i) = λ T x i + λ 0, then { α i > 0 y i f (x i ) = scaled argin i = < y i f (x i ) α i = 0 The exaples in the first category, for which the scaled argin is and the constraints are active are called support vectors. They are the closest to the decision boundary. 7

Support vectors x x Iage by MIT OpenCourseWare. Finish What We Were Doing Earlier To get λ 0, use the copleentarity condition for any of the support vectors (in other words, use the fact that the unnoralized argin of the support vectors is one): = y i (λ T x i + λ 0). If you take a positive support vector, y i =, then λ = λ T 0 x i. Written another way, since the support vectors have the sallest argins, λ 0 = in λ T x i. i:y i = So that s the solution! Just to recap, to get the scoring function f for SVM, you d copute α fro the dual proble (3), plug it into (4) to get λ, plug that into the equation above to get λ 0, and that s the solution to the prial proble, and the coefficients for f. 8

Because of the for of the solution: λ = α i y i x i. i= it is possible that λ is very fast to calculate. Why is that? Think support vectors. The Nonseparable Case If there is no separating hyperplane, there is no feasible solution to the proble we wrote above. Most real probles are nonseparable. Let s fix our SVM so it can accoodate the nonseparable case. The new forulation will penalize istakes the farther they are fro the decision boundary. So we are allowed to ake istakes now, but we pay a price. 9

Let s change our prial proble () to this new prial proble: { y λ T i( x i + λ 0 ) ξ in λ,λ 0,ξ λ + C ξ i s.t. i (5) ξ i 0 i= So the constraints allow soe slack of size ξ i, but we pay a price for it in the objective. That is, if y i f(x i ) then ξ i gets set to 0, penalty is 0. Otherwise, if y i f(x i ) = ξ i, we pay price ξ i. Paraeter C trades off between the twin goals of aking the λ sall (aking what-was-the-iniu-argin / λ large) and ensuring that ost exaples have argin at least / λ. Going on a Little Tangent Rewrite the penalty another way: If y i f(x i ), zero penalty. Else, pay price ξ i = y i f(x i ) Third tie s the char: Pay price ξ i = y i f(x i ) + where this notation z + eans take the axiu of z and 0. Equation (5) becoes: in λ + C y i f(x i ) λ,λ 0 i= Does that look failiar? + 0

The Dual for the Nonseparable Case For the Lagrangian of (5): L(λ, α, r) = λ T, b, ξ + C ξ i [ ] α i y i (λ x i + λ 0 ) + ξ i r i ξ i i= i= i= where α i s and r i s are Lagrange ultipliers (constrained to be 0). The dual turns out to be (after soe work) ax { α i α C i =.. α i α k y i y k x T 0. i x k s.t. i (6) i= α iy i = 0 α i= i,k= So the only difference fro the original proble s Lagrangian (3) is that 0 α i was changed to 0 α i C. Neat! Solving the dual proble with SMO SMO (Sequential Minial Optiization) is a type of coordinate ascent algorith, but adapted to SVM so that the solution always stays within the feasible region. Start with (6). Let s say you want to hold α,..., α fixed and take a coordinate step in the first direction. That is, change α to axiize the objective in (6). Can we ake any progress? Can we get a better feasible solution by doing this? Turns out, no. Look at the constraint in (6), i= α iy i = 0. This eans: αy = αiy i, or ultiplying by y, i= α = y αi y i. i= So, since α,..., α are fixed, α is also fixed.

So, if we want to update any of the α i s, we need to update at least of the siultaneously to keep the solution feasible (i.e., to keep the constraints satisfied). Start with a feasible vector α. Let s update α and α, holding α 3,..., α fixed. What values of α and α are we allowed to choose? Again, the constraint is: α y + α y = i=3 α i y i =: ζ (fixed constant). We are only allowed to choose α, α on the line, so when we pick α, we get α autoatically, fro α = (ζ y α y ) = y (ζ α y ) (y = /y since y {+, }). Also, the other constraints in (6) say 0 α, α C. So, α needs to be within [L,H] on the figure (in order for α to stay within [0, C]), where we will always have 0 L,H C. To do the coordinate ascent step, we will optiize the objective over α, keeping it within [L,H]. Intuitively, (6) becoes: ax α [L,H] α + α + constants α i α k y i y k x T i x k where α = y (ζ α y ). i,k (7) The objective is quadratic in α. This eans we can just set its derivative to 0 to optiize it and get α for the next iteration of SMO. If the optial value is

outside of [L,H], just choose α to be either L or H for the next iteration. For instance, if this is a plot of (7) s objective (soeties it doesn t look like this, soeties it s upside-down), then we ll choose : Note: there are heuristics to choose the order of α i s chosen to update. 3

MIT OpenCourseWare http://ocw.it.edu 5.097 Prediction: Machine Learning and Statistics Spring 0 For inforation about citing these aterials or our Ters of Use, visit: http://ocw.it.edu/ters.