LECTURE 7 Support vector machines SVMs have been used in a multitude of applications and are one of the most popular machine learning algorithms. We will derive the SVM algorithm from two perspectives: Tikhonov regularization, and the more common geometric perspective. We will focus on the linear SVM. 7.. SVMs from Tikhonov regularization We start with Tikhonov regularization " 2IR p n and use the hinge loss functional n where (k) + := max(k, 0). V (y i, V (f,z i ):=n T x i )+ k k 2 # ( y i T x i ) +, 4 3.5 3 2.5 Hinge Loss 2.5 0.5 0 3 2 0 2 3 y * f(x) Figure. Hinge loss. The resulting optimization problem is " # X n (7.) n ( y T i x i ) + + k k 2, 2IR p 43
44 S. MUKHERJEE, PROBABILISTIC MACHINE LEARNING which is non-di erentiable at ( y i f(x i )) = 0 so we introduce slack variables and write the following constrained optimization problem: 2IR p n P n i + 2 subject to : y i T x i i i =,...,n i 0 i =,...,n. The SVM contains an unregularized bias term b so the separating hyperplane need not go through the origin. Plugging this form into the above constrained quadratic problem results in the primal SVM 2IR p, 2IR n,b2ir n P n i + k k 2 subject to : y i T x i + b i i =,...,n i 0 i =,...,n. Note the following trick, one can reparameterize = c j x j, j= as this is an advantageous representation since one now only needs n variables to parameterize rather than p. So we now rewrite the optimization problem as c2ir n, 2IR n,b2ir subject to : n P n i + P ij c ic j x T i x j P y i j c jx T j x i + b i i =,...,n i 0 i =,...,n. We now derive the Wolfe dual quadratic program using Lagrange multiplier techniques: X n L(c,, b,, ) = n i + i i i. X ij c i c j x T i x j 0 8 9 < X = @y i c : j x T j x i + b ; + A i We want to imize L with respect to c, b, and, and maximize L with respect to and, subject to the constraints of the primal problem and nonnegativity constraints on and. We first eliate b and by taking partial derivatives: @L n @b =0 =) X i y i =0 @L @ i =0 =) n j i i =0=) 0 apple i apple n.
LECTURE 7. SUPPORT VECTOR MACHINES 45 The above two conditions will be constraints that will have to be satisfied at optimality. This results in a reduced Lagrangian: 0 X L R (c, ) = c i c j x T i x j i @y i c j x T j x i A. We now eliate c X ij @L R @c =0 =) c i = iyi 2, Substituting the above expression for c into the reduced Lagrangian we are left with the following dual program: P n i max 2IR n subject to : where Q is the matrix defined by 4 T Q P n y i i =0 0 apple i apple n i =,...,n, Q ij = y i y j x T i x j., reg- In most of the SVM literature, instead of the regularization parameter ularization is controlled via a parameter C, defined using the relationship j C = 2 n. Like, the parameter C also controls the trade-o between classification accuracy and the norm of the function. The primal and dual problems become respectively: C P n c2ir n, 2IR n i + P 2 ij c ic j x T i x j Pn subject to : y i j= c jx T i x j + b i i =,...,n i 0 i =,...,n max 2IR n subject to : P n i 2 T Q P n y i i =0 0 apple i apple C i =,...,n. 7.2. SVMs from a geometric perspective The traditional approach to developing the mathematics of SVM is to start with the concepts of separating hyperplanes and margin. The theory is usually developed in a linear space, beginning with the idea of a perceptron, a linear hyperplane that separates the positive and the negative examples. Defining the margin as the distance from the hyperplane to the nearest example, the basic observation is that intuitively, we expect a hyperplane with larger margin to generalize better than one with smaller margin. We denote our hyperplane by w, and we will classify a new point x via the function (7.2) f(x) = sign [hw, xi].
46 S. MUKHERJEE, PROBABILISTIC MACHINE LEARNING (a) (b) Figure 2. Two hyperplanes (a) and (b) perfectly separate the data. However, hyperplane (b) has a larger margin and intuitively would be expected to be more accurate on new observations. Given a separating hyperplane w we let x be a datapoint closest to w, and we let x w be the unique point on w that is closest to x. Obviously, finding a maximum margin w is equivalent to maximizing x x w. So for some k (assume k>0 for convenience), hw, xi = k hw, x w i =0 hw, (x x w )i = k. Noting that the vector x x w is parallel to the normal vector w, x x hw, (x x w w )i = w, w w = w 2 x xw w = w x x w k = w (x x w ) k = x x w. w k is a nuisance parameter and without any loss of generality, we fix k to, and see that maximizing x x w is equivalent to maximizing w,whichinturn is equivalent to imizing w or w 2. We can now define the margin as the distance between the hyperplanes hw, xi = 0 and hw, xi =. So if the data is linear separable case and the hyperplanes run through the origin the maximum margin hyperplane is the one for which 2 w2ir n w subject to : y i hw, x i i i =,...,n.
LECTURE 7. SUPPORT VECTOR MACHINES 47 The SVM introduced by Vapnik includes an unregularized bias term b, leading to classification via a function of the form: f(x) = sign[hw, xi + b]. In addition, we need to work with datasets that are not linearly separable, so we introduce slack variables i, just as before. We can still define the margin as the distance between the hyperplanes hw, xi = 0 and hw, xi =, but the geometric intuition is no longer as clear or compelling. With the bias term and slack variables the primal SVM problem becomes w2ir n,b2ir C P n i + 2 w 2 subject to : y i (hw, xi + b) i i =,...,n i 0 i =,...,n. Using Lagrange multipliers we can derive the same dual from in the previous section. Historically, most developments begin with the geometric form, derived a dual program which was identical to the dual we derived above, and only then observed that the dual program required only dot products and that these dot products could be replaced with a kernel function. In the linearly separable case, we can also derive the separating hyperplane as a vector parallel to the vector connecting the closest two points in the positive and negative classes, passing through the perpendicular bisector of this vector. This was the Method of Portraits, derived by Vapnik in the 970 s, and recently rediscovered (with non-separable extensions) by Keerthi. 7.3. Optimality conditions The primal and the dual are both feasible convex quadratic programs. Therefore, they both have optimal solutions, and optimal solutions to the primal and the dual have the same objective value. We derived the dual from the primal using the (now reparameterized) Lagrangian: L(c,,b,, ) = C i + X c i c j x T i x j ij 0 8 9 < = i @y i c : j x T i x j + b ; + A i i i. We now consider the dual variables associated with the primal constraints: 8 9 < = i =) y i c j x T i x j + b : ; + i j= i =) i 0. Complementary slackness tells us that at optimality, either the primal inequality is satisfied at equality or the dual variable is zero. In other words, if c,, b, and j=
48 S. MUKHERJEE, PROBABILISTIC MACHINE LEARNING are optimal solutions to the primal and dual, then 0 8 9 < = i @y i c j x T i x j + b : ; + A i j= = 0 i i = 0 All optimal solutions must satisfy: i y i 0 @ c j x T i x j j= j= y i j x T i x j =0 i y i =0 i =,...,n C i i =0 i =,...,n y j j x T i x j + ba + i 0 i =,...,n j= 2 0 3 4y i @ y j j x T i x j + ba + i 5 =0 i =,...,n j= i i =0 i =,...,n i, i, i 0 i =,...,n The above optimality conditions are both necessary and su cient. If we have c,, b, and satisfying the above conditions, we know that they represent optimal solutions to the primal and dual problems. These optimality conditions are also known as the Karush-Kuhn-Tucker (KKT) conditions. Suppose we have the optimal i s. Also suppose (this always happens in practice ) that there exists an i satisfying 0 < i <C.Then i <C =) i > 0 =) i =0 0 =) y i @ y j j x T i x j + ba =0 j= X n =) b = y i y j j x T i x j j= So if we know the optimal s, we can detere b. Defining our classification function f(x) as f(x) = `X y i i x T x i + b,
LECTURE 7. SUPPORT VECTOR MACHINES 49 we can derive reduced optimality conditions. For example, consider an i such that y i f(x i ) < : Conversely, suppose i = C: y i f(x i ) < =) i > 0 =) i =0 =) i = C. i = C =) y i f(x i ) + i =0 =) y i f(x i ) apple. Figure 3. Ageometricinterpretationofthereducedoptimalityconditions. The open squares and circles correspond to cases where i =0. Thedark circles and squares correspond to cases where y i f(x i )=and i apple C, these are samples at the margin. The grey circles and squares correspond to cases where y i f(x i ) < and i = C. 7.4. Solving the SVM optimization problem Our plan will be to solve the dual problem to find the s, and use that to find b and our function f. The dual problem is easier to solve the primal problem. It has simple box constraints and a single inequality constraint, even better, we will see that the problem can be decomposed into a sequence of smaller problems. We can solve QPs using standard software. Many codes are available. Main problem the Q matrix is dense, and is n n, so we cannot write it down. Standard QP software requires the Q matrix, so is not suitable for large problems. To get around this memory issue we partition the dataset into a working set W and the remaining points R. We can rewrite the dual problem as: max W 2IR W, R 2IR R P n i2w i + P i i2r subject to : apple QWW apple 2 [ Q W R ] WR W Q RW Q RR R P i2w y i i + P i2r y i i =0 0 apple i apple C, 8i.
50 S. MUKHERJEE, PROBABILISTIC MACHINE LEARNING Suppose we have a feasible solution. We can get a better solution by treating the W as variable and the R as constant. We can solve the reduced dual problem: max ( Q WR R ) W W 2IR W 2 WQ WW W P subject to: i2w y i i = P i2r y i i 0 apple i apple C, 8i 2 W. The reduced problems are fixed size, and can be solved using a standard QP code. Convergence proofs are di cult, but this approach seems to always converge to an optimal solution in practice. An important issue in the decomposition is selecting the working set. There are many di erent approaches. The basic idea is to exae points not in the working set, find points which violate the reduced optimality conditions, and add them to the working set. Remove points which are in the working set but are far from violating the optimality conditions.