Chapter 1 Decomposition methods for Support Vector Machines

Size: px

Start display at page:

Download "Chapter 1 Decomposition methods for Support Vector Machines"

Melvyn Cameron
5 years ago
Views:

1 Chapter 1 Decomposition methods for Support Vector Machines Support Vector Machines (SVM) are widey used as a simpe and efficient too for inear and noninear cassification as we as for regression probems. The basic training principe of SVM, motivated by the statistica earning theory, is that the expected cassification error for unseen test sampes is minimized, so that SVM define good predictive modes. In this chapter we first anayze inear cassifiers, and we introduce the concept of optima separating hyperpane which characterizes SVM modes. By the Wofe s dua theory we show that the training of a inear SVM for cassification eads to sove a convex quadratic programming probem with one inear constraints and box constraints. Then we extend the approach to noninear SVM for cassification. This extension needs the introduction of the so-caed kerne functions. Both inear and noninear SVM training needs the soution of a convex quadratic programming probem. We present optimization methods based on decomposition techniques. The adoption of decomposition techniques is motivated by the fact that, in rea appications, the number of training data may be very huge and the Hessian matrix cannot be stored. We focus on widey used Sequentia Minima Optimization (SMO) agorithms, which updates at each iteration two variabes, and we state theoretica convergence resuts. 1.1 Notation The Training Set (TS) is a set of observations: T S={(x i,y i ), x i X R n, y i Y R,,...,}. The vectors x i are the patterns beonging to the input space. The scaars y i are the abes (targets). In a cassification probem we have that y i { 1,1}, in a regression probem y i R. 1

2 2 1 Decomposition methods for Support Vector Machines 1.2 Linear cassifiers and the optima separating hyperpane Let us consider two disjoint sets A e B of points in R n to be cassified. Assume that A and B are ineary separabe, that is, there exists an hyperpane H = {x R n : w T x+b=0} such that the points x i A beong to an haf space, and the points x j B beong to the other haf space. Then, there exist a vector w R n and a scaar b R such that w T x i + b ε, x i A w T x j + b ε, x j (1.1) B where ε > 0. Dividing by ε we can write w T x i + b 1, x i A w T x j + b 1, x j B (1.2) An hyperpane wi be indicated by H(w,b). We say that H(w,b) is a separating hyperpane if the pair (w,b) is such that (1.2) hods. The decision function of a inear cassifier associated to a separating hyperpane is f(x)=sgn(w T x+b). We introduce the concept of margin of a separating hyperpane (see Fig. 1.1). Definition 1.1. Let H(w,b) be a separating hyperpane. The margin of H(w,b) is the minimum distance ρ between points in A B and the hyperpane H(w,b), that is ρ(w,b)= min x i A B { w T x i + b w }. ρ H(w, b) Fig. 1.1 Margin of a separating hyperpane It is quite intuitive that the margin of a given separating hyperpane is reated to the generaization capabiity of the corresponding inear cassifier. For instance, observing Figure 1.2, we may expect that the hyperpane H(w,b) eads to a better inear cassifier than that associated to the hyperpane H(ŵ, ˆb). The reationship between the margin and the generaization capabiity of inear cassifiers is anayzed by the

3 1.2 Linear cassifiers and the optima separating hyperpane 3 H(ŵ,ˆb) H(w, b) Fig. 1.2 Two separating hyperpanes with different margins statistica earning theory, which theoreticay motivates the importance of defining the hyperpane with maximum margin, the so-caed optima separating hyperpane (see Fig.1.3). H(w,b ) Fig. 1.3 The optima hyperpane Definition 1.2. Given two ineary separabe sets A and B, the optima separating hyperpane is a separating hyperpane H(w,b ) having maximum margin. It can be proved that the optima hyperpane exists and is unique (the forma proof is reported in Appendix A). From the above definition we get that the optima hyperpane is soution of the foowing probem { w T x i } + b max (1.3) w R n,b R w min x i A B The idea underying the proof of existence and uniqueness of the optima hyperpane is based on the foowing steps: - for each separating hyperpane H(w, b), there exists a separating hyperpane H(ŵ, ˆb) such that

4 4 1 Decomposition methods for Support Vector Machines 1 w ρ(w,b) 1 ŵ ; - the above condition impies that probem (1.3) admits soution provided that the foowing probem min w R n,b R 1 w s.t. w T x i + b 1, x i A w T x j + b 1, x j B (1.4) admits soution; - probem (1.4) is obviousy equivaent to min w R n,b R w 2 s.t. w T x i + b 1, x i A w T x j + b 1, x j B, (1.5) and we can prove that it admits a unique soution, which is aso the unique soution of (1.3). 1.3 Linear SVM We present the inear cassifiers defined by SVM in the case both of ineary and of not ineary separabe sets The case of ineary separabe sets Given two ineary separabe sets A and B of points in R n, among the infinite inear cassifiers corresponding to the infinite separating hyperpane, a inear SVM corresponds to a inear cassifier where the decision surface is the optima separating hyperpane. Then, as aready seen, the training of a inear SVM requires to determine the optima separting hyperpane, that is, to sove the probem max ρ(w,b) s.t. w T x i + b 1, x i A w T x j + b 1, x j B, (1.6) where ρ(w,b) is the margin. We have shown in Appendix A that probem (1.6) is equivaent to the foowing convex quadratic programming probem

5 1.3 Linear SVM 5 minf(w) = 1 2 w 2 s.t. w T x i + b 1, x i A w T x j + b 1, x j B (1.7) By using the abe y i =+1 for vectors x i A, and the abe y j = 1 for the vectors x j B, probem (1.8) takes the form minf(w) = 1 2 w 2 s.t. y i[ w T x i + b ] 1 0,,..., (1.8) where is the tota number of training points. We wi consider the Wofe dua formuation (see Appendix B) of probem (1.8) for the foowing reasons: the constraints (1.8) wi be substituted by simpest constraints on the Lagrange mutipiers; in the dua formuation we have inner products between the training vectors and this wi aow us to easiy extend the training procedure to the case of non separabe sets. The Lagrangian of (1.8) is the foowing function L(w,b,λ)= 1 2 w 2 The Wofe dua probem of (1.8) takes the form max L(w,b,λ)= 1 2 w 2 λ i [y i (w T x i + b) 1] (1.9) λ i [y i (w T x i + b) 1] s.t. w L(w,b,λ)=0 L(w,b,λ) b λ 0, = 0 that is

6 6 1 Decomposition methods for Support Vector Machines max L(w,b,λ)= 1 2 w 2 λ i [y i (w T x i + b) 1] s.t. w= λ i y i x i (1.10) λ i y i = 0 λ i 0,..., The maximization probem (1.10) can be written as foows max S(λ)= 1 2 j=1 y i y j (x i ) T x j λ i λ j + λ i s.t. λ i y i = 0 (1.11) λ i 0,..., or, equivaenty, as minimization probem of the form min Γ(λ)= 1 2 j=1 y i y j (x i ) T x j λ i λ j λ i s.t. λ i y i = 0 (1.12) λ i 0,..., We observe that: the existence of the optima soution (w,b ) of (1.8), the inearity of the constraints and Proposition 1.7 ensure that probem (1.11) admits at east a soution λ ; from (1.10) we get that the vector w can be determined as foows w = λ i y i x i ; w depends ony on the so-caed (support vectors) x i whose corresponding mutipiers λi are not nu;

7 1.3 Linear SVM 7 assertion (iii) of Proposition 1.8 ensure that (w,b,λ ) is a pair (optima soution-vector of Lagrange mutipiers), and hence satisfies the foowing compementarity conditions λ i [ y i ( (w ) T x i + b ) 1 ] = 0,..., (1.13) once computed w, by considering any mutipier λi 0, the scaar b can be determined by means of the corresponding compementarity condition defined by (1.13). probem [ (1.12) is a convex quadratic programming probem; indeed, setting X = y 1 x 1,...,y x ], λ T = [ λ 1,...,λ ], the probem takes the form min Γ(λ)= 1 2 λ T X T Xλ e T λ s.t. λ i y i = 0 λ i 0,...,, where e T =[1,...,1]; the decision function is f(x)=sgn ( ) (w ) T x+b ) = sgn( λi y i (x i ) T x+b Optima hyperpane Optima hyperpane and support vectors

8 8 1 Decomposition methods for Support Vector Machines x b x a H(w, b) Fig. 1.4 Non separabe sets: the point x a is a miscassified point with 1 < ξ a, the point x b is a correcty cassified point, but beongs to the separation zone, and hence 0<ξ b < The case of non ineary separabe sets Now assume that the two sets A are B are not ineary separabe. This means that the system of inear inequaities w T x i + b 1, x i A w T x j + b 1, x j B (1.14) does not admit soution. Let is introduce the sack variabes ξ h, with h=1,...,: w T x i + b 1 ξ i, x i A w T x j + b 1+ξ j, x j B ξ h 0, h=1,..., (1.15) Note that whenever a vector x i is not correcty cassified the corresponding variabe ξ i is greater than 1. The variabes ξ i corresponding to vectors correcty cassified and beonging to the separation zone (see Fig. 1.4) are such that 0 < ξ i < 1. Therefore, the term is an upper bound of the number of the cassification errors on the training vectors. Therefore, it is quite natura to add to the objective function of probem (1.8) the rerm C probem becomes ξ i ξ i, where C > 0 is a parameter to assess the training error. The prima

9 1.3 Linear SVM 9 minf(w,ξ) = 1 2 w 2 +C ξ i s.t. y i[ w T x i + b ] 1+ξ i 0,..., (1.16) ξ i 0,..., The dua of (1.16) is max L(w,b,ξ,λ, µ)= 1 2 w 2 +C s.t. w= λ i y i x i λ i y i = 0 ξ i λ i [y i (w T x i + b) 1+ξ i ] µ i ξ i C λ i µ i = 0,..., λ 0 µ 0, which can be equivaenty written in the form min Γ(λ)= 1 2 j=1 y i y j (x i ) T x j λ i λ j λ i t.c. λ i y i = 0 (1.17) 0 λ i C,..., Note that the constraints λ i C, for i = 1,...,, foow from the constraints λ i = C µ i, µ i 0. Again we observe that the vector w is w = λ i y i x i ; for the optima soution(w,b,ξ λ, µ ) the foowing compementarity conditions hod

10 10 1 Decomposition methods for Support Vector Machines φ w T φ(x)+b = 0 Fig. 1.5 Mapping the data from the input space onto the feature space λ i [ y i ( (w ) T x i + b ) 1+ξi ] = 0,..., (1.18) µ i ξ i = 0,..., (1.19) given w and any mutipier λi such that 0<λi < C, the scaar b can be determined by means of the corresponding condition defined by (1.18); whenever λi {0,C} for i = 1,...,, we say that the soution is degenere; probem (1.17) is a convex quadratic programming probem; the decision function of the cassifier is f(x)=sgn ( ) (w ) T x+b ) = sgn( λi y i (x i ) T x+b 1.4 Noninear SVM Linear modes coud be not rich enough to capture noninear patterns in the data. The motivation of introducing noninear SVM is to obtain a noninear decision boundary for probems where the data distributions are inherenty noninear. The idea underying the noninear SVM is that of mapping the data of the input space onto a higher dimensiona space caed feature space and to define a inear cassifier in the feature space (see Fig.1.5). Let us consider a mapping φ : R n H where H is an Eucidean space (the feature space) whose dimension is greater than n (the dimension can be even infinite). The input training vectors x i are mapped onto φ(x i ), with,...,. We can think to define a inear SVM in the feature space by repacing x i with φ(x i ). Then we have

11 1.4 Noninear SVM 11 the dua probem (1.17) is repaced by the foowing probem min Γ(λ)= 1 2 j=1 y i y j φ(x i ) T φ(x j )λ i λ j λ i t.c. λ i y i = 0 (1.20) 0 λ i C,..., the vector w is w = λ i y i φ(x i ) given w and any 0<λ i < C, the scaar b can be determined using the compementarity conditions the decision function takes the form From (1.22) we get that the separation surface is: - inear in the feature space; - non inear in the input space. ( y i λ j y j φ(x j ) T φ(x i )+b ) 1; (1.21) j=1 f(x)=sgn ( (w ) T φ(x)+b ). (1.22) It is important to observe that both in the dua formuation (1.20) and in formua (1.22) concerning the decision function is not necessary to expicity konw the mapping φ, but it is sufficient to know the inner product φ(x) T φ(z) of the feature space. This eads to the fundamenta concept of kerne function. Definition 1.3. Given a set X R n, a function is a kerne if K : X X R K(x,y)= φ(x),φ(y) x,y X, (1.23) where φ is an appication X H and H is an Eucidean space, that is, a inear space with a fixed inner product. We observe that a kerne is necessariy a symmetric function. It can be proved that K(x,z) is a kerne if and ony if the matrix

12 12 1 Decomposition methods for Support Vector Machines ( K(x i,x j ) ) K(x 1,x 1 )... K(x 1,x ) i, j=1 =. K(x,x 1 )... K(x,x ) is semidefinite positive for any set of training vectors {x 1,...,x }. In the iterature the kerne are often denoted by Mercer kerne. Proposition 1.1. Let K : X X R be a symmetric function. Then K is a kerne if and ony if, for any choice of the vectors x 1,...,x in X the matrix is semidefinite positive. K =[K(x i,x j )] i, j=1,..., Using the definition of kerne probem (1.20) can be written as foows min Γ(λ)= 1 2 j=1 y i y j K(x i,x j )λ i λ j λ i s.t. λ i y i = 0 (1.24) 0 λ i C,..., By Proposition 1.1 it foows that probem (1.24) is a convex quadratic programming probem. Exampes of kerne functions are: K(x,z)=(x T z+1) p poynomia kerne (p integer 1) K(x,z)=e x z 2 /2σ 2 gaussian kerne (σ > 0) K(x,z)= tanh(βx T z+γ) hyperboic tangent kerne (suitabe vaues of β and γ) Using the definition of kerne function the decision function is ) f(x)=sgn( λi y i k(x,x i )+b Remark 1.1. On the Gaussian kerne It is possibe to show that the Gaussian kerne is an inner product in an infinite dimensiona space. As consequence, for sufficienty arge vaues of the parameter C the error training is zero.

13 1.5 A genera decomposition scheme for SVM training 13 Remark 1.2. On the poynomia kerne Let us assume that the dimension n of the input space is 2, and et us consider the homogeneous poynomia kerne K(x,z)=(x T z) 2. We show that there are different feature spaces H such that φ : R 2 H and Indeed, both the mapping and the mapping are such that φ(x) T φ(z)=k(x,z). x 2 1 φ(x)= 2x1 x 2 (1.25) x 2 2 x 2 1 φ(x)= x 1 x 2 x 1 x 2 (1.26) x2 2 φ(x) T φ(z)= ( x T z ) 2. Note that in (1.25) we have that φ : R 2 R 3, whie in (1.26) we have that φ : R 2 R 4. It is possibe to show that, for the homogeneous poynomia kerne K(x,z) = ( x T z ) p, the dimension of the minimum embedding space is ( ) n+ p 1 n As consequence, for sufficienty arge vaues of the exponent p and of the parameter C the error training is zero. Finay, we observe that a SVM with gaussian kerne corresponds to a RBF neura network where the number of basis functions and their centres are automaticay determined by the number of support vectors and their vaues. In a simiar way, a SVM with hyperboic tangent kerne corresponds to a MLP neura network, with one hidden ayer, where the number of neuron and the vaues of the weights are automaticay determined by the number of support vectors and their vaues. 1.5 A genera decomposition scheme for SVM training Let us consider the convex quadratic programming probem for SVM training in the case of cassification probems:

14 14 1 Decomposition methods for Support Vector Machines min f(α)= 1 2 αt Qα e T α s.t. y T α = 0 (1.27) 0 α C, where α R, is the number of training data, Q is a symmetric and semidefinite positive matrix, e R is the vector of ones, y { 1,1}, and C is a positive scaar. The generic eement q i j of Q is y i y j K(x i,x j ), where K(x,z)=φ(x) T φ(z) is the kerne function reated to the noninear function φ that maps the data from the input space into the feature space We assume that the number of training data is huge and the Hessian matrix Q, which is dense, cannot be fuy stored so that standard methods for quadratic programming cannot be used. Hence, the adopted strategy to sove the SVM probem is usuay based on the decomposition of the origina probem into a sequence of smaer subprobems obtained by fixing subsets of variabes. In a genera decomposition framework, at each iteration k, the vector of variabes α k is partitioned into two subvectors (αw k,αk ), where the index set W {1,...,} W identifies the variabes of the subprobem to be soved and is caed working set,and W ={1,..., }\ W (for notationa convenience, we omit the dependence on k). Starting from the current soution α k =(αw k,αk ), which is a feasibe point, the W subvector α k+1 W is computed as the soution of the subrobem min α W f(α W,α k W ) y T W α W = y T W αk W 0 α W C. (1.28) The variabes corresponding to W are unchanged, that is, α k+1 = αk, and the cur- W rent soution is updated setting α k+1 =(αw k+1,αk+1 W ). W The cardinaity q of the working set, namey the dimension of the subprobem, must be greatee than or equa to 2, otherise we woud have α k+1 = α k. Indeed, assuming q=1 and W ={i}, if α k is a feasibe point then we have y i α k i = y T W αk W. In order to guarantee that α k+1 is a feasibe point we must have and hence α k+1 = α k. y i α k+1 i = y T W αk W A genera decomposition scheme is described beow.

15 1.6 Sequentia Minima Optimization (SMO) agorithms 15 Decomposition agorithm Data. A feasibe point α 0 (usuay α 0 = 0). Iniziaization. Set k=0. Whie ( the stopping criterion is not satisfied ) 1. seect the working set W k ; 2. set W = W k and compute a soution αw of the probem (1.27); α 3. set αi k+1 i se i W = atrimenti; α k i 4. set f(α k+1 )= f(α k )+Q ( α k+1 α k). 5. set k=k+ 1. end whie Return α = α k The seection rue of the working set strongy affects both the speed of the agorithm and its convergence properties. In computationa terms, the most expensive step at each iteration of a decomposition method is the evauation of the kerne to compute the coumns of the Hessian matrix, corresponding to the indices in the working set W. In the seque we mainy wi focus on agorithms using working sets with cardinaity equa to two, since they are the most used agorithms to sove arge quadratic programs for SVM training. 1.6 Sequentia Minima Optimization (SMO) agorithms The decomposition methods usuay adopted are the so-caed Sequentia Minima Optimization (SMO) agorithms, since they update at each iteration the minimum number of variabes, that is two. At each iteration, a SMO agorithm requires the soution of a convex quadratic programming of two variabes: minq(α i,α j ) = 1 ( ) T αi α j q ii q i j 2 q ji q j j s.t. y i α i + y j α j = y T W αk W ( αi α j ) α i α j (1.29) 0 α h C h=i, j

16 16 1 Decomposition methods for Support Vector Machines We first show that the soution of a subprobem in two variabes of the form (1.29) can be anayticay determined (and this is one of the reasons motivating the interest in defining SMO agorithms). To this aim, given a feasibe point ᾱ and a feasibe direction d, we indicate by β the maximum feasibe step ength aong d starting from ᾱ, i.e., - if d 1 > 0 and d 2 > 0, β = min{c ᾱ 1,C ᾱ 2 }; - if d 1 < 0 and d 2 < 0, β = min{ᾱ 1,ᾱ 2 }; - if d 1 > 0 e d 2 < 0, β = min{c ᾱ 1,ᾱ 2 }; - if d 1 < 0 e d 2 > 0, β = min{ᾱ 1,C ᾱ 2 }. ( ) We set d + 1/y1 = and we report beow the scheme for the anaytica computation of the soution of probem 1/y 2 (1.29). Anaytica computation of the soution of the two-dimensiona probem 1. If f(ᾱ) T d + = 0 set α = ᾱ and stop; 2. If f(ᾱ) T d + < 0 set d = d + ; 3. If f(ᾱ) T d + > 0 set d = d + ; 4. Let β the maximum feasibe step enght aong d : 4a.if β = 0 set α = ᾱ and stop; 4b.if(d ) T Qd = 0 set β = β, otherwise compute β nv = f(ᾱ)t d (d ) T Qd and set β = min{ β,β nv }; 4c.set α = ᾱ+ β d and stop. Now, et us consider the seection rue for choosing, at each iteration k, the two variabes of subprobem (1.29). The seection rue shoud ensure a strict decrease of the objective function. The new point α k+1 is obtained by the updating of two variabes, indicated by α i and α j, so that we have We observe that: ( T α k+1 = α1 k,...,αk+1 i,...,α k+1 j,...,αn) k. (1.30) (i) the new point α k+1 must be feasibe; (ii)if α k is not a soution, then we woud have f(α k+1 )< f(α k ). According to (i), (ii) and (1.30), we focus on feasibe and descent directions (see (i) and (ii)) with ony two nonzero eements (see (1.30)). Therefore, in the seque we wi anayze directions having these features that pay a crucia roe in the decomposition strategies for the considered probem. Let us consider the feasibe set of probem (1.27), indicated by F, namey

17 1.6 Sequentia Minima Optimization (SMO) agorithms 17 F ={x R : y T α = 0, 0 α C}. Given any feasibe point α, we indicate as foows the indices of the active box (ower and upper) constraints L(α)={i : α i = 0}, U(x)={i : α i = C}. The set of feasibe directions in a point α F is the cone D(α)={d R : y T d = 0, d i 0, i L(α), e d i 0, i U(α)}. Indeed, if d is a feasibe direction in α F, then for t sufficienty sma we must have y T (α+td)=0, α+td 0, α+td C, from which it necessariy foows and a T d = 0, d i 0 if α i = 0, d i 0 if α i = C. Given a point ᾱ F, we define feasibe directions in ᾱ with ony two not nu components d i and d j. We indicate by d i, j the direction d i, j =(0,...0, d i, 0,...0, d j, 0,...0). T Since we must have we set y T d i, j = y i d i + y j d j = 0, d i = 1 y i d j = 1 y j. Furthermore, we remark that: - if i L(ᾱ), namey ᾱ i = 0, we must have d i 0, and hence y i > 0; - if i U(ᾱ), namey ᾱ i = C, we must have d i 0, and hence y i < 0; - if j L(ᾱ), namey ᾱ j = 0, we must have d j 0, and hence y j < 0; - if j U(ᾱ), namey ᾱ j = C, we must have d j e0, and hence y j > 0. Note that, whenever 0 < ᾱ i < C, there are not constraints on the sign of d i (and hence on y i ). In the same way, whenever 0 < ᾱ j < C, there are not constraints on the sign of d j (and hence on y j ). On the basis of the preceding considerations we partition the sets L and U into the subsets L, L +, and U, U + respectivey, where

18 18 1 Decomposition methods for Support Vector Machines We observe that if: L + (α)={i L(α) : y i > 0}, L (α)={i L(α) : y i < 0} U + (α)={i U(α) : y i > 0}, U (α)={i U(α) : y i < 0}. - i beongs to L + or to U, and - j beongs to L or to U +, then the corresponding direction d i, j is a feasibe direction in ᾱ. In order to characterize feasibe directions (with ony two nonzero components) in ᾱ, et us define the foowing index sets R(ᾱ)=L + (ᾱ) U (α) {i : 0<ᾱ i < C} S(ᾱ)=L (ᾱ) U + (ᾱ) {i : 0<ᾱ i < C}. (1.31) Note that R(ᾱ) S(ᾱ)={i : 0<ᾱ i < C} R(ᾱ) S(ᾱ)={1,...,}. Moreover, it is easy to see that both R(ᾱ) and S(ᾱ) are non empty. The two index sets R and S aow us to define a the feasibe and descent directions with ony two nonzero components. This is shown in the next proposition. Proposition 1.2. Let ᾱ a feasibe point and et (i, j) {1,...,}, i j, a pair of indices. Then the direction d i, j R such that 1/y i se h=i d i, j h = 1/y j se h= j 0 atrimenti (i) is a feasibe direction in the point ᾱ if and ony if i R(ᾱ) and j S(ᾱ); (ii)the direction d i, j is a descent direction for f in ᾱ if and ony if ( f(ᾱ)) i y i ( f(ᾱ)) j y j < 0. (1.32) Proof. (ia) Assume that d i, j is a feasibe direction. We wi show that i R(ᾱ) and j S(ᾱ). By contradiction assume that i R(ᾱ) and j / S(ᾱ), that is j L + (ᾱ) U (ᾱ). If j L + (ᾱ) then ᾱ j = 0 and y j = 1, so that d j < 0 and hence d i, j is not a feasibe direction since ᾱ j +td j < 0 t > 0. In the same way, if j U (ᾱ) then ᾱ j = C and y j = 1, so that d j > 0, and hence d i, j is not a feasibe direction. (ib) Assume that i R(ᾱ) and j S(ᾱ). We must show that d i, j is such that

19 1.6 Sequentia Minima Optimization (SMO) agorithms 19 y T d i, j = 0 e d i, j h 0 h L(ᾱ) and di, j h 0 h U(ᾱ). From the definition of d i, j it foows y T d = y i d i, j i + y j d i, j j = 0. Moreover, we have i R(ᾱ), and hence, if i L(ᾱ), then (1.31) impies i L + (ᾱ), that is d i = 1/y i > 0. Anaogousy, we have j S(ᾱ), and hence, if j U(ᾱ), then j U + (ᾱ), that is d j = 1/y j < 0. The same concusions can be drawn in the case that i U(ᾱ) and j L(ᾱ), and hence we can concude that d i, j is a feasibe direction. (ii) Since f is a convex and continuousy differentiabe function, the condition f(ᾱ) T d i, j = ( f(ᾱ)) i y i ( f(ᾱ)) j y j < 0 is necessary and sufficient to ensure that d i, j is a descent direction for f in ᾱ. If the pair of indices (i, j) defines a feasibe and descent direction d i, j in the current point α k, then the minimization with respect to the pair of variabes α i and α j wi produce a strict decrease of the objective function. Then, taking into account Proposition 1.2 we define the forma scheme of a SMO agorithm.

20 20 1 Decomposition methods for Support Vector Machines SMO Agorithm Data. The starting point α 0 = 0 and the gradient f(α 0 )=e. Iniziaization. Set k=0. Whie ( the stopping criterion is not satisfied ) 1. seect i R(α k ), j S(α k ), such that and set W ={i, j}; f(α k ) T d i, j < 0, 2. compute the anaytica soution α = αi for h=i 3. set αh k+1 = α j for h= j otherwise; α k h ( α i α j) T of (1.29); 4. set f(α k+1 )= f(α k )+(α k+1 i α k i )Q i+(α k+1 j α k j )Q j; 5. set k=k+ 1. end whie Return α = α k We observe that the above scheme generates a sequence {α k } such that f(α k+1 )< f(α k ). (1.33) The scheme requires to store a vector o size n (the gradient f(α k )) and to get two coumns, Q i e Q j, of the matrix Q. The choice α 0 = 0 for the starting point is motovated by the fact that this point is a feasibe point and such that the computation of the gradient f(α 0 ) does not require any eement of the matrix Q, being f(0)= e. We remark that condition (1.33) is not sufficient to guarantee the convergence towards soutions of the probem. In order to guarantee goba convergence properties of the generated sequence, suitabe working set seection rues must be adopted. For instance, Gauss-Soutwe rues based on the vioation of the optimaity conditions anayzed in the nex section.

21 1.7 Convergent SMO agorithms using first order information Convergent SMO agorithms using first order information As aready viewed, SMO-type agorithms seect at each iteration working set of size exacty two, so that the updated point can be anayticay computed, and this eiminates the need to use an optimization software. In order to design convergent SMO agorithms we need to state the optimaity conditions in a form usefu for defining suitabe working set seection rues. For sake of generaity, in the seque we wi assume that f is a convex continuousy differentiabe function. Let F be the feasibe set of probem (1.27), that is F ={α R : y T α = 0, 0 α C}. Given a feasibe point α, we have aready introduced the the indices of the active box (ower and upper) constraints and the index sets where L(α)={i : α i = 0}, U(x)={i : α i = C}, R(α)=L + (α) U (α) {i : 0<α i < C} S(α)=L (α) U + (α) {i : 0<α i < C}. L + (α)={i L(α) : y i > 0}, L (α)={i L(α) : y i < 0} U + (α)={i U(α) : y i > 0}, U (α)={i U(α) : y i < 0}. The introduction of the index sets R(α) and S(α) aows us to state the optimaity conditions in the foowing form (the proof of the proposition can be found in Appendix C). Proposition 1.3. A feasibe point α is a soution of (1.27) if and ony id { max ( } { f(α )) i min ( } f(α )) j. (1.34) i R(α ) y i j S(α ) y j Given a feasibe point ᾱ, which is not a soution of probem (1.27), a pair i R(ᾱ), j S(ᾱ) such that { ( } { f(α )) i > ( } f(α )) j y i y j is said to be a vioating pair. It can be shown (see Proposition 1.2) that SMO-type agorithms have strict decrease of the objective function if and ony if the working set is a vioating pair.

22 22 1 Decomposition methods for Support Vector Machines However, as aready said, the use of generic vioating pairs as working sets is not sufficient to guarantee convergence properties of the sequence generated by a decomposition agorithm. A convergent SMO agorithm can be defined using as indices of the working set those corresponding to the maxima vioation of the KKT conditions. More specificay, given again a feasibe point α which is not a soution of probem (1.27), et us define I(α)= { i : i arg max i R(α) { J(α)= j : j arg min j S(α) { ( f(α)) i y i { ( f(α)) j y j Taking into account the KKT conditions as stated in (1.34), a pair i I(α), j J(α) most vioates the optimaity conditions, and therefore, it is said to be a maxima vioating pair. Note that the seection of the maxima vioating pair invoves O() operations. A SMO-type agorithm using maxima vioating pairs as working sets is usuay caed most vioating pair (MVP) agorithm which is formay described beow. }} }} Most Vioating Pair (MVP) Agorithm Data. The starting point α 0 = 0 and the gradient f(α 0 )=e. Iniziaization. Set k=0. Whie ( the stopping criterion is not satisfied ) 1. seect i I(α k ), j J(α k ), and set W ={i, j}; ( T 2. compute the anaytica soution α = αi α j) of (1.29); αi for h=i 3. set αh k+1 = α j for h= j otherwise; α k h 4. set f(α k+1 )= f(α k )+(α k+1 i α k i )Q i+(α k+1 j α k j )Q j; 5. set k=k+ 1. end whie Return α = α k We can state the foowing convergence resut.

23 1.8 Convergent SMO agorithms using second order information 23 Proposition 1.4. Suppose that the symmetrix matrix Q is semidefinite positive, and et {α k } be the sequence generated by MVP Agorithm. Then, {α k } admits imit points, and each imit point is a soution of probem (1.27). A usua requirement to estabish convergence properties in the context of a decomposition strategy is that ( im α k+1 α k) = 0. (1.35) k Indeed, in a decomposition method, at the end of each iteration k, ony the satisfaction of the optimaity conditions with respect to the variabes associated to W k is ensured. Therefore, to get convergence towards KKT points, it may be necessary to ensure that consecutive points, which are soutions of the corresponding subprobems, tend to the same imit point. It can be proved that SMO agorithms guarantee property (1.35) (the proof fuy expoits that the subprobems are convex, quadratic probems into two variabes). The convergence resut of Proposition 1.4 can be obtained even using working set rues different from that seecting the maxima vioating pair. For instance, the socaed constant-factor vioating pair rue guarantees goba convergence properties of the SMO agorithm adopting it, and requires to seect any vioating pair u R(α k ), v S(α k ) such that ( f(α k )) u y u ( f(αk )) v y v where 0<σ 1 and (i, j) is a maxima vioating pair. ( ( f(α k )) i σ ( ) f(αk )) j, (1.36) y i y j 1.8 Convergent SMO agorithms using second order information It can be proved that the direction d i, j, being(i, j) the maxima vioating pair, is the soution of the foowing probem min d f(α k ) T d y T d = 0 d t 0 if α k t = 0 d t 0 if α k t = C (1.37) 1 d t 1 {h : d h 0} =2

24 24 1 Decomposition methods for Support Vector Machines The inequaities 1 d t 1 avoid that the objective function tends to. Then, the maxima vioating pair is reated to the minimization of the first order approximation: f(α k + d) f(α k )+ f(α k ) T d. As f in quadratic we can write f(α k + d)= f(α k )+ f(α k ) T d+ 1 2 dt Qd, and we can consider, instead of probem (1.37), the foowing probem min d q(d)= f(α k ) T d+ 1 2 dt Qd y T d = 0 d t 0 if α k t = 0 (1.38) d t 0 if α k t = C {h : d h 0} =2 Note that the constraints 1 d t 1 have been removed since, under suitabe assumptions on Q, the objective function is bounded beow. In particuar we assume that Q is positive definite. The soution of (1.38) woud require O( 2 ) operations, and this coud be impractica whenever the number of training data is huge. In order to take into account this issue, a suitabe working set seection rue, using second order information, has been designed and requires O() operations. More specificay, an index is seected as in the maxima vioating pair, say i is such that i I(α k )=arg max { h f(α k ) }, (1.39) h R(α k ) y h and the other index j as the soution of a minmin probem, i.e., j argmin t S(α k ):y i i f(α k ) y t t f(α k )<0 min di,d t f(α k ) i d i + f(α k ) t d t (d i d t ) T ( qii q it q it q tt )( di d t ) (1.40) y i d i + y t d t = 0 Note that the constraint t S(α k ) : y i i f(α k ) y t t f(α k ) < 0 impies that the seected pair(i, j), with j soution of (1.40), is a vioating pair. We have omitted the constraints

25 1.8 Convergent SMO agorithms using second order information 25 d t 0 if α k t = 0 d t 0 if α k t = C, (1.41) since, as shown ater, they are satisfied at the optima soution. Given t S(α k ) such that y i i f(α k ) y t t < 0, et us consider the inner probem ( )( ) min di,d t f(α k ) i d i + f(α k ) i d j (d i d j ) T qii q it di q it q tt d t (1.42) y i d i + y t d t = 0 From the constraint y i d i + y t d t = 0 we get d i = yt y i d t. By substitution and simpe cacuations, recaing that q ii = K ii, q it = y i y t K it,, we obtain the equivaent probem 1 ) min d t 2 (K ii+ K tt 2K it )yt 2 dt 2 + ( f(α k ) i y i + f(α k ) t y t y t d t, (1.43) The assumption that Q is positive definite impies K ii + K j j 2K i j > 0, and hence that subprobem (1.43) admits soution. In particuar, the soution is such that where y t d t = y i d i = b it a it < 0, (1.44) a it =(K ii + K j j 2K i j )>0 b it = y i i f(α k )+y t t f(α k )>0. Note that the constraints (1.41) are satisfied. Indeed, suppose, for instance, αi k = C and αt k = 0. Since i R(α k ) and t S(α k ), by definition we have y i = 1 and y t = 1. From (1.44) it foows d i < 0 and d t > 0. The other cases are simiar. The optima vaue of the probem (1.42) is b2 it 2a it. Therefore, the index j corresponding to the soution of the probem (1.40) can be computed as foows { } j argmin b2 it : t S(α k ), y i i f(α k ) y t t f(α k )<0, (1.45) t a it It can be proved that the pair (i, j), with i satisfying (1.39) and j satisfying (1.45), is a constant-factor vioating pair. Indeed, et (i, j ) be a maxima vioating pair. As j S(α k ), we can write ( f(α k ) i y i + f(α k ) 2 ) j y j from which it foows a i j ( ) 2 f(α k ) i y i + f(α k ) j y j a i j,

26 26 1 Decomposition methods for Support Vector Machines ( f(α k ) i y i f(α k ai ) j y j j Then, condition (1.36) hods with a i j ) 1/2 ( f(α k ) i y i f(α k ) j y j ) ) ( mins,t a st max s,t a st ) 1/2 ( f(α k ) i y i f(α k ) j y j ). ( ) mins,t a 1/2 st σ =. max s,t a st Therefore, a SMO agorithm using the second order working set rue defined by (1.39) and (1.45) is a convergent agorithm. We remark that the seection of the above pair (i, j) requires O() operations. 1.9 On the stopping criterion Let us introduce the functions m(α), M(α) : F R: max ( f(α)) h if R(α) /0 h R(α) y m(α)= h atrimenti min ( f(α)) h if S(α) /0 h S(α) y M(α)= h + atrimenti where R(α) and S(α) are the index sets previousy defined. From the definitions of m(α) and M(α), and using Proposition 1.3, it foows that ᾱ is soution of (1.27) if and ony if m(ᾱ) M(ᾱ). Let us consider a sequence of feasibe points{α k } convergent to a soution ᾱ. At each iteration k, if α k is not a soution then (using again Proposition 1.3) we have m(α k )>M(α k ). Therefore, one of the adopted stopping criterion is m(α k ) M(α k )+ε, (1.46) where ε > 0. Note that the functions m(α) and M(α) are not continuous. Indeed, even assuming α k ᾱ for k, it may happen that R(α k ) R(ᾱ) or S(α k ) S(ᾱ) for k sufficienty arge. As consequence, in genera we can not write im k m(αk )=m(ᾱ)

27 1.10 Appendix: Proof of existence and uniqueness of the optima hyperpane 27 or im k M(αk )=M(ᾱ). It can be proved that a SMO Agorithm, using the constant-factor vioating pair rue, generates a sequence {α k } such that m(α k ) M(α k ) 0 for k. Hence, for any ε > 0, a SMO agorithm of this type satisfies the stopping criterion (1.46) in a finite number of iterations Appendix: Proof of existence and uniqueness of the optima hyperpane In this appendix we formay prove that the optima hyperpane exists and is unique. To this aim we need some preiminary resuts. Lemma 1.1. Let H(ŵ, ˆb) be a separating hyperpane. Then ρ(ŵ, ˆb) 1 ŵ. Proof. Since it foows ŵ T x + ˆb 1, x A B, ρ(ŵ, ˆb)= min x A B { } ŵ T x + ˆb 1 ŵ ŵ. Lemma 1.2. Given any separating hyperpane H(ŵ, ˆb), there exists a separating hyperpane H( w, b) such that ρ(ŵ, ˆb) ρ( w, b)= 1 w. (1.47) Moreover there exist two points x + A and x B such that w T x + + b=1 w T x + b= 1 (1.48) Proof. Let ˆx i A and ˆx j B the cosest points to H(ŵ, ˆb), that is, the two points such that from which it foows dˆ i = ŵt ˆx i + ˆb ŵ dˆ j = ŵt ˆx j + ˆb ŵ ŵt x i + ˆb, x i A ŵ ŵt x j + ˆb, x j B ŵ (1.49)

28 28 1 Decomposition methods for Support Vector Machines ρ(ŵ, ˆb)=min{ dˆ i, dˆ j } 1 2 ( dˆ i + dˆ j )= ŵt ( ˆx i ˆx j ). (1.50) 2 ŵ Let us consider the numbers α e β such that αŵ T ˆx i + β = 1 αŵ T ˆx j + β = 1 (1.51) that is, the numbers α = 2 = ŵt ŵ T ( ˆx i ˆx j ), β ( ˆx i + ˆx j ) ŵ T ( ˆx i ˆx j ). It can be easiy verified that 0<α 1. We wi show that the hyperpane H( w, b) H(αŵ,β) is a separating hyperpane for the sets A and B, and it is such that (1.47) hods. Indeed, using (1.49), we have As α > 0, we can write ŵ T x i ŵ T ˆx i, x i A ŵ T x j ŵ T ˆx j, x j B. αŵ T x i + β αŵ T ˆx i + β = 1, x i A αŵ T x j + β αŵ T ˆx j + β = 1, x j B (1.52) from which we get that w and b satisifies (1.2), and hence, that H( w, b) is a separating hyperpane for the sets A and B. Furthermore, taking into account (1.52) and the vaue of α, we have { w ρ( w, b)= T x } + b min = 1 x A B w w = 1 α ŵ = ŵt ( ˆx i ˆx j ). 2 ŵ Condition (1.47) foows from the above equaity and (1.50). Using (1.51) we obtain that (1.48) hods with x + = ˆx i and x = ˆx j. Proposition 1.5. The foowing probem min w 2 t.c. w T x i + b 1, x i A w T x j + b 1, x j B (1.53) admits a unique soution(w,b ). Proof. Let F the feasibe set, that is, F ={(w,b) R n R : w T x i + b 1, x i A, w T x j + b 1, x j B}. Given any (w o,b o ) F, et us consider the eve set

29 1.10 Appendix: Proof of existence and uniqueness of the optima hyperpane 29 L o ={(w,b) F : w 2 w o 2 }. The set L o is cose, and we wi show that is aso bounded. To this aim, assume by contradiction that there exists an unbounded sequence {(w k,b k )} beonging to L o. Since w k w o, k, we must have b k. For any k we can write w T k xi + b k 1, x i A w T k x j + b k 1, x j B and hence, as b k, for k sufficienty arge, we have w k 2 > w o 2, and this contradicts the fact that {(w k,b k )} beongs to L o. Then L o is a compact set. The Weirstrass s theorem impies that the function w 2 admits a minimum (w,b ) on L o, and hence, on F. As consequence, (w,b ) is a soution of (1.53). In order to prove that(w,b ) is the unique soution, by contradiction assume that there exists a pair ( w, b) F, ( w, b) (w,b ), such that w 2 = w 2. Suppose w w. The set F is convex, so that λ(w,b )+(1 λ)( w, b) F, λ [0,1]. Since w 2 is a stricty convex function, for any λ (0,1) it foows λw +(1 λ) w 2 < λ w 2 +(1 λ) w 2. ( 1 Getting λ = 1/2, which corresponds to consider the pair( w, b) 2 w w, 1 2 b + 1 ) 2 b, we have ( w, b) F and w 2 < 1 2 w w 2 = w 2, and this contradicts the fact that (w,b ) is a goba minimum. Therefore, we must have w w. Assume b > b (the case b < b is anaogous), and consider the point ˆx i A such that w T ˆx i + b = 1 (the existence of such a point foows from (1.48) of Lemma 1.2). We have 1=w T ˆx i + b = w T ˆx i + b > w T ˆx i + b and this contradicts the fact that w T x i + b 1, x i A. As consequence, we must have b b, and hence the uniqueness of the soution is proved. Proposition 1.6. Let (w,b ) be the soution of (1.53). Then, (w,b ) is the unique soution of the foowing probem

30 30 1 Decomposition methods for Support Vector Machines max ρ(w, b) t.c. w T x i + b 1, x i A w T x j + b 1, x j B, (1.54) and hence is the optima hyperpane. Proof. We observe that (w,b ) is the unique soution of the probem max 1 w t.c. w T x i + b 1, x i A w T x j + b 1, x j B. Lemma 1.1 and Lemma 1.2 impy that, for any separating hyperpane H(w,b), we have 1 w ρ(w,b) 1 w and hence, for the separating hyperpane H(w,b ) we obtain ρ(w,b ) = 1 w, which impies that H(w,b ) is the optima separating hyperpane Appendix B: the Wofe dua The idea underying the duaity theory is that of defining, in correspondence to a given minimum probem, caed prima probem: min f(x) x S a maximum probem, caed dua probem max u U ψ(u) in such a way that (at east) the foowing weak dua property hods inf f(x) sup ψ(u). x S Whenever the above property hods, it is possibe to get usefu information on the soutions of the prima probem by means of the anaysis of the the dua probem. For some casses of probems it is possibe to state, under suitabe assumptions, the foowing strong dua property Let us consider the probem u U inf f(x)=sup ψ(u). x S u U

31 1.11 Appendix B: the Wofe dua 31 min f(x) g i (x) 0,,...,m (1.55) c T j x d j = 0, j= 1,..., p where f : R n R, g i : R n R, i = 1,...,m are convex, continuasy differentiabe functions. Let be the Lagrangian. L(x,λ, µ)= f(x)+ m λ i g i (x)+ p j=1 µ j (c T j x d j ) Proposition 1.7. Assume that probem (1.55) admits soution x, and that there exists a pair of Lagrange mutipiers (λ, µ ). Then (x,λ, µ ) is soution of the foowing probem max L(x,λ, µ) x,λ, µ x L(x,λ, µ)=0 λ 0. Moreover, the duaity gap is nu, that is f(x )=L(x,λ, µ ). Proof. The KKT conditions impy (1.56) x L(x,λ, µ )=0, (1.57) (λ ) T g(x )=0, λ 0. The point (x,λ, µ ) is a feasibe point of probem (1.56), and we have f(x ) = L(x,λ, µ ). We show that (x,λ, µ ) is soution of probem (1.56). Let(x,λ, µ) be a feasibe point of probem (1.56), that is, we have x L(x,λ, µ)= 0 and λ 0. For each λ 0 and for each µ, the function of x L(x, λ, µ)= f(x)+ m λ i g i (x)+ p µ j (c T j x d j ) j=1 is a convex function. Indeed, f is a convex function, and the second term is a inear combination, with nonnegative coefficients, of convex functions, so that, it is a convex function; the third term is an affine function. Therefore, L(x, λ, µ) is the sum of convex functions and hence is convex. From (1.57), recaing that g(x ) 0 and λ 0, taking into account that L, as function of x, is convex (so that L(y) L(x)+(y x) T x L(x)), and using the con-

32 32 1 Decomposition methods for Support Vector Machines dition x L=0, for each λ 0 and for each µ, we can write L(x,λ, µ )= f(x ) f(x )+ m λ ig i (x )=L(x,λ, µ) L(x,λ, µ)+(x x) T x L(x,λ, µ)=l(x,λ, µ) which proves the thesis. Probem (1.56) is usuay referred as Wofe dua. We observe that, in the genera case, given a soution( x, λ, µ) of the Wofe dua, we can not state that x is a soution of the prima probem and that ( λ, µ) is a pair of Lagrange mutipiers Wofe dua for quadratic programming Let us consider the foowing quadratic programming probem min f(x)= 1 2 xt Qx+c T x Ax b 0, (1.58) where Q R n n is symmetric, A R m n, c R n, b R m. Letting L(x,λ)= f(x)+ λ T (Ax b), the Wofe dua is defined as foows max L(x,λ) x,λ x L(x,λ)=0 (1.59) λ 0. We can state the foowing resut. Proposition 1.8. Assume that Q is a symmetric, semidefinite positive matrix. Let ( x, λ) be a soution of the Wofe dua (1.59). Then there exists a vector x (not necessariy equa to x) such that (i) Q(x x)=0; (ii)x is a soution of probem (1.58); (iii)(x, λ) is a pair (goba minimum - vector of Lagrange mutipiers). Proof. Let us consider the dua probem (1.59):

33 1.11 Appendix B: the Wofe dua 33 1 max x,λ 2 x T Qx+c T x+λ T (Ax b) Qx+A T λ + c=0 λ 0 The constraint Qx+A T λ + c=0 impies x T Qx+c T x+λ T Ax=0. (1.60) Using (1.60), the dua probem can be rewritten in the form 1 min x,λ 2 x T Qx+λ T b Qx+A T λ + c=0 (1.61) λ 0 Let( x, λ) be a soution of (1.61). Consider the Lagrangian reated to probem (1.61) W(x,λ,v,z)= 1 2 xt Qx+λ T b v T (Qx+A T λ + c) z T λ. Since ( x, λ) is a soution of the Wofe dua (1.59), from the KKT conditions we get that there exist vectors v R n and z R r such that x W = Q x Q v=0 λ W = b A v z=0 Q x+a T λ + c=0 z T λ = 0 (1.62) z 0 λ 0 From the second and fifty conditions we have z=b A v 0, and hence, the above conditions can be rewritten as foows

34 34 1 Decomposition methods for Support Vector Machines Q x Q v=0 b+a v 0 Q x+a T λ + c=0 (1.63) λ T b+ λ T A v=0 λ 0 By subtracting the first condition from the third condition we obtain Q v+a T λ + c=0. (1.64) As the matrix Q is semidefinite positive, function f is a convex function. From the optimaity conditions in the case of inear constraints it foows that the conditions A v b 0 Q v+a T λ + c=0 λ T (A v b)=0 λ 0 are sufficient to ensure tha v is a soution of probem (1.58). Therefore, by definition, ( v, λ) is a pair (goba minimum - vector of Lagrange mutipiers). Letting x = v we have that x is a soution of probem (1.58), moreover, using the first condition of (1.63), we obtain Hence, assertions (i)-(iii) are proved. Qx = Q v=q x Appendix C: Optimaity conditions We prove here Proposition 1.3. To this aim we state some resuts based on the manipuation of the KKT conditions. Since f is convex and the constraints are inear, a feasibe point α is a soution of the probem (1.27) if and ony if the KKT conditions hod. Let us introduce the Lagrangian L(α,λ,ξ, ˆξ)= 1 2 αt Qα e T α λy T α ξ T α+ ˆξ T (α C),

35 1.12 Appendix C: Optimaity conditions 35 where α R, λ R, ξ, ˆξ R. Proposition 1.9. A feasibe point α is a soution of probem (1.27) if and ony if there exists a scaar λ such that 0 if i L(α ) ( f(α )) i + λ y i 0 if i U(α ) (1.65) = 0 if i / L(α ) U(α ). Proof. Since f is convex and the constraints are inear, a feasibe point α is a soution of probem (1.27) if and ony if there exist Lagrange mutipiers λ R, ξ, ˆξ R such that f(α )+λ y ξ + ˆξ = 0 (1.66) (ξ ) T α = 0 (1.67) ( ˆξ ) T (α C)=0 (1.68) ξ, ˆξ 0. (1.69) We wi show that conditions (1.66)-(1.69) are satisfied if and ony if (1.65) hods. (a) Suppose that α is a feasibe point and that (1.66)-(1.69) hod. Let αi = 0, so that i L(α ). The compementarity condition (1.68) impies ˆξ i = 0, and hence, using (1.66) and (1.69), we obtain ( f(α )) i + λ y i = ξ i 0. In the same way, et αi = C, so that i U(α ). The compementarity condition (1.67) impies ξi = 0, and hence, using (1.66) and (1.69), we obtain ( f(α )) i + λ y i = ˆξ i 0. Finay, assume 0<αi < C, so that i / L(α ) U(α ). The compementarity conditions (1.67) and (1.68) impy ξi, ˆξ i = 0, and hence, using (1.66), we obtain ( f(α )) i + λ y i = 0. (b) Suppose that α is a feasibe point and that (1.65) hods. For,...,: - if α i = 0 then set ˆξ i - if α i = C then set ξ i - if 0<α i < C then set ξ i = 0 and = 0 and ξ i =( f(α )) i + λ y i ; ˆξ i = [( f(α )) i + λ y i ]; = 0 and ˆξ i = 0.

36 36 1 Decomposition methods for Support Vector Machines It can be easiy verified that conditions (1.66)-(1.69) are satisfied. From Proposition 1.9 it foows the next resut. Proposition A point α F is a soution of probem (1.27) if and ony if there exists a scaar λ such that λ ( f(α )) i y i i L + (α ) U (α ) λ ( f(α )) i y i i L (α ) U + (α ) λ = ( f(α )) i y i i / L(α ) U(α ). (1.70) We are ready to prove Proposition 1.3. Proof of Proposition 1.3 (a) Assume that the feasibe point α is a soution. Proposition 1.10 impies that there exists a mutipier λ such that the pair (α,λ ) satisfies conditions (1.70). These atter can be rewritten as foows { max ( } f(α )) i i L + (α ) U (α ) y i λ min i L (α ) U + (α ) λ = ( f(α )) i y i i / L(α ) U(α ). From the definition of the sets R(α ) and S(α ) we obtain { max ( } f(α )) i min i R(α ) y i j S(α ) and hence (1.34) is verified. (b) Assume that (1.34) hods. We can define a mutipier λ such that max i R(α ) { ( f(α )) i y i } λ min i S(α ) { ( f(α )) j y j { ( } f(α )) i y i }, { ( } f(α )) i. (1.71) y i Then, the inequaities of (1.70) hod. The definition of R(α ) and S(α ), and the choice of the mutipier λ (which aows us to satisfy (1.71)) impy { max ( } { f(α )) i λ min ( } f(α )) i. {i: 0<α i <C} y i {i: 0<α i <C} y i Therefore, the equaities of (1.70) hod and the thesis is proved.

SVM: Terminology 1(6) SVM: Terminology 2(6)

SVM: Terminology 1(6) SVM: Terminology 2(6) Andrew Kusiak Inteigent Systems Laboratory 39 Seamans Center he University of Iowa Iowa City, IA 54-57 SVM he maxima margin cassifier is simiar to the perceptron: It aso assumes that the data points are