6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector machie ito its dual form where the examples oly appear i ier products. To this ed, assume we have mapped the examples ito feature vectors φ(x) of dimesio d ad that the resultig traiig set (φ(x 1 ), y 1 ),..., (φ(x ), y ) is liearly separable. Fidig the maximum margi liear separator i the feature space ow correspods to solvig miimize θ 2 /2 subject to y t (θ T φ(x t ) + θ 0 ) 1, t = 1,..., (1) We will discuss later o how slack variables affect the resultig kerel (dual) form. They merely complicate the derivatio without chagig the procedure. Optimizatio problems of the above type (covex, liear costraits) ca be tured ito their dual form by meas of Lagrage multipliers. Specifically, we itroduce a o-egative scalar parameter α t for each iequality costrait ad cast the estimatio problem i terms of θ ad α = {α 1,..., α }: [ ] J(θ, θ 0 ; α) = θ 2 /2 α t y t (θ T φ(x t ) + θ 0 ) 1 (2) The origial miimizatio problem for θ ad θ 0 is recovered by maximizig J(θ, θ 0 ; α) with respect to α. I other words, J(θ, θ 0 ) = max J(θ, θ 0 ; α) (3) α 0 where α 0 meas that all the compoets α t are o-egative. Let s try to see first that J(θ, θ 0 ) really is equivalet to the origial problem. Suppose we set θ ad θ 0 such that at least oe of the costraits, say the oe correspodig to (x i, y i ), is violated. I that case [ ] α i y i (θ T φ(x i ) + θ 0 ) 1 > 0 (4) for ay α i > 0. We ca the set α i = to obtai J(θ, θ 0 ) =. You ca thik of the Lagrage multipliers playig a adversarial role to eforce the margi costraits. More Cite as: Tommi Jaakkola, course materials for 6.867 Machie Learig, Fall 2006. MIT OpeCourseWare (http://ocw.mit.edu/), Massachusetts Istitute of Techology. Dowloaded o [DD Moth YYYY].
6.867 Machie learig, lecture 8 (Jaakkola) 2 formally, { θ J(θ, θ 0 ) = 2 /2 if y t (θ T φ(x t ) + θ 0 ) 1, t = 1,...,, otherwise (5) So the miimizig θ ad θ 0 are therefore those that satisfy the costraits. O the basis of a geeral set of criteria goverig the optimality whe dealig with Lagrage multipliers, criteria kow as Slater coditios, we ca actually switch the maximizig over α ad the miimizatio over {θ, θ 0 } ad get the same aswer: mi max J(θ, θ 0 ; α) = max mi J(θ, θ 0 ; α) (6) θ,θ 0 α 0 α 0 θ,θ 0 The left had side, equivalet to miimizig Eq.(5), is kow as the primal form, while the right had side is the dual form. Let s solve the right had side by first obtaiig θ ad θ 0 as a fuctio of the Lagrage multipliers (ad the data). To this ed d J(θ, θ 0 ; α) = α t y t = 0 (7) dθ 0 d dθ J(θ, θ 0; α) = θ α t y t φ(x t ) = 0 (8) So, agai the solutio for θ is i the spa of the feature vectors correspodig to the traiig examples. Substitutig this form of the solutio for θ back ito the objective, ad takig ito accout the costrait correspodig to the optimal θ 0, we get J(α) = mi J(θ, θ 0 ; α) (9) θ,θ 0 { = α t (1/2) i=1 j=1 α iα j y i y j [φ(x i ) T φ(x j )], if α ty t = 0 (10), otherwise The dual form of the solutio is therefore obtaied by maximizig α t (1/2) α i α j y i y j [φ(x i ) T φ(x j )], (11) i=1 j=1 subject to α t 0, α t y t = 0 (12) This is the dual or kerel form of the support vector machie, ad is also a quadratic optimizatio problem. The costraits are simpler, however. Moreover, the dimesio of Cite as: Tommi Jaakkola, course materials for 6.867 Machie Learig, Fall 2006. MIT OpeCourseWare (http://ocw.mit.edu/), Massachusetts Istitute of Techology. Dowloaded o [DD Moth YYYY].
6.867 Machie learig, lecture 8 (Jaakkola) 3 the iput vectors does ot appear explicitly as part of the optimizatio problem. It is formulated solely o the basis of the Gram matrix: φ(x 1 ) T φ(x 1 ) φ(x 1 ) T φ(x ) K = (13) φ(x ) T φ(x 1 )... φ(x ) T φ(x ) We have already see that the maximum margi hyperplae ca be costructed o the basis of oly a subset of the traiig examples. This should also also i terms of the feature vectors. How will this be maifested i the ˆα t s? May of them will be exactly zero due to the optimizatio. I fact, they are o-zero oly for examples (feature vectors) that are support vectors. Oce we have solved for ˆα t, we ca classify ay ew example accordig to the discrimiat fuctio ŷ(x) = θˆt φ(x) + θˆ0 (14) = αˆty t [φ(x t ) T φ(x)] + θˆ0 (15) = αˆty t [φ(x t ) T φ(x)] + θˆ0 (16) t SV where SV is the set of support vectors correspodig to o-zero values of α t. We do t kow which examples (feature vectors) become as support vectors util we have solved the optimizatio problem. Moreover, the idetity of the support vectors will deped o the feature mappig or the kerel fuctio. But what is θˆ0? It appeared to drop out of the optimizatio problem. We ca set θ 0 after solvig for ˆα t by lookig at the support vectors. Ideed, for all i SV we should have y i (θˆt φ(x i ) + θˆ0) = y i αˆt[φ(x t ) T φ(x i )] + y i θˆ0 = 1 (17) t SV from which we ca easily solve for θˆ0. I priciple, selectig ay support vector would suffice but sice we typically solve the quadratic program over α t s oly up to some resolutio, these costraits may ot be satisfied with equality. It is therefore advisable to costruct θˆ0 as the media value of the solutios implied by the support vectors. What is the geometric margi we attai with some kerel fuctio K(x, x ) = φ(x) T φ(x )? Cite as: Tommi Jaakkola, course materials for 6.867 Machie Learig, Fall 2006. MIT OpeCourseWare (http://ocw.mit.edu/), Massachusetts Istitute of Techology. Dowloaded o [DD Moth YYYY].
6.867 Machie learig, lecture 8 (Jaakkola) 4 It is still 1/ θˆ. I a kerel form ( ) 1/2 γˆgeom = αˆiαˆjy i y j K(x i, x j ) (18) i=1 j=1 Would it make sese to compare geometric margis we attai with differet kerels? We could perhaps use it as a criterio for selectig the best kerel fuctio. Ufortuately this wo t work without some care. For example, if we multiply all the feature vectors by 2, the the resultig geometric margi will also be twice as large (we just expaded the space; the relatios betwee the poits remai the same). It is ecessary to perform some ormalizatio before ay compariso makes sese. We have so far assumed that the examples i their feature represetatios are liearly separable. We d also like to have the kerel form of the relaxed support vector machie formulatio miimize θ 2 /2 + C ξ t (19) subject to y t (θ T φ(x t ) + θ 0 ) 1 ξ t, t = 1,..., (20) The resultig dual form is very similar to the simple oe we derived above. I fact, the oly differece is that the Lagrage multipliers α t are ow also bouded from above by C (the same C as i the above primal formulatio). Ituitively, the Lagrage multipliers α t serve to eforce the classificatio costraits ad adopt larger values for costraits that are harder to satisfy. Without ay upper limit, they would simply reach for ay costrait that caot be satisfied. The limit C specifies the poit whe we should stop from tryig to satisfy such costraits. More formally, the dual form is α t (1/2) α i α j y i y j [φ(x i ) T φ(x j )], (21) i=1 j=1 subject to 0 α t C, α t y t = 0 (22) The resultig discrimiat fuctio has the same form except that the ˆα t values ca be differet. What about θˆ0? To solve for θˆ0 we eed to idetify classificatio costraits that are satisfied with equality. These are o loger simply the oes for which ˆα t > 0 but those correspodig to 0 < αˆt < C. I other words, we have to exclude poits that violate the margi costraits. These are the oes for which ˆα t = C. Cite as: Tommi Jaakkola, course materials for 6.867 Machie Learig, Fall 2006. MIT OpeCourseWare (http://ocw.mit.edu/), Massachusetts Istitute of Techology. Dowloaded o [DD Moth YYYY].
6.867 Machie learig, lecture 8 (Jaakkola) 5 Kerel optimizatio Whether we are iterested i (liear) classificatio or regressio we are faced with the problem of selectig a appropriate kerel fuctio. A step i this directio might be to tailor a particular kerel a bit better to the available data. We could, for example, itroduce additioal parameters i the kerel ad optimize those parameters so as to improve the performace. These parameters could be simple as the β parameter i the radial basis kerel, weight each dimesio of the iput vectors, or more flexible as fidig the best covex combiatio of basic (fixed) kerels. Key to such a approach is the measure we would optimize. Ideally, this measure would be the geeralizatio error but we obviously have to settle for a surrogate measure. The surrogate measure could be cross-validatio or a alterative criterio related to the geeralizatio error (e.g., margi). Kerel selectio We ca also explicitly select amog possible kerels ad cast the problem as a model selectio problem. By choosig a kerel we specify the feature vectors o the basis of which liear predictios are made. Each model 1 (class) refers to a set of liear fuctios (classifiers) based o the chose feature represetatio. I may cases the models are ested i the sese that the more complex model cotais the simpler oe. We will cotiue from this further at the ext lecture. 1 I statistics, a model is a family/set of distributios or a family/set of liear separators. Cite as: Tommi Jaakkola, course materials for 6.867 Machie Learig, Fall 2006. MIT OpeCourseWare (http://ocw.mit.edu/), Massachusetts Istitute of Techology. Dowloaded o [DD Moth YYYY].