Kernel Methods and SVMs Extension

Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general overvew of some extensons to that whch were descrbed n the course, ncludng non bnary classfcaton and support vector regresson. We wll ntroduce the concept of SVMs usng the smplest case for applcaton. Consder a scenaro where we have data that must be classfed nto two dfferent groups. If the data are lnearly separable, or n other words, can be separated completely nto ther groups by a dvdng hyperplane, then our goal s to fnd the equaton of the hyperplane that best dvdes the groups. To be more formal wth the problem descrpton, we label the classes for each of the data ponts x as beng 1 or 1,.e. y { 1, 1 }. Our hyperplane functon has the equaton w T x + b and s defned such that for all ponts that have a class y = 1, and for ponts wth a class y = 1, w T x + b 1 w T x + b 1. In our tranng data, we should have no ponts n between the hyperplanes w T x + b = 1 and w T x + b = 1, a regon called the margn. The dvdng plane s the functon w T x + b = 0 and we classfy new ponts by ther sgn: ˆ = sgn(w T x + b). y Copyrght 2014 Udacty, Inc. All Rghts Reserved.

(Note that w does not look perpendcular due to dfference n x and y axs scalng) There are many choces of our parameter vector w that allow us to separate the data, but some are clearly better than others. Ideally, we want to select parameters for the hyperplane that maxmze the sze of the margn. Consder two ponts that le on opposte margns, x + and x, that are as close as possble to one another. In ths case, the vector connectng these two lnes wll be perpendcular to the hyperplanes defnng the margn. w T x + + b = 1 and w T x + b = 1, Subtractng the two equatons generates w T (x + ) = 2. Snce the vectors w and x + are parallel, w * x + = 2, where v s the magntude/length of a vector v. Dvdng w on both sdes gves the dstance between the hyperplanes x + whch s equal to 2 / w. From here, we observe that maxmzng the sze of the margn s equvalent to fndng the mnmum w that mantans the relatonshp y (w T x + b) 1 for all ponts n the tranng data. (Recall that w T x + b 1 when y = 1 and w T x + b 1 when y = 1.) We approach solvng the problem by notng that mnmzng w s equvalent to mnmzng 1 w 2 2, convertng the problem nto a quadratc programmng optmzaton problem. The Lagrange multplers α transform our optmzaton problem nto one of maxmzng the output of Copyrght 2014 Udacty, Inc. All Rghts Reserved.

w(α) = α α α j y x T 21 x j whle satsfyng the constrants that all 0 α and α y = 0. To provde some context for nterpretng ths, thnk of the multplers α as weghts on data ponts. From the constrant,j α y = 0, the sum of the weghts on the ponts categorzed as y = 1 should be equal to those categorzed as y = 1. As for w(α), the second term controls the summed weghts n the frst term from gettng too large. The second term takes nto account the categores of each par of ponts (y = 1 f they are n the same class, 1 f they dffer) and a measure of smlarty (evoked by x T x j ). When we obtan the optmal Lagrange multplers, t turns out that most of the weghts α are equal to zero. The ponts that have non zero weght are the only ponts that contrbute to the calculaton of w, and all n fact fall on the margn, satsfyng y (w T x + b) = 1. These ponts are the support vectors for the model. We obtan the parameter values for our dvdng hyperplane from w = = w T x y for some pont that les on the margn. α y x and b The above descrbes the general process for computng SVMs for lnearly separable data, but real lfe datasets do not normally allow themselves to be dvded so easly. Here, we dscuss two ways to deal wth non lnearly separable datasets and move beyond hard margn SVMs. If we have data that s mostly lnearly separable, we can consder usng soft margn SVMs, relaxng the crtera that all ponts are correctly classfed. If we have data that s separable n a nonlnear fashon, we can consder usng kernel functons to be able to capture a nonlnear dvdng curve between classes. Typcally, we make consderatons of both kernel functon and value of soft margn parameter to perform classfcaton tasks. In a soft margn SVM, we do not requre the data to be completely lnearly separable and allow for some ponts to be classfed ncorrectly. We provde for each pont a non negatve slack varable ξ that llustrates to what degree each pont s msclassfed: y (w T x + b) 1 ξ. If a pont s classfed correctly on ts sde of the margn, then ξ = 0. If a pont gets placed wthn the margn or n the wrong classed Copyrght 2014 Udacty, Inc. All Rghts Reserved.

regon, then ξ takes on postve value proportonal to the pont s dstance from ts desred margnal hyperplane. Our optmzaton problem now has to balance the sze of the errors we make: our goal s to mnmze w 2 + C, where C s a 1 2 ξ regularzaton parameter that tells us the weght we want to put on msclassfcaton errors. Wth smaller values of C, we punsh errors less, thus ncreasng the sze of the margn. Larger values of C result n narrower margns; the lmt of C as t tends towards nfnty s that any msclassfcaton error s punshed to an extent that we effectvely have our orgnal hard margn SVM. When we convert the optmzaton problem nto the form maxmzng the output of the constrant that w(α) = α α α j y x T 21 x j, α y = 0,j remans the same, whle the other constrant now has an upper bound 0 α C. Wth the soft margn SVM, our support vectors (ponts that have weght 0 < α ) nclude not just ponts on the margnal hyperplanes, but also those ponts that are wthn the margn or are msclassfed. For data that s separable, but not lnearly, we can use a kernel functon to capture a nonlnear dvdng curve. The kernel functon should capture some aspect of smlarty n our data; t also Copyrght 2014 Udacty, Inc. All Rghts Reserved.

represents doman knowledge regardng the structure of the data. In general, we can wrte the functon we want to maxmze as w(α) = α 21 α α j y k(x ).,j In our orgnal, lnear SVM, our kernel functon was k(x ) = x T x j and suggested a dvdng hyperplane. The kernel functon k(x ) = (x T x j ) 2 generates a dvdng hypersphere, whle k(x ) = (x T x j + c) d s the general form for polynomal kernels. Wth kernel functons, we can project the data nto a transformed space where a dvdng hyperplane can be found, but when plotted n the orgnal feature space ends up beng a non lnear dvdng curve. In order to compute the class of a new nstance, we now utlze the sgn of the output α y k(x,x) + b. Snce most of the weghts are equal to zero, ths s stll a farly quck computaton compared to the lnear case. It s mportant to note that the natural task for SVMs les n bnary classfcaton. For classfcaton tasks nvolvng more than two groups, a common strategy s to use multple bnary classfers to decde on a sngle best class for new nstances. For example, we may create one classfer for each class n a one versus all fashon then, for new ponts, classfy them based on the classfer functon that produces the largest value. Alternatvely, we can set up classfers for all Copyrght 2014 Udacty, Inc. All Rghts Reserved.

parwse comparsons and select the class that wns the most parwse matchups for new ponts. (Fgure depcts parwse matchups approach. Gray lnes ndcate where a bnary classfer has no effect. Note central area where no class has domnance.) We can also extend SVMs to regresson tasks, or support vector regresson (SVR). As wth SVMs, we project data n an SVR task usng a kernel functon so that they can be ft by a hyperplane. Instead of dvdng the data nto classes, however, the hyperplane now provdes an estmate for the data s output value. In addton, the margn and error are treated dfferently. A parameter ε s specfed such that small devatons from the regresson hyperplane do not contrbute to error costs,.e. when we attempt to mnmze 1 2 ξ w 2 + C, ξ = 0 when a pont les wthn the margn. Non zero slack varable values are nstead the (lnear) dstance beyond the ε regon that a pont les. Compare ths to the quadratc error functon that s found n standard lnear regresson tasks, where all devatons from the estmate count aganst the functon s ft, but errors are penalzed by the quadratc dfference from the estmate. Copyrght 2014 Udacty, Inc. All Rghts Reserved.

Essentally, however, SVR operates n much the same way as SVM does. For each pont n the tranng data, we nstead have two slack varables, ξ and ξ *, one for postve devatons and one for negatve devatons from the regresson hyperplane. Ths results n two Lagrangan multplers assocated wth each pont, 0 α, α * C, and a respecfed constrant on weght values solved, the regresson functon takes the form (α α* ) = 0. When (α k(x T α* ) x) + b. As before, most of the weghts take a value of zero, and for ponts wth non zero weghts, at most one of α, α * wll be non zero. Copyrght 2014 Udacty, Inc. All Rghts Reserved.