7 LINEAR MODELS. 7.1 The Optimization Framework for Linear Models. Learning Objectives:

Size: px

Start display at page:

Download "7 LINEAR MODELS. 7.1 The Optimization Framework for Linear Models. Learning Objectives:"

Imogene Houston
6 years ago
Views:

1 7 LINEAR MODELS The essece of mathematics is ot to make simple thigs complicated, but to make complicated thigs simple. Staley Gudder I Chapter 4, you leared about the perceptro algorithm for liear classificatio. This was both a model (liear classifier) ad algorithm (the perceptro update rule) i oe. I this sectio, we will separate these two, ad cosider geeral ways for optimizig liear models. This will lead us ito some aspects of optimizatio (aka mathematical programmig), but ot very far. At the ed of this chapter, there are poiters to more literature o optimizatio for those who are iterested. The basic idea of the perceptro is to ru a particular algorithm util a liear separator is foud. You might ask: are there better algorithms for fidig such a liear separator We will follow this idea ad formulate a learig problem as a explicit optimizatio problem: fid me a liear separator that is ot too complicated. We will see that fidig a optimal separator is actually computatioally prohibitive, ad so will eed to relax the optimality requiremet. This will lead us to a covex objective that combies a loss fuctio (how well are we doig o the traiig data) ad a regularizer (how complicated is our leared model). This learig framework is kow as both Tikhoov regularizatio ad structural risk miimizatio. Learig Objectives: Defie ad plot four surrogate loss fuctios: squared loss, logistic loss, expoetial loss ad hige loss. Compare ad cotrast the optimizatio of 0/1 loss ad surrogate loss fuctios. Solve the optimizatio problem for squared loss with a quadratic regularizer i closed form. Implemet ad debug gradiet descet ad subgradiet descet. Depedecies: 7.1 The Optimizatio Framework for Liear Models You have already see the perceptro as a way of fidig a weight vector w ad bias b that do a good job of separatig positive traiig examples from egative traiig examples. The perceptro is a model ad algorithm i oe. Here, we are iterested i separatig these issues. We will focus o liear models, like the perceptro. But we will thik about other, more geeric ways of fidig good parameters of these models. The goal of the perceptro was to fid a separatig hyperplae for some traiig data set. For simplicity, you ca igore the issue of overfittig (but just for ow!). Not all data sets are liearly sepa-

2 88 a course i machie learig rable. I the case that your traiig data is t liearly separable, you might wat to fid the hyperplae that makes the fewest errors o the traiig data. We ca write this dow as a formal mathematics optimizatio problem as follows: mi w,b 1[y (w x + b) > 0] (7.1) I this expressio, you are optimizig over two variables, w ad b. The objective fuctio is the thig you are tryig to miimize. I this case, the objective fuctio is simply the error rate (or 0/1 loss) of the liear classifier parameterized by w, b. I this expressio, 1[ ] is the idicator fuctio: it is oe whe ( ) is true ad zero otherwise. You should remember the yw x We kow that the perceptro algorithm is guarateed to fid parameters for this model if the data is liearly separable. I other words, if the optimum of Eq (7.1) is zero, the the perceptro will efficietly fid parameters for this model. The otio of efficiecy depeds o the margi of the data for the perceptro. You might ask: what happes if the data is ot liearly separable Is there a efficiet algorithm for fidig a optimal settig of the parameters Ufortuately, the aswer is o. There is o polyomial time algorithm for solvig Eq (7.1), uless P=NP. I other words, this problem is NP-hard. Sadly, the proof of this is quite complicated ad beyod the scope of this book, but it relies o a reductio from a variat of satisfiability. The key idea is to tur a satisfiability problem ito a optimizatio problem where a clause is satisfied exactly whe the hyperplae correctly separates the data. You might the come back ad say: okay, well I do t really eed a exact solutio. I m willig to have a solutio that makes oe or two more errors tha it has to. Ufortuately, the situatio is really bad. Zero/oe loss is NP-hard to eve appproximately miimize. I other words, there is o efficiet algorithm for eve fidig a solutio that s a small costat worse tha optimal. (The best kow costat at this time is 418/ ) However, before gettig too disillusioed about this whole eterprise (remember: there s a etire chapter about this framework, so it must be goig somewhere!), you should remember that optimizig Eq (7.1) perhaps is t eve what you wat to do! I particular, all it says is that you will get miimal traiig error. It says othig about what your test error will be like. I order to try to fid a solutio that will geeralize well to test data, you eed to esure that you do ot overfit the data. To do this, you ca itroduce a regularizer over the parameters of the model. For ow, we will be vague about what this regularizer looks like, ad simply call it a arbitrary fuctio R(w, b). trick from the perceptro discussio. If ot, re-covice yourself that this is doig the right thig.

3 liear models 89 This leads to the followig, regularized objective: mi w,b 1[y (w x + b) > 0] + λr(w, b) (7.2) I Eq (7.2), we are ow tryig to optimize a trade-off betwee a solutio that gives low traiig error (the first term) ad a solutio that is simple (the secod term). You ca thik of the maximum depth hyperparameter of a decisio tree as a form of regularizatio for trees. Here, R is a form of regularizatio for hyperplaes. I this formulatio, λ becomes a hyperparameter for the optimizatio. The key remaiig questios, give this formalism, are: How ca we adjust the optimizatio problem so that there are efficiet algorithms for solvig it Assumig R does the right thig, what value(s) of λ will lead to overfittig What value(s) will lead to uderfittig What are good regularizers R(w, b) for hyperplaes Assumig we ca adjust the optimizatio problem appropriately, what algorithms exist for efficietly solvig this regularized optimizatio problem We will address these three questios i the ext sectios. 7.2 Covex Surrogate Loss Fuctios You might ask: why is optimizig zero/oe loss so hard Ituitively, oe reaso is that small chages to w, b ca have a large impact o the value of the objective fuctio. For istace, if there is a positive traiig example with w, x +b = , the adjustig b upwards by will decrease your error rate by 1. But adjustig it upwards by will have o effect. This makes it really difficult to figure out good ways to adjust the parameters. To see this more clearly, it is useful to look at plots that relate margi to loss. Such a plot for zero/oe loss is show i Figure 7.1. I this plot, the horizotal axis measures the margi of a data poit ad the vertical axis measures the loss associated with that margi. For zero/oe loss, the story is simple. If you get a positive margi (i.e., y(w x + b) > 0) the you get a loss of zero. Otherwise you get a loss of oe. By thikig about this plot, you ca see how chages to the parameters that chage the margi just a little bit ca have a eormous effect o the overall loss. You might decide that a reasoable way to address this problem is to replace the o-smooth zero/oe loss with a smooth approximatio. With a bit of effort, you could probably cococt a S -shaped fuctio like that show i Figure 7.2. The beefit of usig such a S-fuctio is that it is smooth, ad potetially easier to optimize. The difficulty is that it is ot covex. Figure 7.1: plot of zero/oe versus margi Figure 7.2: plot of zero/oe versus margi ad a S versio of it

4 90 a course i machie learig If you remember from calculus, a covex fuctio is oe that looks like a happy face ( ). (O the other had, a cocave fuctio is oe that looks like a sad face ( ); a easy memoic is that you ca hide uder a cocave fuctio.) There are two equivalet defiitios of a covex fuctio. The first is that it s secod derivative is always o-egative. The secod, more geometric, defitio is that ay chord of the fuctio lies above it. This is show i Figure 7.3. There you ca see a covex fuctio ad a o-covex fuctio, both with two chords draw i. I the case of the covex fuctio, the chords lie above the fuctio. I the case of the o-covex fuctio, there are parts of the chord that lie below the fuctio. Covex fuctios are ice because they are easy to miimize. Ituitively, if you drop a ball aywhere i a covex fuctio, it will evetually get to the miimum. This is ot true for o-covex fuctios. For example, if you drop a ball o the very left ed of the S-fuctio from Figure 7.2, it will ot go aywhere. This leads to the idea of covex surrogate loss fuctios. Sice zero/oe loss is hard to optimize, you wat to optimize somethig else, istead. Sice covex fuctios are easy to optimize, we wat to approximate zero/oe loss with a covex fuctio. This approximatig fuctio will be called a surrogate loss. The surrogate losses we costruct will always be upper bouds o the true loss fuctio: this guaratees that if you miimize the surrogate loss, you are also pushig dow the real loss. There are four commo surrogate loss fuctios, each with their ow properties: hige loss, logistic loss, expoetial loss ad squared loss. These are show i Figure 7.4 ad defied below. These are defied i terms of the true label y (which is just { 1, +1}) ad the predicted value ŷ = w x + b. Zero/oe: l (0/1) (y, ŷ) = 1[yŷ 0] (7.3) Hige: l (hi) (y, ŷ) = max{0, 1 yŷ} (7.4) Logistic: l (log) (y, ŷ) = 1 log (1 + exp[ yŷ]) (7.5) log 2 Expoetial: l (exp) (y, ŷ) = exp[ yŷ] (7.6) Squared: l (sqr) (y, ŷ) = (y ŷ) 2 (7.7) Figure 7.3: plot of covex ad ocovex fuctios with two chords each Figure 7.4: surrogate loss fs I the defiitio of logistic loss, the 1 log 2 term out frot is there simply to esure that l (log) (y, 0) = 1. This esures, like all the other surrogate loss fuctios, that logistic loss upper bouds the zero/oe loss. (I practice, people typically omit this costat sice it does ot affect the optimizatio.) There are two big differeces i these loss fuctios. The first differece is how upset they get by erroeous predictios. I the

5 liear models 91 case of hige loss ad logistic loss, the growth of the fuctio as ŷ goes egative is liear. For squared loss ad expoetial loss, it is super-liear. This meas that expoetial loss would rather get a few examples a little wrog tha oe example really wrog. The other differece is how they deal with very cofidet correct predictios. Oce yŷ > 1, hige loss does ot care ay more, but logistic ad expoetial still thik you ca do better. O the other had, squared loss thiks it s just as bad to predict +3 o a positive example as it is to predict 1 o a positive example. 7.3 Weight Regularizatio I our learig objective, Eq (7.2), we had a term correspod to the zero/oe loss o the traiig data, plus a regularizer whose goal was to esure that the leared fuctio did t get too crazy. (Or, more formally, to esure that the fuctio did ot overfit.) If you replace to zero/oe loss with a surrogate loss, you obtai the followig objective: mi w,b l(y, w x + b) + λr(w, b) (7.8) The questio is: what should R(w, b) look like From the discussio of surrogate loss fuctio, we would like to esure that R is covex. Otherwise, we will be back to the poit where optimizatio becomes difficult. Beyod that, a commo desire is that the compoets of the weight vector (i.e., the w d s) should be small (close to zero). This is a form of iductive bias. Why are small values of w d good Or, more precisely, why do small values of w d correspod to simple fuctios Suppose that we have a example x with label +1. We might believe that other examples, x that are earby x should also have label +1. For example, if I obtai x by takig x ad chagig the first compoet by some small value ɛ ad leavig the rest the same, you might thik that the classificatio would be the same. If you do this, the differece betwee ŷ ad ŷ will be exactly ɛw 1. So if w 1 is reasoably small, this is ulikely to have much of a effect o the classificatio decisio. O the other had, if w 1 is large, this could have a large effect. Aother way of sayig the same thig is to look at the derivative of the predictios as a fuctio of w 1. The derivative of w x + b with respect to w 1 is: [w x + b] w 1 = [ d w d x d + b] w 1 = x 1 (7.9) Iterpretig the derivative as the rate of chage, we ca see that the rate of chage of the predictio fuctio is proportioal to the

6 92 a course i machie learig idividual weights. So if you wat the fuctio to chage slowly, you wat to esure that the weights stay small. Oe way to accomplish this is to simply use the orm of the weight vector. Namely R (orm) (w, b) = w = d w 2 d. This fuctio is covex ad smooth, which makes it easy to miimize. I practice, it s ofte easier to use the squared orm, amely R (sqr) (w, b) = w 2 = d w 2 d because it removes the ugly square root term ad remais covex. A alterative to usig the sum of squared weights is to use the sum of absolute weights: R (abs) (w, b) = d w d. Both of these orms are covex. I additio to small weights beig good, you could argue that zero weights are better. If a weight w d goes to zero, the this meas that feature d is ot used at all i the classificatio decisio. If there are a large umber of irrelevat features, you might wat as may weights to go to zero as possible. This suggests a alterative regularizer: R (ct) (w, b) = d 1[x d = 0]. This lie of thikig leads to the geeral cocept of p-orms. (Techically these are called l p (or ell p ) orms, but this otatio clashes with the use of l for loss. ) This is a family of orms that all have the same geeral flavor. We write w p to deote the p-orm of w. Why do we ot regularize the bias term b Why might you ot wat to use R (ct) as a regularizer w p = ( ) 1 w d p p d (7.10) You ca check that the 2-orm exactly correspods to the usual Euclidea orm, ad that the 1-orm correspods to the absolute regularizer described above. Whe p-orms are used to regularize weight vectors, the iterestig aspect is how they trade-off multiple features. To see the behavior of p-orms i two dimesios, we ca plot their cotour (or levelset). Figure 7.5 shows the cotours for the same p orms i two dimesios. Each lie deotes the two-dimesioal vectors to which this orm assiges a total value of 1. By chagig the value of p, you ca iterpolate betwee a square (the so-called max orm ), dow to a circle (2-orm), diamod (1-orm) ad poity-star-shaped-thig (p < 1 orm). I geeral, smaller values of p prefer sparser vectors. You ca see this by oticig that the cotours of small p-orms stretch out alog the axes. It is for this reaso that small p-orms ted to yield weight vectors with may zero etries (aka sparse weight vectors). Ufortuately, for p < 1 the orm becomes o-covex. As you might guess, this meas that the 1-orm is a popular choice for sparsity-seekig applicatios. You ca actually idetify the R (ct) regularizer with a p-orm as well. Which value of p gives it to you (Hit: you may have to take a limit.) Figure 7.5: loss:orms2d: level sets of the same p-orms The max orm correspods to lim p. Why is this called the max orm

7 liear models 93 MATH REVIEW GRADIENTS A gradiet is a multidimesioal geeralizatio of a derivative. Suppose you have a fuctio f : R D R that takes a vector x = x 1, x 2,..., x D as iput ad produces a scalar value as output. You ca differetite this fuctio accordig to ay oe of the iputs; for istace, you ca compute f x 5 to get the derivative with respect to the fifth iput. The gradiet of f is just the vector cosistig of the derivative f with respect to each of its iput coordiates idepedetly, ad is deoted f, or, whe the iput to f is ambiguous, x f. This is defied as: x f = f x 1, f x 2,..., f x D (7.11) For example, cosider the fuctio f (x 1, x 2, x 3 ) = x x 1x 2 3x 2 x3 2. The gradiet is: x f = 3x x 2, 5x 1 3x3 2, 6x 2x 3 (7.12) Note that if f : R D R, the f : R D R D. If you evaluate f (x), this will give you the gradiet at x, a vector i R D. This vector ca be iterpreted as the directio of steepest ascet: amely, if you were to travel a ifiitesimal amout i the directio of the gradiet, you would go uphill (i.e., icrease f ) the most. Figure 7.6: 7.4 Optimizatio with Gradiet Descet Evisio the followig problem. You re takig up a ew hobby: blidfolded moutai climbig. Someoe blidfolds you ad drops you o the side of a moutai. Your goal is to get to the peak of the moutai as quickly as possible. All you ca do is feel the moutai where you are stadig, ad take steps. How would you get to the top of the moutai Perhaps you would feel to fid out what directio feels the most upward ad take a step i that directio. If you do this repeatedly, you might hope to get the the top of the moutai. (Actually, if your fried promises always to drop you o purely cocave moutais, you will evetually get to the peak!) The idea of gradiet-based methods of optimizatio is exactly the same. Suppose you are tryig to fid the maximum of a fuctio f (x). The optimizer maitais a curret estimate of the parameter of iterest, x. At each step, it measures the gradiet of the fuctio it is tryig to optimize. This measuremet occurs at the curret locatio, x. Call the gradiet g. It the takes a step i the directio of the gradiet, where the size of the step is cotrolled by a parameter η (eta). The complete step is x x + ηg. This is the basic idea of gradiet ascet. The opposite of gradiet ascet is gradiet descet. All of our

8 94 a course i machie learig Algorithm 21 GradietDescet(F, K, η 1,... ) 1: z (0) 0, 0,..., 0 // iitialize variable we are optimizig 2: for k = 1... K do 3: g (k) z F z (k-1) // compute gradiet at curret locatio 4: z (k) z (k-1) η (k) g (k) // take a step dow the gradiet 5: ed for 6: retur z (K) learig problems will be framed as miimizatio problems (tryig to reach the bottom of a ditch, rather tha the top of a hill). Therefore, descet is the primary approach you will use. Oe of the major coditios for gradiet ascet beig able to fid the true, global miimum, of its objective fuctio is covexity. Without covexity, all is lost. The gradiet descet algorithm is sketched i Algorithm 7.4. The fuctio takes as argumets the fuctio F to be miimized, the umber of iteratios K to ru ad a sequece of learig rates η 1,..., η K. (This is to address the case that you might wat to start your moutai climbig takig large steps, but oly take small steps whe you are close to the peak.) The oly real work you eed to do to apply a gradiet descet method is be able to compute derivatives. For cocreteess, suppose that you choose expoetial loss as a loss fuctio ad the 2-orm as a regularizer. The, the regularized objective fuctio is: L(w, b) = exp [ y (w x + b) ] + λ 2 w 2 (7.13) The oly strage thig i this objective is that we have replaced λ with λ 2. The reaso for this chage is just to make the gradiets cleaer. We ca first compute derivatives with respect to b: L b = b = = exp [ y (w x + b) ] + λ b 2 w 2 (7.14) b exp [ y (w x + b) ] + 0 (7.15) ( ) b y (w x + b) exp [ y (w x + b) ] (7.16) = y exp [ y (w x + b) ] (7.17) Before proceedig, it is worth thikig about what this says. From a practical perspective, the optimizatio will operate by updatig b b η L b. Cosider positive examples: examples with y = +1. We would hope for these examples that the curret predictio, w x + b, is as large as possible. As this value teds toward, the term i the exp[] goes to zero. Thus, such poits will ot cotribute to the step.

9 liear models 95 However, if the curret predictio is small, the the exp[] term will be positive ad o-zero. This meas that the bias term b will be icreased, which is exactly what you would wat. Moreover, oce all poits are very well classified, the derivative goes to zero. Now that we have doe the easy case, let s do the gradiet with respect to w. This cosidered the case of positive examples. What happes with egative examples w L = w exp [ y (w x + b) ] λ + w 2 w 2 (7.18) = ( w y (w x + b)) exp [ y (w x + b) ] + λw (7.19) = y x exp [ y (w x + b) ] + λw (7.20) Now you ca repeat the previous exercise. The update is of the form w w η w L. For well classified poits (oes that ted toward y ), the gradiet is ear zero. For poorly classified poits, the gradiet poits i the directio y x, so the update is of the form w w + cy x, where c is some costat. This is just like the perceptro update! Note that c is large for very poorly classified poits ad small for relatively well classified poits. By lookig at the part of the gradiet related to the regularizer, the update says: w w λw = (1 λ)w. This has the effect of shrikig the weights toward zero. This is exactly what we expect the regulaizer to be doig! The success of gradiet descet higes o appropriate choices for the step size. Figure 7.7 shows what ca happe with gradiet descet with poorly chose step sizes. If the step size is too big, you ca accidetally step over the optimum ad ed up oscillatig. If the step size is too small, it will take way too log to get to the optimum. For a well-chose step size, you ca show that gradiet descet will approach the optimal value at a fast rate. The otio of covergece here is that the objective value coverges to the true miimum. Theorem 8 (Gradiet Descet Covergece). Uder suitable coditios 1, for a appropriately chose costat step size (i.e., η 1 = η 2, = η), the covergece rate of gradiet descet is O(1/k). More specifically, lettig z be the global miimum of F, we have: F(z (k) ) F(z ) 2 z (0) z 2 ηk. The proof of this theorem is a bit complicated because it makes heavy use of some liear algebra. The key is to set the learig rate to 1/L, where L is the maximum curvature of the fuctio that is beig optimized. The curvature is simply the size of the secod derivative. Fuctios with high curvature have gradiets that chage Figure 7.7: good ad bad step sizes 1 Specifically the fuctio to be optimized eeds to be strogly covex. This is true for all our problems, provided λ > 0. For λ = 0 the rate could be as bad as O(1/ k). A aive readig of this theorem seems to say that you should choose huge values of η. It should be obvious that this caot be right. What is missig

10 96 a course i machie learig quickly, which meas that you eed to take small steps to avoid oversteppig the optimum. This covergece result suggests a simple approach to decidig whe to stop optimizig: wait util the objective fuctio stops chagig by much. A alterative is to wait util the parameters stop chagig by much. A fial example is to do what you did for perceptro: early stoppig. Every iteratio, you ca check the performace of the curret model o some held-out data, ad stop optimizig whe performace plateaus. 7.5 From Gradiets to Subgradiets As a good exercise, you should try derivig gradiet descet update rules for the differet loss fuctios ad differet regularizers you ve leared about. However, if you do this, you might otice that hige loss ad the 1-orm regularizer are ot differetiable everywhere! I particular, the 1-orm is ot differetiable aroud w d = 0, ad the hige loss is ot differetiable aroud yŷ = 1. The solutio to this is to use subgradiet optimizatio. Oe way to thik about subgradiets is just to ot thik about it: you essetially eed to just igore the fact that you forgot that your fuctio was t differetiable, ad just try to apply gradiet descet ayway. To be more cocrete, cosider the hige fuctio f (z) = max{0, 1 z}. This fuctio is differetiable for z > 1 ad differetiable for z < 1, but ot differetiable at z = 1. You ca derive this usig differetiatio by parts: { z f (z) = 0 if z > 1 (7.21) z 1 z if z < 1 { z 0 if z > 1 = z (1 z) if z < 1 (7.22) = { 0 if z 1 1 if z < 1 (7.23) Thus, the derivative is zero for z < 1 ad 1 for z > 1, matchig ituitio from the Figure. At the o-differetiable poit, z = 1, we ca use a subderivative: a geeralizatio of derivatives to odifferetiable fuctios. Ituitively, you ca thik of the derivative of f at z as the taget lie. Namely, it is the lie that touches f at z that is always below f (for covex fuctios). The subderivative, deoted f, is the set of all such lies. At differetiable positios, this set cosists just of the actual derivative. At o-differetiable positios, this cotais all slopes that defie lies that always lie uder the fuctio ad make cotact at the operatig poit. This is Figure 7.8: hige loss with sub

11 liear models 97 Algorithm 22 HigeRegularizedGD(D, λ, MaxIter) 1: w 0, 0,... 0, b 0 // iitialize weights ad bias 2: for iter = 1... MaxIter do 3: g 0, 0,... 0, g 0 // iitialize gradiet of weights ad bias 4: for all (x,y) D do 5: if y(w x + b) 1 the 6: g g + y x // update weight gradiet 7: g g + y // update bias derivative 8: ed if 9: ed for 10: g g λw // add i regularizatio term 11: w w + ηg // update weights 12: b b + ηg // update bias 13: ed for 14: retur w, b show pictorally i Figure 7.8, where example subderivatives are show for the hige loss fuctio. I the particular case of hige loss, ay value betwee 0 ad 1 is a valid subderivative at z = 0. I fact, the subderivative is always a closed set of the form [a, b], where a ad b ca be derived by lookig at limits from the left ad right. This gives you a way of computig derivative-like thigs for odifferetiable fuctios. Take hige loss as a example. For a give example, the subgradiet of hige loss ca be computed as: w max{0, 1 y (w x + b)} (7.24) { 0 if y = (w x + b) > 1 w (7.25) 1 y (w x + b) otherwise { = w 0 if y (w x + b) > 1 (7.26) w 1 y (w x + b) otherwise { 0 if y = (w x + b) > 1 (7.27) y x otherwise If you plug this subgradiet form ito Algorithm 7.4, you obtai Algorithm 7.5. This is the subgradiet descet for regularized hige loss (with a 2-orm regularizer). 7.6 Closed-form Optimizatio for Squared Loss Although gradiet descet is a good, geeric optimizatio algorithm, there are cases whe you ca do better. A example is the case of a 2-orm regularizer ad squared error loss fuctio. For this, you ca actually obtai a closed form solutio for the optimal weights. However, to obtai this, you eed to rewrite the optimizatio problem i terms of matrix operatios. For simplicity, we will oly cosider the

12 98 a course i machie learig MATH REVIEW MATRIX MULTIPLICATION AND INVERSION If A ad B are matrices, ad A is N K ad B is K M (the ier dimesios must match), the the matrix product AB is a matrix C that is N M, with C,m = k A,k B k,m. If v is a vector i R D, we will treat is as a colum vector, or a matrix of size D 1. Thus, Av is well defied if A is D M, ad the resultig product is a vector u with u m = d A d,m v d. Aside from matrix product, a fudametal matrix operatio is iversio. We will ofte ecouter a form like Ax = y, where A ad y are kow ad we wat to solve for A. If A is square of size N N, the the iverse of A, deoted A 1, is also a square matrix of size N N, such that AA 1 = I N = A 1 A. I.e., multiplyig a matrix by its iverse (o either side) gives back the idetity matrix. Usig this, we ca solve Ax = y by multiplyig both sides by A 1 o the left (recall that order matters i matrix multiplicatio), yieldig A 1 Ax = A 1 y from which we ca coclude x = A 1 y. Note that ot all square matrices are ivertible. For istace, the all zeros matrix does ot have a iverse (i the same way that 1/0 is ot defied for scalars). However, there are other matrices that do ot have iverses; such matrices are called sigular. Figure 7.9: ubiased versio, but the extesio is Exercise. This is precisely the liear regressio settig. You ca thik of the traiig data as a large matrix X of size N D, where X,d is the value of the dth feature o the th example. You ca thik of the labels as a colum ( tall ) vector Y of dimesio N. Fially, you ca thik of the weights as a colum vector w of size D. Thus, the matrix-vector product a = Xw has dimesio N. I particular: a = [Xw] = X,d w d (7.28) d This meas, i particular, that a is actually the predictios of the model. Istead of callig this a, we will call it Ŷ. The squared error says that we should miimize 2 1 (Ŷ Y ) 2, which ca be writte i vector form as a miimizatio of 1 2 Ŷ Y 2. This ca be expaded visually as: x 1,1 x 1,2... x 1,D w 1 x 2,1 x 2,2... x 2,D w } x N,1 x N,2... {{ x N,D }} w D {{ } X w = d x 1,d w d d x 2,d w d. } d x N,d w d {{ } Ŷ y 1 y 2. y N } {{ } Ŷ (7.29) Verify that the squared error ca actually be writte as this vector orm.

13 liear models 99 So, compactly, our optimizatio problem ca be writte as: mi w L(w) = 1 2 Xw Y 2 + λ 2 w 2 (7.30) If you recall from calculus, you ca miimize a fuctio by settig its derivative to zero. We start with the weights w ad take gradiets: w L(w) = X (Xw Y) + λw (7.31) = X Xw X Y + λw (7.32) ( ) = X X + λi w X Y (7.33) We ca equate this to zero ad solve, yieldig: ( ) X X + λi w X Y = 0 (7.34) ) (X X + λi D w = X Y (7.35) ( ) w = X X + λi 1 D X Y (7.36) Thus, the optimal solutio of the weights ca be computed by a few matrix multiplicatios ad a matrix iversio. As a saity check, you ca make sure that the dimesios match. The matrix X X has dimesio D D, ad therefore so does the iverse term. The iverse is D D ad X is D N, so that product is D N. Multiplyig through by the N 1 vector Y yields a D 1 vector, which is precisely what we wat for the weights. Note that this gives a exact solutio, modulo umerical iacuracies with computig matrix iverses. I cotrast, gradiet descet will give you progressively better solutios ad will evetually coverge to the optimum at a rate of 1/k. This meas that if you wat a aswer that s withi a accuracy of ɛ = 10 4, you will eed somethig o the order of oe thousad steps. The questio is whether gettig this exact solutio is always more efficiet. To ru gradiet descet for oe step will take O(ND) time, with a relatively small costat. You will have to ru K iteratios, yieldig a overall rutime of O(KND). O the other had, the closed form solutio requires costructig X X, which takes O(D 2 N) time. The iversio take O(D 3 ) time usig stadard matrix iversio routies. The fial multiplicatios take O(ND) time. Thus, the overall rutime is o the order O(D 3 + D 2 N). I most stadard cases (though this is becomig less true over time), N > D, so this is domiated by O(D 2 N). Thus, the overall questio is whether you will eed to ru more tha D-may iteratios of gradiet descet. If so, the the matrix iversio will be (roughly) faster. Otherwise, gradiet descet will be (roughly) faster. For low- ad medium-dimesioal problems (say, For those who are kee o liear algebra, you might be worried that the matrix you must ivert might ot be ivertible. Is this actually a problem

14 100 a course i machie learig D 100), it is probably faster to do the closed form solutio via matrix iversio. For high dimesioal problems (D 10, 000), it is probably faster to do gradiet descet. For thigs i the middle, it s hard to say for sure. 7.7 Support Vector Machies At the begiig of this chapter, you may have looked at the covex surrogate loss fuctios ad asked yourself: where did these come from! They are all derived from differet uderlyig priciples, which essetially correspod to differet iductive biases. Let s start by thikig back to the origial goal of liear classifiers: to fid a hyperplae that separates the positive traiig examples from the egative oes. Figure 7.10 shows some data ad three potetial hyperplaes: red, gree ad blue. Which oe do you like best Most likely you chose the gree hyperplae. Ad most likely you chose it because it was furthest away from the closest traiig poits. I other words, it had a large margi. The desire for hyperplaes with large margis is a perfect example of a iductive bias. The data does ot tell us which of the three hyperplaes is best: we have to choose oe usig some other source of iformatio. Followig this lie of thikig leads us to the support vector machie (SVM). This is simply a way of settig up a optimizatio problem that attempts to fid a separatig hyperplae with as large a margi as possible. It is writte as a costraied optimizatio problem: Figure 7.10: picture of data poits with three hyperplaes, RGB with G the best mi w,b 1 γ(w, b) subj. to y (w x + b) 1 ( ) (7.37) I this optimizatio, you are tryig to fid parameters that maximize the margi, deoted γ, (i.e., miimize the reciprocal of the margi) subject to the costrait that all traiig examples are correctly classified. The odd thig about this optimizatio problem is that we require the classificatio of each poit to be greater tha oe rather tha simply greater tha zero. However, the problem does t fudametally chage if you replace the 1 with ay other positive costat (see Exercise ). As show i Figure 7.11, the costat oe ca be iterpreted visually as esurig that there is a o-trivial margi betwee the positive poits ad egative poits. The difficulty with the optimizatio problem i Eq (7.37) is what happes with data that is ot liearly separable. I that case, there is o set of parameters w, b that ca simultaeously satisfy all the Figure 7.11: hyperplae with margis o sides

15 liear models 101 costraits. I optimizatio terms, you would say that the feasible regio is empty. (The feasible regio is simply the set of all parameters that satify the costraits.) For this reaso, this is refered to as the hard-margi SVM, because eforcig the margi is a hard costrait. The questio is: how to modify this optimizatio problem so that it ca hadle iseparable data. The key idea is the use of slack parameters. The ituitio behid slack parameters is the followig. Suppose we fid a set of parameters w, b that do a really good job o 9999 data poits. The poits are perfectly classifed ad you achieve a large margi. But there s oe pesky data poit left that caot be put o the proper side of the margi: perhaps it is oisy. (See Figure 7.12.) You wat to be able to preted that you ca move that poit across the hyperplae o to the proper side. You will have to pay a little bit to do so, but as log as you are t movig a lot of poits aroud, it should be a good idea to do this. I this picture, the amout that you move the poit is deoted ξ (xi). By itroducig oe slack parameter for each traiig example, ad pealizig yourself for havig to use slack, you ca create a objective fuctio like the followig, soft-margi SVM: mi w,b,ξ 1 γ(w, b) }{{} large margi + C ξ }{{} small slack subj. to y (w x + b) 1 ξ ( ) ξ 0 ( ) (7.38) The goal of this objective fuctio is to esure that all poits are correctly classified (the first costrait). But if a poit caot be correctly classified, the you ca set the slack ξ to somethig greater tha zero to move it i the correct directio. However, for all ozero slacks, you have to pay i the objective fuctio proportioal to the amout of slack. The hyperparameter C > 0 cotrols overfittig versus uderfittig. The secod costrait simply says that you must ot have egative slack. Oe major advatage of the soft-margi SVM over the origial hard-margi SVM is that the feasible regio is ever empty. That is, there is always goig to be some solutio, regardless of whether your traiig data is liearly separable or ot. It s oe thig to write dow a optimizatio problem. It s aother thig to try to solve it. There are a very large umber of ways to optimize SVMs, essetially because they are such a popular learig model. Here, we will talk just about oe, very simple way. More complex methods will be discussed later i this book oce you have a bit more backgroud. Figure 7.12: oe bad poit with slack What values of C will lead to overfittig What values will lead to uderfittig Suppose I give you a data set. Without eve lookig at the data, costruct for me a feasible solutio to the soft-margi SVM. What is the value of the objective for this solutio

16 102 a course i machie learig To make progress, you eed to be able to measure the size of the margi. Suppose someoe gives you parameters w, b that optimize the hard-margi SVM. We wish to measure the size of the margi. The first observatio is that the hyperplae will lie exactly halfway betwee the earest positive poit ad earest egative poit. If ot, the margi could be made bigger by simply slidig it oe way or the other by adjustig the bias b. By this observatio, there is some positive example that that lies exactly 1 uit from the hyperplae. Call it x +, so that w x + + b = 1. Similarly, there is some egative example, x, that lies exactly o the other side of the margi: for which w x + b = 1. These two poits, x + ad x give us a way to measure the size of the margi. As show i Figure 7.11, we ca measure the size of the margi by lookig at the differece betwee the legths of projectios of x + ad x oto the hyperplae. Sice projectio requires a ormalized vector, we ca measure the distaces as: Figure 7.13: copy of figure from p5 of cs544 svm tutorial d + = 1 w w x+ + b 1 (7.39) d = 1 w w x b + 1 (7.40) We ca the compute the margi by algebra: γ = 1 [ d + d ] (7.41) 2 = 1 [ 1 2 w w x+ + b 1 1 ] w w x b + 1 (7.42) = 1 [ 1 2 w w x+ 1 ] w w x (7.43) = 1 [ 1 2 w (+1) 1 ] w ( 1) (7.44) = 1 (7.45) w This is a remarkable coclusio: the size of the margi is iversely proportioal to the orm of the weight vector. Thus, maximizig the margi is equivalet to miimizig w! This serves as a additioal justificatio of the 2-orm regularizer: havig small weights meas havig large margis! However, our goal was t to justify the regularizer: it was to uderstad hige loss. So let us go back to the soft-margi SVM ad plug i our ew kowledge about margis: mi w,b,ξ 1 2 w 2 }{{} large margi + C ξ }{{} small slack (7.46)

17 liear models 103 subj. to y (w x + b) 1 ξ ( ) ξ 0 ( ) Now, let s play a thought experimet. Suppose someoe haded you a solutio to this optimizatio problem that cosisted of weights (w) ad a bias (b), but they forgot to give you the slacks. Could you recover the slacks from the iformatio you have I fact, the aswer is yes! For simplicity, let s cosider positive examples. Suppose that you look at some positive example x. You eed to figure out what the slack, ξ, would have bee. There are two cases. Either w x + b is at least 1 or it is ot. If it s large eough, the you wat to set ξ = 0. Why It caot be less tha zero by the secod costrait. Moreover, if you set it greater tha zero, you will pay uecessarily i the objective. So i this case, ξ = 0. Next, suppose that w x + b = 0.2, so it is ot big eough. I order to satisfy the first costrait, you ll eed to set ξ 0.8. But because of the objective, you ll ot wat to set it ay larger tha ecessary, so you ll set ξ = 0.8 exactly. Followig this argumet through for both positive ad egative poits, if someoe gives you solutios for w, b, you ca automatically compute the optimal ξ variables as: ξ = { 0 if y (w x + b) 1 1 y (w x + b) otherwise I other words, the optimal value for a slack variable is exactly the hige loss o the correspodig example! Thus, we ca write the SVM objective as a ucostraied optimizatio problem: (7.47) mi w,b 1 2 w 2 + C l (hi) (y, w x + b) }{{} }{{} large margi small slack (7.48) Multiplyig this objective through by λ/c, we obtai exactly the regularized objective from Eq (7.8) with hige loss as the loss fuctio ad the 2-orm as the regularizer! 7.8 Further Readig TODO further readig

A Course in Machine Learning

A Course in Machine Learning A Course i Machie Learig Hal Daumé III 6 LINEAR MODELS The essece of mathematics is ot to make simple thigs complicated, but to make complicated thigs simple. Staley Gudder I Chapter, you leared about