7 LINEAR MODELS. 7.1 The Optimization Framework for Linear Models. Learning Objectives:

Size: px
Start display at page:

Download "7 LINEAR MODELS. 7.1 The Optimization Framework for Linear Models. Learning Objectives:"

Transcription

1 7 LINEAR MODELS The essece of mathematics is ot to make simple thigs complicated, but to make complicated thigs simple. Staley Gudder I Chapter 4, you leared about the perceptro algorithm for liear classificatio. This was both a model (liear classifier) ad algorithm (the perceptro update rule) i oe. I this sectio, we will separate these two, ad cosider geeral ways for optimizig liear models. This will lead us ito some aspects of optimizatio (aka mathematical programmig), but ot very far. At the ed of this chapter, there are poiters to more literature o optimizatio for those who are iterested. The basic idea of the perceptro is to ru a particular algorithm util a liear separator is foud. You might ask: are there better algorithms for fidig such a liear separator We will follow this idea ad formulate a learig problem as a explicit optimizatio problem: fid me a liear separator that is ot too complicated. We will see that fidig a optimal separator is actually computatioally prohibitive, ad so will eed to relax the optimality requiremet. This will lead us to a covex objective that combies a loss fuctio (how well are we doig o the traiig data) ad a regularizer (how complicated is our leared model). This learig framework is kow as both Tikhoov regularizatio ad structural risk miimizatio. Learig Objectives: Defie ad plot four surrogate loss fuctios: squared loss, logistic loss, expoetial loss ad hige loss. Compare ad cotrast the optimizatio of 0/1 loss ad surrogate loss fuctios. Solve the optimizatio problem for squared loss with a quadratic regularizer i closed form. Implemet ad debug gradiet descet ad subgradiet descet. Depedecies: 7.1 The Optimizatio Framework for Liear Models You have already see the perceptro as a way of fidig a weight vector w ad bias b that do a good job of separatig positive traiig examples from egative traiig examples. The perceptro is a model ad algorithm i oe. Here, we are iterested i separatig these issues. We will focus o liear models, like the perceptro. But we will thik about other, more geeric ways of fidig good parameters of these models. The goal of the perceptro was to fid a separatig hyperplae for some traiig data set. For simplicity, you ca igore the issue of overfittig (but just for ow!). Not all data sets are liearly sepa-

2 88 a course i machie learig rable. I the case that your traiig data is t liearly separable, you might wat to fid the hyperplae that makes the fewest errors o the traiig data. We ca write this dow as a formal mathematics optimizatio problem as follows: mi w,b 1[y (w x + b) > 0] (7.1) I this expressio, you are optimizig over two variables, w ad b. The objective fuctio is the thig you are tryig to miimize. I this case, the objective fuctio is simply the error rate (or 0/1 loss) of the liear classifier parameterized by w, b. I this expressio, 1[ ] is the idicator fuctio: it is oe whe ( ) is true ad zero otherwise. You should remember the yw x We kow that the perceptro algorithm is guarateed to fid parameters for this model if the data is liearly separable. I other words, if the optimum of Eq (7.1) is zero, the the perceptro will efficietly fid parameters for this model. The otio of efficiecy depeds o the margi of the data for the perceptro. You might ask: what happes if the data is ot liearly separable Is there a efficiet algorithm for fidig a optimal settig of the parameters Ufortuately, the aswer is o. There is o polyomial time algorithm for solvig Eq (7.1), uless P=NP. I other words, this problem is NP-hard. Sadly, the proof of this is quite complicated ad beyod the scope of this book, but it relies o a reductio from a variat of satisfiability. The key idea is to tur a satisfiability problem ito a optimizatio problem where a clause is satisfied exactly whe the hyperplae correctly separates the data. You might the come back ad say: okay, well I do t really eed a exact solutio. I m willig to have a solutio that makes oe or two more errors tha it has to. Ufortuately, the situatio is really bad. Zero/oe loss is NP-hard to eve appproximately miimize. I other words, there is o efficiet algorithm for eve fidig a solutio that s a small costat worse tha optimal. (The best kow costat at this time is 418/ ) However, before gettig too disillusioed about this whole eterprise (remember: there s a etire chapter about this framework, so it must be goig somewhere!), you should remember that optimizig Eq (7.1) perhaps is t eve what you wat to do! I particular, all it says is that you will get miimal traiig error. It says othig about what your test error will be like. I order to try to fid a solutio that will geeralize well to test data, you eed to esure that you do ot overfit the data. To do this, you ca itroduce a regularizer over the parameters of the model. For ow, we will be vague about what this regularizer looks like, ad simply call it a arbitrary fuctio R(w, b). trick from the perceptro discussio. If ot, re-covice yourself that this is doig the right thig.

3 liear models 89 This leads to the followig, regularized objective: mi w,b 1[y (w x + b) > 0] + λr(w, b) (7.2) I Eq (7.2), we are ow tryig to optimize a trade-off betwee a solutio that gives low traiig error (the first term) ad a solutio that is simple (the secod term). You ca thik of the maximum depth hyperparameter of a decisio tree as a form of regularizatio for trees. Here, R is a form of regularizatio for hyperplaes. I this formulatio, λ becomes a hyperparameter for the optimizatio. The key remaiig questios, give this formalism, are: How ca we adjust the optimizatio problem so that there are efficiet algorithms for solvig it Assumig R does the right thig, what value(s) of λ will lead to overfittig What value(s) will lead to uderfittig What are good regularizers R(w, b) for hyperplaes Assumig we ca adjust the optimizatio problem appropriately, what algorithms exist for efficietly solvig this regularized optimizatio problem We will address these three questios i the ext sectios. 7.2 Covex Surrogate Loss Fuctios You might ask: why is optimizig zero/oe loss so hard Ituitively, oe reaso is that small chages to w, b ca have a large impact o the value of the objective fuctio. For istace, if there is a positive traiig example with w, x +b = , the adjustig b upwards by will decrease your error rate by 1. But adjustig it upwards by will have o effect. This makes it really difficult to figure out good ways to adjust the parameters. To see this more clearly, it is useful to look at plots that relate margi to loss. Such a plot for zero/oe loss is show i Figure 7.1. I this plot, the horizotal axis measures the margi of a data poit ad the vertical axis measures the loss associated with that margi. For zero/oe loss, the story is simple. If you get a positive margi (i.e., y(w x + b) > 0) the you get a loss of zero. Otherwise you get a loss of oe. By thikig about this plot, you ca see how chages to the parameters that chage the margi just a little bit ca have a eormous effect o the overall loss. You might decide that a reasoable way to address this problem is to replace the o-smooth zero/oe loss with a smooth approximatio. With a bit of effort, you could probably cococt a S -shaped fuctio like that show i Figure 7.2. The beefit of usig such a S-fuctio is that it is smooth, ad potetially easier to optimize. The difficulty is that it is ot covex. Figure 7.1: plot of zero/oe versus margi Figure 7.2: plot of zero/oe versus margi ad a S versio of it

4 90 a course i machie learig If you remember from calculus, a covex fuctio is oe that looks like a happy face ( ). (O the other had, a cocave fuctio is oe that looks like a sad face ( ); a easy memoic is that you ca hide uder a cocave fuctio.) There are two equivalet defiitios of a covex fuctio. The first is that it s secod derivative is always o-egative. The secod, more geometric, defitio is that ay chord of the fuctio lies above it. This is show i Figure 7.3. There you ca see a covex fuctio ad a o-covex fuctio, both with two chords draw i. I the case of the covex fuctio, the chords lie above the fuctio. I the case of the o-covex fuctio, there are parts of the chord that lie below the fuctio. Covex fuctios are ice because they are easy to miimize. Ituitively, if you drop a ball aywhere i a covex fuctio, it will evetually get to the miimum. This is ot true for o-covex fuctios. For example, if you drop a ball o the very left ed of the S-fuctio from Figure 7.2, it will ot go aywhere. This leads to the idea of covex surrogate loss fuctios. Sice zero/oe loss is hard to optimize, you wat to optimize somethig else, istead. Sice covex fuctios are easy to optimize, we wat to approximate zero/oe loss with a covex fuctio. This approximatig fuctio will be called a surrogate loss. The surrogate losses we costruct will always be upper bouds o the true loss fuctio: this guaratees that if you miimize the surrogate loss, you are also pushig dow the real loss. There are four commo surrogate loss fuctios, each with their ow properties: hige loss, logistic loss, expoetial loss ad squared loss. These are show i Figure 7.4 ad defied below. These are defied i terms of the true label y (which is just { 1, +1}) ad the predicted value ŷ = w x + b. Zero/oe: l (0/1) (y, ŷ) = 1[yŷ 0] (7.3) Hige: l (hi) (y, ŷ) = max{0, 1 yŷ} (7.4) Logistic: l (log) (y, ŷ) = 1 log (1 + exp[ yŷ]) (7.5) log 2 Expoetial: l (exp) (y, ŷ) = exp[ yŷ] (7.6) Squared: l (sqr) (y, ŷ) = (y ŷ) 2 (7.7) Figure 7.3: plot of covex ad ocovex fuctios with two chords each Figure 7.4: surrogate loss fs I the defiitio of logistic loss, the 1 log 2 term out frot is there simply to esure that l (log) (y, 0) = 1. This esures, like all the other surrogate loss fuctios, that logistic loss upper bouds the zero/oe loss. (I practice, people typically omit this costat sice it does ot affect the optimizatio.) There are two big differeces i these loss fuctios. The first differece is how upset they get by erroeous predictios. I the

5 liear models 91 case of hige loss ad logistic loss, the growth of the fuctio as ŷ goes egative is liear. For squared loss ad expoetial loss, it is super-liear. This meas that expoetial loss would rather get a few examples a little wrog tha oe example really wrog. The other differece is how they deal with very cofidet correct predictios. Oce yŷ > 1, hige loss does ot care ay more, but logistic ad expoetial still thik you ca do better. O the other had, squared loss thiks it s just as bad to predict +3 o a positive example as it is to predict 1 o a positive example. 7.3 Weight Regularizatio I our learig objective, Eq (7.2), we had a term correspod to the zero/oe loss o the traiig data, plus a regularizer whose goal was to esure that the leared fuctio did t get too crazy. (Or, more formally, to esure that the fuctio did ot overfit.) If you replace to zero/oe loss with a surrogate loss, you obtai the followig objective: mi w,b l(y, w x + b) + λr(w, b) (7.8) The questio is: what should R(w, b) look like From the discussio of surrogate loss fuctio, we would like to esure that R is covex. Otherwise, we will be back to the poit where optimizatio becomes difficult. Beyod that, a commo desire is that the compoets of the weight vector (i.e., the w d s) should be small (close to zero). This is a form of iductive bias. Why are small values of w d good Or, more precisely, why do small values of w d correspod to simple fuctios Suppose that we have a example x with label +1. We might believe that other examples, x that are earby x should also have label +1. For example, if I obtai x by takig x ad chagig the first compoet by some small value ɛ ad leavig the rest the same, you might thik that the classificatio would be the same. If you do this, the differece betwee ŷ ad ŷ will be exactly ɛw 1. So if w 1 is reasoably small, this is ulikely to have much of a effect o the classificatio decisio. O the other had, if w 1 is large, this could have a large effect. Aother way of sayig the same thig is to look at the derivative of the predictios as a fuctio of w 1. The derivative of w x + b with respect to w 1 is: [w x + b] w 1 = [ d w d x d + b] w 1 = x 1 (7.9) Iterpretig the derivative as the rate of chage, we ca see that the rate of chage of the predictio fuctio is proportioal to the

6 92 a course i machie learig idividual weights. So if you wat the fuctio to chage slowly, you wat to esure that the weights stay small. Oe way to accomplish this is to simply use the orm of the weight vector. Namely R (orm) (w, b) = w = d w 2 d. This fuctio is covex ad smooth, which makes it easy to miimize. I practice, it s ofte easier to use the squared orm, amely R (sqr) (w, b) = w 2 = d w 2 d because it removes the ugly square root term ad remais covex. A alterative to usig the sum of squared weights is to use the sum of absolute weights: R (abs) (w, b) = d w d. Both of these orms are covex. I additio to small weights beig good, you could argue that zero weights are better. If a weight w d goes to zero, the this meas that feature d is ot used at all i the classificatio decisio. If there are a large umber of irrelevat features, you might wat as may weights to go to zero as possible. This suggests a alterative regularizer: R (ct) (w, b) = d 1[x d = 0]. This lie of thikig leads to the geeral cocept of p-orms. (Techically these are called l p (or ell p ) orms, but this otatio clashes with the use of l for loss. ) This is a family of orms that all have the same geeral flavor. We write w p to deote the p-orm of w. Why do we ot regularize the bias term b Why might you ot wat to use R (ct) as a regularizer w p = ( ) 1 w d p p d (7.10) You ca check that the 2-orm exactly correspods to the usual Euclidea orm, ad that the 1-orm correspods to the absolute regularizer described above. Whe p-orms are used to regularize weight vectors, the iterestig aspect is how they trade-off multiple features. To see the behavior of p-orms i two dimesios, we ca plot their cotour (or levelset). Figure 7.5 shows the cotours for the same p orms i two dimesios. Each lie deotes the two-dimesioal vectors to which this orm assiges a total value of 1. By chagig the value of p, you ca iterpolate betwee a square (the so-called max orm ), dow to a circle (2-orm), diamod (1-orm) ad poity-star-shaped-thig (p < 1 orm). I geeral, smaller values of p prefer sparser vectors. You ca see this by oticig that the cotours of small p-orms stretch out alog the axes. It is for this reaso that small p-orms ted to yield weight vectors with may zero etries (aka sparse weight vectors). Ufortuately, for p < 1 the orm becomes o-covex. As you might guess, this meas that the 1-orm is a popular choice for sparsity-seekig applicatios. You ca actually idetify the R (ct) regularizer with a p-orm as well. Which value of p gives it to you (Hit: you may have to take a limit.) Figure 7.5: loss:orms2d: level sets of the same p-orms The max orm correspods to lim p. Why is this called the max orm

7 liear models 93 MATH REVIEW GRADIENTS A gradiet is a multidimesioal geeralizatio of a derivative. Suppose you have a fuctio f : R D R that takes a vector x = x 1, x 2,..., x D as iput ad produces a scalar value as output. You ca differetite this fuctio accordig to ay oe of the iputs; for istace, you ca compute f x 5 to get the derivative with respect to the fifth iput. The gradiet of f is just the vector cosistig of the derivative f with respect to each of its iput coordiates idepedetly, ad is deoted f, or, whe the iput to f is ambiguous, x f. This is defied as: x f = f x 1, f x 2,..., f x D (7.11) For example, cosider the fuctio f (x 1, x 2, x 3 ) = x x 1x 2 3x 2 x3 2. The gradiet is: x f = 3x x 2, 5x 1 3x3 2, 6x 2x 3 (7.12) Note that if f : R D R, the f : R D R D. If you evaluate f (x), this will give you the gradiet at x, a vector i R D. This vector ca be iterpreted as the directio of steepest ascet: amely, if you were to travel a ifiitesimal amout i the directio of the gradiet, you would go uphill (i.e., icrease f ) the most. Figure 7.6: 7.4 Optimizatio with Gradiet Descet Evisio the followig problem. You re takig up a ew hobby: blidfolded moutai climbig. Someoe blidfolds you ad drops you o the side of a moutai. Your goal is to get to the peak of the moutai as quickly as possible. All you ca do is feel the moutai where you are stadig, ad take steps. How would you get to the top of the moutai Perhaps you would feel to fid out what directio feels the most upward ad take a step i that directio. If you do this repeatedly, you might hope to get the the top of the moutai. (Actually, if your fried promises always to drop you o purely cocave moutais, you will evetually get to the peak!) The idea of gradiet-based methods of optimizatio is exactly the same. Suppose you are tryig to fid the maximum of a fuctio f (x). The optimizer maitais a curret estimate of the parameter of iterest, x. At each step, it measures the gradiet of the fuctio it is tryig to optimize. This measuremet occurs at the curret locatio, x. Call the gradiet g. It the takes a step i the directio of the gradiet, where the size of the step is cotrolled by a parameter η (eta). The complete step is x x + ηg. This is the basic idea of gradiet ascet. The opposite of gradiet ascet is gradiet descet. All of our

8 94 a course i machie learig Algorithm 21 GradietDescet(F, K, η 1,... ) 1: z (0) 0, 0,..., 0 // iitialize variable we are optimizig 2: for k = 1... K do 3: g (k) z F z (k-1) // compute gradiet at curret locatio 4: z (k) z (k-1) η (k) g (k) // take a step dow the gradiet 5: ed for 6: retur z (K) learig problems will be framed as miimizatio problems (tryig to reach the bottom of a ditch, rather tha the top of a hill). Therefore, descet is the primary approach you will use. Oe of the major coditios for gradiet ascet beig able to fid the true, global miimum, of its objective fuctio is covexity. Without covexity, all is lost. The gradiet descet algorithm is sketched i Algorithm 7.4. The fuctio takes as argumets the fuctio F to be miimized, the umber of iteratios K to ru ad a sequece of learig rates η 1,..., η K. (This is to address the case that you might wat to start your moutai climbig takig large steps, but oly take small steps whe you are close to the peak.) The oly real work you eed to do to apply a gradiet descet method is be able to compute derivatives. For cocreteess, suppose that you choose expoetial loss as a loss fuctio ad the 2-orm as a regularizer. The, the regularized objective fuctio is: L(w, b) = exp [ y (w x + b) ] + λ 2 w 2 (7.13) The oly strage thig i this objective is that we have replaced λ with λ 2. The reaso for this chage is just to make the gradiets cleaer. We ca first compute derivatives with respect to b: L b = b = = exp [ y (w x + b) ] + λ b 2 w 2 (7.14) b exp [ y (w x + b) ] + 0 (7.15) ( ) b y (w x + b) exp [ y (w x + b) ] (7.16) = y exp [ y (w x + b) ] (7.17) Before proceedig, it is worth thikig about what this says. From a practical perspective, the optimizatio will operate by updatig b b η L b. Cosider positive examples: examples with y = +1. We would hope for these examples that the curret predictio, w x + b, is as large as possible. As this value teds toward, the term i the exp[] goes to zero. Thus, such poits will ot cotribute to the step.

9 liear models 95 However, if the curret predictio is small, the the exp[] term will be positive ad o-zero. This meas that the bias term b will be icreased, which is exactly what you would wat. Moreover, oce all poits are very well classified, the derivative goes to zero. Now that we have doe the easy case, let s do the gradiet with respect to w. This cosidered the case of positive examples. What happes with egative examples w L = w exp [ y (w x + b) ] λ + w 2 w 2 (7.18) = ( w y (w x + b)) exp [ y (w x + b) ] + λw (7.19) = y x exp [ y (w x + b) ] + λw (7.20) Now you ca repeat the previous exercise. The update is of the form w w η w L. For well classified poits (oes that ted toward y ), the gradiet is ear zero. For poorly classified poits, the gradiet poits i the directio y x, so the update is of the form w w + cy x, where c is some costat. This is just like the perceptro update! Note that c is large for very poorly classified poits ad small for relatively well classified poits. By lookig at the part of the gradiet related to the regularizer, the update says: w w λw = (1 λ)w. This has the effect of shrikig the weights toward zero. This is exactly what we expect the regulaizer to be doig! The success of gradiet descet higes o appropriate choices for the step size. Figure 7.7 shows what ca happe with gradiet descet with poorly chose step sizes. If the step size is too big, you ca accidetally step over the optimum ad ed up oscillatig. If the step size is too small, it will take way too log to get to the optimum. For a well-chose step size, you ca show that gradiet descet will approach the optimal value at a fast rate. The otio of covergece here is that the objective value coverges to the true miimum. Theorem 8 (Gradiet Descet Covergece). Uder suitable coditios 1, for a appropriately chose costat step size (i.e., η 1 = η 2, = η), the covergece rate of gradiet descet is O(1/k). More specifically, lettig z be the global miimum of F, we have: F(z (k) ) F(z ) 2 z (0) z 2 ηk. The proof of this theorem is a bit complicated because it makes heavy use of some liear algebra. The key is to set the learig rate to 1/L, where L is the maximum curvature of the fuctio that is beig optimized. The curvature is simply the size of the secod derivative. Fuctios with high curvature have gradiets that chage Figure 7.7: good ad bad step sizes 1 Specifically the fuctio to be optimized eeds to be strogly covex. This is true for all our problems, provided λ > 0. For λ = 0 the rate could be as bad as O(1/ k). A aive readig of this theorem seems to say that you should choose huge values of η. It should be obvious that this caot be right. What is missig

10 96 a course i machie learig quickly, which meas that you eed to take small steps to avoid oversteppig the optimum. This covergece result suggests a simple approach to decidig whe to stop optimizig: wait util the objective fuctio stops chagig by much. A alterative is to wait util the parameters stop chagig by much. A fial example is to do what you did for perceptro: early stoppig. Every iteratio, you ca check the performace of the curret model o some held-out data, ad stop optimizig whe performace plateaus. 7.5 From Gradiets to Subgradiets As a good exercise, you should try derivig gradiet descet update rules for the differet loss fuctios ad differet regularizers you ve leared about. However, if you do this, you might otice that hige loss ad the 1-orm regularizer are ot differetiable everywhere! I particular, the 1-orm is ot differetiable aroud w d = 0, ad the hige loss is ot differetiable aroud yŷ = 1. The solutio to this is to use subgradiet optimizatio. Oe way to thik about subgradiets is just to ot thik about it: you essetially eed to just igore the fact that you forgot that your fuctio was t differetiable, ad just try to apply gradiet descet ayway. To be more cocrete, cosider the hige fuctio f (z) = max{0, 1 z}. This fuctio is differetiable for z > 1 ad differetiable for z < 1, but ot differetiable at z = 1. You ca derive this usig differetiatio by parts: { z f (z) = 0 if z > 1 (7.21) z 1 z if z < 1 { z 0 if z > 1 = z (1 z) if z < 1 (7.22) = { 0 if z 1 1 if z < 1 (7.23) Thus, the derivative is zero for z < 1 ad 1 for z > 1, matchig ituitio from the Figure. At the o-differetiable poit, z = 1, we ca use a subderivative: a geeralizatio of derivatives to odifferetiable fuctios. Ituitively, you ca thik of the derivative of f at z as the taget lie. Namely, it is the lie that touches f at z that is always below f (for covex fuctios). The subderivative, deoted f, is the set of all such lies. At differetiable positios, this set cosists just of the actual derivative. At o-differetiable positios, this cotais all slopes that defie lies that always lie uder the fuctio ad make cotact at the operatig poit. This is Figure 7.8: hige loss with sub

11 liear models 97 Algorithm 22 HigeRegularizedGD(D, λ, MaxIter) 1: w 0, 0,... 0, b 0 // iitialize weights ad bias 2: for iter = 1... MaxIter do 3: g 0, 0,... 0, g 0 // iitialize gradiet of weights ad bias 4: for all (x,y) D do 5: if y(w x + b) 1 the 6: g g + y x // update weight gradiet 7: g g + y // update bias derivative 8: ed if 9: ed for 10: g g λw // add i regularizatio term 11: w w + ηg // update weights 12: b b + ηg // update bias 13: ed for 14: retur w, b show pictorally i Figure 7.8, where example subderivatives are show for the hige loss fuctio. I the particular case of hige loss, ay value betwee 0 ad 1 is a valid subderivative at z = 0. I fact, the subderivative is always a closed set of the form [a, b], where a ad b ca be derived by lookig at limits from the left ad right. This gives you a way of computig derivative-like thigs for odifferetiable fuctios. Take hige loss as a example. For a give example, the subgradiet of hige loss ca be computed as: w max{0, 1 y (w x + b)} (7.24) { 0 if y = (w x + b) > 1 w (7.25) 1 y (w x + b) otherwise { = w 0 if y (w x + b) > 1 (7.26) w 1 y (w x + b) otherwise { 0 if y = (w x + b) > 1 (7.27) y x otherwise If you plug this subgradiet form ito Algorithm 7.4, you obtai Algorithm 7.5. This is the subgradiet descet for regularized hige loss (with a 2-orm regularizer). 7.6 Closed-form Optimizatio for Squared Loss Although gradiet descet is a good, geeric optimizatio algorithm, there are cases whe you ca do better. A example is the case of a 2-orm regularizer ad squared error loss fuctio. For this, you ca actually obtai a closed form solutio for the optimal weights. However, to obtai this, you eed to rewrite the optimizatio problem i terms of matrix operatios. For simplicity, we will oly cosider the

12 98 a course i machie learig MATH REVIEW MATRIX MULTIPLICATION AND INVERSION If A ad B are matrices, ad A is N K ad B is K M (the ier dimesios must match), the the matrix product AB is a matrix C that is N M, with C,m = k A,k B k,m. If v is a vector i R D, we will treat is as a colum vector, or a matrix of size D 1. Thus, Av is well defied if A is D M, ad the resultig product is a vector u with u m = d A d,m v d. Aside from matrix product, a fudametal matrix operatio is iversio. We will ofte ecouter a form like Ax = y, where A ad y are kow ad we wat to solve for A. If A is square of size N N, the the iverse of A, deoted A 1, is also a square matrix of size N N, such that AA 1 = I N = A 1 A. I.e., multiplyig a matrix by its iverse (o either side) gives back the idetity matrix. Usig this, we ca solve Ax = y by multiplyig both sides by A 1 o the left (recall that order matters i matrix multiplicatio), yieldig A 1 Ax = A 1 y from which we ca coclude x = A 1 y. Note that ot all square matrices are ivertible. For istace, the all zeros matrix does ot have a iverse (i the same way that 1/0 is ot defied for scalars). However, there are other matrices that do ot have iverses; such matrices are called sigular. Figure 7.9: ubiased versio, but the extesio is Exercise. This is precisely the liear regressio settig. You ca thik of the traiig data as a large matrix X of size N D, where X,d is the value of the dth feature o the th example. You ca thik of the labels as a colum ( tall ) vector Y of dimesio N. Fially, you ca thik of the weights as a colum vector w of size D. Thus, the matrix-vector product a = Xw has dimesio N. I particular: a = [Xw] = X,d w d (7.28) d This meas, i particular, that a is actually the predictios of the model. Istead of callig this a, we will call it Ŷ. The squared error says that we should miimize 2 1 (Ŷ Y ) 2, which ca be writte i vector form as a miimizatio of 1 2 Ŷ Y 2. This ca be expaded visually as: x 1,1 x 1,2... x 1,D w 1 x 2,1 x 2,2... x 2,D w } x N,1 x N,2... {{ x N,D }} w D {{ } X w = d x 1,d w d d x 2,d w d. } d x N,d w d {{ } Ŷ y 1 y 2. y N } {{ } Ŷ (7.29) Verify that the squared error ca actually be writte as this vector orm.

13 liear models 99 So, compactly, our optimizatio problem ca be writte as: mi w L(w) = 1 2 Xw Y 2 + λ 2 w 2 (7.30) If you recall from calculus, you ca miimize a fuctio by settig its derivative to zero. We start with the weights w ad take gradiets: w L(w) = X (Xw Y) + λw (7.31) = X Xw X Y + λw (7.32) ( ) = X X + λi w X Y (7.33) We ca equate this to zero ad solve, yieldig: ( ) X X + λi w X Y = 0 (7.34) ) (X X + λi D w = X Y (7.35) ( ) w = X X + λi 1 D X Y (7.36) Thus, the optimal solutio of the weights ca be computed by a few matrix multiplicatios ad a matrix iversio. As a saity check, you ca make sure that the dimesios match. The matrix X X has dimesio D D, ad therefore so does the iverse term. The iverse is D D ad X is D N, so that product is D N. Multiplyig through by the N 1 vector Y yields a D 1 vector, which is precisely what we wat for the weights. Note that this gives a exact solutio, modulo umerical iacuracies with computig matrix iverses. I cotrast, gradiet descet will give you progressively better solutios ad will evetually coverge to the optimum at a rate of 1/k. This meas that if you wat a aswer that s withi a accuracy of ɛ = 10 4, you will eed somethig o the order of oe thousad steps. The questio is whether gettig this exact solutio is always more efficiet. To ru gradiet descet for oe step will take O(ND) time, with a relatively small costat. You will have to ru K iteratios, yieldig a overall rutime of O(KND). O the other had, the closed form solutio requires costructig X X, which takes O(D 2 N) time. The iversio take O(D 3 ) time usig stadard matrix iversio routies. The fial multiplicatios take O(ND) time. Thus, the overall rutime is o the order O(D 3 + D 2 N). I most stadard cases (though this is becomig less true over time), N > D, so this is domiated by O(D 2 N). Thus, the overall questio is whether you will eed to ru more tha D-may iteratios of gradiet descet. If so, the the matrix iversio will be (roughly) faster. Otherwise, gradiet descet will be (roughly) faster. For low- ad medium-dimesioal problems (say, For those who are kee o liear algebra, you might be worried that the matrix you must ivert might ot be ivertible. Is this actually a problem

14 100 a course i machie learig D 100), it is probably faster to do the closed form solutio via matrix iversio. For high dimesioal problems (D 10, 000), it is probably faster to do gradiet descet. For thigs i the middle, it s hard to say for sure. 7.7 Support Vector Machies At the begiig of this chapter, you may have looked at the covex surrogate loss fuctios ad asked yourself: where did these come from! They are all derived from differet uderlyig priciples, which essetially correspod to differet iductive biases. Let s start by thikig back to the origial goal of liear classifiers: to fid a hyperplae that separates the positive traiig examples from the egative oes. Figure 7.10 shows some data ad three potetial hyperplaes: red, gree ad blue. Which oe do you like best Most likely you chose the gree hyperplae. Ad most likely you chose it because it was furthest away from the closest traiig poits. I other words, it had a large margi. The desire for hyperplaes with large margis is a perfect example of a iductive bias. The data does ot tell us which of the three hyperplaes is best: we have to choose oe usig some other source of iformatio. Followig this lie of thikig leads us to the support vector machie (SVM). This is simply a way of settig up a optimizatio problem that attempts to fid a separatig hyperplae with as large a margi as possible. It is writte as a costraied optimizatio problem: Figure 7.10: picture of data poits with three hyperplaes, RGB with G the best mi w,b 1 γ(w, b) subj. to y (w x + b) 1 ( ) (7.37) I this optimizatio, you are tryig to fid parameters that maximize the margi, deoted γ, (i.e., miimize the reciprocal of the margi) subject to the costrait that all traiig examples are correctly classified. The odd thig about this optimizatio problem is that we require the classificatio of each poit to be greater tha oe rather tha simply greater tha zero. However, the problem does t fudametally chage if you replace the 1 with ay other positive costat (see Exercise ). As show i Figure 7.11, the costat oe ca be iterpreted visually as esurig that there is a o-trivial margi betwee the positive poits ad egative poits. The difficulty with the optimizatio problem i Eq (7.37) is what happes with data that is ot liearly separable. I that case, there is o set of parameters w, b that ca simultaeously satisfy all the Figure 7.11: hyperplae with margis o sides

15 liear models 101 costraits. I optimizatio terms, you would say that the feasible regio is empty. (The feasible regio is simply the set of all parameters that satify the costraits.) For this reaso, this is refered to as the hard-margi SVM, because eforcig the margi is a hard costrait. The questio is: how to modify this optimizatio problem so that it ca hadle iseparable data. The key idea is the use of slack parameters. The ituitio behid slack parameters is the followig. Suppose we fid a set of parameters w, b that do a really good job o 9999 data poits. The poits are perfectly classifed ad you achieve a large margi. But there s oe pesky data poit left that caot be put o the proper side of the margi: perhaps it is oisy. (See Figure 7.12.) You wat to be able to preted that you ca move that poit across the hyperplae o to the proper side. You will have to pay a little bit to do so, but as log as you are t movig a lot of poits aroud, it should be a good idea to do this. I this picture, the amout that you move the poit is deoted ξ (xi). By itroducig oe slack parameter for each traiig example, ad pealizig yourself for havig to use slack, you ca create a objective fuctio like the followig, soft-margi SVM: mi w,b,ξ 1 γ(w, b) }{{} large margi + C ξ }{{} small slack subj. to y (w x + b) 1 ξ ( ) ξ 0 ( ) (7.38) The goal of this objective fuctio is to esure that all poits are correctly classified (the first costrait). But if a poit caot be correctly classified, the you ca set the slack ξ to somethig greater tha zero to move it i the correct directio. However, for all ozero slacks, you have to pay i the objective fuctio proportioal to the amout of slack. The hyperparameter C > 0 cotrols overfittig versus uderfittig. The secod costrait simply says that you must ot have egative slack. Oe major advatage of the soft-margi SVM over the origial hard-margi SVM is that the feasible regio is ever empty. That is, there is always goig to be some solutio, regardless of whether your traiig data is liearly separable or ot. It s oe thig to write dow a optimizatio problem. It s aother thig to try to solve it. There are a very large umber of ways to optimize SVMs, essetially because they are such a popular learig model. Here, we will talk just about oe, very simple way. More complex methods will be discussed later i this book oce you have a bit more backgroud. Figure 7.12: oe bad poit with slack What values of C will lead to overfittig What values will lead to uderfittig Suppose I give you a data set. Without eve lookig at the data, costruct for me a feasible solutio to the soft-margi SVM. What is the value of the objective for this solutio

16 102 a course i machie learig To make progress, you eed to be able to measure the size of the margi. Suppose someoe gives you parameters w, b that optimize the hard-margi SVM. We wish to measure the size of the margi. The first observatio is that the hyperplae will lie exactly halfway betwee the earest positive poit ad earest egative poit. If ot, the margi could be made bigger by simply slidig it oe way or the other by adjustig the bias b. By this observatio, there is some positive example that that lies exactly 1 uit from the hyperplae. Call it x +, so that w x + + b = 1. Similarly, there is some egative example, x, that lies exactly o the other side of the margi: for which w x + b = 1. These two poits, x + ad x give us a way to measure the size of the margi. As show i Figure 7.11, we ca measure the size of the margi by lookig at the differece betwee the legths of projectios of x + ad x oto the hyperplae. Sice projectio requires a ormalized vector, we ca measure the distaces as: Figure 7.13: copy of figure from p5 of cs544 svm tutorial d + = 1 w w x+ + b 1 (7.39) d = 1 w w x b + 1 (7.40) We ca the compute the margi by algebra: γ = 1 [ d + d ] (7.41) 2 = 1 [ 1 2 w w x+ + b 1 1 ] w w x b + 1 (7.42) = 1 [ 1 2 w w x+ 1 ] w w x (7.43) = 1 [ 1 2 w (+1) 1 ] w ( 1) (7.44) = 1 (7.45) w This is a remarkable coclusio: the size of the margi is iversely proportioal to the orm of the weight vector. Thus, maximizig the margi is equivalet to miimizig w! This serves as a additioal justificatio of the 2-orm regularizer: havig small weights meas havig large margis! However, our goal was t to justify the regularizer: it was to uderstad hige loss. So let us go back to the soft-margi SVM ad plug i our ew kowledge about margis: mi w,b,ξ 1 2 w 2 }{{} large margi + C ξ }{{} small slack (7.46)

17 liear models 103 subj. to y (w x + b) 1 ξ ( ) ξ 0 ( ) Now, let s play a thought experimet. Suppose someoe haded you a solutio to this optimizatio problem that cosisted of weights (w) ad a bias (b), but they forgot to give you the slacks. Could you recover the slacks from the iformatio you have I fact, the aswer is yes! For simplicity, let s cosider positive examples. Suppose that you look at some positive example x. You eed to figure out what the slack, ξ, would have bee. There are two cases. Either w x + b is at least 1 or it is ot. If it s large eough, the you wat to set ξ = 0. Why It caot be less tha zero by the secod costrait. Moreover, if you set it greater tha zero, you will pay uecessarily i the objective. So i this case, ξ = 0. Next, suppose that w x + b = 0.2, so it is ot big eough. I order to satisfy the first costrait, you ll eed to set ξ 0.8. But because of the objective, you ll ot wat to set it ay larger tha ecessary, so you ll set ξ = 0.8 exactly. Followig this argumet through for both positive ad egative poits, if someoe gives you solutios for w, b, you ca automatically compute the optimal ξ variables as: ξ = { 0 if y (w x + b) 1 1 y (w x + b) otherwise I other words, the optimal value for a slack variable is exactly the hige loss o the correspodig example! Thus, we ca write the SVM objective as a ucostraied optimizatio problem: (7.47) mi w,b 1 2 w 2 + C l (hi) (y, w x + b) }{{} }{{} large margi small slack (7.48) Multiplyig this objective through by λ/c, we obtai exactly the regularized objective from Eq (7.8) with hige loss as the loss fuctio ad the 2-orm as the regularizer! 7.8 Further Readig TODO further readig

A Course in Machine Learning

A Course in Machine Learning A Course i Machie Learig Hal Daumé III 6 LINEAR MODELS The essece of mathematics is ot to make simple thigs complicated, but to make complicated thigs simple. Staley Gudder I Chapter, you leared about

More information

Introduction to Machine Learning DIS10

Introduction to Machine Learning DIS10 CS 189 Fall 017 Itroductio to Machie Learig DIS10 1 Fu with Lagrage Multipliers (a) Miimize the fuctio such that f (x,y) = x + y x + y = 3. Solutio: The Lagragia is: L(x,y,λ) = x + y + λ(x + y 3) Takig

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

Support vector machine revisited

Support vector machine revisited 6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector

More information

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ. 2 5. Weighted umber of late jobs 5.1. Release dates ad due dates: maximimizig the weight of o-time jobs Oce we add release dates, miimizig the umber of late jobs becomes a sigificatly harder problem. For

More information

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

Machine Learning for Data Science (CS 4786)

Machine Learning for Data Science (CS 4786) Machie Learig for Data Sciece CS 4786) Lecture & 3: Pricipal Compoet Aalysis The text i black outlies high level ideas. The text i blue provides simple mathematical details to derive or get to the algorithm

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

4.3 Growth Rates of Solutions to Recurrences

4.3 Growth Rates of Solutions to Recurrences 4.3. GROWTH RATES OF SOLUTIONS TO RECURRENCES 81 4.3 Growth Rates of Solutios to Recurreces 4.3.1 Divide ad Coquer Algorithms Oe of the most basic ad powerful algorithmic techiques is divide ad coquer.

More information

Differentiable Convex Functions

Differentiable Convex Functions Differetiable Covex Fuctios The followig picture motivates Theorem 11. f ( x) f ( x) f '( x)( x x) ˆx x 1 Theorem 11 : Let f : R R be differetiable. The, f is covex o the covex set C R if, ad oly if for

More information

Admin REGULARIZATION. Schedule. Midterm 9/29/16. Assignment 5. Midterm next week, due Friday (more on this in 1 min)

Admin REGULARIZATION. Schedule. Midterm 9/29/16. Assignment 5. Midterm next week, due Friday (more on this in 1 min) Admi Assigmet 5! Starter REGULARIZATION David Kauchak CS 158 Fall 2016 Schedule Midterm ext week, due Friday (more o this i 1 mi Assigmet 6 due Friday before fall break Midterm Dowload from course web

More information

MA131 - Analysis 1. Workbook 2 Sequences I

MA131 - Analysis 1. Workbook 2 Sequences I MA3 - Aalysis Workbook 2 Sequeces I Autum 203 Cotets 2 Sequeces I 2. Itroductio.............................. 2.2 Icreasig ad Decreasig Sequeces................ 2 2.3 Bouded Sequeces..........................

More information

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32 Boostig Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, 2017 1 / 32 Outlie 1 Admiistratio 2 Review of last lecture 3 Boostig Professor Ameet Talwalkar CS260

More information

6.867 Machine learning, lecture 7 (Jaakkola) 1

6.867 Machine learning, lecture 7 (Jaakkola) 1 6.867 Machie learig, lecture 7 (Jaakkola) 1 Lecture topics: Kerel form of liear regressio Kerels, examples, costructio, properties Liear regressio ad kerels Cosider a slightly simpler model where we omit

More information

Some examples of vector spaces

Some examples of vector spaces Roberto s Notes o Liear Algebra Chapter 11: Vector spaces Sectio 2 Some examples of vector spaces What you eed to kow already: The te axioms eeded to idetify a vector space. What you ca lear here: Some

More information

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017 Lecture 9: Boostig Akshay Krishamurthy akshay@csumassedu October 3, 07 Recap Last week we discussed some algorithmic aspects of machie learig We saw oe very powerful family of learig algorithms, amely

More information

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression REGRESSION 1 Outlie Liear regressio Regularizatio fuctios Polyomial curve fittig Stochastic gradiet descet for regressio MLE for regressio Step-wise forward regressio Regressio methods Statistical techiques

More information

6.867 Machine learning

6.867 Machine learning 6.867 Machie learig Mid-term exam October, ( poits) Your ame ad MIT ID: Problem We are iterested here i a particular -dimesioal liear regressio problem. The dataset correspodig to this problem has examples

More information

Sequences I. Chapter Introduction

Sequences I. Chapter Introduction Chapter 2 Sequeces I 2. Itroductio A sequece is a list of umbers i a defiite order so that we kow which umber is i the first place, which umber is i the secod place ad, for ay atural umber, we kow which

More information

The Growth of Functions. Theoretical Supplement

The Growth of Functions. Theoretical Supplement The Growth of Fuctios Theoretical Supplemet The Triagle Iequality The triagle iequality is a algebraic tool that is ofte useful i maipulatig absolute values of fuctios. The triagle iequality says that

More information

Optimization Methods MIT 2.098/6.255/ Final exam

Optimization Methods MIT 2.098/6.255/ Final exam Optimizatio Methods MIT 2.098/6.255/15.093 Fial exam Date Give: December 19th, 2006 P1. [30 pts] Classify the followig statemets as true or false. All aswers must be well-justified, either through a short

More information

Linear Support Vector Machines

Linear Support Vector Machines Liear Support Vector Machies David S. Roseberg The Support Vector Machie For a liear support vector machie (SVM), we use the hypothesis space of affie fuctios F = { f(x) = w T x + b w R d, b R } ad evaluate

More information

CHAPTER I: Vector Spaces

CHAPTER I: Vector Spaces CHAPTER I: Vector Spaces Sectio 1: Itroductio ad Examples This first chapter is largely a review of topics you probably saw i your liear algebra course. So why cover it? (1) Not everyoe remembers everythig

More information

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients. Defiitios ad Theorems Remember the scalar form of the liear programmig problem, Miimize, Subject to, f(x) = c i x i a 1i x i = b 1 a mi x i = b m x i 0 i = 1,2,, where x are the decisio variables. c, b,

More information

10/2/ , 5.9, Jacob Hays Amit Pillay James DeFelice

10/2/ , 5.9, Jacob Hays Amit Pillay James DeFelice 0//008 Liear Discrimiat Fuctios Jacob Hays Amit Pillay James DeFelice 5.8, 5.9, 5. Miimum Squared Error Previous methods oly worked o liear separable cases, by lookig at misclassified samples to correct

More information

Statistics 511 Additional Materials

Statistics 511 Additional Materials Cofidece Itervals o mu Statistics 511 Additioal Materials This topic officially moves us from probability to statistics. We begi to discuss makig ifereces about the populatio. Oe way to differetiate probability

More information

The Binomial Theorem

The Binomial Theorem The Biomial Theorem Robert Marti Itroductio The Biomial Theorem is used to expad biomials, that is, brackets cosistig of two distict terms The formula for the Biomial Theorem is as follows: (a + b ( k

More information

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

1 Duality revisited. AM 221: Advanced Optimization Spring 2016 AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R

More information

Lecture 6: Integration and the Mean Value Theorem. slope =

Lecture 6: Integration and the Mean Value Theorem. slope = Math 8 Istructor: Padraic Bartlett Lecture 6: Itegratio ad the Mea Value Theorem Week 6 Caltech 202 The Mea Value Theorem The Mea Value Theorem abbreviated MVT is the followig result: Theorem. Suppose

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5 CS434a/54a: Patter Recogitio Prof. Olga Veksler Lecture 5 Today Itroductio to parameter estimatio Two methods for parameter estimatio Maimum Likelihood Estimatio Bayesia Estimatio Itroducto Bayesia Decisio

More information

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22 CS 70 Discrete Mathematics for CS Sprig 2007 Luca Trevisa Lecture 22 Aother Importat Distributio The Geometric Distributio Questio: A biased coi with Heads probability p is tossed repeatedly util the first

More information

INTEGRATION BY PARTS (TABLE METHOD)

INTEGRATION BY PARTS (TABLE METHOD) INTEGRATION BY PARTS (TABLE METHOD) Suppose you wat to evaluate cos d usig itegratio by parts. Usig the u dv otatio, we get So, u dv d cos du d v si cos d si si d or si si d We see that it is ecessary

More information

Machine Learning for Data Science (CS 4786)

Machine Learning for Data Science (CS 4786) Machie Learig for Data Sciece CS 4786) Lecture 9: Pricipal Compoet Aalysis The text i black outlies mai ideas to retai from the lecture. The text i blue give a deeper uderstadig of how we derive or get

More information

Problem Set 4 Due Oct, 12

Problem Set 4 Due Oct, 12 EE226: Radom Processes i Systems Lecturer: Jea C. Walrad Problem Set 4 Due Oct, 12 Fall 06 GSI: Assae Gueye This problem set essetially reviews detectio theory ad hypothesis testig ad some basic otios

More information

Notes for Lecture 5. 1 Grover Search. 1.1 The Setting. 1.2 Motivation. Lecture 5 (September 26, 2018)

Notes for Lecture 5. 1 Grover Search. 1.1 The Setting. 1.2 Motivation. Lecture 5 (September 26, 2018) COS 597A: Quatum Cryptography Lecture 5 (September 6, 08) Lecturer: Mark Zhadry Priceto Uiversity Scribe: Fermi Ma Notes for Lecture 5 Today we ll move o from the slightly cotrived applicatios of quatum

More information

Problem Cosider the curve give parametrically as x = si t ad y = + cos t for» t» ß: (a) Describe the path this traverses: Where does it start (whe t =

Problem Cosider the curve give parametrically as x = si t ad y = + cos t for» t» ß: (a) Describe the path this traverses: Where does it start (whe t = Mathematics Summer Wilso Fial Exam August 8, ANSWERS Problem 1 (a) Fid the solutio to y +x y = e x x that satisfies y() = 5 : This is already i the form we used for a first order liear differetial equatio,

More information

MA131 - Analysis 1. Workbook 3 Sequences II

MA131 - Analysis 1. Workbook 3 Sequences II MA3 - Aalysis Workbook 3 Sequeces II Autum 2004 Cotets 2.8 Coverget Sequeces........................ 2.9 Algebra of Limits......................... 2 2.0 Further Useful Results........................

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods Support Vector Machies ad Kerel Methods Daiel Khashabi Fall 202 Last Update: September 26, 206 Itroductio I Support Vector Machies the goal is to fid a separator betwee data which has the largest margi,

More information

Roberto s Notes on Series Chapter 2: Convergence tests Section 7. Alternating series

Roberto s Notes on Series Chapter 2: Convergence tests Section 7. Alternating series Roberto s Notes o Series Chapter 2: Covergece tests Sectio 7 Alteratig series What you eed to kow already: All basic covergece tests for evetually positive series. What you ca lear here: A test for series

More information

Linear Regression Demystified

Linear Regression Demystified Liear Regressio Demystified Liear regressio is a importat subject i statistics. I elemetary statistics courses, formulae related to liear regressio are ofte stated without derivatio. This ote iteds to

More information

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer. 6 Itegers Modulo I Example 2.3(e), we have defied the cogruece of two itegers a,b with respect to a modulus. Let us recall that a b (mod ) meas a b. We have proved that cogruece is a equivalece relatio

More information

1 Approximating Integrals using Taylor Polynomials

1 Approximating Integrals using Taylor Polynomials Seughee Ye Ma 8: Week 7 Nov Week 7 Summary This week, we will lear how we ca approximate itegrals usig Taylor series ad umerical methods. Topics Page Approximatig Itegrals usig Taylor Polyomials. Defiitios................................................

More information

(3) If you replace row i of A by its sum with a multiple of another row, then the determinant is unchanged! Expand across the i th row:

(3) If you replace row i of A by its sum with a multiple of another row, then the determinant is unchanged! Expand across the i th row: Math 5-4 Tue Feb 4 Cotiue with sectio 36 Determiats The effective way to compute determiats for larger-sized matrices without lots of zeroes is to ot use the defiitio, but rather to use the followig facts,

More information

Pattern recognition systems Laboratory 10 Linear Classifiers and the Perceptron Algorithm

Pattern recognition systems Laboratory 10 Linear Classifiers and the Perceptron Algorithm Patter recogitio systems Laboratory 10 Liear Classifiers ad the Perceptro Algorithm 1. Objectives his laboratory sessio presets the perceptro learig algorithm for the liear classifier. We will apply gradiet

More information

Seunghee Ye Ma 8: Week 5 Oct 28

Seunghee Ye Ma 8: Week 5 Oct 28 Week 5 Summary I Sectio, we go over the Mea Value Theorem ad its applicatios. I Sectio 2, we will recap what we have covered so far this term. Topics Page Mea Value Theorem. Applicatios of the Mea Value

More information

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled 1 Lecture : Area Area ad distace traveled Approximatig area by rectagles Summatio The area uder a parabola 1.1 Area ad distace Suppose we have the followig iformatio about the velocity of a particle, how

More information

Inverse Matrix. A meaning that matrix B is an inverse of matrix A.

Inverse Matrix. A meaning that matrix B is an inverse of matrix A. Iverse Matrix Two square matrices A ad B of dimesios are called iverses to oe aother if the followig holds, AB BA I (11) The otio is dual but we ofte write 1 B A meaig that matrix B is a iverse of matrix

More information

Introduction to Optimization Techniques. How to Solve Equations

Introduction to Optimization Techniques. How to Solve Equations Itroductio to Optimizatio Techiques How to Solve Equatios Iterative Methods of Optimizatio Iterative methods of optimizatio Solutio of the oliear equatios resultig form a optimizatio problem is usually

More information

Intro to Learning Theory

Intro to Learning Theory Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified

More information

Frequentist Inference

Frequentist Inference Frequetist Iferece The topics of the ext three sectios are useful applicatios of the Cetral Limit Theorem. Without kowig aythig about the uderlyig distributio of a sequece of radom variables {X i }, for

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple

More information

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities CS8B/Stat4B Sprig 008) Statistical Learig Theory Lecture: Ada Boost, Risk Bouds, Cocetratio Iequalities Lecturer: Peter Bartlett Scribe: Subhrasu Maji AdaBoost ad Estimates of Coditioal Probabilities We

More information

II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation

II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation II. Descriptive Statistics D. Liear Correlatio ad Regressio I this sectio Liear Correlatio Cause ad Effect Liear Regressio 1. Liear Correlatio Quatifyig Liear Correlatio The Pearso product-momet correlatio

More information

IP Reference guide for integer programming formulations.

IP Reference guide for integer programming formulations. IP Referece guide for iteger programmig formulatios. by James B. Orli for 15.053 ad 15.058 This documet is iteded as a compact (or relatively compact) guide to the formulatio of iteger programs. For more

More information

Section 1.1. Calculus: Areas And Tangents. Difference Equations to Differential Equations

Section 1.1. Calculus: Areas And Tangents. Difference Equations to Differential Equations Differece Equatios to Differetial Equatios Sectio. Calculus: Areas Ad Tagets The study of calculus begis with questios about chage. What happes to the velocity of a swigig pedulum as its positio chages?

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machie Learig (Fall 2014) Drs. Sha & Liu {feisha,yaliu.cs}@usc.edu October 9, 2014 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 1 / 49 Outlie Admiistratio

More information

11.6 Absolute Convergence and the Ratio and Root Tests

11.6 Absolute Convergence and the Ratio and Root Tests .6 Absolute Covergece ad the Ratio ad Root Tests The most commo way to test for covergece is to igore ay positive or egative sigs i a series, ad simply test the correspodig series of positive terms. Does

More information

Kinetics of Complex Reactions

Kinetics of Complex Reactions Kietics of Complex Reactios by Flick Colema Departmet of Chemistry Wellesley College Wellesley MA 28 wcolema@wellesley.edu Copyright Flick Colema 996. All rights reserved. You are welcome to use this documet

More information

Recursive Algorithms. Recurrences. Recursive Algorithms Analysis

Recursive Algorithms. Recurrences. Recursive Algorithms Analysis Recursive Algorithms Recurreces Computer Sciece & Egieerig 35: Discrete Mathematics Christopher M Bourke cbourke@cseuledu A recursive algorithm is oe i which objects are defied i terms of other objects

More information

Recurrence Relations

Recurrence Relations Recurrece Relatios Aalysis of recursive algorithms, such as: it factorial (it ) { if (==0) retur ; else retur ( * factorial(-)); } Let t be the umber of multiplicatios eeded to calculate factorial(). The

More information

Notes for Lecture 11

Notes for Lecture 11 U.C. Berkeley CS78: Computatioal Complexity Hadout N Professor Luca Trevisa 3/4/008 Notes for Lecture Eigevalues, Expasio, ad Radom Walks As usual by ow, let G = (V, E) be a udirected d-regular graph with

More information

Math 113, Calculus II Winter 2007 Final Exam Solutions

Math 113, Calculus II Winter 2007 Final Exam Solutions Math, Calculus II Witer 7 Fial Exam Solutios (5 poits) Use the limit defiitio of the defiite itegral ad the sum formulas to compute x x + dx The check your aswer usig the Evaluatio Theorem Solutio: I this

More information

Zeros of Polynomials

Zeros of Polynomials Math 160 www.timetodare.com 4.5 4.6 Zeros of Polyomials I these sectios we will study polyomials algebraically. Most of our work will be cocered with fidig the solutios of polyomial equatios of ay degree

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 0 Scribe: Ade Forrow Oct. 3, 05 Recall the followig defiitios from last time: Defiitio: A fuctio K : X X R is called a positive symmetric

More information

Posted-Price, Sealed-Bid Auctions

Posted-Price, Sealed-Bid Auctions Posted-Price, Sealed-Bid Auctios Professors Greewald ad Oyakawa 207-02-08 We itroduce the posted-price, sealed-bid auctio. This auctio format itroduces the idea of approximatios. We describe how well this

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Statistics

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Statistics ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER 1 018/019 DR. ANTHONY BROWN 8. Statistics 8.1. Measures of Cetre: Mea, Media ad Mode. If we have a series of umbers the

More information

September 2012 C1 Note. C1 Notes (Edexcel) Copyright - For AS, A2 notes and IGCSE / GCSE worksheets 1

September 2012 C1 Note. C1 Notes (Edexcel) Copyright   - For AS, A2 notes and IGCSE / GCSE worksheets 1 September 0 s (Edecel) Copyright www.pgmaths.co.uk - For AS, A otes ad IGCSE / GCSE worksheets September 0 Copyright www.pgmaths.co.uk - For AS, A otes ad IGCSE / GCSE worksheets September 0 Copyright

More information

Lecture 7: Fourier Series and Complex Power Series

Lecture 7: Fourier Series and Complex Power Series Math 1d Istructor: Padraic Bartlett Lecture 7: Fourier Series ad Complex Power Series Week 7 Caltech 013 1 Fourier Series 1.1 Defiitios ad Motivatio Defiitio 1.1. A Fourier series is a series of fuctios

More information

Ray-triangle intersection

Ray-triangle intersection Ray-triagle itersectio ria urless October 2006 I this hadout, we explore the steps eeded to compute the itersectio of a ray with a triagle, ad the to compute the barycetric coordiates of that itersectio.

More information

OPTIMAL ALGORITHMS -- SUPPLEMENTAL NOTES

OPTIMAL ALGORITHMS -- SUPPLEMENTAL NOTES OPTIMAL ALGORITHMS -- SUPPLEMENTAL NOTES Peter M. Maurer Why Hashig is θ(). As i biary search, hashig assumes that keys are stored i a array which is idexed by a iteger. However, hashig attempts to bypass

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machie Learig (Fall 2014) Drs. Sha & Liu {feisha,yaliu.cs}@usc.edu October 14, 2014 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, 2014 1 / 49 Outlie Admiistratio

More information

WHAT IS THE PROBABILITY FUNCTION FOR LARGE TSUNAMI WAVES? ABSTRACT

WHAT IS THE PROBABILITY FUNCTION FOR LARGE TSUNAMI WAVES? ABSTRACT WHAT IS THE PROBABILITY FUNCTION FOR LARGE TSUNAMI WAVES? Harold G. Loomis Hoolulu, HI ABSTRACT Most coastal locatios have few if ay records of tsuami wave heights obtaied over various time periods. Still

More information

Once we have a sequence of numbers, the next thing to do is to sum them up. Given a sequence (a n ) n=1

Once we have a sequence of numbers, the next thing to do is to sum them up. Given a sequence (a n ) n=1 . Ifiite Series Oce we have a sequece of umbers, the ext thig to do is to sum them up. Give a sequece a be a sequece: ca we give a sesible meaig to the followig expressio? a = a a a a While summig ifiitely

More information

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample. Statistical Iferece (Chapter 10) Statistical iferece = lear about a populatio based o the iformatio provided by a sample. Populatio: The set of all values of a radom variable X of iterest. Characterized

More information

NUMERICAL METHODS COURSEWORK INFORMAL NOTES ON NUMERICAL INTEGRATION COURSEWORK

NUMERICAL METHODS COURSEWORK INFORMAL NOTES ON NUMERICAL INTEGRATION COURSEWORK NUMERICAL METHODS COURSEWORK INFORMAL NOTES ON NUMERICAL INTEGRATION COURSEWORK For this piece of coursework studets must use the methods for umerical itegratio they meet i the Numerical Methods module

More information

Math 113 Exam 3 Practice

Math 113 Exam 3 Practice Math Exam Practice Exam 4 will cover.-., 0. ad 0.. Note that eve though. was tested i exam, questios from that sectios may also be o this exam. For practice problems o., refer to the last review. This

More information

3.2 Properties of Division 3.3 Zeros of Polynomials 3.4 Complex and Rational Zeros of Polynomials

3.2 Properties of Division 3.3 Zeros of Polynomials 3.4 Complex and Rational Zeros of Polynomials Math 60 www.timetodare.com 3. Properties of Divisio 3.3 Zeros of Polyomials 3.4 Complex ad Ratioal Zeros of Polyomials I these sectios we will study polyomials algebraically. Most of our work will be cocered

More information

Lesson 10: Limits and Continuity

Lesson 10: Limits and Continuity www.scimsacademy.com Lesso 10: Limits ad Cotiuity SCIMS Academy 1 Limit of a fuctio The cocept of limit of a fuctio is cetral to all other cocepts i calculus (like cotiuity, derivative, defiite itegrals

More information

Axis Aligned Ellipsoid

Axis Aligned Ellipsoid Machie Learig for Data Sciece CS 4786) Lecture 6,7 & 8: Ellipsoidal Clusterig, Gaussia Mixture Models ad Geeral Mixture Models The text i black outlies high level ideas. The text i blue provides simple

More information

Recitation 4: Lagrange Multipliers and Integration

Recitation 4: Lagrange Multipliers and Integration Math 1c TA: Padraic Bartlett Recitatio 4: Lagrage Multipliers ad Itegratio Week 4 Caltech 211 1 Radom Questio Hey! So, this radom questio is pretty tightly tied to today s lecture ad the cocept of cotet

More information

REGRESSION (Physics 1210 Notes, Partial Modified Appendix A)

REGRESSION (Physics 1210 Notes, Partial Modified Appendix A) REGRESSION (Physics 0 Notes, Partial Modified Appedix A) HOW TO PERFORM A LINEAR REGRESSION Cosider the followig data poits ad their graph (Table I ad Figure ): X Y 0 3 5 3 7 4 9 5 Table : Example Data

More information

Polynomial Functions and Their Graphs

Polynomial Functions and Their Graphs Polyomial Fuctios ad Their Graphs I this sectio we begi the study of fuctios defied by polyomial expressios. Polyomial ad ratioal fuctios are the most commo fuctios used to model data, ad are used extesively

More information

TEACHER CERTIFICATION STUDY GUIDE

TEACHER CERTIFICATION STUDY GUIDE COMPETENCY 1. ALGEBRA SKILL 1.1 1.1a. ALGEBRAIC STRUCTURES Kow why the real ad complex umbers are each a field, ad that particular rigs are ot fields (e.g., itegers, polyomial rigs, matrix rigs) Algebra

More information

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence Sequeces A sequece of umbers is a fuctio whose domai is the positive itegers. We ca see that the sequece,, 2, 2, 3, 3,... is a fuctio from the positive itegers whe we write the first sequece elemet as

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

( 1) n (4x + 1) n. n=0

( 1) n (4x + 1) n. n=0 Problem 1 (10.6, #). Fid the radius of covergece for the series: ( 1) (4x + 1). For what values of x does the series coverge absolutely, ad for what values of x does the series coverge coditioally? Solutio.

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

( ) (( ) ) ANSWERS TO EXERCISES IN APPENDIX B. Section B.1 VECTORS AND SETS. Exercise B.1-1: Convex sets. are convex, , hence. and. (a) Let.

( ) (( ) ) ANSWERS TO EXERCISES IN APPENDIX B. Section B.1 VECTORS AND SETS. Exercise B.1-1: Convex sets. are convex, , hence. and. (a) Let. Joh Riley 8 Jue 03 ANSWERS TO EXERCISES IN APPENDIX B Sectio B VECTORS AND SETS Exercise B-: Covex sets (a) Let 0 x, x X, X, hece 0 x, x X ad 0 x, x X Sice X ad X are covex, x X ad x X The x X X, which

More information

Pattern recognition systems Lab 10 Linear Classifiers and the Perceptron Algorithm

Pattern recognition systems Lab 10 Linear Classifiers and the Perceptron Algorithm Patter recogitio systems Lab 10 Liear Classifiers ad the Perceptro Algorithm 1. Objectives his lab sessio presets the perceptro learig algorithm for the liear classifier. We will apply gradiet descet ad

More information

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015 ECE 8527: Itroductio to Machie Learig ad Patter Recogitio Midterm # 1 Vaishali Ami Fall, 2015 tue39624@temple.edu Problem No. 1: Cosider a two-class discrete distributio problem: ω 1 :{[0,0], [2,0], [2,2],

More information

Lecture 20. Brief Review of Gram-Schmidt and Gauss s Algorithm

Lecture 20. Brief Review of Gram-Schmidt and Gauss s Algorithm 8.409 A Algorithmist s Toolkit Nov. 9, 2009 Lecturer: Joatha Keler Lecture 20 Brief Review of Gram-Schmidt ad Gauss s Algorithm Our mai task of this lecture is to show a polyomial time algorithm which

More information

The Method of Least Squares. To understand least squares fitting of data.

The Method of Least Squares. To understand least squares fitting of data. The Method of Least Squares KEY WORDS Curve fittig, least square GOAL To uderstad least squares fittig of data To uderstad the least squares solutio of icosistet systems of liear equatios 1 Motivatio Curve

More information

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3 MATH 337 Sequeces Dr. Neal, WKU Let X be a metric space with distace fuctio d. We shall defie the geeral cocept of sequece ad limit i a metric space, the apply the results i particular to some special

More information

P1 Chapter 8 :: Binomial Expansion

P1 Chapter 8 :: Binomial Expansion P Chapter 8 :: Biomial Expasio jfrost@tiffi.kigsto.sch.uk www.drfrostmaths.com @DrFrostMaths Last modified: 6 th August 7 Use of DrFrostMaths for practice Register for free at: www.drfrostmaths.com/homework

More information

NICK DUFRESNE. 1 1 p(x). To determine some formulas for the generating function of the Schröder numbers, r(x) = a(x) =

NICK DUFRESNE. 1 1 p(x). To determine some formulas for the generating function of the Schröder numbers, r(x) = a(x) = AN INTRODUCTION TO SCHRÖDER AND UNKNOWN NUMBERS NICK DUFRESNE Abstract. I this article we will itroduce two types of lattice paths, Schröder paths ad Ukow paths. We will examie differet properties of each,

More information