CSCI567 Machine Learning (Fall 2014)

Size: px
Start display at page:

Download "CSCI567 Machine Learning (Fall 2014)"

Transcription

1 CSCI567 Machie Learig (Fall 2014) Drs. Sha & Liu October 14, 2014 Drs. Sha & Liu CSCI567 Machie Learig (Fall 2014) October 14, / 49

2 Outlie Admiistratio 1 Admiistratio 2 Review of last lecture 3 Geometric Uderstadig of SVM 4 Boostig Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

3 Quiz 1 Admiistratio Tuesday Oct :20pm, THH 301 Please arrive o time Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

4 Lecture schedule Admiistratio Oct 15 or Oct 16 TA will lead o Pragmatics Oct 20 or Oct 21 No lecture i the scheduled time prepare for the quiz Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

5 Homework #1 Admiistratio Cout 136 Mi Max Average Media STD Drs. Sha & Liu CSCI567 Machie Learig (Fall 2014) October 14, / 49

6 Outlie Review of last lecture 1 Admiistratio 2 Review of last lecture Support vector machies Basic Lagrage duality theory 3 Geometric Uderstadig of SVM 4 Boostig Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

7 Review of last lecture Support vector machies Support vector machies Hige loss Assumig the label y { 1, 1} ad the decisio rule is h(x) = sig(f(x)) with f(x) = w T φ(x) + b, { l hige (f(x), y) = or l hige (f(x), y) = max(0, 1 yf(x)) 0 if yf(x) 1 1 yf(x) otherwise Ituitio: pealize more if icorrectly classified (the left brach to the kik poit) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

8 Review of last lecture Support vector machies Primal formulatio of support vector machies (SVM) Miimizig the total hige loss o all the traiig data mi w,b max(0, 1 y [w T φ(x ) + b]) + λ 2 w 2 2 equivaletly, mi w,b,{ξ } C ξ w 2 2 s.t. 1 y [w T φ(x ) + b] ξ, ξ 0, Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

9 Review of last lecture Basic Lagrage duality theory Basic Lagrage duality theory Key cocepts you should kow What do primal ad dual mea? How SVM exploits dual formulatio, thus results i usig kerel fuctios for oliear classificatio What do support vectors mea? Our roadmap We will tell you what dual looks like We will show you how it is derived Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

10 Review of last lecture Basic Lagrage duality theory Derivatio of the dual We will derive the dual formulatio as the process will reveal some iterestig ad importat properties of SVM. Particularly, why is it called support vector? Recipe Formulate a Lagragia fuctio that icorporates the costraits, thru itroducig dual variables Miimize the Lagragia fuctio to solve the primal variables Put the primal variables ito the Lagragia ad express i terms of dual variables Maximize the Lagragia with respect to dual variables Recover the solutio (for the primal variables) from the dual variables Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

11 Review of last lecture Basic Lagrage duality theory Derivig the dual for SVM Lagragia L(w, {ξ }, {α }, {λ }) = C ξ w 2 2 λ ξ + α {1 y [w T φ(x ) + b] ξ } uder the costrait that α 0 ad λ 0. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

12 Review of last lecture Basic Lagrage duality theory Miimizig the Lagragia Takig derivatives with respect to the primal variables L w = w y α φ(x ) = 0 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

13 Review of last lecture Miimizig the Lagragia Basic Lagrage duality theory Takig derivatives with respect to the primal variables L w = w y α φ(x ) = 0 L b = α y = 0 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

14 Review of last lecture Miimizig the Lagragia Basic Lagrage duality theory Takig derivatives with respect to the primal variables L w = w y α φ(x ) = 0 L b = α y = 0 L = C λ α = 0 ξ Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

15 Review of last lecture Miimizig the Lagragia Basic Lagrage duality theory Takig derivatives with respect to the primal variables L w = w y α φ(x ) = 0 L b = α y = 0 L = C λ α = 0 ξ This gives rise to equatios likig the primal variables ad the dual variables as well as ew costraits o the dual variables: w = α y = 0 C λ α = 0 y α φ(x ) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

16 Review of last lecture Basic Lagrage duality theory Rewrite the Lagrage i terms of dual variables Substitute the solutio to the primal back ito the Lagragia g({α },{λ }) = L(w, {ξ }, {α }, {λ }) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

17 Review of last lecture Basic Lagrage duality theory Rewrite the Lagrage i terms of dual variables Substitute the solutio to the primal back ito the Lagragia g({α },{λ }) = L(w, {ξ }, {α }, {λ }) = (C α λ )ξ y α φ(x ) α ( ) + α y b ( T α y y m α m φ(x m )) φ(x ) m Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

18 Review of last lecture Basic Lagrage duality theory Rewrite the Lagrage i terms of dual variables Substitute the solutio to the primal back ito the Lagragia g({α },{λ }) = L(w, {ξ }, {α }, {λ }) = (C α λ )ξ y α φ(x ) α ( ) + α y b ( T α y y m α m φ(x m )) φ(x ) m = α y α φ(x ) 2 2 α α m y m y φ(x m ) T φ(x ) m, Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

19 Review of last lecture Basic Lagrage duality theory Rewrite the Lagrage i terms of dual variables Substitute the solutio to the primal back ito the Lagragia g({α },{λ }) = L(w, {ξ }, {α }, {λ }) = (C α λ )ξ y α φ(x ) α ( ) + α y b ( T α y y m α m φ(x m )) φ(x ) m = α y α φ(x ) 2 2 α α m y m y φ(x m ) T φ(x ) m, = α 1 α α m y m y φ(x m ) T φ(x ) 2 m, Several terms vaish because of the costraits α y = 0 ad C λ α = 0. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

20 The dual problem Review of last lecture Basic Lagrage duality theory Maximizig the dual uder the costraits max g({α }, {λ }) = α 1 y m y α m α k(x m, x ) α 2 m, s.t. α 0, α y = 0 C λ α = 0, λ 0, Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

21 The dual problem Review of last lecture Basic Lagrage duality theory Maximizig the dual uder the costraits max g({α }, {λ }) = α 1 y m y α m α k(x m, x ) α 2 m, s.t. α 0, α y = 0 C λ α = 0, λ 0, We ca simplify as the objective fuctio does ot deped o λ, thus we ca covert the equality costrait ivolvig λ with a iequality costrait o α C: α C λ = C α 0 C λ α = 0, λ 0 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

22 Fial form Review of last lecture Basic Lagrage duality theory max α α 1 y m y α m α φ(x m ) T φ(x ) 2 m, s.t. 0 α C, α y = 0 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

23 Recover the solutio Review of last lecture Basic Lagrage duality theory The primal variable w is idetified as w = α y φ(x ) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

24 Recover the solutio Review of last lecture Basic Lagrage duality theory The primal variable w is idetified as w = α y φ(x ) To idetify b, we eed somethig else. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

25 Review of last lecture Basic Lagrage duality theory Complemetary slackess ad support vectors At the optimal solutio to both primal ad dual, the followig must be satisfied for every iequality costrait (these are called KKT coditios) λ ξ = 0 α {1 ξ y [w T φ(x ) + b]} = 0 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

26 Review of last lecture Basic Lagrage duality theory Complemetary slackess ad support vectors At the optimal solutio to both primal ad dual, the followig must be satisfied for every iequality costrait (these are called KKT coditios) λ ξ = 0 α {1 ξ y [w T φ(x ) + b]} = 0 From the first coditio, if α < C, the λ = C α > 0 ξ = 0 Thus, i cojuctio with the secod coditio, we kow that, if C > α > 0, the as y { 1, 1}. 1 y [w T φ(x ) + b] = 0 b = y w T φ(x ) For those whose α > 0, we call such traiig samples as support vectors. (We will discuss their geometric iterpretatio later). Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

27 Review of last lecture Basic Lagrage duality theory Dual formulatio ad kerelized SVM Dual is also a covex quadratic programmig max α α 1 y m y α m α φ(x m ) T φ(x ) 2 m, s.t. 0 α C, α y = 0 We replace the ier products φ(x m ) T φ(x ) with a kerel fuctio max α α 1 y m y α m α k(x m, x ) 2 m, s.t. 0 α C, α y = 0 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

28 Review of last lecture Basic Lagrage duality theory Recoverig solutio to the primal formulatio Weights w = y α φ(x ) Liear combiatio of the iput features! b b = [y w T φ(x )] = [y m y m α m k(x m, x )], for ay C > α > 0 Makig predictio o a test poit x h(x) = sig(w T φ(x) + b) = sig( y α k(x, x) + b) Agai, to make predictio, it suffices to kow the kerel fuctio. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

29 Review of last lecture Basic Lagrage duality theory Thigs you eed to kow about derivig the dual Make sure you ca follow the recipe Formulate a Lagragia fuctio that icorporates the costraits, thru itroducig dual variables Miimize the Lagragia fuctio to solve the primal variables Put the primal variables ito the Lagragia ad express i terms of dual variables Maximize the Lagragia with respect to dual variables Recover the solutio (for the primal variables) from the dual variables Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

30 Outlie Geometric Uderstadig of SVM 1 Admiistratio 2 Review of last lecture 3 Geometric Uderstadig of SVM 4 Boostig Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

31 Geometric Uderstadig of SVM Ituitio: where to put the decisio boudary? Cosider the biary classificatio i the followig figure. We have assumed, for coveiece, that the traiig dataset is separable there is a decisio boudary that separates the two classes perfectly. H H H There are ifiite may ways of puttig the decisio boudary H : w T φ(x) + b = 0! Our ituitio is, however, to put the decisio boudary to be i the middle of the two classes as much as possible. I other words, we wat the decisio boudary is to be far to every poit as much as possible as log as the decisio boudary classifies every poit correctly. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

32 Distaces Geometric Uderstadig of SVM The distace from a poit φ(x) to the decisio boudary is d H (φ(x)) = wt φ(x) + b w 2 (We have derived the above i the recitatio/quiz0. Please re-verify it as a take-home exercise). We ca remove the absolute by exploitig the fact that the decisio boudary classifies every poit i the traiig dataset correctly. Namely, (w T φ(x) + b) ad x s label y are of the same sig. The distace is ow, d H (φ(x)) = y[wt φ(x) + b] w 2 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

33 Maximizig margi Geometric Uderstadig of SVM Margi The margi is defied as the smallest distace from all the traiig poits y [w T φ(x ) + b] margi = mi w 2 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

34 Maximizig margi Geometric Uderstadig of SVM Margi The margi is defied as the smallest distace from all the traiig poits y [w T φ(x ) + b] margi = mi w 2 Sice we are iterested i fidig a w to put all poits as distat as possible from the decisio boudary, we maximize the margi max w y [w T φ(x ) + b] 1 mi = max mi y [w T φ(x ) + b] w w w 2 H : w T φ(x)+b =0 w T φ(x)+b w 2 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

35 Rescaled margi Geometric Uderstadig of SVM Sice the margi does ot chage if we scale (w, b) by a costat factor c ( as w T φ(x) + b = 0 ad (cw) T φ(x) + (cb) = 0 are the same decisio boudary), we fix the scale by forcig mi y [w T φ(x ) + b] = 1 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

36 Geometric Uderstadig of SVM Rescaled margi Sice the margi does ot chage if we scale (w, b) by a costat factor c ( as w T φ(x) + b = 0 ad (cw) T φ(x) + (cb) = 0 are the same decisio boudary), we fix the scale by forcig I this case, our margi becomes mi y [w T φ(x ) + b] = 1 margi = 1 w 2 precisely, the closest poit to the decisio boudary has a distace of that. w T φ(x)+b =1 H : w T φ(x)+b =0 1 w2 w T φ(x)+b = 1 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

37 Primal formulatio Geometric Uderstadig of SVM Combiig everythig we have, for a separable traiig dataset, we aim to max w This is equivalet to 1 w 2 such that y [w T φ(x ) + b] 1, 1 mi w 2 w 2 2 s.t. y [w T φ(x ) + b] 1, This starts to look like our first formulatio for SVMs. For this geometric ituitio, SVM is called max margi (or large margi) classifier. The costraits are called large margi costraits. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

38 Geometric Uderstadig of SVM SVM for o-separable data Suppose there are traiig data poits that caot be classified correctly o matter how we choose w. For those data poits, y [w T φ(x ) + b] 0 for ay w. Thus, the previous costrait y [w T φ(x ) + b] 1, is o loger feasible. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

39 Geometric Uderstadig of SVM SVM for o-separable data Suppose there are traiig data poits that caot be classified correctly o matter how we choose w. For those data poits, y [w T φ(x ) + b] 0 for ay w. Thus, the previous costrait y [w T φ(x ) + b] 1, is o loger feasible. To deal with this issue, we itroduce slack variables ξ to help y [w T φ(x ) + b] 1 ξ, where we also require ξ 0. Note that, eve for hard poits that caot be classified correctly, the slack variable will be able to make them satisfy the above costrait (we ca keep icreasig ξ util the above iequality is met.) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

40 Geometric Uderstadig of SVM SVM Primal formulatio with slack variables We obviously do ot wat ξ goes to ifiity, so we balace their sizes by pealizig them toward zero as much as possible mi w 1 2 w C ξ s.t. y [w T φ(x ) + b] 1 ξ, ξ 0, where C is our tradeoff (hyper)parameter. This is precisely the primal formulatio we first got for SVM. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

41 Geometric Uderstadig of SVM Meaig of support vectors i SVMs Complemetary slackess At optimum, we have to have α {1 ξ y [w T φ(x ) + b]} = 0, That meas, for some, α = 0. Additioally, our optimal solutio is give by w = α y φ(x ) = α y φ(x ) :α >0 I words, our solutio is oly determied by those traiig samples whose correspodig α is strictly positive. Those samples are called support vectors. No-support vectors whose α = 0 ca be removed by the traiig dataset this removal will ot affect the optimal solutio (i.e., after the removal, if we costruct aother SVM classifier o the reduced dataset, the optimal solutio is the same as the oe o the origial dataset.) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

42 Geometric Uderstadig of SVM Who are support vectors? Case aalysis Sice, we have We have 1 ξ y [w T φ(x ) + b]} = 0 ξ = 0. This implies y [w T φ(x ) + b] = 1. They are o poits that are 1/ w 2 away from the decisio boudary. ξ < 1. These are poits that ca be classified correctly but do ot satisfy the large margi costrait they have smaller distaces to the decisio boudary. ξ > 1. These are poits that are misclassified. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

43 Geometric Uderstadig of SVM Visualizatio of how traiig data poits are categorized w T φ(x)+b =1 H : w T φ(x)+b =0 ξ < 1 ξ > 1 w T φ(x)+b = 1 ξ =0 Support vectors are those beig circled with the orage lie. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

44 Outlie Boostig 1 Admiistratio 2 Review of last lecture 3 Geometric Uderstadig of SVM 4 Boostig AdaBoost Derivatio of AdaBoost Boostig as learig oliear basis Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

45 Boostig Boostig High-level idea: combie a lot of classifiers Sequetially costruct those classifiers oe at a time Use weak classifiers to arrive at complex decisio boudaries Our pla Describe AdaBoost algorithm Derive the algorithm Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

46 Boostig How Boostig algorithm works? AdaBoost Give: N samples {x, y }, where y {+1, 1}, ad some ways of costructig weak (or base) classifiers Iitialize weights w 1 () = 1 N for every traiig sample. For t= 1 to T 1 Trai a weak classifier h t (x) based o the curret weight w t (), by miimizig the weighted classificatio error ɛ t = w t ()I[y h t (x )] 2 Calculate weights for combiig classifiers β t = 1 1 ɛt 2 log ɛ t 3 Update weights w t+1 () w t ()e βtyht(x) ad ormalize them such that w t+1() = 1. Output the fial classifier h[x] = sig [ T ] β t h t (x) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49 t=1

47 Example Boostig AdaBoost 10 data poits Base classifier h( ): either horizotal or vertical lies (these are called decisio stumps, classifyig data based o a sigle attribute) The data poits are clearly ot liear separable. I the begiig, all data poits have equal weights (the size of the data markers + or - ) D 1 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

48 Boostig AdaBoost Roud 1: t = 1 h 1 D 2 3 misclassified (with circles): ɛ 1 = 0.3 β 1 = Weights recomputed; the 3 misclassified data poits receive larger weights Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

49 3 misclassified (with circles): ɛ 2 = 0.21 β 2 = Note that ɛ as those 3 data poits have weights less tha 1/10 Weights recomputed; the 3 misclassified data poits receive larger weights. Note that the data poits classified correctly o roud t = 1 receive much smaller weights as they have bee cosistetly classified correctly Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49 Roud 2: t = 2 Boostig AdaBoost h 2 D 3

50 Boostig AdaBoost Roud 33: t = 3 Roud h3 "3 =0.14!3= misclassified (with circles): 3 = 0.14 β3 = Note that those previously correctly classified data poits are ow misclassified however, we might be lucky o this as if they have bee cosistetly classified correctly, the this roud s mistake is probably ot a big deal. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

51 Boostig AdaBoost Fial classifier: combiig 3 classifiers H = sig fial = all data poits are ow classified correctly! Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

52 Why AdaBoost works? Boostig Derivatio of AdaBoost We will show ext that it miimizes a loss fuctio related to classificatio error. Classificatio loss Suppose we wat to have a classifier { 1 if f(x) > 0 h(x) = sig[f(x)] = 1 if f(x) < 0 our loss fuctio is thus l(h(x), y) = { 0 if yf(x) > 0 1 if yf(x) < 0 Namely, the fuctio f(x) ad the target label y should have the same sig to avoid a loss of 1. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

53 Expoetial loss Boostig Derivatio of AdaBoost The previous loss fuctio l(h(x), y) is difficult to optimize. Istead, we will use the followig loss fuctio l exp (h(x), y) = e yf(x) This loss fuctio will fuctio as a surrogate to the true loss fuctio l(h(x, y). However, l exp (h(x), y) is easier to hadle umerically as it is differetiable, see below the cotrast betwee the red ad black curves `(h(x),y) yf(x) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

54 Boostig Choosig the t-th classifier Derivatio of AdaBoost Suppose we have built a classifier f t 1 (x), ad we wat to improve it by addig a ew classifier h t (x) to costruct a ew classifier f(x) = f t 1 (x) + β t h t (x) how ca we choose optimally the ew classifier h t (x) ad the combiatio coefficiet β t? The strategy we will use is to greedily miimize the expoetial loss fuctio. (h t (x), βt ) = arg mi (ht(x),βt) e yf(x) = arg mi (ht(x),β t) e y[f t 1(x )+β th t(x )] Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

55 Boostig Choosig the t-th classifier Derivatio of AdaBoost Suppose we have built a classifier f t 1 (x), ad we wat to improve it by addig a ew classifier h t (x) to costruct a ew classifier f(x) = f t 1 (x) + β t h t (x) how ca we choose optimally the ew classifier h t (x) ad the combiatio coefficiet β t? The strategy we will use is to greedily miimize the expoetial loss fuctio. (h t (x), βt ) = arg mi (ht(x),βt) e yf(x) = arg mi (ht(x),β t) e y[f t 1(x )+β th t(x )] = arg mi (ht(x),βt) w t ()e yβtht(x) where we have used w t () as a shorthad for e yf t 1(x ) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

56 Boostig Derivatio of AdaBoost The ew classifier We decompose the weighted loss fuctio (by w t ()) ito two parts w t ()e yβtht(x) = w t ()e βt I[y h t (x )] + w t ()e βt I[y = h t (x )] Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

57 Boostig Derivatio of AdaBoost The ew classifier We decompose the weighted loss fuctio (by w t ()) ito two parts w t ()e yβtht(x) = w t ()e βt I[y h t (x )] + w t ()e βt I[y = h t (x )] = w t ()e βt I[y h t (x )] + w t ()e βt (1 I[y h t (x )]) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

58 The ew classifier Boostig Derivatio of AdaBoost We decompose the weighted loss fuctio (by w t ()) ito two parts w t ()e yβtht(x) = w t ()e βt I[y h t (x )] + w t ()e βt I[y = h t (x )] = w t ()e βt I[y h t (x )] + w t ()e βt (1 I[y h t (x )]) = (e βt e βt ) w t ()I[y h t (x )] + e βt w t () We have used the followig properties to derive the above y h t (x ) is either 1 or -1 as h t (x ) is the output of a biary classifier. The idicator fuctio I[y = h t (x )] is biary, either 0 or 1. Thus, it equals to 1 I[y h t (x )]. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

59 Boostig Derivatio of AdaBoost Miimizig the weighted classificatio error Thus, we would wat to choose h t (x ) such that h t (x) = arg mi ht(x) ɛ t = w t ()I[y h t (x )] Namely, the weighted classificatio error is miimized precisely trai a weak classifier based o the curret weight w t () o the slide How Boostig algorithm works?. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

60 Boostig Derivatio of AdaBoost Miimizig the weighted classificatio error Thus, we would wat to choose h t (x ) such that h t (x) = arg mi ht(x) ɛ t = w t ()I[y h t (x )] Namely, the weighted classificatio error is miimized precisely trai a weak classifier based o the curret weight w t () o the slide How Boostig algorithm works?. Remarks We ca safely assume that w t (x ) is ormalized so that w t(x ) = 1. This ormalizatio requiremet ca be easily maitaied by chagig the weights to w t (x ) w t(x ) w t(x ) This chage does ot affect how to choose h t (x), as the term w t(x ) is a costat with respect to. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

61 How to choose β t? Boostig Derivatio of AdaBoost We will select β t to miimize (e βt e βt ) w t ()I[y h t (x )] + e βt w t () We assume w t() is ow 1 (cf. the previous slide s Remarks). We take derivative with respect to β t ad set to zero, ad derive the optimal β t as β t = 1 2 log 1 ɛ t ɛ t which is precisely what is o the slide How Boostig algorithm works? Take-home exercise. Verify the solutio Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

62 Updatig the weights Boostig Derivatio of AdaBoost Now that we have improved our classifier ito f(x) = f t 1 (x) + β t h t (x) At the t-th iteratio, we will eed to compute the weights for the above classifier, which is, w t+1 () = e yf(x) = e y[f t 1(x)+β t h t (x)] Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

63 Updatig the weights Boostig Derivatio of AdaBoost Now that we have improved our classifier ito f(x) = f t 1 (x) + β t h t (x) At the t-th iteratio, we will eed to compute the weights for the above classifier, which is, w t+1 () = e yf(x) = e y[f t 1(x)+β t h t (x)] = w t ()e yβ t h t (x) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

64 Updatig the weights Boostig Derivatio of AdaBoost Now that we have improved our classifier ito f(x) = f t 1 (x) + β t h t (x) At the t-th iteratio, we will eed to compute the weights for the above classifier, which is, w t+1 () = e yf(x) = e y[f t 1(x)+βt h t (x)] { = w t ()e yβ t h t (x) wt ()e = β t if y h t (x ) w t ()e β t if y = h t (x ) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

65 Updatig the weights Boostig Derivatio of AdaBoost Now that we have improved our classifier ito f(x) = f t 1 (x) + β t h t (x) At the t-th iteratio, we will eed to compute the weights for the above classifier, which is, w t+1 () = e yf(x) = e y[f t 1(x)+βt h t (x)] { = w t ()e yβ t h t (x) wt ()e = β t if y h t (x ) w t ()e β t if y = h t (x ) Remarks The key poit is that the misclassified data poit will get its weight icreased, while the correctly data poit will get its weight decreased. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

66 Remarks Boostig Derivatio of AdaBoost Note that the AdaBoost algorithm itself ever specifies how we would get h t (x) as log as it miimizes the weighted classificatio error ɛ t = w t ()I[y h t (x )] I this aspect, the AdaBoost algorithm is a meta-algorithm ad ca be used with ay classifier where we ca do the above. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

67 Remarks Boostig Derivatio of AdaBoost Note that the AdaBoost algorithm itself ever specifies how we would get h t (x) as log as it miimizes the weighted classificatio error ɛ t = w t ()I[y h t (x )] I this aspect, the AdaBoost algorithm is a meta-algorithm ad ca be used with ay classifier where we ca do the above. Ex. How do we choose the decisio stump classifier give the weights at the secod roud of the followig distributio? h 1 D 2 We ca simply eumerate all possible ways of puttig vertical ad horizotal lies to separate the data poits ito two classes ad fid the oe with the smallest weighted classificatio error! Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

68 Boostig Noliear basis leared by boostig Boostig as learig oliear basis Two-stage process Get sig[f 1 (x)], sig[f 2 (x)],, Combie ito a liear classificatio model { } y = sig β t sig[f t (x)] Equivaletly, each stage lears a oliear basis φ t (x) = sig[f t (x)]. t This relates to eural etworks, which we might discuss ext week. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32 Boostig Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, 2017 1 / 32 Outlie 1 Admiistratio 2 Review of last lecture 3 Boostig Professor Ameet Talwalkar CS260

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machie Learig (Fall 2014) Drs. Sha & Liu {feisha,yaliu.cs}@usc.edu October 9, 2014 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 1 / 49 Outlie Admiistratio

More information

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it

More information

Linear Support Vector Machines

Linear Support Vector Machines Liear Support Vector Machies David S. Roseberg The Support Vector Machie For a liear support vector machie (SVM), we use the hypothesis space of affie fuctios F = { f(x) = w T x + b w R d, b R } ad evaluate

More information

Support vector machine revisited

Support vector machine revisited 6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

Chapter 7. Support Vector Machine

Chapter 7. Support Vector Machine Chapter 7 Support Vector Machie able of Cotet Margi ad support vectors SVM formulatio Slack variables ad hige loss SVM for multiple class SVM ith Kerels Relevace Vector Machie Support Vector Machie (SVM)

More information

6.867 Machine learning

6.867 Machine learning 6.867 Machie learig Mid-term exam October, ( poits) Your ame ad MIT ID: Problem We are iterested here i a particular -dimesioal liear regressio problem. The dataset correspodig to this problem has examples

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods Support Vector Machies ad Kerel Methods Daiel Khashabi Fall 202 Last Update: September 26, 206 Itroductio I Support Vector Machies the goal is to fid a separator betwee data which has the largest margi,

More information

Optimization Methods MIT 2.098/6.255/ Final exam

Optimization Methods MIT 2.098/6.255/ Final exam Optimizatio Methods MIT 2.098/6.255/15.093 Fial exam Date Give: December 19th, 2006 P1. [30 pts] Classify the followig statemets as true or false. All aswers must be well-justified, either through a short

More information

Linear Classifiers III

Linear Classifiers III Uiversität Potsdam Istitut für Iformatik Lehrstuhl Maschielles Lere Liear Classifiers III Blaie Nelso, Tobias Scheffer Cotets Classificatio Problem Bayesia Classifier Decisio Liear Classifiers, MAP Models

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 0 Scribe: Ade Forrow Oct. 3, 05 Recall the followig defiitios from last time: Defiitio: A fuctio K : X X R is called a positive symmetric

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017 Lecture 9: Boostig Akshay Krishamurthy akshay@csumassedu October 3, 07 Recap Last week we discussed some algorithmic aspects of machie learig We saw oe very powerful family of learig algorithms, amely

More information

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ. 2 5. Weighted umber of late jobs 5.1. Release dates ad due dates: maximimizig the weight of o-time jobs Oce we add release dates, miimizig the umber of late jobs becomes a sigificatly harder problem. For

More information

Machine Learning. Ilya Narsky, Caltech

Machine Learning. Ilya Narsky, Caltech Machie Learig Ilya Narsky, Caltech Lecture 4 Multi-class problems. Multi-class versios of Neural Networks, Decisio Trees, Support Vector Machies ad AdaBoost. Reductio of a multi-class problem to a set

More information

6.867 Machine learning, lecture 7 (Jaakkola) 1

6.867 Machine learning, lecture 7 (Jaakkola) 1 6.867 Machie learig, lecture 7 (Jaakkola) 1 Lecture topics: Kerel form of liear regressio Kerels, examples, costructio, properties Liear regressio ad kerels Cosider a slightly simpler model where we omit

More information

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead) Lecture 4 Homework Hw 1 ad 2 will be reoped after class for every body. New deadlie 4/20 Hw 3 ad 4 olie (Nima is lead) Pod-cast lecture o-lie Fial projects Nima will register groups ext week. Email/tell

More information

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients. Defiitios ad Theorems Remember the scalar form of the liear programmig problem, Miimize, Subject to, f(x) = c i x i a 1i x i = b 1 a mi x i = b m x i 0 i = 1,2,, where x are the decisio variables. c, b,

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple

More information

Introduction to Machine Learning DIS10

Introduction to Machine Learning DIS10 CS 189 Fall 017 Itroductio to Machie Learig DIS10 1 Fu with Lagrage Multipliers (a) Miimize the fuctio such that f (x,y) = x + y x + y = 3. Solutio: The Lagragia is: L(x,y,λ) = x + y + λ(x + y 3) Takig

More information

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence Sequeces A sequece of umbers is a fuctio whose domai is the positive itegers. We ca see that the sequece,, 2, 2, 3, 3,... is a fuctio from the positive itegers whe we write the first sequece elemet as

More information

Linear Regression Demystified

Linear Regression Demystified Liear Regressio Demystified Liear regressio is a importat subject i statistics. I elemetary statistics courses, formulae related to liear regressio are ofte stated without derivatio. This ote iteds to

More information

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence Sequeces A sequece of umbers is a fuctio whose domai is the positive itegers. We ca see that the sequece 1, 1, 2, 2, 3, 3,... is a fuctio from the positive itegers whe we write the first sequece elemet

More information

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015 ECE 8527: Itroductio to Machie Learig ad Patter Recogitio Midterm # 1 Vaishali Ami Fall, 2015 tue39624@temple.edu Problem No. 1: Cosider a two-class discrete distributio problem: ω 1 :{[0,0], [2,0], [2,2],

More information

Support Vector Machines, Kernel SVM

Support Vector Machines, Kernel SVM Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM

More information

1 Review and Overview

1 Review and Overview CS9T/STATS3: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #6 Scribe: Jay Whag ad Patrick Cho October 0, 08 Review ad Overview Recall i the last lecture that for ay family of scalar fuctios F, we

More information

NICK DUFRESNE. 1 1 p(x). To determine some formulas for the generating function of the Schröder numbers, r(x) = a(x) =

NICK DUFRESNE. 1 1 p(x). To determine some formulas for the generating function of the Schröder numbers, r(x) = a(x) = AN INTRODUCTION TO SCHRÖDER AND UNKNOWN NUMBERS NICK DUFRESNE Abstract. I this article we will itroduce two types of lattice paths, Schröder paths ad Ukow paths. We will examie differet properties of each,

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

1 Duality revisited. AM 221: Advanced Optimization Spring 2016 AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R

More information

1 Approximating Integrals using Taylor Polynomials

1 Approximating Integrals using Taylor Polynomials Seughee Ye Ma 8: Week 7 Nov Week 7 Summary This week, we will lear how we ca approximate itegrals usig Taylor series ad umerical methods. Topics Page Approximatig Itegrals usig Taylor Polyomials. Defiitios................................................

More information

Admin REGULARIZATION. Schedule. Midterm 9/29/16. Assignment 5. Midterm next week, due Friday (more on this in 1 min)

Admin REGULARIZATION. Schedule. Midterm 9/29/16. Assignment 5. Midterm next week, due Friday (more on this in 1 min) Admi Assigmet 5! Starter REGULARIZATION David Kauchak CS 158 Fall 2016 Schedule Midterm ext week, due Friday (more o this i 1 mi Assigmet 6 due Friday before fall break Midterm Dowload from course web

More information

Intro to Learning Theory

Intro to Learning Theory Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified

More information

Machine Learning for Data Science (CS 4786)

Machine Learning for Data Science (CS 4786) Machie Learig for Data Sciece CS 4786) Lecture & 3: Pricipal Compoet Aalysis The text i black outlies high level ideas. The text i blue provides simple mathematical details to derive or get to the algorithm

More information

1 The Primal and Dual of an Optimization Problem

1 The Primal and Dual of an Optimization Problem CS 189 Itroductio to Machie Learig Fall 2017 Note 18 Previously, i our ivestigatio of SVMs, we forulated a costraied optiizatio proble that we ca solve to fid the optial paraeters for our hyperplae decisio

More information

Properties and Tests of Zeros of Polynomial Functions

Properties and Tests of Zeros of Polynomial Functions Properties ad Tests of Zeros of Polyomial Fuctios The Remaider ad Factor Theorems: Sythetic divisio ca be used to fid the values of polyomials i a sometimes easier way tha substitutio. This is show by

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

b i u x i U a i j u x i u x j

b i u x i U a i j u x i u x j M ath 5 2 7 Fall 2 0 0 9 L ecture 1 9 N ov. 1 6, 2 0 0 9 ) S ecod- Order Elliptic Equatios: Weak S olutios 1. Defiitios. I this ad the followig two lectures we will study the boudary value problem Here

More information

Algorithms for Clustering

Algorithms for Clustering CR2: Statistical Learig & Applicatios Algorithms for Clusterig Lecturer: J. Salmo Scribe: A. Alcolei Settig: give a data set X R p where is the umber of observatio ad p is the umber of features, we wat

More information

Machine Learning for Data Science (CS 4786)

Machine Learning for Data Science (CS 4786) Machie Learig for Data Sciece CS 4786) Lecture 9: Pricipal Compoet Aalysis The text i black outlies mai ideas to retai from the lecture. The text i blue give a deeper uderstadig of how we derive or get

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

1 Review of Probability & Statistics

1 Review of Probability & Statistics 1 Review of Probability & Statistics a. I a group of 000 people, it has bee reported that there are: 61 smokers 670 over 5 960 people who imbibe (drik alcohol) 86 smokers who imbibe 90 imbibers over 5

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5 CS434a/54a: Patter Recogitio Prof. Olga Veksler Lecture 5 Today Itroductio to parameter estimatio Two methods for parameter estimatio Maimum Likelihood Estimatio Bayesia Estimatio Itroducto Bayesia Decisio

More information

Introduction to Artificial Intelligence CAP 4601 Summer 2013 Midterm Exam

Introduction to Artificial Intelligence CAP 4601 Summer 2013 Midterm Exam Itroductio to Artificial Itelligece CAP 601 Summer 013 Midterm Exam 1. Termiology (7 Poits). Give the followig task eviromets, eter their properties/characteristics. The properties/characteristics of the

More information

Rademacher Complexity

Rademacher Complexity EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for

More information

Problem Set 4 Due Oct, 12

Problem Set 4 Due Oct, 12 EE226: Radom Processes i Systems Lecturer: Jea C. Walrad Problem Set 4 Due Oct, 12 Fall 06 GSI: Assae Gueye This problem set essetially reviews detectio theory ad hypothesis testig ad some basic otios

More information

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion .87 Machie learig: lecture Tommi S. Jaakkola MIT CSAIL tommi@csail.mit.edu Topics The learig problem hypothesis class, estimatio algorithm loss ad estimatio criterio samplig, empirical ad epected losses

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week Lecture: Cocept Check Exercises Starred problems are optioal. Statistical Learig Theory. Suppose A = Y = R ad X is some other set. Furthermore, assume P X Y is a discrete

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture 3 Tolstikhi Ilya Abstract I this lecture we will prove the VC-boud, which provides a high-probability excess risk boud for the ERM algorithm whe

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

INEQUALITIES BJORN POONEN

INEQUALITIES BJORN POONEN INEQUALITIES BJORN POONEN 1 The AM-GM iequality The most basic arithmetic mea-geometric mea (AM-GM) iequality states simply that if x ad y are oegative real umbers, the (x + y)/2 xy, with equality if ad

More information

Axis Aligned Ellipsoid

Axis Aligned Ellipsoid Machie Learig for Data Sciece CS 4786) Lecture 6,7 & 8: Ellipsoidal Clusterig, Gaussia Mixture Models ad Geeral Mixture Models The text i black outlies high level ideas. The text i blue provides simple

More information

Binary classification, Part 1

Binary classification, Part 1 Biary classificatio, Part 1 Maxim Ragisky September 25, 2014 The problem of biary classificatio ca be stated as follows. We have a radom couple Z = (X,Y ), where X R d is called the feature vector ad Y

More information

Polynomials with Rational Roots that Differ by a Non-zero Constant. Generalities

Polynomials with Rational Roots that Differ by a Non-zero Constant. Generalities Polyomials with Ratioal Roots that Differ by a No-zero Costat Philip Gibbs The problem of fidig two polyomials P(x) ad Q(x) of a give degree i a sigle variable x that have all ratioal roots ad differ by

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

Homework Set #3 - Solutions

Homework Set #3 - Solutions EE 15 - Applicatios of Covex Optimizatio i Sigal Processig ad Commuicatios Dr. Adre Tkaceko JPL Third Term 11-1 Homework Set #3 - Solutios 1. a) Note that x is closer to x tha to x l i the Euclidea orm

More information

Linear Programming and the Simplex Method

Linear Programming and the Simplex Method Liear Programmig ad the Simplex ethod Abstract This article is a itroductio to Liear Programmig ad usig Simplex method for solvig LP problems i primal form. What is Liear Programmig? Liear Programmig is

More information

Introduction to Optimization Techniques. How to Solve Equations

Introduction to Optimization Techniques. How to Solve Equations Itroductio to Optimizatio Techiques How to Solve Equatios Iterative Methods of Optimizatio Iterative methods of optimizatio Solutio of the oliear equatios resultig form a optimizatio problem is usually

More information

SVM for Statisticians

SVM for Statisticians SVM for Statisticias Youyi Fog Fred Hutchiso Cacer Research Istitute November 13, 2011 1 / 21 Primal Problem ad Pealized Loss Fuctio Miimize J over b, β ad ξ uder some costraits J = 1 2 β 2 + C ξ i (1)

More information

The Growth of Functions. Theoretical Supplement

The Growth of Functions. Theoretical Supplement The Growth of Fuctios Theoretical Supplemet The Triagle Iequality The triagle iequality is a algebraic tool that is ofte useful i maipulatig absolute values of fuctios. The triagle iequality says that

More information

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise) Lecture 22: Review for Exam 2 Basic Model Assumptios (without Gaussia Noise) We model oe cotiuous respose variable Y, as a liear fuctio of p umerical predictors, plus oise: Y = β 0 + β X +... β p X p +

More information

IP Reference guide for integer programming formulations.

IP Reference guide for integer programming formulations. IP Referece guide for iteger programmig formulatios. by James B. Orli for 15.053 ad 15.058 This documet is iteded as a compact (or relatively compact) guide to the formulatio of iteger programs. For more

More information

4. Linear Classification. Kai Yu

4. Linear Classification. Kai Yu 4. Liear Classificatio Kai Y Liear Classifiers A simplest classificatio model Help to derstad oliear models Argably the most sefl classificatio method! 2 Liear Classifiers A simplest classificatio model

More information

x a x a Lecture 2 Series (See Chapter 1 in Boas)

x a x a Lecture 2 Series (See Chapter 1 in Boas) Lecture Series (See Chapter i Boas) A basic ad very powerful (if pedestria, recall we are lazy AD smart) way to solve ay differetial (or itegral) equatio is via a series expasio of the correspodig solutio

More information

1 Generating functions for balls in boxes

1 Generating functions for balls in boxes Math 566 Fall 05 Some otes o geeratig fuctios Give a sequece a 0, a, a,..., a,..., a geeratig fuctio some way of represetig the sequece as a fuctio. There are may ways to do this, with the most commo ways

More information

The Simplex algorithm: Introductory example. The Simplex algorithm: Introductory example (2)

The Simplex algorithm: Introductory example. The Simplex algorithm: Introductory example (2) Discrete Mathematics for Bioiformatics WS 07/08, G. W. Klau, 23. Oktober 2007, 12:21 1 The Simplex algorithm: Itroductory example The followig itroductio to the Simplex algorithm is from the book Liear

More information

Lesson 10: Limits and Continuity

Lesson 10: Limits and Continuity www.scimsacademy.com Lesso 10: Limits ad Cotiuity SCIMS Academy 1 Limit of a fuctio The cocept of limit of a fuctio is cetral to all other cocepts i calculus (like cotiuity, derivative, defiite itegrals

More information

Recurrence Relations

Recurrence Relations Recurrece Relatios Aalysis of recursive algorithms, such as: it factorial (it ) { if (==0) retur ; else retur ( * factorial(-)); } Let t be the umber of multiplicatios eeded to calculate factorial(). The

More information

Sequences and Series of Functions

Sequences and Series of Functions Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges

More information

Math 257: Finite difference methods

Math 257: Finite difference methods Math 257: Fiite differece methods 1 Fiite Differeces Remember the defiitio of a derivative f f(x + ) f(x) (x) = lim 0 Also recall Taylor s formula: (1) f(x + ) = f(x) + f (x) + 2 f (x) + 3 f (3) (x) +...

More information

15.081J/6.251J Introduction to Mathematical Programming. Lecture 21: Primal Barrier Interior Point Algorithm

15.081J/6.251J Introduction to Mathematical Programming. Lecture 21: Primal Barrier Interior Point Algorithm 508J/65J Itroductio to Mathematical Programmig Lecture : Primal Barrier Iterior Poit Algorithm Outlie Barrier Methods Slide The Cetral Path 3 Approximatig the Cetral Path 4 The Primal Barrier Algorithm

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

Lecture 7: October 18, 2017

Lecture 7: October 18, 2017 Iformatio ad Codig Theory Autum 207 Lecturer: Madhur Tulsiai Lecture 7: October 8, 207 Biary hypothesis testig I this lecture, we apply the tools developed i the past few lectures to uderstad the problem

More information

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring Machie Learig Regressio I Hamid R. Rabiee [Slides are based o Bishop Book] Sprig 015 http://ce.sharif.edu/courses/93-94//ce717-1 Liear Regressio Liear regressio: ivolves a respose variable ad a sigle predictor

More information

10/2/ , 5.9, Jacob Hays Amit Pillay James DeFelice

10/2/ , 5.9, Jacob Hays Amit Pillay James DeFelice 0//008 Liear Discrimiat Fuctios Jacob Hays Amit Pillay James DeFelice 5.8, 5.9, 5. Miimum Squared Error Previous methods oly worked o liear separable cases, by lookig at misclassified samples to correct

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

Summary: CORRELATION & LINEAR REGRESSION. GC. Students are advised to refer to lecture notes for the GC operations to obtain scatter diagram.

Summary: CORRELATION & LINEAR REGRESSION. GC. Students are advised to refer to lecture notes for the GC operations to obtain scatter diagram. Key Cocepts: 1) Sketchig of scatter diagram The scatter diagram of bivariate (i.e. cotaiig two variables) data ca be easily obtaied usig GC. Studets are advised to refer to lecture otes for the GC operatios

More information

Chapter 6: Numerical Series

Chapter 6: Numerical Series Chapter 6: Numerical Series 327 Chapter 6 Overview: Sequeces ad Numerical Series I most texts, the topic of sequeces ad series appears, at first, to be a side topic. There are almost o derivatives or itegrals

More information

Part I: Covers Sequence through Series Comparison Tests

Part I: Covers Sequence through Series Comparison Tests Part I: Covers Sequece through Series Compariso Tests. Give a example of each of the followig: (a) A geometric sequece: (b) A alteratig sequece: (c) A sequece that is bouded, but ot coverget: (d) A sequece

More information

Ma 530 Introduction to Power Series

Ma 530 Introduction to Power Series Ma 530 Itroductio to Power Series Please ote that there is material o power series at Visual Calculus. Some of this material was used as part of the presetatio of the topics that follow. What is a Power

More information

2 Banach spaces and Hilbert spaces

2 Banach spaces and Hilbert spaces 2 Baach spaces ad Hilbert spaces Tryig to do aalysis i the ratioal umbers is difficult for example cosider the set {x Q : x 2 2}. This set is o-empty ad bouded above but does ot have a least upper boud

More information

Measures of Spread: Standard Deviation

Measures of Spread: Standard Deviation Measures of Spread: Stadard Deviatio So far i our study of umerical measures used to describe data sets, we have focused o the mea ad the media. These measures of ceter tell us the most typical value of

More information

MA131 - Analysis 1. Workbook 3 Sequences II

MA131 - Analysis 1. Workbook 3 Sequences II MA3 - Aalysis Workbook 3 Sequeces II Autum 2004 Cotets 2.8 Coverget Sequeces........................ 2.9 Algebra of Limits......................... 2 2.0 Further Useful Results........................

More information

INTEGRATION BY PARTS (TABLE METHOD)

INTEGRATION BY PARTS (TABLE METHOD) INTEGRATION BY PARTS (TABLE METHOD) Suppose you wat to evaluate cos d usig itegratio by parts. Usig the u dv otatio, we get So, u dv d cos du d v si cos d si si d or si si d We see that it is ecessary

More information

Bertrand s Postulate

Bertrand s Postulate Bertrad s Postulate Lola Thompso Ross Program July 3, 2009 Lola Thompso (Ross Program Bertrad s Postulate July 3, 2009 1 / 33 Bertrad s Postulate I ve said it oce ad I ll say it agai: There s always a

More information

Math 451: Euclidean and Non-Euclidean Geometry MWF 3pm, Gasson 204 Homework 3 Solutions

Math 451: Euclidean and Non-Euclidean Geometry MWF 3pm, Gasson 204 Homework 3 Solutions Math 451: Euclidea ad No-Euclidea Geometry MWF 3pm, Gasso 204 Homework 3 Solutios Exercises from 1.4 ad 1.5 of the otes: 4.3, 4.10, 4.12, 4.14, 4.15, 5.3, 5.4, 5.5 Exercise 4.3. Explai why Hp, q) = {x

More information

Chapter 6 Overview: Sequences and Numerical Series. For the purposes of AP, this topic is broken into four basic subtopics:

Chapter 6 Overview: Sequences and Numerical Series. For the purposes of AP, this topic is broken into four basic subtopics: Chapter 6 Overview: Sequeces ad Numerical Series I most texts, the topic of sequeces ad series appears, at first, to be a side topic. There are almost o derivatives or itegrals (which is what most studets

More information

U8L1: Sec Equations of Lines in R 2

U8L1: Sec Equations of Lines in R 2 MCVU U8L: Sec. 8.9. Equatios of Lies i R Review of Equatios of a Straight Lie (-D) Cosider the lie passig through A (-,) with slope, as show i the diagram below. I poit slope form, the equatio of the lie

More information

Naïve Bayes. Naïve Bayes

Naïve Bayes. Naïve Bayes Statistical Data Miig ad Machie Learig Hilary Term 206 Dio Sejdiovic Departmet of Statistics Oxford Slides ad other materials available at: http://www.stats.ox.ac.uk/~sejdiov/sdmml : aother plug-i classifier

More information

AP Calculus BC Review Applications of Derivatives (Chapter 4) and f,

AP Calculus BC Review Applications of Derivatives (Chapter 4) and f, AP alculus B Review Applicatios of Derivatives (hapter ) Thigs to Kow ad Be Able to Do Defiitios of the followig i terms of derivatives, ad how to fid them: critical poit, global miima/maima, local (relative)

More information

Markov Decision Processes

Markov Decision Processes Markov Decisio Processes Defiitios; Statioary policies; Value improvemet algorithm, Policy improvemet algorithm, ad liear programmig for discouted cost ad average cost criteria. Markov Decisio Processes

More information

Castiel, Supernatural, Season 6, Episode 18

Castiel, Supernatural, Season 6, Episode 18 13 Differetial Equatios the aswer to your questio ca best be epressed as a series of partial differetial equatios... Castiel, Superatural, Seaso 6, Episode 18 A differetial equatio is a mathematical equatio

More information

Selective Prediction

Selective Prediction COMS 6998-4 Fall 2017 November 8, 2017 Selective Predictio Preseter: Rog Zhou Scribe: Wexi Che 1 Itroductio I our previous discussio o a variatio o the Valiat Model [3], the described learer has the ability

More information

Problem Cosider the curve give parametrically as x = si t ad y = + cos t for» t» ß: (a) Describe the path this traverses: Where does it start (whe t =

Problem Cosider the curve give parametrically as x = si t ad y = + cos t for» t» ß: (a) Describe the path this traverses: Where does it start (whe t = Mathematics Summer Wilso Fial Exam August 8, ANSWERS Problem 1 (a) Fid the solutio to y +x y = e x x that satisfies y() = 5 : This is already i the form we used for a first order liear differetial equatio,

More information

Lecture 10: Universal coding and prediction

Lecture 10: Universal coding and prediction 0-704: Iformatio Processig ad Learig Sprig 0 Lecture 0: Uiversal codig ad predictio Lecturer: Aarti Sigh Scribes: Georg M. Goerg Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved

More information

Discrete-Time Systems, LTI Systems, and Discrete-Time Convolution

Discrete-Time Systems, LTI Systems, and Discrete-Time Convolution EEL5: Discrete-Time Sigals ad Systems. Itroductio I this set of otes, we begi our mathematical treatmet of discrete-time s. As show i Figure, a discrete-time operates or trasforms some iput sequece x [

More information

The Riemann Zeta Function

The Riemann Zeta Function Physics 6A Witer 6 The Riema Zeta Fuctio I this ote, I will sketch some of the mai properties of the Riema zeta fuctio, ζ(x). For x >, we defie ζ(x) =, x >. () x = For x, this sum diverges. However, we

More information

Mixtures of Gaussians and the EM Algorithm

Mixtures of Gaussians and the EM Algorithm Mixtures of Gaussias ad the EM Algorithm CSE 6363 Machie Learig Vassilis Athitsos Computer Sciece ad Egieerig Departmet Uiversity of Texas at Arligto 1 Gaussias A popular way to estimate probability desity

More information