CSCI567 Machine Learning (Fall 2014)

Size: px

Start display at page:

Download "CSCI567 Machine Learning (Fall 2014)"

Theodore Reed
5 years ago
Views:

1 CSCI567 Machie Learig (Fall 2014) Drs. Sha & Liu October 14, 2014 Drs. Sha & Liu CSCI567 Machie Learig (Fall 2014) October 14, / 49

2 Outlie Admiistratio 1 Admiistratio 2 Review of last lecture 3 Geometric Uderstadig of SVM 4 Boostig Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

3 Quiz 1 Admiistratio Tuesday Oct :20pm, THH 301 Please arrive o time Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

4 Lecture schedule Admiistratio Oct 15 or Oct 16 TA will lead o Pragmatics Oct 20 or Oct 21 No lecture i the scheduled time prepare for the quiz Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

5 Homework #1 Admiistratio Cout 136 Mi Max Average Media STD Drs. Sha & Liu CSCI567 Machie Learig (Fall 2014) October 14, / 49

6 Outlie Review of last lecture 1 Admiistratio 2 Review of last lecture Support vector machies Basic Lagrage duality theory 3 Geometric Uderstadig of SVM 4 Boostig Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

7 Review of last lecture Support vector machies Support vector machies Hige loss Assumig the label y { 1, 1} ad the decisio rule is h(x) = sig(f(x)) with f(x) = w T φ(x) + b, { l hige (f(x), y) = or l hige (f(x), y) = max(0, 1 yf(x)) 0 if yf(x) 1 1 yf(x) otherwise Ituitio: pealize more if icorrectly classified (the left brach to the kik poit) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

8 Review of last lecture Support vector machies Primal formulatio of support vector machies (SVM) Miimizig the total hige loss o all the traiig data mi w,b max(0, 1 y [w T φ(x ) + b]) + λ 2 w 2 2 equivaletly, mi w,b,{ξ } C ξ w 2 2 s.t. 1 y [w T φ(x ) + b] ξ, ξ 0, Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

9 Review of last lecture Basic Lagrage duality theory Basic Lagrage duality theory Key cocepts you should kow What do primal ad dual mea? How SVM exploits dual formulatio, thus results i usig kerel fuctios for oliear classificatio What do support vectors mea? Our roadmap We will tell you what dual looks like We will show you how it is derived Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

10 Review of last lecture Basic Lagrage duality theory Derivatio of the dual We will derive the dual formulatio as the process will reveal some iterestig ad importat properties of SVM. Particularly, why is it called support vector? Recipe Formulate a Lagragia fuctio that icorporates the costraits, thru itroducig dual variables Miimize the Lagragia fuctio to solve the primal variables Put the primal variables ito the Lagragia ad express i terms of dual variables Maximize the Lagragia with respect to dual variables Recover the solutio (for the primal variables) from the dual variables Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

11 Review of last lecture Basic Lagrage duality theory Derivig the dual for SVM Lagragia L(w, {ξ }, {α }, {λ }) = C ξ w 2 2 λ ξ + α {1 y [w T φ(x ) + b] ξ } uder the costrait that α 0 ad λ 0. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

12 Review of last lecture Basic Lagrage duality theory Miimizig the Lagragia Takig derivatives with respect to the primal variables L w = w y α φ(x ) = 0 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

13 Review of last lecture Miimizig the Lagragia Basic Lagrage duality theory Takig derivatives with respect to the primal variables L w = w y α φ(x ) = 0 L b = α y = 0 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

14 Review of last lecture Miimizig the Lagragia Basic Lagrage duality theory Takig derivatives with respect to the primal variables L w = w y α φ(x ) = 0 L b = α y = 0 L = C λ α = 0 ξ Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

15 Review of last lecture Miimizig the Lagragia Basic Lagrage duality theory Takig derivatives with respect to the primal variables L w = w y α φ(x ) = 0 L b = α y = 0 L = C λ α = 0 ξ This gives rise to equatios likig the primal variables ad the dual variables as well as ew costraits o the dual variables: w = α y = 0 C λ α = 0 y α φ(x ) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

16 Review of last lecture Basic Lagrage duality theory Rewrite the Lagrage i terms of dual variables Substitute the solutio to the primal back ito the Lagragia g({α },{λ }) = L(w, {ξ }, {α }, {λ }) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

17 Review of last lecture Basic Lagrage duality theory Rewrite the Lagrage i terms of dual variables Substitute the solutio to the primal back ito the Lagragia g({α },{λ }) = L(w, {ξ }, {α }, {λ }) = (C α λ )ξ y α φ(x ) α ( ) + α y b ( T α y y m α m φ(x m )) φ(x ) m Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

18 Review of last lecture Basic Lagrage duality theory Rewrite the Lagrage i terms of dual variables Substitute the solutio to the primal back ito the Lagragia g({α },{λ }) = L(w, {ξ }, {α }, {λ }) = (C α λ )ξ y α φ(x ) α ( ) + α y b ( T α y y m α m φ(x m )) φ(x ) m = α y α φ(x ) 2 2 α α m y m y φ(x m ) T φ(x ) m, Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

19 Review of last lecture Basic Lagrage duality theory Rewrite the Lagrage i terms of dual variables Substitute the solutio to the primal back ito the Lagragia g({α },{λ }) = L(w, {ξ }, {α }, {λ }) = (C α λ )ξ y α φ(x ) α ( ) + α y b ( T α y y m α m φ(x m )) φ(x ) m = α y α φ(x ) 2 2 α α m y m y φ(x m ) T φ(x ) m, = α 1 α α m y m y φ(x m ) T φ(x ) 2 m, Several terms vaish because of the costraits α y = 0 ad C λ α = 0. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

20 The dual problem Review of last lecture Basic Lagrage duality theory Maximizig the dual uder the costraits max g({α }, {λ }) = α 1 y m y α m α k(x m, x ) α 2 m, s.t. α 0, α y = 0 C λ α = 0, λ 0, Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

21 The dual problem Review of last lecture Basic Lagrage duality theory Maximizig the dual uder the costraits max g({α }, {λ }) = α 1 y m y α m α k(x m, x ) α 2 m, s.t. α 0, α y = 0 C λ α = 0, λ 0, We ca simplify as the objective fuctio does ot deped o λ, thus we ca covert the equality costrait ivolvig λ with a iequality costrait o α C: α C λ = C α 0 C λ α = 0, λ 0 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

22 Fial form Review of last lecture Basic Lagrage duality theory max α α 1 y m y α m α φ(x m ) T φ(x ) 2 m, s.t. 0 α C, α y = 0 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

23 Recover the solutio Review of last lecture Basic Lagrage duality theory The primal variable w is idetified as w = α y φ(x ) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

24 Recover the solutio Review of last lecture Basic Lagrage duality theory The primal variable w is idetified as w = α y φ(x ) To idetify b, we eed somethig else. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

25 Review of last lecture Basic Lagrage duality theory Complemetary slackess ad support vectors At the optimal solutio to both primal ad dual, the followig must be satisfied for every iequality costrait (these are called KKT coditios) λ ξ = 0 α {1 ξ y [w T φ(x ) + b]} = 0 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

26 Review of last lecture Basic Lagrage duality theory Complemetary slackess ad support vectors At the optimal solutio to both primal ad dual, the followig must be satisfied for every iequality costrait (these are called KKT coditios) λ ξ = 0 α {1 ξ y [w T φ(x ) + b]} = 0 From the first coditio, if α < C, the λ = C α > 0 ξ = 0 Thus, i cojuctio with the secod coditio, we kow that, if C > α > 0, the as y { 1, 1}. 1 y [w T φ(x ) + b] = 0 b = y w T φ(x ) For those whose α > 0, we call such traiig samples as support vectors. (We will discuss their geometric iterpretatio later). Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

27 Review of last lecture Basic Lagrage duality theory Dual formulatio ad kerelized SVM Dual is also a covex quadratic programmig max α α 1 y m y α m α φ(x m ) T φ(x ) 2 m, s.t. 0 α C, α y = 0 We replace the ier products φ(x m ) T φ(x ) with a kerel fuctio max α α 1 y m y α m α k(x m, x ) 2 m, s.t. 0 α C, α y = 0 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

28 Review of last lecture Basic Lagrage duality theory Recoverig solutio to the primal formulatio Weights w = y α φ(x ) Liear combiatio of the iput features! b b = [y w T φ(x )] = [y m y m α m k(x m, x )], for ay C > α > 0 Makig predictio o a test poit x h(x) = sig(w T φ(x) + b) = sig( y α k(x, x) + b) Agai, to make predictio, it suffices to kow the kerel fuctio. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

29 Review of last lecture Basic Lagrage duality theory Thigs you eed to kow about derivig the dual Make sure you ca follow the recipe Formulate a Lagragia fuctio that icorporates the costraits, thru itroducig dual variables Miimize the Lagragia fuctio to solve the primal variables Put the primal variables ito the Lagragia ad express i terms of dual variables Maximize the Lagragia with respect to dual variables Recover the solutio (for the primal variables) from the dual variables Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

30 Outlie Geometric Uderstadig of SVM 1 Admiistratio 2 Review of last lecture 3 Geometric Uderstadig of SVM 4 Boostig Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

31 Geometric Uderstadig of SVM Ituitio: where to put the decisio boudary? Cosider the biary classificatio i the followig figure. We have assumed, for coveiece, that the traiig dataset is separable there is a decisio boudary that separates the two classes perfectly. H H H There are ifiite may ways of puttig the decisio boudary H : w T φ(x) + b = 0! Our ituitio is, however, to put the decisio boudary to be i the middle of the two classes as much as possible. I other words, we wat the decisio boudary is to be far to every poit as much as possible as log as the decisio boudary classifies every poit correctly. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

32 Distaces Geometric Uderstadig of SVM The distace from a poit φ(x) to the decisio boudary is d H (φ(x)) = wt φ(x) + b w 2 (We have derived the above i the recitatio/quiz0. Please re-verify it as a take-home exercise). We ca remove the absolute by exploitig the fact that the decisio boudary classifies every poit i the traiig dataset correctly. Namely, (w T φ(x) + b) ad x s label y are of the same sig. The distace is ow, d H (φ(x)) = y[wt φ(x) + b] w 2 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

33 Maximizig margi Geometric Uderstadig of SVM Margi The margi is defied as the smallest distace from all the traiig poits y [w T φ(x ) + b] margi = mi w 2 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

34 Maximizig margi Geometric Uderstadig of SVM Margi The margi is defied as the smallest distace from all the traiig poits y [w T φ(x ) + b] margi = mi w 2 Sice we are iterested i fidig a w to put all poits as distat as possible from the decisio boudary, we maximize the margi max w y [w T φ(x ) + b] 1 mi = max mi y [w T φ(x ) + b] w w w 2 H : w T φ(x)+b =0 w T φ(x)+b w 2 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

35 Rescaled margi Geometric Uderstadig of SVM Sice the margi does ot chage if we scale (w, b) by a costat factor c ( as w T φ(x) + b = 0 ad (cw) T φ(x) + (cb) = 0 are the same decisio boudary), we fix the scale by forcig mi y [w T φ(x ) + b] = 1 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

36 Geometric Uderstadig of SVM Rescaled margi Sice the margi does ot chage if we scale (w, b) by a costat factor c ( as w T φ(x) + b = 0 ad (cw) T φ(x) + (cb) = 0 are the same decisio boudary), we fix the scale by forcig I this case, our margi becomes mi y [w T φ(x ) + b] = 1 margi = 1 w 2 precisely, the closest poit to the decisio boudary has a distace of that. w T φ(x)+b =1 H : w T φ(x)+b =0 1 w2 w T φ(x)+b = 1 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

37 Primal formulatio Geometric Uderstadig of SVM Combiig everythig we have, for a separable traiig dataset, we aim to max w This is equivalet to 1 w 2 such that y [w T φ(x ) + b] 1, 1 mi w 2 w 2 2 s.t. y [w T φ(x ) + b] 1, This starts to look like our first formulatio for SVMs. For this geometric ituitio, SVM is called max margi (or large margi) classifier. The costraits are called large margi costraits. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

38 Geometric Uderstadig of SVM SVM for o-separable data Suppose there are traiig data poits that caot be classified correctly o matter how we choose w. For those data poits, y [w T φ(x ) + b] 0 for ay w. Thus, the previous costrait y [w T φ(x ) + b] 1, is o loger feasible. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

39 Geometric Uderstadig of SVM SVM for o-separable data Suppose there are traiig data poits that caot be classified correctly o matter how we choose w. For those data poits, y [w T φ(x ) + b] 0 for ay w. Thus, the previous costrait y [w T φ(x ) + b] 1, is o loger feasible. To deal with this issue, we itroduce slack variables ξ to help y [w T φ(x ) + b] 1 ξ, where we also require ξ 0. Note that, eve for hard poits that caot be classified correctly, the slack variable will be able to make them satisfy the above costrait (we ca keep icreasig ξ util the above iequality is met.) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

40 Geometric Uderstadig of SVM SVM Primal formulatio with slack variables We obviously do ot wat ξ goes to ifiity, so we balace their sizes by pealizig them toward zero as much as possible mi w 1 2 w C ξ s.t. y [w T φ(x ) + b] 1 ξ, ξ 0, where C is our tradeoff (hyper)parameter. This is precisely the primal formulatio we first got for SVM. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

41 Geometric Uderstadig of SVM Meaig of support vectors i SVMs Complemetary slackess At optimum, we have to have α {1 ξ y [w T φ(x ) + b]} = 0, That meas, for some, α = 0. Additioally, our optimal solutio is give by w = α y φ(x ) = α y φ(x ) :α >0 I words, our solutio is oly determied by those traiig samples whose correspodig α is strictly positive. Those samples are called support vectors. No-support vectors whose α = 0 ca be removed by the traiig dataset this removal will ot affect the optimal solutio (i.e., after the removal, if we costruct aother SVM classifier o the reduced dataset, the optimal solutio is the same as the oe o the origial dataset.) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

42 Geometric Uderstadig of SVM Who are support vectors? Case aalysis Sice, we have We have 1 ξ y [w T φ(x ) + b]} = 0 ξ = 0. This implies y [w T φ(x ) + b] = 1. They are o poits that are 1/ w 2 away from the decisio boudary. ξ < 1. These are poits that ca be classified correctly but do ot satisfy the large margi costrait they have smaller distaces to the decisio boudary. ξ > 1. These are poits that are misclassified. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

43 Geometric Uderstadig of SVM Visualizatio of how traiig data poits are categorized w T φ(x)+b =1 H : w T φ(x)+b =0 ξ < 1 ξ > 1 w T φ(x)+b = 1 ξ =0 Support vectors are those beig circled with the orage lie. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

44 Outlie Boostig 1 Admiistratio 2 Review of last lecture 3 Geometric Uderstadig of SVM 4 Boostig AdaBoost Derivatio of AdaBoost Boostig as learig oliear basis Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

45 Boostig Boostig High-level idea: combie a lot of classifiers Sequetially costruct those classifiers oe at a time Use weak classifiers to arrive at complex decisio boudaries Our pla Describe AdaBoost algorithm Derive the algorithm Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

46 Boostig How Boostig algorithm works? AdaBoost Give: N samples {x, y }, where y {+1, 1}, ad some ways of costructig weak (or base) classifiers Iitialize weights w 1 () = 1 N for every traiig sample. For t= 1 to T 1 Trai a weak classifier h t (x) based o the curret weight w t (), by miimizig the weighted classificatio error ɛ t = w t ()I[y h t (x )] 2 Calculate weights for combiig classifiers β t = 1 1 ɛt 2 log ɛ t 3 Update weights w t+1 () w t ()e βtyht(x) ad ormalize them such that w t+1() = 1. Output the fial classifier h[x] = sig [ T ] β t h t (x) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49 t=1

47 Example Boostig AdaBoost 10 data poits Base classifier h( ): either horizotal or vertical lies (these are called decisio stumps, classifyig data based o a sigle attribute) The data poits are clearly ot liear separable. I the begiig, all data poits have equal weights (the size of the data markers + or - ) D 1 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

48 Boostig AdaBoost Roud 1: t = 1 h 1 D 2 3 misclassified (with circles): ɛ 1 = 0.3 β 1 = Weights recomputed; the 3 misclassified data poits receive larger weights Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

49 3 misclassified (with circles): ɛ 2 = 0.21 β 2 = Note that ɛ as those 3 data poits have weights less tha 1/10 Weights recomputed; the 3 misclassified data poits receive larger weights. Note that the data poits classified correctly o roud t = 1 receive much smaller weights as they have bee cosistetly classified correctly Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49 Roud 2: t = 2 Boostig AdaBoost h 2 D 3

50 Boostig AdaBoost Roud 33: t = 3 Roud h3 "3 =0.14!3= misclassified (with circles): 3 = 0.14 β3 = Note that those previously correctly classified data poits are ow misclassified however, we might be lucky o this as if they have bee cosistetly classified correctly, the this roud s mistake is probably ot a big deal. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

51 Boostig AdaBoost Fial classifier: combiig 3 classifiers H = sig fial = all data poits are ow classified correctly! Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

52 Why AdaBoost works? Boostig Derivatio of AdaBoost We will show ext that it miimizes a loss fuctio related to classificatio error. Classificatio loss Suppose we wat to have a classifier { 1 if f(x) > 0 h(x) = sig[f(x)] = 1 if f(x) < 0 our loss fuctio is thus l(h(x), y) = { 0 if yf(x) > 0 1 if yf(x) < 0 Namely, the fuctio f(x) ad the target label y should have the same sig to avoid a loss of 1. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

53 Expoetial loss Boostig Derivatio of AdaBoost The previous loss fuctio l(h(x), y) is difficult to optimize. Istead, we will use the followig loss fuctio l exp (h(x), y) = e yf(x) This loss fuctio will fuctio as a surrogate to the true loss fuctio l(h(x, y). However, l exp (h(x), y) is easier to hadle umerically as it is differetiable, see below the cotrast betwee the red ad black curves `(h(x),y) yf(x) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

54 Boostig Choosig the t-th classifier Derivatio of AdaBoost Suppose we have built a classifier f t 1 (x), ad we wat to improve it by addig a ew classifier h t (x) to costruct a ew classifier f(x) = f t 1 (x) + β t h t (x) how ca we choose optimally the ew classifier h t (x) ad the combiatio coefficiet β t? The strategy we will use is to greedily miimize the expoetial loss fuctio. (h t (x), βt ) = arg mi (ht(x),βt) e yf(x) = arg mi (ht(x),β t) e y[f t 1(x )+β th t(x )] Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

55 Boostig Choosig the t-th classifier Derivatio of AdaBoost Suppose we have built a classifier f t 1 (x), ad we wat to improve it by addig a ew classifier h t (x) to costruct a ew classifier f(x) = f t 1 (x) + β t h t (x) how ca we choose optimally the ew classifier h t (x) ad the combiatio coefficiet β t? The strategy we will use is to greedily miimize the expoetial loss fuctio. (h t (x), βt ) = arg mi (ht(x),βt) e yf(x) = arg mi (ht(x),β t) e y[f t 1(x )+β th t(x )] = arg mi (ht(x),βt) w t ()e yβtht(x) where we have used w t () as a shorthad for e yf t 1(x ) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

56 Boostig Derivatio of AdaBoost The ew classifier We decompose the weighted loss fuctio (by w t ()) ito two parts w t ()e yβtht(x) = w t ()e βt I[y h t (x )] + w t ()e βt I[y = h t (x )] Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

57 Boostig Derivatio of AdaBoost The ew classifier We decompose the weighted loss fuctio (by w t ()) ito two parts w t ()e yβtht(x) = w t ()e βt I[y h t (x )] + w t ()e βt I[y = h t (x )] = w t ()e βt I[y h t (x )] + w t ()e βt (1 I[y h t (x )]) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

58 The ew classifier Boostig Derivatio of AdaBoost We decompose the weighted loss fuctio (by w t ()) ito two parts w t ()e yβtht(x) = w t ()e βt I[y h t (x )] + w t ()e βt I[y = h t (x )] = w t ()e βt I[y h t (x )] + w t ()e βt (1 I[y h t (x )]) = (e βt e βt ) w t ()I[y h t (x )] + e βt w t () We have used the followig properties to derive the above y h t (x ) is either 1 or -1 as h t (x ) is the output of a biary classifier. The idicator fuctio I[y = h t (x )] is biary, either 0 or 1. Thus, it equals to 1 I[y h t (x )]. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

59 Boostig Derivatio of AdaBoost Miimizig the weighted classificatio error Thus, we would wat to choose h t (x ) such that h t (x) = arg mi ht(x) ɛ t = w t ()I[y h t (x )] Namely, the weighted classificatio error is miimized precisely trai a weak classifier based o the curret weight w t () o the slide How Boostig algorithm works?. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

60 Boostig Derivatio of AdaBoost Miimizig the weighted classificatio error Thus, we would wat to choose h t (x ) such that h t (x) = arg mi ht(x) ɛ t = w t ()I[y h t (x )] Namely, the weighted classificatio error is miimized precisely trai a weak classifier based o the curret weight w t () o the slide How Boostig algorithm works?. Remarks We ca safely assume that w t (x ) is ormalized so that w t(x ) = 1. This ormalizatio requiremet ca be easily maitaied by chagig the weights to w t (x ) w t(x ) w t(x ) This chage does ot affect how to choose h t (x), as the term w t(x ) is a costat with respect to. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

61 How to choose β t? Boostig Derivatio of AdaBoost We will select β t to miimize (e βt e βt ) w t ()I[y h t (x )] + e βt w t () We assume w t() is ow 1 (cf. the previous slide s Remarks). We take derivative with respect to β t ad set to zero, ad derive the optimal β t as β t = 1 2 log 1 ɛ t ɛ t which is precisely what is o the slide How Boostig algorithm works? Take-home exercise. Verify the solutio Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

62 Updatig the weights Boostig Derivatio of AdaBoost Now that we have improved our classifier ito f(x) = f t 1 (x) + β t h t (x) At the t-th iteratio, we will eed to compute the weights for the above classifier, which is, w t+1 () = e yf(x) = e y[f t 1(x)+β t h t (x)] Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

63 Updatig the weights Boostig Derivatio of AdaBoost Now that we have improved our classifier ito f(x) = f t 1 (x) + β t h t (x) At the t-th iteratio, we will eed to compute the weights for the above classifier, which is, w t+1 () = e yf(x) = e y[f t 1(x)+β t h t (x)] = w t ()e yβ t h t (x) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

64 Updatig the weights Boostig Derivatio of AdaBoost Now that we have improved our classifier ito f(x) = f t 1 (x) + β t h t (x) At the t-th iteratio, we will eed to compute the weights for the above classifier, which is, w t+1 () = e yf(x) = e y[f t 1(x)+βt h t (x)] { = w t ()e yβ t h t (x) wt ()e = β t if y h t (x ) w t ()e β t if y = h t (x ) Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

65 Updatig the weights Boostig Derivatio of AdaBoost Now that we have improved our classifier ito f(x) = f t 1 (x) + β t h t (x) At the t-th iteratio, we will eed to compute the weights for the above classifier, which is, w t+1 () = e yf(x) = e y[f t 1(x)+βt h t (x)] { = w t ()e yβ t h t (x) wt ()e = β t if y h t (x ) w t ()e β t if y = h t (x ) Remarks The key poit is that the misclassified data poit will get its weight icreased, while the correctly data poit will get its weight decreased. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

66 Remarks Boostig Derivatio of AdaBoost Note that the AdaBoost algorithm itself ever specifies how we would get h t (x) as log as it miimizes the weighted classificatio error ɛ t = w t ()I[y h t (x )] I this aspect, the AdaBoost algorithm is a meta-algorithm ad ca be used with ay classifier where we ca do the above. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

67 Remarks Boostig Derivatio of AdaBoost Note that the AdaBoost algorithm itself ever specifies how we would get h t (x) as log as it miimizes the weighted classificatio error ɛ t = w t ()I[y h t (x )] I this aspect, the AdaBoost algorithm is a meta-algorithm ad ca be used with ay classifier where we ca do the above. Ex. How do we choose the decisio stump classifier give the weights at the secod roud of the followig distributio? h 1 D 2 We ca simply eumerate all possible ways of puttig vertical ad horizotal lies to separate the data poits ito two classes ad fid the oe with the smallest weighted classificatio error! Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

68 Boostig Noliear basis leared by boostig Boostig as learig oliear basis Two-stage process Get sig[f 1 (x)], sig[f 2 (x)],, Combie ito a liear classificatio model { } y = sig β t sig[f t (x)] Equivaletly, each stage lears a oliear basis φ t (x) = sig[f t (x)]. t This relates to eural etworks, which we might discuss ext week. Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, / 49

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32 Boostig Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, 2017 1 / 32 Outlie 1 Admiistratio 2 Review of last lecture 3 Boostig Professor Ameet Talwalkar CS260