Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32

Size: px

Start display at page:

Download "Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32"

Hector Fitzgerald
5 years ago
Views:

1 Boostig Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

2 Outlie 1 Admiistratio 2 Review of last lecture 3 Boostig Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

3 Grade Policy ad Fial Exam Fial Grades HWs (30%), midterm (30%), ad fial exam (40%) of fial grade The fial grades will be curved so that the media grade is either a B or B+ (I have ot yet decided) I may icrease weight of fial for studets who do much better o fial tha midterm (I do t have a strict policy i place though) Fial Exam Cumulative but with more emphasis o ew material O last day of class (Wedesday, 3/15) Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

4 Outlie 1 Admiistratio 2 Review of last lecture 3 Boostig Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

5 Support vector machies (SVM) Primal Formulatio mi w,b,ξ 1 2 w C ξ s.t. y [w T φ(x ) + b] 1 ξ ad ξ 0, Two equivalet iterpretatios Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

6 Support vector machies (SVM) Primal Formulatio mi w,b,ξ 1 2 w C ξ s.t. y [w T φ(x ) + b] 1 ξ ad ξ 0, Two equivalet iterpretatios Geometric: Maximizig (soft) margi Optimizatio: Miimize hige loss with L2 regularizatio Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

7 Hige Loss Upper-boud for 0/1 loss fuctio (black lie) Covex surrogate to 0/1 loss Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

8 Hige Loss Upper-boud for 0/1 loss fuctio (black lie) Covex surrogate to 0/1 loss, though others exist as well Hige loss less sesitive to outliers tha expoetial (or logistic) loss Logistic loss has a atural probabilistic iterpretatio We ca optimize expoetial loss efficietly i a greedy maer (Adaboost) Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

9 Costraied Optimizatio Primal Formulatio mi w,b,ξ 1 2 w C ξ s.t. y [w T φ(x ) + b] 1 ξ ad ξ 0, Whe workig with costraied optimizatio problems with iequality costraits, we ca write dow primal ad dual problems Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

10 Costraied Optimizatio Primal Formulatio mi w,b,ξ 1 2 w C ξ s.t. y [w T φ(x ) + b] 1 ξ ad ξ 0, Whe workig with costraied optimizatio problems with iequality costraits, we ca write dow primal ad dual problems The dual solutio is always a lower boud o the primal solutio (weak duality) The duality gap equals 0 uder certai coditios (strog duality), ad i such cases we ca either solve the primal or dual problem Strog duality holds for the SVM problem, ad i particular the KKT coditios are ecessary ad sufficiet for the optimal solutio Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

11 Dual formulatio of SVM ad Kerel SVMs max α α 1 y m y α m α φ(x m ) T φ(x ) 2 m, s.t. 0 α C, α y = 0 Dual problem is also a covex quadratic programmig Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

12 Dual formulatio of SVM ad Kerel SVMs max α α 1 y m y α m α φ(x m ) T φ(x ) 2 m, s.t. 0 α C, α y = 0 Dual problem is also a covex quadratic programmig ivolvig N dual variables α Kerel SVM: Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

13 Dual formulatio of SVM ad Kerel SVMs max α α 1 y m y α m α φ(x m ) T φ(x ) 2 m, s.t. 0 α C, α y = 0 Dual problem is also a covex quadratic programmig ivolvig N dual variables α Kerel SVM: Replace ier products φ(xm ) T φ(x ) with a kerel fuctio, k(x m, x ) whe solvig dual problem Show that we ca recover primal predictios at test time without relyig explicitly o φ( ). Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

14 Recoverig primal solutio usig dual variables Why do we care? Usig solely a kerel fuctio, we ca solve the dual optimizatio problem ad make predictios at test time! Predictio oly depeds o support vectors, i.e., poits with α > 0! Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

15 Recoverig primal solutio usig dual variables Why do we care? Usig solely a kerel fuctio, we ca solve the dual optimizatio problem ad make predictios at test time! Predictio oly depeds o support vectors, i.e., poits with α > 0! Weights w = y α φ(x ) Liear combiatio of the iput features Offset b = [y w T φ(x )] Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

16 Recoverig primal solutio usig dual variables Why do we care? Usig solely a kerel fuctio, we ca solve the dual optimizatio problem ad make predictios at test time! Predictio oly depeds o support vectors, i.e., poits with α > 0! Weights w = y α φ(x ) Liear combiatio of the iput features Offset b = [y w T φ(x )] = [y m y m α m k(x m, x )], for ay C > α > 0 Predictio o a test poit x h(x) = sig(w T φ(x) + b) = sig( y α k(x, x) + b) Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

17 Derivig the dual for SVM Primal SVM mi w,b,ξ 1 2 w C ξ s.t. y [w T φ(x ) + b] 1 ξ, ξ 0, Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

18 Derivig the dual for SVM Primal SVM mi w,b,ξ 1 2 w C ξ s.t. y [w T φ(x ) + b] 1 ξ, ξ 0, Lagragia L(w, b, {ξ }, {α }, {λ }) = C ξ w 2 2 λ ξ + α {1 y [w T φ(x ) + b] ξ } uder the costraits that α 0 ad λ 0. Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

19 Miimizig the Lagragia Takig derivatives with respect to the primal variables L w = w L b = α y = 0 y α φ(x ) = 0 L ξ = C λ α = 0 Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

20 Miimizig the Lagragia Takig derivatives with respect to the primal variables L w = w L b = α y = 0 y α φ(x ) = 0 L ξ = C λ α = 0 These equatios lik the primal variables ad the dual variables ad provide ew costraits o the dual variables: w = α y = 0 C λ α = 0 y α φ(x ) Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

21 Rearrage the Lagragia ad icorporate these costraits Recall: L( ) = C ξ w 2 2 λξ + α{1 y[wt φ(x ) + b] ξ } where α 0 ad λ 0 Costraits from partial derivatives: αy = 0 ad C λ α = 0 g({α },{λ }) = L(w, b, {ξ }, {α }, {λ }) Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

22 Rearrage the Lagragia ad icorporate these costraits Recall: L( ) = C ξ w 2 2 λξ + α{1 y[wt φ(x ) + b] ξ } where α 0 ad λ 0 Costraits from partial derivatives: αy = 0 ad C λ α = 0 g({α },{λ }) = L(w, b, {ξ }, {α }, {λ }) = (C α λ )ξ Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

23 Rearrage the Lagragia ad icorporate these costraits Recall: L( ) = C ξ w 2 2 λξ + α{1 y[wt φ(x ) + b] ξ } where α 0 ad λ 0 Costraits from partial derivatives: αy = 0 ad C λ α = 0 g({α },{λ }) = L(w, b, {ξ }, {α }, {λ }) = (C α λ )ξ y α φ(x ) 2 2 Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

24 Rearrage the Lagragia ad icorporate these costraits Recall: L( ) = C ξ w 2 2 λξ + α{1 y[wt φ(x ) + b] ξ } where α 0 ad λ 0 Costraits from partial derivatives: αy = 0 ad C λ α = 0 g({α },{λ }) = L(w, b, {ξ }, {α }, {λ }) = (C α λ )ξ y α φ(x ) α ( ) α y b Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

25 Rearrage the Lagragia ad icorporate these costraits Recall: L( ) = C ξ w 2 2 λξ + α{1 y[wt φ(x ) + b] ξ } where α 0 ad λ 0 Costraits from partial derivatives: αy = 0 ad C λ α = 0 g({α },{λ }) = L(w, b, {ξ }, {α }, {λ }) = (C α λ )ξ y α φ(x ) α ( ) α y b ( T α y y m α m φ(x m )) φ(x ) m Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

26 Rearrage the Lagragia ad icorporate these costraits Recall: L( ) = C ξ w 2 2 λξ + α{1 y[wt φ(x ) + b] ξ } where α 0 ad λ 0 Costraits from partial derivatives: αy = 0 ad C λ α = 0 g({α },{λ }) = L(w, b, {ξ }, {α }, {λ }) = (C α λ )ξ y α φ(x ) α ( ) α y b ( T α y y m α m φ(x m )) φ(x ) m = α y α φ(x ) 2 2 α α m y m y φ(x m ) T φ(x ) m, Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

27 Rearrage the Lagragia ad icorporate these costraits Recall: L( ) = C ξ w 2 2 λξ + α{1 y[wt φ(x ) + b] ξ } where α 0 ad λ 0 Costraits from partial derivatives: αy = 0 ad C λ α = 0 g({α },{λ }) = L(w, b, {ξ }, {α }, {λ }) = (C α λ )ξ y α φ(x ) α ( ) α y b ( T α y y m α m φ(x m )) φ(x ) m = α y α φ(x ) 2 2 α α m y m y φ(x m ) T φ(x ) m, = α 1 α α m y m y φ(x m ) T φ(x ) 2 m, Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

28 The dual problem Maximizig the dual uder the costraits max α g({α }, {λ }) = s.t. α 0, α y = 0 C λ α = 0, λ 0, α 1 y m y α m α k(x m, x ) 2 m, Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

29 The dual problem Maximizig the dual uder the costraits max α g({α }, {λ }) = s.t. α 0, α y = 0 C λ α = 0, λ 0, α 1 y m y α m α k(x m, x ) 2 We ca simplify as the objective fuctio does ot deped o λ. m, C λ α = 0, λ 0 λ = C α 0 Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

30 The dual problem Maximizig the dual uder the costraits max α g({α }, {λ }) = s.t. α 0, α y = 0 C λ α = 0, λ 0, α 1 y m y α m α k(x m, x ) 2 We ca simplify as the objective fuctio does ot deped o λ. m, C λ α = 0, λ 0 λ = C α 0 0 α C Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

31 Simplified Dual max α α 1 y m y α m α φ(x m ) T φ(x ) 2 m, s.t. 0 α C, α y = 0 Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

32 Outlie 1 Admiistratio 2 Review of last lecture 3 Boostig AdaBoost Derivatio of AdaBoost Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

33 Boostig High-level idea: combie a lot of classifiers Sequetially costruct / idetify these classifiers, h t ( ), oe at a time Use weak classifiers to arrive at a complex decisio boudary (strog classifier), where β t is the cotributio of each weak classifier Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

34 Boostig High-level idea: combie a lot of classifiers Sequetially costruct / idetify these classifiers, h t ( ), oe at a time Use weak classifiers to arrive at a complex decisio boudary (strog classifier), where β t is the cotributio of each weak classifier [ T ] h[x] = sig β t h t (x) t=1 Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

35 Boostig High-level idea: combie a lot of classifiers Sequetially costruct / idetify these classifiers, h t ( ), oe at a time Use weak classifiers to arrive at a complex decisio boudary (strog classifier), where β t is the cotributio of each weak classifier Our pla Describe AdaBoost algorithm Derive the algorithm [ T ] h[x] = sig β t h t (x) t=1 Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

36 Adaboost Algorithm Give: N samples {x, y }, where y {+1, 1}, ad some way of costructig weak (or base) classifiers Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

37 Adaboost Algorithm Give: N samples {x, y }, where y {+1, 1}, ad some way of costructig weak (or base) classifiers Iitialize weights w 1 () = 1 N for every traiig sample Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

38 Adaboost Algorithm Give: N samples {x, y }, where y {+1, 1}, ad some way of costructig weak (or base) classifiers Iitialize weights w 1 () = 1 N for every traiig sample For t = 1 to T 1 Trai a weak classifier h t (x) usig curret weights w t (), by miimizig ɛ t = w t ()I[y h t (x )] Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

39 Adaboost Algorithm Give: N samples {x, y }, where y {+1, 1}, ad some way of costructig weak (or base) classifiers Iitialize weights w 1 () = 1 N for every traiig sample For t = 1 to T 1 Trai a weak classifier h t (x) usig curret weights w t (), by miimizig ɛ t = w t ()I[y h t (x )] (the weighted classificatio error) Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

40 Adaboost Algorithm Give: N samples {x, y }, where y {+1, 1}, ad some way of costructig weak (or base) classifiers Iitialize weights w 1 () = 1 N for every traiig sample For t = 1 to T 1 Trai a weak classifier h t (x) usig curret weights w t (), by miimizig ɛ t = w t ()I[y h t (x )] (the weighted classificatio error) 2 Compute cotributio for this classifier Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

41 Adaboost Algorithm Give: N samples {x, y }, where y {+1, 1}, ad some way of costructig weak (or base) classifiers Iitialize weights w 1 () = 1 N for every traiig sample For t = 1 to T 1 Trai a weak classifier h t (x) usig curret weights w t (), by miimizig ɛ t = w t ()I[y h t (x )] (the weighted classificatio error) 2 Compute cotributio for this classifier: β t = 1 2 log 1 ɛt ɛ t Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

42 Adaboost Algorithm Give: N samples {x, y }, where y {+1, 1}, ad some way of costructig weak (or base) classifiers Iitialize weights w 1 () = 1 N for every traiig sample For t = 1 to T 1 Trai a weak classifier h t (x) usig curret weights w t (), by miimizig ɛ t = w t ()I[y h t (x )] (the weighted classificatio error) 2 Compute cotributio for this classifier: β t = Update weights o traiig poits log 1 ɛt ɛ t Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

43 Adaboost Algorithm Give: N samples {x, y }, where y {+1, 1}, ad some way of costructig weak (or base) classifiers Iitialize weights w 1 () = 1 N for every traiig sample For t = 1 to T 1 Trai a weak classifier h t (x) usig curret weights w t (), by miimizig ɛ t = w t ()I[y h t (x )] (the weighted classificatio error) 2 Compute cotributio for this classifier: β t = Update weights o traiig poits w t+1 () w t ()e βtyht(x) log 1 ɛt ɛ t Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

44 Adaboost Algorithm Give: N samples {x, y }, where y {+1, 1}, ad some way of costructig weak (or base) classifiers Iitialize weights w 1 () = 1 N for every traiig sample For t = 1 to T 1 Trai a weak classifier h t (x) usig curret weights w t (), by miimizig ɛ t = w t ()I[y h t (x )] (the weighted classificatio error) 2 Compute cotributio for this classifier: β t = Update weights o traiig poits w t+1 () w t ()e βtyht(x) ad ormalize them such that w t+1() = 1 log 1 ɛt ɛ t Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

45 Adaboost Algorithm Give: N samples {x, y }, where y {+1, 1}, ad some way of costructig weak (or base) classifiers Iitialize weights w 1 () = 1 N for every traiig sample For t = 1 to T 1 Trai a weak classifier h t (x) usig curret weights w t (), by miimizig ɛ t = w t ()I[y h t (x )] (the weighted classificatio error) 2 Compute cotributio for this classifier: β t = Update weights o traiig poits w t+1 () w t ()e βtyht(x) ad ormalize them such that w t+1() = 1 Output the fial classifier [ T ] h[x] = sig β t h t (x) t=1 log 1 ɛt ɛ t Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

46 Example 10 data poits ad 2 features D 1 The data poits are clearly ot liear separable I the begiig, all data poits have equal weights (the size of the data markers + or - ) Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

47 Example 10 data poits ad 2 features D 1 The data poits are clearly ot liear separable I the begiig, all data poits have equal weights (the size of the data markers + or - ) Base classifier h( ): horizotal or vertical lies ( decisio stumps ) Depth-1 decisio trees, i.e., classify data based o a sigle attribute Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

48 Roud 1: t = 1 h 1 D 2 Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

49 Roud 1: t = 1 h 1 D 2 3 misclassified (with circles): ɛ 1 = 0.3 β 1 = Weights recomputed; the 3 misclassified data poits receive larger weights Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

50 Roud 2: t = 2 h 2 D 3 Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

51 Roud 2: t = 2 h 2 D 3 3 misclassified (with circles): ɛ 2 = 0.21 β 2 = Note that ɛ as those 3 data poits have weights less tha 1/10 3 misclassified data poits get larger weights Data poits classified correctly i both rouds have small weights Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

52 Roud33: t = 3 Roud h3 "3 =0.14!3=0.92 Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

53 Roud33: t = 3 Roud h3 "3 =0.14!3= misclassified (with circles): 3 = 0.14 β3 = Previously correctly classified data poits are ow misclassified, hece our error is low; what s the ituitio? Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

54 Roud33: t = 3 Roud h3 "3 =0.14!3= misclassified (with circles): 3 = 0.14 β3 = Previously correctly classified data poits are ow misclassified, hece our error is low; what s the ituitio? I Sice they have bee cosistetly classified correctly, this roud s mistake will hopefully ot have a huge impact o the overall predictio Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

55 Fial classifier: combiig 3 classifiers H = sig fial = All data poits are ow classified correctly! Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

56 Why AdaBoost works? It miimizes a loss fuctio related to classificatio error. Classificatio loss Suppose we wat to have a classifier { 1 if f(x) > 0 h(x) = sig[f(x)] = 1 if f(x) < 0 Oe seemigly atural loss fuctio is 0-1 loss: { 0 if yf(x) > 0 l(h(x), y) = 1 if yf(x) < 0 Namely, the fuctio f(x) ad the target label y should have the same sig to avoid a loss of 1. Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

57 Surrogate loss 0 1 loss fuctio l(h(x), y) is o-covex ad difficult to optimize. We ca istead use a surrogate loss what are examples? Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

58 Surrogate loss 0 1 loss fuctio l(h(x), y) is o-covex ad difficult to optimize. We ca istead use a surrogate loss what are examples? Expoetial Loss l exp (h(x), y) = e yf(x) `(h(x),y) yf(x) Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

59 Choosig the t-th classifier Suppose a classifier f t 1 (x), ad wat to add a weak learer h t (x) f(x) = f t 1 (x) + β t h t (x) ote: h t ( ) outputs 1 or 1, as does sig [f t 1 ( )] Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

60 Choosig the t-th classifier Suppose a classifier f t 1 (x), ad wat to add a weak learer h t (x) f(x) = f t 1 (x) + β t h t (x) ote: h t ( ) outputs 1 or 1, as does sig [f t 1 ( )] How ca we optimally choose h t (x) ad combiatio coefficiet β t? Adaboost greedily miimizes the expoetial loss fuctio! (h t (x), βt ) = arg mi (ht(x),βt) e yf(x) Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

61 Choosig the t-th classifier Suppose a classifier f t 1 (x), ad wat to add a weak learer h t (x) f(x) = f t 1 (x) + β t h t (x) ote: h t ( ) outputs 1 or 1, as does sig [f t 1 ( )] How ca we optimally choose h t (x) ad combiatio coefficiet β t? Adaboost greedily miimizes the expoetial loss fuctio! (h t (x), βt ) = arg mi (ht(x),βt) e yf(x) = arg mi (ht(x),β t) e y[f t 1(x )+β th t(x )] Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

62 Choosig the t-th classifier Suppose a classifier f t 1 (x), ad wat to add a weak learer h t (x) f(x) = f t 1 (x) + β t h t (x) ote: h t ( ) outputs 1 or 1, as does sig [f t 1 ( )] How ca we optimally choose h t (x) ad combiatio coefficiet β t? Adaboost greedily miimizes the expoetial loss fuctio! (h t (x), βt ) = arg mi (ht(x),βt) e yf(x) = arg mi (ht(x),β t) e y[f t 1(x )+β th t(x )] = arg mi (ht(x),βt) w t ()e yβtht(x) where we have used w t () as a shorthad for e yf t 1(x ) Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

63 The ew classifier We ca decompose the weighted loss fuctio ito two parts w t ()e yβtht(x) = w t ()e βt I[y h t (x )] + w t ()e βt I[y = h t (x )] Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

64 The ew classifier We ca decompose the weighted loss fuctio ito two parts w t ()e yβtht(x) = w t ()e βt I[y h t (x )] + w t ()e βt I[y = h t (x )] = w t ()e βt I[y h t (x )] + w t ()e βt (1 I[y h t (x )]) Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

65 The ew classifier We ca decompose the weighted loss fuctio ito two parts w t ()e yβtht(x) = w t ()e βt I[y h t (x )] + w t ()e βt I[y = h t (x )] = w t ()e βt I[y h t (x )] + w t ()e βt (1 I[y h t (x )]) = (e βt e βt ) w t ()I[y h t (x )] + e βt w t () We have used the followig properties to derive the above y h t (x ) is either 1 or -1 as h t (x ) is the output of a biary classifier The idicator fuctio I[y = h t (x )] is either 0 or 1, so it equals 1 I[y h t (x )] Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

66 Fidig the optimal weak learer Summary (h t (x), βt ) = arg mi (ht(x),βt) w t ()e yβtht(x) = arg mi (ht(x),βt)(e βt e βt ) w t ()I[y h t (x )] + e βt w t () What term(s) must we optimize to choose h t (x )? Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

67 Fidig the optimal weak learer Summary (h t (x), βt ) = arg mi (ht(x),βt) w t ()e yβtht(x) = arg mi (ht(x),βt)(e βt e βt ) w t ()I[y h t (x )] + e βt w t () What term(s) must we optimize to choose h t (x )? h t (x) = arg mi ht(x) ɛ t = w t ()I[y h t (x )] Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

68 Fidig the optimal weak learer Summary (h t (x), βt ) = arg mi (ht(x),βt) w t ()e yβtht(x) = arg mi (ht(x),βt)(e βt e βt ) w t ()I[y h t (x )] + e βt w t () What term(s) must we optimize to choose h t (x )? h t (x) = arg mi ht(x) ɛ t = w t ()I[y h t (x )] Miimize weighted classificatio error as oted i step 1 of Adaboost! Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

69 How to choose β t? Summary (h t (x), βt ) = arg mi (ht(x),βt) w t ()e yβtht(x) = arg mi (ht(x),βt)(e βt e βt ) w t ()I[y h t (x )] + e βt w t () What term(s) must we optimize? Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

70 How to choose β t? Summary (h t (x), βt ) = arg mi (ht(x),βt) w t ()e yβtht(x) = arg mi (ht(x),βt)(e βt e βt ) w t ()I[y h t (x )] + e βt w t () What term(s) must we optimize? We eed to miimize the etire objective fuctio with respect to β t! Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

71 How to choose β t? Summary (h t (x), βt ) = arg mi (ht(x),βt) w t ()e yβtht(x) = arg mi (ht(x),βt)(e βt e βt ) w t ()I[y h t (x )] + e βt w t () What term(s) must we optimize? We eed to miimize the etire objective fuctio with respect to β t! We ca do this by takig derivative with respect to β t, settig to zero, ad solvig for β t. After some calculatio ad usig w t() = 1, we fid: β t = 1 2 log 1 ɛ t ɛ t which is precisely step 2 of Adaboost! (Exercise verify the solutio) Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

72 Updatig the weights Oce we fid the optimal weak learer we ca update our classifier: f(x) = f t 1 (x) + β t h t (x) Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

73 Updatig the weights Oce we fid the optimal weak learer we ca update our classifier: f(x) = f t 1 (x) + βt h t (x) We the eed to compute the weights for the above classifier as: w t+1 () = e yf(x) = e y[f t 1(x)+βt h t (x)] Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

74 Updatig the weights Oce we fid the optimal weak learer we ca update our classifier: f(x) = f t 1 (x) + βt h t (x) We the eed to compute the weights for the above classifier as: w t+1 () = e yf(x) = e y[f t 1(x)+βt h t (x)] = w t ()e yβ t h t (x) Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

75 Updatig the weights Oce we fid the optimal weak learer we ca update our classifier: f(x) = f t 1 (x) + β t h t (x) We the eed to compute the weights for the above classifier as: w t+1 () = e yf(x) = e y[f t 1(x)+βt h t (x)] { = w t ()e yβ t h t (x) wt ()e = β t if y h t (x ) w t ()e β t if y = h t (x ) Ituitio Misclassified data poits will get their weights icreased, while correctly classified data poits will get their weight decreased Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

76 Meta-Algorithm Note that the AdaBoost algorithm itself ever specifies how we would get h t (x) as log as it miimizes the weighted classificatio error ɛ t = w t ()I[y h t (x )] I this aspect, the AdaBoost algorithm is a meta-algorithm ad ca be used with ay type of classifier Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

77 E.g., Decisio Stumps How do we choose the decisio stump classifier give the weights at the secod roud of the followig distributio? h 1 D 2 Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

78 E.g., Decisio Stumps How do we choose the decisio stump classifier give the weights at the secod roud of the followig distributio? h 1 D 2 We ca simply eumerate all possible ways of puttig vertical ad horizotal lies to separate the data poits ito two classes ad fid the oe with the smallest weighted classificatio error! Rutime? Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

79 E.g., Decisio Stumps How do we choose the decisio stump classifier give the weights at the secod roud of the followig distributio? h 1 D 2 We ca simply eumerate all possible ways of puttig vertical ad horizotal lies to separate the data poits ito two classes ad fid the oe with the smallest weighted classificatio error! Rutime? Presort data by each feature i O(dN log N) time Evaluate N + 1 thresholds for each feature at each roud i O(dN) time I total O(dN log N + dnt ) time this efficiecy is a attractive quality of boostig! Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

80 Iterpretig boostig as learig oliear basis Two-stage process Get sig[h 1 (x)], sig[h 2 (x)],, Combie ito a liear classificatio model { } { } y = sig β t sig[h t (x)] = sig β φ(x) t Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

81 Iterpretig boostig as learig oliear basis Two-stage process Get sig[h 1 (x)], sig[h 2 (x)],, Combie ito a liear classificatio model { } { } y = sig β t sig[h t (x)] = sig β φ(x) t I other words, each stage lears a oliear basis φ t (x) = sig[h t (x)] This is a alterative way to itroduce o-liearity aside from kerel methods We could also try to lear the basis fuctios ad the classifier at the same time, as we ll talk about with eural etworks ext class Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, / 32

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machie Learig (Fall 2014) Drs. Sha & Liu {feisha,yaliu.cs}@usc.edu October 14, 2014 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 14, 2014 1 / 49 Outlie Admiistratio