Ensemble Methods: Boostng Ncholas Ruozz Unversty of Texas at Dallas Based on the sldes of Vbhav Gogate and Rob Schapre
Last Tme Varance reducton va baggng Generate new tranng data sets by samplng wth replacement from the emprcal dstrbuton Learn a classfer for each of the newly sampled sets Combne the classfers for predcton Today: how to reduce bas 2
Boostng How to translate rules of thumb (.e., good heurstcs) nto good learnng algorthms For example, f we are tryng to classfy emal as spam or not spam, a good rule of thumb may be that emals contanng Ngeran prnce or Vagara are lkely to be spam most of the tme 3
Boostng Freund & Schapre Theory for weak learners n late 80 s Weak Learner: performance on any tranng set s slghtly better than chance predcton Intended to answer a theoretcal queston, not as a practcal way to mprove learnng Tested n md 90 s usng not-so-weak learners Works anyway! 4
PAC Learnng Gven..d samples from an unknown, arbtrary dstrbuton Strong PAC learnng algorthm For any dstrbuton wth hgh probablty gven polynomally many samples (and polynomal tme) can fnd classfer wth arbtrarly small error Weak PAC learnng algorthm Same, but error only needs to be slghtly better than random guessng (e.g., accuracy only needs to exceed 50% for bnary classfcaton) Does weak learnablty mply strong learnablty? 5
Boostng 1. Weght all tranng samples equally 2. Tran model on tranng set 3. Compute error of model on tranng set 4. Increase weghts on tranng cases model gets wrong 5. Tran new model on re-weghted tranng set 6. Re-compute errors on weghted tranng set 7. Increase weghts agan on cases model gets wrong Repeat untl tred (100+ teratons) Fnal model: weghted predcton of each model 6
Boostng: Graphcal Illustraton h 1 x h 2 x h M (x) h x = sgn α m h m (x) m
AdaBoost 1. Intalze the data weghts w 1,, w N for the frst round as w 1 1,, w N 1 = 1 N 2. For m = 1,, M a) Select a classfer h m for the m th round by mnmzng the weghted error b) Compute w (m) 1hm x y c) Update the weghts ε m = w m 1 hm x α m = 1 2 ln 1 ε m ε m y w m+1 = w m exp y h m x () α m 2 ε m 1 ε m
AdaBoost 1. Intalze the data weghts w 1,, w N for the frst round as w 1 1,, w N 1 = 1 N 2. For m = 1,, M a) Select a classfer h m for the m th round by mnmzng the weghted error b) Compute c) Update the weghts w (m) 1hm x y ε m = w m 1 hm x α m = 1 2 ln 1 ε m ε m y Weghted number of ncorrect classfcatons of the m th classfer w m+1 = w m exp y h m x () α m 2 ε m 1 ε m
AdaBoost 1. Intalze the data weghts w 1,, w N for the frst round as w 1 1,, w N 1 = 1 N 2. For m = 1,, M a) Select a classfer h m for the m th round by mnmzng the weghted error b) Compute w (m) 1hm x y c) Update the weghts ε m = w m 1 hm x α m = 1 2 ln 1 ε m ε m y ε m 0 α m w m+1 = w m exp y h m x () α m 2 ε m 1 ε m
AdaBoost 1. Intalze the data weghts w 1,, w N for the frst round as w 1 1,, w N 1 = 1 N 2. For m = 1,, M a) Select a classfer h m for the m th round by mnmzng the weghted error b) Compute w (m) 1hm x y c) Update the weghts ε m = w m 1 hm x α m = 1 2 ln 1 ε m ε m y ε m.5 α m 0 w m+1 = w m exp y h m x () α m 2 ε m 1 ε m
AdaBoost 1. Intalze the data weghts w 1,, w N for the frst round as w 1 1,, w N 1 = 1 N 2. For m = 1,, M a) Select a classfer h m for the m th round by mnmzng the weghted error b) Compute w (m) 1hm x y c) Update the weghts ε m = w m 1 hm x α m = 1 2 ln 1 ε m ε m y ε m 1 α m w m+1 = w m exp y h m x () α m 2 ε m 1 ε m
AdaBoost 1. Intalze the data weghts w 1,, w N for the frst round as w 1 1,, w N 1 = 1 N 2. For m = 1,, M a) Select a classfer h m for the m th round by mnmzng the weghted error b) Compute w (m) 1hm x y c) Update the weghts ε m = w m 1 hm x α m = 1 2 ln 1 ε m ε m y w m+1 = w m exp y h m x () α m 2 ε m 1 ε m Normalzaton factor
Example Consder a classfcaton problem where vertcal and horzontal lnes (and ther correspondng half spaces) are the weak learners D Round 1 Round 2 Round 3 + + + + + + + + + + + + + + + + + + + + h 3 h 1 ε 1 =.3 α 1 =.42 ε 2 =.21 α 2 =.65 ε 3 =.14 α 3 =.92 h 2 14
Fnal Hypothess h x = sgn.42 +.65 +.92 Fnal Hypothess + + + + + h 3
Boostng Theorem: Let Z m = 2 ε m 1 ε m and γ m = 1 2 ε m. 1 M 1 h x () y M Z m = 1 4γ m 2 N m=1 m=1 So, even f all of the γ s are small postve numbers (.e., every learner s a weak learner), the tranng error goes to zero as M ncreases 16
Margns & Boostng We can see that tranng error goes down, but what about test error? That s, does boostng help us generalze better? To answer ths queston, we need to look at how confdent we are n our predctons How can we measure ths? 17
Margns & Boostng We can see that tranng error goes down, but what about test error? That s, does boostng help us generalze better? To answer ths queston, we need to look at how confdent we are n our predctons Margns! 18
Margns & Boostng Intuton: larger margns lead to better generalzaton (same as SVMs) Theorem: wth hgh probablty, boostng ncreases the sze of the margns Note: boostng does NOT maxmze the margn, so t can stll have poor generalzaton performance 19
Boostng Performance 20
Boostng as Optmzaton AdaBoost can actually be nterpreted as a coordnate descent method for a specfc loss functon! Let h 1,, h T Exponental loss be the set of all weak learners l α 1,, α T = exp y α t h t (x () ) t Convex n α t AdaBoost mnmzes ths exponental loss 21
Coordnate Descent Mnmze the loss wth respect to a sngle component of α, let s pck α t dl dα t = y h t x exp y α t h t x t = :h t x =y exp α t exp y t t α t h t x + exp(α t ) exp y α t h t x :h t x y t t = 0 22
Coordnate Descent Solvng for α t α t = 1 2 ln σ :ht x σ :ht x =y exp y σ t t α t h t x y exp y σ t t α t h t x Ths s smlar to the adaboost update! The only dfference s that adaboost tells us n whch order we should update the varables 23
Coordnate Descent Start wth α = 0 Let r = exp y σ t t α t h t x = 1 Choose t to mnmze :h t x r = N y w 1 1 ht x () y For ths choce of t, mnmze the objectve wth respect to α t gves 1 1 ht x () =y α t = 1 N σ w 2 ln N σ w 1 = 1 1 2 ln 1 ε 1 ε ht x () y 1 Repeatng ths procedure wth new values of α yelds adaboost 24
adaboost as Optmzaton Could derve an adaboost algorthm for other types of loss functons! Important to note Exponental loss s convex, but may have multple global optma In practce, adaboost can perform qute dfferently than other methods for mnmzng ths loss (e.g., gradent descent) 25
Boostng n Practce Our descrpton of the algorthm assumed that a set of possble hypotheses was gven In practce, the set of hypotheses can be bult as the algorthm progress Example: buld new decson tree at each teraton for the data set such that the th example has weght w (m) When computng nformaton gan, compute the emprcal probabltes usng the weghts 26
Boostng vs. Baggng Baggng doesn t work so well wth stable models. Boostng mght stll help Boostng mght hurt performance on nosy datasets Baggng doesn t have ths problem On average, boostng helps more than baggng, but t s also more common for boostng to hurt performance. Baggng s easer to parallelze 27
Other Approaches Mxture of Experts (See Bshop, Chapter 14) Cascadng Classfers many others