BBM406 - Introduc0on to ML Spring 2014 Ensemble Methods Aykut Erdem Dept. of Computer Engineering HaceDepe University
2 Slides adopted from David Sontag, Mehryar Mohri, Ziv- Bar Joseph, Arvind Rao, Greg Shakhnarovich, Rob Schapire, Tommi Jaakkola, Jan Sochman and Jiri Matters Today Ensemble Methods - - Bagging Boos0ng
Bias/Variance Tradeoff Bias/Variance&Tradeoff& Hastie, Tibshirani, Friedman Elements of Statistical Learning 2001 3
Bias/Variance Tradeoff Graphical illustration of bias and variance. http://scott.fortmann- roe.com/docs/biasvariance.html 4
Figh0ng the bias- variance tradeoff Simple (a.k.a. weak) learners are good - - e.g., naïve Bayes, logis0c regression, decision stumps (or shallow decision trees) Low variance, don t usually overfit Simple (a.k.a. weak) learners are bad High bias, can t solve hard learning problems 5
Reduce Variance Without Increasing Bias Averaging reduces variance: (when prediction are independent) Average models to reduce model variance One problem: - - Only one training set Where do mul0ple models come from? 6
Bagging (Bootstrap Aggrega0ng) Leo Breiman (1994) Take repeated bootstrap samples from training set D. Bootstrap sampling: Given set D containing N training examples, create D by drawing N examples at random with replacement from D. Bagging: - Create k bootstrap samples D 1... D k. - Train dis0nct classifier on each D i. - Classify new instance by majority vote / average. 7
Bagging Best case: In prac0ce: - models are correlated, so reduc0on is smaller than 1/N - variance of models trained on fewer training cases usually somewhat larger 8
Bagging Example 9
CART* decision boundary * A decision tree learning algorithm; very similar to ID3 10
100 bagged trees Shades of blue/red indicate strength of vote for par0cular classifica0on 11
Reduce Bias 2 and Decrease Variance? Bagging reduces variance by averaging Bagging has lidle effect on bias Can we average and reduce bias? Yes: Boos0ng 12
Boos0ng Ideas Main idea: use weak learner to create strong learner. Ensemble method: combine base classifiers returned by weak learner. Finding simple rela0vely accurate base classifiers olen not hard. But, how should base classifiers be combined? 13
Example: How May I Help You? Goal: automa0cally categorize type of call requested by phone customer (Collect, CallingCard, PersonToPerson, etc.) - yes I d like to place a collect call long distance please (Collect) - operator I need to make a call but I need to bill it to my office (ThirdNumber) - yes I d like to place a call on my master card please (CallingCard) - I just called a number in sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off of my bill (BillingCredit) Observa;on: - - easy to find rules of thumb that are olen correct e.g.: IF card occurs in uderance THEN predict CallingCard hard to find single highly accurate predic0on rule [Gorin et al.] 14
Boos0ng: Intui0on Instead of learning a single (weak) classifier, learn many weak classifiers that are good at different parts of the input space Output class: (Weighted) vote of each classifier - - - Classifiers that are most sure will vote with more convic0on Classifiers will be most sure about a par0cular part of the space On average, do beder than single classifier But how do you??? - - force classifiers to learn about different parts of the input space? weigh the votes of different classifiers? 15
Boos0ng [Schapire, 1989] Idea: given a weak learner, run it mul0ple 0mes on (reweighted) training data, then let the learned classifiers vote On each itera0on t: - weight each training example by how incorrectly it was classified - Learn a hypothesis h t - A strength for this hypothesis a t Final classifier: - A linear combina0on of the votes of the different classifiers weighted by their strength Prac;cally useful Theore;cally interes;ng 16
Boos0ng: Intui0on Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m=1,...,m Pick a weak classifier h m Adjust weights: misclassified examples get heavier α m set according to weighted error of h m 17
Boos0ng: Intui0on Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m=1,...,m Pick a weak classifier h m Adjust weights: misclassified examples get heavier α m set according to weighted error of h m 18
Boos0ng: Intui0on Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m=1,...,m Pick a weak classifier h m Adjust weights: misclassified examples get heavier α m set according to weighted error of h m 19
Boos0ng: Intui0on Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m=1,...,m Pick a weak classifier h m Adjust weights: misclassified examples get heavier α m set according to weighted error of h m 20
Boos0ng: Intui0on Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m=1,...,m Pick a weak classifier h m Adjust weights: misclassified examples get heavier α m set according to weighted error of h m 21
Boos0ng: Intui0on Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m=1,...,m Pick a weak classifier h m Adjust weights: misclassified examples get heavier α m set according to weighted error of h m 22
Boos0ng: Intui0on Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m=1,...,m Pick a weak classifier h m Adjust weights: misclassified examples get heavier α m set according to weighted error of h m 23
First Boos0ng Algorithms [Schapire 89]: - first provable boos0ng algorithm [Freund 90]: - op0mal algorithm that boosts by majority [Drucker, Schapire & Simard 92]: - first experiments using boos0ng - limited by prac0cal drawbacks [Freund & Schapire 95]: - introduced AdaBoost algorithm - strong prac0cal advantages over previous boos0ng algorithms 24
The AdaBoost Algorithm 25
Toy Example Minimize the error weak hypotheses = ver0cal or horizontal half- planes 26
Round 1 h 1 D 2 ε 1 =0.30 α 1 =0.42 27
Round 2 h 2 D 3 ε 2 =0.21 α 2 =0.65 28
Round 3 h 3 ε 3 =0.14 α 3 =0.92 29
Final Hypothesis H = sign 0.42 + 0.65 + 0.92 final = 30
Voted combina0on of classifiers The general problem here is to try to combine many simple weak classifiers into a single strong classifier We consider voted combina0ons of simple binary ±1 component classifiers where the (non- nega0ve) votes α i can be used to emphasize component classifiers that are more reliable than others 31
Components: Decision stumps Consider the following simple family of component classifiers genera0ng ±1 labels: where These are called decision stumps. Each decision stump pays aden0on to only a single component of the input vector 32
Voted combina0ons (cont d.) We need to define a loss func0on for the combina0on so we can determine which new component h(x; θ) to add and how many votes it should receive While there are many op0ons for the loss func0on we consider here only a simple exponen0al loss 33
Modularity, errors, and loss Consider adding the m th component: 34
Modularity, errors, and loss Consider adding the m th component: 35
Modularity, errors, and loss Consider adding the m th component: So at the m th itera0on the new component (and the votes) should op0mize a weighted loss (weighted towards mistakes). 36
Empirical exponen0al loss (cont d.) To increase modularity we d like to further decouple the op0miza0on of h(x; θ m ) from the associated votes α m To this end we select h(x; θ m ) that op0mizes the rate at which the loss would decrease as a func0on of α m 37
Empirical exponen0al loss (cont d.) We find that minimizes We can also normalize the weights: so that 38
Empirical exponen0al loss (cont d.) We find that minimizes where is subsequently chosen to minimize 39
The AdaBoost Algorithm 40
The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} 41
The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m 42
The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =1 43
The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =1 Set t = 12 log(1 t t ) 44
The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =1 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor 45
The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =1 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 40 step 46
The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =2 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 40 step 47
The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =3 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 40 step 48
The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =4 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 40 step 49
The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =5 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 40 step 50
The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =6 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 40 step 51
The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =7 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 40 step 52
The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t = 40 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 40 step 53
Reweigh0ng 54
Reweigh0ng 55
Reweigh0ng 56
Boos0ng results Digit recogni0on 20 15 test error training error error 10 5 0 10 100 1000 # rounds Boos0ng olen (but not always) - - Robust to overfivng Test set error decreases even aler training error is zero [Schapire, 1989] 57
Applica0on: Detec0ng Faces Problem: find faces in photograph or movie Weak classifiers: detect light/dark rectangle in image Many clever tricks to make extremely fast and accurate [Viola & Jones] 58
Boos0ng vs. Logis0c Regression Logis+c$regression:$ Minimize$log$loss$ Minimize+log+loss+ Boos+ng:$ Minimize+exp+loss+ Minimize$exp$loss$ Define$$ Define++ $ $ $ where$x re+x +predefined+ j $predefined$ where$h e+ t (x)$defined$ +defined+dyna features$(linear$classifier)$ Jointly$op+mize$over$all$ weights$w 0,w 1,w 2, $ $ Define++ Define$ $ dynamically$to$fit$data$$ (not$a$$linear$classifier)$ Weights$α t $learned$per$ itera+on$t$incrementally$ 59
Boos0ng vs. Bagging Bagging: Resample data points Weight of each classifier is the same Only variance reduc0on Boos0ng: Reweights data points (modifies their distribu0on) Weight is dependent on classifier s accuracy Both bias and variance reduced learning rule becomes more complex with itera0ons 60