BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

BBM406 - Introduc0on to ML Spring 2014 Ensemble Methods Aykut Erdem Dept. of Computer Engineering HaceDepe University

2 Slides adopted from David Sontag, Mehryar Mohri, Ziv- Bar Joseph, Arvind Rao, Greg Shakhnarovich, Rob Schapire, Tommi Jaakkola, Jan Sochman and Jiri Matters Today Ensemble Methods - - Bagging Boos0ng

Bias/Variance Tradeoff Bias/Variance&Tradeoff& Hastie, Tibshirani, Friedman Elements of Statistical Learning 2001 3

Bias/Variance Tradeoff Graphical illustration of bias and variance. http://scott.fortmann- roe.com/docs/biasvariance.html 4

Figh0ng the bias- variance tradeoff Simple (a.k.a. weak) learners are good - - e.g., naïve Bayes, logis0c regression, decision stumps (or shallow decision trees) Low variance, don t usually overfit Simple (a.k.a. weak) learners are bad High bias, can t solve hard learning problems 5

Reduce Variance Without Increasing Bias Averaging reduces variance: (when prediction are independent) Average models to reduce model variance One problem: - - Only one training set Where do mul0ple models come from? 6

Bagging (Bootstrap Aggrega0ng) Leo Breiman (1994) Take repeated bootstrap samples from training set D. Bootstrap sampling: Given set D containing N training examples, create D by drawing N examples at random with replacement from D. Bagging: - Create k bootstrap samples D 1... D k. - Train dis0nct classifier on each D i. - Classify new instance by majority vote / average. 7

Bagging Best case: In prac0ce: - models are correlated, so reduc0on is smaller than 1/N - variance of models trained on fewer training cases usually somewhat larger 8

Bagging Example 9

CART* decision boundary * A decision tree learning algorithm; very similar to ID3 10

100 bagged trees Shades of blue/red indicate strength of vote for par0cular classifica0on 11

Reduce Bias 2 and Decrease Variance? Bagging reduces variance by averaging Bagging has lidle effect on bias Can we average and reduce bias? Yes: Boos0ng 12

Boos0ng Ideas Main idea: use weak learner to create strong learner. Ensemble method: combine base classifiers returned by weak learner. Finding simple rela0vely accurate base classifiers olen not hard. But, how should base classifiers be combined? 13

Example: How May I Help You? Goal: automa0cally categorize type of call requested by phone customer (Collect, CallingCard, PersonToPerson, etc.) - yes I d like to place a collect call long distance please (Collect) - operator I need to make a call but I need to bill it to my office (ThirdNumber) - yes I d like to place a call on my master card please (CallingCard) - I just called a number in sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off of my bill (BillingCredit) Observa;on: - - easy to find rules of thumb that are olen correct e.g.: IF card occurs in uderance THEN predict CallingCard hard to find single highly accurate predic0on rule [Gorin et al.] 14

Boos0ng: Intui0on Instead of learning a single (weak) classifier, learn many weak classifiers that are good at different parts of the input space Output class: (Weighted) vote of each classifier - - - Classifiers that are most sure will vote with more convic0on Classifiers will be most sure about a par0cular part of the space On average, do beder than single classifier But how do you??? - - force classifiers to learn about different parts of the input space? weigh the votes of different classifiers? 15

Boos0ng [Schapire, 1989] Idea: given a weak learner, run it mul0ple 0mes on (reweighted) training data, then let the learned classifiers vote On each itera0on t: - weight each training example by how incorrectly it was classified - Learn a hypothesis h t - A strength for this hypothesis a t Final classifier: - A linear combina0on of the votes of the different classifiers weighted by their strength Prac;cally useful Theore;cally interes;ng 16

Boos0ng: Intui0on Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m=1,...,m Pick a weak classifier h m Adjust weights: misclassified examples get heavier α m set according to weighted error of h m 17

First Boos0ng Algorithms [Schapire 89]: - first provable boos0ng algorithm [Freund 90]: - op0mal algorithm that boosts by majority [Drucker, Schapire & Simard 92]: - first experiments using boos0ng - limited by prac0cal drawbacks [Freund & Schapire 95]: - introduced AdaBoost algorithm - strong prac0cal advantages over previous boos0ng algorithms 24

The AdaBoost Algorithm 25

Toy Example Minimize the error weak hypotheses = ver0cal or horizontal half- planes 26

Round 1 h 1 D 2 ε 1 =0.30 α 1 =0.42 27

Round 2 h 2 D 3 ε 2 =0.21 α 2 =0.65 28

Round 3 h 3 ε 3 =0.14 α 3 =0.92 29

Final Hypothesis H = sign 0.42 + 0.65 + 0.92 final = 30

Voted combina0on of classifiers The general problem here is to try to combine many simple weak classifiers into a single strong classifier We consider voted combina0ons of simple binary ±1 component classifiers where the (non- nega0ve) votes α i can be used to emphasize component classifiers that are more reliable than others 31

Components: Decision stumps Consider the following simple family of component classifiers genera0ng ±1 labels: where These are called decision stumps. Each decision stump pays aden0on to only a single component of the input vector 32

Voted combina0ons (cont d.) We need to define a loss func0on for the combina0on so we can determine which new component h(x; θ) to add and how many votes it should receive While there are many op0ons for the loss func0on we consider here only a simple exponen0al loss 33

Modularity, errors, and loss Consider adding the m th component: 34

Modularity, errors, and loss Consider adding the m th component: 35

Modularity, errors, and loss Consider adding the m th component: So at the m th itera0on the new component (and the votes) should op0mize a weighted loss (weighted towards mistakes). 36

Empirical exponen0al loss (cont d.) To increase modularity we d like to further decouple the op0miza0on of h(x; θ m ) from the associated votes α m To this end we select h(x; θ m ) that op0mizes the rate at which the loss would decrease as a func0on of α m 37

Empirical exponen0al loss (cont d.) We find that minimizes We can also normalize the weights: so that 38

Empirical exponen0al loss (cont d.) We find that minimizes where is subsequently chosen to minimize 39

The AdaBoost Algorithm 40

The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} 41

The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m 42

The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =1 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor 45

The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =2 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 40 step 47

The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =3 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 40 step 48

The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =4 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 40 step 49

The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =5 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 40 step 50

The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =6 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 40 step 51

The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =7 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 40 step 52

The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t = 40 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 35 40 step 53

Reweigh0ng 54

Reweigh0ng 55

Reweigh0ng 56

Boos0ng results Digit recogni0on 20 15 test error training error error 10 5 0 10 100 1000 # rounds Boos0ng olen (but not always) - - Robust to overfivng Test set error decreases even aler training error is zero [Schapire, 1989] 57

Applica0on: Detec0ng Faces Problem: find faces in photograph or movie Weak classifiers: detect light/dark rectangle in image Many clever tricks to make extremely fast and accurate [Viola & Jones] 58

Boos0ng vs. Logis0c Regression Logis+c$regression:$ Minimize$log$loss$ Minimize+log+loss+ Boos+ng:$ Minimize+exp+loss+ Minimize$exp$loss$ Define$$ Define++ $ $ $ where$x re+x +predefined+ j $predefined$ where$h e+ t (x)$defined$ +defined+dyna features$(linear$classifier)$ Jointly$op+mize$over$all$ weights$w 0,w 1,w 2, $ $ Define++ Define$ $ dynamically$to$fit$data$$ (not$a$$linear$classifier)$ Weights$α t $learned$per$ itera+on$t$incrementally$ 59

Boos0ng vs. Bagging Bagging: Resample data points Weight of each classifier is the same Only variance reduc0on Boos0ng: Reweights data points (modifies their distribu0on) Weight is dependent on classifier s accuracy Both bias and variance reduced learning rule becomes more complex with itera0ons 60