BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

Size: px

Start display at page:

Download "BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University"

Rose Lucas
5 years ago
Views:

1 BBM406 - Introduc0on to ML Spring 2014 Ensemble Methods Aykut Erdem Dept. of Computer Engineering HaceDepe University

2 2 Slides adopted from David Sontag, Mehryar Mohri, Ziv- Bar Joseph, Arvind Rao, Greg Shakhnarovich, Rob Schapire, Tommi Jaakkola, Jan Sochman and Jiri Matters Today Ensemble Methods - - Bagging Boos0ng

3 Bias/Variance Tradeoff Bias/Variance&Tradeoff& Hastie, Tibshirani, Friedman Elements of Statistical Learning

4 Bias/Variance Tradeoff Graphical illustration of bias and variance. roe.com/docs/biasvariance.html 4

5 Figh0ng the bias- variance tradeoff Simple (a.k.a. weak) learners are good - - e.g., naïve Bayes, logis0c regression, decision stumps (or shallow decision trees) Low variance, don t usually overfit Simple (a.k.a. weak) learners are bad High bias, can t solve hard learning problems 5

6 Reduce Variance Without Increasing Bias Averaging reduces variance: (when prediction are independent) Average models to reduce model variance One problem: - - Only one training set Where do mul0ple models come from? 6

7 Bagging (Bootstrap Aggrega0ng) Leo Breiman (1994) Take repeated bootstrap samples from training set D. Bootstrap sampling: Given set D containing N training examples, create D by drawing N examples at random with replacement from D. Bagging: - Create k bootstrap samples D 1... D k. - Train dis0nct classifier on each D i. - Classify new instance by majority vote / average. 7

8 Bagging Best case: In prac0ce: - models are correlated, so reduc0on is smaller than 1/N - variance of models trained on fewer training cases usually somewhat larger 8

9 Bagging Example 9

10 CART* decision boundary * A decision tree learning algorithm; very similar to ID3 10

11 100 bagged trees Shades of blue/red indicate strength of vote for par0cular classifica0on 11

12 Reduce Bias 2 and Decrease Variance? Bagging reduces variance by averaging Bagging has lidle effect on bias Can we average and reduce bias? Yes: Boos0ng 12

13 Boos0ng Ideas Main idea: use weak learner to create strong learner. Ensemble method: combine base classifiers returned by weak learner. Finding simple rela0vely accurate base classifiers olen not hard. But, how should base classifiers be combined? 13

14 Example: How May I Help You? Goal: automa0cally categorize type of call requested by phone customer (Collect, CallingCard, PersonToPerson, etc.) - yes I d like to place a collect call long distance please (Collect) - operator I need to make a call but I need to bill it to my office (ThirdNumber) - yes I d like to place a call on my master card please (CallingCard) - I just called a number in sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off of my bill (BillingCredit) Observa;on: - - easy to find rules of thumb that are olen correct e.g.: IF card occurs in uderance THEN predict CallingCard hard to find single highly accurate predic0on rule [Gorin et al.] 14

15 Boos0ng: Intui0on Instead of learning a single (weak) classifier, learn many weak classifiers that are good at different parts of the input space Output class: (Weighted) vote of each classifier Classifiers that are most sure will vote with more convic0on Classifiers will be most sure about a par0cular part of the space On average, do beder than single classifier But how do you??? - - force classifiers to learn about different parts of the input space? weigh the votes of different classifiers? 15

16 Boos0ng [Schapire, 1989] Idea: given a weak learner, run it mul0ple 0mes on (reweighted) training data, then let the learned classifiers vote On each itera0on t: - weight each training example by how incorrectly it was classified - Learn a hypothesis h t - A strength for this hypothesis a t Final classifier: - A linear combina0on of the votes of the different classifiers weighted by their strength Prac;cally useful Theore;cally interes;ng 16

17 Boos0ng: Intui0on Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m=1,...,m Pick a weak classifier h m Adjust weights: misclassified examples get heavier α m set according to weighted error of h m 17

18 Boos0ng: Intui0on Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m=1,...,m Pick a weak classifier h m Adjust weights: misclassified examples get heavier α m set according to weighted error of h m 18

19 Boos0ng: Intui0on Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m=1,...,m Pick a weak classifier h m Adjust weights: misclassified examples get heavier α m set according to weighted error of h m 19

20 Boos0ng: Intui0on Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m=1,...,m Pick a weak classifier h m Adjust weights: misclassified examples get heavier α m set according to weighted error of h m 20

21 Boos0ng: Intui0on Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m=1,...,m Pick a weak classifier h m Adjust weights: misclassified examples get heavier α m set according to weighted error of h m 21

22 Boos0ng: Intui0on Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m=1,...,m Pick a weak classifier h m Adjust weights: misclassified examples get heavier α m set according to weighted error of h m 22

23 Boos0ng: Intui0on Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m=1,...,m Pick a weak classifier h m Adjust weights: misclassified examples get heavier α m set according to weighted error of h m 23

24 First Boos0ng Algorithms [Schapire 89]: - first provable boos0ng algorithm [Freund 90]: - op0mal algorithm that boosts by majority [Drucker, Schapire & Simard 92]: - first experiments using boos0ng - limited by prac0cal drawbacks [Freund & Schapire 95]: - introduced AdaBoost algorithm - strong prac0cal advantages over previous boos0ng algorithms 24

25 The AdaBoost Algorithm 25

26 Toy Example Minimize the error weak hypotheses = ver0cal or horizontal half- planes 26

27 Round 1 h 1 D 2 ε 1 =0.30 α 1 =

28 Round 2 h 2 D 3 ε 2 =0.21 α 2 =

29 Round 3 h 3 ε 3 =0.14 α 3 =

30 Final Hypothesis H = sign final = 30

31 Voted combina0on of classifiers The general problem here is to try to combine many simple weak classifiers into a single strong classifier We consider voted combina0ons of simple binary ±1 component classifiers where the (non- nega0ve) votes α i can be used to emphasize component classifiers that are more reliable than others 31

32 Components: Decision stumps Consider the following simple family of component classifiers genera0ng ±1 labels: where These are called decision stumps. Each decision stump pays aden0on to only a single component of the input vector 32

33 Voted combina0ons (cont d.) We need to define a loss func0on for the combina0on so we can determine which new component h(x; θ) to add and how many votes it should receive While there are many op0ons for the loss func0on we consider here only a simple exponen0al loss 33

34 Modularity, errors, and loss Consider adding the m th component: 34

35 Modularity, errors, and loss Consider adding the m th component: 35

36 Modularity, errors, and loss Consider adding the m th component: So at the m th itera0on the new component (and the votes) should op0mize a weighted loss (weighted towards mistakes). 36

37 Empirical exponen0al loss (cont d.) To increase modularity we d like to further decouple the op0miza0on of h(x; θ m ) from the associated votes α m To this end we select h(x; θ m ) that op0mizes the rate at which the loss would decrease as a func0on of α m 37

38 Empirical exponen0al loss (cont d.) We find that minimizes We can also normalize the weights: so that 38

39 Empirical exponen0al loss (cont d.) We find that minimizes where is subsequently chosen to minimize 39

40 The AdaBoost Algorithm 40

41 The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} 41

42 The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m 42

43 The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =1 43

44 The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =1 Set t = 12 log(1 t t ) 44

45 The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =1 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor 45

46 The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =1 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error step 46

47 The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =2 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error step 47

48 The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =3 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error step 48

49 The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =4 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error step 49

50 The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =5 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error step 50

51 The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =6 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error step 51

52 The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t =7 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error step 52

53 The AdaBoost Algorithm Given: (x 1,y 1 ),...,(x m,y m ); x i 2 X,y i 2 { 1, +1} Initialise weights D 1 (i) =1/m For t =1,..., T : Find ht = arg min P j = m D t (i)jy i 6= h j (x i )K h j 2H i=1 If t 1/2 then stop t = 40 Set t = Update 12 log(1 t t ) D t+1 (i) = D t(i)exp( t y i h t (x i )) Z t where Z t is normalisation factor Output the final classifier: H(x) =sign TX t h t (x) t=1 training error step 53

54 Reweigh0ng 54

55 Reweigh0ng 55

56 Reweigh0ng 56

57 Boos0ng results Digit recogni0on test error training error error # rounds Boos0ng olen (but not always) - - Robust to overfivng Test set error decreases even aler training error is zero [Schapire, 1989] 57

58 Applica0on: Detec0ng Faces Problem: find faces in photograph or movie Weak classifiers: detect light/dark rectangle in image Many clever tricks to make extremely fast and accurate [Viola & Jones] 58

59 Boos0ng vs. Logis0c Regression Logis+c$regression:$ Minimize$log$loss$ Minimize+log+loss+ Boos+ng:$ Minimize+exp+loss+ Minimize$exp$loss$ Define$$ Define++ $ $ $ where$x re+x +predefined+ j $predefined$ where$h e+ t (x)$defined$ +defined+dyna features$(linear$classifier)$ Jointly$op+mize$over$all$ weights$w 0,w 1,w 2, $ $ Define++ Define$ $ dynamically$to$fit$data$$ (not$a$$linear$classifier)$ Weights$α t $learned$per$ itera+on$t$incrementally$ 59

60 Boos0ng vs. Bagging Bagging: Resample data points Weight of each classifier is the same Only variance reduc0on Boos0ng: Reweights data points (modifies their distribu0on) Weight is dependent on classifier s accuracy Both bias and variance reduced learning rule becomes more complex with itera0ons 60

Boos$ng Can we make dumb learners smart?

Boos$ng Can we make dumb learners smart? Aarti Singh Machine Learning 10-601 Nov 29, 2011 Slides Courtesy: Carlos Guestrin, Freund & Schapire 1 Why boost weak learners? Goal: Automa'cally categorize type