Voting (Ensemble Methods)

Size: px

Start display at page:

Download "Voting (Ensemble Methods)"

Damian Potter
5 years ago
Views:

1 1

2 2

3 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers that are most sure will vote with more conviction Classifiers will be most sure about a particular part of the space On average, do better than single classifier! But how??? force classifiers to learn about different parts of the input space? different subsets of the data? weigh the votes of different classifiers?

4 BAGGing = Bootstrap AGGregation for i = 1, 2,, K: (Breiman, 1996) T i ß randomly select M training instances with replacement h i ß learn(t i ) [Decision Tree, Naive Bayes, ] Now combine the h i together with uniform voting (w i =1/K for all i)

5 5

6 6 decision tree learning algorithm; very similar to version in earlier slides

7 shades of blue/red indicate strength of vote for particular classification

9 Fighting the bias-variance tradeoff Simple (a.k.a. weak) learners are good e.g., naïve Bayes, logistic regression, decision stumps (or shallow decision trees) Low variance, don t usually overfit Simple (a.k.a. weak) learners are bad High bias, can t solve hard learning problems Can we make weak learners always good??? No!!! But often yes

10 Boosting [Schapire, 1989] Idea: given a weak learner, run it multiple times on (reweighted) training data, then let learned classifiers vote On each iteration t: weight each training example by how incorrectly it was classified Learn a hypothesis h t A strength for this hypothesis a t Final classifier: h(x) = sign Practically useful Theoretically interesting i ih i (x)

11 time = 0 blue/red = class size of dot = weight weak learner = Decision stub: horizontal or vertica 11

12 time = 1 this hypothesis has 15% error and so does this ensemble, since the ensemble contains just this one hypothesis 12

13 13 time = 2

14 14 time = 3

15 15 time = 13

16 16 time = 100

17 time = 300 overfitting 17

18 Learning from weighted data Consider a weighted dataset D(i) weight of i th training example (x i,y i ) Interpretations: i th training example counts as if it occurred D(i) times If I were to resample data, I would get more samples of heavier data points Now, always do weighted calculations: e.g., MLE for Naïve Bayes, redefine Count(Y=y) to be weighted count: Count(Y = y) = n j=1 D(j) (Y j = y) setting D(j)=1 (or any constant value!), for all j, will recreates unweighted case

.m: D t+1 (i) / D t (i)exp( t y i h t (x i )) with normalization constant: Output final classifier: mx D t (i)exp( t y i h t (x i )) H(x) = sign! TX t h t (x) How?

19 Given: (x 1,y 1 ),...,(x m,y m )wherex i 2 R n,y i 2 { 1, +1} Initialize: D 1 (i) =1/m, for i =1,...,m For t=1 T: Train base classifier h t (x) using D t Choose α t Update, for..m: D t+1 (i) / D t (i)exp( t y i h t (x i )) with normalization constant: Output final classifier: mx D t (i)exp( t y i h t (x i )) H(x) = sign! TX t h t (x) How? Many possibilities. Will see one shortly! Why? Reweight the data: examples i that are misclassified will have higher weights! y i h t (x i ) > 0 à h i correct y i h t (x i ) < 0 à h i wrong h i correct, α t > 0 à D t+1 (i) < D t (i) h i wrong, α t > 0 à D t+1 (i) > D t (i) Final Result: linear sum of base or weak classifier outputs.

20 mx Given: (x 1,y 1 ),...,(x m,y m )wherex i 2 R n,y i t = D t 2 (i) { (h 1, t +1} (x i ) 6= y i ) Initialize: D 1 (i) =1/m, for i =1,...,m For t=1 T: Train base classifier h t (x) using D t Choose α t Update, for..m: D t+1 (i) / D t (i)exp( t y i h t (x i )) ε t : error of h t, weighted by D t 0 ε t 1 α t : No errors: ε t =0 à α t = All errors: ε t =1 à α t = Random: ε t =0.5 à α t =0 α t ε t

21 What a t to choose for hypothesis h t? Idea: choose a t to minimize a bound on training error! Where mx (H(x i ) 6= y i ) apple mx D t (i)exp( y i f(x i )) [Schapire, 1989] exp( y i f(x i )) (H(x i ) 6= y i ) y i f(x i )

22 What a t to choose for hypothesis h t? Idea: choose a t to minimize a bound on training error! 1 m Where mx (H(x i ) 6= y i ) apple 1 m mx D t (i)exp( y i f(x i )) [Schapire, 1989] = Y t Z t And Z t = mx D t (i)exp( t y i h t (x i )) This equality isn t obvious! Can be shown with algebra (telescoping sums)! If we minimize Õ t Z t, we minimize our training error!!! We can tighten this bound greedily, by choosing a t and h t on each iteration to minimize Z t. h t is estimated as a black box, but can we solve for a t?

23 Summary: choose a t to minimize error bound We can squeeze this bound by choosing a t on each iteration to minimize Z t. Z t = mx D t (i)exp( t y i h t (x i )) mx t = D t (i) (h t (x i ) 6= y i ) For boolean Y: differentiate, set equal to 0, there is a closed form solution! [Freund & Schapire 97]: [Schapire, 1989]

24 Given: (x 1,y 1 ),...,(x m,y m )wherex i 2 R mx n,y i 2 { 1, +1} Initialize: D 1 (i) =1/m, for i =1,...,m t = D t (i) (h t (x i ) 6= y i ) For t=1 T: Train base classifier h t (x) using D t Choose α t Update, for..m: with normalization constant: Output final classifier: D t+1 (i) / D t (i)exp( t y i h t (x i )) mx D t (i)exp( t y i h t (x i )) H(x) = sign! TX t h t (x)

25 Initialize: D 1 (i) =1/m, for i =1,...,m For t=1 T: Train base classifier h t (x) using D t Choose α t mx t = D t (i) (h t (x i ) 6= y i ) x 1 y Update, for..m: D t+1 (i) / D t (i)exp( t y i h t (x i )) Output final classifier: H(x) = sign! TX t h t (x) x 1 H(x) = sign(0.35 h 1 (x)) h 1 (x)=+1 if x 1 >0.5, -1 otherwise Use decision stubs as base classifier Initial: D 1 = [D 1 (1), D 1 (2), D 1 (3)] = [.33,.33,.33] t=1: Train stub [work omitted, breaking ties randomly] h 1 (x)=+1 if x 1 >0.5, -1 otherwise ε 1 =Σ i D 1 (i) δ(h 1 (x i ) y i ) = =0.33 α 1 =(1/2) ln((1-ε 1 )/ε 1 )=0.5 ln(2)= 0.35 D 2 (1) α D 1 (1) exp(-α 1 y 1 h 1 (x 1 )) = 0.33 exp( ) = 0.33 exp(0.35) = 0.46 D 2 (2) α D 1 (2) exp(-α 1 y 2 h 1 (x 2 )) = 0.33 exp( ) = 0.33 exp(-0.35) = 0.23 D 2 (3) α D 1 (3) exp(-α 1 y 3 h 1 (x 3 )) = 0.33 exp( ) = 0.33 exp(-0.35) =0.23 D 2 = [D 1 (1), D 1 (2), D 1 (3)] = [0.5,0.25,0.25] t=2 Continues on next slide!

26 Initialize: D 1 (i) =1/m, for i =1,...,m For t=1 T: Train base classifier h t (x) using D t Choose α t mx t = D t (i) (h t (x i ) 6= y i ) x 1 y Update, for..m: D t+1 (i) / D t (i)exp( t y i h t (x i )) Output final classifier: H(x) = sign! TX t h t (x) x 1 H(x) = sign(0.35 h 1 (x)+0.55 h 2 (x)) h 1 (x)=+1 if x 1 >0.5, -1 otherwise h 2 (x)=+1 if x 1 <1.5, -1 otherwise D 2 = [D 1 (1), D 1 (2), D 1 (3)] = [0.5,0.25,0.25] t=2: Train stub [work omitted; different stub because of new data weights D; breaking ties opportunistically (will discuss at end)] h 2 (x)=+1 if x 1 <1.5, -1 otherwise ε 2 =Σ i D 2 (i) δ(h 2 (x i ) y i ) = =0.25 α 2 =(1/2) ln((1-ε 2 )/ε 2 )=0.5 ln(3)= 0.55 D 2 (1) α D 1 (1) exp(-α 2 y 1 h 2 (x 1 )) = 0.5 exp( ) = 0.5 exp(-0.55) = 0.29 D 2 (2) α D 1 (2) exp(-α 2 y 2 h 2 (x 2 )) = 0.25 exp( ) = 0.25 exp(0.55) = 0.43 D 2 (3) α D 1 (3) exp(-α 2 y 3 h 2 (x 3 )) = 0.25 exp( ) = 0.25 exp(-0.55) = 0.14 D 3 = [D 3 (1), D 3 (2), D 3 (3)] = [0.33,0.5,0.17] t=3 Continues on next slide!

27 Initialize: D 1 (i) =1/m, for i =1,...,m For t=1 T: Train base classifier h t (x) using D t Choose α t mx t = D t (i) (h t (x i ) 6= y i ) x 1 y Update, for..m: D t+1 (i) / D t (i)exp( t y i h t (x i )) Output final classifier: H(x) = sign! TX t h t (x) x 1 D 3 = [D 3 (1), D 3 (2), D 3 (3)] = [0.33,0.5,0.17] t=3: Train stub [work omitted; different stub because of new data weights D; breaking ties opportunistically (will discuss at end)] h 3 (x)=+1 if x 1 <-0.5, -1 otherwise ε 3 =Σ i D 3 (i) δ(h 3 (x i ) y i ) = =0.17 α 3 =(1/2) ln((1-ε 3 )/ε 3 )=0.5 ln(4.88)= 0.79 Stop!!! How did we know to stop? H(x) = sign(0.35 h 1 (x)+0.55 h 2 (x)+0.79 h 3 (x)) h 1 (x)=+1 if x 1 >0.5, -1 otherwise h 2 (x)=+1 if x 1 <1.5, -1 otherwise h 3 (x)=+1 if x 1 <-0.5, -1 otherwise

28 Strong, weak classifiers If each classifier is (at least slightly) better than random: e t < 0.5 Another bound on error: 1 mx (H(x i ) 6= y i ) apple Y m t What does this imply about the training error? Will reach zero! Will get there exponentially fast! Z t apple exp 2! TX (1/2 t ) 2 t=1 Is it hard to achieve better than random training error?

29 Boosting results Digit recognition [Schapire, 1989] Test error Training error Boosting: Seems to be robust to overfitting Test error can decrease even after training error is zero!!!

30 Boosting generalization error bound [Freund & Schapire, 1996] Constants: T: number of boosting rounds Higher T à Looser bound d: measures complexity of classifiers Higher d à bigger hypothesis space à looser bound m: number of training examples more data à tighter bound

31 Boosting generalization error bound [Freund & Schapire, 1996] Constants: Theory does not match practice: T: number of boosting rounds: Robust Higher T to àoverfitting Looser bound, what does this imply? d: Test VC dimension set error decreases of weak learner, even after measures training error is complexity zero of classifier Higher d à bigger hypothesis space à looser bound m: number of training examples more data à tighter bound

32 Boosting: Experimental Results [Freund & Schapire, 1996] Comparison of C4.5, Boosting C4.5, Boosting decision stumps (depth 1 trees), 27 benchmark datasets error error error

33 Boosting and Logistic Regression Logistic regression equivalent to minimizing log loss: ln(1 + exp( y i f(x i ))) Boosting minimizes similar loss function: exp( y i f(x i )) (H(x i ) 6= y i ) y i f(x i ) Both smooth approximations of 0/1 loss!

34 Logistic regression and Boosting Logistic regression: Minimize loss fn Define mx ln(1 + exp( y i f(x i ))) Boosting: Minimize loss fn mx exp( y i f(x i )) Define where each feature x j is predefined Jointly optimize parameters w 0, w 1, w n via gradient ascent. where h t (x) learned to fit data Weights a j learned incrementally (new one for each training pass)

35 What you need to know about Boosting Combine weak classifiers to get very strong classifier Weak classifier slightly better than random on training data Resulting very strong classifier can get zero training error AdaBoost algorithm Boosting v. Logistic Regression Both linear model, boosting learns features Similar loss functions Single optimization (LR) v. Incrementally improving classification (B) Most popular application of Boosting: Boosted decision stumps! Very simple to implement, very effective classifier

Stochastic Gradient Descent

Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular