COMS 4771 Lecture 12 1. Boosting 1 / 16
Boosting
What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16
What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. Motivation: Easy to construct classification rules that are correct more-often-than-not (e.g., If 5% of the e-mail characters are dollar signs, then it s spam. ), but hard to find a single rule that is almost always correct. 3 / 16
What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. Motivation: Easy to construct classification rules that are correct more-often-than-not (e.g., If 5% of the e-mail characters are dollar signs, then it s spam. ), but hard to find a single rule that is almost always correct. Basic idea: Input: training data S, weak learning algorithm A For t = 1, 2,..., T : 1. Choose subset of examples S t S (or a distribution over S). 2. Call weak learning algorithm to get classifier: f t := A(S t). Return a weighted majority vote over f 1, f 2,..., f T. 3 / 16
Boosting: history 1984 Valiant and Kearns ask whether boosting is theoretically possible (formalized in the PAC learning model). 4 / 16
Boosting: history 1984 Valiant and Kearns ask whether boosting is theoretically possible (formalized in the PAC learning model). 1989 Schapire creates first boosting algorithm, solving the open problem of Valiant and Kearns. 4 / 16
Boosting: history 1984 Valiant and Kearns ask whether boosting is theoretically possible (formalized in the PAC learning model). 1989 Schapire creates first boosting algorithm, solving the open problem of Valiant and Kearns. 1990 Freund creates an optimal boosting algorithm (Boost-by-majority). 4 / 16
Boosting: history 1984 Valiant and Kearns ask whether boosting is theoretically possible (formalized in the PAC learning model). 1989 Schapire creates first boosting algorithm, solving the open problem of Valiant and Kearns. 1990 Freund creates an optimal boosting algorithm (Boost-by-majority). 1992 Drucker, Schapire, and Simard empirically observe practical limitations of early boosting algorithms. 4 / 16
Boosting: history 1984 Valiant and Kearns ask whether boosting is theoretically possible (formalized in the PAC learning model). 1989 Schapire creates first boosting algorithm, solving the open problem of Valiant and Kearns. 1990 Freund creates an optimal boosting algorithm (Boost-by-majority). 1992 Drucker, Schapire, and Simard empirically observe practical limitations of early boosting algorithms. 1995 Freund and Schapire create AdaBoost a boosting algorithm with practical advantages over early boosting algorithms. 4 / 16
Boosting: history 1984 Valiant and Kearns ask whether boosting is theoretically possible (formalized in the PAC learning model). 1989 Schapire creates first boosting algorithm, solving the open problem of Valiant and Kearns. 1990 Freund creates an optimal boosting algorithm (Boost-by-majority). 1992 Drucker, Schapire, and Simard empirically observe practical limitations of early boosting algorithms. 1995 Freund and Schapire create AdaBoost a boosting algorithm with practical advantages over early boosting algorithms. Winner of 2004 ACM Paris Kanellakis Award: For their seminal work and distinguished contributions [...] to the development of the theory and practice of boosting, a general and provably effective method of producing arbitrarily accurate prediction rules by combining weak learning rules ; specifically, for AdaBoost, which can be used to significantly reduce the error of algorithms used in statistical analysis, spam filtering, fraud detection, optical character recognition, and market segmentation, among other applications. 4 / 16
AdaBoost Input Training data S from X {±1}. Weak learning algorithm A (for importance-weighted classification). 1: initialize D 1(x, y) := 1/ S for each (x, y) S (a probability distribution). 2: for t = 1, 2,..., T do 3: Give D t-weighted examples to A; get back f t : X {±1}. 4: Update weights: z t := D t(x, y) yf t(x) [ 1, +1] (x,y) S α t := 1 1 + zt ln R (weight of f t) 2 1 z t D t+1(x, y) D t(x, y) exp( α t yf t(x)) for each (x, y) S. 5: end for ( T ) 6: return Final classifier f final (x) := sign α t f t(x). t=1 5 / 16
Interpretation Interpreting z t If Pr [f t(x) = Y ] = 1 + γt for some γt [ 1/2, +1/2], (X,Y ) D t 2 then z t = D t(x, y) yf t(x) = 2γ t [ 1, +1]. (x,y) S z t = 0 random guessing w.r.t. D t. z t > 0 better than random guessing w.r.t. D t. z t < 0 better off using the opposite of f t s predictions. 6 / 16
Interpretation Classifier weights α t = 1 2 ln 1+zt 1 z t 3 α t 2 1 0.5 0.5 1 z t 1 2 Example weights D t+1 (x, y) D t+1(x, y) D t(x, y) exp( α t yf t(x)) 7 / 16
Example: AdaBoost with decision stumps Weak learning algorithm A: ERM with F = decision stumps on R 2 (i.e., axis-aligned threshold functions x sign(vx i t)). Straightforward to handle importance weights in ERM. (Example from Figures 1.1 and 1.2 of Schapire & Freund text.) 8 / 16
Example: execution of AdaBoost D 1 9 / 16
Example: execution of AdaBoost D 1 f 1 z 1 = 0.40, α 1 = 0.42 9 / 16
Example: execution of AdaBoost D 1 D 2 f 1 z 1 = 0.40, α 1 = 0.42 9 / 16
Example: execution of AdaBoost D 1 D 2 f 1 f 2 z 1 = 0.40, α 1 = 0.42 z 2 = 0.58, α 2 = 0.65 9 / 16
Example: execution of AdaBoost D 1 D 2 D 3 + + f 1 f 2 z 1 = 0.40, α 1 = 0.42 z 2 = 0.58, α 2 = 0.65 9 / 16
Example: execution of AdaBoost D 1 D 2 D 3 + + + + f 1 f 2 f 3 z 1 = 0.40, α 1 = 0.42 z 2 = 0.58, α 2 = 0.65 z 3 = 0.72, α 3 = 0.92 9 / 16
Example: final classifier from AdaBoost + + f 1 f 2 f 3 z 1 = 0.40, α 1 = 0.42 z 2 = 0.58, α 2 = 0.65 z 3 = 0.72, α 3 = 0.92 10 / 16
Example: final classifier from AdaBoost + + f 1 f 2 f 3 z 1 = 0.40, α 1 = 0.42 z 2 = 0.58, α 2 = 0.65 z 3 = 0.72, α 3 = 0.92 Final classifier f final (x) = sign(0.42f 1(x) + 0.65f 2(x) + 0.92f 3(x)) (Zero training error!) 10 / 16
Empirical results UCI UCI Results Test error rates of C4.5 and AdaBoost on several classification problems. Each point represents a single classification problem/dataset from UCI repository. 30 30 25 25 30 30 25 25 C4.5 C4.5 20 20 15 15 C4.5 C4.5 20 20 15 15 10 10 10 10 5 5 5 5 0 0 0 0 5 510 1015 1520 2025 2530 30 AdaBoost+stumps 0 0 0 0 5 510 1015 1520 2025 2530 30 AdaBoost+C4.5 boosting Stumps boosting C4.5 C4.5 C4.5 = popular algorithm for learning decision trees. (Figure 1.3 from Schapire & Freund text.) 11 / 16
Training error of final classifier Recall γ t := Pr (X,Y ) Dt [f t(x) = Y ] 1/2 = z t/2. Training error of final classifier from AdaBoost: ( T err(f final, S) exp 2 t=1 γ 2 t ). 12 / 16
Training error of final classifier Recall γ t := Pr (X,Y ) Dt [f t(x) = Y ] 1/2 = z t/2. Training error of final classifier from AdaBoost: ( T err(f final, S) exp 2 If average γ 2 := 1 T T t=1 γ2 t > 0, then training error is exp ( 2 γ 2 T ). t=1 γ 2 t ). 12 / 16
Training error of final classifier Recall γ t := Pr (X,Y ) Dt [f t(x) = Y ] 1/2 = z t/2. Training error of final classifier from AdaBoost: ( T err(f final, S) exp 2 If average γ 2 := 1 T T t=1 γ2 t > 0, then training error is exp ( 2 γ 2 T ). AdaBoost = Adaptive Boosting Some γ t could be small (or even negative!) only care about overall average γ 2. t=1 γ 2 t ). 12 / 16
Training error of final classifier Recall γ t := Pr (X,Y ) Dt [f t(x) = Y ] 1/2 = z t/2. Training error of final classifier from AdaBoost: ( T err(f final, S) exp 2 If average γ 2 := 1 T T t=1 γ2 t > 0, then training error is exp ( 2 γ 2 T ). AdaBoost = Adaptive Boosting Some γ t could be small (or even negative!) only care about overall average γ 2. t=1 What about true error? γ 2 t ). 12 / 16
Combining classifiers Let F be the function class used by the weak learning algorithm A. 13 / 16
Combining classifiers Let F be the function class used by the weak learning algorithm A. The function class used by AdaBoost is { ( T ) } F T := x sign α tf t(x) : f 1, f 2,..., f T F, α 1, α 2,..., α T R t=1 i.e., linear combinations of T functions from F. Complexity of F T grows linearly with T. 13 / 16
Combining classifiers Let F be the function class used by the weak learning algorithm A. The function class used by AdaBoost is { ( T ) } F T := x sign α tf t(x) : f 1, f 2,..., f T F, α 1, α 2,..., α T R t=1 i.e., linear combinations of T functions from F. Complexity of F T grows linearly with T. Theoretical guarantee: with high probability over choice of i.i.d. sample S, ( ) T log F S err(f) err(f, S) + O f F T. S 13 / 16
Combining classifiers Let F be the function class used by the weak learning algorithm A. The function class used by AdaBoost is { ( T ) } F T := x sign α tf t(x) : f 1, f 2,..., f T F, α 1, α 2,..., α T R t=1 i.e., linear combinations of T functions from F. Complexity of F T grows linearly with T. Theoretical guarantee: with high probability over choice of i.i.d. sample S, ( ) T log F S err(f) err(f, S) + O f F T. S Theory suggests danger of over-fitting when T is very large. 13 / 16
Combining classifiers Let F be the function class used by the weak learning algorithm A. The function class used by AdaBoost is { ( T ) } F T := x sign α tf t(x) : f 1, f 2,..., f T F, α 1, α 2,..., α T R t=1 i.e., linear combinations of T functions from F. Complexity of F T grows linearly with T. Theoretical guarantee: with high probability over choice of i.i.d. sample S, ( ) T log F S err(f) err(f, S) + O f F T. S Theory suggests danger of over-fitting when T is very large. Indeed, this does happen sometimes... but often not! 13 / 16
A typical run of boosting AdaBoost+C4.5 on letters dataset. 20 15 C4.5 test error Error 10 AdaBoost test error 5 0 AdaBoost training error 10 100 1000 # of rounds T (# nodes across all decision trees in f final is >2 10 6 ) Training error is zero after just five rounds, but test error continues to decrease, even up to 1000 rounds! (Figure 1.7 from Schapire & Freund text) 14 / 16
Boosting the margin Final classifier from AdaBoost: ( T ) t=1 f final (x) = sign αtft(x) T t=1 αt. } {{ } g(x) [ 1, +1] Call y g(x) [ 1, +1] the margin achieved on example (x, y). 15 / 16
Boosting the margin Final classifier from AdaBoost: ( T ) t=1 f final (x) = sign αtft(x) T t=1 αt. } {{ } g(x) [ 1, +1] Call y g(x) [ 1, +1] the margin achieved on example (x, y). New theory [Schapire, Freund, Bartlett, and Lee, 1998]: Larger margins better generalization error, independent of T. AdaBoost tends to increase margins on training examples. (Similar but not the same as SVM margins.) 15 / 16
Boosting the margin Final classifier from AdaBoost: ( T ) t=1 f final (x) = sign αtft(x) T t=1 αt. } {{ } g(x) [ 1, +1] Call y g(x) [ 1, +1] the margin achieved on example (x, y). New theory [Schapire, Freund, Bartlett, and Lee, 1998]: Larger margins better generalization error, independent of T. AdaBoost tends to increase margins on training examples. (Similar but not the same as SVM margins.) On letters dataset: T = 5 T = 100 T = 1000 training error 0.0% 0.0% 0.0% test error 8.4% 3.3% 3.1% % margins 0.5 7.7% 0.0% 0.0% min. margin 0.14 0.52 0.55 15 / 16
More on boosting Many variants of boosting: AdaBoost.L and LogitBoost Forward-{step,stage}wise regression Boosted decision trees = boosting + decision trees (See ESL Chapter 10.) Boosting algorithms for ranking and multi-class. Boosting algorithms that are robust to certain kinds of noise.... Many connections between boosting and other subjects: Game theory, online learning Information geometry Computational complexity... 16 / 16