Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Boosting Mehryar Mohri - Introduction to Machine Learning page 2

Boosting Ideas Main idea: use weak learner to create strong learner. Ensemble method: combine base classifiers returned by weak learner. Finding simple relatively accurate base classifiers often not hard. But, how should base classifiers be combined? Mehryar Mohri - Introduction to Machine Learning page 3

H { 1, +1} X. AdaBoost (Freund and Schapire, 1997) AdaBoost(S =((x 1,y 1 ),...,(x m,y m ))) 1 for i 1 to m do 2 D 1 (i) 1 m 3 for t 1 to T do 4 h t base classifier in H with small error t =Pr Dt [h t (x i )=y i ] 5 α t 1 2 log 1 t t 6 Z t 2[ t (1 t )] 1 2 normalization factor 7 for i 1 to m do 8 D t+1 (i) D t(i)exp( α t y i h t (x i )) Z t 9 f T t=1 α th t 10 return h =sgn(f) Mehryar Mohri - Introduction to Machine Learning page 4

Distributions D t originally uniform. Notes over training sample: at each round, the weight of a misclassified example is increased. observation: D, since t+1 (i)= e y i f t (x i ) m Q t s=1 Z s D t+1 (i) = D t(i)e α ty i h t (x i ) Z t = D t 1(i)e αt 1yiht 1(xi) e αtyiht(xi) Z t 1 Z t Weight assigned to base classifier h t : α t directy depends on the accuracy of at round t. h t = 1 m e y P t i s=1 α sh s (x i ) t s=1 Z s. Mehryar Mohri - Introduction to Machine Learning page 5

Illustration t = 1 t = 2 Mehryar Mohri - Introduction to Machine Learning page 6

t = 3...... page Mehryar Mohri - Introduction to Machine Learning 7

α 1 + α 2 + α 3 = page Mehryar Mohri - Introduction to Machine Learning 8

Bound on Empirical Error Theorem: The empirical error of the classifier output by AdaBoost verifies: R(h) exp If further for all t [1,T], γ ( 1 2 t), then γ does not need to be known in advance: adaptive boosting. 2 T t=1 R(h) exp( 2γ 2 T ). (Freund and Schapire, 1997) 1 2 t 2. Mehryar Mohri - Introduction to Machine Learning page 9

Proof: Since, as we saw, D t+1 (i)= e y i f t (x i ), m Q t s=1 Z s R(h) = 1 m Now, since Z t = 1 m = m 1 yi f(x i ) 0 1 m i=1 m i=1 Z t m T t=1 m i=1 Z t D T +1 (i) = is a normalization factor, m D t (i)e α ty i h t (x i ) i=1 i:y i h t (x i ) 0 D t (i)e α t + =(1 t )e α t + t e α t =(1 t ) t 1 t + t exp( y i f(x i )) T t=1 i:y i h t (x i )<0 Z t. D t (i)e α t 1 t t =2 t (1 t ). page Mehryar Mohri - Introduction to Machine Learning 10

Thus, T t=1 Z t = T t=1 T t=1 2 t (1 t )= exp 1 2 2 t T t=1 2 2 1 1 4 2 t T =exp 2 t=1 2 1 2 t. Notes: α t minimizer of α (1 t )e α + t e α. since (1 t )e α t = t e α t, at each round, AdaBoost assigns the same probability mass to correctly classified and misclassified instances. for base classifiers x [ 1, +1], α t can be similarly chosen to minimize. Z t Mehryar Mohri - Introduction to Machine Learning page 11

AdaBoost = Coordinate Descent Objective Function: convex and differentiable. F (α) = m e y if(x i ) = m i=1 i=1 e x e y i P T t=1 α th t (x i ). 0 1 loss Mehryar Mohri - Introduction to Machine Learning page 12

Direction: unit vector with e t e t =argmin t m df (α t 1 + ηe t ) dη. η=0 Since F (α, t 1 + ηe t )= e y P t 1 i s=1 α sh s (x i ) e y iηh t (x i ) df (α t 1 + ηe t ) dη = η=0 = i=1 m y i h t (x i )exp i=1 m i=1 t 1 y i s=1 t 1 y i h t (x i )D t (i) m = [(1 t ) t ] m t 1 s=1 s=1 α s h s (x i ) Z s Z s =[2 t 1] m t 1 s=1 Z s. Thus, direction corresponding to base classifier with smallest error. page Mehryar Mohri - Introduction to Machine Learning 13

Step size: obtained via df (α t 1 + ηe t ) dη =0 m y i h t (x i )exp i=1 m i=1 t 1 y i s=1 t 1 y i h t (x i )D t (i) m s=1 m y i h t (x i )D t (i)e y ih t (x i )η =0 i=1 [(1 t )e η t e η ]=0 α s h s (x i ) e y ih t (x i )η =0 Z s e y ih t (x i )η =0 η = 1 2 log 1 t t. Thus, step size matches base classifier weight of AdaBoost. page Mehryar Mohri - Introduction to Machine Learning 14

Alternative Loss Functions boosting loss x e x logistic loss x log 2 (1 + e x ) square loss x (1 x) 2 1 x 1 hinge loss x max(1 x, 0) zero-one loss x 1 x<0 Mehryar Mohri - Introduction to Machine Learning page 15

Standard Use in Practice Base learners: decision trees, quite often just decision stumps (trees of depth one). Boosting stumps: data in R N, e.g., N =2, (height(x), weight(x)). associate a stump to each component. pre-sort each component: O(Nmlog m). total complexity: O((m log m)n + mnt ). at each round, find best component and threshold. stumps not weak learners: think XOR example! Mehryar Mohri - Introduction to Machine Learning page 16

Overfitting? We could expect that AdaBoost would overfit for large values of T, and that is in fact observed in some cases, but in various others it is not! Several empirical observations (not all): AdaBoost does not seem to overfit, furthermore: training error error 20 15 10 5 test error Mehryar Mohri - Introduction to Machine Learning 0 10 100 1000 C4.5 decision trees (Schapire et al., 1998). # rounds page 17

L1 Margin Definitions Definition: the margin of a point (the algebraic distance of to the hyperplane α x=0): ρ(x) = yf(x) m t=t α t with label y is x x=[h 1 (x),...,h T (x)] Definition: the margin of the classifier for a sample S =(x 1,...,x m ) is the minimum margin of the points in that sample: ρ = = y T t=1 α th t (x) α 1 min i [1,m] y i α x i α 1. = y α x α 1. Mehryar Mohri - Introduction to Machine Learning page 18

Note: SVM margin: ρ = min i [1,m] y i w Φ(x i ) w 2. Boosting margin: with H(x) = h1 (x). h T (x) ρ =. min i [1,m] y i α H(x i ) α 1, Distances: q distance to hyperplane w x+b=0: with 1 p + 1 q =1. w x + b w p, page Mehryar Mohri - Introduction to Machine Learning 19

Convex Hull of Hypothesis Set Definition: Let H be a set of functions mapping from X to R. The convex hull of H is defined by conv(h) ={ p k=1 µ k h k : p 1,µ k 0, p µ k 1,h k H}. k=1 ensemble methods are often based on such convex combinations of hypotheses. Mehryar Mohri - Introduction to Machine Learning page 20

Margin Bound - Ensemble Methods Theorem: Let H be a set of real-valued functions. Fix ρ>0. For any δ>0, with probability at least 1 δ, the following holds for all h conv(h) : R(h) R ρ (h)+ 2 ρ R log 1 m H + δ 2m, (Koltchinskii and Panchenko, 2002) where R m H is a measure of the complexity of H. Mehryar Mohri - Introduction to Machine Learning page 21

Notes For AdaBoost, the bound applies to the functions x f(x) α 1 = T t=1 α th t (x) α 1 conv(h). Note that T does not appear in the bound. Mehryar Mohri - Introduction to Machine Learning page 22

But, Does AdaBoost Maximize the Margin? No: AdaBoost may converge to a margin that is significantly below the maximum margin (Rudin et al., 2004) (e.g., 1/3 instead of 3/8)! Lower bound: AdaBoost can achieve asymptotically a margin that is at least ρ max if data separable and 2 some conditions on the base learners (Rätsch and Warmuth, 2002). Several boosting-type margin-maximization algorithms: but, performance in practice not clear or not reported. Mehryar Mohri - Introduction to Machine Learning page 23

Outliers AdaBoost assigns larger weights to harder examples. Application: Detecting mislabeled examples. Dealing with noisy data: regularization based on the average weight assigned to a point (soft margin idea for boosting) (Meir and Rätsch, 2003). Mehryar Mohri - Introduction to Machine Learning page 24

Advantages of AdaBoost Simple: straightforward implementation. Efficient: complexity N T O(mNT ) for stumps: when and are not too large, the algorithm is quite fast. Theoretical guarantees: but still many questions. AdaBoost not designed to maximize margin. regularized versions of AdaBoost. Mehryar Mohri - Introduction to Machine Learning page 25

Weaker Aspects Parameters: need to determine T, the number of rounds of boosting: stopping criterion. need to determine base learners: risk of overfitting or low margins. Noise: severely damages the accuracy of Adaboost (Dietterich, 2000). boosting algorithms based on convex potentials do not tolerate even low levels of random noise, even with L1 regularization or early stopping (Long and Servedio, 2010). Mehryar Mohri - Introduction to Machine Learning page 26

References Thomas G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning, 40(2): 139-158, 2000. Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139, 1997. G. Lebanon and J. Lafferty. Boosting and maximum likelihood for exponential models. In NIPS, pages 447 454, 2001. Ron Meir and Gunnar Rätsch. An introduction to boosting and leveraging. In Advanced Lectures on Machine Learning (LNAI2600), 2003. J. von Neumann. Zur Theorie der Gesellschaftsspiele. Mathematische Annalen, 100:295-320, 1928. Cynthia Rudin, Ingrid Daubechies and Robert E. Schapire. The dynamics of AdaBoost: Cyclic behavior and convergence of margins. Journal of Machine Learning Research, 5: 1557-1595, 2004. Mehryar Mohri - Introduction to Machine Learning page 27

References Rätsch, G., and Warmuth, M. K. (2002) Maximizing the Margin with Boosting, in Proceedings of the 15th Annual Conference on Computational Learning Theory (COLT 02), Sidney, Australia, pp. 334 350, July 2002. Robert E. Schapire. The boosting approach to machine learning: An overview. In D. D. Denison, M. H. Hansen, C. Holmes, B. Mallick, B. Yu, editors, Nonlinear Estimation and Classification. Springer, 2003. Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5): 1651-1686, 1998. Mehryar Mohri - Introduction to Machine Learning page 28