Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Size: px

Start display at page:

Download "Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research"

Walter Dixon
6 years ago
Views:

1 Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research

2 Boosting Mehryar Mohri - Introduction to Machine Learning page 2

3 Boosting Ideas Main idea: use weak learner to create strong learner. Ensemble method: combine base classifiers returned by weak learner. Finding simple relatively accurate base classifiers often not hard. But, how should base classifiers be combined? Mehryar Mohri - Introduction to Machine Learning page 3

4 H { 1, +1} X. AdaBoost (Freund and Schapire, 1997) AdaBoost(S =((x 1,y 1 ),...,(x m,y m ))) 1 for i 1 to m do 2 D 1 (i) 1 m 3 for t 1 to T do 4 h t base classifier in H with small error t =Pr Dt [h t (x i )=y i ] 5 α t 1 2 log 1 t t 6 Z t 2[ t (1 t )] 1 2 normalization factor 7 for i 1 to m do 8 D t+1 (i) D t(i)exp( α t y i h t (x i )) Z t 9 f T t=1 α th t 10 return h =sgn(f) Mehryar Mohri - Introduction to Machine Learning page 4

5 Distributions D t originally uniform. Notes over training sample: at each round, the weight of a misclassified example is increased. observation: D, since t+1 (i)= e y i f t (x i ) m Q t s=1 Z s D t+1 (i) = D t(i)e α ty i h t (x i ) Z t = D t 1(i)e αt 1yiht 1(xi) e αtyiht(xi) Z t 1 Z t Weight assigned to base classifier h t : α t directy depends on the accuracy of at round t. h t = 1 m e y P t i s=1 α sh s (x i ) t s=1 Z s. Mehryar Mohri - Introduction to Machine Learning page 5

6 Illustration t = 1 t = 2 Mehryar Mohri - Introduction to Machine Learning page 6

7 t = page Mehryar Mohri - Introduction to Machine Learning 7

8 α 1 + α 2 + α 3 = page Mehryar Mohri - Introduction to Machine Learning 8

9 Bound on Empirical Error Theorem: The empirical error of the classifier output by AdaBoost verifies: R(h) exp If further for all t [1,T], γ ( 1 2 t), then γ does not need to be known in advance: adaptive boosting. 2 T t=1 R(h) exp( 2γ 2 T ). (Freund and Schapire, 1997) 1 2 t 2. Mehryar Mohri - Introduction to Machine Learning page 9

10 Proof: Since, as we saw, D t+1 (i)= e y i f t (x i ), m Q t s=1 Z s R(h) = 1 m Now, since Z t = 1 m = m 1 yi f(x i ) 0 1 m i=1 m i=1 Z t m T t=1 m i=1 Z t D T +1 (i) = is a normalization factor, m D t (i)e α ty i h t (x i ) i=1 i:y i h t (x i ) 0 D t (i)e α t + =(1 t )e α t + t e α t =(1 t ) t 1 t + t exp( y i f(x i )) T t=1 i:y i h t (x i )<0 Z t. D t (i)e α t 1 t t =2 t (1 t ). page Mehryar Mohri - Introduction to Machine Learning 10

11 Thus, T t=1 Z t = T t=1 T t=1 2 t (1 t )= exp t T t= t T =exp 2 t= t. Notes: α t minimizer of α (1 t )e α + t e α. since (1 t )e α t = t e α t, at each round, AdaBoost assigns the same probability mass to correctly classified and misclassified instances. for base classifiers x [ 1, +1], α t can be similarly chosen to minimize. Z t Mehryar Mohri - Introduction to Machine Learning page 11

12 AdaBoost = Coordinate Descent Objective Function: convex and differentiable. F (α) = m e y if(x i ) = m i=1 i=1 e x e y i P T t=1 α th t (x i ). 0 1 loss Mehryar Mohri - Introduction to Machine Learning page 12

13 Direction: unit vector with e t e t =argmin t m df (α t 1 + ηe t ) dη. η=0 Since F (α, t 1 + ηe t )= e y P t 1 i s=1 α sh s (x i ) e y iηh t (x i ) df (α t 1 + ηe t ) dη = η=0 = i=1 m y i h t (x i )exp i=1 m i=1 t 1 y i s=1 t 1 y i h t (x i )D t (i) m = [(1 t ) t ] m t 1 s=1 s=1 α s h s (x i ) Z s Z s =[2 t 1] m t 1 s=1 Z s. Thus, direction corresponding to base classifier with smallest error. page Mehryar Mohri - Introduction to Machine Learning 13

14 Step size: obtained via df (α t 1 + ηe t ) dη =0 m y i h t (x i )exp i=1 m i=1 t 1 y i s=1 t 1 y i h t (x i )D t (i) m s=1 m y i h t (x i )D t (i)e y ih t (x i )η =0 i=1 [(1 t )e η t e η ]=0 α s h s (x i ) e y ih t (x i )η =0 Z s e y ih t (x i )η =0 η = 1 2 log 1 t t. Thus, step size matches base classifier weight of AdaBoost. page Mehryar Mohri - Introduction to Machine Learning 14

15 Alternative Loss Functions boosting loss x e x logistic loss x log 2 (1 + e x ) square loss x (1 x) 2 1 x 1 hinge loss x max(1 x, 0) zero-one loss x 1 x<0 Mehryar Mohri - Introduction to Machine Learning page 15

16 Standard Use in Practice Base learners: decision trees, quite often just decision stumps (trees of depth one). Boosting stumps: data in R N, e.g., N =2, (height(x), weight(x)). associate a stump to each component. pre-sort each component: O(Nmlog m). total complexity: O((m log m)n + mnt ). at each round, find best component and threshold. stumps not weak learners: think XOR example! Mehryar Mohri - Introduction to Machine Learning page 16

17 Overfitting? We could expect that AdaBoost would overfit for large values of T, and that is in fact observed in some cases, but in various others it is not! Several empirical observations (not all): AdaBoost does not seem to overfit, furthermore: training error error test error Mehryar Mohri - Introduction to Machine Learning C4.5 decision trees (Schapire et al., 1998). # rounds page 17

18 L1 Margin Definitions Definition: the margin of a point (the algebraic distance of to the hyperplane α x=0): ρ(x) = yf(x) m t=t α t with label y is x x=[h 1 (x),...,h T (x)] Definition: the margin of the classifier for a sample S =(x 1,...,x m ) is the minimum margin of the points in that sample: ρ = = y T t=1 α th t (x) α 1 min i [1,m] y i α x i α 1. = y α x α 1. Mehryar Mohri - Introduction to Machine Learning page 18

19 Note: SVM margin: ρ = min i [1,m] y i w Φ(x i ) w 2. Boosting margin: with H(x) = h1 (x). h T (x) ρ =. min i [1,m] y i α H(x i ) α 1, Distances: q distance to hyperplane w x+b=0: with 1 p + 1 q =1. w x + b w p, page Mehryar Mohri - Introduction to Machine Learning 19

20 Convex Hull of Hypothesis Set Definition: Let H be a set of functions mapping from X to R. The convex hull of H is defined by conv(h) ={ p k=1 µ k h k : p 1,µ k 0, p µ k 1,h k H}. k=1 ensemble methods are often based on such convex combinations of hypotheses. Mehryar Mohri - Introduction to Machine Learning page 20

21 Margin Bound - Ensemble Methods Theorem: Let H be a set of real-valued functions. Fix ρ>0. For any δ>0, with probability at least 1 δ, the following holds for all h conv(h) : R(h) R ρ (h)+ 2 ρ R log 1 m H + δ 2m, (Koltchinskii and Panchenko, 2002) where R m H is a measure of the complexity of H. Mehryar Mohri - Introduction to Machine Learning page 21

22 Notes For AdaBoost, the bound applies to the functions x f(x) α 1 = T t=1 α th t (x) α 1 conv(h). Note that T does not appear in the bound. Mehryar Mohri - Introduction to Machine Learning page 22

23 But, Does AdaBoost Maximize the Margin? No: AdaBoost may converge to a margin that is significantly below the maximum margin (Rudin et al., 2004) (e.g., 1/3 instead of 3/8)! Lower bound: AdaBoost can achieve asymptotically a margin that is at least ρ max if data separable and 2 some conditions on the base learners (Rätsch and Warmuth, 2002). Several boosting-type margin-maximization algorithms: but, performance in practice not clear or not reported. Mehryar Mohri - Introduction to Machine Learning page 23

24 Outliers AdaBoost assigns larger weights to harder examples. Application: Detecting mislabeled examples. Dealing with noisy data: regularization based on the average weight assigned to a point (soft margin idea for boosting) (Meir and Rätsch, 2003). Mehryar Mohri - Introduction to Machine Learning page 24

25 Advantages of AdaBoost Simple: straightforward implementation. Efficient: complexity N T O(mNT ) for stumps: when and are not too large, the algorithm is quite fast. Theoretical guarantees: but still many questions. AdaBoost not designed to maximize margin. regularized versions of AdaBoost. Mehryar Mohri - Introduction to Machine Learning page 25

26 Weaker Aspects Parameters: need to determine T, the number of rounds of boosting: stopping criterion. need to determine base learners: risk of overfitting or low margins. Noise: severely damages the accuracy of Adaboost (Dietterich, 2000). boosting algorithms based on convex potentials do not tolerate even low levels of random noise, even with L1 regularization or early stopping (Long and Servedio, 2010). Mehryar Mohri - Introduction to Machine Learning page 26

27 References Thomas G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning, 40(2): , Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1): , G. Lebanon and J. Lafferty. Boosting and maximum likelihood for exponential models. In NIPS, pages , Ron Meir and Gunnar Rätsch. An introduction to boosting and leveraging. In Advanced Lectures on Machine Learning (LNAI2600), J. von Neumann. Zur Theorie der Gesellschaftsspiele. Mathematische Annalen, 100: , Cynthia Rudin, Ingrid Daubechies and Robert E. Schapire. The dynamics of AdaBoost: Cyclic behavior and convergence of margins. Journal of Machine Learning Research, 5: , Mehryar Mohri - Introduction to Machine Learning page 27

28 References Rätsch, G., and Warmuth, M. K. (2002) Maximizing the Margin with Boosting, in Proceedings of the 15th Annual Conference on Computational Learning Theory (COLT 02), Sidney, Australia, pp , July Robert E. Schapire. The boosting approach to machine learning: An overview. In D. D. Denison, M. H. Hansen, C. Holmes, B. Mallick, B. Yu, editors, Nonlinear Estimation and Classification. Springer, Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5): , Mehryar Mohri - Introduction to Machine Learning page 28

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)