Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Size: px

Start display at page:

Download "Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research"

Neal Hunt
5 years ago
Views:

1 Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research

2 Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak) learning algorith Land such that: for all >0, for all c C and all distributions D, for saples S of size polynoial. Pr R(h S) S D 1 2 =poly(1/ ) (Kearns and Valiant, 1994) 1, for a fixed >0 2

3 Finding siple relatively accurate base classifiers often not hard weak learner. Main ideas: Boosting Ideas use weak learner to create a strong learner. cobine base classifiers returned by weak learner (enseble ethod). But, how should the base classifiers be cobined? 3

4 H { 1, +1} X. AdaBoost (Freund and Schapire, 1997) AdaBoost(S =((x 1,y 1 ),...,(x,y ))) 1 for i 1 to do 2 D 1 (i) 1 3 for t 1 to T do 4 h t base classifier in H with sall error t = Pr i D t [h t (x i )6=y i ] 1 5 t 2 log 1 t t 6 Z t 2[ t (1 t )] 1 2. noralization factor 7 for i 1 to do 8 D t+1 (i) P t 9 f t s=1 sh s 10 return h = sgn(f T ) D t (i)exp( t y i h t (x i )) Z t 4

5 Distributions D t originally unifor. Notes over training saple: at each round, the weight of a isclassified exaple is increased. observation: D, since t+1 (i)= e y i f t (x i ) Q t s=1 Z s D t+1 (i) = D ty t(i)e i h t (x i ) = D t 1yiht t 1(i)e 1(xi) e tyiht(xi) = 1 e y P t i s=1 sh s (x i ) t Z t Z t 1 Z t s=1 Z. s Weight assigned to base classifier h t : t directly depends on the accuracy of at round t. h t 5

6 Illustration t = 1 t = 2 6

7 t =

8 α 1 + α 2 + α 3 = 8

9 Bound on Epirical Error Theore: The epirical error of the classifier output by AdaBoost verifies: R(h) exp 2 If further for all t [1,T], ( 1 2 t), then R(h) exp( 2 2 T ). does not need to be known in advance: adaptive boosting. T t=1 1 2 (Freund and Schapire, 1997) t 2. 9

10 Proof: Since, as we saw, D t+1 (i)= e y i f t (x i ), Q t s=1 Z s R(h) = 1 Now, since Z t = = 1 Z t 1 yi f(x i ) 0 T t=1 1 Z t D T +1 (i) = is a noralization factor, D t (i)e ty i h t (x i ) i:y i h t (x i ) 0 D t (i)e t + =(1 t)e t + t e t =(1 t) t 1 t + t exp( y i f(x i )) T t=1 i:y i h t (x i )<0 1 t t Z t. D t (i)e t =2 t(1 t). 10

11 Thus, T Z t = T 2 t(1 t) = T t 2 t=1 Notes: t=1 T t=1 exp t t=1 2 =exp 2 T t=1 1 2 t 2. t iniizer of (1 t)e + t e. since (1 t)e t = t e t, at each round, AdaBoost assigns the sae probability ass to correctly classified and isclassified instances. for base classifiers x [ 1, +1], t can be siilarly chosen to iniize. Z t 11

12 AdaBoost = Coordinate Descent Objective Function: convex and differentiable. F ( ) = 1 X e y if(x i ) = 1 e x X e y P N i j=1 jh j (x i ). 0 1loss 12

13 Direction: unit vector derivative: F 0 ( t e k with best directional F ( t 1 + e k ) F ( t 1 ) 1, e k )=li!0 X Since F (, t 1 + e k )= e y P N i j=1 t 1,jh j (x i ) y i h k (x i ) F 0 ( t 1, e k )= 1 X y i h k (x i )e y i = 1 X y i h k (x i ) D t (i) Z t " X = D t (i)1 yi h k (x i )=+1 = P N j=1 t 1,jh j (x i ) Thus, direction corresponding to base classifier with sallest error. X h (1 t,k ) t,k i Zt = h 2 t,k 1 Zt D t (i)1 yi h k (x i )= 1# i Zt.. 13

14 Step size: chosen to iniize F ( t 1 + e k ); df ( t 1 + e k ) d =0,,, X y i h k (x i )e y P N i j=1 t 1,jh j (x i ) e y ih k (x i ) =0 X y i h k (x i ) D t (i) Z t e y ih k (x i ) =0 X y i h k (x i ) D t (i)e y ih k (x i ) =0, (1 t,k )e t,k e =0, = 1 2 log 1 t,k t,k. Thus, step size atches base classifier weight of AdaBoost. 14

15 Alternative Loss Functions boosting loss x e x logistic loss x log 2 (1 + e x ) square loss x (1 x) 2 1 x 1 hinge loss x ax(1 x, 0) zero-one loss x 1 x<0 15

16 Standard Use in Practice Base learners: decision trees, quite often just decision stups (trees of depth one). Boosting stups: data in R N, e.g., N =2, (height(x), weight(x)). associate a stup to each coponent. pre-sort each coponent: O(Nlog ). total coplexity: O(( log )N + NT ). at each round, find best coponent and threshold. stups not weak learners: think XOR exaple! 16

17 Assue that F T = sgn Overfitting? VCdi(H)=d and for a fixed T, define can for a very rich faily of classifiers. It can be shown (Freund and Schapire, 1997) that: F T T t=1 th t b : t,b R,h t H. VCdi(F T ) 2(d +1)(T +1)log 2 ((T +1)e). This suggests that AdaBoost could overfit for large values of T, and that is in fact observed in soe cases, but in various others it is not! 17

18 Epirical Observations Several epirical observations (not all): AdaBoost does not see to overfit, furtherore: test error training error error # rounds C4.5 decision trees (Schapire et al., 1998). 18

19 Radeacher Coplexity of Convex Hulls Theore: Let H be a set of functions apping fro X to R. Let the convex hull of H be defined as Then, for any saple S, R S (conv(h)) = R S (H). Proof: conv(h) ={ p k=1 R S (conv(h)) = 1 E µ k h k : p 1,µ k 0, sup h k H,µ 0, µ 1 1 = 1 E sup sup h k H µ 0, µ 1 1 = 1 E sup h k H = 1 E sup h H ax k [1,p] p k=1 ih(x i ) k=1 µ k 1,h k H}. i p k=1 p µ k ih k (x i ) = R S (H). µ k h k (x i ) ih k (x i ) 19

20 Margin Bound - Enseble Methods Corollary: Let H be a set of real-valued functions. Fix >0. For any >0, with probability at least 1, the following holds for all h conv(h) : R(h) R (h)+ 2 R H + log 1 R(h) R (h)+ 2 R S H +3 (Koltchinskii and Panchenko, 2002) 2 log 2 2. Proof: Direct consequence of argin bound of Lecture 4 and R S (conv(h))=r S (H). 20

21 Margin Bound - Enseble Methods (Koltchinskii and Panchenko, 2002); see also (Schapire et al., 1998) Corollary: Let H be a faily of functions taking values in { 1, +1} with VC diension d. Fix >0. For any >0, with probability at least 1, the following holds for all h conv(h) : R(h) R (h)+ 2 2d log e d + log 1 2. Proof: Follows directly previous corollary and VC diension bound on Radeacher coplexity (see lecture 3). 21

22 Notes All of these bounds can be generalized to hold uniforly for all (0, 1), at the cost of an additional 2 log log ter 2 and other inor constant factor changes (Koltchinskii and Panchenko, 2002). For AdaBoost, the bound applies to the functions x Note that f(x) T 1 = T t=1 th t (x) 1 conv(h). does not appear in the bound. 22

23 Margin Distribution 1 Theore: For any Pr yf(x) 1 >0, the following holds: 2 T T t=1 1 t (1 t) 1+. Proof: Using the identity D t+1 (i)= e y i f(x i ), 1 yi f(x i ) = 1 = e 1 Q T t=1 Z t exp( y i f(x i )+ 1 ) e 1 T t=1 Z t =2 T T t=1 T Z t t=1 D T +1 (i) 1 t t t(1 t). 23

24 Notes If for all t [1,T], ( 1 2 t), then the upper bound can be bounded by Pr yf(x) 1 (1 2 ) 1 (1 + 2 ) 1+ T/2. For <, (1 2 ) 1 (1+2 ) 1+ <1 and the bound decreases exponentially in T. For the bound to be convergent: O(1/ ), thus O(1/ ) is roughly the condition on the edge value. 24

25 Outliers AdaBoost assigns larger weights to harder exaples. Application: Detecting islabeled exaples. Dealing with noisy data: regularization based on the average weight assigned to a point (soft argin idea for boosting) (Meir and Rätsch, 2003). 25

26 L1-Geoetric Margin Definition: the L 1 -argin f (x) of a linear function f = P T t=1 th t with 6= 0at a point x 2 X is defined by f (x) = f(x) k 1 = P T t=1 th t (x) k k 1 the -argin of over a saple is = h(x) k k 1. L 1 f S =(x 1,...,x ) its iniu argin at points in that saple: f = in f (x i )= i2[1,] in i2[1,] h(x i ) k k 1. 26

27 SVM vs AdaBoost features or base hypotheses SVM apple (x) = 1(x). N (x) AdaBoost apple h1 (x) h(x) =. h N (x) predictor x 7! w (x) x 7! h(x) geo. argin w (x) = d 2 ( (x), hyperpl.) kwk 2 h(x) k k 1 = d 1 (h(x), hyperpl.) conf. argin y(w (x)) y( h(x)) regularization kwk 2 k k 1 (L1-AB) 27

28 Maxiu-Margin Solutions Nor 2. Nor. 28

29 But, Does AdaBoost Maxiize the Margin? No: AdaBoost ay converge to a argin that is significantly below the axiu argin (Rudin et al., 2004) (e.g., 1/3 instead of 3/8)! Lower bound: AdaBoost can achieve asyptotically a argin that is at least ax if the data is separable 2 and soe conditions on the base learners hold (Rätsch and Waruth, 2002). Several boosting-type argin-axiization algoriths: but, perforance in practice not clear or not reported. 29

30 AdaBoost s Weak Learning Condition Definition: the edge of a base classifier h t for a distribution D over the training saple is (t) = 1 t = Condition: there exists >0 for any distribution over the training saple and any base classifier (t). y i h t (x i )D(i). D 30

31 Zero-Su Gaes Definition: payoff atrix M =(M ij ) R n. possible actions (pure strategy) for row player. n possible actions for colun player. M ij payoff for row player ( = loss for colun player) when row plays i, colun plays j. Exaple: rock paper scissors rock paper scissors

32 Mixed Strategies Definition: player row selects a distribution p over the rows, player colun a distribution qover coluns. The expected payoff for row is i j E p q von Neuann s iniax theore: ax p in q equivalent for: ax p [M ij ]= n j=1 p Mq =in q in p Me j =in j [1,n] q p i M ij q j = p Mq. ax p i p Mq. ax e i Mq. [1,] (von Neuann, 1928) 32

33 John von Neuann ( ) 33

34 AdaBoost and Gae Theory Gae: Player A: selects point x i, i [1,]. Player B: selects base learner h t, t [1,T]. M { 1, +1} T Payoff atrix : M it =y i h t (x i ). von Neuann s theore: assue finite H. 2 =in D ax h H D(i)y i h(x i )=ax in y i i [1,] T t=1 th t (x i ) 1 =. 34

35 Consequences Weak learning condition non-zero argin. thus, possible to search for non-zero argin. AdaBoost = (suboptial) search for corresponding ; achieves at least half of the axiu argin. Weak learning = strong condition: the condition iplies linear separability with argin 2 >0. = 35

36 Linear Prograing Proble Maxiizing the argin: This is equivalent to the following convex optiization LP proble: Note that: x 1 =ax in y i i [1,] ax subject to : y i ( x i ) 1 =1. ( x i ) 1. = x H, with H = {x: x =0}. 36

37 Advantages of AdaBoost Siple: straightforward ipleentation. Efficient: coplexity N T O(NT ) for stups: when and are not too large, the algorith is quite fast. Theoretical guarantees: but still any questions. AdaBoost not designed to axiize argin. regularized versions of AdaBoost. 37

38 Weaker Aspects Paraeters: need to deterine T, the nuber of rounds of boosting: stopping criterion. need to deterine base learners: risk of overfitting or low argins. Noise: severely daages the accuracy of Adaboost (Dietterich, 2000). 38

39 Other Boosting Algoriths arc-gv (Breian, 1996): designed to axiize the argin, but outperfored by AdaBoost in experients (Reyzin and Schapire, 2006). L1-regularized AdaBoost (Raetsch et al., 2001): outperfos AdaBoost in experients (Cortes et al., 2014). DeepBoost (Cortes et al., 2014): ore favorable learning guarantees, outperfors both AdaBoost and L1-regularized AdaBoost in experients. 39

40 References Corinna Cortes, Mehryar Mohri, and Uar Syed. Deep boosting. In ICML, s , Leo Breian. Bagging predictors. Machine Learning, 24(2): , Thoas G. Dietterich. An experiental coparison of three ethods for constructing ensebles of decision trees: bagging, boosting, and randoization. Machine Learning, 40(2): , Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Coputer and Syste Sciences, 55(1): , G. Lebanon and J. Lafferty. Boosting and axiu likelihood for exponential odels. In NIPS, s , Ron Meir and Gunnar Rätsch. An introduction to boosting and leveraging. In Advanced Lectures on Machine Learning (LNAI2600), J. von Neuann. Zur Theorie der Gesellschaftsspiele. Matheatische Annalen, 100: ,

41 References Cynthia Rudin, Ingrid Daubechies and Robert E. Schapire. The dynaics of AdaBoost: Cyclic behavior and convergence of argins. Journal of Machine Learning Research, 5: , Rätsch, G., and Waruth, M. K. (2002) Maxiizing the Margin with Boosting, in Proceedings of the 15th Annual Conference on Coputational Learning Theory (COLT 02), Sidney, Australia, pp , July Reyzin, Lev and Schapire, Robert E. How boosting the argin can also boost classifier coplexity. In ICML, s , Robert E. Schapire. The boosting approach to achine learning: An overview. In D. D. Denison, M. H. Hansen, C. Holes, B. Mallick, B. Yu, editors, Nonlinear Estiation and Classification. Springer, Robert E. Schapire and Yoav Freund. Boosting, Foundations and Algoriths. The MIT Press, Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee. Boosting the argin: A new explanation for the effectiveness of voting ethods. The Annals of Statistics, 26(5): ,

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea: