Deep Boosting Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH.
Ensemble Methods in ML Combining several base classifiers to create a more accurate one. Bagging (Breiman 1996). AdaBoost (Freund and Schapire 1997). Stacking (Smyth and Wolpert 1999). Bayesian averaging (MacKay 1996). Other averaging schemes e.g., (Freund et al. 2004). Often very effective in practice. Benefit of favorable learning guarantees. page #2
Convex Combinations Base classifier set. boosting stumps. decision trees with limited depth or number of leaves. Ensemble combinations: convex hull of base classifier set. conv(h) = T t=1 H th t : t 0; T t=1 t 1; t, h t H. page #3
Rademacher Complexity H R S =(x 1,...,x m ) m Definitions: let be a family of functions mapping from to and let be a sample of size. H Empirical Rademacher complexity of : X R S (H) = 1 m E H Rademacher complexity of : sup h H m i=1 ih(x i ). R m (H) = E R S(H). S D m page #4
Ensembles - Margin Bound H >0 Theorem: let be a family of real-valued functions. Fix. Then, for any >0, with probability at least 1, the following holds for all : f = T t=1 th t conv(h) R(f) R S, (f)+ 2 R m (H)+ log 1 2m, (Koltchinskii and Panchencko, 2002) where R S, (f) = 1 m m i=1 1 yi f(x i ). page #5
Questions Can we use a much richer or deeper base classifier set? richer families needed for difficult tasks on speech and image processing. but generalization bound indicates risk of overfitting. page #6
AdaBoost Description: coordinate descent applied to (Freund and Schapire, 1997) F ( )= m i=1 e y if(x i ) = m i=1 exp y i T t=1 th t (x i ). Guarantees: ensemble margin bound. but AdaBoost does not maximize the margin! some margin maximizing algorithms such as arc-gv are outperformed by AdaBoost! (Reyzin and Schapire, 2006) page #7
Suspicions Complexity of hypotheses used: arc-gv tends to use deeper decision trees to achieve a larger margin. Notion of margin: minimal margin perhaps not the appropriate notion. margin distribution is key. can we shed more light on these questions? page #8
This Talk Main question: how can we design ensemble algorithms that can succeed even with very deep decision trees or other complex sets? theory. practical algorithms. experimental results. page #9
Theory page
Base Classifier Set H Decomposition in terms of sub-families or their union. H 1 H 2 H 3 H 4 H 5 H 1 H 1 H 2 H 1 H p page #11
Ensemble Family Non-negative linear ensembles F =conv( p k=1 H k) : with t 0, f = T t=1 th t T t=1 t 1,h t H kt. H 2 H 4 H 1 H 3 H 5 page #12
Ideas Use hypotheses drawn from H k s with larger ks but allocate more weight to hypotheses drawn from smaller ks. how can we determine quantitatively the amounts of mixture weights apportioned to different families? can we provide learning guarantees guiding these choices? connection with SRM? page #13
New Learning Guarantee Theorem: Fix >0. Then, for any >0, with probability at least 1, the following holds for all f = T t=1 th t F : R(f) R S, (f)+ 4 T t=1 tr m (H kt )+O log p 2 m. page #14
Consequences Complexity term with explicit dependency on mixture weights. quantitative guide for controlling weights assigned to more complex sub-families. bound can be used to inspire, or directly define an ensemble algorithm. page #15
Algorithms page
Set-Up H 1,...,H p [ 1, +1] : disjoint sub-families of functions taking values in. Further assumption (not necessary): symmetric subfamilies, i.e.. Notation: for any, r t = R m (H d(ht )). h H k h H k h p k=1 H k d(h) h H d(h) index of sub-family ( ). page #17
Derivation (1) Learning bound suggests seeking to minimize 1 m m i=1 1 yi P Tt=1 th t (x i ) + 4 T 0 with t=1 tr t. T t=1 t 1 page #18
Derivation (2) f = T t=1 th t f/ 0 Since and have the same error, we can T 1 instead search for with to minimize 1 m m i=1 t=1 t 1 yi P Tt=1 th t (x i ) 1 +4 T t=1 tr t. page #19
Derivation (3) u ( u) u 1 u 0 Let be a decreasing convex function upper bounding, with differentiable. Minimizing convex upper bound (convex optimization): min 0 1 m s.t. T t=1 m i=1 t 1 y i T 1, t=1 th t (x i ) + T t=1 tr t where the parameter 0 is introduced. page #20
Derivation (4) Moving the constraint to the objective and using the fact that the sub-families are symmetric leads to: min R N 1 m m i=1 1 y i N j=1 jh j (x i ) + N t=1 ( r j + ) j, where, 0, and for each hypothesis, keep either h or -h. page #21
Convex Surrogates Two principal choices: Exponential loss: Logistic loss: ( u) =exp( u). ( u) =log 2 (1 + exp( u)). page #22
DeepBoost Algorithm Coordinate descent applied to convex objective. non-differentiable function. definition of maximum coordinate descent. page #23
Direction & Step Maximum direction: definition based on the error t,j = 1 2 1 E [y i h j (x i )], i D t where D t is the distribution over sample at iteration t. Step: closed-form expressions for exponential and logistic losses. general case: line search. page #24
Pseudocode j = r j +. page #25
Notes Straightforward updates. Sparsity step. Parallel implementation. page #26
Connections with Previous Work For, DeepBoost coincides with AdaBoost (Freund and Schapire 1997), run with union of subfamilies, for the exponential loss. algorithm of (Friedman et al., 1998), run with union of subfamilies, for the logistic loss. For and, DeepBoost coincides with L1-regularized AdaBoost (Duchi and Singer 2009). = =0 =0 =0 close to unnormalized Arcing (Breiman 1999). also related to AdaBoost (Raetsch and Warmuth 2005). page #27
Experiments page
Rad. Complexity Estimates Benefit of data-dependent analysis: empirical estimates of each. example: for kernel function K k, R S (H k ) R m (H k ) Tr[K k ] m. alternatively, upper bounds in terms of growth functions, R m (H k ) 2log Hk (m) m. page #29
Experiments (1) Family of base classifiers defined by boosting stumps: boosting stumps (threshold functions). in dimension,, thus decision trees of depth 2, d H stumps 1 R m (H stumps 1 ) question at the internal nodes of depth 1. d H stumps 1 (m) 2md 2log(2md) m, with the same in dimension, H stumps, thus 2 R m (H stumps 2 ) H stumps 2 (m) (2m) 2 d(d 1) 2log(2m 2 d(d 1)) m. 2. page #30
Experiments (1) Base classifier set:. Data sets: same UCI Irvine data sets as (Breiman 1999) and (Reyzin and Schapire 2006). H stumps 1 H stumps 2 OCR data sets used by (Reyzin and Schapire 2006): ocr17, ocr49. MNIST data sets: ocr17-mnist, ocr49-mnist. Experiments with exponential loss. Comparison with AdaBoost and AdaBoost-L1. page #31
Data Statistics breastcancer ionosphere german (numeric) Examples 699 351 1000 Attributes 9 34 24 diabetes ocr17 ocr49 Examples 768 2000 2000 Attributes 8 196 196 ocr17-mnist ocr49-mnist Examples 15170 13782 Attributes 400 400 page #32
Experiments - Stumps Exp Loss page #33
Experiments (2) Family of base classifiers defined by decision trees of depth. k VC-dim(Hk trees ) (2 k +1)log 2 (d +1) Since, R m (H trees k ) (2 k+1 +2)log 2 (d +1)log(m) m. K k=1 Htrees k Base classifier set:. Same data sets as with Experiments (1). Both exponential and logistic loss. Comparison with AdaBoost and AdaBoost-L1. page #34
Experiments - Trees Exp Loss page #35
Experiments - Trees Log Loss page #36
Margin Distribution page #37
Extension to Multi-Class Similar data-dependent learning guarantee proven for the multi-class setting. bound depending on mixture weights and complexity of sub-families. Deep Boosting algorithm for multi-class: similar extension taking into account the complexities of sub-families. several variants depending on number of classes. different possible loss functions for each variant. page #38
Multi-Class Learning Guarantee Theorem: Fix >0. Then, for any >0, with probability at least 1, the following holds for all f = T t=1 th t F : R(f) R S, (f)+ 4c2 T tr m ( 1 (H kt )) + O with c t=1 number of classes. log p 2 m log 2 m log p, and 1(H k )={x h(x, y): y Y,h H k }. page #39
Conclusion Deep Boosting: ensemble learning with increasingly complex families. data-dependent theoretical analysis. algorithm grounded in theory. extension to multi-class. ranking and other losses. enhancement of many existing algorithms. compares favorably to AdaBoost and AdaBoost-L1 in preliminary experiments. page #40