Deep Boosting. Abstract. 1. Introduction

Size: px

Start display at page:

Download "Deep Boosting. Abstract. 1. Introduction"

Alexandrina Simmons
5 years ago
Views:

1 Corinna Cortes Google Research, 8th Avenue, New York, NY 00 Mehryar Mohri Courant Institute and Google Research, 5 Mercer Street, New York, NY 00 Uar Syed Google Research, 8th Avenue, New York, NY 00 Abstract We present a new enseble learning algorith, DeepBoost, which can use as base classifiers a hypothesis set containing deep decision trees, or ebers of other rich or coplex failies, and succeed in achieving high accuracy without overfitting the data. The key to the success of the algorith is a capacity-conscious criterion for the selection of the hypotheses. We give new datadependent learning bounds for convex ensebles expressed in ters of the Radeacher coplexities of the sub-failies coposing the base classifier set, and the ixture weight assigned to each sub-faily. Our algorith directly benefits fro these guarantees since it seeks to iniize the corresponding learning bound. We give a full description of our algorith, including the details of its derivation, and report the results of several experients showing that its perforance copares favorably to that of AdaBoost and Logistic Regression and their L -regularized variants.. Introduction Enseble ethods are general techniques in achine learning for cobining several predictors or experts to create a ore accurate one. In the batch learning setting, techniques such as bagging, boosting, stacking, errorcorrection techniques, Bayesian averaging, or other averaging schees are proinent instances of these ethods Breian, 996; Freund & Schapire, 997; Syth & Wolpert, 999; MacKay, 99; Freund et al., 004. Enseble ethods often significantly iprove perforance in practice Quinlan, 996; Bauer & Kohavi, 999; Caruana et al., 004; Dietterich, 000; Schapire, 003 and ben- Proceedings of the 3 st International Conference on Machine Learning, Beijing, China, 04. JMLR: W&CP volue 3. Copyright 04 by the authors. CORINNA@GOOGLE.COM MOHRI@CIMS.NYU.EDU USYED@GOOGLE.COM efit fro favorable learning guarantees. In particular, AdaBoost and its variants are based on a rich theoretical analysis, with perforance guarantees in ters of the argins of the training saples Schapire et al., 997; Koltchinskii & Panchenko, 00. Standard enseble algoriths such as AdaBoost cobine functions selected fro a base classifier hypothesis set H. In any successful applications of AdaBoost, H is reduced to the so-called boosting stups, that is decision trees of depth one. For soe difficult tasks in speech or iage processing, siple boosting stups are not sufficient to achieve a high level of accuracy. It is tepting then to use a ore coplex hypothesis set, for exaple the set of all decision trees with depth bounded by soe relatively large nuber. But, existing learning guarantees for AdaBoost depend not only on the argin and the nuber of the training exaples, but also on the coplexity of H easured in ters of its VC-diension or its Radeacher coplexity Schapire et al., 997; Koltchinskii & Panchenko, 00. These learning bounds becoe looser when using too coplex base classifier sets H. They suggest a risk of overfitting which indeed can be observed in soe experients with AdaBoost Grove & Schuurans, 998; Schapire, 999; Dietterich, 000; Rätsch et al., 00b. This paper explores the design of alternative enseble algoriths using as base classifiers a hypothesis set H that ay contain very deep decision trees, or ebers of soe other very rich or coplex failies, and that can yet succeed in achieving a higher perforance level. Assue that the set of base classifiers H can be decoposed as the union of p disjoint failies H,..., H p ordered by increasing coplexity, where H k, k [, p], could be for exaple the set of decision trees of depth k, or a set of functions based on onoials of degree k. Figure shows a pictorial illustration. Of course, if we strictly confine ourselves to using hypotheses belonging only to failies H k with sall k, then we are effectively using a saller base classifier set H with favorable guarantees. But, to succeed in soe chal-

2 H H H3 H4 H5 H H H H Hp Figure. Base classifier set H decoposed in ters of subfailies H,..., H p or their unions. lenging tasks, the use of a few ore coplex hypotheses could be needed. The ain idea behind the design of our algoriths is that an enseble based on hypotheses drawn fro H,..., H p can achieve a higher accuracy by aking use of hypotheses drawn fro H k s with large k if it allocates ore weights to hypotheses drawn fro H k s with a sall k. But, can we deterine quantitatively the aounts of ixture weights apportioned to different failies? Can we provide learning guarantees for such algoriths? Note that our objective is soewhat reiniscent of that of odel selection, in particular Structural Risk Miniization SRM Vapnik, 998, but it differs fro that in that we do not wish to liit our base classifier set to soe optial H q = q k= H k. Rather, we seek the freedo of using as base hypotheses even relatively deep trees fro rich H k s, with the proise of doing so infrequently, or that of reserving the a soewhat sall weight contribution. This provides the flexibility of learning with deep hypotheses. We present a new algorith, DeepBoost, whose design is precisely guided by the ideas just discussed. Our algorith is grounded in a solid theoretical analysis that we present in Section. We give new data-dependent learning bounds for convex ensebles. These guarantees are expressed in ters of the Radeacher coplexities of the sub-failies H k and the ixture weight assigned to each H k, in addition to the failiar argin ters and saple size. Our capacity-conscious algorith is derived via the application of a coordinate descent technique seeking to iniize such learning bounds. We give a full description of our algorith, including the details of its derivation and its pseudocode Section 3 and discuss its connection with previous boosting-style algoriths. We also report the results of several experients Section 4 deonstrating that its perforance copares favorably to that of AdaBoost, which is known to be one of the ost copetitive binary classification algoriths.. Data-dependent learning guarantees for convex ensebles with ultiple hypothesis sets Non-negative linear cobination ensebles such as boosting or bagging typically assue that base functions are selected fro the sae hypothesis set H. Margin-based generalization bounds were given for ensebles of base functions taking values in {, +} by Schapire et al. 997 in ters of the VC-diension of H. Tighter argin bounds with sipler proofs were later given by Koltchinskii & Panchenko 00, see also Bartlett & Mendelson, 00, for the ore general case of a faily H taking arbitrary real values, in ters of the Radeacher coplexity of H. Here, we also consider base hypotheses taking arbitrary real values but assue that they can be selected fro several distinct hypothesis sets H,..., H p with p and present argin-based learning in ters of the Radeacher coplexity of these sets. Rearkably, the coplexity ter of these bounds adits an explicit dependency in ters of the ixture coefficients defining the ensebles. Thus, the enseble faily we consider is F = conv p k= H k, that is the faily of functions f of the for f = T α th t, where α = α,..., α T is in the siplex and where, for each t [, T ], h t is in H kt for soe k t [, p]. Let X denote the input space. H,..., H p are thus failies of functions apping fro X to R. We consider the failiar supervised learning scenario and assue that training and test points are drawn i.i.d. according to soe distribution D over X {, +} and denote by S = x, y,..., x, y a training saple of size drawn according to D. Let ρ > 0. For a function f taking values in R, we denote by Rf its binary classification error, by R ρ f its ρ-argin error, and by R S,ρ f its epirical argin error: Rf = R ρ f = E [ yfx 0], R ρ f = E [ yfx ρ], x,y D x,y D E x,y S [ yfx ρ], where the notation x, y S indicates that x, y is drawn according to the epirical distribution defined by S. The following theore gives a argin-based Radeacher coplexity bound for learning with such functions in the binary classification case. As with other Radeacher coplexity learning guarantees, our bound is data-dependent, which is an iportant and favorable characteristic of our results. For p =, that is for the special case of a single hypothesis set, the analysis coincides with that of the standard enseble argin bounds Koltchinskii & Panchenko, 00. Theore. Assue p >. Fix ρ > 0. Then, for any δ > 0, with probability at least δ over the choice of a saple S of size drawn i.i.d. according to D, the following inequality holds for all f = T α th t F: Rf R S,ρ f + 4 ρ + ρ α t R H kt log p 4 [ + ρ log ρ ] log p log p + log δ.

3 Thus, Rf R S,ρ f + 4 T ρ α tr H kt + C, p log p with C, p = O ρ log [ ρ log p]. This result is rearkable since the coplexity ter in the right-hand side of the bound adits an explicit dependency on the ixture coefficients α t. It is a weighted average of Radeacher coplexities with ixture weights α t, t [, T ]. Thus, the second ter of the bound suggests that, while soe hypothesis sets H k used for learning could have a large Radeacher coplexity, this ay not be detriental to generalization if the corresponding total ixture weight su of α t s corresponding to that hypothesis set is relatively sall. Such coplex failies offer the potential of achieving a better argin on the training saple. The theore cannot be proven via a standard Radeacher coplexity analysis such as that of Koltchinskii & Panchenko 00 since the coplexity ter of the bound would then be the Radeacher coplexity of the faily of hypotheses F = conv p k= H k and would not depend on the specific weights α t defining a given function f. Furtherore, the coplexity ter of a standard Radeacher coplexity analysis is always lower bounded by the coplexity ter appearing in our bound. Indeed, since R conv p k= H k = R p k= H k, the following lower bound holds for any choice of the non-negative ixtures weights α t suing to one: R F ax k= R H k α t R H kt. Thus, Theore provides a finer learning bound than the one obtained via a standard Radeacher coplexity analysis. The full proof of the theore is given in Appendix A. Our proof technique exploits standard tools used to derive Radeacher coplexity learning bounds Koltchinskii & Panchenko, 00 as well as a technique used by Schapire, Freund, Bartlett, and Lee 997 to derive early VC-diension argin bounds. Using other standard techniques as in Koltchinskii & Panchenko, 00; Mohri et al., 0, Theore can be straightforwardly generalized to hold uniforly for all ρ > 0 at the price of an additional ter that is in O 3. Algorith log log ρ. In this section, we will use the learning guarantees of Section to derive a capacity-conscious enseble algorith for binary classification. 3.. Optiization proble Let H,..., H p be p disjoint failies of functions taking values in [, +] with increasing Radeacher coplexities R H k, k [, p]. We will assue that the hypothesis sets H k are syetric, that is, for any h H k, we also have h H k, which holds for ost hypothesis sets typically considered in practice. This assuption is not necessary but it helps siplifying the presentation of our algorith. For any hypothesis h p k= H k, we denote by dh the index of the hypothesis set it belongs to, that is h H dh. The bound of Theore holds uniforly for all ρ > 0 and functions f conv p k= H k. Since the last ter of the bound does not depend on α, it suggests selecting α to iniize Gα = P T yi αthtxi ρ + 4 ρ α t r t, where r t = R H dht. Since for any ρ > 0, f and f/ρ adit the sae generalization error, we can instead search for α 0 with T α t /ρ which leads to in α 0 T P T αthtxi +4 yi α t r t s.t. α t ρ. The first ter of the objective is not a convex function of α and its iniization is known to be coputationally hard. Thus, we will consider instead a convex upper bound. Let u Φ u be a non-increasing convex function upper bounding u u 0 with Φ differentiable over R and Φ u 0 for all u. Φ ay be selected to be for exaple the exponential function as in AdaBoost Freund & Schapire, 997 or the logistic function. Using such an upper bound, we obtain the following convex optiization proble: in α 0 s.t. T Φ y i α t h t x i + λ α t ρ, α t r t where we introduced a paraeter λ 0 controlling the balance between the agnitude of the values taken by function Φ and the second ter. Introducing a Lagrange variable β 0 associated to the constraint in, the proble can be equivalently written as in α 0 Φ T y i α t h t x i + λr t + βα t. Here, β is a paraeter that can be freely selected by the algorith since any choice of its value is equivalent to a The condition P T αt = of Theore can be relaxed to P T αt. To see this, use for exaple a null hypothesis h t = 0 for soe t.

4 choice of ρ in. Let {h,..., h N } be the set of distinct base functions, and let G be the objective function based on that collection: Gα= Φ y N i α j h j x i + j= N λr j +βα j, with α = α,..., α N R N. Note that we can drop the requireent α 0 since the hypothesis sets are syetric and α t h t = α t h t. For each hypothesis h, we keep either h or h in {h,..., h N }. Using the notation Λ j = λr j + β, 3 for all j [, N], our optiization proble can then be rewritten as in α F α with F α= Φ y N i α j h j x i + j= N Λ j α j, 4 with no non-negativity constraint on α. The function F is convex as a su of convex functions and adits a subdifferential at all α R. We can design a boosting-style algorith by applying coordinate descent to F α. Let α t = α t,,..., α t,n denote the vector obtained after t iterations and let α 0 = 0. Let e k denote the kth unit vector in R N, k [, N]. The direction e k and the step η selected at the tth round are those iniizing F α t + ηe k, that is F α t + ηe k = Φ y i f t x i ηy i h k x i + j k Λ j α t,j + Λ k α t,k + η, where f t = N j= α t,jh j. For any t [, T ], we denote by D t the distribution defined by D t i = Φ y i f t x i, 5 where is a noralization factor, = Φ y i f t x i. For any s [, T ] and j [, N], we denote by ɛ s,j the weighted error of hypothesis h j for the distribution D s, for s [, T ]: ɛ s,j = [ ] E [y i h j x i ]. 6 i Ds 3.. DeepBoost Figure shows the pseudocode of the algorith DeepBoost derived by applying coordinate descent to the objective function 4. The details of the derivation of the expression are given in Appendix B. In the special cases of the DEEPBOOSTS = x, y,..., x, y for i to do D i 3 for t to T do 4 for j to N do 5 if α t,j 0 then 6 d j ɛ t,j + sgnαt,j Λj 7 elseif ɛt,j Λ j then 8 d j 0 9 else d j ɛ t,j 0 k argax d j j [,N] ɛ t ɛ t,k sgnɛt,j Λj if ɛ t e α t,k ɛ t e α t,k Λ k 3 η t α t,k then 4 elseif ɛ t e α t,k ɛ t e α t,k > Λ k then [ [ 5 η t log Λ k ɛ t + Λk ] ] + ɛ t ɛ t ɛ [ t [ 6 else η t log + Λ k ɛ t + Λk ] ] + ɛ t ɛ t ɛ t 7 α t α t + η t e k 8 + Φ N y i j= α t,jh j x i 9 for i to do 0 D t+ i Φ y i P N j= αt,jhjxi f N j= α T,jh j return f + Figure. Pseudocode of the DeepBoost algorith for both the exponential loss and the logistic loss. The expression of the weighted error ɛ t,j is given in 6. In the generic case of a surrogate loss Φ different fro the exponential or logistic losses, η t is found instead via a line search or other nuerical ethods fro η t = argax η F α t + ηe k. exponential loss Φ u = exp u or the logistic loss Φ u = log + exp u, a closed-for expression is given for the step size lines -6, which is the sae in both cases see Sections B.4 and B.5. In the generic case, the step size η t can be found using a line search or other nuerical ethods. Note that when the condition of line is satisfied, the step taken by the algorith cancels out the coordinate along the direction k, thereby leading to a sparser result. This is consistent with the fact that the objective function contains a second ter based on weighted L -nor, which is favoring sparsity. Our algorith is related to several other boosting-type algoriths devised in the past. For λ = 0 and β = 0 and using the exponential surrogate loss, it coincides with Ada- Boost Freund & Schapire, 997 with precisely the sae [ ɛ t direction and sae step log using H = p k= H k as the hypothesis set for base learners. This corresponds to ɛ t ]

5 ignoring the coplexity ter of our bound as well as the control of the su of the ixture weights via β. For λ = 0 and β = 0 and using the logistic surrogate loss, our algorith also coincides with additive logistic loss Friedan et al., 998. In the special case where λ = 0 and β 0 and for the exponential surrogate loss, our algorith atches the L -nor regularized AdaBoost e.g., see Rätsch et al., 00a. For the sae choice of the paraeters and for the logistic surrogate loss, our algorith atches the L - nor regularized additive Logistic Regression studied by Duchi & Singer 009 using the base learner hypothesis set H = p k= H k. H ay in general be very rich. The key foundation of our algorith and analysis is instead to take into account the relative coplexity of the sub-failies H k. Also, note that L -nor regularized AdaBoost and Logistic Regression can be viewed as algoriths iniizing the learning bound obtained via the standard Radeacher coplexity analysis Koltchinskii & Panchenko, 00, using the exponential or logistic surrogate losses. Instead, the objective function iniized by our algorith is based on the generalization bound of Theore, which as discussed earlier is a finer bound see. For λ = 0 but β 0, our algorith is also close to the so-called unnoralized Arcing Breian, 999 or AdaBoost ρ Rätsch & Waruth, 00 using H as a hypothesis set. AdaBoost ρ coincides with AdaBoost odulo the step size, which is ore conservative than that of AdaBoost and depends on ρ. Rätsch & Waruth 005 give another variant of the algorith that does not require knowing the best ρ, see also the related work of Kivinen & Waruth 999; Waruth et al Our algorith directly benefits fro the learning guarantees given in Section since it seeks to iniize the bound of Theore. In the next section, we report the results of our experients with DeepBoost. Let us ention that we have also designed an alternative deep boosting algorith that we briefly describe and discuss in Appendix C. 4. Experients An additional benefit of the learning bounds presented in Section is that they are data-dependent. They are based on the Radeacher coplexity of the base hypothesis sets H k, which in soe cases can be well estiated fro the training saple. The algorith DeepBoost directly inherits this advantage. For exaple, if the hypothesis set H k is based on a positive definite kernel with saple atrix K k, it is known that its epirical Radeacher coplexity can be upper bounded by Tr[Kk ] and lower bounded by Tr[Kk ]. In other cases, when H k is a faily of functions taking binary values, we can use an upper bound on the Radeacher coplexity in ters of the growth function of H k, Π Hk : R H k log Π Hk. Thus, for the faily H stups of boosting stups in diension d, Π H stups d, since there are distinct threshold functions for each diension with points. Thus, the following inequality holds: R H stups logd. 7 Siilarly, we consider the faily of decision trees H stups of depth with the sae question at the internal nodes of depth. We have Π H stups dd since there are dd / distinct trees of this type and since each induces at ost labelings. Thus, we can write R H stups log dd. 8 More generally, we also consider the faily of all binary decision trees Hk trees of depth k. For this faily it is known that VC-diHk trees k + log d + Mansour, 997. More generally, the VC-diension of T n, the faily of decision trees with n nodes in diension d can be bounded by n + log d + see for exaple Mohri VC-diH log+ et al., 0. Since R H, for any hypothesis class H we have 4n + log d + log + R T n. 9 The experients with DeepBoost described below use either H stups = H stups H stups or HK trees = K k= Htrees k, for soe K > 0, as the base hypothesis sets. For any hypothesis in these sets, DeepBoost will use the upper bounds given above as a proxy for the Radeacher coplexity of the set to which it belongs. We leave it to the future to experient with finer data-dependent estiates or upper bounds on the Radeacher coplexity, which could further iprove the perforance of our algorith. Recall that each iteration of DeepBoost searches for the base hypothesis that is optial with respect to a certain criterion see lines 5-0 of Figure. While an exhaustive search is feasible for H stups, it would be far too expensive to visit all trees in HK trees when K is large. Therefore, when using HK trees and also Hstups as the base hypotheses we use the following heuristic search procedure in each iteration t: First, the optial tree h H trees is found via exhaustive search. Next, for all < k K, a locally optial tree h k Htrees k is found by considering only trees that can be obtained by adding a single layer of leaves to h k. Finally, we select the best hypotheses in the set {h,..., h K, h,..., h t }, where h,..., h t are the hypotheses selected in previous iterations.

6 Table. Results for boosted decision stups and the exponential loss function. AdaBoost AdaBoost AdaBoost AdaBoost breastcancer H stups H stups AdaBoost-L DeepBoost ocr7 H stups H stups AdaBoost-L DeepBoost Error Error std dev std dev Avg tree size Avg tree size Avg no. of trees Avg no. of trees AdaBoost AdaBoost AdaBoost AdaBoost ionosphere H stups H stups AdaBoost-L DeepBoost ocr49 H stups H stups AdaBoost-L DeepBoost Error Error std dev std dev Avg tree size Avg tree size Avg no. of trees Avg no. of trees AdaBoost AdaBoost AdaBoost AdaBoost geran H stups H stups AdaBoost-L DeepBoost ocr7-nist H stups H stups AdaBoost-L DeepBoost Error Error std dev std dev Avg tree size Avg tree size.99 Avg no. of trees Avg no. of trees AdaBoost AdaBoost AdaBoost AdaBoost diabetes H stups H stups AdaBoost-L DeepBoost ocr49-nist H stups H stups AdaBoost-L DeepBoost Error Error std dev std dev Avg tree size Avg tree size Avg no. of trees Avg no. of trees Breian 999 and Reyzin & Schapire 006 extensively investigated the relationship between the coplexity of decision trees in an enseble learned by AdaBoost and the generalization error of the enseble. We tested DeepBoost on the sae UCI datasets used by these authors, archive.ics.uci.edu/l/datasets.htl, specifically breastcancer, ionosphere, gerannueric and diabetes. We also experiented with two optical character recognition datasets used by Reyzin & Schapire 006, ocr7 and ocr49, which contain the handwritten digits and 7, and 4 and 9 respectively. Finally, because these OCR datasets are fairly sall, we also constructed the analogous datasets fro all of MNIST, lecun.co/exdb/nist/, which we call ocr7-nist and ocr49-nist. More details on all the datasets are given in Table 4, Appendix D.. As we discussed in Section 3., by fixing the paraeters β and λ to certain values, we recover soe known algoriths as special cases of DeepBoost. Our experients copared DeepBoost to AdaBoost β = λ = 0 with exponential loss, to Logistic Regression β = λ = 0 with logistic loss, which we abbreviate as LogReg, to L -nor regularized AdaBoost e.g., see Rätsch et al., 00a abbreviated as AdaBoost-L, and also to the L -nor regularized additive Logistic Regression algorith studied by Duchi & Singer, 009 β > 0, λ = 0 abbreviated as LogReg-L. In the first set of experients reported in Table, we copared AdaBoost, AdaBoost-L, and DeepBoost with the exponential loss Φ u = exp u and base hypotheses H stups. We tested standard AdaBoost with base hypotheses H stups and H stups. For AdaBoost-L, we optiized over β { i : i = 6,..., 0} and for Deep- Boost, we optiized over β in the sae range and λ {0.000, 0.005, 0.0, 0.05, 0., 0.5}. The exact paraeter optiization procedure is described below. In the second set of experients reported in Table, we used base hypotheses HK trees instead of Hstups, where the axiu tree depth K was an additional paraeter to be optiized. Specifically, for AdaBoost we optiized over K {,..., 6}, for AdaBoost-L we optiized over those sae values for K and β {0 i : i = 3,..., 7}, and for DeepBoost we optiized over those sae values for K, β and λ {0 i : i = 3,..., 7}. The last set of experients, reported in Table 3, are identical to the experients reported in Table, except we used the logistic loss Φ u = log + exp u. We used the following paraeter optiization procedure in all experients: Each dataset was randoly partitioned into 0 folds, and each algorith was run 0 ties, with a different assignent of folds to the training set, validation set and test set for each run. Specifically, for each run i {0,..., 9}, fold i was used for testing, fold i + od 0 was used for validation, and the reaining folds were used for training. For each run, we selected the paraeters that had the lowest error on the validation set and then easured the error of those paraeters on the test set. The average error and the standard deviation of the error over all 0 runs is reported in Tables, and 3, as is the average nuber of trees and the average size of the trees in the ensebles. In all of our experients, the nuber of iterations was set to 00. We also experiented with running each algorith

7 Table. Results for boosted decision trees and the exponential loss function. breastcancer AdaBoost AdaBoost-L DeepBoost ocr7 AdaBoost AdaBoost-L DeepBoost Error Error std dev std dev Avg tree size Avg tree size Avg no. of trees Avg no. of trees ionosphere AdaBoost AdaBoost-L DeepBoost ocr49 AdaBoost AdaBoost-L DeepBoost Error Error std dev std dev Avg tree size Avg tree size Avg no. of trees Avg no. of trees geran AdaBoost AdaBoost-L DeepBoost ocr7-nist AdaBoost AdaBoost-L DeepBoost Error Error std dev std dev Avg tree size Avg tree size Avg no. of trees Avg no. of trees diabetes AdaBoost AdaBoost-L DeepBoost ocr49-nist AdaBoost AdaBoost-L DeepBoost Error Error std dev std dev Avg tree size Avg tree size Avg no. of trees Avg no. of trees Table 3. Results for boosted decision trees and the logistic loss function. breastcancer LogReg LogReg-L DeepBoost ocr7 LogReg LogReg-L DeepBoost Error Error std dev std dev Avg tree size Avg tree size Avg no. of trees Avg no. of trees ionosphere LogReg LogReg-L DeepBoost ocr49 LogReg LogReg-L DeepBoost Error Error std dev std dev Avg tree size Avg tree size Avg no. of trees Avg no. of trees geran LogReg LogReg-L DeepBoost ocr7-nist LogReg LogReg-L DeepBoost Error Error std dev std dev Avg tree size Avg tree size Avg no. of trees Avg no. of trees diabetes LogReg LogReg-L DeepBoost ocr49-nist LogReg LogReg-L DeepBoost Error Error std dev std dev Avg tree size Avg tree size Avg no. of trees Avg no. of trees for up to,000 iterations, but observed that the test errors did not change significantly, and ore iportantly the ordering of the algoriths by their test errors was unchanged fro 00 iterations to,000 iterations. Observe that with the exponential loss, DeepBoost has a saller test error than AdaBoost and AdaBoost-L on every dataset and for every set of base hypotheses, except for the ocr49-nist dataset with decision trees where its perforance atches that of AdaBoost-L. Siilarly, with the logistic loss, DeepBoost perfors always at least as well as LogReg or LogReg-L. For the sall-sized UCI datasets it is difficult to obtain statistically significant results, but, for the larger ocrxx-nist datasets, our results with Deep- Boost are statistically significantly better at the % level using one-sided paired t-tests in all three sets of experients three tables, except for ocr49-nist in Table 3, where this holds only for the coparison with LogReg. This across-the-board iproveent is the result of Deep- Boost s coplexity-conscious ability to dynaically tune the sizes of the decision trees selected in each boosting round, trading off between training error and hypothesis class coplexity. The selected tree sizes should depend on properties of the training set, and this is borne out by our experients: For soe datasets, such as breastcancer, DeepBoost selects trees that are saller on average than the trees selected by AdaBoost-L or LogReg-L, while, for other datasets, such as geran, the average tree size is larger. Note that AdaBoost and AdaBoost-L produce ensebles of trees that have a constant depth since neither algorith penalizes tree size except for iposing a axiu tree depth K, while for DeepBoost the trees in one enseble typically vary in size. Figure 3 plots the distri-

8 Ion: Histogra of tree sizes Ion: AdaBoost L, fold = 6 Ion: AdaBoost, fold = 6 Frequency Tree sizes Figure 3. Distribution of tree sizes when DeepBoost is run on the ionosphere dataset. bution of tree sizes for one run of DeepBoost. It should be noted that the coluns for AdaBoost in Table siply list the nuber of stups to be the sae as the nuber of boosting rounds; a careful exaination of the ensebles for 00 rounds of boosting typically reveals a 5% duplication of stups in the ensebles. Theore is a argin-based generalization guarantee, and is also the basis for the derivation of DeepBoost, so we should expect DeepBoost to induce large argins on the training set. Figure 4 shows the argin distributions for AdaBoost, AdaBoost-L and DeepBoost on the sae subset of the ionosphere dataset. 5. Conclusion We presented a theoretical analysis of learning with a base hypothesis set coposed of increasingly coplex subfailies, including very deep or coplex ones, and derived an algorith, DeepBoost, which is precisely based on those guarantees. We also reported the results of experients with this algorith and copared its perforance with that of AdaBoost and additive Logistic Regression, and their L -nor regularized counterparts in several tasks. We have derived siilar theoretical guarantees in the ulticlass setting and used the to derive a faily of new ulticlass deep boosting algoriths that we will present and discuss elsewhere. Our theoretical analysis and algorithic design could also be extended to ranking and to a broad class of loss functions. This should also lead to the generalization of several existing algoriths and their use with a richer hypothesis set structured as a union of failies with different Radeacher coplexity. In particular, the broad faily of axiu entropy odels and conditional axiu entropy odels and their any variants, which includes the already discussed logistic regression, could all be extended in a siilar way. The resulting DeepMaxent odels or their conditional versions ay adit an alternative theoretical justification that we will discuss elsewhere. Our algorith can also be extended by considering non-differentiable convex surrogate losses such as the hinge loss. When used with kernel base classifiers, this leads to an algorith we have naed DeepSVM. The theory we developed could perhaps be further generalized to Frequency Frequency Noralized Margin Ion: DeepBoost, fold = Noralized Margin Frequency Cuulative Dist Noralized Margin Cuulative Distribution of Margins Noralized Margin Figure 4. Distribution of noralized argins for AdaBoost upper right, AdaBoost-L upper left and DeepBoost lower left on the sae subset of ionosphere. The cuulative argin distributions lower right illustrate that DeepBoost red induces larger argins on the training set than either AdaBoost black or AdaBoost-L blue. encopass the analysis of other learning techniques such as ulti-layer neural networks. Our analysis and algorith also shed soe new light on soe reaining questions left about the theory underlying AdaBoost. The priary theoretical justification for AdaBoost is a argin guarantee Schapire et al., 997; Koltchinskii & Panchenko, 00. However, AdaBoost does not precisely axiize the iniu argin, while other algoriths such as arc-gv Breian, 996 that are designed to do so tend not to outperfor AdaBoost Reyzin & Schapire, 006. Two ain reasons are suspected for this observation: in order to achieve a better argin, algoriths such as arc-gv ay tend to select deeper decision trees or in general ore coplex hypotheses, which ay then affect their generalization; while those algoriths achieve a better argin, they do not achieve a better argin distribution. Our theory ay help better understand and evaluate the effect of factor since our learning bounds explicitly depend on the ixture weights and the contribution of each hypothesis set H k to the definition of the enseble function. However, our guarantees also suggest a better algorith, DeepBoost. Acknowledgents We thank Vitaly Kuznetsov for his coents on an earlier draft of this paper. The work of M. Mohri was partly funded by the NSF award IIS-759.

9 References Bartlett, Peter L. and Mendelson, Shahar. Radeacher and Gaussian coplexities: Risk bounds and structural results. JMLR, 3, 00. Bauer, Eric and Kohavi, Ron. An epirical coparison of voting classification algoriths: Bagging, boosting, and variants. Machine Learning, 36-:05 39, 999. Breian, Leo. Bagging predictors. Machine Learning, 4 :3 40, 996. Breian, Leo. Prediction gaes and arcing algoriths. Neural Coputation, 7:493 57, 999. Caruana, Rich, Niculescu-Mizil, Alexandru, Crew, Geoff, and Ksikes, Alex. Enseble selection fro libraries of odels. In ICML, 004. Dietterich, Thoas G. An experiental coparison of three ethods for constructing ensebles of decision trees: Bagging, boosting, and randoization. Machine Learning, 40:39 57, 000. Duchi, John C. and Singer, Yora. Boosting with structural sparsity. In ICML, pp. 38, 009. Freund, Yoav and Schapire, Robert E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Coputer Syste Sciences, 55: 9 39, 997. Freund, Yoav, Mansour, Yishay, and Schapire, Robert E. Generalization bounds for averaged classifiers. The Annals of Statistics, 3:698 7, 004. Friedan, Jeroe, Hastie, Trevor, and Tibshirani, Robert. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 8:000, 998. Grove, Ada J and Schuurans, Dale. Boosting in the liit: Maxiizing the argin of learned ensebles. In AAAI/IAAI, pp , 998. Kivinen, Jyrki and Waruth, Manfred K. Boosting as entropy projection. In COLT, pp , 999. Koltchinskii, Vladir and Panchenko, Ditry. Epirical argin distributions and bounding the generalization error of cobined classifiers. Annals of Statistics, 30, 00. Mohri, Mehryar, Rostaizadeh, Afshin, and Talwalkar, Aeet. Foundations of Machine Learning. The MIT Press, 0. Quinlan, J. Ross. Bagging, boosting, and C4.5. In AAAI/IAAI, Vol., pp , 996. Rätsch, Gunnar and Waruth, Manfred K. Maxiizing the argin with boosting. In COLT, pp , 00. Rätsch, Gunnar and Waruth, Manfred K. Efficient argin axiizing with boosting. Journal of Machine Learning Research, 6:3 5, 005. Rätsch, Gunnar, Mika, Sebastian, and Waruth, Manfred K. On the convergence of leveraging. In NIPS, pp , 00a. Rätsch, Gunnar, Onoda, Takashi, and Müller, Klaus- Robert. Soft argins for AdaBoost. Machine Learning, 43:87 30, 00b. Reyzin, Lev and Schapire, Robert E. How boosting the argin can also boost classifier coplexity. In ICML, pp , 006. Schapire, Robert E. Theoretical views of boosting and applications. In Proceedings of ALT 999, volue 70 of Lecture Notes in Coputer Science, pp Springer, 999. Schapire, Robert E. The boosting approach to achine learning: An overview. In Nonlinear Estiation and Classification, pp Springer, 003. Schapire, Robert E., Freund, Yoav, Bartlett, Peter, and Lee, Wee Sun. Boosting the argin: A new explanation for the effectiveness of voting ethods. In ICML, pp , 997. Syth, Padhraic and Wolpert, David. Linearly cobining density estiators via stacking. Machine Learning, 36: 59 83, July 999. Vapnik, Vladiir N. Statistical Learning Theory. Wiley- Interscience, 998. Waruth, Manfred K., Liao, Jun, and Rätsch, Gunnar. Totally corrective boosting algoriths that axiize the argin. In ICML, pp , 006. MacKay, David J. C. Bayesian ethods for adaptive odels. PhD thesis, California Institute of Technology, 99. Mansour, Yishay. Pessiistic decision tree pruning based on tree size. In Proceedings of ICML, pp. 95 0, 997.

10 A. Proof of Theore Theore. Assue p >. Fix ρ > 0. Then, for any δ > 0, with probability at least δ over the choice of a saple S of size drawn i.i.d. according to D, the following inequality holds for all f = T α th t : Rf R S,ρ f + 4 ρ + ρ α t R H kt log p 4 [ + ρ log ρ ] log p log p + log δ. Thus, Rf R S,ρ f + 4 T ρ α tr H kt + C, p log p with C, p = O ρ log [ ρ log p]. Proof. For a fixed h = h,..., h T, any α defines a distribution over {h,..., h T }. Sapling fro {h,..., h T } according to α and averaging leads to functions g of the for g = T n n th t for soe n = n,..., n T, with T n t = n, and h t H kt. For any N = N,..., N p with N = n, we consider the faily of functions G F,N = { n p N k h k,j k, j [p] [N k ], h k,j H k }, k= j= and the union of all such failies G F,n = N =n G F,N. Fix ρ > 0. For a fixed N, the Radeacher coplexity of G F,N can be bounded as follows for any : R G F,N p n k= N k R H k. Thus, the following standard argin-based Radeacher coplexity bound holds Koltchinskii & Panchenko, 00. For any δ > 0, with probability at least δ, for all g G F,N, R ρ g R S,ρ g p log δ N k R H k + ρ n. k= Since there are at ost p n possible p-tuples N with N = n, by the union bound, for any δ > 0, with probability at least δ, for all g G F,n, we can write R ρ g R S,ρ g p log pn δ N k R H k + ρ n. k= Thus, with probability at least δ, for all functions g = T n n th t with h t H kt, the following inequality holds R ρ g R S,ρ g p log pn δ n t R H kt + ρ n. k= t:k t=k Taking the expectation with respect to α and using E α [n t /n] = α t, we obtain that for any δ > 0, with probability at least δ, for all h, we can write E[R ρ g R S,ρ g] log pn δ α t R H kt + α ρ. Fix n. Then, for any δ n > 0, with probability at least δ n, E[R ρ/ g R S,ρ/ g] 4 log pn δ α t R H kt + n α ρ. δ Choose δ n = p for soe δ > 0, then for p, n n δ δ n = /p δ. Thus, for any δ > 0 and any n, with probability at least δ, the following holds for all h: E α [R ρ/ g R S,ρ/ g] 4 ρ α t R H kt + log pn δ. 0 Now, for any f = T α th t F and any g = n T n th t, we can upper bound Rf = Pr x,y D [yfx 0], the generalization error of f, as follows: Rf = Pr [yfx ygx + ygx 0] x,y D Pr[yfx ygx < ρ/] + Pr[ygx ρ/] = Pr[yfx ygx < ρ/] + R ρ/ g. We can also write R ρ/ g = R S,ρ/ g f + f Pr[ygx yfx < ρ/] + R S,ρ f. Cobining these inequalities yields Pr [yfx 0] R S,ρ f x,y D Pr[yfx ygx < ρ/] + Pr[ygx yfx < ρ/] + R ρ/ g R S,ρ/ g. Taking the expectation with respect to α yields Rf R S,ρ f E x D,α [ yfx ygx< ρ/]+ E [ ygx yfx< ρ/] + E[R ρ/ g R S,ρ/ g]. x D,α α Since f = E α [g], by Hoeffding s inequality, for any x, E α [ yfx ygx< ρ/ ]=Pr α E α [ ygx yfx< ρ/ ]=Pr α nρ [yfx ygx< ρ/] e 8 nρ [ygx yfx< ρ/] e 8.

11 Thus, for any fixed f F, we can write Rf R S,ρ f e nρ /8 + E α [R ρ/ g R S,ρ/ g]. Thus, the following inequality holds: sup Rf R S,ρ f f F e nρ /8 + sup E[R ρ/ g R S,ρ/ g]. h α Therefore, in view of 0, for any δ > 0 and any n, with probability at least δ, the following holds for all f F: Rf R S,ρ f 4 ρ = 4 ρ α t R H kt + e nρ /8 + α t R H kt +e nρ /8 + log pn δ n log p+log δ. To select n, we seek to iniize n log p f : n e nρ /8 + = e nu + nv, with u = ρ /8 and v = log p/. f is differentiable and for all n, f n = ue nu + v. The iniu of f is n thus for n such that v f n = 0 ue nu = n une un = v 8u n = v u W, 8u where W is the second branch of the Labert function inverse of x xe x. It is not hard to verify that the following inequalities hold for all x 0, /e]: logx W x logx. Bounding W using the lower bound leads to the following choice for n: n = v u log = 8u 4 ρ ρ log. log p Plugging in this value of n yields the following bound: Rf R S,ρ f 4 ρ + α t R H kt + log p ρ which concludes the proof. 4 [ ρ log ρ ] log p log p + log δ, a b c Figure 5. Illustration of the directional derivatives in the three cases of definition. B. Coordinate descent B.. Maxiu descent coordinate For a differentiable convex function, the definition of coordinate descent along the direction with axial descent is standard: the direction selected is the one axiizing the absolute value of the directional derivative. Here, we clarify the definition of the axial descent strategy for a non-differentiable convex function. For any function Q: R N R, we denote by Q +α, e the right directional derivative of Q at α R N and by Q α, e its left directional derivative at α R N along the direction e R N, e =, when they exist: Q Qα + ηe Qα +α, e = li η 0 + η Q Qα + ηe Qα α, e = li. η 0 η For the reaining of this section, we will assue that Q is a convex function. It is known that in that case these quantities always exist and that Q α, e Q +α, e for all α and e. The left and right directional derivatives coincide with the directional derivative Q α, e of Q along the direction e when Q is differentiable at α along the direction e: Q α, e = Q +α, e = Q α, e. For any j [, N], let e j denote the jth unit vector in R N. For any α R N and j [, N], we define the descent gradient δqα, e j of Q along the direction e j as follows: δqα, e j = 0 if Q α, e j 0 Q +α, e j Q +α, e j if Q α, e j Q +α, e j 0 Q α, e j if 0 Q α, e j Q +α, e j. δqα, e j is the eleent of the subgradient along e j that is the closest to 0. Figure 5 illustrates the three cases in that definition. Note that when Q is differentiable along e j, then Q +α, e j = Q α, e j and δqα, e j = Q α, e j. The axiu descent coordinate can then be defined by k = argax δqα, e j j [,N] This coincides with the standard definition when Q is convex and differentiable.

12 B.. Direction In view of, at each iteration t, the direction e k selected by coordinate descent with axiu descent is k = argax j [,N] δqα t, e j. To deterine k, we copute δqα t, e j for all j [, N] by distinguishing two cases: α t,j 0 and α t,j = 0. Assue first that α t,j 0 and let s denote the sign of α t,j. For η sufficiently sall, α t,j + η has the sign of α t,j, that is s and F α t + ηe j = Φ y i f t x i ηy i h j x i + p j Λ j α t,p + sλ j α t,j + η. Thus, when α t,j 0, F adits a directional derivative along e j given by F α t, e j = y i h j x i Φ y i f t x i + sλ j = y i h j x i D t i + sλ j =ɛ t,j + sλ j, and δf α t, e j = ɛ t,j St +sgnα t,jλ j. When α t,j = 0, we find siilarly that F +α t, e j = ɛ t,j + Λ j F α t, e j = ɛ t,j Λ j. The condition F α, e j 0 F +α, e j is equivalent to Λ j ɛ t,j S t Λ j ɛ t,j Λ j. Thus, in suary, we can write, for all j [, N], δf α t, e j = ɛ t,j St +sgnα t,jλ j if α t,j 0 0 else if ɛt,j ɛ t,j St +Λ j ɛ t,j St Λ j otherwise. Λ j else if ɛ t,j Λj This can be siplified by unifying the last two cases and observing that the sign of ɛ t,j suffices to distinguish between the last two cases: δf α t, e j = ɛ t,j St +sgnα t,jλ j if α t,j 0 0 else if ɛt,j Λ j S t ɛ t,j St sgnɛ t,j Λ j otherwise. B.3. Step Given the direction e k, the optial step value η is given by argin η F α t + η e k. In the ost general case, η can be found via a line search or other nuerical ethods. In soe special cases, we can derive a closed-for solution for the step by iniizing an upper bound on F α t + η e k. For convenience, in what follows, we use the shorthand ɛ t for ɛ t,k. Since y i h k x i = +yih kx i + yih kx i, by the convexity of u Φ ηu, the following holds for all η R: Φ y i f t x i ηy i h k x i 3 + y ih k x i Φ y i f t x i η + y ih k x i Φ y i f t x i + η. Thus, we can write F α t + ηe k j k Λ j α t,j + + y i h k x i Φ y i f t x i η + Λ k α t,k + η. y i h k x i Φ y i f t x i + η Let Jη denote that upper bound. We can select η to iniize Jη. J is convex and adits a subdifferential at all points. Thus, η is a iniizer of Jη iff 0 Jη, where Jη denotes the subdifferential of J at η. B.4. Exponential loss In the case Φ = exp, Jη can be expressed as follows Jη = + y i h k x i e yift xi e η + y i h k x i e yift xi e η + Λ k α t,k + η, and e yift xi = Φ y i f t x i = D t i. Thus, J can be rewritten as follows: Jη = ɛ t e η + ɛ t eη + Λ k α t,k + η, Note that when the functions in H take values in {, +}, 3 is in fact an equality and Jη coincides with F α t + ηe t P j k Λj αt,j.

13 P Deep Boosting [ ] Note that η η 0, where η 0 ɛ = log t ɛ t is the step size used is AdaBoost. e t,k The case α t,k + η < 0 can be treated siilarly. It is equivalent to the condition P e t,k e X ɛ t e α t,k ɛ t e α t,k < Λ k, 7 Figure 6. Plot of the polynoial function P. where we used the shorthand ɛ t = ɛ t,k where k is the index of the direction e k selected. If α t,k + η = 0, then the subdifferential of α t,k + η at η is the set {ν : ν [, +]}. Thus, Jη contains 0 iff there exists ν [, +] such that ɛ t e η + ɛ t eη + Λ k ν = 0 ɛ t e α t,k + ɛ t e α t,k + Λ k ν = 0. This is equivalent to the condition ɛ t e α t,k ɛ t e α t,k Λ k. 4 If α t,k + η > 0, then the subdifferential of α t,k + η at η is reduced to {} and Jη contains 0 iff ɛ t e η + ɛ t e η + Λ k = 0 ɛ t e η + Λ k e η ɛ t = 0. 5 Solving the resulting second-degree equation in e η gives e η = Λ k Λk + + ɛ t, ɛ t ɛ t ɛ t that is η = log Λ k ɛ t + Λk + ɛ t. ɛ t ɛ t Let P be the second-degree polynoial of 5 whose solution is e η. P is convex, has one negative solution, one positive solution, and the positive solution is e η. Since e α t,k is positive, the condition α t,k + η > 0 or α t,k < η is then equivalent to P e α t,k < 0 see Figure 6, that is ɛ t e α t,k + Λ k e α t,k ɛ t < 0 ɛ t e α t,k ɛ t e α t,k > Λ k. 6 and leads to the step size η = log B.5. Logistic loss Λ k ɛ t + Λk + ɛ t. ɛ t ɛ t In the case of logistic loss, for any u R, Φ u = log + e u and Φ u = log +e u. To deterine the step size, we use the following general upper bound: [ ] + e u v Φ u v Φ u = log + e u [ + e u + e u v e u ] = log + e u [ ] = log + e v + e u Thus, we can write e v log + e u = Φ ue v. F α t + ηe t F α t Φ y i f t x i e ηyih kx i + Λ k α t,k + η α t,k = D t i e ηyih kx i + Λ k α t,k + η α t,k. To deterine η, we can iniize this upper bound, or equivalently the following D t i e ηyih kx i + Λ k α t,k + η. This expression is syntactically the sae as the one considered in the case of the exponential loss with only the distribution weights D t i and being different. Indeed,

14 in the case of the exponential loss Φ = exp, we can write F α t + ηe k j k Λ j α t,j = = = = Φ y i f t x i ηy i h k x i +Λ k α t,k +η, Φ y i f t x i e ηyih kx i +Λ k α t,k +η, Φ y i f t x i e ηyih kx i +Λ k α t,k +η, D t i e ηyih kx i +Λ k α t,k +η. Thus, we obtain iediately the sae expressions for the step size in the case of the logistic loss with the sae three cases, but with = and D +e +y i f t x i t i = +e +y i f t x i. C. Alternative DeepBoost γ algorith We also devised and ipleented an alternative algorith, DeepBoost γ, which is inspired by the learning bound of Theore but does not seek to iniize it. The algorith adits a paraeter γ > 0 representing the edge value deanded at each boosting round. This is the aount by which we require the error ɛ t of the base hypothesis h t selected at round t to be better than : ɛ t > γ. We assue given p distinct hypothesis sets with increasing degrees of coplexity H,..., H p. DeepBoost γ proceeds as if we were running AdaBoost using only as base hypothesis set H. But, at each round, if the edge achieved by the best hypothesis found in H is not sufficient, that is if it is not larger than the deanded edge γ, then it selects instead the hypothesis in H with the sallest error on the saple weighted by D t. If the edge of that hypothesis is also not sufficient, it proceeds with the next hypothesis set and so forth. If the edge is insufficient even with the best hypothesis in H p, then it just uses the best hypothesis found in H = p k= H k. The edge paraeter γ is deterined via cross-validation. DeepBoost γ is inspired by the bound of Theore since it seeks to use as uch as possible hypotheses fro H or lower coplexity failies and only when necessary functions fro ore coplex failies. Since it tends to choose rarely hypotheses fro ore coplex H k s, the coplexity ter of the bound of Theore reains close to the one using only H. On the other hand, DeepBoost γ can achieve a saller epirical argin loss first ter of the bound by selecting, when needed, ore powerful hypotheses than those accessible using H alone. We carried out soe early experients on several datasets Table 4. Dataset statistics. geran refers ore specifically to the geran nueric dataset. breastcancer ionosphere geran Exaples Attributes diabetes ocr7 ocr49 Exaples Attributes ocr7-nist ocr49-nist Exaples Attributes with DeepBoost γ using boosting stups, in which the perforance of the algorith was found to be superior to that of AdaBoost. A ore extensive study of the theoretical and epirical properties of this algorith are left to the future. D. Additional epirical inforation D.. Dataset sizes and attributes The size and the nuber of attributes for the datasets used in our experients are indicated in Table 4.

Deep Boosting. Abstract. 1. Introduction

Deep Boosting. Abstract. 1. Introduction Corinna Cortes Google Research, 8th Avenue, New York, NY Mehryar Mohri Courant Institute and Google Research, 25 Mercer Street, New York, NY 2 Uar Syed Google Research, 8th Avenue, New York, NY Abstract