Deep Boosting. Abstract. 1. Introduction

Size: px
Start display at page:

Download "Deep Boosting. Abstract. 1. Introduction"

Transcription

1 Corinna Cortes Google Research, 8th Avenue, New York, NY 00 Mehryar Mohri Courant Institute and Google Research, 5 Mercer Street, New York, NY 00 Uar Syed Google Research, 8th Avenue, New York, NY 00 Abstract We present a new enseble learning algorith, DeepBoost, which can use as base classifiers a hypothesis set containing deep decision trees, or ebers of other rich or coplex failies, and succeed in achieving high accuracy without overfitting the data. The key to the success of the algorith is a capacity-conscious criterion for the selection of the hypotheses. We give new datadependent learning bounds for convex ensebles expressed in ters of the Radeacher coplexities of the sub-failies coposing the base classifier set, and the ixture weight assigned to each sub-faily. Our algorith directly benefits fro these guarantees since it seeks to iniize the corresponding learning bound. We give a full description of our algorith, including the details of its derivation, and report the results of several experients showing that its perforance copares favorably to that of AdaBoost and Logistic Regression and their L -regularized variants.. Introduction Enseble ethods are general techniques in achine learning for cobining several predictors or experts to create a ore accurate one. In the batch learning setting, techniques such as bagging, boosting, stacking, errorcorrection techniques, Bayesian averaging, or other averaging schees are proinent instances of these ethods Breian, 996; Freund & Schapire, 997; Syth & Wolpert, 999; MacKay, 99; Freund et al., 004. Enseble ethods often significantly iprove perforance in practice Quinlan, 996; Bauer & Kohavi, 999; Caruana et al., 004; Dietterich, 000; Schapire, 003 and ben- Proceedings of the 3 st International Conference on Machine Learning, Beijing, China, 04. JMLR: W&CP volue 3. Copyright 04 by the authors. CORINNA@GOOGLE.COM MOHRI@CIMS.NYU.EDU USYED@GOOGLE.COM efit fro favorable learning guarantees. In particular, AdaBoost and its variants are based on a rich theoretical analysis, with perforance guarantees in ters of the argins of the training saples Schapire et al., 997; Koltchinskii & Panchenko, 00. Standard enseble algoriths such as AdaBoost cobine functions selected fro a base classifier hypothesis set H. In any successful applications of AdaBoost, H is reduced to the so-called boosting stups, that is decision trees of depth one. For soe difficult tasks in speech or iage processing, siple boosting stups are not sufficient to achieve a high level of accuracy. It is tepting then to use a ore coplex hypothesis set, for exaple the set of all decision trees with depth bounded by soe relatively large nuber. But, existing learning guarantees for AdaBoost depend not only on the argin and the nuber of the training exaples, but also on the coplexity of H easured in ters of its VC-diension or its Radeacher coplexity Schapire et al., 997; Koltchinskii & Panchenko, 00. These learning bounds becoe looser when using too coplex base classifier sets H. They suggest a risk of overfitting which indeed can be observed in soe experients with AdaBoost Grove & Schuurans, 998; Schapire, 999; Dietterich, 000; Rätsch et al., 00b. This paper explores the design of alternative enseble algoriths using as base classifiers a hypothesis set H that ay contain very deep decision trees, or ebers of soe other very rich or coplex failies, and that can yet succeed in achieving a higher perforance level. Assue that the set of base classifiers H can be decoposed as the union of p disjoint failies H,..., H p ordered by increasing coplexity, where H k, k [, p], could be for exaple the set of decision trees of depth k, or a set of functions based on onoials of degree k. Figure shows a pictorial illustration. Of course, if we strictly confine ourselves to using hypotheses belonging only to failies H k with sall k, then we are effectively using a saller base classifier set H with favorable guarantees. But, to succeed in soe chal-

2 H H H3 H4 H5 H H H H Hp Figure. Base classifier set H decoposed in ters of subfailies H,..., H p or their unions. lenging tasks, the use of a few ore coplex hypotheses could be needed. The ain idea behind the design of our algoriths is that an enseble based on hypotheses drawn fro H,..., H p can achieve a higher accuracy by aking use of hypotheses drawn fro H k s with large k if it allocates ore weights to hypotheses drawn fro H k s with a sall k. But, can we deterine quantitatively the aounts of ixture weights apportioned to different failies? Can we provide learning guarantees for such algoriths? Note that our objective is soewhat reiniscent of that of odel selection, in particular Structural Risk Miniization SRM Vapnik, 998, but it differs fro that in that we do not wish to liit our base classifier set to soe optial H q = q k= H k. Rather, we seek the freedo of using as base hypotheses even relatively deep trees fro rich H k s, with the proise of doing so infrequently, or that of reserving the a soewhat sall weight contribution. This provides the flexibility of learning with deep hypotheses. We present a new algorith, DeepBoost, whose design is precisely guided by the ideas just discussed. Our algorith is grounded in a solid theoretical analysis that we present in Section. We give new data-dependent learning bounds for convex ensebles. These guarantees are expressed in ters of the Radeacher coplexities of the sub-failies H k and the ixture weight assigned to each H k, in addition to the failiar argin ters and saple size. Our capacity-conscious algorith is derived via the application of a coordinate descent technique seeking to iniize such learning bounds. We give a full description of our algorith, including the details of its derivation and its pseudocode Section 3 and discuss its connection with previous boosting-style algoriths. We also report the results of several experients Section 4 deonstrating that its perforance copares favorably to that of AdaBoost, which is known to be one of the ost copetitive binary classification algoriths.. Data-dependent learning guarantees for convex ensebles with ultiple hypothesis sets Non-negative linear cobination ensebles such as boosting or bagging typically assue that base functions are selected fro the sae hypothesis set H. Margin-based generalization bounds were given for ensebles of base functions taking values in {, +} by Schapire et al. 997 in ters of the VC-diension of H. Tighter argin bounds with sipler proofs were later given by Koltchinskii & Panchenko 00, see also Bartlett & Mendelson, 00, for the ore general case of a faily H taking arbitrary real values, in ters of the Radeacher coplexity of H. Here, we also consider base hypotheses taking arbitrary real values but assue that they can be selected fro several distinct hypothesis sets H,..., H p with p and present argin-based learning in ters of the Radeacher coplexity of these sets. Rearkably, the coplexity ter of these bounds adits an explicit dependency in ters of the ixture coefficients defining the ensebles. Thus, the enseble faily we consider is F = conv p k= H k, that is the faily of functions f of the for f = T α th t, where α = α,..., α T is in the siplex and where, for each t [, T ], h t is in H kt for soe k t [, p]. Let X denote the input space. H,..., H p are thus failies of functions apping fro X to R. We consider the failiar supervised learning scenario and assue that training and test points are drawn i.i.d. according to soe distribution D over X {, +} and denote by S = x, y,..., x, y a training saple of size drawn according to D. Let ρ > 0. For a function f taking values in R, we denote by Rf its binary classification error, by R ρ f its ρ-argin error, and by R S,ρ f its epirical argin error: Rf = R ρ f = E [ yfx 0], R ρ f = E [ yfx ρ], x,y D x,y D E x,y S [ yfx ρ], where the notation x, y S indicates that x, y is drawn according to the epirical distribution defined by S. The following theore gives a argin-based Radeacher coplexity bound for learning with such functions in the binary classification case. As with other Radeacher coplexity learning guarantees, our bound is data-dependent, which is an iportant and favorable characteristic of our results. For p =, that is for the special case of a single hypothesis set, the analysis coincides with that of the standard enseble argin bounds Koltchinskii & Panchenko, 00. Theore. Assue p >. Fix ρ > 0. Then, for any δ > 0, with probability at least δ over the choice of a saple S of size drawn i.i.d. according to D, the following inequality holds for all f = T α th t F: Rf R S,ρ f + 4 ρ + ρ α t R H kt log p 4 [ + ρ log ρ ] log p log p + log δ.

3 Thus, Rf R S,ρ f + 4 T ρ α tr H kt + C, p log p with C, p = O ρ log [ ρ log p]. This result is rearkable since the coplexity ter in the right-hand side of the bound adits an explicit dependency on the ixture coefficients α t. It is a weighted average of Radeacher coplexities with ixture weights α t, t [, T ]. Thus, the second ter of the bound suggests that, while soe hypothesis sets H k used for learning could have a large Radeacher coplexity, this ay not be detriental to generalization if the corresponding total ixture weight su of α t s corresponding to that hypothesis set is relatively sall. Such coplex failies offer the potential of achieving a better argin on the training saple. The theore cannot be proven via a standard Radeacher coplexity analysis such as that of Koltchinskii & Panchenko 00 since the coplexity ter of the bound would then be the Radeacher coplexity of the faily of hypotheses F = conv p k= H k and would not depend on the specific weights α t defining a given function f. Furtherore, the coplexity ter of a standard Radeacher coplexity analysis is always lower bounded by the coplexity ter appearing in our bound. Indeed, since R conv p k= H k = R p k= H k, the following lower bound holds for any choice of the non-negative ixtures weights α t suing to one: R F ax k= R H k α t R H kt. Thus, Theore provides a finer learning bound than the one obtained via a standard Radeacher coplexity analysis. The full proof of the theore is given in Appendix A. Our proof technique exploits standard tools used to derive Radeacher coplexity learning bounds Koltchinskii & Panchenko, 00 as well as a technique used by Schapire, Freund, Bartlett, and Lee 997 to derive early VC-diension argin bounds. Using other standard techniques as in Koltchinskii & Panchenko, 00; Mohri et al., 0, Theore can be straightforwardly generalized to hold uniforly for all ρ > 0 at the price of an additional ter that is in O 3. Algorith log log ρ. In this section, we will use the learning guarantees of Section to derive a capacity-conscious enseble algorith for binary classification. 3.. Optiization proble Let H,..., H p be p disjoint failies of functions taking values in [, +] with increasing Radeacher coplexities R H k, k [, p]. We will assue that the hypothesis sets H k are syetric, that is, for any h H k, we also have h H k, which holds for ost hypothesis sets typically considered in practice. This assuption is not necessary but it helps siplifying the presentation of our algorith. For any hypothesis h p k= H k, we denote by dh the index of the hypothesis set it belongs to, that is h H dh. The bound of Theore holds uniforly for all ρ > 0 and functions f conv p k= H k. Since the last ter of the bound does not depend on α, it suggests selecting α to iniize Gα = P T yi αthtxi ρ + 4 ρ α t r t, where r t = R H dht. Since for any ρ > 0, f and f/ρ adit the sae generalization error, we can instead search for α 0 with T α t /ρ which leads to in α 0 T P T αthtxi +4 yi α t r t s.t. α t ρ. The first ter of the objective is not a convex function of α and its iniization is known to be coputationally hard. Thus, we will consider instead a convex upper bound. Let u Φ u be a non-increasing convex function upper bounding u u 0 with Φ differentiable over R and Φ u 0 for all u. Φ ay be selected to be for exaple the exponential function as in AdaBoost Freund & Schapire, 997 or the logistic function. Using such an upper bound, we obtain the following convex optiization proble: in α 0 s.t. T Φ y i α t h t x i + λ α t ρ, α t r t where we introduced a paraeter λ 0 controlling the balance between the agnitude of the values taken by function Φ and the second ter. Introducing a Lagrange variable β 0 associated to the constraint in, the proble can be equivalently written as in α 0 Φ T y i α t h t x i + λr t + βα t. Here, β is a paraeter that can be freely selected by the algorith since any choice of its value is equivalent to a The condition P T αt = of Theore can be relaxed to P T αt. To see this, use for exaple a null hypothesis h t = 0 for soe t.

4 choice of ρ in. Let {h,..., h N } be the set of distinct base functions, and let G be the objective function based on that collection: Gα= Φ y N i α j h j x i + j= N λr j +βα j, with α = α,..., α N R N. Note that we can drop the requireent α 0 since the hypothesis sets are syetric and α t h t = α t h t. For each hypothesis h, we keep either h or h in {h,..., h N }. Using the notation Λ j = λr j + β, 3 for all j [, N], our optiization proble can then be rewritten as in α F α with F α= Φ y N i α j h j x i + j= N Λ j α j, 4 with no non-negativity constraint on α. The function F is convex as a su of convex functions and adits a subdifferential at all α R. We can design a boosting-style algorith by applying coordinate descent to F α. Let α t = α t,,..., α t,n denote the vector obtained after t iterations and let α 0 = 0. Let e k denote the kth unit vector in R N, k [, N]. The direction e k and the step η selected at the tth round are those iniizing F α t + ηe k, that is F α t + ηe k = Φ y i f t x i ηy i h k x i + j k Λ j α t,j + Λ k α t,k + η, where f t = N j= α t,jh j. For any t [, T ], we denote by D t the distribution defined by D t i = Φ y i f t x i, 5 where is a noralization factor, = Φ y i f t x i. For any s [, T ] and j [, N], we denote by ɛ s,j the weighted error of hypothesis h j for the distribution D s, for s [, T ]: ɛ s,j = [ ] E [y i h j x i ]. 6 i Ds 3.. DeepBoost Figure shows the pseudocode of the algorith DeepBoost derived by applying coordinate descent to the objective function 4. The details of the derivation of the expression are given in Appendix B. In the special cases of the DEEPBOOSTS = x, y,..., x, y for i to do D i 3 for t to T do 4 for j to N do 5 if α t,j 0 then 6 d j ɛ t,j + sgnαt,j Λj 7 elseif ɛt,j Λ j then 8 d j 0 9 else d j ɛ t,j 0 k argax d j j [,N] ɛ t ɛ t,k sgnɛt,j Λj if ɛ t e α t,k ɛ t e α t,k Λ k 3 η t α t,k then 4 elseif ɛ t e α t,k ɛ t e α t,k > Λ k then [ [ 5 η t log Λ k ɛ t + Λk ] ] + ɛ t ɛ t ɛ [ t [ 6 else η t log + Λ k ɛ t + Λk ] ] + ɛ t ɛ t ɛ t 7 α t α t + η t e k 8 + Φ N y i j= α t,jh j x i 9 for i to do 0 D t+ i Φ y i P N j= αt,jhjxi f N j= α T,jh j return f + Figure. Pseudocode of the DeepBoost algorith for both the exponential loss and the logistic loss. The expression of the weighted error ɛ t,j is given in 6. In the generic case of a surrogate loss Φ different fro the exponential or logistic losses, η t is found instead via a line search or other nuerical ethods fro η t = argax η F α t + ηe k. exponential loss Φ u = exp u or the logistic loss Φ u = log + exp u, a closed-for expression is given for the step size lines -6, which is the sae in both cases see Sections B.4 and B.5. In the generic case, the step size η t can be found using a line search or other nuerical ethods. Note that when the condition of line is satisfied, the step taken by the algorith cancels out the coordinate along the direction k, thereby leading to a sparser result. This is consistent with the fact that the objective function contains a second ter based on weighted L -nor, which is favoring sparsity. Our algorith is related to several other boosting-type algoriths devised in the past. For λ = 0 and β = 0 and using the exponential surrogate loss, it coincides with Ada- Boost Freund & Schapire, 997 with precisely the sae [ ɛ t direction and sae step log using H = p k= H k as the hypothesis set for base learners. This corresponds to ɛ t ]

5 ignoring the coplexity ter of our bound as well as the control of the su of the ixture weights via β. For λ = 0 and β = 0 and using the logistic surrogate loss, our algorith also coincides with additive logistic loss Friedan et al., 998. In the special case where λ = 0 and β 0 and for the exponential surrogate loss, our algorith atches the L -nor regularized AdaBoost e.g., see Rätsch et al., 00a. For the sae choice of the paraeters and for the logistic surrogate loss, our algorith atches the L - nor regularized additive Logistic Regression studied by Duchi & Singer 009 using the base learner hypothesis set H = p k= H k. H ay in general be very rich. The key foundation of our algorith and analysis is instead to take into account the relative coplexity of the sub-failies H k. Also, note that L -nor regularized AdaBoost and Logistic Regression can be viewed as algoriths iniizing the learning bound obtained via the standard Radeacher coplexity analysis Koltchinskii & Panchenko, 00, using the exponential or logistic surrogate losses. Instead, the objective function iniized by our algorith is based on the generalization bound of Theore, which as discussed earlier is a finer bound see. For λ = 0 but β 0, our algorith is also close to the so-called unnoralized Arcing Breian, 999 or AdaBoost ρ Rätsch & Waruth, 00 using H as a hypothesis set. AdaBoost ρ coincides with AdaBoost odulo the step size, which is ore conservative than that of AdaBoost and depends on ρ. Rätsch & Waruth 005 give another variant of the algorith that does not require knowing the best ρ, see also the related work of Kivinen & Waruth 999; Waruth et al Our algorith directly benefits fro the learning guarantees given in Section since it seeks to iniize the bound of Theore. In the next section, we report the results of our experients with DeepBoost. Let us ention that we have also designed an alternative deep boosting algorith that we briefly describe and discuss in Appendix C. 4. Experients An additional benefit of the learning bounds presented in Section is that they are data-dependent. They are based on the Radeacher coplexity of the base hypothesis sets H k, which in soe cases can be well estiated fro the training saple. The algorith DeepBoost directly inherits this advantage. For exaple, if the hypothesis set H k is based on a positive definite kernel with saple atrix K k, it is known that its epirical Radeacher coplexity can be upper bounded by Tr[Kk ] and lower bounded by Tr[Kk ]. In other cases, when H k is a faily of functions taking binary values, we can use an upper bound on the Radeacher coplexity in ters of the growth function of H k, Π Hk : R H k log Π Hk. Thus, for the faily H stups of boosting stups in diension d, Π H stups d, since there are distinct threshold functions for each diension with points. Thus, the following inequality holds: R H stups logd. 7 Siilarly, we consider the faily of decision trees H stups of depth with the sae question at the internal nodes of depth. We have Π H stups dd since there are dd / distinct trees of this type and since each induces at ost labelings. Thus, we can write R H stups log dd. 8 More generally, we also consider the faily of all binary decision trees Hk trees of depth k. For this faily it is known that VC-diHk trees k + log d + Mansour, 997. More generally, the VC-diension of T n, the faily of decision trees with n nodes in diension d can be bounded by n + log d + see for exaple Mohri VC-diH log+ et al., 0. Since R H, for any hypothesis class H we have 4n + log d + log + R T n. 9 The experients with DeepBoost described below use either H stups = H stups H stups or HK trees = K k= Htrees k, for soe K > 0, as the base hypothesis sets. For any hypothesis in these sets, DeepBoost will use the upper bounds given above as a proxy for the Radeacher coplexity of the set to which it belongs. We leave it to the future to experient with finer data-dependent estiates or upper bounds on the Radeacher coplexity, which could further iprove the perforance of our algorith. Recall that each iteration of DeepBoost searches for the base hypothesis that is optial with respect to a certain criterion see lines 5-0 of Figure. While an exhaustive search is feasible for H stups, it would be far too expensive to visit all trees in HK trees when K is large. Therefore, when using HK trees and also Hstups as the base hypotheses we use the following heuristic search procedure in each iteration t: First, the optial tree h H trees is found via exhaustive search. Next, for all < k K, a locally optial tree h k Htrees k is found by considering only trees that can be obtained by adding a single layer of leaves to h k. Finally, we select the best hypotheses in the set {h,..., h K, h,..., h t }, where h,..., h t are the hypotheses selected in previous iterations.

6 Table. Results for boosted decision stups and the exponential loss function. AdaBoost AdaBoost AdaBoost AdaBoost breastcancer H stups H stups AdaBoost-L DeepBoost ocr7 H stups H stups AdaBoost-L DeepBoost Error Error std dev std dev Avg tree size Avg tree size Avg no. of trees Avg no. of trees AdaBoost AdaBoost AdaBoost AdaBoost ionosphere H stups H stups AdaBoost-L DeepBoost ocr49 H stups H stups AdaBoost-L DeepBoost Error Error std dev std dev Avg tree size Avg tree size Avg no. of trees Avg no. of trees AdaBoost AdaBoost AdaBoost AdaBoost geran H stups H stups AdaBoost-L DeepBoost ocr7-nist H stups H stups AdaBoost-L DeepBoost Error Error std dev std dev Avg tree size Avg tree size.99 Avg no. of trees Avg no. of trees AdaBoost AdaBoost AdaBoost AdaBoost diabetes H stups H stups AdaBoost-L DeepBoost ocr49-nist H stups H stups AdaBoost-L DeepBoost Error Error std dev std dev Avg tree size Avg tree size Avg no. of trees Avg no. of trees Breian 999 and Reyzin & Schapire 006 extensively investigated the relationship between the coplexity of decision trees in an enseble learned by AdaBoost and the generalization error of the enseble. We tested DeepBoost on the sae UCI datasets used by these authors, archive.ics.uci.edu/l/datasets.htl, specifically breastcancer, ionosphere, gerannueric and diabetes. We also experiented with two optical character recognition datasets used by Reyzin & Schapire 006, ocr7 and ocr49, which contain the handwritten digits and 7, and 4 and 9 respectively. Finally, because these OCR datasets are fairly sall, we also constructed the analogous datasets fro all of MNIST, lecun.co/exdb/nist/, which we call ocr7-nist and ocr49-nist. More details on all the datasets are given in Table 4, Appendix D.. As we discussed in Section 3., by fixing the paraeters β and λ to certain values, we recover soe known algoriths as special cases of DeepBoost. Our experients copared DeepBoost to AdaBoost β = λ = 0 with exponential loss, to Logistic Regression β = λ = 0 with logistic loss, which we abbreviate as LogReg, to L -nor regularized AdaBoost e.g., see Rätsch et al., 00a abbreviated as AdaBoost-L, and also to the L -nor regularized additive Logistic Regression algorith studied by Duchi & Singer, 009 β > 0, λ = 0 abbreviated as LogReg-L. In the first set of experients reported in Table, we copared AdaBoost, AdaBoost-L, and DeepBoost with the exponential loss Φ u = exp u and base hypotheses H stups. We tested standard AdaBoost with base hypotheses H stups and H stups. For AdaBoost-L, we optiized over β { i : i = 6,..., 0} and for Deep- Boost, we optiized over β in the sae range and λ {0.000, 0.005, 0.0, 0.05, 0., 0.5}. The exact paraeter optiization procedure is described below. In the second set of experients reported in Table, we used base hypotheses HK trees instead of Hstups, where the axiu tree depth K was an additional paraeter to be optiized. Specifically, for AdaBoost we optiized over K {,..., 6}, for AdaBoost-L we optiized over those sae values for K and β {0 i : i = 3,..., 7}, and for DeepBoost we optiized over those sae values for K, β and λ {0 i : i = 3,..., 7}. The last set of experients, reported in Table 3, are identical to the experients reported in Table, except we used the logistic loss Φ u = log + exp u. We used the following paraeter optiization procedure in all experients: Each dataset was randoly partitioned into 0 folds, and each algorith was run 0 ties, with a different assignent of folds to the training set, validation set and test set for each run. Specifically, for each run i {0,..., 9}, fold i was used for testing, fold i + od 0 was used for validation, and the reaining folds were used for training. For each run, we selected the paraeters that had the lowest error on the validation set and then easured the error of those paraeters on the test set. The average error and the standard deviation of the error over all 0 runs is reported in Tables, and 3, as is the average nuber of trees and the average size of the trees in the ensebles. In all of our experients, the nuber of iterations was set to 00. We also experiented with running each algorith

7 Table. Results for boosted decision trees and the exponential loss function. breastcancer AdaBoost AdaBoost-L DeepBoost ocr7 AdaBoost AdaBoost-L DeepBoost Error Error std dev std dev Avg tree size Avg tree size Avg no. of trees Avg no. of trees ionosphere AdaBoost AdaBoost-L DeepBoost ocr49 AdaBoost AdaBoost-L DeepBoost Error Error std dev std dev Avg tree size Avg tree size Avg no. of trees Avg no. of trees geran AdaBoost AdaBoost-L DeepBoost ocr7-nist AdaBoost AdaBoost-L DeepBoost Error Error std dev std dev Avg tree size Avg tree size Avg no. of trees Avg no. of trees diabetes AdaBoost AdaBoost-L DeepBoost ocr49-nist AdaBoost AdaBoost-L DeepBoost Error Error std dev std dev Avg tree size Avg tree size Avg no. of trees Avg no. of trees Table 3. Results for boosted decision trees and the logistic loss function. breastcancer LogReg LogReg-L DeepBoost ocr7 LogReg LogReg-L DeepBoost Error Error std dev std dev Avg tree size Avg tree size Avg no. of trees Avg no. of trees ionosphere LogReg LogReg-L DeepBoost ocr49 LogReg LogReg-L DeepBoost Error Error std dev std dev Avg tree size Avg tree size Avg no. of trees Avg no. of trees geran LogReg LogReg-L DeepBoost ocr7-nist LogReg LogReg-L DeepBoost Error Error std dev std dev Avg tree size Avg tree size Avg no. of trees Avg no. of trees diabetes LogReg LogReg-L DeepBoost ocr49-nist LogReg LogReg-L DeepBoost Error Error std dev std dev Avg tree size Avg tree size Avg no. of trees Avg no. of trees for up to,000 iterations, but observed that the test errors did not change significantly, and ore iportantly the ordering of the algoriths by their test errors was unchanged fro 00 iterations to,000 iterations. Observe that with the exponential loss, DeepBoost has a saller test error than AdaBoost and AdaBoost-L on every dataset and for every set of base hypotheses, except for the ocr49-nist dataset with decision trees where its perforance atches that of AdaBoost-L. Siilarly, with the logistic loss, DeepBoost perfors always at least as well as LogReg or LogReg-L. For the sall-sized UCI datasets it is difficult to obtain statistically significant results, but, for the larger ocrxx-nist datasets, our results with Deep- Boost are statistically significantly better at the % level using one-sided paired t-tests in all three sets of experients three tables, except for ocr49-nist in Table 3, where this holds only for the coparison with LogReg. This across-the-board iproveent is the result of Deep- Boost s coplexity-conscious ability to dynaically tune the sizes of the decision trees selected in each boosting round, trading off between training error and hypothesis class coplexity. The selected tree sizes should depend on properties of the training set, and this is borne out by our experients: For soe datasets, such as breastcancer, DeepBoost selects trees that are saller on average than the trees selected by AdaBoost-L or LogReg-L, while, for other datasets, such as geran, the average tree size is larger. Note that AdaBoost and AdaBoost-L produce ensebles of trees that have a constant depth since neither algorith penalizes tree size except for iposing a axiu tree depth K, while for DeepBoost the trees in one enseble typically vary in size. Figure 3 plots the distri-

8 Ion: Histogra of tree sizes Ion: AdaBoost L, fold = 6 Ion: AdaBoost, fold = 6 Frequency Tree sizes Figure 3. Distribution of tree sizes when DeepBoost is run on the ionosphere dataset. bution of tree sizes for one run of DeepBoost. It should be noted that the coluns for AdaBoost in Table siply list the nuber of stups to be the sae as the nuber of boosting rounds; a careful exaination of the ensebles for 00 rounds of boosting typically reveals a 5% duplication of stups in the ensebles. Theore is a argin-based generalization guarantee, and is also the basis for the derivation of DeepBoost, so we should expect DeepBoost to induce large argins on the training set. Figure 4 shows the argin distributions for AdaBoost, AdaBoost-L and DeepBoost on the sae subset of the ionosphere dataset. 5. Conclusion We presented a theoretical analysis of learning with a base hypothesis set coposed of increasingly coplex subfailies, including very deep or coplex ones, and derived an algorith, DeepBoost, which is precisely based on those guarantees. We also reported the results of experients with this algorith and copared its perforance with that of AdaBoost and additive Logistic Regression, and their L -nor regularized counterparts in several tasks. We have derived siilar theoretical guarantees in the ulticlass setting and used the to derive a faily of new ulticlass deep boosting algoriths that we will present and discuss elsewhere. Our theoretical analysis and algorithic design could also be extended to ranking and to a broad class of loss functions. This should also lead to the generalization of several existing algoriths and their use with a richer hypothesis set structured as a union of failies with different Radeacher coplexity. In particular, the broad faily of axiu entropy odels and conditional axiu entropy odels and their any variants, which includes the already discussed logistic regression, could all be extended in a siilar way. The resulting DeepMaxent odels or their conditional versions ay adit an alternative theoretical justification that we will discuss elsewhere. Our algorith can also be extended by considering non-differentiable convex surrogate losses such as the hinge loss. When used with kernel base classifiers, this leads to an algorith we have naed DeepSVM. The theory we developed could perhaps be further generalized to Frequency Frequency Noralized Margin Ion: DeepBoost, fold = Noralized Margin Frequency Cuulative Dist Noralized Margin Cuulative Distribution of Margins Noralized Margin Figure 4. Distribution of noralized argins for AdaBoost upper right, AdaBoost-L upper left and DeepBoost lower left on the sae subset of ionosphere. The cuulative argin distributions lower right illustrate that DeepBoost red induces larger argins on the training set than either AdaBoost black or AdaBoost-L blue. encopass the analysis of other learning techniques such as ulti-layer neural networks. Our analysis and algorith also shed soe new light on soe reaining questions left about the theory underlying AdaBoost. The priary theoretical justification for AdaBoost is a argin guarantee Schapire et al., 997; Koltchinskii & Panchenko, 00. However, AdaBoost does not precisely axiize the iniu argin, while other algoriths such as arc-gv Breian, 996 that are designed to do so tend not to outperfor AdaBoost Reyzin & Schapire, 006. Two ain reasons are suspected for this observation: in order to achieve a better argin, algoriths such as arc-gv ay tend to select deeper decision trees or in general ore coplex hypotheses, which ay then affect their generalization; while those algoriths achieve a better argin, they do not achieve a better argin distribution. Our theory ay help better understand and evaluate the effect of factor since our learning bounds explicitly depend on the ixture weights and the contribution of each hypothesis set H k to the definition of the enseble function. However, our guarantees also suggest a better algorith, DeepBoost. Acknowledgents We thank Vitaly Kuznetsov for his coents on an earlier draft of this paper. The work of M. Mohri was partly funded by the NSF award IIS-759.

9 References Bartlett, Peter L. and Mendelson, Shahar. Radeacher and Gaussian coplexities: Risk bounds and structural results. JMLR, 3, 00. Bauer, Eric and Kohavi, Ron. An epirical coparison of voting classification algoriths: Bagging, boosting, and variants. Machine Learning, 36-:05 39, 999. Breian, Leo. Bagging predictors. Machine Learning, 4 :3 40, 996. Breian, Leo. Prediction gaes and arcing algoriths. Neural Coputation, 7:493 57, 999. Caruana, Rich, Niculescu-Mizil, Alexandru, Crew, Geoff, and Ksikes, Alex. Enseble selection fro libraries of odels. In ICML, 004. Dietterich, Thoas G. An experiental coparison of three ethods for constructing ensebles of decision trees: Bagging, boosting, and randoization. Machine Learning, 40:39 57, 000. Duchi, John C. and Singer, Yora. Boosting with structural sparsity. In ICML, pp. 38, 009. Freund, Yoav and Schapire, Robert E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Coputer Syste Sciences, 55: 9 39, 997. Freund, Yoav, Mansour, Yishay, and Schapire, Robert E. Generalization bounds for averaged classifiers. The Annals of Statistics, 3:698 7, 004. Friedan, Jeroe, Hastie, Trevor, and Tibshirani, Robert. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 8:000, 998. Grove, Ada J and Schuurans, Dale. Boosting in the liit: Maxiizing the argin of learned ensebles. In AAAI/IAAI, pp , 998. Kivinen, Jyrki and Waruth, Manfred K. Boosting as entropy projection. In COLT, pp , 999. Koltchinskii, Vladir and Panchenko, Ditry. Epirical argin distributions and bounding the generalization error of cobined classifiers. Annals of Statistics, 30, 00. Mohri, Mehryar, Rostaizadeh, Afshin, and Talwalkar, Aeet. Foundations of Machine Learning. The MIT Press, 0. Quinlan, J. Ross. Bagging, boosting, and C4.5. In AAAI/IAAI, Vol., pp , 996. Rätsch, Gunnar and Waruth, Manfred K. Maxiizing the argin with boosting. In COLT, pp , 00. Rätsch, Gunnar and Waruth, Manfred K. Efficient argin axiizing with boosting. Journal of Machine Learning Research, 6:3 5, 005. Rätsch, Gunnar, Mika, Sebastian, and Waruth, Manfred K. On the convergence of leveraging. In NIPS, pp , 00a. Rätsch, Gunnar, Onoda, Takashi, and Müller, Klaus- Robert. Soft argins for AdaBoost. Machine Learning, 43:87 30, 00b. Reyzin, Lev and Schapire, Robert E. How boosting the argin can also boost classifier coplexity. In ICML, pp , 006. Schapire, Robert E. Theoretical views of boosting and applications. In Proceedings of ALT 999, volue 70 of Lecture Notes in Coputer Science, pp Springer, 999. Schapire, Robert E. The boosting approach to achine learning: An overview. In Nonlinear Estiation and Classification, pp Springer, 003. Schapire, Robert E., Freund, Yoav, Bartlett, Peter, and Lee, Wee Sun. Boosting the argin: A new explanation for the effectiveness of voting ethods. In ICML, pp , 997. Syth, Padhraic and Wolpert, David. Linearly cobining density estiators via stacking. Machine Learning, 36: 59 83, July 999. Vapnik, Vladiir N. Statistical Learning Theory. Wiley- Interscience, 998. Waruth, Manfred K., Liao, Jun, and Rätsch, Gunnar. Totally corrective boosting algoriths that axiize the argin. In ICML, pp , 006. MacKay, David J. C. Bayesian ethods for adaptive odels. PhD thesis, California Institute of Technology, 99. Mansour, Yishay. Pessiistic decision tree pruning based on tree size. In Proceedings of ICML, pp. 95 0, 997.

10 A. Proof of Theore Theore. Assue p >. Fix ρ > 0. Then, for any δ > 0, with probability at least δ over the choice of a saple S of size drawn i.i.d. according to D, the following inequality holds for all f = T α th t : Rf R S,ρ f + 4 ρ + ρ α t R H kt log p 4 [ + ρ log ρ ] log p log p + log δ. Thus, Rf R S,ρ f + 4 T ρ α tr H kt + C, p log p with C, p = O ρ log [ ρ log p]. Proof. For a fixed h = h,..., h T, any α defines a distribution over {h,..., h T }. Sapling fro {h,..., h T } according to α and averaging leads to functions g of the for g = T n n th t for soe n = n,..., n T, with T n t = n, and h t H kt. For any N = N,..., N p with N = n, we consider the faily of functions G F,N = { n p N k h k,j k, j [p] [N k ], h k,j H k }, k= j= and the union of all such failies G F,n = N =n G F,N. Fix ρ > 0. For a fixed N, the Radeacher coplexity of G F,N can be bounded as follows for any : R G F,N p n k= N k R H k. Thus, the following standard argin-based Radeacher coplexity bound holds Koltchinskii & Panchenko, 00. For any δ > 0, with probability at least δ, for all g G F,N, R ρ g R S,ρ g p log δ N k R H k + ρ n. k= Since there are at ost p n possible p-tuples N with N = n, by the union bound, for any δ > 0, with probability at least δ, for all g G F,n, we can write R ρ g R S,ρ g p log pn δ N k R H k + ρ n. k= Thus, with probability at least δ, for all functions g = T n n th t with h t H kt, the following inequality holds R ρ g R S,ρ g p log pn δ n t R H kt + ρ n. k= t:k t=k Taking the expectation with respect to α and using E α [n t /n] = α t, we obtain that for any δ > 0, with probability at least δ, for all h, we can write E[R ρ g R S,ρ g] log pn δ α t R H kt + α ρ. Fix n. Then, for any δ n > 0, with probability at least δ n, E[R ρ/ g R S,ρ/ g] 4 log pn δ α t R H kt + n α ρ. δ Choose δ n = p for soe δ > 0, then for p, n n δ δ n = /p δ. Thus, for any δ > 0 and any n, with probability at least δ, the following holds for all h: E α [R ρ/ g R S,ρ/ g] 4 ρ α t R H kt + log pn δ. 0 Now, for any f = T α th t F and any g = n T n th t, we can upper bound Rf = Pr x,y D [yfx 0], the generalization error of f, as follows: Rf = Pr [yfx ygx + ygx 0] x,y D Pr[yfx ygx < ρ/] + Pr[ygx ρ/] = Pr[yfx ygx < ρ/] + R ρ/ g. We can also write R ρ/ g = R S,ρ/ g f + f Pr[ygx yfx < ρ/] + R S,ρ f. Cobining these inequalities yields Pr [yfx 0] R S,ρ f x,y D Pr[yfx ygx < ρ/] + Pr[ygx yfx < ρ/] + R ρ/ g R S,ρ/ g. Taking the expectation with respect to α yields Rf R S,ρ f E x D,α [ yfx ygx< ρ/]+ E [ ygx yfx< ρ/] + E[R ρ/ g R S,ρ/ g]. x D,α α Since f = E α [g], by Hoeffding s inequality, for any x, E α [ yfx ygx< ρ/ ]=Pr α E α [ ygx yfx< ρ/ ]=Pr α nρ [yfx ygx< ρ/] e 8 nρ [ygx yfx< ρ/] e 8.

11 Thus, for any fixed f F, we can write Rf R S,ρ f e nρ /8 + E α [R ρ/ g R S,ρ/ g]. Thus, the following inequality holds: sup Rf R S,ρ f f F e nρ /8 + sup E[R ρ/ g R S,ρ/ g]. h α Therefore, in view of 0, for any δ > 0 and any n, with probability at least δ, the following holds for all f F: Rf R S,ρ f 4 ρ = 4 ρ α t R H kt + e nρ /8 + α t R H kt +e nρ /8 + log pn δ n log p+log δ. To select n, we seek to iniize n log p f : n e nρ /8 + = e nu + nv, with u = ρ /8 and v = log p/. f is differentiable and for all n, f n = ue nu + v. The iniu of f is n thus for n such that v f n = 0 ue nu = n une un = v 8u n = v u W, 8u where W is the second branch of the Labert function inverse of x xe x. It is not hard to verify that the following inequalities hold for all x 0, /e]: logx W x logx. Bounding W using the lower bound leads to the following choice for n: n = v u log = 8u 4 ρ ρ log. log p Plugging in this value of n yields the following bound: Rf R S,ρ f 4 ρ + α t R H kt + log p ρ which concludes the proof. 4 [ ρ log ρ ] log p log p + log δ, a b c Figure 5. Illustration of the directional derivatives in the three cases of definition. B. Coordinate descent B.. Maxiu descent coordinate For a differentiable convex function, the definition of coordinate descent along the direction with axial descent is standard: the direction selected is the one axiizing the absolute value of the directional derivative. Here, we clarify the definition of the axial descent strategy for a non-differentiable convex function. For any function Q: R N R, we denote by Q +α, e the right directional derivative of Q at α R N and by Q α, e its left directional derivative at α R N along the direction e R N, e =, when they exist: Q Qα + ηe Qα +α, e = li η 0 + η Q Qα + ηe Qα α, e = li. η 0 η For the reaining of this section, we will assue that Q is a convex function. It is known that in that case these quantities always exist and that Q α, e Q +α, e for all α and e. The left and right directional derivatives coincide with the directional derivative Q α, e of Q along the direction e when Q is differentiable at α along the direction e: Q α, e = Q +α, e = Q α, e. For any j [, N], let e j denote the jth unit vector in R N. For any α R N and j [, N], we define the descent gradient δqα, e j of Q along the direction e j as follows: δqα, e j = 0 if Q α, e j 0 Q +α, e j Q +α, e j if Q α, e j Q +α, e j 0 Q α, e j if 0 Q α, e j Q +α, e j. δqα, e j is the eleent of the subgradient along e j that is the closest to 0. Figure 5 illustrates the three cases in that definition. Note that when Q is differentiable along e j, then Q +α, e j = Q α, e j and δqα, e j = Q α, e j. The axiu descent coordinate can then be defined by k = argax δqα, e j j [,N] This coincides with the standard definition when Q is convex and differentiable.

12 B.. Direction In view of, at each iteration t, the direction e k selected by coordinate descent with axiu descent is k = argax j [,N] δqα t, e j. To deterine k, we copute δqα t, e j for all j [, N] by distinguishing two cases: α t,j 0 and α t,j = 0. Assue first that α t,j 0 and let s denote the sign of α t,j. For η sufficiently sall, α t,j + η has the sign of α t,j, that is s and F α t + ηe j = Φ y i f t x i ηy i h j x i + p j Λ j α t,p + sλ j α t,j + η. Thus, when α t,j 0, F adits a directional derivative along e j given by F α t, e j = y i h j x i Φ y i f t x i + sλ j = y i h j x i D t i + sλ j =ɛ t,j + sλ j, and δf α t, e j = ɛ t,j St +sgnα t,jλ j. When α t,j = 0, we find siilarly that F +α t, e j = ɛ t,j + Λ j F α t, e j = ɛ t,j Λ j. The condition F α, e j 0 F +α, e j is equivalent to Λ j ɛ t,j S t Λ j ɛ t,j Λ j. Thus, in suary, we can write, for all j [, N], δf α t, e j = ɛ t,j St +sgnα t,jλ j if α t,j 0 0 else if ɛt,j ɛ t,j St +Λ j ɛ t,j St Λ j otherwise. Λ j else if ɛ t,j Λj This can be siplified by unifying the last two cases and observing that the sign of ɛ t,j suffices to distinguish between the last two cases: δf α t, e j = ɛ t,j St +sgnα t,jλ j if α t,j 0 0 else if ɛt,j Λ j S t ɛ t,j St sgnɛ t,j Λ j otherwise. B.3. Step Given the direction e k, the optial step value η is given by argin η F α t + η e k. In the ost general case, η can be found via a line search or other nuerical ethods. In soe special cases, we can derive a closed-for solution for the step by iniizing an upper bound on F α t + η e k. For convenience, in what follows, we use the shorthand ɛ t for ɛ t,k. Since y i h k x i = +yih kx i + yih kx i, by the convexity of u Φ ηu, the following holds for all η R: Φ y i f t x i ηy i h k x i 3 + y ih k x i Φ y i f t x i η + y ih k x i Φ y i f t x i + η. Thus, we can write F α t + ηe k j k Λ j α t,j + + y i h k x i Φ y i f t x i η + Λ k α t,k + η. y i h k x i Φ y i f t x i + η Let Jη denote that upper bound. We can select η to iniize Jη. J is convex and adits a subdifferential at all points. Thus, η is a iniizer of Jη iff 0 Jη, where Jη denotes the subdifferential of J at η. B.4. Exponential loss In the case Φ = exp, Jη can be expressed as follows Jη = + y i h k x i e yift xi e η + y i h k x i e yift xi e η + Λ k α t,k + η, and e yift xi = Φ y i f t x i = D t i. Thus, J can be rewritten as follows: Jη = ɛ t e η + ɛ t eη + Λ k α t,k + η, Note that when the functions in H take values in {, +}, 3 is in fact an equality and Jη coincides with F α t + ηe t P j k Λj αt,j.

13 P Deep Boosting [ ] Note that η η 0, where η 0 ɛ = log t ɛ t is the step size used is AdaBoost. e t,k The case α t,k + η < 0 can be treated siilarly. It is equivalent to the condition P e t,k e X ɛ t e α t,k ɛ t e α t,k < Λ k, 7 Figure 6. Plot of the polynoial function P. where we used the shorthand ɛ t = ɛ t,k where k is the index of the direction e k selected. If α t,k + η = 0, then the subdifferential of α t,k + η at η is the set {ν : ν [, +]}. Thus, Jη contains 0 iff there exists ν [, +] such that ɛ t e η + ɛ t eη + Λ k ν = 0 ɛ t e α t,k + ɛ t e α t,k + Λ k ν = 0. This is equivalent to the condition ɛ t e α t,k ɛ t e α t,k Λ k. 4 If α t,k + η > 0, then the subdifferential of α t,k + η at η is reduced to {} and Jη contains 0 iff ɛ t e η + ɛ t e η + Λ k = 0 ɛ t e η + Λ k e η ɛ t = 0. 5 Solving the resulting second-degree equation in e η gives e η = Λ k Λk + + ɛ t, ɛ t ɛ t ɛ t that is η = log Λ k ɛ t + Λk + ɛ t. ɛ t ɛ t Let P be the second-degree polynoial of 5 whose solution is e η. P is convex, has one negative solution, one positive solution, and the positive solution is e η. Since e α t,k is positive, the condition α t,k + η > 0 or α t,k < η is then equivalent to P e α t,k < 0 see Figure 6, that is ɛ t e α t,k + Λ k e α t,k ɛ t < 0 ɛ t e α t,k ɛ t e α t,k > Λ k. 6 and leads to the step size η = log B.5. Logistic loss Λ k ɛ t + Λk + ɛ t. ɛ t ɛ t In the case of logistic loss, for any u R, Φ u = log + e u and Φ u = log +e u. To deterine the step size, we use the following general upper bound: [ ] + e u v Φ u v Φ u = log + e u [ + e u + e u v e u ] = log + e u [ ] = log + e v + e u Thus, we can write e v log + e u = Φ ue v. F α t + ηe t F α t Φ y i f t x i e ηyih kx i + Λ k α t,k + η α t,k = D t i e ηyih kx i + Λ k α t,k + η α t,k. To deterine η, we can iniize this upper bound, or equivalently the following D t i e ηyih kx i + Λ k α t,k + η. This expression is syntactically the sae as the one considered in the case of the exponential loss with only the distribution weights D t i and being different. Indeed,

14 in the case of the exponential loss Φ = exp, we can write F α t + ηe k j k Λ j α t,j = = = = Φ y i f t x i ηy i h k x i +Λ k α t,k +η, Φ y i f t x i e ηyih kx i +Λ k α t,k +η, Φ y i f t x i e ηyih kx i +Λ k α t,k +η, D t i e ηyih kx i +Λ k α t,k +η. Thus, we obtain iediately the sae expressions for the step size in the case of the logistic loss with the sae three cases, but with = and D +e +y i f t x i t i = +e +y i f t x i. C. Alternative DeepBoost γ algorith We also devised and ipleented an alternative algorith, DeepBoost γ, which is inspired by the learning bound of Theore but does not seek to iniize it. The algorith adits a paraeter γ > 0 representing the edge value deanded at each boosting round. This is the aount by which we require the error ɛ t of the base hypothesis h t selected at round t to be better than : ɛ t > γ. We assue given p distinct hypothesis sets with increasing degrees of coplexity H,..., H p. DeepBoost γ proceeds as if we were running AdaBoost using only as base hypothesis set H. But, at each round, if the edge achieved by the best hypothesis found in H is not sufficient, that is if it is not larger than the deanded edge γ, then it selects instead the hypothesis in H with the sallest error on the saple weighted by D t. If the edge of that hypothesis is also not sufficient, it proceeds with the next hypothesis set and so forth. If the edge is insufficient even with the best hypothesis in H p, then it just uses the best hypothesis found in H = p k= H k. The edge paraeter γ is deterined via cross-validation. DeepBoost γ is inspired by the bound of Theore since it seeks to use as uch as possible hypotheses fro H or lower coplexity failies and only when necessary functions fro ore coplex failies. Since it tends to choose rarely hypotheses fro ore coplex H k s, the coplexity ter of the bound of Theore reains close to the one using only H. On the other hand, DeepBoost γ can achieve a saller epirical argin loss first ter of the bound by selecting, when needed, ore powerful hypotheses than those accessible using H alone. We carried out soe early experients on several datasets Table 4. Dataset statistics. geran refers ore specifically to the geran nueric dataset. breastcancer ionosphere geran Exaples Attributes diabetes ocr7 ocr49 Exaples Attributes ocr7-nist ocr49-nist Exaples Attributes with DeepBoost γ using boosting stups, in which the perforance of the algorith was found to be superior to that of AdaBoost. A ore extensive study of the theoretical and epirical properties of this algorith are left to the future. D. Additional epirical inforation D.. Dataset sizes and attributes The size and the nuber of attributes for the datasets used in our experients are indicated in Table 4.

Deep Boosting. Abstract. 1. Introduction

Deep Boosting. Abstract. 1. Introduction Corinna Cortes Google Research, 8th Avenue, New York, NY Mehryar Mohri Courant Institute and Google Research, 25 Mercer Street, New York, NY 2 Uar Syed Google Research, 8th Avenue, New York, NY Abstract

More information

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:

More information

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)

More information

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH. Deep Boosting Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Ensemble Methods in ML Combining several base classifiers

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

A Smoothed Boosting Algorithm Using Probabilistic Output Codes A Soothed Boosting Algorith Using Probabilistic Output Codes Rong Jin rongjin@cse.su.edu Dept. of Coputer Science and Engineering, Michigan State University, MI 48824, USA Jian Zhang jian.zhang@cs.cu.edu

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

1 Rademacher Complexity Bounds

1 Rademacher Complexity Bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:

More information

Learning with Deep Cascades

Learning with Deep Cascades Learning with Deep Cascades Giulia DeSalvo 1, Mehryar Mohri 1,2, and Uar Syed 2 1 Courant Institute of Matheatical Sciences, 251 Mercer Street, New Yor, NY 10012 2 Google Research, 111 8th Avenue, New

More information

Ensemble Based on Data Envelopment Analysis

Ensemble Based on Data Envelopment Analysis Enseble Based on Data Envelopent Analysis So Young Sohn & Hong Choi Departent of Coputer Science & Industrial Systes Engineering, Yonsei University, Seoul, Korea Tel) 82-2-223-404, Fax) 82-2- 364-7807

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon Lecture 2: Enseble Methods Isabelle Guyon guyoni@inf.ethz.ch Introduction Book Chapter 7 Weighted Majority Mixture of Experts/Coittee Assue K experts f, f 2, f K (base learners) x f (x) Each expert akes

More information

Support Vector Machines. Maximizing the Margin

Support Vector Machines. Maximizing the Margin Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the

More information

Support Vector Machines MIT Course Notes Cynthia Rudin

Support Vector Machines MIT Course Notes Cynthia Rudin Support Vector Machines MIT 5.097 Course Notes Cynthia Rudin Credit: Ng, Hastie, Tibshirani, Friedan Thanks: Şeyda Ertekin Let s start with soe intuition about argins. The argin of an exaple x i = distance

More information

Bayes Decision Rule and Naïve Bayes Classifier

Bayes Decision Rule and Naïve Bayes Classifier Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

PAC-Bayes Analysis Of Maximum Entropy Learning

PAC-Bayes Analysis Of Maximum Entropy Learning PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks Intelligent Systes: Reasoning and Recognition Jaes L. Crowley MOSIG M1 Winter Seester 2018 Lesson 7 1 March 2018 Outline Artificial Neural Networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

Boosting Ensembles of Structured Prediction Rules

Boosting Ensembles of Structured Prediction Rules Boosting Ensembles of Structured Prediction Rules Corinna Cortes Google Research 76 Ninth Avenue New York, NY 10011 corinna@google.com Vitaly Kuznetsov Courant Institute 251 Mercer Street New York, NY

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lessons 7 20 Dec 2017 Outline Artificial Neural networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016 Lessons 7 14 Dec 2016 Outline Artificial Neural networks Notation...2 1. Introduction...3... 3 The Artificial

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

Accuracy at the Top. Abstract

Accuracy at the Top. Abstract Accuracy at the Top Stephen Boyd Stanford University Packard 264 Stanford, CA 94305 boyd@stanford.edu Mehryar Mohri Courant Institute and Google 25 Mercer Street New York, NY 002 ohri@cis.nyu.edu Corinna

More information

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space Journal of Machine Learning Research 3 (2003) 1333-1356 Subitted 5/02; Published 3/03 Grafting: Fast, Increental Feature Selection by Gradient Descent in Function Space Sion Perkins Space and Reote Sensing

More information

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges

More information

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago

More information

arxiv: v4 [cs.lg] 4 Apr 2016

arxiv: v4 [cs.lg] 4 Apr 2016 e-publication 3 3-5 Relative Deviation Learning Bounds and Generalization with Unbounded Loss Functions arxiv:35796v4 cslg 4 Apr 6 Corinna Cortes Google Research, 76 Ninth Avenue, New York, NY Spencer

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a ournal published by Elsevier. The attached copy is furnished to the author for internal non-coercial research and education use, including for instruction at the authors institution

More information

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,

More information

PAC-Bayesian Learning of Linear Classifiers

PAC-Bayesian Learning of Linear Classifiers Pascal Gerain Pascal.Gerain.@ulaval.ca Alexandre Lacasse Alexandre.Lacasse@ift.ulaval.ca François Laviolette Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca Départeent d inforatique

More information

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES Vol. 57, No. 3, 2009 Algoriths for parallel processor scheduling with distinct due windows and unit-tie obs A. JANIAK 1, W.A. JANIAK 2, and

More information

Structured Prediction Theory Based on Factor Graph Complexity

Structured Prediction Theory Based on Factor Graph Complexity Structured Prediction Theory Based on Factor Graph Coplexity Corinna Cortes Google Research New York, NY 00 corinna@googleco Mehryar Mohri Courant Institute and Google New York, NY 00 ohri@cisnyuedu Vitaly

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

Stability Bounds for Non-i.i.d. Processes

Stability Bounds for Non-i.i.d. Processes tability Bounds for Non-i.i.d. Processes Mehryar Mohri Courant Institute of Matheatical ciences and Google Research 25 Mercer treet New York, NY 002 ohri@cis.nyu.edu Afshin Rostaiadeh Departent of Coputer

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Matheatical Sciences, 251 Mercer Street, New York,

More information

Block designs and statistics

Block designs and statistics Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning Basics: Estimators, Bias and Variance Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics

More information

Randomized Recovery for Boolean Compressed Sensing

Randomized Recovery for Boolean Compressed Sensing Randoized Recovery for Boolean Copressed Sensing Mitra Fatei and Martin Vetterli Laboratory of Audiovisual Counication École Polytechnique Fédéral de Lausanne (EPFL) Eail: {itra.fatei, artin.vetterli}@epfl.ch

More information

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning JMLR: Workshop and Conference Proceedings vol (1) 1 15 New Bounds for Learning Intervals with Iplications for Sei-Supervised Learning David P. Helbold dph@soe.ucsc.edu Departent of Coputer Science, University

More information

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a

More information

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION A eshsize boosting algorith in kernel density estiation A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION C.C. Ishiekwene, S.M. Ogbonwan and J.E. Osewenkhae Departent of Matheatics, University

More information

Domain-Adversarial Neural Networks

Domain-Adversarial Neural Networks Doain-Adversarial Neural Networks Hana Ajakan, Pascal Gerain 2, Hugo Larochelle 3, François Laviolette 2, Mario Marchand 2,2 Départeent d inforatique et de génie logiciel, Université Laval, Québec, Canada

More information

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification page 2 Motivation Real-world problems often have multiple classes:

More information

On the Impact of Kernel Approximation on Learning Accuracy

On the Impact of Kernel Approximation on Learning Accuracy On the Ipact of Kernel Approxiation on Learning Accuracy Corinna Cortes Mehryar Mohri Aeet Talwalkar Google Research New York, NY corinna@google.co Courant Institute and Google Research New York, NY ohri@cs.nyu.edu

More information

COS 424: Interacting with Data. Written Exercises

COS 424: Interacting with Data. Written Exercises COS 424: Interacting with Data Hoework #4 Spring 2007 Regression Due: Wednesday, April 18 Written Exercises See the course website for iportant inforation about collaboration and late policies, as well

More information

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material Consistent Multiclass Algoriths for Coplex Perforance Measures Suppleentary Material Notations. Let λ be the base easure over n given by the unifor rando variable (say U over n. Hence, for all easurable

More information

A Theoretical Analysis of a Warm Start Technique

A Theoretical Analysis of a Warm Start Technique A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful

More information

Robustness and Regularization of Support Vector Machines

Robustness and Regularization of Support Vector Machines Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 1,2 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Matheatical Sciences, 251 Mercer Street, New York,

More information

Learnability and Stability in the General Learning Setting

Learnability and Stability in the General Learning Setting Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago shai@tti-c.org Ohad Shair The Hebrew University ohadsh@cs.huji.ac.il Nathan Srebro TTI-Chicago nati@uchicago.edu

More information

EMPIRICAL COMPLEXITY ANALYSIS OF A MILP-APPROACH FOR OPTIMIZATION OF HYBRID SYSTEMS

EMPIRICAL COMPLEXITY ANALYSIS OF A MILP-APPROACH FOR OPTIMIZATION OF HYBRID SYSTEMS EMPIRICAL COMPLEXITY ANALYSIS OF A MILP-APPROACH FOR OPTIMIZATION OF HYBRID SYSTEMS Jochen Till, Sebastian Engell, Sebastian Panek, and Olaf Stursberg Process Control Lab (CT-AST), University of Dortund,

More information

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison yströ Method vs : A Theoretical and Epirical Coparison Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou Machine Learning Lab, GE Global Research, San Raon, CA 94583 Michigan State University,

More information

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points

More information

Boosting with Abstention

Boosting with Abstention Boosting with Abstention Corinna Cortes Google Research New York, NY 00 corinna@google.co Giulia DeSalvo Courant Institute New York, NY 00 desalvo@cis.nyu.edu Mehryar Mohri Courant Institute and Google

More information

Polygonal Designs: Existence and Construction

Polygonal Designs: Existence and Construction Polygonal Designs: Existence and Construction John Hegean Departent of Matheatics, Stanford University, Stanford, CA 9405 Jeff Langford Departent of Matheatics, Drake University, Des Moines, IA 5011 G

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness A Note on Scheduling Tall/Sall Multiprocessor Tasks with Unit Processing Tie to Miniize Maxiu Tardiness Philippe Baptiste and Baruch Schieber IBM T.J. Watson Research Center P.O. Box 218, Yorktown Heights,

More information

Geometrical intuition behind the dual problem

Geometrical intuition behind the dual problem Based on: Geoetrical intuition behind the dual proble KP Bennett, EJ Bredensteiner, Duality and Geoetry in SVM Classifiers, Proceedings of the International Conference on Machine Learning, 2000 1 Geoetrical

More information

E. Alpaydın AERFAISS

E. Alpaydın AERFAISS E. Alpaydın AERFAISS 00 Introduction Questions: Is the error rate of y classifier less than %? Is k-nn ore accurate than MLP? Does having PCA before iprove accuracy? Which kernel leads to highest accuracy

More information

Prediction by random-walk perturbation

Prediction by random-walk perturbation Prediction by rando-walk perturbation Luc Devroye School of Coputer Science McGill University Gábor Lugosi ICREA and Departent of Econoics Universitat Popeu Fabra lucdevroye@gail.co gabor.lugosi@gail.co

More information

Boosting with Abstention

Boosting with Abstention Boosting with Abstention Corinna Cortes Google Research New York, NY corinna@google.co Giulia DeSalvo Courant Institute New York, NY desalvo@cis.nyu.edu Mehryar Mohri Courant Institute and Google New York,

More information

Fast Montgomery-like Square Root Computation over GF(2 m ) for All Trinomials

Fast Montgomery-like Square Root Computation over GF(2 m ) for All Trinomials Fast Montgoery-like Square Root Coputation over GF( ) for All Trinoials Yin Li a, Yu Zhang a, a Departent of Coputer Science and Technology, Xinyang Noral University, Henan, P.R.China Abstract This letter

More information

Ch 12: Variations on Backpropagation

Ch 12: Variations on Backpropagation Ch 2: Variations on Backpropagation The basic backpropagation algorith is too slow for ost practical applications. It ay take days or weeks of coputer tie. We deonstrate why the backpropagation algorith

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

ADANET: adaptive learning of neural networks

ADANET: adaptive learning of neural networks ADANET: adaptive learning of neural networks Joint work with Corinna Cortes (Google Research) Javier Gonzalo (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR

More information

Probability Distributions

Probability Distributions Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples

More information

Soft-margin SVM can address linearly separable problems with outliers

Soft-margin SVM can address linearly separable problems with outliers Non-linear Support Vector Machines Non-linearly separable probles Hard-argin SVM can address linearly separable probles Soft-argin SVM can address linearly separable probles with outliers Non-linearly

More information

Probabilistic Machine Learning

Probabilistic Machine Learning Probabilistic Machine Learning by Prof. Seungchul Lee isystes Design Lab http://isystes.unist.ac.kr/ UNIST Table of Contents I.. Probabilistic Linear Regression I... Maxiu Likelihood Solution II... Maxiu-a-Posteriori

More information

1 Proof of learning bounds

1 Proof of learning bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a

More information

A Note on the Applied Use of MDL Approximations

A Note on the Applied Use of MDL Approximations A Note on the Applied Use of MDL Approxiations Daniel J. Navarro Departent of Psychology Ohio State University Abstract An applied proble is discussed in which two nested psychological odels of retention

More information

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data Suppleentary to Learning Discriinative Bayesian Networks fro High-diensional Continuous Neuroiaging Data Luping Zhou, Lei Wang, Lingqiao Liu, Philip Ogunbona, and Dinggang Shen Proposition. Given a sparse

More information

Multiple Instance Learning with Query Bags

Multiple Instance Learning with Query Bags Multiple Instance Learning with Query Bags Boris Babenko UC San Diego bbabenko@cs.ucsd.edu Piotr Dollár California Institute of Technology pdollar@caltech.edu Serge Belongie UC San Diego sjb@cs.ucsd.edu

More information

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search Quantu algoriths (CO 781, Winter 2008) Prof Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search ow we begin to discuss applications of quantu walks to search algoriths

More information

On the Use of A Priori Information for Sparse Signal Approximations

On the Use of A Priori Information for Sparse Signal Approximations ITS TECHNICAL REPORT NO. 3/4 On the Use of A Priori Inforation for Sparse Signal Approxiations Oscar Divorra Escoda, Lorenzo Granai and Pierre Vandergheynst Signal Processing Institute ITS) Ecole Polytechnique

More information

Lower Bounds for Quantized Matrix Completion

Lower Bounds for Quantized Matrix Completion Lower Bounds for Quantized Matrix Copletion Mary Wootters and Yaniv Plan Departent of Matheatics University of Michigan Ann Arbor, MI Eail: wootters, yplan}@uich.edu Mark A. Davenport School of Elec. &

More information

Kernel-Based Nonparametric Anomaly Detection

Kernel-Based Nonparametric Anomaly Detection Kernel-Based Nonparaetric Anoaly Detection Shaofeng Zou Dept of EECS Syracuse University Eail: szou@syr.edu Yingbin Liang Dept of EECS Syracuse University Eail: yliang6@syr.edu H. Vincent Poor Dept of

More information

An improved self-adaptive harmony search algorithm for joint replenishment problems

An improved self-adaptive harmony search algorithm for joint replenishment problems An iproved self-adaptive harony search algorith for joint replenishent probles Lin Wang School of Manageent, Huazhong University of Science & Technology zhoulearner@gail.co Xiaojian Zhou School of Manageent,

More information

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes Journal of Machine Learning Research (200) 789-84 Subitted /08; Revised /0; Published 2/0 Stability Bounds for Stationary ϕ-ixing and β-ixing Processes Mehryar Mohri Courant Institute of Matheatical Sciences

More information

Compression and Predictive Distributions for Large Alphabet i.i.d and Markov models

Compression and Predictive Distributions for Large Alphabet i.i.d and Markov models 2014 IEEE International Syposiu on Inforation Theory Copression and Predictive Distributions for Large Alphabet i.i.d and Markov odels Xiao Yang Departent of Statistics Yale University New Haven, CT, 06511

More information

Perceptron Mistake Bounds

Perceptron Mistake Bounds Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce

More information

1 Identical Parallel Machines

1 Identical Parallel Machines FB3: Matheatik/Inforatik Dr. Syaantak Das Winter 2017/18 Optiizing under Uncertainty Lecture Notes 3: Scheduling to Miniize Makespan In any standard scheduling proble, we are given a set of jobs J = {j

More information