Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Size: px

Start display at page:

Download "Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes"

Stanley Hill
6 years ago
Views:

1 Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute and Google Research, 25 Mercer street, New York, NY, 002 Uar Syed Google Research, 76 Ninth Avenue, New York, NY, 00 VITALY@CIMS.NYU.DU MOHRI@CS.NYU.DU USYD@GOOGL.COM Abstract This paper presents iproved Radeacher coplexity argin bounds that scale linearly with the nuber of classes as opposed to the quadratic dependence of existing Radeacher coplexity argin-based learning guarantees. We further use this result to prove a novel generalization bound for ulti-class classifier ensebles that depends only on the Radeacher coplexity of the hypothesis classes to which the classifiers in the enseble belong.. Introduction Multi-class classification is one of the central probles in achine learning. Given a saple S = {x, y ),..., x, y )} drawn i.i.d. fro soe unknown distribution D over X {,..., c}, the objective of the learner consists of finding a hypothesis h that adits a sall expected loss, h being selected out of soe hypothesis class H. The expected loss is given by X,Y ) D LhX), Y ), where L is a loss function, typically chosen to be the zero-one loss defined by Ly, y) = y y. A coon approach to ulti-class classification consists of learning a scoring function f : X Y R that assigns a score fx, y) to pair ade of an input point x X and a candidate label y. The label predicted for x is the one with the highest score: hx) = argax fx, y). y Y Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 205. JMLR: W&CP volue 37. Copyright 205 by the authors). The difference between the score of the correct label and that of the runner-up is the argin achieved for that exaple. The fraction of saple points with argin less than a specified constant is the epirical argin loss of h. These quantities play a critical role in an algorithagnostic analysis of generalization in the ulti-class setting based on data-dependent coplexity easures such as Radeacher coplexity. In particular, Koltchinskii & Panchenko, 2002) showed that with high probability, uniforly over hypothesis set, Rh) R S, h) 2c2 R Π G)) O ), where Rh) is the generalization error of hypothesis h, R S, h) its epirical argin loss, and R Π G)) the Radeacher coplexity of the faily of loss functions Π G) associated to H, which is defined precisely below. This bound is pessiistic and suggests that learning with an extreely large nuber of classes ay not be possible. Indeed, it is well known that for certain classes of coonly used hypotheses, including linear and kernel-based ones, R Π G)) O ). Therefore, for learning to occur we will need to be on the order of at least c 4ɛ /, for soe ɛ > 0. In soe odern achine learning tasks such as speech recognition and iage classification, c is often greater than 0 4. The bound above suggests that even for extreely favorable argin values of the order 0 3, a saple required for learning has to be in the order of at least 0 3. However, epirical results in speech recognition and iage categorization suggest that it is possible to learn with uch fewer saples. This result is also pessiistic in ters of coputational coplexity since storing and processing 0 3 saple points ay not be feasible. In this paper, we show that this bound can be iproved to scale linearly with the nuber of classes.

2 We further consider convex ensebles of classification odels. nseble ethods are general techniques in achine learning for cobining several ulti-class classification hypothesis to further iprove accuracy. Learning a linear cobination of base classifiers, or a classifier enseble, is one of the oldest and ost powerful ideas in achine learning. Boosting Freund & Schapire, 997) also known as forward stagewise additive odeling Friedan et al., 998) is a widely used eta-algorith for enseble learning. In the boosting approach, the enseble s isclassification error is replaced by a convex upper bound, called the surrogate loss. The algorith greedily iniizes the surrogate loss by augenting the enseble with a classifier or adjusting the weight of a classifier already in the enseble) at each of iteration. One of the ain advantages of boosting is that, because it is a stagewise procedure, one can efficiently learn a classifier enseble in which each classifier belongs to a large and potentially infinite) base hypothesis class, provided that one has an efficient algorith for learning good base classifiers. For exaple, decision trees are coonly used as the base hypothesis class. In contrast, generalization bounds for classifier ensebles tend to increase with the coplexity of the base hypothesis class Schapire et al., 997), and indeed boosting has been observed to overfit in practice Grove & Schuurans, 998; Schapire, 999; Dietterich, 2000; Rätsch et al., 200b). One way to address overfitting in a boosted enseble is to regularize the weights of the classifiers. Standard regularization penalizes all the weights in the enseble equally Rätsch et al., 200a; Duchi & Singer, 2009), but in soe cases it sees they should be penalized unequally. For exaple, in an enseble of decision trees, deeper decision trees should have a larger regularization penalty than shallower ones. Based on this idea, we present a novel generalization guarantee for ulti-class classifier ensebles that depends only on the Radeacher coplexity of the hypothesis classes to which the classifiers in the enseble belong. Cortes et al., 204) developed this idea in an algorith called DeepBoost, a boosting algorith where the decision in each iteration of which classifier to add to the enseble, and the weight assigned to that classifier, depends in part on the coplexity of the hypothesis class to which it belongs. One interpretation of DeepBoost is that it applies the principle of structural risk iniization to each iteration of boosting. Kuznetsov et al., 204) extended these ideas to the ulti-class setting. The rest of this paper is organized as follows. In Section 2 we present and prove our iproved Radeacher coplexity argin bounds that scale linearly with the nuber of classes. In Section 3 we use this result to prove a novel generalization bound for ulti-class classifier ensebles that depends only on the Radeacher coplexity of the hypothesis classes to which the classifiers in the enseble belong. We conclude with soe final rearks in Section Multi-class argin bounds In this section, we present our iproved data-dependent learning bound in the ulti-class setting. Let X denote the input space. We denote by Y = {,..., c} a set of c classes, which, for convenience, we index by an integer in, c. The label associated by a hypothesis f : X Y R to x X is given by argax y Y fx, y). The argin f x, y) of the function f for a labeled exaple x, y) X Y is defined by f x, y) = fx, y) ax y y fx, y ). ) Thus, f isclassifies x, y) iff f x, y) 0. We assue that training and test points are drawn i.i.d. according to soe distribution D over X Y and denote by S = x, y ),..., x, y )) a training saple of size drawn according to D. For any > 0, the generalization error Rf), its -argin error R f) and its epirical argin error are defined as follows: Rf) = R f) = R S, f) = x,y) D f x,y) 0, x,y) D f x,y), x,y) S f x,y), where the notation x, y) S indicates that x, y) is drawn according to the epirical distribution defined by S. For any faily of hypotheses G apping X Y to R, we define Π G) by Π G) = {x hx, y): y Y, h G}. 2) The following result due to Koltchinskii & Panchenko, 2002) is a well known argin bound for the ulti-class setting. Theore. Let G be a faily of hypotheses apping X Y to R, with Y = {,..., c}. Fix > 0. Then, for any δ > 0, with probability at least δ > 0, the following bound holds for all g G: Rg) R S, g) 2c2 R Π G)) log δ where Π G) = {x, y) gx, y) : y Y, g G}. As discussed in the introduction, the bound of Theore is pessiistic and suggests that learning with an extreely large nuber of classes ay not be possible. The following theore presents our argin learning guarantees for

3 ulti-class classification with a large nuber of classes that scales linearly with the nuber of classes, as opposed to the quadratic dependency of Theore. Theore 2. Let G be a faily of hypotheses apping X Y to R, with Y = {,..., c}. Fix > 0. Then, for any δ > 0, with probability at least δ > 0, the following bound holds for all g G: Rg) R S, g) 4c R log δ Π G)) where Π G) = {x, y) gx, y) : y Y, g G}. Note that the bound of Theore 2 is strictly better than that of Theore for all c > 2. The bound of Theore 2 is ore optiistic both in ters of coputation resources and statistical hardness of the proble. To the best of our knowledge, it is an open proble if the dependence on the nuber of classes can be further iproved in general, that is for arbitrary hypothesis sets. Proof. We will need the following definition for this proof: g x, y) = in y y gx, y) gx, y )) θ,g x, y) = ingx, y) gx, y ) θ y =y), y where θ > 0 is an arbitrary constant. Observe that gx,y) 0 θ,g x,y) 0. To verify this clai it suffices to check that gx,y) 0 θ,g x,y) 0, which is equivalent to the following stateent: if g x, y) 0 then θ,g x, y) 0. Indeed, this follows fro the following bound: θ,g x, y) = in gx, y) gx, y ) ) θ y y =y in gx, y) gx, y ) θ y =y) y y = in gx, y) gx, y ) ) = g x, y), y y where the inequality follows fro taking the iniu over a saller set. Let Φ be the argin loss function defined for all u R by Φ u) = u 0 u ) 0<u. We also let G = {x, y) θ,g x, y): g G} and G = {Φ g : g G}. By the standard Radeacher coplexity bound Koltchinskii & Panchenko, 2002; Mohri et al., 202), for any δ > 0, with probability at least δ, the following holds for all g G: Rg) Φ θ,g x i, y i )) 2R G) log δ Fixing θ = 2, we observe that Φ θ,g x i, y i )) = Φ g x i, y i )) gx i,y i). Indeed, either θ,g x i, y i ) = g x i, y i ) or θ,g x i, y i ) = 2 g x i, y i ), which iplies the desired result. Talagrand s lea Ledoux & Talagrand, 99; Mohri et al., 202) yields R G) R G) since Φ is a -Lipschitz function. Therefore, for any δ > 0, with probability at least δ, for all g G: Rg) R S, g) 2 R G) log δ and to coplete the proof it suffices to show that R G) 2cR Π G)). Here, R G) can be upper-bounded as follows: R G) = σ i gx i, y i ) ax gx i, y) 2 y=yi )) y σ i gx i, y i ) σ i ax gx i, y) 2 y=yi ). y Now we bound the second ter above. Observe that σ i gx i, y i ) σ = σ i gx i, y) yi=y σ = y Y y Y y Y σ σ σ i gx i, y) yi=y ɛi σ i gx i, y) 2 ), 2 where ɛ i = 2 yi=y. Since ɛ i {, }, σ i and σ i ɛ i adit the sae distribution and, for any y Y, each of the ters of the right-hand side can be bounded as follows: σ 2 σ 2 σ ɛi σ i gx i, y) R Π G)). ) 2 2 σ i ɛ i gx i, y) σ i gx i, y)

4 Thus, we can write σ igx i, y i ) c R Π G)). To bound the second ter, we first apply Lea 8. of Mohri et al., 202) that iediately yields that y Y σ i ax gx i, y) 2 y=yi ) y σ i gx i, y) 2 y=yi ) and since Radeacher variables are ean zero, we observe that σ i gx i, y) 2 y=yi ) = = which copletes the proof. ) σ i gx i, y) 2 σ i y=yi σ i gx i, y) R Π G)) 3. Multi-class data-dependent learning guarantee for convex ensebles We consider p failies H,..., H p of functions apping fro X Y to 0, and the enseble faily F = conv p k= H k), that is the faily of functions f of the for f = T t= th t, where =,..., T ) is in the siplex and where, for each t, T, h t is in H kt for soe k t, p. The following theore gives a argin-based Radeacher coplexity bound for learning with ensebles of base classifiers with ultiple hypothesis sets. As with other Radeacher coplexity learning guarantees, our bound is data-dependent, which is an iportant and favorable characteristic of our results. Theore 3. Assue p > and let H,..., H p be p failies of functions apping fro X Y to 0,. Fix > 0. Then, for any δ > 0, with probability at least δ over the choice of a saple S of size drawn i.i.d. according to D, the following inequality holds for all f = T t= th t F: Rf) R S, f) 8c t= t R Π H kt )) 2 log p c 4 log ) c 2 2 log p 2 4 log p log 2 δ Thus, Rf) R S, f) 8c T t= tr H kt ) log p ) O 2 log 2 c 2 4 log p. Before we present the proof of this result we discuss soe of its consequences. For p =, that is for the special case of a single hypothesis set, this bound to the bound of Theore 2. However, the ain rearkable benefit of this learning bound is that its coplexity ter adits an explicit dependency on the ixture coefficients t. It is a weighted average of Radeacher coplexities with ixture weights t, t, T. Thus, the second ter of the bound suggests that, while soe hypothesis sets H k used for learning could have a large Radeacher coplexity, this ay not negatively affect generalization if the corresponding total ixture weight su of t s corresponding to that hypothesis set) is relatively sall. Using such potentially coplex failies could help achieve a better argin on the training saple. The theore cannot be proven via the standard Radeacher coplexity analysis of Koltchinskii & Panchenko 2002) since the coplexity ter of the bound would then be R conv p k= H k)) = R p k= H k) which does not adit an explicit dependency on the ixture weights and is lower bounded by T t= tr H kt ). Thus, the theore provides a finer learning bound than the one obtained via a standard Radeacher coplexity analysis. Our proof akes use of Theore 2 and a proof technique used in Schapire et al., 997). Proof. For a fixed h = h,..., h T ), any in the probability siplex defines a distribution over {h,..., h T }. Sapling fro {h,..., h T } according to and averaging leads to functions g of the for g = n T n th t for soe n = n,..., n T ), with T t= n t = n, and h t H kt. For any N = N,..., N p ) with N = n, we consider the faily of functions G F,N = { n p N k h k,j k, j) p N k, h k,j H k }, k= j= and the union of all such failies G F,n = N =n G F,N. Fix > 0. For a fixed N, the Radeacher coplexity of Π G F,N ) can be bounded as follows for any : R Π G F,N )) n p k= N k R Π H k )). Thus, by Theore 2, the following ulti-class argin-based Radeacher coplexity bound holds. For any δ > 0, with probability at least δ, for all g G F,N, R g) R S, g) 4c n p log δ N k R Π H k )) k= Since there are at ost p n possible p-tuples N with N =

5 n, by the union bound, for any δ > 0, with probability at least δ, for all g G F,n, we can write R g) R S, g) 4c n p log pn δ N k R Π H k )) k= Thus, with probability at least δ, for all functions g = n T n th t with h t H kt, the following inequality holds R g) R S, g) 4c n p k= t:k t=k log pn δ n t R Π H kt )) Taking the expectation with respect to and using n t /n = t, we obtain that for any δ > 0, with probability at least δ, for all g, we can write R g) R S, g) 4c log pn δ t R Π H kt )) t= Fix n. Then, for any δ n > 0, with probability at least δ n, R /2 g) R S,/2 g) 8c t R Π H kt )) t= log pn δ n δ Choose δ n = 2p for soe δ > 0, then for p 2, n n δ δ n = 2 /p) δ. Thus, for any δ > 0 and any n, with probability at least δ, the following holds for all g: R /2 g) R S,/2 g) 8c t R Π H kt )) t= log 2p2n δ 3) Now, for any f = T t= th t F and any g = n T n th t, we can upper-bound Rf) = Pr x,y) D f x, y) 0, the generalization error of f, as The nuber Sp, n) of p-tuples N with N = n is known to be precisely ) pn p. follows: Rf) = Pr f x, y) g x, y) g x, y) 0 x,y) D Pr f x, y) g x, y) < /2 Pr g x, y) /2 = Pr f x, y) g x, y) < /2 R /2 g). We can also write R /2 g) = R S,/2 g f f) Pr gx, y) f x, y) < /2 R S, f). x,y) S Cobining these inequalities yields Pr f x, y) 0 R S, f) x,y) D Pr f x, y) g x, y) < /2 x,y) D Pr x,y) S gx, y) f x, y) < /2 R /2 g) R S,/2 g). Taking the expectation with respect to yields Rf) R S, f) x,y) D, f x,y) gx,y)< /2 x,y) S, gx,y) f x,y)< /2 R /2 g) R S,/2 g). 4) Fix x, y) and for any function ϕ: X Y 0, define y ϕ as follows: y ϕ = argax y y ϕx, y). For any g, by definition of g, we can write g x, y) gx, y) gx, y f ). In light of this inequality and Hoeffding s bound, the following holds: f x,y) gx,y)< /2 = Pr f x, y) g x, y) < /2 fx, Pr y) fx, y f ) ) gx, y) gx, y f ) ) < /2 e n2 /8. Siilarly, for any g, we can write f x, y) fx, y) fx, y g). Using this inequality, the union bound and Hoeffding s bound, the other expectation ter appearing on the right-hand side of 4) can be bounded as follows: gx,y) f x,y)< /2 = Pr gx, y) f x, y) < /2 gx, Pr y) gx, y g ) ) fx, y) fx, y g) ) < /2 gx, Pr y) gx, y ) ) fx, y) fx, y ) ) < /2 y y c )e n2 /8.

6 Thus, for any fixed f F, we can write Rf) R S, f) ce n2 /8 R /2 g) R S,/2 g). Therefore, the following inequality holds: Rf) R S, f) f F ce n2 /8 R /2 g) R S,/2 g), g and, in view of 3), for any δ > 0 and any n, with probability at least δ, the following holds for all f F: Rf) R S, f) 8c Choosing n = 4 log p ) yields the following inequality: 2 Rf) R S, f) 8c t= t R Π H kt )) ce n2 8 2n ) log p log 2 δ. 2 4 log c and concludes the proof. 4. Conclusion t= t R Π H kt )) 2 log p c 4 log ) c 2 2 log p 2 4 log p log 2 δ We presented iproved Radeacher coplexity argin bounds that scale linearly with the nuber of classes, as opposed to the quadratic dependency of the existing Radeacher coplexity argin-based learning guarantees. Furtherore, we used this result to prove a novel generalization bound for ulti-class classifier ensebles that depends only on the Radeacher coplexity of the hypothesis classes to which the classifiers in the enseble belong. Cortes et al., 204) developed this idea in an algorith called DeepBoost, a boosting algorith where the decision at each iteration of which classifier to add to the enseble, and which weight to assign to that classifier, depends on the coplexity of the hypothesis class to which it belongs. One interpretation of DeepBoost is that it applies the principle of structural risk iniization to each iteration of boosting. Kuznetsov et al., 204) extended these ideas to the ulticlass setting. 2 To select n we consider fn) = ce nu nv, where u = 2 /8 and v = log p/. Taking the derivative of f, setting it to zero and solving for n, we obtain n = W v ) where 2u 2c 2 u W is the second branch of the Labert function inverse of x xe x ). Using the bound log x W x) 2 log x leads to the following choice of n: n = log v ). 2u 2c 2 u References Cortes, Corinna, Mohri, Mehryar, and Syed, Uar. Deep boosting. In ICML, pp , 204. Dietterich, Thoas G. An experiental coparison of three ethods for constructing ensebles of decision trees: Bagging, boosting, and randoization. Machine Learning, 402):39 57, Duchi, John C. and Singer, Yora. Boosting with structural sparsity. In ICML, pp. 38, Freund, Yoav and Schapire, Robert. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Coputer Syste Sciences, 55): 9 39, 997. Friedan, Jeroe H., Hastie, Trevor, and Tibshirani, Robert. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28:2000, 998. Grove, Ada J and Schuurans, Dale. Boosting in the liit: Maxiizing the argin of learned ensebles. In AAAI/IAAI, pp , 998. Koltchinskii, Vladir and Panchenko, Ditry. pirical argin distributions and bounding the generalization error of cobined classifiers. Annals of Statistics, 30, Kuznetsov, Vitaly, Mohri, Mehryar, and Syed, Uar. Multi-class deep boosting. In Ghahraani, Z., Welling, M., Cortes, C., Lawrence, N.D., and Weinberger, K.Q. eds.), Advances in Neural Inforation Processing Systes 27, pp Curran Associates, Inc., 204. Ledoux, Michel and Talagrand, Michel. Probability in Banach Spaces: Isoperietry and Processes. Springer, 99. Mohri, Mehryar, Rostaizadeh, Afshin, and Talwalkar, Aeet. Foundations of Machine Learning. The MIT Press, 202. Rätsch, Gunnar, Mika, Sebastian, and Waruth, Manfred K. On the convergence of leveraging. In NIPS, pp , 200a. Rätsch, Gunnar, Onoda, Takashi, and Müller, Klaus- Robert. Soft argins for AdaBoost. Machine Learning, 423): , 200b. Schapire, Robert. Theoretical views of boosting and applications. In Proceedings of ALT 999, volue 720 of Lecture Notes in Coputer Science, pp Springer, 999.

7 Schapire, Robert., Freund, Yoav, Bartlett, Peter, and Lee, Wee Sun. Boosting the argin: A new explanation for the effectiveness of voting ethods. In ICML, pp , 997. Margin Bounds for Learning with a Large Nuber of Classes

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)