Computable Shell Decomposition Bounds

Size: px

Start display at page:

Download "Computable Shell Decomposition Bounds"

Brent Summers
5 years ago
Views:

1 Journal of Machine Learning Research 5 (2004) Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago 1427 East 60th Street Chicago, IL 60637, USA JL@TTI-CORG MCALLESTER@TTI-CORG Editor: Manfred Waruth Abstract Haussler, Kearns, Seung and Tishby introduced the notion of a shell decoposition of the union bound as a eans of understanding certain epirical phenoena in learning curves such as phase transitions Here we use a variant of their ideas to derive an upper bound on the generalization error of a hypothesis coputable fro its training error and the histogra of training errors for the hypotheses in the class In ost cases this new bound is significantly tighter than traditional bounds coputed fro the training error and the cardinality of the class Our results can also be viewed as providing a rigorous foundation for a odel selection algorith proposed by Scheffer and Joachis Keywords: Saple coplexity, classification, true error bounds, shell bounds 1 Introduction For an arbitrary finite hypothesis class we consider the hypothesis of inial training error We give a new upper bound on the generalization error of this hypothesis coputable fro the training error of the hypothesis and the histogra of the training errors of the other hypotheses in the class This new bound is typically uch tighter than ore traditional upper bounds coputed fro the training error and cardinality of the class As a siple exaple, suppose that we observe that all but one epirical error in a hypothesis space is 1/2 and one epirical error is 0 Furtherore, suppose that the saple size is large enough (relative to the size of the hypothesis class) that with high confidence we have that, for all hypotheses in the class, the true (generalization) error of a hypothesis is within 1/5 of its training error This iplies, that with high confidence, hypotheses with training error near 1/2 have true error in [3/10, 7/10] Intuitively, we would expect the true error of the hypothesis with iniu epirical error to be very near to 0 rather than siply less than 1/5 because none of the hypotheses which produced an epirical error of 1/2 could have a true error close enough to 0 that there exists a significant probability of producing 0 epirical error The bound presented here validates this intuition We show that you can ignore hypotheses with training error near 1/2 in calculating an effective size of the class for hypotheses with training error near 0 This new effective class size allows us to calculate a tighter bound on the difference between training error and true error for hypotheses with training error near 0 The new bound is proved using a distribution-dependent application of the union bound siilar in spirit to the shell decoposition introduced by Haussler et al (1996) c 2004 John Langford and David McAllester

2 LANGFORD AND MCALLESTER We actually give two upper bounds on generalization error an uncoputable bound and a coputable bound The uncoputable bound is a function of the unknown distribution of true error rates of the hypotheses in the class The coputable bound is, essentially, the uncoputable bound with the unknown distribution of true errors replaced by the known histogra of training errors Our ain contribution is that this replaceent is sound, ie, the coputable version reains, with high confidence, an upper bound on generalization error When considering asyptotic properties of learning theory bounds it is iportant to take liits in which the cardinality (or VC diension) of the hypothesis class is allowed to grow with the size of the saple In practice, ore data typically justifies a larger hypothesis class For exaple, the size of a decision tree is generally proportional the aount of training data available Here we analyze the asyptotic properties of our bounds by considering an infinite sequence of hypothesis classes H, one for each saple size, such that ln H approaches a liit larger than zero This kind of asyptotic analysis provides a clear account of the iproveent achieved by bounds that are functions of error rate distributions rather than siply the size (or VC diension) of the class We give a lower bound on generalization error showing that the uncoputable upper bound is asyptotically as tight as possible any upper bound on generalization error given as a function of the unknown distribution of true error rates ust asyptotically be greater than or equal to our uncoputable upper bound Our lower bound on generalization error also shows that there is essentially no loss in working with an upper bound coputed fro the true error distribution rather than expectations coputed fro this distribution as used by Scheffer and Joachis (1999) Asyptotically, the coputable bound is siply the uncoputable bound with the unknown distribution of true errors replaced with the observed histogra of training errors Unfortunately, we can show that in liits where ln H converges to a value greater than zero, the histogra of training errors need not converge to the distribution of true errors the histogra of training errors is a seared out version of the distribution of true errors This searing loosens the bound even in the large-saple asyptotic liit We give a precise asyptotic characterization of this searing effect for the case where distinct hypotheses have independent training errors In spite of the divergence between these bounds, the coputable bound is still significantly tighter than classical bounds not involving error distributions The coputable bound can be used for odel selection In the case of odel selection we can assue an infinite sequence of finite odel classes H 0,H 1, where each H j is a finite class with ln H j growing linearly in j To perfor odel selection we find the hypothesis of inial training error in each class and use the coputable bound to bound its generalization error We can then select, aong these, the odel with the sallest upper bound on generalization error Scheffer and Joachis propose (without foral justification) replacing the distribution of true errors with the histogra of training errors Under this replaceent, the odel selection algorith based on our coputable upper bound is asyptotically identical to the algorith proposed by Scheffer and Joachis The shell decoposition is a distribution-dependent use of the union bound Distributiondependent uses of the union bound have been previously exploited in so-called self-bounding algoriths Freund (1998) defines, for a given learning algorith and data distribution, a set S of hypotheses such that with high probability over the saple, the algorith always returns a hypothesis in that set Although S is defined in ters of the unknown data distribution, Freund gives a way of coputing a set S fro the given algorith and the saple such that, with high confidence, S contains S and hence the effective size of the hypothesis class is bounded by S Langford and 530

3 COMPUTABLE SHELL DECOMPOSITION BOUNDS Blu (1999) give a ore practical version of this algorith Given an algorith and data distribution they conceptually define a weighting over the possible executions of the algorith Although the data distribution is unknown, they give a way of coputing a lower bound on the weight of the particular execution of the algorith generated by the saple at hand In this paper we consider distribution dependent union bounds defined independent of any particular learning algorith The bounds given in this paper apply to finite concept classes Of course ore sophisticated easures of the coplexity of a concept class, such as VC diension or Radeacher coplexity, are possible and can soeties result in tighter bounds However, insight into finite classes reains useful in at least two ways Finite class analysis is useful as a pedagogical tool, teaching about directions in which to look for the reoval of slack fro these ore sophisticated bounds Indeed, various localized Radeacher coplexity results (Bartlett et al, 2002) and the peeling technique (van de Geer, 1999) appear to (roughly) correspond to the orthogonal cobination of shell bounds and earlier Radeacher coplexity results One advantage of the shell bounds is the KL-divergence for of the bounds which soothly interpolates between the linear bounds of the realizable case and the quadratic bounds of the unrealizable case This realizable-unrealizable interpolation is orthogonal to the shell principle that concepts with large epirical error are unlikely to be confused with concepts with low error rate The shell bound also supports intuitions that are difficult to achieve in ore coplex settings For exaple, the siple shell bounds clearly exhibit phase transitions in the learning bound, soething which does not appear to be well-elucidated for localalized Radeacher bounds In suary, the siplicity of finite classes (and a shell bound analysis on a finite class) provides a clarity that is difficult to achieve with ore coplex structure-exploiting bounds Finite class analysis is also useful in a ore practical sense In practice a finite VC diension class usually has a finite paraeterization Given that these real paraeters are typically represented as 32 bit floating point nubers, the class becoes finite and the log of the class size is linear in the nuber of paraeters Since any of the ore sophisticated infinite-class techniques are loose by large ultiplicative constants, a finite class analysis applied to a VC class discretized to a sall nuber of bits can actually yield tighter bounds as shown in Figure 1 2 Matheatical Preliinaries For an arbitrary easure on an arbitrary saple space we use the notation 1 δ S Φ[S,δ] to ean that with probability at least 1 δ over the choice of the saple S we have that Φ[S,δ] holds In practice S is the training saple of a learning algorith Note that x δ S Φ[x, S, δ] does not iply δ S x Φ[x, S, δ] If X is a finite set, and for all x X we have the assertion δ > 0 δ S Φ[S,x,δ] then by a standard application of the union bound we have the assertion δ > 0 δ S x X Φ[S,x, δ X ] We call this the quantification rule If δ > 0 δ S Φ[S,δ] and δ > 0 δ S Ψ[S, δ] then by a standard application of the union bound we have δ > 0 δ S Φ[S, δ 2 ] Ψ[S, δ 2 ] We call this the conjunction rule The KL-divergence of p fro q, denoted D(q p), is qln( q 1 q p ) + (1 q)ln( 1 p ) with 0ln( 0 p ) = 0 and qln( q 0 ) = Let ˆp be the fraction of heads in a sequence S of tosses of a biased coin where 1 This can be read as for all but δ sets S, the predicate Φ[S,δ] holds or with probability 1 δ over the draw of S, the predicate Φ[S,δ] holds 531

4 LANGFORD AND MCALLESTER True Error Bound VC bound ORB (32 bits) ORB (16 bits) ORB (8 bits) Training Error Training Error Figure 1: A graph coparing the (infinite hypothesis) VC bound to the finite hypothesis Occa s razor bound For all curves we use VC diension d = 10, bound failure probability, δ = 01, and = 1000 exaples For the VC bound calculation (see Moore, 2004, for details) the forula is true error train error + (d ln 2 d + ln 4 δ )/ For the Occa s Razor Bound (see Langford, 2003, for details) calculation, we use a unifor distribution over the 2 kd discrete classifiers which ight be representable when we discretize d paraeters to k = 8, 16, 32 bits per diension The basic forula is: KL(train error true error) (kd ln2 + ln 1 δ )/ This graph is approxiatly the sae for any siilar ratio of d/ with saller values favoring the Occa s Razor Bound 532

5 COMPUTABLE SHELL DECOMPOSITION BOUNDS the probability of heads is p We have the following inequality given by Chernoff (1952): This bound can be rewritten as follows: q [p,1] : Pr( ˆp q) e D(q p) (1) δ > 0 δ S D(ax( ˆp, p) p) ln( 1 δ ) (2) To derive (2) fro (1) note that Pr(D(ax( ˆp, p) p) ln( 1 δ ) ) equals Pr( ˆp q) where q p and D(q p) = ln( 1 δ ) By (1) we then have that this probability is no larger than e D(q p) = δ It is just as easy to derive (1) fro (2) so the two stateents are equivalent By duality, ie, by considering the proble defined by replacing p by 1 p, we get Conjoining (2) and (3) yields the following corollary of (1): δ > 0 δ S D(in( ˆp, p) p) ln( 1 δ ) (3) δ > 0 δ S D( ˆp p) ln( 2 δ ) (4) Using the inequality D(q p) 2(q p) 2 one can show that (4) iplies the better known for of the Chernoff bound δ > 0 δ ln( 2 δ S p ˆp ) 2 (5) Using the inequality D(q p) (p q)2 2q, which holds for q p, we can show that (3) iplies the following: 2 δ > 0 δ 2 ˆpln( 1 δ S p ˆp + ) + 2ln( 1 δ ) (6) Note that for sall values of ˆp forula (6) gives a tighter upper bound on p than does (5) The upper bound on p iplicit in (4) is soewhat tighter than the iniu of the bounds given by (5) and (6) We now consider a foral setting for hypothesis learning We assue a finite set H of hypotheses and a space X of instances We assue that each hypothesis represents a function fro X to {0,1} where we write h(x) for the value of the function represented by hypothesis h when applied to instance x We also assue a distribution D on pairs x, y with x X and y {0,1} For any hypothesis h we define the error rate of h, denoted e(h), to be P x, y D (h(x) y) For a given saple S of pairs drawn fro D we write ê(h) to denote the fraction of the pairs x, y in S such that h(x) y Quantifying over h H in (4) yields the following second corollary of (1): δ S h H D(ê(h) e(h)) ln H + ln( 2 δ ) (7) 2 A derivation of this forula can be found in Mansour and McAllester (2000) or McAllester and Schapire (2000) To see the need for the last ter consider the case where ˆp = 0 533

6 LANGFORD AND MCALLESTER By considering bounds on D(q p) we can derive the ore well known corollary of (7), ln H δ + ln( 2 δ S h H e(h) ê(h) ) (8) 2 These two forulas both liit the distance between ê(h) and e(h) In this paper we work with (7) rather than (8) because it yields an (iplicit) upper bound on generalization error that is optial up to asyptotic equality 3 The Upper Bound Our goal now is to iprove on (7) Our first step is to divide the hypotheses in H into disjoint sets based on their true error rates More specifically, for p [0,1] define p to be ax(1, p ) Note that p is of the for k k 1 where either p = 0 and k = 1 or p > 0 and p (, k ] In either case we have p { 1,, } and if p = k k 1 then p [, k ] Now we define H ( k ) to be the set of h H such that e(h) = k We define s( k ) to be ln(ax(1, H ( k ) )) We now have the following lea Lea 31 With high probability over the draw of S, for all hypotheses, the deviation between the epirical error ê(h), and true error e(h), of every hypothesis is bounded by s( e(h) ) More precisely, δ > 0 δ S h H, D(ê(h) e(h)) s( e(h) ) + ln( 2 δ ) Proof Quantifying over p { 1,, } and h H (p) in (4) gives δ > 0, δ S, p { 1,, }, h H (p), But this iplies the lea D(ê(h) e(h)) lns(p) + ln( 2 δ ) Lea 31 iposes a constraint, and hence a bound, on e(h) More specifically, we have the following where lub {x : Φ[x]} denotes the least upper bound (the axiu) of the set {x : Φ[x]}: e(h) lub {q : D(ê(h) q) s( q ) + ln( 2 δ ) } (9) This is our uncoputable bound It is uncoputable because the nubers s( 1 ),, s( ) are unknown Ignoring this proble, however, we can see that this bound is typically significantly tighter than (7) More specifically, we can rewrite (7) as e(h) lub {q : D(ê(h) q) ln H + ln( 2 δ ) } (10) Since s( k ln ) ln H, and since is sall for large, we have that (9) is never significantly looser than (10) Now consider a hypothesis h such that the bound on e(h) given by (7), or equivalently, 534

7 COMPUTABLE SHELL DECOMPOSITION BOUNDS (10), is significantly less than 1/2 Assuing is large, the bound given by (9) ust also be significantly less than 1/2 But for q significantly less than 1/2 we typically have that s( q ) is significantly saller than ln H For exaple, suppose H is the set of all decision trees of size /10 For large, a rando decision tree of this size has error rate near 1/2 The set of decision trees with error rate significantly saller than 1/2 is an exponentially sall fraction of the set of all possible trees So for q sall copared to 1/2 we get that s( q ) is significantly saller than ln H This akes the bound given by (9) significantly tighter than the bound given by (10) We now show that the distribution of true errors can be replaced, essentially, by the histogra of training errors We first introduce the following definitions: Ĥ ( ) k,δ ( ) k ŝ, δ h H : ( ( ln ax 1, 2 Ĥ ê(h) k 1 + ( ) )) k, δ The definition of ŝ ( k, δ) is otivated by the following lea ln( 162 δ ) 2 1 Lea 32 With high probability over the draw of S, for all q, s(q) ŝ(q,2δ) More precisely, δ > 0, δ S, q { 1,, }, s(q) ŝ(q, 2δ) Before proving Lea 32 we note that by conjoining (9) and Lea 32 we get the following This is our ain result Theore 33 With high probability over the draw of S, for all hypotheses, the deviation between the epirical error ê(h), and true error e(h), of every hypothesis is bounded by ŝ( q, δ) More precisely, δ > 0, δ S, h H, e(h) lub { q : D(ê(h) q) ŝ( q, δ) + ln( 4 δ ) As for Lea 31, the bound iplicit in Theore 33 is typically significantly tighter than the bound in (7) or its equivalent for (10) The arguent for the iproved tightness of Theore 33 over (10) is siilar to the arguent for (9) More specifically, consider a hypothesis h for which the bound in (10) is significantly less than 1/2 Since ŝ( q, δ) ln H, the set of values of q satisfying the condition in Theore 33 ust all be significantly less than 1/2 But for large we ln(16 2 /δ) 2 1 have that is sall So if q is significantly less than 1/2 then all hypotheses in Ĥ ( q,δ) have epirical error rates significantly less than 1/2 But for ost hypothesis classes, eg, decision trees, the set of hypotheses with epirical error rates far fro 1/2 should be an exponentially sall fraction of the class Hence we get that ŝ( q, δ) is significantly less than ln H and Theore 33 is tighter than (10) The reainder of this section is a proof of Lea 32 Our departure point for the proof is the following lea fro McAllester (1999) }, 535

8 LANGFORD AND MCALLESTER Lea 34 (McAllester 99) For any easure on any hypothesis class we have the following where E h f (h) denotes the expectation of f (h) under the given easure on h: δ > 0 δ S E h e (2 1)(ê(h) e(h))2 4 δ Intuitively, this lea states that with high confidence over the choice of the saple ost hypotheses have epirical error near their true error This allows us to prove that ŝ( q, δ) bounds s( q ) More specifically, by considering the unifor distribution on H ( k ), Lea 34 iplies ( E h H ( k ) e (2 1)(ê(h) e(h))2) 4 δ ( Pr h H ( k ) e (2 1)(ê(h) e(h))2 8 ) δ ( Pr h H ( k ) e (2 1)(ê(h) e(h))2 8 ) δ h H ( k ln( 8 δ ) : ê(h) e(h) ) 2 1 h H ( k ) : ê(h) k 1 + ln( 8 δ ) 2 1 H ( k ) 2 Ĥ ( ) k, 2δ Lea 32 now follows by quantification over q { 1,, } H ( k ) 1 2 H ( k ) 4 Asyptotic Analysis and Phase Transitions This section and the two that follow give an asyptotic analysis of the bounds presented earlier The asyptotic analysis is stated in Theore 41 and Stateent 61 To develop the asyptotic analysis, however, a preliinary discussion is needed regarding the phenoenon of phase transitions The bounds given in (9) and Theore 33 exhibit phase transitions More specifically, the bounding expression can be discontinuous in δ and, eg, arbitrarily sall changes in δ can cause large changes in the bound To see how this happens consider the constraint on the quantity q: D(ê(h) q) s( q ) + ln( 2 δ ) (11) The bound given by (9) is the least upper bound of the values of q satisfying (11) Assue that is sufficiently large that we can think of s( q ) as a continuous function of q which we write as s(q) We can then rewrite (11) where λ is a quantity not depending on q and s(q) does not depend on δ: D(ê(h) q) s(q) + λ (12) 536

9 COMPUTABLE SHELL DECOMPOSITION BOUNDS For q ê(h) we know that D(ê(h) q) is a onotonically increasing function of q It is reasonable to assue that for q 1/2 we also have that s(q) is a onotonically increasing function of q But even under these conditions it is possible that the feasible values of q, ie, those satisfying (12), can be divided into separated regions Furtherore, increasing λ can cause a new feasible region to coe into existence When this happens the bound, which is the least upper bound of the feasible values, can increase discontinuously At a ore intuitive level, consider a large nuber of high error concepts and saller nuber of lower error concepts At a certain confidence level the high error concepts can be ruled out But as the confidence requireent becoes ore stringent suddenly (and discontinuously) the high error concepts ust be considered A siilar discontinuity can occur in saple size Phase transitions in shell decoposition bounds are discussed in ore detail by Haussler et al (1996) Phase transitions coplicate asyptotic analysis But asyptotic analysis illuinates the nature of phase transitions As entioned in the introduction, in the asyptotic analysis of learning theory bounds it is iportant that one does not hold H fixed as the saple size increases If we hold H ln H fixed then li = 0 But this is not what one expects for large saples in practice As the saple size increases one typically uses larger hypothesis classes Intuitively, we expect that even for very large we have that ln H is far fro zero For the asyptotic analysis of the bound in (9) we assue an infinite sequence of hypothesis classes H 1, H 2, H 3 and an infinite sequence of data distributions D 1, D 2, D 3, Let s ( k ) be s( k ) defined relative to H and D In the asyptotic analysis we assue that the sequence of functions s ( q ), viewed as functions of q [0, 1], converge uniforly to a continuous function s(q) This eans that for any ε > 0 there exists a k such that for all > k we have q [0,1] s ( q ) s(q) ε Given the functions s ( p ) and their liit function s(p), we define the following functions of an epirical error rate ê: { B (ê) lub q : D(ê q) s ( q ) + ln( 2 δ ) }, B(ê) lub {q : D(ê q) s(q)} The function B (ê) corresponds directly to the upper bound in (9) The function B(ê) is intended to be the large asyptotic liit of B (ê) However, phase transitions coplicate asyptotic analysis The bound B(ê) need not be a continuous function of ê A value of ê where the bound B(ê) is discontinuous corresponds to a phase transition in the bound At a phase transition the sequence B (ê) need not converge Away fro phase transitions, however, we have the following theore Theore 41 If the bound B(ê) is continuous at the point ê (so we are not at a phase transition), and the functions (paraeterized by ) s ( q ), viewed as functions of q [0,1], converge uniforly to a continuous function s(q), then we have li B (ê) = B(ê) 537

10 LANGFORD AND MCALLESTER Proof Define the set F (ê) as This gives F (ê) { Siilarly, define F(ê, ε) and B(ê, ε) as q : D(ê q) s ( q ) + ln( 2 δ ) B (ê) = lub F (ê) F(ê, ε) {q [0,1] : D(ê q) s(q) + ε} B(ê, ε) lub F(ê, ε) We first show that the continuity of B(ê) at the point ê iplies the continuity of B(ê, ε) at the point ê, 0 We note that there exists a continuous function f (ê, ε) with f (ê, 0) = ê and such that for any ε sufficiently near 0 we have We then have D( f (ê, ε) q) = D(ê q) ε B(ê, ε) = B( f (x, ε)) Since f is continuous, and B(ê) is continuous at the point ê, we get that B(ê, ε) is continuous at the point ê, 0 We now prove the lea The functions of the for s ( q )+ln 2 δ converge uniforly to the function s(q) This iplies that for any ε > 0 there exists a k such that for all > k we have But this in turn iplies that F(ê, ε) F (ê) F(ê, ε) B(ê, ε) B (ê) B(ê, ε) (13) The lea now follows fro the continuity of the function B(ê, ε) at the point ê, 0 } Theore 41 can be interpreted as saying that for large saple sizes, and for values of ê other than the special phase transition values, the bound has a well defined value independent of the confidence paraeter δ and deterined only by a sooth function s(q) A siilar stateent can be ade for the bound in Theore 33 for large, and at points other than phase transitions, the bound is independent of δ and is deterined by a sooth liit curve For the asyptotic analysis of Theore 33 we assue an infinite sequence H 1, H 2, H 3, of hypothesis classes and an infinite sequence S 1, S 2, S 3, of saples such that saple S has size Let H ( k, δ) and ŝ ( k, δ) be H ( k, δ) and ŝ( k, δ) respectively defined relative to hypothesis class H and saple S Let U ( k ) be the set of hypotheses in H having an epirical error of exactly k in the saple S Let u ( k ) be ln(ax(1, U ( k ) ) In the analysis of Theore 33 we allow that the functions u ( q ) are only locally uniforly convergent to a continuous function ū(q), ie, for any q [0,1] and any ε > 0 there exists an integer k and real nuber γ > 0 satisfying > k, p (q γ, q + γ) u ( p ) ū(p) ε Locally unifor convergence plays a role in the analysis in Section 6 538

11 COMPUTABLE SHELL DECOMPOSITION BOUNDS Theore 42 If the functions u ( q ) then, for any fixed value of δ, the functions ŝ( q, δ) converge locally uniforly to a continuous function ū(q) convergence of u ( q ) is unifor, then so is the convergence of ŝ( q, δ) also converge locally uniforly to ū(q) If the Proof Consider an arbitrary value q [0,1] and ε > 0 We construct the desired k and γ More specifically, select k sufficiently large and γ sufficiently sall that we have the properties > k, p (q 2γ, q + 2γ) u ( p ) ū(p) < ε 3, p (q 2γ, q + 2γ) ū(p) ū(q) ε 3, 1 k + ln( 16k2 δ ) 2k 1 < γ, lnk k ε 3 Consider an > k and p (q γ, q + γ) It now suffices to show that ŝ ( p, δ) Because U ( p ) is a subset of H ( p, δ) we have We can also upper bound ŝ( p, δ) ŝ ( p, δ) as follows: u ( p ) H ( p,δ) ū(p) ε k p γ ū(p) ε 3 ( k U ) ŝ( p, δ) e u( k ) k p γ e (ū( k )+ 3 ε ) k p γ e (ū(p)+ 2ε 3 ) k p γ 2ε (ū(p)+ e 3 ) ū(p) + 2ε 3 + ln ū(p) + ε 539

12 LANGFORD AND MCALLESTER A siilar arguent shows that if u ( q ) converges uniforly to ū(q) then so does u ( q ) Given quantities ŝi( q, δ) that converge uniforly to ū(q) the reainder of the analysis is identical to that for the asyptotic analysis of (9) We define the upper bounds { ˆB (ê) lub q : D(ê q) ŝ( q, δ) + ln ( )} 4 δ ˆB(ê) lub {q : D(ê q) ū(q)} Again we say that ê is at a phase transition if the function ˆB(ê) is discontinuous at the value ê We then get the following whose proof is identical to that of Theore 41 Theore 43 If the bound ˆB(ê) is continuous at the point ê (so we are not at a phase transition), and the functions u ( q ) converge uniforly to ū(q), then we have that 5 Asyptotic Optiality of (9) li ˆB (ê) = ˆB(ê) Forula (9) can be viewed as providing an upper bound on e(h) as a function of ê(h) and the function s In this section we show that for any curve s and value ê there exists a hypothesis class and data distribution such that the upper bound in (9) is realized up to asyptotic equality Up to asyptotic equality, (9) is the tightest possible bound coputable fro ê(h) and the nubers s( 1 ),, s( ) The classical VC diensions bounds are nearly optial over bounds coputable fro the chosen hypothesis error rate ê(h ) and the class H The nubers s( 1 ),, s( ) depend on both H and the data distribution Hence the bound in (9) uses inforation about the distribution and hence can be tighter than classical VC bounds A siilar stateent applies to the bound in Theore (33) coputed fro the epirically observable nubers ŝ( 1 ),, ŝ( ) In this case, the bound uses ore inforation fro the saple than just ê(h) The optiality theore given here also differs fro the traditional lower bound results for VC diension in that here the lower bounds atch the upper bounds up to asyptotic equality The departure point for our optiality analysis is the following lea fro Cover and Thoas (1991) Lea 51 (Cover and Thoas) If ˆp is the fraction of heads out of tosses of a coin where the true probability of heads is p then for q p we have Pr( ˆp q) e D(q p) This lower bound on Pr( ˆp q) is very close to Chernoff s 1952 upper bound (1) The tightness of (9) is a direct reflection of the tightness (1) To exploit Lea 51 we need to construct hypothesis classes and data distributions where distinct hypotheses have independent training errors More specifically, we say that a set of hypotheses {h 1,, h n } has independent training errors if the rando variables ê(h 1 ),, ê(h n ) are independent 540

13 COMPUTABLE SHELL DECOMPOSITION BOUNDS By an arguent siilar to the derivation of (3) fro (1) we can prove the fro Lea 51 that ( Pr D(in( ˆp, p) p) ln( ) 1 δ ) ln( + 1) δ (14) Lea 52 Let X be any finite set, S a rando variable, and Θ[S,x,δ] a forula such that for every x X and δ > 0 we have Pr(Θ[S,x,δ]) δ, and Pr( x X Θ[S, x, δ]) = x X Pr(Θ[S, x, δ]) We then have δ > 0 δ S x X Θ[S,x, ln( 1 δ ) X ] Proof Pr(Θ[S,x, ln( 1 δ ) X ]) ln( 1 δ ) X Pr( Θ[S,x, ln( 1 δ ) X ]) 1 ln( 1 δ ) X e ln( 1 ) δ X Pr( x X Θ[S, x, ln( 1 δ ) X ]) e ln( 1 δ ) = δ Now define h ( k ) to be the hypothesis of inial training error in the set H ( k ) Let glb {x : Φ[x]} denote the greatest lower bound (the iniu) of the set {x : Φ[x]} We now have the following lea Lea 53 If the hypotheses in the class H ( q ) are independent then δ > 0, δ S, q { 1,, }, ê(h (q)) glb { ê : D(in(ê, q 1 ) q) s(q) ln(+1) ln(ln( δ )) Proof To prove Lea 53 let q be a fixed rational nuber of the for k Assuing independent hypotheses we can applying Lea 52 to (14) to get δ > 0, δ S, h H ( k ), D(in(ê(h),e(h)) e(h)) s(q) ln( + 1) ln(ln( 1 δ )) Let w be the hypothesis in H (q) satisfying this forula We now have ê(h (q)) ê(w) and q 1 e(w) q These two conditions iply δ > 0, δ S, This iplies that ê(h (q)) glb D(in(ê(h (q)),q 1 ) q) s(q) ln(+1) ln(ln 1 δ ) { ê : D(in(ê, q 1 ) q) s(q) ln(+1) ln(ln( 1 δ )) Lea 53 now follows by quantification over q { 1,, } } } 541

14 LANGFORD AND MCALLESTER For q [0,1] we have that Lea 31 iplies that { ê(h ( q )) glb ê : D ( } ê q ) 1 s( q )+ln( 2 δ ) We now have upper and lower bounds on the quantity ê(h ( q )) which agree up to asyptotic equality in a large liit where s ( q ) converges (pointwise) to a continuous function s(q) we have that the upper and lower bound on ê(h ( q )) both converge (pointwise) to ê(h (q)) = glb {ê : D(ê q) s(q)} This asyptotic value of ê(h (q)) is a continuous function of q Since q is held fixed in calculating the bounds on ê( q ), phase transitions are not an issue and unifor convergence of the functions s ( q ) is not required Note that for large and independent hypotheses we get that ê(h (q)) is deterined as a function of the true error rate q and s( q ) The following lea states that any liit function s(p) is consistent with the possibility that hypotheses are independent This, together with Lea 53 iplies that no unifor bound on e(h) as a function of ê(h) and H ( 1 ),, H ( ) can be asyptotically tighter than (9) Theore 54 Let s(p) be any continuous function of p [0,1] There exists an infinite sequence of hypothesis spaces H 1, H 2, H 3,, and sequence of data distributions D 1, D 2, D 3, such that each class H has independent hypotheses for data distribution D and such that s ( p ) converges (pointwise) to s(p) Proof First we show that if H ( i ) = e s( i ) then the functions s ( p ) s(p) Assue H ( i ) = e s( i ) In this case we have converge (pointwise) to s ( p ) = s( p ) Since s(p) is continuous, for any fixed value of p we get that s ( p ) converges to s(p) Recall that D is a probability distribution on pairs x, y with y {0,1} and x X for soe set X We take H to be a disjoint union of sets H ( k ) where H ( k ) is selected as above Let f 1,, f N be the eleents of H with N = H Let X be the set of all N-bit bit strings and define f i (x) to be the value of ith bit of the bit vector x Now define the distribution D on pairs x, y by selecting y to be 1 with probability 1/2 and then selecting each bit of x independently where the ith bit is selected to disagree with y with probability k where k is such that f i H ( k ) 6 Relating ŝ and s In this section we show that in large liits of the type discussed in Section 4 the histogra of epirical errors need not converge to the histogra of true errors So even in the large asyptotic liit, the bound given by Theore 33 is significantly weaker than the bound given by (9) To show that ŝ( q, δ) can be asyptotically different fro s( q ) we consider the case of independent hypotheses More specifically, given a continuous function s(p) we construct an infinite 542

15 COMPUTABLE SHELL DECOMPOSITION BOUNDS sequence of hypothesis spaces H 1, H 2, H 3, and an infinite sequence of data distributions D 1, D 2, D 3, using the construction in the proof of Theore 54 We note that if s(p) is differentiable with bounded derivative then the functions s ( p ) converge uniforly to s(p) For a given infinite sequence data distributions we generate an infinite saple sequence S 1, S 2, S 3,, by selecting S to consists of pairs x, y drawn IID fro distribution D For a given saple sequence and h H we define ê (h) and ŝ ( k, δ) in a anner siilar to ê(h) and ŝ( k, δ) but for saple S The ain result of this section is the following Conjecture 61 If each H has independent hypotheses under data distribution D, and the functions s ( p ) converge uniforly to a continuous function s(p), then for any δ > 0 and p [0, 1], we have, with probability 1 over the generation of the saple sequence, that ŝ ( p,δ) li = sup s(q) D(p q) q [0,1] We call this a conjecture rather than a theore because the proof has not been worked out to a high level of rigor Nonetheless, we believe the proof sketch given below can be expanded to a fully rigorous arguent Before giving the proof sketch we note that the liiting value of ŝ( p, δ) is independent of δ This is consistent with Theore 42 Define ŝ(p) sup q [0,1] s(q) D(p q) Note that ŝ(p) s(p) This gives an asyptotic version of Lea 32 But since D(p q) can be locally approxiated as c(p q) 2 (up to its second order Taylor expansion), if s(p) is increasing at the point p then we also get that ŝ(p) is strictly larger than s(p) Proof Outline: To prove Stateent 61 we first define H (p, q) for p,q { 1,, } to be the set of all h H (q) such that ê (h) = p Intuitively, H (p, q) is the set of concepts with true error rate near q that have epirical error rate p Ignoring factors that are only polynoial in, the probability of a hypothesis with true error rate q having epirical error rate p can be written as (approxiately) e D(p q) So the expected size of H (p, q) can be written as H (q) e D(p q), or alternatively, (approxiately) as e s(q) e D(p q) or e ( s(q) D(p q)) More forally, we have, for any fixed value of p and q, ln(ax(1, E( H ( p, q ) ))) li = ax(0, s(q) D(p q)) We now show that the expectation can be eliinated fro the above liit First, consider distinct values of p and q such that s(q) D(p q) > 0 Since p and q are distinct, the probability that a fixed hypothesis in H ( q ) is in H ( p, q ) declines exponentially in Since s(q) D(p q) > 0 the expected size of H ( p, q ) grows exponentially in Since the hypotheses are independent, the distribution of possible values of H ( p, q ) becoes essentially a Poisson ass distribution with an expected nuber of arrivals growing exponentially in The probability that H ( p, q ) deviates fro its expectation by as uch as a factor of 2 declines exponentially in We say that a saple sequence is safe after k if for all > k we 543

16 LANGFORD AND MCALLESTER have that H ( p, q ) is within a factor of 2 of its expectation Since the probability of being unsafe at declines exponentially in, for any δ there exists a k such that with probability at least 1 δ the saple sequence is safe after k So for any δ > 0 we have that with probability at least 1 δ the sequence is safe after soe k But since this holds for all δ > 0, with probability 1 such a k ust exist: We now define li ln(ax(1, H ( p, q ) )) = s(q) D(p q) s ( p, q ) ln(ax(1, H ( p, q ) )) It is also possible to show for p = q we have that with probability 1 we have that s ( p, q ) approaches s(p) and that for distinct p and q with s(q) D(p q) 0 we have that s ( q, q ) approaches 0 Putting these together yields that, with probability 1, we have li s ( p, q ) = ax(0, s(q) D(p q)) (15) Define U ( k ) and u ( k ) as in Section 4 We now have the following equality: U (p) = q { 1,, }H (p, q) u (p) We now show that with probability 1, approaches ŝ(p) First, consider a p [0, 1] such that ŝ(p) > 0 Let Since s(q) D(q p) is a continuous function, and [0,1] is a copact set, sup q [0,1] s(q) D(p q) ust be realized at soe value q [0,1] Let q be such that s(q ) D(p q ) equals ŝ(p) We have that u ( p ) s ( p, q ) This, together with (15), iplies that u ( p ) li inf ŝ(p) The saple sequence is safe at and k if H ( p, k ) does not exceed twice the expectation of H ( p, q ) Assuing unifor convergence of s ( p ), the probability of not being safe at and k declines exponentially in at a rate at least as fast as the rate of decline of the probability of not being safe at and q By the union bound this iplies that for a given the probability that there exists an unsafe k also declines exponentially We say that the sequence is safe after N if it is safe for all and k with > N The probability of not being being safe after N also declines exponentially with N By an arguent siilar to that given above, this iplies that with probability 1 over the choice of the sequence there exists a N such that the sequence is safe after N But if we are safe at then U ( p ) 2E H (p, q ) This iplies that li sup u ( p ) ŝ(p) Putting the two bounds together we get li u ( p ) 544 = ŝ(p)

17 COMPUTABLE SHELL DECOMPOSITION BOUNDS The above arguent establishes (to soe level of rigor) pointwise convergence of u ( p ) to ŝ(p) It is also possible to establish a convergence rate that is a continuous function of p This iplies that the convergence of u ( p ) can be ade locally unifor Theore 42 then iplies the desired result 7 Iproveents Theore 33 has been iproved in various ways (Langford, 2002): Reoving the discretization of true errors, Using one-sided bounds, Using nonunifor union bounds over discrete values of the for k, Tightening the Chernoff bound using direct calculation of Binoial coefficients, and Iproving Lea 34 These iproveents allow the reoval of all but one ln() ters fro the stateent of the bound However, they do not iprove the asyptotic equations given by Theore 41 and Stateent 61 A practical difficulty with the bound in Theore 33 is that it is usually ipossible to enuerate the eleents of an exponentially large hypothesis class and hence ipractical to copute the histogra of training errors for the hypotheses in the class In practice the values of s( k ) ight be estiated using soe for of Monte-Carlo Markov chain sapling over the hypotheses For certain hypothesis spaces it ight also be possible to directly calculate the epirical error distribution without evaluating every hypothesis For exaple, this can be done with partition rules which, given a fixed partition of the input space, ake predictions which are constant on each partition If there are n eleents in the partition then there are 2 n partition rules For a fixed partition, the histogra of epirical errors for the 2 n partition rules can be coputed in polynoial tie Note that the class of decision trees is a union of partition rules where the structure of a tree defines a partition and the labels at the leaves of the tree define a particular partition rule relative to that partition Taking advantage of this, it is suprisingly easy to copute a shell bound for sall decision trees (Langford, 2002) 8 Discussion and Conclusion Traditional PAC bounds are stated in ters of the training error and class size or VC diension The coputable bound given here is soeties uch tighter because it exploits the additional inforation in the histogra of training errors The uncoputable bound uses the additional (unavailable) inforation in the distribution of true errors Any distribution of true errors can be realized in a case with independent hypotheses We have shown that in such cases this uncoputable bound is asyptotically equal to actual generalization error Hence this is the tightest possible bound, up to asyptotic equality, over all bounds expressed as functions of ê(h ) and the distribution of true errors We have also shown that the use of the histogra of epirical errors results in a bound that, while still tighter than traditional bounds, is looser than the uncoputable bound even in the large saple asyptotic liit 545

18 LANGFORD AND MCALLESTER One of the goals of learning theory is to give generalization guarantees that are predictive of actual generalization error It is well known that the actual generalization error can exhibit phase transitions as the saple size increases the expected generalization error can jup essentially discontinuously in saple size So accurate true error bounds should also exhibit phase transitions Shell bounds exhibit these phase transitions while other bounds such as VC diension results do not The phase transitions can also be interpreted as a stateent about the bound as a function of the confidence paraeter δ As the value of δ is varied the bound ay shift essentially discontinuously To put this another way, let h be the hypothesis of inial training error on a large saple Near a phase transition in true generalization error (as opposed to a phase transition in the bound) we ay have that with probability 1 δ the true error of h is near its training error but with probability δ/2, say, the true error of h can be far fro its training error More traditional bounds do not exhibit this kind of sensitivity to δ Bounds that exhibit phase transitions see to bring the theoretical analysis of generalization closer to the actual phenoenon Acknowledgents Yoav Freund, Avri Blu, and Tobias Scheffer all provided useful discussion in foring this paper References P Bartlett, O Bousquet and S Mendelson Localized Radeacher coplexities Proceedings of the 15th Annual Conference on Coputational Learning Theory, pp 44-58, (2002) H Chernoff A easure of asyptotic efficiency for test of a hypothesis based on the su of observations Annals of Matheatical Statistics, 23: , 1952 T M Cover and J A Thoas Eleents of Inforation Theory, Wiley, 1991 Y Freund Self bounding algoriths Coputational Learning Theory (COLT), 1998 D Haussler, M Kearns, H S Seung, and N Tishby (1996) Rigorous learning curve bounds fro statistical echanics Machine Learning 25, pp , 1996 J Langford Practical prediction theory for classification, ICML 2003 tutorial, avaliable at jl/projects/prediction bounds/tutorial/tutorialps J Langford Quantitatively tight saple coplexity bounds PhD Thesis, Carnegie Mellon, 2002 J Langford and A Blu, Microchoice and self-bounding algoriths Coputational Learning Theory (COLT), 1999 Y Mansour and D McAllester Generalization bounds for decision trees Coputational Learning Theory (COLT), 2000 A Moore VC diension for characterizing classifiers Tutorial at 2cscuedu/ aw/tutorials/vcdi08pdf 546

19 COMPUTABLE SHELL DECOMPOSITION BOUNDS D McAllester PAC-Bayesian odel averaging Coputational Learning Theory (COLT), 1999 D McAllester and R Schapire On the convergence rate of good-turing estiators Coputational Learning Theory (COLT), 2000 T Scheffer and T Joachis Expected error analysis for odel selection International Conference on Machine Learning (ICML), 1999 S van de Geer Epirical Process in M-Estiation Cabridge University Press,

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung