Computable Shell Decomposition Bounds
|
|
- Brent Summers
- 5 years ago
- Views:
Transcription
1 Journal of Machine Learning Research 5 (2004) Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago 1427 East 60th Street Chicago, IL 60637, USA JL@TTI-CORG MCALLESTER@TTI-CORG Editor: Manfred Waruth Abstract Haussler, Kearns, Seung and Tishby introduced the notion of a shell decoposition of the union bound as a eans of understanding certain epirical phenoena in learning curves such as phase transitions Here we use a variant of their ideas to derive an upper bound on the generalization error of a hypothesis coputable fro its training error and the histogra of training errors for the hypotheses in the class In ost cases this new bound is significantly tighter than traditional bounds coputed fro the training error and the cardinality of the class Our results can also be viewed as providing a rigorous foundation for a odel selection algorith proposed by Scheffer and Joachis Keywords: Saple coplexity, classification, true error bounds, shell bounds 1 Introduction For an arbitrary finite hypothesis class we consider the hypothesis of inial training error We give a new upper bound on the generalization error of this hypothesis coputable fro the training error of the hypothesis and the histogra of the training errors of the other hypotheses in the class This new bound is typically uch tighter than ore traditional upper bounds coputed fro the training error and cardinality of the class As a siple exaple, suppose that we observe that all but one epirical error in a hypothesis space is 1/2 and one epirical error is 0 Furtherore, suppose that the saple size is large enough (relative to the size of the hypothesis class) that with high confidence we have that, for all hypotheses in the class, the true (generalization) error of a hypothesis is within 1/5 of its training error This iplies, that with high confidence, hypotheses with training error near 1/2 have true error in [3/10, 7/10] Intuitively, we would expect the true error of the hypothesis with iniu epirical error to be very near to 0 rather than siply less than 1/5 because none of the hypotheses which produced an epirical error of 1/2 could have a true error close enough to 0 that there exists a significant probability of producing 0 epirical error The bound presented here validates this intuition We show that you can ignore hypotheses with training error near 1/2 in calculating an effective size of the class for hypotheses with training error near 0 This new effective class size allows us to calculate a tighter bound on the difference between training error and true error for hypotheses with training error near 0 The new bound is proved using a distribution-dependent application of the union bound siilar in spirit to the shell decoposition introduced by Haussler et al (1996) c 2004 John Langford and David McAllester
2 LANGFORD AND MCALLESTER We actually give two upper bounds on generalization error an uncoputable bound and a coputable bound The uncoputable bound is a function of the unknown distribution of true error rates of the hypotheses in the class The coputable bound is, essentially, the uncoputable bound with the unknown distribution of true errors replaced by the known histogra of training errors Our ain contribution is that this replaceent is sound, ie, the coputable version reains, with high confidence, an upper bound on generalization error When considering asyptotic properties of learning theory bounds it is iportant to take liits in which the cardinality (or VC diension) of the hypothesis class is allowed to grow with the size of the saple In practice, ore data typically justifies a larger hypothesis class For exaple, the size of a decision tree is generally proportional the aount of training data available Here we analyze the asyptotic properties of our bounds by considering an infinite sequence of hypothesis classes H, one for each saple size, such that ln H approaches a liit larger than zero This kind of asyptotic analysis provides a clear account of the iproveent achieved by bounds that are functions of error rate distributions rather than siply the size (or VC diension) of the class We give a lower bound on generalization error showing that the uncoputable upper bound is asyptotically as tight as possible any upper bound on generalization error given as a function of the unknown distribution of true error rates ust asyptotically be greater than or equal to our uncoputable upper bound Our lower bound on generalization error also shows that there is essentially no loss in working with an upper bound coputed fro the true error distribution rather than expectations coputed fro this distribution as used by Scheffer and Joachis (1999) Asyptotically, the coputable bound is siply the uncoputable bound with the unknown distribution of true errors replaced with the observed histogra of training errors Unfortunately, we can show that in liits where ln H converges to a value greater than zero, the histogra of training errors need not converge to the distribution of true errors the histogra of training errors is a seared out version of the distribution of true errors This searing loosens the bound even in the large-saple asyptotic liit We give a precise asyptotic characterization of this searing effect for the case where distinct hypotheses have independent training errors In spite of the divergence between these bounds, the coputable bound is still significantly tighter than classical bounds not involving error distributions The coputable bound can be used for odel selection In the case of odel selection we can assue an infinite sequence of finite odel classes H 0,H 1, where each H j is a finite class with ln H j growing linearly in j To perfor odel selection we find the hypothesis of inial training error in each class and use the coputable bound to bound its generalization error We can then select, aong these, the odel with the sallest upper bound on generalization error Scheffer and Joachis propose (without foral justification) replacing the distribution of true errors with the histogra of training errors Under this replaceent, the odel selection algorith based on our coputable upper bound is asyptotically identical to the algorith proposed by Scheffer and Joachis The shell decoposition is a distribution-dependent use of the union bound Distributiondependent uses of the union bound have been previously exploited in so-called self-bounding algoriths Freund (1998) defines, for a given learning algorith and data distribution, a set S of hypotheses such that with high probability over the saple, the algorith always returns a hypothesis in that set Although S is defined in ters of the unknown data distribution, Freund gives a way of coputing a set S fro the given algorith and the saple such that, with high confidence, S contains S and hence the effective size of the hypothesis class is bounded by S Langford and 530
3 COMPUTABLE SHELL DECOMPOSITION BOUNDS Blu (1999) give a ore practical version of this algorith Given an algorith and data distribution they conceptually define a weighting over the possible executions of the algorith Although the data distribution is unknown, they give a way of coputing a lower bound on the weight of the particular execution of the algorith generated by the saple at hand In this paper we consider distribution dependent union bounds defined independent of any particular learning algorith The bounds given in this paper apply to finite concept classes Of course ore sophisticated easures of the coplexity of a concept class, such as VC diension or Radeacher coplexity, are possible and can soeties result in tighter bounds However, insight into finite classes reains useful in at least two ways Finite class analysis is useful as a pedagogical tool, teaching about directions in which to look for the reoval of slack fro these ore sophisticated bounds Indeed, various localized Radeacher coplexity results (Bartlett et al, 2002) and the peeling technique (van de Geer, 1999) appear to (roughly) correspond to the orthogonal cobination of shell bounds and earlier Radeacher coplexity results One advantage of the shell bounds is the KL-divergence for of the bounds which soothly interpolates between the linear bounds of the realizable case and the quadratic bounds of the unrealizable case This realizable-unrealizable interpolation is orthogonal to the shell principle that concepts with large epirical error are unlikely to be confused with concepts with low error rate The shell bound also supports intuitions that are difficult to achieve in ore coplex settings For exaple, the siple shell bounds clearly exhibit phase transitions in the learning bound, soething which does not appear to be well-elucidated for localalized Radeacher bounds In suary, the siplicity of finite classes (and a shell bound analysis on a finite class) provides a clarity that is difficult to achieve with ore coplex structure-exploiting bounds Finite class analysis is also useful in a ore practical sense In practice a finite VC diension class usually has a finite paraeterization Given that these real paraeters are typically represented as 32 bit floating point nubers, the class becoes finite and the log of the class size is linear in the nuber of paraeters Since any of the ore sophisticated infinite-class techniques are loose by large ultiplicative constants, a finite class analysis applied to a VC class discretized to a sall nuber of bits can actually yield tighter bounds as shown in Figure 1 2 Matheatical Preliinaries For an arbitrary easure on an arbitrary saple space we use the notation 1 δ S Φ[S,δ] to ean that with probability at least 1 δ over the choice of the saple S we have that Φ[S,δ] holds In practice S is the training saple of a learning algorith Note that x δ S Φ[x, S, δ] does not iply δ S x Φ[x, S, δ] If X is a finite set, and for all x X we have the assertion δ > 0 δ S Φ[S,x,δ] then by a standard application of the union bound we have the assertion δ > 0 δ S x X Φ[S,x, δ X ] We call this the quantification rule If δ > 0 δ S Φ[S,δ] and δ > 0 δ S Ψ[S, δ] then by a standard application of the union bound we have δ > 0 δ S Φ[S, δ 2 ] Ψ[S, δ 2 ] We call this the conjunction rule The KL-divergence of p fro q, denoted D(q p), is qln( q 1 q p ) + (1 q)ln( 1 p ) with 0ln( 0 p ) = 0 and qln( q 0 ) = Let ˆp be the fraction of heads in a sequence S of tosses of a biased coin where 1 This can be read as for all but δ sets S, the predicate Φ[S,δ] holds or with probability 1 δ over the draw of S, the predicate Φ[S,δ] holds 531
4 LANGFORD AND MCALLESTER True Error Bound VC bound ORB (32 bits) ORB (16 bits) ORB (8 bits) Training Error Training Error Figure 1: A graph coparing the (infinite hypothesis) VC bound to the finite hypothesis Occa s razor bound For all curves we use VC diension d = 10, bound failure probability, δ = 01, and = 1000 exaples For the VC bound calculation (see Moore, 2004, for details) the forula is true error train error + (d ln 2 d + ln 4 δ )/ For the Occa s Razor Bound (see Langford, 2003, for details) calculation, we use a unifor distribution over the 2 kd discrete classifiers which ight be representable when we discretize d paraeters to k = 8, 16, 32 bits per diension The basic forula is: KL(train error true error) (kd ln2 + ln 1 δ )/ This graph is approxiatly the sae for any siilar ratio of d/ with saller values favoring the Occa s Razor Bound 532
5 COMPUTABLE SHELL DECOMPOSITION BOUNDS the probability of heads is p We have the following inequality given by Chernoff (1952): This bound can be rewritten as follows: q [p,1] : Pr( ˆp q) e D(q p) (1) δ > 0 δ S D(ax( ˆp, p) p) ln( 1 δ ) (2) To derive (2) fro (1) note that Pr(D(ax( ˆp, p) p) ln( 1 δ ) ) equals Pr( ˆp q) where q p and D(q p) = ln( 1 δ ) By (1) we then have that this probability is no larger than e D(q p) = δ It is just as easy to derive (1) fro (2) so the two stateents are equivalent By duality, ie, by considering the proble defined by replacing p by 1 p, we get Conjoining (2) and (3) yields the following corollary of (1): δ > 0 δ S D(in( ˆp, p) p) ln( 1 δ ) (3) δ > 0 δ S D( ˆp p) ln( 2 δ ) (4) Using the inequality D(q p) 2(q p) 2 one can show that (4) iplies the better known for of the Chernoff bound δ > 0 δ ln( 2 δ S p ˆp ) 2 (5) Using the inequality D(q p) (p q)2 2q, which holds for q p, we can show that (3) iplies the following: 2 δ > 0 δ 2 ˆpln( 1 δ S p ˆp + ) + 2ln( 1 δ ) (6) Note that for sall values of ˆp forula (6) gives a tighter upper bound on p than does (5) The upper bound on p iplicit in (4) is soewhat tighter than the iniu of the bounds given by (5) and (6) We now consider a foral setting for hypothesis learning We assue a finite set H of hypotheses and a space X of instances We assue that each hypothesis represents a function fro X to {0,1} where we write h(x) for the value of the function represented by hypothesis h when applied to instance x We also assue a distribution D on pairs x, y with x X and y {0,1} For any hypothesis h we define the error rate of h, denoted e(h), to be P x, y D (h(x) y) For a given saple S of pairs drawn fro D we write ê(h) to denote the fraction of the pairs x, y in S such that h(x) y Quantifying over h H in (4) yields the following second corollary of (1): δ S h H D(ê(h) e(h)) ln H + ln( 2 δ ) (7) 2 A derivation of this forula can be found in Mansour and McAllester (2000) or McAllester and Schapire (2000) To see the need for the last ter consider the case where ˆp = 0 533
6 LANGFORD AND MCALLESTER By considering bounds on D(q p) we can derive the ore well known corollary of (7), ln H δ + ln( 2 δ S h H e(h) ê(h) ) (8) 2 These two forulas both liit the distance between ê(h) and e(h) In this paper we work with (7) rather than (8) because it yields an (iplicit) upper bound on generalization error that is optial up to asyptotic equality 3 The Upper Bound Our goal now is to iprove on (7) Our first step is to divide the hypotheses in H into disjoint sets based on their true error rates More specifically, for p [0,1] define p to be ax(1, p ) Note that p is of the for k k 1 where either p = 0 and k = 1 or p > 0 and p (, k ] In either case we have p { 1,, } and if p = k k 1 then p [, k ] Now we define H ( k ) to be the set of h H such that e(h) = k We define s( k ) to be ln(ax(1, H ( k ) )) We now have the following lea Lea 31 With high probability over the draw of S, for all hypotheses, the deviation between the epirical error ê(h), and true error e(h), of every hypothesis is bounded by s( e(h) ) More precisely, δ > 0 δ S h H, D(ê(h) e(h)) s( e(h) ) + ln( 2 δ ) Proof Quantifying over p { 1,, } and h H (p) in (4) gives δ > 0, δ S, p { 1,, }, h H (p), But this iplies the lea D(ê(h) e(h)) lns(p) + ln( 2 δ ) Lea 31 iposes a constraint, and hence a bound, on e(h) More specifically, we have the following where lub {x : Φ[x]} denotes the least upper bound (the axiu) of the set {x : Φ[x]}: e(h) lub {q : D(ê(h) q) s( q ) + ln( 2 δ ) } (9) This is our uncoputable bound It is uncoputable because the nubers s( 1 ),, s( ) are unknown Ignoring this proble, however, we can see that this bound is typically significantly tighter than (7) More specifically, we can rewrite (7) as e(h) lub {q : D(ê(h) q) ln H + ln( 2 δ ) } (10) Since s( k ln ) ln H, and since is sall for large, we have that (9) is never significantly looser than (10) Now consider a hypothesis h such that the bound on e(h) given by (7), or equivalently, 534
7 COMPUTABLE SHELL DECOMPOSITION BOUNDS (10), is significantly less than 1/2 Assuing is large, the bound given by (9) ust also be significantly less than 1/2 But for q significantly less than 1/2 we typically have that s( q ) is significantly saller than ln H For exaple, suppose H is the set of all decision trees of size /10 For large, a rando decision tree of this size has error rate near 1/2 The set of decision trees with error rate significantly saller than 1/2 is an exponentially sall fraction of the set of all possible trees So for q sall copared to 1/2 we get that s( q ) is significantly saller than ln H This akes the bound given by (9) significantly tighter than the bound given by (10) We now show that the distribution of true errors can be replaced, essentially, by the histogra of training errors We first introduce the following definitions: Ĥ ( ) k,δ ( ) k ŝ, δ h H : ( ( ln ax 1, 2 Ĥ ê(h) k 1 + ( ) )) k, δ The definition of ŝ ( k, δ) is otivated by the following lea ln( 162 δ ) 2 1 Lea 32 With high probability over the draw of S, for all q, s(q) ŝ(q,2δ) More precisely, δ > 0, δ S, q { 1,, }, s(q) ŝ(q, 2δ) Before proving Lea 32 we note that by conjoining (9) and Lea 32 we get the following This is our ain result Theore 33 With high probability over the draw of S, for all hypotheses, the deviation between the epirical error ê(h), and true error e(h), of every hypothesis is bounded by ŝ( q, δ) More precisely, δ > 0, δ S, h H, e(h) lub { q : D(ê(h) q) ŝ( q, δ) + ln( 4 δ ) As for Lea 31, the bound iplicit in Theore 33 is typically significantly tighter than the bound in (7) or its equivalent for (10) The arguent for the iproved tightness of Theore 33 over (10) is siilar to the arguent for (9) More specifically, consider a hypothesis h for which the bound in (10) is significantly less than 1/2 Since ŝ( q, δ) ln H, the set of values of q satisfying the condition in Theore 33 ust all be significantly less than 1/2 But for large we ln(16 2 /δ) 2 1 have that is sall So if q is significantly less than 1/2 then all hypotheses in Ĥ ( q,δ) have epirical error rates significantly less than 1/2 But for ost hypothesis classes, eg, decision trees, the set of hypotheses with epirical error rates far fro 1/2 should be an exponentially sall fraction of the class Hence we get that ŝ( q, δ) is significantly less than ln H and Theore 33 is tighter than (10) The reainder of this section is a proof of Lea 32 Our departure point for the proof is the following lea fro McAllester (1999) }, 535
8 LANGFORD AND MCALLESTER Lea 34 (McAllester 99) For any easure on any hypothesis class we have the following where E h f (h) denotes the expectation of f (h) under the given easure on h: δ > 0 δ S E h e (2 1)(ê(h) e(h))2 4 δ Intuitively, this lea states that with high confidence over the choice of the saple ost hypotheses have epirical error near their true error This allows us to prove that ŝ( q, δ) bounds s( q ) More specifically, by considering the unifor distribution on H ( k ), Lea 34 iplies ( E h H ( k ) e (2 1)(ê(h) e(h))2) 4 δ ( Pr h H ( k ) e (2 1)(ê(h) e(h))2 8 ) δ ( Pr h H ( k ) e (2 1)(ê(h) e(h))2 8 ) δ h H ( k ln( 8 δ ) : ê(h) e(h) ) 2 1 h H ( k ) : ê(h) k 1 + ln( 8 δ ) 2 1 H ( k ) 2 Ĥ ( ) k, 2δ Lea 32 now follows by quantification over q { 1,, } H ( k ) 1 2 H ( k ) 4 Asyptotic Analysis and Phase Transitions This section and the two that follow give an asyptotic analysis of the bounds presented earlier The asyptotic analysis is stated in Theore 41 and Stateent 61 To develop the asyptotic analysis, however, a preliinary discussion is needed regarding the phenoenon of phase transitions The bounds given in (9) and Theore 33 exhibit phase transitions More specifically, the bounding expression can be discontinuous in δ and, eg, arbitrarily sall changes in δ can cause large changes in the bound To see how this happens consider the constraint on the quantity q: D(ê(h) q) s( q ) + ln( 2 δ ) (11) The bound given by (9) is the least upper bound of the values of q satisfying (11) Assue that is sufficiently large that we can think of s( q ) as a continuous function of q which we write as s(q) We can then rewrite (11) where λ is a quantity not depending on q and s(q) does not depend on δ: D(ê(h) q) s(q) + λ (12) 536
9 COMPUTABLE SHELL DECOMPOSITION BOUNDS For q ê(h) we know that D(ê(h) q) is a onotonically increasing function of q It is reasonable to assue that for q 1/2 we also have that s(q) is a onotonically increasing function of q But even under these conditions it is possible that the feasible values of q, ie, those satisfying (12), can be divided into separated regions Furtherore, increasing λ can cause a new feasible region to coe into existence When this happens the bound, which is the least upper bound of the feasible values, can increase discontinuously At a ore intuitive level, consider a large nuber of high error concepts and saller nuber of lower error concepts At a certain confidence level the high error concepts can be ruled out But as the confidence requireent becoes ore stringent suddenly (and discontinuously) the high error concepts ust be considered A siilar discontinuity can occur in saple size Phase transitions in shell decoposition bounds are discussed in ore detail by Haussler et al (1996) Phase transitions coplicate asyptotic analysis But asyptotic analysis illuinates the nature of phase transitions As entioned in the introduction, in the asyptotic analysis of learning theory bounds it is iportant that one does not hold H fixed as the saple size increases If we hold H ln H fixed then li = 0 But this is not what one expects for large saples in practice As the saple size increases one typically uses larger hypothesis classes Intuitively, we expect that even for very large we have that ln H is far fro zero For the asyptotic analysis of the bound in (9) we assue an infinite sequence of hypothesis classes H 1, H 2, H 3 and an infinite sequence of data distributions D 1, D 2, D 3, Let s ( k ) be s( k ) defined relative to H and D In the asyptotic analysis we assue that the sequence of functions s ( q ), viewed as functions of q [0, 1], converge uniforly to a continuous function s(q) This eans that for any ε > 0 there exists a k such that for all > k we have q [0,1] s ( q ) s(q) ε Given the functions s ( p ) and their liit function s(p), we define the following functions of an epirical error rate ê: { B (ê) lub q : D(ê q) s ( q ) + ln( 2 δ ) }, B(ê) lub {q : D(ê q) s(q)} The function B (ê) corresponds directly to the upper bound in (9) The function B(ê) is intended to be the large asyptotic liit of B (ê) However, phase transitions coplicate asyptotic analysis The bound B(ê) need not be a continuous function of ê A value of ê where the bound B(ê) is discontinuous corresponds to a phase transition in the bound At a phase transition the sequence B (ê) need not converge Away fro phase transitions, however, we have the following theore Theore 41 If the bound B(ê) is continuous at the point ê (so we are not at a phase transition), and the functions (paraeterized by ) s ( q ), viewed as functions of q [0,1], converge uniforly to a continuous function s(q), then we have li B (ê) = B(ê) 537
10 LANGFORD AND MCALLESTER Proof Define the set F (ê) as This gives F (ê) { Siilarly, define F(ê, ε) and B(ê, ε) as q : D(ê q) s ( q ) + ln( 2 δ ) B (ê) = lub F (ê) F(ê, ε) {q [0,1] : D(ê q) s(q) + ε} B(ê, ε) lub F(ê, ε) We first show that the continuity of B(ê) at the point ê iplies the continuity of B(ê, ε) at the point ê, 0 We note that there exists a continuous function f (ê, ε) with f (ê, 0) = ê and such that for any ε sufficiently near 0 we have We then have D( f (ê, ε) q) = D(ê q) ε B(ê, ε) = B( f (x, ε)) Since f is continuous, and B(ê) is continuous at the point ê, we get that B(ê, ε) is continuous at the point ê, 0 We now prove the lea The functions of the for s ( q )+ln 2 δ converge uniforly to the function s(q) This iplies that for any ε > 0 there exists a k such that for all > k we have But this in turn iplies that F(ê, ε) F (ê) F(ê, ε) B(ê, ε) B (ê) B(ê, ε) (13) The lea now follows fro the continuity of the function B(ê, ε) at the point ê, 0 } Theore 41 can be interpreted as saying that for large saple sizes, and for values of ê other than the special phase transition values, the bound has a well defined value independent of the confidence paraeter δ and deterined only by a sooth function s(q) A siilar stateent can be ade for the bound in Theore 33 for large, and at points other than phase transitions, the bound is independent of δ and is deterined by a sooth liit curve For the asyptotic analysis of Theore 33 we assue an infinite sequence H 1, H 2, H 3, of hypothesis classes and an infinite sequence S 1, S 2, S 3, of saples such that saple S has size Let H ( k, δ) and ŝ ( k, δ) be H ( k, δ) and ŝ( k, δ) respectively defined relative to hypothesis class H and saple S Let U ( k ) be the set of hypotheses in H having an epirical error of exactly k in the saple S Let u ( k ) be ln(ax(1, U ( k ) ) In the analysis of Theore 33 we allow that the functions u ( q ) are only locally uniforly convergent to a continuous function ū(q), ie, for any q [0,1] and any ε > 0 there exists an integer k and real nuber γ > 0 satisfying > k, p (q γ, q + γ) u ( p ) ū(p) ε Locally unifor convergence plays a role in the analysis in Section 6 538
11 COMPUTABLE SHELL DECOMPOSITION BOUNDS Theore 42 If the functions u ( q ) then, for any fixed value of δ, the functions ŝ( q, δ) converge locally uniforly to a continuous function ū(q) convergence of u ( q ) is unifor, then so is the convergence of ŝ( q, δ) also converge locally uniforly to ū(q) If the Proof Consider an arbitrary value q [0,1] and ε > 0 We construct the desired k and γ More specifically, select k sufficiently large and γ sufficiently sall that we have the properties > k, p (q 2γ, q + 2γ) u ( p ) ū(p) < ε 3, p (q 2γ, q + 2γ) ū(p) ū(q) ε 3, 1 k + ln( 16k2 δ ) 2k 1 < γ, lnk k ε 3 Consider an > k and p (q γ, q + γ) It now suffices to show that ŝ ( p, δ) Because U ( p ) is a subset of H ( p, δ) we have We can also upper bound ŝ( p, δ) ŝ ( p, δ) as follows: u ( p ) H ( p,δ) ū(p) ε k p γ ū(p) ε 3 ( k U ) ŝ( p, δ) e u( k ) k p γ e (ū( k )+ 3 ε ) k p γ e (ū(p)+ 2ε 3 ) k p γ 2ε (ū(p)+ e 3 ) ū(p) + 2ε 3 + ln ū(p) + ε 539
12 LANGFORD AND MCALLESTER A siilar arguent shows that if u ( q ) converges uniforly to ū(q) then so does u ( q ) Given quantities ŝi( q, δ) that converge uniforly to ū(q) the reainder of the analysis is identical to that for the asyptotic analysis of (9) We define the upper bounds { ˆB (ê) lub q : D(ê q) ŝ( q, δ) + ln ( )} 4 δ ˆB(ê) lub {q : D(ê q) ū(q)} Again we say that ê is at a phase transition if the function ˆB(ê) is discontinuous at the value ê We then get the following whose proof is identical to that of Theore 41 Theore 43 If the bound ˆB(ê) is continuous at the point ê (so we are not at a phase transition), and the functions u ( q ) converge uniforly to ū(q), then we have that 5 Asyptotic Optiality of (9) li ˆB (ê) = ˆB(ê) Forula (9) can be viewed as providing an upper bound on e(h) as a function of ê(h) and the function s In this section we show that for any curve s and value ê there exists a hypothesis class and data distribution such that the upper bound in (9) is realized up to asyptotic equality Up to asyptotic equality, (9) is the tightest possible bound coputable fro ê(h) and the nubers s( 1 ),, s( ) The classical VC diensions bounds are nearly optial over bounds coputable fro the chosen hypothesis error rate ê(h ) and the class H The nubers s( 1 ),, s( ) depend on both H and the data distribution Hence the bound in (9) uses inforation about the distribution and hence can be tighter than classical VC bounds A siilar stateent applies to the bound in Theore (33) coputed fro the epirically observable nubers ŝ( 1 ),, ŝ( ) In this case, the bound uses ore inforation fro the saple than just ê(h) The optiality theore given here also differs fro the traditional lower bound results for VC diension in that here the lower bounds atch the upper bounds up to asyptotic equality The departure point for our optiality analysis is the following lea fro Cover and Thoas (1991) Lea 51 (Cover and Thoas) If ˆp is the fraction of heads out of tosses of a coin where the true probability of heads is p then for q p we have Pr( ˆp q) e D(q p) This lower bound on Pr( ˆp q) is very close to Chernoff s 1952 upper bound (1) The tightness of (9) is a direct reflection of the tightness (1) To exploit Lea 51 we need to construct hypothesis classes and data distributions where distinct hypotheses have independent training errors More specifically, we say that a set of hypotheses {h 1,, h n } has independent training errors if the rando variables ê(h 1 ),, ê(h n ) are independent 540
13 COMPUTABLE SHELL DECOMPOSITION BOUNDS By an arguent siilar to the derivation of (3) fro (1) we can prove the fro Lea 51 that ( Pr D(in( ˆp, p) p) ln( ) 1 δ ) ln( + 1) δ (14) Lea 52 Let X be any finite set, S a rando variable, and Θ[S,x,δ] a forula such that for every x X and δ > 0 we have Pr(Θ[S,x,δ]) δ, and Pr( x X Θ[S, x, δ]) = x X Pr(Θ[S, x, δ]) We then have δ > 0 δ S x X Θ[S,x, ln( 1 δ ) X ] Proof Pr(Θ[S,x, ln( 1 δ ) X ]) ln( 1 δ ) X Pr( Θ[S,x, ln( 1 δ ) X ]) 1 ln( 1 δ ) X e ln( 1 ) δ X Pr( x X Θ[S, x, ln( 1 δ ) X ]) e ln( 1 δ ) = δ Now define h ( k ) to be the hypothesis of inial training error in the set H ( k ) Let glb {x : Φ[x]} denote the greatest lower bound (the iniu) of the set {x : Φ[x]} We now have the following lea Lea 53 If the hypotheses in the class H ( q ) are independent then δ > 0, δ S, q { 1,, }, ê(h (q)) glb { ê : D(in(ê, q 1 ) q) s(q) ln(+1) ln(ln( δ )) Proof To prove Lea 53 let q be a fixed rational nuber of the for k Assuing independent hypotheses we can applying Lea 52 to (14) to get δ > 0, δ S, h H ( k ), D(in(ê(h),e(h)) e(h)) s(q) ln( + 1) ln(ln( 1 δ )) Let w be the hypothesis in H (q) satisfying this forula We now have ê(h (q)) ê(w) and q 1 e(w) q These two conditions iply δ > 0, δ S, This iplies that ê(h (q)) glb D(in(ê(h (q)),q 1 ) q) s(q) ln(+1) ln(ln 1 δ ) { ê : D(in(ê, q 1 ) q) s(q) ln(+1) ln(ln( 1 δ )) Lea 53 now follows by quantification over q { 1,, } } } 541
14 LANGFORD AND MCALLESTER For q [0,1] we have that Lea 31 iplies that { ê(h ( q )) glb ê : D ( } ê q ) 1 s( q )+ln( 2 δ ) We now have upper and lower bounds on the quantity ê(h ( q )) which agree up to asyptotic equality in a large liit where s ( q ) converges (pointwise) to a continuous function s(q) we have that the upper and lower bound on ê(h ( q )) both converge (pointwise) to ê(h (q)) = glb {ê : D(ê q) s(q)} This asyptotic value of ê(h (q)) is a continuous function of q Since q is held fixed in calculating the bounds on ê( q ), phase transitions are not an issue and unifor convergence of the functions s ( q ) is not required Note that for large and independent hypotheses we get that ê(h (q)) is deterined as a function of the true error rate q and s( q ) The following lea states that any liit function s(p) is consistent with the possibility that hypotheses are independent This, together with Lea 53 iplies that no unifor bound on e(h) as a function of ê(h) and H ( 1 ),, H ( ) can be asyptotically tighter than (9) Theore 54 Let s(p) be any continuous function of p [0,1] There exists an infinite sequence of hypothesis spaces H 1, H 2, H 3,, and sequence of data distributions D 1, D 2, D 3, such that each class H has independent hypotheses for data distribution D and such that s ( p ) converges (pointwise) to s(p) Proof First we show that if H ( i ) = e s( i ) then the functions s ( p ) s(p) Assue H ( i ) = e s( i ) In this case we have converge (pointwise) to s ( p ) = s( p ) Since s(p) is continuous, for any fixed value of p we get that s ( p ) converges to s(p) Recall that D is a probability distribution on pairs x, y with y {0,1} and x X for soe set X We take H to be a disjoint union of sets H ( k ) where H ( k ) is selected as above Let f 1,, f N be the eleents of H with N = H Let X be the set of all N-bit bit strings and define f i (x) to be the value of ith bit of the bit vector x Now define the distribution D on pairs x, y by selecting y to be 1 with probability 1/2 and then selecting each bit of x independently where the ith bit is selected to disagree with y with probability k where k is such that f i H ( k ) 6 Relating ŝ and s In this section we show that in large liits of the type discussed in Section 4 the histogra of epirical errors need not converge to the histogra of true errors So even in the large asyptotic liit, the bound given by Theore 33 is significantly weaker than the bound given by (9) To show that ŝ( q, δ) can be asyptotically different fro s( q ) we consider the case of independent hypotheses More specifically, given a continuous function s(p) we construct an infinite 542
15 COMPUTABLE SHELL DECOMPOSITION BOUNDS sequence of hypothesis spaces H 1, H 2, H 3, and an infinite sequence of data distributions D 1, D 2, D 3, using the construction in the proof of Theore 54 We note that if s(p) is differentiable with bounded derivative then the functions s ( p ) converge uniforly to s(p) For a given infinite sequence data distributions we generate an infinite saple sequence S 1, S 2, S 3,, by selecting S to consists of pairs x, y drawn IID fro distribution D For a given saple sequence and h H we define ê (h) and ŝ ( k, δ) in a anner siilar to ê(h) and ŝ( k, δ) but for saple S The ain result of this section is the following Conjecture 61 If each H has independent hypotheses under data distribution D, and the functions s ( p ) converge uniforly to a continuous function s(p), then for any δ > 0 and p [0, 1], we have, with probability 1 over the generation of the saple sequence, that ŝ ( p,δ) li = sup s(q) D(p q) q [0,1] We call this a conjecture rather than a theore because the proof has not been worked out to a high level of rigor Nonetheless, we believe the proof sketch given below can be expanded to a fully rigorous arguent Before giving the proof sketch we note that the liiting value of ŝ( p, δ) is independent of δ This is consistent with Theore 42 Define ŝ(p) sup q [0,1] s(q) D(p q) Note that ŝ(p) s(p) This gives an asyptotic version of Lea 32 But since D(p q) can be locally approxiated as c(p q) 2 (up to its second order Taylor expansion), if s(p) is increasing at the point p then we also get that ŝ(p) is strictly larger than s(p) Proof Outline: To prove Stateent 61 we first define H (p, q) for p,q { 1,, } to be the set of all h H (q) such that ê (h) = p Intuitively, H (p, q) is the set of concepts with true error rate near q that have epirical error rate p Ignoring factors that are only polynoial in, the probability of a hypothesis with true error rate q having epirical error rate p can be written as (approxiately) e D(p q) So the expected size of H (p, q) can be written as H (q) e D(p q), or alternatively, (approxiately) as e s(q) e D(p q) or e ( s(q) D(p q)) More forally, we have, for any fixed value of p and q, ln(ax(1, E( H ( p, q ) ))) li = ax(0, s(q) D(p q)) We now show that the expectation can be eliinated fro the above liit First, consider distinct values of p and q such that s(q) D(p q) > 0 Since p and q are distinct, the probability that a fixed hypothesis in H ( q ) is in H ( p, q ) declines exponentially in Since s(q) D(p q) > 0 the expected size of H ( p, q ) grows exponentially in Since the hypotheses are independent, the distribution of possible values of H ( p, q ) becoes essentially a Poisson ass distribution with an expected nuber of arrivals growing exponentially in The probability that H ( p, q ) deviates fro its expectation by as uch as a factor of 2 declines exponentially in We say that a saple sequence is safe after k if for all > k we 543
16 LANGFORD AND MCALLESTER have that H ( p, q ) is within a factor of 2 of its expectation Since the probability of being unsafe at declines exponentially in, for any δ there exists a k such that with probability at least 1 δ the saple sequence is safe after k So for any δ > 0 we have that with probability at least 1 δ the sequence is safe after soe k But since this holds for all δ > 0, with probability 1 such a k ust exist: We now define li ln(ax(1, H ( p, q ) )) = s(q) D(p q) s ( p, q ) ln(ax(1, H ( p, q ) )) It is also possible to show for p = q we have that with probability 1 we have that s ( p, q ) approaches s(p) and that for distinct p and q with s(q) D(p q) 0 we have that s ( q, q ) approaches 0 Putting these together yields that, with probability 1, we have li s ( p, q ) = ax(0, s(q) D(p q)) (15) Define U ( k ) and u ( k ) as in Section 4 We now have the following equality: U (p) = q { 1,, }H (p, q) u (p) We now show that with probability 1, approaches ŝ(p) First, consider a p [0, 1] such that ŝ(p) > 0 Let Since s(q) D(q p) is a continuous function, and [0,1] is a copact set, sup q [0,1] s(q) D(p q) ust be realized at soe value q [0,1] Let q be such that s(q ) D(p q ) equals ŝ(p) We have that u ( p ) s ( p, q ) This, together with (15), iplies that u ( p ) li inf ŝ(p) The saple sequence is safe at and k if H ( p, k ) does not exceed twice the expectation of H ( p, q ) Assuing unifor convergence of s ( p ), the probability of not being safe at and k declines exponentially in at a rate at least as fast as the rate of decline of the probability of not being safe at and q By the union bound this iplies that for a given the probability that there exists an unsafe k also declines exponentially We say that the sequence is safe after N if it is safe for all and k with > N The probability of not being being safe after N also declines exponentially with N By an arguent siilar to that given above, this iplies that with probability 1 over the choice of the sequence there exists a N such that the sequence is safe after N But if we are safe at then U ( p ) 2E H (p, q ) This iplies that li sup u ( p ) ŝ(p) Putting the two bounds together we get li u ( p ) 544 = ŝ(p)
17 COMPUTABLE SHELL DECOMPOSITION BOUNDS The above arguent establishes (to soe level of rigor) pointwise convergence of u ( p ) to ŝ(p) It is also possible to establish a convergence rate that is a continuous function of p This iplies that the convergence of u ( p ) can be ade locally unifor Theore 42 then iplies the desired result 7 Iproveents Theore 33 has been iproved in various ways (Langford, 2002): Reoving the discretization of true errors, Using one-sided bounds, Using nonunifor union bounds over discrete values of the for k, Tightening the Chernoff bound using direct calculation of Binoial coefficients, and Iproving Lea 34 These iproveents allow the reoval of all but one ln() ters fro the stateent of the bound However, they do not iprove the asyptotic equations given by Theore 41 and Stateent 61 A practical difficulty with the bound in Theore 33 is that it is usually ipossible to enuerate the eleents of an exponentially large hypothesis class and hence ipractical to copute the histogra of training errors for the hypotheses in the class In practice the values of s( k ) ight be estiated using soe for of Monte-Carlo Markov chain sapling over the hypotheses For certain hypothesis spaces it ight also be possible to directly calculate the epirical error distribution without evaluating every hypothesis For exaple, this can be done with partition rules which, given a fixed partition of the input space, ake predictions which are constant on each partition If there are n eleents in the partition then there are 2 n partition rules For a fixed partition, the histogra of epirical errors for the 2 n partition rules can be coputed in polynoial tie Note that the class of decision trees is a union of partition rules where the structure of a tree defines a partition and the labels at the leaves of the tree define a particular partition rule relative to that partition Taking advantage of this, it is suprisingly easy to copute a shell bound for sall decision trees (Langford, 2002) 8 Discussion and Conclusion Traditional PAC bounds are stated in ters of the training error and class size or VC diension The coputable bound given here is soeties uch tighter because it exploits the additional inforation in the histogra of training errors The uncoputable bound uses the additional (unavailable) inforation in the distribution of true errors Any distribution of true errors can be realized in a case with independent hypotheses We have shown that in such cases this uncoputable bound is asyptotically equal to actual generalization error Hence this is the tightest possible bound, up to asyptotic equality, over all bounds expressed as functions of ê(h ) and the distribution of true errors We have also shown that the use of the histogra of epirical errors results in a bound that, while still tighter than traditional bounds, is looser than the uncoputable bound even in the large saple asyptotic liit 545
18 LANGFORD AND MCALLESTER One of the goals of learning theory is to give generalization guarantees that are predictive of actual generalization error It is well known that the actual generalization error can exhibit phase transitions as the saple size increases the expected generalization error can jup essentially discontinuously in saple size So accurate true error bounds should also exhibit phase transitions Shell bounds exhibit these phase transitions while other bounds such as VC diension results do not The phase transitions can also be interpreted as a stateent about the bound as a function of the confidence paraeter δ As the value of δ is varied the bound ay shift essentially discontinuously To put this another way, let h be the hypothesis of inial training error on a large saple Near a phase transition in true generalization error (as opposed to a phase transition in the bound) we ay have that with probability 1 δ the true error of h is near its training error but with probability δ/2, say, the true error of h can be far fro its training error More traditional bounds do not exhibit this kind of sensitivity to δ Bounds that exhibit phase transitions see to bring the theoretical analysis of generalization closer to the actual phenoenon Acknowledgents Yoav Freund, Avri Blu, and Tobias Scheffer all provided useful discussion in foring this paper References P Bartlett, O Bousquet and S Mendelson Localized Radeacher coplexities Proceedings of the 15th Annual Conference on Coputational Learning Theory, pp 44-58, (2002) H Chernoff A easure of asyptotic efficiency for test of a hypothesis based on the su of observations Annals of Matheatical Statistics, 23: , 1952 T M Cover and J A Thoas Eleents of Inforation Theory, Wiley, 1991 Y Freund Self bounding algoriths Coputational Learning Theory (COLT), 1998 D Haussler, M Kearns, H S Seung, and N Tishby (1996) Rigorous learning curve bounds fro statistical echanics Machine Learning 25, pp , 1996 J Langford Practical prediction theory for classification, ICML 2003 tutorial, avaliable at jl/projects/prediction bounds/tutorial/tutorialps J Langford Quantitatively tight saple coplexity bounds PhD Thesis, Carnegie Mellon, 2002 J Langford and A Blu, Microchoice and self-bounding algoriths Coputational Learning Theory (COLT), 1999 Y Mansour and D McAllester Generalization bounds for decision trees Coputational Learning Theory (COLT), 2000 A Moore VC diension for characterizing classifiers Tutorial at 2cscuedu/ aw/tutorials/vcdi08pdf 546
19 COMPUTABLE SHELL DECOMPOSITION BOUNDS D McAllester PAC-Bayesian odel averaging Coputational Learning Theory (COLT), 1999 D McAllester and R Schapire On the convergence rate of good-turing estiators Coputational Learning Theory (COLT), 2000 T Scheffer and T Joachis Expected error analysis for odel selection International Conference on Machine Learning (ICML), 1999 S van de Geer Epirical Process in M-Estiation Cabridge University Press,
Computable Shell Decomposition Bounds
Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung
More informationE0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis
E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds
More information1 Proof of learning bounds
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a
More information1 Generalization bounds based on Rademacher complexity
COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges
More informationA Simple Regression Problem
A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where
More informationUnderstanding Machine Learning Solution Manual
Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y
More informationComputational and Statistical Learning Theory
Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher
More information3.8 Three Types of Convergence
3.8 Three Types of Convergence 3.8 Three Types of Convergence 93 Suppose that we are given a sequence functions {f k } k N on a set X and another function f on X. What does it ean for f k to converge to
More informationE0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)
E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how
More information1 Rademacher Complexity Bounds
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability
More informationFeature Extraction Techniques
Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that
More informationLearnability and Stability in the General Learning Setting
Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago shai@tti-c.org Ohad Shair The Hebrew University ohadsh@cs.huji.ac.il Nathan Srebro TTI-Chicago nati@uchicago.edu
More information13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices
CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay
More informationThe Weierstrass Approximation Theorem
36 The Weierstrass Approxiation Theore Recall that the fundaental idea underlying the construction of the real nubers is approxiation by the sipler rational nubers. Firstly, nubers are often deterined
More informationBounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks
Bounds on the Miniax Rate for Estiating a Prior over a VC Class fro Independent Learning Tasks Liu Yang Steve Hanneke Jaie Carbonell Deceber 01 CMU-ML-1-11 School of Coputer Science Carnegie Mellon University
More informationTesting Properties of Collections of Distributions
Testing Properties of Collections of Distributions Reut Levi Dana Ron Ronitt Rubinfeld April 9, 0 Abstract We propose a fraework for studying property testing of collections of distributions, where the
More information1 Bounding the Margin
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost
More informationA Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science
A Better Algorith For an Ancient Scheduling Proble David R. Karger Steven J. Phillips Eric Torng Departent of Coputer Science Stanford University Stanford, CA 9435-4 Abstract One of the oldest and siplest
More informationPAC-Bayes Analysis Of Maximum Entropy Learning
PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E
More informationNon-Parametric Non-Line-of-Sight Identification 1
Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,
More informationNew Bounds for Learning Intervals with Implications for Semi-Supervised Learning
JMLR: Workshop and Conference Proceedings vol (1) 1 15 New Bounds for Learning Intervals with Iplications for Sei-Supervised Learning David P. Helbold dph@soe.ucsc.edu Departent of Coputer Science, University
More informationImproved Guarantees for Agnostic Learning of Disjunctions
Iproved Guarantees for Agnostic Learning of Disjunctions Pranjal Awasthi Carnegie Mellon University pawasthi@cs.cu.edu Avri Blu Carnegie Mellon University avri@cs.cu.edu Or Sheffet Carnegie Mellon University
More informationOn the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs
On the Inapproxiability of Vertex Cover on k-partite k-unifor Hypergraphs Venkatesan Guruswai and Rishi Saket Coputer Science Departent Carnegie Mellon University Pittsburgh, PA 1513. Abstract. Coputing
More informationarxiv: v1 [cs.ds] 3 Feb 2014
arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/
More informationComputational and Statistical Learning Theory
Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic
More informationThis article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and
This article appeared in a ournal published by Elsevier. The attached copy is furnished to the author for internal non-coercial research and education use, including for instruction at the authors institution
More informationIn this chapter, we consider several graph-theoretic and probabilistic models
THREE ONE GRAPH-THEORETIC AND STATISTICAL MODELS 3.1 INTRODUCTION In this chapter, we consider several graph-theoretic and probabilistic odels for a social network, which we do under different assuptions
More informationBlock designs and statistics
Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent
More informationModel Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon
Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential
More informationOn Poset Merging. 1 Introduction. Peter Chen Guoli Ding Steve Seiden. Keywords: Merging, Partial Order, Lower Bounds. AMS Classification: 68W40
On Poset Merging Peter Chen Guoli Ding Steve Seiden Abstract We consider the follow poset erging proble: Let X and Y be two subsets of a partially ordered set S. Given coplete inforation about the ordering
More informationarxiv: v1 [cs.ds] 17 Mar 2016
Tight Bounds for Single-Pass Streaing Coplexity of the Set Cover Proble Sepehr Assadi Sanjeev Khanna Yang Li Abstract arxiv:1603.05715v1 [cs.ds] 17 Mar 2016 We resolve the space coplexity of single-pass
More informationFoundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)
More informationFixed-to-Variable Length Distribution Matching
Fixed-to-Variable Length Distribution Matching Rana Ali Ajad and Georg Böcherer Institute for Counications Engineering Technische Universität München, Gerany Eail: raa2463@gail.co,georg.boecherer@tu.de
More informationLecture 21. Interior Point Methods Setup and Algorithm
Lecture 21 Interior Point Methods In 1984, Kararkar introduced a new weakly polynoial tie algorith for solving LPs [Kar84a], [Kar84b]. His algorith was theoretically faster than the ellipsoid ethod and
More informationChapter 6 1-D Continuous Groups
Chapter 6 1-D Continuous Groups Continuous groups consist of group eleents labelled by one or ore continuous variables, say a 1, a 2,, a r, where each variable has a well- defined range. This chapter explores:
More informationUniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval
Unifor Approxiation and Bernstein Polynoials with Coefficients in the Unit Interval Weiang Qian and Marc D. Riedel Electrical and Coputer Engineering, University of Minnesota 200 Union St. S.E. Minneapolis,
More informationA Note on the Applied Use of MDL Approximations
A Note on the Applied Use of MDL Approxiations Daniel J. Navarro Departent of Psychology Ohio State University Abstract An applied proble is discussed in which two nested psychological odels of retention
More informationIntelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines
Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes
More informationA Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness
A Note on Scheduling Tall/Sall Multiprocessor Tasks with Unit Processing Tie to Miniize Maxiu Tardiness Philippe Baptiste and Baruch Schieber IBM T.J. Watson Research Center P.O. Box 218, Yorktown Heights,
More informationThis model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.
CS 493: Algoriths for Massive Data Sets Feb 2, 2002 Local Models, Bloo Filter Scribe: Qin Lv Local Models In global odels, every inverted file entry is copressed with the sae odel. This work wells when
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October
More informationRademacher Complexity Margin Bounds for Learning with a Large Number of Classes
Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute
More informationProbability Distributions
Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples
More informationarxiv: v2 [math.co] 3 Dec 2008
arxiv:0805.2814v2 [ath.co] 3 Dec 2008 Connectivity of the Unifor Rando Intersection Graph Sion R. Blacburn and Stefanie Gere Departent of Matheatics Royal Holloway, University of London Egha, Surrey TW20
More informationPattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition
Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition
More informationLecture October 23. Scribes: Ruixin Qiang and Alana Shine
CSCI699: Topics in Learning and Gae Theory Lecture October 23 Lecturer: Ilias Scribes: Ruixin Qiang and Alana Shine Today s topic is auction with saples. 1 Introduction to auctions Definition 1. In a single
More informationKernel Methods and Support Vector Machines
Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic
More informationNecessity of low effective dimension
Necessity of low effective diension Art B. Owen Stanford University October 2002, Orig: July 2002 Abstract Practitioners have long noticed that quasi-monte Carlo ethods work very well on functions that
More informationCombining Classifiers
Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/
More informationGraphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes
Graphical Models in Local, Asyetric Multi-Agent Markov Decision Processes Ditri Dolgov and Edund Durfee Departent of Electrical Engineering and Coputer Science University of Michigan Ann Arbor, MI 48109
More informationHandout 7. and Pr [M(x) = χ L (x) M(x) =? ] = 1.
Notes on Coplexity Theory Last updated: October, 2005 Jonathan Katz Handout 7 1 More on Randoized Coplexity Classes Reinder: so far we have seen RP,coRP, and BPP. We introduce two ore tie-bounded randoized
More informationASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical
IEEE TRANSACTIONS ON INFORMATION THEORY Large Alphabet Source Coding using Independent Coponent Analysis Aichai Painsky, Meber, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE arxiv:67.7v [cs.it] Jul
More informationStability Bounds for Non-i.i.d. Processes
tability Bounds for Non-i.i.d. Processes Mehryar Mohri Courant Institute of Matheatical ciences and Google Research 25 Mercer treet New York, NY 002 ohri@cis.nyu.edu Afshin Rostaiadeh Departent of Coputer
More informationESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics
ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS A Thesis Presented to The Faculty of the Departent of Matheatics San Jose State University In Partial Fulfillent of the Requireents
More informationA Bernstein-Markov Theorem for Normed Spaces
A Bernstein-Markov Theore for Nored Spaces Lawrence A. Harris Departent of Matheatics, University of Kentucky Lexington, Kentucky 40506-0027 Abstract Let X and Y be real nored linear spaces and let φ :
More informationThe Simplex Method is Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate
The Siplex Method is Strongly Polynoial for the Markov Decision Proble with a Fixed Discount Rate Yinyu Ye April 20, 2010 Abstract In this note we prove that the classic siplex ethod with the ost-negativereduced-cost
More informationBest Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence
Best Ar Identification: A Unified Approach to Fixed Budget and Fixed Confidence Victor Gabillon Mohaad Ghavazadeh Alessandro Lazaric INRIA Lille - Nord Europe, Tea SequeL {victor.gabillon,ohaad.ghavazadeh,alessandro.lazaric}@inria.fr
More informationThe degree of a typical vertex in generalized random intersection graph models
Discrete Matheatics 306 006 15 165 www.elsevier.co/locate/disc The degree of a typical vertex in generalized rando intersection graph odels Jerzy Jaworski a, Michał Karoński a, Dudley Stark b a Departent
More informationHomework 3 Solutions CSE 101 Summer 2017
Hoework 3 Solutions CSE 0 Suer 207. Scheduling algoriths The following n = 2 jobs with given processing ties have to be scheduled on = 3 parallel and identical processors with the objective of iniizing
More informationThe Frequent Paucity of Trivial Strings
The Frequent Paucity of Trivial Strings Jack H. Lutz Departent of Coputer Science Iowa State University Aes, IA 50011, USA lutz@cs.iastate.edu Abstract A 1976 theore of Chaitin can be used to show that
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,
More informationEstimating Entropy and Entropy Norm on Data Streams
Estiating Entropy and Entropy Nor on Data Streas Ait Chakrabarti 1, Khanh Do Ba 1, and S. Muthukrishnan 2 1 Departent of Coputer Science, Dartouth College, Hanover, NH 03755, USA 2 Departent of Coputer
More informationBiostatistics Department Technical Report
Biostatistics Departent Technical Report BST006-00 Estiation of Prevalence by Pool Screening With Equal Sized Pools and a egative Binoial Sapling Model Charles R. Katholi, Ph.D. Eeritus Professor Departent
More informationList Scheduling and LPT Oliver Braun (09/05/2017)
List Scheduling and LPT Oliver Braun (09/05/207) We investigate the classical scheduling proble P ax where a set of n independent jobs has to be processed on 2 parallel and identical processors (achines)
More informationTight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions
Tight Inforation-Theoretic Lower Bounds for Welfare Maxiization in Cobinatorial Auctions Vahab Mirrokni Jan Vondrák Theory Group, Microsoft Dept of Matheatics Research Princeton University Redond, WA 9805
More information16 Independence Definitions Potential Pitfall Alternative Formulation. mcs-ftl 2010/9/8 0:40 page 431 #437
cs-ftl 010/9/8 0:40 page 431 #437 16 Independence 16.1 efinitions Suppose that we flip two fair coins siultaneously on opposite sides of a roo. Intuitively, the way one coin lands does not affect the way
More informationA Theoretical Analysis of a Warm Start Technique
A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful
More informationOn the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation
journal of coplexity 6, 459473 (2000) doi:0.006jco.2000.0544, available online at http:www.idealibrary.co on On the Counication Coplexity of Lipschitzian Optiization for the Coordinated Model of Coputation
More informationLower Bounds for Quantized Matrix Completion
Lower Bounds for Quantized Matrix Copletion Mary Wootters and Yaniv Plan Departent of Matheatics University of Michigan Ann Arbor, MI Eail: wootters, yplan}@uich.edu Mark A. Davenport School of Elec. &
More informationSoft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis
Soft Coputing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis Beverly Rivera 1,2, Irbis Gallegos 1, and Vladik Kreinovich 2 1 Regional Cyber and Energy Security Center RCES
More informationPolygonal Designs: Existence and Construction
Polygonal Designs: Existence and Construction John Hegean Departent of Matheatics, Stanford University, Stanford, CA 9405 Jeff Langford Departent of Matheatics, Drake University, Des Moines, IA 5011 G
More informationBipartite subgraphs and the smallest eigenvalue
Bipartite subgraphs and the sallest eigenvalue Noga Alon Benny Sudaov Abstract Two results dealing with the relation between the sallest eigenvalue of a graph and its bipartite subgraphs are obtained.
More informationAlgorithms for parallel processor scheduling with distinct due windows and unit-time jobs
BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES Vol. 57, No. 3, 2009 Algoriths for parallel processor scheduling with distinct due windows and unit-tie obs A. JANIAK 1, W.A. JANIAK 2, and
More informationOn Constant Power Water-filling
On Constant Power Water-filling Wei Yu and John M. Cioffi Electrical Engineering Departent Stanford University, Stanford, CA94305, U.S.A. eails: {weiyu,cioffi}@stanford.edu Abstract This paper derives
More informationRobustness and Regularization of Support Vector Machines
Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA
More informationlecture 36: Linear Multistep Mehods: Zero Stability
95 lecture 36: Linear Multistep Mehods: Zero Stability 5.6 Linear ultistep ethods: zero stability Does consistency iply convergence for linear ultistep ethods? This is always the case for one-step ethods,
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 11 10/15/2008 ABSTRACT INTEGRATION I
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 11 10/15/2008 ABSTRACT INTEGRATION I Contents 1. Preliinaries 2. The ain result 3. The Rieann integral 4. The integral of a nonnegative
More informationCS Lecture 13. More Maximum Likelihood
CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood
More informationA note on the multiplication of sparse matrices
Cent. Eur. J. Cop. Sci. 41) 2014 1-11 DOI: 10.2478/s13537-014-0201-x Central European Journal of Coputer Science A note on the ultiplication of sparse atrices Research Article Keivan Borna 12, Sohrab Aboozarkhani
More informationCharacterization of the Line Complexity of Cellular Automata Generated by Polynomial Transition Rules. Bertrand Stone
Characterization of the Line Coplexity of Cellular Autoata Generated by Polynoial Transition Rules Bertrand Stone Abstract Cellular autoata are discrete dynaical systes which consist of changing patterns
More informationCSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13
CSE55: Randoied Algoriths and obabilistic Analysis May 6, Lecture Lecturer: Anna Karlin Scribe: Noah Siegel, Jonathan Shi Rando walks and Markov chains This lecture discusses Markov chains, which capture
More informationSupport Vector Machines. Maximizing the Margin
Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the
More informationNote on generating all subsets of a finite set with disjoint unions
Note on generating all subsets of a finite set with disjoint unions David Ellis e-ail: dce27@ca.ac.uk Subitted: Dec 2, 2008; Accepted: May 12, 2009; Published: May 20, 2009 Matheatics Subject Classification:
More informationCurious Bounds for Floor Function Sums
1 47 6 11 Journal of Integer Sequences, Vol. 1 (018), Article 18.1.8 Curious Bounds for Floor Function Sus Thotsaporn Thanatipanonda and Elaine Wong 1 Science Division Mahidol University International
More informationa a a a a a a m a b a b
Algebra / Trig Final Exa Study Guide (Fall Seester) Moncada/Dunphy Inforation About the Final Exa The final exa is cuulative, covering Appendix A (A.1-A.5) and Chapter 1. All probles will be ultiple choice
More informationSupport Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization
Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering
More informationChaotic Coupled Map Lattices
Chaotic Coupled Map Lattices Author: Dustin Keys Advisors: Dr. Robert Indik, Dr. Kevin Lin 1 Introduction When a syste of chaotic aps is coupled in a way that allows the to share inforation about each
More informationUpper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition
Upper bound on false alar rate for landine detection and classification using syntactic pattern recognition Ahed O. Nasif, Brian L. Mark, Kenneth J. Hintz, and Nathalia Peixoto Dept. of Electrical and
More informationTight Bounds for Maximal Identifiability of Failure Nodes in Boolean Network Tomography
Tight Bounds for axial Identifiability of Failure Nodes in Boolean Network Toography Nicola Galesi Sapienza Università di Roa nicola.galesi@uniroa1.it Fariba Ranjbar Sapienza Università di Roa fariba.ranjbar@uniroa1.it
More informationA Theoretical Framework for Deep Transfer Learning
A Theoretical Fraewor for Deep Transfer Learning Toer Galanti The School of Coputer Science Tel Aviv University toer22g@gail.co Lior Wolf The School of Coputer Science Tel Aviv University wolf@cs.tau.ac.il
More informationVC Dimension and Sauer s Lemma
CMSC 35900 (Spring 2008) Learning Theory Lecture: VC Diension and Sauer s Lea Instructors: Sha Kakade and Abuj Tewari Radeacher Averages and Growth Function Theore Let F be a class of ±-valued functions
More informationSymbolic Analysis as Universal Tool for Deriving Properties of Non-linear Algorithms Case study of EM Algorithm
Acta Polytechnica Hungarica Vol., No., 04 Sybolic Analysis as Universal Tool for Deriving Properties of Non-linear Algoriths Case study of EM Algorith Vladiir Mladenović, Miroslav Lutovac, Dana Porrat
More informationTEST OF HOMOGENEITY OF PARALLEL SAMPLES FROM LOGNORMAL POPULATIONS WITH UNEQUAL VARIANCES
TEST OF HOMOGENEITY OF PARALLEL SAMPLES FROM LOGNORMAL POPULATIONS WITH UNEQUAL VARIANCES S. E. Ahed, R. J. Tokins and A. I. Volodin Departent of Matheatics and Statistics University of Regina Regina,
More informationMulti-Dimensional Hegselmann-Krause Dynamics
Multi-Diensional Hegselann-Krause Dynaics A. Nedić Industrial and Enterprise Systes Engineering Dept. University of Illinois Urbana, IL 680 angelia@illinois.edu B. Touri Coordinated Science Laboratory
More informationIterative Decoding of LDPC Codes over the q-ary Partial Erasure Channel
1 Iterative Decoding of LDPC Codes over the q-ary Partial Erasure Channel Rai Cohen, Graduate Student eber, IEEE, and Yuval Cassuto, Senior eber, IEEE arxiv:1510.05311v2 [cs.it] 24 ay 2016 Abstract In
More informationTesting equality of variances for multiple univariate normal populations
University of Wollongong Research Online Centre for Statistical & Survey Methodology Working Paper Series Faculty of Engineering and Inforation Sciences 0 esting equality of variances for ultiple univariate
More informationBayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)
Bayesian Learning Chapter 6: Bayesian Learning CS 536: Machine Learning Littan (Wu, TA) [Read Ch. 6, except 6.3] [Suggested exercises: 6.1, 6.2, 6.6] Bayes Theore MAP, ML hypotheses MAP learners Miniu
More informationQuantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search
Quantu algoriths (CO 781, Winter 2008) Prof Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search ow we begin to discuss applications of quantu walks to search algoriths
More informationMachine Learning Basics: Estimators, Bias and Variance
Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics
More informationAlgorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation
Algorithic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation Michael Kearns AT&T Labs Research Murray Hill, New Jersey kearns@research.att.co Dana Ron MIT Cabridge, MA danar@theory.lcs.it.edu
More information