Learnability and Stability in the General Learning Setting

Size: px
Start display at page:

Download "Learnability and Stability in the General Learning Setting"

Transcription

1 Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago Ohad Shair The Hebrew University Nathan Srebro TTI-Chicago Karthik Sridharan TTI-Chicago Abstract We establish that stability is necessary and sufficient for learning, even in the General Learning Setting where unifor convergence conditions are not necessary for learning, and where learning ight only be possible with a non-erm learning rule. This goes beyond previous work on the relationship between stability and learnability, which focused on supervised classification and regression, where learnability is equivalent to unifor convergence and it is enough to consider the ERM. Introduction We consider the General Setting of Learning 0 where we would like to iniize a population risk functional (stochastic objective) F(h) = E Z D f(h; Z) () where the distribution D of Z is unknown, based on i.i.d. saple z,..., z drawn fro D (and full knowledge of the function f). This General Setting subsues supervised classification and regression, certain unsupervised learning probles, density estiation and stochastic optiization. For exaple, in supervised learning z = (x, y) is an instance-label pair, h is a predictor, and f(h, (x, y)) = loss(h(x), y) is the loss functional. For supervised classification and regression probles, it is well known that a proble is learnable (see precise definition in Section 2) if and only if the epirical risks F S (h) = f(h, z i ) (2) converge uniforly to their expectations. If unifor convergence holds, then the epirical risk iniizer (ERM) is consistent, i.e. the population risk of the ERM converges to the optial population risk, and the proble is learnable using the ERM. That is, learnability is equivalent to learnability by ERM, and so we can focus our attention solely on epirical risk iniizers. Stability has also been suggested as an explicit alternate condition for learnability. Intuitively, stability notions focus on particular algoriths, or learning rules, and easure their sensitivity to perturbations in the training set. In particular, it has been established that various fors of stability of the ERM are sufficient for learnability. Mukherjee et al 7 argue that since unifor convergence also iplies stability of the ERM, and is necessary for (distribution independent) learning in the supervised classification and regression setting, then stability of the ERM is necessary and sufficient for learnability in the supervised classification and regression setting. It is iportant to ephasize that this characterization of stability as necessary for learnability goes through unifor convergence, i.e. the chain of iplications is: ERM Stable with ERM Unifor Convergence However, the equivalence between (distribution independent) consistency of epirical risk iniization and unifor convergence is specific to supervised classification and regression. The results of Alon et al establishing this equivalence do not always hold in the General Learning Setting. In particular, we recently showed that in strongly convex stochastic optiization probles, the ERM is stable and thus consistent, even though the epirical risks do not converge to their expectations uniforly (Exaple 7., taken fro 9). Since the other iplications in the chain above still hold in the general learning setting (e.g., unifor convergence iplies stability and stability iplies learnability by ERM), this exaple deonstrates that stability is a strictly ore general sufficient condition for learnability. A natural question is then whether, in the General Setting, stability is also necessary for learning. Here we establish that indeed, even in the general learning setting, (distribution independent) stability of ERM is necessary and sufficient for (distribution independent) consistency of the ERM. The situation is therefore as follows: Unifor Convergence ERM Stable with ERM We ephasize that, unlike the arguents of Mukherjee et al 7, the proof of necessity does not go through unifor convergence, allowing us to deal also with settings beyond supervised classification and regression. The discussion above concerns only stability and learnability of the ERM. In supervised classification and regression there is no need to go beyond the ERM, since learnability is equivalent to learnability by e-

2 pirical risk iniization. But as we recently showed, there are learning probles in the General Setting which are learnable using soe alternate learning rule, but in which ERM is neither stable nor consistent (Exaple 7.2, taken fro 9). Stability of ERM is therefore a sufficient, but not necessary, condition for learnability: Unifor Convergence ERM Stable with ERM This propts us to study the stability properties of non-erm learning rules. We establish that, even in the General Setting, any consistent and generalizing learning rule (i.e. where in addition to consistency, the epirical risk is also a good estiate of the population risk) ust be asyptotically epirical risk iniizing (AERM, see precise definition in Section 2). We thus focus on such rules and show that also for an AERM, stability is sufficient for consistency and generalization. The converse is a bit weaker for AERMs, though. We show that a strict notion of stability, which is necessary for ERM consistency, is not necessary for AERM consistency, and instead suggest a weaker notion (averaging out fluctuations across rando training sets) that is necessary and sufficient for AERM consistency. Noting that any consistent learning rule can be transfored to a consistent and generalizing learning rule, we obtain a sharp characterization of learnability in ters of stability learnability is equivalent to the existence of a stable AERM: Exists Stable AERM with AERM 2 The General Learning Setting A learning proble is specified by a hypothesis doain H, an instance doain Z and an objective function (e.g. loss functional or cost function ) f : H Z R. Throughout this paper we assue the function is bounded by soe constant B, i.e. f(h, z) B for all h H and z Z. A learning rule is a apping A : Z H fro sequences of instances in Z to hypotheses. We refer to sequences S = {z,...,z } as saple sets, but it is iportant to reeber that the order and ultiplicity of instances ay be significant. A learning rule that does not depend on the order is said to be syetric. We will generally consider saples S D of i.i.d. draws fro D. A possible approach to learning is to iniize the epirical risk F S (h). We say that a rule A is an ERM (Epirical Risk Miniizer) if it iniizes the epirical risk F S (A(S)) = F S (ĥ) = in F S(h). (3) h H where we use F S (ĥ) = in h H F S (h) to refer to the iniu epirical risk. But since there ight be several hypotheses iniizing the epirical risk, ĥ does not refer to a specific hypotheses and there ight be any rules which are all ERM. We say that a rule A is an AERM (Asyptotical Epirical Risk Miniizer) with rate ɛ er () under distribution D if: E S D F S (A(S)) F S (ĥs) ɛ er () (4) Here and whenever talking about a rate ɛ(), we require it be onotone decreasing with ɛ() 0. A learning rule is universally an AERM with rate ɛ er (), if it is an AERM with rate ɛ er () under all distributions D over Z. Returning to our goal of iniizing the expected risk, we say a rule A is consistent with rate ɛ cons () under distribution D if for all, E S D F(A(S)) F(h ) ɛ cons (). (5) where we denote F(h ) = inf h H F(h). A rule is universally consistent with rate ɛ cons () if it is consistent with rate ɛ cons () under all distributions D over Z. A proble is said to be learnable if there exists soe universally consistent learning rule for the proble. This definition of learnability, requiring a unifor rate for all distributions, is the relevant notion for studying learnability of a hypothesis class. It is a direct generalization of agnostic PAC-learnability 4 to Vapnik s General Setting of Learning as studied by Haussler 3 and others. We say a rule A generalizes with rate ɛ gen () under distribution D if for all, E S D F(A(S)) F S (A(S)) ɛ gen (). (6) A rule universally generalizes with rate ɛ gen () if it generalizes with rate ɛ gen () under all distributions D over Z. We note that other authors soeties define consistency, and thus also learnable as a cobination of our notions of consistency and generalizing. 3 Stability We define a sequence of progressively weaker notions of stability, all based on leave-one-out validation. For a saple S of size, let S \i = {z,..., z i, z i+,..., z } be a saple of points obtained by deleting the i-th observation of S. All our easures of stability concern the effect deleting z i has on f(h, z i ), where h is the hypotheses returned by the learning rule. That is, all easures consider the agnitude of f(a(s \i ); z i ) f(a(s); z i ). Definition. A rule A is unifor-loo stable with rate ɛ stable () if for all saples S of points and for all i: f(a(s \i ); z i ) f(a(s); z i ) ɛ stable (). Definition 2. A rule A is all-i-loo stable with rate ɛ stable () under distributions D if for all i: f(a(s E \i S D ); z i ) f(a(s); z i ) ɛ stable (). Definition 3. A rule A is LOO stable with rate ɛ stable () under distributions D if E S D f(a(s \i ); z i ) f(a(s); z i ) ɛ stable (). For syetric learning rules, Definitions 2 and 3 are equivalent. Exaple 7.5 shows that the syetry assuption is necessary, and the two definitions are not equivalent for non-syetric learning rules. Our weakest notion of stability, which we show is still enough to ensure learnability, is:

3 Definition 4. A rule A is on-average-loo stable with rate ɛ stable () under distributions D if E S D f(a(s \i ); z i ) f(a(s); z i ) ɛ stable (). We say that a rule is universally stable with rate ɛ stable (), if the stability property holds with rate ɛ stable () for all distributions. Clai 3.. Unifor-LOO stability with rate ɛ stable () iplies all-i-loo stability with rate ɛ stable (), which iplies LOO stability with rate ɛ stable (), which iplies onaverage-loo stability with rate ɛ stable (). Relationship to Other Notions of Stability Many different notions of stability, soe under ultiple naes, have been suggested in the literature. In particular, our notion of all-i-loo stability has been studied by several authors under different naes: pointwise-hypothesis stability 2, CV loo stability 7, and cross-validation-(deletion) stability 8. All are equivalent, though the rate is soeties defined differently. Other authors define stability with respect to replacing, rather then deleting, an observation. E.g. CV stability 5 and cross-validation-(replaceent) 8 are analogous to all-i- LOO stability and average stability 8 is analogous to average-loo stability for syetric learning rules. In general the deletion and replaceent variants of stability are incoparable in Appendix A we briefly discuss how the results in this paper change if replaceent stability is used. A uch stronger notion is unifor stability 2, which is strictly stronger than any of our notions, and is sufficient for tight generalization bounds. However, this notion is far fro necessary for learnability (5 and Exaple 7.3 below). In the context of syetric learning rules, all-i-loo stability and LOO stability are equivalent. In order to treat nonsyetric rules ore easily, we prefer working with LOO stability. For an elaborate discussion of the relationships between different notions of stability, see 5. 4 Main Results We first establish that existence of a stable AERM is sufficient for learning: Theore 4.. If a rule is an AERM with rate ɛ er () and stable (under any of our definitions) with rate ɛ stable () under D, then it is consistent and generalizes under D with rates ɛ cons () 3ɛ er () + ɛ stable ( + ) + 2B + ɛ gen () 4ɛ er () + ɛ stable ( + ) + 6B Corollary 4.2. If a rule is universally an AERM and stable then it is universally consistent and generalizing. Seeking a converse to the above, we first note that it is not possible to obtain a converse for each distribution D separately, i.e. to Theore 4.. In Exaple 7.6, we show a specific learning proble and distribution D in which the ERM (in fact, any AERM) is consistent, but not stable, even under our weakest notion of stability. However, we are able to obtain a converse to Corollary 4.2. That is, establish that a universally consistent ERM, or even AERM, ust also be stable. For exact ERMs we have: Theore 4.3. For an ERM the following are equivalent: Universal LOO stability. Universal consistency. Universal generalization. Recall that for a syetric rule, LOO stability and all-i- LOO stability are equivalent, and so consistency or generalization of a syetric ERM (the typical case) also iply all-i-loo stability. Theore 4.3 only guarantees LOO stability as a necessary condition for consistency. Exaple 7.3 (adapted fro 5) establishes that we cannot strengthen the condition to unifor-loo stability, or any stronger definition: there exists a learning proble for which the ERM is universally consistent, but not unifor-loo stable. For AERMs, we obtain a weaker converse, ensuring only on-average-loo stability: Theore 4.4. For an AERM, the following are equivalent: Universal on-average-loo stability. Universal consistency. Universal generalization. On-average-LOO stability is strictly weaker then LOO stability, but this is the best that can be assured. In Exaple 7.4 we present a learning proble and an AERM that is universally consistent, but is not LOO stable. The exact rate conversions of Theores 4.3 and 4.4 are specified in the corresponding proofs (Section 6), and are all polynoial. In particular, an ɛ cons -universal consistent ɛ er - AERM is on-average-loo stable with rate ɛ stable () 3ɛ er ( ) + 3ɛ cons (( ) /4 ) + 6B. The above results apply only to AERMs, for which we also see that universal consistency and generalization are equivalent. Next we show that if in fact we seek universal consistency and generalization, then we ust consider only AERMs: Theore 4.5. If a rule A is universally consistent with rate ɛ cons () and generalizing with rate ɛ gen (), then it is universally an AERM with rate ɛ er () ɛ gen () + 3ɛ cons ( /4 ) + 4B Cobining theores 4.4 and 4.5, we get that the existence of a universally on-average-loo stable AERM is a necessary (and sufficient) condition for existence of soe universally consistent and generalizing rule. As we show in Exaple 7.7, there ight still be a universally consistent learning rule (hence the proble is learnable by our definition) that is not stable even by our weakest definition (and is not an AERM nor generalizing). Nevertheless, any universally consistent learning rule can be transfored into a universally consistent and generalizing learning rule (Lea 6.). Thus by Theores 4.5 and 4.4 this rule ust also be a stable AERM, establishing:

4 Theore 4.6. A learning proble is learnable if and only if there exists a universally on-average-loo stable AERM. In particular, if there exists a ɛ cons -universally consistent rule, then there exists a rule that is ɛ stable -on-average-loo stable and ɛ er -AERM where: ɛ er () = 3ɛ cons ( /4 ) + 7B, (7) ɛ stable () = 6ɛ cons (( ) /4 ) + 9B 5 Coparison with Prior Work 5. Theore 4. and Corollary 4.2 The consistency iplication in Theore 4. (specifically, that all-loo stability of an AERM iplies consistency) was established by Mukherjee et al 7, Theore 3.5. As for the generalization guarantee, Rakhlin et al 8 prove that for ERM, average-loo stability is equivalent to generalization. For ore general learning rules, 2 attepted to show that all-i-loo stability iplies generalization. However, Mukherjee et al 7 (in reark 3, pg. 73) provide a siple counterexaple and note that the proof of 2 is wrong, and in fact all-i-loo stability alone is not enough to ensure generalization. To correct this, Mukherjee et al 7 introduced an additional condition, referred to as Eloo err stability, which together with all-i-loo stability ensures generalization. For AERMs, they use arguents specific to supervised learning, arguing that universal consistency iplies unifor convergence, and establish generalization only via this route. And so, Mukherjee et al obtain a version of Corollary 4.2 that is specific to supervised learning. In suary, coparing Theore 4. to previous work, our results extend the generalization guarantee also to AERMs in the general learning setting. Rakhlin et al 8 also show that the replaceent (rather then deletion) version stability iplies generalization for any rule (even non-aerm), and hence consistency for AERMs. Recall that the deletion and replaceent version are not equivalent. We are not aware of strong converses for the replaceent variant. 5.2 Converse Results Mukherjee et al 7 argue that all-i-loo stability of the ERM (in fact, of any AERM) is also necessary for ERM universal consistency and thus learnability. However, their arguents are specific to supervised learning, and establish stability only via unifor convergence of F S (h) to F(h), as discussed in the introduction. As we now know, in the general learning setting, ERM consistency is not equivalent to this unifor convergence, and furtherore, there ight be a non-erm universally consistent rule even though the ERM is not universally consistent. Therefore, our results here apply to the general learning setting and do not use unifor convergence arguents. For an ERM, Rakhlin et al 8 show that generalization is equivalent to on-average-loo stability, for any distribution and without resorting to unifor convergence arguents. This provides a partial converse to Theore 4.. However, our results extend to AERM s as well and ore iportantly, provide a converse to AERM consistency rather than just generalization. This distinction between consistency and generalization is iportant, as there are situations with consistent but not stable AERM s (Exaple 7.6), or even universally consistent learning rules which are not stable, generalizing nor AERM s (Exaple 7.7). Another converse result that does not use unifor convergence arguents, but is specific only to the realizable binary learning setting was given by Kutin and Niyogi 5. They show that in this setting, for any distribution D, all-i- LOO stability of the ERM under D is necessary for ERM consistency under D. This is a uch stronger for of converse as it applies to any specific distribution separately, rather then requiring universal consistency. However, not only is it specific to supervised learning, but further requires the distribution be realizable (i.e. zero error is achievable). As we show in Exaple 7.6, a distribution-specific converse is not possible in the general setting. All the papers cited above focus on syetric learning rules where all-i-loo stability is equivalent to LOO stability. We prefer not to liit our attention to syetric rules, and instead use LOO stability. 6 Detailed Results and Proofs We first establish that for AERMs, on-average-loo stability and generalization are equivalent, and that for ERMs the equivalence extends also to LOO stability. This extends the work of Rakhlin et al 8 fro ERMs to AERMs, and with soewhat better rate conversions. 6. Equivalence of Stability and Generalization It will be convenient to work with a weaker version of generalization as an interediate step: We say a rule A onaverage generalizes with rate ɛ oag () under distribution D if for all, E S D F(A(S)) F S (A(S)) ɛ oag (). (8) It is straightforward to see that generalization iplies onaverage generalization with the sae rate. We show that for AERMs, the converse is also true, and also that on-average generalization is equivalent to on-average stability, establishing the equivalence between generalization and on-average stability (for AERMs). Lea 6. (For AERMs: on-average generalization on-average stability). Let A be AERM with rate ɛ er () under D. If A is on-average generalizing with rate ɛ oag () then it is on-average LOO stable with rate ɛ oag ( ) + 2ɛ er ( )+2B/. If A is on-average LOO stable with rate ɛ stable () then it is on-average generalizing with rate ɛ stable ( + ) + 2ɛ er () + 2B/. Proof. For the ERMs of S and S \i we have F S \i(ĥs \i) F S(ĥS) 2B, and so since A is AERM: FS E (A(S)) F S \i(a(s \i )) 2ɛ er ( )+ 2B (9)

5 generalization stability Applying (8) to S \i and cobining with (9) we have E F(A(S \i )) F S (A(S)) ɛ oag ( ) + 2ɛ er ( ) + 2B/, which does not actually depend on i, hence: ɛ oag ( ) + 2ɛ er ( ) + 2B/ E F(A(S \i )) F S (A(S)) (0) = E i E F(A(S \i )) F S (A(S)) = E S \i,z i E i f(a(s \i ), z i ) E S f(a(s), z i ) = E E i f(a(s \i ), z i ) f(a(s), z i ) () which establishes on-average stability. stability generalization Bounding () by ɛ stable () and working back we get that (0) is also bounded by ɛ stable (). Cobined with (9) we get E F(A(S \i )) F S \i(a(s \i )) ɛ stable () + 2ɛ oag ( ) + 2B/ which establishes on-average generalization. Lea 6.2 (AERM + on-average generalization generalization). If A is an AERM with rate ɛ er () and onaverage generalizes with rate ɛ oag () under D, then A generalizes with rate ɛ oag () + 2ɛ er () + 2B under D. Proof. Using respective optialities of ĥs and h we can bound: F S (A(S)) F(A(S)) = F S (A(S)) F S (ĥs) + F S (ĥs) F S (h ) + F S (h ) F(h ) + F(h ) F(A(S)) F S (A(S)) F S (ĥs) + F S (h ) F(h ) = Y (2) Where the final equality defines a new rando variable Y. By Lea 6.3 and the AERM guarantee we have E Y ɛ er () + B/ ). Fro Lea 6.4 we can conclude that E F S (A(S)) F(A(S)) EF S (A(S)) F(A(S)) + 2E Y ɛ oag () + 2ɛ er () + 2B. Utility Lea 6.3. For i.i.d. X i, X i B and X = X i we have E X EX B/. Proof. E X EX VarX = VarX i / B/. Utility Lea 6.4. Let X, Y be rando variables s.t. X Y alost surely. Then E X EX + 2E Y. Proof. Denote a + = ax(0, a) and observe that X Y iplies X + Y + (this holds when both have the sae sign, and when X 0 Y, while Y < 0 < X is not possible). We therefor have EX + EY + E Y. Also note that X = 2X + X. We can now calculate: E X = E2X + X = 2EX + EX 2E Y + EX. For exact ERM, we get a stronger equivalence: Lea 6.5 (ERM+on-average-LOO LOO stable). If an exact ERM A is on-average-loo stable with rate ɛ stable () under D, then it is also LOO stable under D with the sae rate. Proof. By optiality of ĥs = A(S): f(ĥs \i, z i) f(ĥs, z i ) = F S (ĥs \i) F S(ĥS) Then using on-average-loo stability: = + F S \i(ĥs) F S \i(ĥs\i) 0. (3) f( E ĥ S \i, z i ) f(ĥs, z i ) ( ) E f(ĥs \i, z i) f(ĥs, z i ) ɛ stable () Lea 6.5 can be extended also to AERMs with rate o( n ). However, for AERMs with a slower rate, or at least with rate Ω( n ), Exaple 7.4 establishes that this stronger converse is not possible. We have now established the stability generalization parts of Theores 4., 4.3 and 4.4 (in fact, even a slightly stronger converse than in Theores 4.3 and 4.4, as it does not require universality). 6.2 A Sufficient Condition for Consistency It is also fairly straightforward to see that generalization (or even on-average generalization) of an AERM iplies its consistency: Lea 6.6 (AERM+generalization consistency). If A is AERM with rate ɛ er () and it on-average generalizes with rate ɛ oag () under D then it is consistent with rate ɛ oag () + ɛ er () under D. Proof. EF(A(S)) F(h ) = EF(A(S)) F S (h ) = EF(A(S)) F S (A(S)) + EF S (A(S)) F S (h ) EF(A(S)) F S (A(S)) + E F S (A(S)) F S (ĥs) ɛ gen () + ɛ er () Cobined with the results of Section 6., this copletes the proof of Theore 4. and the stability consistency and generalization consistency parts of Theores 4.3 and Converse Direction Lea 6. already provides a converse result, establishing that stability is necessary for generalization. However, in order to establish that stability is also necessary for universal consistency we ust prove that universal consistency of an AERM iplies universal generalization. Note that consistency under a specific distribution for an AERM does not iply generalization nor stability (Exaple 7.6). We ust instead rely on universal consistency. The ain tool we use is the following lea:

6 Lea 6.7 (Main Converse Lea). If a proble is learnable, i.e. there exists a universally consistent rule A with rate ɛ cons (), then under any distribution, FS E (ĥs) F(h ) ɛ ep () where ɛ ep () = 2ɛ cons ( ) + 2B + 2B 2 for any sequence is such that and = o( ). Proof. Let I = {I,...,I } be a rando saple of indexes in the range.. where each I i is independently uniforly distributed, and I is independent of S. Let S = {z Ii }, i.e. a saple of size drawn fro the unifor distribution over saples in S (with replaceents). We first bound the probability that I has no repeated indexes ( duplicates ): (i ) Pr(I has duplicates) (4) 2 Conditioned on not having duplicates in I, the saple S is actually distributed according to D, i.e. can be viewed as a saple fro the original distribution. We therefor have by universal consistency: 2 E F(A(S )) F(h ) no dups ɛ cons ( ) (5) But viewed as a saple drawn fro the unifor distribution over instances in S, we also have: FS E S (A(S )) F S (ĥs) ɛ cons ( ) (6) Conditioned on having no duplications in I, S \S (i.e. those saples in S not chosen by I) is independent of S, and S \ S =, and so by Lea 6.3: E S F(A(S )) F S\S (A(S )) B (7) Finally, if there are no duplicates, then for any hypothesis, and in particular for A(S ) we have: F S (A(S )) F S\S (A(S )) 2B (8) Cobining (5),(6),(7) and (8), accounting for a axial discrepancy of B when we do have duplicates, and assuing 2 /2, we get the desired bound. In the supervised learning setting, Lea 6.7 is just an iediate consequence of learnability being equivalent to consistency and generalization of the ERM. However, the Lea applies also in the General Setting, where universal consistency ight be achieved only by a non-erm. The Lea states that if a proble is learnable, even though the ERM ight not be consistent (as in, e.g. Exaple 7.2), the epirical error achieved by the ERM is in fact an asyptotically unbiased estiator of F(h ). Equipped with Lea 6.7, we are now ready to show that universal consistency of an AERM iplies generalization and that any universally consistent and generalizing rule ust be an AERM. What we show is actually a bit stronger: that if a proble is learnable, and so Lea 6.7 holds, then for any distribution D separately, consistency of an AERM under D iplies generalization under D and also any consistent and generalizing rule under D ust be an AERM. Lea 6.8 (learnable+aerm+consistent generalizing). If Lea 6.7 holds with rate ɛ ep (), and A is an ɛ er - AERM and ɛ cons -consistent under D, then it is generalizing under D with rate ɛ ep () + ɛ er () + ɛ cons (). Proof. FS E F S (A(S)) F(A(S)) E (A(S)) F S (ĥs) + E F(h FS ) F(A(S)) + E (ĥs) F(h ) ɛ er () + ɛ cons () + ɛ ep () Lea 6.9 (learnable+consistent+generalizing AERM). If Lea 6.7 holds with rate ɛ ep (), and A is ɛ cons - consistent and ɛ gen -generalizing under D, then it is AERM under D with rate ɛ ep () + ɛ gen () + ɛ cons (). Proof. FS E (A(S)) F S (ĥs) E F S (A(S)) F(A(S)) + E F(A(S)) F(h F(h ) + E ) F S (ĥs) ɛ gen () + ɛ cons () + ɛ ep () Lea 6.8 establishes that universal consistency of an AERM iplies universal generalization, and thus copletes the proof of Theores 4.3 and 4.4. Lea 6.9 establishes Theore 4.5. To get the rates in 4, we use = /4 in Lea 6.7. Leas 6.6, 6.8 and 6.9 together establish an interesting relationship: Corollary 6.0. For a (universally) learnable proble, for any distribution D and learning rule A, any two of the following iply the third : A is an AERM under D. A is consistent under D. A generalizes under D. Note, however, that any one property by itself is possible, even universally: - The ERM in Exaple 7.2 is neither consistent nor generalizing, despite the proble being learnable. - Exaple 7.7 deonstrates a universally consistent learning rule which is neither generalizing nor an AERM. - A rule returning a fixed hypothesis always generalizes, but of course need not be consistent nor an AERM. In contrast, for learnable supervised classification and regression probles, it is not possible for a learning rule to be just universally consistent, without being an AERM and without generalization. Nor is it possible for a learning rule to be a universal AERM for a learnable proble, without being generalizing and consistent. Corollary 6.0 can also provide a certificate of nonlearnability. E.g. for the proble in Exaple 7.6 we show a specific distribution for which there is a consistent AERM that does not generalize. We can conclude that there is no universally consistent learning rule for the proble, otherwise the corollary is violated.

7 6.4 Existence of a Stable Rule Theores 4.5 and 4.4, which we just copleted proving, already establish that for AERMs, universal consistency is equivalent to universal on-average-loo stability. Existence of a universally on-average-loo stable AERM is thus sufficient for learnability. In order to prove that it is also necessary, it is enough to show that existence of a universally consistent learning rule iplies existence of a universally consistent AERM. This AERM ust then be on-average-loo stable by Theore 4.4. We actually show how to transfor a consistent rule to a consistent and generalizing rule. If this rule is universally consistent, then by Lea 6.9 we can then conclude it ust an AERM, and by 6. that it ust be on-average-loo stable. Lea 6.. For any rule A there exists a rule A, such that: A universally generalizes with rate 3B. For any D, if A is ɛ cons -consistent under D then A is ɛ cons ( ) consistent under D. Proof. For a saple S of size, let S be a sub-saple consisting of the first observation in S. Define A (S) = A(S ). That is, A applies A to only of the observation in S. A generalizes: We can decopose: F S (A(S )) F(A(S )) = (F S (A(S )) F(A(S ))) + ( )(F S\S (A(S )) F(A(S ))) The first ter can be bounded by 2B/. As for the second ter, S \ S is statistically independent of S and so we can use Lea 6.3 to bound its expected agnitude to obtain: E F S (A(S )) F(A(S )) 2B + ( ) B 3B A is consistent: If A is consistent, then: E F(A (S)) inf F(h) h H E F(A(S )) inf F(h) ɛ cons ( ) h H (9) Proof of Converse in Theore 4.6 If there exists a universally consistent rule with rate ɛ cons (), by Lea 6. there exists A which is universally consistent and generalizing. Choosing = /4 in Lea 6.7 and applying Leas 6.9 and 6. we get the rates specified in (7). Reark We can strengthen the above theore to show existence of an on-average-loo stable, always AERM (ie. a rule which for every saple approxiately iniizes F S (h)). The new learning rule for this purpose chooses the hypothesis returned by the original rule whenever epirical risk is sall and chooses an ERM otherwise. The proof is copleted via Markov inequality to bound the probability that we don t choose the hypothesis returned by the original learning rule. 7 Exaples Our first exaple (taken fro 9) shows that unifor convergence is not necessary for ERM consistency. I.e. universal ERM consistency holds without unifor convergence. Of course, this can also happen in trivial settings where there is one hypothesis h 0 which doinates all other hypothesis (i.e. f(h 0, z) < f(h, z) for all z and all h h 0 ) 0. However, the exaple below deonstrates a non-trivial situation with ERM universal consistency but no unifor convergence: there is no doinating hypothesis, and finding the optial hypothesis does require learning. In particular, unlike trivial probles with a doinating hypothesis, in the exaple below there is not even local unifor convergence. I.e. there is no unifor convergence even aong hypotheses that are close to being population optial. Exaple 7.. There exists a learning proble for which any ERM is universally consistent, but the epirical risks do not converge uniforly to their expectations. Proof. Consider a convex stochastic optiization proble given by: f(w; (x, α)) = α (w x) + w 2 = α 2 i(wi xi) 2 + w 2, i where w is the hypothesis, w,x are eleents in a unit ball around the origin of a Hilbert space with a countably infinite orthonoral basis e,e 2,..., and α is an infinite binary sequence. αi is the i-th coordinate of α, wi := w,e i, and xi is defined siilarly. In our other subission 9, we show that the ERM is stable, hence consistent. However, when x = 0 a.s. and α is i.i.d. unifor, there is no unifor convergence, not even locally. To see why, note that for a rando saple S of any finite size, with probability one there exists an excluded basis vector e j such that α i j = 0 for all (x i, α i ) S. For any t > 0, we have F(te j ) F S (te j ) t 2, regardless of the saple size. Setting t = establishes sup w F(w) F S (w) even as, and so there is no unifor convergence. Choosing t arbitrarily sall, we see that even when F(te j ) is close to optial, the deviations F(w) F S (w) still do not converge to zero as. Perhaps ore surprisingly, the next exaple (also taken fro 9) shows that in the general setting, learnability ight require using a non-erm. Exaple 7.2. There exists a learning proble with a universally consistent learning rule, but for which no ERM is universally consistent. Proof. Consider the sae hypothesis space and saple space as before, with: f(w, z) = α (w x) 2 + ɛ 2 2 i (w i ) 2, where ɛ = 0.0. When x = 0 a.s. and α is i.i.d. unifor, then the ERM ust have ŵ =. To see why, note that

8 for an excluded e j (which exists a.s.) increasing wj towards one decreases the objective. But since ŵ =, we have F(ŵ) /2, while inf w F(w) F(0) = ɛ, and so F(ŵ) inf w F(w). On the other hand, A(S) = arg inf S (w)+ 20 w 2 is a uniforly-loo stable AERM and hence by Theore 4. universally consistent. In the next three exaples, we show that in a certain sense, Theore 4.3 and Theore 4.4 cannot be iproved with stronger stability notions. Viewed differently, they also constitute separation results between our various stability notions, and show which are strictly stronger than the other. Exaple 7.4 also deonstrates the gap between supervised learning and a general learning setting, by presenting a learning proble and an AERM that is universally consistent, but not LOO stable. Exaple 7.3. There exists a learning proble with a universally consistent and all-i-loo stable learning rule, but there is no universally consistent and unifor LOO stable learning rule. Proof. This exaple is taken fro 5. Consider the hypothesis space {0, }, the instance space {0, }, and the objective function f(h, z) = h z. It is straightforward to verify that an ERM is a universally consistent learning rule. It is also universally all-i-loo stable, because reoving an instance can change the hypothesis only if the original saple had an equal nuber of 0 s and s (plus or inus one), which happens with probability at ost O(/ ) where is the saple size. However, it is not hard to see that the only unifor LOO stable learning rule, at least for large enough saple sizes, is a constant rule which always returns the sae hypothesis h regardless of the saple. Such a learning rule is obviously not universally consistent. Exaple 7.4. There exists a learning proble with a universally consistent (and average-loo stable) AERM, which is not LOO stable. Proof. Let the instance space, hypothesis space and objective function be as in Exaple 7.3. Consider the following learning rule, based on a saple S = (z,..., z ): if i {z}/ > /2 + log(4)/2, return. If i {z}/ < /2 log(4)/2, return 0. Otherwise, return Parity(S) = (z +...z ) od 2. This learning rule is an AERM, with ɛ er () = 2 log(4)/. Since we have only two hypotheses, we have unifor convergence of F S ( ) to F( ) for any hypothesis. Therefore, our learning rule universally generalizes (with rate ɛ gen () = log(4/δ)/2), and by Theore 4.4, this iplies that the learning rule is also universally consistent and average-loo stable. However, the learning rule is not LOO stable. Consider the unifor distribution on the instance space. By Hoeffding s inequality, i {z}/ /2 log(4)/2 with probability at least /2 for any saple size. In that case, the returned hypothesis is the parity function (even when we reove an instance fro the saple, assuing 3). When this happens, it is not hard to see that for any i, f(a(s), z i ) f(a(s \i ), z i ) = {z}( ) Parity(S). This iplies that ( ) E f(a(s \i ); z i ) f(a(s); z i ) (20) 2 E log(4) {z} 2 {z} 2 ( ) log(4) , which does not converge to zero with the saple size. Therefore, the learning rule is not LOO stable. Note that the proof iplies that average-loo stability cannot be replaced even by weaker stability notions than LOO stability. For instance, a natural stability notion interediate between average-loo stability and LOO stability is ( ) E S D f(a(s \i ); z i ) f(a(s); z i ), (2) where the absolute value is now over the entire su, but inside the expectation. In the exaple used in the proof, (2) is still lower bounded by (20), which does not converge to zero with the saple size. Exaple 7.5. There exists a learning proble with a universally consistent and LOO-stable AERM, which is not syetric and is not all-i-loo stable. Proof. Let the instance space be 0,, the hypothesis space 0, 2, and the objective function f(h, z) = {h=z}. Consider the following learning rule A: given a saple, check if the value z appears ore than once in the saple. If no, return z, otherwise return 2. Since F S (2) = 0, and z returns only if this value constitutes / of the saple, the rule above is an AERM with rate ɛ er () = /. To see universal consistency, let Pr(z ) = p. With probability ( p) 2, z / {z 2,...,z }, and the returned hypothesis is z, with F(z ) = p. Otherwise, the returned hypothesis is 2, with F(2) = 0. Hence E S F(A(S)) p( p) 2, which can be easily verified to be at ost /( ), so the learning rule is consistent with rate ɛ cons () /( ). To see LOO-stability, notice that our learning hypothesis can change by deleting z i, i >, only if z i is the only instance in z 2,...,z equal to z. So ɛ stable () 2/ (in fact, LOOstability holds even without the expectation). However, this learning rule is not all-i-loo-stable. For instance, for any continuous distribution, f(a(s \ ), z ) f(a(s), z ) = with probability, so it obviously cannot be all-i-loo-stable with respect to i =. Next we show that for specific distributions, even ERM consistency does not iply even our weakest notion of stability.

9 Unifor Convergence ERM Strict Consistency ERM Stability ERM Consistency All AERM Stable and Consistent Exists Stable AERM Exists Consistent AERM Learnability Figure : Iplications of various properties of learning probles. Consistency refers to univeral consistency and stability refers to univeral on-average-loo stability. Exaple 7.6. There exists a learning proble and a distribution on the instance space, such that the ERM (or any AERM) is consistent but is not average-loo stable. Proof. Let the instance space be 0,, the hypothesis space consist of all finite subsets of 0,, and define the objective function as f(h, z) = {z / h} ). Consider any continuous distribution on the instance space. Since the underlying distribution D is continuous, we have F(h) = for any hypothesis h. Therefore, any learning rule (including any AERM) will be consistent with F(A(S)) =. On the other hand, the ERM here always achieves F S (ĥs) = 0, so any AERM cannot generalize, or even on-average-generalize (by Lea 6.2), hence cannot be average-loo stable (by Lea 6.). Finally, the following exaple shows that while learnability is equivalent to the existence of stable and consistent AERM s (Theore 4.4 and Theore 4.6), there ight still exist other learning rules, which are neither of the above. Exaple 7.7. There exists a learning proble with a universally consistent learning rule, which is not average-loo stable, generalizing nor an AERM. Proof. Let the instance space be 0,. Let the hypothesis space consist of all finite subsets of 0,, and the objective function be the indicator function f(h, z) = {z h}. Consider the following learning rule: given a saple S 0,, the learning rule checks if there are any two identical instances in the saple. If so, the learning rule returns the epty set. Otherwise, it returns the saple. This learning rule is not an AERM, nor does it necessarily generalize or is average-loo stable. Consider any continuous distribution on 0,. The learning rule always returns a countable set A(S), with F S (A(S)) =, while F S ( ) = 0 (so it is not an AERM) and F(A(S)) = 0 (so it does not generalize). Also, f(a(s), z i ) = 0 while f(a(s \i 0, z i ) = with probability, so it is not average- LOO stable either. However, the learning rule is universally consistent. If the underlying distribution is continuous on 0,, then the returned hypothesis is S, which is countable hence, F(S) = 0 = inf h F(h). For discrete distributions, let M denote the proportion of instances in the saple which appear exactly once, and let M 0 be the probability ass of instances which did not appear in the saple. Using 6, Theore 3, we have that for any δ, it holds with probability at least δ over a saple of size that ( ) M 0 M O log(/δ), uniforly for any discrete distribution. If this event occurs, then either M <, or M 0 O(log(/δ)/ ). But in the first event, we get duplicate instances in the saple, so the returned hypothesis is the optial, and in the second case, the returned hypothesis is the saple, which has a total probability ass of at least O(log(/δ)/ ), and therefore F(A(S)) O(log(/δ)/ ). As a result, regardless of the underlying distribution, with probability of at least δ over the saple, ( ) F(A(S)) O log(/δ). Since the r.h.s. converges to 0 with for any δ, it is easy to see that the learning rule is universally consistent. 8 Discussion In the failiar setting of supervised classification or regression, the question of learnability is reduced to that of unifor convergence of epirical risks to their expectation, and in turn to finiteness of the fat-shattering diension. Furtherore, due to the equivalence of learnability and unifor convergence, there is no need to look beyond the ERM. We recently showed 9 that the situation in the General Learning Setting is substantially ore coplex. Universal ERM consistency ight not be equivalent to unifor convergence, and furtherore, learnability ight be possible only with a non-erm. We are therefore in need of a new understanding of the question of learnability that applies ore broadly then just to supervised classification and regression. In studying learnability in the General Setting, Vapnik 0 focuses solely on epirical risk iniization, which we now know is not sufficient for understanding learnability (e.g. Exaple 7.2). Furtherore, for epirical risk iniization, Vapnik establishes unifor convergence as a necessary and sufficient condition not for ERM consistency, but rather for strict consistency of the ERM. We now know that even in rather non-trivial probles (e.g. Exaple 7. taken fro 9), where the ERM is consistent and generalizes, strict consistency does not hold. Furtherore, Exaple 7. also deonstrates that ERM stability guarantees ERM consistency, but not strict consistency, perhaps giving another indication that strict consistency ight be too strict (this and other relationships are depicted in Figure ). In Exaples 7. and 7.2 we see that stability is a strictly ore general sufficient condition for learnability. This akes stability an appealing candidate for understanding learnability in the ore general setting. Indeed, we show that stability is not only sufficient, but is also necessary for learning, even in the General Learning Setting. A previous such characterization was based on unifor convergence and thus applied only to supervised clas-

10 sification and regression 7. Extending the characterization beyond these settings is particularly interesting, since for supervised classification and regression the question of learnability is already essentially solved. Extending the characterization, without relying on unifor convergence, also allows us to frae stability as the core condition guaranteeing learnability, with unifor convergence only a sufficient, but not necessary, condition for stability (see Figure ). In studying the question of learnability and its relation to stability, we encounter several differences between this ore general setting, and settings such as supervised classification and regression where learnability is equivalent to unifor convergence. We suarize soe of these distinctions: Perhaps the ost iportant distinction is that in the General Setting learnability ight be possible only with a non-erm. In this paper we establish that if a proble is learnable, although it ight not be learnable with an ERM, it ust be learnable with soe AERM. And so, in the General Setting we ust look beyond epirical risk iniization, but not beyond asyptotic epirical risk iniization. In supervised classification and regression, if one AERM is universally consistent then all AERMs are universally consistent. In the General Setting we ust choose the AERM carefully. In supervised classification and regression, a universally consistent rule ust also generalize and be AERM. In the General Setting, a universally consistent rule need not generalize nor be an AERM, as exaple 7.7 deonstrates. However, Theore 4.5 establishes that, even in the General Setting, if a rule is universally consistent and generalizing then it ust be an AERM. This gives us another reason to not look beyond asyptotic epirical risk iniization, even in the General Setting. The above distinctions can also be seen through Corollary 6.0, which concerns the relationship between AERM, consistency and generalization in learnable probles. In the General Setting, any two conditions iply the other, but it is possible for any one condition to exist without the others. In supervised classification and regression, if a proble is learnable then generalization always holds (for any rule), and so universal consistency and AERM iply each other. In supervised classification and regression, ERM inconsistency for soe distribution is enough to establish non-learnability. Establishing non-learnability in the General Setting is trickier, since one ust consider all AERMs. We show how Corollary 6.0 can provide a certificate for non-learnability, in the for of a rule that is consistent and an AERM for soe specific distribution, but does not generalize (Exaple 7.6). In the General Setting, universal consistency of an AERM only guarantees on-average-loo stability, but not LOO stability as in the supervised classification setting 7. As we show in Exaple 7.4, this is a real difference and not erely a deficiency of our proofs. We have begun exploring the issue of learnability in the General Setting, and uncovered iportant relationships between learnability and stability. But any probles are left open. Throughout the paper we ignored the issue of getting high-confidence concentration guarantees. We choose to use convergence in expectation, and defined the rates as rates on the expectation. Since the objective f is bounded, convergence in expectation is equivalent to convergence in probability and using Markov s inequality we can translate a rate of the for E ɛ() to a low confidence guarantee Pr( > ɛ()/δ) δ. Can we also obtain exponential concentration results of the for Pr( > ɛ()polylog(/δ)) δ? It is possible to construct exaples in the General Setting in which convergence in expectation of the stability does not iply exponential concentration of consistency and generalization. Is it possible to show that exponential concentration of stability is equivalent to exponential concentration of consistency and generalization? We showed that existence of an average-loo stable AERM is necessary and sufficient for learnability (Theore 4.6). Although specific AERMs ight be universally consistent and generalizing without being LOO stable (Exaple 7.4), it ight still be possible to show that for a learnable proble, there always exists soe LOO stable AERM. This would tighten our converse result and establish existence of a LOO stable AERM as equivalent to learnability. Even existence of a LOO stable AERM is not as elegant and siple as having finite VC diension, or fat-shattering diension. It would be very interesting to derive equivalent but ore cobinatorial conditions for learnability. References N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensitive diensions, unifor convergence, and learnability. J. ACM, 44(4):65 63, O. Bousquet and A. Elisseeff. Stability and generalization. J. Mach. Learn. Res., 2: , David Haussler. Decision theoretic generalizations of the PAC odel for neural net and other learning applications. Inforation and Coputation, 00():78 50, Michael J. Kearns, Robert E. Schapire, and Linda M. Sellie. Toward efficient agnostic learning. In Proc. of COLT 5, S. Kutin and P. Niyogi. Alost-everywhere algorithic stability and generalization error. In Proc. of UAI 8, D.A. McAllester and R.E. Schapire. On the convergence rate of good-turing estiators. In Proc. of COLT 3, S. Mukherjee, P. Niyogi, T. Poggio, and R. M. Rifkin. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of epirical risk iniization. Adv. Coput. Math., 25(- 3):6 93, S. Rakhlin, S. Mukherjee, and T. Poggio. Stability results in learning theory. Analysis and Applications, 3(4):397 49, S. Shalev-Shwartz, O. Shair, N. Srebro, and K. Sridharan. Stochastic convex optiization. In Proceedings of COLT 22, V.N. Vapnik. The Nature of Statistical Learning Theory. Springer, 995.

11 A Replaceent Stability We used the Leave-One-Out version of stabilities throughout the paper, however any of the results hold when we use the replaceent versions instead. Here we briefly survey the differences in the ain results as they apply to replaceentbased stabilities. Let S (i) denote the saple S with z i replaced by soe other z i drawn fro the sae unknown distribution D. Definition 5. A rule A is unifor-ro stable with rate ɛ stable () if for all saples S of points and z, z,..., z Z : f(a(s (i) ); z ) f(a(s); z ) ɛ stable (). Definition 6. A rule A is on-average-ro stable with rate ɛ stable () under distributions D if E S D ;z f(a(s,...,z D (i) ); z i ) f(a(s); z i ) ɛ stable (). With the above definitions replacing unifor-loo stability and on-average-loo stability respectively, all theores in Section 4 other than Theore 4.3 hold (i.e. Theore 4., Corollary 4.2, Theore 4.4 and Theore 4.6). We do not know how to obtain a replaceent-variant of Theore 4.3 even for a consistent ERM, we can only guarantee on-average-ro stability (as in Theore 4.4), but we do not know if this is enough to ensure RO stability. However, although for ERMs we can only obtain a weaker converse, we can guarantee the existence of an AERM that is not only on-average-ro stable but actually unifor-ro stable. That is, we get a uch stronger variant of Theore 4.6: Theore A.. A learning proble is learnable if and only if there exists an unifor-ro stable AERM. Proof. Clearly if there exists any rule A that is unifor-ro stable and AERM then the proble is learnable, since the learning rule A is in fact universally consistent by theore 4.. On the other hand if there exists a rule A that is universally consistent, then consider the rule A as in the construction of Lea 6.. As shown in the lea this rule is consistent. Now note that A only uses the first saples of S. Hence for i > we have A (S (i) ) = A (S) and so: f(a(s (i) ); z ) f(a(s); z ) = f(a(s (i) ); z ) f(a(s); z ) 2B We thus showed that this rule is consistent, generalizes, and is 2B -uniforly RO stable.

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

Learnability, Stability and Uniform Convergence

Learnability, Stability and Uniform Convergence Journal of Machine Learning Research (200) 2635-2670 ubitted 4/0; Revised 9/0; Published 0/0 Learnability, tability and Unifor Convergence hai halev-hwartz chool of Coputer cience and Engineering The Hebrew

More information

Understanding Machine Learning Solution Manual

Understanding Machine Learning Solution Manual Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges

More information

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

1 Proof of learning bounds

1 Proof of learning bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a

More information

1 Rademacher Complexity Bounds

1 Rademacher Complexity Bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung

More information

On Learnability, Complexity and Stability

On Learnability, Complexity and Stability On Learnability, Complexity and Stability Silvia Villa, Lorenzo Rosasco and Tomaso Poggio 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly

More information

3.8 Three Types of Convergence

3.8 Three Types of Convergence 3.8 Three Types of Convergence 3.8 Three Types of Convergence 93 Suppose that we are given a sequence functions {f k } k N on a set X and another function f on X. What does it ean for f k to converge to

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

arxiv: v1 [cs.ds] 3 Feb 2014

arxiv: v1 [cs.ds] 3 Feb 2014 arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 11 10/15/2008 ABSTRACT INTEGRATION I

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 11 10/15/2008 ABSTRACT INTEGRATION I MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 11 10/15/2008 ABSTRACT INTEGRATION I Contents 1. Preliinaries 2. The ain result 3. The Rieann integral 4. The integral of a nonnegative

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search Quantu algoriths (CO 781, Winter 2008) Prof Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search ow we begin to discuss applications of quantu walks to search algoriths

More information

1 Proving the Fundamental Theorem of Statistical Learning

1 Proving the Fundamental Theorem of Statistical Learning THEORETICAL MACHINE LEARNING COS 5 LECTURE #7 APRIL 5, 6 LECTURER: ELAD HAZAN NAME: FERMI MA ANDDANIEL SUO oving te Fundaental Teore of Statistical Learning In tis section, we prove te following: Teore.

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

Shannon Sampling II. Connections to Learning Theory

Shannon Sampling II. Connections to Learning Theory Shannon Sapling II Connections to Learning heory Steve Sale oyota echnological Institute at Chicago 147 East 60th Street, Chicago, IL 60637, USA E-ail: sale@athberkeleyedu Ding-Xuan Zhou Departent of Matheatics,

More information

VC Dimension and Sauer s Lemma

VC Dimension and Sauer s Lemma CMSC 35900 (Spring 2008) Learning Theory Lecture: VC Diension and Sauer s Lea Instructors: Sha Kakade and Abuj Tewari Radeacher Averages and Growth Function Theore Let F be a class of ±-valued functions

More information

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning Basics: Estimators, Bias and Variance Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics

More information

Learnability of Gaussians with flexible variances

Learnability of Gaussians with flexible variances Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-ail: azhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start October 20, 2007

More information

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine CSCI699: Topics in Learning and Gae Theory Lecture October 23 Lecturer: Ilias Scribes: Ruixin Qiang and Alana Shine Today s topic is auction with saples. 1 Introduction to auctions Definition 1. In a single

More information

Tight Complexity Bounds for Optimizing Composite Objectives

Tight Complexity Bounds for Optimizing Composite Objectives Tight Coplexity Bounds for Optiizing Coposite Objectives Blake Woodworth Toyota Technological Institute at Chicago Chicago, IL, 60637 blake@ttic.edu Nathan Srebro Toyota Technological Institute at Chicago

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

A Theoretical Framework for Deep Transfer Learning

A Theoretical Framework for Deep Transfer Learning A Theoretical Fraewor for Deep Transfer Learning Toer Galanti The School of Coputer Science Tel Aviv University toer22g@gail.co Lior Wolf The School of Coputer Science Tel Aviv University wolf@cs.tau.ac.il

More information

Improved Guarantees for Agnostic Learning of Disjunctions

Improved Guarantees for Agnostic Learning of Disjunctions Iproved Guarantees for Agnostic Learning of Disjunctions Pranjal Awasthi Carnegie Mellon University pawasthi@cs.cu.edu Avri Blu Carnegie Mellon University avri@cs.cu.edu Or Sheffet Carnegie Mellon University

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

The Weierstrass Approximation Theorem

The Weierstrass Approximation Theorem 36 The Weierstrass Approxiation Theore Recall that the fundaental idea underlying the construction of the real nubers is approxiation by the sipler rational nubers. Firstly, nubers are often deterined

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

Polygonal Designs: Existence and Construction

Polygonal Designs: Existence and Construction Polygonal Designs: Existence and Construction John Hegean Departent of Matheatics, Stanford University, Stanford, CA 9405 Jeff Langford Departent of Matheatics, Drake University, Des Moines, IA 5011 G

More information

Block designs and statistics

Block designs and statistics Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent

More information

3.3 Variational Characterization of Singular Values

3.3 Variational Characterization of Singular Values 3.3. Variational Characterization of Singular Values 61 3.3 Variational Characterization of Singular Values Since the singular values are square roots of the eigenvalues of the Heritian atrices A A and

More information

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition

More information

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness A Note on Scheduling Tall/Sall Multiprocessor Tasks with Unit Processing Tie to Miniize Maxiu Tardiness Philippe Baptiste and Baruch Schieber IBM T.J. Watson Research Center P.O. Box 218, Yorktown Heights,

More information

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression Sha M Kakade Microsoft Research and Wharton, U Penn skakade@icrosoftco Varun Kanade SEAS, Harvard University vkanade@fasharvardedu

More information

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning JMLR: Workshop and Conference Proceedings vol (1) 1 15 New Bounds for Learning Intervals with Iplications for Sei-Supervised Learning David P. Helbold dph@soe.ucsc.edu Departent of Coputer Science, University

More information

List Scheduling and LPT Oliver Braun (09/05/2017)

List Scheduling and LPT Oliver Braun (09/05/2017) List Scheduling and LPT Oliver Braun (09/05/207) We investigate the classical scheduling proble P ax where a set of n independent jobs has to be processed on 2 parallel and identical processors (achines)

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

Symmetrization and Rademacher Averages

Symmetrization and Rademacher Averages Stat 928: Statistical Learning Theory Lecture: Syetrization and Radeacher Averages Instructor: Sha Kakade Radeacher Averages Recall that we are interested in bounding the difference between epirical and

More information

A Theoretical Analysis of a Warm Start Technique

A Theoretical Analysis of a Warm Start Technique A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful

More information

Chapter 6 1-D Continuous Groups

Chapter 6 1-D Continuous Groups Chapter 6 1-D Continuous Groups Continuous groups consist of group eleents labelled by one or ore continuous variables, say a 1, a 2,, a r, where each variable has a well- defined range. This chapter explores:

More information

Handout 7. and Pr [M(x) = χ L (x) M(x) =? ] = 1.

Handout 7. and Pr [M(x) = χ L (x) M(x) =? ] = 1. Notes on Coplexity Theory Last updated: October, 2005 Jonathan Katz Handout 7 1 More on Randoized Coplexity Classes Reinder: so far we have seen RP,coRP, and BPP. We introduce two ore tie-bounded randoized

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

The Hilbert Schmidt version of the commutator theorem for zero trace matrices

The Hilbert Schmidt version of the commutator theorem for zero trace matrices The Hilbert Schidt version of the coutator theore for zero trace atrices Oer Angel Gideon Schechtan March 205 Abstract Let A be a coplex atrix with zero trace. Then there are atrices B and C such that

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

On Poset Merging. 1 Introduction. Peter Chen Guoli Ding Steve Seiden. Keywords: Merging, Partial Order, Lower Bounds. AMS Classification: 68W40

On Poset Merging. 1 Introduction. Peter Chen Guoli Ding Steve Seiden. Keywords: Merging, Partial Order, Lower Bounds. AMS Classification: 68W40 On Poset Merging Peter Chen Guoli Ding Steve Seiden Abstract We consider the follow poset erging proble: Let X and Y be two subsets of a partially ordered set S. Given coplete inforation about the ordering

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions Tight Inforation-Theoretic Lower Bounds for Welfare Maxiization in Cobinatorial Auctions Vahab Mirrokni Jan Vondrák Theory Group, Microsoft Dept of Matheatics Research Princeton University Redond, WA 9805

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

1 Identical Parallel Machines

1 Identical Parallel Machines FB3: Matheatik/Inforatik Dr. Syaantak Das Winter 2017/18 Optiizing under Uncertainty Lecture Notes 3: Scheduling to Miniize Makespan In any standard scheduling proble, we are given a set of jobs J = {j

More information

Learnability, Stability, Regularization and Strong Convexity

Learnability, Stability, Regularization and Strong Convexity Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago

More information

Testing Properties of Collections of Distributions

Testing Properties of Collections of Distributions Testing Properties of Collections of Distributions Reut Levi Dana Ron Ronitt Rubinfeld April 9, 0 Abstract We propose a fraework for studying property testing of collections of distributions, where the

More information

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes Journal of Machine Learning Research (200 66-686 Subitted /08; Revised /0; Published 2/0 Stability Bounds for Stationary ϕ-ixing and β-ixing Processes Mehryar Mohri Courant Institute of Matheatical Sciences

More information

A Bernstein-Markov Theorem for Normed Spaces

A Bernstein-Markov Theorem for Normed Spaces A Bernstein-Markov Theore for Nored Spaces Lawrence A. Harris Departent of Matheatics, University of Kentucky Lexington, Kentucky 40506-0027 Abstract Let X and Y be real nored linear spaces and let φ :

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a ournal published by Elsevier. The attached copy is furnished to the author for internal non-coercial research and education use, including for instruction at the authors institution

More information

Supplement to: Subsampling Methods for Persistent Homology

Supplement to: Subsampling Methods for Persistent Homology Suppleent to: Subsapling Methods for Persistent Hoology A. Technical results In this section, we present soe technical results that will be used to prove the ain theores. First, we expand the notation

More information

On Conditions for Linearity of Optimal Estimation

On Conditions for Linearity of Optimal Estimation On Conditions for Linearity of Optial Estiation Erah Akyol, Kuar Viswanatha and Kenneth Rose {eakyol, kuar, rose}@ece.ucsb.edu Departent of Electrical and Coputer Engineering University of California at

More information

Lecture 20 November 7, 2013

Lecture 20 November 7, 2013 CS 229r: Algoriths for Big Data Fall 2013 Prof. Jelani Nelson Lecture 20 Noveber 7, 2013 Scribe: Yun Willia Yu 1 Introduction Today we re going to go through the analysis of atrix copletion. First though,

More information

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes Journal of Machine Learning Research (200) 789-84 Subitted /08; Revised /0; Published 2/0 Stability Bounds for Stationary ϕ-ixing and β-ixing Processes Mehryar Mohri Courant Institute of Matheatical Sciences

More information

arxiv: v1 [cs.ds] 17 Mar 2016

arxiv: v1 [cs.ds] 17 Mar 2016 Tight Bounds for Single-Pass Streaing Coplexity of the Set Cover Proble Sepehr Assadi Sanjeev Khanna Yang Li Abstract arxiv:1603.05715v1 [cs.ds] 17 Mar 2016 We resolve the space coplexity of single-pass

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

The Simplex Method is Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate

The Simplex Method is Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate The Siplex Method is Strongly Polynoial for the Markov Decision Proble with a Fixed Discount Rate Yinyu Ye April 20, 2010 Abstract In this note we prove that the classic siplex ethod with the ost-negativereduced-cost

More information

Some Classical Ergodic Theorems

Some Classical Ergodic Theorems Soe Classical Ergodic Theores Matt Rosenzweig Contents Classical Ergodic Theores. Mean Ergodic Theores........................................2 Maxial Ergodic Theore.....................................

More information

Fairness via priority scheduling

Fairness via priority scheduling Fairness via priority scheduling Veeraruna Kavitha, N Heachandra and Debayan Das IEOR, IIT Bobay, Mubai, 400076, India vavitha,nh,debayan}@iitbacin Abstract In the context of ulti-agent resource allocation

More information

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay A Low-Coplexity Congestion Control and Scheduling Algorith for Multihop Wireless Networks with Order-Optial Per-Flow Delay Po-Kai Huang, Xiaojun Lin, and Chih-Chun Wang School of Electrical and Coputer

More information

A theory of learning from different domains

A theory of learning from different domains DOI 10.1007/s10994-009-5152-4 A theory of learning fro different doains Shai Ben-David John Blitzer Koby Craer Alex Kulesza Fernando Pereira Jennifer Wortan Vaughan Received: 28 February 2009 / Revised:

More information

A := A i : {A i } S. is an algebra. The same object is obtained when the union in required to be disjoint.

A := A i : {A i } S. is an algebra. The same object is obtained when the union in required to be disjoint. 59 6. ABSTRACT MEASURE THEORY Having developed the Lebesgue integral with respect to the general easures, we now have a general concept with few specific exaples to actually test it on. Indeed, so far

More information

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs On the Inapproxiability of Vertex Cover on k-partite k-unifor Hypergraphs Venkatesan Guruswai and Rishi Saket Coputer Science Departent Carnegie Mellon University Pittsburgh, PA 1513. Abstract. Coputing

More information

arxiv: v2 [stat.ml] 23 Feb 2016

arxiv: v2 [stat.ml] 23 Feb 2016 Perutational Radeacher Coplexity A New Coplexity Measure for Transductive Learning Ilya Tolstikhin 1, Nikita Zhivotovskiy 3, and Gilles Blanchard 4 arxiv:1505.0910v stat.ml 3 Feb 016 1 Max-Planck-Institute

More information

On Process Complexity

On Process Complexity On Process Coplexity Ada R. Day School of Matheatics, Statistics and Coputer Science, Victoria University of Wellington, PO Box 600, Wellington 6140, New Zealand, Eail: ada.day@cs.vuw.ac.nz Abstract Process

More information

Efficient Learning with Partially Observed Attributes

Efficient Learning with Partially Observed Attributes Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy Shai Shalev-Shwartz The Hebrew University, Jerusale, Israel Ohad Shair The Hebrew University, Jerusale, Israel Abstract We describe and

More information

COS 424: Interacting with Data. Written Exercises

COS 424: Interacting with Data. Written Exercises COS 424: Interacting with Data Hoework #4 Spring 2007 Regression Due: Wednesday, April 18 Written Exercises See the course website for iportant inforation about collaboration and late policies, as well

More information

Stability Bounds for Non-i.i.d. Processes

Stability Bounds for Non-i.i.d. Processes tability Bounds for Non-i.i.d. Processes Mehryar Mohri Courant Institute of Matheatical ciences and Google Research 25 Mercer treet New York, NY 002 ohri@cis.nyu.edu Afshin Rostaiadeh Departent of Coputer

More information

Math Real Analysis The Henstock-Kurzweil Integral

Math Real Analysis The Henstock-Kurzweil Integral Math 402 - Real Analysis The Henstock-Kurzweil Integral Steven Kao & Jocelyn Gonzales April 28, 2015 1 Introduction to the Henstock-Kurzweil Integral Although the Rieann integral is the priary integration

More information

Fixed-to-Variable Length Distribution Matching

Fixed-to-Variable Length Distribution Matching Fixed-to-Variable Length Distribution Matching Rana Ali Ajad and Georg Böcherer Institute for Counications Engineering Technische Universität München, Gerany Eail: raa2463@gail.co,georg.boecherer@tu.de

More information

arxiv: v2 [math.co] 3 Dec 2008

arxiv: v2 [math.co] 3 Dec 2008 arxiv:0805.2814v2 [ath.co] 3 Dec 2008 Connectivity of the Unifor Rando Intersection Graph Sion R. Blacburn and Stefanie Gere Departent of Matheatics Royal Holloway, University of London Egha, Surrey TW20

More information

Information Theoretic Guarantees for Empirical Risk Minimization with Applications to Model Selection and Large-Scale Optimization

Information Theoretic Guarantees for Empirical Risk Minimization with Applications to Model Selection and Large-Scale Optimization with Applications to Model Selection and Large-Scale Optiization Ibrahi Alabdulohsin 1 Abstract In this paper, we derive bounds on the utual inforation of the epirical risk iniization (ERM) procedure for

More information

Robustness and Regularization of Support Vector Machines

Robustness and Regularization of Support Vector Machines Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

Prerequisites. We recall: Theorem 2 A subset of a countably innite set is countable.

Prerequisites. We recall: Theorem 2 A subset of a countably innite set is countable. Prerequisites 1 Set Theory We recall the basic facts about countable and uncountable sets, union and intersection of sets and iages and preiages of functions. 1.1 Countable and uncountable sets We can

More information

Bootstrapping Dependent Data

Bootstrapping Dependent Data Bootstrapping Dependent Data One of the key issues confronting bootstrap resapling approxiations is how to deal with dependent data. Consider a sequence fx t g n t= of dependent rando variables. Clearly

More information

The degree of a typical vertex in generalized random intersection graph models

The degree of a typical vertex in generalized random intersection graph models Discrete Matheatics 306 006 15 165 www.elsevier.co/locate/disc The degree of a typical vertex in generalized rando intersection graph odels Jerzy Jaworski a, Michał Karoński a, Dudley Stark b a Departent

More information

Lower Bounds for Quantized Matrix Completion

Lower Bounds for Quantized Matrix Completion Lower Bounds for Quantized Matrix Copletion Mary Wootters and Yaniv Plan Departent of Matheatics University of Michigan Ann Arbor, MI Eail: wootters, yplan}@uich.edu Mark A. Davenport School of Elec. &

More information

New upper bound for the B-spline basis condition number II. K. Scherer. Institut fur Angewandte Mathematik, Universitat Bonn, Bonn, Germany.

New upper bound for the B-spline basis condition number II. K. Scherer. Institut fur Angewandte Mathematik, Universitat Bonn, Bonn, Germany. New upper bound for the B-spline basis condition nuber II. A proof of de Boor's 2 -conjecture K. Scherer Institut fur Angewandte Matheati, Universitat Bonn, 535 Bonn, Gerany and A. Yu. Shadrin Coputing

More information

Ensemble Based on Data Envelopment Analysis

Ensemble Based on Data Envelopment Analysis Enseble Based on Data Envelopent Analysis So Young Sohn & Hong Choi Departent of Coputer Science & Industrial Systes Engineering, Yonsei University, Seoul, Korea Tel) 82-2-223-404, Fax) 82-2- 364-7807

More information

LORENTZ SPACES AND REAL INTERPOLATION THE KEEL-TAO APPROACH

LORENTZ SPACES AND REAL INTERPOLATION THE KEEL-TAO APPROACH LORENTZ SPACES AND REAL INTERPOLATION THE KEEL-TAO APPROACH GUILLERMO REY. Introduction If an operator T is bounded on two Lebesgue spaces, the theory of coplex interpolation allows us to deduce the boundedness

More information

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)

More information

Estimating Entropy and Entropy Norm on Data Streams

Estimating Entropy and Entropy Norm on Data Streams Estiating Entropy and Entropy Nor on Data Streas Ait Chakrabarti 1, Khanh Do Ba 1, and S. Muthukrishnan 2 1 Departent of Coputer Science, Dartouth College, Hanover, NH 03755, USA 2 Departent of Coputer

More information

Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks

Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks Bounds on the Miniax Rate for Estiating a Prior over a VC Class fro Independent Learning Tasks Liu Yang Steve Hanneke Jaie Carbonell Deceber 01 CMU-ML-1-11 School of Coputer Science Carnegie Mellon University

More information

Introduction to Machine Learning. Recitation 11

Introduction to Machine Learning. Recitation 11 Introduction to Machine Learning Lecturer: Regev Schweiger Recitation Fall Seester Scribe: Regev Schweiger. Kernel Ridge Regression We now take on the task of kernel-izing ridge regression. Let x,...,

More information

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t. CS 493: Algoriths for Massive Data Sets Feb 2, 2002 Local Models, Bloo Filter Scribe: Qin Lv Local Models In global odels, every inverted file entry is copressed with the sae odel. This work wells when

More information

Research Article On the Isolated Vertices and Connectivity in Random Intersection Graphs

Research Article On the Isolated Vertices and Connectivity in Random Intersection Graphs International Cobinatorics Volue 2011, Article ID 872703, 9 pages doi:10.1155/2011/872703 Research Article On the Isolated Vertices and Connectivity in Rando Intersection Graphs Yilun Shang Institute for

More information

Curious Bounds for Floor Function Sums

Curious Bounds for Floor Function Sums 1 47 6 11 Journal of Integer Sequences, Vol. 1 (018), Article 18.1.8 Curious Bounds for Floor Function Sus Thotsaporn Thanatipanonda and Elaine Wong 1 Science Division Mahidol University International

More information