Learnability and Stability in the General Learning Setting

Size: px

Start display at page:

Download "Learnability and Stability in the General Learning Setting"

Erik Rogers
5 years ago
Views:

1 Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago Ohad Shair The Hebrew University Nathan Srebro TTI-Chicago Karthik Sridharan TTI-Chicago Abstract We establish that stability is necessary and sufficient for learning, even in the General Learning Setting where unifor convergence conditions are not necessary for learning, and where learning ight only be possible with a non-erm learning rule. This goes beyond previous work on the relationship between stability and learnability, which focused on supervised classification and regression, where learnability is equivalent to unifor convergence and it is enough to consider the ERM. Introduction We consider the General Setting of Learning 0 where we would like to iniize a population risk functional (stochastic objective) F(h) = E Z D f(h; Z) () where the distribution D of Z is unknown, based on i.i.d. saple z,..., z drawn fro D (and full knowledge of the function f). This General Setting subsues supervised classification and regression, certain unsupervised learning probles, density estiation and stochastic optiization. For exaple, in supervised learning z = (x, y) is an instance-label pair, h is a predictor, and f(h, (x, y)) = loss(h(x), y) is the loss functional. For supervised classification and regression probles, it is well known that a proble is learnable (see precise definition in Section 2) if and only if the epirical risks F S (h) = f(h, z i ) (2) converge uniforly to their expectations. If unifor convergence holds, then the epirical risk iniizer (ERM) is consistent, i.e. the population risk of the ERM converges to the optial population risk, and the proble is learnable using the ERM. That is, learnability is equivalent to learnability by ERM, and so we can focus our attention solely on epirical risk iniizers. Stability has also been suggested as an explicit alternate condition for learnability. Intuitively, stability notions focus on particular algoriths, or learning rules, and easure their sensitivity to perturbations in the training set. In particular, it has been established that various fors of stability of the ERM are sufficient for learnability. Mukherjee et al 7 argue that since unifor convergence also iplies stability of the ERM, and is necessary for (distribution independent) learning in the supervised classification and regression setting, then stability of the ERM is necessary and sufficient for learnability in the supervised classification and regression setting. It is iportant to ephasize that this characterization of stability as necessary for learnability goes through unifor convergence, i.e. the chain of iplications is: ERM Stable with ERM Unifor Convergence However, the equivalence between (distribution independent) consistency of epirical risk iniization and unifor convergence is specific to supervised classification and regression. The results of Alon et al establishing this equivalence do not always hold in the General Learning Setting. In particular, we recently showed that in strongly convex stochastic optiization probles, the ERM is stable and thus consistent, even though the epirical risks do not converge to their expectations uniforly (Exaple 7., taken fro 9). Since the other iplications in the chain above still hold in the general learning setting (e.g., unifor convergence iplies stability and stability iplies learnability by ERM), this exaple deonstrates that stability is a strictly ore general sufficient condition for learnability. A natural question is then whether, in the General Setting, stability is also necessary for learning. Here we establish that indeed, even in the general learning setting, (distribution independent) stability of ERM is necessary and sufficient for (distribution independent) consistency of the ERM. The situation is therefore as follows: Unifor Convergence ERM Stable with ERM We ephasize that, unlike the arguents of Mukherjee et al 7, the proof of necessity does not go through unifor convergence, allowing us to deal also with settings beyond supervised classification and regression. The discussion above concerns only stability and learnability of the ERM. In supervised classification and regression there is no need to go beyond the ERM, since learnability is equivalent to learnability by e-

2 pirical risk iniization. But as we recently showed, there are learning probles in the General Setting which are learnable using soe alternate learning rule, but in which ERM is neither stable nor consistent (Exaple 7.2, taken fro 9). Stability of ERM is therefore a sufficient, but not necessary, condition for learnability: Unifor Convergence ERM Stable with ERM This propts us to study the stability properties of non-erm learning rules. We establish that, even in the General Setting, any consistent and generalizing learning rule (i.e. where in addition to consistency, the epirical risk is also a good estiate of the population risk) ust be asyptotically epirical risk iniizing (AERM, see precise definition in Section 2). We thus focus on such rules and show that also for an AERM, stability is sufficient for consistency and generalization. The converse is a bit weaker for AERMs, though. We show that a strict notion of stability, which is necessary for ERM consistency, is not necessary for AERM consistency, and instead suggest a weaker notion (averaging out fluctuations across rando training sets) that is necessary and sufficient for AERM consistency. Noting that any consistent learning rule can be transfored to a consistent and generalizing learning rule, we obtain a sharp characterization of learnability in ters of stability learnability is equivalent to the existence of a stable AERM: Exists Stable AERM with AERM 2 The General Learning Setting A learning proble is specified by a hypothesis doain H, an instance doain Z and an objective function (e.g. loss functional or cost function ) f : H Z R. Throughout this paper we assue the function is bounded by soe constant B, i.e. f(h, z) B for all h H and z Z. A learning rule is a apping A : Z H fro sequences of instances in Z to hypotheses. We refer to sequences S = {z,...,z } as saple sets, but it is iportant to reeber that the order and ultiplicity of instances ay be significant. A learning rule that does not depend on the order is said to be syetric. We will generally consider saples S D of i.i.d. draws fro D. A possible approach to learning is to iniize the epirical risk F S (h). We say that a rule A is an ERM (Epirical Risk Miniizer) if it iniizes the epirical risk F S (A(S)) = F S (ĥ) = in F S(h). (3) h H where we use F S (ĥ) = in h H F S (h) to refer to the iniu epirical risk. But since there ight be several hypotheses iniizing the epirical risk, ĥ does not refer to a specific hypotheses and there ight be any rules which are all ERM. We say that a rule A is an AERM (Asyptotical Epirical Risk Miniizer) with rate ɛ er () under distribution D if: E S D F S (A(S)) F S (ĥs) ɛ er () (4) Here and whenever talking about a rate ɛ(), we require it be onotone decreasing with ɛ() 0. A learning rule is universally an AERM with rate ɛ er (), if it is an AERM with rate ɛ er () under all distributions D over Z. Returning to our goal of iniizing the expected risk, we say a rule A is consistent with rate ɛ cons () under distribution D if for all, E S D F(A(S)) F(h ) ɛ cons (). (5) where we denote F(h ) = inf h H F(h). A rule is universally consistent with rate ɛ cons () if it is consistent with rate ɛ cons () under all distributions D over Z. A proble is said to be learnable if there exists soe universally consistent learning rule for the proble. This definition of learnability, requiring a unifor rate for all distributions, is the relevant notion for studying learnability of a hypothesis class. It is a direct generalization of agnostic PAC-learnability 4 to Vapnik s General Setting of Learning as studied by Haussler 3 and others. We say a rule A generalizes with rate ɛ gen () under distribution D if for all, E S D F(A(S)) F S (A(S)) ɛ gen (). (6) A rule universally generalizes with rate ɛ gen () if it generalizes with rate ɛ gen () under all distributions D over Z. We note that other authors soeties define consistency, and thus also learnable as a cobination of our notions of consistency and generalizing. 3 Stability We define a sequence of progressively weaker notions of stability, all based on leave-one-out validation. For a saple S of size, let S \i = {z,..., z i, z i+,..., z } be a saple of points obtained by deleting the i-th observation of S. All our easures of stability concern the effect deleting z i has on f(h, z i ), where h is the hypotheses returned by the learning rule. That is, all easures consider the agnitude of f(a(s \i ); z i ) f(a(s); z i ). Definition. A rule A is unifor-loo stable with rate ɛ stable () if for all saples S of points and for all i: f(a(s \i ); z i ) f(a(s); z i ) ɛ stable (). Definition 2. A rule A is all-i-loo stable with rate ɛ stable () under distributions D if for all i: f(a(s E \i S D ); z i ) f(a(s); z i ) ɛ stable (). Definition 3. A rule A is LOO stable with rate ɛ stable () under distributions D if E S D f(a(s \i ); z i ) f(a(s); z i ) ɛ stable (). For syetric learning rules, Definitions 2 and 3 are equivalent. Exaple 7.5 shows that the syetry assuption is necessary, and the two definitions are not equivalent for non-syetric learning rules. Our weakest notion of stability, which we show is still enough to ensure learnability, is:

3 Definition 4. A rule A is on-average-loo stable with rate ɛ stable () under distributions D if E S D f(a(s \i ); z i ) f(a(s); z i ) ɛ stable (). We say that a rule is universally stable with rate ɛ stable (), if the stability property holds with rate ɛ stable () for all distributions. Clai 3.. Unifor-LOO stability with rate ɛ stable () iplies all-i-loo stability with rate ɛ stable (), which iplies LOO stability with rate ɛ stable (), which iplies onaverage-loo stability with rate ɛ stable (). Relationship to Other Notions of Stability Many different notions of stability, soe under ultiple naes, have been suggested in the literature. In particular, our notion of all-i-loo stability has been studied by several authors under different naes: pointwise-hypothesis stability 2, CV loo stability 7, and cross-validation-(deletion) stability 8. All are equivalent, though the rate is soeties defined differently. Other authors define stability with respect to replacing, rather then deleting, an observation. E.g. CV stability 5 and cross-validation-(replaceent) 8 are analogous to all-i- LOO stability and average stability 8 is analogous to average-loo stability for syetric learning rules. In general the deletion and replaceent variants of stability are incoparable in Appendix A we briefly discuss how the results in this paper change if replaceent stability is used. A uch stronger notion is unifor stability 2, which is strictly stronger than any of our notions, and is sufficient for tight generalization bounds. However, this notion is far fro necessary for learnability (5 and Exaple 7.3 below). In the context of syetric learning rules, all-i-loo stability and LOO stability are equivalent. In order to treat nonsyetric rules ore easily, we prefer working with LOO stability. For an elaborate discussion of the relationships between different notions of stability, see 5. 4 Main Results We first establish that existence of a stable AERM is sufficient for learning: Theore 4.. If a rule is an AERM with rate ɛ er () and stable (under any of our definitions) with rate ɛ stable () under D, then it is consistent and generalizes under D with rates ɛ cons () 3ɛ er () + ɛ stable ( + ) + 2B + ɛ gen () 4ɛ er () + ɛ stable ( + ) + 6B Corollary 4.2. If a rule is universally an AERM and stable then it is universally consistent and generalizing. Seeking a converse to the above, we first note that it is not possible to obtain a converse for each distribution D separately, i.e. to Theore 4.. In Exaple 7.6, we show a specific learning proble and distribution D in which the ERM (in fact, any AERM) is consistent, but not stable, even under our weakest notion of stability. However, we are able to obtain a converse to Corollary 4.2. That is, establish that a universally consistent ERM, or even AERM, ust also be stable. For exact ERMs we have: Theore 4.3. For an ERM the following are equivalent: Universal LOO stability. Universal consistency. Universal generalization. Recall that for a syetric rule, LOO stability and all-i- LOO stability are equivalent, and so consistency or generalization of a syetric ERM (the typical case) also iply all-i-loo stability. Theore 4.3 only guarantees LOO stability as a necessary condition for consistency. Exaple 7.3 (adapted fro 5) establishes that we cannot strengthen the condition to unifor-loo stability, or any stronger definition: there exists a learning proble for which the ERM is universally consistent, but not unifor-loo stable. For AERMs, we obtain a weaker converse, ensuring only on-average-loo stability: Theore 4.4. For an AERM, the following are equivalent: Universal on-average-loo stability. Universal consistency. Universal generalization. On-average-LOO stability is strictly weaker then LOO stability, but this is the best that can be assured. In Exaple 7.4 we present a learning proble and an AERM that is universally consistent, but is not LOO stable. The exact rate conversions of Theores 4.3 and 4.4 are specified in the corresponding proofs (Section 6), and are all polynoial. In particular, an ɛ cons -universal consistent ɛ er - AERM is on-average-loo stable with rate ɛ stable () 3ɛ er ( ) + 3ɛ cons (( ) /4 ) + 6B. The above results apply only to AERMs, for which we also see that universal consistency and generalization are equivalent. Next we show that if in fact we seek universal consistency and generalization, then we ust consider only AERMs: Theore 4.5. If a rule A is universally consistent with rate ɛ cons () and generalizing with rate ɛ gen (), then it is universally an AERM with rate ɛ er () ɛ gen () + 3ɛ cons ( /4 ) + 4B Cobining theores 4.4 and 4.5, we get that the existence of a universally on-average-loo stable AERM is a necessary (and sufficient) condition for existence of soe universally consistent and generalizing rule. As we show in Exaple 7.7, there ight still be a universally consistent learning rule (hence the proble is learnable by our definition) that is not stable even by our weakest definition (and is not an AERM nor generalizing). Nevertheless, any universally consistent learning rule can be transfored into a universally consistent and generalizing learning rule (Lea 6.). Thus by Theores 4.5 and 4.4 this rule ust also be a stable AERM, establishing:

4 Theore 4.6. A learning proble is learnable if and only if there exists a universally on-average-loo stable AERM. In particular, if there exists a ɛ cons -universally consistent rule, then there exists a rule that is ɛ stable -on-average-loo stable and ɛ er -AERM where: ɛ er () = 3ɛ cons ( /4 ) + 7B, (7) ɛ stable () = 6ɛ cons (( ) /4 ) + 9B 5 Coparison with Prior Work 5. Theore 4. and Corollary 4.2 The consistency iplication in Theore 4. (specifically, that all-loo stability of an AERM iplies consistency) was established by Mukherjee et al 7, Theore 3.5. As for the generalization guarantee, Rakhlin et al 8 prove that for ERM, average-loo stability is equivalent to generalization. For ore general learning rules, 2 attepted to show that all-i-loo stability iplies generalization. However, Mukherjee et al 7 (in reark 3, pg. 73) provide a siple counterexaple and note that the proof of 2 is wrong, and in fact all-i-loo stability alone is not enough to ensure generalization. To correct this, Mukherjee et al 7 introduced an additional condition, referred to as Eloo err stability, which together with all-i-loo stability ensures generalization. For AERMs, they use arguents specific to supervised learning, arguing that universal consistency iplies unifor convergence, and establish generalization only via this route. And so, Mukherjee et al obtain a version of Corollary 4.2 that is specific to supervised learning. In suary, coparing Theore 4. to previous work, our results extend the generalization guarantee also to AERMs in the general learning setting. Rakhlin et al 8 also show that the replaceent (rather then deletion) version stability iplies generalization for any rule (even non-aerm), and hence consistency for AERMs. Recall that the deletion and replaceent version are not equivalent. We are not aware of strong converses for the replaceent variant. 5.2 Converse Results Mukherjee et al 7 argue that all-i-loo stability of the ERM (in fact, of any AERM) is also necessary for ERM universal consistency and thus learnability. However, their arguents are specific to supervised learning, and establish stability only via unifor convergence of F S (h) to F(h), as discussed in the introduction. As we now know, in the general learning setting, ERM consistency is not equivalent to this unifor convergence, and furtherore, there ight be a non-erm universally consistent rule even though the ERM is not universally consistent. Therefore, our results here apply to the general learning setting and do not use unifor convergence arguents. For an ERM, Rakhlin et al 8 show that generalization is equivalent to on-average-loo stability, for any distribution and without resorting to unifor convergence arguents. This provides a partial converse to Theore 4.. However, our results extend to AERM s as well and ore iportantly, provide a converse to AERM consistency rather than just generalization. This distinction between consistency and generalization is iportant, as there are situations with consistent but not stable AERM s (Exaple 7.6), or even universally consistent learning rules which are not stable, generalizing nor AERM s (Exaple 7.7). Another converse result that does not use unifor convergence arguents, but is specific only to the realizable binary learning setting was given by Kutin and Niyogi 5. They show that in this setting, for any distribution D, all-i- LOO stability of the ERM under D is necessary for ERM consistency under D. This is a uch stronger for of converse as it applies to any specific distribution separately, rather then requiring universal consistency. However, not only is it specific to supervised learning, but further requires the distribution be realizable (i.e. zero error is achievable). As we show in Exaple 7.6, a distribution-specific converse is not possible in the general setting. All the papers cited above focus on syetric learning rules where all-i-loo stability is equivalent to LOO stability. We prefer not to liit our attention to syetric rules, and instead use LOO stability. 6 Detailed Results and Proofs We first establish that for AERMs, on-average-loo stability and generalization are equivalent, and that for ERMs the equivalence extends also to LOO stability. This extends the work of Rakhlin et al 8 fro ERMs to AERMs, and with soewhat better rate conversions. 6. Equivalence of Stability and Generalization It will be convenient to work with a weaker version of generalization as an interediate step: We say a rule A onaverage generalizes with rate ɛ oag () under distribution D if for all, E S D F(A(S)) F S (A(S)) ɛ oag (). (8) It is straightforward to see that generalization iplies onaverage generalization with the sae rate. We show that for AERMs, the converse is also true, and also that on-average generalization is equivalent to on-average stability, establishing the equivalence between generalization and on-average stability (for AERMs). Lea 6. (For AERMs: on-average generalization on-average stability). Let A be AERM with rate ɛ er () under D. If A is on-average generalizing with rate ɛ oag () then it is on-average LOO stable with rate ɛ oag ( ) + 2ɛ er ( )+2B/. If A is on-average LOO stable with rate ɛ stable () then it is on-average generalizing with rate ɛ stable ( + ) + 2ɛ er () + 2B/. Proof. For the ERMs of S and S \i we have F S \i(ĥs \i) F S(ĥS) 2B, and so since A is AERM: FS E (A(S)) F S \i(a(s \i )) 2ɛ er ( )+ 2B (9)

5 generalization stability Applying (8) to S \i and cobining with (9) we have E F(A(S \i )) F S (A(S)) ɛ oag ( ) + 2ɛ er ( ) + 2B/, which does not actually depend on i, hence: ɛ oag ( ) + 2ɛ er ( ) + 2B/ E F(A(S \i )) F S (A(S)) (0) = E i E F(A(S \i )) F S (A(S)) = E S \i,z i E i f(a(s \i ), z i ) E S f(a(s), z i ) = E E i f(a(s \i ), z i ) f(a(s), z i ) () which establishes on-average stability. stability generalization Bounding () by ɛ stable () and working back we get that (0) is also bounded by ɛ stable (). Cobined with (9) we get E F(A(S \i )) F S \i(a(s \i )) ɛ stable () + 2ɛ oag ( ) + 2B/ which establishes on-average generalization. Lea 6.2 (AERM + on-average generalization generalization). If A is an AERM with rate ɛ er () and onaverage generalizes with rate ɛ oag () under D, then A generalizes with rate ɛ oag () + 2ɛ er () + 2B under D. Proof. Using respective optialities of ĥs and h we can bound: F S (A(S)) F(A(S)) = F S (A(S)) F S (ĥs) + F S (ĥs) F S (h ) + F S (h ) F(h ) + F(h ) F(A(S)) F S (A(S)) F S (ĥs) + F S (h ) F(h ) = Y (2) Where the final equality defines a new rando variable Y. By Lea 6.3 and the AERM guarantee we have E Y ɛ er () + B/ ). Fro Lea 6.4 we can conclude that E F S (A(S)) F(A(S)) EF S (A(S)) F(A(S)) + 2E Y ɛ oag () + 2ɛ er () + 2B. Utility Lea 6.3. For i.i.d. X i, X i B and X = X i we have E X EX B/. Proof. E X EX VarX = VarX i / B/. Utility Lea 6.4. Let X, Y be rando variables s.t. X Y alost surely. Then E X EX + 2E Y. Proof. Denote a + = ax(0, a) and observe that X Y iplies X + Y + (this holds when both have the sae sign, and when X 0 Y, while Y < 0 < X is not possible). We therefor have EX + EY + E Y. Also note that X = 2X + X. We can now calculate: E X = E2X + X = 2EX + EX 2E Y + EX. For exact ERM, we get a stronger equivalence: Lea 6.5 (ERM+on-average-LOO LOO stable). If an exact ERM A is on-average-loo stable with rate ɛ stable () under D, then it is also LOO stable under D with the sae rate. Proof. By optiality of ĥs = A(S): f(ĥs \i, z i) f(ĥs, z i ) = F S (ĥs \i) F S(ĥS) Then using on-average-loo stability: = + F S \i(ĥs) F S \i(ĥs\i) 0. (3) f( E ĥ S \i, z i ) f(ĥs, z i ) ( ) E f(ĥs \i, z i) f(ĥs, z i ) ɛ stable () Lea 6.5 can be extended also to AERMs with rate o( n ). However, for AERMs with a slower rate, or at least with rate Ω( n ), Exaple 7.4 establishes that this stronger converse is not possible. We have now established the stability generalization parts of Theores 4., 4.3 and 4.4 (in fact, even a slightly stronger converse than in Theores 4.3 and 4.4, as it does not require universality). 6.2 A Sufficient Condition for Consistency It is also fairly straightforward to see that generalization (or even on-average generalization) of an AERM iplies its consistency: Lea 6.6 (AERM+generalization consistency). If A is AERM with rate ɛ er () and it on-average generalizes with rate ɛ oag () under D then it is consistent with rate ɛ oag () + ɛ er () under D. Proof. EF(A(S)) F(h ) = EF(A(S)) F S (h ) = EF(A(S)) F S (A(S)) + EF S (A(S)) F S (h ) EF(A(S)) F S (A(S)) + E F S (A(S)) F S (ĥs) ɛ gen () + ɛ er () Cobined with the results of Section 6., this copletes the proof of Theore 4. and the stability consistency and generalization consistency parts of Theores 4.3 and Converse Direction Lea 6. already provides a converse result, establishing that stability is necessary for generalization. However, in order to establish that stability is also necessary for universal consistency we ust prove that universal consistency of an AERM iplies universal generalization. Note that consistency under a specific distribution for an AERM does not iply generalization nor stability (Exaple 7.6). We ust instead rely on universal consistency. The ain tool we use is the following lea:

6 Lea 6.7 (Main Converse Lea). If a proble is learnable, i.e. there exists a universally consistent rule A with rate ɛ cons (), then under any distribution, FS E (ĥs) F(h ) ɛ ep () where ɛ ep () = 2ɛ cons ( ) + 2B + 2B 2 for any sequence is such that and = o( ). Proof. Let I = {I,...,I } be a rando saple of indexes in the range.. where each I i is independently uniforly distributed, and I is independent of S. Let S = {z Ii }, i.e. a saple of size drawn fro the unifor distribution over saples in S (with replaceents). We first bound the probability that I has no repeated indexes ( duplicates ): (i ) Pr(I has duplicates) (4) 2 Conditioned on not having duplicates in I, the saple S is actually distributed according to D, i.e. can be viewed as a saple fro the original distribution. We therefor have by universal consistency: 2 E F(A(S )) F(h ) no dups ɛ cons ( ) (5) But viewed as a saple drawn fro the unifor distribution over instances in S, we also have: FS E S (A(S )) F S (ĥs) ɛ cons ( ) (6) Conditioned on having no duplications in I, S \S (i.e. those saples in S not chosen by I) is independent of S, and S \ S =, and so by Lea 6.3: E S F(A(S )) F S\S (A(S )) B (7) Finally, if there are no duplicates, then for any hypothesis, and in particular for A(S ) we have: F S (A(S )) F S\S (A(S )) 2B (8) Cobining (5),(6),(7) and (8), accounting for a axial discrepancy of B when we do have duplicates, and assuing 2 /2, we get the desired bound. In the supervised learning setting, Lea 6.7 is just an iediate consequence of learnability being equivalent to consistency and generalization of the ERM. However, the Lea applies also in the General Setting, where universal consistency ight be achieved only by a non-erm. The Lea states that if a proble is learnable, even though the ERM ight not be consistent (as in, e.g. Exaple 7.2), the epirical error achieved by the ERM is in fact an asyptotically unbiased estiator of F(h ). Equipped with Lea 6.7, we are now ready to show that universal consistency of an AERM iplies generalization and that any universally consistent and generalizing rule ust be an AERM. What we show is actually a bit stronger: that if a proble is learnable, and so Lea 6.7 holds, then for any distribution D separately, consistency of an AERM under D iplies generalization under D and also any consistent and generalizing rule under D ust be an AERM. Lea 6.8 (learnable+aerm+consistent generalizing). If Lea 6.7 holds with rate ɛ ep (), and A is an ɛ er - AERM and ɛ cons -consistent under D, then it is generalizing under D with rate ɛ ep () + ɛ er () + ɛ cons (). Proof. FS E F S (A(S)) F(A(S)) E (A(S)) F S (ĥs) + E F(h FS ) F(A(S)) + E (ĥs) F(h ) ɛ er () + ɛ cons () + ɛ ep () Lea 6.9 (learnable+consistent+generalizing AERM). If Lea 6.7 holds with rate ɛ ep (), and A is ɛ cons - consistent and ɛ gen -generalizing under D, then it is AERM under D with rate ɛ ep () + ɛ gen () + ɛ cons (). Proof. FS E (A(S)) F S (ĥs) E F S (A(S)) F(A(S)) + E F(A(S)) F(h F(h ) + E ) F S (ĥs) ɛ gen () + ɛ cons () + ɛ ep () Lea 6.8 establishes that universal consistency of an AERM iplies universal generalization, and thus copletes the proof of Theores 4.3 and 4.4. Lea 6.9 establishes Theore 4.5. To get the rates in 4, we use = /4 in Lea 6.7. Leas 6.6, 6.8 and 6.9 together establish an interesting relationship: Corollary 6.0. For a (universally) learnable proble, for any distribution D and learning rule A, any two of the following iply the third : A is an AERM under D. A is consistent under D. A generalizes under D. Note, however, that any one property by itself is possible, even universally: - The ERM in Exaple 7.2 is neither consistent nor generalizing, despite the proble being learnable. - Exaple 7.7 deonstrates a universally consistent learning rule which is neither generalizing nor an AERM. - A rule returning a fixed hypothesis always generalizes, but of course need not be consistent nor an AERM. In contrast, for learnable supervised classification and regression probles, it is not possible for a learning rule to be just universally consistent, without being an AERM and without generalization. Nor is it possible for a learning rule to be a universal AERM for a learnable proble, without being generalizing and consistent. Corollary 6.0 can also provide a certificate of nonlearnability. E.g. for the proble in Exaple 7.6 we show a specific distribution for which there is a consistent AERM that does not generalize. We can conclude that there is no universally consistent learning rule for the proble, otherwise the corollary is violated.

7 6.4 Existence of a Stable Rule Theores 4.5 and 4.4, which we just copleted proving, already establish that for AERMs, universal consistency is equivalent to universal on-average-loo stability. Existence of a universally on-average-loo stable AERM is thus sufficient for learnability. In order to prove that it is also necessary, it is enough to show that existence of a universally consistent learning rule iplies existence of a universally consistent AERM. This AERM ust then be on-average-loo stable by Theore 4.4. We actually show how to transfor a consistent rule to a consistent and generalizing rule. If this rule is universally consistent, then by Lea 6.9 we can then conclude it ust an AERM, and by 6. that it ust be on-average-loo stable. Lea 6.. For any rule A there exists a rule A, such that: A universally generalizes with rate 3B. For any D, if A is ɛ cons -consistent under D then A is ɛ cons ( ) consistent under D. Proof. For a saple S of size, let S be a sub-saple consisting of the first observation in S. Define A (S) = A(S ). That is, A applies A to only of the observation in S. A generalizes: We can decopose: F S (A(S )) F(A(S )) = (F S (A(S )) F(A(S ))) + ( )(F S\S (A(S )) F(A(S ))) The first ter can be bounded by 2B/. As for the second ter, S \ S is statistically independent of S and so we can use Lea 6.3 to bound its expected agnitude to obtain: E F S (A(S )) F(A(S )) 2B + ( ) B 3B A is consistent: If A is consistent, then: E F(A (S)) inf F(h) h H E F(A(S )) inf F(h) ɛ cons ( ) h H (9) Proof of Converse in Theore 4.6 If there exists a universally consistent rule with rate ɛ cons (), by Lea 6. there exists A which is universally consistent and generalizing. Choosing = /4 in Lea 6.7 and applying Leas 6.9 and 6. we get the rates specified in (7). Reark We can strengthen the above theore to show existence of an on-average-loo stable, always AERM (ie. a rule which for every saple approxiately iniizes F S (h)). The new learning rule for this purpose chooses the hypothesis returned by the original rule whenever epirical risk is sall and chooses an ERM otherwise. The proof is copleted via Markov inequality to bound the probability that we don t choose the hypothesis returned by the original learning rule. 7 Exaples Our first exaple (taken fro 9) shows that unifor convergence is not necessary for ERM consistency. I.e. universal ERM consistency holds without unifor convergence. Of course, this can also happen in trivial settings where there is one hypothesis h 0 which doinates all other hypothesis (i.e. f(h 0, z) < f(h, z) for all z and all h h 0 ) 0. However, the exaple below deonstrates a non-trivial situation with ERM universal consistency but no unifor convergence: there is no doinating hypothesis, and finding the optial hypothesis does require learning. In particular, unlike trivial probles with a doinating hypothesis, in the exaple below there is not even local unifor convergence. I.e. there is no unifor convergence even aong hypotheses that are close to being population optial. Exaple 7.. There exists a learning proble for which any ERM is universally consistent, but the epirical risks do not converge uniforly to their expectations. Proof. Consider a convex stochastic optiization proble given by: f(w; (x, α)) = α (w x) + w 2 = α 2 i(wi xi) 2 + w 2, i where w is the hypothesis, w,x are eleents in a unit ball around the origin of a Hilbert space with a countably infinite orthonoral basis e,e 2,..., and α is an infinite binary sequence. αi is the i-th coordinate of α, wi := w,e i, and xi is defined siilarly. In our other subission 9, we show that the ERM is stable, hence consistent. However, when x = 0 a.s. and α is i.i.d. unifor, there is no unifor convergence, not even locally. To see why, note that for a rando saple S of any finite size, with probability one there exists an excluded basis vector e j such that α i j = 0 for all (x i, α i ) S. For any t > 0, we have F(te j ) F S (te j ) t 2, regardless of the saple size. Setting t = establishes sup w F(w) F S (w) even as, and so there is no unifor convergence. Choosing t arbitrarily sall, we see that even when F(te j ) is close to optial, the deviations F(w) F S (w) still do not converge to zero as. Perhaps ore surprisingly, the next exaple (also taken fro 9) shows that in the general setting, learnability ight require using a non-erm. Exaple 7.2. There exists a learning proble with a universally consistent learning rule, but for which no ERM is universally consistent. Proof. Consider the sae hypothesis space and saple space as before, with: f(w, z) = α (w x) 2 + ɛ 2 2 i (w i ) 2, where ɛ = 0.0. When x = 0 a.s. and α is i.i.d. unifor, then the ERM ust have ŵ =. To see why, note that

8 for an excluded e j (which exists a.s.) increasing wj towards one decreases the objective. But since ŵ =, we have F(ŵ) /2, while inf w F(w) F(0) = ɛ, and so F(ŵ) inf w F(w). On the other hand, A(S) = arg inf S (w)+ 20 w 2 is a uniforly-loo stable AERM and hence by Theore 4. universally consistent. In the next three exaples, we show that in a certain sense, Theore 4.3 and Theore 4.4 cannot be iproved with stronger stability notions. Viewed differently, they also constitute separation results between our various stability notions, and show which are strictly stronger than the other. Exaple 7.4 also deonstrates the gap between supervised learning and a general learning setting, by presenting a learning proble and an AERM that is universally consistent, but not LOO stable. Exaple 7.3. There exists a learning proble with a universally consistent and all-i-loo stable learning rule, but there is no universally consistent and unifor LOO stable learning rule. Proof. This exaple is taken fro 5. Consider the hypothesis space {0, }, the instance space {0, }, and the objective function f(h, z) = h z. It is straightforward to verify that an ERM is a universally consistent learning rule. It is also universally all-i-loo stable, because reoving an instance can change the hypothesis only if the original saple had an equal nuber of 0 s and s (plus or inus one), which happens with probability at ost O(/ ) where is the saple size. However, it is not hard to see that the only unifor LOO stable learning rule, at least for large enough saple sizes, is a constant rule which always returns the sae hypothesis h regardless of the saple. Such a learning rule is obviously not universally consistent. Exaple 7.4. There exists a learning proble with a universally consistent (and average-loo stable) AERM, which is not LOO stable. Proof. Let the instance space, hypothesis space and objective function be as in Exaple 7.3. Consider the following learning rule, based on a saple S = (z,..., z ): if i {z}/ > /2 + log(4)/2, return. If i {z}/ < /2 log(4)/2, return 0. Otherwise, return Parity(S) = (z +...z ) od 2. This learning rule is an AERM, with ɛ er () = 2 log(4)/. Since we have only two hypotheses, we have unifor convergence of F S ( ) to F( ) for any hypothesis. Therefore, our learning rule universally generalizes (with rate ɛ gen () = log(4/δ)/2), and by Theore 4.4, this iplies that the learning rule is also universally consistent and average-loo stable. However, the learning rule is not LOO stable. Consider the unifor distribution on the instance space. By Hoeffding s inequality, i {z}/ /2 log(4)/2 with probability at least /2 for any saple size. In that case, the returned hypothesis is the parity function (even when we reove an instance fro the saple, assuing 3). When this happens, it is not hard to see that for any i, f(a(s), z i ) f(a(s \i ), z i ) = {z}( ) Parity(S). This iplies that ( ) E f(a(s \i ); z i ) f(a(s); z i ) (20) 2 E log(4) {z} 2 {z} 2 ( ) log(4) , which does not converge to zero with the saple size. Therefore, the learning rule is not LOO stable. Note that the proof iplies that average-loo stability cannot be replaced even by weaker stability notions than LOO stability. For instance, a natural stability notion interediate between average-loo stability and LOO stability is ( ) E S D f(a(s \i ); z i ) f(a(s); z i ), (2) where the absolute value is now over the entire su, but inside the expectation. In the exaple used in the proof, (2) is still lower bounded by (20), which does not converge to zero with the saple size. Exaple 7.5. There exists a learning proble with a universally consistent and LOO-stable AERM, which is not syetric and is not all-i-loo stable. Proof. Let the instance space be 0,, the hypothesis space 0, 2, and the objective function f(h, z) = {h=z}. Consider the following learning rule A: given a saple, check if the value z appears ore than once in the saple. If no, return z, otherwise return 2. Since F S (2) = 0, and z returns only if this value constitutes / of the saple, the rule above is an AERM with rate ɛ er () = /. To see universal consistency, let Pr(z ) = p. With probability ( p) 2, z / {z 2,...,z }, and the returned hypothesis is z, with F(z ) = p. Otherwise, the returned hypothesis is 2, with F(2) = 0. Hence E S F(A(S)) p( p) 2, which can be easily verified to be at ost /( ), so the learning rule is consistent with rate ɛ cons () /( ). To see LOO-stability, notice that our learning hypothesis can change by deleting z i, i >, only if z i is the only instance in z 2,...,z equal to z. So ɛ stable () 2/ (in fact, LOOstability holds even without the expectation). However, this learning rule is not all-i-loo-stable. For instance, for any continuous distribution, f(a(s \ ), z ) f(a(s), z ) = with probability, so it obviously cannot be all-i-loo-stable with respect to i =. Next we show that for specific distributions, even ERM consistency does not iply even our weakest notion of stability.

9 Unifor Convergence ERM Strict Consistency ERM Stability ERM Consistency All AERM Stable and Consistent Exists Stable AERM Exists Consistent AERM Learnability Figure : Iplications of various properties of learning probles. Consistency refers to univeral consistency and stability refers to univeral on-average-loo stability. Exaple 7.6. There exists a learning proble and a distribution on the instance space, such that the ERM (or any AERM) is consistent but is not average-loo stable. Proof. Let the instance space be 0,, the hypothesis space consist of all finite subsets of 0,, and define the objective function as f(h, z) = {z / h} ). Consider any continuous distribution on the instance space. Since the underlying distribution D is continuous, we have F(h) = for any hypothesis h. Therefore, any learning rule (including any AERM) will be consistent with F(A(S)) =. On the other hand, the ERM here always achieves F S (ĥs) = 0, so any AERM cannot generalize, or even on-average-generalize (by Lea 6.2), hence cannot be average-loo stable (by Lea 6.). Finally, the following exaple shows that while learnability is equivalent to the existence of stable and consistent AERM s (Theore 4.4 and Theore 4.6), there ight still exist other learning rules, which are neither of the above. Exaple 7.7. There exists a learning proble with a universally consistent learning rule, which is not average-loo stable, generalizing nor an AERM. Proof. Let the instance space be 0,. Let the hypothesis space consist of all finite subsets of 0,, and the objective function be the indicator function f(h, z) = {z h}. Consider the following learning rule: given a saple S 0,, the learning rule checks if there are any two identical instances in the saple. If so, the learning rule returns the epty set. Otherwise, it returns the saple. This learning rule is not an AERM, nor does it necessarily generalize or is average-loo stable. Consider any continuous distribution on 0,. The learning rule always returns a countable set A(S), with F S (A(S)) =, while F S ( ) = 0 (so it is not an AERM) and F(A(S)) = 0 (so it does not generalize). Also, f(a(s), z i ) = 0 while f(a(s \i 0, z i ) = with probability, so it is not average- LOO stable either. However, the learning rule is universally consistent. If the underlying distribution is continuous on 0,, then the returned hypothesis is S, which is countable hence, F(S) = 0 = inf h F(h). For discrete distributions, let M denote the proportion of instances in the saple which appear exactly once, and let M 0 be the probability ass of instances which did not appear in the saple. Using 6, Theore 3, we have that for any δ, it holds with probability at least δ over a saple of size that ( ) M 0 M O log(/δ), uniforly for any discrete distribution. If this event occurs, then either M <, or M 0 O(log(/δ)/ ). But in the first event, we get duplicate instances in the saple, so the returned hypothesis is the optial, and in the second case, the returned hypothesis is the saple, which has a total probability ass of at least O(log(/δ)/ ), and therefore F(A(S)) O(log(/δ)/ ). As a result, regardless of the underlying distribution, with probability of at least δ over the saple, ( ) F(A(S)) O log(/δ). Since the r.h.s. converges to 0 with for any δ, it is easy to see that the learning rule is universally consistent. 8 Discussion In the failiar setting of supervised classification or regression, the question of learnability is reduced to that of unifor convergence of epirical risks to their expectation, and in turn to finiteness of the fat-shattering diension. Furtherore, due to the equivalence of learnability and unifor convergence, there is no need to look beyond the ERM. We recently showed 9 that the situation in the General Learning Setting is substantially ore coplex. Universal ERM consistency ight not be equivalent to unifor convergence, and furtherore, learnability ight be possible only with a non-erm. We are therefore in need of a new understanding of the question of learnability that applies ore broadly then just to supervised classification and regression. In studying learnability in the General Setting, Vapnik 0 focuses solely on epirical risk iniization, which we now know is not sufficient for understanding learnability (e.g. Exaple 7.2). Furtherore, for epirical risk iniization, Vapnik establishes unifor convergence as a necessary and sufficient condition not for ERM consistency, but rather for strict consistency of the ERM. We now know that even in rather non-trivial probles (e.g. Exaple 7. taken fro 9), where the ERM is consistent and generalizes, strict consistency does not hold. Furtherore, Exaple 7. also deonstrates that ERM stability guarantees ERM consistency, but not strict consistency, perhaps giving another indication that strict consistency ight be too strict (this and other relationships are depicted in Figure ). In Exaples 7. and 7.2 we see that stability is a strictly ore general sufficient condition for learnability. This akes stability an appealing candidate for understanding learnability in the ore general setting. Indeed, we show that stability is not only sufficient, but is also necessary for learning, even in the General Learning Setting. A previous such characterization was based on unifor convergence and thus applied only to supervised clas-

10 sification and regression 7. Extending the characterization beyond these settings is particularly interesting, since for supervised classification and regression the question of learnability is already essentially solved. Extending the characterization, without relying on unifor convergence, also allows us to frae stability as the core condition guaranteeing learnability, with unifor convergence only a sufficient, but not necessary, condition for stability (see Figure ). In studying the question of learnability and its relation to stability, we encounter several differences between this ore general setting, and settings such as supervised classification and regression where learnability is equivalent to unifor convergence. We suarize soe of these distinctions: Perhaps the ost iportant distinction is that in the General Setting learnability ight be possible only with a non-erm. In this paper we establish that if a proble is learnable, although it ight not be learnable with an ERM, it ust be learnable with soe AERM. And so, in the General Setting we ust look beyond epirical risk iniization, but not beyond asyptotic epirical risk iniization. In supervised classification and regression, if one AERM is universally consistent then all AERMs are universally consistent. In the General Setting we ust choose the AERM carefully. In supervised classification and regression, a universally consistent rule ust also generalize and be AERM. In the General Setting, a universally consistent rule need not generalize nor be an AERM, as exaple 7.7 deonstrates. However, Theore 4.5 establishes that, even in the General Setting, if a rule is universally consistent and generalizing then it ust be an AERM. This gives us another reason to not look beyond asyptotic epirical risk iniization, even in the General Setting. The above distinctions can also be seen through Corollary 6.0, which concerns the relationship between AERM, consistency and generalization in learnable probles. In the General Setting, any two conditions iply the other, but it is possible for any one condition to exist without the others. In supervised classification and regression, if a proble is learnable then generalization always holds (for any rule), and so universal consistency and AERM iply each other. In supervised classification and regression, ERM inconsistency for soe distribution is enough to establish non-learnability. Establishing non-learnability in the General Setting is trickier, since one ust consider all AERMs. We show how Corollary 6.0 can provide a certificate for non-learnability, in the for of a rule that is consistent and an AERM for soe specific distribution, but does not generalize (Exaple 7.6). In the General Setting, universal consistency of an AERM only guarantees on-average-loo stability, but not LOO stability as in the supervised classification setting 7. As we show in Exaple 7.4, this is a real difference and not erely a deficiency of our proofs. We have begun exploring the issue of learnability in the General Setting, and uncovered iportant relationships between learnability and stability. But any probles are left open. Throughout the paper we ignored the issue of getting high-confidence concentration guarantees. We choose to use convergence in expectation, and defined the rates as rates on the expectation. Since the objective f is bounded, convergence in expectation is equivalent to convergence in probability and using Markov s inequality we can translate a rate of the for E ɛ() to a low confidence guarantee Pr( > ɛ()/δ) δ. Can we also obtain exponential concentration results of the for Pr( > ɛ()polylog(/δ)) δ? It is possible to construct exaples in the General Setting in which convergence in expectation of the stability does not iply exponential concentration of consistency and generalization. Is it possible to show that exponential concentration of stability is equivalent to exponential concentration of consistency and generalization? We showed that existence of an average-loo stable AERM is necessary and sufficient for learnability (Theore 4.6). Although specific AERMs ight be universally consistent and generalizing without being LOO stable (Exaple 7.4), it ight still be possible to show that for a learnable proble, there always exists soe LOO stable AERM. This would tighten our converse result and establish existence of a LOO stable AERM as equivalent to learnability. Even existence of a LOO stable AERM is not as elegant and siple as having finite VC diension, or fat-shattering diension. It would be very interesting to derive equivalent but ore cobinatorial conditions for learnability. References N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensitive diensions, unifor convergence, and learnability. J. ACM, 44(4):65 63, O. Bousquet and A. Elisseeff. Stability and generalization. J. Mach. Learn. Res., 2: , David Haussler. Decision theoretic generalizations of the PAC odel for neural net and other learning applications. Inforation and Coputation, 00():78 50, Michael J. Kearns, Robert E. Schapire, and Linda M. Sellie. Toward efficient agnostic learning. In Proc. of COLT 5, S. Kutin and P. Niyogi. Alost-everywhere algorithic stability and generalization error. In Proc. of UAI 8, D.A. McAllester and R.E. Schapire. On the convergence rate of good-turing estiators. In Proc. of COLT 3, S. Mukherjee, P. Niyogi, T. Poggio, and R. M. Rifkin. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of epirical risk iniization. Adv. Coput. Math., 25(- 3):6 93, S. Rakhlin, S. Mukherjee, and T. Poggio. Stability results in learning theory. Analysis and Applications, 3(4):397 49, S. Shalev-Shwartz, O. Shair, N. Srebro, and K. Sridharan. Stochastic convex optiization. In Proceedings of COLT 22, V.N. Vapnik. The Nature of Statistical Learning Theory. Springer, 995.

11 A Replaceent Stability We used the Leave-One-Out version of stabilities throughout the paper, however any of the results hold when we use the replaceent versions instead. Here we briefly survey the differences in the ain results as they apply to replaceentbased stabilities. Let S (i) denote the saple S with z i replaced by soe other z i drawn fro the sae unknown distribution D. Definition 5. A rule A is unifor-ro stable with rate ɛ stable () if for all saples S of points and z, z,..., z Z : f(a(s (i) ); z ) f(a(s); z ) ɛ stable (). Definition 6. A rule A is on-average-ro stable with rate ɛ stable () under distributions D if E S D ;z f(a(s,...,z D (i) ); z i ) f(a(s); z i ) ɛ stable (). With the above definitions replacing unifor-loo stability and on-average-loo stability respectively, all theores in Section 4 other than Theore 4.3 hold (i.e. Theore 4., Corollary 4.2, Theore 4.4 and Theore 4.6). We do not know how to obtain a replaceent-variant of Theore 4.3 even for a consistent ERM, we can only guarantee on-average-ro stability (as in Theore 4.4), but we do not know if this is enough to ensure RO stability. However, although for ERMs we can only obtain a weaker converse, we can guarantee the existence of an AERM that is not only on-average-ro stable but actually unifor-ro stable. That is, we get a uch stronger variant of Theore 4.6: Theore A.. A learning proble is learnable if and only if there exists an unifor-ro stable AERM. Proof. Clearly if there exists any rule A that is unifor-ro stable and AERM then the proble is learnable, since the learning rule A is in fact universally consistent by theore 4.. On the other hand if there exists a rule A that is universally consistent, then consider the rule A as in the construction of Lea 6.. As shown in the lea this rule is consistent. Now note that A only uses the first saples of S. Hence for i > we have A (S (i) ) = A (S) and so: f(a(s (i) ); z ) f(a(s); z ) = f(a(s (i) ); z ) f(a(s); z ) 2B We thus showed that this rule is consistent, generalizes, and is 2B -uniforly RO stable.

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher