Vapnk-Chervonenks theory Rs Kondor June 13, 2008 For the purposes of ths lecture, we restrct ourselves to the bnary supervsed batch learnng settng. We assume that we have an nput space X, and an unknown dstrbuton D on X { 1,+1}. Gven a tranng set S = ((x 1,y 1 ),(x 2,y 2 ),...,(x m,y m )) drawn from D, our learnng algorthm tres to fnd a hypothess ĥ: X { 1,+1} that wll predct well on future (x,y) examples drawn from D n the sense that the true error E(ĥ) = E (x,y) D I ĥ(x) y ] (1) wll not be too bg. Of course we cannot measure E(ĥ), because we don t know D. What we have nstead s the emprcal error measured on the sample S: E S (ĥ) = 1 m m I ĥ(x ) y ]. (2) =1 Ths s otherwse known as the tranng error, and typcally t s an overoptmstc estmate of the true error, smply because the way most learnng algorthms work s to mplctly or explctly drve down just ths quantty. Some amount of overfttng s then unavodable. The job of generalzaton bounds s to relate these two quanttes, so that we can report thngs lke my algorthm found the hypothess ĥ, on the tranng data t has error E S, and on future data the error s not lkely to be more than ǫ worse. In other words, E(ĥ) ES(ĥ) < ǫ. No matter how we set ǫ, however, a really really bad tranng set (n the sense of a very unrepresentatve sample) can always mslead us even more, so explct bounds of ths form are generally not obtanable. The P part of PAC stands for amng for guarantees of ths form only n the probablstc sense,.e., fndng (ǫ,δ) pars, where both are small postve real numbers, such that P S E(ĥ) E S ǫ ] 1 δ. (3) Here P S stands for probablty over choce of tranng set. 1
Concentraton The fundamental dea behnd all generalzaton bounds s that although t s possble that the same quantty, n our case, the error, wll turn out to be very dfferent on two dfferent samples from the same dstrbuton, ths s not lkely to happen. In general, as the sample sze grows, emprcal quanttes tend to concentrate more and more around ther mean, n our case, the true error. The smplest nequalty capturng ths fact, and one that s key to our development, s Hoeffdng s nequalty, whch states that f Z 1,Z 2,...,Z m are ndependent draws of a Bernoull random varable wth parameter p and γ > 0, then 1 P m m ] Z > p + γ e 2mγ2. =1 Ths fts our problem ncely, because for any h C f we take Z = Ih(x ) y ], Hoeffdng s nequalty says somethng about the probablty of devatons from the true error p = E(h) of the hypothess h. Pluggng n the hypothess ĥ returned by algorthm and blndly applyng Hoeffdng s nequalty gves P E(ĥ) E S(ĥ) > ǫ ] e 2mǫ2, so settng the rght hand sde equal to 1 δ, the PAC bound P E(ĥ) ES(ĥ) > ǫ ] < δ (4) s satsfed when ln (1/δ) ǫ > 2m. Unfortunately, ths smple argument s NOT CORRECT. What goes wrong s that gven the fact that our algorthm returned ĥ, the (x,y ) examples n the sample are not IID samples from D, and consequently nether are the Z s IID samples from Bernoull(E(ĥ)). One way to put ths couplng between the sample and the hypothess n relef s to note that our algorthm s really just a functon A: S ĥ. Ths makes t clear that the emprcal error s a functon of S n two ways: through the hypothess A(S) and the ndvdual tranng examples (x, y) S. Clearly, we can t hold one constant and regard E S as a statstc based on IID draws of the other. 2
Unform convergence Overcomng the problem of the couplng between S and ĥ s a major hurdle n provng generalzaton bounds. The way to proceed s to nstead of focusng on any one partcular h, to focus on all of them smulatenously. In partcular, f we can fnd an (ǫ,δ) par such that or equvalently, P E(h) E S (h) ǫ h C ] 1 δ, P ĥ C such that E(h) E S(h) > ǫ ] < δ, then that (ǫ,δ) par wll certanly satsfy the PAC bound (3). At frst sght ths seems lke a terrble overkll, snce C mght nclude some crazy rregular functons that mght never be chosen by any reasonable algorthm, but these functons mght make our bound very loose. On the other hand, t s worth notng that at least amongst reasonable functons the ĥ chosen by our learnng algorthm s actually lkely to be towards the top of the lst n terms of the magntude of E(h) E S (h), smply because learnng algorthms by ther very nature tend to drve down E S (h). So boundng E(h) E S (h) for all h C mght not be such a crazy thng to do after all. How best to do t s not obvous, though. If C has only a fnte number of hypotheses, the smplstc method s to use the unon bound: P ĥ C such that E(h) E S(h) > ǫ ] h C P E(h) E S (h) > ǫ ] where on the rght hand sde now we are allowed to use the Hoeffdng bound P E(h) E S (h) > ǫ ] e 2mǫ2, leadng to C e 2mγ2 δ and therefore ln C + ln(1/δ) ǫ >. 2m The unon bound s clearly very loose though, and the explct appearance of the number of hypotheses n C s also worryng: what f two hypotheses are almost dentcal? Should we stll count them as separate? Clearly, there must be some more approprate way of quantfyng the rchness of a concept space than just countng the number of hypotheses n C. VC theory s an attempt to do just ths. 3
Symmetrzaton Concentraton nequaltes don t just tell us that on a sngle sample S, E S (h) can t devate too much from ts mean E(h), they also mply that for a par of ndependent samples S 1 and S 2, E S1 (h) can t be very far from E S2 (h). The key dea behnd Vapnk and Chervonenks poneerng work was to use explot ths fact by a process called symmetrzaton to reduce everythng to just lookng at fnte samples. We start wth the followng smple applcaton of Hoeffdng s nequalty. Proposton 1 Let S and S be two ndependent samples of sze 2m drawn from a dstrbuton D on X { 1,+1} and let E and E S be defned as n (1) and (2). Then for any h C, P E(h) E S (h) > ǫ ] 2 P E S (h) E S (h) > ǫ/2 ]. Ths result also readly generalzes to the unform case. Proposton 2 Let S and S be two ndependent samples of sze 2m drawn from a dstrbuton D on X { 1,+1} and let E and E S be defned as n (1) and (2). Then for any h C, ] P sup E(h) E S (h)] > ǫ 2 P sup ES (h) E S (h) ] ] > ǫ/2. h C h C Now let us defne S = S S and ask ourselves: what s the probablty that the errors ncurred by h on S are dstrbuted n such a way that mǫ/2 more of them fall n S than n S? For the sake of smplcty here we only consder the case that exactly 0 errors fall n S and k = mǫ/2 fall n S. Proposton 3 Consder 2m balls of whch exactly k balls are black. If we randomly splt the balls nto two sets of sze m, then the probablty P k,0 that all the black balls end up n the frst set s at most 1/2 k Proof. ( ) 2m P k,0 = / = k k m(m 1)(m 2)...(m k) (2m)(2m 1)...(2m k) < 1/2 k. For the general case of u and u + k balls n the two sets, a smlar combnatoral nequalty holds. The real sgnfcance of symmetrzaton s that t allows us to quantfy the complexty of C n terms of just the jont sample S nstead of ts behavor on the entre nput space. In partcular, n boundng sup h ES (h) E S (h) ], two hypothess h and h only need to be counted as dstnct f they dffer on S: how they behave over the rest of X s mmateral. To be somewhat more explct, we defne the restrcton of h to S as h S : S { 1,+1} h S (x) = h(x), 4
and the correspondng restrcted concept class as C S = { h S h C }. Whle C S s of course a property of S, t s also a characterstc of the entre concept class C n the sense that t s often possble to bound ts sze ndependently of S. The maxmal rate at whch C S grows wth the sze of m, Π C (n) = max U X n C U s called the growth functon. Usng the growth functon and Proposton 3, we can now gve the fnte sample verson of the unon bound: P sup ES (h) E S (h) ] > ǫ/2 h C ] Π C (2m) 2 mǫ/2. The VC dmenson s solely a devce for computng Π C (n). The VC dmenson The concept class C s sad to shatter a set S X f C can realze all possble labelngs of S,.e., f C S = 2 S. The Vapnk-Chervonenks dmenson of C s the sze of the largest subset of S that C can shatter, VC(C) = max S { S X and C S = 2 n }. The followng famous result (called the Sauer-Shelah lemma) tells us how to bound the growth functon n terms of the VC dmenson. Proposton 4 (Sauer-Shelah lemma) Let C be a concept class of VC-dmenson d, and let Π(m) be the correspondng growth functon. Then for m d, Π(m) = 2 m ; and for m > d, Π(m) ( em d ) d. (5) Proof. The m d case s just a restatement of the defnnton of VC dmenson. For m > d we use nducton on m. For m = 2, t s trval to show that (5) holds. Assumng that t holds for m and any d, we now show that t also holds for m+1 and any d. Let S be any subset of X of sze m+1, and fxng some x S, let us wrte t as S = S \x {x}, where S \x = m. Now for any h C S \x, consder ts two possble extensons h + : S { 1,+1} wth h + (x ) = h(x ) for x S and h + (x) = 1 h : S { 1,+1} wth h (x ) = h(x ) for x S and h (x) = 1. Ether both of these hypotheses are n C S or only one of them s. Let U be the subset of C S \x of hypotheses for whch both h + and h are n C S, and let U S = h U {h+,h }. Then we have C S = C S \x + U. 5
By the nductve hypothess C S \x d ) =1. As for the second term, consder that f U shatters any set V, then U S wll shatter V {x}, so VC(U) VC(U S ) 1 d 1, so by the nductve hypothess U d 1 ) =1. Therefore +1 C S d 1 + = + 1, snce d ) =1 s just the number of ways of choosng up to d objects from m + 1, and the sum corresponds to decomposng these choces accordng to whether a partcular object has been chosen or not. Fnally, =1 < d Puttng t all together ) d ( ) ( ) m d ) d m ( )( ) m d < = m d m ) ( d 1 + d m ) d exp(d) d m) d 6