Learning Theory: Lecture Notes

Learnng Theory: Lecture Notes Lecturer: Kamalka Chaudhur Scrbe: Qush Wang October 27, 2012 1 The Agnostc PAC Model Recall that one of the constrants of the PAC model s that the data dstrbuton has to be separable wth respect to the hypothess class H. The Agnostc PAC model removes ths restrcton. That s, there no longer exsts a h H wth h = 0. efnton 1 Agnostc PAC Model A hypothess class H s sad to be Agnostc PACLearnable f there s an algorthm A wth the followng property. For all ɛ, δ, 0 ɛ, δ 1 2, all dstrbutons over X Y, f A s gven ɛ, δ and m H ɛ, δ examples from, then wth probablty 1 δ, t outputs a h H wth: h ɛ nf h H h The learnng procedure n the PAC model s to fnd a hypothess n H whch s consstent wth all the nput examples. In the Agnostc PAC model, there s no such hypothess. Instead, a common learnng procedure s to fnd a hypothess h that mnmzes the emprcal or, or the or on the tranng examples. Suppose that gven a set of samples S drawn from a data dstrbuton, h mnmzes the emprcal or h, S whle h opt mnmzes the true or h. h = arg mn h, S and h opt = arg mn Our goal s to fnd the condton under whch h ε h opt. Lemma 1 For a fxed h H and m samples S drawn from, P h h, S ε 2e mε2. h. Proof: Let S = {x 1, y 1,..., x m, y m } be the sample set, and let Z = 1hx y for any h H. Then, E[Z ] = h and h, S = 1 Z. m The bound then follows drectly from applyng Hoeffdng s Inequalty. Theorem 1 For a fnte hypothess class H, P h h opt ε 2 H e mε2 /4. 1

Proof: Frst observe that h h opt can be splt nto three terms h h opt = h h, S h, S h opt, S h opt, S h opt. The mddle term, h, S h opt 0, because h mnmzes h, S. Thus h h opt 2 sup h h, S. The theorem then results from combnng ths wth the prevous lemma, and applyng an Unon Bound over all h H: P sup h h, S ε P 2 h h, S ε 2 H e mε2 /4. 2 For falure probablty δ, the bound n Theorem 1 can be rewrtten as: εm 2 ln2 H /δ Contrast ths wth the analogous bound for PAC learnng: m 1 εm ln H /δ m Thus, Agnostc PAC learnng s statstcally harder than PAC learnng. Usually t s also computatonally harder as well. 2 Bounds for Infnte Hypothess Classes The generalzaton bounds we have proved so far apply to fnte hypothess classes, because the unon bound step breaks down when H s nfnte. We wll now see how we can explot the structure of a hypothess class to show generalzaton bounds whch apply nfnte classes as well. What knd of structure can we explot? In cases where a hypothess class s nfnte, many dfferent hypotheses can produce the same labelng so often the set of meanngful hypotheses s much smaller. We wll measure the complexty a hypothess class by the rchness of the labelngs t can produce. Ths noton can be made formal by the VC dmenson. Assumng bnary classfcaton, that s Y = {0, 1}, for a hypothess class H, and a set of examples S = {x 1,..., x m }, we defne: Π H S = {hx 1,..., hx m h H}. Here H may be nfnte but Π H S has at most 2 m possble elements, and under certan condtons on H, Π H S may have even less. efnton 1 We say a hypothess class H shatters S f Π H S = {0, 1} m. efnton 2 The VC dmenson of H s the sze of the largest set of examples that can be shattered by H. The VC dmenson s nfnte f for all m, there s a set of m examples shattered by H. 2 2

Example 1: Bdrectonal Thresholds. Let X = R wth H = R {, }. Here each example s a pont on a lne, and has a bnary label. Each hypothess n H corresponds to a threshold t and a sgn or, and can be wrtten as h {t,} or h {t, }, defned as follows: h {t,} x =, x t =, otherwse In other words, h {t,} labels everythng to the rght of t as and everythng else as, and h {t, } s defned correspondngly. Snce t can take on any real value, H s nfnte. Note that on any fxed set of ponts S = {x 1, x 2,..., x m } of sze m, Π H S 2m.Consder the followng m 1 ntervals:, x 1, x 1, x 2, x 2, x 3,..., x m 2, x m 1, x m 1, x m, x m, 3 Two thresholds t and t placed n the same nterval and wth the same sgn would result n the same labelng; moreover the pars h {,} and h {, } as well as h {, } and h {,} result n the same labellng. Thus there are 2m dstnct labelngs. What s the VC dmenson of ths class? Thresholds can produce all possble labels on a set of two dstnct ponts. However on a sequence of three ponts, they cannot label the sequence,, or,,. Thus no sets of sze 3 are shattered, and the VC dmenson of ths hypothess class s 2. Example 2: Intervals on the lne. Let X = R wth H = R R. Samples agan label ponts on the lne and each hypothess corresponds to two real values defnng an nterval; ponts nsde the nterval are labeled and everythng else s labeled. Formally, for each nterval [a, b], h [a,b] x = for a x b, and otherwse. For any set S = {x 1,..., x m } of m ponts, Π H S = m1 2 1. Any two hypotheses h[a,b] and h [a,b ] where a and a or b and b le n the same nterval n the sequence n Equaton 3 produce the same labelng of S. Thus there are m1 2 dstnct labelngs of S where not all data ponts are labeled, correspondng to hypotheses h [a,b] where a and b le n dfferent ntervals n the sequence n Equaton 3. Fnally, we add the all labellng whch s acheved by h [a,a] for any a. What s the VC dmenson of ntervals? Intervals can label any sequence of two dstnct ponts but cannot label a sequence of three dstnct ponts,,. Thus the VC dmenson of H s 2. If H s expanded to allow bdrectonal ntervals, the prevous sequence could then be labeled but sequences such as,,, could not be, gvng a VC dmenson of 3. Example 3: Lnear Classfers. Let X = R 2 wth H = {lnear classfers over R 2 }. Consder a set S of 3 ponts n general poston. Fgure 2 shows that all possble labelngs of S are achevable by H. Thus there exsts a set of 3 ponts that can be shattered by H. On the other hand, t can be shown that no set of 4 dstnct ponts on the plane can be shattered by H. Thus the VC dmenson of H s 3. Note that a set of 3 collnear ponts on the plane cannot be shattered by H because the labelng,, s not achevable by H; but ths does not change the VC dmenson calculaton because there s a set of sze 3 that can be shattered. In general, the VC dmenson for the hypothess class of lnear classfers n R d s d 1. Theorem 2 For any fnte hypothess class H, VCdmH log 2 H. Proof: If H shatters S then H s at least 2 m meanng the VC dmenson can be at most log 2 H. 3

Fgure 1: All possble labelngs of S are achevable by the class of lnear classfers on the plane. Example 3: Infnte VC dmenson. Let X = R and H = R. For w R a hypothess s gven by h w x = sgnsnwx. For all m, the set S = {2 1, 2 2,..., 2 m } s shattered by h. To see ths, let w = π 0.y 1 y 2... y m be a decmal bnary encodng of a set of desred labels, convertng 1 to 0. Essentally each x bt shfts w to produce the desred label as a result of the fact that sgnsnπz = 1 z. Thus the VC dmenson of ths hypothess class s nfnte. 2.1 Sauer s Lemma Sauer s Lemma formally relates the VC dmenson of a hypothess class H and the sze of Π H S for any set S of examples of sze m. Lemma 2 If the VC dmenson for a hypothess class H s d then for a set of m samples S, where m d, m em d Π H S Om d d Proof: We wll prove ths by nducton over m and d. Let Φ d m = d m. The two base cases: When m = 0, S s the empty set so Π H S 1 and Φ d 0 = 1. When d = 0, H cannot even shatter one pont so only one labelng s possble and Π H S = Φ 0 m = 1. Then, assumng Sauer s Lemma holds for m 1, d and m 1, d 1, we wsh to show Π H S Φ d m. Let S = {x 1,..., x m }. In what follows, we restrct ourselves to the sample space S. Restrcton to S can only decrease the VC dmenson of H, so t does not affect the theorem statement. 4

We start by splttng Π H S through ntroducng two new hypothess classes H 1 and H 2 defned on samples S = {x 1,..., x m 1 }. H 1 s dentcal to H but gnores the last example x m whle H 2 conssts of only those hypotheses where duplcates dfferng only on x m would occur n H. A sample splt could be as follows: H H 1 H 2 x 1 x 2 x 3 x 4 x 5 x 1 x 2 x 3 x 4 x 1 x 2 x 3 x 4 h 1 0 1 1 0 0 0 1 1 0 h 2 0 1 1 0 1 0 1 1 0 h 3 0 1 1 1 0 0 1 1 1 h 4 1 0 0 1 0 1 0 0 1 h 5 1 0 0 1 1 1 0 0 1 h 6 1 1 0 0 1 1 1 0 0 If a set s shattered by H 1, t s also shattered by H. Thus VCdmH 1 VCdmH = d. If S s shattered by H 2, then S {x m } s shattered by H mplyng VCdmH 1 VCdmH 1 = d 1. Wth ths splt, Π H S = Π H1 S Π H2 S. Let l be any labelng of S \ {x m } achevable by H; f l, and l, both occur n Π H S, then l occurs n both H 1 and H 2 ; otherwse, l occurs only n H 1. So by the nductve hypothess, m 1 d 1 m 1 Π H S Φ d m 1 Φ d 1 m 1 = m 1 m 1 m = = = Φ d m. 1 =1 Fnally, from Sterlng s approxmaton, for when m d, Φ d m = m m d d m d = d m =1 m d 1 d m d m em d. d 5