Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 7

Statistical Machie Learig II Sprig 2017, Learig Theory, Lecture 7 1 Itroductio Jea Hoorio jhoorio@purdue.edu So far we have see some techiques for provig geeralizatio for coutably fiite hypothesis classes e.g., uio boud), as well as for ifiite hypothesis classes e.g., primal-dual witess, Rademacher complexity). Example 1. Lets start by providig a very simple example: classificatio of oe dimesioal data. Our task is to give a biary label {0, 1} to a iput z R. I this example, the hypothesis class is the set of threshold fuctios: F = {f : R {0, 1} fz) = 1[z > θ], θ R} {f : R {0, 1} fz) = 1[z < θ], θ R} Could we use uio bouds i this case? The cardiality of F is equal to the umber 1 of scalars i R. Istead of coutig the umber of fuctios i F, we will cout the possible ways i which traiig samples could be labeled with fuctios i F. 2 Growth fuctio We will assume a arbitrary domai Z ad a dataset S = {z 1,..., z } cotaiig samples where z i Z for all i. I geeral, we will assume a hypothesis class F {f f : Z {0, 1}}. We will use the followig shorthad otatio: FS) = {fz 1 ),..., fz )) {0, 1} f F} That is, FS) cotais all the {0, 1} vectors that ca produced by applyig all fuctios i F to the dataset S. A atural measure of complexity is the followig. Defiitio 7.1. The growth fuctio or shatter coefficiet) of the hypothesis class F {f f : Z {0, 1}} for samples is: GF, ) = max S Z FS) 1 R is ot a coutable set, by Cator s diagoalisatio argumet. 1

Note that the growth fuctio does ot deped o the specific traiig set S, but it is a measure of the worst case amog all possible traiig sets. Clearly GF, ) 2, but ofte it is much smaller. Example 1 cotiues). Assume we sort all samples i S i icreasig order, ad recall that we have threshold fuctios, thus after sortig it is ot possible label three cosecutive samples as 0, 1, 0 or 1, 0, 1. I other words, all samples to the left should be 0 ad all samples to the right should be 1 or alteratively, all samples to the left should be 1 ad all samples to the right should be 0). Let see this more graphically. Each colum is oe of the samples, ad each row is a possible {0, 1} vector i FS). Clearly, GF, ) = 2 for Example 1. 1, 0, 0, 0,..., 0) 1, 1, 0, 0,..., 0) 1, 1, 1, 0,..., 0). 1, 1, 1, 1,..., 1) 0, 1, 1, 1,..., 1) 0, 0, 1, 1,..., 1) 0, 0, 0, 1,..., 1). 0, 0, 0, 0,..., 0) 3 Vapik-Chervoekis VC) dimesio Defiitio 7.2. The VC dimesio of the hypothesis class F {f f : Z {0, 1}} is: V CF) = max N { GF, ) = 2 } As the growth fuctio, the VC dimesio does ot deped o the specific traiig set S. Also, by defiitio if V CF) = d the for all > d we have GF, ) < 2. The iequality is strict.) I the followig sectio, we show a less obvious result. Example 1 cotiues). As before, assume we sort all samples i S i icreasig order, ad recall that we have threshold fuctios. Lets list all possible 2

{0, 1} vectors ad strikeout the oes that are ot i the set FS). = 1 = 2 = 3 0) 0, 0) 0, 0, 0) 1) 0, 1) 0, 0, 1) 1, 0) 0, 1, 0) 1, 1) 0, 1, 1) 1, 0, 0) 1, 0, 1) 1, 1, 0) 1, 1, 1) GF, 1) = 2 GF, 2) = 4 GF, 3) = 6 Clearly, V CF) = 2 for Example 1. I fact, we previously foud that GF, ) = 2, by Defiitio 7.2 we have: V CF) = max N { GF, ) = 2 } = max N { 2 = 2 } = 2 4 Sauer-Shelah lemma Lemma 7.1. The growth fuctio ad the VC dimesio of a hypothesis class F {f f : Z {0, 1}} fulfill: GF, ) V CF) i=0 ) + 1) V CF) i Proof. The right-had side is just a cosequece of the biomial theorem, thus we will cocetrate o the left-had side. We will use proof by iductio. First, defie for clarity: H, d) = d i=0 ) i Sice the biomial coefficiet fulfills ) i = 1 ) i + 1 i 1), it is clear that: H, d) = H 1, d) + H 1, d 1) 1) We ca restate the theorem as follows, V CF) d the: GF, ) H, d) 2) 3

Base case. We show that eq.2) holds for = 1 ad all d 1. Sice we have oly oe sample, we have oly two possible {0, 1} 1 vectors 0) ad 1), ad therefore: GF, 1) = {0), 1)} = 2 O the other had: H1, d) = = = 2 d ) 1 i i=0 ) ) 1 1 + 0 1 Thus, GF, 1) = H1, d) = 2 ad eq.2) holds for = 1 ad all d 1. Iductive step. Assume that eq.2) holds for 1 ad all d 1, ad show that it holds for ad d. Fix a dataset S ad defie: Furthermore, defie: F 2 = FS 2 ) S = {z 1, z 2,..., z } S 2 = {z 2,..., z } F 2 = {fz 2 ),..., fz )) f F such that f F) f z 1 ) = 1 fz 1 ) ad f z i ) = fz i ) for i = 2... } Let see this more graphically. For FS), each colum is oe of the samples, ad each row is a possible {0, 1} vector. For F 2 ad F 2, each colum is oe of the 1 samples, ad each row is a possible {0, 1} 1 vector. Here b i {0, 1}. There are three cases FS) F 2 F 2 1. Two vectors i FS) match i 0, b 2,..., b ) b 2,..., b ) b 2,..., b ) etries 2 to, but ot i etry 1 1, b 2,..., b ) 2. A vector i FS) is uique i 0, b 2,..., b ) b 2,..., b ) etries 2 to, etry 1 is 0 3. A vector i FS) is uique i 1, b 2,..., b ) b 2,..., b ) etries 2 to, etry 1 is 1 Let c 1, c 2 ad c 3 the umber of times case 1, 2 ad 3 occur i S, respectively. 4

The ext table shows the umber of biary vectors i FS), F 2 ad F 2. There are three cases FS) F 2 F 2 1. Two vectors i FS) match i 2c 1 c 1 c 1 2. A vector i FS) is uique i c 2 c 2 0 3. A vector i FS) is uique i c 3 c 3 0 Total umber of biary vectors 2c 1 +c 2 +c 3 c 1 +c 2 +c 3 c 1 From the above, it is clear that: FS) = F 2 + F 2 3) F 2 FS) 2 F 2 FS) Recall that Defiitio 7.2 VC dimesio) depeds o powers of 2. Thus from the above, it is clear that if V CFS)) d the: V CF 2 ) d V CF 2) d 1 Recall that the umber of samples i S is, while the umber of samples i S 2 is 1. From eq.3), the above ad eq.1), we have: FS) = F 2 + F 2 H 1, d) + H 1, d 1) = H, d) Sice the choice of S was arbitrary, the above holds for ay dataset S, ad thus: GF, ) = max S Z FS) H, d) Therefore, eq.2) holds ad we prove our claim. 5 Massart lemma ad Rademacher complexity Lemma 7.2. Let A be a coutably fiite subset of R. Let σ = {σ 1... σ } be idepedet Rademacher radom variables. We have: E σ [sup σ i a i 2 log A sup a 2 5

Proof. For ay t > 0 we have: [ ) ) t E σ σ i a i E σ [ t sup σ i a i sup = E σ [sup t ) σ i a i [ ) E σ t σ i a i = E σ [ t σ i a i = E σi [ tσ i a i = ta i ) + ta i ) 2 ) 1 2 t2 a 2 i = ) 1 2 t2 a 2 2 ) 1 A 2 t2 sup a 2 2 4.a) 4.b) where the step i eq.4.a) follows from Jese s iequality. The step i eq.4.b) follows sice for all z R we have that e z + e z )/2 e z2 /2. By takig logarithms ad dividig by t o both sides of the above, we have: log A E σ [sup σ i a i + 1 t 2 t sup a 2 2 log A I order to miimize the fuctio ft) = t + 1 2 t sup a 2 2, we make the derivative equal to zero ad solve for t. That is: Thus, t = 0 = ft)/ t = log A t 2 + 1 2 sup a 2 2 2 log A sup a 2. Pluggig this back i the above, we prove our claim. 6

Lemma 7.3. Let F {f f : Z {0, 1}} be a hypothesis class. The empirical Rademacher complexity Defiitio 5.2) of the hypothesis class F with respect to samples is bouded as follows: Proof. R F) R F) = E σ [sup = 1 E σ 2 log GF, ) h F [ 1 sup a FS) σ i hz i ) σ i a i 1 2 log FS) sup a 2 a FS) 5.a) 5.b) 1 2 log GF, ) 5.c) 2 log GF, ) = where the step i eq.5.a) follow from Defiitio 5.2 respectively. The step i eq.5.b) follows from Massart lemma Lemma 7.2). The step i eq.5.c) follows sice S Z ) FS) GF, ), ad sice for all a {0, 1} we have that a 2. 7