Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified are members of a set X, the domai set or feature space Istaces are to be classified ito a label set Y For ow (ad most of the class), we assume that the label set is biary, that is Y = {0, 1} For example, a istace x X could be a email ad its label idicates whether the email is spam (y = 1) or ot spam (y = 0) We ofte assume that the istaces are represeted as real-valued vectors, that is X R d for some dimesio d A predictor or classifier is a fuctio h : X Y A learer is a fuctio that takes some traiig data ad maps it to a predictor We let the traiig data be deoted by a sequece S = ((X 1, Y 1 ),, (X, Y )) The, formally, a learer A is a fuctio A : (X Y) i Y X A : S h, where Y X deotes the set of all fuctios from set X to set Y For coveiece, whe the learer is clear from cotext, we use the otatio h to deote the output of the learer o data of size, that is h = A(S) for S = The goal of learig is produce a predictor h that correctly classifies ot oly the traiig data, but also future istaces that it has ot see yet We thus eed a mathematical descriptio of how the eviromet produces istaces I particular, we would like to model that the eviromet (or ature) remais somehow stable, that the process that geerated the traiig data is the same that will geerate future data We model the data geeratio as a probability distributio P over X Y = X {0, 1} We further assume that the istaces (X i, Y i ) are iid (idepedetly ad idetically distributed) accordig to P
The performace of a classifier h o a istace (X, Y ) is measured by a loss fuctio A loss fuctio is a fuctio l : (Y X X Y) R The value l(h, X, Y ) R idicates how badly h predicts o example (X, Y ) We will, for ow, work with the biary loss (or 0/1-loss), defied as l(h, X, Y ) = 1[h(X) Y ], where 1[p] deotes the idicator fuctio of predicate p, that is 1[p] = 1 of p is true ad 1[p] = 0 if p is false The biary loss is 1 if the predictio of h o example (X, Y ) is wrog If the predictio is correct, o loss is suffered ad the biary loss assigs value 0 We ca ow formally phrase the goal of learig, as aimig for a classifier that has low loss o expectatio over the data geeratig distributio That is, we would like to output a classifier that has low expected loss, or risk, defied as L(h) = E (X,Y ) P [l(h, X, Y )] = E (X,Y ) P [1[h(X) Y ]] Sice our loss fuctio assumes oly values i {0, 1}, the above expectatio is equal to the probability of geeratig a example X o which h makes a wrog predictio That is, we have L(h) = E (X,Y ) P E (X,Y ) P [1[h(X) Y ]] = P (X,Y ) P [h(x) Y ] Note however, that the learer does ot get to see the data geeratig distributio It ca thus ot merely output a classifier of lowest expected loss The learer eeds to make its decisios based o the data S Give a classifier h ad data S, the learer ca evaluate the empirical risk of h o S L (h) = 1 1[h(X i ) Y i ] 22 O the relatio of empirical ad true risk A atural strategy for the learer would be, to simply output a fuctio that has small empirical risk I favor of this approach, we ow show that the empirical risk is a ubiased estimator of the true risk Claim 1 For all fuctios h : X {0, 1} ad for all sample sizes we have E S L (h) = L(h)
Proof 1 E S L (h) = E S 1[h(X i ) Y i ] = 1 E S 1[h(X i ) Y i ] = 1 = 1 E (X,Y ) 1[h(X) Y ] L(h) = L(h) where the secod equality holds by liearity of expectatio, ad the third iequality holds sice that expectatio depeds oly o oe (the i-th) example i S Thus, for ay fixed fuctio, the empirical risk gives us a ubiased estimate of the quatity that we are after, the true risk Note that this holds eve for small sample sizes Moreover, by the law of large umbers, the above claim implies that, with large sample sizes, the empirical risk of a classifier coverges to its true risk (i probability) As we see more ad more data, the empirical risk of a fuctio becomes a better ad better estimate of its true risk This may lead us to believe that the simple learig strategy of just fidig some fuctio with low empirical risk should succeed at achievig low true risk as we see more ad more data However, the followig pheomeo shows that this strategy ca i fact go wrog arbitrarily badly Claim 2 There exists a distributio P ad a learer, such that for all we have L (h ) = 0 ad L(h ) = 1 Proof As the data geeratig distributio, cosider the uiform distributio over R {1} That is, i ay sample S, geerated by this P, the examples are labeled with 1, that is S = ((X 1, 1)),, (X, 1))) We costruct a stubbor learer A The stubbor learer outputs a fuctio that agrees with the sample s labels o poits that were i the sample S, but keeps believig that the label is 0 everywhere else Formally: h (X) = A(S)(X) = { 1 if (X, 1) S 0 otherwise Now we clearly have L (h ) = 0 for all However, sice S is fiite, the set of istaces X o which h predicts 1 has measure 0 Thus, with probability 1, h outputs the icorrect label 0 Thus L(h) = 1 The differece betwee the situatios i the above two claims is that, i the secod case, the fuctio h depeds o the data While, for every fixed fuctio h (fixed before the data is see), the empirical risk estimates coverge to the true risk of this fuctio,
this covergece is ot uiform over all fuctios Claim 2 shows that, at ay give sample size, there exist fuctios, for which true ad empirical risk are arbitrarily far apart Now, i machie learig, we do wat the fuctio that the learer outputs to be able to deped o the data Furthermore, the learer oly ever gets to see a fiite amout of data We have see that, for ay fiite sample size, that is, o ay fiite amout of data, the empirical risk ca be a very bad idicator of the true risk of a fuctio Basic questios of learig theory thus are: How ca we cotrol the (true) risk of a fuctio leared based o a fiite amout of data? Ca we idetify situatios where we ca relate the true ad empirical risk? 23 Fixig a hypothesis space We have see that, if we wat our leared fuctio h to deped o the data, we have to chage the rules for the learer I Claim 2, we let the learer output ay fuctio it wated This resulted i the learer adaptig itself very well to the data it has see i the sample, achievig 0 empirical risk, while ot makig ay progress towards predictig well o usee examples The costructio of Claim 2 is a extreme versio of a pheomeo called overfittig I iformal terms, if a learig method has too much freedom with regards to the fuctios it ca output, it may overadapt to the traiig data, rather tha extractig structure that will also apply to the usee examples Overfittig is frequetly ecoutered pheomeo i practice, that oe has to guard agaist To prevet the learer from overfittig, we eed to restrict the class of predictors A hypothesis class H is a set of predictors H {0, 1} X Istead of allowig the learer to output ay fuctio, we will ow cosider learers that output fuctios from H We will see that, i may cases, fixig the hypothesis class before we see the data, will let us regai cotrol over the relatio betwee empirical ad true risk However, fixig the Hypothesis class also meas that there may ot be ay good fuctio i the class We will thus rephrase the goal of learig, to oly require the learer to come up with a fuctio that is (approximately) as good as the best fuctio i the class H Thus, our ew goal is to show that L(h ) if L(h) + f(), h H where f is decreasig fuctio of sample size That is, as we see more ad more data, we would like that the true risk of the output of the learer approaches the best risk possible with the class H Or equivaletly, we would like to show that L(h ) if L(h) f() h H 24 Learability of fiite classes We ow show that the above goal is achievable for fiite classes H = {h 1,, h N } We will aalyze the learer ERM (Empirical Risk Miimizatio), which outputs a fuctio from H that has miimal empirical risk ERM : S ĥ argmi i L (h i )
There may be several fuctios i H that have lowest empirical risk o a data set S But sice every fuctio h H has some empirical risk o data S, ad the empirical risk ca oly assume fiitely may values (amely multiples of 1 = 1 ), the argmi is a oempty S subset of H A learer is a ERM learer, if it always outputs some fuctio from this subset For ow, we will further make a simplifyig assumptio o the data geeratig distributio P We will assume that P is realizable with respect to the clas H A distributio is realizable with respect to a hypothesis class H if there is a h H with L(h ) = 0 Theorem 1 Let H = {h 1,, h N } ad δ (0, 1] Uder the realizability assumptio, we have with probability at least (1 δ) over the geeratio of the sample S L(ĥ) log N + log(1/δ) Proof Note that L (h ) = 0 for all possible samples S Thus, for ay ɛ > 0, ERM oly outputs a fuctio error larger tha ɛ, if L (ĥ) = 0 while L(ĥ) ɛ For every h H with L(h) > ɛ, we have P S {L (h) = 0} (1 ɛ) e ɛ (Recall that, for all x R, we have (1 + x) e x ) Let H ɛ deote the set of fuctios h i H with L(h) > ɛ We get, usig the uio boud, } P S {L(ĥ) ɛ P S { h H ɛ : L (h) = 0} = P S { h Hɛ L (h) = 0} H ɛ (1 ɛ) H (1 ɛ) Now we set ɛ = log 1 δ +log N Pluggig i this value for ɛ, we have show P S { L(ĥ) H e ɛ = Ne ɛ log N + log(1/δ) which is equivalet to the statemet of the theorem } δ Thus, uder the realizability we have L(h ) mi h H L(h) f() for f() = log N+log(1/δ)