1 Review and Overview

Size: px

Start display at page:

Download "1 Review and Overview"

Morris Beasley
5 years ago
Views:

1 DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, Review ad Overview I the first half of this course, the cetral questio that we wat to aswer is: Why miimizig the traiig error ofte leads to a small testig error? I the last lecture, we proved the asymptotics of the maximum likelihood estimator (MLE). I particular, as the umber of traiig examples, deoted by, teds to ifiity, L(ˆθ) L(θ ) p 2 + o ( ) 1. Here L is the expected loss. ˆθ is the miimizer of the traiig loss, while θ is the groud truth parameter. This result partially explais why the MLE, which has the smallest traiig loss, is also likely to achieve a small testig error whe there are eough traiig examples. Oe limitatio of the above result is that it requires well-specifiedess, i.e., the data are distributed precisely accordig to a particular groud truth parameter θ i the parameter space. We would like to prove a more geeral result i the followig form without assumig well-specifiedess. 1 L(ˆθ) L(θ ) f(p, ), p, 1. Aother limitatio of this asymptotic result is that it igores the depedece of higher order terms o other hyperparameters (i this case, the dimesio p). Cosider the followig two fuctios, both of which are of order p 2 + o ( 1 ) : p vs p 2 + p Arguably, the first oe is a better upper boud sice the secod boud requires > p 50 traiig examples to be below 1. From ow o, we restrict our attetio to the o-asymptotic regime, where is fiite. I this lecture, we preview the form of results that we would like to prove i the followig few lectures. The we itroduce the uiform covergece framework, which we will istatiate i various settigs to prove geeralizatio bouds. We ed this lecture by provig uiform covergece o fiite hypothesis classes. 2 Notatios I the o-asymptotic settig, we oly igore the absolute costats (also kow as uiversal costats). Specifically, wheever otatio O(X) appears i a statemet, it meas that there exists a uiversal costat c such that the statemet holds if we replace O(X) by c X. Similarly, we write A B as a shorthad for A O(B). I the followig, we review a few cetral otatios from previous lectures. 1 I this case, θ eeds to be redefied as the miimizer of the expected loss.

2 Hypothesis space: H is a family of hypotheses, i.e., predictio fuctios. Loss fuctio: l : (X Y) H R. This is aalogous to the otatio l((x, y), θ) for loss fuctios i the last lecture, yet it accommodates the geeral case where we caot aturally parametrize H by a cotiuum of parameters. Expected loss: L(h) = E (x,y) P l((x, y), h), where P is a data distributio over X Y. Moreover, we defie h argmi L(h) as the miimizer of the expected loss. Traiig loss (also kow as empirical risk): ˆL(h) = 1 i=1 l (( x (i), y (i)), h ), where ( x (1), y (1)), ( x (2), y (2)),..., ( x (), y ()) are traiig examples draw i.i.d. from P. Empirical risk miimizer (ERM): ĥ argmi ˆL(h). Remark 1. For a fixed data distributio P, there is o radomess i h, sice h is just the miimizer of a determiistic fuctio L(h). O the other had, ĥ is ideed a radom variable, as the traiig loss fuctio ˆL is defied based o the traiig examples. The depedece betwee some of the above cocepts is depicted i the followig diagram: 3 Objective Our goal is to prove a upper boud o L(ĥ) L(h ), the differece betwee the expected loss of the ERM ĥ ad that of the optimal hypothesis h. I particular, the results that we are goig to prove i this ad the followig lectures are of the followig form: Pr L(ĥ) L(h ) > ɛ δ. I words, with probability at least 1 δ, it holds that L(ĥ) L(h ) ɛ. 2 We ca iterpret L(ĥ) L(h ) > ɛ as a failure evet sice it meas that the ERM ĥ has a excess risk greater tha parameter ɛ, which is udesirable. Therefore, we would like the probability of this failure evet to be as small as possible. Moreover, the smaller ɛ is, the stricter we are whe evaluatig the performace of ĥ. Thus, we would like parameters ɛ ad δ to be as small as possible, give a fixed umber of traiig examples. 3 Recall the defiitio of the ERM ĥ. We have ˆL(ĥ) = mi ˆL(h) ˆL(h ). I order to prove a high-probability boud of form L(ĥ) L(h ) + extra term, 2 Here ad i the followig, the probability is always take over the radomess i traiig examples uless otherwise specified. 3 As a geeral rule of thumb, we ca make δ iverse polyomially small without usig too may traiig examples, while the cost of miimizig ɛ is geerally higher. 2

3 it remais to argue that ˆL(ĥ) L(ĥ) ad ˆL(h ) L(h ) (up to a small additive error) with high probability. Provig ˆL(h ) L(h ) is relatively simple: ˆL(h ) = 1 i=1 l((x(i), y (i) ), h ) is the average of i.i.d. radom variables, each with expectatio L(h ), so we ca prove a boud of the followig form by applyig stadard cocetratio iequalities: Pr ˆL(h ) L(h ) ɛ 1 δ. I fact, this argumet holds for ay fixed hypothesis h H, as log as h does ot deped o the traiig examples. The difficulty is i provig ˆL(ĥ) L(ĥ) sice ĥ is ideed a radom variable that depeds o the traiig examples (see Remark 1). Istead of a cocetratio boud for a sigle fixed hypothesis h, we eed a stroger cocetratio property that holds for every hypothesis i H simultaeously. This is where the otio of uiform covergece comes ito play. 4 Uiform Covergece Uiform covergece is a property of the hypothesis class H of the followig form: Pr h H, ˆL(h) L(h) ɛ 1 δ. (1) I words, it states that with probability at least 1 δ over the radom draw of traiig data, the traiig loss is poitwise close to the expected loss, up to a additive error of at most ɛ. I geeral, if we parametrize the hypothesis space by R, we would expect the picture of traiig loss ad expected loss to be as i Figure 1(a) if uiform covergece holds. It turs out that for particular learig tasks, the traiig loss exhibits a icer ladscape: it is ot oly poitwise close to the expected loss but also of the same shape, which is iformally depicted i Figure 1(b). (See GLM16, MBM18 for some of the recet work alog this lie of research.) Nevertheless, as oly uiform covergece is cocered i this lecture, we do ot distiguish these two differet ladscapes. (a) (b) Figure 1: Two differet empirical risk ladscapes. The blue lie ad the red lie deote the expected ad traiig losses. The dashed gree lies deote the expected loss ± ɛ. 3

4 4.1 Uiform Covergece Implies Geeralizatio Before provig uiform covergece for specific hypothesis classes, we first demostrate how it implies geeralizatio, i.e., a upper boud o L(ĥ) L(h ). We ca write L(ĥ) L(h ) as L(ĥ) L(ĥ) L(h ) = ˆL(ĥ) + ˆL( ĥ) ˆL(h ) + ˆL(h ) L(h ) L(ĥ) ˆL(ĥ) ˆL(h ) L(h ) 2 sup ˆL(h) L(h). Here the secod step applies the defiitio of ĥ, ad the third step follows directly from the defiitio of supremum. By Equatio (2), we have the followig implicatio h H, ˆL(h) L(h) ɛ = sup ˆL(h) L(h) ɛ = L(ĥ) L(h ) 2ɛ. (3) Therefore, if we could prove uiform covergece (1) for hypothesis class H, we have Pr L(ĥ) L(h ) 2ɛ Pr h H, ˆL(h) L(h) ɛ 1 δ, a geeralizatio boud that we desire. 4.2 Fiite Hypothesis Classes Now we prove that uiform covergece ideed holds for fiite hypothesis classes. Recall that is the umber of examples draw i.i.d. from the data distributio, ad the probability is always take over the radomess i traiig examples. Theorem 2. If H is fiite ad l((x, y), h) 0, 1, we have the followig statemets: (1) For ay fixed h H ad ɛ > 0, Pr ˆL(h) L(h) ɛ 1 2e 2ɛ2. (2) (2) For ay ɛ > 0, Pr h H, ˆL(h) L(h) ɛ 1 2 H e 2ɛ2. (3) With probability at least 1 δ, for ay h H, (4) With probability at least 1 δ, ˆL(h) L(h) L(ĥ) L(h ) l H + l 2 δ. 2 l H + l 1 δ. 4

5 The proof of the theorem relies o the followig cocetratio iequality, which is a quatitative versio of the cetral limit theorem. Lemma 3 (Hoeffdig s iequality). Let X 1, X 2,..., X be idepedet radom variables such that a i X i b i almost surely for each i. The, for ay ɛ > 0, 1 1 ( 2 2 ɛ 2 ) Pr X i E X i ɛ 1 exp i=1 (b i a i ) 2. i=1 i=1 Now we are ready to prove Theorem 2. We first prove Statemet (1) usig Hoeffdig s iequality, the we show that each statemet directly implies the ext. Proof of Theorem 2. Statemet (1) follows from Hoeffdig s iequality by takig X i = l((x (i), y (i) ), h), a i = 0 ad b i = 1 for each i i Lemma 3. The, Statemet (2) follows from a uio boud: Pr h H, ˆL(h) L(h) > ɛ Pr ˆL(h) L(h) > ɛ (uio boud) 2e 2ɛ2 = 2 H e 2ɛ2. (Statemet (1)) l H +l 2 δ Statemet (3) follows from pluggig ɛ = 2 ito Statemet (2). Fially, Statemet (3) ad the implicatio i Equatio (3) imply Statemet (4). 5 Digressio: The PAC Learig Framework Probably approximately correct (PAC) learig is a theoretical framework of machie learig proposed by Valiat Val84. Oe of the key defiitios i PAC learig is the otio of PAC learig algorithms. Defiitio 4 (PAC learig algorithm). Algorithm A is a PAC learig algorithm for hypothesis class H, if for ay distributio P over X Y, ɛ > 0 ad δ (0, 1), (( ĥ = A x (1), y (1)) (, x (2), y (2)),..., (x (), y ())) satisfies Pr L(ĥ) L(h ) > ɛ δ, ad A rus i polyomial time with respect to size(x ), 1 ɛ ad 1 δ. Remark 5. Iformally, size(x ) is the umber of bits to describe a elemet of X. For example, size(x ) = log 2 X if X is fiite. I the geeral case where X is parametrized by real umbers, this defiitio requires X to be discretized first. Remark 6. Defiitio 4 implicitly requires the umber of traiig examples,, to be polyomial i size(x ), 1 ɛ ad 1 δ. Otherwise, A does ot have eough time to read its etire iput. 5

6 Oe limitatio of Valiat s framework is that it requires the algorithm to work o every data distributio P, which turs out to be overly ambitious ad thus urealistic. I cotrast, recet research o learig theory usually requires certai assumptios o the data distributio, e.g., the distributio is Gaussia or realizable (i.e., the hypothesis class cotais the fuctio to be leared). Thus, it is worth thikig about the role of assumptios o learig theory research. Suppose we could prove the followig three theorems: Theorem A. Statemet P implies Statemet Q. Theorem B. Uder certai assumptios, P implies Q. Theorem C. Uder certai assumptios, P implies a statemet stroger tha Q. Theorem B is defiitely the weakest amog these three, yet Theorems A ad C are icomparable. It is ofte a matter of taste whether oe prefers A or C. I deep learig theory, however, Statemet Q is ofte vacuous for most practical uses. I this case, we had better prove a result similar to Theorem C, ad the fid out how the assumptios ca be weakeed ad evetually removed. Refereces GLM16 Rog Ge, Jaso D Lee, ad Tegyu Ma. Matrix completio has o spurious local miimum. I Advaces i Neural Iformatio Processig Systems (NIPS), pages , MBM18 Sog Mei, Yu Bai, ad Adrea Motaari. The ladscape of empirical risk for ocovex losses. The Aals of Statistics, 46(6A): , Val84 Leslie G Valiat. A theory of the learable. Commuicatios of the ACM (CACM), 27(11): ,

1 Review and Overview

1 Review and Overview CS9T/STATS3: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #6 Scribe: Jay Whag ad Patrick Cho October 0, 08 Review ad Overview Recall i the last lecture that for ay family of scalar fuctios F, we