18.657: Mathematics of Machine Learning

Size: px

Start display at page:

Download "18.657: Mathematics of Machine Learning"

Myles Greene
6 years ago
Views:

1 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 4 Scribe: Cheg Mao Sep., 05 I this lecture, we cotiue to discuss the effect of oise o the rate of the excess risk E(h) = R(h) R(h ) where h is the empirical risk miimizer. I the biary classificatio model, oise roughly meas how close the regressio fuctio η is from. I particular, if η = the we observe oly oise, ad if η {0,} we are i the oiseless case which has bee studied last time. Especially, we achieved the fast rate logm i the oiseless case by assumig h H which implies that h = h. This assumptio was essetial for the proof ad we will see why it is ecessary agai i the followig sectio. 3. Noise coditios The oiseless assumptio is rather urealistic, so it is atural to ask what the rate of excess risk is whe the oise is preset but ca be cotrolled. Istead of the coditio η {0,}, we ca cotrol the oise by assumig that η is uiformly bouded away from, which is the motivatio of the followig defiitio. Defiitio (Massart s oise coditio): The oise i biary classificatio is said to satisfy Massart s coditio with costat γ (0, ] if η(x) γ almost surely. Oce uiform boudedess is assumed, the fast rate simply follows from last proof with appropriate modificatio of costats. Theorem: Let ce(h) deote the excess risk of the empirical risk miimizer h = h erm. If Massart s oise coditio is satisfied with costat γ, the E(h) log(m/δ) γ with probability at least δ. (I particular γ = gives exactly the oiseless case.) Proof. Defie Z i (h) = I(h(X i ) = Y i ) I(h(X i ) = Y i ). By the assumptio h = h ad the defiitio of h = h erm, E(h) = R(h) R(h) ( ) = R (h) R (h)+r (h) R (h) R(h) R(h) (3.) ( Z i (h) IE[Z i (h ) ] ). (3.) i= Hece it suffices to boud the deviatio of i Z i from its expectatio. To this ed, we hope to apply Berstei s iequality. Sice Var[Z i (h)] IE[Zi(h) ] = IP[h(X i ) = h(x i )],

2 we have that for ay j M, Var[Z i (h j )] IP[h j (X) = h(x)] =: σ j. i= Berstei s iequality implies that [ ] ( t ) δ IP (Z i (h j ) IE[Z i (h j )]) > t exp σ i= j + 3 t =:. M Applyig a uio boud over j M ad takig ( σj log(m/δ) log(m/δ) ) t = t 0 (j) := max,, 3 we get that (Z i (h j ) IE[Z i (h j )]) t 0 (j) (3.3) i= for all j M with probability at least δ. Suppose h = h. It follows from (3.) ad (3.3) that with probability at least δ, j E(h) t 0(j). (Note that so far the proof is exactly the same as the oiseless case.) Sice η(x) γ a.s. ad h = h, Therefore, E(h) = IE[ η(x) I(h(X) = h (X))] γip[h(x) = h(x)] = γσ. ( E(h) log(m/δ) log(m/δ) ) E(h) max,, (3.4) γ 3 so we coclude that with probability at least δ, log(m/δ) E(h). γ j j The assumptio that h = h was used twice i the proof. First it eables us to igore the approximatio error ad oly study the stochastic error. More importatly, it makes the excess risk appear o the right-had side of (3.4) so that we ca rearrage the excess risk to get the fast rate. Massart s oise coditio is still somewhat strog because it assumes uiform boudedess of η from. Istead, we ca allow η to be close to but oly with small probability, ad this is the cotet of ext defiitio.

3 Defiitio (Tsybakov s oise coditio or Mamme-Tsybakov oise coditio): The oise i biary classificatio is said to satisfy Tsybakov s coditio if there exists α (0,), C0 > 0 ad t 0 (0, ] such that for all t [0,t 0 ]. α 0t α IP[ η(x) t] C α I particular, as α, t α 0 α, so this recovers Massart s coditio with γ = t 0 ad we have the fast rate. As α 0, t α, so the coditio is void ad we have the slow rate. I betwee, it is atural to expect fast rate (meaig faster tha slow rate) whose order depeds o α. We will see that this is ideed the case. Lemma: Uder Tsybakov s oise coditio with costats α,c 0 ad t 0, we have IP[h(X) = h (X)] CE(h) α for ay classifier h where C = C(α,C 0,t 0 ) is a costat. Proof. We have E(h) = IE[ η(x) I(h(X) = h (X))] IE[ η(x) I( η(x) > t)i(h(x) = h (X))] tip[ η(x) > t,h(x) = h (X)] tip[h(x) = h (X)] tip[ η(x) t] tip[h(x) = h (X)] C 0 t α where Tsybakov s coditio was used i the last step. Take t = cip[h(x) = h (X)] α α for some positive c = c(α,c 0,t 0 ) to be chose later. We assume that c t 0 to guaratee that t [0,t 0 ]. Sice α (0,), E(h) cip[h(x) = h (X)] /α C0 c α IP[h(X) = h (X)] /α cip[h(x) = h (X)] /α by selectig c sufficietly small depedig o α ad C 0. Therefore IP[h(X) = h (X)] E(h) α ad choosig C = C(α,C 0,t 0 ) := c α completes the proof. Havig established the key lemma, we are ready to prove the promised fast rate uder Tsybakov s oise coditio. c α 3

4 Theorem: If Tsybakov s oise coditio is satisfied with costat α,c 0 ad t 0, the there exists a costat C = C(α,C 0,t 0 ) such that with probability at least δ. l ) E h) C ( og(m/δ ( ) α This rate of excess risk parametrized by α is ideed a iterpolatio of the slow (α 0) ad the fast rate (α ). Futhermore, ote that the empirical risk miimizer h does ot deped o the parameter α at all! It automatically adjusts to the oise level, which is a very ice feature of the empirical risk miimizer. Proof. The majority of last proof remais valid ad we will explai the differece. After establishig that E(h) t 0 (ĵ), we ote that the lemma gives σ = IP[h(X) h(x)] CE(h)α. j It follows that ad thus ( CE(h) α log(m/δ) log(m/δ) ) E(h) max, 3 ( Clog M E(h) max ( δ ) α log(m/δ), 3 ). 4. VAPNIK-CHERVONENKIS (VC) THEORY The upper bouds proved so far are meaigful oly for a fiite dictioary H, because if M = H is ifiite all of the bouds we have will simply be ifiity. To exted previous results to the ifiite case, we essetially eed the coditio that oly a fiite umber of elemets i a ifiite dictioary H really matter. This is the objective of the Vapik- Chervoekis (VC) theory which was developed i Empirical measure Recall from previous proofs (see (3.) for example) that the key quatity we eed to cotrol is sup ( ) R (h) R(h). h H Istead of the uio boud which would ot work i the ifiite case, we seek some boud that potetially depeds o ad the complexity of the set H. Oe approach is to cosider some metric structure o H ad hope that if two elemets i H are close, the the quatity evaluated at these two elemets are also close. O the other had, the VC theory is more combiatorial ad does ot ivolve ay metric space structure as we will see. 4

5 By defiitio R (h) R(h) = ( ) I(h(Xi ) = Y i ) IE[I(h(X i ) = Y i )]. i= Let Z = (X,Y) ad Z i = (X i,y i ), ad let A deote the class of measurable sets i the sample space X {0,}. For a classifier h, defie A h A by Moreover, defie measures µ ad µ o A by {Z i A h } = {h(x i ) = Y i }. µ (A) = I(Z i A) ad µ(a) = IP[Z i A] i= for A A. With this otatio, the slow rate we proved is just log( A /δ) supr (h) R(h) = sup µ (A) µ(a). h H Sice this is ot accessible i the ifiite case, we hope to use oe of the cocetratio iequalities to give a upperboud. Note that µ (A) is a sum of radom variables that may ot be idepedet, so the oly tool we ca use ow is the bouded differece iequality. If we chage the value of oly oe z i i the fuctio z,...,z sup µ (A) µ(a), the value of the fuctio will differ by at most /. Hece it satisfies the boudeddifferece assumptio with c i = / for all i. Applyig the bouded differece iequality, we get that log(/δ) sup µ (A) µ(a) IE[sup µ (A) µ(a) ] with probability at least δ. Note that this already precludes ay fast rate (faster tha / ). Toachieve fastrate, weeedtalagrad iequality adlocalizatio techiques which are beyod the scope of this sectio. It follows that with probability at least δ, log(/δ) sup µ (A) µ(a) IE[sup µ (A) µ(a) ]+. We will ow focus o boudig the first term o the right-had side. To this ed, we eed a techique called symmetrizatio, which is the subject of the ext sectio. 4. Symmetrizatio ad Rademacher complexity Symmetrizatio is a frequetly used techique i machie learig. Let D = {Z,...,Z } be the sample set. To employ symmetrizatio, we take aother idepedet copy of the sample set D = {Z,...,Z }. This sample oly exists for the proof, so it is sometimes referred to as a ghost sample. The we have µ(a) = IP[Z A] = IE[ I(Z i A)] = IE[ I(Z i A) D] = IE[µ i= 5 i= (A) D]

6 where µ := i= I(Z i A). Thus by Jese s iequality, IE[sup µ (A) µ(a) ] = IE [ sup µ (A) IE[µ (A) D] ] [ IE sup IE[ µ (A) µ (A) D ] ] IE [ ] sup µ (A) µ (A) = IE [ sup ( ) ] I(Z i A) I(Z i A). i= Sice D has the same ( distributio of D, by symmetry I(Z i A) I(Z i A) has the same distributio as σ i I(Zi A) I(Z i A)) where σ,...,σ are i.i.d. Rad( ), i.e. IP[σ i = ] = IP[σ i = ] =, ad σ i s are take to be idepedet of both samples. Therefore, [ IE[sup µ (A) µ(a) ] IE sup ( ) ] σ i I(Zi A) I(Z i A) i= IE [ sup ] σi I(Z i A). i= (4.5) Usig symmetrizatio we have bouded IE[sup µ (A) µ(a) ] by amuch icer quatity. Yet we still eed a upper boud of the last quatity that depeds oly o the structure of A but ot o the radom sample {Z i }. This is achieved by takig the supremum over all z i X {0,} =: Y. Defiitio: The Rademacher complexity of a family of sets A i a space Y is defied to be the quatity R (A) = sup sup IE [ ] σi I(z i A). z,...,z Y i= The Rademacher complexity of a set B IR is defied to be R (B) = IE [ sup σi b i ]. b B i= We coclude from (4.5) ad the defiitio that IE[sup µ (A) µ(a) ] R (A). I the defiitio of Rademacher complexity of a set, the quatity i= σ ib i measures how well a vector b B correlates with a radom sig patter {σ i }. The more complex B is, the better some vector i B ca replicate a sig patter. I particular, if B is the full hypercube [,], the R (B) =. However, if B [,] cotais oly k-sparse 6

7 vectors, the R (B) = k/. Hece R (B) is ideed a measuremet of the complexity of the set B. The set of vectors to our iterest i the defiitio of Rademacher complexity of A is T(z) := {(I(z A),...,I(z A)) T,A A}. Thus the key quatity here is the cardiality of T(z), i.e., the umber of sig patters these vectors ca replicate as A rages over A. Although the cardiality of A may be ifiite, the cardiality of T(z) is bouded by. 7

8 MIT OpeCourseWare Mathematics of Machie Learig Fall 05 For iformatio about citig these materials or our Terms of Use, visit:

Rademacher Complexity

EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for