Sieve Estimators: Consistency and Rates of Convergence

EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for formal publicatios. They may be distributed outside this class oly with the permissio of the Istructor. 1 Itroductio We ca decompose the excess risk of a discrimiatio ruleas follows: Rĥ R = Rĥ RH + }{{} RH R }{{} estimatio error approximatio error The first term is the estimatio error ad measures the performace of the discrimiatio rule with respect to the best hypothesis i H. I previous lectures, we studied performace guaratees for this quatity whe ĥ is ERM. Now, we also cosider the approximatio error, which captures how well a class of hypotheses {H k } approximates the Bayes decisio boudary. For example, cosider the histogram classifiers i Figure 1 ad Figure 2, respectively. As the grid becomes more fie-graied, the class of hypotheses approximates the Bayes decisio boudary with icreasig accuracy. The examiatio of the approximatio error will lead to the desig of sieve estimators that perform ERM over H k where k = k grows with at a appropriate rate such that both the approximatio error ad the estimatio error coverge to 0. Note that while the estimatio error is radom because it depeds o the sample, the approximatio error is ot radom. 2 Approximatio Error The followig assumptio will be adopted to establish uiversal cosistecy. Defiitio 1. A sequece of sets of classifiers H 1, H 2,... is said to have the uiversal approximatio property UAP if for all P XY, as k. R H k R Observe that RH k is ot radom; it does ot deped o the traiig data. Also, it depeds o P XY because it is a expectatio. The followig theorem gives a sufficiet coditio for a sequece of classifiers to have the UAP. Theorem 1. Suppose H k cosists of classifiers that are piecewise costat o a paritito of X ito cells A k,1, A k,2,.... If sup j 1 diama k,j 0 as k, the {H k } has the UAP. Proof. See the proof of Theorem 6.1 i [1]. Example. Suppose X = [0, 1] d. Cosider H k = { histogram classifiers based o cells of sidelegth 1 k }. The, diama k,j = d k for j. I particular, sup diama k,j = j 1 as k. By the above theorem, {H k } has the UAP. 0 d k 1

2 Figure 1: Histogram classifier based o 6 6 grid Figure 2: Histogram classifier based o 12 12 grid 3 Covergece of Radom Variables This sectio offers a brief review of covergece i probability ad almost surely, ad gives some useful results for provig both kids of covergece. Defiitio 2. Let Z 1, Z 2,... be a sequece of radom variables i R. We say that Z coverges to Z i i.p. probability, ad write Z Z, if ɛ > 0, lim Pr Z Z ɛ = 0. We say Z coverges to Z almost surely with probability oe, ad write Z a.s. Z, if Pr{ω : lim Z ω = Zω} = 1. Example. I aalysis of discrimiatio rules, we cosider Ω = {ω = X 1, Y 1, X 2, Y 2,... X Y } where X i, Y i i.i.d. P XY. To study covergece of the risk to the Bayes risk, we take Z = Rĥ where ĥ is a classifier based o the first etries of ω, amely X 1, Y 1,..., X, Y, ad Z = R. Now, we cosider some useful lemmas for provig covergece i probability ad almost sure covergece. We begi with useful results for covergece i probability. a.s. i.p. Lemma 1. If Z Z, the Z Z. Proof. This was proved i EECS 501. Lemma 2. If E{ Z Z } 0 as, the Z Proof. By Markov s iequality i.p. Z. Pr Z Z ɛ E{ Z Z } ɛ 0. Now, we tur our attetio to almost sure covergece. We begi with the Borel-Catelli Lemma, which yields a useful corollary.

3 Lemma 3 Borel-Catelli Lemma. Cosider a probability space Ω, A, P. Let {A } 1 be a sequece of evets. If P A < the where 1 P lim sup A = 0 lim sup A := {ω Ω : N N such that ω A } = {ω that occur ifiitely ofte i the sequece of evets} = A. N 1 N Proof. Observe that P lim sup A = P lim N N A. Let B N = N A. Observe that B 1 B 2 B 3, i.e., {B N } is a decreasig sequece of evets. The, by cotiuity of P, we have P lim B N = lim P B N N N = lim P A N lim = 0 N N N P A where the iequality i the third lie follows from uio boud ad the last equality follows from the hypothesis that 1 P A <. The followig corollary of the Borel-Catelli lemma is very useful for showig almost sure covergece. Corollary 1. If for all ɛ > 0, we have the Z a.s. Z. Pr Z Z ɛ < 1 Proof. Defie the evet A ɛ = {ω Ω : N N such that Z w Zw ɛ}. Let A ɛ = {ω Ω : Z ω Zω ɛ}. Observe that lim sup A ɛ = A ɛ. By the hypothesis, we may apply the Borel-Catelli lemma to obtai PrA ɛ = 0. Now defie A = {ω Ω : Z ω Z} = {ω Ω : ɛ > 0 N N such that Z ω Zω ɛ}. I words, this meas: for some ɛ, o matter how far alog you go i the sequece, you ca fid some such that Z ω Zω ɛ. Now, let {ɛ j } be a strictly decreasig sequece covergig to 0. Observe that A j=1 Aɛj. To see this, take some ω A. The, for some ɛ > 0 N N such that Z ω Zω ɛ. As j, evetually there is some j such that ɛ j ɛ, so that ω A ɛ j. Therefore, PrA Pr A ɛ j This cocludes the proof. j=1 PrA ɛj j=1 = 0.

4 4 Sieve Estimators Let {H k } be a sequece of sets of classifiers with the uiform approximatio property UAP. Assume that we have a uiform deviatio boud for H k e.g., if H k is fiite or a VC class. We deote by ĥ,k the classifier we obtai from empirical risk miimizatio ERM over H k : 1 ĥ,k = arg mi h H k 1 {hxi Y i}. Let k be a iteger-valued sequece ad defie ĥ = ĥ,k. Sice we have the UAP, it should be clear that the approximatio error goes to 0 as log as k with. We wish to restrict the rate of growth of k appropriately so that the estimatio error also goes to zero, i probability or almost surely. This is called a sieve estimator, whose ame is geerally attributed to Greader [3]. Formally, we eed that Rĥ RH k 0, i.p/a.s. Let s apply some previously developed theory to this problem. Cosider the case where H k <. We have see that Pr Rĥ RH k ɛ 2 H k e ɛ2 /2. 1 We eed to make sure that H k does ot domiate the expoetial term, so that the sum i Corollary 1 coverges. Rewrite the right-had side of Eq. 1 as Hk e ɛ 2 /2 = e l H k ɛ 2 /2. i Thus it suffices to have l H k 0, as. Aother way to write this is usig little-oh otatio. We say that a sequece a = ob if a b. So, the above is equivalet to l Hk = o. Uder this coditio, 0 as The for all ɛ > 0, 1 ɛ > 0, N ɛ, N ɛ, l Hk ɛ 2 Pr Rĥ RH k ɛ N ɛ + 2e ɛ 2 /4 <. N ɛ 2 ɛ2 4. 2 The summatio o the right is simply a covergig geometric series. By Corollary 1 we have established Theorem 2. Let {H k } satisfy the UAP with Hk <. Let k be a iteger sequece such that as, k ad l Hk = o. The R ĥ R almost surely, i.e., ĥ is strogly uiversally cosistet. Corollary 2. Let {H k } = {the set of histogram classifiers with side legth 1/k}, X = [0, 1] d. Let k such that k d = o. The ĥ is strogly uiversally cosistet. Proof. We previous saw i Sectio 2 that histograms have the UAP. Also, Hk = 2 k d, so the corollary follows from Theorem 2. I Corollary 2, we require kd 0. Note that /kd is the umber of samples per cell. Therefore the umber of samples per cell must go to ifiity; this seems like a reasoable coditio for strog cosistecy.

5 Now, let s cosider the case where the VC dimesio of H k is fiite. From our discussio of VC theory, we have that for all ɛ > 0, Pr Rĥ RH k ɛ 8S Hk e ɛ2 /128. We also kow from Sauer s lemma e V S H, V, V where V is the VC dimesio of H. Usig this, we deduce Pr Rĥ RH k ɛ 8S Hk e ɛ2 /128 C V k e ɛ2 /128 = C exp V k l ɛ2 128 C is simply a costat that does ot deped o. If we choose k such that V k = o l, the we obtai the followig iequality, which is similar to the oe i Eq. 2:. Therefore We have established 1 ɛ > 0, N, N, V k l ɛ2 128 ɛ2 256. Pr Rĥ RH k ɛ N + Ce ɛ2 /256 <. N Theorem 3. Let {H k } satisfy the UAP, ad let V k < deote the VC dimesio of H k. Let k be a iteger-valued sequece such that as, k ad V k = o l. The Rĥ R almost surely, i.e., ĥ is strogly uiversally cosistet. 5 Rates of Covergece Oe could ask ] whether there is a uiversal rate of covergece. Put more formally, is there a ĥ such that E [Rĥ R coverges to 0 at a fixed rate, for ay joit distributio P XY? It turs out that the aswer is o. A theorem from [2] makes this cocrete. Theorem 4. Let {a } with a 0 such that 1 16 a 1 a 2.... For ay ĥ, there exists a joit distributio P XY such that R = 0 ad ERĥ a. Proof. See Chapter 7 of [1]. Sice the sequece {a } is arbitrary, we ca also choose how slowly it coverges to 0, thus showig that there will be o uiversal rate of covergece. I order to establish rates of covergece, we will therefore have to place restrictios o P XY. Some possible reasoable restrictios o P XY iclude: P X Y =y is a cotiuous radom variable. ηx is smooth. there exists a t 0 such that P X ηx 1 2 t 0 = 0. the Bayes decisio boudary is smooth. Below we cosider oe particular distributioal assumptio uder which we ca establish a rate of covergece.

6 Figure 3: Illustratio of Bayes decisio boudary BDB. 6 Box-Coutig Class Let X = [0, 1] d. For m 1, defie P m to be the partitio of X ito cubes of side legth 1/m. Note that P m = m d. Defiitio 3. The box coutig class B is the set of all P XY such that A P X has a bouded desity. B There exists a costat C such that m 1, the Bayes decisio boudary BDB {x : ηx = 1/2} passes through at most Cm d 1 elemets of P m see Fig. 3. This essetially says that the Bayes decisio boudary BDB has dimesio d 1. Look up the term box-coutig dimesio for more. We have the followig rate of covergece result. Theorem 5. Let H k ={histogram classifiers based o P k }, assume P XY B, ad let ĥ be a sieve estimator with ad k 1 d+2. The [ ] E Rĥ R = O 1 d+2. Proof. Recall the decompositio ito estimatio ad approximatio errors: [ E Rĥ R ] ] = E [Rĥ R Hk + RH k R. To boud the estimatio error, deote := Rĥ R H k. By the law of total expectatio, Choosig ɛ = [ ] E = P r [ ɛ E }{{} ] ɛ + P r [ < ɛ E }{{}}{{} ] < ɛ. }{{} δ 1 1 trivial sice R 1 ɛ 2[k d l 2+l2/δ] [ ] ad δ = 1/, we have E Rĥ RH k k = O d. To boud the approximatio error, let h = Bayes classifier ad h k = best classifier i H k. Observe [ Rh k Rh = 2E X ηx 1 ] 2 1 {h k X h X} E X [1 {h k X h X}] by ηx 1 2 1 2 = P X h kx h X. 3

7 Let := {x : h k x h x}, ad let f be the desity of P X. The P X = fxdx f λ, where λ deotes volume/lebesgue measure. Now classificatio errors occur oly o cells itersectig the Bayes decisio boudary see shaded cells i Fig. 3, ad so λ Ck d 1 /k d = O 1 k. Settig 1 k = k d, we see that k 1 d+2 gives the best rate for histograms. Histograms[ do ot ] achieve the best rate amog all discrimiatio rules for the box-coutig class; this best rate is E Rĥ R = O 1 d [4]. I the ext set of otes we ll examie a discrimiatio rule that achieves this rate to withi a logarithmic factor. Exercises 1. With X = R d let H k = {hx = 1 {fx 0} : f is a polyomial of degree at most k}. Let P be the set of all joit distributios P XY such that if Rh R h H k as k. Determie a explicit sufficiet coditio o k for the sieve estimator to be cosistet for all P XY P. 2. Up to this poit i our discussio of empirical risk miimizatio, we have always assumed that a empirical risk miimizer exists. However, it could be that the ifimum of the empirical risk is ot attaied by ay classifier. I this case, we ca still have a cosistet sieve estimator based o a approximate empirical risk miimizer. Thus, let τ k be a sequece of positive real umbers decreasig to zero. Defie ĥ,k to be ay classifier ĥ,k {h H k : R h if R h + τ k }. h H k Now cosider the discrimiatio rule ĥ := ĥ,k. Show that Theorems 2 ad 3 still hold for this discrimiatio rule. State ay additioal assumptios o the rate of covergece of τ k that may be eeded. 3. Assume X R d. A fuctio f : X R is said to be Lipschitz cotiuous if there exists a costat L > 0 such that for all x, x X, fx fx L x x. Show that if oe coordiate of the Bayes decisio boudary is a Lipschitz cotiuous fuctio of the others, the B holds i the defiitio of the box-coutig class. A example of the stated coditio is whe the Bayes decisio boudary is the graph of, say, x d = fx 1,..., x d 1, where f is Lipschitz. 4. Establish the followig partial coverse to Lemma 2: If Z Z is bouded, the Z i.p. Z implies E{ Z Z } 0. Therefore, i the cotext of classificatio, ERĥ R is equivalet to weak cosistecy.

8 Refereces [1] L. Devroye, L. Györfi ad G. Lugosi, A Probabilistic Theory of Patter Recogitio, Spriger, 1996. [2] L. Devroye, Ay discrimiatio rule ca have a arbitrarily bad probability of error for fiite sample size, IEEE Tras. Patter Aalysis ad Machie Itelligece, vol. 4, pp. 154-157, 1982. [3] U. Greader, Abstract Iferece, Wiley, 1981. [4] C. Scott ad R. Nowak, Miimax-Optimal Classificatio with Dyadic Decisio Trees, IEEE Tras. Iform. Theory, vol. 52, pp. 1335-1353, 2006.