Empirical Processes: Glivenko Cantelli Theorems

Empirical Processes: Gliveko Catelli Theorems Mouliath Baerjee Jue 6, 200 Gliveko Catelli classes of fuctios The reader is referred to Chapter.6 of Weller s Torgo otes, Chapter??? of VDVW ad Chapter 8.3 of Kosorok. First, a theorem usig bracketig etropy. Let (F, ) be a subset of a ormed space of real fuctios f : X R. Give real fuctios l ad u o X (but ot ecessarily i F), the bracket [l, u] is defied as the set of all fuctios f F satisfyig l f u. The fuctios l, u are assumed to have fiite orms. A ɛ-bracket is a bracket with u l ɛ. The bracketig umber N [] (ɛ, F ) is the miimum umber of ɛ-brackets with which F ca be covered ad the bracketig etropy is the log of this umber. Theorem. Let F be a class of measurable fuctios with N [] (ɛ, F ) < for all ɛ > 0. The F is P -Gliveko-Catelli, i.e. P P F a.s 0. Brief sketch: For ay ɛ > 0 choose fiitely may ɛ-brackets {l i, u i } m (which ca be arraged, by assumptio) ad argue, by fidig a boud o (P P )f (for each f) i terms of the [l i, u i ] that cotais it, that: sup (P P )f f F { } max (P P ) u i max (P P ) l i i m i m ad coclude, usig the strog law for radom variables, that the right side of the above display is almost surely less tha 2 ɛ evetually. + ɛ, GC theorem for a cotiuous distributio fuctio o the lie: Let F be a cotiuous cdf ad P the correspodig measure. By uiform cotiuity of F o the lie, for every ɛ > 0,

we ca fid = t 0 < t < t 2 <... < t k < t k+ =, with k a positive iteger, such that the uio of the brackets [(x t i ), (x t i+ )] for i = 0,,..., k cotais {(x t : t R} ad satisfy F (t i+ ) F (t i ) ɛ. The above theorem ow applies directly. Note that the cotiuity of the distributio fuctio F was used crucially. The GC theorem o the lie holds for arbitrary distributio fuctios though. This more geeral result will be see to be a corollary of a subsequet GC theorem. The ext lemma provides a settig which guaratees a fiite bracketig umber for appropriate classes of fuctios ad fids a ready applicatio i iferece i parametric statistical models. Lemma. Suppose that F = {f(, t) : t T }, where T is a compact subset of a metric space (D, d) ad the fuctios f : X T R are cotiuous i t for P almost x X. Assume that the evelope fuctio F defied by F (x) = sup t T f(x, t) satisfies P F <. The N [ ] (ɛ, F, L (P )) <, for each ɛ > 0. The proof is give i Chapter.6 of Weller s Torgo otes. We skip it but show ext how the above result is helpful for deducig cosistecy i parametric statistical models. Cosistecy i parametric models: Let {p(x, θ) : θ Θ} with Θ R d be a class of parametric desities ad cosider X, X 2,..., geerated from some p(x, θ 0 ). Also assume that Θ is compact ad that p(x, θ) is cotiuous i θ for P θ0 -almost x. Defie M(θ) = E θ0 l(x, θ) where l(x, θ) = log p(x, θ). Fially assume that sup θ Θ l(x, θ) B(x) for some B with E θ0 B(X ) <. The, ote that M(θ) if fiite for all θ ad moreover, cotiuous o Θ. If P 0 deotes the measure correspodig to θ 0, M(θ) = P θ0 l(, θ). The MLE of θ is give by ˆθ = argmax θ M (θ) where M (θ) = P l(, θ). Uder the assumptio that the model is idetifiable (i.e. the probability distributios correspodig to differet θ s are differet), it is easily see that M(θ) is uiquely miimized at θ 0. Fially, ote that θ 0 is a well-separated maximizer i the sese that for ay η > 0, sup θ Θ Bη(θ 0 ) c M(θ) < M(θ 0), with B η (θ 0 ) beig the ope ball of radius η cetered at θ 0. Let ψ(η) = M(θ 0 ) sup θ Θ Bη(θ0 ) c M(θ). The ψ(η) > 0. Our goal is to show that ˆθ P θ0 θ 0. So, give ɛ > 0, cosider P (ˆθ B ɛ (θ 0 ) c. Now, ˆθ B ɛ (θ 0 ) c M(ˆθ ) sup θ Θ B η(θ 0 ) c M(θ) M(ˆθ ) M(θ 0 ) ψ(ɛ) M(ˆθ ) M(θ 0 ) + M (θ 0 ) M (ˆθ ) ψ(ɛ) 2

Thus, 2 sup M (θ) M(θ) ψ(ɛ). θ Θ P (ˆθ B ɛ (θ 0 ) c ) P (sup M (θ) M(θ) ψ(ɛ)/2) P (sup (P P θ0 ) l(, θ) ψ(ɛ)/2), θ Θ θ Θ ad this goes to 0, owig to the fact that (sup θ Θ (P P θ0 ) l(, θ) ) a.s. 0 (sice uder our assumptios o the parametric model, we ca coclude from Lemma. that N [ ] (η,{l(, θ) : θ Θ}, L (P θ0 )) < for every η > 0 ad the ivoke Theorem.). We ext state (ad partly prove) a result that provides ecessary ad sufficiet coditios for a class of fuctios F to be Gliveko-Catelli i terms of coverig umbers. Theorem.2 Let F be a P -measurable class of measurable fuctios bouded i L (P ). The F is P -Gliveko Catelli if ad oly if: (a) P F <, (b) E log N(ɛ, F M, L 2 (P )) lim = 0, for all M < ad ɛ > 0. Here F M = {f (F M) : f F}. Discussio: We will oly cosider the if part of the proof. This will be provided later. First, we ote that L 2 ca be replaced by ay L r, r. At least for the if part, this will be obvious from the proof. Secodly, for the if part, the secod coditio ca be replaced by the weaker coditio that log N(ɛ, F M, L 2 (P ))/ P 0. Thirdly, sice N(ɛ, F M, L 2 (P )) N(ɛ, F, L 2 (P )) for all M > 0, coditio (b) i the theorem ca be replaced by the alterative coditio that E (log N(ɛ, F, L 2 (P )/) 0 (or a coditio ivolvig covergece i probability for the if part). Fially, if F has a measurable ad itegrable evelope, F, the P F is fiite almost surely (simple strog law) ad it is readily argued that: ɛ > 0, (log N(ɛ, F, L (P ))) = o p () ɛ > 0, (log N(ɛ F P,, F, L (P ))) = o p (). To see this quickly, use the characterizatio of i-probability covergece i terms of almost sure covergece alog subsequeces. It turs out that there is a large class of fuctios, called VC classes of fuctios, for which the quatity log N(ɛ F P,, F, L (P )) is bouded, uiformly i ad ω; i fact, for such a class F of fuctios, for: sup Q N(ɛ F Q,r, F, L r (Q)) K ( ɛ r ) M, 3

for a iteger M that depeds solely o F, a costat K that depeds oly o F, ad the supremum is take over all probability measures for which F Q,r > 0. Thus, a VC class of fuctios with itegrable evelope F is easily Gliveko-Catelli for ay probability measure o the correspodig sample space. The fortuate thig is that fuctios formed by combiig VC classes of fuctios via various mathematical operatios ofte satisfy similar etropy bouds as i the above display, so that such (more) complex classes cotiue to remai Gliveko-Catelli uder itegrability hypotheses. As a special case, cosider F = {f t (x) =,t] (x) : t R d }. Thus f t (x) is simply the idicator of the ifiite rectagle to the south-west of the poit t. For all probability measures Q o d-dimesioal Euclidea space: N(ɛ, F, L (Q)) M d ( K ɛ ) d, which immediately implies the classical Gliveko-Catelli theorem i R d. Proof of Theorem.2: We prove the if part. By P -measurability of the class F ad Corollary. of the symmetrizatio otes applied with Φ beig the idetity, E P P F 2 E ɛ i f(x i ) F = 2 E X E ɛ ɛ i f(x i ) F 2 E X E ɛ ɛ i f(x i ) + 2 P (F (F > M)). Give ay ɛ > 0, a appropriate choice of M esures that the secod term is o larger tha ɛ. It suffices to show that for this choice of M, the first term is evetually smaller tha ɛ. To this ed, first fix X, X 2,..., X. A ɛ-et G (assumed to be of miimal size) over F M i L 2 (P ) is also a ɛ-et i L (P ). It follows that: E ɛ ɛ i f(x i ) E ɛ ɛ i f(x i ) + ɛ. G Before goig further, ote that each g G ca be assumed to be uiformly bouded (i absolute value) by M. This ca be achieved sice each f i F M is bouded (i absolute value) by M. So, 4

give a arbitrary ɛ-et G, perturb each g to a g which coicides with g wheever g M ad o the complemet of this set equals (g) M. These perturbed fuctios cotiue to costitute a ɛ-et over F M. Cosider the first term o the right of the above display. Sice the L orm is bouded (up to a costat) by the ψ Orlicz orm, which is bouded upto a costat by the ψ 2 Orlicz orm, we ca use Lemma. i the chaiig otes to boud the first term, up to a costat, by: B = + log N(ɛ, F M, L 2 (P )) max ɛ i f(x i ). f G ψ2 X As a cosequece of Hoeffdig s iequality (see the first page of the symmetrizatio otes): ɛ i f(x i ) 6 (P f 2 ) /2 6 M, ψ2 X ad thus B 6 M + log N(ɛ,, L 2 (P )) by Coditio (b) of the theorem. Coclude that: E ɛ ɛ i f(x i ) P 0. 0, Sice the above radom variable is bouded, coclude that: E X E ɛ ɛ i f(x i ) 0. It follows that E ( P P F 0. Our goal is however to show almost sure covergece. This is deduced by a submartigale argumet, a simplified versio of which is preseted at the ed of these otes. The idea here is to show that P P F is a reverse submartigale with respect to a (decreasig) filtratio that coverges to the symmetric sigma-field ad therefore has a almost sure limit. This almost sure limit, beig measurable with respect to the symmetric sigma field, must be a o-egative costat almost surely. The fact that the expectatio coverges to 0 the forces this costat to be 0. The full udiluted versio of the argumet is preseted i Lemma 2.4.5 of VDVW. Uiform ad uiversal GC classes: If F is P -Gliveko-Catelli for all probability measures 5

P o (X, A), it is called a uiversal Gliveko-Catell class. For example, VC classes of fuctios (that appear i the discussio precedig the proof of Theorem.2) are uiversal GC-classes provided they are uiformly bouded (so that there is a itegrable evelope for every probability measure P ). A stroger GC property ca be formulated i terms of the uiformity of the covergece of the empirical measure to the true measure over all probability measures o (X, A). Say that F is a strog uiform GC class if, for all ɛ > 0, sup P rp P P(X,A) Note that the almost sure covergece of P P ( ) sup P m P F > ɛ m 0. to 0 for a fixed P is equivalet to the coditio: For every ɛ > 0, ( ) P rp sup P m P F > ɛ 0. m Uiform Gliveko Catelli classes are sometimes useful i statistical applicatios, for example i situatios where the paret distributio from which a statistical model is geerated is allowed to vary with the sample size, or situatios where there are two idices m, that go to ifiity, with beig the sample size, ad m a idex that labels the statistical model. Cosistecy argumets for such situatios ca be costructed via the otio of uiform GC classes of fuctios. A compellig applicatio is preseted i the paper by Se, Baerjee ad Michailidis (200) (available o Baerjee s webpage) where the problem is oe of estimatig the miimum effective dose i a dose respose settig (the largest dose beyod which the respose is positive) ad is the umber of distict doses with each dose admiistered to a distict set of m idividuals. Cosistecy of a least squares estimate of the miimum effective dose is established as m, ad the otio of uiform GC classes is heavily used. Sectio 2.8. of VDVW deals with these ideas; see Theorem 2.8. which ca be used to deduce that V C classes of fuctios are uiformly Gliveko-Catelli uder appropriate itegrability restrictios. GC preservatio: Preservatio of GC properties are importat from the perspective of applicatios. Ofte, i a statistical applicatio, it becomes ecessary to show the GC property for a class of fuctios with complex fuctioal forms to which tailor-made GC theorems are difficult to apply. However, if such classes ca be built up from simple GC classes of fuctios via simple 6

mathematical operatios, the GC property ofte traslates to the complex classes of iterest. Sectio.6 of Weller s otes has a discussio of preservatio properties as does Sectio 9.3 of Kosorok. Some discussio from Kosorok: A example: Suppose that X = R ad that X P. (i) For 0 < M < ad a R, let f(x, t) = x t ad F = F a,m = {f(x, t) : t a M}. Show that if E( X ) <, N [ ] (ɛ, F, L (P )) <. Derivatio: Chop the iterval [a M, a + M] ito a evely spaced (fiite) grid of poits {s i } icludig the ed-poits such that successive poits o the grid are separated by a distace o larger tha ɛ < ɛ. Costruct a set of brackets {l j, u j } where l j (x) = x s j (x s j )+ x s j+ (x s j+ ) ad u j (x) = x s j x s j+. Each l j, u j has fiite orm sice E P ( X ) <. A simple picture should ow covice you that u j l j is o-egative ad o larger tha ɛ poitwise ad hece i the L (P ) orm. Every poit t i [a M, a + M] lies i some [s j, s j+ ] ad the fuctio f(x, t) the belogs to the bracket [l j, u j ], showig that N [ ] (ɛ, F, L (P )) <. (ii) Same as before but let f(x, t) = x t x a. Show that N [ ] (ɛ, F, L (P )) < but without the assumptio that E P ( X ) <. Derivatio: Take the l j, u j s costructed above ad defie ũ j = u j x a ad ũ j = l j x a. Cosider { l j, ũ j }. It is easy to show, usig the fact that x t x t t t that each ũ j ad each l j is bouded ad therefore itegrable, irrespective of whether E( X ) <. If t [s j, s j+ ], f(x, t) lies i the bracket [ l j, ũ j ]. 7