Empirical Processes: Glivenko Cantelli Theorems

Similar documents
Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Sequences and Series of Functions

Chapter 6 Infinite Series

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Lecture 8: Convergence of transformations and law of large numbers

Math 341 Lecture #31 6.5: Power Series

7.1 Convergence of sequences of random variables

Mathematical Methods for Physics and Engineering

On Random Line Segments in the Unit Square

Lecture 19: Convergence

REAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS

Sieve Estimators: Consistency and Rates of Convergence

Measure and Measurable Functions

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Singular Continuous Measures by Michael Pejic 5/14/10

McGill University Math 354: Honors Analysis 3 Fall 2012 Solutions to selected problems

FUNDAMENTALS OF REAL ANALYSIS by

Lecture 2. The Lovász Local Lemma

Analytic Continuation

Theorem 3. A subset S of a topological space X is compact if and only if every open cover of S by open sets in X has a finite subcover.

7.1 Convergence of sequences of random variables

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1

Advanced Stochastic Processes.

Lecture Notes for Analysis Class

Optimally Sparse SVMs

lim za n n = z lim a n n.

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

Chapter 7 Isoperimetric problem

Glivenko-Cantelli Classes

Lecture 3 : Random variables and their distributions

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

1 Convergence in Probability and the Weak Law of Large Numbers

The Wasserstein distances

Beurling Integers: Part 2

Math Solutions to homework 6

Lecture 3 The Lebesgue Integral

Chapter 0. Review of set theory. 0.1 Sets

Empirical Process Theory and Oracle Inequalities

Axioms of Measure Theory

Seunghee Ye Ma 8: Week 5 Oct 28

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

Notes #3 Sequences Limit Theorems Monotone and Subsequences Bolzano-WeierstraßTheorem Limsup & Liminf of Sequences Cauchy Sequences and Completeness

6.3 Testing Series With Positive Terms

TERMWISE DERIVATIVES OF COMPLEX FUNCTIONS

Council for Innovative Research

Homework 2. Show that if h is a bounded sesquilinear form on the Hilbert spaces X and Y, then h has the representation

32 estimating the cumulative distribution function

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Infinite Sequences and Series

Rates of Convergence by Moduli of Continuity

The Boolean Ring of Intervals

b i u x i U a i j u x i u x j

Metric Space Properties

Point Estimation: properties of estimators 1 FINITE-SAMPLE PROPERTIES. finite-sample properties (CB 7.3) large-sample properties (CB 10.

ST5215: Advanced Statistical Theory

y X F n (y), To see this, let y Y and apply property (ii) to find a sequence {y n } X such that y n y and lim sup F n (y n ) F (y).

Probability for mathematicians INDEPENDENCE TAU

Introduction to Functional Analysis

Information Theory and Statistics Lecture 4: Lempel-Ziv code

Notes 19 : Martingale CLT

A Proof of Birkhoff s Ergodic Theorem

Lecture 10 October Minimaxity and least favorable prior sequences

TENSOR PRODUCTS AND PARTIAL TRACES

2 Banach spaces and Hilbert spaces

6. Uniform distribution mod 1

Ma 4121: Introduction to Lebesgue Integration Solutions to Homework Assignment 5

MAT1026 Calculus II Basic Convergence Tests for Series

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Fall 2013 MTH431/531 Real analysis Section Notes

Math 113 Exam 3 Practice

Math 220A Fall 2007 Homework #2. Will Garner A

Notes 5 : More on the a.s. convergence of sums

An alternative proof of a theorem of Aldous. concerning convergence in distribution for martingales.

A NOTE ON LEBESGUE SPACES

1 The Haar functions and the Brownian motion

5 Many points of continuity

University of Colorado Denver Dept. Math. & Stat. Sciences Applied Analysis Preliminary Exam 13 January 2012, 10:00 am 2:00 pm. Good luck!

On the behavior at infinity of an integrable function

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology

REGRESSION WITH QUADRATIC LOSS

Introduction to Extreme Value Theory Laurens de Haan, ISM Japan, Erasmus University Rotterdam, NL University of Lisbon, PT

Chapter 6 Principles of Data Reduction

Notes on Snell Envelops and Examples

Estimation of the essential supremum of a regression function

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero?

Regression with quadratic loss

MATH301 Real Analysis (2008 Fall) Tutorial Note #7. k=1 f k (x) converges pointwise to S(x) on E if and

Properties of Fuzzy Length on Fuzzy Set

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

M17 MAT25-21 HOMEWORK 5 SOLUTIONS

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A.

INFINITE SEQUENCES AND SERIES

1 Review and Overview

Real Numbers R ) - LUB(B) may or may not belong to B. (Ex; B= { y: y = 1 x, - Note that A B LUB( A) LUB( B)

Introduction to Probability. Ariel Yadin. Lecture 7

Journal of Multivariate Analysis. Superefficient estimation of the marginals by exploiting knowledge on the copula

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Transcription:

Empirical Processes: Gliveko Catelli Theorems Mouliath Baerjee Jue 6, 200 Gliveko Catelli classes of fuctios The reader is referred to Chapter.6 of Weller s Torgo otes, Chapter??? of VDVW ad Chapter 8.3 of Kosorok. First, a theorem usig bracketig etropy. Let (F, ) be a subset of a ormed space of real fuctios f : X R. Give real fuctios l ad u o X (but ot ecessarily i F), the bracket [l, u] is defied as the set of all fuctios f F satisfyig l f u. The fuctios l, u are assumed to have fiite orms. A ɛ-bracket is a bracket with u l ɛ. The bracketig umber N [] (ɛ, F ) is the miimum umber of ɛ-brackets with which F ca be covered ad the bracketig etropy is the log of this umber. Theorem. Let F be a class of measurable fuctios with N [] (ɛ, F ) < for all ɛ > 0. The F is P -Gliveko-Catelli, i.e. P P F a.s 0. Brief sketch: For ay ɛ > 0 choose fiitely may ɛ-brackets {l i, u i } m (which ca be arraged, by assumptio) ad argue, by fidig a boud o (P P )f (for each f) i terms of the [l i, u i ] that cotais it, that: sup (P P )f f F { } max (P P ) u i max (P P ) l i i m i m ad coclude, usig the strog law for radom variables, that the right side of the above display is almost surely less tha 2 ɛ evetually. + ɛ, GC theorem for a cotiuous distributio fuctio o the lie: Let F be a cotiuous cdf ad P the correspodig measure. By uiform cotiuity of F o the lie, for every ɛ > 0,

we ca fid = t 0 < t < t 2 <... < t k < t k+ =, with k a positive iteger, such that the uio of the brackets [(x t i ), (x t i+ )] for i = 0,,..., k cotais {(x t : t R} ad satisfy F (t i+ ) F (t i ) ɛ. The above theorem ow applies directly. Note that the cotiuity of the distributio fuctio F was used crucially. The GC theorem o the lie holds for arbitrary distributio fuctios though. This more geeral result will be see to be a corollary of a subsequet GC theorem. The ext lemma provides a settig which guaratees a fiite bracketig umber for appropriate classes of fuctios ad fids a ready applicatio i iferece i parametric statistical models. Lemma. Suppose that F = {f(, t) : t T }, where T is a compact subset of a metric space (D, d) ad the fuctios f : X T R are cotiuous i t for P almost x X. Assume that the evelope fuctio F defied by F (x) = sup t T f(x, t) satisfies P F <. The N [ ] (ɛ, F, L (P )) <, for each ɛ > 0. The proof is give i Chapter.6 of Weller s Torgo otes. We skip it but show ext how the above result is helpful for deducig cosistecy i parametric statistical models. Cosistecy i parametric models: Let {p(x, θ) : θ Θ} with Θ R d be a class of parametric desities ad cosider X, X 2,..., geerated from some p(x, θ 0 ). Also assume that Θ is compact ad that p(x, θ) is cotiuous i θ for P θ0 -almost x. Defie M(θ) = E θ0 l(x, θ) where l(x, θ) = log p(x, θ). Fially assume that sup θ Θ l(x, θ) B(x) for some B with E θ0 B(X ) <. The, ote that M(θ) if fiite for all θ ad moreover, cotiuous o Θ. If P 0 deotes the measure correspodig to θ 0, M(θ) = P θ0 l(, θ). The MLE of θ is give by ˆθ = argmax θ M (θ) where M (θ) = P l(, θ). Uder the assumptio that the model is idetifiable (i.e. the probability distributios correspodig to differet θ s are differet), it is easily see that M(θ) is uiquely miimized at θ 0. Fially, ote that θ 0 is a well-separated maximizer i the sese that for ay η > 0, sup θ Θ Bη(θ 0 ) c M(θ) < M(θ 0), with B η (θ 0 ) beig the ope ball of radius η cetered at θ 0. Let ψ(η) = M(θ 0 ) sup θ Θ Bη(θ0 ) c M(θ). The ψ(η) > 0. Our goal is to show that ˆθ P θ0 θ 0. So, give ɛ > 0, cosider P (ˆθ B ɛ (θ 0 ) c. Now, ˆθ B ɛ (θ 0 ) c M(ˆθ ) sup θ Θ B η(θ 0 ) c M(θ) M(ˆθ ) M(θ 0 ) ψ(ɛ) M(ˆθ ) M(θ 0 ) + M (θ 0 ) M (ˆθ ) ψ(ɛ) 2

Thus, 2 sup M (θ) M(θ) ψ(ɛ). θ Θ P (ˆθ B ɛ (θ 0 ) c ) P (sup M (θ) M(θ) ψ(ɛ)/2) P (sup (P P θ0 ) l(, θ) ψ(ɛ)/2), θ Θ θ Θ ad this goes to 0, owig to the fact that (sup θ Θ (P P θ0 ) l(, θ) ) a.s. 0 (sice uder our assumptios o the parametric model, we ca coclude from Lemma. that N [ ] (η,{l(, θ) : θ Θ}, L (P θ0 )) < for every η > 0 ad the ivoke Theorem.). We ext state (ad partly prove) a result that provides ecessary ad sufficiet coditios for a class of fuctios F to be Gliveko-Catelli i terms of coverig umbers. Theorem.2 Let F be a P -measurable class of measurable fuctios bouded i L (P ). The F is P -Gliveko Catelli if ad oly if: (a) P F <, (b) E log N(ɛ, F M, L 2 (P )) lim = 0, for all M < ad ɛ > 0. Here F M = {f (F M) : f F}. Discussio: We will oly cosider the if part of the proof. This will be provided later. First, we ote that L 2 ca be replaced by ay L r, r. At least for the if part, this will be obvious from the proof. Secodly, for the if part, the secod coditio ca be replaced by the weaker coditio that log N(ɛ, F M, L 2 (P ))/ P 0. Thirdly, sice N(ɛ, F M, L 2 (P )) N(ɛ, F, L 2 (P )) for all M > 0, coditio (b) i the theorem ca be replaced by the alterative coditio that E (log N(ɛ, F, L 2 (P )/) 0 (or a coditio ivolvig covergece i probability for the if part). Fially, if F has a measurable ad itegrable evelope, F, the P F is fiite almost surely (simple strog law) ad it is readily argued that: ɛ > 0, (log N(ɛ, F, L (P ))) = o p () ɛ > 0, (log N(ɛ F P,, F, L (P ))) = o p (). To see this quickly, use the characterizatio of i-probability covergece i terms of almost sure covergece alog subsequeces. It turs out that there is a large class of fuctios, called VC classes of fuctios, for which the quatity log N(ɛ F P,, F, L (P )) is bouded, uiformly i ad ω; i fact, for such a class F of fuctios, for: sup Q N(ɛ F Q,r, F, L r (Q)) K ( ɛ r ) M, 3

for a iteger M that depeds solely o F, a costat K that depeds oly o F, ad the supremum is take over all probability measures for which F Q,r > 0. Thus, a VC class of fuctios with itegrable evelope F is easily Gliveko-Catelli for ay probability measure o the correspodig sample space. The fortuate thig is that fuctios formed by combiig VC classes of fuctios via various mathematical operatios ofte satisfy similar etropy bouds as i the above display, so that such (more) complex classes cotiue to remai Gliveko-Catelli uder itegrability hypotheses. As a special case, cosider F = {f t (x) =,t] (x) : t R d }. Thus f t (x) is simply the idicator of the ifiite rectagle to the south-west of the poit t. For all probability measures Q o d-dimesioal Euclidea space: N(ɛ, F, L (Q)) M d ( K ɛ ) d, which immediately implies the classical Gliveko-Catelli theorem i R d. Proof of Theorem.2: We prove the if part. By P -measurability of the class F ad Corollary. of the symmetrizatio otes applied with Φ beig the idetity, E P P F 2 E ɛ i f(x i ) F = 2 E X E ɛ ɛ i f(x i ) F 2 E X E ɛ ɛ i f(x i ) + 2 P (F (F > M)). Give ay ɛ > 0, a appropriate choice of M esures that the secod term is o larger tha ɛ. It suffices to show that for this choice of M, the first term is evetually smaller tha ɛ. To this ed, first fix X, X 2,..., X. A ɛ-et G (assumed to be of miimal size) over F M i L 2 (P ) is also a ɛ-et i L (P ). It follows that: E ɛ ɛ i f(x i ) E ɛ ɛ i f(x i ) + ɛ. G Before goig further, ote that each g G ca be assumed to be uiformly bouded (i absolute value) by M. This ca be achieved sice each f i F M is bouded (i absolute value) by M. So, 4

give a arbitrary ɛ-et G, perturb each g to a g which coicides with g wheever g M ad o the complemet of this set equals (g) M. These perturbed fuctios cotiue to costitute a ɛ-et over F M. Cosider the first term o the right of the above display. Sice the L orm is bouded (up to a costat) by the ψ Orlicz orm, which is bouded upto a costat by the ψ 2 Orlicz orm, we ca use Lemma. i the chaiig otes to boud the first term, up to a costat, by: B = + log N(ɛ, F M, L 2 (P )) max ɛ i f(x i ). f G ψ2 X As a cosequece of Hoeffdig s iequality (see the first page of the symmetrizatio otes): ɛ i f(x i ) 6 (P f 2 ) /2 6 M, ψ2 X ad thus B 6 M + log N(ɛ,, L 2 (P )) by Coditio (b) of the theorem. Coclude that: E ɛ ɛ i f(x i ) P 0. 0, Sice the above radom variable is bouded, coclude that: E X E ɛ ɛ i f(x i ) 0. It follows that E ( P P F 0. Our goal is however to show almost sure covergece. This is deduced by a submartigale argumet, a simplified versio of which is preseted at the ed of these otes. The idea here is to show that P P F is a reverse submartigale with respect to a (decreasig) filtratio that coverges to the symmetric sigma-field ad therefore has a almost sure limit. This almost sure limit, beig measurable with respect to the symmetric sigma field, must be a o-egative costat almost surely. The fact that the expectatio coverges to 0 the forces this costat to be 0. The full udiluted versio of the argumet is preseted i Lemma 2.4.5 of VDVW. Uiform ad uiversal GC classes: If F is P -Gliveko-Catelli for all probability measures 5

P o (X, A), it is called a uiversal Gliveko-Catell class. For example, VC classes of fuctios (that appear i the discussio precedig the proof of Theorem.2) are uiversal GC-classes provided they are uiformly bouded (so that there is a itegrable evelope for every probability measure P ). A stroger GC property ca be formulated i terms of the uiformity of the covergece of the empirical measure to the true measure over all probability measures o (X, A). Say that F is a strog uiform GC class if, for all ɛ > 0, sup P rp P P(X,A) Note that the almost sure covergece of P P ( ) sup P m P F > ɛ m 0. to 0 for a fixed P is equivalet to the coditio: For every ɛ > 0, ( ) P rp sup P m P F > ɛ 0. m Uiform Gliveko Catelli classes are sometimes useful i statistical applicatios, for example i situatios where the paret distributio from which a statistical model is geerated is allowed to vary with the sample size, or situatios where there are two idices m, that go to ifiity, with beig the sample size, ad m a idex that labels the statistical model. Cosistecy argumets for such situatios ca be costructed via the otio of uiform GC classes of fuctios. A compellig applicatio is preseted i the paper by Se, Baerjee ad Michailidis (200) (available o Baerjee s webpage) where the problem is oe of estimatig the miimum effective dose i a dose respose settig (the largest dose beyod which the respose is positive) ad is the umber of distict doses with each dose admiistered to a distict set of m idividuals. Cosistecy of a least squares estimate of the miimum effective dose is established as m, ad the otio of uiform GC classes is heavily used. Sectio 2.8. of VDVW deals with these ideas; see Theorem 2.8. which ca be used to deduce that V C classes of fuctios are uiformly Gliveko-Catelli uder appropriate itegrability restrictios. GC preservatio: Preservatio of GC properties are importat from the perspective of applicatios. Ofte, i a statistical applicatio, it becomes ecessary to show the GC property for a class of fuctios with complex fuctioal forms to which tailor-made GC theorems are difficult to apply. However, if such classes ca be built up from simple GC classes of fuctios via simple 6

mathematical operatios, the GC property ofte traslates to the complex classes of iterest. Sectio.6 of Weller s otes has a discussio of preservatio properties as does Sectio 9.3 of Kosorok. Some discussio from Kosorok: A example: Suppose that X = R ad that X P. (i) For 0 < M < ad a R, let f(x, t) = x t ad F = F a,m = {f(x, t) : t a M}. Show that if E( X ) <, N [ ] (ɛ, F, L (P )) <. Derivatio: Chop the iterval [a M, a + M] ito a evely spaced (fiite) grid of poits {s i } icludig the ed-poits such that successive poits o the grid are separated by a distace o larger tha ɛ < ɛ. Costruct a set of brackets {l j, u j } where l j (x) = x s j (x s j )+ x s j+ (x s j+ ) ad u j (x) = x s j x s j+. Each l j, u j has fiite orm sice E P ( X ) <. A simple picture should ow covice you that u j l j is o-egative ad o larger tha ɛ poitwise ad hece i the L (P ) orm. Every poit t i [a M, a + M] lies i some [s j, s j+ ] ad the fuctio f(x, t) the belogs to the bracket [l j, u j ], showig that N [ ] (ɛ, F, L (P )) <. (ii) Same as before but let f(x, t) = x t x a. Show that N [ ] (ɛ, F, L (P )) < but without the assumptio that E P ( X ) <. Derivatio: Take the l j, u j s costructed above ad defie ũ j = u j x a ad ũ j = l j x a. Cosider { l j, ũ j }. It is easy to show, usig the fact that x t x t t t that each ũ j ad each l j is bouded ad therefore itegrable, irrespective of whether E( X ) <. If t [s j, s j+ ], f(x, t) lies i the bracket [ l j, ũ j ]. 7