Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued

Similar documents
Introduction to Empirical Processes and Semiparametric Inference Lecture 09: Stochastic Convergence, Continued

Introduction to Empirical Processes and Semiparametric Inference Lecture 08: Stochastic Convergence

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results

Introduction to Empirical Processes and Semiparametric Inference Lecture 13: Entropy Calculations

Empirical Processes: General Weak Convergence Theory

Introduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models

Theoretical Statistics. Lecture 17.

SOME CONVERSE LIMIT THEOREMS FOR EXCHANGEABLE BOOTSTRAPS

Homework # , Spring Due 14 May Convergence of the empirical CDF, uniform samples

Draft. Advanced Probability Theory (Fall 2017) J.P.Kim Dept. of Statistics. Finally modified at November 28, 2017

1 Glivenko-Cantelli type theorems

Bounded uniformly continuous functions

MATH 6605: SUMMARY LECTURE NOTES

Introduction to Empirical Processes and Semiparametric Inference Lecture 01: Introduction and Overview

L p Functions. Given a measure space (X, µ) and a real number p [1, ), recall that the L p -norm of a measurable function f : X R is defined by

1 Weak Convergence in R k

ELEMENTS OF PROBABILITY THEORY

Estimation of the Bivariate and Marginal Distributions with Censored Data

Introduction to Empirical Processes and Semiparametric Inference Lecture 22: Preliminaries for Semiparametric Inference

Lecture Particle Filters. Magnus Wiktorsson

3 (Due ). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

An elementary proof of the weak convergence of empirical processes

Stability of optimization problems with stochastic dominance constraints

Weak convergence and Brownian Motion. (telegram style notes) P.J.C. Spreij

Lecture 13: Subsampling vs Bootstrap. Dimitris N. Politis, Joseph P. Romano, Michael Wolf

Theoretical Statistics. Lecture 1.

Estimation and Inference of Quantile Regression. for Survival Data under Biased Sampling

Asymptotic Distributions for the Nelson-Aalen and Kaplan-Meier estimators and for test statistics.

Statistical inference on Lévy processes

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539

Asymptotics of minimax stochastic programs

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

7 Influence Functions

General Theory of Large Deviations

If we want to analyze experimental or simulated data we might encounter the following tasks:

Program Evaluation with High-Dimensional Data

2 (Bonus). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

High Dimensional Empirical Likelihood for Generalized Estimating Equations with Dependent Data

EMPIRICAL PROCESS THEORY AND APPLICATIONS. Sara van de Geer. Handout WS 2006 ETH Zürich

The Heine-Borel and Arzela-Ascoli Theorems

1 Exercises for lecture 1

Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics. Jiti Gao

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3

Metric Spaces. Exercises Fall 2017 Lecturer: Viveka Erlandsson. Written by M.van den Berg

Introduction to Probability

Math 117: Continuity of Functions

Metric Spaces and Topology

Part II Probability and Measure

STAT 512 sp 2018 Summary Sheet

ECE 4400:693 - Information Theory

Ph.D. Qualifying Exam Monday Tuesday, January 4 5, 2016

Efficiency of Profile/Partial Likelihood in the Cox Model

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Convergence of Multivariate Quantile Surfaces

EMPIRICAL PROCESSES: Theory and Applications

Large Sample Theory. Consider a sequence of random variables Z 1, Z 2,..., Z n. Convergence in probability: Z n

MAS223 Statistical Inference and Modelling Exercises

Economics 204 Fall 2011 Problem Set 2 Suggested Solutions

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Additive Isotonic Regression

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

1 The Glivenko-Cantelli Theorem

Verifying Regularity Conditions for Logit-Normal GLMM

1.1 Basis of Statistical Decision Theory

Qualifying Exam in Probability and Statistics.

Statistical Properties of Numerical Derivatives

Some Aspects of Universal Portfolio

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Lecture 2: Uniform Entropy

THE INVERSE FUNCTION THEOREM

In terms of measures: Exercise 1. Existence of a Gaussian process: Theorem 2. Remark 3.

Lecture Notes in Advanced Calculus 1 (80315) Raz Kupferman Institute of Mathematics The Hebrew University

large number of i.i.d. observations from P. For concreteness, suppose

Bootstrap, Jackknife and other resampling methods

Economics 583: Econometric Theory I A Primer on Asymptotics

Lecture 2: Consistency of M-estimators

6. Brownian Motion. Q(A) = P [ ω : x(, ω) A )

Large Deviations for Small-Noise Stochastic Differential Equations

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

Statistics 3858 : Maximum Likelihood Estimators

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions

Lecture Particle Filters

Elements of Probability Theory

Asymptotic statistics using the Functional Delta Method

Concentration behavior of the penalized least squares estimator

Probability and Measure

Advanced Statistics II: Non Parametric Tests

9 Brownian Motion: Construction

Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension. n=1

Quantile Processes for Semi and Nonparametric Regression

Class 2 & 3 Overfitting & Regularization

d(x n, x) d(x n, x nk ) + d(x nk, x) where we chose any fixed k > N

Statistics 300B Winter 2018 Final Exam Due 24 Hours after receiving it

Lecture 2: Convex Sets and Functions

M311 Functions of Several Variables. CHAPTER 1. Continuity CHAPTER 2. The Bolzano Weierstrass Theorem and Compact Sets CHAPTER 3.

Monte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan

Probability and Measure

MATH 51H Section 4. October 16, Recall what it means for a function between metric spaces to be continuous:

Can we do statistical inference in a non-asymptotic way? 1

Transcription:

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations Research University of North Carolina-Chapel Hill 1

Empirical Processes: The Main Features A stochastic process is a collection of random variables {X(t), t T } on the same probability space, indexed by an arbitrary index set T. In general, an Empirical process is a stochastic process based on a random sample, usually of n i.i.d. random variables X 1,..., X n, such as F n, where the index set is T = R. 2

More generally, an empirical processes has the following features: The i.i.d. sample X 1,..., X n is drawn from a probability measure P on an arbitrary sample space X. We define the empirical measure to be P n = n 1 n i=1 δ X i, where δ Xi is the measure that assigns mass 1 to X i and mass zero elsewhere. For a measurable function f : X R, we define P n f = n 1 n i=1 f(x i). For any class F of functions f : X R we can define the empirical process {P n f, f F}, i.e., F becomes the index set. This approach leads to a stunningly rich array of empirical processes. 3

Setting X = R, we can re-express F n as the empirical process {P n f, f F} by setting F = {1{x t}, t R}. Recall that the law of large number yields for each t R. F n (t) as F (t), (1) The functional perspective invites us to consider uniform results over t R, or, more generally, over f F: this is also called the sample path perspective. 4

Along this line of reasoning, Glivenko (1933) and Cantelli (1933) strengthened (1) to sup t R F n (t) F (t) as 0. (2) Another way of saying this is that the sample paths of F n get uniformly closer to F as n, almost surely. 5

For the general empirical process setting, we say that a class of measurable functions f : X R is P -Glivenko-Cantelli (or P -GC, or Glinvenko-Cantelli, or GC) if where P f = X sup f F P n f P f as 0, (3) f(s)p (dx), and as is a mode of convergence slightly stronger than as with the following attributes: certain measureability problems that can arise in complicated empirical processes (including many practical survival analysis settings) are cleanly resolved; the modes of convergence are equivalent in the setting of (2); and more details to be described later. 6

Returning to F n, the central limit theorem tells us that for each t R, G n (t) n [F n (t) F (t)] G(t), where denotes convergence in distribution and G(t) is a mean zero normal (Gaussian) random variable with variance F (t)(1 F (t)). 7

In fact, we know that for all t in any finite set of the form T k = {t 1,..., t k } R, G n will simultaneously converge to a mean zero Gaussian vector G = {G(t 1 ),..., G(t k )}, where cov[g(s), G(t)] = E[G(s)G(t)] = F (s t) F (s)f (t), (4) for all s, t T k. Much more can be said. 8

Donsker (1952) showed that the sample paths of G n (as a function on R) converge in distribution to a certain stochastic process G. Weak convergence is the generalization of convergence in distribution from vectors of random variables to sample paths of stochastic processes. 9

Donsker s famous result can be stated succinctly as G n G in l (R), (5) where, for any index set T, l (T ) is the collection of all bounded functions f : T R. l (T ) is a metric space with respect to the uniform metric on T and is used to remind us that we are taking the functional perspective or, more precisely, that we are thinking of distributional convergence in terms of the sample paths. 10

The limiting process G in (5) is a mean zero Gaussian process with as given in (4). E[G(s)G(t)] = F (s t) F (s)f (t), A Gaussian process is a stochastic process {Z(t), t T }, where for every finite T k T, {Z(t), t T k } is multivariate normal and all (almost all) sample path are continuous in a certain sense to be defined later. 11

The particular Gaussian process G in (5) can be written as G(t) = B(F (t)), where B is a Brownian bridge process on the unit interval ([0, 1]). The process B is a mean zero Gaussian process on [0, 1], with covariance s t st, for s, t [0, 1], and has the form W(t) tw(1), for t [0, 1], where W is a standard Brownian motion process. B can be generalized in a very important manner which we will discuss shortly. 12

The Brownian motion W is a mean zero Gaussian process on [0, ) with continuous sample paths (almost surely); W(0) = 0; and with covariance s t. Both the Brownian bridge and Brownian motion are very important stochastic processes that arise frequently in statistics and probability. 13

Returning again to general empirical processes, define the random measure G n = n(p n P ), and define G to be a mean zero Gaussian process indexed by F, with covariance E [f(x)g(x)] E [f(x)] E [g(x)], for all f, g F, and having appropriately continuous sample paths (almost surely). Both G n and G can be thought of as being indexed by F and are completely defined once we specify F. 14

We say that F is P -Donsker if G n G in l (F). The P or F may be dropped if the context is clear. 15

Donsker s (1952) theorem tells us that F = {1{x t}, t R} is Donsker for all probability measures defined by a real distribution function F. With f(x) = 1{x t} and g(x) = 1{x s}, E [f(x)g(x)] E [f(x)] E [g(x)] = F (s t) F (s)f (t). For this reason, G is referred to as a Brownian bridge (or generalized Brownian bridge). The range of possibilities for G n and G, as defined through classes of function F, is stunningly vast. 16

Suppose we are interesting in forming simultaneous confidence bands for F over some subset H R. Since F = {1{x t}, t R} is Glivenko-Cantelli, we can uniformly consistently estimate the covariance σ(s, t) = F (s t) F (s)f (t) with ˆσ(s, t) = F n (s t) F n (s)f n (t). Is it possible to use this to generate the desired confidence bands? 17

While knowledge of the covariance is sufficient to generate simultaneous confidence bands when H is finite via the chi-square distribution (for example), it is not sufficient when H is infinite (e.g., when H is a subinterval of R). When H is infinite, and more generally, it is useful to make use of the Donsker result for G n. In particular, let U n = sup t H G n (t), and note that the distribution of U n can be used to construct the desired confidence bands. 18

The continuous mapping theorem tells us that whenever a process {Z n (t), t H} converges weakly to a tight limiting process {Z(t), t H} in l (H), then h(z n ) h(z) in h(l (H)) for any continuous map h. In our setting, U n = h(g n ), where g h(g) = sup t H g(t), for any g l (R), is a continuous real valued function. Thus U n U = sup t H G(t) by the continuous mapping theorem. 19

Note that U = sup t H G(t) = sup t H is distribution free and has a known distribution. B(F (t)) = sup B(u) u [0,1] In particular, for s > 0, P [U s] = 1 + 2 ( 1) k e 2k2 s 2 k=1 (Billingsly, 1968, Page 85); e.g., a 95% confidence band is F n ( ) ± 1.92 n. 20

An alternative approach is to use bootstraps of F. These bootstraps have the form ˆF n (t) = n 1 n W ni 1{X i t}, i=1 where (W n1,..., W nn ) is a multinomial random n-vector with probabilities (1/n,..., 1/n), number of trials n, independence from the data X 1,..., X n. 21

More general weights (W n1,..., W nn ) are possible (and useful), and it can be shown that Ĝ n = ) n (ˆFn F n converges weakly, conditional on the data X 1,..., X n, to G in l (R). Thus the bootstrap provides valid confidence bands for F. 22

Consider now the more general setting, where F is a Donsker class and we wish to construct confidence band for Ef(X) for all f H F. Under minimal (moment) conditions on F, ˆσ(f, g) = P n [f(x)g(x)] P n f(x)p n g(x) is uniformly consistent for σ(f, g) = P [f(x)g(x)] P f(x)p g(x). Whenever H is infinite, knowledge of σ is not enough to form confidence bands. 23

Fortunately, the bootstrap is always valid when F is Donsker and can thus be used for arbitrary and/or infinite H. If Ĝn = n(ˆp n P n ), where ˆP n f = n 1 n W ni f(x i ), i=1 then the conditional distribution of Ĝn given the data X 1,..., X n converges weakly to G in l (F), and F can be replaced with any H F. The bootstrap for F n is a special case of this. 24

Although many statistics do not have the form {P n f, f F}, many can be written as φ(p n ), where φ : l (F) B is continuous and B is a set (possibly infinite-dimensional). Consider for example, ξ n (p) = F 1 n (p) for p [a, b] [0, 1]. ξ n (p) is the sample quantile process and can be shown to be expressible, under reasonable regularity conditions, as φ(f n ), where φ is a Hadamard differentiable functional of F n. 25

In this example, and in general, the functional delta method states that n [φ(pn ) φ(p )] converges weakly in B to φ (G), whenever F is Donsker and φ is Hadamard differentiable. Moreover, the empirical process bootstrap discussed previously is also automatically valid, i.e., n converges weakly to φ (G). [ ] φ(ˆp n ) φ(p n ), conditional on the data, 26

Many other statistics can be written as zeros, or approximate-zeros, of estimating equations based on empirical processes: these are called Z-estimators. An example is ˆβ from linear regression which can be written as a zero of U n (β) = P n [X(Y X β)]. Yet other statistics can be written as maxima or minima of objective functions based on empirical processes: these are called M-estimators. Examples include least-squares, maximum likelihood and minimum penalized likelihoods such as L(β, η) from partly-linear logistic regression. 27

Thus many important statistics can be viewed as involving an empirical process. A key asymptotic issue is studying the limiting behavior of these empirical processes in terms of their samples paths (the functional perspective). Primary achievements in this direction include: Glivenko-Cantelli results which extend the law of large numbers, Donsker results which extend the central limit theorem, the validity of the bootstrap for Donsker classes, and the functional delta method. 28

Stochastic Convergence There is always a metric space (D, d) involved, where D is the set of points and d is a metric satisfying, for all x, y, z D, d(x, y) 0, d(x, y) = d(y, x), d(x, z) d(x, y) + d(y, z), and d(x, y) = 0 iff x = y. Often D = l (T ), the set of bounded functions f : T R, where T is an index set of interest and d(x, y) = sup t T x(t) y(t) is the uniform distance. 29

When we speak of X n converging to X in D, we mean that the samples paths of X n behave more and more like the sample paths of X (as points in D). When X n and X are Borel measurable, weak convergence, denoted X n X, is equivalent to Ef(X n ) Ef(X) for every bounded, continuous f : D R, which set of functions we denote C b (D), and where continuity is in terms of d, i.e., f(x) f(y) 0 as d(x, y) 0. We will define Borel measurability of X n later, but it basically means that there are certain important subsets A D for which P (X n A) is not defined. 30

What about when X n is not Borel measurable? We need to introduce outer expectation for arbitrary maps (not necessarily random variables) T : Ω R[, ], where Ω is a sample space. The outer expectation of T, denoted E T is the infemum of all EU, where U : Ω R is measurable, U T, and EU exists. We analogously define E T = E ( T ) as the inner expectation. 31

We can show that there exists a minimal measurable majorant T such that T is measurable and ET = E T. Conversely, the maximal measurable minorant T also exists and satisfies T = ( T ) and E T = ET. We can also define outer probability P (A) as the infemum over all P (B), where A B Ω and B is measurable. P (A) = 1 P (Ω A) is the inner probability. 32

Even if X n is not Borel measurable, we can have weak convergence: specifically, if E f(x n ) Ef(X), for all f C b (D), then we say X n X. Such weak convergence carries with it the following implicit measurability requirements: X is Borel measurable and E f(x n ) E f(x n ) 0, for every f C b (D). A sequence X n satisfying the second requirement above is called asymptotically measurable. 33

We can define outer measurable forms of weak and strong convergence in probability: convergence in probability: X n P X if P {d(x n, X) > ɛ} 0 for every ɛ > 0. outer almost sure convergence: X n as X if there exists a sequence n of measurable random variables such that d(x n, X) n and P {lim sup n n = 0} = 1. While these modes of convergence are slightly different than the usual ones, they are identical when all the random quantities involved are measurable. 34

For most settings we will study, the limiting quantity X will be tight (related to smoothness) in addition to being Borel measurable. Moreover, the most common choice for D will be l (T ) with the uniform metric d. Let ρ(s, t) be a semimetric on T : a semimetric ρ satisfies all of the requirements of a metric except that ρ(s, t) = 0 does not necessarily imply s = t. 35

We say that the semimetric space (T, ρ) is totally bounded if, for every ɛ > 0, there exists a finite subset T k = {t 1,..., t k } T such that for all t T, we have ρ(s, t) ɛ for some s T k. Define UC(T, ρ) to be the subset of l (T ) consisting of functions x for which lim δ 0 sup x(t) x(s) = 0. s,t T with ρ(s,t) δ The functions of UC(T, ρ) are also called the ρ-equicontinuous functions. The UC refers to uniform continuity, and UC(T, ρ) is a very important metric space. 36

A Borel measurable stochastic process X in l (T ) is tight if X UC(T, ρ) almost surely for some ρ making T totally bounded. If X is a Gaussian process, then ρ can be chosen without loss of generality to be ρ(s, t) = (var[x(s) X(t)]) 1/2. Tight Gaussian processes will be the most important limiting processes we will consider. 37

Theorem 2.1. X n converges weakly to a tight X in l (T ) if and only if: (i) For all finite {t 1,..., t k } T, the multivariate distribution of {X n (t 1 ),..., X n (t k )} converges to that of {X(t 1 ),..., X(t k )}. (ii) There exists a semimetric ρ for which T is totally bounded and { } lim lim sup P sup X n (s) X n (t) > ɛ δ 0 n s,t T with ρ(s,t)<δ = 0, for all ɛ > 0. 38

Usually, establishing Condition (i) is easy, while establishing Condition (ii) is hard. For empirical processes from i.i.d. data, we want to show G n G in l (F), where F is a class of measurable functions f : X R, X is the sample space, and Ef 2 (X) < for all f F. Thus Condition (i) is automatic by the standard central limit theorem. 39

Establishing Condition (ii) is much harder. When F is Donsker: The limiting process G is always a tight Gaussian process, F is totally bounded w.r.t. ρ(f, g) = {var[f(x) g(x)]} 1/2, and Condition (ii) is satisfied with T = F, X n (f) = G n f, and X(f) = Gf, for all f F. Of course, the hard part is showing F is Donsker. 40

Theorem (continuous mapping). Suppose g : D E is continuous at every point of D 0 D and X n X, where X takes its values almost surely on D 0. Then g(x n ) g(x). Example. Let g : l (F) R be g(x) = x(f) F sup f F x(f) : The continuous mapping theorem now implies that G n F G F. This can be used for confidence bands for P f. 41

Entropy for Glivenko-Cantelli and Donsker Theorems In order to obtain Glivenko-Cantelli and Donsker results for F, we need to evaluate the complexity (or entropy) of F. The easiest way is with entropy with bracketing (also called bracketing entropy), which we now introduce. First, for 1 r <, define L r (P ) to be the space of measurable functions f : X R with f P,r [P f r (X)] 1/r <. 42

An ɛ-bracket in L r (P ) is a pair of functions l, u L r (P ) with l(x) u(x) P -almost surely and u l P,r ɛ. A function f F lies in the bracket (l, u) if l(x) f(x) u(x) P -almost surely. The bracketing number N [] (ɛ, F, L r (P )) is the minimum number of ɛ-brackets in L r (P ) needed in order to ensure that every f F is contained in at least one bracket. The logarithm of the bracketing number is the entropy with bracketing. 43

Theorem 2.2. Let F be a class of measurable functions and suppose N [] (ɛ, F, L r (P )) < for every ɛ > 0. Then F is P -Glivenko-Cantelli. Example. Let X = R, F = {1{x t}, t R}, and P be the distribution F on X. 44

For each ɛ > 0, there exists a finite set of real numbers with = t 0 < t 1 < < t k = such that F (t j ) F (t j 1 ) ɛ, all 1 j k, with F (t 0 ) = 0 and F (t k ) = 1. Now consider the brackets {(l j, u j ), 1 j k}, with l j (x) = 1{x t j 1 } and u j (x) = 1{x < t j } (note that u j is not in F). Clearly, each f F is contained in one of these brackets and the number of such brackets is finite. This implies that F n is uniformly consistent for F almost surely. 45

For Donsker results, a more refined assessment of entropy is needed. The bracketing integral (or bracketing entropy integral) is J [] (δ, F, L r (P )) δ 0 log N [] (ɛ, F, L r (P )) dɛ. This allows the bracketing entropy to go to, as ɛ 0, but keeps this from happening too fast. Theorem 2.3. Let F be a class of measurable functions with J [] (, F, L 2 (P )) <. Then F is P -Donsker. 46

Example, continued. For each ɛ > 0, choose the brackets (l j, u j ) as before and note that u j l j P,2 = ( u j l j P,1 ) 1/2 ɛ 1/2. Thus an L 2 ɛ-bracket is an L 1 ɛ 2 -bracket, and hance the minimum number of L 2 ɛ-brackets needed to cover F is 1 + 1/ɛ 2. 47

Since log(1 + a) 1 + log(a) for all a 1, we now have that log N [] (ɛ, F, L 2 (P )) [ 1 + log(1/ɛ 2 ) ] 1{ɛ 1}, and thus J [] (, F, L 2 (P )) 0 u 1/2 e u/2 du = 2π <, where we used the variable substitution u = 1 + log(1/ɛ 2 ). Many other classes of function have bounded bracketing entropy integral, including many parametric classes of functions and the class F of all monotone functions f : X = R [0, 1], for all 1 r < and any P on X = R. 48

There are many classes F for which bracketing entropy does not work, but a different kind of entropy, uniform entropy (or just entropy) works. For a probability measure Q, the covering number N(ɛ, F, L r (Q)) is the minimum number of L r (Q) ɛ-balls needed to cover F, where an ɛ-ball around a function g L r (Q) is the set {h L r (Q) : h g Q,r < ɛ} : For a collection of ɛ-balls to cover F, all elements of F must be contained in at least one of the ɛ-balls. It is not necessary that the centers of the balls in the collection be in F. 49

The logarithm of the covering number is the entropy. What we really need, though, is the uniform covering number: where sup Q N(ɛ F Q,r, F, L r (Q)), (6) F : X R is an envelope for F, meaning that f(x) F (x) for all x X and all f F, and where the supremum is taken over all finitely discrete probability measures Q with F Q,r > 0. 50

The logarithm of (6) is the uniform entropy: note that it does not depend on the probability measure P for the observed data. The uniform entropy integral is J(δ, F, L r ) = δ 0 log sup N(ɛ F Q,r, F, L r (Q)) dɛ, Q where the supremum is taken over the same set used in (6). 51

The following two theorems are the Glivenko-Cantelli and Donsker theorems for uniform entropy: Theorem 2.4. Let F be an appropriately measurable class of measurable functions with sup Q N(ɛ F Q,1, F, L 1 (Q)) < for every ɛ > 0, where the supremum is taken over the same set used in (6). If P F <, then F is P -Glivenko-Cantelli. Theorem 2.5. Let F be an appropriately measurable class of measurable functions with J(1, F, L 2 ) <. If P F 2 <, then F is P -Donsker. 52

An important class of functions F for which J(1, F, L 2 ) < are the Vapnik- Cervonenkis (VC) classes: these include the indicator function class example from above; classes that can be expressed as finite-dimensional vector spaces; and many other classes. There are preservation theorems for building VC classes from other VC classes as well as preservation theorems for Donsker, Glivenko-Cantelli, and other classes: we will go into this in depth in Chapter 9. 53

Bootstrapping Empirical Processes We mentioned earlier that measurability in weak convergence can be tricky. This is even more the case with the bootstrap, since there are two sources of randomness: the data and the random weights (resampling) used in the bootstrap. For this reason, we have to assess convergence of conditional laws (the bootstrap given the data) differently than regular weak convergence. 54

An important result is that X n X in the metric space (D, d) if and only if sup f BL 1 E f(x n ) Ef(X) 0, (7) where BL 1 is the space of functions f : D R with Lipschitz norm bounded by 1, i.e. f D 1 and f(x) f(y) d(x, y) for all x, y D. We can now define bootstrap weak convergence. 55

Let ˆXn be a sequence of bootstrapped processes in D with random weights which we denote by M. For some tight process X in D, we use the notation ˆX P n M X to: mean that sup h BL1 E M h( ˆX n ) Eh(X) P 0 and E M h( ˆX n ) E M h( ˆX n ) P 0, for all h BL 1, where subscript M denotes conditional expectation over M given the data, and where h( ˆX n ) and h( ˆX n ) are the measurable majorants and minorants with respect to the joint data (include M ). 56

In addition to using the multinomial weights mentioned previously which correspond to the nonparametric bootstrap, we have a useful alternative that performs better in some settings. Let ξ = (ξ 1, ξ 2,...) be an infinite sequence of nonnegative i.i.d. random variables, also independent of the data X, which have mean 0 < µ < and variance 0 < τ 2 <, and which satisfy ξ 2,1 <, where ξ 2,1 = 0 P ( ξ > x) dx. 57

We can now define the multiplier bootstrap empirical measure P n = n 1 n i=1 (ξ i/ ξ n )f(x i ), where P n is defined to be zero if ξ n = 0. Note that the weights involved add up to n for both the multiplier and nonparametric bootstraps. When ξ has a standard exponential distribution, for example, the moment conditions are easily satisfied and the resulting multiplier bootstrap has Dirichlet weights. Let Ĝn = n(ˆp n P n ), G n = n(µ/τ)( P n P n ), and G be the Brownian bridge in l (F). 58

Theorem 2.6. The following are equivalent: 1. F is P -Donsker. 2. Ĝ n P W G and the sequence Ĝn is asymptotically measurable. 3. Gn P ξ G and the sequence G n is asymptotically measurable. Theorem 2.7. The following are equivalent: 1. F is P -Donsker and P [sup f F (f(x) P f) 2] <. 2. Ĝ n as W G. 3. Gn P ξ G. 59

A few comments: Note that most other bootstrap results use conditional almost sure consistency; however, convergence in probability is sufficient for almost all statistical applications. We can use these reasults for many complicated inference settings. There are continuous mapping results for the bootstrap. There are also Glivenko-Cantelli type results which are needed for some applications. More on the bootstrap will be discussed in Chapter 10. 60