1 Glivenko-Cantelli type theorems

Size: px

Start display at page:

Download "1 Glivenko-Cantelli type theorems"

Laurence Allen
5 years ago
Views:

1 STA79 Lecture Spring Semester Glivenko-Cantelli type theorems Given i.i.d. observations X,..., X n with unknown distribution function F (t, consider the empirical (sample CDF ˆF n (t = I [Xi t]. n Then as n, ˆF n (t F (t a.s. < t < Without the (i.e. for a fixed t this is just an ordinary LLN for Bernoulli r.v.s The difficult (and usefulness is in the. Notice that F (t = P (X t = P (X (, t], where (, t] can be considered as a set A (indexed by t. And the Glivenko-Cantelli theorem can be rewritten as: A A d[ ˆF n (s F (s] a.s. Does the following convergence hold if A is any Lebesgue measurable set in F? d[ ˆF n (s F (s] a.s. A F A We know the following: ( if F = {(, t], t R}, then the uniform convergence holds; (.5 if F = {(a, b], for any real a < b}, then the uniform convergence holds; (2 if F = { all measurable sets }, then the uniform convergence doesn t holds; (3 if F = Vapnik-Chervonenkis (V-C sets, then the uniform convergence holds. We shall see that the key is F {x, x 2,, x n } should have n k (polynomials many different sets, not exponentially many (2 n.. The proof of Glivenko-Cantelli theorem i.i.d. i.i.d. Suppose X,..., X n F (t, and Y,..., Y n F (t (same CDF. Also assume X s are independent of Y s. Let ˆF n (t = I [Xi t] n and F n(t = n I [Yi t] Step : Symmetrization (See Page 4 of Pollard for details 2P ( ɛ > ; P ( ˆF n (t F (t > ɛ <t< <t< ( ˆF n (t F (t (F n(t F (t > ɛ 2 = 2P ( ˆF n (t Fn(t > ɛ <t< 2

2 STA79 Lecture Spring Semester Since ˆF n (t and Fn(t are piecewise constant functions, thus ˆF n (t Fn(t has at most (2n + different values when < t <. Step 2: Turn infinite many Sup to finite many Max, corresponding to (2n+ different values. 2P ( ˆF n (t Fn(t > ɛ = 2P ( max 2 ˆF n (t F t=t,...,t 2n+ n(t > ɛ 2 <t< 2n+ = 2P ( ˆF n (t i Fn(t i > ɛ 2 2n+ 2 P ( ˆF n (t i Fn(t i > ɛ 2 (By Boole s ineq. Step 3: Hoeffding s Inequality (Pollard, 984 Suppose Y,..., Y n are independent with EY i = (Mean and a i Y b i (bounded then, η >, P ( Y + Y Y n > η 2e 2η 2 n (b i a i 2. Let Y i = n (I [X i t] I [Yi t] then we have n Y i n and E(Y i =. Thus Hoeffding s Inequality can be applied to ˆF n (t i Fn(t i, with η = ɛ 2 2n+ 2 P ( ˆF n (t i Fn(t i > ɛ 2n+ 2 2 ( 2( ɛ 2 2 exp 2 ( 2 n 2 n = (8n + 4e nɛ2 8 as n Remarks: ( The above inequality holds for any ɛ > and any n. So we actually proved P ( <t< ˆF n (t F (t > ɛ (8n + 4e nɛ2 8 ; ( (2 It is worth noting that how fast this bound (8n + 4e nɛ2 8 goes to. For example, n= (8n + 4e nɛ2 8 <, an application of Borel-Cantalli lemma turns this into a.s. convergence. so Glivenko-Cantelli is almost surely converge. This also works if we replace (8n + 4 with any polynomials of n like n k. 2

3 STA79 Lecture Spring Semester.2 Generalizations Many generalizations are possible.. The random variables X, X 2,, X n need only be independent; and do not have to be identically distributed. The limiting distribution is then F n (t = /n F i (t. (The limit is always obtained by replace the random variables by the expectations 2. The constant /n may be replaced by other constants or a sequence of n constants: a, a 2,, a n. The result will be P ( <t< [ a i I[X i t] a i F i (t > ɛ (8n + 4 exp ɛ 2 8 n /a2 i 3. The limit do not have to be distribution functions. Any bounded non random function will do. In particular a sub-distrbution function. ] ; where U i (t = EI [Xi t, δ i =]. t a i I [Xi t, δ i =] U i (t Excercise: Suppose, as n we have <t< n N(t EN(t a.s. and <t< n R(t ER(t a.s. as n. Show that t dn(s t R(s den(s ER(s again, uniformly for those t that ER(t > η >. Furthermore, pose g(t is a function that the integral in the limit below is well defined. Let g n (t be a random sequence of functions that Show that t g n (t g(t a.s.. <t< g n (sdn(s R(s again, uniformly for those t that ER(t > η >. t g(sden(s ER(s 3

4 STA79 Lecture Spring Semester Let F = {A t = (, t], < t < } A t {x,..., x n } = {φ}, {x }, {x x 2 },..., {x...x n } (WLOG assume the x i s are ordered. The number of all subsets of {x,..., x n } is 2 n, but the number of all sets of the form A t {x,..., x n } is (n +. In general, if the number of all sets of the type A {x,..., x n } is a polynomial function in n (i.e. O(n k 2 n, then the sets contained in A is a V-C class of sets. For example, if A = A ab = (a, b], < a < b <, then the number of all sets of type A {x,..., x n } is n(n+ + (including empty set. Therefore the sets of A 2 ab = (a, b] is a V-C class of sets. Claim: If and only if F is a V-C class of sets, then P ( A F I [A] d ˆF n (t I [A] df (t > ɛ.3 Applications In the Cox model, the Breslow estimate of Baseline hazard and Fisher information matrix. ˆΛ (t = t n dn(s n n I [Y i s]e βz i We focus on the denominator. ( P s n I [Yi s]e βz i n P (Y i se βz i > ɛ (8n + 4e nɛ2 8M 2 (Condition : z i M < Where P (Y i s = P (T i sp (C i s = e Λ i(s [ G(s] = e Λ (se βz i [ G(s] Similar for Fisher information matrix n z 2 i I [Yi s]e βz i 4

5 STA79 Lecture Spring Semester The Glivenko-Cantelli can also be formulated for functions. P ( f F f(xd ˆF n (x f(xdf (x > ɛ = P ( f F n f(x i f(xdf (x > ɛ What is the condition on F to make above? V-C class of function: if a function s graph is a V-C class of sets. f(x graph{(x, y f(x > y} More dimensions (example: 2 dimensions The number of all sets of A {x,..., x n } is a polynomial function in n A = rectangles V-C class of sets Hence the Glivenko-Cantelli convergence works in 2 dimensions etc. Homework: Is the following true? Prove if it is true. ˆF n (t a.s. F (t <t x (n If not, what bound instead of x (n will make the convergence hold? Homework: Suppose ˆΛ n (t is the Nelson-Aalen estimator based on n right censored observations, and the Λ(t is the true cumulative hazard. Assume Λ(t is continuous, also assume Λ(t as t. Show that, as n either in probability or almost surely. Any speed? Can we make it t<? ˆΛ n (t Λ(t t M Reference: Pollard D. (984 Convergence of Stochastic Processes. Springer 5

6 STA79 Lecture Spring Semester 2 Empirical Likelihood and Bootstrap The idea of Boostrap: In the correspondence (or the link between that of ˆF n ( (ˆθ n θ, bootstrap apply a random perturb to the ˆF n, and see how (ˆθ n θ change accordingly. Repeat this many times and you have a sampling distribution of (ˆθ n θ. The random perturbation is obtained by a random sampling (or re-sampling to ˆF n. The idea of Empirical Likelihood: In the correspondence of ˆFn ( ˆθ n, EL force the statistic ˆθ n to the value θ, and find the tilted ˆF n that corresponding to this perturbed ˆθ n. We denote the tilted distribution as ˆF λ n for some nonzero λ. It turns out 2 log EL( ˆF λ n EL( ˆF n will have a chi square distribution, a pivatol distribution when θ is the true value of the parameter. Under null hypothesis, the perturbation of ˆθ n to θ is of order / n (usually. In bootstrap, the perturbation of re-sampling to ˆF n is also of order / n. The differrence: the bootstrap is a random perturbation but EL is a fixed perturbation, so bootstrap usually need simulation to repeat many times, and result may be slightly difference due to random re-sample errors. On the other hand, bootstrap can be applied to any statistic, but EL works most successfully for the case ˆθ is NPMLE. (has anyone try it on non-mle? In some setup, it may not be clear how a random perturbation should be applied to the ˆF becausse there are several plausible ways to do it. On the other hand, for EL there is usually clear, and only one way to set ˆθ to θ. Bootstrap needs to estimate a whole distribution (or percentile, and the EL can rely on the fact that the distribution of the likelihood ratio is a pivatol chi square. The introduction of the λ turns the non-parametric problem into a parametric problem. In the new parametric problem, we are estimate the true value of zero, and the information of λ is just the negative second derivative of the log likelihood and the MLE is ˆλ n which has an asymptotic normal distribution. This sub-family of parametric distributions are the so-called least favorable sub-distributions, an idea first proposed by Stone in

7 STA79 Lecture Spring Semester For any (square integrable function φ(t and a distribution F (, define φ(t = φ F (t = φ(udf (u F (t where τ F = {x : F (x < }. (t, τ F ] Theorem have that is Denote the Kaplan-Meier estimator based on n i.i.d. observations as ˆF n. We ˆF φ(ud ˆF n (u φ(udf (u n (t (t, τ ˆFn ] F (t (t, τ F ] φ ˆFn (t φ F (t The convergence is uniformly, almost sure, i.e. φ ˆFn (t φ F (t, a.s. t Theorem Assume φ(t is square integrable with respect to F (t. Then we have [φ(t φ ˆF (t] 2 d ˆF n (t [φ(t φ F (t] 2 df (t 7

8 STA79 Lecture Spring Semester Akritas (2 studied the central limit theorem for the Kaplan-Meier integrals. There are earlier papers about the same topic, but the asymptotic variance expression of Akritas (2 is new and interesting. Theorem (Akritas 2 The asymtotic variance of Kaplan-Meier integrals are ( n AsyV ar φ(td ˆF KM (t = τ [φ(t φ(t] 2 [ F (t]df (t H(t A multivariate version of this theorem can be easily obtained. Denote Φ(t = (φ (t,, φ k (t, then the asymptotic variance-covariance matrix of the k-vector of Kaplan-Meier integrals is ( n AsyV arcov Φ(td ˆF KM (t = [σ ij ], with τ σ ij = [φ i (t φ i (t][φ j (t φ [ F (t]df (t j (t]. H(t This multivariate version can be obtained by using the representation of Akritas (2, his Theorem 6. An easier to check sufficient condition to insure the variance are well defined is τ φ 2 (s df (s <. G(s. When there is no censoring, the Kaplan-Meier estimator become the empirical distribution and the integral with respect to empirical distribution is just the i.i.d. summation (or average. Finally, when there is no censoring H(s = F (s, the covariance formula of Akritas above simplify to the following AsyCov ( n φ i (X u, u= n φ j (X u can be written as τ [φ i (t φ i (t][φ j (t φ [ F (t]df (t j (t]. F (t On the other hand, the said covariance can obviously be written as τ u= [φ i (t Eφ i ][φ j (t Eφ j ]df (t. We, therefore, arrive at the following identity Lemma For function φ i and φ j that are square integrable with respect to F (t we have τ [φ i (t φ i (t][φ j (t φ [ F (t]df (t j (t] F (t = τ [φ i (t Eφ i ][φ j (t Eφ j ]df (t. When either the expectations Eφ i = or Eφ j = or both, the above identity can further be simplified to τ [φ i (t φ i (t][φ j (t φ [ F (t]df (t j (t] F (t = τ [φ i (t][φ j (t]df (t. We comment that this identity holds for any distribution F (, we later will use this when F (t is the Kaplan-Meier distribution. 8

9 STA79 Lecture Spring Semester A general empirical likelihood theorem. For a sample of n independent observations with distribution belongs to a family F n (β here β is the finite dimentional parameter, F n can be nonparametric. If there exist a distribution F n such that F n << F n, that is all distributions are dominated by a single (but can depend on n distribution F n, then empirical likelihood works for test the finite parameter β of the distributions F n. 9

11 Survival Analysis and Empirical Likelihood

11 Survival Analysis and Empirical Likelihood The first paper of empirical likelihood is actually about confidence intervals with the Kaplan-Meier estimator (Thomas and Grunkmeier 1979), i.e. deals with