Asymptotically optimal estimation of smooth functionals for interval censoring, case 2

Size: px

Start display at page:

Download "Asymptotically optimal estimation of smooth functionals for interval censoring, case 2"

Katherine Simon
5 years ago
Views:

1 Asymptotically optimal estimation of smooth functionals for interval censoring, case 2 Ronald Geskus 1 and Piet Groeneboom 2 Municipal Health Service Dept Public Health and Environment Nieuwe Achtergracht 1, 118 WT Amsterdam and Faculty of Information Technology and Systems Delft University of Technology Mekelweg 4, 2628 CD Delft Abstract For a version of the interval censoring model, case 2, in which the observation intervals are allowed to be arbitrarily small, we consider estimation of functionals that are differentiable along Hellinger differentiable paths. The asymptotic information lower bound for such functionals can be represented as the squared L 2 -norm of the canonical gradient in the observation space. This canonical gradient has an implicit expression as a solution of an integral equation that does not belong to one of the standard types. We study an extended version of the integral equation that can also be used for discrete distribution functions like the nonparametric maximum likelihood estimator (NPMLE), and derive the asymptotic normality and efficiency of the NPMLE from properties of the solutions of the integral equations. 1. rgeskus@gggd.amsterdam.nl 2. p.groeneboom@twi.tudelft.nl AMS 1991 subject classifications. 6F17, 62E2, 62G5, 62G2, 45A5. Key words and phrases. nonparametric maximum likelihood, empirical processes, asymptotic distributions, asymptotic efficiency, integral equations. 1

2 1 Introduction In the interval censoring problem one wants to obtain information on some distribution F, often representing an event time distribution, based on a sample of random intervals J 1,..., J n in which unobservable X 1,..., X n F are known to be contained. In case 1, we have a sample of observation times T i and we know whether X i is smaller or larger than the corresponding observation time T i. More formally: we observe (T 1, 1 ),..., (T n, n ), with i = 1 Xi T i. Case 2 is usually denoted as the situation with two observation times (U i, V i ) and the information whether X i is left of U i, between U i and V i or right of V i. For case 1, also denoted by current status data, quite a lot is known. It is already shown in Ayer et. al. (1955) and van Eeden (1956) that there exists a one-step procedure for calculation of the non-parametric maximum likelihood estimator (NPMLE) ˆF n of the distribution function F, based on isotonic regression theory. The asymptotic distribution of ˆF n (t ), for fixed t IR, is derived in part II, chapter 5 of Groeneboom and Wellner (1992); n 1/3 is the obtained convergence rate. The same chapter discusses convergence properties of the NPMLE µ( ˆF n ) of the mean µ(f ) and it is shown that, under some extra conditions, n(µ( ˆFn ) µ(f )) D N(, F (x)[1 F (x)] g(x) dx), as n with g(x) the density of the distribution of the observation times, and D denoting convergence in distribution. Huang and Wellner (1995) prove a similar result for functionals that are linear in F : K(F ) = c df. Then an extra factor c (x) appears in the integral formula for the limit variance. Groeneboom s proof for the mean uses the convergence rate of the supremum distance between the NPMLE and the underlying distribution function, which is replaced by a easier argument based on L 2 -distance properties of ˆFn and smoothness of c (x)/g(x) in Huang and Wellner s proof. In case 2 one may expect better estimation results than in case 1, since one has more information on the location of the X i s. However, both theoretical as well as practical aspects of the problem are more complicated. Only iterative procedures are available for computation of the NPMLE ˆF n of F. The iterative convex minorant algorithm, as introduced by Groeneboom in part II of Groeneboom and Wellner (1992), converges quickly in computer experiments, and a slight modification of this algorithm is shown to converge in Jongbloed (1998). When considering the asymptotic distribution of the NPMLE ˆF n (t ), a distinction should be made. If the relative amount of mass of the (U, V )-distribution near the diagonal point (t, t ), compared to the amount of mass of F near t is very small we are more in a case 1-type situation and we still have a n 1/3 convergence rate (Wellner (1995)). The limit distribution of ˆF n (t ) has been established in Groeneboom (1996) for the situation that U and V remain bounded away. If the observation time distribution has sufficient mass along the diagonal, the convergence rate increases to (n log n) 1/3 (see Groeneboom and Wellner (1992)). There is a conjecture on the asymptotic distribution of ˆF n (t ) in this situation, but the proof is still incomplete. Another estimator is F n (1) (t ), which is obtained by doing one step in the iterative convex minorant algorithm, with the true underlying distribution F as starting value. Of course, this procedure, which does not even lead to an estimator in the strict sense, has no practical value. However, it may be relevant for theoretical purposes. For the asymptotic distribution of F (1) n (t ) is known, and it is conjectured to have the same 2

3 asymptotic distribution as the NPMLE. Contrary to F (t ), for case 1 as well as case 2, the mean µ(f ) is a smooth linear functional. This means that it is differentiable along Hellinger differentiable paths of distributions. For any functional having this property, one can derive a Cramér-Rao-type information lower bound, giving the best possible limit variance that can be attained under n convergence rate. See e.g. van der Vaart (1991), part I of Groeneboom and Wellner (1992) or Bickel et. al. (1993) for the general theory and the application to case 1. For case 1, an explicit expression of the information lower bound can be derived and is given by c (x) φ(x) dx, with φ(x) being the function φ(x) = c (x) F (x)[1 F (x)]. g(x) So this lower bound is attained by the NPMLE. The function φ appearing in the information lower bound has an analogue in case 2. However, contrary to case 1, an explicit formula for φ is unknown, and unlikely to exist, except for some very special choices of the distributions of X and (U, V ) (see Geskus (1997)). Nevertheless it can be proved, without knowing φ explicitly, that the NPMLE K( ˆF n ), estimating the smooth functional K(F ), shows asymptotically optimal behavior in case 2 as well, using properties of the integral equation for φ. Just as in the problem of estimation of F (t ), two situations can be distinguished, showing different behavior. In Geskus and Groeneboom (1996) and Geskus and Groeneboom (1997), together from now on denoted as GG, the simplest case is treated, with the (U, V )-distribution having no mass along the diagonal. Smoothness properties of φ are derived from the structure of the integral equation and are sufficient to give the proof. The technical report Geskus and Groeneboom (1995) contains both papers. See also Geskus (1997), which contains a rather extensive treatment of the application of the general lower bound theory to case 1 and this simple version of case 2. In the present paper, we treat the situation where the observation times can be arbitrarily close. In the long run we will have observations for which we get very close to observing X itself, a situation that never occurs in case 1 and the case 2 treated in GG. Now the analysis is much more complicated, since we really have to deal with the singularity of the integrand of the integral equation on the diagonal. Our general outline is as follows. In section 2, after a specification of the problem, the relevant lower bound calculations are given. In section 3 we show asymptotic efficiency of the NPMLE; here we benefit from some recent results on the Hellinger distance between the NPMLE and the underlying distribution function in van de Geer (1996). Section 3 also contains our main result, Theorem 3.2, showing that the NPMLE K( ˆF n ) of a smooth functional K(F ) of the underlying distribution function F converges at n-rate, is asymptotically normal and has an asymptotic variance that is equal to the information lower bound. In section 4 we present some simulation results showing that the variance of the NPMLE is already very close to the information lower bound for reasonable sample sizes. In this section we also show some pictures of the solution of the integral equation and its derivative for the case that the distributions are uniform (F uniform and the distribution of (U, V ) uniform on the upper triangle of the unit square) and the smooth functional corresponds to 3

4 the mean. These graphs are compared to graphs of a corresponding solution of the integral equation when F is replaced by the NPMLE (and the equation is transformed into another equation in an inverse scale). Finally, in the appendix, some technical results are proved that are needed in section 3. Of course, situations with more than two observation times for each unobservable event time may occur as well. This is usually denoted as the case k situation. However, only the two observation times immediately around the event time give relevant information, so it very much resembles case 2. We will only consider case 2. See also GG, where case k is briefly discussed. Unless otherwise stated, all norms in this paper should be read as L 2 -norms. The dominating measure varies, but is denoted by a subscript, or should be clear from the context. The L 2-space denotes the subspace of functions that integrate to zero. 2 The model; lower bound considerations We will start with a brief formulation of the problem and a list of assumptions that are needed in order to perform the lower bound calculations. Moreover, a key equation on which our analysis will be based is given. A more elaborate introduction, including a derivation of this equation, can be found in GG and Geskus (1997). 2.1 The model; general lower bound theory for smooth functionals Let F be a distribution function. We are interested in estimation of some functional K(F ). However, instead of a sample X 1,..., X n F, we are only able to observe the sample (U 1, V 1, 1, Γ 1 ),..., (U n, V n, n, Γ n ) with i = 1 Xi U i and Γ i = 1 Ui <X i V i. The following model assumptions are made: (M1) Let K > and let S be a bounded interval IR. F is contained in the class F S := F support(f ) S, F absolutely continuous, sup f(x) K. x F is the distribution on which we want to obtain information; however, we do not observe X i F directly. (M2) Instead, we observe the pairs (U i, V i ), with distribution function H. H is contained in H, the collection of all two-dimensional distributions on (u, v) u < v, absolutely continuous with respect to two-dimensional Lebesgue measure and such that (U i, V i ) is independent of X i. Let h denote the density of (U i, V i ), with marginal densities and distribution functions h 1, H 1 and h 2, H 2 for U i and V i respectively. (M3) If both H 1 and H 2 put zero mass on some set A, then F F S has zero mass on A as well, so F H 1 + H 2. This means that F does not have mass on sets in which no observations can occur. Typically, under (M1), F has support [, M], and H has its mass on the triangle (u, v) u < v M. From now on we will restrict attention to this typical situation. More general choices of the support and the situation with H 1 + H 2 having larger support than F have been treated in Geskus (1997), but are similar in essence. Without condition (M3), the 4

5 functionals in which we are interested are not well-defined. Moreover, if F has probability mass on a region in which no observations can occur, one cannot estimate the structure of F on this region consistently. In many survival studies, events do occur outside the domain of observation, making estimation of functionals like the mean an impossible task. The observable random vectors (U i, V i, i, Γ i ) have density q F,H (u, v, δ, γ) = h(u, v)f (u) δ (F (v) F (u)) γ (1 F (v)) 1 δ γ with respect to λ 2 ν 2, where ν 2 denotes counting measure on the set (, 1), (1, ), (, ). The letter M in conditions (M1) to (M3) stands for model. With respect to the functional to be estimated we assume: (F1) K is differentiable along Hellinger differentiable paths of distributions from F S. The canonical gradient for this functional is denoted by κ F. What functionals satisfy this pathwise differentiability requirement? An important class of functionals are the functionals that are linear in F, K(F ) = c(x) df (x). All moment functionals F x k df (x) belong to this class. Estimation of the distribution function at a fixed point concerns a linear functional as well: for K(F ) = F (t ) we have c(x) = 1 [,t ](x). In Bickel et. al. (1993), proposition A.5.2, it is shown that linear functionals on F S with sup F F S E F c(x) 2 < are pathwise differentiable at any F F S, with canonical gradient κ F (x) = c(x) c(x) df (x). So if we were able to observe the X i s directly, we would obtain the information lower bound c(x) E F (c(x)) 2 F. For nonlinear functionals, there is no general method that immediately establishes pathwise differentiability and supplies the formula for the canonical gradient. An example of a nonlinear functional to which our theory can be applied is K(F ) = F 2 (x) w(x) dx. If the function w is bounded, a slight extension of the proof in Bickel et. al. (1993), as given in Geskus (1997), shows that this functional has canonical gradient M M M κ F (x) = 2 F (s) w(s) ds 2 F (s) w(s) ds df (x). s=x x= s=x In order to show that the NPMLE K( ˆF n ) asymptotically attains the lower bound, we have to make the following extra assumption: (F2) K(G) K(F ) = κ F (x) d(g F )(x) + O( G F 2 λ ), 5

6 for all distribution functions G with support contained in [, M], with λ denoting Lebesgue measure on IR. For linear functionals (F2) holds without the O-term. However, the functional K(F ) = F 2 (x) w(x) dx also satisfies (F2). In the interval censoring model we do not observe X F directly, and the estimand K(F ) is only implicitly defined as a functional Θ(Q F,H ) on the class of probability measures on the observation space, with H acting as a nuisance parameter. What about differentiability and information lower bounds in this model? The score operator L 1, relating the censoring model to the, unattainable, model without censoring is in our situation: [L 1 a](u, v, δ, γ) = E[a(X) U = u, V = v, = δ, Γ = γ] = δ u a df F (u) + γ v u a df F (v) F (u) M v (1 δ γ) a df + 1 F (v) a.e. [Q F,H ](2.1) This operator may be defined on L 2 (F ), with range in L 2 (Q F,H ). However, since it relates scores, our main interest lies in the domain L 2 (F ). Then its range is contained in L 2 (Q F,H). The adjoint of L 1 on L 2 (Q F,H) can be written as [L 1b](x) = E[b(U, V,, Γ) X = x] and we get [L 1b](x) = M M u=x v=u x M u= v=x x x u= v=u b(u, v, 1, ) h(u, v) dv du + b(u, v,, 1) h(u, v) dv du + (2.2) b(u, v,, ) h(u, v) dv du a.e.-[f ]. Now we have pathwise differentiability of Θ(Q F,H ) if and only if κ F R(L 1 ) and if this holds, then the canonical gradient is the unique element θ F,H in R(L 1 ) L 2 (Q F,H) satisfying L 1 θ F,H = κ F. (2.3) (See van der Vaart (1991).) Many functionals that are pathwise differentiable in the model without censoring, lose this property in the interval censoring model. Due to the smoothness of the adjoint operator, any functional K with a canonical gradient that is not a.e. equal to a continuous function cannot be obtained under L 1. So not all linear functionals remain pathwise differentiable. For example, K(F ) = F (t ), with canonical gradient 1 [,t ]( ) F (t ), is discontinuous at t, and therefore does not belong to the range of L 1. This corresponds with F (t ) not being estimable at n-rate. However, functionals with a canonical gradient that is sufficiently smooth, will be shown to remain differentiable under censoring. Hence for these functionals the information lower bound theory holds. If the canonical gradients θ = θ F,H and κ F satisfy some extra conditions, the information lower bound θ 2 Q F,H has an alternative formulation. 6

7 Theorem 2.1 Let θ be contained in R(L), say θ = La for some a L 2 (F ). Assume that the function x κ F (x) is differentiable with bounded derivative. Then we have with φ (x) = M x a (t) df (t). θ 2 Q F,H = < a, κ F > F = M κ F (x)φ (x)dx Proof: See theorem 3.3 in Geskus and Groeneboom (1995). This theorem holds more generally. However, in the interval censoring model, both case 1 and case 2, we have the extra property that the function φ(x) def = M x a(t) df (t) with a L 2(F ) also appears explicitly in the score operator L 1. Therefore it will play an important role. It will be called the integrated score function. From its definition we know that φ satisfies φ()=φ(m)= and that φ is continuous for F F S. In section 2.2 we will pay attention to the structure of the lower bound. Section 3 will be devoted to showing that the NPMLE ˆΘ n of Θ(Q F,H ) satisfies n( ˆΘn Θ(Q F,H )) D N(, θ 2 Q F,H ). 2.2 Lower bounds for interval censoring case 2 We restrict ourselves to the case θ R(L 1 ). So the case θ R(L 1 ) \ R(L 1 ) will not be considered. Solvability of the equation κ F (x) = [L 1 L 1a](x) a.e. F (2.4) in the variable a L 2 (F ) will be investigated. The support of F may consist of a finite number of disjoint intervals. However, (2.4) is not defined on intervals where F does not put mass, and these intervals do not play any further role. So without loss of generality we may assume the support of F to consist of one interval [, M]. By the structure of the score operator L 1 this can be reformulated as an equation in φ. If we suppose equation (2.4) to hold for all x [, M], taking derivatives on both sides yields the following integral equation: [ x φ(x) + d F (x) t= with d F (x) being the function M φ(x) φ(t) F (x) F (t) h(t, x) dt d F (x) = t=x φ(t) φ(x) F (t) F (x) h(x, t) dt ] = k(x)d F (x), (2.5) F (x)[1 F (x)] h 1 (x)[1 F (x)] + h 2 (x)f (x), writing k(x) instead of κ F (x). Although k may depend on F, we do not explicitly express this dependence. This is done, since in proving asymptotic efficiency of the NPMLE we have 7

8 to consider (2.5) for convex combinations F = (1 α)f +α ˆF n, where F F S (the unknown distribution) is continuous and ˆF n (the NPMLE of F ) is purely discrete. Solvability and structure of the solution to (2.5) will be investigated for such combinations, with k still determined by the underlying distribution F (so k = κ F ). Apart from the model conditions (M1) to (M3), some extra conditions will have to be introduced in order to make the proofs in this section possible. For the distributions we assume (D1) h 1 (x) + h 2 (x) > for all x [, M]. (D2) h(u, v) is continuous. The partial derivatives 1 t (x) = x h(x, t) and 2 t (x) = xh(t, x) exist, except for at most a finite number of points x, where left and right derivatives with respect to x do exist for each t. The derivatives are bounded, uniformly in t and x. (D3) F is a nondegenerate distribution function with at most finitely many points of jump x i (, M). Let D = x =, x 1,..., x m, x m+1 = M denote the ordered set of jump points of F, augmented with the endpoints of the interval [, M]. We assume that F is differentiable between jumps, except for at most a finite number of points, where left and right derivatives exist. Everywhere outside D, the derivative is bounded and c for some c > (so we assume f (x) c for all x [, M]). The set of points of jump may be empty. Note that if F has jumps, we assume that F has derivative c also on the (non-empty) intervals (, x 1 ) and (x m, M) (where we allow x 1 = x m, though). For the functional we should have (F3) k is differentiable, except for at most a finite number of points x, where left and right derivatives exist. The derivative is bounded, uniformly in x. Note that the letter D in conditions (D1) to (D3) stands for distribution and the letter F in (F1) to (F3) for functional. Of course, (D2) implies continuity of h 1 and h 2. (D1) is the equivalent of g > in case 1 and is needed. It implies that d F is bounded. In case 1, the function φ has an explicit representation of the form φ(x) = k F [1 F ], g whereas in case 2, φ can only be expressed implicitly as a solution to (2.5). If (2.5) is solvable, its solution φ can be shown to contain a factor F (1 F ), just as in case 1. The structure of d F already suggests this factor to be present. Validity of the factorization is shown by inserting φ = F (1 F ) ξ in (2.5). Some reordering yields an integral equation in ξ, which will be shown to be solvable. This ξ-equation has the following form: [ x M ] ξ(x) ξ(t) ξ(x) + c F (x) F (x) F (t) h ξ(t) ξ(x) (t, x) dt F (t) F (x) h (x, t) dt = k(x)c F (x), (2.6) with c F (x) given by t= c 1 F (x) = x t= t=x [1 F (t)] h(t, x) dt + M t=x F (t) h(x, t) dt = h 2 (x) E1 F (U) V = x + h 1 (x) EF (V ) U = x (2.7) 8

9 and h (t, x) = F (t)[1 F (t)] h(t, x) if t x h (x, t) = F (t)[1 F (t)] h(x, t) if x t This equation is similar in structure to the φ-equation. So the lemmas and theorems in the remainder of this section apply to both the φ-equation (2.5) and the ξ-equation (2.6). Most of the proofs will only be given for the φ-equation. Unlike the situation treated in GG, we now assume that the observation density does have mass along the diagonal. This has the consequence that the integral equation may no longer be a Fredholm integral equation. However, we first consider a desingularized integral equation, to which the theory on Fredholm integral equations of the second kind can be applied (Kress (1989)). If F has jumps, the solution of the integral equation will in general also have jumps. However, the key observation in analyzing the integral equation and in proving the efficiency of the NPMLE is that, even when F has discontinuities, we can make a change of scale in such a way that the solution of the integral equation can be extended to a Lipschitz function in the transformed scale. We first introduce some notation. Let G(t) = F 1 (t), t [, 1], with a derivative g which exists except for at most a finite number of points, where, however, G has left and right derivatives. Furthermore, let k(t) = k(g(t)), H(t, u) = H(G(t), G(u)) and likewise h(t, u) = h(g(t), G(u)), and let d F be defined by d F (t) = (2.8) t(1 t) (1 t) h 1 (t) + t h 2 (t), (2.9) where h i = h i G, i = 1, 2. Note that, if F has jumps, d F d F G. Also note that k, d and h are continuous. In a similar way, we define c F (t) 1 = t (1 s) h(s, t) dg(s) + 1 t s h(t, s) dg(s) and h (t, u) = t(1 t) h(t, u) if t u h (u, t) = t(1 t) h(u, t) if u t (2.1) We have the following lemma. Lemma 2.1 (i) The integral equation t φ ɛ (t) = d F (t) k(t) φ ɛ(t) φ ɛ(t ) (t t ) ɛ 1 h(t, t) dg(t )+ t has a unique continuous solution φ ɛ, satisfying φ ɛ(u) φ ɛ(t) (u t) ɛ h(t, u) dg(u) (2.11) inf d F (x)k(x) φ ɛ (t) sup d F (x)k(x), (2.12) x [,M] x [,M] for all t [, 1] and ɛ >. For points t in the range of F, say t = F (x), we have φ ɛ (t) = φ ɛ (x) 9

10 (ii) The integral equation ξ ɛ (t) = c F (t) k(t) t ξ ɛ(t) ξ ɛ(t ) (t t ) ɛ has a unique continuous solution ξ ɛ, satisfying 1 h (t, t) dg(t ξ )+ ɛ(u) ξ ɛ(t) (u t) ɛ h (t, u) dg(u) t (2.13) inf c F (x)k(x) ξ ɛ (t) sup c F (x)k(x), (2.14) x [,M] x [,M] for all t [, 1] and ɛ >. For points t in the range of F, say t = F (x), we have ξ ɛ (t) = ξ ɛ (x) Proof: ad (i) By the Fredholm theory, as used e.g. in Geskus and Groeneboom (1996), theorem 5, p. 82, the φ ɛ -equation (2.11) can be shown to have a unique continuous solution, for each ɛ >. Note that the integration in (2.11) is only with respect to dg(t ) and dg(u) and therefore only involves values belonging to the range of F. So for points t in the range of F we have φ ɛ (t) = φ ɛ (G(t)). Let m = arg min[ φ ɛ ] and s = arg max[ φ ɛ ]. We have φ ɛ (s) d F (s) k(s) since φ ɛ (s) φ ɛ (t), t [, 1], and similarly φ ɛ (m) d F (m) k(m) since φ ɛ (m) φ ɛ (t), t [, 1]. Hence we have (2.12). sup d F (x)k(x), x [,M] inf d F (x)k(x), x [,M] ad (ii) The argument is completely similar to the argument given for (i). The following lemma is the crux of the proof of the existence of the solution to the original integral equation. Lemma 2.2 The functions φ ɛ are Lipschitz on [, 1], uniformly in ɛ >. Proof: We will use similar notation as in Lemma 2.1. Let x 1,..., x m be the points of jump of F and let x =, x m+1 = M. Furthermore, let τ i = F (x i ), i =,..., m+1. For i =,..., m, the interval [τ i, τ i+1 ] can be divided into two parts: (1) the interval [τ i, τ i ), where τ i = F (x i+1 ). The interval [τ i, τ i ) corresponds to the interval [x i, x i+1 ) in the original scale. The function G is strictly increasing and differentiable on the interval (τ i, τ i ), and is right and left differentiable at τ i and τ i respectively. (2) the interval [τ i, τ i+1]. This interval corresponds to the jump of F at x i+1. Here the function G is constant, again having right and left derivatives at the respective endpoints. 1

11 If i = m, the second interval only consists of the point 1. Let D = τ,..., τ m+1 τ,..., τ m discontinuity points of k (t), d F (t), 1 t (t) = t h(t, t ) for t t, and 2 u(t) = t h(u, t) for t u. Then φ ɛ (t) is differentiable for t / D, and has left and right derivatives for t D. Using k(t) t φ ɛ(t) φ ɛ(t ) (t t ) ɛ h(t, t) dg(t ) + 1 t φ ɛ(u) φ ɛ(t) (u t) ɛ = φ ɛ (t)/ d F (t) = ξ ɛ (t)[(1 t) h 1 (t) + t h 2 (t)], and using left or right dervatives when t D, we have: φ ɛ (t) = d F (t) ξ ɛ (t) [(1 t) h 1 (t) + t h 2 (t)] + d F (t) k (t) t φ ɛ(t) φ ɛ(t ) (t t ) ɛ 1 t h(t, t) dg(t ) φ ɛ(u) φ ɛ(t) (u t) ɛ + t d F (t) φ ɛ (t) t :t t t t φ ɛ(t) φ ɛ(t ) >ɛ d F (t) φ ɛ (t) ɛ 1 t t ɛ + (t t ) 2 d H(t, t) h(t, u) dg(u) h(t, t u) dg(u) φ ɛ (t) u t φ ɛ(u) φ ɛ(t) u:u t>ɛ t+ɛ h(t, t)g(t ) dt + t d H(t, u) (u t) 2 h(t, u)g(u) du. (2.15) Note that H(t, t u) = h(t, u)g(t) and similarly for the other partial derivative of H. Moving the terms containing φ ɛ to the left-hand side of (2.15), shows that φ ɛ (t) has a finite upper bound, using Lemma 2.1. Moreover, φ ɛ is piecewise continuous on the closed intervals from one point in D to the subsequent one. So φ ɛ attains a maximum value, which may be a right or left derivative. The rest of the proof is devoted to showing that this maximum value is uniform in ɛ. def Let M ɛ = sup t [,1] φ ɛ (t) and suppose that φ ɛ attains its supremum at a point s. Note that M ɛ, since φ ɛ () = φ ɛ (1) = and φ ɛ is continuous. Then, if < t < s ɛ, φ ɛ (s) s t Likewise, if 1 > t > s + ɛ, we get φ ɛ(s) φ ɛ(t) (s t) 2 φ ɛ(s) t s s t φ ɛ (s) φ ɛ (u) du φ ɛ(t) φ ɛ(s) (t s) 2. (s t) 2. So these parts work in the opposite direction, and are harmless in (2.15). Now let K ɛ (t) be defined by K ɛ (t) def = d F (t) k (t) + d F (t) ξ ɛ (t) [(1 t) h 1 (t) + t h 2 (t)] t d F (t) φ ɛ(t) φ ɛ(t ) (t t ) ɛ 1 t h(t, t) dg(t ) t φ ɛ(u) φ ɛ(t) (u t) ɛ h(t, t u) dg(u) 11

12 and let C ɛ (t) be defined by C ɛ (t) def = 1 + d F (t) ɛ 1 t Then we have implying t ɛ h(t, t)g(t ) dt + In a similar way, if m ɛ def = inf t [,1] φ ɛ(t), we get t+ɛ t h(t, u)g(u) du, t [, 1]. (2.16) φ ɛ(s) C ɛ (s) K ɛ (s), (2.17) M ɛ sup K ɛ (t)/c ɛ (t). (2.18) t [,1] m ɛ Let the function A δ be defined by t A δ (t) def = d F (t) h(t t, t) dg(t) + t δ Fix δ > such that, for all t [, 1], inf K ɛ(t)/c ɛ (t). (2.19) t [,1] t+δ Note that δ > can be chosen independently of ɛ >, since t t h(t, u) dg(u), t [, 1] A δ (t)/c ɛ (t) 1 2. (2.2) lim C ɛ (t) = d F (t) h(t, t)g(t), t (, 1). ɛ Then we get from (2.2), for each t [, 1], by applying the mean value theorem on the ratios φ ɛ (t) φ ɛ (t )/(t t ) and φ ɛ (u) φ ɛ (t)/(u t), t d F (t) φ ɛ(t) φ ɛ(t ) h(t t+δ t, t) dg(t ) + φ ɛ(u) φ ɛ(t) h(t, t u) dg(u) /C ɛ (t) t δ (t t ) ɛ A δ (t) maxm ɛ, m ɛ /C ɛ (t) 1 2 maxm ɛ, m ɛ. Defining B δ (t) by B δ (t) we get, for t [, 1], t (u t) ɛ def = d F (t) k (t) + d F (t) [(1 t) h 1 (t) + t h 2 (t)] sup + 2 d F (t) δ sup d F (t ) k(t ) t [,1] sup t [,t] d F (t) k (t) + d F (t) ξ ɛ (t) [(1 t) h 1 (t) + t h 2 (t)] + d F (t) t δ φ ɛ(t) + φ ɛ(t ) t t t h(t, t) dg(t ) + d F (t) k (t) + d F (t) ξ ɛ (t) [(1 t) h 1 (t) + t h 2 (t)] + 2 d F (t) δ B δ (t) c, sup φ ɛ (t ) t [,1] t δ t h(t, t) dg(t ) + 12 t [,1] t h(t, t) + sup 1 t+δ 1 t+δ c F (t ) k(t ) u [t,1] φ ɛ(t) + φ ɛ(u) u t t h(t, u) dg(u) h(t, t u) h(t, t u) dg(u) (2.21)

13 for some constant c, independent of ɛ and t. Hence, for each t [, 1], implying φ ɛ (t) A δ(t)/c ɛ (t) + B δ (t)/c ɛ (t) 1 2 maxm ɛ, m ɛ + B δ (t)/c ɛ (t), 1 2 maxm ɛ, m ɛ sup B δ (t)/c ɛ (t) sup c/c ɛ (t) c, (2.22) t [,1] t [,1] for some constant c independent of ɛ. Hence φ ɛ (t) is bounded on [, 1], uniformly in ɛ and t, implying that φ ɛ is Lipschitz, uniformly in ɛ >. We now have the following theorem. Theorem 2.2 Let G(t) = F 1 (t), t [, 1], with a derivative g which exists except for at most a finite number of points, where G has left and right derivatives. Furthermore, let k(t) = k(g(t)), H(t, u) = H(G(t), G(u)), h(t, u) = h(g(t), G(u)), and let df be defined by where h i = h i G, i = 1, 2. Then (i) The integral equation t φ(t) = d F (t) k(t) d F (t) def = t(1 t) (1 t) h 1 (t) + t h 2 (t), (2.23) 1 φ(t) φ(t ) t t d H(t, t) + t has a unique solution which is Lipschitz on [, 1]. φ(u) φ(t) u t d H(t, u), t [, 1], (2.24) (ii) The Lipschitz norm in (i) has the following upper bound. Let C(t) be defined by Moreover, let A δ (t) and B δ (t) be defined by t A δ (t) def = d F (t) h(t t, t) dg(t ) + t δ and C(t) def = d F (t)g(t) h(t, t). (2.25) t+δ t t h(t, u) dg(u), (2.26) B δ (t) def = d F (t) k (t) + d F (t) [(1 t) h 1 (t) + t h 2 (t)] sup + 2 d F (t) δ sup d F (t ) k(t ) t [,1] sup t [,t] t [,1] t h(t, t) + sup c F (t ) k(t ) u [t,1] t h(t, u) (2.27) At the points in D = discontinuity points of g(t), augmented with and 1 discontinuity points of k (t), d F (t), 1 t (t) = h(t, t t ) for t t, and 2 u (t) = t h(u, t) for t u, 13

14 A δ and B δ have two versions, one corresponding to taking left derivatives and one corresponding to taking right derivatives. Then there exists a δ > such that and we have where c is given by sup A δ (t)/c(t) 1/2 t [,1] (iii) The integral equation (2.5) has a unique solution φ. φ(u) φ(t) c(u t), t < u 1, (2.28) c = 2 sup B δ (t)/c(t). (2.29) t [,1] Proof: ad (i) By the preceding two lemma s, the set φ ɛ : ɛ ɛ (for some ɛ > ) is bounded and equicontinuous. Hence, by the Arzelà-Ascoli theorem, each sequence φ ɛn, ɛ n, has a subsequence ( φ ɛm ), converging in the supremum metric to a continuous function φ on [, 1]. By Lebesgue s dominated convergence theorem we get, for such a subsequence ( φ ɛm ), φ(x) = lim φ ɛm (x) m = d F (x) k(x) x φ(x) φ(t) x t h(t, x) dg(t) + 1 x φ(t) φ(x) t x Uniqueness of the solution follows in the same way as in lemma 2.1. ad (ii) It was shown in (2.22) in the proof of Lemma 2.2 that where C ɛ is defined by (2.16). But since sup φ ɛ(t) 2 sup B δ (t)/c ɛ (t), t [,1] t [,1] lim C ɛ (t) = d F (t) h(t, t)g(t), ɛ h(x, t) dg(t). (2.3) for t [, 1], (2.28) now follows. ad (iii) We define φ by φ(x) = φ(f (x)). If t = F (x), we get, by a change of variables, φ(x) = φ(t) t = d F (t) k(t) = d F (x) k(x) x 1 φ(t) φ(t ) t t d H(t, t) + φ(x) φ(x ) F (x) F (x ) dh(x, x) + t u t d H(t, u) φ(u) φ(t) M x φ(y) φ(x) F (y) F (x) dh(x, y), and hence φ satisfies the original integral equation. Uniqueness of φ follows from uniqueness of φ (since a solution φ conversely defines a solution φ on the inverse scale). Remark. The same arguments can be applied to prove existence of a solution to the ξ- equation. Hence φ can be written as φ = F (1 F ) ξ. Solvability of κ F = L 1 L 1a can now immediately be seen. 14

15 Corollary 2.1 The equation κ F = L 1 L 1a is solvable. Proof: By the Lipschitz property of φ we have, for any x < y M, φ(y) φ(x) F (y) F (x) = φ(f (y)) φ(f (x)) F (y) F (x) K, for some constant K. Thus the Radon-Nikodym derivative dφ/df is a.e.-[f ] bounded by K. For the canonical gradient we get, if t < u, θ F (t, u, δ, γ) = [L 1 a](t, u, δ, γ) = δ φ(t) F (t) γ φ(u) φ(t) φ(u) F (u) F (t) + (1 δ γ) 1 F (u). (2.31) 3 Asymptotic efficiency of the NPMLE In this section we will denote the unknown distribution function of the unobservable random variables X i by F. As in section 2, we will assume that F is continuous. Let ˆF n be the NPMLE of F, based on the sample of observations (U 1, V 1, 1, Γ 1 ),..., (U n, V n, n, Γ n ). It is obtained by maximizing the likelihood n F (U i ) i (F (V i ) F (U i )) Γ i (1 F (V i )) 1 i Γ i h(u i, V i ) (3.1) i=1 over the class of piecewise constant right-continuous (sub-)distribution functions on [, M], having jumps only at a subset of the points U i and V i, i = 1,..., n. The properties of the function thus obtained are discussed in Groeneboom and Wellner (1992) and GG. A rather important property of the NPMLE is that it does not depend on h(u i, V i ), so we do not have to perform any preliminary density estimation or bandwidth choice. The fact that we do not have to solve a bandwidth problem is one of the great advantages of the nonparametric maximum likelihood approach in the present problem. By the restriction that ˆF n only has mass at the observation times, also made in Groeneboom and Wellner (1992), we get a piecewise constant function. Let x =, x m+1 = M and let x 1 <... < x m be the points of jump of F in the interval (, M). Then ˆF n satisfies the following properties: Proposition 3.1 Any function σ that is constant on the same intervals J i = [x i 1, x i ) as ˆF n satisfies for i = 2,..., m. δ σ(t) t J i + σ(u) u J i ˆF n(t) γ ˆF n(u) ˆF dq n (t, u, δ, γ) n(t) γ ˆF n(u) ˆF 1 δ γ n(t) 1 ˆF n(u) dq n (t, u, δ, γ) = Proof: See Groeneboom and Wellner (1992), part II, proposition 1.3, and Geskus and Groeneboom (1997), corollary 1, p

16 Proposition 3.2 Prob lim ˆF n F = = 1 n Proof: See Groeneboom and Wellner (1992), part II, sections 4.1 (case 1) and 4.3 (case 2). Proposition 3.3 ˆF n F Hi = O p (n 1/3 (log n) 1/6 ) as n, for i = 1, 2 Proof: See Geskus and Groeneboom (1997), corollary 2, p. 29, and van de Geer (1996), example 3.2. The following result is needed in the proof of Lemma 3.1. Proposition 3.4 lim P r ˆF n is defective =. n Proof: See Geskus and Groeneboom (1997), proposition 1, p. 26. Although the conditions on H are different there, the proof is the same, since the difference in conditions has no bearing on this particular property. In addition to the smoothness conditions (D1) to (D3), given in section 2, we assume (D4) h(t, t) = lim u t h(t, u) c >, for all t (, M) and some c >. Like in GG our definition of the canonical gradient θ will be extended to piecewise constant distribution functions F with finitely many discontinuities, based on the solution φ F of a discrete version of the integral equation (2.5). (In order to stress dependence on F, we will write φ F instead of φ.) However, since F (v) F (u) no longer remains bounded away from zero on the region where H puts mass, we have to use an approach different from the one in GG. On one hand, the quotient φ F (v) φ F (u) F (v) F (u), for u and v in the same interval of constancy of F, can only be defined correctly if φ F is constant on the same interval. On the other hand, d F, h and κ F in general are not constant on these intervals, making a completely discrete version of the integral equation impossible. Therefore, instead of one function φ F we now need a pair of functions (φ F, ψ F ), satisfying φ F (x) = d F (x) k(x) where r F (t, u) is defined by x r F (t, x) h(t, x) dt + M x r F (x, t) h(x, t) dt, (3.2) r F (t, u) = φf (u) φ F (t) F (u) F (t), if F (t) < F (u), ψ F (u) ψ F (t) F (u) F (t), if F (t) = F (u), t < u, (3.3) 16

17 and where φ F is constant on the same intervals as F. Since φ F is constant, the only real integral part is the ψ F -part; the remaining part of the integral can be written as a summation. The key to the proof of the existence of a solution pair (φ F, ψ F ) and also to the other proofs in this section are a representation of the equation for φ F on an inverse scale and the construction of a continuous extension of the equation for φ F on this inverse scale (similar techniques were used in section 2). Using a similar notation as in section 2, we denote by G the inverse of F, where, for purely discrete distribution functions F, we take the right-continuous version of the inverse, defined by Furthermore, we define G(t) = infx [, M] : F (x) > t, t. k F = k G, h 1,F = h 1 G, h 2,F = h 2 G and d F (t) = and likewise H(t, u) = H(G(t), G(u)), t u 1. For part (iii) of theorem 3.1 we will also need the following notation ij (h) def = d i def = i (g) def = xi+1 u=x i xi+1 t(1 t) (1 t) h 1,F (t) + t h 2,F (t). x i g(t) dt, (3.4) xj+1 v=x j h(u, v) dv du, (3.5) z i (1 z i ) i (h 1 ) (1 z i ) + i (h 2 ) z i. (3.6) The following theorem shows the existence of the solution pair. Moreover it gives a uniform Lipschitz condition for the functions φ F and ψ F, which will be a crucial tool in showing the Donsker property for θ F. Theorem 3.1 Let the following conditions on F, H and κ F be satisfied: (M1) to (M3); (D1) to (D4); (F1) to (F3). Furthermore, let F [,M] be the set of discrete non-defective distribution functions on [, M] with finitely many points of jump, contained in (, M). Then there exists an ɛ > such that, for F F, where F is defined by F F [,M] : sup x [,M] F (x) F (x) ɛ, (i) There exists a unique Lipschitz function φ F : [, 1] IR such that, for t [, 1] \ D, φ F (t) = d φ F (t) kf (t) F (t) φ F (t ) t t t d H(t, t) [,t) φ + F (u) φ F (t) u t d H(t, u), (3.7) u (t,1] where D is the (finite) set of discontinuities of the right-continuous inverse G = F 1 in (, 1), augmented with and 1. The function φ F is Lipschitz, uniformly for F F. (ii) There exists a pair (φ F, ψ F ), solving the integral equation (3.2), where φ F is absolutely continuous with respect to F and the function ψ F is Lipschitz on each interval between jumps of F, uniformly for F F, with a Lipschitz norm not depending on the interval. 17

18 (iii) Let z i = F (x i ) and y i = φ F (x i ), i = 1,..., m. Then, using the definitions (3.6) to (3.5), we have that the vector y = (y 1,..., y m ) is the unique solution of the set of linear equations y i d 1 i + ji (h) z i z j j<i = i (k) + j<i ji (h) z i z j + j>i y j + j>i ij (h) z j z i ij (h) z j z i y j, i = 1,..., m. (3.8) Theorem 3.1 will be proved by approximating the purely discrete distribution function F by the function F α = (1 α)f + αf and by studying the behavior of the corresponding function φ Fα, as α 1. The rather technical proof is given in the Appendix. By Theorem 3.1, the definition of the function θ F can be extended to piecewise constant distribution functions F F by defining θ F (t, u, δ, γ) = δφ F (t) F (t) where φ F and ψ F solve equation (3.2), and where φ F (t) F (t) γ r F (t, u) + (1 δ γ)φ F (u) 1 F (u), (3.9) and φ F (u) 1 F (u) are defined to be zero if F (t) = or if 1 F (u) =, respectively. Note that θ F no longer has an interpretation as canonical gradient. In the sequel we will write Q F instead of Q F,H. We are now ready to formulate our main result. Theorem 3.2 Let the conditions of Theorem 3.1 be satisfied. Then n(k( ˆFn ) K(F )) D N(, θ F 2 Q F ) as n (3.1) Proof: Note that it is sufficient to show the following: n(k( ˆFn ) K(F )) = n θ F d(q n Q F ) + o p (1). (3.11) Moreover, using the uniform consistency of ˆFn (see Proposition 3.2), we may assume that ˆF n F, for all large n, where F is defined as in Theorem 3.1. The proof consists of the following steps. I. By conditions (D1) and (F2), and Proposition 3.3 we have n(k( ˆFn ) K(F )) = n κ F d( ˆF n F ) + o p (1). II. In lemma 3.1 the following will be shown: κ F d(f F ) = θ F dq F, if F F. 18

19 III. Unlike the situation in GG, φ ˆF n is constant on the same intervals as ˆF n. Since γ =, if ˆF n (u) = ˆF n (t), Proposition 3.1 can be used to obtain θ ˆF n dq n =, yielding n θ ˆFn dq F = n θ ˆFn d(q n Q F ). IV. This is further split into n θ ˆF n d(q n Q F ) = n θ F d(q n Q F ) + n ( θ ˆF n θ F )d(q n Q F ). The last term will be shown to be o p (1) (Theorem 3.3). Lemma 3.1 Let F be defined as in Theorem 3.1. Then we have under the conditions of Theorem 3.1, for all F F, κ F d(f F ) = θ F dq F. Proof: Let, for any distribution function F, L F : L 2 (F ) L 2 (Q F ) denote the conditional expectation operator [L F a](u, v, δ, γ) = δ u a df F (u) + γ v u a df F (v) F (u) + (1 δ γ) with adjoint L, given by the conditional expectation M v a df [L b](x) = E[b(U, V,, Γ) X = x] a.e.-f. 1 F (v) a.e. [Q F ], Since the adjoint is an expectation, conditionally on the value of the random variable X F, its structure does not depend on F. F only determines where it has to be defined (the a.e.-f part). Still a L 2 (F ) implies L F (a) L 2 (Q F ). The ratios occurring in θ F, F F, are bounded, since, by Theorem 3.1, φf and ψ F are Lipschitz functions, if F F. Hence θ F L 2 (Q F ), for F F. Let 1 L 2 (F ) denote the constant function 1(x) 1, x IR. Under L F this transforms into the constant function 1 (t, u, δ, γ) 1 on L 2 (Q F ). Now we have, θ F dq F = < θ F, 1 > QF = < θ F, L F (1) > QF = < L ( θ F ), 1 > F = L ( θ F ) df. 19

20 If we can prove L ( θ F ) = κ F κ F df a.e.-f, we are done. This is shown as follows: Recall that the integral equation was obtained by taking derivatives on both sides of the equation κ F (x) = [L θf ](x) for all x [, M]. Now we will go the other way, integrate, but replace θ F by θ F, obtaining [L θf ](x) = [ κ F ](x) + C for all x [, M]. For the constant C we have, using that F is non-defective, C = C df = L ( θ F ) df κ F df. It is easily shown that θ F is contained in L 2 (Q F ). (However, it is not contained in R(L F ), because of the part ψ F of the solution pair (φ F, ψ F ).) Now we have < L ( θ F ), 1 > F = < θ F, L F (1) > QF = < θ F, 1 > QF =. The hard part of the proof of Theorem 3.2 is to show that n ( θ ˆFn θ F )d(q n Q F ) = o p (1). (3.12) We will prove the Donsker property for the middle part γ r F of θ F and only indicate the very similar (simpler) proofs for the the two other parts of θ F at the end of the proof of Theorem 3.3. Since the proof is rather involved, we first sketch the general ideas. We start by defining a neighborhood, shrinking with n, such that the probability that ˆF n belongs to F n tends to 1, as n. Let, for F F, and q F (t, u, δ, γ) = δf (t) + γf (u) F (t) + (1 γ δ)1 F (u) (3.13) q F (t, u, δ, γ) = δf (t) + γf (u) F (t) + (1 δ γ)1 F (u). (3.14) It is proved in van de Geer (1996), that h 2 (q ˆFn, q F ) = O p (n 2/3 (log n) 1/3 ), (3.15) where h(q F, q F ) is the Hellinger distance between the densities q F and q F w.r.t. the product of the measure, induced by H, and counting measure on, 1 2 \ (1, 1). The result (3.15) 2

21 has already been used above, since Proposition (3.3) is based on it. Now, if F n is the set of distribution functions F F, satisfying we have: h 2 (q F, q F ) n 2/3 log n, (3.16) P r ˆF n F n 1, as n. In fact, the upper bound n 2/3 log n, defining the class F n, could be replaced by c n n 2/3 (log n) 1/3, where we only need c n, as n, but we are a little bit wasteful with our powers of log n in an attempt to avoid an accumulation of constants in the upper bounds. We need to study properties of the empirical integrals Q n ( θ F θ F ) 2 and Q n ( θ F θ G ) 2, for F, G F n. The denominators in θ F, θ F and θ G can be arbitrarily close to zero. If F F, then F (u) F (t) will be zero on a region of positive Lebesgue measure, in which case we get for the middle part γ r F of θ F : γ r F (t, u) = γ ψ F (u) ψ F (t)/(f (u) F (t). We will face these difficulties by considering three regions of integration: and C n,η (F ) = w : q F (w) > η q F (w), q F (w) > n 1/3, (3.17) D η (F ) = w : q F (w) ηq F (w), (3.18) C n (F ) = w : q F (w) n 1/3, (3.19) for some η (, 1), where the elements w of the sets, defined above, are of the form w = (t, u, δ, γ). On the region C n,η (F ), θ F has a behavior which is comparable to the behavior of θ F ; on the other regions we just use the uniform boundedness of θ F and the fact that the integrals over these regions become sufficiently small. For the entropy calculations we shall use ratios r Fk,G k, φ k of the form φ k (G k (u)) φ k (F k (t)), G k (u) F k (t) where F k and G k are distribution functions such that F k F G k ((F k, G k ) is a bracket for F ) and where φ k is a Lipschitz function approximating φ F. In this way the good behavior of the ratios r F on the region C n,η (F ) is preserved on the same region by the approximating ratio r Fk,G k, φ k. Next we will apply the chaining lemma, using a chain with a kind of funnel structure, preserving this good behavior on the region C n,η (F ). Note that the approximating ratios are outside the original class of ratios r F. In dealing with the region D η (F ), defined by (3.18), we will need the functions g F = q F q F q F + q F, (3.2) for which Lemma A.1 in van de Geer (1996) holds. This lemma, specialized to our situation, is given below for easy reference. 21

22 Lemma 3.2 Let, for F F n, S n (F ) and G n be defined by S n (F ) = q F >σ n g 2 F dq n 1/2 and G n = g F 1 qf>σn : F F n, where (we take) σ n = n 1/3. Let (ρ n ) n 1 be a non-decreasing sequence of real numbers 1. Then, given < ν < 2 and < C <, there exists an < L < depending on ν and C, such that for τ n n 1 ν 2+ν 2+ν ρn, for all n, we have S n (F ) lim sup n P r h(q F, q F ) (Lτ n ) 8 for some F F n lim sup n 4P r sup δ> ( δ ρ n ) ν H(δ, Gn, Q n ) > C where H(δ, G n, Q n ) denotes the δ-entropy of G n for the L 2 -distance w.r.t. Q n., We will also need the following two lemmas. Lemma 3.3 (i) Let the function a n be defined by a n = 1 qf >n 1/3 / q2 F. Then Q n a n = O p (log n) (3.21) (ii) Let the function b n be defined by b n = 1 qf n 1/3. Then Q n b n = O p (n 2/3 ) (3.22) Proof: Both statements are simple consequences of the Markov inequality: P r Q n a n > c log n (c log n) 1 Q F a n = (c log n) 1 q F >n 1/3 q 2 F dq F = c 1 O(1), and, likewise, P r Q n b n > c n 2/3 c 1 n 2/3 Q F b n = c 1 n 2/3 q F n 1/3 dq F = c 1 O(1). 22

23 Lemma 3.4 Let, for η (, 1), the set D η (F ) be defined by (3.18). Then sup Q n D η (F ) = O p (n 2/3 log n). (3.23) F F n Proof: We have: Q n D η (F ) (1 η) 2 qf q F q F D η(f ) 2 dqn 4(1 η) 2 Q n qf q F q F +q F 2, (3.24) where we use in the first step that q F q F q F = q F 1 q F (1 η)q F, if q F η q F. Furthermore, by Lemmas 3.2 and 3.3, qf q 2 sup F F F n q F >n 1/3 q F +q dqn F = sup gf 2 dq n = O p (n 2/3 log n). (3.25) F F n q F >n 1/3 Relation (3.25) follows from Lemma 3.2, by taking ν = 1, τ n = n 1/3 (log n) 1/2 and ρ n = log n. The metric entropy H(δ, G n, Q n ) of the class G n, with the L 2 -distance with respect to Q n, is then O p (δ 1 (log n) 1/2 ) uniformly in δ >. This is seen by first noting that q F q F q F + q F = 2q F q F + q F 1, and next that, for two distribution functions F 1 and F 2, q F1 q F2 q F1 + q F q F2 + q F q F 1 q F2. q F By the results of Birman and Solomjak (1967) or Ball and Pajor (199), applied on classes of uniformly bounded monotone functions, the δ-entropy of the class of functions q F for the L 2 -distance w.r.t. the probability measure Q n, defined by Q n (B) = B q F >n 1/3 q 2 F dq n / q F >n 1/3 q 2 F dq n, is O(δ 1 ), implying H(δ, G n, Q n ) = O p (δ 1 (log n) 1/2 ), using part (i) of Lemma 3.3. The entropy results needed here can also be found in van der Vaart and Wellner (1996), Theorem 2.7.5, p. 159; Theorem 2.6.9, p. 142 and Example , p. 149, where, in fact, log N(ɛ, F, L 2 (Q)) K/ɛ follows from Theorem by taking F in the latter to be the class of indicators 1 [,t] : t IR, with V = 2. Since, by g F 1 and part (ii) of Lemma 3.3, qf q 2 sup F F F n q F n 1/3 q F +q dqn F dq n O p (n 2/3 ), q F n 1/3 23

24 we get, in combination with (3.25), sup F F n The result now follows from (3.24) and (3.26). qf q F q F +q F 2 dqn = O p (n 2/3 log n). (3.26) From now on we will concentrate on the behavior of the middle part of θ F. Consider triples (F k, G k, φ k ), where F k and G k are distribution functions belonging to F and φ k belongs to a uniform class of Lipschitz functions on [, 1], with the same uniform Lipschitz norm c Lip and upper bound as the functions φ F, F F, of Theorem 3.1. For these triples we define r Fk,G k, φ k (t, u) = φ k (G k (u)) φ k (F k (t)) G k (u) F k (t), if G k (u) > F k (t), (3.27) and, for pairs (t, u) such that F k (t) = G k (u), we define r Fk,G k, φ k (t, u) =. Moreover, we define the semi-metric d n ((F k, G k, φ k ), (F l, G l, φ l )) = 2 max φ k (t) φ 1 l (t) t [,1] F (u) F (t)>n 1/3 F (u) F γ dq (t) 2 n 1/2 2 + Fk (t) F l (t) F (u) F (t) γ dqn + F (u) F (t)>n 1/3 F (u) F (t)>n 1/3 We now have the following result. 1/2 Gk (u) G l (u) F (u) F (t) 2 γ dqn 1/2 (3.28) Lemma 3.5 Let, for distribution functions F and G such that F G, the set C n,η (F, G) be defined by C n,η (F, G) = (t, u) : F (u) F (t) > n 1/3, G(u) F (t) ηf (u) F (t). (3.29) Then we have for all pairs of distribution functions (F k, G k ) and (F l, G l ) such that F k G k and F l G l, Q n (r Fk,G k, φ k r Fl,G l, φ l ) 2 γ1 Cn,η(F k,g k ) C n,η(f l,g l ) C 2 d n ((F k, G k, φ k ), (F l, G l, φ l )) 2, where r F,G, φ is defined by (3.27), and where C > is a constant, only depending on η (, 1) and the Lipschitz norm c Lip, corresponding to the uniform Lipschitz class of functions φ F, F F. Proof: If G k (u) F k (t) > and G l (u) F l (t) >, we have the decomposition = r Fk,G k, φ k (t, u) r Fl,G l, φ l (t, u)γ φk (G k (u)) φ k (F k (t)) G k (u) F k (t) = φ k (G k (u)) φ k (F k (t)) G k (u) F k (t) φ l (G l (u)) φ l (G l (t)) G l (u) F l (t) γ G l (u) F l (t) G k (u) (F k (t) + φ k (G k (u)) φ k (F k (t)) φ l (G l (u)) φ l (F l (t)) 24 γ G l (u) F l (t) γ G l (u) F l (t).

25 Hence Q n (r Fk,G k, φ k r Fl,G l, φ l ) 2 γ1 Cn,η(F k,g k ) C n,η(f l,g l ) c 2 Lip η 2 +η 2 c 2 Lip η 2 +2η 2 +4c 2 Lip η 2 C 2 +C 2 C n,η(f k,g k ) C n,η(f l,g l ) C n,η(f k,g k ) C n,η(f l,g l ) C n,η(f k,g k ) C n,η(f l,g l ) C n,η(f k,g k ) C n,η(f l,g l ) C n,η(f k,g k ) C n,η(f l,g l ) C n,η(f k,g k ) C n,η(f l,g l ) C n,η(f k,g k ) C n,η(f l,g l ) G l (u) F l (t) (G k (u) F k (t)) 2 (F (u) F (t)) 2 γ dq n φ k (G k (u)) φ k (F k (t)) ( φ l (G l (u)) φ l (F l (t))) 2 (F (u) F (t) 2 γ dq n G l (u) F l (t) (G k (u) (F k (t)) 2 (F (u) F (t)) 2 γ dq n φ k (G l (u)) φ k (F l (t)) ( φ l (G l (u)) φ l (F l (t))) 2 (F (u) F (t)) 2 γ dq n (G k (u) G l (u)) 2 +(F k (t) F l (t)) 2 (F (u) F (t)) 2 γ dq n (G k (u) G l (u)) 2 +(F k (t) F l (t)) 2 (F (u) F (t)) 2 γ dq n ( φ k (G l (u)) φ l (G l (u))) 2 +( φ k (F l (t)) φ l (F l (t))) 2 (F (u) F (t)) 2 γ dq n C 2 d n ((F k, G k, φ k ), (F l, G l, φ l )) 2, (3.3) where C > is a constant, only depending on η and the Lipschitz norm c Lip, corresponding to the uniform Lipschitz class of functions φ F, F F. Note that we can also apply Lemma 3.5 in the approximation of a ratio r F, by making the identification r F = r F,F, and by noting that C φf n,η (F, F ) = C n,η (F ). For the set of functions F n we now have the following theorem which finishes the proof of Theorem 3.2. Theorem 3.3 Let F n be the set of distribution functions F F, defined by (3.16). Then we have, under the conditions of Theorem 3.2, for each ɛ >, P r sup n(q n Q F )( θ F θ F ) > ɛ, as n. (3.31) F F n Proof: We will (using the notation of Pollard (1984), page 15) denote the empirical process n(q n Q F ) by E n and the symmetrized empirical process E n by En. Fix an (arbitrary) ɛ >. By the symmetrization lemma we have P r E n (r F r F )γ > ɛ for some F F n 4P r En (r F r F )γ > 1 4 ɛ for some F F n. Let ɛ > and η (, 1) be fixed in the following. We are going to show that P r En(r F r F )γ > 1 4 ɛ for some F F n ξ n, as n, (3.32) for all ξ n = ((T 1, U 1, δ 1, γ 1 ),..., (T n, U n, δ n, γ n )), 25

Real Analysis Problems

Real Analysis Problems Cristian E. Gutiérrez September 14, 29 1 1 CONTINUITY 1 Continuity Problem 1.1 Let r n be the sequence of rational numbers and Prove that f(x) = 1. f is continuous on the irrationals.