Preservation Theorems for Glivenko-Cantelli and Uniform Glivenko-Cantelli Classes This is page 5 Printer: Opaque this Aad van der Vaart and Jon A. Wellner ABSTRACT We show that the P Glivenko property of classes of functions F,...,F k is preserved by a continuous function ϕ from R k to R in the sense that the new class of functions x ϕ(f (x),...,f k (x)), f i F i, i =,...,k is again a Glivenko-Cantelli class of functions if it has an integrable envelope. We also prove an analogous result for preservation of the uniform Glivenko-Cantelli property. Corollaries of the main theorem include two preservation theorems of Dudley (998). We apply the main result to reprove a theorem of Schick and Dudley 998a or b? Yu (999)concerning consistency of the NPMLE in a model for mixed case interval censoring. Finally a version of the consistency result of Schick and Yu (999)is established for a general model for mixed case interval censoring in which a general sample space Y is partitioned into sets which are members of some VC-class C of subsets of Y. Glivenko-Cantelli theorems Let (X, A,P) be a probability space, and suppose that F L (P ). For such a class of functions, let F 0,P {f Pf : f F}. We also let F F (x) sup f F f(x), the envelope function of F. If f F for all f F with F measurable, then F is an envelope for F. Suppose that X,X 2,... are i.i.d. P. The Glivenko-Cantelli theorems give conditions under which the empirical measure P n converges uniformly to P over a class F, either in probability (in which case we say that F is a weak Glivenko-Cantelli class for P ) or almost surely: P n P F sup P n f Pf a.s. 0; f F in this case we say that F is a strong Glivenko-Cantelli class for P. Useful sufficient conditions for a class F to be a strong Glivenko-Cantelli class for P are that it has an integrable envelope and either log N [] (ɛ, F,L (P )) < for all ɛ>0
6 A. van der Vaart and J. A. Wellner or log N(ɛ, F M,L (P n )) = o P (n) for every ɛ>0 and 0 <M<. Here F M = {f [F M] : f F}. In the second case some additional measurability conditions are necessary; see Van der Vaart and Wellner (996), Chapter 2.4, pp. 22 26. In particular, the condition in Theorem 2.4.3, page 23, is shown by Giné and Zinn (984) to be both necessary and sufficient, under measurability assumptions, for the class F to be a strong Glivenko-Cantelli class (see Talagrand (996) for another proof). Talagrand (987b) gives necessary and sufficient conditions for the Glivenko-Cantelli theorem without any measurability hypotheses. Theorem. (Giné and Zinn, 984). Suppose that F is L (P ) bounded and nearly linearly supremum measurable for P ; in particular this holds if F is image admissible Suslin. Then the following are equivalent: (a) F is a strong Glivenko-Cantelli class for P. (b) F has an envelope function F L (P ) and the truncated classes F M {f{f M} : f F}satisfy n E log N(ɛ, F M,L r (P n )) 0 for all ɛ>0, and for all M (0, ) for some (all) r (0, ] where f Lr(P ) f P,r {P ( f r )} r. 2 Preservation of the Glivenko Cantelli property Our goal in this section is to present several results concerning the stability of the Glivenko-Cantelli property of one or more classes of functions under composition with functions ϕ. A theorem which motivated our interest is the following result of Dudley (998a). Theorem 2. (Dudley, 998a). Suppose that F is a strong Glivenko-Cantelli class for P with PF <, J is a possibly unbounded interval including the ranges of all f F, ϕ is continuous and monotone on J, and for some finite constants c, d, ϕ(y) c y + d for all y J. Then ϕ(f) is also a strong Glivenko-Cantelli class for P. Dudley (998a) proves this via the characterization of Glivenko-Cantelli classes due to Talagrand (987b). Dudley (998b) also uses Talagrand s characterization to prove the following interesting proposition. Proposition. (Dudley, 998b). Suppose that F is a strong Glivenko- Cantelli class for P with PF <, and g is a fixed bounded function ( g < ). Then the class of functions g F {g f : f F}is a strong Glivenko-Cantelli class for P.
Glivenko-Cantelli Preservation Theorems 7 Yet another proposition in this same vein is: Proposition 2. (Giné and Zinn, 984). Suppose that F is a uniformly bounded strong Glivenko-Cantelli class for P, and g L (P ) is a fixed function. Then the class of functions g F {g f : f F}is a strong Glivenko-Cantelli class for P. Given classes F,...,F k of functions f i : X Rand a function ϕ : R k R, let ϕ(f,...,f k ) be the class of functions x ϕ(f (x),...,f k (x)), where f i F i, i =,...,k. Theorem 2 and Propositions and 2 are all corollaries of the following theorem. Theorem 3. Suppose that F,...,F k are P Glivenko-Cantelli classes of functions, and that ϕ : R k R is continuous. Then H ϕ(f,...,f k ) is P Glivenko-Cantelli provided that it has an integrable envelope function. Proof. We first assume that the classes of functions F i are appropriately measurable. Let F,...,F k and H be integrable envelopes for F,...,F k and H respectively, and set F = F... F k.form (0, ), define Now H M {ϕ(f) [F M] : f =(f,...,f k ) F F 2 F k F}. (P n P )ϕ(f) F (P n + P )H [F>M] + (P n P )h HM. The expectation of the first term on the right converges to 0 as M. Hence it suffices to show that H M is P Glivenko-Cantelli for every fixed M. Let δ = δ(ɛ) betheδ of Lemma 2 below for ϕ :[ M,M] k R, ɛ>0, and the L -norm. Then for any (f j,g j ) F j, j =,...,k, implies that It follows that P n f j g j [Fj M] δ, j =,...,k k P n ϕ(f,...,f k ) ϕ(g,...,g k ) [F M] ɛ. N(ɛ, H M,L (P n )) k N( δ k, F j [Fj M],L (P n )). Thus E log N(ɛ, H M,L (P n )) = o(n) for every ɛ>0, M<. This implies that E log N(ɛ, (H M ) N,L (P n )) = o(n) for (H M ) N the functions h{h
8 A. van der Vaart and J. A. Wellner N} for h H M.ThusH M is strong Glivenko-Cantelli for P by Theorem. This concludes the proof that H = ϕ(f) is weak Glivenko-Cantelli. Because it has an integrable envelope, it is strong Glivenko-Cantelli by, e.g., Lemma 2.4.5 of Van der Vaart and Wellner (996). This concludes the proof for appropriately measurable classes F j, j =,...,k. We extend the theorem to general Glivenko-Cantelli classes using separable versions as in Talagrand (987a). (Also see van der Vaart and Wellner (996), pp. 5 20 for a discussion.) As shown in the preceding argument, it is not a loss of generality to assume that the classes F i are uniformly bounded. Furthermore, it suffices to show that ϕ(f,...,f k )is weak Glivenko-Cantelli. We first need a lemma. Lemma. Any strong P -Glivenko Cantelli class F is totally bounded in L (P ) if and only if P F <. Furthermore for any r (, ), iff has an envelope that is contained in L r (P ), then F is also totally bounded in L r (P ). (983) or (984)? Proof. A class that is totally bounded is also bounded. Thus for the first statement we only need to prove that a strong Glivenko-Cantelli class F with P F < is totally bounded in L (P ). It is well-known that such a class has an integrable envelope. E.g. see Giné and Zinn (983) or Problem 2.4. of van der Vaart and Wellner (996) to conclude first that P f Pf F <. Next the claim follows from the triangle inequality f F f Pf F + P F. Thus it is no loss of generality to assume that the class F possesses an envelope that is finite everywhere. If F is not totally bounded in L (P ), then there is ε > 0 and G F countable such that the packing number D(ε, G,L (P )) =. This is obvious, since if F is not totally bounded, then D(ε, F,L (P )) = for some ε>0; then take a countable set of ε separated points. So, now G has all the measurablity properties, is GC and L (P )-bounded, and moreover, D(ε, G,L )=. But this is a contradiction because, as shown in Giné and Zinn (984), first half of page 984, D(ε, G,L ) must be finite for all ε>0. This completes the proof of the first part of the lemma. If F has an envelope in L r (P ), then F is totally bounded in L r (P ) if the class F M of functions f{f M} is totally bounded in L r (P ) for every fixed M. The class F M is P -GC by Theorem 3 and hence this class is totally bounded in L (P ). But then it is also totally bounded in L r (P ), because P f r P f M r for any f that is bounded by M and we can construct the ɛ-net over F M in L (P ) to consist of functions that are bounded by M. Because a Glivenko-Cantelli class F with P F < is totally bounded in L (P ) by Lemma, it is separable as a subset of L (P ). A minor generalization of Theorem 2.3.7 in van der Vaart and Wellner (996) shows that there exists a bijection f f of F onto a class F L (P ) such that
f = f P almost surely for every f F. Glivenko-Cantelli Preservation Theorems 9 there exists a countable subset G F such that for every n there exists a measurable set N n X n with P n (N n ) = 0 such that for all (x,...,x n ) / N n and f F there exists {g m } Gsuch that P g m f 0 and (g m (x ),...,g m (x n )) ( f(x ),..., f(x n )). By an adaptation of a theorem due to Talagrand (987a) (see Theorem 2.3.5 in van der Vaart and Wellner (996)) a class F is weak Glivenko- Cantelli if and only if the class F is weak Glivenko-Cantelli and sup P n f f 0 f F in outer probability. Construct a pointwise separable version F i for each of the classes F i. The classes F i possess enough measurability to make the preceding argument work; in particular pointwise separable version in the above sense is sufficient for the nearly linearly supremum measurable hypothesis of Giné and Zinn (984) for both F,...,F k and ϕ(f,...,f k ). Thus the class ϕ( F,..., F k ) is Glivenko-Cantelli for P. Now by Lemma 2 there exists for every ɛ>0aδ>0 such that P n f j f j < δ k, j =,...,k, implies P n ϕ(f,...,f k ) ϕ( f,..., f k ) <ɛ. The theorem follows. Lemma 2. Suppose that ϕ : K R is continuous and K R k is compact. Then for every ɛ>0 there exists δ>0 such that for all n and for all a,...,a n,b,...,b n K R k implies n n n a i b i <δ n ϕ(a i ) ϕ(b i ) <ɛ. Here can be any norm on R k ; in particular it can be ( k ) /r x r = x i r, r [, ) or x max i k x i for x =(x,...,x k ) R k.
20 A. van der Vaart and J. A. Wellner Proof. Let U n be uniform on {,...,n}, and set X n = a Un, Y n = b Un. Then we can write n n a i b i = E X n Y n and n ϕ(a i ) ϕ(b i ) = E ϕ(x n ) ϕ(y n ). n Hence it suffices to show that for every ɛ>0 there exists δ>0 such that for all (X, Y ) random vectors in K R k, E X Y <δ implies E ϕ(x) ϕ(y ) <ɛ. Suppose not. Then for some ɛ>0 and for all m =, 2,... there exists (X m,y m ) such that E X m Y m < m, E ϕ(x m) ϕ(y m ) ɛ. But since {(X m,y m )} is tight, there exists (X m,y m ) d (X, Y ). Then it follows that E X Y = lim E X m m Y m =0 so that X = Y a.s., while on the other hand 0=E ϕ(x) ϕ(y ) = lim E ϕ(x m m ) ϕ(y m ) ɛ>0. This contradiction means that the desired implication holds. Another potentially useful preservation theorem is one based on building up Glivenko-Cantelli classes from the restrictions of a class of functions to elements of a partition of the sample space. The following theorem is related to the results of Van der Vaart (996) for Donsker classes. Theorem 4. Suppose that F is a class of functions on (X, A,P), and {X i } is a partition of X : X i = X, X i X j = for i j. Suppose that F j {f Xj : f F}is P Glivenko-Cantelli for each j, and F has an integrable envelope function F. Then F is itself P Glivenko-Cantelli. Proof. Since f = f Xj = f Xj, it follows that E P n P F E P n P Fj 0
Glivenko-Cantelli Preservation Theorems 2 by the dominated convergence theorem, since each term in the sum converges to zero by the hypothesis that each F j is P Glivenko-Cantelli, and we have E P n P Fj E P n (F Xj )+P (F Xj ) 2P (F Xj ) where P (F X j )=P (F ) <. 3 Preservation of the uniform Glivenko Cantelli property We say that F is a strong uniform Glivenko-Cantelli class if for all ɛ>0 ( ) sup PrP sup P m P F >ɛ 0 as n P P(X,A) m n where P(X, A) is the set of all probability measures on (X, A). For x = (x,...,x n ) X n, n =, 2,..., and r (0, ), we define on F the pseudo-distances e x,r (f,g) = { n n } r f(x i ) g(x i ) r e x, (f,g) = max i n f(x i) g(x i ),, f,g F. Let N(ɛ, F,e x,r ) denote the ɛ covering number of (F,e x,r ), ɛ>0. Then define, for n =, 2,..., ɛ>0, and r (0, ], the quantities N n,r (ɛ, F) = sup x X n N(ɛ, F,e x,r ). Theorem 5. (Dudley, Giné, and Zinn (99)). Suppose that F is a class of uniformly bounded functions such that F is image admissible Suslin. Then the following are equivalent: (a) F is a strong uniform Glivenko-Cantelli class. (b) log N n,r (ɛ, F) 0 for all ɛ>0 n for some (all) r (0, ]. For the definition of the image admissible Suslin property see Dudley (984), sections 0.3 and.. The following theorem gives natural sufficient conditions for preservation of the uniform Glivenko-Cantelli theorem.
22 A. van der Vaart and J. A. Wellner Theorem 6. Suppose that F,...,F k are classes of uniformly bounded functions on (X, A) such that F,...,F k are image-admissible Suslin and strong uniform Glivenko-Cantelli classes. Suppose that ϕ : R k R is continuous. Then H ϕ(f,...,f k ) is a strong uniform Glivenko-Cantelli class. Proof. It follows from Lemma 2 that for any ɛ>0 there exists δ>0 such that for any f j,g j F j, j =,...,k, x X n, implies that It follows that and hence that e x, (f j,g j ) δ, j =,...,k k e x, (ϕ(f,...,f k ),ϕ(g,...,g k )) ɛ. N n, (ɛ, H) n log N n,(ɛ, H) k N n, ( δ k, F j), k n log N n,( δ k, F j) 0 where the convergence follows from part (b) of Theorem 5 with r =.If we show that H = ϕ(f) is image admissible Suslin, then the conclusion follows from (b) implies (a) in Theorem 5. We give a proof of a slightly stronger statement in the following lemma. Lemma 3. If F,...,F k are image admissible Suslin via (Y i, S i,t i ), i =,...,k, and φ : R k R is measurable, then ϕ(f,...,f k ) is image admissible Suslin via ( k Y i, k S i,t) for T (y,...,y k )=ϕ(t y,...,t k y k ). Proof. There exist Polish spaces Z i and measurable maps g i : Z i Y i which are onto. Then g : Z Z k Y Y k defined by g(z,...,z k )=(g (z ),...,g k (z k )) is onto and measurable for S S k. So ( i Y i, i S i) is Suslin. It suffices to check that T is onto and ψ defined by the map ψ(x, y,...,y k )=ϕ(t y (x),...,t k y k (x)) is measurable. Obviously T is onto, and ψ is measurable because each map (x, y i ) T i y i (x) is measurable on X Y i and hence on X Y Y k, and hence (x, y,...,y k ) (T y (x),...,t k y k (x)) is measurable from X Yto R k.
Glivenko-Cantelli Preservation Theorems 23 4 Nonparametric maximum likelihood estimation: a general result Now we prove a general result for nonparametricmaximum likelihood estimation in a class of densities. The main proposition in this section is related to results of Pfanzagl (988) and Van de Geer (993), (996). Suppose that P is a class of densities with respect to a fixed σ finite measure µ on a measurable space (X, A). Suppose that X,...,X n are i.i.d. P 0 with density p 0 P. Let p n argmax P n log p. For 0 <α, let ϕ α (t) =(t α )/(t α + ) for t 0, ϕ(t) = for t<0. Then ϕ α is bounded and continuous for each α (0, ]. For 0 <β< define h 2 β(p, q) p β q β dµ. Note that h /2 (p, q) h(p, q) is the Hellinger distance between p and q, and by Hölder s inequality, h β (p, q) 0 with equality if and only if p = q a.e. µ. Proposition 3. Suppose that P is convex. Then ( h 2 α/2 ( p n,p 0 ) (P n P 0 ) ϕ α ( p ) n ). p 0 In particular, when α =we have, with ϕ ϕ, ( h 2 ( p n,p 0 ) h 2 /2 ( p n,p 0 ) (P n P 0 ) ϕ( p ) n ). p 0 Corollary. Suppose that {ϕ(p/p 0 ): p P}is a P 0 Glivenko-Cantelli class. Then for each 0 <α, h α/2 ( p n,p 0 ) a.s. 0. Proof of Proposition 3. It follows from convexity of P and convexity of t ϕ α (/t) that () 0 P n ϕ α ( p n /p 0 )=(P n P 0 )ϕ α ( p n /p 0 )+P 0 ϕ α ( p n /p 0 ); see van der Vaart and Wellner (996) page 330, and Pfanzagl (988), pp. 4 43. Now we show that p α p α ( ) 0 (2) P 0 ϕ α (p/p 0 )= p α + p α dp 0 p β 0 p β dµ 0 for β = α/2. Note that this holds if and only if p α +2 p α 0 + p 0dµ + p β pα 0 p β dµ,
24 A. van der Vaart and J. A. Wellner or p β 0 p β dµ 2 p α p α 0 + pα p 0dµ. But this holds if p β 0 p β 2 pα p 0 p α 0 +. pα With β = α/2, this becomes 2 (pα 0 + p α ) p α/2 0 p α/2 = p α 0 pα, and this holds by the arithmeticmean geometricmean inequality. Thus (2) holds. Combining (2) with () yields the claim of the proposition. 5 Example:a result of Schick and Yu Our goal in this section is to give another proof of the consistency result of Schick and Yu (999) for the Non-Parametric Maximum Likelihood Estimator (NPMLE) F n for mixed case interval censored data. Our proof is based on the inequality of the preceding section, and is similar in spirit to results of Van de Geer (993), (996). Suppose that Y is a random variable taking values in R + = [0, ) with distribution function F F= {all df s F on R + }. Unfortunately we are not able to observe Y itself. What we do observe is a vector of times T K =(T K,,...,T K,K ) where K, the number of times, is itself random, and the interval (T K,j,T K,j ] into which Y falls (with T K,0 0, T K,K+ ). More formally, we assume that K is an integer-valued random variable, and T = {T k,j,j =,...,k,k =, 2,...}, is a triangular array of potential observation times, and that Y and (K, T) are independent. Let X =( K,T K,K), with a possible value x =(δ k,t k,k), where k =( k,,..., k,k ) with k,j = (Tk,j,T k,j ](Y ), j =, 2,...,k +, and T k is the kth row of the triangular array T. Suppose we observe n i.i.d. copies of X; X,X 2,...,X n, where X i =( (i),t (i),k (i) ), i = K (i) K (i), 2,...,n. Here (Y (i),t (i),k (i) ), i =, 2,... are the underlying i.i.d. copies of (Y,T,K). We first note that conditionally on K and T K, the vector K has a multinomial distribution: where ( K K, T K ) Multinomial K+ (, F K ) F K (F (T K, ),F(T K,2 ) F (T K, ),..., F (T K,K )).
Glivenko-Cantelli Preservation Theorems 25 Suppose for the moment that the distribution G k of (T K K = k) has density g k and p k P (K = k). Then a density of X is given by (3) k+ p F (x) p F (δ k,t k,k)= (F (t k,j ) F (t k,j )) δ k,j g k (t k )p k where t k,0 0, t k,k+. In general, (4) p F (x) p F (δ k,t k,k) = = k+ (F (t k,j ) F (t k,j )) δ k,j k+ δ k,j (F (t k,j ) F (t k,j )) is a density of X with respect to the dominating measure ν where ν is determined by the joint distribution of (K, T), and it is this version of the density of X with which we will work throughout the rest of the paper. Thus the log-likelihood function for F of X,...,X n is given by where n l n(f X) = n n = P n m F K (i) + ( ) (i) K,j log F (T (i) (i) ) F (T K (i),j K (i),j ) m F (X) = K+ K+ K,j log (F (T K,j ) F (T K,j )) K,j log ( F K,j ) and where we have ignored the terms not involving F. We also note that, with P 0 P F0, K+ P 0 m F (X) =P 0 F 0,K,j log ( F K,j ). The NonparametricMaximum Likelihood Estimator (NPMLE) F n is the distribution function F n (t) which puts all its mass at the observed time points and maximizes the log-likelihood l n (F X). It can be calculated via the iterative convex minorant algorithm proposed in Groeneboom and Wellner (992) for case 2 interval censored data.
26 A. van der Vaart and J. A. Wellner By Proposition 3 with α = and ϕ ϕ as before, it follows that ( ) h 2 (p Fn,p F0 ) (P n P 0 ) ϕ(p Fn /p F0 ) where ϕ is bounded and continuous from R to R. Now the collection of functions G {p F : F F} is easily seen to be a Glivenko-Cantelli class of functions: this can be seen by first applying Theorem 4 to the collections G k, k =, 2,... obtained from G by restricting to the sets K = k. Then for fixed k, the collections G k = {p F (δ, t k,k): F F}are P 0 Glivenko-Cantelli classes since F is a uniform Glivenko-Cantelli class, and since the functions p F are continuous transformations of the classes of functions x δ k,j and x F (t k,j ) for j =,...,k+, and hence G is P Glivenko-Cantelli by Theorem 3. Note that the single function p F0 is trivially P 0 Glivenko-Cantelli since it is uniformly bounded, and the single function (/p F0 ) is also P 0 GC since P 0 (/p F0 ) <. Thus by Proposition 2 with g =(/p F0 ) and F = G = {p F : F F}, it follows that G {p F /p F0 : F F}is P 0 Glivenko- Cantelli. Finally another application of Theorem 3 shows that the collection H {L (p F /p F0 ): F F}= {ϕ(p F /p F0 ): F F} is also P 0 -Glivenko-Cantelli. When combined with Proposition 3, this yields the following theorem: Theorem 7. The NPMLE F n satisfies h(p Fn,p F0 ) a.s. 0. To relate this result to a recent theorem of Schick and Yu (999), it remains only to understand the relationship between their L (µ) and the Hellinger metric h between p F and p F0. Let B denote the collection of Borel sets in R. OnB we define measures µ and µ as follows: For B B, (5) and (6) µ(b) = µ(b) = P (K = k) k= P (K = k) k k= k P (T k,j B K = k), k P (T k,j B K = k). Let d be the L (µ) metricon the class F; thus for F,F 2 F, d(f,f 2 )= F (t) F 2 (t) dµ(t).
Glivenko-Cantelli Preservation Theorems 27 The measure µ was introduced by Schick and Yu (999); note that µ is a finite measure if E(K) <. Note that d(f,f 2 ) can also be written in terms of an expectation as: K (7) d(f,f 2 )=E (K,T ) (F (T K,j ) F 2 (T K,j ). As Schick and Yu (999) observed, consistency of the NPMLE F n in L (µ) holds under virtually no further hypotheses. Theorem 8. (Schick and Yu). Suppose that E(K) <. Then d( F n,f 0 ) a.s. 0. Proof. We will show that Theorem 8 follows from Theorem 7 and the following lemma. Lemma 4. { 2 Proof. We know that F n F 0 d µ} 2 h 2 (p F n,p F 0 ). h 2 (p Fn,p F0 ) d TV (p Fn,p F0 ) 2h(p Fn,p F0 ) where, with y k,0 =, y k,k+ =, h 2 (p Fn,p F0 )= k+ P (K = k) {[ F n (y k,j ) F n (y k,j )] /2 k= [F 0 (y k,j ) F 0 (y k,j )] /2 } 2 dg k (y) while k+ d TV (p Fn,p F0 )= P (K = k) [ F n (y k,j ) F n (y k,j )] k= [F 0 (y k,j ) F 0 (y k,j )] dg k (y). Note that k+ [ F n (y k,j ) F n (y k,j )] [F 0 (y k,j ) F 0 (y k,j )] k+ = ( F n F 0 )(y k,j,y k,j ] max j k+ F n (y k,j ) F 0 (y k,j ),
28 A. van der Vaart and J. A. Wellner so integrating across this inequality with respect to G k (y) yields k+ [ F n (y k,j ) F n (y k,j )] [F 0 (y k,j ) F 0 (y k,j )] dg k (y) max j k k k F n (y k,j ) F 0 (y k,j ) dg k,j (y k,j ) F n (y k,j ) F 0 (y k,j ) dg k,j (y k,j ). By multiplying across by P (K = k) and summing over k, this yields d TV (p Fn,p F0 ) F n F 0 d µ, and hence (8) h 2 (p Fn,p F0 ) 2 { F n F 0 d µ} 2. The measure µ figuring in Lemma 4 is not the same as the measure µ of Schick and Yu (999) because of the factor /k. Note that this factor means that the measure µ is always a finite measure, even if E(K) =. It is clear that µ(b) µ(b) for every Borel set B, and that µ µ. The following lemma (Lemma 2.2 of Schick and Yu (999)) together with Lemma 4 shows that Theorem 7 implies the result of Schick and Yu once again: Lemma 5. Suppose that µ and µ are two finite measures, and that g, g, g 2,... are measurable functions with range in [0, ]. Suppose that µ is absolutely continuous with respect to µ. Then g n g d µ 0 implies that gn g dµ 0. Proof. Write g n g dµ = g n g dµ d µ d µ and use the dominated convergence theorem applied to a.e. convergent subsequences. 6 Example:generalizing the result of Schick and Yu Our goal in this section is to give a generalization of the consistency result of Schick and Yu (999).
Glivenko-Cantelli Preservation Theorems 29 Suppose that Y is a random variable taking values in Y. Suppose that Y has distribution Q on the measurable space (Y, B). Unfortunately we are not able to observe Y itself. What we do observe is a vector of random sets C K =(C K,,...,C K,K ) where K, the number of sets is itself random, and the set {C K,j } K form a partition of Y: C K,j C K,j = if j j and K C K,j = Y. More formally, we assume that K is an integer-valued random variable, and C = {C k,j,j =,...,k,k =, 2,...}, is a triangular array of random sets, and that Y and (K, C) are independent. Let X =( K,C K,K), with a possible value x =(δ k,c k,k), where k =( k,,..., k,k ) with k,j = Ck,j (Y ), j =, 2,...,k, and C k is the k th row of the triangular array C. Suppose we observe n i.i.d. copies of X; X,X 2,...,X n, where X i =( (i),c (i),k (i) ), i =, 2,...,n. Here K (i) K (i) (Y (i),c (i),k (i) ), i =, 2,... are the underlying i.i.d. copies of (Y,C,K). We first note that conditionally on K and C K, the vector K has a multinomial distribution: where ( K K, C K ) Multinomial K (,Q K ) Q K (Q(C K, ),Q(C K,2 ),...,Q(C K,K )). Suppose for the moment that the distribution G k of (C K K = k) has density g k and p k P (K = k). Then a density of X is given by (9) k p Q (x) p Q (δ k,c k,k)= Q(c k,j ) δ k,j g k (c k )p k. In general, (0) k k p Q (x) p Q (δ k,c k,k)= Q(c k,j ) δ k,j = δ k,j Q(c k,j ) is a density of X with respect to the dominating measure ν where ν is determined by the joint distribution of (K, C), and it is this version of the density of X with which we will work throughout the rest of the paper. Thus the log-likelihood function for Q of X,...,X n is given by where n l n(q X) = n n K (i) (i) K,j log Q(C(i) K (i),j )=P nm Q m Q (X) = K K,j log Q(C K,j )
30 A. van der Vaart and J. A. Wellner and where we have ignored the terms not involving Q. We also note that, with P 0 P Q0, K P 0 m Q (X) =P 0 Q 0 (C K,j ) log Q(C K,j ). The NonparametricMaximum Likelihood Estimator (NPMLE) Q n is a probability measure Q n which maximizes the log-likelihood l n (Q X). Here we will bypass the many interesting existence, characterization, and computational issues connected with the NPMLE Q n, and focus instead on the issue of consistency once we have the NPMLE in hand. By Proposition 3 with α = it follows that ( ) h 2 (p Qn,p Q0 ) (P n P 0 ) ϕ(p Qn /p Q0 ). where ϕ is bounded and continuous. Now the collection of functions G {p Q : Q Q} where Q is the collection of all probability distributions on (Y, B), is easily seen to be a Glivenko-Cantelli class of functions: this can be seen by first applying Theorem 4 to the collections G k, k =, 2,... obtained from G by restricting to the sets K = k. The next step is to show that the collections G k {p Q (,,k): Q Q} are P 0 Glivenko-Cantelli for each k. To this end, suppose that (Y, B) isa measurable space and let C be a universal GC - class of sets in Y (i.e. Q GC for every probability measure Q on (Y, B)). Let (T,T,G) be a probability space, and let t C t be a map with values in C. Lemma 6. Suppose that the collection {{t : y C t } : y Y}is G GC. Then the collection of functions t Q(C t ) where Q ranges over all probability measures Q on Y is G GC. Remark. By Assouad (983), Proposition 2.2, page 246, together with Corollary.0, page 24, it follows that {{t T : y C t } : y Y}is a VC-class of subsets of T if and only if {C t : t T } is a VC-class of subsets of Y. Thus if we assume that all the subsets C = C t arising in the partitions are elements of a VC-class C, then, subject to being suitably measurable, the hypothesis of Lemma 6 is satisfied (and the class in question is even universal Glivenko-Cantelli).
Glivenko-Cantelli Preservation Theorems 3 Proof. For Q the Diracmeasure at y (unit mass at y), the function t Q(C t ) becomes t Ct (y). By assumption, the set of all such functions, with y ranging over Y, is G GC. Then so is the set of all functions t m m Ct (y i ), y,...,y m Y,m N. Let Y,...,Y m be an i.i.d. sample from a given Q. Because C is Q GC, sup m {Y i C} P (C) 0 a.s.. C C m Then there exists a sequence y,y 2,... in Y such that sup m {y i C} P (C) 0, C C m and consequently sup m {y i C t } P(C t ) 0. t T m It follows that the maps t Q(C t ) are the (uniform) limits of sequences of functions from a G GC class. Hence they are G GC. Note on measurability: It is assumed implicitly that the sets {t : y C t } are measurable in T for every y Y. The proof shows that the maps t P (C t ) are then also measurable. It follows that for fixed k, the collections G k = {p Q (δ, c k,k) : Q Q} are P 0 -Glivenko-Cantelli, and since the functions p Q are continuous transformations of the classes of functions x δ k,j and x Q(c k,j ) for j =,...,k, and hence G is P Glivenko-Cantelli by Theorem 3. Note that the single function p Q0 is trivially P 0 Glivenko-Cantelli since it is uniformly bounded. Now another application of Theorem 3 shows that the collection H {L (p Q /p Q0 ): Q Q}= {ϕ(p Q /p Q0 ): Q Q} is also P 0 -Glivenko-Cantelli. When combined with Proposition 3, this yields the following theorem. Theorem 9. If all C K,j C, a VC collection of subsets of X, then the NPMLE Q n satisfies h(p Qn,p Q0 ) a.s. 0.
32 A. van der Vaart and J. A. Wellner To obtain a statement analogous to the theorem of Schick and Yu (999), let Σ denote a sigma algebra for the space C of subsets: we are assuming that P ( K [C K,j C])=.Thus(C, Σ) is a measurable space. On Σ we define the measure µ as follows: () k µ(b) = P (K = k) P (C k,j B K = k), B Σ. k= Note that d TV (p Qn,p Q0 ) = (2) = k P (K = k) Q n (c k,j ) Q 0 (c k,j ) dg k (c) k= Q n (c) Q 0 (c) dµ(c) d( Q n,q 0 ). Here is a generalization of the theorem of Schick and Yu (999). Theorem 0. The NPMLE Q n satisfies d( Q n,q 0 ) a.s. 0. Proof. Theorem 0 follows from Theorem 9, (2) and the observation that d( Q n,q 0 )=d TV (p Qn,p Q0 ) 2h(p Qn,p Q0 ). Example. (Mixed case interval censoring in R 2 ). Suppose that Y Q 0 takes values in R +2, and the partitions {C K,j } K are obtained as the natural rectangles formed from two increasing collections 0 S K,... S K,K and 0 T K2,... T K2,K on the two coordinate axes. Then K =(K + )(K 2 + ), and all the elements of the partitions are elements of the VC-class of rectangles in R 2. It follows that the NPMLE Q n of Q 0 satisfies Q n (s, t] Q 0 (s, t] dµ(s, t) 0 a.s.. s,t R +2 Acknowledgments. We owe thanks to Evarist Giné and Joel Zinn for suggesting the proof given for Lemma. References Assouad, P. (983). Densité et dimension. Annales de l Institut Fourier 33, 233 282. Dudley, R. M. (984). A Course on Empirical Processes. Ecole d été de probabilités de St.-Flour, 982. Lecture Notes in Mathematics 097, Springer, Berlin, 42.
Glivenko-Cantelli Preservation Theorems 33 Dudley, R. M. (998a). Consistency of M-estimators and one-sided bracketing. In Proceedings of the First International Conference on High- Dimensional Probability. E. Eberlein, M. Hahn, and M. Talagrand, eds. Birkhäuser, Basel, pp. 33 58. Dudley, R. M. (998b). Personal communication. Dudley, R. M., Giné, E., and Zinn, J. (99). Uniform and universal Glivenko-Cantelli classes. J. Theoretical Probab. 4, 485 50. Giné, E., and Zinn, J. (984). Some limit theorems for empirical processes. Ann. Probability 2, 929 989. Groeneboom, P. and Wellner, J. A. (992). Information Bounds and Nonparametric Maximum Likelihood Estimation. DMV Seminar Band 9, Birkhäuser, Basel. Pfanzagl, J. (988). Consistency of maximum likelihood estimators for certain nonparametric families, in particular: mixtures. J. Statist. Planning and Inference 9, 37 58. Schick, A. and Yu, Q. (999), Consistency of the GMLE with Mixed Case Interval-Censored Data, Scand. J. Statist., to appear. Talagrand, M. (987a). Measurability problems for empirical measures. Ann. Probability 5, 204 22. Talagrand, M. (987b). The Glivenko-Cantelli problem. Ann. Probability 5, 837 870. Talagrand, M. (996). The Glivenko-Cantelli problem, ten years later. J. Theor. Prob. 9, 37 384. Van de Geer, S. (993). Hellinger consistency of certain nonparametric maximum likelihood estimators. Ann. Statist. 2, 4 44. Van de Geer, S. (996). Rates of convergence for the maximum likelihood estimator in mixture models. J. Nonparametric Statist. 6, 293 30. Van der Vaart, A. W. (996). New Donsker classes. Ann. Probability 24, 228 240. Van der Vaart, A. W., and Wellner, J. A. (996). Weak Convergence and Empirical Processes with Applications to Statistics, Springer-Verlag, New York. update? Department of Mathematics Free University De Boelelaan 08a 08 HV Amsterdam The Netherlands e-mail: aad@cs.vu.nl Department of Statistics University of Washington P.O. Box 354322 Seattle, Washington 9895-4322 U.S.A. e-mail: jaw@stat.washington.edu