Maximum Likelihood Estimation under Shape Constraints

Size: px

Start display at page:

Download "Maximum Likelihood Estimation under Shape Constraints"

Lily Howard
5 years ago
Views:

1 Maximum Likelihood Estimation under Shape Constraints Hanna K. Jankowski June 2 3, 29 Contents 1 Introduction 2 2 The MLE of a decreasing density Finding the Estimator Consistency Inconsistency at Zero Local Asymptotic Theory Why Do We See a Cube Root Rate? Proofs and the Switching Relation Some Global Asymptotic Results Exercises The Nonparametric MLE for Current Status Data Finding the Estimator A Maximal Intersection Approach A Convex Minorant Approach Consistency and Asymptotics Exercises Disclaimer 29 1

2 1 Introduction The estimation of functions such as densities, hazard rates, cumulative distributions or regression functions plays a large role in statistical inference. Often, it is quite natural to assume that the function of interest has a certain shape. For example, in economics it is often assumed that the functional form of the model is convex (Chak et al., 25). In survival analysis, we can assume that the hazard function is bathtub, or U-shaped. That is, it is first decreasing and then increasing. Heuristically, bathtub shaped hazards correspond to lifetime distributions with high initial hazard (or infant mortality), lower and often rather constant hazard during the middle of life, and then increasing hazard of failure (or wear out) as aging proceeds (see, for example, Jankowski and Wellner (29) and the references therein). As yet another example, functions such as the cumulative distribution function are naturally shape constrained to be increasing 1. Many different shape assumptions may be made. Here, we restrict ourselves to a discussion of the k monotone family of density functions (in fact, we will shortly narrow our focus even further). A density f is k monotone if ( 1) m f (m) is decreasing and convex (1.1) for all m {,..., k 2} when k 2, and if it is decreasing when k = 1. Note that for k = 2, we are looking at the family of convex decreasing densities. If a density satisfies (1.1) for all m, then it is said to be completely monotone. The study of decreasing densities began with the seminal work of Grenander (1956), who studied the maximum likelihood estimator (MLE) of a decreasing density and a decreasing hazard function (motivated by problems of survival analysis in actuarial science). Consistency of the MLE was shown in Marshall and Proschan (1965), and Prakasa Rao (197) later established the local asymptotic distribution theory at a fixed point. Notably, convergence here occurs at rate n 1/3. The estimators for k = 2 were considered in Groeneboom et al. (21b,a), and the general k 1 case was studied by Balabdaoui and Wellner (27) with convergence rates of n k/(2k+1), although several open problems still remain here. For the completely monotone setting we currently know that the MLE exists and is consistent (Jewell, 1982). One interesting fact about the k monotone family of densities is their characterization via mixtures. Proposition 1.1. A density f is k monotone (completely monotone) if and only if it can be represented as a scale mixture of Beta(1, k) (exponential) densities: { f(x) = y 1 (1 x/(ky)) k 1 + dµ(y) 1 k < y 1 exp ( x/y) dµ(y) k =, 1 Throughout, we say decreasing in lieu of non increasing and strictly decreasing for decreasing. A similar language will be used for increasing, positive and negative functions or variables. 2

3 for some probability measure µ. For a proof of this fact see Williamson (1956); Lévy (1962) and Gneiting (1999) when k < and Widder (1941); Gneiting (1998), and Feller (1971, pages ), when k =. We note that appropriately modified versions of these characterizations hold true for other functions, and not just densities. In this short course we focus on two specific examples from the k = 1 family. We will study the nonparametric maximum likelihood estimator for two specific examples: 1. The MLE for a decreasing density on [, ). 2. The MLE of the cumulative distribution function for current status (interval censoring case 1) data. 2 The MLE of a decreasing density By Proposition 1.1, every decreasing density is a mixture of uniform densities. That is, for some probability measure µ, f(x) = y 1 I [,y] (x)dµ(y). (2.1) Suppose that we observe n independent realizations of a random variable from a decreasing density on [, ). The likelihood is then L (f) = n f(x i ), and we assume (without loss of generality) that the observed x i values are ordered. Without any shape restrictions on the density, the MLE of f would be f(x) = 1 n i=1 n δ xi (x), where δ y is the Dirac delta function. This estimator is weakly consistent and converges at a rate n 1/2, however, it is not even a density! i=1 2.1 Finding the Estimator Our problem is then to maximize L (f) = n f(x i ), i=1 3

4 Figure 1: MLE via least concave majorants: MLE for the uniform (top) and exponential (bottom) distributions when n = 1. On the left is shown the cdf along with the least convex majorant. On the right is the resulting density (black) along with the true density (red). subject to f(x) = 1 and f being decreasing. This may be accomplished by a series of steps. STEP 1: Note that the likelihood only looks at the values of the function f at the points x 1 x 2... x n. Consider any decreasing function f over the interval (x i 1, x i ] (where we take x = ). The likelihood increases if the value of f(x i ) increases, and the 4

5 value of f on (x i 1, x i ) has no impact on the likelihood. Therefore, if f maximizes L, it must be constant on (x i 1, x i ]. We have thus learned that the MLE is a left continuous, piecewise constant function with jumps only at the observed data points. STEP 2: Let y i = f(x i ) for i = 1,..., n. By the above step, to find the MLE we need to find the the values y 1,..., y n which maximize L = n y i, subject to y 1 y 2... y n, y i for all i, and n i=1 y i(x i x i 1 ) = 1. We have therefore reduced the question from an infinite dimensional problem to an n dimensional one. Moreover, since the function l = log L is continuous and concave in y, a unique solution exists. STEP 3: To find the constrained maximum of L (or, equivalently l = log L ), one may first try to use the Lagrange method. Unfortunately, this approach will not work here, as the answer typically lies on the boundary of the domain. We can still learn something quite useful though about the MLE via the Lagrange approach, when we constrain the problem to lie in the interior of the domain. An answer lies on the boundary of the domain if for some i, y i+1 = y i. Suppose that we know apriori that the solution is made up of only p unique values of y. That is, i=1 y 1 = y 2 =... = y i1 = ϕ 1 y i1 +1 = y i1 +2 =... = y i2 = ϕ 2 y ip 1 +1 = y ip 1 +2 =... = y ip = ϕ p, where p and the locations i 1,..., i p are known ahead of time, and only the values of ϕ 1,..., ϕ p are unknown. We assume that i p = n and that ϕ 1 > ϕ 2 >... > ϕ p. Since we have constrained the problem to lie away from the boundary, we may now use a Lagrange approach to find the values of ϕ j for j = 1,..., p. The reduced problem is now to maximize L = p j=1 ϕ (i j i j 1 ) j, (taking i = ), subject to p j=1 ϕ j(x ij x ij 1 ) = 1. This gives the answer that ϕ j = 1 (i j i j 1 ) n (x ij x ij 1 ). 5.

6 Thus, we have reduced the problem to finding the values of i 1,..., i p. Note that part of finding the correct values of i 1,..., i p is the stipulation that the ϕ values are strictly decreasing. STEP 4: Suppose that we do know the values of i 1,..., i p, and let F n denoted the cdf associated with the MLE, found by integrating the density f. A simple calculation shows that F n (x ij ) = i j n (2.2) for j = 1,..., p. Let F n denote the empirical cumulative distribution function of the data. That is, F n (x) = 1 n I [,xi ](x). n We may re write (2.2) as i=1 F n (x ij ) = F n (x ij ), j = 1,..., p. We also know that F n is a concave function, as the MLE must be decreasing. We will next prove that F n (x) F n (x). (2.3) Proof of (2.3). As F n is piecewise linear, it is enough to show that (2.3) holds at all data points. To wit, suppose that there exists some x v such that F n (x v ) < F n (x v ). Let f be the density of F n as given above. Then x v falls in some interval (x ij 1, x ij ). Let m = i j, m = i j 1. Let f be another decreasing density, which is equal to f, except on the interval (x m, x m ]. We define f to be equal to (f + ε/δ 1 ) on (x m, x m ], and (f ε/δ 2 ) on (x v, x m ], where δ 1 = x v x m, and δ 2 = x m x v. Notice that It follows that F n (x v ) F n (x v ) = F n (x v ) F n (x v ) ( F n (x m ) F n (x m )) = δ 1 m m v m δ 1 + δ 2 n n = 1 n(δ 1 + δ 2 ) (δ 1(m v) δ 2 (v m)). v m δ 1 m v δ 2 >. (2.4) 6

7 To finish, consider the difference of the likelihoods at f and f. This is equal to (a positive constant times) ( v m (f + ε/δ 1 ) v m (f ε/δ 2 ) m v f m m = ε m v ) + O(ε 2 ). δ 1 δ 2 Therefore, for ε small enough, we have found a new decreasing density with a larger likelihood. As this is impossible, it must be that (2.3) holds. STEP 5: Let s collect what we have so far. We know that F n (x) F n (x) for all x, with equality at only a few special locations x ij for j = 1,..., p. Moreover, Fn (x) is concave by definition. Thus, it is a concave majorant. Since it is linear in between the touchpoints, it must be the least concave majorant. Theorem 2.1. The maximum likelihood estimator of a decreasing density on [, ) is the left derivative of the least concave majorant of the empirical cumulative distribution function F n. Here, we define f n () = f n (+). This result gives us an easy graphical representation of the MLE. Some examples are shown in Figure 1. Note that the left derivative of the piecewise linear majorant of F n is exactly what is needed to recover the left continuous MLE. 2.2 Consistency Figure 2 shows an example of the MLE for n = 1 for the exponential and uniform distributions. In general, the estimator appears to have really nice behaviour (especially considering the flexibility of the nonparametric MLE). The only problems appear near zero. Indeed, as we will see later, the Grenander estimator is not consistent at zero. Theorem 2.2. Suppose that X 1, X 2,... are IID random variables with a decreasing density f on [, ). Then the MLE f n is uniformly consistent on closed intervals bounded away from zero: that is, for each c >, we have almost surely. sup f n (x) f (x) c x< The first proof of this fact appeared in Marshall and Proschan (1965). Here, we present a proof based on a general method first developed in Jewell (1982). 7

8 Figure 2: The MLE (black) and the true density (red) for the uniform and exponential densities when n = 1. Proof. The proof relies on convex optimization, and we must first set up the appropriate notation. Notice that the log likelihood may be written as l(f) = n log f(x i ) = i=1 log f(x)df n (x). The MLE is the maximum of l(f) over the space K 1, the space of positive, decreasing functions on [, ) which integrate to one (that is, the space of decreasing densities). To find the MLE we may also minimize l(f) over K 1. It turns out that it s a little bit easier if we relax the minimization. Let K denote the space of positive and decreasing functions on [, ). Next, define the function ϕ(f) = log f(x)df n (x) + f(x)dx. (2.5) It turns out that the function which minimizes ϕ(f) has integral equal to one, and is therefore the MLE. We will proof this fact in the first step below. The remaining steps prove the entire result. STEP 1: Consider minimizing the function ϕ, and let f denote it s minimum. We know that for any other function g K, ϕ(g) ϕ( f). Let ε >. If f ± εg K, then similarly 8

9 we have that ϕ( f ± εg) ϕ( f). It follows that the directional derivative defined as is also positive. g ϕ( f) ϕ( = lim f ± εg) ϕ( f) ε ε g(x) = f(x) df n(x) + g(x)dx Choosing g = f, we see that for all ε sufficiently small (1 ± ε) f K, and therefore, fϕ( f) =. This implies that = = 1 + f(x) f(x) df n(x) + f(x)dx, f(x)dx or, f(x)dx = 1. It follows that to find the MLE we may simply minimize ϕ(f) over K. STEP 2: Recall that f is the true density of the observed data, and let F (x) = x f (x)dx. It is well known that sup x F n (x) F (x) almost surely. Let Ω Ω denote the space where this convergence occurs, and note that P (Ω ) = 1. STEP 3: Recall the characterization of a decreasing density (2.1). It follows that f(x) = = x y 1 I [,y] (x)dµ(y) y 1 dµ(y) 1 x. This tells us that any decreasing density has an upper bound for each x >. Fix ω Ω (for the remainder of the proof!), and consider the sequence { f n }. For each x >, we can find a subsequence {n k } such that f nk (x) converges (since f n (x) is bounded above by 1/x). Using a standard diagonalization procedure, we conclude that there exists a further subsequence, which we denote for simplicity as { f n }, and a function f, such that f n (x) f(x) for x >. Note that f(x) must also be positive and decreasing. Moreover, by Fatou s lemma, it satisfies f(x)dx 1. (2.6) 9

10 STEP 4: Choosing g = f in the directional derivative defined in STEP 1 above, we find that (since for any ε >, f + εf K and since f ϕ( f n ) ) f (x) f n (x) df n(x) f (x) = 1. (2.7) STEP 5: Fix δ >, and let η δ = F 1 (1 δ). We may assume that δ is small enough such that δ < η δ. From (2.7), it follows that there exists a finite number τ δ, such that f n (η δ ) τ δ, for all sufficiently large n. This implies that sup δ x ηδ 1/ f n (x) 1/ f(x) as n. Thus, by applying the general form of the dominated convergence theorem and (2.7), we find that ηδ δ f (x) f(x) df (x) 1 Since δ was chosen arbitrarily small, we may let δ to obtain that f (x) f(x) df (x) = f 2 (x) dx 1. (2.8) f(x) STEP 6: Finally, fix ε >. Using (2.6) and (2.8), we have that = 1/ε ε 1/ε ε 1/ε 2 2 (f (x) f(x)) 2 dx f(x) f 2 (x) 1/ε f(x) dx 2 f (x)dx + ε f (x)dx. ε 1/ε ε f(x)dx Letting ε the righthand side converges to zero, and therefore we find that f(x) = f (x) for all x >. We have thus shown that for any ω Ω, where P (Ω ) = 1, there exists a subsequence of fn (x) which converges to f (x) pointwise for x >. Since the functions in question are decreasing densities, we immediately also obtain uniform convergence over [c, ). This implies that sup x [c, ) f n (x) f (x) on Ω, and completes the proof. Remark 2.3. The approach taken in STEP 1 above to show that f(x)dx = 1 may be extended by considering other functions g. That is, we can plug in different g into the inequality g ϕ( f) as long as f ± εg K. This yields a characterization of the MLE 1

11 in terms of a series of inequalities. For the Grenander estimator, nothing beats the least convex minorant characterization. However, for estimators under other constraints (e.g. k = 2), these characterizations turn out to be the best, and only, descriptions available of the estimators. 2.3 Inconsistency at Zero Many shape constrained estimators of the type we are considering here are inconsistent at x = (or at other endpoints). For the Grenander estimator, this problem was first studied in Woodroofe and Sun (1993) when f () is bounded. We stipulate here that f () = f (+), and use the same convention for the MLE itself. The results of Woodroofe and Sun (1993) were later extended to other situations in Balabdaoui et al. (29), where different behaviour near zero of the true density is considered. For the k = 2 case, see Groeneboom et al. (21b) where the estimator of a convex density is not consistent at zero, and Jankowski and Wellner (29) where the estimator of a convex hazard function over [, T ] is not consistent at either or T. Woodroofe and Sun (1993) also propose an alternative to the Grenander estimator near zero, which is consistent, and a similar approach is taken in Balabdaoui (27) for the general k monotone density estimator. Why is this problem even interesting? As one example, suppose that the times between failures of a system are IID random variables Y 1,..., Y n which have a common distribution G with a finite positive mean µ. Fix a time t, and let N N t denote the largest n such that Y Y n t (if the Y i s were exponential then N t would be the Poisson process). The variable N is the number of failures observed prior to time t. Suppose that we observe only X = t (Y Y N ), the time since the last breakdown. Then, for large t, we may approximate the density of X as f(x) = 1 [1 G(x )], µ which is decreasing since G is a cdf. Also, f() = 1/µ, which is the natural quantity of interest in this problem. Other examples are given in Balabdaoui (27); Balabdaoui et al. (29); Woodroofe and Sun (1993). Theorem 2.4. Suppose that < f () <, and let N(t) denote a rate 1 Poisson process. Then f n () f () sup N(t) t> t d = 1 U, where U is a uniform random variable on the unit interval. 11

12 Sketch of Proof. By the characterization of the MLE via the least convex majorant, we know that f n () = sup t> = sup t> F n (t) t nf n (t/n). t Note that nf n (t/n) is a Binomial random variable with parameters n and F (t/n) n 1 f ()t, which converges to a Poisson random variable with mean f (). This proves the first part of the result. The fact that the supremum of N(t)/t has the same distribution as the reciprocal of a uniform random variable was shown for example in Pyke (1959). 2.4 Local Asymptotic Theory We now know that the MLE is a consistent estimator (except at zero). Next, we wish to know what the rate of convergence is. For the Grenander estimator, the local (and global) rate of convergence is n 1/3, which is much slower than the n 1/2 rate we typically see with parametric estimators. In a sense, this is the price one pays for the flexibility of nonparametric estimation. The local n 1/3 rate of convergence is only seen if the true density is locally strictly decreasing. In regions where the true density is flat, we obtain the usual n 1/2 rate. The two main results are as follows. Theorem 2.5. Let X 1,..., X n be independent observations from a decreasing density f on [, ). Fix a point x (, ) such that f (x ) <. Then n 1/3 1 2 f (x )f (x ) 1/3 ( f n (x ) f (x )) 2Z, where Z is the location of the maximum of the process {B(t) t 2, t R}, and B(t) is a standard two-sided Brownian motion on R such that B() =. The above result was first shown in Prakasa Rao (197). The proof we present here is based on the approach via the switching relation developed by Piet Groeneboom (see Groeneboom (1985) as well as Groeneboom (1983) and Groeneboom (1986)). The distribution of Z is known as Chernoff s distribution (after Chernoff (1964), who first showed how it arises as the limit in mode estimation). Chernoff shows that the density of Z may be written as f Z (z) = 1 2 u x(z 2, z)u x (z 2, z), 12

13 where u is the solution to the heat equation u t = u xx /2 for x < t 2, with boundary conditions u(x, t) = 1 for x t 2 and u(x, t) for x. In Groeneboom (1989) (a paper which won the Rollo Davidson prize in 1985), Groeneboom derived the density of Z in terms of the Airy function Ai(x). He showed that g(z) = u x (z 2, z) is the function with Fourier transform ĝ(s) = 2 1/3 /Ai(i2 1/3 s), s R. Exact quantiles for this distribution may be found in Groeneboom and Wellner (21). Remark 2.6. In fact, a stronger result than that stated in Theorem 2.5 holds. Under the same conditions, one can show that n 1/3 ( f n (x + n 1/3 x) f(x )) V (x), where V (x) is the left derivative of the least concave majorant of the process { f (x )B(t) ( f (x )/2)t 2, t R}. This result provides one reason of why we see the n 1/3 rate of convergence: In a neighbourhood of x, there are O(n 1/3 ) touchpoints between F n and its least convex minorant. Therefore, to get a non trivial limit, we need to re magnify space on the order of n 1/3. Theorem 2.7. Suppose that X 1,..., X n are IID Uniform[,1]. For any x (, 1), we have that n( fn (x ) 1) W, where W is the left derivative at x of the least concave majorant of the standard Brownian bridge. This result was proved in Groeneboom (1986), and extended to densities with locally flat regions in Carolan and Dykstra (1999) Why Do We See a Cube Root Rate? There are many other examples in statistics where the argmax estimator converges at a rate slower than the more typical n 1/2. One of the simplest examples of what is going on is the following. Suppose that we are interested in finding θ which maximizes P [θ 1, θ + 1] = E P [I[θ 1 X θ+1]]. To do this, we use the estimator, θ n, which maximizes the empirical version P n [θ 1, θ + 1] = 1 n n I[θ 1 X i θ + 1]. i=1 13

14 If P has a smooth density, then P [θ 1, θ + 1] is parabolic near θ, since (assuming wlog that θ < θ) 1+θ 1+θ P [θ 1, θ + 1] P [θ 1, θ + 1] = p(x)dx p(x)dx 1+θ 1+θ p(1 + θ )(θ θ ) + p (1 + θ )(θ θ ) 2 /2 p( 1 + θ )(θ θ ) p ( 1 + θ )(θ θ ) 2 /2. Since θ maximizes P [θ 1, θ + 1], we have that p(1 + θ ) = p( 1 + θ ). Also p must be increasing at 1 + θ and decreasing at 1 + θ. Therefore, P [θ 1, θ + 1] P [θ 1, θ + 1] c(θ θ ) 2, for some positive constant c. The function above describes the deterministic trend of θ n about θ. The random perturbation is equal to (P n P )[θ 1, θ + 1] (P n P )[θ 1, θ + 1], which is a centered average and therefore approximately normal with variance given by σ 2 = 1 n var (I[θ 1 X θ + 1] I[θ 1 X θ + 1]) 1 n var (I[θ 1 X θ 1] I[θ + 1 X θ + 1]) 1 n {P [θ + 1, θ + 1] + P [θ 1, θ 1]} 1 { θ+1 θ 1 } p(x)dx + p(x)dx n θ +1 1 n c (θ θ ), θ 1 for some other positive constant c. It follows that the random perturbation term is of the order O p (n 1/2 θ θ 1/2 ). In order for θ n to be the maximum, both the random perturbation term and the deterministic term must be of the same order. That is, random term + deterministic term = P n [θ 1, θ + 1] P n [θ 1, θ + 1], which must be positive for θ = θ n, since θ n is the maximizer. Therefore, O p (n 1/2 θ θ 1/2 ) c(θ θ ) 2, 14

15 which rearranges to θ θ O p (n 1/3 ), giving the n 1/3 rate. At the heart of the argument, is the function of interest g(x, θ) = I[θ 1, θ + 1](x) which is not smooth. If we had instead considered g(x, θ) = (x θ) 2, the maximization would yield the typical n 1/2 rate of convergence. The above example is taken from Kim and Pollard (199), who study many examples of cube-root asymptotics, and develop a general theory for the problem. Note that the exposition given above is not rigorous. To make the argument formal, one would also need to prove these results uniformly in θ, so that we can simply plug in the random θ = θ n. This may be done using empirical process theory. The same argument as above, but with some nice pictures, may also be found in Maathuis (27). For the Grenander estimator f n, instead of the maximizing function P n [θ 1, θ + 1] we minimize ϕ(f) as defined in (2.5). At first glance, ϕ is a smooth function, quite different from g(x, θ) = I[θ 1, θ + 1]. However, we must not forget the restriction to the class of decreasing densities. Recall that by (2.1), the density function may be written as f(x) = y 1 I [,y] (x)dµ(y). Thus, at the root of the problem, we are working with the functions g(x, y) = I[, y](x), and therefore we expect a similar behaviour as in the example considered above Proofs and the Switching Relation Define the process { } ŝ n (a) = sup x F n (x) yx = sup(f n (z) yz) z. To simplify notation, we will also write this as ŝ n (a) = argmax x {F n (x) yx}. The switching relation is the name given to the following set equality {ŝ n (a) < x} = { f n (x) < a}. (2.9) The relation was first noted in Groeneboom (1983). For a proof using convex analysis, as well as more background on the relation, see Balabdaoui et al. (29); van der Vaart and van der Laan (26). For a visual justification, see Figure 3. The advantage of the switching relation is that it shows that the least concave majorant is a continuous mapping of F n, since the argmax is a continuous map (see e.g. Ferger (24)). Thus, to find the limits, we need only to understand the underlying limiting process, and then apply the continuous mapping theorem. 15

16 Figure 3: Imagine dropping a line with slope a starting from infinity. The line will make contact with the function F n at some point. In the picture above, this line is the red line, with the least convex minorant denoted in black. The black dots denote the dots of the empirical cdf. Let x = ŝ n (a), with a being the slope of the red line. Now note that the value of f n (x) for any x > x must be smaller than a. Proof of 2.5. By the switching relation, we have that ( P n 1/3 ( f ) n (x ) f(x )) < t ( = P fn (x ) < f(x ) + n t) 1/3 = P ( ) ŝ n (f(x ) + n 1/3 t) < x = P ( n 1/3 (ŝ n (f(x ) + n 1/3 t) x ) < ), (2.1) where we have added the additional n 1/3 scaling. We next examine the term n 1/3 (ŝ n (f(x ) + n 1/3 t) x ) more closely. This is equal to n 1/3 (argmax z { Fn (x) (f(x ) + n 1/3 t)x } x ) = argmax h n 1/3 x { Fn (x + n 1/3 h) (f(x ) + n 1/3 t)(x + n 1/3 h) } = argmax h n 1/3 x { Fn (x + n 1/3 h) F n (x ) n 1/3 (f(x ) + n 1/3 t)h } = argmax h n 1/3 x { n 2/3 (F n (x + n 1/3 h) F n (x ) n 1/3 (f(x ) + n 1/3 t)h) } argmax h n 1/3 x V n (h), 16

17 where V n (h) = n 2/3 (F n (x + n 1/3 h) F n (x ) n 1/3 f(x )h) th A n + B n th, letting A n = n 2/3 ( F n (x + n 1/3 h) F n (x ) (F (x + n 1/3 h) F (x )) ) B n = n 2/3 ( F (x + n 1/3 h) F (x ) n 1/3 f(x )h ). The limit of B n, is easy. We clearly have B n = n 2/3 ( F (x + n 1/3 h) F (x ) n 1/3 f(x )h ) 1 2 f (x )h 2 by Taylor expansion. The limit of A n is also not hard to see. Let U n be the empirical process n(p n P ) when P is the uniform distribution on the unit interval, and let U be the standard Brownian bridge. Lastly, let B denote a standard Brownian motion. We then have A n = n 2/3 ( F n (x + n 1/3 h) F n (x ) (F (x + n 1/3 h) F (x )) ) d = n ( 1/6 U n (F (x + n 1/3 h)) U n (F (x )) ) n ( 1/6 U(F (x + n 1/3 h)) U(F (x )) ) d = n ( 1/6 B(F (x + n 1/3 h)) B(F (x )) B(1)(F (x + n 1/3 h) F (x )) ) n 1/6 (B(F (x + n 1/3 h)) B(F (x )) + n 1/6 f(x )B(1)h d = n 1/6 (B(F (x + n 1/3 h) F (x )) + n 1/6 f(x )B(1)h d n 1/6 (B(f(x )n 1/3 h) + n 1/6 f(x )B(1)h = f(x )B(h) + n 1/6 f(x )B(1)h f(x )B(h). It follows, by the continuous mapping theorem, that { f(x argmax h n 1/3 x V n (h) argmax h R )B(h) + 1 } 2 f (x )h 2 th. Let a = f(x ), b = f (x )/2 and c 3 = a 1 b 2. Note that, since f is strictly decreasing, b is 17

18 strictly positive. Some careful calculation shows that argmax h { ab(h) bh 2 th } { = c argmax h R ab(h/c) bc 2 h 2 tc 1 h } { } = c argmax h R ac 1 B(h) bc 2 h 2 tc 1 h = c argmax h R { (a 2 b 1 ) 1/3 (B(h) h 2 th) }, where t = t(ab) 1/3. Continuing with the calculation, we find that the last line above is equal to c argmax h R { B(h) h 2 th } = c argmax h R { B(h) (h t/2) 2} = c ( argmax h R { B(h + t/2) h 2} t/2 ) d = c ( { argmax h R B(h) + B( t/2) h 2} t/2 ) = c ( { argmax } h R B(h) h 2 t/2 ) Plugging this result back into (2.1), we find that ( P n 1/3 ( f ) n (x ) f(x )) < t P ( { argmax } h R B(h) h 2 < t/2 ) = P ( 2(ab) 1/3 argmax h R { B(h) h 2 } < t ). Since the process argmax h R {B(h) h 2 } has a unique maximum, this proves the claim. 2.5 Some Global Asymptotic Results The previous section looks at what happens to the estimator at a fixed point x. In addition to these results, there are also a number of global theorems, which give information about convergence of the estimator at more than just one point. In particular, we have the following theorem, proved in Groeneboom (1985). Theorem 2.8. Let f be a decreasing density with support [, T ], and with a bounded, continuous second derivative, such that f (x) for all x (, T ). Then n {n 1/6 1/3 f } n f 1 C N(, σ 2 ), where σ 2.17, and T C = 2E[Z] 1 2 f (x)f (x) 1/3 dx, 18

19 with Z equal to the location of the maximum of {B(h) h 2 } as above, and E[Z].41. The assumption that the true density is strictly decreasing is key to this result. For densities with regions of constancy, we observe a n local rate of convergence. The following result shows the global result for the uniform distribution. The theorem was established in Groeneboom and Pyke (1983). Theorem 2.9. Let f n be the Grenander estimator of a decreasing density when the data come from a uniform distribution on the unit interval. Then, 1 ) (3 log n) (n 1/2 ( f n (x) 1) 2 dx log n N(, 1) 2.6 Exercises 1. Show that the constrained maximization solution in Step 3 of Section 2.1 is correct. 2. I have posted R code on the website which finds the Grenander estimator (MLE) of a decreasing density (see grenander.r). Pick your favourite decreasing density and compare how well the MLE performs for different sample sizes of n. How does this relate to the theory presented in lectures? 3. The alternative characterization mentioned in Remark 2.6 is stated as follows. The function f n is the Grenander estimator if and only if it satisfies y 1 f n (x)df n (x) y, for all y, with equality at all points of discontinuity of f n. (a) Prove the necessity of these conditions. To do this, consider the directional derivative g ϕ( f n ) for values of g(x) = I[, y](x), and follow the approach described in Remark 2.6. (b) How does the choice of g(x) above relate to (2.1)? 4. In Section 2.4.1, we give a reason for the n 1/3 rate of convergence for the estimator of the location of the maximum of the function P [θ 1, θ + 1]. Repeat the argument to show that the estimator of the location of the minimum of P (X θ) 2 = E P [(X θ) 2 ] must exhibit n 1/2 convergence. 5. Use the switching relation to prove Theorem

20 3 The Nonparametric MLE for Current Status Data Medical studies, reliability theory and actuarial science often deal with the estimation of lifetime distributions. In these situations, it is typical to observe censored data of a variety of types. Here, we discuss one specific case of this data, known as current status or interval censoring case 1 data. In Hoel and Walberg (1972), the time to the onset of lung cancer in 144 male RFM mice was studied (lung tumors are predominantly non lethal in RFM mice). The mice were separated into two groups: mice subject to a conventional environment (96 subjects) and mice subject to a germ free environment (48 subjects). Each mouse was sacrificed at a random time, and at this time the presence or absence of tumor was determined. This data set is also given in Sun (26, page 6). These types of experiments are called tumorigenicity studies, and are designed to determine whether some agent or environment accelerates the time until tumor onset. In this case, we are interested in estimating the time until the onset of the tumor for both groups. However, we do not observe the exact time of onset for the subjects. We only observe whether or not the tumor was present at a random time. The observed data can be described as follows. Let X denote the failure time data with distribution F. In our example, X denotes the onset time of the lung tumor. Our goal is to estimate F. Let T denote the random observation time with distribution G. We also assume that T is independent of X. In our example, T is the time that a mouse is sacrificed. We observe n IID observations from (T, ), where = I(X T ). This type of data is called current status, as we only observe the current state of each subject, and not the actual time of failure. It is also a special type of interval censoring: for each subject, we observe only whether the failure time fell into the (random) interval [, T ] or (T, ) (see e.g. Figure 4). Specifically, this type of interval censoring is called interval censoring case 1, as there was exactly one observation time for each subject. One can also consider interval censoring case k, where there are k observation times per subject, but this is beyond the scope of this course. 2

21 δ 4 = δ 3 = 1 δ 2 = δ 1 = 1 t 1 t 2 t 3 t 4 Figure 4: An example of current status data with n = 4. We observe t = (1, 2.5, 3, 5) and δ = (1,, 1, ). 3.1 Finding the Estimator The likelihood for current status data may be written as L (F ) = n F (t i ) δ i (1 F (t i )) 1 δ i g(t i ) i=1 n F (t i ) δ i (1 F (t i )) 1 δ i. i=1 Without loss of generality, we assume that the observation times have been ordered: t 1 t 2... t n. Our goal is to maximize L (F ), or equivalently to maximize the log-likelihood, l(f ) = n δ i log(f (t i )) + (1 δ i ) log(1 F (t i )) i=1 over the space of distribution functions. Note that the likelihood depends only on the values of F at the observation times t i, and that one can change F but obtain the same likelihood. This means that the cdf F is not uniquely determined by the maximization problem. To get around this, we make the following, standard, additional assumptions. 1. We assume that the MLE is a piecewise constant function, with jumps only at the observation points. 2. If F (t n ) < 1, we do not specify the location of the remaining mass. 21

22 Under these conventions, the nonparametric MLE is uniquely determined. Let y i = F (t i ). We may now re-write the maximization problem as the n-dimensional problem of maximizing l(y) = n δ i log y i + (1 δ i ) log(1 y i ) (3.1) i=1 subject to y 1 y 2... y n 1. Suppose that δ n = 1. Then we must have that y n = 1, since this maximizes the corresponding term in the likelihood, without placing any restrictions on y n 1, say. A similar behaviour occurs if δ 1 =, when we must have that y 1 =. For this reason, we also add the following assumption. 3. We assume that δ 1 = 1 and δ n =. If this is not the case, then let i 1 = min{i : δ i = 1} and i n = max{i : δ i = }, and then y i = for all i < i 1 and y i = 1 for all i > i n for the MLE A Maximal Intersection Approach One way to think about the maximization problem is in terms of the observed intervals. Let us denote these as R 1,..., R n. That is, in the example where n = 4, t = (1, 2.5, 3, 5) and δ = (1,, 1, ), we observe the intervals R 1 = [, 1], R 2 = (2.5, ), R 3 = [, 3] and R 4 = (5, ). These intervals are shown in Figure 4. The MLE may be re-written as L (F ) = n P (R i ). i=1 It is clear that the MLE must place all of its mass on the intervals R i. However, we can reduce these sets even further. A set I is a maximal intersection if there exists a subset β {1,..., n} such that I = i β R i, but i β R i = for all strict supersets β, that is β β {1,..., n}. Alternatively, for each x R +, we could define the height map, h(x) = {i : x R i }. Then the maximal intersections are the sets formed by all local maxima of the height map. In our example, the maximal intersections are I 1 = [, 1], I 2 = (2.5, 3] and I 3 = (5, ). These intervals are shaded in grey on the x axis in Figure 4. Now, suppose that the MLE puts some mass on the interval R 3 /I 1. Since this interval shows up only once in the likelihood, but the interval I 1 shows up twice, we increase the likelihood by shifting that mass onto the interval I 1. This implies that the MLE will put mass only on the maximal intersections observed. This property was first observed in Peto (1973); Turnbull (1976). Denote the maximal intersections as I j, for j = 1,..., m, and denote the mass placed in I j as α j. We may the re-write the log likelihood in term of the 22

23 masses α j as l(α) = = n log P (R i ) i=1 ( n m ) log α j I(I j R i ), i=1 j=1 which needs to be maximized over all values α such that α α m = 1. Note that we have made the assumption that the cdf is piecewise constant with jumps at the observation times t i only. This corresponds to the assumption that the mass α j is placed in the right most point of the maximal intersection I j. In the example considered above, we find that l(α) = log α 1 + log(α 2 + α 3 ) + log(α 1 + α 2 ) + log(α 3 ), which is maximized at the boundary point α = (.5,,.5). This corresponds to putting mass 1/2 at the point t 1 = 1 and the remaining mass beyond t 4 = 5. The approach to solving the problem using maximal intersections extends nicely to other forms of interval censoring. See for example the discussion in Maathuis (27). However, in the special case of current status data, there is a better way of finding the estimator A Convex Minorant Approach We now return to the formulation of the likelihood problem given in (3.1). The following proposition gives a characterization of the MLE. The proof is given for example in Groeneboom and Wellner (1992). Recall that the shape-constrained MLE may often be found in terms of a set of equalities (and inequalities). The reason for this is that maximizing the likelihood is actually a constrained convex optimization problem, and these inequalities may be seen as a special case of Fenchel duality. A discussion of this link appears in Robertson et al. (1988, Chapter 6). Proposition 3.1. The vector ŷ maximizes l(y) given in (3.1) subject to y 1... y n 1 if and only if the following conditions hold. (C1). For all j = 1,..., n with equality if ŷ j > ŷ j 1. n i=j { δi 1 δ } i, ŷ i 1 ŷ i 23

24 (C2). n i=1 { δi ŷ i 1 δ } i =. ŷ i 1 ŷ i Moreover, ŷ which satisfies the equations above is unique. A little bit of calculation allows us to re write this characterization as follows. Corollary 3.2. The vector ŷ is the MLE if and only if j 1 {ŷ i δ i } for all j = 1,..., n, with equality holding if ŷ j > ŷ j 1. i=1 The incredible thing about this corollary is that it allows us to again characterize the MLE in terms of convex minorants (last time we had concave majorants...). Indeed, the vector ŷ is the left derivative at i = 1,..., n of the greatest convex minorant of the graph of {(j, j i=1 δ i) : j =,..., n}. This is easy to see: Define (j) = j i=1 δ i and G(j) = j i=1 ŷi, with G linear in between values of j. Clearly, G is convex and a minorant of G. That it is the greatest such minorant follows from the linearity of G between points of touch with. Also, the left derivative of G at j, is also its slope given by G(j) G(j 1) = ŷ j. The graphical characterization gives us a very simple way of finding the MLE. The example considered previously in Figure 4 is shown below using this approach, see Figure Figure 5: Finding the MLE using the greatest convex minorant approach. 24

25 Proof. Let τ and η be two successive points of jump of ŷ. That is, we have that ŷ τ 1 < ŷ τ = ŷ τ+1 =... = ŷ η 1 < ŷ η. By (C1), we also know that j 1 { δi 1 δ } i, ŷ i 1 ŷ i i=τ with strict inequality for j = τ + 1,..., η 1, and equality for j = η. We multiply the above display by ŷ i (1 ŷ i ), and since ŷ i is constant here, we obtain that j 1 { δi ŷ i (1 ŷ i ) 1 δ } j 1 i = {δ i ŷ i }, ŷ i 1 ŷ i i=τ which proves the result. Recall the tumorigenicity study of RFM mice (taken from Sun (26)) considered at the beginning of this section. The nonparametric MLE estimator of the survival functions for both groups is shown in Figure 6 below. i=τ Figure 6: Survival functions resulting from the nonparametric MLE for the RFM mice data set. The germ free mice are shown in blue (mean= 689.5), and the conventional environment in red (mean > 79.28). 25

26 The convex minorant characterization may also arise if the MLE is viewed instead as an isotonic regression. That is, we may show that the MLE may also be written as the minimizer of the least squares criterion n (δ i y i ) 2 i=1 over all y subject to our monotonicity conditions. For more information on isotonic regression and the link to convex minorants, we refer to Robertson et al. (1988, Theorem 1.2.1). 3.2 Consistency and Asymptotics F (x) Figure 7: Here X comes from an exponential distribution with mean one, and T from an exponential with mean.5. The plots show the MLE with the true cdf (red) for sample sizes n = 1, 1 and n = 1,. Figure 7 gives an example of the MLE as the sample size increases. From the figure, it appears that the estimator is consistent for the true distribution F. However, there is one condition we need for this to hold. Theorem 3.3. Let F denote the true distribution of X and G the distribution of the observation times T. Suppose also that P F P G. Then the nonparametric MLE of F, F n is consistent. That is, almost surely. sup F n (t) F (t), t R 26

27 Recall that P F P G, means that the probability measure P F induced by F is absolutely continuous with respect to the probability measure P G induced by G. That is, if there is a set A such that P G (A) =, then we must also have that P F (A) =. It is easy to see why this assumption is necessary. If G misses a region of positive probability under F, then there is no way for the estimator to learn anything about F. Notice also that the MLE is now always consistent (unlike the Grenander estimator of a decreasing density which is inconsistent at zero). This is because we have more control over cumulative distribution functions, in comparison to densities. For a proof of Theorem 3.3, see Groeneboom and Wellner (1992). The proof follows a similar method to that shown in the proof of Theorem 2.2. Next, we consider the global and local convergence rates of the estimators. Again, we find that these are on the order of n 1/3, as is true for all monotone nonparametric maximum likelihood estimators. Theorem 3.4. Let t be such that < F (t ), G(t ) < 1, and let F and G be differentiable at t, with strictly positive derivatives f (t ) and g(t ), respectively. Then, { } n 1/3 Fn (t ) F (t ) 2cZ, where c = ( 1 2 F (t )(1 F (t )) f ) 1/3 (t ), g(t ) and Z is the location of the maximum of the process {B(t) t 2, t R}, and B(t) is a standard two-sided Brownian motion on R such that B() =. Theorem 3.5. F n (x) F (x) dg(x) = O p (n 1/3 ) Theorem 3.4 is taken from Groeneboom and Wellner (1992, page 89), and Theorem 3.5 is taken from van der Vaart and Wellner (1996, page 322). For current status data, there is no known switching relation. Instead, we use the characterization of the MLE using the inequalities in Proposition 3.1 to prove the result. The idea is to use this characterization as an operator which uniquely finds the MLE as a function of some underlying process. The operator is then shown to be continuous, and the limit of the underlying process is found. The details of the proofs are given in the references. It is also possible to state stronger results for F n (t ) similar to that of Remark 2.6 (see e.g. Theorem 4.2 in Maathuis (27)). However, the statement of Theorem 3.4 is particularly nice, in that the constant c is revealing in its dependence on g(t ). We already know that we 27

28 need P F P G for consistency to hold, otherwise G undersamples regions of F. On a local level, it would be optimal for G also not to oversample relative to F. We see this in the term f (t )/g(t ) which appears in the constant c. The dependence of c on F (t )(1 F (t )) is natural and to be expected. 3.3 Exercises 1. Prove the necessity of the conditions (C1) and (C2) of Proposition 3.1. (a) Define the directional derivative of the likelihood function as and show that it is equal to y l(ŷ) = lim ε ε 1 (l(ŷ + εy) l(ŷ)), y l(ŷ) = n i=1 y i { } δ i ŷ i 1 δ i ŷ i. (b) Define Y = {y R n : y 1... y n 1}. Show that if ŷ + εy Y, then y l(ŷ), and that if ŷ ± εy Y, then y l(ŷ) =. (c) Define e (i) = {,...,, 1,..., 1} so that there are a total of (i 1) zeros in the vector. Now consider the functions y = ŷ, and e (i) = {,...,, 1,..., 1}. Plugging these into part (b) above, should yield conditions (C1) and (C2). (d) Recall the characterization of decreasing densities given in (2.1). Write a similar characterization for all increasing functions. Can you relate the choice of the vectors e (i) to this characterization? 2. Posted on the website is R code to find the nonparametric MLE of current status data using the graphical representation (see status.r). I have also posted the dataset hepa.txt which includes Hepatitis A data taken from Keiding (1991). The data consists of a cross sectional survey of people of various ages who had their blood tested for the presence of Hepatitis A antibodies. The data gives the values of age (age) the number of people with the antibodies present (npos) and the total number tested of a given age (total). Find the nonparametric MLE for the cdf of the random time that one develops the Hepatitis A antibodies. 3. For the RFM mice example, we found the MLE of the cdf, and used this to calculate the means for the two groups (CE=conventional environment, GF= germ free) using µ CE = µ GF = 28 (1 F CE (x))dx (1 F GF (x))dx.

29 See Figure 6 for the exact values. (a) We state that µ CE > Why not report an equality? (b) What does the assumption that the MLE F n is piecewise constant with all mass placed to right of the maximal intersection imply about µ n calculated as (1 F n (x))dx? (c) Is µ n a consistent estimator of (1 F (x))dx? 4. In Figure 7 we showed an example of the nonparametric MLE for current status data. In the example, both X and T are exponential with means 1 and.5, respectively. Recall that we say that X is stochastically greater than T (and write T X) if F T (x) F X (x) for all x. Would this additional information have any impact on the nonparametric MLE? 4 Disclaimer In this short course I have tried to give a flavour of the work done in shape constrained nonparametric likelihood estimation. The notes are, as expected, far from complete. Hopefully, the references given will provide a good start for those interested in different aspects of the theory presented here. In particular, I recommend Groeneboom and Wellner (1992) for the more theoretical aspects of the results given in Section 3, and Maathuis (27) for some accessible level lecture notes on interval censoring and the nonparametric MLE. References Balabdaoui, F. (27). Consistent estimation of a convex density at the origin. Math. Methods Statist Balabdaoui, F., Jankowski, H., Pavlides, M., Seregin, A. and Wellner, J. (29). On the Grenander estimator at zero. Tech. Rep. 554, University of Washington. Balabdaoui, F. and Wellner, J. A. (27). Estimation of a k-monotone density: limit distribution theory and the spline connection. Ann. Statist Carolan, C. and Dykstra, R. (1999). Asymptotic behavior of the Grenander estimator at density flat regions. Canad. J. Statist Chak, P. M., Madras, N. and Smith, B. (25). Semi-nonparametric estimation with bernstein polynomials. Economics Letters

30 Chernoff, H. (1964). Estimation of the mode. Ann. Inst. Statist. Math Feller, W. (1971). An introduction to probability theory and its applications. Vol. II. Second edition, John Wiley & Sons Inc., New York. Ferger, D. (24). A continuous mapping theorem for the argmax-functional in the nonunique case. Statist. Neerlandica Gneiting, T. (1998). On the Bernstein-Hausdorff-Widder conditions for completely monotone functions. Exposition. Math Gneiting, T. (1999). Radial positive definite functions generated by Euclid s hat. J. Multivariate Anal Grenander, U. (1956). On the theory of mortality measurement. II. Skand. Aktuarietidskr (1957). Groeneboom, P. (1983). The concave majorant of Brownian motion. Ann. Probab Groeneboom, P. (1985). Estimating a monotone density. In Proceedings of the Berkeley conference in honor of Jerzy Neyman and Jack Kiefer, Vol. II (Berkeley, Calif., 1983). Wadsworth Statist./Probab. Ser., Wadsworth, Belmont, CA. Groeneboom, P. (1986). Some current developments in density estimation. In Mathematics and computer science (Amsterdam, 1983), vol. 1 of CWI Monogr. North-Holland, Amsterdam, Groeneboom, P. (1989). Brownian motion with a parabolic drift and Airy functions. Probab. Theory Related Fields Groeneboom, P., Jongbloed, G. and Wellner, J. A. (21a). A canonical process for estimation of convex functions: the invelope of integrated Brownian motion +t 4. Ann. Statist Groeneboom, P., Jongbloed, G. and Wellner, J. A. (21b). Estimation of a convex function: characterizations and asymptotic theory. Ann. Statist Groeneboom, P. and Pyke, R. (1983). Asymptotic normality of statistics based on the convex minorants of empirical distribution functions. Ann. Probab Groeneboom, P. and Wellner, J. A. (1992). Information bounds and nonparametric maximum likelihood estimation, vol. 19 of DMV Seminar. Birkhäuser Verlag, Basel. 3

31 Groeneboom, P. and Wellner, J. A. (21). Computing Chernoff s distribution. J. Comput. Graph. Statist Hoel, D. and Walberg, H. (1972). Statistical analysis of survival experiments. Journal of the National Cancer Institute Jankowski, H. and Wellner, J. (29). Nonparametric estimation of a convex bathtubshaped hazard function. Bernoulli To appear (available online). Jewell, N. P. (1982). Mixtures of exponential distributions. Ann. Statist Keiding, N. (1991). Age-specific incidence and prevalence: a statistical perspective. J. Roy. Statist. Soc. Ser. A With discussion. Kim, J. and Pollard, D. (199). Cube root asymptotics. Ann. Statist Lévy, P. (1962). Extensions d un théorème de D. Dugué et M. Girault. Z. Wahrscheinlichkeitstheorie Verw. Gebiete Maathuis, M. (27). Survival analysis for interval censored data: lecture notes. Available online: maathuis/teaching/. Marshall, A. W. and Proschan, F. (1965). Maximum likelihood estimation for distributions with monotone failure rate. Ann. Math. Statist Peto, R. (1973). Experimental survival curves for interval censored data. Applied Statistics Prakasa Rao, B. L. S. (197). Estimation for distributions with monotone failure rate. Ann. Math. Statist Pyke, R. (1959). The supremum and infimum of the Poisson process. Ann. Math. Statist Robertson, T., Wright, F. T. and Dykstra, R. L. (1988). Order restricted statistical inference. Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics, John Wiley & Sons Ltd., Chichester. Sun, J. (26). The statistical analysis of interval-censored failure time data. Statistics for Biology and Health, Springer, New York. Turnbull, B. W. (1976). The empirical distribution function with arbitrarily grouped, censored and truncated data. J. Roy. Statist. Soc. Ser. B

STATISTICAL INFERENCE UNDER ORDER RESTRICTIONS LIMIT THEORY FOR THE GRENANDER ESTIMATOR UNDER ALTERNATIVE HYPOTHESES

STATISTICAL INFERENCE UNDER ORDER RESTRICTIONS LIMIT THEORY FOR THE GRENANDER ESTIMATOR UNDER ALTERNATIVE HYPOTHESES By Jon A. Wellner University of Washington 1. Limit theory for the Grenander estimator.