Nonparametric estimation for current status data with competing risks

Nonparametric estimation for current status data with competing risks Marloes Henriëtte Maathuis A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2006 Program Authorized to Offer Degree: Statistics

University of Washington Graduate School This is to certify that I have examined this copy of a doctoral dissertation by Marloes Henriëtte Maathuis and have found that it is complete and satisfactory in all respects, and that any and all revisions required by the final examining committee have been made. Co-Chairs of the Supervisory Committee: Piet Groeneboom Jon A. Wellner Reading Committee: Piet Groeneboom Michael G. Hudgens Jon A. Wellner Date:

In presenting this dissertation in partial fulfillment of the requirements for the doctoral degree at the University of Washington, I agree that the Library shall make its copies freely available for inspection. I further agree that extensive copying of this dissertation is allowable only for scholarly purposes, consistent with fair use as prescribed in the U.S. Copyright Law. Requests for copying or reproduction of this dissertation may be referred to Proquest Information and Learning, 300 North Zeeb Road, Ann Arbor, MI 48106-1346, 1-800-521-0600, to whom the author has granted the right to reproduce and sell (a) copies of the manuscript in microform and/or (b) printed copies of the manuscript made from microform. Signature Date

University of Washington Abstract Nonparametric estimation for current status data with competing risks Marloes Henriëtte Maathuis Co-Chairs of the Supervisory Committee: Professor Piet Groeneboom Statistics Professor Jon A. Wellner Statistics We study current status data with competing risks. Such data arise naturally in cross-sectional survival studies with several failure causes. Moreover, generalizations of these data arise in HIV vaccine clinical trials. The general framework is as follows. We analyze a system that can fail from K competing risks, where K N is fixed. The random variables of interest are (X, Y ), where X R + = (0, ) is the failure time of the system, and Y {1,...,K} is the corresponding failure cause. However, we cannot observe (X, Y ) directly. Rather, we observe the current status of the system at a single random observation time T R +, where T is independent of (X, Y ). This means that at time T, we observe whether or not failure occurred, and if and only if failure occurred, we also observe the failure cause Y. We study nonparametric estimation of the sub-distribution functions F 0k (t) = P(X t, Y = k), k = 1,...,K, t R +. We focus on two estimators: the nonparametric maximum likelihood estimator (MLE) and the naive estimator introduced by Jewell, Van der Laan and Henneman (2003). Our main interest is in asymptotic properties of the MLE, and the naive estimator is considered for comparison.

Until now, the asymptotic properties of the MLE have been largely unknown. We resolve this issue by proving its consistency, n 1/3 -rate of convergence, and limiting distribution. The limiting distribution involves a new self-induced limiting process, consisting of the convex minorants of K correlated two-sided Brownian motion processes plus parabolic drifts, plus an additional term involving the difference between the sum of the K drifting Brownian motions and their convex minorants. Various other aspects that we consider include characterizations of the estimators, uniqueness, graph theory, and computational algorithms. Furthermore, we show that both the MLE and the naive estimator are asymptotically efficient for a family of smooth functionals, with n-rate convergence to a normal limit. Finally, we study an extension of the model, where X is subject to interval censoring and Y is a continuous random variable. We show that the MLE is typically inconsistent in this model, and propose a simple method to repair this inconsistency.

TABLE OF CONTENTS List of Figures................................... List of Tables.................................... iii v Chapter 1: Introduction............................ 1 1.1 Motivation and problem description................... 1 1.2 Overview of previous work........................ 3 1.3 Overview of new results and outline of this thesis........... 4 Chapter 2: The estimators........................... 7 2.1 Definition of the estimators....................... 7 2.2 Censored data perspective........................ 13 2.3 Graph theory and uniqueness...................... 23 2.4 Characterizations............................. 37 Chapter 3: Computation............................ 62 3.1 Reduction and optimization....................... 63 3.2 Iterative convex minorant algorithms.................. 66 Chapter 4: Consistency............................. 71 4.1 Hellinger consistency........................... 71 4.2 Local and uniform consistency...................... 77 Chapter 5: Rate of convergence........................ 85 5.1 Hellinger rate of convergence....................... 85 5.2 Asymptotic local minimax lower bound................. 90 5.3 Local rate of convergence......................... 96 5.4 Technical lemmas and proofs....................... 118 i

Chapter 6: Limiting distribution........................ 132 6.1 The limiting distribution of the naive estimator............ 133 6.2 The limiting distribution of the MLE.................. 146 6.3 Technical lemmas and proofs....................... 177 Chapter 7: A family of smooth functionals.................. 186 7.1 Information bound calculations..................... 187 7.2 Asymptotic normality of functionals of the MLE............ 194 Chapter 8: Examples.............................. 210 8.1 Menopause data.............................. 210 8.2 Simulations................................ 213 Chapter 9: An extension: interval censored continuous mark data..... 229 9.1 The model and an explicit formula for the MLE............ 230 9.2 Inconsistency of the MLE........................ 236 9.3 Repaired MLE via discretization of marks............... 246 9.4 Examples................................. 248 ii

LIST OF FIGURES Figure Number Page 2.1 The estimators: Graphical representation of the observed data..... 9 2.2 Graph theory: Intersection graph for the MLE............. 30 2.3 Convex minorant characterizations: Plots for the data in Table 2.5.. 59 5.1 Asymptotic local minimax lower bound: The perturbation F nk..... 91 5.2 Local rate: Plot of v n (t) for various values of β............. 100 5.3 Local rate: Example clarifying the proof of Lemma 5.16........ 128 6.1 Limiting distribution: Processes for the naive estimator at t 0 = 1... 136 6.2 Limiting distribution: Processes for the naive estimator at t 0 = 2... 137 6.3 Limiting distribution: Processes for the MLE at t 0 = 1........ 153 6.4 Limiting distribution: Processes for the MLE at t 0 = 2........ 154 6.5 Limiting distribution: Comparison of limiting processes at t 0 = 1... 155 6.6 Limiting distribution: Comparison of limiting processes at t 0 = 2... 156 8.1 Menopause data: Question of the Health Examination Study...... 211 8.2 Menopause data: The MLE and the naive estimator.......... 212 8.3 Simulations: The true underlying sub-distribution functions...... 218 8.4 Simulations: The estimators in a single simulation........... 219 8.5 Simulations: Pointwise bias........................ 220 8.6 Simulations: Pointwise variance...................... 221 8.7 Simulations: Pointwise mean squared error................ 222 8.8 Simulations: Pointwise relative efficiency................. 223 8.9 Simulations: Smooth functionals of the MLE for t 0 = 2......... 225 8.10 Simulations: Smooth functionals of the naive estimator for t 0 = 2... 226 8.11 Simulations: Smooth functionals of the MLE for t 0 = 10........ 227 8.12 Simulations: Smooth functionals of the naive estimator for t 0 = 10.. 228 9.1 Continuous mark data: Contour lines for estimates of F 0 (x, y)..... 254 iii

9.2 Continuous mark data: Estimates of F 0X (x)............... 255 9.3 Continuous mark data: Estimates of F 0 (x 0, y).............. 256 iv

LIST OF TABLES Table Number Page 2.1 Censored data perspective: Example data................ 22 2.2 Censored data perspective: Estimators for the data in Table 2.1.... 23 2.3 Graph theory: Example data....................... 29 2.4 Graph theory: Clique matrix for the data in Table 2.3......... 35 2.5 Convex minorant characterizations: Example data........... 58 8.1 Simulations: Pointwise bias, variance and MSE at t = 10........ 224 9.1 Continuous mark data: Summary of the examples............ 249 v

ACKNOWLEDGMENTS I sincerely thank my advisors, Piet Groeneboom and Jon Wellner, for their mentorship over the past years. Their knowledge, guidance, inspiration and encouragement have been very important to me. I thank Peter Gilbert, Tilmann Gneiting, Peter Hoff and Michael Hudgens for serving on my committee, with special thanks to Michael for suggesting this research problem. I thank Bernard Deconinck for serving as the graduate school representative. I am grateful to the faculty, staff and students in our department for providing a stimulating and supportive research environment. In particular, I thank Fadoua Balabdaoui, Moulinath Banerjee and Hanna Jankowski for helpful discussions. Finally, I want to express my deep gratitude to Steven, my parents, my family and my friends, for their continuous support. vi

1 Chapter 1 INTRODUCTION 1.1 Motivation and problem description The work in this thesis is motivated by recent clinical trials of candidate vaccines against HIV/AIDS. The main purpose of such trials is to determine the overall efficacy of a candidate vaccine. Like many viruses, HIV exhibits significant genotypic and phenotypic variation, so that it can be distinguished into several subtypes. Therefore, it is also of interest to determine the efficacy of a vaccine against each subtype of the virus. Establishing vaccine efficacy for certain subtypes can warrant vaccination of populations in which the given subtypes are highly prevalent. Furthermore, establishing that the vaccine is efficacious for some subtypes, but not for others, gives important information for possible improvements of the vaccine. Thus, the variables of interest are the time of infection and the subtype of the infecting virus. These variables cannot be observed directly, because participants of a trial are only tested for the virus at several follow-up times. Since each test indicates whether or not infection happened before the time of the test, the time of infection is interval censored, i.e., only known to lie within a time interval determined by the follow-up times. Since simultaneous infections with several subtypes of a virus are rare, the subtypes are often analyzed as competing risks (see, e.g., Hudgens, Satten and Longini (2001)). Hence, these trials yield interval censored survival data with competing risks. In this thesis, we analyze current status data with competing risks. Current status censoring is the simplest form of interval censoring, where there is exactly one

2 observation time for each subject. We study these data for two reasons. First, such data arise naturally in cross-sectional studies with several failure causes. Second, understanding current status data with competing risks is a first step towards understanding the more complicated interval censored data with competing risks that arise in vaccine clinical trials. We consider the following general framework. We analyze a system that can fail from K competing risks, where K N is fixed. The random variables of interest are (X, Y ), where X R + = (0, ) is the failure time of the system, and Y {1,..., K} is the corresponding failure cause. Due to censoring, we cannot observe (X, Y ) directly. Rather, we observe the current status of the system at a single random observation time T R +, where T is independent of (X, Y ). Thus, at time T we observe whether or not failure occurred, and if and only if failure occurred, we also observe the failure cause Y. Examples that fit into this framework can be found in reliability and survival analysis. For an example, see the menopause data analyzed by Krailo and Pike (1983), where X is the age at menopause, Y is the cause of menopause (natural or operative), and T is the age at the time of the survey. In cross-sectional HIV studies we think of X as the time of HIV infection, Y as the subtype of the infecting HIV virus, and T as the time of the HIV test. Note that one is free to define the origin of the time scale as. Common choices include the date of birth and the beginning of the study. Given current status data with competing risks, we consider nonparametric estimation of the sub-distribution functions F 0k (t) = P(X t, Y = k), k = 1,...,K. This problem, or close variants thereof, has been studied by Hudgens, Satten and Longini (2001), Jewell, Van der Laan and Henneman (2003), and Jewell and Kalbfleisch (2004). However, there are still many open problems. In particular, until now, the asymptotic properties of the nonparametric maximum likelihood estimator (MLE) have been largely unknown. In this thesis, we resolve this problem. We prove con-

3 sistency, the rate of convergence and the limiting distribution of the MLE. These asymptotic results form an important step towards making inference about the subdistribution functions. The outline of the remainder of this chapter is as follows. In Section 1.2 we give an overview of previous work in this area. In Section 1.3 we give an outline of this thesis, together with a discussion of our main results. 1.2 Overview of previous work Hudgens, Satten and Longini (2001) study competing risks data subject to interval censoring and truncation. They derive the nonparametric maximum likelihood estimator (MLE) and provide an EM algorithm for its computation. They also introduce an alternative pseudo-likelihood estimator. They apply their methods to data from a cohort of injecting drug users in Thailand, where the event of interest is infection with HIV-1, and the competing risks are HIV-1 subtypes B and E. Jewell, Van der Laan and Henneman (2003) study current status data with competing risks. They consider some simple parametric models, some ad-hoc nonparametric estimators, and the MLE. They compare these estimators in a simulation study. Furthermore, they apply their methods to data analyzed by Krailo and Pike (1983), where the event of interest is menopause and the competing risks are natural and operative menopause. Finally, the authors discuss results suggesting that the simple ad-hoc estimators might yield fully efficient estimators for smooth functionals of the sub-distribution functions. Jewell and Kalbfleisch (2004) study maximum likelihood estimation of a series of ordered multinomial parameters. Current status data with competing risks can be viewed as a special case of this setting. The authors focus on the computation of the MLE, and introduce an iterative version of the Pool Adjacent Violators Algorithm.

4 1.3 Overview of new results and outline of this thesis We focus on the following two nonparametric estimators for the sub-distribution functions: the MLE F n = ( F n1,..., F nk ), and the naive estimator F n = ( F n1,..., F nk ) introduced by Jewell, Van der Laan and Henneman (2003). 1 Our main interest is in asymptotic properties of the MLE, and the naive estimator is considered for comparison. In Chapter 2 we define the estimators, and discuss the relationship between them. We show that both the MLE and the naive estimator can be viewed as maximum likelihood estimators for censored data. This observation is useful, because it allows us to use readily available theory and computational algorithms. In particular, the naive estimator can be viewed as the maximum likelihood estimator for reduced univariate current status data. Hence, many properties of the naive estimator follow straightforwardly from known results on current status data. The censored data perspective also allows us to use graph theory to study uniqueness properties of the estimators. Finally, we characterize the estimators in terms of necessary and sufficient conditions, in the form of Fenchel characterizations and (self-induced) convex minorant characterizations. These characterizations play a key role in the development of the asymptotic theory, and also lead to computational algorithms. Computational aspects of the MLE are discussed in Chapter 3. Since there are no explicit formulas available for the MLE, we compute the MLE with an iterative algorithm. We discuss two classes of algorithms and the connections between them. The first class is based on sequential quadratic programming, where each quadratic programming problem is solved using a support reduction algorithm. The second class consists of iterative convex minorant algorithms. We prove convergence of algorithms in both classes. Furthermore, we show that one particular iterative convex minorant algorithm can be viewed as a sequential quadratic programming method that only 1 The subscript n denotes the sample size.

5 uses the diagonal elements of the Hessian matrix. In Chapter 4 we discuss consistency of the estimators. We prove that both estimators are Hellinger consistent, and we use this to derive various forms of local and uniform consistency. The rate of convergence is discussed in Chapter 5. The Hellinger rate of convergence and the local rate of convergence of the naive estimator are n 1/3. This follows from known results on current status data without competing risks. For the MLE, we prove that the Hellinger rate of convergence is n 1/3. Next, we derive a local asymptotic minimax lower bound of n 1/3, meaning that no estimator can have a better local rate of convergence than n 1/3, in a minimax sense. We proceed by proving that the local rate of convergence of the MLE is n 1/3. This result comes as no surprise given the local asymptotic minimax lower bound and the local rate of convergence of the naive estimator. However, the proof of this result turned out to be rather involved, and required new methods. The key idea is to first establish a rate result for K k=1 F nk that holds uniformly on a fixed neighborhood around a point t 0, instead of on the usual shrinking neighborhood of order O(n 1/3 ). In Chapter 6 we discuss the limiting distribution of the estimators. The limiting distribution of the naive estimator is given by the slopes of the convex minorants of K correlated two-sided Brownian motion processes plus parabolic drifts. The limiting distribution of the MLE involves a new self-induced limiting process, consisting of the convex minorants of K correlated two-sided Brownian motion processes plus parabolic drifts, plus an additional term involving the difference between the sum of the K drifting Brownian motion processes and their convex minorants. In Chapter 7 we consider estimation of smooth functionals. Jewell, Van der Laan and Henneman (2003) suggested that the naive estimator yields asymptotically efficient smooth functionals. We show that this is indeed the case, and that the same holds for the MLE. In Chapter 8 we apply our methods to real and simulated data. We compare

6 the MLE and the naive estimator in a simulation study, considering both pointwise estimation and the estimation of smooth functionals. For pointwise estimation, we show that the MLE is superior to the naive estimator in terms of mean squared error, both for small and large sample sizes. For the estimation of smooth functionals, we show that the behavior of the MLE and the naive estimator is similar, and in agreement with the results in Chapter 7. Finally, in Chapter 9 we consider an extension of the model, where X is subject to interval censoring case k, and Y is a continuous random variable. This model is referred to as the interval censored continuous mark model. It is applicable to HIV vaccine clinical trials by letting X be the time of HIV infection, and Y be the viral distance between the infecting HIV virus and the virus present in the vaccine. We derive the limit of the MLE in this model, and show that the MLE is inconsistent in general. We also suggest a simple method for repairing the MLE by discretizing Y, an operation that transforms the data to interval censored data with competing risks. We illustrate the behavior of the MLE and the repaired MLE in four examples.

7 Chapter 2 THE ESTIMATORS In this chapter we study finite sample properties of the MLE and the naive estimator. In Section 2.1 we formally define the model and the estimators. Since both estimators can be viewed as maximum likelihood estimators for censored data, Section 2.2 provides a general discussion on the MLE for censored data. In Section 2.3 we use a graph theoretic perspective to derive properties of the estimators. Finally, in Section 2.4, we characterize the estimators in terms of necessary and sufficient Fenchel and convex minorant conditions. 2.1 Definition of the estimators Before we define the MLE and the naive estimator, we introduce some assumptions and notation. Recall that K N denotes the number of competing risks. The variables of interest are (X, Y ), where X R + is the failure time of a system, and Y {1,..., K} is the corresponding failure cause. We do not observe (X, Y ) directly. Rather, we observe the system at a random observation time T R +. At this time, we observe whether or not failure occurred, and if and only if failure occurred, we also observe the failure cause Y. Our goal is nonparametric estimation of the bivariate distribution function of (X, Y ), or equivalently, of the vector of sub-distribution functions F 0 = (F 01,...,F 0K ), where F 0k (t) = P(X t, Y = k), k = 1,..., K. We make the following assumptions:

8 (a) T is independent of (X, Y ); (b) The system cannot fail from two or more causes at the same time. Assumption (a) is essential for the development of the theory, and is used in the definition of the estimators in Sections 2.1.2 and 2.1.3. Assumption (b) ensures that the failure cause is well defined. This assumption is always satisfied by defining simultaneous failure from several causes as a new failure cause. We do not make any other assumptions. In particular, we do not require that all observation times are distinct. 2.1.1 Notation We denote the observed data by Z = (T, ), where = ( 1,..., K+1 ) and k = 1{X T, Y = k}, k = 1,...,K, (2.1) K+1 = 1{X > T }. (2.2) Thus, for k = 1,..., K, k = 1 if and only if failure happened by time T and was due to cause k. Furthermore, K+1 = 1 if and only if failure did not happen by time T. Note that K+1 k=1 k = 1, and hence K+1 = 1 K k=1 k. A graphical representation of the observed data is given in Figure 2.1. Let Z 1,...,Z n be n i.i.d. observations of Z, where Z i = (T i, i ) and i = ( i1,..., i,k+1 ). We call an observation Z i right censored if i,k+1 = 1, and left censored otherwise. Let T (1),...,T (n) be the order statistics of T 1,...,T n, where ties are broken arbitrarily after ensuring that left censored observation are ordered before right censored observations. We denote the corresponding -vectors by (1),..., (n), where (i) = ( (i)1,..., (i),k+1 ).

9 3 2 1 = (1, 0, 0, 0) = (0, 1, 0, 0) 3 2 1 T T 3 2 1 = (0, 0, 1, 0) = (0, 0, 0, 1) 3 2 1 T T Figure 2.1: Graphical representation of the observed data (T, ) in an example with K = 3 competing risks. The grey sets indicate the values of (X, Y ) that are consistent with (T, ), for each of the four possible values of. Let e k, k = 1,..., K + 1, be the kth unit vector in R K+1, and let Z = {(t, e k ) : t R +, k = 1,...,K + 1}. (2.3) Let G be the distribution of T, and let G n be the empirical distribution of T 1,...,T n. Furthermore, let P n be the empirical distribution of Z 1,...,Z n, i.e., for any function h : Z R we have P n h(z) = h(z)dp n (z) = 1 n n i=1 h(z i). For vectors x = (x 1,...,x K ) R K, we define x + = K k=1 x k and x K+1 = 1 x +. For example, we write + = K k=1 k, F 0+ (t) = K k=1 F 0k(t) and F 0,K+1 (t) = 1 F 0+ (t). The only exception to the notation x K+1 = 1 x + is that we do not use it for the naive estimator. The reason for this will become clear in Section 2.1.3.

10 2.1.2 The MLE We now define the MLE F n = ( F n1,..., F nk ) for F 0 = (F 01,..., F 0K ). Note that T Multinomial K+1 (1, (F 01 (T),..., F 0,K+1 (T))). (2.4) Hence, under F = (F 1,..., F K ), the density for a single observation z = (t, δ) is p F (z) = K+1 k=1 F k (t) δ k, (2.5) with respect to the dominating measure µ = G #, where # is counting measure on {e k : k = 1,...,K + 1}. The corresponding log likelihood (divided by n) 1 is l n (F) = log p F (u, δ)dp n (u, δ) = K+1 k=1 δ k log F k (u)dp n (u, δ), (2.6) and the MLE (if it exists) 2 is defined by l n ( F n ) = max F F K l n (F), (2.7) where F K is the set of all K-tuples of sub-distribution functions on R + with pointwise sum bounded by one. Note that we can absorb G in the dominating measure µ because of the assumed independence between T and (X, Y ). 2.1.3 The naive estimator We now define the naive estimator F n = ( F n1,..., F n,k+1 ). The naive estimator F nk can be viewed as the MLE for the reduced current status data Z k = (T, k ). To see 1 In order to efficiently use the empirical process notation, we use the convention of dividing all log likelihoods by n. 2 Existence of the estimators will follow from Theorem 2.1 ahead.

11 this, let p k,fk (u, δ) be the marginal density of the reduced current status data Z k : p k,fk (u, δ) = F k (u) δ k {1 F k (u)} 1 δ k. Then the naive estimator F nk maximizes the marginal log likelihood l nk (F k ) = = log p k,fk (u, δ)dp n (u, δ) {δ k log F k (u) + (1 δ k ) log(1 F k (u))}dp n (u, δ), (2.8) for k = 1,...,K + 1. Thus, the naive estimators (if they exist) are defined by l nk ( F nk ) = max F k F l nk(f k ), k = 1,...,K, (2.9) l n,k+1 ( F n,k+1 ) = max S S l n,k+1(s). (2.10) where F is the collection of all sub-distribution functions on R +, and S is the collection of all sub-survival functions on R +. Note that we can omit G in the marginal log likelihood, since T and (X, Y ) are independent. The naive estimator provides two different estimators for the overall failure time distribution F 0+, namely F n+ = K k=1 F nk and 1 F n,k+1. Since the naive estimator does not require the sum of the sub-distribution functions to be bounded by one, F n+ may exceed one. In contrast, 1 F n,k+1 is always bounded between zero and one. This estimator is simply the MLE for the overall failure time distribution when information on the failure causes is ignored. In general, Fn,K+1 1 F n+, and we therefore do not use the shorthand notation x K+1 = 1 x + for the naive estimator. 2.1.4 Comparison of the two estimators In order to point out the similarities and differences between the MLE and the naive estimator, we give the following alternative but equivalent definition of the naive

12 estimator. For F = (F 1,...,F K ), we define ln (F) = K k=1 [ ] δ k log F k (u) + (1 δ k ) log(1 F k (u)) dp n (u, δ). (2.11) Then the naive estimator F n = ( F n1,..., F nk ) (if it exists) is defined by ln ( F n ) = max F F K ln (F), (2.12) where F K is the space of all K-tuples of sub-distribution functions on R +. Comparing this optimization problem with the optimization problem (2.7) for the MLE, we see the following two differences: (a) The log likelihood (2.6) for the MLE contains a term involving F K+1 (u) = 1 F + (u), while the log likelihood (2.11) for the naive estimator does not include such a term; (b) The space F K for the MLE includes the constraint that the sum of the subdistribution functions is bounded by one, while the space F K for the naive estimator does not include such a constraint. Thus, the MLE takes into account the K-dimensional system of sub-distribution functions, while the naive estimator ignores this aspect of the problem. In fact, since the sub-distribution functions in optimization problem (2.12) are not related to each other, the optimization problem can be split into the K optimization problems defined in (2.9). Since these optimization problems correspond to the MLE for univariate current status data, both computational results and asymptotic theory follow straightforwardly from known results for current status data (see Groeneboom and Wellner (1992, Part II, Sections 1.1, 4.1 and 5.1)). The fact that the MLE takes into account the system of sub-distribution functions leads to more complicated computation and asymptotic theory. However, these com-

13 plications result in a better pointwise behavior of the MLE, as shown in the simulation study in Section 8.2. 2.2 Censored data perspective From the definitions of the MLE and the naive estimator, we see that both estimators can be viewed as nonparametric maximum likelihood estimators for censored data. Viewing the estimators from this perspective allows us to use readily available computational algorithms and theory for the MLE for censored data. We consider the following general framework. Let W be a random variable taking values in W. Suppose that W has distribution F 0. Our goal is to estimate this distribution. However, we do not observe W directly. Rather, we observe a vector of random sets D = (D 1,...,D p ) that form a partition of W, i.e., p j=1 D j = W and D j D k = for j k {1,...,p}. We assume that D is independent of W. In principle, we can allow the number of random sets to be random, but for our purposes that is not needed. Furthermore, we observe an indicator vector = ( 1,..., p ), where j = 1{W D j }, j = 1,...,p. Thus, we observe a vector D containing a random partition of W, and an indicator vector indicating which set R {D 1,...,D p } contains the unobservable W. We call the set R an observed set. Using the convention 0 D j =, we can write R = p j=1 jd j. Let Z 1,...,Z n be n i.i.d. copies of Z = (D, ). These data define n i.i.d. observed sets R 1,...,R n. Writing the log likelihood in terms of these sets gives l n (F) = 1 n n log P F (R i ), i=1 where P F (R i ) denotes the probability mass in R i under distribution F. The maximum

14 likelihood estimator (if it exists) is defined by l n ( F n ) = max F F l n(f), (2.13) where F is the space of all distribution functions on W. Since l n (F) is optimized over the function space F, the optimization problem (2.13) is infinite dimensional. However, the number of parameters can be reduced by generalizing the reasoning of Turnbull (1976) for univariate censored data. It follows that the estimators can only assign mass to a finite collection of disjoint sets A 1,...,A m, called maximal intersections by Wong and Yu (1999). In the literature, there are several equivalent definitions of maximal intersections. Wong and Yu (1999) define A j to be a maximal intersection if and only if it is a finite intersection of the R i s such that for each i A j R i = or A j R i = A j. Gentleman and Vandal (2002) use a graph theoretic perspective. They show that the maximal intersections correspond to maximal cliques of the intersection graph of the observed sets. We discuss this perspective in detail in the next section. For observed sets that take the form of rectangles in R p, p N, Maathuis (2005) introduces yet another way to view the maximal intersections, using a height map of the observed sets. This height map is a function h : R p {0, 1,..., }, where h(x) is defined as the number of observed sets that overlap at the point x R p. Maathuis (2005) shows that the maximal intersections are exactly the local maxima of the height map of a canonical version of the observed sets. We say that R 1,...,R n are a canonical version of R 1,...,R n if the following three properties hold: (i) R 1,..., R n and R 1,...,R n have the same intersection structure, i.e., R i R j = if and only if R i R j =, for all i, j {1,...,n}; (ii) The x-coordinates of R 1,...,R n are distinct and take values in {1,...,2n}; (ii) The y-coordinates of R 1,...,R n are distinct and take values in {1,..., 2n}. Thus, any ties that may have been present in R 1,..., R n are resolved in R 1,...,R n, but in a way that does not affect the intersection structure. For details on the transformation to canonical sets, see Maathuis (2005, Section 2.1).

15 By generalizing the reasoning of Turnbull (1976), it follows that the MLE is indifferent to the distribution of mass within the maximal intersections. As a result, the MLE is typically not uniquely defined on the maximal intersections. This type of non-uniqueness is called representational non-uniqueness by Gentleman and Vandal (2002). Thus, we can at best hope to determine the probability masses α j = P F (A j ), j = 1,..., m. We let α = (α 1,...,α m ) and write the probability mass in an observed set R i in terms of α: P α (R i ) = m α j 1{A j R i }. (2.14) j=1 Then we can write the log likelihood as l n (α) = 1 n n log P α (R i ) = 1 n i=1 ( n m ) log α j 1{A j R i }. (2.15) i=1 j=1 Thus, we can think of the computation of the estimators as a two step process. First, in the reduction step, we compute the maximal intersections A 1,...,A m. Next, in the optimization step, we solve the optimization problem l n ( α) = max A l n(α), (2.16) where A = {α R m : α j 0, j = 1,...,m,1 T α = 1} and 1 is the all-one vector in R m. This optimization problem is an m-dimensional convex constrained optimization problem. Existence of the MLE follows directly from standard methods in optimization theory. Theorem 2.1 The MLE α defined by (2.16) exists.

16 Proof: Letting log(0) =, l n (α) is a continuous extended real valued function on the nonempty compact set A. Hence, the maximum exists by, e.g., Zeidler (1985, Corollary 38.10). The optimization problem (2.16) may have several solutions. This forms a second source of non-uniqueness for the MLE, called mixture non-uniqueness by Gentleman and Vandal (2002). We will show in Section 2.3 that for current status data with competing risks, both the MLE and the naive estimator are mixture unique. However, we first show how both estimators fit into the censored data framework. 2.2.1 Censored data perspective of the MLE For the MLE, the variable of interest is W = (X, Y ), taking values in the space W = R + {1,...,K}. The observation time T defines a partition of p = K + 1 random sets in W: D k = (0, T] {k}, k = 1,...,K, (2.17) D K+1 = (T, ) {1,..., K}. (2.18) Since there is a one-to-one correspondence between D = (D 1,...,D K+1 ) and T, the assumption that T is independent of (X, Y ) is equivalent to the assumption that D is independent of (X, Y ). Furthermore, note that k = 1{X T, Y = k} = 1{(X, Y ) D k } for k = 1,...,K, and K+1 = 1{X > T } = 1{(X, Y ) D K+1 }. Hence, the vector indicates which set contains the unobservable (X, Y ), and the observed data (T, ) give exactly the same information as (D, ). The corresponding observed sets are R = K+1 k=1 kd k, so that (0, T] {k} if k = 1, k = 1,...,K, R = (T, ) {1,...,K} if K+1 = 1. (2.19)

17 It follows that we can write the log likelihood (2.6) as l n (F) = 1 n n i=1 log P F(R i ). The MLE maximizes this expression over all bivariate sub-distribution functions F on R + {1,..., K}, or equivalently, over all K-tuples of sub-distribution functions F = (F 1,...,F K ) with pointwise sum bounded by one. We now consider the maximal intersections of the observed sets R 1,...,R n. Note that the observed sets can take the form (t, ) {1,...,K} for some t R +. Such sets are not rectangles in R 2, and hence we cannot directly use the concept of the height map of Maathuis (2005). However, by transforming such sets into (t, ) [1, K], we do have rectangles in R 2. We can then compute the maximal intersections using the concept of the height map. Afterwards we transform sets of the form (t, ) [1, K] back to (t, ) {1,..., K}. Once we have computed α, we obtain F nk (t) by summing the mass in (0, t] {k}, for k = 1,..., K and t R +. For each k {1,..., K + 1}, we call A a maximal intersection for F nk, if A is involved in the computation of F nk. A precise definition is given below. Definition 2.2 Let k {1,...,K}, and let R = {R 1,...,R n } be the observed sets as defined in (2.19). We call A a maximal intersection for F nk if it is a maximal intersection of R and A (R {k}). We call A a maximal intersection for F n+ (or equivalently, for F n,k+1 ) if A is a maximal intersection for some F nk, k = 1,..., K. Note that maximal intersections for F n+ are sets in R + {1,...,K}, although F n+ is a function on R +. Recall from Section 2.1.1 that we order the observations such that their observation times are nondecreasing, where ties are broken arbitrarily after ensuring that left censored observations are ordered before right censored observations. Hence, if there is an observation Z i such that T i = T (n) and i,k+1 = 1, then (n),k+1 = 1 holds, even if there are other observations with T i = T (n) and ik = 1 for some k {1,..., K}. This is used in the following lemma, which provides information on the form of the maximal intersections for F nk. The lemma follows directly

18 from the idea of the height map. Lemma 2.3 Let k {1,..., K}. Each maximal intersection for F nk satisfies one of the following two conditions: (i) A = (T (i), T (j) ] {k}, with i < j, (i),k+1 = 1, (j)k = 1, and (l),k+1 = (l)k = 0 for all l such that T (i) < T (l) < T (j) ; (ii) A = (T (n), ) {1,..., K}, with (n),k+1 = 1. Moreover, if a set A satisfies one of these conditions, then A is a maximal intersection for F nk. 2.2.2 Censored data perspective of the naive estimator For the naive estimator F nk, we consider the reduced current status data Z k = (T, k ). Define the variables W k = X1{Y = k} + 1{Y k}, k = 1,...,K, W K+1 = X, taking values in W = R + { }. Note that F 0k (t) = P(W k t) for k = 1,..., K, and F 0,K+1 (t) = P(W K+1 > t). Hence we can take W 1,...,W K+1 to be our variables of interest. The observation time T defines a partition of p = 2 random sets in W: D 1 = (0, T] and D 2 = (T, ]. (2.20) Since there is a one-to-one correspondence between D = (D 1, D 2 ) and T, the assumption that T is independent of (X, Y ) is equivalent to the assumption that D is independent of W 1,...,W K+1.

19 For k = 1,..., K, note that k = 1{X T, Y = k} = 1{W k T } = 1{W k D 1 }. Hence, the vector ( k, 1 k ) indicates whether D 1 or D 2 contains the unobservable W k, and the reduced current status data (T, k ) give exactly the same information as (D, k ). The corresponding observed sets are R (k) = k D 1 (1 k )D 2, so that (0, T] if R (k) k = 1, = (T, ) if k = 0. (2.21) We can write the log likelihood (2.8) as l nk (F k ) = 1 n n i=1 log P F(R (k) i ). The naive estimator maximizes this expression over all sub-distribution functions F k on R +. For k = K + 1, note that K+1 = 1{X > t} = 1{W K+1 D 2 }. Hence, the vector (1 K+1, K+1 ) indicates whether D 1 or D 2 contains the unobservable X, and the reduced current status data (T, K+1 ) give exactly the same information as (D, K+1 ). The corresponding observed sets are R (K+1) = (1 K+1 )D 1 K+1 D 2, so that (0, T] if R (K+1) K+1 = 0, = (T, ) if K+1 = 1. (2.22) We can write the log likelihood (2.8) as l n,k+1 (S) = 1 n n i=1 log P S(R (K+1) i ). The naive estimator F n,k+1 maximizes this expression over all sub-survival functions S on R +. Definition 2.4 For k = 1,..., K + 1, we call A a maximal intersection for F nk if it is a maximal intersection of the observed sets R (k) 1,...,R n (k) (2.22). as defined in (2.21) and The maximal intersections for the naive estimator are described in Lemmas 2.5 and 2.6. Both lemmas follow directly from the idea of the height map. Lemma 2.5 Let k {1,...,K}. Each maximal intersections A for F nk satisfies one of the following two conditions:

20 (i) A = (T (i), T (j) ], with (T (i), T (j) ) {T 1,..., T n } =, (i)k = 0, and (j)k = 1. (ii) A = (T (n), ), with (n)k = 0. Moreover, if an interval A satisfies one of these conditions, then it is a maximal intersection for F nk. Lemma 2.6 Each maximal intersection for F n,k+1 satisfies one of the following two conditions: (i) A = (T (i), T (j) ], with (T (i), T (j) ) {T 1,...,T n } =, (i),k+1 = 1, and (j),k+1 = 0. (ii) A = (T (n), ), with (n),k+1 = 1. Moreover, if an interval A satisfies one of these conditions, then A is a maximal intersection for F n,k+1. 2.2.3 Comparing the maximal intersections for both estimators Definition 2.7 For any set A R 2, we define the x-interval and y-interval of A to be the projections of A on the x-axis and y-axis. Furthermore, we define the lower and upper endpoint of A to be the lower and upper endpoint of its x-interval. We now compare the maximal intersections for F nk and F nk, for k {1,...,K}. Lemma 2.8 For each k = 1,..., K, the number of maximal intersections for F nk is at least as large as the number of maximal intersections for F nk. Moreover, each upper endpoint of a maximal intersection for F nk is an upper endpoint of a maximal intersection for F nk. Proof: Let A be a maximal intersection for F nk. We show that there is a maximal intersection for F nk with the same upper endpoint. Note that A must satisfy one of

21 the two conditions of Lemma 2.3. First, suppose that the A = (T (n), ) {1,..., K} with (n),k+1 = 1. Then (n)k = 0, and A = (T (n), ) is a maximal intersection for F nk by Lemma 2.5. Next, suppose that A = (T (i), T (j) ] {k}, with (i),k+1 = 1, (j)k = 1 and (l)k = (l),k+1 = 0 for all l such that T (i) < T (l) < T (j). Then (j 1)k = 0, and hence A = (T (j 1), T (j) ] is a maximal intersection for F nk by Lemma 2.5. Lemma 2.9 The number of maximal intersections for F n,k+1 is at most as large as the number of maximal intersections for F n,k+1. Moreover, the collection of lower endpoints of the maximal intersections for F n,k+1 is identical to the collection of lower endpoints of the maximal intersections for F n,k+1. As a result, the number of regions on the x-axis where F n,k+1 can put mass is identical to the number of regions on the x-axis where F n,k+1 can put mass. Finally, the union of the maximal intersections for F n,k+1 is contained in the union of the x-intervals of the maximal intersections for F n,k+1. Proof: Let A be a maximal intersection for F n,k+1. We show that there is a maximal intersection for F n,k+1 with the same lower endpoint. Note that A must satisfy one of the two conditions of Lemma 2.6. First, suppose that A = (T (i), T (j) ] with (T (i), T (j) ) {T 1,...,T n } =, (i),k+1 = 1 and (j),k+1 = 0. Since (j),k+1 = 0, there must be a k {1,..., K} such that (j)k = 1. But this implies that (T (i), T (j) ] {k} is a maximal intersection for F nk, by Lemma 2.3. Next, suppose that A = (T (n), ) with (n),k+1 = 1. Then (T (n), ) {1,..., K} is a maximal intersection for F n1,..., F nk by Lemma 2.3, and hence it is a maximal intersection for F n,k+1 by definition. Next, let A be a maximal intersection for F n,k+1. We show that there is a maximal intersection for F n,k+1 with the same lower endpoint. By definition, it follows that there is a k {1,...,K} so that A is a maximal intersection for F nk. Hence, A must satisfy one of the two conditions of Lemma 2.3. First, suppose that A = (T (i), T (j) ] {k}, with (i),k+1 = 1, (j)k = 1 and (l)k = (l)k+1 = 0 for all l

22 Table 2.1: Example data with K = 2 competing risks, illustrating that the number of positive maximal intersections for F n,k+1 can be larger than the number of positive maximal intersections for F n,k+1. i t (i) δ (i)1 δ (i)2 δ (i)3 1 1 1 0 0 2 2 0 0 1 3 3 0 0 1 4 4 1 0 0 5 5 0 0 1 i t (i) δ (i)1 δ (i)2 δ (i)3 6 6 0 1 0 7 7 0 1 0 8 8 1 0 0 9 9 0 1 0 10 10 0 1 0 such that T (i) < T (l) < T (j). If S = (T (i), T (j) ) {T 1,..., T n } =, then (T (i), T (j) ] is a maximal intersection for F n,k+1 by Lemma 2.6. Otherwise, (T (i), min{s}] is a maximal intersection for F n,k+1. Next, suppose that A = (T (n), ) {1,..., K} with (n),k+1 = 1. Then (T (n), ) is a maximal intersection for F n,k+1 by Lemma 2.6. The last statement follows by combining the fact that the collection of lower endpoints of the maximal intersections for F n,k+1 and F n,k+1 are identical, with the fact that maximal intersections for F n,k+1 cannot contain observation times in their interior (Lemma 2.6). Remark 2.10 The last statement of Lemma 2.9 has implications for representational non-uniqueness of the estimators. It shows that it is possible that the area in which the MLE F n,k+1 suffers from representational non-uniqueness is larger than the area in which F n,k+1 suffers from representational non-uniqueness. This was also noted by Hudgens, Satten and Longini (2001), and partly motivated their pseudo-likelihood estimator. However, note that it can also happen that F n,k+1 is non-unique over a larger area, if many of the maximal intersections for F n,k+1 get zero mass. For an example, see Tables 2.1 and 2.2. Motivated by Remark 2.10, we now consider maximal intersections that get positive mass. We introduce the following terminology:

23 Table 2.2: The estimators for the data in Table 2.1, in terms of their maximal intersections (MIs) and the corresponding probability masses. F n,k+1 MIs mass (0, 1] {1} 3/10 (3, 4] {1} 0 (5, 8] {1} 0 (5, 6] {2} 7/10 F n,k+1 MIs mass (0, 1] 1/3 (3, 4] 1/6 (5, 6] 1/2 Definition 2.11 Let k {1,..., K + 1}. We say that A is a positive maximal intersection for F nk if A is a maximal intersection for F nk and the MLE assigns positive mass to A. Similarly, we say that F nk is a positive maximal intersection for F nk if A is a maximal intersection for F nk and F nk assigns positive mass to A. After reading Lemma 2.9, one may wonder whether the number of positive maximal intersections for F n,k+1 is at most as large as the number of positive maximal intersections for F n,k+1. This is indeed often the case in simulations, but not always. A counter example can be found in Table 2.1. In this example, Fn,K+1 has four maximal intersections, given in Table 2.2. The naive estimator F n,k+1 has three maximal intersections, with corresponding masses given in Table 2.2. Note that the maximal intersections satisfy the statement in Lemma 2.9. However, there are only two positive maximal intersections for F n,k+1, while there are three positive maximal intersections for F n,k+1. 2.3 Graph theory and uniqueness Gentleman and Vandal (2001), Gentleman and Vandal (2002), Maathuis (2003), and Vandal, Gentleman and Liu (2006) use a graph theoretic perspective to study properties of the maximum likelihood estimator for censored data. Before we apply these methods to our problem, we give an introduction to graph theory. This introduction

24 is mostly based on Golumbic (1980), and also partly given in Maathuis (2003, Section 3.3). 2.3.1 Introduction to graph theory for censored data Let G = (V, E) be an undirected graph, where V is a set of vertices, and E is a set of edges. An edge is a collection of two vertices. Two vertices v and w are said to be adjacent in G if there is an edge between v and w, i.e., vw E. We say that two sets of vertices S 1 and S 2 are adjacent if there is at least one pair of vertices (v, w) such that v S 1, w S 2 and vw E. A subgraph of G = (V, E) is defined to be any graph G = (V, E ) such that V V and E E. Given a subset A V of vertices, we define the subgraph induced by A to be G A = (A, E A ), where E A = {xy E : x A, y A}. We call a subset M V of vertices a clique if every pair of distinct vertices in M is adjacent. We call M V a maximal clique if there is no clique in G that properly contains M as a subset 3. Every finite graph has a finite number of maximal cliques that we denote by C = {C 1,...,C m }. Let R = {R 1,...,R n } be a family of sets. The intersection graph of R is obtained by representing each set in R by a vertex, and connecting two vertices by an edge if and only if their corresponding sets intersect. An intersection graph of a collection of intervals on a linearly ordered set is called an interval graph. Alternatively, an undirected graph G is called an interval graph if it can be thought of as an intersection graph of a set of intervals on the real line. Every maximal clique C j in an intersection graph has a real representation A j = R C j R, given by the intersection of the sets that form the maximal clique. A sequence of vertices (v 0, v 1,...,v l ) is called a cycle of length l + 1 if v i 1 v i E for all i = 1,..., l and v l v 0 E. A cycle (v 0,...,v l ) is called a simple cycle if v i v j 3 Instead of the terms clique and maximal clique, some authors use the terms complete subgraph and clique.

25 for i j. A simple cycle (v 0, v 1,...,v l ) is called chordless if for all i = 0,..., l, v i v j E only for j = (i ± 1) mod (l + 1). A graph is called triangulated if it does not contain chordless cycles of length strictly greater than three. Hajös (1957) showed that every interval graph is triangulated. A clique graph of R is an intersection graph of the maximal cliques C. Thus, in this graph each vertex represents a maximal clique, and two vertices C j and C k are adjacent if and only if C j C k, i.e., if there is at least one set in R that is an element of both C j and C k. We define the clique matrix to be a vertices versus maximal cliques incidence matrix. For n observed sets with m maximal cliques, this is an n m matrix H with elements H ij = 1{A j R i }. 4 We now return to the maximum likelihood estimator for censored data. Let R = {R 1,...,R n } be the observed sets. Gentleman and Vandal (2001) showed that the maximal intersections A 1,..., A m of R, defined in Section 2.2, are exactly the real representations of the maximal cliques of the intersection graph of R. Hence, we can study the intersection graph to deduce properties of the MLE. In particular, Gentleman and Vandal (2002, Lemma 4) showed that α is unique if the intersection graph is triangulated. An alternative proof can be found in Maathuis (2003, Lemma 3.13). Finally, we can use the clique matrix H to rewrite the optimization problem (2.16). Namely, P α (R i ) = (Hα) i, so that (2.16) becomes l n ( α) = max A n log ((Hα) i ). 2.3.2 Graph theoretic aspects and uniqueness of the naive estimator i=1 For k = 1,...,K + 1, let R (k) = {R (k) 1,...,R(k) n } be the observed sets for the naive estimator F nk, as defined in (2.21) and (2.22). The following proposition uses the structure of the intersection graph and the form of the maximal intersections to 4 Note that our H is the transpose of the incidence matrix defined in Gentleman and Vandal (2002, page 559).