Product-limit estimators of the survival function with left or right censored data

Size: px

Start display at page:

Download "Product-limit estimators of the survival function with left or right censored data"

Peter Dorsey
5 years ago
Views:

1 Product-limit estimators of the survival function with left or right censored data 1 CREST-ENSAI Campus de Ker-Lann Rue Blaise Pascal - BP Bruz cedex, France ( patilea@ensai.fr) 2 Institut de Statistique Université Catholique de Louvain 20, voie du Roman Pays 1348 Louvain-la-Neuve, Belgique ( rolin@stat.ucl.ac.be) Valentin Patilea 1 and Jean-Marie Rolin 2 Abstract. The problem of estimating the distribution of a lifetime when data may be left or right censored is considered. Two models are introduced and the corresponding product-limit estimators are derived. Strong uniform convergence and asymptotic normality are proved for the product-limit estimators on the whole range of the observations. A bootstrap procedure that can be applied to confidence intervals construction is proposed. Keywords: bootstrap, delta-method, martingales, left and right censoring, strong convergence, weak convergence. 1 Introduction A great deal of recent attention in survival analysis has focused on estimating the survivor distributions in the presence of various and complex censoring mechanisms. The goal of this paper is to analyze simple models for lifetime data that may be left or right censored. Typically, a lifetime T is left or right censored if, instead of observing T we observe a finite nonnegative random variable Y, and a discrete random variable with values 0, 1 or 2. By definition, when A = 0, Y = T, when A = 1, Y < T and, when A = 2, Y > T. Models for left or right censored data were proposed by [Turnbull, 1974], [Sampath and Chandra, 1990] and [Huang, 1999]. See also [Gu and Zhang, 1993], [Kim, 1994]. Assume that the sample consists of n independent copies of (Y, A) and let F T be the distribution of the lifetime of interest T. Using the plug-in (or substitution) principle, the nonparametric estimation of F T is straight as soon as F T can be expressed as an explicit function of the distribution of (Y, A). The existence of such a function requires a precise description of the censoring mechanism that is generally achieved by introducing latent variables and by making assumptions on their distributions. In this paper,

2 980 Patilea and Rolin two latent models allowing for explicit inversion formula, that is closed-form function relating F T to the distribution of (Y, A), are proposed. In some sense, our first latent model lies between the classical rightcensorship model and the current-status data model. It may be applied to the following framework. Consider a study where T the age at onset for a disease is analyzed. The individuals are examined only one time and they belong to one of the following categories: (i) evidence of the disease is present and the age at onset is known (from medical records, interviews with the patient or family members,...); (ii) the disease is diagnosed but the age at onset is unknown or the accuracy of the information about this is questionable; and (iii) the disease is not diagnosed at the examination time. Let C denote the age of the individual at the examination time. In the first case the exact failure time T (age at onset) is observed, that is Y = T. In case (ii) the failure time T is left-censored by C and thus Y = C, A = 2. Finally, the onset time T is right-censored by C for the individuals who have not yet developed the disease; in this case Y = C, A = 1. If no observation as in (ii) occurs, we are in the classical right-censorship framework, while if no uncensored observation is recorded we have current-status data. Our first latent model can be applied, for instance, with the data sets analyzed by [Turnbull and Weiss, 1978]. The second latent model proposed is closely related to the first one. It lies between the left-censorship model and the current-status data model. Consider the example of a reliability experiment where the failure time of a type of device is analyzed. A sample of devices is considered and a single inspection for each device in the sample is undertaken. Some of them already failed without knowing when (left censored observations). To increase the precision of the estimates, a proportion of the devices still working is selected randomly and followed until failure (uncensored observations). For the remaining working devices the failure time is right censored by the inspection time. Let us point out that, without any model assumption, given a distribution for the observed variables (Y, A) with Y 0 and A {0, 1, 2}, we can always apply our two inversion formulae. In this way we build two pseudotrue distribution functions of the lifetime of interest which are functionals of the observed distribution. If the experiment under observation is compatible with the hypothesis of one of our latent models, the true F T can be exactly recovered from the observed distribution. Otherwise, we can only approximate the true lifetime distribution. The paper is organized as follows. Section 2 introduces our two latent models through the equations relating the distribution of the observations to those of the latent variables. Solving these equations for F T we deduce the inversion formulae. The product-limit estimators are obtained by applying the inversion formulae to the empirical distribution. Section 2 is ended with some remarks and comments on related models. Section 3 contains the

3 PL estimators with left or right censored data 981 asymptotic results for the first latent model (similar arguments apply for the second model). We prove strong uniform convergence for the product-limit estimator on the whole range of the observations. Our proof extends and simplifies the results of [Stute and Wang, 1994] and [Gill, 1994] provided in the case of the Kaplan-Meier estimator. Next, the asymptotic normality of our product-limit estimator is obtained. The variance of the limit Gaussian process being complicated, a bootstrap procedure for which the asymptotic validity is a direct consequence of the delta-method is proposed. 2 The latent models 2.1 Model 1 The survival time of interest is T (e.g., the age at onset). Let C be a censoring time (e.g., the age of the individual at the examination time) and be a Bernoulli random variable. Assume that the latent variables T, C and are independent. The observations are independent copies of the variables (Y, A), with Y 0 and A {0, 1, 2}. These variables are defined as Y = min(t, C) + (1 )max(c T, 0) = C + min(t C, 0) and A = 2(1 )1 {T C} +1 {C<T }, where 1 A denotes the indicator function of the set A. With this censoring mechanism the lifetime T is observed, right censored or left censored. In view of the definitions of Y and A, note that if is constant and equal to one (resp. zero), we obtain right censored (resp. current status ) data. Let F T and F C denote the distributions of T and C, respectively. Let p = P ( = 1). Define the observed subdistributions of Y as H k (B) = P (Y B, A = k), k = 0, 1, 2, (1) for any B Borel subset of [0, ]. As usually in survival analysis, the censoring mechanism defines a map Φ between the distributions of the latent variables and the observed distributions. For the censoring mechanism we consider, the relationship (H 0, H 1, H 2 ) = Φ(F T, F C, p) between the subdistributions of Y and the distributions of the latent variables T, C and is the following: H 0 (dt) = p F C ([t, ]) F T (dt) H 1 (dt) = F T ((t, ])F C (dt). (2) H 2 (dt) = (1 p)f T ([0, t])f C (dt) Remark that when p = 1 (resp. p = 0) the equations (2) boil down to the equations of the classical independent right-censoring (resp. current status) model. By plug-in applied with the empirical distribution, the nonparametric estimation of the distribution of T is straight as soon as the map Φ is invertible

4 982 Patilea and Rolin and F T can be written as an explicit function of the observed subdistributions H k, k = 0, 1, 2. The model considered allows us an explicit inversion formula for F T. In order to derive this inversion formula, integrate the first and the second equation in (2) on [t, ] and deduce For t = 0, it follows that H 0 ([t, ]) + ph 1 ([t, ]) = pf T ([t, ])F C ([t, ]). (3) p = H 0([0, ]) 1 H 1 ([0, ]) = H 0 ([0, ]) H 0 ([0, ]) + H 2 ([0, ]). (4) Recall that the hazard measure associated to a distribution F is Λ(dt) = F(dt)/F([t, ]). In our case, use (2)-(3) to deduce that the hazard function corresponding to F T can be written as Λ T (dt) = Finally, the distribution F T can be expressed as H 0 (dt) H 0 ([t, ]) + ph 1 ([t, ]). (5) F T ((t, ]) = [0,t](1 Λ T (ds)), (6) where [0,t] is the product-integral on [0, t]. Note that there is no explicit formula for F T if p = 0 in equations (2), that is with current status data. Given the explicit relationship between the distribution of T and the observed subdistributions, to obtain the product-limit estimator of F T, we simply replace H k, k = 0, 1, 2 by their empirical counterparts. Let F T denote the product-limit estimator of F T. 2.2 Model 2 As in Model 1, assume that T, C and are independent. The observations are independent copies of the variables (Y, A), with Y 0 and A {0, 1, 2} where Y = T, A = 0 if 0 C T and = 1; Y = C, A = 1 if 0 C T and = 0; (7) Y = C, A = 2 if 0 T < C. The equations of this model are H 0 (dt) = p F C ([0, t])f T (dt) H 1 (dt) = (1 p)f T ([t, ])F C (dt). (8) H 2 (dt) = F T ([0, t))f C (dt) Remark that when p = 1 (resp. p = 0) the equations (8) boil down to the equations of the classical independent left-censoring (resp. current status)

5 PL estimators with left or right censored data 983 model. This model also allows for an explicit inversion formula for F T. By integration in the first and the third equation in (8), H 0 ([0, t])+ph 2 ([0, t]) = pf T ([0, t])f C ([0, t]). Deduce p = H 0([0, ]) 1 H 2 ([0, ]). Recall that given a distribution F, the associated reverse hazard measure is M(dt) = F(dt)/F([0, t]). By equations (8) deduce that the reverse hazard function M T associated to F T can be written as M T (dt) = H 0 (dt) H 0 ([0, t]) + ph 2 ([0, t]). Finally, the distribution F T can be expressed as F T ([0, t]) = (1 M T (ds)). (t, ] Applying the inversion formula with the empirical subdistributions, we get the product-limit estimator of F T in Model 2. Note that if T = h(t) and C = h(c), with h 0 a decreasing transformation, then T, C and are the variables of Model 1 applied to the left or right censored lifetime h(y ). In other words, Model 2 is equivalent to Model 1, up to a time reversal transformation. 2.3 Related models [Huang, 1999] introduced a model for the so-called partly interval-censored data, Case 1; see also [Kim, 1994]. In such data, for some subjects, the exact failure time of interest T is observed. For the remaining subjects, only the information on their current status at the examination time is available. [Huang, 1999] considered the nonparametric maximum likelihood estimator (NPMLE) of F T. Unfortunately, NPMLE does not have an explicit form and therefore Huang needs strong assumptions for deriving its asymptotic properties and a numerical algorithm for the applications. Let us point out that, on contrary to our Model 1 (resp. Model 2), in Huang s model one may observe exact failure times even if failure occurs after (resp. before) the examination time. Moreover, in Huang s model one may still obtain a n consistent estimator of the distribution FT if one simply considers the empirical distribution of the uncensored lifetimes. This is no longer true in our models. Perhaps, the most popular model for left or right-censored data is the one introduced by [Turnbull, 1974]; see also [Gu and Zhang, 1993]. In Turnbull s model there are three latent lifetimes L (left-censoring), T (lifetime of interest) and R (right-censoring) with L R. The observed variables

6 984 Patilea and Rolin are Y = max(l, min(t, R)) = min(max(l, T), R) and A defines as follows: A = 0 if L < T R; A = 1 if R < T; and A = 2 if T L. The equations of this model are H 0 (dt) = {F R ([t, ]) F L ([t, ])} F T (dt) H 1 (dt) = F T ((t, ]) F R (dt), H 2 (dt) = F T ([0, t]) F L (dt) where H k, k = 0, 1, 2 are defined as in (1) and F T, F L and F R are the distributions of T, L and R, respectively. The NPMLE of the distribution of the failure time T is not explicit but it can be computed, for instance, by iterations based on the so-called self-consistency equation. Note that imposing F C (dt) = (1 p) 1 F L (dt) = F R (dt), one recovers the equations of Model 1. However, for the applications we have in mind, there is no natural interpretation for such a constraint in Turnbull s model. Moreover, we derive a product-limit estimator for our Model 1. Finally, the asymptotic results below are much simpler and they are obtained under weaker conditions than in Turnbull s model. 3 Asymptotic results In this section the strong uniform convergence and the asymptotic normality for the estimator of the distribution F T in Model 1 are derived. Moreover, we propose a bootstrap procedure that can be used to build confidence intervals for F T. As in the previous sections, the distributions F T and F C need not be continuous. For simpler notation, hereafter, the subscript T is suppressed when there is no possible confusion. We write F (resp. F, Λ and Λ) instead of F T (resp. F T, ΛT and Λ T ). 3.1 Strong uniform convergence Let H nk be the empirical counterparts of the subdistributions H k, k = 0, 1, 2,, that is n H nk ([0, t]) = 1 {Yi t, A i=k}, k = 0, 1, 2. i=1 Clearly, sup t 0 H nk ([0, t]) H k ([0, t]) 0, almost surely. We want to prove the strong uniform convergence of the distribution F, that is F([0, t]) F([0, t]) 0, asn, almost surely, sup t I where I = {t : H 0 ([t, ])+ph 1 ([t, ]) > 0}. For this purpose, first we prove the almost sure convergence of the hazard function.

7 PL estimators with left or right censored data 985 Theorem 1 Assume that p (0, 1] and let t = supi. For any σ I, sup Λ([0, t]) Λ ([0, t]) 0, as n, almost surely. 0 t σ Moreover, if t / I and Λ ([0, t )) <, then Λ ([0, t )) Λ ([0, t )), almost surely. The strong uniform convergence of the distribution F follows without any additional assumption. Theorem 2 Assume that p (0, 1]. Then F([0, t]) F([0, t]) 0, as n, almost surely. sup t I With p = 1 one recovers the strong uniform convergence result for the Kaplan-Meier estimator obtained by [Stute and Wang, 1994], [Gill, 1994]. Our alternative proof is simpler. 3.2 Asymptotic normality Here we study the weak convergence of the process n( F F) where F is the product-limit estimator of Model 1. In this case, Λ does no longer have a martingale structure (in t) as in the case of the Nelson-Aalen estimator, that is when p = 1. However, a continuous time submartingale property for Λ can be obtained. This suffices us to extend the techniques of Gill (1983) and to use them in combination with the functional delta-method in order to establish the weak convergence of n( F F) to a Gaussian process. Here, the weak convergence is denoted by. The space D[a, b] of càdlàg functions defined on [a, b] is endowed with the supremum norm and the ball σ field. Theorem 3 Assume that p (0, 1] and define U(t) = n( F([0, t]) F([0, t])), t 0. Let t = supi. a) Let τ be a point in I. Then, U G in D[0, τ], where G is a Gaussian process. b) If t I, but H 0 (dt) <, (1) {H 0 ([t, ]) + ph 1 ([t, ])} 2 [0,t ) then G can be extended to a Gaussian process on [0, t ] and U G in D[0, t ]. The proof of the weak convergence is postponed to the appendix. Note that when t I, condition (1) is equivalent to F T (dt) F T ([t, ]) > 0 and <. (2) F C ([t, ]) [0,t )

8 986 Patilea and Rolin 3.3 Bootstrapping the product-limit estimator Theorem 3 may be used to obtain confidence intervals and confidence bands for F. However, the law of the process G(t) = G(t)/F ((t, ]) being complicated, one may prefer a bootstrap method in order to avoid handling this process in practical applications. Here, a bootstrap sample is obtained by simple random sample with replacement from the set of observations {(Y i, A i ) : 1 i n}. Let {(Yi, A i ) : 1 i n} denote a bootstrap sample and let Hk be the corresponding subdistributions. Apply equations (4) to (6) to obtain the bootstrap estimator F. The following theorem state that the bootstrap works almost surely for our product-limit estimator on any interval [0, τ] such that H 0 ([τ, ]) + ph 1 ([τ, ]) > 0. This result is a simple corollary of Theorem of [Van der Vaart and Wellner, 1996] and it is based on the uniform Hadamard differentiability of the maps involved in the inversion formula of Model 1. Theorem 4 Let τ I and let G(t) be the limit of U(t)/F ((t, ]) in D[0, τ], as obtained from Theorem 3. Then, the process n{ F ([0, t]) F([0, t])}/ F ((t, ]) converges to G in D[0, τ], almost surely. References [Gill, 1994]R. Gill. Lectures on survival analysis. Lecture Notes in Mathematics (Ecole d été de Probabilités de Saint-Flour XXII 1992), pages , [Gu and Zhang, 1993]M.G. Gu and C.-H. Zhang. Asymptotic properties of selfconsistent estimators based on doubly censored data. Ann. Statist., pages , [Huang, 1999]J. Huang. Asymptotic properties of nonparametric estimation based on partly interval-censored data. Statist. Sinica, pages , [Kim, 2003]J.S. Kim. Maximum likelihood estimation for the proportional hazards models with partly interval-censored data. J. R. Stat. Soc. Ser B, pages , [Samuelsen, 1989]S.O. Samuelsen. Asymptotic theory for non-parametric estimators from doubly censored data. Scand. J. Statist., pages 1 21, [Stute and Wang, 1994]W. Stute and J. Wang. The strong law under random censorship. Ann. Statist., pages , [Turnbull and Weiss, 1978]B.W. Turnbull and L. Weiss. A likelihood ratio statistics for testing goodness of fit with randomly censored data. Biometrics, pages , [Turnbull, 1974]B.W. Turnbull. Nonparametric estimation of a survivorship function with doubly censored data. J. Amer. Statist. Assoc., pages , [Van der Vaart and Wellner, 1996]A.W. Van der Vaart and J.A. Wellner. Weak convergence and empirical processes. Springer Verlag, New-York, 1996.

Product-limit estimators of the survival function for two modified forms of currentstatus

Bernoulli 12(5), 2006, 801 819 Product-limit estimators of the survival function for two modified forms of currentstatus data VALENTIN PATILEA 1 and JEAN-MARIE ROLIN 2 1 CREST-ENSAI, Campus de Ker Lann,