Parameter Estimation for Partially Complete Time and Type of Failure Data

Size: px

Start display at page:

Download "Parameter Estimation for Partially Complete Time and Type of Failure Data"

Audra Knight
6 years ago
Views:

1 Biometrical Journal 46 (004), DOI 0.00/bimj arameter Estimation for artially Complete Time and Type of Failure Data Debasis Kundu Department of Mathematics, Indian Institute of Technology Kanpur, in 0806, INDIA Received August 00, revised July 03, accepted December 003 Summary The theory of competing risks has been developed to asses a specific risk in presence of other risk factors. In this paper we consider the parametric estimation of different failure modes under partially complete time and type of failure data using latent failure times and cause specific hazard functions models. Uniformly minimum variance unbiased estimators and maximum likelihood estimators are obtained when latent failure times and cause specific hazard functions are exponentially distributed. We also consider the case when they follow Weibull distributions. One data set is used to illustrate the proposed techniques. Key words Competing risks; Incomplete data; Failure distribution; Exponential distribution; Maximum likelihood estimators; Minimum variance unbiased estimator.. Introduction In medical studies an investigator is often concerned with the assessment of a specific risk in presence of other risk factors. In statistical literature it is well known as the competing risks model. In analyzing the competing risks model, ideally data consist of a failure time and an indicator denoting the failure type. But in many situations data may be incomplete, both in failure time and failure type. Moreover, under some circumstances, the failure type can be determined even when the failure time is censored. For example, consider an analysis of prostatic cancer in older man. Since, prostatic cancer is not rapidly lethal and can be diagnosed long before death, a man known to have the prostatic cancer who is alive at the time of analysis contributes complete information on the failure type, but incomplete in failure time. On the other hand, without an autopsy, the time of death provides a complete information on the failure time but no information on the failure type. Note that in presence of complete information, several studies have been carried out for the last two decades both under the parametric and non-parametric set up. Dinse [3] considered the nonparametric estimation of the survival function of a particular cause for incomplete time and type of failure data. Because of the non-parametric formulation, it was difficult to consider estimations and construction of confidence intervals of several other parameters of interests. Here we adopt the parametric formulation of the problem and consider statistical inference of several important parameters. We approach the problem in two different ways. First we use the latent failure times model, as suggested by Cox []. In the latent failure times model, it is assumed that the competing risks are independent. Secondly, we use the cause specific hazard functions model as suggested by rentice et al. [9], where the distributions of the competing causes may not be independent. It is assumed that the latent failure time distributions can be either exponential or Weibull. Similarly, under the cause specific hazard functions model, it is assumed that the cause specific hazard functions also can be either exponential or Weibull. Interestingly, in both the formulations, likelihood functions of the observed data are same in each case. Therefore, the estimation procedures of the different unknown * Corresponding author kundu@iitk.ac.in

2 66 D. Kundu Competing Risks parameters and their statistical properties remain unchanged, although the interpretations of the different parameters might be different. In this paper, we make similar assumptions as of Dinse [3]. It is assumed that every member of a certain target population either dies of a particular disease, say cancer or by other causes. A proportion p of the population die of cancer and the proportion ( p) die due to other causes. Suppose that a subject can experience one of J failure types and let us define X be the time of failure and let d be the failure type. At the end of the study, we have the following types of observations fx ¼ t; d ¼ jg fx > tg 3 fx > t; d ¼ jg 4 fx ¼ tg It may be mentioned that Miyakawa [8] and recently Kundu and Basu [5] considered a similar problem in presence of only type and type 4 observations. The rest of the paper is organized as follows. In Section, we provide the formulation in terms of the latent failure times model and describe different notation, which we are going to use throughout this paper. In Section 3, we consider the estimation of different parameters when the latent failure time distributions are exponential. The corresponding confidence intervals are considered in Section 4. The Weibull latent failure distributions are considered in Section 5. One data set is analyzed in Section 6 for illustration purposes. In Section 7, we formulate the problem using the cause specific hazard functions model and finally we draw conclusions from out work in Section 8.. Model Description and Notation Without loss of generality, we assume that there are only two failures types. We use the following notation; X i lifetime of system i X ji lifetime of mode j (¼ or ) of system i FðÞ cumulative distribution function of X i f ðþ density function of X i F j ðþ distribution function of X ji f j ðþ density function of X ji F j ðþ ¼ F j ðþ d i indicator variable denoting cause of failure of system i I½Š indicator function of event ½Š Exp ðlþ denotes exponential random variable with density function l e lx Gamma ða; lþ denotes gamma random variable with density function la GðaÞ xa e lx Weibull ða; lþ denotes Weibull random variable with density function al x a e lxa bin ðn; pþ denotes binomial random variable with parameters n and p. In this Section we formulate the problem using the latent failure times model of Cox []. It is assumed that ðx i ; X i Þ; i ¼ ;...N are N independent and identically distributed (i.i.d.) random variables. It is also assumed that X i and X i are independent for all i ¼ ; ;...N and X i ¼ min fx i ; X i g. Without loss of generality we assume that there are only two failure types, namely and. We assume that we have following observations. The first r observations have complete failure times and corresponding cause of failure is for all of them. We denote this set as I, i.e., the observations are of the type ðx i ; d i Þ; d i ¼ for all and the number of elements in I, j I j,isr. Similarly the next r observations have complete failure times and the corresponding cause of failure is for all of them. We denote this set as I. The next r 3 observations have complete failure times but corresponding causes of failure are unknown and we denote this set as I 3. For the next r 4 observations we know that the failure type will be but failure times have been right censored. We denote this set as I 4. Similarly for the next r 5 observations we have failure type, but corresponding failure times are right censored. Finally we have last r 6 right censored observations,

3 Biometrical Journal 46 (004) 67 where exact failure time and failure type both are unknown and we denote this set as I 6. Therefore, in summary we have following types of observations a ðx i ; Þ; ; ji j¼r ; b ðx i ; Þ; ; ji j¼r ; c ðx i ; Þ; 3 ; ji 3 j¼r 3 ; d ðx i ; Þ; 4 ; ji 4 j¼r 4 ; ðþ e ðx i ; Þ; 5 ; ji 5 j¼r 5 ; f ðx i ; Þ; 6 ; ji 6 j¼r 6 We denote r þ r þ r 3 ¼ n, r 4 þ r 5 þ r 6 ¼ m and therefore, m þ n ¼ N. In order to analyze incomplete data it is assumed that failure times are from the same population as the complete data. We also assume throughout that r þ r ¼ n, r 3, r 4 þ r 5 ¼ m, r 6 are fixed and not random and n > 0. If they are assumed to be random, then all the results are valid conditionally, see Kundu and Basu [5] for details. The likelihood contributions from the observations a; b; c; d; e and f are respectively. f ðxþ F ðxþ ; f ðxþ F ðxþ ; f ðxþ ; Ð x f ðyþ F ðyþ dy ; Ð x f ðyþ F ðyþ dy ; 3. Exponential Latent Failure Time Distributions oint Estimation In this Section, we assume that X ji s are exponential random variables with parameters l j for j ¼ and and for i ¼ ;...; N. The distribution function, F j ðtþ, ofx ji has the following form F j ðtþ ¼ð e ljt Þ ; j ¼ ; In this case the likelihood function of the observed data () is L ¼ l r exp ðl þ l Þ x i l r exp ðl þ l Þ x i ðl þ l Þ r 3 exp ðl þ l Þ r4 l x i exp ðl þ l Þ x i 3 l þ l 4 l r5 exp ðl þ l Þ x i exp ðl þ l Þ x i l þ l 5 6 ¼ l r þr 4 l r þr 5 ðl þ l Þ r 3 r 4 r 5 exp ðl þ l Þ x i ðþ Here I ¼ I [ I [...[ I 6. From the likelihood function (), we obtain MLEs of l and l as nðr þ r 4 Þ nðr ^l ¼ þ r 4 Þ and ^l ¼ ðn þ m Þ ðn þ m Þ x i x i Now to compute mean and variances of ^l and ^l, we observe the following l l r bin n ; ; r bin n ; ; ð3þ l þ l l þ l l l r 4 bin m ; ; r 5 bin m ; ; ð4þ l þ l l þ l nþ m i ¼ X i Gðn þ m; l þ l Þ FðxÞ; ð5þ

4 68 D. Kundu Competing Risks Note that using (3) (5), the exact distributions of ^l and ^l can be easily calculated. Using (3) (5), we obtain n Eð^l Þ¼ n þ m l n l l ; Vð^l Þ¼ ðn þ m Þðnþm Þ n þ m þ l ; n þ m n Eð^l Þ¼ n þ m l n l l ; Vð^l Þ¼ ðn þ m Þðnþm Þ n þ m þ l ð6þ n þ m From (6), it is immediate that ~l ¼ ðr þ r 4 Þðnþm Þ and ~ l ¼ ðr þ r 5 Þðnþm Þ ðn þ m Þ ðn þ m Þ x i x i are uniformly minimum variance unbiased estimators (UMVUEs) of l and l respectively. Note that these results match with those of Miyakawa [8], when r 4 ¼ r 5 ¼ r 6 ¼ 0. The relative risk rate, p, due to cause is ½X i < X i Š¼ ð 0 l exp ð ðl þ l Þ xþdx ¼ l l þ l Observe that r n and r 4 m are unbiased estimators of p, but the MLE of p, ^p, is ^p ¼ ^l ^l þ ^l ¼ r þ r 4 n þ m Note that the distribution ^p is a scaled version of a Binomial distribution and ^p is an unbiased estimator of p. Now, we consider the problem of estimating the mean lifetimes q ¼ and q ¼ due to l l cause and cause respectively. Note that although MLEs and UMVUEs of l and l always exist but the same is not true for q and q. UMVUEs of q and q do not exist and MLEs of q and q exist when r þ r 4 > 0 and r þ r 5 > 0 respectively. First we give the results for q. The conditional MLE (conditioning that r þ r 4 > 0) of q, say ^q, is as follows ^q ¼ x i n þ m nðr þ r 4 Þ Now, we obtain the conditional distribution of ^q, conditioning on r þ r 4 > 0, which in turn helps us in constructing confidence interval q. We need the following lemma for further development. Lemma The conditional moment generating function (mgf) of ^q, say f^q ðtþ, is of the following form f^q ðtþ ¼ E½exp ðt^q Þjr þ r 4 > 0Š " ¼ q n þ m n þ m i n þm ðn þ m Þ! q q i i ¼ i!ðn þ m iþ! q þ q q þ q tðn ðnþmþ # þ m Þ q q ¼ n þ m p i tðn þ m Þ q q ðn þ mþ ni q þ q i ¼ ni q þ q Here q ¼ q and q þ q i n p i ¼ q n þ m ð Þ þ m ðn þ m Þ! q q i i!ðn þ m iþ! q þ q q þ q

5 Biometrical Journal 46 (004) 69 roof of lemma Note that nþ m random variables. Therefore, i ¼ f^q ðtþ ¼ E½ exp ðt^q Þjr þ r 4 > 0 Š ¼ n þ m i ¼ X i is Gamma n þ m; q þ q and r þ r 4 is bin n þ m ; q q E½ exp ðt^q Þ j r þ r 4 ¼ i Šr ½ þ r 4 ¼ i j r þ r 4 > 0 Š l l þ l Note that, p i ¼ r ½ þ r 4 ¼ i j r þ r 4 > 0 Šfor i ¼ ;...; n þ m. Since the moment generating function of Gamma ða; lþ is t a, the result immediately follows. Therefore, we have the following theorem. l Theorem The conditional probability density function (pdf) of ^q, f^q ðxþ, and the conditional probability distribution function, F^q ðxþ, becomes f^q ðxþ ¼ n þ m i ¼ p i g i ðxþ ; F^q ðxþ ¼ n þ m i ¼ p i G i ðxþ Here g i ðxþ and G i ðxþ denote the density function and the distribution function respectively of a gamma niðq þ q Þ random variable with shape parameter ðn þ mþ and scale parameter for i ¼ ;...; n ðn þ m Þðq q Þ þ m. roof of theorem The proof follows immediately from lemma. From theorem, it is clear that the conditional distribution of ^q is a mixture of gamma random variables. It is easy to compute different conditional moments of ^q. We provide the first and the second conditional moments, others can be computed along the same line. We denote two conditional moments as Eð^q Þ and Eð^q Þ respectively. We have n þ m p i Eð^q Þ¼ i ¼ i n ðn þ mþðn q q þ m Þ ð7þ q þ q and! Eð^q Þ¼ n þ m p i i ¼ i n ðn þ mþðnþmþþðn þ m Þ ðq q Þ ðq þ q Þ Note that ^q is not an unbiased estimator of q and since the summation sign denotes inverse moments of positive binomial random variables, it is not possible to give exact expressions. However, it is known (Mendenhall and Lehmann [7]) that if Z is a binomial random variable with parameters N and p, then for large N, E Z j Z > 0 EðZÞ ¼ Np and E Z j Z > 0 ðeðzþþ ¼ ðnpþ. Using these approximations, we have Eð^q Þ n þ m q and Eð^q þ mþðnþmþþ Þðn n n q. Therefore, when m n tends to zero, then ^q is an asymptotically unbiased and consistent estimator of q. For fixed m and n (large), the expression (7) of Eð^q Þ can be used for biased correction. Same results are true for ^q also and they can be obtained simply by interchanging the roles of q and q. 4. Exponential Latent Failure Time Distributions Confidence Intervals In this Section we obtain confidence bounds of l and l using the asymptotic distributions of ^l and ^l.the Fisher Information of l and l is Iðl ; l Þ¼ðI ij ðl ; l ÞÞ for i; j ¼ and. Here I ij ðl ; l Þ¼ ln ðl ; l j

6 70 D. Kundu Competing Risks and I ðl ; l Þ¼ nl þðn þ m Þ l l ðl þ l Þ ; I ðl ; l Þ¼ I ðl ; l Þ¼ ðn m Þ n ðl þ l Þ and I ðl ; l Þ¼ nl þðn þ m Þ l l ðl þ l Þ Therefore, if l ¼ðl ; l Þ and ^l ¼ð^l ; ^l Þ, then we have, ð^l lþ!n 0; I ðl ; l Þ ; where I ðl ; l Þ¼ðIij ðl ; l ÞÞ and I ; l Þ¼ l ðnl þðn þ m Þ l Þ ; nðn þ m Þ I ; l Þ¼ I ; l Þ¼ l l ð ðn m Þþn Þ ; nðn þ m Þ I ; l Þ¼ l ðnl þðn þ m Þ l Þ nðn þ m Þ Similarly, we can have the Fisher Information matrix, Iðq ; q Þ, for q and q by using simply Jacobian transformation. Some other important parameters are, say ½Š p ¼ l ¼ q l þ l q þ q robability of death due to cause. ½Š t ¼ pq þð pþq ¼ pl þð pþl The expected lifetime of all individuals. ½3Š t ¼ pt The cause () specific mortality index. ½4Š t 3 ¼ exp ð l xþ The survival function due to cause. The MLEs of p, t, t and t 3 can be obtained very easily by using the in-variance property of the MLEs and their asymptotic distributions can be obtained using the d-method. 5. Weibull Latent Failure Time Distributions 5. Estimation of parameters In this Section we assume that X ji s are Weibull random variables with parameters ða; l j Þ for j ¼ ; and for i ¼ ;...; n. The distribution function F j ðþ of X ji has the following form F j ðtþ ¼ exp ð l j t a Þ We assume that the lifetime distributions of the different causes follow Weibull distributions with different scale parameters but they have the same shape parameter. The hazard rates due to cause and cause are given ðtþ F ðtþ ¼ al t a ðtþ F ðtþ ¼ al t a respectively. The mean lifetimes due to cause and cause are EðX i Þ¼ G þ and EðX a l i Þ¼ G þ a a l a

7 Biometrical Journal 46 (004) 7 respectively. The relative risk rate, p, due to cause is ð p ¼ X ½ i < X i Š ¼ al x a exp ð ðx a ðl þ l ÞÞ dx ¼ l l þ l 0 and it is independent of the shape parameter. The log-likelihood of the observed data is ln ðlþ ¼ ðr þ r 4 Þ ln ðl Þþðr þ r 5 Þ ln ðl Þþðr 3 r 4 r 5 Þ ln ðl þ l Þ ðl þ l Þ x a i þða Þ ln ðx i Þþðr þ r þ r 3 Þ ln ðaþ [ I [ I 3 Taking derivatives of ln ðlþ with respect to l, l and a and making them equal to zero, give ^l ðaþ ¼ nðr þ r 4 Þ x a i ðn and þ m Þ ^l ðaþ ¼ nðr þ r 5 Þ x a i ðn þ m Þ utting values of ^l ðaþ and ^l ðaþ in ln (L), we obtain ln ðlðaþþ ¼ K þðr þ r þ r 3 Þ ln ðaþ ln x a i þða Þ ln ðx i Þ ; ð8þ [ I [ I 3 here K is a constant independent of a. Now differentiating (8) with respect to a and equating it to zero, we obtain x a i ln ðx i Þ d da ln ðlðaþþ ¼ ðr þ r þ r 3 Þ þ ln ðx i Þþ r þ r þ r 3 ¼ 0 ð9þ [ I [ I 3 a x a i The equation (9) can be written equivalently as 3 x a i ln ðx i Þ ln ðx i Þ 6 a ¼ [ I [ I r þ r þ r 3 x a i ¼ hðaþ (say) Therefore, a simple iterative scheme may be used to solve (0). From the ith iterate a ðiþ, a ði þþ can be obtained as hða ði þþ Þ. Once we obtain ^a, ^l ð^aþ and ^l ð^aþ can be obtained as MLEs of l and l respectively. Alternatively, we can use EM algorithm of Dempster, Laird and Rubin [] to compute MLEs of unknown parameters. The EM algorithm consists of two steps. The first one is the estimation (E) step, where complete observations are left intact and pseudo observations are framed. In this case we consider the E-step as follows. All data belong to category a, b, d, e, and f are left intact, only if the observation x belongs to category c, we form pseudo observation by fractioning x to two partially complete pseudo observation of the form ðx; w ðx; gþþ, ðx; w ðx; gþþ, where g ¼ðl ; l ; aþ. Specifically, the fractional mass, ðw j ðx; gþþ, assigned to this pseudo observation x, is the conditional probability that the individual died from risk j given that the individual had died at time point point x. Therefore, w ðx; gþ ¼ l and w ðx; gþ ¼ l l þ l l þ l From now on we denote w ðx; gþ, w ðx; gþ as w and w respectively. The log-likelihood function of the pseudo data can be written as ln ðl S ðl ; l ; aþþ, where ln ðl S ðl ; l ; aþþ ¼ ðr þ r þ r 3 Þ ln ðaþþðr þ w r 3 Þ ln ðl Þþðr þ w r 3 Þ ln ðl Þ þða Þ ln ðx i Þ ðl þ l Þ x a i ðþ [ I [ I 3 ð0þ

8 7 D. Kundu Competing Risks The maximization step (M) involves the maximization of the log-likelihood function of the pseudo data () with respect to l, l and a. Assuming w j s to be known, we need to maximize L S ðl ; l ; aþ. It has to be performed iteratively. If ðl ðiþ ; lðiþ ; aðiþ Þ is the ith iterate, then ði þ Þth iterate, ðl ðiþþ ; l ðiþþ ; a ðiþþ Þ can be obtained as l ðiþþ ¼ r þ r 3 w x aðiþ i and l ðiþþ Finally a ðiþþ can be obtained as follows; where a ðiþþ ¼ arg max gðaþ ; ¼ r þ r 3 w x aðiþ i gðaþ ¼ ðr þ r þ r 3 Þ ln ðaþþðr þ w r 3 Þ ln ðl ðiþ Þþðr þ w r 3 Þ ln ðl ðiþ þða Þ ln ðx i Þ ðl ðiþ þ lðiþ Þ x a i [ I [ I 3 Equating g 0 ðaþ ¼0, we obtain a ¼ ðl ðiþ þ lðiþ Þ Note that () also can be solved similarly as (0). r þ r þ r 3 ¼ vðaþðsayþ ðþ x a i ln ðx i Þ ln ðx i Þ [ I [ I 3 Þ 5. Confidence intervals In this Section we provide confidence intervals of different parameters. Since it is not possible to obtain the exact distribution of MLEs, we use the asymptotic method. The asymptotic results can be stated as follows ð^l l ; ^l l ; ^a aþ!n 3 ð0; I ðl ; l ; aþþ ð3þ Here Iðl ; l ; aþ is the Fisher Information matrix for the unknown parameters ðl ; l ; aþ. The elements of the 3 3 matrix I ¼ððI ij ÞÞ are as follows I ðl ; l ; aþ ¼ nl þðn þ m Þ l l ðl þ l Þ ; I ðl ; l ; aþ ¼ nl þðn þ m Þ l l ðl þ l Þ ; I ðl ; l ; aþ ¼ n ðn þ m Þ ðl þ l Þ ¼ I ða; l ; l Þ ; I 33 ðl ; l ; aþ ¼ n a þ Vðn þ mþðl þ l Þ ; I 3 ðl ; l ; aþ ¼ ðn þ mþ U ¼ I 3 ðl ; l ; aþ ¼I 3 ða; l ; l Þ¼I 3 ðl ; l ; aþ Here U ¼ EðX a ln ðxþþ and V ¼ EðX a ðln ðxþþ Þ, where X is distributed as Weibull ða; l þ l Þ. Note that U and V can be written as U ¼ aðl þ l Þ wðþ ln l ð þ l Þ V ¼ a ðl þ l Þ ½w0 ðþþwðþðln ðl þ l ÞÞ wðþ ln ðl þ l ÞŠ

9 Biometrical Journal 46 (004) 73 Here wðþ and w 0 ðþ are the digamma and polygamma functions respectively. We can use (3) to compute asymptotic confidence intervals of all unknown parameters. Since it is observed that in many cases (see Efron and Hinkley [4]), it is more appropriate to use the observed information matrix than the expected information matrix, we provide the observed information matrix also. It is possible to obtain the observed information matrix when the EM algorithm is used (Louis [6]). Using the same notation of Louis [6], the observed information matrix, ^I, takes the form ^I ¼ B SS 0 Here B is the 3 3 negative of the second derivative matrix and S is the 3 gradient vector. They are as follows and SðÞ ¼ r þ w r 3 l Sð3Þ ¼ r þ r þ r 3 þ a x a I ; SðÞ ¼r þ w r 3 l x a I ; ln ðx i Þ ðl þ l Þ x a i ln ðx i Þ [ I [ I 3 Bð; Þ ¼ r þ w r 3 l ; Bð; Þ ¼ r þ w r 3 l ; Bð; Þ ¼Bð; Þ ¼0 ; Bð; 3Þ ¼Bð3; Þ ¼Bð; 3Þ ¼Bð3; Þ ¼ x a i ln ðx i Þ ; I I Bð3; 3Þ ¼ r þ r þ r 3 a 6. Data Analysis ðl þ l Þ x a i ðln ðx i ÞÞ We illustrate the proposed parametric estimation technique using one cancer clinical trials. The data was originally analyzed by Dinse [3] using the non-parametric approach. Data Set This example is concerned with the relationship between time until progression and an indicator function of patient status at the time of progression for patients with glioblastoma, a cancer of the brain. Define X as the time until disease progression, as measured in month from the date of randomization. Suppose, d indicates whether a patient is non ambulatory (d ¼ ) or ambulatory (d ¼ ) at the time of their most recent examination within the month before progression. If a patient does not experience a progression by the time of analysis, the patient belongs to category f. For a patient with a known progression time, if there is no record of his/her ambulatory status within the month of preceding progression falls into category c. For detailed discussions on the data set the readers are referred to Dinse [3]. The data set presents times and indicators for 7 patients from a clinical trial. The summary of the data set is as follows r ¼ 4, r ¼ 7, r 3 ¼ 3, r 4 ¼ r 5 ¼ 0, r 6 ¼ 83, n ¼ 58, n ¼ 89, m ¼ 0 and m ¼ 83. Also x i ¼ 3, x i ¼ 50, ¼ 76, x i ¼ 885 and x i ¼ Before fitting our model to the complete data set we would like to test the validity of the proposed parametric models. Since the closeness between the empirical survival function and the estimated survival function is a good measure for model validity for uncensored data, we consider a subset of the data which has only a, b or c types of failure. Assuming that the latent failure time distributions are exponential we obtain the MLEs of the parameters of the restricted data as ^l r ¼ and ^l r ¼ The corresponding UMVUEs are ~ l r ¼ and ~ l r ¼ respectively. In this case MLEs and UMVUEs are quite close to each other and we consider MLEs only. The Kolmogorov-Smirnov (K-S) distance between the empirical distribution function of the restricted data and the estimated distribution function using the exponential model is 0.9 and the corresponding p

10 74 D. Kundu Competing Risks Empirical survival function Estimated exponential survival function Estimated Weibull survival function Fig. Empirical survival function and the estimated survival functions using exponential and Weibull models. value is Similarly assuming that the latent failure time distributions are Weibull, we obtain the MLEs of the parameters of the restricted data as ^l r ¼ ; ^l r ¼ and ^a r ¼ In this case the K-S distance between the empirical distribution function and the estimated distribution function using the Weibull model is 0.73 and the corresponding p value is Since the p value corresponding to the Weibull model is quite high, therefore the preliminary data analysis suggests that the proposed Weibull model can be used to analyze this data set. We plot the empirical survival function and the estimated survival functions using exponential model and 0.8 Empirical survival function Estimated survival function using UMVUE, s Estimated survival function using MLE, s Fig. Empirical survival function and estimated survival functions using MLE s and UMVUE s. Here lifetime distributions are exponential.

11 Biometrical Journal 46 (004) Empirical survival function Estimated survival function using MLE, s Estimated survival function using modified MLE, s Fig. 3 Empirical survival function and the estimated survival functions using MLE s and modified MLE s. Here the lifetime distributions are Weibull. Weibull model in Figure. Since it is assumed that the incomplete data are the coming from the same population as the complete data, therefore it is not unreasonable to use the proposed models to analyze this data set. Now we consider the full data set. Under the assumption that the latent failure times are exponentially distributed, MLEs of l and l are ^l ¼ and ^l ¼ 0059 respectively, the log-likelihood value is Also Vð^l Þ¼5 0 6 and Vð^l Þ¼ Moreover, UMVUEs of l and l are ~ l ¼ and ~ l ¼ respectively, with Vð ~ l Þ¼ and Vð ~ l Þ¼ 0 5. Clearly MLEs and UMVUEs are quite different and it is mainly due to heavy ð00384 þ 0059Þ x censoring. Estimated survival function obtained using the MLEs is e and the corresponding 95% confidence bounds are e ð00384 þ 0059Þ x 003 e ð00384 þ 0059Þ x. The Kolmogorov- Smirnov (K-S) distance between the empirical distribution function and the estimated distribution function is 0.96 and the corresponding p value is almost zero. Similarly the estimated survival function using the UMVUEs is e and the corresponding 95% confidence bands are ð00738 þ 00305Þ x e ð00738 þ 00305Þx 0046 e ð00738 þ 00305Þ x. In this case the K-S distance is 0.5 and the corresponding p value is We provide the empirical survival function and the estimated survival functions using MLEs and UMVUEs in the Figure. Clearly, UMVUEs provide a much better fit to the empirical survival function but still its p value is quite small. Instead of MLEs we prefer to use the bias corrected UMVUEs in this case. Conditional estimates of mean times until progression of patients with non ambulatory status and ambulatory status are q ~ ¼ 3480 and q ~ ¼ 353 respectively. The corresponding 95% confidence intervals are (4.8404,.99) and (6.4388, ) respectively. Clearly the mean progression time of the patients with ambulatory status is significantly more than the patients with non ambulatory status. The biased corrected estimate of the mean progression time of all patients, say ~t ¼ and the corresponding 95% asymptotic confidence interval is (5.094, 3.049). The estimate of the relative risk rate of the non ambulatory status is and the corresponding 95% asymptotic confidence interval is (0.6076, ). Clearly it is siginificantly greater than 0.5. The estimates of the survival functions of the time until progression of the patients with non-ambulatory and ambulatory status at the time point x are e 00738x and e 00305x respectively. The corresponding 95% confidence intervals are e 00738x 0096x e 00738x and e 00305x 0038x e 00305x respectively.

12 76 D. Kundu Competing Risks Now we analyze the data under the assumptions that the latent failure time distributions are Weibull. In this case, we obtain the MLEs of the unknown parameters using the methods proposed in the Section 5. Both the methods converge to the same solutions and they are as follows, ^a ¼ 45; ^l ¼ 00 and ^l ¼ 0009, the corresponding log-likelihood value is We obtain the empirical survival function and the estimated survival function using the MLEs as e ð00 þ 0009Þx45. We provide it in the Figure 3. Unfortunately it does not match very well. The K-S distance between the empirical distribution function and the estimated distribution function is 0.33 and the corresponding p value is almost zero. Similar things happen even in the exponential case also. We feel that the scale parameters are highly biased. We make the bias correction as follows. Assuming that the shape parameter is known to be.45, we transform the date as X a and then estimate the scale parameters by UMVUEs. It gives the improved estimates of l and l, say ~l ¼ 0049 and ~ l ¼ 0078 respectively. We obtain the empirical survival function and the estimated survival function using the improved estimates as e ð0049 þ 0078Þx45. It is also plotted in Figure 3. The K-S distance between the empirical survival function and the estimated survival function in this case is and the corresponding p value is 0.7. We will be using these new estimates from now on. The asymptotic variance covariance matrix of ð^l ; ^l ; ^aþ obtained from the observed information matrix using the EM algorithm is Therefore, by proper multiplication we obtain the variance covariance matrix of ð ~ l ; ~ l ; ~aþ as The asymptotic 95% confidence bounds for l, l and a are (0.0366, 0.049), (0.09, 0.036) and (.350,.940) respectively. Since the 95% confidence interval of a does not include, therefore it is reasonable to use the Weibull model rather than the exponential model for this data set. Even the log-likelihood values also suggest (performing c testing) that the Weibull model is a preferred model than the exponential model in this case. The estimates of the mean lifetime of the patients with nonambulatory and ambulatory status are.5349 and and the corresponding 95% confidence bounds are (5.099, ) and (4.5047, 37.) respectively. The estimate of the relative risk rate of the patients with non-ambulatory status is and the corresponding asymptotic 95% confidence band is (0.5885, 0.85). The estimates of the survival functions of the time until progression of the patients with non-ambulatory and ambulatory status at the time point x are e 0049x45 and e 0078x45 respectively. The corresponding 95% asymptotic confidence bounds are e 0049x45 e 0049x45 x 45 ð ln ðxþþ59570 ðln ðxþþ Þ 0 3 and e 0078x45 e 0078x45 x 45 ð ln ðxþþ07 ðln ðxþþ Þ 0 3. They are plotted in Figure 4 and it is clear from the figure that the two survival functions are significantly different. For illustrations purposes we assume that 0% of the category f data have known status of the patients. We randomly choose 0% of the data from the category f and assign status and status with probability 07 ( p) and 03 respectively. Now the summary of the new data set is as follows r ¼ 4, r ¼ 7, r 3 ¼ 3, r 4 ¼ 7, r 5 ¼ 5, r 6 ¼ 7, n ¼ 58, n ¼ 89, m ¼ and m ¼ 83. Also x i ¼ 3, x i ¼ 50, ¼ 76, ¼ 09, ¼ 76, x i ¼ 700 and x i ¼ 639. Assuming that the lifetime distributions are exponential, the estimates of the scale parameters are as follows ^l ¼ 0037; ^l ¼ 007; ~ l ¼ 0074 and ~ l ¼ Similarly assuming that the lifetime distri-

13 Biometrical Journal 46 (004) Estimated survival function of the ambulatory status patients 95% confidence bands Estimated survival function of the non ambulatory status patients 0. 95% confidence bands Fig. 4 Estimated survival functions and the corresponding 95% confidence bounds for the non-ambulatory and the ambulatory status patients. The lifetime distributions are assumed to be Weibull. butions are Weibull the estimates are ^a ¼ 45; ^l ¼ 004; ^l ¼ 00099; ~ l ¼ 0043 and ~l ¼ Therefore the estimates of the other parameters can be obtained similarly as before, using these estimates. Now we compare ours with Dinse s [3] findings. First of all the non-parametric estimate of FðÞ as obtained by Dinse [3] is quite close to the parametric estimate of the survival function, particularly when the latent failure time distributions are assumed to be Weibull (see Figure 3). This indicates that both analysis should provide comparable results. Dinse [3] also mentioned that the non-parametric estimates of d j ðtþ ¼½d ¼ j j X ¼ tš are very erratic and therefore, he has provided the smoothed estimates of d j ðtþ. In our formulation, d j ðtþ is constant for all t. Therefore, the parametric estimate of d j ðtþ can be considered as the smoothed version of the non-parametric estimates. In our case using Weibull distributions, we obtain ^d ðtþ 070 and ^d ðtþ 030. From Figure of Dinse [3], it is clear that the nonparametric smoothed estimate of d ðtþ is close to Therefore, both methods provide similar results. 7. Cause Specific Hazard Functions Model In this Section, we formulate the problem, using the cause specific hazard functions, as originally proposed by rentice et al. [9]. We assume cause specific hazard functions to be exponential or Weibull. We use lðtþ as the overall hazard function and l j ðtþ as the cause specific hazard function for j ¼ and. It is known that lðtþ ¼l ðtþþl ðtþ. Suppose X, d and FðÞ are same as defined in Sections and, the likelihood contributions of the observations a, b, c, d, e and f are l ðtþ FðtÞ; l ðtþ FðtÞ; lðtþ FðtÞ; Ð respectively. t lðuþ FðuÞ du ; Ð t l ðuþ FðuÞ du; Ð t l ðuþ FðuÞ du;

14 78 D. Kundu Competing Risks 7. Exponential cause specific hazard functions In this subsection, it is assumed l ðtþ ¼l and l ðtþ ¼l In this case the likelihood contributions from the observations a, b, c, d, e and f are l e ðl þl Þ t ; l e ðl þl Þ t ; ðl þ l Þ e ðl þl Þ t ; l l þ l e ðl þl Þ t ; e ðl þl Þ t l l þ l e ðl þl Þ t ; respectively. Therefore, under this formulation, the likelihood function of the observed data () is same as (). 7. Weibull cause specific hazard functions In this subsection, it is assumed, l ðtþ ¼al t a and l ðtþ ¼al t a Therefore, FðtÞ ¼e ðl þl Þ t a In this case the likelihood contributions from the observations a, b, c, d, e and f are al t a e ðl þl Þ t a ; al t a e ðl þl Þ t a ; aðl þ l Þ t a e ðl þl Þ t a ; l e ðl þl Þ t l a ; e ðl þl Þ t a ; e ðl þl Þ t a ; l þ l l þ l respectively. In this case also the likelihood function of the data is same as the likelihood function obtained using independent Weibull latent failure times distributions in Section 5. Since the likelihood functions are equal in both cases, the estimation procedures of the different unknown parameters, namely ðl ; l Þ or ðl ; l ; aþ, and the statistical properties of these estimates are same in both the formulation. 8. Conclusions In this paper we consider the analysis of the partially complete time and type of failure data in presence of competing risks. We formulate the problem in two different ways and it is observed that both of them lead to the same likelihood function. It raises the important question of the identifiability problem of competing risks. Our analysis shows, as expected, that in the above situation it is not possible to identify from the given data whether the competing causes are independent or not without the presence of covariates. References Cox, D. R. (959). The analysis of exponentially distributed lifetimes with two types of failures. Journal of the Royal Statistical Society, Series B, 4 4. Dempster, A.., Laird, N. M., and Rubin, D. B. (977). Maximum likelihood estimation from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39,. Dinse, G. E. (98). Non-parametric estimation of partially incomplete time and types of failure data. Biometrics 38,

15 Biometrical Journal 46 (004) 79 Efron, B. and Hinkley, D. V. (978). The observed versus expected information. Biometrika 65, Kundu, D. and Basu, S. (000). Analysis of incomplete data in presence of competing risks. Journal of Statistical lanning and Inference 87, 39. Louis, T. A. (98). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society, Series B 44,, Mendenhall, W. and Lehmann, E. H. Jr. (960). An approximation of the negative moments of the positive binomial useful in life testing. Technometrics, 7 4. Miyakawa, M. (984). Analysis of incomplete data in competing risks model. IEEE Transactions on Reliability Analysis 33, rentice, R. L., Kalbfleisch, J. D., eterson, Jr., A. V., Flournoy, N., Farewell, V. T. and Breslow, N. E. (978). The analysis of failure times in the presence of competing risks. Biometrics 34,

Analysis of incomplete data in presence of competing risks

Journal of Statistical Planning and Inference 87 (2000) 221 239 www.elsevier.com/locate/jspi Analysis of incomplete data in presence of competing risks Debasis Kundu a;, Sankarshan Basu b a Department