Wei Pan 1. the NPMLE of the distribution function for interval censored data without covariates. We reformulate the

Size: px

Start display at page:

Download "Wei Pan 1. the NPMLE of the distribution function for interval censored data without covariates. We reformulate the"

Hortense Cunningham
5 years ago
Views:

1 Extending the Iterative Convex Minorant Algorithm to the Cox Model for Interval-Censored Data Wei Pan 1 1 The iterative convex minorant (ICM) algorithm (Groeneboom and Wellner, 1992) is fast in computing the NPMLE of the distribution function for interval censored data without covariates. We reformulate the ICM as a generalized gradient projection method (GGP), which leads to a natural extension to the Cox model. It is also easily extended to support the Lasso (Tibshirani, 1996). Some simulation results are also shown. For illustration we reanalyze two real datasets. Key Words: Cross-validation EM Generalized gradient projection ICM Interval censoring Isotonic regression Lasso NPMLE. 1. INTRODUCTION There has been an increasing research interest in interval censored data recently, largely due to its popularity in studies of AIDS and other chronic diseases, in which the event ofinterestisnever observed exactly but only known to occur within some time intervals. In these studies a baseline examination was conducted, then after some xed time the rst followup took place, then second and third, etc. followups (if any) after intervals of several years each. If the event happens, wecan only be aware of its happening between two examinations. Otherwise, it may only happen after the end of the last followup. This produces interval censoring. One fundamental statistical problem in survival analysis is to estimate the distribution function of the event time from censored data. In a more general framework Turnbull (1976) presents an EM algorithm (Dempster, Laird and Rubin, 1977) to compute the nonparametric maximum likelihood estimator (NPMLE) of the distribution function for this kind of data without covariates. Groeneboom and Wellner (1992, shortened as G&W in the sequel) propose an iterative convex minorant (ICM) algorithm, which they suggest is much faster than the EM. In the Cox model with interval censored data, Finkelstein (1986) gives a modied Newton-Raphson method after reparameterization. This approach uses the 1 Division of Biostatistics, University of Minnesota, A46 Mayo Building (Box 33), Minneapolis, MN weip@biostat.umn.edu

2 Extending the ICM 2 inverse Hessian matrix at each iteration. However, for continuous time model without grouping, the dimension of the unknown parameters is in the order of the number of observations, which is often more than several hundreds. Many oftoday's ordinary workstations cannot aord this storage space. Moreover, the Hessian matrix is sometimes near singular, which may cause some numerical problems when inverting it. In contrast, neither the EM nor the ICM needs to store and invert the Hessian matrix. In fact, as analyzed in the next section, the ICM only uses the diagonal elements of the Hessian. On the other hand, in spite of its slowness the EM is hard to be used in the Cox model for continuous survival times (Alioum and Commenges, 1996). Hence an extension of the ICM to the Cox model has a practical importance. For low-dimensional regression coecients, a prole likelihood approach has been suggested (Huang 1996 Huang and Wellner 1995). It works by xing the regression coecient at some possible values, then for each xed value of regression coecient maximizing the likelihood with respect to the baseline distribution by a modied ICM algorithm. Then the point achieving the maximum of this prole likelihood (possibly after smoothing) is taken as the nal estimate. Just recently Huang and Wellner (1997) suggest a more general method based on the back-tting scheme (Hastie and Tibshirani 199), though no simulation results are shown. We will compare our approach with theirs later. Originally the ICM was formulated in terms of some special stochastic processes and isotonic regression algorithms (G&W Huang and Wellner 1997). In Section 2 we identify it as a special case of the generalized gradient projection method (GGP) in the optimization literature (Mangasarian, 1996 Bertsekas, 1982). From this connection a natural extension of the ICM to the Cox model is almost trivial. We further extend it to support the Lasso, a biased regression scheme proposed by Tibshirani (1996, 1997). In Section 3 simulations were conducted to assess the performance of the extended ICM methods. Finally two real world problems are used to illustrate the methods, followed by a brief discussion. 2. COMPUTATIONAL PROCEDURES 2.1 GGP We rst briey review the GGP, a general optimization scheme (Mangasarian, 1996 Bertsekas, 1982). It is helpful to regard it as a Newton-type iterative algorithm in constrained optimization.

3 Extending the ICM 3 Specically, suppose we want to maximize a function f(x) in a closed convex set X. Let rf denote the rst derivative off, H a symmetric positive denite matrix, which can be often taken as the negative Hessian matrix of f if f is strictly concave. The GGP updates its current estimate x (m) by x (m+1) = Proj[ x (m) + (m) H (m);1 rf(x (m) ) H (m) X ] where (m) > is a suitably chosen stepsize, and Proj is the projection operation dened by: Proj [x H X ] := arg minf(x ; x ) H(x ; x ):x 2Xg: x The iterate x (m) at convergence is taken as a solution of the maximization problem. The rationale behind the GGP is intuitively explained by the following result: 1) x 2 arg max x2x f(x) 2) x = Proj [x + H ;1 rf(x ) H X ] for any > 3) x = Proj [x + H ;1 rf(x ) H X ] for some > If f is dierentiable, then 1) implies 2) On the other hand, if f is convex, then 3) implies 1). We notice by passing that taking H as an identity matrix (i.e. completely ignoring the second order information of f) results in the gradient projection method (Polak 1971). Some theoretical properties of the GGP, such as its superlinear or linear convergence rate (depending on the choice of H), have been established in Bertsekas (1982). Furthermore, Bertsekas has shown its eectiveness via computational examples involving as many as 1 variables. This feature is especially attractive in our current setting (often involving more than several hundred parameters for medium to large samples). 2.2 ICM for Interval Censored Data without Covariates For interval censored data, the time of the event ofinterest, say survival time X i, is not observed exactly. Observations available are censoring times, (U i V i )fori =1 ::: n. We only know that U i X i V i 1. As usual we assume that X i and (U i V i ) are independent, and fx i g and f(u i V i )g are respectively two iid samples. We are interested in estimating the distribution function of X i, F,fromf(U i V i )g. The (conditional) log-likelihood can be written down as (Turnbull, 1976): L(F jf(u i V i )g) = nx i=1 log (F (V i ) ; F (U i ;)) :

4 Extending the ICM 4 Notice that L can be further expressed in terms of the cumulative hazard function. This is actually what we did in implementing our algorithms. There are two advantages with this parametrization: i) concavity of the likelihood function is readily available, even in the Cox model (Huang and Wellner, 1997) ii) in contrast to restricting F 1 there is no upper bound on. But for simplicity of notation we still use the distribution function throughout. The nonparametric maximum likelihood estimator (NPMLE) ^F of F is a right-continuous function maximizing L(F ). Turnbull (1976) shows that ^F can only have jumps between the order statistics i, i =1 ::: k, ofu i and V i,1in. However, the probability distribution of ^F within some intervals is nonidentiable. This nonidentiability will not inuence the computation of ^F by the EM algorithm. To facilitate the computation of ^F by the ICM, G&W dene the NPMLE ^F as piecewise constant with possible discontinuities only at the order statistics f i g, and moreover ^F may be less than 1 at the largest order statistic k. These two dierent denitions seem to be non-critical. In the sequel, we adopt the second one. Turnbull (1976) also suggests how to reduce the number of the order statistics f i g on which ^F may have jumps. We denote the possible support of the NPMLE as a disjoint union of intervals [a i b i ), i =1 ::: m. (It is likely that the NPMLE has no probability mass in some of these intervals.) Generally we usev to denote a column vector with components v i, and similarly the (i j)th element of a matrix M with M ij. So far the ICM is only applicable to interval censored data without covariates. The original exposition of the ICM was in terms of stochastic processes (Groeneboom and Wellner, 1992). Now we reformulate it as a GGP method (Mangasarian 1996 Bertsekas 1982) so that its connection with the Newton-Raphson algorithm and gradient projection method is obvious. Let the rst derivative (i.e. gradient) and second derivative (i.e. the Hessian) of L with respect to F =(F ( 1 ) ::: F ( k )) be rl and r 2 L respectively. G is a k k diagonal matrix with the same diagonal elements as ;r 2 L.Tohave a proper (sub)distribution function we need to restrict F 2R= ff : F ( 1 ) ::: F ( k ) 1g. Denote the estimates from the ith iteration as F (i). Then the ICM iterates as F (m+1) = Proj[F (m) + G(F (m) ) ;1 rl(f (m) ) G(F (m) ) R]

5 Extending the ICM 5 where the projection by denition is Proj[y G R] := arg min x f kx i=1 (y i ; x i ) 2 G ii : x 1 x 2 ::: x k 1g which is just an isotonic least squares regression problem and hence can be eciently accomplished by some well-known algorithms such as the pool-adjacent-violators algorithm (PAVA) (Robertson et al., 1988). The time complexity of the PAVA iso(k 2 ) and easy to program. Graphically the solution of the isotonic regression (with a simple order as R) can be represented by a greatest convex minorant of the plot f( ), ( P j i=1 G ii P j i=1 G iix i ) for i =1 ::: kg. This may be the reason why the above (iterative) algorithm is named ICM. The projection guarantees that the new F (m+1) is still a proper (sub)distribution function. Moreover, the projection seems to speed up convergence as suggested in G&W. Evidently this is a special case of the GGP method. Without the projection, the above iteration is just a Newton-like step where only the diagonal elements of the Hessian matrix are used. The estimate F (m) at convergence is taken as the NPMLE. We follow Aragon and Eberly (1992) to further dene the damped iterative convex minorant algorithm (DICM) as F (m+1) = Proj[F (m) + j G(F (m) ) ;1 rl(f (m) ) G(F (m) ) R] where k is an Armijo-like stepsize (Polak 1971) or can be simply chosen by j =max n 1=2 i : L(F (m+1) ) >L(F (m) ) i= 1 2 ::: which tries to take a large stepsize while increases the likelihood. The appeal of the DICM is its global convergence and increasing likelihood in each step, though Zhan and Wellner (1995) point out some possible problems in the proof of the global convergence in Aragon and Eberly (1992). On the other hand, G&W (page 7) suggest using \buer" to keep the estimate as a proper distribution and to avoid the likelihood becoming. However, they do not say what to be done if the new estimate does make the likelihood be. Damping is a natural way to avoid this problem. Hence in our discussion we refer to ICM and DICM equivalently. 2.3 Extending ICM to the Cox Model The Cox proportional hazards model is probably the most widely used one in survival analysis. It assumes that the hazard function h(xjz) with covariate (vector) Z is proportional to an unknown o

6 Extending the ICM 6 baseline hazard h (x): h(xjz) =h (x)exp(z ) or formulated in terms of distribution functions: 1 ; F (xjz) =[1; F (x)] exp(z ) where F is the corresponding baseline distribution (unknown), and is the regression coecient (vector). Let the n observations be labelled (U 1 V 1 Z 1 ),...,(U n V n Z n ), where (U i V i ) as before is the lower and upper boundaries of the censoring interval, and Z i is the covariate (vector). The survival time X i is only observed to be inside (U i V i ]. As usual we assume that X i and (U i V i ) are mutually independent. (Finkelstein 1986): L(F )= The joint log-likelihood can be written down (up to a constant) as nx i=1 o log n(1 ; F (U i ;)) exp(z) i ; (1 ; F (V i )) exp(z) i : L is maximized to obtain the NPMLE ^ of the regression coecient along with that of the baseline distribution, ^F. Here we approach this maximization problem by extending the ICM. Computationally this is different from Finkelstein (1986), where the Newton-Raphson method is adopted after reparametrization. As before we would like to avoid using the full Hessian matrix due to the possible diculties mentioned earlier. Our extended ICM is simple (without inverting a high-dimensional matrix) and reasonably fast when applied in this setting. Extending the ICM is straightforward by considering the regression coecient as another component of parameters in addition to the baseline distribution. Now let r 1 L(F )=@L(F )=@F and r 2 L(F )=@L(F )=@ denote the rst derivatives G 1 (F )andg 2 (F ) the corresponding diagonal matrices of the negative second derivatives. Then our extended ICM iterates as F (m+1) = Proj[F (m) + j G 1 (F (m) (m) ) ;1 r 1 L(F (m) (m) ) G 1 (F (m) (m) ) R] (m+1) = (m) + j G 2 (F (m) (m) ) ;1 r 2 L(F (m) (m) ) where j and Proj are dened as before. Again the projection step is taken to ensure that the estimate of F is a proper (sub)distribution, which requires that ^F is nondecreasing and between and 1. Since no constraint is on, no projection is performed on it. We will discuss another application of the GP to constrained next.

7 Extending the ICM 7 Notice that inverting the diagonal matrices G 1 and G 2 is trivial. But to ensure numerical stability ofinverting G 1 and G 2 (and positive-deniteness of G 1 and G 2 for truncated data as mentioned at the end of the paper), the Levenberg-Marquardt adjustment is adopted (Thisted, 1988). Specically, G ;1 1 and G ;1 2 are respectively replaced by (G 1 + I) ;1 and (G 2 + I) ;1 with a suitably chosen small > when they are near singular (or non-positive-denite for truncated data). We note that our extended ICM is dierent from that outlined by Huang and Wellner (1997), which is basically a back-tting algorithm (Hastie and Tibshirani 199): L(F (m) )is rst maximized with respect to F at the xed (m) (by an ICM-like algorithm), which leads to an F (m+1) then L(F (m+1) ) is maximized with respect to at the xed F (m+1) to get another estimate (m+1) the above two steps are repeated until convergence. In contrast, we propose to maximize the likelihood jointly with F and, which seems to be more common and likely more ecient. Huang and Wellner (1997) did not give any specic numerical results. Some numerical comparison is presented in Section 3.3. Intuitively the (extended) ICM will converge, since the likelihood function is upper-bounded and each step of the (extended) ICM increases the likelihood. When we don't use the second order information (i.e. G is an identity matrix), the GGP reduces to the GP, which only has a linear convergence rate. On the other hand, the GGP can have a superlinear convergence rate if G is suitably chosen (part of which is diagonal and the remaining part equals the corresponding Hessian) (Propositions 3 and 4, Bertsekas 1982). Since the used information G in our (extended) ICM is between them, we can expect the convergence rate of the (extended) ICM is also between linear and superlinear furthermore, if the diagonal elements of the Hessian dominate the non-diagonals, the convergence rate of the (extended) ICM will at least be close to superlinearity. Rigorous derivations are not pursued here, though our limited experience seems to be positive. 2.4 Extending ICM to Support Lasso Tibshirani (1996) proposes a new estimation procedure called Lasso in linear regression and then extends it to the Cox model for right-censored data (Tibshirani, 1997). The Lasso works by constraining the sum of the absolute values of the regression coecients being less than a constant. This immediately reminds one the ridge regression, where the constraint is replaced by the sum of the squared regression coecients. As in ridge regression, the Lasso is a biased estimation

8 Extending the ICM 8 method which nonetheless may lead to smaller mean squared errors by reducing the variability of the estimates. However, to our knowledge, there is still no application of the Lasso method to the Cox model with interval censored data in the literature. Unlike in linear regression or in the Cox model with right censored data, here we have a high-dimensional (and ordered) nuisance parameter, the baseline distribution. (The baseline distribution is eliminated from the partial likelihood in the Cox model for right censored data but we do not have an explicit partial likelihood for interval censored data.) The ICM is readily adapted to support the Lasso by using the ideas of the GGP. Now we need to constrain in C = f : P r i=1 j i jg (where is a specied constant). This only needs to modify our previous algorithm by (m+1) = Proj[ (m) + j G 2 (F (m) (m) ) ;1 r 2 L(F (m) (m) ) G 2 (F (m) (m) ) C]: The projection of a point into C is a standard quadratic programming problem. In particular, Tibshirani (1996) gives two ecient algorithms. We term the above (m) Lasso estimate. at convergence as the The in the constraint can be chosen by generalized cross-validation for right censored data (Tibshirani 1997). Here we use a 1-fold likelihood cross-validation (Silverman 1986). 3. SIMULATION In our simulations, the baseline distribution F is Weibull with shape parameter 2 and scale parameter 1. Only binary covariates are used, each component of which takes or 1 with an equal probability. The regression coecient isalsoavector of and 1. The rst examination time T is Uniform( ). To mimic many panel studies we take the length of the time interval between two follow-up examinations to be constant, len = :5. Generally if we have k + 1 examinations, survival time X i is accordingly censored in one of ( T i ] (T i T i + len] ::: (T i + k len 1). k and jointly determine the censoring pattern. In our simulation studies there are four congurations: 1. k =1, =2, =(1 ) 2. k =1, =2, =(:333 :333 :333) 3. k =2, =1, =(1 ) 4. k =2, =1, =(:333 :333 :333)

9 Extending the ICM 9 It is evident that in the rst two cases the average length of censoring intervals tends to be longer than that in the other two, which implies more information loss from censoring. The rst and third cases represent one large covariate eect while the other twomany smaller covariate eects. They are used to compare the NPMLE with the Lasso estimate. The sample size is always 1. All the programs are implemented in C on a SunSparc 1. The starting value for the baseline distribution is the discrete uniform in the possible support of the NPMLE, whereas it is zero for the regression coecient. An algorithm is stopped when both the log-likelihood increment and the change of the regression coecient fromtwo consecutive iterations are less than 1 ; NPMLE and Bootstrap For cases 1 and 3, suppose we know the true model in other words, we know there is only one non-zero coecient 1. The ICM is so fast that often it converges within several seconds. The fast speed of the ICM facilitates using some computing-intensive methods, such as bootstrap and cross-validation. The cross-validation is used in choosing the Lasso constraint. Now we apply bootstrap to estimate the variability of the NPMLE ^. For related work in the context of right-censored data see Burr (1992). As reviewed in Huang and Wellner (1997), in practice there are two approaches to estimate the variance of ^: one is by using the observed information matrix another is through the prole likelihood for low-dimensional. Aswe pointed out earlier, for continuous time model without grouping the observed information matrix is often of too high-dimension, and sometimes is near singular. There is also no theoretical justication for their use. Bootstrapping seems a natural alternative, though we are not aware of any empirical studies yet, which probably results from lack of faster algorithms. Our bootstrapping scheme is as discussed by Efron and Tibshirani (1986), where the same number of observations are randomly taken with replacement from the original sample to form a bootstrap sample. We used 1 bootstrap samples. The results are reported in Table 1. Noticeably the NPMLE is biased upwards and with a large variability for case 1. The reason is too much information loss from censoring, particularly with an overwhelming 6%left-censoring (and 1% right-censoring approximately). For a left-censored observation ( V i Z i ) its contribution to the

10 Extending the ICM 1 Table 1: NPMLE of the regression coecient, ^, and the bootstrap estimate of its standard error, ^se. Their means and Monte Carlo standard errors were calculated from 5 replications. = 1. Case mean( ^) se( ^) mean(^se) se(^se) log-likelihood is n o L i =log 1 ; (1 ; F (V i )) exp(z i) which will be maximized when tends to the innity for a positive Z i. Nevertheless, the bootstrap estimate of the variance of the NPMLE is still sensible. In case 3, the left-censoring percentage is reduced to 3%(and there is about 1%right-censoring). The bootstrap estimate improves along with the NPMLE. In overall we can see that the bootstrap works reasonably well. 3.2 NPMLE and Lasso Estimate In this subsection we compare the performance of the NPMLE and the Lasso estimate of the regression coecient. Unlike in the last subsection, now all three covariates are used in other words, we do not know in advance which i is zero. Our results are presented in Table 2, which are pretty similar to that of Tibshirani (1996) for ordinary least squares regression. The Lasso estimates tend to have larger biases (down towards zero) but much smaller variances, hence smaller mean squared errors (MSE), than the NPMLE. However, as the performance of the NPMLE improves in cases 3 and 4, the Lasso estimate is only on a slight edge. 3.3 A Comparison with Other Algorithms As pointed out earlier, Finkelstein's Newton-Raphson algorithm is not practical to use when the sample size is large, whereas the prole likelihood approach (Huang 1996 Huang and Wellner 1995) is only applicable with low-dimensional regression coecients. Hence the only left competitor is the algorithm outlined in Huang and Wellner (1997). Essentially their method is to alternate

11 Extending the ICM 11 Table 2: NPMLEs and Lasso estimates of regression coecients (with their standard errors in parentheses) from 5 replications. =(1 ) for Cases 1 and 3 =(:333 :333 :333) for other two. NPMLE Lasso Case MSE ^1 ^2 ^3 MSE 1 ~ 2 ~ 3 ~ (.29) (.63) (.4) (.41) (.29) (.4) (.17) (.18) (.65) (.41) (.4) (.44) (.17) (.25) (.22) (.25) (.28) (.3) (.26) (.27) (.2) (.3) (.9) (.11) (.21) (.27) (.26).27) (.9) (.21) (.2) (.21) maximizing the likelihood with respect to the baseline distribution and with the regression coecients. In contrast, we propose to maximize the likelihood with respect to the baseline distribution and the regression coecients simultaneously. The advantage of our algorithm can be explained intuitively as follows. Suppose we want to maximize a bivariate function, say f(x y), with respect to x and y. If possible, we would like to maximize f(x y) with respect to x and y jointly, rather than maximize f(x y) with respect to x, then with y and repeat this process. We note that due to this connection between two algorithms it may appear that our proposal is a trivial extension of Huang and Wellner's. However, this connection is established largely by our formulation of the ICM as a GGP. We compared these two algorithms in the conguration of Case 3. Both algorithms used the same starting values and convergence criteria as before. The results are shown in Table 3. It is apparent that our algorithm is over two times faster than Huang and Wellner's. In particular, the log-likelihoods at the convergence of the two algorithms are very close to each other in fact, all the log-likelihood values at the convergence of our algorithm are slightly larger than their corresponding parts from Huang and Wellner's. This guarantees that our algorithm was not favorably stopped

12 Extending the ICM 12 Table 3: Speed of our new and Huang and Wellner's algorithm in computing the NPMLE in Case 3, with 5 replications. The CPU time is in seconds Log-likelihood is that at convergence. All the corresponding standard errors are in parentheses. Algorithm CPU time Log-likelihood ^ Huang-Wellner 1.36 (.24) (6.7869) (.293) New.44 (.8) ( ) (.295) earlier. Toinvestigate our algorithm's (in)dependence on starting values, we used some random starting values for the rst ten datasets in Case 3 each one is replicated for 1 times. The initial baseline hazard increment ineach of its support intervals was taken as a uniform random number between.1 and.1, while the starting regression coecients were from a uniform distribution between -2 and 2. For each dataset, the algorithm all converged to the neighborhood of the same point: the estimated regression coecients agreed up to the third decimal digit while the log-likelihood value was the same. However, more work is needed to clarify whether there are any local maxima to which the algorithm may converge, as one referee pointed out. 4. REAL EXAMPLES Now we apply our extended ICM to reanalyze two real datasets via the NPMLE. They have already been discussed among others in Finkelstein (1986), and Huang (1994) and Huang and Wellner (1995). Finkelstein focuses on hypothesis testing while Huang (1994) and Huang and Wellner (1995) establish the asymptotic normality and asymptotic eciency of the NPMLE of the regression coecient. The rst dataset is from a tumorigenicity study, where 144 mice were assigned to either a germfree or a conventional environment. The survival time here is the tumor onset time, which cannot be observed directly. The presence of the tumor can only be detected at the time of the sacrice or death of the animal. Hence the survival time is either left censored or right censored (i.e. interval-censoring case 1). If we take the germfree group as the baseline with covariate Z = and another group with Z = 1, then the NPMLE ^ of the regression coecient from the Cox model

13 Extending the ICM 13 is ^ = ;:68 with estimated standard error.41 by bootstrap (with 1 bootstrap replications). They are dierent from Huang's results: ;:55 and :29. Huang uses the prole likelihood to compute the NPMLE and approximates its variance by asymptotics, which involves some density estimation. Our 95%bootstrap percentile condence interval for is [;1:26 :34], compared with Huang's asymptotic normal approximation [;1:12 :1]. The second example considered is the Breast Cosmesis Study dataset (Finkelstein 1986 Finkelstein and Wolfe 1985). Two medical treatments are given to 9 early breast cancer patients after tumorectomy: one is a mix of primary radiotheropy and adjuvant chemotheropy, and another is radiotheropy only. The interest of this clinical trial is to investigate which treatment has better long-term cosmetic eects. Interval-censoring results since the patients could be visited every 4to 6 months and those living farther away from the clinic had even longer followup intervals. If we take the radiotheropy group as the baseline group, the NPMLE of the regression coecient from the Cox model is ^ =:735 with estimated standard error.35 by bootstrap. These results are not much dierent from the corresponding.795 and.29 from Huang and Wellner (1995). But the bootstrap inference seems more conservative than theirs. The 95%bootstrap percentile and asymptotic condence intervals are respectively [.11, 1.49] and [.23, 1.36]. Further studies are needed to clarify these discrepancies. 5. SUMMARY In computing the NPMLE of the distribution function for interval censored data, the dimension of the parameters is in the order of the sample size, which is often more than several hundreds. Hence the space and time overheads incurred by using the Hessian (or its approximation by a full matrix) are prohibitive. This motivated us to investigate algorithms that avoid using these large matrices. Clearly we are trading time for space. In particular, we do not claim that our extended ICM is faster than Finkelstein's Newton-Raphson method. The ICM is fast but still simple without using a high-dimensional matrix, such as the Hessian, for interval censored data with no covariates. Our formulation of the ICM as a GGP helps not only to establish its connection with the Newton-Raphson algorithm, but also to ease its extension to the Cox model and to support the Lasso. Its fast speed also facilitates combining its use with some computing-intensive methods, such as bootstrap and cross-validation. In particular, our limited

14 Extending the ICM 14 simulation studies seem to favor the Lasso estimates of the regression coecients in terms of mean squared errors as in linear regression setting (Tibshirani, 1996). Finally, we note that our extended ICM (with Levenberg-Marquardt adjustment) is readily applicable to truncated and interval censored data (Pan and Chappell 1998). Acknowledgements The author would like to thank Olvi Mangasarian for introducing the GGP thanks also go to Rick Chappell for stimulating discussions and support. Many insightful comments from the Associate Editor and three anonymous referees dramatically improved the presentation.

15 Extending the ICM 15 References Alioum, A. and Commenges, D. (1996), \A Proportional Hazards Model for Arbitrarily Censored and Truncated Data", Biometrics, 52, Aragon, J. and Eberly, D. (1992),\On Convergence of Convex Minorant Algorithms for Distribution Estimation with Interval-Censored Data", Journal of Computational and Graphical Statistics, 1, Bertsekas, D.P. (1976), \On the Goldstein-Levitin-Polyak Gradient Projection Method", IEEE Transactions on Automatic Control, 21, Bertsekas, D.P. (1982), \Projected Newton Methods for Optimization Problems with Simple Constraints", SIAM Control and Optimization, 2, Burr, D. (1994), \A Comparison of Certain Bootstrap Condence Intervals in the Cox Model", Journal of American Statistical Association, 89, Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977),\Maximum Likelihood from Incomplete Data via the EM Algorithm", Journal of the Royal Statistical Society, Series B, 39, Efron, B. and Tibshirani, R.J. (1986), \Bootstrap Methods for Standard Errors, Condence Intervals, and Other Measures of Statistical Accuracy" (with discussion), Biometrika, 65, Efron, B. and Tibshirani, R. J. (1993), An Introduction to the Bootstrap, New York: Chapman & Hall. Finkelstein, D.M. (1986), \A Proportional Hazards Model for Interval-Censored Failure Time Data", Biometrics, 42, Finkelstein, D.M. and Wolfe, R.A. (1985), \A Semiparametric Model for Regression Analysis of Interval-censored Failure Time Data", Biometrics, 41, Groeneboom, P. and Wellner, J.A. (1992),Information Bounds and Nonparametric Maximum Likelihood Estimation, Basel: Birhauser Verlag. Hastie, T. and Tibshirani, R. (199), Generalized Additive Models, London: Chapman & Hall. Huang, J. (1996), \Ecient Estimation for the Cox Model with Interval Censoring", Annals of Statistics, 24, Huang, J. and Wellner, J.A. (1995), \Ecient Estimation for the Proportional Hazards Model with `Case 2' Interval Censoring", Tech. Rept., Dept. of Statistics, Univ. of Washington, Seattle, WA.

16 Extending the ICM 16 Huang, J. and Wellner, J.A. (1997), \Interval Censored Survival Data: A Review of Recent Progress", Proceedings of the First Seattle Symposium in Biostatistics: Survival Analysis, New York: Springer. Mangasarian, O.L. (1996), \Algorithms of Nonlinear Mathematical Programming", CS73 Course Notes, Computer Sciences Dept., Univ. of Wisconsin, Madison, WI. Pan, W. and Chappell, R. (1998), \Computation of the NPMLE of Distribution Functions for Interval Censored and Truncated Data with Applications to the Cox Model", Computational Statistics and Data Analysis, 28, Polak, E. (1971), Computational Methods in Optimization, New York: Academic Press. Robertson, T., Wright, F.T. and Dykstra, R.L. (1988), Order Restricted Statistical Inference, New York: Wiley, Silverman, B.W. (1986), Density Estimation for Statistics and Data Analysis, New York: Chapman & Hall. Tibshirani, R. (1996), \Regression Shrinkage and Selection via the Lasso", Journal of the Royal Statistical Society, Series B, 58, Tibshirani, R. (1997), \A Proposal for Variable Selection in the Cox Model", Statistics in Medicine, 16, Tibshirani, R. and Hastie, T. (199), Generalized Additive Models, NewYork: Chapman & Hall. Thisted, R.A. (1988), Elements of Statistical Computing, NewYork: Chapman & Hall. Turnbull, B.W. (1976), \The Empirical Distribution Function with Arbitrarily Grouped, Censored and Truncated Data", Journal of the Royal Statistical Society, Series B, 38, Zhan, Y. and Wellner, J.A. (1995),\Double Censoring: Characterization and Computation of the Nonparametric Maximum Likelihood Estimator", Tech. Rept. Dept. of Statistics, Univ. of Washington, Seattle, WA.

The iterative convex minorant algorithm for nonparametric estimation

The iterative convex minorant algorithm for nonparametric estimation Report 95-05 Geurt Jongbloed Technische Universiteit Delft Delft University of Technology Faculteit der Technische Wiskunde en Informatica