Maximum likelihood estimation for Cox s regression model under nested case-control sampling

Size: px

Start display at page:

Download "Maximum likelihood estimation for Cox s regression model under nested case-control sampling"

Sydney Gilbert
5 years ago
Views:

1 Biostatistics (2004), 5, 2,pp Printed in Great Britain Maximum likelihood estimation for Cox s regression model under nested case-control sampling THOMAS H. SCHEIKE Department of Biostatistics, University of Copenhagen, Blegdamsvej 3, DK-2200 KBH N, Denmark ts@kubism.ku.dk ANDERS JUUL Department of Growth and Reproduction, University Hospital of Copenhagen, Blegdamsvej 9, Denmark and Centre of Preventive Medicine, Glostrup County Hospital, Denmark SUMMARY Nested case-control sampling is designed to reduce the costs of large cohort studies. It is important to estimate the parameters of interest as efficiently as possible. We present a new maximum likelihood estimator (MLE) for nested case-control sampling in the context of Cox s proportional hazards model. The MLE is computed by the EM-algorithm, which is easy to implement in the proportional hazards setting. Standard errors are estimated by a numerical profile likelihood approach based on EM aided differentiation. The work was motivated by a nested case-control study that hypothesized that insulinlike growth factor I was associated with ischemic heart disease. The study was based on a population of 3784 Danes and 231 cases of ischemic heart disease where controls were matched on age and gender. We illustrate the use of the MLE for these data and show how the maximum likelihood framework can be used to obtain information additional to the relative risk estimates of covariates. Keywords: Cox model; Efficiency; Nested case-control; Proportional hazards model; Survival data. 1. INTRODUCTION Large cohort studies are designed to learn about covariate effects for relatively rare diseases. Often the covariates of interest are expensive to obtain and the study therefore will be very expensive to carry out. Thomas (1977) suggested an alternative design called the nested case-control (NCC) design where each case is compared to a random sample from the risk set. This design will typically reduce the amount of data dramatically. The standard analysis for nested case-control sampling, see Thomas (1977), Oakes (1981), Goldstein and Langholz (1992) or Borgan et al. (1995), can be implemented by standard software. Samuelsen (1997) suggested several procedures for more efficient analysis. To find out whether low serum insulin-like growth factor (IGF-I) was associated with increased risk of ischemic heart disease (IHD), we carried out a matched nested case-control sampling for a large cohort study (Juul et al., 2002). We had two major reasons for doing the study as an NCC. First, the measurement To whom correspondence should be addressed. Biostatistics Vol. 5 No. 2 c Oxford University Press 2004; all rights reserved.

2 194 T. H. SCHEIKE AND A. JUUL of IGF-I was done by a relatively costly analysis based on blood samples that were retrieved from a freezer where they were stored at the initiation of the study in Secondly, since the blood sample had to be discarded at the analysis for IGF-I we also wanted to limit the use of valuable biological material that might be useful for other scientific studies. The proportional hazards model with time constant covariates is often used to analyze survival data. We here suggest a maximum likelihood estimator (MLE) that appears to have better finite sample properties than the standard partial likelihood estimation procedure. For the parameter levels found in the case-study we observed that the standard analysis and MLE had the same efficiency, but when covariate effects were large then the MLE gave increased efficiency compared to the standard analysis. Maximization of the likelihood function is easily carried out by use of the EM-algorithm. This is particularly simple in the present context since standard software can be applied. Standard errors are obtained by a numerical profile likelihood approach based on EM aided differentiation as in Chen and Little (1999), see also Murphy et al. (1997). The EM-algorithm (Dempster et al., 1977) has been utilized in a number of non-parametric maximum likelihood settings, see e.g. Turnbull (1976), Nielsen et al. (1992), Murphy et al. (1997), Wellner and Zhan (1997), Martinussen (1999) and Chen and Little (1999). The papers by Martinussen (1999) and Chen and Little (1999) on missing covariates in a proportional hazards setting are closely related to the present approach, but our approach differs from their methods by a new non-parametric way of handling the distribution of the covariates. Martinussen (1999) and Chen and Little (1999) are limited to parametric assumptions about the covariate distribution for the partly observed covariates. We here suggest a simple and elegant approach for dealing with the distribution of the covariates completely non-parametrically by combining the approach of Wellner and Zhan (1997) with the EM algorithm for survival data. Chen (2002) extends Chen and Little (1999) and makes a similar suggestion for the missing data problem. The missing data problem deals with covariates that are missing at random (MAR), and the nested case-control sampling can be viewed as having covariates missing at random, since the missingness of the covariates depend solely on the survival times and censoring indicators of the data. Even though the probability that a subject has missing covariates does not depend on the unobserved covariates, nested case-control sampling differs from the set-up described in the missing data papers because the missingness is not independent across subjects. Despite these differences, however, nested case-control sampling leads to a likelihood and EM-algorithm that is equivalent to that of Chen (2002). We also present some new models where the stratified baseline hazard is modelled by a proportional hazards model, and show how the algorithm can be modified to estimate cumulative baseline hazards stratified according the covariate that is only partly observed. Both these models are new for nested casecontrol sampling, but are easy to estimate in a maximum likelihood setting. The paper is organized as follows. Section 2 presents nested case-control sampling and introduces notation. Section 3 introduces the new non-parametric MLE and outlines the estimation procedure. In Section 4 we show some extensions and discuss various simplifying assumptions. Section 5 gives a simulation study aimed at validating the results from the application in Section 6. Section 7 contains some closing remarks. 2. NESTED CASE-CONTROL SAMPLING Let U and C denote the survival and censoring times of an individual. Let T = min(u, C) denote the observed time at risk with δ = I U C the indicator of failure, and let (X, Z) denote a (q + p)- dimensional covariate vector. We make the assumption that U and C are independent given Z, X. An additional important assumption is that the censoring is non-informative on Z, i.e. that the censoring distribution may only depend on X. Toobtain a MLE we also assume that the censoring distribution is

3 MLE for nested case-control sampling 195 non-informative on the parameters of interest. It is crucial for the further analysis that the covariate vector does not vary with time. Define the associated counting process N(t) = I t T and the at risk indicator Y (t) = I T >t. Consider a cohort of M independent and identically distributed subjects that we denote as (T i,δ i, Z i, X i ), i = 1,...,M or (N i (t), Y i (t), X i, Z i ), i = 1,...,M. Wedonot observe the full cohort but only observe data from a nested case-control sampling design. We thus observe X i at the initiation of the study for all subjects (covariates such as gender and age) but we only observe Z i for those of the subjects that become cases or are selected as controls. The observations are (N i (t), Y i (t), X i ), i = 1,...,M,aswell as Z i for those subjects that become cases or controls. Define the size of the cohort as M, let R ={1,...,M} be the index set of all subjects in the cohort, and let R 1 and R 0 denote, respectively, the index sets of all cases and controls in R,defined as R j ={i R : δ i = j} for j = 0, 1. When the risk sets are functions of time they are restricted to subjects under risk at that time, such that for example R(t) ={i R : Y i (t) = 1} is the set of subjects at risk just before time t and R 1 (t) ={i R(t) : i R 1 }. Let R 0 denote the index set of the selected controls that are not in R 1. We assume that the intensity of N(t) is given by a stratified version of Cox s regression model as λ(t) = Y (t)λ(x, t) exp{z T β 0 }, (2.1) where β 0 is a p-dimensional unknown regression parameter, and that X takes S distinct values, thus resulting in S strata. We denote these values as x j, j = 1,...,S, and let λ j (t) = λ(x j, t). The stratified baseline hazard, λ(x, t),isnonparametric which allows the model a great deal of flexibility. We need additional notation to describe how the controls are selected in the NCC. Given that subject i becomes a case at time t, weselect a set of controls, R ncc (t), from the risk set at time t. The controls at time t are selected as a random sample of subjects under risk (R(t)) and we allow the selection to depend only on X. When likelihood-based inference is the aim it is crucial that the probability of sampling a given risk set depends solely on the observed data. This is the case for nested case-control sampling. We here restrict attention to simple random sampling of those under risk, possibly matched on X. This implies that controls may be sampled several times. The sampling of controls must not depend on knowledge of who becomes cases. Therefore subjects who become cases after time t should be available for sampling as controls at t. Based on full information about the entire cohort the partial likelihood estimator, ˆβ, for Cox s regression model would be the solution to the score equation M ( U 0 (β) = Z i S ) 1(β, R(t)) dn i (t) = 0 S 0 (β, R(t)) i=1 where S k (β, R(t)) = i R(t) Y i(t) exp(zi T β)zi k for k = 0, 1 and where for a vector a, a 0 = 1 and a 1 = a. The standard analysis of the nested case-control design, see Thomas (1977), Oakes (1981), Goldstein and Langholz (1992) or Borgan et al. (1995), is based on the score equation for the regression parameter β 0 U ncc (β) = M i=1 ( Z i S ) 1(β, R ncc (t) {i}) dn i (t) = 0, S 0 (β, R ncc (t) {i}) where the risk set R ncc (t) is a random sample of the risk set at time t as above. Borgan et al. (1995) present an asymptotic analysis in the martingale framework. We consider two different sampling methods for the controls: (1) the controls are a simple random sample from the risk set; and (2) the controls are randomly sampled from the risk set of controls that

4 196 T. H. SCHEIKE AND A. JUUL match the cases on some aspects of X. The matched sampling is the more general situation and is dealt with in the next section. 3. MLE FOR NESTED CASE-CONTROL SAMPLING In this section we propose a maximum likelihood estimator for β 0 in the semi-parametric setting of Cox s regression model. A traditional survival analysis will condition on the observed covariates, and then no modelling of their distribution is needed. Similarly, the standard nested case-control analysis does not require modelling of the distribution of Z. However, when a likelihood approach is pursued with Z only observed for some subjects we need to model the distribution of Z given X to represent precisely the information contributed by the subjects for whom Z is unobserved. Because X is fully observed it makes no difference whether it is conditioned on or considered random, and we therefore do not include it in the likelihood. Note, however, that covariates must not depend on time to accommodate a full likelihood approach. Let f (z x) and f (z U = t, x) denote the conditional distribution of Z given X = x and the conditional distribution of Z given U = t and X = x, respectively. We let f (z) denote the marginal distribution of Z. Note that f (z U = t, x) = λ(x, t) exp(z T β 0 ) exp( (x, t) exp(z T β 0 )) f (z x)/g(t x), where g(t x) = λ(x, t) exp(z T β 0 ) exp( (x, t) exp(z T β 0 )) f (z x) dz is the marginal distribution of U given X = x, and that the conditional distribution of Z given U t and X = x is equal to f (z U t, x) = exp( (x, t) exp(z T β 0 )) f (z x)/g(t x), (3.1) where G(t x) = P(U t X = x) = exp( (x, t) exp(z T β 0 )) f (z x) dz. The likelihood for the data of the M iid subjects consists of terms from controls, cases and other subjects. The likelihood contribution for a subject that becomes a case at the time where we record the covariate Z is g(t X) f (z U = t, X) = λ(x, t) exp(z T β)exp( (X, T ) exp(z T β)) f (Z X); the likelihood contribution for a subject who is selected as a control at a time, T z, and is then followed until T is G(T z X) f (z T T z, X) exp( ( (X, T ) (X, T z )) exp(z T β))(λ(x, t) exp(z T β)) δ = (λ 0 (X, T ) exp(z T β)) δ exp( 0 (X, T ) exp(z T β)) f (Z X); finally, the likelihood contribution for a subject whose Z covariate is never observed is G(T X). The combined likelihood of these contributions is (proportional to) {(λ(x i, T i ) exp(zi T β)) δi exp( (X i, T i ) exp(zi T β)) f (Z i X i )} (3.2) i R 1 R 0 G(T i X i ), (3.3) i R\(R 1 R 0 ) where \ denotes set difference. The first part, (3.2), of the likelihood is of the standard Cox form, while the second term, (3.3) is more complicated. We have omitted all the sampling probabilities of the controls since these were assumed to depend only on the observed data.

5 MLE for nested case-control sampling 197 The key assumption of this paper is that the MLE of the distribution of Z given X has point masses at the observed covariate values, see Wellner and Zhan (1997) for a discussion of the consequences of this assumption. We let S(X) be the mapping that takes X into the stratum number. Denote the distinct values among the observed Z covariates for which X = x s as W s (1),..., W s (l s ) (l s is the number of distinct values in stratum s) and the corresponding point masses as p s = (p s (1),...,p s (l s )), such that l s i=1 p s (i) = 1 for all s = 1,...,S. Weuse the notation that p s (Z i ) = l s k=1 p s (k)i Zi =W s (k). The second term of the likelihood, (3.3), can then be written as l S(X) G(T X) = exp( (X, T ) exp(w S(X) (i) T β))p S(X) (i). i=1 It would be appealing to maximize the likelihood with respect to the two non-parametric terms p j and j ( ) ( j = 1,...,S) for given β to obtain a profile likelihood for β on which inference and asymptotic properties could be established. Unfortunately, this does not seem tractable because of the G(T X) terms. Instead we apply the EM-algorithm since maximization is easy in the full data situation. The full log-likelihood given all covariates is l(, β, p) = = M {log(λ(x i, T i ) exp(zi T β))δ i (X i, T i ) exp(zi T β) + log( f (Z i X i ))} i=1 M {log(λ(x i, T i ) exp(zi T β))δ i (X i, T i ) exp(zi T β)}+ i=1 S l s s=1 k=1 a s (k) log(p s (k)) (3.4) with a s ( j) = M i=1 I Zi =W s ( j),x i =x s.wefurther define b s = M i=1 I Xi =x s, and let D i denote the data of the ith subject. We use the notation = ( 1 ( ),..., S ( )), where s ( ) = (x s, ) and p = (p 1,.., p S ). The E-step of the EM-algorithm consists of computing Q((, β, p), ( k,β k, p k )) := E k (l(, β, p) D 1,...,D M, k, p k,β k ), i.e. the conditional expectation of l(, β, p) given the observed data, D 1,...,D M, and the current parameter estimates, ( k,β k, p k ).Wethus need to compute the expectations E k (Z i D i ) and E k (exp(z T i β) D i ). These expectations are simple to compute due to the assumption about the finite number of values. Using (3.1) leads to P k (Z i = W s ( j) U i T i, X i = x s ) = exp( k (x s, T i ) exp(w s ( j) T β k ))p k s ( j) ls l=1 exp( k (x s, T i ) exp(w s (l) T β k ))p k s (l). Define α k ij = P k (Z i = W S(Xi )( j) D i ).For subjects that become cases or are selected as controls α k ij = l S(Xi ) j=1 I Z i =W S(Xi )( j), whereas for the other subjects α k ij = P k(z i = W S(Xi )( j) U i T i, X i ).Now, lsj=1 E k (exp(zi T exp(w s ( j) T β)αij k β) U i > T i, X i = x s ) = lsj=1. αij k

6 198 T. H. SCHEIKE AND A. JUUL Therefore E k (l(, β, p) D 1,...,D M, k,β k, p k ) is computed as + i R\(R 1 R 0 ) i R 1 R 0 {log(λ(x i, T i ) exp(z T i β))δ i (X i, T i ) exp(z T i β) + log(p S(Xi )(Z i ))} (3.5) l S(Xi ) j=1 (X i, t) exp(w S(Xi )( j) T β))α k ij + i R\(R 1 R 0 ) l S(Xi ) The conditional expectation of the full data likelihood given the observed data, Q((, β, p), ( k,β k, p k )),ismaximized in p 1,...,p J, subject to l sj=1 p s ( j) = 1, by ˆp s ( j) = j=1 log(p S(Xi )( j))α k ij. (3.6) M I Xi =x s αij k /b s, s = 1,...,S, (3.7) i=1 in β by ˆβ that is the solution to the partial likelihood score M S i=1 s=1 and in by the Breslow estimator I Xi =x s ( Z i S 1 (β, t, s) S 0 (β, t, s) ˆ (s, t) = t 0 ) dn i (t) = 0, (3.8) 1 S 0 ( ˆβ,t, s) dn s (t) (3.9) where N s (t) = i I X i =x s N i (t) and for h = 0, 1 S h (β, t, s) = I Xi =x s Y i (t) exp(zi T β)zi h i R 1 R 0 + I Xi =x s Y i (t) exp(w s ( j) T β)αij k W s( j) h. i R\(R 1 R 0 ) j The last maximization in β and is equivalent to a standard Cox regression with offsets. The algorithm is therefore simple to implement by standard software. In the case where all subjects that are not selected as cases or controls are right censored at the same time the expressions simplify. In recent work Chen (2002) shows that the MLE for the relative risk parameters is consistent and asymptotically Normal when based on independent identically distributed observations with an independent missing data mechanism. For nested case-control sampling, however, the missingness is not independent across subjects. The probability that a subject is sampled as a control depends on both the number of failures and the failure times of all subjects. The likelihood and the EM-algorithm are, however, equivalent and we expect the results to carry through. Standard errors for the regression parameters may be obtained by bootstrapping techniques or directly from the information matrix as in Louis (1982). The total number of parameters will, however, be large if the number of events and the number of different covariate values are large. Therefore, as interest primarily centres on the regression parameters it seems preferable to use techniques that focus on these parameters. We suggest estimating standard errors by EM-aided differentiation of the profile likelihood for β, asin Chen and Little (1999). With β 0 and pβ denoting the maximizers of the observed data log-likelihood, l O (, β, p), for given β, then the derivative of the observed data profile likelihood is β l O(β, β 0, pβ ) = E ( β l(β, β 0, pβ ) D 1,...,D M, β,β,p β ),

7 MLE for nested case-control sampling 199 where the conditional expectation is computed with the parameters ( β,β,p β ).Now, with β N, j denoting a perturbed version of ˆβ where the jth component is perturbed by d, then l O ( ˆβ, ˆ 0, ˆp) 1 ( ) β β j d E β l( β N, j 0,β N, j, p β N, j ) D 1,...,D M, β N, j 0,β N, j, p β N, j. Chen and Little suggest that d = 1/N results in a reasonable performance. Our simulations also indicate that the standard errors perform quite well for the sample sizes considered in the case study. 4. EXTENSIONS AND ADDITIONAL MODELLING Various additional assumptions may be made to simplify the parameters of the model. We here consider two possible assumptions about the baseline hazards and the distribution of Z given X. We illustrate the use of the simplifying assumptions for the case study. We may for example assume that the stratified Cox model is in fact a proportional hazards model λ(x, t) = λ 0 (t) exp{x T η}. (4.1) This leads to a similar analysis except that the score equation for the relative risk parameters, (3.8), and the Breslow estimator, (3.9), are no longer stratified. If the conditional distribution of Z given X, f (z x), is known not to depend on X, so that the stratification variable does not contain information about the distribution of Z, the maximization (3.7) is not stratified. It may also be of interest to stratify the baseline hazards according to the partly unobserved covariates. This gives the model λ(t) = Y (t)λ(z, t) exp{x T β}, (4.2) where we now assume that the partly unobserved covariate Z takes only a finite number of distinct values, and that the fully observed covariate X leads to proportional hazards. This model can be maximized similarly to the full data likelihood in the standard case (3.4). Subjects with unobserved Z are simply distributed to the strata according to the distribution of Z given the observed data. Model (4.2) may be used for examining the proportionality of the covariates and can give additional insight into the time-varying effects of the covariates. 5. SIMULATIONS FOR NESTED CASE-CONTROL SAMPLING To learn about the finite sample properties of the MLE in a situation comparable to the case study discussed in the next section we simulated survival times from a cohort of size 4000 that were censored at time 15. We first considered two continuous covariates that were independent standard Normals with log-relative risk 0.25 and 0.25, respectively and with a constant baseline hazard with levels 0.004, and 0.016, respectively. We also varied the log-relative risk to (0,0), ( 0.5, 0.5) and ( 1, 1). Due to the equivariance of the likelihood approach, the minus sign could be omitted without otherwise changing the results; it is included here only in order to mimic the values of the application. Table 1 shows the performance of a Cox regression analysis for a fully observed cohort (4000), the standard nested case-control analysis (standard NCC) and the MLE, with log-relative risk parameters (0, 0), ( 0.5, 0.5) and ( 1, 1). Wecomputed the empirical mean, empirical standard error and the mean of the estimated standard error for the different estimators. The empirical variances are also given relative to the full cohort Cox. For the low relative risk effects the standard NCC analysis performed very well and

8 200 T. H. SCHEIKE AND A. JUUL Table 1. Simulations of cohort size 4000 with two covariates and three different baseline hazard levels, and for three levels of the relative risk parameters ((0, 0), ( 0.5, 0.5), and ( 1, 1)). Empirical mean (Emp. mean), empirical standard deviation (emp. sd.), and mean of estimated standard deviation (mean est. sd.). The empirical standard deviations are given in absolute size as well as relative (in parentheses) to the full cohort Cox standard errors. Simulations based on 1000 replications N=4000, two controls Emp. Emp. Emp. Emp. Mean Mean λ 0 (t) Av. cases mean mean sd. sd. est. sd est. sd Cox Standard NCC (1.24) 0.08 (1.22) MLE (1.23) 0.08 (1.21) Cox Standard NCC (1.21) 0.05 (1.15) MLE (1.24) 0.06 (1.16) Cox Standard NCC (1.24) 0.04 (1.25) MLE (1.26) 0.05 (1.30) Cox Standard NCC (1.48) 0.08 (1.60) MLE (1.44) 0.08 (1.52) Cox Standard NCC (1.42) 0.06 (1.39) MLE (1.35) 0.06 (1.31) Cox Standard NCC (1.29) 0.04 (1.25) MLE (1.28) 0.04 (1.21) Cox Standard NCC (1.66) 0.15 (1.84) MLE (1.25) 0.10 (1.25) Cox Standard NCC (1.67) 0.09 (1.70) MLE (1.32) 0.07 (1.26) Cox Standard NCC (1.89) 0.06 (1.73) MLE (1.12) 0.04 (1.21) led to estimates that were almost as efficient as the full Cox. The MLE had a similar behaviour but led to slightly higher variances. The findings are consistent for the three different levels of the baseline hazard. Log-relative risk at absolute level 0.25 led to similar conclusions and is not shown. For log-relative risk at 0.5 wesee that the estimators have similar performances, but that the MLE is slightly better. For the high risk case we see that the MLE leads to an important improvement in efficiency compared to the standard analysis. All estimators had their variability well estimated by the estimators of the standard errors. Note also that the MLE has a slight bias. In conclusion, we find that when the NCC has good efficiency then the MLE has a similar performance. This appears to be the case for log-relative risk effects smaller than 0.5 for the level of the baseline hazard in the simulation. For larger effects the MLE improves considerably on the NCC.

9 MLE for nested case-control sampling 201 Table 2. Log-relative risk of standardized IGF-I and IGFBP-3 for IHD data Standard NCC MLE IGF-I (0.118) (0.091) IGFBP (0.114) (0.115) 6. APPLICATION To study the effect of IGF-I and its binding protein 3 (IGFBP-3) on IHD Juul et al. (2002) performed a nested case-control study based on a Danish population. The study was based on a population of 3784 Danes leading to 231 cases of IHD, each of whom was subsequently matched on age and gender to two randomly selected controls. Study participants were recruited with approximate ages 30, 40, 50 and 60 years. A detailed description of the study can be found in Juul et al. (2002). Here, we give some additional analysis of the key covariates in the MLE framework to illustrate the use of the MLE. First we report the analysis of Juul et al. (2002) focusing on the effects of the covariates IGF-I and IGFBP-3. Table 2 shows the estimators of the standard analysis and the MLE where IGF-I and IGFBP-3 were included as standardized continuous covariates. The estimates and standard errors are almost the same for the two methods. Increased levels of IGF-I lead to significantly lowered risk of IHD when correcting for IGFBP-3. Conversely, increased levels of IGFBP-3 lead to a significantly higher risk of IHD. The baseline hazards stratified by age and gender were estimated by the MLE and are shown in Figure 1. These reveal that the baseline hazards varied considerably with age group and gender. Generally, females had a risk of IHD between one-half and one-third of males for all age groups, and the risk increased with age. To further summarize the effect of age and gender we assumed that the stratified baseline hazards led to proportional effects, thus using the additional modelling of the stratified baseline hazard described in Section 4. The effects of age and gender were then found using the MLE method. We found that females had a log-relative risk (sd) of 1.08(0.16), whilst with every year of age the log-relative risk increased by 0.088(0.0090). The common baseline hazard, which cannot be estimated by the standard NCC analysis, is shown in Figure 2. The thick and thin lines give the cumulative baseline hazards for males and for females, respectively. We also grouped IGF-I and IGFBP-3 into their quartiles and estimated the relative risk for these groups. The quartile analysis (Table 3) resulted in considerably smaller standard errors for the MLE in the stratified case compared to the estimates from the standard NCC. Not all of the log-relative risk estimates are significant, but the estimates are consistent with the linear modelling carried out above. We considered two versions of the MLE: one where the baseline hazard had proportional effects of age and gender; and a more general model, closer in spirit to the NCC analysis, where the baseline hazard was stratified according to eight groups depending on age and gender. We see, for example, that IGF-I in the lowest quartile corresponds to a log-relative risk of 0.57(0.15) for the stratified MLE, in contrast to the standard NCC estimate of 0.68(0.29). The stratified MLE had standard errors that differ considerably from the standard errors of the MLE. This suggests that the additional modelling of the stratified baseline hazards did not result in increased efficiency, even though the proportional modelling of the baseline hazards appears quite reasonable. The difference in the two sets of standard errors may therefore indicate that the EM-aided standard errors should only be used with caution. We tried different convergence criteria and found some variation in the estimated standard errors.

10 202 T. H. SCHEIKE AND A. JUUL Age= 30 Age= 40 cum. baseline cum. baseline cum. baseline time (years) Age= time (years) cum. baseline time (years) Age= time (years) Fig. 1. Cumulative baseline hazards according to age and gender. Males, thick line; females, thin line. cum. baseline Cumulative Baseline Hazard for IHD time (years) Fig. 2. Cumulative baseline hazards adjusted for effects of age and gender by proportional modelling for subjects aged 50 on entry, see text. Males, thick line; females, thin line.

11 MLE for nested case-control sampling 203 Table 3. Log-relative risk of quartiles of IGF-I and IGFBP-3 for IHD data. The effects of age and gender are only given for the model where they are modelled as proportional effects cum. distribution cum. distribution Standard NCC MLE MLE Strat. IGF-I 1st quartile 0.68 (0.29) 0.58 (0.21) 0.57 (0.15) IGF-I 2nd quartile 0.33 (0.27) 0.26 (0.19) 0.26 (0.20) IGF-I 3rd quartile 0.49 (0.25) 0.51 (0.27) 0.51 (0.18) IGF-I 4th quartile BP-3 1st quartile BP-3 2nd quartile 0.54 (0.26) 0.49 (0.17) 0.44 (0.22) BP-3 3rd quartile 0.44 (0.28) 0.46 (0.15) 0.39 (0.18) BP-3 4th quartile 0.71 (0.30) 0.63 (0.12) 0.59 (0.14) Gender (0.16) - Age (0.0090) - Males IGFI Standardized IGFI Females IGFI Standardized IGFI cum. distribution cum. distribution Males IGFBP Standardized IGFBP3 Females IGFBP Standardized IGFBP3 Fig. 3. Marginal distribution of IGF-1 and IGFBP-3 for males and females, on standardized scale. The MLE gives an estimate of the distribution of IGF-I and IGFBP-3 given the stratification variables. In Figure 3 we show the marginal distribution for IGF-I and IGFBP-3 for the two genders in a model where the distributions of these factors were assumed to depend only on gender. Figure 3 indicates that both hormones are somewhat higher for males. The distributions are shown on the standardized scale for the two hormones, as the variables enter the analysis in the standardized form. Note, however, that the estimates are invariant under translation and change of scale. A further analysis

12 204 T. H. SCHEIKE AND A. JUUL cum. baseline Quartile of Ratio IGFI/IGFBP3 2. Quartile of Ratio IGFI/IGFBP3 3. Quartile of Ratio IGFI/IGFBP3 4. Quartile of Ratio IGFI/IGFBP time (years) Fig. 4. Cumulative baseline hazards stratified according to level of the ratio of IGF-I to IGFBP-3 for males aged 50 on entry. also indicated that the distribution of the hormones did not vary much with the age groups. Finally, we estimated the cumulative baseline hazard stratified according to the quartiles of the ratio of IGF-I and IGFBP-3. The gender and age were present in the model through proportional hazards modelling. This analysis is summarized in Figure 4. In Figure 4 we see that a low ratio of IGF-I to IGFBP-3 indicates a high risk of IHD. For an individual in the high risk group the expected rate of occurrence of IHD is 0.12 per 15 years, which approximately translates to a 12% risk of developing IHD over the 15 years of the study. The analysis was adjusted for effects of gender and age that are fully observed for all individuals. These effects were similar to those reported earlier. 7. DISCUSSION The proposed MLE for nested case-control sampling appears to result in increased efficiency compared to the standard analysis when the effects are large. In addition, the MLE may be used to obtain additional information about the covariates. In our example we were able to compute estimates of the cumulative baseline hazard stratified according to the ratio of IGF-I and IGFBP-3. One important methodological point is that the NCC and the case-cohort study (Prentice, 1986; Self and Prentice, 1988) lead to likelihoods of the same form and therefore can be analysed by the same program. We conjecture that the two designs lead to equivalent power. The suggested MLE technique can be extended to deal with left truncation, in which case we are still able to compute the required conditional mean of the unobserved covariates given the data. The MLE has several important limitations compared to the standard analysis. The MLE procedure can only deal with time-constant covariates and a limited number of situations. Also, the censoring mechanism must be independent of the partly unobserved covariate. We estimated the standard errors of the log-relative risk parameters by EM-aided numerical differentiation, and this approach worked well in the simulations that we performed. In the application, however,

13 MLE for nested case-control sampling 205 we found that the standard errors showed some numerical instability. Repeated analysis of the data with slightly different criteria for convergence led to some variability in the estimates of the standard errors. We have assumed that X is discrete in nature. However, this assumption may be relaxed by using smoothing techniques to access the conditional distribution of Z given X. Further research should study the asymptotic properties of the MLE procedure. We conjecture that the results of Chen (2002) will carry through. Note that the partly observed covariates (Z) are not allowed to influence the selection of controls. To deal with counter-matching where the risk set selection also depends on the value of the observed value, Z i,ofthe covariate for the corresponding case, some modification of the suggested MLE is needed. See Borgan et al. (1995) for a discussion of various other strategies for selecting the controls in the NCC design. ACKNOWLEDGEMENTS The first author received partial support through a grant from the National Institutes of Health and did most of the work while employed by Department of Mathematical Sciences at Aalborg University. We thank the Centre for Preventive Medicine for letting us use their data, and the associate editor and two reviewers for their comments that lead to an improved presentation. REFERENCES ANDERSEN, P. K., BORGAN, Ø., GILL, R. D. AND KEIDING, N.(1993). Statistical Models Based on Counting Processes. New York: Springer. BORGAN, Ø., GOLDSTEIN, L.AND LANGHOLZ, B.(1995). Methods for the analysis of sampled cohort data in the Cox proportional hazards model. Annals of Statistics 23, CHEN, H.Y.(2002). Double-semiparametric method for missing covariates in cox regression models. Journal of the American Statistical Association 97, CHEN, H. Y. AND LITTLE, R. J.(1999). Proportional hazards regression with missing covariates. Journal of the American Statistical Association 94, COX, D.R.(1972). Regression models and life tables. Journal of the Royal Statistical Society B 34, DEMPSTER, A. P., LAIRD, N. M. AND RUBIN, D..B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B 39, GOLDSTEIN, L. AND LANGHOLZ, B.(1992). Asymptotic theory for nested case-control sampling in the Cox regression model. Annals of Statistics 20, JUUL, A., SCHEIKE, T., DAVIDSEN, M., GYLLENBORG, J. AND JØRGENSEN, T.(2002). Low serum insulinlike growth factor I is associated with increased risk of ischemic heart disease: a population based cohort study. Circulation 106, LOUIS, T. A.(1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society B 44, MARTINUSSEN, T. (1999). Cox regression with incomplete covariate measurements using the EM-algorithm. Scandinavian Journal of Statistics 26, MURPHY, S. A., ROSSINI, A. J. AND VAN DER VAART, A. W.(1997). Maximum likelihood estimation in the proportional odds model. Journal of the American Statistical Association 92, NIELSEN, G. G., GILL, R., ANDERSEN, P. K. AND SØRENSEN, T. I. A.(1992). A counting process approach to maximum likelihood estimation in frailty models. Scandinavian Journal of Statistics 19,

14 206 T. H. SCHEIKE AND A. JUUL OAKES, D.(1981). Survival times: aspects of partial likelihood (with discussion). International Statistical Review 49, PRENTICE, R. L.(1986). A case-cohort design for epidemiological cohort studies and disease prevention trials. Biometrika 73, SAMUELSEN, S. O.(1997). A pseudolikelihood approach to analysis of nested case-control studies. Biometrika 84, SELF, S. G. AND PRENTICE, R. L.(1988). Asymptotic distribution theory and efficiency results for case-cohort studies. Annals of Statistics 16, THOMAS, D.(1977). Addendum to: methods of cohort analysis: appraisal by application to asbestos mining, by F.D.K. Liddell, J.C. McDonald and D.C. Thomas. Journal of the Royal Statistical Society A 140, TURNBULL, B. W.(1976). The empirical distribution function with arbitrarily grouped, censored and truncated data. Journal of the American Statistical Association 69, WELLNER, J. A. AND ZHAN, Y. (1997). A hybrid algorithm for computing of the nonparametric maximum likelihood estimator from censored data. Journal of the American Statistical Association 92, [Received August 28, 2002; first revision October 14, 2002; second revision January 13, 2003; third revision April 2, 2003; accepted for publication August 7, 2003]

On the Breslow estimator

Lifetime Data Anal (27) 13:471 48 DOI 1.17/s1985-7-948-y On the Breslow estimator D. Y. Lin Received: 5 April 27 / Accepted: 16 July 27 / Published online: 2 September 27 Springer Science+Business Media,