Logistic regression model for survival time analysis using time-varying coefficients

Size: px

Start display at page:

Download "Logistic regression model for survival time analysis using time-varying coefficients"

Gwenda Morris
6 years ago
Views:

1 Logistic regression model for survival time analysis using time-varying coefficients Accepted in American Journal of Mathematical and Management Sciences, 2016 Kenichi SATOH Research Institute for Radiation Biology and Medicine, Hiroshima University, Kasumi, Minami-ku, Hiroshima Tetsuji TONDA Faculty of Management and Information Systems, Prefectural University of Hiroshima, Ujina-Higashi, Minami-ku, Hiroshima , JAPAN. Shizue IZUMI Center for Data Science Education and Research, Shiga University, Banbacho, Hikone, Shiga , JAPAN. SYNOPTIC ABSTRACT In epidemiological studies, odds ratios are widely used for quantifying the relative risk. The odds ratio can be estimated from background factors using logistic regression. In this paper, a logistic regression model for the survival time is proposed using time-varying coefficients, and statistical inference is conducted using the Newton-Raphson method and simultaneous confidence intervals. Numerical examples and simulation studies demonstrate that the proposed model can be used to obtain the odds ratio in survival time analysis. Key words: Logistic regression model; Newton-Raphson method; Odds ratio; Survival time analysis; Time-varying coefficient. 1. Introduction. Odds ratios are widely used in epidemiology to measure the association between dichotomous outcome variables, such as, case or control, normal or abnormal, dead or alive (see, McCullagh and Nelder (1989)). It can be interpreted as a relative risk when the probability of occurrence is very small.

2 Logistic regression models are often used to estimate the odds ratio in situations when there are confounding factors requiring adjustment. On the other hand, time to death or survival time is frequently analyzed by using Cox proportional hazard model, proposed in Cox (1972). However, the model is not concerned with the odds ratio, but with the hazard ratio. Here, we try to apply the logistic regression model to survival time analysis and evaluate the odds ratio. In Section 2, we consider time-varying coefficients in logistic regression model in order to describe survival time data. In Section 3 the proposed model is applied to a real dataset, and the stability of the estimation method is investigated in a simulation study in Section 4. In Section 5, we discuss our proposed method and conclusions from our investigation. 2. Logistic regression model for survival time data. First, we define survival time as a random variable and explain a censoring time in 2.1. Then we connect the distribution function of survival time with time-varying coefficients. In 2.2 regression coefficients are estimated by maximizing a log-likelihood under the logistic regression model and the Newton-Raphson method can be implemented. Since estimated time-varying coefficients are functions of time, their confidence intervals are also functions given in Describing distribution function of survival time data by using time-varying coefficients. Let T be a continuous random variable denoting the time of death, whose cumulative distribution function (cdf) is given by F (t) = Pr(T t). The complement of cdf is known as the survival function, given by S(t) = 1 F (t). It denotes the probability of being alive up until time t, or more generally, the probability that the event of interest has not occurred by time t, which is often called the censoring time. Let the regression coefficients of covariates a = (a 1,..., a p ) be β(t) = (β 1 (t),..., β p (t)). The effects of covariates can be non-stationary, and are

3 referred to as time-varying coefficients (Hastie and Tibshirani (1993)). With the logit or log-odds transformation of F (t), a logistic regression model can be obtained for survival time data as follows, log F (t) S(t) = z(t a) = β(t) a. (1) Thus, the log-odds ratio for a j = 1 to a j = 0 at time t can be expressed by z(t a j = 1) z(t a j = 0) = β j (t), (2) or the odds ratio is given by exp{β j (t)}. The model in (1) can be regarded as an extension of the log-logistic model proposed by Bennet (1983), which uses the log-logistic distribution function for survival time and has a varying coefficient log t only for a constant covariate a 1, i.e., log F (t)/s(t) = φ log t + β a. Here we propose a model to evaluate the time-varying coefficients for the covariates in equation (1). We consider linear time-varying coefficients using the growth curve model presented in Satoh and Yanagihara (2010) for longitudinal data. Let x(t) be a (q 1) th degree polynomial basis function for varying coefficients β(t), i.e., β(t) = x(t) Θ. (3) Here, x(t) = (1, t, t 2,, t q 1 ) and Θ = (θ 1,, θ p ) is a q p unknown regression coefficient matrix. Note that ẋ(t) does not need to be a polynomial basis function, but it must be a differentiable function of t Deriving maximum likelihood estimators of regression coefficients. Assuming that the cdf F (t) is differentiable, we can then obtain the probability density function (pdf) given by, f(t) = From (4), it holds that df (t) dt = F (t)s(t) dz(t). (4) dt dz(t) dt = dβ(t) a = ẋ(t) Θa (5) dt

4 where ẋ(t) = dx(t) = (0, 1, 2t,, (q 1)t q 2 ). (6) dt Note that the hazard function can be written as f(t)/s(t) = F (t)ẋ(t) Θa. In most real situations, polynomial basis functions based on t = log( t) can provide a better fit for survival data than those based on the original survival time t, e.g., Bennet (1983). Assume that all subjects may experience an event or be censored, that is, for subject i either the time of death t i or an indication of whether or not the subject is censored, δ i = 1(uncensored) and δ i = 0(censored), i.e., (t i, δ i ), i = 1,, n, may be observed. Then the likelihood function for the regression coefficients Θ can be expressed as L(Θ) = n i=1 f δ i i S 1 δ i i = n i=1 {F i S i ż i } δ i S 1 δ i i, (7) where a i is a covariate vector for subject i, ż i = ẋ(t i ) Θa i, f i = f(t i ), F i = F (t i ) and S i = S(t i ). By maximizing the log-likelihood function with respect to Θ, the maximum likelihood estimator ˆΘ = ( ˆθ 1,, ˆθ p ) can be obtained. Let θ = vec(θ) = (θ 1,, θ p), and l(θ) = log L(Θ), and then the estimator ˆθ = vec( ˆΘ) satisfies dl( ˆθ)/dθ = 0 qp, which is defined by dl(θ) n dθ = { } δi S i w i F i w i + δ i ż 1 i ẇ i, (8) i=1 where w i = a i x(t i ) and ẇ i = a i ẋ(t i ). Its Hessian matrix is given by d 2 l(θ) dθ 2 = n i=1 { (1 + δi )F i S i w i w i + δ i ż 2 i ẇ i ẇ i}. (9) Using the Newton-Raphson method, the maximum likelihood estimator ˆθ can be obtained in the following recurrence formula. { } d 2 1 l(θ m ) dl(θ m ) θ m+1 = θ m, m = 0, 1, 2,, (10) dθ 2 dθ

5 x(t) ˆΘ or, ˆβj (t) = x(t) ˆθj, j {1,, p}. (12) where θ 0 is an adequate initial value. Note that the inverse matrix can be used as an asymptotic covariance matrix of the maximum likelihood estimator ˆθ, i.e., Ω = Cov( ˆθ) { d 2 l( ˆθ) } 1. (11) dθ 2 We then have estimators for the linear time-varying coefficients, ˆβ(t) = From the properties of the maximum likelihood estimator under regularity conditions, e.g., Philippou and Roussas (1975), the estimators are asymptotically normal, ˆβ j (t) N q (0, σ 2 j (t)) where σ 2 j (t) = x(t) Ω j x(t) and Cov( ˆθ j ) = Ω j which is the corresponding q q matrix of Ω = (Ω uv ), u, v = 1,, pq, i.e.,ω j = (Ω uv ), u, v = (j 1)q + 1,, jq Constructing simultaneous confidence intervals of time-varying coefficients. Here, we construct a confidence interval for the linear time-varying coefficients, given by I j,α (t u α ) = [ ˆβj (t) u αˆσ j (t), ˆβj (t) + u αˆσ j (t)]. (13) The covering probability of I j,α (t u α ) depends on u α. For example, the pointwise confidence interval at a fixed time t can be constructed by letting u α = z α/2, where z α denotes the upper 100α percentile of N(0, 1). Note that the confidence interval I j,α (t z α/2 ) satisfies Pr(β j (t) I j,α (t z α/2 )) 1 α for a fixed time t. To construct a simultaneous confidence interval, we need to evaluate the distribution of the supremum of the Wald type statistic T j (t) = { ˆβ j (t) β j (t)}/σ j (t), but it is difficult to derive an explicit expression for the distribution of the supremum statistic in general. Here, we evaluate the upper bound of the supremum of T j (t) in the same manner as in Satoh and Yanagihara (2010). From the inequality in Rao (1973, p. 60), ˆβ j (t) asymptotically

6 satisfies the following equation: {x(t) ( ˆθj θ j )} 2 {x ( ˆθj θ j )} 2 sup T j (t) 2 = sup t R t R x(t) Ω j x(t) = sup ( x R q ) x Ω j x ( ) ˆθj θ j Ω 1 ˆθj θ j χ 2 q. j (14) Note that the asymptotic distribution of the upper bound is χ 2 q for any time t. Let u α = c q,α, where c q,α is the upper 100α percentile of χ 2 q, then the covering probability of the confidence interval I j,α (t c q,α ) satisfies Pr ( β j (t) I j,α (t) t R ) 1 α. (15) Based on equation (14), we can construct test statistics for the following null hypotheses for time-varying coefficient β j (t): Uniformly zero Uniformly constant. H 0 : β j (t) = 0 for t R H 0 : β j (t) = const. for t R (16) The uniformly zero hypothesis is equivalent to θ j = 0. Using equation (14) with θ j = 0, the upper bound of the supremum of T j (t) 2 is W j = ˆθ jω 1 j ˆθ j χ 2 q. Hence, W j can be used as a test statistic for the null hypothesis H 0. The uniformly zero hypothesis is rejected when W j > c q,α, and the p-value can be obtained by Pr(χ 2 q > W j ). Note that the uniformly constant hypothesis is equivalent to θ ( 1) j = 0, where θ ( 1) j is a (q 1)-dimensional vector, where the first element of θ j is excluded because it is equal to 1. This implies that the corresponding covariate a j has no effect on observations and the corresponding odds ratio is 1, i.e., exp{β j (t)} = 1. Analogous to the test for the uniformly zero hypothesis, we can construct a test statistic and derive an asymptotic null distribution for the uniformly constant hypothesis. 3. Numerical example. In this section, we consider a dataset of remission lengths (weeks) for acute leukemia patients in Table 1, which was reported by Freireich et al. (1963) and was explained in Kleinbaum (2012). The data consist of a placebo

7 group and a treatment group, each containing 21 patients. Our main concern is comparing the survival rates of the two groups. We considered the proposed model using the placebo group as a control group, and the covariate of the i th individual is expressed as a i = 1 for the treatment group and a i = 0 for the placebo group, where i = 1,, n with n = 21 2 = 42. Assuming the time-varying coefficient for the treatment effect to be a linear curve, the design vector is given by x(t) = (1, t) and the length is q = 2. Note that the survival time t is the logarithm of the original length of remission. The maximum likelihood estimators and the asymptotic standard error were calculated using (10) and (11) respectively and are listed in Table 2. Hence, the estimated logistic regression model in (1) can be expressed as ˆβ 1 (t) + ˆβ 2 (t)a where ˆβ 1 (t) = t for the placebo group and ˆβ 2 (t) = t for the treatment effect, i.e., ˆβ1 (t) + ˆβ 2 (t) for the treatment group. Figure 1 shows the fitted survival curves for each group. The proposed model seems to provide a good fit to the Kaplan-Meier curves. Since the proposed model is based on logistic regression, the odds ratio for the treatment group to the placebo group can be expressed as exp{β 2 (t)}, (see Figure 2). The simultaneous confidence intervals were also derived using (15). The estimated time-varying odds ratio curve seems to be around 0.1 during observation in Figure 2. In fact, the regression coefficient of t a in Table 2 is not statistically significant; p = > Then, the interaction term is removed from Table 2 and the corresponding estimates are given in Table 3. The treatment effect is now statistically significant, although the effect is not significant in Table 2. The estimated odds ratio in Table 3 is exp( 2.315) = 0.10 and the curve in Figure 2 appear to be reasonably constant. From the results of applying the proposed method to the remission time dataset, the proposed model constructed by logistic regression with time-varying coefficients can be seen to provide a good fit to the data, and we could confirm that the odds ratio was constant using the more flexible model which allowed for non-stationary odds ratios.

8 4. Simulation. We obtained our estimates for the model parameters using the Newton- Raphson method, as defined by the recurrence formula (10). The estimates will converge if the initial value θ 0 is sufficiently close to the maximum likelihood estimator ˆθ, since dl( ˆθ)/dθ = 0 qp (see, McCullagh and Nelder (1989)). To elucidate the behavior of the estimator we investigated: 1) how quickly the estimator converged as the number of iterations increased, and 2) the influence of the initial guess for the estimator on the convergence. For our simulations, we used the parameter estimates in Table 3, which were fitted to the example shown in Table 1. Therefore, the initial values can be expressed as θ 0 = (θ 01, θ 02, θ 03 ). The regression coefficients θ 01 and θ 02 were fixed as and 1.830, respectively, based on the values in Table 3 and the coefficients θ 03 was simulated from the uniform distribution U( 4, 0), which are relatively close to the true maximum likelihood estimator ˆθ 3 = given in Table 3. Thus, as shown in Figure 3, 1,000 initial values were simulated from the uniform distribution and the Newton-Raphson method was applied 20 times for each initial value. All estimators successfully converged and the converged values were almost the same as the true maximum likelihood estimator. For the convergence rate, the number of iterations until convergence was less than 5 times. From the results of the simulations, the Newton-Raphson method seems to be suitable for obtaining the maximum likelihood estimators when the initial values are sufficiently close to the true values. Therefore, it is important for us to try different initial values and confirm the likelihood value in (7) for the obtained estimators. 5. Conclusion. We proposed a logistic regression survival model with time-varying coefficients. The maximum likelihood estimators and their asymptotic covariance matrix were calculated iteratively by the Newton-Raphson method. In our model, the odds ratio can be expressed as a function of time and its simultaneous confidence intervals were also considered. From the simulation study,

9 a maximum likelihood estimator can also be obtained with the odds ratio when initial values are close to the true values. The model provided a good fit when applied to a real dataset, and it was confirmed that the odds ratio is constant in time. Besides providing a test of stationarity for the odds ratio, our proposed model might also be useful for modeling odds ratios which are non-stationary. References Bennet, S. (1983). Log-logistic regression models for survival data. Journal of Applied Statistics, 32, Cox, D. R. (1972). Regression Models and Life-Tables. Journal of the Royal Statistical Society. Series B, 34, Freireich, E. O. et al. (1963). The effect of 6-mercaptopurine on the duration of steroid-induced remissions in acute leukemia. Blood, 21, Hastie, T. and Tibshirani, R. (1993). Varying-coefficient models. Journals of the Royal Statistical Society B, 55, Kleinbaum, D. G. (2012). Survival Analysis 3rd ed., Springer, New York. Philippou, A. N. and Roussas, G. G. (1975). Asymptotic normality of the maximum likelihood estimate in the independent not identically distributed case. Annals of the Institute of Statistical Mathematics, 27, Rao, C. R. (1973). Linear Statistical Inference and Its Applications. John Wiley, New York. McCullagh, P. and Nelder, J. A. (1989). Generalized linear models 2nd ed., Chapman and Hall/CRC, London.

10 Satoh, K. and Yanagihara, H. (2010). for a growth curve model. Management Sciences, 30, Estimation of varying coefficients American Journal of Mathematical and Satoh, K. and Tonda, T. (2016). Estimating regression coefficients for balanced growth curve model when time trend of baseline is not specified. American Journal of Mathematical and Management Sciences, in press. Table 1. Length of remission dataset by Freireich et al. (1963). ID Placebo Treatment ID Placebo Treatment Table 2. Estimates of regression coefficients. Variables Estimate Std. Error χ 2 1 p-value (Intercept) t a t a Table 3. Estimates of regression coefficients when the treatment effect is constant in time. Variables Estimate Std. Error χ 2 1 p-value (Intercept) t a

11 Survival Probability Treatment Placebo Kaplan Meier Weeks Figure 1. Fitted survival curves based on the logistic regression model.

12 Odds Ratio Estimated OR 95% C.I Weeks Figure 2. The estimated time-varying odds ratio curve and its 95% simultaneous confidence intervals.

13 Estimates Iterations of Newton Raphson method Figure 3. Convergence of the regression coefficients with different initial values, when using the Newton-Raphson method. The true value is

Power and Sample Size Calculations with the Additive Hazards Model

Journal of Data Science 10(2012), 143-155 Power and Sample Size Calculations with the Additive Hazards Model Ling Chen, Chengjie Xiong, J. Philip Miller and Feng Gao Washington University School of Medicine