Logistic regression model for survival time analysis using time-varying coefficients

Similar documents
Power and Sample Size Calculations with the Additive Hazards Model

TMA 4275 Lifetime Analysis June 2004 Solution

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Goodness-of-fit tests for randomly censored Weibull distributions with estimated parameters

Illustration of the Varying Coefficient Model for Analyses the Tree Growth from the Age and Space Perspectives

Approximation of Survival Function by Taylor Series for General Partly Interval Censored Data

MAS3301 / MAS8311 Biostatistics Part II: Survival

Survival Analysis. Stat 526. April 13, 2018

Analysis of Time-to-Event Data: Chapter 4 - Parametric regression models

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Lecture 7 Time-dependent Covariates in Cox Regression

Statistics in medicine

Survival Analysis Math 434 Fall 2011

Parameters Estimation for a Linear Exponential Distribution Based on Grouped Data

PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA

Lecture 11. Interval Censored and. Discrete-Time Data. Statistics Survival Analysis. Presented March 3, 2016

STAT 6350 Analysis of Lifetime Data. Failure-time Regression Analysis

Semiparametric Regression

Cox s proportional hazards model and Cox s partial likelihood

REGRESSION ANALYSIS FOR TIME-TO-EVENT DATA THE PROPORTIONAL HAZARDS (COX) MODEL ST520

Bias-corrected AIC for selecting variables in Poisson regression models

Typical Survival Data Arising From a Clinical Trial. Censoring. The Survivor Function. Mathematical Definitions Introduction

Advanced Quantitative Methods: maximum likelihood

Lecture 22 Survival Analysis: An Introduction

Multistate Modeling and Applications

Quantile Regression for Residual Life and Empirical Likelihood

LOGISTIC REGRESSION Joseph M. Hilbe

Review. December 4 th, Review

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

β j = coefficient of x j in the model; β = ( β1, β2,

Lecture 4 - Survival Models

CIMAT Taller de Modelos de Capture y Recaptura Known Fate Survival Analysis

11 Survival Analysis and Empirical Likelihood

STA6938-Logistic Regression Model

Introduction to Statistical Analysis

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY

Survival Analysis I (CHL5209H)

Correlation and regression

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Lecture 5 Models and methods for recurrent event data

Proportional hazards regression

Hypothesis Testing Based on the Maximum of Two Statistics from Weighted and Unweighted Estimating Equations

Chapter 2 Inference on Mean Residual Life-Overview

Two-stage Adaptive Randomization for Delayed Response in Clinical Trials

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation

Introduction to General and Generalized Linear Models

Statistics 262: Intermediate Biostatistics Non-parametric Survival Analysis

Lecture 12. Multivariate Survival Data Statistics Survival Analysis. Presented March 8, 2016

Application of Time-to-Event Methods in the Assessment of Safety in Clinical Trials

Dynamic Prediction of Disease Progression Using Longitudinal Biomarker Data

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

9 Generalized Linear Models

A comparison of inverse transform and composition methods of data simulation from the Lindley distribution

Other Survival Models. (1) Non-PH models. We briefly discussed the non-proportional hazards (non-ph) model

Consider Table 1 (Note connection to start-stop process).

Regularization in Cox Frailty Models

Lecture 6 PREDICTING SURVIVAL UNDER THE PH MODEL

Classification. Chapter Introduction. 6.2 The Bayes classifier

In contrast, parametric techniques (fitting exponential or Weibull, for example) are more focussed, can handle general covariates, but require

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

University of California, Berkeley

Beyond GLM and likelihood

Chapter 4 Regression Models

IP WEIGHTING AND MARGINAL STRUCTURAL MODELS (CHAPTER 12) BIOS IPW and MSM

Linear Regression Models P8111

Survival Analysis. 732G34 Statistisk analys av komplexa data. Krzysztof Bartoszek

Package threg. August 10, 2015

Interval Estimation for Parameters of a Bivariate Time Varying Covariate Model

JOINT REGRESSION MODELING OF TWO CUMULATIVE INCIDENCE FUNCTIONS UNDER AN ADDITIVITY CONSTRAINT AND STATISTICAL ANALYSES OF PILL-MONITORING DATA

11 November 2011 Department of Biostatistics, University of Copengen. 9:15 10:00 Recap of case-control studies. Frequency-matched studies.

MAS3301 / MAS8311 Biostatistics Part II: Survival

A Recursive Formula for the Kaplan-Meier Estimator with Mean Constraints and Its Application to Empirical Likelihood

Attributable Risk Function in the Proportional Hazards Model

Answer Key for STAT 200B HW No. 8

Sample size determination for logistic regression: A simulation study

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

Step-Stress Models and Associated Inference

Bias Correction of Cross-Validation Criterion Based on Kullback-Leibler Information under a General Condition

MS&E 226: Small Data

Survival Regression Models

STAT331. Cox s Proportional Hazards Model

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ

DAGStat Event History Analysis.

4.5.1 The use of 2 log Λ when θ is scalar

Log-linearity for Cox s regression model. Thesis for the Degree Master of Science

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

(θ θ ), θ θ = 2 L(θ ) θ θ θ θ θ (θ )= H θθ (θ ) 1 d θ (θ )

Stochastic Differential Equations.

A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy

FULL LIKELIHOOD INFERENCES IN THE COX MODEL: AN EMPIRICAL LIKELIHOOD APPROACH

Comparing Distribution Functions via Empirical Likelihood

8 Nominal and Ordinal Logistic Regression

Definition 3.1 A statistical hypothesis is a statement about the unknown values of the parameters of the population distribution.

Statistics and econometrics

Solutions for Examination Categorical Data Analysis, March 21, 2013

49th European Organization for Quality Congress. Topic: Quality Improvement. Service Reliability in Electrical Distribution Networks

Full likelihood inferences in the Cox model: an empirical likelihood approach

STAT 7030: Categorical Data Analysis

Transcription:

Logistic regression model for survival time analysis using time-varying coefficients Accepted in American Journal of Mathematical and Management Sciences, 2016 Kenichi SATOH ksatoh@hiroshima-u.ac.jp Research Institute for Radiation Biology and Medicine, Hiroshima University, 1-2-3 Kasumi, Minami-ku, Hiroshima 734-8553. Tetsuji TONDA Faculty of Management and Information Systems, Prefectural University of Hiroshima, 1-1-71 Ujina-Higashi, Minami-ku, Hiroshima 734-8558, JAPAN. Shizue IZUMI Center for Data Science Education and Research, Shiga University, 1-1-1 Banbacho, Hikone, Shiga 522-8522, JAPAN. SYNOPTIC ABSTRACT In epidemiological studies, odds ratios are widely used for quantifying the relative risk. The odds ratio can be estimated from background factors using logistic regression. In this paper, a logistic regression model for the survival time is proposed using time-varying coefficients, and statistical inference is conducted using the Newton-Raphson method and simultaneous confidence intervals. Numerical examples and simulation studies demonstrate that the proposed model can be used to obtain the odds ratio in survival time analysis. Key words: Logistic regression model; Newton-Raphson method; Odds ratio; Survival time analysis; Time-varying coefficient. 1. Introduction. Odds ratios are widely used in epidemiology to measure the association between dichotomous outcome variables, such as, case or control, normal or abnormal, dead or alive (see, McCullagh and Nelder (1989)). It can be interpreted as a relative risk when the probability of occurrence is very small.

Logistic regression models are often used to estimate the odds ratio in situations when there are confounding factors requiring adjustment. On the other hand, time to death or survival time is frequently analyzed by using Cox proportional hazard model, proposed in Cox (1972). However, the model is not concerned with the odds ratio, but with the hazard ratio. Here, we try to apply the logistic regression model to survival time analysis and evaluate the odds ratio. In Section 2, we consider time-varying coefficients in logistic regression model in order to describe survival time data. In Section 3 the proposed model is applied to a real dataset, and the stability of the estimation method is investigated in a simulation study in Section 4. In Section 5, we discuss our proposed method and conclusions from our investigation. 2. Logistic regression model for survival time data. First, we define survival time as a random variable and explain a censoring time in 2.1. Then we connect the distribution function of survival time with time-varying coefficients. In 2.2 regression coefficients are estimated by maximizing a log-likelihood under the logistic regression model and the Newton-Raphson method can be implemented. Since estimated time-varying coefficients are functions of time, their confidence intervals are also functions given in 2.3. 2.1. Describing distribution function of survival time data by using time-varying coefficients. Let T be a continuous random variable denoting the time of death, whose cumulative distribution function (cdf) is given by F (t) = Pr(T t). The complement of cdf is known as the survival function, given by S(t) = 1 F (t). It denotes the probability of being alive up until time t, or more generally, the probability that the event of interest has not occurred by time t, which is often called the censoring time. Let the regression coefficients of covariates a = (a 1,..., a p ) be β(t) = (β 1 (t),..., β p (t)). The effects of covariates can be non-stationary, and are

referred to as time-varying coefficients (Hastie and Tibshirani (1993)). With the logit or log-odds transformation of F (t), a logistic regression model can be obtained for survival time data as follows, log F (t) S(t) = z(t a) = β(t) a. (1) Thus, the log-odds ratio for a j = 1 to a j = 0 at time t can be expressed by z(t a j = 1) z(t a j = 0) = β j (t), (2) or the odds ratio is given by exp{β j (t)}. The model in (1) can be regarded as an extension of the log-logistic model proposed by Bennet (1983), which uses the log-logistic distribution function for survival time and has a varying coefficient log t only for a constant covariate a 1, i.e., log F (t)/s(t) = φ log t + β a. Here we propose a model to evaluate the time-varying coefficients for the covariates in equation (1). We consider linear time-varying coefficients using the growth curve model presented in Satoh and Yanagihara (2010) for longitudinal data. Let x(t) be a (q 1) th degree polynomial basis function for varying coefficients β(t), i.e., β(t) = x(t) Θ. (3) Here, x(t) = (1, t, t 2,, t q 1 ) and Θ = (θ 1,, θ p ) is a q p unknown regression coefficient matrix. Note that ẋ(t) does not need to be a polynomial basis function, but it must be a differentiable function of t. 2.2. Deriving maximum likelihood estimators of regression coefficients. Assuming that the cdf F (t) is differentiable, we can then obtain the probability density function (pdf) given by, f(t) = From (4), it holds that df (t) dt = F (t)s(t) dz(t). (4) dt dz(t) dt = dβ(t) a = ẋ(t) Θa (5) dt

where ẋ(t) = dx(t) = (0, 1, 2t,, (q 1)t q 2 ). (6) dt Note that the hazard function can be written as f(t)/s(t) = F (t)ẋ(t) Θa. In most real situations, polynomial basis functions based on t = log( t) can provide a better fit for survival data than those based on the original survival time t, e.g., Bennet (1983). Assume that all subjects may experience an event or be censored, that is, for subject i either the time of death t i or an indication of whether or not the subject is censored, δ i = 1(uncensored) and δ i = 0(censored), i.e., (t i, δ i ), i = 1,, n, may be observed. Then the likelihood function for the regression coefficients Θ can be expressed as L(Θ) = n i=1 f δ i i S 1 δ i i = n i=1 {F i S i ż i } δ i S 1 δ i i, (7) where a i is a covariate vector for subject i, ż i = ẋ(t i ) Θa i, f i = f(t i ), F i = F (t i ) and S i = S(t i ). By maximizing the log-likelihood function with respect to Θ, the maximum likelihood estimator ˆΘ = ( ˆθ 1,, ˆθ p ) can be obtained. Let θ = vec(θ) = (θ 1,, θ p), and l(θ) = log L(Θ), and then the estimator ˆθ = vec( ˆΘ) satisfies dl( ˆθ)/dθ = 0 qp, which is defined by dl(θ) n dθ = { } δi S i w i F i w i + δ i ż 1 i ẇ i, (8) i=1 where w i = a i x(t i ) and ẇ i = a i ẋ(t i ). Its Hessian matrix is given by d 2 l(θ) dθ 2 = n i=1 { (1 + δi )F i S i w i w i + δ i ż 2 i ẇ i ẇ i}. (9) Using the Newton-Raphson method, the maximum likelihood estimator ˆθ can be obtained in the following recurrence formula. { } d 2 1 l(θ m ) dl(θ m ) θ m+1 = θ m, m = 0, 1, 2,, (10) dθ 2 dθ

x(t) ˆΘ or, ˆβj (t) = x(t) ˆθj, j {1,, p}. (12) where θ 0 is an adequate initial value. Note that the inverse matrix can be used as an asymptotic covariance matrix of the maximum likelihood estimator ˆθ, i.e., Ω = Cov( ˆθ) { d 2 l( ˆθ) } 1. (11) dθ 2 We then have estimators for the linear time-varying coefficients, ˆβ(t) = From the properties of the maximum likelihood estimator under regularity conditions, e.g., Philippou and Roussas (1975), the estimators are asymptotically normal, ˆβ j (t) N q (0, σ 2 j (t)) where σ 2 j (t) = x(t) Ω j x(t) and Cov( ˆθ j ) = Ω j which is the corresponding q q matrix of Ω = (Ω uv ), u, v = 1,, pq, i.e.,ω j = (Ω uv ), u, v = (j 1)q + 1,, jq. 2.3. Constructing simultaneous confidence intervals of time-varying coefficients. Here, we construct a confidence interval for the linear time-varying coefficients, given by I j,α (t u α ) = [ ˆβj (t) u αˆσ j (t), ˆβj (t) + u αˆσ j (t)]. (13) The covering probability of I j,α (t u α ) depends on u α. For example, the pointwise confidence interval at a fixed time t can be constructed by letting u α = z α/2, where z α denotes the upper 100α percentile of N(0, 1). Note that the confidence interval I j,α (t z α/2 ) satisfies Pr(β j (t) I j,α (t z α/2 )) 1 α for a fixed time t. To construct a simultaneous confidence interval, we need to evaluate the distribution of the supremum of the Wald type statistic T j (t) = { ˆβ j (t) β j (t)}/σ j (t), but it is difficult to derive an explicit expression for the distribution of the supremum statistic in general. Here, we evaluate the upper bound of the supremum of T j (t) in the same manner as in Satoh and Yanagihara (2010). From the inequality in Rao (1973, p. 60), ˆβ j (t) asymptotically

satisfies the following equation: {x(t) ( ˆθj θ j )} 2 {x ( ˆθj θ j )} 2 sup T j (t) 2 = sup t R t R x(t) Ω j x(t) = sup ( x R q ) x Ω j x ( ) ˆθj θ j Ω 1 ˆθj θ j χ 2 q. j (14) Note that the asymptotic distribution of the upper bound is χ 2 q for any time t. Let u α = c q,α, where c q,α is the upper 100α percentile of χ 2 q, then the covering probability of the confidence interval I j,α (t c q,α ) satisfies Pr ( β j (t) I j,α (t) t R ) 1 α. (15) Based on equation (14), we can construct test statistics for the following null hypotheses for time-varying coefficient β j (t): Uniformly zero Uniformly constant. H 0 : β j (t) = 0 for t R H 0 : β j (t) = const. for t R (16) The uniformly zero hypothesis is equivalent to θ j = 0. Using equation (14) with θ j = 0, the upper bound of the supremum of T j (t) 2 is W j = ˆθ jω 1 j ˆθ j χ 2 q. Hence, W j can be used as a test statistic for the null hypothesis H 0. The uniformly zero hypothesis is rejected when W j > c q,α, and the p-value can be obtained by Pr(χ 2 q > W j ). Note that the uniformly constant hypothesis is equivalent to θ ( 1) j = 0, where θ ( 1) j is a (q 1)-dimensional vector, where the first element of θ j is excluded because it is equal to 1. This implies that the corresponding covariate a j has no effect on observations and the corresponding odds ratio is 1, i.e., exp{β j (t)} = 1. Analogous to the test for the uniformly zero hypothesis, we can construct a test statistic and derive an asymptotic null distribution for the uniformly constant hypothesis. 3. Numerical example. In this section, we consider a dataset of remission lengths (weeks) for acute leukemia patients in Table 1, which was reported by Freireich et al. (1963) and was explained in Kleinbaum (2012). The data consist of a placebo

group and a treatment group, each containing 21 patients. Our main concern is comparing the survival rates of the two groups. We considered the proposed model using the placebo group as a control group, and the covariate of the i th individual is expressed as a i = 1 for the treatment group and a i = 0 for the placebo group, where i = 1,, n with n = 21 2 = 42. Assuming the time-varying coefficient for the treatment effect to be a linear curve, the design vector is given by x(t) = (1, t) and the length is q = 2. Note that the survival time t is the logarithm of the original length of remission. The maximum likelihood estimators and the asymptotic standard error were calculated using (10) and (11) respectively and are listed in Table 2. Hence, the estimated logistic regression model in (1) can be expressed as ˆβ 1 (t) + ˆβ 2 (t)a where ˆβ 1 (t) = 3.606 + 1.902t for the placebo group and ˆβ 2 (t) = 1.764 0.218t for the treatment effect, i.e., ˆβ1 (t) + ˆβ 2 (t) for the treatment group. Figure 1 shows the fitted survival curves for each group. The proposed model seems to provide a good fit to the Kaplan-Meier curves. Since the proposed model is based on logistic regression, the odds ratio for the treatment group to the placebo group can be expressed as exp{β 2 (t)}, (see Figure 2). The simultaneous confidence intervals were also derived using (15). The estimated time-varying odds ratio curve seems to be around 0.1 during observation in Figure 2. In fact, the regression coefficient of t a in Table 2 is not statistically significant; p = 0.704 > 0.05. Then, the interaction term is removed from Table 2 and the corresponding estimates are given in Table 3. The treatment effect is now statistically significant, although the effect is not significant in Table 2. The estimated odds ratio in Table 3 is exp( 2.315) = 0.10 and the curve in Figure 2 appear to be reasonably constant. From the results of applying the proposed method to the remission time dataset, the proposed model constructed by logistic regression with time-varying coefficients can be seen to provide a good fit to the data, and we could confirm that the odds ratio was constant using the more flexible model which allowed for non-stationary odds ratios.

4. Simulation. We obtained our estimates for the model parameters using the Newton- Raphson method, as defined by the recurrence formula (10). The estimates will converge if the initial value θ 0 is sufficiently close to the maximum likelihood estimator ˆθ, since dl( ˆθ)/dθ = 0 qp (see, McCullagh and Nelder (1989)). To elucidate the behavior of the estimator we investigated: 1) how quickly the estimator converged as the number of iterations increased, and 2) the influence of the initial guess for the estimator on the convergence. For our simulations, we used the parameter estimates in Table 3, which were fitted to the example shown in Table 1. Therefore, the initial values can be expressed as θ 0 = (θ 01, θ 02, θ 03 ). The regression coefficients θ 01 and θ 02 were fixed as 3.463 and 1.830, respectively, based on the values in Table 3 and the coefficients θ 03 was simulated from the uniform distribution U( 4, 0), which are relatively close to the true maximum likelihood estimator ˆθ 3 = 2.315 given in Table 3. Thus, as shown in Figure 3, 1,000 initial values were simulated from the uniform distribution and the Newton-Raphson method was applied 20 times for each initial value. All estimators successfully converged and the converged values were almost the same as the true maximum likelihood estimator. For the convergence rate, the number of iterations until convergence was less than 5 times. From the results of the simulations, the Newton-Raphson method seems to be suitable for obtaining the maximum likelihood estimators when the initial values are sufficiently close to the true values. Therefore, it is important for us to try different initial values and confirm the likelihood value in (7) for the obtained estimators. 5. Conclusion. We proposed a logistic regression survival model with time-varying coefficients. The maximum likelihood estimators and their asymptotic covariance matrix were calculated iteratively by the Newton-Raphson method. In our model, the odds ratio can be expressed as a function of time and its simultaneous confidence intervals were also considered. From the simulation study,

a maximum likelihood estimator can also be obtained with the odds ratio when initial values are close to the true values. The model provided a good fit when applied to a real dataset, and it was confirmed that the odds ratio is constant in time. Besides providing a test of stationarity for the odds ratio, our proposed model might also be useful for modeling odds ratios which are non-stationary. References Bennet, S. (1983). Log-logistic regression models for survival data. Journal of Applied Statistics, 32, 165-171. Cox, D. R. (1972). Regression Models and Life-Tables. Journal of the Royal Statistical Society. Series B, 34, 187-220. Freireich, E. O. et al. (1963). The effect of 6-mercaptopurine on the duration of steroid-induced remissions in acute leukemia. Blood, 21, 699-716. Hastie, T. and Tibshirani, R. (1993). Varying-coefficient models. Journals of the Royal Statistical Society B, 55, 757-796. Kleinbaum, D. G. (2012). Survival Analysis 3rd ed., Springer, New York. Philippou, A. N. and Roussas, G. G. (1975). Asymptotic normality of the maximum likelihood estimate in the independent not identically distributed case. Annals of the Institute of Statistical Mathematics, 27, 45-55. Rao, C. R. (1973). Linear Statistical Inference and Its Applications. John Wiley, New York. McCullagh, P. and Nelder, J. A. (1989). Generalized linear models 2nd ed., Chapman and Hall/CRC, London.

Satoh, K. and Yanagihara, H. (2010). for a growth curve model. Management Sciences, 30, 243-256. Estimation of varying coefficients American Journal of Mathematical and Satoh, K. and Tonda, T. (2016). Estimating regression coefficients for balanced growth curve model when time trend of baseline is not specified. American Journal of Mathematical and Management Sciences, in press. Table 1. Length of remission dataset by Freireich et al. (1963). ID Placebo Treatment ID Placebo Treatment 1 1 10 11 2 11+ 2 22 7 12 5 20+ 3 3 32+ 13 4 19+ 4 12 23 14 15 6 5 8 22 15 8 17+ 6 17 6 16 23 35+ 7 2 16 17 5 6 8 11 34+ 18 11 13 9 8 32+ 19 4 9+ 10 12 25+ 20 1 6+ 21 8 10+ Table 2. Estimates of regression coefficients. Variables Estimate Std. Error χ 2 1 p-value (Intercept) -3.606 0.779 21.43 0.000 t 1.902 0.343 30.76 0.000 a -1.764 1.567 1.27 0.260 t a -0.218 0.575 0.15 0.704 Table 3. Estimates of regression coefficients when the treatment effect is constant in time. Variables Estimate Std. Error χ 2 1 p-value (Intercept) -3.463 0.661 27.41 0.000 t 1.830 0.275 44.37 0.000 a -2.315 0.628 13.61 0.000

Survival Probability 0.0 0.2 0.4 0.6 0.8 1.0 Treatment Placebo Kaplan Meier 0 5 10 15 20 25 30 35 Weeks Figure 1. Fitted survival curves based on the logistic regression model.

Odds Ratio 0.01 0.10 1.00 10.00 Estimated OR 95% C.I. 0 5 10 15 20 25 30 35 Weeks Figure 2. The estimated time-varying odds ratio curve and its 95% simultaneous confidence intervals.

Estimates 4 3 2 1 0 1 2 3 4 5 Iterations of Newton Raphson method Figure 3. Convergence of the regression coefficients with different initial values, when using the Newton-Raphson method. The true value is 2.315.