The influence of categorising survival time on parameter estimates in a Cox model

Similar documents
Fractional Polynomials and Model Averaging

A new strategy for meta-analysis of continuous covariates in observational studies with IPD. Willi Sauerbrei & Patrick Royston

Towards stratified medicine instead of dichotomization, estimate a treatment effect function for a continuous covariate

Multivariable Fractional Polynomials

Multivariable Fractional Polynomials

especially with continuous

Relative-risk regression and model diagnostics. 16 November, 2015

Package CoxRidge. February 27, 2015

Multivariable model-building with continuous covariates: 2. Comparison between splines and fractional polynomials

A fast routine for fitting Cox models with time varying effects

MAS3301 / MAS8311 Biostatistics Part II: Survival

Tied survival times; estimation of survival probabilities

Multiple imputation in Cox regression when there are time-varying effects of covariates

TMA 4275 Lifetime Analysis June 2004 Solution

REGRESSION ANALYSIS FOR TIME-TO-EVENT DATA THE PROPORTIONAL HAZARDS (COX) MODEL ST520

The mfp Package. July 27, GBSG... 1 bodyfat... 2 cox... 4 fp... 4 mfp... 5 mfp.object... 7 plot.mfp Index 9

Part III Measures of Classification Accuracy for the Prediction of Survival Times

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

Lecture 6 PREDICTING SURVIVAL UNDER THE PH MODEL

Dynamic Prediction of Disease Progression Using Longitudinal Biomarker Data

Analysis of competing risks data and simulation of data following predened subdistribution hazards

Lecture 7. Proportional Hazards Model - Handling Ties and Survival Estimation Statistics Survival Analysis. Presented February 4, 2016

Müller: Goodness-of-fit criteria for survival data

Adapting Prediction Error Estimates for Biased Complexity Selection in High-Dimensional Bootstrap Samples. Harald Binder & Martin Schumacher

Assessment of time varying long term effects of therapies and prognostic factors

Applied Survival Analysis Lab 10: Analysis of multiple failures

Survival Regression Models

Part IV Extensions: Competing Risks Endpoints and Non-Parametric AUC(t) Estimation

Extensions of Cox Model for Non-Proportional Hazards Purpose

The mfp Package. R topics documented: May 18, Title Multivariable Fractional Polynomials. Version Date

Logit estimates Number of obs = 5054 Wald chi2(1) = 2.70 Prob > chi2 = Log pseudolikelihood = Pseudo R2 =

β j = coefficient of x j in the model; β = ( β1, β2,

Building a Prognostic Biomarker

Part [1.0] Measures of Classification Accuracy for the Prediction of Survival Times

Lecture 7 Time-dependent Covariates in Cox Regression

Two techniques for investigating interactions between treatment and continuous covariates in clinical trials

POWER AND SAMPLE SIZE DETERMINATIONS IN DYNAMIC RISK PREDICTION. by Zhaowen Sun M.S., University of Pittsburgh, 2012

3 The Multivariable Fractional Polynomial Approach, with Thoughts about Opportunities and Challenges in Big Data

The practical utility of incorporating model selection uncertainty into prognostic models for survival data

Instantaneous geometric rates via Generalized Linear Models

STAT 6350 Analysis of Lifetime Data. Failure-time Regression Analysis

Lecture 11. Interval Censored and. Discrete-Time Data. Statistics Survival Analysis. Presented March 3, 2016

Parametric multistate survival analysis: New developments

flexsurv: flexible parametric survival modelling in R. Supplementary examples

Parametric Maximum Likelihood Estimation of Cure Fraction Using Interval-Censored Data

Flexible parametric alternatives to the Cox model, and more

Survival Analysis. Stat 526. April 13, 2018

Description Syntax for predict Menu for predict Options for predict Remarks and examples Methods and formulas References Also see

First Aid Kit for Survival. Hypoxia cohort. Goal. DFS=Clinical + Marker 1/21/2015. Two analyses to exemplify some concepts of survival techniques

Survival Prediction Under Dependent Censoring: A Copula-based Approach

Multi-state models: prediction

Survival Analysis Math 434 Fall 2011

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

Extensions of Cox Model for Non-Proportional Hazards Purpose

Definitions and examples Simple estimation and testing Regression models Goodness of fit for the Cox model. Recap of Part 1. Per Kragh Andersen

Other Survival Models. (1) Non-PH models. We briefly discussed the non-proportional hazards (non-ph) model

Multistate models and recurrent event models

In contrast, parametric techniques (fitting exponential or Weibull, for example) are more focussed, can handle general covariates, but require

Lecture 12. Multivariate Survival Data Statistics Survival Analysis. Presented March 8, 2016

Statistical Methods For Biomarker Threshold models in clinical trials

The Stata Journal. Editors H. Joseph Newton Department of Statistics Texas A&M University College Station, Texas

Simulating complex survival data

Group Sequential Tests for Delayed Responses. Christopher Jennison. Lisa Hampson. Workshop on Special Topics on Sequential Methodology

Multistate models and recurrent event models

A regression tree approach to identifying subgroups with differential treatment effects

Using a Counting Process Method to Impute Censored Follow-Up Time Data

arxiv: v1 [math.st] 22 Oct 2018

Residuals and model diagnostics

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Prediction Error Estimation for Cure Probabilities in Cure Models and Its Applications

STAT331. Cox s Proportional Hazards Model

Survival Model Predictive Accuracy and ROC Curves

The Stata Journal. Editor Nicholas J. Cox Geography Department Durham University South Road Durham City DH1 3LE UK

Multi-state Models: An Overview

A discrete semi-markov model for the effect of need-based treatments on the disease states Ekkehard Glimm Novartis Pharma AG

Nuisance Parameter Estimation in Survival Models

Multimodal Deep Learning for Predicting Survival from Breast Cancer

Package penalized. February 21, 2018

Quantifying the Predictive Accuracy of Time to Event Models in the Presence of Competing Risks

Technical Report - 7/87 AN APPLICATION OF COX REGRESSION MODEL TO THE ANALYSIS OF GROUPED PULMONARY TUBERCULOSIS SURVIVAL DATA

9 Estimating the Underlying Survival Distribution for a

Lecture 3. Truncation, length-bias and prevalence sampling

Lecture 5 Models and methods for recurrent event data

DAGStat Event History Analysis.

Chapter 4 Regression Models

Validation. Terry M Therneau. Dec 2015

Modelling excess mortality using fractional polynomials and spline functions. Modelling time-dependent excess hazard

Chapter 17. Failure-Time Regression Analysis. William Q. Meeker and Luis A. Escobar Iowa State University and Louisiana State University

Multivariate Survival Analysis

Logistic regression model for survival time analysis using time-varying coefficients

Regularization in Cox Frailty Models

Analysis of Time-to-Event Data: Chapter 2 - Nonparametric estimation of functions of survival time

Typical Survival Data Arising From a Clinical Trial. Censoring. The Survivor Function. Mathematical Definitions Introduction

Advanced Methodology Developments in Mixture Cure Models

Statistical Inference and Methods

Smoothing Spline-based Score Tests for Proportional Hazards Models

Cox s proportional hazards model and Cox s partial likelihood

Estimating a time-dependent concordance index for survival prediction models with covariate dependent censoring

Penalized likelihood logistic regression with rare events

Semiparametric Regression

Transcription:

The influence of categorising survival time on parameter estimates in a Cox model Anika Buchholz 1,2, Willi Sauerbrei 2, Patrick Royston 3 1 Freiburger Zentrum für Datenanalyse und Modellbildung, Albert-Ludwigs-Universität Freiburg 2 Institut für Medizinische Biometrie und Medizinische Informatik, Universitätsklinikum Freiburg 3 MRC Clinical Trials Unit, London, UK Funded by Deutsche Forschungsgemeinschaft Research unit FOR 534 2. April 2007 1

Standard Cox model and its extension Cox model Causes for non-ph Standard Cox model λ(t X) = λ 0 (t) exp(β 1 X 1 +... + β p X p ) with unspecified baseline hazard λ 0 (t) Critical assumptions Linear effect of continuous covariates allow for non-linear covariate effects λ(t X) = λ 0 (t) exp(β 1 f 1 (X 1 ) +... + β p f p (X p )) Proportional hazards (PH) allow for non-proportional hazards (time-varying effects) λ(t X) = λ 0 (t) exp(β 1 (t)x 1 +... + β p (t)x p ) Extended Cox model relaxing both above assumptions λ(t X) = λ 0 (t) exp(β 1 (t)f 1 (X 1 ) +... + β p (t)f p (X p )) 2

Causes for non-proportional hazards Cox model Causes for non-ph Effect changes over time Incorrect modelling Omission of an important covariate Incorrect functional form of a covariate Different survival model is appropriate 3

Model selection strategy Model selection strategy algorithm Rotterdam breast cancer series Kaplan-Meier estimate Development of the model Multivariable strategy for model selection needed to select variables which have influence on the outcome model functional form of the influence of continuous variables model time varying effects in case of non-ph The Multivariable Fractional Polynomial Time approach combines backward elimination of variables function selection procedure to select a function from the class of fractional polynomials (non-linear if sufficiently supported by the data) investigation of possible time varying effects for each variable from a multivariable proportional hazards Cox model 4

Multivariable Fractional Polynomial Time () algorithm Model selection strategy algorithm Rotterdam breast cancer series Kaplan-Meier estimate Development of the model Stage 1: Determine time fixed model M 0 Select model M 0 using MFP algorithm assuming PH (full time period) 5

Multivariable Fractional Polynomial Time () algorithm Model selection strategy algorithm Rotterdam breast cancer series Kaplan-Meier estimate Development of the model Stage 1: Determine time fixed model M 0 Stage 2: If necessary, add covariate with short term effect only Start with model M 0, keep variables and functions from M 0 Restrict the time period to (0, t), e.g. t defined by the first half of events Run the MFP-algorithm for (0, t) and add, if necessary, significant covariates to M 0. This gives a proportional hazards model M 1. 5

Multivariable Fractional Polynomial Time () algorithm Model selection strategy algorithm Rotterdam breast cancer series Kaplan-Meier estimate Development of the model Stage 1: Determine time fixed model M 0 Stage 2: If necessary, add covariate with short term effect only Start with model M 0, keep variables and functions from M 0 Restrict the time period to (0, t), e.g. t defined by the first half of events Run the MFP-algorithm for (0, t) and add, if necessary, significant covariates to M 0. This gives a proportional hazards model M 1. hazard 0 non significant time 5

Multivariable Fractional Polynomial Time () algorithm Model selection strategy algorithm Rotterdam breast cancer series Kaplan-Meier estimate Development of the model Stage 1: Determine time fixed model M 0 Stage 2: If necessary, add covariate with short term effect only Start with model M 0, keep variables and functions from M 0 Restrict the time period to (0, t), e.g. t defined by the first half of events Run the MFP-algorithm for (0, t) and add, if necessary, significant covariates to M 0. This gives a proportional hazards model M 1. hazard significant 0 time 5

Multivariable Fractional Polynomial Time () algorithm Model selection strategy algorithm Rotterdam breast cancer series Kaplan-Meier estimate Development of the model Stage 1: Determine time fixed model M 0 Stage 2: If necessary, add covariate with short term effect only Stage 3: Add possible time varying effects of variables in M 1 Use a forward selection procedure to add significant time varying effects to model M 1. For each covariate of M 1 in turn investigate time varying effect β(t) adjusting for all other covariates of M 1. This gives the final model M 2. 5

Rotterdam breast cancer series Model selection strategy algorithm Rotterdam breast cancer series Kaplan-Meier estimate Development of the model Breast cancer survival data with 2982 patients 1518 events for RFS (recurrence free survival) 20 years max. follow up 10 variables median uncensored survival time: 2.5 years 6

Rotterdam breast cancer series Model selection strategy algorithm Rotterdam breast cancer series Kaplan-Meier estimate Development of the model 0.00 0.25 0.50 0.75 1.00 Kaplan Meier survival estimate 0 2 4 6 8 10 12 14 16 18 recurrence free survival (years) No. at risk: 2982 2319 1805 1340 920 481 171 55 11 3 7

Development of the model Model selection strategy algorithm Rotterdam breast cancer series Kaplan-Meier estimate Development of the model Variable Model M 0 Model M 1 Model M 2 X 1 (age) X 2 (menopausal status) - - - X 3a (tumour size > 20mm) X 3b (tumour size > 50mm) - X 4 (tumour grade) X 5 2 (no. of pos. lymph nodes) log(x 6 ) (progesterone receptor) - X 7 (oestrogen receptor) - - - X 8 (hormonal therapy) X 9 (chemotherapy) X 3a (log(t)) log(x 6 ) (log(t)) Model M 0 : Selected with MFP assuming PH, 4 variables eliminated Model M 1 : Add variables with short-term effect only Model M 2 : Add time-varying effects 8

Enlargement of the data Enlargement Example Issues under investigation Handling ties The analysis of time-varying effects requires long-term follow-up large sample size Why is enlargement necessary? lnl = D [ { } ] x k β(t (j) ) d j ln exp(x i β(t (j) )) k D j i R j j=1 Enlargement of such data may cause computational problems:. stsplit, at failures gives about 2.2 million records in Rotterdam data Enlarged data may be difficult to manage for the analysis of one data set is nearly impossible for simulation studies 9

Possible solution: categorisation of survival time Enlargement Example Issues under investigation Handling ties scheme Equidistant intervals (e.g. 6 month length). stsplit period, at(.5(.5)20) results in only 35747 records Other categorisation schemes are possible, e.g. categorisation in quantiles How to code categorised survival times Here: represent intervals by integers In clinical investigations e.g. use the mean survival time within each interval 10

Example for categorised time in the enlarged data Enlargement Example Issues under investigation Handling ties. stsplit period, at(.5(.5)20) (32765 observations (episodes) created). egen categorised EFS = group(period). list categorised EFS EFS yrs event Patient ID if Patient ID==1 catego S EFS yrs event Patien D 1. 1.5. 1 2. 2 1. 1 3. 3 1.5. 1 4. 4 2. 1 5. 5 2.5. 1 6. 6 3. 1 7. 7 3.5. 1 8. 8 4. 1 9. 9 4.5. 1 10. 10 4.925 0 1 11

Issues under investigation Enlargement Example Issues under investigation Handling ties of survival time raises issues as to the number and position of cutpoints the loss of information the increased number of ties 12

Issues under investigation Enlargement Example Issues under investigation Handling ties of survival time raises issues as to the number and position of cutpoints the loss of information the increased number of ties We will consider interval lengths 1.5, 3, 6, 12 and 24 months 12

Issues under investigation Enlargement Example Issues under investigation Handling ties of survival time raises issues as to the number and position of cutpoints the loss of information the increased number of ties We will consider interval lengths 1.5, 3, 6, 12 and 24 months compare parameter estimates obtained by stcox for the different interval lengths 12

Issues under investigation Enlargement Example Issues under investigation Handling ties of survival time raises issues as to the number and position of cutpoints the loss of information the increased number of ties We will consider interval lengths 1.5, 3, 6, 12 and 24 months compare parameter estimates obtained by stcox for the different interval lengths compare parameter estimates using the four methods of handling ties provided by stcox 12

Methods for handling ties in Stata Enlargement Example Issues under investigation Handling ties breslow: efron: exactm: approximation of exact marginal log likelihood; fast but least accurate (default) approximation of the exact marginal log likelihood; slower than breslow but more accurate exact marginal log likelihood; very slow exactp: exact partial log likelihood; very slow 13

Influence of the length of categorisation interval Length of interval Ties (original time) Ties (3 months) Ties (6 months) Run time MFP Differences of parameter estimates in percent relative to the original time Breslow method for handling ties Interval length (months) Original time 1.5 3 6 12 24 No. of records 2982 138502 70000 35747 18649 10086 No. of distinct observed times 2183 155 78 39 20 10 X 1 (age) -0.013-0.4-1.1-1.9-3.5-8.5 X 3a (tumour size > 20mm) 0.289-0.5-1.2-2.5-5.0-11.2 X 4 (tumour grade) 0.390-0.8-1.4-2.5-6.0-12.9 X5 2 (# pos. lymph nodes) -1.713-1.0-2.1-3.9-7.8-15.6 X 8 (hormonal therapy) -0.386-1.3-2.3-3.4-7.8-13.9 X 9 (chemotherapy) -0.454-1.3-2.7-4.7-10.1-22.0 2982 is the original number of observations (1419 ties), splitting the data at each event time gives approximately 2.2 million records 14

Comparison of the methods of handling ties (original time) Parameter estimates using the original time Length of interval Ties (original time) Ties (3 months) Ties (6 months) Run time MFP Method Breslow Efron exactm exactp X 1 (age) -0.0132-0.0132-0.0132-0.0132 X 3a (tumour size > 20mm) 0.2885 0.2886 0.2886 0.2886 X 4 (tumour grade) 0.3900 0.3900 0.3900 0.3901 (# pos. lymph nodes) -1.7128-1.7132-1.7132-1.7136 X 2 5 X 8 (hormonal therapy) -0.3857-0.3859-0.3858-0.3859 X 9 (chemotherapy) -0.4539-0.4540-0.4540-0.4541 Method of handling ties has no influence on parameter estimates 15

Comparison of the methods of handling ties (3 months) Length of interval Ties (original time) Ties (3 months) Ties (6 months) Run time MFP Difference in parameter estimates in percent relative to analysis using original time Interval length: 3 months Method Breslow Efron exactm exactp X 1 (age) 1.1 +1.4 +2.3 +3.4 X 3a (tumour size > 20mm) 1.2 +0.4 +1.1 +1.6 X 4 (tumour grade) 1.4 0.2 +0.2 +1.7 X5 2 (# pos. lymph nodes) 2.1 0.0 +0.7 +2.2 X 8 (hormonal therapy) 2.3 +0.4 +1.7 +3.4 X 9 (chemotherapy) 2.7 0.0 +1.1 +2.6 16

Comparison of the methods of handling ties (6 months) Length of interval Ties (original time) Ties (3 months) Ties (6 months) Run time MFP Difference in parameter estimates in percent relative to analysis using original time Interval length: 6 months Method Breslow Efron exactm exactp X 1 (age) 1.9 +3.1 79.2 X 3a (tumour size > 20mm) 2.5 +0.7 86.9 X 4 (tumour grade) 2.5 0.1 84.1 X5 2 (# pos. lymph nodes) 3.9 0.0 76.9 X 8 (hormonal therapy) 3.4 +1.9 79.5 X 9 (chemotherapy) 4.7 +0.4 81.1 17

Comparison of run time of the methods of handling ties Run time relative to Breslow using original time Length of interval Ties (original time) Ties (3 months) Ties (6 months) Run time MFP Run time Breslow Efron exactm exactp original time (2982 records) 1 1.11 19.56 1237.85 3 month intervals (70000 records) 40.18 44.88 85.80 1675.15 split at failures (2.2 mio records) 1926.25 1926.26 2067.26 3367.71 Breslow and Efron similar exactm and exactp much more computationally demanding 18

Selection of model using mfp Length of interval Ties (original time) Ties (3 months) Ties (6 months) Run time MFP Do stage 1 of algorithm with all 10 candidate variables. mfp stcox x1 x2 x3a x3b x4b x5e x6 x7 x8 x9, select(0.01) (breslow and efron only) Identical model as in original data for interval lengths 1.5, 3, and 6 months for both breslow and efron method for ties with similar parameter estimates 19

of results References For a single analysis in one data set categorisation is usually not required may be sufficient for computer intensive methods (simulations, bootstrap, cross validation etc.) In case of categorisation: Categorising long-term survival into 40-100 distinct values seems sensible: Parameter estimates nearly identical Loss of information seems negligible Handling ties: Many distinct values: nearly identical results Small(er) number of distinct values: - Exact methods break down - Breslow and Efron give acceptable results Breslow and Efron suitable for simulation studies, Efron slightly preferable 20

References Sauerbrei, W., Royston, P. and Look, M. (2007). A new proposal for multivariable modelling of time-varying effects in survival data based on fractional polynomial time-transformation. Biometrical Journal, in press Buchholz, A., Sauerbrei, W. and Royston, P. (2006). Investigation of time-varying effects in survival analysis may require categorisation of time: does it matter? submitted References Program stmfpt available upon request from Patrick Royston (pr@ctu.mrc.ac.uk) Thanks for your attention. 21

References 22

Required amount of memory for different interval lengths References Interval length No. of Amount of (months) records memory* (bytes) Original time 2,982 146,118 Split at failures 2,220,499 115,465,948 1.5 138,502 8,310,120 3 70,000 4,200,000 6 35,747 2,144,820 12 18,649 988,397 24 10,086 534,558 *data only, without overhead 23