Survival Analysis I (CHL5209H)

Similar documents
Survival Analysis I (CHL5209H)

Tied survival times; estimation of survival probabilities

Survival Regression Models

Cox s proportional hazards model and Cox s partial likelihood

You know I m not goin diss you on the internet Cause my mama taught me better than that I m a survivor (What?) I m not goin give up (What?

Survival Analysis Math 434 Fall 2011

STAT331. Cox s Proportional Hazards Model

Lecture 3. Truncation, length-bias and prevalence sampling

Handling Ties in the Rank Ordered Logit Model Applied in Epidemiological

MAS3301 / MAS8311 Biostatistics Part II: Survival

Survival Analysis for Case-Cohort Studies

Part IV Extensions: Competing Risks Endpoints and Non-Parametric AUC(t) Estimation

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs

Marginal Structural Cox Model for Survival Data with Treatment-Confounder Feedback

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

The nltm Package. July 24, 2006

Statistical aspects of prediction models with high-dimensional data

β j = coefficient of x j in the model; β = ( β1, β2,

USING MARTINGALE RESIDUALS TO ASSESS GOODNESS-OF-FIT FOR SAMPLED RISK SET DATA

Dynamic Prediction of Disease Progression Using Longitudinal Biomarker Data

THESIS for the degree of MASTER OF SCIENCE. Modelling and Data Analysis

Support Vector Hazard Regression (SVHR) for Predicting Survival Outcomes. Donglin Zeng, Department of Biostatistics, University of North Carolina

STAT 6350 Analysis of Lifetime Data. Failure-time Regression Analysis

Power and Sample Size Calculations with the Additive Hazards Model

A Regression Model For Recurrent Events With Distribution Free Correlation Structure

Causal Hazard Ratio Estimation By Instrumental Variables or Principal Stratification. Todd MacKenzie, PhD

Lecture 7. Proportional Hazards Model - Handling Ties and Survival Estimation Statistics Survival Analysis. Presented February 4, 2016

11 November 2011 Department of Biostatistics, University of Copengen. 9:15 10:00 Recap of case-control studies. Frequency-matched studies.

Survival Analysis with Time- Dependent Covariates: A Practical Example. October 28, 2016 SAS Health Users Group Maria Eberg

Practical Considerations Surrounding Normality

Joint Modeling of Longitudinal Item Response Data and Survival

Maximum likelihood estimation for Cox s regression model under nested case-control sampling

In contrast, parametric techniques (fitting exponential or Weibull, for example) are more focussed, can handle general covariates, but require

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

ST745: Survival Analysis: Cox-PH!

Cox regression: Estimation

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Building a Prognostic Biomarker

Lecture 6 PREDICTING SURVIVAL UNDER THE PH MODEL

On the Breslow estimator

TGDR: An Introduction

Definitions and examples Simple estimation and testing Regression models Goodness of fit for the Cox model. Recap of Part 1. Per Kragh Andersen

Applications of GIS in Health Research. West Nile virus

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Unbiased estimation of exposure odds ratios in complete records logistic regression

Residuals and model diagnostics

Semiparametric Regression

Correction for classical covariate measurement error and extensions to life-course studies

SAMPLE SIZE ESTIMATION FOR SURVIVAL OUTCOMES IN CLUSTER-RANDOMIZED STUDIES WITH SMALL CLUSTER SIZES BIOMETRICS (JUNE 2000)

Lecture 7 Time-dependent Covariates in Cox Regression

Correlation and regression

Introduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models

Beyond GLM and likelihood

Case-control studies

Does low participation in cohort studies induce bias? Additional material

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

Lecture 11. Interval Censored and. Discrete-Time Data. Statistics Survival Analysis. Presented March 3, 2016

Lab 8. Matched Case Control Studies

3003 Cure. F. P. Treasure

Federated analyses. technical, statistical and human challenges

[Part 2] Model Development for the Prediction of Survival Times using Longitudinal Measurements

On a connection between the Bradley-Terry model and the Cox proportional hazards model

Lectures of STA 231: Biostatistics

Statistics in medicine

Part [1.0] Measures of Classification Accuracy for the Prediction of Survival Times

A Generalized Global Rank Test for Multiple, Possibly Censored, Outcomes

Statistical Analysis of Spatio-temporal Point Process Data. Peter J Diggle

5. Parametric Regression Model

The Ef ciency of Simple and Countermatched Nested Case-control Sampling

( t) Cox regression part 2. Outline: Recapitulation. Estimation of cumulative hazards and survival probabilites. Ørnulf Borgan

STAT331. Combining Martingales, Stochastic Integrals, and Applications to Logrank Test & Cox s Model

Estimation of the Relative Excess Risk Due to Interaction and Associated Confidence Bounds

The influence of categorising survival time on parameter estimates in a Cox model

GROUPED SURVIVAL DATA. Florida State University and Medical College of Wisconsin

Approximation of Survival Function by Taylor Series for General Partly Interval Censored Data

Lecture 9. Statistics Survival Analysis. Presented February 23, Dan Gillen Department of Statistics University of California, Irvine

Estimating a Population Mean. Section 7-3

Package crrsc. R topics documented: February 19, 2015

Ph.D. course: Regression models. Introduction. 19 April 2012

Typical Survival Data Arising From a Clinical Trial. Censoring. The Survivor Function. Mathematical Definitions Introduction

Optimal Treatment Regimes for Survival Endpoints from a Classification Perspective. Anastasios (Butch) Tsiatis and Xiaofei Bai

Is the cholesterol concentration in blood related to the body mass index (bmi)?

CIMAT Taller de Modelos de Capture y Recaptura Known Fate Survival Analysis

GOV 2001/ 1002/ E-2001 Section 10 1 Duration II and Matching

WALD III SOFTWARE FOR THE MASSES (AND AN EXAMPLE) Leo Breiman UCB Statistics

Time-dependent covariates

Ph.D. course: Regression models. Regression models. Explanatory variables. Example 1.1: Body mass index and vitamin D status

Package SimSCRPiecewise

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY

On the Analysis of Tuberculosis Studies with Intermittent Missing Sputum Data

Multi-state Models: An Overview

Consider Table 1 (Note connection to start-stop process).

Statistical and epidemiological considerations in using remote sensing data for exposure estimation

log T = β T Z + ɛ Zi Z(u; β) } dn i (ue βzi ) = 0,

Individualized Treatment Effects with Censored Data via Nonparametric Accelerated Failure Time Models

Estimating direct effects in cohort and case-control studies

CHAPTER 10 Comparing Two Populations or Groups

From Model to Log Likelihood

A TWO-STAGE LINEAR MIXED-EFFECTS/COX MODEL FOR LONGITUDINAL DATA WITH MEASUREMENT ERROR AND SURVIVAL

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

Transcription:

Survival Analysis Dalla Lana School of Public Health University of Toronto olli.saarela@utoronto.ca February 3, 2015 21-1

Time matching/risk set sampling/incidence density sampling/nested design 21-2

21-3 Motivation Suppose that a cohort (that is, a closed population) of size n has been recruited through participation in a health examination survey. Some baseline characteristics (e.g. blood pressure, cholesterol, BMI, smoking habits, prevalent health conditions and their treatment) are recorded on all cohort members. Denote these characteristics by X. We are interested in whether particular genetic/biomarkers (denoted by Z), determined from plasma/serum samples collected at the examination, are associated with disease outcomes in the cohort. However, carrying out the measurements on all cohort member would be expensive, and no association study can be carried out at this point anyway. (Why?) Solution: store (freeze) the biological material to wait for later use.

Before the follow-up Population 0 1000 2000 3000 4000 5000 6000 0 2 4 6 8 10 21-4 Follow up years

After 10 years of follow-up Population 0 1000 2000 3000 4000 5000 6000 0 2 4 6 8 10 21-5 Follow up years

21-6 Case series (184 incident stroke events) 0 2 4 6 8 10 0 1000 2000 3000 4000 5000 6000 Follow up years Population

Survival Analysis Zoom in to see more Population 0 20 40 60 80 0 2 4 6 8 10 21-7 Follow up years

21-8 Identify risk sets 0 2 4 6 8 10 0 20 40 60 80 Follow up years Population

21-9 Sample one control per case 0 2 4 6 8 10 0 20 40 60 80 Follow up years Population

21-10 Sample 5 controls per case 0 2 4 6 8 10 0 20 40 60 80 Follow up years Population

21-11 Or, sample the whole riskset 0 2 4 6 8 10 0 20 40 60 80 Follow up years Population

21-12 Risk set sampling From each risk set, m controls are selected randomly, and independently from the previous or later sampled risk sets. Because of this, one individual can contribute a control to more than one case, and individuals with an outcome event can contribute a control before their event time. The case and the m controls are now matched by time. Selecting more than m = 5 controls per case no longer substantially improves efficiency. This is why cost savings are possible through nested designs. The covariate measurements Z are collected on all individuals with a disease event of interest, and the pooled set of individuals contributing the sampled controls. To ensure comparability of the measurements, the cases and controls may need to to be further matched w.r.t. factors such as storage time, storage conditions, freeze-thaw cycles, and analytic batch (e.g. plate).

21-13

Model Suppose that we are interested in the parameter β in the proportional hazards model λ i (t) = λ 0 (t) exp{βz i + γ x i }, where z i is the genetic/biomarker measurement collected under the nested design, and x i are other factors that we want to include in the model. How to estimate this model under the nested design? Turns out that a certain conditioning argument gives a very useful conditional likelihood expression. For deriving this expression recall the notation of counting process jumps dn i (t) indicating whether an event occurred to individual i at time t, and the history F t representing all observed information just before t. 21-14

likelihood contribution Use now i to index a sampled riskset, rather than individual, with i1, i2,... indexing individuals within the sampled riskset. Consider the special case of 1-1 matched design, where one control per case is selected randomly. The likelihood contribution is given by the conditional probability of individual 1 in risk set i experiencing an outcome event at time t i, given that we know that one of the two individual experienced an outcome event, that is, P(dN i1 (t i ) = 1 dn i1 (t i ) + dn i2 (t i ) = 1, F ti, θ), where θ represents the collection of all model parameters. 21-15

likelihood contribution (2) Using the basic rules of probability, and assuming the outcomes of the two individuals independent, we can write this as P(dN i1 (t i ) = 1 dn i1 (t i ) + dn i2 (t i ) = 1, F ti, θ) = P(dN i1(t i ) = 1, dn i2 (t i ) = 0 F ti, θ) P(dN i1 (t i ) + dn i2 (t i ) = 1 F ti, θ) P(dN i1 (t i ) = 1 F ti, θ) = P(dN i2 (t i ) = 0 F ti, θ) P(dN i1 (t i ) = e 1 F ti, θ). P(dN i2 (t i ) = e 2 F ti, θ) (e 1,e 2 ) {(0,1),(1,0)} Note further that the observed information F ti records the covariate values {z i1, x i1, z i2, x i2 } and that the two individuals are still at risk at this moment, that is, {N i1 (t i ) = 0, N i2 (t i ) = 0}. 21-16

21-17 likelihood contribution (2) Substituting these in we get P(dN i1 (t i ) = 1 dn i1 (t i ) + dn i2 (t i ) = 1, F ti, θ) = = P(dN i1 (t i ) = 1, N i1 (t i ) = 0 z i1, x i1, θ) P(dN i2 (t i ) = 0, N i2 (t i ) = 0 z i2, x i2, θ) P(dN i1 (t i ) = e 1, N i1 (t i ) = 0 z i1, x i1, θ) P(dN i2 (t i ) = e 2, N i2 (t i ) = 0 z i2, x i2, θ) (e 1,e 2 ) (e 1,e 2 ) {(0,1),(1,0)} λ i1 (t i )S i1 (t i )S i2 (t i ) λ i1 (t i ) e 1 S i1 (t i )λ i2 (t i ) e 2 S i2 (t i ) λ i1 (t i ) = λ i1 (t i ) + λ i2 (t i ) exp{βz i1 + γ x i1 } = exp{βz i1 + γ x i1 } + exp{βz i2 + γ x i2 }. Note how the baseline hazard and the survival probabilities canceled out of the expression.

likelihood contribution (3) If there are d cases in total, the complete conditional likelihood expression is L(β, γ) = d i=1 exp{βz i1 + γ x i1 } exp{βz i1 + γ x i1 } + exp{βz i2 + γ x i2 }. (1) The resulting expression should look familiar; if we sample the whole risk set, using the same conditioning argument, we would get L(β, γ) = d i=1 exp{βz i1 + γ x i1 } l R(t i ) exp{βz il + γ x il }, where R(t i ) is the set of individuals at risk at the event time t i. This is equivalent to the Cox partial likelihood. 21-18

21-19 The expression (1) is called conditional likelihood. The reason for this is that in non-time matched designs, a likelihood expression of the same form would be obtained by substituting in a model. In that case, the likelihood expression would be used to estimate log-odds ratios. With time matching, the model substituted in the general probability expression is the proportional hazards survival model, and as a result, we can estimate log-hazard ratios. To sum up, with time matched sampling of the controls, conditional has nothing to do with. Also, confusingly, the terms time matching/risk set sampling/incidence density sampling/nested design are used interchangeably in the literature. They all mean the same thing.

in R The conditional likelihood can be maximized using the clogit function of the R survival package clogit(formula, data, weights, subset, na.action, method=c("exact", "approximate", "efron", "breslow"),...) which uses a model formula of the form case.status~exposure+strata(matched.set) Here the strata variable identifies the time matched risksets. Ties in the data mean that more than one event occurred at the same time; the argument method specifies how the ties are handled. 21-20

Exact method for ties The conditioning approach generalizes if ties are present. For example, if two events occurred at time t i and one control was sampled from the riskset, we would get ( ) 3 P dn i1 (t i ) = 1, dn i2 (t i ) = 1 dn il (t i ) = 2, F ti, θ = l=1 exp{βz i1 + γ x i1 } exp{βz i2 + γ x i2 } exp{βz i1 + γ x i1 } exp{βz i2 + γ x i2 } + exp{βz i1 + γ x i1 } exp{βz i3 + γ x i3 } + exp{βz i2 + γ x i2 } exp{βz i3 + γ x i3 } (homework). This is the so-called exact methods for handling ties; the others are approximations. 21-21