Survival Analysis for Case-Cohort Studies

Similar documents
THESIS for the degree of MASTER OF SCIENCE. Modelling and Data Analysis

STAT331. Cox s Proportional Hazards Model

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Other Survival Models. (1) Non-PH models. We briefly discussed the non-proportional hazards (non-ph) model

On the Breslow estimator

Hypothesis Testing Based on the Maximum of Two Statistics from Weighted and Unweighted Estimating Equations

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

University of California, Berkeley

Approximation of Survival Function by Taylor Series for General Partly Interval Censored Data

Statistics 262: Intermediate Biostatistics Non-parametric Survival Analysis

Survival Analysis Math 434 Fall 2011

Maximum likelihood estimation for Cox s regression model under nested case-control sampling

Power and Sample Size Calculations with the Additive Hazards Model

Lecture 5 Models and methods for recurrent event data

REGRESSION ANALYSIS FOR TIME-TO-EVENT DATA THE PROPORTIONAL HAZARDS (COX) MODEL ST520

Support Vector Hazard Regression (SVHR) for Predicting Survival Outcomes. Donglin Zeng, Department of Biostatistics, University of North Carolina

Multivariate Survival Data With Censoring.

Lecture 6 PREDICTING SURVIVAL UNDER THE PH MODEL

STAT 6350 Analysis of Lifetime Data. Failure-time Regression Analysis

Cox Regression in Nested Case Control Studies with Auxiliary Covariates

ST745: Survival Analysis: Cox-PH!

1 Introduction. 2 Residuals in PH model

Lecture 3. Truncation, length-bias and prevalence sampling

Cox s proportional hazards model and Cox s partial likelihood

MAS3301 / MAS8311 Biostatistics Part II: Survival

Semiparametric Regression

USING MARTINGALE RESIDUALS TO ASSESS GOODNESS-OF-FIT FOR SAMPLED RISK SET DATA

MARGINAL STRUCTURAL COX MODELS WITH CASE-COHORT SAMPLING. Hana Lee

Part III. Hypothesis Testing. III.1. Log-rank Test for Right-censored Failure Time Data

Missing covariate data in matched case-control studies: Do the usual paradigms apply?

Application of Time-to-Event Methods in the Assessment of Safety in Clinical Trials

Lecture 7 Time-dependent Covariates in Cox Regression

Tests of independence for censored bivariate failure time data

STATISTICAL METHODS FOR CASE-CONTROL AND CASE-COHORT STUDIES WITH POSSIBLY CORRELATED FAILURE TIME DATA

Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates

Dynamic Prediction of Disease Progression Using Longitudinal Biomarker Data

Multivariate Survival Analysis

Additive hazards regression for case-cohort studies

Definitions and examples Simple estimation and testing Regression models Goodness of fit for the Cox model. Recap of Part 1. Per Kragh Andersen

TESTS FOR LOCATION WITH K SAMPLES UNDER THE KOZIOL-GREEN MODEL OF RANDOM CENSORSHIP Key Words: Ke Wu Department of Mathematics University of Mississip

Using Estimating Equations for Spatially Correlated A

Lecture 12. Multivariate Survival Data Statistics Survival Analysis. Presented March 8, 2016

Efficient Semiparametric Estimators via Modified Profile Likelihood in Frailty & Accelerated-Failure Models

Survival Analysis I (CHL5209H)

β j = coefficient of x j in the model; β = ( β1, β2,

Survival Regression Models

Cox s proportional hazards/regression model - model assessment

AFT Models and Empirical Likelihood

END-POINT SAMPLING. Yuan Yao, Wen Yu and Kani Chen. Hong Kong Baptist University, Fudan University and Hong Kong University of Science and Technology

Large sample theory for merged data from multiple sources

Joint Modeling of Longitudinal Item Response Data and Survival

Philosophy and Features of the mstate package

A Sampling of IMPACT Research:

Chapter 7 Fall Chapter 7 Hypothesis testing Hypotheses of interest: (A) 1-sample

A Bayesian Nonparametric Approach to Causal Inference for Semi-competing risks

Survival Analysis. Stat 526. April 13, 2018

Nuisance parameter elimination for proportional likelihood ratio models with nonignorable missingness and random truncation

STAT331. Combining Martingales, Stochastic Integrals, and Applications to Logrank Test & Cox s Model

4. Comparison of Two (K) Samples

Lecture 22 Survival Analysis: An Introduction

Empirical Likelihood Methods for Two-sample Problems with Data Missing-by-Design

log T = β T Z + ɛ Zi Z(u; β) } dn i (ue βzi ) = 0,

Estimating the Mean Response of Treatment Duration Regimes in an Observational Study. Anastasios A. Tsiatis.

Estimation of Conditional Kendall s Tau for Bivariate Interval Censored Data

Regression Calibration in Semiparametric Accelerated Failure Time Models

Survival Analysis I (CHL5209H)

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY

Goodness-Of-Fit for Cox s Regression Model. Extensions of Cox s Regression Model. Survival Analysis Fall 2004, Copenhagen

Robust estimates of state occupancy and transition probabilities for Non-Markov multi-state models

Technical Report - 7/87 AN APPLICATION OF COX REGRESSION MODEL TO THE ANALYSIS OF GROUPED PULMONARY TUBERCULOSIS SURVIVAL DATA

GROUPED SURVIVAL DATA. Florida State University and Medical College of Wisconsin

TMA 4275 Lifetime Analysis June 2004 Solution

Linear rank statistics

Chapter 2 Inference on Mean Residual Life-Overview

Estimation with clustered censored survival data with missing covariates in the marginal Cox model

Step-Stress Models and Associated Inference

Meei Pyng Ng 1 and Ray Watson 1

Causal Hazard Ratio Estimation By Instrumental Variables or Principal Stratification. Todd MacKenzie, PhD

Multistate models and recurrent event models

Estimation for two-phase designs: semiparametric models and Z theorems

Building a Prognostic Biomarker

You know I m not goin diss you on the internet Cause my mama taught me better than that I m a survivor (What?) I m not goin give up (What?

Asymptotic equivalence of paired Hotelling test and conditional logistic regression

Sensitivity analysis and distributional assumptions

Survival Analysis. Lu Tian and Richard Olshen Stanford University

Quantile Regression for Residual Life and Empirical Likelihood

Lecture 8 Stat D. Gillen

Statistical Inference and Methods

TESTINGGOODNESSOFFITINTHECOX AALEN MODEL

Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure)

Published online: 10 Apr 2012.

arxiv: v1 [stat.me] 15 May 2011

Analysing geoadditive regression data: a mixed model approach

Censoring mechanisms

Goodness-of-fit test for the Cox Proportional Hazard Model

Advanced Quantitative Research Methodology, Lecture Notes: Research Designs for Causal Inference 1

NIH Public Access Author Manuscript J Am Stat Assoc. Author manuscript; available in PMC 2015 January 01.

Attributable Risk Function in the Proportional Hazards Model

PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA

A note on L convergence of Neumann series approximation in missing data problems

Transcription:

Survival Analysis for ase-ohort Studies Petr Klášterecký Dept. of Probability and Mathematical Statistics, Faculty of Mathematics and Physics, harles University, Prague, zech Republic e-mail: petr.klasterecky@matfyz.cz Abstract: This paper gives an overview of some data analysis methods commonly used in survival analysis and their modifications for case-cohort studies. Most attention is paid to the ox proportional hazards model for studying relative risk in failure-time studies. Key Words: survival analysis, case-cohort studies, ox proportional hazards model, partial likelihood Introduction In epidemiological and clinical research the so called cohort studies are routinelly conducted to assess the effects of various covariates on a failure time. After being randomly selected, the cohort of subjects is typically followed for a given time period, the covariates are measured according to the study design and the failure or censoring time is recorded for each individual. The analysis is then performed using classical techniques such as the ox regression, accelerated failure time models etc. The main drawback of this type of study is its cost when following a large number of subjects, since the covariates of interest may include e.g. expensive medical laboratory tests. If, however, the disease prevalence is very low in the population, it would be necessary to conduct a large cohort study to ensure the enrolling of a sufficient number of cases (i.e subjects experiencing the event of interest) so that the analysis would even be meaningful. ase-cohort studies were introduced by R.L. Prentice [Prentice] in 1986 in order to reduce the cost by, roughly speaking, observing much fewer subjects. Rather than following the whole cohort, a random sample called subcohort of the individuals is selected and followed. In addition to that any subject enters the study when experiencing the event of interest. This way the cases are sampled with probability one and their covariate values are recorded at (and potentially also after) their failure times while only a relatively small number of the remaining individuals are followed from the beginning of the study. The setup introduced in [Prentice] also indicates that there is no intended information available for the cases up to their failure time unless the case is by chance sampled to the subcohort. The case-cohort study data are then often analysed using modifications of the full-data partial likelihood. Survival analysis overview Let us first introduce some standard notation as used e.g. in [Fleming and Harrington]: suppose there are n subjects in the study and denote for each i = 1,..., n by T i a (non-negative) random failure time, i a (positive) censoring time of the i th individual, X i = min(t i, i ) the censored failure time, δ i = I [Xi =T i ] and Z i (t) possibly time-dependent p-dimensional vector of covariates. Note that we are only able to observe the triplets (X i, δ i, Z i (t)), i = 1,..., n. Let us further introduce N(t) = I [T t,δ=1] (a stochastic counting process indicating if the event of interest has been observed) and the at risk process Y (t) = I [X>t]. Note that Y (t) = 0 implies that a potential failure at time t cannot be observed. The important inference questions in this setting most often concern the relative risk estimation and the conditional distribution of failure time, given covariates.

The ox proportional hazards model The concepts of proportional hazards models and partial likelihood were first presented by D. R. ox in [ox 1972] and [ox 1975]. Since then, these results were reviewed in many books and we shall only point out the basic ideas here. Assume first for simplicity that no covariates depend on time and let S(t Z) = P [T > t Z] be the conditional survival function. The conditional hazard function is then defined by λ(t Z) = lim P [t T < t + h T t, Z] h Now suppose that, for any two covariate values Z 1 and Z 2, the associated failure rates have a fixed ratio over time. Then λ(t Z 1 ) = k(z 1, Z 2 )λ(t Z 2 ) where k is a nonnegative real valued function which does not depend on time - the hazard rates are proportional. If we denote λ 0 (t) = λ(t Z = 0) (usually called the baseline hazard rate), then h 0 1 (1) λ(t Z) = λ 0 (t)k(z, 0) = λ 0 (t)g(z). The function g(z) must not be negative (we are modelling a hazard rate) with g(0) = 1 and its dependence on Z should not be too complicated in order to keep the model practically useful. The dependence on Z is thus most often modelled through a linear term β Z = β 1 Z 1 +... + β p Z p while g is usually taken to be g(x) = e x. The model (1) then becomes (2) λ(t Z) = λ 0 (t) exp(β Z), which is the most common form of the proportional hazards model. For β j, j = 1,..., p now e β j shows directly the relative risk of failure for an individual with covariate vector Z 1,..., Z j 1, Z j +1, Z j+1,..., Z p as compared to an individual with covariates Z 1,..., Z j 1, Z j, Z j+1,..., Z p. The conditional suvival function S(t Z) is now modelled as t because of the relationship S(t Z) = exp( S(t Z) = {S 0 (t)} exp(β Z) 0 λ(u Z))du. Note that time-dependent covariates can be introduced into the proportinal hazards model simply through a function (3) λ(t Z(t)) = lim P [t T < t + h T t, Z(t)]. h h 0 1 However, the function λ(t) defined in (3) can no longer be viewed as a traditional hazard function and cannot be used to compute a conditional survival function S(t Z(t)). Partial likelihood The vector of parameters, β, has to be estimated from the observed data in the proportional hazards model (1) or (2), respectively. The method of maximum likelihood is a very popular method of estimating parameters in classical statistics such as generalized linear models. If, however, the unknown baseline hazard function (rate) λ 0 (t) is present, the full likelihood approach would become very cumbersome. Therefore the so called partial likelihood is used and will be presented in this section. Suppose X is a random vector with a density f X (x,θ), where θ is a vector parameter (φ,β) with φ being a nuisance parameter. The full likelihood approach would require joint maximization over the whole parameter vector θ which may not be feasible in high dimensions. The idea behind the partial likelihood approach is that the joint likelihood may often be written as a product of two parts; one of them depends on β only (does not involve φ) and only this part is used for inference

P. KLÁŠTEREKÝ: SURVIVAL ANALYSIS FOR ASE-OHORT STUDIES about β. The latter one may depend on both β and φ - this part of the information about β is thus discarded. When this idea is applied to the failure time variables with covariates it follows that if there are no ties in the observed failure times, then the partial likelihood is given by L exp(β Z (i) ) l R i exp(β Z l ), where L denotes the total number of observed failure times, R i is the set of individuals who remain in risk (the risk set) and Z (i) is the vector of covariates for the individual who fails as the i th one. One can see that together with the covariates the only information preserved and used from the original data are in fact the ranks of the original failure times. All other information (including for example the actual failure/censoring times) is discarded. A more general formula allowing for ties in the observed failure times or stratification can be found in [Fleming and Harrington], pp. 143-144. Using the counting process notation, the general (ties and time-dependent covariates allowing) formula for the partial likelihood is given by (4) { n s 0 Y i (s) exp(β Z i (s)) n j=1 Y j(s) exp(β Z j (s)) Statistical inference based on partial likelihood } Ni (s) Parameter estimates ˆβ can be obtained by setting the partial derivatives of (4) with respect to β equal to 0 and solving the given system of equations. In order to keep the formulas as comprehensive as possible let us now introduce the following notation: S (0) (β, t) = n 1 Y i (t) exp(β Z i (t)) S (1) (β, t) = n 1 Z i (t)y i (t) exp(β Z i (t)) The partial likelihood (4) may be also viewed as the value at t = of the process { } n Y i (u) exp(β Ni (u) Z i (u)) L(β, t) = n j=1 Y j(u) exp(β Z j (u)) 0 u t implying that the score vector U(β) = β log L(β, ) is the value at t = of the process t (5) U(β, t) = {Z i (u) E(β, u)} dn i (u), where 0 E(β, t) = S(1) (β, t) S (0) (β, t) can be thought of, roughly speaking, as the expected vector of covariates for the items still remaining at the risk set at time t. This way one can view the score vector as a kind of comparison of the observed and expected values of covariates. Using martingale methods one can prove under certain regularity conditions consistency of the partial maximum likelihood estimator ˆβ and the weak convergence of the normalized score process to a Gaussian stochastic process, so that there are some asymptotic results available similar to the ones from the classical maximum likelihood estimation..

P. KLÁŠTEREKÝ: SURVIVAL ANALYSIS FOR ASE-OHORT STUDIES Modifications for case-cohort studies The usual methods based on the partial likelihood require the knowledge of covariates for all subjects which of course is not available under the case-cohort design. Remind that the only information that can be used here are the complete covariate histories for the controls (and cases, depending on further assumptions), covariate values for the cases at failure times and maybe some aditional values such as age, gender or some database entries for the whole cohort. It is thus clear that some modifications are needed in order to adapt the methods described above to the case-cohort problem. Only the cases contribute directly to the summation in (5), whereas the controls influence is hidden in the at-risk covariate averages E(β, t). It is, however, this value of E that causes trouble because under the case-cohort design it contains unobserved data and thus cannot be calculated. Instead, the so called pseudoscores, which rely on the available data only, are used: t (6) U (β, t) = {Z i (u) E (β, u)} dn i (u), 0 where the case-cohort at risk average is given by E (β, t) = S (1) (β, t)/s(0) (β, t) with S (0) (β, t) = n 1 ϱ i (t)y i (t) exp(β Z i (t)) S (1) (β, t) = n 1 ϱ i (t)z i (t)y i (t) exp(β Z i (t)). The parameter vector β is then estimated by solving the equation U (β, ) = 0, the estimators defined in this way were shown to be consistent, asymptotically normal and their asymptotic variance can be estimated directly from the variance estimator of the score function U(β) (see [Prentice] for special cases and [Kulich and Lin] for a general treatment). The asymptotic distribution theory is further studied in [Self and Prentice] using a combination of martingale and finite population convergence results. Among others, the sufficiency conditions for validity of the asymptotic results for the full-cohort and the case-cohort estimators are compared there. A robust approach to the variance estimation is presented in [Barlow] - the variance estimator suggested there also falls into the class of so-called sandwich estimators. Still another approach to the estimation of the variance of ˆβ which utilizes bootstrapping can be found in [Wacholder et al.]. All these results should be compared with extra caution, since the basic setup usually differs slightly from paper to paper. For example, the estimator studied in [Self and Prentice] differs slightly from the maximum pseudolikelihood estimator defined in [Prentice] because it is not based on the same pseudolikelihood function, both estimators are howewer shown to be quite generally asymptotically equivalent. The weights ϱ i (t) eliminate the subjects with incomplete data by setting ϱ i (t) = 0 for such individuals; various proposals for the choice of ϱ i (t) have been published leading to different parameter estimators. If we denote by α the sampling probability for each subject, by ξ i = 1 if the subject i was sampled and ξ i = 0 otherwise, then the weights usually involve a factor ξ i /α for two reasons. It eliminates subjects that were not selected to the subcohort and at the same time it allows a natural interpretation that each of the selected subjects represents a given proportion of individuals in the original cohort. If the true sampling probability α is unknown, then its (possibly time-varying) estimator ˆα can be used; the simplest estimator for α is naturally the empirical proportion of sampled subjects. According to [Robins

P. KLÁŠTEREKÝ: SURVIVAL ANALYSIS FOR ASE-OHORT STUDIES et al.], the resulting estimator can be more efficient when the estimator ˆα is used even if the true value of α is known. In general there are two ways of treating the cases, resulting in two different classes of estimators which are being referred to as the N- and D-estimators, respectively. N-estimators treat the cases as if they were sampled with probability one at their failure time, but no covariates are recorded for them before this failure time unless they happen to fall into the subcohort. The subcohort is treated as a sample from all study objects regardless of the failure status and a natural view of the whole procedure is prospective. Thus, the general weights for an N-estimator are given by ϱ i (t) = ξ i /α(t) for t T i ; ϱ i (T i ) = 1, where the sampling probability of the individuals, α(t), can be replaced by its estimator ˆα(t). With this approach, the cases are weighted by 1 while the controls by the inverse of their (estimated) sampling probabilities. Unlike N-estimators, the D-estimators work with the original cohort stratified according to failure status, cases forming a separate stratum sampled with probability one. Given the failure status, the sampling probabilities α(t) only apply to the set of controls and a D-estimator is defined by weights ϱ i (t) = δ i + (1 δ i )ξ i /α(t), where the controls sampling probabilities α(t) can again be estimated from the data. This approach can be naturally viewed as a retrospective one, mainly due to stratification on the failure status. Note that stratification on some other discrete variables can be induced with no principal complexity. Such stratification will just likely affect the sampling probabilities α k (t), k = 1,..., K, which may now be different in each of the K strata. As a consequence, the weights may also differ and more caution is thus needed in notation and computations. oncluding remarks In this paper a brief overview of some survival analysis methods used in casecohort studies was presented. The original Prentice s estimator belongs to the class of N-estimators and its basic asymptotic properties (such as asymptotic normality) were derived in [Prentice]. So far, not much attention was paid to certain topics, e.g. the effective use of the values of covariates, which are available for subjects not selected to the subcohort. A generalization addressing (among others) this issue is a class of doubly-weighted estimators introduced and studied in [Kulich and Lin]. An interesting open problem is also how to merge several subcohorts independently sampled from the same cohort. It is common that multiple subcohorts are drawn from a large study population to investigate the effects of different exposures on the same censored outcome. It would be useful to combine the covariate data sampled in this way in order to perform a single analysis of the joint effect of all such exposures. In this problem, only the cases have complete covariate information. A careful use of the available partial covariate information for the rest of the cohort will be necessary to solve this problem. References [Barlow] Barlow, W.E., Robust Variance Estimation for the ase-ohort Design. Biometrics 50, pp. 1064-1072, 1994. [ox 1972] ox, D.R., Regression Models and Life Tables (with discussion). Journal of the Royal Statistical Society, Series B 34, pp. 187-220, 1972. [ox 1975] ox, D.R., Partial Likelihood. Biometrika 62, pp. 269-276, 1975. [Fleming and Harrington] Fleming, T.R. and Harrington, D.P., ounting Processes and Survival Analysis. Wiley, New York, 1991.

P. KLÁŠTEREKÝ: SURVIVAL ANALYSIS FOR ASE-OHORT STUDIES [Kulich and Lin] Kulich, M. and Lin, D.Y., Improving the Efficiency of Relative-Risk Estimation in ase-ohort Studies. JASA 99, 2004, to appear. [Prentice] Prentice, R.L., A ase-ohort Design for Epidemiologic ohort Studies and Disease Prevention Trials. Biometrika 73, pp. 1-11, 1986. [Robins et al.] Robins, J.M., Rotnitzky, A. and Zhao, L.P., Estimation of Regresion oefficients When Some Regressors Are Not Always Observed. JASA 89, pp. 846-866, 1994. [Self and Prentice] Self, S.G. and Prentice, R.L., Asymptotic Distribution Theory and Efficiency Results for ase-ohort Studies. Annals of Statistics 16, pp. 64-81, 1988. [Wacholder et al.] Wacholder, S., Gail, M.H., Pee, D. and Brookmeyer, R., Alternative Variance and Efficiency alculations for the ase-ohort Design. Biometrika 76, pp. 117-123, 1989.