Practical tools for survival analysis with heterogeneity-induced competing risks

Similar documents
STAT331. Cox s Proportional Hazards Model

Bayesian Learning (II)

Cox s proportional hazards model and Cox s partial likelihood

1 Using standard errors when comparing estimated values

The Gaussian process latent variable model with Cox regression

Survival Analysis for Case-Cohort Studies

Multivariate Survival Analysis

6.867 Machine Learning

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Other Survival Models. (1) Non-PH models. We briefly discussed the non-proportional hazards (non-ph) model

Approximation of Survival Function by Taylor Series for General Partly Interval Censored Data

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Multistate Modeling and Applications

COS513 LECTURE 8 STATISTICAL CONCEPTS

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Survival Analysis I (CHL5209H)

Probability and Information Theory. Sargur N. Srihari

CIMAT Taller de Modelos de Capture y Recaptura Known Fate Survival Analysis

Statistical Inference and Methods

Lecture 22 Survival Analysis: An Introduction

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Model comparison. Patrick Breheny. March 28. Introduction Measures of predictive power Model selection

Likelihood and Fairness in Multidimensional Item Response Theory

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Multi-state Models: An Overview

A Regression Model For Recurrent Events With Distribution Free Correlation Structure

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Density Estimation. Seungjin Choi

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Probability theory basics

Model Selection in Bayesian Survival Analysis for a Multi-country Cluster Randomized Trial

University of California, Berkeley

Model comparison and selection

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

MAS3301 / MAS8311 Biostatistics Part II: Survival

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Analysis of competing risks data and simulation of data following predened subdistribution hazards

Power and Sample Size Calculations with the Additive Hazards Model

Lecture 5 Models and methods for recurrent event data

REGRESSION ANALYSIS FOR TIME-TO-EVENT DATA THE PROPORTIONAL HAZARDS (COX) MODEL ST520

Lecture 3. Truncation, length-bias and prevalence sampling

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Bayesian Inference. Introduction

CSC321 Lecture 18: Learning Probabilistic Models

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY

On the errors introduced by the naive Bayes independence assumption

Reliability Engineering I

Invariant HPD credible sets and MAP estimators

Dynamic Prediction of Disease Progression Using Longitudinal Biomarker Data

6.867 Machine Learning

Inference and estimation in probabilistic time series models

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Notes for course EE1.1 Circuit Analysis TOPIC 4 NODAL ANALYSIS

Sparse Linear Models (10/7/13)

Probabilistic and Bayesian Machine Learning

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

Survival Regression Models

A Bayesian Nonparametric Approach to Causal Inference for Semi-competing risks

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Optimal Treatment Regimes for Survival Endpoints from a Classification Perspective. Anastasios (Butch) Tsiatis and Xiaofei Bai

Group Sequential Tests for Delayed Responses. Christopher Jennison. Lisa Hampson. Workshop on Special Topics on Sequential Methodology

Introduction to Reliability Theory (part 2)

Survival Analysis. Stat 526. April 13, 2018

MCMC 2: Lecture 2 Coding and output. Phil O Neill Theo Kypraios School of Mathematical Sciences University of Nottingham

Expectation Maximization

Statistics 262: Intermediate Biostatistics Non-parametric Survival Analysis

g-priors for Linear Regression

Practice Exam 1. (A) (B) (C) (D) (E) You are given the following data on loss sizes:

Analysing geoadditive regression data: a mixed model approach

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Machine Learning Lecture Notes

Paradoxical Results in Multidimensional Item Response Theory

Unsupervised Learning

In contrast, parametric techniques (fitting exponential or Weibull, for example) are more focussed, can handle general covariates, but require

Hierarchical Models & Bayesian Model Selection

Machine learning - HT Maximum Likelihood

Interim Monitoring of Clinical Trials: Decision Theory, Dynamic Programming. and Optimal Stopping

INTRODUCTION TO MULTILEVEL MODELLING FOR REPEATED MEASURES DATA. Belfast 9 th June to 10 th June, 2011

Group Sequential Designs: Theory, Computation and Optimisation

ECO Class 6 Nonparametric Econometrics

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Advanced Quantitative Methods: maximum likelihood

PATTERN RECOGNITION AND MACHINE LEARNING

Modelling geoadditive survival data

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

CHAPTER 1 A MAINTENANCE MODEL FOR COMPONENTS EXPOSED TO SEVERAL FAILURE MECHANISMS AND IMPERFECT REPAIR

Optimising Group Sequential Designs. Decision Theory, Dynamic Programming. and Optimal Stopping

Bayesian model selection for computer model validation via mixture model estimation

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

CTDL-Positive Stable Frailty Model

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Statistics: Learning models from data

The outline for Unit 3

Statistical Methods for Alzheimer s Disease Studies

THESIS for the degree of MASTER OF SCIENCE. Modelling and Data Analysis

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Transcription:

Practical tools for survival analysis with heterogeneity-induced competing risks J. van Baardewijk a, H. Garmo b, M. van Hemelrijck b, L. Holmberg b, A.C.C. Coolen a,c a Institute for Mathematical and Molecular Biomedicine, King s College London b Cancer Epidemiology Group, King s College London c London Institute for Mathematical Sciences June 2013 When censoring by non-primary risks is informative, many conventional survival analysis methods are not applicable. The observed primary risk hazard rates are no longer estimators of what they would have been in the absence of other risks. Recovering the decontamined primary hazard rates and survival functions from survival data is called the competing risk problem. Most competing risks studies assume implicitly that risk correlations are induced by cohort or disease heterogeneity that was not captured by covariates. If in addition one assumes that proportional hazards holds at the level of individuals, for all risks, one obtains a generic statistical description that allows us to handle the competing risk problem, and from which Cox regression, frailty and random effects models, and latent class models can all be recovered in special limits. From this we derive new practical tools for epidemiologists, such as formulae for decontaminated primary risk hazard rates and survival functions, and for retrospective assignment of patients to cohort sub-classes (if these exist). Synthetic data confirm that our approach can map a cohort s substructure, and remove heterogeneity-induced false protectivity and false exposure effects. Application to survival data, with prostate cancer as the primary risk (the ULSAM study), leads to plausible alternative explanations for previous counter-intuitive inferences. Keywords: survival analysis; heterogeneity; competing risks 1

Contents 1 Introduction 3 2 Definitions and general identities 5 2.1 Survival probability and crude cause specific hazard rates................ 5 2.2 Decontaminated cause-specific risk measures the competing risk problem....... 6 3 Heterogeneity-induced competing risks 7 3.1 Connection between cohort level and individual level descriptions............ 7 3.2 Heterogeneous cohorts and different levels of risk complexity.............. 8 3.3 Implications of having heterogeneity-induced competing risks.............. 10 3.4 Canonical level of description for resolving heterogeneity-induced competing risks... 11 3.5 Estimation of W[h 0,..., h R z] from survival data..................... 13 4 Parametrisation of W[h 0,..., h R z] 14 4.1 Generic parametrisation................................... 14 4.2 Connection with conventional regression methods..................... 16 4.3 A simple latent class parametrisation for heterogeneity-induced competing risks.... 17 5 Application to synthetic survival data 19 5.1 Cohort substructure and regression parameters...................... 20 5.2 Decontaminated survival functions............................. 20 5.3 Retrospective class identification.............................. 22 6 Applications to prostate cancer data 23 6.1 Cohort substructure and regression parameters...................... 23 6.2 Decontaminated survival curves............................... 24 7 Discussion 24 A Connection between cohort level and individual level cause-specific hazard rates 28 B Equivalence of formulae for data likelihood in terms of W [h 0,..., h R z] 28 C Connection with standard regression methods 29 D Numerical details 31 2

1 Introduction For general introductions to the survival analysis literature we refer to the textbooks [1, 2, 3, 4]. The competing risk problem is the question of how to handle contamination by informative censoring of those primary risk characteristics that may be inferred from survival data [5], such as cause-specific hazard rates and survival curves. One would like to know their values in the hypothetical situation where all non-primary risks were disabled. This is nontrivial, since disabling non-primary risks will generally affect the hazard rate of the primary risk. If all risks have statistically independent event times, censoring is not informative and simple methods are available for analysis and regression, such as those of [6, 7]. Unfortunately, one cannot infer presence or absence of risk correlations from survival data alone [8], and in many cases the independence assumption is expected to be incorrect. The importance of having reliable epidemiological tools for isolating statistical features even for interrelated comorbid diseases is increasingly recognised [9]. Unaccounted for risk correlations can lead to incorrect inferences [10, 11, 12, 13, 14], and simulations with synthetic data confirm that uncritical use of simple methods can be quite misleading; see e.g. Figure 1. Risk correlations are often fingerprints of residual heterogeneity, i.e. heterogeneity that is not visible in the covariates. A primary and a secondary disease could share molecular pathways, or be jointly influenced by factors that were not observed. Or, a given disease could in fact be a spectrum of distinct diseases, each with specific covariate associations. Many authors have tried to model residual cohort heterogeneity, usually starting from Cox-type cause-specific hazard rates, but with additional individualised risk multipliers. If the multipliers do not depend on the covariates we speak of frailty models, e.g. [15, 16, 17, 18, 19], and regard them as representing the impact of unobserved covariates, see e.g. [20]. If they depend on the covariates we would speak of random effects models, e.g. [21, 11, 22, 23, 24]. If the distribution of frailty factors takes the form of discrete clusters (latent classes, [25]), we obtain the latent class models; see e.g. [26] or [27] (which combines frailty and random effects with covariate-dependent class allocation as in [28]). Further variations include time-dependent frailty factors, and models in which the latent class of each individual is known. Most frailty and random effects studies, however, quantify only the hazard rate of the primary risk. They thereby capture some consequences of cohort heterogeneity, but without modelling also the non-primary risks it is fundamentally impossible to deal with the competing risk problem. The approach of [29] focuses on parametrising the covariate-conditioned cumulative incidence function of the primary risk. It is is conceptually similar to [7]; both model the primary risk profile in the presence of all risks. Cumulative incidence functions appear more intuitive than hazard rates; they are directly measurable, and incorporate also the impact of non-primary risks. However, expressing the data likelihood in terms of cumulative incidence functions is more cumbersome than in terms of hazard rates. But while [29] quantify risks that compete, they do not address the competing risk problem. Further developments involve e.g. alternative parametrisations [30, 31], application to the cumulative incidence of non-primary risks [32], and the inclusion of frailty factors [33]. Another community of authors have focused further on identifying which mathematical constraints or conditions need to be imposed on multi-risk survival analysis models in order to circumvent Tsiatis identifiability problem, and infer the joint event time distribution unambiguously from survival data. Examples involving survival data with covariates are [34], and [35]. However, also these studies do not take the step towards decontamination tools. So we face the unsatisfactory situation of multiple distinct approaches to cohort heterogeneity and competing risks. Only few address the competing risk problem, which requires modelling all risks and their correlations. None give formulae for decontamined primary risk measures. In this 3

1.0 primary risk only 1.0 primary & secondary risk 0.8 S1 KM 0.6 0.8 S1 KM 0.6 0.4 UQ LQ 0.4 LQ 0.2 0.2 0.0 0 10 20 30 t 0.0 UQ 0 10 20 30 t Figure 1: Illustration of the dangers of using Kaplan-Meier (KM) estimators [6] in the presence of competing risks. We show KM estimators S1 KM of the primary risk survival function, for upper and lower quartiles (UQ, LQ) of covariate 1, for two data sets whose primary risk characteristics are identical. They differ in that on the left only the primary risk is active, whereas on the right a second risk is activated which causes informative censoring. The KM estimators on the right suggest a strong association of the primary risk with covariate 1, which is in fact spurious. What we see is false exposure. One of the aims of this work is to develop practical formulae for decontaminated survival estimators, that for both data sets would report the correct curves (i.e. the ones on the left). Details of these synthetic data are given in a later section. work we try to build a generic statistical description of competing risks and a partial resolution of the competing risk problem that unifies the various schools of thought above. Our work is based on the observation that most papers implicitly assume that correlations between competing risks are induced by residual cohort heterogeneity. We show how this simple and transparent assumption leads in a natural way to a formalism with exact formulae for decontaminated primary risk measures, in which Cox regression, frailty models, random effect models, and latent class models are all included as special cases, and which produces transparent parametrisations of the cumulative incidence function (the language of Fine and Gray). This report is organised as follows. In section 2 we define the competing risk problem in mathematical terms. We inspect in section 3 the relation between cohort level and individual level statistical descriptions, classify different levels of risk complexity from the competing risk perspective, and define what we mean by heterogeneity-induced competing risks. We derive the implications of having heterogeneity-induced competing risks, and show that the canonical mathematical description involves the covariate-conditioned functional distribution of the individual hazard rates of all risks. In section 4 we work out the theory for a natural family of such parametrisations, that includes conventional methods (Cox regression, frailty, random effects and latent class models) in special limits. In the remaining sections we apply the formalism to synthetic data, and to real survival data from the ULSAM longitudinal study [36, 37], with prostate cancer as the primary risk. This application leads to appealing and transparent new explanations for previously counter-intuitive inferences. We end with a summary of our findings. 4

2 Definitions and general identities We recall briefly the basic definitions of survival analysis, and define the competing risk problem in mathematical terms. In doing so we will try to stay as close as possible to the notation conventions and terminology of [2]. 2.1 Survival probability and crude cause specific hazard rates We imagine a cohort of individuals who are subject to R true risks, labelled by r = 1... R. We use r = 0 to indicate the end-of-trial censoring event, since for the structure of the theory there is no difference between censoring due to alternative risks and censoring due to trial termination. Most of the mathematical relations of survival analysis are derived directly from the joint distribution P(t 0,..., t R ) of event times (t 0,..., t R ), where t r 0 is the time at which risk r triggers an event. 1 From this distribution follow the crude cause-specific hazard rates, i.e. the probabilities per unit time that given failures occur at time t if until then none of the possible events has yet occurred: h r (t) = 1... dt 0... dt R P(t 0,..., t R )δ(t t r ) S(t) 0 0 R θ(t r t) (1) We used the delta-distribution δ(x), defined by the identity dx δ(x)f(x) = f(0), and the step function, defined by θ(x > 0) = 1 and θ(x < 0) = 0. It is easy to show that the survival function can be written as S(t) = e R t r=0 ds hr(s) 0, (2) The crude cause-specific hazard rates provide the link between theory and observations, since the probability density P (t, r) to find the earliest event occurring at time t and corresponding to risk r, is given by P (t, r) = h r (t)e R t ds h 0 r (s), (3) These relations hold irrespective of whether we have a large or a small cohort, or even a single individual, although the values of P(t 0,..., t R ) would be different. However, at some point we will work simultaneously with cohort level and individual level descriptions, and it will be necessary to specify with further indices to which we refer. Conditioning on covariate information is trivial. For simplicity we assume the covariates to be discrete; for continuous covariates one finds similar formulae, with integrals instead of sums. Knowing the values z IR p of p covariates means starting from the distribution P(t 0,..., t R z) which gives the event time statistics of the sub-cohort of those individuals i that have covariate vector z i = z. It is related to the previous distribution of the full cohort via P(t 0,..., t R ) = z P(t 0,..., t R z)p (z), where P (z) gives the fraction of the cohort that have covariates z. We then obtain the following covariate-conditioned survival functions and crude cause-specific hazard rates: r r S(t z) = R... dt 0... dt R P(t 0,..., t R z) θ(t r t) (4) 0 0 r=0 1 This starting point is not fully general. It assumes that all risks will ultimately lead to failure. One can include events with a finite chance of not happening at any time, by adding for each risk r a binary variable τ r to indicate whether or not the calamity button is pressed at time t r. 5

h r (t z) = 1... dt 0... dt R P(t 0,..., t R z)δ(t t r ) S(t z) 0 0 R θ(t r t) (5) r r with the usual relation between survival and crude hazard rates, and the usual link to observations: S(t z) = e R t ds h 0 r (s z) (6) P (t, r z) = h r (t z)e R t 0 ds h r (s z) (7) If we study a cohort of N individuals, with covariate vectors {z 1,..., z N }, the survival data D usually consist of N samples of event time and event type pairs (t, r), viz. D = {(t i, r i )}. The probability density for an individual with covariate vector z to report (t, r) is given by (7), so the data likelihood P (D) = N P (t i, r i z i ) obeys log P (D) = R { N N δ r,ri log h r (t i z i ) r=0 ti 0 } dt h r (t z i ) (8) 2.2 Decontaminated cause-specific risk measures the competing risk problem The aim of survival analysis is to extract statistical patterns from the data, that allow us to make predictions for new individuals, if we know their covariates. We are often interested in one specific primary risk. Many relevant risk-specific quantities can be calculated once we know the crude hazard rates. For instance, the cause-specific cumulative incidence function F r (t), the probability that event r has been observed at any time prior to time t, is F r (t) = t 0 dt S(t )h r (t ) (9) Although F r (t) refers to risk r specifically, it can be heavily influenced by other risks. If it is small, this may be because event r is intrinsically unlikely, or because it tends to be preceeded by events r r. One cannot tell. To obtain decontaminated information on a primary risk r one must consider the hypothetical situation where all other risks r r are disabled. This means replacing 2 R P(t 0,..., t R ) P(t r ) lim δ(t r Λ) (10) Λ r r with the the marginal event time distribution P(t r ) =... [ s r dt s]p(t 0,..., t R ). Inserting (10) into (1) gives, as expected, zero values for all non-primary crude hazard rates, but it also affects the value of the primary risk hazard rate. One now finds the following formulae for the decontaminated (conditioned) cause-specific survival function and hazard rate for risk r, indicated with tildes to distinguish them from their crude counterparts: S r (t) = S r (t z) = t t dt r P(t r ), hr (t) = d dt log S r (t) (11) dt r P(t r z), hr (t z) = d dt log S r (t z) (12) 2 It was noted by [39] that one cannot be sure that this statement is always appropriate; it may be that correlated risks share biochemical pathways such that they can never be deactivated independently. 6

In general one will indeed find that h r (t) h r (t) and h r (t z) h r (t z). Equations (11,12) tell us that to determine the decontaminated risk measures for the primary risk r we must estimate the marginal distributions P(t r ) or P(t r z) from survival data. Tsiatis showed [8] that this is impossible without further assumptions. For every P(t 0,..., t R ) there is an alternative distribution P(t 0,..., t R ) that describes independent times, but such that P and P both generate identical cause-specific hazard rates for all risks: P(t 0,..., t R ) = R (h r (t)e t ds hr(s)) 0 (13) r=0 in which {h r (t)} are the cause-specific hazard rates of P(t 0,..., t R ). Hence the only information that can be estimated from survival data alone are the (covariate-conditioned) crude cause-specicifc hazard rates. One cannot calculate P(t 0,..., t R ) or P(t 0,..., t R z) and their marginals. Without further information or assumptions there is no way to disentangle the different risks. This is the identifiability problem. One way out is to assume that all risks are statistically independent, i.e. that P(t 0,..., t R z) = Rr=0 P(t r z). This solves trivially the competing risk problem, since now one finds that h r (t z) = h r (t z) for all r, and S r (t z) = e t 0 ds hr(s z) (14) This assumption underlies the clinical use of e.g. Cox s proportional hazards regression [7] and Kaplan-Meier estimators of the cause-specific survival function [6], which would otherwise be inappropriate tools (see Figure 1). 3 Heterogeneity-induced competing risks We work out the consequences of assuming that event time correlations are caused by residual cohort heterogeneity. This is much weaker than assuming risk independence, but still allows us to deal with the competing risk problem. 3.1 Connection between cohort level and individual level descriptions The standard survival analysis formalism is built solely on the starting point of a joint event time distribution; it can hence also be applied to risk at the level of individuals. Let N be the number of individuals in the cohort to which P(t 0,..., t R ) refers, labelled by i = 1... N. We write the joint event time distribution of individual i in this cohort as P i (t 0,..., t R ), and the crude cause-specific hazard rates of individual i as h i r(t). It then follows that h i r(t) = 1... dt 0... dt R P i (t 0,..., t R )δ(t t r ) S i (t) 0 0 R θ(t r t) (15) S i (t) = e R t r=0 ds 0 hi r (s) (16) P i (t, r) = h i r(t)e R t r r 0 ds hi r (s), (17) S i (t) is the survival function of individual i, and P i (t, r) is the probability that the first event for individual i occurs at time t and corresponds to risk r. When describing a cohort, we have the added 7

uncertainty of not knowing which individuals were picked from the population, so the connection between the two levels is simply given by P(t 0,..., t R ) = 1 N P i (t 0,..., t R ) (18) N i, z P(t 0,..., t R z) = i =z P i(t 0,..., t R ) i, z i =z 1 (19) For quantities that depend linearly on the joint event time distribution, the link between cohort level and individual level is a simple averaging over the label i, possibly conditioned on covariates, e.g. S(t) = 1 N S i (t), N S(t z) = P (t, r) = 1 N i, z i =z S i(t), P (t, r z) = i, z i =z 1 N P i (t, r) (20) i, z i =z P i(t, r) i, z i =z 1 (21) In contrast, quantities such as the crude cause-specific hazard rates depend in a more complicated way on P(t 0,..., t R ), via their conditioning on survival. Cohort level cause-specific hazard rates, for instance, are not direct averages over their individual level counterparts. Instead one finds (see appendix A for details): h r (t) = h r (t z) = N h i r(t)e R t N e R t i,z i =z hi r(t)e R t 0 ds hi r (s) 0 ds hi r (s), (22) 0 ds hi r (s) i,z R t (23) i =z e ds 0 hi r (s) 3.2 Heterogeneous cohorts and different levels of risk complexity We always allow our cohorts to be heterogeneous in terms of covariates; we refer here to heterogeneity in the relation between covariates and risks. A homogeneous cohort is one in which this relation is uniform, so the distribution P i (t 0,..., t R ) can depend on i only via z i. Hence there exists a function P(t 0,..., t R z) such that P i (t 0,..., t R ) = P(t 0,..., t R z i ) for all i (24) The same is then true for the cause-specific hazard rates: h i r(t) = h r (t z i ) for all i, in which h r (t z) is related to P(t 0,..., t R z) via equations (4,5). It also follows directly from (23) that at cohort level the covariate-conditioned event time distribution is P(t 0,..., t R z) = P(t 0,..., t R z), and the covariateconditioned crude hazard rates are h r (t z) = h r (t z), as expected. A property of homogeneous cohorts is that uncorrelated individual level risks, i.e. P i (t 0,..., t R ) = R r=0 P i (t r ), imply uncorrelated covariate-conditioned cohort level risks. This follows from (19): P(t 0,..., t R z) = = i, z i =z Rr=0 P i (t r ) i, z i =z 1 = R P(t r z) = r=0 8 i, z i =z Rr=0 P(t r z i ) i, z i =z 1 R P(t r z) (25) r=0

In heterogeneous cohorts (24) does not hold; individuals have further features, not captured by covariates, that impact upon their risks. Here one will observe a gradual filtering : high-risk individuals will drop out early, causing time dependencies at cohort level that have no counterpart at individual level. For instance, even if all individuals have stationary hazard rates, one would according to (22,23) still find time dependent crude cohort level hazard rates. Here, having uncorrelated individual level risks no longer implies having uncorrelated covariate-conditioned cohort level risks. One can have P i (t 0,..., t R ) = R r=0 P i (t r ), but still P(t 0,..., t R z) R r=0 P(t r z). Risk correlations can thus be generated at different levels, and there is a natural hierarchy of cohorts in terms of risk complexity, with implications for the applicability of methods: Level 1: homogeneous cohort, no competing risks individual: P i (t 0,..., t R ) = R r=0 P(t r z i ) cohort: P(t 0,..., t R z) = R r=0 P(t r z) The members of the cohort differ in their covariates, but they are homogeneous in terms of the link between covariates and risk. For each individual, the event times of all risks are statistically independent, and their probabilities are determined by the covariates alone. Since there is no residual heterogeneity, there is no competing risk problem; crude and true cause-specific hazard rates and survival functions are identical. Level 2: heterogeneous cohort, no competing risks individual: P i (t 0,..., t R ) = R r=0 P i (t r ) cohort: P(t 0,..., t R z) = R r=0 P(t r z) For each individual the event times of all risks are statistically independent, but their susceptibilities are no longer determined by the covariates alone (reflecting e.g. disease sub-groups or the impact of unobserved covariates). However, this residual heterogeneity does not manifest itself in risk correlations at cohort level. One will therefore observe heterogeneity-induced effects, such as cohort filtering, but no competing risks. Level 3: heterogeneity-induced competing risks individual: P i (t 0,..., t R ) = R r=0 P i (t r ) cohort: P(t 0,..., t R z) R r=0 P(t r z) For each individual the event times of all risks are statistically independent, and their susceptibilities are not determined by the covariates alone, similar to level 2. However, residual cohort heterogeneity now leads to risk correlations at cohort level, which cause informative censoring. Here one will therefore observe competing risks phenomena. Level 4: individual and cohort level competing risks individual: P i (t 0,..., t R ) R r=0 P i (t r ) cohort: P(t 0,..., t R z) R r=0 P(t r z) This is the most complex situation from a modelling point of view, where both at the level of 9

individuals and at cohort level the event times of different risks are correlated. We will again observe competing risk phenomena, but we can no longer say where these are generated. In fact, correlations amongst non-primary risks are harmless; what matters is only whether there are correlations between primary and non-primary risks. We could in principle make a further distinction between having P(t 0,..., t R z) = R r=0 P(t r z) and P(t 0,..., t R z) = P(t r z)p(t 0,..., t r 1, t r+1,..., t R z); the latter being weaker but still sufficient. Here we will not persue this; it is clear how the theory can incorporate this distinction. Levels 1 and 2 are those where the assumption of statistically independent risks, underlying e.g. Cox regression and Kaplan-Meier estimators, is valid. At level 2 there is still no competing risk problem, but the heterogeneity demands parametrisations of crude cohort level primary hazard rates that are more complex than those of Cox, which is the rationale behind frailty and random effects models, and the latent class models of [27]. All these approaches still only model the primary risk, and therefore cannot handle cohorts beyond level 2. Level 4 is the most complex scenario, which we will not deal with in this work. Our focus is on level 3: cohorts with heterogeneity-induced competing risks. Here the correlations between cohort level event times have their origin strictly in correlations between disease susceptibilities and covariate associations of individuals, e.g. someone with a high hazard rate for a disease A may also be likely to have a high hazard rate for B, for reasons not explained by the covariates. 3.3 Implications of having heterogeneity-induced competing risks We now show that the assumption that competing risks are induced by residual cohort heterogeneity (level 3 in our classification) leads to a resolution of the competing risk problem. In the case of heterogeneity-induced competing risks we have independent event times at the level of individuals, hence for each individual i we know that P i (t r ) = h i r(t)e t 0 ds hi r (s) (26) The covariate-conditioned cohort level event time marginals are therefore P r (t r z) = i,z i =z hi r(t)e t 0 ds hi r (s) i,z i =z 1 (27) and via (12) we can write the decontaminated cause-specific survival function and hazard rate as S r (t z) = i,z i =z e t 0 ds hi r(s) i,z i =z 1 (28) h r (t z) = i,z i =z hi r(t)e t 0 ds hi r(s) i,z i =z e t 0 ds hi r(s) (29) We used 0 ds hi r(s) = for all (i, r), which follows from the normalisation of P i (t 0,..., t R ). Expressions (28,29) are similar but not identical to formulae (14,23) for the decontaminated causespecific survival function and the crude covariate-conditioned cause-specific hazard rates which would 10

have been found if all risks had been independent: S r (t z) = e t h r (t z) = ds hr(s z) 0 t (30) ds 0 hi r (s) i,z i =z hi r(t)e R i,z R t (31) i =z e ds 0 hi r (s) The differences are interpreted easily. In (29) the probability that individual i survives until time t is given by exp[ t 0 ds hi r(s)] (which causes the cohort filtering ), since no risk other than r is active. In contrast, in (31) all risks contribute to cohort filtering. Formulae (29) and (31) will therefore be non-identical, unless we have risk independence, which in (31) would give rise to an identical factor in numerator and denominator that would drop out. The differences between (28,29) and (30,31) quantify the severity of the competing risk problem in our cohort. We also see that in homogenous cohorts one indeed recovers S r (t z) = S r (t z) and h r (t z) = h r (t z). Similarly, we can work out the link between the theory and survival data. Inserting (17) into (21) leads us to P (t, r z) = i, z i =z hi r(t)e R t ds 0 hi r (s) i, z i =z 1 (32) Hence the assumption that competing risks (if present) are induced by heterogeneity leads to relatively simple formulae for the decontaminated cause-specific quantities of interest and for the likelihood of observing individual survival data. What remains is to identify the minimal level of description required for evaluating these formulae, and to determine how the required information can be estimated from survival data. 3.4 Canonical level of description for resolving heterogeneity-induced competing risks The canonical level of description is the minimal set of observables in terms of which we can write the decontaminated risk-specific quantities (28,29) (so that we can calculate what we are interested in) and the data likelihood (32) (so it can be estimated). In (28,29) we need the covariate-constrained distribution of individual hazard rates for the primary risk. In (32) we need in addition the covariateconstrained distribution of the cumulative rates of non-primary risks. In combination we see that the minimal description would be the functional distribution W[h r, h /r z] = i,z i =z δ F[h r h i r]δ F [h /r r r hi r ] i,z i =z 1 (33) Here δ F denotes the functional δ-distribution 3, defined by the functional integral identity {df}δ F [f]g[f] = G[f] f(t)=0 t 0 (34) 3 Where the δ-function can be interpreted as the probability distribution for a real-valued stochastic variable x without uncertainty that always takes the value x = 0, its functional generalisation δ F[f] can be interpreted as the functional probability distribution describing a real-valued function f acting on [0, ) that always takes the value f(t) = 0 for all t 0. For illustrations of its use see e.g. [40]. 11

W[h r, h /r z] represents for each possible choice of the function pair {h r (t), h /r (t)}: which fraction of those individuals in our cohort that have covariates z also have the individual primary hazard rates h i r(t) = h r (t) and the cumulative non-primary hazard rates r r hi r(t) = h /r (t). In practice it will often be advantageous to relax our requirement of a minimal description. Non-primary risks will often be mutually very different in their characteristics, so finding an efficient parametrisation for the dependence on r r hi r (t) in W[h r, h /r z] will be awkward. A slightly redundant alternative choice, but one that is more easily parametrised, would be Rr=0 i,z W[h 0,..., h R z] = i =z δ F [h r h i r] i,z i =z 1 (35) It gives the joint functional distribution over the cohort of all R + 1 individual cause-specific hazard rates at all times. The distribution (33) follows from (35) via W[h r, h /r z] = {dh 0... dh R} W[h 0,..., h R z] δ F [h r h r]δ F [h /r h r ] (36) r r For independent risks one would simply find the factorised form W[h 0,..., h R z] = R r=0 W[h r z]. If we know (35) we can write the decontaminated risk-specific quantities (28,29) as S r (t z) = h r (t z) = {dh 0... dh R } W[h 0,..., h R z] e t {dh0... dh R } W[h 0,..., h R z] h r (t)e t 0 ds hr(s) (37) ds hr(s) 0 {dh0... dh R } W[h 0,..., h R z] e t 0 ds hr(s) (38) whereas their crude counterparts, which would be reported upon assuming independent risks, are S r (t z) = e t h r (t z) = ds hr(s z) 0 t (39) ds h 0 r (s) {dh0... dh R } W[h 0,..., h R z] h r (t)e R {dh0... dh R } W[h 0,..., h R z] e R t (40) ds h 0 r (s) We can quantifying the impact of competing risks in the cohort by comparing (37,38) to (39,40). If the primary risk r is not correlated with the non-primary risks (i.e. if W[h 0,..., h R z] = W[h 1 z]w[h 0, h 2,..., h R z]), or if there is just one risk, the formulae (28) and (39) as well as (38) and (40) become pairwise identical, as expected. The data likelihood (32) acquires the form P (t, r z) = {dh 0... dh R } W[h 0,..., h R z] h r (t)e R t ds h 0 r (s) (41) An alternative formula for P (t, r z) follows upon combining (23) with (40). In appendix B we show that the results are identical. Finally, the covariate-conditioned cause-specific cumulative incidence functions can be written as F r (t z) = {dh 0... dh R } W[h 0,..., h R z] t 0 dt h r (t )e R t 0 ds h r (s) (42) The level of description (35) is sufficient and necessary for handling heterogeneity-induced competing risks, apart from the trivial option to combine the non-primary risks r r into a single risk, leading 12

to (33). One cannot work with the crude cohort-level covariate-conditioned hazard rates alone: the latter can be calculated from W[h 0,..., h R z] via (40), but the converse is not true. In fact, for any W[h 0,..., h R z] there exists an alternative distribution W[h 0,..., h R z] describing a homogeneous cohort, such that W and W give identical crude cohort-level cause-specific hazard rates, namely W[h 0,..., h R z] = r R δ F[h r h r (z)], in which h r (z) is the function (40). 3.5 Estimation of W[h 0,..., h R z] from survival data When data are limited one must determine the relevant quantities in parametrised form, to avoid overfitting. Since the data likelihood can be expressed in terms of the crude cohort-level covariateconditioned cause-specific hazard rates, one cannot extract information from survival data on W[h 0,..., h R z] that is not contained in {h t (t z)}. However, even relatively simple parametrisations of W[h 0,..., h R z] will via (40) correspond to nontrivial crude conditioned hazard rates (with time dependencies caused by cohort filtering), that one would be very unlikely to propose when parametrising at the level of the crude hazard rates. We thus assume W[h 0,..., h R z] to be a member of a parametrised family of conditioned distributions W[h 0,..., h R z, θ], in which θ Ω denotes the vector of parameters and Ω its value domain. Since the probability density for an individual with covariates z to report the pair (t, r) is given by (41), the data likelihood P (D θ) = N P (t i, r i z i ), given the parameters θ is P (D θ) = N {dh 0... dh R } W[h 0,..., h R z i, θ] h ri (t i )e R ti ds h 0 r (s) (43) If we concentrate all the survival data in two empirical distributions, ˆP (t, r z) = i, z i =z δ(t t i)δ r,ri i, z i =z 1, ˆP (z) = 1 N N δ z,zi (44) (with δ ab = 1 if a = b, and δ ab = 0 otherwise) we can write the log-likelihood L(θ) = log P (D θ) as L(θ) = N z R ˆP (z) r=0 dt ˆP (t, r z) log {dh 0... dh R } W[h 0,..., h R z, θ] h r (t)e R t ds h 0 r (s) (45) This log-likelihood can be interpreted in terms of the dissimilarity of the empirical function ˆP (t, r z) and the model prediction ˆP (t, r z, θ), i.e. the result of substituting W[h 0,..., h R z, θ] into (41): L(θ) N = z { ˆP R (z) dt ˆP (t, r z) log ˆP R (t, r z) r=0 r=0 dt ˆP ( ˆP (t, r z) )} (t, r z) log P (t, r z, θ) (46) The first (entropic) term is independent of θ, the second is minus the Kullback-Leibler distance D( ˆP P ) [43] between ˆP and P. Hence finding the most probable parameters θ is equivalent to minimizing D( ˆP P ). From this starting point one can follow different routes for estimating θ, each with specific advantages and limitations. In maximum likelihood (ML) estimation one simply uses the value ˆθ for which the data are most likely, ˆθ ML = argmax θ Ω L(θ) (47) 13

In the Bayesian formalism one does not commit oneself to one choice for θ, but one uses the full posterior disribution P (θ D) = P (θ)e L(θ) Ω dθ P (θ )e L(θ ) (48) Finally, in maximum a posteriori probability (MAP) estimation one uses the value ˆθ for which P (θ D) is maximal, ˆθ MAP = argmax θ Ω [L(θ) + log P (θ)] (49) For sufficiently large data sets the above estimation methods become equivalent, i.e. lim N ˆθMAP = lim N ˆθML and lim N P (θ D) = δ(θ θ ML ). This follows from the property lim N L(θ)/N = lim N P (θ D)/N. Moreover, from (46) it follows that ˆθ MAP and ˆθ ML are both consistent estimators [41], provided W[h 0,..., h R z, θ] is an unambiguous parametrisation (i.e. the link θ P (t, r z, θ) is one-to-one), and if the data were indeed generated from P (t, r z, θ), since then we will find lim N ˆP (t, r z) = P (t, r z, ˆθ) and limn D( ˆP P ˆθ ) = 0. There are many variations on these protocols, see e.g. [42]. One could reduce the overfitting danger in the ML method by including Aikake s (AIC) or the Bayesian Information Criterion (BIC). Alternative Bayesian routes involve e.g. hyperparameter estimation, or variational approximations of the posterior parameter distribution to reduce computation costs, or model selection to select good parametrisations W[h 0,..., h R z, θ]. 4 Parametrisation of W[h 0,..., h R z] We obtain a transparent class of parametrisations for W[h 0,..., h R z, θ] by assuming that proportional hazards holds at the level of individuals. We work out the relevant equations, and show how the resulting description includes the standard methods (e.g. Cox regression, frailty, random effect and latent class models) as special cases. 4.1 Generic parametrisation For each individual i we can always write the individual cause-specific hazard rates in the form h i r(t) = λ i r(t) exp(βr 0i + p µ=1 βµi r zµ), i with all time-dependences concentrated in the λ i r(t). The parameters βr 0i represent individual risk-specific frailties, which must be normalised to remove the redundancy due to invariance of the hazard rates under {λ i r(t), βr 0i } {λ i r(t)e ζi r, β 0i r +ζr}. i According to (35), we can then write W[h 0,..., h R z] as W[h 0,..., h R z, M] = dβ 0... dβ R {dλ 0... dλ R } M(β 0,..., β R ; λ 0,..., λ R z) R r=0 with the short-hand β r = (β 0 r,..., β p r ), and with M(β 0,..., β R ; λ 0,..., λ R z) = δ F [h r λ r e β0 r + p µ=1 βµ r z µ ] (50) i, z i =z Rr=0 {δ F [λ r λ i r]δ(β r β i r)} i, z i =z 1 (51) 14

So in this parametrisation θ = M. Note that (50) is still completely general. It does not yet imply a proportional hazards assumption at the level of individuals unless M(β 0,..., β R ; λ 0,..., λ R z) is independent of z. However, it is practical only if M(β 0,..., β R ; λ 0,..., λ R z) depends in a relatively simple way on the parameters {β 0,..., β R } and the functions {λ 0,..., λ R }. To compactify our notation further we introduce the short-hands β z = β 0 + p µ=1 βµ z µ and Λ t (t) = t 0 ds λ r(s). Inserting (50) into (45) then gives the corresponding data log-likelihood L(M): L(M) = N R ˆP (z) dt ˆP (t, r z) log dβ 0... dβ R {dλ 0,..., λ R } z This is equivalent to L(M) = N r=0 M(β 0,..., β R ; λ 0,..., λ R z) λ r (t) e β r z R Λ r (t) exp(β r z) (52) log dβ 0... dβ R {dλ 0,..., λ R } M(β 0,..., β R ; λ 0,..., λ R z i ) λ ri (t i ) e β r i z i R Λ r (t i) exp(β r z i ) The individual cause-specific hazard rates are written in a form reminiscent of [7], but with timedependent factors and time-independent regression and frailty parameters for the R + 1, distributed according to M(β 0,..., β R ; λ 0,..., λ R z), in the spirit of fraily and random effects models. However, here this is done for all risks, so that the complexities of competing risks are captured by the correlation structure of M(β 0,..., β R ; λ 0,..., λ R z). All applications in this report are based on the generic parametrisation (50). Given (50) one obtains formulae for the decontaminated and crude cause-specific quantities of interest, which are fully exact as long as M(β 0,..., β R ; λ 0,..., λ R z) is kept general. We write the single-risk marginals of M(β 0,..., β R ; λ 0,..., λ R z) as ( ) M(β r ; λ r z) = dβ r {dλ r } M(β 0,..., β R ; λ 0,..., λ R z) (54) r r For the decontaminated cause-specific survival functions and hazard rates we then get S r (t z) = dβ r {dλ r } M(β r ; λ r z) e exp(βr z)λr(t) (55) (53) h r (t z) = dβr {dλ r } M(β r ; λ r z) λ r (t)eβr z exp(βr z)λr(t) dβr {dλ r } M(β r ; λ r z) e exp(β (56) r z)λr(t) The crude hazard rates and the data probability become dβ0... dβ h r (t z) = R {dλ0... dλ R } M(β 0,..., β R ; λ 0,..., λ R z) λ r (t)e βr z R exp(β r z)λ r (t) dβ0... dβ R {dλ0... dλ R } M(β 0,..., β R ; λ 0,..., λ R z) e R P (t, r z) = dβ 0... dβ R exp(β r z)λ r (t) (57) {dλ 0,..., λ R } M(β 0,..., β R ; λ 0,..., λ R z)λ r (t)e β r z R exp(β r z)λ r (t) (58) and, finally, the covariate-conditioned cumulative cause-specific incidence functions are F r (t z) = dβ 0... dβ R {dλ 0... dλ R } M(β 0,..., β R ; λ 0,..., λ R z) t dt λ r (t )e βr z R exp(β r z)λ r (t ) 0 (59) 15

4.2 Connection with conventional regression methods The parametrisation (50) is generic, so all regression methods compatible with assuming heterogeneityinduced competing risks will correspond to specific choices for M(β 0,..., β R ; λ 0,..., λ R z). We label the primary risk as r = 1. All methods that take primary and non-primary risks to be independent would have M(β 0,..., β R ; λ 0,..., λ R z) = M(β 1, λ 1 z)m(β 0, β 2,..., β R ; λ 0, λ 2,..., λ R z) with some specific M(β 1, λ 1 z), e.g. Cox s proportional hazards regression [7] Here one assumes that there is no variability in the parameters (β 1, λ 1 ) of the primary risk. Elimination of parameter redundancy then means that β 0 1 is absorbed into λ 1(t), and we find M(β 1 ; λ 1 z) = p δ F [λ 1 ˆλ] δ(β1) 0 δ(β µ 1 ˆβ µ ) (60) µ=1 Via maximum likelihood one can express the base hazard rate in terms of the regression coefficients { ˆβ µ } (giving Breslow s formula), substitution of which leads to Cox s equations [7]. See appendix C for details. Simple frailty models In simple frailty models [16, 18], the frailty parameters of different risks are assumed statistically independent, so the heterogeneity of the cohort is concentrated in the random parameter β 0 1 : M(β 1 ; λ 1 z) = p δ F [λ 1 ˆλ] g(β1) 0 δ(β µ 1 ˆβ µ ) (61) µ=1 One then usually chooses the frailty distribution g(β1 0 ) to be of a specific parametrised form that allows one to do various relevant integrals over β1 0 analytically. See appendix C for details. Simple random effects models In simple random effects models, such as [21], one still takes the primary risk parameters to be independent of the non-primary ones, but now the regression coeficients that couple to the covariates are non-uniform: M(β 1 ; λ 1 z) = δ F [λ 1 ˆλ] W (β 1 ) (62) One assumes a parametrized form for the distribution W (β 1 ) and estimates its parameters from the data. Latent class models The latent class models of [27] are found upon assuming the cohort to consists of a finite number of discrete sub-cohorts. Each is of the type (60), but with distinct base hazard rates 16

and association parameters. The probabilities for individuals to belong to each sub-cohort are allowed to depend on their covariates, as in [28]: M(β 1 ; λ 1 z) = w(l z) = L p w(l z) δ F [λ 1 ˆλ l ] δ(β1) 0 δ(β µ 1 ˆβ lµ ) (63) l=1 µ=1 e αl + p 0 µ=1 αl µ zµ Ll p =1 eαl 0 µ=1 αl µ zµ (64) These models all focus on the primary risk only, and thereby lose the ability to deal with the competing risk problem. Some authors have tried to characterise all risk and their parameter interactions [17, 11], but did not yet develop systematic decontamination protocols. Of course there are many variations on the above models, including versions with time-dependent covariates, and models with non-latent classes in the sense that for each individual one knows the class label. It is easy to see how they fit into the generic formulation. 4.3 A simple latent class parametrisation for heterogeneity-induced competing risks Descriptions that include all risks and their correlations will have more parameters than those limited to the primary risk. In view of overfitting, it is vital that one limits the complexity of the chosen parametrisation. The difference between frailty and random effects models is only in whether the risk variability relates to known or unknown covariates, so it seems logical to combine both. If we take the heterogeneity to be discrete, but without the covariate dependence of class probabilities of (63), if we assume the end-of-trial risk not to depend on the covariates, and if we choose the base hazard rates of all risks to be uniform in the cohort, we obtain a simple model family in which M(β 0,..., β R ; λ 0,..., λ R z) = δ(β 0 )δ F [λ 0 ˆλ 0 ]M(β 1,..., β R ; λ 1,..., λ R ), with M(β 1,..., β R ; λ 1,..., λ R ) = R M(β 1,..., β R ) δ F [λ r ˆλ r ] (65) r=1 M(β 1,..., β R ) = L R w l δ(β r ˆβ l r) (66) l=1 r=1 Here ˆβ l r = ( ˆβ l0 r,..., ˆβ lp r ). See Figure 2 for an illustration of what (65) means in terms of individual cause-specific hazard rates. For any choice for the number L of assumed latent classes, the remaining parameters to be estimated are: the cause-specific hazard rates {ˆλ r (t)} of all risks, the L class sizes w l [0, 1] (subject to L l=1 w l = 1), and the regression coeficients { ˆβ lµ r } and frailty parameters { ˆβ l0 r } of all risks r = 1... R and all latent classes. One can see (65) as a generalisation of [26] (where only frailties, as opposed to also associations, were class-dependent). The remaining parametrisation invariance is {ˆλ r (t), ˆβ l0 r } {ˆλ r (t)e ζr, ˆβ l0 r + ζ r } for all l, which is removed by setting ˆβ 10 r = 0 for all r. Finding the optimal value of L is a simple Bayesian model selection problem. The log-likelihood (53) is at the core of parameter estimation. For the multi-risk parametrisation (65) it simplifies to the following expression, with our usual short-hand β l r z = βr l0 + p µ=1 βlµ r z µ and with δ ab = 1 δ ab : L(M) = N log ˆλ N { L ri (t i ) + log w l e ˆβ R ri z i ˆΛ r=0 r(t i ) exp( ˆβ l r z i) } l=1 17

Latent class 1 Latent class L for all r >0: h i r(t) = ˆλ r (t)e fraction: w 1 10 ˆβ + p r µ=1 ˆβ 1µ r zµ i for all r >0: h i r(t) = ˆλ r (t)e fraction: w L L0 ˆβ + p r µ=1 ˆβ Lµ r zµ i Figure 2: Illustration of the parametrisation (65). All individuals i in the cohort are assumed to have personalised cause-specific hazard rates h i r(t) which for all risks r = 1... R are of the proportional hazards form. The cohort is allowed to be heterogeneous in that it may consist of L sub-cohorts (or latent classes ), labelled by l = 1... L. Each latent class l contains individuals with risk-specific ˆβ l0 frailties r and with risk-specific regression parameters that capture the impact of covariates. The base hazard rates ˆλ r (t) of the risks are assumed not to vary between individuals. The class membership of the individuals in our data set is not known a priori, but can be inferred a posteriori. ˆβ lµ r = L 0 (M) + L risks (M) (67) The first term probes end-of-trial censoring information. The second contains the quantities related to true risks: L 0 (M) = L risks (M) = N δ 0ri log ˆλ N 0 (t i ) ˆΛ 0 (t i ) (68) N δ 0ri log ˆλ N { L ri (t i ) + log w l e δ ˆβ R 0r i ri z i ˆΛ r=1 r(t i ) exp( ˆβ l r z i) } (69) l=1 Inserting (65) into our formulae for the decontaminated cause-specific survival function and hazard rates of the true risks r > 0 gives the relatively simple and intuitive expressions S r (t z) = L w l e exp( ˆβ l r z)ˆλ r(t) l=1 (70) h r (t z) = ˆλ r (t) Ll=1 w l e ˆβ l r z exp( ˆβ l r z)ˆλ r(t) Ll=1 w l e exp( ˆβ l r z)ˆλ r(t) (71) The crude hazard rate and the data probability become h r (t z) = ˆλ r (t) P (t, r z) = ˆλ r (t)e ˆΛ 0 (t) Ll=1 w l e ˆβ l r z R r =1 exp( ˆβ l r z)ˆλ r (t) Ll=1 w l e R r =1 exp( ˆβ l r z)ˆλ r (t) L l=1 w l e ˆβ l r z R r =1 exp( ˆβ l r z)ˆλ r (t) (72) (73) From the crude cause-specific hazard rates follow the crude cause-specific survival functions for r = 1... R, via the relation S r (t z) = exp[ t 0 ds h r(s z)]. The cumulative cause-specific incidence 18

functions corresponding to (65) are F r (t z) = t 0 dt ˆλr (t )e ˆΛ 0 (t ) L l=1 w l e ˆβ l r z R r =1 exp( ˆβ l r z)ˆλ r (t ) The specific parametrisation (65) has two additional useful features: (74) After the class sizes (w 1,..., w L ) have been inferred, one obtains the effective number L eff of classes via Shannon s information-theoretic entropy S [43], which takes into account any class size differences and can complement Bayesian model selection in the identification of the optimal value of L: L eff = e S, S = L w l log w l (75) Since our latent classes are defined in terms of the relation between covariates and risk, one cannot predict class membership for individuals on the basis of covariate information alone. However, Bayesian arguments allow us to calculate class membership probabilities retrospectively, for any individual on which we have the covariates z and survival information (t, r). For each class label l, the model (65) gives l=1 P (t, r z, l) = ˆλ r (t) e ˆβ l r z ˆΛ 0 (t) R r =1 exp( ˆβ l r z)ˆλ r (t) (76) Hence, using P (t, r, l z) = P (t, r z, l)w l and P (t, r z) = L l =1 P (t, r z, l )w l, we obtain P (l t, r, z) = w l P (t, r z, l) Ll =1 w l P (t, r z, l ) = l w l e ˆβ R r z r =1 exp( ˆβ l r z)ˆλ r (t) l Ll =1 w l e ˆβ z R l r r exp( ˆβ =1 r z)ˆλ r (t) (77) The probability that individual i belongs to class l is P (l t i, r i, z i ). Retroscpective class assigment could aid the search for informative new covariates, increasing our ability to predict personalised risk in heterogeneous cohorts. Such covariates are expected to be features that patients in the same class have in common. Finally, instead of imposing by hand the independence of the end-of-trial risk on covariates (to reduce the number of model parameters), one could treat the end-of-trial risk as any other risk. Any lµ parameter estimation protocol should then report that ˆβ 0 = 0 for all l and all µ = 1... p, which gives a sanity test of numerical implementations. 5 Application to synthetic survival data To test our results under controlled conditions we turn to synthetic data with heterogeneity-induced competing risks, generated from populations of the type (65). Details on numerical data generation are given in Appendix D. Our method should map a cohort s risk and association substructure, if it exists, i.e. report the number and sizes of sub-cohorts and their distinct regression parameters for all risks. It should then use this information to generate correct decontaminated survival curves, and assign individuals retrospectively to their correct latent classes. 19

5.1 Cohort substructure and regression parameters We generated numerically event times and event types for three heterogeneous data sets A,B and C. Each has N = 1600 individuals from two latent classes of equal size, with at most two real risks, and with end-of-trial censoring at time t = 50. Each indivdual i has three covariates (zi 1, z2 i, z3 i ), drawn randomly and independently from P (z) = (2π) 1/2 e z2 /2. All frailty parameters βr l0 are zero. The base hazard rates of the risks are time-independent: ˆλ1 (t) = 0.05 (primary risk) and ˆλ 2 (t) = 0.1 (if a secondary risk is enabled). Table 1 shows the further specifications of the data sets, together with the results of performing proportional hazards regression [7], and our generic heterogeneous regression with the latent class log-likelihood (69) where the MAP protocol was complemented with Aikake s Information Criterion as described in (111). The three data sets were constructed such that they have fully identical primary risk characterics. In set A there is heterogeneity but no competing risk. In set B a second risk is introduced, which in one of the two classes targets individuals similar to those most sensitive to the primary risk (with respect to the first covariate). Here one expects false protectivity effects. In set C a second risk is introduced, which in one of the two classes targets individuals similar to those least sensitive to the primary risk (with respect to the first covariate). Here one expects false exposure effects. As expected, the proportional hazards regression method [7] fails to report meaningful results, since it aims to describe the relation between covariates and the primary risk in each data set with a single regression vector (β1 1, β2 1, β3 1 ). The heterogeneous regression based on (69,111) always reports the correct number of classes (L = 2), and the correct class-specific parameters (within accuracy limits determined by numerical search accuracy and finite sample size). Note that the assigment of class labels to the identified classes is in principle arbitrary; see e.g. the regression results for data set B, where the class labelled l = 2 is labelled l = 1 in the data definition. 5.2 Decontaminated survival functions The second test of our method and its numerical implementation is to verify that for all three data sets A, B and C it can extract the correct decontaminated covariate-conditioned survival curve S 1 (t z) for the primary risk, from the survival data alone. The result should be identical in all three cases, since the data sets differ only in the interference effects of a secondary risk. For the primary risk in Table 1, the correct expression (70) simplifies to S 1 (t z 1 ) = 1 2 e t 20 exp(2z 1) + 1 2 e t 20 exp( 2z 1) (78) We can calculate the true primary risk survival curves for the upper and lower covariate quartiles (UQ, LQ) and the inter-quartile range (IQ). For our Gaussian-distributed covariates, with zero average and unit variance, the upper and lower quartile survival curves are identical, due to the symmetry S 1 (t z 1 ) = S 1 (t z 1 ). With the short-hand Dz = (2π) 1/2 e z2 /2 dz, and the quartile point z Q defined via z Q Dz = 1 4, giving z Q 0.67449, we obtain from (78): LQ, UQ : S1 (t z 1 [z Q, )) = 2 Dz z Q zq IQ : S1 (t z 1 [ z Q, z Q ]) = 2 Dz 0 (e t 20 exp(2z) + e t 20 exp( 2z)) (79) (e t 20 exp(2z) + e t 20 exp( 2z)) (80) Figure 3 shows the true LQ, UQ and IQ survival curves (79,80) for data sets A, B and C, together with the decontaminated curves S 1 in (70) calculated from application of our regression method 20