Duration data analysis - basic concepts

Similar documents
Exercises. (a) Prove that m(t) =

Cox s proportional hazards model and Cox s partial likelihood

Time-dependent covariates

Survival Analysis. Stat 526. April 13, 2018

Multistate Modeling and Applications

Chapter 4 Fall Notations: t 1 < t 2 < < t D, D unique death times. d j = # deaths at t j = n. Y j = # at risk /alive at t j = n

Survival Analysis. Lu Tian and Richard Olshen Stanford University

Censoring and Truncation - Highlighting the Differences

In contrast, parametric techniques (fitting exponential or Weibull, for example) are more focussed, can handle general covariates, but require

DAGStat Event History Analysis.

ST745: Survival Analysis: Nonparametric methods

Nonparametric Model Construction

Multi-state Models: An Overview

Lecture 6 PREDICTING SURVIVAL UNDER THE PH MODEL

Cox s proportional hazards/regression model - model assessment

PhD course in Advanced survival analysis. One-sample tests. Properties. Idea: (ABGK, sect. V.1.1) Counting process N(t)

Extensions of Cox Model for Non-Proportional Hazards Purpose

Survival Analysis I (CHL5209H)

Likelihood Construction, Inference for Parametric Survival Distributions

Ph.D. course: Regression models. Introduction. 19 April 2012

Practice Exam 1. (A) (B) (C) (D) (E) You are given the following data on loss sizes:

ST5212: Survival Analysis

Lecture 22 Survival Analysis: An Introduction

Ph.D. course: Regression models. Regression models. Explanatory variables. Example 1.1: Body mass index and vitamin D status

Statistics 262: Intermediate Biostatistics Regression & Survival Analysis

Lecture 7 Time-dependent Covariates in Cox Regression

Robust estimates of state occupancy and transition probabilities for Non-Markov multi-state models

Lecture 3. Truncation, length-bias and prevalence sampling

Censoring mechanisms

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

Statistical Inference and Methods

STAT331. Cox s Proportional Hazards Model

10 Introduction to Reliability

Step-Stress Models and Associated Inference

Introduction to Reliability Theory (part 2)

for Time-to-event Data Mei-Ling Ting Lee University of Maryland, College Park

Subject CT4 Models. October 2015 Examination INDICATIVE SOLUTION

Notes largely based on Statistical Methods for Reliability Data by W.Q. Meeker and L. A. Escobar, Wiley, 1998 and on their class notes.

Survival Distributions, Hazard Functions, Cumulative Hazards

Duration Analysis. Joan Llull

Typical Survival Data Arising From a Clinical Trial. Censoring. The Survivor Function. Mathematical Definitions Introduction

REGRESSION ANALYSIS FOR TIME-TO-EVENT DATA THE PROPORTIONAL HAZARDS (COX) MODEL ST520

Multistate models and recurrent event models

Lecture 4 - Survival Models

4. Comparison of Two (K) Samples

A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky

CIMAT Taller de Modelos de Capture y Recaptura Known Fate Survival Analysis

ASM Study Manual for Exam P, Second Edition By Dr. Krzysztof M. Ostaszewski, FSA, CFA, MAAA Errata

CHAPTER 3 ANALYSIS OF RELIABILITY AND PROBABILITY MEASURES

Other Survival Models. (1) Non-PH models. We briefly discussed the non-proportional hazards (non-ph) model

Application of Time-to-Event Methods in the Assessment of Safety in Clinical Trials

Yuxin Zhang. September 21st, 2018

STAT Sample Problem: General Asymptotic Results

Multistate models and recurrent event models

18.465, further revised November 27, 2012 Survival analysis and the Kaplan Meier estimator

Multi-state models: prediction

Survival Regression Models

Estimation for Modified Data

4 Testing Hypotheses. 4.1 Tests in the regression setting. 4.2 Non-parametric testing of survival between groups

Survival Models. Lecture: Weeks 2-3. Lecture: Weeks 2-3 (Math 3630) Survival Models Fall Valdez 1 / 31

ASM Study Manual for Exam P, First Edition By Dr. Krzysztof M. Ostaszewski, FSA, CFA, MAAA Errata

BEST TESTS. Abstract. We will discuss the Neymann-Pearson theorem and certain best test where the power function is optimized.

Survival Analysis. 732G34 Statistisk analys av komplexa data. Krzysztof Bartoszek

ECON 721: Lecture Notes on Duration Analysis. Petra E. Todd

PhD course: Statistical evaluation of diagnostic and predictive models

Log-linearity for Cox s regression model. Thesis for the Degree Master of Science

Stochastic Modelling Unit 1: Markov chain models

Statistical Methods for Alzheimer s Disease Studies

1. Reliability and survival - basic concepts

Statistics 262: Intermediate Biostatistics Non-parametric Survival Analysis

Logistic regression model for survival time analysis using time-varying coefficients

Approximation of Survival Function by Taylor Series for General Partly Interval Censored Data

Survival models and health sequences

Survival Analysis Math 434 Fall 2011

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH

Math438 Actuarial Probability

Statistik for MPH: september Analyse af kohortestudier, rater, standardisering (Silva: 70-80,

Recall Discrete Distribution. 5.2 Continuous Random Variable. A probability histogram. Density Function 3/27/2012

A multi-state model for the prognosis of non-mild acute pancreatitis

Errata for the ASM Study Manual for Exam P, Fourth Edition By Dr. Krzysztof M. Ostaszewski, FSA, CFA, MAAA

Survival Analysis: Weeks 2-3. Lu Tian and Richard Olshen Stanford University

UNIVERSITY OF CALIFORNIA, SAN DIEGO

7.1 The Hazard and Survival Functions

ST495: Survival Analysis: Maximum likelihood

Multivariate spatial modeling

STAT Section 2.1: Basic Inference. Basic Definitions

Optimal Treatment Regimes for Survival Endpoints from a Classification Perspective. Anastasios (Butch) Tsiatis and Xiaofei Bai

Instrumental variables estimation in the Cox Proportional Hazard regression model

An Overview of Methods for Applying Semi-Markov Processes in Biostatistics.

Definitions and examples Simple estimation and testing Regression models Goodness of fit for the Cox model. Recap of Part 1. Per Kragh Andersen

Individualized Treatment Effects with Censored Data via Nonparametric Accelerated Failure Time Models

Multistate models in survival and event history analysis

Estimation of Conditional Kendall s Tau for Bivariate Interval Censored Data

Goodness-of-fit tests for randomly censored Weibull distributions with estimated parameters

First Aid Kit for Survival. Hypoxia cohort. Goal. DFS=Clinical + Marker 1/21/2015. Two analyses to exemplify some concepts of survival techniques

Topic 4: Continuous random variables

Analysis of competing risks data and simulation of data following predened subdistribution hazards

Creating New Distributions

e 4β e 4β + e β ˆβ =0.765

Probability Midterm Exam 2:15-3:30 pm Thursday, 21 October 1999

Transcription:

Duration data analysis - basic concepts Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark September 7, 2018 1 / 25

Course topics (tentative) duration data - censoring and likelihoods estimation of the survival function and the cumulative hazard parametric models semi-parametric inference - Cox s partial likelihood model assessment point process/counting process approach (review) time-dependent variables competing risks 2 / 25

Estimation of probability of loss given default Risk management in banks: probability of default and probability of loss given default (default=nedskrivning eller tab). For each customer bank records monthly default/loss status (D,L) until first loss or customer leaves bank (Q, with no loss) or date of recording. Examples of data sets for various customers: D, D, D,D,D,D D,L D,D,L D,D,D, D D,D,Q D, D, D, D, D How to estimate probability of loss given default? First restrict attention to customer with default: D, D, D,D,D,D D,L D,D,L D,D,D, D D,D,Q. Here we observe two customers with loss given default and three customers without. Estimate 40%? 3 / 25

But suppose we did not observe loss for first customer because loss did not yet occur at date of recording data? We did perhaps not observe customer long enough missing data Default sequence: we denote by a default sequence, a sequence of observations initiated by a default and ending by L, Q, D or by D at time of recording. E.g. the data sequence D,D,D, D,D,D contains two default sequences D,D, D and D,D. T : time to loss after first default in a default sequence. I.e. for sequence D,D,L, T = 2. For sequence D,D T is unknown (just know T 2). 4 / 25

Idea: factorize into conditional probabilities Probability of loss given default is P(T < ) = P(T = n) = P(T = n T n)p(t n) n=1 n=1 Thus enough to estimate P(T = n T n) since for any k 1 k 1 P(T k) = (1 P(T = l T l)) l=1 We can estimate P(T = l T l) unbiasedly for any l! Focus now on survival function P(T n) and hazard function P(T = n T n). These are basic concepts in duration/survival analysis! 5 / 25

klosterforsikring In 1872 T.N. Thiele (Danish astronomer, statistician, actuarian) engaged in designing an annuity/insurance for unmarried women (of wealthy origin). A woman was dependent on getting married to support her living. Parents should be able to insure a daughter against not getting married. From certain age daughter would get a yearly amount until death or marriage. 6 / 25

Price of insurance: expected time to death or marriage times yearly amount. If annuity pr. year is q and T denotes time to marriage or death, then for retirement age t R, price = qe[t t R T t R ]P(T t R ) = qmrl(t R )S(t R ) NB: in reality future payments should be discounted to get present value of future payments (inflation) Sometimes we define survival function as S(t) = P(T > t) - distinction only matters for discrete time. mrl: mean residual life time. 7 / 25

T M, T D : times to marriage respectively death in years. T = min(t M, T D ). E[T t R T t R ] = n=0 P(T t R n T t R ) Assuming independence P(T t) = P(T M t)p(t D t). Thiele estimated P(T M t) and P(T D t) for t = 1, 2,... using parametric models and least squares from data recorded at jomfruklostre (existing homes for unmarried women). We will return to this data set later on in an exercise. 8 / 25

Practical considerations man...ved at gøre giftermål eller ikke gifter-mål til genstand for forsikring gør sig afhængig af den forsikredes frie vilje This is the reason why Thiele uses data from jomfruklostre to get valid estimates of probability that insured women do not marry - insured women might or might not be less inclined to marriage than women in general, however Er valget mellem gift og ugift stand end utvivlsomt altid en frivillig sag, så er der naturlige bånd på denne som på enhver frihed. Og er det end muligt for enhver at fatte og at gennemføre en cølibatsbeslutning så er der dog kræfter, mægtige kræfter, der modsætte sig Jeg mener også, at det vil være nødvendigt, ikke at optage interessenter i en så fremrykket alder, at det bliver let for dem eller deres familie, at danne sig et skøn om deres individuelle sandsynlighed for at blive gift 9 / 25

Time to breakdown of windturbine Vesta A/S wants to design insurance/maintenance policies. Thus need to estimate the cost of maintaining a wind turbine. Thus need to estimate the distribution of the time from wind turbine is installed until e.g. gear box breaks down. The wear of a turbine depends on the load that the wind turbine is exposed to - which again depends on the weather conditions: time dependent variable. Other variables (not time dependent): type of turbine, manufacturer... 10 / 25

Time to death of cirrhosis In the period 1962-1969 532 patients with the diagnosis of cirrhosis joined a randomized clinical trial for which the aim was to investigate the effect of treatment with the hormone prednison. The patients were randomly assigned to either prednison or placebo treatments. The survival times of the patients were observed until september 1974 so that observations were right censored for patients who were alive at this date. 11 / 25

Discrete or continuous time? In practice, data are always discrete either by construction or by rounding. Continous time models mathematically convenient and useful if rounding of data not too severe. E.g. Vestas and cirrhosis data analysed using continuous time models. 12 / 25

Common features of duration data 1. positive 2. right skewed 3. censored (mainly right censoring) - terminal event not observed at time of recording data. 4. theory very much based on probability. 5. semi-parametric methods very important. Due to 1. and 2. normal models usually not useful. Ignoring 3. will introduce possibly strong bias of estimates. 4. is a concept very different from usual parametric models. Selfstudy: various parametric alternatives to normal models (exponential, Weibull, log normal, gamma). 13 / 25

Hazard and survival function Let T denote random duration time with pdf f and cdf F. Assume T continuous random variable. Survival function Hazard function S(t) = P(T > t) = 1 F (t) h(t) = f (t)/s(t) h(t)dt: probability that T [t, t + dt[ given T t. Plots of hazard function usually more informative than plots of survival function. 14 / 25

Types of right censoring Let X be duration time and C time to censoring. We observe T = min(x, C) and = 1[X C] ( = 1 means duration time observed). Type 1 censoring: an event is only observed if it occurs prior to some fixed time t obs. If a subject enters at time t start then C = t obs t start. Progressive type 1 censoring: different subjects may have different observation times t obs. Generalized type 1 censoring: different subjects may have different starting times t start. 15 / 25

NB: if t start not controlled by experimenter then more reasonable to consider it as a random variable T start in which case also C is random. Then we may have a case of competing risk/random censoring (see later slide). 16 / 25

Type 2 censoring Type 2 censoring: experiment started for n individuals at time t start and terminates when duration times observed for 0 < r < n individuals. Then C = X (r) t start. Progressive type 2 censoring: type 2 censoring applied with r = r 1. After r 1 duration times observed, n 1 r 1 individuals (including the r 1 observed) are removed from the n individuals. Then type 2 censoring applied to the remaining n n 1 individuals etc. 17 / 25

Competing risks/random censoring If another event happens prior to the event of interest, X is not observed. C is the duration time until the other event. E.g. X time to death of cirrhosis and C time to death of heart attack or C time to patient leaves the study due to migration. In practice this type of censoring is difficult unless C independent of X. We return to competing risks in the end of the course. NB: some authors use the term random censoring for the case where C and X are independent! Question: what about independence of X and C in case of type 1 and 2 censoring? 18 / 25

Likelihoods for duration data Suppose we have observations (t i, δ i ) which are realizations of (T i, i ) and i = 1[X i C i ] and the X i are continous random variables with density f Xi. We assume the observations are independent so it is sufficient to derive the likelihood for one observation, say (t, δ) realization of (T, ). NB: KM derivations on the lower half part of page 75 very sloppy! Their equation (3.5.5) is OK if RHS is read as pdf. Note if T continuous random variable then (T, ) has density g if P(T t, = δ) = t 0 g(u, δ)du. 19 / 25

Case C random and independent of X Assume C continuous random variable with density f C. P(T t, = 0) = P(C < t, X > C) = t 0 c t 0 f C (c)f X (x)dxdc = f C (c)s X (c)dc Thus g(t, 0) = f C (t)s X (t). By symmetry, g(t, 1) = f X (t)s C (t). Thus likelihood is f X (t) δ S X (t) 1 δ f C (t) 1 δ S C (t) δ = h X (t) δ S X (t)h C (t) 1 δ S C (t) Suppose we consider a parametric family f X ( ; θ) for X but f C ( ) is constant as a function of θ (non-informative censoring). Then likelihood is equivalent to h X (t; θ) δ S X (t; θ) 20 / 25

Case C is deterministic Suppose C is deterministic and equal to the fixed value c. Given δ = 0, T = c is deterministic. Given δ = 1, T is continuous. Distribution of T is non-standard: a mixture of a discrete and a continuous distribution. P(T = t δ = 0) = 1[c = t] and P(δ = 0) = P(X > c) = S x (c) Hence contribution to likelihood is S X (t) if (t, δ) = (c, 0). Further, for 0 t c P(T t δ = 1) = P(X t) P(X c) = F X (t) F X (c) and P(δ = 1) = F x (c) Hence P(T t, δ = 1) = F X (t) with density f X (t). Summing up, likelihood is again f X (t) δ S X (t) 1 δ = h X (t) δ S X (t) 21 / 25

Mixture of discrete and continuous distribution T has density g(t) = f X (t)1[t < c] + S X (c)1[t = c] wrt Lebesgue + point mass at c. This in the sense that P(T t) = min(t,c) 0 f X (u)du + S X (c)1[c t] 22 / 25

Likelihood for type 2 censored data Exercise! 23 / 25

Less restrictive censoring assumption: independent censoring Terminology confusing: independent censoring is not the same as random censoring with X and C independent (e.g. Fleming and Harrington page 26-27 or ABGK page 51). Informally we have independent censoring if for any time t the survival of an individual with T t is representative of the survival of all individuals with X t. In other words, the information that an individual is not censored at time t does not change the distribution of the remaining survival time. Formally P(X [t, t+dt[ X t, C t) = P(X [t, t+dt[ X t) = h X (t)dt This is enough for non-parametric estimation of survival function (Kaplan-Meier) and Cox s partial likelihood (later). 24 / 25

Independent censoring continued Counter example: suppose patients tend to leave study if their condition deteriorates - thus remaining patients with C t and X t tend to be more healthy than an arbitrary patient with X t. Random independent censoring trivially implies independent censoring. Type 2 censoring is also an example of independent censoring (exercise). 25 / 25