ST495: Survival Analysis: Maximum likelihood

ST495: Survival Analysis: Maximum likelihood Eric B. Laber Department of Statistics, North Carolina State University February 11, 2014

Everything is deception: seeking the minimum of illusion, keeping within the ordinary limitations, seeking the maximum. In the first case one cheats the Good, by trying to make it too easy for oneself to get it, and the Evil by imposing all too unfavorable conditions of warfare on it. In the second case one cheats the Good by keeping as aloof from it as possible, and the Evil by hoping to make it powerless through intensifying it to the utmost. Franz Kafka

Last time Introduced parametric models commonly used in survival analysis; discussed their densities, hazards, survivor function, and CDFs; showed how to draw from these distributions using R Exponential Weibull Gamma Extreme value Log-normal Log-logistic We also discussed location-scale models and how to use the location-scale framework to incorporate covariate information We also discussed flexible models for the hazard function including piecewise constant and basis expansions

Last time: burning questions How to choose a distribution? How to estimate the parameters indexing a chosen distribution? How can we accommodate difference types of censoring? How can I use R to do the foregoing estimation steps?

Warm-up Explain to your stat buddy 1. Hazard function 2. How a bathtub hazard might arise 3. How an increasing hazard might arise 4. How a decreasing hazard might arise True or false: (T/F) Minimum of independent Weibull r.v. s are Weibull (T/F) Measles increases fecundity in goats (T/F) An exponential distribution would be good model for mortality in humans Who is generally credited with discovering maximum likelihood estimation?

Warm-up cont d Some concepts and notation for today Let a1,..., a n be a sequence of constants then n a i = a 1 a 2 a n Let f (x) denote a function from R p into R then x = arg max f (x) x satisfies f (x ) f (x) for all x R p. We use 1 Statement to denote the function that equals one if Statement is true and zero otherwise. Thus, 1 t 1 equals one if t 1 and zero otherwise.

Observation schemes Why is survival analysis its own sub-field of statistics? Abundance of important applications Fundamental contributions to statistical theory (esp. in semi-parametrics) Dealing partial information due to censoring Recall that when we only observe partial information about a failure time we say that it s censored T C (Right censored) T L (Left censored) V T < U (Interval censored)

Likelihood If generative model is indexed by parameter θ then the likelihood is L(θ) P(Data; θ), which is viewed as a function of θ with the data being fixed The maximum likelihood estimator is θ n = arg max L(θ) θ Warm-up: Let X 1,..., X n N(µ, σ 2 ), then θ = (µ, σ 2 ), derive L(θ), and θ n. Check your answer with your stat buddy.

Likelihood cont d Ex. Let T 1,..., T n be an iid draw from distn with density f (t; θ) indexed by θ then L(θ) P(Data; θ) = n P(T i = t i ; θ) = n f (t i ; θ), Let T denote a generic observation distd according to f (t; θ). How can we use L(θ) to estimate: The mean of T? The CDF of T? The hazard of T?

Nonparametric maximum likelihood Ex. Let T 1,..., T n be an iid draw from distn with density f (t). Suppose now, however, we don t put any restrictions of f (t) (other than it being a density). In this case f is our parameter n L(f ) P(Data) = f (t i ), how can we maximize this over densities f? Claim: Our estimated f, say ˆf should only put positive mass on t 1,..., t n. (Why?) If f puts mass on t1,..., t n then maximizing the likelihood is equivalent to solving n n max subj. to α i = 1 α i α 1,...,α n 0 Some painful calculus shows f is pmf with f (ti ) = 1/n, i = 1,..., n

Nonparametric maximum likelihood cont d Thus f (t) is a discrete distribution with f (t i ) = 1/n. The estimated is given by F (t) = 1 n n 1 t ti, this is called the empirical distribution function (ECDF)

Computing ECDF in R n = 50; x = rnorm (n); FHat = ecdf (x); plot (FHat, xlab= x ); Fn(x) 0.0 0.2 0.4 0.6 0.8 1.0 3 2 1 0 1 2 3 x

Observation schemes

Truncated estimation Suppose t 1,..., t n compose a random sample from subjects with lifetimes less than or equal to one year. The LH takes the form: n n { } f (ti ) f (t i T i 1) =, F (1) Why? Suppose we posited a parametic model f (t; θ) for T, how can we estimate θ using left-truncated data? Deceptively difficult stat question: Suppose T 1,..., T n are drawn ind. from an exp(θ) distn but are truncated at one. When does the maximum likelihood estimator exist?

Right-censoring Recall that a censoring time is right censored at C if we only observe that T > C Our goal in the next few slides is to derive the LH under different right-censoring mechanisms Notation: observe {(T i, δ i )} n where T i is the observation time and δ i is the censoring indicator { 1 Failure time observed δ i = 0 Right censored Big result of the day: Under a variety of right-censoring mechanisms: n LH f (t i ) δ i S(t i +) 1 δ i

Type I censoring In type I censoring each individual has a fixed (non-random) censoring time C > 0 If T C then failure time observed If T > C then right-censored Ex. Odense Malignant Melanoma Data: n = 205 subjects enrolled between 1962 and 1972 at Odense Dept. of Plastic Surgery had tumors and surrounding tissue removed. Patients were followed until the death or the study concluded in 1977. Note* This data is contained in the boot package in R.

Type I censoring cont d Using the book s notation: define t i = min(t i, C i ) and δ i = 1 Ti C i then: If δ i = 1, t i is the failure time so the information (t i, δ i ) contributes to the LH is f (t i ) If δ i = 1 then t i is the censoring time so the information (t i, δ i ) contributes to the LH is S(t i +) Thus, the LH is n f (t i ) δ i S(t i +) 1 δ i

Type I censoring cont d In class: Suppose T 1,..., T n are iid exp(θ) but subject to Type I censoring, let δ 1,..., δ n denote the censoring indicators. Derive the MLE for θ.

Independent random censoring Assume lifetime T and censoring time C are random variables. Often more realistic Ex. Random study enrollment times Ex. Subjects moving out of town... Let G(t) and g(t) denote the survivor and density function for C resp., define t i = min(t i, C i ), δ i = 1 Ti C i, then Why? f (t i, δ i ) = [f (t i )G(t i +)] δ i [S(t i +)g(t i )] 1 δ i,

Independent random censoring cont d The LH for n iid observations is ( n ) ( n ) f (t i ) δ i S(t i +) 1 δ i g(t i ) 1 δ G(t i +) δ i, note that if g(t) and G(t) does contain information about f (t) or S(t) then the LH is proportional to n f (t i ) δ i S(t i +) 1 δ i It s the same LH as before!!!

Type II censoring Observe individuals until the rth failure is observed, so that we observe the r smallest lifetimes t (1) t (r). All n units start at the same time Follow-up stops at the time of the rth failure Follow-up time is random Using properties of order statistics, the LH is ( r ) n! n f (t (n r)! (i) ) S(t (r) +) n r f (t i ) δ i S(t i +) 1 δ i

Code break Go over mle.r in R Studio

For those who are adventurous

Counting process notation Goal: Show the form the LH for right-censoring applies in very general settings For clarity we assume discrete time t = 0, 1,... Let h i (t) and S i (t) denote the hazard and survivor function for ith subject resp. ; further define Y i (t) 1 Ti t, ith subj not censored { 1 ith subj. hasn = t failed or been censored at t 0 Otherwise, if Y i (t) = 1 then we say the ith subj. is at risk at time t

Counting process notation cont d Define dn i (t) Y i (t)1 Ti =t { 1 if at risk and fails at t = 0 Otherwise dc i (t) Y i (t)1 ith subj. censored at t { 1 if at risk and censored at t = 0 Otherwise Claim: {dn i (t), dc i (t), t 0} has a single 1 and the rest zeros

Counting process notation cont d Even more definitions: dn(t) (dn 1 (t),..., dn n (t)) dc(t) (dc 1 (t),..., dc n (t)) H(t) {(dn(s), dc(s), s = 0, 1,..., t 1} we say H(t) is the history of the survival process up to time t

Counting process notation: the likelihood Note that lim t H(t) contains all the information in the collected data (why?), thus P (Data) = P (dn(0)) P (dc(0) dn(0)) = P (dn(1) H(1)) P (dc(1) dn(1), H(1)) P(dN(t) H(t))P(dC(t) dn(t), H(t)) t=0

Counting process notation: the likelihood cont d To make the horrible expression tractable we ll assume conditional independence across subjects given H(t) and P (dn i (t) = 1 H(t)) = Y i (t)h i (t), explain this expression to your stat buddy We will also assume that terms inside P(dC(t) dn(t), H(t)) are not informative for the parameters in h i (t) When the above assumptions hold we say the censoring is non-informative

Counting process notation: the likelihood cont d Under the foregoing assumption the LH is given by n t=0 h i (t) dn i (t) (1 h i (t)) Y i (t)(1 dn i (t)) To see this, we ll consider two cases: Case 1: ith subject s failure time is observed at ti, then they re at risk at t = 0, 1,..., t i, and dn i (t) = 1 t=ti, thus t=0 t i 1 h i (t) dn i (t) (1 h i (t)) Y i (t)(1 dn i (t)) = h i (t i ) (1 h i (s)) = f i (t i ) Case 2: ith subject is censored at time ti, then they re at risk at times t = 0, 1,..., t i, and dc i (t) = 1 t=si, thus t i h i (t) dn i (t) (1 h i (t)) Y i (t)(1 dn i (t)) = (1 h i (t)) = S i (t i +1) t=0 t=0 s=0

Counting process notation: the likelihood cont d Putting it all together shows the LH is proportional to n f i (t i ) δ i S i (t i + 1) 1 δ i Limiting arguments show the LH is the same (with S i (t i + 1) replaced by S i (t i +)) in the continuous case Note* we ll see later that using the framework of partial likelihood that maximizing the above LH is appropriate in even more general settings

LH-based inference For parametric models the LH provides an efficient framework for estimation and inference Let θ Θ R p index the survival distribution of interest, define L(θ) the LH l(θ) = log L(θ) the log-lh u(θ) = d dθ l(θ) the score function I (θ) = d 2 dθdθ l(θ) the Fisher information

LH-based inference cont d Recall maximum LH estimator solves u(θ) = 0 θ n = arg max θ Θ L(θ), Under mild regularity conditions n( θn θ ) N(0, I 1 (θ ))

LH-based inference example Assume T 1,..., T n are iid exp(λ) and subject to noninformative censoring. Let δ 1,..., δ n denote the censoring indicator. Find the MLE for λ, the Fisher information matrix, and a 95% confidence interval for λ. The LH, L(λ) is n f (t i ; λ) δ i S(t i ; λ) 1 δ i = n λ δ i exp { λt i δ i } exp { λt i (1 δ i )} = λ n δ i exp The log-lh, l(λ), is ( n ) δ i log λ λ n { t i λ } n t i

LH-based inference example cont d The score function u(λ) is given by u(λ) = n δ i λ n t i setting this to zero and solving yields λ n = n δ i n t i

LH-based inference example cont d We take the negative derivative of u(λ) to get I (λ) = n δ i/λ 2 How do we get a 95% confidence interval for λ?