ST745: Survival Analysis: Nonparametric methods

Size: px

Start display at page:

Download "ST745: Survival Analysis: Nonparametric methods"

Ruby Burns
5 years ago
Views:

1 ST745: Survival Analysis: Nonparametric methods Eric B. Laber Department of Statistics, North Carolina State University February 5, 2015

2 The KM estimator is used ubiquitously in medical studies to estimate and depict the fraction of patients living for a certain amount of time after treatment. It has since been applied to data from clinical trials of therapies for every disease from cancer to cardiology to concussion. Science Life Paul Meiers work and the KM analysis have been responsible for saving millions of lives. Significance

3 Then and now Last time we discussed max-lh with censoring Right-censoring schemes Left-truncation Interval censored data Current status data Estimating parametric models in R Large sample theory and inference Today we ll discuss Kaplan-Meier estimator and inference Nelson Aalen estimator and inference Using R for nonpar estimation

4 Warm-up Explain to your stat buddy 1. What s the difference between left-censoring and left-truncation? 2. Given two examples of nonparametric estimators 3. Pros and cons of nonparametric methods relative to parametric methods 4. What is a confidence interval? True or false: (T/F) Paul Meier is still alive (T/F) The bootstrap is an asymptotic approximation (T/F) The intergral symbol was invented by Gottfried Wilhelm Leibniz III

5 Things to recall For a discrete distribution with failure times t 1,... S(t) = [1 h(t j )], j:t j <t where h(t j ) = P(T = t j T t)

6 Family feud! I surveyed statisticians in SAS hall for the five most important steps in an applied statistical analysis. What are they?

7 Complications due to censoring Consider making a simple visual display of lifetime data subj to right-censoring Why is this important? Consider making a histogram, what goes wrong? What about plotting the empirical CDF? Today we ll see how to make these plots (and more!)

8 Product limit estimator: warm-up Let T 1,..., T n denote an iid sample (no-censoring) Empirical CDF F (t) = 1 n 1 Ti t n Empirical survival function (ESF) i=1 Ŝ(t) = 1 n n i=1 1 Ti t Does F (t) = Ŝ(t) everywhere?

9 Ex. ECDF and ESF F^(x) S^(x) x x How big are the steps above?

10 Ex. ECDF and ESF cont d n = 100; x = rchisq (n, df=4); par(list(mar=c(5,5,4,1) + 0.1, mfrow=c(1,2))); plot (stepfun (sort(x), c(0, (1:n)/n)), xlab="x", ylab=expression(hat(f)(x)), main="", lwd=3); plot (stepfun (sort(x), c(1, 1-(1:n)/n)), xlab="x", ylab=expression(hat(s)(x)), main="", lwd=3);

11 Ex. ECDF and ESF cont d If t 1 < t 2 < < t k are distinct failure times Ŝ(t) = 1 n k d j 1 tj t, j=1 where d j are the number of observations equal to t j Why?

12 ECDF and ESF under censoring When there is censoring Number of points in an interval [a, b] is unknown Cannot compute ESF or ECDF Kaplan-Meier (KM) estimator (aka product limit estimator) is an analog of the ESF for right-censored data The original KM paper is the most highly cited statistics paper to date. What is the second most highly cited?

13 KM estimator Let {(t i, δ i)} n i=1 denote obs. data with distinct failure times t 1 < t 2 < < t k (these DO NOT include censoring times) Define dj n i=1 1 t i =t j,δ i =1 to be the number of failures at t j nj n i=1 1 t i t j to be the number at risk at t j The KM estimator of S(t) is Ŝ(t) = j:t j <t ( ) nj d j Explain Ŝ(t) intuitively to your stat buddy n j

14 Why does KM make sense? Given {(t i, δ i)} n i=1 how can we estimate h(t j)? (Assume discrete for now) h(t j ) = P(T = t j T t j ) #fail at t j #at risk at t j = d j n j apply S(t) = j:t j <t [1 h(t j)] ( ) j:t j <t 1 d j n j = Ŝ(t)

15 Ex. compute the KM estimator t δ t j n j d j (n j d j )/n j Ŝ(t j +)

16 Code break I: Computing KM in R See file firstkm.r

17 Sanity check Claim: The KM estimator reduces to the ESF when there is no censoring. Why? Answer on board.

18 Code break II: Example from Lawless See file ex321.r

19 Variance estimation A consistent estimator of the variance of Ŝ(t) is given by Greenwood s formula: σ 2 S (t) = Ŝ 2 (t) j:t j <t d j n j (n j d j ) When there is no censoring, this reduces to Ŝ(t)(1 Ŝ(t))/n. Why is this the right quantity?

20 KM as nonparametric MLE Recall our counting process notation Y t (t) = 1 Ti t,ith subj not cens at t dn i (t) = Y i (t)1 Ti =t dc i (t) = Y i (t)1 ith subj cens at t, we ll assume a discrete distribution with potential failure times t = 0, 1,... With your stat buddy prove n i=1 dn i(t) = n i=1 Y i(t)dn i (t)

21 KM as nonparametric MLE cont d Recall from our work on non-informative censoring that L n i=1 t=0 h(t) dn i (t) [1 h(t)] Y i (t)dn i (t) Note* We saw this en route to simplifying to an expression involving f (t) and S(t); for our purposes it will be convenient to use the above form.

22 KM as nonparametric MLE cont d The LH simplifies to L h(t) dt [1 h(t)] nt dt, t=0 where d t n i=1 dn i(t), n t n i=1 Y i(t) Why? Interchange products to obtain t=0 i=1 n h(t) dt [1 h(t)] nt dt = t=0 h(t) n i=1 dn i (t) [1 h(t)] n i=1 Y i (t)(1 dn i (t)), and use i=1 Y i(t)dn i (t) = n i=1 dn i(t) = d t

23 KM as nonparametric MLE cont d To obtain nonparametric MLE we view (h(0), h(1),...) as our parameter and maximize L If n t = 0 then there is no information about h(t), let τ denote the largest t s.t. n t > 0 then and the log-lh is l = L τ h(t) dt [1 h(t)] nt dt, t=0 τ {d t log h(t) + (n t d t ) log (1 h(t))} t=0

24 KM as nonparametric MLE cont d Differentiate l wrt to h(t) to obtain h(t) l = d t h(t) (n t d t ) 1 h(t), set this to zero and solve for h(t) to obtain ĥ(t) = d t/n t Then Ŝ(t) = j:t j <t [ ] 1 ĥ(t j) = j:t j <t [ 1 d ] j, n j is the MLE for S(t) by the invariance property of the MLE

25 KM as nonpar MLE, enough already! Some things to note 1. If the last obs time τ is a failure then Ŝ(t) 0 for all t > τ 2. If the last obs time τ is a censoring time then Ŝ(t) is not defined for t > τ 3. MLE formulation is powerful since large sample theory can be used to study efficiency and conduct statistical inference

26 Fact from your past Let g be a smooth function from R into R then so that g( θ n ) g(θ) + g(θ)( θ n θ) Var g( θ n ) g 2 (θ)var θ n, thus we can approximate the variance of θ n via Var θ n Ex. Let g(u) = log u to obtain 1 g 2 (θ) Var g( θ n ) Var Ŝ(t) S 2 (t)var log Ŝ(t)

27 Computing Greenwood s formula If we can approximate the variance of log Ŝ(t) then we can use the preceding expansion to approximate Var Ŝ(t) Recall the score function (derivative of log-lh) is so that u(h(t)) = d t h(t) (n t d t ) 1 h(t), u (h(t)) = d t h 2 (t) (n t d t ) (1 h(t)) 2 [ 1 = n t h(t) h(t) n t = h(t)(1 h(t)) ]

28 Computing Greenwood s formula cont d Observed fisher info is a diagonal matrix with entries I t = n t h(t)(1 h(t)) Thus (ĥ(0), ĥ(1),..., ĥ(τ)) are asymptotically independent s.t. [ ] Var log Ŝ(t) = Var log 1 ĥ(t j) j:t j <t = Var log j:t j <t j:t j <t Var log [ ] 1 ĥ(t j) [ ] 1 ĥ(t j)

29 Computing Greenwood s formula cont d We can estimate Var log Var log { ] 1 ĥ(t j) using our approx { ] 1 ĥ(t j) Var ĥ(t 1 j) (1 It j ĥ(t))2 (1 = n t j ĥ(t j ) ĥ(t))2 1 ĥ(t j) Putting it all together Var(Ŝ(t)) Ŝ 2 (t) j:t j <t ĥ(t j )n tj 1 ĥ(t j) = Ŝ 2 (t) j:t j <t d j (n j d j ) 2 n j, where we have used n tj = n j and ĥ(t j) = d j /n j

30 Computing Greenwood s formula epilogue We glossed over some slippery technical details; for rigorous treatment see advanced survival texts (e.g., Flemming and Harrington, 2005). For a treatment of infinite dimensional parameter spaces see Butches semi-parametrics course.

31 Nelson-Aalen estimator One could obtain an estimator of the cumulative hazard via log Ŝ(t) (why?) but the following estimator is typically preferred Ĥ(t) d j, n j j:t j t this is called the Nelson-Aalen (pronounced OH-len) estimator

32 Ex. compute the NA estimator t δ t j n j d j d j )/n j Ĥ(t j )

33 Code break III: Computing NA in R See file firstna.r

34 Plotting the NA estimator Plot of Ĥ(t) informative for the shape of the hazard fn H(t) linear implies constant hazard H(t) convex implies monotone hazard Slope of H(t) approximates h(t)

35 Match the NA estimator with the true hazard H^(t) H^(t) H^(t) time time time h(t) h(t) h(t) time time time

36 Variance estimation NA estimator is an MLE just like KM Variance estimator for Ĥ(t) is σ 2 H (t) = j:t j t d j (n j d j ) nj 2, which can be derived using large-sample approximations

37 Codebreak IV: NA on Example from Lawless See file ex321na.r

38 Confidence interval for S(t) Fact: For any fixed t > 0 Ŝ(t) S(t) σ S (t) N(0, 1) Stronger convergence results (simultaneous over all t) exist (1 α) 100% CI based on Greenwood s formula Ŝ(t) ± z 1 α/2 σ S (t)

39 Alternative confidence intervals Greenwood s formula is intuitive but has drawbacks CI generally does not perform well in small samples Can generate a CI with endpoints outside of (0, 1) Recall our general strategy for modeling probabilities 1. Transform to take values in R 2. Conduct estimation/inference on transformed scale 3. Transform back to (0, 1)

40 Transformed confidence interval Let g(s) be a decreasing cts function from (0, 1) onto R, construct a CI for g(s(t)) then transform back via Taylor approx Define ψ(t) g(ŝ(t)) then σ 2 ψ (t) [ g {Ŝ(t) }] 2 σ 2 S (t) Taylor series arguments show ( P z 1 α/2 ψ(t) ) ψ(t) z σ ψ (t) 1 α/2 1 α

41 Transformed confidence interval cont d Rearrange terms to obtain ( P ψ(t) z1 α/2 σ ψ (t) ψ(t) ψ(t) ) + z 1 α/2 σ ψ (t) 1 α Solve for S(t) using ψ(t) = g(s(t)) ( { } P g 1 ψ(t) + z1 α/2 σ ψ (t) S(t) g 1 { ψ(t) z1 α/2 σ ψ (t)} ) 1 α Note the arguments within g 1 have flipped Question: How do we know g 1 exists and is decreasing?

42 Transformed confidence interval cont d If g(s) = log ( log(s)) CI is [ e { exp( ψ(t)+z 1 α/2 σ ψ)}, e { exp( ψ(t) z 1 α/2 σ ψ)} ] σ 2 σ S 2 ψ(t) = 2 [Ŝ(t) log Ŝ(t)] Variance is Another common choice is g(s) = log(s)

43 Bootstrap: AKA the boostarp Eric Draws a brilliant depiction of the bootstrap on the board Applaud subsides A quiet moment of reflection reveals a new appreciation for the beauty of statistics in each of us

44 The boostarp cont d Let D = {(T i, δ i )} n i=1 denote the observed data and P n the empirical distribution A (nonparametric) bootstrap sample is a sample of size n, say D (b), drawn uniformly (with replacement) from D D (b) is an i.i.d. draw of size n from P n Other resample sizes are possible Standard percentile bootstrap CI for S(t) 1. Draw B nonparametric samples, D (1),..., D (B) 2. Compute Ŝ (b) (t), KM on D (b), b = 1,..., B 3. Let l α/2, and û 1 α/2 be the (α/2) 100 and (1 α/2) 100 percentiles of Ŝ (1) (t),..., Ŝ (B) ] 4. Final (1 α) 100% CI is [ lα/2, û 1 α/2

45 Simulated experiment: coverage probabilities T log-normal( 1, 2), C exp(1.75) Sample size of n = 200 and 10K MC replications Compare coverage of Greenwood s formula with log log transform Coverage Greenwood Log log t See coverageexample.r

46 Confidence intervals for quantiles In some settings a quantile is of interest E.g., the median Quantiles are often easier to estimate than moments Recall t p is the pth quantile of T t p = inf {t : 1 S(t) p} Give an estimator Ŝ(t) of S(t) we obtain } t p = inf {t : 1 Ŝ(t) p

47 Confidence intervals for quantiles cont d For continuous T, S(t p ) = 1 p Suppose t L = t L (Data) satisfies P (S(t L ) 1 p) 1 α, then t L is a lower confidence bound for t p (Why?) For any fixed t ( ) P S(t) Ŝ(t) z 1 α/2 σ S (t) 1 α, solve Ŝ(t L) z 1 α/2 σ S (t) = 1 p for t L

ST495: Survival Analysis: Maximum likelihood

ST495: Survival Analysis: Maximum likelihood Eric B. Laber Department of Statistics, North Carolina State University February 11, 2014 Everything is deception: seeking the minimum of illusion, keeping