Lecture 2: Martingale theory for univariate survival analysis

Lecture 2: Martingale theory for univariate survival analysis In this lecture T is assumed to be a continuous failure time. A core question in this lecture is how to develop asymptotic properties when studying statistical methods for univariate survival data? Empirical process approach - a general tool for asymptotic theory Martingale theory - enjoy some advantages in variance simplifications for right-censored data; widely used!

2.1 Notation. f(t): density function of T F (t) = P (T t) = t f(u)du: cumulative distribution function S(t) = 1 F (t) = f(u)du: survival function of T t S(t) = exp{ Λ(t)}, Λ(t): cumulative hazard function Λ(t) = logs(t) λ(t) = Λ (t): hazard function C: censoring time

more notation: i = 1, 2,..., n: index for subjects X i = min(t i, C i ): observed survival time (possibly censored) i = I(T i < C i ): censoring indicator X (1), X (2),..., X (k) : ordered uncensored times R(t) = {(X j, j ) : X j t, j = 1, 2,..., n}: risk set at t Y i (t) = I(X i t): at-risk indicator Y (t) = Y i(t): total no. of subjects at risk at t Z i or Z i (t): (time-varying) covariates N i (t) = I(X i t, i = 1): count subject i s failure event prior to time t N(t) = N i(t): total number of observed events prior to time t {(X i, i, Z i ) : i = 1, 2,..., n}: collected data

2.2 Martingale theory: Initial setting Probability space (Ω, F, P ) Ω: sample space F : a class of subsets of Ω; the class is a σ-algebra P : a probability measure on (Ω, F ) Conditional expectation E[Y X] 1. E[Y X] is σ(x)-measurable 2. E[Y X] = E[Y σ(x)] - average prediction of Y given all the information on X 3. E[Y ] = E[E[Y X]]

Stochastic process {X(t); t τ} on (Ω, F, P ) 1. a collection of random variables indexed by t τ 2. for a fixed sample ω Ω, X(t; ω) is a sample path as a function of t; for convenience, we usually write X(t) instead of X(t; ω)

- Define F t as the σ-field generated by {X(u) : u t}: information about X( ) from to t - {W (t); t } is adapted to F t if W (t) is F t -measureable - Define F as the collection of F t, t τ. It is called the filtration of the underlying probability space A stochastic process M(t) is a martingale with respect to the stochastic process X(t), t τ, if - M(t) is adapted to filtration F : for each t, the random variable M(t) is a F t -measurable function - E[M(t)] < - for s, E[M(t + s) F t ] = M(t) (fair game)

Under the same setting, for s - M(t) is submartingale if E[M(t + s) F t ] > M(t) (winning) - M(t) is supermartingale if E[M(t + s) F t ] < M(t) (losing) If M(t) is a martingale, then - E[dM(t)] = - if E[M()] =, then E[M(s)] =, s Predictable process W (t) - W (t) is F t -measurable - E[W (t) F t ] = W (t) - left-continuous process adapted to F t is predictable - example: at-risk process E[Y (t) F t ] = Y (t)

2.3 Doob-Meyer Decomposition Theorem Doob-Meyer Decomposition Theorem If X(t) is an adapted, right-continuous, non-negative submartingale, then there exists a unique left-continuous, increasing, predictable process A(t) such that A() =, E[A(t)] < and Q(t) = X(t) A(t) is a martingale. A(t) is called a compensator. da(t) = E[dX(t) Ft ]

Example. - X(t) = N i (t) is a sub-martingale. - Y i (t) = I(X i t) is left-continuous with left-hand limit and is a predictable process. - Define a predictable process A(t) = t Y i(u)λ(u)du. - Then, da(t) = Y i (t)λ(t)dt = E[dN i (t) F t ] - Q(t) = M i (t) = N i (t) t Y i(u)λ(u)du is a martingale. - We can show by a more direct proof that M i (t) is a martingale in next few slides.

Consider a Hilbert space H which is a complete metric space with respect to the distance function induced by the inner product < X, Y >= E[XY ]. The norm of the inner product for X H based on the inner product <, > is defined as X = < X, X > = (E[X 2 ]) 1/2. martingale: M(t) = N(t) t Y (u)λ(u)du variance: < M, M > (t) = var[m(t)] = E[M(t) 2 ] Define: d c < M, M > (t) = E[d(M(t)) 2 F t ]

2.4 variance and covariance of martingales Calculation of E[dN i (t) F t ], E[dM i (t) F t ] dn i (t) = or 1 E[dN i (t) F t ] = P r(dn i (t) = 1 F t ) - given F t, Y i (t) is known (predictable) - if Y i (t) = (i.e., subject i has failed or been censored before t), then P r(dn i (t) = 1 F t ) = - if Y i (t) = 1 (i.e., subject i at risk at t), then P r(dn i (t) = 1 F t ) = λ(t)dt under independent censoring. Exercise. Prove this result. E[dN i (t) F t ] = Y i (t)λ(t)dt E[{dN i (t) Y i (t)λ(t)} F t ] = E[dM i (t) F t ] =

M i (t) = N i (t) t Y i(u)λ(u)du is a martingale. proof. - M i (t) is adapted to F - E[M i (t)] = < - Note that for s, [{ t+s E[M i (t + s) F t ] = E dn i (u) t t+s = = = t+s t } ] Y i (u)λ(u)du F t + M i (t) E [dn i (u) Y i (u)λ(u)du F t ] + M i (t) t t+s [ ] E[E dn i (u) Y i (u)λ(u)du F u F t ] + M i (t) t t+s E[ F t ] + M i (t) = M i (t) t M(t) = N(t) t Y (u)λ(u)du is a martingale.

From now on, use martingales: M i (t) = N i (t) t Y i(u)λ(u)du M(t) = N(t) t Y (u)λ(u)du Note that - E[M i (t)] =, t τ; E[dM i (t) F t ] =, < t τ - E[M(t)] =, t τ; E[dM(t) F t ] =, < t τ

Calculation of variance and covariance: d c < M i, M j > (t) d c < M i, M j > (t) = E[d(M i (t)m j (t)) F t ] d[m i (t)m j (t)] = M i (t)m j (t) M i (t )M j (t ) Thus, = [M i (t ) + dm i (t)][m j (t ) + dm j (t)] M i (t )M j (t ) = M j (t )dm i (t) + M i (t )dm j (t) + dm i (t)dm j (t) E[d(M i (t)m j (t)) F t ] = E[dM i (t) dm j (t) F t ] = cov[dm i (t), dm j (t) F t ] Note: M i (t) = M j (t) d c < M i, M i > (t) = E[d(M i (t)) 2 F t ] = var[dm i (t) F t ]

d c < M i, M i > (t) = Y i (t)λ(t)dt + o p (dt) - dm i (t) = dn i (t) Y i (t)λ(t)dt - Given F t, Y i (t)λ(t)dt is a constant term - var[dm i (t) F t ] = var[dn i (t) F t ] = E[(dN i (t)) 2 F t ] E[dN i (t) F t ] 2 = E[dN i (t) F t ] E[dN i (t) F t ] 2 = Y i (t)λ(t)dt Y i (t)[λ(t)dt] 2 = Y i (t)λ(t)dt + o p (dt) Thus, d c < M i, M i > (t) = var[dm i (t) F t ] = Y i (t)λ(t)dt + o p (dt), this is a predictable variation. (Remark: o p (dt)/dt as dt o p (dt) )

if j k, d c < M j, M k > (t) = o p (dt) M j (t) = N j (t) t Y j(u)λ(u)du M k (t) = N k (t) t Y k(u)λ(u)du d c < M j, M k > (t) = cov[dm j (t), dm k (t) F t ] = E[(dN j (t) Y j (t)λ(t)dt)(dn k (t) Y k (t)λ(t)dt) F t ] = E[dN j (t)dn k (t) F t ] Y j (t)λ(t)dt E[dN k (t) F t ] E[dN j (t) F t ] Y k (t)λ(t)dt + Y j (t)y k (t)(λ(t)dt) 2 = E[dN j (t)dn k (t) F t ] Y j (t)y k (t)(λ(t)dt) 2 = E[dN j (t)dn k (t) F t ] + o p(dt) Note that E[dN j (t)dn k (t) F t ] = as long as N j (t) and N k (t) do not have jump at the same t (with positive probability). Thus, for continuous failure time models, E[dN j (t)dn k (t) F t ] =.

2.5 Martingale Central Limit Theorem (MCLT) Property. Suppose M(t) is a martingale and H(t) is predictable. Then L(t) = t H(u)dM(u) is martingale subject to F t. Recall ˆΛ(t) = t dn(u) Y (u), U n (t) = n[ˆλ(t) Λ(t)] = t 1 n n [dn i (u) Y i (u)λ(u)du] Y (u) n t n t n = Y (u) dm i(u) = Y (u) dm(u) for t < τ = sup t {t : P r(x t) > }.

What is the limiting distribution of U n (t)? An application of MCLT Assumptions: 1. convergent variance: (t) v(t) 2. U n (t) is smooth (condition skipped) Results: 1. U n (t) converges weakly to U(t) 2. U(t) is a Brownian motion E[U(t)] = var[u(t)] = lim n (t) independent increment: U(s) and U(t) U(s) are independent for s t

Use the results of the variance and covariance calculations, - if u < v, E[ n Y (u) dm i(u) n Y (v) dm j(v) F v ] = n Y (u) dm i(u) E[ n Y (v) dm j(v) F v ] = - E[ n Y (u) dm i(u) n Y (u) dm i(u) F u ] = n Y (u) var[dm 2 i (u) F u ] = n Y (u) Y 2 i (u)λ(u)du [ ] t n Also, (t) = E Y (u) 2 Y i(u)λ(u)du [ t ] n = E Y (u) 2 Y (u)λ(u)du [ t ] λ(u)du t λ(u)du = E Y (u)/n E[Y 1 (u)] = v(t) where v(t) = t λ(u)du E[Y du = t 1(u)] λ(u)du S c(u)s(u)

Thus, n[ˆλ(t) Λ(t)] converges weakly to Brownian motion U(t), where E[U(t)] = and var[u(t)] = t λ(u)du S. c(u)s(u) Note that S(t) = e Λ(t), and the Kaplan-Meier estimator satisfies Ŝ(t) = e t dn(u) Y (u) + o p (n 1/2 ) (reference: Breslow and Crowley, 1974). By the functional Delta method, n[ Ŝ(t) S(t)] = n[e ˆΛ(t) e Λ(t) ] + o p (1) d S(t) U(t) (weak convergence)

Results: 1. n[ŝ(t) S(t)] converges weakly to V (t) = S(t) U(t) 2. V (t) is a mean zero Gaussian process E[V (t)] = var[v (t)] = [S(t)] 2 t λ(u)du S c(u)s(u)

Remark: 1. Estimation of var[v (t)]: df t U (v) t var[v (t)] = [S(t)] 2 S X (v) S c (v)s(v) = df [S(t)]2 U (v) S 2 (v) X Thus, by plugging in the KM and empirical distribution estimators, var[v (t)] can be estimated by t d var[v (t)] = [Ŝ(t)]2 ˆF U (v) Ŝ 2 (v), X which is approximately the Greenwood s formula: ṽar(v (t)) = [Ŝ(t)]2 t dn(u) t n 1 Y (u) (Y (u) dn(u)) = [Ŝ(t)]2 d ˆF U (u) Ŝ (u)ŝx (u+ X )

2. Both survival and censoring distributions plays roles in var[v (t)]. Note: The censoring distribution does not play a role in the likelihood function, but does play a role implicitly in estimation inference.

2.6 Proportional hazards model Now consider the proportional hazards model, λ(t Z) = λ (t)exp(βz), where Z is a 1-dimensional covariate. For simplicity of notation, we consider time-independent covariates only. The results can also be extended to time-dependent-covariates cases. L = (i) p(z (i) H(x (i) ), x (i)) {a residual likelihood}, where H(t ) = data history prior to t. Partial likelihood L p = (i) p(z (i) H(x (i) ), x (i)) = n ( e βz i j=1 yj(ti)eβz j Partial score equation: S n (β) = {Z i Z(u; β)}dn i (u) where ZiYi(u) exp(βzi) Z(u; β) = Yi(u) exp(βzi) (expected value of Z i in the risk set at u). The partial likelihood estimator ˆβ is derived by solving S n (β) =. ) δi

A representation of S n (β) S n(t; β) = = + = + = n t {Z i Z(u; β)}dn i (u) n t {Z i Z(u; β)}[dn i (u) Y i (u) exp(βz i )λ (u)du] n t {Z i Z(u; β)}y i (u) exp(βz i )λ (u)du n t {Z i Z(u; β)}dm i (u; β) { t n } n Z i Y i (u) exp(βz i ) Y i (u) exp(βz i ) Z(u; β) λ (u)du n t {Z i Z(u; β)}dm i (u; β) ( )

Let U n (t) = n 1/2 S n (t; β). Then under proper regularity conditions (t) = 1 n n t t {Z i Z(u)} 2 Y i (u) exp(βz i )λ (u)du E[{Z µ Z (u)} 2 Y i (u) exp(βz i )]λ (u)du Properties of S n (t; β) 1. E[S n (t; β)] = 2. using representation (*) to derive independent increment property: for s t; cov[s n (s; β){s n (t; β) S n (s; β)}] = 3. var[n 1/2 S n (t; β)] = (t)

The partial likelihood estimator ˆβ n is consistent (proof skipped) Asymptotic normality partial score equation S n (t; β) = t {Z i Z(u; β)}dn i (u) Taylor expansion (mean value expansion): for β n lying between ˆβ n and β, Thus, = S n (t; ˆβ n ) = S n (t; β) + S n(β) β=β β n ( ˆβ n β) { n 1/2 ( ˆβ n β) = n 1 S } 1 n(β) β=β β n n 1/2 S n (β) n 1/2 S n (β) d N(, σ 2 )

slope: S n (β) β = since β n β, n j=1 {Z j Z(u; β)}y j (u) exp(βz j ) j=1 Y dn i (u) j(u) exp(βz j ) n 1 S n(β) β=β β n E[{Z j Z(u; β)}y j (u) exp(βz j )] dp (X j u, j = 1) = σ 2 E[Y j (u) exp(βz j )] thus, by likelihood theory, n( ˆβ n β) d N(, σ 2 )

therefore, standardized version [ n ˆβ n β ] 1/2 D N (, 1) j=1 {Zj Z(u)} 2 exp( ˆβ nz j)y j(u) j=1 exp( ˆβ dn i (u) nz j)y j(u) estimation of σ 2 : n 1 n i j=1 {Z j Z(X i ; ˆβ n )}Y j (X i ) exp( ˆβ n Z j ) j=1 Y j(x i ) exp( ˆβ n Z j )

Denote S Z (u; β) = 95% confidence interval for β is ˆβ n ± 1.96 j=1 {Zj Z(u)} 2 exp(βz j)y j(u) j=1 exp(βzj)yj(u) [ n S Z (u; ˆβ n )dn i (u) ] 1/2

Hypothesis testing on H : β = β : rejects H at 5% type-i error Wald s test: ˆβ n β [ n S Z (u; ˆβ n )dn i (u) The partial likelihood score test: ] 1/2 n 1 2 S n (β ) var[s n (β )] 1.96 1.96 i.e., {Z i Z(u; β )}dn i (u) [ S Z (u; β )dn i (u) ] 1/2 1.96

what if β = and Z = /1 in the partial likelihood score test: {Z i Z(u; β )}dn i (u) [ S Z (u; β )dn i (u) ] 1/2 where S Z (u; β ) = j=1 {Z j Z(u)} 2 exp(β Z)Y j (u) j=1 exp(β Z)Y j (u) = Z(u){1 Z(u)} and Z(u) is the fraction of treated subjects in the risk set at u. Note: The partial likelihood score test is the Log-rank test with a more general variance estimator. This variance estimator estimator coincides with that of the Log-rank test when failure time T is continuous.

Proportional hazards model for multivariate covariates Z R p λ(t Z) = λ (t)exp(β T Z) β = (β 1,..., β p ) T is p-vector partial score function where Z p 1 i = S n (β) = Z i1 Z i2... Z ip n p 1 {Z p 1 i Z p 1 (u, β)}dn i (u), Zp 1 (u, β) = Z 1 (u, β) Z 2 (u, β)... Z p (u, β) p 1

n 1/2 S n (β) D N (, Σ(β)), where Σ(β) is the limit of n 1 = n 1 n j=1 {Z j Z} p 1 {Z j Z} T 1 pexp(βz j )Y i (u) j=1 exp(βz dn i (u) j)y i (u) where v = 1, v 1 = v, v 2 = vv T. partial derivative Sn(β n ) β Σ(β) p p n 1/2 ( ˆβ n β) D N (, Σ 1 (β)) j=1 {Z 2 j Z} p pexp(βz j )Y i (u) j=1 exp(βz dn i (u) j)y i (u)

Estimation of baseline hazard function (Breslow estimator) ˆΛ (t; ˆβ n ) = n t dn i (u) t Y i(u)exp( ˆβ n T Z i ) = dn i(u) Y i(u)exp( ˆβ n T Z i ) consistent asymptotic normality: n 1/2 [ˆΛ (t; ˆβ n ) Λ (t)] equals n 1/2 [ˆΛ (t; β) Λ (t)] + n 1/2 [ˆΛ (t; ˆβ n ) ˆΛ (t; β)] = Term I + Term II

Term I: n 1/2 n t dm i (t; β) exp(βt Z i )Y i (u), which is a sum of martingale residuals with predictable variation: n n t Y i (u)exp(β T Z i )λ (u)du t [ exp(βt Z i )Y i (u)] λ (u)du 2 E[exp(β T Z)Y (u)] = σ2 1(t)

Term II: mean value expansion t = n 1/2 ( ˆβ n β) t n 1/2 [ˆΛ (t; ˆβ n ) ˆΛ (t; β)] exp(β T n Z i )Z i Y i (u)dn(u) [ exp(β T n Z i )Y i (u)] 2 exp(β T n Z i )Z i Y i (u)dn(u) [ µ(t) exp(β T n Z i )Y i (u)] 2

n 1/2 ( ˆβ n β) = n 1/2 {Z i Z}dM i (u; β) S 2 Z (u; β)dn(u) D N (, σ 2 ) covariance between Term I and Term II n t dm i (t; β) exp(βt Z i )Y i (u), which is a predictable covariation n t n {Z i Z(u)}dM i (u; β), {Z i Z(u)}exp(β T Z i )Y i (u)λ (u)du exp(βt Z i )Y i (u) =

Term I and II are uncorrelated Thus, n 1/2 [ˆΛ (t; ˆβ n ) Λ (t)] N (, σ 2 1(t) + µ(t)2 σ 2 )

Appendix Consistency of K-M estimator We consider the general case that the failure time T could be partly continuous and partly discrete. Recall S(t) = P {T t}. Define S(t) = 1 S(t) = P {T < t} and Define the subsurvival functions S (t) = S Y (t) = P {Y t} = [S(t)][S C (t)] Then, SU (t) = P {Y t, δ = 1} = SC(t) = P {Y t, δ = } = t S (t) = S U (t) + S C(t). t [S C (u)]d S(u) [S(u)]d S C (u)

We will show that S(t) can be expressed as a function of SU (t) and S C (t). (i) Suppose SU (t) is continuous at u < t. t ds U (u) S U (u) + S C (u) = = t t = log S(t). S C (u)df (u) S(u)S C (u) ds(u) S(u) = log[s(u)] t

Therefore, [ t dsu S(t) = exp (u) ] SU (u) + S C (u). (ii) Suppose SU has a jump at t, but S C is continuous at t. log S U (t+) + S C (t+) S U (t ) + S C (t ) = log [S(t+)][S C(t+)] [S(t )][S C (t )], = log [S(t+)] [S(t )] (The second equality follows from the fact that S c is continuous at t so S C (t+) = S C (t ).)

Therefore, S(t+) S(t) = S(t+) { S(t ) = exp log [ S U (t+) + SC (t+) ]} SU (t ) + S C (t ). (The first equality is due to the left continuity of survival function.) If the underlying distributions S and S C have no common jumps, then from (i) and (ii) t S(t) = exp{c ds U (u) S U (u) + S C (u) + d u<t log [ S U (u+) + SC (u+) ] SU (u ) + S C (u ) }, where c denotes integration over the continuity intervals of Su and d denotes summation over the discrete jumps of Su. The above expression is called Peterson s representation, showing that S(t) can be represented as a function of SU, S C, and t, that is, S(t) = ψ(s U, S C; t).

Peterson s representation gives us a proof that the KM-estimator Ŝ(t) is consistent. The proof proceeds as follows. Define the empirical sub-survival functions Ŝ U (t) = 1 n Ŝ C(t) = 1 n n I(y i > t, δ i = 1), n I(y i > t, δ i = ). It can be seen that the PL estimator is Ŝ(t) = ψ(ŝ U, Ŝ C; t), provided any ties between uncensored and censored observations are interpreted as uncensored observations preceding censored.

Notice that since Ŝ U is discrete, ψ(ŝ U, Ŝ C ; t) involves only the summation over the discrete jumps of Ŝ U. By the Glivenko-Cantelli theorem, Ŝ U (t) a.s. S U (t), Ŝ C(t) a.s. S C(t), uniformly in t. (The notation a.s. denotes converges almost surely to. ) Also, ψ is a continuous function of SU, S C in the sup norm. That is, if SU S U = sup t SU (t) S U (t) and S C S C, then ψ(su, S C ; t) ψ(s U, S c ; t). Therefore, Ŝ(t) = ψ(ŝ U, Ŝ C; t) a.s. ψ(s U, S C; t) = S(t).

REFERENCE Peterson, JASA (1977).

Functional Delta method Suppose that n 1/2 (ˆΛ( ) Λ( )) d U( ). If φ( ) is compact differentiable, then that is, n 1/2 [φ(ˆλ( )) φ(λ( ))]) d dφ(λ( ); U( )) n 1/2 [φ(λ n ( )) φ(λ( ))]) d dφ(λ( ); n 1/2 (ˆΛ(t) Λ(t))) Note: dφ(f ; G) means differentiability at F in the direction of G