Lecture 3. Truncation, length-bias and prevalence sampling

Lecture 3. Truncation, length-bias and prevalence sampling 3.1 Prevalent sampling Statistical techniques for truncated data have been integrated into survival analysis in last two decades. Truncation in survival analysis refers to an incomplete data mechanism. It describes a sampling constraint that a failure time variable is observable only if it falls in a certain region. When the value of failure time falls outside the region, the information about the variable is completely lost and therefore excluded from the data set. Truncated survival data typically arise when prevalent sampling is used for recruiting cohort subjects.

When conducting a natural history study, the incident and prevalent samplings are commonly used for recruiting cohort subjects. Incident cohort. An incident cohort is identified by randomly sampling subjects with initial events (time origin 0) occurring within a calendar time interval. The subjects are then followed until occurrence of the failure event, loss to follow-up or end-of-study. The data collected from an incident cohort are the typical survival data subject to right censoring.

Prevalent cohort. When failure times are long, the incident cohort design is inefficient for natural history studies because it usually takes a long follow-up time to observe enough failure events. In contrast, a prevalent sampling design which draws samples from a disease prevalent population is more focused and thus more practical in real studies. The prevalent sample is formed by subjects whose initial events had occurred but have not experienced the failure event at the time of recruitment, τ.

2011-01-11 14:27:01 1/1 Unnamed Page (#27)

The prevalent sampling can be described using a distributional truncation model I, or a non-stationary Poisson process model II: I. Define T 0 as the time from the disease incidence to the failure event for subjects who became diseased in a calendar time interval [0, τ], τ > 0. The variable W 0 is the time from disease incidence to the (potential) recruitment time, termed as the left truncation time. Under the prevalent sampling, the probability density of the observed (W, T ) is the population probability density of (W 0, T 0 ) given T 0 W 0 : p W,T (w, t) = p W 0,T 0 T 0 W 0 (w, t T 0 W 0 ).

II. Let T 0 be the time from the disease incidence to the failure event. Assume the initial events occur over the calendar time as a non-stationary Poisson process with intensity/rate λ(u 0 ), u 0 [0, τ], τ > 0. The prevalent sampling population includes those {(u 0, t 0 )} where the disease incidence occurs within calendar time interval [0, τ] and the failure event occurs after calendar time τ, i.e., u 0 + t 0 τ.

Remark: In this lecture (W 0, T 0 ) represent population variables and (W, T ) represent observed variables subject to prevalent sampling.

Models I and II are connected through the distributions of disease incidence and failure times. In Model I, - g is the marginal density function of W 0 - S u (t) = Pr(T 0 t W 0 = u) is the survival function of T 0 given W 0 = u - when T 0 is independent of W 0, S u (t) = S(t) = Pr(T 0 t) In Model II, define the pdf g 0 (u) = λ(u)/ τ λ(v)dv as the normalized 0 rate for u [0, τ], the truncation time as W 0 = τ U 0, and g(w 0 ) = g 0 (τ w 0 ).

Summary: - Under Model II, conditioning on the number of initial events occurring in [0, τ], the unordered truncation times {W 0 } can be considered as a set of iid random variables with pdf g, and the corresponding set of bivariate vectors {(W 0, T 0 )} form a set of iid bivariate random vectors. The prevalent population includes individuals with {(w 0, t 0 )} where the failure event occurs after calendar time τ, i.e., t 0 w 0. Use the prevalent sampling to collect untruncated data until n pairs of untruncated (W 1, T 1 ),..., (W n, T n ) are observed. - Under either Model I or Model II, the prevalent sample is then formed by n pairs of iid untruncated (W 1, T 1 ),..., (W n, T n ).

The independent truncation assumption refers to the assumption that T 0 and W 0 are independent of each other. In application this independent truncation condition is satisfied when the failure time is independent of the calendar time of incidence of the initiating event. The joint density of the observed (W, T ) can then be expressed as p W,T (w, t) = g(w)f(t)i(t w) P(T 0 W 0 ) = g(w)f(t)i(t w) S(u)g(u)du (1)

Example. Suppose a random sample of women with breast cancer (b.c.) are recruited for observation of survival. The failure time T 0 is defined as the time from onset of b.c. to death. Suppose the time of recruitment, τ, is a fixed calendar time. Then, g(w) can be interpreted as the occurrence density of b.c. at calendar time τ w. The independent truncation condition is satisfied when the failure time T 0 (residual lifetime for b.c.) is independent of the calendar time of incidence of b.c.

3.2 Length-biased sampling Length-biased sampling could arise in many epidemiological studies when survival data are collected from a disease population. Assume (i) The rate of occurrence of incidence of disease, λ(u), remains constant over time. (ii) The probability distribution of the failure time T 0 is independent of when the incidence of disease occurred. Conditions in (i) and (ii) together refer to as the equilibrium state. The two conditions define the so-called stable diseases.

The breast cancer example. In the breast cancer (b.c.) example, the prevalent sampling is length-biased if (i) the rate of occurrence of b.c., λ( ), remains constant over time, and (ii) the density function of the time from b.c. to death, f, is independent of when b.c. occurred.

For simplicity of discussion, assume the failure time T 0 has finite support [0, τ]. When conditions (i) and (ii) are satisfied, the truncation time T 0 follows uniform distribution over [0, τ] - note that this condition would be satisfied if the disease incidence rate has been constant for a long time ( τ units) before the prevalent sample is drawn. The joint density of observed (W, T ) in (1) becomes p W,T (w, t) = f(t)i(t w) 0 S(u)du = f(t)i(t w) µ, where and µ = E[T 0 ].

Define the forward failure time as R = T W. The joint density function of (W, V ) can be derived as p W,R (w, r) = f(w + r)i(w 0, r 0) µ, (2) which is known in renewal theory the joint density function of backward and forward recurrence times (Feller, 1971). In general, treating length-biased data as the usual data would lead to biased analytical results. The amount of bias could be large or small, depending on models and problems.

Length bias: a few marginal distributions The sampling density functions of T, W and R can be derived, based on (2), as p T (t) = tf(t)i(t 0)/µ, p W (w) = S(w)I(w 0)/µ, p R (r) = S(r)I(r 0)/µ. The distribution of (3) is generally referred to as the length-biased distribution. Exercise: Drive the marginal pdf s of T, W and R from the joint pdf p W,R (w, r).

The survivorship function S can be estimated by a weighted empirical distribution function, with weight inversely proportional to t i : Ŝ n (t) = n 1 ˆµ n {t 1 i I(t i t)}, i=1 where ˆµ = {n 1 j t 1 j } 1 serves as an appropriate estimate of µ, since n 1 j t 1 j estimates µ 1. The estimator Ŝn was first proposed by Cox (1969, Book) and can be proven to be the nonparametric maximum likelihood estimator of S, a special case under Vardi s selection bias models (1982, 1985, AS).

Exercise: Identify assumptions under which ˆµ converges to µ in probability. Also prove the convergence result.

Density estimation for length-biased data. Following the same weighting procedure, a kernel estimator of the density function f can be derived as ˆf n (t) = n 1 ˆµ n {t 1 i K h (t t i )}, i=1 where K h (x) = h 1 K(h 1 x), h > 0, with K a kernel function. Alternatively, one could first estimate the length-biased density, (2), by an ordinary kernel estimator and then use the relationship of (2) and f to obtain an estimator of f.

PHM for length-biased data. Under the proportional hazards model, for uncensored length-biased data, a risk set sampling technique can be adopted for estimating regression parameters. For t j t i, let j (t i ) be a binary variable which equals 1 with probability t i /t j, and 0 with probability 1 (t i /t j ). The indicators j (t i ) are used to identify bias-adjusted risk sets and to construct pseudo-likelihood equations. Regression parameter estimates are then derived by solving the score equations. For censored length-biased data, statistical methods and theory have been further developed in last few years for non- and semi-parametric models.

Prevalence = Incidence Duration Use Model II. Recall that λ(u) is the disease incidence rate at the calendar time u, and S u is the survival function of T 0 for those patients whose disease was initiated at u. The disease prevalence rate at calendar time τ can be obtained as P (τ) = τ λ(u)s u(τ u)du. When the equilibrium condition is satisfied, the incidence rate is a constant (λ(u) = λ 0 ) and the survival function is independent of u (S u = S), and P (τ) = λ 0 τ S(τ u)du = λ 0 S(u)du = λ 0 µ is functionally independent of τ. Thus, let P (τ) = p 0 and we derive p 0 = λ 0 µ (Prevalence = Incidence duration). 0

Length bias: full and conditional likelihoods Data: iid (w 1, t 1 ),..., (w n, t n ) L m : density function of {t i } L c : density function of {w i } conditioning on {t i } The full likelihood is L = L m L c = { n i=1 } { t i f(t i ) n µ i=1 } 1 By the factorization theorem, the observed failure times {t i } serve as sufficient statistics for f in cases that f is either a parametric or nonparametric function. Thus, conditioning on {t i }, the truncation times {w i } do not contain additional information for f. t i

3.3 Left truncation Length-biased data can be viewed as a special case of left truncated data, since the conditional density of the observed t i given w i is f(t i )I(t i w i ) S(w i ) (3) which corresponds to the density function of left truncated failure time. By viewing length-biased data as left truncated data, we next consider how to analyze left truncated data in a general setting. It is important to indicate that the validity of the truncated density in (3) depends only on the independent truncation assumption, (ii), and not on the constant incidence rate assumption, (i).

Notation i = 1, 2,..., n: index for subjects (W 1, T 1 ),..., (W n, T n ): iid observations R(t) = {(W j, T j ) : W j t T j, j = 1, 2,..., n}: risk set at t Y i (t) = I( W j t T j ): at-risk indicator Y (t) = n i=1 Y i(t): total no. of subjects at risk at t N i (t) = I(T i t): count subject i s failure event prior to time t N(t) = n i=1 N i(t): total number of observed events prior to time t

The conditional likelihood based on {(w i, t i )} given {w i } can be written as L c = n i=1 {f(t i)/s(w i )}. Let t (1) <... < t (J) be the distinct and ordered values of t 1,..., t n and let w (1),..., w (J) be the corresponding truncation times. By assigning probability mass to and only to {t (j) }, we can write f(t (j) ) S(w (j) ) = S(t (k j+1)) S(t (kj)) S(t (k j+2)) S(t (kj+1))... f(t (j)) S(t (j) ) = (1 λ(t (kj))) (1 λ(t (kj+1)))... λ(t (j) ) (4) where t (kj) satisfies t (kj 1) < w (j) t (kj). Note that (w (j), t (j) ) falls in the risk sets at t (kj), t (kj+1),... and t (j), and that the factorization in (4) consists of both the factor λ(t (j) ) at t (j), and the factor 1 λ(t (k) ) if (w (j), t (j) ) falls in the risk set at t (k).

The risk set at t is R(t) = {(w i, t i ) : w i t t i }, and d (j) the number of failures at t (j), Y (t (j) ) as the number of subjects in R(t (j) ). The conditional likelihood L c can be reparameterized as L c = = J [ ] dn(t(j) ) [ f(t(j) ) S(t(j+1) ) S(t (j) ) S(t (j) ) J [ ] dn(t(j) ) [ ] Y (t(j) ) dn(t λ(j) 1 (j) ) λ(j) j=1 j=1 ] Y (t(j) ) dn(t (j) )

Thus, following an argument similar to the one in Kaplan and Meier (1958), the unique mle of λ (j) can be obtained as dn(t (j) )/Y (t (j) ) and the product-limit estimator Ŝ(t) = t (j) < t ( 1 dn(t ) (j)) Y (t (j) ) (5) is the unique nonparametric MLE of L c. This is also called the truncation product-limit estimator. Exercise: Does the existence of NPMLE require conditions?

Example Data: (4, 5), (0, 4), (5, 7), (1, 2), (2, 8), (1, 5) t (i) 2 4 5 7 8 dn(t (i) ) 1 1 2 1 1 Y (t (i) ) 4 4 4 2 1 R(t (1) ) = {(0, 4), (1, 2), (2, 8), (1, 5)} R(t (2) ) = {(4, 5), (0, 4), (2, 8), (1, 5)}......... The truncation product-limit estimate is thus Ŝ(2) = ( 1 Ŝ(4) = 1 1 ) = 3 4 4

( Ŝ(5) = 1 1 ) ( 1 1 ) = 3 4 4 4 3 ( 4 Ŝ(7) = 1 1 ) ( 1 1 ) ( 1 2 ) = 3 4 4 4 4 3 4 2 ( 4 Ŝ(8) = 1 1 ) ( 1 1 ) ( 1 2 ) ( 1 1 ) = 3 4 4 4 2 4 3 4 2 4 1 2 Ŝ(8 + ) = 0 Note: Unlike right censored data, risk sets for truncated data usually are not nested.

3.4 Left Truncated and Right Censored Data For i = 1,..., n, assume the following independent censoring condition: Conditional on the observed W i = w i, the time from τ i to the failure event, R i, is independent of the time from τ i to censoring, D i. This independent censoring condition does not, however, imply independence between the biased failure time T i (= W i + R i ) and the censoring time C i (= W i + D i ). Let x i = min{t i, c i } be the time from the initiating event to censoring, and δ i = I(x i = t i ) the censoring indicator. Conditional on w i, under the independent censoring condition, the density of (w i, x i, δ i ) is proportional to f(x i ) δi S(x i ) 1 δi /S(w i ).

2011-01-10 06:33:37 1/1 Unnamed Page (#25)

Notation i = 1, 2,..., n: index for subjects (W 1, X 1, 1 ),..., (W n, X n, n ): iid observations R(t) = {(W j, X j, j ) : W j t X j, j = 1, 2,..., n}: risk set at t Y i (t) = I( W j t X j ): at-risk indicator Y (t) = n i=1 Y i(t): total no. of subjects at risk at t N i (t) = I(X i t, j = 1): count subject i s failure event prior to time t N(t) = n i=1 N i(t): total number of observed events prior to time t

Given a sample of independent and identically distributed observations (w 1, x 1, δ 1 ),..., (w n, x n, δ n ), the full likelihood is L = L c L m, where L c is the conditional likelihood corresponding to the density of {(w, x, δ)} given {w}, and L m is the marginal likelihood of {w}. The full likelihood function can be expressed as L = L{ c (S, S D ) L m (S, G) n } f(x i ) δi S(x i ) 1 δi f D W (x i w i w i ) 1 δi S D W (x i w i w i ) δi = S(w i=1 i ) { n } S(w i )g(w i ) i=1 S(w)g(w)dw { n } { n } f(x i ) δi S(x i ) 1 δi S(w i )g(w i ). S(w i ) S(w)g(w)dw i=1 i=1

Thus L c n { } f(xi ) δi S(x i ) 1 δi i=1 S(w i ) L m = n i=1 { } S(wi )g(w i ) S(w)g(w)dw Statistical approaches based on the conditional likelihood, L c, are considered methods for left truncated and right censored data.

Estimation of S: Using techniques similar to the previous one, the product-limit estimator Ŝ(t) = x (j) < t ( 1 dn(x ) (j)) Y (x (j) ) (6) can be derived as the unique nonparametric MLE from the conditional likelihood L c. This is also called the truncation product-limit estimator.

Estimation of G: By assigning probability to, and only to {w i }, the maximum value of the marginal likelihood L m (S, G) is the same as the maximum value of a multinomial likelihood, where the unique maximum likelihood estimator is the empirical distribution. With S fixed as Ŝ (the conditional MLE), the maximum value of L m (Ŝ, G) is achieved by the inverse weighting estimator: Ĝ(w; Ŝ) = n i=1 Ŝ 1 (w i )I(w i w) n i=1 Ŝ 1 (w i ). The NPMLEs from the full likelihood L can then be derived as (Ŝ, Ĝ( ; Ŝ)).

Risk-set-based methods. For left truncated and right censored data, we can used the revised risk sets to derive modified Greenwoods Formula - it still holds for the estimation of the asymptotic variance of the truncation product-limit estimator. modified partial likelihood method - it still holds for the estimation of β in the proportional hazards model. modified log-rank tests - it still holds for testing the difference between two groups.

Remarks: 1. Essentially, censoring and truncation share some significant similarities in statistical analysis - especially, the similarities in risk-set-based methods. 2. Nevertheless, there exists significant differences that are not all emphasized in this course. In general, censoring is a nuisance distribution but truncation distribution could be important itself (e.g. HIV infection distribution over calendar time) or consist of information about the failure time distribution.

Example. For length-biased failure time data (without censoring), the weighted estimator of S, Ŝn(t) = n 1 ˆµ n i=1 {t 1 i I(t i t)}, is much more efficient that the product-limit estimator. Example. Knowing truncation distribution completely or partially can substantially improve estimation for distribution of T 0. Note that for censored survival data the knowledge of censoring distribution does not improve estimation of the survival distribution. Some on-going research...

Exercise. Suppose a bivariate random vector (T, Y ) has the joint pdf p T,Y (t, y) = f(y)i(y t 0)/µ, where f is a continuous pdf for a positive-valued random variable and µ = yf(y)dy (mean based on f). Assume the moments based on f are finite up to the 4th order. Suppose (T 1, Y 1 ),..., (T n, Y n ) are iid random vectors with the same pdf of (T, Y ). Part I. (a) Show that the marginal pdf of Y is yf(y)/µ (length-bias distribution) and show that the pdf of Y conditioning on T = t is f(y)i(y t)/(1 F (t)). (b) Find the expected value of 1/Y. Is E[Y ] less than, equal to, or greater than µ? (c) Does n 1 n i=1 {1/Y i} converge to µ 1 in probability? Explain. (d) Derive an asymptotically normally distributed estimator of µ and describe the estimator s asymptotic distribution.

Part II. Now assume f is the Gamma(γ, θ) density function with known γ > 0 and unknown θ > 0, where f(y) = θ γ y γ 1 e θy /Γ(γ). (e) Show that (Y 1,..., Y n ) is a sufficient statistic for this family of distribution {p T,Y }. Does your conclusion depend on the choice of Gamma distribution? (That is, does your conclusion hold if the Gamma distribution model is replaced by a different distribution model?) (f) Based on (t 1, y 1 ),..., (t n, y n ), find the mle of θ. Find the asymptotic distribution for the mle of θ. (g) What is the mle of µ?

Part III. In this part we assume f is the pdf of an exponential distribution: f(y; θ) = θe θy I(y > 0), θ > 0 (h) Consider the full likelihood L, based on the pdf of observed {(t 1, y 1 ),..., (t n, y n )}. Calculate the Fisher information for θ based on L. (i) Consider the conditional likelihood L c, based on the pdf of observed {(t 1, y 1 ),..., (t n, y n )} given observed {t 1,,..., t n }. Derive the MLE of θ from L c and calculate the Fisher information for θ based on L c. (j) Compare the Fisher informations for θ derived from L and L c. Please provide a discussion of your result.

Remarks: 1. For Gamma(γ, θ) distribution, the mean is µ = γ/θ and the variance is γ/θ 2. 2. If n is a positive integer, then the Gamma function has the property that Γ(n) = (n 1)!.