Quantile Regression for Doubly Censored Data with Applications to Cystic Fibrosis Studies. Technical Report July 1, 2010

Size: px

Start display at page:

Download "Quantile Regression for Doubly Censored Data with Applications to Cystic Fibrosis Studies. Technical Report July 1, 2010"

Jean Parrish
5 years ago
Views:

1 Quantile Regression for Doubly Censored Data with Applications to Cystic Fibrosis Studies by Shuang Ji 1, Limin Peng 1, Yu Cheng 2, and HuiChuan Lai 3 Technical Report 1-3 July 1, 21 1 Department of Biostatistics and Bioinformatics Rollins School of Public Health 1518 Clifton Road, N.E. Emory University Atlanta, Georgia 3322, U.S.A. 2 Department of Statistics University of Pittsburgh Pittsburgh, PA 1526, U.S.A. 3 Departments of Nutritional Sciences and Biostatistics and Medical Informatics University of Wisconsin-Madison Madison, WI 5376, U.S.A. Telephone: (44) FAX: (44) lpeng@emory.edu

2 Quantile Regression for Doubly Censored Data with Applications to Cystic Fibrosis Studies Shuang Ji 1, Limin Peng 1,, Yu Cheng 2, and HuiChuan Lai 3 1 Department of Biostatistics and Bioinformatics Rollins School of Public Health, Emory University Atlanta, Georgia 3322, U.S.A. 2 Department of Statistics, University of Pittsburgh, Pittsburgh, PA 1526, U.S.A. 3 Departments of Nutritional Sciences and Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 5376, U.S.A. July 1, 21 Abstract Quantile regression is known for its flexibility to accommodate varying covariate effects and has attracted growing interest in its application to survival analysis. Motivated by Peng and Huang (28) s work on quantile regression method with randomly right censored data, we develop a quantile regression method tailored for a double censoring setting that is often encountered in registry studies utilizing the embedded martingale structure. The proposed estimation and inference procedures are computationally simple and stable. We establish the uniform consistency and weak convergence of the resulting estimators. We also provide a sensible solution to address the identifiability issues on regression quantiles at both tails, a unique feature with doubly censored data. The finite-sample performance of our approach is assessed by a series of simulation studies. An application to a registry data on cystic fibrosis illustrates the good practical utility of our method. Key Words: Cystic Fibrosis; Double censoring; Empirical process; Martingale; Pseudomonas aeruginosa; Regression quantile; Varying coefficient. 1 Introduction Double censoring occurs when the measurement of interest is subject to both left censoring and right censoring. It is often encountered in biomedical research. In 1

3 the Cystic Fibrosis Patient Registry (CFFPR) study, the onset time of first pseudomonas aeruginosa (PA) infection in CF patients is of interest. This event time is subject to double censoring since patients may enter the study with PA infection having occurred already, and the follow-up may be terminated before any PA infection is observed. In this study, the left censoring times which correspond to ages at the first visit are always recorded for each patient. However, this may not be true for the right censoring times due to random dropouts. Such doubly censored data, which can also arise from many other situations, is the focus of this paper. It is more challenging to handle doubly censored data than right censored data. Turnbull (1974) studied the one-sample problem using the self-consistent estimators. Gehan (1965) studied the two-sample problem by an extension of the Wilcoxon test. Ren (28) proposed a weighted empirical likelihood-based semiparametric MLE as a unified approach for the two-sample problem with various censoring schemes including double censoring. In this paper we are interested in the regression setting. Among existing work on regression models for doubly censored data, Zhang and Li (1996) proposed the elegant Buckey-James-Ritvo-type M-estimator. However, computational issues may arise because the estimating equation is neither monotone nor continuous. Ren and Gu (1997) and Ren (23) proposed a parallel regression M-estimator. This approach requires the marginal independence of the survival time and the censoring time, which may be restrictive in practice. Assuming both the left censoring time and right censoring time are always observed, Cai and Cheng (24) adopted semiparametric transformation models which can be viewed as generalization of the proportional hazards models. Yan et al. (29) adapted temporal process regression (Fine, Yan and Kosorok, 24) to doubly censored data. These methods may have limited application to the CFFPR data. We plan to relax the assumptions mentioned above and propose a methodology tailored to the CFFPR data. We adopt a quantile regression framework to tackle the double censoring problem. Quantile regression (Koenker and Bassett 1978) has been serving as a popular extension of the classical linear regression model for uncensored data. It is well known for its flexibility of allowing for varying covariate effects and its robustness as a non-parametric method. These features have enhanced utility especially when there is a changing pattern of covariate effects or when heteroscedasticity exists in the data(peng and Huang, 28; Portnoy, 23). Therefore quantile regression has attracted increased attention in survival analysis. For censored data quantile regression model can be viewed as a generalization of the traditional AFT model. Most existing work on censored quantile regression has been focused on right censored data. Among the earliest breakthroughs, Powell (1984, 1986) extended the least absolute deviation (LAD) from traditional quantile regression to censored quantile regression, assuming the censoring variables are fixed or always observ- 2

4 able. Subsequent efforts have been made to accommodate random censoring. Ying et al. (1995) modified LAD for median regression models and adopted a semiparametric procedure. Honore, Khan and Powell (22) also adapted the LAD idea to random censoring. Approaches of this type share the assumptions of covariate independent censoring and unconditional independence between survival time and censoring variables. Methods have also be developed under the relatively weak conditional independence between survival time and censoring time given the covariates. Yang (1999) proposed estimators based on weighted empirical survival and hazards functions, which however, may not suit for more general heteroscedastic situations. Without imposing restrictions on the error term, Portnoy (23) proposed a recursively reweighted estimator as a generalization of the Kaplan-Meier estimator. Due to the recursive nature, there are complications in both the computation and the asymptotic results. The asymptotic properties of a gridded version of the estimation procedure are studied in Neocleous, Vanden Branden and Portnoy (26) and Portnoy and Lin (21). Peng and Huang (28) proposed an alternative estimating equation motivated by the martingale structure of right censored data. Their approach is well justified in theory and convenient to implement. Motivated by Peng and Huang (28), we propose a quantile regression model for doubly censored data, assuming conditional independent censoring with known left censoring time and unknown right censoring time. By utilizing the martingale structure associated with doubly censored data we propose a monotone estimating equation for regression parameters and adopt a grid-based estimation procedure. It can be shown that our estimation procedure is equivalent to locating the minima of a sequence of L 1 -type convex functions with unique solutions. This procedure can be reliably implemented by existing functions in R and S-PLUS. In addition, we provide remedies for the potential identifiability issue, which is common and important for censored data, as discussed in Peng and Huang (28). For doubly censoring data it is more challenging, as we can see that both the lower and higher tails of the event time distribution may suffer from non-identifiability. The difficulty here cannot be bypassed by solely restricting the range of τ. To solve this issue, we propose a conditional version of the quantile regression model that is appropriate when the lower tail is not identifiable. We also provide rigorous justifications for all our proposals and establish asymptotic properties for the estimated regression quantiles by using empirical process and stochastic integral techniques. The rest of the article is organized as follows. In section 2 we present the unconditional model, the estimation procedure, asymptotic results and the inference procedure. In section 3 we present the conditional model and its associated properties. In section 4 we briefly describe an extension of our models. In section 5 we report the simulation study results which show good finite sample performance. In 3

5 section 6 we illustrate our proposed method through the CFFPR data analysis. We conclude the article with discussions in section 7. 2 The Unconditional Quantile Regression Model 2.1 Data and Model Let T be the event time, L be the left censoring time, U be the right censoring time, and Z = (1, Z) be a (p + 1) 1 covariate vector. Define X = max(l,min(t, U)), at risk indicator R(t) = I(L t X), and counting process N = I(X t, δ = 1) (Fleming and Harrington 1991), where δ is the censoring indicator defined as δ = 1, if L T U; 2, if T < L; 3, if T > U. We assume that (L, U) T Z. In addition, we denote the conditional CDF of T by F T (t Z) = Pr(T t Z), and the conditional cumulative hazard function of T by Λ T (t Z) = log{1 F T (t Z)}. Let T i, L i, U i, X i, δ i, R i (t) and N i (t) be sample analogs. The observed data consists of n iid replicates of (X, δ, Z, L), denoted by (X i, δ i, Z i, L i ), i = 1,..., n. Define Q T (τ Z) = in f {t : F T (t Z) τ}, so it is the conditional τ th quantile of T given Z. The quantile regression model takes the form (1) Q T (τ Z) = exp{z β(τ)}, τ (, 1), where β(τ) is a vector of unknown coefficients representing covariate effects on Q T (τ Z). By formulating the coefficients as functions of τ, we allow the covariate effects to vary across quantiles of T. We can see that this model reduces to the classic AFT model when all covariate effects are assumed to be constant over τ and the intercept is the τ th quantile of a random variate. Model (1) has been studied by Peng and Huang (28) for data with random right censoring. 2.2 Estimation Procedure We propose an estimating equation by utilizing properties of the martingale associated with the observe data. Define M(t) = N(t) t R(u) dλ T (u Z) and let M i (t) be the sample analog. For right censored data (Fleming and Harrington 1991) it has been established that, under the conditional independent censoring assumption, M i (t) is a martingale. For doubly censored data, however, the at risk indicator R( ) is different. Nonetheless by adopting a similar strategy as for right censored 4

6 data we can still show that M i (t) is a martingale with respective to the filtration σ{n i (u), R i (u + ) : i = 1,, n; u t}, assuming (L i, U i ) T i Z i. The detailed proof is provided in Appendix. It naturally follows that (2) E{M i (t) Z i } =, t. Note that, by adopting a variable transformation inside the integral we have M i (t) = N i (t) = N i (t) t τ I{L i u X i }λ T (u Z i ) du I{L i Q T (v Z i ) X i } dh(v), where H(v) = log(1 v) for v < 1. When evaluated at t = Q T (τ Z i ) = exp{z i β (τ)} where β ( ) is the true parameter vector, M i (t) becomes M i [exp{z i β (τ)}] = N i [exp{z i β (τ)}] τ I[L i exp{z i β (v)} X i ] dh(v) The above equality and (2) naturally lead to our proposed estimating equation (3) n 1 2 S n (β, τ) =, where S n (β, τ) = 1 n i=1 τ Z i (N i [exp{z β(τ)}] I[L i exp{z β(v)} X i ] dh(v)). It is implied by (2) that E{S n (β, τ)} =. Suggested by the stochastic integration representation of S n (β, τ), a grid-based estimation procedure is adopted. Define ˆβ(τ) as a right-continuous step function that jumps only on a grid: G Ln = { = τ < τ 1 < < τ Ln = τ U < 1}. Here we confine the range of τ to be [, τ U ] to accommodate potential non-identifiability on the upper tail due to right censoring. Based on (3), we propose obtaining ˆβ(τ j ), j = 1,, L n, by sequentially solving β(τ j ) from the equation: (4) n 1 2 j 1 Z i (N i [exp{z β(τ j )}] I[L i exp{z ˆβ(τ k )} X i ] i=1 k=1 {H(τ k+1 ) H(τ k )}) =. By the definition of Q T ( Z ) and model (1), exp {Z β ()} =. Therefore we always set exp {Z ˆβ()} =. The estimators ˆβ(τ j )( j = 1,, L n ) are defined as 5

7 generalized solutions (Fygenson and Ritov 1994), since the non-continuity of the estimating function (4) may lead to failure in finding exact solutions. Note that (4) is a monotone estimating equation, and the left-hand side of (4) is 2 times the gradient of the following function (5) l j (h) = I(δ i = 1) log X i h I(δ i = 1)Z i i=1 + R h { I(δ l = 1)Z l } l=1 + R h (2Z r j 1 r=1 k=1 {H(τ k+1 ) H(τ k )}), I[L r exp{z r ˆβ(τ k )} X r ] where R is a very large number. Therefore ˆβ(τ j ) can be obtained as the minimizer of the L 1 -type function l j (h). The minimization can be performed using the Barrodale-Roberts algorithm (Barrodale and Roberts 1974) which is implemented in standard statistical software including S-PLUS and R. 2.3 Asymptotic Results The uniform consistency and weak convergence of our proposed estimators can be established by utilizing the stochastic integral representation of the estimation procedure, along the lines of Peng and Huang (28). Under certain regularity conditions C1-C4 (with details stated in Appendix), the following theorems hold. Theorem 1. Assuming conditions C1-C4 hold and lim n S Ln =, then the uniform consistency holds: sup τ [ν,τu ] ˆβ(τ) β (τ) p, where < ν < τ U. Theorem 2. Assuming conditions C1-C4 hold and lim n n 1/2 S Ln =, then n 1/2 ˆβ(τ) β (τ) converges weakly to a Gaussian process for τ [ν, τ U ], where < ν < τ U. We need to point out that the regularity condition C4 (see Appendix B) corresponds to the identifiability of β (τ), τ (, τ U ]. In the simple one-sample case, this condition is equivalent to f T [exp{β (τ)}]pr[l < exp{β (τ)} U] >, ν (, τ U ]. Assuming f T ( ) is bounded away from, then the above condition reduces to Pr{L < Q T (τ) U} >, ν (, τ U ], which implies τ U < F T (U + ) and ν > F T (L ), with U + and L representing the upper bound of the support of U and the lower bound of the support of L, respectively. Since ν > is arbitrary, we need 6

8 L T, where T is the lower bound of the support of T. For the regression setting, however, we only have implicit conditions on L and τ U to guarantee the identifiability. The proof of Theorem 1 can be easily adapted from Peng and Huang (28), since the two essential elements in their proof are preserved in our case: the estimating function has expectation when evaluated at the true parameter, and the boundary condition ˆβ() =. Therefore we omit the proof in this paper. The proof of Theorem 2 is sketched in Appendix. 2.4 Inferences The estimation of the covariance matrix of ˆβ(τ) is complicated by the unknown density function of X. Hence we adopt the resampling approach proposed by Peng and Huang (28) that generalizes the minimand perturbing technique (Jin et al. 21). Specifically, the objective function (5) is perturbed by ξ 1,, ξ n, a set of i.i.d. variates from a nonnegative known distribution with mean 1 and variance 1, for example, Exp(1). The resulting objective function is (6) l j (h) = ξ i I(δ i = 1) log X i h ξ i I(δ i = 1)Z i i=1 + R h { ξ l I(δ l = 1)Z l } l=1 j 1 + R h (2ξ r Z r I[L r exp{z r ˆβ(τ k )} X r ] r=1 k=1 {H(τ k+1 ) H(τ k )}), j = 1,, L n. The minimizer β (τ j ) of the perturbed function, can be sequentially located with the boundary exp{β ()} set to be. For a fixed τ we can approximate the variance of ˆβ(τ ) by repeatedly generating the variates set {ξ k1,, ξ kn } k=1 B and obtaining the corresponding {β k (τ )} k=1 B. Then the confidence interval for β(τ ) can be constructed using a normal approximation. We can also carry out similar hypothesis testing and second-stage inferences as proposed in Peng and Huang (28). Particularly we are interested in testing (a) whether the effect of a specific covariate Z q (2 q p+1) is significant on a range of τ [l, u], and (b) whether the effect of Z q is constant over τ [l, u]. Consider the general hypothesis H : g{β(τ)} = r (τ), τ [l, u], where the operator function g( ) is often linear and r (τ) is a hypothesized value. Test (a) is a special case of testing H with g(x) = x q and r (τ) = for τ [l, u]. A integral test statistic may be adopted, namely Γ = n 1/2 u {g{ˆβ(v)} r l (v)}θ(v) dv. It can be shown Γ is consistent 7

9 test under certain conditions, where Θ( ) is a nonnegative weight function. The distribution of Γ under H can be approximated using the conditional distribution of the resampling-based test statistic Γ = n 1/2 u [g{β (v)} g{ˆβ(v)}]θ(v) dv. The l constancy test (b) essentially examines H : g{β(τ)} = η, τ [l, u] where g( ) is the same as in (a) and η is an unspecified constant. We may set η to be the average effect ρ = u g{β l (v)} dv/(u l), and ˆρ = u g{ˆβ l (v)} dv/(u l) is a consistent estimator of ρ. The inferences on ρ can be developed using the resamplingbased realizations ρ. To test H we adopt the test statistic Γ = n 1/2 u {g{ˆβ(v)} l ˆρ} Θ(v) dv., where Θ( ) is a nonconstant weight function. The distribution of Γ under H is equivalent to the conditional distribution of Γ = n 1/2 u ([g{β (v)} l g{ˆβ(v)}] [r (v) ˆr(v)]) Θ(v) dv given the observed data. We may reject H ( H ) at level α if the value of Γ ( Γ) is extreme compared to its null distribution, i.e., greater than the (1 α/2)th quantile or less than the (α/2)th quantile of Γ ( Γ ). 3 The Conditional Quantile Regression Model In the previous section the identifiability issue on the upper tail was dealt with, while the lower tail of T was assumed to be identifiable. However, the latter identifiability may not always be guaranteed. For the simple one-sample case, we have established the condition for the identifiability of the lower tail, that is, the lower bound of the support of L is no greater than the lower bound of the support of T. In practice this condition can be violated and hence introduce non-identifiability. For the regression setting we also have an implicit condition, which may not always hold either. When the lower tail is not identifiable, we cannot simply refine the range for τ of interest to [τ L, τ U ] where τ L >. The reason is that, resembling left-truncated data (Tsai et al. 1987), in this case identification of β(τ) is precluded. Motivated by the conditional estimator of the survival function proposed by Tsai et al. (1987), we propose a conditional version of the quantile regression model that formulates the covariate effects on conditional quantiles of T given T > t. In essence we impose an artificial left truncation time t, a relatively early time point, to circumvent the difficulty on identifying the lower tail. The choice of t depends on two aspects: (a) the mathematical condition for identifiability, given in the appendix; and (b) the scientific interest and logic. Let T = T + t, and the conditional model takes the form: (7) Q T (τ Z, T > t ) = exp{z α(τ)} + t, τ (, 1). Here Q T (τ Z, T > t ) is defined as in f {t : F T (t Z, T > t ) τ}, the conditional τ th quantile of T given T > t and Z. Accordingly the coefficient vector α(τ) 8

10 corresponds to covariate effects on the conditional quantiles of T given T > t rather than the marginal quantiles. Model (7) necessitates different estimation procedures than those of the unconditional model. The estimation equation (3) cannot be directly borrowed without modification. First of all, we need to re-define the counting process and the risk set by setting Ň(t) = I(t < X t, δ = 1) and Ř(t) = I(t L t X) in order to reflect the condition T > t. What follows naturally is the modified martingale ˇM(t) = Ň(t) t Ř(u) dλ T (u Z, T > t ) = Ň(t) t I(L t X) dλ T (u Z, T > t ). Note that, due to the complications that arise from the combination of left censoring and artificial left truncation, it is not straightforward to verify ˇM(t) is a martingale. However, we can still show that E{ ˇM(t) Z} = by utilizing the fact that Ň(t) = N(t) N(t ) for t > t together with the established properties of N(t). Specifically, we have shown that E{N(t) Z} = t I(L < u X)λ(u Z) du, which can be further expressed as Λ T (t X Z) Λ T (t L Z). Noting that Λ T ( Z) = log{1 F T ( Z)}, we may arrive at Λ T (t X Z) Λ T (t X Z) = I(X > t )Λ T (t X T > t, Z). From these facts it follows that E{Ň(t)} = t I(L t X) dλ T (u Z, T > t ) and hence E{ ˇM(t) Z} =. The detailed proof is provided in Appendix. Setting t = Q T (τ Z, T > t ), by a variable transformation inside the integral we can show that E{ S n (α, τ)} =, where α denotes the true value of the parameter vector α and S n (α, τ) = 1 n Z i I(X i > t ){N i [exp{z α(τ)} + t ] i=1 τ Therefore we propose an estimating equation (8) n 1 2 S n (α, τ) =. I[L i exp{z α(v)} + t X i ] dh(v)}. Note that we still have the boundary condition exp{z ˆα()} = which is critical in solving α(τ). Similar to the unconditional case, we notice S n (α, τ) has a stochastic integration representation. Therefore we can again utilize the grid-based root-finding strategy and arrive at a monotonic estimating equation whose equiva- 9

11 lent objective function is (9) l j (h) = I(δ i = 1, X i > t ) log(x i t ) I(δ i = 1, X i > t )h Z i i=1 + R h { I(δ l = 1, X l > t )Z l } l=1 + R h (2Z i j 1 r=1 k=1 {H(τ k+1 ) H(τ k )}), I[L r exp{z ˆα(τ k )} + t X r ] where j = 1,, L n, and R is a very large number. The aforementioned Barrodale- Roberts algorithm still applies. Under regularity conditions C1 -C4 parallel to C1-C4 (stated in Appendix XX), we have the following asymptotic results. Theorem 3. Assuming conditions C1 -C4 hold and lim n S Ln =, then sup τ [ν,τu ] ˆα(τ) α (τ) p, where < ν < τ U. Theorem 4. Assuming conditions C1 -C4 hold and lim n n 1/2 S Ln =, then n 1/2 ˆα(τ) α (τ) converges weakly to a Gaussian process for τ [ν, τ U ], where < ν < τ U. To justify Theorems 3-4, we define Z i = Z i I(X i > t ), X i = X i t, L i = L i t. The counting process associated with X i is N i (t) = I(X i t, δ = 1) = I(X i t + t, δ = 1), and we have (1) S n (α, τ) = 1 n Z i {N i (exp[z i α(τ)]) i=1 τ I(L i < exp[z i α(v)] X i ) dh(v)}. Notice that in (1) S n (α, τ) is rewritten as an estimating equation for the unconditional case. Therefore we can utilize the established asymptotic results for the unconditional case and conclude that the uniform consistency (Theorem 3) and the weak convergence (Theorem 4) of ˆα still hold. It is also worth mentioning that the conditions C1 -C4 are essentially C1-C4 corresponding to the new data Z i, X i, L i. A similar resampling-based inference procedure can be adopted here as for the unconditional case. The same perturbation can be applied to (1). We can also carry out similar hypothesis testing and second-stage inferences as in Section 2. 1

12 4 Extension to Doubly Censored Data with Left Truncation In the previous section we proposed a conditional model to accommodate potential non-identifiability on the lower tail of T. Borrowing the idea from the popular technique that deals with right censored data with left truncation, we essentially imposed an artificial left truncation on T. A straightforward extension is that, with slight modification, the proposed model can be applied to even more complicated data that are subject to double censoring and left truncation simultaneously. Specifically, suppose the data described in the previous section are also left truncated at a, a known time point. Let t = t a, then our model is Q T (τ Z, T > t ) = exp{z α(τ)} + t, τ (, 1). Since the above model is exactly the same as (7) except that t is replaced with t, the estimating equation and root finding procedure can be applied without difficulty. 5 Simulation Studies We study the finite-sample performance of our propose methods through Monte- Carlo simulations. Two settings are illustrated: one with homoscedastic errors and one with heteroscedastic errors. In each setting we examine both the unconditional case where the left tail of the event time is totally identifiable and the conditional case where the left tail of the event time is identifiable after a pre-specified time. To guarantee the identifiability in the unconditional case, we impose a positive probability mass P on for the left censoring time L. In the first setting, event times are generated from an AFT model with iid errors: log T = b 1 Z 1 + b 2 Z 2 + ɛ, where ɛ follows the extreme value distribution. It can be shown that under this specific setup the true regression quantiles are the same for the unconditional case and the conditional case given Z = (1, Z 1, Z 2 ), i.e., β (τ) = α (τ) = {Q ɛ (τ), b 1, b 2 }. The covariates Z 1 and Z 2 are generated from Uni f (, 1) and Bernoulli(.5) respectively. The right censoring time U is Uni f (.1 I(Z 2 = 1), c u ), and the left censoring time L is Uni f (, c l ) W, where W follows Bernoulli(1 P ). If U < L in one pair of realization, we regenerate U until U L. In the unconditional case we set P =.2, b 1 =, b 2 =.5, c l =.5, c u = 3.8, resulting in 2% right censoring and 2% left censoring. In the conditional case we set P =, b 1 =, b 2 = 1., 11

13 c l =.3, c u = 4.5, resulting in 15% right censoring and 2% left censoring. Under each configuration we generate 1 data sets of sample sizes n = 2 and n = 4. We set B = 2 in the resampling procedures with the disturbance {ζ i } B i=1 generated from Exp(1). An equally spaced grid on τ with size.1 is adopted when estimating β(τ). We also carry out tests on both the overall significance and the constant effect hypotheses for each covariate. In the latter test we adopt the weight function Θ(v) = I{v (l + u)/2}. We set l =.1 and u =.7. Table 1 present the estimation results for the AFT setting under the unconditional model and the conditional model with the sample size n = 2. Included are the biases (absolute values), empirical standard deviations (EmpSD), average resampling-based SD estimates (AvgSD), and the coverage rates of the 95% Wald confidence intervals for ˆβ(τ) and ˆα(τ), τ =.1,.3,.5,.7, We see the biases are negligible, the resampling-based SD estimates agree well with the empirical SD, and the coverage rates are close to the nominal level. Table 1: Simulation results under AFT models (n=2) Unconditional Case Conditional Case b 1 =, b 2 =-.5, 2% rc, 2% lc b 1 =, b 2 =-1, 15% rc, 2% lc τ Bias AvgSD EmpSD Cov95 Bias AvgSD EmpSD Cov95.1 ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα Table 2 present the hypothesis testing results. The empirical rejection rates (ERR) for both tests at level.5 are reported, together with the estimated average effects (AvgEst), empirical SD of the average effects, and average resamplingbased SD estimates of the average effects. We see that the type I errors are close to the nominal level.5. The estimated average covriate effects of Z 1 and Z 2 are close to the true values (constants). The resampling-based SD estimates agree well with the empirical SD. 12

14 Table 2: Simulation results on hypothesis testing and second-stage inference under AFT models (n=2) Unconditional Case b 1 =, b 2 =-.5, 2% right censoring, 2% left censoring H : β(τ) =, l τ u H : β(τ) = η, l τ u ERR AvgEst AvgSD EmpSD ERR ˆβ ˆβ ˆβ Conditional Case b 1 =, b 2 =-1, 15% right censoring, 2% left censoring H : α(τ) =, l τ u H : α(τ) = η, l τ u ERR AvgEst AvgSD EmpSD ERR ˆα ˆα ˆα In the second setting, we consider a log linear models with heteroscedastic errors. Event times are generated from the model: log T = b 1 Z 1 + b 2 Z 2 ξ + ɛ, where ɛ follows N(, 1), Z 1 follows Uni f (, 1), Z 2 follows Bernoulli(.5), and ξ follows Exp(1). These variates are all independent. The left censoring time and right censoring time are generated in the same way as in the first setting. In the unconditional case we P =.2, b =, b 1 = 1.5, c l =.7, c u = 4.8,, resulting in 25% right censoring and 2% left censoring. In the conditional case we set P =, b =, b 1 = 4.5, c l =.5, c u = 4., resulting in 15% right censoring and 25% left censoring. Tables 3-4 report the estimation results and hypothesis testing results in the similar fashion as in Tables 1-2. We can see that in the heteroscedastic setting our proposed method also performs well. The simulation results for n = 4 for both settings are presented in the supplementary materials. Comparing the two different sample sizes we see that the larger one (4) has better overall performances. We also preformed simulation to demonstrate the advance of our approach over the naive approach that simply discards all left-censored subjects. The same unconditional models and configurations as summarized in Table 1 (AFT model) and Table 3 (heteroscedastic model) are taken for illustration. Figure 1 displays the parameter estimates from the proposed approach and the naive approach together 13

15 Table 3: Simulation results under log-linear models with heteroscedastic errors (n=2) Unconditional Case Conditional Case b 1 =, b 2 =-1.5, 25% rc, 2% lc b 1 =, b 2 =-4.5, 15% rc, 25% lc τ Bias AvgSD EmpSD Cov95 Bias AvgSD EmpSD Cov95.1 ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα Table 4: Simulation results on hypothesis testing and second-stage inference under log-linear models with heteroscedastic errors (n=2) Unconditional Case b 1 =, b 2 =-1.5, 25% right censoring, 2% left censoring H : β(τ) =, l τ u H : β(τ) = η, l τ u ERR AvgEst AvgSD EmpSD ERR ˆβ ˆβ ˆβ Conditional Case b 1 =, b 2 =-4.5, 15% right censoring, 25% left censoring H : α(τ) =, l τ u H : α(τ) = η, l τ u ERR AvgEst AvgSD EmpSD ERR ˆα ˆα ˆα

16 with the true parameter values. This figure shows that the proposed estimators are virtually unbiased while the naive estimators greatly deviate from the true values for non-zero parameters. Estimates for AFT Model intercept true parameters proposed estimator naive estimator b true parameters proposed estimator naive estimator b true parameters proposed estimator naive estimator tau tau tau Estimates for Heteroscedastic Model intercept true parameters proposed estimator naive estimator b true parameters proposed estimator naive estimator b true parameters proposed estimator naive estimator tau tau tau Figure 1: Comparing Estimates from Proposed Approach and Naive Approach 6 CFFPR Data Example In this section we apply our methods to the CFFPR data discussed in Section 1. Cystic Fibrosis (CF) is one of the most common and life-shortening genetic disorders affecting the lungs and digestive systems of about 3, children and adults in the United States and 7, worldwide (Cystic Fibrosis Foundation 21). Pseudomonas aeruginosa (PA), the predominant bacterial pathogen infecting 8% of CF patients under age 18, accelerates malfunction in lung and serves as an important predictor of mortality in CF (Retsch-Bogart et al. 28). In this paper we used the CFFPR data collected during to investigate the association between onset ages of the first detected PA infection and several risk factors in CF patients diagnosed by age 1. Similar data were analyzed by Lai et al. (24) under the Cox model and Yan et al. (29) based on temporal process regression. 15

17 In our analysis, the event time of interest T corresponds to patient s age at the first PA infection which is subject to both left and right censoring. Among 12,818 CF patients diagnosed between 1986 and 2, 3,343 (26.1%) patients had PA infection at study entry (i.e. left censored by patient s age at the first recorded visit L) and 2,213 (17.3%) patients had no PA infection documented by December 25 (i.e. right censored by age at the last follow-up before the cut-off date U). To avoid the complication with delayed entry (i.e. left truncation), we restricted the study population to subjects who were diagnosed before age 1 and alive at age 1. The first restriction was imposed because the first 1 years were known to have greatest potential to take advantage of early diagnosis (Campbell and White 25). The second restriction was imposed to avoid left truncation due to mortality prior to CF diagnosis. Since the mortality rate before age 1 was very low, about 1.5%, we expect excluding patients who died before age 1 would only result in a small deviation from the general CF population. The restricted sample contains 11,179 patients for which the left and right censoring rates are 23.7% and 16.2%, respectively. We applied the proposed quantile regression method to this doubly-censored restricted CFFPR sample. The same set of covariates as that in Yan et al. (29) were considered which included gender (1 for females and for males), diagnosis mode (denoted by factor ) and diagnosis year (denoted by dx ). Diagnosis mode was defined according to common clinical practices that identify CF, including four categories: diagnosis at birth due to meconium ileus (MI), diagnosis shortly after birth by neonatal/prenatal screening (SCR), diagnosis at variable ages because of family history (FH), and diagnosis at variable ages from various symptoms (SYMP) other than MI (Lai et al. 24, Yan et al. 29). Diagnosis year was classified into three cohorts that reflect medical progresses in CF diagnosis and treatment: ( dx86 ), ( dx9 ), and ( dx94 ). Without loss of generality, we chose male patients who were diagnosed between 1986 and 1989 by symptoms other than MI as our reference level. Figure 1 displays the coefficient estimates β(τ) in solid lines along with their 95% pointwise Wald confidence intervals in dotted lines for τ [l, u] with l=.1 and u=.65. The two endpoints.1 and.65 were chosen so that we would have enough data points to get reliable estimates at each quantile in between. Unsurprisingly the estimated intercept is an increasing step function which suggests that 1% of male patients with CF diagnosed between by SYMP acquired their first PA infection by age of 2 years and 65% of them had their first PA infection by age of 9 years. The estimated coefficients for MI are close to, the estimated coefficients for FH and SCR are always positive, while the coefficients for dx9 and dx94 are always negative. This may suggest that compared with those diagnosed with symptoms, patients diagnosed because of MI had similar onset ages 16

18 of PA infection, while patients diagnosed due to family history or screening developed their first PA infection at later ages. Regarding the diagnosis cohort effect, patients who were diagnosed in and exhibited earlier onset of PA infection than patients diagnosed in The coefficients for gender, though demonstrating some marginal significance around τ=.65, do not show an apparent trend of departing from. A formal significance test was performed based on the average covariate effects across quantiles ranging from.1 to.65. The aforementioned protecting effect of FH diagnosis and accelerating effects of newer diagnosis cohorts are significant on average, with p-values.2,.1 and <.1 (data not shown), respectively. The effect of SCR diagnosis is marginally significant (p-value=.7). We paid particular attention to testing the gender effect due to its cross-over pattern as shown in Figure 2. That is, the estimated coefficient is first positive and then flips the sign around τ=.35. Hence we assessed the aggregated gender effect in two τ- intervals:τ [.1,.35] and τ [.35,.65]. Our test is significant in neither interval, with both p-values around.2 (data not shown). For the three significant covariates, i.e., FH, dx9 and dx94, we also conducted a constancy test using the weight function Θ(t) = I[t < (l + u)/2] in the view of the rather monotone pattern with each coefficient estimate. Results show that only the effect of dx94 varies with τ, with p-values <.1 (data not shown). This result is consistent with our observation from Figure 2, that is, though always being negative, the magnitude of the coefficient for dx94 gradually decreases with τ. The scientific implication may be that patients diagnosed between 1994 and 2 were more likely to get more frequent culture, resulting in earlier detection of PA infection, as compared to those diagnosed between 1986 and However, the more frequent culture had less effect on later onset ages, hence such a difference was less dramatic for CF patients with late onset of PA infection. 7 Discussion In this paper we propose a quantile regression for doubly censored data. Quantile regression has been serving as a popular alternative for the classical Cox model and AFT models in survival analysis, but has not been adapted to doubly censored data to the best of our knowledge. By formulating the covariate effects on different quantiles of the outcome instead of the mean and adopting non-parametric approaches, our model benefits from flexibility and robustness. The proposed estimation procedure takes great advantage of the martingale structure that facilitates both asymptotic studies and implementations in standard statistical software. We make the rather common assumption of (L, U) C Z, where both L and 17

19 intercept female DX(MI) regression quantile regression quantile regression quantile tau tau tau DX(FH) DX(SCR) dx9 regression quantile regression quantile regression quantile tau tau tau dx94 regression quantile tau Figure 2: Estimated Regression Quantiles U are allowed to depend on Z. In addition, U is assumed to be unobservable as we often encounter in survival analysis. On the other hand, we do assume known L. To relax this assumption one needs to overcome additional difficulty that may precludes utilizing the martingale structure, nonetheless this could be one of our future directions. Identifiability is a subtle but important issue that is often encountered in censored data. Due to the loss of event time in both the lower and upper tails, doubly censored data may suffer from two-way non-identifiability. In our proposal we tackle this issue with a two-fold strategy. By restricting our attention to a range of τ away from 1, that is, (, τ U ] with τ U < 1, we bypass the identification on the upper tail, and by a conditional version of the regression model, in which we confine T > t to overcome non-identifiability on the lower tail. In this paper we assume fixed τ U and t, but in practice they can be chosen adaptively or based on scientific reasons. Although our methodology is tailored to the CFFPR data which we illustrate as an example, this work has potential broader applications. Double censoring may also arise from many other scenarios, for example, when measurements can only be accurately taken on a certain interval [L, U], or when the interest lies in the elapsed time between the occurrences of two events, both of which subject to right 18

20 censoring. Our work can be easily applied to these situations. In addition, the proposed conditional model share the essence of solution to data with left truncation and right censoring, and therefore we can also adapt our methods to this type of data without difficulty. 19

21 A A.1 Justification of the Estimating Equations A.1.1 The Unconditional Case To justify the estimating equation (4) for the unconditional case, we need to prove (3) first. Recall that dm i (t) = dn i (t) R i (t)λ i (t)dt and F t is the filtration σ{n i (u), R i (u + ), Z i : i = 1,, n; u t}. We have E[dN i (t) F t ] = Pr[dN i (t) = 1 F t ] = Pr[t T i < t + dt, R i (t) = 1 F t ] = R i (t)pr[t T i < t + dt R i (t), Z i ] = R i (t)pr[t T i < t + dt T i t, L i t, U i t, Z i ] = R i (t)λ i (t Z i )dt assuming (L i, U i ) T i Z i. Hence M i (t) is a martingale and E{M i (t) Z i } = for t. On the other hand, we have t R i (u)λ i (u Z i ) du = = t τ I(L i < u X i )λ i (u Z i ) du I(L i < F 1 T (v Z i) X i ) dh(v), where H(v) = log(1 v) for x < 1. Then it follows immediately that E[S n {β (τ)}] = for τ (, 1). A.1.2 The Conditional Case We want to show that E{ ˇM(t) Z} =, where ˇM(t) = Ň(t) t I(L t X) dλ T (u Z, T > t ), and Ň(t) = N(t) N(t ) for t > t and otherwise. From the arguments in the unconditional case we know that E{N(t) Z} = = Hence, for t > t, we have t t I(L < u X)λ(u Z) du I(u X)λ(u Z) du t = Λ T (t X Z) Λ T (t L Z). I(u L)λ(u Z) du (11) E{N(t) N(t ) Z} = {Λ T (t X Z) Λ T (t X Z)} {Λ T (t L Z) Λ T (t L Z)}. 2

22 On the other hand, we have (12) Λ T (t X Z) Λ T (t X Z) = I(X > t ){Λ T (t X Z) Λ T (t Z)} = I(X > t )[ log{1 Pr(T t X Z)} + log{1 Pr(t t Z)}] = I(X > t )[ log{1 Pr(T t X Z) Pr(T t Z) }] 1 Pr(T t Z) = I(X > t )[ log{1 Pr(T t X T > t, Z)}] = I(X > t )Λ T (t X T > t, Z). Similarly we can show that (13) Λ T (t L Z) Λ T (t L Z) = I(L > t )Λ T (t L T > t, Z). Plugging (12) and (13) into (11), we have (14) B E{N(t) N(t ) Z} =I(X > t )Λ T (t X T > t, Z) = = = I(L > t )Λ T (t L T > t, Z) t t τ {I(X > t )I(u X) I(L > t )I(u L)} λ T (u T > t, Z) du {L < u X}λ T (u T > t, Z) du I(L < F 1 T (v T > t, Z i ) X) dh(v). From (11) and (14) we immediately have E{ ˇM(t) Z} =. B.1 Regularity Conditions and Proofs B.1.1 Regularity Conditions for Theorems 1 and 2 Let F( Z) and F( Z) be the conditional CDF and survival function of X (the observed time) given Z, respectively; let F(t Z) = Pr(t < X t, δ = 1 Z); and let F X,L (t, t Z) = Pr(X t, L t Z). Moreover, let f ( Z), f ( Z), f ( Z) and f X,L (, Z) be the first order derivatives of F( Z), F( Z), F( Z) and F X,L (, Z), respectively. 21

23 Define: µ(b) = E[ZN{exp(Z b)}], B(b) = E[Z 2 f {exp(z b) Z} exp(z b)], v n (b) = n 1 Z i N i {exp(z i b)} µ(b), i=1 µ(b) = E[ZI{L < exp(z b) X}], J(b) = E[Z 2 ( f {exp(z b) Z} f X,L {exp(z b), exp(z b) Z}) exp(z b)], ṽ n (b) = n 1 Z i I{L i < exp(z i b) X i} µ(b). i=1 The regularity conditions are stated as follows: C1: The covariate space Z is compact, i.e., sup i Z i <. C2: (a) Each component of E(ZN[exp{Z β (τ)}]) is a Lipschitz function of τ, and (b) f (t Z) and f (t z) are bounded above uniformly in t and z. C3: (a) f {exp(z b) Z} > for all b B(d ), (b) E(Z 2 ) >, (c) each component of J(b)B(b) 1 is uniformly bounded in b B(d ), where B(d ) is a neighborhood containing {β (τ), τ (, τ U )}, defined as B(d ) = {b R p : inf τ (,τu ] µ(b) µ(β (τ)) d }. C4: inf τ [ν,τu ]eigminb(β (τ)) > for any ν (, τ U ), where eigmin( ) denotes the minimal eigenvalue of a matrix. B.1.2 Proof of Theorem 2 Following the proof of Lemma B.1. in Peng and Huang (28) we can show that (15) sup τ (,τu ] n 1/2 i=1 Z i {N i (exp{z i ˆβ(τ)}) N i (exp{z i β (τ)})} n 1/2 (µ{ˆβ(τ)} µ{β (τ)}) p. and sup τ (,τu ] n 1/2 i=1 Z i {I(L i < exp{z i ˆβ(τ)} X i ) I(L i < exp{z i β (τ)} X i )} (16) n 1/2 ( µ{ˆβ(τ)} µ{β (τ)}) p. 22

24 (15) and (16) together with the uniform convergence of µ{ˆβ(τ)} for τ (, τ U ] imply a stochastic differential equation as mentioned in Peng and Huang (28): n 1/2 S n (β, τ) =n 1/2 [µ{ˆβ(τ)} µ{β (τ)}] τ [J{β (u)}b{β (u)} 1 + o (,τu ](1)] n 1/2 [µ{ˆβ(τ)} µ{β (τ)}] dh(u) + o (,τu ](1). Using the production integration theory (Gill and Johansen 199; Andersen et al. 1998), we have (17) n 1/2 [µ{ˆβ(τ)} µ{β (τ)}] = φ{ n 1/2 S n (β, τ)} + o (,τu ](1), where φ is a linear operator. By the Donsker theorem, n 1/2 S n (β, τ) converges weakly to a tight Gaussian process G(τ) for τ (, τ U ]. Hence n 1/2 [µ{ˆβ(τ)} µ{β (τ)}] converges weakly to φ{g(τ)} which is also a Gaussian process. Using Taylor expansions we immediately have n 1/2 {ˆβ(τ) β (τ)} converges weakly to the Gaussian process B{β (τ)} 1 φ{g(τ)} for τ [ν, τ U ], where the lower limit ν ensures B{β (τ)} 1 is uniformly bounded. B.1.3 Regularity Conditions for Theorems 3 and 4 Define: µ(a) =E[ZÑ{exp(Z a) + t }], B(a) =E[Z 2 f {exp(z a + t ) Z} exp(z a)], v n (a) =n 1 Z i Ñ i {exp(z i a) + t } µ(a), i=1 µ(a) = E[ZI{L < exp(z a) + t X}], J(a) =E[Z 2 ( f {exp(z a) + t Z} f X,L {exp(z a) + t, exp(z a) + t Z}) exp(z a)], ṽ n (a) =n 1 Z i I{L i < exp(z i a) + t X i } µ(a). i=1 The regularity conditions are stated as follows: C1 : The covariate space Z is compact, i.e., sup i Z i <. C2 : (a) Each component of E(ZÑ(exp{Z α (τ)} + t )) is a Lipschitz function of τ, and (b) f (t z) and f (t z) are bounded above uniformly in t and z. C3 :(a) f {exp(z a) + t Z} > for all a B(d ), (b) E(Z 2 ) >, (c) each 23

25 component of J(a)B(a) 1 is uniformly bounded in a B(d ), where B(d ) is a neighborhood containing {α (τ), τ (, τ U )}, defined as B(d ) = {a R p : inf τ (,τu ] µ(a) µ(α (τ)) d }. C4 : inf τ [ν,τu ]eigminb(α (τ)) > for any ν (, τ U ), where eigmin( ) denotes the minimal eigenvalue of a matrix. C C.1 Simulation Results for n = 4 Table 5: Simulation results under AFT models (n=4) Unconditional Case Conditional Case b 1 =, b 2 =-.5, 2% rc, 2% lc b 1 =, b 2 =-1, 15% rc, 2% lc τ Bias AvgSD EmpSD Cov95 Bias AvgSD EmpSD Cov95.1 ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα

26 Table 6: Simulation results on hypothesis testing and second-stage inference under AFT models (n=4) Unconditional Case b 1 =, b 2 =-.5, 2% right censoring, 2% left censoring H : β(τ) =, l τ u H : β(τ) = η, l τ u ERR AvgEst AvgSD EmpSD ERR ˆβ ˆβ ˆβ Conditional Case b 1 =, b 2 =-1, 15% right censoring, 2% left censoring H : α(τ) =, l τ u H : α(τ) = η, l τ u ERR AvgEst AvgSD EmpSD ERR ˆα ˆα ˆα Table 7: Simulation results under log-linear models with heteroscedastic errors (n=4) Unconditional Case Conditional Case b 1 =, b 2 =-1.5, 25% rc, 2% lc b 1 =, b 2 =-4.5, 15% rc, 25% lc τ Bias AvgSD EmpSD Cov95 Bias AvgSD EmpSD Cov95.1 ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα ˆβ ˆα

27 Table 8: Simulation results on hypothesis testing and second-stage inference under log-linear models with heteroscedastic errors (n=4) Unconditional Case b 1 =, b 2 =-1.5, 25% right censoring, 2% left censoring H : β(τ) =, l τ u H : β(τ) = η, l τ u ERR AvgEst AvgSD EmpSD ERR ˆβ ˆβ ˆβ Conditional Case b 1 =, b 2 =-4.5, 15% right censoring, 25% left censoring H : α(τ) =, l τ u H : α(τ) = η, l τ u ERR AvgEst AvgSD EmpSD ERR ˆα ˆα ˆα

Estimation and Inference of Quantile Regression. for Survival Data under Biased Sampling

Estimation and Inference of Quantile Regression for Survival Data under Biased Sampling Supplementary Materials: Proofs of the Main Results S1 Verification of the weight function v i (t) for the lengthbiased