Estimation in the l 1 -Regularized Accelerated Failure Time Model

Size: px

Start display at page:

Download "Estimation in the l 1 -Regularized Accelerated Failure Time Model"

Albert Collins
5 years ago
Views:

1 Estimation in the l 1 -Regularized Accelerated Failure Time Model by Brent Johnson, PhD Technical Report May 2008 Department of Biostatistics Rollins School of Public Health 1518 Clifton Road, N.E. Emory University Atlanta, Georgia Telephone: (404) FAX: (404) bajohn3@emory.edu

2 Estimation in the l 1 -regularized accelerated failure time model by Brent A. Johnson Abstract This note variable selection in the semiparametric linear regression model for censored data. Semiparametric linear regression for censored data is a natural extension of the linear model for uncensored data; however, random censoring introduces substantial theoretical and numerical challenges. By now, a number of authors have made significant contributions for estimation and inference in the semiparametric linear model but none of these authors have considered regularized estimation and subsequent variable selection. Our estimator is defined as a consistent solution to a suitably penalized, weighted logrank estimating function. For general weight function, this estimating function is known to be non-monotone in the regression coefficients and may contain multiple roots. Nevertheless, it is one of the more popular estimators which does not assume proportional hazards. The proposed method uses linear and quadratic programming techniques for l 1 -regularized estimation and can be implemented easily in R. We illustrate the utility of our approach in real and simulated data. Keywords: Adaptive lasso Lasso Oracle property Penalized least squares Proportional hazards. 1

3 1 Introduction Over the past several years, substantial attention has been paid to simultaneous estimation and variable selection in the linear model through so-called penalized least squares (PLS) estimators (e.g. Breiman, 1995; Tibshirani, 1996; Fan and Li, 2001; Zou and Hasite, 2005; Zou, 2006; Yuan and Lin, 2006). Because PLS estimators simultaneously shrink some coefficients to zero and estimate the non-zero coefficients, one can manage the theoretical properties of PLS estimators more easily than earlier proposals, such as stepwise deletion and subset selection. In addition to desirable asymptotic properties, elegant solutions to the resulting constrained optimization problems are now readily available (e.g. Osborne, Presnell, Turlach, 2000; Efron, Johnstone, Hastie, Tibshirani, 2004; Friedman and Popescu, 2004; Friedman, Hastie, Höfling, Tibshirani, 2007). For the most part, the existing theoretical results and accompanying optimization algorithms derived for PLS estimators can be extended to general response variables through generalized linear models and penalized likelihood (cf. Tibshirani, 1996; Fu, 2003; Park and Hastie, 2006) and to censored outcome data through the proportional hazards (PH) model (Cox, 1972) and partial likelihood (Tibshirani, 1997; Park and Hastie, 2006; Zhang and Lu, 2007). However, neither the technical arguments nor the optimization algorithms apply to general penalized M- and Z-estimators because a convex loss function is absent and the Hessian may not be directly estimable. Recently, authors (Johnson, 2005; Johnson, Lin, and Zeng, 2007) extended the earlier notion of penalized estimating function (Fu, 2003) and showed that many of the asymptotic results for penalized likelihood (Zou, 2006; Zhang and Lu, 2007) do, in fact, hold for a wide class of semi-parametric models under modest regularity conditions. Despite these theoretical advances for penalized estimating functions and variable selection in semi-parametric models, efficient computational strategies are lacking and must typically be considered on a case-by-case basis. 2

4 Fan and Li (2001) proposed an algorithm that yields a local solution to a general constrained optimization problem using Newton-type steps. In the case of l 1 penalties on the regression coefficients, Tibshirani (1996) dismissed this same algorithm as inefficient, especially when compared to linear and quadratic programming (QP) for the same constrained optimization problem. A survey of the literature on l 1 -regularization suggests that QP is the preferred method (e.g. Tibshirani, 1996, 1997; Yuan and Lin, 2006); furthermore, the Karush-Kuhn-Tucker conditions implied by the primal (and its dual) are key steps in deriving the entire l 1 -regularized solution path (Osborne et al, 2000; Efron et al., 2004; Park and Hastie, 2005; Yuan and Lin, 2006). Hence, we infer that QP is the gold standard for optimizing a general loss function subject to l 1 constraints on the regression coefficients (however, see also Fu, 1998; Friedman et al, 2007). Unfortunately, it seems unlikely that QP techniques may be used for solving general penalized estimating functions. However, this paper considers one exceptional class of estimating function where QP may be used and show how the l 1 -regularized estimating function can be efficiently computed using standard software. Unlike hazards regression for censored data, the accelerated failure time model is based on the linear model, a cornerstone of statistical modeling. This has lead many prominent statisticians, most notably Sir D. R. Cox, to observe that the accelerated failure time (AFT) model and the estimated regression coefficients to have a rather direct physical interpretation, (Reid, 1994, p. 450). Moreover, it is well-known that the PH and AFT model cannot simultaneously hold except in the case of extreme value error distributions. Therefore, two reasons why statisticians give the AFT model serious consideration in censored data regression include (i) the direct interpretability of regression coefficients, and (ii) the AFT model assumptions can hold when the PH model assumptions fail. These reasons have lead instructors to include the AFT model as part of the standard graduate curriculum in statistics (cf. Kalbfleisch and Prentice, 2002) and have also lead to numerous extensions/applications in biometry and econometrics. Therefore, addressing effective 3

5 variable selection and estimation strategies in the AFT model is a worthwhile goal. In this paper, we propose a QP algorithm for constrained estimation in the rank-based AFT model. In Section 2, we briefly review the AFT model assumptions and some recent contributions to variable selection methods within this modeling framework. This paper is concerned with a very special class of penalized estimating function derived from linear rank tests for censored data (Prentice, 1978; Tsiatis, 1990). The proposed methods are detailed in Section 3 and couple a novel estimation strategy by Jin, Lin, Wei, and Ying (2003) with a computational trick for penalized least absolute deviation (lad) regression (Wang, Li, Jiang, 2007). Interestingly, the same trick can be used for a natural extension of lad-lasso (Wang et al., 2007) to censored data regression (Zhou, 1992; Huang, Ma, Xie, 2007) which we describe in Section 3.3. We demonstrate the utility of the proposed methods through two examples in Section 4 2 Background Consider the linear regression model y i = x iβ + ε i, (i = 1,...,n), (1) where y i is the response variable and x i is a d-vector of fixed predictors for the ith subject, β is a d- vector of regression coefficients, and (ε 1,...,ε n ) are independent and identically distributed errors with absolutely continuous density f. Here, we assume that the predictors have been standardized to have mean zero and unit variance. The familiar lasso estimator for β is given by the minimizer of the objective function d y Xβ 2 + λ β j, (2) j=1 4

6 where y = (y 1,...,y n ), X = (x 1,...,x n ), and λ is user-specified regularization parameter. The lasso solution to (2) is equivalent to the constrained optimization problem min β y Xβ 2, subject to d β j τ, (3) for a user-specified parameter τ. We note that there is a one-to-one correspondence between λ in (2) and τ in (3); expression (2) is sometimes referred to as the Lagrangian equivalent of (3) (e.g. Friedman et al., 2007). Now, define the random observables j=1 Z i = min(y i,c i ), δ i = I(y i C i ), (i = 1,...,n), where C i is a random censoring variable for the i-th subject and I( ) denotes the indicator function. The goal is to estimate the regression coefficients β for user-defined parameter λ (or τ) using the observed data {(Z i,δ i,x i ), i = 1,...,n}. A natural extension of the PLS estimators via (2) for censored data is through Buckley-James statistics (1979), whereby one essentially replaces the censored observation (y i,δ i = 0) with an imputed value, say ŷ i, and proceeds as usual. This method has been applied in some applications (Johnson, 2005; Datta et al., 2007) with varying level of success. It is well-known that Buckley- James statistics may admit multiple solutions in finite samples even if they appear to work well in simulation studies. At best, the solution to a penalized Buckley-James statistic can only be an approximate one (see Johnson et al., 2007). Therefore, it is desirable to consider other statistics where one can place more confidence in the coefficient estimates. To the best of our knowledge, there is one other research team who is actively developing methods for variable selection in the AFT models (excluding groups working on modified Buckley-James statistics). Huang and colleagues (Huang et al., 2006; Huang et al., 2007; Xie and Huang, 2008) are currently developing methods based on lad regression for censored data. One advantage of their approach is that it is computationally trivial to implement: weighted lad regression of Z = (Z 1,...,Z n ) on X where the 5

7 weights (Zhou, 1992; Bang and Tsiatis, 2002; Zhou, 2005) are functions of the observed data and calculated via Kaplan-Meier (Kaplan and Meier, 1958). A simple extension of a recent numerical trick yields the lasso solution for Huang s lad estimator for censored data; this is briefly described in Section 3.3. Previously, Huang et al. (2006) used gradient-directed search algorithms (Friedman and Popescu, 2005) to produce a similar solution. Recently, Johnson (2005) considered model selection within the rank-based AFT model framework. We note that Johnson s (2005) earlier methods are very different from the current proposal because he attempted to solve the penalized estimating function for general penalty (i.e. Fan and Li, 2001). Penalized rank-based estimation the AFT model is not trivial even without the task of variable selection; this difficulty is due, in part, to the fact that the original unpenalized estimating function is neither continuous nor component-wise monotone for general weight functions. Subsequently, the penalized estimating function is also very difficult to solve for general weight functions and general penalty functions. However, if we restrict our attention to l 1 penalty functions, the story is quite different! The fact that an otherwise slippery l 1 -regularized estimating function can be efficiently solved using QP is an important finding and this result forms our thesis in Section 3. 3 Methods The weighted log-rank estimating function (Prentice, 1978; Tsiatis, 1990) is defined as Ψ o φ (β) = i φ{e i (β),β}[x i x{e i (β),β}], i=1 where e i (β) = Z i x iβ, φ is a possibly data-dependent weight function satisfying condition A7 of Johnson (2005, Appendix 1), S (0) (t,β) = n 1 I{e j (β) t}, j=1 S (1) (t,β) = n 1 x j I{e j (β) t}, j=1 6

8 and x(t,β) = S (1) (t,β)/s (0) (t,β). Define the penalized, weighted log-rank estimating function as λ 1 sgn(β 1 ) Ψ φ (β) = Ψ o φ (β) + n., (4) λ d sgn(β d ) where (λ 1,...,λ d ) are coefficient-dependent regularization parameters (Zou, 2006; Wang et al., 2007; Zhang and Lu, 2007). The proposed estimator β φ is defined as a consistent solution to the estimating equations: Ψ φ (β) = 0. Two weight functions of substantial interest are φ(t,β) = 1 and φ(t,β) = S (0) (t,β), that correspond to the log-rank (Mantel, 1966) and Gehan (1965) weights, respectively. Define the estimator β o φ as a consistent solution to the original estimating equations: Ψ o φ (β) = 0. It has been established that, under suitable regularity conditions, the random vector n 1/2 ( β o φ β 0 ) converges in distribution to a mean-zero Gaussian random vector with covariance matrix A 1 φ B φa 1 φ, where A φ = lim n n 1 B φ = lim n n 1 i=1 i=1 φ(t,β 0 ){x i x(t,β 0 )} 2 { λ(t)/λ(t)}dn i (t,β 0 ), {φ(t,β 0 )} 2 {x i x(t,β 0 )} 2 dn i (t,β 0 ), N i (t,β) = I{e i (β) t, i = 1}, λ(t) is the hazard function of the errors e i (β 0 ), λ(t) = (d/dt)λ(t), and M 2 = M M, the Kronecker product of the matrix M with itself. Throughout this paper, we shall assume that A φ is nonsingular. 3.1 Estimation and inference with Gehan-type weight functions In the case where φ(t,β) = S (0) (t,β), the estimating function Ψ φ (β) in (4) simplifies to λ 1 sgn(β 1 ) Ψ G (β) = n 1 δ i (x i x j )I{e i (β) e j (β)} + n., (5) i=1 j=1 λ d sgn(β d ) 7

9 where (5) follows from the definition of x{β;e i (β)}. It is easy to check that Ψ G (β) in (5) is the gradient of the following function: d Q G (β) = L G (β) + n λ j β j (6) L G (β) = n 1 i=1 j=1 j=1 δ i {e i (β) e j (β)}, (7) where c = max( c,0). This leads to the following optimization problem for the proposed estimator β G : min u,β i=1 j=1 δ i u ij (8) s.t. u ij > 0, u ij {e i (β) e j (β)}, i,j β k τ k, τ k 0, k = 1,...,d, where (τ 1,...,τ d ) are the coefficient-dependent constraints and correspond to (λ 1,...,λ d ) in (6). It is of sufficient interest to know how one can solve (8) using standard software (namely, the quantreg package in R). Jin et al. (2003) note that minimizing L G (β) (without variable selection) is equivalent to minimizing δ i e i (β) e j (β) + M β i=1 j=1 k=1 l=1 δ k (x l x k ), (9) for a large number M. A standard technique for solving (9) is to construct the n 2 -dimensional pseudo-response vector W = (W 1,...,W n,w n+1 ) and pseudo-design matrix Ω = (Ω 1,...,Ω n,ω n+1 ), W i = δ i (Z i Z 1,Z 1 Z 2,...,Z i Z n ), i = 1,...,n Ω i = δ i [(x i x 1 ),(x i x 2 ),...,(x i x n ) ]. The last elements of the response vector and design matrix are W n+1 = M and ω n+1 = k l δ k(x k x l ), respectively. Finally, the optimization of L G (β) is accomplished through the median regression of W on Ω via quantreg in R. The numerical trick to solve (6) is to construct the 8

10 (n 2 +d)-dimensional pseudo-response vector W = (W,0 ) and (n 2 +d) d pseudo-design matrix Ω = [Ω,n 2 diag(λ 1,...,λ d )]. Then, one can efficiently compute (6) via median regression of W on Ω. A similar numerical trick was offered by Wang et al. (2007) in lad regression for uncensored data. Theorem 1 states the main theoretical result for the Gehan-type estimator, including the existence of an n 1/2 -consistent estimator, the sparsity of the estimator and the asymptotic normality of the estimator. Let A denote the indices of the predictors in the true model, i.e. A = {j : β 0j 0}. The vector of true values is denoted β 0. For simplicity, define the j-th regularization parameter λ j = π j λ for all j = 1,...,d, where π j = 1/ β o G,j, and β o G = ( β o G,1,..., β o G,d ) was defined as the solution to the unpenalized Gehan estimating equation: 0 = Ψ o G (β). This weighting scheme is sometimes referred to as adaptive lasso (alasso) weighting (Zou, 2006; Zhang and Lu, 2007) and similarly motivated the choice of regularization parameters (λ 1,...,λ d ) in Wang et al. (2007). Theorem 1 Assume the regularity conditions A1-A6 of Johnson (2005). If nλ n 0 and nλ n, then β G β 0 = O p (n 1/2 ), lim n P( β G,j = 0) = 1, for every j A, and n 1/2 ( βg,a β 0,A ) d N(0,Γ G,A ), where Γ G = A 1 G B GA 1 G, Γ G,A is the sub-matrix containing only the elements of Γ G whose indices belong to A. Remark 1. As pointed out by Zou (2006), Johnson et al. (2007), Zhang and Lu (2007), Wang et al. (2007), the weight π j is the key to obtain the oracle property. In particular, conditions A1-A6 of Johnson (2005) imply condition C.2(i) of Johnson et al. (2007) and prevents the j-th element of the penalized estimating function from being dominated by the penalty term, λ j sgn(β j ), for β j0 0, because nλ n π j sgn(β j ) vanishes. However, if β j0 = 0, nλ n π j sgn(β j ) diverges to + 9

11 or depending on the sign of β j in the small neighborhood of β j0. This is due to the fact that the weights are conveniently defined as π j = 1/ β G,j o. From here, it is easy to see that because n( βo G,j β 0j ) = O p (1), then ninf βj Mn 1/2 λ nπ j = Mnλ n. The proof of Theorem 1 can be adapted from any number of proofs where one begins with a convex loss function and appends the lasso penalty (e.g. Zou, 2006; Zhang and Lu, 2007). The similarity of this proof to earlier proofs stems from the monotonicity of Ψ o G (β) and that we have a convex objective function, L G (β). Heuristically, L G (β) plays the role of the negative log-likelihood and the remaining arguments follow in a straightforward fashion. Hence, the proof is omitted. 3.2 Estimation and inference for general weight functions In the absence of variable selection, Jin et al. (2003) proposed a novel, iteratively reweighted optimization strategy to solve the estimating function Ψ o φ (β). We exploit their strategy here to provide an elegant solution to the general l 1 -regularized, weighted logrank estimation function Ψ φ (β) using QP. Define the convex objective function Q φ (β; β) = L φ (β; β) + n L φ (β; β) = n 1 i=1 j=1 d λ j β j, j=1 γ{e i ( β), β}δ i {e i (β) e j (β)}, where β is a preliminary consistent estimator for β 0 and γ(t,β) = φ(t,β)/s (0) (t,β). Note that Q φ (β; β) is just like Q G (β; β) but with weights γ which do not depend on β. Using similar reasoning to Jin et al. (2003), it is easy to see that Q φ (β; β) is convex function subject to l 1 constraints on the regression constraints and can be efficiently solved using QP along the lines discussed earlier. Note that the weights γ(t,β) are used to scale W and Ω, but not the regularization parameters (λ 1,...,λ d ). The definitions of W and Ω follow and we subsequently employ quantreg. We offer 10

12 the following iterative algorithm for estimating β 0 : β [k] = arg min Q φ (β; β [k 1] ), k > 1, and β [0] = β G. If β [k] converges to a limit as k, then the limit must satisfy 0 = Ψ φ (β). solution to the regularized estimating function Theorem 2 Assume the regularity conditions A1-A8 of Johnson (2005). If nλ n 0 and nλ n, then β φ β 0 = O p (n 1/2 ), lim n P( β φ,j = 0) = 1, for every j A, and n 1/2 ( βφ,a β 0,A ) d N(0,Γ φ,a ), where Γ φ = A 1 φ B φa 1 φ, Γ φ,a is the sub-matrix containing only the elements of Γ φ whose indices belong to A. Remark 2. As noted in Remark 1, the definition of the weight π j is important and can simply be defined as π j = 1/ β φ,j o for general weight function φ. However, there is no difficulty in using the following definition: π j = 1/ β G,j o, that is, the unpenalized Gehan estimates. This is due to the fact that the Gehan estimates are themselves a n 1/2 -consistent solution to the estimating equation: 0 = Ψ o φ (β). In the sequel, we use the latter definition of π j with the Gehan estimates, β o G,j. The proof of Theorem 2 does not follow as straightforwardly as Theorem 1. The principal difficulty lies in the non-monotonicity of Ψ o φ (β) and, subsequently, the fact that we do not start with a well-behaved convex function like L G (β) to optimize. Recently, Johnson (2005) obtained similar conclusions to those in Theorem 2 for penalized weighted logrank estimating functions with nonconcave penalty (Fan and Li, 2001). Johnson s arguments can be extended to the current situation with no significant difficulty. Now, however, an even more general theory for penalized estimating 11

13 function can be used to establish conclusions like those contained in Theorems 1-2 (See Johnson et al., Theorem 1, 2007). Hence, the proof here is omitted but we refer interested readers to aforementioned papers. 3.3 An extension of lad-lasso to censored data One method of handling censored data comes from the missing data literature (Horvitz and Thompson, 1952) whereby one inversely weights an observed response by one minus the probability of missingness. This method has been successfully applied to censored data problems by Tsiatis and colleagues (cf. Zhao and Tsiatis, 1997; Bang and Tsiatis, 2002) among others. Let K(t) be the survivor distribution for the censoring random variable at time t, i.e. K(t) = P(C > t). Then, one can draw inference on β through the weighted loss function w i Z i x i β, i=1 where w i = δ i /K(Z i ). If K(t) is unknown and censoring is assumed to be independent of the covariate process, it may be nonparametrically estimated via the Kaplan-Meier estimator using the data {(Z i,1 δ i ), i = 1,...,n}. Under the weaker assumption that the censoring distribution may depend on the covariate process, then K(t) may be estimated semi-parametrically through the PH model, for example. We note that our definition of weight w i is different than the definition given by Huang et al. (2007) although one can show they are equivalent using a standard argument from survival analysis. Naturally, one can estimate β by regressing the scaled response Z w = ( w 1 Z 1,..., w n Z n ) on a similarly scaled design matrix X w via quantreg. Now, suppose one wishes to solve the constrained lad-lasso for censored data, that is, optimize the following objective function: d w i Z i x i β + n λ j β j. (10) i=1 j=1 12

14 This can be accomplished using the same technique proposed by Wang et al. (2007). Namely, define the (n + d)-dimensional pseudo response and (n + d) d design matrix, Z w = (Z w,0 ) and X w = [X w,n diag(λ 1,...,λ d )], respectively. Finally, regress the pseudo response Z w on the pseudo design X w using quantreg. In this paragraph, we outline the asymptotic properties for the minimizer of (10). Let β o = ( β o 1,..., β o d ) be the lad regression estimator for censored data. Define β as the minimizer of (10) with λ j = λπ j, π j = 1/ β j o. Then, under suitable regularity conditions, it can be shown that as nλn 0 and nλ n, β β 0 = O p (n 1/2 ), lim n P( β j = 0) = 1, for every j A, and ) n ( βa 1/2 β 0,A converges in distribution to a mean-zero Gaussian random vector with covariance V A, where V is the asymptotic variance-covariance matrix of β o and V A is the sub-matrix containing only the elements of V whose indices belong to A (Johnson et al., 2007, Theorem 1). Hence, the lad regression estimator for censored data in Huang et al. (2007) can be shown to possess an oracle property with weights (π 1,...,π d ). 3.4 Parameter Tuning For penalized least squares and penalized likelihood, it is fairly straightforward to implement crossvalidation or generalized cross-validation (e.g. Tibshirani, 1996, 1997; Fan and Li, 2001), Akaike information criterion (AIC; Akaike, 1973), Bayesian information criterion (BIC; Schwarz, 1978; Zou, Hastie, and Tibshirani, 2004). In the absence of a likelihood or obvious loss function, defining a useful model selection criterion can be problematic. We summarize two strategies below: one based on viewing L G (β) as a dispersion criterion which can then be used in cross-validation and another based on statistical rules-of-thumb. 13

15 Tuning via cross-validation. For simplicity in notation, we drop the subscript φ for weight function and let β λ denote an arbitrary regularized estimator with tuning parameter λ. The traditional BIC criterion (Zou et al., 2004) for model selection is defined BIC L (λ) = 2 l n ( β λ ) + log n d(λ), (11) where l n (β) is the log-likelihood and d(λ) = A(λ), the cardinality of the active set A(λ) for β λ. The definition in (11) extends naturally to other loss functions such as squared error loss and and absolute error loss. Furthermore, the AIC L (λ) criterion is similarly defined but with 2 replacing log n in the second expression on the right-hand side of (11). By this point, it is well-established that BIC is asymptotically consistent in model selection but tends to produce models that are too sparse in finite samples. For penalized estimating functions, there is no natural substitute for l n (β). However, in the rankbased regression models, L G (β) may be a legitimate candidate loss function. In the absence of censoring, L G (β) reduces to Jaeckel s (1972) convex dispersion function. Because L G (β) can itself be viewed as a norm, it has a similar geometric interpretation to the residual sum of squares or l 1 -norm. Hence, consider the criterion BIC(λ) = 2 log L G ( β λ ) + log n d(λ). Again, we define AIC(λ) similarly. Johnson (2005) used a version of AIC(λ) and showed that it worked well for practical work. Another legitimate idea comes from lad regression, where a robust goodness-of-fit statistic can be some function of absolute error loss, n 1 y ŷ λ 1, where the predicted value ŷ λ = (ŷ 1,...,ŷ n ) (note: this prediction will depend on robust estimate of E(ε 1 ) which can potentially be a nontrivial matter with censored data). In any case, inverse weighting methods suggest one use the large sample approximation, n 1 i w i Z i ŷ i. Our experience has been that cross-validating with an inversely weighted information criterion can be tricky, to say the least. 14

16 Statistical rules-of-thumb. It is well-known that cross-validation can be computationally expensive especially for high-dimensional cross-validating. For this reason, several authors have proposed a variety of statistical rules-of-thumb for choosing λ. Recently, Wang et al. (2007) proposed a rule-of-thumb for uncensored lad regression with l 1 penalty by appealing to a Bayesian argument (Tibshirani, 1996, sect. 5). Such an argument can be useful in wide variety of settings and, hence, we summarize their strategy below. One can view a penalized likelihood estimator as a Bayesian estimator where each coefficient β j has a double exponential prior with location zero and scale nλ j. Then, the optimal λ j is chosen to minimize the following negative posterior log-likelihood: l n (β) + n d j=1 { λ j β j log ( nλj 2 ) } log(n). (12) Wang et al. refer to the posterior in (12) as a BIC criterion and lead to optimal BIC tuning parameters λ j = log(n)/(n β j ). The optimal AIC parameters λ j = 1/(n β j ) are derived similarly. Using our earlier notation, λ j = λπ j, which implies λ BIC = n 1 log(n), λaic = n 1. For censored data, we do not expect these rules-of-thumb to be completely satisfactory because they ignore censoring recall, for a fixed sample size n, the information in the sample decreases as the censoring proportion increases. These parameter values may be reasonable starting points with the understanding that they will be too small, in general, and hence result in models that are too complex. Alternatively, an ad-hoc adjustment for censoring could be as simple as dividing by the uncensored proportion, e.g. λ AIC = (nπ U ) 1 and λ BIC = (nπ U ) 1 log n, with π U = P(δ = 1). We are currently investigating other strategies for selecting the regularization parameters in general penalized estimating functions that can be motivated via other lines of reasoning. 15

17 4 Examples 4.1 Mayo Primary Biliary Cirrhosis Study We consider the Mayo primary biliary cirrhosis (PBC) data (Fleming and Harrington, 1991, Appendix D.1). The data contains information about the survival time and prognostic variables for 418 patients who were eligible to participate in a randomized study of the drug D-penicillamin. Of 418 patients who met standard eligibility criteria, a total of 312 patients participated in the randomized portion of the study. The investigators used stepwise deletion to build a Cox proportional hazards model for the natural history of PBC (Dickson et al., 1989). Our summary of this data set is for descriptive purposes only and to illustrate the continuous shrinkage and estimation feature of the proposed method. A more sophisticated analysis of the Mayo PBC data would include the longitudinal observations and perhaps include some level of joint modeling; such a detailed analysis goes beyond the thesis of this paper. At the same time, the Mayo PBC data set seems a reasonable choice for illustrating the operating characteristics of estimators based on the AFT model as Lin et al. (1993) have argued that the PH model does not fit the data well. Our summary of the May PBC data consists two analyses. First, we analyze a data set of 418 patients using five predictors that define the natural history model: age, log(albumin), log(bilirubin), edema, and log(protime); see Fleming and Harrington (1991; Table 4.6.3). This five predictors have already been preselected to be highly correlated with survival (Dickson et al., 1989), hence, our goal is not variable selection per se but rather to contrast the differences between the Gehan and logrank coefficient estimates for each of lasso and alasso. Some authors (e.g. Yuan and Lin, 2006) have suggested a connect-the-dots approximation to exact coefficient paths whereby one repeats the optimization algorithm over a fine grid of regularization parameters, {λ 1,...,λ M }, and simply draw the line segments between adjacent coefficient coefficient estimates. We calculated 16

18 Gehan lasso Logrank lasso Estimates Estimates log lambda (a) log lambda (b) Gehan alasso Logrank alasso Estimates Estimates log lambda (c) log lambda (d) Figure 1: Approximate coefficient paths for five independent variables in the natural history model using the Mayo primary biliary cirrhosis data. See text for which lines correspond to which independent variables. 17

19 this approximated coefficient path for the five predictors in the natural history model for each of four estimators (Gehan, logrank by lasso, alasso) with K = 5 steps in the logrank estimator. The results are summarized over four displays in Figure 1. The same model, data set, and rank-based estimators were considered by Jin et al. (2003); however, no regularized estimation procedures were considered. The unpenalized rank-based estimators correspond to λ = 0, that is, the left-hand side of each display in Figure 1. Each display in Figure 1 contains 5 lines, one for each independent variable: age (solid black line), albumin (dashed red line), bilirubin (dotted green line), edema (royal blue broken line), and protime (teal dashed line). One very insightful contribution of the coefficient paths in Figure 1 is that one can assess the relative importance of each variable compared to the remaining variables in the active set by observing which variable(s) is forced out as λ increases incrementally. For the Gehan estimator with lasso penalty, age is forced out first, followed by protime second and albumin third. For the logrank estimator with lasso penalty, protime is out first, followed age second, then edema third and albumin fourth. For alasso penalty, protime and albumin are forced out first for both the Gehan and logrank weight function. Then, the Gehan path forces age out third and edema fourth while the logrank path force age and edema out in the reverse order. In other words, each combination of weights (φ and π j ) affect the coefficient paths in subtle ways. For Gehan weight functions, it is well-known that the coefficient estimates depend on the censoring distribution and this fact is likely contributing to some of the observed differences between Gehan versus logrank coefficient estimates. Our second analysis considers ten predictors for the smaller cohort of patients (n = 312) from the randomized study. The ten predictors include age, log(albumin), log(alkaline phosphatase), ascites, log(bilirubin), edema, hepatomegaly, log(protime), sex and spiders. This original study investigators used stepwise deletion and the PH model on this data set to construct a natural 18

20 Gehan lasso Logrank lasso Estimates Estimates log lambda (a) log lambda (b) Gehan alasso Logrank alasso Estimates Estimates log lambda (c) log lambda (d) Figure 2: Approximate coefficient paths for ten independent variables in the Mayo primary biliary cirrhosis data. 19

21 Variable Gehan Logrank lasso alasso lasso alasso Age Albumin Alk. Phos Ascites Bilirubin Edema Hepatomegaly Prothrombin Sex Spiders Table 1: Order statistics for stage-wise variable selection procedures on Mayo primary biliary cirrhosis data. Table entries refer to the order in which variables enter into the rank-based, semiparametric linear model for censored data whose coefficients are given in Figure 2. history model for PBC. Johnson (2005) analyzed the same data but used a simulated annealing algorithm to estimate the regression coefficients. Moreover, Johnson did not consider the alasso estimator although it shares the same limiting distribution as the weighted logrank statistic with non-concave penalty (Fan and Li, 2001). The results of our second analysis are displayed in Figure 2. Viewed from the opposite direction, the coefficient paths in Figure 2 may be regarded as approximate stage-wise forward selection procedures. In other words, we start with the null model for sufficiently large λ, then variables enter into the active set A(λ) one-at-a-time as λ decreases incrementally. While our method is not a forward selection scheme in the sense of Efron et al. (2004), it may construed as one in a loose sense (Yuan and Lin, 2006). The order in which variables enter the active set are displayed in Table 1. With the exception of the strongest predictor (bilirubin) and weakest predictor (alk. phos.), every other variable enters the active set at a different point in the selection procedure depending on the combination of weight function φ and adaptive weights π j. We note that the first five variables 20

22 generally correspond to the natural history model (age, albumin, bilirubin, edema, and protime) with the exception of Gehan lasso where ascites enters the active set just before age. Of the remaining five independent variables, ascites is the next important variable. Then, the Gehan estimators have spiders entering seventh while logrank has spiders farther down the sequence. Regardless of the chosen combination of weights (φ and π j ), the coefficient paths can be insightful even after one has selected which parameter λ is the optimal one. 4.2 Simulation Study To explore the operating characteristics of the proposed methods, we simulated 100 data sets of size n from the model y i = x i β 0 + σε i, (i = 1,...,n), where β 0 = (3,1.5,0,0,2,0,0, 0), ε i and x i are independent standard normal with the correlation between the jth and kth components of x equal to 0.5 j k. This model was considered by Tibshirani (1996) and Fan and Li (2001). We set the censoring distribution to be uniform(0,τ), where τ was chosen to yield approximately 25% censoring. We compared the model error ME ( β φ β 0 ) E(xx T )( β φ β 0 ) of the proposed penalized estimator to that of the original rank-based estimator using the ratio of median model error (RMME). We also compared the average numbers of regression coefficients that are correctly or incorrectly shrunk to 0, that is, β φ,j < The results are presented in Table 2, where oracle pertains to the situation in which we know a priori which coefficients are non-zero. Deficiencies in the lasso are well-known by now and several authors have already compared alasso to lasso with uncensored data (e.g. Zou, 2006; Wang et al., 2007) and censored data in the PH model (Zhang and Lu, 2007), and other missing data problems (Johnson et al., 2007). In Table 2, 21

23 Table 2: Simulation results on model selection with censored data where table entries are the ratio of median model error (RMME), the average number of correct (C) and incorrect (I) zeros for each combination of lasso (L), adaptive lasso (AL), and tuning parameter AIC and BIC. Gehan Logrank Avg. No. of 0s Avg. No. of 0s Method RMME (%) C I RMME (%) C I n = 50,σ = 3, and 20% censoring L(AIC) L(BIC) AL(AIC) AL(BIC) Oracle n = 50,σ = 1, and 20% censoring L(AIC) L(BIC) AL(AIC) AL(BIC) Oracle n = 75,σ = 3, and 40% censoring L(AIC) L(BIC) AL(AIC) AL(BIC) Oracle n = 75,σ = 1, and 40% censoring L(AIC) L(BIC) AL(AIC) AL(BIC) Oracle

24 we confirm that our method performs as advertised across different sample sizes and censoring distributions and to illustrate the differences between Gehan and logrank estimates. We note that Johnson (2005) did not present any simulation results for the penalized logrank regression coefficients. We first comment that table entries for RMME are not directly comparable across weight functions (Gehan and logrank) because each is divided by the median model error (MME) for the full model estimate, Gehan full model MME and logrank full model MME, respectively. Having said this, penalized estimates for both Gehan and logrank estimators have smaller model error than do full model estimates, as one would hope. Furthermore, the advantage in using adaptive weights is more pronounced in models with large n, small error variance σ, and a few strong predictors. This result is consistent with the literature and can be seen in Table 2 for the Gehan and logrank estimators. Finally, we note that AIC models tend to be too complex, especially when compared with the BIC models. Again, this result is consistent with what has already been reported in the literature; however, this is the first paper to report on the rank-based AFT model. 5 Remarks This paper describes l 1 regularized estimation in the rank-based, accelerated failure time model (AFT). The estimator is defined as a consistent solution to a penalized estimating function, where the estimating function satisfies modest regularity conditions. Furthermore, the operating characteristics, such as root-n consistency and an oracle property, of the proposed methods have been established elsewhere (Johnson, 2005; Johnson et al., 2007). Unlike earlier methods for variable selection in the rank-based AFT model, the proposed method optimizes a convex loss function and can be executed easily using quantreg in R by extending the 23

25 results of Jin et al. (2003). On the other hand, the weighted log-rank estimating function with non-concave penalty will never correspond to the minimizer of a convex objective function for any weight function φ. Interestingly, a similar numerical trick used in this paper can be used to offer an elegant solution to lad-lasso for censored data (Huang et al., 2007). When one considers statistical inference for censored data, the proportional hazards (PH) model is the most popular and regularized variable selection in this model follows naturally from penalized likelihood theory and methods. In statistics, it is natural to want more than a single method for any given problem as each method has built-in assumptions. For example, if the data do not support the PH assumption the AFT model offers investigators a viable alternative. For this reason, statisticians have worked hard to overcome the many theoretical and computational challenges inherent in this model. This paper illustrates how one can extend current methods for ordinary statistical inference in the unpenalized rank-based AFT model to variable selection in the rank-based AFT model with l 1 penalty. The regression coefficients in the AFT model have a simple, direct interpretation whereas coefficients in the PH model are interpreted on a relative hazard scale. The latter interpretation may be awkward outside survival and lifetime regression analyses. 24

Analysis Methods for Supersaturated Design: Some Comparisons

Journal of Data Science 1(2003), 249-260 Analysis Methods for Supersaturated Design: Some Comparisons Runze Li 1 and Dennis K. J. Lin 2 The Pennsylvania State University Abstract: Supersaturated designs