Path consistent model selection in additive risk model via Lasso

Size: px

Start display at page:

Download "Path consistent model selection in additive risk model via Lasso"

Jasmine Wells
5 years ago
Views:

1 STATISTICS IN MEDICINE Statist. Med. 2007; 26: Published online 16 February 2007 in Wiley InterScience ( Path consistent model selection in additive risk model via Lasso Chenlei Leng 1 and Shuangge Ma 2,, 1 Department of Statistics and Applied Probability, National University of Singapore, Singapore 2 Department of Epidemiology and Public Health, Yale University, U.S.A. SUMMARY As a flexible alternative to the Cox model, the additive risk model assumes that the hazard function is the sum of the baseline hazard and a regression function of covariates. For right censored survival data when variable selection is needed along with model estimation, we propose a path consistent model selector using a modified Lasso approach, under the additive risk model assumption. We show that the proposed estimator possesses the oracle variable selection and estimation property. Applications of the proposed approach to three right censored survival data sets show that the proposed modified Lasso yields parsimonious models with satisfactory estimation and prediction results. Copyright q 2007 John Wiley &Sons,Ltd. KEY WORDS: additive risk model; Lasso; oracle properties; variable selection 1. INTRODUCTION In clinical studies, it is common to have multiple covariates measured along with censored survival outcomes. Our study is partly motivated by examples like the famous primary biliary cirrhosis (PBC) study [1], where 17 covariates along with right censored survival outcome were recorded for 312 patients. For variable selection, Fleming and Harrington [1] applied a step-down procedure using the Wald test under the Cox model. It is shown that only a subset of the 17 covariates is in fact associated with the censored clinical outcome. From a clinical point of view, separating important covariates from unimportant ones may lead to better understanding of the causal relationships and more interpretable results. Statistically speaking, including covariates not associated with the outcome in the model fitting leads to unstable estimates and loss of power. So for data sets like Correspondence to: Shuangge Ma, Department of Epidemiology and Public Health, Division of Biostatistics, Yale University, New Haven, CT, U.S.A. shuangge.ma@yale.edu Contract/grant sponsor: NUS Contract/grant sponsor: NIH; contract/grant number: RR Received 13 March 2006 Copyright q 2007 John Wiley & Sons, Ltd. Accepted 18 December 2006

2 3754 C. LENG AND S. MA the PBC, variable selection is needed along with model estimation. In this article, we consider variable selection for right censored survival data with multiple covariates, under the additive risk model assumption. The additive risk model for time to event data assumes that the conditional hazard function for the failure time T of interest is associated with a p-vector covariate Z( ) via λ(t; Z) = λ 0 (t) + (β 0 ) Z(t) where λ 0 is the unknown baseline function and β 0 is the unknown regression parameter. The additive risk model assumes that the risk factors contribute to the hazard in an additive manner and provides a useful alternative to the Cox model [2] especially when the proportional hazard assumption is violated. The additive risk model has been extensively studied and demonstrated to have satisfactory biological and statistical implications [3 8]. With right censored data and the additive risk model, Lin and Ying [7] demonstrated that the estimating equation has a simple least-squared format. This property, which is not shared by the Cox model, makes estimation with multiple covariates computationally affordable. We are most interested in the case that some components of β 0 are extremely small or exactly zero, which are referred as zero covariate effects (as opposed to non-zero covariate effects) in this article. In this case, it is essential to carry out variable selection and identify the important non-zero covariate effects. A comprehensive account of variable selection in linear regression can be found in [9]. Breiman [10, 11] demonstrated that traditional methods, such as best subset selection or stepwise approaches, may suffer from instability and lack of accuracy. A number of penalized approaches, where variable selection is achieved by penalizing the complexity of models, were proposed to tackle these two problems. Especially, Breiman [10] recommended the non-negative garrote estimate, which was further extended to the Lasso [12]. In right censored survival analysis, the Lasso method for Cox s model was investigated by Tibshirani [13]. Variable selection in the accelerated failure time (AFT) model via Lasso and gradient descent regularization was studied in [14]. Fan and Li [15] proposed the SCAD estimator, which possesses the oracle property. Namely, the model can be estimated as if the true sub-model were known in advance. Compared with traditional methods like the step-down approach, penalization methods have better theoretical ground and more satisfactory finite sample performance. Compared with the Cox model and the AFT model, study of the additive risk model remains rare. In gene selection context, Ma and Huang [16] proposed a simple Lasso approach to select relevant predictors for right censored survival data with multiple covariates under the additive risk model assumption. A careful examination shows that their implementation is a hybrid Lasso in the sense the Lasso is operated on the estimating equation. Recent theoretical studies show that the simple Lasso method is not generally path consistent [17 19], in the sense that with probability greater than zero the whole solution path of the Lasso may not contain the true model even if there is one, if the design matrix satisfies certain conditions. More troublesome is the fact that the Lasso does not select the right model by utilizing a prediction criterion, even if the solution path contains the true model [20]. In order to remedy those problems, Zou [17] and Wang et al. [21] proposed a modified Lasso procedure by adaptively penalizing each coefficient and showed the resulting estimator has the oracle property. Zou [17] further discussed various choices of adaptive penalties and demonstrated convincingly the usefulness of the improved Lasso method. In a related article, Yuan and Lin [18] showed that the non-negative garrote estimator yields path consistent models. Compared with other penalization methods, Lasso-based methods may be preferred since the objective function is convex and computation cost is relatively small.

3 MODEL SELECTION IN ADDITIVE RISK MODEL 3755 In this article, we consider variable selection for data like the PBC example, under the additive risk model assumption. We propose an estimating equation-based Lasso method to select relevant predictors. We show that a simple modification of the Lasso can yield path consistent estimate in terms of model selection and root-n convergence rate for estimating regression coefficients. The asymptotic properties of the estimator proposed by Ma and Huang [16] can in fact be established in a similar manner. Compared with [16], the proposed approach is theoretically path consistent without losing the computational simplicity or sacrificing the prediction power. The rest of the article is organized as follows. Section 2 presents Lin Ying s [7] estimator for the additive risk model, the Lasso and the modified Lasso estimates. In Section 3, we show that the modified Lasso estimator possesses the oracle property. In Section 4, we outline the fast Lars Lasso algorithm to compute the whole solution path, which is significantly different from the L 1 boosting in [16]. In addition, we propose choosing the penalty based on the V-fold cross-validation. In Section 5, we analyse the PBC data and two other survival data sets. Section 6 gives a short conclusion. Proofs are relegated to Section ADDITIVE RISK MODEL WITH MODIFIED LASSO 2.1. Additive risk model Consider right censored survival data, where the event time of interest T is subject to random censoring C. Wealsoassumealengthp covariate Z is present. Consider a set of n independent subjects such that the observed counting process {N i (t), t 0} counts the number of events for the ith subject up to time t. Let {Y i (t), t 0} be the at-risk process for the ith individual. Further denote the cumulative baseline hazard function for λ 0 (t) as Λ 0 (t) = t 0 λ 0(u) du. The intensity for N i (t) is then given by Y i (t)dλ(t; Z i ) = Y i (t){dλ 0 (t) + (β 0 ) Z i (t)dt} where Λ is the conditional cumulative hazard function. For detailed data and model assumptions, please refer to [7]. Lin and Ying [7] proposed solving the following estimating equation to estimate β 0 : U(β) = n Z i (t){dn i (t) Y i (t)d ˆΛ 0 (β, t) Y i (t)β Z i (t) dt} where i=1 0 ˆΛ 0 (β, t) = t 0 n {dn j (u) Y j (u)β Z j (u) du} n Y j is an estimate of Λ 0 (t). Thus, the estimating equation can be written as U(β) = n {Z i (t) Z(t)}{dN i (t) Y i (t)β Z i (t) dt} where i=1 0 / Z(t) = n n Y j (t)z j (t) Y j (t)

4 3756 C. LENG AND S. MA The resulting estimator takes the explicit form [ n ] 1 [ n ˆβ = Y i (t){z i (t) Z(t)} 2 dt i=1 0 i=1 0 ] {Z i (t) Z(t)}dN i (t) = A 1 n b n (1) where a 2 = aa is a rank one p p matrix. Under mild regularity conditions, Lin and Ying [7] showed that the random vector n 1/2 U(β 0 ) converges weakly to a normal variable with mean zero and a covariance matrix B which can be consistently estimated by ˆB = n 1 n i=1 0 {Z i (t) Z(t)} 2 dn i (t) Moreover, n 1/2 (ˆβ β 0 ) converges weakly to a p-variate normally distributed variable with mean zero and a covariance matrix Σ = A 1 BA 1 which can be consistently estimated by ˆΣ = Â 1 ˆB Â 1, where Â = n 1 n i=1 0 Y i (t){z i (t) Z(t)} 2 dt Modified Lasso In the case of multiple covariates with the additive risk model, especially when the number of covariates is large compared with the sample size, empirical studies reveal that the estimated variances for different coefficients may differ by several orders, which indicates unstable estimate (results not shown). Thus, dimension reduction or variable selection is needed along with estimation to remove zero covariate effects. Various dimension reduction methods such as principal component analysis [22] or partial least squares [23] can be applied. However, with dimension reduction methods, usually linear combinations of all covariates are used. This can lead to unclear explanation of the model fitting results. Moreover, if certain covariates are not related to the outcome, it is important to remove those covariates from model fitting. In this sense, variable selection techniques may be preferred. Variable selection can be achieved with penalization methods. Inspired by successes of the Lasso for the Cox and AFT models, we adopt the Lasso procedure and propose estimating β by minimizing 1 p 2 [β A n β 2β b n ]+nλ n β j (2) where λ n is the tuning parameter that essentially governs the bias variance trade-off of the estimate. With slight abuse of notation, we still denote minimizer of (2) as ˆβ. Whenλ n = 0, we go back to the Lin Ying estimator in (1); when λ n =, ˆβ = 0. The first part of (2) mimics the loss function in linear regression, especially if we can write A n = X X and b n = X y, where X is the design matrix and y is the response vector. If β is obtained from the Lin Ying s estimate, then the first part is zero. So loosely speaking, it measures the goodness-of-fit. With slight abuse of notation, we refer it as the loss function. Since A n is positive definite, we can decompose it via singular value decomposition as A n = V QV, where V is an orthogonal matrix and Q = diag(σ 1,...,σ p ) such that σ 1 σ p >0.

5 MODEL SELECTION IN ADDITIVE RISK MODEL 3757 By denoting X = Q 1/2 V and y = Q 1/2 Vb n, (2) has the familiar Lasso form up to a constant 1 p 2 y Xβ 2 + nλ n β j Note here that X is a p p instead of n p matrix, and y is a p-vector, which are significantly different from the standard linear regression formulation. Ma and Huang [16] suggested to replace X by A n and y by b n and minimize the objective function 1 p 2n b n A n β 2 + nλ n β j (3) Up to a scale constant, the above objective function (3) is equivalent to 1 p 2n X y X Xβ 2 + nλ n β j which can be easily seen as a variant of the proposed Lasso estimate. As pointed out in the Introduction, the formulation in (2) or (3) may fail to give path consistent model selection result under certain assumptions on the design matrix [19]. Especially, it is possible that (1) with probability greater than zero, the whole path may not include the true parameter value; or (2) even if it is included, prediction-based approach cannot identify this true model. If the above scenarios happen, we conclude that the variable selection result is not consistent. A simple remedy is to replace the penalty p β j by a data-dependent weighted L 1 norm p w j β j, resulting in 1 p 2 [β A n β 2β b n ]+nλ n w j β j (4) where w j s are the non-negative weights to be specified later. We refer to the estimate defined by the minimizer of (4) as the mlasso (modified Lasso) estimate and focus on its theoretical and empirical properties in this article. In the linear regression paradigm, the mlasso has been investigated in [17]. The proposed estimate for the additive risk model is partly motivated by that study. However, significant differences exist. For example, the design matrix X no longer contains independent covariates and the dimension of X no longer depends on the sample size. Moreover the mlasso estimate with the additive risk model does not have the simplified format as in [12] even under orthogonality assumption on covariates. 3. PROPERTIES OF THE MLASSO ESTIMATE As pointed out by Fan and Li [15], a good penalized approach should result in an estimator with the oracle property, namely, the estimator should perform as well as if the true underlying model were known in advance. We show in this section that the proposed mlasso estimator does possess the oracle property if w j s are appropriately chosen. Thus, the proposed mlasso is capable of consistently identifying the true non-zero covariate effects, and can achieve root-n convergence rate for the estimated non-zero coefficients.

6 3758 C. LENG AND S. MA Without loss of generality, write β 0 = (β 0 1,...,β0 r, β0 r+1,...,β0 p )T, where βr+1 0,...,β0 p are zero. Define I ={j : β 0 j = 0}. We propose setting w j = 1/ ˆβ 0 j, where ˆβ 0 is an initial non-zero, consistent estimate of β 0. Theorem 1 (Oracle property) If max j ˆβ 0 j β0 j =O p(δ n ), nλ n 0and nλ n /δ n, then the minimizer of (4) satisfies the oracle property when n, i.e. ˆβ j = 0 j / I; ˆβ j j I has the same limiting distribution as λ n = 0. Similarly, for the estimate in (3), one has Corollary 2 The minimizer of (3) has the oracle property if max j ˆβ 0 j β0 j =O p(δ n ), nλ n 0, nλ n /δ n, and j β j is replaced by j β j / ˆβ 0 j. Intuitively, by choosing w j = 1/ ˆβ 0 j, the amount of penalization each coefficient gets is inversely proportional to its pre-determined significance. Therefore, the important coefficients are penalized to a lesser extent than unimportant (smaller) ones. In this article, we assume certain estimation consistency of the ˆβ 0 and hence w to simplify the asymptotic proof. Careful inspection of the proof reveals that the consistency assumption can be replaced by the looser zero consistency assumption, i.e. it is only needed to assume that w j s go to infinity for zero components of β 0 and are asymptotically bounded for non-zero components. We do not pursue this issue further in this article. The above results establish the fact that the mlasso estimates have the oracle property, as long as the initial estimate ˆβ 0 is δ n -consistent. The initial estimate does not have to be the optimal n-consistent estimator. As a consequence, ˆβ0 can be obtained via many familiar estimating methods, for example, the standard Lin Ying approach or the ridge regression. For ridge regression with a penalty τ, ˆβ 0 is estimated as ˆβ 0 = (A n + τi ) b n. For practical data analyses, if the initial estimate has zero or extremely small components, we propose setting the corresponding mlasso components zero. As in [12, 15], we can estimate the covariance matrix of ˆβ C (C ={j : ˆβ j = 0}) by the sandwich formula ˆΣ CC ={Â CC + nλ n D} 1 ˆB CC {Â CC + nλ n D} 1 where D = diag(w C,1 /ˆβ C,1,...,w C,d /ˆβ C,d ), d = #{C} and Â CC and ˆB CC are the sub-matrices of Â and ˆB corresponding to the set C. 4. COMPUTATION AND TUNING By letting d j = w j β j, j = 1,...,p, the mlasso is equivalent to 1 p min d 2 [d Ã n d 2d b n ]+nλ n d j (5)

7 MODEL SELECTION IN ADDITIVE RISK MODEL 3759 where Ã n ={diag(w 1,...,w p )} 1 A n {diag(w 1,...,w p )} 1 and b n ={diag(w 1,...,w p )} 1 b n. The fast Lasso algorithm [24] can then be used to solve for d, with a slight twist that only A n and b n are presented instead of X and y. Suppose the estimate which minimizes (5) is ˆd, then the estimate which solves (4) is simply ˆβ j = ˆβ 0 j ˆd j. This is analogous to the non-negative garrote except for the fact that d j s are not constrained to be non-negative. Mimicking [25], we propose the following computational algorithm: Algorithm for computing (5) 1. Start with d = 0, the active set C = arg max j b n, j, and the direction γ, ap-vector with γ C = sgn( b n ) C and γ C C = 0 2. Compute how far the algorithm can proceed before a new variable joins the active set a 1 = min{a>0: {Ã n (d + aγ) b n } j = {Ã n (d + aγ) b n } C, j / C}. 3. Compute how far the algorithm can proceed before any d j in C hits zero a 2 = min{a>0:(d + aγ) j = 0, j C} 4. Let α = min(a 1, a 2 );ifa = a 1, add the variable attaining equality at a to C;ifa = a 2, remove variable attending 0 at a from C. Update d d+aγ and compute γ as γ C =Ã n,cc ( sgn(d C)) and γ C C = 0 5. Go to step 2 until Ã n d = b n In order to find the λ n corresponding to an estimate ˆd(λ n ) C, we note and thus ˆd(λ n ) C = Ã n,cc ( b n,c nλ n sgn( ˆd(λ n )) C ) λ n = [ b n,c Ã n,cc ˆd(λ n ) C ] i n[sgn( ˆd(λ n )) C ] i where [a] i is the ith entry of a. Denote L(β) = 1 2 (β A n β 2b n β). The optimal λ can be determined by the V-fold cross-validation. More specifically, the V-fold cross-validation works as follows: denote the full data set by S and therefore the training set and the testing sets are S S v and S v, respectively, for v = 1,...,V.For each λ, compute L(ˆβ v (λ)) using the training data S S v. Define the cross-validation score as CV(λ) = V L(ˆβ v (λ)). v=1 We then find λ to minimize CV(λ). The same V-fold cross-validation is used for τ if ˆβ 0 is estimated via the ridge regression. In [16], the Lasso estimate is obtained using a L 1 boosting algorithm. The Lars algorithm yields an exact solution, instead of the approximate solution via boosting. The L 1 boosting-based approach is proper if the number of covariates is extremely large (for example in microarray studies). However, in clinical studies such as the PBC example, the Lars approach is computationally faster and better behaved.

8 3760 C. LENG AND S. MA 5. DATA ANALYSIS We apply the proposed approach to three well-investigated public data sets. Those data sets are chosen partly to facilitate comparison with published results. In all three data sets, we have right censored survival data with multiple covariates. We refer to corresponding publications for more experimental and data details PBC data The Mayo Clinic has established a database of 424 patients having primary biliary cirrhosis of liver (PBC). Seventeen covariates were collected for 312 randomized patients. We focus on the 276 patients with no missing value in the covariates. A more detailed account of the PBC data can be found in [26]. Following [13], we standardize all the regressors, so that they are more comparable in the penalization scheme. We calculate the Lasso estimate with Tibshirani s Lasso formulation, and two mlasso estimates, with the initial estimates obtained from Lin Ying s estimating equation (mlasso 1) and the ridge regression (mlasso 2). The tuning parameter λ (and τ for mlasso 2) is chosen by the 10-fold cross-validation. The estimation results are summarized in Table I. The Lasso formulation has the advantage in generating the whole solution path for varying λ, and we present the solution paths, together with the cross-validation scores, in Figure 1. It can be seen that the cross-validation scores are well defined as a function of the tuning parameter and there exist well-separated, unique minimizers. Several interesting observations arise. Firstly, 14 estimated coefficients of the full additive risk model have the same signs as those of the Cox model. We referto [13] Table I for detailed estimates from the Cox model. The three covariates with different signs are alk, chol and hep. In the full additive model, those three variables are not marginally significant. Moreover, they are estimated as unimportant variables if a Lasso procedure is applied to either Cox or additive risk model. This suggests that the Cox and additive models lead to similar biological conclusions. Secondly, compared with the 12 selected covariates in Table I of [13], the Lasso-additive formulation gives a final model with 10 predictors, a subset of the Lasso Cox model excluding chol and sex. We note that the Lasso Cox estimates of chol and sex in [13] are very small and marginally not significant. Thirdly, in terms of signs of estimated non-zero coefficients, the Lasso, mlasso 1 and mlasso 2 are the same as the full additive model. So no biological conclusions are significantly changed by carrying out variable selection. Fourthly, as can be seen in the top row of Figure 1, the Lasso seems to over-shrink the estimates towards zero and the mlasso (especially mlasso 1) seems to employ the shrinkage to a smaller degree. For example, variables such as bill and asc, which eventually have larger coefficients than others, have larger coefficients in the early part of the mlasso 1 and mlasso 2 paths, than in the case of regular Lasso. Since mlasso 2 employs a shrinkage estimator, the ridge regression, as the initial estimate, mlasso 1 has the least shrinkage, which can be seen clearly from Figure 1. There is one fewer covariate (spid) in mlasso 2 than Lasso; and two fewer covariates (spid and prot) in mlasso 1 than Lasso. For model fitting comparison, we propose using the time-dependent receiver operating characteristic (ROC) for censored data approach. The time-dependent ROC technique was investigated in [27] in the context of medical diagnosis and has been used as criteria for censored data regression with high-dimensional covariates [28]. The essential idea is to treat the event indicator as binary outcome for each time point and evaluate the classification performance at each time

9 MODEL SELECTION IN ADDITIVE RISK MODEL 3761 Table I. Estimation results for the PBC data: the Lasso refers to the additive model with the original Lasso penalty; mlasso 1 refers to the Lasso solution with ˆβ 0 being Lin Ying s estimator; mlasso 2 refers to the Lasso solution with ˆβ 0 being the ridge regression estimator. Full model Lasso mlasso 1 mlasso 2 Coefficient SE Coefficient SE Coefficient SE Coeffcient SE Variable ( 10 4 ) ( 10 4 ) Z ( 10 4 ) ( 10 4 ) Z ( 10 4 ) ( 10 4 ) Z ( 10 4 ) ( 10 4 ) Z age alb alk asc bili chol ed hep plat prot sex sgot spid stage trt trig cop

10 3762 C. LENG AND S. MA Lasso Lasso bili CV score β asc ed age cop spid prot stage sgot alb β /max β mlasso β /max β mlasso 1 bili CV score β asc ed age cop stage sgot alb d /max d d /max d mlasso 2 mlasso 2 bili CV score β asc ed age cop prot stage sgot alb d /max d d /max d Figure 1. Analysis of PBC data. Top row: Lasso; middle row: mlasso 1; bottom row: mlasso 2. Left panels: the cross-validation score. Right panels: the solution paths. Dotted lines: optimal tuning parameter chosen by cross-validation. using the standard ROC technique. In the ROC approach, the area under curve (AUC) can be used as the evaluation/comparison criteria and a larger AUC at time t indicates better model fitting of the survival outcome at time t as measured by sensitivity and specificity evaluated at time t.

11 MODEL SELECTION IN ADDITIVE RISK MODEL 3763 PBC ROC curves Density AUC FULL LASSO mlasso 1 mlasso 2 Cox Lasso Time (day) UIS ROC curves Density AUC time (day) BMT ROC curves Density AUC Time (day) Figure 2. Kernel density estimates of the p-values (log 10 transformed) and the time-dependent ROC curves for the three data sets. We can see from Figure 2 that the proposed mlasso approaches have similar model fitting performance as the full model, while using only a subset of covariates. To investigate the prediction performance of the proposed estimators, we randomly split the 276 observation into training sets of size 200 and test sets of 76. The estimation is carried out using the training sets only. The linear risk scores ˆβ Z for the testing sets are calculated using the estimates

12 3764 C. LENG AND S. MA ˆβ from the training sets. We generate two hypothetical risk groups according to the median of the linear risk scores in the testing set. The survival functions of the two groups are compared. A chi-squared statistic (with degree of freedom one) and corresponding p-value are computed. A similar evaluation method is adopted by Ma and Huang [16], Li and Gui [28]. We repeat this procedure 100 times and summarize the log 10 p-values in Figure 2. We see that mlasso 1 and mlasso 2 yield similar p-values with the Lasso, while overall, penalized additive risk models give small p-values compared to the full additive risk model. Out of 100 random training/testing sets, we observe that regular Lasso gives 75 p-values smaller than those of the full model. The mlasso 1 yields 63 and mlasso 2 yields 73 p-values smaller than those of the full model. To compare whether the means of the p-values are significantly different, we use the pairwise Wilcox rank sum test. It is observed that the p-value between Lasso and Full is 0.008, between mlasso1 and Full is 0.10, between mlasso 2 and Full is So the Lasso-based approaches are significantly better than the full model. However, there is no significant difference between either Lasso and mlasso 1 (p-value 0.26), or Lasso and mlasso 2 (p-value 0.88). The mean model size for regular Lasso is 9.55 with a standard error 0.16, for mlasso 1 is 7.20 with a standard error 0.33, for mlasso 2 is 8.37 with a standard error The two mlasso estimates give smaller models while at the same time, maintain the same differentiation power between low and high risk groups. As a simple comparison, we also consider the PBC data with the Lasso Cox estimate. We show in Figure 1 the time-dependent ROC using the estimation results in [13]. We can see that the additive models have dominating AUCs, which suggests that the additive models are better fitted than the Cox model UIS data This data set is a subset of data from the University of Massachusetts Aids Research Unit Impact Study [29]. Observations with missing entries are excluded and we focus on 528 samples with complete record. The purpose of this study was to compare treatment programs of different planned durations designed to reduce drug abuse and to prevent high-risk HIV behaviour. The UIS sought to study how the covariates determine time to return to drug use. As suggested by Hosmer and Lemeshow [29], we create three dummy variables for hecrcoc (heroin or cocaine use) and two dummy variables for IVHX (IV drug use history at administration). A total of 12 covariates are considered for the additive risk model. More discussions on the UIS data set can be found in [29]. We apply the proposed Lasso approach. Plot of the model features is close to Figure 1 and is omitted here. The fitted coefficients and their standard errors are listed in Table II. We note that all the estimated coefficients agree in signs with their counterparts from the Cox model (results not shown), which indicates similar biological conclusions. The Lasso gives a model with 10 covariates, while both mlasso 1 and mlasso 2 give a final model with 8 covariates. Model fitting comparison based on the time-dependent ROC is shown in Figure 2. Results similar to those for the PBC data are observed. In order to assess the performance of the three regularization methods, we randomly split the data into training sets of sample size 400 and testing sets 128. As in the PBC analysis, we create hypothetical risk groups by ranking the testing samples according to their estimated scores. We repeat the splitting 100 times and summarize the result. It is observed that for the 100 runs, regular Lasso gives a model with average size 9.12, mlasso 1 gives 7.64 and mlasso 2 given Again, we see that the two mlasso algorithms give smaller models. Wilcox pairwise tests for differences between p-values from the three methods show that

13 MODEL SELECTION IN ADDITIVE RISK MODEL 3765 Table II. Estimation results for the UIS data: the Lasso refers to the additive model with the original Lasso penalty; mlasso 1 refers to the Lasso solution with ˆβ 0 being Lin Ying s estimator; mlasso 2 refers to the Lasso solution with ˆβ 0 being the ridge regression estimator. Full model Lasso mlasso 1 mlasso 2 Coefficient SE Coefficient SE Coefficient SE Coeffcient SE Variable ( 10 4 ) ( 10 4 ) Z ( 10 4 ) ( 10 4 ) Z ( 10 4 ) ( 10 4 ) Z ( 10 4 ) ( 10 4 ) Z age becktota ndrugtx race treat site los hercoc hercoc hercoc ivhx ivhx

14 3766 C. LENG AND S. MA Table III. Estimation results for the BMT data: the Lasso refers to the additive model with the original Lasso penalty; mlasso 1 refers to the Lasso solution with ˆβ 0 being Lin Ying s estimator; mlasso 2 refers to the Lasso solution with ˆβ 0 being the ridge regression estimator. Full model Lasso mlasso 1 mlasso 2 Coefficient SE Coefficient SE Coefficient SE Coeffcient SE Variable ( 10 4 ) ( 10 4 ) Z ( 10 4 ) ( 10 4 ) Z ( 10 4 ) ( 10 4 ) Z ( 10 4 ) ( 10 4 ) Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z

15 MODEL SELECTION IN ADDITIVE RISK MODEL 3767 there is no significant difference between the three methods. See Figure 2 for the kernel density estimates of the log 10 p-values BMT data The BMT data is provided and analysed in [30]. We study the disease free time after bone marrow transplant for 137 patients. Of particular interest is to detect influential covariate effects out of the 14 measurement covariates. We employ the proposed Lasso approach and the fitted coefficients are in Table III. In this example, we observe that the full Cox model (result not shown) and the full additive model give 11 coefficients with the same signs. The three coefficients with different signs are Z2, Z10 and Z11, which again are not important variables if a Lasso procedure is applied. The time dependence ROC curves are shown in Figure 2. The time-dependent AUCs are reasonably large, which suggests satisfactory model fitting results. The prediction performance is examined by splitting the total samples into training sets (with 100 samples) and testing sets (with 37 sample). We repeat it 100 times. The average model size for the full model is 4.90; for mlasso 1 is 3.97; for mlasso 2 is Pairwise Wilcoxons tests confirm that the three penalized regression methods yield significantly better separation between low and high risk groups; while there are no significant difference between the three Lasso approaches (results not shown). 6. CONCLUDING REMARKS Additive risk model provides a useful alternative to Cox model. For data like the PBC, the additive risk model may provide a better fit than the Cox model as evaluated with time-dependent ROC. With the presence of multiple covariates, the Lasso methodology can be applied to build reliable parsimonious models. Particularly, we propose the mlasso approach and show that it yields path consistent model selection result. The fast Lars Lasso algorithm is applied to compute the whole solution path of the estimates, which greatly facilitates the adaptive choice of the tuning parameter. The proposed algorithm can be easily realized using existing software. Development of public software package will be pursued in the future. We also discuss a sandwich-type formula to infer the standard errors of the estimated parameters. We analyse three survival data sets using the proposed approach. Comparisons with the full additive model and the Cox model suggest satisfactory estimation and prediction performance. Extensive simulation studies will be pursued in a separate study to systematically investigate the small sample performance of the Lasso-based approaches under the additive risk model. Compared with previous variable selection studies with censored data, we consider the widely used but less-investigated additive risk model. Combined with previous studies in [13, 14], a more comprehensive framework of variable selection in survival analysis is now available. Compared with [16], the proposed mlasso has better theoretical basis and similar computational cost. Empirical studies show that the mlasso can yield models smaller than those from regular Lasso, yet with similar prediction performance. Compared with the regular Lasso, the mlasso has the disadvantage of requiring a consistent initial estimate, although this usually does not pose a serious problem. In this article, we assume the covariates act linearly on the additive hazard but it need not be true. We are currently investigating the extension to non-parametric additive model λ(t; Z) = λ 0 (t) + η(z(t))

16 3768 C. LENG AND S. MA where η is a non-parametric function which will be fitted using popular non-parametric methods such as smoothing spline [31] or local polynomial [32]. This study will be reported in a later manuscript. 7. PROOFS Proof of Theorem 1 Write the initial estimate of β 0 as ˆβ 0, where max j ˆβ 0 j β0 j =O p(δ n ). Denote β = β 0 / n and define V (0) n (u) = 1 2 (β A n β 2b n β) + nλ n p 1 ˆβ 0 j β j = 1 2 [(β0 + u/ n) A n (β 0 + u/ n) 2b n (β0 + u/ n)]+nλ n p 1 ˆβ 0 j β0 j + u/ n and V n (u) = V n (0) (u) V n (0) (0). Note that V n (u) is minimized at n(ˆβ β 0 ). First note that V n (u) = 1 A 2 ut n n u (b n A n β 0 ) T u + p 1 nλ n n( β 0 n ˆβ 0 j j + u j / n β 0 j ) By standard counting process arguments [33, 7], A n /n A and (b n A n β 0 )/ n w N(0, B). It is easy to see n( β j + u j / n β j ) = u j sgn(β j )I (β j = 0) + u j I (β j = 0) Since nλ n 0and nλ n /δ n, it follows nλ n ˆβ 0 j ( β j + u/ { op (1) if u j = 0 j / I n β j ) = + otherwise Therefore, V n ( ) d V ( ), where { 2u T w + u T Au if u j = 0 j / I V n (u) = + otherwise Now rewrite matrix A,w and u as ( ) AII A II c A = A I c I A I c I c (6) (7) where A II is r r, A I c I c is (p r) (p r) and A I c I = A T II c ( ) ( ) wi ui w =, u = w I c u I c

17 MODEL SELECTION IN ADDITIVE RISK MODEL 3769 Since V ( ) is convex, by applying the arguments in [34, 35] (see also [36]), n(ˆβ β 0 ) argmin (V ). Thatisû I d A 1 II w I and û I c d 0. Thus, the n consistency part is proved. And we have P(A = r) 1, where A = #{ˆβ j = 0 β 0 j = 0}. It remains to show P(B > 0) 0asn, where B = #{ˆβ j = 0 β 0 j = 0}. We have showed that ˆβ β 0 = O p (1/ n). It is sufficient to show that for some small ε n = C/ n and j = r + 1,...,p To show (8), by Taylor expansion, we have V n (0) V n (0) (β) > 0 for 0<β β j <ε n (8) j < 0 for ε n <β j <0 (9) (β) = 1 (A T n β j n n, j β b n, j) + nλ n / ˆβ 0 j (10) where A n, j is the jth column of A n and b n, j is the jth entry of b n. Note that A T n, j β b n, j n = AT n, j β0 b n, j n + AT n, j (β β0 ) n = O p (1) by the assumption that β β 0 = O P (1/ n). Since nλ n /δ n, (10) is dominated by nλn / ˆβ 0 j > 0. Similarly we can prove (9). This completes the proof of P(B > 0) 0. ACKNOWLEDGEMENTS We would like to thank associate editor and two referees for insightful comments that have led to significant improvement of this paper. Chenlei Leng s research is partially supported by NUS research grants. Shuangge Ma would like to thank Yale Center for High Performance Computation in Biology and Biomedicine (NIH grant: RR , which funded the instrumentation) for computing support. REFERENCES 1. Fleming TR, Harrington DP. Counting Processes and Survival Analysis. Wiley: New York, Cox DR. Regression models and life-tables (with Discussion). Journal of the Royal Statistical Society, Series B 1972; 34: Aalen O. A Model for Regression Analysis of Counting Processes. Lecture Notes in Statistics, vol. 2. Springer: Berlin, Cox DR, Oakes D. Analysis of Survival Data. Chapman & Hall: London, Mckeague IW, Sasieni PD. A partly parametric additive risk model. Biometrika 1994; 81: Thomas DC. Use of auxiliary information in fitting nonproportional hazard models. Modern Statistical Methods in Chronic Disease Epidemiology. Wiley: New York, Lin D, Ying Z. Semiparametric analysis of the additive risk model. Biometrika 1994; 81: Breslow NE, Day NE. Statistical Methods in Cancer Research, vol. 2. IARC: Lyon, Miller A. Subset Selection in Regression. Chapman & Hall: London, Breiman L. Better subset regression using the nonnegative garrote. Technometrics 1995; 37: Breiman L. Heuristics of instability and stabilization in model selection. The Annals of Statistics 1996; 24:

18 3770 C. LENG AND S. MA 12. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 1996; 58: Tibshirani R. The lasso method for variable selection in the Cox model. Statistics in Medicine 1997; 16: Huang J, Ma S, Xie H. Regularized estimation in the accelerated failure time model with high dimensional covariates. Biometrics 2006; 62: Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 2001; 96: Ma S, Huang J. Lasso method for additive risk models with high dimensional covariates. Technical Report 347, Department of Statistics and Actuarial Science, University of Iowa, Iowa, Zou H. The adaptive lasso and its oracle properties. Technical Report, University of Minnesota, Minneapolis, Yuan M, Lin Y. On the nonnegative garrote estimator. Technical Report , School of Industrial and Systems Engineering, Georgia Institute of Technology, Zhao P, Yu B. On model selection consistency of lasso. Manuscript, Leng C, Lin Y, Wahba G. A note on the lasso and related procedures in model selection. Statistica Sinica 2006; 16(4): Wang H, Li G, Jiang G. Robust regression shrinkage and consistent variable selection via the lad-lasso. Manuscript, Ma S, Kosorok MR, Fine JP. Additive risk models for survival data with high dimensional covariates. Biometrics 2006; 62: Nguyen D, Rocke DM. Partial least squares proportional hazard regression for application to DNA microarray data. Bioinformatics 2002; 18: Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression (with Discussion). The Annals of Statistics 2004; 32: Rosset S, Zhu J. Piecewise linear regularized solution paths. Manuscript, Dickson E, Grambsch P, Fleming T, Fisher LD, Langworthy A. Prognosis in primary biliary cirrhosis: model for decision making. Hepatology 1989; 10: Heagerty PJ, Lumley T, Pepe M. Time dependent roc curves for censored survival data and a diagnostic marker. Biometrics 2000; 56: Li H, Gui J. Partial cox regression analysis for high-dimensional microarray gene expression data. Bioinformatics 2004; 20: Hosmer DW, Lemeshow S. Applied Survival Analysis. Wiley: New York, Klein JP, Moeschberger ML. Survival Analysis: Techniques for Censored and Truncated Data. Springer: Berlin, Wahba G. Spline Models for Observational Data, CBMS-NSF Regional Conference Series in Applied Mathematics, vol. 59. SIAM: Philadelphia, PA, Fan J, Gijbels I. Local Polynomial Modelling and its Applications. Chapman & Hall: London, Anderson P, Gill RD. Cox s regression model for counting process: a large sample study. The Annals of Statistics 1982; 10: Geyer C. On the asymptotics of constrained m-estimation. The Annals of Statistics 1994; 22: Geyer C. On the asymptotics of convex stochastic optimization. Manuscript, Knight K, Fu W. Asymptotics for lasso-type estimators. The Annals of Statistics 2000; 28:

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model