Estimation in the l 1 -Regularized Accelerated Failure Time Model

Size: px
Start display at page:

Download "Estimation in the l 1 -Regularized Accelerated Failure Time Model"

Transcription

1 Estimation in the l 1 -Regularized Accelerated Failure Time Model by Brent Johnson, PhD Technical Report May 2008 Department of Biostatistics Rollins School of Public Health 1518 Clifton Road, N.E. Emory University Atlanta, Georgia Telephone: (404) FAX: (404) bajohn3@emory.edu

2 Estimation in the l 1 -regularized accelerated failure time model by Brent A. Johnson Abstract This note variable selection in the semiparametric linear regression model for censored data. Semiparametric linear regression for censored data is a natural extension of the linear model for uncensored data; however, random censoring introduces substantial theoretical and numerical challenges. By now, a number of authors have made significant contributions for estimation and inference in the semiparametric linear model but none of these authors have considered regularized estimation and subsequent variable selection. Our estimator is defined as a consistent solution to a suitably penalized, weighted logrank estimating function. For general weight function, this estimating function is known to be non-monotone in the regression coefficients and may contain multiple roots. Nevertheless, it is one of the more popular estimators which does not assume proportional hazards. The proposed method uses linear and quadratic programming techniques for l 1 -regularized estimation and can be implemented easily in R. We illustrate the utility of our approach in real and simulated data. Keywords: Adaptive lasso Lasso Oracle property Penalized least squares Proportional hazards. 1

3 1 Introduction Over the past several years, substantial attention has been paid to simultaneous estimation and variable selection in the linear model through so-called penalized least squares (PLS) estimators (e.g. Breiman, 1995; Tibshirani, 1996; Fan and Li, 2001; Zou and Hasite, 2005; Zou, 2006; Yuan and Lin, 2006). Because PLS estimators simultaneously shrink some coefficients to zero and estimate the non-zero coefficients, one can manage the theoretical properties of PLS estimators more easily than earlier proposals, such as stepwise deletion and subset selection. In addition to desirable asymptotic properties, elegant solutions to the resulting constrained optimization problems are now readily available (e.g. Osborne, Presnell, Turlach, 2000; Efron, Johnstone, Hastie, Tibshirani, 2004; Friedman and Popescu, 2004; Friedman, Hastie, Höfling, Tibshirani, 2007). For the most part, the existing theoretical results and accompanying optimization algorithms derived for PLS estimators can be extended to general response variables through generalized linear models and penalized likelihood (cf. Tibshirani, 1996; Fu, 2003; Park and Hastie, 2006) and to censored outcome data through the proportional hazards (PH) model (Cox, 1972) and partial likelihood (Tibshirani, 1997; Park and Hastie, 2006; Zhang and Lu, 2007). However, neither the technical arguments nor the optimization algorithms apply to general penalized M- and Z-estimators because a convex loss function is absent and the Hessian may not be directly estimable. Recently, authors (Johnson, 2005; Johnson, Lin, and Zeng, 2007) extended the earlier notion of penalized estimating function (Fu, 2003) and showed that many of the asymptotic results for penalized likelihood (Zou, 2006; Zhang and Lu, 2007) do, in fact, hold for a wide class of semi-parametric models under modest regularity conditions. Despite these theoretical advances for penalized estimating functions and variable selection in semi-parametric models, efficient computational strategies are lacking and must typically be considered on a case-by-case basis. 2

4 Fan and Li (2001) proposed an algorithm that yields a local solution to a general constrained optimization problem using Newton-type steps. In the case of l 1 penalties on the regression coefficients, Tibshirani (1996) dismissed this same algorithm as inefficient, especially when compared to linear and quadratic programming (QP) for the same constrained optimization problem. A survey of the literature on l 1 -regularization suggests that QP is the preferred method (e.g. Tibshirani, 1996, 1997; Yuan and Lin, 2006); furthermore, the Karush-Kuhn-Tucker conditions implied by the primal (and its dual) are key steps in deriving the entire l 1 -regularized solution path (Osborne et al, 2000; Efron et al., 2004; Park and Hastie, 2005; Yuan and Lin, 2006). Hence, we infer that QP is the gold standard for optimizing a general loss function subject to l 1 constraints on the regression coefficients (however, see also Fu, 1998; Friedman et al, 2007). Unfortunately, it seems unlikely that QP techniques may be used for solving general penalized estimating functions. However, this paper considers one exceptional class of estimating function where QP may be used and show how the l 1 -regularized estimating function can be efficiently computed using standard software. Unlike hazards regression for censored data, the accelerated failure time model is based on the linear model, a cornerstone of statistical modeling. This has lead many prominent statisticians, most notably Sir D. R. Cox, to observe that the accelerated failure time (AFT) model and the estimated regression coefficients to have a rather direct physical interpretation, (Reid, 1994, p. 450). Moreover, it is well-known that the PH and AFT model cannot simultaneously hold except in the case of extreme value error distributions. Therefore, two reasons why statisticians give the AFT model serious consideration in censored data regression include (i) the direct interpretability of regression coefficients, and (ii) the AFT model assumptions can hold when the PH model assumptions fail. These reasons have lead instructors to include the AFT model as part of the standard graduate curriculum in statistics (cf. Kalbfleisch and Prentice, 2002) and have also lead to numerous extensions/applications in biometry and econometrics. Therefore, addressing effective 3

5 variable selection and estimation strategies in the AFT model is a worthwhile goal. In this paper, we propose a QP algorithm for constrained estimation in the rank-based AFT model. In Section 2, we briefly review the AFT model assumptions and some recent contributions to variable selection methods within this modeling framework. This paper is concerned with a very special class of penalized estimating function derived from linear rank tests for censored data (Prentice, 1978; Tsiatis, 1990). The proposed methods are detailed in Section 3 and couple a novel estimation strategy by Jin, Lin, Wei, and Ying (2003) with a computational trick for penalized least absolute deviation (lad) regression (Wang, Li, Jiang, 2007). Interestingly, the same trick can be used for a natural extension of lad-lasso (Wang et al., 2007) to censored data regression (Zhou, 1992; Huang, Ma, Xie, 2007) which we describe in Section 3.3. We demonstrate the utility of the proposed methods through two examples in Section 4 2 Background Consider the linear regression model y i = x iβ + ε i, (i = 1,...,n), (1) where y i is the response variable and x i is a d-vector of fixed predictors for the ith subject, β is a d- vector of regression coefficients, and (ε 1,...,ε n ) are independent and identically distributed errors with absolutely continuous density f. Here, we assume that the predictors have been standardized to have mean zero and unit variance. The familiar lasso estimator for β is given by the minimizer of the objective function d y Xβ 2 + λ β j, (2) j=1 4

6 where y = (y 1,...,y n ), X = (x 1,...,x n ), and λ is user-specified regularization parameter. The lasso solution to (2) is equivalent to the constrained optimization problem min β y Xβ 2, subject to d β j τ, (3) for a user-specified parameter τ. We note that there is a one-to-one correspondence between λ in (2) and τ in (3); expression (2) is sometimes referred to as the Lagrangian equivalent of (3) (e.g. Friedman et al., 2007). Now, define the random observables j=1 Z i = min(y i,c i ), δ i = I(y i C i ), (i = 1,...,n), where C i is a random censoring variable for the i-th subject and I( ) denotes the indicator function. The goal is to estimate the regression coefficients β for user-defined parameter λ (or τ) using the observed data {(Z i,δ i,x i ), i = 1,...,n}. A natural extension of the PLS estimators via (2) for censored data is through Buckley-James statistics (1979), whereby one essentially replaces the censored observation (y i,δ i = 0) with an imputed value, say ŷ i, and proceeds as usual. This method has been applied in some applications (Johnson, 2005; Datta et al., 2007) with varying level of success. It is well-known that Buckley- James statistics may admit multiple solutions in finite samples even if they appear to work well in simulation studies. At best, the solution to a penalized Buckley-James statistic can only be an approximate one (see Johnson et al., 2007). Therefore, it is desirable to consider other statistics where one can place more confidence in the coefficient estimates. To the best of our knowledge, there is one other research team who is actively developing methods for variable selection in the AFT models (excluding groups working on modified Buckley-James statistics). Huang and colleagues (Huang et al., 2006; Huang et al., 2007; Xie and Huang, 2008) are currently developing methods based on lad regression for censored data. One advantage of their approach is that it is computationally trivial to implement: weighted lad regression of Z = (Z 1,...,Z n ) on X where the 5

7 weights (Zhou, 1992; Bang and Tsiatis, 2002; Zhou, 2005) are functions of the observed data and calculated via Kaplan-Meier (Kaplan and Meier, 1958). A simple extension of a recent numerical trick yields the lasso solution for Huang s lad estimator for censored data; this is briefly described in Section 3.3. Previously, Huang et al. (2006) used gradient-directed search algorithms (Friedman and Popescu, 2005) to produce a similar solution. Recently, Johnson (2005) considered model selection within the rank-based AFT model framework. We note that Johnson s (2005) earlier methods are very different from the current proposal because he attempted to solve the penalized estimating function for general penalty (i.e. Fan and Li, 2001). Penalized rank-based estimation the AFT model is not trivial even without the task of variable selection; this difficulty is due, in part, to the fact that the original unpenalized estimating function is neither continuous nor component-wise monotone for general weight functions. Subsequently, the penalized estimating function is also very difficult to solve for general weight functions and general penalty functions. However, if we restrict our attention to l 1 penalty functions, the story is quite different! The fact that an otherwise slippery l 1 -regularized estimating function can be efficiently solved using QP is an important finding and this result forms our thesis in Section 3. 3 Methods The weighted log-rank estimating function (Prentice, 1978; Tsiatis, 1990) is defined as Ψ o φ (β) = i φ{e i (β),β}[x i x{e i (β),β}], i=1 where e i (β) = Z i x iβ, φ is a possibly data-dependent weight function satisfying condition A7 of Johnson (2005, Appendix 1), S (0) (t,β) = n 1 I{e j (β) t}, j=1 S (1) (t,β) = n 1 x j I{e j (β) t}, j=1 6

8 and x(t,β) = S (1) (t,β)/s (0) (t,β). Define the penalized, weighted log-rank estimating function as λ 1 sgn(β 1 ) Ψ φ (β) = Ψ o φ (β) + n., (4) λ d sgn(β d ) where (λ 1,...,λ d ) are coefficient-dependent regularization parameters (Zou, 2006; Wang et al., 2007; Zhang and Lu, 2007). The proposed estimator β φ is defined as a consistent solution to the estimating equations: Ψ φ (β) = 0. Two weight functions of substantial interest are φ(t,β) = 1 and φ(t,β) = S (0) (t,β), that correspond to the log-rank (Mantel, 1966) and Gehan (1965) weights, respectively. Define the estimator β o φ as a consistent solution to the original estimating equations: Ψ o φ (β) = 0. It has been established that, under suitable regularity conditions, the random vector n 1/2 ( β o φ β 0 ) converges in distribution to a mean-zero Gaussian random vector with covariance matrix A 1 φ B φa 1 φ, where A φ = lim n n 1 B φ = lim n n 1 i=1 i=1 φ(t,β 0 ){x i x(t,β 0 )} 2 { λ(t)/λ(t)}dn i (t,β 0 ), {φ(t,β 0 )} 2 {x i x(t,β 0 )} 2 dn i (t,β 0 ), N i (t,β) = I{e i (β) t, i = 1}, λ(t) is the hazard function of the errors e i (β 0 ), λ(t) = (d/dt)λ(t), and M 2 = M M, the Kronecker product of the matrix M with itself. Throughout this paper, we shall assume that A φ is nonsingular. 3.1 Estimation and inference with Gehan-type weight functions In the case where φ(t,β) = S (0) (t,β), the estimating function Ψ φ (β) in (4) simplifies to λ 1 sgn(β 1 ) Ψ G (β) = n 1 δ i (x i x j )I{e i (β) e j (β)} + n., (5) i=1 j=1 λ d sgn(β d ) 7

9 where (5) follows from the definition of x{β;e i (β)}. It is easy to check that Ψ G (β) in (5) is the gradient of the following function: d Q G (β) = L G (β) + n λ j β j (6) L G (β) = n 1 i=1 j=1 j=1 δ i {e i (β) e j (β)}, (7) where c = max( c,0). This leads to the following optimization problem for the proposed estimator β G : min u,β i=1 j=1 δ i u ij (8) s.t. u ij > 0, u ij {e i (β) e j (β)}, i,j β k τ k, τ k 0, k = 1,...,d, where (τ 1,...,τ d ) are the coefficient-dependent constraints and correspond to (λ 1,...,λ d ) in (6). It is of sufficient interest to know how one can solve (8) using standard software (namely, the quantreg package in R). Jin et al. (2003) note that minimizing L G (β) (without variable selection) is equivalent to minimizing δ i e i (β) e j (β) + M β i=1 j=1 k=1 l=1 δ k (x l x k ), (9) for a large number M. A standard technique for solving (9) is to construct the n 2 -dimensional pseudo-response vector W = (W 1,...,W n,w n+1 ) and pseudo-design matrix Ω = (Ω 1,...,Ω n,ω n+1 ), W i = δ i (Z i Z 1,Z 1 Z 2,...,Z i Z n ), i = 1,...,n Ω i = δ i [(x i x 1 ),(x i x 2 ),...,(x i x n ) ]. The last elements of the response vector and design matrix are W n+1 = M and ω n+1 = k l δ k(x k x l ), respectively. Finally, the optimization of L G (β) is accomplished through the median regression of W on Ω via quantreg in R. The numerical trick to solve (6) is to construct the 8

10 (n 2 +d)-dimensional pseudo-response vector W = (W,0 ) and (n 2 +d) d pseudo-design matrix Ω = [Ω,n 2 diag(λ 1,...,λ d )]. Then, one can efficiently compute (6) via median regression of W on Ω. A similar numerical trick was offered by Wang et al. (2007) in lad regression for uncensored data. Theorem 1 states the main theoretical result for the Gehan-type estimator, including the existence of an n 1/2 -consistent estimator, the sparsity of the estimator and the asymptotic normality of the estimator. Let A denote the indices of the predictors in the true model, i.e. A = {j : β 0j 0}. The vector of true values is denoted β 0. For simplicity, define the j-th regularization parameter λ j = π j λ for all j = 1,...,d, where π j = 1/ β o G,j, and β o G = ( β o G,1,..., β o G,d ) was defined as the solution to the unpenalized Gehan estimating equation: 0 = Ψ o G (β). This weighting scheme is sometimes referred to as adaptive lasso (alasso) weighting (Zou, 2006; Zhang and Lu, 2007) and similarly motivated the choice of regularization parameters (λ 1,...,λ d ) in Wang et al. (2007). Theorem 1 Assume the regularity conditions A1-A6 of Johnson (2005). If nλ n 0 and nλ n, then β G β 0 = O p (n 1/2 ), lim n P( β G,j = 0) = 1, for every j A, and n 1/2 ( βg,a β 0,A ) d N(0,Γ G,A ), where Γ G = A 1 G B GA 1 G, Γ G,A is the sub-matrix containing only the elements of Γ G whose indices belong to A. Remark 1. As pointed out by Zou (2006), Johnson et al. (2007), Zhang and Lu (2007), Wang et al. (2007), the weight π j is the key to obtain the oracle property. In particular, conditions A1-A6 of Johnson (2005) imply condition C.2(i) of Johnson et al. (2007) and prevents the j-th element of the penalized estimating function from being dominated by the penalty term, λ j sgn(β j ), for β j0 0, because nλ n π j sgn(β j ) vanishes. However, if β j0 = 0, nλ n π j sgn(β j ) diverges to + 9

11 or depending on the sign of β j in the small neighborhood of β j0. This is due to the fact that the weights are conveniently defined as π j = 1/ β G,j o. From here, it is easy to see that because n( βo G,j β 0j ) = O p (1), then ninf βj Mn 1/2 λ nπ j = Mnλ n. The proof of Theorem 1 can be adapted from any number of proofs where one begins with a convex loss function and appends the lasso penalty (e.g. Zou, 2006; Zhang and Lu, 2007). The similarity of this proof to earlier proofs stems from the monotonicity of Ψ o G (β) and that we have a convex objective function, L G (β). Heuristically, L G (β) plays the role of the negative log-likelihood and the remaining arguments follow in a straightforward fashion. Hence, the proof is omitted. 3.2 Estimation and inference for general weight functions In the absence of variable selection, Jin et al. (2003) proposed a novel, iteratively reweighted optimization strategy to solve the estimating function Ψ o φ (β). We exploit their strategy here to provide an elegant solution to the general l 1 -regularized, weighted logrank estimation function Ψ φ (β) using QP. Define the convex objective function Q φ (β; β) = L φ (β; β) + n L φ (β; β) = n 1 i=1 j=1 d λ j β j, j=1 γ{e i ( β), β}δ i {e i (β) e j (β)}, where β is a preliminary consistent estimator for β 0 and γ(t,β) = φ(t,β)/s (0) (t,β). Note that Q φ (β; β) is just like Q G (β; β) but with weights γ which do not depend on β. Using similar reasoning to Jin et al. (2003), it is easy to see that Q φ (β; β) is convex function subject to l 1 constraints on the regression constraints and can be efficiently solved using QP along the lines discussed earlier. Note that the weights γ(t,β) are used to scale W and Ω, but not the regularization parameters (λ 1,...,λ d ). The definitions of W and Ω follow and we subsequently employ quantreg. We offer 10

12 the following iterative algorithm for estimating β 0 : β [k] = arg min Q φ (β; β [k 1] ), k > 1, and β [0] = β G. If β [k] converges to a limit as k, then the limit must satisfy 0 = Ψ φ (β). solution to the regularized estimating function Theorem 2 Assume the regularity conditions A1-A8 of Johnson (2005). If nλ n 0 and nλ n, then β φ β 0 = O p (n 1/2 ), lim n P( β φ,j = 0) = 1, for every j A, and n 1/2 ( βφ,a β 0,A ) d N(0,Γ φ,a ), where Γ φ = A 1 φ B φa 1 φ, Γ φ,a is the sub-matrix containing only the elements of Γ φ whose indices belong to A. Remark 2. As noted in Remark 1, the definition of the weight π j is important and can simply be defined as π j = 1/ β φ,j o for general weight function φ. However, there is no difficulty in using the following definition: π j = 1/ β G,j o, that is, the unpenalized Gehan estimates. This is due to the fact that the Gehan estimates are themselves a n 1/2 -consistent solution to the estimating equation: 0 = Ψ o φ (β). In the sequel, we use the latter definition of π j with the Gehan estimates, β o G,j. The proof of Theorem 2 does not follow as straightforwardly as Theorem 1. The principal difficulty lies in the non-monotonicity of Ψ o φ (β) and, subsequently, the fact that we do not start with a well-behaved convex function like L G (β) to optimize. Recently, Johnson (2005) obtained similar conclusions to those in Theorem 2 for penalized weighted logrank estimating functions with nonconcave penalty (Fan and Li, 2001). Johnson s arguments can be extended to the current situation with no significant difficulty. Now, however, an even more general theory for penalized estimating 11

13 function can be used to establish conclusions like those contained in Theorems 1-2 (See Johnson et al., Theorem 1, 2007). Hence, the proof here is omitted but we refer interested readers to aforementioned papers. 3.3 An extension of lad-lasso to censored data One method of handling censored data comes from the missing data literature (Horvitz and Thompson, 1952) whereby one inversely weights an observed response by one minus the probability of missingness. This method has been successfully applied to censored data problems by Tsiatis and colleagues (cf. Zhao and Tsiatis, 1997; Bang and Tsiatis, 2002) among others. Let K(t) be the survivor distribution for the censoring random variable at time t, i.e. K(t) = P(C > t). Then, one can draw inference on β through the weighted loss function w i Z i x i β, i=1 where w i = δ i /K(Z i ). If K(t) is unknown and censoring is assumed to be independent of the covariate process, it may be nonparametrically estimated via the Kaplan-Meier estimator using the data {(Z i,1 δ i ), i = 1,...,n}. Under the weaker assumption that the censoring distribution may depend on the covariate process, then K(t) may be estimated semi-parametrically through the PH model, for example. We note that our definition of weight w i is different than the definition given by Huang et al. (2007) although one can show they are equivalent using a standard argument from survival analysis. Naturally, one can estimate β by regressing the scaled response Z w = ( w 1 Z 1,..., w n Z n ) on a similarly scaled design matrix X w via quantreg. Now, suppose one wishes to solve the constrained lad-lasso for censored data, that is, optimize the following objective function: d w i Z i x i β + n λ j β j. (10) i=1 j=1 12

14 This can be accomplished using the same technique proposed by Wang et al. (2007). Namely, define the (n + d)-dimensional pseudo response and (n + d) d design matrix, Z w = (Z w,0 ) and X w = [X w,n diag(λ 1,...,λ d )], respectively. Finally, regress the pseudo response Z w on the pseudo design X w using quantreg. In this paragraph, we outline the asymptotic properties for the minimizer of (10). Let β o = ( β o 1,..., β o d ) be the lad regression estimator for censored data. Define β as the minimizer of (10) with λ j = λπ j, π j = 1/ β j o. Then, under suitable regularity conditions, it can be shown that as nλn 0 and nλ n, β β 0 = O p (n 1/2 ), lim n P( β j = 0) = 1, for every j A, and ) n ( βa 1/2 β 0,A converges in distribution to a mean-zero Gaussian random vector with covariance V A, where V is the asymptotic variance-covariance matrix of β o and V A is the sub-matrix containing only the elements of V whose indices belong to A (Johnson et al., 2007, Theorem 1). Hence, the lad regression estimator for censored data in Huang et al. (2007) can be shown to possess an oracle property with weights (π 1,...,π d ). 3.4 Parameter Tuning For penalized least squares and penalized likelihood, it is fairly straightforward to implement crossvalidation or generalized cross-validation (e.g. Tibshirani, 1996, 1997; Fan and Li, 2001), Akaike information criterion (AIC; Akaike, 1973), Bayesian information criterion (BIC; Schwarz, 1978; Zou, Hastie, and Tibshirani, 2004). In the absence of a likelihood or obvious loss function, defining a useful model selection criterion can be problematic. We summarize two strategies below: one based on viewing L G (β) as a dispersion criterion which can then be used in cross-validation and another based on statistical rules-of-thumb. 13

15 Tuning via cross-validation. For simplicity in notation, we drop the subscript φ for weight function and let β λ denote an arbitrary regularized estimator with tuning parameter λ. The traditional BIC criterion (Zou et al., 2004) for model selection is defined BIC L (λ) = 2 l n ( β λ ) + log n d(λ), (11) where l n (β) is the log-likelihood and d(λ) = A(λ), the cardinality of the active set A(λ) for β λ. The definition in (11) extends naturally to other loss functions such as squared error loss and and absolute error loss. Furthermore, the AIC L (λ) criterion is similarly defined but with 2 replacing log n in the second expression on the right-hand side of (11). By this point, it is well-established that BIC is asymptotically consistent in model selection but tends to produce models that are too sparse in finite samples. For penalized estimating functions, there is no natural substitute for l n (β). However, in the rankbased regression models, L G (β) may be a legitimate candidate loss function. In the absence of censoring, L G (β) reduces to Jaeckel s (1972) convex dispersion function. Because L G (β) can itself be viewed as a norm, it has a similar geometric interpretation to the residual sum of squares or l 1 -norm. Hence, consider the criterion BIC(λ) = 2 log L G ( β λ ) + log n d(λ). Again, we define AIC(λ) similarly. Johnson (2005) used a version of AIC(λ) and showed that it worked well for practical work. Another legitimate idea comes from lad regression, where a robust goodness-of-fit statistic can be some function of absolute error loss, n 1 y ŷ λ 1, where the predicted value ŷ λ = (ŷ 1,...,ŷ n ) (note: this prediction will depend on robust estimate of E(ε 1 ) which can potentially be a nontrivial matter with censored data). In any case, inverse weighting methods suggest one use the large sample approximation, n 1 i w i Z i ŷ i. Our experience has been that cross-validating with an inversely weighted information criterion can be tricky, to say the least. 14

16 Statistical rules-of-thumb. It is well-known that cross-validation can be computationally expensive especially for high-dimensional cross-validating. For this reason, several authors have proposed a variety of statistical rules-of-thumb for choosing λ. Recently, Wang et al. (2007) proposed a rule-of-thumb for uncensored lad regression with l 1 penalty by appealing to a Bayesian argument (Tibshirani, 1996, sect. 5). Such an argument can be useful in wide variety of settings and, hence, we summarize their strategy below. One can view a penalized likelihood estimator as a Bayesian estimator where each coefficient β j has a double exponential prior with location zero and scale nλ j. Then, the optimal λ j is chosen to minimize the following negative posterior log-likelihood: l n (β) + n d j=1 { λ j β j log ( nλj 2 ) } log(n). (12) Wang et al. refer to the posterior in (12) as a BIC criterion and lead to optimal BIC tuning parameters λ j = log(n)/(n β j ). The optimal AIC parameters λ j = 1/(n β j ) are derived similarly. Using our earlier notation, λ j = λπ j, which implies λ BIC = n 1 log(n), λaic = n 1. For censored data, we do not expect these rules-of-thumb to be completely satisfactory because they ignore censoring recall, for a fixed sample size n, the information in the sample decreases as the censoring proportion increases. These parameter values may be reasonable starting points with the understanding that they will be too small, in general, and hence result in models that are too complex. Alternatively, an ad-hoc adjustment for censoring could be as simple as dividing by the uncensored proportion, e.g. λ AIC = (nπ U ) 1 and λ BIC = (nπ U ) 1 log n, with π U = P(δ = 1). We are currently investigating other strategies for selecting the regularization parameters in general penalized estimating functions that can be motivated via other lines of reasoning. 15

17 4 Examples 4.1 Mayo Primary Biliary Cirrhosis Study We consider the Mayo primary biliary cirrhosis (PBC) data (Fleming and Harrington, 1991, Appendix D.1). The data contains information about the survival time and prognostic variables for 418 patients who were eligible to participate in a randomized study of the drug D-penicillamin. Of 418 patients who met standard eligibility criteria, a total of 312 patients participated in the randomized portion of the study. The investigators used stepwise deletion to build a Cox proportional hazards model for the natural history of PBC (Dickson et al., 1989). Our summary of this data set is for descriptive purposes only and to illustrate the continuous shrinkage and estimation feature of the proposed method. A more sophisticated analysis of the Mayo PBC data would include the longitudinal observations and perhaps include some level of joint modeling; such a detailed analysis goes beyond the thesis of this paper. At the same time, the Mayo PBC data set seems a reasonable choice for illustrating the operating characteristics of estimators based on the AFT model as Lin et al. (1993) have argued that the PH model does not fit the data well. Our summary of the May PBC data consists two analyses. First, we analyze a data set of 418 patients using five predictors that define the natural history model: age, log(albumin), log(bilirubin), edema, and log(protime); see Fleming and Harrington (1991; Table 4.6.3). This five predictors have already been preselected to be highly correlated with survival (Dickson et al., 1989), hence, our goal is not variable selection per se but rather to contrast the differences between the Gehan and logrank coefficient estimates for each of lasso and alasso. Some authors (e.g. Yuan and Lin, 2006) have suggested a connect-the-dots approximation to exact coefficient paths whereby one repeats the optimization algorithm over a fine grid of regularization parameters, {λ 1,...,λ M }, and simply draw the line segments between adjacent coefficient coefficient estimates. We calculated 16

18 Gehan lasso Logrank lasso Estimates Estimates log lambda (a) log lambda (b) Gehan alasso Logrank alasso Estimates Estimates log lambda (c) log lambda (d) Figure 1: Approximate coefficient paths for five independent variables in the natural history model using the Mayo primary biliary cirrhosis data. See text for which lines correspond to which independent variables. 17

19 this approximated coefficient path for the five predictors in the natural history model for each of four estimators (Gehan, logrank by lasso, alasso) with K = 5 steps in the logrank estimator. The results are summarized over four displays in Figure 1. The same model, data set, and rank-based estimators were considered by Jin et al. (2003); however, no regularized estimation procedures were considered. The unpenalized rank-based estimators correspond to λ = 0, that is, the left-hand side of each display in Figure 1. Each display in Figure 1 contains 5 lines, one for each independent variable: age (solid black line), albumin (dashed red line), bilirubin (dotted green line), edema (royal blue broken line), and protime (teal dashed line). One very insightful contribution of the coefficient paths in Figure 1 is that one can assess the relative importance of each variable compared to the remaining variables in the active set by observing which variable(s) is forced out as λ increases incrementally. For the Gehan estimator with lasso penalty, age is forced out first, followed by protime second and albumin third. For the logrank estimator with lasso penalty, protime is out first, followed age second, then edema third and albumin fourth. For alasso penalty, protime and albumin are forced out first for both the Gehan and logrank weight function. Then, the Gehan path forces age out third and edema fourth while the logrank path force age and edema out in the reverse order. In other words, each combination of weights (φ and π j ) affect the coefficient paths in subtle ways. For Gehan weight functions, it is well-known that the coefficient estimates depend on the censoring distribution and this fact is likely contributing to some of the observed differences between Gehan versus logrank coefficient estimates. Our second analysis considers ten predictors for the smaller cohort of patients (n = 312) from the randomized study. The ten predictors include age, log(albumin), log(alkaline phosphatase), ascites, log(bilirubin), edema, hepatomegaly, log(protime), sex and spiders. This original study investigators used stepwise deletion and the PH model on this data set to construct a natural 18

20 Gehan lasso Logrank lasso Estimates Estimates log lambda (a) log lambda (b) Gehan alasso Logrank alasso Estimates Estimates log lambda (c) log lambda (d) Figure 2: Approximate coefficient paths for ten independent variables in the Mayo primary biliary cirrhosis data. 19

21 Variable Gehan Logrank lasso alasso lasso alasso Age Albumin Alk. Phos Ascites Bilirubin Edema Hepatomegaly Prothrombin Sex Spiders Table 1: Order statistics for stage-wise variable selection procedures on Mayo primary biliary cirrhosis data. Table entries refer to the order in which variables enter into the rank-based, semiparametric linear model for censored data whose coefficients are given in Figure 2. history model for PBC. Johnson (2005) analyzed the same data but used a simulated annealing algorithm to estimate the regression coefficients. Moreover, Johnson did not consider the alasso estimator although it shares the same limiting distribution as the weighted logrank statistic with non-concave penalty (Fan and Li, 2001). The results of our second analysis are displayed in Figure 2. Viewed from the opposite direction, the coefficient paths in Figure 2 may be regarded as approximate stage-wise forward selection procedures. In other words, we start with the null model for sufficiently large λ, then variables enter into the active set A(λ) one-at-a-time as λ decreases incrementally. While our method is not a forward selection scheme in the sense of Efron et al. (2004), it may construed as one in a loose sense (Yuan and Lin, 2006). The order in which variables enter the active set are displayed in Table 1. With the exception of the strongest predictor (bilirubin) and weakest predictor (alk. phos.), every other variable enters the active set at a different point in the selection procedure depending on the combination of weight function φ and adaptive weights π j. We note that the first five variables 20

22 generally correspond to the natural history model (age, albumin, bilirubin, edema, and protime) with the exception of Gehan lasso where ascites enters the active set just before age. Of the remaining five independent variables, ascites is the next important variable. Then, the Gehan estimators have spiders entering seventh while logrank has spiders farther down the sequence. Regardless of the chosen combination of weights (φ and π j ), the coefficient paths can be insightful even after one has selected which parameter λ is the optimal one. 4.2 Simulation Study To explore the operating characteristics of the proposed methods, we simulated 100 data sets of size n from the model y i = x i β 0 + σε i, (i = 1,...,n), where β 0 = (3,1.5,0,0,2,0,0, 0), ε i and x i are independent standard normal with the correlation between the jth and kth components of x equal to 0.5 j k. This model was considered by Tibshirani (1996) and Fan and Li (2001). We set the censoring distribution to be uniform(0,τ), where τ was chosen to yield approximately 25% censoring. We compared the model error ME ( β φ β 0 ) E(xx T )( β φ β 0 ) of the proposed penalized estimator to that of the original rank-based estimator using the ratio of median model error (RMME). We also compared the average numbers of regression coefficients that are correctly or incorrectly shrunk to 0, that is, β φ,j < The results are presented in Table 2, where oracle pertains to the situation in which we know a priori which coefficients are non-zero. Deficiencies in the lasso are well-known by now and several authors have already compared alasso to lasso with uncensored data (e.g. Zou, 2006; Wang et al., 2007) and censored data in the PH model (Zhang and Lu, 2007), and other missing data problems (Johnson et al., 2007). In Table 2, 21

23 Table 2: Simulation results on model selection with censored data where table entries are the ratio of median model error (RMME), the average number of correct (C) and incorrect (I) zeros for each combination of lasso (L), adaptive lasso (AL), and tuning parameter AIC and BIC. Gehan Logrank Avg. No. of 0s Avg. No. of 0s Method RMME (%) C I RMME (%) C I n = 50,σ = 3, and 20% censoring L(AIC) L(BIC) AL(AIC) AL(BIC) Oracle n = 50,σ = 1, and 20% censoring L(AIC) L(BIC) AL(AIC) AL(BIC) Oracle n = 75,σ = 3, and 40% censoring L(AIC) L(BIC) AL(AIC) AL(BIC) Oracle n = 75,σ = 1, and 40% censoring L(AIC) L(BIC) AL(AIC) AL(BIC) Oracle

24 we confirm that our method performs as advertised across different sample sizes and censoring distributions and to illustrate the differences between Gehan and logrank estimates. We note that Johnson (2005) did not present any simulation results for the penalized logrank regression coefficients. We first comment that table entries for RMME are not directly comparable across weight functions (Gehan and logrank) because each is divided by the median model error (MME) for the full model estimate, Gehan full model MME and logrank full model MME, respectively. Having said this, penalized estimates for both Gehan and logrank estimators have smaller model error than do full model estimates, as one would hope. Furthermore, the advantage in using adaptive weights is more pronounced in models with large n, small error variance σ, and a few strong predictors. This result is consistent with the literature and can be seen in Table 2 for the Gehan and logrank estimators. Finally, we note that AIC models tend to be too complex, especially when compared with the BIC models. Again, this result is consistent with what has already been reported in the literature; however, this is the first paper to report on the rank-based AFT model. 5 Remarks This paper describes l 1 regularized estimation in the rank-based, accelerated failure time model (AFT). The estimator is defined as a consistent solution to a penalized estimating function, where the estimating function satisfies modest regularity conditions. Furthermore, the operating characteristics, such as root-n consistency and an oracle property, of the proposed methods have been established elsewhere (Johnson, 2005; Johnson et al., 2007). Unlike earlier methods for variable selection in the rank-based AFT model, the proposed method optimizes a convex loss function and can be executed easily using quantreg in R by extending the 23

25 results of Jin et al. (2003). On the other hand, the weighted log-rank estimating function with non-concave penalty will never correspond to the minimizer of a convex objective function for any weight function φ. Interestingly, a similar numerical trick used in this paper can be used to offer an elegant solution to lad-lasso for censored data (Huang et al., 2007). When one considers statistical inference for censored data, the proportional hazards (PH) model is the most popular and regularized variable selection in this model follows naturally from penalized likelihood theory and methods. In statistics, it is natural to want more than a single method for any given problem as each method has built-in assumptions. For example, if the data do not support the PH assumption the AFT model offers investigators a viable alternative. For this reason, statisticians have worked hard to overcome the many theoretical and computational challenges inherent in this model. This paper illustrates how one can extend current methods for ordinary statistical inference in the unpenalized rank-based AFT model to variable selection in the rank-based AFT model with l 1 penalty. The regression coefficients in the AFT model have a simple, direct interpretation whereas coefficients in the PH model are interpreted on a relative hazard scale. The latter interpretation may be awkward outside survival and lifetime regression analyses. 24

Analysis Methods for Supersaturated Design: Some Comparisons

Analysis Methods for Supersaturated Design: Some Comparisons Journal of Data Science 1(2003), 249-260 Analysis Methods for Supersaturated Design: Some Comparisons Runze Li 1 and Dennis K. J. Lin 2 The Pennsylvania State University Abstract: Supersaturated designs

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

LSS: An S-Plus/R program for the accelerated failure time model to right censored data based on least-squares principle

LSS: An S-Plus/R program for the accelerated failure time model to right censored data based on least-squares principle computer methods and programs in biomedicine 86 (2007) 45 50 journal homepage: www.intl.elsevierhealth.com/journals/cmpb LSS: An S-Plus/R program for the accelerated failure time model to right censored

More information

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Journal of Data Science 9(2011), 549-564 Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Masaru Kanba and Kanta Naito Shimane University Abstract: This paper discusses the

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly Robust Variable Selection Methods for Grouped Data by Kristin Lee Seamon Lilly A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

TGDR: An Introduction

TGDR: An Introduction TGDR: An Introduction Julian Wolfson Student Seminar March 28, 2007 1 Variable Selection 2 Penalization, Solution Paths and TGDR 3 Applying TGDR 4 Extensions 5 Final Thoughts Some motivating examples We

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

STAT331. Cox s Proportional Hazards Model

STAT331. Cox s Proportional Hazards Model STAT331 Cox s Proportional Hazards Model In this unit we introduce Cox s proportional hazards (Cox s PH) model, give a heuristic development of the partial likelihood function, and discuss adaptations

More information

Variable Selection in Cox s Proportional Hazards Model Using a Parallel Genetic Algorithm

Variable Selection in Cox s Proportional Hazards Model Using a Parallel Genetic Algorithm Variable Selection in Cox s Proportional Hazards Model Using a Parallel Genetic Algorithm Mu Zhu and Guangzhe Fan Department of Statistics and Actuarial Science University of Waterloo Waterloo, Ontario

More information

Least Absolute Deviations Estimation for the Accelerated Failure Time Model. University of Iowa. *

Least Absolute Deviations Estimation for the Accelerated Failure Time Model. University of Iowa. * Least Absolute Deviations Estimation for the Accelerated Failure Time Model Jian Huang 1,2, Shuangge Ma 3, and Huiliang Xie 1 1 Department of Statistics and Actuarial Science, and 2 Program in Public Health

More information

Semiparametric Regression

Semiparametric Regression Semiparametric Regression Patrick Breheny October 22 Patrick Breheny Survival Data Analysis (BIOS 7210) 1/23 Introduction Over the past few weeks, we ve introduced a variety of regression models under

More information

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1 Variable Selection in Restricted Linear Regression Models Y. Tuaç 1 and O. Arslan 1 Ankara University, Faculty of Science, Department of Statistics, 06100 Ankara/Turkey ytuac@ankara.edu.tr, oarslan@ankara.edu.tr

More information

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray

More information

On Mixture Regression Shrinkage and Selection via the MR-LASSO

On Mixture Regression Shrinkage and Selection via the MR-LASSO On Mixture Regression Shrinage and Selection via the MR-LASSO Ronghua Luo, Hansheng Wang, and Chih-Ling Tsai Guanghua School of Management, Peing University & Graduate School of Management, University

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Presented by Yang Zhao March 5, 2010 1 / 36 Outlines 2 / 36 Motivation

More information

LEAST ABSOLUTE DEVIATIONS ESTIMATION FOR THE ACCELERATED FAILURE TIME MODEL

LEAST ABSOLUTE DEVIATIONS ESTIMATION FOR THE ACCELERATED FAILURE TIME MODEL Statistica Sinica 17(2007), 1533-1548 LEAST ABSOLUTE DEVIATIONS ESTIMATION FOR THE ACCELERATED FAILURE TIME MODEL Jian Huang 1, Shuangge Ma 2 and Huiliang Xie 1 1 University of Iowa and 2 Yale University

More information

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent

More information

Single Index Quantile Regression for Heteroscedastic Data

Single Index Quantile Regression for Heteroscedastic Data Single Index Quantile Regression for Heteroscedastic Data E. Christou M. G. Akritas Department of Statistics The Pennsylvania State University SMAC, November 6, 2015 E. Christou, M. G. Akritas (PSU) SIQR

More information

Grouped variable selection in high dimensional partially linear additive Cox model

Grouped variable selection in high dimensional partially linear additive Cox model University of Iowa Iowa Research Online Theses and Dissertations Fall 2010 Grouped variable selection in high dimensional partially linear additive Cox model Li Liu University of Iowa Copyright 2010 Li

More information

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

Lecture 5: Soft-Thresholding and Lasso

Lecture 5: Soft-Thresholding and Lasso High Dimensional Data and Statistical Learning Lecture 5: Soft-Thresholding and Lasso Weixing Song Department of Statistics Kansas State University Weixing Song STAT 905 October 23, 2014 1/54 Outline Penalized

More information

NEW METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO SURVIVAL ANALYSIS AND STATISTICAL REDUNDANCY ANALYSIS USING GENE EXPRESSION DATA SIMIN HU

NEW METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO SURVIVAL ANALYSIS AND STATISTICAL REDUNDANCY ANALYSIS USING GENE EXPRESSION DATA SIMIN HU NEW METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO SURVIVAL ANALYSIS AND STATISTICAL REDUNDANCY ANALYSIS USING GENE EXPRESSION DATA by SIMIN HU Submitted in partial fulfillment of the requirements

More information

On High-Dimensional Cross-Validation

On High-Dimensional Cross-Validation On High-Dimensional Cross-Validation BY WEI-CHENG HSIAO Institute of Statistical Science, Academia Sinica, 128 Academia Road, Section 2, Nankang, Taipei 11529, Taiwan hsiaowc@stat.sinica.edu.tw 5 WEI-YING

More information

Regularization in Cox Frailty Models

Regularization in Cox Frailty Models Regularization in Cox Frailty Models Andreas Groll 1, Trevor Hastie 2, Gerhard Tutz 3 1 Ludwig-Maximilians-Universität Munich, Department of Mathematics, Theresienstraße 39, 80333 Munich, Germany 2 University

More information

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Article Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Fei Jin 1,2 and Lung-fei Lee 3, * 1 School of Economics, Shanghai University of Finance and Economics,

More information

Shrinkage Tuning Parameter Selection in Precision Matrices Estimation

Shrinkage Tuning Parameter Selection in Precision Matrices Estimation arxiv:0909.1123v1 [stat.me] 7 Sep 2009 Shrinkage Tuning Parameter Selection in Precision Matrices Estimation Heng Lian Division of Mathematical Sciences School of Physical and Mathematical Sciences Nanyang

More information

MAS3301 / MAS8311 Biostatistics Part II: Survival

MAS3301 / MAS8311 Biostatistics Part II: Survival MAS3301 / MAS8311 Biostatistics Part II: Survival M. Farrow School of Mathematics and Statistics Newcastle University Semester 2, 2009-10 1 13 The Cox proportional hazards model 13.1 Introduction In the

More information

Linear regression methods

Linear regression methods Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response

More information

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding Patrick Breheny February 15 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/24 Introduction Last week, we introduced penalized regression and discussed ridge regression, in which the penalty

More information

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao

More information

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract Journal of Data Science,17(1). P. 145-160,2019 DOI:10.6339/JDS.201901_17(1).0007 WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION Wei Xiong *, Maozai Tian 2 1 School of Statistics, University of

More information

Pathwise coordinate optimization

Pathwise coordinate optimization Stanford University 1 Pathwise coordinate optimization Jerome Friedman, Trevor Hastie, Holger Hoefling, Robert Tibshirani Stanford University Acknowledgements: Thanks to Stephen Boyd, Michael Saunders,

More information

Comparisons of penalized least squares. methods by simulations

Comparisons of penalized least squares. methods by simulations Comparisons of penalized least squares arxiv:1405.1796v1 [stat.co] 8 May 2014 methods by simulations Ke ZHANG, Fan YIN University of Science and Technology of China, Hefei 230026, China Shifeng XIONG Academy

More information

Discussion of Least Angle Regression

Discussion of Least Angle Regression Discussion of Least Angle Regression David Madigan Rutgers University & Avaya Labs Research Piscataway, NJ 08855 madigan@stat.rutgers.edu Greg Ridgeway RAND Statistics Group Santa Monica, CA 90407-2138

More information

Adaptive Lasso for Cox s Proportional Hazards Model

Adaptive Lasso for Cox s Proportional Hazards Model Adaptive Lasso for Cox s Proportional Hazards Model By HAO HELEN ZHANG AND WENBIN LU Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695-8203, U.S.A. hzhang@stat.ncsu.edu

More information

log T = β T Z + ɛ Zi Z(u; β) } dn i (ue βzi ) = 0,

log T = β T Z + ɛ Zi Z(u; β) } dn i (ue βzi ) = 0, Accelerated failure time model: log T = β T Z + ɛ β estimation: solve where S n ( β) = n i=1 { Zi Z(u; β) } dn i (ue βzi ) = 0, Z(u; β) = j Z j Y j (ue βz j) j Y j (ue βz j) How do we show the asymptotics

More information

Generalized Elastic Net Regression

Generalized Elastic Net Regression Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1

More information

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables LIB-MA, FSSM Cadi Ayyad University (Morocco) COMPSTAT 2010 Paris, August 22-27, 2010 Motivations Fan and Li (2001), Zou and Li (2008)

More information

Stability and the elastic net

Stability and the elastic net Stability and the elastic net Patrick Breheny March 28 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/32 Introduction Elastic Net Our last several lectures have concentrated on methods for

More information

Part [1.0] Measures of Classification Accuracy for the Prediction of Survival Times

Part [1.0] Measures of Classification Accuracy for the Prediction of Survival Times Part [1.0] Measures of Classification Accuracy for the Prediction of Survival Times Patrick J. Heagerty PhD Department of Biostatistics University of Washington 1 Biomarkers Review: Cox Regression Model

More information

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010 Model-Averaged l 1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley and Daniel Percival August 22,

More information

Proportional hazards regression

Proportional hazards regression Proportional hazards regression Patrick Breheny October 8 Patrick Breheny Survival Data Analysis (BIOS 7210) 1/28 Introduction The model Solving for the MLE Inference Today we will begin discussing regression

More information

Other Survival Models. (1) Non-PH models. We briefly discussed the non-proportional hazards (non-ph) model

Other Survival Models. (1) Non-PH models. We briefly discussed the non-proportional hazards (non-ph) model Other Survival Models (1) Non-PH models We briefly discussed the non-proportional hazards (non-ph) model λ(t Z) = λ 0 (t) exp{β(t) Z}, where β(t) can be estimated by: piecewise constants (recall how);

More information

AFT Models and Empirical Likelihood

AFT Models and Empirical Likelihood AFT Models and Empirical Likelihood Mai Zhou Department of Statistics, University of Kentucky Collaborators: Gang Li (UCLA); A. Bathke; M. Kim (Kentucky) Accelerated Failure Time (AFT) models: Y = log(t

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Quantitative Methods for Stratified Medicine

Quantitative Methods for Stratified Medicine Quantitative Methods for Stratified Medicine The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Yong, Florence Hiu-Ling.

More information

Moment and IV Selection Approaches: A Comparative Simulation Study

Moment and IV Selection Approaches: A Comparative Simulation Study Moment and IV Selection Approaches: A Comparative Simulation Study Mehmet Caner Esfandiar Maasoumi Juan Andrés Riquelme August 7, 2014 Abstract We compare three moment selection approaches, followed by

More information

In Search of Desirable Compounds

In Search of Desirable Compounds In Search of Desirable Compounds Adrijo Chakraborty University of Georgia Email: adrijoc@uga.edu Abhyuday Mandal University of Georgia Email: amandal@stat.uga.edu Kjell Johnson Arbor Analytics, LLC Email:

More information

Outlier detection and variable selection via difference based regression model and penalized regression

Outlier detection and variable selection via difference based regression model and penalized regression Journal of the Korean Data & Information Science Society 2018, 29(3), 815 825 http://dx.doi.org/10.7465/jkdi.2018.29.3.815 한국데이터정보과학회지 Outlier detection and variable selection via difference based regression

More information

Regularization Path Algorithms for Detecting Gene Interactions

Regularization Path Algorithms for Detecting Gene Interactions Regularization Path Algorithms for Detecting Gene Interactions Mee Young Park Trevor Hastie July 16, 2006 Abstract In this study, we consider several regularization path algorithms with grouped variable

More information

Permutation-invariant regularization of large covariance matrices. Liza Levina

Permutation-invariant regularization of large covariance matrices. Liza Levina Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work

More information

Statistical Inference

Statistical Inference Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park

More information

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Bayesian Grouped Horseshoe Regression with Application to Additive Models Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu, Daniel F. Schmidt, Enes Makalic, Guoqi Qian, and John L. Hopper Centre for Epidemiology and Biostatistics, Melbourne

More information

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:

More information

Residuals and model diagnostics

Residuals and model diagnostics Residuals and model diagnostics Patrick Breheny November 10 Patrick Breheny Survival Data Analysis (BIOS 7210) 1/42 Introduction Residuals Many assumptions go into regression models, and the Cox proportional

More information

VARIABLE SELECTION AND STATISTICAL LEARNING FOR CENSORED DATA. Xiaoxi Liu

VARIABLE SELECTION AND STATISTICAL LEARNING FOR CENSORED DATA. Xiaoxi Liu VARIABLE SELECTION AND STATISTICAL LEARNING FOR CENSORED DATA Xiaoxi Liu A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements

More information

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana

More information

VARIABLE SELECTION AND ESTIMATION WITH THE SEAMLESS-L 0 PENALTY

VARIABLE SELECTION AND ESTIMATION WITH THE SEAMLESS-L 0 PENALTY Statistica Sinica 23 (2013), 929-962 doi:http://dx.doi.org/10.5705/ss.2011.074 VARIABLE SELECTION AND ESTIMATION WITH THE SEAMLESS-L 0 PENALTY Lee Dicker, Baosheng Huang and Xihong Lin Rutgers University,

More information

Day 4: Shrinkage Estimators

Day 4: Shrinkage Estimators Day 4: Shrinkage Estimators Kenneth Benoit Data Mining and Statistical Learning March 9, 2015 n versus p (aka k) Classical regression framework: n > p. Without this inequality, the OLS coefficients have

More information

Tuning Parameter Selection in L1 Regularized Logistic Regression

Tuning Parameter Selection in L1 Regularized Logistic Regression Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 2012 Tuning Parameter Selection in L1 Regularized Logistic Regression Shujing Shi Virginia Commonwealth University

More information

Chapter 3. Linear Models for Regression

Chapter 3. Linear Models for Regression Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Linear

More information

Post-Selection Inference

Post-Selection Inference Classical Inference start end start Post-Selection Inference selected end model data inference data selection model data inference Post-Selection Inference Todd Kuffner Washington University in St. Louis

More information

Nonnegative Garrote Component Selection in Functional ANOVA Models

Nonnegative Garrote Component Selection in Functional ANOVA Models Nonnegative Garrote Component Selection in Functional ANOVA Models Ming Yuan School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, GA 3033-005 Email: myuan@isye.gatech.edu

More information

On the covariate-adjusted estimation for an overall treatment difference with data from a randomized comparative clinical trial

On the covariate-adjusted estimation for an overall treatment difference with data from a randomized comparative clinical trial Biostatistics Advance Access published January 30, 2012 Biostatistics (2012), xx, xx, pp. 1 18 doi:10.1093/biostatistics/kxr050 On the covariate-adjusted estimation for an overall treatment difference

More information

Survival Analysis. Lu Tian and Richard Olshen Stanford University

Survival Analysis. Lu Tian and Richard Olshen Stanford University 1 Survival Analysis Lu Tian and Richard Olshen Stanford University 2 Survival Time/ Failure Time/Event Time We will introduce various statistical methods for analyzing survival outcomes What is the survival

More information

Regularization: Ridge Regression and the LASSO

Regularization: Ridge Regression and the LASSO Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu arxiv:1709.04840v1 [stat.me] 14 Sep 2017 Abstract Penalty-based variable selection methods are powerful in selecting relevant covariates

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

Lecture 22 Survival Analysis: An Introduction

Lecture 22 Survival Analysis: An Introduction University of Illinois Department of Economics Spring 2017 Econ 574 Roger Koenker Lecture 22 Survival Analysis: An Introduction There is considerable interest among economists in models of durations, which

More information

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression Stepwise Searching for Feature Variables in High-Dimensional Linear Regression Qiwei Yao Department of Statistics, London School of Economics q.yao@lse.ac.uk Joint work with: Hongzhi An, Chinese Academy

More information

The Pennsylvania State University The Graduate School Eberly College of Science NEW PROCEDURES FOR COX S MODEL WITH HIGH DIMENSIONAL PREDICTORS

The Pennsylvania State University The Graduate School Eberly College of Science NEW PROCEDURES FOR COX S MODEL WITH HIGH DIMENSIONAL PREDICTORS The Pennsylvania State University The Graduate School Eberly College of Science NEW PROCEDURES FOR COX S MODEL WITH HIGH DIMENSIONAL PREDICTORS A Dissertation in Statistics by Ye Yu c 2015 Ye Yu Submitted

More information

A Confidence Region Approach to Tuning for Variable Selection

A Confidence Region Approach to Tuning for Variable Selection A Confidence Region Approach to Tuning for Variable Selection Funda Gunes and Howard D. Bondell Department of Statistics North Carolina State University Abstract We develop an approach to tuning of penalized

More information

Lecture 14: Variable Selection - Beyond LASSO

Lecture 14: Variable Selection - Beyond LASSO Fall, 2017 Extension of LASSO To achieve oracle properties, L q penalty with 0 < q < 1, SCAD penalty (Fan and Li 2001; Zhang et al. 2007). Adaptive LASSO (Zou 2006; Zhang and Lu 2007; Wang et al. 2007)

More information

Efficient Estimation of Censored Linear Regression Model

Efficient Estimation of Censored Linear Regression Model 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 2 22 23 24 25 26 27 28 29 3 3 32 33 34 35 36 37 38 39 4 4 42 43 44 45 46 47 48 Biometrika (2), xx, x, pp. 4 C 28 Biometrika Trust Printed in Great Britain Efficient Estimation

More information

Unified LASSO Estimation via Least Squares Approximation

Unified LASSO Estimation via Least Squares Approximation Unified LASSO Estimation via Least Squares Approximation Hansheng Wang and Chenlei Leng Peking University & National University of Singapore First version: May 25, 2006. Revised on March 23, 2007. Abstract

More information

STAT 331. Accelerated Failure Time Models. Previously, we have focused on multiplicative intensity models, where

STAT 331. Accelerated Failure Time Models. Previously, we have focused on multiplicative intensity models, where STAT 331 Accelerated Failure Time Models Previously, we have focused on multiplicative intensity models, where h t z) = h 0 t) g z). These can also be expressed as H t z) = H 0 t) g z) or S t z) = e Ht

More information

6 Pattern Mixture Models

6 Pattern Mixture Models 6 Pattern Mixture Models A common theme underlying the methods we have discussed so far is that interest focuses on making inference on parameters in a parametric or semiparametric model for the full data

More information

The Iterated Lasso for High-Dimensional Logistic Regression

The Iterated Lasso for High-Dimensional Logistic Regression The Iterated Lasso for High-Dimensional Logistic Regression By JIAN HUANG Department of Statistics and Actuarial Science, 241 SH University of Iowa, Iowa City, Iowa 52242, U.S.A. SHUANGE MA Division of

More information

Saharon Rosset 1 and Ji Zhu 2

Saharon Rosset 1 and Ji Zhu 2 Aust. N. Z. J. Stat. 46(3), 2004, 505 510 CORRECTED PROOF OF THE RESULT OF A PREDICTION ERROR PROPERTY OF THE LASSO ESTIMATOR AND ITS GENERALIZATION BY HUANG (2003) Saharon Rosset 1 and Ji Zhu 2 IBM T.J.

More information

LEAST ANGLE REGRESSION 469

LEAST ANGLE REGRESSION 469 LEAST ANGLE REGRESSION 469 Specifically for the Lasso, one alternative strategy for logistic regression is to use a quadratic approximation for the log-likelihood. Consider the Bayesian version of Lasso

More information

Accelerated Failure Time Models

Accelerated Failure Time Models Accelerated Failure Time Models Patrick Breheny October 12 Patrick Breheny University of Iowa Survival Data Analysis (BIOS 7210) 1 / 29 The AFT model framework Last time, we introduced the Weibull distribution

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu Department of Statistics, University of Illinois at Urbana-Champaign WHOA-PSI, Aug, 2017 St. Louis, Missouri 1 / 30 Background Variable

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

Rank-based inference for the accelerated failure time model

Rank-based inference for the accelerated failure time model Biometrika (2003), 90, 2, pp. 341 353 2003 Biometrika Trust Printed in reat Britain Rank-based inference for the accelerated failure time model BY ZHEZHEN JIN Department of Biostatistics, Columbia University,

More information

Or How to select variables Using Bayesian LASSO

Or How to select variables Using Bayesian LASSO Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO On Bayesian Variable Selection

More information

Professors Lin and Ying are to be congratulated for an interesting paper on a challenging topic and for introducing survival analysis techniques to th

Professors Lin and Ying are to be congratulated for an interesting paper on a challenging topic and for introducing survival analysis techniques to th DISCUSSION OF THE PAPER BY LIN AND YING Xihong Lin and Raymond J. Carroll Λ July 21, 2000 Λ Xihong Lin (xlin@sph.umich.edu) is Associate Professor, Department ofbiostatistics, University of Michigan, Ann

More information

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010 Department of Biostatistics Department of Statistics University of Kentucky August 2, 2010 Joint work with Jian Huang, Shuangge Ma, and Cun-Hui Zhang Penalized regression methods Penalized methods have

More information

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Nonconcave Penalized Likelihood with A Diverging Number of Parameters Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized

More information

arxiv: v1 [stat.me] 30 Dec 2017

arxiv: v1 [stat.me] 30 Dec 2017 arxiv:1801.00105v1 [stat.me] 30 Dec 2017 An ISIS screening approach involving threshold/partition for variable selection in linear regression 1. Introduction Yu-Hsiang Cheng e-mail: 96354501@nccu.edu.tw

More information

UNIVERSITY OF CALIFORNIA, SAN DIEGO

UNIVERSITY OF CALIFORNIA, SAN DIEGO UNIVERSITY OF CALIFORNIA, SAN DIEGO Estimation of the primary hazard ratio in the presence of a secondary covariate with non-proportional hazards An undergraduate honors thesis submitted to the Department

More information

1 The problem of survival analysis

1 The problem of survival analysis 1 The problem of survival analysis Survival analysis concerns analyzing the time to the occurrence of an event. For instance, we have a dataset in which the times are 1, 5, 9, 20, and 22. Perhaps those

More information

Prediction & Feature Selection in GLM

Prediction & Feature Selection in GLM Tarigan Statistical Consulting & Coaching statistical-coaching.ch Doctoral Program in Computer Science of the Universities of Fribourg, Geneva, Lausanne, Neuchâtel, Bern and the EPFL Hands-on Data Analysis

More information

Model Selection. Frank Wood. December 10, 2009

Model Selection. Frank Wood. December 10, 2009 Model Selection Frank Wood December 10, 2009 Standard Linear Regression Recipe Identify the explanatory variables Decide the functional forms in which the explanatory variables can enter the model Decide

More information

Robust Variable Selection Through MAVE

Robust Variable Selection Through MAVE Robust Variable Selection Through MAVE Weixin Yao and Qin Wang Abstract Dimension reduction and variable selection play important roles in high dimensional data analysis. Wang and Yin (2008) proposed sparse

More information