Estimation in the l 1 -Regularized Accelerated Failure Time Model
|
|
- Albert Collins
- 5 years ago
- Views:
Transcription
1 Estimation in the l 1 -Regularized Accelerated Failure Time Model by Brent Johnson, PhD Technical Report May 2008 Department of Biostatistics Rollins School of Public Health 1518 Clifton Road, N.E. Emory University Atlanta, Georgia Telephone: (404) FAX: (404) bajohn3@emory.edu
2 Estimation in the l 1 -regularized accelerated failure time model by Brent A. Johnson Abstract This note variable selection in the semiparametric linear regression model for censored data. Semiparametric linear regression for censored data is a natural extension of the linear model for uncensored data; however, random censoring introduces substantial theoretical and numerical challenges. By now, a number of authors have made significant contributions for estimation and inference in the semiparametric linear model but none of these authors have considered regularized estimation and subsequent variable selection. Our estimator is defined as a consistent solution to a suitably penalized, weighted logrank estimating function. For general weight function, this estimating function is known to be non-monotone in the regression coefficients and may contain multiple roots. Nevertheless, it is one of the more popular estimators which does not assume proportional hazards. The proposed method uses linear and quadratic programming techniques for l 1 -regularized estimation and can be implemented easily in R. We illustrate the utility of our approach in real and simulated data. Keywords: Adaptive lasso Lasso Oracle property Penalized least squares Proportional hazards. 1
3 1 Introduction Over the past several years, substantial attention has been paid to simultaneous estimation and variable selection in the linear model through so-called penalized least squares (PLS) estimators (e.g. Breiman, 1995; Tibshirani, 1996; Fan and Li, 2001; Zou and Hasite, 2005; Zou, 2006; Yuan and Lin, 2006). Because PLS estimators simultaneously shrink some coefficients to zero and estimate the non-zero coefficients, one can manage the theoretical properties of PLS estimators more easily than earlier proposals, such as stepwise deletion and subset selection. In addition to desirable asymptotic properties, elegant solutions to the resulting constrained optimization problems are now readily available (e.g. Osborne, Presnell, Turlach, 2000; Efron, Johnstone, Hastie, Tibshirani, 2004; Friedman and Popescu, 2004; Friedman, Hastie, Höfling, Tibshirani, 2007). For the most part, the existing theoretical results and accompanying optimization algorithms derived for PLS estimators can be extended to general response variables through generalized linear models and penalized likelihood (cf. Tibshirani, 1996; Fu, 2003; Park and Hastie, 2006) and to censored outcome data through the proportional hazards (PH) model (Cox, 1972) and partial likelihood (Tibshirani, 1997; Park and Hastie, 2006; Zhang and Lu, 2007). However, neither the technical arguments nor the optimization algorithms apply to general penalized M- and Z-estimators because a convex loss function is absent and the Hessian may not be directly estimable. Recently, authors (Johnson, 2005; Johnson, Lin, and Zeng, 2007) extended the earlier notion of penalized estimating function (Fu, 2003) and showed that many of the asymptotic results for penalized likelihood (Zou, 2006; Zhang and Lu, 2007) do, in fact, hold for a wide class of semi-parametric models under modest regularity conditions. Despite these theoretical advances for penalized estimating functions and variable selection in semi-parametric models, efficient computational strategies are lacking and must typically be considered on a case-by-case basis. 2
4 Fan and Li (2001) proposed an algorithm that yields a local solution to a general constrained optimization problem using Newton-type steps. In the case of l 1 penalties on the regression coefficients, Tibshirani (1996) dismissed this same algorithm as inefficient, especially when compared to linear and quadratic programming (QP) for the same constrained optimization problem. A survey of the literature on l 1 -regularization suggests that QP is the preferred method (e.g. Tibshirani, 1996, 1997; Yuan and Lin, 2006); furthermore, the Karush-Kuhn-Tucker conditions implied by the primal (and its dual) are key steps in deriving the entire l 1 -regularized solution path (Osborne et al, 2000; Efron et al., 2004; Park and Hastie, 2005; Yuan and Lin, 2006). Hence, we infer that QP is the gold standard for optimizing a general loss function subject to l 1 constraints on the regression coefficients (however, see also Fu, 1998; Friedman et al, 2007). Unfortunately, it seems unlikely that QP techniques may be used for solving general penalized estimating functions. However, this paper considers one exceptional class of estimating function where QP may be used and show how the l 1 -regularized estimating function can be efficiently computed using standard software. Unlike hazards regression for censored data, the accelerated failure time model is based on the linear model, a cornerstone of statistical modeling. This has lead many prominent statisticians, most notably Sir D. R. Cox, to observe that the accelerated failure time (AFT) model and the estimated regression coefficients to have a rather direct physical interpretation, (Reid, 1994, p. 450). Moreover, it is well-known that the PH and AFT model cannot simultaneously hold except in the case of extreme value error distributions. Therefore, two reasons why statisticians give the AFT model serious consideration in censored data regression include (i) the direct interpretability of regression coefficients, and (ii) the AFT model assumptions can hold when the PH model assumptions fail. These reasons have lead instructors to include the AFT model as part of the standard graduate curriculum in statistics (cf. Kalbfleisch and Prentice, 2002) and have also lead to numerous extensions/applications in biometry and econometrics. Therefore, addressing effective 3
5 variable selection and estimation strategies in the AFT model is a worthwhile goal. In this paper, we propose a QP algorithm for constrained estimation in the rank-based AFT model. In Section 2, we briefly review the AFT model assumptions and some recent contributions to variable selection methods within this modeling framework. This paper is concerned with a very special class of penalized estimating function derived from linear rank tests for censored data (Prentice, 1978; Tsiatis, 1990). The proposed methods are detailed in Section 3 and couple a novel estimation strategy by Jin, Lin, Wei, and Ying (2003) with a computational trick for penalized least absolute deviation (lad) regression (Wang, Li, Jiang, 2007). Interestingly, the same trick can be used for a natural extension of lad-lasso (Wang et al., 2007) to censored data regression (Zhou, 1992; Huang, Ma, Xie, 2007) which we describe in Section 3.3. We demonstrate the utility of the proposed methods through two examples in Section 4 2 Background Consider the linear regression model y i = x iβ + ε i, (i = 1,...,n), (1) where y i is the response variable and x i is a d-vector of fixed predictors for the ith subject, β is a d- vector of regression coefficients, and (ε 1,...,ε n ) are independent and identically distributed errors with absolutely continuous density f. Here, we assume that the predictors have been standardized to have mean zero and unit variance. The familiar lasso estimator for β is given by the minimizer of the objective function d y Xβ 2 + λ β j, (2) j=1 4
6 where y = (y 1,...,y n ), X = (x 1,...,x n ), and λ is user-specified regularization parameter. The lasso solution to (2) is equivalent to the constrained optimization problem min β y Xβ 2, subject to d β j τ, (3) for a user-specified parameter τ. We note that there is a one-to-one correspondence between λ in (2) and τ in (3); expression (2) is sometimes referred to as the Lagrangian equivalent of (3) (e.g. Friedman et al., 2007). Now, define the random observables j=1 Z i = min(y i,c i ), δ i = I(y i C i ), (i = 1,...,n), where C i is a random censoring variable for the i-th subject and I( ) denotes the indicator function. The goal is to estimate the regression coefficients β for user-defined parameter λ (or τ) using the observed data {(Z i,δ i,x i ), i = 1,...,n}. A natural extension of the PLS estimators via (2) for censored data is through Buckley-James statistics (1979), whereby one essentially replaces the censored observation (y i,δ i = 0) with an imputed value, say ŷ i, and proceeds as usual. This method has been applied in some applications (Johnson, 2005; Datta et al., 2007) with varying level of success. It is well-known that Buckley- James statistics may admit multiple solutions in finite samples even if they appear to work well in simulation studies. At best, the solution to a penalized Buckley-James statistic can only be an approximate one (see Johnson et al., 2007). Therefore, it is desirable to consider other statistics where one can place more confidence in the coefficient estimates. To the best of our knowledge, there is one other research team who is actively developing methods for variable selection in the AFT models (excluding groups working on modified Buckley-James statistics). Huang and colleagues (Huang et al., 2006; Huang et al., 2007; Xie and Huang, 2008) are currently developing methods based on lad regression for censored data. One advantage of their approach is that it is computationally trivial to implement: weighted lad regression of Z = (Z 1,...,Z n ) on X where the 5
7 weights (Zhou, 1992; Bang and Tsiatis, 2002; Zhou, 2005) are functions of the observed data and calculated via Kaplan-Meier (Kaplan and Meier, 1958). A simple extension of a recent numerical trick yields the lasso solution for Huang s lad estimator for censored data; this is briefly described in Section 3.3. Previously, Huang et al. (2006) used gradient-directed search algorithms (Friedman and Popescu, 2005) to produce a similar solution. Recently, Johnson (2005) considered model selection within the rank-based AFT model framework. We note that Johnson s (2005) earlier methods are very different from the current proposal because he attempted to solve the penalized estimating function for general penalty (i.e. Fan and Li, 2001). Penalized rank-based estimation the AFT model is not trivial even without the task of variable selection; this difficulty is due, in part, to the fact that the original unpenalized estimating function is neither continuous nor component-wise monotone for general weight functions. Subsequently, the penalized estimating function is also very difficult to solve for general weight functions and general penalty functions. However, if we restrict our attention to l 1 penalty functions, the story is quite different! The fact that an otherwise slippery l 1 -regularized estimating function can be efficiently solved using QP is an important finding and this result forms our thesis in Section 3. 3 Methods The weighted log-rank estimating function (Prentice, 1978; Tsiatis, 1990) is defined as Ψ o φ (β) = i φ{e i (β),β}[x i x{e i (β),β}], i=1 where e i (β) = Z i x iβ, φ is a possibly data-dependent weight function satisfying condition A7 of Johnson (2005, Appendix 1), S (0) (t,β) = n 1 I{e j (β) t}, j=1 S (1) (t,β) = n 1 x j I{e j (β) t}, j=1 6
8 and x(t,β) = S (1) (t,β)/s (0) (t,β). Define the penalized, weighted log-rank estimating function as λ 1 sgn(β 1 ) Ψ φ (β) = Ψ o φ (β) + n., (4) λ d sgn(β d ) where (λ 1,...,λ d ) are coefficient-dependent regularization parameters (Zou, 2006; Wang et al., 2007; Zhang and Lu, 2007). The proposed estimator β φ is defined as a consistent solution to the estimating equations: Ψ φ (β) = 0. Two weight functions of substantial interest are φ(t,β) = 1 and φ(t,β) = S (0) (t,β), that correspond to the log-rank (Mantel, 1966) and Gehan (1965) weights, respectively. Define the estimator β o φ as a consistent solution to the original estimating equations: Ψ o φ (β) = 0. It has been established that, under suitable regularity conditions, the random vector n 1/2 ( β o φ β 0 ) converges in distribution to a mean-zero Gaussian random vector with covariance matrix A 1 φ B φa 1 φ, where A φ = lim n n 1 B φ = lim n n 1 i=1 i=1 φ(t,β 0 ){x i x(t,β 0 )} 2 { λ(t)/λ(t)}dn i (t,β 0 ), {φ(t,β 0 )} 2 {x i x(t,β 0 )} 2 dn i (t,β 0 ), N i (t,β) = I{e i (β) t, i = 1}, λ(t) is the hazard function of the errors e i (β 0 ), λ(t) = (d/dt)λ(t), and M 2 = M M, the Kronecker product of the matrix M with itself. Throughout this paper, we shall assume that A φ is nonsingular. 3.1 Estimation and inference with Gehan-type weight functions In the case where φ(t,β) = S (0) (t,β), the estimating function Ψ φ (β) in (4) simplifies to λ 1 sgn(β 1 ) Ψ G (β) = n 1 δ i (x i x j )I{e i (β) e j (β)} + n., (5) i=1 j=1 λ d sgn(β d ) 7
9 where (5) follows from the definition of x{β;e i (β)}. It is easy to check that Ψ G (β) in (5) is the gradient of the following function: d Q G (β) = L G (β) + n λ j β j (6) L G (β) = n 1 i=1 j=1 j=1 δ i {e i (β) e j (β)}, (7) where c = max( c,0). This leads to the following optimization problem for the proposed estimator β G : min u,β i=1 j=1 δ i u ij (8) s.t. u ij > 0, u ij {e i (β) e j (β)}, i,j β k τ k, τ k 0, k = 1,...,d, where (τ 1,...,τ d ) are the coefficient-dependent constraints and correspond to (λ 1,...,λ d ) in (6). It is of sufficient interest to know how one can solve (8) using standard software (namely, the quantreg package in R). Jin et al. (2003) note that minimizing L G (β) (without variable selection) is equivalent to minimizing δ i e i (β) e j (β) + M β i=1 j=1 k=1 l=1 δ k (x l x k ), (9) for a large number M. A standard technique for solving (9) is to construct the n 2 -dimensional pseudo-response vector W = (W 1,...,W n,w n+1 ) and pseudo-design matrix Ω = (Ω 1,...,Ω n,ω n+1 ), W i = δ i (Z i Z 1,Z 1 Z 2,...,Z i Z n ), i = 1,...,n Ω i = δ i [(x i x 1 ),(x i x 2 ),...,(x i x n ) ]. The last elements of the response vector and design matrix are W n+1 = M and ω n+1 = k l δ k(x k x l ), respectively. Finally, the optimization of L G (β) is accomplished through the median regression of W on Ω via quantreg in R. The numerical trick to solve (6) is to construct the 8
10 (n 2 +d)-dimensional pseudo-response vector W = (W,0 ) and (n 2 +d) d pseudo-design matrix Ω = [Ω,n 2 diag(λ 1,...,λ d )]. Then, one can efficiently compute (6) via median regression of W on Ω. A similar numerical trick was offered by Wang et al. (2007) in lad regression for uncensored data. Theorem 1 states the main theoretical result for the Gehan-type estimator, including the existence of an n 1/2 -consistent estimator, the sparsity of the estimator and the asymptotic normality of the estimator. Let A denote the indices of the predictors in the true model, i.e. A = {j : β 0j 0}. The vector of true values is denoted β 0. For simplicity, define the j-th regularization parameter λ j = π j λ for all j = 1,...,d, where π j = 1/ β o G,j, and β o G = ( β o G,1,..., β o G,d ) was defined as the solution to the unpenalized Gehan estimating equation: 0 = Ψ o G (β). This weighting scheme is sometimes referred to as adaptive lasso (alasso) weighting (Zou, 2006; Zhang and Lu, 2007) and similarly motivated the choice of regularization parameters (λ 1,...,λ d ) in Wang et al. (2007). Theorem 1 Assume the regularity conditions A1-A6 of Johnson (2005). If nλ n 0 and nλ n, then β G β 0 = O p (n 1/2 ), lim n P( β G,j = 0) = 1, for every j A, and n 1/2 ( βg,a β 0,A ) d N(0,Γ G,A ), where Γ G = A 1 G B GA 1 G, Γ G,A is the sub-matrix containing only the elements of Γ G whose indices belong to A. Remark 1. As pointed out by Zou (2006), Johnson et al. (2007), Zhang and Lu (2007), Wang et al. (2007), the weight π j is the key to obtain the oracle property. In particular, conditions A1-A6 of Johnson (2005) imply condition C.2(i) of Johnson et al. (2007) and prevents the j-th element of the penalized estimating function from being dominated by the penalty term, λ j sgn(β j ), for β j0 0, because nλ n π j sgn(β j ) vanishes. However, if β j0 = 0, nλ n π j sgn(β j ) diverges to + 9
11 or depending on the sign of β j in the small neighborhood of β j0. This is due to the fact that the weights are conveniently defined as π j = 1/ β G,j o. From here, it is easy to see that because n( βo G,j β 0j ) = O p (1), then ninf βj Mn 1/2 λ nπ j = Mnλ n. The proof of Theorem 1 can be adapted from any number of proofs where one begins with a convex loss function and appends the lasso penalty (e.g. Zou, 2006; Zhang and Lu, 2007). The similarity of this proof to earlier proofs stems from the monotonicity of Ψ o G (β) and that we have a convex objective function, L G (β). Heuristically, L G (β) plays the role of the negative log-likelihood and the remaining arguments follow in a straightforward fashion. Hence, the proof is omitted. 3.2 Estimation and inference for general weight functions In the absence of variable selection, Jin et al. (2003) proposed a novel, iteratively reweighted optimization strategy to solve the estimating function Ψ o φ (β). We exploit their strategy here to provide an elegant solution to the general l 1 -regularized, weighted logrank estimation function Ψ φ (β) using QP. Define the convex objective function Q φ (β; β) = L φ (β; β) + n L φ (β; β) = n 1 i=1 j=1 d λ j β j, j=1 γ{e i ( β), β}δ i {e i (β) e j (β)}, where β is a preliminary consistent estimator for β 0 and γ(t,β) = φ(t,β)/s (0) (t,β). Note that Q φ (β; β) is just like Q G (β; β) but with weights γ which do not depend on β. Using similar reasoning to Jin et al. (2003), it is easy to see that Q φ (β; β) is convex function subject to l 1 constraints on the regression constraints and can be efficiently solved using QP along the lines discussed earlier. Note that the weights γ(t,β) are used to scale W and Ω, but not the regularization parameters (λ 1,...,λ d ). The definitions of W and Ω follow and we subsequently employ quantreg. We offer 10
12 the following iterative algorithm for estimating β 0 : β [k] = arg min Q φ (β; β [k 1] ), k > 1, and β [0] = β G. If β [k] converges to a limit as k, then the limit must satisfy 0 = Ψ φ (β). solution to the regularized estimating function Theorem 2 Assume the regularity conditions A1-A8 of Johnson (2005). If nλ n 0 and nλ n, then β φ β 0 = O p (n 1/2 ), lim n P( β φ,j = 0) = 1, for every j A, and n 1/2 ( βφ,a β 0,A ) d N(0,Γ φ,a ), where Γ φ = A 1 φ B φa 1 φ, Γ φ,a is the sub-matrix containing only the elements of Γ φ whose indices belong to A. Remark 2. As noted in Remark 1, the definition of the weight π j is important and can simply be defined as π j = 1/ β φ,j o for general weight function φ. However, there is no difficulty in using the following definition: π j = 1/ β G,j o, that is, the unpenalized Gehan estimates. This is due to the fact that the Gehan estimates are themselves a n 1/2 -consistent solution to the estimating equation: 0 = Ψ o φ (β). In the sequel, we use the latter definition of π j with the Gehan estimates, β o G,j. The proof of Theorem 2 does not follow as straightforwardly as Theorem 1. The principal difficulty lies in the non-monotonicity of Ψ o φ (β) and, subsequently, the fact that we do not start with a well-behaved convex function like L G (β) to optimize. Recently, Johnson (2005) obtained similar conclusions to those in Theorem 2 for penalized weighted logrank estimating functions with nonconcave penalty (Fan and Li, 2001). Johnson s arguments can be extended to the current situation with no significant difficulty. Now, however, an even more general theory for penalized estimating 11
13 function can be used to establish conclusions like those contained in Theorems 1-2 (See Johnson et al., Theorem 1, 2007). Hence, the proof here is omitted but we refer interested readers to aforementioned papers. 3.3 An extension of lad-lasso to censored data One method of handling censored data comes from the missing data literature (Horvitz and Thompson, 1952) whereby one inversely weights an observed response by one minus the probability of missingness. This method has been successfully applied to censored data problems by Tsiatis and colleagues (cf. Zhao and Tsiatis, 1997; Bang and Tsiatis, 2002) among others. Let K(t) be the survivor distribution for the censoring random variable at time t, i.e. K(t) = P(C > t). Then, one can draw inference on β through the weighted loss function w i Z i x i β, i=1 where w i = δ i /K(Z i ). If K(t) is unknown and censoring is assumed to be independent of the covariate process, it may be nonparametrically estimated via the Kaplan-Meier estimator using the data {(Z i,1 δ i ), i = 1,...,n}. Under the weaker assumption that the censoring distribution may depend on the covariate process, then K(t) may be estimated semi-parametrically through the PH model, for example. We note that our definition of weight w i is different than the definition given by Huang et al. (2007) although one can show they are equivalent using a standard argument from survival analysis. Naturally, one can estimate β by regressing the scaled response Z w = ( w 1 Z 1,..., w n Z n ) on a similarly scaled design matrix X w via quantreg. Now, suppose one wishes to solve the constrained lad-lasso for censored data, that is, optimize the following objective function: d w i Z i x i β + n λ j β j. (10) i=1 j=1 12
14 This can be accomplished using the same technique proposed by Wang et al. (2007). Namely, define the (n + d)-dimensional pseudo response and (n + d) d design matrix, Z w = (Z w,0 ) and X w = [X w,n diag(λ 1,...,λ d )], respectively. Finally, regress the pseudo response Z w on the pseudo design X w using quantreg. In this paragraph, we outline the asymptotic properties for the minimizer of (10). Let β o = ( β o 1,..., β o d ) be the lad regression estimator for censored data. Define β as the minimizer of (10) with λ j = λπ j, π j = 1/ β j o. Then, under suitable regularity conditions, it can be shown that as nλn 0 and nλ n, β β 0 = O p (n 1/2 ), lim n P( β j = 0) = 1, for every j A, and ) n ( βa 1/2 β 0,A converges in distribution to a mean-zero Gaussian random vector with covariance V A, where V is the asymptotic variance-covariance matrix of β o and V A is the sub-matrix containing only the elements of V whose indices belong to A (Johnson et al., 2007, Theorem 1). Hence, the lad regression estimator for censored data in Huang et al. (2007) can be shown to possess an oracle property with weights (π 1,...,π d ). 3.4 Parameter Tuning For penalized least squares and penalized likelihood, it is fairly straightforward to implement crossvalidation or generalized cross-validation (e.g. Tibshirani, 1996, 1997; Fan and Li, 2001), Akaike information criterion (AIC; Akaike, 1973), Bayesian information criterion (BIC; Schwarz, 1978; Zou, Hastie, and Tibshirani, 2004). In the absence of a likelihood or obvious loss function, defining a useful model selection criterion can be problematic. We summarize two strategies below: one based on viewing L G (β) as a dispersion criterion which can then be used in cross-validation and another based on statistical rules-of-thumb. 13
15 Tuning via cross-validation. For simplicity in notation, we drop the subscript φ for weight function and let β λ denote an arbitrary regularized estimator with tuning parameter λ. The traditional BIC criterion (Zou et al., 2004) for model selection is defined BIC L (λ) = 2 l n ( β λ ) + log n d(λ), (11) where l n (β) is the log-likelihood and d(λ) = A(λ), the cardinality of the active set A(λ) for β λ. The definition in (11) extends naturally to other loss functions such as squared error loss and and absolute error loss. Furthermore, the AIC L (λ) criterion is similarly defined but with 2 replacing log n in the second expression on the right-hand side of (11). By this point, it is well-established that BIC is asymptotically consistent in model selection but tends to produce models that are too sparse in finite samples. For penalized estimating functions, there is no natural substitute for l n (β). However, in the rankbased regression models, L G (β) may be a legitimate candidate loss function. In the absence of censoring, L G (β) reduces to Jaeckel s (1972) convex dispersion function. Because L G (β) can itself be viewed as a norm, it has a similar geometric interpretation to the residual sum of squares or l 1 -norm. Hence, consider the criterion BIC(λ) = 2 log L G ( β λ ) + log n d(λ). Again, we define AIC(λ) similarly. Johnson (2005) used a version of AIC(λ) and showed that it worked well for practical work. Another legitimate idea comes from lad regression, where a robust goodness-of-fit statistic can be some function of absolute error loss, n 1 y ŷ λ 1, where the predicted value ŷ λ = (ŷ 1,...,ŷ n ) (note: this prediction will depend on robust estimate of E(ε 1 ) which can potentially be a nontrivial matter with censored data). In any case, inverse weighting methods suggest one use the large sample approximation, n 1 i w i Z i ŷ i. Our experience has been that cross-validating with an inversely weighted information criterion can be tricky, to say the least. 14
16 Statistical rules-of-thumb. It is well-known that cross-validation can be computationally expensive especially for high-dimensional cross-validating. For this reason, several authors have proposed a variety of statistical rules-of-thumb for choosing λ. Recently, Wang et al. (2007) proposed a rule-of-thumb for uncensored lad regression with l 1 penalty by appealing to a Bayesian argument (Tibshirani, 1996, sect. 5). Such an argument can be useful in wide variety of settings and, hence, we summarize their strategy below. One can view a penalized likelihood estimator as a Bayesian estimator where each coefficient β j has a double exponential prior with location zero and scale nλ j. Then, the optimal λ j is chosen to minimize the following negative posterior log-likelihood: l n (β) + n d j=1 { λ j β j log ( nλj 2 ) } log(n). (12) Wang et al. refer to the posterior in (12) as a BIC criterion and lead to optimal BIC tuning parameters λ j = log(n)/(n β j ). The optimal AIC parameters λ j = 1/(n β j ) are derived similarly. Using our earlier notation, λ j = λπ j, which implies λ BIC = n 1 log(n), λaic = n 1. For censored data, we do not expect these rules-of-thumb to be completely satisfactory because they ignore censoring recall, for a fixed sample size n, the information in the sample decreases as the censoring proportion increases. These parameter values may be reasonable starting points with the understanding that they will be too small, in general, and hence result in models that are too complex. Alternatively, an ad-hoc adjustment for censoring could be as simple as dividing by the uncensored proportion, e.g. λ AIC = (nπ U ) 1 and λ BIC = (nπ U ) 1 log n, with π U = P(δ = 1). We are currently investigating other strategies for selecting the regularization parameters in general penalized estimating functions that can be motivated via other lines of reasoning. 15
17 4 Examples 4.1 Mayo Primary Biliary Cirrhosis Study We consider the Mayo primary biliary cirrhosis (PBC) data (Fleming and Harrington, 1991, Appendix D.1). The data contains information about the survival time and prognostic variables for 418 patients who were eligible to participate in a randomized study of the drug D-penicillamin. Of 418 patients who met standard eligibility criteria, a total of 312 patients participated in the randomized portion of the study. The investigators used stepwise deletion to build a Cox proportional hazards model for the natural history of PBC (Dickson et al., 1989). Our summary of this data set is for descriptive purposes only and to illustrate the continuous shrinkage and estimation feature of the proposed method. A more sophisticated analysis of the Mayo PBC data would include the longitudinal observations and perhaps include some level of joint modeling; such a detailed analysis goes beyond the thesis of this paper. At the same time, the Mayo PBC data set seems a reasonable choice for illustrating the operating characteristics of estimators based on the AFT model as Lin et al. (1993) have argued that the PH model does not fit the data well. Our summary of the May PBC data consists two analyses. First, we analyze a data set of 418 patients using five predictors that define the natural history model: age, log(albumin), log(bilirubin), edema, and log(protime); see Fleming and Harrington (1991; Table 4.6.3). This five predictors have already been preselected to be highly correlated with survival (Dickson et al., 1989), hence, our goal is not variable selection per se but rather to contrast the differences between the Gehan and logrank coefficient estimates for each of lasso and alasso. Some authors (e.g. Yuan and Lin, 2006) have suggested a connect-the-dots approximation to exact coefficient paths whereby one repeats the optimization algorithm over a fine grid of regularization parameters, {λ 1,...,λ M }, and simply draw the line segments between adjacent coefficient coefficient estimates. We calculated 16
18 Gehan lasso Logrank lasso Estimates Estimates log lambda (a) log lambda (b) Gehan alasso Logrank alasso Estimates Estimates log lambda (c) log lambda (d) Figure 1: Approximate coefficient paths for five independent variables in the natural history model using the Mayo primary biliary cirrhosis data. See text for which lines correspond to which independent variables. 17
19 this approximated coefficient path for the five predictors in the natural history model for each of four estimators (Gehan, logrank by lasso, alasso) with K = 5 steps in the logrank estimator. The results are summarized over four displays in Figure 1. The same model, data set, and rank-based estimators were considered by Jin et al. (2003); however, no regularized estimation procedures were considered. The unpenalized rank-based estimators correspond to λ = 0, that is, the left-hand side of each display in Figure 1. Each display in Figure 1 contains 5 lines, one for each independent variable: age (solid black line), albumin (dashed red line), bilirubin (dotted green line), edema (royal blue broken line), and protime (teal dashed line). One very insightful contribution of the coefficient paths in Figure 1 is that one can assess the relative importance of each variable compared to the remaining variables in the active set by observing which variable(s) is forced out as λ increases incrementally. For the Gehan estimator with lasso penalty, age is forced out first, followed by protime second and albumin third. For the logrank estimator with lasso penalty, protime is out first, followed age second, then edema third and albumin fourth. For alasso penalty, protime and albumin are forced out first for both the Gehan and logrank weight function. Then, the Gehan path forces age out third and edema fourth while the logrank path force age and edema out in the reverse order. In other words, each combination of weights (φ and π j ) affect the coefficient paths in subtle ways. For Gehan weight functions, it is well-known that the coefficient estimates depend on the censoring distribution and this fact is likely contributing to some of the observed differences between Gehan versus logrank coefficient estimates. Our second analysis considers ten predictors for the smaller cohort of patients (n = 312) from the randomized study. The ten predictors include age, log(albumin), log(alkaline phosphatase), ascites, log(bilirubin), edema, hepatomegaly, log(protime), sex and spiders. This original study investigators used stepwise deletion and the PH model on this data set to construct a natural 18
20 Gehan lasso Logrank lasso Estimates Estimates log lambda (a) log lambda (b) Gehan alasso Logrank alasso Estimates Estimates log lambda (c) log lambda (d) Figure 2: Approximate coefficient paths for ten independent variables in the Mayo primary biliary cirrhosis data. 19
21 Variable Gehan Logrank lasso alasso lasso alasso Age Albumin Alk. Phos Ascites Bilirubin Edema Hepatomegaly Prothrombin Sex Spiders Table 1: Order statistics for stage-wise variable selection procedures on Mayo primary biliary cirrhosis data. Table entries refer to the order in which variables enter into the rank-based, semiparametric linear model for censored data whose coefficients are given in Figure 2. history model for PBC. Johnson (2005) analyzed the same data but used a simulated annealing algorithm to estimate the regression coefficients. Moreover, Johnson did not consider the alasso estimator although it shares the same limiting distribution as the weighted logrank statistic with non-concave penalty (Fan and Li, 2001). The results of our second analysis are displayed in Figure 2. Viewed from the opposite direction, the coefficient paths in Figure 2 may be regarded as approximate stage-wise forward selection procedures. In other words, we start with the null model for sufficiently large λ, then variables enter into the active set A(λ) one-at-a-time as λ decreases incrementally. While our method is not a forward selection scheme in the sense of Efron et al. (2004), it may construed as one in a loose sense (Yuan and Lin, 2006). The order in which variables enter the active set are displayed in Table 1. With the exception of the strongest predictor (bilirubin) and weakest predictor (alk. phos.), every other variable enters the active set at a different point in the selection procedure depending on the combination of weight function φ and adaptive weights π j. We note that the first five variables 20
22 generally correspond to the natural history model (age, albumin, bilirubin, edema, and protime) with the exception of Gehan lasso where ascites enters the active set just before age. Of the remaining five independent variables, ascites is the next important variable. Then, the Gehan estimators have spiders entering seventh while logrank has spiders farther down the sequence. Regardless of the chosen combination of weights (φ and π j ), the coefficient paths can be insightful even after one has selected which parameter λ is the optimal one. 4.2 Simulation Study To explore the operating characteristics of the proposed methods, we simulated 100 data sets of size n from the model y i = x i β 0 + σε i, (i = 1,...,n), where β 0 = (3,1.5,0,0,2,0,0, 0), ε i and x i are independent standard normal with the correlation between the jth and kth components of x equal to 0.5 j k. This model was considered by Tibshirani (1996) and Fan and Li (2001). We set the censoring distribution to be uniform(0,τ), where τ was chosen to yield approximately 25% censoring. We compared the model error ME ( β φ β 0 ) E(xx T )( β φ β 0 ) of the proposed penalized estimator to that of the original rank-based estimator using the ratio of median model error (RMME). We also compared the average numbers of regression coefficients that are correctly or incorrectly shrunk to 0, that is, β φ,j < The results are presented in Table 2, where oracle pertains to the situation in which we know a priori which coefficients are non-zero. Deficiencies in the lasso are well-known by now and several authors have already compared alasso to lasso with uncensored data (e.g. Zou, 2006; Wang et al., 2007) and censored data in the PH model (Zhang and Lu, 2007), and other missing data problems (Johnson et al., 2007). In Table 2, 21
23 Table 2: Simulation results on model selection with censored data where table entries are the ratio of median model error (RMME), the average number of correct (C) and incorrect (I) zeros for each combination of lasso (L), adaptive lasso (AL), and tuning parameter AIC and BIC. Gehan Logrank Avg. No. of 0s Avg. No. of 0s Method RMME (%) C I RMME (%) C I n = 50,σ = 3, and 20% censoring L(AIC) L(BIC) AL(AIC) AL(BIC) Oracle n = 50,σ = 1, and 20% censoring L(AIC) L(BIC) AL(AIC) AL(BIC) Oracle n = 75,σ = 3, and 40% censoring L(AIC) L(BIC) AL(AIC) AL(BIC) Oracle n = 75,σ = 1, and 40% censoring L(AIC) L(BIC) AL(AIC) AL(BIC) Oracle
24 we confirm that our method performs as advertised across different sample sizes and censoring distributions and to illustrate the differences between Gehan and logrank estimates. We note that Johnson (2005) did not present any simulation results for the penalized logrank regression coefficients. We first comment that table entries for RMME are not directly comparable across weight functions (Gehan and logrank) because each is divided by the median model error (MME) for the full model estimate, Gehan full model MME and logrank full model MME, respectively. Having said this, penalized estimates for both Gehan and logrank estimators have smaller model error than do full model estimates, as one would hope. Furthermore, the advantage in using adaptive weights is more pronounced in models with large n, small error variance σ, and a few strong predictors. This result is consistent with the literature and can be seen in Table 2 for the Gehan and logrank estimators. Finally, we note that AIC models tend to be too complex, especially when compared with the BIC models. Again, this result is consistent with what has already been reported in the literature; however, this is the first paper to report on the rank-based AFT model. 5 Remarks This paper describes l 1 regularized estimation in the rank-based, accelerated failure time model (AFT). The estimator is defined as a consistent solution to a penalized estimating function, where the estimating function satisfies modest regularity conditions. Furthermore, the operating characteristics, such as root-n consistency and an oracle property, of the proposed methods have been established elsewhere (Johnson, 2005; Johnson et al., 2007). Unlike earlier methods for variable selection in the rank-based AFT model, the proposed method optimizes a convex loss function and can be executed easily using quantreg in R by extending the 23
25 results of Jin et al. (2003). On the other hand, the weighted log-rank estimating function with non-concave penalty will never correspond to the minimizer of a convex objective function for any weight function φ. Interestingly, a similar numerical trick used in this paper can be used to offer an elegant solution to lad-lasso for censored data (Huang et al., 2007). When one considers statistical inference for censored data, the proportional hazards (PH) model is the most popular and regularized variable selection in this model follows naturally from penalized likelihood theory and methods. In statistics, it is natural to want more than a single method for any given problem as each method has built-in assumptions. For example, if the data do not support the PH assumption the AFT model offers investigators a viable alternative. For this reason, statisticians have worked hard to overcome the many theoretical and computational challenges inherent in this model. This paper illustrates how one can extend current methods for ordinary statistical inference in the unpenalized rank-based AFT model to variable selection in the rank-based AFT model with l 1 penalty. The regression coefficients in the AFT model have a simple, direct interpretation whereas coefficients in the PH model are interpreted on a relative hazard scale. The latter interpretation may be awkward outside survival and lifetime regression analyses. 24
Analysis Methods for Supersaturated Design: Some Comparisons
Journal of Data Science 1(2003), 249-260 Analysis Methods for Supersaturated Design: Some Comparisons Runze Li 1 and Dennis K. J. Lin 2 The Pennsylvania State University Abstract: Supersaturated designs
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationLSS: An S-Plus/R program for the accelerated failure time model to right censored data based on least-squares principle
computer methods and programs in biomedicine 86 (2007) 45 50 journal homepage: www.intl.elsevierhealth.com/journals/cmpb LSS: An S-Plus/R program for the accelerated failure time model to right censored
More informationSelection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty
Journal of Data Science 9(2011), 549-564 Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Masaru Kanba and Kanta Naito Shimane University Abstract: This paper discusses the
More informationLinear Model Selection and Regularization
Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In
More informationRobust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly
Robust Variable Selection Methods for Grouped Data by Kristin Lee Seamon Lilly A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree
More informationMS-C1620 Statistical inference
MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents
More informationTGDR: An Introduction
TGDR: An Introduction Julian Wolfson Student Seminar March 28, 2007 1 Variable Selection 2 Penalization, Solution Paths and TGDR 3 Applying TGDR 4 Extensions 5 Final Thoughts Some motivating examples We
More informationA New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables
A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of
More informationSTAT331. Cox s Proportional Hazards Model
STAT331 Cox s Proportional Hazards Model In this unit we introduce Cox s proportional hazards (Cox s PH) model, give a heuristic development of the partial likelihood function, and discuss adaptations
More informationVariable Selection in Cox s Proportional Hazards Model Using a Parallel Genetic Algorithm
Variable Selection in Cox s Proportional Hazards Model Using a Parallel Genetic Algorithm Mu Zhu and Guangzhe Fan Department of Statistics and Actuarial Science University of Waterloo Waterloo, Ontario
More informationLeast Absolute Deviations Estimation for the Accelerated Failure Time Model. University of Iowa. *
Least Absolute Deviations Estimation for the Accelerated Failure Time Model Jian Huang 1,2, Shuangge Ma 3, and Huiliang Xie 1 1 Department of Statistics and Actuarial Science, and 2 Program in Public Health
More informationSemiparametric Regression
Semiparametric Regression Patrick Breheny October 22 Patrick Breheny Survival Data Analysis (BIOS 7210) 1/23 Introduction Over the past few weeks, we ve introduced a variety of regression models under
More informationVariable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1
Variable Selection in Restricted Linear Regression Models Y. Tuaç 1 and O. Arslan 1 Ankara University, Faculty of Science, Department of Statistics, 06100 Ankara/Turkey ytuac@ankara.edu.tr, oarslan@ankara.edu.tr
More informationSOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu
SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray
More informationOn Mixture Regression Shrinkage and Selection via the MR-LASSO
On Mixture Regression Shrinage and Selection via the MR-LASSO Ronghua Luo, Hansheng Wang, and Chih-Ling Tsai Guanghua School of Management, Peing University & Graduate School of Management, University
More informationISyE 691 Data mining and analytics
ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationPaper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)
Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Presented by Yang Zhao March 5, 2010 1 / 36 Outlines 2 / 36 Motivation
More informationLEAST ABSOLUTE DEVIATIONS ESTIMATION FOR THE ACCELERATED FAILURE TIME MODEL
Statistica Sinica 17(2007), 1533-1548 LEAST ABSOLUTE DEVIATIONS ESTIMATION FOR THE ACCELERATED FAILURE TIME MODEL Jian Huang 1, Shuangge Ma 2 and Huiliang Xie 1 1 University of Iowa and 2 Yale University
More informationA Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression
A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent
More informationSingle Index Quantile Regression for Heteroscedastic Data
Single Index Quantile Regression for Heteroscedastic Data E. Christou M. G. Akritas Department of Statistics The Pennsylvania State University SMAC, November 6, 2015 E. Christou, M. G. Akritas (PSU) SIQR
More informationGrouped variable selection in high dimensional partially linear additive Cox model
University of Iowa Iowa Research Online Theses and Dissertations Fall 2010 Grouped variable selection in high dimensional partially linear additive Cox model Li Liu University of Iowa Copyright 2010 Li
More informationTECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection
DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model
More informationData Mining Stat 588
Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic
More informationLecture 5: Soft-Thresholding and Lasso
High Dimensional Data and Statistical Learning Lecture 5: Soft-Thresholding and Lasso Weixing Song Department of Statistics Kansas State University Weixing Song STAT 905 October 23, 2014 1/54 Outline Penalized
More informationNEW METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO SURVIVAL ANALYSIS AND STATISTICAL REDUNDANCY ANALYSIS USING GENE EXPRESSION DATA SIMIN HU
NEW METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO SURVIVAL ANALYSIS AND STATISTICAL REDUNDANCY ANALYSIS USING GENE EXPRESSION DATA by SIMIN HU Submitted in partial fulfillment of the requirements
More informationOn High-Dimensional Cross-Validation
On High-Dimensional Cross-Validation BY WEI-CHENG HSIAO Institute of Statistical Science, Academia Sinica, 128 Academia Road, Section 2, Nankang, Taipei 11529, Taiwan hsiaowc@stat.sinica.edu.tw 5 WEI-YING
More informationRegularization in Cox Frailty Models
Regularization in Cox Frailty Models Andreas Groll 1, Trevor Hastie 2, Gerhard Tutz 3 1 Ludwig-Maximilians-Universität Munich, Department of Mathematics, Theresienstraße 39, 80333 Munich, Germany 2 University
More informationLasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices
Article Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Fei Jin 1,2 and Lung-fei Lee 3, * 1 School of Economics, Shanghai University of Finance and Economics,
More informationShrinkage Tuning Parameter Selection in Precision Matrices Estimation
arxiv:0909.1123v1 [stat.me] 7 Sep 2009 Shrinkage Tuning Parameter Selection in Precision Matrices Estimation Heng Lian Division of Mathematical Sciences School of Physical and Mathematical Sciences Nanyang
More informationMAS3301 / MAS8311 Biostatistics Part II: Survival
MAS3301 / MAS8311 Biostatistics Part II: Survival M. Farrow School of Mathematics and Statistics Newcastle University Semester 2, 2009-10 1 13 The Cox proportional hazards model 13.1 Introduction In the
More informationLinear regression methods
Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response
More informationThe lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding
Patrick Breheny February 15 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/24 Introduction Last week, we introduced penalized regression and discussed ridge regression, in which the penalty
More informationAn efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss
An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao
More informationWEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract
Journal of Data Science,17(1). P. 145-160,2019 DOI:10.6339/JDS.201901_17(1).0007 WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION Wei Xiong *, Maozai Tian 2 1 School of Statistics, University of
More informationPathwise coordinate optimization
Stanford University 1 Pathwise coordinate optimization Jerome Friedman, Trevor Hastie, Holger Hoefling, Robert Tibshirani Stanford University Acknowledgements: Thanks to Stephen Boyd, Michael Saunders,
More informationComparisons of penalized least squares. methods by simulations
Comparisons of penalized least squares arxiv:1405.1796v1 [stat.co] 8 May 2014 methods by simulations Ke ZHANG, Fan YIN University of Science and Technology of China, Hefei 230026, China Shifeng XIONG Academy
More informationDiscussion of Least Angle Regression
Discussion of Least Angle Regression David Madigan Rutgers University & Avaya Labs Research Piscataway, NJ 08855 madigan@stat.rutgers.edu Greg Ridgeway RAND Statistics Group Santa Monica, CA 90407-2138
More informationAdaptive Lasso for Cox s Proportional Hazards Model
Adaptive Lasso for Cox s Proportional Hazards Model By HAO HELEN ZHANG AND WENBIN LU Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695-8203, U.S.A. hzhang@stat.ncsu.edu
More informationlog T = β T Z + ɛ Zi Z(u; β) } dn i (ue βzi ) = 0,
Accelerated failure time model: log T = β T Z + ɛ β estimation: solve where S n ( β) = n i=1 { Zi Z(u; β) } dn i (ue βzi ) = 0, Z(u; β) = j Z j Y j (ue βz j) j Y j (ue βz j) How do we show the asymptotics
More informationGeneralized Elastic Net Regression
Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1
More informationSmoothly Clipped Absolute Deviation (SCAD) for Correlated Variables
Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables LIB-MA, FSSM Cadi Ayyad University (Morocco) COMPSTAT 2010 Paris, August 22-27, 2010 Motivations Fan and Li (2001), Zou and Li (2008)
More informationStability and the elastic net
Stability and the elastic net Patrick Breheny March 28 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/32 Introduction Elastic Net Our last several lectures have concentrated on methods for
More informationPart [1.0] Measures of Classification Accuracy for the Prediction of Survival Times
Part [1.0] Measures of Classification Accuracy for the Prediction of Survival Times Patrick J. Heagerty PhD Department of Biostatistics University of Washington 1 Biomarkers Review: Cox Regression Model
More informationChris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010
Model-Averaged l 1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley and Daniel Percival August 22,
More informationProportional hazards regression
Proportional hazards regression Patrick Breheny October 8 Patrick Breheny Survival Data Analysis (BIOS 7210) 1/28 Introduction The model Solving for the MLE Inference Today we will begin discussing regression
More informationOther Survival Models. (1) Non-PH models. We briefly discussed the non-proportional hazards (non-ph) model
Other Survival Models (1) Non-PH models We briefly discussed the non-proportional hazards (non-ph) model λ(t Z) = λ 0 (t) exp{β(t) Z}, where β(t) can be estimated by: piecewise constants (recall how);
More informationAFT Models and Empirical Likelihood
AFT Models and Empirical Likelihood Mai Zhou Department of Statistics, University of Kentucky Collaborators: Gang Li (UCLA); A. Bathke; M. Kim (Kentucky) Accelerated Failure Time (AFT) models: Y = log(t
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationQuantitative Methods for Stratified Medicine
Quantitative Methods for Stratified Medicine The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Yong, Florence Hiu-Ling.
More informationMoment and IV Selection Approaches: A Comparative Simulation Study
Moment and IV Selection Approaches: A Comparative Simulation Study Mehmet Caner Esfandiar Maasoumi Juan Andrés Riquelme August 7, 2014 Abstract We compare three moment selection approaches, followed by
More informationIn Search of Desirable Compounds
In Search of Desirable Compounds Adrijo Chakraborty University of Georgia Email: adrijoc@uga.edu Abhyuday Mandal University of Georgia Email: amandal@stat.uga.edu Kjell Johnson Arbor Analytics, LLC Email:
More informationOutlier detection and variable selection via difference based regression model and penalized regression
Journal of the Korean Data & Information Science Society 2018, 29(3), 815 825 http://dx.doi.org/10.7465/jkdi.2018.29.3.815 한국데이터정보과학회지 Outlier detection and variable selection via difference based regression
More informationRegularization Path Algorithms for Detecting Gene Interactions
Regularization Path Algorithms for Detecting Gene Interactions Mee Young Park Trevor Hastie July 16, 2006 Abstract In this study, we consider several regularization path algorithms with grouped variable
More informationPermutation-invariant regularization of large covariance matrices. Liza Levina
Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work
More informationStatistical Inference
Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park
More informationBayesian Grouped Horseshoe Regression with Application to Additive Models
Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu, Daniel F. Schmidt, Enes Makalic, Guoqi Qian, and John L. Hopper Centre for Epidemiology and Biostatistics, Melbourne
More informationDirect Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina
Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:
More informationResiduals and model diagnostics
Residuals and model diagnostics Patrick Breheny November 10 Patrick Breheny Survival Data Analysis (BIOS 7210) 1/42 Introduction Residuals Many assumptions go into regression models, and the Cox proportional
More informationVARIABLE SELECTION AND STATISTICAL LEARNING FOR CENSORED DATA. Xiaoxi Liu
VARIABLE SELECTION AND STATISTICAL LEARNING FOR CENSORED DATA Xiaoxi Liu A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements
More informationProperties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation
Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana
More informationVARIABLE SELECTION AND ESTIMATION WITH THE SEAMLESS-L 0 PENALTY
Statistica Sinica 23 (2013), 929-962 doi:http://dx.doi.org/10.5705/ss.2011.074 VARIABLE SELECTION AND ESTIMATION WITH THE SEAMLESS-L 0 PENALTY Lee Dicker, Baosheng Huang and Xihong Lin Rutgers University,
More informationDay 4: Shrinkage Estimators
Day 4: Shrinkage Estimators Kenneth Benoit Data Mining and Statistical Learning March 9, 2015 n versus p (aka k) Classical regression framework: n > p. Without this inequality, the OLS coefficients have
More informationTuning Parameter Selection in L1 Regularized Logistic Regression
Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 2012 Tuning Parameter Selection in L1 Regularized Logistic Regression Shujing Shi Virginia Commonwealth University
More informationChapter 3. Linear Models for Regression
Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Linear
More informationPost-Selection Inference
Classical Inference start end start Post-Selection Inference selected end model data inference data selection model data inference Post-Selection Inference Todd Kuffner Washington University in St. Louis
More informationNonnegative Garrote Component Selection in Functional ANOVA Models
Nonnegative Garrote Component Selection in Functional ANOVA Models Ming Yuan School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, GA 3033-005 Email: myuan@isye.gatech.edu
More informationOn the covariate-adjusted estimation for an overall treatment difference with data from a randomized comparative clinical trial
Biostatistics Advance Access published January 30, 2012 Biostatistics (2012), xx, xx, pp. 1 18 doi:10.1093/biostatistics/kxr050 On the covariate-adjusted estimation for an overall treatment difference
More informationSurvival Analysis. Lu Tian and Richard Olshen Stanford University
1 Survival Analysis Lu Tian and Richard Olshen Stanford University 2 Survival Time/ Failure Time/Event Time We will introduce various statistical methods for analyzing survival outcomes What is the survival
More informationRegularization: Ridge Regression and the LASSO
Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression
More informationVariable Selection for Highly Correlated Predictors
Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu arxiv:1709.04840v1 [stat.me] 14 Sep 2017 Abstract Penalty-based variable selection methods are powerful in selecting relevant covariates
More informationRegression, Ridge Regression, Lasso
Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.
More informationLecture 22 Survival Analysis: An Introduction
University of Illinois Department of Economics Spring 2017 Econ 574 Roger Koenker Lecture 22 Survival Analysis: An Introduction There is considerable interest among economists in models of durations, which
More informationStepwise Searching for Feature Variables in High-Dimensional Linear Regression
Stepwise Searching for Feature Variables in High-Dimensional Linear Regression Qiwei Yao Department of Statistics, London School of Economics q.yao@lse.ac.uk Joint work with: Hongzhi An, Chinese Academy
More informationThe Pennsylvania State University The Graduate School Eberly College of Science NEW PROCEDURES FOR COX S MODEL WITH HIGH DIMENSIONAL PREDICTORS
The Pennsylvania State University The Graduate School Eberly College of Science NEW PROCEDURES FOR COX S MODEL WITH HIGH DIMENSIONAL PREDICTORS A Dissertation in Statistics by Ye Yu c 2015 Ye Yu Submitted
More informationA Confidence Region Approach to Tuning for Variable Selection
A Confidence Region Approach to Tuning for Variable Selection Funda Gunes and Howard D. Bondell Department of Statistics North Carolina State University Abstract We develop an approach to tuning of penalized
More informationLecture 14: Variable Selection - Beyond LASSO
Fall, 2017 Extension of LASSO To achieve oracle properties, L q penalty with 0 < q < 1, SCAD penalty (Fan and Li 2001; Zhang et al. 2007). Adaptive LASSO (Zou 2006; Zhang and Lu 2007; Wang et al. 2007)
More informationEfficient Estimation of Censored Linear Regression Model
2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 2 22 23 24 25 26 27 28 29 3 3 32 33 34 35 36 37 38 39 4 4 42 43 44 45 46 47 48 Biometrika (2), xx, x, pp. 4 C 28 Biometrika Trust Printed in Great Britain Efficient Estimation
More informationUnified LASSO Estimation via Least Squares Approximation
Unified LASSO Estimation via Least Squares Approximation Hansheng Wang and Chenlei Leng Peking University & National University of Singapore First version: May 25, 2006. Revised on March 23, 2007. Abstract
More informationSTAT 331. Accelerated Failure Time Models. Previously, we have focused on multiplicative intensity models, where
STAT 331 Accelerated Failure Time Models Previously, we have focused on multiplicative intensity models, where h t z) = h 0 t) g z). These can also be expressed as H t z) = H 0 t) g z) or S t z) = e Ht
More information6 Pattern Mixture Models
6 Pattern Mixture Models A common theme underlying the methods we have discussed so far is that interest focuses on making inference on parameters in a parametric or semiparametric model for the full data
More informationThe Iterated Lasso for High-Dimensional Logistic Regression
The Iterated Lasso for High-Dimensional Logistic Regression By JIAN HUANG Department of Statistics and Actuarial Science, 241 SH University of Iowa, Iowa City, Iowa 52242, U.S.A. SHUANGE MA Division of
More informationSaharon Rosset 1 and Ji Zhu 2
Aust. N. Z. J. Stat. 46(3), 2004, 505 510 CORRECTED PROOF OF THE RESULT OF A PREDICTION ERROR PROPERTY OF THE LASSO ESTIMATOR AND ITS GENERALIZATION BY HUANG (2003) Saharon Rosset 1 and Ji Zhu 2 IBM T.J.
More informationLEAST ANGLE REGRESSION 469
LEAST ANGLE REGRESSION 469 Specifically for the Lasso, one alternative strategy for logistic regression is to use a quadratic approximation for the log-likelihood. Consider the Bayesian version of Lasso
More informationAccelerated Failure Time Models
Accelerated Failure Time Models Patrick Breheny October 12 Patrick Breheny University of Iowa Survival Data Analysis (BIOS 7210) 1 / 29 The AFT model framework Last time, we introduced the Weibull distribution
More informationVariable Selection for Highly Correlated Predictors
Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu Department of Statistics, University of Illinois at Urbana-Champaign WHOA-PSI, Aug, 2017 St. Louis, Missouri 1 / 30 Background Variable
More informationMachine Learning Linear Regression. Prof. Matteo Matteucci
Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares
More informationRank-based inference for the accelerated failure time model
Biometrika (2003), 90, 2, pp. 341 353 2003 Biometrika Trust Printed in reat Britain Rank-based inference for the accelerated failure time model BY ZHEZHEN JIN Department of Biostatistics, Columbia University,
More informationOr How to select variables Using Bayesian LASSO
Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO On Bayesian Variable Selection
More informationProfessors Lin and Ying are to be congratulated for an interesting paper on a challenging topic and for introducing survival analysis techniques to th
DISCUSSION OF THE PAPER BY LIN AND YING Xihong Lin and Raymond J. Carroll Λ July 21, 2000 Λ Xihong Lin (xlin@sph.umich.edu) is Associate Professor, Department ofbiostatistics, University of Michigan, Ann
More informationThe MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010
Department of Biostatistics Department of Statistics University of Kentucky August 2, 2010 Joint work with Jian Huang, Shuangge Ma, and Cun-Hui Zhang Penalized regression methods Penalized methods have
More informationNonconcave Penalized Likelihood with A Diverging Number of Parameters
Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized
More informationarxiv: v1 [stat.me] 30 Dec 2017
arxiv:1801.00105v1 [stat.me] 30 Dec 2017 An ISIS screening approach involving threshold/partition for variable selection in linear regression 1. Introduction Yu-Hsiang Cheng e-mail: 96354501@nccu.edu.tw
More informationUNIVERSITY OF CALIFORNIA, SAN DIEGO
UNIVERSITY OF CALIFORNIA, SAN DIEGO Estimation of the primary hazard ratio in the presence of a secondary covariate with non-proportional hazards An undergraduate honors thesis submitted to the Department
More information1 The problem of survival analysis
1 The problem of survival analysis Survival analysis concerns analyzing the time to the occurrence of an event. For instance, we have a dataset in which the times are 1, 5, 9, 20, and 22. Perhaps those
More informationPrediction & Feature Selection in GLM
Tarigan Statistical Consulting & Coaching statistical-coaching.ch Doctoral Program in Computer Science of the Universities of Fribourg, Geneva, Lausanne, Neuchâtel, Bern and the EPFL Hands-on Data Analysis
More informationModel Selection. Frank Wood. December 10, 2009
Model Selection Frank Wood December 10, 2009 Standard Linear Regression Recipe Identify the explanatory variables Decide the functional forms in which the explanatory variables can enter the model Decide
More informationRobust Variable Selection Through MAVE
Robust Variable Selection Through MAVE Weixin Yao and Qin Wang Abstract Dimension reduction and variable selection play important roles in high dimensional data analysis. Wang and Yin (2008) proposed sparse
More information