Variable selection via generalized SELO-penalized linear regression models
|
|
- Clifton Howard
- 5 years ago
- Views:
Transcription
1 Appl. Math. J. Chinese Univ. 2018, 33(2): Variable selection via generalized SELO-penalized linear regression models SHI Yue-yong 1,3 CAO Yong-xiu 2 YU Ji-chang 2 JIAO Yu-ling 2, Abstract. The seamless-l 0 (SELO) penalty is a smooth function on [0, ) that very closely resembles the L 0 penalty, which has been demonstrated theoretically and practically to be effective in nonconvex penalization for variable selection. In this paper, we first generalize SELO to a class of penalties retaining good features of SELO, and then propose variable selection and estimation in linear models using the proposed generalized SELO (GSELO) penalized least squares (PLS) approach. We show that the GSELO-PLS procedure possesses the oracle property and consistently selects the true model under some regularity conditions in the presence of a diverging number of variables. The entire path of GSELO-PLS estimates can be efficiently computed through a smoothing quasi-newton (SQN) method. A modified BIC coupled with a continuation strategy is developed to select the optimal tuning parameter. Simulation studies and analysis of a clinical data are carried out to evaluate the finite sample performance of the proposed method. In addition, numerical experiments involving simulation studies and analysis of a microarray data are also conducted for GSELO-PLS in the high-dimensional settings. Consider the linear regression model 1 Introduction y = Xβ + ϵ, (1) where y = (y 1, y 2,, y n ) T R n is a response vector, X = (x ij ) R n d is a design matrix, β = (β1, β2,, βd )T R d is a vector of underlying regression coefficients, and ϵ = (ϵ 1, ϵ 2,, ϵ n ) T R n is a vector of random errors. We assume without loss of generality that y is centered and the columns of X are centered and n-normalized, i.e., n i=1 y i = 0, n i=1 x ij = 0 and n 1 n i=1 x2 ij = 1. We also assume that β is sparse in the sense that only Received: Revised: MR Subject Classification: 62F12, 62J05, 62J07. Keywords: continuation, coordinate descent, BIC, LLA, oracle property, SELO, smoothing quasi-newton. Digital Object Identifier(DOI): Supported by the National Natural Science Foundation of China ( , , , ) and the Fundamental Research Funds for the Central Universities (CUGW150809). Correspondence author.
2 146 Appl. Math. J. Chinese Univ. Vol. 33, No. 2 a relatively small portion of the components of β are nonzero, and our goal is to reconstruct the unknown vector β. Let A = {j; βj 0} be the true model and suppose that s = A is the size of the true model (i.e., the sparsity level of β ), where A denotes the cardinality of A. To achieve sparsity in linear models, the penalization (or regularization) method, which optimizes a loss function term plus a penalty function term, has been widely used in the literature (cf., e.g., [27, 8, 10, 31-32, 29]). In this paper, we consider the following so-called SELO-penalized least squares (PLS) problem: { ˆβ := ˆβ(λ, τ) = arg min Q n (β) = 1 β R p 2n y Xβ 2 + d j=1 } p λ,τ (β j ), (2) ( ) where denotes the L 2 norm on the Euclidean space and p λ,τ (β j ) = λ log(2) log βj β j +τ + 1 is the SELO penalty proposed by Dicker et al. [7]. λ and τ are two positive tuning (or regularization) parameters. In particular, λ is the sparsity tuning parameter obtaining sparse solutions and τ is the shape (or concavity) tuning parameter making SELO L 0 (τ 0+), where L 0 admits p λ (β j ) = λi( β j = 0). ˆβ = ˆβ(λ, τ) in (2) is called a SELO-PLS (SPLS) estimator. L 0 regularization [9] directly penalizes the number of variables in the regression models, so it enjoys a nice interpretation of the best subset selection, but it is not continuous at 0, and is computationally infeasible when d is moderately large. SELO is a good surrogate for L 0 since it can explicitly mimic L 0 via small τ values, and is more stable than L 0 due to the continuity of its penalty function. Figure 1 depicts SELO penalties for a few τ s while fixing λ = 1. Figure 1: Plot of SELO penalty functions. τ = 0 (L 0, thick solid), τ = 0.1 (dotdash), τ = 1 (dashed), τ = 10 (dotted), and τ = (thin solid). Since the introduction of the SELO for the linear models (LM) [7], the methodology has been extended to generalized linear models (GLM) [16], multivariate panel count data [30]
3 SHI Yue-yong et al. Variable selection via generalized SELO-penalized linear regression models 147 and quantile regression [6] among others. Under LM, Dicker et al [7] show that SELO-LM estimators enjoy the oracle property [8] when both d and n tend to infinity with d/n 0, and outperform other penalized estimators by various metrics in numerical simulations. They propose a SELO-LM-BIC procedure to select the tuning parameters and showed it is consistent for model selection. Under GLM, Li et al [16] show that the SELO-GLM procedure enjoys the oracle property when both d and n tend to infinity with d 5 /n 0. They also establish the model selection consistency results via a SELO-GLM-BIC procedure. It is noteworthy that both SELO-LM and SELO-GLM estimators can be efficiently calculated by coordinate descent (CD) algorithms. Zhang et al [30] develop the SELO penalized estimating equation approach to conduct the regression analysis of multivariate panel count data with the focus on variable selection and estimation of significant covariate effects, where the dimensionality d is assumed to be fixed. They use a BIC procedure to select tuning parameters and apply the classical Newton-Raphson algorithm for numerical experiments. Ciuperca [6] introduces and studies the SELO quantile estimator in a linear model when both d and n tend to infinity with d/n 0, and derives the convergence rate, oracle properties and BIC model selection consistency result of corresponding estimators. In this paper, we propose to use a generalized SELO (GSELO) penalized method to make variable selection and parameter estimation in linear models. First, we generalize the SELO penalty to a class of penalties (i.e., GSELO penalties) closely resembling the L 0 penalty and retaining good features of SELO. Second, based on the proposed GSELO penalties, we develop the GSELO-PLS procedure for linear models on variable selection and parameter estimation. We give consistency and asymptotically normality properties for GSELO-PLS and show it performs as well as an oracle estimator when both d and n tend to infinity with d/n 0. Third, we implement a smoothing quasi-newton (SQN) method with backtracking line search technique, which has superlinear convergence rate and is insensitive to choices of initial values, and it can avoid calculating the sequence of the inverse of the Hessian matrix compared with the modified Newton-Raphson algorithm [8], to calculate the proposed GSELO-PLS estimates. In particular, we couple our algorithm with a continuation strategy on the regularization parameter, i.e., given a decreasing sequence of parameter {λ g } g, we apply the algorithm to solve the λ g+1 -problem with the initial guess from the λ g -problem. The idea of continuation is well established for the iterative algorithms with the purpose of warm starting and globalizing the convergence. We adopt a modified BIC (MBIC) to select a suitable tuning parameter during the continuation process. Finally, we conduct numerical experiments to evaluate the performance of GSELO- PLS in high dimensions. To deal with the high dimensional issue, We first employ a local linear approximation (LLA) [33] to the nonconvex GSELO penalties and then resort to a existing Gauss-Seidel type coordinate descent algorithm in [2] to obtain the solution path. Numerically, when coupled with the continuation strategy and a high-dimensional BIC (HBIC), the overall GSELO-PLS-HBIC procedure for high-dimensional data is very efficient. The remainder of this paper is organized as follows. In Section 2, we first describe our proposed GSELO method and then establish asymptotic theoretical results of the GSELO-PLS
4 148 Appl. Math. J. Chinese Univ. Vol. 33, No. 2 procedure. In Section 3, we present the algorithm for computing the GSELO-PLS estimator, the standard error formulae for the estimated coefficients and a modified BIC coupled with a continuation strategy to select the optimal tuning parameter. Simulation studies are conducted in Section 4 to evaluate the finite sample performance of the proposed method, which is further illustrated with a real clinical data. In Section 5, we numerically study the behaviors of the proposed GSELO-PLS estimators in high dimensions, including the computational issues, the choice of the tuning parameter, simulation studies and analysis of a microarray data. We conclude the paper with Section 6. Proofs of the theorems are provided in the Appendix. 2 Generalized SELO-penalized linear regression models 2.1 Methodology Let P denote all GSELO penalties. f is an arbitrary function that satisfies the following two hypotheses: (H1) f(x) is a continuous function w.r.t x, which has the first and second derivative in [0, 1]; (H2) f (x) 0 on the interval [0, 1] and lim x 0 f(x) x = 1. Then a GSELO penalty p λ,τ ( ) P is given as p λ,τ (β j ) = λ f(1) f( β j β j + τ ), where λ (sparsity) and τ (concavity) are two positive tunning parameters. Remark 1. (H1) is needed to guarantee the continuity of penalty functions and (H2) is used to make the penalties in P resemble the L 0. Obviously, it easily follows that SELO is a member of P as long as we take f(x) = log(x+1). Table 1 lists some representatives of P and Figure 2 displays them with τ = 1 and 0.01 respectively. Table 1: Representatives of GSELO Name Types of functions f(x) p λ,τ (β j ) LIN linear x λ βj β j +τ SELO logarithmic log(x + 1) λ log(2) log( β j β j +τ + 1) EXP exponential 1 exp( x) λ 1 exp( 1) [1 exp( β j SIN trigonometric sin(x) λ sin(1) sin( βj β j +τ ) β j +τ )] ATN inverse trigonometric arctan(x) λ arctan(1) arctan( β j β j +τ )
5 SHI Yue-yong et al. Variable selection via generalized SELO-penalized linear regression models 149 Remark 2. It is noteworthy that LIN is actually the transformed L 1 penalty studied by Nikolova [21], which enlightens Lv and Fan [18] on proposing the SICA approach for sparse recovery and model selection. p λ,τ(βj) λ =1 τ =1 L 0 LIN SELO EXP SIN ATN p λ,τ(βj) λ =1 τ =0.01 L 0 LIN SELO EXP SIN ATN β j β j Figure 2: Left panel: λ = 1, τ = 1; Right panel: λ = 1, τ = L 0 (thick solid), L 1 (solid), LIN (dashed), SELO (dotted), EXP (dotdash), SIN (longdash) and ATN (twodash). Based on the proposed GSELO penalty, corresponding GSELO-PLS estimator can be given as { ˆβ := ˆβ(λ, τ) = arg min Q n (β) = 1 β R 2n y Xβ 2 + d where p λ,τ ( ) P. d j=1 } p λ,τ (β j ), (3) 2.2 Theoretical results We establish theoretical results of the GSELO-PLS estimator based on the following regularity conditions. (C1) n and dσ 2 /n 0. (C2) There exist positive constants r, R R such that r < γ min (n 1 X T X) < γ max (n 1 X T X) < R, where γ min (n 1 X T X) and γ max (n 1 X T X) are the smallest and largest eigenvalues of n 1 X T X respectively. (C3) τ = O( σ 2 /(dn)) and λτ[n/(dσ 2 )] 3/2. (C4) ρ n/(dσ 2 ), λ/ρ 2 0, where ρ = min j A β j.
6 150 Appl. Math. J. Chinese Univ. Vol. 33, No. 2 (C5) lim max d n 1 i n j=1 x2 ij = 0. (C6) E( ϵ i /σ 2+δ ) < M for some δ > 0 and M <. Remark 3. Conditions (C1)-(C6) coincide with the conditions in [7]. Please see more details therein. Theorem 1 (Existence of GSELO-PLS estimator). Under hypotheses (H1)-(H2) and conditions (C1)-(C6), then, with probability tending to one, there exists a local minimizer ˆβ of Q n (β), defined in (3), such that ˆβ β = O p ( dσ 2 /n), where denotes the Euclidean norm of a vector. Theorem 2 (Oracle property). Under hypotheses (H1)-(H2) and conditions (C1)-(C6), then, with probability tending to 1, the n/(dσ 2 )-consistent local minimizer ˆβ in Theorem 1 must be such that (i) lim P ({j; ˆβj 0} = A) = 1. n (ii) nb n (n 1 X T A X A/σ 2 ) 1/2 ( ˆβ A β A ) N(0, G) in distribution, where B n is an arbitrary q A matrix such that B n B T n G. To save space, we only state the main results here and relegate the proofs to the Appendix. Interested readers can refer to [10,7] for more details. 3.1 Algorithm 3 Computation Dicker et al. [7] use a coordinate descent (CD) algorithm procedure, which amounts to finding the roots of certain cubic equations, for obtaining SELO estimates. However, among the GSELO penalties taken into consideration in this paper (i.e., LIN, SELO, EXP, SIN and ATN in Table 1), only LIN and SELO can be implemented using the CD algorithm in [7]. To illustrate this point, we consider the one-dimensional PLS problem { ˆβ = arg min Q(β) = 1 } β R 2 (β β 0) 2 + p λ,τ (β), (4) where β 0 R is a constant and p λ,τ (β) is a penalty in Table 1. CD procedures ask for finding the nonzero stationary points (or critical points) of the objective function Q(β). Direct computation of Q (β) = 0 gives (LIN) β β 0 + λ τ sgn(β) ( β + τ) 2 = 0, (SELO) β β 0 + λ τ sgn(β) log(2) (2 β + τ)( β + τ) = 0 or (SIN) β β 0 + λ ( ) β τ sgn(β) sin(1) cos β + τ ( β + τ) 2 = 0.
7 SHI Yue-yong et al. Variable selection via generalized SELO-penalized linear regression models 151 It follows that LIN and SELO can be transformed into cubic equations, while SIN can t, neither can EXP and ATN share the same spirit as SIN. Thus, for the sake of the uniformity of computation, we use the smoothing quasi-newton (SQN) method [22,19,24] to optimize Q n (β) in (3). Since GSELO penalty functions are singular at the origin, we first smooth the penalty functions by replacing β j with βj 2 + ε, where ε is a small positive quantity. It follows that βj 2 + ε β j when ε 0. Then, we solve { ˆβ = ˆβ(λ, τ, ε) = arg min Q ε n(β) = 1 β R 2n y Xβ 2 + d d j=1 } p λ,τ,ε (β j ) instead of (3) by using the DFP quasi-newton with backtracking linear search algorithm, where p λ,τ,ε (β j ) = p λ,τ ( βj 2 + ε). In practice, taking ε = 0.01 gives good results. We summarize the SQN-DFP procedure in Algorithm 1. More theoretical results about smoothing methods for nonsmooth and noconvex minimization can be found in [4,5]. Remark 4. Like the local quadratic approximation (LQA) algorithm in Fan and Li [8], the sequence β k obtained from SQN-DFP may not be sparse for any fixed k and hence is not directly suitable for variable selection. In practice, we set β k j = 0 if βk j < ε 0 for some sufficiently small tolerance level ε 0. (5) Algorithm 1 SQN-DFP Input: initial values β 0 R d and H 0 = I d (d d identity matrix); linear search parameters ρ (0, 1) and η (0, 1 2 ); stop tolerance δ. 1: for k = 0, 1, 2,, k max do 2: compute g k = Q ε n(β k ), 3: if g k δ, then 4: stop, output β k as the estimate of β in (5), 5: else 6: compute direction d k = H k g k. 7: end if 8: for m = 0, 1, 2,, m max do 9: compute βm k = β k + ρ m d k, 10: if Q ε n(βm) k Q ε n(β k ) + ηρ m gk T d k, then 11: stop, output α k = ρ m. 12: end if 13: end for 14: compute β k+1 = β k + α k d k, g k+1 = Q ε n(β k+1 ), g k = g k+1 g k, β k = β k+1 β k. 15: if ( β k ) T g k 0, then 16: H k+1 = H k ; 17: else 18: H k+1 = H k H k g k g T k H k gk T H k g k + βk ( β k ) T ( β k ) T g k. 19: end if 20: end for Output: ˆβ, the estimate of β in equation (5).
8 152 Appl. Math. J. Chinese Univ. Vol. 33, No Covariance estimation Following [7], we estimate the covariance matrix (i.e., standard errors) for ˆβ by using a sandwich formula ĉov( ˆβÂ) = ˆσ 2 {X Ṱ A X  + n ε Â,Â( ˆβ)} 1 X Ṱ A X  {XṰ A X  + n ε Â,Â( ˆβ)} 1, (6) where ˆσ 2 = (n ŝ) 1 y X ˆβ 2, ŝ = Â,  = {j; ˆβ j 0} and ε (β) = diag{p λ,τ,ε(β 1 )/ β 1,, p λ,τ,ε(β d )/ β d }. (7) For variables with ˆβ j = 0, the estimated standard errors are Tuning parameter selection As suggested in [7], we fix τ = 0.01 and use a modified BIC (MBIC) procedure to tune λ via { ˆλ = arg min MBIC( ˆβ) = log(ˆσ 2 ) + k } n λ n ŝ, (8) where ˆβ = ˆβ(λ, τ), ˆσ 2 = (n ŝ) 1 y X ˆβ 2, ŝ =  and k n is a positive number that depends on the sample size n and satisfies k n log(n). In our numerical experiments, we set k n = log(n). Since solving (5) is a nonconvex optimization problem, we coupe SQN-DFP with a continuation strategy on the tuning parameter for efficient computation. To be precise, one needs a starting value λ 0 for the parameter λ and a decreasing factor µ (0, 1) to obtain a decreasing sequence {λ g } g, where λ g = λ 0 µ g, and then run Algorithm 1 to solve the λ g+1 -problem initialized with the solution of λ g -problem. Summarizing the idea leads to Algorithm 2. See [14] and the references therein for more details. In practice, we use λ 0 = λ max, where λ max is an initial guess of λ, supposedly large, that shrinks all β j s to zero, and set λ min = 1e 5λ max, then divide the interval [λ min, λ max ] into G (the number of grid points) equally distributed subintervals in the logarithmic scale. Numerically, µ is determined by G. Clearly, a large G value implies a large decreasing factor µ. For sufficient resolution of the solution path, G usually takes G 50 (e.g., G = 100 or 200). Implementing Algorithm 1 for each value of τ and the sequence λ max = λ 0 > λ 1 > > λ G = λ min to be considered gives the entire GSELO-PLS solution path. Then we select the optimal λ from the candidate set Λ = {λ 1, λ 2,, λ G } using MBIC (8). Remark 5. Dicker et al. [7] show that the SELO-PLS-MBIC procedure consistently identifies the true model with diverging number of parameters under some regularity conditions. It can be proved that the GSELO-PLS-MBIC procedure is also consistent with model selection by using similar arguments used in the proof of Theorem 2 of Dicker et al. [7], and thus the detailed proof is omitted here. Remark 6. In Algorithm 1, we set (ρ, η, m max ) = (0.55, 0.4, 20) following [19]. Due to the continuation strategy, the maximum number of outer iterations k max is not necessary to be large in practice, so we set k max = 50 in Algorithm 1. This procedure makes it possible
9 SHI Yue-yong et al. Variable selection via generalized SELO-penalized linear regression models 153 to substantially reduce the computational cost without noticeable loss of the accuracy of the solutions. Algorithm 2 Continuation strategy Input: Given λ 0 and µ (0, 1). Let β(λ 0 ) = 0. 1: for g = 1, 2, 3,, G do 2: Apply Algorithm 1 to problem (5) with λ g = λ 0 µ g, initialized with β 0 = β(λ g 1 ). 3: Compute MBIC values. 4: end for Output: Select λ by (8). 4.1 Simulation studies 4 Numerical experiments We present simulation studies to examine the finite sample properties of GSELO-PLS-MBIC. All codes, available from the authors, are written in Matlab and all experiments are performed in MATLAB R2010b on a quad-core laptop with an Intel Core i5 CPU (2.60 GHz) and 8 GB RAM running Windows 8.1 (64 bit). We simulate N = 1000 data sets from the linear model (1). β is a d 1 vector with β1 = 3, β2 = 1.5, β3 = 2 and the other βj s being 0. Thus, s = 3. The rows of the n d matrix X are sampled as i.i.d. copies from N(0, Σ) with Σ = (0.5 j k ) for 1 j, k d. The components of the n 1 vector ϵ are sampled from N(0, 1). In order to emphasize the dependency of the number of parameters on the sample size, we consider two sample sizes: n = 100 and n = 200 with d = n/(2 log n), where x denotes the integer part of x for x 0. To evaluate the variable selection performance of the proposed method, we consider the average model size N 1 N s=1 Â(s) (MS), the proportion of correct models N 1 N s=1 I{Â(s) = A} (CM), the average l absolute error N 1 N s=1 ˆβ (s) β (AE), the average l 2 relative error N 1 N s=1 ( ˆβ (s) β 2 / β 2 ) (RE) and the average model error N 1 N s=1 ( ˆβ (s) β ) T Σ( ˆβ (s) β ) (ME). Simulation results for variable selection are summarized in Table 2. Since LIN, SELO, EXP, SIN and ATN all belong to the GSELO penalty family, it can be seen from Table 2 that five penalties behave quite similar to each other in all considered criteria. With respect to MS, although all methods tend to slightly overestimate the true model, they can select the true model quite well with reasonably small errors in terms of CM, AE, RE and ME. The results given in Table 3 are obtained under the same situation as in Table 2 but for the estimation of the regression parameter β. With respect to parameter estimation, Table 3 presents the average of estimated nonzero coefficients (Mean), the average of estimated standard error (ESE) and the sample standard deviations (SSD). From Table 3, we can see that Means are close to corresponding true values, and ESEs agree well with SSDs, which indicates that the proposed covariance estimation formula is reasonable and reliable.
10 154 Appl. Math. J. Chinese Univ. Vol. 33, No. 2 Table 2: Simulation results for variable selection. d = n/(2 log n). (n, d) method MS CM AE RE ME (100,10) LIN % SELO % EXP % SIN % ATN % (200,18) LIN % SELO % EXP % SIN % ATN % Table 3: Simulation results for parameter estimation. d = n/(2 log n). β 1 = 3 β 2 = 1.5 β 3 = 2 (n, d) method Mean ESE SSD Mean ESE SSD Mean ESE SSD (100,10) LIN SELO EXP SIN ATN (200,18) LIN SELO EXP SIN ATN Analysis of clinical data We illustrate GSELO-PLS-MBIC by an analysis of a prostate cancer data set from [26]. This data set examines the correlation between the level of prostate specific antigen and a number of clinical measures in 97 men with prostate cancer who were about to receive a radical prostatectomy. It has been analyzed by many texts on data mining (cf., e.g., [32,13]) and can be publicly available from the R package ElemStatLearn [13]. There are 97 observations (n = 97) and 9 variables (one quantitative response and d = 8 predictors) in the prostate cancer data set. The goal is to predict the response lpsa (log of prostate specific antigen) from predictors including lcavol (log cancer volume), lweight (log prostate weight), age, lbph (log of benign prostatic hyperplasia amount), svi (seminal vesicle invasion), lcp (log of capsular penetration), gleason (Gleason score) and pgg45 (percent of Gleason scores 4 or 5). There are lots of different model fitting and tuning parameter selection procedures that are being carried out on the prostate cancer data, which makes it challenging to choose which one to use as the underlying true model is generally unknown in real data analyses. However, as previously stated, the minimizer of L 0 procedure (i.e., best subset selection) is the optimal solution and, if available, can be used as a gold standard for the evaluation of other approaches.
11 SHI Yue-yong et al. Variable selection via generalized SELO-penalized linear regression models 155 Hereafter, we regard the best subset model with the lowest BIC as the true model in order to assess the approaches. Five GSELO procedures (i.e., LIN, SELO, EXP, SIN and ATN) proposed in the previous sections are applied to the prostate cancer data. Additionally, the LS (computed by R built-in function lm) and LASSO (computed by R function cv.glmnet with the lambda.1se rule and set.seed=0 from R package glmnet [11]) solutions are also provided for comparison purposes. The estimated regression parameters and the predictive mean squared errors (PMSE) calculated by n 1 n i=1 (ŷ i y i ) 2 are provided in Table 4. One can see that five GSELO penalties behave similarly, and they all select lcavol and lweight. In particular, SIN and ATN can recover exactly the best subset selection result, which shows the good performance of the proposed GSELO-PLS-MBIC procedure. Table 4: Analysis of the prostate cancer data set. Estimated coefficients of different methods applied to the prostate data. The zero entries correspond to variables omitted. Term LS Best Subset LASSO LIN SELO EXP SIN ATN Intercept lcavol lweight age lbph svi lcp gleason pgg PMSE High dimensional case In this section, we discuss how GSELO-PLS can be applied to high dimensional data in which d > n. For solving (3) in high dimensions, we first employ the local linear approximation (LLA) [33] to p λ,τ ( ) P: p λ,τ (β j ) p λ,τ (β k j ) + p λ,τ (β k j )( β j β k j ), (9) where βj k are the kth estimates of β j, j = 1, 2,, d, and p λ,τ (β j) means the derivative of p λ,τ (β j ) with respect to β j. Given β k of β, we find the next estimate via where ω k+1 j β k+1 = arg min{ 1 β 2n y Xβ d j=1 ω k+1 j β j }, (10) = p λ,τ (βk j ). Then we use a Gauss-Seidel type coordinate descent (CD) algorithm in [2] for solving (10). We summarize the LLA-CD procedure in Algorithm 3. For LLA-CD, we also couple it with the continuation strategy on the regularization parameter, in order to obtain accurate solutions.
12 156 Appl. Math. J. Chinese Univ. Vol. 33, No. 2 Algorithm 3 LLA-CD Input: X R n d, y R n, β 0 R d, τ, λ, δ (tolerance) and k max (the maximum number of iterations). 1: for k = 0, 1, 2, do 2: while k < k max do 3: for j = 1, 2,, d do 4: Calculate z j = n 1 x T j r j = n 1 x T j r + βj k, where r = y Xβ k, r j = y X jβ j, k j is introduced to refer to the portion that remains after the jth column or element is removed, and r j is the partial residuals of x j. 5: Update β k+1 j S(z j, ω k+1 j ), where ω k+1 j = p λ,τ (βj k ) and S(t, λ) = sgn(t)( t λ) + is the soft-thresholding operator. 6: Update r r (β k+1 j βj k )x j. 7: end for 8: if β k+1 β k < δ then 9: break, ˆβ = β k+1. 10: else 11: Update k k : end if 13: end while 14: end for Output: ˆβ, the estimate of β in equation (10). We use the sandwich formula (6) to estimate the covariance matrix for the LLA-CD estimates ˆβ by replacing ε (β) in (7) with (β) = diag{p λ,τ ( β 1 )/ β 1,, p λ,τ ( β d )/ β d }. Since d is larger than n, MBIC in (8) breaks down in the tuning of λ. Thus we adopt a highdimensional BIC (HBIC) proposed by Wang et al [28] to select the optimal tuning parameter ˆλ during the continuation process, which reads ˆλ = arg min{hbic(λ) = log( y X ˆβ(λ) 2 /n) + C n log(d) M(λ) }, (11) λ Λ n where Λ is a subset of (0, + ), M(λ) = {j : ˆβj (λ) 0} and M(λ) denotes the cardinality of M(λ), and C n = log(log n). 5.1 Simulation studies in high dimensions We illustrate the finite sample properties of GSELO-PLS-HBIC in high dimensions with simulation studies. The implementation setting is the same as in Section 4.1 but for two sample sizes n = 100 and n = 200 with d = n log(n)/2. The results for variable selection and parameter estimation are reported in Table 5 and Table 6, respectively. It can be seen from the tables that five GSELO penalties still perform reasonably well in terms of both variable selection and parameter estimation when d is larger than n. Since the sparsity level of β is fixed in our simulations (i.e., s = 3), the better performance appears to be associated with larger sample sizes.
13 SHI Yue-yong et al. Variable selection via generalized SELO-penalized linear regression models 157 Table 5: Simulation results for variable selection. d = n log(n)/2. (n, d) method MS CM AE RE ME (100,230) LIN % SELO % EXP % SIN % ATN % (200,529) LIN % SELO % EXP % SIN % ATN % Table 6: Simulation results for parameter estimation. d = n log(n)/2. β 1 = 3 β 2 = 1.5 β 3 = 2 (n, d) method Mean ESE SSD Mean ESE SSD Mean ESE SSD (100,230) LIN SELO EXP SIN ATN (200,529) LIN SELO EXP SIN ATN Analysis of microarray data We analyze the eyedata set which is publicly available in R package flare [15] to illustrate the application of GSELO-PLS-HBIC in high-dimensional settings. This data set is a gene expression data from the microarray experiments of mammalian eye tissue samples of [23]. The response variable y is a numeric vector of length 120 giving expression level of gene TRIM32 which causes Bardet-Biedl syndrome (BBS). The design matrix X is a matrix which represents the data of 120 rats with 200 gene probes. We want to find the gene probes that are most related to TRIM32 in sparse high-dimensional regression models. For this dataset, we consider ncvreg [2] (10-fold cv.ncvreg with seed=0) as the gold standard for comparison purposes. Table 7 lists the results of GSELO (LIN, SELO, EXP, SIN and ATN) and ncvreg. From Table 7, six sets identify 5, 3, 6, 4, 3 and 4 probes respectively and have 3 in common. Notably, for those common probes, although the magnitudes of estimates are not equal, they have the same signs, which suggests similar biological conclusions. In addition, they have similar PMSEs which implies they can give results of comparable accuracy.
14 158 Appl. Math. J. Chinese Univ. Vol. 33, No. 2 Table 7: Analysis of the eyedata set. Estimated coefficients of different methods are provided. The zero entries correspond to variables omitted. No. Term Probe ncvreg LIN SELO EXP SIN ATN Intercept β β β β β β β β β β PMSE Concluding remarks In this paper, we propose the GSELO-PLS procedure for linear models on variable selection and parameter estimation issues. We generalize the SELO to the GSELO and thus put the popular SELO penalization method within a more general framework. We impose the asymptotic properties of the proposed GSELO-PLS estimator in a setting where the dimension of covariates growing with the sample size. The consistency and the oracle property of proposed estimators are proved under some regularity conditions. When coupled with a continuation strategy and a modified BIC tuning parameter selector, our overall proposed procedure is very efficient and accurate. In addition, when d is larger than n, we use a LLA-CD algorithm and a high-dimensional BIC, combined with a continuation strategy on the regularization parameter, to compute the GSELO solution paths in high dimensions. The results of simulation studies and real data examples demonstrate the effectiveness of our proposed approach. As a natural extension of the SELO, the proposed GSELO method automatically inherits all merits of SELO, and can be directly used to acquire existing results of those SELO-based literatures, i.e., linear models [7], generalized linear models [16], multivariate panel count data [30] and quantile regression [6]. By the connection between SICA and GSELO, heuristically, it is attractive for us to consider using GSELO to do variable selection for other realms in future, such as Cox models [24,25] and additive hazards models [17]. Moreover, in regression problems, variables can often be thought of as grouped. According to the group exponential LASSO in [1] for bi-level variable selection, it would be interesting to extend the GSELO results for structured sparsity penalized models, which we also leave for future research. Appendix We follow steps similar to the proofs of Dicker et al. [7]. Hereafter, we use p(β j ) other than p λ,τ (β j ) to denote the penalty in GSELO for the sake of simplicity in notation.
15 SHI Yue-yong et al. Variable selection via generalized SELO-penalized linear regression models 159 Proof of Theorem 1. Let α n = dσ 2 /n. It is sufficient to show that, for any given ε > 0, there exists a large constant C such that P { inf Q n(β + Cα n u) Q n (β )} 1 ε. (12) u =1 Define D n (u) = Q n (β + Cα n u) Q n (β ). We have D n (u) 1 2n C2 αn Xu n Cα nϵ T Xu + [p(βj + Cα n u j ) p(βj )] = I 1 + I 2 + I 3, j K(u) where K(u) = {j : p(βj + Cα nu j ) p(βj ) < 0}. By (C2), I 1 = 1 2n C2 αn Xu 2 2 γ min(n 1 X T X) 2 C 2 α 2 n = O p (C 2 α 2 n), (13) I 2 = 1 n Cα nϵ T Xu Cα n n ϵt X n Cα n n σ 2 γ max (n 1 X T X) = O p (Cα 2 n). (14) (C4) implies ρ/α n. This and the fact p( ) is concave on [0, ) imply that p(β j + Cα n u j ) p(β j ) Cα n u j p (β j + Cα n u j ), when n is sufficiently large. It follows that I 3 Cα n u j p (βj + Cα n u j ) (Here, p (t) means the derivative with respect to t.) j K(u) = j K(u) j K(u) λ Cα n u j f(1) f βj ( + Cα nu j βj + Cα nu j + τ ) τ ( βj + Cα nu j + τ) 2 τ Cα n u j O(1)λ (ρ + τ) 2 j K(u) = O(Cα n ) λ ρ 2 (τ d) = O(Cα n )o(1)o(α n ) = o(cα 2 n) Cα n u j O(1)λ τ ρ 2 Cα no(1) λτ ρ 2 d u by (C3)-(C4) and (H1)-(H2). From (13), (14) and (15), if C > 0 is large enough, I 2 and I 3 are dominated by I 1, which is positive. This proves (12). (15) Proof of Theorem 2 (i). β R d with β β Cα n, where C is any positive constant. For ε n Cα n > 0, it suffices to show, for all j A c, we have Q n (β) > 0, β j for 0 < β j < ε n, (16) Q n (β) < 0, β j for ε n < β j < 0, (17) with probability tending to one as n. By some algebras, Q n (β) = 1 β j n x j(y Xβ) + p ( β j ) sgn(β j ) = I 1 + I 2. Note that E( XT ϵ n ) = 0, then 2 ) = tr{e( XT ϵ ϵ T X σ2 X )} = n n n tr(xt n by (C2). It follows that 1 n XT (y Xβ) = O p ( E( XT ϵ n ) = σ2 O(d) = O(dσ2 n n ) dσ 2 n ) = O p(α n ), and further I 1 = o p (α n ). On
16 160 Appl. Math. J. Chinese Univ. Vol. 33, No. 2 the other hand, p ( β j )/α n = 1 f(1) f β j ( β j + τ ) λτ/α n ( β j + τ) 2. Since β j Cα n with j A c, and (C3) implies α n /τ and λτ/α n ( β j + τ) 2 λτ/α n λτ = O( (Cα n + τ) 2 αn 3 ), we have p ( β j )/α n together with (H1) and (H2). Thus, the sign of Q n (β) = α n {o p (1) + p ( β j )/α n sgn(β j )} β j is completely determined by the sign of β j when n is large, and they always have same signs. Hence, (16) and (17) follow. Proof of Theorem 2 (ii). By (i) of Theorem 2, {j; ˆβj 0} = A. As the local minima of Q n (β), ˆβ must satisfy Q n (β) β= = 0, β A ˆβ i.e., 1 n XT A (y X ˆβ) + p A ( ˆβ) = 0, which implies ˆβ A βa = (X T AX A ) 1 X T Aϵ (n 1 X T AX A ) 1 p A( ˆβ). It follows that nbn (n 1 X T AX A /σ 2 ) 1/2 ( ˆβ A β A) = B n (σ 2 X T AX A ) 1/2 X T Aϵ nb n (σ 2 X T AX A ) 1/2 p A( ˆβ) = I 1 I 2 and I 2 = n/σ 2 B n (X T AX A /n) 1/2 p A( ˆβ). (18) By (C2), we have B n (X T AX A /n) 1/2 p A( ˆβ) = O p ( p A( ˆβ) ). (19) On the other hand, j A = {j; ˆβ j 0}, ( p ( ˆβ j ) = λ f(1) f ˆβ ) j τ sgn( ˆβ j ) ˆβ j + τ ( ˆβ j + τ), 2 and it follows p ( ˆβ) = O p ( λτ ρ ) according to (H1)-(H2) and (C3)-(C4). It is noteworthy that 2 v d v holds for all v R d, and then we have p A ( ˆβ) p ( ˆβ) = O p ( d λτ ρ ). Then, 2 (18), (19) and (C3)-(C4) imply I 2 = n/σ 2 O p ( d λτ ρ 2 ) = O p( nd/σ 2 λτ ρ 2 ) = O p( λ ρ 2 ) = o p(1). Thus, to complete proof (ii), it suffices to show d I 1 N(0, G), (20) according to the Slutsky s theorem. Denote I 1 = n w i,n, where w i,n = B n (σ 2 X T A X A) 1/2 x i,a ϵ i. Fix δ 0 > 0 and let η i,n = x T i,a (XT A X A) 1/2 Bn T B n (X T A X A) 1/2 x i,a. Then, using similar procedures in [7], we can show n E( w i,n 2 ; w i,n 2 > δ 0 ) 0. i=1 i=1
17 SHI Yue-yong et al. Variable selection via generalized SELO-penalized linear regression models 161 by (C5) and the fact n η i,n = tr(bn T B n ) tr(g) <. Thus, the Lindeberg condition is i=1 satisfied and (20) holds. References [1] P Breheny. The group exponential lasso for bi-level variable selection, Biometrics, 2015, 71(3): [2] P Breheny, J Huang. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection, Ann Appl Stat, 2011, 5(1): [3] E J Candes, M B Wakin, S P Boyd. Enhancing sparsity by reweighted l1 minimization, J Fourier Anal Appl, 2008, 14(5): [4] X Chen. Superlinear convergence of smoothing quasi-newton methods for nonsmooth equations, J Comput Appl Math, 1997, 80(1): [5] X Chen. Smoothing methods for nonsmooth, nonconvex minimization, Math Program, 2012, 134(1): [6] G Ciuperca. Model selection in high-dimensional quantile regression with seamless L 0 penalty, Statist Probab Lett, 2015, 107: [7] L Dicker, B Huang, X Lin. Variable selection and estimation with the seamless-l0 penalty, Statist Sinica, 2013, 23: [8] J Fan, R Li. Variable selection via nonconcave penalized likelihood and its oracle properties, J Amer Statist Assoc, 2001, 96(456): [9] J Fan, J Lv. A selective overview of variable selection in high dimensional feature space, Statist Sinica, 2010, 20(1): [10] J Fan, H Peng. Nonconcave penalized likelihood with a diverging number of parameters, Ann Statist, 2004, 32(3): [11] J Friedman, T Hastie, R Tibshirani. Regularization paths for generalized linear models via coordinate descent, J Stat Softw, 2010, 33(1): [12] C Gao, N Wang, Q Yu, Z Zhang. A feasible nonconvex relaxation approach to feature selection, In: Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, [13] T Hastie, R Tibshirani, J Friedman. The Elements of Statistical Learning, Springer, Berlin, [14] Y Jiao, B Jin, X Lu. A primal dual active set with continuation algorithm for the l 0 -regularized optimization problem, Appl Comput Harmon Anal, 2015, 39: [15] X Li, T Zhao, X Yuan, H Liu. The flare package for high dimensional linear regression and precision matrix estimation in R, J Mach Learn Res, 2015, 16: [16] Z Li, S Wang, X Lin. Variable selection and estimation in generalized linear models with the seamless L0 penalty, Canad J Statist, 2012, 40(4): [17] W Lin, J Lv. High-dimensional sparse additive hazards regression, J Amer Statist Assoc, 2013, 108(501):
18 162 Appl. Math. J. Chinese Univ. Vol. 33, No. 2 [18] J Lv, Y Fan. A unified approach to model selection and sparse recovery using regularized least squares, Ann Statist, 2009, 37(6A): [19] C F Ma. Optimization method and its Matlab program design, Science Press, Beijing, [20] R Mazumder, J Friedman, T Hastie. SparseNet: Coordinate descent with nonconvex penalties, J Amer Statist Assoc, 2011, 106(495): [21] M Nikolova. Local strong homogeneity of a regularized estimator, SIAM J Appl Math, 2000, 61(2): [22] J Nocedal, S Wright. Numerical optimization, 2nd ed, Springer, New York, [23] T Scheetz, K Kim, R Swiderski, A Philp, T Braun, K Knudtson, A Dorrance, G DiBona, J Huang, T Casavant, V Sheffield, E Stone. Regulation of gene expression in the mammalian eye and its relevance to eye disease, Proc Natl Acad Sci USA, 2006, 103(39): [24] Y Y Shi, Y X Cao, Y L Jiao, Y Y Liu. SICA for Cox s proportional hazards model with a diverging number of parameters, Acta Math Appl Sin Engl Ser, 2014, 30(4): [25] Y Y Shi, Y L Jiao, L Yan, Y X Cao. A modified BIC tuning parameter selector for SICA-penalized Cox regression models with diverging dimensionality, J Math, 2017, 37(4): [26] T A Stamey, J N Kabalin, J E McNeal, I M Johnstone, F Freiha, E A Redwine, N Yang. Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II. Radical prostatectomy treated patients, J Urol, 1989, 141(5): [27] R Tibshirani. Regression shrinkage and selection via the lasso, J R Stat Soc Ser B Stat Methodol, 1996, 58(1): [28] L Wang, Y Kim, R Li. Calibrating nonconvex penalized regression in ultra-high dimension, Ann Statist, 2013, 41(5), [29] C H Zhang. Nearly unbiased variable selection under minimax concave penalty, Ann Statist, 2010, 38(2): [30] H Zhang, J Sun, D Wang. Variable selection and estimation for multivariate panel count data via the seamless-l0 penalty, Canad J Statist, 2013, 41(2): [31] H Zou. The adaptive lasso and its oracle properties, J Amer Statist Assoc, 2006, 101(476): [32] H Zou, T Hastie. Regularization and variable selection via the elastic net, J R Stat Soc Ser B Stat Methodol, 2005, 67(2): [33] H Zou, R Li. One-step sparse estimates in nonconcave penalized likelihood models, Ann Statist, 2008, 36(4): School of Economics and Management, China University of Geosciences, Wuhan , China. 2 School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan , China. 3 Center for Resources and Environmental Economic Research, China University of Geosciences, Wuhan , China yulingjiaomath@whu.edu.cn
HIGH-DIMENSIONAL VARIABLE SELECTION WITH THE GENERALIZED SELO PENALTY
Vol. 38 ( 2018 No. 6 J. of Math. (PRC HIGH-DIMENSIONAL VARIABLE SELECTION WITH THE GENERALIZED SELO PENALTY SHI Yue-yong 1,3, CAO Yong-xiu 2, YU Ji-chang 2, JIAO Yu-ling 2 (1.School of Economics and Management,
More informationVARIABLE SELECTION AND ESTIMATION WITH THE SEAMLESS-L 0 PENALTY
Statistica Sinica 23 (2013), 929-962 doi:http://dx.doi.org/10.5705/ss.2011.074 VARIABLE SELECTION AND ESTIMATION WITH THE SEAMLESS-L 0 PENALTY Lee Dicker, Baosheng Huang and Xihong Lin Rutgers University,
More informationChapter 3. Linear Models for Regression
Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Linear
More informationThe MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010
Department of Biostatistics Department of Statistics University of Kentucky August 2, 2010 Joint work with Jian Huang, Shuangge Ma, and Cun-Hui Zhang Penalized regression methods Penalized methods have
More informationNonnegative Garrote Component Selection in Functional ANOVA Models
Nonnegative Garrote Component Selection in Functional ANOVA Models Ming Yuan School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, GA 3033-005 Email: myuan@isye.gatech.edu
More informationComparisons of penalized least squares. methods by simulations
Comparisons of penalized least squares arxiv:1405.1796v1 [stat.co] 8 May 2014 methods by simulations Ke ZHANG, Fan YIN University of Science and Technology of China, Hefei 230026, China Shifeng XIONG Academy
More informationNonconcave Penalized Likelihood with A Diverging Number of Parameters
Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized
More informationLecture 5: Soft-Thresholding and Lasso
High Dimensional Data and Statistical Learning Lecture 5: Soft-Thresholding and Lasso Weixing Song Department of Statistics Kansas State University Weixing Song STAT 905 October 23, 2014 1/54 Outline Penalized
More informationADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL REGRESSION MODELS
Statistica Sinica 18(2008), 1603-1618 ADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL REGRESSION MODELS Jian Huang, Shuangge Ma and Cun-Hui Zhang University of Iowa, Yale University and Rutgers University Abstract:
More informationBi-level feature selection with applications to genetic association
Bi-level feature selection with applications to genetic association studies October 15, 2008 Motivation In many applications, biological features possess a grouping structure Categorical variables may
More informationA Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression
A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent
More informationAnalysis Methods for Supersaturated Design: Some Comparisons
Journal of Data Science 1(2003), 249-260 Analysis Methods for Supersaturated Design: Some Comparisons Runze Li 1 and Dennis K. J. Lin 2 The Pennsylvania State University Abstract: Supersaturated designs
More informationPaper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)
Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Presented by Yang Zhao March 5, 2010 1 / 36 Outlines 2 / 36 Motivation
More informationLeast Absolute Gradient Selector: variable selection via Pseudo-Hard Thresholding
arxiv:204.2353v4 [stat.ml] 9 Oct 202 Least Absolute Gradient Selector: variable selection via Pseudo-Hard Thresholding Kun Yang September 2, 208 Abstract In this paper, we propose a new approach, called
More informationThe Iterated Lasso for High-Dimensional Logistic Regression
The Iterated Lasso for High-Dimensional Logistic Regression By JIAN HUANG Department of Statistics and Actuarial Science, 241 SH University of Iowa, Iowa City, Iowa 52242, U.S.A. SHUANGE MA Division of
More informationWEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract
Journal of Data Science,17(1). P. 145-160,2019 DOI:10.6339/JDS.201901_17(1).0007 WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION Wei Xiong *, Maozai Tian 2 1 School of Statistics, University of
More informationAn efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss
An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao
More informationSemi-Penalized Inference with Direct FDR Control
Jian Huang University of Iowa April 4, 2016 The problem Consider the linear regression model y = p x jβ j + ε, (1) j=1 where y IR n, x j IR n, ε IR n, and β j is the jth regression coefficient, Here p
More informationConsistent Selection of Tuning Parameters via Variable Selection Stability
Journal of Machine Learning Research 14 2013 3419-3440 Submitted 8/12; Revised 7/13; Published 11/13 Consistent Selection of Tuning Parameters via Variable Selection Stability Wei Sun Department of Statistics
More informationThe Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA
The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:
More informationHigh-dimensional Ordinary Least-squares Projection for Screening Variables
1 / 38 High-dimensional Ordinary Least-squares Projection for Screening Variables Chenlei Leng Joint with Xiangyu Wang (Duke) Conference on Nonparametric Statistics for Big Data and Celebration to Honor
More informationRegression Shrinkage and Selection via the Elastic Net, with Applications to Microarrays
Regression Shrinkage and Selection via the Elastic Net, with Applications to Microarrays Hui Zou and Trevor Hastie Department of Statistics, Stanford University December 5, 2003 Abstract We propose the
More informationLecture 4: Newton s method and gradient descent
Lecture 4: Newton s method and gradient descent Newton s method Functional iteration Fitting linear regression Fitting logistic regression Prof. Yao Xie, ISyE 6416, Computational Statistics, Georgia Tech
More informationSOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu
SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray
More informationThe picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R
The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R Xingguo Li Tuo Zhao Tong Zhang Han Liu Abstract We describe an R package named picasso, which implements a unified framework
More informationTHE Mnet METHOD FOR VARIABLE SELECTION
Statistica Sinica 26 (2016), 903-923 doi:http://dx.doi.org/10.5705/ss.202014.0011 THE Mnet METHOD FOR VARIABLE SELECTION Jian Huang 1, Patrick Breheny 1, Sangin Lee 2, Shuangge Ma 3 and Cun-Hui Zhang 4
More informationData Mining Stat 588
Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic
More informationSCIENCE CHINA Information Sciences. Received December 22, 2008; accepted February 26, 2009; published online May 8, 2010
. RESEARCH PAPERS. SCIENCE CHINA Information Sciences June 2010 Vol. 53 No. 6: 1159 1169 doi: 10.1007/s11432-010-0090-0 L 1/2 regularization XU ZongBen 1, ZHANG Hai 1,2, WANG Yao 1, CHANG XiangYu 1 & LIANG
More informationA significance test for the lasso
1 First part: Joint work with Richard Lockhart (SFU), Jonathan Taylor (Stanford), and Ryan Tibshirani (Carnegie-Mellon Univ.) Second part: Joint work with Max Grazier G Sell, Stefan Wager and Alexandra
More informationLeast Absolute Gradient Selector: Statistical Regression via Pseudo-Hard Thresholding
Least Absolute Gradient Selector: Statistical Regression via Pseudo-Hard Thresholding Kun Yang Trevor Hastie April 0, 202 Abstract Variable selection in linear models plays a pivotal role in modern statistics.
More informationA New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables
A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of
More informationStepwise Searching for Feature Variables in High-Dimensional Linear Regression
Stepwise Searching for Feature Variables in High-Dimensional Linear Regression Qiwei Yao Department of Statistics, London School of Economics q.yao@lse.ac.uk Joint work with: Hongzhi An, Chinese Academy
More informationLecture 14: Variable Selection - Beyond LASSO
Fall, 2017 Extension of LASSO To achieve oracle properties, L q penalty with 0 < q < 1, SCAD penalty (Fan and Li 2001; Zhang et al. 2007). Adaptive LASSO (Zou 2006; Zhang and Lu 2007; Wang et al. 2007)
More informationFast Regularization Paths via Coordinate Descent
August 2008 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerry Friedman and Rob Tibshirani. August 2008 Trevor
More informationIntroduction to Statistics and R
Introduction to Statistics and R Mayo-Illinois Computational Genomics Workshop (2018) Ruoqing Zhu, Ph.D. Department of Statistics, UIUC rqzhu@illinois.edu June 18, 2018 Abstract This document is a supplimentary
More informationSmoothly Clipped Absolute Deviation (SCAD) for Correlated Variables
Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables LIB-MA, FSSM Cadi Ayyad University (Morocco) COMPSTAT 2010 Paris, August 22-27, 2010 Motivations Fan and Li (2001), Zou and Li (2008)
More informationESL Chap3. Some extensions of lasso
ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied
More informationarxiv: v1 [math.st] 7 Dec 2018
Variable selection in high-dimensional linear model with possibly asymmetric or heavy-tailed errors Gabriela CIUPERCA 1 Institut Camille Jordan, Université Lyon 1, France arxiv:1812.03121v1 [math.st] 7
More informationRegularization Paths
December 2005 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Saharon Rosset, Ji Zhu, Hui Zhou, Rob Tibshirani and
More informationADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL REGRESSION MODELS. November The University of Iowa. Department of Statistics and Actuarial Science
ADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL REGRESSION MODELS Jian Huang 1, Shuangge Ma 2, and Cun-Hui Zhang 3 1 University of Iowa, 2 Yale University, 3 Rutgers University November 2006 The University
More informationGroup exponential penalties for bi-level variable selection
for bi-level variable selection Department of Biostatistics Department of Statistics University of Kentucky July 31, 2011 Introduction In regression, variables can often be thought of as grouped: Indicator
More informationQuantile Regression for Analyzing Heterogeneity. in Ultra-high Dimension
Quantile Regression for Analyzing Heterogeneity in Ultra-high Dimension Lan Wang, Yichao Wu and Runze Li Abstract Ultra-high dimensional data often display heterogeneity due to either heteroscedastic variance
More informationRobust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly
Robust Variable Selection Methods for Grouped Data by Kristin Lee Seamon Lilly A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationPrediction & Feature Selection in GLM
Tarigan Statistical Consulting & Coaching statistical-coaching.ch Doctoral Program in Computer Science of the Universities of Fribourg, Geneva, Lausanne, Neuchâtel, Bern and the EPFL Hands-on Data Analysis
More informationMSA220/MVE440 Statistical Learning for Big Data
MSA220/MVE440 Statistical Learning for Big Data Lecture 9-10 - High-dimensional regression Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Recap from
More informationThe lasso: some novel algorithms and applications
1 The lasso: some novel algorithms and applications Newton Institute, June 25, 2008 Robert Tibshirani Stanford University Collaborations with Trevor Hastie, Jerome Friedman, Holger Hoefling, Gen Nowak,
More informationA PRIMAL DUAL ACTIVE SET ALGORITHM FOR A CLASS OF NONCONVEX SPARSITY OPTIMIZATION
A PRIMAL DUAL ACTIVE SET ALGORITHM FOR A CLASS OF NONCONVEX SPARSITY OPTIMIZATION YULING JIAO, BANGTI JIN, XILIANG LU, AND WEINA REN Abstract. In this paper, we consider the problem of recovering a sparse
More informationVariable Selection for Highly Correlated Predictors
Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu Department of Statistics, University of Illinois at Urbana-Champaign WHOA-PSI, Aug, 2017 St. Louis, Missouri 1 / 30 Background Variable
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationA Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models
A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los
More informationBias-free Sparse Regression with Guaranteed Consistency
Bias-free Sparse Regression with Guaranteed Consistency Wotao Yin (UCLA Math) joint with: Stanley Osher, Ming Yan (UCLA) Feng Ruan, Jiechao Xiong, Yuan Yao (Peking U) UC Riverside, STATS Department March
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationA Confidence Region Approach to Tuning for Variable Selection
A Confidence Region Approach to Tuning for Variable Selection Funda Gunes and Howard D. Bondell Department of Statistics North Carolina State University Abstract We develop an approach to tuning of penalized
More informationProperties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation
Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana
More informationRegularization Paths. Theme
June 00 Trevor Hastie, Stanford Statistics June 00 Trevor Hastie, Stanford Statistics Theme Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Mee-Young Park,
More informationA significance test for the lasso
1 Gold medal address, SSC 2013 Joint work with Richard Lockhart (SFU), Jonathan Taylor (Stanford), and Ryan Tibshirani (Carnegie-Mellon Univ.) Reaping the benefits of LARS: A special thanks to Brad Efron,
More informationIterative Selection Using Orthogonal Regression Techniques
Iterative Selection Using Orthogonal Regression Techniques Bradley Turnbull 1, Subhashis Ghosal 1 and Hao Helen Zhang 2 1 Department of Statistics, North Carolina State University, Raleigh, NC, USA 2 Department
More informationA Unified Primal Dual Active Set Algorithm for Nonconvex Sparse Recovery
A Unified Primal Dual Active Set Algorithm for Nonconvex Sparse Recovery Jian Huang Yuling Jiao Bangti Jin Jin Liu Xiliang Lu Can Yang January 5, 2018 Abstract In this paper, we consider the problem of
More informationLearning with Sparsity Constraints
Stanford 2010 Trevor Hastie, Stanford Statistics 1 Learning with Sparsity Constraints Trevor Hastie Stanford University recent joint work with Rahul Mazumder, Jerome Friedman and Rob Tibshirani earlier
More informationBayesian Grouped Horseshoe Regression with Application to Additive Models
Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu, Daniel F. Schmidt, Enes Makalic, Guoqi Qian, and John L. Hopper Centre for Epidemiology and Biostatistics, Melbourne
More informationA Constructive Approach to L 0 Penalized Regression
Journal of Machine Learning Research 9 (208) -37 Submitted 4/7; Revised 6/8; Published 8/8 A Constructive Approach to L 0 Penalized Regression Jian Huang Department of Applied Mathematics The Hong Kong
More informationCOMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)
COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationShrinkage Tuning Parameter Selection in Precision Matrices Estimation
arxiv:0909.1123v1 [stat.me] 7 Sep 2009 Shrinkage Tuning Parameter Selection in Precision Matrices Estimation Heng Lian Division of Mathematical Sciences School of Physical and Mathematical Sciences Nanyang
More informationConsistent Model Selection Criteria on High Dimensions
Journal of Machine Learning Research 13 (2012) 1037-1057 Submitted 6/11; Revised 1/12; Published 4/12 Consistent Model Selection Criteria on High Dimensions Yongdai Kim Department of Statistics Seoul National
More informationASYMPTOTIC PROPERTIES OF BRIDGE ESTIMATORS IN SPARSE HIGH-DIMENSIONAL REGRESSION MODELS
ASYMPTOTIC PROPERTIES OF BRIDGE ESTIMATORS IN SPARSE HIGH-DIMENSIONAL REGRESSION MODELS Jian Huang 1, Joel L. Horowitz 2, and Shuangge Ma 3 1 Department of Statistics and Actuarial Science, University
More informationIn Search of Desirable Compounds
In Search of Desirable Compounds Adrijo Chakraborty University of Georgia Email: adrijoc@uga.edu Abhyuday Mandal University of Georgia Email: amandal@stat.uga.edu Kjell Johnson Arbor Analytics, LLC Email:
More informationSTK Statistical Learning: Advanced Regression and Classification
STK4030 - Statistical Learning: Advanced Regression and Classification Riccardo De Bin debin@math.uio.no STK4030: lecture 1 1/ 42 Outline of the lecture Introduction Overview of supervised learning Variable
More informationSelection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty
Journal of Data Science 9(2011), 549-564 Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Masaru Kanba and Kanta Naito Shimane University Abstract: This paper discusses the
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon
More informationVariable Selection and Parameter Estimation Using a Continuous and Differentiable Approximation to the L0 Penalty Function
Brigham Young University BYU ScholarsArchive All Theses and Dissertations 2011-03-10 Variable Selection and Parameter Estimation Using a Continuous and Differentiable Approximation to the L0 Penalty Function
More informationGeneralized Linear Models and Its Asymptotic Properties
for High Dimensional Generalized Linear Models and Its Asymptotic Properties April 21, 2012 for High Dimensional Generalized L Abstract Literature Review In this talk, we present a new prior setting for
More informationGroup descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors
Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors Patrick Breheny Department of Biostatistics University of Iowa Jian Huang Department of Statistics
More informationAsymptotic Equivalence of Regularization Methods in Thresholded Parameter Space
Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space Jinchi Lv Data Sciences and Operations Department Marshall School of Business University of Southern California http://bcf.usc.edu/
More informationLeast Angle Regression, Forward Stagewise and the Lasso
January 2005 Rob Tibshirani, Stanford 1 Least Angle Regression, Forward Stagewise and the Lasso Brad Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani Stanford University Annals of Statistics,
More informationBAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage
BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement
More informationFast Regularization Paths via Coordinate Descent
user! 2009 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerome Friedman and Rob Tibshirani. user! 2009 Trevor
More informationHigh-dimensional covariance estimation based on Gaussian graphical models
High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,
More informationSparse survival regression
Sparse survival regression Anders Gorst-Rasmussen gorst@math.aau.dk Department of Mathematics Aalborg University November 2010 1 / 27 Outline Penalized survival regression The semiparametric additive risk
More informationA Bregman alternating direction method of multipliers for sparse probabilistic Boolean network problem
A Bregman alternating direction method of multipliers for sparse probabilistic Boolean network problem Kangkang Deng, Zheng Peng Abstract: The main task of genetic regulatory networks is to construct a
More informationSelf-adaptive Lasso and its Bayesian Estimation
Self-adaptive Lasso and its Bayesian Estimation Jian Kang 1 and Jian Guo 2 1. Department of Biostatistics, University of Michigan 2. Department of Statistics, University of Michigan Abstract In this paper,
More informationGeneralized Elastic Net Regression
Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1
More informationM-estimation in high-dimensional linear model
Wang and Zhu Journal of Inequalities and Applications 208 208:225 https://doi.org/0.86/s3660-08-89-3 R E S E A R C H Open Access M-estimation in high-dimensional linear model Kai Wang and Yanling Zhu *
More informationMS-C1620 Statistical inference
MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents
More informationHigh-dimensional regression with unknown variance
High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march 2012 Setting Gaussian regression with unknown variance: Y i = f i + ε i with ε i i.i.d. N (0, σ 2 ) f = (f
More informationInference After Variable Selection
Department of Mathematics, SIU Carbondale Inference After Variable Selection Lasanthi Pelawa Watagoda lasanthi@siu.edu June 12, 2017 Outline 1 Introduction 2 Inference For Ridge and Lasso 3 Variable Selection
More informationIdentify Relative importance of covariates in Bayesian lasso quantile regression via new algorithm in statistical program R
Identify Relative importance of covariates in Bayesian lasso quantile regression via new algorithm in statistical program R Fadel Hamid Hadi Alhusseini Department of Statistics and Informatics, University
More informationStatistica Sinica Preprint No: SS R3
Statistica Sinica Preprint No: SS-2015-0413.R3 Title Regularization after retention in ultrahigh dimensional linear regression models Manuscript ID SS-2015-0413.R3 URL http://www.stat.sinica.edu.tw/statistica/
More informationLecture 25: November 27
10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationExploiting Covariate Similarity in Sparse Regression via the Pairwise Elastic Net
Exploiting Covariate Similarity in Sparse Regression via the Pairwise Elastic Net Alexander Lorbert, David Eis, Victoria Kostina, David M. Blei, Peter J. Ramadge Dept. of Electrical Engineering, Dept.
More informationSparse Learning and Distributed PCA. Jianqing Fan
w/ control of statistical errors and computing resources Jianqing Fan Princeton University Coauthors Han Liu Qiang Sun Tong Zhang Dong Wang Kaizheng Wang Ziwei Zhu Outline Computational Resources and Statistical
More informationSparsity Regularization
Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationFast Regularization Paths via Coordinate Descent
KDD August 2008 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerry Friedman and Rob Tibshirani. KDD August 2008
More informationAdaptive L p (0 <p<1) Regularization: Oracle Property and Applications
Adaptive L p (0
More informationRegularization and Variable Selection via the Elastic Net
p. 1/1 Regularization and Variable Selection via the Elastic Net Hui Zou and Trevor Hastie Journal of Royal Statistical Society, B, 2005 Presenter: Minhua Chen, Nov. 07, 2008 p. 2/1 Agenda Introduction
More informationA Significance Test for the Lasso
A Significance Test for the Lasso Lockhart R, Taylor J, Tibshirani R, and Tibshirani R Ashley Petersen June 6, 2013 1 Motivation Problem: Many clinical covariates which are important to a certain medical
More informationPre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models
Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider variable
More information