Variable selection via generalized SELO-penalized linear regression models

Size: px

Start display at page:

Download "Variable selection via generalized SELO-penalized linear regression models"

Clifton Howard
5 years ago
Views:

1 Appl. Math. J. Chinese Univ. 2018, 33(2): Variable selection via generalized SELO-penalized linear regression models SHI Yue-yong 1,3 CAO Yong-xiu 2 YU Ji-chang 2 JIAO Yu-ling 2, Abstract. The seamless-l 0 (SELO) penalty is a smooth function on [0, ) that very closely resembles the L 0 penalty, which has been demonstrated theoretically and practically to be effective in nonconvex penalization for variable selection. In this paper, we first generalize SELO to a class of penalties retaining good features of SELO, and then propose variable selection and estimation in linear models using the proposed generalized SELO (GSELO) penalized least squares (PLS) approach. We show that the GSELO-PLS procedure possesses the oracle property and consistently selects the true model under some regularity conditions in the presence of a diverging number of variables. The entire path of GSELO-PLS estimates can be efficiently computed through a smoothing quasi-newton (SQN) method. A modified BIC coupled with a continuation strategy is developed to select the optimal tuning parameter. Simulation studies and analysis of a clinical data are carried out to evaluate the finite sample performance of the proposed method. In addition, numerical experiments involving simulation studies and analysis of a microarray data are also conducted for GSELO-PLS in the high-dimensional settings. Consider the linear regression model 1 Introduction y = Xβ + ϵ, (1) where y = (y 1, y 2,, y n ) T R n is a response vector, X = (x ij ) R n d is a design matrix, β = (β1, β2,, βd )T R d is a vector of underlying regression coefficients, and ϵ = (ϵ 1, ϵ 2,, ϵ n ) T R n is a vector of random errors. We assume without loss of generality that y is centered and the columns of X are centered and n-normalized, i.e., n i=1 y i = 0, n i=1 x ij = 0 and n 1 n i=1 x2 ij = 1. We also assume that β is sparse in the sense that only Received: Revised: MR Subject Classification: 62F12, 62J05, 62J07. Keywords: continuation, coordinate descent, BIC, LLA, oracle property, SELO, smoothing quasi-newton. Digital Object Identifier(DOI): Supported by the National Natural Science Foundation of China ( , , , ) and the Fundamental Research Funds for the Central Universities (CUGW150809). Correspondence author.

2 146 Appl. Math. J. Chinese Univ. Vol. 33, No. 2 a relatively small portion of the components of β are nonzero, and our goal is to reconstruct the unknown vector β. Let A = {j; βj 0} be the true model and suppose that s = A is the size of the true model (i.e., the sparsity level of β ), where A denotes the cardinality of A. To achieve sparsity in linear models, the penalization (or regularization) method, which optimizes a loss function term plus a penalty function term, has been widely used in the literature (cf., e.g., [27, 8, 10, 31-32, 29]). In this paper, we consider the following so-called SELO-penalized least squares (PLS) problem: { ˆβ := ˆβ(λ, τ) = arg min Q n (β) = 1 β R p 2n y Xβ 2 + d j=1 } p λ,τ (β j ), (2) ( ) where denotes the L 2 norm on the Euclidean space and p λ,τ (β j ) = λ log(2) log βj β j +τ + 1 is the SELO penalty proposed by Dicker et al. [7]. λ and τ are two positive tuning (or regularization) parameters. In particular, λ is the sparsity tuning parameter obtaining sparse solutions and τ is the shape (or concavity) tuning parameter making SELO L 0 (τ 0+), where L 0 admits p λ (β j ) = λi( β j = 0). ˆβ = ˆβ(λ, τ) in (2) is called a SELO-PLS (SPLS) estimator. L 0 regularization [9] directly penalizes the number of variables in the regression models, so it enjoys a nice interpretation of the best subset selection, but it is not continuous at 0, and is computationally infeasible when d is moderately large. SELO is a good surrogate for L 0 since it can explicitly mimic L 0 via small τ values, and is more stable than L 0 due to the continuity of its penalty function. Figure 1 depicts SELO penalties for a few τ s while fixing λ = 1. Figure 1: Plot of SELO penalty functions. τ = 0 (L 0, thick solid), τ = 0.1 (dotdash), τ = 1 (dashed), τ = 10 (dotted), and τ = (thin solid). Since the introduction of the SELO for the linear models (LM) [7], the methodology has been extended to generalized linear models (GLM) [16], multivariate panel count data [30]

3 SHI Yue-yong et al. Variable selection via generalized SELO-penalized linear regression models 147 and quantile regression [6] among others. Under LM, Dicker et al [7] show that SELO-LM estimators enjoy the oracle property [8] when both d and n tend to infinity with d/n 0, and outperform other penalized estimators by various metrics in numerical simulations. They propose a SELO-LM-BIC procedure to select the tuning parameters and showed it is consistent for model selection. Under GLM, Li et al [16] show that the SELO-GLM procedure enjoys the oracle property when both d and n tend to infinity with d 5 /n 0. They also establish the model selection consistency results via a SELO-GLM-BIC procedure. It is noteworthy that both SELO-LM and SELO-GLM estimators can be efficiently calculated by coordinate descent (CD) algorithms. Zhang et al [30] develop the SELO penalized estimating equation approach to conduct the regression analysis of multivariate panel count data with the focus on variable selection and estimation of significant covariate effects, where the dimensionality d is assumed to be fixed. They use a BIC procedure to select tuning parameters and apply the classical Newton-Raphson algorithm for numerical experiments. Ciuperca [6] introduces and studies the SELO quantile estimator in a linear model when both d and n tend to infinity with d/n 0, and derives the convergence rate, oracle properties and BIC model selection consistency result of corresponding estimators. In this paper, we propose to use a generalized SELO (GSELO) penalized method to make variable selection and parameter estimation in linear models. First, we generalize the SELO penalty to a class of penalties (i.e., GSELO penalties) closely resembling the L 0 penalty and retaining good features of SELO. Second, based on the proposed GSELO penalties, we develop the GSELO-PLS procedure for linear models on variable selection and parameter estimation. We give consistency and asymptotically normality properties for GSELO-PLS and show it performs as well as an oracle estimator when both d and n tend to infinity with d/n 0. Third, we implement a smoothing quasi-newton (SQN) method with backtracking line search technique, which has superlinear convergence rate and is insensitive to choices of initial values, and it can avoid calculating the sequence of the inverse of the Hessian matrix compared with the modified Newton-Raphson algorithm [8], to calculate the proposed GSELO-PLS estimates. In particular, we couple our algorithm with a continuation strategy on the regularization parameter, i.e., given a decreasing sequence of parameter {λ g } g, we apply the algorithm to solve the λ g+1 -problem with the initial guess from the λ g -problem. The idea of continuation is well established for the iterative algorithms with the purpose of warm starting and globalizing the convergence. We adopt a modified BIC (MBIC) to select a suitable tuning parameter during the continuation process. Finally, we conduct numerical experiments to evaluate the performance of GSELO- PLS in high dimensions. To deal with the high dimensional issue, We first employ a local linear approximation (LLA) [33] to the nonconvex GSELO penalties and then resort to a existing Gauss-Seidel type coordinate descent algorithm in [2] to obtain the solution path. Numerically, when coupled with the continuation strategy and a high-dimensional BIC (HBIC), the overall GSELO-PLS-HBIC procedure for high-dimensional data is very efficient. The remainder of this paper is organized as follows. In Section 2, we first describe our proposed GSELO method and then establish asymptotic theoretical results of the GSELO-PLS

4 148 Appl. Math. J. Chinese Univ. Vol. 33, No. 2 procedure. In Section 3, we present the algorithm for computing the GSELO-PLS estimator, the standard error formulae for the estimated coefficients and a modified BIC coupled with a continuation strategy to select the optimal tuning parameter. Simulation studies are conducted in Section 4 to evaluate the finite sample performance of the proposed method, which is further illustrated with a real clinical data. In Section 5, we numerically study the behaviors of the proposed GSELO-PLS estimators in high dimensions, including the computational issues, the choice of the tuning parameter, simulation studies and analysis of a microarray data. We conclude the paper with Section 6. Proofs of the theorems are provided in the Appendix. 2 Generalized SELO-penalized linear regression models 2.1 Methodology Let P denote all GSELO penalties. f is an arbitrary function that satisfies the following two hypotheses: (H1) f(x) is a continuous function w.r.t x, which has the first and second derivative in [0, 1]; (H2) f (x) 0 on the interval [0, 1] and lim x 0 f(x) x = 1. Then a GSELO penalty p λ,τ ( ) P is given as p λ,τ (β j ) = λ f(1) f( β j β j + τ ), where λ (sparsity) and τ (concavity) are two positive tunning parameters. Remark 1. (H1) is needed to guarantee the continuity of penalty functions and (H2) is used to make the penalties in P resemble the L 0. Obviously, it easily follows that SELO is a member of P as long as we take f(x) = log(x+1). Table 1 lists some representatives of P and Figure 2 displays them with τ = 1 and 0.01 respectively. Table 1: Representatives of GSELO Name Types of functions f(x) p λ,τ (β j ) LIN linear x λ βj β j +τ SELO logarithmic log(x + 1) λ log(2) log( β j β j +τ + 1) EXP exponential 1 exp( x) λ 1 exp( 1) [1 exp( β j SIN trigonometric sin(x) λ sin(1) sin( βj β j +τ ) β j +τ )] ATN inverse trigonometric arctan(x) λ arctan(1) arctan( β j β j +τ )

5 SHI Yue-yong et al. Variable selection via generalized SELO-penalized linear regression models 149 Remark 2. It is noteworthy that LIN is actually the transformed L 1 penalty studied by Nikolova [21], which enlightens Lv and Fan [18] on proposing the SICA approach for sparse recovery and model selection. p λ,τ(βj) λ =1 τ =1 L 0 LIN SELO EXP SIN ATN p λ,τ(βj) λ =1 τ =0.01 L 0 LIN SELO EXP SIN ATN β j β j Figure 2: Left panel: λ = 1, τ = 1; Right panel: λ = 1, τ = L 0 (thick solid), L 1 (solid), LIN (dashed), SELO (dotted), EXP (dotdash), SIN (longdash) and ATN (twodash). Based on the proposed GSELO penalty, corresponding GSELO-PLS estimator can be given as { ˆβ := ˆβ(λ, τ) = arg min Q n (β) = 1 β R 2n y Xβ 2 + d where p λ,τ ( ) P. d j=1 } p λ,τ (β j ), (3) 2.2 Theoretical results We establish theoretical results of the GSELO-PLS estimator based on the following regularity conditions. (C1) n and dσ 2 /n 0. (C2) There exist positive constants r, R R such that r < γ min (n 1 X T X) < γ max (n 1 X T X) < R, where γ min (n 1 X T X) and γ max (n 1 X T X) are the smallest and largest eigenvalues of n 1 X T X respectively. (C3) τ = O( σ 2 /(dn)) and λτ[n/(dσ 2 )] 3/2. (C4) ρ n/(dσ 2 ), λ/ρ 2 0, where ρ = min j A β j.

6 150 Appl. Math. J. Chinese Univ. Vol. 33, No. 2 (C5) lim max d n 1 i n j=1 x2 ij = 0. (C6) E( ϵ i /σ 2+δ ) < M for some δ > 0 and M <. Remark 3. Conditions (C1)-(C6) coincide with the conditions in [7]. Please see more details therein. Theorem 1 (Existence of GSELO-PLS estimator). Under hypotheses (H1)-(H2) and conditions (C1)-(C6), then, with probability tending to one, there exists a local minimizer ˆβ of Q n (β), defined in (3), such that ˆβ β = O p ( dσ 2 /n), where denotes the Euclidean norm of a vector. Theorem 2 (Oracle property). Under hypotheses (H1)-(H2) and conditions (C1)-(C6), then, with probability tending to 1, the n/(dσ 2 )-consistent local minimizer ˆβ in Theorem 1 must be such that (i) lim P ({j; ˆβj 0} = A) = 1. n (ii) nb n (n 1 X T A X A/σ 2 ) 1/2 ( ˆβ A β A ) N(0, G) in distribution, where B n is an arbitrary q A matrix such that B n B T n G. To save space, we only state the main results here and relegate the proofs to the Appendix. Interested readers can refer to [10,7] for more details. 3.1 Algorithm 3 Computation Dicker et al. [7] use a coordinate descent (CD) algorithm procedure, which amounts to finding the roots of certain cubic equations, for obtaining SELO estimates. However, among the GSELO penalties taken into consideration in this paper (i.e., LIN, SELO, EXP, SIN and ATN in Table 1), only LIN and SELO can be implemented using the CD algorithm in [7]. To illustrate this point, we consider the one-dimensional PLS problem { ˆβ = arg min Q(β) = 1 } β R 2 (β β 0) 2 + p λ,τ (β), (4) where β 0 R is a constant and p λ,τ (β) is a penalty in Table 1. CD procedures ask for finding the nonzero stationary points (or critical points) of the objective function Q(β). Direct computation of Q (β) = 0 gives (LIN) β β 0 + λ τ sgn(β) ( β + τ) 2 = 0, (SELO) β β 0 + λ τ sgn(β) log(2) (2 β + τ)( β + τ) = 0 or (SIN) β β 0 + λ ( ) β τ sgn(β) sin(1) cos β + τ ( β + τ) 2 = 0.

7 SHI Yue-yong et al. Variable selection via generalized SELO-penalized linear regression models 151 It follows that LIN and SELO can be transformed into cubic equations, while SIN can t, neither can EXP and ATN share the same spirit as SIN. Thus, for the sake of the uniformity of computation, we use the smoothing quasi-newton (SQN) method [22,19,24] to optimize Q n (β) in (3). Since GSELO penalty functions are singular at the origin, we first smooth the penalty functions by replacing β j with βj 2 + ε, where ε is a small positive quantity. It follows that βj 2 + ε β j when ε 0. Then, we solve { ˆβ = ˆβ(λ, τ, ε) = arg min Q ε n(β) = 1 β R 2n y Xβ 2 + d d j=1 } p λ,τ,ε (β j ) instead of (3) by using the DFP quasi-newton with backtracking linear search algorithm, where p λ,τ,ε (β j ) = p λ,τ ( βj 2 + ε). In practice, taking ε = 0.01 gives good results. We summarize the SQN-DFP procedure in Algorithm 1. More theoretical results about smoothing methods for nonsmooth and noconvex minimization can be found in [4,5]. Remark 4. Like the local quadratic approximation (LQA) algorithm in Fan and Li [8], the sequence β k obtained from SQN-DFP may not be sparse for any fixed k and hence is not directly suitable for variable selection. In practice, we set β k j = 0 if βk j < ε 0 for some sufficiently small tolerance level ε 0. (5) Algorithm 1 SQN-DFP Input: initial values β 0 R d and H 0 = I d (d d identity matrix); linear search parameters ρ (0, 1) and η (0, 1 2 ); stop tolerance δ. 1: for k = 0, 1, 2,, k max do 2: compute g k = Q ε n(β k ), 3: if g k δ, then 4: stop, output β k as the estimate of β in (5), 5: else 6: compute direction d k = H k g k. 7: end if 8: for m = 0, 1, 2,, m max do 9: compute βm k = β k + ρ m d k, 10: if Q ε n(βm) k Q ε n(β k ) + ηρ m gk T d k, then 11: stop, output α k = ρ m. 12: end if 13: end for 14: compute β k+1 = β k + α k d k, g k+1 = Q ε n(β k+1 ), g k = g k+1 g k, β k = β k+1 β k. 15: if ( β k ) T g k 0, then 16: H k+1 = H k ; 17: else 18: H k+1 = H k H k g k g T k H k gk T H k g k + βk ( β k ) T ( β k ) T g k. 19: end if 20: end for Output: ˆβ, the estimate of β in equation (5).

8 152 Appl. Math. J. Chinese Univ. Vol. 33, No Covariance estimation Following [7], we estimate the covariance matrix (i.e., standard errors) for ˆβ by using a sandwich formula ĉov( ˆβÂ) = ˆσ 2 {X Ṱ A X Â + n ε Â,Â( ˆβ)} 1 X Ṱ A X Â {XṰ A X Â + n ε Â,Â( ˆβ)} 1, (6) where ˆσ 2 = (n ŝ) 1 y X ˆβ 2, ŝ = Â, Â = {j; ˆβ j 0} and ε (β) = diag{p λ,τ,ε(β 1 )/ β 1,, p λ,τ,ε(β d )/ β d }. (7) For variables with ˆβ j = 0, the estimated standard errors are Tuning parameter selection As suggested in [7], we fix τ = 0.01 and use a modified BIC (MBIC) procedure to tune λ via { ˆλ = arg min MBIC( ˆβ) = log(ˆσ 2 ) + k } n λ n ŝ, (8) where ˆβ = ˆβ(λ, τ), ˆσ 2 = (n ŝ) 1 y X ˆβ 2, ŝ = Â and k n is a positive number that depends on the sample size n and satisfies k n log(n). In our numerical experiments, we set k n = log(n). Since solving (5) is a nonconvex optimization problem, we coupe SQN-DFP with a continuation strategy on the tuning parameter for efficient computation. To be precise, one needs a starting value λ 0 for the parameter λ and a decreasing factor µ (0, 1) to obtain a decreasing sequence {λ g } g, where λ g = λ 0 µ g, and then run Algorithm 1 to solve the λ g+1 -problem initialized with the solution of λ g -problem. Summarizing the idea leads to Algorithm 2. See [14] and the references therein for more details. In practice, we use λ 0 = λ max, where λ max is an initial guess of λ, supposedly large, that shrinks all β j s to zero, and set λ min = 1e 5λ max, then divide the interval [λ min, λ max ] into G (the number of grid points) equally distributed subintervals in the logarithmic scale. Numerically, µ is determined by G. Clearly, a large G value implies a large decreasing factor µ. For sufficient resolution of the solution path, G usually takes G 50 (e.g., G = 100 or 200). Implementing Algorithm 1 for each value of τ and the sequence λ max = λ 0 > λ 1 > > λ G = λ min to be considered gives the entire GSELO-PLS solution path. Then we select the optimal λ from the candidate set Λ = {λ 1, λ 2,, λ G } using MBIC (8). Remark 5. Dicker et al. [7] show that the SELO-PLS-MBIC procedure consistently identifies the true model with diverging number of parameters under some regularity conditions. It can be proved that the GSELO-PLS-MBIC procedure is also consistent with model selection by using similar arguments used in the proof of Theorem 2 of Dicker et al. [7], and thus the detailed proof is omitted here. Remark 6. In Algorithm 1, we set (ρ, η, m max ) = (0.55, 0.4, 20) following [19]. Due to the continuation strategy, the maximum number of outer iterations k max is not necessary to be large in practice, so we set k max = 50 in Algorithm 1. This procedure makes it possible

9 SHI Yue-yong et al. Variable selection via generalized SELO-penalized linear regression models 153 to substantially reduce the computational cost without noticeable loss of the accuracy of the solutions. Algorithm 2 Continuation strategy Input: Given λ 0 and µ (0, 1). Let β(λ 0 ) = 0. 1: for g = 1, 2, 3,, G do 2: Apply Algorithm 1 to problem (5) with λ g = λ 0 µ g, initialized with β 0 = β(λ g 1 ). 3: Compute MBIC values. 4: end for Output: Select λ by (8). 4.1 Simulation studies 4 Numerical experiments We present simulation studies to examine the finite sample properties of GSELO-PLS-MBIC. All codes, available from the authors, are written in Matlab and all experiments are performed in MATLAB R2010b on a quad-core laptop with an Intel Core i5 CPU (2.60 GHz) and 8 GB RAM running Windows 8.1 (64 bit). We simulate N = 1000 data sets from the linear model (1). β is a d 1 vector with β1 = 3, β2 = 1.5, β3 = 2 and the other βj s being 0. Thus, s = 3. The rows of the n d matrix X are sampled as i.i.d. copies from N(0, Σ) with Σ = (0.5 j k ) for 1 j, k d. The components of the n 1 vector ϵ are sampled from N(0, 1). In order to emphasize the dependency of the number of parameters on the sample size, we consider two sample sizes: n = 100 and n = 200 with d = n/(2 log n), where x denotes the integer part of x for x 0. To evaluate the variable selection performance of the proposed method, we consider the average model size N 1 N s=1 Â(s) (MS), the proportion of correct models N 1 N s=1 I{Â(s) = A} (CM), the average l absolute error N 1 N s=1 ˆβ (s) β (AE), the average l 2 relative error N 1 N s=1 ( ˆβ (s) β 2 / β 2 ) (RE) and the average model error N 1 N s=1 ( ˆβ (s) β ) T Σ( ˆβ (s) β ) (ME). Simulation results for variable selection are summarized in Table 2. Since LIN, SELO, EXP, SIN and ATN all belong to the GSELO penalty family, it can be seen from Table 2 that five penalties behave quite similar to each other in all considered criteria. With respect to MS, although all methods tend to slightly overestimate the true model, they can select the true model quite well with reasonably small errors in terms of CM, AE, RE and ME. The results given in Table 3 are obtained under the same situation as in Table 2 but for the estimation of the regression parameter β. With respect to parameter estimation, Table 3 presents the average of estimated nonzero coefficients (Mean), the average of estimated standard error (ESE) and the sample standard deviations (SSD). From Table 3, we can see that Means are close to corresponding true values, and ESEs agree well with SSDs, which indicates that the proposed covariance estimation formula is reasonable and reliable.

10 154 Appl. Math. J. Chinese Univ. Vol. 33, No. 2 Table 2: Simulation results for variable selection. d = n/(2 log n). (n, d) method MS CM AE RE ME (100,10) LIN % SELO % EXP % SIN % ATN % (200,18) LIN % SELO % EXP % SIN % ATN % Table 3: Simulation results for parameter estimation. d = n/(2 log n). β 1 = 3 β 2 = 1.5 β 3 = 2 (n, d) method Mean ESE SSD Mean ESE SSD Mean ESE SSD (100,10) LIN SELO EXP SIN ATN (200,18) LIN SELO EXP SIN ATN Analysis of clinical data We illustrate GSELO-PLS-MBIC by an analysis of a prostate cancer data set from [26]. This data set examines the correlation between the level of prostate specific antigen and a number of clinical measures in 97 men with prostate cancer who were about to receive a radical prostatectomy. It has been analyzed by many texts on data mining (cf., e.g., [32,13]) and can be publicly available from the R package ElemStatLearn [13]. There are 97 observations (n = 97) and 9 variables (one quantitative response and d = 8 predictors) in the prostate cancer data set. The goal is to predict the response lpsa (log of prostate specific antigen) from predictors including lcavol (log cancer volume), lweight (log prostate weight), age, lbph (log of benign prostatic hyperplasia amount), svi (seminal vesicle invasion), lcp (log of capsular penetration), gleason (Gleason score) and pgg45 (percent of Gleason scores 4 or 5). There are lots of different model fitting and tuning parameter selection procedures that are being carried out on the prostate cancer data, which makes it challenging to choose which one to use as the underlying true model is generally unknown in real data analyses. However, as previously stated, the minimizer of L 0 procedure (i.e., best subset selection) is the optimal solution and, if available, can be used as a gold standard for the evaluation of other approaches.

11 SHI Yue-yong et al. Variable selection via generalized SELO-penalized linear regression models 155 Hereafter, we regard the best subset model with the lowest BIC as the true model in order to assess the approaches. Five GSELO procedures (i.e., LIN, SELO, EXP, SIN and ATN) proposed in the previous sections are applied to the prostate cancer data. Additionally, the LS (computed by R built-in function lm) and LASSO (computed by R function cv.glmnet with the lambda.1se rule and set.seed=0 from R package glmnet [11]) solutions are also provided for comparison purposes. The estimated regression parameters and the predictive mean squared errors (PMSE) calculated by n 1 n i=1 (ŷ i y i ) 2 are provided in Table 4. One can see that five GSELO penalties behave similarly, and they all select lcavol and lweight. In particular, SIN and ATN can recover exactly the best subset selection result, which shows the good performance of the proposed GSELO-PLS-MBIC procedure. Table 4: Analysis of the prostate cancer data set. Estimated coefficients of different methods applied to the prostate data. The zero entries correspond to variables omitted. Term LS Best Subset LASSO LIN SELO EXP SIN ATN Intercept lcavol lweight age lbph svi lcp gleason pgg PMSE High dimensional case In this section, we discuss how GSELO-PLS can be applied to high dimensional data in which d > n. For solving (3) in high dimensions, we first employ the local linear approximation (LLA) [33] to p λ,τ ( ) P: p λ,τ (β j ) p λ,τ (β k j ) + p λ,τ (β k j )( β j β k j ), (9) where βj k are the kth estimates of β j, j = 1, 2,, d, and p λ,τ (β j) means the derivative of p λ,τ (β j ) with respect to β j. Given β k of β, we find the next estimate via where ω k+1 j β k+1 = arg min{ 1 β 2n y Xβ d j=1 ω k+1 j β j }, (10) = p λ,τ (βk j ). Then we use a Gauss-Seidel type coordinate descent (CD) algorithm in [2] for solving (10). We summarize the LLA-CD procedure in Algorithm 3. For LLA-CD, we also couple it with the continuation strategy on the regularization parameter, in order to obtain accurate solutions.

12 156 Appl. Math. J. Chinese Univ. Vol. 33, No. 2 Algorithm 3 LLA-CD Input: X R n d, y R n, β 0 R d, τ, λ, δ (tolerance) and k max (the maximum number of iterations). 1: for k = 0, 1, 2, do 2: while k < k max do 3: for j = 1, 2,, d do 4: Calculate z j = n 1 x T j r j = n 1 x T j r + βj k, where r = y Xβ k, r j = y X jβ j, k j is introduced to refer to the portion that remains after the jth column or element is removed, and r j is the partial residuals of x j. 5: Update β k+1 j S(z j, ω k+1 j ), where ω k+1 j = p λ,τ (βj k ) and S(t, λ) = sgn(t)( t λ) + is the soft-thresholding operator. 6: Update r r (β k+1 j βj k )x j. 7: end for 8: if β k+1 β k < δ then 9: break, ˆβ = β k+1. 10: else 11: Update k k : end if 13: end while 14: end for Output: ˆβ, the estimate of β in equation (10). We use the sandwich formula (6) to estimate the covariance matrix for the LLA-CD estimates ˆβ by replacing ε (β) in (7) with (β) = diag{p λ,τ ( β 1 )/ β 1,, p λ,τ ( β d )/ β d }. Since d is larger than n, MBIC in (8) breaks down in the tuning of λ. Thus we adopt a highdimensional BIC (HBIC) proposed by Wang et al [28] to select the optimal tuning parameter ˆλ during the continuation process, which reads ˆλ = arg min{hbic(λ) = log( y X ˆβ(λ) 2 /n) + C n log(d) M(λ) }, (11) λ Λ n where Λ is a subset of (0, + ), M(λ) = {j : ˆβj (λ) 0} and M(λ) denotes the cardinality of M(λ), and C n = log(log n). 5.1 Simulation studies in high dimensions We illustrate the finite sample properties of GSELO-PLS-HBIC in high dimensions with simulation studies. The implementation setting is the same as in Section 4.1 but for two sample sizes n = 100 and n = 200 with d = n log(n)/2. The results for variable selection and parameter estimation are reported in Table 5 and Table 6, respectively. It can be seen from the tables that five GSELO penalties still perform reasonably well in terms of both variable selection and parameter estimation when d is larger than n. Since the sparsity level of β is fixed in our simulations (i.e., s = 3), the better performance appears to be associated with larger sample sizes.

13 SHI Yue-yong et al. Variable selection via generalized SELO-penalized linear regression models 157 Table 5: Simulation results for variable selection. d = n log(n)/2. (n, d) method MS CM AE RE ME (100,230) LIN % SELO % EXP % SIN % ATN % (200,529) LIN % SELO % EXP % SIN % ATN % Table 6: Simulation results for parameter estimation. d = n log(n)/2. β 1 = 3 β 2 = 1.5 β 3 = 2 (n, d) method Mean ESE SSD Mean ESE SSD Mean ESE SSD (100,230) LIN SELO EXP SIN ATN (200,529) LIN SELO EXP SIN ATN Analysis of microarray data We analyze the eyedata set which is publicly available in R package flare [15] to illustrate the application of GSELO-PLS-HBIC in high-dimensional settings. This data set is a gene expression data from the microarray experiments of mammalian eye tissue samples of [23]. The response variable y is a numeric vector of length 120 giving expression level of gene TRIM32 which causes Bardet-Biedl syndrome (BBS). The design matrix X is a matrix which represents the data of 120 rats with 200 gene probes. We want to find the gene probes that are most related to TRIM32 in sparse high-dimensional regression models. For this dataset, we consider ncvreg [2] (10-fold cv.ncvreg with seed=0) as the gold standard for comparison purposes. Table 7 lists the results of GSELO (LIN, SELO, EXP, SIN and ATN) and ncvreg. From Table 7, six sets identify 5, 3, 6, 4, 3 and 4 probes respectively and have 3 in common. Notably, for those common probes, although the magnitudes of estimates are not equal, they have the same signs, which suggests similar biological conclusions. In addition, they have similar PMSEs which implies they can give results of comparable accuracy.

14 158 Appl. Math. J. Chinese Univ. Vol. 33, No. 2 Table 7: Analysis of the eyedata set. Estimated coefficients of different methods are provided. The zero entries correspond to variables omitted. No. Term Probe ncvreg LIN SELO EXP SIN ATN Intercept β β β β β β β β β β PMSE Concluding remarks In this paper, we propose the GSELO-PLS procedure for linear models on variable selection and parameter estimation issues. We generalize the SELO to the GSELO and thus put the popular SELO penalization method within a more general framework. We impose the asymptotic properties of the proposed GSELO-PLS estimator in a setting where the dimension of covariates growing with the sample size. The consistency and the oracle property of proposed estimators are proved under some regularity conditions. When coupled with a continuation strategy and a modified BIC tuning parameter selector, our overall proposed procedure is very efficient and accurate. In addition, when d is larger than n, we use a LLA-CD algorithm and a high-dimensional BIC, combined with a continuation strategy on the regularization parameter, to compute the GSELO solution paths in high dimensions. The results of simulation studies and real data examples demonstrate the effectiveness of our proposed approach. As a natural extension of the SELO, the proposed GSELO method automatically inherits all merits of SELO, and can be directly used to acquire existing results of those SELO-based literatures, i.e., linear models [7], generalized linear models [16], multivariate panel count data [30] and quantile regression [6]. By the connection between SICA and GSELO, heuristically, it is attractive for us to consider using GSELO to do variable selection for other realms in future, such as Cox models [24,25] and additive hazards models [17]. Moreover, in regression problems, variables can often be thought of as grouped. According to the group exponential LASSO in [1] for bi-level variable selection, it would be interesting to extend the GSELO results for structured sparsity penalized models, which we also leave for future research. Appendix We follow steps similar to the proofs of Dicker et al. [7]. Hereafter, we use p(β j ) other than p λ,τ (β j ) to denote the penalty in GSELO for the sake of simplicity in notation.

15 SHI Yue-yong et al. Variable selection via generalized SELO-penalized linear regression models 159 Proof of Theorem 1. Let α n = dσ 2 /n. It is sufficient to show that, for any given ε > 0, there exists a large constant C such that P { inf Q n(β + Cα n u) Q n (β )} 1 ε. (12) u =1 Define D n (u) = Q n (β + Cα n u) Q n (β ). We have D n (u) 1 2n C2 αn Xu n Cα nϵ T Xu + [p(βj + Cα n u j ) p(βj )] = I 1 + I 2 + I 3, j K(u) where K(u) = {j : p(βj + Cα nu j ) p(βj ) < 0}. By (C2), I 1 = 1 2n C2 αn Xu 2 2 γ min(n 1 X T X) 2 C 2 α 2 n = O p (C 2 α 2 n), (13) I 2 = 1 n Cα nϵ T Xu Cα n n ϵt X n Cα n n σ 2 γ max (n 1 X T X) = O p (Cα 2 n). (14) (C4) implies ρ/α n. This and the fact p( ) is concave on [0, ) imply that p(β j + Cα n u j ) p(β j ) Cα n u j p (β j + Cα n u j ), when n is sufficiently large. It follows that I 3 Cα n u j p (βj + Cα n u j ) (Here, p (t) means the derivative with respect to t.) j K(u) = j K(u) j K(u) λ Cα n u j f(1) f βj ( + Cα nu j βj + Cα nu j + τ ) τ ( βj + Cα nu j + τ) 2 τ Cα n u j O(1)λ (ρ + τ) 2 j K(u) = O(Cα n ) λ ρ 2 (τ d) = O(Cα n )o(1)o(α n ) = o(cα 2 n) Cα n u j O(1)λ τ ρ 2 Cα no(1) λτ ρ 2 d u by (C3)-(C4) and (H1)-(H2). From (13), (14) and (15), if C > 0 is large enough, I 2 and I 3 are dominated by I 1, which is positive. This proves (12). (15) Proof of Theorem 2 (i). β R d with β β Cα n, where C is any positive constant. For ε n Cα n > 0, it suffices to show, for all j A c, we have Q n (β) > 0, β j for 0 < β j < ε n, (16) Q n (β) < 0, β j for ε n < β j < 0, (17) with probability tending to one as n. By some algebras, Q n (β) = 1 β j n x j(y Xβ) + p ( β j ) sgn(β j ) = I 1 + I 2. Note that E( XT ϵ n ) = 0, then 2 ) = tr{e( XT ϵ ϵ T X σ2 X )} = n n n tr(xt n by (C2). It follows that 1 n XT (y Xβ) = O p ( E( XT ϵ n ) = σ2 O(d) = O(dσ2 n n ) dσ 2 n ) = O p(α n ), and further I 1 = o p (α n ). On

16 160 Appl. Math. J. Chinese Univ. Vol. 33, No. 2 the other hand, p ( β j )/α n = 1 f(1) f β j ( β j + τ ) λτ/α n ( β j + τ) 2. Since β j Cα n with j A c, and (C3) implies α n /τ and λτ/α n ( β j + τ) 2 λτ/α n λτ = O( (Cα n + τ) 2 αn 3 ), we have p ( β j )/α n together with (H1) and (H2). Thus, the sign of Q n (β) = α n {o p (1) + p ( β j )/α n sgn(β j )} β j is completely determined by the sign of β j when n is large, and they always have same signs. Hence, (16) and (17) follow. Proof of Theorem 2 (ii). By (i) of Theorem 2, {j; ˆβj 0} = A. As the local minima of Q n (β), ˆβ must satisfy Q n (β) β= = 0, β A ˆβ i.e., 1 n XT A (y X ˆβ) + p A ( ˆβ) = 0, which implies ˆβ A βa = (X T AX A ) 1 X T Aϵ (n 1 X T AX A ) 1 p A( ˆβ). It follows that nbn (n 1 X T AX A /σ 2 ) 1/2 ( ˆβ A β A) = B n (σ 2 X T AX A ) 1/2 X T Aϵ nb n (σ 2 X T AX A ) 1/2 p A( ˆβ) = I 1 I 2 and I 2 = n/σ 2 B n (X T AX A /n) 1/2 p A( ˆβ). (18) By (C2), we have B n (X T AX A /n) 1/2 p A( ˆβ) = O p ( p A( ˆβ) ). (19) On the other hand, j A = {j; ˆβ j 0}, ( p ( ˆβ j ) = λ f(1) f ˆβ ) j τ sgn( ˆβ j ) ˆβ j + τ ( ˆβ j + τ), 2 and it follows p ( ˆβ) = O p ( λτ ρ ) according to (H1)-(H2) and (C3)-(C4). It is noteworthy that 2 v d v holds for all v R d, and then we have p A ( ˆβ) p ( ˆβ) = O p ( d λτ ρ ). Then, 2 (18), (19) and (C3)-(C4) imply I 2 = n/σ 2 O p ( d λτ ρ 2 ) = O p( nd/σ 2 λτ ρ 2 ) = O p( λ ρ 2 ) = o p(1). Thus, to complete proof (ii), it suffices to show d I 1 N(0, G), (20) according to the Slutsky s theorem. Denote I 1 = n w i,n, where w i,n = B n (σ 2 X T A X A) 1/2 x i,a ϵ i. Fix δ 0 > 0 and let η i,n = x T i,a (XT A X A) 1/2 Bn T B n (X T A X A) 1/2 x i,a. Then, using similar procedures in [7], we can show n E( w i,n 2 ; w i,n 2 > δ 0 ) 0. i=1 i=1

17 SHI Yue-yong et al. Variable selection via generalized SELO-penalized linear regression models 161 by (C5) and the fact n η i,n = tr(bn T B n ) tr(g) <. Thus, the Lindeberg condition is i=1 satisfied and (20) holds. References [1] P Breheny. The group exponential lasso for bi-level variable selection, Biometrics, 2015, 71(3): [2] P Breheny, J Huang. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection, Ann Appl Stat, 2011, 5(1): [3] E J Candes, M B Wakin, S P Boyd. Enhancing sparsity by reweighted l1 minimization, J Fourier Anal Appl, 2008, 14(5): [4] X Chen. Superlinear convergence of smoothing quasi-newton methods for nonsmooth equations, J Comput Appl Math, 1997, 80(1): [5] X Chen. Smoothing methods for nonsmooth, nonconvex minimization, Math Program, 2012, 134(1): [6] G Ciuperca. Model selection in high-dimensional quantile regression with seamless L 0 penalty, Statist Probab Lett, 2015, 107: [7] L Dicker, B Huang, X Lin. Variable selection and estimation with the seamless-l0 penalty, Statist Sinica, 2013, 23: [8] J Fan, R Li. Variable selection via nonconcave penalized likelihood and its oracle properties, J Amer Statist Assoc, 2001, 96(456): [9] J Fan, J Lv. A selective overview of variable selection in high dimensional feature space, Statist Sinica, 2010, 20(1): [10] J Fan, H Peng. Nonconcave penalized likelihood with a diverging number of parameters, Ann Statist, 2004, 32(3): [11] J Friedman, T Hastie, R Tibshirani. Regularization paths for generalized linear models via coordinate descent, J Stat Softw, 2010, 33(1): [12] C Gao, N Wang, Q Yu, Z Zhang. A feasible nonconvex relaxation approach to feature selection, In: Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, [13] T Hastie, R Tibshirani, J Friedman. The Elements of Statistical Learning, Springer, Berlin, [14] Y Jiao, B Jin, X Lu. A primal dual active set with continuation algorithm for the l 0 -regularized optimization problem, Appl Comput Harmon Anal, 2015, 39: [15] X Li, T Zhao, X Yuan, H Liu. The flare package for high dimensional linear regression and precision matrix estimation in R, J Mach Learn Res, 2015, 16: [16] Z Li, S Wang, X Lin. Variable selection and estimation in generalized linear models with the seamless L0 penalty, Canad J Statist, 2012, 40(4): [17] W Lin, J Lv. High-dimensional sparse additive hazards regression, J Amer Statist Assoc, 2013, 108(501):

18 162 Appl. Math. J. Chinese Univ. Vol. 33, No. 2 [18] J Lv, Y Fan. A unified approach to model selection and sparse recovery using regularized least squares, Ann Statist, 2009, 37(6A): [19] C F Ma. Optimization method and its Matlab program design, Science Press, Beijing, [20] R Mazumder, J Friedman, T Hastie. SparseNet: Coordinate descent with nonconvex penalties, J Amer Statist Assoc, 2011, 106(495): [21] M Nikolova. Local strong homogeneity of a regularized estimator, SIAM J Appl Math, 2000, 61(2): [22] J Nocedal, S Wright. Numerical optimization, 2nd ed, Springer, New York, [23] T Scheetz, K Kim, R Swiderski, A Philp, T Braun, K Knudtson, A Dorrance, G DiBona, J Huang, T Casavant, V Sheffield, E Stone. Regulation of gene expression in the mammalian eye and its relevance to eye disease, Proc Natl Acad Sci USA, 2006, 103(39): [24] Y Y Shi, Y X Cao, Y L Jiao, Y Y Liu. SICA for Cox s proportional hazards model with a diverging number of parameters, Acta Math Appl Sin Engl Ser, 2014, 30(4): [25] Y Y Shi, Y L Jiao, L Yan, Y X Cao. A modified BIC tuning parameter selector for SICA-penalized Cox regression models with diverging dimensionality, J Math, 2017, 37(4): [26] T A Stamey, J N Kabalin, J E McNeal, I M Johnstone, F Freiha, E A Redwine, N Yang. Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II. Radical prostatectomy treated patients, J Urol, 1989, 141(5): [27] R Tibshirani. Regression shrinkage and selection via the lasso, J R Stat Soc Ser B Stat Methodol, 1996, 58(1): [28] L Wang, Y Kim, R Li. Calibrating nonconvex penalized regression in ultra-high dimension, Ann Statist, 2013, 41(5), [29] C H Zhang. Nearly unbiased variable selection under minimax concave penalty, Ann Statist, 2010, 38(2): [30] H Zhang, J Sun, D Wang. Variable selection and estimation for multivariate panel count data via the seamless-l0 penalty, Canad J Statist, 2013, 41(2): [31] H Zou. The adaptive lasso and its oracle properties, J Amer Statist Assoc, 2006, 101(476): [32] H Zou, T Hastie. Regularization and variable selection via the elastic net, J R Stat Soc Ser B Stat Methodol, 2005, 67(2): [33] H Zou, R Li. One-step sparse estimates in nonconcave penalized likelihood models, Ann Statist, 2008, 36(4): School of Economics and Management, China University of Geosciences, Wuhan , China. 2 School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan , China. 3 Center for Resources and Environmental Economic Research, China University of Geosciences, Wuhan , China yulingjiaomath@whu.edu.cn

HIGH-DIMENSIONAL VARIABLE SELECTION WITH THE GENERALIZED SELO PENALTY

Vol. 38 ( 2018 No. 6 J. of Math. (PRC HIGH-DIMENSIONAL VARIABLE SELECTION WITH THE GENERALIZED SELO PENALTY SHI Yue-yong 1,3, CAO Yong-xiu 2, YU Ji-chang 2, JIAO Yu-ling 2 (1.School of Economics and Management,