Hard Thresholded Regression Via Linear Programming

Size: px

Start display at page:

Download "Hard Thresholded Regression Via Linear Programming"

Prudence Harrison
5 years ago
Views:

1 Hard Thresholded Regression Via Linear Programming Qiang Sun, Hongtu Zhu and Joseph G. Ibrahim Departments of Biostatistics, The University of North Carolina at Chapel Hill. Q. Sun, is Ph.D. student ( H. Zhu is Professor of Biostatistics ( and J. G. Ibrahim is Alumni Distinguished Professor of Biostatistics ( Department of Biostatistics, University of North Carolina at Chapel Hill, NC

2 Abstract This aim of this paper is to develop a hard thresholded regression (HTR) framework for simultaneous variable selection and unbiased estimation in high dimensional linear regression. This new framework is motivated by its close connection with the L 0 regularization and best subset selection under an orthonormal design, while enjoying several key computational and theoretical advantages over many existing penalization methods (e.g., SCAD or LASSO). Computationally, HTR is a fast two-stage estimation procedure consisting of the first step of calculating a coarse initial estimator and the second step of solving a linear program problem. Theoretically, under some mild conditions, the HTR estimator is shown to enjoy the strong oracle property and thresholded property even when the number of covariates grows at an exponential rate. We also propose to incorporate the regularized covariance estimator into the estimation procedure in order to have a better trade off between noise accumulation and correlation modeling. Under this framework, HTR includes Sure Independence Screening as a special case. Both simulation and real data results show that HTR outperforms other state-of-the-art methods. Key words: correlation bias, finite sample bias, hard thresholded regression, linear programming. 2

3 1 Introduction Consider the linear model y i = x T i β + ε i, i = 1,..., n, (1) where y i is a univariate response, x i = (x i,1,..., x i,p ) T is a p dimensional covariate vector, β = (β 1,..., β p ) T is a p 1 regression coefficient vector, and {ɛ i : i = 1,..., n} are independent and identically distributed (i.i.d) errors. The theory of linear models is well established for traditional applications, where the dimension p is fixed and the sample size n is much larger than p. With the development of many modern technologies, however, in many biological, medical, social, and economical studies, p is comparable with, or much larger than, n, making valid statistical inferences a great challenge. Let S be a subset of indices such that S = {j βj o 0} and let p S be the cardinality of S, where β o = (β1 o,..., βo p) T is the true parameter β. For prediction accuracy and variable selection consistency, it is common to assume a sparsity assumption, that is, p S << p. For model (1), many regularization methods for variable selection minimize Q(β) = 1 p 2n y Xβ p λ (β j ), (2) where y = (y 1,, y n ) T, X is an n p non-stochastic matrix with ith row x T i, 2 represents the L 2 norm, and p λ ( ) is a penalty function (e.g., SCAD or Lasso), which depends on a tuning parameter λ > 0. The most well-known best subset selection method is the L 0 penalized regression which can achieve simultaneous parameter estimation and variable selection (Akaike, 1973; Schwarz, 1978). The subset selection methods coupled with different selection criteria-including the C p statistics, the Akaike information criterion (AIC), the Bayesian information criterion (BIC), minimum description length (MDL), and the risk inflation criterion (RIC) are special cases of the L 0 penalized regression, resulting from the j=1 3

4 assignment of different values to λ. However, solving the L 0 regularization with a fixed λ is an NP-hard problem and computational methods based on an exhaustive search rapidly become impractical when the number of covariates increases (Huo and Ni, 2007; Fan and Peng, 2004; Fan and Li, 2001; Zhang, 2010). To address this computational issue, different convex/nonconvex penalty functions have been used in Q(β) and have been extensively investigated in order to mimic the L 0 regularization (Tibshirani, 1996; Fan and Li, 2001; Fan and Peng, 2004; Zhang, 2010; Meinshausen and Bühlmann, 2006; Leng et al., 2004; Zou, 2006). Instead of developing another penalty function, we develop a new hard thresholded regression (HTR) modeling framework for performing simultaneous variable selection and unbiased estimation in model (1). The key idea of HTR is to minimize H(β) = 1 n W XT (y Xβ) 1 + λ β 1, (3) where W is a p 0 p weight matrix based on some initial estimates of β, which will be introduced in Section 2. As shown in Sections 2 and 3, HTR simultaneously enjoys two key computational and theoretical properties as follows: (i) Since H(β) is convex and HTR can be cast as a linear programming problem, minimizing H(β) is computationally efficient even in high-dimensional settings. (ii) Under some mild conditions, the HTR estimate, which minimizes H(β), is an oracle estimator and achieves unbiased estimation within a wide range of λ. Due to the nice properties (i) and (ii), our HTR estimate is a novel addition to the extensive regularization literature. Our HTR method in (3) shares some important similarities with existing regularization methods. The penalty function in H(β) is the same as that of the popular Lasso (Tibshirani, 1996), when p λ (β j ) = λ β j. As shown in Section 2, HTR has a strong connection with 4

5 the L 0 and hard-thresholding regularizations (Akaike, 1973; Schwarz, 1978; Zheng et al., 2013), since all of them reduce to best subset selection under orthogonal designs, that is, n 1 X T X = I p, where I p is the p p identity matrix. A comparison of the regularization path between the L 0 regularization and HTR is shown in Figure 1. Our HTR differs significantly from the existing regularization methods in several major ways. A major advantage of HTR over nonconvex regularizations is its computational efficiency (i). Although there is much literature on nonconvex regularization methods (Wang et al., 2013a; Kim and Kwon, 2012; Zhang and Zhang, 2012; Fan and Lv, 2011; Kim et al., 2008; Wang et al., 2013b), several important questions still remain. Specifically, due to the non-convexity of the penalty function, multiple local minima always exist, and hence it is difficult to identify the oracle estimator even if the oracle estimator may be known to exist along the solution path. A major advantage of HTR over convex regularization methods is its nice theoretical property (ii). Due to the convexity of the penalty function, convex regularization methods, such as the Lasso, suffer from estimation bias issue and thus can be suboptimal in terms of risk estimation. See Fan and Li (2001) for detailed discussions. Moreover, the shrinkage bias introduced by convex regularization methods poses major challenges for statistical inferences, such as constructing confidence intervals or hypothesis testing, in high dimensional settings (Zhang and Zhang, 2012; van de Geer et al., 2013; Chatterjee and Lahiri, 2011). There is also a major conflict between optimal prediction and consistent variable selection in the lasso method (Meinshausen and Bühlmann, 2006; Leng et al., 2004; Zou, 2006). We make three major contributions in this paper as follows: We systematically investigate a fast two-step estimation procedure for HTR. The first step is to calculate a ridge estimator and the second step is to solve a linear programming problem. 5

6 L 0 Penalized Regression Hard Thresholded Regression ^ ^ Fig. 1: Solution paths of the L 0 regularization regression and HTR: We consider a simple example that y i = x i β + ε i, where β = (3, 2, 1.5, 0, 0, 0) T and ε i s are independently and identically distributed as N(0, 1). We plot the estimates of regression coefficients ˆβ j s for this example. Left panel: the L 0 Penalized Regression estimates as a function of λ; Right panel: hard thresholded regression estimates as a function of λ. We provide a comprehensive theoretical investigation of HTR. We show that the HTR estimator has the strong oracle property even when the number of covariates may grow at an exponential rate. We propose to incorporate the regularized covariance estimator into the estimation procedure in order to have a better trade off between noise accumulation and correlation modeling. The rest of the paper is organized as follows. In Section 2, we introduce HTR and its implementation, and discuss its connections with other regularization methods. Then, we 6

7 show that the HTR estimator has the strong oracle property under some mild conditions. In Section 3, we discuss potential extensions to the ultra-high dimensional case. Numerical results from Monte Carlo simulations and a real data example are presented in Section 4. A discussion is presented in Section 6, and proofs are given in the appendix. 2 Methods 2 1 Hard Thresholded Regression (HTR) Consider n independent observations {(y i, x i ) : i = 1,..., n} from model (1) with the true parameter vector β o. Without loss of generality, we standardize each column of X = ( x 1,, x p ) so that x k 2 = n for k = 1,..., p. The target of HTR in (3) is to estimate β o from the data. Our HTR algorithm is a two-stage approach. 1. Compute an initial estimator of β, denoted by ˆβ init, with a reasonably small risk error bound. For instance, let ˆβ ridge = (X T X + λ init I p ) 1 X T y be a ridge estimator of β, where I p is the p p identity matrix and λ init 0 is a tuning parameter. When λ init = 0, ˆβ ridge reduces to the ordinary least squares estimator of β. We will use ˆβ ridge as a candidate of ˆβ init and examine its risk error bound in Section Construct the weight matrix W based on ˆβ init, denoted by Ŵ, and then write the HTR estimator as ˆβ HT R = argmin 1 n Ŵ XT (y Xβ) 1 + λ β 1. (4) Throughout the paper, we set Ŵ as Ŵ = diag(ŵ j ) and ŵ j = ˆβ init,j γ for j = 1, 2,..., p, (5) where γ is a positive constant and ˆβ init,j is the j th component of ˆβ. 7

8 Numerically, computation of ˆβ HT R is very straightforward, since the objective function in (4) is convex and can be recast into a linear programming problem. Specifically, we introduce a p 1 slack vector η = {η j = n 1 [ŴXT (y Xβ)] j, j = 1,..., p}, β + = {β + j } j 1, and β = {β j } j 1. Then, the minimization in (4) can be rewritten as min p {η j +λ(β j + +β j )} subject to η 0, β+ 0, β 0, and η 1 (y Xβ) η, nŵxt j=1 where the optimization variables are η, β +, and β in R p. Finally, β can be recovered by β = β + β. There are at least two major motivations for HTR. The first one comes from the score equation of the maximum likelihood estimator. Let l n (β) and U n (β) be, respectively, the likelihood (or quasi-likelihood) and score functions of β. The score equation and its weighted version are given by U n (β) = β l n(β) = 0 and Ŵ U n(β) = 0, (6) which are equivalent to U n (β) 1 = 0 and Ŵ U n(β) 1 = 0, respectively. For model (1), U n (β) reduces to X T (y Xβ) and thus ˆβ HT R can be regarded as the penalized weighted score estimator with the L 1 norm β 1. Moreover, R(β) = (R 1 (β), R 2 (β),..., R p (β)) T = U n (β) can be regarded as the risk function of β and Ŵ is the risk calibration weight matrix for imposing additional information learned from the first stage. Therefore, based on (6), it is possible to extend HTR to more general scenarios, such as generalized linear models. The second motivation comes from the Dantzig selector (Candes and Tao, 2007) and the least absolute gradient selector (LAGS) (Yang, 2012). These two selectors are equivalent to solving the objective function as ˆβ = argmin β n 1 X T (y Xβ) a + λ Vβ 1, (7) where V is a p p weight matrix. The Dantzig selector and LAGS correspond to ( a, V) = (, I p ) and ( a, V) = (, diag(1/ ˆβ init,1,, 1/ ˆβ init,p )), respectively. 8

9 As pointed by Candes and Tao (2007), one would want to constrain the size of the correlated residual vector X T (y Xβ) rather than the size of the residual vector y Xβ, since such an estimation procedure is invariant under orthogonal transformations of X. Moreover, since the correlated residual vector measures the correlation between the predictors and the response, one would obviously want to include the explanatory variables that are highly correlated with the response y in the model. A major drawback of the Dantzig selector is shrinkage bias, leading to suboptimal risk estimation, even though a double Dantzig selector can reduce the bias (James and Radchenko, 2009). Moreover, to address the same bias issue, similar to the adaptive Lasso (Zou, 2006), LAGS uses adaptive weights calculated from ˆβ init to directly penalize different regression coefficients. An advantage of HTR is that it directly reduces the effects of those risk functions R j (β) associated with insignificant β j s in both the estimation and variable selection. When s << min(p, n) and p is comparable with n, we expect that HTR outperforms LAGS in terms of bias and mean squared error. See Section 4 for details. 2 2 Orthonormal Design Case We examine the orthonormal design case in order to delineate some connections between HTR and other existing regularization methods. In this case, we have X T X = ni p and ˆβ ols = ( ˆβ ols 1,, ˆβ ols p ) T = n 1 X T y. Best subset selection of size k reduces to choosing the k largest coefficients in absolute value and setting the rest to 0. Specifically, for some value of λ, this is equivalent to ˆβ j = ols ˆβ j 1 ˆβols j >λ for j = 1,, p, (8) which has a strong connection with hard shrinkage. For the Lasso (Tibshirani, 1996), the solutions have the form ˆβ lasso,j = sign( ols ols ˆβ j )( ˆβ j λ) + for j = 1,, p, (9) 9

10 which has a strong connection with the soft shrinkage proposals of Donoho and Johnstone (1994); Donoho et al. (1995). However, there is a major shrinkage bias in (9). Many convex/nonconvex penalty functions in (2) have been proposed to reduce the effect of the shrinkage bias in Lasso for statistical inferences (Candes and Tao, 2007; Fan and Li, 2001; Zou, 2006; Zhang, 2010). For instance, with the hard-thresholding penalty p λ (t) = 0.5[λ 2 (λ t) 2 +]1(t 0), we can obtain the hard thresholding estimator in (8). In the case of orthonormal design, the hard thresholding penalty is also equivalent to the L 0 penalty p λ (t) = 0.5λ 2 1(t 0). However, for nonorthonormal designs, although nonconvex regularization can be beneficial in selecting important covariates in model (1), additional computational and theoretical questions arise due to the nonconvexity of the penalty function. Both HTR and LAGS try to mimic best subset selection, while avoiding various issues associated with convex/nonconvex penalty functions used in Q(β). Specifically, we keep the L 1 -penalty function p λ (t) = λ t, whereas we replace the loss function in Q(β) by the score equation (or risk function) of β. In the case of orthonormal design, HTR reduces to whose solutions are given by argmin β p ŵ j β j j=1 ˆβ ols j + λ p β j, (10) j=1 ˆβ HT R,j = ˆβ ols j 1(λ ŵ j ) for j = 1,, p. (11) By taking the ridge estimator, we obtain ŵ j = ˆβ ols j /(1 + λ init ) and thus ˆβ HT R reduces to the hard thresholding estimator in (8) for some value of λ. We can also use the biased lasso estimate ˆβ lasso,j to construct ŵ j in the first stage and then calculate an unbiased estimator ˆβ HT R by calibrating the bias in ˆβ lasso,j. Thus, for HTR, we only need a coarse initial estimator in the first stage, which could then help us in identifying the activation set S of the true β. 10

11 We note that HTR is different from the hard-thresholding procedure. Given ˆβ init and λ n > 0, the hard thresholding (HT) estimator ˆβ HT is defined as ˆβ ˆβ HT init, if ˆβ init λ n, = 0, if ˆβ init λ n. (12) The hard-thresholding rule aims to remove the false positives at the second stage, while largely preserving the estimator calculated in the first stage. In contrast, our HTR always reestimates β in order to calibrate the estimation bias introduced in the first stage. Therefore, a coarse initial estimator of β is sufficient in the first stage of HTR. 2 3 Theoretical Results We formally establish the strong oracle property of ˆβ HT R, when the number of parameters is large and grows with the sample size n. We start with the following regularity conditions. Throughout the paper, the following conditions are needed to facilitate the technical details, although they may not be the weakest conditions. Regularity Conditions (RCs) (RC1) 0 < b λ min (n 1 X T X) λ max (n 1 X T X) B <. (RC2) lim n log (p)/log (n) v for some 0 v < 1. (RC3) λn 1/2 0 and λn 0.5(γ v(γ+1)). (RC4) The initial estimates ˆβ init satisfy E[ ˆβ init β 2 2 X] = O(pn 1 ). Remarks. Condition (RC1) assumes that the predictor matrix has reasonably good behavior, which is also considered in Fan and Peng (2004). Condition (RC2) specifies that the growth rate of p is at most a polynomial, that is, p = O(n v ), v < 1. It is worth pointing out that Condition (RC2) is weaker than that used in Fan and Peng (2004), for which they assume that p satisfies p 3 = o(n). Condition (RC3) specifies the relationship between λ 11

12 and n. To construct the risk calibration weight matrix Ŵ, we take a fixed γ such that γ > 2v/(1 v). Condition (RC4) requires that the initial estimator used in the first stage has reasonably good behavior in terms of the risk error bound. Such an error bound is generally available for many standard estimators of β. As an illustration, we show below that the ridge estimator used in the first stage satisfies (RC4) as given in the following proposition. Proposition 2.1 (Risk Error For Ridge Estimates) Under (RC1), ˆβ ridge satisfies E[ ˆβ ridge β 2 2 X] 2 λ2 init β σ2 npb n 2 b 2. (13) Furthermore, if λ 2 init β 2 2 = O(np), then we have E[ ˆβ ridge β 2 2 X] = O( p ). (14) n We next study the strong oracle properties of ˆβ HT R. Before we state the main theorem, we introduce the oracle estimator, denoted as ˆβ, as ˆβ = argmin β,β j =0, j S 1 2n n (y i x T i β) 2 (X T 1 = X 1) 1 X T 1 y, (15) 0 i=1 in which without loss of generality, it is assumed that the first s regression coefficients are nonzero and the remaining p s regression coefficients are zero. Moreover, X 1 is the corresponding design matrix for the first s regression coefficients. Theorem 2.2 below provides the strong oracle property of ˆβ HT R. Theorem 2.2 (Strong Oracle Property of ˆβ HT R ) Assume that conditions (RC1)-(RC4) hold. Then, as n, we have P r(ˆβ HT R = ˆβ ) 1. (16) 12

13 Combining Proposition 2.1 and Theorem 2.2 yields the strong oracle property of ˆβ HT R, when we set ˆβ init = (X T X + λ init I p ) 1 X T y in the first stage. Our result gives the strong oracle property under very mild conditions by only assuming λn 1/2 0 and λn 0.5(γ v(γ+1)) in (RC3). We shall compare our result with adaptive lasso in the fixed dimension setting. Adaptive lasso achieves oracle property by requiring λn 1/2 0, or equivalently, the bias term λ goes to 0 at a faster rate than n 1/2. However, in HTR, the bias term λ can diverge to with no faster rate than n 1/2. This can further validate the superiority of the HTR estimator: the thresholding level λ is only used to shut down the noise without introducing any bias term in the final estimator. 3 HTR under the Ultra-High Dimensional Setting 3 1 Ultra-High Dimensional HTR We discuss how to extend HTR for the ultra-high dimensional setting with p >> n. For instance, it is common to assume that p may grow at an exponential rate in n. In this case, the standard HTR in (4) may fail for p >> n. In particular, condition (RC1) fails, since λ min (n 1 X T X) = 0 for p > n. Thus, we need to use a new covariance matrix of predictors x, denoted by Σ x, which is positive definite, to replace n 1 X T X in (4). The use of a positive-definite Σ X to replace n 1 X T X is also very common in the regularization literature. For instance, in Zou and Trevor (2005), the elastic net estimator for model (1) is defined as argmin{β T ΣX β 2n 1 y T Xβ + λ β 1 }, (17) β in which Σ X = n 1 (X T X + λ 2 I p )/(1 + λ 2 ) for some λ 2 > 0. Our new ultra-high dimensional HTR algorithm for p n is also a two-stage approach as follows. 13

14 1. Compute ˆβ init, which satisfies the following estimation error bound ˆβ init β 2 C 0 ps log(p) n (18) in a large probability set J 0, that is, Pr(J 0 ) = 1 δ n,p,ps 1 or δ n,p,ps = o(1). Specifically, we use the Lasso estimator of β, denoted by ˆβ lasso, as a candidate of ˆβ init, since it has been shown in Zhang and Huang (2008) that (18) holds for ˆβ lasso under the sparse Riesz condition. We may use other regularization estimators of β, such as the Dantzig estimator, since the error bound (18) is widely available for them in the ultra-high dimensional framework. 2. Construct Ŵ and estimate ˆβ HT R according to ˆβ HT R = argmin β 1 2 Ŵ(n 1 X T y Σ X β) 1 + λ β 1. (19) We will show below that our ultra-high dimensional HTR is a general framework for carrying out screening, variable selection, and estimation. We first establish a connection between ultra-high dimensional HTR and Sure Independence Screening (SIS) when p is much larger than n. With a large dimensionality p, the computational cost and estimation accuracy are major difficulties for any statistical method. To overcome such difficulties, Fan and Lv (2008) introduced the SIS methodology to reduce dimensionality from an ultra-high p to a relatively large scale d n with d n n. Specifically, let ω = n 1 X T y = ( ω 1,..., ω p ) T be a p 1 vector of marginal correlations of predictors with the response variable. The standard SIS method is to select the features according to their marginal correlations with the response variable contained in ω, and then filter out those with weak marginal correlations with the response variable. This SIS procedure is equivalent to a special case of HTR by taking Ŵ = diag( ω 1,..., ω p ) and Σ X = n 1 diag{xx T } = I p in (19). Thus, (19) reduces 14

15 to ˆβ HT R,j = w j 1( w j λ) = argmin{ β p [ w j w j β j + λ β j ]}. (20) Without loss of generality, it assumes that w 1 > w 2 > > w p. For any given q (0, 1), we can select the covariates corresponding to the first [qn] largest w j s by taking λ = w [qn] in (20), where [qn] denotes the integer part of qn. Furthermore, we may combine the order of { w j } j learned from SIS with HTR (SIS+HTR) to recalculate ˆβ HT R. Second, we show that our HTR procedure allows us to extend SIS to more complex settings, when the predictors may be highly correlated. The incorporation of the correlation structure among the predictors is critical for better variable selection and estimation in model (1). An important strategy is to balance between noise accumulation and correlation modeling. Without loss of generality, we assume that the true covariance matrix of x, denoted by Σ x, has a geometric decay structure and therefore we can use its regularized bandable convariance estimator, denoted as Σ x, to approximate Σ x (Bickel and Levina, 2008). Extensions to other covariance structures can also be done by using other regularized estimators in the literature (Cai et al., 2010; Lam and Fan, 2009; Rothman et al., 2009; Fan j=1 et al., 2013). Specifically, we set ω = 1 Σ X n 1 X T y and Ŵ = diag( ω ). In this case, (19) reduces to ˆβ HT R = argmin{ diag( ω )(n 1 X T y Σ X β) 1 + λ β 1 }. (21) β Since we explicitly account for the joint information of all the covariates by regularizing their covariance matrix estimation through a de-correlation procedure instead of using the independence rule, we may call (21) a Sure Correlation Screening (SCS) procedure, which can avoid the faithful assumption used in Fan and Lv (2008). 15

16 3 2 Theoretical Results We formally investigate the strong oracle property of ˆβ HT R under the ultra-high dimensional scenario. We start with the following regularity condition on Σ x. Specifically, throughout the paper, it is assumed that Σ x belongs to a well behaved covariance class U(ε 0, α, C 1 ), defined by U(ε 0, α, C 1 ) = {Σ = (σ jj ) p p : max { σ j j j : j j > k} C 1 k α for all k > 0, j Σ = Σ T, and 0 < ε 0 λ min (Σ) λ max (Σ) 1/ε 0 }, where ε 0, C 1, and α are positive scalars. The condition Σ x U(ε 0, α, C 1 ) basically requires that Σ x be bandable. Such a condition on Σ x can be relaxed by employing different covariance estimators (Bickel and Levina, 2008; Cai et al., 2010; Lam and Fan, 2009; Rothman et al., 2009; Fan et al., 2013). We also introduce the L Correlation Condition (LCC) for model identifiability. For a given set S with cardinality p S and its complement S C = {1,, p}/s with cardinality p S C = p p S, we consider a partition of the p p matrix Σ x according to (S, S C ) as follows: Σ x = Σ x,ss Σ x,ss C, Σ x,s C S Σ x,s C S C where Σ x,s1 S 2 is a p S1 p S2 matrix corresponding to indices in S 1 and S 2, in which S 1 and S 2 are equal to either S or S C. We say that (Σ x, S) satisfies the L correlation condition, if there exists a u 0 (n, p, p S ) > 0 such that min Σ x,ssτ S + Σ x,ss C τ S C > u 0 (n, p, p S ), (22) τ S =1, τ S C =1 where τ S and τ S C are p S 1 and (p p S ) 1 vectors, respectively. The L correlation condition is used to rule out the case of strong collinearity in the same spirit of condition 4 in Fan and Lv (2008). The sample version of LCC closely resembles 16

17 the irrepresentable condition first proposed by Zhao and Yu (2006). The ir-representable condition is equivalent to putting a regularization constraint on the regression coefficients of the irrelevant covariates X S C on the relevant covariates X S, Σ 1 x,ss Σ x,ss C 1 1 u 0 (n, p, p S ) for some constant u 0 (n, p, p S ) > 0. Similar to the ir-representable condition, if we put the constraint in the L norm rather than the L 1 norm, i.e. Σ 1 x,ss Σ x,ss C 1 u 0 (n, p, p S ) and hold s fixed, this would imply the LCC condition by observing that min Σ x,ss τ S + Σ x,ss C τ S C min Σ 1 τ Ω 0 τ Ω x,ss (τ S Σ 1 x,ss Σ x,ss C τ S C ) 0 1 min λ min (Σ x,ss ) τ S Σ 1 τ Ω x,ss Σ x,ss C τ S C 0 ps (23) ε 0 ps u 0 (n, p, p S ), where Ω 0 = { τ S = 1, τ S C = 1}. Generally, we allow u 0 (n, p, p S ) to diverge to 0. We examine an example of Σ x to show that for some Σ x, the LCC condition holds, whereas the irrepresentable condition does not. Specifically, we consider a specific Σ 0 x with Σ 0 x,ss = I p S, Σ 0 x,s C S C = I p ps, and Σ 0 x,ss C = (Σ 0 x,s C S )T = [p S 1 1 ps, 0,, 0], where 1 ps is a p S 1 vector with all ones. Therefore, the LCC condition allows us to go beyond the irrepresentable condition for consistent variable selection. Proposition 3.1 For S = {1,..., p S } and Σ 0 x defined as above, (Σ 0 x, S) satisfies the LCC condition, but not the ir-representable condition. We define the oracle estimator of β in the ultra-high dimensional setting as β = ( {( Σ x,ss ) 1 X T S y}t, 0 T ) T, (24) where Σ x,ss denotes the submatrix of Σ X corresponding to the indices in the true active set S. Note that the difference between β and the oracle least squares estimate ˆβ is very small, since Σ x,ss Σ x,ss 1 2 Σ x Σ x 1 2 = O p({n 1 log(p)} α/(2(α+1)) ) for Σ x U(ε 0, α, C 1 ) (Bickel and Levina, 2008). Moreover, if the ordinary least squares estimator is desirable, 17

18 especially when s/n is moderate, we can first identify an initial active set, denoted as S n, and then we can calculate ˆβ ref = (X T S n X Sn ) 1 X T S n y. Before we present the main results below, we let Σ kn = B kn (Σ) = (σ ij 1 ( i j kn)) and define η = min j S β o j. Theorem 3.2 (Strong Oracle Property of ˆβ HT R under p >> n with thresholded property) Suppose that Σ x U(ε 0, α, C 1 ), (18) holds with a positive scalar C 0, and (B kn (Σ x ), S n ) satisfies the LCC condition. Suppose the tuning parameter λ satisfies m n < λ < M n for k n {log(p)/n} 1/(2(α+1)), where. m n = C γ 0 (2k n + 1){p S log(p)/n} γ/2 max{ɛ 1 2(η + 1) 0, γ(ε 0, δ) (log(p) n )1/2 } and M n. = [u0 (n, p, p S ) 2t 0 ](η C 0 ps log(p)/n) γ. Moreover, t 0 = 2(η + 1) log(p n){γ(ε 0, δ)} 1 n 1 is defined in Lemma 6.1, where γ(ε 0, δ) is a constant not depending on n and p. Then with probability at least 1 δ n,p,ps 3p η, we have S n = S and ˆβHT R = β. (25) Theorem 3.2 quantifies our HTR estimator under the ultra-high dimensional scenario. Assuming that η > C 0 ps log(p)/n and u 0 (n, p, p S ) is fixed, we roughly require (η ) γ λ (p S log(p)/n) γ/2 (2k n + 1). However, in Wang et al. (2013a), the calibrated CCCP method identifies the oracle estimator when η λ p S log(p)/n. We point out an interesting phenomenon: within the range (m n, M n ) with m n and M n defined in the above theorem, ˆβ HT R stays at the oracle estimator β. This agrees with our intuition that HTR s solution path has a piecewise constant property. We mention that our result is not directly comparable with the calibrated CCCP method or any other method in the literature as we only require that M n > λ > m n rather than M n λ m n. Finally, Theorem 3.2 is in line 18

19 with the important theoretical properties of L 0 penalized regression considered in Zheng et al. (2013). This may further validate our HTR method. 4 Numerical Examples 4 1 Simulation Study Continuous responses were generated according to model (1) with β = (3, 2, 0, 0, 1.5, 0,..., 0 }{{} p 5 and n = 100. Moreover, in model (1), x i follows the N(0, Σ x ) distribution with covariance matrix Σ x and ɛ i is independent of x i and has a normal distribution with mean 0 and ) T standard deviation σ = 2. Writing Σ x = σ(ρ ij ), we consider three different correlation structures of (ρ ij ) including Case 1: independent correlation design with (ρ ij ) = diag(1,, 1); Case 2: weak correlation design with ρ ij = 0.30 i j ; Case 3: relatively strong correlation design with ρ ij = 0.95 i j. We consider both a relatively high dimension p = 40 and the ultra-high dimensional case p = 2000 n. We investigate the sparsity recovery and estimation properties of the HTR estimator via numerical simulations. We compared the HTR estimator with the following estimators: the oracle estimator which assumes the availability of the active set S 0 ; the adaptive lasso estimator proposed by Zou (2006); the smoothly clipped absolute deviation (SCAD) estimator (Fan and Li, 2001); and the minimax concave penalty (MCP) estimator with a = 3 Zhang (2010). For SCAD, n 1/2 fold cross-validation was used to select the tuning parameter λ; for ALasso and HTR, sequential tuning in Bühlmann and Geer (2011) was used; and the MCP estimator was computed using the R package PLUS with the theoretically 19

20 optimal tuning parameter value λ = σ 2/np. For the case p = 30, we also computed regularized estimators based on LAGS. To estimate the bandable covariance estimator Σ x in HTR, the banding parameter was selected by cross validation as described in Bickel and Levina (2008). To further demonstrate the performance by using the regularized covariance matrix, we also compared the HTR estimates with the sample covariance matrix, and the independence covariance matrix and denoted them as HTRsam and HTR ind, respectively. For each simulation setting, we generated 100 simulated data sets and applied different estimators to each dataset. Then, we calculated different statistics for each estimator and included them in Tables 1, 2, 3 and 4. We calculated the mean and median of ˆβ i β i with i = 1, 2, 3 in order to measure the downward shrinkage bias. To measure the sparsity recovery, we calculated the mean and median of number of zero coefficients incorrectly estimated to be nonzero (i.e. false positive, denoted as FP) and the mean and median of number of nonzero coefficients correctly estimated to be nonzero (i.e. true positive, denoted by TP). To measure the estimation accuracy, we calculated the mean and median squared error (MSE) and the mean and median absolute error (MAE). It is not surprising that Lasso always overfits. Other procedures improve the performance of Lasso by reducing the estimation bias and the false positive rate. The best overall performance is achieved by the HTR estimator with relatively small shrinkage bias, MSE, MAE, and FP. The MCP and SCAD also have overall good performance. In the relatively high dimensional (p = 30) example, HTR outperforms LAGS in all three cases. When the dimension is 2000, in all cases, the HTR with sample covariance matrix favors false selections and thus it has worse performance compared with that of the regularized covariance estimator. When the correlation structure gets stronger, ignoring the correlation structure would produce too sparse a solution and miss true variables, which verifies our conjecture. This verifies the effectiveness of using the regularized covariance matrix in the regression 20

21 procedure. 4 2 Bardet-Biedl syndrome gene expression study We applied HTR to the Bardet Biedl syndrome gene expression study in Scheetz et al. (2006). For this data set, F1 animals were intercrossed and 120 twelve-week-old male offspring were selected for tissue harvesting from the eyes and for microarray analysis. The microarrays used to analyze the RNA from the eyes of these animals contain more than 31, 042 different probe sets (Affymetric GeneChip Rat Genome Array). The intensity values were normalized using the RMA (robust multi-chip averaging, Bolstad et al. (2003); Irizarry et al. (2003)) method to obtain summary expression values for each probe set. The outcome of interest is the expression of TRIM32, corresponding to probe at, a gene which has been shown to cause Bardet-Biedl syndrome (Chiang et al., 2006), which is a genetic disease of multiple organ systems including the retina. Following Scheetz et al. (2006), we focused on 18, 957 probes out of the 31, 042 probe sets on the array that exhibited a sufficient signal for reliable analysis and at least 2-fold variation in expression. The aim of this data analysis is to find the genes, whose expressions are correlated with that of gene TRIM32. We used model (1) to address this problem and applied different regularization methods in the analysis. We first standardized the probes so that they have mean zero and standard deviation 1. As in Huang et al. (2008), we focused on 3000 probes with the largest variances among the 18, 975 covariates and considered two approaches. The first approach is to regress on the p = 3000 probes. The second approach is to regress on the 200 probes among the 3000 with the largest marginal correlation coefficients with TRIM32. We randomly partitioned the data 100 times, each with a training set of size 80 and a test set of 40 observations. The prediction mean squared error was computed within the test set, while the scaled estimators and the lasso estimator with a fixed penalty level λ were 21

22 Table 1: Mean of simulation results for p = 40: ˆβ 1 β 1, ˆβ 2 β 2, ˆβ 3 β 3, MSE, MAE, TP, and FP. For each case, 100 simulated data sets were used. Case Methods ˆβ 1 β 1 ˆβ 2 β 2 ˆβ 3 β 3 MSE MAE TP FP 1 Oracle NA NA Lasso ALasso SCAD MCP HTR LAGS Oracle NA NA Lasso ALasso SCAD MCP HTR LAGS Oracle NA NA Lasso ALasso SCAD MCP HTR LAGS

23 Table 2: Median of simulation results for p = 40: ˆβ 1 β 1, ˆβ 2 β 2, ˆβ 3 β 3, MSE, MAE, TP, and FP. For each case, 100 simulated data sets were used. Case Methods ˆβ 1 β 1 ˆβ 2 β 2 ˆβ 3 β 3 MSE MAE TP FP 1 Oracle NA NA Lasso ALasso SCAD MCP HTR LAGS Oracle NA NA Lasso ALasso SCAD MCP HTR LAGS Oracle NA NA Lasso ALasso SCAD MCP HTR LAGS

24 Table 3: Mean of simulation results for p = 2000: we report ˆβ 1 β 1, ˆβ 2 β 2, ˆβ 3 β 3, MSE, MAE, TP, and FP. For each case, 100 simulated data sets were used. Case Methods ˆβ 1 β 1 ˆβ 2 β 2 ˆβ 3 β 3 MSE MAE TP FP 1 Oracle NA NA Lasso ALasso SCAD MCP HTR HTRsam HTR ind Oracle NA NA Lasso ALasso SCAD MCP HTR HTRsam HTR ind Oracle NA NA Lasso ALasso SCAD MCP HTR HTRsam HTR ind

25 Table 4: Median of simulation results for p = 2000: we report ˆβ 1 β 1, ˆβ 2 β 2, ˆβ 3 β 3, MSE, MAE, TP, and FP. For each case, 100 simulated data sets were used. Case Methods ˆβ 1 β 1 ˆβ 2 β 2 ˆβ 3 β 3 MSE MAE TP FP 1 Oracle NA NA Lasso ALasso SCAD MCP HTR HTRsam HTR ind Oracle NA NA Lasso ALasso SCAD MCP HTR HTRsam HTR ind Oracle NA NA Lasso ALasso SCAD MCP HTR HTRsam HTR ind

26 computed based on the training set. In addition, we compared the prediction performance of all the estimators mentioned in the simulations. In each replication, we computed all the regularization estimators based on the training set of 80 observations. The penalty level is selected by 5-fold cross validation over the training data set. Table 5 includes the median of downward prediction bias (DPBias), defined as #test sample i=1 (ŷ i ) y i, median of the mean squared prediction error (MSPE), and the average selected model size in the 100 replications for p = 300 and For MCP, the tuning parameters were selected by cross validation since the standard deviation of the random error is unknown. HTR works at least as good as, if not better than, ALasso, SCAD, and MCP with much sparser models and small prediction errors. It is worth pointing out that the HTR procedure produces the sparsest solution yet with a well controlled prediction error. Moreover, HTR controls the downward prediction bias well. The performance of the MCP procedure is satisfactory but its optimal performance depends on another tuning parameter a. In screening or diagnostic applications, it is often important to develop an accurate diagnostic test using as few features as possible in order to control the cost. The same consideration also matters when selecting target genes in gene therapies. 5 Conclusions and Further Discussions The main contribution of this paper is three fold. First, we have offered a new perspective to achieve unbiased estimation instead of using non-convex penalized regression, which can be formulated as a linear programming problem and thus is computational tractable. The global optimal solution is assured. Secondly, we have proposed a new framework to incorporate the covariance estimator into the regression procedure for a better trade off between noise accumulation and correlation modeling and facilitating the possibility of 26

27 Table 5: Gene Expression Data Analysis p Method MSPE DPBias avg model size 300 Lasso Alasso SCAD MCP HTR Lasso Alasso SCAD MCP HTR

28 relaxing conditions for consistent variable selection. Thirdly, we provide a comprehensive theoretical study of the HTR estimator. We show that the HTR estimator has the strong oracle property and exhibits a very interesting thresholded phenomenon: the HTR estimator is oracle in the interval (mn,mn). 6 Proofs We present the proofs of all theoretical results below. Proof of Proposition 2.1. Note that ˆβ ridge (λ) β = λ(x T X + λi) 1 β + (X T X + λi) 1 X T ε. (26) Then, it follows from (RC1) that E[ ˆβ ridge (λ) β 2 2 X] = E[ λ(x T X + λi) 1 β + (X T X + λi) 1 X T ε 2 2 X] 2λ 2 {λ min (X T X) + λ} 2 β {λ min (X T X) + λ} 2 E[ε T XX T ε] = 2{λ min (X T X) + λ} 2 {λ 2 β Tr(X T X)σ 2 } 2 λ2 β σ2 pλ max (X T X) (λ min (X T X) + λ) 2 2 λ2 β σ2 npb (nb + λ) 2 2 λ2 β σ2 npb n 2 b 2, which yields the proof of Proposition 2.1. Proof of Theorem 2.2. The proof of Theorem 2.2 consists of two steps. The first step is to show the exact support recovery as lim P(S n S) = 1, (27) n where S n = {j ˆβ HT R,j 0}. The second step is to show lim P(S S n) = 1, (28) n 28

29 We prove (27) as follows. Write ˆΣ X = n 1 X T X. The Karush-Kuhn-Tucker (KKT) condition of (4) leads to Ŵ ˆΣ X sign(x T (y Xˆβ HT R )) = λ sign(ˆβ HT R ), (29) where sign(x) is the signum function of x. Thus, if ˆβ HT R,j 0, then we have ŵ j [ˆΣ X sign(x T (y Xβ))] (j) = λ sign( ˆβ HT R,j ), where [a] (j) denotes the j-th component of any vector a. Since [ˆΣ X sign(x T (y Xβ))] (j) ˆΣ X, we have λ sign( ˆβ HT R,j ) ŵ j ˆΣ X. Therefore, to prove (27), it suffices to show that as n, we have P( j S C {ŵ j ˆΣ X λ}) 0. (30) We now bound the left-hand side (LHS) of (30) as follows: P( j S C {ŵ j ˆΣ X λ}) j S C P(ŵ j ˆΣ X λ) P( ˆβ 2 λ init,j ( ) 2/γ ) E ˆβ init β 2 2 ˆΣ j S C X (λ/ ˆΣ X ) = O(( 1 2/γ λn γ/2 v(γ+1)/2 )2/γ ), (31) which yields (27). We prove (28) as follows. Rewrite the KKT condition as sign(x T (y Xˆβ HT R )) = λ (ˆΣ x ) 1 Ŵ 1 sign(ˆβ HT R ). (32) Therefore, to prove (28), it suffices to show that as n, we have The LHS of (33) is bounded by P(λ (ˆΣ x ) 1 max j S ŵ 1 j < 1) 1. (33) LHS of (33) P(min j S (ŵ j) > λ (ˆΣ x ) 1 ) = P(min j S ( ˆβinit,j ) > [λ (ˆΣ x ) 1 ] 1/γ ). (34) 29

30 Since min j S ˆβinit,j minj S β j ˆβ init,s β S min j S β j ˆβ init,s β 2, we have RHS of (34) P(min j S β j > [λ (ˆΣ x ) 1 ] 1/γ + ˆβ init β 2 ). (35) Further, by assumption (RC3), we have [λ (ˆΣ x ) 1 ] 1/γ + ˆβ init β 2 = O p (Bλ n) 1/γ + yielding lim n P(S n = S) = 1. Denote the event {S n = S} as J. In J, we have The KKT conditions yield X T (y X S ˆβS ) = X T S (y X S ˆβ S ) X T S C (y X S ˆβS ) p n O p(1) = o p (1), (36). (37) ˆβ Sn = ˆβ S = (X T S X S ) 1 X T S y = ˆβ S. (38) This completes the proof of Theorem 2.2. Proof of Proposition 3.1. It can be shown that (Σ 0 x,ss ) 1 Σ 0 x,ss C 1 = 1 and thus the ir-representable condition fails. On the other hand, we have min Σ 0 x,ssτ S + Σ 0 τ Ω x,ss τ C S C min τ S Σ 0 0 τ Ω x,ss τ C S C 1 p 1 S, (39) 0 which finishes the proof. Proof of Theorem 3.2. It suffices to show the support recovery, i.e., S n = S, with large probability, since it implies the strong oracle property by the KKT conditions. From here on, we focus on J 0, in which we have ˆβ init β 1 2 C 0 ps log(p)n 1. 30

31 Similar to the proof of Theorem 2.2, we first prove (27) by showing (30) as follows: P( j S C, ŵ j Σ x > λ) P( ˆβ λ init,j ( ) 1/γ ) Σ j S C x λ p P( Σ x ˆβ init β γ ) (40) 2 p P( Σ x B kn (Σ x ) max 1 λ 2k n (C 0 ps log p/n) ) γ + p P( B kn (Σ x ) max 1 λ ) = (R1) + (R2). 2k n (C 0 ps log p/n) γ For (R2), since λ > Cγ 0 ɛ 0 (2k n + 1)(p S log(p)/n) γ/2 and B kn (Σ x ) max Σ x max Σ x 2, it can be shown that (R2) is bounded from above by P( B kn (Σ x ) max 1 2k n λ (C 0 ps log(p)/n) ) P( Σ x 2 ε 1 γ 0 ), (41) which converges to zero in probability by assumption. For (R1), it follows from Lemma 6.1 and λ m n that (R1) 3p η. The second step is to use a contradiction to show that the probability of the event {S n S} converges to one as λ M n. Before we pursue the proof, we introduce some notation. Let J 1 = { Σ x B kn (Σ x ) t 0 }, where t 0 is defined in Lemma 6.1, τ S = (τ j ) j S, η = min j S β o j, and τ = sign(n 1 X T y Σ x ˆβHT R ). Similar to (33), it suffices to show that τ S < 1. (42) If (42) does not hold, then we would have τ Ω 0 by the definition of Ω 0. Further, by using the KKT conditions, we have λŵ 1 j = ( Σ x,ss τ S + Σ x,ss cτ S C ) j for all j. 31

32 This yields hhhhh λ{min j S ˆβ init,j γ } 1 Σ x,ss τ S + Σ x,ss C τ S C Σ kn,ssτ S + Σ kn,ss cτ S c Σ x,ss Σ kn,ss Σ x,ss c Σ kn,ss c u 0 (n, p, p S ) 2 Σ kn Σ kn, where Σ kn,aa s are partitions of Σ k n = B kn (Σ x ) corresponding to A, A = S or S C. Therefore, in the event J = J 0 J 1, we have On the other hand, we have min ˆβ init,j γ j S λ u 0 (n, p, p S ) 2t 0. (43) min ˆβ init,j min ˆβ j o ˆβ init β min ˆβ j o C 0 ps log(p)/n. (44) j S j S j S Combining (43) and (44) leads to λ {u 0 (n, p, p S ) 2t 0 }(η C 0 ps log(p)/n) γ, which contradicts with the assumption λ < M n. Finally, with probability at least Pr(J ) 1 δ n,p,ps 3p η, we have S n = S and ˆβHT R = β. (45) Lemma 6.1 For all t t 0 = 2(η + 1) log(p n){γ(ɛ 0, δ)} 1 n 1, we have P ( Σ x B kn (Σ x ) t) 1 3(p n) η. (46) Proof. Let J 2 = { Σ x B kn (Σ x ) t}. It follows from Lemma A.3 of Bickel and Levina (2008) that P (J 2 ) (2k n + 1)p exp{ n(t 0 ) 2 γ(ε 0, δ)} 1 (2k n + 1)(p n) exp{ 2n(η + 1) γ(ε 0, δ) 3{(p n)k n } exp{ (η + 1) log ((p n)k n ))} 3{(p n)k n } (η+1)+1 3(p n) η, log (p n) γ(ε 0, δ))} n 32

33 which finishes the proof of Lemma 6.1. References Akaike, H. (1973), Information theory and an extension of the maximum likelihood principle, 2nd International Symposium on Information Theory, Bickel, P. J. and Levina, E. (2008), Regularized estimation of large covariance matrices, The Annals of Statistics, 36, Bolstad, B., Irizarry, R., Å strand, M., and Speed, T. (2003), A comparison of normalization methods for high density oligonucleotide array data based on variance and bias, Bioinformatics, 19, Bühlmann, P. and Geer, S. V. D. (2011), Statistics for high-dimensional data: methods, theory and applications,. Cai, T. T., Zhang, C.-H., and Zhou, H. H. (2010), Optimal rates of convergence for covariance matrix estimation, The Annals of Statistics, 38, Candes, E. and Tao, T. (2007), The Dantzig selector: Statistical estimation when p is much larger than n, The Annals of Statistics, 35, Chatterjee, A. and Lahiri, S. (2011), Bootstrapping lasso estimators, Journal of the American Statistical Assocation, 106, Bootstrapping Lasso Estimators. Chiang, A., Beck, J., and al., E. (2006), Homozygosity mapping with SNP arrays identifies TRIM32, an E3 ubiquitin ligase, as a BardetBiedl syndrome gene (BBS11), Proceedings of the National Academy of Sciences, 103,

34 Donoho, D. L. and Johnstone, I. M. (1994), Ideal spatial adaptation by wavelet shrinkage, Biometrika, 81, Donoho, D. L., Johnstone, I. M., Kerkyacharian, G., and Pickard, D. (1995), Wavelet Shrinkage: Asymptopia? (with discussion), Journal of the Royal Statistical Society, Ser. B, 57, Fan, J. and Li, R. (2001), Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American Statistical Association, 96, Fan, J., Liao, Y., and Mincheva, M. (2013), Large covariance estimation by thresholding principal orthogonal complements, Journal of Royal Statistical Society, Series B, To appear, 75, Fan, J. and Lv, J. (2008), Sure independence screening for ultrahigh dimensional feature space (with discussion), Journal of Royal Statistical Society, Series B, 70, (2011), Nonconcave Penalized Likelihood With NP-Dimensionality, Information Theory, IEEE Transactions on, 57, Fan, J. and Peng, H. (2004), On non-concave penalized likelihood with diverging number of parameters, Annals of Statistics, 32, Huang, J., Ma, S., and Zhang, C. (2008), Adaptive Lasso for sparse high-dimensional regression models, Statistica Sinica, 18, Huo, X. and Ni, X. (2007), When do stepwise algorithms meet subset selection criteria? The Annals of Statistics, 35, Irizarry, R., Hobbs, B., and Collin, F. (2003), Exploration, normalization, and summaries of high density oligonucleotide array probe level data, Biostatistics, 4,

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu Department of Statistics, University of Illinois at Urbana-Champaign WHOA-PSI, Aug, 2017 St. Louis, Missouri 1 / 30 Background Variable