Hard Thresholded Regression Via Linear Programming

Size: px
Start display at page:

Download "Hard Thresholded Regression Via Linear Programming"

Transcription

1 Hard Thresholded Regression Via Linear Programming Qiang Sun, Hongtu Zhu and Joseph G. Ibrahim Departments of Biostatistics, The University of North Carolina at Chapel Hill. Q. Sun, is Ph.D. student ( H. Zhu is Professor of Biostatistics ( and J. G. Ibrahim is Alumni Distinguished Professor of Biostatistics ( Department of Biostatistics, University of North Carolina at Chapel Hill, NC

2 Abstract This aim of this paper is to develop a hard thresholded regression (HTR) framework for simultaneous variable selection and unbiased estimation in high dimensional linear regression. This new framework is motivated by its close connection with the L 0 regularization and best subset selection under an orthonormal design, while enjoying several key computational and theoretical advantages over many existing penalization methods (e.g., SCAD or LASSO). Computationally, HTR is a fast two-stage estimation procedure consisting of the first step of calculating a coarse initial estimator and the second step of solving a linear program problem. Theoretically, under some mild conditions, the HTR estimator is shown to enjoy the strong oracle property and thresholded property even when the number of covariates grows at an exponential rate. We also propose to incorporate the regularized covariance estimator into the estimation procedure in order to have a better trade off between noise accumulation and correlation modeling. Under this framework, HTR includes Sure Independence Screening as a special case. Both simulation and real data results show that HTR outperforms other state-of-the-art methods. Key words: correlation bias, finite sample bias, hard thresholded regression, linear programming. 2

3 1 Introduction Consider the linear model y i = x T i β + ε i, i = 1,..., n, (1) where y i is a univariate response, x i = (x i,1,..., x i,p ) T is a p dimensional covariate vector, β = (β 1,..., β p ) T is a p 1 regression coefficient vector, and {ɛ i : i = 1,..., n} are independent and identically distributed (i.i.d) errors. The theory of linear models is well established for traditional applications, where the dimension p is fixed and the sample size n is much larger than p. With the development of many modern technologies, however, in many biological, medical, social, and economical studies, p is comparable with, or much larger than, n, making valid statistical inferences a great challenge. Let S be a subset of indices such that S = {j βj o 0} and let p S be the cardinality of S, where β o = (β1 o,..., βo p) T is the true parameter β. For prediction accuracy and variable selection consistency, it is common to assume a sparsity assumption, that is, p S << p. For model (1), many regularization methods for variable selection minimize Q(β) = 1 p 2n y Xβ p λ (β j ), (2) where y = (y 1,, y n ) T, X is an n p non-stochastic matrix with ith row x T i, 2 represents the L 2 norm, and p λ ( ) is a penalty function (e.g., SCAD or Lasso), which depends on a tuning parameter λ > 0. The most well-known best subset selection method is the L 0 penalized regression which can achieve simultaneous parameter estimation and variable selection (Akaike, 1973; Schwarz, 1978). The subset selection methods coupled with different selection criteria-including the C p statistics, the Akaike information criterion (AIC), the Bayesian information criterion (BIC), minimum description length (MDL), and the risk inflation criterion (RIC) are special cases of the L 0 penalized regression, resulting from the j=1 3

4 assignment of different values to λ. However, solving the L 0 regularization with a fixed λ is an NP-hard problem and computational methods based on an exhaustive search rapidly become impractical when the number of covariates increases (Huo and Ni, 2007; Fan and Peng, 2004; Fan and Li, 2001; Zhang, 2010). To address this computational issue, different convex/nonconvex penalty functions have been used in Q(β) and have been extensively investigated in order to mimic the L 0 regularization (Tibshirani, 1996; Fan and Li, 2001; Fan and Peng, 2004; Zhang, 2010; Meinshausen and Bühlmann, 2006; Leng et al., 2004; Zou, 2006). Instead of developing another penalty function, we develop a new hard thresholded regression (HTR) modeling framework for performing simultaneous variable selection and unbiased estimation in model (1). The key idea of HTR is to minimize H(β) = 1 n W XT (y Xβ) 1 + λ β 1, (3) where W is a p 0 p weight matrix based on some initial estimates of β, which will be introduced in Section 2. As shown in Sections 2 and 3, HTR simultaneously enjoys two key computational and theoretical properties as follows: (i) Since H(β) is convex and HTR can be cast as a linear programming problem, minimizing H(β) is computationally efficient even in high-dimensional settings. (ii) Under some mild conditions, the HTR estimate, which minimizes H(β), is an oracle estimator and achieves unbiased estimation within a wide range of λ. Due to the nice properties (i) and (ii), our HTR estimate is a novel addition to the extensive regularization literature. Our HTR method in (3) shares some important similarities with existing regularization methods. The penalty function in H(β) is the same as that of the popular Lasso (Tibshirani, 1996), when p λ (β j ) = λ β j. As shown in Section 2, HTR has a strong connection with 4

5 the L 0 and hard-thresholding regularizations (Akaike, 1973; Schwarz, 1978; Zheng et al., 2013), since all of them reduce to best subset selection under orthogonal designs, that is, n 1 X T X = I p, where I p is the p p identity matrix. A comparison of the regularization path between the L 0 regularization and HTR is shown in Figure 1. Our HTR differs significantly from the existing regularization methods in several major ways. A major advantage of HTR over nonconvex regularizations is its computational efficiency (i). Although there is much literature on nonconvex regularization methods (Wang et al., 2013a; Kim and Kwon, 2012; Zhang and Zhang, 2012; Fan and Lv, 2011; Kim et al., 2008; Wang et al., 2013b), several important questions still remain. Specifically, due to the non-convexity of the penalty function, multiple local minima always exist, and hence it is difficult to identify the oracle estimator even if the oracle estimator may be known to exist along the solution path. A major advantage of HTR over convex regularization methods is its nice theoretical property (ii). Due to the convexity of the penalty function, convex regularization methods, such as the Lasso, suffer from estimation bias issue and thus can be suboptimal in terms of risk estimation. See Fan and Li (2001) for detailed discussions. Moreover, the shrinkage bias introduced by convex regularization methods poses major challenges for statistical inferences, such as constructing confidence intervals or hypothesis testing, in high dimensional settings (Zhang and Zhang, 2012; van de Geer et al., 2013; Chatterjee and Lahiri, 2011). There is also a major conflict between optimal prediction and consistent variable selection in the lasso method (Meinshausen and Bühlmann, 2006; Leng et al., 2004; Zou, 2006). We make three major contributions in this paper as follows: We systematically investigate a fast two-step estimation procedure for HTR. The first step is to calculate a ridge estimator and the second step is to solve a linear programming problem. 5

6 L 0 Penalized Regression Hard Thresholded Regression ^ ^ Fig. 1: Solution paths of the L 0 regularization regression and HTR: We consider a simple example that y i = x i β + ε i, where β = (3, 2, 1.5, 0, 0, 0) T and ε i s are independently and identically distributed as N(0, 1). We plot the estimates of regression coefficients ˆβ j s for this example. Left panel: the L 0 Penalized Regression estimates as a function of λ; Right panel: hard thresholded regression estimates as a function of λ. We provide a comprehensive theoretical investigation of HTR. We show that the HTR estimator has the strong oracle property even when the number of covariates may grow at an exponential rate. We propose to incorporate the regularized covariance estimator into the estimation procedure in order to have a better trade off between noise accumulation and correlation modeling. The rest of the paper is organized as follows. In Section 2, we introduce HTR and its implementation, and discuss its connections with other regularization methods. Then, we 6

7 show that the HTR estimator has the strong oracle property under some mild conditions. In Section 3, we discuss potential extensions to the ultra-high dimensional case. Numerical results from Monte Carlo simulations and a real data example are presented in Section 4. A discussion is presented in Section 6, and proofs are given in the appendix. 2 Methods 2 1 Hard Thresholded Regression (HTR) Consider n independent observations {(y i, x i ) : i = 1,..., n} from model (1) with the true parameter vector β o. Without loss of generality, we standardize each column of X = ( x 1,, x p ) so that x k 2 = n for k = 1,..., p. The target of HTR in (3) is to estimate β o from the data. Our HTR algorithm is a two-stage approach. 1. Compute an initial estimator of β, denoted by ˆβ init, with a reasonably small risk error bound. For instance, let ˆβ ridge = (X T X + λ init I p ) 1 X T y be a ridge estimator of β, where I p is the p p identity matrix and λ init 0 is a tuning parameter. When λ init = 0, ˆβ ridge reduces to the ordinary least squares estimator of β. We will use ˆβ ridge as a candidate of ˆβ init and examine its risk error bound in Section Construct the weight matrix W based on ˆβ init, denoted by Ŵ, and then write the HTR estimator as ˆβ HT R = argmin 1 n Ŵ XT (y Xβ) 1 + λ β 1. (4) Throughout the paper, we set Ŵ as Ŵ = diag(ŵ j ) and ŵ j = ˆβ init,j γ for j = 1, 2,..., p, (5) where γ is a positive constant and ˆβ init,j is the j th component of ˆβ. 7

8 Numerically, computation of ˆβ HT R is very straightforward, since the objective function in (4) is convex and can be recast into a linear programming problem. Specifically, we introduce a p 1 slack vector η = {η j = n 1 [ŴXT (y Xβ)] j, j = 1,..., p}, β + = {β + j } j 1, and β = {β j } j 1. Then, the minimization in (4) can be rewritten as min p {η j +λ(β j + +β j )} subject to η 0, β+ 0, β 0, and η 1 (y Xβ) η, nŵxt j=1 where the optimization variables are η, β +, and β in R p. Finally, β can be recovered by β = β + β. There are at least two major motivations for HTR. The first one comes from the score equation of the maximum likelihood estimator. Let l n (β) and U n (β) be, respectively, the likelihood (or quasi-likelihood) and score functions of β. The score equation and its weighted version are given by U n (β) = β l n(β) = 0 and Ŵ U n(β) = 0, (6) which are equivalent to U n (β) 1 = 0 and Ŵ U n(β) 1 = 0, respectively. For model (1), U n (β) reduces to X T (y Xβ) and thus ˆβ HT R can be regarded as the penalized weighted score estimator with the L 1 norm β 1. Moreover, R(β) = (R 1 (β), R 2 (β),..., R p (β)) T = U n (β) can be regarded as the risk function of β and Ŵ is the risk calibration weight matrix for imposing additional information learned from the first stage. Therefore, based on (6), it is possible to extend HTR to more general scenarios, such as generalized linear models. The second motivation comes from the Dantzig selector (Candes and Tao, 2007) and the least absolute gradient selector (LAGS) (Yang, 2012). These two selectors are equivalent to solving the objective function as ˆβ = argmin β n 1 X T (y Xβ) a + λ Vβ 1, (7) where V is a p p weight matrix. The Dantzig selector and LAGS correspond to ( a, V) = (, I p ) and ( a, V) = (, diag(1/ ˆβ init,1,, 1/ ˆβ init,p )), respectively. 8

9 As pointed by Candes and Tao (2007), one would want to constrain the size of the correlated residual vector X T (y Xβ) rather than the size of the residual vector y Xβ, since such an estimation procedure is invariant under orthogonal transformations of X. Moreover, since the correlated residual vector measures the correlation between the predictors and the response, one would obviously want to include the explanatory variables that are highly correlated with the response y in the model. A major drawback of the Dantzig selector is shrinkage bias, leading to suboptimal risk estimation, even though a double Dantzig selector can reduce the bias (James and Radchenko, 2009). Moreover, to address the same bias issue, similar to the adaptive Lasso (Zou, 2006), LAGS uses adaptive weights calculated from ˆβ init to directly penalize different regression coefficients. An advantage of HTR is that it directly reduces the effects of those risk functions R j (β) associated with insignificant β j s in both the estimation and variable selection. When s << min(p, n) and p is comparable with n, we expect that HTR outperforms LAGS in terms of bias and mean squared error. See Section 4 for details. 2 2 Orthonormal Design Case We examine the orthonormal design case in order to delineate some connections between HTR and other existing regularization methods. In this case, we have X T X = ni p and ˆβ ols = ( ˆβ ols 1,, ˆβ ols p ) T = n 1 X T y. Best subset selection of size k reduces to choosing the k largest coefficients in absolute value and setting the rest to 0. Specifically, for some value of λ, this is equivalent to ˆβ j = ols ˆβ j 1 ˆβols j >λ for j = 1,, p, (8) which has a strong connection with hard shrinkage. For the Lasso (Tibshirani, 1996), the solutions have the form ˆβ lasso,j = sign( ols ols ˆβ j )( ˆβ j λ) + for j = 1,, p, (9) 9

10 which has a strong connection with the soft shrinkage proposals of Donoho and Johnstone (1994); Donoho et al. (1995). However, there is a major shrinkage bias in (9). Many convex/nonconvex penalty functions in (2) have been proposed to reduce the effect of the shrinkage bias in Lasso for statistical inferences (Candes and Tao, 2007; Fan and Li, 2001; Zou, 2006; Zhang, 2010). For instance, with the hard-thresholding penalty p λ (t) = 0.5[λ 2 (λ t) 2 +]1(t 0), we can obtain the hard thresholding estimator in (8). In the case of orthonormal design, the hard thresholding penalty is also equivalent to the L 0 penalty p λ (t) = 0.5λ 2 1(t 0). However, for nonorthonormal designs, although nonconvex regularization can be beneficial in selecting important covariates in model (1), additional computational and theoretical questions arise due to the nonconvexity of the penalty function. Both HTR and LAGS try to mimic best subset selection, while avoiding various issues associated with convex/nonconvex penalty functions used in Q(β). Specifically, we keep the L 1 -penalty function p λ (t) = λ t, whereas we replace the loss function in Q(β) by the score equation (or risk function) of β. In the case of orthonormal design, HTR reduces to whose solutions are given by argmin β p ŵ j β j j=1 ˆβ ols j + λ p β j, (10) j=1 ˆβ HT R,j = ˆβ ols j 1(λ ŵ j ) for j = 1,, p. (11) By taking the ridge estimator, we obtain ŵ j = ˆβ ols j /(1 + λ init ) and thus ˆβ HT R reduces to the hard thresholding estimator in (8) for some value of λ. We can also use the biased lasso estimate ˆβ lasso,j to construct ŵ j in the first stage and then calculate an unbiased estimator ˆβ HT R by calibrating the bias in ˆβ lasso,j. Thus, for HTR, we only need a coarse initial estimator in the first stage, which could then help us in identifying the activation set S of the true β. 10

11 We note that HTR is different from the hard-thresholding procedure. Given ˆβ init and λ n > 0, the hard thresholding (HT) estimator ˆβ HT is defined as ˆβ ˆβ HT init, if ˆβ init λ n, = 0, if ˆβ init λ n. (12) The hard-thresholding rule aims to remove the false positives at the second stage, while largely preserving the estimator calculated in the first stage. In contrast, our HTR always reestimates β in order to calibrate the estimation bias introduced in the first stage. Therefore, a coarse initial estimator of β is sufficient in the first stage of HTR. 2 3 Theoretical Results We formally establish the strong oracle property of ˆβ HT R, when the number of parameters is large and grows with the sample size n. We start with the following regularity conditions. Throughout the paper, the following conditions are needed to facilitate the technical details, although they may not be the weakest conditions. Regularity Conditions (RCs) (RC1) 0 < b λ min (n 1 X T X) λ max (n 1 X T X) B <. (RC2) lim n log (p)/log (n) v for some 0 v < 1. (RC3) λn 1/2 0 and λn 0.5(γ v(γ+1)). (RC4) The initial estimates ˆβ init satisfy E[ ˆβ init β 2 2 X] = O(pn 1 ). Remarks. Condition (RC1) assumes that the predictor matrix has reasonably good behavior, which is also considered in Fan and Peng (2004). Condition (RC2) specifies that the growth rate of p is at most a polynomial, that is, p = O(n v ), v < 1. It is worth pointing out that Condition (RC2) is weaker than that used in Fan and Peng (2004), for which they assume that p satisfies p 3 = o(n). Condition (RC3) specifies the relationship between λ 11

12 and n. To construct the risk calibration weight matrix Ŵ, we take a fixed γ such that γ > 2v/(1 v). Condition (RC4) requires that the initial estimator used in the first stage has reasonably good behavior in terms of the risk error bound. Such an error bound is generally available for many standard estimators of β. As an illustration, we show below that the ridge estimator used in the first stage satisfies (RC4) as given in the following proposition. Proposition 2.1 (Risk Error For Ridge Estimates) Under (RC1), ˆβ ridge satisfies E[ ˆβ ridge β 2 2 X] 2 λ2 init β σ2 npb n 2 b 2. (13) Furthermore, if λ 2 init β 2 2 = O(np), then we have E[ ˆβ ridge β 2 2 X] = O( p ). (14) n We next study the strong oracle properties of ˆβ HT R. Before we state the main theorem, we introduce the oracle estimator, denoted as ˆβ, as ˆβ = argmin β,β j =0, j S 1 2n n (y i x T i β) 2 (X T 1 = X 1) 1 X T 1 y, (15) 0 i=1 in which without loss of generality, it is assumed that the first s regression coefficients are nonzero and the remaining p s regression coefficients are zero. Moreover, X 1 is the corresponding design matrix for the first s regression coefficients. Theorem 2.2 below provides the strong oracle property of ˆβ HT R. Theorem 2.2 (Strong Oracle Property of ˆβ HT R ) Assume that conditions (RC1)-(RC4) hold. Then, as n, we have P r(ˆβ HT R = ˆβ ) 1. (16) 12

13 Combining Proposition 2.1 and Theorem 2.2 yields the strong oracle property of ˆβ HT R, when we set ˆβ init = (X T X + λ init I p ) 1 X T y in the first stage. Our result gives the strong oracle property under very mild conditions by only assuming λn 1/2 0 and λn 0.5(γ v(γ+1)) in (RC3). We shall compare our result with adaptive lasso in the fixed dimension setting. Adaptive lasso achieves oracle property by requiring λn 1/2 0, or equivalently, the bias term λ goes to 0 at a faster rate than n 1/2. However, in HTR, the bias term λ can diverge to with no faster rate than n 1/2. This can further validate the superiority of the HTR estimator: the thresholding level λ is only used to shut down the noise without introducing any bias term in the final estimator. 3 HTR under the Ultra-High Dimensional Setting 3 1 Ultra-High Dimensional HTR We discuss how to extend HTR for the ultra-high dimensional setting with p >> n. For instance, it is common to assume that p may grow at an exponential rate in n. In this case, the standard HTR in (4) may fail for p >> n. In particular, condition (RC1) fails, since λ min (n 1 X T X) = 0 for p > n. Thus, we need to use a new covariance matrix of predictors x, denoted by Σ x, which is positive definite, to replace n 1 X T X in (4). The use of a positive-definite Σ X to replace n 1 X T X is also very common in the regularization literature. For instance, in Zou and Trevor (2005), the elastic net estimator for model (1) is defined as argmin{β T ΣX β 2n 1 y T Xβ + λ β 1 }, (17) β in which Σ X = n 1 (X T X + λ 2 I p )/(1 + λ 2 ) for some λ 2 > 0. Our new ultra-high dimensional HTR algorithm for p n is also a two-stage approach as follows. 13

14 1. Compute ˆβ init, which satisfies the following estimation error bound ˆβ init β 2 C 0 ps log(p) n (18) in a large probability set J 0, that is, Pr(J 0 ) = 1 δ n,p,ps 1 or δ n,p,ps = o(1). Specifically, we use the Lasso estimator of β, denoted by ˆβ lasso, as a candidate of ˆβ init, since it has been shown in Zhang and Huang (2008) that (18) holds for ˆβ lasso under the sparse Riesz condition. We may use other regularization estimators of β, such as the Dantzig estimator, since the error bound (18) is widely available for them in the ultra-high dimensional framework. 2. Construct Ŵ and estimate ˆβ HT R according to ˆβ HT R = argmin β 1 2 Ŵ(n 1 X T y Σ X β) 1 + λ β 1. (19) We will show below that our ultra-high dimensional HTR is a general framework for carrying out screening, variable selection, and estimation. We first establish a connection between ultra-high dimensional HTR and Sure Independence Screening (SIS) when p is much larger than n. With a large dimensionality p, the computational cost and estimation accuracy are major difficulties for any statistical method. To overcome such difficulties, Fan and Lv (2008) introduced the SIS methodology to reduce dimensionality from an ultra-high p to a relatively large scale d n with d n n. Specifically, let ω = n 1 X T y = ( ω 1,..., ω p ) T be a p 1 vector of marginal correlations of predictors with the response variable. The standard SIS method is to select the features according to their marginal correlations with the response variable contained in ω, and then filter out those with weak marginal correlations with the response variable. This SIS procedure is equivalent to a special case of HTR by taking Ŵ = diag( ω 1,..., ω p ) and Σ X = n 1 diag{xx T } = I p in (19). Thus, (19) reduces 14

15 to ˆβ HT R,j = w j 1( w j λ) = argmin{ β p [ w j w j β j + λ β j ]}. (20) Without loss of generality, it assumes that w 1 > w 2 > > w p. For any given q (0, 1), we can select the covariates corresponding to the first [qn] largest w j s by taking λ = w [qn] in (20), where [qn] denotes the integer part of qn. Furthermore, we may combine the order of { w j } j learned from SIS with HTR (SIS+HTR) to recalculate ˆβ HT R. Second, we show that our HTR procedure allows us to extend SIS to more complex settings, when the predictors may be highly correlated. The incorporation of the correlation structure among the predictors is critical for better variable selection and estimation in model (1). An important strategy is to balance between noise accumulation and correlation modeling. Without loss of generality, we assume that the true covariance matrix of x, denoted by Σ x, has a geometric decay structure and therefore we can use its regularized bandable convariance estimator, denoted as Σ x, to approximate Σ x (Bickel and Levina, 2008). Extensions to other covariance structures can also be done by using other regularized estimators in the literature (Cai et al., 2010; Lam and Fan, 2009; Rothman et al., 2009; Fan j=1 et al., 2013). Specifically, we set ω = 1 Σ X n 1 X T y and Ŵ = diag( ω ). In this case, (19) reduces to ˆβ HT R = argmin{ diag( ω )(n 1 X T y Σ X β) 1 + λ β 1 }. (21) β Since we explicitly account for the joint information of all the covariates by regularizing their covariance matrix estimation through a de-correlation procedure instead of using the independence rule, we may call (21) a Sure Correlation Screening (SCS) procedure, which can avoid the faithful assumption used in Fan and Lv (2008). 15

16 3 2 Theoretical Results We formally investigate the strong oracle property of ˆβ HT R under the ultra-high dimensional scenario. We start with the following regularity condition on Σ x. Specifically, throughout the paper, it is assumed that Σ x belongs to a well behaved covariance class U(ε 0, α, C 1 ), defined by U(ε 0, α, C 1 ) = {Σ = (σ jj ) p p : max { σ j j j : j j > k} C 1 k α for all k > 0, j Σ = Σ T, and 0 < ε 0 λ min (Σ) λ max (Σ) 1/ε 0 }, where ε 0, C 1, and α are positive scalars. The condition Σ x U(ε 0, α, C 1 ) basically requires that Σ x be bandable. Such a condition on Σ x can be relaxed by employing different covariance estimators (Bickel and Levina, 2008; Cai et al., 2010; Lam and Fan, 2009; Rothman et al., 2009; Fan et al., 2013). We also introduce the L Correlation Condition (LCC) for model identifiability. For a given set S with cardinality p S and its complement S C = {1,, p}/s with cardinality p S C = p p S, we consider a partition of the p p matrix Σ x according to (S, S C ) as follows: Σ x = Σ x,ss Σ x,ss C, Σ x,s C S Σ x,s C S C where Σ x,s1 S 2 is a p S1 p S2 matrix corresponding to indices in S 1 and S 2, in which S 1 and S 2 are equal to either S or S C. We say that (Σ x, S) satisfies the L correlation condition, if there exists a u 0 (n, p, p S ) > 0 such that min Σ x,ssτ S + Σ x,ss C τ S C > u 0 (n, p, p S ), (22) τ S =1, τ S C =1 where τ S and τ S C are p S 1 and (p p S ) 1 vectors, respectively. The L correlation condition is used to rule out the case of strong collinearity in the same spirit of condition 4 in Fan and Lv (2008). The sample version of LCC closely resembles 16

17 the irrepresentable condition first proposed by Zhao and Yu (2006). The ir-representable condition is equivalent to putting a regularization constraint on the regression coefficients of the irrelevant covariates X S C on the relevant covariates X S, Σ 1 x,ss Σ x,ss C 1 1 u 0 (n, p, p S ) for some constant u 0 (n, p, p S ) > 0. Similar to the ir-representable condition, if we put the constraint in the L norm rather than the L 1 norm, i.e. Σ 1 x,ss Σ x,ss C 1 u 0 (n, p, p S ) and hold s fixed, this would imply the LCC condition by observing that min Σ x,ss τ S + Σ x,ss C τ S C min Σ 1 τ Ω 0 τ Ω x,ss (τ S Σ 1 x,ss Σ x,ss C τ S C ) 0 1 min λ min (Σ x,ss ) τ S Σ 1 τ Ω x,ss Σ x,ss C τ S C 0 ps (23) ε 0 ps u 0 (n, p, p S ), where Ω 0 = { τ S = 1, τ S C = 1}. Generally, we allow u 0 (n, p, p S ) to diverge to 0. We examine an example of Σ x to show that for some Σ x, the LCC condition holds, whereas the irrepresentable condition does not. Specifically, we consider a specific Σ 0 x with Σ 0 x,ss = I p S, Σ 0 x,s C S C = I p ps, and Σ 0 x,ss C = (Σ 0 x,s C S )T = [p S 1 1 ps, 0,, 0], where 1 ps is a p S 1 vector with all ones. Therefore, the LCC condition allows us to go beyond the irrepresentable condition for consistent variable selection. Proposition 3.1 For S = {1,..., p S } and Σ 0 x defined as above, (Σ 0 x, S) satisfies the LCC condition, but not the ir-representable condition. We define the oracle estimator of β in the ultra-high dimensional setting as β = ( {( Σ x,ss ) 1 X T S y}t, 0 T ) T, (24) where Σ x,ss denotes the submatrix of Σ X corresponding to the indices in the true active set S. Note that the difference between β and the oracle least squares estimate ˆβ is very small, since Σ x,ss Σ x,ss 1 2 Σ x Σ x 1 2 = O p({n 1 log(p)} α/(2(α+1)) ) for Σ x U(ε 0, α, C 1 ) (Bickel and Levina, 2008). Moreover, if the ordinary least squares estimator is desirable, 17

18 especially when s/n is moderate, we can first identify an initial active set, denoted as S n, and then we can calculate ˆβ ref = (X T S n X Sn ) 1 X T S n y. Before we present the main results below, we let Σ kn = B kn (Σ) = (σ ij 1 ( i j kn)) and define η = min j S β o j. Theorem 3.2 (Strong Oracle Property of ˆβ HT R under p >> n with thresholded property) Suppose that Σ x U(ε 0, α, C 1 ), (18) holds with a positive scalar C 0, and (B kn (Σ x ), S n ) satisfies the LCC condition. Suppose the tuning parameter λ satisfies m n < λ < M n for k n {log(p)/n} 1/(2(α+1)), where. m n = C γ 0 (2k n + 1){p S log(p)/n} γ/2 max{ɛ 1 2(η + 1) 0, γ(ε 0, δ) (log(p) n )1/2 } and M n. = [u0 (n, p, p S ) 2t 0 ](η C 0 ps log(p)/n) γ. Moreover, t 0 = 2(η + 1) log(p n){γ(ε 0, δ)} 1 n 1 is defined in Lemma 6.1, where γ(ε 0, δ) is a constant not depending on n and p. Then with probability at least 1 δ n,p,ps 3p η, we have S n = S and ˆβHT R = β. (25) Theorem 3.2 quantifies our HTR estimator under the ultra-high dimensional scenario. Assuming that η > C 0 ps log(p)/n and u 0 (n, p, p S ) is fixed, we roughly require (η ) γ λ (p S log(p)/n) γ/2 (2k n + 1). However, in Wang et al. (2013a), the calibrated CCCP method identifies the oracle estimator when η λ p S log(p)/n. We point out an interesting phenomenon: within the range (m n, M n ) with m n and M n defined in the above theorem, ˆβ HT R stays at the oracle estimator β. This agrees with our intuition that HTR s solution path has a piecewise constant property. We mention that our result is not directly comparable with the calibrated CCCP method or any other method in the literature as we only require that M n > λ > m n rather than M n λ m n. Finally, Theorem 3.2 is in line 18

19 with the important theoretical properties of L 0 penalized regression considered in Zheng et al. (2013). This may further validate our HTR method. 4 Numerical Examples 4 1 Simulation Study Continuous responses were generated according to model (1) with β = (3, 2, 0, 0, 1.5, 0,..., 0 }{{} p 5 and n = 100. Moreover, in model (1), x i follows the N(0, Σ x ) distribution with covariance matrix Σ x and ɛ i is independent of x i and has a normal distribution with mean 0 and ) T standard deviation σ = 2. Writing Σ x = σ(ρ ij ), we consider three different correlation structures of (ρ ij ) including Case 1: independent correlation design with (ρ ij ) = diag(1,, 1); Case 2: weak correlation design with ρ ij = 0.30 i j ; Case 3: relatively strong correlation design with ρ ij = 0.95 i j. We consider both a relatively high dimension p = 40 and the ultra-high dimensional case p = 2000 n. We investigate the sparsity recovery and estimation properties of the HTR estimator via numerical simulations. We compared the HTR estimator with the following estimators: the oracle estimator which assumes the availability of the active set S 0 ; the adaptive lasso estimator proposed by Zou (2006); the smoothly clipped absolute deviation (SCAD) estimator (Fan and Li, 2001); and the minimax concave penalty (MCP) estimator with a = 3 Zhang (2010). For SCAD, n 1/2 fold cross-validation was used to select the tuning parameter λ; for ALasso and HTR, sequential tuning in Bühlmann and Geer (2011) was used; and the MCP estimator was computed using the R package PLUS with the theoretically 19

20 optimal tuning parameter value λ = σ 2/np. For the case p = 30, we also computed regularized estimators based on LAGS. To estimate the bandable covariance estimator Σ x in HTR, the banding parameter was selected by cross validation as described in Bickel and Levina (2008). To further demonstrate the performance by using the regularized covariance matrix, we also compared the HTR estimates with the sample covariance matrix, and the independence covariance matrix and denoted them as HTRsam and HTR ind, respectively. For each simulation setting, we generated 100 simulated data sets and applied different estimators to each dataset. Then, we calculated different statistics for each estimator and included them in Tables 1, 2, 3 and 4. We calculated the mean and median of ˆβ i β i with i = 1, 2, 3 in order to measure the downward shrinkage bias. To measure the sparsity recovery, we calculated the mean and median of number of zero coefficients incorrectly estimated to be nonzero (i.e. false positive, denoted as FP) and the mean and median of number of nonzero coefficients correctly estimated to be nonzero (i.e. true positive, denoted by TP). To measure the estimation accuracy, we calculated the mean and median squared error (MSE) and the mean and median absolute error (MAE). It is not surprising that Lasso always overfits. Other procedures improve the performance of Lasso by reducing the estimation bias and the false positive rate. The best overall performance is achieved by the HTR estimator with relatively small shrinkage bias, MSE, MAE, and FP. The MCP and SCAD also have overall good performance. In the relatively high dimensional (p = 30) example, HTR outperforms LAGS in all three cases. When the dimension is 2000, in all cases, the HTR with sample covariance matrix favors false selections and thus it has worse performance compared with that of the regularized covariance estimator. When the correlation structure gets stronger, ignoring the correlation structure would produce too sparse a solution and miss true variables, which verifies our conjecture. This verifies the effectiveness of using the regularized covariance matrix in the regression 20

21 procedure. 4 2 Bardet-Biedl syndrome gene expression study We applied HTR to the Bardet Biedl syndrome gene expression study in Scheetz et al. (2006). For this data set, F1 animals were intercrossed and 120 twelve-week-old male offspring were selected for tissue harvesting from the eyes and for microarray analysis. The microarrays used to analyze the RNA from the eyes of these animals contain more than 31, 042 different probe sets (Affymetric GeneChip Rat Genome Array). The intensity values were normalized using the RMA (robust multi-chip averaging, Bolstad et al. (2003); Irizarry et al. (2003)) method to obtain summary expression values for each probe set. The outcome of interest is the expression of TRIM32, corresponding to probe at, a gene which has been shown to cause Bardet-Biedl syndrome (Chiang et al., 2006), which is a genetic disease of multiple organ systems including the retina. Following Scheetz et al. (2006), we focused on 18, 957 probes out of the 31, 042 probe sets on the array that exhibited a sufficient signal for reliable analysis and at least 2-fold variation in expression. The aim of this data analysis is to find the genes, whose expressions are correlated with that of gene TRIM32. We used model (1) to address this problem and applied different regularization methods in the analysis. We first standardized the probes so that they have mean zero and standard deviation 1. As in Huang et al. (2008), we focused on 3000 probes with the largest variances among the 18, 975 covariates and considered two approaches. The first approach is to regress on the p = 3000 probes. The second approach is to regress on the 200 probes among the 3000 with the largest marginal correlation coefficients with TRIM32. We randomly partitioned the data 100 times, each with a training set of size 80 and a test set of 40 observations. The prediction mean squared error was computed within the test set, while the scaled estimators and the lasso estimator with a fixed penalty level λ were 21

22 Table 1: Mean of simulation results for p = 40: ˆβ 1 β 1, ˆβ 2 β 2, ˆβ 3 β 3, MSE, MAE, TP, and FP. For each case, 100 simulated data sets were used. Case Methods ˆβ 1 β 1 ˆβ 2 β 2 ˆβ 3 β 3 MSE MAE TP FP 1 Oracle NA NA Lasso ALasso SCAD MCP HTR LAGS Oracle NA NA Lasso ALasso SCAD MCP HTR LAGS Oracle NA NA Lasso ALasso SCAD MCP HTR LAGS

23 Table 2: Median of simulation results for p = 40: ˆβ 1 β 1, ˆβ 2 β 2, ˆβ 3 β 3, MSE, MAE, TP, and FP. For each case, 100 simulated data sets were used. Case Methods ˆβ 1 β 1 ˆβ 2 β 2 ˆβ 3 β 3 MSE MAE TP FP 1 Oracle NA NA Lasso ALasso SCAD MCP HTR LAGS Oracle NA NA Lasso ALasso SCAD MCP HTR LAGS Oracle NA NA Lasso ALasso SCAD MCP HTR LAGS

24 Table 3: Mean of simulation results for p = 2000: we report ˆβ 1 β 1, ˆβ 2 β 2, ˆβ 3 β 3, MSE, MAE, TP, and FP. For each case, 100 simulated data sets were used. Case Methods ˆβ 1 β 1 ˆβ 2 β 2 ˆβ 3 β 3 MSE MAE TP FP 1 Oracle NA NA Lasso ALasso SCAD MCP HTR HTRsam HTR ind Oracle NA NA Lasso ALasso SCAD MCP HTR HTRsam HTR ind Oracle NA NA Lasso ALasso SCAD MCP HTR HTRsam HTR ind

25 Table 4: Median of simulation results for p = 2000: we report ˆβ 1 β 1, ˆβ 2 β 2, ˆβ 3 β 3, MSE, MAE, TP, and FP. For each case, 100 simulated data sets were used. Case Methods ˆβ 1 β 1 ˆβ 2 β 2 ˆβ 3 β 3 MSE MAE TP FP 1 Oracle NA NA Lasso ALasso SCAD MCP HTR HTRsam HTR ind Oracle NA NA Lasso ALasso SCAD MCP HTR HTRsam HTR ind Oracle NA NA Lasso ALasso SCAD MCP HTR HTRsam HTR ind

26 computed based on the training set. In addition, we compared the prediction performance of all the estimators mentioned in the simulations. In each replication, we computed all the regularization estimators based on the training set of 80 observations. The penalty level is selected by 5-fold cross validation over the training data set. Table 5 includes the median of downward prediction bias (DPBias), defined as #test sample i=1 (ŷ i ) y i, median of the mean squared prediction error (MSPE), and the average selected model size in the 100 replications for p = 300 and For MCP, the tuning parameters were selected by cross validation since the standard deviation of the random error is unknown. HTR works at least as good as, if not better than, ALasso, SCAD, and MCP with much sparser models and small prediction errors. It is worth pointing out that the HTR procedure produces the sparsest solution yet with a well controlled prediction error. Moreover, HTR controls the downward prediction bias well. The performance of the MCP procedure is satisfactory but its optimal performance depends on another tuning parameter a. In screening or diagnostic applications, it is often important to develop an accurate diagnostic test using as few features as possible in order to control the cost. The same consideration also matters when selecting target genes in gene therapies. 5 Conclusions and Further Discussions The main contribution of this paper is three fold. First, we have offered a new perspective to achieve unbiased estimation instead of using non-convex penalized regression, which can be formulated as a linear programming problem and thus is computational tractable. The global optimal solution is assured. Secondly, we have proposed a new framework to incorporate the covariance estimator into the regression procedure for a better trade off between noise accumulation and correlation modeling and facilitating the possibility of 26

27 Table 5: Gene Expression Data Analysis p Method MSPE DPBias avg model size 300 Lasso Alasso SCAD MCP HTR Lasso Alasso SCAD MCP HTR

28 relaxing conditions for consistent variable selection. Thirdly, we provide a comprehensive theoretical study of the HTR estimator. We show that the HTR estimator has the strong oracle property and exhibits a very interesting thresholded phenomenon: the HTR estimator is oracle in the interval (mn,mn). 6 Proofs We present the proofs of all theoretical results below. Proof of Proposition 2.1. Note that ˆβ ridge (λ) β = λ(x T X + λi) 1 β + (X T X + λi) 1 X T ε. (26) Then, it follows from (RC1) that E[ ˆβ ridge (λ) β 2 2 X] = E[ λ(x T X + λi) 1 β + (X T X + λi) 1 X T ε 2 2 X] 2λ 2 {λ min (X T X) + λ} 2 β {λ min (X T X) + λ} 2 E[ε T XX T ε] = 2{λ min (X T X) + λ} 2 {λ 2 β Tr(X T X)σ 2 } 2 λ2 β σ2 pλ max (X T X) (λ min (X T X) + λ) 2 2 λ2 β σ2 npb (nb + λ) 2 2 λ2 β σ2 npb n 2 b 2, which yields the proof of Proposition 2.1. Proof of Theorem 2.2. The proof of Theorem 2.2 consists of two steps. The first step is to show the exact support recovery as lim P(S n S) = 1, (27) n where S n = {j ˆβ HT R,j 0}. The second step is to show lim P(S S n) = 1, (28) n 28

29 We prove (27) as follows. Write ˆΣ X = n 1 X T X. The Karush-Kuhn-Tucker (KKT) condition of (4) leads to Ŵ ˆΣ X sign(x T (y Xˆβ HT R )) = λ sign(ˆβ HT R ), (29) where sign(x) is the signum function of x. Thus, if ˆβ HT R,j 0, then we have ŵ j [ˆΣ X sign(x T (y Xβ))] (j) = λ sign( ˆβ HT R,j ), where [a] (j) denotes the j-th component of any vector a. Since [ˆΣ X sign(x T (y Xβ))] (j) ˆΣ X, we have λ sign( ˆβ HT R,j ) ŵ j ˆΣ X. Therefore, to prove (27), it suffices to show that as n, we have P( j S C {ŵ j ˆΣ X λ}) 0. (30) We now bound the left-hand side (LHS) of (30) as follows: P( j S C {ŵ j ˆΣ X λ}) j S C P(ŵ j ˆΣ X λ) P( ˆβ 2 λ init,j ( ) 2/γ ) E ˆβ init β 2 2 ˆΣ j S C X (λ/ ˆΣ X ) = O(( 1 2/γ λn γ/2 v(γ+1)/2 )2/γ ), (31) which yields (27). We prove (28) as follows. Rewrite the KKT condition as sign(x T (y Xˆβ HT R )) = λ (ˆΣ x ) 1 Ŵ 1 sign(ˆβ HT R ). (32) Therefore, to prove (28), it suffices to show that as n, we have The LHS of (33) is bounded by P(λ (ˆΣ x ) 1 max j S ŵ 1 j < 1) 1. (33) LHS of (33) P(min j S (ŵ j) > λ (ˆΣ x ) 1 ) = P(min j S ( ˆβinit,j ) > [λ (ˆΣ x ) 1 ] 1/γ ). (34) 29

30 Since min j S ˆβinit,j minj S β j ˆβ init,s β S min j S β j ˆβ init,s β 2, we have RHS of (34) P(min j S β j > [λ (ˆΣ x ) 1 ] 1/γ + ˆβ init β 2 ). (35) Further, by assumption (RC3), we have [λ (ˆΣ x ) 1 ] 1/γ + ˆβ init β 2 = O p (Bλ n) 1/γ + yielding lim n P(S n = S) = 1. Denote the event {S n = S} as J. In J, we have The KKT conditions yield X T (y X S ˆβS ) = X T S (y X S ˆβ S ) X T S C (y X S ˆβS ) p n O p(1) = o p (1), (36). (37) ˆβ Sn = ˆβ S = (X T S X S ) 1 X T S y = ˆβ S. (38) This completes the proof of Theorem 2.2. Proof of Proposition 3.1. It can be shown that (Σ 0 x,ss ) 1 Σ 0 x,ss C 1 = 1 and thus the ir-representable condition fails. On the other hand, we have min Σ 0 x,ssτ S + Σ 0 τ Ω x,ss τ C S C min τ S Σ 0 0 τ Ω x,ss τ C S C 1 p 1 S, (39) 0 which finishes the proof. Proof of Theorem 3.2. It suffices to show the support recovery, i.e., S n = S, with large probability, since it implies the strong oracle property by the KKT conditions. From here on, we focus on J 0, in which we have ˆβ init β 1 2 C 0 ps log(p)n 1. 30

31 Similar to the proof of Theorem 2.2, we first prove (27) by showing (30) as follows: P( j S C, ŵ j Σ x > λ) P( ˆβ λ init,j ( ) 1/γ ) Σ j S C x λ p P( Σ x ˆβ init β γ ) (40) 2 p P( Σ x B kn (Σ x ) max 1 λ 2k n (C 0 ps log p/n) ) γ + p P( B kn (Σ x ) max 1 λ ) = (R1) + (R2). 2k n (C 0 ps log p/n) γ For (R2), since λ > Cγ 0 ɛ 0 (2k n + 1)(p S log(p)/n) γ/2 and B kn (Σ x ) max Σ x max Σ x 2, it can be shown that (R2) is bounded from above by P( B kn (Σ x ) max 1 2k n λ (C 0 ps log(p)/n) ) P( Σ x 2 ε 1 γ 0 ), (41) which converges to zero in probability by assumption. For (R1), it follows from Lemma 6.1 and λ m n that (R1) 3p η. The second step is to use a contradiction to show that the probability of the event {S n S} converges to one as λ M n. Before we pursue the proof, we introduce some notation. Let J 1 = { Σ x B kn (Σ x ) t 0 }, where t 0 is defined in Lemma 6.1, τ S = (τ j ) j S, η = min j S β o j, and τ = sign(n 1 X T y Σ x ˆβHT R ). Similar to (33), it suffices to show that τ S < 1. (42) If (42) does not hold, then we would have τ Ω 0 by the definition of Ω 0. Further, by using the KKT conditions, we have λŵ 1 j = ( Σ x,ss τ S + Σ x,ss cτ S C ) j for all j. 31

32 This yields hhhhh λ{min j S ˆβ init,j γ } 1 Σ x,ss τ S + Σ x,ss C τ S C Σ kn,ssτ S + Σ kn,ss cτ S c Σ x,ss Σ kn,ss Σ x,ss c Σ kn,ss c u 0 (n, p, p S ) 2 Σ kn Σ kn, where Σ kn,aa s are partitions of Σ k n = B kn (Σ x ) corresponding to A, A = S or S C. Therefore, in the event J = J 0 J 1, we have On the other hand, we have min ˆβ init,j γ j S λ u 0 (n, p, p S ) 2t 0. (43) min ˆβ init,j min ˆβ j o ˆβ init β min ˆβ j o C 0 ps log(p)/n. (44) j S j S j S Combining (43) and (44) leads to λ {u 0 (n, p, p S ) 2t 0 }(η C 0 ps log(p)/n) γ, which contradicts with the assumption λ < M n. Finally, with probability at least Pr(J ) 1 δ n,p,ps 3p η, we have S n = S and ˆβHT R = β. (45) Lemma 6.1 For all t t 0 = 2(η + 1) log(p n){γ(ɛ 0, δ)} 1 n 1, we have P ( Σ x B kn (Σ x ) t) 1 3(p n) η. (46) Proof. Let J 2 = { Σ x B kn (Σ x ) t}. It follows from Lemma A.3 of Bickel and Levina (2008) that P (J 2 ) (2k n + 1)p exp{ n(t 0 ) 2 γ(ε 0, δ)} 1 (2k n + 1)(p n) exp{ 2n(η + 1) γ(ε 0, δ) 3{(p n)k n } exp{ (η + 1) log ((p n)k n ))} 3{(p n)k n } (η+1)+1 3(p n) η, log (p n) γ(ε 0, δ))} n 32

33 which finishes the proof of Lemma 6.1. References Akaike, H. (1973), Information theory and an extension of the maximum likelihood principle, 2nd International Symposium on Information Theory, Bickel, P. J. and Levina, E. (2008), Regularized estimation of large covariance matrices, The Annals of Statistics, 36, Bolstad, B., Irizarry, R., Å strand, M., and Speed, T. (2003), A comparison of normalization methods for high density oligonucleotide array data based on variance and bias, Bioinformatics, 19, Bühlmann, P. and Geer, S. V. D. (2011), Statistics for high-dimensional data: methods, theory and applications,. Cai, T. T., Zhang, C.-H., and Zhou, H. H. (2010), Optimal rates of convergence for covariance matrix estimation, The Annals of Statistics, 38, Candes, E. and Tao, T. (2007), The Dantzig selector: Statistical estimation when p is much larger than n, The Annals of Statistics, 35, Chatterjee, A. and Lahiri, S. (2011), Bootstrapping lasso estimators, Journal of the American Statistical Assocation, 106, Bootstrapping Lasso Estimators. Chiang, A., Beck, J., and al., E. (2006), Homozygosity mapping with SNP arrays identifies TRIM32, an E3 ubiquitin ligase, as a BardetBiedl syndrome gene (BBS11), Proceedings of the National Academy of Sciences, 103,

34 Donoho, D. L. and Johnstone, I. M. (1994), Ideal spatial adaptation by wavelet shrinkage, Biometrika, 81, Donoho, D. L., Johnstone, I. M., Kerkyacharian, G., and Pickard, D. (1995), Wavelet Shrinkage: Asymptopia? (with discussion), Journal of the Royal Statistical Society, Ser. B, 57, Fan, J. and Li, R. (2001), Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American Statistical Association, 96, Fan, J., Liao, Y., and Mincheva, M. (2013), Large covariance estimation by thresholding principal orthogonal complements, Journal of Royal Statistical Society, Series B, To appear, 75, Fan, J. and Lv, J. (2008), Sure independence screening for ultrahigh dimensional feature space (with discussion), Journal of Royal Statistical Society, Series B, 70, (2011), Nonconcave Penalized Likelihood With NP-Dimensionality, Information Theory, IEEE Transactions on, 57, Fan, J. and Peng, H. (2004), On non-concave penalized likelihood with diverging number of parameters, Annals of Statistics, 32, Huang, J., Ma, S., and Zhang, C. (2008), Adaptive Lasso for sparse high-dimensional regression models, Statistica Sinica, 18, Huo, X. and Ni, X. (2007), When do stepwise algorithms meet subset selection criteria? The Annals of Statistics, 35, Irizarry, R., Hobbs, B., and Collin, F. (2003), Exploration, normalization, and summaries of high density oligonucleotide array probe level data, Biostatistics, 4,

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu Department of Statistics, University of Illinois at Urbana-Champaign WHOA-PSI, Aug, 2017 St. Louis, Missouri 1 / 30 Background Variable

More information

ADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL REGRESSION MODELS

ADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL REGRESSION MODELS Statistica Sinica 18(2008), 1603-1618 ADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL REGRESSION MODELS Jian Huang, Shuangge Ma and Cun-Hui Zhang University of Iowa, Yale University and Rutgers University Abstract:

More information

High-dimensional Ordinary Least-squares Projection for Screening Variables

High-dimensional Ordinary Least-squares Projection for Screening Variables 1 / 38 High-dimensional Ordinary Least-squares Projection for Screening Variables Chenlei Leng Joint with Xiangyu Wang (Duke) Conference on Nonparametric Statistics for Big Data and Celebration to Honor

More information

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los

More information

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010 Department of Biostatistics Department of Statistics University of Kentucky August 2, 2010 Joint work with Jian Huang, Shuangge Ma, and Cun-Hui Zhang Penalized regression methods Penalized methods have

More information

Comparisons of penalized least squares. methods by simulations

Comparisons of penalized least squares. methods by simulations Comparisons of penalized least squares arxiv:1405.1796v1 [stat.co] 8 May 2014 methods by simulations Ke ZHANG, Fan YIN University of Science and Technology of China, Hefei 230026, China Shifeng XIONG Academy

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Analysis Methods for Supersaturated Design: Some Comparisons

Analysis Methods for Supersaturated Design: Some Comparisons Journal of Data Science 1(2003), 249-260 Analysis Methods for Supersaturated Design: Some Comparisons Runze Li 1 and Dennis K. J. Lin 2 The Pennsylvania State University Abstract: Supersaturated designs

More information

ADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL REGRESSION MODELS. November The University of Iowa. Department of Statistics and Actuarial Science

ADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL REGRESSION MODELS. November The University of Iowa. Department of Statistics and Actuarial Science ADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL REGRESSION MODELS Jian Huang 1, Shuangge Ma 2, and Cun-Hui Zhang 3 1 University of Iowa, 2 Yale University, 3 Rutgers University November 2006 The University

More information

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Nonconcave Penalized Likelihood with A Diverging Number of Parameters Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized

More information

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:

More information

Stability and the elastic net

Stability and the elastic net Stability and the elastic net Patrick Breheny March 28 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/32 Introduction Elastic Net Our last several lectures have concentrated on methods for

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

A Confidence Region Approach to Tuning for Variable Selection

A Confidence Region Approach to Tuning for Variable Selection A Confidence Region Approach to Tuning for Variable Selection Funda Gunes and Howard D. Bondell Department of Statistics North Carolina State University Abstract We develop an approach to tuning of penalized

More information

Iterative Selection Using Orthogonal Regression Techniques

Iterative Selection Using Orthogonal Regression Techniques Iterative Selection Using Orthogonal Regression Techniques Bradley Turnbull 1, Subhashis Ghosal 1 and Hao Helen Zhang 2 1 Department of Statistics, North Carolina State University, Raleigh, NC, USA 2 Department

More information

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson Bayesian variable selection via penalized credible regions Brian Reich, NC State Joint work with Howard Bondell and Ander Wilson Brian Reich, NCSU Penalized credible regions 1 Motivation big p, small n

More information

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Consistent high-dimensional Bayesian variable selection via penalized credible regions Consistent high-dimensional Bayesian variable selection via penalized credible regions Howard Bondell bondell@stat.ncsu.edu Joint work with Brian Reich Howard Bondell p. 1 Outline High-Dimensional Variable

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu arxiv:1709.04840v1 [stat.me] 14 Sep 2017 Abstract Penalty-based variable selection methods are powerful in selecting relevant covariates

More information

The Iterated Lasso for High-Dimensional Logistic Regression

The Iterated Lasso for High-Dimensional Logistic Regression The Iterated Lasso for High-Dimensional Logistic Regression By JIAN HUANG Department of Statistics and Actuarial Science, 241 SH University of Iowa, Iowa City, Iowa 52242, U.S.A. SHUANGE MA Division of

More information

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Cun-Hui Zhang and Stephanie S. Zhang Rutgers University and Columbia University September 14, 2012 Outline Introduction Methodology

More information

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract Journal of Data Science,17(1). P. 145-160,2019 DOI:10.6339/JDS.201901_17(1).0007 WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION Wei Xiong *, Maozai Tian 2 1 School of Statistics, University of

More information

Two Tales of Variable Selection for High Dimensional Regression: Screening and Model Building

Two Tales of Variable Selection for High Dimensional Regression: Screening and Model Building Two Tales of Variable Selection for High Dimensional Regression: Screening and Model Building Cong Liu, Tao Shi and Yoonkyung Lee Department of Statistics, The Ohio State University Abstract Variable selection

More information

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression Stepwise Searching for Feature Variables in High-Dimensional Linear Regression Qiwei Yao Department of Statistics, London School of Economics q.yao@lse.ac.uk Joint work with: Hongzhi An, Chinese Academy

More information

Tuning Parameter Selection in Regularized Estimations of Large Covariance Matrices

Tuning Parameter Selection in Regularized Estimations of Large Covariance Matrices Tuning Parameter Selection in Regularized Estimations of Large Covariance Matrices arxiv:1308.3416v1 [stat.me] 15 Aug 2013 Yixin Fang 1, Binhuan Wang 1, and Yang Feng 2 1 New York University and 2 Columbia

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space

Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space Jinchi Lv Data Sciences and Operations Department Marshall School of Business University of Southern California http://bcf.usc.edu/

More information

Linear regression methods

Linear regression methods Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response

More information

The Sparsity and Bias of The LASSO Selection In High-Dimensional Linear Regression

The Sparsity and Bias of The LASSO Selection In High-Dimensional Linear Regression The Sparsity and Bias of The LASSO Selection In High-Dimensional Linear Regression Cun-hui Zhang and Jian Huang Presenter: Quefeng Li Feb. 26, 2010 un-hui Zhang and Jian Huang Presenter: Quefeng The Sparsity

More information

arxiv: v1 [stat.me] 30 Dec 2017

arxiv: v1 [stat.me] 30 Dec 2017 arxiv:1801.00105v1 [stat.me] 30 Dec 2017 An ISIS screening approach involving threshold/partition for variable selection in linear regression 1. Introduction Yu-Hsiang Cheng e-mail: 96354501@nccu.edu.tw

More information

Statistica Sinica Preprint No: SS R3

Statistica Sinica Preprint No: SS R3 Statistica Sinica Preprint No: SS-2015-0413.R3 Title Regularization after retention in ultrahigh dimensional linear regression models Manuscript ID SS-2015-0413.R3 URL http://www.stat.sinica.edu.tw/statistica/

More information

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1 Variable Selection in Restricted Linear Regression Models Y. Tuaç 1 and O. Arslan 1 Ankara University, Faculty of Science, Department of Statistics, 06100 Ankara/Turkey ytuac@ankara.edu.tr, oarslan@ankara.edu.tr

More information

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model

More information

Permutation-invariant regularization of large covariance matrices. Liza Levina

Permutation-invariant regularization of large covariance matrices. Liza Levina Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work

More information

Chapter 3. Linear Models for Regression

Chapter 3. Linear Models for Regression Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Linear

More information

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Presented by Yang Zhao March 5, 2010 1 / 36 Outlines 2 / 36 Motivation

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables LIB-MA, FSSM Cadi Ayyad University (Morocco) COMPSTAT 2010 Paris, August 22-27, 2010 Motivations Fan and Li (2001), Zou and Li (2008)

More information

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Bayesian Grouped Horseshoe Regression with Application to Additive Models Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu, Daniel F. Schmidt, Enes Makalic, Guoqi Qian, and John L. Hopper Centre for Epidemiology and Biostatistics, Melbourne

More information

On High-Dimensional Cross-Validation

On High-Dimensional Cross-Validation On High-Dimensional Cross-Validation BY WEI-CHENG HSIAO Institute of Statistical Science, Academia Sinica, 128 Academia Road, Section 2, Nankang, Taipei 11529, Taiwan hsiaowc@stat.sinica.edu.tw 5 WEI-YING

More information

Feature selection with high-dimensional data: criteria and Proc. Procedures

Feature selection with high-dimensional data: criteria and Proc. Procedures Feature selection with high-dimensional data: criteria and Procedures Zehua Chen Department of Statistics & Applied Probability National University of Singapore Conference in Honour of Grace Wahba, June

More information

THE Mnet METHOD FOR VARIABLE SELECTION

THE Mnet METHOD FOR VARIABLE SELECTION Statistica Sinica 26 (2016), 903-923 doi:http://dx.doi.org/10.5705/ss.202014.0011 THE Mnet METHOD FOR VARIABLE SELECTION Jian Huang 1, Patrick Breheny 1, Sangin Lee 2, Shuangge Ma 3 and Cun-Hui Zhang 4

More information

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Journal of Data Science 9(2011), 549-564 Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Masaru Kanba and Kanta Naito Shimane University Abstract: This paper discusses the

More information

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:

More information

Forward Regression for Ultra-High Dimensional Variable Screening

Forward Regression for Ultra-High Dimensional Variable Screening Forward Regression for Ultra-High Dimensional Variable Screening Hansheng Wang Guanghua School of Management, Peking University This version: April 9, 2009 Abstract Motivated by the seminal theory of Sure

More information

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published

More information

Lecture 14: Variable Selection - Beyond LASSO

Lecture 14: Variable Selection - Beyond LASSO Fall, 2017 Extension of LASSO To achieve oracle properties, L q penalty with 0 < q < 1, SCAD penalty (Fan and Li 2001; Zhang et al. 2007). Adaptive LASSO (Zou 2006; Zhang and Lu 2007; Wang et al. 2007)

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

Reconstruction from Anisotropic Random Measurements

Reconstruction from Anisotropic Random Measurements Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013

More information

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

High-dimensional covariance estimation based on Gaussian graphical models

High-dimensional covariance estimation based on Gaussian graphical models High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,

More information

Consistent Model Selection Criteria on High Dimensions

Consistent Model Selection Criteria on High Dimensions Journal of Machine Learning Research 13 (2012) 1037-1057 Submitted 6/11; Revised 1/12; Published 4/12 Consistent Model Selection Criteria on High Dimensions Yongdai Kim Department of Statistics Seoul National

More information

Generalized Elastic Net Regression

Generalized Elastic Net Regression Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1

More information

Shrinkage Tuning Parameter Selection in Precision Matrices Estimation

Shrinkage Tuning Parameter Selection in Precision Matrices Estimation arxiv:0909.1123v1 [stat.me] 7 Sep 2009 Shrinkage Tuning Parameter Selection in Precision Matrices Estimation Heng Lian Division of Mathematical Sciences School of Physical and Mathematical Sciences Nanyang

More information

Regularization and Variable Selection via the Elastic Net

Regularization and Variable Selection via the Elastic Net p. 1/1 Regularization and Variable Selection via the Elastic Net Hui Zou and Trevor Hastie Journal of Royal Statistical Society, B, 2005 Presenter: Minhua Chen, Nov. 07, 2008 p. 2/1 Agenda Introduction

More information

ESL Chap3. Some extensions of lasso

ESL Chap3. Some extensions of lasso ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied

More information

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider variable

More information

Homogeneity Pursuit. Jianqing Fan

Homogeneity Pursuit. Jianqing Fan Jianqing Fan Princeton University with Tracy Ke and Yichao Wu http://www.princeton.edu/ jqfan June 5, 2014 Get my own profile - Help Amazing Follow this author Grace Wahba 9 Followers Follow new articles

More information

High dimensional thresholded regression and shrinkage effect

High dimensional thresholded regression and shrinkage effect J. R. Statist. Soc. B (014) 76, Part 3, pp. 67 649 High dimensional thresholded regression and shrinkage effect Zemin Zheng, Yingying Fan and Jinchi Lv University of Southern California, Los Angeles, USA

More information

VARIABLE SELECTION AND ESTIMATION WITH THE SEAMLESS-L 0 PENALTY

VARIABLE SELECTION AND ESTIMATION WITH THE SEAMLESS-L 0 PENALTY Statistica Sinica 23 (2013), 929-962 doi:http://dx.doi.org/10.5705/ss.2011.074 VARIABLE SELECTION AND ESTIMATION WITH THE SEAMLESS-L 0 PENALTY Lee Dicker, Baosheng Huang and Xihong Lin Rutgers University,

More information

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28 Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:

More information

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR Howard D. Bondell and Brian J. Reich Department of Statistics, North Carolina State University,

More information

Shrinkage Methods: Ridge and Lasso

Shrinkage Methods: Ridge and Lasso Shrinkage Methods: Ridge and Lasso Jonathan Hersh 1 Chapman University, Argyros School of Business hersh@chapman.edu February 27, 2019 J.Hersh (Chapman) Ridge & Lasso February 27, 2019 1 / 43 1 Intro and

More information

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 9-10 - High-dimensional regression Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Recap from

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray

More information

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington Mingon Kang, Ph.D. Computer Science, Kennesaw State University Problems

More information

CALIBRATING NONCONVEX PENALIZED REGRESSION IN ULTRA-HIGH DIMENSION

CALIBRATING NONCONVEX PENALIZED REGRESSION IN ULTRA-HIGH DIMENSION The Annals of Statistics 2013, Vol. 41, No. 5, 2505 2536 DOI: 10.1214/13-AOS1159 Institute of Mathematical Statistics, 2013 CALIBRATING NONCONVEX PENALIZED REGRESSION IN ULTRA-HIGH DIMENSION BY LAN WANG

More information

On Mixture Regression Shrinkage and Selection via the MR-LASSO

On Mixture Regression Shrinkage and Selection via the MR-LASSO On Mixture Regression Shrinage and Selection via the MR-LASSO Ronghua Luo, Hansheng Wang, and Chih-Ling Tsai Guanghua School of Management, Peing University & Graduate School of Management, University

More information

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding Patrick Breheny February 15 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/24 Introduction Last week, we introduced penalized regression and discussed ridge regression, in which the penalty

More information

On Model Selection Consistency of Lasso

On Model Selection Consistency of Lasso On Model Selection Consistency of Lasso Peng Zhao Department of Statistics University of Berkeley 367 Evans Hall Berkeley, CA 94720-3860, USA Bin Yu Department of Statistics University of Berkeley 367

More information

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly Robust Variable Selection Methods for Grouped Data by Kristin Lee Seamon Lilly A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree

More information

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013) A Survey of L 1 Regression Vidaurre, Bielza and Larranaga (2013) Céline Cunen, 20/10/2014 Outline of article 1.Introduction 2.The Lasso for Linear Regression a) Notation and Main Concepts b) Statistical

More information

Least Absolute Gradient Selector: variable selection via Pseudo-Hard Thresholding

Least Absolute Gradient Selector: variable selection via Pseudo-Hard Thresholding arxiv:204.2353v4 [stat.ml] 9 Oct 202 Least Absolute Gradient Selector: variable selection via Pseudo-Hard Thresholding Kun Yang September 2, 208 Abstract In this paper, we propose a new approach, called

More information

A Modern Look at Classical Multivariate Techniques

A Modern Look at Classical Multivariate Techniques A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico

More information

Regularization: Ridge Regression and the LASSO

Regularization: Ridge Regression and the LASSO Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression

More information

The lasso, persistence, and cross-validation

The lasso, persistence, and cross-validation The lasso, persistence, and cross-validation Daniel J. McDonald Department of Statistics Indiana University http://www.stat.cmu.edu/ danielmc Joint work with: Darren Homrighausen Colorado State University

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 1 Ridge Regression Ridge regression and the Lasso are two forms of regularized

More information

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR DEPARTMENT OF STATISTICS North Carolina State University 2501 Founders Drive, Campus Box 8203 Raleigh, NC 27695-8203 Institute of Statistics Mimeo Series No. 2583 Simultaneous regression shrinkage, variable

More information

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Article Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Fei Jin 1,2 and Lung-fei Lee 3, * 1 School of Economics, Shanghai University of Finance and Economics,

More information

OWL to the rescue of LASSO

OWL to the rescue of LASSO OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

Tractable Upper Bounds on the Restricted Isometry Constant

Tractable Upper Bounds on the Restricted Isometry Constant Tractable Upper Bounds on the Restricted Isometry Constant Alex d Aspremont, Francis Bach, Laurent El Ghaoui Princeton University, École Normale Supérieure, U.C. Berkeley. Support from NSF, DHS and Google.

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 6: Model complexity scores (v3) Ramesh Johari ramesh.johari@stanford.edu Fall 2015 1 / 34 Estimating prediction error 2 / 34 Estimating prediction error We saw how we can estimate

More information

High-dimensional regression modeling

High-dimensional regression modeling High-dimensional regression modeling David Causeur Department of Statistics and Computer Science Agrocampus Ouest IRMAR CNRS UMR 6625 http://www.agrocampus-ouest.fr/math/causeur/ Course objectives Making

More information

HIGH-DIMENSIONAL VARIABLE SELECTION WITH THE GENERALIZED SELO PENALTY

HIGH-DIMENSIONAL VARIABLE SELECTION WITH THE GENERALIZED SELO PENALTY Vol. 38 ( 2018 No. 6 J. of Math. (PRC HIGH-DIMENSIONAL VARIABLE SELECTION WITH THE GENERALIZED SELO PENALTY SHI Yue-yong 1,3, CAO Yong-xiu 2, YU Ji-chang 2, JIAO Yu-ling 2 (1.School of Economics and Management,

More information

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data Effective Linear Discriant Analysis for High Dimensional, Low Sample Size Data Zhihua Qiao, Lan Zhou and Jianhua Z. Huang Abstract In the so-called high dimensional, low sample size (HDLSS) settings, LDA

More information

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences Biostatistics-Lecture 16 Model Selection Ruibin Xi Peking University School of Mathematical Sciences Motivating example1 Interested in factors related to the life expectancy (50 US states,1969-71 ) Per

More information

DISCUSSION OF A SIGNIFICANCE TEST FOR THE LASSO. By Peter Bühlmann, Lukas Meier and Sara van de Geer ETH Zürich

DISCUSSION OF A SIGNIFICANCE TEST FOR THE LASSO. By Peter Bühlmann, Lukas Meier and Sara van de Geer ETH Zürich Submitted to the Annals of Statistics DISCUSSION OF A SIGNIFICANCE TEST FOR THE LASSO By Peter Bühlmann, Lukas Meier and Sara van de Geer ETH Zürich We congratulate Richard Lockhart, Jonathan Taylor, Ryan

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Regularization Path Algorithms for Detecting Gene Interactions

Regularization Path Algorithms for Detecting Gene Interactions Regularization Path Algorithms for Detecting Gene Interactions Mee Young Park Trevor Hastie July 16, 2006 Abstract In this study, we consider several regularization path algorithms with grouped variable

More information

High Dimensional Covariance and Precision Matrix Estimation

High Dimensional Covariance and Precision Matrix Estimation High Dimensional Covariance and Precision Matrix Estimation Wei Wang Washington University in St. Louis Thursday 23 rd February, 2017 Wei Wang (Washington University in St. Louis) High Dimensional Covariance

More information

The Risk of James Stein and Lasso Shrinkage

The Risk of James Stein and Lasso Shrinkage Econometric Reviews ISSN: 0747-4938 (Print) 1532-4168 (Online) Journal homepage: http://tandfonline.com/loi/lecr20 The Risk of James Stein and Lasso Shrinkage Bruce E. Hansen To cite this article: Bruce

More information

Consistent Selection of Tuning Parameters via Variable Selection Stability

Consistent Selection of Tuning Parameters via Variable Selection Stability Journal of Machine Learning Research 14 2013 3419-3440 Submitted 8/12; Revised 7/13; Published 11/13 Consistent Selection of Tuning Parameters via Variable Selection Stability Wei Sun Department of Statistics

More information

Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors

Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors Patrick Breheny Department of Biostatistics University of Iowa Jian Huang Department of Statistics

More information

Semi-Penalized Inference with Direct FDR Control

Semi-Penalized Inference with Direct FDR Control Jian Huang University of Iowa April 4, 2016 The problem Consider the linear regression model y = p x jβ j + ε, (1) j=1 where y IR n, x j IR n, ε IR n, and β j is the jth regression coefficient, Here p

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Biostatistics Advanced Methods in Biostatistics IV

Biostatistics Advanced Methods in Biostatistics IV Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results

More information

A Short Introduction to the Lasso Methodology

A Short Introduction to the Lasso Methodology A Short Introduction to the Lasso Methodology Michael Gutmann sites.google.com/site/michaelgutmann University of Helsinki Aalto University Helsinki Institute for Information Technology March 9, 2016 Michael

More information