Variable Selection for Highly Correlated Predictors

Size: px

Start display at page:

Download "Variable Selection for Highly Correlated Predictors"

Raymond Bruce
6 years ago
Views:

1 Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu Department of Statistics, University of Illinois at Urbana-Champaign WHOA-PSI, Aug, 2017 St. Louis, Missouri 1 / 30

2 Background Variable selection: Detect relevant predictors Important in model building with a large number of predictors Sparsity Interpretability Model selection consistency Strong correlations between predictors 2 / 30

3 A Motivating example Gene expression data of 90 Asians from the international Haplotype Map ( HapMap ) project Response: The gene CHRNA6 (nicotine addiction) Potential predictors: 47, 292 probes 17, 656, 192 correlations between potential predictors have absolute value greater than / 30

4 A Motivating example A correlated subset: 6743 probes Fig. 1: Correlations between all probes in the subset 4 / 30

5 Traditional variable selection methods Penalized least squares methods: Lasso (Tibshirani, 1996), adaptive Lasso (Zou, 2006), SCAD (Fan and Li, 2001), elastic net (Zou and Hastie, 2005) Screening methods: Sure independence screening (Fan and Lv, 2008), forward regression (Wang, 2009), forward-lasso adaptive shrinkage (Radchenko and James, 2011) Lack variable selection consistency for strongly correlated data 5 / 30

Failure of irrepresentable condition (Zhao and Yu, 2006) Weak irrepresentable condition C 21 (C 11 ) 1 sign(β (1) ) < 1 Example: ( ) C11 C C = 12, C 21 C 22 where C 11 and C

6 Failure of irrepresentable condition (Zhao and Yu, 2006) Weak irrepresentable condition C 21 (C 11 ) 1 sign(β (1) ) < 1 Example: ( ) C11 C C = 12, C 21 C 22 where C 11 and C 22 have exchangeable structure with correlation α 1 and α 3 respectively, C 12 = (α 2 ) q (p q). Failure of the irrepresentable condition α 2 > α 1 (q/ q i=1 sign(β i) ) 6 / 30

7 Failure of irrepresentable condition A sparse linear setting: y n 1 = X n p β p 1 + ε n 1 n = 80, p = 150 β j = β s if 1 j 10; β j = 0 if 11 j p Block-exchangeable covariance matrix with α 1 = 0.5, α 2 = 0.7, α 3 = 0.9 β s Lasso Adaptive Lasso SCAD FNR FPR FNR FPR FNR FPR Table 1: FNR: false negative rate; FPR: false positive rate Unable to identify signals 7 / 30

8 Another motivation: Direct effect? Mediator = X + ε 1 Mediator variable: transmits indirect effects from X and has direct effects on the response variable Fig. 2: Y = Mediator + ε 2, conditional independence Fig. 3: Y = X + Mediator + ε 2 Example: X : sex Mediator: qualification Y : hiring decision 8 / 30

9 Existing methods dealing with correlated predictors The nonconvex penalties and ridge regression (Wang and Wang, 2014) Gauss-Lasso selector (Javanmard and Montanari, 2013) Preconditioning the Lasso (Jia et al., 2015) PC-simple algorithm (Bühlmann et al., 2010) Requiring partial faithfulness: If partial correlation between Y and X j is nonzero, then conditional correlation between Y and X j given any subset of other predictors is nonzero. 9 / 30

10 Partial correlation Partial correlation between Y and X j : Cov(Y,X ρ j = j X j ) Var(Y X j )Var(X j X j ) Limited range: weakens the strength of signal coefficients s j = Var(Y X j ) is larger for relevant covariates Standard deviation of Y conditional on X j : sj 2 + β 2 j d jj, = 1/σyy 1 ρ j 2 = 1 σ yy where d jj is the jth diagonal element of precision matrix, and σ yy is the first diagonal element of Σ 1 = Cov(Y, X 1,..., X p ) / 30

11 Semi-standard partial covariance (SPAC) Semi-standard PArtial Covariance (SPAC) between Y and X j : γ j = ρ j s j = β j Var(X j X j ) γ j = 0 if and only if β j = 0 11 / 30

2 Semi-standard PArtial Covariance (SPAC): The projection γ

12 Partial correlation and SPAC Y = β 1 X 1 + ε; X 2 is correlated with X 1 Partial correlaiton: cos ω 1 and cos ω 2 Semi-standard PArtial Covariance (SPAC): The projection γ 1 and γ 2 Fig. 4: γ 1 and cos ω 1 Fig. 5: γ 2 and cos ω 2 12 / 30

13 Partial correlation and SPAC SPAC has unrestrictive range Incorporate magnitude of coefficients Differentiate signals and noises 13 / 30

14 Coefficients and SPAC γ j = ρ j s j = β j Var(Xj X j ) = β j 1 R 2 j, where Rj 2 is the coefficient of the multiple correlation between X and all other covariates Encourage selection of predictors that are important to the response but are not correlated with other covariates Discourage irrelevant covariates which are highly correlated with relevant predictors 14 / 30

15 SPAC variable selection method Original penalized least square function: L(γ, ˆd) = 1 p 2 y X j β j 2 + j=1 p p λ (β j ) j=1 Replace the coefficients β in the above function by SPACs γ (γ j = β j / d jj ) L(γ, ˆd) = 1 p 2 y X j ˆd jj γ j 2 + j=1 p p λ (γ j ) ˆd jj j=1 Possible choices of the penality pλ : Lasso (SPAC-Lasso), adaptive Lasso (SPAC-ALasso), and SCAD (SPAC-SCAD) 15 / 30

16 Example: the block-exchangeable C is block-exchangeable q n, p n q n as n, where p n : number of all predictors; q n : number of relevant predictors m n = q n i=1 sign(γ i), L: limit inferior of q n /m n Proposition 1 If C 21 (C 11 ) 1 sign(β (1) ) 1 (Irrepresentable conditions do not hold), then α 2 α 1 L α 1 α 3 α 2 > α 1 16 / 30

17 Example: the block-exchangeable Proposition 2 If there exists a positive constant η such that 1 α1 α 2 < (1 η) α 1 L, (1) 1 α 3 then the SPAC-Lasso is strongly sign consistent when q n and p n q n increase as n. The SPAC-Lasso can still be strongly sign consistent when Lasso does NOT have variable selection consistency 17 / 30

18 Transformed Strong Irrepresentable Condition Original Strong Irrepresentable Condition: There exists a positive constant vector η such that C 21 (C 11 ) 1 sign(β (1) ) 1 η, Transformed Strong Irrepresentable Condition: There exists a positive constant vector η such that V (2)C 21 (C 11 ) 1 V (1) 1 sign(γ (1) ) 1 η, V (1) = diag{1/ d11,..., 1/ d qq } V (2) = diag{1/ dq+1q+1,..., 1/ d pp } Incorporate more situations with highly correlated predictors C with larger correlations between relevant and irrelevant predictors than correlations between relevant predictors 18 / 30

19 General theoretical results for the SPAC-Lasso Theorem 1 Let ˆd = { ˆd 11,..., ˆd pnp n }. Under Transformed Strong Irrepresentable and regularity conditions, there exists a M 0, for any δ > 0, the following holds with probability at least 1 O(n δ ): (1) There exists a solution ˆγ = ˆγ(λ n, ˆd) (2) Strong sign consistency: ˆγ = s γ (3) Estimation consistency: ˆγ γ 2 M 0 qn λ n 19 / 30

20 Simulation studies Y = X β + N(0, σ 2 I n ), n = 100, p = 200, q = 10 β = (β s,..., β }{{} s, 0,..., 0), where β }{{} s is from 0.1 to 1 q p q The C is block-exchangeable with α = (α 1, α 2, α 3 ) T α1 : correlation between relevant predictors α2 : correlation between relevant and irrelevant predictors α3 : correlation between irrelevant predictors λ is tuned by the BIC 20 / 30

21 Simulation results FNR: False negative rate; FPR: False positive rate Ratio: FNR+FPR of the existing method / FNR+FPR of the corresponding proposed method β β Lasso SPAC-Lasso ALasso SPAC-ALasso s FNR FPR FNR FPR Ratio FNR FPR FNR FPR Ratio SCAD SPAC-SCAD PC-simple SPAC-ALasso s FNR FPR FNR FPR Ratio FNR FPR FNR FPR Ratio >50 Table 2: α 1 = 0.3, α 2 = 0.5, α 3 = / 30

22 Simulation results Fig. 6: β s = 0.3, α 1 = 0.3, α 2 = 0.5, α 3 = / 30

23 Simulation results False Positive Rate + False Negative Rate Fig. 7: α 1 = 0.3, α 2 = 0.5, α 3 = 0.8 Fig. 8: α 1 = 0.5, α 2 = 0.7, α 3 = / 30

24 Simulation summay The SPAC methods produce smaller FNRs and FPRs (with smaller variation) than traditional penalty-based methods in all the settings The SPAC-ALasso outperforms the PC-simple algorithm Under highly correlated settings (α 2 = 0.7), SPAC methods perform siginificantly better than traditional methods when signals are strong 24 / 30

25 Real data Gene expression data of 90 Asians from the international HapMap project (ftp://ftp.sanger.ac.uk/pub/ge nevar/) A highly correlated subset: 6743 probes Randomly split data into a training set (90%) and a testing set (10%) for 100 times 25 / 30

26 Real data Means of number of selected probes (NS) and prediction mean squared error (PMSE) for the testing set The proposed methods select fewer probes, and has smaller prediction error than corresponding original methods Lasso SPAC-L ALasso SPAC-AL SCAD SPAC-SCAD PCL Mean of NS SD of NS Mean of PMSE Table 3: PCL : the PC-simple algorithm with Lasso; SPAC-L : the SPAC with Lasso penalty; SPAC-AL : the SPAC with adaptive Lasso penalty 26 / 30

27 Real data Apply SPAC-Lasso to all observations Fig. 9: Correlations between relevant probes and irrelevant probes based on the SPAC-Lasso. 27 / 30

28 Conclusions SPAC reduces correlation effects from other predictors Compared with partial correlation, SPAC incorporates magnitude of coefficients SPAC facilitates choosing predictors with direct association with the response variable Asymptotic theory: the SPAC-Lasso has model selection consistency for highly correlated data Numerical studies: SPAC method outperforms existing competing methods with other penalty functions for highly correlated data 28 / 30

29 The end Thank You! 29 / 30

30 References Bühlmann, P., Kalisch, M., and Maathuis, M. H. (2010). Variable selection in high- dimensional linear models: partially faithful distributions and the PC-simple algorithm. Biometrika, 97(2), Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5), Javanmard, A. and Montanari, A. (2013). Model selection for high-dimensional regression under the generalized irrepresentability condition. Advances in neural information processing systems, Jia, J., Rohe, K., et al. (2015). Preconditioning the lasso for sign consistency. Electronic Journal of Statistics, 9(1), Radchenko, P., James, G. M., et al. (2011). Improved variable selection with forward-lasso adaptive shrinkage. The Annals of Applied Statistics, 5(1), Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1), Wang, H. (2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association, 104(488), Wang, X. and Leng, C. (2015). High dimensional ordinary least squares projection for screening variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(3), Wang, X. and Wang, M. (2014). Combination of nonconvex penalties and ridge regres- sion for high-dimensional linear models. Journal of Mathematical Research with Applications, 34(6), Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2), Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. Journal of Machine Learning Research, 7(Nov), Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), / 30

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu arxiv:1709.04840v1 [stat.me] 14 Sep 2017 Abstract Penalty-based variable selection methods are powerful in selecting relevant covariates