High-dimensional Ordinary Least-squares Projection for Screening Variables

Similar documents
The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Lecture 14: Variable Selection - Beyond LASSO

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song

ADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL REGRESSION MODELS

Variable Selection for Highly Correlated Predictors

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Forward Regression for Ultra-High Dimensional Variable Screening

High-dimensional Ordinary Least-squares Projection for Screening Variables

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A Confidence Region Approach to Tuning for Variable Selection

Comparisons of penalized least squares. methods by simulations

Iterative Selection Using Orthogonal Regression Techniques

Chapter 3. Linear Models for Regression

ESL Chap3. Some extensions of lasso

Compressed Sensing in Cancer Biology? (A Work in Progress)

Outlier detection and variable selection via difference based regression model and penalized regression

High-dimensional regression

Bi-level feature selection with applications to genetic association

Stability and the elastic net

Generalized Linear Models and Its Asymptotic Properties

Generalized Elastic Net Regression

Two Tales of Variable Selection for High Dimensional Regression: Screening and Model Building

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Statistica Sinica Preprint No: SS R3

arxiv: v1 [stat.me] 30 Dec 2017

Statistical Inference

Proteomics and Variable Selection

Variable Screening in High-dimensional Feature Space

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

Variable Selection for Highly Correlated Predictors

Feature selection with high-dimensional data: criteria and Proc. Procedures

Shrinkage Methods: Ridge and Lasso

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

Fast Regularization Paths via Coordinate Descent

Regression Shrinkage and Selection via the Lasso

The deterministic Lasso

Consistent Group Identification and Variable Selection in Regression with Correlated Predictors

ADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL REGRESSION MODELS. November The University of Iowa. Department of Statistics and Actuarial Science

Generalized Concomitant Multi-Task Lasso for sparse multimodal regression

Linear regression methods

Inference for High Dimensional Robust Regression

Homogeneity Pursuit. Jianqing Fan

Permutation-invariant regularization of large covariance matrices. Liza Levina

A Significance Test for the Lasso

The Iterated Lasso for High-Dimensional Logistic Regression

Sparse survival regression

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

Regularization and Variable Selection via the Elastic Net

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables

MSA220/MVE440 Statistical Learning for Big Data

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

On High-Dimensional Cross-Validation

(Part 1) High-dimensional statistics May / 41

Lecture 6: Methods for high-dimensional problems

Reconstruction from Anisotropic Random Measurements

Covariate-Assisted Variable Ranking

Prediction & Feature Selection in GLM

Analysis Methods for Supersaturated Design: Some Comparisons

The lasso, persistence, and cross-validation

Hard Thresholded Regression Via Linear Programming

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Sparsity in Underdetermined Systems

Linear Model Selection and Regularization

Feature Screening in Ultrahigh Dimensional Cox s Model

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming

Machine Learning for OR & FE

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

[y i α βx i ] 2 (2) Q = i=1

Estimating Estimable Functions of β. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics / 17

Variable selection in high-dimensional regression problems

Theoretical results for lasso, MCP, and SCAD

Indirect multivariate response linear regression

MSA220/MVE440 Statistical Learning for Big Data

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

A Modern Look at Classical Multivariate Techniques

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Forward Selection and Estimation in High Dimensional Single Index Models

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Sparsity Regularization

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Linear regression COMS 4771

MS-C1620 Statistical inference

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Inference After Variable Selection

Feature Engineering, Model Evaluations

FEATURE SCREENING IN ULTRAHIGH DIMENSIONAL

12. Perturbed Matrices

Statistical Learning with the Lasso, spring The Lasso

Transcription:

1 / 38 High-dimensional Ordinary Least-squares Projection for Screening Variables Chenlei Leng Joint with Xiangyu Wang (Duke) Conference on Nonparametric Statistics for Big Data and Celebration to Honor Professor Grace Wahba 4-6 June, 2014

2 / 38 The Story Started working as a project assistant for Grace; Intended to work on spline density estimation (but never completed); Thursday group meetings; Fortunately graduated!!

Back to the future 3 / 38

4 / 38 Outline Introduction and motivation Theory Simulation and application Conclusion and future research An open question

5 / 38 The setup Consider the linear regression model y = β 1 x 1 + β 2 x 2 + + β p x p + ε. With data we write Y = Xβ + ɛ, where Y R n, X R n p, and ɛ R n consists of i.i.d. errors. Notations M = {x 1,..., x p } as the full model M S as the true model where S = {j : β j 0, j = 1,, p} with s = S.

6 / 38 Introduction In high-dimensional data analysis: The dimension p is much larger than the sample size n; The number of the important variables s is often much smaller s n; The goal is to identify these important variables. Two approaches One-stage: Selection and estimation, often optimisation based; Two-stage: Screening followed by some one-stage approach.

7 / 38 One-stage methods Penalised likelihood with a sparsity inducing penalty. Lasso (Tibshirani, 1996), SCAD (Fan and Li, 2001), elastic net (Zou and Hastie, 2005), grouped Lasso (Yuan and Lin, 2006), Cosso (Lin and Zhang, 2006), Dantzig selector (Candes and Tao, 2007) and so on. Convex and non-convex optimisation. Different conditions for estimation and selection consistency: difficult to achieve both (Leng, Lin and Wahba, 2006). Subsampling approaches for consistency: computationally intensive.

8 / 38 Two-stage methods Screen first, refine next. Intuition: choosing a superset is easier than estimating the exact set. Actually widely used: in cancer classification, use marginal t tests for example. Fan and Lv (2008) put forward a theory for marginal screening in linear regression by retaining variables with large marginal correlations (with the response): sure independent screening (SIS) Marginal: Generalised to many models including GLMs, Cox s model, GAMs, varying-coefficient models, and etc. Correlation: Generalised notions of correlation. Alternative iterative procedures: forward selection (Wang, 2009) and tilting (Cho and Frylewicz, 2012).

9 / 38 Elements of Screening 2 Elements: Computational: key; otherwise we can use optimisation-based approaches such as Lasso for screening too! Theoretical: Screening property; the superset must contain the important variables (with probability tending to one) the sure screening property. Remark: Ideally the sure screening property should hold under general conditions.

10 / 38 Motivation Let s look at a class of estimates of β as where A R p n. β = AY, Screening procedure: Choose a submodel M d that retains the d << p largest entries of β, M d = {x j : β j are among the largest d of all β j s}. Ideally, β maintains the rank order of the entries of β the nonzero entries of β are large in β relatively the zero entries of β are small in β relatively.

11 / 38 Signal noise analysis Note β = AY = A(Xβ + ɛ) = (AX)β + Aɛ. Signal (AX)β + Noise Aɛ The noise part is small stochastically In order for β to preserve the rank order of β, ideally AX = I, or AX I The above discussion motivated us to use some inverse of X. The SIS in Fan and Lv (2008) set A = X T and thus β = X T Y.

12 / 38 Inverse of X Look for A such that AX I. When p < n, A = (X T X) 1 X T gives rise to the OLS estimator. When p > n, Moore-Penrose inverse of X as A = X T (XX T ) 1, unique to high-dimensional data. We use ˆβ = X T (XX T ) 1 Y named the High-dimensional Ordinary Least-squares Projector (HOLP): high-d OLS.

13 / 38 Remarks Write ˆβ = X T (XX T ) 1 Xβ + X T (XX T ) 1 ɛ, HOLP projects β onto the row space of X; OLS projects β onto the column space of X Straightforward to implement Can be efficiently computed: O(n 2 p), as opposed to O(np) of SIS

14 / 38 A comparison of the screening matrices The screening matrix AX in AY = (AX)β + Aɛ HOLP: X T (XX T ) 1 X SIS: X T X A quick simulation: n = 50, p = 1000, x N(0, Σ). Three setups Independent: Σ = I CS: σ jk = 0.6 for j k AR(1): σ jk = 0.995 j k

15 / 38 Screening matrices SIS, Ind SIS, CS SIS, AR(1) HOLP, Ind HOLP, CS HOLP, AR(1)

16 / 38 Theory Assumptions p > n and log p = O(n γ ) for some γ > 0. Conditions on the eigenvalues of XΣ 1 X T /p and the distribution of Σ 1/2 x where Σ = var(x) Conditions on the magnitude of the smallest β j for j S Conditions on s and the condition number of Σ However, we don t need the marginal correlation assumption which requires y and the important x j with j S to satisfy min j S cov(β 1 j y, x j ) c.

17 / 38 Marginal screening The marginal correlation assumption is vital to all marginal screening approaches. In SIS, AY = X T Y = X T Xβ + X T ɛ. The SIS signal X T Xβ Σβ: β j nonzero doesn t imply (Σβ) j nonzero. For HOLP, X T (XX T ) 1 Xβ Iβ = β.

18 / 38 Theorem 1 (Screening property of HOLP) Under mild conditions, if we choose the submodel size d p properly, the M d chosen by HOLP satisfies ( ) P(M S M d ) = 1 O p exp( nc 1 log n ). Theorem 2 (Screening consistency of HOLP) Under mild conditions, the HOLP estimator satisfies ( ) ( ) P min ˆβ j > max ˆβ j = 1 O p exp( nc 2 j S j S log n ).

19 / 38 Another motivation for HOLP The ridge regression estimator ˆβ(r) = (ri + X T X) 1 X T Y, where r is the ridge parameter. Letting r gives r ˆβ(r) X T Y, the SIS. Letting r 0 gives ˆβ(r) (X T X) + X T Y An application of the Sherman-Morrison-Woodbury formula gives (ri + X T X) 1 X T Y = X T (ri + XX T ) 1 Y. Then letting r 0 gives which gives us HOLP. (X T X) + X T Y = X T (XX T ) 1 Y,

20 / 38 Ridge Regression Theorem 3 (Screening consistency of ridge regression) Under mild conditions, with a proper ridge parameter r, the ridge regression estimator satisfies P ( ) ( ) min ˆβ j (r) > max ˆβ j (r) = 1 O exp( nc 3 j S j S log n ). Remark The theorem holds in particular when the ridge parameter r is fixed. Potential to generalise to GLMs, Cox s models and etc. Ongoing.

21 / 38 Simulation (p, n) = (1000, 100) or (10000, 200) Signal to noise ratio R 2 = 0.5 or 0.9 Σ and β (i) Independent predictors β i = ( 1) u i ( N(0, 1) + 4 log n/ n) where u i Ber(0.4) for i S and β i = 0 for i S. (ii) Compound symmetry: β i = 5 for i = 1,..., 5 and β i = 0 otherwise, ρ = 0.3, 0.6, 0.9. (iii) Autoregressive correlation: β 1 = 3, β 4 = 1.5, β 7 = 2, and β i = 0 otherwise.

22 / 38 More setups (iv) Factor model: x i = k j=1 φ jf ij + η i, where f ij and η i and φ j are iid normal. Coefs as in CS. (v) Group structure: 15 true variables into three groups. x j+3m = z j + N(0, δ 2 ). β i = 3, i 15; β i = 0, i > 15. where m = 0,..., 4, j = 1, 2, 3, and δ 2 is 0.01, 0.05 or 0.1. (vi) Extreme correlation: x i = (z i + w i )/ 2, i = 1,, 5 and x i = (z i + 5 j=1 w j)/2, i = 16,, p. Coefs as in (ii). The response variable is more correlated to a large number of unimportant variables. Make it even harder, x i+s, x i+2s = x i + N(0, 0.01), i = 1,, 5.

23 / 38 (p, n) = (1000, 100): R 2 = 0.5 Example HOLP SIS ISIS FR Tilting (i) Ind. 0.570 0.565 0.270 0.370 0.340 ρ = 0.3 0.150 0.160 0.050 0.005 0.000 (ii) CS ρ = 0.6 0.005 0.005 0.005 0.000 0.000 ρ = 0.9 0.000 0.000 0.000 0.000 0.010 ρ = 0.3 0.810 0.810 0.510 0.555 0.525 (iii) AR(1) ρ = 0.6 0.970 0.985 0.560 0.390 0.355 ρ = 0.9 0.990 1.000 0.500 0.185 0.160 k = 2 0.295 0.000 0.045 0.135 0.105 (iv) Factor k = 10 0.060 0.000 0.000 0.000 0.025 k = 20 0.010 0.000 0.000 0.000 0.000 δ 2 = 0.1 0.935 0.970 0.000 0.000 0.000 (v) Group δ 2 = 0.05 0.950 0.970 0.000 0.000 0.000 δ 2 = 0.01 0.960 0.980 0.000 0.000 0.000 (vi) Extreme 0.305 0.000 0.000 0.000 0.020

24 / 38 (p, n) = (1000, 100): R 2 = 0.9 Example HOLP SIS ISIS FR Tilting (i) Ind. 0.935 0.910 0.990 1.000 1.000 ρ = 0.3 0.980 0.855 0.955 1.000 0.990 (ii) CS ρ = 0.6 0.830 0.260 0.305 0.575 0.490 ρ = 0.9 0.050 0.010 0.005 0.000 0.050 ρ = 0.3 0.990 0.965 1.000 1.000 1.000 (iii) AR(1) ρ = 0.6 1.000 1.000 1.000 1.000 1.000 ρ = 0.9 1.000 1.000 0.970 0.985 1.000 k = 2 0.940 0.015 0.490 0.950 0.960 (iv) Factor k = 10 0.715 0.000 0.115 0.370 0.455 k = 20 0.430 0.000 0.015 0.105 0.225 δ 2 = 0.1 1.000 1.000 0.000 0.000 0.000 (v) Group δ 2 = 0.05 1.000 1.000 0.000 0.000 0.000 δ 2 = 0.01 1.000 1.000 0.000 0.000 0.000 (vi) Extreme 0.905 0.000 0.000 0.150 0.110

25 / 38 (p, n) = (10000, 200): R 2 = 0.5 Example HOLP SIS ISIS FR Tilting (i) Ind. 0.680 0.680 0.620 0.570 ρ = 0.3 0.310 0.310 0.060 0.020 (ii) CS ρ = 0.6 0.010 0.010 0.000 0.000 ρ = 0.9 0.000 0.000 0.000 0.000 ρ = 0.3 0.810 0.860 0.740 0.740 (iii) AR(1) ρ = 0.6 0.990 0.990 0.580 0.680 ρ = 0.9 1.0000 1.000 0.480 0.390 k = 2 0.450 0.010 0.020 0.240 (iv) Factor k = 10 0.050 0.000 0.000 0.010 k = 20 0.030 0.000 0.000 0.000 δ 2 = 0.1 1.000 1.000 0.000 0.000 (v) Group δ 2 = 0.05 0.990 0.990 0.000 0.000 δ 2 = 0.01 1.000 1.000 0.000 0.000 (vi) Extreme 0.580 0.000 0.000 0.040

26 / 38 (p, n) = (10000, 200): R 2 = 0.9 Example HOLP SIS ISIS FR Tilting (i) Ind. 0.960 0.960 1.000 1.000 ρ = 0.3 1.000 0.920 1.000 1.000 (ii) CS ρ = 0.6 0.960 0.280 0.420 0.960 ρ = 0.9 0.100 0.000 0.000 0.000 ρ = 0.3 0.990 0.990 1.000 1.000 (iii) AR(1) ρ = 0.6 1.000 1.000 1.000 1.000 ρ = 0.9 1.000 1.000 1.000 1.000 k = 2 0.980 0.000 0.350 0.990 (iv) Factor k = 10 0.850 0.000 0.060 0.700 k = 20 0.540 0.000 0.010 0.230 δ 2 = 0.1 1.000 1.000 0.000 0.000 (v) Group δ 2 = 0.05 1.000 1.000 0.000 0.000 δ 2 = 0.01 1.000 1.000 0.000 0.000 (vi) Extreme 1.000 0.000 0.000 0.210

27 / 38 A demonstration of Theorem 2 and 3 We set { 4 [exp(n 1/3 )] for examples except Example (vi) p = and 20 [exp(n 1/4 )] for Example (vi) { 1.5 [n 1/4 ] for R 2 = 90% s = [n 1/4 ] for R 2 = 50%

28 / 38 Theorem 2 HOLP with R 2 = 90% HOLP with R 2 = 50% probability all β^true >β^false 0.0 0.2 0.4 0.6 0.8 1.0 (i) (ii) (iii) (iv) (v) (vi) probability all β^true >β^false 0.0 0.2 0.4 0.6 0.8 1.0 (i) (ii) (iii) (iv) (v) (vi) 100 200 300 400 500 100 200 300 400 500 Figure 1: HOLP: P(min i S ˆβ i > max i S ˆβ i ) versus the sample size n.

29 / 38 Theorem 3 ridge HOLP with R 2 = 90% ridge HOLP with R 2 = 50% probability all β^true >β^false 0.0 0.2 0.4 0.6 0.8 1.0 (i) (ii) (iii) (iv) (v) (vi) probability all β^true >β^false 0.0 0.2 0.4 0.6 0.8 1.0 (i) (ii) (iii) (iv) (v) (vi) 100 200 300 400 500 100 200 300 400 500 Figure 2: ridge-holp (r = 10): P(min i S ˆβ i > max i S ˆβ i ) versus sample size n.

30 / 38 Computation efficiency: Varying supset p=1000,n=100 p=1000,n=100 (tilting excluded) time cost (sec) 0 100 200 300 400 Tilting Forward regression ISIS HOLP SIS time cost (sec) 0 10 20 30 40 50 60 Forward regression ISIS HOLP SIS 0 20 40 60 80 100 0 20 40 60 80 100 Figure 3: Computational time against the submodel size when (p, n) = (1000, 100).

31 / 38 Computation efficiency: Varying p d=50,n=100 d=50,n=100 (tilting excluded) time cost (sec) 0 500 1000 1500 2000 Tilting Forward regression ISIS HOLP SIS time cost (sec) 0 5 10 15 20 25 30 35 Forward regression ISIS HOLP SIS 500 1000 1500 2000 2500 500 1000 1500 2000 2500 Figure 4: Computational time against the total number of the covariates when (d, n) = (50, 100).

32 / 38 A data analysis The mammalian eye diseases (Scheetz et al., 2006) gene expressions on the eye tissues from 120 twelve-week-old male F2 rats gene coded as TRIM32 responsible for causing Bardet-Biedl syndrome Focused on 5000 genes (out of about 19K genes) with the highest sample variance

33 / 38 Table 1: The 10-fold cross validation error for nine different methods Methods Mean errors Standard errors Final size Lasso 0.015 0.023 5 SCAD 0.019 0.025 6 ISIS-SCAD 0.016 0.023 2 SIS-SCAD 0.014 0.025 4 FR-Lasso 0.016 0.021 2 FR-SCAD 0.018 0.023 3 HOLP-Lasso 0.014 0.016 5 HOLP-SCAD 0.014 0.017 4 tilting 0.017 0.021 6 NULL 0.021 0.027 0

34 / 38 Table 2: Commonly selected genes for different methods Probe ID 1376747 1381902 1390539 1382673 Gene name BE107075 Zfp292 BF285569 BE115812 Lasso yes yes yes SCAD yes yes yes ISIS-SCAD yes SIS-SCAD yes yes FR-Lasso yes yes FR-SCAD yes yes yes HOLP-Lasso yes yes HOLP-SCAD yes yes Tilting

35 / 38 Take-home message For a linear model Y = Xβ + ɛ, with normalized data, follow variable screening step 1, 2, 3. 1. Compute HOLP ˆβ = X T (XX T ) 1 Y. 2. Retain the d variables (usually can take d = n) corresponding to the d largest entries of ˆβ. 3. Screening is done! Start thinking about building a refined model based on the remaining d variables.

36 / 38 Conclusion HOLP Computationally efficient Theoretical appealing Methodologically simple Generalisable via its ridge version Future work (ongoing) GLMs Cox s models Screening for compressed sensing Grouped variable screening, GAMs...

An open question 37 / 38

38 / 38 He who teaches me for one day is my father for life. Thank you!