High-dimensional Ordinary Least-squares Projection for Screening Variables

1 / 38 High-dimensional Ordinary Least-squares Projection for Screening Variables Chenlei Leng Joint with Xiangyu Wang (Duke) Conference on Nonparametric Statistics for Big Data and Celebration to Honor Professor Grace Wahba 4-6 June, 2014

2 / 38 The Story Started working as a project assistant for Grace; Intended to work on spline density estimation (but never completed); Thursday group meetings; Fortunately graduated!!

Back to the future 3 / 38

4 / 38 Outline Introduction and motivation Theory Simulation and application Conclusion and future research An open question

5 / 38 The setup Consider the linear regression model y = β 1 x 1 + β 2 x 2 + + β p x p + ε. With data we write Y = Xβ + ɛ, where Y R n, X R n p, and ɛ R n consists of i.i.d. errors. Notations M = {x 1,..., x p } as the full model M S as the true model where S = {j : β j 0, j = 1,, p} with s = S.

6 / 38 Introduction In high-dimensional data analysis: The dimension p is much larger than the sample size n; The number of the important variables s is often much smaller s n; The goal is to identify these important variables. Two approaches One-stage: Selection and estimation, often optimisation based; Two-stage: Screening followed by some one-stage approach.

7 / 38 One-stage methods Penalised likelihood with a sparsity inducing penalty. Lasso (Tibshirani, 1996), SCAD (Fan and Li, 2001), elastic net (Zou and Hastie, 2005), grouped Lasso (Yuan and Lin, 2006), Cosso (Lin and Zhang, 2006), Dantzig selector (Candes and Tao, 2007) and so on. Convex and non-convex optimisation. Different conditions for estimation and selection consistency: difficult to achieve both (Leng, Lin and Wahba, 2006). Subsampling approaches for consistency: computationally intensive.

8 / 38 Two-stage methods Screen first, refine next. Intuition: choosing a superset is easier than estimating the exact set. Actually widely used: in cancer classification, use marginal t tests for example. Fan and Lv (2008) put forward a theory for marginal screening in linear regression by retaining variables with large marginal correlations (with the response): sure independent screening (SIS) Marginal: Generalised to many models including GLMs, Cox s model, GAMs, varying-coefficient models, and etc. Correlation: Generalised notions of correlation. Alternative iterative procedures: forward selection (Wang, 2009) and tilting (Cho and Frylewicz, 2012).

9 / 38 Elements of Screening 2 Elements: Computational: key; otherwise we can use optimisation-based approaches such as Lasso for screening too! Theoretical: Screening property; the superset must contain the important variables (with probability tending to one) the sure screening property. Remark: Ideally the sure screening property should hold under general conditions.

10 / 38 Motivation Let s look at a class of estimates of β as where A R p n. β = AY, Screening procedure: Choose a submodel M d that retains the d << p largest entries of β, M d = {x j : β j are among the largest d of all β j s}. Ideally, β maintains the rank order of the entries of β the nonzero entries of β are large in β relatively the zero entries of β are small in β relatively.

11 / 38 Signal noise analysis Note β = AY = A(Xβ + ɛ) = (AX)β + Aɛ. Signal (AX)β + Noise Aɛ The noise part is small stochastically In order for β to preserve the rank order of β, ideally AX = I, or AX I The above discussion motivated us to use some inverse of X. The SIS in Fan and Lv (2008) set A = X T and thus β = X T Y.

12 / 38 Inverse of X Look for A such that AX I. When p < n, A = (X T X) 1 X T gives rise to the OLS estimator. When p > n, Moore-Penrose inverse of X as A = X T (XX T ) 1, unique to high-dimensional data. We use ˆβ = X T (XX T ) 1 Y named the High-dimensional Ordinary Least-squares Projector (HOLP): high-d OLS.

13 / 38 Remarks Write ˆβ = X T (XX T ) 1 Xβ + X T (XX T ) 1 ɛ, HOLP projects β onto the row space of X; OLS projects β onto the column space of X Straightforward to implement Can be efficiently computed: O(n 2 p), as opposed to O(np) of SIS

14 / 38 A comparison of the screening matrices The screening matrix AX in AY = (AX)β + Aɛ HOLP: X T (XX T ) 1 X SIS: X T X A quick simulation: n = 50, p = 1000, x N(0, Σ). Three setups Independent: Σ = I CS: σ jk = 0.6 for j k AR(1): σ jk = 0.995 j k

15 / 38 Screening matrices SIS, Ind SIS, CS SIS, AR(1) HOLP, Ind HOLP, CS HOLP, AR(1)

16 / 38 Theory Assumptions p > n and log p = O(n γ ) for some γ > 0. Conditions on the eigenvalues of XΣ 1 X T /p and the distribution of Σ 1/2 x where Σ = var(x) Conditions on the magnitude of the smallest β j for j S Conditions on s and the condition number of Σ However, we don t need the marginal correlation assumption which requires y and the important x j with j S to satisfy min j S cov(β 1 j y, x j ) c.

17 / 38 Marginal screening The marginal correlation assumption is vital to all marginal screening approaches. In SIS, AY = X T Y = X T Xβ + X T ɛ. The SIS signal X T Xβ Σβ: β j nonzero doesn t imply (Σβ) j nonzero. For HOLP, X T (XX T ) 1 Xβ Iβ = β.

18 / 38 Theorem 1 (Screening property of HOLP) Under mild conditions, if we choose the submodel size d p properly, the M d chosen by HOLP satisfies ( ) P(M S M d ) = 1 O p exp( nc 1 log n ). Theorem 2 (Screening consistency of HOLP) Under mild conditions, the HOLP estimator satisfies ( ) ( ) P min ˆβ j > max ˆβ j = 1 O p exp( nc 2 j S j S log n ).

19 / 38 Another motivation for HOLP The ridge regression estimator ˆβ(r) = (ri + X T X) 1 X T Y, where r is the ridge parameter. Letting r gives r ˆβ(r) X T Y, the SIS. Letting r 0 gives ˆβ(r) (X T X) + X T Y An application of the Sherman-Morrison-Woodbury formula gives (ri + X T X) 1 X T Y = X T (ri + XX T ) 1 Y. Then letting r 0 gives which gives us HOLP. (X T X) + X T Y = X T (XX T ) 1 Y,

20 / 38 Ridge Regression Theorem 3 (Screening consistency of ridge regression) Under mild conditions, with a proper ridge parameter r, the ridge regression estimator satisfies P ( ) ( ) min ˆβ j (r) > max ˆβ j (r) = 1 O exp( nc 3 j S j S log n ). Remark The theorem holds in particular when the ridge parameter r is fixed. Potential to generalise to GLMs, Cox s models and etc. Ongoing.

21 / 38 Simulation (p, n) = (1000, 100) or (10000, 200) Signal to noise ratio R 2 = 0.5 or 0.9 Σ and β (i) Independent predictors β i = ( 1) u i ( N(0, 1) + 4 log n/ n) where u i Ber(0.4) for i S and β i = 0 for i S. (ii) Compound symmetry: β i = 5 for i = 1,..., 5 and β i = 0 otherwise, ρ = 0.3, 0.6, 0.9. (iii) Autoregressive correlation: β 1 = 3, β 4 = 1.5, β 7 = 2, and β i = 0 otherwise.

22 / 38 More setups (iv) Factor model: x i = k j=1 φ jf ij + η i, where f ij and η i and φ j are iid normal. Coefs as in CS. (v) Group structure: 15 true variables into three groups. x j+3m = z j + N(0, δ 2 ). β i = 3, i 15; β i = 0, i > 15. where m = 0,..., 4, j = 1, 2, 3, and δ 2 is 0.01, 0.05 or 0.1. (vi) Extreme correlation: x i = (z i + w i )/ 2, i = 1,, 5 and x i = (z i + 5 j=1 w j)/2, i = 16,, p. Coefs as in (ii). The response variable is more correlated to a large number of unimportant variables. Make it even harder, x i+s, x i+2s = x i + N(0, 0.01), i = 1,, 5.

23 / 38 (p, n) = (1000, 100): R 2 = 0.5 Example HOLP SIS ISIS FR Tilting (i) Ind. 0.570 0.565 0.270 0.370 0.340 ρ = 0.3 0.150 0.160 0.050 0.005 0.000 (ii) CS ρ = 0.6 0.005 0.005 0.005 0.000 0.000 ρ = 0.9 0.000 0.000 0.000 0.000 0.010 ρ = 0.3 0.810 0.810 0.510 0.555 0.525 (iii) AR(1) ρ = 0.6 0.970 0.985 0.560 0.390 0.355 ρ = 0.9 0.990 1.000 0.500 0.185 0.160 k = 2 0.295 0.000 0.045 0.135 0.105 (iv) Factor k = 10 0.060 0.000 0.000 0.000 0.025 k = 20 0.010 0.000 0.000 0.000 0.000 δ 2 = 0.1 0.935 0.970 0.000 0.000 0.000 (v) Group δ 2 = 0.05 0.950 0.970 0.000 0.000 0.000 δ 2 = 0.01 0.960 0.980 0.000 0.000 0.000 (vi) Extreme 0.305 0.000 0.000 0.000 0.020

24 / 38 (p, n) = (1000, 100): R 2 = 0.9 Example HOLP SIS ISIS FR Tilting (i) Ind. 0.935 0.910 0.990 1.000 1.000 ρ = 0.3 0.980 0.855 0.955 1.000 0.990 (ii) CS ρ = 0.6 0.830 0.260 0.305 0.575 0.490 ρ = 0.9 0.050 0.010 0.005 0.000 0.050 ρ = 0.3 0.990 0.965 1.000 1.000 1.000 (iii) AR(1) ρ = 0.6 1.000 1.000 1.000 1.000 1.000 ρ = 0.9 1.000 1.000 0.970 0.985 1.000 k = 2 0.940 0.015 0.490 0.950 0.960 (iv) Factor k = 10 0.715 0.000 0.115 0.370 0.455 k = 20 0.430 0.000 0.015 0.105 0.225 δ 2 = 0.1 1.000 1.000 0.000 0.000 0.000 (v) Group δ 2 = 0.05 1.000 1.000 0.000 0.000 0.000 δ 2 = 0.01 1.000 1.000 0.000 0.000 0.000 (vi) Extreme 0.905 0.000 0.000 0.150 0.110

25 / 38 (p, n) = (10000, 200): R 2 = 0.5 Example HOLP SIS ISIS FR Tilting (i) Ind. 0.680 0.680 0.620 0.570 ρ = 0.3 0.310 0.310 0.060 0.020 (ii) CS ρ = 0.6 0.010 0.010 0.000 0.000 ρ = 0.9 0.000 0.000 0.000 0.000 ρ = 0.3 0.810 0.860 0.740 0.740 (iii) AR(1) ρ = 0.6 0.990 0.990 0.580 0.680 ρ = 0.9 1.0000 1.000 0.480 0.390 k = 2 0.450 0.010 0.020 0.240 (iv) Factor k = 10 0.050 0.000 0.000 0.010 k = 20 0.030 0.000 0.000 0.000 δ 2 = 0.1 1.000 1.000 0.000 0.000 (v) Group δ 2 = 0.05 0.990 0.990 0.000 0.000 δ 2 = 0.01 1.000 1.000 0.000 0.000 (vi) Extreme 0.580 0.000 0.000 0.040

26 / 38 (p, n) = (10000, 200): R 2 = 0.9 Example HOLP SIS ISIS FR Tilting (i) Ind. 0.960 0.960 1.000 1.000 ρ = 0.3 1.000 0.920 1.000 1.000 (ii) CS ρ = 0.6 0.960 0.280 0.420 0.960 ρ = 0.9 0.100 0.000 0.000 0.000 ρ = 0.3 0.990 0.990 1.000 1.000 (iii) AR(1) ρ = 0.6 1.000 1.000 1.000 1.000 ρ = 0.9 1.000 1.000 1.000 1.000 k = 2 0.980 0.000 0.350 0.990 (iv) Factor k = 10 0.850 0.000 0.060 0.700 k = 20 0.540 0.000 0.010 0.230 δ 2 = 0.1 1.000 1.000 0.000 0.000 (v) Group δ 2 = 0.05 1.000 1.000 0.000 0.000 δ 2 = 0.01 1.000 1.000 0.000 0.000 (vi) Extreme 1.000 0.000 0.000 0.210

27 / 38 A demonstration of Theorem 2 and 3 We set { 4 [exp(n 1/3 )] for examples except Example (vi) p = and 20 [exp(n 1/4 )] for Example (vi) { 1.5 [n 1/4 ] for R 2 = 90% s = [n 1/4 ] for R 2 = 50%

28 / 38 Theorem 2 HOLP with R 2 = 90% HOLP with R 2 = 50% probability all β^true >β^false 0.0 0.2 0.4 0.6 0.8 1.0 (i) (ii) (iii) (iv) (v) (vi) probability all β^true >β^false 0.0 0.2 0.4 0.6 0.8 1.0 (i) (ii) (iii) (iv) (v) (vi) 100 200 300 400 500 100 200 300 400 500 Figure 1: HOLP: P(min i S ˆβ i > max i S ˆβ i ) versus the sample size n.

29 / 38 Theorem 3 ridge HOLP with R 2 = 90% ridge HOLP with R 2 = 50% probability all β^true >β^false 0.0 0.2 0.4 0.6 0.8 1.0 (i) (ii) (iii) (iv) (v) (vi) probability all β^true >β^false 0.0 0.2 0.4 0.6 0.8 1.0 (i) (ii) (iii) (iv) (v) (vi) 100 200 300 400 500 100 200 300 400 500 Figure 2: ridge-holp (r = 10): P(min i S ˆβ i > max i S ˆβ i ) versus sample size n.

30 / 38 Computation efficiency: Varying supset p=1000,n=100 p=1000,n=100 (tilting excluded) time cost (sec) 0 100 200 300 400 Tilting Forward regression ISIS HOLP SIS time cost (sec) 0 10 20 30 40 50 60 Forward regression ISIS HOLP SIS 0 20 40 60 80 100 0 20 40 60 80 100 Figure 3: Computational time against the submodel size when (p, n) = (1000, 100).

31 / 38 Computation efficiency: Varying p d=50,n=100 d=50,n=100 (tilting excluded) time cost (sec) 0 500 1000 1500 2000 Tilting Forward regression ISIS HOLP SIS time cost (sec) 0 5 10 15 20 25 30 35 Forward regression ISIS HOLP SIS 500 1000 1500 2000 2500 500 1000 1500 2000 2500 Figure 4: Computational time against the total number of the covariates when (d, n) = (50, 100).

32 / 38 A data analysis The mammalian eye diseases (Scheetz et al., 2006) gene expressions on the eye tissues from 120 twelve-week-old male F2 rats gene coded as TRIM32 responsible for causing Bardet-Biedl syndrome Focused on 5000 genes (out of about 19K genes) with the highest sample variance

33 / 38 Table 1: The 10-fold cross validation error for nine different methods Methods Mean errors Standard errors Final size Lasso 0.015 0.023 5 SCAD 0.019 0.025 6 ISIS-SCAD 0.016 0.023 2 SIS-SCAD 0.014 0.025 4 FR-Lasso 0.016 0.021 2 FR-SCAD 0.018 0.023 3 HOLP-Lasso 0.014 0.016 5 HOLP-SCAD 0.014 0.017 4 tilting 0.017 0.021 6 NULL 0.021 0.027 0

34 / 38 Table 2: Commonly selected genes for different methods Probe ID 1376747 1381902 1390539 1382673 Gene name BE107075 Zfp292 BF285569 BE115812 Lasso yes yes yes SCAD yes yes yes ISIS-SCAD yes SIS-SCAD yes yes FR-Lasso yes yes FR-SCAD yes yes yes HOLP-Lasso yes yes HOLP-SCAD yes yes Tilting

35 / 38 Take-home message For a linear model Y = Xβ + ɛ, with normalized data, follow variable screening step 1, 2, 3. 1. Compute HOLP ˆβ = X T (XX T ) 1 Y. 2. Retain the d variables (usually can take d = n) corresponding to the d largest entries of ˆβ. 3. Screening is done! Start thinking about building a refined model based on the remaining d variables.

36 / 38 Conclusion HOLP Computationally efficient Theoretical appealing Methodologically simple Generalisable via its ridge version Future work (ongoing) GLMs Cox s models Screening for compressed sensing Grouped variable screening, GAMs...

An open question 37 / 38

38 / 38 He who teaches me for one day is my father for life. Thank you!