A Consistent Model Selection Criterion for L 2 -Boosting in High-Dimensional Sparse Linear Models

Size: px

Start display at page:

Download "A Consistent Model Selection Criterion for L 2 -Boosting in High-Dimensional Sparse Linear Models"

Mae Fowler
6 years ago
Views:

1 A Consistent Model Selection Criterion for L 2 -Boosting in High-Dimensional Sparse Linear Models Tze Leung Lai, Stanford University Ching-Kang Ing, Academia Sinica, Taipei Zehao Chen, Lehman Brothers

2 y in = p n j=1 High Dimensional Regression β jn x ij (n) + ε i, i = 1, 2,..., n, p n = O(exp(n ξ )) for some 0 < ξ < 1. The n s in β jn, y in and x ij (n) may be suppressed. Some typical examples for n << p: Gene Expression Data: n = sample size, p = # genes. Sparse Signal Reconstruction: n = # measurements, p = # signals Image Recovery: n = # grid points, p = # basis functions Portfolio Selection: p = # assets, n = # time points 1

3 L 2 -Boosting: Pure Greedy Algorithm (PGA) Step 1. Define R 0 = (y 1,, y n ) and x j = (x 1j,..., x nj ). Find a variable among {x 1,..., x pn } that is most correlated to R 0. Call the variable xŝ1 and generate residual vector R 1 = R 0 ˆβŝ1 xŝ1, where ˆβŝ1 is the least squares estimate obtained by regressing R 0 on xŝ1. 2

4 Step 2. Find a variable among {x 1,..., x pn } that is most correlated to R 1. Call the variable xŝ2 and let R 2 = R 1 ˆβŝ2 xŝ2.. If iterations stop at step m, then the new outcome, y n+1, is predicted by ŷŝ1,...,ŝ m = ˆβŝ1 x n+1,ŝ1 + + ˆβŝm x n+1,ŝm. 3

5 L 2 -Boosting: Orthogonal Greedy Algorithm (OGA) Step 1. Define y n = (y 1,..., y n ) and x j = (x 1j,..., x nj ). Find a variable among {x 1,..., x pn } that is most correlated to y n. Call the variable xŝo 1 R 1 = y n Mŝo 1 y n, where Mŝo 1 L(xŝo 1 ). and generate residual is the projection matrix into 4

6 Step 2. Find a variable among {x 1,..., x pn } that is most correlated to R 1. Call the variable xŝo 2 R 2 = y n Mŝo 1,ŝo 2 y n, where Mŝo 1,ŝo 2 into L(xŝo 1, x ŝ o 2 ).. and let is the projection matrix If iterations stop at step m, then the new outcome, y n+1, is predicted by ŷŝo = ˆβ o 1,...,ŝo m ŝ x o n+1,ŝ o + + ˆβ o 1 1 ŝ x o n+1,ŝ o, m m where ˆβ o,..., ˆβ o ŝ o ŝ o 1 m are the least squares estimates. 5

7 Convergence Rates in L 2 -Boosting Assumptions: (K.1) Ee tε 1 + supj E exp(sx 1j ) < for t t 0, s s 0. (K.2) 0 < ι 1 < λ min (Γ n ) λ max (Γ n ) < ι 2 < for all n, where Γ n = E(x 1 x 1 ) (K.2*) 0 < η 1 min 1 j p n E(x 2 1j ) max 1 j p n E(x 2 1j ) < η 2 <. (K.3) p n = O(exp(Cn ξ )) for some 0 < ξ < 1 and C > 0. (K.4) sup n 1 pn j=1 β j <. 6

8 Theorem 1. Assume (K.1), (K.2), (K.3) and (K.4). Then, for any choice of m = m n satisfying m n = O(n l ) with 0 < l < (1 ξ)/5, E { (y n+1 ŷŝo 1,...,ŝo m )2 y n, x 1,..., x pn } σ 2 = O p (m 1 ). Theorem 2. Assume (K.1), (K.2*), (K.3) and (K.4). Then, for any choice of m = m n satisfying m n = o(log n) and 0 < θ < 1, 3 E { } (y n+1 ŷ P ŝ 1,...,ŝ m ) 2 y n, x 1,..., x pn σ 2 = O p (m θ ). 7

9 Theorem 3. Assume (K.1), (K.2), (K.3), (K.4) and (K.5) For some 0 γ < (1 ξ)/5, lim inf n nγ min j O n β 2 j > C > 0. Then, for any choice of m = m n satisfying m n = O(n l ) with 0 < l < (1 ξ)/5 and lim n m n /n γ =, where lim n P(Do n) = 1, O n = {j : 1 j p n, β j 0}, D o n = { O n {ŝ o 1,..., ŝo m n } }. 8

10 Theorem 3 indicates that with probability approaching 1, the variables chosen by the OGA after m n steps contains all relevant variables. To prevent overfitting, the algorithm should be stopped at the first time when all relevant variables are included, i.e., choose the smallest correct model along the boosting path. 9

11 High-Dimensional Hannan Quinn Criterion HQ(k) = log ˆσ 2 (xŝo 1,..., x ŝ o k ) + (log p n)c n k n where C n and ˆk HQ = arg min 1 k m n HQ(k), ˆσ 2 (xŝo 1,..., x ŝ o k ) = 1 n y n(i Mŝo 1,...,ŝo k )y n. 10

12 The major difference between HQ and conventional Hannan Quinn is the additional factor log p n, which is related to the maximum of i.i.d. V 1,..., V pn χ 2 (1): max 1 k pn V i 2 log p n 1 in probability. Define A j = {ŝ o,..., 1 ŝo }, j 1, j m k n o n, if O n A mn = min{j : 1 j m n, O n A j }, if O n A mn. 11

13 Theorem 4. Assume (K.1), (K.2), (K.3), (K.4) and (K.5). Then, for any choice of m = m n satisfying m n = O(n l ) with 0 < l < (1 ξ)/5 and lim n m n /n γ =, and any choice of C n satisfying C n log p n = o(n 1 2γ ) and n γ /C n = o(1), lim n P(AˆkHQ = A ko n ) = 1. For γ = 0, one can take C n = log n, giving the high-dimensional version of BIC. 12

14 Sketch of Proof 1. Difference between OGA (or PGA) with its population version: Exponential bounds. 2. The convergence rates in Theorems 1 and 2 are used to prove Theorem Theorem 3, exponential bounds and extension of Hannan Quinn-type arguments to prove Theorem 4. 13

15 Finite-Sample Performance Example 1. (β 1,..., β 5 ) = (3, 3.5, 4, 2.8, 3.2), β j = 0 for j 6 in y i = 5 β j x ij + p β j x ij + ε i, i = 1,..., n. j=1 j=6 ε i i.i.d. N(0, 1), x i = (x i1,..., x ip ) T independent of ε i. x ij = z ij + ηw i, where η > 0, (z T i, w i) T N(0, I) simulation runs for each result. 14

16 Table 1: Frequency of all 5 relevant and i additional variables selected (m n = 30) in 1000 simulations. i η n p method Total OGA+HDBIC OGA+BIC OGA+HDBIC OGA+BIC OGA+HDBIC OGA+BIC OGA+HDBIC OGA+BIC

17 Example 2. q = 10, n = 400, p = 4000 in y i = q j=1 β j x ij + p j=q+1 β j x ij + ε i, i = 1,..., n, (β 1,..., β 10 ) = (3.2, 3.2, 3.2, 3.2, 4.4, 4.4, 3.5, 3.5, 3.5, 3.5), β j = 0 for j > 10. ε i i.i.d. N(0, 2.25). x i = (x i1,..., x ip ) T, same as in Example simulations for each result. 16

18 Table 2: Frequency of all 9 relevant and i additional variables selected (m n = 40) in 100 simulations. i η method Total 1 OGA+HDBIC LASSO OGA+HDBIC LASSO

19 Example 3. Whereas Example 2 satisfies the neighborhood stability condition of Meinshausen and Buhlmann (2006), which is an extension of the irrepresentable condition of Zhao and Yu (2006) to ensure model selection consistency of LASSO to random covariates, this condition is violated in taking q = 10 in Example 2 and (β 1,..., β q ) = (3, 3.75, 4.5, 5.25, 6, 6.75, 7.5, 8.25, 9, 9.75). β j = 0 for j 11, (x i1,..., x iq ) T i.i.d. N(0, I q ). ε i i.i.d. N(0, 1) and independent of (x i1,..., x ip ) T. x ij = z ij + ( q l=1 bx il) for q < j p, z i N(0, I p q /4). qb 2 + 1/4 = 1 (i.e, b = 3/4q). 18

20 Table 3: Frequency of all 10 relevant and i additional variables selected (m n = 40) in 100 simulations. i method or more Total OGA+HDBIC LASSO

21 Conclusion By obtaining the convergence rates of orthogonal greedy L 2 -boosting (OGA), classical model selection criteria such as BIC can be modified to a large number p n of regressors in sparse regression models. Consistency of model selection is formulated in terms of capturing all, and only, variables with nonzero coefficients under (K5). 20

22 OGA with consistent model selection means that the estimate is asymptotically equivalent to OLS with the correct set of covariates. In the absence of (K5), consistency of model selection can be linked to the objective of the regression model, e.g., prediction, estimation of high-dimensional covariance matrix via modified Cholesky decomposition. For the latter, consistent model selection can be carried out via thresholding (Bickel & Levina, 2008, and extensions). 21

On High-Dimensional Cross-Validation

On High-Dimensional Cross-Validation BY WEI-CHENG HSIAO Institute of Statistical Science, Academia Sinica, 128 Academia Road, Section 2, Nankang, Taipei 11529, Taiwan hsiaowc@stat.sinica.edu.tw 5 WEI-YING