Stepwise Searching for Feature Variables in High-Dimensional Linear Regression

Size: px

Start display at page:

Download "Stepwise Searching for Feature Variables in High-Dimensional Linear Regression"

Melinda Hood
5 years ago
Views:

1 Stepwise Searching for Feature Variables in High-Dimensional Linear Regression Qiwei Yao Department of Statistics, London School of Economics Joint work with: Hongzhi An, Chinese Academy of Sciences Da Huang, Peking University Cun-Hui Zhang, Rutgers University p.1

2 Regression with p >> n: (some) recent developments Algorithms: stepwise addition and deletion new information criteria BICP and BICC Numerical results simulation with independent and dependent regressors comparison with Lasso Asymptotic results: consistency for BICP p.2

3 Consider linear model y = Xβ + ε = (x 1,,x p )β + ε, where X is an n p design matrix, ε N(0,σ 2 I n ). Assume β β n = (β 1,,β p ) varies with n, and p p n together with n Let d d n = I n, where Then d n. I n = {1 i p : β n,i β i 0} Sparsity: d << p p.3

4 Lasso estimator (Tibshirani 1996): β lasso = arg min β { y Xβ 2 + λ p j=1 β j }, where λ > 0 is a constant. Due to the L 1 penalty, some β j are shrunk to exactly 0 for large λ. Therefore sparsity is achieved. For a given λ, Lasso can be solved by quadratic programming. Efron et al. (2004) LARS: solve the whole Lasso solution path (for all λ > 0) in the same order of computations as a single LS fit, and β j (λ) are piecewise linear. Adaptive Lasso: use β j 1 γ instead of β j to achieve the oracle properties (Zou 2006). p.4

5 Dantzig selector (Candes and Tao 2007): β DS is the solution of the following l 1 -regularization problem min ζ R p p i=1 β i subject to X (y Xβ) λ p σ. Usually we take λ p = 2 log p MSE( β DS ) is a log p-factor of the MSE for an oracle estimator. Recasted into a linear programming problem. Approximately equivalent to Lasso (Bickel, Ritov and Tsybakov 2008) p.5

6 Sure independence (correlation) screening (Fan and Lv 2008): (i) Marginal regression: choose p 0 regressors which are most correlated individually with y for some d << p 0 << p. (ii) Apply other methods, such as adaptive Lasso, SCAD or Dantzig selector, to identify the sparse model among p 0 candidate regressors. Computationally most efficient, may apply when p is huge. p.6

7 Consider linear model y = Xβ + ε = (x 1,,x p )β + ε, where X is an n p design matrix, ε N(0,σ 2 I n ). Assume β β n = (β n,1,,β n,p ) varies with n, and p p n together with n Let d d n = I n, where I n = {1 i p : β n,i 0} p.7

8 For J {1,,p}, Put Notation X J : n J matrix consisting of the columns of X corresponding to the indices in J β J : J -vector consisting of the components β corresponding to the indices in J. P J = X J (X J X J ) X J L u,v (J ) = u (I n P J )v, u,v R n. Then L y,y (J ) is the SSR from the LS fitting ŷ = X J βj = P J y. p.8

9 Algorithm Stage I Forward addition: 1. Let j 1 = arg min 1 i p L y,y ({i}) and J 1 = {j 1 }. Put BICP 1 = log{l y,y (J 1 )/n} + 2 log p / n. p.9

10 Algorithm Stage I Forward addition: 1. Let j 1 = arg min 1 i p L y,y ({i}) and J 1 = {j 1 }. Put BICP 1 = log{l y,y (J 1 )/n} + 2 log p / n. 2. Continue with k = 2, 3,, provided BICP k < BICP k 1, where BICP k = log{l y,y (J k )/n} + 2k n log p. In the above expression, J k = J k 1 {j k }, and j k = arg max i J k 1 [L y,y (J k 1 ) L y,y (J k 1 {i})] = arg max i J k 1 L 2 y,x i (J k 1 )/L xi,x i (J k 1 ). p.9

11 Algorithm Stage I Forward addition: 1. Let j 1 = arg min 1 i p L y,y ({i}) and J 1 = {j 1 }. Put BICP 1 = log{l y,y (J 1 )/n} + 2 log p / n. 2. Continue with k = 2, 3,, provided BICP k < BICP k 1, where BICP k = log{l y,y (J k )/n} + 2k n log p. In the above expression, J k = J k 1 {j k }, and j k = arg max i J k 1 [L y,y (J k 1 ) L y,y (J k 1 {i})] = arg max i J k 1 L 2 y,x i (J k 1 )/L xi,x i (J k 1 ). 3. For BICP k BICP k 1, let k = k 1, and În,1 = J ek. p.9

12 Stage II Backward deletion; 4. Let BICP e k = BICP ek and J e k = În,1. p.10

13 Stage II Backward deletion; 4. Let BICP e k = BICP ek and J e k = În,1. 5. Continue with k = k 1, k 2,, provided BICP k BICP k+1, where BICP k = log{l y,y (J k )/n} + 2k n log p. In the above expression, J k = J k+1 \ {j k}, and j k = arg min [L y,y (J i Jk+1 k+1 \ {i}) L y,y(jk+1 )]. p.10

14 Stage II Backward deletion; 4. Let BICP e k = BICP ek and J e k = În,1. 5. Continue with k = k 1, k 2,, provided BICP k BICP k+1, where BICP k = log{l y,y (J k )/n} + 2k n log p. In the above expression, J k = J k+1 \ {j k}, and j k = arg min [L y,y (J i Jk+1 k+1 \ {i}) L y,y(jk+1 )]. 6. For BICP k > BICP k+1, let k = k + 1, and În,2 = J b k. p.10

15 Implementation: Sweep operation Forward addition: Set L 0 = (X,y) (X,y) (l 0 i,j ): a (p + 1) (p + 1) matrix. Adding one variable, say, x i, in the k-th step corresponds to transfer L k 1 = (l k 1 i,j ) to L k = (l k i,j ) by the sweep operation: l k i,i = 1/lk 1 i,i, l k j,m = lk 1 j,m lk 1 i,m lk 1 j,i /l k 1 i,i for j i and m i, l k i,j = lk 1 i,j /l k 1 i,i and l k j,i = lk 1 j,i /l k 1 i,i for j i. Then L y,y (J k 1 ) L y,y (J k 1 {i}) = ( l k 1 i,p+1 Backward deletion: Same as the above with L 0 = L ek obtained in Stage I. For k = k 1, k 2,, L y,y (J k+1 {i}) L y,y(j k+1 ) = ( l e k k+1 i,p+1 ) 2/l k 1 i,i, i J k 1. ) 2/l e k k+1 i,i, i N k. p.11

16 Remarks 1. BICP k = log{l y,y (J k )/n} + k 2 log p n, replaces the penalty log n n in standard BIC by 2 log p n, is designed for the cases with p n or p > n. p.12

17 Remarks 1. BICP k = log{l y,y (J k )/n} + k 2 log p n, replaces the penalty log n n in standard BIC by 2 log p n, is designed for the cases with p n or p > n. 2. An alternative BICC BICC k = log{l y,y (J k )/n) + c 0 } + k log n n, where c 0 > 0 is a constant. p.12

18 Remarks 1. BICP k = log{l y,y (J k )/n} + k 2 log p n, replaces the penalty log n n in standard BIC by 2 log p n, is designed for the cases with p n or p > n. 2. An alternative BICC BICC k = log{l y,y (J k )/n) + c 0 } + k log n n, where c 0 > 0 is a constant. Why insert c 0? For k close to n, L y,y (J k ) 0, log{l y,y (J k 1 )} log{l y,y (J k )} L y,y(j k 1 ) L y,y (J k ) L y,y (J k ) may be very large, even when L y,y (J k 1 ) L y,y (J k ) is negligible. p.12

19 Remarks 3. In the forward search, x i should be excluded from the further search if L y,y (J k 1 ) L y,y (J k 1 {i}) is practically 0. This may improve the computation efficiency. p.13

20 Remarks 3. In the forward search, x i should be excluded from the further search if L y,y (J k 1 ) L y,y (J k 1 {i}) is practically 0. This may improve the computation efficiency. 4. When p n n, the true mean µ n = X In β In may be represented as linear combinations of any full-ranked n n submatrix of X. In practice, we may start the forward search based on the genuine optimal regression subset with j regressors, where j 1 is a small integer. This should effectively eliminate the possibility that În,1 ends as a non-sparse set. p.13

21 Simulation: Example 1 Consider model y = Xβ + ε with x ij and ε i are indep N(0, 1). Setting I: n = 200, p = 1000 or 2000, and d = 10 or 25 Setting II: n = 800, p = or 20000, and d = 25 or 40 Non-zero β i are of the form ( 1) u (2.5 2 log p/n + v ), v N(0, 1), and P(u = 1) = P(u = 0) = 0.5. Replication: 200 times p.14

22 No. of selected regressors in forward and backward searches BICP: n=200, p=1000 BICC: n=200, p=1000 BICP: n=200, p=2000 BICC: n=200, p= BICP: n=800, p=10000 BICC: n=800, p=10000 BICP: n=800, p=20000 BICC: n=800, p= p.15

23 Comparison with Lasso Lasso estimator is defined as the minimizer of 1 p 2n y Xβ 2 + λ β j, with λ = 2(log p)/n. We standard the data: x j = n for all j. j=1 For a fitted model, relative error is defined as r = 1 d ( number of selected wrong variables + number of unselected true variables ) p.16

24 d d r n p d Method Mean STD Mean STD BICP BICC LASSO BICP BICC LASSO BICP BICC LASSO BICP BICC LASSO p.17

25 d d r n p d Method Mean STD Mean STD BICP BICC LASSO BICP BICC LASSO BICP BICC LASSO BICP BICC LASSO p.18

26 BICP and LASSO Comparable? BICP adds a new variable by performing an F -test: ( Ly,y (J BICP k < BICP k ) ) k 1 log + 2 log p L y,y (J k 1 ) n L y,y(j k 1 ) L y,y (J k ) L y,y (J k ) i.e. F 1,n k 1 > (n k 1){e 2 log p/n 1} 2 log p. < 0 > e (2 log p)/n 1. LASSO selects x j by performing approximately a z-test: 1 2n y X jβ j 2 1 2n y Xβ 2 > λ β n,j 1 n β n,jx j(y X j β j ) + β 2 j x n,j 2 /n > λ β n,j 1 n β n,jx j(y X j β j ) + β 2 n,j > λ β n,j i.e. approximately x j (y X jβ j ) > nλ. p.19

27 As y X j β j ε N(0,σ 2 I n ), x j (y X jβ j ) N(O,σ 2 x j 2 ). Hence it is approximately as χ 2 1 > n 2 λ 2 /{ x j 2 σ 2 } = 2 log p Note σ 2 = 1 in our example. As F 1,q χ 2 1 methods are approximately comparable. for large q, the two Remark. The methods which penalizing log( y Xβ 2 ) (such as BIC) are F -test based and do not require to know σ 2. The methods which penalizing y Xβ 2 directly (such as LASSO) are z-test based and do require the info on σ 2. p.20

28 Simulation: Example 2 Same setting as in Example 1 with added dependence: for 1 k n and 1 i j d, Corr(X ki,x kj ) = ( 1) u 1 (0.5) i j, Corr(X ki,x k,i+d ) = ( 1) u 2 ρ, Corr(X ki,x k,i+2d ) = ( 1) u 3 (1 ρ 2 ) 1/2, where ρ U[0.2, 0.8], u 1, u 2 and u 3 are independent from the uniform distribution on the two points {1, 0}. The first d regression variables have the non-zero coefficients. p.21

29 No. of selected regressors in forward and backward searches BICP: n=200, p=1000 BICC: n=200, p=1000 BICP: n=200, p=2000 BICC: n=200, p= BICP: n=800, p=10000 BICC: n=800, p=10000 BICP: n=800, p=20000 BICC: n=800, p= p.22

30 Asymptotic results Consistency Goal: P { În,1 I n = În,2 } 1 Major difficulty: No. of candidate models Key idea: find a series of collections C k (k 1) of deterministic models such that No. of candidate models in C k diverge to not too fast models selected in forward path are always in C k, k 1 Note. The construction of {C k } is for deriving the consistency, which is not required for practical implementation. p.23

31 Heuristics for the forward search Let C k be a collection of deterministic models of size k, satisfying two conditions: 1. Let k be an integer for which I n J for all J C k 1, and P { k < k : J k 1 C k 1, I n J k 1, J k C k } 0, 2. P { k k : J k 1 C k 1, I n J k 1, BICP k < BICP k 1 } 0. Condition 1 sets k as an upper bound for the size of the selected model, which may go to, and furthermore J k C k unless I n J k 1 Condition 2 requires to stop the search with În,1 = J k 1 as soon as I n J k 1. p.24

32 The stopping rule is effectively an F -test: BICP k BICP k 1 log( L y,y(j k ) ) L y,y (J k 1 ) L y,y(j k 1 ) L y,y (J k ) L y,y (J k ) For a deterministic model J I n and j J, + 2 log p n 0 e (2 log p)/n 1. F(J,j) = L y,y(j ) L y,y (J {j}) L y,y (J {j})/(n J 1) F 1,n J 1 p.25

33 Since j k is selected among p k + 1 choices, we have P { k k : J k 1 C k 1, I n J k 1, BICP k < BICP k 1 } k k=d n +1 k k=d n +1 k k=d n +1 P {J k 1 C k 1, I n J k 1, BICP k < BICP k 1 } P { max J C k 1 max j J F(J,j) > (n k)(e(2 log p)/n 1) } Ck 1 (p k + 1)P {F1,n k > (n k)(e (2 log p)/n 1)}. p.26

34 Based the tail property of F -distribution, the RHS of the above is bounded by k k=d n +1 2 C k 1 (p k + 1) exp { (log p)(1 (k + 1)/n) } (1 (k + 1)/n)π log p which converges to 0 if k /p 0, k (log p)/n = O(1), k k=d n +1 Ck 1 log p 0. The above condition needs to be relaxed, though it may easily hold if k is bounded (when n,p ). Chen and Chen (2008): extended BIC Huang and Wang (2008): BICC p.27

35 Regularity conditions 1. The sparse Riesz condition: ) ( ) 0 < c λ min (X J X J /n λ max X J X J /n c < for any J {1,...,p} and J d n, where c, c are fixed const. and d n (e.g. d n log(p/d n) = a 0 n for small a 0 > 0). This ensures that the sparse representation of the model is unique. If the underlying distribution of all the p regressors is non-degenerate, any n n submatrix of X is full-ranked with probability 1. p.28

36 Let µ = Xβ = X In β In, and d n = #{β n,j 0, 1 j p}, β = min β n,j β n,j For some constants γ (0, 1), the upper bound for the size of estimated sets is defined as k = [ d n log{ µ 2 /(nc β 2 )} / {c (1 γ) 2 } ]. 2. Condition on β, d n and p = p n : (1 + d n ) log p n = o(n), and β 2 2(1 + ǫ 0) 3 σ 2 γ 2 c 3 log p n, max{ d n + k 2, d n c c 2 (1 γ) 2 } < d n, (1 + d n ) log(2 + d n) log p n log ( µ 2 nc β 2 k { log k + log ( d n c )} log pn c 2 (1 γ) 2, ) 0, where ǫ 0 > 0 is a const. p.29

37 3. Condition (adjustment) for BICP: BICP k = log{l y,y (J k )/n} + k l=1 2(1 + η 0 ) log p, n l 1 where η 0 ( 0, (1 γ) 2 (1 + ǫ 0 ) 3 /(γ 2 c ) 1 ) is a small constant. Remark. For k k, k l=1 2(1 + η 0 ) log p n l 1 = {1 + η 0 + o(1)} 2k log p. n The adjustment increases the penalty by a factor of (1 + η 0 ). p.30

38 Theorem. Under conditions 1, 2 and 3, P { Î n,1 I n = În,2, k k < k } 1. Final remark. The consistency was proved for a slightly aggressive BICP penalty, while the proof used conservative Bonferroni estimates of multiple testing errors in all stages. The simple BICP penalty (2k log p)/n is recommended in practice. We repeated simulation for Example 1 with the adjusted BICP, the results are good, but not as good as those with the simple penalty (2k log p)/n. p.31

39 No. of selected regressors in forward and backward searches η 0 = 0 : n = 200, p = 1000 η 0 = 0.1 : n = 200, p = 1000 η 0 = 0 : n = 200, p = 2000 η 0 = 0.1 : n = 200, p = η 0 = 0 : n = 800, p = η 0 = 0.1 : n = 800, p = η 0 = 0 : n = 800, p = η 0 = 0.1 : n = 800, p = p.32

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Consistent high-dimensional Bayesian variable selection via penalized credible regions Howard Bondell bondell@stat.ncsu.edu Joint work with Brian Reich Howard Bondell p. 1 Outline High-Dimensional Variable