arxiv: v1 [stat.me] 30 Dec 2017

Size: px

Start display at page:

Download "arxiv: v1 [stat.me] 30 Dec 2017"

Alvin Little
5 years ago
Views:

1 arxiv: v1 [stat.me] 30 Dec 2017 An ISIS screening approach involving threshold/partition for variable selection in linear regression 1. Introduction Yu-Hsiang Cheng and Tzee-Ming Huang and Su-Yun Huang In the last ten years, the problem of variable selection has received increasing attention, especially in the analysis of large scale data sets such as gene data sets (e.g. [2]). In this paper, we consider the variable selection problem in the linear regression model p Y = β j X j +ε, j=1 where Y is the response variable, X 1,..., X p are p predictors, and ε is the error, which is of mean zero. Suppose that we have n independent observations for(y,x 1,...,X p ). When the sample size n is greaterthan p, it is easyto obtain an estimator for β = (β 1,..., β p ) T using least square approach. However, when p n, finding an estimator for β is a challenging problem since the usual least square approach fails. To overcome the estimation difficulty for the large p small n case, it is common to assume that only few predictors are influential in the model ([4], [6]) and then perform variable selection. Many variable selection methods for linear regression model can be found in the literature. For some of these methods, variable selection and parameter estimation are performed simultaneously, such as forward regression, selection based on criteria like BIC or AIC, LASSO and it variants. Alternatively, one can first perform variable screening without estimating the parameter vector β to reduce the number of predictors, and then perform usual variable selection Corresponding author. 1

2 Cheng and Huang and Huang/ISIS variable selection 2 methods. An example of screening method is SIS (sure independence screening), proposed by Fan and Lv [1], where predictors with large absolute sample correlations to the response variable are selected. Another example of screening method, SIRS (sure independent ranking and screening) is proposed by Zhu et al. [5], which can be used for variable selection for many extensions of linear regression models. When p is very large, it is suitable to include the screening step as a part of the variable selection process so that the selection process can be done efficiently. Fan and Lv [1] also propose an extension of SIS called ISIS (iterative sure independence screening) to improve selection accuracy. In each iterative step in ISIS, SIS is performed to reduce the number of predictors and then an additional variable selection method, such as adaptive LASSO or SCAD, is applied to the predictors obtained from SIS. Here the screening is based on the absolute sample correlation between a predictor and the residuals obtained from regressing the response on the predictors selected in the previous iteration. The iterative procedure stops until a pre-specified number of predictors are selected. While ISIS performs much better than SIS according to the simulation results in [1], its implementation requires one to determine the number of predictors to be selected, as well as the tuning parameter(s) in the additional variable selection method, which may take extra efforts. For instance, if one would like to use ISIS with LASSO, and select the λ parameter by cross-validation in each iteration, then this task can be computationally expansive. To solve the difficulty in the implementation of ISIS, we propose an iterative SIS procedure with specific screening threshold in each iteration and the iteration stops automatically, where the threshold in the first iteration is based on the (1 α) quantile G 1 (1 α Ỹ,F 1,...,F q ), where G( Ỹ,F 1,...,F q ) is the cumulative distribution function for the maximum of absolute sample correlations between Ỹ: the observation vector for the response variable and observations for q variables that are independent of the response variable, where the q variables are independent and have CDF s F 1,..., F q respectively. When p is not too large, we take q = p and take F j to be ˆF Xj, the empirical CDF of X j, for j = 1,..., p for the first iteration. If one uses the quantile G 1 (1 α Ỹ, ˆF X1,..., ˆF Xp ) asscreening threshold for SIS, then one can compare the effect of a single predictor (measured in absolute correlation to the response) to the maximum effect of p unimportant predictors. If a predictor has stronger effect than the maximum effect of unimportant predictors, then the predictor can be important. By using G 1 (1 α Ỹ, ˆF X1,..., ˆF Xp ) as screening threshold, we expect that all unimportant predictors are screened out with probability at least (1 α), so there is a good chance to remove unimportant predictors effectively. In our proposed procedures, we use approximation of G 1 (1 α Ỹ, ˆF X1,..., ˆF Xp ) as screening threshold with α = 0.5 instead of a very small value, so that predictors with weaker association with the response are not easily screened out. Zhu et al. [5] also use a similar approach to determine the threshold for

3 Cheng and Huang and Huang/ISIS variable selection 3 their SIRS method. They propose a predictor ranking score to measure variable contribution and use a random threshold that is the maximum score for unimportant variables, which include d auxiliary variables that are generated independently from the data. Their d auxiliary variables play the same role as our q variables with cumulative distribution functions F 1,..., F q respectively that are independent of the response variable. When p is very large, the quantile G 1 (1 α Ỹ, ˆF X1,..., ˆF Xp ) can be very large since it increases with p. As a result, we may fail to detect some important predictors using an approximation of G 1 (1 α Ỹ, ˆF X1,..., ˆF Xp ) as the screening threshold in the first iteration. To solve this problem, we propose the second screening procedure, which includes a partitioning step that divides predictors into smaller subgroups and performs screening on subgroups. Simulation results show that the second screening procedure works better than the first one when p is large. The rest of the paper is organized as follows. In Section 2, we provide a theorem that gives justification for using G 1 (1 α Ỹ, ˆF X1,..., ˆF Xp ) as screening threshold for SIS when p is not very large, and describe our two screening procedures, which are designed for moderate p and very large p respectively. Simulation results are given in Section 3. The proof of the theorem is given in Section Screening based on maximum absolute correlation In this section, we first provide in Section 2.1 a theorem that gives justification for using G 1 (1 α Ỹ, ˆF X1,..., ˆF Xp ) as the screening threshold in SIS without iteration. Next, we state in Section 2.2 our first screening procedure, which is an iterative SIS procedure, where in the first iteration, an approximation of G 1 (0.5 Ỹ, ˆF X1,..., ˆF Xp ) is used the screening threshold. Our first screening procedure works for p = O(n δ ) for some δ (0,2). When p is very large, we propose the second screening procedure, which includes an additional partitioning step to divide predictors into smaller subgroups. The second screening procedure is described in Section SIS using G 1 (1 α Ỹ, ˆF X1,..., ˆF Xp ) as threshold In this section, we consider the case where p = O(n δ ) for some δ (0,2) and all p predictors are independent. We will first state Theorem 1, which justifies the use of G 1 (1 α Ỹ, ˆF X1,..., ˆF Xp ) as the screening threshold in SIS without iteration. Theorem 1. Suppose that max 1 j p ( 1 n n ( ) ) 6 Xi,j µ j,n = O p (1), σ j,n

4 Cheng and Huang and Huang/ISIS variable selection 4 where X i,j is the i-th observation for X j, µ j,n = n X i,j/n and σ 2 j,n = n (X i,j µ j,n ) 2 /n. Suppose that p Cn δ for some constant C > 0 and 0 < δ < 2, then for t > 0, G(t Ỹ, ˆF X1,..., ˆF Xp ) 1 in probability as n. The proof of Theorem 1 is given in Section 5. Note that under the conditions intheorem1,ifthenumberofimportantpredictorsiso(1),thenthereexistst > 0 such that the event that the sample correlations between Ỹ and all important predictors exceed t, denoted by A n (t), occurs with probability tending to one as n, and when A n (t) occurs, all the important predictors are selected if G(t Ỹ, ˆF X1,..., ˆF Xp ) > 1 α. Therefore, under the conditions in Theorem 1, if the number of important predictors is O(1), then all the important predictors are selected with probability tending to one as n. To evaluate G 1 (1 α Ỹ, ˆF X1,..., ˆF Xp ), we consider two approaches: bootstrap approximation and normal approximation. For bootstrap approximation, for j = 1,..., p, we first obtain m bootstrap samples of size n based on the observations for X j, and compute the sample correlations to Ỹ to obtain m maximum absolute correlations. Compute the (1 α) quantile of the m maximum absolute correlations and denote it by G 1 boot (1 α Ỹ, ˆF X1,..., ˆF Xp ). Then G 1 boot (1 α Ỹ, ˆF X1,..., ˆF Xp )isusedtoapproximateg 1 (1 α Ỹ, ˆF X1,..., ˆF Xp ). When n is large, we consider using normal approximation when evaluating G 1 (1 α Ỹ, ˆF X1,..., ˆF Xp ). In such case, since the distribution of the sample correlation between an unimportant predictor and Ỹ can be approximated by N(0,1/n), the resulting approximation for G 1 (1 α Ỹ, ˆF X1,..., ˆF Xp ) is given by z(n,p,α) def = Φ 1( ) (1 (1 α) 1/p ) n. Note that under the condition that p Cn δ for δ (0,2), it can be shown that lim n z(n,p,α) = 0, so the sample correlation between Y and an important predictor also exceeds z(n,p,α) with probability tending to one as n and we do not need to be concerned about screening out important predictors when using z(n, p, α) as the screening threshold Iterative screening when p is O(n δ ) When the predictors are correlated, it is possible that some important predictors are only weakly correlated with the response, but they may have stronger association with the response jointly. In such case, performing iterative screening can help improve the variable selection ability of screening. This point is mentioned in both [1] and [5]. In this section, we will describe our first screening procedure, which is an iterative SIS procedure. The algorithm is stated as

5 Cheng and Huang and Huang/ISIS variable selection 5 follows. Algorithm 1 (for moderate p): Suppose that X 1,..., X p are predictors and Y is the response. Let S all denote the collection of all p predictors. (1) Calculate the sample correlation between the observations for Y and that for X j and denote it by ρ j for j = 1,..., p. (2) For j = 1,..., p, X j is considered an important predictor if ρ j > approximate G 1 (1 α Ỹ, ˆF X1,..., ˆF Xp ), where α (0,1) is pre-specified. (3) Let S sel denote the set of the selected predictors. Let S c sel = S all \ S sel be the set of predictors that are not yet selected. Suppose that there are q predictors Z 1,..., Z q in S c sel. Regress Y on all the predictors in S sel based on the observations and let y resid denote the residuals. Calculate the sample correlation between y resid and Z j and denote it by ρ j for j = 1,...q. (4) Z j is considered an important predictor if ρ j > approximate G 1 (1 α y resid, ˆF Z1,..., ˆF Zq ), and we add Z j to S sel. Here for j = 1,..., q, ˆFZj denote the empirical cumulative distribution function of Z j. (5) Carry out Steps (3) and (4) until no more important predictors can be obtained or the number of selected predictors is greater than or equal to n. (6) The predictors selected are the predictors in S sel. Note. As mentioned in the previous section, G 1 (1 α Ỹ, ˆF X1,..., ˆF Xp ) is approximated using bootstrap approximation when n is small, and is approximated using the normal approximation z(n, p, α) for large n. The approximation of G 1 (1 α y resid, ˆF Z1,..., ˆF Zq ) is obtained similarly. In our simulation studies, we use normal approximation for n 200 and bootstrap approximation for n < Large p case When p is very large, the screening threshold G 1 (1 α Ỹ, ˆF X1,..., ˆF Xp ) or z(n,p,α) can be improper. In such case, we can partition the predictors into smaller subgroups and perform screening on subgroups. Indeed, {X 1,...,X p } will be randomly partitioned into k disjoint subsets of sizes p 1,..., p k respectively several times, where p 1,..., p k are approximately equal and each p i does not exceed n 1.99 for i = 1,..., k. Below is our second iterative screening procedure, where the random partition will be carried out T times. Here we assume n 200 and normal approximation is used when evaluating the screening

6 Cheng and Huang and Huang/ISIS variable selection 6 threshold. Algorithm 2 (for large p): Suppose that X 1,..., X p are predictors and Y is the response. (1) Let Ω = {X 1,...,X p }. For t = 1,..., T, carry out the tasks in (a) and (b) below: (a) Partition Ω into k disjoint several subsets Π 1,..., Π k, where Π i is of size p i for i = 1,..., k. (b) i. Take S kernel =, S sel = and y resid be the observations for Y. ii. For i = 1,..., k, let Π i = Π i S kernel and let Z i,1,...,z i,p i denote the predictors in Π i. Calculate the sample correlation between y resid and Z i,j and denote it by ρ i,j for j = 1,...p i. Let A i = {Z i,j : ρ i,j > z(n,p i,α),j {1,...,p i}}. Calculate the adjusted R-squared when regressing Y on predictors in A i S kernel. Add the predictors in A i to S sel. iii. Let A be the A i with the largest adjusted R-squared in Step iii among A 1,..., A k. Add the predictors in A to S kernel. iv. Take y resid to be the residuals obtained by regressing Y on all the predictors in S kernel. v. Repeat Steps ii iv until no more predictors are added to S sel, the largest adjusted R-squared does not increase, or the number of predictors in S sel exceeds n. vi. Take S t = S sel. (2) The predictors selected are the predictors in S 1... S T. 3. Simulation studies In this section, we carry out two sets of simulation studies to check the performance for the proposed screening procedures in Sections 2.2 and 2.3 respectively. Throughout this section, the p predictors X 1,...,X p are generated from N(0,1) with specific covariance matrix Σ, and for each predictor, n observations are generated. Here we consider Σ {Σ 1,Σ 2,Σ 3 }, where Σ 1 is an identity matrix, and the remaining covariance matrices corresponding two different dependent structures for the predictors, which will be described later. Suppose that X = (X 1,...,X p ) n p, the response Y is given by Y = Xβ +ε, (1) where β T = (1 + β 1,...,1 + β 10,0,...,0) is the coefficient vector and β i s are generated from the uniform distribution on [ 0.5, 0.5]. Elements in the error

7 Cheng and Huang and Huang/ISIS variable selection 7 vector ε are generated independently from N(0,σ 2 ), where variance σ 2 is set such that E ( Var(Xβ β) ) E ( Var(Xβ β) ) +σ 2 = r. Note that when β is not random, r is the sum of squares for the linear model in (1), which is used in [5] to indicate the relative noise level in the model. In order to evaluate the selection accuracy for a given variable selection method result Ŝ, we use the following accuracy measure #(S I Ŝ) #S I, (2) where #S denotes the number of elements in a set S, Ŝ is the collection of the variables selected by the selection method, and S I is the collection of all influential variables. In our simulation studies, S I = {X 1,...,X 10 } and #S I = 10. We will compare the performances of LASSO, adaptive LASSO, ISIS-SCAD and our screening approach. For LASSO and adaptive LASSO approaches, the L 1 penalty is multiplied by a positive parameter λ, which can be determined using cross-validation. However, according to our experience, using crossvalidation to choose λ for LASSO often leads to choosing too many predictors, so in our simulation studies, the penalty parameter λ for LASSO/adaptive LASSO is chosen so that the number of predictors selected by LASSO/adaptive LASSO is approximately the same (or a little larger) as the number of predictors selected by our method, if our method selects at least 10 predictors. In cases where our method selects less than 10 predictors, the penalty parameter λ for LASSO/adaptive LASSO is chosen so that the number of predictors selected by LASSO/adaptive LASSO is approximately 10(or a little larger). When performing ISIS-SCAD in the following studies, we use the function SIS in R package SIS and all default settings are used exepct that the method for tuning the regularization parameter is SCAD p is O(n δ ) In this section, we present simulation results for the case where p is less than n 2. We carry out Algorithm 1 in Section 2.2 for four combinations of n and p: (n, p) {(100, 2000),(200, 34000),(300, 75000),(400, )}, where p is closed to n 1.97 in the latter three cases. In the proposed approach, we use different screening thresholds for different sample sizes. When the sample size n is small (n < 200), we use the approximate quantile based on bootstrap as the screening threshold. When n is greaterthan or equal to 200, we use z(n,p,α) with α = 0.5 as the screening threshold. We carry out 500 replications for each combination of (n,p). Below we will present results for Σ = I p p, Σ 2 and Σ 3 respectively. We first considerthe case Σ = I p p, where all the predictors areindependent. Table 1 shows the averages for accuracy measure in (2) based on 500 trials for

8 Cheng and Huang and Huang/ISIS variable selection 8 Algorithm 1, LASSO, adaptive LASSO and ISIS-SCAD. The medians for the numbers of selected predictors are also included in parentheses. Based on Table 1, we find that the performance of our Algorithm 1 is better than those of LASSO, adaptive LASSO and ISIS-SCAD for all two r values, especially when the sample size n is small. r n = 100 n = 200 n = 300 n = Algorithm (12) (12) 1(12) 1(12) LASSO (12) (12) (12) (12) ad. LASSO (12) (12) (12) (12) ISIS-SCAD (21) (37) (52) (66) Algorithm (13) 1(12) 1(11) 1(12) LASSO (13) (12) (11) 1(12) ad. LASSO (13) (12) (11) (12) ISIS-SCAD (21) (37) (52) 1(66) Table 1 Screening results when Σ = I p p Next, we consider cases where predictors are dependent. Two different covariance matrices for predictors, Σ 2 and Σ 3, are considered. The matrix Σ 2 is a p p matrix with the (i,j)-th element = 0.75 i j for 1 i, j p. The matrix Σ 3 is a p p matrix with the (i,j)-th element is ρ 1 if i 10,j 10; 0.05 if 11 i p,11 j p; 0 otherwise. Covariance matrices of the same type of Σ 2 or Σ 3 are also considered in [5]. The simulation results of our method, LASSO, adaptive LASSO and ISIS- SCAD when Σ = Σ 2 are presented in Table 2. The results when Σ = Σ 3 with ρ 1 = 0.5andρ 1 = 0.3arepresentedin Tables3and 4respectively.Comparingto the results for Σ = I p p, all four methods can detect more influential predictors with Σ = Σ 2 or Σ = Σ 3 under the same r level. From the results in Tables 2, 3 and 4, it is clear that our method performs better than LASSO, adaptive LASSO and ISIS-SCAD for all two values of r. The performances of all four methods improve as r or n increases. It is worthy to note that adaptive LASSO outperforms LASSO and ISIS-SCAD when Σ = Σ 2 or Σ = Σ 3 with ρ 1 = 0.5, which is quite different from the independent case (Σ = I p p ) p is large In this section, the case p n is considered and Algorithm 2 is applied to perform predictor screening. For comparison purpose, we also present the results for Algorithm 1 and ISIS-SCAD. The data generating processes are the same as

9 Cheng and Huang and Huang/ISIS variable selection 9 r n = 100 n = 200 n = 300 n = Algorithm (11) (11) (12) (12) LASSO (11) (11) (12) (12) ad. LASSO (11) (11) (12) (12) ISIS-SCAD (21) (37) (52) (66) Algorithm (11) (11) (12) (12) LASSO (11) (11) (12) (12) ad. LASSO (11) (11) (12) (12) ISIS-SCAD (21) (37) (52) (66) Table 2 Screening results when Σ = Σ 2 r n = 100 n = 200 n = 300 n = Algorithm (11) (11) 1(11) 1(11) LASSO (11) (11) (11) (11) ad. LASSO (11) (11) (11) (11) ISIS-SCAD (21) (37) (52) (66) Algorithm (11) 1(11) 1(11) 1(11) LASSO (11) (11) (11) (11) ad. LASSO (11) (11) (11) (11) ISIS-SCAD (21) (37) (52) (66) Table 3 Screening results when Σ = Σ 3 and ρ 1 = 0.5 in Section 3.1 but with different parameter values. The covariance matrix Σ can be I p p, Σ 2 or Σ 3. We set r = 0.8 for Σ = I p p and r = 0.3 for Σ = Σ 2. When Σ = Σ 3, we consider ρ 1 {0.3,0.5} and set r = 0.4 for ρ 1 = 0.3 and r = 0.3 for ρ 1 = 0.5. The sample size n is 200. Several values for p are considered: p = 34000P 0, where P 0 {1,2,4,6,8}. Note that here we choose smaller r values than those in previous simulation studies, because under such set-up, we can observe clearly that the screening accuracy decreases as p increases for Algorithm 1 and ISIS-SCAD, and further investigate whether the same pattern occurs for Algorithm 2. ForeachΣand eachp 0 {1,2,4,6,8},wecarryout ISIS-SCAD, Algorithm1 and Algorithm 2 with T {1,2,3} for 100 simulated data sets and compute the corresponding accuracy measures in (2). We also report medians of the numbers of selected predictors in parentheses. The results are given in Tables 5 8. As shown in Tables 5 8, for ISIS-SCAD and Algorithm 1, the average screening accuracy tends to decrease as P 0 increases. For Algorithm 2 with T = 3, the average screening accuracy does not always decrease as P 0 increases and the accuracy is better than that for ISIS-SCAD and Algorithm 1. Moreover, for Algorithm 2 with T = 3 and P 0 {2,4,6,8}, the screening accuracy is not

10 Cheng and Huang and Huang/ISIS variable selection 10 r n = 100 n = 200 n = 300 n = Algorithm (9.5) (10.5) (11) (11) LASSO (10) (11) (11) (11) ad. LASSO (10) (10.5) (11) (11) ISIS-SCAD (21) (37) (52) (66) Algorithm (10) (11) (11) 1(11) LASSO (10) (11) (11) (11) ad. LASSO (10) (11) (11) (11) ISIS-SCAD (21) (37) (52) (66) Table 4 Screening results when Σ = Σ 3 and ρ 1 = 0.3 a lot worse than that for Algorithm 1 with P 0 = 1, so the performance for Algorithm 2 is very satisfactory for the large p case. The effect of T is also of our interest. Based on the results in Tables 5 8, we find that the screening accuracy can be improved by increasing T. However, based on results not presented here, we find that such improvement is not sigificant when T 3. Thus we suggest to use T = 3. n = 200 P 0 = 1 P 0 = 2 P 0 = 4 P 0 = 6 P 0 = 8 ISIS-SCAD 0.800(37) 0.727(37) 0.677(37) 0.632(37) 0.609(37) Algorithm (11) 0.829(10) 0.802(10) 0.758(10) 0.740(10) Algorithm 2, T = (13) 0.826(21.5) 0.768(30) 0.725(65.5) Algorithm 2, T = (16) 0.866(30) 0.848(53) 0.807(117.5) Algorithm 2, T = (19.5) 0.891(38) 0.873(73) 0.840(177.5) Table 5 Screening results for various P 0 and T when Σ = I p p and r = 0.8 n = 200 P 0 = 1 P 0 = 2 P 0 = 4 P 0 = 6 P 0 = 8 ISIS-SCAD 0.558(37) 0.587(37) 0.537(37) 0.541(37) 0.518(37) Algorithm (9) 0.816(9) 0.757(8) 0.733(8.5) 0.705(8) Algorithm 2, T = (10) 0.821(13.5) 0.814(20) 0.808(32) Algorithm 2, T = (11) 0.821(16) 0.814(31) 0.810(55) Algorithm 2, T = (11) 0.821(21) 0.814(39) 0.811(83.5) Table 6 Screening results for various P 0 s and Ts when Σ = Σ 2 and r = 0.3

11 Cheng and Huang and Huang/ISIS variable selection 11 n = 200 P 0 = 1 P 0 = 2 P 0 = 4 P 0 = 6 P 0 = 8 ISIS-SCAD 0.705(37) 0.716(37) 0.688(37) 0.669(37) 0.654(37) Algorithm (10) 0.918(10) 0.876(10) 0.856(10) 0.862(10) Algorithm 2, T = (11) 0.928(14) 0.930(20) 0.946(25) Algorithm 2, T = (12.5) 0.928(16) 0.931(28) 0.946(44.5) Algorithm 2, T = (13) 0.928(19) 0.931(37) 0.946(61.5) Table 7 Screening results for various P 0 s and Ts when Σ = Σ 3, ρ 1 = 0.5 and r = 0.3 n = 200 P 0 = 1 P 0 = 2 P 0 = 4 P 0 = 6 P 0 = 8 ISIS-SCAD 0.848(37) 0.831(37) 0.810(37) 0.797(37) 0.792(37) Algorithm (10) 0.842(10) 0.792(9) 0.761(9) 0.765(9) Algorithm 2, T = (11) 0.877(14) 0.865(19) 0.877(29) Algorithm 2, T = (12) 0.877(17) 0.866(26) 0.878(45.5) Algorithm 2, T = (13) 0.877(21) 0.866(34) 0.879(68.5) Table 8 Screening results for various P 0 s and Ts when Σ = Σ 3, ρ 1 = 0.3 and r = Concluding remarks In this paper, we cosider the problem of predictor screening in linear regression. We propose two iterative SIS-based screening procedures, Algorithm 1 and Algorithm 2, for predictor selection. For Algorithm 1, we provide a screening threshold for SIS in each iteration. The use of the proposed screening threshold is justified in Theorem 1 when p = O(n δ ) for some δ (0,2). When p is very large, we include in our procedure a step of partitioning the predictors into smaller subgroups to perform screening, which gives our Algorithm 2. Simulation results show that Algorithm 1 works well when p n 1.97, and the performance of Algorithm 2 is also satisfactory when p is very large. In particular, Algorithm 2 outperforms Algorithm 1 when p is very large, which shows that including the partitioning step is important. 5. Proof of Theorem 1 In this section, we provide the proof of Theorem 1. Suppose that given Ỹ, Z i,j: 1 i n, 1 j p are independent, and Z 1,j,..., Z n,j are IID from ˆF Xj for j = 1,..., p. Let Z j = n Z i,j/n for j = 1,..., p. Suppose that the observation vector Ỹ = (Y 1,...,Y n ) and Ȳ = n Y i/n. Let X denote the

12 Cheng and Huang and Huang/ISIS variable selection 12 n p data matrix whose (i,j)-th component is X i,j. Then G(t Ỹ, ˆF X1,..., ˆF Xp ) ( n = P max (Z i,j Z ) j )(Y i Ȳ) 1 j p ( n (Z i,j Z j ) 2) 1/2( n (Y i Ȳ)2) 1/2 t X Ỹ, ( p ) n = P (Z i,j µ j,n )(Y i Ȳ) ( n (Z i,j Z j ) 2) 1/2( n (Y i Ȳ)2) 1/2 t X, Ỹ, j=1 where for j = 1,..., p, µ j,n = n X i,j/n, and ( ) n P (Z i,j µ j,n )(Y i Ȳ) ( n (Z i,j Z j ) 2) 1/2( n (Y i Ȳ)2) 1/2 t X Ỹ, n P (Z i,j µ j,n )(Y i Ȳ) ( nσj,n 2 n (Y i Ȳ)2) 1/2 t(1 δ 1) Ỹ, X }{{} nσ 2 j,n P ( n (Z i,j Z j ) 2) 1/2 > 1 1 δ 1 X }{{} II j nσ 2 j,n I j for δ 1 (0,1) and σj,n 2 = n (X i,j µ j,n ) 2 /n. Let Φ denote the CDF for N(0,1). Below we will give bounds for I j and II j respectively by comparing them with probabilities related to N(0, 1). We will first show that max 1 j p II j = O p (1/n 2 ). For δ 2 (0,1 (1 δ 1 ) 2 ), we have nσ 2 j,n II j = P ( n (Z i,j Z j ) 2) 1/2 > 1 1 δ 1 X ( n ) ( ) P (Z i,j µ j,n ) 2 < (1 δ 1 ) 2 n( Z j µ j,n ) δ 2 X 2 +P > δ 2 X. nσ 2 j,n From Theorem 2 in Osipov and Petrov [3], ( n ) P (Z i,j µ j,n ) 2 < (1 δ 1 ) 2 δ 2 X +P nσ 2 j,n ( n( Z j µ j,n ) 2 nσ 2 j,n > δ 2 X ) can be expressed as Φ( n((1 δ 1 ) 2 δ 2 1)/ v j,n )+(1 2Φ( nδ 2 ))+a n,j S n,1

13 Cheng and Huang and Huang/ISIS variable selection 13 for 1 j p, where ( ) (Z i,j µ j,n ) 2 v j,n = Var X = 1 n σ 2 j,n max 1 j p a n,j = O(1/n 2 ), 1 S n,1 = max 1 j p n n (X i,j µ j,n ) 2 1 σ 2 j,n 3 n ( (X i,j µ j,n ) max 1 j p n σ 2 j,n 1) 2, n X i,j µ j,n σ j,n Since max 1 j p v j,n = O p (1) and S n,1 = O p (1), we have max 1 j p II j = O p (1/n 2 ). To control I j, note that P n (Z i,j µ j,n )(Y i Ȳ) ( nσj,n 2 n (Y i Ȳ)2) 1/2 t(1 δ 1) Ỹ, X ( 1 2Φ(t(1 δ 1 ) n) ) n P (Z i,j µ j,n )(Y i ( Ȳ) t(1 δ nσj,n 2 n (Y i Ȳ)2) 1/2 1) Ỹ, X ( 1 2Φ(t(1 δ 1 ) n) ) + P n (Z i,j µ j,n )(Y i ( Ȳ) t(1 δ nσj,n 2 n (Y i Ȳ)2) 1/2 1) Ỹ, X ( 1 2Φ(t(1 δ 1 ) n) ) 2C 1( n X i,j µ j,n 3 /n) n Y ( ) 3 i ( Ȳ 3 1 σj,n 3 n Y i Ȳ 2) 3/2 1+ n t(1 δ 1 ) ( ) n ( 2C 1 S n,1 Y ) 3 i Ȳ 3 /n ( n n Y i Ȳ 2 /n ) 1 3/2 1+ (3) n t(1 δ 1 ) 3. for some absolute constant C 1. Here the last inequality follows from Theorem 2 in [3]. Note that in (3), the quantity so S n,1 n ( Y i Ȳ 3 /n ( n (Y i Ȳ)2 /n ) 3/2 = O p(1), max Ij ( 1 2Φ(t(1 δ 1 ) n) ) = Op (1/n 2 ), 1 j p which, together with the fact that max 1 j p II j = O p (1/n 2 )), implies that G(t Ỹ, ˆF X1,..., ˆF Xp ) p (I j II j ) converges to one in probability as n. The proof of Theorem 1 is complete. j=1

14 Cheng and Huang and Huang/ISIS variable selection 14 References [1] Jianqing Fan and Jinchi Lv. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5): , [2] Qianchuan He and Dan-Yu Lin. A variable selection method for genome-wide association studies. Bioinformatics, 27(1):1 8, [3] V. V. Osipov, L. V. & Petrov. On an estimate of the remainder term in the central limit theorem. Theory of Probability and its Applications, 12(2): , [4] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58: , [5] Li-Ping Zhu, Lexin Li, Runze Li, and Lixing Zhu. Model-free feature screening for ultrahigh dimensional data. Journal of the American Statistical Association, 106(496): , [6] Hui Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101: , 2006.

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions: