On High-Dimensional Cross-Validation

Size: px

Start display at page:

Download "On High-Dimensional Cross-Validation"

Lily Phillips
5 years ago
Views:

1 On High-Dimensional Cross-Validation BY WEI-CHENG HSIAO Institute of Statistical Science, Academia Sinica, 128 Academia Road, Section 2, Nankang, Taipei 11529, Taiwan 5 WEI-YING WU Department of Applied Mathematics, University of National Dong Hwa, Hualien 97401, Taiwan AND CHING-KANG ING 10 Institute of Statistical Science, Academia Sinica, 128 Academia Road, Section 2, Nankang, Taipei 11529, Taiwan SUMMARY Cross-validation (CV) is one of the most popular methods for model selection. By splitting n 15 data points into a training sample of size n c and a validation sample of size n v with n v /n 1 and n c, Shao (1993) showed that subset selection based on CV is consistent in a regression model of p candidate variables with p << n. However, in the case of p >> n, not only does CV s consistency remain undeveloped, but subset selection is also practically infeasible. In this paper, we fill this gap by using CV as a backward elimination tool for eliminating variables that are 20 included by high-dimensional variable screening methods possessing sure screening property. By choosing an n v such that n v /n converges to 1 at a rate faster than the one in Shao s (1993) paper, we establish the consistency of our selection procedure. We also illustrate the finite-sample performance of the proposed procedure using Monte Carlo simulation. Some key words: 25 Consider the linear model 1. CROSS-VALIDATION y = Xβ + e, where y = (y 1,..., y n ) T is a n 1 response vector, X = (x 1,..., x n ) T, x T t is a p vector of covariates (predictors) for 1 t n, β is the regression coefficient including p unknown paramters, and e = (e 1,..., e n ) T are uncorrelated with the mean zero and variance σ 2 random 30 disturbances. Under our consideration, p >> n and many of parameters in β could be zero that means the sparse condition. Cross-validation (CV) is a famous method to compare various models according to their prediction performance. The data set are usually split into two parts: {(y t, x t ), t s} and 1

2 35 2 WEI-CHENG HSIAO, WEI-YING WU AND CHING-KANG ING {(y t, x t ), t s c }, where s is the testing set of the subset of {1,..., n} with size n v and its complement set s c is the training set with size n c = n n v. For the well-known leave-one-out cross-validation (n v = 1), under the model depending on the subset of {1,..., p} denoted by J α with size d Jα, the respective CV is defined as n ˆΓ Jα,n = n 1 (y s X ˆβ Jα,s Jα,s c)2 s= where X Jα is the n d Jα matrix which consists of the columns of X indexed by i J α, X Jα,s is the n v d Jα matrix containing the rows of X Jα indexed by i s, and ˆβ Jα,s c is the model fitted under the remaining data set after removed sth data set. However, the leave-one-out crossvalidation is not variable selection consistent; see Efron (1986) and Shao (1993). To solve this drawback on the leave-one-out cross-validation, for the fixed p case, Shao (1993) proposed another model selection method through the multifold cross-validation, and its consistency (i.e., selecting all relevant variables and no irrelevant variables, with probability approaching 1) was also verified. A brief review for Shao (1993) is given under balanced incomplete sampling and Monte Carlo sampling. For the balanced incomplete case, let B be a collection of b subsets of elements that have testing set size n v. B is selected according to the following balance conditions: (i) every t {1,..., n} appears in the same number of subsets in B; and (ii) each (t, t ) with 1 t < t n, appears in the same number of subsets in B. All nonempty subsets of {1,..., p} denoted by J α are compared under the average of the squared prediction error over all subsets s in B defined by BICV (n v ), and an appropriate model is chosen by minimizing ˆΓ BICV J α,n = 1 n v b y s ŷ Jα,s c 2 (1) s B where ŷ Jα,s c = X J α,s ˆβ Jα,s c, ˆβ Jα,s c is the least square estimator of β J α using the training data set. But for the practical execution, Shao (1993) further suggested a Monte Carlo sampling, abbreviated as MCCV (n v ). MCCV (n v ) selects a model by minimizing ˆΓ MCCV J α,n = 1 n v b y s ŷ Jα,s c 2, (2) s R where R is a collection of randomly drawing b subsets of 1,..., n that have size n v. However, for the concerned high dimensional problem, the model selection through the above cross-validation seems to be infeasible because of a heavy combinatorial search THE CROSS-VALIDATION IN THE HIGH DIMENSIONAL PROBLEM For the high dimensional problem, many popular variable screening methods, such as Lasso (Tibshirani (1996)), LARS (Efron et al. (2004)), Adaptive Lasso (Zou (2006)), SIS (Fan and Lv (2008)), and OGA (Ing and Lai (2011)), can search relevant variables. However, after screening, not only relevant variables but many irrelevant variables could be involved in the selected submodel. To remove those included irrelevant variables, different model selection criteria could be applied. For example, AIC (Akaike (1974)), BIC (Schwarz (1978)), Extended BIC in (Chen and Chen (2008)), and HDIC in (Ing and Lai (2011)).

3 On High-Dimensional Cross-Validation 3 In this paper, along the screening path, the cross-validation concept inspired from Shao (1993) is employed to search the true model which only contains all relevant variables and the proposed procedure is detailedly introduced under the following steps: Step 1 (Screening). By applying screen method on the full model {1,..., p}, the reduced sub- 70 model is attained, and denoted by Ĵm n = {ĵ 1,..., ĵ mn }, where m n = #(Ĵm n ). Step 2 (Trim). Backward stepwise elimination is applied to exclude the irrelevant variables in the truncated submodel by comparing the respective ˆΓ MCCV under the same n v and b conditions as Step 2. The selected submodel after trimming is defined by ˆN n = {ĵ l : MCCV MCCV ˆΓ n {ĵ (Ĵˆkn l }) > ˆΓ n ), 1 l ˆk (Ĵˆkn n }. For the Step 1, many well-known screening methods can be considered. For example, Lasso 75 (Tibshirani (1996)), LARS (Efron et al. (2004)), Adaptive Lasso (Zou (2006)), SIS (Fan and Lv (2008)), and OGA (Ing and Lai (2011)). Step 2 utilizes the cross-validation to trim off those involved irrelevant variables, and the procedure of Step 1 and 2 is abbreviated as HDCV. For the real application, different screening methods M 1, M 2,..., M s may be considered in the HDCV procedure, and their repesctive submodels are ˆN n,m1,..., ˆN n,ms. Such submodel 80 information are further integrated in the following Step 3 by the comparison of their correspoding MCCV and then an appropriate model is suggested. Step 3 (Combination). The selected final model after competing s trimmed models, ˆN n,m1,..., ˆN n,ms where M respects a specified screening method at Step 1, is defined by N f = arg min α { ˆNn,M1,..., ˆN n,ms }ˆΓ MCCV (defined by (2)). Ĵ α,n Through the procedure HDCV, the heavy combinatorial searching problem can be avoided. In 85 addition, when the sure screening property (i.e., including all relevant variables with probability approaching 1) in Step 1 is promised, under some regular conditions, the consistency of the variable selection can further be confirmed The High Dimension Balanced Incomplete CV(n v ) method Similar as Shao (1993), the variable selection consistency of the suggested procedure is veri- 90 fied first under the balanced sampling and then random sampling. Assume that m n is the number of the selected variables after the sure screening and the covariance matrix R J = E(XJ T X J). Also assume (C1) E ε q < and max 1 j pn E x tj q < for some large q, (C2) p n = O(n s 1 ) for some s 1 > 0, 95 (C3) m n = O(n s 2 ) = o(n 1/2 ξ c ) for some 0 < s 2 < 1/3, any ξ > 0 (C4) n c = O(n s 3 ), with 0 < s 3 < 1 s 2, (C5) max 1 #(J) mn R 1 J 1 = O(1). Note that (C5) implies max 1 #(J) mn R 1 J 2 = O(1), and (C6) #(B) = O(n s 4 ) for some s 4 > In (C1), only qth moment bound assumptions on the measurement error and covariate variables are required. Although the exponentially increase of p n as the sample size (n) is not suggested in (C2), this assumption still works in practice. (C3) is a restriction for the size of surviving covariate variables after the screen procedure, and (C4) is used for the train sample size. (C5) means that the minimum eigenvalue of the covariance matrix of covariate variables must be 105 bounded. (C6) is a condition for the replication size of the cross-validation procedure.

4 WEI-CHENG HSIAO, WEI-YING WU AND CHING-KANG ING The following theorem shows that the proposed model selection procedure, surely screening + HDCV, is variable selection consistent when (C1)-(C6) holds. Assume that m n variables are selected from the screening method, which is denoted by Ĵm n and J α0 is the model including all relevant variables and no irrelevant variables. THEOREM 1. Suppose that (C1)-(C6) hold. For any submodel Jˆα Ĵm n with #(Jˆα ) = dˆα. (a) If J α0 Jˆα, then ˆΓ BICV ˆα,n = n 1 e T e + n 1 c dˆα σ 2 + o p (n 1 c ); (b) If J α0 is not contained in Jˆα, then there exists some positive constant φ such that ˆΓ BICV ˆα,n n 1 e T e + ˆα,n + o p (1), where P ( ˆα,n > φ) 1 as n High Dimension Balanced Incomplete CV(n v ) method As mentioned previously, the assumption of the balance collection is sometimes hard to be satisfied. The collection of subsets attained by randomly drawing is more feasible in practice, and is denoted by R. In the following theorem, for the random collection, the consistency of the proposed variable selection procedure will be presented under conditions (C1)-(C6), and an additional condition for the repetition size b similar with Theorem 2 of Shao (1993). THEOREM 2. Suppose the conditions of theorem 1 hold and R is a collection of b randomly selected subsets of {1, 2,..., n} that have size n v. Assume b satisfies b 1 n 2 c n 2+ξ m 2 n 0, for an arbitrary small ξ > 0. Assume Jˆα Ĵm n with #(Jˆα ) = dˆα. (a) If J α0 Jˆα then, ˆΓ MCCV ˆα,n = 1 e T s e s + n 1 c dˆα σ 2 + o p (n 1 c ). n v b s R (b) If J α0 is not contained in Jˆα, then there exists some positive constant φ such that ˆΓ MCCV ˆα,n 1 e T s e s + ˆα,n + o p (1), n v b where P ( ˆα,n > φ) 1. s R SIMULATION STUDIES In this section, four simulation examples are applied to examine the performance of the HDCV based on various screening methods - OGA, ISIS-SCAD, LARS and the adaptive Lasso with three k-fold sizes: 5, 10 and 15, which are abbreviateed by OGA, ISIS-SCAD, LARS, and Adaptive-Lasso (k), respectively. To implement ISIS-SCAD, LARS, and adative Lasso, we use the SIS (Fan et al. 2010), glmnet (Friedman, Hastie, and Tibshirani, 2010), LARS (Hastie and Efron, 2007), parcor (Kraemer and Schaefer, 2010) packages in R. Because OGA and LARS are selected stepwise, the high-dimensional information criterion in Ing and Lai (2011) is employed to improve the screening performance, and the penalty w n in our simulation studies is chosen by log(n), which is corresponding to HDBIC in Ing and Lai (2011). For the Step 2 of HDCV (Trim), in the the cross-validation setting, the training set n c = n 0.6, and the randomly drawing replcations b = n 1.5, where n is the sample size which guarantee the conditions (C3) and (C6). Each result is based on 1000 data sets.

5 On High-Dimensional Cross-Validation 5 Table 1. Frequency of including all 3 relevant variables (correct), of selecting exactly (E), of selecting all relevant variables and i irreleant variables (E+i) with E if i > 5. ρ Method E E+1 E+2 E+3 E+4 E+5 E Correct MSPE 0.5 OGA OGA+Trim OGA+HDBIC+Trim ISIS-SCAD ISIS-SCAD+Trim LARS LARS+Trim LARS+HDBIC+Trim Adaptive Lasso (5) Adaptive Lasso (5)+Trim Adaptive Lasso (10) Adaptive Lasso (10)+Trim Adaptive Lasso (15) Adaptive Lasso (15)+Trim OGA OGA+Trim OGA+HDBIC+Trim ISIS-SCAD ISIS-SCAD+Trim LARS LARS+Trim LARS+HDBIC+Trim Adaptive Lasso (5) Adaptive Lasso (5)+Trim Adaptive Lasso (10) Adaptive Lasso (10)+Trim Adaptive Lasso (15) Adaptive Lasso (15)+Trim Define the mean squared prediction errors MSPE = ( 1000 l=1 p j=1 β j x (l) n+1,j ŷ(l) n+1 )2, in which x (l) n+1,1,, x(l) n+1,p are the regressors associated with y(l) n+1, the new outcome in the lth 140 simulation run, and ŷ (l) n+1 denotes the prediction of y(l) n+1. The first example is showed in Fan and Lv (2008) and the second and the third examples are found in Ing and Lai (2011). In the final example, we generate the data using covariates of a real dataset from a semiconductor manufacturing company and location and dispersion model. Example 1. The following model is employed, 145 q p y t = β j x tj + β j x tj + ε t, t = 1,..., n, (3) j=1 j=q+1 where β q+1 = = β p = 0, and ε t are i.i.d. N(0, σ 2 ) independent with the x tj. We consider that p = 1000, q = 3, n = 100 and β 1 = β 2 = β 3 = 5 with the standard Gaussian noise. The sample of (X 1,..., X p ) with size n was drawn from a multivariate normal distribution N(0, Σ)

6 6 WEI-CHENG HSIAO, WEI-YING WU AND CHING-KANG ING Table 2. Frequency of including all 5 relevant variables (correct), of selecting exactly (E), of selecting all relevant variables and i irreleant variables (E+i) with E if i > 5. η n p Method E E+1 E+2 E+3 E+4 E+5 E Correct MSPE OGA OGA+HDBIC+Trim Adaptive Lasso (5) Adaptive Lasso (5)+Trim Adaptive Lasso (10) Adaptive Lasso (10)+Trim Adaptive Lasso (15) Adaptive Lasso (15)+Trim N f (n c) N f (0.8n c) OGA OGA+HDBIC+Trim Adaptive Lasso (5) Adaptive Lasso (5)+Trim Adaptive Lasso (10) Adaptive Lasso (10)+Trim Adaptive Lasso (15) Adaptive Lasso (15)+Trim N f (n c) N f (0.8n c) whose covariance matrix Σ = (σ ij ) p p has entries σ ii = 1 and σ ij = ρ, i j, and two different models are defined by ρ = 0.5, 0.9. As shown in Table 1, not only irrelevant variables but also relevant ones are removed after the addition of Trim in ISIS-SCAD and LARS. OGA+Trim can not delete any variables, and OGA+HDBIC+Trim can cancel all irrelevant variables and Adaptive Lasso (k) + Trim can keep all relevant variables in all simulations. In addition, the MSPE of OGA+HDBIC+Trim or Adaptive Lasso (k)+trim is close to the oracle value of qσ 2 /n = In the following simulation works, we only focus on OGA+HDBIC+Trim and Adaptive Lasso (k)+trim because of their superior performance. Example 2. In this example, we consider the covariates x tj = d tj + ηw t, 1 j p in which η 0 and (d t1,..., d tp, w t ), 1 t n, are i.i.d. normal with mean (1,..., 1, 0) T and the identity covariance matrix I. Moreover, Corr(x tj, x tk ) = η 2 /(1 + η 2 ) increases with η > 0. We choose q = 5, (β 1,..., β 5 ) = (3, 3.5, 4, 2.8, 3.2), and assume that σ = 1 and (3) holds. The cases η = 0, 2, and (n, p) = (50, 1000) and (100, 2000) are considered here to accommodate a much larger number of candidate variables and allow substantial correlations among them. To combine the information from OGA and Adaptive Lasso, Step 3 (Combination) is further applied on the submodels after trimming and the finally chosen model with minimum MCCV is denoted by N f (n c ), where n c is the training size used in Step 2. Since the remaining irreleant variables and relevant variables in trimmed model may be highly correlated, we also consider a smaller training size (0.8n c ) in Step 3, and the model finally selected is the abbreviateed by N f (0.8n c ).

7 On High-Dimensional Cross-Validation 7 Table 3. Frequency of including all 10 relevant variables and selecting all relevant variables (E+i) with E if i > 5. Method E E+1 E+2 E+3 E+4 E+5 E Correct MSPE OGA OGA+HDBIC+Trim Adaptive Lasso (5) Adaptive Lasso (5)+Trim Adaptive Lasso (10) Adaptive Lasso (10)+Trim Adaptive Lasso (15) Adaptive Lasso (15)+Trim N f (n c) N f (0.8n c) It is shown in Table 2 that while the sample size is 50, after applying Step 3, the frequency 170 of identifying the parsimonious model increases from 863 in the Adaptive Lasso (15) to 955 for N f (0.8n c ) (or 894 for N f (n c )). When the sample size is 100, the N f (0.8n c ) s frequency of identifying the parsimonious model is similar to the performance of the OGA+HDBIC+Trim or Adaptive Lasso (k)+trim. Example 3. In this example, we set σ = 1 and x t1,..., x tq are i.i.d. standard normal, and 175 x tj = d tj + b x q x tl, for q + 1 j p, l=1 where b x = {3/(4q)} 0.5 and (d tq+1,..., d tp ) T are independent and identically multivariate normal distribution with mean 0 and covariance matrix (1/4)I and independent of x tj for 1 j q. We set (n, p) = (400, 4000) with randomly generated covariate, and let q = 10, (β 1,..., β 10 ) = (3, 3.75, 4.5, 5.25, 6, 6.75, 7.5, 8.25, 9, 9.75). This example is considered in Ing and Lai (2011) to illustrates an inherent difficulty for analyz- 180 ing high-dimensional data when irrelevant variables have substantial correlations with relevant ones. As mentioned by Ing and Lai (2011), under this simulation settings, Adaptive Lasso lose the sure screening property which relies on an initial estimate based on Lasso may actually perform worse, and such phenomena could also be observed in Table 3. Also, Ing and Lai (2011) show that the first iteration of OGA selects an irrelevant variable 185 which remains in the OGA path until the last iteration. OGA+HDBIC+Trim is able to choose all relevant variables without including any irrelevant variables in all 1,000 simulations, and its MSPE is close to the oracle value of qσ 2 /n = Similarly, N f (n c ) or N f (0.8n c ) also chooses the smallest correct model in all simulations. Example 4. The covariates we consider is chosen from a real dataset provided by United Mi- 190 croelectronics Corporation: A semiconductor manufacturing company in Taiwan. The original real data are collected through over 900 equipment and 300 Defective Rates. 405 covariates are used after deleting replicated variables and extreme variables, and the response values are generated from the following location and dispersion model, 1/ y t = β 0 + β j x tj + α 0 + α j x tj + ε t, t = 1,..., 300. j=1 j=1

8 8 WEI-CHENG HSIAO, WEI-YING WU AND CHING-KANG ING Table 4. Frequency of including all 2 relevant variables (correct) of the location model, of selecting exactly (E), of selecting all relevant variables and i irreleant variables (E+i) with E if i > 5. Model Method E E+1 E+2 E+3 E+4 E+5 E Correct Location OGA OGA+HDBIC+Trim Adaptive Lasso Adaptive Lasso5+Trim Adaptive Lasso Adaptive Lasso10+Trim Adaptive Lasso Adaptive Lasso15+Trim N f (0.8n c) Table 5. Frequency of including unique relevant variables (correct) of the dispersion model, of selecting exactly (E), of selecting all relevant variables and i irreleant variables (E+i) with E if i > 5. Residuals are based on N f (0.8n c ) at the first stage. Model Method E E+1 E+2 E+3 E+4 E+5 E Correct Dispersion OGA OGA+HDBIC+Trim Adaptive Lasso (5) Adaptive Lasso (5)+Trim Adaptive Lasso (10) Adaptive Lasso (10)+Trim Adaptive Lasso (15) Adaptive Lasso (15)+Trim N f (0.8n c) We set (β 0, β 4, β 159, α 0, α 160 ) = (1, 2.3, 4.5, 3, 2.5) and the others are zero, and ε t s are i.i.d. standard normal distribution independent with the covariate x tj. Two-stage model selection procedure is implemented in this location and dispersion model. We select relevant variables of the location model at the first stage and calculate fitted values, ŷ t, Ĵ, t = 1,..., 300 where Ĵ is the set of the selected variables. The responses at the second stage ] come from the log-transformation of square residuals, i.e., log [(y t ŷ t, Ĵ )2, t = 1,..., 300. The relevant variables of the dispersion model are further searched on these transformated data. For two-stage model selection procedure, N f (0.8n c ) is applied in the combination step. In 1000 simulations of Table 4, OGA+HDBIC+Trim in identifying the sparse location model is better than Adaptive Lasso (k)+trim. N f (0.8n c ) has 93.5% percentage of selecting parsimonious location model. In Table 5, the percentage that N f (0.8n c ) correctly catches all relevant variables is 89.9% (About 80% for OGA+HDBIC+Trim and Adaptive Lasso (10)+Trim). Combining Table 4 and Table 5, for N f (0.8n c ), the percentages of identifying the parsimonious location and dispersion model is more than 84% ( = ). Therefore, our recommend procedure can be flexibly applied to select the exact correct location and dispersion model.

9 On High-Dimensional Cross-Validation 9 REFERENCES 210 AKAIKE, H. (1994). A New Look at Statistical Model Identification. IEEE Transactions on Automatic Control 19, CHEN, J. & CHEN, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95, EFRON, B. (1986). How Biased is the Apparent Error Rate of a Prediction Rule? Journal of the American Statistical 215 Association 81, EFRON, B., HASTIE, T., JOHNSTONE, I. & TIBSHIRANI, R. (2004). Least angle regression (with discussion). The Annals of Statistics 32, FAN. J. & LV, J. (2008). Sure independence screening for ultra-high dimension feature space (with discussion). Journal of the Royal Statistical Society: Series B 70, FRIEDMAN, J., HASTIE, T. & TIBSHIRANI, R. (2010). glmnet: Lasso and elastic-net regularized generalized linear models. R package version HASTIE, T. & EFRON, B. (2007). lars: Least Angle Regression, Lasso and Forward Stagewise. R package version KRAEMER, N. & SCHAEFER, J. (2010). parcor: Regularized estimation of partial correlation matrices. R package 225 version ING, C. K. & LAI, T. L. (2011). A stepwise regression method and consistent model selection for high-dimensional sparse linear models. Statistica Sinica 21, SCHWARZ, G. (1978). Estimating the dimension of a model. The Annals of Statistics 6, SHAO, J. (1993). Linear model selection by cross-validation. Journal of the American Statistical Association 88, TIBSHIRANI, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B 58, ZOU, H. (2006). The adaptive Lasso and its oracle properties. Journal of the American Statistical Association 101,

Comparisons of penalized least squares. methods by simulations

Comparisons of penalized least squares arxiv:1405.1796v1 [stat.co] 8 May 2014 methods by simulations Ke ZHANG, Fan YIN University of Science and Technology of China, Hefei 230026, China Shifeng XIONG Academy