Model Selection via Bayesian Information Criterion for Quantile Regression Models

Size: px

Start display at page:

Download "Model Selection via Bayesian Information Criterion for Quantile Regression Models"

Victor Booker
6 years ago
Views:

1 Model election via Bayesian Information Criterion for Quantile Regression Models Eun Ryung Lee Hohsuk Noh Byeong U. Park final version for JAA August 5, 2013 Abstract Bayesian Information Criterion BIC is known to identify the true model consistently as long as the predictor dimension is finite. Recently, its moderate modifications have been shown to be consistent in model selection even when the number of variables diverges. Those works have been done mostly in mean regression, but rarely in quantile regression. The best known results about BIC for quantile regression are for linear models with a fixed number of variables. In this paper, we investigate how BIC can be adapted to high-dimensional linear quantile regression and show that a modified BIC is consistent in model selection when the number of variables diverges as the sample size increases. We also discuss how it can be used for choosing the regularization parameters of penalized approaches that are designed to conduct variable selection and shrinkage estimation simultaneously. Moreover, we extend the results to structured nonparametric quantile models with a diverging number of covariates. We illustrate our theoretical results via some simulated examples and a real data analysis on human eye disease. Eun Ryung Lee is postdoctoral researcher, Department of Economics, University of Mannheim, Mannheim, Germany. Hohsuk Noh is Assistant Professor, Department of tatistics, ookmyung Women s University, eoul , outh Korea. Byeong U. Park is Professor, Department of tatistics, eoul National University, eoul , outh Korea. E. R. Lee acknowledges financial support from the the Collaborative Research Center FB 884 Political Economy of Reforms, funded by the German Research Foundation DFG. H. Noh acknowledges financial support from the European Research Council under the European Community s eventh Framework Programme FP7/ / ERC Grant agreement No and IAP research network P7/06 of the Belgian Government Belgian cience Policy. Research of B. U. Park was supported by the NRF Grant funded by the Korea government MET No The authors would like to thank two referees, the Associate Editor and the Coeditor for their valuable suggestions, which have significantly improved the paper. This research was partly done while the second author was visiting the first author in University of Mannheim. The first and second authors appreciate the support from Prof. Enno Mammen and Prof. Ingrid Van Keilegom regarding their research visits. 1

2 Key words: high-dimension, linear quantile regression, nonparametric quantile regression, model selection consistency, regularization parameter selection, shrinkage method 1 Introduction In regression analysis, model selection is necessary in that an underfitted model brings out seriously biased results, whereas an overfitted model leads to substantial loss in estimation efficiency. Hence finding a parsimonious model with good prediction ability is an important task. For a traditional linear mean regression model, Bayesian Information Criterion BIC has been used for model selection because it has been well understood that best subset selection with the BIC identifies the true model consistently Nishii, 1984; hao, Recently, Wang et al and Chen and Chen 2008 found that the ordinary BIC is too liberal for large model spaces and proposed its modifications that work when the number of variables increases with the sample size. The selection consistency of BIC has been extended to the M-estimation setting by several authors such as Machado 1993 and Wu and Zen 1999, particularly when the predictor dimension is finite. However, such methods are computationally expensive, especially when the number of variables is large. To tackle this problem, various shrinkage methods such as the Least Absolute hrinkage and election Operator LAO and moothly Clipped Absolute Deviation CAD have been proposed and recognized as a promising way of doing estimation and variable selection simultaneously. An interesting thing is that, along with the popularity of such shrinkage methods, BIC has gained attention as a useful way of determining the amount of shrinkage. The seminal work of Wang and Leng 2007 followed by Wang et al. 2007b and Wang et al suggested to use a BIC-type criterion for choosing the shrinkage parameter in penalized least squares and established the model selection consistency of the resulting procedure. Zhang et al showed that BIC can play the same role in penalized likelihood settings. As shrinkage methods gained popularity in mean regression, many researchers have turned their interest to variable selection in quantile regression. Wang et al. 2007a and Wu and Liu 2009 considered shrinkage estimators for variable selection in linear quantile regression models, and Li and Zhu 2008 proposed an efficient algorithm to compute the entire solution path of L 1 -norm regularized quantile regression. Recently, Li et al and Wang et al investigated CAD penalizations for variable selection in high-dimensional median and quantile regression, respectively. In the context of 2

3 nonparametric models, Noh et al and Noh and Lee 2012 proposed variable selection methods for varying coefficient models and additive models in quantile regression setting, respectively. Although some of these works considered BIC-type criteria to select the shrinkage parameter, apparently there has been no sound theory about why we should consider BIC-type criteria for model selection in quantile regression models. This is the main motivation of the present work. Model selection in quantile regression has two interesting features. One is that considering median regression it can be seen as a way of achieving robustness in model selection. Along this direction, Machado 1993 proposed a BIC for median regression and showed that it identifies the true model consistently if the number of variables is finite. The other feature is that when heterogeneity exists due to either heteroscedastic variance or other types of non-location-scale covariate effects, the sets of relevant covariates can vary depending on the segment of interest in the conditional distribution. Wang et al showed that the CAD-penalized quantile regression estimator is able to detect this heterogeneity of the relevance pattern in high-dimension. However, to the best of our knowledge no theoretical work about BIC has been done in a quantile regression context for the case where the number of variables is diverging, which is encountered in many applications. From this observation, we are motivated to establish a more general and systematic theory of BIC in quantile regression including high-dimensional cases. The rest of this paper is organized as follows. ection 2 briefly describes how a BIC for linear quantile regression is derived and proposes its modification in the case of a diverging number of variables. We also show the selection consistency of the modified BIC there. In ection 3, we propose BICs for covariate selection in some structured nonparametric models and show that they identify the true model consistently when the number of covariates diverges. Finally, we illustrate the theoretical results with some simulated examples and a real data analysis in ections 4 and 5, respectively. All the theoretical details are given in the Appendix. 2 BIC for Linear Quantile Regression In this section, we describe how a BIC for a linear quantile regression model can be derived and then propose its extension to the cases where the number of variables diverges as the sample size increases. 3

4 2.1 Motivation of BIC for linear quantile regression Consider the following linear quantile regression model: Y i = X i β + U i, i = 1,..., n, 2.1 where Y i, X i, U i are independent and identically distributed as Y, X, U R R p R and P U 0 X = x = τ for almost every x. Let X i = X1 i,..., Xi p and β = β1,..., β p. The number of variables p = p n is allowed to increase with the sample size n. We omit such dependence on n in notation for simplicity whenever it has no confusion. We consider the situations where only d covariates among the Xj i s are relevant, i.e., the p d coefficients are zero in the model 2.1. This brings out the statistical problem of selecting the relevant variables. To derive a BIC for the model 2.1, suppose that the error U i follows an asymmetric Laplace distribution whose density is given by fu = τ1 τ σ { exp ρ } τ u 2σ 2.2 with the check loss ρ τ u = u2τ 2Iu < 0, and U i is independent of X i. ince P U i < 0 = τ, the conditional τ-quantile of Y i given X i = x i is x i β. Before deriving a BIC, we introduce some notations. The generic notation = {j 1,..., j d } {1,..., p} denotes an arbitrary candidate model, which includes X j1,..., X jd as relevant predictors and let X = X j1,..., X jd. Additionally, we define to be the cardinality d of. Based on these notations and setup, we first note that the maximum likelihood estimator of β, σ for a model is given as ˆβ, ˆσ, where ˆβ = argmin n β R ρ τ Y i X i β and ˆσ = 2n 1 n ρ τ Y i X i ˆβ. From the definition of BIC in a form of a penalized log-likelihood chwartz, 1978, we obtain the following BIC for linear quantile regression: BIC O L = log ρ τ Y i X i ˆβ + log n 2n. 2.3 The above BIC for τ = 1/2 has been already considered as a robust alternative to the traditional BIC by Machado 1993, who showed its robustness in consistent model selection over a range of the error distributions. Actually, Wu and Zen 1999 considered another type of BIC for median regression in the context of M-estimation. If we set σ = 1 in 2.2 when deriving the BIC, it leads us to one of the 4

5 BICs proposed by Wu and Zen 1999, which is ρ τ Y i X i ˆβ + log n. 2.4 Wu and Zen 1999 showed that the BIC with τ = 1/2 at 2.4 is strongly consistent under certain conditions. The criterion 2.3 is invariant under scale change of the response variable. To see this, note that changing Y i to cy i for some constant c > 0 results in changing ˆβ to c ˆβ for each and thus adding log c to the criterion. This means the criterion 2.3 chooses the same model regardless of the dependent variable s underlying scale of units, whereas 2.4 does not. For this reason, we concentrate on the criterion 2.3 in this paper although it is also possible to extend 2.4 to quantile regression. 2.2 Modified BIC for high-dimension Regarding the BIC in 2.3, recently Lian 2012 proved its consistency in model selection when the number of variables p is finite. However, when p is diverging with the sample size it does not guarantee a consistent result. The reason is that the ordinary BIC is too liberal for model selection and tends to seriously overfit in such a situation, as observed in our simulations. A recent attempt to remedy this problem is to put more penalty on the complexity of the model in the BIC so that it can be strengthened in overfit resistance. This type of modification has been shown to be successful in the context of mean regression in Wang et al and Chen and Chen Adopting this idea, we consider the following modification of the BIC in 2.3 for the case where p is diverging: BIC H L = log ρ τ Y i X i ˆβ where C n is some positive constant which diverges to infinity as n increases. + log n 2n C n, 2.5 Additionally, to facilitate the theoretical analysis of the BIC in 2.5 in high-dimensional setting where the number of variables exceeds the sample size, we consider setting an upper bound on the cardinality of, say s n, and searching for the best model among those sub-models of which the cardinality is smaller than or equal to s n. This idea of restricting model spaces was considered in 5

6 Chen and Chen We define the modified BIC with restriction as Ŝ = argmin BIC H L, 2.6 Ms n where Ms n = { {1,..., p n } : s n }. In ection 2.4, we will show that the modified BIC with restriction identifies the true model consistently when p n = On κ and s n = On α, where κ is an arbitrary positive number and 0 < α < 1/2. Without the restriction on the model space, i.e., with s n = p n, the modified BIC may enjoy selection consistency only when p n = On κ for some 0 < κ < 1/2. Hereafter, we will only consider the modified BIC with restriction and omit the term with restriction whenever it brings out no confusion. 2.3 Main challenges in high-dimension howing the validity of the modified BIC at 2.5 in quantile regression is much more difficult and challenging than in mean regression due to an intrinsic feature of quantile regression. To show the consistency of BIC H L, a crucial and often difficult step is to prove that BICH L is resistant to overfitting, in other words, P min, = BICH L > BIC H L 1 as n, 2.7 where is the true model. Note that it is not equivalent to prove P BIC H L > BICH L 1 for any overfitted candidate model :,, particularly when p is diverging, which leads to a difficulty in high-dimension. One simple and effective idea of establishing 2.7 is to use the following universal lower bound for every overfitted model : min, = BICH L BIC H L log ρ τ Y i X i ˆβF F log ρ τ Y i X i ˆβ + log n 2n C n = A ˆβ F, ˆβ + log n 2n C n, 2.8 where F = {1,..., p} is the full model. It can be shown using the standard arguments in quantile regression that when p = On κ, 0 < κ < 1/2, A ˆβ F, ˆβ = O p p/n

7 For a proof of 2.9, see the Appendix. Consequently, the universal lower bound at 2.8 is useful enough to prove 2.7 as long as p is slowly diverging with an order less than or equal to log n since it is typically considered that C n as n. However, if the order of p is larger than log n, then the lower bound is too rough so that it is no longer useful. It is possible to prove that BIC H L with C n p is overfit-resistant with such a bound. However, such a modification is too naive, hence it has a considerable risk of suffering from underfit inflation in many situations. Because of the reason discussed above, for each overfitted model one should obtain an -specific lower bound for BIC H L BICH L, precisely n ρ τ Y i X i ˆβ n ρ τ Y i X i ˆβ as Wang et al did in linear mean regression. Because the estimators ˆβ and ˆβ have no explicit forms due to the non-differentiability of the check function at zero, obtaining such bound is not feasible without an approximation of n 1 n ρ τ Y i X i β n 1 n ρ τ Y i X i β that is uniform for all β with β β being in some compact set C R and for all :. Thus, getting such a uniform approximation is a usual practice in quantile regression. However, even if we get an -specific lower bound from a uniform approximation, it does not help us prove 2.7 unless the probability of ˆβ β C for all : is large enough. To prove the latter is very challenging since the number of the candidates increases exponentially with p. One of the key elements in our work is to show that there exists a common compact set C R p such that a rescaled version of ˆβ, 0 c β, 0 c for every : is in the set C with large probability, where 0 c is the zero vector in R c, see the supplementary file for details. 2.4 Consistency of BIC H L in model selection In this section, we show that the modified BIC at 2.5 identifies the true model consistently in highdimensional quantile regression models and discuss how the established theory can be used in the context of shrinkage methods. The generic notations and Ŝ, respectively, denote the true model and the selected model by BIC H L. Recall that = {j : βj 0, 1 j p} and d = under the model 2.1. We make the following assumptions in order to facilitate theoretical analysis of the BIC H L. A1 The conditional distribution F U X x of the error U, given X = x, has a density f U X which satisfies: i sup u,x f U X u x < ; ii there exists positive constants δ 1 and δ 2 such that inf x inf u δ1 f U X u x δ 2. A2 The variables X j satisfy: 7

8 i max 1 j p X j M < ; ii the eigenvalues of Σ = EX X is uniformly bounded away from zero and infinity over : M2s n, that is, 0 < l min inf l min Σ sup l max Σ l max <, 2s n 2s n where l min A and l max A denote the smallest and largest eigenvalues of a real-valued square matrix A, respectively. A3 d is fixed, p n = On κ, κ > 0 and d s n = On α, 0 α < 1/2. A4 C n and C n log n/n 0. A5 E U <. The boundedness condition, i of A2, is commonly used for high-dimensional analysis as in Wang et al The condition ii of A2 is the sparse Riesz condition assumed by Zhang and Huang 2008 and Chen and Chen 2008, which is one of the well-known conditions to deal with the situation where the number of regression coefficients exceeds the number of observations p > n. The assumption on d in A3 is rather strong considering that recent works in high-dimensional data analysis try to allow d to grow with n. The main reason for considering fixed d is a technical difficulty in dealing with the maximum of \ 1 n 1 n ρ τ Y i X i ˆβ n 1 n ρ τ Y i X i ˆβ over all overfitted candidate models :,, which arises from the implicit form of the quantile regression estimator. The condition A5 is assumed for conciseness of the proof. It can be relaxed to include some heavy-tailed distributions not satisfying A5. For example, if we add to A4 an additional assumption on C n, log n 4 C 2 n/n 0, the established theory also holds for the case where U follows the Cauchy distribution that does not have the first moment. The following theorem demonstrates that the modified BIC has consistency in model selection in the high-dimensional linear quantile regression model. Theorem 2.1 Under the model 2.1, suppose that A1-A5 hold. Then, we have P Ŝ = 1 as n. Remark Regarding the choice of C n, any positive sequence that satisfies A4 works in theory. In our simulations and real data analysis, C n = log p is used and this choice seems to work reasonably well in a wide range of settings. 8

9 The BIC H L may be also used to select a regularization parameter in shrinkage estimation of quantile regression models. Let ˆβ λ = ˆβ λ,1,..., ˆβ λ,p be a penalized estimator with a regularization parameter λ that determines the amount of shrinkage. Adopting the idea of the BIC for selecting λ that was developed in the mean regression setting, we suggest choosing λ for ˆβ λ as the minimizer of ˆλ = argmin λ BIC H L λ = argmin λ log ρ τ Y i X i ˆβλ + Ŝλ log n 2n C n, 2.10 where argmin runs over all λ > 0 such that the selected subset Ŝλ {j : ˆβ λ,j 0, 1 j p} by the penalized estimator ˆβ λ has a cardinality Ŝλ s n. It can be easily shown that, as long as there exists a sequence λ n such that the shrinkage estimator ˆβ λn satisfies lim P n Ŝλ n = = 1, 2.11 BIC H L at 2.10 works as a criterion of selecting λ for ˆβ λ that gives model selection consistency with a diverging number of variables, i.e., P Ŝˆλ = 1. The property 2.11 holds for some sequence λ n if ˆβ λ possesses the oracle property in the sense of Fan and Li One example of such estimator is the CAD-penalized median regression estimator when p = on 1/2 Li et al., BIC for Nonparametric Quantile Regression In this section we consider two nonparametric models and their estimation based on basis function approximation. We propose some extensions of the BIC criterion derived in ection 2 to these nonparametric models. We note that Huang and Yang 2004 developed a BIC for model selection in nonparametric mean regression models from the time series context. An extension of such work to the nonparametric quantile setting has not been done even in the case where the number of covariates is fixed. Here, we consider the situation where the number of variables p = p n is allowed to increase with the sample size n. Although we consider two examples of nonparametric quantile models, the principle of our extension is quite general and applicable to other nonparametric quantile models. 9

10 3.1 BICs for structured nonparametric models Varying coefficient model We consider the varying coefficient quantile regression model Y i = X i βz i + U i = p Xj i β j Z i + U i, i = 1,..., n, 3.1 j=0 where βz = β 0 z,..., β p z is a coefficient vector, X i = X i 0, Xi 1,..., Xi p, X i 0 1 and P U i 0 X i = x i, Z i = z i = τ for almost every x i, z i. Assume that the data Y i, X i, Z i for 1 i n are independent and identically distributed as Y, X, Z R R p+1 R. We assume that only d covariates among the X 1,..., X p are relevant in the model 3.1. To fit the model 3.1, one may approximate each coefficient function β j using basis functions {B jl, l = 1,..., q j }, β j z q j l=1 γ jl B jl z, j = 0,..., p, 3.2 and then estimate γ jl by minimizing n ρ τ Y i p qj j=0 l=1 γ jlxj ib jlz i. This gives the estimator ˆβ j = q j l=1 ˆγ jlb jl for j = 0,..., p. tatistical properties of the estimator ˆβ j was established by Kim To define a criterion, we need estimates of the coefficient functions for each candidate model. We always include the baseline function β 0 in the model 3.1. For each = {j 1,..., j d } {1,..., p}, define ˆβ,j = q j l=1 ˆγ,jlB jl for j {0}, where ˆγ = argmin γ ρ Y i q j j {0} l=1 γ jl XjB i jl Z i. 3.3 Let N denote the number of basis functions B jl, j {0}, 1 l q j, used to fit the model with the approximation 3.2, i.e., N = q 0 + j q j. Following the idea in ection 2, we propose the following extension of BIC H L for the model 3.1: BIC H VC = log ρ τ Y i j {0} Xj i ˆβ,j Z i log n + N 2n C n, 3.4 where C n is a positive constant which diverges to infinity as n increases. The diverging order of C n 10

11 will be specified in the asymptotic theory. As in linear quantile regression, we select the model which minimizes BIC H VC within the restricted model space Ms n Additive model In a similar spirit of the BIC in 3.4, we may also develop a BIC for the additive quantile regression model p Y i = µ + m j Xj i + U i, i = 1,..., n, 3.5 j=1 where Y i, X i, U i, 1 i n are independent and identically distributed as Y, X, U R [0, 1] p R and P U 0 X = x = τ for almost every x. For identification, we assume that 1 0 m jx dx = 0 for j = 1,..., p. Also, we assume that only d covariates among the X j s are relevant in the model 3.5. With a basis {B jl } q j l=1 for approximating m j, the estimators for a candidate model are given by ˆm,j = q j l=1 ˆγ,jlB jl for j, where ˆγ = argmin ρ τ Y i γ 0 j q j l=1 γ jl B jl Xj i. We always include the intercept µ in the model 3.5, and take ˆµ = ˆγ,0. imilarly to the BIC at 3.4, we propose to use the following BIC as a model selection criterion for the model 3.5: BIC H AD = log ρ τ Y i ˆµ j ˆm,j Xj i log n + N 2n C n, 3.6 where N = 1 + j q j. Here again, we find the optimal model over the model space Ms n. 3.2 Consistency of the BICs in covariate selection If one approximates a nonparametric model by a parametric model, then the problem of estimating the approximate model is essentially the same as the estimation of a parametric model. That said, the two estimation problems are methodologically similar. However, from the model selection point of view, the two are different since, for a nonparametric model, the selection of the coefficients variables in the approximate model is done groupwise. For example, in our varying coefficient model at 3.1, the variables {X j B jl Z : 1 l q j } enter and leave the model together. From a theoretical point of view, one also needs to take care of the approximating error in the nonparametric setting. 11

12 To facilitate our theoretical analysis of BIC H VC and BICH AD, we introduce some notations. Let X denote the predictor variable X, Z or X, and X i its ith observation. The true index sets in the nonparametric models 3.1 and 3.5 are {1 j p : β j L2 0} and {1 j p : m j L2 0}, respectively, where L2 denotes the L 2 -norm. To present theoretical results for the two models in one framework, we use the following generic notations whose definitions change according to the model. Let π j = B j1,..., B jqj be the set of the basis functions to approximate the jth coefficient function β j in 3.1 or the jth additive component function m j in 3.5. uppose that π j γ j are the best approximations, in the sup-norm, of β j s or m j s. Let r nj denote the approximation error, that is, r nj = sup z π j z γ j β jz or r nj = sup x π j x γ j m jx. With some slight abuse of notation we denote, by Π = Π 0,..., Π p, both Πx, z = π 0 z, x 1 π 1 z,..., x p π p z under the varying coefficient model 3.1 and Πx = 1, π 1 x 1,..., π p x p under the additive model 3.5. Let Π i be the value of Π at the ith observation X i. Also, let Π = Π 0, Π j 1,..., Π j d where = {j 1,..., j d }. Define rn = R n γ where γ = γ0, γ 1,..., γ p and R n γ to be sup x,z p j=0 x j β j z Πx, z γ under 3.1 or sup x µ + p j=1 m jx j Πx γ under 3.5. Then, Π γ is the best approximation of the true τth conditional quantile in the function space generated by the chosen basis functions. Using these notations, the problems of estimating γ for the two nonparametric models can be rewritten in a single form of the minimization problem, ˆγ = argmin γ ρ τ Y i Π i γ. 3.7 For an approximation of the jth coefficient additive component function, the number of basis functions, q j, should tend to infinity as n, so q j depend on n. However, we omit such dependence whenever it brings no confusion. Under the mild condition that lim sup n max j q j / min j q j <, without loss of generality we can assume q j = q for all j in our asymptotic analysis. We make the following assumptions to establish the model consistency results of the BICs for the nonparametric quantile regression models. B1 The conditional distribution F U X x of the error U, given X = x, has a density f U X that satisfies i and ii of A1. B2 Let q = q n n 1/2r+1 where r is the constant in iii below and a n b n means that the ratio a n /b n is bounded away from zero and infinity. For the matrices Π i and Π satisfy: i max 1 i n max 0 j p Π i j = O q; 12

13 ii all the eigenvalues of Σ = EΠ Π is uniformly bounded away from zero and infinity over : M2s n, that is, 0 < l min inf l min Σ sup l max Σ l max < ; 2s n 2s n iii r nj = Oq r and r n = Oq r for all j and some r > 1/2. B3 d is fixed, p = On κ, κ > 0 and d s n = On α, 0 α < 2r 1/4r + 2. B4 C n and C n q log n/n 0. B5 E U <. The conditions i and iii of B2 are not strong and are satisfied in most nonparametric estimation problems based on spline basis approximation, see Kim 2007 and Horowitz and Lee 2005, for example. Especially, the constant r in iii depends on the degree of the smoothness of the coefficient or component function as well as the basis functions. pecifically, when the dth order derivative of β j or m j for every j satisfies the Hölder condition of order γ, B-splines of order d + 1 give r = d + γ chumaker, 1981, Corollary The condition ii of B2 is an extension of ii of A2 for the nonparametric models, which Wei et al have already used for the analysis of high-dimensional varying coefficient models. When the covariates are independent, the condition ii of B2 holds for the B-spline basis that Kim 2007 and Horowitz and Lee 2005 used. With these assumptions, we obtain the model selection consistency of BIC H VC and BICH AD covariates is diverging. when the number of Theorem 3.1 Under the model 3.1 or 3.5, suppose that B1-B5 hold. Then, we have P Ŝ = 1 as n. As we discussed at the end of ection 2 in the case of linear quantile regression, the above result justifies the use of BIC H VC and BICH AD for the selection of a tuning parameter in nonparametric penalized quantile regression estimation. Additionally, our proof of Theorem 3.1 reveals that BIC O VC and BIC O AD, which are BICH VC and BICH AD with C n = 1 respectively, work when the number of covariates p is finite. Remark In the structured nonparametric models, BIC may be used to determine the number of basis functions as well as to choose significant variables. In our work, we focus on the use of the 13

14 BICs in choosing significant covariates when the appropriate number of basis for each component or coefficient is given. When there is no need of covariate selection, the BICs can be also used to determine the number of basis functions in practice. uch examples can be found in He and hi 1996, Doksum and Koo 2000, Horowitz and Lee 2005 and Kim imulation tudies imulation experiments are conducted to confirm our theoretical results. We consider three quantile regression models introduced in ections 2 and 3, one of which is a parametric linear model and the others are nonparametric additive and varying coefficient models. For all models, the error ε i is independent of the covariate vector X i or X i, Z i. We examine the performance of the modified BICs in various scenarios, changing the number of variables p relative to the sample size n. We consider the case n, p = 100, 10, where p n and thus the ordinary BICs BIC O L, BICO VC and BICO AD are applicable. For high-dimensional models, we consider n, p = 100, 100 and 100, 200 for the parametric model, and n, p = 150, 100 for the nonparametric models. In the latter scenarios, since p is comparable to or larger than n, the ordinary BICs are expected to fail in identifying the true model and the modified ones to excel. As for s n, we set s n = 50 for BIC H L and s n = 20 for BIC H VC and BIC H AD. Here are the detailed descriptions of the models. Model L Linear model : Y i = 3X i 1 1.5X i 2 + 2X i 5 + 2X i 8ε i, i = 1,..., n, 4.1 where all the covariates X1 i,..., Xi p are generated from a multivariate normal distribution with mean 0 and covariance matrix Σ = σ j1,j 2 with σ j1,j 2 = 0.5 j 1 j 2. Then, the eighth covariate X8 i is replaced by the random number from the uniform distribution U[0.5, 1]. For ε i, we consider the standard normal and t distribution with df = 2. Model VC Varying coefficient model : Y i = β 0 Z i + β 1 Z i X i 1 + β 2 Z i X i X i 3 ε i τ, i = 1,..., n, 4.2 where the covariates X i = X i 1,..., Xi p is from a multivariate normal distribution with mean 0 and covariance matrix Σ = σ j1,j 2 with σ j1,j 2 = 0.7 j 1 j 2, Z i is from a uniform distribution 14

15 U[0, 1] and ε i τ = ε i Fε 1 τ. Here, the subtraction of Fε 1 τ from ε i is to make the τ-th quantile of ε i τ be zero. As for the distribution F ε, we consider N0, 3. Finally, we set β 0 u = 4u, β 1 u = 2 sin2πu, β 2 u = 1 + 3u1 u. Model AD Additive model : Y i = f 1 X i 1 + f 2 X i 2 + f 3 X i 3 + σ 4 X i 4ε i, i = 1,..., n, 4.3 where f 1 x = 5x, f 2 x = 4 sin2πx/2 sin2πx, f 3 x = 60.1 sin2πx cos2πx sin 2 2πx cos 3 2πx sin 3 2πx and σ 4 x = 4.52x 1 2. The error ε follows a standard normal distribution and the covariates X1 i,..., Xi p have a compound symmetry covariance structure: X k = W k + tu/1 + t, k = 1,..., p, where W 1,..., W p and U are i.i.d. from U[0, 1]. We set t = Construction of sub-models When p is large, it is not computationally feasible to calculate BIC H L as well as BICH VC and BIC H AD for all : Ms n. of sub-models and choose the best model among them. A simple solution to this problem is to construct a sequence A typical approach for this is using the regularization path of a penalized estimator. Recently, many efficient methods for finding the path have been developed such as the LAR algorithm Efron et al., 2004 and the coordinate-descent based algorithms Mazumder et al., 2011; Breheny and Huang, Below we give some details about the model selection procedure with BIC H L at 2.5 and a sparse penalized regression method. For a given penalty p λ indexed by λ > 0, compute the regularization path of a penalized estimator { ˆβ λ : λ > 0}, where ˆβ λ = argmin β ρ τ Y i X i β + n p p λ β j. 4.4 j=1 15

16 Let Γ = {λ : Ŝλ Ŝλ, Ŝλ s n }. Choose the best model Ŝ among Ŝλ with λ Γ, i.e., Ŝ = argmin Ŝ λ, λ Γ = argmin Ŝ λ, λ Γ BIC H L Ŝλ log ρ τ Y i X i ˆβŜλ Ŝ λ + Ŝλ log n 2n C n. Here, ˆβŜλ should not be confused with ˆβ λ defined at 4.4. The former is the minimizer of n ρ τ Y i X i Ŝ λ βŝλ over βŝλ R Ŝλ. In a similar manner, we construct sub-models for BIC H VC and BICH AD using the penalized quantile regression estimators in the additive and varying coefficient models. Throughout all numerical simulations, we use the CAD-penalized quantile regression estimator to construct a sequence of sub-models. To implement the regularization path, we compute the penalized estimators over a fine grid of λ by applying the local linear approximation algorithm iteratively starting from the zero initial vector. For numerical details, see Wang et al and Noh et al In the case of high-dimensional linear mean regression, Kim and Kwon 2012 and Zhang 2010 established the consistency of the solution path which means that the regularization path includes the true model for a non-convex penalized estimator with either the CAD penalty or minimax concave penalty under certain conditions. However, such results are not available in high-dimensional quantile regression. Because of this, in our Monte-carlo simulations we report the proportion of the cases where the regularization path includes the true model the column incl. in Tables 1 and 2. We observe that the proportion of inclusion of the true model is generally high. Additionally, to see whether a failure of the BICs can be attributed to the exclusion of the true model in the path, we add the true model to a sequence of candidate sub-models whenever it is not in the path, and evaluate the model selection performance of the BICs. We find that the model selection performance does not seem to be significantly affected by whether or not we artificially add the true model to the candidate models, see the online supplementary file for the simulation results about this. 4.2 Model selection consistency of the modified BICs To calculate the BICs for nonparametric models Model VC and AD, we use cubic splines and set the number of knots to be 3, which we find enables the corresponding basis to approximate all the coefficient and component functions reasonably well in our simulations. For the assessment of model selection, we say that is correct if = ; overfits if and ; and underfits if 16

17 , and report the percentage of each case out of 200 Monte Carlo replications. Additionally, we report the average number of correctly selected nonzero components in the column labeled NC and the average number of zero components incorrectly selected into the final model in the column labeled NIC. The standard deviations based on the 200 replications are presented in parentheses. Tables 1 and 2 summarize model selection results of BIC O L and BICH L based on 200 replications when the error is normal and t-distributed, respectively. Note that due to the heteroscedastic error of the model, the index set of the relevant variables changes depending on a given quantile level τ, {1, 2, 5} and {1, 2, 5, 8} for τ = 0.5 and τ 0.5, respectively. When the predictor dimension is moderate p = 10, both BICs seem to work reasonably well. However, in high-dimensional situations p = 100 and 200 the ordinary BIC tends to overfit seriously, which was already observed in the mean regression setting by Chen and Chen 2008, whereas the modified BIC performs pretty well in resisting against overfit without losing much efficiency in detecting the true nonzero variables overall. Additionally, we observe that when τ = 0.25 and 0.75 both BICs tend to underfit more than when τ = 0.5. This is because it is more difficult to identify the variable X 8 involved in heteroscedasticity than other variables. Detecting such heterogeneity of relevant variables seems to become even more difficult in high-dimension especially considering the result of t errors with p = 200 at τ = 0.25 and Concerning BIC H VC and BICH AD, the model selection results are presented in Table 3. Note that = {1, 2} and {1, 2, 3} for model VC and AD, respectively, when τ = 0.5. Because we find that C n = log p makes the corresponding BICs for both the nonparametric models excessively overfit-resistant, resulting in underfit inflation, we also take the choices C n = 2 1 log p and 3 1 log p. Additionally, we present the results of the case where C n = 1, which corresponds to the ordinary BIC. From Table 3, we learn the same lesson, as in the parametric case, that in high-dimensional situations the modified BICs for the nonparametric models manage to control the proliferation of overfit without much losing the sensitivity of detecting significant covariates. However, this advantage over the ordinary BICs is dependent on the choice of C n, which is kind of subjective in our framework. This issue deserves further investigation in the future. 4.3 Effectiveness as regularization parameter selector To evaluate the performance of the BIC defined at 2.10 as a regularization parameter selector for penalized estimators, we use 200 replications of random samples of n = 100 and p = 50 from Model 17

18 Table 1: Model selection results when n = 100 with the normal errors. BIC O L BIC H L τ p incl. C O U NC NIC C O U NC NIC % 44.0% 0.0% % 4.0% 0.0% % 98.5% 1.5% % 4.5% 7.0% % 94.5% 5.5% % 4.5% 14.0% % 34.0% 0.0% % 3.5% 0.0% % 97.0% 0.0% % 0.5% 0.5% % 99.5% 0.0% % 0.5% 2.0% % 46.5% 0.0% % 4.5% 0.0% % 98.0% 2.0% % 4.5% 5.0% % 93.0% 7.0% % 1.5% 13.5% Note: The numbers under C, O and U, respectively, are the percentages of the cases in which the selected index sets are correct, overfit and underfit. Table 2: Model selection results when n = 100 with the t errors. BIC O L BIC H L τ p incl. C O U NC NIC C O U NC NIC % 29.0% 1.0% % 2.0% 11.5% % 83.5% 6.0% % 0.5% 50.0% % 83.0% 15.0% % 2.0% 68.5% % 13.0% 0.5% % 0.5% 2.5% % 72.5% 1.5% % 0.5% 25.0% % 82.5% 4.5% % 0.0% 35.0% % 26.0% 0.5% % 3.5% 15.5% % 86.0% 7.5% % 1.5% 54.5% % 79.0% 19.5% % 1.0% 67.0% Note: The numbers under C, O and U, respectively, are the percentages of the cases in which the selected index sets are correct, overfit and underfit.

19 L. For illustration, we consider the one step CAD-penalized estimator ˆβ λ = ˆβ λ,1,..., ˆβ λ,p = argmin β ρ τ Y i X i β + n p ṗ λ β j β j, 4.5 j=1 which is implemented in the spirit of Zou and Li Here, the function ṗ λ is the derivative of the CAD penalty function defined on R + as ṗ λ x = λix λ + aλ x + Ix > λ a 1 for some constant a > 2, where I is the indicator function and β = β 1,..., β p is the unpenalized estimator with λ = 0 in 4.5. Table 4 gives model selection results of the estimator in 4.5 when the regularization parameter λ is chosen by BIC H L λ at 2.10 and by BIC O L λ = log ρ τ Y i X i ˆβλ + Ŝλ log n 2n. 4.6 We also present the results when λ is selected by the cross-validation 5-folds in our simulation based on the check loss ρ τ as was considered in Wu and Liu We observe that BIC H L λ, as a regularization parameter selector, is as effective as BIC H L, as a model selection criterion, when the number of variables is comparable to the sample size, whereas the cross-validation and BIC O L λ lead to considerable overfitting. 5 Real Data Example For illustration of the modified BIC for high-dimensional linear quantile regression, we use the data analyzed in cheetz et al. 2006, which has gene expression values of 31,042 probe sets on 120 rats. Gene expression levels were analyzed on a log scale with base 2. The main objective of this analysis is to study how the expression of gene T RIM32 probe at, known to cause human hereditary diseases of the retina, depends on the expression of other genes. Following cheetz et al. 2006, we exclude probes that were not expressed in the eye or were not sufficiently variable from the 31,042 probe sets: remove each probe for which the maximum expression value among the 120 rats is less than the 25th percentile of the entire set of the expression values, and choose probes that exhibited at least 2-fold variation in expression level among the 120 rats. After this process, we have 18,986 19

20 Table 3: Model selection results of BIC H VC and BICH AD when n = 100 τ = 0.5. BIC H VC BIC H AD p C n C O U NC NIC C O U NC NIC % 1.5% 1.5% % 4.5% 0.0% log p = % 0.0% 78.5% % 0.0% 1.0% log p = % 1.5% 5.0% % 2.5% 0.0% log p = % 3.5% 0.0% % 14.0% 0.0% % 46.0% 2.0% % 44.5% 0.5% log p = % 0.0% 95.0% % 0.0% 88.5% log p = % 0.0% 77.0% % 1.5% 10.5% log p = % 11.5% 22.0% % 10.0% 1.5% Note: The numbers under C, O and U, respectively, are the percentages of the cases in which the selected index sets are correct, overfit and underfit. Table 4: Model selection results of the penalized estimator τ = 0.5 when n = 100 and p = 50. N0, 1 t2 Method C O U NC NIC C O U NC NIC BIC O L 56.0% 44.0% 0.0% % 42.0% 8.5% BIC H L 97.0% 0.5% 2.5% % 5.0% 27.0% CV 71.0% 28.0% 1.0% % 33.5% 20.0% Note: The numbers under C, O and U, respectively, are the percentages of the cases in which the selected index sets are correct, overfit and underfit.

21 probes left. Among these probes, as in Huang et al. 2008, we select 3000 genes with the largest variance in expression value, and then choose the top 300 genes whose expression value has the largest absolute value of correlation with the expression of gene T RIM 32 corresponding to probe at. Then we apply several methods to find relevant genes to the gene T RIM32 at different quantiles τ = 0.25, 0.5, To be more specific, we consider the following linear quantile regression model as in Wang et al. 2012: 300 Q τ Y i X i = β 0 τ + β j τxj, i j=1 where Q τ Y X denotes the τth conditional quantile of the expression of gene T RIM32 Y given all the expression of other genes X = X 1,..., X 300. The appropriateness of the linear quantile regression model for this data has been checked in Wang et al via the simulation-based graphical method of Wei and He ince it is impossible to do an exhaustive search over 300 genes for the best model, we rely on the regularization path of the penalized estimator as in ection 4 and restrict the cardinality of sub-models to 20. With the selected sub-models, we find relevant genes at different quantiles τ = 0.25, 0.5, 0.75 using BIC H L and the 5-fold cross-validation CV1. For comparison, we also present the result of the CAD-penalized estimator in Wang et al with the shrinkage parameter λ selected by the 5-fold cross-validation CV2. To assess the quality of model selection results, we conduct 50 random partitions. For each partition, we randomly choose 80 rats and 40 rats as the training and test data, respectively. With the training data, we conduct model selection based on the three methods and report the average number of nonzero coefficients and the average prediction error calculated using the test data in Table 5. The standard errors based on the 200 replications are presented in parentheses. As a measure of prediction error, we use 40 ρ τ Y i Ŷ i. From Table 5, we observe that the model chosen by the modified BIC is much simpler than the ones chosen by the cross-validation but shows comparable or even better prediction error performance all across the quantiles we consider, which suggests that the proposed BIC is a sensible criterion for model selection in high-dimensional linear quantile regression. 6 Discussion Even though the BIC for linear quantile regression has been known to select the true model consistently when the predictor dimension is finite, it has not been understood how a diverging number of variables 21

22 Table 5: Analysis of gene expression dataset Quantile level Method Ave # of nonzero Prediction error BIC H L τ = 0.25 CV CV BIC H L τ = 0.5 CV CV BIC H L τ = 0.75 CV CV would affect the selection consistency of BIC in quantile regression. In this work we propose several extensions of BIC in quantile regression for the cases of a diverging number of variables both in parametric and nonparametric models and proved their consistency in model selection. ince we extend the BIC to nonparametric models with a basis approximation, it is natural to ask whether such an extension is also possible for the kernel smoothing approach. One can think of a BIC for nonparametric quantile regression in kernel smoothing context by modifying the proposed nonparametric BICs based on the well-known intuition h q 1 between the bandwidth h and the number of basis functions q. Additionally, it seems to be possible to consider the extension from the BIC for parametric quantile regression using the effective sample size nh in kernel smoothing, instead of the original sample size n, as Wang and Xia 2009 did in varying coefficient mean regression models. In our simulations, we find that the finite sample performance of the modified BICs for highdimensional quantile regression seems to vary with the choice of C n. Although we provide a quite wide range of C n that works in theory, a practical choice of C n can be subjective so it deserves further research. As one of the referees commented, it is worthwhile to investigate whether a rigorous Bayesian formulation is available for quantile regression as an extension of the work of Chen and Chen 2008 in the mean regression setting. Based on such investigation, one may discover another BIC with a Bayesian justification, which may provide some suitable choice of C n for our modified BICs. Our theoretical results about model selection consistency can be extended to the ultra-high dimensional case, depending on the cardinality restriction s n of candidate submodels. To be specific, our modified BICs continue to have selection consistency when p = Oexpn κ for 0 < κ < 1/2 α, where α is the constant in A3 for the linear model, and the one in B3 for the structured nonparametric 22

23 models. This result only requires a stronger condition on C n, which is C n log n/ log p instead of C n in the conditions A4 and B4. Further, when the model dimension is very large, it is worthwhile to consider reducing the number of the covariates using a recently developed screening procedure for quantile regression, such as He et al. 2013, before applying our proposed BICs to choose the final model. Finally, we remark that our modified BICs with a broad range of C n can be also understood in a framework of the generalized information criterion. Appendix We present only the proof of Theorem 3.1, because in nonparametric quantile regression one should handle the model approximation error as well so that it is more complicated than the one in the linear quantile regression setting. The proof of Theorem 2.1, which considers the linear quantile model, can be found in the online supplementary file. A.1 Proof of Theorem 3.1 Let M s n = { Ms n : } and B δ = {θ R N : θ δ}, where is the Euclidean norm. Recall that N is the number of the basis functions used to fit the model, i.e., N = q + 1 under the varying coefficient model 3.1, or N = q + 1 under the additive model 3.5. We denote the maximum of N over Ms n by N. Let R n = Π γ Q τ X and R i n = Π i γ Q τ X i, where Q τ x is the τth conditional quantile of the response variable Y given X = x. Define ˆl max = sup sn l max ˆΣ and ˆl min = inf sn l min ˆΣ where ˆΣ = n 1 n Πi Πi. For a matrix A, let A M = sup u: u =1 Au denote the operator norm of A. Using Theorem 1.6 of Tropp 2012 with the assumption B2, one has that there exists a constant C > 0 such that for all t 0, P n 1 Π i jπ i k EΠ jπ k > t M This gives sup 0 j,k p n Πi j Πi k /n EΠ jπ k sup u,v: u = v =1 u, ˆΣ Σ v Cq exp nt 2 Cq 2 + qt sup 0 j,k p n Πi j Πi k /n EΠ jπ k. M = O p log p/n 1/2 q. ince ˆΣ Σ M = s n, we may assume M without loss of generality that there exist positive constants c and C such that 0 < c < ˆl min ˆl max < C <. 23

24 Lemma A.1 uppose that B1,B2 and B3 hold. Then, for any sequence {L n } satisfying 1 L n N δ 0/10 for some δ 0 > 0 with N 2+δ 0 = on, we have sup sup M s n γ : γ γ Ln N /n N 1 { ρ τ U i Π i γ γ R i n ρ τ U i R i n + Π i γ γ 2τ 2IU i < 0 E U i X i ρ τ U i Π i γ γ R i n ρ τ U i R i n} = o p 1. A.1 Proof. The lemma can be proved by technical arguments similar to those in the proof of Lemma 3.2 in He and hi Let Z i = n 1/2 Π i and θ = n 1/2 N 1/2 γ γ. Note that A.1 is equivalent to sup sup M s n θ B Ln N 1 { ρ τ U i N 1/2 Zi θ Rn i ρ τ U i Rn i + N 1/2 Zi θ 2τ 2IU i < 0 E U i Z iρ τ U i N 1/2 Zi θ Rn i ρ τ U i Rn} i = o p 1. A.2 By the assumption B3, we can choose δ 0 > 0 such that N 2+δ 0 = on. With such δ 0, let d n = N δ 0/10 L 2 n max M s n max 1 i n Z i N 1/2 L n. ince dn MN 2+4δ 0/5 /n 1/2 = o1 for some constant M > 0, we may assume that d n W for some constant W > 0. Then, it suffices to show that for any ɛ > 0, where P sup sup M s n θ B 1 N 1 h θ > ɛ, d n W 0 as n, h θ = = h i θ { U i N 1/2 L nz i θ Rn i U i Rn i + N 1/2 L nz i θ 2τ 2IU i < 0 E U i Z i U i N 1/2 L nz i θ Rn i U i Rn }. i A.3 Take the minimal number of balls with radius q 0 = ɛn /24W n, say Γ 1,..., Γ Kn, that covers B 1. Actually, the balls and the radius depend on but we omit such dependence in notation for 24

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

Journal of Data Science,17(1). P. 145-160,2019 DOI:10.6339/JDS.201901_17(1).0007 WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION Wei Xiong *, Maozai Tian 2 1 School of Statistics, University of