Efficient Regressions via Optimally Combining Quantile Information

Size: px

Start display at page:

Download "Efficient Regressions via Optimally Combining Quantile Information"

Evelyn Sparks
5 years ago
Views:

1 Efficient Regressions via Optimally Combining Quantile Information Zhibiao Zhao Penn State University Zhijie Xiao Boston College September 29, 2011 Abstract We study efficient estimation of regression models via quantile regression. Both the classical parametric linear regression model and the nonparametric regression model are investigated. We argue that it is crucial to optimally combine information over quantiles. We investigate both the weighted composite quantile regression estimator based on a quantile weighted loss function, and the weighted quantile average estimator based on a weighted average of quantile regression estimators at single quantiles. We show that efficient regression estimators can be constructed using quantile regression. In the case of non-regular statistical estimation, the proposed estimators lead to super-efficient estimation. Asymptotic properties of the proposed estimators are studied and efficiency comparison is conducted among these estimators and other estimators such as the ordinary least squares estimator and the least-absolute-deviation estimator. We also conduct Monte Carlo experiments to investigate the sampling performance of the proposed estimators. AMS 2000: Primary 62J05, 62G08; Secondary 62G20. Keywords: Asymptotic normality, Bahadur representation, Composite quantile regression, Efficiency, Fisher information, Quantile regression, Super-efficiency. We thank Roger Koenker and Steve Portnoy for their very helpful comments. Zhibiao Zhao: Department of Statistics, Penn State University, University Park, PA 16802; zuz13@stat.psu.edu. Zhijie Xiao: Department of Economics, Boston College, Chestnut Hill, MA 02467; zhijie.xiao@bc.edu. 1

2 1 Introduction We are interested in the problem of efficient estimation of regression models: Y = m(x, θ + ε, (1 where X is the p-dimensional covariates, Y R is the response, ε is the noise, and θ is the vector of unknown parameters. In particular, although our method may be extended to the estimation of other statistical models (see discussion in Section 6, we consider in this paper two leading models that are widely used in statistical applications: the classical parametric linear regression model where θ is finite dimensional, and the nonparametric regression model where the parameter is of infinite dimension. In practice, model (1 is usually estimated via an ordinary least squares (OLS regression (for the parametric model or local least squares regression (in the nonparametric case of Y on X based on a collection of observations. When ε is normally distributed, the widely used OLS estimator has likelihood interpretation and is the most efficient estimator. In the absence of Gaussianity, the OLS estimator is usually less efficient than methods that exploit the distributional information, although may still be consistent under appropriate regularity conditions. Without these regularity conditions, the OLS estimator may not even be consistent, for example, when the error has a heavy-tailed distribution such as the Cauchy distribution. Monte Carlo evidence indicates that the OLS estimator can be very sensitive to certain outliers. In empirical analysis, many applications (such as finance and economics involve data whose distributions are heavy-tailed or skewed, and estimators based on OLS may have poor performance in these cases. It is therefore important to develop robust and efficient estimation procedures for general innovation distributions. If the error distribution were known, the Maximum Likelihood Estimator (MLE could be constructed. Under appropriate regularity conditions, the MLE is asymptotic normal and asymptotically efficient in the sense that the limiting covariance matrix attains the Cramér-Rao lower bound. In practice, the true density function is generally unknown and so the MLE is not feasible. However, the MLE and the Craḿer-Rao efficiency bound serve as a standard against which we should measure our estimator. The existence and construction of asymptotically efficient regression estimators have attracted much research attention, see, e.g., Stein (1956, Beran (1974 and Stone (

3 In his Wald lecture, based on a nonparametric estimate of the score function and an one-step Newton-Raphson approximation, Bickel (1982 proposed asymptotically efficient estimator of linear regressions with unknown symmetric error distribution. He also showed that, in the presence of an intercept term, asymptotically efficient estimator of the slope parameters can be obtained without symmetry assumption. We believe that the quantile regression technique (Koenker and Bassett, 1978 can provide a useful method in efficient statistical estimation. Intuitively, an estimation method that exploits the distributional information can potentially provide an efficient estimator. Since quantile regression provides a way of estimating the whole conditional distribution, appropriately using quantile regressions may improve estimation efficiency. Under regularity assumptions, the least-absolute-deviation (LAD regression (a quantile regression at median can provide better estimators than the OLS in the presence of heavy-tailed distributions. In addition, for certain distributions, a quantile regression at a non-median quantile may deliver a more efficient estimator than the LAD method. Furthermore, additional efficiency gain may be achieved by aggregating information over multiple quantiles. Although using quantile regression over multiple quantiles can potentially improve the efficiency of estimation, this is often much easier to say than it is to do in a satisfactory way. To combine information from quantile regression, one may consider combining information over different quantiles via the criterion function of the estimation procedure, or combining information based on estimators at different quantiles. An important contribution along the second direction is Portnoy and Koenker (1989, who proposed an asymptotically efficient estimator for the slope parameters of the classical linear regression model. Chamberlain (1994 and Xiao and Koenker (2009 considered some more complicate models using minimum distance estimation to improve efficiency, although not reaching the efficiency bound. Zou and Yuan (2008, Kai, Li and Zou (2010 recently considered the composite quantile regression to incorporate information across quantiles along the first direction mentioned above. In particular, they use an aggregate loss function based on a simple average of the individual loss functions over multiple quantiles. In this paper, we argue that, although a composite quantile regression estimator based on the aggregated criterion function may outperform the OLS estimator for some non-normal distributions, simple average in general is not an efficient way of using distributional information from quantile regressions. 3

4 Information at different quantiles are correlated, improperly using multiple quantile information may even reduce the efficiency. Roughly speaking, simple average delivers good estimators when the error distribution is close to normal. In fact, for nonparametric regression, a simple-averaging-based composite quantile regression estimator is asymptotically equivalent to the local least squares estimator. However, the main purpose of combining quantile information is to improve efficiency when the error distribution is not normal and OLS does not work well. It is therefore important to combine quantile information appropriately to achieve efficiency gain. We study optimal combination of quantile regression information. We investigate both the weighted composite quantile regression estimator based on weighted quantile loss functions, and the weighted quantile average estimator based on a weighted average of quantile regression estimators at single quantiles. Optimal weighting schemes that lead to efficient regression estimators are proposed. In the case of non-regular statistical estimation, the proposed estimators lead to super-efficient estimation. Asymptotic properties of the proposed estimators are studied and efficiency comparison is conducted among these estimators and other estimators such as the OLS and LAD estimators. The rest of this paper is organized as follows: Section 2 contains a preliminary result that is useful for our subsequent asymptotic theories. We study the linear regression in Section 3 and nonparametric regression model in Section 4. We briefly discuss estimation in non-regular cases in Section 5. Extensions and future work are discussed in section 6. Section 7 contains simulation studies. Proofs are gathered in Section 8. Throughout, we consider combining information over k quantiles τ i = i/(k + 1, i = 1,..., k. 2 A Preliminary Result Given the regression model (1, we denote the distribution, density, and quantile functions of the error term ε by F, f, and Q. Denote the Fisher information of f by I(f, i.e. [ log f(u ] 2f(udu [f (u] 2 I(f = = u f(u du. Throughout we assume I(f <. R Our goal is to improve efficiency (upon traditional methods of regression estimators by optimally using information at different quantiles τ 1,..., τ k. Throughout this paper, 4 R

5 we use the vector q and matrices Γ and H defined as follows: q = Γ = H = [ ] T q 1,..., q k, where qi = f(q(τ i, [ ] min(τ i, τ j τ i τ j, 1 i,j k [ ] min(τi, τ j τ i τ j. f (Q (τ i f (Q (τ j 1 i,j k For convenience we write q 0 = q k+1 = 0. We also make the following assumptions. Assumption 1. f is positive, twice continuously differentiable, and bounded on {u : 0 < F (u < 1}. Assumption 2. Let g(τ = f(q(τ, g 2 (τ + g 2 (1 τ lim τ 0 τ 1 τ = 0 and lim τ 2 g (t 2 dt = 0. τ 0 τ Assumptions 1 and 2 are conventionally imposed in the study of efficient statistical estimation. Assumption 1 assumes smoothness of the density function. Assumption 2 is needed for Cramér-Rao type efficiency result. It requires that the error density approaches zero at the boundary at a sufficiently fast rate, otherwise one may estimate the parameters at a faster rate; see, e.g. Akahira and Takeuchi (1995 for a discussion of this issue, also see discussions in Section 5. Given a random sample {(X t, Y t } n t=1, we consider efficient estimation of θ using quantile regressions. As will become more clear later, for many quantile regression estimators that have an appropriate Bahadur representation, the covariance matrix of the estimator based on optimal quantile combination will be dependent on Λ k defined below: Λ k = e T k H 1 e k, where e k = (1,..., 1 T. (2 Theorem 1 below gives an important preliminary result on combining quantile information, and it indicates that as we use a large number of quantiles (k, Λ k will converge to the Fisher information. Theorem 1. Under Assumptions 1 and 2, lim k Λ k = I(f. 5

6 Assumption 2 is an important assumption that ensures the Cramér-Rao efficiency result. Proposition 1 below presents alternative sufficient conditions for Assumption 2. Proposition 1. Under Assumption 1, if {u : 0 < F (u < 1} = R, then Assumption 2 holds if either one of the following two conditions holds: { g 2 (τ + g 2 (1 τ } lim + [ g (τ + g (1 τ ]τ 3/2 = 0. (3 τ 0 τ or equivalently { lim u ± f 2 (u min{1 F (u, F (u} + 2 [log f(u] min{1 F (u, F (u} 3/2 u 2 f(u } = 0. (4 By (4 in Proposition 1, for symmetric density f, Assumption 2 is satisfied if { f 2 (u lim u 1 F (u + 2 [log f(u] } [1 F (u] 3/2 = 0. (5 u 2 f(u We write a b if a/b is bounded away from zero and infinity. Corollary 1. Suppose that there exist ν > 1 and β > 0 such that f(u u ν and 2 [log f(u]/ u 2 u β as u. Then Assumption 2 is satisfied if β + 3(ν 1/2 > 1. In particular, the latter holds if β 1. Many commonly used distributions with support on R satisfy Assumption 2, and hence resulting in asymptotic efficiency by Theorem 1. (i If f is the standard normal density, then 2 logf(u/ u 2 = 1 and 1 F (u = [1 + o(1]f(u/u as u, it is easy to verify (5. (ii For Laplace distribution with density f(u = 0.5 exp( u, u R, 1 F (u = 0.5 exp( u for u > 0, and 2 [log f(u]/ u 2 = 0 for u 0, so (5 trivially holds. (iii For logistic distribution, we can show g(τ = c(τ τ 2 for some constant c > 0, so (3 holds. (iv For Student-t distribution with α > 0 degrees of freedom, Corollary 1 holds with ν = α +1 and β = 2. (v For symmetric α-stable distribution with index α (0, 2, Corollary 1 also holds with ν = α + 1 and β = 2. For such a case, OLS is consistent with rate n 1 1/α only for α > 1 (and for α 1, OLS is not consistent. Also, the limiting distribution is non-normal and again α-stable distribution. The proposed quantile regression method still achieves efficiency with asymptotic normality. (vi For mixture normal distribution: θn(µ 1, σ (1 θn(µ 2, σ 2 2, µ 1, µ 2 R, σ 2 1, σ 2 2 > 0, θ [0, 1], we can verify (4. 6

7 3 Efficient Estimation of The Linear Model We first consider in this section the problem of estimating coefficients of the linear model Y = α + Xβ + ε, (6 where X = (X(1,..., X(p is the p-dimensional covariates, Y is the response, ε is the noise independent of X, α is the intercept, and β = (β 1,..., β p T is the slope vector. We first look at β. Given a random sample {(X t, Y t } n t=1, we consider two ways of combining quantile information in Sections 3.1 and 3.2 respectively, additional discussions are given in Section 3.3. Section 3.4 considers α. For convenience and without loss of generality, we assume that the regressors X are centered around 0. Assumption 3. E(X = 0 and the covariance Σ X = E(X T X is positive definite. 3.1 The Weighted Composite Quantile Regression Estimation The first approach combines quantiles via the criterion function. For τ (0, 1, denote by Q Y (τ X the conditional τ-th quantile of Y given X, then model (6 has the following quantile representation: Q Y (τ X = α(τ + Xβ with α(τ = α + Q(τ, (7 where Q(τ is the τ-th quantile of ε. We consider estimating the model over {τ i, i = 1,..., k} jointly based on the following weighted quantile loss function: ({ α(τ i } k, β WCQR = argmin a 1,...,a k,b w i n ρ τi (Y t a i X t b, (8 t=1 where w i 0 with k w i = 1 are weights and ρ τ (z = z(τ 1 z 0 is the quantile loss function (Koenker and Bassett, If w i 1/k, (8 reduces to the (unweighted Composite Quantile Regression (CQR of Zou and Yuan (2008. For this reason, we call (8 the Weighted CQR (WCQR. Theorem 2 below gives the limiting distribution of the WCQR estimator. 7

8 Theorem 2. Let w = (w 1,, w k T, and Γ and q be defined in Section 2. Under Assumption 1 and 3, as the sample size n, n ( βwcqr β ( N 0, Σ 1 X J(w, where J(w = wt Γw w T qq T w. (9 If we use equal weights w i = 1/k over all quantiles, the asymptotic covariance matrix of the unweighted CQR is given by Σ 1 X J CQR, where J CQR = ( k 1 f (Q (τ i 2 [ ] min(τ i, τ j τ i τ j. (10 However, the importance of information at different quantiles are different, and the information at different quantiles are correlated - depending on the error distribution. Thus, it is important to optimally (instead of simple average combine these quantiles together. By Theorem 2, the covariance matrix of β WCQR depends on the weights through J(w. Thus a natural way to select optimal weights in the WCQR is to minimizes J(w in (9. j=1 Theorem 3 below studies the optimal WCQR estimator. Assumption 4. The log density function log f(u of ε is concave. Theorem 3. are Under Assumptions 1, 3 and 4, the optimal weights w minimizing J(w w i = (2q i q i 1 q i+1 /(q 1 + q k 0, i = 1,..., k, (11 For the corresponding optimal WCQR estimator β WCQR, we have ( n ( β WCQR β N 0, Σ 1 X Λ 1 k. where Λ k is given by (2. By the result in Theorem 1, under the additional Assumption 2, Λ k I(f: as we use more quantiles (k, the optimal WCQR achieves the Cramér-Rao efficiency bound. Remark. For a general weighting w i = w(τ i for a function w(, as k, J(w 2 1 { r w (r w (s s (1 r ds} dr 0 0 ( 2. 1 w (r f (Q (r dr 0 8

9 3.2 The Weighted Quantile Average Estimation The WCQR estimator combines information at different quantiles via the criterion function. We next consider an alternative approach that combines information based on estimators at different quantiles. For each τ, we estimate β via the quantile regression: ( α(τ, β(τ n = argmin ρ τ (Y t a X t b. (12 a,b Under Assumptions 1 and 3, β(τ has asymptotic normality: ( n ( β(τ β N 0, Σ 1 τ(1 τ X. (13 f 2 (Q(τ To combine quantile information, we define the weighted quantile average estimator (WQAE: β WQAE = ϖ i β(τi, subject to ϖ i = 1. (14 t=1 Estimators of the above form has a long history in statistics, dated back to Mosteller (1946. Portnoy and Koenker (1989 studied estimators in the form of β(τdν(τ with a general weighting function. An important difference between the weights ϖ in the WQAE and the weights w in the WCQR is that ϖ i are not restricted to be non-negative. Just as the case of the WCQR, properly weighting different quantiles in the WQAE is crucial in efficiently combining distributional information to improve efficiency. As shown by Theorem 4(i below, the optimal weights can be obtained by minimizing a quadratic function of ϖ. The optimal weights and the corresponding asymptotic distribution of the optimal WQAE are given in Theorem 4(ii. Recall the matrix H and the vector e k in Section 2. Theorem 4. (i Denote ϖ = (ϖ 1,, ϖ k T, under Assumptions 1 and 3, as the sample size n, ( n ( βwqae β N 0, Σ 1 X, S(ϖ where S(ϖ = ϖ T Hϖ. (15 (ii The optimal weights that minimizes S(ϖ are unique and are given by ϖ = H 1 e k /(e T k H 1 e k. (16 Recall Λ k in (2. For the corresponding optimal WQAE β WQAE, we have ( n ( β WQAE β N 0, Σ 1 X S(ϖ, where S(ϖ = Λ 1 k. 9

10 By Theorem 1, lim k Λ k = I(f under Assumption 2. Thus, as we use more quantiles, the variance of the optimal WQAE estimator approaches the Cramér-Rao lower bound, corroborating the result of Portnoy and Koenker (1989. Notice that the optimal weights ϖ allow for negative components. The appearance of a negative weight in ϖ is due to very inaccurate estimation at some quantile (say τ and the existence of very accurate estimates at other quantiles that are highly correlated with the estimator at this quantile (τ. For example, consider Cauchy distribution and two quantiles τ 1 = 0.15, τ 2 = 0.5, then β (τ 2 is a much more accurate estimator than β (τ 1, and they are positively correlated. In this case, the correlation effect is stronger than the averaging effect, and an optimal combination of β (τ 1 and β (τ 2 would impose a negative weight on β (τ A Comparison of Asymptotic Efficiency The WQAE β WQAE differs from the WCQR estimator β WCQR in several aspects. While the WCQR estimator in (8 is based on an aggregation of several quantile loss functions, the WQAE in (14 is based on an a weighted average of separate estimates from different quantiles. As a result, computing the WQAE only involves k separate (p + 1-parameter minimization problems, whereas the WCQR requires solving a larger (p + 1k-parameter minimization problem. In addition, to ensure a proper loss function, the weights w i of β WCQR are restricted to be non-negative; by contrast, the weights ϖ 1,..., ϖ k in (14 can be negative. It is computationally appealing to impose less constraint on the weights. By Theorems 1 and 4, the optimal WQAE is asymptotically efficient under Assumptions 1 and 2. By contrast, asymptotic efficiency of the optimal WCQR requires the additional Assumption 4, which is due to the constraint of non-negative weights in WCQR. Theorem 5 below shows that in general the optimal WQAE is asymptotically more efficient than the WCQR estimator. Only when Assumption 4 holds, the optimal WCQR is asymptotically equivalent to the optimal WQAE, and both are asymptotically efficient. Theorem 5. Let S(ϖ be defined in Theorem 4 (ii and J(w given by (9. Then S(ϖ min w J(w, with equality at w = w (given by (11 under Assumption 4. Next we study the asymptotic relative efficiency (ARE of the optimal WQAE to the OLS estimator, the quantile regression (QR estimator β(τ in (12 at a single quantile 10

11 τ, and the unweighted CQR estimator. They all have asymptotic normality with limiting distribution in the form: n( β β N(0, s 2 Σ 1 X, where s2 = var(ε (assuming finite for the OLS, τ(1 τ/f 2 (Q(τ for the QR, J CQR in (10 for the unweighted CQR, and S(ϖ in Theorem 4 (ii for the optimal WQAE. For comparison, we use the optimal WQAE as the benchmark and define its ARE to estimator β as ARE( β = s2 S(ϖ = (et k H 1 e k s 2, where β : OLS, QR(τ, CQR. ARE 1 indicates better performance of the optimal WQAE. Clearly, ARE(CQR 1, ARE(QR(τ i 1, lim k ARE(OLS 1. We discuss a few examples below using k = 9 quantiles. Example 1. Let ε be Student-t distributed. Then ARE(CQR= 1.03, 1.12, 1.58 for t or N(0,1, t 2, t 1. As the tail of the distribution gets heavier, the optimal WQAE becomes increasingly more efficient than CQR. For t 1 (Cauchy distribution, the optimal weights ϖ = { 0.03, 0.04,0.08,0.29,0.40,0.29,0.08, 0.04, 0.03}, quantiles τ = 0.4, 0.5, 0.6 contribute almost all information whereas quantiles τ = 0.1, 0.2, 0.8, 0.9 are penalized heavily with negative weights. In this case, Zou and Yuan s CQR with uniform weights would not perform well. On the other hand, for N(0,1, ϖ = {0.13,0.11,0.11,0.10,0.10,0.10,0.11,0.11,0.13} are close to the uniform weights. Therefore the optimal WQAE and CQR have comparable performance, and both significantly outperform the single QR estimator in (12. See Table 1 for details. Example 2. Let ε be a mixture of two normal distributions. Mixture 1 (different variances: 0.5N(0,1+0.5N(0,0.5 6 ; Mixture 2 (different means: 0.5N( 2,1+0.5N(2,1. For Mixture 1, ϖ = { 0.002, 0.102,0.183,0.277,0.287,0.277,0.183, 0.102, 0.002}, quantiles 0.3,0.4,0.5,0.6,0.7 contain substantial information whereas quantiles 0.2,0.8 have negative weights For Mixture 2, ϖ = {0.185,0.156,0.153,0.078, 0.144,0.078,0.153,0.156,0.185} have quite different pattern. For Mixture 1, ARE(OLS = 10.17, ARE(CQR = 2.43; for Mixture 2, ARE(OLS = 3.28, ARE(CQR = Thus, the optimal WQAE has substantial efficiency gains. See Table 1 for details. Example 3. Let ε be Gamma random variable with parameter α > 0. For α = 1 (exponential distribution, ϖ 1 = 1.152, ϖ 2 = 0.124, and ϖ i 0 for i = 3,..., 9. Quantiles 0.1, 0.2 contain almost all information. See Table 1 for details. 11

12 Insert Table 1 about here. As shown in the examples above, different quantiles may carry substantially different amount of information, and without properly utilizing such information may result in substantial loss of efficiency. The latter phenomenon provides strong evidence in favor of our proposed optimally weighted quantile based estimators. 3.4 Efficient Estimation of the Intercept Term Efficient estimation of the intercept α in (6 requires additional assumptions such as symmetry, see, e.g. Bickel (1982. In the quantile domain representation (7, without further assumptions, α is unidentifiable from α(τ. Following the previous literature, we assume that f is symmetric, which holds for many distributions, such as Student-t distribution, uniform distribution on symmetric intervals, Laplace distribution, symmetric α-stable distribution, many normal mixture distributions, and their truncated versions on symmetric interval. Assumption 5. The distribution of ε is symmetric. Recall α(τ in (7 and α(τ in (12. We consider the following WQAE for α: α WQAE = ϖ i α(τ i, subject to ϖ i = 1, ϖ i = ϖ k+1 i, i = 1,..., k. (17 The asymptotic distribution of α WQAE is given in Theorem 6 below. Again, the optimal WQAE of α can be constructed by minimizing the limiting variance of α WQAE. Since the limiting variance of α WQAE depends on S(ϖ defined in Theorem 4, and the optimal weights ϖ in Theorem 4 satisfy the condition ϖi = ϖk+1 i under Assumption 5, we have: Theorem 6. (i Under Assumptions 1, 3 and 5, we have the asymptotic normality: ( ( n α WQAE α N 0, S(ϖ, where S(ϖ = ϖ T Hϖ. (18 (ii The optimal weights that minimize the asymptotic variance is given by ϖ in (16, and the optimal WQAE α WQAE has limiting distribution n ( α WQAE α N(0, Λ 1 k, where Λ k is defined in (2. By Theorem 1, under the additional Assumption 2, the variance of the optimal WQAE of α approaches the Cramér-Rao lower bound as the number of quantiles k. 12

13 3.5 Estimating f(q(τ To compute the optimal weights, we need to estimate f(q(τ. Since f (Q (τ is invariant under location shift of ε, α in (6 can be absorbed into ε. We propose the procedure: (i Use uniform weights ϖ i = 1/k to obtain the preliminary estimator β, and compute the residuals (a combination of both α and ε as ε t = Y t X t β. (ii Use the nonparametric kernel density estimator to estimate f(u: f(u = 1 nb n ( u εt K, (19 b where K( is a non-negative kernel function and b > 0 is a bandwidth. (iii Estimate f(q(τ by f( Q(τ, where Q(τ is the sample τ-quantile of ε 1,..., ε n. t=1 In (19, we use Gaussian kernel for K. Following Silverman (1986, p. 48, we choose { b = 0.9 min sd( ε 1,..., ε n, IQR( ε 1,..., ε n } n 1/5, 1.34 where sd and IQR denote the sample standard deviation and sample interquartile. In estimating the intercept term, we symmetrize the weights; see Section Efficient Nonparametric Regression In this section, we study efficient estimation of the nonparametric regression model: Y = m(x + σ(xε, (20 where X and Y represent covariate and response respectively, ε is the error independent of X, m( is an unknown nonparametric function of interest, and σ( is the scale function. Our goal is to obtain efficient estimate of m(x at given x. For simplicity, we consider the case where X is one-dimensional and use local linear smoothing. Our analysis also applies to the case with multivariate X, and other nonparametric smoothing methods. For convenience, write Z n N(c n, s 2 n if s 1 n {Z n c n [1 + o p (1]} N(0, 1 in distribution. Let K( be a kernel function and h a bandwidth parameter, denote µ K = u 2 K(udu, φ K = K 2 (udu. R 13 R

14 Assumption 6. (i The density function p X ( > 0 of X is differentiable, m( is three times differentiable, and σ( > 0 is differentiable, in the neighborhood of x; the density function f of ε is symmetric, positive, and twice continuously differentiable on its support. (ii The kernel function K( integrates to one, is symmetric, and has bounded support. (iii The bandwidth h satisfies h 0 and nh as n. Assumption 6 is similar to those of Kai, Li and Zou (2010. As in Section 3.4, the symmetry assumption in Assumption 6 (i is imposed to ensure identifiability. Based on observations (X 1, Y 1,..., (X n, Y n, the widely used local linear least squares regression is ( m LS (x, b = argmin a,b n ( {Y t a b(x t x} 2 Xt x K. (21 h t=1 If we assume that var(ε <, under some regularity conditions [see, e.g. Fan and Gijbels (1996], the local linear least squares estimator is asymptotically normal: ( 1 m LS (x m(x N 2 m (xµ K h 2 φ K σ 2 (x, nhp X (x var(ε. When ε i s are Gaussian, the local linear least squares estimation corresponds to the weighted likelihood criterion. In the absence of Gaussianity, asymptotic results of the above estimator generally still hold but this estimator is less efficient in terms of meansquared error than estimators that exploit the distributional information. In the presence of heavy-tailed data, local median (or more generally, local τ-th quantile regression has been proposed as a robust method in estimating m(x; see, e.g. Yu and Jones (1998: ( m LAD (x, b = argmin a,b n ( Xt x Y t a b(x t x K. (22 h t=1 Under Assumption 6, we have m LAD (x m(x N ( 1 2 m (xµ K h 2, φ K σ 2 (x 1. nhp X (x 4f 2 (0 If the error density f were known, we could replace {Y t a b(x t x} 2 in (21 by the log likelihood log f(y t a b(x t x and obtain a likelihood-based estimator, denoted 14

15 by m MLE (x, see, e.g., Fan, Farman and Gijbels (1998. Under some regularity conditions, ( 1 m MLE (x m(x N 2 m (xµ K h 2 φ K σ 2 (x, nhp X (x I(f 1, (23 where I(f is the Fisher information of the error distribution. The local likelihood estimator has lower variance than the least squares or least-absolute-deviation based local linear estimator as follows by the classical Cramér-Rao inequality. In practice, f is unknown and the local likelihood estimator is infeasible. In this section, we propose a quantile average nonparametric estimator for m(x, and show that, as the number of quantiles k, the asymptotic variance of the proposed estimator converges to the same asymptotic variance as the local likelihood procedure. This is true regardless of the dimensionality of the regressors. In this sense, our estimator is efficient 1. In many applications, the error distribution is likely to be non-normal and perhaps quite far from normal to such an extent that the local likelihood estimator is much more efficient than the least squares estimator. As a modeling issue it is hard to believe that we have better information about the error density than about the shape of the main regression effect and so it is quite natural to treat the error density as an unknown parameter. Thus our results are comforting in that they say that we can still use information from the error distribution to improve the performance of nonparametric estimates. 4.1 The Weighted Quantile Averaging Nonparametric Estimator Let Q(τ be the τ-th quantile of ε, and Q Y X (τ x be the conditional τ-th quantile of Y given X = x, then the nonparametric regression model (20 has the quantile regression representation: Q Y X (τ x = m(x + σ(xq(τ. Denote Q Y X (τ x by m(τ, x. Consider the local linear nonparametric quantile regression: ( m(τ, x, b = argmin a,b n ( Xt x ρ τ {Y t a b(x t x}k. (24 h t=1 Solution of a in the above optimization, m(τ, x, provides an estimate for Q Y X (τ x. 1 This is not to say that our method has any special properties with regard to minimax risk; as is well known it is not possible to achieve the lower bound here, see Fan (

16 We construct the following weighted quantile averaging nonparametric (WQAN estimator for m(x based on weighted average of m(τ i, x over τ i, i = 1,..., k: m WQAN (x = ϖ i m(τ i, x, subject to ϖ i = 1, ϖ i = ϖ k+1 i, i = 1,..., k. (25 Theorem 7. Let S(ϖ be defined in Theorem 4 (i. Under Assumption 6, as n, ( 1 m WQAN (x m(x N 2 m (xµ K h 2 φ K σ 2 (x, nhp X (x S(ϖ. (26 If we simply use equal weights, then the resulting unweighted quantile average nonparametric estimator has the asymptotic normality in Theorem 7 with S(ϖ replaced by R k = 1 k 2 j=1 min(τ i, τ j τ i τ j f(q(τ i f(q(τ j. (27 This is the same as the limiting distribution of the nonparametric CQR estimator of Kai, Li and Zou (2010. Assuming that var(ε <, by result of Kai, Li and Zou (2010, lim R min(r, s rs k = drds = var(ε, k f(q(rf(q(s the unweighted quantile averaging nonparametric regression estimator is asymptotically equivalent to the local least squares estimator! This result indicates that if we use a simple average over τ, even as we use more and more quantiles, there is no efficiency gain of combining quantile information; Simple averaging does not work when it is most needed. See the simulation studies in Section 7.2. Proper weighting over different quantiles is very important in the nonparametric case. Next we show that a large efficiency gain can be obtained by optimal weighting in the WQAN. Since the asymptotic bias of m WQAN (x does not depend on the weights by Theorem 7, it is natural to find the optimal weights in m WQAN (x that minimize the asymptotic variance subject to the conditions on ϖ in (25. Notice that, the limiting variance of the WQAN estimator depends on ϖ through S(ϖ, which is the same as in Theorem 4. In addition, under Assumption 6 (i, the optimal weights ϖ in Theorem 4 satisfy the condition ϖi = ϖk+1 i. Therefore, the optimal WQAN estimator of m(x is given by: ( ( H m WQAN(x = m(τ 1, x,..., m(τ k, x ϖ 1 e k = m(τ 1, x,..., m(τ k, x. (28 e T k H 1 e k 16

17 By Theorem 7 and Theorem 4, we have the following result. Theorem 8. Recall Λ k in (2. Under Assumption 6, as n, ( m 1 WQAN(x m(x N 2 m (xµ K h 2 φ K σ 2 (x, nhp X (x Λ 1 k. By Theorem 1, under the additional Assumption 2, the optimal WQAN estimator m WQAN (x achieves asymptotic efficiency in the sense that its asymptotic bias and variance are the same as that of the likelihood-based estimator in ( Comparison of Asymptotic Efficiency In this section we compare the proposed optimal WQAN estimator m WQAN (x in (28 with other nonparametric regression estimators: the local linear least squares estimator m LS (x in (21, the local linear least-absolute-deviation estimator m LAD (x in (22, and the local linear CQR estimator m CQR (x of Kai, Li and Zou (2010. All four estimators have the same asymptotic bias m (xµ K h 2 /2 and different variances. In Theorem 9 we show that m WQAN (x is more efficient than the local CQR estimator m CQR (x. Theorem 9. Under Assumptions 2 and 6, the optimal WQAN estimator m WQAN (x is asymptotically more efficient than the nonparametric CQR m CQR (x for any nonnormal distribution. For normal distribution, they are asymptotically equivalent. It is easy to see that the proposed optimal WQAN estimator m WQAN (x is more efficient than the local OLS and LAD estimators. Thus we conclude that the proposed optimal WQAN estimator is asymptotically more efficient than all these three estimators. Next we calculate asymptotic relative efficiencies of these nonparametric methods. All four estimators have asymptotic normality of the following form with different s 2 : ( 1 m(x m(x N 2 m (xµ K h 2 φ K σ 2 (x, nhp X (x s2. we define the asymptotic mean-squared error (AMSE { } 2 1 AMSE{ m(x} = 2 m (xµ K h 2 + φ Kσ 2 (x nhp X (x s2. 17

18 Minimizing the AMSE, we obtain the optimal bandwidth: { } 1/5 h = argmin AMSE{ m(x} = {m (xµ K } 2/5 φ K σ 2 (x h p X (x s2 n 1/5, (29 and the associated optimal AMSE evaluated at the optimal bandwidth h { } AMSE { m(x} = 5 4/5 4 {m (xµ K } 2/5 φ K σ 2 (x p X (x s2 n 4/5, (30 which is proportional to (s 2 4/5. Therefore, similar results in Section 3.3 also apply here. Remark. Our analysis also extends to the multivariate case when X is d-dimensional. In this case, the variance component will be of order (nh d Estimating f(q(τ As in Section 3.5, we need to estimate the unknown quantity f(q(τ. Notice that the optimal weights ϖ are invariant under the transformation aε of the error for any a > 0, and hence we can assume without loss of generality that ε has median one. Then the conditional median of Y m(x given X is σ(x, therefore we can apply local median quantile regression to estimate σ(. We propose the following procedure (i Use (25 with uniform weights to obtain preliminary estimator m(. (ii Compute Y i m(x i and estimate σ( by local linear median quantile regression: ( σ(x, b = argmin σ,b n ( Xi x ρ 0.5 { Y i m(x i σ b(x i x}k. (31 l (iii Compute the errors ε i = [Y i m(x i ]/ σ(x i and use (19 to obtain estimate f(u. (iv Denote the sample τ-quantile of ε 1,..., ε n by Q(τ, obtain the estimate of f(q(τ as f( Q(τ, use (16 to compute ϖ 1,..., ϖ k, and finally obtain symmetrized weights ϖ j = ( ϖ j + ϖ k+1 j /2, j = 1,..., k. In (31, following Yu and Jones (1998, we use l = l LS (π/2 1/5, where l LS is the plug-in bandwidth (Ruppert, Sheather and Wand, 1995 for local linear least squares regression based on the data (X i, Y i m(x i 2, i = 1,..., n. 18

19 5 Super-efficient Estimation via Quantile Regression Under Assumptions 1, 2 and 3, asymptotically efficient estimators of β in model (6 exist, they are root-n consistent, asymptotic normal, with limiting covariance matrix attaining the Fisher information lower bound. Similarly, under Assumptions 6 and 2, the local likelihood estimator of m(x is nh-consistent, asymptotic normal, and the limiting covariance matrix equals to the Fisher information matrix. These cases are usually related to the terminology of regular statistical estimation. The corresponding regularity conditions are not mathematical trivialities (for easy of proof but are real restricting conditions to obtain the efficiency results. In the case when those usual regularity conditions do not hold, the previously discussed efficiency result may not hold, and different results (from the likelihood-based estimation may be found in such cases. For example, different rate of convergence may be found, and it is well-known that, in general, asymptotically efficient estimators do not exist. These unusual cases are some times called non-regular statistical estimation. In this section, we briefly discuss the case of non-regular estimation. We show that, in this case, by using quantile regression with optimal weighting, super-efficient estimators may be obtained in the sense that the limiting variance is smaller than the conventional Cramér-Rao bound or even zero. Assumption 2 guarantees that the density f vanishes at the boundary at a sufficiently fast rate so that the properties of regular estimation holds. If this assumption does not hold, the conventional estimators such as the OLS or the MLE may still be root-n consistent and asymptotic normal. However, we show in Theorem 10 below that the optimally weighted quantile averaging estimators has smaller variance than the conventional lower bound, and one may even estimate β (or m(x at a rate better than root-n (or nh. Theorem 10. Let g(τ = f(q(τ, and Λ k be defined by (2, and assume g 2 (τ + g 2 (1 τ lim τ 0 τ = c. (32 (i If 0 < c < and lim τ 0 τ 2 1 τ τ g (t 2 dt = 0, then lim k Λ k = [c + I(f] 1. (ii If c =, then lim k Λ k = 0. Assumption 2 covers the regular case corresponding to c = 0 in (32. Theorem 10 indicates that, for the non-regular case c > 0 in (32, under appropriate regularity condi- 19

20 tions, for large k, the variances of the (standardized optimally weighted quantile regression based estimators are smaller than the Cramér-Rao bound. In particular, if c =, as the number of quantiles k increases, the variance of the weighted quantile regression based estimators approaches zero. In this sense, the optimal quantile regression based estimators are super-efficient. Corollary 2 below concerns a special case that the density f is positive at the boundary. Corollary 2. Denote the support of f by D, then lim k Λ k = 0 in any of the following three cases: (i D = [D 1, D 2 ] with f(d 1 + f(d 2 > 0; (ii D = [D 1, with f(d 1 > 0; or (iii D = (, D 2 ] with f(d 2 > 0. Example 4. For the truncated version of many commonly used distributions, we have lim k Λ k = 0. For example, for truncated normal on [ 1, 1], Corollary 2 (i applies. For uniform distribution on [0, 1], we can show Λ k = 1/(2k Example 5. Let ε have Beta distribution with parameters α 1, α 2 > 0. For α 1 = α 2 = 1 (uniform distribution, ϖ1 = ϖk = 0.5 and ϖ 2 = = ϖk 1 = 0, which is not surprising: the sufficient statistic for independent uniform samples on [θ 1, θ 2 ] is the two extreme observations. Also, ARE(CQR = (k +1(k +2/(6k, so the CQR with uniform weights is substantially inferior to the optimal WQAE for large k. 6 Extensions Our proposed methodology provides a general framework for constructing efficient estimator under a broad variety of settings. Here we briefly discuss some general ideas and mention some possible directions. Different types of models may involve different assumptions and have different results - depending on the model and the subject of interest. We wish to explore these in more detail in future research. In general, one may consider efficient estimation of a statistical model that depends on {Y t, X t, α, β}. Suppose that the model has a quantile domain representation, say, Q Yt (τ X t = g(x t, α(τ, β where α(τ are local unknown parameters that are related to α and vary over different quantiles, and β are global parameters that remain the same over all quantiles. We are 20

21 interested in more efficient estimation of the parameters. Under appropriate regularity conditions, the quantile regression estimators θ(τ = ( α(τ, β(τ are consistent estimators of θ(τ = (α(τ, β and have a Bahadur representation, say, θ(τ = θ(τ + C mf(q(τ n Z t [τ 1 εt <Q(τ] + negligible error, t=1 where m = m(n, C is a p p positive definite matrix that does not depend on τ, Z 1,..., Z n R p are IID p-dimensional column random vector, and Q(τ is the τ-th quantile of ε t. An efficient estimator for β based on combined quantile regressions can be constructed in a similar way to those discussed in this paper. Under additional assumptions on α(τ, say α(τ = α + cq(τ is a quantile shift parameter, and the error distribution is symmetric, estimator for α based on combined quantile regressions can be also be constructed. We briefly discuss a few possible directions below. (i Efficient estimation of derivatives. For model (20, our proposed methodology can also be used to construct efficient estimator for m (x by considering linear combination of b in (24 over different quantiles. (ii Efficient estimation of varying-coefficient models. Consider Y t = m 1 (U t X t,1 + + m p (U t X t,p + σ(u t ε t, (33 where m 1 (,..., m p ( are nonparametric functional coefficients of interest. Denote by Q Y U (τ u the conditional τ-th quantile of Y given U = u, and by Q(τ the τ-th quantile of ε. Then (33 has the quantile representation Q Y U (τ u = m 1 (ux t,1 + + m p (ux t,p + σ(uq(τ, which can be estimated by a similar local linear quantile regression as in (24. Note that m 1 (u,..., m p (u are global parameters and σ(uq(τ are local parameters, so our methodology can be applied here. (iii Efficient estimation of conditional variance function. Consider Y = σ(xε. Then Y 2 = σ 2 (Xε 2, leading to the quantile representation Q Y 2 X(τ x = σ 2 (xq ε 2(τ. To ensure identifiability, assume σ(0 = 1. Then Q Y 2 X(τ x/q Y 2 X(τ 0 = σ 2 (x for all τ (0, 1, and efficient estimator of σ 2 ( can be constructed via combining quantiles. 21

22 (iv Efficient estimation of dependent data. Our asymptotic theories are developed for independent data for technical convenience, but they can be extended to dependent data, such as time series or even longitudinal data, under appropriate mixing conditions. 7 Monte Carlo Studies 7.1 Linear regression model In (6, let α = 0, β = 1, and the covariate X N(0,1. For ε, we consider Normal distribution, Student-t distribution with one (t 1 and two (t 2 degrees of freedom, the two normal mixture distributions in Example 2, Laplace distribution, Beta distributions Beta(1,b, b = 0.5, 1, 2, 3, and Gamma distributions Gamma(γ, γ = 0.5, 1, 2, 3. Throughout this section we consider uniformly spaced quantiles τ i = i/(k + 1, i = 1,..., k = 9. We compare six estimation methods. OLS: ordinary least squares estimate; LAD: the median quantile estimate with τ = 0.5 in (12; QAU, QAO, QAE: quantile average estimate in (14 with uniform weights, the theoretical optimal weights, and the estimated optimal weights (cf. Section 3.5, respectively; CQR: Zou and Yuan (2008 s CQR estimator. With realizations, we use QAE as a benchmark to which the five other methods are compared based on the empirical relative efficiency: RE(Method = MSE(Method , with MSE = [ β(j β] 2. MSE(QAE where Method stands for any of the other five estimators, and β(j is the estimator from the j-th iteration. A value of RE larger than one indicates better performance of QAE. The results are summarized in Tables 2 for two sample sizes n = 100, 300. For N(0,1, Student-t 2 and Laplace distributions, QAE and CQR are comparable; for all other distributions, QAE significantly outperforms CQR. Also, QAE outperforms OLS for all non-normal distributions whereas they are comparable for N(0,1. For larger sample size, the superior performance of QAE is even more remarkable, which agrees with our theoretical discovery. For almost all cases considered, QAE substantially outperforms the LAD estimator and the relative efficiency can be as high as 2845%. It is worth pointing out that, for Beta and j=1 22

23 Gamma distributions, the relative efficiencies are much higher than the other distributions considered, owing to the super-efficiency phenomenon in Theorem 10. We conclude that the proposed optimal QAE offers a more efficient alternative to existing methods. Insert Table 2 about here. 7.2 Nonparametric regression model In our data analysis, we use the standard Gaussian kernel for K(. We now address the bandwidth selection issue. By (29, the optimal bandwidth h is proportional to (s 2 1/5. Denote by h LS and h WQAN the bandwidth for the least squares estimator and the proposed optimal WQAN estimator. Then h WQAN = h LS {Λ k/var(ε} 1/5, where Λ k is defined in (2. In practice, to select h LS, we can use the plug-in bandwidth selector in Ruppert, Sheather and Wand (1995, implemented using the command dpill of the R package KernSmooth. We then select h WQAN by plugging in estimates of Λ k and var(ε using the methods in Section 4.3, but for the purpose of comparison we shall use their true values in our simulation studies. Similarly, we can choose the optimal bandwidths for the other two estimators. Kai, Li and Zou (2010 adopted the same strategy. In step (i of Section 4.3, for the preliminary estimator m(, we use the plug-in bandwidth selector h LS. We compare the empirical performance of the four methods listed at the beginning of Section 4.2. With 1000 realizations, we use the least squares estimator m LS as the benchmark to which the other three methods are compared based on the relative efficiency: RE( m = MISE( m LS 1, with MISE( m = MISE( m l2 r=1 l 1 { m r (x m(x} 2 dx, where m r is the estimator from the rth realization, and [l 1, l 2 ] is the interval over which m is estimated. A relative efficiency greater than one indicates m outperforms the LS estimator. To facilitate computation, the integral is approximated using 20 grid points. Consider n = 200 samples from the model Model I : Y = sin(2x + 2 exp( 16X ε, X Unif[ 1.6, 1.6]. (34 The same model was also used in Kai, Li and Zou (2010 with normal design X N(0,1. Here we use uniform design to avoid some computational issues. 23 For ε, we consider

24 nine symmetric distributions: N(0,1, truncated normal on [ 1, 1], truncated Cauchy on [ 10, 10], truncated Cauchy on [ 1, 1], Student-t with 3 (t 3 degrees of freedom, Standard Laplace distribution, uniform distribution on [ 0.5, 0.5], and two normal mixtures 0.5 N(2,1+0.5 N( 2,1 and 0.95 N(0, N(0,9. The first normal mixture can be used to model a two-cluster population, whereas the second normal mixture can be viewed as a noise contamination model. Let [l 1, l 2 ] = [ 1.5, 1.5]. The relative efficiencies of the four methods, with m LS (x as the benchmark, are summarized in Table 3 (a. Overall, the proposed optimal WQAN either significantly outperforms or is comparable to the other three methods. For example, for N(0,1 and 0.95 N(0, N(0,9, WQAN is comparable to LS; for other distributions, WQAN has about 20% efficiency gain over LS for most distributions and more than 60% efficiency gain for 0.5 N(2,1+0.5 N( 2,1. When compared with CQR, WQAN outperforms CQR for all but the four distributions: N(0,1, Student-t 3, Laplace distribution, and 0.95 N(0, N(0,9, for which they are comparable. While WQAN underperforms LAD for truncated Cauchy on [ 10, 10], it has substantial efficiency gains for N(0,1, truncated N(0,1 on [ 1, 1], truncated Cauchy on [ 1, 1], uniform on [ 0.5, 0.5], and 0.5 N(2,1+0.5 N( 2,1. The empirical performance of the proposed method in Table 3 (a is not as impressive as its theoretical performance in (30. For example, for truncated N(0,1 on [ 1, 1], the theoretical AREs according to (30 are 1, 0.48, 0.86, 0.93, 0.95, 1.13, 1.69, 2.17, compared to 1, 0.54, 0.93, 0.98, 0.99, 1.14, 1.25, 1.26 in Table 3 (a. To explain this phenomenon, the plot (not included here of the function m(x = sin(2x + 2 exp( 16x 2 exhibits large curvature and sharp changes on [ 0.5, 0.5], and thus large estimation bias could easily offset the asymptotic efficiency improvements, especially for a moderate sample size. To appreciate this, use the same X and ε in (34, and consider model Model II : Y = 1.8X + 0.5ε. (35 Then the bias term m (xµ K h 2 /2 = 0 vanishes and the variance plays a dominating role. For all four estimation methods, we use the same bandwidth: the plug-in bandwidth selector for local linear regression. We summarize the relative efficiencies in Table 3 (b. The overall pattern of the empirical relative efficiencies is consistent with that of the theoretical ones in (30, and the proposed WQAN significantly outperforms other methods for almost all distributions considered. Also, using more quantiles (k = 29 significantly 24

25 improves the performance of WQAN for truncated N(0,1 on [ 1, 1], truncated Cauchy on [ 1, 1] and uniform on [ 0.5, 0.5], which is not shared by the CQR method. In summary, for most non-normal distributions considered, the proposed method can have substantial efficiency improvements over other methods, and the empirical performance is consistent with our asymptotic theory. Insert Table 3 about here. 8 Proofs The following Lemma 1 is needed to prove Theorem 1. Lemma 1 Let h( be any differentiable function on interval [a, b]. Then we have [ b ] 2 b h(xdx (b a h 2 (xdx (b b a3 [h (x] 2 dx. (36 2 a Proof. It is easy to verify the following identity [ b 2 b V := h(xdx] (b a h 2 (xdx = 1 2 a a a b b a a a [h(x h(y] 2 dxdy. (37 For all x, y [a, b], we have h(x h(y = x y h (udu b a h (u du, uniformly. Thus, by the Cauchy-Schwartz inequality, [ b 2 b max h(x h(y 2 h (u du] (b a [h (u] 2 du. x,y [a,b] Applying the above inequality to (37, we have V completing the proof. (b a2 2 a max h(x h(y 2 x,y [a,b] (b a3 2 a b a [h (u] 2 du, Proof of Theorem 1. Recall Γ = [min(τ i, τ j τ i τ j ] 1 i,j k, τ i = i/(k + 1. We can verify Γ 1 = (k + 1., (

26 with 2(k + 1 on the diagonal, (k + 1 on the super-/sub-diagonals, and 0 elsewhere. Define the k k diagonal matrix } P = diag {q 1,..., q k with q i = f(q(τ i. (39 We have H = P 1 ΓP 1 and H 1 = P Γ 1 P. Using (38, we can obtain [ Λ k = e T k H 1 e k = e T k P Γ 1 P e k = (k + 1 q1 2 + qk 2 + (q i q i 1 ]. 2 (40 Recall g(τ = f(q(τ in Assumption 2. Define u = Q(τ. Then τ/ u = F (u/ u = f(u, and by the chain rule, g (τ = ( g/ u( u/ τ = f (u/f(u. Thus, 1 [ f [g (τ] 2 (u ] 2f(udu dτ = = I(f. f(u Applying the chain rule again, 0 g (τ = {g (τ} u u τ = {f (u/f(u} u R i=2 u τ = f (uf(u f (u 2. (41 f 3 (u Let = 1/(k + 1. In (40, by Assumption 2, (k + 1(q1 2 + qk 2 0. Define W k = (k + 1 (q i q i 1 2 i=2 1 [g (t] 2 dt. To prove Λ k I(f, it suffices to show W k 0. Since τ 1 =, τ k = 1, and τ i τ i 1 =, we can rewrite W k as { [ τi ] 2 W k = (k + 1 g (tdt τ i 1 i=2 Applying Lemma 1 and Assumption 2, we get (τ i τ i 1 i=2 τi τ i 1 [g (t] 2 dt }. W k 2 2 i=2 τi [g (t] 2 dt = 2 τ i g (t 2 dt 0, completing the proof. Proof of Proposition 1. First, we prove that (3 implies Assumption 2. Let ϵ > 0 be any given number. By the condition lim τ 0 [ g (τ + g (1 τ ]τ 3/2 = 0, there exists 26

E cient Regressions via Optimally Combining Quantile Information

E cient Regressions via Optimally Combining Quantile Information Zhibiao Zhao Penn State University Zhijie Xiao Boston College Abstract We develop a generally applicable framework for constructing e cient