The Pennsylvania State University The Graduate School Eberly College of Science A NON-ITERATIVE METHOD FOR FITTING THE SINGLE

Size: px

Start display at page:

Download "The Pennsylvania State University The Graduate School Eberly College of Science A NON-ITERATIVE METHOD FOR FITTING THE SINGLE"

Raymond Barrett
5 years ago
Views:

1 The Pennsylvania State University The Graduate School Eberly College of Science A NON-ITERATIVE METHOD FOR FITTING THE SINGLE INDEX QUANTILE REGRESSION MODEL WITH UNCENSORED AND CENSORED DATA A Dissertation in Statistics by Eliana Christou 2016 Eliana Christou Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy May 2016

2 The dissertation of Eliana Christou was reviewed and approved by the following: Michael G. Akritas Professor of Statistics Dissertation Advisor, Chair of Committee Bing Li Professor of Statistics Zhibiao Zhao Associate Professor of Statistics Spiro E. Stefanou Professor Emeritus of Agricultural Economics, The Pennsylvania State University Professor and Chair of Food and Resource Economics, University of Florida Aleksandra B. Slavković Associate Head for Graduate Studies, Professor of Statistics Signatures are on file in the Graduate School. ii

3 Abstract Quantile regression QR) is becoming increasingly popular due to its relevance in many scientific investigations. Linear and nonlinear QR models have been studied extensively, while recent research focuses on the single index quantile regression SIQR) model. Compared to the single index mean regression SIMR) problem, the fitting and the asymptotic theory of the SIQR model are more complicated due to the lack of closed form expressions for estimators of conditional quantiles. Consequently, existing methods are necessarily iterative. We propose a non-iterative estimation algorithm, and derive the asymptotic distribution of the proposed estimator under heteroscedasticity. For identifiability, we use a parametrization that sets the first coefficient to 1 instead of the typical condition which restricts the norm of the parametric component. This distinction is more than simply cosmetic as it affects, in a critical way, the correspondence between the estimator derived and the asymptotic theory. The ubiquity of high dimensional data has led to a number of variable selection methods for linear/nonlinear QR models and, recently, for the SIQR model. We propose a new algorithm for simultaneous variable selection and parameter estimation applicable also for heteroscedastic data. The proposed algorithm, which is non-iterative, consists of two steps. Step 1 performs an initial variable selection method. Step 2 uses the results of Step 1 to obtain better estimation of the conditional quantiles and, using them, to perform simultaneous variable selection and estimation of the parametric component of the SIQR model. It is shown that the initial variable selection method of Step 1 consistently estimates the relevant variables, and that the estimated parametric component derived in Step 2 satisfies the oracle property. Furthermore, QR is particularly relevant for the analysis of censored survival data as an alternative to proportional hazards and the accelerated failure time models. Such data occur frequently in biostatistics, environmental sciences, social sciences and econometrics. There is a large body of work for linear/nonlinear QR models for censored data, but it is only recently that the SIQR model has received iii

4 some attention. However, the only existing method for fitting the SIQR model uses an iterative algorithm and no asymptotic theory for the resulting estimator of the Euclidean parameter is given. We propose a new non-iterative estimation algorithm, and derive the asymptotic distribution of the proposed estimator under heteroscedasticity. iv

5 Table of Contents List of Figures List of Tables List of Symbols Acknowledgments viii ix xii xiii Chapter 1 Introduction to Quantile Regression Linear Quantile Regression Nonparametric Quantile Regression Semiparametric Quantile Regression Outline of Thesis Chapter 2 Single Index Quantile Regression for Heteroscedastic Data Introduction The Proposed Estimator Main Results Numerical Studies Computational Remarks Simulation Results Boston Housing Data Conclusions Chapter 3 Variable Selection in Heteroscedastic Single Index Quantile Regression Introduction v

6 3.2 The Proposed Estimator Main Results Numerical Studies Computational Remarks Simulation Results Boston Housing Data An application to Genomic Data Conclusions Chapter 4 Single Index Quantile Regression for Censored Data Introduction The Proposed Estimator Main Results Numerical Studies Computational Remarks Simulation Results A Real Example Conclusions Appendix Assumptions and Proofs of Main Results 63 A.1 Assumptions A.2 Some General Lemmas A.3 Proofs for Chapter A.3.1 Some Lemmas A.3.2 Proof of Proposition A.3.3 Proof of Proposition A.3.4 Proof of Theorem A.4 Proofs for Chapter A.4.1 Some Lemmas A.4.2 Proof of Theorem A.4.3 Proof of Proposition A.4.4 Proof of Proposition A.4.5 Proof of Theorem A.5 Proofs for Chapter A.5.1 Some Lemmas A.5.2 Proof of Proposition A.5.3 Proof of Proposition A.5.4 Proof of Theorem vi

7 Bibliography 109 vii

8 List of Figures 1.1 Quantile Regression ρ check function Boxplot of estimated parametric component for Model 2.18) for the three estimators; the true β is Estimated SIQR for Boston housing data for Model 2.22). The dots are the observations and the curve is the estimated quantile function Estimated SIQR for Boston housing data for Model 2.23). The dots are the observations and the curve is the estimated quantile function Estimated SIQR for Boston housing data for Model 3.11). The dots are the observations and the curve is the conditional quantile function viii

9 List of Tables 2.1 Mean values and standard errors in parenthesis), R β) and average LL R τ Q ), defined in 2.19), for Model 2.18) τ, β1 2.2 Mean values and standard errors in parenthesis), R β) and average LL R τ Q ), defined in 2.19), for Model 2.20). Also, 95% coverage τ, β1 probability for NWQR, WYY-2 and the second estimated coefficient of Wu et al. 2010), denoted by WYY % coverage probability for NWQR, WYY-2, and the second estimated coefficient of Wu et al. 2010), denoted by WYY, for Model 2.21) Proposed parametric vector estimates and standard errors in parenthesis) for Boston housing data for Model 2.22) with five different quantile levels Proposed parametric vector estimates and standard errors in parenthesis) for Boston housing data for Model 2.23) with five different quantile levels LL 2.6 Mean check based absolute residuals, R τ Q ), defined in 2.19), τ, β1 for Models 2.22) and 2.23); WYY denotes the method proposed by Wu et al. 2010) and NWQR denotes the proposed methodology Mean values and standard deviations in parenthesis) for the size and the number of correct and incorrect zeros of the estimated parametric component β SCAD. Also, mean values for AR β SCAD ) and R τ Q LL τ, βscad 1 ) for Model 3.7) Mean values and standard deviations in parenthesis) for the MSE β 1X) for the SCAD-NWQR, LASSO-AY, and ALASSO-AY estimated parametric components for Model 3.7) ix

10 3.3 Mean values and standard deviations in parenthesis) for the size and the number of correct and incorrect zeros of the estimated parametric component β SCAD. Also, mean values for AR β SCAD ) and R τ Q LL τ, βscad 1 ) for Model 3.8) Mean values and standard deviations in parenthesis) for the MSE β 1X) for the SCAD-NWQR, LASSO-AY, and ALASSO-AY estimated parametric components for Model 3.8) Mean values and standard deviations in parenthesis) for the size and the number of correct and incorrect zeros of the estimated parametric component β SCAD. Also, mean values for AR β SCAD ) and R τ Q LL τ, βscad 1 ) for Model 3.9) Mean values and standard deviations in parenthesis) for the MSE β 1X) for the SCAD-SIR, LASSO-AY, and ALASSO-AY estimated parametric components for Model 3.9) Mean values and standard deviations in parenthesis) for the size and the number of correct and incorrect zeros of the estimated parametric component β SCAD for Model 3.10) Mean values and standard deviations in parenthesis) for the MSE β 1X), average AR β) and average R τ Q LL τ, β1 ) for NWQR and SCAD-NWQR estimated parametric components for Model 3.10) Proposed parametric vector estimates for Boston housing data for Model 3.11) with five different quantile levels Number of zero coefficients for SCAD-NWQR, LASSO-AY, and ALASSO-AY for Boston housing data for Model 3.11) with five different quantile levels LL 3.11 Mean check based absolute residuals, R τ Q ), defined in 2.19), for τ, β1 SCAD-NWQR, LASSO-AY, and ALASSO-AY for Boston housing data for Model 3.11) with five different quantile levels Average estimated coefficients for States 1, 3 and 6. These states that make up the cold or warm bulk of the genome; they have low or intermediate divergence rates, and are located on the autosomes away from the telomeres Average estimated coefficients for States 2, 4 and 5. State 2 has very low divergence rates and is located exclusively on chromosome X. States 4 and 5 have very high divergence rates, and the former is located in the telomeric regions of autosomes the latter is interspersed throughout the autosomes) x

11 LL 4.1 Mean values and standard errors in parenthesis), average R τ Q ), τ, β1 defined in 2.19), and coverage probabilities for the CD-NWQR for Model 4.10) Trimmed mean values and standard errors in parenthesis), and average R τ Q LL τ, β1 ), defined in 2.19), for the CD-BGK for Model 4.10) Mean and trimmed mean values along with their corresponding standard errors in parenthesis) for the estimated parametric component, as well as average R τ Q LL τ, β1 ), defined in 2.19), for the CD-NWQR and CD-BGK estimators for Model 4.11) Parametric vector estimates and standard errors in parenthesis) for Model 4.12) for five different quantile levels LL 4.5 Mean check absolute residuals, R τ Q ), defined in 2.19), for the τ, β1 CD-NWQR and CD-BGK estimators for Model 4.12) xi

12 List of Symbols τ The greek letter, p. 1, denotes the quantile of interest, where 0 < τ < 1. ρ τ β 1 ɛ The greek letter, p. 1, denotes the check function, defined for 0 < τ < 1, as ρ τ u) = τ Iu < 0))u. The greek letter, p. 6, denotes a d-dimensional vector of unknown parameters, where β 1 = 1, β ) and β R d 1. The greek letter, p. 8, denotes the error term. β The greek letter, p. 9, denotes an estimator of β. xii

13 Acknowledgments I would like to express my special appreciation and thanks to my advisor Professor Michael G. Akritas, you have been a tremendous mentor for me. I would like to thank you for the useful comments, remarks and engagement through the learning process of this thesis, as well as, your continuous encouragement and support throughout these years. I would also like to thank the Associate Head for Graduate Studies Professor Aleksandra Slavković for supporting me throughout all my academic years and helping me whenever I had the need. I would like to thank all the faculty in the department for encouraging all the students and especially Professor Naomi Altman and Professor David Hunter for been next to me throughout all my steps. I would also like to thank an undergraduate professor of mine, Dr. Tasos Christofides, for encouraging me to continue my studies at the graduate level and providing me guidance throughout all my decisions. A special thanks to my family. Words cannot express how grateful I am to my mother and my father for all of the sacrifices that you ve made on my behalf. I have worked hard to complete my graduate studies, but I wouldn t have been able to achieve my goals if it wasn t your support, love, and encouragement. Your prayer for me was what sustained me thus far. Thank you from the bottom of my heart. Many thanks to my brother, my sister, and my grandparents for being always there for me. I lost my grandmother two months before my graduation, but I own her a great thanks for all the times she was giving me her support through her calls. Thank you grandma! I would also like to thank a special friend of mine, Sotiria Marathovounioti who was always my support and Vasiliki Vasileiou who supported me in writing, and pushed me to strive towards my goal. xiii

14 Dedication To my mother, Andri Christou, my father, Constantinos Christou, and my grandmother, Avgi Stefanou. Thank you for everything! xiv

15 Chapter 1 Introduction to Quantile Regression Ordinary least squares regression plays a prominent role in a wide variety of fields and is a very popular method for modeling the relationship between a d- dimensional vector of covariates X and the conditional mean of the response variable Y given X = x. However, mean regression, linear or not, provides only a single summary measure of the conditional distribution of the response given the covariates. Moreover, the sensitivity of the least squares estimator to even modest amount of outliers, makes it a very poor estimator in many non-gaussian and especially long-tailed distributions. Quantile regression QR), which was first introduced by Koenker and Bassett 1978), is a method for completing the regression picture by focusing on specific conditional quantiles of the distribution. QR models the relationship between a d- dimensional vector of covariates X and the τth, for 0 < τ < 1, conditional quantile of the response variable Y given X = x. When the error term is heteroscedastic, a direct approach for estimating conditional quantiles has a number of advantages. There are a lot of real data sets that motivate the use of QR, especially cases were extremes are important. For example, QR can be used in environmental studies, where the upper quantiles of pollution levels are critical from a public health perspective. Koenker and Bassett 1978) introduced the loss function ρ τ ), also known as check function, defined for 0 < τ < 1, as ρ τ u) = τ Iu < 0))u, which simply gives different weights to positive and negative values; see Figure 1.1. It can be easily shown that minimizing the function Eρ τ X q)) with respect to q, gives the 1

16 τ th quantile of the random variable X. This idea motivated Koenker and Bassett 1978) to introduce linear QR. Figure 1.1. Quantile Regression ρ check function 1.1 Linear Quantile Regression Let Q τ Y x) Q τ Y X = x) = inf{y : P Y y X = x) τ} 1.1) denote the τ-th conditional quantile of Y given X = x and consider the linear QR model Q τ Y x) = β x, 1.2) 2

17 where β is a d-dimensional vector of unknown parameters. Koenker and Bassett 1978) used the representation Q τ Y x) = arg min q E ρ τ Y q) X = x), 1.3) where ρ τ ) is the check function, to define the estimator β as β = arg min b R d ρ τ Y i b X i ). 1.4) Thus, β x gives the estimator of the τ-th conditional quantile under the linear QR model. Observe that for τ = 1/2, the objective function is the L 1 norm, that is arg min b R d 1 2 Y i b X i, which gives the estimated conditional median. It turns out that QR inherits the well known robustness properties of the median regression; see Pollard 1991). Koenker and Bassett 1978) studied the asymptotic statistical behavior of the estimated conditional regression quantiles, while Koenker 1994) studied confidence intervals for the regression quantiles, based on the asymptotic theory. He suggested three different construction methods for the confidence interval: the sparsity estimation which involves direct estimation of the sparsity function, the inversion of rank tests which computes confidence intervals by inverting the rank score test, and the resampling method. Later, Koenker and Hallock 2001) presented some practical implementations for linear QR. Specifically, they considered two data sets, the quantile engel curves and the QR of infant birthweights, and they demonstrated that there were characteristics that were not captured by the least squares estimators. 1.2 Nonparametric Quantile Regression Because the linearity assumption of model 1.2) is quite strict, several authors considered the completely flexible nonparametric estimation of the conditional 3

18 quantiles. For the model Q τ Y x) = hx), where h : R d R is a fully nonparametric function, Truong 1989) showed that, under conditions, local median estimators achieve the global optimal rates of Stone 1982) with respect to L m norms, 0 < m. Chaudhuri 1991) constructed local polynomial estimators for conditional quantile functions and their derivatives, and also showed that they achieve the optimal nonparametric rates of convergence of Stone 1982) under mild conditions. A local Bahadur type representation was also established by Chaudhuri 1991) for the uniform kernel function, and this result was later extended to general kernels by Hong 2003). Fan et al. 1994) considered a general convex loss function, that includes the mean, median, quantiles, and other robust functionals, and constructed local linear estimators. See also Yu and Jones 1998) who proposed inverting a local linear conditional distribution estimator. Takeuchi et al. 2006) presented a nonparametric version of a quantile estimator, which can be obtained by solving a simple quadratic programming problem and provide uniform convergence results. Kong et al. 2010) extended Chaudhuri s 1991) and Hong s 2003) pointwise Bahadur representation results by deriving a strong uniform with respect to x) Bahadur representation also for dependent observations. Guerre and Sabbah 2012) investigated the bias and the weak Bahadur representation of a local polynomial estimator of the conditional quantile function and its derivatives uniformly with respect to the quantile level, the covariates and the smoothing parameter. Also, they showed that the local polynomial quantile estimator achieves the global optimal rates of Stone 1982) for the L m and uniform norms. 1.3 Semiparametric Quantile Regression The rate of convergence of completely nonparametric estimators of conditional quantiles, however, decreases with increasing dimensionality of the covariate vector. This motivated the study of a number of semiparametric models, and of variable selection methods, for QR. Koenker 2011) considered the additive model for QR which includes both parametric and nonparametric components. Lin et al. 2013) 4

19 considered variable selection for nonparametric QR via smoothing spline ANOVA SS-ANOVA). See Su and Zhang 2012) for a literature review. The single index quantile regression SIQR) model has received particular attention and this will be the subject undertaken this thesis; see Chapter 2 for the definition of a SIQR model. 1.4 Outline of Thesis The main complication faced by methods that are based on minimization of a semiparametric objective function, lies in the lack of a closed form expression for the conditional quantile; see Chapter 2, Sections 2.1 and 2.2 for details. Thus, the proposed methods are necessarily iterative; this is further clarified in the next chapter. Here we propose a new check-function based objective function, which can be minimized non-iteratively for parameter estimation. In Chapter 2 we present the proposed algorithm for estimating the parametric component of a SIQR model for heteroscedastic data. In Chapter 3 we extend the non-iterative algorithm for simultaneous variable selection and parameter estimation, and in Chapter 4 we present the SIQR model for censored data. 5

20 Chapter 2 Single Index Quantile Regression for Heteroscedastic Data 2.1 Introduction The single index quantile regression SIQR) model specifies that Q τ Y x) = Q τ,β1 Y β 1x), 2.1) where β 1 is a d-dimensional vector of unknown parameters, Q τ Y x) is the τth conditional quantile of the response Y given X = x defined in 1.1), and, for any d-dimensional vector b 1, Q τ,b1 Y b 1x) = inf{y : P Y y b 1X = b 1x) τ}. 2.2) A SIQR model is very useful since it maintains some nonparametric flexibility, while at the same time, it reduces the dimensionality. For identifiability one imposes certain conditions on β 1, the most common of which is to assume that β 1 = 1, with its first coordinate positive. In this work, we propose the parametrization which assumes that β 1 = 1, β ), β R d 1 ; 2.3) this parametrization is also used in the R package np for the single index mean regression SIMR) model. This distinction is more than simply cosmetic as it affects, 6

21 in a critical way, the correspondence between the estimator derived and the asymptotic theory. The advantages of the proposed parametrization are demonstrated in the simulations; see Section Existing literature considers the SIQR model under homoscedasticity Wu et al. 2010), restricted heteroscedasticity Chaudhuri et al. 1997) and general heteroscedasticity Kong and Xia 2012). Chaudhuri et al. 1997) considered the average derivative quantile regression estimator which, under a SIQR model where the variance function depends only on β 1x, estimates the direction of β 1. Wu et al. 2010) and Kong and Xia 2012) estimate β 1 by minimizing an objective function. Compared with SIMR, see Li and Racine 2007, Chapter 8), the main complication faced by this approach lies in the lack of a closed form expression for the estimator of conditional quantiles. Thus, the proposed methods are necessarily iterative. Wu et al. 2010) proposed an algorithm which, starting from an initial value b 0 1 for the parametric component, iteratively estimates the nonparametric component and its derivative using local linear QR, and the parametric component using essentially) linear QR. Kong and Xia 2012) criticized the convergence properties of the algorithm in Wu et al. 2010) and proposed an improved iterative algorithm by introducing a penalty term that assures its almost sure convergence. The outliers in the boxplot of the Wu et al. 2010) estimator shown in Section are probably a consequence of the iteration issues of their algorithm.) In addition, Kong and Xia 2012) allowed general heteroscedasticity, but the covariance function of the limiting normal distribution they obtained depends on the true value of the parametric component in an explicit manner. In this work we propose a non-iterative method, based on minimization of a check-function based objective function, for estimating the parametric component of the SIQR model. The proposed estimator is shown to have an asymptotically normal distribution, with a simple expression for the covariance matrix, under general heteroscedasticity. In Section 2.2 we present the proposed estimator, while in Section 2.3 we present the main results, that include the n-consistency and the asymptotic normality of the estimated parametric component. In Section 2.4 we present results from several simulation examples and a real data application on the Boston housing data. Some concluding remarks are given in Section

22 2.2 The Proposed Estimator Let {Y i, X i } n be independent and identically distributed iid) observations that satisfy Y i = Q τ,β1 Y β 1X i ) + ɛ i, 2.4) where Q τ,β1 Y β 1X i ) is defined in 2.2), and the error term ɛ i satisfies Q τ ɛ i X i ) = 0. The quantities β 1 and ɛ i are specific to the τ-th quantile, but we omit the subscript τ for notational convenience. Note that 2.4) is an equivalent way of specifying the SIQR model 2.1). Relation 1.3) implies that the true parametric vector β recall the parametrization used in 2.3)) satisfies β = arg min b E ρ τ Y Q τ,b1 Y b 1X)) ), 2.5) where, in a notation that will be used throughout this thesis, b 1 = 1, b ), b R d 1. The sample level version of 2.5) consists of minimizing ρ τ Y i Q τ,b1 Y b 1X i )). 2.6) As in the SIMR problem cf. Ichimura 1993, Newey and Stoker 1993), the unknown Q τ,b1 Y b 1X i ) must be replaced with an estimator. Unlike the SIMR problem, however, there is no closed form expression for the estimator of Q τ,b1 Y b 1X i ), and this has led to iterative algorithms for estimating β; see the literature review in Section 2.1. To overcome this difficulty, we define, for any given b R d 1, the function gt b) : R R as gt b) = E Q τ Y X) b 1X = t ), where b 1 = 1, b ). Noting that, under the SIQR model 2.1), Q τ Y X) = Q τ,β1 Y β 1X) = gβ 1X β), it follows that β also satisfies β = arg min b E ρ τ Y gb 1X b)) ). 2.7) 8

23 The sample level version of 2.7) consists of minimizing S n τ, b) = ρ τ Y i gb 1X i b)). 2.8) Again, g b) is unknown but it can be estimated, in a non-iterative fashion, by first obtaining estimators Q τ Y X i ), for i = 1,..., n, and forming the Nadaraya- Watson-type estimator ĝ NW t b) = Q τ Y X i )K t b 1 X ) i h nk=1 K t b 1 X k h ), 2.9) where K ) is a univariate kernel function and h is a bandwidth. The different methods for constructing nonparametric estimators Q τ Y X i ) are summarized in Racine and Li 2014), who also introduced a new direct method. In this thesis, we will use the local polynomial conditional quantile estimator, which was studied in Guerre and Sabbah 2012). Specifically, let k denote the order of the local polynomial estimator and define, for v = v 1,..., v d ) where v 1,..., v d integer numbers, v = v v d. Then, for α = α 0, α 1) R P, where P the number of v s with v k, a multivariate kernel function K x) = K x 1,..., x d ), and a univariate bandwidth h, let L n α 0, α 1 ); τ, x) = 1 nh ) d where Uz) = z v /v!, v k), for z v = z v 1 1 z v d d Q τ Y x) as α 0 τ; x), where α 0 τ; x) is defined through ) ρ τ Y i UX i x) α)k Xi x, 2.10) h and v! = d v i!. Define α 0 τ; x), α 1 τ; x)) = arg min α 0,α 1 ) L nα 0, α 1 ); τ, x). 2.11) Remark For high dimensional data it is possible to improve the estimation of conditional quantiles by employing variable selection methods; see Chapter 3 for details. Thus, the proposed estimator is obtained by β = arg min Ŝ n τ, b), 2.12) b Θ 9

24 where Θ R d 1 is a compact set assumed to contain the true value of β, and Ŝ n τ, b) = ρ τ Yi ĝ NW b 1X i b) ). 2.13) For technical reasons that have to do with the uniform convergence of the Nadaraya- Watson estimator, a trimming function is usually introduced in the objective function 2.13). To avoid complicating the notation, we will assume that the support X 0 of X is compact and the density f b of b 1X stays bounded away from zero on T b = {t : t = b 1x, x X 0 }, uniformly in b Θ. 2.3 Main Results The first two results, which have to do with the uniform, in both t and b, consistency of ĝ NW t b), and the n-consistency of β, are needed for the proof of Theorem PROPOSITION Let ĝ NW t b) be as defined in 2.9). Assume that for some r > 2, E Q τ Y X) r < and sup t Tb E Q τ Y X) r b 1X = t)f b t) < holds for all b Θ, where T b = {t : t = b 1x, x X 0 }, X 0 is the compact support of X and f b is the density of b 1X. Moreover, assume that Q τ Y x) is in H s X 0 ) for some s with [s] k, where H s X 0 ) is defined in Appendix A.1 and k is the order of the local polynomial conditional quantile estimators Q τ Y X i ) used in 2.9)). Under Assumptions GS1-GS3 and Assumptions A1-A5 given in Appendix A.1, we have sup b Θ,t T b ĝ NW t b) gt b) = Op a n + a n + h 2), where a n = log n/n) s/2s+d) and a n = log n/nh)) 1/2. The proof of Proposition is given in Appendix A.3.2. PROPOSITION Let β be as defined in 2.12). Then, under the assumptions of Proposition 2.3.1, Assumptions A6 and A7 given in Appendix A.1, and the condition nh 4 = o1), where h is the bandwidth in 2.9), β is n-consistent estimator of β. 10

25 The proof of Proposition is given in Appendix A.3.3. THEOREM Let β be as defined in 2.12). Then, under the assumptions of Proposition 2.3.2, n β β) = V 1 W n + o p 1), where V=E g β 1X β)) 2 X 1 EX 1 β 1X))X 1 EX 1 β 1X)) f ɛ X 0 X) ) 2.14) for g t b) = / t)gt b), X 1 the d 1)-dimensional vector consisting of coordinates 2,..., d of X, and f ɛ X x) the conditional probability density function of ɛ given X = x, and for Y i W n = n 1/2 n = Y i ĝ NW β 1X i β). Furthermore, ρ τy i )g β 1X i β)x i, 1 EX 1 β 1X)), 2.15) n β β) d N 0, τ1 τ)v 1 ΣV 1), where Σ = E g β 1X β)) 2 X 1 EX 1 β 1X))X 1 EX 1 β 1X)) ). 2.16) The proof of Theorem is given in Appendix A.3.4. Next, let β 1 = 1, β ) and define âτ; x), bτ; x)) = arg min a,b) ρ τ Y i a b β 1X i x)) K β 1X i x) h where K ) is a univariate kernel function and h is a bandwidth, and define, Q LL τ Y x) = Q LL τ, β1 Y β 1x) = âτ; x), 2.17) as the quantile estimator based on the assumption of the SIQR model 2.1). 11

26 LL COROLLARY Let Q τ Y x) be as defined in 2.17), where K ) is a symmetric, second order and density kernel with a compact support and a bounded first derivative that satisfies t j K 2 t)dt < for j = 0, 1, 2. Then, under the assumptions of Proposition and Assumption A8 given in Appendix A.1, n h Q LL τ Y x) Q τ Y x) h ) 2 Iβ d 1x) N0, ω 2 β 1x)), where Iβ 1x) = 1/2)g β 1x β) t 2 Kt)dt and ω 2 β 1x) = τ1 τ) K 2 t)dt f β β 1x)f ɛ β 0 β 1x)) 2, for f ɛ β t) the conditional density function of ɛ given β 1X takes the value t, h 0 and n h as n. The corollary follows from the fact that the proposed estimator β is n-consistent and the proof of Theorem 2 of Wu et al. 2010). Remark It can be shown that, when using the parametrization adopted here see 2.3)), the asymptotic normality result for the parametric vector of Wu et al. 2010) is a special case pertaining to the homoscedastic case) of Theorem Oberhofer and Haupt 2015), considered nonlinear QR thus known link function), in a fixed design with heteroscedastic errors which are allowed to be weakly dependent. Taking these differences into consideration, the form of the limiting covariance matrix they obtained is also related to that of Theorem Finally, Kong and Xia 2012), who considered the heteroscedastic case and use a penalty term to ensure the convergence of their iterative algorithm, obtain a covariance matrix that depends on the true value of the parametric component in an explicit manner and whose form is not directly comparable to that of the previous literature or ours. 12

27 2.4 Numerical Studies This section contains simulation results and the analysis of a real data set contrasting the proposed estimator with that of Wu et al. 2010) Computational Remarks For the computation of the proposed estimator, the conditional quantile estimator Q τ Y x) used in 2.9)) is a multivariate local linear estimator, computed using an extension of the code for the function lprq in the R package quantreg which applies only to univariate covariates). The bandwidth h, used in 2.10), is selected to be the rule-of-thumb bandwidth for the local linear conditional quantile estimator derived in Yu and Jones 1998). The bandwidth h used in 2.9) is selected using the optimal rate of Cn 1/5, where C is chosen to be the standard deviation of b 0 1 x for b 0 1 an initial value. The Gaussian kernel was used for estimation of the conditional quantiles Q τ Y X i ) and the Nadaraya-Watson-type estimator ĝ NW t b). The function nlrq of the same R package was used for minimizing the objective function 2.13). For the computation of the covariance matrix derived in Theorem 2.3.3, note that the expressions for V and Σ in 2.14) and 2.16), respectively, involve the quantity g β 1X β)x 1 EX 1 β 1X)) and f ɛ X 0 X). The estimator used for the first quantity is based on the observation that, under the single index SI) model, b gb 1X b) = g β 1X β)x 1 EX 1 β 1X)), b=β and, therefore, it can be estimated as b ĝ NW b 1X b) b= β. Finally, the estimation of f ɛ X 0 X) uses Gaussian kernel and bandwidth chosen according to the R function bw.nrd0 in the stats package. The resulting code is available from the first author. For the computation of the Wu et al. 2010) estimator we used the code provided by these authors. Because the two estimators being compared are derived using different parametrizations for ensuring identifiability, e.g., 2.3) versus the constraint to have a norm of one, for the sake of comparison we found necessary to introduce two modifications of the Wu et al. 2010) estimator. For the first modification, we divide their esti- 13

28 mator by its first component, and for the second modification we use our proposed parametrization in conjunction with the iterative algorithm of Wu et al. 2010) for estimating the remaining d 1 coefficients. In what follows, NWQR denotes the proposed estimator, and WYY, WYY-1 and WYY-2 denote, respectively, the estimator in Wu et al. 2010) and its first and second modification. All simulation results in this chapter use a sample size of n = 400 and are based on N = 100 iterations Simulation Results Example 1 Asymmetric Homoscedastic Errors): Here the data are generated according to the model Y = 5 cosβ 1X) + exp β 1X) 2 ) + ɛ, 2.18) where X = X 1, X 2 ), X i U0, 1) are iid, β 1 = 1, 2), the residual ɛ follows an exponential distribution with mean 2, and X i s and ɛ are mutually independent. In this example, we fit the single index median regression model using the proposed method and that of Wu et al. 2010). The boxplots presented in Figure 2.1 shows 100 coefficient estimates of the parametric component β whose true value is 2), using the proposed methodology and the two modifications of the Wu et al. 2010) estimator. Observe that the boxplot for NWQR is more closely concentrated around the true value of 2 than the other two boxplots, while the boxplot for WYY-1 displays the widest variability around the true value. The observed outliers in the boxplot for the WYY-2 estimator which uses the proposed parametrization) is probably a consequence of the iterative algorithm which stops after a maximum number of iterations. For further comparison, Table 2.1 reports the observed, over the N = 100 simulation runs, mean values and standard deviations for the three estimators of the parametric component β, as well as the mean squared error, R β), and the LL average mean check based absolute residuals, R τ Q ), defined as τ, β1 R τ Q LL τ, β1 ) = 1 n ρ τ Y i Q LL τ, β1 Y β 1X i )), 2.19) 14

29 Figure 2.1. Boxplot of estimated parametric component for Model 2.18) for the three estimators; the true β is NWQR WYY 1 WYY 2 β^ where β LL 1 denotes an estimator of the Euclidean parameter β 1, and Q Y β τ, β1 1X i ) is defined in 2.17). The findings in Table 2.1 can be further confirm the conclusions drawn from the boxplot in Figure 2.1. We observe that NWQR gives the smallest bias, the smallest value of R β), LL and the smallest average R τ Q ), followed by τ, β1 WYY-2, which uses the proposed constraint, and then WYY-1. Finally, we compare the coverage of the asymptotic 95% confidence intervals 15

30 Table 2.1. Mean values and standard errors in parenthesis), R β) and average R τ Q LL ), τ, β1 defined in 2.19), for Model 2.18). β R β) LL R τ Q ) τ, β1 NWQR ) WYY ) WYY ) based on NWQR, WYY-2, and the second estimated coefficient of Wu et al. 2010) which estimates 2/ 5). The WYY-1 estimator was not considered due to the additional complication presented by the ratio. The observed coverage probabilities are 0.96, 0.91 and 0.68 for NWQR, WYY-2 and the second estimated coefficient of Wu et al. 2010), respectively. The p-values corresponding to 0.91 and 0.68 are 0.07 and , respectively. The conclusion is that the marginal standard error formula given in Wu et al. 2010) for the second component of β 1 is appropriate for the parametrization used in the present work. Example 2 Symmetric Homoscedastic Errors): Here the data are generated according to the model Y = expβ 1X) + ɛ, 2.20) where X = X 1, X 2 ), X i N0, 1) are iid, β 1 = 1, 2), the residual ɛ follows a standard normal distribution, and X i s and ɛ are mutually independent. In this example, we fit the SIQR model for five different quantile levels, τ = 0.1, 0.25, 0.5, 0.75, 0.9, using the proposed method and that of Wu et al. 2010). The boxplots for Model 2.20) are ommited since they reveal similar conclusions as in Example 1. Table 2.2 presents the observed, over the N = 100 simulation runs, mean values and standard deviations for the three estimators of the parametric component β, as well as the mean squared error, R β), and the average mean check Q LL based absolute residuals, R τ ). NWQR is more closely concentrated around τ, β1 2 and gives the smallest R β) LL and average R τ Q ) values for all quantile levels. τ, β1 Note that for τ = 0.9, the NWQR estimator has a mean value of , while WYY-2 has a mean value of This is due to an outlier presented for NWQR, while it still gives the smallest bias. Also, observe the large values of β and R β) for 16

31 τ = 0.5 for WYY-1. This is due to a very extreme outlier around 140) which can be observed from the boxplot and which is a consequence of the iterative algorithm. Table 2.2 also reports the coverage of the asymptotic 95% confidence intervals based on NWQR, WYY-2, and the second estimated coefficient of Wu et al. 2010) denoted by WYY in the table. Similar conclusions as in Example 1 can be drawn regarding the performance of these confidence intervals with the additional remark that the coverage probability of the WYY-2 intervals deteriorates for the 75th and 90th percentiles. Table 2.2. Mean values and standard errors in parenthesis), R β) and average R τ Q LL ), τ, β1 defined in 2.19), for Model 2.20). Also, 95% coverage probability for NWQR, WYY-2 and the second estimated coefficient of Wu et al. 2010), denoted by WYY. τ NWQR ) ) ) ) ) β WYY ) ) ) ) ) WYY ) ) ) ) ) NWQR R β) WYY WYY NWQR LL R τ Q ) τ, β1 WYY WYY % coverage NWQR prob. WYY WYY Example 3 Asymmetric Heteroscedastic Errors): Here the data are generated according to the model Y = sin2πβ 1X) β 2X) 2 ) ɛ, 2.21) 4 where X = X 1, X 2 ), X i U0, 1) are iid, β 1 = 1, 2), β 2 = 1, 1), and the residual ɛ follows an exponential distribution with mean 1. We fit the SIQR model 17

32 for five different quantile levels, τ = 0.1, 0.25, 0.5, 0.75, 0.9, using the proposed method and that of Wu et al. 2010). Table 2.3 reports the coverage of the asymptotic 95% confidence intervals based on NWQR, WYY-2, and WYY. The observed, over the N = 100 simulation runs, mean values and standard deviations, as well as R β) LL and average R τ Q ) for the three estimators of the parametric τ, β1 component are not reported here since they reveal the same trends as in the previous two examples. From Table 2.3 we observe that the coverage probabilities of the NWQR intervals are close to the true nominal value of 0.95, while those of the WYY-2 estimator tend to be smaller than the nominal value, probably a consequence of the heteroscedasticity. Table % coverage probability for NWQR, WYY-2, and the second estimated coefficient of Wu et al. 2010), denoted by WYY, for Model 2.21). τ NWQR WYY WYY Boston Housing Data For this example we consider an application regarding Boston housing data. The data contains 506 observations on 14 variables, for which the dependent variable of interest is medv, the median value of owner-occupied homes in $1000s, and the other thirteen variables are statistical measurements on the 506 census tracts in suburban Boston from the 1970 census. The data was originally published by Harrison and Rubinfeld 1978). This data set can be found in the MASS library in R. QR is appropriate for this data set because the response variable is the median price of homes and the y-values larger than or equal to $50,000 have been recorded as $50,000. As was noted in Chaudhuri et al. 1997, p. 724), such a truncation in the upper tail of the response makes quantile regression, which is not influenced very much by extreme values of the response, a very appropriate methodology. Due to the collinearity in the data set, Breiman and Friedman 1985) applied their alternating conditional expectation ACE) method for selecting the relevant 18

33 variables and selected the four covariates RM, TAX, PTRATIO, and LSTAT, for which the description is given below. Many regression studies Opsomer and Ruppert 1998; Yu and Lu 2004; Wu et al. 2010) have used this data set and, using a logarithmic transformation on the covariates TAX and LSTAT, found potential relationship between the response medv and these four covariates. Opsomer and Ruppert 1998) considered mean regression and fitted the additive model after removing the observations with outliers on the covariates TAX and LSTAT. Yu and Lu 2004) fitted an additive QR model and Wu et al. 2010) considered the SIQR model. In addition, many studies Chaudhuri et al. 1997; Wu et al. 2010) considered the relationship between medv and the three covariates RM, LSTAT, and DIS, for which the description is given below. Chaudhuri et al. 1997) considered the quantile average derivative estimate qade) regression, while Wu et al. 2010) considered the SIQR. We apply our proposed methodology using the above two sets of predictors. First, consider the four covariates: RM: average number of rooms per house in the area TAX: full-value property tax in dollar) per $10,000 PTRATIO: pupil-teacher ratio by town LSTAT: percentage of the population having lower economic status in the area. Following previous studies, we take logarithmic transformations on TAX and LSTAT, and center the dependent variable. Let X 1, X 2, X 3 and X 4 denote the standardized RM, logtax), PTRATIO and loglstat), respectively, and set X = X 1, X 2, X 3, X 4 ). We consider the SIQR model: Q τ medv X) = gx 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 β), 2.22) for five different quantile levels τ = 0.1, 0.25, 0.5, 0.75, 0.9, which assumes that RM is a significant predictor, an assumption which is confirmed by the previous studies. An analysis assuming different significant predictor gave the same results. Note that the dependence of g and β on τ is not made explicit. 19

34 Table 2.4 gives the estimators and their standard errors, resulting from the proposed methodology. The conclusions derived from this table are: a) PTRATIO seems to have a significant contribution only on the 90th quantile; b) none of TAX, PTRATIO, LSTAT appear to have a significant contribution on the 50th and 75th quantiles; c) TAX is more significant than LSTAT for the 10th and 25th quantiles; d) LSTAT is more significant than TAX for the 90th quantile; e) RM seems to have an opposite effect than that of the other predictors. Conclusions a)-d) differ from those in the aforementioned literature. The difference with the conclusions in Yu and Lu 2004) is largely because they considered only the absolute value of the coefficients instead of the coefficients t-values. Similarly, the conclusions of Wu et al. 2010) regarding the relative significance of the predictors are based on the absolute value of their coefficients, even though they did compute standard errors based on bootstrap instead of their variance formulas). Finally, the results of Opsomer and Ruppert 1998) are not directly comparable to ours because they considered mean regression using an additive model. Table 2.4. Proposed parametric vector estimates and standard errors in parenthesis) for Boston housing data for Model 2.22) with five different quantile levels. τ RM logtax) PTRATIO loglstat) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) In order to compare the performance of the proposed estimator with that of Wu et al. 2010), we consider the mean check based absolute residuals statistic, LL R τ Q ), as defined in 2.19). From the left panel of Table 2.6, we note that τ, β1 both methods perform similarly, with the proposed estimator achieving somewhat smaller values for all quantiles. 20

35 Figure 2.2. Estimated SIQR for Boston housing data for Model 2.22). The dots are the observations and the curve is the estimated quantile function index medv index medv index medv index medv index medv Each panel in Figure 2.2 superimposes the scatterplots of Y i, β 1X i ) and Q LL τ, β1 Y β 1X i ), β 1X i ), i = 1,..., 506. These plots suggest that the estimated conditional quantile function provides a good fit to the data. They also indicate the presence of heteroscedasticity. Next, we consider the three covariates RM, LSTAT and DIS where: RM: average number of rooms per house in the area 21

36 LSTAT: percentage of the population having lower economic status in the area DIS: weighted distance to five Boston employment centers from houses of the area. Let X 1, X 2 and X 3 denote the standardized RM, LSTAT and DIS, respectively, and set X = X 1, X 2, X 3 ). We consider the SIQR model: Q τ medv X) = gx 1 + β 2 X 2 + β 3 X 3 β), 2.23) for five different quantile levels τ = 0.1, 0.25, 0.5, 0.75, 0.9, which assumes that RM is a significant predictor, an assumption which is confirmed by the previous studies. An analysis assuming that LSTAT is the significant predictor gave the same results. Table 2.5 gives the estimators and their standard errors, resulting from the proposed methodology. The conclusions derived from this table are: a) LSTAT is the most important covariate for all quantile levels by comparing the coefficients t-values; b) the effect of LSTAT is essentially stable across different quantiles; c) DIS has a significant contribution only on the 90th quantile. Conclusions a) and b) are in conformity with the ones of Chaudhuri et al. 1997) and Wu et al. 2010), with the difference that the aforementioned authors draw their conclusions by comparing only the absolute values of the normalized coefficients without calculating the coefficients t-values. Our conclusion c), however, is a more definitive statement regarding the relevance of DIS, since none of the above investigators mentioned anything about the significance of DIS. LL The right panel of Table 2.6 gives R τ Q ) for the five different quantile levels, τ, β1 contrasting the proposed estimator with that of Wu et al. 2010). We remark LL that Wu et al. 2010) display the R τ Q ) values resulting from the qade of τ, β1 Chaudhuri et al. 1997). Because these values are considerably higher, they are not displayed in Table 2.6 for either model. For τ = 0.1, 0.25, 0.5, 0.75 the two LL methods give similar R τ Q ) values with NWQR resulting in somewhat larger τ, β1 values for τ = 0.1, 0.25 and somewhat smaller for τ = 0.5, 0.75; for τ = 0.9, however, LL NWQR results in considerably lower R τ Q ) value. In terms of comparing the τ, β1 two models, Model 2.22) seems better for τ = 0.1, 0.25, 0.5 according to both WYY and NWQR, while Model 2.23) seems better for τ = 0.75 according to WYY and for τ = 0.75, 0.9 for NWQR. Lastly, our method suggests that the best fit 22

37 Table 2.5. Proposed parametric vector estimates and standard errors in parenthesis) for Boston housing data for Model 2.23) with five different quantile levels. τ RM LSTAT DIS ) ) ) ) ) ) ) ) ) ) corresponds to the lower quanitle τ = 0.1 for both models, a conclusion which is in conformity with the aforementioned literature. Table 2.6. Mean check based absolute residuals, R τ Q LL ), defined in 2.19), for Models τ, β1 2.22) and 2.23); WYY denotes the method proposed by Wu et al. 2010) and NWQR denotes the proposed methodology. Model 2.22) Model 2.23) τ WYY NWQR WYY NWQR Finally, each panel in Figure 2.3 superimposes the scatterplots of Y i, β 1X i ) LL and Q Y β τ, β1 1X i ), β 1X i ), i = 1,..., 506. Again, these plots suggest that the estimated conditional quantile function provides a good fit to the data and they also indicate the presence of heteroscedasticity. 23

Single Index Quantile Regression for Heteroscedastic Data

Single Index Quantile Regression for Heteroscedastic Data E. Christou M. G. Akritas Department of Statistics The Pennsylvania State University SMAC, November 6, 2015 E. Christou, M. G. Akritas (PSU) SIQR