Practical Bayesian Quantile Regression. Keming Yu University of Plymouth, UK

Practical Bayesian Quantile Regression Keming Yu University of Plymouth, UK (kyu@plymouth.ac.uk) A brief summary of some recent work of us (Keming Yu, Rana Moyeed and Julian Stander).

Summary We develops a Bayesian framework for quantile regression, including Tobit quantile regression. We discuss the likelihood selection, and families of prior distribution on the quantile regression vector that lead to proper posterior distributions with finite moments. We show how the posterior distribution can be sampled and summarized by Markov chain Monte Carlo methods. A method for quantile regression model choice is also developed. In an empirical comparison, our approach out-performed some common classical estimators.

1. Background Linear regression models: Aim: Estimate E(Y X); Method: Least-squares minimization; Suitability: Gaussian errors, symmetric distributions; Weakness: Conditional skew distribution, outliers, tail behavior.

Quantile regression models (Koenker and Bassett 1978): Aim: Estimate the p th quantile of Y given X (0 < p < 1), and explore a complete relationship between Y and X; Specific cases: median regression (p = 0.5); Method: Check function minimization; Check function: ρ p (u) = u(p I(u < 0)); Applications: Reference charts in medicine (Cole and Green, 1992), Survival analysis (Koenker and Geling, 2001), Value at Risk (Bassett and Chen, 2001), Labor economics (Buchinsky, 1995), Flood return period (Yu et al., 2003).

2. Basic setting Consider the following standard linear model y i = µ(x i ) + ɛ i, Typically, µ(x i ) = x i β for a vector of coefficients β. The pth (0 < p < 1) quantile of ɛ i is the value, q p, for which P (ɛ i < q p ) = p. The pth conditional quantile of y i given x i is denoted as q p (y i x i ) = x i β(p). (1)

To inference for parameter β(p) given p and the observations on (X, Y), the posterior distribution of β(p), π(β y) is given by π(β y) L(y β) π(β), (2) where π(β) is the prior distribution of β and L(y β) is the likelihood function.

3. Selecting likelihood function There are two or three different ways to select L(y β) in the set-up above. For example, using a mixture distribution accompanying a Dirichlet process or Pó tree prior to model the error distribution (Walker et al., 1999, Kottas and Gelfand, 2001). But people (Richardson, 1999) commented that, extra parameters under inference, associated computations complicated, difficult in the choice of the partition of the space of the priors. The other example is to use a substitute likelihood (Dunson et al., (2003), Biometrics.) But this is not a proper likelihood.

A simple and natural likelihood is based on the asymmetric Laplace distribution with probability density for 0 < p < 1; f p (u) = p(1 p) exp{ ρ p (u)}, (3) Link: The minimization of the loss function (3) is exactly equivalent to the maximization of a likelihood function below; Likelihood function: L(y β) = p n (1 p) n exp i ρ p (y i x i β) (4) Feature: Except parameter β, no extra parameter under inference. Then we just need to set priors for β.

Once we have a prior for β(p) and data set available, we could make posterior inference for the parameters. How? as there is no conjugate prior, we use the popular MCMC techniques for sampling from the posterior. Why Bayes? Classical methods rely on asymptotics, either large sample is available or using bootstrap. In contrast, MCMC sampling enables us to make exact inference for any sample size without resorting to asymptotic calculations.

For example, The asymptotic covariance matrices of parameter estimators in those classical approaches for these quantile regression models depend on the error densities of ɛ, and are therefore difficult to estimate reliably. In the Bayesian framework, variance estimates, as well as any other posterior summary come out as a by-product of the MCMC sampler, and therefore are trivial to obtain once samples from the posterior distribution are available. Moreover, the Bayesian paradigm also enables us to incorporate prior information in a natural way, whereas the frequentist paradigm does not. Take the uncertainty of parameters into account.

4. Bayesian posterior computation via MCMC An MCMC scheme would constructs a Markov chain with equilibrium distribution the posterior π(β y). After running the Markov chain for a certain burn-in period so that it can reach equilibrium, one obtains samples from π(β y). One popular method for constructing a Markov chain is via the Metropolis-Hastings (MH) algorithm. Where a candidate is generated from an auxiliary distribution and then accepted or rejected with some probability. The candidate generating distribution q(β, β c ) can depend on the current state β c of the Markov chain.

A candidate β is accepted with an certain acceptance probability α(β, β c ) also depending on the current state β c given by: α(β, β c ) = min[ π(β )L(y β )q(β c, β ) π(β c )L(y β c )q(β, β c ), 1]. If, for example, a simple random walk is used to generate β from β c, then the ratio q(βc,β ) q(β,β c ) = 1. In general the steps of the MH algorithm are therefore: Step 0: Start with an arbitrary value β (0) For n from 1 to N Step n: Generate β from q(β, β c ) and u from U(0, 1) If u α(β, β c ) set β (n) = β (acceptance) If u > α(β, β c ) set β (n) = β (n 1) (rejection)

As mentioned above after running this procedure for a certain burn-in period, the samples obtained may be through as coming from the posterior distribution. Hence, we can estimate posterior moments, standard deviations and credible intervals from this posterior sample. We found that convergence was very rapid. Remark: We may use the set-up for Tobit quantile regression: suppose that y and y are random variables connected by the censoring relationship y = max { y 0, y }, where y 0 is a known censoring point. In this case, we have found it simplifies the algorithm to assume zero to be the fixed censoring points by a simple transformation of any non-zero censoring points.

S-PLUS or R- codes to implement the algorithms are available (free).

5. Some theoretic results, including prior selection As people like Richardson (1999) mentioned that popular forms of priors tends to be those which have parameters that can be set straightforwardly and which lead to posterior with a relatively immediate form. Although a standard conjugate prior distribution is not available for the quantile regression formulation, MCMC methods may be used to draw samples from the posterior distributions. This, principal, allows us to use virtually any prior distribution. However, we should select priors that yields proper posteriors. Choose the prior π(β) from a class of known distributions, in order to get proper posteriors.

First, the posterior is proper if and only if 0 < π(β y)dβ <, (5) RT +1 or, equivalently, if and only if, 0 < L(y β) π(β) dβ <. RT +1 Moreover, we require that all posterior moment exist. That is, E[( T j=0 β j r j) y] <, (6) where (r 0,..., r T ) denotes the order of the moments of β = (β 0,..., β T ). We now establish a bound for the integral R T +1 Tj=0 β j r j L(y β) π(β) dβ that allows us to obtain proper posterior moments.

Theorem 1 (basic lemma) Let the function g(t) = exp( t ), and let h 1 (p) = min{p, 1 p} and h 2 (p) = max{p, 1 p}, then all posterior moments exist if and only if R T +1 T j=0 β j r j n i=1 is finite for both k = 1 and 2. g { h k (p)(y i x i β)} π(β)dβ

Theorem 2 establishes that in the absence of any realistic prior information we could legitimately use an improper uniform prior distribution for all the components of β. This choice may be appealing as the resulting posterior distribution is proportional to the likelihood surface. Theorem 2: Assume that the prior for β is improper and uniform, that is, π(β) 1, then all posterior moments of β exist.

Theorem 3: When the elements of β are assumed prior independent, and each π(β i ) exp( β i µ i λ ), a double-exponential with fixed i µ i and λ i > 0, all posterior moments of β exist. Theorem 4: Assume that the prior for β is multivariate normal N(µ, Σ) with fixed µ and Σ, then all posterior moments of β exist. In particular, when the elements of β are assumed a prior independent and univariate normal, all posterior moments of β exist.

6. Empirical comparison Buchinsky and Hahn (1998) performed Monte Carlo experiments to compare their estimator with the one proposed by Powell (1986) for Tobit quantile regression estimation. One of models they used was given by and y = max{ 0.75, y } y = 1 + x 1 + 0.5x 2 + ɛ, where the regressors in x i were each drawn from a standard normal distribution and the error term has multiplicative heteroskedasticity obtained by taking ɛ = ξ v(x) with ξ N(0, 25) and v(x) = 1 + 0.5(x 1 + x 2 1 + x 2 + x 2 2 ).

For estimating the median regression for this model, Table 1 summaries the biases, root mean square errors (RMSE) and 95% credible intervals for β 0 and β 1 obtained from the following three approaches: BH method (Buchinsky and Hahn, 1998), Powell s estimator (Power, 1986) and the proposed Bayesian method with uniform prior. The values relating to BH and Powell were also reported in Table 1 of Buchinsky and Hahn (1998). In particular, the BH method used log-likelihood cross-validated bandwidth selection for kernel estimation of the censored probability, and the 95% confidence intervals for both BH and Powell estimators are based on their asymptotic normality theory. The results from Bayesian inference are based on a burn-in of 1000 iterations and then 2000 sample values (see Figure).

Table 1. Bias, root mean square errors (RMSE) and 95% credible intervals for the parameters β 0 and β 1 of the median regression. Samples were generated from the model considered by Buchinsky and Hahn (1998). Three approaches were used: BH method (Buchinsky and Hahn, 1998), Powell estimator (Powell, 1986) and the proposed Bayesian method with uniform prior.

β 0 β 1 Size BH Powell Bayes BH Powell Bayes 100 Bias 0.14-0.08-0.08 0.31 0.33 0.12 RMSE 2.88 4.11 0.18 2.16 2.85 0.25 2.5% -4.49-6.00 0.57-3.13-4.55 0.73 97.5% 6.76 9.40 1.20 5.65 7.41 1.65 400 Bias 0.20 0.19-0.04-0.06-0.45-0.01 RMSE 0.58 0.68 0.08 0.61 0.66 0.08 2.5% -0.85-0.83 0.82-0.82-1.12 0.83 97.5% 4.41 4.31 1.13 2.24 2.36 1.17 600 Bias 0.18 0.20-0.01-0.06-0.47-0.05 RMSE 0.48 0.49 0.05 0.50 0.57 0.08 2.5% -1.33-0.14 0.91-0.37-0.89 0.83 97.5% 4.67 3.42 1.09 2.09 1.89 1.08

Clearly, the proposed Bayesian method outperformed the BH and Powell methods. It yields considerably lower biases, lower mean square errors and much more precise credible intervals. S-PLUS code to implement the method with this comparison is available.

7. Reference chart for Immunoglobulin-G This data set refers to the serum concentration (grams per litre)of immunoglobulin-g (IgG) in 298 children aged from 6 months to 6 years (Issacs, et al., 1983). The relationship of IgG with age is quite weak, with some visual evidence of positive skewness. We took the response variable Y to be the IgG concentration and used a quadratic model in age, x, to fit the quantile regression: q p (y x) = β 0 (p) + β 1 (p) x + β 2 (p) x 2, for 0 < p < 1. Figure shows the plot of the data along with the quantile regression lines. Each point on the curves is the mean of the predictive posterior distribution. We could also obtain desired credible intervals around these curves using the MCMC samples of β(p).

8. Model choice: Using marginal likelihood and Bayes factors Basically, considering the problem of comparing a collection of models {M 1,..., M L } that reflect competing hypotheses about the regression form. The issue of model choice can be dealt with by calculating Bayes factors. Under model M k, suppose that the pth quantile model is given by y M k = x (k) β p(k) + ɛ p(k), then the marginal likelihood arising from estimating β p(k) is defined as m(y M k ) = L(y M k, β p(k) ) π(β p(k) M k )dβ p(k), which is the normalizing constant of the posterior density. The calculation of the marginal likelihood has attracted considerable interest in the recent MCMC literature. In particular, Chib (1995) and Chib and Jeliazkov (2001) have developed a simple approach for estimating the marginal likelihood using the output from the Gibbs sample and the MH algorithm respectively. Here, we have log m(y M k ) = log L(y M k, β p(k) ) + log π(β p(k) M k) log π(β p(k) y, M k), from which the marginal likelihood can be estimated by finding an estimate of the posterior ordinate π(β p(k) y, M k).

We denote this estimate as ˆπ(β p(k) y, M k). For estimate efficiency, β p(k) = βp(k) is generally taken to be a point of high density in the support of posterior density. On substituting the latter estimate in log m(y M k ), we get log ˆm(y M k ) = n log(p(1 p)) ρ p (y x (k) β p(k) ) + log π(β p(k) M k) log ˆπ(β p(k) y, M k), in which the first term n log(p(1 p)) is constant and the sum is over all data points. Once the posterior ordinate is estimated, we can estimate the Bayes factor of any two models M k and M l by ˆB kl = exp{log ˆm(y M k ) log ˆm(y M l )}. A simulation-consistent estimate of ˆπ(β θ(k) y, M k) can be given by For an improper prior, in which ˆπ(β y) = G 1 G g=1 α(β(g), β ) q(β (g), β ) J, 1 J j=1 α(β, β (j) ) α(β, β ) = min{1, π(β ) π(β) L (β, β)},

and L (β, β) = exp{ ( ρp (y x (k) β p(k) ) ρ p(y x (k) β θ(k)) ) }. Where {β (j) } are samples drawn from q(β, β) and {β (g) } are samples drawn from the posterior distribution. For proper prior, ˆπ(β y) = α(β, β )π(β y)dβ α (β, β)π(β)dβ, in which α (β, β) = min{ 1 π(β), 1 π(β ) L (β, β )}. This implies that a simulation-consistent estimate of the posterior ordinate is given by ˆπ(β y) = G 1 G g=1 α(β(g), β ) J 1 J j=1 α (β, β (j) ), where {β (j) } are samples drawn from a proper prior distribution and {β (g) } are samples drawn from the posterior distribution.

9. Inference with scale parameter One may be interested in introducing a scale parameter into the likelihood function L(y β) for the proposed Bayesian inference. Suppose σ > 0 is the scale parameter, ( ) L(y β, σ) = θn (1 θ) n n exp ρ σ n θ ( y i x i β ). σ The corresponding posterior distribution π(β, σ y) can be written as i=1 π(β, σ y) L(y β, σ) π(β, σ), where π(β, σ) is the prior distribution of (β θ, σ) for a particular θ. As what interests us is the regression parameter β, and σ is what is referred to as a nuisance parameter, we may integrate out σ and investigate the marginal posterior π(β y) only. For example, we have considered a reference prior π(β, σ) 1, which gives σ that π(β y) ( n ρ θ (y i x i β)) n, or i=1 log π(β y) n log n ρ θ (y i x iβ). Implementing MCMC algorithm on this posterior form, we have found that the simulation results are more or less same as those obtained using the posterior density (4). i=1

References Billias, Y., Chen, S. and Ying, Z. (2000): Simple Resampling Methods for Censored Regression Quantiles, Journal of Econometrics, 99, 373 386. Buchinsky, M. (1998), Recent advances in quantile regression models, J. Human Res., 33, 88 126. Chib, S. (1992): Bayes Inference in the Tobit Censored Regression Model, Journal of Econometrics, 51, 79 99. Isaacs, D., Altman, D.G., Tidmarsh, C.E., Valman, H.B. and Webster, A.D.B. (1983), Serum Immunoglobin concentrations in preschool children measured by laser nephelometry: reference ranges for IgG, IgA, IgM, J. Clinical Pathology, 36, 1193 1196. Koenker, R. and Bassett, G.S. (1978), Regression quantiles, Econometrica, 46, 33 50. Hahn, J. (1995): Bootstrapping Quantile Regression Estimators, Econometric Theory, 11, 105 121. Huang, H. (2001): Bayesian Analysis of the SUR Tobit Model, Applied Economics Letters, 8, 617 622. Kottas, A. and Gelfand, A. E. (2001): Bayesian semiparametric Median Regression Model, Journal of the American Statistical Association, 91, 689 698.

Powell, J. (1986a): Censored Regression Quantiles, Journal of Econometrics, 32, 143-155. Powell, J. (1986b): Symmetrically Trimmed Least Squares Estimation for Tobit Models, Econometrica, 54, 1435 60. Richardson, S. (1999): Contribution to the Discussion of Walker et al., Bayesian Nonparametric Inference for Random Distribution and Related Functions, J. R. Statist. Soc. B, 61, 485 527. Walker, S.G., Damien, P., Laud, P. W. and Smith, A.F.M. (1999): Bayesian Nonparametric Inference for Random Distribution and Related Functions, J. R. Statist. Soc. B, 61, 485 527. Powell, J. L. (2001): Semiparametric estimation of censored selection models in Hsiao,C., K. Morimune, and J.L. Powell,eds., Nonlinear statistical modelling, Cambridge University Press. Yu, K. and Moyeed, R.A. (2001): Bayesian quantile regression, Statistics and Probability Letters, 54, 437-447. Yu, K., Lu, Z. and Stander, J. (2003): Quantile regression: applications and current research area, The Statistican, 52, 331 350.