Practical Bayesian Quantile Regression. Keming Yu University of Plymouth, UK

Similar documents
Bayesian quantile regression

STA 4273H: Statistical Machine Learning

Research Article Power Prior Elicitation in Bayesian Quantile Regression

Bayesian Linear Regression

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

A Partially Collapsed Gibbs sampler for Bayesian quantile regression

Principles of Bayesian Inference

Bayesian Methods for Machine Learning

Stat 5101 Lecture Notes

Lecture Notes based on Koop (2003) Bayesian Econometrics

Marginal Specifications and a Gaussian Copula Estimation

Markov Chain Monte Carlo methods

Bayesian Regression Linear and Logistic Regression

Markov Chain Monte Carlo in Practice

UNIVERSITY OF CALIFORNIA Spring Economics 241A Econometrics

Bayesian Semiparametric GARCH Models

Bayesian Methods in Multilevel Regression

Bayesian Semiparametric GARCH Models

Gaussian kernel GARCH models

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Bayesian Phylogenetics:

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Pattern Recognition and Machine Learning

Lecture 8: The Metropolis-Hastings Algorithm

The Bayesian Choice. Christian P. Robert. From Decision-Theoretic Foundations to Computational Implementation. Second Edition.

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

A note on Reversible Jump Markov Chain Monte Carlo

7. Estimation and hypothesis testing. Objective. Recommended reading

Bayesian Parameter Estimation and Variable Selection for Quantile Regression

CPSC 540: Machine Learning

Quantile regression and heteroskedasticity

Kobe University Repository : Kernel

eqr094: Hierarchical MCMC for Bayesian System Reliability

Density Estimation. Seungjin Choi

Bayesian Inference and MCMC

Likelihood-free MCMC

Markov Chain Monte Carlo methods

POSTERIOR ANALYSIS OF THE MULTIPLICATIVE HETEROSCEDASTICITY MODEL

A general mixed model approach for spatio-temporal regression data

Multilevel Statistical Models: 3 rd edition, 2003 Contents

Identify Relative importance of covariates in Bayesian lasso quantile regression via new algorithm in statistical program R

Online appendix to On the stability of the excess sensitivity of aggregate consumption growth in the US

Answers and expectations

On Bayesian Computation

Bayesian Linear Models

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Control Variates for Markov Chain Monte Carlo

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Bayesian Multivariate Logistic Regression

Contents. Part I: Fundamentals of Bayesian Inference 1

Inference for a Population Proportion

Approximate Bayesian computation for spatial extremes via open-faced sandwich adjustment

Bayesian Linear Models

Markov Chain Monte Carlo (MCMC)

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Default Priors and Effcient Posterior Computation in Bayesian

A Bootstrap Test for Conditional Symmetry

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Markov Chain Monte Carlo Methods

Monte Carlo in Bayesian Statistics

Quantile regression for longitudinal data using the asymmetric Laplace distribution

Bayes: All uncertainty is described using probability.

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Markov Chain Monte Carlo

Spatial Statistics Chapter 4 Basics of Bayesian Inference and Computation

Making rating curves - the Bayesian approach

New Bayesian methods for model comparison

The Recycling Gibbs Sampler for Efficient Learning

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Bayesian Econometrics

Modelling geoadditive survival data

Index. Pagenumbersfollowedbyf indicate figures; pagenumbersfollowedbyt indicate tables.

Principles of Bayesian Inference

Subject CS1 Actuarial Statistics 1 Core Principles

Computational statistics

MCMC algorithms for fitting Bayesian models

Bayesian Hypothesis Testing in GLMs: One-Sided and Ordered Alternatives. 1(w i = h + 1)β h + ɛ i,

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Flexible Regression Modeling using Bayesian Nonparametric Mixtures

Accept-Reject Metropolis-Hastings Sampling and Marginal Likelihood Estimation

Introduction to Machine Learning CMU-10701

Bayesian Nonparametric Regression for Diabetes Deaths

A Note on Lenk s Correction of the Harmonic Mean Estimator

Bayesian Gaussian Process Regression

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions

Overview. DS GA 1002 Probability and Statistics for Data Science. Carlos Fernandez-Granda

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

Bayesian estimation of bandwidths for a nonparametric regression model with a flexible error density

Bayesian Extreme Quantile Regression for Hidden Markov Models

Numerical Analysis for Statisticians

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

David Giles Bayesian Econometrics

MCMC: Markov Chain Monte Carlo

Gibbs Sampling in Endogenous Variables Models

VCMC: Variational Consensus Monte Carlo

Introduction to Machine Learning

Transcription:

Practical Bayesian Quantile Regression Keming Yu University of Plymouth, UK (kyu@plymouth.ac.uk) A brief summary of some recent work of us (Keming Yu, Rana Moyeed and Julian Stander).

Summary We develops a Bayesian framework for quantile regression, including Tobit quantile regression. We discuss the likelihood selection, and families of prior distribution on the quantile regression vector that lead to proper posterior distributions with finite moments. We show how the posterior distribution can be sampled and summarized by Markov chain Monte Carlo methods. A method for quantile regression model choice is also developed. In an empirical comparison, our approach out-performed some common classical estimators.

1. Background Linear regression models: Aim: Estimate E(Y X); Method: Least-squares minimization; Suitability: Gaussian errors, symmetric distributions; Weakness: Conditional skew distribution, outliers, tail behavior.

Quantile regression models (Koenker and Bassett 1978): Aim: Estimate the p th quantile of Y given X (0 < p < 1), and explore a complete relationship between Y and X; Specific cases: median regression (p = 0.5); Method: Check function minimization; Check function: ρ p (u) = u(p I(u < 0)); Applications: Reference charts in medicine (Cole and Green, 1992), Survival analysis (Koenker and Geling, 2001), Value at Risk (Bassett and Chen, 2001), Labor economics (Buchinsky, 1995), Flood return period (Yu et al., 2003).

2. Basic setting Consider the following standard linear model y i = µ(x i ) + ɛ i, Typically, µ(x i ) = x i β for a vector of coefficients β. The pth (0 < p < 1) quantile of ɛ i is the value, q p, for which P (ɛ i < q p ) = p. The pth conditional quantile of y i given x i is denoted as q p (y i x i ) = x i β(p). (1)

To inference for parameter β(p) given p and the observations on (X, Y), the posterior distribution of β(p), π(β y) is given by π(β y) L(y β) π(β), (2) where π(β) is the prior distribution of β and L(y β) is the likelihood function.

3. Selecting likelihood function There are two or three different ways to select L(y β) in the set-up above. For example, using a mixture distribution accompanying a Dirichlet process or Pó tree prior to model the error distribution (Walker et al., 1999, Kottas and Gelfand, 2001). But people (Richardson, 1999) commented that, extra parameters under inference, associated computations complicated, difficult in the choice of the partition of the space of the priors. The other example is to use a substitute likelihood (Dunson et al., (2003), Biometrics.) But this is not a proper likelihood.

A simple and natural likelihood is based on the asymmetric Laplace distribution with probability density for 0 < p < 1; f p (u) = p(1 p) exp{ ρ p (u)}, (3) Link: The minimization of the loss function (3) is exactly equivalent to the maximization of a likelihood function below; Likelihood function: L(y β) = p n (1 p) n exp i ρ p (y i x i β) (4) Feature: Except parameter β, no extra parameter under inference. Then we just need to set priors for β.

Once we have a prior for β(p) and data set available, we could make posterior inference for the parameters. How? as there is no conjugate prior, we use the popular MCMC techniques for sampling from the posterior. Why Bayes? Classical methods rely on asymptotics, either large sample is available or using bootstrap. In contrast, MCMC sampling enables us to make exact inference for any sample size without resorting to asymptotic calculations.

For example, The asymptotic covariance matrices of parameter estimators in those classical approaches for these quantile regression models depend on the error densities of ɛ, and are therefore difficult to estimate reliably. In the Bayesian framework, variance estimates, as well as any other posterior summary come out as a by-product of the MCMC sampler, and therefore are trivial to obtain once samples from the posterior distribution are available. Moreover, the Bayesian paradigm also enables us to incorporate prior information in a natural way, whereas the frequentist paradigm does not. Take the uncertainty of parameters into account.

4. Bayesian posterior computation via MCMC An MCMC scheme would constructs a Markov chain with equilibrium distribution the posterior π(β y). After running the Markov chain for a certain burn-in period so that it can reach equilibrium, one obtains samples from π(β y). One popular method for constructing a Markov chain is via the Metropolis-Hastings (MH) algorithm. Where a candidate is generated from an auxiliary distribution and then accepted or rejected with some probability. The candidate generating distribution q(β, β c ) can depend on the current state β c of the Markov chain.

A candidate β is accepted with an certain acceptance probability α(β, β c ) also depending on the current state β c given by: α(β, β c ) = min[ π(β )L(y β )q(β c, β ) π(β c )L(y β c )q(β, β c ), 1]. If, for example, a simple random walk is used to generate β from β c, then the ratio q(βc,β ) q(β,β c ) = 1. In general the steps of the MH algorithm are therefore: Step 0: Start with an arbitrary value β (0) For n from 1 to N Step n: Generate β from q(β, β c ) and u from U(0, 1) If u α(β, β c ) set β (n) = β (acceptance) If u > α(β, β c ) set β (n) = β (n 1) (rejection)

As mentioned above after running this procedure for a certain burn-in period, the samples obtained may be through as coming from the posterior distribution. Hence, we can estimate posterior moments, standard deviations and credible intervals from this posterior sample. We found that convergence was very rapid. Remark: We may use the set-up for Tobit quantile regression: suppose that y and y are random variables connected by the censoring relationship y = max { y 0, y }, where y 0 is a known censoring point. In this case, we have found it simplifies the algorithm to assume zero to be the fixed censoring points by a simple transformation of any non-zero censoring points.

S-PLUS or R- codes to implement the algorithms are available (free).

5. Some theoretic results, including prior selection As people like Richardson (1999) mentioned that popular forms of priors tends to be those which have parameters that can be set straightforwardly and which lead to posterior with a relatively immediate form. Although a standard conjugate prior distribution is not available for the quantile regression formulation, MCMC methods may be used to draw samples from the posterior distributions. This, principal, allows us to use virtually any prior distribution. However, we should select priors that yields proper posteriors. Choose the prior π(β) from a class of known distributions, in order to get proper posteriors.

First, the posterior is proper if and only if 0 < π(β y)dβ <, (5) RT +1 or, equivalently, if and only if, 0 < L(y β) π(β) dβ <. RT +1 Moreover, we require that all posterior moment exist. That is, E[( T j=0 β j r j) y] <, (6) where (r 0,..., r T ) denotes the order of the moments of β = (β 0,..., β T ). We now establish a bound for the integral R T +1 Tj=0 β j r j L(y β) π(β) dβ that allows us to obtain proper posterior moments.

Theorem 1 (basic lemma) Let the function g(t) = exp( t ), and let h 1 (p) = min{p, 1 p} and h 2 (p) = max{p, 1 p}, then all posterior moments exist if and only if R T +1 T j=0 β j r j n i=1 is finite for both k = 1 and 2. g { h k (p)(y i x i β)} π(β)dβ

Theorem 2 establishes that in the absence of any realistic prior information we could legitimately use an improper uniform prior distribution for all the components of β. This choice may be appealing as the resulting posterior distribution is proportional to the likelihood surface. Theorem 2: Assume that the prior for β is improper and uniform, that is, π(β) 1, then all posterior moments of β exist.

Theorem 3: When the elements of β are assumed prior independent, and each π(β i ) exp( β i µ i λ ), a double-exponential with fixed i µ i and λ i > 0, all posterior moments of β exist. Theorem 4: Assume that the prior for β is multivariate normal N(µ, Σ) with fixed µ and Σ, then all posterior moments of β exist. In particular, when the elements of β are assumed a prior independent and univariate normal, all posterior moments of β exist.

6. Empirical comparison Buchinsky and Hahn (1998) performed Monte Carlo experiments to compare their estimator with the one proposed by Powell (1986) for Tobit quantile regression estimation. One of models they used was given by and y = max{ 0.75, y } y = 1 + x 1 + 0.5x 2 + ɛ, where the regressors in x i were each drawn from a standard normal distribution and the error term has multiplicative heteroskedasticity obtained by taking ɛ = ξ v(x) with ξ N(0, 25) and v(x) = 1 + 0.5(x 1 + x 2 1 + x 2 + x 2 2 ).

For estimating the median regression for this model, Table 1 summaries the biases, root mean square errors (RMSE) and 95% credible intervals for β 0 and β 1 obtained from the following three approaches: BH method (Buchinsky and Hahn, 1998), Powell s estimator (Power, 1986) and the proposed Bayesian method with uniform prior. The values relating to BH and Powell were also reported in Table 1 of Buchinsky and Hahn (1998). In particular, the BH method used log-likelihood cross-validated bandwidth selection for kernel estimation of the censored probability, and the 95% confidence intervals for both BH and Powell estimators are based on their asymptotic normality theory. The results from Bayesian inference are based on a burn-in of 1000 iterations and then 2000 sample values (see Figure).

Table 1. Bias, root mean square errors (RMSE) and 95% credible intervals for the parameters β 0 and β 1 of the median regression. Samples were generated from the model considered by Buchinsky and Hahn (1998). Three approaches were used: BH method (Buchinsky and Hahn, 1998), Powell estimator (Powell, 1986) and the proposed Bayesian method with uniform prior.

β 0 β 1 Size BH Powell Bayes BH Powell Bayes 100 Bias 0.14-0.08-0.08 0.31 0.33 0.12 RMSE 2.88 4.11 0.18 2.16 2.85 0.25 2.5% -4.49-6.00 0.57-3.13-4.55 0.73 97.5% 6.76 9.40 1.20 5.65 7.41 1.65 400 Bias 0.20 0.19-0.04-0.06-0.45-0.01 RMSE 0.58 0.68 0.08 0.61 0.66 0.08 2.5% -0.85-0.83 0.82-0.82-1.12 0.83 97.5% 4.41 4.31 1.13 2.24 2.36 1.17 600 Bias 0.18 0.20-0.01-0.06-0.47-0.05 RMSE 0.48 0.49 0.05 0.50 0.57 0.08 2.5% -1.33-0.14 0.91-0.37-0.89 0.83 97.5% 4.67 3.42 1.09 2.09 1.89 1.08

Clearly, the proposed Bayesian method outperformed the BH and Powell methods. It yields considerably lower biases, lower mean square errors and much more precise credible intervals. S-PLUS code to implement the method with this comparison is available.

7. Reference chart for Immunoglobulin-G This data set refers to the serum concentration (grams per litre)of immunoglobulin-g (IgG) in 298 children aged from 6 months to 6 years (Issacs, et al., 1983). The relationship of IgG with age is quite weak, with some visual evidence of positive skewness. We took the response variable Y to be the IgG concentration and used a quadratic model in age, x, to fit the quantile regression: q p (y x) = β 0 (p) + β 1 (p) x + β 2 (p) x 2, for 0 < p < 1. Figure shows the plot of the data along with the quantile regression lines. Each point on the curves is the mean of the predictive posterior distribution. We could also obtain desired credible intervals around these curves using the MCMC samples of β(p).

8. Model choice: Using marginal likelihood and Bayes factors Basically, considering the problem of comparing a collection of models {M 1,..., M L } that reflect competing hypotheses about the regression form. The issue of model choice can be dealt with by calculating Bayes factors. Under model M k, suppose that the pth quantile model is given by y M k = x (k) β p(k) + ɛ p(k), then the marginal likelihood arising from estimating β p(k) is defined as m(y M k ) = L(y M k, β p(k) ) π(β p(k) M k )dβ p(k), which is the normalizing constant of the posterior density. The calculation of the marginal likelihood has attracted considerable interest in the recent MCMC literature. In particular, Chib (1995) and Chib and Jeliazkov (2001) have developed a simple approach for estimating the marginal likelihood using the output from the Gibbs sample and the MH algorithm respectively. Here, we have log m(y M k ) = log L(y M k, β p(k) ) + log π(β p(k) M k) log π(β p(k) y, M k), from which the marginal likelihood can be estimated by finding an estimate of the posterior ordinate π(β p(k) y, M k).

We denote this estimate as ˆπ(β p(k) y, M k). For estimate efficiency, β p(k) = βp(k) is generally taken to be a point of high density in the support of posterior density. On substituting the latter estimate in log m(y M k ), we get log ˆm(y M k ) = n log(p(1 p)) ρ p (y x (k) β p(k) ) + log π(β p(k) M k) log ˆπ(β p(k) y, M k), in which the first term n log(p(1 p)) is constant and the sum is over all data points. Once the posterior ordinate is estimated, we can estimate the Bayes factor of any two models M k and M l by ˆB kl = exp{log ˆm(y M k ) log ˆm(y M l )}. A simulation-consistent estimate of ˆπ(β θ(k) y, M k) can be given by For an improper prior, in which ˆπ(β y) = G 1 G g=1 α(β(g), β ) q(β (g), β ) J, 1 J j=1 α(β, β (j) ) α(β, β ) = min{1, π(β ) π(β) L (β, β)},

and L (β, β) = exp{ ( ρp (y x (k) β p(k) ) ρ p(y x (k) β θ(k)) ) }. Where {β (j) } are samples drawn from q(β, β) and {β (g) } are samples drawn from the posterior distribution. For proper prior, ˆπ(β y) = α(β, β )π(β y)dβ α (β, β)π(β)dβ, in which α (β, β) = min{ 1 π(β), 1 π(β ) L (β, β )}. This implies that a simulation-consistent estimate of the posterior ordinate is given by ˆπ(β y) = G 1 G g=1 α(β(g), β ) J 1 J j=1 α (β, β (j) ), where {β (j) } are samples drawn from a proper prior distribution and {β (g) } are samples drawn from the posterior distribution.

9. Inference with scale parameter One may be interested in introducing a scale parameter into the likelihood function L(y β) for the proposed Bayesian inference. Suppose σ > 0 is the scale parameter, ( ) L(y β, σ) = θn (1 θ) n n exp ρ σ n θ ( y i x i β ). σ The corresponding posterior distribution π(β, σ y) can be written as i=1 π(β, σ y) L(y β, σ) π(β, σ), where π(β, σ) is the prior distribution of (β θ, σ) for a particular θ. As what interests us is the regression parameter β, and σ is what is referred to as a nuisance parameter, we may integrate out σ and investigate the marginal posterior π(β y) only. For example, we have considered a reference prior π(β, σ) 1, which gives σ that π(β y) ( n ρ θ (y i x i β)) n, or i=1 log π(β y) n log n ρ θ (y i x iβ). Implementing MCMC algorithm on this posterior form, we have found that the simulation results are more or less same as those obtained using the posterior density (4). i=1

References Billias, Y., Chen, S. and Ying, Z. (2000): Simple Resampling Methods for Censored Regression Quantiles, Journal of Econometrics, 99, 373 386. Buchinsky, M. (1998), Recent advances in quantile regression models, J. Human Res., 33, 88 126. Chib, S. (1992): Bayes Inference in the Tobit Censored Regression Model, Journal of Econometrics, 51, 79 99. Isaacs, D., Altman, D.G., Tidmarsh, C.E., Valman, H.B. and Webster, A.D.B. (1983), Serum Immunoglobin concentrations in preschool children measured by laser nephelometry: reference ranges for IgG, IgA, IgM, J. Clinical Pathology, 36, 1193 1196. Koenker, R. and Bassett, G.S. (1978), Regression quantiles, Econometrica, 46, 33 50. Hahn, J. (1995): Bootstrapping Quantile Regression Estimators, Econometric Theory, 11, 105 121. Huang, H. (2001): Bayesian Analysis of the SUR Tobit Model, Applied Economics Letters, 8, 617 622. Kottas, A. and Gelfand, A. E. (2001): Bayesian semiparametric Median Regression Model, Journal of the American Statistical Association, 91, 689 698.

Powell, J. (1986a): Censored Regression Quantiles, Journal of Econometrics, 32, 143-155. Powell, J. (1986b): Symmetrically Trimmed Least Squares Estimation for Tobit Models, Econometrica, 54, 1435 60. Richardson, S. (1999): Contribution to the Discussion of Walker et al., Bayesian Nonparametric Inference for Random Distribution and Related Functions, J. R. Statist. Soc. B, 61, 485 527. Walker, S.G., Damien, P., Laud, P. W. and Smith, A.F.M. (1999): Bayesian Nonparametric Inference for Random Distribution and Related Functions, J. R. Statist. Soc. B, 61, 485 527. Powell, J. L. (2001): Semiparametric estimation of censored selection models in Hsiao,C., K. Morimune, and J.L. Powell,eds., Nonlinear statistical modelling, Cambridge University Press. Yu, K. and Moyeed, R.A. (2001): Bayesian quantile regression, Statistics and Probability Letters, 54, 437-447. Yu, K., Lu, Z. and Stander, J. (2003): Quantile regression: applications and current research area, The Statistican, 52, 331 350.