Bayesian Regression with Heteroscedastic Error Density and Parametric Mean Function

Size: px

Start display at page:

Download "Bayesian Regression with Heteroscedastic Error Density and Parametric Mean Function"

Basil Greer
6 years ago
Views:

1 Bayesian Regression with Heteroscedastic Error Density and Parametric Mean Function Justinas Pelenis Institute for Advanced Studies, Vienna May 8, 2013 Abstract This paper considers a Bayesian estimation of restricted conditional moment models with linear regression as a particular example. A common practice in the Bayesian literature for linear regression and other semi-parametric models is to use flexible families of distributions for the errors and to assume that the errors are independent from covariates. However, a model with flexible covariate dependent error distributions should be preferred for the following reason. Assuming that the error distribution is independent of predictors might lead to inconsistent estimation of the parameters of interest when errors and covariates are dependent. To address this issue, we develop a Bayesian regression model with a parametric mean function defined by a conditional moment condition and flexible predictor dependent error densities. Sufficient conditions to achieve posterior consistency of the regression parameters and conditional error densities are provided. In experiments, the proposed method compares favorably with classical and alternative Bayesian estimation methods for the estimation of the regression coefficients and conditional densities. Keywords: Bayesian semi-parametrics, Bayesian conditional density estimation, heteroscedastic linear regression. Previously circulated under the title of Bayesian Semiparametric Regression. I am very thankful to Andriy Norets, Bo Honore, Jia Li, Ulrich Müller and Chris Sims as well as seminar participants at Princeton, Royal Holloway, Institute for Advanced Studies, Vienna, Wirtschaftsuniversitat Wien, European Seminar on Bayesian Econometrics, Seminar on Bayesian Inference in Econometrics and Statistics (SBIES) and Cowles summer conferences for helpful discussions and multiple suggestions that greatly contributed to improve the quality of the manuscript. All remaining errors are mine.

2 1 Introduction Estimation of regression coefficients in linear regression models can be consistent but inefficient if heteroscedasticity is ignored. Furthermore, the regression curve only provides a summary of the mean effects but does not provide any information regarding conditional error distributions which might be of interest to the decision maker. Estimation of conditional error distributions is useful in settings where forecasting and out of sample predictions are the object of interest. In this paper we propose a novel Bayesian method for consistent estimation of both linear regression coefficients and conditional residual distributions when data generating process satisfies a linear conditional moment restriction E[y x] = x β or a more general restricted conditional moment condition of E[y x] = h(x, θ) for some known function h. The contribution of this proposal is that the model is correctly specified for a large class of true data generating processes without imposing specific restrictions on the conditional error distributions. Hence consistent and efficient estimation of the parameters of interest might be expected. The most widely used method to estimate the mean of a continuous response variable as a function of predictors is, without doubt, the linear regression model. Often the models considered impose the assumptions of constant variance and/or symmetric and unimodal error distributions. Such restrictions are often inappropriate for real-life datasets where conditional variability, skewness and asymmetry might hold. The prediction intervals obtained using models with constant variance and/or symmetric error distributions are likely to be inferior to the prediction intervals obtained from models with predictor dependent residual densities. To achieve full inference of parameters of interest and conditional error densities we propose a semi-parametric Bayesian model with a parametric mean function and a non-parametric prior for covariate dependent error densities to simultaneously estimate both regression coefficients and predictor dependent error densities. A Bayesian approach might be more effective in small samples as it enables 1

3 exact inference given observed data instead of relying on asymptotic approximations. There is a substantial semi-parametric Bayesian literature that focuses on non-parametric priors for the mean function and parametric priors on the error distributions. An alternative common strand of the literature focuses on either parametric or non-parametric priors on the mean function and uses explicitly non-parametric priors on the error distributions and it is this choice of the priors on the error distributions that motivates the discussion in this manuscript. The common assumption is that the errors are generated independently from regressors x and usually satisfy either a median or quantile restriction. Estimation and consistency of such models is discussed in Kottas and Gelfand (2001), Hirano (2002), Amewou-Atisso et al. (2003), Conley et al. (2008) and Wu and Ghosal (2008) among others. However, estimation of the parameters and error densities might be inconsistent if errors and covariates are dependent. For example, under heteroscedasticity or conditional asymmetry of error distributions the pseudo-true values of regression coefficients in a linear model with errors generated by covariate independent mixtures of normals are not generally equal to the true parameter values. One of the contributions of this paper is to show that the model proposed in this manuscript that incorporates predictor dependent residual densities is flexible and leads to a consistent estimation of both parameters of interest θ and conditional error densities. Other Bayesian proposals that incorporate predictor dependent residual density modeling into parametric models are by Pati and Dunson (forthcoming) where residual density is restricted to be symmetric, by Kottas and Krnjajic (2009) for quantile regression but without accompanying consistency theorems and by Leslie et al. (2007) who accommodate heteroscedasticity by multiplying the error term by a predictor dependent factor. However, none of these papers address the issue of conditional error asymmetry, and the estimation of regression coefficients by these methods might be inconsistent in the presence of residual asymmetry as the proposed models are misspecified. 2

4 Flexible models with covariate dependent error densities might lead to a more efficient estimator of the regression coefficients. For a linear regression problem, often only the regression coefficient β is of interest. It is a well known fact that if the conditional moment restriction holds then the weighted least squares estimator is more efficient than ordinary least squares estimator under heteroscedasticity. It is known that in parametric models, by assertion of Le Cam s parametric Bernstein-von Mises theorem, the posterior behaves as if one has observed normally distributed maximum likelihood estimator with variance equal to the inverse of Fisher information, see van der Vaart (1998). Semiparametric versions of Bernstein-von Mises theorem have been obtained by Shen (2002) and Kleijn and Bickel (2012), but the conditions are hard to verify. Nonetheless there is an expectation that posterior distribution of β is normal and centered at the true value in correctly specified semi-parametric models if the priors are carefully chosen. Since the most popular frequentist approach of using OLS with heteroscedasticity robust covariance matrix (White (1982)) is suboptimal in a linear regression model with conditional moment restriction, one should expect to achieve a more efficient estimator by estimating a correctly specified model proposed here. Simulation results presented in Section 4 support the hypothesis that the proposed model gives a more efficient estimator of regression coefficients under heteroscedasticity. The defining feature of the proposed model is that we impose a zero mean restriction on conditional error densities conditional on any predictor value. Imposition of the conditional restriction on the error distributions can be expected to be of benefit as a more efficient estimation of the parameters of interest might be expected. We model residual distributions flexibly as a finite or infinite mixtures of a base kernel. The base kernel for residual density is a mixture of two normal distributions with a joint mean of 0. The probability weights in both finite and infinite mixtures are predictor dependent and vary smoothly with changes in predictor values. We consider a finite smoothly mixing 3

5 regression model similar to the ones considered by Geweke and Keane (2007) and Norets (2010) and show that estimation would be consistent if the number of mixtures is allowed to increase. In such models, an appropriate number of mixtures needs to be selected which presents an additional complication. To avoid such complications, an alternative is to estimate a fully nonparametric model (i.e. infinite mixture). We consider the kernel stick breaking process as a fully non-parametric approach to inference in a restricted moment model defined by a conditional moment restriction. This flexible approach leads to consistent estimation of both parameters of interest and conditional error densities. Another contribution of this paper is to provide weak posterior consistency theorems for conditional density estimation in a Bayesian framework for a large class of true data generating processes using kernel stick breaking process (KSBP) with an exponential kernel proposed by Dunson and Park (2008). There are two alternative approaches for conditional density estimation in the Bayesian literature. The first general approach is to use dependent Dirichlet processes (MacEachern (1999), De Iorio et al. (2004), Griffin and Steel (2006) and others) to model conditional density directly. The second approach is to model joint unconditional distributions (Muller et al. (1996), Norets and Pelenis (2012) and others) and extract conditional densities of interest from joint distribution of observables. Even though many varying approaches for direct modeling of conditional distributions have been considered, consistency properties have been largely unstudied and only recent studies of Tokdar et al. (2010), Norets and Pelenis (forthcoming) and Pati et al. (2013) address this question using different setups. We provide a set of sufficient conditions to ensure weak posterior consistency of conditional residual densities using KSBP with an exponential kernel and mixtures of Gaussian distributions and indirectly achieve posterior consistency of the parametric part. The outline of the paper is as follows: Section 2 introduces the finite and infinite models for estimation of a semi-parametric linear regression with a conditional moment 4

6 constraint. Section 3 provides theoretical results regarding the posterior consistency of both the parametric and nonparametric components of the model. Section 4 contains small sample simulation results where we show that the proposed method performs well for both conditional density and parameter estimation and does not suffer from the possible misspecification biases much in the way that other approaches do. The proofs and details of posterior computation are contained in the Appendix. 2 Restricted Moment Model The data consists of N observations of (Y N, X N ) = {(y 1, x 1 ), (y 2, x 2 ),..., (y N, x N )} where y i Y R is a response variable and x i X R d are the covariates. The observations are independently and identically distributed (y i, x i ) F 0 under the assumption that the data generating process (DGP) satisfies E F0 [y x] = h(x, θ 0 ) for all x X for some known function h : X Θ Y. Alternatively, the restricted moment model can be written as y i = h(x i, θ 0 ) + ɛ i, (y i, x i ) F 0, i = 1,..., n. with E F0 [ɛ x] = 0 for all x X. The unknown parameters of this semi-parametric model would be (θ, f ɛ x ), where θ is the finite dimensional parameter of interest and f ɛ x is the infinite dimensional parameter. Let Ξ = F ɛ x Θ be the parameter space, where Θ denotes the space of θ and F ɛ x the space of conditional densities with mean zero. That is θ Θ R p and F ɛ x = { } f ɛ x : R X [0, ) : f ɛ x (ɛ, x)dɛ = 1, ɛf ɛ x (ɛ, x)dɛ = 0 x X. R R The primary objective is to construct a model to consistently estimate the parameter of interest θ 0, while consistent estimation of the conditional error densities f 0,ɛ x is of 5

7 secondary interest. This joint objective is achieved by proposing a flexible covariate dependent model for residual densities that allows the residual density to vary with predictors x X. We will show that the proposed model is correctly specified under weak restrictions on F ɛ x and leads to consistent estimation of both θ 0 and conditional error densities. Furthermore, the simulation results in Section 4 show that this flexible approach might help achieve more efficient estimate of the parameter of interest θ Finite Smoothly Mixing Regression First, we define a density f 2 ( ) which is a mixture of two normal distributions with a joint mean of zero. Given the parameters {π, µ, σ 1, σ 2 } density f 2 is defined as f 2 (ɛ; π, µ, σ 1, σ 2 ) = πφ(ɛ; µ, σ1) 2 + (1 π)φ(ɛ; µ π 1 π, σ2 2) where φ(ɛ; µ, σ 2 ) is a standard normal density evaluated at ɛ with mean µ and variance σ 2. Note that by construction a random variable ɛ with a probability density function f 2 has an expected value 0 as desired. In Section 3 we show that any density belonging to a large class of densities with mean 0 can be approximated by a countable collection of mixtures of f 2. The proposed finite smoothly mixing regression model that imposes a conditional moment restriction is a special case of a mixtures of experts as introduced by Jacobs et al. (1991). Let the proposed model be defined by a set of parameters (η k, θ) where θ is the parameter of interest and η k are the nuisance parameters that induce conditional 6

8 densities f ɛ x. The density of observable y i is modeled as: k p(y i x i, θ, η k ) = α j (x i )f 2 (y i h(x i, θ); π j, µ j, σ j1, σ j2 ) (1) j=1 k α j (x i ) = 1, x i X j=1 where α j (x t ) is a regressor dependent smoothly varying probability weight. Note that by construction E p [y x] = h(x, θ) as desired. The conditional distribution of residuals is modeled as a flexible countable mixture of densities f 2 with predictor dependent mixing weights. Modeling of α j (x) is the choice of the econometrician and there are few available alternatives. We will use a linear logit regression considered by Norets (2010) as it has desirable theoretical properties. Mixing probabilities α j (x i ) are constructed as α j (x i ) = exp ( ρ j + γ jx i ) k l=1 exp (ρ l + γ l x i). (2) The linear logit regression is not a unique choice as Geweke and Keane (2007) considered a multinomial probit model for α j (x), and a multiple number of alternative possibilities have been considered in predictor-dependent stick breaking process literature. Generally, this finite mixture model can be considered as a special case of smoothly mixing regression model for conditional density estimation that imposes a linear mean but leaves residual densities unconstrained. The full finite mixture model is characterized by the parameter of interest θ and the nuisance parameters η k { π j, µ j, σ j1, σ j2, ρ j, γ j} k. To complete the characterization of j=1 this model one would specify a prior Π θ on Θ and a prior Π η on the parameters η k that induces a prior Π fɛ x on F ɛ x. These priors induce a joint prior Π = Π θ Π fɛ x on Ξ. In Section 3 we show that for any true DGP F 0 there exists k large enough and 7

9 parameters (θ, η k ) such that the proposed model is arbitrarily close in KL divergence to the true DGP. This property can be used to show that a consistent estimation of θ 0 would be obtained with k. 2.2 Infinite Smoothly Mixing Regression Estimation of a finite mixture model introduces an additional complication of having to estimate the number of mixture components k. An alternative solution would be to consider an infinite smoothly mixing regression. The conditional density of the observable y i is modeled as: p(y i x i, θ, η) = p j (x i )f 2 (y i h(x i, θ); π j, µ j, σ j1, σ j2 ) j=1 where η are nuisance parameters to be specified later, p j (x i ) is a predictor dependent probability weight and j=1 p j(x) = 1 a.s. for all x X. To construct this infinite mixture model we will employ predictor-dependent stick breaking processes. Similarly to the choice of α j (x) in the finite smoothly mixing regressions, various constructions of p j (x) have been considered in the literature. Those methods include order based dependent Dirichlet processes (πddp) proposed by Griffin and Steel (2006), probit stick-breaking process (Chung and Dunson (2009)), kernel stick-breaking process (Dunson and Park (2008)) and local Dirichlet process (ldp) (Chung and Dunson (2011)) which is a special case of kernel stick-breaking processes. We will be employing a kernel stick-breaking process introduced by Dunson and Park (2008). It is defined using a countable sequence of mutually independent random components V j Beta(a j, b j ) and Γ j H independently for each j = 1,.... The covariate dependent mixing weights are 8

10 defined as: p j (x) = V j K ϕ (x, Γ j ) l<j (1 V l K ϕ (x, Γ l )), for all x X where K : R d R d [0, 1] is any bounded kernel function. Kernel functions that have been considered in practice are K ϕ (x, Γ j ) = exp( ϕ x Γ j 2 ) and K ϕ (x, Γ j ) = 1( x Γ j < ϕ), where is the Euclidean distance. We will focus on the exponential kernel in this manuscript. Jointly the conditional density of y i conditional on covariate x i is defined as p(y i x i, θ, η) = p j (x i )f 2 (y i h(x i, θ); π j, µ j, σ j1, σ j2 ) (3) j=1 p j (x) = V j K ϕ (x, Γ j ) l<j (1 V l K ϕ (x, Γ l )) For Bayesian analysis the parameters are endowed with these priors: { π j, µ j, σj1, 2 σj2} 2 G 0, Γ j H, V j Beta(a j, b j ), ϕ Π ϕ and θ Π θ. The nuisance parameter is η = {ϕ, {π j, µ j, σ j1, σ j2, V j, Γ j } j=1} and jointly these priors on the nuisance parameters induce a prior Π fɛ x on F ɛ x. This is a very flexible model for predictor dependent conditional densities, however it also imposes the desired property that conditional error densities have a mean of zero in order to identify parameter of interest θ. We will show that this is a correctly specified model for the DGP as posterior concentrates on the true parameter θ 0 and on a weak neighborhood of the true conditional densities f 0,ɛ x using a particular choice of a kernel function. Exponential kernel is chosen as it is closely related to the linear logit regression used in finite mixture model. Even though practical suggestions have been plentiful, theoretical results regarding the consistency properties of predictor dependent stick-breaking processes are scarce. Related 9

11 theoretical results are presented by Tokdar et al. (2010), Pati et al. (2013) and Norets and Pelenis (forthcoming). One of the key contributions of this paper are Theorem 2 and Corollary 1 in Section 3 that show that kernel stick-breaking processes with exponential kernel can be used to consistently estimate flexible unrestricted conditional densities and the parameters of interest. Hopefully, consistent estimation of the error densities could lead to a more efficient estimation of the parameters of interest, such as linear regression coefficients, as compared to the other methods that do not directly impose a conditional moment restriction. 3 Consistency Properties We provide general sufficient conditions on the true data generating process that lead to posterior consistency in estimating regression parameters and conditional residual densities. We show that residual densities induced by the proposed models can be chosen to be arbitrarily close in Kullback-Leibler divergence to true conditional densities that satisfy the conditional moment restriction. That is the Kullback-Leibler (KL) closure of proposed models in Section 2 include all true data generating distributions that satisfy a set of general conditions presented in this Section. To motivate the necessity for consistency properties of the models proposed in Section 2 we provide a stylized data generating example where alternative flexible constructions might fail to deliver consistent estimation of the regression function. Suppose that one is estimating a model y = g(x) + ɛ, where g( ) is either an unspecified function or some known function h(x, θ), such as h(x, θ) = x θ. One common textbook advice is to assume that ɛ is independent of covariates and to model error distribution as a Dirichlet process mixture to account for possible non-gaussianity. Such a model is proposed by Chib and Greenberg (2010). An alternative approach by Pati and Dunson (forthcoming) allows the error distribution to be covariate dependent, but restricts it to be symmetric. We hope 10

12 that the stylized example below will highlight the issues that both of these approaches might face, when the true error distribution is covariate dependent and asymmetric. Consider this particular discrete stylized data generating example such that E[y x] = xθ 0 = 0 and θ 0 = 0. Suppose that x { 1, 1} as then only two regression means have to be estimated for g(x = 1) and g(x = 1) and hence g(x) can be estimated as g(x) = xθ. Let true data generating process be P (x = 1) = P (x = 1) = 0.5, P (y = 2 x = 1) = 1/3 and P (y = 1 x = 1) = 2/3, and P (y = 1 x = 1) = 2/3 and P (y = 2 x = 1) = 1/3. Then E[y x] = 0 by design, however the pseudo-true value of θ would be θ = 0.5 or alternatively Ẽ[y x = 1] = 0.5 and Ẽ[y x = 1] = 0.5 if one assumes that either residuals are covariate independent or covariate dependent and symmetric and allows for the residual distribution to be otherwise fully flexible. This inconsistency arises as in both cases the conditional error distribution would concentrate around the probability distribution of P (ɛ = 1.5 x) = 0.5 and P (ɛ = 1.5 x) = 0.5 for both x = 1 and x = 1. This example illustrates the possible risk of the misspecification of the error density and the effect of such a misspecification on the estimates of regression coefficients will be explored in a simulation example in Section 4. Therefore, we advocate a model with flexible asymmetric covariate dependent error distributions as this would allow us to obtain consistent estimates of the parameters of interest. Next, we proceed to show that under very general assumptions on the true data generating process and the prior the estimation of the models defined in 2.1 and 2.2 would lead to posterior consistency. Let p(y x, M) be the conditional density of y given x implied by some model M. Let the true data generating joint density of (y, x) be f 0 (y x)f 0 (x), then the joint marginal density induced by the model M is p(y x, M)f 0 (x). Note that in the models considered in Section 2 we modeled only conditional error density and left the data generating density f 0 (x) of x X unspecified. The KL divergence 11

13 between f 0 (y x)f 0 (x) and p(y x, M)f 0 (x) is defined as d KL (f 0, p M ) = log f 0(y x)f 0 (x) p(y x, M)f 0 (x) F 0(dy, dx) = log f 0(y x) p(y x, M) F 0(dy, dx). (4) Given the true conditional data generating density f 0 (y x), define f 0,ɛ x as f 0,ɛ x (ɛ x) = f 0 (ɛ+h(x, θ 0 ) x). We say that posterior is consistent for estimating (f 0,ɛ x, θ 0 ) if Π(W Y n, X n ) converges to 1 with P n F 0 probability as n for any neighborhood W of (f 0,ɛ x, θ 0 ) when the true data generating distribution is F 0. We define a weak neighborhood U δ (f ɛ x ) as U δ (f ɛ x ) = { f x : f x F ɛ x, R X gf x (ɛ, x)f 0 (x)dɛdx R X gf ɛ x (ɛ, x)f 0 (x)dɛdx < δ, g : R X R is bounded and uniformly continuous}. We consider a neighborhood W of (f 0,ɛ x, θ 0 ) of the form U δ (f 0,ɛ x ) {θ : θ θ 0 < ρ} for any δ > 0 and ρ > 0. First, we will consider the case of the finite model described in Section 2.1 by Equations (1) and (2). Let the proposed model be defined by the parameters (η k, θ). Then we show that there exists k large enough and a set of parameters (η k, θ) such that KL divergence between true conditional densities and the ones implied by the finite model is arbitrarily close to 0. Theorem 1. Assume that 1. E[y x] = h(x, θ 0 ) a.s. F f 0 (y x) is continuous in (y, x) a.s. F X has bounded support and E F0 [y 2 x] < for all x X. 4. h is Lipschitz continuous in x. 12

14 5. There exists δ > 0 such that log f 0 (y x) inf y z <δ, x t <δ f 0 (z t) F 0(dy, dx) < (5) Let p(, θ, η k ) be defined as in Equations (1) and (2). Then, for any ɛ > 0 there exists (η k, θ) such that d KL (f 0 ( ), p(, θ, η k )) < ɛ. Theorem 1 is proved in the appendix. The general idea behind this approximation result is provided below. First we show that the mixing weights α(x) are flexible enough, so that a subset of the mixing weights approximate an indicator function on the neighborhood of some particular covariate value x, while other mixing weights for that particular x approach 0. Then given this particular point x, we can approximate the conditional error density at this x by a finite mixture of f 2 densities associated with a particular subset of mixing weights. As covariate x is varied, different mixing weights would approximate the indicator function on the neighborhood of x and different mixtures of f 2 densities would be used to approximate the conditional error density at the particular value of x. The results above imply the existence of a large number k of mixture components such that induced conditional densities are close to the true values of the DGP. However, this does not provide a direct method of estimating k, the number of mixtures, to be used in applications. Furthermore, one can show that any finite model could have pseudo-true values of θ different from true values for some data generating distributions that belong to the general class F of DGPs. Therefore, we would like to establish that as the KL approximation increases it has to be the case that our estimate of θ is approaching the true value θ 0. We show that these concerns are addressed when an infinite smoothly mixing regression model induced by a predictor dependent stick breaking process prior is used for inference. We provide the general conditions under which the estimation of the 13

15 infinite mixture model leads to posterior consistency of f 0,ɛ x and θ 0. Hence, we provide the necessary theoretical foundation for the use of infinite mixture model if consistency is a desired requirement for the estimation procedure. For the infinite mixture model defined in the Equation (3), the priors G 0, H, Π V, Π ϕ, Π θ and a choice of kernel function K ϕ induce a prior Π on Ξ. By choosing the priors on the nuisance parameters, we induce a prior on the set of conditional error densities with mean 0. A conditional density function f x is said to be in the KL support of the prior Π (i.e. f x KL(Π)), if for all ɛ > 0, Π(K ɛ (f x )) > 0, where K ɛ (f x ) {(θ, η) : d KL (f x ( ), p(, θ, η)) < ɛ} and d KL (, ) is defined in the Equation (4). The next theorem shows that if the true data generating distribution F 0 satisfies the assumptions of the Theorem 1, then f 0 belongs to the KL support of Π under general conditions on the prior distributions and for a particular kernel function. Theorem 2. Assume F 0 satisfies assumptions of Theorem 1 and f 0 ( ) are covariate dependent conditional densities of y Y induced by F 0. Let K ϕ (x, Γ) = exp( ϕ x Γ 2 ) and let the prior Π be induced by the priors G 0, H, Π V, Π ϕ, Π θ and the model defined in Equation (3). If the priors are such that θ 0 is an interior point of support of Π θ, Π(σ j1 < δ) > 0 for any δ > 0 and X supp(h), then f 0 KL(Π). Theorem 2 is proved in the Appendix. The intuition is such that for any given set of the parameters of the finite mixture model, there is a set of the nuisance parameters of the infinite model that approximates the finite model in KL sense. The minimal restrictions on the priors show that any set of the parameters in neighborhood of this particular instance of the infinite model is arbitrarily close in KL sense and this neighborhood has a positive prior support. Once the KL support property is established we hope to proceed to use Schwartz s posterior consistency theorem (Schwartz (1965)) to show that posterior is weakly consistent at f 0,ɛ x and θ 0. First, we will consider the case of the linear regression with 14

16 h(x, θ) = x θ as an illustrative example of the additional assumptions that are necessary to achieve posterior consistency. Furthermore, assume that a constant is included in the regressors x. Theorem 3. An (almost) immediate implication of Schwartz (1965). Given the model defined by Equation (3) suppose that F 0 satisfy the assumptions of the Theorem 1 and prior distributions satisfy the requirements of the Theorem 2 and that E F0 [y x] = x θ 0. Furthermore, suppose that E F0 [x x] is a non-singular matrix. The prior is restricted so that there exists a large L such that E f [ɛ 2 x] < L for all x X and all f supp(π fɛ x ). Let W = U δ (f 0,ɛ x ) {θ : θ θ 0 < ρ} for some δ > 0 and ρ > 0, then Π(W c Y N, X N ) 0 a.s. PF 0. The assumption that E F0 [x x] is non-singular allows us to identify the unique parameter of interest θ 0. If E F0 [x x] were singular, then in the presence of multi-collinearity the parameter of interest is only set-identified and one could expect the posterior to concentrate on this set, but this conjecture is not formally explored. The theorem is proved rigorously in the Appendix. It consists of the construction of exponentially consistent tests for testing H 0 : (f ɛ x, θ) = (f 0,ɛ x, θ 0 ) against alternative hypothesis H 1 : (f ɛ x, θ) W c. Once that is accomplished it is a straightforward application of Schwartz s posterior consistency theorem as KL-support property is already proved in Theorem 2. Theorem 3 can be extended to other restricted moment models beyond linear regression case with an additional assumption. Let γ { 1, 1} p and define a quadrant Q γ = {z R p : z j γ j > 0 for all j = 1,..., p}. Corollary 1. Suppose that F 0 satisfy the assumptions of the Theorem 1 and that the prior distributions satisfy the requirements of the Theorem 2. Additionally, assume that for each γ and θ θ 0 Q γ, there exists X h,γ,ξ = {x X h(x, θ 0 ) h(x, θ) > ξ θ θ 0, θ θ 0 Q γ, x > ξ} and F 0 (X h,γ,ξ ) > ζ > 0 for some ξ > 0 small enough. Furthermore, the prior is restricted so that there exists a large L such that E f [ɛ 2 x] < L for all x X and 15

17 all f supp(π fɛ x ). Let W = U δ (f 0,ɛ x ) {θ : θ θ 0 < ρ} for some δ > 0 and ρ > 0, then Π(W c Y N, X N ) 0 a.s. P F 0. This Corollary is proved in the appendix, but it is just a simple extension of the 3. However, the assumptions of this Corollary require some restrictions on the possible function h and on the distribution of the covariates. This assumption could be satisfied for a number of generalized linear models with a monotonic link function. This corollary establishes that consistent estimation of parameter of interest and conditional error densities will be achieved if the true data generating process satisfies a conditional moment restriction. Given the desirable theoretical properties that both parametric and nuisance parts are consistently estimated it achieves the two objectives. Firstly, the estimation of the parameter of interest is consistent. Secondly, consistent estimation of the nuisance parameter, which are conditional error densities in this case, might lead to a more efficient estimation of the parameter of interest which would be a justification for the estimation of the full semi-parametric model as opposed to some alternative simplified model. There have been a number of papers that estimated the mean regression function of E[y x] without specific parametric assumptions on the form of the regression mean. For example, Chan et al. (2006) and Chib and Greenberg (2010) used splines to flexibly model the regression mean, while Pati and Dunson (forthcoming) used Gaussian process prior for the mean regression function and proved consistency properties of their proposal as well for the case of symmetric residual densities. Unsurprisingly, using the results from Norets (2010) one could show that a well behaved mean regression function could be approximated by finite order polynomials in covariates and Theorems 1 and 2 could be extended to show that f 0 and E[y x] = g(x) is in the Kullback-Leibler support of Π under some general conditions. Extending these results to posterior consistency is less straightforward due to construction of exponentially consistent tests for a non-parametric 16

18 mean. Such tests based on a conditional mean restriction were built to test hypotheses on particular chosen subspaces of a finite dimensional space Θ and the choice of such subspaces is not straightforward in the case of unrestricted regression mean function. However, similar exponentially consistent tests based on conditional median restriction have been constructed by Pati and Dunson (forthcoming) and one would hope that such an extension is possible for testing conditional mean restriction as well and it is an interesting future research topic. 4 Simulation Examples A number of simulation examples is considered to assess the performance of the method proposed in this paper. Consider a linear regression model with y i = α + x iβ + ɛ i,, (y i, x i ) iid F 0, i = 1,..., N. and E F0 [ɛ i x i ] = 0, x is one-dimensional and N = 200. We consider four alternative data generating processes (DGPs), with first three suggested by Müller (forthcoming). 1. Case (i): y i = x i + ɛ i, ɛ i N(0, 1). 2. Case (ii): y i = x i + ɛ i, ɛ i x i N(0, a 2 ( x i + 0.5) 2 ), where a = Case (iii): y i = x i + ɛ i, ɛ i x i, s N([1 2 1(x i < 0)]µ s, σ 2 s), where P (s = 1) = 0.8, P (s = 2) = 0.2, µ 1 = 0.25, σ 1 = 0.75, µ 2 = 1 and σ 2 = Case (iv): y i = 0+0 x i +ɛ i, ɛ i x i N(x i µ s, ), where P (s = 1) = P (s = 2) = 0.5 and µ 1 = µ 2 = All four DGPs are such that E[(x i ɛ i ) 2 ] = 1 and x i N(0, 1) so that inference based on heteroscedasticity robust OLS and infeasible GLS are comparable across the four cases. 17

19 Inference is based on the following methods. First, inference based on an infeasible generalized least squares (GLS) with a correct covariance matrix specification. Second, Bayesian inference based on the artificial sandwich posterior (OLS) as proposed by Müller (forthcoming). Let θ = (α, β), then θ N(ˆθ, ˆΣ) where ˆθ is the ordinary least squares coefficient and ˆΣ is the sandwich covariance matrix. Note that inference based on this sandwich posterior is asymptotically equivalent to inference using Bayesian bootstrap (Lancaster (2003)) so there is a Bayesian alternative to this frequentist inspired procedure when the regression coefficient is the object of interest. Third, Bayesian inference based on a normal regression model (NLR), where ɛ i x i N(0, σ 2 ) with priors θ N(0, (0.01I 2 ) 1 ), σ 2 InvGamma(3/2, 2). Fourth, inference based on a Robinson (1987) asymptotically efficient uniform weight k-nn estimator with k n = n 4/5. Finally, Bayesian inference based on the conditional linear regression model (CLR) proposed in this paper. At this stage for computational reasons we consider the finite model with k = 5 number of states as defined in Equations (1) and (2). We will consider three versions of the CLR model. The first version would be unrestricted regular CLR with priors set to θ N(0, (0.01I 2 ) 1 ), ρ j N(0, 1), γ j N(0, 4), σ ji IG(3/2, 2), µ j N(0, 4) and (π j, 1 π j ) Dir(5, 5) for all j and i. The second version will be called unconditional linear regression (ULR) model which allows for flexible error distribution which is assumed to be independent of covariates. Such model will be estimated by restricting γ j = 0 for all j and with the same prior distributions as for CLR. The third version will be called symmetric conditional linear model (SymCLR) and the error densities will be restricted to be symmetric (but possibly heteroscedastic), by setting π j = 0.5 and σ j1 = σ j2 for all j and the rest of the priors are the same. By comparing these three model choices we hope to pinpoint the possible shortcomings of the homoscedasticity or symmetry restrictions as discussed in Section 3. The priors for µ, σ, π are chosen to reflect the range and variance of observable y at possible covariate 18

20 values. Furthermore, the priors for ρ, γ control the sharing of information across covariate values and the prior of γ should be of similar magnitude to the range of covariates x to allow for conditional error density difference across covariate values. While no preprocessing steps were taken for this simulation exercise as, in some other instances it would be beneficial to standardize covariates so that prior expectation of the influence of each predictor variable is equal. Posterior computation of the finite model and full description of the priors are contained in the Appendix Posterior simulation for the infinite mixture model could be accomplished using retrospective or slice sampling methods proposed by Papaspiliopoulos and Roberts (2008), Walker (2007) and Kalli et al. (2011). The approach used in this paper for infinite mixture sampling is presented in Appendix as well. The parameter of interest is β R and we consider three separate criteria for the evaluation of the performance of the proposed estimators. First, we will compute root mean squared error (RMSE). Second, we construct 95% Bayesian credibility regions using and quantiles of the posterior distribution and report coverage probabilities. Third, we consider the length of a credibility region as another indicator of the performance of the estimator and we compare the average interval lengths of the competing methods. Similar approaches for evaluating performance have been considered by Conley et al. (2008). We repeat the simulation exercise 1000 times for each of the four DGPs. The results are displayed in Table 1. Relative performance of the methods is similar whether RMSE, coverage, or interval length is used as an evaluation criterion. The results show that the conditional linear regression model proposed in this paper performs better than alternatives in Cases (ii) and (iv) in the presence of heteroscedasticity and performs comparably in other cases. In Cases (i) and (iii) the best performing models should be OLS and NLR since it is well know that OLS estimator achieves the semi-parametric efficiency 19

21 Table 1: Simulation results Case (i) Case (ii) Case (iii) Case (iv) Method Criterion GLS OLS NLR k-nn RMSE CLR ULR SymCLR GLS OLS NLR 95 % k-nn Coverage CLR ULR SymCLR GLS OLS NLR Interval k-nn Length CLR ULR SymCLR Notes: DGPs are in columns and methods of inference in rows. Entries are RMSE, Coverage of 95% Bayesian credibility region and interval length of the Bayesian credibility region. Bayesian inference in each method is implemented by a Gibbs sampler of draws with first 5000 discarded as burn-in for 1000 draws from each DGP. 20

22 under homoscedasticity. Note that under heteroscedasticity in Cases (ii) and (iv) the 95% coverage of highest posterior density interval of Bayesian NLR is well below 95% due to the misspecification of the error distribution. Unconditional ULR and symmetric SymCLR versions perform worse in Case (iii) due to (conditional) asymmetries of the error distribution. When the true conditional error density is skewed, the misspecified models lead to inconsistent estimation of the regression coefficients as has been conjectured in Section 3. Furthermore, note that unconditional ULR performs poorly in Cases (ii) and (iv) in the presence of heteroscedasticity as expected. As demonstrated in this simulation example estimation of linear models with conditional error densities modeled as mixtures of normals or with flexible symmetric residual densities as proposed in Pati and Dunson (forthcoming) might be misguided if the regression coefficients are the object of interest. As expected, the CLR model proposed in this paper outperforms or performs comparable to the other alternatives even when compared to the asymptotically efficient k-nn estimator. Furthermore, the proposed methods are able to provide estimates of the conditional error densities as well and this is the aspect that we explore below. First, we will present some suggestive evidence that infinite mixture model defined in Equation (3) with an exponential kernel function is capable of consistently estimating conditional error densities. Furthermore, we will illustrate the possible shortcoming of the infinite mixture model that is restricted to be symmetric. We consider three different finite sample sizes: 200, 800, The priors used for the infinite mixture model are set to θ N(0, (0.01I 2 ) 1 ), ψ Exp(2), V j Beta(1, 1) and Γ j U(min{X}, max{x}), σ ji IG(3/2, 2), µ j N(0, 4) and (π j, 1 π j ) Dir(5, 5) for all j and i for the unrestricted infinite model. For the symmetric model we enforce π j = 1/2 and σ j1 = σ j2 as it can be shown that the infinite mixture model defined in Equation (3) with these restrictions could consistently estimate any general symmetric conditional error density as proposed by Pati et al. (2013). The models are estimated via MCMC Gibbs sampling 21

23 methods as described in Appendix by keeping the last 5000 draws with the first 5000 discarded as burn-in. Below, we explore in more detail the asymmetric DGP (iii). In Figure 1 we present the results of the conditional density estimation. It can be seen that with an increasing sample size the unconstrained infinite mixture model consistently estimates the conditional error densities. In addition, in Figure 2 we present the contour plots for N = From Figure 2, one can note that the constrained symmetric infinite mixture model depicted in the center panel is unable to estimate this skewed distribution. Moreover, one can infer that the pseudo-true value of the linear regression coefficient in this case is indeed negative as it has been suggested by Müller (forthcoming). Hence, the proposed unconstrained infinite CLR model consistently estimates both the coefficients and conditional error densities as desired, while in some cases, such as DGP (iii), the flexible heteroscedastic symmetric restriction might lead to inconsistent estimation of both the parameters of interest and of the conditional densities. Figure 1: Estimated conditional densities for different covariate values and sample sizes. The solid lines are the true values, the red dotted lines are the posterior means, and the blue dashed lines are pointwise 95% equal-tailed credible intervals. 22

24 Figure 2: Left panel: Posterior means of p(y x) for the unconstrained CLR. Center panel: Posterior means of p(y x) for the symmetric SymCLR. Right panel: True p(y x). We can compare the performance of unconstrained CLR and symmetric CLR for the DGP (iv). As expected the unconstrained CLR is able to consistently estimate conditional error densities as can be seen from Figure 3. However, in this case the symmetric CLR correctly estimates the conditional residual densities as well as can be seen from the contour plots in Figure 4, since the true error densities were symmetric as well. As these examples show, the approach considered in this paper is capable of consistently estimating heteroscedastic error densities and a parametric mean function, while some other common specifications of non-parametric prior for residual densities might be misspecified. The unfortunate consequence of such approaches is that even in the case when only the error density is misspecified but parametric mean function is correctly specified it might still cause the parameters of interest such as linear regression coefficients to be estimated inconsistently. The models considered in this paper proposed a new specification of a heteroscedastic error density in order to obtain posterior consistency in both mean regression function and conditional error densities. 23

25 Figure 3: Estimated conditional densities for different covariate values and sample sizes. The solid lines are the true values, the red dotted lines are the posterior means, and the dashed lines are pointwise 95% equal-tailed credible intervals. Figure 4: Left panel: Posterior means of p(y x) for the unconstrained CLR. Center panel: Posterior means of p(y x) for the symmetric SymCLR. Right panel: True p(y x). 24

26 5 Appendix 5.1 Proofs Proof. (Theorem 1) Let A m j, j = 0, 1,..., m, be a partition of Y R, where A m 1,..., A m m are adjacent intervals with side length h m and A m 0 is the rest of set Y. Let B m j, j = 1,..., N(m) be a partition of a bounded X where B n 1,..., B m N(m) are adjacent cubes with side length λ m that cover whole of X and N(m) = m d. Furthermore, let x m j be the center of the cube B m j. We will require that by construction h m, λ m 0 as m. We will construct a model (θ, η k ) such that d KL (f 0 ( ), p(, θ, η k )) < ɛ. Firstly choose θ = θ 0 and consider a model defined by some nuisance parameters η m as p(y x, θ 0, η m ) = α j (x) = N(m) j=1 α j (x) m F 0 (A m i x m j )φ ( ) y h(x, θ 0 ); µ ji, σji 2 i=0 exp ( c m (x m j x m j 2x m j x) ) N(m) l=0 exp ( c m (x m l x m l 2x m l x)) (6) such that m i=0 F 0(A m i x m j )µ ji = 0 for all j {1..., N(m)} as to impose the conditional moment restriction in the proposed model. We will show that under appropriate conditions for c n, some collection of α j (x) approximates 1 B n j (x). Given x X define I m (x, s m ) = { j : x m j x 2 < s m } where sm = λ 2 md is a squared diagonal of Bj m. Using the arguments of the proof of Proposition 4.1. from Norets (2010) we know that for m large enough for any j I m (x, s m ) α j (x) 1 exp { c m s m }/N(m). (7) j I m (x,s m) One can always pick c m such that exp { c m s m }/N(m) 0 and then a collection of α j (x) approximates 1 B n j (x). 25

27 By Assumption 4 there exists L large enough such that h(x, θ 0 ) h(x n j, θ 0 ) L x x n j as h is Lipschitz continuous. Let δ m = δ m + Ls 1/2 m where δ m 0. Then for any j I m (x, s m ) and A m i C δ m (y), where C δ (y) is an interval centered at y with length δ, the following holds F 0 (A m i x n j ) λ(a m i ) inf f 0 (z t). (8) z C δ m (y), t x 2 s m In order to impose conditional mean of 0 we require that, for each x m j, the parameters {µ ji } m i=0 satisfy m i=0 F 0(A m i x m j )µ ji = 0. Then given x m j and j for each i = 0,..., m define µ ji as µ ji = A m i f 0 (y x m j )(y h(x m j, θ 0 ))dy F 0 (A m i xm j ) if F 0 (A m i x n j ) > 0. If F 0 (A m i x n j ) = 0 then define µ ji = µ i h(x m j, θ 0 ) where µ i is the center of the interval A m i and for i = 0 define µ j0 = 0. Then by construction m i=0 F 0(A m i x n j )µ ji = 0 for each j = 1,..., N(m). For some j I m (x, s m ) let y j = y (h(x, θ 0 ) h(x n j, θ 0 )), then C δm (y j ) C δ m (y) by definition of δ m and δm. Let σ ji = σ m if i > 0 and σ j0 = σ 0 that will be chosen later. For m large enough such that { } i : A m i C δm(y ) λ(a m i )φ ( ) y h(x, θ 0 ); µ ji, σm 2 = i:a m i C δm (y) ( λ(a m i )φ y (h(x, θ 0 ) h(x n j, θ 0 )); i:a m i C δm (y) ( λ(a m i )φ y j ; µ ji, σm) 2 i:a m i C δm (y j ) 1 A m i ) f 0 (y x m j )ydy F 0 (A m i xm j ), σm 2 3h m 8σ m. (9) (2π) 1/2 σ m (2π) 1/2 δ m 26

28 with the last inequality derived from Lemmas 1 and 2 1 in Norets and Pelenis (2012) where µ ji = A m f 0 (y x m j )ydy i F 0. (A m i xm j ) Then equation (6) and inequalities (7), (8) and (9) can be combined to show that for any given (x, y) there exists m large enough such that p(y x, θ 0, η m ) > j I m 1 (x,sm) ( F 0 (A m i x n j )α j (x)φ y h(x, θ0 ); µ ji, σm) 2 i:a m i C δm (y) ( 1 3h m 8σ m (2π) 1/2 σ m inf z C δ m (y), t x 2 s m f 0 (z t) (2π) 1/2 δ m ) ( 1 exp { c ) ms m }. N(m) Let {δ m, σ m, h m, c m, s m } satisfy the following: δ m 0, σ m /δ m 0, h m /σ m 0, c m, s m 0, exp { c m s m }/N(m) 0. Hence for any given (x, y) and a given ɛ > 0 there exists M 1 large enough such that m > M 1 p(y x, θ 0, η m ) > inf f 0 (z t) (1 ɛ). z C δ m (y), t x 2 s m By Assumption 2 f 0 (y x) is continuous in (y, x) and if f 0 (y x) > 0, then there exist M 2 large enough such that m > M 2 f 0 (y x) inf z Cδ m (y), t x 2 s m f 0 (z t) 1 + ɛ since s m 0 and δ m 0. Then for any m max {M 1, M 2 } 1 f 0 (y x) p(y x, θ 0, η m ) f 0 (y x) inf z Cδ m (y), t x 2 s m f 0 (z t)(1 ɛ) 1 + ɛ 1 ɛ. Henceforth, log max {1, f 0 (y x)/p(y x, θ 0, η m )} 0 a.s. F 0 as long as f 0 (y x) is continuous 1 With a minor adjustment in the proofs due to µ ji A m i rather than being a center of A m i. 27

29 in (y, x) a.s. F 0. This result establishes pointwise convergence. Furthermore, we would like to apply Dominated Convergence Theorem (DCT) to show that d KL (f 0 ( ), p(, θ, η m )) 0 as m. To apply DCT we need to establish an integrable upper bound. We know from equation (7) that for any x X there exists m large enough such that α j (x) > 1/2 for j I m (x, s m ). Similarly from equation (9) that for any (y, x) the bound for the Riemann sum is bounded below by 1/2 for m large enough. These facts will be used in deriving an integrable upper bound. Further note that pointwise convergence is obtained if instead of F 0 (A m 0 x m j )φ (y h(x, θ 0 ); µ j0, σ0) 2 in the mixture model, we consider 0.5F 0 (A m 0 x m j )(φ (y h(x, θ 0 ); 2µ j0, σ0) 2 + φ (y h(x, θ 0 ); 0, σ0)). 2 Combining these three observations we know that then for m large enough and j I m (x, s m ) p(y x, θ 0, η m ) = N(n) j=1 α j (x) M F 0 (A m i x n j )φ ( ) y h(x, θ 0 ); µ ji, σm 2 m= F 0 (A m 0 X n j ) ( φ ( y h(x, θ 0 ); 2µ j0, σ 2 0) + φ ( y h(x, θ0 ); 0, σ 2 0 > 0.25 inf f 0 (z t) δ φ ( y h(x, θ 0 ); 0, σ0) 2. z C δ (y), t x 2 δ )) as the inequality is derived by the same reasoning as in the proof of Proposition 2.1. in Norets (2010). Then { log max { = log 1, } { f 0 (y x) log max 1, p(y x, θ 0, η m ) δφ (y h(x, θ 0 ); 0, σ 2 0) = log { 0.25δφ ( y h(x, θ 0 ); 0, σ 2 0 } f 0 (y x) 0.25δφ (y h(x, θ 0 ); 0, σ0) 2 inf z Cδ (y), t x 2 δ f 0 (z t) } f 0 (y x) inf z Cδ (y), t x 2 δ f 0 (z t) )} f 0 (y x) + log inf z Cδ (y), t x 2 δ f 0 (z t) and the first logarithm above is integrable by Assumption 3 while the second logarithm is integrable by Assumption 5. Therefore, applying DCT we get that d KL (f 0 ( ), p(, θ, η m )) 0. 28

POSTERIOR CONSISTENCY IN CONDITIONAL DENSITY ESTIMATION BY COVARIATE DEPENDENT MIXTURES. By Andriy Norets and Justinas Pelenis

Resubmitted to Econometric Theory POSTERIOR CONSISTENCY IN CONDITIONAL DENSITY ESTIMATION BY COVARIATE DEPENDENT MIXTURES By Andriy Norets and Justinas Pelenis Princeton University and Institute for Advanced