Outline. Introduction to Bayesian nonparametrics. Notation and discrete example. Books on Bayesian nonparametrics

Size: px

Start display at page:

Download "Outline. Introduction to Bayesian nonparametrics. Notation and discrete example. Books on Bayesian nonparametrics"

Ernest Franklin
5 years ago
Views:

1 Outline Introduction to Bayesian nonparametrics Andriy Norets Department of Economics, Princeton University September, 200 Dirichlet Process (DP) - prior on the space of probability measures. Some applications of DP. DP mixtures. Role of consistency in Bayesian nonparametrics. Schwartz posterior consistency theorem and extensions. Applications to conditional density estimation (consistency, MCMC estimation methods, comparison with classical methods on real data). /39 2/39 Books on Bayesian nonparametrics Notation and discrete example Ghosh and Ramamoorthi (2003) - foundations and theory up to Dey et al. (998) - applications and MCMC estimation methods. Hort (200) - collection of up to date reviews of theoretical and applied work in different subfields. (X, B, P) - probability space for observations. M(X ) - space of probability measures on X (P M(X )). For Bayesian nonparametric inference we need to put a prior π on M(X ). Dirichlet process prior for X = {,...,n} and M(X )={p =(p,...,p n ):p i 0, p i =}, π(p,...,p n )= Γ( α i ) Γ(αi ) pα pn αn known as the Dirichlet distribution with parameter α, D(α). 3/39 4/39

2 Discrete example continued Ferguson (973) construction The likelihood has the same functional form as the prior: k l(y,...,y K p) =p {y k=} k p {y k=n} n Thus, the posterior is also a Dirichlet distribution π(p y) =D(α + k {y k =},...,α + k If we treat vector α as a measure on X (α(i) =α i ), π(p y) =D(α + {y k = n}) K δ yi ), where δ yi is a point mass at y i. i= Kolmogorov s extension theorem for general case: Define finite dimensional distributions first: for a finite partition X = n i= A i,let (P(A ),...,P(A n )) Dirichlet(α(A ),...,α(a n )) where α is a measure on X (it is a parameter for the DP) Next, using properties of the Dirichlet distribution, show that an extension of these finite dimensional distributions to M(X ), DP(α), is unique and that, as in the discrete case, the posterior is also DP(α + K i= δ y i ). 5/39 6/39 Sethuraman (994) construction Applications of DP prior Sethuraman showed that DP can be represented as where DP(α) = p i δ θi i= θ i α( )/α(x ), iid. i p = q, p i = q i ( q ) (stick breaking) = q i Beta(,α(X )), iid. Ferguson (973) showed that mean variance and covariance CDF all converge to sample analogs. Thus, DP puts probability on discrete probability measures. 7/39 8/39

3 Application of DP prior to estimation of CDF Under squared loss, the prior point estimate is α(, t] ˆF (t) = P(, t]dp α (dp) = α(r) because (, t] and(t, ) is a partition of R and under DP α P(, t] Beta(α(, t],α(t, )). To get posterior point estimate, use α + K i= δ y i instead of α: ˆF (t y,...,y K )= α(r) α(r)+k α(, t] α(r) }{{} Prior CDF K + Empirical CDF α(r)+k Multinomial-Dirichlet model Several authors use a multinomial likelihood with the support given by observations in the sample (or all possible values for observations) and an improper Dirichlet prior on the multinomial probabilities. This can be seen as a limit of DP when α(x ) 0(theprior becomes uninformative, DP α+ K D i= δy K ). i i= δy i A parameter of interest is defined as a function of (p,...,p K ) (parameters of the multinomial likelihood) and (y,...,y K ) (observations), e.g., the mean is pi y i. 9/39 0/39 Applications of multinomial-dirichlet model Rubin (98) (Bayesian bootstrap): multinomial-dirichlet model delivers a posterior distribution for certain parameters (mean, variance) which is almost identical to classical bootstrap distribution. Lo (987): asymptotic results for the Bayesian bootstrap. Lancaster (2003): OLS with White heteroscedasticity robust variance covariance matrix can be ustified by Bayesian bootstrap if the parameter of interest is the minimizer of E(y βx) 2. Poirier (2008) obtain different versions of White s estimator used in classical literature under different prior specifications. Chamberlain and Imbens (2003) suggest multinomial-dirichlet model with added prior moment equality restrictions as a Bayesianwaytohandlemomentequalitymodels. DP mixtures (most common use of DP) y i m i, s i N(m i, s 2 i ) m i, s i F F F DP(α) By Sethuraman construction, this is equivalent to a countable mixture of normals: y i p i N(m i, si 2 ) i= Practical MCMC algorithms are available (Escobar and West (995), Neal (2000)). The model is similar to a finite mixture of normals, which we ll consider below. /39 2/39

4 Applications of DP mixtures Motivation for studying consistency in the Bayesian framework Dey et al. (998) - a collection of applications. Hirano (2002) (and Hirano s dissertation) - flexible error term distributions in panel data models. Conley et al. (2008) - IV with flexible error term distributions. Burda et al. (2008) - discrete choice model with heterogeneous coefficients (heterogeneity distribution is modeled by DP mixtures). Priors in infinite (high) dimensional problems are hard to construct/elicit. There are known examples of seemingly reasonable priors that lead to unreasonable posteriors (Diaconis and Freedman (986)). Posterior consistency (posterior concentrates around true parameter values asymptotically) provides an additional check on the prior distributions. 3/39 4/39 (A version of) Schwartz posterior consistency theorem Generalizations of Schwartz theorem For a measure μ on X,letL μ be the space of probability densities w.r.t. μ. Letπ be a prior on L μ. If any Kullback-Liebler neighborhood of the true density, f 0 L μ, has positive prior π probability, then the posterior is consistent in weak topology: π(u y,...,y n ), a.s.f n 0 for any weak topology neighborhood U. Barron (988), Barron et al. (999), Ghosal et al. (999) - consistency for total variation distance rather than weak topology distance. Ghosal et al. (2000) - posterior convergence rates. Ghosal and van der Vaart (2007) - some Bayesian nonparametric procedures achieve minimax rates in estimation of univariate density. 5/39 6/39

5 Two approaches based on mixtures FMMN and specifications of SMR (Conditional) observables density: Model the oint distribution by finite mixtures of multivariate normals (FMMN) and extract the conditional distributions of interest from the model (West, Mueller, Escobar (994), Wu and Ghosal (2009), Norets and Pelenis (2009)). Model the conditional distribution directly by smooth mixtures of regressions (SMR) mixing probabilities are functions of covariates (Jacobs et al. (99), Jiang and Tanner (999), Geweke and Keane (2007), Villani et al. (2009), Norets (200)). p(y x,θ,m) = m α m (x; θ)φ(y,μ m (x; θ),σ m (x; θ)), = φ is normal pdf, μ m (x; θ) - mean, σ m (x; θ) 2 I -variance (non-diagonal variance will be denoted by H.) How to specify mixing probabilities α m (x; θ) andμ m (x; θ) and σ m (x; θ) 2? do not depend on x, depend on linear functions of x, or flexible, e.g., (transformations of) polynomials in x? 7/39 8/39 Consistency results Summary of results from Norets (200) If f (y x) is in the KL closure of p(y x,θ,m m ) then we can put a prior on the number of mixture components m and apply Schwartz posterior consistency theorem to obtain consistency of posterior in weak topology. Results can be generalized from i.i.d to Markov process settings as in Ghosal and Tang (2006). Simplest flexible specification: α m s multinomial logit with linear indices in x, (μ m,σ m ) are independent of x. Polynomial terms in logit reduce the number of mixture components m. Alternatively, (α m,σ m ) independent of x and μ m in x also work when y is univariate. polynomial However, making α m rather than μ m flexible in x improves the convergence rate. Making σ m a function of x weakens some restrictions on F. Modeling conditional distributions directly is better. 9/39 20/39

6 Assumptions on target distribution F Examples satisfying assumptions. f (y x) is continuous in y on Y for all x X. 2. The second moments of y are finite. 3. Unbounded support (f (y x) > 0). For a cube, C r (y), with center y and side length r > 0 f (y x) log F (dy, dx) <, inf z Cr (y) f (z x) or d log f (z x) sup F (dy, dx) <, z C r (y) dz 4. Bounded support can be accommodated by using some other shapes instead of cubes C r (y). Exponential: f (y x) =γ(x)exp{ γ(x)y}, γ(x) > 0, and γ 2 df < Student t, f (y x) [ν +((y b(x))/c(x)) 2 ] (ν+)/2, ν>2, b(x) 2 and c(x) 2 are integrable w.r.t. f (x). Uniformly bounded support, f (y x) is continuous in y, > f f (y x) f > 0. Bounded away from zero condition can be substituted with monotonicity at the boundary. 2/39 22/39 Estimation for FMMN The oint distribution of observables and unobservables Latent states (Diebolt and Robert (994)): s i {,...,m} y i s i,θ,m φ(.; μ si, H s i ) P(s i = θ, M) =α p({y i, s i } T i=; {α,μ, H } m = M) T α si H si 0.5 exp[ 0.5(y i μ si ) H si (y i μ si )] i= α a αm a (Dirichlet prior) H 0.5 exp{ 0.5(μ μ) λh (μ μ)} (Normal-Wishart H (ν d )/2 exp{ 0.5trSH } prior) (ν, S,μ,λ, a) - hyperparameters 23/39 24/39

7 Gibbs sampler Gibbs sampler p({y i, s i } T i=; {α,μ, H } m = M) T α si H si 0.5 exp[ 0.5(y i μ si ) H si (y i μ si )] i= α a αm a (Dirichlet prior) H 0.5 exp{ 0.5(μ μ) λh (μ μ)} (Normal-Wishart H (ν d )/2 exp{ 0.5trSH } prior) Gibbs sampler blocks: p(s i =...) α H 0.5 exp[ 0.5(y i μ ) H (y i μ )] p({y i, s i } T i=; {α,μ, H } m = M) T α si H si 0.5 exp[ 0.5(y i μ si ) H si (y i μ si )] i= α a αm a (Dirichlet prior) H 0.5 exp{ 0.5(μ μ) λh (μ μ)} (Normal-Wishart H (ν d )/2 exp{ 0.5trSH } prior) Gibbs sampler blocks: i {s i =}+a p(α...) α α i {s i =m}+a m 25/39 26/39 Gibbs sampler Gibbs sampler, discrete variables p({y i, s i } T i=; {α,μ, H } m = M) T α si H si 0.5 exp[ 0.5(y i μ si ) H si (y i μ si )] i= α a αm a (Dirichlet prior) H 0.5 exp{ 0.5(μ μ) λh (μ μ)} (Normal-Wishart H (ν d )/2 exp{ 0.5trSH } prior) Map discrete variables into intervals of R and model continuous latent variables, y i,k y i,k =[a i,k, b i,k ], by the mixture. Add a block for the latent variables: p(yi,k...) exp[ 0.5(yi μ si ) H si (yi μ si )] [ai,k,b i,k ](yi,k ) It is truncated normal. Gibbs sampler blocks: p(μ, H...) Normal-Wishart distribution 27/39 28/39

8 Gibbs sampler, SMR If μ m is linear or polynomial in x, Gibbs sampler is almost the same. If σ m depends on x then Gibbs sampler blocks for parameters of σ m will have non-standard form. One needs to use more sophisticated MCMC algorithms, e.g., Metropolis-Hastings. If α m is defined by multinomial logit then Gibbs sampler block for the logit parameters is similar to maximum likelihood estimation of logit model. Evaluating model quality Marginal Likelihood, p(y T M), is difficult to compute. Cross Validated Log Scoring Rule T log p(y t Y T /t, M) = t= T log K t= K p(y t Y T /t,θ k, M) k= We use: Modified Cross Validated Log Scoring Rule Randomly order sample observations and use the first T observations for inference and the rest for evaluation. Repeat this process several times and compare means or medians. T t=t + log p(y t Y T, M) 29/39 30/39 Labor Market Participation Comparison of different models Gerfin (996) cross-section dataset. Compare probit, kernel (Hall et al. (2004)) and FMMN. Binary dependent variable - Labor force participation dummy. Independent variables: Log of non-labor income, Age, Education, Number of young children, Number of old children, Foreign dummy. Number of observations: T = 872. Split into two random samples of T = 650 and T 2 = 222 observations. Use T as an estimation sample, and T 2 as a prediction sample for 50 different random splits. Table: Cross-validated log scores and classification rates for Labor Force Participation Model Log Score %Correct Mean Median Mean Median Probit % 66.37% Kernel % 66.89% FMMN(m=) % 66.22% FMMN(m=2) % 68.47% FMMN(m=3) % 67.79% Mean and median log scores and classication rates for probit, kernel and FMMN models evaluated at 50 random evaluation samples of T 2 = 222 observations. 3/39 32/39

9 Boston Housing Data Data were analysed by kernel methods by Li and Racine (2008). Compare kernel methods from Hall et al. (2004) (cross-validation) with FMMN and SMR. Dependent variable - price of the house in a given area. Independent variables: average number of rooms in the area, percetange of the population having lower economic status in the area, weighted distance to five Boston employment centers. Number of observations: T = 506. Split into two random samples of T = 400 and T 2 = 06 observations. Use T as an estimation sample, and T 2 as a prediction sample for 00 different random splits. Comparison of different models Table: Cross-validated log scores Model Log Score Mean Median Nonparametric FMMN(m=3) FMMN(m=4) FMMN(m=5) FMMN(m=6) FMMN(m=ML) SMR(m=4) SMR(m=7) /39 34/39 Daily returns on S&P 500 Summary Results from Geweke and Keane (2007) Table: Out of sample model comparisons Model Predictive likelihood SMR t-garch(,) Threshold EGARCH(,) GARCH(,) Normal iid Bayesian framework provides a rich collection of non- and semi- parametric methods. Some of these methods deliver familiar classical estimators. This can help understand assumptions under which the estimators are appropriate in finite samples. In small samples, mixture based methods compare favorably with classical parametric and nonparametric alternatives. We still have a lot to learn: posterior convergence rates, flexible priors in structural models, moment (in)equality models,... 35/39 36/39

10 References I Barron, A. (988): The Exponential Convergence of Posterior Probabilities with Implications for Bayes Estimators of Density Functions,. Barron, A., M. J. Schervish, and L. Wasserman (999): The Consistency of Posterior Distributions in Nonparametric Problems, The Annals of Statistics, 27, pp Burda, M., M. Harding, and J. Hausman (2008): A Bayesian Mixed Logit-Probit Model for Multinomial Choice, Journal of Econometrics, 47, pp Chamberlain, G. and G. W. Imbens (2003): Nonparametric Applications of Bayesian Inference, Journal of Business and Economic Statistics, 2, pp Conley, T. G., C. B. Hansen, R. E. McCulloch, and P. E. Rossi (2008): A semi-parametric Bayesian approach to the instrumental variable problem, Journal of Econometrics, 44, Dey, D., P. MuIler, and D. Sinha, eds. (998): Practical Nonparametric and Semiparametric Bayesian Statistics, Lecture Notes in Statistics, Vol. 33, Springer. Diaconis, P. and D. Freedman (986): On Inconsistent Bayes Estimates of Location, The Annals of Statistics, 4, pp Diebolt, J. and C. P. Robert (994): Estimation of Finite Mixture Distributions through Bayesian Sampling, Journal of the Royal Statistical Society. Series B (Methodological), 56, Escobar, M. and M. West (995): Bayesian Density Estimation and Inference Using Mixtures, Journal of the American Statistical Association, 90, pp Ferguson, T. S. (973): A Bayesian Analysis of Some Nonparametric Problems, The Annals of Statistics,, pp Gerfin, M. (996): Parametric and Semi-Parametric Estimation of the Binary Response Model of Labour Market Participation, Journal of Applied Econometrics,, Geweke, J. and M. Keane (2007): Smoothly mixing regressions, Journal of Econometrics, 38. Ghosal, S., J. K. Ghosh, and R. V. Ramamoorthi (999): Posterior Consistency of Dirichlet Mixtures in Density Estimation, The Annals of Statistics, 27, pp Ghosal, S., J. K. Ghosh, and A. W. v. d. Vaart (2000): Convergence Rates of Posterior Distributions, The Annals of Statistics, 28, pp References II Ghosal, S. and Y. Tang (2006): Bayesian consistency for Markov processes, Sankhya, 68, Ghosal, S. and A. van der Vaart (2007): Posterior convergence rates of Dirichlet mixtures at smooth densities, Ann. Statist., 35, Ghosh, J. and R. Ramamoorthi (2003): Bayesian Nonparametrics, Springer; edition. Hall, P., J. Racine, and Q. Li (2004): Cross-Validation and the Estimation of Conditional Probability Densities, Journal of the American Statistical Association, 99, Hirano, K. (2002): Semiparametric Bayesian Inference in Autoregressive Panel Data Models, Econometrica, 70, Hort, N. L., ed. (200): Bayesian Nonparametrics (Cambridge Series in Statistical and Probabilistic Mathematics), Cambridge University Press. Jacobs, R. A., M. I. Jordan, S. J. Nowlan, and G. E. Hinton (99): Adaptive mixtures of local experts, Neural Comput., 3, Jiang, W. and M. Tanner (999): Hierarchical Mixtures-of-Experts for Exponential Family Regression models: Approximation and Maximum Likelihood Estimation, Annals of Statistics, 27, Lancaster, T. (2003): A Note on Bootstraps and Robustness,. Li, Q. and J. S. Racine (2008): Nonparametric Estimation of Conditional CDF and Quantile Functions With Mixed Categorical and Continuous Data, Journal of Business and Economic Statistics, 26, Lo,A.Y.(987): A Large Sample Study of the Bayesian Bootstrap, The Annals of Statistics, 5, pp Neal, R. (2000): Markov Chain Sampling Methods for Dirichlet Process Mixture Models, Journal of Computational and Graphical Statistics, 9, Norets, A. (200): Approximation of conditional densities by smooth mixtures of regressions, Annals of statistics, 38, Norets, A. and J. Pelenis (2009): Bayesian modeling of oint and conditional distributions,. Poirier, D. (2008): Bayesian Interpretations of Heteroskedastic Consistent Covariance Estimators Using the Informed Bayesian Bootstrap, Working Papers , University of California-Irvine, Department of Economics. 37/39 38/39 References III Rubin, D. (98): THE BAYESIAN BOOTSTRAP, The Annals of Statistics, 9, pp Sethuraman, J. (994): A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS, Statistica Sinica, pp Villani, M., R. Kohn, and P. Giordani (2009): Regression density estimation using smooth adaptive Gaussian mixtures, Journal of Econometrics, 53, Wu, Y. and S. Ghosal (2009): L-Consistency of Dirichlet Mixtures in Multivariate Bayesian Density Estimation, working paper. 39/39

Journal of Econometrics

Journal of Econometrics 168 (2012) 332 346 Contents lists available at SciVerse ScienceDirect Journal of Econometrics ournal homepage: www.elsevier.com/locate/econom Bayesian modeling of oint and conditional