ICES REPORT Model Misspecification and Plausibility

Similar documents
Bayesian estimation of the discrepancy with misspecified parametric models

Semiparametric posterior limits

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Bayesian Regularization

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Bayesian Sparse Linear Regression with Unknown Symmetric Error

Nonparametric Bayesian Uncertainty Quantification

Verifying Regularity Conditions for Logit-Normal GLMM

Minimum Message Length Analysis of the Behrens Fisher Problem

Bayesian Asymptotics

Covariance function estimation in Gaussian process regression

Loss Function Estimation of Forecasting Models A Bayesian Perspective

Statistical Machine Learning Lectures 4: Variational Bayes

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

Statistical Inference

Introduction to Bayesian Statistics

Foundations of Statistical Inference

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Nested Sampling. Brendon J. Brewer. brewer/ Department of Statistics The University of Auckland

MODEL COMPARISON CHRISTOPHER A. SIMS PRINCETON UNIVERSITY

Conjugate Predictive Distributions and Generalized Entropies

Model comparison. Christopher A. Sims Princeton University October 18, 2016

Econometrics I, Estimation

Brittleness and Robustness of Bayesian Inference

Foundations of Nonparametric Bayesian Methods

Bayesian search for other Earths

5 Mutual Information and Channel Capacity

Brittleness and Robustness of Bayesian Inference

Inconsistency of Bayesian inference when the model is wrong, and how to repair it

Likelihood and p-value functions in the composite likelihood context

New Insights into History Matching via Sequential Monte Carlo

Bayesian Machine Learning

Eco517 Fall 2014 C. Sims MIDTERM EXAM

Quantifying mismatch in Bayesian optimization

Approximate Bayesian computation for spatial extremes via open-faced sandwich adjustment

G8325: Variational Bayes

Lecture : Probabilistic Machine Learning

Cromwell's principle idealized under the theory of large deviations

Lecture 6: Model Checking and Selection

Density Estimation. Seungjin Choi

an introduction to bayesian inference

Instructor: Dr. Volkan Cevher. 1. Background

Priors for the frequentist, consistency beyond Schwartz

Lecture 35: December The fundamental statistical distances

i=1 h n (ˆθ n ) = 0. (2)

Posterior Regularization

Understanding Covariance Estimates in Expectation Propagation

PATTERN RECOGNITION AND MACHINE LEARNING

A BAYESIAN MATHEMATICAL STATISTICS PRIMER. José M. Bernardo Universitat de València, Spain

BAYESIAN INFERENCE IN A CLASS OF PARTIALLY IDENTIFIED MODELS

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

David Giles Bayesian Econometrics

Variable selection and feature construction using methods related to information theory

Plausible Values for Latent Variables Using Mplus

COMP90051 Statistical Machine Learning

CSC 2541: Bayesian Methods for Machine Learning

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Introduction into Bayesian statistics

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Uniformly Most Powerful Bayesian Tests and Standards for Statistical Evidence

1 Priors, Contd. ISyE8843A, Brani Vidakovic Handout Nuisance Parameters

MIT Spring 2016

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

David Giles Bayesian Econometrics

The Minimum Message Length Principle for Inductive Inference

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

13: Variational inference II

Will Penny. SPM short course for M/EEG, London 2015

Statistica Sinica Preprint No: SS R2

Principles of Bayesian Inference

Information Theory in Intelligent Decision Making

Expectation Propagation for Approximate Bayesian Inference

Principles of Bayesian Inference

Special Topic: Bayesian Finite Population Survey Sampling

Principles of Bayesian Inference

Lecture 2: Statistical Decision Theory (Part I)

Time Series and Dynamic Models

Labor-Supply Shifts and Economic Fluctuations. Technical Appendix

Theory of Maximum Likelihood Estimation. Konstantin Kashin

STAT J535: Introduction

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Gaussian Approximations for Probability Measures on R d

Lecture 10 Maximum Likelihood Asymptotics under Non-standard Conditions: A Heuristic Introduction to Sandwiches

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Bayesian Analysis (Optional)

A Bayesian view of model complexity

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Harrison B. Prosper. CMS Statistics Committee

Series 7, May 22, 2018 (EM Convergence)

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

Diversity-Promoting Bayesian Learning of Latent Variable Models

Consensus and Distributed Inference Rates Using Network Divergence

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

BAYESIAN MODEL CRITICISM

Variational Bayes. A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Graduate Econometrics I: Maximum Likelihood II

BAYESIAN ESTIMATION OF UNKNOWN PARAMETERS OVER NETWORKS

Supplementary Material to Clustering in Complex Latent Variable Models with Many Covariates

Transcription:

ICES REPORT 14-21 August 2014 Model Misspecification and Plausibility by Kathryn Farrell and J. Tinsley Odena The Institute for Computational Engineering and Sciences The University of Texas at Austin Austin, Texas 78712 Reference: Kathryn Farrell and J. Tinsley Odena, "Model Misspecification and Plausibility," ICES REPORT 14-21, The Institute for Computational Engineering and Sciences, The University of Texas at Austin, August 2014.

Model Misspecification and Plausibility Kathryn Farrell a,, J. Tinsley Oden a a Institute for Computational Engineering and Sciences, The University of Texas at Austin Abstract We address in this communication relationships between Bayesian posterior model plausibilities and the Kullback-Leibler divergence between the socalled true observational distribution and that predictable by a parametric model. Keywords: plausibility Bayesian model calibration, Kullback-Leibler divergence, model 1. Introduction Generally, we consider a set M(Y) of probability measures µ, defined on a (metric) space Y of physical observations. A target quantity of interest, Q : M(Y) R is selected, e.g. Q(µ) = µ[x a], X being a random variable and a a threshold value. We seek a particular measure, µ, from which the true value of the quantity of interest, Q(µ ), is computed. For observational data, we draw independent and identically distributed (i.i.d.) samples y = {y 1, y 2,..., y n } Y n from µ. We wish to predict Q(µ ) using a parametric model P : M(Y), where is the space of parameters. If there exists a parameter θ such that P(θ ) = µ, the model P is said to be well-specified. Otherwise, if µ / P(), the model is misspecified. Bayesian frameworks provide a general setting for the analysis of such models and their predictive capabilities in the presence of uncertainties. For any fixed i.i.d. observational data y, let us consider a set M of m parametric models, each with its own parameter space, M = {P 1 (θ 1 ), Corresponding author Email addresses: kfarrell@ices.utexas.edu (Kathryn Farrell), oden@ices.utexas.edu (J. Tinsley Oden) Preprint submitted to Elsevier July 28, 2014

P 2 (θ 2 ),..., P m (θ m )}, θ i i. Prior information regarding parameters θ i may be collected and characterized by the prior probability density function π(θ i P i, M). This information may be updated via Bayes s Rule, π(θ i y, P i, M) = π(y θ i, P i, M)π(θ i P i, M), (1) π(y P i, M) where the likelihood probability density, π(y θ i, P i, M), captures how well the parameters are able to reproduce the given data y and the posterior density π(θ i y, P i, M) contains the updated information about the parameters. The denominator in (1), called the model evidence, is the marginalization of the product of the likelihood and prior densities over the parameters, π(y P i, M) = π(y θ i, P i, M)π(θ i P i, M) dθ i. (2) i An important observation presents itself at this point: the evidence (2) can be viewed as the likelihood in a higher form of Bayes s Rule, π(p i y, M) = π(y P i, M)π(P i M). (3) π(y M) The left-hand side of (3) is the posterior model plausibility and may be denoted ρ i = π(p i y, M) for simplicity. Taking the place of the prior density is the prior model plausibility π(p i M), and π(y M) is a normalizing factor such that m i= ρ i = 1. Among the models in M, the model with plausibility ρ i closest to unity is deemed the most plausible model. 2. Misspecified Models Let us now suppose that the model P (or P i ) is misspecified, i.e. µ / P(). Suppose µ is absolutely continuous with respect to the Lebesgue measure and that g(y) is the probability density associated with µ. Then the best approximation to g in P() is the model with the parameter θ = argmin D KL (g π( θ, P, M)), (4) where D KL ( ) is the Kullback-Leibler divergence, or relative entropy, between two densities: if p(y) and q(y) are two probability densities, 2

D KL (p q) = p(y) log p(y) Y q(y) dy. (5) The parameter θ yields a probability measure, µ = P(θ ), that is as close as possible to µ in the D KL pseudo-measure.. It is easily shown that θ is the maximum likelihood estimate, i.e. it maximizes the expected value of the log-likelihood relative to the true density g: [ ] θ = argmin g(y) log g(y) dy g(y) log π(y θ) dy Y n Y [ ] n = argmin g(y) log π(y θ) dy Y n = argmax g(y) log π(y θ) dy Y n = argmax E g [log π(y θ)], (6) where the negative self-entropy g log g dy was eliminated since it does not depend on θ and therefore does not affect the optimization. It should be noted that the parameter θ is of fundamental significance in the theories of the asymptotic behavior of parameter estimates of both Bayesian and frequentist statistics. Under suitable smoothness assumptions, as more data becomes available, the posterior density characterized by π(θ n {y 1, y 2,..., y n }, P, M) through Bayes s Rule converges in L 1 (in probability) to a normal distribution, N(θ, V(θ )), centered at θ with covariance matrix given by [2, 3] [ ] V(θ 2 ) = E g θ θ T π(y θ, P, M). (7) This result is referred to as the Bernstein-von Mises Theorem for misspecified models (see e.g. [1, 3, 4]). Of course, if P is well-specified, θ = θ and the posterior produced by Bayes s Rule converges in probability to a normal distribution centered at θ as the number of data samples increases and V (θ ) becomes the Fisher information matrix at θ [2]. 3

3. Plausibility-D KL Theory Let us now suppose that we have two misspecified models, P 1 and P 2. We may compare these models in the Bayesian setting through the concept of model plausibility: if P 1 is more plausible than P 2, ρ 1 > ρ 2. In the maximum likelihood setting, the model that yields a probability measure closer to µ is considered the better model. That is, if D KL (g π(y θ 1, P 1, M)) < D KL (g π(y θ 2, P 2, M)), (8) it can be said that model P 1 is better than model P 2. The theorems presented here define the relationship between these two notions of model comparison. However, Bayesian and frequentist methods fundamentally differ in the way they view the model parameters. Bayesian methods consider parameters to be stochastic, characterized by probability density functions, while frequentist approaches seek a single, deterministic parameter value. To bridge this gap in methodology, we note that considering parameters as deterministic vectors, for example θ 0, is akin to assigning them delta functions as their posterior probability distributions, which result from delta prior distributions. In this case, the evidence (2) becomes π(y P i, M) = π(y θ, P i, M)δ(θ θ 0 ) dθ = π(y θ 0, P i, M). (9) Particularly, if we consider the optimal parameter θ i for model P i, π(y P i, M) = π(y θ i, P i, M). We can take the ratio of posterior model plausibilities, ρ 1 = π(y P 1, M)π(P 1 M) ρ 2 π(y P 2, M)π(P 2 M) = π(y θ 1, P 1, M)π(P 1 M) π(y θ 2, P 2, M)π(P 2 M) = π(y θ 1, P 1, M) π(y θ 2, P 2, M) O 12, (10) where O 12 is the ratio of prior odds and is often assumed to be one. With these tools in hand, we present the following theorems. Theorem 1. Let (10) hold. If P 1 is more plausible than P 2 and O 12 1, then (8) holds. Proof. If P 1 is more plausible than P 2, 4

1 < ρ 1 = π(y θ 1, P 1, M) ρ 2 π(y θ 2, P 2, M) O 12 π(y θ 1, P 1, M) π(y θ 2, P 2, M) Equivalently, the reciprocal of the far right hand side is less than one, so log π(y θ 2, P 2, M) π(y θ 1, P 1, M) (11) < 0. (12) Since g(y) is a probability measure, it is always non-negative. Thus g(y) log π(y θ 2, P 2, M) π(y θ 1, P 1, M) < 0 g(y) log π(y θ 2, P 2, M) Y n π(y θ 1, P 1, M) dy < 0. (13) This can be expanded into g(y) log π(y θ 2, P 2, M) dy Y n g(y) log π(y θ 1, P 1, M) dy < 0, Y n (14) which means g(y) log π(y θ 1, P 1, M) dy < Y n g(y) log π(y θ 2, P 2, M) dy. Y n (15) By adding the quantity Y n g log g dy to both sides, the desired result (8) immediately follows. This theorem demonstrates that if model P 1 is better than model P 2 in the Bayesian sense, it is also a better deterministic model in the sense of (8). However, the reverse implication requires much stronger conditions. The assertion (8) can be equivalently written as g(y) log π(y θ 2, P 2, M) dy < 0. (16) Y n π(y θ 1, P 1, M) For this inequality to hold, the relationship π(y θ 2, P 2, M) π(y θ 1, P 1, M) < 1 (17) does not necessarily need to be true for every point y Y n. 5

One way to proceed is to invoke the Mean Value Theorem: if Y n < and under suitable smoothness conditions, there exists some ȳ Y n such that g(y) log π(y θ 2, P 2, M) Y n π(y θ 1, P 1, M) dy = Yn g(ȳ) log π(ȳ θ 2, P 2, M) π(ȳ θ 1, P 1, M). (18) Combining (16) and (18), Since Y n > 0 and g(y) > 0, Y n g(ȳ) log π(ȳ θ 2, P 2, M) π(ȳ θ 1, P 1, M) < 0. (19) log π(ȳ θ 2, P 2, M) π(ȳ θ 1, P 1, M) < 0 π(ȳ θ 2, P 2, M) π(ȳ θ 1, P 1, M) < 1 π(ȳ θ 1, P 1, M) π(ȳ θ 2, P 2, M) If O 12 1, > 1. (20) π(ȳ θ 1, P 1, M) π(ȳ θ 2, P 2, M) O 12 > 1 ρ 1 ρ 2 > 1. (21) Thus P 1 is more plausible than P 2 for given data ȳ. In summary, we have: Theorem 2. If D KL (g π(y θ 1, P 1, M)) < D KL (g π(y θ 2, P 2, M)) and if Y n < and the integrand in (18) is continuous, then there exists a ȳ Y n such that P 1 is more plausible than P 2, given that O 12 1. Acknowledgement: The support of this work by the US Department of Energy Applied Mathematics program MMICC s effort under Award Number DE-5C0009286 is gratefully acknowledged. [1] Bernstein S. Theory of Probability. Moscow 1917. [2] Freedman, David A. On the so-called Huber sandwich estimator and robust standard errors. The American Statistician 60.4 (2006). [3] Kleijn, B.J.K. and van der Vaart, A.W. The Bernstein-Von-Mises theorem under misspecification. Electronic Journal of Statistics 6 (2012), 354 381 [4] Von Mises, R. Wahrscheinlichkeitsrechnung. Springer Verlag, Berlin 1931 6