Bayesian estimation of the discrepancy with misspecified parametric models

Similar documents
ICES REPORT Model Misspecification and Plausibility

Asymptotics for posterior hazards

Nonparametric Bayesian Methods - Lecture I

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Bayesian Sparse Linear Regression with Unknown Symmetric Error

Density Estimation. Seungjin Choi

Bayesian Nonparametrics

Inconsistency of Bayesian inference when the model is wrong, and how to repair it

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Bayesian Regularization

Semiparametric posterior limits

Stat 451 Lecture Notes Numerical Integration

Introduction to Probabilistic Machine Learning

Bayesian Consistency for Markov Models

Bayesian nonparametrics

Latent factor density regression models

Bayesian asymptotics with misspecified models

Stat 5101 Lecture Notes

STA 4273H: Sta-s-cal Machine Learning

Can we do statistical inference in a non-asymptotic way? 1

DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS. By Subhashis Ghosal North Carolina State University

STA414/2104 Statistical Methods for Machine Learning II

Invariant HPD credible sets and MAP estimators

A Very Brief Summary of Bayesian Inference, and Examples

Stochastic Spectral Approaches to Bayesian Inference

NONPARAMETRIC BAYESIAN INFERENCE ON PLANAR SHAPES

Nonparametric Bayesian Methods (Gaussian Processes)

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Computer Emulation With Density Estimation

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Adaptive Bayesian procedures using random series priors

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Foundations of Nonparametric Bayesian Methods

Consensus and Distributed Inference Rates Using Network Divergence

Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors

A Very Brief Summary of Statistical Inference, and Examples

12 - Nonparametric Density Estimation

Lecture : Probabilistic Machine Learning

Machine Learning using Bayesian Approaches

Bayesian Regression with Heteroscedastic Error Density and Parametric Mean Function

13: Variational inference II

Overall Objective Priors

Nonparametric Bayesian Uncertainty Quantification

STAT 518 Intro Student Presentation

Statistics: Learning models from data

Bayesian Regression Linear and Logistic Regression

Part III. A Decision-Theoretic Approach and Bayesian testing

Bayes spaces: use of improper priors and distances between densities

Curve Fitting Re-visited, Bishop1.2.5

CPSC 540: Machine Learning

On the Choice of Parametric Families of Copulas

Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Introduction An approximated EM algorithm Simulation studies Discussion

Asymptotics for posterior hazards

Bayesian model selection for computer model validation via mixture model estimation

Irr. Statistical Methods in Experimental Physics. 2nd Edition. Frederick James. World Scientific. CERN, Switzerland

Statistica Sinica Preprint No: SS R2

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Bayesian Methods for Machine Learning

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Improper mixtures and Bayes s theorem

Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands

Part 1: Expectation Propagation

Exchangeability. Peter Orbanz. Columbia University

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

ASYMPTOTIC PROPERTIES OF PREDICTIVE RECURSION: ROBUSTNESS AND RATE OF CONVERGENCE

Expectation Propagation for Approximate Bayesian Inference

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Time Series and Dynamic Models

STAT J535: Chapter 5: Classes of Bayesian Priors

Covariance function estimation in Gaussian process regression

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

Introduction to Bayesian Statistics

Short Questions (Do two out of three) 15 points each

Empirical Bayes Deconvolution Problem

Expectation Propagation Algorithm

5 Introduction to the Theory of Order Statistics and Rank Statistics

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Bayesian Aggregation for Extraordinarily Large Dataset

Flexible Regression Modeling using Bayesian Nonparametric Mixtures

Foundations of Statistical Inference

Econometrics I, Estimation

The consequences of misspecifying the random effects distribution when fitting generalized linear mixed models

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

CSC 2541: Bayesian Methods for Machine Learning

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Probabilistic Graphical Models

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

STAT215: Solutions for Homework 2

Efficiency of Profile/Partial Likelihood in the Cox Model

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Nonparametric Bayes Inference on Manifolds with Applications

POSTERIOR CONSISTENCY IN CONDITIONAL DENSITY ESTIMATION BY COVARIATE DEPENDENT MIXTURES. By Andriy Norets and Justinas Pelenis

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

Transcription:

Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics workshop ICERM, 17-21 September 2012 Joint work with S. Walker 1 / 22

Outline Semiparametric density estimation Asymptotics and illustration References 2 / 22

BNP density estimation Let X 1,..., X n be exchangeable (i.e. conditionally iid) observations from an unknown density f on the real line. If F is the density space and Π(df ) the prior, via Bayes theorem R Q n Π(A X 1,..., X n) = R A i=1 f (X i) Π(df ) Q n F i=1 f (X i) Π(df ) Wealth of Bayesian nonparametric (BNP) models Dirichlet process mixtures of continuos densities; log spline models; Bernstein polynomials; log Gaussian processes. All with well studied asymptotic properties, e.g. posterior concentration rates Π(f : d(f, f 0 ) > Mɛ n X 1,..., X n) n 0, when X 1, X 2,... are iid from some true f 0. 3 / 22

Discrepancy from a parametric model Suppose now we have a favorite parametric family f θ (x), θ Θ R p. likely to be misspecified: there is no θ such that f 0 = f θ. We want to learn about the best parameter value θ 0 which minimizes the Kullback-Leibler divergence from true f 0 : θ 0 = arg min Θ R f0 log(f 0 /f θ ) A nonparametric component W is introduced to model the discrepancy between f 0 and the closest density f θ0 : f θ,w (x) f θ (x) W (x), so that C(x) := W (x) R W (s) fθ (s) ds is designed to estimate C 0 (x) = f 0 (x)/f θ0 (x). 4 / 22

Related works - Frequentist Hjort and Glad (1995) Start with a parametric density estimate f ˆθ(x), ˆθ being, e.g., the MLE of θ with respect to the likelihood Q n i=1 log f θ(x i ). Then multiply it with a nonparametric kernel-type of the correction function r(x) = f 0 (x)/f ˆθ (x): ˆf (x) = f ˆθ(x)ˆr(x) = 1 n nx i=1 K h (x i x) f ˆθ (x) f ˆθ(x i ) in a two-stage sequential analysis. ˆf is shown to be more precise than traditional kernel density estimator in a broad neighborhood around the parametric family, while losing little when the f 0 is far from the parametric family. 5 / 22

Related works - Bayes Nonparametric prior built around a parametric model via f (x) = f θ (x)g(f θ (x)), where F θ is the cdf of f θ and g is a density on [0, 1] with prior Π. Verdinelli and Wasserman (1999): Π as an infinite exponential family. Application to goodness of fit testing. Rousseau (2008): Π as a mixtures of betas. Application to goodness of fit testing. Tokdar (2007): Π as a log Gaussian process prior. Application to posterior inference for densities with unbounded support. For g(x) = e Z (x) / R 1 0 ez (s) ds and Z Gaussian process with covariance σ(, ), f (x) can be written f (x) f θ (x) e Z (x) {z } W (x) for Z Gaussian process with covariance σ(f θ ( ), F θ ( )). 6 / 22

Posterior updating f θ,w (x) f θ (x) W (x), C(x) := W (x) R W (s) fθ (s) ds. Truly semi parametric: aim is at learning about the best parameter θ 0, then at seeing how close f θ0 is to f 0 via C(x) = W (x)/ R W (s) f θ (s) ds. Situation in which the updating process from prior to posterior may be seen as problematic: the model f θ,w is intrinsically non identified in (θ, C) The full Bayesian update π(θ, W x 1,..., x n) π(θ)π(w ) Q n i=1 f θ,w (x i ) is appropriate for learning about f 0 ; it is not so for learning about (θ 0, C 0 ). The marginal posterior π(θ x 1,..., x n) = R π(θ, W x 1,..., x n)dw has no interpretation: it is not identified what parameter value this π is targeting. 7 / 22

Posterior updating What removes us from the formal Bayes set up is the desire to specifically learn about θ 0. θ 0 defined without any reference to W, or C. Whether we are interested in learning about C 0 or not, our beliefs about θ 0 should not change. Hence, the appropriate update for θ is the parametric one: π(θ x 1,..., x n) π(θ) Q n i=1 f θ(x i ). We keep updating W according to the semi parametric model, so our updating scheme is π(w θ, x 1,..., x n) π(w ) Q n i=1 f θ,w (x i ), π(θ, W x 1,..., x n) = π(w θ, x 1,..., x n) π(θ x 1,..., x n). non-full Bayesian update 8 / 22

Posterior updating π(θ, W x 1,..., x n) = π(w θ, x 1,..., x n) π(θ x 1,..., x n). (θ, W ) are estimated sequentially, with W reflecting additional uncertainty on θ. Marginalization of the posterior over W is well defined, π(w x 1,..., x n) = R Θ π(w θ, x 1,..., x n) π(dθ x 1,..., x n) since π(θ x 1,..., x n) describes the beliefs about the real parameter θ 0. Coherence is about properly defining the quantities of interest and showing that Bayesian updates provide learning about these quantities and this is checked by what is yielded asymptotically. Hence we seek frequentist validation: we show that the posterior of (θ, C) converges to a point mass at (θ 0, C 0 ). 9 / 22

Lenk (2003) Let I be a compact interval on the real line and Z a Gaussian process. Lenk (2003) considers the semi parametric density model f (x) = f θ(x) e R Z (x) f I θ(s) e Z (s) ds for f θ (x) member of the exponential family. In the Loève expansion of Z (x), the orthogonal basis is chosen so that the sample paths integrate to zero. Further assumption for identification: the orthogonal basis does not contain any of the canonical statistics of f θ (x). Estimation based on truncation of the series expansion or by imputation of the Gaussian process at a fixed grid of points, see Tokdar (2007). 10 / 22

Bounded W (x) Building upon Lenk (2003), we keep working with Gaussian processes and consider f θ,w (x) = f θ(x) W (x) RI f, W (x) = Ψ(Z (x)) θ(s) W (s)ds where Ψ(u) is a cdf having a smooth unimodal symmetric density ψ(u) on the real line. With an additional condition on Ψ(u), we can show that W (x) preserves the asymptotic properties of log Gaussian process prior. On the other hand, with W (x) 1, Walker (2011) describes a latent model which can deal with the intractable normalizing constant. It is based on X k=0 n + k 1 k!»z k «n 1 f θ (s) (1 W (s)) ds = R. W (s) fθ (s)ds 11 / 22

Link function Ψ(u) Lipschitz condition on log Ψ(u): ψ(u)/ψ(u) m uniformly on R satisfied by the standard Laplace cdf, standard logistic cdf or standard Cauchy cdf, but not by the standard normal cdf. For fixed θ, write p z = f θ,ψ(z). It can be shown that, when z 1 z 2 < ɛ, ( h(p z1, p z2 ) mɛ e mɛ/2 K (p z1, p z2 ) m 2 ɛ 2 e mɛ (1 + mɛ) Posterior asymptotic results of van der Vaart and van Zanten (2008) carries over to this setting: If Ψ 1 (f 0 /f θ ) is contained in the support of Z, then Π {p z : h(p z, f 0 ) > ɛ X 1,..., X n} 0, F 0 a.s. Results on posterior contraction rate can be also derived. 12 / 22

Conditional posterior of W (A) Lipschitz condition on log Ψ(u); (B) f θ (x) is continuous and bounded away from zero; (C) the support of Z contains the space C(I) of continuous densities on I. Theorem 1. Under assumptions (A), (B) and (C), the conditional posterior of W given θ is exponentially consistent at all f 0 C(I), i.e. for any ɛ > 0, π {W : h(f θ,w, f 0 ) > ɛ θ, X 1,..., X n} e dn, F 0 a.s. for some d > 0 as n. As corollary, for fixed θ, the posterior of C(x) = W (x)/ R I f θ(s)w (s)ds consistently estimates the discrepancy f 0 (x)/f θ (x). The exponential convergence to 0 is a by-product of standard techniques for proving posterior consistency. 13 / 22

Marginal posterior of θ For given f 0, let θ 0 be the parameter value that minimize R I f 0 log(f 0 /f θ ): θ 0 = arg min Θ R f0 log(f 0 /f θ ) Under some regularity condition on the family f θ and on the prior at θ 0, the posterior accumulates at θ 0 with rate n: π θ θ 0 > M nn 1/2 X 1,..., X n 0, F 0 a.s. see Kleijn and van der Vaart (2012). One of the key regularity conditions on f θ is the existence of an open neighborhood of θ 0 and a square-integrable function m θ0 (x) such that, for all θ 1, θ 2 U, log(f θ1 /f θ2 ) m θ0 θ 1 θ 2, P 0 a.s. For our purposes, we focus on a different local property: there exist α > 0 and an open neighborhood U of θ 0 such that for all θ 1, θ 2 U: log f θ1 /f θ2 θ 1 θ 2 α (D) 14 / 22

Marginal posterior of W Assume the regularity conditions on f θ and π(θ) are satisfied for f 0 C(I). Recall the definition of the marginal posterior of W, π(w x 1,..., x n) = R Θ π(w θ, x 1,..., x n)π(dθ x 1,..., x n) Theorem 2. Under assumptions (A), (B), (C) and (D), the marginal posterior of W (x) satisfies as n. π W : h(f θ0,w, f 0 ) > ɛ X 1,..., X n 0, F 0 a.s. The marginal posterior of W is evaluated outside a neighborhood defined in terms of θ 0. Clearly, if π(θ) is degenerate at θ 0, the result follows directly from Theorem 1. Hint of the proof: sufficient to consider the posterior when the prior is restricted on θ θ 0 M nn 1/2. We then manipulate numerator and denominator by using (D) together with the inequalities R exp{ log(f θ0 /f θ ) } R f I θ 0 (x)w (x)dx f I θ(x)w (x)dx exp{ log(f θ 0 /f θ ) } 15 / 22

Marginal posterior of C Recall the definition C(x) = C θ,w (x) = W (x) R W (s) fθ (s) ds, which is designed to estimate C 0 (x) = f 0 (x)/f θ0 (x). Corollary. Under the hypotheses of Theorem 2, as n, π R I C C 0 > ɛ X 1,..., X n 0, F 0 a.s. Together with π θ θ 0 > M nn 1/2 X 1,..., X n 0, we conclude that the posterior of (θ, C) converges to (θ 0, C 0 ). Hint of the proof: Theorem 2 implies that R I C θ 0,W C 0 goes to 0. By triangular inequality, it is sufficient to show that, uniformly over θ θ 0 Mn 1/2, R I C θ,w C θ0,w 0. This we show by using (D). 16 / 22

Illustration 1 n = 500 observations from f 0 (x) = 2(1 x); f θ (x) = θe θx /(1 e θ ) with improper prior π(θ) 1/θ. C(x) 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 x Figure: Estimated (bold) and true (dashed) functions of C 0 (x) 17 / 22

Parametric Bayes update Simulation from the posterior of θ. The minimum K-L parameter value is θ 0 = 2.15. Density 0.0 0.5 1.0 1.5 2.0 1.6 1.8 2.0 2.2 2.4 2.6 2.8 theta Figure: Posterior distribution of θ with parametric Bayes update. Posterior mean 2.13. 18 / 22

Full Bayes update Using the proper conditional posterior π(θ W, x 1,..., x n) π(θ) Q i f θ,w (x i ) Density 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.5 2.0 2.5 theta Figure: Posterior distribution of θ with formal Bayes update. Posterior mean 1.97. 19 / 22

Illustration 2 n = 500 observations from f 0 (x) = 2x; f θ (x) = θx θ 1 with 0 < θ < 1 and uniform prior for θ. θ 0 = 1. 0 6 0.70 0.75 0.80 0.85 0.90 0.95 1.00 theta 0 6 0.70 0.75 0.80 0.85 0.90 0.95 1.00 theta Figure: Posterior distributions of θ with parametric Bayes update (top) and formal Bayes update (bottom). 20 / 22

Discussion Both the proposed update and formal Bayes update seem to provide a suitable estimate for C, which is not surprising given its flexibility of estimating C 0 with alternative values of θ. Yet the posterior for θ is more accurate for the parametric Bayesian posterior. It shows that semiparametric models need to be thought about carefully: the parametric part needs to define which θ has been targeted. Future work will deal with f θ with unbounded support. Extension to posterior contraction rates. Connections with asymptotic properties of empirical Bayes. Use the C(x) function for model selection 21 / 22

References De Blasi & Walker (2012). Bayesian estimation of the discrepancy with misspecified parametric models. Tech. Rep., submitted. Hjort & Glad (1995). Nonparametric density estimation with a parametric start. Ann. Statist. 23, 882-904. Kleijn & van der Vaart (2012). The Bernstein-Von Mises theorem under misspecification. Electron. J. Stat. 6, 354-381. Lenk (1988). The logistic normal distribution for Bayesian, nonparametric, predictive densities. J. Amer. Statist. Assoc. 83, 509-516. Rousseau (2008). Approximating interval hypothesis: p-values and Bayes factors. In Bayesian Statistics 8, 417-452. Tokdar (2007). Towards a faster implementation of density estimation with logistic Gaussian process priors. J. Comp. Graph. Statist. 16, 633-655. van der Vaart & van Zanten (2008). Rates of contraction of posterior distributions based on Gaussian process priors. Ann. Statist. 36, 1435-1463. Verdinelli & Wasserman (1998). Bayesuan goodness of fit testing using infinite dimensional exponential families. Ann. Statist. 26, 1215-1241. Walker (2011). Posterior sampling when the normalizing constant is unknown. Comm. Statist. Simulation Comput. 40, 784-792. 22 / 22