FREQUENTIST BEHAVIOR OF FORMAL BAYESIAN INFERENCE

Similar documents
Modern Likelihood-Frequentist Inference. Donald A Pierce, OHSU and Ruggero Bellio, Univ of Udine

Applied Asymptotics Case studies in higher order inference

Stat 5101 Lecture Notes

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Long-Run Covariability

Density Estimation. Seungjin Choi

Invariant HPD credible sets and MAP estimators

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

6 Pattern Mixture Models

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Likelihood and Fairness in Multidimensional Item Response Theory

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

STA 4273H: Statistical Machine Learning

,..., θ(2),..., θ(n)

Linear Mixed Models. One-way layout REML. Likelihood. Another perspective. Relationship to classical ideas. Drawbacks.

Bayesian Analysis of Latent Variable Models using Mplus

Integrated Likelihood Estimation in Semiparametric Regression Models. Thomas A. Severini Department of Statistics Northwestern University

ECE521 week 3: 23/26 January 2017

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Stochastic Spectral Approaches to Bayesian Inference

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Figure 36: Respiratory infection versus time for the first 49 children.

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Open Problems in Mixed Models

PRINCIPLES OF STATISTICAL INFERENCE

Optimization Problems

Part 6: Multivariate Normal and Linear Models

Likelihood and p-value functions in the composite likelihood context

Spatial inference. Spatial inference. Accounting for spatial correlation. Multivariate normal distributions

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Bayesian Regression Linear and Logistic Regression

Nonparametric Bayesian Methods (Gaussian Processes)

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Fitting a Straight Line to Data

Equivalence of random-effects and conditional likelihoods for matched case-control studies

Paradoxical Results in Multidimensional Item Response Theory

Bayesian inference for sample surveys. Roderick Little Module 2: Bayesian models for simple random samples

Inference with few assumptions: Wasserman s example

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Models for models. Douglas Nychka Geophysical Statistics Project National Center for Atmospheric Research

Plausible Values for Latent Variables Using Mplus

Last week. posterior marginal density. exact conditional density. LTCC Likelihood Theory Week 3 November 19, /36

A Course in Applied Econometrics Lecture 18: Missing Data. Jeff Wooldridge IRP Lectures, UW Madison, August Linear model with IVs: y i x i u i,

7 Sensitivity Analysis

GAUSSIAN PROCESS REGRESSION

Advanced Statistical Methods. Lecture 6

Semiparametric Generalized Linear Models

STA 216, GLM, Lecture 16. October 29, 2007

Introduction to Machine Learning

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Parameter estimation! and! forecasting! Cristiano Porciani! AIfA, Uni-Bonn!

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Factor Analysis. Robert L. Wolpert Department of Statistical Science Duke University, Durham, NC, USA

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

6.867 Machine Learning

The Bayesian Approach to Multi-equation Econometric Model Estimation

Discussion of Maximization by Parts in Likelihood Inference

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Intelligent Systems Statistical Machine Learning

High-Throughput Sequencing Course

Modelling geoadditive survival data

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Bayesian inference for factor scores

Machine Learning Linear Classification. Prof. Matteo Matteucci

For more information about how to cite these materials visit

CIMAT Taller de Modelos de Capture y Recaptura Known Fate Survival Analysis

Intelligent Systems Statistical Machine Learning

Bayesian Methods for Machine Learning

Bayesian Quadrature: Model-based Approximate Integration. David Duvenaud University of Cambridge

Bayesian and frequentist inference

Introduction to Probabilistic Machine Learning

Computational methods for mixed models

Modern Likelihood-Frequentist Inference. Summary

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

STA414/2104 Statistical Methods for Machine Learning II

Factor Analysis. Qian-Li Xue

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Measurement error as missing data: the case of epidemiologic assays. Roderick J. Little

Some Approximations of the Logistic Distribution with Application to the Covariance Matrix of Logistic Regression

Introduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models

Theory of Maximum Likelihood Estimation. Konstantin Kashin

1. Fisher Information

Managing Uncertainty

Machine Learning 4771

Gaussian Processes (10/16/13)

Approximating models. Nancy Reid, University of Toronto. Oxford, February 6.

Statistical Data Analysis Stat 3: p-values, parameter estimation

Based on slides by Richard Zemel

Default Priors and Effcient Posterior Computation in Bayesian

Bayesian estimation of the discrepancy with misspecified parametric models

Binomial Model. Lecture 10: Introduction to Logistic Regression. Logistic Regression. Binomial Distribution. n independent trials

More on nuisance parameters

CS281 Section 4: Factor Analysis and PCA

1 What does the random effect η mean?

ESTIMATING THE MEAN LEVEL OF FINE PARTICULATE MATTER: AN APPLICATION OF SPATIAL STATISTICS

Fundamental Issues in Bayesian Functional Data Analysis. Dennis D. Cox Rice University

Transcription:

FREQUENTIST BEHAVIOR OF FORMAL BAYESIAN INFERENCE Donald A. Pierce Oregon State Univ (Emeritus), RERF Hiroshima (Retired), Oregon Health Sciences Univ (Adjunct) Ruggero Bellio Univ of Udine For Perugia talk 7 July 2008 Slides are at ftp://www.science.oregonstate.edu/~piercedo/ 1

INTRODUCTION We are interested in relations between frequentist inference and Bayesian inference using diffuse priors --- but differently than in terms of matching priors Find that posterior densities can provide approximations to certain pseudo-likelihoods, providing for quite accurate frequentist inference from the Bayesian calculations Particularly interested in Bayesian MCMC methods, useful when evaluating the likelihood function involves intractable integrals; e.g. GLMM, covariate errors, imputation of missing data, spatial (and image) analysis, and so forth Many of those using this seem really to be seeking frequentist inference, only using the MCMC to resolve difficulties in evaluating the likelihood 2

For very large n, frequentist and Bayesian inference coincide since ( ˆ θ θ ) / SE has the same Gaussian limiting distribution from either perspective Not so comforting if either of these distributions is clearly not Gaussian Modern frequentist likelihood asymptotics points to deeper connections between the two modes of inference This pertains largely to the treatment of nuisance parameters Usually, in frequentist methods these are maximized out of the likelihood function, but in Bayesian methods they are integrated out But modern (frequentist) likelihood asymptotics suggests also the integrating out of nuisance parameters 3

Inference about ψ ( θ ) where θ is multi-dimensional Reparametrize as ( ψ, λ), realizing that the choice of representation of λ is largely arbitrary Profile likelihood (PL) is fundamental and does not depend on that arbitrary matter P λ ψ ( θ ) = ψ ( ) L ( ψ ) = max L( ψ, λ; y) = max L θ; y Modern likelihood asymptotics points to two types of modification of PL, dealing with effects of fitting nuisance parameters Approximate conditional likelihood (ACL) in terms of orthogonal nuisance parameter λ (not unique) L AC ( ψ ) = L ( ψ ) var( ˆ λ ; ψ known) P (using asy var, obsvd info) Modified profile likelihood (MPL) which does not depend on representation of nuisance parameters L MP ˆ ˆ ψ ( ψ ) = L ( ψ ) λ / ( λ ) ( λˆψ constrained MLE) AC 1/ 2 4

Logistic regr: 77 binary obsvns with 6 covars. The ACL here uses the formula for that but with the non-orthogonal canonical parameter. When using the orthogonal mean parameter, the ACL is the same as the MPL

Orthogonal parameters: Main point is that the MLE of λ when ψ is known does not depend strongly on ψ Will be true if the expected information matrix for ( ψ, λ) is block diagonal Leads to a PDE with solution giving λ, always exists when ψ is scalar --- depends on both λ and ψ (unless ) λ ˆ ˆ ψ The above Jacobian λ / ( λ ) definition is complicated, but... is a quantity whose precise Can approximate it adequately in terms of the covariance of the score vector at two different parameter values --- simpler approximation when the data can be represented in stochastically independent subsets Orthogonal parameters are important to what follows, with interesting Bayesian and MCMC implications 6

ACL based on common approach of conditioning to eliminate nuisance parameters It depends on the non-unique choice of the orthogonal parameter MPL invariant to that choice, approximating conditional likelihood when that is appropriate and marginal likelihood when that is appropriate * It appears that MPL is generally approximating an idealized integrated likelihood IL L ( ψ ) L( ψ, λ; y) w( λ) dλ ideal I = where the weight function w( λ) is, in principle, a suitably invariant distribution on ψ 1 ( ψ ) --- difficult to deal with explicitly Severini (2007) developed a good approximation to that aim, but essentially with a complicated weight function w( λ) depending on the data 7

Bayesian calculations: If λ, ψ are a priori independent then so π ( ψ y) L( ψ, λ; y) π ( ψ ) dπ ( λ) = π ( ψ ) L( ψ, λ; y) dπ ( λ) λ, π ( y) ( ) LI ( ; y) π ψ π ψ ψ where the special IL depends on the representation of the nuisance parameter and the prior for it (related matters) Note that the prior π ( ψ ) has a very simple effect, not very important for our considerations, since we are mainly concerned with some kind of pseudo-likelihood for ψ An important Laplace approximation (Sweeting 1987) is L ( ψ ; y) L ( ψ ) var( ˆ λ; ψ known) π ( ˆ λ ψ ) λ, π 1/ 2 ψ I P which is formula for ACL but without imposing orthogonality An IL using an orthogonal parameter may approximate well a version of ACL, and perhaps even the MPL 8

Often fairly easy with MCMC to obtain samples from π ( ψ, λ y) π ( ψ ) π ( λ) p( y ψ, λ) As noted above, if π ( ψ ) is uniform then the marginal density of ψ λ, from this is an integrated likelihood L π I ( ψ ; y) and this will depend little on the choice of a diffuse π ( λ) Recall that for diffuse π ( λ) λ L π ( ψ ; y) L ( ψ ) var( ˆ λ; ψ known), 1/ 2 I P but this is not ACL unless λ is orthogonal Seems impractical to ask that the MCMC be done, generally, in an orthogonal parameterization It seems not so difficult approximate the final factor from analysis of the MCMC samples, thus to remove it and obtain PL We suspect, but less confidently, that one can generally approximate from the samples a desired adjustment factor ˆ 1/ 2 var( λ ; ψ known) 9

In the example given above, the MCMC posterior density is essentially the same as the wrong ACL when, as usual, λ is represented in terms of canonical parameters If the MCMC were done taking as the orthogonal mean parameter, the posterior density would be essentially the same as the MPL. So one can obtain useful likelihood quantities from the MCMC, but it is crucial to use something close to the orthogonal representation of nuisance parameters How to do this can be worked out for some important classes of problems, e.g. GLMMs, but it is difficult to do in general, so we need a way to fix up the damage from using non-orthogonal parameters. λ

Some comments: There is much written about choice of π ( ψ ) but for our intentions this is not really an issue If not taken as uniform, then can divide by it at the end. Doing this may affect the convergence of the MCMC It is not really necessary to sample ψ at all --- can take a grid of values for it and only sample λ. In WinBUGS, say, make many separate runs with ψ fixed As for orthogonal parameters, it will not work simply to transform the samples after the simulation --- there are complications regarding the Jacobian that possibly could be dealt with Even when a form of orthogonal parameter is known in simple form, it can be difficult to use that in the MCMC specifications 1/ 2 The quantities log var( ˆ λ; ψ known) are very smooth functions, often essentially linear in ψ 11

Example: Logistic regression with random intercept for clusters. logit( p ) = z β + σu ij ij i When taking ψ = σ, it turns out that a nearly orthogonal 2 parametrization replaces β by β = β / 1 + 0.3σ This derives from considering the usual attenuation in marginal estimation with mixed models Things work fine, i.e. the WinBUGS posterior density for σ is essentially the MPL as desired: the problem is that rather special considerations were necessary to obtain β (When ψ = β j the original lack of orthogonality is minor) Essentially the same approach works for any GLMM with random intercepts for clusters, and for identity, log or logit link (as above). But, can we fix things up regarding non-orthogonal parameters in a more general manner? 12

Bacteria data --- binary data on 50 individuals, repeated observations at 0,2,4,6,11 weeks (some missing) for 220 total binary readings. Model: 3 parameters for treatments, 4 parameters for time variation 2 Logistic mixed model with N(0, σ ) term on logit scale for each individual, to allow for within-individual correlation WinBUGS analysis, and HOA likelihood analysis using R-program prepared by Bellio in work with Brazzale Consider here only inference about sigma 13

Likelihood 0.2 0.4 0.6 0.8 1.0 0.5 1.0 1.5 2.0 2.5 3.0 σ 3 left curves: PL, MPL, WinBUGS posterior orthog param 2 right curves: APL formula (Laplace), WinBUGS orig param 14

const 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 3.0 sigma Posterior samples (sigma, const) in orthog params: Orthog seen as constant regression. Variation of Var(const sigma) would provide for adjustment PL to APL, but none seen 15

const 1 2 3 4 5 6 1 2 3 4 5 6 sigma Original params: Lack of orthog seen as non-constant regression. Var(const sigma) varies too much with sigma, including unwanted variation due to E(const sigma) 16

Recall that λ L π ( ψ ; y) L ( ψ ) var( ˆ λ; ψ known), 1/ 2 I P and consider approximating the final factor from the MCMC samples, so we can (first) remove it to obtain an approximation to PL From the MCMC samples we can compute (estimate arbitrarily well) var( λ y ; ψ known) = var( λ ψ = ψ, y) (recall that for our needs we could either fix ψ at specified values, or sample it from π ( ψ ) ) With diffuse prior we can expect that var( ˆ λ ; ψ known) var( λ y; ψ known) Making this substitution, estimating the conditional variance function from posterior samples, and back-calculating the PL works quite well in various examples we have tried 17

Likelihood 0.2 0.4 0.6 0.8 1.0 0.5 1.0 1.5 2.0 2.5 3.0 σ Profile (solid), posterior density (rightmost dashed), and posterior adjusted by estimated variance function (leftmost dashed) ---- recovering essentially the PL from the MCMC. Can actually do better than this with modification of details 18

Can we go further and approximate an ACL or the MPL? Need to estimate the function var( λ ψ = ψ ) const 1 2 3 4 5 6 1 2 3 4 5 6 sigma One possibility is to remove from the variance function here that variation due to the unwanted dependence E( λ ψ = ψ ) 19

Another is to estimate more directly from the posterior samples key aspects of the transformation to orthogonal parameters Possible resolution of these issues is currently underway So where are we so far? We can use MCMC to approximate the PL and likely an ACL or MPL Distinctions between using the posterior as a density and using it as a likelihood are readily explored (often will not matter much) Can obtain very good frequentist inferences without complications of choosing prior for interest parameter As a practical matter, provides for frequentist inference when the likelihood is difficult to evaluate As a theoretical matter, clarifies connections between the two modes of inference 20