Some Curiosities Arising in Objective Bayesian Analysis

Similar documents
Bayesian Adjustment for Multiplicity

An Extended BIC for Model Selection

g-priors for Linear Regression

Default priors and model parametrization

1 Hypothesis Testing and Model Selection

An Overview of Objective Bayesian Analysis

the unification of statistics its uses in practice and its role in Objective Bayesian Analysis:

7. Estimation and hypothesis testing. Objective. Recommended reading

Stat 5101 Lecture Notes

Extended Bayesian Information Criteria for Model Selection with Large Model Spaces

OBJECTIVE PRIORS FOR THE BIVARIATE NORMAL MODEL. BY JAMES O. BERGER 1 AND DONGCHU SUN 2 Duke University and University of Missouri-Columbia

Overall Objective Priors

A Very Brief Summary of Bayesian Inference, and Examples

Master s Written Examination

Linear Models A linear model is defined by the expression

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

A Very Brief Summary of Statistical Inference, and Examples

The linear model is the most fundamental of all serious statistical models encompassing:

BAYES AND EMPIRICAL-BAYES MULTIPLICITY ADJUSTMENT IN THE VARIABLE-SELECTION PROBLEM. Department of Statistical Science, Duke University

Objective Bayesian Statistical Inference

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Approximate Bayesian computation for spatial extremes via open-faced sandwich adjustment

Bayesian and frequentist inference

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Bayesian Econometrics

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Bayesian Linear Models

Part III. A Decision-Theoretic Approach and Bayesian testing

Bayesian methods in economics and finance

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics

Bayesian Linear Models

Seminar über Statistik FS2008: Model Selection

5.2 Expounding on the Admissibility of Shrinkage Estimators

Bayesian inference. Justin Chumbley ETH and UZH. (Thanks to Jean Denizeau for slides)

Linear Models and Estimation by Least Squares

Divergence Based priors for the problem of hypothesis testing

7. Estimation and hypothesis testing. Objective. Recommended reading

Frequentist Accuracy of Bayesian Estimates

The comparative studies on reliability for Rayleigh models

Unobservable Parameter. Observed Random Sample. Calculate Posterior. Choosing Prior. Conjugate prior. population proportion, p prior:

STAT215: Solutions for Homework 2

Principles of Statistics

STAT 740: Testing & Model Selection

Statistical Inference

Bayesian Inference. Chapter 9. Linear models and regression

Prior Distributions for the Variable Selection Problem

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

10. Exchangeability and hierarchical models Objective. Recommended reading

Introduction to Bayesian Methods

Problem Selected Scores

STAT 830 Bayesian Estimation

Model comparison and selection

Statistical Inference

Bayesian Hypothesis Testing: Redux

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Carl N. Morris. University of Texas

More on nuisance parameters

1 One-way analysis of variance

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Remarks on Improper Ignorance Priors

Introduction to Bayesian learning Lecture 2: Bayesian methods for (un)supervised problems

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Bayesian Model Averaging

BAYES AND EMPIRICAL-BAYES MULTIPLICITY ADJUSTMENT IN THE VARIABLE-SELECTION PROBLEM

(a) (3 points) Construct a 95% confidence interval for β 2 in Equation 1.

STAT 461/561- Assignments, Year 2015

A union of Bayesian, frequentist and fiducial inferences by confidence distribution and artificial data sampling

Studentization and Prediction in a Multivariate Normal Setting

Multiple Testing, Empirical Bayes, and the Variable-Selection Problem

Lecture 1: Introduction

Student-t Process as Alternative to Gaussian Processes Discussion

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Chapter 17: Undirected Graphical Models

Chapter 7. Hypothesis Testing

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Lecture 2: Statistical Decision Theory (Part I)

GENERAL THEORY OF INFERENTIAL MODELS I. CONDITIONAL INFERENCE

The Delta Method and Applications

1. Fisher Information

Lecture 16 : Bayesian analysis of contingency tables. Bayesian linear regression. Jonathan Marchini (University of Oxford) BS2a MT / 15

Estimation of a multivariate normal covariance matrix with staircase pattern data

Comparing Non-informative Priors for Estimation and Prediction in Spatial Models

Likelihood-based inference with missing data under missing-at-random

Fiducial Inference and Generalizations

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

New Bayesian methods for model comparison

Lecture 7 October 13

Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics. Jiti Gao

Modelling of Dependent Credit Rating Transitions

Chapter 3: Maximum Likelihood Theory

AMS-207: Bayesian Statistics

Bayesian inference J. Daunizeau

COS513 LECTURE 8 STATISTICAL CONCEPTS

Sparse Linear Models (10/7/13)

Transcription:

. Some Curiosities Arising in Objective Bayesian Analysis Jim Berger Duke University Statistical and Applied Mathematical Institute Yale University May 15, 2009 1

Three vignettes related to John s work Some puzzling things concerning invariant priors Some puzzling things concerning multiplicity What is the effective sample size? 2

I. Objective Priors: Why Can t We Have It All? Figure 1: John at Princeton in 1965. 3

An Example: Inference for the Correlation Coefficient The bivariate normal distribution ( of (x 1, x 2 ) has mean (µ 1, µ 2 ) and σ 2 ) covariance matrix Σ = 1 ρσ 1 σ 2 ρσ 1 σ 2 σ2 2, where ρ is the correlation between x 1 and x 2. For a sample (x 11, x 21 ), (x 12, x 22 ),..., (x 1n, x 2n ), the sufficient statistics are x = (x 1, x 2 ), where x i = n 1 n j=1 x ij, and S = n ( (x i x)(x i x) s 11 r ) s 11 s 22 = r, s 11 s 22 s 22 i=1 where s ij = n k=1 (x ik x i )(x jk x j ), r = s 12 / s 11 s 22. 4

Three interesting priors for inference concerning ρ: The reference prior (Lindley and Bayarri) π R (µ 1, µ 2, σ 1, σ 2, ρ) = 1 σ 1 σ 2 (1 ρ 2 ). The right-haar prior π RH1 (µ 1, µ 2, σ 1, σ 2, ρ) = 1 σ 2 2 (1 ρ2 ), which is right-haar w.r.t. the lower triangular matrix group action (a 1, a 2, T l ) (x 1, x 2 ) = T l (x 1, x 2 ) + (a 1, a 2 ). The right-haar prior π RH2 (µ 1, µ 2, σ 1, σ 2, ρ) = 1 σ 2 1 (1 ρ2 ), which is right-haar w.r.t. the upper triangular matrix, T u, group action. 5

Credible intervals for ρ, under either right-haar prior, can be approximated by drawing independent Z N(0, 1), χ 2 n 1 and χ 2 n 2; setting ρ = Y, where Y = Z χ 2 1+Y + n 2 2 χ 2 n 1 χ 2 n 1 r 1 r 2 ; repeating this process 10, 000 times; using the α 2 % upper and lower percentiles of these generated ρ to form the desired confidence limits. Credible intervals for ρ, under the reference prior, can be found by replacing the first two steps above by Generate Σ = σ2 1 ρσ 1 σ 2 ρσ 1 σ 2 σ2 2 Inverse Wishart(S 1, n 1). Generate u unif (0, 1). If u 1 ρ 2, record ρ. Otherwise repeat. 6

Lemma 1 1. The Bayesian credible set for ρ, from either right-haar prior, has exact frequentist coverage 1 α. 2. This credible set is the the fiducial confidence interval for ρ of Fisher 1930. So we have it all, a simultaneous objective Bayesian, fiducial, and exact frequentist confidence set. Or do we? In the 60 s, Brillinger showed that, if one starts with the density f(r ρ) of r, there is no prior distribution for ρ whose posterior equals the fiducial distribution. Geisser and Cornfield (1963) thus conjectured that fiducial and Bayesian inference could not agree here. (They do.) But, since π RH (ρ x) can be shown only to depend on the data through r, we have a marginalization paradox (Dawid, Stone and Zidek, 1973): π RH (ρ x) = g(ρ, r) f(r ρ)π(ρ) for any π( ). 7

Can We Trust Bayesian Truisms with Improper Priors? Two Truisms: If considering various priors, either average (or go hierarchical); or choose the empirical Bayes prior that maximizes the marginal likelihood. 1. Consider the symmetrized right-haar prior π S (µ 1, µ 2, σ 1, σ 2, ρ) = π RH1 (µ 1, µ 2, σ 1, σ 2, ρ) + π RH2 (µ 1, µ 2, σ 1, σ 2, ρ) = 1 σ1 2(1 ρ2 ) + 1 σ2 2(1 ρ2 ). 2. Any rotation Γ of coordinates yields a new right-haar prior. The empirical Bayes prior, π EB, is the right-haar prior for that rotation for which s 12 = 0, where S ( s 11 s 12 s 12 s 22 ) = ΓSΓ. 8

(σ 1, σ 2, ρ) R( Σ1 ) R( Σ2 ) R( ΣS ) R( ΣEB ) (1, 1, 0).4287.4288.4452.6052 (1, 2, 0).4278.4270.4424.5822 (1, 5, 0).4285.4287.4391.5404 (1, 50, 0).4254.4250.4272.5100 (1, 1,.1).4255.4266.4424.5984 (1, 1,.5).4274.4275.4403.5607 (1, 1,.9).4260.4255.4295.5159 (1, 1,.9).4242.4243.4280.5119 Table 1: Estimated frequentist risks of various estimates of Σ, under Stein s loss and when n = 10; Σ i are the right-haar estimates, Σ S is the symmetrized prior estimate, and Σ EB is the empirical Bayes estimate. 9

II. Bayesian Multiplicity Issues in Variable Selection Figure 2: John Hartigan with one of his multiple grandchildren 10

Problem: Data X arises from a normal linear regression model, with m possible regressors having associated unknown regression coefficients β i, i = 1,... m, and unknown variance σ 2. Models: Consider selection from among the submodels M i, i = 1,..., 2 m, having only k i regressors with coefficients β i (a subset of (β 1,..., β m )) and resulting density f i (x β i, σ 2 ). Prior density under M i : Zellner-Siow priors π i (β i, σ 2 ). Marginal likelihood of M i : m i (x) = f i (x β i, σ 2 )π i (β i, σ 2 ) dβ i dσ 2 Prior probability of M i : P (M i ) Posterior probability of M i : P (M i x) = P (M i)m i (x) j P (M j)m j (x). 11

Common Choices of the P (M i ) Equal prior probabilities: P (M i ) = 2 m Bayes exchangeable variable inclusion: Each variable, β i, is independently in the model with unknown probability p (called the prior inclusion probability). p has a Beta(p a, b) distribution. (We use a = b = 1, the uniform distribution, as did Jeffreys 1961, who also suggested alternative choices of the P (M i ). Probably a = b = 1/2 is better.) Then, since k i is the number of variables in model M i, P (M i ) = 1 P (M i ) = ˆp k i (1 ˆp) m k i 0 p k i (1 p) m k i Beta(p a, b)dp = Beta(a + k i, b + m k i ) Beta(a, b) Empirical Bayes exchangeable variable inclusion: Find the MLE ˆp by maximizing the marginal likelihood of p, j pk j (1 p) m k j m j (x), and use as the prior model probabilities.. 12

Controlling for multiplicity in variable selection Equal prior probabilities: P (M i ) = 2 m does not control for multiplicity; it corresponds to fixed prior inclusion probability p = 1/2 for each variable. Empirical Bayes exchangeable variable inclusion does control for multiplicity, in that ˆp will be small if there are many β i that are zero. Bayes exchangeable variable inclusion also controls for multiplicity (see Scott and Berger, 2008), although the P (M i ) are fixed. Note: The control of multiplicity by Bayes and EB variable inclusion usually reduces model complexity, but is different than the usual Bayeisan Ockham s razor effect that reduces model complexity. The Bayesian Ockham s razor operates through the effect of model priors π i (β i, σ 2 ) on m i (x), penalizing models with more parameters. Multiplicity correction occurs through the choice of the P (M i ). 13

Equal model probabilities Bayes variable inclusion Number of noise variables Number of noise variables Signal 1 10 40 90 1 10 40 90 β 1 : 1.08.999.999.999.999.999.999.999.999 β 2 : 0.84.999.999.999.999.999.999.999.988 β 3 : 0.74.999.999.999.999.999.999.999.998 β 4 : 0.51.977.977.999.999.991.948.710.345 β 5 : 0.30.292.289.288.127.552.248.041.008 β 6 : +0.07.259.286.055.008.519.251.039.011 β 7 : +0.18.219.248.244.275.455.216.033.009 β 8 : +0.35.773.771.994.999.896.686.307.057 β 9 : +0.41.927.912.999.999.969.861.567.222 β 10 : +0.63.995.995.999.999.996.990.921.734 False Positives 0 2 5 10 0 1 0 0 Table 2: Posterior inclusion probabilities for 10 real variables in a simulated data set. 14

Comparison of Bayes and Empirical Bayes Approaches Theorem 1 In the variable-selection problem, if the null model (or full model) has the largest marginal likelihood, m(x), among all models, then the MLE of p is ˆp = 0 (or ˆp = 1.) (The naive EB approach, which assigns P (M i ) = ˆp k i (1 ˆp) m k i, concludes that the null (full) model has probability 1.) A simulation with 10,000 repetitions to gauge the severity of the problem: m = 14 covariates, orthogonal design matrix p drawn from U(0, 1); regression coefficients are 0 with probability p and drawn from a Zellner-Siow prior with probability (1 p). n = 16, 60, and 120 observations drawn from the given regression model. Case ˆp = 0 ˆp = 1 n = 16 820 781 n = 60 783 766 n = 120 723 747 15

Is empirical Bayes at least accurate asymptotically as m? Posterior model probabilities, given p: P (M i x, p) = pk i (1 p) m k i m i (x) j pk j (1 p) m k j m j (x) Posterior distribution of p: π(p x) = K j pk j (1 p) m k j m j (x) This does concentrate about the true p as m, so one might expect that P (M i x) = 1 0 P (M i x, p)π(p x)dp P (M i x, ˆp) m i (x) ˆp k i (1 ˆp) m k i. This is not necessarily true; indeed 1 0 P (M i x, p)π(p x)dp = 1 0 m i (x) p k i (1 p) m k i m i (x) π(p x)/k 1 0 π(p x) dp p k i (1 p) m k i dp m i (x)p (M i ). Caveat: Some EB techniques have been justified; see Efron and Tibshirani (2001), Johnstone and Silverman (2004), Cui and George (2006), and Bogdan et. al. (2008). 16

III. What is the Effective Sample Size in Generalized BIC? Figure 3: John Hartigan in Sin City 17

Data: Independent vectors x i g i (x i θ), for i = 1,..., n. Unknown: θ = (θ 1,..., θ p ); ˆθ is the MLE Log-likelihood function: where x = (x 1,..., x n ). l(θ) = log f(x θ) = log ( n i=1 g i(x i θ)) Usual BIC: BIC 2l( ˆθ) p log n (Schwarz, 1978) Generalization of BIC: 2l( ˆθ) p i=1 log(1 + n i) + 2 p i=1 log ( 1 e vi ) 2 vi, v i = ˆξ 2 i d i (1+n i ), the d 1 i are the eigenvalues of the observed information matrix, the ξ i are the coordinates in an orthogonally transformed θ. n i is the effective sample size corresponding to ξ. What should these be? 18

Ex. Group means: For i = 1,..., p and l = 1,..., r, X il = µ i + ɛ il, where ɛ il N(0, σ 2 ). It might seem that n = pr but, if one followed Schwarz, one would have (defining µ = (µ 1,..., µ p ) t ) that X l = (X 1l,..., X pl ) t N p (µ, σ 2 I), l = 1,..., r, so that the sample size appearing in BIC should be r. The effective sample size for each µ i is r, but the effective sample size for σ 2 is pr, so effective sample size is parameter-dependent. One could easily be in the situation where p but the effective sample size r is fixed. 19

Ex. Random effects group means: µ i N(ξ, τ 2 ), with ξ and τ 2 being unknown. What is the number of parameters (see also Pauler (1998))? (1) If τ 2 = 0, there is only one parameter ξ. (2) If τ 2 is huge, is the number of parameters p + 2? (3) But, if one integrates out µ = (µ 1,..., µ p ), then f(x σ 2, ξ, τ 2 ) = f(x µ, ξ, σ 2 )π(µ ξ, τ 2 )dµ 1 σ p(r 1) exp { ˆ σ2 2σ 2 } p i=1 exp { ( x i ξ) 2 2( σ2 r +τ 2 ) so p = 3 if one can work directly with f(x σ 2, ξ, τ 2 ). Note: In this example the effective sample sizes should be p r for σ 2, p for ξ and τ 2, and r for the µ i s. Ex. Common mean, differing variances: Suppose n/2 of the Y i are N(θ, 1), while n/2 are N(θ, 1000). Clearly the effective sample size is roughly n/2. }, 20

Ex. ANOVA: Y = (Y 1,..., Y n ) t N n (Xβ, σ 2 I), where X is a given n p matrix of 1 s and -1 s with orthogonal columns, where β = (β 1,..., β p ) t and σ 2 are ( unknown. Then ) the information matrix for θ = (β, σ 2 ) is Î = n ˆσ 2 I p p 0 0 n 2ˆσ 4 for all parameters. so that now the effective sample size appears to be n Note: The group means problem and ANOVA are linear models, so one can have effective sample sizes from r = 1 to n for parameters in the linear model. 21

Defining the effective sample size n j for ξ j : For the case where no variables are integrated out,a possible general definition for the effective sample size follows from considering the information associated with observation x i arising from the single-observation expected information matrix I i [ Ii,jk 2 ] = E log f i (x i θ) θ j θ k = O (Ii,jk )O, where θ= ˆθ. 22

Since I jj = n i=1 I i,jj is the expected information about ξ j, a reasonable way to define n j is define information weights w ij = I i,jj / n k=1 I k,jj ; define the effective sample size for ξ j as n j = I jj n i=1 w iji i,jj = ( I jj ) 2 n i=1 ( I i,jj ) 2. Intuitively, w ij Ii,jj is a weighted measure of the information per observation, and dividing the total information about ξ j by this information per case seems plausible as an effective sample size. 23

THANKS ALL 24

THANKS JOHN 25