Overall Objective Priors

Similar documents
Overall Objective Priors

arxiv: v1 [math.st] 10 Apr 2015

Integrated Objective Bayesian Estimation and Hypothesis Testing

Integrated Objective Bayesian Estimation and Hypothesis Testing

The Jeffreys Prior. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) The Jeffreys Prior MATH / 13

Objective Bayesian Hypothesis Testing

A BAYESIAN MATHEMATICAL STATISTICS PRIMER. José M. Bernardo Universitat de València, Spain

Introduction to Bayesian Methods

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Invariant HPD credible sets and MAP estimators

An Extended BIC for Model Selection

Noninformative Priors for the Ratio of the Scale Parameters in the Inverted Exponential Distributions

General Bayesian Inference I

Objective Bayesian Hypothesis Testing and Estimation for the Risk Ratio in a Correlated 2x2 Table with Structural Zero

Some Curiosities Arising in Objective Bayesian Analysis

Objective Bayesian Statistical Inference

10. Exchangeability and hierarchical models Objective. Recommended reading

The Bayesian Choice. Christian P. Robert. From Decision-Theoretic Foundations to Computational Implementation. Second Edition.

g-priors for Linear Regression

THE FORMAL DEFINITION OF REFERENCE PRIORS

Objective Priors for Discrete Parameter Spaces

A Very Brief Summary of Bayesian Inference, and Examples

Objective Prior for the Number of Degrees of Freedom of a t Distribution

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Default priors and model parametrization

1 Hypothesis Testing and Model Selection

The binomial model. Assume a uniform prior distribution on p(θ). Write the pdf for this distribution.

Comparing Non-informative Priors for Estimation and Prediction in Spatial Models

Minimum Message Length Analysis of the Behrens Fisher Problem

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Bayesian Analysis of RR Lyrae Distances and Kinematics

Comparing Non-informative Priors for Estimation and. Prediction in Spatial Models

Modern Methods of Statistical Learning sf2935 Auxiliary material: Exponential Family of Distributions Timo Koski. Second Quarter 2016

Bayesian Inference. Chapter 2: Conjugate models

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

A Very Brief Summary of Statistical Inference, and Examples

Stat 5101 Lecture Notes

Part III. A Decision-Theoretic Approach and Bayesian testing

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Bayesian estimation of the discrepancy with misspecified parametric models

Bayesian Linear Regression

A Bayesian perspective on GMM and IV

Bayesian Statistics from Subjective Quantum Probabilities to Objective Data Analysis

Bayesian Life Test Planning for the Weibull Distribution with Given Shape Parameter

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Linear Models A linear model is defined by the expression

PMR Learning as Inference

1. Fisher Information

Divergence Based priors for the problem of hypothesis testing

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian model selection for computer model validation via mixture model estimation

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Bayesian linear regression

Introduction to Probabilistic Machine Learning

An Overview of Objective Bayesian Analysis

7. Estimation and hypothesis testing. Objective. Recommended reading

Harrison B. Prosper. CMS Statistics Committee

Lecture 13 Fundamentals of Bayesian Inference

Markov Chain Monte Carlo methods

7. Estimation and hypothesis testing. Objective. Recommended reading

Metropolis-Hastings Algorithm

Bayesian tests of hypotheses

Bayesian inference: what it means and why we care

Checking for Prior-Data Conflict

Markov Chain Monte Carlo (MCMC)

Computational statistics

The New Palgrave Dictionary of Economics Online

Multinomial Data. f(y θ) θ y i. where θ i is the probability that a given trial results in category i, i = 1,..., k. The parameter space is

Bayesian Inference: Posterior Intervals

Construction of an Informative Hierarchical Prior Distribution: Application to Electricity Load Forecasting

Divergence Based Priors for Bayesian Hypothesis testing

OBJECTIVE BAYESIAN ANALYSIS OF THE 2 2 CONTINGENCY TABLE AND THE NEGATIVE BINOMIAL DISTRIBUTION

Bayes Factors for Goodness of Fit Testing

Minimum Message Length Shrinkage Estimation

Minimum Message Length Analysis of Multiple Short Time Series

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Conjugate Predictive Distributions and Generalized Entropies

BAYESIAN MODEL CRITICISM

Bayesian Methods for Machine Learning

Deblurring Jupiter (sampling in GLIP faster than regularized inversion) Colin Fox Richard A. Norton, J.

Objective Bayesian Statistics

Stat260: Bayesian Modeling and Inference Lecture Date: March 10, 2010

Statistical Theory MT 2007 Problems 4: Solution sketches

Variational Principal Components

1 Priors, Contd. ISyE8843A, Brani Vidakovic Handout Nuisance Parameters

Latent Dirichlet Alloca/on

Bayesian data analysis in practice: Three simple examples

Bayesian Regularization

Lecture 2: Priors and Conjugacy

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Bayesian non-parametric model to longitudinally predict churn

COS513 LECTURE 8 STATISTICAL CONCEPTS

Lecture 7 October 13

Objective Bayesian analysis on the quantile regression

OBJECTIVE PRIORS FOR THE BIVARIATE NORMAL MODEL. BY JAMES O. BERGER 1 AND DONGCHU SUN 2 Duke University and University of Missouri-Columbia

Introduction to Markov Chain Monte Carlo & Gibbs Sampling

INTRODUCTION TO BAYESIAN STATISTICS

Probability and Estimation. Alan Moses

Transcription:

Overall Objective Priors Jim Berger, Jose Bernardo and Dongchu Sun Duke University, University of Valencia and University of Missouri Recent advances in statistical inference: theory and case studies University of Padova March 21-23, 2013 1

Outline Background Previous approaches to development of an overall prior New approaches to development of an overall prior Specifics of the reference distance approach Specifics of the prior modeling approach Summary 2

Background Objective Bayesian methods have priors defined by the model (or model structure). In models with a single unknown parameter, the acclaimed objective prior is the Jeffreys-rule prior (more generally, the reference prior). In multiparameter models, the optimal objective (e.g., reference or matching) prior depends on the quantity of interest, e.g., the parameter concerning which inference is being performed. But often one needs a single overall prior for prediction for decision analysis when the user might consider non-standard quantities of interest for computational simplicity for sociological reasons 3

Example: Bivariate Normal Distribution, with mean parameters µ 1 and µ 2, standard deviations σ 1 and σ 2, and correlation ρ. Berger and Sun (AOS2008) studied priors that had been considered for 21 quantities of interest (original parameters and derived ones such as µ 1 /σ 1 ). An optimal prior for each quantity of interest was suggested. An overall prior was also suggested: The primary criterion used to judge candidate overall priors was reasonable frequentist coverage properties of resulting credible intervals for the most important quantities of interest. The prior (from Lindley and Bayarri) π O (µ 1, µ 2, σ 1, σ 2, ρ) = 1 σ 1 σ 2 (1 ρ 2 ) was the suggested overall prior. 4

Previous Approaches to Development of an Overall Prior I. Group-invariance priors II. Constant or vague proper priors III. The Jeffreys-rule prior Notation: Data: x Unknown model parameters: θ Data density: p(x θ) Prior density: π(θ) Marginal (predictive) density: p(x) = p(x θ)π(θ) dθ Posterior density: π(θ x) = p(x θ) π(θ)/p(x) 5

I. Group-invariance priors: If p(x θ) has a group invariance structure, then the recommended objective prior is typically the right-haar prior. Often works well for all parameters that define the invariance structure. Example: If the sampling model is N(x i µ, σ), the right-haar prior is π(µ, σ) = 1/σ, and this is fine for either µ or σ (yielding the usual objective posteriors). But it may be poor for other parameters. Example: For the bivariate normal problem, one right-haar prior is π 1 (µ 1, µ 2, σ 1, σ 2, ρ) = 1/[σ1(1 2 ρ 2 )], which is fine for µ 1, σ 1 and ρ, but leads to problematical posteriors for µ 2 and σ 2 (Berger and Sun, 2008). And it may not be unique. Example: For the bivariate normal problem, another right-haar prior is π 2 (µ 1, µ 2, σ 1, σ 2, ρ) = 1/[σ2(1 2 ρ 2 )]. 6

The situation can be even worse if the right-haar prior is used for derived parameters. Example: Multi-normal means: Let x i be independent normal with mean µ i and variance 1, for i = 1, m. The right-haar (actually Haar) prior for µ = (µ 1,..., µ m ) is π(µ) = 1. It results in a sensible N(µ i x i, 1) posterior for each individual µ i. But it is terrible for θ = 1 m µ 2 = 1 m m i=1 µ2 i (Stein). The posterior mean of θ is [1 + 1 m m i=1 x2 i ]; this converges to [θ + 2] as m ; indeed, the posterior concentrates sharply around [θ + 2] and so is badly inconsistent. 7

II. Constant or vague proper priors are often used as the overall prior. The problems of a constant prior are well-documented, including lack of invariance to transformation (the original problem with Laplace s inverse probability ), frequent posterior impropriety (as in the first full Bayesian analyses of Gaussian spatial models with an exponential correlation structure, when constant priors were used for the range parameter), and possible terrible performance (as in the previous example). Vague proper priors (such as a constant prior over a large compact set) are at best equivalent to use of a constant prior (and so inherit the flaws of a constant prior); can be worse, in that they can hide problems such as a lack of posterior propriety. 8

III. The Jeffreys-rule prior: If the data model density is p(x θ) the Jeffeys-rule prior for the unknown θ = {θ 1,..., θ k } has the form I(θ) 1/2 dθ 1... dθ k where I(θ) is the k k Fisher information matrix with (i, j) element [ ] I(θ) ij = E x θ 2 log p(x θ). θ i θ j This is the optimal objective prior (from many perspectives) for (regular) one-parameter models, but has problems for multi-parameter models: The right-haar prior in the earlier multi-normal mean problem is also the Jeffreys-rule prior there, and yielded inconsistent estimators. (It also yields inconsistent estimators in the Neyman-Scott problem.) For the N(x i µ, σ) model, the Jeffreys-rule prior is π(µ, σ) = 1/σ 2, which results in posterior inferences for µ and σ that have degrees of freedom equal to n, not the correct n 1. 9

For the bivariate normal example, the Jeffreys-rule prior is 1/[σ 2 1σ 2 2(1 ρ 2 ) 2 ]; it yields the natural marginal posteriors for the means and standard deviations, but results in quite inferior objective posteriors for ρ and various derived parameters (Berger and Sun, 2008)). in p-variate normal problems, the Jeffreys-rule prior for a covariance matrix can be very bad (Stein, Yang and Berger, 1992). It can overwhelm the data: Example: Multinomial distribution: Suppose x = (x 1,..., x m ) is multinomial Mu(x n; θ 1,..., θ m ), where m i=1 θ i = 1. If the sample size n is small relative to the number of classes m, we have a large sparse table. The Jeffreys-rule prior, π(θ 1,..., θ m ) m i=1 θ 1/2 i is a proper prior that can overwhelm the data. 10

Suppose n = 3 and m = 1000, with x 240 = 2, x 876 = 1, other x i = 0. The posterior means resulting from the Jeffreys prior are E[θ i x] = x i + 1/2 m i=1 (x i + 1/2) = x i + 1/2 n + m/2 = x i + 1/2 503, so E[θ 240 x] = 2.5 503, E[θ 876 x] = 1.5 503, E[θ i x] = 0.5 503 otherwise. Thus cells 240 and 876 only have total posterior probability = 0.008, even though all 3 observations are in these cells. 4 503 The problem is that the Jeffreys-rule prior added 1/2 to all the zero cells, making them much more important than the cells with data! Note that the uniform prior on the simplex is even worse, since it adds 1 to each cell. The prior i θ 1 i adds zero to each cell, but the posterior is improper unless all cells have nonzero entries. For specific problems there have been improvements such as the independence Jeffreys-rule prior, but such prescriptions have been adhoc and have not lead to a general alternative definition. 11

New Approaches to Development of an Overall Prior A. The reference distance approach B. The hierarchical approach B1. Prior averaging B2. Prior modeling approach 12

A. The Reference Distance Approach: Choose a prior that yields marginal posteriors for all parameters that are close to the reference posteriors for the parameters in an average distance sense (to be specified). Example: Multinomial example (continued): The reference prior, when θ i is of interest, differs for each θ i. It results in a Beta reference posterior Be(θ i x i + 1 2, n x i + 1 2 ). Goal: identify a single joint prior for θ whose marginal posteriors could be expected to be close to each of the reference posteriors just described, in some average sense. Consider, as an overall prior, the Dirichlet Di(θ a,..., a) distribution, having density proportional to i θ(a 1) i. The marginal posterior for θ i is then Be(θ i x i + a, n x i + (m 1)a). The goal is to choose a so these are, in any average sense, close to the reference posteriors Be(θ i x i + 1, n x 2 i + 1). 2 13

The recommended choice is (approximately) a = 1/m: This prior adds only 1/m = 0.001 to each cell in the earlier example; Thus E[θ i x] = x i + 1/m m i=1 (x i + 1/m) = x i + 1/m n + 1 = x i + 0.001 4, so that E[θ 240 x] 0.5, E[θ 876 x] 0.25, and E[θ i x] 1 4000 otherwise, all sensible (recall x 240 = 2, x 876 = 1, other x i = 0). 14

A. The Hierarchical approach: Utilize hierarchical modeling to transfer the reference prior problem to a higher level. A1. Prior Averaging: Starting with a collection of reference (or other) priors {π i (θ), i = 1,..., k} for differing parameters or quantities of interest, use the average prior, such as π(θ) = k π i (θ). i=1 This is hierarchical as it coincides with giving each prior an equal prior probability of being correct, and averaging out over this hyperprior. 15

Example: Bivariate Normal example (continued): Faced with the two right-haar priors, a natural prior to consider is their average, given by π(µ 1, µ 2, σ 1, σ 2, ρ) = 1 2σ1 2(1 ρ2 ) + 1 2σ2 2(1 ρ2 ). It is shown in Sun and Berger (2007) that this prior is worse than either right-haar prior alone, suggesting that averaging improper priors is not a good idea. Interestingly, the geometric average of these two priors is the recommended overall prior for the bivariate normal π O (µ 1, µ 2, σ 1, σ 2, ρ) = 1/[σ 1 σ 2 (1 ρ 2 )], but justification for geometric averaging is currently lacking. 16

Another problem with prior averaging is that there can be too many reference priors to average. Example: Multinomial example (continued): The reference prior π i (θ), when θ i is the parameter of interest, depends on the parameter ordering chosen in the derivation (e.g. {θ i, θ 1, θ 2,..., θ m }). All choices lead to the same marginal reference posterior Be(θ i x i + 1, n x 2 i + 1). 2 In constructing an overall prior by prior averaging, each of the orderings would have to be considered. There are m! reference priors to be averaged. Conclusion: For the reasons indicated above, we do not recommend the prior averaging approach. 17

A2. Prior Modeling Approach: In this approach one Chooses a class of proper priors π(θ a) that reflects the desired structure of the problem. Forms the marginal likelihood p(x a) = p(x a)π(θ a) dθ. Finds the reference prior, π R (a), for a in this marginal model. Thus the overall prior becomes π O (θ) = π(θ a)π R (a)da, although computation is typically easier in the hierarchical formulation. 18

Example: Multinomial (continued): The Dirichlet Di(θ a,..., a) class of priors is natural, reflecting the desire to treat all the θ i similarly. The marginal model is then p(x a) = = ( n ) ( m x 1... x m ( ) n Γ(m a) x 1... x m Γ(a) m i=1 ) θ x Γ(m a) m i i Γ(a) m i=1 m i=1 Γ(x i + a). Γ(n + m a) θ a 1 i dθ The reference prior for π R (a) would just be the Jeffreys-rule prior for this marginal model, and is given later. The overall prior for θ is π(θ) = Di(θ a,..., a) π R (a)da. 19

Specifics of the Reference Distance Approach 20

Defining a distance (divergence): Intrinsic discrepancy (Bernardo and Rueda, 2002; Bernardo, 2005, 2001) Definition 1 The intrinsic discrepancy δ{p 1, p 2 } between two probability distributions for the random vector ψ with densities p 1 (ψ) Ψ 1 and p 2 (ψ) Ψ 2 is { δ{p 1, p 2 } = min p 1 (ψ) log p 1(ψ) Ψ 1 p 2 (ψ) dψ, p 2 (x) log p } 2(ψ) Ψ 2 p 1 (ψ) dψ assuming that at least one of the integrals exists. The (non-symmetric) (Kullback-Leibler) logarithmic divergence, in scenarios where there is a true distribution p 2 (ψ), discrepancy). κ{p 1 p 2 } = Ψ 2 p 2 (x) log p 2(ψ) p 1 (ψ) dψ, is another reasonable choice (and is usually equivalent to the intrinsic 21

The exact solution scenario: If a prior π O (θ) yields marginal posteriors that are equal to the reference posteriors for each of the quantities of interest, then the resulting intrinsic discrepancies are zero and π O (θ) is a natural choice for the overall prior. Example: Univariate normal distribution: For the N(x i µ, σ) distribution, suppose µ and σ are the quantities of interest; π O (µ, σ) = σ 1 is the reference prior when either µ or σ is the quantity of interest; hence π O is an optimal overall prior. Suppose, in addition to µ and σ, the centrality parameter θ = µ/σ is also a quantity of interest. The reference prior for θ is (Bernardo, 1979) π θ (θ, σ) = (1 + 1 2 θ2 ) 1/2 σ 1 ; this yields different marginal posteriors than does π O (µ, σ) = σ 1 ; hence we would not have an exact solution. 22

General (Proper) Situation: Suppose the model is p(x ω) and the quantities of interest are {θ 1,..., θ m }, with proper reference priors {π R i (ω)}m i=1. {πi R(θ i x)} m i=1 are the corresponding marginal reference posteriors. p R i (x) = Ω p(x ω) πr i (ω) dω are the corresponding (proper) marginal densities or prior predictives. {w i } m i=1 are weights giving the importance of each quantity of interest. A family of priors F = {π(ω a), a A} is considered. The best overall prior within F is defined to be that which minimizes, over a A, the average expected intrinsic loss m d(a) = w i δ{πi R ( x), π i ( x, a)} p R i (x) dx. i=1 X Big Issue: When the reference priors are not proper (the usual case), there is no assurance that d(a) is finite. There is no clear way to proceed otherwise, so we are studying if d(a) is often finite in the improper case. 23

Example: Multinomial model: Consider the multinomial model with m cells and parameters {θ 1,..., θ m }, with m i=1 θ i = 1. We seek to find the Di(θ a,..., a) prior that minimizes the average expected intrinsic loss. The reference posterior for each of the θ i s is Be(θ i x i + 1 2, n x i + 1 2 ). The marginal posterior of θ i for the Dirchlet prior is Be(θ i x i + a, n x i + (m 1)a). The intrinsic discrepancy between these marginal posteriors is δ i {a x, m, n} = δ Be {x i + 1 2, n x i + 1 2, x i + a, n x i + (m 1)a}, δ Be {a 1, β 1, a 2, β 2 } = min[κ Be {a 2, β 2 a 1, β 1 }, κ Be {a 1, β 1 a 2, β 2 }, ] κ Be {a 2, β 2 a 1, β 1 } = = log [ Γ(a1 + β 1 ) Γ(a 2 + β 2 ) 1 0 Γ(a 2 ) Γ(a 1 ) Be(θ i a 1, β 1 ) log ] Γ(β 2 ) Γ(β 1 ) [ Be(θi a 1, β 1 ) ] dθ i Be(θ i a 2, β 2 ) + (a 1 a 2 )ψ(a 1 ) + (β 1 β 2 )ψ(β 1 ) ((a 1 + β 1 ) (a 2 + β 2 ))ψ(a 1 + β 1 ), and ψ( ) is the digamma function. 24

The discrepancy δ i {a x i, m, n} between the two posteriors of θ i only depends on the data through x i and the reference predictive for x i is p(x i n) = 1 0 Bi(x i n, θ i ) Be(θ i 1/2, 1/2) dθ i = 1 π Γ(x i + 1 2 ) Γ(n x i + 1 2 ) Γ(x i + 1) Γ(n x i + 1), because the sampling distribution of x i is Bi(x i n, θ i ), and the marginal reference prior for θ i is π i (θ i ) = Be(θ i 1/2, 1/2). Noting that each θ i yields the same expected loss, the average expected intrinsic loss is d(a m, n) = n δ{a x, m, n} p(x n). x=0 25

1.0 0.8 a 0.09 d a n n 5,10,25,100,500 1.0 0.8 a 0.009 n 5 d a n n 5,10,25,100,500 0.6 n 5 0.6 0.4 0.4 0.2 0.2 n 500 n 500 a 0.0 0.0 0.1 0.2 0.3 0.4 0.5 a 0.0 0.00 0.01 0.02 0.03 0.04 0.05 Figure 1: Expected intrinsic losses, of using a Dirichlet prior with parameter {a,..., a} in a multinomial model with m cells, for sample sizes 5, 10, 25, 100 and 500. Left panel, m = 10; right panel, m = 100. In both cases, the optimal value for all sample sizes is a 1/m. Exact values for n = 25 are 0.091 and 0.0085.) 26

Specifics of the Prior Modeling Approach Multinomial Example Bivariate Normal Example 27

Example: Multinomial (continued): The Dirichlet Di(θ a,..., a) class of priors is natural, reflecting the desire to treat all the θ i similarly. The marginal model is then p(x a) = = ( n ) ( m x 1... x m ( ) n Γ(m a) x 1... x m Γ(a) m i=1 ) θ x Γ(m a) m i i Γ(a) m i=1 m i=1 Γ(x i + a). Γ(n + m a) θ a 1 i dθ The reference prior for π R (a) would just be the Jeffreys-rule prior for this marginal model, and is given later. The overall prior for θ is π(θ) = Di(θ a,..., a) π R (a)da. 28

Derivation of π R (a): p(x a) is a regular one-parameter model, so the reference prior is the Jeffreys-rule prior. The marginal (predictive) density of any of the x i s is ( ) n Γ(xi + a) Γ(n x i + (m 1)a) Γ(m a) p 1 (x i a, m, n) = Γ(a) Γ((m 1)a) Γ(n + m a) x i. Computation yields π R (a m, n) n 1 j=0 ( Q(j a, m, n) (a + j) 2 ) m 1/2 (m a + j) 2, where Q(j a, m, n) = n l=j+1 p 1(l a, m, n), j = 0,..., n 1. 29

π R (a) can be shown to be a proper prior. Why did that happen? It can be shown that p(x a) = O(a r 1 ), as a 0, ( n ) x m n, as a, where r is the number of nonzero x i. Thus the likelihood is constant at, so the prior must be proper at infinity for the posterior to exist. It can be shown that, for sparse tables, where m/n is relatively large, the reference prior is well approximated by the proper prior π (a m, n) = 1 2 n m a 1/2( a + n ) 3/2. m 30

3.0 2.5 Π R Α m, n 2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 Α Figure 2: Reference priors π R (a m, n) (solid lines) and its approximations (dotted lines) for (m = 150, n = 10) (upper curve) and for (m = 500, n = 10) (lower curve) 31

Computation with the hierarchical reference prior: 1. The obvious MCMC sampler is: Step 1. Use a Metropolis Hastings move to sample from the marginal posterior π R (a x) π R (a) p(x a). Step 2. Given a, sample from the usual beta posterior π(θ a, x). 2. The empirical Bayes approximation is to fix a at it s posterior mode â R, which exists and is nonzero if r 2. Using the ordinary empirical Bayes estimate from maximizing p(x a) is problematical, since the likelihood does not go to zero at. For instance, if all x i = 1, p(x a) has a likelihood increasing in a. 32

Asymptotic posterior mode as m and n go to, but n/m 0: â = (r 1.5) m log n if r n 0, c n m if r n c < 1, n 2 2m(n r) if r n 1 and (n r)2 n. where r is the number of nonzero x i and c is the solution to c log(1 + 1 c ) = c. While â is of O( 1 m ), it also depends on r and n. For instance, suppose r = n/2 (i.e., there are n/2 nonzero entries); then â = 0.40n/m., 33

Example: Bivariate Normal (continued): There are actually a continuum of right-haar priors given as follows. For the orthogonal matrix Γ = cos(β) sin(β) sin(β) cos(β), π/2 < β π/2, the right-haar prior based on the transformed data ΓX is π(µ 1, µ 2, σ 1, σ 2, ρ β) = sin2 (β)σ 2 1 + cos 2 (β)σ 2 2 + 2 sin(β) cos(β)ρσ 1 σ 2 σ 2 1 σ2 2 (1 ρ2 ) We thus have a class of priors indexed by a hyperparameter β. The natural prior distribution on β is the (proper) uniform distribution (being uniform over the set of rotations is natural.) The resulting prior is π O (µ 1, µ 2, σ 1, σ 2, ρ) = 1 π π/2 π/2 π(µ 1, µ 2, σ 1, σ 2, ρ β)dβ ( 1 the same bad prior as the average of the original two right-haar priors. σ 2 1 + 1 σ 2 2 ). 1 (1 ρ 2 ) 34

Empirical hierarchical approach: Find the empirical Bayes estimate ˆβ and use π(µ 1, µ 2, σ 1, σ 2, ρ ˆβ) as the overall prior. This was shown in Sun and Berger (2007) to result in a terrible overall prior, much worse than either the individual reference priors or even the bad prior average. 35

Summary There is an important need for overall objective priors for models. The reference distance approach is natural, and seems to work well when reference priors are proper. It is unclear if the reference distance approach can be used when the reference priors are improper. The prior averaging approach is not recommended when the reference priors are improper and can be computationally difficult even when they are proper. The prior modeling approach seems excellent (as usual), and is recommended if one can find a natural class of proper priors to initiate the hierarchical analysis. The failure of the hierarchical approach for the right-haar priors in the bivariate normal example was dramatic, suggesting that using improper priors are the bottom level of a hierarchy is a bad idea. 36

Thanks! 37