Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Similar documents
Module 4: Bayesian Methods Lecture 9 A: Default prior selection. Outline

Lecture 20 May 18, Empirical Bayes Interpretation [Efron & Morris 1973]

Part III. A Decision-Theoretic Approach and Bayesian testing

Introduction to Probabilistic Machine Learning

A Very Brief Summary of Bayesian Inference, and Examples

Lecture 2: Statistical Decision Theory (Part I)

Chapter 5. Bayesian Statistics

Part 2: One-parameter models

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

STAT 425: Introduction to Bayesian Analysis

Bayesian Inference. Chapter 2: Conjugate models

Neutral Bayesian reference models for incidence rates of (rare) clinical events

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

A Very Brief Summary of Statistical Inference, and Examples

Time Series and Dynamic Models

Part 4: Multi-parameter and normal models

Introduction to Bayesian Methods

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

STAT 830 Bayesian Estimation

Minimum Message Length Analysis of the Behrens Fisher Problem

Lecture 3. Univariate Bayesian inference: conjugate analysis

Bayesian methods in economics and finance

g-priors for Linear Regression

STAT215: Solutions for Homework 2

Overall Objective Priors

Linear Models A linear model is defined by the expression

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

Lecture 13 Fundamentals of Bayesian Inference

Carl N. Morris. University of Texas

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

(4) One-parameter models - Beta/binomial. ST440/550: Applied Bayesian Statistics

Choosing among models

One Parameter Models

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Bayesian Inference: Posterior Intervals

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

10. Exchangeability and hierarchical models Objective. Recommended reading

Multinomial Data. f(y θ) θ y i. where θ i is the probability that a given trial results in category i, i = 1,..., k. The parameter space is

Predictive Distributions

Beta statistics. Keywords. Bayes theorem. Bayes rule

Foundations of Statistical Inference

An Introduction to Bayesian Linear Regression

One-parameter models

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Statistical Theory MT 2007 Problems 4: Solution sketches

The binomial model. Assume a uniform prior distribution on p(θ). Write the pdf for this distribution.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.


arxiv: v1 [stat.me] 25 Dec 2016

Lecture 2: Priors and Conjugacy

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

Classical and Bayesian inference

Bernoulli and Poisson models

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

CPSC 540: Machine Learning

2018 SISG Module 20: Bayesian Statistics for Genetics Lecture 2: Review of Probability and Bayes Theorem

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Bayesian Statistics. Debdeep Pati Florida State University. February 11, 2016


More on nuisance parameters

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Shrinkage Estimation in High Dimensions

Stat 5101 Lecture Notes

Part 6: Multivariate Normal and Linear Models

Conjugate Analysis for the Linear Model

A union of Bayesian, frequentist and fiducial inferences by confidence distribution and artificial data sampling

Final Examination. STA 215: Statistical Inference. Saturday, 2001 May 5, 9:00am 12:00 noon

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

5.2 Expounding on the Admissibility of Shrinkage Estimators

Foundations of Statistical Inference

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Default Priors and Effcient Posterior Computation in Bayesian

Chapter 9: Interval Estimation and Confidence Sets Lecture 16: Confidence sets and credible sets

Default priors and model parametrization

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

1 Hypothesis Testing and Model Selection

Statistical Theory MT 2006 Problems 4: Solution sketches

Remarks on Improper Ignorance Priors

Bayesian Asymptotics

Bayesian Regression Linear and Logistic Regression

Chapter 8: Sampling distributions of estimators Sections

(1) Introduction to Bayesian statistics

COS513 LECTURE 8 STATISTICAL CONCEPTS

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Introduction. Start with a probability distribution f(y θ) for the data. where η is a vector of hyperparameters

Gibbs Sampling in Endogenous Variables Models

Math 494: Mathematical Statistics

Bayesian Linear Models

Bayesian Inference: Concept and Practice

Model Checking and Improvement

Maximum Likelihood Estimation

Other Noninformative Priors

Bayesian model selection for computer model validation via mixture model estimation

ECE531 Lecture 8: Non-Random Parameter Estimation

The Jeffreys Prior. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) The Jeffreys Prior MATH / 13

A Discussion of the Bayesian Approach

Basic of Probability Theory for Ph.D. students in Education, Social Sciences and Business (Shing On LEUNG and Hui Ping WU) (May 2015)

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

Bayesian performance

Transcription:

Module 22: Bayesian Methods Lecture 9 A: Default prior selection Peter Hoff Departments of Statistics and Biostatistics University of Washington

Outline Jeffreys prior Unit information priors Empirical Bayes priors

Independent binary sequence Suppose researcher A has data of the following type: M A : y 1,..., y n i.i.d. binary(θ), θ [0, 1]. A asks you to do a Bayesian analysis, but either doesn t have any prior information about θ, or wants you to obtain objective Bayesian inference for θ. You need to come up with some prior π A (θ) to use for this analysis.

Independent binary sequence Suppose researcher B has data of the following type: M B : y 1,..., y n i.i.d. binary( eγ 1+e γ ), γ (, ). B asks you to do a Bayesian analysis, but either doesn t have any prior information about γ, or wants you to obtain objective Bayesian inference for γ. You need to come up with some prior π B (γ) to use for this analysis.

Prior generating procedures Suppose we have a procedure for generating priors from models: Procedure(M) π Applying the procedure to model M A should generate a prior for θ: Procedure(M A ) π A (θ) Applying the procedure to model M B should generate a prior for γ: Procedure(M B ) π B (γ) What should the relationship between π A and π B be?

Induced priors Note that a prior π A (θ) over θ induces a prior π A (γ) over γ = log This induced prior can be obtained via calculus; simulation. θ 1 θ.

Induced priors t h e t a < r b e t a ( 5 0 0 0, 1, 1 ) gamma< l o g ( t h e t a /(1 t h e t a ) ) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 θ 0.00 0.05 0.10 0.15 0.20 10 5 0 5 10 γ

Internally consistent procedures This fact creates a small conundrum: We could generate a prior for γ via the induced prior on θ: Procedure(M A ) π A (θ) π A (γ) Alternatively, a prior for γ could be obtained directly from M B : Procedure(M B ) π B (γ) Both π A (γ) and π B (γ) are obtained from the Procedure. Which one should we use?

Jeffreys principle Jeffreys (1949) says that any default Procedure should be internally consistent in the sense that the two priors on γ should be the same. More generally, his principle states if M B is a reparameterization of M A, then π A (γ) = π B (γ). Of course, all of this logic applies to the model in terms of θ: Procedure(M A ) π A (θ) Procedure(M B ) π B (γ) π B (θ) π A (θ) = π B (θ)

Jeffreys prior It turns out that Jeffreys principle leads to a unique Procedure: π J (θ) = E[( d log dθ p(y θ))2 ] Example: Binomial/binary model y 1,..., y n i.i.d. binary(θ) π J (θ) θ 1/2 (1 θ) 1/2 We recognize this prior as a beta(1/2,1/2) distribution: θ beta(1/2, 1/2) Default Bayesian inference is then based on the following posterior: θ y 1,..., y n beta(1/2 + y i, 1/2 + (1 y i )).

Jeffreys prior Example: Poisson model y 1,..., y n i.i.d. Poisson(θ) π J (θ) 1/ θ Recall our conjugate prior for θ in this case was a gamma(a, b) density: θ(θ a, b) θ a 1 e θ/b For the Poisson model and gamma prior, π gamma(a, b) θ y 1,..., y n gamma(a + y i, b + n) What about under the Jeffreys prior? π J (θ) looks like a gamma distribution with (a, b) = (1/2, 0). It follows that θ π J θ y 1,..., y n gamma(1/2 + y i, n). ( Note: π J is not an actual gamma density - it is not a probability density at all! )

Jeffreys prior Example: Normal model y 1,..., y n i.i.d. Normal(µ, σ 2 ) π J (µ, σ 2 ) = 1/σ 2 (this is a particular version of Jeffreys prior for multiparameter problems) It is very interesting to note that the resulting posterior for µ is µ ȳ s/ n tn 1 This means that a 95% objective Bayesian confidence interval for µ is µ ȳ ± t.975,n 1s/ n This is exactly the same as the usual t-confidence interval for a normal mean.

Notes on Jeffreys prior 1. Jeffreys principle leads to Jeffreys prior. 2. Jeffreys prior isn t always a proper prior distribution. 3. Improper priors can lead to proper posteriors. These often lead to Bayesian interpretations of frequentist procedures.

Data-based priors Recall from the binary/beta analysis: θ beta(a, b) y 1,..., y n binary(θ) θ y 1,..., y n beta(a + y i, b + (1 y i ) Under this posterior, E[θ y 1,..., y n] = a + y i a + b + n ( a + b = a + b + n a a+b guess at what θ is a + b confidence in guess. ) a ( a + b + n a + b + n ) ȳ

Data-based priors We may be reluctant to guess at what θ is. Wouldn t ȳ be better than a guess? Idea: Set a a+b = ȳ. Problem: This is cheating! Using ȳ for your prior misrepresents the amount of information you have. Solution: Cheat as little as possible: a Set = ȳ. a+b Set a + b = 1. This implies a = ȳ, b = 1 ȳ. The amount of cheating has the information content of only one observation.

Unit information principle If you don t have prior information about θ, then 1. Obtain an MLE/OLS estimator ˆθ of θ; 2. Make the prior π(θ) weakly centered around ˆθ, have the information equivalent of one observation. Again, such a prior leads to double-use of the information in your sample. However, the amount of cheating is small, and decreases with n.

Poisson example: y 1,..., y n i.i.d. Poisson(θ) Under the gamma(a, b) prior, E[θ y 1,..., y n] = a + y i b + n b = ( b + n ) a b + ( n b + n )ȳ Unit information prior: a/b = ȳ, b = 1 (a, b) = (ȳ, 1)

Comparison to Jeffreys prior CI width 1.0 1.5 2.0 2.5 3.0 3.5 4.0 j u j u uj uj uj CI coverage probability 0.930 0.940 0.950 j j u u j u j u j u 20 40 60 80 100 n 20 40 60 80 100 n

Notes on UI priors 1. UI priors weakly concentrate around a data-based estimator. 2. Inference under UI priors is anti-conservative, but this bias decreases with n. 3. Can be used in multiparameter settings, and is related to BIC.

Normal means problem Task: Estimate θ = (θ 1,..., θ p). An odd problem: y j = θ j + ɛ j, ɛ 1,..., ɛ p i.i.d. normal(0, 1) What does estimation of θ j have to do with estimation of θ k? There is only one observation y j per parameter θ j - how well can we do? Where the problem comes from: Comparison of two groups A and B on p variables (e.g. expression levels) For each variable j, construct a two-sample t-statistic y j = x A,j x B,j s j / n For each j, y j is approximately normal with mean θ j = n(µ A,j µ B,j )/σ j variance 1.

Normal means problem y j = θ j + ɛ j, ɛ 1,..., ɛ p i.i.d. normal(0, 1) One obvious estimator of θ = (θ 1,..., θ p) is y = (y 1,..., y p). y is the MLE; y is unbiased and the UMVUE. However, it turns out that y is not so great in terms of risk: R(y, θ) = E[ p (y j θ j ) 2 ] When p > 2 we can find an estimator that beats y for every value of θ, and is much better when p is large. This estimator has been referred to as an empirical Bayes estimator. j=1

Bayesian normal means problem y j = θ j + ɛ j, ɛ 1,..., ɛ p i.i.d. normal(0, 1) Consider the following prior on θ: θ 1,..., ɛ p i.i.d. normal(0, τ 2 ) Under this prior, ˆθ j = E[θ j y 1,..., y n] = τ 2 τ 2 + 1 y j This is a type of shrinkage prior: It shrinks the estimates towards zero, away from y j ; It is particularly good if many of the true θ j s are very small or zero.

Empirical Bayes ˆθ j = τ 2 τ 2 + 1 y j We might know we want to shrink towards zero. We might not know the appropriate amount of shrinkage. Solution: Estimate τ 2 from the data! y j = θ j + ɛ j ɛ j N(0, 1) θ j N(0, τ 2 y j N(0, τ 2 + 1) ) We should have y 2 j p(τ 2 + 1) y 2 j /p 1 τ 2 Idea: Use ˆτ 2 = y 2 j /p 1 for the shrinkage estimator. Modification Use ˆτ 2 = y 2 j /(p 2) 1 for the shrinkage estimator.

James-Stein estimation ˆθ j = ˆτ 2 ˆτ 2 + 1 y j ˆτ 2 = y 2 j /(p 2) 1 It has been shown theoretically that from a non-bayesian perspective, this estimator beats y in terms of risk for all θ. R(ˆθ, θ) < R(y, θ) for all θ Also, from a Bayesian perspective, this estimator is almost as good as the optimal Bayes estimator, under a known τ 2.

Comparison of risks Bayes risk 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10 The Bayes risk of the JSE is between that of X and the Bayes estimator. Bayes risk functions are plotted for p {3, 5, 10, 20}. τ 2

Empirical Bayes in general Model: p(y θ), θ Θ Prior class: π(θ ψ), ψ Ψ What value of ψ to choose? Empirical Bayes: 1. Obtain the marginal likelihood p(y ψ) = p(y θ)π(θ ψ)dθ ; 2. Find an estimator ˆψ based on p(y ψ) ; 3. Use the prior π(θ ˆψ).

Notes on empirical Bayes 1. Empirical Bayes procedures are obtained by estimating hyperparameters from the data. 2. Often these procedures behave well from both Bayes and frequentist procedures. 3. They work best when the number of parameters is large and hyperparameters are distinguishable.