A Very Brief Summary of Bayesian Inference, and Examples

Similar documents
A Very Brief Summary of Statistical Inference, and Examples

Part III. A Decision-Theoretic Approach and Bayesian testing

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

9 Bayesian inference. 9.1 Subjective probability

Statistical Theory MT 2007 Problems 4: Solution sketches

Introduction to Bayesian Methods

Statistical Theory MT 2006 Problems 4: Solution sketches

Parameter Estimation

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

Foundations of Statistical Inference

General Bayesian Inference I

STAT215: Solutions for Homework 2

A Very Brief Summary of Statistical Inference, and Examples

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Bayesian inference: what it means and why we care

Bayesian Inference: Posterior Intervals

Introduction to Bayesian Statistics 1

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

STAT 425: Introduction to Bayesian Analysis

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

STAT 830 Bayesian Estimation

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

Invariant HPD credible sets and MAP estimators

Data Analysis and Uncertainty Part 2: Estimation

7. Estimation and hypothesis testing. Objective. Recommended reading

INTRODUCTION TO BAYESIAN METHODS II

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Hierarchical Models & Bayesian Model Selection

Integrated Objective Bayesian Estimation and Hypothesis Testing

ST 740: Model Selection

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

1. Fisher Information

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007


Linear Models A linear model is defined by the expression

Chapter 5. Bayesian Statistics

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio


Bayesian Statistics. Debdeep Pati Florida State University. February 11, 2016

Bayesian statistics, simulation and software

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

1 Priors, Contd. ISyE8843A, Brani Vidakovic Handout Nuisance Parameters

COS513 LECTURE 8 STATISTICAL CONCEPTS

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Bayesian RL Seminar. Chris Mansley September 9, 2008

Lecture 1: Introduction

INTRODUCTION TO BAYESIAN ANALYSIS

Nuisance parameters and their treatment

MIT Spring 2016

Principles of Statistics

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Lecture 2: Conjugate priors

Lecture 2: Statistical Decision Theory (Part I)

Foundations of Statistical Inference

David Giles Bayesian Econometrics

Final Examination. STA 215: Statistical Inference. Saturday, 2001 May 5, 9:00am 12:00 noon

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Lecture 7 October 13

Lecture 8 October Bayes Estimators and Average Risk Optimality

Suggested solutions to written exam Jan 17, 2012

Introduction. Start with a probability distribution f(y θ) for the data. where η is a vector of hyperparameters

Bayesian Inference: Concept and Practice

Bayesian Inference for Normal Mean

Beta statistics. Keywords. Bayes theorem. Bayes rule

Chapter 4. Bayesian inference. 4.1 Estimation. Point estimates. Interval estimates

Final Examination a. STA 532: Statistical Inference. Wednesday, 2015 Apr 29, 7:00 10:00pm. Thisisaclosed bookexam books&phonesonthefloor.

Bayesian Asymptotics

A BAYESIAN MATHEMATICAL STATISTICS PRIMER. José M. Bernardo Universitat de València, Spain

Noninformative Priors for the Ratio of the Scale Parameters in the Inverted Exponential Distributions

Bayes Factors for Grouped Data

Lecture 2: Priors and Conjugacy

Bayesian Statistics from Subjective Quantum Probabilities to Objective Data Analysis

Definition 3.1 A statistical hypothesis is a statement about the unknown values of the parameters of the population distribution.

Department of Statistics

Bayesian philosophy Bayesian computation Bayesian software. Bayesian Statistics. Petter Mostad. Chalmers. April 6, 2017

Inference for a Population Proportion

The binomial model. Assume a uniform prior distribution on p(θ). Write the pdf for this distribution.

Lecture 1: Bayesian Framework Basics

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

The Normal Linear Regression Model with Natural Conjugate Prior. March 7, 2016

Bayesian Inference for Regression Parameters

The comparative studies on reliability for Rayleigh models

Foundations of Statistical Inference

An Introduction to Bayesian Linear Regression

Unobservable Parameter. Observed Random Sample. Calculate Posterior. Choosing Prior. Conjugate prior. population proportion, p prior:

Objective Bayesian Hypothesis Testing

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Bayesian Statistics. Luc Demortier. INFN School of Statistics, Vietri sul Mare, June 3-7, The Rockefeller University

Introduction to Probabilistic Machine Learning

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Sequential Monitoring of Clinical Trials Session 4 - Bayesian Evaluation of Group Sequential Designs

Bayesian statistics, simulation and software

Some Curiosities Arising in Objective Bayesian Analysis

Topic 12 Overview of Estimation

Bayesian Statistics Script for the course in spring 2014

Seminar über Statistik FS2008: Model Selection

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Transcription:

A Very Brief Summary of Bayesian Inference, and Examples Trinity Term 009 Prof Gesine Reinert Our starting point are data x = x 1, x,, x n, which we view as realisations of random variables X 1, X,, X n with distribution (model) f(x 1, x,, x n θ) In the Bayesian framework, θ Θ is random, and follows a prior distribution π(θ) 1 Priors, Posteriors, Likelihood, and Sufficiency The posterior distribution of θ given x is π(θ x) = f(x θ)π(θ) f(x θ)π(θ)dθ Abbreviating, we write: posterior prior likelihood The (prior) predictive distribution of the data x on the basis π is p(x) = f(x θ)π(θ)dθ Suppose data x 1 is available, and we want to predict additional data: the predictive distribution is p(x x 1 ) = f(x θ)π(θ x 1 )dθ Note that x and x 1 are assumed conditionally independent given θ They are not, in general, unconditionally independent Example Suppose y 1, y,, y n are independent normally distributed random variables, each with variance 1 and with means βx 1,, βx n, where β is an unknown real-valued parameter and x 1, x,, x n are known constants Suppose as prior π(β) = N (µ, α ) Then the likelihood is L(β) = (π) n n i=1 e (y i βx i ) 1

and π(β) = 1 (β µ) e α πα exp { β α + β µ } α The posterior of β given y 1, y,, y n can be calculated as { π(β y) exp 1 (yi βx i ) 1 } (β µ) α exp {β x i y i 1 β x i β α + βµ } α Abbreviate s xx = x i, s xy = x i y i, then π(β y) exp {βs xy 1 β s xx β = exp { (s β xx + 1α ) + β which we recognize as N α + βµ } α ( sxy + µ α s xx + 1, (s xx + 1α ) ) 1 α ( s xy + µ α ) }, To summarize information about θ we find a minimal sufficient statistic t(x); T = t(x) is sufficient for θ if and only if for all x and θ π(θ x) = π(θ t(x)) As in the frequentist approach; posterior (inference) is based on sufficient statistics Choice of prior There are many ways of choosing a prior distribution; using a coherent belief system, eg Often it is convenient to restrict the class of priors to a particular family of distributions When the posterior is in the same family of models as the prior, ie when one has a conjugate prior, then updating the distribution under new data is particularly convenient

For the regular k-parameter exponential family, f(x θ) = f(x)g(θ)exp{ k c i φ i (θ)h i (x)}, x X, i=1 conjugate priors can be derived in a straightforward manner, using the sufficient statistics Non-informative priors favour no particular values of the parameter over others If Θ is finite, choose we uniform prior If Θ is infinite, there are several ways (some may be improper): For a location density f(x θ) = f(x θ), then the non-informative location-invariant prior is π(θ) = π(0) constant; we usually choose π(θ) = 1 for all θ (improper prior) For a scale density f(x σ) = 1 f ( x σ σ) for σ > 0, the scale-invariant noninformative prior is π(σ) 1 ; usually we choose π(σ) = 1 (improper) σ σ The Jeffreys prior is π(θ) I(θ) 1 if the information I(θ) exists; then is invariant under reparametrization; the Jeffreys prior may or may not be improper Example continued Suppose y 1, y,, y n are independent, y i N (βx i, 1), where β is an unknown parameter and x 1, x,, x n are known Then ( ) I(β) = E log L(β, y) β ( ) = E (yi βx i )x i β = s xx and so the Jeffreys prior is s xx ; constant and improper With this Jeffreys prior, the calculation of the posterior distribution is equivalent to putting α = in the previous calculation, yielding ( ) sxy π(β y) is N, (s xx ) 1 s xx If Θ is discrete, and satisfies the constraints E π g k (θ) = µ k, k = 1,, m 3

then we choose the distribution with the maximum entropy under these constraints; it is given by: π(θ i ) = exp ( m k=1 λ kg k (θ i )) i exp ( m k=1 λ kg k (θ i )) where the λ i are determined by the constraints There is a similar formula for the continuous case, maximizing the entropy relative to a particular reference distribution π 0 under constraints For π 0 one would choose the natural invariant noninformative prior For inference, we check the influence of the choice of prior, for example by trying out different priors Point Estimation Under suitable regularity conditions, and random sampling, when n is large, then the posterior is approximately N (ˆθ, (ni 1 (ˆθ)) 1 ), where ˆθ is the mle; provided that the prior is non-zero in a region surrounding ˆθ In particular, if θ 0 is the true parameter, then the posterior will become more and more concentrated around θ 0 Often reporting the posterior distribution is preferred to point estimation Bayes estimators are constructed to minimize the integrated risk Recall from Decision Theory: Θ is the set of all possible states of nature (values of parameter), D is the set of all possible decisions (actions); a loss function is any function L : Θ D [0, ) For point estimation we choose D = Θ, and L(θ, d) is the loss in reporting d when θ is true For a prior π and data x X, the posterior expected loss of a decision is ρ(π, d x) = L(θ, d)π(θ x)dθ For a prior π the integrated risk of a decision rule δ is r(π, δ) = L(θ, δ(x))f(x θ)dxπ(θ)dθ Θ X Θ An estimator minimizing r(π, δ) can be obtained by selecting, for every x X, the value δ(x) that minimizes ρ(π, δ x) A Bayes estimator associated 4

with prior π, loss L, is any estimator δ π which minimizes r(π, δ) Then r(π) = r(π, δ π ) is the Bayes risk This is valid for proper priors, and for improper priors if r(π) < If r(π) =, we define a generalized Bayes estimator as the minimizer, for every x, of ρ(π, d x) For strictly convex loss functions, Bayes estimators are unique For squared error loss L(θ, d) = (θ d), the Bayes estimator δ π associated with prior π is the posterior mean For absolute error loss L(θ, d) = θ d, the posterior median is a Bayes estimator Example Let x be a single observation from N (µ, 1), and let µ have prior distribution π(µ) e µ, on the whole real line What is the generalized Bayes estimator for µ under square error loss? First we find the posterior distribution of µ given x π(µ x) exp (µ 1 ) (µ x)) exp ( 1 ) (µ (x + 1)) and so π(µ x) = N (x + 1, 1) The generalized Bayes estimator under square error loss is the posterior mean (from lectures), and so we have µ B = 1 + x Note that, in this example, for µ B, if we consider X N (µ, 1) and µ fixed, we have MSE(µ B ) = E(X + 1 µ) = 1 + E(X µ) = > E(X µ) = 1 The mle for µ is X, so the mean-square error for the Bayes estimator is larger than the MSE for the maximum-likelihood estimator The Bayes estimator is far from optimal here 5

3 Hypothesis Testing For testing H 0 : θ Θ 0, we set D = {accept H 0, reject H 0 } = {1, 0}, where 1 stands for acceptance; we choose as loss function 0 if θ Θ 0, φ = 1 a (1) L(θ, φ) = 0 if θ Θ 0, φ = 0 0 if θ Θ 0, φ = 0 a 1 if θ Θ 0, φ = 1 Under this loss function, the Bayes decision rule associated with a prior distribution π is { 1 if P φ π π (θ Θ (x) = 0 x) > a 1 a 0 +a 1 0 otherwise The Bayes factor for testing H 0 : θ Θ 0 against H 1 : θ Θ 1 is B π (x) = P π (θ Θ 0 x)/p π (θ Θ 1 x) P π (θ Θ 0 )/P π (θ Θ 1 ) = p(x θ Θ 0) p(x θ Θ 1 ) ; this is the ratio of how likely the data is under H 0 and how likely the data is under H 1 The Bayes factor measures the extent to which the data x will change the odds of Θ 0 relative to Θ 1 Note that we have to make sure that our prior distribution puts mass on H 0 (and on H 1 ) If H 0 is simple, this is usually achieved by choosing a prior that has some point mass on H 0 and otherwise lives on H 1 Example Let X be binomially distributed with sample size n and success probability θ Suppose that we would like to investigate H 0 : θ = θ 0 and H 1 : θ θ 0, where θ 0 is specified Our prior distribution is π(h 0 ) = β, π(h 1 ) = 1 β, and given H 1, θ is Beta-distributed with parameters α and β, ie π(θ H 1 ) = θα 1 (1 θ) β 1, B(α, β) 6

where B is the Beta function, B(α, β) = 1 0 θ α 1 (1 θ) β 1 dθ = Γ(α)Γ(β) Γ(α + β) 1 Find the Bayes factor for this test problem Find the Bayes decision rule for this test problem under the loss function (1) For the Bayes factor we calculate p(x θ Θ 0 and p(x θ Θ 1 ) Firstly, ( ) n p(x θ Θ 0 ) = p(x θ = θ 0 ) = θ x x 0(1 θ 0 ) n x For the alternative, we use the Beta prior, ( ) n 1 1 p(x θ Θ 1 ) = θ x (1 θ) n x θ α 1 (1 θ) β 1 dθ x B(α, β) 0 ( ) n 1 1 = θ x+α 1 (1 θ) n x+β 1 dθ x B(α, β) 0 ( ) n B(x + α, n x + β) = x B(α, β) Thus the Bayes factor is B π (x) = p(x θ Θ 0) p(x θ Θ 1 ) B(α, β) = B(α + x, β + n x) θx 0(1 θ 0 ) n x The Bayes decision rule is { 1 if P φ π π (θ Θ (x) = 0 x) > a 1 a 0 +a 1 0 otherwise and from lectures, φ π (x) = 1 B π (x) > a 1 a 0 β 1 β B(α, β) B(α + x, β + n x) θx 0(1 θ 0 ) n x > a 1 a 0 β 1 β βb(α, β) (1 β)b(α + x, β + n x) θx 0(1 θ 0 ) n x > a 1 a 0 7

For robustness we consider least favourable Bayesian answers; suppose H 0 : θ = θ 0, H 1 : θ θ 0, and the prior probability on H 0 is ρ 0 = 1/ For G a family of priors on H 1 we put and B(x, G) = inf g G P (x, G) = f(x θ 0 ) Θ f(x θ)g(θ)dθ ( ) 1 1 1 + B(x, G) A Bayesian prior g G on H 0 will then have posterior probability at least P (x, G) on H 0 (for ρ 0 = 1/) If ˆθ is the mle of θ, and G A the set of all prior distributions, then B(x, G A ) = f(x θ 0) f(x ˆθ(x)) and P (x, G A ) = ( 1 + f(x ˆθ(x)) f(x θ 0 ) ) 1 4 Credible intervals A (1 α) (posterior) credible interval (region) is an interval (region) of θ values within which 1 α of the posterior probability lies Often we would like to find HPD (highest posterior density) region: a (1 α) credible region that has minimal volume When the posterior density is unimodal, this is often straightforward Example continued Suppose y 1, y,, y n are independent normally distributed random variables, each with variance 1 and with means βx 1,, βx n, where β is an unknown real-valued parameter and x 1,( x,, x n are known constants Under the Jeffreys prior, the posterior is N sxy s xx, (s xx ) 1) Hence a 95%-credible interval for β is s xy 1 ± 196 s xx s xx 8

The interpretation in Bayesian statistics is conditional on the observed x; the randomness relates to the distribution of θ In contrast, a frequentist confidence interval applies before x is observed; the randomness relates to the distribution of x 5 Not to forget about: Nuisance parameters If θ = (ψ, λ), where λ nuisance parameter, and π(θ x) = π((ψ, λ) x), then we base our inference on the marginal posterior of ψ: π(ψ x) = π(ψ, λ x)dλ That is, we just integrate out the nuisance parameter 9