Similar documents

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

Bayesian philosophy Bayesian computation Bayesian software. Bayesian Statistics. Petter Mostad. Chalmers. April 6, 2017

Chapter 5. Bayesian Statistics

Introduction to Bayesian Methods

Data Analysis and Uncertainty Part 2: Estimation

A Very Brief Summary of Bayesian Inference, and Examples

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

HPD Intervals / Regions

Metropolis-Hastings Algorithm

Bayesian Inference: Posterior Intervals

Bayesian Graphical Models

1 Introduction. P (n = 1 red ball drawn) =

Introduction to Bayesian Statistics 1

Foundations of Statistical Inference

Part III. A Decision-Theoretic Approach and Bayesian testing

MAS3301 Bayesian Statistics Problems 2 and Solutions

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

Introduction to Probabilistic Machine Learning

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Beta statistics. Keywords. Bayes theorem. Bayes rule

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

10. Exchangeability and hierarchical models Objective. Recommended reading

STAT 425: Introduction to Bayesian Analysis

Bayesian Inference for Regression Parameters

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

David Giles Bayesian Econometrics

Lecture 2: Conjugate priors

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

MAS3301 Bayesian Statistics

Lecture 2: Priors and Conjugacy

Linear Regression. Data Model. β, σ 2. Process Model. ,V β. ,s 2. s 1. Parameter Model

Particle Filtering a brief introductory tutorial. Frank Wood Gatsby, August 2007

INTRODUCTION TO BAYESIAN STATISTICS

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Markov Chain Monte Carlo

The Metropolis-Hastings Algorithm. June 8, 2012

BUGS Bayesian inference Using Gibbs Sampling

Lecture 6. Prior distributions

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Interval Estimation. Chapter 9

Chapter 8.8.1: A factorization theorem

Lecture 3. Univariate Bayesian inference: conjugate analysis

An Introduction to Bayesian Methodology via WinBUGS and PROC MCMC

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Parameter Estimation

Bayesian Methods for Machine Learning

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Chapter 8: Sampling distributions of estimators Sections

Bayesian Gaussian Process Regression

Bayesian Inference. Chapter 2: Conjugate models

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Bayesian inference: what it means and why we care

1 Probability Model. 1.1 Types of models to be discussed in the course

Bayesian Regression Linear and Logistic Regression

Advanced Statistical Modelling

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

WinBUGS : part 2. Bruno Boulanger Jonathan Jaeger Astrid Jullion Philippe Lambert. Gabriele, living with rheumatoid arthritis

Chapter 4. Bayesian inference. 4.1 Estimation. Point estimates. Interval estimates

Other Noninformative Priors

Bayesian statistics, simulation and software

Probabilistic Machine Learning

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Classical and Bayesian inference

One-parameter models

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Principles of Bayesian Inference

CSC 2541: Bayesian Methods for Machine Learning

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

A Very Brief Summary of Statistical Inference, and Examples

Foundations of Statistical Inference

David Giles Bayesian Econometrics

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

PIER HLM Course July 30, 2011 Howard Seltman. Discussion Guide for Bayes and BUGS

Bayesian Inference and Decision Theory

Principles of Bayesian Inference

Markov Chain Monte Carlo methods

Homework 1: Solution

General Bayesian Inference I

Prior Choice, Summarizing the Posterior

CS281A/Stat241A Lecture 22

Final Examination. STA 215: Statistical Inference. Saturday, 2001 May 5, 9:00am 12:00 noon

Statistical Data Analysis Stat 3: p-values, parameter estimation

Lecture 23 Maximum Likelihood Estimation and Bayesian Inference

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

36-463/663Multilevel and Hierarchical Models

Approximate Normality, Newton-Raphson, & Multivariate Delta Method

Conjugate Priors: Beta and Normal Spring 2018

MIT Spring 2016

Bayesian statistics, simulation and software

Compute f(x θ)f(θ) dθ

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Learning Bayesian network : Given structure and completely observed data

Conjugate Priors: Beta and Normal Spring January 1, /15

Computer intensive statistical methods

Conjugate Priors: Beta and Normal Spring 2018

STAT 830 Bayesian Estimation

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Transcription:

Weakness of Beta priors (or conjugate priors in general) They can only represent a limited range of prior beliefs. For example... There are no bimodal beta distributions (except when the modes are at 0 and 1). There are no beta distributions which are roughly constant in (.3,.7), and drop off rapidly to zero outside this range. etc. Use of arbitrary priors Requires numerical integration (or Monte Carlo). For example θf(x θ)π(θ) dθ E(θ x) = θ π(θ x) dθ = f(x θ)π(θ) dθ For E(h(θ) x), replace θ by h(θ) above. To compute Var(θ x) = E(θ 2 x) [E(θ x)] 2 E(θ 2 x). we need When θ is high-dimensional, Monte Carlo (as in WinBugs) is often easier then numerical integration. Mixtures of Beta priors (or conjugate priors in general) Allows representation of more general prior beliefs while retaining some of the convenience of conjugate priors.

For example, for Bayesian estimation of θ = P (heads): If then π(θ) = p 1 f 1 (θ) + p 2 f 2 (θ), where f 1 Beta(α 1, β 1 ), f 2 Beta(α 2, β 2 ) p 1, p 2 > 0 and p 1 + p 2 = 1, π(θ x) = p 1 f 1 (θ) + p 2 f 2 (θ), where f 1 Beta(α 1 + t, β 1 + n t), f 2 Beta(α 2 + t, β 2 + n t) p 1, p 2 > 0 and p 1 + p 2 = 1, with p 1, p 2 being functions of t (see below). The posterior weights are: p i = ψ i ψ 1 + ψ 2 where ψ i = p ib(α i + t, β i + n t) B(α i, β i ) with B(α, β) = Γ(α)Γ(β) Γ(α + β).

Example: Suppose λ π and, conditional on λ, X 1,..., X n are iid Poisson(λ). What is the family of conjugate priors in this situation? Examine the likelihood function and see what priors fit well with it: multiplying likelihood times prior should produce something in the same family (ignoring constants). L(λ x) = n i=1 λ x i e λ x i! = λ xi e nλ xi! λ t e nλ where t = which has the general form λ a e b λ, which is the kernel of a gamma density (as a function of λ). Textbook parameterization of Gamma density: f(x α, β) = xα 1 e x/β β α Γ(α) xα 1 e x/β Another common parameterization (call it Gamma(a, b)) substitutes b = 1/β: n i=1 x i f(x a, b) = ba x a 1 e bx Γ(a) (Substitute x λ in both pdf s.) x a 1 e bx. Suppose the prior π is Gamma(a, b). (Using Gamma(a, b) leads to a somewhat simpler updating rule.) Then π(λ x) L(λ x)π(λ) λ t e nλ λ a 1 e bλ = λ (a+t) 1 e (b+n)λ Gamma(a + t, b + n) pdf so that the posterior distn is Gamma(a+t, b+n). The Gamma family is closed under sampling and forms a conjugate family.

Bayesian Estimation and Sufficient Statistics Suppose T (X) is a sufficient statistic for θ. Fact: Posterior distributions depend on the data X only through the sufficient statistic T (X). Corollary: E(θ X) and Var(θ X) (and any other posterior quantity) depend on the data X only through T (X). Proof: By the FC, g, h such that f(x θ) = g(t (x), θ)h(x) for all x and θ. Thus π(θ x) f(x θ)π(θ) = g(t (x), θ)h(x)π(θ) g(t (x), θ)π(θ) so that g(t (x), θ)π(θ) π(θ x) = Θ g(t (x), θ )π(θ ) dθ which depends on x only through T (x).

A Simple Bayesian Hierarchical Model Situation: 10 coins are sampled from a population of coins. Each coin is tossed 20 times. We desire to estimate p 1, p 2,..., p 10, the probability of heads for each coin. Let X i be the number of heads in 20 tosses for coin i. A Bayesian can incorporate prior knowledge about the similarity of the coins into the prior distribution. Bayesian Model: (i = 1,..., 10 throughout) X i Binomial(p i, 20) η i Normal(µ, 1/τ) where p i = and η i = log eη i 1 + e η i ( pi 1 p i µ Normal(0, 1), τ Gamma(10, b) The rv s at each level are conditionally independent given the rv s at lower levels. X = (X 1, X 2,..., X 10 ), θ = (η 1, η 2,..., η 10, µ, τ) The posterior is π(θ X) 10 i=1 ( 20 X i ) p X i i 10 (1 p i ) 20 X i g(η i µ, τ) h(µ) k(τ) i=1 where g( µ, τ) is the N(µ, 1/τ) density, h( ) is the N(0, 1) density, and k( ) is the Gamma(10, b) density. Note: Gamma(a, b) has mean a/b and variance a/b 2. )

BUGS code describing the model: model { for(i in 1:N){ x[i] dbin(p[i],20) logit(p[i]) <- eta[i] eta[i] dnorm(mu,tau) } mu dnorm(0.0,1.0) tau dgamma(10.0,xxx) # Changing the scale only. } Data: list(n=10,x=c(14,8,11,14,11,11,8,10,12,8)) Inits: list(mu=0,tau=1,eta=c(0,0,0,0,0,0,0,0,0,0)) -------------------------------------------------- Story: There are 10 coins. Each is tossed 20 times. The data x[1] = 14, x[2] = 8,..., x[9] = 12, x[10] = 8 is the number of heads observed on each coin. Goal: Estimate p[1],..., p[10], the probability of heads for each coin. -------------------------------------------------- Frequentist Answers ------------------- Assuming all p[i] are equal leads to: phat = sum(x)/200 = 0.535 std. dev. of phat = sqrt(.535*(1-.535)/200) = 0.0353 Estimating each p[i] separately leads to: p[1]hat = 14/20 = 0.7 p[2]hat = 8/20 = 0.4 std. dev. of p[1]hat = sqrt(.7*(1-.7)/20) = 0.1025 std. dev. of p[2]hat = sqrt(.4*(1-.4)/20) = 0.1095 Compare with the answers below.

The Likelihood function: The prior: L(θ) = f(x θ) = π(θ) = 10 i=1 10 i=1 ( 20 X i ) p X i i where p i = (1 p i ) 20 X i eη i 1 + e η i g(η i µ, τ) h(µ) k(τ) where g( µ, τ) is the N(µ, 1/τ) density, h( ) is the N(0, 1) density, and k( ) is the Gamma(10, b) density. The Posterior: π(θ X) f(x θ)π(θ) The Bayesian s belief after collecting the data is given by their posterior distrbution. BUGS uses MCMC (Markov Chain Monte Carlo) to generate a sample of many thousands (as many as you like) of θ vectors from the posterior distribution. Using this sample from the posterior, BUGS can print summary information for each component of θ = (η 1, η 2,..., η 10, µ, τ) giving Monte Carlo estimates of the posterior mean, posterior variance, posterior median, etc. BUGS can also print summary information of the posterior distribution of other quantities such as p 1, p 2,..., p 10 The following page gives output summarizing the posterior distributions of three different Bayesians, each with a different prior. The three priors differ only in the value of b, which takes on the three values b = 10 3, 10 4, 2.

Using: tau dgamma(10.0,.001) node mean sd MC error 2.5% median 97.5% mu 0.1316 0.1296 0.008272-0.1277 0.135 0.3998 node mean sd MC error 2.5% median 97.5% tau 10001.3 3194.89 17.9655 4757.75 9670.29 17168.7 node mean sd MC error 2.5% median 97.5% p[1] 0.5328 0.0322 0.00205 0.4680 0.5338 0.5986 p[2] 0.5326 0.0322 0.00205 0.4678 0.5336 0.5983 -------------------------------------------------- Using: tau dgamma(10.0,10000) node mean sd MC error 2.5% median 97.5% mu -0.0054 0.9929 0.004380-1.9522-0.0013 1.9352 node mean sd MC error 2.5% median 97.5% tau 0.001497 3.86E-4 1.62E-6 8.377E-4 0.001464 0.002344 node mean sd MC error 2.5% median 97.5% p[1] 0.7006 0.09994 4.151E-4 0.4871 0.7077 0.8744 p[2] 0.4008 0.10689 4.637E-4 0.2030 0.3975 0.6181 -------------------------------------------------- Using: tau dgamma(10.0,2.0) node mean sd MC error 2.5% median 97.5% mu 0.1398 0.196 0.00191-0.2462 0.1409 0.524 node mean sd MC error 2.5% median 97.5% tau 5.512 1.603 0.01041 2.872 5.34 9.101 node mean sd MC error 2.5% median 97.5% p[1] 0.6122 0.07851 5.701E-4 0.4547 0.614 0.7604 p[2] 0.4698 0.08052 5.129E-4 0.3135 0.4696 0.6262