Similar documents

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

Chapter 5. Bayesian Statistics

HPD Intervals / Regions

A Very Brief Summary of Bayesian Inference, and Examples

Bayesian philosophy Bayesian computation Bayesian software. Bayesian Statistics. Petter Mostad. Chalmers. April 6, 2017

1 Introduction. P (n = 1 red ball drawn) =

Introduction to Bayesian Methods

Bayesian Inference: Posterior Intervals

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Beta statistics. Keywords. Bayes theorem. Bayes rule

Data Analysis and Uncertainty Part 2: Estimation

Introduction to Bayesian Statistics 1

Foundations of Statistical Inference

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

MAS3301 Bayesian Statistics Problems 2 and Solutions

Bayesian Inference for Regression Parameters

Metropolis-Hastings Algorithm

Introduction to Probabilistic Machine Learning

Lecture 2: Conjugate priors

STAT 425: Introduction to Bayesian Analysis

10. Exchangeability and hierarchical models Objective. Recommended reading

Lecture 2: Priors and Conjugacy

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

Chapter 8.8.1: A factorization theorem

Bayesian Graphical Models

Lecture 3. Univariate Bayesian inference: conjugate analysis

Linear Regression. Data Model. β, σ 2. Process Model. ,V β. ,s 2. s 1. Parameter Model

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

MAS3301 Bayesian Statistics

Part III. A Decision-Theoretic Approach and Bayesian testing

Chapter 8: Sampling distributions of estimators Sections

Bayesian Inference. Chapter 2: Conjugate models

Lecture 6. Prior distributions

1 Probability Model. 1.1 Types of models to be discussed in the course

Interval Estimation. Chapter 9

Chapter 4. Bayesian inference. 4.1 Estimation. Point estimates. Interval estimates

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Classical and Bayesian inference

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

INTRODUCTION TO BAYESIAN STATISTICS

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Homework 1: Solution

General Bayesian Inference I

Particle Filtering a brief introductory tutorial. Frank Wood Gatsby, August 2007

Prior Choice, Summarizing the Posterior

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Parameter Estimation

Markov Chain Monte Carlo

An Introduction to Bayesian Methodology via WinBUGS and PROC MCMC

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Conjugate Priors: Beta and Normal Spring 2018

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Bayesian statistics, simulation and software

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Compute f(x θ)f(θ) dθ

A Very Brief Summary of Statistical Inference, and Examples

Bayesian statistics, simulation and software

Conjugate Priors: Beta and Normal Spring January 1, /15

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Conjugate Priors: Beta and Normal Spring 2018

STAT 830 Bayesian Estimation

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

David Giles Bayesian Econometrics

Bayesian Inference. p(y)

BUGS Bayesian inference Using Gibbs Sampling

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

One-parameter models

Other Noninformative Priors

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Continuous random variables

The Metropolis-Hastings Algorithm. June 8, 2012

Bayesian statistics, simulation and software

Introduction to Machine Learning

Introduction to Bayesian Computation

WinBUGS : part 2. Bruno Boulanger Jonathan Jaeger Astrid Jullion Philippe Lambert. Gabriele, living with rheumatoid arthritis

Discrete Binary Distributions

Classical and Bayesian inference

Bayesian Gaussian Process Regression

Bayesian inference: what it means and why we care

1 Probability Model. 1.1 Types of models to be discussed in the course

Introduction. Start with a probability distribution f(y θ) for the data. where η is a vector of hyperparameters

The exponential family: Conjugate priors

Bayesian Inference and Decision Theory

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Statistical Theory MT 2007 Problems 4: Solution sketches

Statistical Data Analysis Stat 3: p-values, parameter estimation

CS281A/Stat241A Lecture 22

Lecture 23 Maximum Likelihood Estimation and Bayesian Inference

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

CPSC 540: Machine Learning

Inference for a Population Proportion

MAS3301 Bayesian Statistics

9 Bayesian inference. 9.1 Subjective probability

Advanced Statistical Modelling

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Transcription:

Weakness of Beta priors (or conjugate priors in general) They can only represent a limited range of prior beliefs. For example... There are no bimodal beta distributions (except when the modes are at 0 and 1). There are no beta distributions which are roughly constant in (.3,.7), and drop off rapidly to zero outside this range. etc. Use of arbitrary priors Requires numerical integration (or Monte Carlo). For example θf(x θ)π(θ) dθ E(θ x) = θ π(θ x) dθ = f(x θ)π(θ) dθ For E(h(θ) x), replace θ by h(θ) above. To compute Var(θ x) = E(θ 2 x) [E(θ x)] 2 E(θ 2 x). we need When θ is high-dimensional, Monte Carlo (as in WinBugs) is often easier then numerical integration. Mixtures of Beta priors (or conjugate priors in general) Allows representation of more general prior beliefs while retaining some of the convenience of conjugate priors.

For example, for Bayesian estimation of θ = P (heads): If then π(θ) = p 1 f 1 (θ) + p 2 f 2 (θ), where f 1 Beta(α 1, β 1 ), f 2 Beta(α 2, β 2 ) p 1, p 2 > 0 and p 1 + p 2 = 1, π(θ x) = p 1 f 1 (θ) + p 2 f 2 (θ), where f 1 Beta(α 1 + t, β 1 + n t), f 2 Beta(α 2 + t, β 2 + n t) p 1, p 2 > 0 and p 1 + p 2 = 1, with p 1, p 2 being functions of t (see below). The posterior weights are: p i = ψ i ψ 1 + ψ 2 where ψ i = p ib(α i + t, β i + n t) B(α i, β i ) with B(α, β) = Γ(α)Γ(β) Γ(α + β).

Example: Suppose λ π and, conditional on λ, X 1,..., X n are iid Poisson(λ). What is the family of conjugate priors in this situation? Examine the likelihood function and see what priors fit well with it: multiplying likelihood times prior should produce something in the same family (ignoring constants). L(λ x) = n i=1 λ x i e λ x i! = λ xi e nλ xi! λ t e nλ where t = which has the general form λ a e b λ, which is the kernel of a gamma density (as a function of λ). Textbook parameterization of Gamma density: f(x α, β) = xα 1 e x/β β α Γ(α) xα 1 e x/β Another common parameterization (call it Gamma(a, b)) substitutes b = 1/β: n i=1 x i f(x a, b) = ba x a 1 e bx Γ(a) (Substitute x λ in both pdf s.) x a 1 e bx. Suppose the prior π is Gamma(a, b). (Using Gamma(a, b) leads to a somewhat simpler updating rule.) Then π(λ x) L(λ x)π(λ) λ t e nλ λ a 1 e bλ = λ (a+t) 1 e (b+n)λ Gamma(a + t, b + n) pdf so that the posterior distn is Gamma(a+t, b+n). The Gamma family is closed under sampling and forms a conjugate family.

Bayesian Estimation and Sufficient Statistics Suppose T (X) is a sufficient statistic for θ. Fact: Posterior distributions depend on the data X only through the sufficient statistic T (X). Corollary: E(θ X) and Var(θ X) (and any other posterior quantity) depend on the data X only through T (X). Proof: By the FC, g, h such that f(x θ) = g(t (x), θ)h(x) for all x and θ. Thus π(θ x) f(x θ)π(θ) = g(t (x), θ)h(x)π(θ) g(t (x), θ)π(θ) so that g(t (x), θ)π(θ) π(θ x) = Θ g(t (x), θ )π(θ ) dθ which depends on x only through T (x).

A Simple Bayesian Hierarchical Model Situation: 10 coins are sampled from a population of coins. Each coin is tossed 20 times. We desire to estimate p 1, p 2,..., p 10, the probability of heads for each coin. Let X i be the number of heads in 20 tosses for coin i. A Bayesian can incorporate prior knowledge about the similarity of the coins into the prior distribution. Bayesian Model: (i = 1,..., 10 throughout) X i Binomial(p i, 20) η i Normal(µ, 1/τ) where p i = and η i = log eη i 1 + e η i ( pi 1 p i µ Normal(0, 1), τ Gamma(10, b) The rv s at each level are conditionally independent given the rv s at lower levels. X = (X 1, X 2,..., X 10 ), θ = (η 1, η 2,..., η 10, µ, τ) The posterior is π(θ X) 10 i=1 ( 20 X i ) p X i i 10 (1 p i ) 20 X i g(η i µ, τ) h(µ) k(τ) i=1 where g( µ, τ) is the N(µ, 1/τ) density, h( ) is the N(0, 1) density, and k( ) is the Gamma(10, b) density. Note: Gamma(a, b) has mean a/b and variance a/b 2. )

BUGS code describing the model: model { for(i in 1:N){ x[i] dbin(p[i],20) logit(p[i]) <- eta[i] eta[i] dnorm(mu,tau) } mu dnorm(0.0,1.0) tau dgamma(10.0,xxx) # Changing the scale only. } Data: list(n=10,x=c(14,8,11,14,11,11,8,10,12,8)) Inits: list(mu=0,tau=1,eta=c(0,0,0,0,0,0,0,0,0,0)) -------------------------------------------------- Story: There are 10 coins. Each is tossed 20 times. The data x[1] = 14, x[2] = 8,..., x[9] = 12, x[10] = 8 is the number of heads observed on each coin. Goal: Estimate p[1],..., p[10], the probability of heads for each coin. -------------------------------------------------- Frequentist Answers ------------------- Assuming all p[i] are equal leads to: phat = sum(x)/200 = 0.535 std. dev. of phat = sqrt(.535*(1-.535)/200) = 0.0353 Estimating each p[i] separately leads to: p[1]hat = 14/20 = 0.7 p[2]hat = 8/20 = 0.4 std. dev. of p[1]hat = sqrt(.7*(1-.7)/20) = 0.1025 std. dev. of p[2]hat = sqrt(.4*(1-.4)/20) = 0.1095 Compare with the answers below.

Using: tau dgamma(10.0,.001) node mean sd MC error 2.5% median 97.5% mu 0.1316 0.1296 0.008272-0.1277 0.135 0.3998 node mean sd MC error 2.5% median 97.5% tau 10001.3 3194.89 17.9655 4757.75 9670.29 17168.7 node mean sd MC error 2.5% median 97.5% p[1] 0.5328 0.0322 0.00205 0.4680 0.5338 0.5986 p[2] 0.5326 0.0322 0.00205 0.4678 0.5336 0.5983 -------------------------------------------------- Using: tau dgamma(10.0,10000) node mean sd MC error 2.5% median 97.5% mu -0.0054 0.9929 0.004380-1.9522-0.0013 1.9352 node mean sd MC error 2.5% median 97.5% tau 0.001497 3.86E-4 1.62E-6 8.377E-4 0.001464 0.002344 node mean sd MC error 2.5% median 97.5% p[1] 0.7006 0.09994 4.151E-4 0.4871 0.7077 0.8744 p[2] 0.4008 0.10689 4.637E-4 0.2030 0.3975 0.6181 -------------------------------------------------- Using: tau dgamma(10.0,2.0) node mean sd MC error 2.5% median 97.5% mu 0.1398 0.196 0.00191-0.2462 0.1409 0.524 node mean sd MC error 2.5% median 97.5% tau 5.512 1.603 0.01041 2.872 5.34 9.101 node mean sd MC error 2.5% median 97.5% p[1] 0.6122 0.07851 5.701E-4 0.4547 0.614 0.7604 p[2] 0.4698 0.08052 5.129E-4 0.3135 0.4696 0.6262