Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Similar documents
Chapter 5. Bayesian Statistics

Foundations of Statistical Inference

Part III. A Decision-Theoretic Approach and Bayesian testing

Bayesian Inference: Posterior Intervals

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Introduction to Bayesian Methods

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Principles of Bayesian Inference

David Giles Bayesian Econometrics

A Very Brief Summary of Bayesian Inference, and Examples

Introduc)on to Bayesian Methods

Principles of Bayesian Inference

HPD Intervals / Regions

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Bayesian Inference: Concept and Practice


Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Time Series and Dynamic Models

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Bayesian statistics, simulation and software

Introduction to Probabilistic Machine Learning

Principles of Bayesian Inference

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

Density Estimation. Seungjin Choi

STAT 425: Introduction to Bayesian Analysis

Hierarchical Models & Bayesian Model Selection

Lecture 2: Priors and Conjugacy

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Bayesian Statistics. Debdeep Pati Florida State University. February 11, 2016

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

Bayesian philosophy Bayesian computation Bayesian software. Bayesian Statistics. Petter Mostad. Chalmers. April 6, 2017

(1) Introduction to Bayesian statistics

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Bayesian inference: what it means and why we care

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

ST 740: Model Selection

COS513 LECTURE 8 STATISTICAL CONCEPTS

Data Analysis and Uncertainty Part 2: Estimation

Learning Bayesian network : Given structure and completely observed data


Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

Principles of Bayesian Inference

Lecture 2: Statistical Decision Theory (Part I)

Naïve Bayes classification

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Linear Models A linear model is defined by the expression

Introduction to Bayesian Statistics 1

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Bayesian Inference. Introduction

Intro to Bayesian Methods

Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book)

Foundations of Statistical Inference

Bayesian RL Seminar. Chris Mansley September 9, 2008

10. Exchangeability and hierarchical models Objective. Recommended reading

9 Bayesian inference. 9.1 Subjective probability

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bayesian Inference and MCMC

Bayesian statistics, simulation and software

Bayesian Regression Linear and Logistic Regression

STAT 830 Bayesian Estimation

Bayesian Computation

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Probability and Estimation. Alan Moses

Bayesian Inference. Chapter 2: Conjugate models

Stable Limit Laws for Marginal Probabilities from MCMC Streams: Acceleration of Convergence

Markov Chain Monte Carlo methods

Announcements. Proposals graded

Lecture 1: Probability Fundamentals

ST 740: Multiparameter Inference

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

Inference for a Population Proportion

INTRODUCTION TO BAYESIAN METHODS II

The exponential family: Conjugate priors

PMR Learning as Inference

Parametric Techniques Lecture 3

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Bayesian Linear Regression

More Spectral Clustering and an Introduction to Conjugacy

Introduction to Applied Bayesian Modeling. ICPSR Day 4

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Confidence Intervals. CAS Antitrust Notice. Bayesian Computation. General differences between Bayesian and Frequntist statistics 10/16/2014

Lecture 1: Bayesian Framework Basics

Introduction to Bayes

Parametric Techniques

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model

A primer on Bayesian statistics, with an application to mortality rate estimation

Frequentist Statistics and Hypothesis Testing Spring

Bayesian Phylogenetics:

Bayesian Statistics Part III: Building Bayes Theorem Part IV: Prior Specification

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Introduction to Bayesian Inference

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

Transcription:

to Bayesian Methods Introduction to Bayesian Methods p.1/??

We develop the Bayesian paradigm for parametric inference. To this end, suppose we conduct (or wish to design) a study, in which the parameter θ is of inferential interest. Here θ may be vector valued. For example, 1. θ = difference in treatment means. θ = hazard ratio 3. θ = vector of regression coefficients 4. θ = probability a treatment is effective Introduction to Bayesian Methods p./??

In parametric inference, we specify a parametric model for the data, indexed by the parameter θ. Letting x denote the data, we denote this model (density) by p(x θ). The likelihood function of θ is any function proportional to p(x θ), i.e., L(θ) p(x θ). Example Suppose x θ Binomial(N, θ). Then p(x θ) = ( ) N θ x (1 θ) N x, x x = 0, 1,...,N. Introduction to Bayesian Methods p.3/??

We can take L(θ) = θ x (1 θ) N x. The parameter θ is unknown. In the Bayesian mind-set, we express our uncertainty about quantities by specifying distributions for them. Thus, we express our uncertainty about θ by specifying a prior distribution for it. We denote the prior density of θ by π(θ). The word "prior" is used to denote that it is the density of θ before the data x is observed. By Bayes theorem, we can construct the distribution of θ x, which is called the posterior distribution of θ. We denote the posterior distribution of θ by p(θ x). Introduction to Bayesian Methods p.4/??

By Bayes theorem, p(θ x) = Θ p(x θ)π(θ) p(x θ)π(θ)dθ where Θ denotes the parameter space of θ. The quantity p(x) = p(x θ)π(θ)dθ Θ is the normalizing constant of the posterior distribution. For most inference problems, p(x) does not have a closed form. Bayesian inference about θ is primarily based on the posterior distribution of θ, p(θ x). Introduction to Bayesian Methods p.5/??

For example, one can compute various posterior summaries, such as the mean, median, mode, variance, and quantiles. For example, the posterior mean of θ is given by E(θ x) = θp(θ x)dθ Θ Example 1 Given θ, suppose x 1,x,...,x n are i.i.d. Binomial(1,θ), and θ Beta(α, λ). The parameters of the prior distribution are often called the hyperparameters. Let us derive the posterior distribution of θ. Let x = (x 1,x,...,x n ), and thus, Introduction to Bayesian Methods p.6/??

p(x θ) = p(x θ) = n i=1 n i=1 p(x i θ) θ x i (1 θ) n x i where x i = n i=1 x i. Also, p(x θ) = θ P x i (1 θ) n P x i π(θ x) = Γ(α + λ) Γ(α)Γ(λ) θα 1 (1 θ) λ 1 Now, we can write the kernel of the posterior density as Introduction to Bayesian Methods p.7/??

p(θ x) θ P x i θ α 1 (1 θ) n P x i (1 θ) λ 1 = θ P x i +α 1 (1 θ) n P x i +λ 1 Thus p(θ x) θ P x i +α 1 (1 θ) n P x i +λ 1. We can recognize this kernel as a beta kernel with paramters ( x i + α,n x i + λ). Thus, θ x Beta( xi + α,n ) x i + λ and therefore p(θ x) = Γ(α + n + λ) Γ( P x i + α)γ(n P x i + λ) θp x i +α 1 (1 θ) n P xi +λ 1. Introduction to Bayesian Methods p.8/??

Remark In deriving posterior densities, an often used technique is to try and recognize the kernel of the posterior density of θ. This avoids direct computation of p(x). This technique saves lots of time in derivation. If the kernel cannot be recognized, then p(x) must be computed directly. In this example we have p(x) = p(x 1,...,x n ) 1 0 θ P x i +α 1 (1 θ) n P x i +λ 1. = Γ( x i + α)γ(n x i + λ) Γ(α + n + λ) Introduction to Bayesian Methods p.9/??

Thus p(x 1,...,x n ) = Γ(α + λ) Γ(α)Γ(λ) Γ( x i + α)γ(n x i + λ) Γ(α + n + λ) for x i = 0, 1, and i = 1,...,n. Suppose A 1,A,... are events such that A i Aj = φ and i=1 = Ω, where Ω denotes the sample space. Let B denote an event in Ω. Then Bayes theorem for events can be written as p(a i B) = P(B A i )P(A i ) i=1 P(B A i)p(a i ) Introduction to Bayesian Methods p.10/??

P(A i ) is the prior probability of A i and p(a i B) is the posterior probability ofa i given B has ocurred. Example Bayes theorem is often used in diagnostic tests for cancer. A young person was diagnosed as having a type of cancer that occurs extremely rarely in young people. Naturally, has was very upset. A friend told him that it was probably a mistake. His friend reasons as follows. No medical test is perfect: There are always incidences of false positives and false negatives. Introduction to Bayesian Methods p.11/??

Let C stand for the event that he has cancer and let + stand for the event that an individual responds positively to the test. Assume P(C) = 1/1, 000, 000 = 10 6 and P(+ C c ) =.01. (So only one per million people his age have the disease and the test is extremely god relative to most medical tests, giving only 1% false positives and 1% false negatives). Find the probability that he has cancer given that he has a positive response. (After you make this calculation you will not be surprised to learn that he did not have cancer.) P(C +) = P(C +) = P(+ C)P(C) P(+ C)P(C) + P(+ C c )P(C c ) (.99)(10 6 ) (.99)(10 6 ) + (.01)(.999999) Introduction to Bayesian Methods p.1/??

P(C +) = 00000099.01000098 =.00009899 Example 3 Suppose x 1,...,x n is a random sample from N(µ,σ ). i) Suppose σ is known and µ N(µ o,σ o). The posterior density of µ is given by: P(µ x)! ny p(x i µ, σ ) π(µ) i=1 exp j 1 X ff«j (xi σ µ) exp 1 ff«σo (µ µ o ) Introduction to Bayesian Methods p.13/??

exp = exp exp { 1 { 1 { 1 ( ) ( )} nσ o + σ σ µ + µ o xi + µ o σ σoσ σoσ ( )[ ( )]} nσ o + σ σ µ µ o xi + µ o σ σoσ nσo + σ ( )[ ( )] } nσ o + σ σ µ o xi + µ o σ σoσ nσo + σ We can recognize this as a normal kernel with mean P µ post = σ o xi +µ o σ and variance σ nσo +σ post = ( nσ o +σ ) 1 = σ σo o σ σ nσo +σ Thus ( ) σ µ x N o xi + µ o σ σ, oσ. nσo + σ nσo + σ Introduction to Bayesian Methods p.14/??

ii) Suppose µ is known and σ is unknown. Let τ = 1/σ. τ is often called the precision parameter. Suppose τ gamma( δ o, γ o ). Thus π(τ) τ δ o 1 exp ( τγ ) o Let us derive the posterior distribution of τ. { p(τ x) τ n/ exp τ } { (xi µ) τ δ o 1 exp τγ } o { p(τ x) τ n+δ o 1 exp τ (γ o + } (x i µ) ) Introduction to Bayesian Methods p.15/??

Thus τ x gamma ( n + δo, γ o + (x i µ) ) iii) Now suppose µ and σ are both unknown. Suppose we specify the joint prior where π(µ,τ) = π(µ τ)π(τ) µ τ N(µ o,τ 1 σo) ( δo τ gamma, γ ) o Introduction to Bayesian Methods p.16/??

The joint posterior density of (µ,τ) is given by ( { τ n/ exp τ }) (xi µ) { (τ 1 exp τ }) (µ µ σo o ) ( { τ δo/ 1 exp τγ }) o { = τ n+δ o+1 1 exp τ (γ o + (µ µ o) + )} (x i µ) The joint posterior does not have a clear recognizable form. Thus, σ o we need to compute p(x) by brute force. Introduction to Bayesian Methods p.17/??

p(x) = Z Z 0 Z Z 0 0 j τ n+δ o+1 1 exp τ γ o + (µ µ o) σ o + X (x i µ) «ff dµdτ τ n+δ o+1 1 exp{ τ (γ o + µ (n + 1/σo ) µ(x x i + µ o /σo ) + (µ o /σ o + X x i )}dµτ Z n τ n+δ o+1 1 exp τ (γ o + µ o /σ o + X o «x i ) dτ Z exp n τ µ (n + 1/σ o ) µ(x x i + µ o /σ o ) o «dµ Introduction to Bayesian Methods p.18/??

The integratal with respect to µ can be evaluated by completing the square. 0 = exp exp { τ(n + σ o ) exp [µ ( x i + µ o σ o ) o ) (n + σ { τ( xi + µ o σo ) (n + σo ) ] } } dµ { } τ( xi + µ o σo ) (π) 1/ τ 1/ (n + σ (n + σo o ) 1/ ) Introduction to Bayesian Methods p.19/??

Now we need to evaluate 0 exp exp (π) 1/ (n + σo ) 1/ τ 1/ τ n+δ o/ 1 1 { } τ [γ o + µ o/σo + x i] } xi + µ o /σo) ] dτ (n + 1/σo) { τ [( = (π) 1/ (n + σo ) 1/ 0 { exp τ τ n+δ o/ 1 1 [γ o + µ o/σ o + x i ( ]} x i + µ o /σo) (n + 1/σ o) dτ Introduction to Bayesian Methods p.0/??

= [ 1 (π) 1/ Γ ( n+δ o ) (n + 1/σ o ) 1 ( γ o + µ o/σ o + x i (P x i +µ o /σ o ) (n+1/σ o ) )] n+δo = (π) 1/ Γ ( n+δ o ) n+δo (n + 1/σ o) 1 [ γ o + µ o/σ o + x i (P x i +µ o /σ o ) (n+1/σ o ) ] n+δo p (x) Thus, p(x) = ( (π) (n+1)/ σ 1 o ) ( γ o )δ o/ p (x) Γ( δ o ) Introduction to Bayesian Methods p.1/??

The joint posterior density of (µ,τ x) can also be obtained in this case by deriving p(µ, τ x) = p(µ τ x)p(τ x). Exercise: Find p(µ τ x) and p(τ x). It is of great interest to find the marginal posterior distributions of µ and τ. p(µ x) = 0 0 exp p(µ,τ x)dτ { τ n+δ 0 +1 1 exp { τ [ γ o + µ o/σo + ]} x i τ [ µ (n + 1/σo) µ( x i + µ o /σo) ]} dτ Introduction to Bayesian Methods p./??

= { τ n+δ 0 +1 1 exp τ [ γ o + µ 0 o/σo + ]} x i { [ ( exp τ(n + ) ]} 1/σ o) xi + µ o /σo µ exp { τ n + 1/σo [ ]} ( xi + µ o /σo) dτ n + 1/σo Let a = ( x i +µ o /σo) (n+1/σ. Then, we can write the integral o) as Introduction to Bayesian Methods p.3/??

= = Z 0 exp τ n+δ 0 +1 n τ 1 h γ o + µ o/σ o + X x i + (n + 1/σ o)(µ a) (n + 1/σ o)a io dτ n+δ0 +1 Γ n+δ 0 +1 ˆγo + µ o /σ o + P x i + (n + 1/σ o )(µ a) (n + 1/σ )a o»1 + c(µ a) b ca n+δ 0 +1 where c = n + 1/σ o and b = γ o + µ o/σ o + x i. We recognize this kernel as that of a t-distribution with location parameter a ( ) and dispersion parameter (n+δo )c 1, b ca and n+δo degrees of freedom. Introduction to Bayesian Methods p.4/??

Definition Let y = (y 1,...,y p ) be a p 1 random vector. Then y is said to have a p diminsional multivariate t distribution with d degrees of freedom, location paramter m and dispersion matrix Σ p p if y has density p(y) = ( Γ ( d+p ) (πd) p/ Σ 1/) Γ ( ) d [ 1 + 1 d (y m) Σ 1 (y m) ] d+p We write this as y S p (d, m, Σ). In our problem, p = ( ) 1 1, d = n + δ o, m = a, Σ 1 = (n+δ o)c b ca, Σ = (n+δo )c b ca Introduction to Bayesian Methods p.5/??

The marginal distribution of τ is give by p(τ y) = Z 0 τ n+δ 0 +1 n 1 exp τ h γ o + µ o /σ o + X io x i n τ exp (n + 1/σ o)a o j exp τ(n + 1/σ o ) ff (m a) dµ τ n+δ 0 +1 n 1 τ 1 exp τ hγ o + µ o /σ o + X x i (n + 1/σ o )aio = τ n+δ n 0 1 exp τ hγ o + µ o /σ o + X x i (n + 1/σ o )aio Thus,» n + δ0 τ x gamma, 1 γ o + µ o /σ o + X x i (n + 1/σ o )a. Introduction to Bayesian Methods p.6/??

Remark A t distribution can be obtain as a scale mixture of normals. That is, if x τ N p (m,τ 1 Σ) and τ gamma(δ o,γ o ), then is the S p (δ o,m, γ o Note: p(x) = δ o Σ ) 0 p(x τ)π(τ)dτ ) density. That is, x S p (δ o,m, γ o δ o Σ p(x τ) = (π) p/ τ p/ Σ 1/ { exp τ } (x m) Σ 1 (x m). Introduction to Bayesian Methods p.7/??

Remark Note that in Examples 1 and 3i),ii), the posterior distribution is of the same family as the prior distribution. When the posterior distributionof a paramter is of the sme family as the prior istribution, such prior distributions are called conjugate prior distributions. For example 1, a Beta prior in θ led to a Beta posterior for θ. In example 3i), a normal prior for µ yielded a normal posterior for µ. In example 3ii), a gamma prior for τ yielded a gamma posterior for τ. More on conjugate priors later. Introduction to Bayesian Methods p.8/??

Advantages of Bayesian Methods 1. Interpretation Having a distribution for your unknown parameter θ is easier to understand that a point estimate and a standard error. In addition, we consider the following example of a confidence interval. A 95% confidence interval for a population mean θ can be written as x ± (1.96)s/ n. Thus P(a < θ < b) 0.95. Introduction to Bayesian Methods p.9/??

Advantages of Bayesian Methods 1. Interpretation We have to rely on a repeated sampling interpretation to make a probability as above. Thus, after observing the data, we cannot make a statement like the true θ has a 95% chance of falling in x ± (1.96)s/ n. although we are tempted to say this. Introduction to Bayesian Methods p.30/??

Advantages of Bayesian Methods. Bayes Inference Obeys the Likelihood Principal The likelihood principle: If two distinct sampling plans (designs) yield proportional likelihood functions for θ, then inference about θ should be identical from these two designs. Frequentist inference does not obey the likelihood principle, in general. Example Suppose in 1 independent tosses of a coin, 9 heads and 3 tails are observed. I wish to test the null hypothesis H o : θ = 1/ vs.h o : θ > 1/, where θ is the true probability of heads. Introduction to Bayesian Methods p.31/??

Advantages of Bayesian Methods Consider the following choices for the likelihood function: a) Binomial n = 1 (fixed), x = number of heads. x Binomial(1, θ) and the likelihood is ( ) n L 1 (θ) = θ x (1 θ) n x x ( ) 1 = θ 9 (1 θ) 3 9 b) Negative Binomial: n is not fixed, flip until the third tail appears. Here x is the number of flips required to complete the experiment, x NegBinomial(r=3,θ). Introduction to Bayesian Methods p.3/??

Advantages of Bayesian Methods L (θ) = = ( ) r + x 1 θ x (1 θ) r x ( ) 11 θ 9 (1 θ) 3 9 Note that L 1 (θ) L (θ). From a Bayesian perspective, the posterior distribution of θ is the same under either design. That is p(θ x) = L 1(θ)π(θ) L1 (θ)π(θ)dθ L (θ)π(θ) L (θ)π(θ)dθ Introduction to Bayesian Methods p.33/??

Advantages of Bayesian Methods However, under the frequentist paradigm, inferences about θ are quite different under each design. The rejection region based on the binomial likelihood is p(x 9 θ = 1/) = 1 j=9 ( 1 j ) θ j (1 θ) 1 j = 0.075 while for the negative binomial likelihood, the p-value is ( ) + j p(x 9 θ = 1/) = θ j (1 θ) 3 = 0.035 j j=9 The two designs lead to different decisions, rejecting H o under design and not under design 1. Introduction to Bayesian Methods p.34/??

Advantages of Bayesian Methods 3. Bayesian Inference Does not Lead to Absurd Results Absurd results can be obtained when doing UMVUE estimation. Suppose x Poisson(λ), and we want to estimate θ = e λ, 0 < θ < 1. It can be shown that the UMVUE of θ is ( 1) x. Thus, if x is even the UMVUE of θ is 1 and if x is odd the UMVUE of θ is -1!! Introduction to Bayesian Methods p.35/??

Advantages of Bayesian Methods 4. Bayes Theorem is a formula for learning Suppose you conduct an experiment and collect observations x 1,...,x n. Then p(θ x) = p(x θ)π(θ) p(x θ)π(θ)dθ Θ where x = (x 1,...,x n ). Suppose you collect an additional observation x n+1 in a new study. Then, p(θ x,x n+1 ) = p(x n+1 θ)π(θ x) p(x n+1 θ)π(θ x)dθ Θ So your prior in the new study is the posterior from the previous. Introduction to Bayesian Methods p.36/??

Advantages of Bayesian Methods 5. Bayes inference does not require large sample theory With modern computing advances, exact calculations can be carried out using Markov chain Monte Carlo (MCMC) methods. Bayes methods do not require asymptotics for valid inference. Thus small sample Bayesian inference proceeds in the same way as if one had a large sample. Introduction to Bayesian Methods p.37/??

Advantages of Bayesian Methods 6. Bayes inference often has frequentist inference as a special case Often one can obtain frequentists answers by choosing a uniform priorfor the parameters, i.e. π(θ) 1, so that p(θ x) L(θ) In such cases, frenquentist answers can be obtained from such a posterior distribution. Introduction to Bayesian Methods p.38/??