to Bayesian Methods Introduction to Bayesian Methods p.1/??
We develop the Bayesian paradigm for parametric inference. To this end, suppose we conduct (or wish to design) a study, in which the parameter θ is of inferential interest. Here θ may be vector valued. For example, 1. θ = difference in treatment means. θ = hazard ratio 3. θ = vector of regression coefficients 4. θ = probability a treatment is effective Introduction to Bayesian Methods p./??
In parametric inference, we specify a parametric model for the data, indexed by the parameter θ. Letting x denote the data, we denote this model (density) by p(x θ). The likelihood function of θ is any function proportional to p(x θ), i.e., L(θ) p(x θ). Example Suppose x θ Binomial(N, θ). Then p(x θ) = ( ) N θ x (1 θ) N x, x x = 0, 1,...,N. Introduction to Bayesian Methods p.3/??
We can take L(θ) = θ x (1 θ) N x. The parameter θ is unknown. In the Bayesian mind-set, we express our uncertainty about quantities by specifying distributions for them. Thus, we express our uncertainty about θ by specifying a prior distribution for it. We denote the prior density of θ by π(θ). The word "prior" is used to denote that it is the density of θ before the data x is observed. By Bayes theorem, we can construct the distribution of θ x, which is called the posterior distribution of θ. We denote the posterior distribution of θ by p(θ x). Introduction to Bayesian Methods p.4/??
By Bayes theorem, p(θ x) = Θ p(x θ)π(θ) p(x θ)π(θ)dθ where Θ denotes the parameter space of θ. The quantity p(x) = p(x θ)π(θ)dθ Θ is the normalizing constant of the posterior distribution. For most inference problems, p(x) does not have a closed form. Bayesian inference about θ is primarily based on the posterior distribution of θ, p(θ x). Introduction to Bayesian Methods p.5/??
For example, one can compute various posterior summaries, such as the mean, median, mode, variance, and quantiles. For example, the posterior mean of θ is given by E(θ x) = θp(θ x)dθ Θ Example 1 Given θ, suppose x 1,x,...,x n are i.i.d. Binomial(1,θ), and θ Beta(α, λ). The parameters of the prior distribution are often called the hyperparameters. Let us derive the posterior distribution of θ. Let x = (x 1,x,...,x n ), and thus, Introduction to Bayesian Methods p.6/??
p(x θ) = p(x θ) = n i=1 n i=1 p(x i θ) θ x i (1 θ) n x i where x i = n i=1 x i. Also, p(x θ) = θ P x i (1 θ) n P x i π(θ x) = Γ(α + λ) Γ(α)Γ(λ) θα 1 (1 θ) λ 1 Now, we can write the kernel of the posterior density as Introduction to Bayesian Methods p.7/??
p(θ x) θ P x i θ α 1 (1 θ) n P x i (1 θ) λ 1 = θ P x i +α 1 (1 θ) n P x i +λ 1 Thus p(θ x) θ P x i +α 1 (1 θ) n P x i +λ 1. We can recognize this kernel as a beta kernel with paramters ( x i + α,n x i + λ). Thus, θ x Beta( xi + α,n ) x i + λ and therefore p(θ x) = Γ(α + n + λ) Γ( P x i + α)γ(n P x i + λ) θp x i +α 1 (1 θ) n P xi +λ 1. Introduction to Bayesian Methods p.8/??
Remark In deriving posterior densities, an often used technique is to try and recognize the kernel of the posterior density of θ. This avoids direct computation of p(x). This technique saves lots of time in derivation. If the kernel cannot be recognized, then p(x) must be computed directly. In this example we have p(x) = p(x 1,...,x n ) 1 0 θ P x i +α 1 (1 θ) n P x i +λ 1. = Γ( x i + α)γ(n x i + λ) Γ(α + n + λ) Introduction to Bayesian Methods p.9/??
Thus p(x 1,...,x n ) = Γ(α + λ) Γ(α)Γ(λ) Γ( x i + α)γ(n x i + λ) Γ(α + n + λ) for x i = 0, 1, and i = 1,...,n. Suppose A 1,A,... are events such that A i Aj = φ and i=1 = Ω, where Ω denotes the sample space. Let B denote an event in Ω. Then Bayes theorem for events can be written as p(a i B) = P(B A i )P(A i ) i=1 P(B A i)p(a i ) Introduction to Bayesian Methods p.10/??
P(A i ) is the prior probability of A i and p(a i B) is the posterior probability ofa i given B has ocurred. Example Bayes theorem is often used in diagnostic tests for cancer. A young person was diagnosed as having a type of cancer that occurs extremely rarely in young people. Naturally, has was very upset. A friend told him that it was probably a mistake. His friend reasons as follows. No medical test is perfect: There are always incidences of false positives and false negatives. Introduction to Bayesian Methods p.11/??
Let C stand for the event that he has cancer and let + stand for the event that an individual responds positively to the test. Assume P(C) = 1/1, 000, 000 = 10 6 and P(+ C c ) =.01. (So only one per million people his age have the disease and the test is extremely god relative to most medical tests, giving only 1% false positives and 1% false negatives). Find the probability that he has cancer given that he has a positive response. (After you make this calculation you will not be surprised to learn that he did not have cancer.) P(C +) = P(C +) = P(+ C)P(C) P(+ C)P(C) + P(+ C c )P(C c ) (.99)(10 6 ) (.99)(10 6 ) + (.01)(.999999) Introduction to Bayesian Methods p.1/??
P(C +) = 00000099.01000098 =.00009899 Example 3 Suppose x 1,...,x n is a random sample from N(µ,σ ). i) Suppose σ is known and µ N(µ o,σ o). The posterior density of µ is given by: P(µ x)! ny p(x i µ, σ ) π(µ) i=1 exp j 1 X ff«j (xi σ µ) exp 1 ff«σo (µ µ o ) Introduction to Bayesian Methods p.13/??
exp = exp exp { 1 { 1 { 1 ( ) ( )} nσ o + σ σ µ + µ o xi + µ o σ σoσ σoσ ( )[ ( )]} nσ o + σ σ µ µ o xi + µ o σ σoσ nσo + σ ( )[ ( )] } nσ o + σ σ µ o xi + µ o σ σoσ nσo + σ We can recognize this as a normal kernel with mean P µ post = σ o xi +µ o σ and variance σ nσo +σ post = ( nσ o +σ ) 1 = σ σo o σ σ nσo +σ Thus ( ) σ µ x N o xi + µ o σ σ, oσ. nσo + σ nσo + σ Introduction to Bayesian Methods p.14/??
ii) Suppose µ is known and σ is unknown. Let τ = 1/σ. τ is often called the precision parameter. Suppose τ gamma( δ o, γ o ). Thus π(τ) τ δ o 1 exp ( τγ ) o Let us derive the posterior distribution of τ. { p(τ x) τ n/ exp τ } { (xi µ) τ δ o 1 exp τγ } o { p(τ x) τ n+δ o 1 exp τ (γ o + } (x i µ) ) Introduction to Bayesian Methods p.15/??
Thus τ x gamma ( n + δo, γ o + (x i µ) ) iii) Now suppose µ and σ are both unknown. Suppose we specify the joint prior where π(µ,τ) = π(µ τ)π(τ) µ τ N(µ o,τ 1 σo) ( δo τ gamma, γ ) o Introduction to Bayesian Methods p.16/??
The joint posterior density of (µ,τ) is given by ( { τ n/ exp τ }) (xi µ) { (τ 1 exp τ }) (µ µ σo o ) ( { τ δo/ 1 exp τγ }) o { = τ n+δ o+1 1 exp τ (γ o + (µ µ o) + )} (x i µ) The joint posterior does not have a clear recognizable form. Thus, σ o we need to compute p(x) by brute force. Introduction to Bayesian Methods p.17/??
p(x) = Z Z 0 Z Z 0 0 j τ n+δ o+1 1 exp τ γ o + (µ µ o) σ o + X (x i µ) «ff dµdτ τ n+δ o+1 1 exp{ τ (γ o + µ (n + 1/σo ) µ(x x i + µ o /σo ) + (µ o /σ o + X x i )}dµτ Z n τ n+δ o+1 1 exp τ (γ o + µ o /σ o + X o «x i ) dτ Z exp n τ µ (n + 1/σ o ) µ(x x i + µ o /σ o ) o «dµ Introduction to Bayesian Methods p.18/??
The integratal with respect to µ can be evaluated by completing the square. 0 = exp exp { τ(n + σ o ) exp [µ ( x i + µ o σ o ) o ) (n + σ { τ( xi + µ o σo ) (n + σo ) ] } } dµ { } τ( xi + µ o σo ) (π) 1/ τ 1/ (n + σ (n + σo o ) 1/ ) Introduction to Bayesian Methods p.19/??
Now we need to evaluate 0 exp exp (π) 1/ (n + σo ) 1/ τ 1/ τ n+δ o/ 1 1 { } τ [γ o + µ o/σo + x i] } xi + µ o /σo) ] dτ (n + 1/σo) { τ [( = (π) 1/ (n + σo ) 1/ 0 { exp τ τ n+δ o/ 1 1 [γ o + µ o/σ o + x i ( ]} x i + µ o /σo) (n + 1/σ o) dτ Introduction to Bayesian Methods p.0/??
= [ 1 (π) 1/ Γ ( n+δ o ) (n + 1/σ o ) 1 ( γ o + µ o/σ o + x i (P x i +µ o /σ o ) (n+1/σ o ) )] n+δo = (π) 1/ Γ ( n+δ o ) n+δo (n + 1/σ o) 1 [ γ o + µ o/σ o + x i (P x i +µ o /σ o ) (n+1/σ o ) ] n+δo p (x) Thus, p(x) = ( (π) (n+1)/ σ 1 o ) ( γ o )δ o/ p (x) Γ( δ o ) Introduction to Bayesian Methods p.1/??
The joint posterior density of (µ,τ x) can also be obtained in this case by deriving p(µ, τ x) = p(µ τ x)p(τ x). Exercise: Find p(µ τ x) and p(τ x). It is of great interest to find the marginal posterior distributions of µ and τ. p(µ x) = 0 0 exp p(µ,τ x)dτ { τ n+δ 0 +1 1 exp { τ [ γ o + µ o/σo + ]} x i τ [ µ (n + 1/σo) µ( x i + µ o /σo) ]} dτ Introduction to Bayesian Methods p./??
= { τ n+δ 0 +1 1 exp τ [ γ o + µ 0 o/σo + ]} x i { [ ( exp τ(n + ) ]} 1/σ o) xi + µ o /σo µ exp { τ n + 1/σo [ ]} ( xi + µ o /σo) dτ n + 1/σo Let a = ( x i +µ o /σo) (n+1/σ. Then, we can write the integral o) as Introduction to Bayesian Methods p.3/??
= = Z 0 exp τ n+δ 0 +1 n τ 1 h γ o + µ o/σ o + X x i + (n + 1/σ o)(µ a) (n + 1/σ o)a io dτ n+δ0 +1 Γ n+δ 0 +1 ˆγo + µ o /σ o + P x i + (n + 1/σ o )(µ a) (n + 1/σ )a o»1 + c(µ a) b ca n+δ 0 +1 where c = n + 1/σ o and b = γ o + µ o/σ o + x i. We recognize this kernel as that of a t-distribution with location parameter a ( ) and dispersion parameter (n+δo )c 1, b ca and n+δo degrees of freedom. Introduction to Bayesian Methods p.4/??
Definition Let y = (y 1,...,y p ) be a p 1 random vector. Then y is said to have a p diminsional multivariate t distribution with d degrees of freedom, location paramter m and dispersion matrix Σ p p if y has density p(y) = ( Γ ( d+p ) (πd) p/ Σ 1/) Γ ( ) d [ 1 + 1 d (y m) Σ 1 (y m) ] d+p We write this as y S p (d, m, Σ). In our problem, p = ( ) 1 1, d = n + δ o, m = a, Σ 1 = (n+δ o)c b ca, Σ = (n+δo )c b ca Introduction to Bayesian Methods p.5/??
The marginal distribution of τ is give by p(τ y) = Z 0 τ n+δ 0 +1 n 1 exp τ h γ o + µ o /σ o + X io x i n τ exp (n + 1/σ o)a o j exp τ(n + 1/σ o ) ff (m a) dµ τ n+δ 0 +1 n 1 τ 1 exp τ hγ o + µ o /σ o + X x i (n + 1/σ o )aio = τ n+δ n 0 1 exp τ hγ o + µ o /σ o + X x i (n + 1/σ o )aio Thus,» n + δ0 τ x gamma, 1 γ o + µ o /σ o + X x i (n + 1/σ o )a. Introduction to Bayesian Methods p.6/??
Remark A t distribution can be obtain as a scale mixture of normals. That is, if x τ N p (m,τ 1 Σ) and τ gamma(δ o,γ o ), then is the S p (δ o,m, γ o Note: p(x) = δ o Σ ) 0 p(x τ)π(τ)dτ ) density. That is, x S p (δ o,m, γ o δ o Σ p(x τ) = (π) p/ τ p/ Σ 1/ { exp τ } (x m) Σ 1 (x m). Introduction to Bayesian Methods p.7/??
Remark Note that in Examples 1 and 3i),ii), the posterior distribution is of the same family as the prior distribution. When the posterior distributionof a paramter is of the sme family as the prior istribution, such prior distributions are called conjugate prior distributions. For example 1, a Beta prior in θ led to a Beta posterior for θ. In example 3i), a normal prior for µ yielded a normal posterior for µ. In example 3ii), a gamma prior for τ yielded a gamma posterior for τ. More on conjugate priors later. Introduction to Bayesian Methods p.8/??
Advantages of Bayesian Methods 1. Interpretation Having a distribution for your unknown parameter θ is easier to understand that a point estimate and a standard error. In addition, we consider the following example of a confidence interval. A 95% confidence interval for a population mean θ can be written as x ± (1.96)s/ n. Thus P(a < θ < b) 0.95. Introduction to Bayesian Methods p.9/??
Advantages of Bayesian Methods 1. Interpretation We have to rely on a repeated sampling interpretation to make a probability as above. Thus, after observing the data, we cannot make a statement like the true θ has a 95% chance of falling in x ± (1.96)s/ n. although we are tempted to say this. Introduction to Bayesian Methods p.30/??
Advantages of Bayesian Methods. Bayes Inference Obeys the Likelihood Principal The likelihood principle: If two distinct sampling plans (designs) yield proportional likelihood functions for θ, then inference about θ should be identical from these two designs. Frequentist inference does not obey the likelihood principle, in general. Example Suppose in 1 independent tosses of a coin, 9 heads and 3 tails are observed. I wish to test the null hypothesis H o : θ = 1/ vs.h o : θ > 1/, where θ is the true probability of heads. Introduction to Bayesian Methods p.31/??
Advantages of Bayesian Methods Consider the following choices for the likelihood function: a) Binomial n = 1 (fixed), x = number of heads. x Binomial(1, θ) and the likelihood is ( ) n L 1 (θ) = θ x (1 θ) n x x ( ) 1 = θ 9 (1 θ) 3 9 b) Negative Binomial: n is not fixed, flip until the third tail appears. Here x is the number of flips required to complete the experiment, x NegBinomial(r=3,θ). Introduction to Bayesian Methods p.3/??
Advantages of Bayesian Methods L (θ) = = ( ) r + x 1 θ x (1 θ) r x ( ) 11 θ 9 (1 θ) 3 9 Note that L 1 (θ) L (θ). From a Bayesian perspective, the posterior distribution of θ is the same under either design. That is p(θ x) = L 1(θ)π(θ) L1 (θ)π(θ)dθ L (θ)π(θ) L (θ)π(θ)dθ Introduction to Bayesian Methods p.33/??
Advantages of Bayesian Methods However, under the frequentist paradigm, inferences about θ are quite different under each design. The rejection region based on the binomial likelihood is p(x 9 θ = 1/) = 1 j=9 ( 1 j ) θ j (1 θ) 1 j = 0.075 while for the negative binomial likelihood, the p-value is ( ) + j p(x 9 θ = 1/) = θ j (1 θ) 3 = 0.035 j j=9 The two designs lead to different decisions, rejecting H o under design and not under design 1. Introduction to Bayesian Methods p.34/??
Advantages of Bayesian Methods 3. Bayesian Inference Does not Lead to Absurd Results Absurd results can be obtained when doing UMVUE estimation. Suppose x Poisson(λ), and we want to estimate θ = e λ, 0 < θ < 1. It can be shown that the UMVUE of θ is ( 1) x. Thus, if x is even the UMVUE of θ is 1 and if x is odd the UMVUE of θ is -1!! Introduction to Bayesian Methods p.35/??
Advantages of Bayesian Methods 4. Bayes Theorem is a formula for learning Suppose you conduct an experiment and collect observations x 1,...,x n. Then p(θ x) = p(x θ)π(θ) p(x θ)π(θ)dθ Θ where x = (x 1,...,x n ). Suppose you collect an additional observation x n+1 in a new study. Then, p(θ x,x n+1 ) = p(x n+1 θ)π(θ x) p(x n+1 θ)π(θ x)dθ Θ So your prior in the new study is the posterior from the previous. Introduction to Bayesian Methods p.36/??
Advantages of Bayesian Methods 5. Bayes inference does not require large sample theory With modern computing advances, exact calculations can be carried out using Markov chain Monte Carlo (MCMC) methods. Bayes methods do not require asymptotics for valid inference. Thus small sample Bayesian inference proceeds in the same way as if one had a large sample. Introduction to Bayesian Methods p.37/??
Advantages of Bayesian Methods 6. Bayes inference often has frequentist inference as a special case Often one can obtain frequentists answers by choosing a uniform priorfor the parameters, i.e. π(θ) 1, so that p(θ x) L(θ) In such cases, frenquentist answers can be obtained from such a posterior distribution. Introduction to Bayesian Methods p.38/??