MAS3301 Bayesian Statistics

Size: px

Start display at page:

Download "MAS3301 Bayesian Statistics"

Alexandrina Carpenter
5 years ago
Views:

1 MAS3301 Bayesian Statistics M. Farrow School of Mathematics and Statistics Newcastle University Semester 2,

2 11 Conjugate Priors IV: The Dirichlet distribution and multinomial observations 11.1 The Dirichlet distribution The Dirichlet distribution is a distribution for a set of quantities θ 1,..., θ m where θ i 0 and m θ i = 1. An obvious application is to a set of probabilities for a partition (i.e. for an exhaustive set of mutually exclusive events). The probability density function is f(θ 1,..., θ m ) = Γ(a i) θ ai 1 i where A = m a i and a 1,..., a m are parameters with a i > 0 for i = 1,..., m. Clearly, if m = 2, we obtain a beta(a 1, a 2 ) distribution as a special case. The mean of θ j is the variance of θ j is var(θ j ) = and the covariance of θ j and θ k, where j k, is E(θ j ) = a j A a j A(A + 1) a 2 j A 2 (A + 1) covar(θ j, θ k ) = a ja k A 2 (A + 1). Also the marginal distribution of θ j is beta(a j, A a j ). Note that the space of the parameters θ 1,..., θ m has only m 1 dimensions because of the constraint m θ i = 1, so that, for example, θ m = 1 m 1 θ i. Therefore, when we integrate over this space, the integration has only m 1 dimensions. Proof (mean) The mean is E(θ j ) = = = θ j Γ(a i) Γ(a j + 1) Γ(A + 1) Γ(a j ) Γ(a j + 1) = a j Γ(A + 1) Γ(a j ) A θ ai 1 i dθ 1... dθ m 1 Γ(A + 1) m Γ(a i ) θ a i 1 i dθ 1... dθ m 1 where a i = a i when i j and a j = a j

3 Proof (variance) Similarly so E(θ 2 j ) = Γ(a j + 2) = (a j + 1)a j Γ(A + 2) Γ(a j ) (A + 1)A var(θ j ) = (a j + 1)a ( j (A + 1)A aj ) 2 a j = A A(A + 1) a 2 j A 2 (A + 1) Proof (covariance) Also so E(θ j θ k ) = Γ(A + 2) Γ(a j + 1) Γ(a k + 1) = a ja k Γ(a j ) Γ(a k ) (A + 1)A covar(θ j, θ k ) = a ja k (A + 1)A a j a k A A = a ja k A 2 (A + 1) Proof (marginal) We can write the joint density of θ 1,..., θ m as f 1 (θ 1 )f 2 (θ 2 θ 1 )f 3 (θ 3 θ 1, θ 2 ) f m 1 (θ m 1 θ 1,..., θ m 2 ). (We do not need to include a final term in this for θ m because θ m is fixed once θ 1,..., θ m 1 are fixed). In fact we can write the joint density as Γ(a 1 )Γ(A a 1 ) θa1 1 1 (1 θ 1 ) A a1 1 Γ(A a 1 ) θ a2 1 2 (1 θ 1 θ 2 ) A a1 a2 1 Γ(a 2 )Γ(A a 1 a 2 ) (1 θ 1 ) A a1 1 Γ(A a 1 a m 2 ) θ am 1 1 m 1 θm am 1 Γ(a m 1 )Γ(A a 1 a m 1 ) (1 θ 1 θ m 2 ) am 1+am 1. A bit of cancelling shows that this simplifies to the correct Dirichlet density. 70

4 Thus we can see that the marginal distribution of θ 1 is a beta(a 1, A a 1 ) distribution and similarly that the marginal distribution of θ j is a beta(a j, A a j ) distribution. We can also deduce the distribution of a subset of θ 1,..., θ m. For example if θ 3 = 1 θ 1 θ 2 θ 3, then the distribution of θ 1, θ 2, θ 3, θ 3 is Dirichlet(a 1, a 2, a 3, ã 3 ) where ã 3 = A a 1 a 2 a Multinomial observations Model Suppose that we will observe X 1,..., X m where these are the frequencies for categories 1,..., m, the total N = m X i is fixed and the probabilities for these categories are θ 1,..., θ m where m θ i = 1. Then, given θ, where θ = (θ 1,..., θ m ) T, the distribution of X 1,..., X m is multinomial with Pr(X 1 = x 1,..., X m = x m ) = N! x i! θ xi i. Notice that, with m = 2, this is just a binomial(n, θ 1 ) distribution. Then the likelihood is L(θ; x) = N! x i! θ xi i. The conjugate prior is a Dirichlet distribution which has a pdf proportional to The posterior pdf is proportional to θ ai 1 i θ ai 1 i. θ xi i = θ xi i θ ai+xi 1 i. This is proportional to the pdf of a Dirichlet distribution with parameters a 1 + x 1, a 2 + x 2,... a m + x m Example In a survey 1000 English voters are asked to say for which party they would vote if there were a general election next week. The choices offered were 1: Labour, 2: Liberal, 3: Conservative, 4: Other, 5: None, 6: Undecided. We assume that the population is large enough so that the responses may be considered independent given the true underlying proportions. Let θ 1,..., θ 6 be the probabilities that a randomly selected voter would give each of the responses. Our prior distribution for θ 1,..., θ 6 is a Dirichlet(5, 3, 5, 1, 2, 4) distribution. This gives the following summary of the prior distribution. Response a i Prior mean Prior var. Prior sd. Labour Liberal Conservative Other None Undecided Total Suppose our observed data are as follows. 71

5 Labour Liberal Conservative Other None Undecided Then we can summarise the posterior distribution as follows. Response a i + x i Posterior mean Posterior var. Posterior sd. Labour Liberal Conservative Other None Undecided Total

6 12 Sufficiency 12.1 Introduction Consider the following problem. We are going to observe two random variables X 1, X 2. In each case, given the value of µ, we have X i µ N(µ, V ) where the variance V is known but we wish to learn about the value of µ. Further, given µ, the two variables X 1, X 2 are independent. The likelihood comes from the joint pdf of X 1, X 2 but an exactly equivalent observation would be Y 1, Y 2 where It is easily seen that Y 1 = X 1 + X 2 Y 2 = X 1 X 2 Y 1 N(2µ, 2V ) Y 2 N(0, 2V ) and that Y 1 and Y 2 are independent. Therefore Y 2 does not depend on µ and its value can not tell us anything about µ. On the other hand the value of Y 1 tells us everything which we can learn from the data about µ. We say that Y 1 is sufficient for µ and Y 2 is ancillary for µ Definition Suppose we have an unknown (e.g. a parameter) θ and we will observe data Y. The density (or probability) of Y given θ is f Y θ (y θ) and this gives us the likelihood, L(θ; y). Suppose we have a statistic T (Y ), with value t. Since, once we know Y, we can calculate T, can always write f Y θ (y θ) = f Y,T θ (y, t θ) = f T θ (t θ)f Y t,θ (y t, θ). In some cases f Y t,θ (y t, θ) does not depend on θ so f Y t,θ (y t, θ) = f Y t (y t). In this case f Y θ (y θ) = f T θ (t θ)f Y t (y t). (9) In such a case we say that T (Y ) is a sufficient statistic for θ given Y. Often we simply say that T is sufficient for θ Factorisation theorem Another way to express (9) is to say that T is sufficient for θ if and only if there are functions g, h such that where h(y) does not depend on θ. This is known as Neyman s factorisation theorem. f Y θ (y θ) = g(θ, t)h(y) (10) Proof: If T is sufficient for θ then we can write g(θ, t) = f T θ (t θ) and h(y) = f Y t (y t). To prove the converse we start by integrating (or summing) (10) over all values of y where T (y) = t. This gives f T θ (t θ) = g(θ, t)h(t) 73

7 for some function H(t). This gives us which we substitute in (10) to obtain Now so g(θ, t) = f T θ(t θ) H(t) f Y θ (y θ) = f T θ(t θ)h(y). H(t) f Y t,θ (y t, θ) = f Y,T θ(y, t θ) f T θ (t θ) f Y t,θ (y t, θ) = h(y) H(t). = f Y θ(y θ) f T θ (t θ) The right hand side of this equation does not depend on θ so the theorem is proved Sufficiency principle From (9) we can see that, if T is sufficient for θ, then the likelihood for θ from y is proportional to the likelihood for θ from t. Therefore, instead of using the likelihood for the full data we can use the likelihood based simply on the distribution of T Examples Poisson Suppose that we observe random variables Y 1,..., Y n where, given the value of the parameter λ, Y i is independent of Y j for i j and Y i Poisson(λ) for i = 1,..., n. Then the likelihood is L(λ; y) = n e λ λ yi y i! = e nλ λ S n = g(λ, S)h(y) where S = n y i, g(λ, S) = e nλ λ S and h(y) = n 1 y i!. So S is sufficient for λ. Furthermore S Poisson(nλ) so an equivalent likelihood is 1 y i! L S (λ; y) = e nλ (nλ) S S! e nλ λ S. 74

8 Normal Suppose that we observe random variables Y 1,..., Y n where, given the value of the parameters µ, σ 2, Y i is independent of Y j for i j and Y i N(µ, σ 2 ) for i = 1,..., n. Here the parameter is θ = (µ, σ 2 ) T. The likelihood is n L(µ, σ 2 ; y) = (2πσ 2 ) 1/2 exp 1 } 2σ 2 (y i µ) 2 } = (2πσ 2 ) n/2 exp 1 n 2σ 2 (y i µ) 2 } = (2πσ 2 ) n/2 exp 1 n 2σ 2 (y i ȳ + ȳ µ) 2 [ = (2πσ 2 ) n/2 exp 1 n ]} 2σ 2 (y i ȳ) 2 + n(ȳ µ) 2 = (2πσ 2 ) n/2 exp 1 [ S + n(ȳ µ) 2 ] } 2σ 2 where h(y) = 1, T = (ȳ, S) T, = g(θ, T )h(y) ȳ = 1 n n y i and S = n (y i ȳ) 2. Hence ȳ and S are sufficient for µ and σ 2. Furthermore, in the case where σ 2 is known, ȳ is sufficient for µ since L(µ; y) = exp n 2σ 2 (ȳ µ)2} (2πσ 2 ) n/2 exp S } 2σ 2 = g(µ, ȳ)h(y) with h(y) = (2πσ 2 ) n/2 exp S } 2σ 2. 75

MAS3301 Bayesian Statistics

MAS3301 Bayesian Statistics M. Farrow School of Mathematics and Statistics Newcastle University Semester, 008-9 1 13 Sequential updating 13.1 Theory We have seen how we can change our beliefs about an