Statistical Theory MT 2007 Problems 4: Solution sketches

Statistical Theory MT 007 Problems 4: Solution sketches 1. Consider a 1-parameter exponential family model with density f(x θ) = f(x)g(θ)exp{cφ(θ)h(x)}, x X. Suppose that the prior distribution has the mixture form π(θ α, τ 0, τ 1 ) = α l π(θ τ (l) 0, τ (l) 1 ), for α 1 + + α l = 1, where the densities π(θ τ (l) 0, τ (l) 1 ) are members of the conjugate family associated with f(x θ). Show that after observing a sample x 1,..., x n, the posterior for θ also has a mixture form, and identify the (posterior) mixing probabilities and the (posterior) hyperparameters. Solution: Conjugate priors for the likelihood f(x θ) are of the form φ(θ τ) = g(θ) τ 0 exp{cφ(θ)τ 1 }/K(τ), where K(τ) = K(τ 1, τ ) is the normalising constant. After observing a sample x 1,..., x n the posterior density is π(θ x, τ) = g(θ) τ 0+H 0 exp{cφ(θ)(τ 1 + H 1 )}/K(τ + H), where H 0 = n and H 1 = h(x i ), and H = (H 0, H 1 ). If the prior has mixture form, then the posterior is proportional to α l π(θ τ (l) )g(θ) n exp{cφ(θ)h 1 } = α l g(θ) τ (l) +H 0 exp{cφ(θ)(τ (l) + H 1 )}/K(τ (l) ) = α l π(θ x, τ (l) )K(τ (l) + H)/K(τ (l) ). The normalising constant is the integral of this expression, and since each of the component densities, π(θ x, τ (l) ), has integral 1, the normalising constant is d α l K(τ (l) + H)/K(τ (l) ), so that the posterior density is where β l π(θ x, τ (l) ), β l = α lk(τ (l) + H)/K(τ (l) ) d α l K(τ (l) + H)/K(τ (l) ). 1

. Suppose that X has a Poisson distribution with unknown mean θ. Determine the conjugate prior family, and associate posterior distribution, for θ. Determine the Jeffreys prior π J for θ, and discuss whether the scale-invariant prior π 0 (θ) = 1/θ might be preferrable as noninformative prior. Using π J as the reference measure, find the maximum entropy prior under the constraints that the prior mean and variance of θ are both 1. (Just write it in terms of the constraints λ 1 and λ from lectures, do not solve for these.) Repeat, for the reference measure π 0. (Again, just write it in terms of the constraints λ 1 and λ from lectures, do not solve for these.) Solution: The Poisson probability mass function is f(x θ) = e θ θx, x = 0, 1,... x! An element π(θ) of the conjugate prior family is given by π(θ) e τ 0θ θ τ 1, θ > 0. For this to be a proper prior we require τ 0 > 0 and τ 1 > 1. The posterior density of θ given x is then given by π(θ x) e (τ 0+1)θ θ τ 1+x, θ > 0, which is a Gamma distribution with parameters α = τ 1 + x + 1 and β = τ 0 + 1. To obtain the Jeffreys prior we need the Fisher information I(θ). Note that so I(θ) = E It follows that the Jeffreys prior is θ log f(x θ) = x θ, ( log f(x θ) θ ) π J (θ) θ 1, θ > 0. = θ θ = 1 θ. Comparison with scale-invariant prior: If your relative beliefs about any two parameter values θ 1 and θ depend only on their ratio θ 1 /θ, then you will be led to the scaleinvariant prior, π 0 = 1/θ. If on the other hand you insist that the prior distribution of θ is invariant under parametric transformatinos then you are led to the Jeffreys prior. In this example, these two philosophies are incompatible. Maximum entropy prior: The constraints are E π (θ) = 1 and E π ((θ 1) ) = 1, and the reference measure is π ref (θ) = π J (θ) θ 1. By definition, the prior density that

maximises the entropy relative to the reference density π ref and satisfies m constraints is π(θ) = π ref(θ)exp( m k=1 λ k g k (θ)) πref (θ)exp( m k=1 λ k g k (θ))dθ, and in our case m =, g 1 (θ) = θ, g (θ) = (θ 1), and so π(θ) θ 1 exp(λ1 θ + λ (θ 1) ). Similarly, when the reference measure is π ref = π 0, then the maximum entropy prior is π(θ) θ 1 exp(λ 1 θ + λ (θ 1) ). Side remark: Poisson process: The probability density of the inter-arrival times in a Poisson process is f(x λ) = λe λx. For this density the Fisher information is λ ( log λ + λx) = 1 λ, so that the Jeffreys prior for λ is proportional to 1/λ. Since θ = λt and T is a known constant, this implies that, from the perspective of inter-arrival times, the prior density for θ should also be proportional to 1/θ. 3. Consider a random sample of size n from a normal distribution with unknown mean θ and unknown variance φ, with the improper prior for (θ, φ). f(θ, φ) = φ 1, φ > 0 Calculate the joint posterior distribution and both marginal posteriors for (θ, φ). Show that the 100(1 α)% HPD credible interval for θ in the posterior is x ± t (xi x) α/ n(n 1), where t α/ is the α/-quantile of the t n 1 -distribution. Compare this with the classical confidence interval for θ. Solution: The joint posterior p.d.f. of θ and φ is proportional to e (x i θ) /(φ) (πφ) n/ 1 φ. 3

Let λ = 1, then, using the change of variable formula (Jacobian φ 1/λ ) the joint posterior p.d.f. of θ and λ is given by { π(θ, λ) exp λ } (xi θ) λ n/ 1. Now use the identity (xi θ) = (x i x) + n(x θ) to give { π(θ, λ) exp λ } (xi x) λ n/ 1 exp { λ } n(x θ). (1) Now integrate θ out, using that exp { λ } n(x θ) dθ = to give the marginal density π(λ) exp { λ (xi x) } π λn, λ (n 1)/ 1. It follows that λ has the Gamma density Gamma(α, β) with α = n 1 and β = 1 (xi x). Changing variables, with y = λ (x i x), we have π(y) = y (n 1)/ 1 e y/, y > 0. Γ(1/(n 1)) (n 1)/ In other words, the distribution of φ 1 (x i x) is χ n 1. From (1) the conditional posteriori distribution of θ given λ is N (x, (λn) 1 ), i.e. (θ x) λn N (0, 1). It follows that (a posteriori) (θ x) λn is independent of λ, so that (θ x) λn λ (x i x) /(n 1) t n 1, i.e. θ x (xi x) t n 1, n(n 1) the point being that this function of θ and x has the same distribution when viewed either as a random variable dependent on x with θ fixed or a random variable dependent on θ with x fixed. This key result can be obtained by integrating λ ourt of (1). Finally, let t α/ be the α/-quantile of the t n 1 -distribution, then the 100(1 α)% HPD credible interval is x ± t (xi x) α/ n(n 1), which is exactly the same as the classical 100(1 α)% confidence interval. However the interpretation is different (see lecture notes). 4

4. Suppose that x 1,..., x n is a random sample from a Poisson distribution with unknown mean θ. Two models for the prior distribution of θ are contemplated; π 1 (θ) = e θ, θ > 0, and π (θ) = e θ θ, θ > 0. (a) Calculate the the Bayes estimator of θ under both models, with quadratic loss function. (b) The prior probabilities of model 1 and model are assessed at probability 1/ each. Calculate the Bayes factor for H 0 :model 1 applies against H 1 :model applies. Solution: (a) We calculate π 1 (θ x) e nθ θ x i e θ = e (n+1)θ θ x i, which we recognize as Gamma( x i + 1, n + 1). expected value of the posterior distribution, xi + 1 n + 1. For model, The Bayes estimator is the π (θ x) e nθ θ x i e θ θ = e (n+1)θ θ x i +1, which we recognize as Gamma( x i +, n + 1). expected value of the posterior distribution, xi + n + 1. The Bayes estimator is the Note that the first model has greater weight for smaller values of θ, so the posterior distribution is shifted to the left. (b) The prior probabilities of model 1 and model are assessed at probability 1/ each. Then the Bayes factor is 0 0 e nθ θ x i e θ dθ e nθ θ x i e θ θdθ = Γ( x i + 1)/((n + 1) x i +1 ) Γ( x i + )/((n + 1) x i + ) n + 1 = xi + 1. Note that in this setting = 5 P (model 1 x) P (model x) P (model 1 x) 1 P (model 1 x),

so that Hence P (model 1 x) = (1 + B(x)) 1. P (model 1 x) = which is decreasing for x increasing. = ( 1 + ( xi + 1 n + 1 ) 1, 1 + x + 1 n 1 + 1 n 5. Let θ be a real-valued parameter and let f(x θ) be the probability density function of an observation x, given θ. The prior distribution of θ has a discrete component that gives probability β to the point null hypothesis H 0 : θ = θ 0. The remainder of the distribution if continuous, and conditional on θ θ 0, its density is g(θ). ) 1 (1) Derive an expression for π(θ 0 x), the posterior probability of H 0. () Derive the Bayes factor B(x) for the null hypothesis against the alternative. (3) Express π(θ 0 x) in terms of B(x). (4) Explain how you would use B(x) to construct a most powerful test of size α for H 0, against the alternative θ θ 0. Suppose that x 1,..., x n is a sample from a normal distribution with mean θ and variance v. Let H 0 : θ = 0, let β = 1/, and let g(θ) = (πw ) 1/ exp{ θ /(w )}, for < θ <. Show that, if the sample mean is observed to be 10(v/n) 1/, then (5) the most powerful test of size α = 0.05 will reject H 0 for any value of n; (6) the posterior probability of H 0 converges to 1, as n. Solution: (1) Following lectures, the posterior probability of θ 0 is where () The Bayes factor is P (θ=θ 0 x) P (θ θ 0 x) P (θ=θ 0 ) P (θ θ 0 ) π(θ 0 x) = βf(x θ 0 ) βf(x θ 0 ) + (1 β)m(x), m(x) = f(x θ)g(θ)dθ. βf(x θ 0 ) = βf(x θ 0 ) + (1 β)m(x) βf(x θ 0) + (1 β)m(x) / β (1 β)m(x) 1 β = f(x θ 0) m(x). 6

(3) From lectures, π(θ 0 x) = ( 1 + 1 β ) 1. βb(x) (4) The simple hypothesis θ = θ 0 can be tested against the simple hypothesis that the density of x is f(x θ)g(θ)dθ; i.e. H 0 : x f(x θ) and H 1 : x m(x). The Neyman-Pearson Lemma says that the most powerful test of size α rejects H 0 when f(x θ)/m(x) < C α, where C α is chosen such that So we reject when B(x) < C α. {x:f(x θ)/m(x)<c α} f(x θ)dx = α. (5) We now have x = (x 1,..., x n ). The statistic x is sufficient. Under H 0, x N(0, v/n) and under H 1, x N(0, v/n + w ). The most powerful test rejects when x/ v/n > C α. For a 5% test, C α is certainly less than, therefore with the observed value of x the test will reject H 0 at the 5% level. (6) However the Bayes factor is 1 v/n exp{ 10 v/n v/n } 1 v/n+w exp{ 10 v/n } (v/n+w as n. Therefore the posterior probability of H 0 converges to 1 (contrasting with the conclusion of the classical test). 7