7. Estimation and hypothesis testing. Objective. Recommended reading

Size: px

Start display at page:

Download "7. Estimation and hypothesis testing. Objective. Recommended reading"

Ethel Marsh
5 years ago
Views:

1 7. Estimation and hypothesis testing Objective In this chapter, we show how the election of estimators can be represented as a decision problem. Secondly, we consider the problem of hypothesis testing from a Bayesian viewpoint and illustrate the similarities and differences between Bayesian and classical procedures. Recommended reading Kass, R. and Raftery, A.E. (1995). Bayes factors. Journal of the American Statistical Association, 90,

2 Point estimation Assume that θ is univariate. For Bayesians, the problem of estimation of θ is a decision problem. Associated with an estimator, T, of θ, there is a loss function L(T, θ) which reflects the difference between the value of T and θ. Then, for an expert, E, with distribution p(θ), the Bayes estimator of θ is that which minimizes the expected loss E[L(T, θ)] = L(T, θ)p(θ) dθ. Definition 18 The Bayes estimator, B, of θ is such that E[L(B, θ)] E[L(T, θ)] for any T B.

3 Bayes estimators for some given loss functions The mean, median and mode of E s distribution for θ can be justified as estimators given certain specific loss functions. Theorem 29 Suppose that E s density of θ is p(θ). Then 1. If L(T, θ) = (T θ) 2 then B is E s mean, B = E[θ]. 2. If L(T, θ) = T θ, then B is E s median for θ. 3. If Θ is discrete and L(T, θ) = for θ. { 0 if T = θ 1 if T θ then T is E s modal estimator

4 Proof 1. E[L(T, θ)] = = (T θ) 2 p(θ) dθ (T E[θ] + E[θ] θ) 2 p(θ) dθ = (T E[θ]) 2 + V [θ] which is minimized when T = E[θ] = B is the Bayes estimator. 2. Exercise. 3. Exercise.

5 In the case that Θ is continuous, then if we consider the loss function L(T, θ) = { 0 if T θ < ɛ 1 otherwise, then it is easy to see that T is the centre of the modal interval of width ɛ and letting ɛ 0, T approaches the mode.

6 Interval estimation A expert s 95% credible interval for a variable is simply an interval for which the expert, E, has a 95% probability We have previously used such intervals in the earlier chapters. Definition 19 If p(θ) is E s density for θ, we say that (a, b) is a 100(1 α)% credible interval for θ if P (a θ b x) = b a p(θ) dθ = 1 α. It is clear that in general, E will have (infinitely) many credible intervals for θ. The shortest possible credible interval is called a highest posterior density (HPD) interval.

7 Definition 20 The 100 (1 α)% HPD interval is an interval of form C = {θ : f(θ) c(α)} where c(α) is the largest number such that P (C) 1 α. Example 53 X µ N (µ, 1). Let f(µ) 1. Therefore, µ x N ( x, 1/n) and some 95% posterior credible intervals are (, x / n) or ( x 1.64/ n, ) or ( x ± 1.96/ n) which is the HPD interval. We can generalize the definition of a credible interval to multivariate densities p(θ). In this case, we can define a credible region C: P (θ C) = 1 α.

8 Hypothesis testing Assume now that we wish to test the hypothesis H 0 : θ Θ 0 versus the alternative H 1 : θ Θ 1, where Θ 0 Θ 1 = Θ and Θ 0 Θ 1 = φ. In theory, this is straightforward. Given a sample of data, x, we can calculate the posterior probabilities P (H 0 x) and P (H 1 x), and under a suitable loss function, we can decide to accept or reject the null hypothesis H 0. Example 54 Given the all or nothing loss, L(H 0, θ) = { 0 if H0 is true 1 if H 1 is true we accept H 0 if P (H 0 x) > P (H 1 x).

9 For point null and alternative hypotheses or for one tailed tests, then Bayesian and classical solutions are often similar. Example 55 Let X θ N(θ, 1). We wish to test H 0 : θ 0 against H 1 : θ > 0. Given the uniform prior, p(θ) 1, we know that θ x N ( x, 1 n). Therefore, where Φ( ) is the standard normal cdf. P (H 0 x) = P (θ 0 x) = Φ ( n x ) This probability is equal to the usual, classical p-value for the test of H 0 : θ = 0 against H 1 : θ > 0. P ( X x H 0 ) = P ( n X n x H 0 ) = 1 Φ ( n x ) = Φ ( n x )

10 The Lindley/Jeffreys paradox For two-tailed tests, Bayesian and classical results can be very different. Example 56 Let X θ N (θ, 1) and suppose that we wish to test the hypothesis H 0 : θ = 0 versus the alternative H 1 : θ 0. Assume first that with a normal prior distribution, p 0 = P (H 0 ) = 0.5 = P (H 1 ) = p 1 θ H 1 N (0, 1) and suppose that we observe the mean x of a sample of n data with likelihood l(θ x) = ( n )1 2 ( exp n 2π 2 (θ x)2).

11 Then, we can calculate the posterior probability, α 0 = P (H 0 x). α 0 = P (H 0 x) = = p 0 l(θ = 0 x) p 0 l(θ = 0 x) + p 1 l(θ x) p 0 l(θ = 0 x) p 0 l(θ = 0 x) + p 1 l(θ x)p(θ H1 ) dθ l(θ = 0 x) = l(θ = 0 x) + l(θ x)p(θ H 1 ) dθ ( ) n )1 2 = K exp ( n x2 2π 2 for a constant K 1 = l(θ = 0 x) + l(θ x)p(θ H 1 ) dθ.

12 We can also evaluate the second term in the denominator. l(θ x)p(θ H 1 ) = = = ( n 2π n 2π exp n 2π exp )1 ( 2 exp n ) ( 1 2 (θ x)2 2π ( 1 2 ( 1 2 [n( x θ) 2 + θ 2]) [ (n + 1) ( θ n x n + 1 )1 2 exp ( 12 θ2 ) ) 2 + n x2 n + 1 ]) and, integrating, we have, l(θ x)p(θ H 1 )dθ = ( )1 ( n 2 exp 1 ) n x 2. (n + 1)2π 2 n + 1

13 Therefore, we have K 1 = = ( )1 ) ( )1 ( n 2 exp ( n x2 n 2 + exp 1 ) n x 2 2π 2 (n + 1)2π 2 n + 1 ( )1 ) ( ( )) n 2 exp ( n x2 1 n2 x exp 2π 2 n + 1 2(n + 1) so the posterior probability of H 0 is α 0 = { 1 + ( 1 n2 x 2 )} 1 exp. n + 1 2(n + 1) Now suppose that x = 2/ n > 1.96/ n.

14 Therefore, for the classical test of H 0 against H 1, with a fixed significance level of 95%, the result is significant and we reject the null hypothesis H 0. However, in this case, the posterior probability is α 0 = { 1 + ( 1 exp 2n )} 1 n + 1 n when n. Using Bayesian methods, a sample that leads us to reject H 0 using a classical test gives a posterior probability of H 0 that approaches 1 as n. This is called the Lindley / Jeffreys paradox. See Lindley (1957).

15 Comments The choice of the prior variance of θ is quite important but the example also illustrates that it is not very sensible, from a classical viewpoint, to use fixed significance levels as n increases. (In practice, smaller values of α are virtually always used when n increases so that the power of the test is maintained). Also, point null hypotheses do not appear to be very sensible from a Bayesian viewpoint. The Lindley / Jeffreys paradox can be avoided by considering an interval ( ɛ, ɛ) and considering a prior distribution over this interval, see e.g. Berger and Delampady (1987). The main practical difficulty then lies in the choice of ɛ.

16 Bayes factors Good The original idea for the Bayes factor stems from Good (1958) who attributes it to Turing. The motivation is to find a more objective measure than simple posterior probability.

17 Motivation An important problem in many contexts such as regression modeling is that of selecting a model M i from a given class M = {M i : i = 1,..., k}. In this case, given the a prior distribution P ( ) over the model space, we have P (M i x) = P (M i )f(x M i ) k j=1 P (M j)f(x M j ), and given the all or nothing loss function, we should select the most probable model. However, the prior model probabilities are strongly influential in this choice and, if the class of possible models M is large, it is often difficult or impossible to precisely specify the model probabilities P (M i ). Thus, it is necessary to introduce another concept which is less strongly dependent on the prior information.

18 Consider two hypotheses (or models) H 0 and H 1 and let p 0 = P (H 0 ), p 1 = P (H 1 ) and α 0 = P (H 0 x) and α 1 = P (H 1 x). Then Jeffreys (1961) defines the Bayes factor comparing the two hypotheses as follows. Definition 21 The Bayes factor in favour of H 0 is defined to be B 0 1 = α 0/α 1 p 0 /p 1 = α 0p 1 α 1 p 0. The Bayes factor is simply the posterior odds in favour of H 0 divided by the prior odds. It tells us about the changes in our relative beliefs about the two models caused by the data. The Bayes factor is almost an objective measure and partially eliminates the influence of the prior distribution in that it is independent of p 0 and p 1.

19 Proof α 0 = P (H 0 x) = α 1 = p 0 f(x H 0 ) p 0 f(x H 0 ) + p 1 f(x H 1 ) p 1 f(x H 1 ) p 0 f(x H 0 ) + p 1 f(x H 1 ) α 0 = p 0f(x H 0 ) α 1 p 1 f(x H 1 ) B 0 1 = f(x H 0) f(x H 1 ) which is independent of p 0 and p 1. The Bayes factor is strongly related to the likelihood ratio, often used in classical statistics for model choice. If H 0 and H 1 are point null hypotheses, then the Bayes factor is exactly equal to the likelihood ratio.

20 Example 57 Suppose that we wish to text H 0 : λ = 6 versus H 1 : λ = 3 and that we observe a sample of size n from an exponential density with rate λ, f(x λ) = λe λx. Then the Bayes factor in favour of H 0 in this case is B 0 1 = l(λ = 6 x) l(λ = 3 x) = 6n e 6n x 3 n e 3n x = 2 n e 3n x.

21 For composite hypotheses however, the Bayes factor depends on the prior parameter distributions. For example, in the case that H 0 is composite, then the marginal likelihood is f(x H 0 ) = f(x θ, H 0 )p(θ H 0 ) dθ which is dependent on the prior parameter distribution, p(θ H 0 ).

22 Consistency of the Bayes factor We can see that the Bayes factor always takes values between 0 and infinity. Furthermore, it is obvious that B 0 1 = if P (H 0 x) = 1 and B 0 1 = 0 if P (H 0 x) = 0. Thus, the Bayes factor is consistent so that if H 0 is true, then B 0 1 when n and if H 1 is true, then B when n. Returning to Example 57, it can be seen that if λ = 6, then when n, we have B n e n/2.

23 Bayes factors and scales of evidence We can interpret the statistic 2 log B 0 1 as a Bayesian version of the classical log likelihood statistic. Kass and Raftery (1995) suggest using the following table of values of B 0 1 to represent the evidence against H 1. 2 log 10 B 0 1 B 0 1 Evidence against H 1. 0 a 2 1 a 3 Hardly worth commenting 2 a 6 3 a 20 Positive 6 a a 150 Strong > 10 > 150 Very strong Jeffreys (1961) suggests a (similar) alternative scale of evidence.

24 Relation of the Bayes factor to standard model selection criteria The Schwarz (1978) criterion or Bayesian information criterion for evaluating a model, M is BIC = 2 log l(ˆθ M, x) + d log n where ˆθ is the MLE and d is the dimension of the parameter space Θ for M. Then, it is possible to show that when the sample size, n, then BIC 0 BIC 1 2 log B 0 1 where BIC i represents the Bayesian information for model i and B 0 1 is the Bayes factor. See Kass and Raftery (1995).

25 The deviance information criterion (DIC) The DIC (Spiegelhalter et al 2002) is a Bayesian alternative to the BIC, AIC etc. appropriate for use in hierarchical models, where the Bayes factor is difficult to calculate. For a model M and sample data, x, the deviance is: D(x M, θ) = 2 log l(θ M, x). The expected deviance, D = E[D M, θ], is a measure of lack of fit of the model. The effective number of model parameters is p D = D D(x M, θ) where θ is the posterior mean. DIC = D + p D. Then the deviance information criterion is

26 An advantage of the DIC is that it is easy to calculate when using Gibbs sampling and this criterion is implemented automatically in Winbugs. For more details, see: However, the DIC has some strange properties which make it inappropriate in certain contexts, e.g. it is not guaranteed that p D is positive. For alternatives, see e.g. Celeux et al (2006).

7. Estimation and hypothesis testing. Objective. Recommended reading

7. Estimation and hypothesis testing Objective In this chapter, we show how the election of estimators can be represented as a decision problem. Secondly, we consider the problem of hypothesis testing