Theory of Inference: Homework 4 1. Here is a slightly technical question about hypothesis tests. Suppose that Y 1,..., Y n iid Poisson(λ) λ > 0. The two competing hypotheses are H 0 : λ = λ 0 versus H 1 : λ = λ 1 for specified values of λ 0 and λ 1. (a) Derive the form of the likelihood ratio, f 0 (y)/f 1 (y). Answer. First, we need the PMF for y, which is f(y; λ) = n i=1 e λ λy i y i! e nλ λ i y i = e nλ λ nȳ where ȳ is the sample mean; we can ignore multiplicative terms that do not depend on λ (since they will cancel in ratios). Then f 0 (y) f 1 (y) = e n(λ 0 λ 1 ) ( λ0 λ 1 ) nȳ. (b) Evaluate this ratio for the values λ 0 = 0.50, λ 1 = 1.25 and the dataset y obs = { 1, 0, 2, 2, 0, 1, 1, 1, 1, 1, 3 }. Thinking like a Bayesian, what does the dataset say about the two hypotheses? Answer. Here is the R code I used to generate the dataset and evaluate the likelihood ratio. 1
set.seed(101) lambda0 <- 0.50 lambda1 <- 1.25 n <- 11 y <- rpois(n, lambda1); print(paste(y, collapse = ", ")) ybar <- mean(y) loglr <- -n * (lambda0 - lambda1) + (n * ybar) * log(lambda0 / lambda1) print(exp(loglr)) # 0.02568676 This is a likelihood ratio of about 1/40. To a Bayesian who did not have strong prior beliefs about λ, this would indicate strong support for H 1 over H 0. This is reassuring because you will see from the code that the true hypothesis is H 1. (c) (Harder maybe just think about this one over Easter.) Type 1 and Type 2 errors of the rejection region R := { y : f } 0(y) f 1 (y) k Evaluate the for k 0.5. Hint: you will find the Normal approximation to the Poisson very helpful for a pen-and-paper calculation (leave your results expressed in terms of the Normal distribution function), or else you can do the exact calculation in R, using the ppois function. Answer. Taking λ 0 < λ 1, we have y R nȳ log k + n(λ 0 λ 1 ) log(λ 0 /λ 1 ) =: c (make sure you check this!). If the Y s are IID Poisson with rate λ, then the sum of the Y s is also Poisson (prove this using the Moment Generating Function), and its rate is its expectation, nλ. Using the Normal approximation to the Poisson, we then have nȳ N( ; nλ, nλ), approximately, remembering that the variance of a Poisson is the same as the 2
expectation. Therefore the Type 1 error is approximately α := Pr 0 (Y R) = Pr 0 (nȳ c) 1 Φ(c ; nλ 0, nλ 0 ) where Φ is the Normal distribution function, with specified expectation and variance. The Type 2 error is β := Pr 1 (Y R) Φ(c ; nλ 1, nλ 1 ) Here is some R code for the Type 1 and the Type 2 error: k <- 0.5 c <- (log(k) + n * (lambda0 - lambda1)) / log(lambda0 / lambda1) print(c) # 9.760163 alpha <- pnorm(c, n * lambda0, sqrt(n * lambda0), lower.tail = FALSE) print(alpha) # 0.03464381 beta <- pnorm(c, n * lambda1, sqrt(n * lambda1)) print(beta) # 0.1409683 Now let s also do the exact calculation in R: alpha <- ppois(floor(c), n * lambda0, lower.tail = FALSE) print(alpha) # 0.05377747 beta <- ppois(c, n * lambda1) print(beta) # 0.1217738 showing that the Normal approximation is accurate to about 2 percentage points, in this case. (d) (Another one to think about.) Produce a plot of the Type 1 error (x-axis) versus the Type 2 error (y-axis) for a range of values for k. This curve is known as the operating characteristics curve for this model and this pair of hypotheses. It is obvious that operating characteristics curve goes through (0, 1) and (1, 0), and it is fairly easy to prove that it is always convex (follows from the Neyman-Pearson Lemma). 3
Answer. kvals <- seq(from = -2, to = 2, length = 101) kvals <- 10^kvals opchar <- sapply(kvals, function(k) { c <- (log(k) + n * (lambda0 - lambda1)) / log(lambda0 / lambda1) alpha <- ppois(floor(c), n * lambda0, lower.tail = FALSE) beta <- ppois(c, n * lambda1) c(k = k, alpha = alpha, beta = beta) }) opchar <- t(opchar) # has columns: k, alpha, beta opchar <- rbind(c(0, 0, 1), opchar, c(inf, 1, 0)) # always have these endpoints plot(opchar[, c("alpha", "beta")], type = "l", col = "black", asp = 1, xlim = c(0, 1), ylim = c(0, 1), xlab = expression(paste("values for ", alpha)), ylab = expression(paste("values for ", beta)), main = "Operating characteristics curve") dev.print(pdf, "opchar.pdf") 4
Operating characteristics curve Values for β 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Values for α 2. Here is an exam-style revision question about hypothesis tests. Part 2d is unseen and a bit harder. Each part carries five marks. (a) Describe the general framework for Hypothesis Testing, and give illustrations of the types of hypotheses that might be tested. Answer. I ll just sketch the answers, without reference to the notes. A hypothesis is a statement about the values of the parameters in a parametric model for some observables. In a hypothesis test we assess the evidential support for two competing hypotheses. So if the parametric model is Y p( ; θ) for θ Ω, then the two hypotheses can be written, in general, as H 0 : θ Ω 0 versus H 1 : θ Ω 1 5
where Ω 0, Ω 1 Ω, and Ω 0 Ω 1 =. If Ω i is a single point, then H i is a simple hypothesis; otherwise it is a composite hypothesis. One illustration is for two competing models: Y p 0 ( ) versus Y p 1 ( ), which can be thought of as special cases of a single more general model with θ {0, 1}. In this case both H 0 and H 1 are simple. Another common case is where some value θ 0 divides the scalar parameter space into two parts. In this case we have H 0 : θ θ 0 versus H 1 : θ > θ 0, and both H 0 and H 1 are composite. One special case of this is where Ω = [θ 0, ), in which case H 0 is simple and H 1 is composite. A third illustration is where a hypothesis concerns only one of the parameters, say H 0 : µ = µ 0 versus H 1 : µ > µ 0, in the case where Y N( ; µ, σ 2 ). In this case both H 0 and H 1 are composite, because the value of σ 2 is not specified. (b) Explain how a Bayesian might choose between two simple hypotheses. Answer. Let the simple hypotheses be H 0 : Y p 0 ( ) versus H 1 : Y p 1 ( ). According to Bayes s theorem in odds form, Pr(H 0 Y = y) Pr(H 1 Y = y) = p 0(y) p 1 (y) Pr(H 0 ) Pr(H 1 ). The Bayesian will typically favour the hypothesis with the larger posterior probability; i.e. H 0 if the lefthand side (the posterior odds for H 0 versus H 1 ) is greater than one, and H 1 if it is less than one. The first term on the righthand side is the likelihood ratio. If this is much larger or much smaller than one, then it will typically dominate the second term, the prior odds. So the Bayesian will typically favour H 0 if the likelihood ratio is large, say more than 20, and H 1 if the likelihood ratio is small, say less than 1/20. In the middle, he will need to include explicit prior probabilities: if he does not have these, then he remains undecided. (c) State and prove the Neyman-Pearson Lemma. Answer. You can look this one up in the lecture notes! My proof was adapted 6
from section 8.2 of M. DeGroot and M. Schervish, 2002, Probability and Statistics, 3rd edn, Addison-Wesley. (d) Consider two simple hypotheses, H 0 and H 1, and the rejection region R := { y : f } 0(y) f 1 (y) k. Let α be the Type 1 error level of R, and β be the Type 2 error level (these are both functions of k). Show that α β k, where indicates the change that arises from a small change in k. Answer. Imagine ordering all the values of y Y in terms of the values of f 0 (y)/f 1 (y); denote this ordering y (1), y (2) and so on. Fix a value of k and let y (j) be the value in Y for which f 0 (y (j 1) ) f 1 (y (j 1) ) < k < f 0(y (j) ) f 1 (y (j) ), or f 0 (y (j) )/f 1 (y (j) ) k. Now slightly increase k, so that the rejection region now includes y (j). The increase in Type 1 error is f 0 (y (j) ), and the increase in the Type 2 error is f 1 (y (j) ), i.e. the Type 2 error goes down because the rejection region is bigger. So as stated. [Comment: α β = f 0(y (j) ) f 1 (y (j) ) k this is an important and useful result, but we did not prove it in the lectures, and it is not in the notes. If you look back to the operating characteristics picture above, the gradient of the curve is β/ α 1/k, so you can almost read the value of k from the curve, providing that the aspect ratio of the plot is 1, which it is, because I specified asp = 1 as an argument to the plot function.] 7
(e) Explain briefly how we might examine the hypotheses H 0 : θ = 0 versus H 1 : θ > 0 in the case of the model Y p( ; θ) with θ 0. Answer. θ 0 0.] [Note: I have corrected an ambiguity in the question, by writing Briefly, the lower bound on the likelihood ratio, over all possible prior distributions for θ with support on H 1, is p 0 (y) p 1 (y) p(y; θ 0) p(y; ˆθ 1 (y)) where ˆθ 1 (y) := argmax t>0 p(y; t). Now if this lower bound is quite large, say above 0.3, then this rules out the possibility of strong evidence against H 0, and so, if there is a reasonable preference for H 0 over H 1 in the prior (say, because H 0 is the current theory and H 1 is a speculative new theory), then H 0 will be chosen. On the other hand, if this lower bound is very small, say below 0.001, then the likelihood ratio could easily be small, say about 1/20, in which case we would want to choose H 1. But it could also be large, say 20, in which case we would want to choose H 0. It depends on the prior for θ under H 1, which we are reluctant to specify. So in this case we are undecided. This homework is not due until the first Tue of next Term (21 Apr). In the meantime I will circulate another homework on p-values. Jonathan Rougier Mar 2015 8