HYPOTHESIS TESTING: FREQUENTIST APPROACH. These notes summarize the lectures on (the frequentist approach to) hypothesis testing. You should be familiar with the standard hypothesis testing from previous stats classes. Here, we will explain where this approach comes from and develop new ideas (all within the context of the parametric set-up). 1. Set-Up The basic setup of (the Neyman-Pearson approach to) hypothesis testing is as follows. There are two hypotheses you are trying to decide between: the null (H 0 ) and the alternative (H A ). If a hypothesis fully determines the behaviour (pdf/pmf or other) of the random variables then it is called simple, otherwise it is known as composite. The hypothesis test will reject H 0 in favour of H A if a test statistic T = T (X) falls into a rejection region (RR). We therefore have that: accept H 0 reject H 0 H 0 true Type I error H A true Type II error The probability of a type I error is denoted as α and is also known as the significance level of a test. The probability of a type II error is denoted as β. The power of a test is 1 β; the probability of doing the correct thing under H A. Note that if H A is composite then both β and power depend on the particular member of H A which holds. In this case we will often plot the power function. Ideally we would have that α = β = 0. However, in practice we most often have that decreasing α drives up β and vice versa. 2. Neyman-Pearson The main idea behind Neyman-Pearson is to fix α in advance (choose α to be small) and then to find a test which yields a small value of β. The Neyman-Pearson lemma tells us that the in such a set-up, the likelihood ratio test (LRT) is the most powerful of all the possible tests. This only works for two simple hypotheses. Date: November 25, 2007. 1
2 HYPOTHESIS TESTING: FREQUENTIST APPROACH. Thus, assume that H 0 and H A are both simple, and let f 0 (x) denote the pdf/pmf (likelihood) of the data under H 0 (and f A (x) under H A ). The LRT is the test which rejects if f 0 (x) f A (x) < c, where c is chosen in such a way so that P (reject) = α. Lemma 2.1 (Neyman-Pearson Lemma). Any other test with significance level α α has power less than or equal to that of the likelihood ratio test. First of all note that this is a very sensible thing to do (we reject H 0 if the data has a bigger likelihood under H A ). Thus, the basic idea is similar to that of maximum likelihood estimation. We next need to take the LRT test and translate it into something easier to handle. Example. ESP example (Bernoulli, sample size is 10). We have that P (T otal 6 H 0 ) = 0.02, and that P (T otal 5 H 0 ) = 0.078 (for sample size 10), we therefore cannot choose α = 0.05 exactly. We will choose the rejection region to be {6, 7, 8, 9, 10}. In this case, the power function is given in Figure 1. The code used to generate Figure 1 in R was: x<-rep(0,250) for(i in 1:250) { x[i]<-1-pbinom(5,10,0.25+i/250*0.75) } plot(x) What happens if we do n independent tests at the same time? Example. Population = exponential. Example. Population = normal, variance known. 3. P-values Performing an α-level test is not very informative as to the amount of information for/against the alternative hypothesis. The quantity that does allow us to measure this is the p-value. The p-value is defined as the smallest value of α for which the null hypothesis will be rejected. Typically it is calculated as the probability of obtaining a test statistic as or more extreme than what was actually observed. Extreme is dictated by the form of the rejection region. For a specific example, in the ESP Bernoulli case, if
HYPOTHESIS TESTING: FREQUENTIST APPROACH. 3 x 0.0 0.2 0.4 0.6 0.8 1.0 0 50 100 150 200 250 Index Figure 1. Power function in ESP example. we observed 5 total successes, then since we reject for T = total large, the p-value is calculated as P (T 5 H 0 ). Imho, you should always report the p-value for a hypothesis test in your research. Example. Under the null hypothesis show that the distribution of the p-value is Uniform[0, 1].
4 HYPOTHESIS TESTING: FREQUENTIST APPROACH. 4. Generalized Likelihood Ratio Test (GLRT) The LRT is optimal for testing a simple hypothesis against a simple hypothesis. However, often we wish to compare simple vs. composite or two composite hypotheses. As the name implies, the generalized LRT is a generalization of the LRT which allows us to handle composite hypotheses. Although no optimality results exist for the generalized version, we do have some nice asymptotic results, and it is easily motivated as a natural extension of the LRT. The set up for the GLRT is as follows. Let f(x θ) denote the pdf/pmf of the data if the parameter θ (possibly multivariate) is known. Notice that f(x θ) is actually the likelihood. The null hypothesis specifies that θ Θ 0 and the alternative says that θ Θ A. We let Θ denote Θ 0 Θ A. The GLRT rejects the null if Λ = max θ Θ 0 f(x θ) max θ ΘA f(x θ) is small. Indeed, this is very reasonable. In practice however it is often easier to work with Λ = max θ Θ 0 f(x θ) max θ Θ f(x θ) and reject H 0 if this is small. Since Λ = min(λ, 1) both versions actually do the same thing. We take the latter as our official definition of the GLRT. Example. Two-sided normal, unknown variance. Theorem 4.1. Under smoothness assumptions on the underlying pdf/pmf, the null distribution of 2 log Λ converges to a χ 2 distribution with degrees of freedom equal to dim Θ dim Θ 0 as the sample size tends to infinity. Since we reject for small values of Λ, we would reject for large values of 2 log Λ. Example. Compare this to the two-sided normal, unknown variance. 5. Power In the previous sections we have really avoided the issue of power. The LRT chooses the test with the highest power for a fixed significance level, but what if this isn t good enough? In practice it often is the case that increasing sample size increases power. The following examples are designed to illustrate this. Example. Normal, variance known. Example. Normal, variance unknown. (To calculate power, approximate using normal!)
HYPOTHESIS TESTING: FREQUENTIST APPROACH. 5 6. Duality of Confidence Intervals and Hypothesis Tests. A confidence interval (or set, in general) can be obtained by inverting a hypothesis test and vice versa. Example. Normal with known variance. Theorem 6.1. Suppose that for every value θ 0 in Θ there is a test at level α of the hypothesis H 0 : θ = θ 0. Denote the acceptance region of the test as A(θ 0 ). Then the set C(X) = {θ : X A(θ)} is a 100(1 α)% confidence region for θ. In words, a 100(1 α)% confidence region for θ consists of all those values of θ 0 for which the hypothesis that θ = θ 0 will not be rejected at level α. Theorem 6.2. Suppose that C(X) is a 100(1 α)% confidence region for θ: that is, for every θ 0 P (θ 0 C(X) θ = θ 0 ) = 1 α. Then an acceptance region for a test at level α of the hypothesis H 0 : θ = θ 0 is A(θ 0 ) = {X θ 0 C(X)}. In words, this says that the hypothesis that θ = θ 0 is accepted if θ 0 lies in the confidence region. This duality works exactly for the t-test and z-tests and associated confidence intervals. For other tests that are typically used (eg. testing a proportion or LRT for the Poisson, say) the typical tests do not invert exactly to the confidence interval and vice versa. This is not because duality fails in these cases, but because the test used is not an exact inversion of the confidence set. Example. Suppose a t-test rejects the two-sided hypothesis test for µ = 0 at the 5% level. Would the 90% CI contain zero? References [R] Rice, J.; Mathematical Statistics and Data Analysis Duxbury Press, 2nd Edition, 1995. Prepared by Hanna Jankowski Department of Statistics, University of Washington Box 354322, Seattle, WA 98195-4322 U.S.A. e-mail: hanna@stat.washington.edu