F79SM STATISTICAL METHODS SUMMARY NOTES 9 Hypothesis testing 9.1 Introduction As before we have a random sample x of size n of a population r.v. X with pdf/pf f(x;θ). The distribution we assign to X is our model for the process which has generated our data e.g. X ~ N(µ,1), X ~ Poisson(λ). A hypothesis H is a statement about the distribution of X in particular, in this chapter, it is a statement about the unknown value of a parameter θ, ( or θ). A simple hypothesis is a statement which completely specifies the distribution: e.g. if X ~ N(µ,1) then H: µ = 5 is a simple hypothesis. If H is not simple, it is composite e.g. H: µ > 5. A test of H is a rule which partitions the sample space into two subsets: critical region: data in this subset are not consistent with H and we reject H acceptance region: data in this subset are consistent with H and we accept H. The null hypothesis H represents the current theory (the status quo ) e.g. H : θ =, H : θ = θ, H : P <.4, H : P > P, H : µ = 5, H : µ > 5, H : µ < µ H : µ 1 µ = (this is the no difference or no treatment effect hypothesis) H : σ 1 = σ (this is the equal variances or homoscedasticity hypothesis) The null hypothesis H is contrasted with an alternative hypotheses H 1 and our test is written, for example, as follows: H : θ = θ v H 1 : θ = θ 1 a test with simple null and alternative hypotheses H : θ = θ v H 1 : θ > θ a one-sided test with simple null and composite alternative hypotheses H : θ θ v H 1 : θ > θ a one-sided test with composite null and alternative hypotheses H : θ = θ v H 1 : θ θ a two-sided test with simple null and composite alternative hypotheses The fundamental questions we are asking are: do our data provide strong enough evidence to justify our rejecting the null hypothesis? how strong is our evidence against the null hypothesis? and, on a more general plane of enquiry, 1 RJG how good is our procedure given the data we have, are we using the best available test? is there perhaps a better experimental procedure we could have used to make a better testing approach available to us? The decision is based on the value of an appropriate function of the data called the test statistic (e.g. the sample mean x, sample proportion P, sample variance s, maximum value in the sample), whose distribution is completely known under H, that is when H is true. 9. Classical (Neyman Pearson) methodology (a) Simple H v simple H 1 There are two types of testing errors we are exposed to when making our decision: type I error: reject H when it is true type II error: accept H when it is false The probabilities of making these errors are conventionally denoted α and β: α = P(commit a type I error) = P(reject H H true) β = P(commit a type II error) = P(accept H H false)
1 β = P(reject H H false) is called the power of the test it is the probability of making a correct decision to reject the null hypothesis. It measures the effectiveness of the test at detecting departures from the null hypothesis. We want both α and β to be small, but, for a fixed sample size, it is not possible to lower both probabilities of error simultaneously we can of course lower the probabilities by increasing the sample size. The classical, Neyman Pearson, approach to testing is as follows: when testing H : θ = θ v H 1 : θ = θ 1 (i) fix/choose the value of α once chosen, α is called the level of significance of the test (popular choices are α =.5, giving a 5% test and α =.1, giving a 1% test ), and then (ii) use the test for which β is smallest that is to say the test with the highest power i.e. choose the most powerful available test of level α. The method for finding this best test is based on the likelihood function the result is the Neyman Pearson Lemmma, which is expressed in terms of the likelihood ratio L θ / L θ = L / L for short. The lemma states that the form of the best test is given by finding the ( ) ( ) [ ] 1 1 form of the critical region C such that C = {x; L kl 1 } for some constant k. The exact specification of C comes from the chosen level of the test α., and depends on θ. The power of the resulting test depends on θ 1. The criterion comes down in practice to defining C in terms of a range of values of the test statistic for a formal statement and proof of the Lemma see Miller & Miller 1.4. A test with pre-assigned level is often called a significance test. If the level is α and our decision is reject H we say that our result is statistically significant at 1α%. (b) Composite hypotheses In some cases we can use the N P Lemma when we have a composite alternative hypothesis. We may be able to find a test which is best for every value of the parameter specified by H 1. Such a test, if it exists, is said to be uniformly most powerful (UMP). Ex9.1 Random sample, size n, of X ~ N(µ,1). Test of H : µ = µ v H 1 : µ > µ. Consider first a test of H : µ = µ v H 1 : µ = µ 1 (where µ 1 > µ ). ( ) ( ) 1/ 1 f x; µ π exp ( x µ ) = 1 exp ( xi µ ) n L / L1 = = exp { x ( µ 1 µ ) ( µ 1 µ )} 1 exp ( xi µ 1) The best test has critical region defined by those data values x such that L /L 1 k, which is true for x µ µ µ µ k x µ µ k, that is, since µ 1 µ >, for x K. ( ) ( ), that is for ( ) 1 1 1 1 So, the best test is such that we reject H if x exceeds some value, i.e. we reject H for large values of X.
Suppose we want to perform a test at the 1α% level. Under H, X ~ N( µ,1/ n). α = P(type I error) = P ( X K µ µ ) 1 ( µ ) = = Φ ( α ) n K z α 1 [For a 5% test, n ( K µ ) { } > = = ( ) z.5 1.645 P Z > n K µ = α so = = and we reject H for X > µ 1.645/ n. For the case µ = 1, µ 1 = 1.5, and n = 5, we reject H for X > 1.39. The power of the test is then given by 1.39 1.5 P( X > 1.39 µ = µ 1), which is P Z > =P(Z >.855) =.84] 1/ 5 The test is best whatever the particular value of µ specified in H 1 and so is UMP for testing H : µ = µ v H 1 : µ > µ. No UMP test exists for testing H : µ = µ v H 1 : µ µ. The power function of a test (which generalises the concept of power we met earlier) is a function of the parameter given by π(θ) = P(reject H θ) Ex9.1 continued Consider testing H : µ µ v H 1 : µ > µ at the 5% level of significance. The best 5% test is as above and is UMP it rejects H for X > µ 1.645/ n. For general µ, X ~ N( µ,1/ n). 1.645 > = { > 1.645 } n { n ( µ µ ) } π(µ) = P(reject H µ) = P X µ µ P Z n ( µ µ ) = 1 Φ 1.645 This function is graphed below (in the case n = 9, µ = 1): at µ =µ = 1, power =.5 = level of the test of H : µ = 1 v H 1 : µ = µ 1 ( > 1). Power function power...4.6.8 1. -1 1 3 When working with composite hypotheses, the largest value of the power function π(θ) under H is called the size of the test (this generalises the concept of the level of the test). 3 mu
9.3 Some standard cases 9.3.1 Testing a population mean Suppose X ~ N(µ,σ ), random sample, size n, testing H : µ = µ (a) σ known X µ Test statistic is, which is ~ N(,1) under H σ / n (b) σ unknown X µ Test statistic is, which is ~ t n 1 under H this gives the famous t test S / n Large samples from any distribution: X µ Test statistic is S / n, which is ~ N(,1) (approximately) under H 9.3. Testing a population variance Suppose X ~ N(µ,σ ), random sample, size n, testing H : σ = σ n 1 S Test statistic is ( ) σ, which is ~ χ under H n 1 9.3.3 Testing a population proportion Let X be the number of successes in n Bernoulli trials with P(success) = θ, testing H : θ = θ Test statistic is X, which is ~ b(n, θ ) under H, X nθ and, for large n, ~ N(,1) (approximately) under H nθ 1 θ ( ) 9.3.4 Testing a Poisson mean Suppose X ~ Poisson(λ), random sample, size n, testing H : λ = λ Test statistic is X i, which ~ Poisson(nλ ) under H, X i ~ N nλ, nλ or X ~ N λ, under H and, for large n, ( ) Ex9. Random sample of X ~ N(µ,σ ). We want to test H : µ = 1.5 v H 1 : µ < 1.5 at the 5% level. We have data from a random sample of size 1: x = 9.1, x = 877.47. 4 λ n i X µ We reject H for small values of X. The test statistic is, which is ~ t 9 under H. S / n 1 9.1 The data give x = 9.1, s = 877.47 = 3.477. 9 1 X 1.5 Lower 5% point for t 9 is 1.833, so we reject H for < 1.833. This defines the critical region. S / 1 For our sample x = 9.1, s = 3.477 ; the test statistic has value.6 so we do reject H and accept H 1. i
An alternative, and simpler, approach is to calculate the observed value of the test statistic for the sample in hand and compare it with the tabulated percentage point (or go further see P-values later). Here, our observed t = (9.1 1.5) /.3477 1/ =.64, which is lower than the relevant percentage point ( 1.833) our observed value is low enough to be in the tail of the reference distribution and we reject H. Ex9.3 A coin is tossed times and lands heads 5 times and tails 15 times. Investigate whether the coin is fair or biased in favour of tails (i.e. do we have strong enough evidence to conclude that the coin is biased in favour of tails?) Let X be the number of heads. Then X ~ b(, θ) where P(head) = θ. We will test H : θ =.5 v H 1 : θ <.5 at 5%. We reject H for small values of X. From NCST (p), P(X 5 θ =.5) =.7 which is less than.5. Our observation x = 5 is in the lower tail of the reference binomial distribution so we reject H. We conclude that the coin is biased in favour of tails. Suppose the coin was tossed times and landed heads 8 times. P(X 8 θ =.5) =.517. This is far too high to provide evidence against H, which can stand. But suppose now that the coin was tossed 1 times and landed heads 4 times (same proportion of heads, but on many more tosses). X nθ Now X ~ b(1, θ) and we can use the test statistic which is ~ N(,1) (approximately) nθ 1 θ under H. ( ) Our observed statistic is (4 5)/5 = which is less than the lower 5% point of the N(,1) distribution ( 1.645) our observed value is in the tail of the reference distribution this time we have sufficiently strong evidence against H to justify our rejecting it. We reject H and conclude that the coin is biased in favour of tails (but see next section for improved methodology which allows naturally for the use of a continuity correction). 9.4 Significance and P values A typical conclusion of a significance test is simply reject H at the 5% level of significance or just reject H at 5%. This is not as informative as we can be. It is more informative to quantify the strength of the evidence the data provide against H. We do this by calculating the probability value (P value) of our observed test statistic. The P value is the observed significance level of the test statistic it is the probability, assuming H is true, of observing a value of the test statistic as extreme (that is, as inconsistent with H ) as the value we have actually observed The P value is the probability of the smallest critical region which includes the observed test statistic. Given the data we have, the P value is the lowest level at which we can reject H. The smaller is the P value, the stronger is our evidence against H. The use of P values is very widespread in published statistical work and is strongly recommended. 5
In Ex9.1, consider again the case µ = 1, µ 1 = 1.5, and n = 5. Suppose we observe x = 1.41. This value is in the critical region (which is x > 1.39 ) and has P X 1.41 µ = P Z >.5 =. (or.%). So we have strong enough P value given by ( ) ( ) evidence to justify rejecting H, at levels of testing down to %. Suppose however we observe x = 1.7. This value is not in the critical region and has P value P X 1.7 µ = P Z > 1.35 =.89 (or 8.9%). The P value is higher and the given by ( ) ( ) evidence is not strong enough to justify rejecting H. In Ex9., the observed test statistic is.64 and the P value of this statistic is P(t 9 <.64) =.5 (from NCST). So we have strong enough evidence to justify rejecting H, at levels of testing down to.5%. In Ex9.3 with 1 tosses, under H, X ~ N(5,5) approximately, and the P-value of our observation 4 heads is calculated as 4.5 5 P ( X 4 H ) = P Z < = P ( Z < 1.9) =.9. 5 We have strong enough evidence to justify rejecting H, at levels of testing down to about 3%. [Note the use of the continuity correction when using the normal distribution (which is continuous) to calculate an approximation to a probability for the binomial distribution (which is discrete).] P-value Suitable language for your conclusions (in most applications) >.5 insufficient evidence against H to justify rejecting it evidence not strong enough to justify rejecting H H can stand <.5 we have some evidence against H we can reject H at the 5% level of testing <.1 we have strong evidence against H we can reject H at the 1% level of testing we can reject H at levels of testing down to 1% <.1 we have overwhelming evidence against H we can reject H at the.1% level of testing we can reject H at levels of testing down to.1% 9.5 Two sample situations see over 6
9.5 Two sample situations 9.5.1 Difference between two population means Random sample size n 1 from X 1 ~ N(µ 1, σ 1 ); random sample size n from X ~ N(µ, σ ). All variables are independent. Sample means X1 and X and variances S 1 and S. We want to test hypotheses about µ 1 µ. H : µ 1 µ = δ (δ = is the no difference or no treatment effect hypothesis). (a) Population variances known Test statistic is X1 X δ, which is ~ N(,1) under H σ1 σ n n 1 (b) Common population variance σ 1 = σ = σ Test statistic is X1 X δ, which is ~ t with n 1 n df under H 1 1 S p n n 1 (recall the pooled estimator of σ is ( 1) ( 1) This gives the famous two sample t test. Large samples from any distribution: S p = n S n S 1 1 n n 1 Test statistic is X1 X δ, or X1 X δ, both of which are ~ N(,1) (approximately) under H 1 1 S p S1 S n n n n 1 1 9.5. Ratio of two population variances Random sample size n 1 from X 1 ~ N(µ 1, σ 1 ); random sample size n from X ~ N(µ, σ ). All variables are independent. Sample variances S 1 and S. We want to test hypotheses about σ 1 /σ. H : σ 1 /σ = 1 (i.e. σ 1 = σ this is the homoscedasticity hypothesis) S1 Test statistic is S, which is ~ Fn 1 1, n 1 under H 9.5.3 Difference between two population proportions X 1 ~ b(n 1,θ 1 ), X ~ b(n,θ ); large samples; sample proportions P 1 and P respectively H : θ 1 θ = δ ( δ = is the no difference hypothesis in regard to the population proportions) P1 P δ Test statistic is, which is ~ N(,1) under H P1 ( 1 P1 ) P ( 1 P ) n n 1 In the case δ =, H specifies a common population proportion θ = θ 1 = θ, and, under H, 1 X 1 X ~ b(n 1 n, θ). The MLE of the common proportion is then ˆ X X θ =. n n ) 1 7
1 1 = θ θ n1 n In this case the estimated standard error of P 1 P under H is ese( P1 P ) ˆ( 1 ˆ) and the test statistic is P1 P ese P ( P ) 1, which is ~ N(,1) (approximately) under H 9.5.4 Difference between two Poisson means Random sample size n 1 from X 1 ~ Poisson(λ 1 ), random sample size n from X ~ Poisson(λ ); large samples; all variables independent. Sample means X1 and X. H : λ 1 = λ. X1 X The test statistic normally used is, which is ~ N(,1) (approximately) under H X1 X n n 1 Noting that under H, the MLE of λ = λ 1 = λ is λ = statistic X X ese X 1 ( 1 X ) where ese( X1 X ) X X, one can also argue for the test n n ˆ 1i i 1 1 1 = ˆ λ. n1 n 9.5.5 Paired data (non-independent samples) Data arise as physical pairs (x i, y i ), i = 1,,, n with differences d i = x i y i. H : µ D = µ X µ Y = Problem reverts to the one-sample problem of 9.3.1. Ex9.4 See Ex8.1 Test H : µ 1 = µ v H 1 : µ 1 µ Test statistic is t = X1 X 1 1 S p n n 1 and, for a 5% test, we reject H for t >.8 (t has 1 df) For our data, t = 1.87/.717 =.56, and we reject H. The P-value of our statistic is P( t >.56) =.9 =.18 (approx., from NCST). Ex9.5 See Ex8.11 Test H :θ 1 = θ v H :θ 1 θ P 1 =.18, P =.115, P 1 P =.65 Under H, ˆ θ = 77/5 =.154 and ese( P P ) 1 1 =.154.846 3 =.395 1 Test statistic =.65/.395 = 1.973 P-value of result = P(Z > 1.973) =.4 =.48 We reject H at levels of testing down to 4.8% 8
9.6 Tests and confidence intervals A CI for a parameter θ is a set of values which, given the data we have, are plausible for the parameter. So any value θ contained in the CI should be such that the hypothesis H : θ = θ will be accepted in a corresponding hypothesis test. This is in fact generally the case. 1.96 1.96 For example, sampling from N(µ,1). A 95% two-sided CI for µ is given by X, X n n X µ and this interval contains µ if and only if < < 1.96, which is the condition under which 1/ n H : µ = µ is accepted in a 5% significance test when testing H : µ = µ v H 1 : µ µ. In general there is this direct link between the two-sided 1(1 α)% CI and the 1α% two-sided test. Similarly one-sided CIs correspond to one-sided tests. For example, consider again sampling from 1.645 N(µ,1). A 95% lower CI for µ is given by X, and this interval contains precisely those n values of µ which, when specified under H in the 5% test of H : µ = µ v H 1 : µ > µ result in H being accepted. If a CI has already been calculated for a parameter, then many questions which arise in a hypothesis testing framework are answerable immediately, at least in so far as giving us a basic reject or accept decision. Ex9.6 Ex9.1 revisited: N(µ,1), n = 5 H : µ = 1 v H 1 : µ > 1. 1.645 Suppose we observe x =1.4. Then a lower 95% CI for µ is given by 1.4, 5 i.e. (1.7, ). This interval does not contain the value µ = 1 which we therefore reject as being implausible (inconsistent with the value of the sample mean) it is too low. This is the same conclusion we come to in the test, for which the critical region is X > 1.39. 1.645 Suppose we observe x =1.3. Then a lower 95% CI for µ is given by 1.3, 5 i.e. (9.97, ). This interval does contain the value µ = 1 which we therefore accept as being plausible (consistent with the value of the sample mean). This again is the same conclusion we come to in the test. For a general x, the lower limit of the CI is x.39 and so any hypothesised value for µ such that µ > x.39 is contained in the CI i.e. we accept a hypothesised µ provided x < µ.39 and hence reject it for x < µ.39, as in Ex9.1. S Ex9.7 Ex9. revisited: an upper 95% CI for µ is given by, X 1.833, which with 1 x = 9.1 and s = 3.477 gives (, 1.5). The interval does not contain the value µ = 1.5 which we therefore reject as being inconsistent with the value of the sample mean it is too high. This is the same conclusion we come to in the test. 9
Ex9.8 See Ex8.. Testing a population proportion: H : θ =.38 v H 1 : θ.38 based on the result that a random sample of 1 includes 4 with the property. We reject H for large values of X, where X ~ b(1,.38) N(456, 8.7) P(X 4) = P[Z < (4.5 456)/ 8.7] = P(Z <.11) =.174 so the P-value of this (two sided) test is.34. So at 5% we reject H. The 95% CI for θ is.35 ±.7 i.e. (.33,.377) the value.38 is not contained in this interval. 9.7 Other matters (a) When a single best test (in the Neyman Pearson sense) is not available, another, more general approach is used. The test statistic and critical region are found by setting an upper bound on the ratio maxl / maxl where maxl is the maximum value of the likelihood L under the restrictions imposed by H, and maxl is the unrestricted maximum value of L. This method produces tests called likelihood ratio tests. For example, in sampling from N(µ,σ ) and testing H : µ = µ, the method leads to the t test of 9.3.1. (b) We may be able to reject H at a specified level simply by using so much data that our test statistic has a small enough standard error to enable us to detect a departure from H. This departure may, however, be of little or no physical significance. (c) A failure to reject H does not imply that H is true. It indicates that we have failed to reject it our data do not provide sufficiently strong evidence against it. H represents a theory which lives on to fight another day. (d) Good practice in testing State your hypotheses the test statistic the distribution of the test statistic under H the observed value of the test statistic the P-value (at least approximately) of the test statistic your conclusion as regards the hypotheses your conclusion in words which relate to the physical situation concerned 1
Appendix R code to produce the display in Ex9.1 continued x=c(-:6)*.5 y=1 pnorm(3*(1 x)1.6449) c=c( 1,1) d=c(.5,.5) e=c(1,1) f=c(,.5) plot(x,y,type="l",xlab="mu",ylab="power",main="power function") lines(c,d,lty=) lines(e,f,lty=) 11