Optimal Tests of Hypotheses (Hogg Chapter Eight)

Size: px

Start display at page:

Download "Optimal Tests of Hypotheses (Hogg Chapter Eight)"

Lilian Byrd
5 years ago
Views:

1 Optimal Tests of Hypotheses Hogg hapter Eight STAT 406-0: Mathematical Statistics II Spring Semester 06 ontents Most Powerful Tests. Review of Hypothesis Testing The Neyman-Pearson Lemma Example: Gamma Distribution Uniformly Most Powerful Tests Example: One-Sided Hypothesis Example: Two-Sided Hypothesis Example: omposite Null Hypothesis The Karlin-Rubin Theorem Example: Exponential Family Likelihood Ratio Tests and UMP Unbiased Tests 7. Gamma Example ontinued Unbiased Tests Illustration for Gamma example omposite Hypothesis Testing with Priors 0 3. Example: Mean of a Normal Distribution ase : Uniform Prior opyright 06, John T. Whelan, and all that 3.. ase : Non-Uniform Prior Tuesday 6 April 06 Read Section 8. of Hogg Most Powerful Tests We turn now to another application of likelihood methods: as tests for hypotheses. First, a reminder of some of the notation and definitions associated with classical frequentist hypothesis testing.. Review of Hypothesis Testing In the classical framework, we define a test to reject or not reject a null hypothesis H 0 in the context of an alternative hypothesis H, based on a realization x of the n-dimensional random vector X which is often but not always a sample drawn from a univariate distribution. We refer to the support space for the whole sample as S, i.e., X S. We can define the test as partitioning the sample space into a critical region and its compelement.

2 If x S we reject H 0, while if x / i.e., x c we do not reject H 0. Strictly speaking, we should not refer to the outcome of the test as accepting one hypothesis or the other. We often think of this in terms of a parametrized distribution f X x; θ and define H 0 to specify a value θ = θ 0 Hogg calls this θ for the parameter and H to specify some other value θ = θ which Hogg calls θ. Since each hypothesis only specifies a single value for θ, we call these point hypotheses, and the parameter formalism is actually unneccessary; f X x; θ 0 = f X x H 0 and f X x; θ = f X x H can just be thought of as two different probsbility distributions relevant to H 0 and H, respectively. We define the size α of the critical region, also known as the significance or false-alarm probability of the test as the probability that we will reject H 0 if it is true: α = P X H 0 = f X x H 0 d n x. The power of the test is the probability that we ll reject H 0 if H is true: γ = P X H = f X x H d n x. We can also write this in terms of β = γ, which is known as the false dismissal probability. Given different tests with the same false-alarm probability α, we d naturally rather use the one with the lowest false-dismissal probability β = γ, i.e., the highest power γ. This is known as the most powerful test.. The Neyman-Pearson Lemma There is a theorem, usually known as the Neyman-Pearson lemma, that shows how the most powerful test to of one point hypothesis H 0 against another H can be constructed from the likelihood ratio Λx = f Xx H 0 f X x H = Lθ 0; x Lθ ; x.3 We define so that x if and only if Λx k where k is defined by P ΛX k H 0 = f X x H 0 d n x = α.4 which ensures that the critical region is of size α. I.e., we reject H 0 if and only if Λx k. The Neyman-Pearson lemma states that the power of this test γ = P ΛX k H = f X x H d n x.5 is greater than or equal to the power of any other test with the same significance. I.e., if A is some other critical region with size α, so that f X x H 0 d n x = α.6 the Neyman-Pearson lemma says that γ = f X x H d n x A A f X x H d n x.7 The demonstration of the Neyman-Pearson lemma involves breaking up the regions and A in terms of their overlap A. Evidentally, we can write = A c A A = c A A.8a.8b

3 The contribution to both α and γ from A cancel out of any comparison between and A. So the Neyman-Pearson lemma is equivalent to the condition that γ f X x H d n x A = A c f X x H d n x c A If we can prove that, we ve proved the lemma Now, by definition so while so A c c A f X x H 0 f X x H f X x H d n x f X x H d n x.9 k for x.0 f X x H 0 d n x. k A c f X x H 0 f X x H k for x c. f X x H d n x k c A f X x H 0 d n x.3 But since the tests defined by and A both have the same significance α, the integrals of f X x H 0 over the non-overlapping regions must be the same: α f X x H 0 d n x A = A c f X x H 0 d n x = c A f X x H 0 d n x.4 so the right-hand sides of.3 and. must be equal, which means f X x H d n x f X x H d n x.5 A c c A which, as we ve argued above, means that the power of the likelihood ratio test defined by is greater than or equal to that defined by A..3 Example: Gamma Distribution fx; θ = x5 θ 5 e x/θ, 0 < x <. Let H 0 : θ = and H : θ =. Likelihood is so likelihood ratio is Λ = Lθ; x = n i= x i 4 exp θ 5n n x i /θ i=.6 [ L; x L; x = 5n exp ] n x i = 5n e y/.7 i= where y = n i= x i. Note that the likelihood ratio depends only on the sufficient statistic y. A moment s thought will show that this will be true in general. If there are one or more sufficient statistics for the parameters which distinguish H 0 from H, the likelihood ratio must be a function of those alone. Note that in this case, Y is Gamman, of H 0 is true and Gamman, of H is true, so we can figure out the significance and power of the test by using the percentiles of the Gamma or equivalently chi-square distribution. 3

4 Thursday 8 April 06 Read Section 8. of Hogg.4 Uniformly Most Powerful Tests The Neyman-Pearson Lemma shows that when comparing point hypotheses H 0 and H, a test based on the likelihood ratio f X x H 0 /f X x H gives the most powerful test highest γ = P X H at a given significance α = P X H 0. We often wish to compare composite hypotheses, in which H and/or H 0 can correspond to parametrized families of distributions. In particular, if X is a sample drawn from a distribution fx; θ, we may have hypotheses H 0 : θ ω and H : θ Ω. If we first consider the case where H 0 is a point hypothesis θ = θ 0 and H is a composite hypothesis parametrized by θ, the power of a given test will depend on θ: γθ = P X ; θ = f X x; θd n x.8 If the same test is most powerful at a given α for every choice of θ, we say it is uniformly most powerful UMP. For example, recall the example from the last class, where X is a sample of size n from a Gamma5, θ distribution and H 0 is θ = θ 0 =. Now let H be the composite hypothesis θ >. We know from the Neyman-Pearson lemma that at each θ the most powerful test will be the one given by the likelihood ratio: : Λx; θ = Lθ 0; x Lθ; x where the constant kθ is defined by kθ.9 P ΛX; θ kθ H 0 = α.0.4. Example: One-Sided Hypothesis We saw last time that Lθ; x = n i= x4 i θ 5n e y/θ. where y is the sufficient statistic y = n i= x i, and Y is a Gamma5n, θ random variable. The likelihood ratio is Λθ; x = Lθ 5n [ 0; x θ Lθ; x = exp ] y θ 0 θ 0 θ = θ 5n exp θ. y θ As long as θ >, this is a decreasing function of y, so a test which rejects H 0 if Λ kθ will be equivalent to one which does so if Y a for some corresponding a. This can be defined so that the critical region is of size α: α = P Y a H 0 = γθ 0 = where Γs, x = a x y 5n Γ5nθ 5n 0 e y/θ 0 dy = Γ5n, a Γ5.3 t s e t dt.4 is the upper incomplete Gamma function. In any event, the value of a doesn t depend on θ, as long as θ >, so Y a is the UMP test of H 0 in light of the composite hypothesis H..4. Example: Two-Sided Hypothesis Note that there does not always exist a UMP test. Suppose instead the hypothesis H is θ θ 0 =. For θ >, the most 4

5 powerful test is to reject H 0 if Y a as defined above. For 0 < θ <, the likelihood ratio θ Λθ; x = θ 5n exp y.5 θ is a monotonically increasing function of y, so Λ kθ is equivalent to Y b, where b is defined by α = P Y b H 0 = γθ 0 = where b Γs Γs, x = y 5n Γ5nθ 5n 0 x e y/θ 0 dy = Γ5n Γ5n, b Γ5n.6 t s e t dt.7 is the lower incomplete Gamma function. Thus the most powerful test depends on the value of θ: rejecting H 0 when Y a is most powerful if θ > θ 0, while rejecting H when Y b is most powerful if θ < θ 0, To illustrate, we ll plot the power function for the two tests assuming α = 0.0 and n = 4. We ll get the percentiles and tail probabilities for the Gamma distribution using the SciPy stats package. For any distribution, the method sf is the survival function one minus the cdf, cdf is the cdf, and isf and ppf are the inverses of these functions. In []: from scipy.stats import gamma as gammadist In []: alpha = 0.0 In [3]: n = 4 In [4]: a = gammadist5*n.isfalpha In [5]: b = gammadist5*n.ppfalpha We check that the significance of the two tests is indeed 0%: In [6]: print gammadist5*n.sfa 0. In [7]: print gammadist5*n.cdfb 0. In [8]: theta = linspace0,3,000[:] The [:] is to omit the first element in the array, so that we re not actually trying to use θ = 0 as a value of the scale parameter. In [9]: gamma_a = gammadist5*n,scale=theta.sfa In [0]: gamma_b = gammadist5*n,scale=theta.cdfb In []: figure; In []: plottheta,gamma_a,'b-',label=r'$y\geq a$'; In [3]: plottheta,gamma_b,'r--',lw=,label=r'$y\leq b$'; In [4]: legendloc='center right'; In [5]: xlabelr'$\theta$'; In [6]: ylabelr'$\gamma\theta$'; In [7]: gridtrue In [8]: savefig'notes08_power.eps',bbox_inches='tight' Here are the two power curves: 5

6 γθ Y a Y b If we consider the test Y a, P X ; θ = a y 5n Γ5nθ 5n e y/θ dy = a/θ t 5n Γ5n e t dt.9 Since the integrand is positive definite and independent of θ, we see that we can maximize the integral by making θ as large as possible subject to H 0 so that the lower limit a/θ is as small as possible and the integral is maximized, i.e., θ = θ 0 =. So the sigificance of the test for null hypothesis H 0 : θ θ 0 = is the same as for H 0 : θ = θ 0 =, and the rest of the problem proceeds as before, i.e., Y a defines the UMP for H 0 : θ vs H : θ >..4.4 The Karlin-Rubin Theorem Note that they intersect at γ = 0., which makes sense since any test with significance α will satisfy γθ 0 = α. And as we can see, the power of the test Y a solid blue is higher for θ >, while the power of the test Y b dashed red is higher for θ <..4.3 Example: omposite Null Hypothesis Returning to the case where H is θ >, we can extend the discussion to comparison between two composite hypotheses by letting H 0 be θ. As a matter of convention, we define the significance α for a test with a composite null hypothesis as the worst-case false alarm rate: θ α = max P X ; θ.8 θ ω This is example is an illustration of a result known as the Karlin- Rubin theorem, which extends the Neyman-Pearson lemma to tests of one-sided composite hypotheses. If the likelihood function Lθ; x can be written in terms of a single sufficient statistic y = ux and has the property that, for any θ 0 and θ in the parameter space and satisfying θ 0 < θ, the likelihood ratio Λx; θ 0, θ = Lθ 0; θ Lθ ; θ.30 is a monotonically increasing or monotonically decreasing function of y, then there exists a uniformly most powerful test for H 0 : θ θ versus H : θ > θ. We call this situation a monotone likelihood ratio mlr..4.5 Example: Exponential Family The conditions for a monotone likelihood ratio seem a little restrictive, but there is a decent class of models which satisfy them. 6

7 If we consider a distribution which is a member of the regular exponential family, so fx; θ = exp ηθkx + Hx + qθ.3 then the likelihood for a sample of size n is Lθ; x = exp ηθ n Kx i + i= and the likelihood ratio is n Hx i + nqθ i=.3 Λx; θ 0, θ = Lθ 0; θ Lθ ; θ = exp [ηθ 0 ηθ ]y + n[qθ 0 qθ ].33 where the dependence on the data is via the sufficient statistic y = n i= Kx i. We see that: If ηθ is a monotonically increasing function of θ, then ηθ 0 ηθ < 0 for θ 0 < θ, and Λx; θ 0, θ is a monotonically decreasing function of y. If ηθ is a monotonically decreasing function of θ, then ηθ 0 ηθ > 0 for θ 0 < θ, and Λx; θ 0, θ is a monotonically increasing function of y. I.e., if ηθ is a monotone function of θ, we have a monotone likelihood ratio, and the Karlin-Rubin theorem tells us there s a UMP test to distinguish between one-sided hypotheses. Tuesday 3 May 06 Read Section 8.3 of Hogg Likelihood Ratio Tests and UMP Unbiased Tests. Gamma Example ontinued Recall that last time, we saw that, for a sample of size n drawn from a Gamma5, θ distribution, a test which rejected H 0 : θ = θ 0 = if n i= X i a was uniformly most powerful for comparing H 0 to the one-sided alternative hypothesis θ > θ 0 =, while one with rejected H 0 if n i= X i b was UMP if the alternative hypothesis was θ < θ 0 =, but if the alternative hypothesis was H = θ θ 0 =, there was no single test which was most powerful for every θ allowed under H. One method we ve considered before for comparing composite hypotheses is a likelihood ratio test, where we maximize the likelihood subject to each hypothesis. In this case, where the null hypothesis is still the point hypothesis θ = θ 0, this becomes specifically in this example, Λx = Lθ 0; x L θx, x Lθ; x =. n i= x4 i [Γ5] n θ 5n e y/θ. and the maximum likelihood solution θ is the solution to 0 = θ 5n ln θ y = 5n θ θ + y.3 θ 7

8 i.e., θ = y, which makes the likelihood ratio 5n Λx = L; x y 5n L y, x = e y+5n 5n 5n.4 Since this goes to zero as y goes to zero or infinity, a test which rejects H 0 if Λx k will be a two-sided test on Y, rejecting if Y c or Y c.5 where y = c and y = c correspond to equal values of the likelihood, i.e., c 5n e c = c 5n e c.6 which, together with the specified significance α determines c and c via α = c c f Y y; dy = Γ5n c c y 5n e y dy.7 Note that this likelihood ratio test is slightly different from the equal-tailed test defined by c sym 0 f Y y; dy = α = c sym f Y y; dy.8 which is slightly easier to construct. c sym and c sym can be written as the 00αth and 00 αth percentiles of the Gamma5n, distribution which applies to Y under the null hypothesis H 0. If the distribution function f Y y; and the likelihood ratio Λx were symmetric functions in y, of course, the likelihood ratio test would be equal-tailed. This is the case in several of the examples in Hogg. We can examine the power functions of the different options the two one-sided tests, the equal-tailed test, and the likelihood ratio test by continuing our plotting example from last time: In [9]: c_sym = gammadist5*n.isf0.5*alpha In [0]: c_sym = gammadist5*n.ppf0.5*alpha In []: gamma_sym =...: gammadist5*n,scale=theta.cdfc_sym...: + gammadist5*n,scale=theta.sfc_sym In []: plottheta,gamma_sym,'k:',lw=,...: label=r'$y\leq c_^{\rm sym}$ '...: + r'or $Y\geq c_^{\rm sym}$'; For the maximum likelihood test, we have to find the test which satisfies.6 from among the choices for which P Y c H 0 + P Y c H 0 = α: In [3]: alphaleft = linspace0,alpha,000[:-] In [4]: c = gammadist5*n.ppfalphaleft In [5]: c = gammadist5*n.isfalpha-alphaleft c and c are both 998-element arrays giving the lower and uppwe end of the range of y values for which we should not reject H 0. Rather than doing some sophisticated root-finding, we just brute-force it and pick the pair which has the lowest value of c 5n e c c 5n e c : In [6]: ind_ml = argminabsc**5*n*exp-c...: -c**5*n*exp-c In [7]: c_ml = c[ind_ml] In [8]: c_ml = c[ind_ml] 8

9 In [9]: gamma_ml =...: gammadist5*n,scale=theta.cdfc_ml...: + gammadist5*n,scale=theta.sfc_ml In [30]: plottheta,gamma_ml,'g-',...: label=r'$y\leq c_^{\rm ML}$ '...: + r'or $Y\geq c_^{\rm ML}$'; In [3]: legendloc='center right'; In [3]: savefig'notes08_ml.eps',bbox_inches='tight' γθ Y a Y b Y c sym or Y c sym Y c ML or Y c ML function that goes to unity as you go away from θ = θ 0 in either direction, as opposed to the one-sided tests, whose power function goes to zero on the wrong side of θ = θ 0.. Unbiased Tests One of the unsatisfactory properties of the one-sided hypothesis tests is that while they are the most powerful tests for some ranges of the parameter θ, for other ranges, the power is actually below the false alarm probability α. For instance, the test which rejects H 0 when Y a has γθ < α when θ < θ 0 =. This means the test is more likely to reject H 0 when it s true than when H is true, if it happens that θ <. A test with this undesirable property is called biased. onversely, an unbiased test is one for which γθ α for all θ allowed by H. It is often the case that, while no uniformly most powerful test exists, there is an unbiased test which, at all values of θ, is more powerful than any other unbiased test. This is known as the uniformly most powerful unbiased test. It can be shown that in many cases specifically the regular exponential family, the UMPU test is a two-sided test on the sufficient statistic Y, which rejects H 0 if Y c or Y c We see that the equal-tailed test is slightly different from the maximum likelihood test, thanks to the asymmetric distribution, but it s close. Both have the property of having a power θ.. Illustration for Gamma example Returning to the example of a sample from the Gamma5, θ distribution, consider a test which satisfies γθ α when θ θ 0 =. Since γθ 0 = α, that means the power function γθ must have a minimum at θ = θ 0. Recalling that 9

10 Y Gamma5n, θ, we have 0.0 γθ = c = Γ5n f Y y; θ dy = c Γ5n c /θ c /θ t 5n e t dt c c y 5n e y/θ dy θ 5n.9 where we have made the substitution t = y/θ in the last step. Requiring this to be a minimum at θ = θ 0 gives us 0 = γ θ 0 = c c 5n Γ5n θ0 which gives us, using θ 0 =, e c c θ 0 c 5n e c.0 γθ θ Y a Y b Y c sym or Y c sym Y c ML or Y c ML c 5n e c = c 5n e c. If we look back, we see this is just the condition.6 associated with the likelihood ratio test, so that test is in fact the UMP unbiased test for this model. We can verify this by zooming in on the power function plot above: In [33]: xlim[.95,.05]; In [34]: ylim[0.08,0.]; In [35]: savefig'notes08_mlzoom.eps',bbox_inches='tight' We see that the equal-tailed test has a power function which dips below α = 0.0 for a small range of θ values, so in fact that test is not unbiased. We see that the likelihood ratio test is indeed unbiased, and it turns out to tbe the UMPU test. Thursday 5 May 06 See Searle, 3 omposite Hypothesis Testing with Priors onsider now a slightly different situation. Suppose that the null hypothesis H 0 is a point hypothesis, but the alternative hypothesis H is a composite hypothesis which allows for a range 0

11 of values of the model parameters θ, but comes with a prior distribution f Θ θ H on those parameters. We include the possibility of a p-dimensional parameter space, but it may also be that p =. If we define a test which rejects H 0 when f X x H 0 f X x H = Lθ 0 ; x Lθ; x fθ θ H d p θ k 3. the Neyman-Pearson lemma tells us this is the most powerful test of H 0 versus H. This is most powerful in the sense of maximizing the power function γh = P X H = f X x H d n x = Lθ; x f Θ θ H d p θ d n x = γθ f Θ θ H d p θ A few points to note: 3. The test statistic in 3. is just the Bayes factor which we ve already motivated using Bayes s theorem in the form to write P H i x = f Xx H i P H i f X x P H 0 x P H x = f Xx H 0 P H 0 f X x H P H If there is a uniformly most powerful test which covers any θ in the support of f Θ θ H, this will also be the most powerful test of H 0 against H for any prior distribution. A possible objection is that the frequentist hypothesis testing formalism only applies to the outcomes of repeated experiments, and isn t supposed to know about any prior distribution. The preprint by Searle referenced above actually refers to the outcome of a Monte arlo experiment, where θ is also randomly generated along with the realization of X, so the relevant joint distribution is f XΘ x, θ H = f X x; θ f Θ θ H. In that context, the Bayes factor constructed using the same prior as the Monte arlo simulation gives the most powerful test, when attention is restricted to tests which define using only X and not Θ. 3. Example: Mean of a Normal Distribution Suppose that the data vector is X = X X but rather than being a sample from a given distribution, they are independent random variables X Nθ, and X Nθ,. Let the null hypothesis be H 0 : θ = 0 = θ and the alternative hypothesis be H : θ θ 0 0 Lθ; x = π exp. The likelihood function is x θ x θ 3.5 The maximum likelihood solution is θ = x and θ = x, so the likelihood ratio test would reject H 0 if L0, x Λx = L θ, x = exp x + x k 3.6 which means, for a test of significance α, rejecting H 0 if X + X χ,α 3.7 On the other hand, if we have a prior f Θ θ, θ H the Neyman-Pearson lemma tells us the optimal test statistic is the Bayes factor exp x +x f Θθ, θ H exp x θ x θ 3.8 dθ dθ

12 3.. ase : Uniform Prior Suppose that the prior f Θ θ, θ H is uniform in θ and θ. In that case the denominator is proportional to exp x θ x θ dθ dθ = constant 3.9 and the Bayes factor is proportional to the maximum-likelihood ratio exp x + x 3.0 so using x + x as a test statistic does give the optimal test in light of the prior. 3.. ase : Non-Uniform Prior It could be that the prior on θ and θ is more complicated. For instance, suppose f Θ θ, θ θ θ. Then the denominator of the Bayes factor is proportional to θθ exp x θ x θ dθ dθ + x + x 3. This is an improper prior, so technically the normalization factor will be zero and Bayes factor will be infinite, but we re only interested in the x dependence of the Bayes factor anyway. If we want to be careful, we can use a prior that is e.g., Gaussian with some large width, much greater than any of the other values in the problem. E.g., for the gravitational wave signal from a rotating or orbiting system, the amplitudes of the left- and right-circularly polarized parts of the signal are proportional to ±cos ι where ι is an inclination angle, which is generally unknown with a uniform prior on cos ι [, ], which produces a non-trivial prior distribution for the two amplitudes. See Whelan et al, lassical and Quantum Gravity 3, ; which means the Bayes factor is proportional to exp x +x + x + x and the optimal test rejects H 0 if 3. X + X + ln + X + ln + X c 3.3 We can show the difference between the two families of critical regions on a contour plot: In []: x = linspace-4,4,00 In []: xgrid, xgrid = meshgridx,x In [3]: Fgrid = xgrid** + xgrid** In [4]: Bgrid = Fgrid...: + *log+xgrid** + *log+xgrid** In [5]: xlevels = arange6 In [6]: Blevels = xlevels** + *log+xlevels** In [7]: Flevels = xlevels** In [8]: figurefigsize=5,5; In [9]: contourxgrid,xgrid,fgrid,flevels,colors='r',...: linewidths=,linestyles='dashed'; In [0]: contourxgrid,xgrid,bgrid,blevels,colors='b'; In []: gridtrue

13 In []: xlabelr'$x_$'; In [3]: ylabelr'$x_$'; In [4]: savefig'notes08_contours.eps',...: bbox_inches='tight' 4 3 to estimate the significance of the Bayes-factor test numerically. See figure 3 of Whelan et al 04 for an example of this. Tuesday 0 May 06 Review for Final Exam The exam is comprehensive, but with relatively more emphasis on chapter eight and the second half of chapter seven. Please come with questions and topics you d like to go over. Thursday May 06 Review for Final Exam The exam is comprehensive, but with relatively more emphasis on chapter eight and the second half of chapter seven. Please come with questions and topics you d like to go over. x x Note that we drew the contours at arbitrary levels; to draw them at the same significance for the two tests, we d probably need 3

Hypothesis Test. The opposite of the null hypothesis, called an alternative hypothesis, becomes

Hypothesis Test. The opposite of the null hypothesis, called an alternative hypothesis, becomes Neyman-Pearson paradigm. Suppose that a researcher is interested in whether the new drug works. The process of determining whether the outcome of the experiment points to yes or no is called hypothesis