Some General Types of Tests

Size: px

Start display at page:

Download "Some General Types of Tests"

Austin Thornton
5 years ago
Views:

1 Some General Types of Tests We may not be able to find a UMP or UMPU test in a given situation. In that case, we may use test of some general class of tests that often have good asymptotic properties.

2 Likelihood Ratio Tests We see that the Neyman-Pearson Lemma leads directly to use of the ratio of the likelihoods in constructing tests. Now we want to generalize this approach and to study the properties of tests based on that ratio. Although as we have emphasized, the likelihood is a function of the distribution rather than of the random variable, we want to study its properties under the distribution of the random variable.

3 Likelihood Ratio Tests Using the idea of the ratio as in the test of H 0 : θ Θ 0, but inverting that ratio and including both hypotheses in the denominator, we define the likelihood ratio as λ(x) = sup θ Θ 0 L(θ; X) sup θ Θ L(θ; X). The test rejects H 0 if λ(x) < c, where c is some value in [0, 1]. (The inequality goes in the opposite direction because we have inverted the ratio.) Tests such as this are called likelihood ratio tests.

4 Likelihood Ratio Tests We should note that there are other definitions of a likelihood ratio; in particular, in TSH3 its denominator is the sup over the alternative hypothesis. If the alternative hypothesis does not specify Θ Θ 0, such a definition requires specification of both H 0, and H 1 ; whereas the definition above requires specification only of H 0. The likelihood ratio may not exist, but if if is well defined, clearly it is in the interval [0, 1], and values close to 1 provide evidence that the null hypothesis is true, and values close to 0 provide evidence that it is false.

5 Asymptotic Likelihood Ratio Tests Some of the most important properties of LR tests are asymptotic ones. There are various ways of using the likelihood to build practical tests. Some are asymptotic tests that use MLEs (or RLEs).

6 Asymptotic Properties of Tests For use of asymptotic approximations for hypothesis testing, we need a concept of asymptotic significance. We assume a family of distributions P, a sequence of statistics {δ n } based on a random sample X 1,..., X n. In hypothesis testing, the standard setup is that we have an observable random variable with a distribution in the family P. Our hypotheses concern a specific member P P. We want to test H 0 : P P 0 versus H 1 : P P 1, where P 0 P, P 1 P, and P 0 P 1 =.

7 Asymptotic Properties of Tests Letting β(δ n, P ) = Pr(δ n = 1), we define lim sup n β(δ n, P ) P P 0, if it exists, as the asymptotic size of the test. If lim sup n β(δ n, P ) α P P 0, then α is an asymptotic significance level of the test. δ n is consistent for the test iff lim sup n β(δ n, P ) = 0 P P 1. δ n is Chernoff-consistent for the test iff δ n is consistent and furthermore, lim sup n β(δ n, P ) = 0 P P 0. The asymptotic distribution of a maximum of a likelihood is a chi-squared and the ratio of two is asymptotically an F.

8 Regularity Conditions The interesting asymptotic properties of LR tests depend on the Le Cam regularity conditions, which go slightly beyond the Fisher information regularity conditions. These are the conditions to ensure that superefficiency can only occur over a set of Lebesgue measure 0 (Shao Theorem 4.16), the asymptotic efficiency of RLEs (Shao Theorem 4.17), and the chi-squared asymptotic significance of LR tests (Shao Theorem 6.5).

9 Asymptotic Significance of LR Tests We consider a general form of the null hypothesis, versus the alternative H 0 : R(θ) = 0 H 1 : R(θ) 0, for a continuously differential function R(θ) from IR k to IR r. (Shao s notation, H 0 : θ = g(ϑ) where ϑ is a (k r)-vector, although slightly different, is equivalent.)

10 Asymptotic Significance of LR Tests The key result is Theorem 6.5 in Shao, which, assuming the Le Cam regularity conditions, says that under H 0, 2 log(λ n ) d χ 2 r, where χ 2 r is a random variable with a chi-squared distribution with r degrees of freedom and r is the number of elements in R(θ). (In the simple case, r is the number of equations in the null hypothesis.) This allows us to determine the asymptotic significance of an LR test. It is also the basis for constructing asymptotically correct confidence sets.

11 Wald Tests and Score Tests There are two types of tests that arise from likelihood ratio tests. These are called Wald tests and score tests. Score tests are also called Rao test or Lagrange multiplier tests. These tests are asymptotically equivalent. They are consistent under the Le Cam regularity conditions, and they are Chernoff-consistent if α is chosen so that as n, α 0 and χ 2 r,α n = o(n).

12 Wald Tests The Wald test uses the test statistics W n = ( R(ˆθ) ) ( ( S(ˆθ) ) ( I n (ˆθ) ) 1 S(ˆθ)) 1 R(ˆθ), where S(θ) = R(θ)/ θ and I n (θ) is the Fisher information matrix, and these two quantities are evaluated at an MLE or RLE ˆθ. The test rejects the null hypothesis when this value is large. Notice that for the simple hypothesis H 0 : θ = θ 0, this simplifies to (ˆθ θ 0 ) I n (ˆθ)(ˆθ θ 0 ).

13 Wald Tests An asymptotic test can be constructed because W n d Y, where Y χ 2 r and r is the number of elements in R(θ). (This is proved in Theorem 6.6 of Shao, page 434.) The test rejects at the α level if W n > χ 2 r,1 α, where χ2 r,1 α is the 1 α quantile of the chi-squared distribution with r degrees of freedom. (Note that Shao denotes this quantity as χ 2 r,α.)

14 Score Tests A related test is the Rao score test, sometimes called a Lagrange multiplier test. It is based on a MLE or RLE θ under the restriction that R(θ) = 0, and rejects H 0 when the following is large: R n = (s n ( θ)) ( I n ( θ) ) 1 sn ( θ), where s n (θ) = l L (θ)/ θ, and is called the score function. The information matrix can either be the Fisher information matrix (that is, the expected values of the derivatives) evaluated at the RLEs or the observed information matrix in which instead of expected values, the observed values are used. An asymptotic test can be constructed because R n d Y, where Y χ 2 r and r is the number of elements in R(θ). This is proved in Theorem 6.6 (ii) of Shao.

15 Example 1 tests in a linear model Consider a general regression model: X i = f(z i, β) + ɛ i, where ɛ i i.i.d. N(0, σ 2 ). For given k r matrix L, we want to test H 0 : Lβ = β 0. Let X be the sample (it s an n-vector). whose rows are the z i. Let Z be the matrix The log likelihood is log l(β; X) = c(σ 2 ) 1 2σ 2(X f(z, β))t (X f(z, β)). The MLE is the LSE, ˆβ.

16 Let β be the maximizer of the log likelihood under the restriction Lβ = β 0. The likelihood ratio is the same as the difference in the log likelihoods. The maximum of the unrestricted log likelihood (minus a constant) is the minimum of the residuals: 1 2σ 2(X f(z, ˆβ)) T (X f(z, ˆβ)) = 1 2σ 2SSE(ˆβ) and likewise, for the restricted: 1 2σ 2(X f(z, β)) T (X f(z, β)) = 1 2σ 2SSE( β).

17 Now, the difference, SSE(ˆβ) SSE( β) σ 2, has an asymptotic χ 2 (r) distribution. (Note that the 2 goes away.) We also have that SSE(ˆβ) σ 2 has an asymptotic χ 2 (n k) distribution. So for the likelihood ratio test we get an F -type statistic: (SSE(ˆβ) SSE( β))/r. SSE(ˆβ)/(n k)

18 Use unrestricted MLE ˆβ and consider Lˆβ β 0. V(ˆβ) ( J T f(ˆβ) J f(ˆβ)) 1 σ 2, and so where J f(ˆβ) V(Lˆβ) L ( J T f(ˆβ) J f(ˆβ)) 1 L T σ 2, is the n k Jacobian matrix. Hence, we can write an asymptotic χ 2 (r) statistic as (Lˆβ β 0 ) T (L ( ) ) 1 1 J T f(ˆβ) J f(ˆβ) L T s 2 (Lˆβ β 0 )

19 We can form a Wishart-type statistic from this. If r = 1, L is just a vector (the linear combination), and we can take the square root and from a pseudo t : s L T ˆβ β 0 L T (J T f(ˆβ) J f(ˆβ) ) 1 L. Get MLE with the restriction Lβ = β 0 using a Lagrange multiplier, λ of length r. Minimize 1 2σ 2(X f(z, β))t (X f(z, β)) + 1 σ 2(Lβ β 0) T λ.

20 To minimize, differentiate and set = 0: J T f(ˆβ) (X f(z, ˆβ)) + L T λ = 0 Lˆβ β 0 = 0. J T f(ˆβ) (X f(z, ˆβ)) is called the score vector. It is of length k. Now V(X f(z, ˆβ)) σ 2 I n, so the variance of the score vector, and hence, also of L T λ, goes to σ 2 J T f(β) J f(β). (Note this is the true β in this expression.) Estimate the variance of the score vector with σ 2 J T f( β) J f( β), where σ 2 = SSE( β)/(n k + r). Hence, we use L T λ and its estimated variance (previous slide). Get ( 1 1 σ 2 λ T L J T f( β) f( β)) J L T λ

21 It is asymptotically χ 2 (r). This is the Lagrange multiplier form. Another form: Use J T f( β) (X f(z, β)) in place of L T λ. Get 1 σ 2(X f(z, β)) T J f( β) ( ) 1 J T f( β) J f( β) J T f( β) (X f(z, β)) This is the score form. Except for the method of computing it, it is the same as the Lagrange multiplier form. This is the SSReg in the AOV for a regression model.

22 Example 2 an anomalous score test Morgan, Palmer, and Ridout (2007) illustrate some interesting issues using a simple example of counts of numbers of stillbirths in each of a sample of litters of laboratory animals. They suggest that a zero-inflated Poisson is an appropriate model. This distribution is an ω mixture of a point mass at 0 and a Poisson distribution. The CDF (in a notation we will use often later) is P 0,ω (x λ) = (1 ω)p (x λ) + ωi [0, [ (x), where P (x) is the Poisson CDF with parameter λ. (Write the PDF (under the counting measure). Is this a reasonable probability model? What are the assumptions? Do the litter sizes matter?)

23 If we denote the number of litters in which the number of observed stillbirths is i by n i, the log-likelihood function is l(ω, λ) = n 0 log ( ω + (1 ω)e λ) + i=1 n i log(1 ω) i=1 n i λ+ i=1 in i log(λ) Suppose we want to test the null hypothesis that ω = 0. The score test has the form s T J 1 s, where s is the score vector and J is either the observed or the expected information matrix. For each we substitute ω = 0 and λ = ˆλ 0, where ˆλ 0 = i=1 in i /n with n = i=0 n i, which is the MLE when ω = 0.

24 Let n + = and d = n i i=1 i=0 in i. The frequency of 0s is important. Let f 0 = n 0 /n. Taking the derivatives and setting ω = 0, we have l ω = n 0e λ n, l λ = n + d/λ, 2 l ω 2 = n n 0e 2λ + n 0 e λ,

25 continued... and 2 l ωλ = n 0e λ, 2 l λ 2 = d/λ2. So, substituting the observed data and the restricted MLE, we have observed information matrix O(0, ˆλ 0 ) = n 1 + f 0e 2ˆλ 0 2f 0 eˆλ 0 f 0 eˆλ 0. f 0 eˆλ 0 1/ˆλ 0 Now, for the expected information matrix when ω = 0, we first observe that E(n 0 ) = ne λ, E(d) = nλ, and E(n + ) = n(1 e λ ); hence I(0, ˆλ 0 ) = n [ eˆλ /ˆλ 0 ].

26 continued... Hence, the score test statistic can be written as κ(ˆλ 0 )(n 0 eˆλ 0 n) 2, where κ(ˆλ 0 ) is the (1,1) element of the inverse of either O(0, ˆλ 0 ) or I(0, ˆλ 0 ).

27 Inverting the matrices (they are 2 2), we have as the test statistic for the score test, either or s O = s I = ne ˆλ 0 (1 θ) 2 1 e ˆλ 0 ˆλ 0 e ˆλ 0 ne ˆλ 0 (1 θ) 2 e ˆλ 0 + θ 2θe ˆλ 0 θ 2ˆλ 0 e ˆλ 0, where θ = f 0 eˆλ 0, which is the ratio of the observed proportion of 0 counts to the estimated probability of a zero count under the Poisson model. (If n 0 is actually the number expected under the Poisson model, then θ = 1.)

28 Now consider the actual data reported by Morgan, Palmer, and Ridout (2007) for stillbirths in each litter of a sample of 402 litters of laboratory animals. No. stillbirths No. litters For these data, we have n = 402, d = 185, ˆλ 0 = , e ˆλ 0 = , and θ = What is interesting is the difference in s I and s O. In this particular example, if all n i for i 1 are held constant at the observed values, but different values of n 0 are considered, as n 0 increases the ratio s I /s O increases from about 1/4 to 1 (when the n 0 is the expected number under the Poisson model; i.e., θ = 1), and then decreases, actually becoming negative (around n 0 = 100).

29 This example illustrates an interesting case. The score test is inconsistent because the observed information generates negative variance estimates at the MLE under the null hypothesis. The score test can also be inconsistent if the expected likelihood equation has spurious roots.

30 Sequential Tests In the simplest formulation of statistical hypothesis testing, corresponding to the setup of the Neyman-Pearson lemma, we test a given hypothesized distribution versus another given distribution. After setting some ground rules regarding the probability of falsely rejecting the null hypothesis, and then determining the optimal test in the case of simple hypotheses, we determined more general optimal tests in cases for which they exist, and for other cases, we determined optimal tests among classes of tests that had certain desirable properties. In some cases, the tests involved regions within the sample space in which the decision between the two hypotheses was made randomly; that is, based on a random process over and above the randomness of the distributions of interest.

31 Another logical approach to take when the data generated by the process of interest does not lead to a clear decision is to decide to take more observations. Recognizing at the outset that this is a possibility, we may decide to design the test as a sequential procedure. We take a small number of observations, and if the evidence is strong enough either to accept the null hypothesis or to reject it, the test is complete and we make the appropriate decision. On the other hand, if the evidence from the small sample is not strong enough, we take some additional observations and perform the test again. We repeat these steps as necessary.

32 Multiple Tests There are several ways to measure the error rate. Letting m be the number of tests, F be the number of false positives, T be the number of true positives, and S = F + T be the total number of discoveries, or rejected null hypotheses, we define error measures are the per comparison error rate, the family wise error rate, an the false discovery rate, PCER = E(F )/m; FWER = Pr(F 1); FDR = E(F/S).

33 The Benjamini-Hochberg (BH) method for controlling FDR works as follows. First, order the m p-values from the tests: p 1 p m. Then determine a threshold value for rejection by finding the largest integer j such that p j jα/m. Finally, reject any hypothesis whose p-value is smaller than or equal to p j. Benjamini and Hochberg (1995) prove that this procedure is guaranteed to force FDR α. Storey (2003) proposed use of the proportion of false positives for any hypothesis (feature) incurred, on average, when that feature defines the threshold value. The q-value can be calculated for each feature under investigation.

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

Let us first identify some classes of hypotheses. simple versus simple H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided H 0 : θ θ 0 versus H 1 : θ > θ 0. (2) two-sided; null on extremes H 0 : θ θ 1 or