Lecture 17: Likelihood ratio and asymptotic tests

Lecture 17: Likelihood ratio and asymptotic tests Likelihood ratio When both H 0 and H 1 are simple (i.e., Θ 0 = {θ 0 } and Θ 1 = {θ 1 }), Theorem 6.1 applies and a UMP test rejects H 0 when f θ1 (X) f θ0 (X) > c 0 for some c 0 > 0. The following definition is a natural extension of this idea. Definition 6.2 Let l(θ) = f θ (X) be the likelihood function. For testing H 0 : θ Θ 0 versus H 1 : θ Θ 1, a likelihood ratio (LR) test is any test that rejects H 0 if and only if λ(x) < c, where c [0,1] and λ(x) is the likelihood ratio defined by / λ(x) = sup l(θ) θ Θ 0 supl(θ). θ Θ UW-Madison (Statistics) Stat 710, Lecture 17 Jan 2019 1 / 17

Discussions If λ(x) is well defined, then λ(x) 1. The rationale behind LR tests is that when H 0 is true, λ(x) tends to be close to 1, whereas when H 1 is true, λ(x) tends to be away from 1. If there is a sufficient statistic, then λ(x) depends only on the sufficient statistic. LR tests are as widely applicable as MLE s in 4.4 and, in fact, they are closely related to MLE s. If θ is an MLE of θ and θ 0 is an MLE of θ subject to θ Θ 0 (i.e., Θ 0 is treated as the parameter space), then λ(x) = l( θ 0 ) / l( θ). For a given α (0,1), if there exists a c α [0,1] such that sup P θ (λ(x) < c α ) = α, θ Θ 0 then an LR test of size α can be obtained. Even when the c.d.f. of λ(x) is continuous or randomized LR tests are introduced, it is still possible that such a c α does not exist. UW-Madison (Statistics) Stat 710, Lecture 17 Jan 2019 2 / 17

Optimality When a UMP or UMPU test exists, an LR test is often the same as this optimal test. Proposition 6.5 Suppose that X has a p.d.f. in a one-parameter exponential family: f θ (x) = exp{η(θ)y (x) ξ (θ)}h(x) w.r.t. a σ-finite measure ν, where η is a strictly increasing and differentaible function of θ. (i) For testing H 0 : θ θ 0 versus H 1 : θ > θ 0, there is an LR test whose rejection region is the same as that of the UMP test T given in Theorem 6.2. (ii) For testing H 0 : θ θ 1 or θ θ 2 versus H 1 : θ 1 < θ < θ 2, there is an LR test whose rejection region is the same as that of the UMP test T given in Theorem 6.3. (iii) For testing the other two-sided hypotheses, there is an LR test whose rejection region is equivalent to Y (X) < c 1 or Y (X) > c 2 for some constants c 1 and c 2. UW-Madison (Statistics) Stat 710, Lecture 17 Jan 2019 3 / 17

Proof We prove (i) only. Let θ be the MLE of θ. Note that l(θ) is increasing when θ θ and decreasing when θ > θ. Thus, { 1 θ θ0 λ(x) = θ > θ 0. l(θ 0 ) l( θ) Then λ(x) < c is the same as θ > θ 0 and l(θ 0 )/l( θ) < c. From the property of exponential families, θ is a solution of the likelihood equation logl(θ) θ = η (θ)y (X) ξ (θ) = 0 and ψ(θ) = ξ (θ)/η (θ) has a positive derivative ψ (θ). Since η ( θ)y ξ ( θ) = 0, θ is an increasing function of Y and d θ dy > 0. Consequently, for any θ 0 Θ, UW-Madison (Statistics) Stat 710, Lecture 17 Jan 2019 4 / 17

d [ ] logl( θ) logl(θ 0 ) dy = d [ ] η( θ)y ξ ( θ) η(θ 0 )Y + ξ (θ 0 ) dy = d θ dy η ( θ)y + η( θ) d θ dy ξ ( θ) η(θ 0 ) = d θ dy [η ( θ)y ξ ( θ)] + η( θ) η(θ 0 ) = η( θ) η(θ 0 ), which is positive (or negative) if θ > θ 0 (or θ < θ 0 ), i.e., logl( θ) logl(θ 0 ) is strictly increasing in Y when θ > θ 0 and strictly decreasing in Y when θ < θ 0. Hence, for any d R, θ > θ 0 and l(θ 0 )/l( θ) < c is equivalent to Y > d for some c (0,1). Example 6.20 Consider the testing problem H 0 : θ = θ 0 versus H 1 : θ θ 0 based on i.i.d. X 1,...,X n from the uniform distribution U(0,θ). We now show that the UMP test with rejection region X (n) > θ 0 or X (n) θ 0 α 1/n given in Exercise 19(c) is an LR test. UW-Madison (Statistics) Stat 710, Lecture 17 Jan 2019 5 / 17

Note that l(θ) = θ n I (X(n), )(θ). Hence { (X(n) /θ λ(x) = 0 ) n X (n) θ 0 0 X (n) > θ 0 and λ(x) < c is equivalent to X (n) > θ 0 or X (n) /θ 0 < c 1/n. Taking c = α ensures that the LR test has size α. Example 6.21 Consider normal linear model X = N n (Z β,σ 2 I n ) and the hypotheses H 0 : Lβ = 0 versus H 1 : Lβ 0, where L is an s p matrix of rank s r and all rows of L are in R(Z ). The likelihood function in this problem is ( ) n/2 { } l(θ) = 1 2πσ exp 1 X Z β 2, θ = (β,σ 2 ). 2 2σ 2 Since X Z β 2 X Z β 2 for any β and the LSE β, ( ) n/2 { l(θ) 1 2πσ exp 1 X Z β } 2. 2 2σ 2 UW-Madison (Statistics) Stat 710, Lecture 17 Jan 2019 6 / 17

Treating the right-hand side of this expression as a function of σ 2, it is easy to show that it has a maximum at σ 2 = σ 2 = X Z β 2 /n and supl(θ) = (2π σ 2 ) n/2 e n/2. θ Θ Similarly, let β H0 be the LSE under H 0 and σ 2 H 0 = X Z β H0 2 /n: Thus, sup l(θ) = (2π σ H 2 0 ) n/2 e n/2. θ Θ 0 λ(x) = ( σ 2 / σ 2 H 0 ) n/2 = ( X Z β 2 X Z β H0 2 ) n/2. For a two-sample problem, we let n = n 1 + n 2, β = (µ 1, µ 2 ), and ( ) Jn1 0 Z =. 0 J n2 Testing H 0 : µ 1 = µ 2 versus H 1 : µ 1 µ 2 is the same as testing H 0 : Lβ = 0 versus H 1 : Lβ 0 with L = ( 1 1 ). The LR test is the same as the two-sample two-sided t-tests in 6.2.3. UW-Madison (Statistics) Stat 710, Lecture 17 Jan 2019 7 / 17

Example: Exercise 6.84 Let F and G be two known cumulative distribution functions on R and X be a single observation from the cumulative distribution function θf(x) + (1 θ)g(x), where θ [0,1] is unknown. We first derive the likelihood ratio λ(x) for testing H 0 : θ θ 0 versus H 1 : θ > θ 0 where θ 0 [0,1) is a known constant. Let f and g be the probability densities of F and G, respectively, with respect to the measure corresponding to F + G. Then, the likelihood function is l(θ) = θ[f (X) g(x)] + g(x) and { f (X) f (X) g(x) sup l(θ) = 0 θ 1 g(x) f (X) < g(x). For θ 0 [0,1), { θ0 [f (X) g(x)] + g(x) f (X) g(x) sup l(θ) = 0 θ θ 0 g(x) f (X) < g(x). UW-Madison (Statistics) Stat 710, Lecture 17 Jan 2019 8 / 17

Hence, { θ0 [f (X) g(x)]+g(x) f (X) f (X) g(x) λ(x) = 1 f (X) < g(x). Choose a constant c with θ 0 c < 1. Then λ(x) c is the same as g(x) f (X) c θ 0 1 θ 0 We may find a c with P(λ(X) c) = α when θ = θ 0. Consider next H 0 : θ 1 θ θ 2 versus H 1 : θ < θ 1 or θ > θ 2 where 0 θ 1 < θ 2 1 are known constants. For 0 θ 1 θ 2 1, { θ2 [f (X) g(x)] + g(x) f (X) g(x) sup l(θ) = 0 θ 1 θ θ 2 1 θ 1 [f (X) g(x)] + g(x) f (X) < g(x) Hence, θ 2 [f (X) g(x)]+g(x) f (X) f (X) g(x) λ(x) = f (X) < g(x). θ 1 [f (X) g(x)]+g(x) g(x) UW-Madison (Statistics) Stat 710, Lecture 17 Jan 2019 9 / 17

Choose a constant c with max{1 θ 1,θ 2 } c < 1. Then λ(x) c is the same as g(x) f (X) c θ 0 1 θ 0 or g(x) f (X) θ 1 c (1 θ 1 ). How to find a c with sup θ1 θ θ 2 P(λ(X) c) = α? Finally, consider H 0 : θ θ 1 or θ θ 2 versus θ 1 < θ < θ 2 where 0 θ 1 θ 2 1 are known constants. Note that sup 0 θ θ 1,θ 2 θ 1 l(θ) = sup 0 θ 1 l(θ). Hence, λ(x) = 1 This means that, unless we consider randomizing, we cannot find a c such that sup θ θ1 or θ θ 2 P(λ(X) c) = α. UW-Madison (Statistics) Stat 710, Lecture 17 Jan 2019 10 / 17

It is often difficult to construct a test with exactly size α or level α. Tests whose rejection regions are constructed using asymptotic theory (so that these tests have asymptotic level α) are called asymptotic tests, which are useful when a test of exact size α is difficult to find. Definition 2.13 (asymptotic tests) Let X = (X 1,...,X n ) be a sample from P P and T n (X) be a test for H 0 : P P 0 versus H 1 : P P 1. (i) If limsup n α Tn (P) α for any P P 0, then α is an asymptotic significance level of T n. (ii) If lim n sup P P0 α Tn (P) exists, it is called the limiting size of T n. (iii) T n is consistent iff the type II error probability converges to 0. If P 0 is not a parametric family, the limiting size of T n may be 1. This is the reason why we consider the weaker requirement in (i). If α (0,1) is a pre-assigned level of significance for the problem, then a consistent test T n having asymptotic significance level α is called asymptotically correct, and a consistent test having limiting size α is called strongly asymptotically correct. UW-Madison (Statistics) Stat 710, Lecture 17 Jan 2019 11 / 17

In the i.i.d. case we can obtain the asymptotic distribution (under H 0 ) of the likelihood ratio λ(x) so that an LR test having asymptotic significance level α can be obtained. Theorem 6.5 (asymptotic distribution of likelihood ratio) Assume the conditions in Theorem 4.16. Suppose that H 0 : θ = g(ϑ), where ϑ is a (k r)-vector of unknown parameters and g is a continuously differentiable function from R k r to R k with a full rank g(ϑ)/ ϑ. Under H 0, 2logλ n d χ 2 r, where λ n = λ(x) and χr 2 is a random variable having the chi-square distribution χr 2. Consequently, the LR test with rejection region λ n < e χ2 r,α/2 has asymptotic significance level α, where χr,α 2 is the (1 α)th quantile of the chi-square distribution χr 2. UW-Madison (Statistics) Stat 710, Lecture 17 Jan 2019 12 / 17

Proof Without loss of generality, we assume that there exist an MLE θ and an MLE ϑ under H 0 such that λ n = sup l(θ) / supl(θ) = l(g( ϑ)) / l( θ). θ Θ 0 θ Θ Let s n (θ) = logl(θ)/ θ and I 1 (θ) be the Fisher information about θ contained in X 1. Following the proof of Theorem 4.17 in 4.5.2, we can obtain that ni1 (θ)( θ θ) = n 1/2 s n (θ) + o p (1), and 2[logl( θ) logl(θ)] = n( θ θ) τ I 1 (θ)( θ θ) + o p (1). Then 2[logl( θ) logl(θ)] = n 1 [s n (θ)] τ [I 1 (θ)] 1 s n (θ) + o p (1). Similarly, under H 0, 2[logl(g( ϑ)) logl(g(ϑ))] = n 1 [ s n (ϑ)] τ [Ĩ1(ϑ)] 1 s n (ϑ) + o p (1), UW-Madison (Statistics) Stat 710, Lecture 17 Jan 2019 13 / 17

where s n (ϑ) = logl(g(ϑ))/ ϑ = D(ϑ)s n (g(ϑ)), D(ϑ) = g(ϑ)/ ϑ, and Ĩ1(ϑ) is the Fisher information about ϑ (under H 0 ) contained in X 1. Combining these results, we obtain that, under H 0, 2logλ n = 2[logl( θ) logl(g( ϑ))] = n 1 [s n (g(ϑ))] τ B(ϑ)s n (g(ϑ)) + o p (1) where B(ϑ) = [I 1 (g(ϑ))] 1 [D(ϑ)] τ [Ĩ1(ϑ)] 1 D(ϑ). By the CLT, n 1/2 [I 1 (θ)] 1/2 s n (θ) d Z, where Z = N k (0,I k ). Then, it follows from Theorem 1.10(iii) that, under H 0, 2logλ n d Z τ [I 1 (g(ϑ))] 1/2 B(ϑ)[I 1 (g(ϑ))] 1/2 Z. Let D = D(ϑ), B = B(ϑ), A = I 1 (g(ϑ)), and C = Ĩ1(ϑ). Then (A 1/2 BA 1/2 ) 2 = A 1/2 BABA 1/2 = A 1/2 (A 1 D τ C 1 D)A(A 1 D τ C 1 D)A 1/2 = (I k A 1/2 D τ C 1 DA 1/2 )(I k A 1/2 D τ C 1 DA 1/2 ) UW-Madison (Statistics) Stat 710, Lecture 17 Jan 2019 14 / 17

= I k 2A 1/2 D τ C 1 DA 1/2 + A 1/2 D τ C 1 DAD τ C 1 DA 1/2 = I k A 1/2 D τ C 1 DA 1/2 = A 1/2 BA 1/2, where the fourth equality follows from the fact that C = DAD τ. This shows that A 1/2 BA 1/2 is a projection matrix. The rank of A 1/2 BA 1/2 is Thus, by Exercise 51 in 1.6, tr(a 1/2 BA 1/2 ) = tr(i k D τ C 1 DA) = k tr(c 1 DAD τ ) = k tr(c 1 C) = k (k r) = r. Z τ [I 1 (g(ϑ))] 1/2 B(ϑ)[I 1 (g(ϑ))] 1/2 Z = χ 2 r UW-Madison (Statistics) Stat 710, Lecture 17 Jan 2019 15 / 17

Asymptotic tests based on likelihoods There are two popular asymptotic tests based on likelihoods that are asymptotically equivalent to LR tests. The hypothesis H 0 : θ = g(ϑ) is equivalent to a set of r k equations: H 0 : R(θ) = 0, where R(θ) is a continuously differentiable function from R k to R r. Wald (1943) introduced a test that rejects H 0 when the value of W n = [R( θ)] τ {[C( θ)] τ [I n ( θ)] 1 C( θ)} 1 R( θ) is large, where C(θ) = R(θ)/ θ, I n (θ) is the Fisher information matrix based on X 1,...,X n, and θ is an MLE or RLE of θ. Rao (1947) introduced a score test that rejects H 0 when the value of R n = [s n ( θ)] τ [I n ( θ)] 1 s n ( θ) is large, where s n (θ) = logl(θ)/ θ is the score function and θ is an MLE or RLE of θ under H 0 : R(θ) = 0. Wald s test requires computing θ, not θ = g( ϑ). Rao s score test requires computing θ, not θ. UW-Madison (Statistics) Stat 710, Lecture 17 Jan 2019 16 / 17

Theorem 6.6 Assume the conditions in Theorem 4.16. (i) Under H 0 : R(θ) = 0, where R(θ) is a continuously differentiable function from R k to R r, W n d χr 2 and, therefore, the test rejects H 0 if and only if W n > χr,α 2 has asymptotic significance level α, where χr,α 2 is the (1 α)th quantile of the chi-square distribution χr 2. (ii) The result in (i) still holds if W n is replaced by R n. Proof (i) Using Theorems 1.12 and 4.17, ( ) n[r( θ) R(θ)] d N r 0,[C(θ)] τ [I 1 (θ)] 1 C(θ), where I 1 (θ) is the Fisher information about θ contained in X 1. Under H 0, R(θ) = 0 and, therefore (by Theorem 1.10), n[r( θ)] τ {[C(θ)] τ [I 1 (θ)] 1 C(θ)} 1 R( θ) d χ 2 r Then the result follows from Slutsky s theorem (Theorem 1.11) and the fact that θ p θ and I 1 (θ) and C(θ) are continuous at θ. (ii) See the textbook. UW-Madison (Statistics) Stat 710, Lecture 17 Jan 2019 17 / 17