Chapter 8: Hypothesis Testing Lecture 9: Likelihood ratio tests

Chapter 8: Hypothesis Testing Lecture 9: Likelihood ratio tests Throughout this chapter we consider a sample X taken from a population indexed by θ Θ R k. Instead of estimating the unknown parameter, we want to make a decision on whether the unknown θ is in Θ 0, a subset of Θ. Definition 8.1.1. A hypothesis is a statement about the population parameter θ. It is specified as H : θ Θ 0 for a Θ 0 Θ, where H stands for a hypothesis. Definition 8.1.2. For any hypothesis H 0 : θ Θ 0, its complementary hypothesis is H 1 : θ Θ 1 = Θ c 0. H 0 is called the null hypothesis and H 1 is called the alternative hypothesis. Based on a sample from the population, we want to decide which of the two complementary hypotheses is true, i.e., to test H 0 : θ Θ 0 versus H 1 : θ Θ 1 = Θ c 0 UW-Madison (Statistics) Stat 610 Lecture 9 2016 1 / 16

Definition 8.1.3. A hypothesis test (or simply a test) is a rule that specifies for which sample values H 0 is accepted or rejected (H 1 is accepted). The subset of the sample space for which H 0 is rejected is called the rejection region or critical region. Its complement is called the acceptance region. In this course, rejecting H 0 results in accepting H 1 and accepting H 0 results in rejecting H 1. Also, not rejecting" is the same as accepting. Typically, a test is specified in terms of a test statistic T (X) = T (X 1,...,X n ), a function of the sample X. For example, a test might specify that H 0 is to be rejected if the sample mean X is greater than 3. We introduce a method of using the likelihood function to construct tests, which is applicable as long as a likelihood is available. Properties and optimality of these tests will be studied later. UW-Madison (Statistics) Stat 610 Lecture 9 2016 2 / 16

Definition 8.2.1. The likelihood ratio test statistic for testing H 0 : θ Θ 0 versus H 1 : θ Θ c 0 is sup L(θ X) θ Θ 0 λ(x) = L(θ X), sup θ Θ where L(θ x) is the likelihood function based on X = x. A likelihood ratio test (LRT) is any test that has a rejection region of the form {x : λ(x) c} where c is a constant satisfying 0 c 1. The rationale behind LRTs is that λ(x) is likely to be small if there if there are parameter points in Θ c 0 for which x is much more likely than for any parameter in Θ 0. Note that in the denominator of the likelihood ratio, the sup is taken over Θ, not Θ c 0. The method of determining c is discussed later. UW-Madison (Statistics) Stat 610 Lecture 9 2016 3 / 16

The likelihood ratio method is related to the MLE discussed in Section 7.2.2. Suppose that θ is the MLE of θ and θ 0 is the MLE of θ when Θ 0 is treated as the parameter space (the so-called restricted MLE). Then λ(x) = L( θ 0 X) L( θ X) Example 8.2.2. Let X 1,...,X n be iid from N(µ,σ 2 ) with unknown µ R. Let µ 0 be a constant and consider H 0 : µ = µ 0 versus H 1 : µ µ 0 First, assume σ 2 = 1. Θ 0 has only one element, µ 0, and the numerator of λ(x) is L(µ 0 x). Since the MLE of µ is X, λ(x) = (2π) n/2 exp( n i=1 (x i µ 0 ) 2 /2) (2π) n/2 exp( n i=1 (x i x) 2 /2) UW-Madison (Statistics) Stat 610 Lecture 9 2016 4 / 16

= exp Hence, ( n 1 i µ 0 ) 2 i=1(x 2 + 1 2 n i=1 ) (x i x) 2 = exp ( n 2 ( x µ 0) 2) { {x : λ(x) c} = x : x µ 0 } 2(logc)/n Next, we consider the situation where σ 2 is unknown, i.e., θ = (µ,σ 2 ) R (0, ) and Θ 0 = {µ 0 } (0, ). Under H 0, the MLE of θ is (µ 0, σ 0 2 ), and the MLE without restriction is θ = ( X, σ 2 ), where Then σ 2 0 = 1 n n i=1(x i µ 0 ) 2 and σ 2 = 1 n n i=1 (X i X) 2 λ(x) = (2π σ 0 2) n/2 exp( n i=1 (x i µ 0 ) 2 /2 σ 0 2) (2π σ 2 ) n/2 exp( n i=1 (x i x) 2 /2 σ 2 ) = σ n exp( n/2) σ 0 n exp( n/2) [ n i=1 = (x i x) 2 ] n n i=1 (x i x) 2 + n( x µ 0 ) 2 = [1 + n( x µ 0) 2 ] n (n 1)s 2 UW-Madison (Statistics) Stat 610 Lecture 9 2016 5 / 16

where s 2 is the observed value of the sample variance S 2. Hence, { n x µ0 } {x : λ(x) c} = x : c 1/n 1 n 1s Example 8.2.3. Let X 1,...,X n be iid from exponential(θ,1) with unknown θ R and pdf f θ (x) = e (x θ), x > θ. Let θ 0 be a fixed constant and the hypotheses be H 0 : θ θ 0 versus H 1 : θ > θ 0 The likelihood function with X = x = (x 1,...,x n ) is { e (x 1 θ) e L(θ x) = (x n θ) θ x i,i = 1,...,n, 0 otherwise { exp( n = i=1 x i + nθ) θ x (1) 0 θ > x (1) Clearly, L(θ x) is an increasing function of θ on (,x (1) ]. UW-Madison (Statistics) Stat 610 Lecture 9 2016 6 / 16

Thus, the unrestricted MLE of θ is X (1) and ( ) n supl(θ x) = exp x i + nx (1) θ R The restricted MLE of θ depends on whether x (1) θ 0 ; if yes, the restricted MLE is still x (1) ; if x (1) > θ 0, the restricted MLE is θ 0 since L(θ x) increases for θ x (1). Hence, { ( ) exp n sup L(θ x) = i=1 x i + nx (1) x (1) θ 0 θ θ 0 exp( n i=1 x i + nθ 0 ) x (1) > θ 0 { 1 x(1) θ λ(x) = 0 i=1 exp ( n(x (1) θ 0 ) ) x (1) > θ 0 {x : λ(x) c} = { x : exp ( n(x (1) θ 0 ) ) c } = { } x : x (1) θ 0 n 1 logc In both examples, λ(x) is a function of sufficient statistics, which is true in general. UW-Madison (Statistics) Stat 610 Lecture 9 2016 7 / 16

Theorem 8.2.4. If T (X) is a sufficient statistic for θ and λ (T ) and λ(x) are the likelihood ratios based on T and X, respectively, then λ (T (x)) = λ(x) for every x X (the sample space). Proof. From the factorization theorem, L(θ x) = g θ (T (x))h(x) for some functions g θ and h. Then λ(x) = = sup L(θ x) θ Θ 0 sup θ Θ L(θ x) = sup L (θ T (x)) θ Θ 0 sup θ Θ sup g θ (T (x))h(x) θ Θ 0 sup θ Θ g θ (T (x))h(x) = L (θ T (x)) = λ (T (x)) sup g θ (T (x)) θ Θ 0 sup θ Θ g θ (T (x)) where the 3rd equality holds because h(x) does not depend on θ and the 4th equality holds because it can be shown that g θ (x) is proportional to the pdf or pmf of T (X). UW-Madison (Statistics) Stat 610 Lecture 9 2016 8 / 16

One way analysis of variance (ANOVA) Consider independent data Y i1,...,y ini N(µ i,σ 2 ), i = 1,...,k We want to test whether the populations have a common mean, H 0 : µ i = µ for all i vs H 1 : µ i s are not all equal We now derive the LRT for testing H 0. Under H 0, Y ij s are iid N(µ,σ 2 ), and the result about normal population tells us that the MLE of (µ,σ 2 ) under H 0 is θ 0 = (Ȳ,SST/N), where Ȳ = 1 n k n i i=1 j=1 Y ij SST = k n i i=1 j=1 (Y ij Ȳ )2 are respectively the sample mean of the total sample, the sample variance of the total sample multiplied by n 1, and n = k i=1 n i is the total sample size. In general, the MLE of (µ 1,..., µ k,σ 2 ) is (Ȳ1,...,Ȳk,SSW/n), where Ȳ i = 1 n i n i Y ij SSW = j=1 i=1 j=1 UW-Madison (Statistics) Stat 610 Lecture 9 2016 9 / 16 k n i (Y ij Ȳi) 2

are respectively the sample mean of the ith sample (from population i) and the so-called within population (group) sum of squares. Then, the likelihood ratio based on X = (Y ij,j = 1,...,n i,i = 1,...,k) is λ(x) = (2πSST/n) n/2 e n/2 ( ) SST n/2 (2πSSW/n) n/2 e n/2 = SSW To derive the cut-off value c for the LRT that rejects H 0 iff λ(x) < c, we need to find the distribution of a monotone function of λ(x). Note that SST = k n i i=1 j=1 (Y ij Ȳi) 2 + k i=1 n i (Ȳi Ȳ )2 = SSW + SSB where SSB is the sum of squares between population (groups). This identity is a special case of the so-called ANVOA decomposition. Then ( ) (k 1)F n/2 SSB/(k 1) λ(x) = 1 +, F = n k SSW/(n k) i.e., λ(x) is a strictly decreasing function of F. UW-Madison (Statistics) Stat 610 Lecture 9 2016 10 / 16

We may show that F has an F-distribution with degrees of freedom k 1 and n k, and this F-distribution is central under H 0 so that we reject H 0 if F > the upper α quantile of the central F-distribution. But this result follows from the general result we will establish next. In applications, it is common to use the following ANOVA table. ANOVA table Source df SS MS F between k 1 SSB MSB=SSB/(k 1) MSB/MSW within n k SSW MSW=SSW/(n k) total n 1 SST df = degrees of freedom Example 11.2.12 Source df SS MS F between 3 995.90 331.97 26.09 within 15 190.83 12.72 total 18 1186.73 UW-Madison (Statistics) Stat 610 Lecture 9 2016 11 / 16

Hypotheses concerning linear functions of β in a linear model Under the linear model with Y N(Xβ,σ 2 I n ), we consider H 0 : Lβ = 0 versus H 1 : Lβ 0, where L is an s p matrix of rank s p. For the special case of one-way ANOVA with H 0 : µ 1 = µ 2 = = µ k, p = k, s = k 1 and 1 0 0 1 L = 0 1 0 1 0 0 1 1 We consider the likelihood ratio test. The likelihood function is L(β,σ 2 ) = 1 (2πσ 2 ) exp { n/2 { 1 (2πσ 2 exp ) n/2 } Y Xβ 2 2σ 2 } Y X β 2 2σ 2 UW-Madison (Statistics) Stat 610 Lecture 9 2016 12 / 16

since Y Xβ 2 Y X β 2 for any β and the LSE β. Treating the right-hand side of this expression as a function of σ 2, it is easy to show that it has a maximum at σ 2 = σ 2 = Y X β 2 /n and supl(β,σ 2 ) = (2π σ 2 ) n/2 e n/2. β,σ 2 Similarly, let β H0 be the LSE under H 0 and σ 2 H 0 = Y X β H0 2 /n. Then Thus, the LR is sup L(β,σ 2 ) = (2π σ H 2 0 ) n/2 e n/2. Lβ=0,σ 2 λ = ( σ 2 / σ H 2 0 ) n/2 Y X β n = Y X β H0 n Next, we derive the distribution of a monotone function of λ under Lβ = 0 so that we can obtain a size α test. For the given L, there exists a (p s) p matrix M such that G = (M L ) is nonsingular. Let γ = Gβ, whose last s components are 0 s when Lβ = 0. UW-Madison (Statistics) Stat 610 Lecture 9 2016 13 / 16

Note that Xβ = XG 1 Gβ = (XG 1 )γ = W γ, W = XG 1 Then the problem reduces to testing whether the last s components of γ are 0 s with Y N(W γ,σ 2 I n ), γ = G β, and X β = W γ. Under H 0, X β H0 = W 1 (W 1 W 1) 1 W 1 Y, where W 1 is the first p s columns of W, i.e., W = (W 1 W 2 ). Then Y X β 2 = Y W (W W ) 1 W Y 2 = Y (I n H w )Y, where H w = W (W W ) 1 W, a projection matrix of rank p, and Y X β H0 2 = Y W 1 (W 1 W 1) 1 W 1 Y 2 = Y (I n H w1 )Y, where H w1 = W 1 (W 1 W 1) 1 W 1, a projection matrix of rank p s. Since Y Y = Y H w1 Y + Y (H w H w1 )Y + Y (I n H w )Y and the sum of the ranks of H w1, H w H w1, and I n H w is p s + s + n p = n, by Cochran s theorem, Y (H w H w1 )Y and Y (I n H w )Y are independently chi-square distributed. UW-Madison (Statistics) Stat 610 Lecture 9 2016 14 / 16

Hence F = Y (H w H w1 )Y /s Y (I n H w )Y /(n p) = [ Y X β H0 2 Y X β 2 ]/s Y X β 2 /(n p) which is a decreasing function of the LR λ, has the noncentral F-distribution with degrees of freedom s and n p and the noncentrality parameter δ = γ W (H w H w1 )W γ/σ 2. Under H 0, δ = 0 (exercise) and, hence, the LR test of size α rejects H 0 when F > F s,n p,α, and the power of this test is related to the noncentral F-distribution. Examples For the one-way ANOVA with H 0 : µ 1 = µ 2 = = µ k, the result holds with s = k 1 and p = k. Consider the two-way balanced ANOVA with H 0 : α 1 = = α a 1 = 0. The statistic F can be calculated using the SSR under the two-way balanced ANOVA model and the SSR under the reduced model with α i s equal to 0. UW-Madison (Statistics) Stat 610 Lecture 9 2016 15 / 16

Alternatively, we can use the following decomposition: (Y ijk Ȳij + α i + β ) 2 j + γ ij (Y ijk Ȳ ) 2 = i,j,k i,j,k = bc α 2 i i + ac j β 2 j = SSA + SSB + SSAB + SSR + c γ 2 ( ij + Yijk Ȳij ) 2 i,j i,j,k The LR tests of size α can be obtained using the following table. ANOVA table Source df SS MS F A a 1 SSA MSA = SSA a 1 B b 1 SSB MSB = SSB b 1 AB (a 1)(b 1) SSAB MSAB = SSAB (a 1)(b 1) Residual ab(c 1) SSR MSR = SSR ab(c 1) total abc 1 SST df = degrees of freedom MSA MSR MSB MSR MSAB MSR UW-Madison (Statistics) Stat 610 Lecture 9 2016 16 / 16