Economics 52 Econometrics Professor N.M. Kiefer LECTURE 1: NEYMAN-PEARSON LEMMA AND ASYMPTOTIC TESTING NEYMAN-PEARSON LEMMA: Lesson: Good tests are based on the likelihood ratio. The proof is easy in the case of simple hypotheses: H : x ~ p (x) = f(x θ ) H 1 : x ~ p 1 (x) = f(x θ 1 ) The last equality is provided so this can look like a more familiar parametric test. Suppose we have a sample x = (x 1,..., x n ) R n and we want to choose between H and H 1. (Note that p i is the likelihood function.) Define a decision function d: R n {, 1} such that d(x) = when H is accepted and d(x) = 1 when H 1 is accepted. Thus, d defines a partition of the sample space. The following diagrams illustrate situations where n = 2. Let A be the region in which d =. A c is the complement of A in R n. Then the error probabilities are ) = p c ( x) A α = P(d = 1H dx
) = 1 p1( x). A β = P(d = H dx Note: α is the size of the test - the probability of an error of the first type, and β is the operating characteristic of the test - the probability of an error of the second type. (1 - β) is the power of the test. (You had this in ECON 519.) You would like to choose a test minimizing both error probabilities, but there are tradeoffs. α can be set to, its minimum, by choosing d = always; but then β = 1. This is the only way α can be assured to be. Similarly, β = if d = 1, but then α = 1. Now, α = 1/2 and β = 1/2 can be obtained by flipping a coin and ignoring the data. Thus we have 3 points on the "frontier" available without data. The "information budget constraint" with no data is the solid line in the following figure. These points are available by randomizing the decision with different probabilities. Good tests using data will get a constraint like the dashed line (of course, (,1) and (1,) are always the endpoints). (Exercise: Why does this constraint have this shape?) This is like an income effect - information gives a better tradeoff between the two types of errors. We will continue this analogy at the end of the lecture and add indifference curves to this diagram. Note that the "budget" constraint is nonlinear so the problem appears harder than the simple consumer's problem with two goods. It will turn out that the problem is not harder - the 2
indifference curves are linear! Definition: p /p 1 is the likelihood ratio where p i = f(x θ i ) is the joint distribution of data. Let A(T) = {x: p /p 1 > T} (a set in R n ) and = p ( x) dx ; β * = c α * dx. p ( x) A A 1 A defines a decision rule d = if x A and d = 1 if x A c. Let B be any other region in R n with error probabilities α and β. Then : Neyman-Pearson Lemma: If α α *, then β β *. What does this say? Proof: Define I A (x) = 1 if x A and I B (x) = 1 if x B. Then (I A - I B )(p (x) - Tp 1 (x)). To check this, look at both cases: If x A, then I A = 1 and p /p 1 > T...(think about this) Multiplication yields: I A p - I A Tp 1 - I B p + I B Tp 1. If this holds for any given x, it certainly holds on the average.thus, A (p - Tp 1 )dx - B (p - Tp 1 )dx. Hence (recall definitions of α, β, α *, β * ), (1 - α * ) - Tβ * - (1 - α) + Tβ = T(β - β * ) + (α - α * ). Thus, if β < β *, α must be > α *, and vice versa. 3
The result says that when designing tests we should look at the likelihood ratio. Indifference curves for error probabilities: Let (α,β ) and (α 1,β 1 ) be error probabilities associated with two different tests. Suppose you are indifferent between these tests. Then you do not care if the choice is made with a coin flip. But this defines another test with error probabilities α 2 = 1/2 α + 1/2 α 1 and β 2 = 1/2 β + 1/2 β 1, and you are indifferent between this new test and the others. Continuing, you derive a linear indifference curve. Note that the practice of fixing α (e.g..5) for all sample sizes ( all values of β) corresponds to lexicographic preferences, which are not continuous and therefore illogical in this setting. Example: Consider the following composite hypothesis: H : θ = θ (null hypothesis) H 1 : θ θ (alternative hypothesis) (Note that θ is the hypothesized value.) Here we find the ML estimator θ $ and consider the likelihood ratio f(x θ )/f(x θ $ ). Basically we are choosing the "best" value under the alternative hypothesis for the denominator. 4
Exercise: Consider the regression model y = X 1 β 1 + X 2 β 2 + ε where ε ~ q(, σ 2 ). Is the F-test for β 2 = a likelihood ratio test? ASYMPTOTIC TESTING: In this section, we will study "the" three tests: Likelihood Ratio (LR), Wald and Score (Lagrange Multiplier - LM) tests. Background: (Asymptotics) l (θ) = Σ ln p(x θ) is the log likelihood function. Define the score function s( θ) = dl dθ and i( θ ) = - E 2 d ln p d ln p = E 2 dθ dθ 2. By CLT, 1 n s ~ q(,i ) where θ is the true value, s = s(θ ) and i = i(θ ) (Refer to lecture 9 for this result.) Testing: Let $ θ be the ML estimator. Let d = $ θ - θ denote the vector of deviations. Then (again referring to Lecture 9), n -1/2 s = i d n 1/2 asymptotically. Note that this is same as n 1/2 d = i 1 s n -1/2. Further, 2[l( θ $ ) - l(θ )] = n d i d asymptotically. (To get this result, expand l( θ $ ) around θ and 5
take probability limits). Consider the hypothesis: H : θ = θ H 1 : θ θ Note that the restriction is θ = θ. Likelihood Ratio Test: Likelihood ratio: LR = p(x θ )/max θ p(x θ) = p(x θ )/p(x θ $ ) The test statistic is -2 ln LR = 2 [l( θ $ ) - l(θ )], and it is distributed as χ 2 (with degrees of freedom equal to the number of restrictions imposed) under the null hypothesis. Wald Test: The test statistic is n d i( θ $ )d, and it is distributed as χ 2 under the null hypothesis. Score Test: The test statistic is n -1 s i 1 s, and it is distributed as χ 2 under the null hypothesis. Note: plim i( $ θ ) = i(θ ) = i when the restriction is true and recall that plim ( n d i d - n -1 s i 1 1 2-1 s ) = since n d = i s n / 1/ 2 asymptotically. So, the tests are asymptotically equivalent. This is a fairly rapid development and you should convince yourself of these asymptotic equivalences. Note that the Wald and LM tests are appealing because of their asymptotic equivalence to the LR test, which is an optimal test in the Neyman-Pearson sense. 6
Discussion: -What are the computational requirements for these tests? -Which is best? Geometry: For illustrative purposes, θ is one-dimensional. Likelihood Ratio test: Here, we look at the change in the log likelihood function l(θ) evaluated at θ $ and θ. If the difference between l( θ $ ) and l( θ ) is too large, we reject H. Wald test: Here, we look at the deviation in parameter space. 7
The difference between $ θ and θ implies a larger difference between l ($ θ) and l ( θ ) for the more curved log likelihood function. Evidence against the hypothesized value θ depends on the curvature of the log likelihood function measured by ni( θ $ ). Hence the test statistic is n( $ θ - θ ) 2 i( $ θ ). Score test: Here, we look at the slope of the log likelihood function at the hypothesized value θ. Since two log likelihood functions can have equal values of s with different distances between θ$ and θ, s must be weighed by the change in slope (i.e. curvature). A bigger change in slope implies less evidence against the hypothesized value θ. Hence the test statistic n -1 2 si 1. Why is the score test also called the Lagrange Multiplier test? The log likelihood function is maximized subject to the restriction θ = θ : max l( θ) - λ( θ -θ). θ This gives θ $ = θ and λ = s( θ ) = l. θ 8