Asymptotic Tests and Likelihood Ratio Tests

Similar documents
Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Cherry Blossom run (1) The credit union Cherry Blossom Run is a 10 mile race that takes place every year in D.C. In 2009 there were participants

A Very Brief Summary of Statistical Inference, and Examples

Statistics 3858 : Maximum Likelihood Estimators

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

Math 494: Mathematical Statistics

simple if it completely specifies the density of x

Mathematical Statistics

STAT 135 Lab 6 Duality of Hypothesis Testing and Confidence Intervals, GLRT, Pearson χ 2 Tests and Q-Q plots. March 8, 2015

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

Stat 135, Fall 2006 A. Adhikari HOMEWORK 6 SOLUTIONS

Chapter 7. Hypothesis Testing

Multivariate Statistical Analysis

Definition 3.1 A statistical hypothesis is a statement about the unknown values of the parameters of the population distribution.

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Loglikelihood and Confidence Intervals

Topic 19 Extensions on the Likelihood Ratio

Topic 15: Simple Hypotheses

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Confidence Intervals and Hypothesis Tests

Hypothesis Testing: The Generalized Likelihood Ratio Test

Composite Hypotheses and Generalized Likelihood Ratio Tests

Math 494: Mathematical Statistics

STAT 514 Solutions to Assignment #6

exp{ (x i) 2 i=1 n i=1 (x i a) 2 (x i ) 2 = exp{ i=1 n i=1 n 2ax i a 2 i=1

Central Limit Theorem ( 5.3)

HYPOTHESIS TESTING: FREQUENTIST APPROACH.

ECE 275A Homework 7 Solutions

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Review and continuation from last week Properties of MLEs

Master s Written Examination

Lecture 7 Introduction to Statistical Decision Theory

F79SM STATISTICAL METHODS

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

TUTORIAL 8 SOLUTIONS #

Lecture 12 November 3

Review of Discrete Probability (contd.)

ECE 275B Homework # 1 Solutions Version Winter 2015

HT Introduction. P(X i = x i ) = e λ λ x i

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

STAT 730 Chapter 4: Estimation

STAT 461/561- Assignments, Year 2015

Lecture 21: October 19

Probability and Statistics Notes

Ch. 5 Hypothesis Testing

Primer on statistics:

Spring 2012 Math 541B Exam 1

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Hypothesis Test. The opposite of the null hypothesis, called an alternative hypothesis, becomes

The Multinomial Model

The University of Hong Kong Department of Statistics and Actuarial Science STAT2802 Statistical Models Tutorial Solutions Solutions to Problems 71-80

[y i α βx i ] 2 (2) Q = i=1

Hypothesis testing: theory and methods

5.2 Fisher information and the Cramer-Rao bound

ECE 275B Homework # 1 Solutions Winter 2018

Chapter 4. Theory of Tests. 4.1 Introduction

Lecture 8: Information Theory and Statistics

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Maximum Likelihood Estimation

Summary of Chapters 7-9

ST495: Survival Analysis: Hypothesis testing and confidence intervals

Math 181B Homework 1 Solution

STAT 135 Lab 3 Asymptotic MLE and the Method of Moments

BTRY 4090: Spring 2009 Theory of Statistics

Stat 5102 Lecture Slides Deck 3. Charles J. Geyer School of Statistics University of Minnesota

Outline of GLMs. Definitions

LECTURE 10: NEYMAN-PEARSON LEMMA AND ASYMPTOTIC TESTING. The last equality is provided so this can look like a more familiar parametric test.

DA Freedman Notes on the MLE Fall 2003

Parametric Techniques Lecture 3

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

More Empirical Process Theory

Introduction 1. STA442/2101 Fall See last slide for copyright information. 1 / 33

Multiple Regression Analysis

Lecture 10: Generalized likelihood ratio test

Discrete Dependent Variable Models

One-sample categorical data: approximate inference

Chapter 6. Order Statistics and Quantiles. 6.1 Extreme Order Statistics

Lecture 17: Likelihood ratio and asymptotic tests

Topic 17: Simple Hypotheses

Probability Theory and Statistics. Peter Jochumzen

Likelihood-based inference with missing data under missing-at-random

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

Parametric Techniques

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Part 1.) We know that the probability of any specific x only given p ij = p i p j is just multinomial(n, p) where p k1 k 2

Maximum Likelihood Estimation

STAT 135 Lab 2 Confidence Intervals, MLE and the Delta Method

Hypothesis Testing. A rule for making the required choice can be described in two ways: called the rejection or critical region of the test.

8. Hypothesis Testing

5601 Notes: The Sandwich Estimator

Asymptotic Statistics-VI. Changliang Zou

STAT 801: Mathematical Statistics. Hypothesis Testing

Statistical hypothesis testing The parametric and nonparametric cases. Madalina Olteanu, Université Paris 1

Association studies and regression

Probability and Statistics qualifying exam, May 2015

δ -method and M-estimation

Answers to the 8th problem set. f(x θ = θ 0 ) L(θ 0 )

ADJUSTED POWER ESTIMATES IN. Ji Zhang. Biostatistics and Research Data Systems. Merck Research Laboratories. Rahway, NJ

Lecture 21. Hypothesis Testing II

Economics 583: Econometric Theory I A Primer on Asymptotics: Hypothesis Testing

Transcription:

Asymptotic Tests and Likelihood Ratio Tests Dennis D. Cox Department of Statistics Rice University P. O. Box 1892 Houston, Texas 77251 Email: dcox@stat.rice.edu November 21, 2004 0

1 Chapter 6, Section 5: Asymptotic Tests Here, we consider asymptotic tests and, in particular, Wald tests, likelihood ratio tests, and score tests. 1.1 General Considerations; Wald Tests. Definitio.1 Let X n denote the observation vector at stage n and consider a test of H 0 : θ Θ 0 vs. H 1 : θ Θ 1. A sequence of tests φ n (X n ) is called an asymptotic (sequence of) level α test(s) of H 0 vs. H 1 if that is θ Θ 0, lim sup E θ [φ n (X n )] α, n θ Θ 0, ǫ > 0, N such that n N, E θ [φ n (X n )] < α + ǫ. (1) Example 1.1 Let X 1, X 2,... be i.i.d. with finite second moments, and let µ = E[X] and σ 2 = Var[X]. Suppose we want to test H 0 : µ = µ 0 vs. H 1 : µ µ 0, where µ 0 is given. Letting X n and S n be the mean and sample standard deviations based on a sample of size n, we know from previous results that Let X n µ S n / n D N(0, 1). If H 0 is true, then T n = X n µ 0 S n / n P [ T n < z α/2 or T n > z α/2 ] α, so rejecting H 0 for T n < z α/2 or T n > z α/2 gives an asymptotic level α test. 1

It is important to recognize that an asymptotic level α test may be nowhere near level α for any sample size. In fact, for the test in the previous example, it follows from a Theorem due to Bahadur (ref???) that the size of the test is 1 for all n. The main thing to remember is that the N in (1) may depend on θ Θ 0. The test statistic of the previous example is in fact a special instance of a more general method. Suppose we want to test H 0 : g(θ) = γ 0 where g(θ) is some estimand. Let ˆγ n be a consistent and asymptotic normal estimator with n(ˆγn g(θ)) D N(0, σ 2 (θ)), if θ is the true value of the parameter. Let S n be a consistent estimator of σ(θ). Then, by Slutsky s theorem T n = ˆγ n g(θ) S n / n D N(0, 1). Thus, rejecting H 0 when T n < z α/2 or T n > z α/2 gives an asymptotic level α test. In general, if we use the MLE for ˆγ n, i.e. g(ˆθ n ) where ˆθ n is the MLE of theta, and the consistent estimator of the variance g(θ) I 1 (θ) g(θ) of the asymptotic normal distribution is obtained by pluggin in ˆθ n, then the test is referred to as a Wald test. Example 1.2 Let X i B(n i, p i ), i = 1, 2 be independent, and we want to test H 0 : p 1 = p 2 vs. H 1 : p 1 p 2. Note that the null hypothesis can be restated as g(p 1, p 2 ) = p 1 p 2 = 0. Letting the MLE for g(p 1, p 2 ) is ˆp i = X i n i, (2) ˆγ = ˆp 1 ˆp 2. As this is a two sample problem, the asymptotics requires a little consideration. It is not covered by our previous theory, since there we assumed we had a single i.i.d. sample. In 2

particular, we will need to consider if it is merely sufficient for both sample sizes and to go to, or do they have to be tied together somehow. We have ni (ˆp i p i ) D N (0, p i (1 p i )), by the CLT. We will rewrite this as ˆp i D = p i + Z i p i (1 p i )/n i + o P (n 1/2 i ), i = 1, 2, Z i i.i.d. N(0, 1). This follows from Skorohod s theorem: for each i we can find Y D in = n i (ˆp i p i ) such that Y P in p i (1 p i )Z i, and we may take (Z i, Y i1, Y i2,...) independent, so that in fact (Y 1n, Y 2n ) D = ( [ˆp 1 p 1 ], [ˆp 2 p 2 ]). Thus, ˆγ = ˆp 1 ˆp 2 D = p 1 p 2 + Z 1 p 1 (1 p 1 )/ Z 2 p 2 (1 p 2 )/ + o P ((min{, } 1/2 ) D p1 (1 p 1 ) = p 1 p 2 + Z + p 2(1 p 2 ) + o P ((min{, } 1/2 ) where Z N(0, 1). Note that p1 (1 p 1 ) + p 2(1 p 2 ) min{, } 1/2. (3) Hence, we have ˆγ (p 1 p 2 ) p1 (1 p 1 ) + p 2(1 p 2 ) D N(0, 1), as min{, }. (4) Now (ˆp 1, ˆp 2 ) P (p 1, p 2 ), so by the Continuous Mapping Principle, ( ˆp 1 (1 ˆp 1 ), ˆp 2 (1 ˆp 2 ) ) P (p 1 (1 p 1 ), p 2 (1 p 2 ) ) = (v 1, v 2 ). Thus, ˆp 1 (1 ˆp 1 ) 1 + ˆp 2(1 ˆp 2 ) p 1 (1 p 1 ) + p 2(1 p 2 ) p 1 (1 p 1 ) ˆp 1 (1 ˆp 1 ) n 1 + p 2(1 p 2 ) ˆp 2 (1 ˆp 2 ) v 1 + v 2 o P (1) n = 1 + o P (1) v 1 + v 2 3

At this point, if we don t recall our original objective, we will be obliged to introduce max{, } to estimate the denominator. However, recall that we only need to establish the convergence result under H 0 : p 1 = p 2. So, letting the common values be denoted p = p 1 = p 2, v = v 1 = v 2 = p(1 p), then contuing with the computation from above, we have o P (1) + o P (1) v 1 + v 2 = o P (1) + o P (1) v ( 1 + 1 ) o P (1) + o P (1) v 1 min{, } (since dropping a term makes the denominator smaller) o P (1) min{, } v 1 min{, } = o P(1) v = o P (1). Thus, as long as min{, }, (5) then ˆp 1 (1 ˆp 1 ) + ˆp 2(1 ˆp 2 ) p 1 (1 p 1 ) + p 2(1 p 2 ) P 1. (6) With this result, the Continuous Mapping Principle (applied to the function φ(x) = x 1/2, x > 0), Slutsky s Theorem, and (4), we have ˆγ (p 1 p 2 ) ˆp1 (1 ˆp 1 ) = + ˆp 2(1 ˆp 2 ) p 1 (1 p 1 ) n 1 + p 2(1 p 2 ) ˆp 1 (1 ˆp 1 ) + ˆp 2(1 ˆp 2 ) ˆγ (p 1 p 2 ) p1 (1 p 1 ) + p 2(1 p 2 ) D 1 N(0, 1) = N(0, 1). (7) 4

Letting ˆp 1 ˆp 2 T = ˆp1 (8) (1 ˆp 1 ) + ˆp 2(1 ˆp 2 ) we obtain an asymptotic level α test by reject H 0 if T < z α/2 or T > z α/2. (9) In practice, as long as both sample sizes are reasonably large (say, min{, } 30) and the true p i s are not too close to 0 or 1, the test should be reasonably accurate. An assessment of the accuracy for given,, and p = p 1 = p 2 is presented in Figure 1. To describe the figure, we did two sets of 1,000 monte carlo trials. In the first trial, we used p 1 = p 2 = 0.2 and = = 20. The results of this simulation are shown in the top two plots in the figure. To produce these plots, we took the values of the Wald statistic for each run and converted them to approximate p-values using the asymptotic normal distribution. If the approximation is accurate, then the sorted p-values should look like the order statistics of a Unif(0, 1) distribution. We have plotted all of the sorted p-values vs. the expected values of the order statistics of a Unif(0, 1) distribution. If U (i) is the i th order statistic of N i.i.d. Unif(0, 1), then one can show E[U (i) ] = i/(n + 1). We have also put in the reference line y = x. If the p-values are accurate, they should fall on this line, except for random variation. In the upper left panel, we have plotted all of the sorted p-values. We see that the observed p-values tend to fall a little above the line, generally, except at the end. Note that a vertical line of p-values indicates a point where many values were tied. Since the original data were binomial, the p-value test statistic is discrete, and where there is a discrete lump of probability mass with a large probability, we expect to see a lot of tied values. The sort function will just list all the tied values in a big block, which gives a vertical line in the plot. The upper right figure is more useful it shows a blow-up for the observed (simulated) p-values which are 0.10, which is generally the range where we might consider a p-value interesting or significant. Because α = 0.05 is the usual level of significance, we have shown it here as a vertical line. Thus, we would reject H 0 for p-values 0.05. Looking at the 5

P values for Wald Test Blow up For Wald Test Theoretical Quantiles 0.0 0.2 0.4 0.6 0.8 1.0 Theoretical Quantiles 0.00 0.02 0.04 0.06 0.08 0.10 0.0 0.2 0.4 0.6 0.8 1.0 1000 sorted p values n1 = 20 n2 = 20 p1=p2 = 0.2 P values for Wald Test 0.00 0.02 0.04 0.06 0.08 0.10 1000 sorted p values n1 = 20 n2 = 20 p1=p2 = 0.2 Blow up For Wald Test Theoretical Quantiles 0.0 0.2 0.4 0.6 0.8 1.0 Theoretical Quantiles 0.00 0.02 0.04 0.06 0.08 0.0 0.2 0.4 0.6 0.8 1.0 1000 sorted p values n1 = 50 n2 = 50 p1=p2 = 0.2 0.00 0.02 0.04 0.06 0.08 0.10 1000 sorted p values n1 = 50 n2 = 50 p1=p2 = 0.2 Figure 1: Plots of empirical quantiles of p-values computed for the Wald test of H 0 : p 1 = p 2 in the two sample Binomial setting. p-values 0.05, we see that a little over 0.08 of them are 0.05, and since we simulated under one of the parameter values in the null hypothesis (p 1 = p 2 = 0.2), we would falsely reject H 0 about 8% of the time. Thus, for a nominal level of 0.05, the size of the test is at least 0.08. The bottom two plots in the figure show the same results for 1, 000 simulations with = = 50 and p 1 = p 2 = 0.2. Here, we see that the Unif(0, 1) Q-Q (quantile-quantile) plot is much better. In particular, we see that with a nominal α = 0.05, the actual type I error probability for this null value is only slightly above 0.05. 6

Interestingly, if we compare the results from the two different sample sizes, it looks like for p-values < 0.01, the smaller sample size is actually more accurate than for the larger sample size. We can speculate that this is due to the discreteness of the observation. In general, if you have a practical problem, and you want to know how accurate the p-value is for your problem, a reasonable way of assessing this is to simulate from the Maximum Likelihood Estimate under H 0, using the same design (sample sizes, in this case), convert simulated test statistics to p-values, and look in the range of p-values you have gotten. If you see that the theoretical quantiles of the simulated p-values are somewhat above their nominal level, then it suggests that a correction is in order. Another thought that comes to mind is to replace our standard error estimator in the denominator of T by an estimator under H 0, i.e. to use T = ˆp 1 ˆp 2 ˆp(1 ˆp) [1/ + 1/ ], where ˆp = X 1 + X 2 +, (10) is the so-called pooled estimate of p since it comes by pooling the samples of Bernoulli trials. If H 0 is true, we should have a better estimate of the standard error by pooling, so it intuitively makes some sense to do this. We shall return to this test later. One issue with Wald tests is that they are not unique: there is usually more than one estimand g(θ) which gives the same null hypothesis. For example, to test H 0 : p 1 = p 2 in the two sample Binomial model, we could use g(p 1, p 2 ) = p 1 /p 2, and test the null hypothesis g(p 1, p 2 ) = 1. 1.2 Likelihood Ratio Tests. The Likelihood Ratio Test (LRT) for a testing problem is easy to define. We shall try to motivate it with some intuition, which suggests that the intuitive definition is probably 7

wrong, but in the cases where the standard theory applies, it is nearly certainly OK. Consider a general hypothesis testing problem H 0 : θ Θ 0 vs. H 1 : θ Θ 1, where {Θ 0, Θ 1 } is a partition of the parameter space Θ (i.e., Θ 0 Θ 1 = Θ, and Θ 0 Θ 1 = ). If both Θ 0 and Θ 1 were singleton sets, then we would apply Neyman-Pearson and have an optimal level α test. When one or both are composite, life is not so simple. However, suppose we could reduce each of the Θ i s to singleton sets by selecting our best guess under the assumption that θ Θ i and then apply Neyman-Pearson to this best guess. If our method of selecting the best guess were a good one, then it makes sense that our test might not be optimal, but it should be close. Unfortunately, our best guess will depend on the data, so we may have some problem with getting a single distribution to use under H 0 to select a critical value. There are then two problems with this idea: (1) what is our best guess; and (2) how do we get a critical value to get the desired level of significance. The first problem is relatively easy: use a good point estimate of the parameter under the assumption θ Θ i, i = 0, 1. The most general (frequentist) method for getting good point estimates is maximum likelihood. It s justification is primarily asymptotic, and the best we can hope for in solving problem (2) is that we can get an asymptotic level α test. In a remarkable tour-de-force of statistical asymptotics, Wilks (ref???) showed that this choice of point estimate ( best guess ) in fact leads to a single distribution, under any θ Θ 0, provided Θ 0 satisfies regularity conditions. Type in some more motivation and examples already given in lecture. In particular, the difference between the motivated test stat and the usual form??? We would reject for large values of Y. The usual form of the LRT statistic is Y = sup θ Θ 1 f θ (X) sup θ Θ0 f θ (X) λ = 2 log ( ) supθ Θ0 f θ (X) sup θ Θ f θ (X) 8 (11) (12)

Again, we reject for large values of λ. Note that λ 0 always. In general, we will let L n (θ) = log f θ (X n ) (13) ˆθ n = arg max n(θ) θ Θ (14) ˆθ 0n = arg maxl n (θ) θ Θ 0 (15) ˆθ 1n = arg maxl n (θ) θ Θ 1 (16) denote the log likelihood, the full MLE, the MLE under H 0, and the MLE under H 1, all for the n th entry in the sequence of experiments (e.g., for sample size n under i.i.d. sampling). If the sample size is not important, we will drop the n. We shall also use the same subscripting notation if the parameter (or component thereof) is denoted by some other symbol than θ. Needs more discussion???. (17) Example 1.3 Let X 1, X 2,..., X n be i.i.d. N(µ, σ 2 ) where both µ and σ 2 are unknown. We first consider testing H 0 : µ = µ 0 vs. H 1 : µ µ 0. (18) Letting θ = (µ, σ 2 ), The various MLEs are ˆµ = X = 1 n X i n i=1 ˆσ 2 = 1 n (X i n i=1 ˆµ 0 = µ 0 ˆσ 0 2 = 1 n (X i µ 0 ) 2 n i=1 ˆµ 1 = X a.s. ˆσ 1 2 = 1 n (X i n i=1 a.s. 9

Note that in this case, the two forms of the LRT statistic are essentially equivalent since the MLE s are essentially the same under H 0 and unrestricted. This is typically the case under the regularity conditions for which the asymptotic χ 2 distribution of λ holds, at least with probability approaching 1. Now, the maximized log likelihood is L (ˆµ, [ ˆσ 2) = (2π) n/2 (ˆσ 2 ) n/2 exp 1 ] n (X 2ˆσ 2 i ˆµ) 2 i=1 = (2π) n/2 (ˆσ 2 ) n/2 exp[ n/2]. (19) This is the typical situation for maximized normal likelihoods (something the student should commit to memory). Under the restriction of H 0 : L (ˆµ 0, ˆσ 2 0) = (2π) n/2 (ˆσ 2 0) n/2 exp[ n/2] Thus, the two forms of the LRT statistic are Y = ) 2 n/2 ) 2 (ˆσ 0 (ˆσ, λ = n log 0. ˆσ 2 ˆσ 2 Rejecting H 0 for large values of either of these test statistics is equivalent to rejecting for large values of where ˆσ 0 2 i(x i µ 0 ) 2 = ˆσ 2 i(x i X) 2 i(x i = X + X µ 0 ) 2 i(x i X) 2 i(x i = X) 2 + i( X µ 0 ) 2 i(x i X) 2 = 1 + T 2 n 1 T = 1 n 1 X µ 0 i(x i X) 2 /n = X µ 0 S/ n, S2 = 1 n 1 (X i X) 2, i 10

is Student s t-statistic. Now, we can reject for large values of T 2, or T, or simply reject for both large and small values, i.e., Reject H 0 if T < t (n 1),α/2 or T > t (n 1),α/2. Thus, we see that the LRT is equivalent to the classical two sided t-test, which is in fact a UMP unbiased test. Now suppose we change to a one-sided test, e.g. H 0 : µ µ 0 vs. H 1 : µ > µ 0. A difference in the two forms of the LRT statistic will emerge in this case. The full MLE remains the same, but ˆµ 0 = µ 0 if X µ0 X if X < µ0 ˆσ 0 2 = 1 n (X i µ 0 ) 2 n i=1 X if X > µ0 ˆµ 1 = µ 0 if X µ0 ˆσ 1 2 = 1 n (X i ˆµ 1 ) 2 n i=1 (Technically speaking, ˆµ 1 = µ 0 is not allowed since Θ 1 = {(µ, σ 2 ) : µ > µ 0 }, but this is where the likelihood is maximized under H 0, so plugging it in gives us sup θ Θ1 L(θ).) Notice that the full MLE ˆµ = ˆµ 0 is X µ 0. If this happens, we will have the λ LRT statistic equal to 0, and this happens with probability 1/2 if the true µ = µ 0 (greater probability for µ < µ 0 ). It is only a problem if we want to use a level of significance α > 1/2, which in practice is never an issue. Thus, ignoring this little difficulty for large significance levels (i.e., assume α 1/2), we will reject H 0 in either case if X is larger than µ 0 and the square of Student s t statistic T 2 is too large. Equivalently, we may specify our rejection region as reject H 0 if T > t (n 1),α, 11

where T has the same definition as the in the two sided testing problem. The above example illustrates some facts about the LRT principle for deriving a test statistic. Firstly, although we are primarily considering it as a statistic for an asymptotic test, in many situations its exact distribution can be derived, and of course then we should use that rather than an asymptotic approximation. Secondly, when there is a good or even best test, the LRT will typically give the same test. The asymptotic distribution result we will state holds for a special class of null hypotheses. Suppose Θ IR p is open. We will consider a test of H 0 : g(θ) = γ 0 vs. H 1 : g(θ) γ 0, (20) where g : Θ IR p IR k, k p. (21) Note that this null hypothesis amounts to saying that θ satisfies k (possibly nonlinear) constraints. For a null hypothesis as in (20), if g satisfies certain regularity conditions, then Θ 0 = {θ : g(θ) = γ 0 }. is a smooth k-dimensional manifold. If p = 2 or p = 3 and k = 1, then Θ 0 is a curve. If p = 3 and k = 2, the Θ 0 is a surface. To speak of a smooth manifold means that if one magnifies a small piece of the manifold, it looks like a linear manifold. More discussion is given in Subsectio.4 Example 1.4 We return to the two sample Binomial model of the previous example and derive the LRT. For convenience let b(x; n, p) = n x p x (1 p) n x, x = 0, 1,..., n, 12

P values for LRT Blow up For LRT Theoretical Quantiles 0.0 0.2 0.4 0.6 0.8 1.0 Theoretical Quantiles 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.0 0.2 0.4 0.6 0.8 1.0 1000 sorted p values n1 = 20 n2 = 20 p1=p2 = 0.2 P values for LRT 0.00 0.02 0.04 0.06 0.08 0.10 1000 sorted p values n1 = 20 n2 = 20 p1=p2 = 0.2 Blow up For LRT Theoretical Quantiles 0.0 0.2 0.4 0.6 0.8 1.0 Theoretical Quantiles 0.00 0.02 0.04 0.06 0.08 0.0 0.2 0.4 0.6 0.8 1.0 1000 sorted p values n1 = 50 n2 = 50 p1=p2 = 0.2 0.00 0.02 0.04 0.06 0.08 0.10 1000 sorted p values n1 = 50 n2 = 50 p1=p2 = 0.2 Figure 2: Plots of empirical quantiles of p-values computed for the LRT test of H 0 : p 1 = p 2 in the two sample Binomial setting. denote the B(n, p) p.m.f. Then λ = 2 log b(x 1;, ˆp)b(X 2 ;, ˆp) b(x 1 ;, ˆp 1 )b(x 2 ;, ˆp 2 ), where ˆp and the ˆp i are given in (10) and (2). While some simplification in the formula for λ is possible, it does not reduce to a particularly useful expression. 13

1.3 The Score Test We now introduce yet another test, whose justification is primarily asymptotic. The score test is harder to justify from an intuitive point of view than the LRT, nonetheless, it is asymptotically equivalent to the LRT, and often gives a test statistic with a simpler formula, often similar to a Wald test. The test statistic for the score test is based on the derivative of the (full) log likelihood evaluated at the MLE under H 0. If θ 0 denotes the true value of the parameter, then E[ L n (θ 0 )] = 0 (22) Cov[ L n (θ 0 )] = I n (θ 0 ) (23) Since ˆθ 0 is a consistent estimator of of θ 0 under H 0, we have under the regularity conditions for asymptotic normality of the MLE that I n (θ 0 ) 1/2 L n (θ 0 ) D N(0, I). If we evaluate this at the full MLE ˆθ n, then the same result will hold, but when evaluated at the MLE ˆθ 0n under H 0, the resulting asymptotic normal distribution will clearly be singular (since ˆθ 0n is constrained to lie on the manifold H 0 which is locally approximately a p k dimensional linear manifold). However, one can show that under the same regularity conditions that give the asymptotic χ 2 k distribution for λ n, the score statistic S n = L n (ˆθ 0n ) t I n (ˆθ 0n ) 1 L n (ˆθ 0n ), (24) is asymptotically equivalent to λ n, the LRT statistic. Hence, S D n χ 2 k, under H 0. (25) The main reason for preferring the score statistic over the LRT is that in many cases, it is easier to compute, and often yields simpler formulae. To compute the score statistic S, one need only find ˆθ 0n, whereas computation of lambda requires the full MLE ˆθ n as well as the null MLE. 14

Example 1.5 Let us return to the two sample binomial problem (X i B(n i, p i ) are independent, i = 1, 2), and consider the test of H 0 : p 1 = p 2. The log likelihood (except for unimportant constants that don t depend on (p 1, p 2 )) is given by L(p 1, p 2 ) = x 1 log p 1 + (n x 1 ) log(1 p 1 ) + x 2 log p 2 + (n x 2 ) log(1 p 2 ), and differentiating, we obtain L(p 1, p 2 ) = D 2 L(p 1, p 2 ) = I(p 1, p 2 ) = x 1 p 1 x 1 1 p 1 x 2 p 2 x 2 1 p 2 x 1 p 2 1 = + x 1 (1 p 1 ) 2 0 p 1 (1 p 1 0 ) x 1 p 1 p 1 (1 p 1 ) x 2 p 2 p 2 (1 p 2 ) x 0 2 + x 2 p 2 2 (1 p 2 ) 2 0 p 2 (1 p 2 ) Plugging in the general formula for the score statistic, we obtain S = ( )2 x1 ˆp ˆp(1 ˆp) + ˆp(1 ˆp) ( )2 x2 ˆp ˆp(1 ˆp). ˆp(1 ˆp) One can show with a little algebra that this reduces to S = (ˆp 1 ˆp 2 ) 2 ˆp(1 ˆp)[ 1 + 1 ]. (26) Note that the constraint is k = 1 dimensional, so we reject H 0 if S > χ 2 1,α. In Figure 3 we show the results of the level study for the score statistic under identical conditions of the previous two simulation studies. One sees that the results are indistinguishable from the LRT, even in the smaller sample size case. As the score test in this setting is easier to motivate (use the Wald test with a pooled estimate for the standard error), it is more widely used, especially by nonstatisticians. Note that one can use the signed square root of S, namely T = ˆp 1 ˆp 2 ˆp(1 ˆp)[ 1 + 1 ] 15

P values for Score Test Blow up For Score Test Theoretical Quantiles 0.0 0.2 0.4 0.6 0.8 1.0 Theoretical Quantiles 0.00 0.02 0.04 0.06 0.08 0.10 0.0 0.2 0.4 0.6 0.8 1.0 1000 sorted p values n1 = 20 n2 = 20 p1=p2 = 0.2 P values for Score Test 0.00 0.02 0.04 0.06 0.08 0.10 1000 sorted p values n1 = 20 n2 = 20 p1=p2 = 0.2 Blow up For Score Test Theoretical Quantiles 0.0 0.2 0.4 0.6 0.8 1.0 Theoretical Quantiles 0.00 0.02 0.04 0.06 0.08 0.0 0.2 0.4 0.6 0.8 1.0 1000 sorted p values n1 = 50 n2 = 50 p1=p2 = 0.2 0.00 0.02 0.04 0.06 0.08 0.10 1000 sorted p values n1 = 50 n2 = 50 p1=p2 = 0.2 Figure 3: Plots of empirical quantiles of p-values computed for the Score Test of H 0 : p 1 = p 2 in the two sample Binomial setting. for this problem as well. The asymptotic null distribution is N(0, 1). We reject the (two sided) null hypothesis H 0 : p 1 = p 2 if either T > z α/2 or T < z α/2. This test statistic can also be used to test one sided hypotheses, such as H 0 : p 1 p 2, with a corresponding one sided rejection region. In general, this test statistic gives better approximation of the level α constraint than the corresponding Wald test with the statistic in (8). 16

1.4 Proofs of Asymptotic Distribution Results. We begin by stating an important theorem from advanced calculus. Unfortunately, it takes a little further argument to show that the full rank condition on the derivative of the estimand is enough to give that the null hypothesis subspace is a locally linear manifold. Theorem 1.1 (Implicit Function Theorem) Consider a continuously differentiable map g : A IR k where A IR p is open, and k p. Let (a, b) A, a IR k, b IR p k, and f(a, b) = y. Assume further that det(d) 0, where D is the k k matrix D ij = g i x j (a, b), 1 i, j k. Then there exists a neighborhood W of a in IR k and a unique continuously differentiable function h : W IR p k such that g(t, h(t)) = y for all t W. Essentially, the theorem says that if the determinant of the partials of the first k variables of a k-dimensional-valued function is nonzero, then, given an k-dimensional value for the function, we can solve for the remaining p k variables, and the map from the first k variables to the remaining p k variables is differentiable (locally linear). Now, if the function g : A IR p where A IR p is open simply has rank(dg(x)) = k for all x A, then, at any point x 0, there are exactly k linearly independent columns in the k p matrix Dg(x 0 ), and we can rearrange the variables so that the first k columns satisfy this condition. Returning to our null hypothesis H 0 : g(θ) = γ 0, the set of points satisfying this constraint Θ 0 = {θ Θ : g(θ) = γ 0 }. Assume that there is an open set A with Θ 0 A such that Dg(θ) has rank k for all θ A. Then for any θ 0 Θ 0 we can find a neighborhood of θ 0 and k independent components of θ 0 such that the remaining p k components are continuously differentiable functions of the given k components. Hence, the graph of the function function will be almost linear in this neighborhood of the k independent components. 17

Corollary 1.2 Let Θ 0 = {θ Θ : g(θ) = γ 0 } where g : Θ IR p IR k. Assume that there is an open set A with Θ 0 A such that Dg(θ) has rank k for all θ A. Let θ 0 Θ 0 be given. We begin by outlining the main steps in the proof in a nonrigorous manner, leaving to later to fill in the details. 1.5 Appendix: Listing of R Code. We wrote 3 individual functions to compute the test statistics. Another function called these test statistic functions in a for loop to generate the matrix of p-values. Finally, a script file called the last function and generated the plots. The files are separated by 3 comment lines. Continuation lines are indented 3 spaces. Bin2wald <- function(x1,n1,x2,n2) { # computes Wald stat for testing equality of 2 binomial p s p1hat = x1/n1 p2hat = x2/n2 se = sqrt( p1hat*(1-p1hat)/n1 + p2hat*(1-p2hat)/n2 ) z = (p1hat - p2hat)/se return(z) } Bin2lrt <- function(x1,n1,x2,n2) { # computes LRT stat for testing equality of 2 binomial p s 18

p1hat = x1/n1 p2hat = x2/n2 phat = (x1+x2)/(n1+n2) num = dbinom(x1,n1,phat)*dbinom(x2,n2,phat) denom = dbinom(x1,n1,p1hat)*dbinom(x2,n2,p2hat) lambda = -2 * log( num/denom ) return(lambda) } Bin2score <- function(x1,n1,x2,n2) { # computes Wald stat for testing equality of 2 binomial p s p1hat = x1/n1 p2hat = x2/n2 phat = (x1+x2)/(n1+n2) se = sqrt( phat*(1-phat)*(1/n1 + 1/n2 ) ) z = (p1hat - p2hat)/se return(z) } Runsim = function(n1,p1,n2,p2,nmc) { # simulates Nmc trials of 2 sample binomial x1 ~ B(n1,p1), x2 ~ B(n2,p2) # computes wald, lrt, and score p-values for H0: p1 = p2 on each trial x1 = rbinom(nmc,n1,p1) x2 = rbinom(nmc,n2,p2) 19

pvals = matrix(na,nrow=nmc,ncol=3) dimnames(pvals) = list(null,c("wald","lrt","score")) for(i i:nmc) { zwald = Bin2wald(x1[i],n1,x2[i],n2) p1 = 2*pnorm(-abs(zwald)) lambda = Bin2lrt(x1[i],n1,x2[i],n2) p2 = 1-pchisq(lambda,1) zscore = Bin2score(x1[i],n1,x2[i],n2) p3 = 2*pnorm(-abs(zscore)) pvals[i,] = c(p1,p2,p3) } return(pvals) } # script file to ru simulations of the 3 tests for p1=p2 # under H0 and plot the results p1 =.2 p2 = p1 n1a = 20 n2a = 20 nmc = 1000 pvalsa=runsim(n1a,p1,n2a,p2,nmc) for(j i:3) pvalsa[,j] = sort(pvalsa[,j]) #### 2nd simulation, same p, different n s n1b = 50 n2b = 50 20

pvalsb=runsim(n1b,p1,n2b,p2,nmc) for(j i:3) pvalsb[,j] = sort(pvalsb[,j]) ############################################################################## # making plots xpoints = (1:nmc)/(nmc+1) m = round(nmc/10) subtitlea = paste("n1 =",as.character(n1a)," n2 =", as.character(n2a)," p1=p2 =",as.character(p1)) subtitleb = paste("n1 =",as.character(n1b)," n2 =", as.character(n2b)," p1=p2 =",as.character(p1)) postscript("fig01.ps") par(mfrow=c(2,2)) plot(xpoints,pvalsa[,1], xlab="theoretical Quantiles",ylab="sorted p-values", main="p-values for Wald Test", sub=subtitlea) abline(0,1) plot(xpoints[1:m],pvalsa[1:m,1], xlab="theoretical Quantiles",ylab="sorted p-values", main="blow-up For Wald Test", sub=subtitlea) abline(0,1) plot(xpoints,pvalsb[,1], xlab="theoretical Quantiles",ylab="sorted p-values", main="p-values for Wald Test", sub=subtitleb) abline(0,1) plot(xpoints[1:m],pvalsb[1:m,1], 21

xlab="theoretical Quantiles",ylab="sorted p-values", main="blow-up For Wald Test", sub=subtitleb) abline(0,1) graphics.off() postscript("fig02.ps") par(mfrow=c(2,2)) plot(xpoints,pvalsa[,2], xlab="theoretical Quantiles",ylab="sorted p-values", main="p-values for LRT", sub=subtitlea) abline(0,1) plot(xpoints[1:m],pvalsa[1:m,2], xlab="theoretical Quantiles",ylab="sorted p-values", main="blow-up For LRT", sub=subtitlea) abline(0,1) plot(xpoints,pvalsb[,2], xlab="theoretical Quantiles",ylab="sorted p-values", main="p-values for LRT", sub=subtitleb) abline(0,1) plot(xpoints[1:m],pvalsb[1:m,2], xlab="theoretical Quantiles",ylab="sorted p-values", main="blow-up For LRT", sub=subtitleb) abline(0,1) 22

graphics.off() postscript("fig03.ps") par(mfrow=c(2,2)) plot(xpoints,pvalsa[,3], xlab="theoretical Quantiles",ylab="sorted p-values", main="p-values for Score Test", sub=subtitlea) abline(0,1) plot(xpoints[1:m],pvalsa[1:m,3], xlab="theoretical Quantiles",ylab="sorted p-values", main="blow-up For Score Test", sub=subtitlea) abline(0,1) plot(xpoints,pvalsb[,3], xlab="theoretical Quantiles",ylab="sorted p-values", main="p-values for Score Test", sub=subtitleb) abline(0,1) plot(xpoints[1:m],pvalsb[1:m,3], xlab="theoretical Quantiles",ylab="sorted p-values", main="blow-up For Score Test", sub=subtitleb) abline(0,1) graphics.off() 23

2 Exercises for Section 6.5 2.1 Show that for a one parameter exponential family, the LRT is equivalent to the UMP test of H 0 : θ θ 0 vs. H 1 : θ > θ 0, provided we either use Y statistic in (11), or use reasonable values of α (asymptotically, α < 1/2). 2.2 Verify the steps leadin up to (26). 2.3 Consider the two sample exponential model: X ij Expo(µ i ), 1 j n i, are mutually independent. We want to test H 0 : µ 1 = µ 2 vs.h 1 : µ 1 µ 2. (a) Derive a Wald test, and the LRT and score tests for this problem. You should give explicit formulae for each test statistic, and for the critical region. Where possible, use a two sided region based on the N(0, 1) distribution rather than a χ 2 distribution. (b) Perform level study similar to the one given in the text for the two sample binomial setting to compare how well the test statistics achieve the level α constraint for 0 < α < 1. (c) Verify directly that the LRT and Score tests are asymptotically equivalent. 24