simple if it completely specifies the density of x

Similar documents
A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

DA Freedman Notes on the MLE Fall 2003

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Hypothesis testing: theory and methods

Some General Types of Tests

Mathematics Ph.D. Qualifying Examination Stat Probability, January 2018

Introduction to Estimation Methods for Time Series models Lecture 2

Review and continuation from last week Properties of MLEs

Central Limit Theorem ( 5.3)

Chapter 7. Hypothesis Testing

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

LECTURE 10: NEYMAN-PEARSON LEMMA AND ASYMPTOTIC TESTING. The last equality is provided so this can look like a more familiar parametric test.

parameter space Θ, depending only on X, such that Note: it is not θ that is random, but the set C(X).

Ch. 5 Hypothesis Testing

Spring 2012 Math 541B Exam 1

Hypothesis Testing. A rule for making the required choice can be described in two ways: called the rejection or critical region of the test.

Lecture 32: Asymptotic confidence sets and likelihoods

Mathematics Qualifying Examination January 2015 STAT Mathematical Statistics

Institute of Actuaries of India

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

The outline for Unit 3

Loglikelihood and Confidence Intervals

Topic 15: Simple Hypotheses

40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology

F79SM STATISTICAL METHODS

Economics 583: Econometric Theory I A Primer on Asymptotics

14.30 Introduction to Statistical Methods in Economics Spring 2009

Lecture 17: Likelihood ratio and asymptotic tests

Nuisance parameters and their treatment

Math 494: Mathematical Statistics

Composite Hypotheses and Generalized Likelihood Ratio Tests

Economics 520. Lecture Note 19: Hypothesis Testing via the Neyman-Pearson Lemma CB 8.1,

Hypothesis Test. The opposite of the null hypothesis, called an alternative hypothesis, becomes

STAT 135 Lab 6 Duality of Hypothesis Testing and Confidence Intervals, GLRT, Pearson χ 2 Tests and Q-Q plots. March 8, 2015

BTRY 4090: Spring 2009 Theory of Statistics

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Statistics 135 Fall 2008 Final Exam

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables THE UNIVERSITY OF MANCHESTER. 21 June :45 11:45

CSE 312 Final Review: Section AA

Math 494: Mathematical Statistics

Brief Review on Estimation Theory

Cherry Blossom run (1) The credit union Cherry Blossom Run is a 10 mile race that takes place every year in D.C. In 2009 there were participants

To appear in The American Statistician vol. 61 (2007) pp

Information in Data. Sufficiency, Ancillarity, Minimality, and Completeness

Exercises Chapter 4 Statistical Hypothesis Testing

4.5.1 The use of 2 log Λ when θ is scalar

STAT 461/561- Assignments, Year 2015

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

Likelihood Ratio tests

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Greene, Econometric Analysis (6th ed, 2008)

STAT 830 Hypothesis Testing

5.2 Fisher information and the Cramer-Rao bound

Chapter 4: Constrained estimators and tests in the multiple linear regression model (Part III)

Introduction Large Sample Testing Composite Hypotheses. Hypothesis Testing. Daniel Schmierer Econ 312. March 30, 2007

Hypothesis Testing: The Generalized Likelihood Ratio Test

Chapter 4. Theory of Tests. 4.1 Introduction

Likelihood inference in the presence of nuisance parameters

Lecture 3 September 1

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

Chapter 3 : Likelihood function and inference

Hypothesis Testing - Frequentist

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Stat 5101 Lecture Notes

Chapters 10. Hypothesis Testing

Statistics. Statistics

Chapter 6. Hypothesis Tests Lecture 20: UMP tests and Neyman-Pearson lemma

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

Asymptotic Statistics-VI. Changliang Zou

f(y θ) = g(t (y) θ)h(y)

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Limiting Distributions

STAT 830 Hypothesis Testing

Statement: With my signature I confirm that the solutions are the product of my own work. Name: Signature:.

Master s Written Examination - Solution

Lecture Testing Hypotheses: The Neyman-Pearson Paradigm

Basic Concepts of Inference

Limiting Distributions

Lecture 21: October 19

HT Introduction. P(X i = x i ) = e λ λ x i

10. Composite Hypothesis Testing. ECE 830, Spring 2014

Statistical Inference

Stat 5102 Lecture Slides Deck 3. Charles J. Geyer School of Statistics University of Minnesota

The University of Hong Kong Department of Statistics and Actuarial Science STAT2802 Statistical Models Tutorial Solutions Solutions to Problems 71-80

This does not cover everything on the final. Look at the posted practice problems for other topics.

Master s Written Examination

Statistical Inference

Practical Econometrics. for. Finance and Economics. (Econometrics 2)

Mathematical Statistics

Submitted to the Brazilian Journal of Probability and Statistics

Econ 583 Homework 7 Suggested Solutions: Wald, LM and LR based on GMM and MLE

Economics 583: Econometric Theory I A Primer on Asymptotics: Hypothesis Testing

Mathematical statistics

Lecture 12 November 3

Transcription:

3. Hypothesis Testing Pure significance tests Data x = (x 1,..., x n ) from f(x, θ) Hypothesis H 0 : restricts f(x, θ) Are the data consistent with H 0? H 0 is called the null hypothesis simple if it completely specifies the density of x composite otherwise 1

pure significance test: a means of examining whether the data are consistent with H 0 where the only distribution of the data that is explicitly formulated is that under H 0 test statistic T = t(x) Suppose: the larger t(x), the more inconsistent the data with H 0 2

For simple H 0, the p-value of x is p = P(T t(x) H 0 ). small p more inconsistency with H 0 Remark: View p-value as realisation of a random P : Denote c.d.f. of T under H 0 by F T,H0, then, if F T,H0 is continuous, P = 1 F T,H0 (T (X)) Probability integral transform: if F T,H0 is continuous, one-to-one, then P is uniformly distributed on [0, 1], P U[0, 1]. 3

To derive this result, suppose that X has continuous distribution with invertible c.d.f. F, and Y = F (X). Then 0 Y 1, and for 0 y 1, P (Y y) = P (F (X) y) = P (X F 1 (y)) = F (F 1 (y)) = y, showing that Y U[0, 1]. It also follows that 1 Y U[0, 1]. Now note that F T,H0 (t) = P (T t H 0 ) and so F T,H0 (T (X)) is of the form F (X) for some random variable X. 4

For composite H 0 : If S is sufficient for θ then the distribution of X conditional on S is independent of θ; the p-value of x is p = P(T t(x) H 0 ; S). 5

Example: Dispersion of Poisson distribution H 0 : X 1,..., X n i.i.d. P oisson(µ), with unknown µ Under H 0, Var X i = EX i = µ and so would expect T = t(x) = S 2 /X close to 1. The statistic T is also called the dispersion index. Suspect that the X i s may be over-dispersed, that is, Var[X i ] > E[X i ]: discrepancy with H 0 would correspond to large T 6

Recall that X is sufficient for the Poisson distribution; the p-value is then p = P (S 2 /X t(x) X = x; H 0 ) under the Poisson hypothesis, which makes p independent of the unknown µ Given X = x and H 0 we have that S 2 /X χ 2 n 1/(n 1) (see Chapter 5 later) and so p P(χ 2 n 1/(n 1) t(x)). 7

Possible alternatives to H 0 guide the choice and interpretation of T What is a best test? 8

Simple null and alternative hypotheses General: random sample X 1,..., X n from f(x; θ) null hypothesis H 0 : θ Θ 0 alternative hypothesis H 1 : θ Θ 1 where Θ 1 = Θ \ Θ 0 ; Θ denoting the whole parameter space Choose rejection region or critical region R: reject H 0 X R 9

Suppose H 0 : θ = θ 0, H 1 : θ = θ 1 Type I error: reject H 0 when it is true α = P(reject H 0 H 0 ), also known as size of the test Type II error: accept H 0 when it is false β = P(accept H 0 H 1 ). the power of the test is 1 β = P(accept H 1 H 1 ). 10

Usually: fix α (e.g., α = 0.05, 0.01, etc.), look for a test which minimizes β: most powerful or best test of size α Intuitively: reject H 0 in favour of H 1 if likelihood of θ 1 is much larger than likelihood of θ 0, given the data. 11

Neyman-Pearson Lemma: (see, e.g., Casella and Berger, p.366) The most powerful test of H 0 versus H 1 has rejection region R = { x : L(θ } 1; x) L(θ 0 ; x) k α where the constant k α is chosen so that P(X R H 0 ) = α. This test is called the the likelihood ratio (LR) test. 12

Often we simplify the condition L(θ 1 ; x)/l(θ 0 ; x) k α to t(x) c α, for some constant c α and some statistic t(x); determine c α from the equation P(T c α H 0 ) = α, where T = t(x); then the test is reject H 0 if and only if T c α. For data x the p-value is p = P(T t(x) H 0 ). 13

Example: Normal means, one-sided X 1,..., X n N(µ, σ 2 ), σ 2 known H 0 : µ = µ 0 H 1 : µ = µ 1, with µ 1 > µ 0 Then L(µ 1 ; x) L(µ 0 ; x) k l(µ 1; x) l(µ 0 ; x) log k [ (xi µ 1 ) 2 (x i µ 0 ) 2] 2σ 2 log k (µ 1 µ 0 )x k x c (since µ 1 > µ 0 ), where k, c are constants, indept. of x. 14

Choose t(x) = x, and for size α test choose c so that P(X c H 0 ) = α; equivalently, such that P ( X µ0 σ/ n c µ 0 σ/ n ) H 0 = α. Hence want (c µ 0 )/(σ/ n) = z 1 α (where Φ(z 1 α ) = 1 α with Φ standard normal c.d.f.), i.e. c = µ 0 + σz 1 α / n. 15

So our test reject H 0 if and only if X c becomes reject H 0 if and only if X µ 0 + σz 1 α / n This is the most powerful test of H 0 versus H 1. 16

Recall the notation for standard normal quantiles: If Z N (0, 1) then P(Z z α ) = α and P(Z z(α)) = α, and note that z(α) = z 1 α. Thus P(Z z 1 α ) = 1 (1 α) = α. 17

Example: Bernoulli, probability of success, one-sided X 1,..., X n i.i.d. Bernoulli(θ) then L(θ) = θ r (1 θ) n r where r = x i ; test H 0 : θ = θ 0 against H 1 : θ = θ 1, where θ 1 > θ 0. Now θ 1 /θ 0 > 1, (1 θ 1 )/(1 θ 0 ) < 1, and L(θ 1 ; x) L(θ 0 ; x) = ( θ1 θ 0 ) r ( 1 θ1 1 θ 0 ) n r and so L(θ 1 ; x)/l(θ 0 ; x) k α r r α. 18

So the best test rejects H 0 for large r. For any given critical value r c, α = n j=r c ( ) n θ j 0 j (1 θ 0) n j gives the p-value if we set r c = r(x) = x i, the observed value. 19

Note: The distribution is discrete, so we may not be able to achieve a level α test exactly (unless we use additional randomization). For example, if R Binomial(10, 0.5), then P (R 9) = 0.011, and P (R 8) = 0.055, so there is no c such that P (R c) = 0.05. A solution is to randomize: If R 9 reject the null hypothesis, if R 7 accept the null hypothesis, and if R = 8 flip a (biased) coin to achieve the exact level of 0.05. 20

Composite alternative hypotheses Suppose θ scalar H 0 : θ = θ 0 one-sided alternative hypotheses H 1 : θ < θ 0 H + 1 : θ > θ 0 two-sided alternative H 1 : θ θ 0 21

The power function of a test power(θ) = P(X R θ) probability of rejecting H 0 as a function of the true value of the parameter θ; depends on α, the size of the test. Main uses: comparing alternative tests, choosing sample size. 22

A test of size α is uniformly most powerful (UMP) if its power function is such that power(θ) power (θ) for all θ Θ 1, where power (θ) is the power function of any other size-α test. 23

Consider: H 0 against H + 1 For exponential family problems: usually for any θ 1 > θ 0 the rejection region of the LR test is independent of θ 1. At the same time, the test is most powerful for every single θ 1 which is larger than θ 0. Hence the test derived for one such value of θ 1 is UMP for H 0 versus H + 1. 24

Example: normal mean, composite one-sided alternative X 1,..., X n N(µ, σ 2 ) i.i.d., σ 2 known H 0 : µ = µ 0 H + 1 : µ > µ 0 First pick arbitrary µ 1 > µ 0 the most powerful test of µ = µ 0 versus µ = µ 1 has a rejection region of the form X µ 0 + σz 1 α / n for a test of size α. 25

This rejection region is independent of µ 1, hence it is UMP for H 0 versus H + 1. The power of the test is power(µ) = P ( X µ 0 + σz 1 α / n µ ) = P ( X µ 0 + σz 1 α / n X N(µ, σ 2 /n) ) = P ( X µ σ/ n µ 0 µ σ/ n + z 1 α = P ( Z z 1 α µ µ 0 σ/ n X N(µ, σ 2 /n) ) Z N(0, 1) ) = 1 Φ ( z 1 α (µ µ 0 ) n/σ ). The power increases from 0 up to α at µ = µ 0 and then to 1 as µ increases. The power decreases as α increases. 26

Sample size calculation in the Normal example Suppose want to be near-certain to reject H 0 when µ = µ 0 + δ, say, and have size 0.05. Suppose we want to fix n to force power(µ) = 0.99 at µ = µ 0 + δ: 0.99 = 1 Φ(1.645 δ n/σ) so that 0.01 = Φ(1.645 δ n/σ). Solving this equation (use tables) gives 2.326 = 1.645 δ n/σ, i.e. n = σ 2 (1.645 + 2.326) 2 /δ 2 is the required sample size. 27

UMP tests are not always available. If not, options include: 1. Wald test 2. locally most powerful test (score test) 3. generalized likelihood ratio test. 28

Wald test The Wald test is directly based on the asymptotic normality of the m.l.e. ˆθ = ˆθn, often ˆθ N ( θ, In 1 (θ) ) if θ is the true parameter. Recall: In simple random sample, I n (θ) = ni 1 (θ). Also it is often true that asymptotically, we may replace θ by ˆθ in the Fisher information, ˆθ N ( θ, I 1 n (ˆθ) ) So we can construct a test based on W = I n (ˆθ)(ˆθ θ 0 ) N (0, 1). 29

If θ is scalar, squaring gives W 2 χ 2 1, so equivalently we could use a chi-square test. For higher-dimensional θ we can base a test on the quadratic form (ˆθ θ 0 ) T I n (ˆθ)(ˆθ θ 0 ) which is approximately chi-square distributed in large samples. 30

If we would like to test H 0 : g(θ) = 0, where g is a differentiable function, then the delta method gives as test statistic W = g(ˆθ){g(ˆθ)(i n (ˆθ)) 1 G(ˆθ) T } 1 g(ˆθ), where G(θ) = g(θ) T θ. 31

Example: Non-invariance of the Wald test Suppose that ˆθ is scalar and approximately N (θ, I n (θ) 1 )- distributed, then for testing H 0 : θ = 0 the Wald statistic becomes ˆθ I(ˆθ), which would be approximately standard normal. 32

If instead we tested H 0 : θ 3 = 0, then the delta method with g(θ) = θ 3, so that g (θ) = 3θ 2, gives V ar(g(ˆθ)) 9ˆθ 4 (I(ˆθ)) 1 and as Wald statistic ˆθ I(ˆθ), 3 which would again be approximately standard normal, but the p-values for a finite sample may be quite different! 33

Locally most powerful test (Score test) Test H 0 : θ = θ 0 against H 1 : θ = θ 0 + δ for some small δ > 0 The most powerful test has a rejection region of the form l(θ 0 + δ) l(θ 0 ) k. Taylor expansion gives l(θ 0 + δ) l(θ 0 ) + δ l(θ 0) θ i.e. l(θ 0 + δ) l(θ 0 ) δ l(θ 0) θ. 34

So a locally most powerful (LMP) test has as rejection region R = { x : l(θ 0) θ } k α. This is also called the score test: l/ θ is known as the score function. Under certain regularity conditions, E [ ] l θ = 0, Var [ ] l θ = I n (θ). As l is usually a sum of independent components, so is l(θ 0 )/ θ, and the CLT (Central Limit Theorem) can be applied. 35

Example: Cauchy parameter X 1,..., X n random sample from Cauchy (θ), having density f(x; θ) = [ π ( 1 + (x θ) 2)] 1 for < x <, Test H 0 : θ = θ 0 against H + 1 : θ > θ 0. Then l(θ 0 ; x) θ = 2 { } x i θ 0. 1 + (x i θ 0 ) 2 Fact: Under H 0, the expression l(θ 0 ; X)/ θ has mean 0, variance I n (θ 0 ) = n/2. 36

The CLT applies, l(θ 0 ; X)/ θ N (0, n/2) under H 0, so for the LMP test, P(N (0, n/2) k α ) = P(N (0, 1) k α 2 n ) α gives k α z 1 α n/2, and as rejection region with approximate size α R = { x : 2 ( ) x i θ 0 1 + (x i θ 0 ) 2 > } n 2 z 1 α. 37

Generalized likelihood ratio (LR) test Test H 0 : θ = θ 0 against H + 1 : θ > θ 0; use as rejection region R = { x : max θ θ 0 L(θ; x) L(θ 0 ; x) } k α. If L has one mode, at the m.l.e. ˆθ, then the likelihood ratio in the definition of R is either 1, if ˆθ θ0, or L(ˆθ; x)/l(θ 0 ; x), if ˆθ > θ 0. Similar for H 1 with fairly obvious changes of signs and directions of inequalities 38

Two-sided tests Test H 0 : θ = θ 0 against H 1 : θ θ 0. If the one-sided tests of size α have symmetric rejection regions R + = {x : t > c} and R = {x : t < c}, then a two-sided test (of size 2α) is to take the rejection region to R = {x : t > c}; this test has as p-value p = P( t(x) t H 0 ). 39

The two-sided (generalized) LR test uses [ ] maxθ L(θ; X) T = 2 log L(θ 0 ; X) [ ] L(ˆθ; X) = 2 log L(θ 0 ; X) and rejects H 0 for large T. Fact: T χ 2 1 under H 0 (later). Where possible, the exact distribution of T or of a statistic equivalent to T should be used. 40

If θ is a vector: there is no such thing as a one-sided alternative hypothesis. For the alternative θ θ 0 we use a LR test based on [ ] L(ˆθ; X) T = 2 log. L(θ 0 ; X) Under H 0, T χ 2 p where p = dimension of θ (see Chapter 5). 41

For the score test we use as statistic l (θ 0 ) T [I(θ 0 )] 1 l (θ 0 ), where I(θ) is the expected Fisher information matrix: [I(θ)] jk := [I n (θ)] jk = E[ 2 l/ θ j θ k ]. If the CLT applies to the score function, then this quadratic form is again approximately χ 2 p under H 0 (see Chapter 5). 42

Example: Pearson s Chi-square statistic. We have a random sample of size n, with p categories; P(X j = i) = π i, for i = 1,..., p, j = 1,..., n. As πi = 1, we take θ = (π 1,..., π p 1 ). The likelihood function is then π n i i where n i = # observations in category i (so n i = n). The m.l.e. is ˆθ = n 1 (n 1,..., n p 1 ). Test H 0 : θ = θ 0, where θ 0 = (π 1,0,..., π p 1,0 ), against H 1 : θ θ 0. 43

The score vector is vector of partial derivatives of l(θ) = p 1 i=1 n i log π i + n p log with respect to π 1,..., π p 1 : ( 1 p 1 k=1 π k ) l π i = n i π i n p 1 p 1 k=1 π. k The matrix of second derivatives has entries 2 l π i π k = n iδ ik π 2 i n p (1 p 1 i=1 π i) 2, where δ ik = 1 if i = k, and δ ik = 0 if i k. Minus the expectation of this, using E[N i ] = nπ i, gives I(θ) = ndiag(π 1 1,..., π 1 p 1 ) + n11t π 1 p, where 1 is a (p 1)-dimensional vector of ones. 44

Compute l (θ 0 ) T [I(θ 0 )] 1 l (θ 0 ) = p i=1 (n i nπ i,0 ) 2 nπ i,0 ; this statistic is called the chi-squared statistic, T say. The CLT for the score vector gives that T χ 2 p 1 under H 0. 45

Note: the form of the chi-squared statistic is (Oi E i ) 2 /E i where O i and E i refer to observed and expected frequencies in category i: This is known as Pearson s chi-square statistic. 46

Composite null hypotheses θ = (ψ, λ), λ nuisance parameter want test not depending on the unknown value of λ Extending two of the previous methods: 47

Generalized likelihood ratio test: Composite null hypothesis H 0 : θ Θ 0 H 1 : θ Θ 1 = Θ \ Θ 0 The (generalized) LR test uses the likelihood ratio statistic T = max θ Θ L(θ; X) max L(θ; X) θ Θ 0 and rejects H 0 for large values of T. 48

Now θ = (ψ, λ), assume that ψ is scalar, test H 0 : ψ = ψ 0 against H + 1 : ψ > ψ 0. The LR statistic T is T = max L(ψ, λ) max L P (ψ) ψ ψ 0,λ max L(ψ 0, λ) = ψ ψ 0 L P (ψ 0 ) λ where L P (ψ) is the profile likelihood for ψ. 49

For H 0 against H 1 : ψ ψ 0, T = max ψ,λ max λ L(ψ, λ) L(ψ 0, λ) = L( ˆψ, ˆλ) L P (ψ 0 ). Often (see Chapter 5): 2 log T χ 2 p where p is the dimension of ψ Important requirement: the dimension of λ does not depend on n. 50

Example: Normal distribution and Student t-test Random sample, size n, N(µ, σ 2 ), both µ and σ unknown; H 0 : µ = µ 0. Ignoring an irrelevant additive constant, l(θ) = n log σ n(x µ)2 + (n 1)s 2 2σ 2 Maximizing this w.r.t. σ with µ fixed gives l P (µ) = n ( (n 1)s 2 2 log + n(x µ) 2 ). n 51

If H + 1 : µ > µ 0: maximize l P (µ) over µ µ 0 : if x µ 0 then max at µ = µ 0 if x > µ 0 then max at µ = x So log T = 0 when x µ 0 and is n ( ) (n 1)s 2 2 log n + n ( (n 1)s 2 2 log + n(x µ 0 ) 2 ) n = n ( 2 log 1 + n(x µ 0) 2 ) (n 1)s 2 when x > µ 0. 52

Thus the LR rejection region is of the form R = {x : t(x) c α }, where t(x) = n(x µ0 ). s This statistic is called Student-t statistic. Under H 0, t(x) t n 1, and for a size α test set c α = t n 1,1 α ; the p-value is p = P(t n 1 t(x)). Here we use the quantile notation P(t n 1 t n 1,1 α ) = α. 53

The two-sided test of H 0 against H 1 : µ µ 0 is easier, as unconstrained maxima are used. The size α test has rejection region R = {x : t(x) t n 1,1 α/2 }. 54

Score test: Composite null hypothesis Now θ = (ψ, λ) with ψ scalar, test H 0 : ψ = ψ 0 against H + 1 : ψ > ψ 0 or H 1 : ψ < ψ 0. The score test statistic is T = l(ψ 0, ˆλ 0 ; X), ψ where ˆλ 0 is the MLE for λ when H 0 is true. Large positive values of T indicate H + 1, and large negative values indicate H 1. Thus the rejection regions are of the form T k + α when testing against H + 1, and T k α when testing against H 1. 55

Recall the derivation of the score test, l(θ 0 + δ) l(θ 0 ) δ l(θ 0) θ = δt. If δ > 0, i.e. for H + 1, we reject if T is large; if δ < 0, i.e. for H 1, we reject if T is small. 56

Sometimes the exact null distribution of T available; more often we use that T normal (by CLT, see Chapter 5), zero mean. To find the approximate variance: 1. compute I(ψ 0, λ) 2. invert to I 1 3. take the diagonal element corresponding to ψ 4. invert this element 5. replace λ by the null hypothesis MLE ˆλ 0. Denote the result by v, then Z = T/ v N (0, 1) under H 0. 57

A considerable advantage is that the unconstrained MLE ˆψ is not required. 58

Example: linear or non-linear model We can extend the linear model Y j = (x T j β)+ɛ j, where ɛ 1,..., ɛ n i.i.d. N (0, σ 2 ), to a non-linear model Y j = (x T j β) ψ + ɛ j with the same ɛ s. Test H 0 : ψ = 1: usual linear model, against, say, H 1 : ψ < 1. Here our nuisance parameters are λ T = (β T, σ 2 ). 59

Write η j = x T j β, and denote the usual linear model fitted values by ˆη j0 = x T j ˆβ 0, where the estimates are obtained under H 0. As Y j N (η j, σ 2 ), we have up to an irrelevant additive constant, l(ψ, β, σ) = n log σ 1 2σ 2 (yj η ψ j )2, and so l ψ = 1 σ 2 (yj η ψ j )ηψ j log η j, yielding that the null MLE s are the usual LSEs (leastsquare estimates), which are 60

ˆβ 0 = (X T X) 1 X T Y, ˆσ 2 = n 1 (Y j x T j ˆβ 0 ) 2. So the score test statistic becomes T = 1ˆσ 2 (Yj ˆη j0 )(ˆη j0 log ˆη j0 ). We reject H 0 for large negative values of T. 61

Compute approximate null variance (see below): u 2 j uj x T j 0 I(ψ 0, β, σ) = 1 σ 2 uj x j xj x T j 0 0 0 2n where u j = η j log η j. The (1, 1) element of the inverse of I has reciprocal ( u T u u T X(X T X) 1 X T u ) / σ 2, where u T = (u 1,..., u n ). Substitute ˆη j0 for η j and ˆσ 2 for σ 2 to get v. For the approximate p-value calculate z = t/ v and set p = Φ(z). 62

Calculation trick: To compute the (1, 1) element of the inverse of I above: if A = a x T x B where a is a scalar, x is an (n 1) 1 vector and B is an (n 1) (n 1) matrix, then (A 1 ) 11 = 1/(a x T B 1 x). Recall also: ψ ηψ = ln η eψ ψ = ln ηe ψ ln η = η ψ ln η. 63

For the (1, 1)-entry of the information matrix, we calculate 2 l ψ = 1 { ( η ψ 2 σ 2 j log η j)η ψ j log η j } +(y j η ψ j )ηψ j (log η j) 2, and as Y j N (η j, σ 2 ) we have E { 2 } l ψ 2 = 1 σ 2 η ψ j log η jη ψ j log η j = 1 σ 2 u 2 j, as required. The off-diagonal terms in the information matrix can be calculated in a similar way, using that β η = xt j. 64

Multiple tests When many tests applied to the same data, there is a tendency for some p-values to be small: Suppose P 1,..., P m are the random P -values for m independent tests at level α (before seeing the data); for each i, suppose that P (P i α) = α if the null hypothesis is true. But then the probability that at least one of the null hypothesis is rejected if m independent tests are carried out is 1 P ( none rejected) = 1 (1 α) m. 65

Example: If α = 0.05 and m = 10, then P ( at least one rejected H 0 true ) = 0.4012. Thus with high probability at least one significant result will be found even when all the null hypotheses are true. 66

Bonferroni: The Bonferroni inequality gives that P(min P i α H 0 ) m P(P i α H 0 ) mα. i=1 A cautious approach for an overall level α is therefore to declare the most significant of m test results as significant at level p only if min p i p/m. Example: If α = 0.05 and m = 10, then reject only if the p-value is less than 0.005. 67

Combining independent tests Suppose we have k independent experiments/studies, for the same null hypothesis. If only the p-values are reported, and if we have continuous distribution, we may use that under H 0 each p-value is U[0.1] uniformly distributed. This gives that 2 k log P i χ 2 2k i=1 (exactly) under H 0, so p comb = P(χ 2 2k 2 log p i ). 68

If each test is based on a statistic T such that T i N (0, v i ), then the best combination statistic is Z = (T i /v i )/ v 1 i. If H 0 is a hypothesis about a common parameter ψ, then the best combination of evidence is lp,i (ψ), and the combined test would be derived from this (e.g., an LR or score test). 69

Advice Even though a test may initially be focussed on departures in one direction, it is usually a good idea not to totally disregard departures in the other direction, even if they are unexpected. 70

Warning: Not rejecting the null hypothesis does not mean that the null hypothesis is true! Rather it means that there is not enough evidence to reject the null hypothesis; the data are consistent with the null hypothesis. The p-value is not the probability that the null hypothesis is true. 71

Nonparametric tests Sometimes we do not have a parametric model available, and the null hypothesis is phrased in terms of arbitrary distributions, for example concerning only the median of the underlying distribution. Such tests are called nonparametric or distribution-free; treating these would go beyond the scope of these lectures. 72