Lecture 10: Generalized likelihood ratio test

Similar documents
5.2 Fisher information and the Cramer-Rao bound

ML Testing (Likelihood Ratio Testing) for non-gaussian models

Statistics 3858 : Maximum Likelihood Estimators

This does not cover everything on the final. Look at the posted practice problems for other topics.

TUTORIAL 8 SOLUTIONS #

STAT 135 Lab 6 Duality of Hypothesis Testing and Confidence Intervals, GLRT, Pearson χ 2 Tests and Q-Q plots. March 8, 2015

Lecture 1: Introduction and probability review

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Lecture 21: October 19

Composite Hypotheses and Generalized Likelihood Ratio Tests

Statistics 3858 : Contingency Tables

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Introduction to Survey Analysis!

Loglikelihood and Confidence Intervals

Summary of Chapters 7-9

ECO220Y Review and Introduction to Hypothesis Testing Readings: Chapter 12

Goodness of Fit Goodness of fit - 2 classes

Ling 289 Contingency Table Statistics

Stat 135, Fall 2006 A. Adhikari HOMEWORK 6 SOLUTIONS

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Chapter 7. Hypothesis Testing

UNIVERSITY OF TORONTO Faculty of Arts and Science

Hypothesis Testing: The Generalized Likelihood Ratio Test

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

The Multinomial Model

STAT 830 Hypothesis Testing

4.5.1 The use of 2 log Λ when θ is scalar

One-Way Tables and Goodness of Fit

Questions 3.83, 6.11, 6.12, 6.17, 6.25, 6.29, 6.33, 6.35, 6.50, 6.51, 6.53, 6.55, 6.59, 6.60, 6.65, 6.69, 6.70, 6.77, 6.79, 6.89, 6.

Some General Types of Tests

Chapter 1 Statistical Inference

Testing Restrictions and Comparing Models

Central Limit Theorem ( 5.3)

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2

Math 494: Mathematical Statistics

Introductory Econometrics

HYPOTHESIS TESTING: FREQUENTIST APPROACH.

STAT 536: Genetic Statistics

Probability and Statistics Notes

Multiple Sample Categorical Data

Stat 135 Fall 2013 FINAL EXAM December 18, 2013

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

STAT 461/561- Assignments, Year 2015

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Practice Problems Section Problems

MS&E 226: Small Data

Stat 5102 Final Exam May 14, 2015

Robustness and Distribution Assumptions

INTERVAL ESTIMATION AND HYPOTHESES TESTING

Maximum Likelihood Large Sample Theory

Statistical Inference

Inferential statistics

Math 152. Rumbos Fall Solutions to Assignment #12

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

The t-distribution. Patrick Breheny. October 13. z tests The χ 2 -distribution The t-distribution Summary

Two-sample inference: Continuous data

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

Harvard University. Rigorous Research in Engineering Education

Space Telescope Science Institute statistics mini-course. October Inference I: Estimation, Confidence Intervals, and Tests of Hypotheses

Hypothesis testing: theory and methods

Horse Kick Example: Chi-Squared Test for Goodness of Fit with Unknown Parameters

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Hypothesis testing. Anna Wegloop Niels Landwehr/Tobias Scheffer

Mathematical statistics

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Topic 19 Extensions on the Likelihood Ratio

Ch. 5 Hypothesis Testing

STAT 135 Lab 7 Distributions derived from the normal distribution, and comparing independent samples.

McGill University. Faculty of Science. Department of Mathematics and Statistics. Part A Examination. Statistics: Theory Paper

Statistical Data Analysis Stat 3: p-values, parameter estimation

Frequentist Statistics and Hypothesis Testing Spring

Performance Evaluation and Comparison

STAT 135 Lab 9 Multiple Testing, One-Way ANOVA and Kruskal-Wallis

Discrete Multivariate Statistics

STAT 830 Hypothesis Testing

STAT 285: Fall Semester Final Examination Solutions

Midterm Examination. STA 215: Statistical Inference. Due Wednesday, 2006 Mar 8, 1:15 pm

Topic 3: Hypothesis Testing

14.30 Introduction to Statistical Methods in Economics Spring 2009

Chapter 10. Prof. Tesler. Math 186 Winter χ 2 tests for goodness of fit and independence

Topic 5: Generalized Likelihood Ratio Test

Confidence Intervals, Testing and ANOVA Summary

4 Hypothesis testing. 4.1 Types of hypothesis and types of error 4 HYPOTHESIS TESTING 49

Categorical Data Analysis. The data are often just counts of how many things each category has.

One-Sample Numerical Data

Some Assorted Formulae. Some confidence intervals: σ n. x ± z α/2. x ± t n 1;α/2 n. ˆp(1 ˆp) ˆp ± z α/2 n. χ 2 n 1;1 α/2. n 1;α/2

Lecture 11 - Tests of Proportions

Cherry Blossom run (1) The credit union Cherry Blossom Run is a 10 mile race that takes place every year in D.C. In 2009 there were participants

ST495: Survival Analysis: Hypothesis testing and confidence intervals

(θ θ ), θ θ = 2 L(θ ) θ θ θ θ θ (θ )= H θθ (θ ) 1 d θ (θ )

EC2001 Econometrics 1 Dr. Jose Olmo Room D309

10. Composite Hypothesis Testing. ECE 830, Spring 2014

Dr. Maddah ENMG 617 EM Statistics 10/15/12. Nonparametric Statistics (2) (Goodness of fit tests)

Chapter 26: Comparing Counts (Chi Square)

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences FINAL EXAMINATION, APRIL 2013

DISTRIBUTIONS USED IN STATISTICAL WORK

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Institute of Actuaries of India

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Transcription:

Stat 200: Introduction to Statistical Inference Autumn 2018/19 Lecture 10: Generalized likelihood ratio test Lecturer: Art B. Owen October 25 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They are meant as a memory aid for students who took stat 200 at Stanford University. They may be distributed outside this class only with the permission of the instructor. Also, Stanford University holds the copyright. Abstract These notes are mnemonics about what was covered in class. They don t replace being present or reading the book. Reading ahead in the book is very effective. In this lecture we looked at the generalized likelihood ratio test GLRT, with an emphasis on multinomial models. 10.1 Generalized likelihood ratio test The GLRT is used to test H 0 versus an alternative H A. In parametric models we describe H 0 as a set of θ values. If H 0 is simple there is only one θ H 0. For composite H 0 there can be many such θ. The alternative H A is similarly a set of θ values and we let H 1 = H 0 H A. That is H 1 includes both null and alternative. A generalized likelihood ratio test of H 0 versus H A rejects H 0 when Λ = max θ H 0 max θ H1 λ 0 for some cutoff value λ 0. Because any θ H 0 is also in H 1 we always have Λ 1. Because the θ values in H 0 are a subset of those in H 1 we often call H 0 a submodel. To get a test with type I error α (such as α = 0.01) we have to work out the distribution of Λ under H 0. That can be extremely hard. When H 0 is composite, the distribution of Λ could depend on which θ H 0 is the true one. There is a setting where the distribution of Λ can be known approximately. If the data X 1,..., X n are IID with PDF f(x; θ) or PMF p(x; θ), if that f (or p) satisfies regularity conditions like the ones we had for Fisher information, then we get (Theorem 9.4A of Rice) 2 log Λ χ 2 (d), (under H 0) as n, where d = #free parameters in H 1 #free parameters in H 0. The number of free parameters is the dimensionality of the space that the parameter vector can move in. Here is an example of counting parameters. Suppose that X N(2, 1) under H 0 and X N(1 + σ 2, σ 2 ) under H 1 and X N(µ, σ 2 ) under H 2. Then H 2 has two free parameters: the vector θ = (µ, σ) can move freely over a two dimensional region. Hypothesis H 1 has only one free parameter because θ = (1 + σ 2, σ 2 ) c Stanford University 2018

10-2 Lecture 10: Generalized likelihood ratio test lies on just one curve. Hypothesis H 0 has no free parameters. It includes just one point. Applying that theorem we find 2 log max θ H 0 max θ H1 χ2 (1) under H 0 2 log max θ H 1 max θ H2 χ2 (1) under H 1 2 log max θ H 0 max θ H2 χ2 (2) under H 0. Note the middle example carefully: we used H 1 as a null there. Intuitively, the more free parameters you add to the alternative hypothesis, the more the alternative is better at explaining the data than the null, and hence the larger 2 log(λ) needs to be to provide evidence against the null. To illustrate free parameters some more suppose that θ = (θ 1, θ 1, θ 3 ). If your hypothesis is θ 2 1 + θ 2 2 + θ 2 3 1 then θ is the inside of a 3 dimensional ball and it has three free parameters. If your hypothesis is θ 2 1 + θ 2 2 + θ 2 3 = 1 then θ is the surface of a sphere and it only has two free parameters. If we would add another equality constraint like θ 1 = θ 2 we would be down to one free parameter. The multinomial distributions that we consider next have a similar issue. They involve a set of probabilities that add up to one. Then the last one is one minus the sum of the others and is not free to vary. 10.2 Multinomial One way to generate a multinomial sample is to place n items randomly into m distinct groups, each item is placed independently, the chance that such an item is placed in group j is p j for j = 1,..., m, and then the multinomial data is (n 1, n 2,..., n m ) where n j is the number of items that landed in group j. People often visualize it as balls dropping into bins. It is presented on page 73 and then again on page 272 and also page 341 where the data are (x 1, x 2,..., x m ). Same thing. The multinomial probability is n! m n 1!n 2! n m! if n j 0 are integers summing to n (and zero otherwise). Using Lagrange multipliers (p 272) gives the MLE ˆp j = n j /n. Although there are m parameters p 1,..., p m we must have j p j = 1, so p m = 1 1 j<m p j. We are free to choose any m 1 of the p j but then the last one is determined. So there are m 1 free parameters. Suppose that p j = p j (θ) for some parameter θ. In class and also in Rice there is the Hardy-Weinberg equilibrium model. It is a multinomial for 0, 1 or 2 copies of a given gene in some plant or animal. It is more natural to use a multinomial with j = 0, 1, 2 for this distribution. It has p 0 (θ) = (1 θ) 2, p 1 (θ) = 2θ(1 θ) p nj j

Lecture 10: Generalized likelihood ratio test 10-3 and p 2 (θ) = θ 2. We could have these hypotheses: H 0 : p 0 = 1/9 p 1 = 4/9 p 2 = 4/9 H 1 : p 0 = (1 θ) 2, p 2 = 2θ(1 θ), p 3 = (1 θ) 2, 0 θ 1 H 2 : p 0 = θ 0, p 1 = θ 1, p 2 = 1 θ 1 θ 2. Now H 0 is a simple hypothesis with no free parameters, H 1 is the Hardy-Weinberg equilibrium model with one free parameter and H 2 has two free parameters with legal values θ 0 0, θ 1 0 subject to θ 0 + θ 1 1. Hypotheses H 0, H 1 and H 2 have 0, 1 and 2 free parameters each respectively. Because H 0 is contained in H 1 which is contained in H 2 we can use the log likelihood ratio to test H 0 versus H 1 and H 1 versus H 2. Testing H 1 versus H 2 is a goodness of fit test. It tests whether any Hardy-Weinberg model could be true. If it is rejected then we conclude that the data did not come from any Hardy-Weinberg model. When we have a submodel of the multinomial (like Hardy-Weinberg) then the likelihood is The log likelihood is n! m L = n 1!n 2!... n m! p j (θ) nj. l = c + m n j log(p j (θ)), which we then maximize over θ to get the MLE. For instance with Hardy-Weinberg l = c + n 0 log((1 θ) 2 ) + n 1 log(2θ(1 θ)) + n 2 log((1 θ) 2 ) which we can easily maximize over θ. (Do it if you have not already.) We can test a null hypothesis submodel p j = p j (θ) by a GLRT versus the model where there are m 1 free parameters. In that model ˆp j = n j /n. In the submodel we get an MLE ˆθ and then we can form p j (ˆθ). Write the observed count in group j as O j = n j = nˆp j and the expected count under the submodel as E j = np j (ˆθ). Rice shows that m ( Oj ) 2 log(λ) = 2 O j log E j and that by a Taylor approximation 2 log(λ) m (O j E j ) 2 E j. The numerator (O j E j ) 2 is the squared difference between what we saw and would is expected under the best fitting choice of θ for the submodel. So it makes sense that larger values are stronger evidence against H 0. As for the denominator, larger E j implies higher average values of (E j O j ) 2 under the submodel and the denominator roughly compensates. 10.3 Fisher vs Mendel We looked at the peas example. The χ 2 statistic compared a fixed distribution (no free parameters) under the null to a general alternative. That has 3 degrees of freedom because there is a 2 2 table of probabilities

10-4 Lecture 10: Generalized likelihood ratio test (four parameters) that must add up to one (reducing it to three free parameters). Then 2 log(λ) = 0.618. The p-value is p = Pr(χ 2 (3) 0.618). = 0.9. We would get this much departure or more about 90% of the time even if H 0 were true. So we have no reason to reject Mendel s H 0. Fisher pooled information from many of Mendel s data sets. Since the data sets are independent, the sum of their chi-square test statistics is also chi-square with a degrees of freedom equal to the sum of the degrees of freedom in the individual tests. The result was p = 0.999996. Now the data fits the model seemingly too well. As we saw some lectures ago under H 0 the p-value has a U(0, 1) distribution. (That is exactly true for a continuous random variable and it will be close to true for Mendel s data.) In class somebody asked how we can put this p-value into a testing framework. Here is one way. Mendel has a null hypothesis about how the peas will come out and then he gets data on peas and a p-value p Mendel where he does not reject his H 0. Then later Fisher has a null hypothesis about Mendel. The null is that Mendel used a p-value with the U(0, 1) distribution and the alternative is that Mendel did something that was less likely to reject. Fisher s data is p Mendel and he rejects his null for large values of p Mendel. For instance he could reject at the 0.05 level if p Mendel 0.95. Under this argument Fisher gets a p-value p Fisher = 0.00004. Of course Mendel didn t really compute p-values because the χ 2 test came much later. Rice has a hunch that Mendel kept sampling until the data looked good and stopped. Maybe. I wonder how likely one would be to get there that way. Also, I don t think Mendel computed any of those test statistics. My different hunch is that Mendel may have had several sets of data, discarded the ones that didn t look right perhaps thinking something had gone wrong, and then the retained data fit too well. 10.4 Practical vs statistical significance Consider these three multinomial samples n 1 n 2 n 3 n 4 n 5 2 log(λ) p 103 101 97 103 96 0.44 0.51 1030 1010 970 1030 960 4.40 0.035 10300 10100 9700 10300 9600 44.0 3.1 10 11 The null hypothesis is that p j = 1/5 for j = 1,..., 5. The given chi-squared statistic has 4 degrees of freedom because the null has 0 free parameters and the alternative has 4. The third row is extremely statistically significant. That does not mean that the result is important. It looks like the counts vary by up to 3 or 4 percent from equality. That size of fluctuation might not matter. If it does not matter then the fact that it is statistically significant does not make it matter. Statistical significance captures how unlikely something is under H 0. The more data there is the more an even small difference will become statistically significant. For instance suppose that X i N(µ, 1) independently and that H 0 has µ = µ 0 = 1. Suppose µ is really 1.001. Then the likelihood ratio statistic is Λ = n 1 2π e (Xi µ0)2 /2 1 2π e (Xi X) 2 /2

Lecture 10: Generalized likelihood ratio test 10-5 because we know the MLE is ˆµ = X. Now 2 log(λ) = = (X i µ 0 ) 2 (X i X) 2 (2X i X µ 0 )( X µ 0 ) = ( X µ 0 ) (2X i X µ 0 ) = ( X µ 0 )(2n X n X nµ 0 ) = n( X µ 0 ) 2. Now the LLN has X µ µ 0. So when µ µ 0, this χ 2 statistic goes off to infinity and the p-value goes to 0. For a real valued parameter we can use confidence intervals to study practical and statistical significance together. Let L and U be the end points of the confidence interval. When L < θ 0 < U we don t reject. If there is a practical difference between the values of L and U then it is clear that we don t know enough about θ. If possible we should get more data. If instead the practical consequences of θ = L and θ = U (and any value in between) are not materially different then we can proceed as if θ = θ 0.