Lecture 12 November 3

Similar documents
Lecture 7 October 13

Economics 520. Lecture Note 19: Hypothesis Testing via the Neyman-Pearson Lemma CB 8.1,

Lecture 8 October Bayes Estimators and Average Risk Optimality

Hypothesis Testing. BS2 Statistical Inference, Lecture 11 Michaelmas Term Steffen Lauritzen, University of Oxford; November 15, 2004

ST5215: Advanced Statistical Theory

Definition 3.1 A statistical hypothesis is a statement about the unknown values of the parameters of the population distribution.

Mathematical Statistics

Lecture 16 November Application of MoUM to our 2-sided testing problem

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary

STAT 830 Hypothesis Testing

Testing Statistical Hypotheses

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

STAT 830 Hypothesis Testing

Homework 7: Solutions. P3.1 from Lehmann, Romano, Testing Statistical Hypotheses.

Lecture notes on statistical decision theory Econ 2110, fall 2013

Lecture 2: Basic Concepts of Statistical Decision Theory

40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology

2. What are the tradeoffs among different measures of error (e.g. probability of false alarm, probability of miss, etc.)?

Chapter 6. Hypothesis Tests Lecture 20: UMP tests and Neyman-Pearson lemma

The University of Hong Kong Department of Statistics and Actuarial Science STAT2802 Statistical Models Tutorial Solutions Solutions to Problems 71-80

BEST TESTS. Abstract. We will discuss the Neymann-Pearson theorem and certain best test where the power function is optimized.

Chapter 9: Hypothesis Testing Sections

Chapter 7. Hypothesis Testing

Lecture 7 Introduction to Statistical Decision Theory

Testing Statistical Hypotheses

LECTURE NOTES 57. Lecture 9

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

Ch. 5 Hypothesis Testing

STAT 801: Mathematical Statistics. Hypothesis Testing

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Chapter 4. Theory of Tests. 4.1 Introduction

Hypothesis Test. The opposite of the null hypothesis, called an alternative hypothesis, becomes

Lecture Testing Hypotheses: The Neyman-Pearson Paradigm

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Math 494: Mathematical Statistics

Detection and Estimation Chapter 1. Hypothesis Testing

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

Partitioning the Parameter Space. Topic 18 Composite Hypotheses

Lecture 8: Information Theory and Statistics

Lecture 3 September 29

ECE531 Screencast 11.4: Composite Neyman-Pearson Hypothesis Testing

8 Testing of Hypotheses and Confidence Regions

On the GLR and UMP tests in the family with support dependent on the parameter

Direction: This test is worth 250 points and each problem worth points. DO ANY SIX

simple if it completely specifies the density of x

Chapters 10. Hypothesis Testing

Derivation of Monotone Likelihood Ratio Using Two Sided Uniformly Normal Distribution Techniques

HYPOTHESIS TESTING: FREQUENTIST APPROACH.

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

ECE531 Lecture 4b: Composite Hypothesis Testing

Introduction to Machine Learning. Lecture 2

If there exists a threshold k 0 such that. then we can take k = k 0 γ =0 and achieve a test of size α. c 2004 by Mark R. Bell,

Lecture 8: Information Theory and Statistics

A Very Brief Summary of Statistical Inference, and Examples

Introductory Econometrics. Review of statistics (Part II: Inference)

Topic 15: Simple Hypotheses

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Part III. A Decision-Theoretic Approach and Bayesian testing

Class 19. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

Lecture 26: Likelihood ratio tests

Chapter 10. Hypothesis Testing (I)

Chapters 10. Hypothesis Testing

MIT Spring 2016

Hypothesis Testing Chap 10p460

8: Hypothesis Testing

Hypothesis testing: theory and methods

STAT 135 Lab 6 Duality of Hypothesis Testing and Confidence Intervals, GLRT, Pearson χ 2 Tests and Q-Q plots. March 8, 2015

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Topic 17: Simple Hypotheses

Peter Hoff Statistical decision problems September 24, Statistical inference 1. 2 The estimation problem 2. 3 The testing problem 4

6.1 Variational representation of f-divergences

Chapter 3: Unbiased Estimation Lecture 22: UMVUE and the method of using a sufficient and complete statistic

Lecture 2 Machine Learning Review

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Some General Types of Tests

Topic 3: Hypothesis Testing

A General Overview of Parametric Estimation and Inference Techniques.

ECE531 Lecture 6: Detection of Discrete-Time Signals with Random Parameters

Bayesian statistics: Inference and decision theory

Summary of Chapters 7-9

Lecture 3 January 28

Hypothesis Testing. A rule for making the required choice can be described in two ways: called the rejection or critical region of the test.

10. Composite Hypothesis Testing. ECE 830, Spring 2014

Discussion of Hypothesis testing by convex optimization

Lecture 21: Minimax Theory

Composite Hypotheses and Generalized Likelihood Ratio Tests

STATISTICS SYLLABUS UNIT I

Detection theory. H 0 : x[n] = w[n]

21.1 Lower bounds on minimax risk for functional estimation

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Hypothesis Testing - Frequentist

Lecture 21: October 19

LECTURE 10: NEYMAN-PEARSON LEMMA AND ASYMPTOTIC TESTING. The last equality is provided so this can look like a more familiar parametric test.

Hypothesis Testing: The Generalized Likelihood Ratio Test

4 Hypothesis testing. 4.1 Types of hypothesis and types of error 4 HYPOTHESIS TESTING 49

MTMS Mathematical Statistics

14.30 Introduction to Statistical Methods in Economics Spring 2009

Tests and Their Power

Methods of evaluating tests

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015

Transcription:

STATS 300A: Theory of Statistics Fall 2015 Lecture 12 November 3 Lecturer: Lester Mackey Scribe: Jae Hyuck Park, Christian Fong Warning: These notes may contain factual and/or typographic errors. 12.1 Summary In this lecture, we are going to review briefly the concepts developed for tackling the problem of optimal inference in the context of point estimation. Then, we are going to consider a different kind of inference setting, that of hypothesis testing. We will develop the Neyman- Pearson paradigm and show how to find optimal tests for so called simple problems. 12.2 Point Estimation Recap In the first part of the course, we focused on optimal inference in the setting of point estimation (see Figure 12.2). We formulated this problem in the framework of decision theory and focused on finite sample criteria of optimality. We immediately discovered that uniform optimality was seldom attainable in practice, and thus, we developed our theory of optimality along two lines: constraining and collapsing. To restrict ourselves to interesting subclasses of estimators, we first introduced the notion of unbiasedness, which lead us to UMRUEs/UMVUEs. Then we considered certain symmetry constraints in the context of location invariant decision problems this was formalized via the concept of equivariance, which led us to MREs. The situation is similar in hypothesis testing: to develop a useful notion of optimality, we will need to impose constraints on tests. These constraints arise in the form of risk bounds, unbiasedness, and equivariance. Another way to achieve optimality was to collapse the risk function. We introduced the notions of average risk (optimized by Bayes estimators) and worst case risk (optimized by minimax estimators). We saw that Bayes estimators have many desirable properties and provide tools to reason about both concepts. We also considered a few different model families such as exponential family and location and scale family. We came out with certain notions of optimality, optimal unbiased estimator was easy to find for exponential family with complete sufficient statistics and for locationscale family, we defined the best equivariant estimators. 12.3 Hypothesis Testing We now shift our attention to the problem of hypothesis testing. As discussed above, many of the same issues we encountered in defining and achieving optimality for point estimation will reemerge here. However, we do encounter two entirely new objects: the null, H 0, and 12-1

Figure 12.1. Review of point estimation. alternative, H 1, hypotheses. It is this development which will lead us to introduce the notion of risk bounds. 12.3.1 Model Setup Hypothesis testing is just a particular type of decision problem. As usual, we assume that the data is sampled according to X P θ and that P θ belongs to the model P = {P θ : θ Ω}. In addition to the standard setup, we divide the models in P into two disjoint subclasses known as hypotheses : H 0 : θ Ω 0 Ω (null hypothesis) H 1 : θ Ω 1 = Ω \ Ω 0 (alternative hypothesis) Our goal is to infer which hypothesis is correct. This can be cast as classification, so our decision space is D = {accept H 0, reject H 0 }. Example 1. You have developed a new anti-itch cream, and you suspect that it may have grave side effects. We can test this hypothesis on a population of mice by applying anti-itch cream to some and no cream to others. In particular, we would like to see if there is a change in life expectancy when the cream is used. Thus, the two hypotheses are H 0 : No change in life expectancy and H 1 : Change in life expectancy after application of cream. Based on the results of the experiments we can choose to accept or reject the hypothesis. While one could imagine equipping this decision problem with a variety of loss functions, there is a canonical loss function L(θ, d) that gives rise to the optimality goals classically espoused in hypothesis testing. We specify this loss in matrix form in Table 12.3.1. The columns represent the true state of nature, i.e., whether θ Ω 0 or θ Ω 1, while the rows indicate the decision that was made. 12-2

θ Ω 0 θ Ω 1 Reject H 0 1 (Type I Error) 0 (Good) Accept H 0 0 (Good) 1 (Type II Error) Table 12.1. Canonical loss function L(θ, d). We have two types of error which induce a loss. A Type I error or false positive occurs when we reject H 0 when it is in fact true. Similarly a Type II error or false negative occurs when we accept H 0 when it is false. In practice, one might assign different loss values to the two types of error, as they could have significantly different consequences. This would lead us back towards point estimation where the target is now a binary parameter indicating whether H 0 or H 1 is true. What we do below is different from this formulation. We in fact induce a more stringent form of asymmetry between Type I and Type II errors through the introduction of risk bounds in the next section. For now, we will forgo discussing the advantages/disadvantages of the respective formulations. Because our loss function is this kind of indicator function, throughout our discussion of hypothesis testing we will allow for randomized decision procedures δ φ which we specify in terms of a test function φ(x) [0, 1] (also known as the critical function). The test function φ indicates that δ φ (X, U) rejects H 0 w.p. φ(x). 1 In other words, φ(x) = P (δ φ (X, U) = Reject H 0 X). where U is as usual a uniform random variable independent of X. The function φ(x) completely specifies the behavior of δ φ, and it will be convenient to work directly with φ(x) in place of δ φ. While we could safely ignore randomized procedures when considering optimality under convex loss functions (due to the Rao-Blackwell theorem), we must account for improvements due to randomization under our non-convex loss L. In order to start reasoning about optimality in this context we are going to introduce definition. Definition 1. The power function of a test φ is β(θ) = E θ [φ(x)] = P θ (Reject H 0 ). Indeed we can describe our risk function wholly in terms of this power function. Note: If θ 0 Ω 0, then β(θ 0 ) = R(θ 0, δ φ ) = Type I Error rate. For θ 1 Ω 1, then β(θ 1 ) = 1 R(θ 1, δ φ ) = 1 Type II Error rate. Our ideal optimality goal is to minimize β(θ 0 ) uniformly for all θ 0 Ω 0 and maximize β(θ 1 ) uniformly for all θ 1 Ω 1. Unfortunately, it is typically impossible to minimize all errors across all parameters. So what can we do? We can constrain the form of the procedure that we are allowed to use, leading to the Neyman-Pearson paradigm of hypothesis testing. 1 Recall that randomized decision procedures are functions of the data and an independent source of randomness U Unif[0, 1]. 12-3

12.4 The Neyman-Pearson Paradigm We are going to start by fixing a value α (0, 1). We will call it the level of significance. We will require that our procedures satisfy the following risk bound: sup E θ0 φ(x) = sup β(θ 0 ) α. θ 0 Ω 0 θ 0 Ω 0 The quantity sup θ0 Ω 0 β(θ 0 ) is called the size of the test. If the size of the test φ is bounded by α, φ is called a level α test. The level of the test represents the tolerance we have for falsely rejecting the null hypothesis. Essentially, instead of trying to minimize the Type I error, the Neyman-Pearson paradigm simply bounds it and focuses on minimizing the Type II error. Our new optimality goal can be summarized as follows. Optimality Goal: Find a level α test that maximizes the power β(θ 1 ) = E θ1 [φ(x)] for each θ 1 Ω 1. Such a test is called uniformly most powerful (UMP). We still need to maximize power uniformly for all alternatives θ 1, so what have we gained by working under this risk bound? It turns out that in special cases, UMP (i.e., optimal) tests, formulated in this way, exist. The intuition is as follows. A test is determined by the region of the sample space for which it rejects the null hypothesis (the rejection region). The differences in data generated from θ Ω 0 versus data generated from θ Ω 1 determine the shape of effective versus ineffective rejection regions. If it turns out that a certain shape is optimal for all pairs θ 0 Ω 0 and θ 1 Ω 1, then we will have a UMP. When UMP tests do not exist, we will introduce additional constraints such as unbiasedness and equivariance, as we did with point estimation. 12.5 MP for the simple case Definition 2. A hypothesis H 0 is called simple if Ω 0 = 1, otherwise it is called composite. The same is true for H 1. The simple case in hypothesis testing is the case when both H 0 and H 1 are simple. In this case, we will use the notation H 0 : X p 0 H 1 : X p 1 where p 0, p 1 denote the densities of P θ0, P θ1 with respect to some common measure µ, and we call E p1 [φ(x)] the power of the test φ. Our goal in the simple case can be compactly described as: max φ E p1 [φ(x)] s.t. E p0 [φ(x)] α. Any maximizing test φ is called most powerful (MP). (We drop the word uniformly as we are only interested in maximizing the power under a single alternative distribution.) In this setting, it turns out that it is not too difficult to find such a test; one is given by 12-4

Lemma 1 (Neyman-Pearson). (i) Existence. For testing H 0 : p 0 vs. H 1 : p 1, there is a test φ(x) and a constant k such that: (a) E p0 φ(x) = α (size = level). { 1 if p 1 (x) p (b) φ(x) = 0 > k (always reject if likelihood ratio is > k). (x) 0 if < k (always accept if likelihood ratio is < k). p 1 (x) p 0 (x) Such a test is called a likelihood ratio test (LRT). (ii) Sufficient. If a test satisfies (a),(b) for some constant k, it is most powerful for testing H 0 : p 0 vs. H 1 : p 1 at level α. (Hence, the LRT from part (i) is most powerful.) (iii) Necessary. If a test φ is MP at level α then it satisfies (b) for some k, and it also satisfies (a) unless there exists a test of size < α with power 1. (In the latter case, we did not need to expend all of budgeted Type I error.) This is a comprehensive lemma that essentially solves the hypothesis testing problem for the simple case. Note that the lemma makes no explicit mention of how the test behaves when p 1(x) = k. When p p 0 (x) 0 and p 1 are continuous with respect to Lebesgue measure, this region has measure 0 and hence is of no consequence to us. Otherwise, the behavior in this region is usually determined by the desire to satisfy the size = level constraint of part (a), as we will see next time when we prove the NP lemma. Example 2. As a first example consider a situation where you observe the first symptom X of an illness and the goal is to distinguish between two possible illnesses (different distributions over X). The problem parameters are given in the following table: X sneezing fever fainting sore throat runny nose H 0 : p 0 (cold) 1/4 1/100 1/100 3/100 70/100 H 1 : p 1 (flu) 1/2 10/100 2/100 5/100 33/100 r(x) = p 1 (x)/p 0 (x) 2 10 2 5/3 33/70. We want to come up with a most powerful test for this model. According to the NP lemma, we need to compute the likelihood ratio: r(x) = p 1(x) p 0. Now, suppose that the test rejects (x) the cold hypothesis iff X {fever}. Is this MP for some level α? This test satisfies part (b) of the NP lemma for k (2, 10), let s say k = 5. The size is given by the probability that P p0 (X = fever) = 1 1, which implies this is a most powerful test at α =. 100 100 The flexibility to specify the outcome of the test when r(x) = k is important. To see this, suppose φ rejects if X {fever, fainting}. This satisfies (b) of the NP lemma for k = 2; simply specify that φ(fainting) = 1 and φ(sneezing) = 0. The size is given by P p0 (X {fever, fainting}) = 2 2, so it is most powerful at α =. 100 100 Next, we will look at a setting involving continuous density functions. 12-5

Example 3. Let X 1,..., X n hypotheses: i.i.d N (µ, σ 2 ), with σ known. Consider the following two H 0 : µ = 0 and H 1 : µ = µ 1 where µ 1 is known. Since this is a simple case (only two distributions), we calculate the likelihood ratio: r(x) = p 1(x) p 0 (x) = n i=1 n i=1 = exp ( 1 2πσ exp ( (x i µ 1 ) 2 /2σ 2 ) 1 σ 2 µ 1 1 2πσ exp ( x 2 i /2σ2 ) n i=1 x i nµ2 1 2σ 2 We observe that the likelihood ratio is only a function of the sufficient statistic (this is in general true by the Factorization Criterion). We have the following equivalences: r(x) > k µ 1 xi σ 2 nµ2 1 2σ 2 > log k ) µ 1 xi > k { xi > k if µ 1 > 0 xi < k if µ 1 < 0 Let us focus on the first case where µ 1 > 0 (but it is important to note that the two cases µ 1 > 0 versus µ 1 < 0 induce different rejection regions). We can rewrite the test in a different form so that the left hand side of the inequality has standard normal distribution under the null: n x σ > k Note: Here the sufficient statistic essentially determines a MP test. Also, observe that for a given level α, the constant involved in the LRT is uniquely determined by the constraint E p0 φ(x) = α and thus depends only on the distribution of the null hypothesis and not at all on µ 1 (provided that µ 1 > 0). Thus, we have by the NP lemma that the test: reject H 0 iff n x/σ > k(α) is MP where we pick k = k(α) (where k(α) z 1 α is the (1 α) quantile of the standard normal) so that it equals the target level. This value is uniquely determined by the size constraint: ( ) n X E p0 φ(x) = α = P µ=0 > k. σ Now an important observation: for any µ 1 > 0 the MP test is the same, which means that it is actually a UMP at level α for testing: H 0 : µ = 0 and H 1 : µ > 0. So we see no µ 1 dependence here and it s really nice that I can test against any µ 1 > 0 at once. Similarly, we could derive a distinct UMP test for testing against H 1 : µ < 0. Unfortunately, no UMP test exists for testing H 0 : µ = 0 and H 1 : µ 0 12-6.

because the µ > 0 test dominates the µ < 0 test in the µ > 0 scenario and vice versa i.e. the shape of the most powerful rejection rejection (whether we reject for x > k or whether we reject for x < k) is dependent on the sign of µ 1. 12-7