Lecture notes on statistical decision theory Econ 2110, fall 2013

Similar documents
Econ 2148, spring 2019 Statistical decision theory

Econ 2140, spring 2018, Part IIa Statistical Decision Theory

STAT 830 Decision Theory and Bayesian Methods

Hypothesis Testing Chap 10p460

Lecture 12 November 3

STAT 830 Hypothesis Testing

STAT 830 Hypothesis Testing

MIT Spring 2016

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

ST5215: Advanced Statistical Theory

Lecture 2: Statistical Decision Theory (Part I)

On the Bayesianity of Pereira-Stern tests

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Lecture 7 Introduction to Statistical Decision Theory

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

9 Bayesian inference. 9.1 Subjective probability

Definition 3.1 A statistical hypothesis is a statement about the unknown values of the parameters of the population distribution.

Lecture Notes 1: Decisions and Data. In these notes, I describe some basic ideas in decision theory. theory is constructed from

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

Bayesian statistics: Inference and decision theory

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Lecture 8: Information Theory and Statistics

Lecture 2: Basic Concepts of Statistical Decision Theory

A Very Brief Summary of Bayesian Inference, and Examples

Peter Hoff Statistical decision problems September 24, Statistical inference 1. 2 The estimation problem 2. 3 The testing problem 4

Economics 520. Lecture Note 19: Hypothesis Testing via the Neyman-Pearson Lemma CB 8.1,

Topic 10: Hypothesis Testing

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

4 Invariant Statistical Decision Problems

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Testing Statistical Hypotheses

Second Welfare Theorem

A Very Brief Summary of Statistical Inference, and Examples

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

LECTURE NOTES 57. Lecture 9

Topic 10: Hypothesis Testing

Ch. 5 Hypothesis Testing

Testing Statistical Hypotheses

Recap Social Choice Functions Fun Game Mechanism Design. Mechanism Design. Lecture 13. Mechanism Design Lecture 13, Slide 1

Hypothesis Testing. Testing Hypotheses MIT Dr. Kempthorne. Spring MIT Testing Hypotheses

MIT Spring 2016

BEST TESTS. Abstract. We will discuss the Neymann-Pearson theorem and certain best test where the power function is optimized.

Statistical Inference

Statistical Inference

Convex Analysis and Economic Theory Winter 2018

Hypothesis Testing. BS2 Statistical Inference, Lecture 11 Michaelmas Term Steffen Lauritzen, University of Oxford; November 15, 2004

Notes on Decision Theory and Prediction

INTRODUCTION TO BAYESIAN METHODS II

Two examples of the use of fuzzy set theory in statistics. Glen Meeden University of Minnesota.

Lecture 8: Information Theory and Statistics

We set up the basic model of two-sided, one-to-one matching

Hypothesis Testing. A rule for making the required choice can be described in two ways: called the rejection or critical region of the test.

Institute of Actuaries of India

STAT 801: Mathematical Statistics. Hypothesis Testing

44 CHAPTER 2. BAYESIAN DECISION THEORY

Part III. A Decision-Theoretic Approach and Bayesian testing

Lecture 21. Hypothesis Testing II

Lecture 1: Bayesian Framework Basics

Statistical Decision Theory and Bayesian Analysis Chapter1 Basic Concepts. Outline. Introduction. Introduction(cont.) Basic Elements (cont.

14.30 Introduction to Statistical Methods in Economics Spring 2009

40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology

Basic Concepts of Inference

Primer on statistics:

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Mathematical Statistics

Recitation 7: Uncertainty. Xincheng Qiu

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

δ -method and M-estimation

Econ 121b: Intermediate Microeconomics

Lecture 7 October 13

Extensions of Expected Utility Theory and some Limitations of Pairwise Comparisons

On the Optimality of Likelihood Ratio Test for Prospect Theory Based Binary Hypothesis Testing

A Rothschild-Stiglitz approach to Bayesian persuasion

Hypothesis testing: theory and methods

Chapter 9: Hypothesis Testing Sections

Chapters 10. Hypothesis Testing

Ph.D. Preliminary Examination MICROECONOMIC THEORY Applied Economics Graduate Program June 2016

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Statistical Inference

Review. December 4 th, Review

8: Hypothesis Testing

Partitioning the Parameter Space. Topic 18 Composite Hypotheses

Whaddya know? Bayesian and Frequentist approaches to inverse problems

13 Social choice B = 2 X X. is the collection of all binary relations on X. R = { X X : is complete and transitive}

Graduate Microeconomics II Lecture 5: Cheap Talk. Patrick Legros

Nearest Neighbor Pattern Classification

Chapter 5. Bayesian Statistics

Foundations of Statistical Inference

Evaluating the Performance of Estimators (Section 7.3)

Lecture 8 October Bayes Estimators and Average Risk Optimality

Division of the Humanities and Social Sciences. Supergradients. KC Border Fall 2001 v ::15.45

14.30 Introduction to Statistical Methods in Economics Spring 2009

Joint Detection and Estimation: Optimum Tests and Applications

Are Probabilities Used in Markets? 1

Testing Simple Hypotheses R.L. Wolpert Institute of Statistics and Decision Sciences Duke University, Box Durham, NC 27708, USA

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables THE UNIVERSITY OF MANCHESTER. 21 June :45 11:45

First Welfare Theorem

Gaussian Estimation under Attack Uncertainty

Chapter 9: Hypothesis Testing Sections

Transcription:

Lecture notes on statistical decision theory Econ 2110, fall 2013 Maximilian Kasy March 10, 2014 These lecture notes are roughly based on Robert, C. (2007). The Bayesian choice: from decision-theoretic foundations to computational implementation. Springer Verlag, chapter 2. (Robert is very passionately Bayesian - read critically!) You might also want to look at Casella, G. and Berger, R. (2001). Statistical inference. Duxbury Press, chapter 7.3.4. What I would like you to take away from this section of class: 1. A general framework to think about what makes a good estimator, test, etc. 2. How the foundations of statistics relate to those of microeconomic theory. 3. In what sense the set of Bayesian estimators contains most reasonable estimators. 1 Basic definitions A general statistical decision problem has the following components: 1

Observed data X A statistical decision a A state of the world θ A loss function L(a, θ) (the negative of utility) A statistical model f(x θ) A decision function a = (X) The underlying state of the world θ influences the distribution of the observation X. The decision maker picks a decision a after observing X. She wants to pick a decision that minimizes loss L(a, θ), for the unknown state of the world θ. X is useful because it reveals some information about θ, at least if f(x θ) does depend on θ. The problem of statistical decision theory is to find decision functions which are good in the sense of making loss small. Examples of loss functions: 1. In estimation, we want to find an a which is close to some function µ of θ, such as for instance µ(θ) = E[X]. Loss is larger if the difference between our estimate and the true value is larger. A commonly used loss in this case is the squared error, L(a, θ) = (a µ(θ)) 2. (1) 2. An alternative to squared error loss is absolute error loss, L(a, θ) = a µ(θ). (2) 3. In testing, we want to decide whether some statement H 0 : θ Θ 0 about the state of the world is true. We might evaluate such a decision a {0, 1} based on the loss 1 if a = 1, θ Θ 0 L(a, θ) = c if a = 0, θ / Θ 0 (3) 0 else. 2

Risk function: An important intermediate object in evaluating a decision function is the risk function R. It measures the expected loss for a given true state of the world θ, R(, θ) = E θ [L((X), θ)]. (4) Note the dependence of R on the state of the world - a decision function might be good for some values of θ, but bad for other values. The next section will discuss how we might deal with this trade-off. Examples of risk functions: 1. We observe X {0,..., k} multinomially distributed with P (X = x) = f(x). We want to estimate f(0) and loss is squared error loss. Taking the estimator (X) = 1(X = 0), we get the risk function R(, f) = E[((X) f(0)) 2 ] = Var((X)) = f(0)(1 f(0)). (5) 2. We observe X N(µ, 1), and want to estimate µ. 1 Loss is again squared error loss. Consider the estimator We get the risk function (X) = a + b X. R(, µ) = E[((X) µ) 2 ] = Var((X)) + Bias((X)) 2 = b 2 Var(X) + (a + be[x] E[X]) 2 = b 2 + (a + (b 1)µ) 2. (6) Choosing a and b involves a trade-off of bias and variance, and this trade-off depends on µ. 2 Optimality criteria Throughout it will be helpful for intuition to consider the case where θ can only take two values, and to consider a 2D-graph where each axis corresponds to R(, θ) for θ = θ 0, θ 1. 1 This example is important because of the asymptotic approximations we will discuss in the next part of class. 3

The ranking provided by the risk function is multidimensional; it provides a ranking of performance between decision functions for every θ. In order to get a global comparison of their performance, we have to somehow aggregate this ranking into a global ranking. We will now discuss three alternative ways to do this: 1. The partial ordering provided by considering a decision function better relative to another one if it is better for every θ, 2. the complete ordering provided if we trade off risk across θ with prespecified weights, where the weights are given in the form of a prior distribution, and 3. the complete ordering provided if we evaluate a decision function under the worst-case scenario. Admissibility: A decision function is said to dominate another function if for all θ, and R(, θ) R(, θ) (7) R(, θ) < R(, θ) (8) for at least one θ. It seems natural to only consider decisions functions which are not dominated. Such decision functions are called admissible, all other decision functions are inadmissible. (picture with Pareto frontier!) Note that dominance only generates a partial ordering of decision functions. In general there will be many different admissible decision functions. Bayes optimality: An approach which comes natural to economists is to trade off risk across different θ by assigning weights π(θ) to each θ. Such an approach evaluates decision functions based on the integrated risk R(, π) = R(, θ)π(θ)dθ. (9) 4

Figure 1: Feasible and admissible risk functions R(.,θ1) feasible admissible R(.,θ0) In this context, if π can be normalized such that π(θ)dθ = 1 and π 0, then π is called a prior distribution. A Bayes decision function minimizes integrated risk, = argmin R(, π). (10) (picture with linear indifference curves!) If π is a prior distribution, we can define posterior expected loss R(, π X) := L((X), θ)π(θ X)dθ, (11) where π(θ X) is the posterior distribution π(θ X) = f(x θ)π(θ), (12) m(x) and m(x) = f(x θ)π(θ)dx is the normalizing constant given by the prior likelihood of m. It is easy to see that any Bayes decision function can be obtained by minimizing R(, π X) through choice of (X) for every X, since R(, π) = R((X), π X)m(X)dX. (13) 5

Figure 2: Bayes optimality R(.,θ1) R(*,.) π(θ) R(.,θ0) Minimaxity: An approach which does not rely on the choice of a prior is minimaxity. This approach evaluates decisions functions based on the worst-case risk, (picture with Leontieff indifference curves!) R() = sup R(, θ). (14) θ A minimax decision function, if it exists, solves the problem = argmin R() = argmin sup R(, θ). (15) θ Some relationships between these concepts Much can be said about the relationship between the concepts of admissibility, Bayes optimality, and minimaxity. We will discuss some basic relationships between them in this and the next section. 1. If is admissible with constant risk, then it is a minimax decision function. Proof: 6

Figure 3: Minimaxity R(.,θ1) R(*,.) R(.,θ0) Suppose that had smaller minimax risk than. Then R(, θ ) sup θ R(, θ) < sup R(, θ) = R(, θ ), θ where we used constant risk in the last equality. But this contradicts admissibility. (picture!) 2. If the prior distribution π(θ) is strictly positive and the Bayes decision function has finite risk and risk is continuous in θ, then it is admissible. Sketch of proof: Suppose it is not admissible. Then it is dominated by. But then R(, π) = R(, θ)π(θ)dθ < R(, θ)π(θ)dθ = R(, π), since R(, θ) R(, θ) for all θ with strict inequality for some θ. This contradicts being a Bayes decision function. (picture!) 3. We will prove the reverse of this statement in the next section. 4. The Bayes risk R(π) := inf R(, π) is always smaller than the minimax risk R := inf sup θ R(, θ). Proof: 7

R(π) = inf R(, π) sup π inf R(, π) inf sup R(, π) = inf π sup R(, θ) = R. θ If there exists a prior π such that inf R(, π ) = sup π inf R(, π), it is called the least favorable distribution. Analogies to microeconomics Concluding this section note the correspondence between the concepts discussed here and those of microeconomics. First, there is an analogy between statistical decision theory and social welfare analysis. The analogy maps different parameter values θ to different people i, and risk R(., θ) to individuals utility u i (.). For this correspondence 1. Our dominance is analogous to Pareto dominance. 2. Admissibility, in particular, is analogous to Pareto efficiency. 3. Bayes risk corresponds to weighted sums of utilities as in standard social welfare functions in public finance. The prior corresponds to welfare weights (distributional preferences). 4. Minimaxity corresponds to Rawlsian inequality aversion. Second, consider choice under uncertainty, and in particular choice in strategic interactions. For this correspondence 1. Dominance of decision functions corresponds to dominance of strategies in games. 2. Bayes risk corresponds to expected utility (after changing signs), and Bayes optimality corresponds to expected utility maximization. 3. Minimaxity is an extreme form of so called ambiguity aversion. It is also familiar from early game theory, in particular from the theory of zero-sum games. 8

3 Two justifications of the Bayesian approach In the last section we saw that, under some conditions, every Bayes decision function is admissible. An important result states that the reverse also holds true under certain conditions - every admissible decision function is Bayes, or the limit of Bayes decision functions. This can be taken to say that we might restrict our attention in the search of good decision functions to the Bayesian ones. We state a simple version of this type of result. We call the set of risk functions that correspond to some the risk set, R := {r(.) = R(., ) for some }. (16) We will assume convexity of R, which is no big restriction, since we can always randomly mix decision functions. We say a class of decision functions is a complete class if it contains every admissible decision function. Complete class theorem Suppose the set Θ of possible values for θ is compact, that the risk set R is convex, and that all decision functions have continuous risk. Then the Bayes decision functions constitute a complete class in the sense that for every admissible decision function, there exists a prior distribution π such that is a Bayes decision function for π. Proof: (picture!!) The proof is essentially an application of the separating hyperplane theorem, applied to the space of functions of θ with the inner product f, g = f(θ)g(θ)dθ. (For intuition, focus on the case where Θ is finite and the inner product is given by f, g = θ f(θ)g(θ).) Let be admissible. Then the corresponding risk function R(., ) belongs to the lower boundary of the risk set R by the definition of admissibility. The separating hyperplane theorem (in combination with convexity of R) implies that there exists a function π (with finite integral) such that for all R(., ), π R(., ), π. 9

Figure 4: Complete class theorem R(.,θ1) R(,.) π(θ) R(.,θ0) By admissibility π can be chosen such that π 0, and thus π := π/ π defines a prior distribution. minimizes R(., ), π = R(, π) among the set of feasible decision functions, and is therefore the optimal Bayesian decision function for the prior π. A second general justification for Bayesian decision theory is given by subjective probability theory, going back to Savage (1954) and Anscombe and Aumann (1963). This theory is discussed in chapter 6 of Mas-Colell, A., Whinston, M., and Green, J. (1995). Microeconomic theory. Oxford university press, and maybe in Econ 2010. Suppose a decision maker ranks risk functions R(., ) by a preference relationship. If this preference relationship is complete, monotonic, and satisfies the independence condition R 1 R 2 αr 1 + (1 α)r 3 αr 2 + (1 α)r 3 (17) 10

for all R 1, R 2, R 3 and α [0, 1], then there exists a prior π such that R(., 1 ) R(., 2 ) R(π, 1 ) R 2 (π, 2 ). Sketch of proof: Using independence repeatedly, we can show that for all R 1, R 2, R 3 R X, and all α > 0, 1. R 1 R 2 iff αr 1 αr 2, 2. R 1 R 2 iff R 1 + R 3 R 2 + R 3, 3. {R : R R 1 } = {R : R 0} + R 1, 4. {R : R 0} is a convex cone. 5. {R : R 0} is a half space. The last claim requires completeness. It immediately implies the existence of π. Monotonicity implies that π is not negative. 4 Testing and the Neyman Pearson lemma A special statistical decision problem is the problem of testing, where we want to decide whether some statement H 0 : θ Θ 0 about the state of the world is true, or whether alternatively H 1 : θ Θ 1 is true. A statistical test is then a decision function ϕ : X {0, 1}, where ϕ = 1 corresponds to rejecting the null hypothesis. We might more generally allow for randomized tests ϕ : X [0, 1] which reject H 0 with probability ϕ(x). We can make two types of classification errors when we try to distinguish H 0 from H 1 : 1. We can decide that H 1 is true when if fact H 0 is true. This is called the Type I error. 2. We can decide that H 0 is true when if fact H 1 is true. This is called the Type II error. 11

Suppose that the data are distributed according to the density (or probability mass function) f θ (x). Then the probability of rejecting H 0 given θ is equal to β(θ) = E θ [ϕ(x)] = ϕ(x)f θ (x)dx. (18) Suppose that θ has only two points of support, θ 0 and θ 1. Then 1. The probability of a Type I error is equal to β(θ 0 ). This is also called the level or significance of the test, and is often denoted α. 2. The probability of a Type II error is equal to 1 β(θ 1 ). β(θ 1 ) is called the power of a test, and is often denoted β. We would like to make the probabilities of both types of errors small, that is to have a small α and a large β. In the case where θ has only two points of support, there is a simple characterization of the solution to this problem. Consider a 2D-graph where the first axis corresponds to β(θ 0 ) and the second to 1 β(θ 1 ). This is very similar to the risk functions we drew before. We can restate the problem as finding a ϕ that solves max ϕ β(θ 1) s.t. β(θ 0 ) = α for a prespecified level α. The solution to this problem is given by the Neyman-Pearson Lemma: 1 for f 1 (x) > λf 0 (x) ϕ (x) = κ for f 1 (x) = λf 0 (x) 0 for f 1 (x) < λf 0 (x) where λ and κ are chosen such that ϕ (x)f 0 (x)dx = α. Proof: Let ϕ(x) be any other test of level α, i.e. ϕ(x)f0 (x)dx = α. We need to show that ϕ (x)f 1 (x)dx ϕ(x)f 1 (x)dx. Note that (ϕ (x) ϕ(x))(f 1 (x) λf 0 (x))dx 0 since ϕ (x) = 1 ϕ(x) for all x such that f 1 (x) λf 0 (x) > 0 and ϕ (x) = 0 ϕ(x) for all x such that f 1 (x) λf 0 (x) < 0. Therefore, using α = 12

ϕ(x)f0 (x)dx = ϕ (x)f 0 (x)dx, = = (ϕ (x) ϕ(x))(f 1 (x) λf 0 (x))dx (ϕ (x) ϕ(x))f 1 (x)dx ϕ (x)f 1 (x)dx ϕ(x)f 1 (x)dx 0 as required. The proof in the discrete case is identical with all summations replaced by integrals. 13