Notes on Decision Theory and Prediction

Similar documents
Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Unobservable Parameter. Observed Random Sample. Calculate Posterior. Choosing Prior. Conjugate prior. population proportion, p prior:

Topic 10: Hypothesis Testing

LECTURE NOTES 57. Lecture 9

Topic 10: Hypothesis Testing

Introduction to Bayesian Statistics 1

Lecture 21. Hypothesis Testing II

Lectures 5 & 6: Hypothesis Testing

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

Announcements. Proposals graded

Definition 3.1 A statistical hypothesis is a statement about the unknown values of the parameters of the population distribution.

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Detection and Estimation Chapter 1. Hypothesis Testing

Lecture 2: Statistical Decision Theory (Part I)

Empirical Risk Minimization Algorithms

Two examples of the use of fuzzy set theory in statistics. Glen Meeden University of Minnesota.

Lectures on Statistics. William G. Faris

Derivation of Monotone Likelihood Ratio Using Two Sided Uniformly Normal Distribution Techniques

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Statistical Inference

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Lecture notes on statistical decision theory Econ 2110, fall 2013

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Stat 5101 Lecture Notes

Statistics 100A Homework 5 Solutions

Special Topic: Bayesian Finite Population Survey Sampling

Lecture 8: Information Theory and Statistics

Bayesian Inference for Normal Mean

A Very Brief Summary of Statistical Inference, and Examples

HYPOTHESIS TESTING: FREQUENTIST APPROACH.

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Data Analysis and Monte Carlo Methods

(1) Introduction to Bayesian statistics

Introduction to Bayesian Statistics

Frequentist Statistics and Hypothesis Testing Spring

MIT Spring 2016

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.


Inference for a Population Proportion

Bayesian Computation

ECE531 Lecture 8: Non-Random Parameter Estimation

Ch. 5 Hypothesis Testing

Introduction to Machine Learning. Lecture 2

ECE531 Lecture 4b: Composite Hypothesis Testing

Lecture 7 Introduction to Statistical Decision Theory

21.1 Lower bounds on minimax risk for functional estimation

Hypothesis Testing. BS2 Statistical Inference, Lecture 11 Michaelmas Term Steffen Lauritzen, University of Oxford; November 15, 2004

A Very Brief Summary of Bayesian Inference, and Examples

Part III. A Decision-Theoretic Approach and Bayesian testing

Fundamental Probability and Statistics

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method

1 A simple example. A short introduction to Bayesian statistics, part I Math 217 Probability and Statistics Prof. D.

Confidence Intervals. CAS Antitrust Notice. Bayesian Computation. General differences between Bayesian and Frequntist statistics 10/16/2014

STAT 830 Hypothesis Testing

BEST TESTS. Abstract. We will discuss the Neymann-Pearson theorem and certain best test where the power function is optimized.

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Bayesian inference: what it means and why we care

Business Statistics. Lecture 9: Simple Regression

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Probability and Statistics qualifying exam, May 2015

Testing Statistical Hypotheses

Bayesian Model Diagnostics and Checking

ST5215: Advanced Statistical Theory

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

TUTORIAL 8 SOLUTIONS #

Bayesian RL Seminar. Chris Mansley September 9, 2008

CSE 312 Final Review: Section AA

Class 26: review for final exam 18.05, Spring 2014

STA 732: Inference. Notes 2. Neyman-Pearsonian Classical Hypothesis Testing B&D 4

STAT 830 Hypothesis Testing

Mathematical Statistics

Bayes rule and Bayes error. Donglin Zeng, Department of Biostatistics, University of North Carolina

The Delta Method and Applications

Machine Learning 4771

Statistical Inference

Lecture 13 and 14: Bayesian estimation theory

Quiz 2 Date: Monday, November 21, 2016

Making peace with p s: Bayesian tests with straightforward frequentist properties. Ken Rice, Department of Biostatistics April 6, 2011

DR.RUPNATHJI( DR.RUPAK NATH )

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

LECTURE 10: NEYMAN-PEARSON LEMMA AND ASYMPTOTIC TESTING. The last equality is provided so this can look like a more familiar parametric test.

8: Hypothesis Testing

Lecture 21: Minimax Theory

Direction: This test is worth 250 points and each problem worth points. DO ANY SIX

Terminology for Statistical Data

Chapters 10. Hypothesis Testing

Lecture 4. f X T, (x t, ) = f X,T (x, t ) f T (t )

Bayesian statistics: Inference and decision theory

ECE531 Lecture 13: Sequential Detection of Discrete-Time Signals

Statistical Inference

Testing Statistical Hypotheses

Statistics of Small Signals

Spring 2012 Math 541B Exam 1

A CONDITION TO OBTAIN THE SAME DECISION IN THE HOMOGENEITY TEST- ING PROBLEM FROM THE FREQUENTIST AND BAYESIAN POINT OF VIEW

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Data Analysis and Statistical Methods Statistics 651

Chapters 10. Hypothesis Testing

exp{ (x i) 2 i=1 n i=1 (x i a) 2 (x i ) 2 = exp{ i=1 n i=1 n 2ax i a 2 i=1

Transcription:

Notes on Decision Theory and Prediction Ronald Christensen Professor of Statistics Department of Mathematics and Statistics University of New Mexico October 7, 2014 1. Decision Theory Decision theory is a very general theory that allows one to examine Bayesian estimation and hypothesis testing as well as Neyman-Pearson hypothesis testing and many aspects of frequentist estimation. I am not aware that it has anything to say about Fisherian significance testing. In decision theory we start with states of nature Θ, potential actions a A, and a loss function L(, a) that takes real values. We are interesting in taking actions that will reduce our losses. Some formulations of decision theory incorporate a utility function U(, a) and seek actions that increase utility. The formulations are interchangeable by simply taking U(, a) L(, a). Eventually, we will want to incorporate data in the form of a random vector X taking values in X and having density f(x ). The distribution of X is called the sampling distribution. We will focus on three special cases. 0

Estimation of a scalar state of nature involves scalar actions with Θ A R. Three commonly used loss functions are Squared error, L(, a) ( a) 2 ; Weighted squared error, L(, a) w()( a) 2, wherein w() is a known weighting function taking positive values; Absolute error, L(, a) a. Estimation of a vector involves Θ A R r. Three commonly used loss functions are L(, a) ( a) ( a) a 2 L(, a) w() a 2, w() > 0 L(, a) r j1 j a j Hypothesis testing involves two hypotheses, say Θ { 0, 1 }, and two corresponding actions A {a 0, a 1 }. What is key in this problem is that there are only two states of nature in Θ that we can think of as the null and alternative hypotheses, and two corresponding actions in A that we can think of as accepting the null (rejecting the alternative) and accepting the alternative (rejecting the null). The standard loss function is L(, a) a 0 a 1 0 0 1 A more general loss function is 1 1 0 L(, a) a 0 a 1 0 c 00 c 01 1 c 10 c 11 wherein, presumably, c 00 c 01 and c 10 c 11. 1

2. Optimal Prior Actions If is random, i.e., if has a prior distribution, then the optimal action is defined to be the action that minimizes the expected loss, E[L(, a)] E [L(, a)] Proposition 1: For Θ A R and L(, a) ( a) 2, if is random, the optimal action is â E(). Proof: It is enough to show that E[( a) 2 ] E[( â) 2 ] + (â a) 2 because then the minimizing value of a occurs when â a. As is so often the case, the proof proceeds by subtracting and adding the correct answer. E[( a) 2 ] E[({ â} + {â a}) 2 ] E[( â) 2 ] + 2E[( â)(â a)] + E[(â a) 2 ] E[( â) 2 ] + 2(â a)e[( â)] + (â a) 2 E[( â) 2 ] + (â a) 2 The third equality holds because (â a) 2 is a constant and the fourth holds because E[ E()] 0. Proposition 2: For Θ A R and L(, a) w()( a) 2, if is random, the optimal action is â E[w()]/E[w()]. Proof: The proof is an exercise. Write E[w()( a) 2 ] E[w()( â + â a) 2 ]. 2

Proposition 3: For Θ A R and L(, a) a, if is random, the optimal action is â m Median(). Proof: I changed some notation in this proof but I did not really look at it, I had generated it some time ago. Without loss of generality assume a is greater than the median m of so that p a P r[ > a] 0.5 E[ a ] a a dp a a + + ( a)dp + ( a)dp + m m a a + (a )dp + a m (a )dp + ( m)dp + m m ( m)dp + a a (a )dp + (a m)dp + m m a a ( m)dp + m a a (m a)dp + (m a)dp m m (m + a 2)dP (a )dp (m a)dp (m )dp m m (a m)dp + (m )dp + (a m)dp a m m dp + (m a)dp + (m + a 2)dP + (a m)dp a a m m dp + (m a)p a + (m + a 2)dP + 0.5(a m) m a m dp + (0.5 p a )(a m) + (m + a 2)dP m a m dp + (0.5 p a )(a m) + (m a)dp m m dp + (0.5 p a )(a m) + (0.5 p a )(m a) m dp E[ m ] 3

Proposition 4: action is For Θ { 0, 1 }, A {a 0, a 1 }, L(, a) I( a), the optimal a 0 if Pr( 0 ) > 0.5 â a 1 if Pr( 0 ) < 0.5. Proof: Note that E[L(, a 0 )] L( 0, a 0 )Pr( 0 ) + L( 1, a 0 )Pr( 1 ) Pr( 1 ) and E[L(, a 1 )] L( 0, a 1 )Pr( 0 ) + L( 1, a 1 )Pr( 1 ) Pr( 0 ). If Pr( 1 ) < Pr( 0 ) the optimal action is a 0 and if Pr( 1 ) > Pr( 0 ) the optimal action is a 1. However, Pr( 0 ) + Pr( 1 ) 1, so Pr( 1 ) < Pr( 0 ) if and only if Pr( 0 ) > 0.5 3. Optimal Posterior Actions Suppose we have a data vector X with density f(x ). If is random, i.e., if has a prior density p(), a Bayesian updates the distribution of using the data and Bayes Theorem to get the posterior density p( X) f(x )p() f(x )p()dµ() CLASS: think of dµ() d. The Bayes action is defined to be the action that minimizes the expected loss, E[L(, a) X] E X [L(, a)]. The Bayes action is just the optimal action when the distribution on is the posterior distribution given X. Recognizing this fact, the previous section provides a number of results immediately. Proposition 1a: For Θ A R, data X x, and L(, a) ( a) 2, if is random, the Bayes action is â E X () E( X x). 4

Proposition 2a: For Θ A R, data X x, and L(, a) w()( a) 2, if is random, the Bayes action is â E[w() X x]/e[w() X x]. Proposition 3a: For Θ A R, data X x, and L(, a) a, if is random, the Bayes action is â m Median( X x). Proposition 4a: the Bayes action is For Θ { 0, 1 }, data X x, A {a 0, a 1 }, L(, a) I( a), a 0 if Pr( 0 X x) > 0.5 â a 1 if Pr( 0 X x) < 0.5. 4. Traditional Decision Theory With states of nature Θ, potential actions a A, and a data vector X taking values in X and having density f(x ), a decision function is defined as a mapping of the data into the action space, i.e., : X A. With a loss function L(, a), the risk function is defined as R(, ) E X {L[, (X)]}. To frequentists, the risk function is the soul of decision theory. The Bayes risk is a frequentist idea of what a Bayesian should worry about. With a prior distribution, call it p, on, the Bayes risk is defined as r(p, ) E [R(, )]. Frequentists think that Bayesians should be concerned about finding the Bayes decision rule that minimizes the Bayes risk. 5

Formally, for a prior p, the Bayes rule is a decision function p with r(p, p ) inf r(p, ). Bayesians think that they should be concerned with finding the Bayes action given the data, as discussed in the previous section. Fortunately, these amount to the same thing. To minimize the Bayes risk, you pick (x) to minimize r(p, ) E [R(, )] E ( EX {L[, (X)]} ) E X ( E X {L[, (X)]} ). This can be minimized by picking (x) to be the Bayes action that minimizes E Xx {L[, (x)]} for every value of x. One exception to Bayesians being concerned about the Bayes action rather than the Bayes decision rule is when a Bayesian is trying to design an experiment, hence is concerned with possible data rather than already observed data. 5. Prediction Theory In prediction theory one wishes to predict an unobserved random vector y based on an observed random vector x. Let s say that y has q dimensions and that x has p 1 dimensions. We assume that the joint distribution of x and y is known. Any predictor of y is some function of x, say ỹ(x). We define a predictive loss function, L[y, ỹ(x)] and seek to find a predictor ŷ(x) that minimizes the expected prediction loss, E{L[y, ỹ(x)]}, where the expectation is over both y and x. Note that ( E x,y {L[y, ỹ(x)]} E x Ey x {L[y, ỹ(x)]} ) 6

or in alternative notation E{L[y, ỹ(x)]} E (E{L[y, ỹ(x)] x}). In particular, there is a one to one correspondence between prediction theory and the approach of traditional decision theory to Bayesian analysis. We associate y with and x with X. In prediction we assume a joint distribution for x and y whereas in Bayesian analysis we specify the sampling distribution and the prior that together determine the joint distribution of and X. A predictor ỹ(x) is analogous to a decision rule. The expected prediction error E x,y {L[y, ỹ(x)]} is analogous to the Bayes risk. Just like in Bayesian analysis, the way to find the best predictor is, for each value of x, to find the value of ỹ(x) that minimizes E{L[y, ỹ(x)] x}. The most common prediction problem is similar to linear regression in which y takes values in R and uses squared error loss, L[y, ỹ(x)] [y ỹ(x)] 2. We want to minimize the expected prediction error E{L[y, ỹ(x)]} E{[y ỹ(x)] 2 } where the expectation is over both y and x. Identifying prediction with decision and conditioning on x, we see that Proposition 1a implies Proposition 1b: For data (x, y), y R, and L(y, ỹ(x)) [y ỹ(x)] 2, the best predictor is ŷ E(y x). Regression, both linear and nonparametric, is about estimating the optimal predictor E(y x). Note that this result holds even when y is Bernoulli, in which case the best predictor under squared error loss is E(y x) Pr[y 1 x]. Using squared error loss with a Bernoulli variable y is essentially using Brier Scores. 7

Similarly we can get other best predictors. Proposition 2b: For data (x, y), y R, and L(y, ỹ(x)) w(y)[y ỹ(x)] 2, the best predictor is ŷ E[w(y)y x]/e[w(y) x]. Proposition 3b: For data (x, y), y R, and L(y, ỹ(x)) y ỹ(x), the best predictor is ŷ m Median(y x). When y takes values in {0, 1}, and alternative loss function is the so called Hamming loss, L[y, ỹ(x)] I[y ỹ(x)], wherein a predictor ỹ(x) also needs to take values in {0, 1}. We want to minimize the expected prediction error E{L[y, ỹ(x)]} E{I[y ỹ(x)]} where the expectation is over both y and x. We see that Proposition 4a implies Proposition 4b: best predictor is For data (x, y), y {0, 1} and L(y, ỹ(x)) I(y ỹ(x)), the 0 if Pr(y 0 x) > 0.5 ŷ(x) 1 if Pr(y 0 x) < 0.5. In binary regression people tend to focus on the probability of getting a 1, rather than getting a 0 (which is analogous to a null hypothesis), so it is more common to think of the optimal predictor as 0 if Pr(y 1 x) < 0.5 ŷ(x) 1 if Pr(y 1 x) > 0.5. Binomial (logistic/probit) regression is about estimating the probability Pr(y 1 x). For squared error loss, this gives the estimated optimal predictor. For Hamming loss, the estimated optimal predictor is 0 or 1 depending on whether the estimated value of Pr(y 1 x) is less than 0.5 8

Fisher argued (similarly to Bayesians) that prediction problems should be considered entirely as conditional on the predictor vector x. However, there are some predictive measures such as the coefficient of determination that are defined with respect to the distribution on x. Measures that depend on the distribution of x are inappropriate to compare when the distribution of x changes. Thus it is common to argue that R 2 values for the same model on different data are not comparable. In fact, that is only true if the x data have been sample from a different population which is usually the case. 9

6. Minimax Rules Definition 1: A decision rule 0 is a minimax rule if sup R(, 0 ) inf sup R(, ). Definition 2: A prior distribution on, say g, is a least favorable distribution if inf If is a Bayes rule with respect to g then r(g, ) sup inf r(g, ). g r(g, ) inf r(g, ) sup inf r(g, ). g We present without proof the Minimax Theorem Theorem 3: inf sup g r(g, ) sup g inf r(g, ). Corollary 4: For any, sup R(, ) sup g r(g, ). Proof: Observe that r(g, ) E [R(, )] E [sup R(, )] sup R(, ), so sup g r(g, ) sup R(, ). Conversely, by considering the subset of priors that take on the value with probability one, say g, note that r(g, ) R(, ) and sup r(g, ) sup r(g, ) sup R(, ). g g : Θ 10

Proposition 5: If the Minimax Theorem holds, 0 is a minimax rule, and g is a least favorable distribution with corresponding Bayes rule, then 0 is also a Bayes rule with respect to the least favorable distribution. (If the Bayes rule happens to be unique, we must have 0.) Proof: Using Corollary 4, Definition 1, Corollary 4, the Minimax Theorem 3, and Definition 2, r(g, 0 ) sup r(g, 0 ) g sup R(, 0 ) inf inf sup R(, ) sup r(g, ) g sup inf r(g, ) g r(g, ) This must be an equality since we know by definition of the Bayes rule that r(g, 0 ) r(g, ) Since 0 and have the same Bayes risk, they must both be Bayes rules. The point is that a Bayes rule for a least favorable distribution isn t necessarily a minimax rule, but a minimax rule, if it exists, is necessarily a Bayes rule for a least favorable distribution. 11

Definition 6: 0 is an equalizer rule if for some constant K, R(, 0 ) K for all. Proposition 7: If the Minimax Theorem holds and if 0 is both an equalizer rule and the Bayes rule for some prior distribution g 0, then 0 is minimax. Proof: inf sup R(, ) sup R(, 0 ) K r(g 0, 0 ) inf By the Minimax Theorem, all of these are equal, so in particular r(g 0, ) sup inf r(g 0, ). g inf sup R(, ) sup R(, 0 ). Exercise: Let X Bin(n, ) and Beta(α, β). Assume that the Minimax Theorem holds! For squared error loss, find the Bayes rule, say, αβ. Find R(, αβ ). Pick α and β so that αβ is an equalizer rule. Establish that αβ a minimax rule. 12