Notes on Decision Theory and Prediction Ronald Christensen Professor of Statistics Department of Mathematics and Statistics University of New Mexico October 7, 2014 1. Decision Theory Decision theory is a very general theory that allows one to examine Bayesian estimation and hypothesis testing as well as Neyman-Pearson hypothesis testing and many aspects of frequentist estimation. I am not aware that it has anything to say about Fisherian significance testing. In decision theory we start with states of nature Θ, potential actions a A, and a loss function L(, a) that takes real values. We are interesting in taking actions that will reduce our losses. Some formulations of decision theory incorporate a utility function U(, a) and seek actions that increase utility. The formulations are interchangeable by simply taking U(, a) L(, a). Eventually, we will want to incorporate data in the form of a random vector X taking values in X and having density f(x ). The distribution of X is called the sampling distribution. We will focus on three special cases. 0
Estimation of a scalar state of nature involves scalar actions with Θ A R. Three commonly used loss functions are Squared error, L(, a) ( a) 2 ; Weighted squared error, L(, a) w()( a) 2, wherein w() is a known weighting function taking positive values; Absolute error, L(, a) a. Estimation of a vector involves Θ A R r. Three commonly used loss functions are L(, a) ( a) ( a) a 2 L(, a) w() a 2, w() > 0 L(, a) r j1 j a j Hypothesis testing involves two hypotheses, say Θ { 0, 1 }, and two corresponding actions A {a 0, a 1 }. What is key in this problem is that there are only two states of nature in Θ that we can think of as the null and alternative hypotheses, and two corresponding actions in A that we can think of as accepting the null (rejecting the alternative) and accepting the alternative (rejecting the null). The standard loss function is L(, a) a 0 a 1 0 0 1 A more general loss function is 1 1 0 L(, a) a 0 a 1 0 c 00 c 01 1 c 10 c 11 wherein, presumably, c 00 c 01 and c 10 c 11. 1
2. Optimal Prior Actions If is random, i.e., if has a prior distribution, then the optimal action is defined to be the action that minimizes the expected loss, E[L(, a)] E [L(, a)] Proposition 1: For Θ A R and L(, a) ( a) 2, if is random, the optimal action is â E(). Proof: It is enough to show that E[( a) 2 ] E[( â) 2 ] + (â a) 2 because then the minimizing value of a occurs when â a. As is so often the case, the proof proceeds by subtracting and adding the correct answer. E[( a) 2 ] E[({ â} + {â a}) 2 ] E[( â) 2 ] + 2E[( â)(â a)] + E[(â a) 2 ] E[( â) 2 ] + 2(â a)e[( â)] + (â a) 2 E[( â) 2 ] + (â a) 2 The third equality holds because (â a) 2 is a constant and the fourth holds because E[ E()] 0. Proposition 2: For Θ A R and L(, a) w()( a) 2, if is random, the optimal action is â E[w()]/E[w()]. Proof: The proof is an exercise. Write E[w()( a) 2 ] E[w()( â + â a) 2 ]. 2
Proposition 3: For Θ A R and L(, a) a, if is random, the optimal action is â m Median(). Proof: I changed some notation in this proof but I did not really look at it, I had generated it some time ago. Without loss of generality assume a is greater than the median m of so that p a P r[ > a] 0.5 E[ a ] a a dp a a + + ( a)dp + ( a)dp + m m a a + (a )dp + a m (a )dp + ( m)dp + m m ( m)dp + a a (a )dp + (a m)dp + m m a a ( m)dp + m a a (m a)dp + (m a)dp m m (m + a 2)dP (a )dp (m a)dp (m )dp m m (a m)dp + (m )dp + (a m)dp a m m dp + (m a)dp + (m + a 2)dP + (a m)dp a a m m dp + (m a)p a + (m + a 2)dP + 0.5(a m) m a m dp + (0.5 p a )(a m) + (m + a 2)dP m a m dp + (0.5 p a )(a m) + (m a)dp m m dp + (0.5 p a )(a m) + (0.5 p a )(m a) m dp E[ m ] 3
Proposition 4: action is For Θ { 0, 1 }, A {a 0, a 1 }, L(, a) I( a), the optimal a 0 if Pr( 0 ) > 0.5 â a 1 if Pr( 0 ) < 0.5. Proof: Note that E[L(, a 0 )] L( 0, a 0 )Pr( 0 ) + L( 1, a 0 )Pr( 1 ) Pr( 1 ) and E[L(, a 1 )] L( 0, a 1 )Pr( 0 ) + L( 1, a 1 )Pr( 1 ) Pr( 0 ). If Pr( 1 ) < Pr( 0 ) the optimal action is a 0 and if Pr( 1 ) > Pr( 0 ) the optimal action is a 1. However, Pr( 0 ) + Pr( 1 ) 1, so Pr( 1 ) < Pr( 0 ) if and only if Pr( 0 ) > 0.5 3. Optimal Posterior Actions Suppose we have a data vector X with density f(x ). If is random, i.e., if has a prior density p(), a Bayesian updates the distribution of using the data and Bayes Theorem to get the posterior density p( X) f(x )p() f(x )p()dµ() CLASS: think of dµ() d. The Bayes action is defined to be the action that minimizes the expected loss, E[L(, a) X] E X [L(, a)]. The Bayes action is just the optimal action when the distribution on is the posterior distribution given X. Recognizing this fact, the previous section provides a number of results immediately. Proposition 1a: For Θ A R, data X x, and L(, a) ( a) 2, if is random, the Bayes action is â E X () E( X x). 4
Proposition 2a: For Θ A R, data X x, and L(, a) w()( a) 2, if is random, the Bayes action is â E[w() X x]/e[w() X x]. Proposition 3a: For Θ A R, data X x, and L(, a) a, if is random, the Bayes action is â m Median( X x). Proposition 4a: the Bayes action is For Θ { 0, 1 }, data X x, A {a 0, a 1 }, L(, a) I( a), a 0 if Pr( 0 X x) > 0.5 â a 1 if Pr( 0 X x) < 0.5. 4. Traditional Decision Theory With states of nature Θ, potential actions a A, and a data vector X taking values in X and having density f(x ), a decision function is defined as a mapping of the data into the action space, i.e., : X A. With a loss function L(, a), the risk function is defined as R(, ) E X {L[, (X)]}. To frequentists, the risk function is the soul of decision theory. The Bayes risk is a frequentist idea of what a Bayesian should worry about. With a prior distribution, call it p, on, the Bayes risk is defined as r(p, ) E [R(, )]. Frequentists think that Bayesians should be concerned about finding the Bayes decision rule that minimizes the Bayes risk. 5
Formally, for a prior p, the Bayes rule is a decision function p with r(p, p ) inf r(p, ). Bayesians think that they should be concerned with finding the Bayes action given the data, as discussed in the previous section. Fortunately, these amount to the same thing. To minimize the Bayes risk, you pick (x) to minimize r(p, ) E [R(, )] E ( EX {L[, (X)]} ) E X ( E X {L[, (X)]} ). This can be minimized by picking (x) to be the Bayes action that minimizes E Xx {L[, (x)]} for every value of x. One exception to Bayesians being concerned about the Bayes action rather than the Bayes decision rule is when a Bayesian is trying to design an experiment, hence is concerned with possible data rather than already observed data. 5. Prediction Theory In prediction theory one wishes to predict an unobserved random vector y based on an observed random vector x. Let s say that y has q dimensions and that x has p 1 dimensions. We assume that the joint distribution of x and y is known. Any predictor of y is some function of x, say ỹ(x). We define a predictive loss function, L[y, ỹ(x)] and seek to find a predictor ŷ(x) that minimizes the expected prediction loss, E{L[y, ỹ(x)]}, where the expectation is over both y and x. Note that ( E x,y {L[y, ỹ(x)]} E x Ey x {L[y, ỹ(x)]} ) 6
or in alternative notation E{L[y, ỹ(x)]} E (E{L[y, ỹ(x)] x}). In particular, there is a one to one correspondence between prediction theory and the approach of traditional decision theory to Bayesian analysis. We associate y with and x with X. In prediction we assume a joint distribution for x and y whereas in Bayesian analysis we specify the sampling distribution and the prior that together determine the joint distribution of and X. A predictor ỹ(x) is analogous to a decision rule. The expected prediction error E x,y {L[y, ỹ(x)]} is analogous to the Bayes risk. Just like in Bayesian analysis, the way to find the best predictor is, for each value of x, to find the value of ỹ(x) that minimizes E{L[y, ỹ(x)] x}. The most common prediction problem is similar to linear regression in which y takes values in R and uses squared error loss, L[y, ỹ(x)] [y ỹ(x)] 2. We want to minimize the expected prediction error E{L[y, ỹ(x)]} E{[y ỹ(x)] 2 } where the expectation is over both y and x. Identifying prediction with decision and conditioning on x, we see that Proposition 1a implies Proposition 1b: For data (x, y), y R, and L(y, ỹ(x)) [y ỹ(x)] 2, the best predictor is ŷ E(y x). Regression, both linear and nonparametric, is about estimating the optimal predictor E(y x). Note that this result holds even when y is Bernoulli, in which case the best predictor under squared error loss is E(y x) Pr[y 1 x]. Using squared error loss with a Bernoulli variable y is essentially using Brier Scores. 7
Similarly we can get other best predictors. Proposition 2b: For data (x, y), y R, and L(y, ỹ(x)) w(y)[y ỹ(x)] 2, the best predictor is ŷ E[w(y)y x]/e[w(y) x]. Proposition 3b: For data (x, y), y R, and L(y, ỹ(x)) y ỹ(x), the best predictor is ŷ m Median(y x). When y takes values in {0, 1}, and alternative loss function is the so called Hamming loss, L[y, ỹ(x)] I[y ỹ(x)], wherein a predictor ỹ(x) also needs to take values in {0, 1}. We want to minimize the expected prediction error E{L[y, ỹ(x)]} E{I[y ỹ(x)]} where the expectation is over both y and x. We see that Proposition 4a implies Proposition 4b: best predictor is For data (x, y), y {0, 1} and L(y, ỹ(x)) I(y ỹ(x)), the 0 if Pr(y 0 x) > 0.5 ŷ(x) 1 if Pr(y 0 x) < 0.5. In binary regression people tend to focus on the probability of getting a 1, rather than getting a 0 (which is analogous to a null hypothesis), so it is more common to think of the optimal predictor as 0 if Pr(y 1 x) < 0.5 ŷ(x) 1 if Pr(y 1 x) > 0.5. Binomial (logistic/probit) regression is about estimating the probability Pr(y 1 x). For squared error loss, this gives the estimated optimal predictor. For Hamming loss, the estimated optimal predictor is 0 or 1 depending on whether the estimated value of Pr(y 1 x) is less than 0.5 8
Fisher argued (similarly to Bayesians) that prediction problems should be considered entirely as conditional on the predictor vector x. However, there are some predictive measures such as the coefficient of determination that are defined with respect to the distribution on x. Measures that depend on the distribution of x are inappropriate to compare when the distribution of x changes. Thus it is common to argue that R 2 values for the same model on different data are not comparable. In fact, that is only true if the x data have been sample from a different population which is usually the case. 9
6. Minimax Rules Definition 1: A decision rule 0 is a minimax rule if sup R(, 0 ) inf sup R(, ). Definition 2: A prior distribution on, say g, is a least favorable distribution if inf If is a Bayes rule with respect to g then r(g, ) sup inf r(g, ). g r(g, ) inf r(g, ) sup inf r(g, ). g We present without proof the Minimax Theorem Theorem 3: inf sup g r(g, ) sup g inf r(g, ). Corollary 4: For any, sup R(, ) sup g r(g, ). Proof: Observe that r(g, ) E [R(, )] E [sup R(, )] sup R(, ), so sup g r(g, ) sup R(, ). Conversely, by considering the subset of priors that take on the value with probability one, say g, note that r(g, ) R(, ) and sup r(g, ) sup r(g, ) sup R(, ). g g : Θ 10
Proposition 5: If the Minimax Theorem holds, 0 is a minimax rule, and g is a least favorable distribution with corresponding Bayes rule, then 0 is also a Bayes rule with respect to the least favorable distribution. (If the Bayes rule happens to be unique, we must have 0.) Proof: Using Corollary 4, Definition 1, Corollary 4, the Minimax Theorem 3, and Definition 2, r(g, 0 ) sup r(g, 0 ) g sup R(, 0 ) inf inf sup R(, ) sup r(g, ) g sup inf r(g, ) g r(g, ) This must be an equality since we know by definition of the Bayes rule that r(g, 0 ) r(g, ) Since 0 and have the same Bayes risk, they must both be Bayes rules. The point is that a Bayes rule for a least favorable distribution isn t necessarily a minimax rule, but a minimax rule, if it exists, is necessarily a Bayes rule for a least favorable distribution. 11
Definition 6: 0 is an equalizer rule if for some constant K, R(, 0 ) K for all. Proposition 7: If the Minimax Theorem holds and if 0 is both an equalizer rule and the Bayes rule for some prior distribution g 0, then 0 is minimax. Proof: inf sup R(, ) sup R(, 0 ) K r(g 0, 0 ) inf By the Minimax Theorem, all of these are equal, so in particular r(g 0, ) sup inf r(g 0, ). g inf sup R(, ) sup R(, 0 ). Exercise: Let X Bin(n, ) and Beta(α, β). Assume that the Minimax Theorem holds! For squared error loss, find the Bayes rule, say, αβ. Find R(, αβ ). Pick α and β so that αβ is an equalizer rule. Establish that αβ a minimax rule. 12