Lecture Notes 1: Decisions and Data. In these notes, I describe some basic ideas in decision theory. theory is constructed from

Size: px

Start display at page:

Download "Lecture Notes 1: Decisions and Data. In these notes, I describe some basic ideas in decision theory. theory is constructed from"

Marshall Fleming
6 years ago
Views:

1 Topics in Data Analysis Steven N. Durlauf University of Wisconsin Lecture Notes : Decisions and Data In these notes, I describe some basic ideas in decision theory. theory is constructed from The Data: d Unknown(s): θ Choices: c C Loss function: l( θ, dc, ) The goal is to construct a decision rule c( d ), a mapping from D to the set of choices C. The decision rule may be stochastic, I will ignore for expositional purposes.. Statistical decision theory The standard version of statistical decision theory is formulated along Bayesian lines: all unknowns are associated with posterior probability densities; decision rule is determined by minimization of expected loss. Specifically, one implicitly chooses c( d ) by solving the problem min,, ( θ ) µθ ( ) c C l dc d (.) for each value of d.

2 This problem illustrates the key question facing an expected loss minimizing policy analyst: how does one construct µθd ( )? To understand this conditional probability, observe that ( d ) µθ ( d ) ( d) ( ) ( ) µθ, µ d θ µθ = = µ µ ( d) (.) Since µ ( d ) is not a function of θ, (it is really nothing more than a normalization that ensures that the relevant probabilities sum to ), one can rewrite (.) as ( ) ( ) ( ) µθ d µ d θ µθ (.3) This is the classic statement that the posterior probability measure for some unknown, µθd ( ), is proportional to the product of the likelihood function µ ( d θ) and the prior probability measure µθ ( ). The prior represents the information the analyst possesses about the unkown before (i.e. prior!) to the data. For some types of unknowns (e.g. parameters) and associated regularity conditions, it will be the case that as the number of observations grows, the likelihood will swamp the prior. loss functions and standard statistical exercises The statistical decision formulation does not, at first glance, appear closely related to the standard exercises of producing an estimate of a parameter, etc. However, these types of exercises may be reproduced in their essentials by a suitable choice of the loss function. Consider the loss function ( θ) ( θ,, ) ( ) l dc = c d (.4)

3 The decision c( d ) will equal E( θ ) d produce a parameter estimate that is equal to the posterior mean of the parameter. (The proof is left as an exercise.) This posterior mean is the Bayesian analog to a point estimate in frequentist analysis. (More on this later.) priors The assignment of priors is problematic, since in many (most?) contexts, the analyst does not have a good basis for constructing µθ ( ). This is true in several sense; first for many problems one may simply have nothing to say about the ex ante relative probability of different realizations. A second problem is that what information is possessed may not quantify easily. Within Bayesian statistics, a number of approaches exist to addressing this difficulty. For our purposes, I will focus on the question of constructing priors in the case where one is ignorant i.e. one does not have any basis for discriminating between different values of the unknown. And as an example, I will consider analysis of the mean and variance of a sequence of i.i.d. random variables with mean κ and variance σ. The classical approach to specifying priors in the presence of ignorance is the principle of insufficient reason an idea which is often attributed to Bermoulli and Laplace (although the name of the principle postdates them). The basic idea of the principle is that in absence of any reason to discriminate between two values of an unknown, this is interpretable as saying that we assign equal prior probabilities to them. How does the principle of insufficient reason translate into a prior for κ? If the support of the mean, under ignorance, is (, ), then the associated prior much be constant, i.e. a uniform density on the real line, i.e. µ ( λ). Notice that in this case, the prior is improper, which means that it does not integrate to. 3

4 The case of the variance σ is more complicated. One triviality: the parameter that is typically studied is the standard deviation σ. Since the variance cannot be negative, the support of σ is, under ignorance, taken to be [ 0, ). For this case, the standard ignorant prior is taken to be µ ( σ) motivation of this formulation is that it assigns a uniform prior to support is (, ).. The σ log σ, whose This example suggests a more general problem in formulating ignorance in terms of a uniform prior. Consider a parameter vector θ and its associated transformation into another parameter vector φ ( θ) = f. Assuming all probability measures can be represented as densities, for a given prior µθ ( ), it must be the case that the associated prior for φ is df µ ( φ ) = µ ( f ( φ) ) dφ ( ) φ (.5) This means that the uniform prior is not invariant across nonlinear transformations of the unknowns, which makes little sense if the prior really captures ignorance, since ignorance about θ presumably implies ignorance about φ. One way to impose invariance of this type is via Jeffrey s prior. Recalling that Fisher s information matrix is defined as ( d ) log ( d ) logµ θ µ θ I( θ ) = E = µ ( d θ ) (.6) θ D θ Jeffrey s prior is defined as ( ) ( ) / µθ I θ (.7) 4

5 This prior addresses the invariance problem in that the form of the prior can be shown to be unchanged, but raises difficulties in terms of interpretability. One issue is that the prior depends on the form of the likelihood, which seems odd since the prior is supposed to describe beliefs about parameters.. Example. Priors and posteriors for mean of normal random variables N ( κ σ 0, 0) Suppose that i ( κσ, ) x N, i =... K with σ known; the prior for κ is. It is straightforward to show that the posterior density for the mean is K κ + 0 x σ σ 0 K µ ( κσ, κ σ ) = 0, 0, x,... xk N, + K + σ0 σ σ0 σ (.8) where x is the sample mean of the data. Notice that as K, the effects of the prior on the posterior disappear, i.e. the posterior mean converges to the sample mean. Example.. Linear regression with diffuse prior and Suppose that = β + ε yi x i i with ε ( 0, σ ) N, σ known, ε i uncorrelated, xi nonstochastic. Assume that the prior on β is uniform across described above. One can show that the posterior density of β is R L, as ( ˆ βols, σ ( ' ) ) N X X (.9) 5

6 where βˆols is the OLS estimator. For this special case, there is a nice relationship between the posterior density computed by a Bayesian and the OLS objects computed by a frequentist.. Decision theory without probabilities In this section, I discuss decision critieria when probabilities are not available for the unobservables of interest. minimax The minimax solution is to choose c so that ( θ ) min max,, c C θ l dc (.0) The minimax approach is usually associated with Abraham Wald (950). An important axiomatization is due to Itzhak Gilboa and David Schmeidler (989). The criterion implicitly means that the policymaker is extremely risk averse. There are interesting applications of the criterion in philosophy, notably John Rawls (97) difference principle. The idea of the difference principle is the following. Suppose that individuals are asked to rank different distributions of economic outcomes in a society, under the proviso that they do not know which of the outcomes will turn out to be theirs. Rawls argues that individuals will choose that distribution that maximizes the utility of the worst off person. As pointed out by Kenneth Arrow (973), this is a minimax argument. It can further be understood as the limit of a welfarist analysis. For example, suppose that individual utility is defined by u = ω α where ω denotes a scalar measure of an i i individual s outcome. Further, suppose that the social state ω is assigned the 6

7 social welfare measure i ω α α i. Then, as α, W min i u i. The limit α implies an arbitrary degree of risk aversion. But this must be true for any monotonic transformation of the social welfare function, so it is also true for a utilitarian welfare calculation as well. minimax regret An alternative approach to decisionmaking without assigning probabilities to unknowns is to choose c to minimize the maximum regret. Minimax regret was originally proposed by Leonard J. Savage (95) as an alternative to minimax which is less conservative; Charles Manski is the leading advocate of its use among current researcher; he has produced numerous important papers applying minimax regret to different contexts; see Manski (008) for a conceptual defense. Regret is defined as ( θ,, ) = ( θ,, ) min ( θ,, ) r dc l dc l dc (.) c C The minimax regret solution is θ ( θ ) minc max r, dc, (.) A recent axiomatization is due to Jorg Stoye (006a,b). Example.3. minimax, minimax regret, and expected loss Suppose that there are two possible actions by the policymaker, c and c and two possible states of the world, θ and θ. The following table the losses for each policy and state of the world, as well as the maximum loss and maximum regret. 7

8 θ θ max loss max regret c c As the table indicates, by the minimax criterion and the minimax regret criterion, the policymaker should choose c. A Bayesian would address the policy problem by assigning probability p ( p ) to θ (θ ) and computing expected losses under the two policies. It is obvious that if p is close enough to, the policymaker should choose c. Example.4. minimax and minimax regret producing different policy choices Consider the alternative loss structure θ θ max loss max regret c c Here the minimax policy choice is c whereas the minimax regret choice is c. This example illustrates the intuitive appeal of minimax regret in that the criterion focuses on differences in policy effects rather than absolute levels. Example.5. minimax regret and the independence of irrelevant alternatives property 8

9 This example modifies Example.3 by adding a third policy c 3 θ θ max loss max regret c 4 4 c c The introduction of c 3 implies that the minimax regret choice is now c. This is troubling; the fact that c 3 is available has changed the ranking of c and c. This violates the property of independence of irrelevant alternatives (IIA), which is often taken to be a natural axiom for decisionmaking. The recognition that minimax regret violates IIA is due to Herbert Chernoff (954). One way to interpret this violation is that context matters for decisionmaking, an idea that is found in a number of contexts studied in behavioral economics. It is not clear, though, that the findings on context dependent decisionmaking map that closely to the IIA violations found in minimax regret. This is something worth investigating. Example.6. Another minimax regret IIA violation Here is another case where a violation of IIA occurs. Compare the analysis of policies θ θ max loss max regret c c with the comparison of 3 policies 9

10 θ θ max loss max regret c c c c is the minimax choice in the two policy case; but if c 3 ( c ) were not available, c ( c 3 ) would be the minimax solution. This seems especially paradoxical (at least to me) since c and c 3 have symmetric structures. It is the case (see Stoye (006a,b) for formal analysis) that if one is willing to forgo IIA, one can derive minimax regret under an axiom system which includes an assume of independence of never-optimal alternatives. This means that if one adds a policy that is not optimal in any state of nature (when compared to others) that it cannot lead one to change to change one s choice among the original set. Hurwicz criterion One can mix the different approaches that have been described. For example, one can consider problems that incorporate both expected loss and minimax considerations. ( α θ ( θ ) ( α) ( θ )) minc C max l, dc, + El, dc, (.3) This was originally proposed by Leo Hurwicz (95). 3. Decision rules and risk 0

11 The risk function, defined as (, ( )),, ( ) ( ) ( ) R θ c D = l θ d c d µ d θ (.4) D is a metric for understanding the performance of a rule across the data space, given conditional on the unknown. performance across the data space by probabilities. Notice that this calculation weights Relative to the set of potential decision rules, a rule c ( ) is (strictly) dominant if it produces a (strictly) lower risk than any alternative rule for all values of θ. A decision rule is said to be inadmissible if there exists an alternative rule with lower risk for all values of θ. There is no guarantee that a unique dominant rule exists. This lack of uniqueness leads to criteria such as minimax risk. The Bayes risk of a rule is defined as Θ (,, ( )) ( ) =,, ( ) ( ) ( ) ( ) Rθ dc d µθ l θ dc d µ d θµθ (.5) Recall that µ ( θ ) µθ ( ) = µθ ( ) µ ( ) Θ D d d d. Exchanging the order of integration (ok under uninteresting regularity assumptions) and substituting this identity into (.5), the Bayes risk is Θ ( θ ( )) µθ ( ) µ ( ) l, dc, d d d (.6) D It is evident that minimizing the double integral occurs by minimizing the inner integral at each data point. Minimization of the inner integral is equivalent to the original Bayesian decision theory solution. Notice that the risk function become superfluous from this perspective.

12 A general result is that any decision rule which minimizes Bayes risk relative to a proper prior is admissible. A form of the converse is also true: any admissible rule minimizes Bayes risk under some prior. 4. Fiducial inference One way to understand the differences between Bayesian and frequentist approaches is that Bayesian approaches construct probabilities of unobservables based on observables whereas frequentist approaches are based on the analysis of the probabilities of observables based on unobservables. Statistical decision theory is based on the former. Fiducial inference is a method, originally proposed by Ronald Fisher to address this distinction. Consider a sequence of Normal ( λσ, ) random variables x i. Assume the variance is known. Then the same mean x has the probability density σ N λ,. It is straightforward to K show that this implies that x λ has the probability density σ N 0, K which is also the density of λ x. λ x is an example of what is known a a pivotal quantity, which means that its distribution does not depend on λ. The fiducial argument is that since the deviation of λ from x is defined by a pivotal quantity, this pivotal quantity characterizes the uncertainty associated with the parameter, i.e. σ σ λ x N 0, implies λ N x, K K (.7) The argument is generally rejected since, without justification, it represents a transformation of a nonrandom object, in this case λ, into a random one. A variation of Fisher s ideas has been developed by Donald Fraser (see Fraser

13 (968) for a comprehensive statement), called structural inference. It is also controversial and has yet to impact empirical work. However, unlike Fisher s original formulation I do not believe one can say it is regarded as a failure. I personally find the structural inference ideas very intriguing, although quite hard to fully understand. 3

14 References Arrow, K., (973), Some Ordinalist-Utilitarian Notes on Rawls Theory of Justice, Journal of Philosophy, LXX, 9, Chernoff, H., (954), Rational Selection of Decision Functions, Econometrica,, Fraser, D., (968), The Structure of Inference, New York: John Wiley. Hurwicz, L., (95), Some Specification Problems and Applications to Econometric Models, Econometrica, 9, Gilboa, I. and D. Schmeidler, (989), Maxmin Expected Utility with Imprecise Probabilistic Information, Journal of Mathematical Economics, 8, Manski, C., (008), Actualizing Rationality, mimeo, Northwestern University, Rawls, J., (97), A Theory of Justice, Cambridge: Harvard University Press. Savage, L. J., (96), The Theory of Statistical Decision, Journal of the American Statistical Association, 46, Stoye, J., (006a), Statistical Decisions Under Ambiguity, mimeo, NYU. Stoye, J., (006b), Axioms for Minimax Regret Choice Correspondences, mimeo, NYU. Wald, A., (950), Statistical Decision Functions, New York: John Wiley. 4

Policy Evaluation in Uncertain Economic Environments

Policy Evaluation in Uncertain Economic Environments William A. Brock, Steven N. Durlauf, Kenneth D. West 1 5/7/03 forthcoming, Brookings Papers on Economic Activity The number of separate variables which