Lecture Notes 1: Decisions and Data. In these notes, I describe some basic ideas in decision theory. theory is constructed from

Similar documents
Policy Evaluation in Uncertain Economic Environments

Dominance and Admissibility without Priors

Great Expectations. Part I: On the Customizability of Generalized Expected Utility*

Econ 2148, spring 2019 Statistical decision theory

Lecture notes on statistical decision theory Econ 2110, fall 2013

Lecture 2: Basic Concepts of Statistical Decision Theory

Recursive Ambiguity and Machina s Examples

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

On the Measurement of Inequality under Uncertainty*

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Uncertain Inference and Artificial Intelligence

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Policy Evaluation in Uncertain Economic Environments

Macroeconomics and Model Uncertainty

Sequential Decisions

Are Probabilities Used in Markets? 1

Econ 2140, spring 2018, Part IIa Statistical Decision Theory

A Discussion of the Bayesian Approach

I. Bayesian econometrics

arxiv: v1 [cs.ai] 16 Aug 2018

UNIVERSITY OF NOTTINGHAM. Discussion Papers in Economics CONSISTENT FIRM CHOICE AND THE THEORY OF SUPPLY

One-parameter models

Testing Restrictions and Comparing Models

STAT 425: Introduction to Bayesian Analysis

Decision Analysis. An insightful study of the decision-making process under uncertainty and risk

U n iversity o f H ei delberg. Informativeness of Experiments for MEU A Recursive Definition

Bayesian consistent prior selection

Lecture 2: Statistical Decision Theory (Part I)

Dynamics of Inductive Inference in a Uni ed Model

NBER WORKING PAPER SERIES MODEL UNCERTAINTY AND POLICY EVALUATION: SOME THEORY AND EMPIRICS. William A. Brock Steven N. Durlauf Kenneth D.

Decision-making with belief functions

The Bayesian Choice. Christian P. Robert. From Decision-Theoretic Foundations to Computational Implementation. Second Edition.

Rationality and Uncertainty

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

Lecture 1: Bayesian Framework Basics

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Definitions and Proofs

Whaddya know? Bayesian and Frequentist approaches to inverse problems

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Statistics: Learning models from data

CS 540: Machine Learning Lecture 2: Review of Probability & Statistics

On Backtesting Risk Measurement Models

Best Guaranteed Result Principle and Decision Making in Operations with Stochastic Factors and Uncertainty

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

An Axiomatic Model of Reference Dependence under Uncertainty. Yosuke Hashidate

New Perspectives on Statistical Decisions under Ambiguity

Bayesian Inference: Concept and Practice

Introduction to Applied Bayesian Modeling. ICPSR Day 4

Minimax and the Value of Information

Minimax-Regret Sample Design in Anticipation of Missing Data, With Application to Panel Data. Jeff Dominitz RAND. and

Relative Benefit Equilibrating Bargaining Solution and the Ordinal Interpretation of Gauthier s Arbitration Scheme

Principles of Statistical Inference

Bayesian Econometrics

Principles of Statistical Inference

A Bayesian perspective on GMM and IV

Lecture 12 November 3

Why a Bayesian researcher might prefer observational data

Second-Order Expected Utility

Columbia University. Department of Economics Discussion Paper Series. The Knob of the Discord. Massimiliano Amarante Fabio Maccheroni

Stat 5101 Lecture Notes

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Recursive Ambiguity and Machina s Examples

STAT J535: Introduction

Peter Hoff Statistical decision problems September 24, Statistical inference 1. 2 The estimation problem 2. 3 The testing problem 4

Fiducial Inference and Generalizations

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Ch. 5 Hypothesis Testing

De Finetti s ultimate failure. Krzysztof Burdzy University of Washington

Lecture 7 October 13

MAXIMUM LIKELIHOOD, SET ESTIMATION, MODEL CRITICISM

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Economics 201B Economic Theory (Spring 2017) Bargaining. Topics: the axiomatic approach (OR 15) and the strategic approach (OR 7).

Comment on The Veil of Public Ignorance

A BAYESIAN MATHEMATICAL STATISTICS PRIMER. José M. Bernardo Universitat de València, Spain

Lecture 3: Introduction to Complexity Regularization

PART I INTRODUCTION The meaning of probability Basic definitions for frequentist statistics and Bayesian inference Bayesian inference Combinatorics

Axiomatic bargaining. theory

1 Hypothesis Testing and Model Selection

A Note on the Existence of Ratifiable Acts

Theories of justice. Harsanyi s approach, Rawls approach, Unification

BTRY 4830/6830: Quantitative Genomics and Genetics

Part 2: One-parameter models

Social Choice Theory. Felix Munoz-Garcia School of Economic Sciences Washington State University. EconS Advanced Microeconomics II

STATISTICAL TREATMENT RULES FOR HETEROGENEOUS POPULATIONS

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

A noninformative Bayesian approach to domain estimation

6 Pattern Mixture Models

DECISIONS UNDER UNCERTAINTY

Parametric Techniques

Action-Independent Subjective Expected Utility without States of the World

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Riskiness for sets of gambles

David Giles Bayesian Econometrics

Probability and Information Theory. Sargur N. Srihari

Mathematical Fuzzy Logic in Reasoning and Decision Making under Uncertainty

RECURSIVE AMBIGUITY AND MACHINA S EXAMPLES 1. INTRODUCTION

Transcription:

Topics in Data Analysis Steven N. Durlauf University of Wisconsin Lecture Notes : Decisions and Data In these notes, I describe some basic ideas in decision theory. theory is constructed from The Data: d Unknown(s): θ Choices: c C Loss function: l( θ, dc, ) The goal is to construct a decision rule c( d ), a mapping from D to the set of choices C. The decision rule may be stochastic, I will ignore for expositional purposes.. Statistical decision theory The standard version of statistical decision theory is formulated along Bayesian lines: all unknowns are associated with posterior probability densities; decision rule is determined by minimization of expected loss. Specifically, one implicitly chooses c( d ) by solving the problem min,, ( θ ) µθ ( ) c C l dc d (.) for each value of d.

This problem illustrates the key question facing an expected loss minimizing policy analyst: how does one construct µθd ( )? To understand this conditional probability, observe that ( d ) µθ ( d ) ( d) ( ) ( ) µθ, µ d θ µθ = = µ µ ( d) (.) Since µ ( d ) is not a function of θ, (it is really nothing more than a normalization that ensures that the relevant probabilities sum to ), one can rewrite (.) as ( ) ( ) ( ) µθ d µ d θ µθ (.3) This is the classic statement that the posterior probability measure for some unknown, µθd ( ), is proportional to the product of the likelihood function µ ( d θ) and the prior probability measure µθ ( ). The prior represents the information the analyst possesses about the unkown before (i.e. prior!) to the data. For some types of unknowns (e.g. parameters) and associated regularity conditions, it will be the case that as the number of observations grows, the likelihood will swamp the prior. loss functions and standard statistical exercises The statistical decision formulation does not, at first glance, appear closely related to the standard exercises of producing an estimate of a parameter, etc. However, these types of exercises may be reproduced in their essentials by a suitable choice of the loss function. Consider the loss function ( θ) ( θ,, ) ( ) l dc = c d (.4)

The decision c( d ) will equal E( θ ) d produce a parameter estimate that is equal to the posterior mean of the parameter. (The proof is left as an exercise.) This posterior mean is the Bayesian analog to a point estimate in frequentist analysis. (More on this later.) priors The assignment of priors is problematic, since in many (most?) contexts, the analyst does not have a good basis for constructing µθ ( ). This is true in several sense; first for many problems one may simply have nothing to say about the ex ante relative probability of different realizations. A second problem is that what information is possessed may not quantify easily. Within Bayesian statistics, a number of approaches exist to addressing this difficulty. For our purposes, I will focus on the question of constructing priors in the case where one is ignorant i.e. one does not have any basis for discriminating between different values of the unknown. And as an example, I will consider analysis of the mean and variance of a sequence of i.i.d. random variables with mean κ and variance σ. The classical approach to specifying priors in the presence of ignorance is the principle of insufficient reason an idea which is often attributed to Bermoulli and Laplace (although the name of the principle postdates them). The basic idea of the principle is that in absence of any reason to discriminate between two values of an unknown, this is interpretable as saying that we assign equal prior probabilities to them. How does the principle of insufficient reason translate into a prior for κ? If the support of the mean, under ignorance, is (, ), then the associated prior much be constant, i.e. a uniform density on the real line, i.e. µ ( λ). Notice that in this case, the prior is improper, which means that it does not integrate to. 3

The case of the variance σ is more complicated. One triviality: the parameter that is typically studied is the standard deviation σ. Since the variance cannot be negative, the support of σ is, under ignorance, taken to be [ 0, ). For this case, the standard ignorant prior is taken to be µ ( σ) motivation of this formulation is that it assigns a uniform prior to support is (, ).. The σ log σ, whose This example suggests a more general problem in formulating ignorance in terms of a uniform prior. Consider a parameter vector θ and its associated transformation into another parameter vector φ ( θ) = f. Assuming all probability measures can be represented as densities, for a given prior µθ ( ), it must be the case that the associated prior for φ is df µ ( φ ) = µ ( f ( φ) ) dφ ( ) φ (.5) This means that the uniform prior is not invariant across nonlinear transformations of the unknowns, which makes little sense if the prior really captures ignorance, since ignorance about θ presumably implies ignorance about φ. One way to impose invariance of this type is via Jeffrey s prior. Recalling that Fisher s information matrix is defined as ( d ) log ( d ) logµ θ µ θ I( θ ) = E = µ ( d θ ) (.6) θ D θ Jeffrey s prior is defined as ( ) ( ) / µθ I θ (.7) 4

This prior addresses the invariance problem in that the form of the prior can be shown to be unchanged, but raises difficulties in terms of interpretability. One issue is that the prior depends on the form of the likelihood, which seems odd since the prior is supposed to describe beliefs about parameters.. Example. Priors and posteriors for mean of normal random variables N ( κ σ 0, 0) Suppose that i ( κσ, ) x N, i =... K with σ known; the prior for κ is. It is straightforward to show that the posterior density for the mean is K κ + 0 x σ σ 0 K µ ( κσ, κ σ ) = 0, 0, x,... xk N, + K + σ0 σ σ0 σ (.8) where x is the sample mean of the data. Notice that as K, the effects of the prior on the posterior disappear, i.e. the posterior mean converges to the sample mean. Example.. Linear regression with diffuse prior and Suppose that = β + ε yi x i i with ε ( 0, σ ) N, σ known, ε i uncorrelated, xi nonstochastic. Assume that the prior on β is uniform across described above. One can show that the posterior density of β is R L, as ( ˆ βols, σ ( ' ) ) N X X (.9) 5

where βˆols is the OLS estimator. For this special case, there is a nice relationship between the posterior density computed by a Bayesian and the OLS objects computed by a frequentist.. Decision theory without probabilities In this section, I discuss decision critieria when probabilities are not available for the unobservables of interest. minimax The minimax solution is to choose c so that ( θ ) min max,, c C θ l dc (.0) The minimax approach is usually associated with Abraham Wald (950). An important axiomatization is due to Itzhak Gilboa and David Schmeidler (989). The criterion implicitly means that the policymaker is extremely risk averse. There are interesting applications of the criterion in philosophy, notably John Rawls (97) difference principle. The idea of the difference principle is the following. Suppose that individuals are asked to rank different distributions of economic outcomes in a society, under the proviso that they do not know which of the outcomes will turn out to be theirs. Rawls argues that individuals will choose that distribution that maximizes the utility of the worst off person. As pointed out by Kenneth Arrow (973), this is a minimax argument. It can further be understood as the limit of a welfarist analysis. For example, suppose that individual utility is defined by u = ω α where ω denotes a scalar measure of an i i individual s outcome. Further, suppose that the social state ω is assigned the 6

social welfare measure i ω α α i. Then, as α, W min i u i. The limit α implies an arbitrary degree of risk aversion. But this must be true for any monotonic transformation of the social welfare function, so it is also true for a utilitarian welfare calculation as well. minimax regret An alternative approach to decisionmaking without assigning probabilities to unknowns is to choose c to minimize the maximum regret. Minimax regret was originally proposed by Leonard J. Savage (95) as an alternative to minimax which is less conservative; Charles Manski is the leading advocate of its use among current researcher; he has produced numerous important papers applying minimax regret to different contexts; see Manski (008) for a conceptual defense. Regret is defined as ( θ,, ) = ( θ,, ) min ( θ,, ) r dc l dc l dc (.) c C The minimax regret solution is θ ( θ ) minc max r, dc, (.) A recent axiomatization is due to Jorg Stoye (006a,b). Example.3. minimax, minimax regret, and expected loss Suppose that there are two possible actions by the policymaker, c and c and two possible states of the world, θ and θ. The following table the losses for each policy and state of the world, as well as the maximum loss and maximum regret. 7

θ θ max loss max regret c 4 4 5 c 9 5 9 3 As the table indicates, by the minimax criterion and the minimax regret criterion, the policymaker should choose c. A Bayesian would address the policy problem by assigning probability p ( p ) to θ (θ ) and computing expected losses under the two policies. It is obvious that if p is close enough to, the policymaker should choose c. Example.4. minimax and minimax regret producing different policy choices Consider the alternative loss structure θ θ max loss max regret c 4 3 4 8 c 5 5 5 Here the minimax policy choice is c whereas the minimax regret choice is c. This example illustrates the intuitive appeal of minimax regret in that the criterion focuses on differences in policy effects rather than absolute levels. Example.5. minimax regret and the independence of irrelevant alternatives property 8

This example modifies Example.3 by adding a third policy c 3 θ θ max loss max regret c 4 4 c 9 5 9 5 c 50 0 50 6 3 The introduction of c 3 implies that the minimax regret choice is now c. This is troubling; the fact that c 3 is available has changed the ranking of c and c. This violates the property of independence of irrelevant alternatives (IIA), which is often taken to be a natural axiom for decisionmaking. The recognition that minimax regret violates IIA is due to Herbert Chernoff (954). One way to interpret this violation is that context matters for decisionmaking, an idea that is found in a number of contexts studied in behavioral economics. It is not clear, though, that the findings on context dependent decisionmaking map that closely to the IIA violations found in minimax regret. This is something worth investigating. Example.6. Another minimax regret IIA violation Here is another case where a violation of IIA occurs. Compare the analysis of policies θ θ max loss max regret c 30 0 30 9 c with the comparison of 3 policies 9

θ θ max loss max regret c 30 0 30 0 c c 0 30 30 0 3 c is the minimax choice in the two policy case; but if c 3 ( c ) were not available, c ( c 3 ) would be the minimax solution. This seems especially paradoxical (at least to me) since c and c 3 have symmetric structures. It is the case (see Stoye (006a,b) for formal analysis) that if one is willing to forgo IIA, one can derive minimax regret under an axiom system which includes an assume of independence of never-optimal alternatives. This means that if one adds a policy that is not optimal in any state of nature (when compared to others) that it cannot lead one to change to change one s choice among the original set. Hurwicz criterion One can mix the different approaches that have been described. For example, one can consider problems that incorporate both expected loss and minimax considerations. ( α θ ( θ ) ( α) ( θ )) minc C max l, dc, + El, dc, (.3) This was originally proposed by Leo Hurwicz (95). 3. Decision rules and risk 0

The risk function, defined as (, ( )),, ( ) ( ) ( ) R θ c D = l θ d c d µ d θ (.4) D is a metric for understanding the performance of a rule across the data space, given conditional on the unknown. performance across the data space by probabilities. Notice that this calculation weights Relative to the set of potential decision rules, a rule c ( ) is (strictly) dominant if it produces a (strictly) lower risk than any alternative rule for all values of θ. A decision rule is said to be inadmissible if there exists an alternative rule with lower risk for all values of θ. There is no guarantee that a unique dominant rule exists. This lack of uniqueness leads to criteria such as minimax risk. The Bayes risk of a rule is defined as Θ (,, ( )) ( ) =,, ( ) ( ) ( ) ( ) Rθ dc d µθ l θ dc d µ d θµθ (.5) Recall that µ ( θ ) µθ ( ) = µθ ( ) µ ( ) Θ D d d d. Exchanging the order of integration (ok under uninteresting regularity assumptions) and substituting this identity into (.5), the Bayes risk is Θ ( θ ( )) µθ ( ) µ ( ) l, dc, d d d (.6) D It is evident that minimizing the double integral occurs by minimizing the inner integral at each data point. Minimization of the inner integral is equivalent to the original Bayesian decision theory solution. Notice that the risk function become superfluous from this perspective.

A general result is that any decision rule which minimizes Bayes risk relative to a proper prior is admissible. A form of the converse is also true: any admissible rule minimizes Bayes risk under some prior. 4. Fiducial inference One way to understand the differences between Bayesian and frequentist approaches is that Bayesian approaches construct probabilities of unobservables based on observables whereas frequentist approaches are based on the analysis of the probabilities of observables based on unobservables. Statistical decision theory is based on the former. Fiducial inference is a method, originally proposed by Ronald Fisher to address this distinction. Consider a sequence of Normal ( λσ, ) random variables x i. Assume the variance is known. Then the same mean x has the probability density σ N λ,. It is straightforward to K show that this implies that x λ has the probability density σ N 0, K which is also the density of λ x. λ x is an example of what is known a a pivotal quantity, which means that its distribution does not depend on λ. The fiducial argument is that since the deviation of λ from x is defined by a pivotal quantity, this pivotal quantity characterizes the uncertainty associated with the parameter, i.e. σ σ λ x N 0, implies λ N x, K K (.7) The argument is generally rejected since, without justification, it represents a transformation of a nonrandom object, in this case λ, into a random one. A variation of Fisher s ideas has been developed by Donald Fraser (see Fraser

(968) for a comprehensive statement), called structural inference. It is also controversial and has yet to impact empirical work. However, unlike Fisher s original formulation I do not believe one can say it is regarded as a failure. I personally find the structural inference ideas very intriguing, although quite hard to fully understand. 3

References Arrow, K., (973), Some Ordinalist-Utilitarian Notes on Rawls Theory of Justice, Journal of Philosophy, LXX, 9, 45-63. Chernoff, H., (954), Rational Selection of Decision Functions, Econometrica,, 44-443. Fraser, D., (968), The Structure of Inference, New York: John Wiley. Hurwicz, L., (95), Some Specification Problems and Applications to Econometric Models, Econometrica, 9, 343-4. Gilboa, I. and D. Schmeidler, (989), Maxmin Expected Utility with Imprecise Probabilistic Information, Journal of Mathematical Economics, 8, 4-53. Manski, C., (008), Actualizing Rationality, mimeo, Northwestern University, Rawls, J., (97), A Theory of Justice, Cambridge: Harvard University Press. Savage, L. J., (96), The Theory of Statistical Decision, Journal of the American Statistical Association, 46, 55-67. Stoye, J., (006a), Statistical Decisions Under Ambiguity, mimeo, NYU. Stoye, J., (006b), Axioms for Minimax Regret Choice Correspondences, mimeo, NYU. Wald, A., (950), Statistical Decision Functions, New York: John Wiley. 4