Lecture Slides - Part 1

Similar documents
Bayesian Inference. Introduction

Compute f(x θ)f(θ) dθ

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

100 if α=red 0 if α=red f A,r = f A,b =

18.05 Final Exam. Good luck! Name. No calculators. Number of problems 16 concept questions, 16 problems, 21 pages

Microeconomic Theory (501b) Problem Set 10. Auctions and Moral Hazard Suggested Solution: Tibor Heumann

Lecture Slides - Part 4

Computational Perception. Bayesian Inference

David Giles Bayesian Econometrics

Conditional Probability, Independence and Bayes Theorem Class 3, Jeremy Orloff and Jonathan Bloom

Computational Cognitive Science

Intrinsic and Extrinsic Motivation

Computational Cognitive Science

Fall 2017 Recitation 10 Notes

Notes on Blackwell s Comparison of Experiments Tilman Börgers, June 29, 2009

Hierarchical Models & Bayesian Model Selection

Bayesian Updating: Discrete Priors: Spring 2014

Persuading Skeptics and Reaffirming Believers

Bayesian Updating with Continuous Priors Class 13, Jeremy Orloff and Jonathan Bloom

Probability theory basics

Machine Learning

PHASES OF STATISTICAL ANALYSIS 1. Initial Data Manipulation Assembling data Checks of data quality - graphical and numeric

Conditional Probability, Independence, Bayes Theorem Spring January 1, / 28

Comparison of Bayesian and Frequentist Inference

Post-exam 2 practice questions 18.05, Spring 2014

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Lecture 7. Bayes formula and independence

Equilibrium Refinements

Class 26: review for final exam 18.05, Spring 2014

Intro to Probability. Andrei Barbu

Hypothesis Testing. Testing Hypotheses MIT Dr. Kempthorne. Spring MIT Testing Hypotheses

Economics of Networks Social Learning

Probability and Inference

Bayesian decision theory. Nuno Vasconcelos ECE Department, UCSD

Asset Pricing under Asymmetric Information Modeling Information & Solution Concepts

The Value of Information Under Unawareness

Bayesian Machine Learning

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

CS281A/Stat241A Lecture 22

Choice under Uncertainty

1. The General Linear-Quadratic Framework

SIGNALS, BELIEFS AND UNUSUAL EVENTS ERID LECTURE DUKE UNIVERSITY

Null Hypothesis Significance Testing p-values, significance level, power, t-tests

Bayesian Updating: Odds Class 12, Jeremy Orloff and Jonathan Bloom

Conditional Probability, Independence, Bayes Theorem Spring January 1, / 23

Exam 2 Practice Questions, 18.05, Spring 2014

Central Limit Theorem and the Law of Large Numbers Class 6, Jeremy Orloff and Jonathan Bloom

Lecture 9. Expectations of discrete random variables

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Probability Theory and Simulation Methods

Recap Social Choice Fun Game Voting Paradoxes Properties. Social Choice. Lecture 11. Social Choice Lecture 11, Slide 1

18.05 Practice Final Exam

18.05 Exam 1. Table of normal probabilities: The last page of the exam contains a table of standard normal cdf values.

Formalizing Probability. Choosing the Sample Space. Probability Measures

STAT 418: Probability and Stochastic Processes

Endogenous Information Choice

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

Frequentist Statistics and Hypothesis Testing Spring

Confidence Intervals for Normal Data Spring 2014

LECTURE 2. Convexity and related notions. Last time: mutual information: definitions and properties. Lecture outline

Intermediate Math Circles November 15, 2017 Probability III

Probability and Estimation. Alan Moses

This exam contains 13 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Conjugate Priors: Beta and Normal Spring January 1, /15

Review of Maximum Likelihood Estimators

Bayesian Inference and MCMC

Randomized Algorithms

1. The General Linear-Quadratic Framework

Introduction to Bayesian Statistics

Bayesian Analysis for Natural Language Processing Lecture 2

Moral Hazard: Characterization of SB

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

COMP90051 Statistical Machine Learning

Introduction to Stochastic Processes

CSC321 Lecture 18: Learning Probabilistic Models

10. Exchangeability and hierarchical models Objective. Recommended reading

Microeconomics III Final Exam Answers 3/22/11 Muhamet Yildiz

Bayesian Analysis (Optional)

Parametric Techniques

A primer on Bayesian statistics, with an application to mortality rate estimation

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

Bayesian model selection: methodology, computation and applications

Graduate Microeconomics II Lecture 5: Cheap Talk. Patrick Legros

STAT 414: Introduction to Probability Theory

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Moral Hazard. Felix Munoz-Garcia. Advanced Microeconomics II - Washington State University

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

ECO 317 Economics of Uncertainty. Lectures: Tu-Th , Avinash Dixit. Precept: Fri , Andrei Rachkov. All in Fisher Hall B-01

Introduction to Machine Learning. Lecture 2

Lecture 2 Sep 5, 2017

14.30 Introduction to Statistical Methods in Economics Spring 2009

This is designed for one 75-minute lecture using Games and Information. October 3, 2006

Chapter 2. Decision Making under Risk. 2.1 Consequences and Lotteries

MCMC notes by Mark Holder

Parametric Techniques Lecture 3

ECE295, Data Assimila0on and Inverse Problems, Spring 2015

Lecture 6: Model Checking and Selection

Transcription:

Lecture Slides - Part 1 Bengt Holmstrom MIT February 2, 2016. Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 1 / 36

Going to raise the level a little because 14.281 is now taught by Juuso and so it is also higher level Books: MWG (main book), BDT specifically for contract theory, others. MWG s mechanism design section is outdated Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 2 / 36

Comparison of Distributions First order stochastic dominance (FOSD): Definition: Take two distributions F, G. Then we say that F > 1 G (F first order stochastically dominates G) iff y y 1 u non-decreasing, u(x)df (x) u(x)dg(x) 2 F (x) G(x) x 3 There are x, z random variables s.t. z 0, x G, x + z F, and z H(z x) (z s distribution could be conditional on x). All these definitions are equivalent. Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 3 / 36

Second order stochastic dominance (SOSD): Take two distributions F, G with the same mean. Definition: We say that F > 2 G (F SOSDs G) iff y y 1 u concave and nondecreasing, u(x)df (x) u(x)dg(x). (F has y less risk, thus is worth more to a risk-averse agent) x y x 2 0 G(t)dt F (t)dt x. 0 3 There are x, z random variables such that x F, x + z G and E(z x) = 0. (x + z is a mean-preserving spread of x ). All these definitions are equivalent. Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 4 / 36

Monotone likelihood ratio property (MLRP): Let F, G be distributions given by densities f, g respectively. Let f (x) l(x) = g(x). Intuitively, the statistician observes a draw x from a random variable that may have distribution F or G and asks: given the realization, is it more likely to come from F or from G? l(x) turns out to be the ratio by which we multiply the prior odds to get the posterior odds. Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 5 / 36

Definition: The pair (f, g) has the MLRP property if l(x) is non-decreasing. Intuitively, the higher the realized value x, the more likely that it was drawn from the high distribution, F. MLRP implies FOSD, but it is a stronger condition. You could have FOSD and still there might be some high signal values that likely come from G. For example: suppose f (0) = f (2) = 0.5 and g(1) = g(3) = 0.5. Then g FOSDs f but the MLRP property fails (1 is likely to come from g, 2 is likely to come from f ). Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 6 / 36

This is often used in models of moral hazard, adverse selection, etc., like so: Let F (x a) be a family of distributions parameterized/indexed by a. Here a is an action (e.g. effort) or type (e.g. ability) of an agent, and x is the outcome (e.g. the amount produced). f (x 1,a 2 ) MLRP tells us that if x 2 > x 1 and a 2 > a 1 then f (x 2,a 2 ) f (x 2,a 1 ) f (x 1,a 1 ). In other words, if the principal observes a higher x, it will guess a higher likelihood that it came about due to a higher a. Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 7 / 36

Decision making under uncertainty Premise: you see a signal and then need to take an action. How should we react to the information? Goals: Look for an optimal decision rule. Calculate the value of the information we get. (How much more utility do we get vs. choosing under ignorance?) Can information systems (experiments) be preference-ordered? (So you can say experiment A is more useful to me than B) Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 8 / 36

Basic structure: θ state of the world, e.g., market demand y is the information/signal/experimental outcome, e.g., sales forecast a (final) action, e.g., amount produced u(a, θ) payoff from choice a under state θ, e.g., profits This may be money based: e.g., x(a, θ) is the money generated and u(a, θ) = ũ(x(a, θ)) where ũ(x) is utility created by having x money. Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 9 / 36

Figure: A decision problem θ u(a,θ ) 1 1 1 y 1 a 1 θ 2 u(a,θ ) 1 2 a 2 θ 1 u(a,θ ) 1 1 a 1 y 2 θ 2 a 2 Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 10 / 36

A strategy is a function a : Y A where Y is the codomain of the signal, and a(y) defines the chosen action after observing y. θ : Ω Θ is a random variable and y : Θ Y is the signal. Ω gives the entire probability space, Θ is the set of payoff-relevant states of the world, but the agent does not observe θ directly so must condition on y instead. Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 11 / 36

How does the agent do this? He knows the joint distribution p(y, θ) of y and θ. In particular y he has a prior belief about the state of the world, p(θ) = y p(y, θ). And he can calculate likelihoods p(y θ) by Bayes rule, p(y, θ) = p(θ)p(y θ). As stated, the random variables with their joint distribution are the primitives and we back out the likelihoods. But since the experiment is fully described by these likelihoods, it can be cleaner to take them as the primitives. Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 12 / 36

In deciding what action to take, the agent will need the reverse likelihoods p(θ y) = p(y,θ) p(y). These are the posterior beliefs, which tell the agent what states θ are more likely given the realization of the experiment y. IMPORTANT: every experiment induces a distribution over posteriors. Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 13 / 36

By the Law of Total Probability, p(θ) = y p(y)p(θ y): the weighted average of the posterior must equal the prior. In other words, p(θ ), viewed as a random vector, is a martingale. Can also take posteriors as primitives! Every collection of posteriors {p(θ y)} y Y that is consistent with the priors and signal probabilities (i.e., p 0 (θ) = y p(θ y)p(y)) corresponds to an experiment. Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 14 / 36

An example: coin toss A coin may be biased towards heads (θ 1 ) or tails (θ 2 ) p(θ 1 ) = p(θ 2 ) = 0.5 p(h θ 1 ) = 0.8, p(t θ 1 ) = 0.2 p(h θ 2 ) = 0.4, p(t θ 2 ) = 0.6 Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 15 / 36

We can then find: p(h) = 0.8 0.5 + 0.4 0.5 = 0.6, p(t ) = 0.4 2 p(θ 1 H) = 0.8 0.5 = p( H) 3 1 T ) 4 p(θ 1 T ) = 0.2 p( 0.5 = Figure: Updating after coin toss (p ' (R) = p(θ 1 H), p ' (L) = p(θ 1 T )) L R 0 p (L) p =.5 p (R) 1 Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 16 / 36

Sequential Updating Suppose we have signals y 1 and y 2 coming from two experiments (which may be correlated) It does not matter if you update based on experiment A first, then update on B or vice-versa; or even if you take the joint results (y 1, y 2 ) as a single experiment and update on that (However, if the first experiment conditions how or whether you do the second one, then of course this is no longer true) Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 17 / 36

E.g., suppose that θ is the health of a patient, θ 1 = healthy, θ 2 = sick, and y 1, y 2 = + or (positive or negative) are the results of two experiments (e.g. doctor s exam and blood test) Figure: Sequential Updating + 0 + p + 1 Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 18 / 36

Lecture 2 Note: experiments can be defined independently of prior beliefs about θ If we take an experiment as a set of posteriors p(y θ), these can be used regardless of p 0 (θ) (But, of course, they will generate a different set of posteriors p(θ y), depending on the priors) If you have a blood test for a disease, you can run it regardless of the fraction of sick people in the population, and its probability of type 1 and type 2 errors will be the same, but you will get different beliefs about probability of sickness after a positive (or negative) test Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 19 / 36

One type of experiment is where y = θ + E In particular, when θ N(µ, σ 2 ) and E N(0, σ E 2 ), this is very θ tractable because the distribution of y, the distribution of y θ, and the distribution of θ y are all normal Useful to define precision of a random variable: Σ θ = σ 12 θ The lower the variance, the higher the precision Precision shows up in calculations of posteriors with normal distributions: o in this example Σ θ Σ θ y N L Σ θ +Σ L µ + Σθ +Σ L y, Σ θ + Σ E. Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 20 / 36

Statistics and Sufficient Statistics A statistic is any (vector-valued) function T mapping y to T (y). Statistics are meant to aggregate information contained in y. Now suppose that p(y θ) = p(y T (y))p(t (y) θ). Then T (y) contains all the relevant information that y gives me to figure out θ. In that case, we say T is a sufficient statistic. p(y θ)p(θ) p(y T (y))p(t (y) θ)p(θ) p(t (y) θ)p(θ) Formally, p(θ y) = p(y) = p(y T (y))p(t (y)) = p(t (y)) = p(θ T (y)). Here, we use that p(y) = p(y, T (y)) because T (y) is a function of y. Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 21 / 36

This is useful if, e.g., the mean of a vector of estimates is a sufficient statistic for θ, then I can forget about the vector and simplify the calculations. A minimal sufficient statistic is intuitively the simplest/coarsest possible. Formally, T (y) is a minimal sufficient statistic if it is sufficient and, for any S(y) that is sufficient, there is a function f such that T (y) = f (S(y)). Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 22 / 36

Decision Analysis Remember our framework: we want to take an action a to maximize u(a, θ), dependent on the state of the world We will condition on our information y, given by posteriors p(y θ) Three questions: How to find the optimal decision rule? What is the value of information? For a particular problem, this is given by how much your utility increases from getting the information Can we say anything about experiments in general? E.g., experiment A will always give you weakly higher utility than B, regardless of your decision problem Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 23 / 36

We can solve for the optimal decision ex post or ex ante Either observe yy and then calculate a (y) = max a u(a, θ)p(θ y)dθ (ex post) θ y y Or build a strategy a ( ) = max a( ) u(a(y), θ)p(y, θ)dθdy before y θ seeing y (ex ante) In decision theory problems with only one agent, both are equivalent (as long as all y s have positive probability) Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 24 / 36

Note: given a y, we can think of the ex post problem as a generic problem of the form V (p ) = max a v(a, p ) where v(a, p ) = y u(a, θ)p (θ)dθ, and p (θ) = p(θ y). θ Note: v(a, p) is linear in p. Exercise: show that V (p) is convex in p. y Definition: V Y = y V (p y )p(y)dy is the maximal utility I can get from information system Y (by taking optimal actions). Definition: Z Y = V Y V (p 0 ) is the value of information system Y (over just knowing the prior). Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 25 / 36

Consider an example with states θ 1, θ 2 ; actions a 1, a 2 ; and signal outcomes L, R Figure: Value of Information v(a 2, p) V ( ) v(a 1, p) u (a,θ ) u(a,θ ) 2 2 1 1 Z V Y V ( p) u(a,θ ) 1 2 u(a,θ ) 2 1 0 p (L) p p (R) 1 0 θ = θ 2 prior θ = θ 1 p Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 26 / 36

In the graph p = p 0 (θ 1 ) For generic a and p, v(a, p) = pu(a, θ 1 ) + (1 p)u(a, θ 2 ) In the graph u(a 1, θ 1 ) > u(a 2, θ 1 ) but u(a 2, θ 2 ) > u(a 1, θ 2 ) (want to match action to state) Under the prior, a 1 is the better action With information, we want to choose a(l) = a 2, a(r) = a 1 V Y is a weighted average of the resulting payoffs Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 27 / 36

This illustrates the maximal gap (between deciding with just the prior vs. exactly knowing the state): Figure: Value of Perfect Information Value of perfect information p Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 28 / 36

Lecture 3 Life lesson: even if two ways of writing a model are mathematically equivalent, it may make a huge difference how you think about it Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 29 / 36

Comparison of Experiments Question: given two experiments A, B, when is Y A > Y B regardless of your decision problem? Answer 1: Iff the distribution of posteriors from Y A is a MPS (mean-preserving spread) of the distribution of posteriors from Y B. Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 30 / 36

In this example with two states and two two-outcome experiments, the one with more extreme posteriors gives higher utility Figure: Mean-preserving spread of posteriors (Need not be parallel) L R p B (B) B p A (L) p G p A (R) p B (G) Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 31 / 36

Idea: if Y A s posterior distribution is a MPS of Y B s, it is as though knowing the outcome of experiment A amounts to knowing the outcome of B, then being given some extra info (which generates the MPS) Note: MPS of the signals means less information, but MPS of the resulting posteriors means more information! Formally, since V is convex, averaging V over a more dispersed set gives a higher result (by Jensen s inequality). Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 32 / 36

Given two experiments which do not bracket each other (neither s posteriors are a MPS of the other s), we can find decision problems for which either one is better In the graph, the experiment with outcomes B, G is uninformative for our purposes, hence worse than the one with outcomes L, R But conclusion is reversed if we change the payoff structure Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 33 / 36

Figure: Unordered experiments v(a 1, p) v(a 2, p) L R B p G Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 34 / 36

Second attempt: use the concept of Blackwell garbling. Definition: Y B is a garbling of Y A if P B = MP A T, where P B = [p ij ] B, where p B ij = P[y B = i θ = j], P A = [p kl ] A, where p A = P[y A = k θ = l], kl M = [m ik ], a Markov matrix (its columns add up to 1). The idea: B can be construed as an experiment that takes the outcomes of A and then mixes them up probabilistically. This makes B less informative (even if it e.g. had more outcomes than A). Answer 2: Y A > Y B iff B is a Blackwell garbling of A. Corollary: B is a Blackwell garbling of A iff the posterior distribution of A is a MPS of B. Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 35 / 36

For two-outcome experiments, easy to show that a garbling has posteriors bracketed by those of the the original experiment Figure: Garbled posteriors L R p B G Bengt Holmstrom (MIT) Lecture Slides - Part 1 February 2, 2016. 36 / 36

MIT OpenCourseWare https://ocw.mit.edu 14.124 Microeconomic Theory IV Spring 2017 For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.