STAT 830 Decision Theory and Bayesian Methods

Similar documents
Econ 2148, spring 2019 Statistical decision theory

Lecture notes on statistical decision theory Econ 2110, fall 2013

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

ST5215: Advanced Statistical Theory

Econ 2140, spring 2018, Part IIa Statistical Decision Theory

MIT Spring 2016

STAT215: Solutions for Homework 2

Part III. A Decision-Theoretic Approach and Bayesian testing

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

A Very Brief Summary of Bayesian Inference, and Examples

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Lecture 2: Statistical Decision Theory (Part I)

Bayesian statistics: Inference and decision theory

A statement of the problem, the notation and some definitions

Evaluating the Performance of Estimators (Section 7.3)

Lecture 7 October 13

STAT 830 Bayesian Estimation

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Frequentist Statistics and Hypothesis Testing Spring

Definition 3.1 A statistical hypothesis is a statement about the unknown values of the parameters of the population distribution.

Hypothesis Testing. Testing Hypotheses MIT Dr. Kempthorne. Spring MIT Testing Hypotheses

Lecture 2: Basic Concepts of Statistical Decision Theory

A Very Brief Summary of Statistical Inference, and Examples

On Extended Admissible Procedures and their Nonstandard Bayes Risk

Statistical Decision Theory and Bayesian Analysis Chapter1 Basic Concepts. Outline. Introduction. Introduction(cont.) Basic Elements (cont.

Lecture 7 Introduction to Statistical Decision Theory

coherence assessed via gambling: a discussion

Statistical Inference

Preliminary check: are the terms that we are adding up go to zero or not? If not, proceed! If the terms a n are going to zero, pick another test.

9 Bayesian inference. 9.1 Subjective probability

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Lecture 8: Information Theory and Statistics

Minimax Estimation of Kernel Mean Embeddings

Introductory Econometrics. Review of statistics (Part II: Inference)

Optimum Joint Detection and Estimation

Solving and Graphing Inequalities

Statistical Inference

Key Words: binomial parameter n; Bayes estimators; admissibility; Blyth s method; squared error loss.

Terminology. Experiment = Prior = Posterior =

ECS171: Machine Learning

Sequential Decisions

Econometrics I, Estimation

Statistical Measures of Uncertainty in Inverse Problems

Pattern Classification

Sequential Decisions

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Introduction to Bayesian Statistics 1

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Chapters 9. Properties of Point Estimators

ECE531 Lecture 2a: A Mathematical Model for Hypothesis Testing (Finite Number of Possible Observations)

STAT 830 Hypothesis Testing

Linear Classifiers and the Perceptron

Bayesian Average Error Based Approach to Sample Size Calculations for Hypothesis Testing

Eco517 Fall 2014 C. Sims MIDTERM EXAM

4 Invariant Statistical Decision Problems

3.0.1 Multivariate version and tensor product of experiments

ECE531 Homework Assignment Number 2

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Lesson 3-1: Solving Linear Systems by Graphing

Peter Hoff Statistical decision problems September 24, Statistical inference 1. 2 The estimation problem 2. 3 The testing problem 4

8.1-4 Test of Hypotheses Based on a Single Sample

Two examples of the use of fuzzy set theory in statistics. Glen Meeden University of Minnesota.

CLASSICAL PROBABILITY MODES OF CONVERGENCE AND INEQUALITIES

=.55 = = 5.05

The Central Limit Theorem

Minimax risk bounds for linear threshold functions

GAMMA-ADMISSIBILITY IN A NON-REGULAR FAMILY WITH SQUARED-LOG ERROR LOSS

Lecture 8: Information Theory and Statistics

Notes on Decision Theory and Prediction

17. Convergence of Random Variables

Computational Perception. Bayesian Inference

The PAC Learning Framework -II

6.1 Variational representation of f-divergences

MIT Spring 2016

Mathematical Statistics

ECE531 Lecture 4b: Composite Hypothesis Testing

STAT 285: Fall Semester Final Examination Solutions

Eco517 Fall 2004 C. Sims MIDTERM EXAM

MATH 1301, Practice problems

Lecture Notes 3 Convergence (Chapter 5)

Introduction to Machine Learning

I. Bayesian econometrics

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Lecture 2 Machine Learning Review

ORF 245 Fundamentals of Statistics Practice Final Exam

F (x) = P [X x[. DF1 F is nondecreasing. DF2 F is right-continuous

Does Unlabeled Data Help?

Introduction. Chapter 1. Contents. EECS 600 Function Space Methods in System Theory Lecture Notes J. Fessler 1.1

g 2 (x) (1/3)M 1 = (1/3)(2/3)M.

1 A simple example. A short introduction to Bayesian statistics, part I Math 217 Probability and Statistics Prof. D.

SOLUTIONS FOR PROBLEMS 1-30

Statistical Methods for the Social Sciences, Autumn 2012

Describe the main branches of natural science and relate them to each other. Describe the relationship between science and technology.

Seminar über Statistik FS2008: Model Selection

Null Hypothesis Significance Testing p-values, significance level, power, t-tests Spring 2017

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

ORF 245 Fundamentals of Statistics Chapter 9 Hypothesis Testing

Math489/889 Stochastic Processes and Advanced Mathematical Finance Solutions for Homework 7

Transcription:

STAT 830 Decision Theory and Bayesian Methods Example: Decide between 4 modes of transportation to work: B = Ride my bike. C = Take the car. T = Use public transit. H = Stay home. Costs depend on weather: R = Rain or S = Sun. Ingredients of Decision Problem in the no data case Decision space D = {B, C, T, H} of possible actions. Parameter space Θ = {R, S} of possible states of nature. Loss function L = L(d, θ) loss incurred if do d and θ is true state of nature. In the example we might use the following table for L: C B T H R 3 8 5 25 S 5 0 2 25 Notice that if it rains I will be glad if I drove. If it is sunny I will be glad if I rode my bike. In any case staying at home is expensive. In general we study this problem by comparing various functions of θ. In this problem a function of θ has only two values, one for rain and one for sun and we can plot any such function as a point in the plane. We do so to indicate the geometry of the problem before stating the general theory. 1

Losses of deterministic rules Rain B T C H Sun 2

Statistical Decision Theory Statistical problems have another ingredient, the data. We observe X a random variable taking values in say X. We may make our decision d depend on X. A decision rule is a function δ(x) from X to D. We will want L(δ(X), θ) to be small for all θ. Since X is random we quantify this by averaging over X and compare procedures δ in terms of the risk function R δ (θ) = E θ (L(δ(X), θ)) To compare two procedures we must compare two functions of θ and pick the smaller one. But typically the two functions will cross each other and there won t be a unique smaller one. Example: In estimation theory to estimate a real parameter θ we used D = Θ, L(d, θ) = (d θ) 2 and find that the risk of an estimator ˆθ(X) is Rˆθ(θ) = E[(ˆθ θ) 2 ] which is just the Mean Squared Error of ˆθ. We have already seen that there is no unique best estimator in the sense of MSE. How do we compare risk functions in general? Minimax methods choose δ to minimize the worst case risk: We call δ minimax if sup{r δ (θ); θ Θ)}. sup R δ (θ) = inf θ δ sup R δ (θ) θ Usually the sup and inf are achieved and we write max for sup and min for inf. This is the source of minimax. Bayes methods choose δ to minimize an average r π (δ) = R δ (θ)π(θ)dθ for a suitable density π. We call π a prior density and r the Bayes risk of δ for the prior π. 3

Example: My transportation problem has no data so the only possible (nonrandomized) decisions are the four possible actions B, C, T, H. For B and T the worst case is rain. For the other two actions Rain and Sun are equivalent. We have the following table: C B T H R 3 8 5 25 S 5 0 2 25 Maximum 5 8 5 25 To get the smallest maximum: take car, or transit. Thus the minimax action is either to take the car or to take public transit. Now imagine I toss a coin with probability λ of getting Heads and take my car if I get Heads, otherwise take transit. The long run average daily loss would be 3λ + 5(1 λ) when it rains and 5λ + 2(1 λ) when it is Sunny. Call this procedure d λ ; add it to graph for each value of λ. Varying λ from 0 to 1 gives a straight line running from (3, 5) to (5, 2). The two losses are equal when λ = 3/5. For smaller λ worst case risk is for sun; for larger λ worst case risk is for rain. Added to graph: loss functions for each d λ, (straight line) and set of (x, y) pairs for which min(x, y) = 3.8 worst case risk for d λ when λ = 3/5. 4

Losses Rain B T C H Sun 5

The figure then shows that d 3/5 is actually the minimax procedure when randomized procedures are permitted. In general we might consider using a 4 sided coin where we took action B with probability λ B, C with probability λ C and so on. The loss function of such a procedure is a convex combination of the losses of the four basic procedures making the set of risks achievable with the aid of randomization look like the following: 6

Losses Rain Sun 7

Randomization in decision problems permits the assumption that the set of possible risk functions is convex an important technical conclusion used to prove many basic decision theory results. The graph shows that many points in the picture correspond to bad decision procedures. Rain or shine not taking my car to work has a lower loss than staying home; the decision to stay home is inadmissible. Definition: A decision rule δ is inadmissible if there is a rule δ such that R δ (θ) R δ (θ) for all θ and there is at least one value of θ where the inequality is strict. A rule which is not inadmissible is called admissible. Admissible procedures have risks on lower left of graphs, i.e., lines connecting B to T and T to C are the admissible procedures. Connection between Bayes procedures and admissible procedures A prior distribution in the example is specified by two probabilities, π S and π R which add up to 1. If L = (L S, L R ) is the risk function for some procedure then the Bayes risk is r π = π R L R + π S L S. Consider the set of L such that this Bayes risk is equal to some constant. On our picture this is a line with slope π S /π R. Now consider three priors: π 1 = (0.9, 0.1), π 2 = (0.5, 0.5) and π 3 = (0.1, 0.9). For π 1 : imagine a line with slope -9 =0.9/0.1 starting on the far left of the picture and sliding right until it bumps into the convex set of possible losses in the previous picture. It does so at point B as shown in the next graph. Sliding this line to the right corresponds to making r π larger and larger so that when it just touches the convex set we have found the Bayes procedure. 8

Losses Rain Sun 9

Here is a picture showing the same lines for the three priors above. 10

Losses Rain Sun 11

The Bayes procedure for π 1 (a prior which says you re pretty sure it will be sunny) is to ride your bike. If it s a toss up between R and S you take the bus. If R is very likely you take your car. Prior (0.6, 0.4) produces the line shown here: 12

Losses Rain Sun 13

Any point on line BT is Bayes for this prior. Decision Theory and Bayesian Methods Summary for no data case Decision space is the set of possible actions I might take. We assume that it is convex, typically by expanding a basic decision space D to the space D of all probability distributions on D. Parameter space Θ of possible states of nature. Loss function L = L(d, θ) which is the loss I incur if I do d and θ is the true state of nature. We call δ minimax if max θ L(δ, θ) = min δ max L(δ, θ). θ A prior is a probability distribution π on Θ,. The Bayes risk of a decision δ for a prior π is r π (δ) = E π (L(δ, θ)) = L(δ, θ)π(θ)dθ if the prior has a density. For finite parameter spaces Θ the integral is a sum. A decision δ is Bayes for a prior π if for any decision δ. r π (δ ) r π (δ) For infinite parameter spaces: π(θ) > 0 on Θ is a proper prior if π(θ)dθ < ; divide π by integral to get a density. If π(θ)dθ = π is an improper prior density. Decision δ is inadmissible if there is δ such that L(δ, θ) L(δ, θ) for all θ and there is at least one value of θ where the inequality is strict. A decision which is not inadmissible is called admissible. 14

Every admissible procedure is Bayes, perhaps only for an improper prior. (Proof uses the Separating Hyperplane Theorem in Functional Analysis.) Every Bayes procedure with finite Bayes risk (for prior with density > 0 for all θ) is admissible. Proof: If δ is Bayes for π but not admissible there is a δ such that L(δ, θ) L(δ, θ) Multiply by the prior density; integrate: r π (δ ) r π (δ) If there is a θ for which the inequality involving L is strict and if the density of π is positive at that θ then the inequality for r π is strict which would contradict the hypothesis that δ is Bayes for π. Notice: the theorem actually requires the extra hypotheses: positive density, and risk functions of δ and δ continuous. A minimax procedure is admissible. (Actually there can be several minimax procedures and the claim is that at least one of them is admissible. When the parameter space is infinite it might happen that set of possible risk functions is not closed; if not then we have to replace the notion of admissible by some notion of nearly admissible.) The minimax procedure has constant risk. Actually the admissible minimax procedure is Bayes for some π and its risk is constant on the set of θ for which the prior density is positive. Decision Theory and Bayesian Methods Summary when there is data Decision space is the set of possible actions I might take. We assume that it is convex, typically by expanding a basic decision space D to the space D of all probability distributions on D. Parameter space Θ of possible states of nature. Loss function L = L(d, θ): loss I incur if I do d and θ is true state of nature. 15

Add data X X with model {P θ ; θ Θ}: model density is f(x θ). A procedure is a map δ : X D. The risk function for δ is the expected loss: We call δ minimax if R δ (θ) = R(δ, θ) = E [L{δ(X), θ}]. max θ R(δ, θ) = min δ max R(δ, θ). θ A prior is a probability distribution π on Θ,. Bayes risk of decision δ for prior π is r π (δ) = E π (R(δ, θ)) = L(δ(x), θ)f(x θ)π(θ)dxdθ if the prior has a density. For finite parameter spaces Θ the integral is a sum. A decision δ is Bayes for a prior π if for any decision δ. r π (δ ) r π (δ) For infinite parameter spaces: π(θ) > 0 on Θ is a proper prior if π(θ)dθ < ; divide π by integral to get a density. If π(θ)dθ = π is an improper prior density. Decision δ is inadmissible if there is δ such that R(δ, θ) R(δ, θ) for all θ and there is at least one value of θ where the inequality is strict. A decision which is not inadmissible is called admissible. Every admissible procedure is Bayes, perhaps only for an improper prior. 16

If every risk function is continuous then every Bayes procedure with finite Bayes risk (for prior with density > 0 for all θ) is admissible. A minimax procedure is admissible. The minimax procedure has constant risk. The admissible minimax procedure is Bayes for some π; its risk is constant on the set of θ for which the prior density is positive. 17