Bayesian statistics: Inference and decision theory

Similar documents
Mathematical Statistics. Sara van de Geer

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

ST5215: Advanced Statistical Theory

MIT Spring 2016

Lecture 2: Basic Concepts of Statistical Decision Theory

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

Lecture notes on statistical decision theory Econ 2110, fall 2013

Minimax lower bounds I

STAT215: Solutions for Homework 2

λ(x + 1)f g (x) > θ 0

9 Bayesian inference. 9.1 Subjective probability

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

A Very Brief Summary of Statistical Inference, and Examples

4 Invariant Statistical Decision Problems

Lecture 7 Introduction to Statistical Decision Theory

Econ 2148, spring 2019 Statistical decision theory

Lecture 8: Information Theory and Statistics

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

A Very Brief Summary of Bayesian Inference, and Examples

STAT 830 Decision Theory and Bayesian Methods

Econ 2140, spring 2018, Part IIa Statistical Decision Theory

MIT Spring 2016

Part III. A Decision-Theoretic Approach and Bayesian testing

Lecture 12 November 3

Bayesian Inference: Posterior Intervals

Principles of Statistics

40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology

6.1 Variational representation of f-divergences

Announcements. Proposals graded

Lecture 8 October Bayes Estimators and Average Risk Optimality

ST5215: Advanced Statistical Theory

INTRODUCTION TO BAYESIAN METHODS II

MIT Spring 2016

Lecture 2: Statistical Decision Theory (Part I)

Lecture 13 Fundamentals of Bayesian Inference

Statistical Inference

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Introduction to Bayesian learning Lecture 2: Bayesian methods for (un)supervised problems

Lecture 7 October 13

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

STAT 830 Bayesian Estimation

Evaluating the Performance of Estimators (Section 7.3)

Chapter 3: Unbiased Estimation Lecture 22: UMVUE and the method of using a sufficient and complete statistic

Theory of Statistical Tests

21.1 Lower bounds on minimax risk for functional estimation

Problem set 4, Real Analysis I, Spring, 2015.

Primer on statistics:

Integrated Objective Bayesian Estimation and Hypothesis Testing

Solutions to the Exercises of Section 2.11.

Information Theory in Intelligent Decision Making

Notes on Decision Theory and Prediction

Two examples of the use of fuzzy set theory in statistics. Glen Meeden University of Minnesota.

Detection and Estimation Chapter 1. Hypothesis Testing

COS513 LECTURE 8 STATISTICAL CONCEPTS

On maxitive integration

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

David Giles Bayesian Econometrics

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Hypothesis Testing - Frequentist

Bayesian analysis in finite-population models

Nonparametric Bayesian Uncertainty Quantification

Seminar über Statistik FS2008: Model Selection

Machine Learning 4771

STAT 730 Chapter 4: Estimation

Chapter 4. Theory of Tests. 4.1 Introduction

On the Bayesianity of Pereira-Stern tests

1 Probability theory. 2 Random variables and probability theory.

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Lecture 13 and 14: Bayesian estimation theory

Definition 3.1 A statistical hypothesis is a statement about the unknown values of the parameters of the population distribution.

Nearest Neighbor Pattern Classification

Lecture 8: Information Theory and Statistics

Homework 7: Solutions. P3.1 from Lehmann, Romano, Testing Statistical Hypotheses.

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

Robustness and duality of maximum entropy and exponential family distributions

Mathematical Statistics

44 CHAPTER 2. BAYESIAN DECISION THEORY

Chapter 7. Hypothesis Testing

STAT 200B: Theoretical Statistics

Time Series and Dynamic Models

Master s Written Examination

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Statistical Theory MT 2006 Problems 4: Solution sketches

BEST TESTS. Abstract. We will discuss the Neymann-Pearson theorem and certain best test where the power function is optimized.

Probability and Measure

STAT 830 Hypothesis Testing

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

Stochastic Convergence, Delta Method & Moment Estimators

Introduction to Probabilistic Machine Learning

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Objective Bayesian Hypothesis Testing

Chapter 4. The dominated convergence theorem and applications

Probability and Measure

Chapter 7. Basic Probability Theory

Econometrics I, Estimation

Motif representation using position weight matrix

General Bayesian Inference I

Peter Hoff Statistical decision problems September 24, Statistical inference 1. 2 The estimation problem 2. 3 The testing problem 4

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Transcription:

Bayesian statistics: Inference and decision theory Patric Müller und Francesco Antognini Seminar über Statistik FS 28 3.3.28 Contents 1 Introduction and basic definitions 2 2 Bayes Method 4 3 Two optimalities: minimaxity and admissibility 8 4 Alternatives 11 5 Appendix 11 6 Bibliographie 13 1

1 Introduction and basic definitions Notation: Let X χ be an observable random variable (the data) and χ the space of all observations. We denote with P : {P θ : θ } the model class of distributions with unknown parameter θ. We will assume X P P Let A be the action space. - A R estimating a parameter γ : g(θ) R - A {, 1} testing - A [, 1] randomized tests Definition: A decision is a map d : χ A. d(x) is the action when x is observed. We need to define an evaluation criterion in order to compare two different decisions. Definition: A loss function is a map L : A R + with L(θ, a) being the loss when the parameter value is θ and one takes action a. Remark: The actual determination of the loss function is often awkward in practice, in particular because the determination of the consequences of each action for each value of θ is usually impossible when A or are large sets, for instance when they have an infinite number of elements. The complexity of determining the subjctive loss function of the decisionmaker often prompts the statistician to use classical or (canonical) losses, selected because of their simplicity and mathematical tractability. It is still better to take a decision in a finite time using an approximate criterion rather that spending an infinite time to determine the proper loss function. Suppose our model class P : {P θ : θ } is dominated by ν (σ-finite). Let X P P. Then let p θ dp θ denote the densities. We now think of dν p θ as the density of X given the value of θ. We write it as p θ (x) p(x θ) x χ. 2

A fundamental basis of Bayesian Decision Theory is that statistical inference should start with the rigorous determination of three factors: the distribution family for the observations, p(x θ) p θ (x); the prior distribution for the parameter, Π(θ); the loss associated with the decisions, L(θ, d). The prior, the loss and even sometimes the sampling distribution being derived from partly sujective considerations. Definition: The risk of the decision d is R(θ, d) : E θ [L (θ, d(x))] χ L (θ, d(x)) p θ (x)dν(x) Examples: some classical loss functions and their risks. a) Estimation of g(θ) R ; A R L(θ, a) : w(θ) g(θ) a r, r R(θ, d) w(θ)e θ [ g(θ) d(x) r ] Special case: r 2; w(θ) 1 θ then R(θ, d) E θ [ g(θ) d(x) 2 ] is the Mean square error. L is called quadratic loss. b) 1, 1 Testing Hypothesis: H : θ vs H 1 : θ 1 A {, 1} 1 if θ and a 1 L(θ, a) c(> ) if θ 1 and a otherwise d : ϕ { Pθ (ϕ(x) 1) if θ R(θ, ϕ) cp θ (ϕ(x) ) if θ 1 Remark: The risk R(θ, d) is a function of the parameter θ. Therefore, the frequentist approach does not induce a total ordering on the set of procedures. Definition: A decision d is called stirictly better than d if R(θ, d ) R(θ, d) θ 3

and θ such that R(θ, d ) < R(θ, d). When there exists a d that is strictly better than d, then d is called inadmissible. Exemple: X (X 1,..., X n ) iid E θ [X 1 ] µ, V ar θ (X 1 ) 1 θ d(x) X n 1 1 n 1 n 1 i1 X i d (X) X n 1 n n i1 X i L(θ, a) µ a 2 R(θ, d) 1, R(θ, n 1 d ) 1 n d is inadmissible. 2 Bayes Method Suppose the parameter space is a measure space. We can then equip it with a probability measure Π. We call Π the a priori distribution. Definition: The Bayes risk (with respect to the probability measure Π) is r(π, d) : R(θ, d)dπ(θ). A decision is called Bayes (with respect to Π) if r(π, d) inf d r(π, d ). Remark: If Π has density w : dπ with respect to some dominating measure dµ µ, we may write r(π, d) R(θ, d)w(θ)dµ(θ) : r w (d). The Bayes risk may be thought of as taking a weighted average of the risks. We call w(θ) the prior density. Moreover p(x) : p(x θ)w(θ)dµ(θ). Definition: The a posteriori density of θ is w(θ x) p(x θ) w(θ) p(x) 4 θ, x χ.

Theorem: Given the data X x, consider θ as a random variable with density w(θ x). Let l(x, a) : E [L(θ, a) X x] : L(θ, a)w(θ a)dµ(θ) and d(x) : arg min a l(x, a). d is Bayes decision d Bayes. Proof: r w (d ) Fubini χ χ χ R(θ, d )w(θ)dµ(θ) [ ] L(θ, d (x))p(x θ)dν(x) w(θ)dµ(θ) χ L(θ, d (x))w(θ x)dµ(θ)p(x)dν(x) r w (d) l(x, d )p(x)dν(x) l(x, d)p(x)dν(x) Notice 1: Form a strictly Bayesian point of view, only the posterior expected loss l(x, a) is important. The previous theorem give us a constructive tool for the determination of the Bayes estimators. Notice 2: For strictly convex losses, the Bayes estimator is unique. Remark: Let X R, E [X] µ and V ar(x) <. If a R then E [ (X a) 2] V ar(x) + (a µ) 2 since: E [ (X a) 2] E [ (X µ (a µ)) 2] E [ (X µ) 2] 2 E [(X µ)(x a)] +E [ (a µ) 2] }{{} V ar(x) + (a µ) 2 5

arg min a E [(X a) 2 ] E [X] Example 1: R, A R Then: L(θ, a) : (θ a) 2 d Bayes (x) E [θ X x] θw(θ x)dµ(θ) Example 2: An alternative solution to the quadratic loss in dimesion one is to use the absolute error loss, L(θ, a) θ a R, A R d Bayes (x) median of w(θ x) Example 3: The 1 loss. {θ, θ 1 }. H : θ θ vs H 1 : θ θ 1 θ θ 1 A [, 1] d : ϕ(x) probability of rejecting H { Eθ [ϕ(x)] if θ θ R(θ, ϕ) 1 E θ [ϕ(x)] if θ θ 1 Let ν >> P θ, P θ1 and p : dp θ, p dν 1 : dp θ 1 Let c and q [, 1]. Then 1 if p 1 > cp ϕ NP c if p 1 cp if p 1 < cp dν. is called Neyman-Pearson Test. With the same assumption as in the Bayes method we have now w Π(θ θ ) < w < 1 w 1 Π(θ θ 1 ) 1 w r w (ϕ) w R(θ, ϕ) + w 1 R(θ 1, ϕ). Lemma: ϕ Bayes 1 if p 1 > w w 1 p q if p 1 w w 1 p if p 1 < w w 1 p 6

where q [, 1] arbitrary. Proof: This implies r w (ϕ) w E θ [ϕ(x)] + w 1 (1 E θ1 [ϕ(x)]) ) w ϕp + w 1 (1 ϕp 1 (w p (x) w 1 p 1 (x))ϕ(x)dν(x) + w 1 with q arbitrary value in [, 1]. w p (x) w 1 p 1 (x) < ϕ Bayes 1 w p (x) w 1 p 1 (x) > ϕ Bayes w p (x) w 1 p 1 (x) ϕ Bayes q Example 4: R, A L(θ, a) 1 { θ a >c} for c > given constant l(x, a) Π( θ a > c X x) w(θ x)dµ(θ) θ a >c d Bayes (x) arg min a l(x, a) 1 l(x, a) 2c If µ is Lebesgue measure 1 l(x, a) lim c 2c Conclusion: arg max(1 l(x, a)) 1 2c a θ a c w(θ x)dµ(θ) w(a x) is called maximum a posteriori estimator. d Bayes (x) d MAP (x) : arg max w(a x) a Note: w(a x) p(x a) w(a) p(x) arg max a w(a x) arg max θ [p(x θ)w(θ)] MAP MLE if w(θ) are constant! 7

3 Two optimalities: minimaxity and admissibility After having seen the concept of admissibility, to look at how good or how bad are our decisions, we want to introduce the concept of minimaxity: Definition: Let d be a decision; d is called minimax if the minimax risk R : inf sup δ D θ R(θ, δ) is equal to sup θ R(θ, d). In other words minimax means: the best decision in the worst possible case. Remark: D is the set of randomized estimators. Example: The first oil-drilling platforms were designed according to a minimax principle: they were so constructed to resist to the worst gale together with the worst storm ever observed, and this at the minimal temperature ever measured. Naturally the platform is quite sure, but very expensive to build. Nowadays companies tend to use other strategies to reduce costs. Up to now we have seen only methods to compare decisions, now we want to improve decisions. Definition: let S be a Statistic, we say S is Sufficient if: P θ [x S] does not depend on θ Thm: Rao-Blackwell then: Let A R p convex a L(θ, a) a convex function θ S be sufficient and d : χ A be a decision suppose d (s) : E [d(x) S s] (assumed to exist) R(θ, d ) R(θ, d) θ (1) Proof: we have to use the Jensen inequality: for a convex g E [g(x)] g(e [X]) and the iterated expectation lemma. R(θ, d) E [L (θ, d(x))] E [E [L (θ, d(x)) S s]] E [L (θ, d (S))] 8

Remarks: I As loss function we often use L(θ, a) : θ a 2 which is convex, and [; 1] or R (a convex set) as action space. So it is not so difficult to satisfy the conditions of the theorem. II The new decision is better than the oldest one, and only depend on s. III R Def sup Π r(π) Def sup Π inf δ D r(π, δ) R inf δ D sup θ R(θ, δ) We can now look at some relations between Bayes, minimax and admissible: Proposition 1: If δ is Bayes (wrt Π a priori distribution of ) and if R(θ, δ ) r(π ) θ supp(π ), then δ is minimax. Proposition 2: If (Π n ) n N sequence of proper prior distribution s.t.: R(θ, δ ) lim n r(π n ) < θ then δ is minimax. Proposition 3: If there exists a unique minimax estimator δ, then it is admissible. Proof: let δ be another decision; because of minimaxity sup R(θ, δ) > sup R(θ, δ ) θ θ θ : R(θ, δ) > R(θ, δ ) δ is admissible Note that uniqueness is essential. Example: {; 1} R(θ, δ 1 ) : 1 R(θ, δ 2 ) : θ sup θ R(θ, δ 1 ) sup θ R(θ, δ 2 ) even if δ i are minimax, δ 2 is strictly better than δ 1 (also δ 1 is inadmissible). Proposition 4: If δ is an admissible decision and R(θ, δ ) R(δ ), then δ is the unique minimax decision. 9

Proof: θ : sup R(θ, δ ) ( ) R(θ, δ ) θ δ adm. δ θ : R( θ, δ) > R( θ, δ) sup R(θ, δ) R( θ, δ) > R( θ, δ) ( ) sup R(θ, δ ) θ θ also δ is minimax! Uniqueness follows dierctly from > δ. Proposition 5: Let Π be the a priori strictly positive distribution on, R(θ, δ) continuous in θ δ and r Π (δ) <. Then Bayes implies admissible. Proof: Assume δ Π Bayes but inadmissible, then there exist a strictly better decision δ 1. θr(θ, δ 1 ) R(θ, δ Π ) and θ : R( θ, δ 1 ) < R( θ, δ Π ) (2) because of continuity of the risk function we can find an open set σ in which (2) holds. So we have: r Π (δ 1 ) R(θ, δ 1 )w(θ)dθ R(θ, δ 1 )w(θ)dθ + R(θ, δ 1 )w(θ)dθ \σ σ }{{}}{{} < R \σ R(θ,δ Π)w(θ)dθ r Π (δ Π ) R(θ, δ Π )w(θ)dθ < R σ R(θ,δ Π)w(θ)dθ This result is in contradiction with the fact that δ Π is Bayes. Proposition 6: If the Bayes estimator associated with a prior Π is unique, it is admissible. Proposition 7: A Bayes estimator δ Π (with Π prior), convex loss function and finite Bayes risk is admissible. 1

4 Alternatives Up to now we have only analysed cases, where loss and distributions were given. But the reality is not always so. Often we don t have a completly determined loss function; then we have some possibilities to solve this problem, like: choosing arbitrary the subjective best unknown parameter using a random variable (F distributed) where we have unknown parameters. So the Bayes risk is: L(θ, δ, ω)df (ω)dπ(θ x) Ω we can reduce the number of possible loss functions by taking only few of them in account, and then looking for estimators performing well for all these losses 5 Appendix A numerical exemple to show the utility of the Bayes method. Definition: Z > has a Gamma-(k, λ) distribution if it has density f Z (z) zk 1 e λz λ k Γ(k) Γ(k) z k 1 e z dz Remark: E [Z] k λ Definition: Z {, 1,...} has a Negative Binomial-(k, p) distribution if P (Z z) Γ(z + k) p k (1 p) z z, 1,... Γ(k)z! Remarks: k1: geometric distribution with parameter p. E [Z] k(1 p) p V ar(z) k(1 p) p 2. θ θx Exemple: Suppose p(x θ) e x {, 1,...} (Poisson(θ)-distribution) x! and that Gamma(k, λ). 11

Result : p(x) : Negative Binomial(k, p) where p λ. 1+λ Proof: p(x) p(x θ)w(θ)dµ(θ) λ k Γ(k)x! θ θx e θ k 1 e λθ λ k dθ x! Γ(k) e θ(λ+1) θ x+k 1 dθ We compute the last integral by partial integration e θ(λ+1) θ x+k 1 Finally we can write that dθ e θ(λ+1) (λ + 1) θx+k 1 + }{{} e θ(λ+1) λ + 1 (x + k 1)θx+k 2 dθ x + k 1 e θ(λ+1) θ x+k 2 dθ λ + 1 (x + k 1) (x + k 2)... 3 2... (λ + 1) x+k 1 Γ(x + k) (λ + 1) x+k e θ(λ+1) dθ p(x) Γ(x + k) Γ(k)x! ( λ ) k ( 1 ) x (1 + λ) 1 + λ Result 1: w(θ x) is Gamma(x + k, 1 + λ). Proof: w(θ x) p(x θ) w(θ) p(x) θ θx e x! θk 1 λθ λk e Γ(k) Result θ θx e x! θk 1 λθ λk e Γ(k) 1 p(x) Γ(x+k) Γ(k)x! θ x+k 1 e (λ+1)θ (λ + 1) x+k Γ(x + k) 12 ( 1 λ (1+λ) ) k ( 1 ) x 1+λ

Remarks: θ 1 : E [θ X] x+k 1+λ loss. is the Bayes estimator in the case of quadratic MLE: θ 2 : x. MAP: arg max θ e θ(λ+1)θx+k 1 θ 3 x+k 1 1+λ. 6 Bibliographie - Robert, Ch. The Bayesian Choise,2nd ed. - Vorlesungmitschriften von Grundlagen der mathematischen Statistik, von Prof. Sara van de Geer 13