STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

Similar documents
Lecture 7 October 13

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

ST5215: Advanced Statistical Theory

A Very Brief Summary of Statistical Inference, and Examples

Evaluating the Performance of Estimators (Section 7.3)

Lecture 2: Basic Concepts of Statistical Decision Theory

Part III. A Decision-Theoretic Approach and Bayesian testing

A Very Brief Summary of Bayesian Inference, and Examples

MIT Spring 2016

STAT215: Solutions for Homework 2

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Lecture 2: Statistical Decision Theory (Part I)

Lecture 8 October Bayes Estimators and Average Risk Optimality

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

9 Bayesian inference. 9.1 Subjective probability

A General Overview of Parametric Estimation and Inference Techniques.

Econ 2148, spring 2019 Statistical decision theory

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

A Very Brief Summary of Statistical Inference, and Examples

Bayesian statistics: Inference and decision theory

3.0.1 Multivariate version and tensor product of experiments

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

MIT Spring 2016

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

1. Fisher Information

Lecture 1: Bayesian Framework Basics

Definition 3.1 A statistical hypothesis is a statement about the unknown values of the parameters of the population distribution.

STAT 830 Decision Theory and Bayesian Methods

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Lecture 12 November 3

Lecture 7 Introduction to Statistical Decision Theory

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Methods of evaluating estimators and best unbiased estimators Hamid R. Rabiee

STAT 730 Chapter 4: Estimation

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Testing Statistical Hypotheses

Maximum Likelihood Estimation

MIT Spring 2016

Econ 2140, spring 2018, Part IIa Statistical Decision Theory

Lecture notes on statistical decision theory Econ 2110, fall 2013

Shrinkage Estimation in High Dimensions

Lecture 5 September 19

6.1 Variational representation of f-divergences

Lecture 20 May 18, Empirical Bayes Interpretation [Efron & Morris 1973]

STAT 830 Bayesian Estimation

Notes on Decision Theory and Prediction

Multinomial Data. f(y θ) θ y i. where θ i is the probability that a given trial results in category i, i = 1,..., k. The parameter space is

Two examples of the use of fuzzy set theory in statistics. Glen Meeden University of Minnesota.

Statistical Inference

Lecture 8: Information Theory and Statistics

STA 732: Inference. Notes 2. Neyman-Pearsonian Classical Hypothesis Testing B&D 4

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Estimation of parametric functions in Downton s bivariate exponential distribution

Theory of Statistical Tests

Chapter 8.8.1: A factorization theorem

Statistics Ph.D. Qualifying Exam: Part I October 18, 2003

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Combining Biased and Unbiased Estimators in High Dimensions. (joint work with Ed Green, Rutgers University)

Maximum Likelihood Estimation

March 10, 2017 THE EXPONENTIAL CLASS OF DISTRIBUTIONS

Lecture 13 and 14: Bayesian estimation theory

Testing Statistical Hypotheses

Brief Review on Estimation Theory

Final Examination. STA 215: Statistical Inference. Saturday, 2001 May 5, 9:00am 12:00 noon

Principles of Statistics

Qualifying Exam in Probability and Statistics.

ECE 275B Homework # 1 Solutions Version Winter 2015

1 Hypothesis Testing and Model Selection

λ(x + 1)f g (x) > θ 0

Lecture 17: Likelihood ratio and asymptotic tests

ECE 275B Homework # 1 Solutions Winter 2018

Chapter 3: Unbiased Estimation Lecture 22: UMVUE and the method of using a sufficient and complete statistic

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Fundamental Theory of Statistical Inference

Chapter 3. Point Estimation. 3.1 Introduction

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Mathematical Statistics. Sara van de Geer

Minimax Estimation of Kernel Mean Embeddings

The Bayesian Choice. Christian P. Robert. From Decision-Theoretic Foundations to Computational Implementation. Second Edition.

Ch. 5 Hypothesis Testing

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

STAT 830 Hypothesis Testing

INTRODUCTION TO BAYESIAN METHODS II

STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN

Bayes and Empirical Bayes Estimation of the Scale Parameter of the Gamma Distribution under Balanced Loss Functions

Lecture Notes 1: Decisions and Data. In these notes, I describe some basic ideas in decision theory. theory is constructed from

Statistical Measures of Uncertainty in Inverse Problems

Statistical Inference

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Conjugate Priors, Uninformative Priors

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

ST5215: Advanced Statistical Theory

Stat 5101 Lecture Notes

Spring 2012 Math 541A Exam 1. X i, S 2 = 1 n. n 1. X i I(X i < c), T n =

STAT 830 Hypothesis Testing

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Week 2 Spring Lecture 3. The Canonical normal means estimation problem (cont.).! (X) = X+ 1 X X, + (X) = X+ 1

Midterm Examination. STA 215: Statistical Inference. Due Wednesday, 2006 Mar 8, 1:15 pm

Transcription:

STA 732: Inference Notes 10. Parameter Estimation from a Decision Theoretic Angle Other resources 1 Statistical rules, loss and risk We saw that a major focus of classical statistics is comparing various inference methods, e.g., comparing power and efficiency of testing rules. A formal study of this was initiated by Abraham Wald (Ann. Math. Stat. 1939), later dubbed as Decision Theory by Eric Lehman in the 1950s. Decision theory provides a synthesis of many concepts in statistics, including frequentist properties, loss functions, Bayesian procedures and so on. These notes provide a brief overview of the concepts and results. 2 Decisions, loss and risk Consider data X S modeled as X f(x θ), θ Θ. Decision theory deals with the case where based on observed data X = x, one has to choose an action a from a pre-specified action space A. A typical example is hypothesis testing, where A = {reject H 0, accept H 0 }. A statistical decision method or rule is a map δ from S to A. A comparison between various rules is relevant only when the consequences of such decision making have been well laid out in the form of a loss function L(θ, a) which measures the loss incurred when the true parameter value is θ and an action a is taken. We will work with non-negative loss functions. For any decision rule δ, its risk function is defined as the expected loss: R(θ, δ) := E{L(θ, δ(x)) θ}. As we have seen with testing rules, it is not meaningful to seek the rule with uniformly smallest risk at every θ. Decision theory deals with other more reasonable modes of comparison and offers some interesting results on various modes of rule generation processes, such as the Bayesian approach, MLE, shrinkage. Keeping with convention, we will restrict our discussion to the task of parameter estimation, where one is trying to provide a point estimate of quantity η = η(θ) E. In this case A = E and the loss function is some sort of distance measure between η(θ) and the reported estimate a. Common loss functions are: (η(θ) a) 2 squared-error loss L(θ, a) = η(θ) a p L p -loss for some p > 0 I{ η(θ) a > c} 0-1 loss for some c > 0. I do not view point estimation as a statistical inference task, but it is often the first step toward hypothesis testing of interval estimation. 1

R(θ, δ) 0 5 10 15 x 1 x*(abs(x) > 2) 0.5*x + 1 1.1*x -4-2 0 2 4 θ Figure 1: Risk functions under square error loss for 5 rules: δ 1 (X) = X (black), δ 2 (X) 1 (red), δ 3 (X) = XI( X > 2), (green), δ 4 (X) = 0.5X + 1 (blue) and δ 5 (X) = 1.1X (cyan) for scalar data X modeled as X N(θ, 1) and the quantity of interest is θ itself. 3 Risk comparison An obvious difficulty with comparing rules by their risk functions is that the latter are not scalars, and hence not necessarily well ordered. Figure 1 shows this for the scalar data X modeled as X N(θ, 1), θ R. Various notions have been proposed to overcome this difficulty: 3.1 Admissibility A rule is called inadmissible if there exists another rule with a risk function that is everywhere smaller or equal and strictly smaller at some θ. In Figure 1, δ 5 is inadmissible because δ 1 dominates it everywhere. A rule is admissible if it is not inadmissible. Admissibility does not rule out rules like δ 2, which wages all its money on θ = 1, and cannot be beaten at that θ value. 3.2 Best among a restricted class One might seek the rule that has uniformly the lowest risk within a smaller class of rules. For point estimation, there are two popular choices for such restrictions: 1. Unbiased estimators. Restrict to estimators δ(x) such that E{δ(X) θ} = η(θ) for all θ Θ. For squared error loss: R(θ, δ) = Var{δ(X) θ} + [E{δ(X) θ} η(θ)] 2 2

and hence the risk of an unbiased estimator at θ equals its variance at θ. The minimum risk estimator within this class, if it exists, is called the uniformly minimum variance unbiased (UMVU) estimator. We will later see that δ 1 (X) = X is the UMVU estimator for the problem in Figure 1. 2. Equivariant estimators A number of statistical problems come with some notion of invariance under certain transformation. Consider for example measuring the mean population height of 6 year old girls. It is desirable that an estimation rule does not lead to conflicting estimates based on whether measurements are taken in inches or centimeters. Rules with such invariant properties are called equivariant rules. For simple geometric transformations, such as location shift, or scaling or rotation, one can often find a rule with uniformly smallest risk among the class of equivariant rules. For the problem in Figure 1, the rule δ(x) = X is also the minimum risk equivariant estimator under location shifts and/or scale changes. 3.3 Scalar risk summaries One approach to avoid the ordering problem is to summarize the entire risk function of a rule into a single number. Two popular summaries are: 1. Weighted average & Bayes rules. One could use a weighting function w(θ) on Θ to obtain a single number summary: r(w, δ) = R(θ, δ)w(θ)dθ. If w has a finite integral over Θ, then it is equivalent to consider only the normalized version π of w. If π is a pdf, then any rule δ that minimizes r(π, δ) is called a Bayes rule w.r.t. π. A Bayes rule always exists and can be found as: δ π (x) = argmin a A L(θ, a)π(θ x)dθ. For squared error loss, the (unique) Bayes rule is simply the posterior mean estimate: δ(x) = E(η(θ) x) = η(θ(π(θ x)dθ. The rules δ 2 (x) 1 and δ 4 (x) = 0.5x + 1 in Figure 1 are both Bayes rules, with π(θ) = N(1, 0) (i.e., the Dirac measure at 1) and π(θ) = N(2, 1) respectively. The rule δ 1 (x) = x is a formal Bayes rule under the improper prior π(θ) 1. It is also a generalized Bayes rule, i.e., the limit of the sequence of Bayes rules generated from a sequence of proper priors. 2. Maximum risk & minimax rules. The weighting approach requires some background knowledge of which θ weigh more heavily than others to define the average risk. A more omni-purpose strategy is to guard against the worst case scenario, given by the maximum-risk sup θ Θ R(θ, δ) associated with a rule δ. A rule with the minimum maximum-risk is called a minimax rule. In Figure 1, δ 1 has the lower maximum-risk among the five rules. It turns out that δ 1 is indeed the minimax rule for this problem. We will later see, however, that a minimax rule need not necessarily be admissible. 3

4 More on UMVU 4.1 Sufficiency & Rao-Blackwellization A statistic T (X) is called sufficient for θ if the conditional distribution of X given T (X) does not depend on θ, i.e., for any t T the set of values T (X) can assume, P (X A θ = θ 1, T (X) = t) = P (X A θ = θ 2, T (X) = t) for every θ 1, θ 2 Θ and A S. Any unbiased estimator δ(x) that is not a function of a sufficient statistic T (X) can be improved by considering its Rao-Blackwellized version δ(x) = E{δ(X) T (X)}, which also is unbiased. Note that this expectation could be carried out under any value of θ. This holds because of the usual total variance identity : Var{δ(X) θ} = E[Var{δ(X) θ, T (X)} θ] + Var[E{δ(X) θ, T (X)} θ] = a positive number + Var{ δ(x) θ} and hence Var{ δ(x) θ} < Var{δ(X) θ}. Sufficient statistics are easiest to identify by the factorization theorem: T (X) is sufficient if and only if f(x θ) = h(x)g(t (x), θ) for some function g(t, θ). This technique could be IID used to argue that for X i Poisson(µ), µ R +, T (X) = n i=1 X i is a sufficient statistic for µ. More generally if X f(x θ) = h(x) exp{θt (x) A(θ)} is an exponential family model, then T (X) is a sufficient statistic for θ. As a non-exponential family example, for the model IID X i Uniform(0, θ), θ R +, T (X) = max(x 1,..., X n ) is a sufficient statistic. 4.2 The information inequality There is a limit to how much an unbiased estimate could be improved, essentially because there is a limit to how compressed a sufficient statistic could be. Another approach that gives such a limit is the information inequality. If δ(x) is unbiased for η(θ) then, Var{δ(X) θ} η(θ)2 I F (θ) where I F (θ) is the Fisher information at θ 1. To prove this at a fixed θ, let S(X) = l X (θ) = log f(x θ) θ denote the score function. Then, under regularity conditions, E{S(X) θ} = 0, Var{S(X) θ} = I F (θ) and θ E{δ(X)S(X) θ} = δ(x) f(x θ) f(x θ) f(x θ)dx = δ(x)f(x θ)dx = η(θ) θ 1 There is multivariate version of the information inequality. Suppose dim(θ) = k, and, dim(δ(x)) = dim(η(θ)) = m. Then, Var{δ(X) θ} η(θ) T I 1 (θ) η(θ) where, η(θ) the k m matrix of first order partial derivatives by A B we mean A B is non-negative definite 4 F θ i η j (θ), and, for two square matrices A, B,

from which the information inequality follows because of the inequality: Cov(W, V ) 2 Var(W )Var(V ). The proof above indicates that for a canonical exponential family model f(x θ) = h(x) exp{θt (x) A(θ)}, the lower bound is indeed achieved by δ(x) = T (X) for estimating η(θ) = A(θ). That is T (X) is UMVU estimator of A(θ). It can be proved that an UMVU estimator exists for every η(θ) which admits an unbiased estimators. 4.3 Non-existence of UMVU estimators Consider data X Z, where Z denotes the set of all integers (positive, negative or zero). modeled as: X f(x θ), θ Z, where f(x θ) = 1/3 if x {θ 1, θ, θ + 1} and f(x θ) = 0 otherwise. Then, there does not exist an UMVU estimator for any non-constant η(θ). 4.4 Bayes rules are biased Interestingly, a (square-error loss) Bayes estimate is never unbiased! Because, if otherwise, δ(x) was the Bayes rule w.r.t. π and satisfied E{δ(X) θ} = η(θ) for all θ and so and therefore E{δ(X)η(θ)} = { E[E{δ(X) θ)}η(θ)] = E{η 2 (θ)} E[δ(X)E{η(θ) X}] = E{δ 2 (X)} E{δ(X) η(θ)} 2 = E{δ 2 (X)} + E{η 2 (θ)} 2E[δ(X)E{η(θ) X}] = 0, which cannot happen unless η(θ) const. 5 Minimax and Bayes Interestingly the most systematic way to find minimax rules comes through considerations of Bayes rules that have constant risk functions. For any prior π, its Bayes risk is defined as r π = inf δ r(π, δ), the average risk associated with the corresponding Bayes rule. A prior π with maximum Bayes risk is called a least favorable prior. It turns out that Bayes rules of least favorable priors have close relations to minimaxity. Theorem 1. Suppose a prior π with associated Bayes rule δ π satisfies then δ π is minimax and π is least favorable. r π = sup R(θ, δ π ) θ Θ The condition of the theorem is same as saying that the risk function R(θ, δ π ) must be constant over the support of π. 5

Example (Binomial model.). Consider the model X Binomial(n, p), p (0, 1) where we want to estimate p under the squared-error loss (δ(x) p) 2. For π = Beta(a, b), the Bayes estimate is δ a,b (X) = a+x whose risk function is given by a+b+n R(p, δ a,b ) = 1 (a + b + n) 2 { np(1 p) + [a(1 p) bp] 2 }. A straightforward calculation shows that for a = b = n/2, the corresponding Bayes rule has a constant risk function. Hence Beta( n/2, n/2) is a least favorable prior and the corresponding Bayes rule δ(x) = (X + n/2)/(n + n) is a minimax estimator. This is not a very interesting estimator (unless you really feel strongly that π should put most of its mass around p = 1/2). For moderately large n, this minimax rule has a higher risk than the conventional estimator X/n at most p outside of a tiny interval around 1/2. If however we used the scaled squared-error loss (δ(x) p) 2 /{p(1 p)} then δ(x) = X/n is the Bayes rule associated with π = Uniform(0, 1) and has a constant risk function. So X/n is minimax under this loss. For dealing with unbounded Θ, the above theorem could be relaxed to the following. Theorem 2. Suppose π n is a sequence of priors such that r := lim n r πn is finite. If any rule δ satisfies sup θ Θ R(θ, δ) = r then δ is minimax and r r π for every π. Example (Normal mean.). Suppose X N p (θ, σ 2 I p ), θ R p. Let π n = N p (0, ni p ). Then, under the squared-error loss L(θ, a) = θ a 2, δ πn (X) = (1 + σ 2 /n) 1 X, r πn = pσ 2 (1 + σ 2 /n) 1. So, r = pσ 2. Now δ(x) = X has risk R(θ, X) = pσ 2 = r. So X is minimax! Of course, X is also UMVU estimate, the MLE and the Bayes estimate under the reference Bayes analysis. But below we will see that it s not admissible when p 3! 6 Normal means, James-Stein & shrinkage Consider again the problem of estimating θ R p from data X N p (θ, σ 2 I p ) where σ is known. The Bayes estimator and Bayes risk under π = N p (0, τ 2 I p ) are: δ τ = τ 2 τ 2 + σ 2 X, rτ = pσ2 τ 2 τ 2 + σ 2. Under this prior, the marginal pdf of X is N p (0, (σ 2 + τ 2 )I k ), under which an unbiased estimator of ρ = τ 2 /(τ 2 + σ 2 ) can be obtained as ˆρ = 1 (p 2)σ2. Plugging in this to δ τ we X 2 get the famous James-Stein estimator: ) δ JS (p 2)σ2 (X) = (1 X X 2 with risk function R(θ, δ JS ) = pσ 2 (p 2) 2 σ 4 E{ 1 X 2 θ} 6

which is uniformly smaller than the risk function of X, R(θ, X) = pσ 2. That is, the UMVU, ML, reference-bayes estimator X is not admissible! It turns out that the James-Stein estimator is itself inadmissible, and is dominated by the positive part estimator: ) δ + (p 2)σ2 (X) = (1 X X 2 + where z + = max(z, 0). The James-Stein and the positive part estimators triggered great interest in shrinkage (the factor in front of X shrinks the ML estimator X toward zero when θ is close to zero) and empirical Bayes (plugging in an estimate of τ in the Bayes estimators). 7 Bayes & Admissibility It is easy to see that all Bayes rules are essentially admissible. A more precise statement is as follows. Theorem 3. Suppose δ π is a Bayes estimator having finite Bayes risk with respect to a prior density π which is positive for all θ Θ, and the risk function of every estimator δ is continuous in θ. Then δ π is admissible. The result is easy to prove, but a crucial assumption is continuity of the risk function of every estimator δ. This, however is not a very strong condition. For example, for a canonical exponential family model f(x θ) = h(x) exp{θ T T (x) A(θ)} and any loss function L(θ, δ) for which R(η, δ) is finite, (θ, δ) is also continuous in θ. We conclude this discussion with a partial converse of the above result. Theorem 4. Suppose f(x θ) > 0 for every x S, θ Θ. strictly convex in a for every θ and satisfy, Let L(θ, a) be continuous and lim L(θ, a) =, for all θ Θ. a Then every admissible rule δ(x) is a generalized Bayes rule, i.e., there is a sequence π n of prior distributions with support on a finite subset of Θ such that δ πn (X) δ(x) [almost surely]. The two theorems together say that under some regularity conditions on the model and the loss function, every Bayes rule is admissible and every admissible rule is generalized- Bayes. 7