ST5215: Advanced Statistical Theory

Similar documents
STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

ST5215: Advanced Statistical Theory

Lecture 2: Basic Concepts of Statistical Decision Theory

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Lecture 12 November 3

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Chapter 3: Unbiased Estimation Lecture 22: UMVUE and the method of using a sufficient and complete statistic

Definition 3.1 A statistical hypothesis is a statement about the unknown values of the parameters of the population distribution.

Evaluating the Performance of Estimators (Section 7.3)

Bayesian statistics: Inference and decision theory

Statistical Inference

4 Invariant Statistical Decision Problems

40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology

MIT Spring 2016

STAT 830 Decision Theory and Bayesian Methods

Peter Hoff Statistical decision problems September 24, Statistical inference 1. 2 The estimation problem 2. 3 The testing problem 4

Lecture 32: Asymptotic confidence sets and likelihoods

Lecture notes on statistical decision theory Econ 2110, fall 2013

Statistical Decision Theory and Bayesian Analysis Chapter1 Basic Concepts. Outline. Introduction. Introduction(cont.) Basic Elements (cont.

Lecture 1: Bayesian Framework Basics

Lecture 7 Introduction to Statistical Decision Theory

6.1 Variational representation of f-divergences

Statistical Inference

Brief Review on Estimation Theory

MIT Spring 2016

Mathematical Statistics

STAT215: Solutions for Homework 2

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

Lecture 7 October 13

Econ 2148, spring 2019 Statistical decision theory

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

ENEE 621 SPRING 2016 DETECTION AND ESTIMATION THEORY THE PARAMETER ESTIMATION PROBLEM

Statistical inference

A Very Brief Summary of Statistical Inference, and Examples

Lecture 8: Information Theory and Statistics

A statement of the problem, the notation and some definitions

λ(x + 1)f g (x) > θ 0

1.1 Basis of Statistical Decision Theory

Lecture 5 September 19

Graduate Econometrics I: Unbiased Estimation

Econ 2140, spring 2018, Part IIa Statistical Decision Theory

Testing Statistical Hypotheses

2018 년전기입시기출문제 2017/8/21~22

Political Science 236 Hypothesis Testing: Review and Bootstrapping

Stochastic Convergence, Delta Method & Moment Estimators

LECTURE NOTES 57. Lecture 9

Minimum Error Rate Classification

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

Hypothesis Test. The opposite of the null hypothesis, called an alternative hypothesis, becomes

Chapter 6. Hypothesis Tests Lecture 20: UMP tests and Neyman-Pearson lemma

Statistical Measures of Uncertainty in Inverse Problems

MTMS Mathematical Statistics

3.0.1 Multivariate version and tensor product of experiments

Lecture 8 October Bayes Estimators and Average Risk Optimality

On Extended Admissible Procedures and their Nonstandard Bayes Risk

Hypothesis Testing. Testing Hypotheses MIT Dr. Kempthorne. Spring MIT Testing Hypotheses

Chapter 3. Point Estimation. 3.1 Introduction

Review. December 4 th, Review

Lecture 13: p-values and union intersection tests

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Mathematical statistics

9 Bayesian inference. 9.1 Subjective probability

LECTURE NOTE #3 PROF. ALAN YUILLE

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Econometrics I, Estimation

Model comparison and selection

General Principles in Random Variates Generation

Lecture 11 Weak IV. Econ 715

Economics 520. Lecture Note 19: Hypothesis Testing via the Neyman-Pearson Lemma CB 8.1,

Stat 643 Problems. a) Show that V is a sub 5-algebra of T. b) Let \ be an integrable random variable. Find EÐ\lVÑÞ

Parameter Estimation

GAMMA-ADMISSIBILITY IN A NON-REGULAR FAMILY WITH SQUARED-LOG ERROR LOSS

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Lecture 2 Machine Learning Review

M(t) = 1 t. (1 t), 6 M (0) = 20 P (95. X i 110) i=1

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

A General Overview of Parametric Estimation and Inference Techniques.

Lecture 26: Likelihood ratio tests

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Lectures 22-23: Conditional Expectations

Lecture 3 January 28

Testing Statistical Hypotheses

Part III. A Decision-Theoretic Approach and Bayesian testing

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Linear Classification: Linear Programming

(a) (3 points) Construct a 95% confidence interval for β 2 in Equation 1.

Probability and Measure

Introduction to Machine Learning (67577) Lecture 3

Naive Bayes and Gaussian Bayes Classifier

Qualifying Exam in Probability and Statistics.

Lecture 2: Statistical Decision Theory (Part I)

If g is also continuous and strictly increasing on J, we may apply the strictly increasing inverse function g 1 to this inequality to get

EIE6207: Estimation Theory

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Module 2. Random Processes. Version 2, ECE IIT, Kharagpur

A Very Brief Summary of Bayesian Inference, and Examples

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Transcription:

Department of Statistics & Applied Probability Wednesday, October 5, 2011

Lecture 13: Basic elements and notions in decision theory Basic elements X : a sample from a population P P Decision: an action we take after observing X A: the set of allowable actions (A, F A ): the action space X : the range of X Decision rule: a measurable function (a statistic) T from (X, F X ) to (A, F A ) If X is observed, then we take the action T (X ) A Performance criterion: loss function Loss function L(P, a): a function from P A to [0, ). L(P, a) is Borel for each P If X = x is observed and our decision rule is T, then our loss is L(P, T (x))

Risk It is difficult to compare L(P, T 1 (X )) and L(P, T 2 (X )) for two decision rules, T 1 and T 2, since both of them are random. The average (expected) loss is defined as R T (P) = E[L(P, T (X ))] = L(P, T (x))dp X (x). If P is a parametric family indexed by θ, the loss and risk are denoted by L(θ, a) and R T (θ) Comparisons For decision rules T 1 and T 2, T 1 is as good as T 2 iff R T1 (P) R T2 (P) for any P P, and is better than T 2 if, in addition, R T1 (P) < R T2 (P) for at least one P P. Two decision rules T 1 and T 2 are equivalent iff R T1 (P) = R T2 (P) for all P P. X

Optimal rule If T is as good as any other rule in I, a class of allowable decision rules, then T is I-optimal (or optimal if I contains all possible rules). Randomized decision rules A function δ on X F A such that, for every A F A, δ(, A) is a Borel function and, for every x X, δ(x, ) is a probability measure on (A, F A ). If X = x is observed, we have a distribution of actions: δ(x, ). A nonrandomized decision rule T previously discussed can be viewed as a special randomized decision rule with δ(x, {a}) = I {a} (T (x)), a A, x X. To choose an action in A when a randomized rule δ is used, we need to simulate a pseudorandom element of A according to δ(x, ). Thus, an alternative way to describe a randomized rule is to specify the method of simulating the action from A for each x X.

Randomized decision rules (cont.) A randomized rule can be a discrete distribution δ(x, ) assigning probability p j (x) to a nonrandomized decision rule T j (x), j = 1, 2,..., in which case the rule δ can be equivalently defined as a rule taking value T j (x) with probability p j (x), i.e., T (X ) = T 1 (X ) with probability p 1 (X ) T k (X ) with probability p k (X ) The loss function for a randomized rule δ is defined as L(P, δ, x) = L(P, a)dδ(x, a), A which reduces to the same loss function we discussed when δ is a nonrandomized rule. The risk of a randomized rule δ is then R δ (P) = E[L(P, δ, X )] = L(P, a)dδ(x, a)dp X (x). X A

Randomized decision rules (cont.) For and T 1 (X ) with probability p 1 (X ) T (X ) = T k (X ) with probability p k (X ) k L(P, T, x) = L(P, T j (x))p j (x) R T (P) = j=1 k E[L(P, T j (X ))p j (X )] j=1

Randomized decision rules (cont.) For and T 1 (X ) with probability p 1 (X ) T (X ) = T k (X ) with probability p k (X ) k L(P, T, x) = L(P, T j (x))p j (x) R T (P) = j=1 k E[L(P, T j (X ))p j (X )] j=1 Example 2.19 Let X = (X 1,..., X n ) be a vector of iid measurements for a parameter θ R. We want to estimate θ. Action space: (A, F A ) = (R, B). A common loss function in this problem is the squared error loss L(P, a) = (θ a) 2, a A.

Example 2.19 (continued) Let T (X ) = X, the sample mean. The loss for X is ( X θ) 2. If the population has mean µ and variance σ 2 <, then R X (P) = E(θ X ) 2 = (θ E X ) 2 + E(E X X ) 2 = (θ E X ) 2 + Var( X ) = (µ θ) 2 + σ2 n. If θ is in fact the mean of the population, then σ2 R X (P) = n, is an increasing function of the population variance σ 2 and a decreasing function of the sample size n.

The problem in Example 2.19 is a special case of a general problem called estimation. In an estimation problem, a decision rule T is called an estimator. The following example describes another type of important problem called hypothesis testing. Example 2.20 Let P be a family of distributions, P 0 P, and P 1 = {P P : P P 0 }. A hypothesis testing problem can be formulated as that of deciding which of the following two statements is true: H 0 : P P 0 versus H 1 : P P 1. Here, H 0 is called the null hypothesis and H 1 is called the alternative hypothesis. The action space for this problem contains only two elements, i.e., A = {0, 1}, where 0 is the action of accepting H 0 and 1 is the action of rejecting H 0.

Example 2.20 (continued) A decision rule is called a test. Since a test T (X ) is a function from X to {0, 1}, T (X ) must have the form I C (X ), where C F X is called the rejection region or critical region for testing H 0 versus H 1. 0-1 loss L(P, a) = 0 if a correct decision is made and 1 if an incorrect decision is made, i.e., L(P, j) = 0 for P P j and L(P, j) = 1 otherwise, j = 0, 1. Under this loss, the risk is { P(T (X ) = 1) = P(X C) P P0 R T (P) = P(T (X ) = 0) = P(X C) P P 1. The 0-1 loss implies that the loss for two types of incorrect decisions (accepting H 0 when P P 1 and rejecting H 0 when P P 0 ) are the same. In some cases, one might assume unequal losses: L(P, j) = 0 for P P j, L(P, 0) = c 0 when P P 1, and L(P, 1) = c 1 when P P 0.

Definition 2.7 (Admissibility) Let I be a class of decision rules (randomized or nonrandomized). A decision rule T I is called I-admissible (or admissible when I contains all possible rules) iff there does not exist any S I that is better than T (in terms of the risk). Remarks If a decision rule T is inadmissible, then there exists a rule better than T and T should not be used in principle. However, an admissible decision rule is not necessarily good. For example, in an estimation problem a silly estimator T (X ) a constant may be admissible. If T is I-optimal, then it is I-admissible. If T is I-optimal and T 0 is I-admissible, then T 0 is also I-optimal and is equivalent to T. If there are two I-admissible rules that are not equivalent, then there does not exist any I-optimal rule.

Theorem 2.5 Suppose that A is a convex subset of R k and that for any P P, L(P, a) is a convex function of a. (i) Let δ be a randomized rule satisfying A a dδ(x, a) < for any x X and let T 1 (x) = A adδ(x, a). Then L(P, T 1 (x)) L(P, δ, x) (or L(P, T 1 (x))<l(p, δ, x) if L is strictly convex in a) for any x X and P P. (ii) (Rao-Blackwell theorem). Let T be a sufficient statistic for P P, T 0 R k be a nonrandomized rule satisfying E T 0 <, and T 1 = E[T 0 (X ) T ]. Then R T1 (P) R T0 (P) for any P P. If L is strictly convex in a and T 0 is not a function of T, then T 0 is inadmissible. Remark Under the conditions of Theorem 2.5, we can concentrate on non-randomized rules. The proof of Theorem 2.5 is an application of Jensen s inequality.

Definition 2.8 (Unbiasedness) In an estimation problem, the bias of an estimator T (X ) of a real-valued parameter ϑ of the unknown population is defined to be b T (P) = E[T (X )] ϑ (denoted by b T (θ) when P is in a parametric family indexed by θ). An estimator T (X ) is said to be unbiased for ϑ iff b T (P) = 0 for any P P. Definition 2.9 (Invariance) Let X be a sample from P P. (i) A class G of one-to-one transformations of X is called a group iff g i G implies g 1 g 2 G and g 1 i G. (ii) We say that P is invariant under G iff ḡ(p X ) = P g(x ) is a one-to-one transformation from P onto P for each g G.

Definition 2.9 (continued) (iii) A decision problem is said to be invariant iff P is invariant under G and the loss L(P, a) is invariant in the sense that, for every g G and every a A, there exists a unique g(a) A such that L(P X, a) = L ( P g(x ), g(a) ). (Note that g(x ) and g(a) are different functions in general.) (iv) A decision rule T (x) is said to be invariant iff, for every g G and every x X, T (g(x)) = g(t (x)). Remarks Invariance means that our decision is not affected by one-to-one transformations of data. In a problem where the distribution of X is in a location-scale family P on R k, we often consider location-scale transformations of data X of the form g(x ) = AX + c, where c C R k and A T, a class of invertible k k matrices.

Bayes rule Consider an average of R T (P) over P P: r T (Π) = R T (P)dΠ(P), P where Π is a known probability measure on (P, F P ) with an appropriate σ-field F P. r T (Π) is called the Bayes risk of T w.r.t. Π. If T I and r T (Π) r T (Π) for any T I, then T is called a I-Bayes rule (or Bayes rule when I contains all possible rules) w.r.t. Π. Minimax rule Consider the worst situation, i.e., sup P P R T (P). If T I and sup R T (P) sup R T (P) P P P P for any T I, then T is called a I-minimax rule (or minimax rule when I contains all possible rules).