Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

In the rest of this course we switch gear to the interplay between information theory and statistics. 1 In this lecture, we will introduce the basic elements of statistical decision theory: It is about how to make decision from data samples collected from a statistical model. It is about how to evaluate decision making algorithms (decision rules) under a statistical model. It also serves the purpose of overviewing the contents to be covered in the follow-up lectures. 2 In the follow-up lectures, we will go into details of several topics, including Hypothesis testing: large-sample asymptotic performance limits Point estimation: Bayes vs. Minimax, lower bounding techniques, high dimensional problems, etc. Along side, we will introduce tools and techniques for investigating the asymptotic performance of several statistical problems, and show its interplay with information theory. Tools from probability theory: large deviation, concentration inequalities, etc. Elements from information theory: information measures, lower bounding techniques, etc. 2 / 55 I-Hsiang Wang IT Lecture 7

Overview of this Lecture In this lecture, the goal is to establish basics of statistical decision theory. 1 We will begin with setting up the framework of statistical decision theory, including: Statistical experiment: parameter space, data samples, statistical model Decision rule: deterministic vs. randomized Performance evaluation: loss function, risk, minimax vs. Bayes 2 Next, we will introduce two basic statical decision making problems, including Hypothesis testing Point estimation 3 / 55 I-Hsiang Wang IT Lecture 7

Statistical Model and Decision Making 1 Statistical Model and Decision Making Basic Framework Examples Paradigms 2 Hypothesis Testing Basics 3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency 4 / 55 I-Hsiang Wang IT Lecture 7

Statistical Model and Decision Making Basic Framework 1 Statistical Model and Decision Making Basic Framework Examples Paradigms 2 Hypothesis Testing Basics 3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency 5 / 55 I-Hsiang Wang IT Lecture 7

Statistical Model and Decision Making Basic Framework Statistical Experiment P ( ) X X X Decision Making (X) = ˆT 6 / 55 I-Hsiang Wang IT Lecture 7

Statistical Model and Decision Making Basic Framework Statistical Experiment P ( ) X X X Decision Making (X) = ˆT Statistical Experiment Statistical Model: A collection of data-generating distributions P {P θ θ Θ}, where Θ is called the parameter space, could be finite, infinitely countable, or uncountable. Pθ ( ) is a probability distribution which accounts for the implicit randomness in experiments, sampling, or making observations Data (Sample/Outcome/Observation): X is generated by a random draw from P θ, that is, X P θ. X could be random variables, vectors, matrices, processes, etc. 7 / 55 I-Hsiang Wang IT Lecture 7

Statistical Model and Decision Making Basic Framework Statistical Experiment P ( ) X X X Decision Making (X) = ˆT Inference Task Objective: T (θ), a function of the parameter θ. From the data X P θ, one would like to infer T (θ) from X. Decision Rule Decision rule (deterministic): τ ( ) is a function of X. ˆT = τ (X) is the inferred result. Decision rule (randomized): τ (, ) is a function of (X, U), where U is external randomness. ˆT = τ (X, U) is the inferred result. 8 / 55 I-Hsiang Wang IT Lecture 7

Statistical Model and Decision Making Statistical Experiment P ( ) T ( ) X X X T( ) Basic Framework Decision Making (X) = T ᅸ Loss Function l (, ) T l(t ( ), (X)) EX P [ ] L ( ) Performance Evaluation: how good is a decision rule τ? Loss function: l(t (θ), τ (X)) measures how bad the decision rule τ is (with a speciﬁc data point X ). Note: since X is random, l (T (θ), τ (X)) is also random. Risk: Lθ (τ ) EX Pθ [l(t (θ), τ (X))] measures on average how bad the decision rule τ is when the true parameter is θ. 9 / 55 I-Hsiang Wang IT Lecture 7

Statistical Model and Decision Making Statistical Experiment P ( ) T ( ) X X X T( ) Basic Framework Decision Making (X) = T ᅸ Loss Function l (, ) T l(t ( ), (X)) EX P [ ] Performance Evaluation: what if the decision rule τ is randomized? Loss function becomes l(t (θ), τ (X, U )). Risk becomes Lθ (τ ) EU,X Pθ [l(t (θ), τ (X, U ))]. 10 / 55 I-Hsiang Wang IT Lecture 7 L ( )

Statistical Model and Decision Making Examples 1 Statistical Model and Decision Making Basic Framework Examples Paradigms 2 Hypothesis Testing Basics 3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency 11 / 55 I-Hsiang Wang IT Lecture 7

Statistical Model and Decision Making Examples Sometimes we care about the inferred object itself. 12 / 55 I-Hsiang Wang IT Lecture 7

Example: Decoding Statistical Model and Decision Making Examples Decoding in channel coding over a DMC is one example that we are familiar with. Parameter is the message: Parameter space is the message set: Data is the received sequence: Statistical model is Encoder+Channel: Task is to decode the message: θ m Θ { 1, 2,..., 2 NR} X Y N P θ (x) N i=1 P Y X(y i x i (m)) T (θ) m Decision rule is the decoding algorithm: τ(x) dec(y N ) Loss function is the 0-1 loss: l (T (θ), τ(x)) 1 { m dec(y N ) } Risk is the decoding error probability: L θ (τ) λ m,dec P { m dec(y N ) m is sent } 13 / 55 I-Hsiang Wang IT Lecture 7

Statistical Model and Decision Making Example: Hypothesis Testing Examples Decoding in channel coding belongs to a more general class of problems called hypothesis testing. Parameter space is a finite set: Task is to infer parameter θ: Loss function is the 0-1 loss: Risk is the probability of error: Θ < T (θ) = θ l (T (θ), τ(x)) = 1 {θ τ(x)} L θ (τ) = P X Pθ {θ τ(x)} 14 / 55 I-Hsiang Wang IT Lecture 7

Statistical Model and Decision Making Example: Density Estimation Examples Estimate the probability density function from the collected samples. Parameter space is a (huge) set of density functions: Θ F = {f : R [0, + ) which is concave/continuous/lipschitz continuous/etc.} Data is the observed i.i.d. sequence: Task is to infer density function f( ): X X n, X i i.i.d. f( ) T (θ) f Decision rule is the density estimator: τ(x) ˆf X n( ) ( Loss function is some sort of divergence: l (T (θ), τ(x)) D f ˆf ) x n ( Risk is the expected loss: L θ (τ) E X n f [D n f ˆf )] X n 15 / 55 I-Hsiang Wang IT Lecture 7

Statistical Model and Decision Making Examples Sometimes we care about the utility of the inferred object. 16 / 55 I-Hsiang Wang IT Lecture 7

Statistical Model and Decision Making Examples Example: Classification/Prediction A basic problem in learning is to train a classifier that predicts the category of a new object. Parameter space is a collection of labelings: Data is the training data set: Statistical model is the noisy labeling: Task is to infer the true labeling h H: Θ H = {h : X [1 : K]} X (X n, Y n ), label Y i [1 : K]. P θ (x) n i=1 P X (x i ) P Y h(x) (y i h(x i )) T (θ) h( ), τ(x) ĥx n,y n( ). Loss function is the prediction error probability: }] l (T, τ(x)) E X PX [1 {h(x) ĥ(x) (Note: This is still random as ĥ depends on the randomly drawn training data (X, Y )n ) Risk is the averaged loss over training: [ }]] L θ (τ) E (X n,y n ) (P X P Y h(x) ) n E X PX [1 {h(x) ĥ(x) 17 / 55 I-Hsiang Wang IT Lecture 7

Example: Regression Statistical Model and Decision Making Examples Another example that we are very familiar with is regression under mean squared error. Parameter space is a collection of functions: Θ F = {f : R p R} Data is the training data: X (X n, Y n ) Statistical model is the noisy observation: (Y = f(x) + Z where Z is the observation noise) P θ (x) n i=1 P X (x i ) P Z (y i f(x i )) Task is to infer f F: T (θ) f, τ(x) ˆf X n,y n( ) Loss function is the mean-squared-error: [ l (T, τ(x)) E (X,Y ) Pf (Y ˆf(X)) ] 2 (Note: This is still random as ˆf depends on the randomly drawn training data (X, Y ) n ) Risk is the averaged loss over training: L θ (τ) E (X n,y n ) P n f [E (X,Y ) Pf [(Y ˆf(X)) 2 ]] 18 / 55 I-Hsiang Wang IT Lecture 7

Statistical Model and Decision Making Example: Linear Regression Examples For linear regression, the underlying model is linear: f (x) = β x + γ. Parameter is (β, γ) R p R: θ (β, γ), Θ R p R Data is the training data: X (X n, Y n ) Statistical model is the noisy linear model: P θ (x) n i=1 P X (x i ) P Z (y i γ β x i ) Task is to infer (β, γ): T (θ) (β, γ), τ(x) (ˆβ, ˆγ) X n,y n Loss function is the mean-squared-error: [ l (T, τ(x)) E (X,Y ) Pβ,γ (Y ˆβ ] X ˆγ) 2 (Note: This is still random as (ˆβ, ˆγ) depends on the randomly drawn training data (X, Y ) n ) Risk is the averaged loss over random training data: L θ (τ) E (X n,y n ) P [E n (X,Y ) Pβ,γ [(Y ˆβ ]] X ˆγ) 2 β,γ 19 / 55 I-Hsiang Wang IT Lecture 7

Statistical Model and Decision Making Paradigms 1 Statistical Model and Decision Making Basic Framework Examples Paradigms 2 Hypothesis Testing Basics 3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency 20 / 55 I-Hsiang Wang IT Lecture 7

Statistical Model and Decision Making Paradigms How to Determine the Best Estimator? Recall: Risk of a decision rule τ is L θ (τ), which is defined according to the underlying statistical model, the task, and the loss function. Note that the risk depends on the hidden parameter θ. If we can find a decision making algorithm τ such that for all other decision rule τ, then τ is obviously the best. L θ (τ ) L θ (τ) θ Θ, Unfortunately this is not likely to happen. For example, say the loss function is the l 2 -norm between θ and ˆθ = τ(x). Then, τ(x) = θ will minimize the risk L θ (τ) when the true parameter is θ. This means that there is no single τ that can simultaneously minimize L θ for all θ Θ! There are two main paradigms that resolve this issue: Average-case paradigm (Bayes) and Worst-case paradigm (Minimax) 21 / 55 I-Hsiang Wang IT Lecture 7

Bayes Paradigm Statistical Model and Decision Making Paradigms In the Bayes paradigm, a prior distribution π ( ) over the parameter space Θ is required, and the performance of a decision rule is evaluated according to the average-case analysis with respect to π. Definition 1 (Bayes Risk) The average risk of a decision rule τ with respect to prior π is defined as R π (τ) E Θ π [L Θ (τ)]. The Bayes risk for a prior π is defined as the minimum average risk, that is, which is attained by a Bayes decision rule τπ (may not be unique). R π inf τ R π (τ), (1) Note that in information theory most coding theorems are derived in the Bayes paradigm we assume random sources and messages. We will give more results later when we talk about hypothesis testing and point estimation. 22 / 55 I-Hsiang Wang IT Lecture 7

Minimax Paradigm Statistical Model and Decision Making Paradigms In a Minimax paradigm, the performance of a decision rule is evaluated according to the worst-case risk over the entire parameter space. Definition 2 (Minimax Risk) The worst-case risk of a decision rule τ is defined as R (τ) sup θ Θ L θ (τ). The Minimax risk is defined as the minimum worst-case risk, that is, R inf τ R (τ) = inf τ sup θ Θ L θ (τ), (2) which is attained by a Minimax decision rule τ (may not be unique). A main criticism on the Bayes paradigm is that in many applications there is no prior over the parameter space, because the statistical task is done one-shot. In such cases, the Minimax paradigm provides a conservative but robust evaluation and theoretical guarantees. 23 / 55 I-Hsiang Wang IT Lecture 7

Bayes vs. Minimax Statistical Model and Decision Making Paradigms A simple yet fundamental relationship between Minimax and Bayes is stated in the theorem below. Theorem 1 (Minimax Risk Worst-Case Bayes Risk) In general, the Minimax risk is not smaller than any Bayes risk, that is, R sup π Rπ. (3) Furthermore, the above inequality becomes equality in the following two cases: 1 The parameter space Θ and the data alphabet X are both finite. 2 Θ < and the loss function is bounded from below. Remark: When deriving lower bound on the minimax risk, inequality (3) is useful. 24 / 55 I-Hsiang Wang IT Lecture 7

Statistical Model and Decision Making Paradigms pf: Note that R = inf τ sup θ Θ L θ (τ) and sup π Rπ = sup π inf τ E Θ π [L Θ (τ)]. For any decision rule τ and prior distribution π, we have sup θ Θ L θ (τ) E Θ π [L Θ (τ)]. Hence, for any prior distribution π, we have R = inf τ sup θ Θ L θ (τ) inf τ E Θ π [L Θ (τ)] = Rπ. Therefore, R sup π Rπ. Proof of the two sufficient conditions for the equality to hold requires some knowledge in convex optimization theory. Essentially speaking, these are the two sufficient conditions such that "strong duality" holds. 25 / 55 I-Hsiang Wang IT Lecture 7

Statistical Model and Decision Making Paradigms Deterministic vs. Randomized Decision Rules Randomization does not always help. In the following we give two such scenarios. Proposition 1 1 If the loss function τ l(t, τ) is convex, then randomization does not help. 2 In the Bayes paradigm, there always exists a deterministic decision rule which is Bayes optimal. pf: First, for any randomized decision rule τ(x, U), by Jensen's inequality we have L θ (τ) E U,X Pθ [l(t (θ), τ(x, U))] E X Pθ [l(t (θ), E U [τ(x, U) X])] = L θ (E U [τ(x, U) X]). Second, for any randomized decision rule τ(x, U) and prior distribution Θ π, R π (τ) = E Θ,U,X [l (T (Θ), τ(x, U)] = E U [R π (τ(, U))] R π (τ(, u)), for some u. Hence, the average risk of a randomized decision rule is always lower bound by that of a deterministic one, namely, τ(, u) wtih u attains a smaller R π (τ(, u)) than the average. 26 / 55 I-Hsiang Wang IT Lecture 7

Statistical Model and Decision Making Paradigms Parametric vs. Non-Parametric Models Conventionally, parametric model refers to the case where the parameter of interest is finite dimensional, while in the non-parametric model, parameter is infinite dimensional (eg., a function). The parametric framework is useful when one is familiar with certain properties of the data and has a good statistical model for the data. The parameter space Θ is hence fixed, not scaling with the samples of data. In contrast, if such knowledge about the underlying data is not sufficient, the non-parametric framework might be more suitable. However, this cut is vague and not well-defined at times. Parametric: Hypothesis testing, Mean estimation, Covariance matrix estimation, etc. Non-Parametric: Density estimation, Regression, etc. 27 / 55 I-Hsiang Wang IT Lecture 7

Hypothesis Testing 1 Statistical Model and Decision Making Basic Framework Examples Paradigms 2 Hypothesis Testing Basics 3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency 28 / 55 I-Hsiang Wang IT Lecture 7

Hypothesis Testing Basics 1 Statistical Model and Decision Making Basic Framework Examples Paradigms 2 Hypothesis Testing Basics 3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency 29 / 55 I-Hsiang Wang IT Lecture 7

Hypothesis Testing Basics Basic Setup We begin with the simplest setup binary hypothesis testing: 1 Two hypotheses regarding the observation X, indexed by θ {0, 1}: H 0 : X P 0 (Null Hypothesis, θ = 0) H 1 : X P 1 (Alternative Hypothesis, θ = 1) 2 Goal: design a decision making algorithm ϕ : X {0, 1}, x ˆθ, to choose one of the two hypotheses, based on the observed realization of X, so that a certain risk is minimized. 3 The loss function is the 0-1 loss, rendering two kinds probability of errors: Probability of false alarm (false positive; type I error): α ϕ P FA (ϕ) P {H 1 is chosen H 0 }. Probability of miss detection (false negative; type II error): β ϕ P MD (ϕ) P {H 0 is chosen H 1 }. 30 / 55 I-Hsiang Wang IT Lecture 7

Hypothesis Testing Deterministic Testing Algorithm Decision Regions Basics X Observation Space A 1 ( ) Acceptance Region of H 1. A 0 ( ) Acceptance Region of H 0. A test ϕ : X {0, 1} is equivalently characterized by its corresponding acceptance (decision) regions: A θ (ϕ) ϕ 1 (ˆθ) {x X : ϕ (x) = ˆθ}, ˆθ = 0, 1. The two types of probability of error can be equivalently represented as α ϕ = x A 1 (ϕ) P 0 (x) = x X ϕ (x) P 0 (x), β ϕ = x A 0 (ϕ) P 1 (x) = x X (1 ϕ (x)) P 1 (x). When the context is clear, we often drop the dependency on the test ϕ when dealing with acceptance regions Aˆθ. 31 / 55 I-Hsiang Wang IT Lecture 7

Hypothesis Testing Basics Likelihood Ratio Test Definition 3 (Likelihood Ratio Test) A (deteministic) likelihood ratio test (LRT) is a test ϕ τ, parametrized by constants τ > 0 (called threshold), defined as follows: { 1 if P 1 (x) > τp 0 (x) ϕ τ (x) = 0 if P 1 (x) τp 0 (x). For x supp P0, the likelihood ratio L (x) P 1(x) P 0 (x). Hence, LRT is a thresholding algorithm on the likelihood ratio L (x). Remark: For computational convenience, often one deals with log likelihood ratio (LLR) log (L(x)) = log (P 1 (x)) log (P 0 (x)). 32 / 55 I-Hsiang Wang IT Lecture 7

Hypothesis Testing Basics Trade-Off Between α (P FA ) and β (P MD ) Theorem 2 (Neyman-Pearson Lemma) For a likelihood ratio test ϕ τ and another deterministic test ϕ, α ϕ α ϕτ = β ϕ β ϕτ. pf: Observe x X, 0 (ϕ τ (x) ϕ (x)) (P 1 (x) τp 0 (x)), because if P 1 (x) τp 0 (x) > 0 = ϕ τ (x) = 1 = (ϕ τ (x) ϕ (x)) 0. if P 1 (x) τp 0 (x) 0 = ϕ τ (x) = 0 = (ϕ τ (x) ϕ (x)) 0. Summing over all x X, we get 0 (1 β ϕτ ) (1 β ϕ ) τ (α ϕτ α ϕ ) = (β ϕ β ϕτ ) + τ (α ϕ α ϕτ ). Since τ > 0, from above we conclude that α ϕ α ϕτ = β ϕ β ϕτ. 33 / 55 I-Hsiang Wang IT Lecture 7

Hypothesis Testing Basics (P MD ) (P MD ) 1 1 1 (P FA ) 1 (P FA ) Question: What is the optimal trade-off curve? What is the optimal test achieving the curve? 34 / 55 I-Hsiang Wang IT Lecture 7

Hypothesis Testing Randomized Testing Algorithm Basics Randomized tests include deterministic tests as special cases. Definition 4 (Randomized Test) A randomized test decides ˆθ = 1 with probability ϕ (x) and ˆθ = 0 with probability 1 ϕ (x), where ϕ is a mapping ϕ : X [0, 1]. Note: A randomized test is characterized by ϕ, as in deterministic tests. Definition 5 (Randomized LRT) A randomized likelihood ratio test (LRT) is a test ϕ τ,γ, parametrized by cosntants τ > 0 and γ (0, 1), defined as follows: { 1/0 if P 1 (x) τp 0 (x) ϕ τ,γ (x) = γ if P 1 (x) = τp 0 (x). 35 / 55 I-Hsiang Wang IT Lecture 7

Hypothesis Testing Basics Randomized LRT Achieves the Optimal Trade-Off Consider the following optimization problem: Neyman-Pearson Problem minimize ϕ:x [0,1] subject to β ϕ α ϕ α Theorem 3 (Neyman-Pearson) A randomized LRT ϕ τ,γ with the parameters (τ, γ ) satisfying α = α ϕτ,γ attains optimality for the Neyman-Pearson Problem. 36 / 55 I-Hsiang Wang IT Lecture 7

Hypothesis Testing Basics pf: First argue that for any α (0, 1), one can find (τ, γ ) such that α = α ϕτ,γ = x X ϕ τ,γ (x) P 0 (x) = P 0 (x) + γ P 0 (x) x: L(x)>τ x: L(x)=τ For any test ϕ, due to a similar argument as in Theorem 2, we have x X, (ϕ τ,γ (x) ϕ (x)) (P 1 (x) τ P 0 (x)) 0. Summing over all x X, similarly we get ( ) ( ) β ϕ β ϕτ,γ + τ α ϕ α ϕτ,γ 0 Hence, for any feasible test ϕ with α ϕ α = α ϕτ,γ, its probability of type II error β ϕ β ϕτ,γ 37 / 55 I-Hsiang Wang IT Lecture 7

Bayes Risk Hypothesis Testing Basics Sometimes prior probabilities of the two hypotheses are known: π θ P {H θ is true}, θ = 0, 1, π 0 + π 1 = 1. In this sense, one can view the index Θ as a (binary) random variable with (prior) distribution P {Θ = θ} = π θ, for θ = 0, 1. With prior probabilities, it then makes sense to talk about the average probability of error for a test ϕ, or more generally, the average risk: P e (ϕ) π 0 α ϕ + π 1 β ϕ = E Θ,X [1 {Θ ϕ(x)}] ; R π (ϕ) E Θ,X [l(θ, ϕ(x))]. The Bayes hypothesis testing problem is to test the two hypotheses with knowledge of prior probabilities so that the average probability of error (or in general, a risk function) is minimized. 38 / 55 I-Hsiang Wang IT Lecture 7

Hypothesis Testing Basics Minimizing Bayes Risk Consider the following problem of minimizing Bayes risk. Bayes Problem minimize ϕ: X [0,1] with known R π (ϕ) E Θ,X [l(θ, ϕ(x))] (π 0, π 1 ) and l(θ, ˆθ) Theorem 4 (LRT is an Optimal Bayes Test) Assume l(0, 0) < l(0, 1) and l(1, 1) < l(1, 0). A deterministic LRT ϕ τ with threshold attains optimality for the Bayes Problem. τ = (l(0,1) l(0,0))π 0 (l(1,0) l(1,1))π 1 39 / 55 I-Hsiang Wang IT Lecture 7

Hypothesis Testing Basics pf: R π (ϕ) = x X l(0, 0)π 0P 0 (x) (1 ϕ (x)) + x X l(0, 1)π 0P 0 (x) ϕ (x) + x X l(1, 0)π 1P 1 (x) (1 ϕ (x)) + x X l(1, 1)π 1P 1 (x) ϕ (x) = l(0, 0)π 0 + x X (l(0, 1) l(0, 0)) π 0P 0 (x) ϕ (x) + l(1, 0)π 1 + x X (l(1, 1) l(1, 0)) π 1P 1 (x) ϕ (x) = [ x X (l(0, 1) l(0, 0)) π0 P 0 (x) (l(1, 1) l(1, 0)) π 1 P 1 (x) ] ϕ (x) }{{} ( ) + l(0, 0)π 0 + l(1, 0)π 1. For each x X, we shall choose ϕ (x) [0, 1] such that ( ) is minimized. It is then obvious that we should choose { 1 if (l(0, 1) l(0, 0)) π 0 P 0 (x) (l(1, 1) l(1, 0)) π 1 P 1 (x) < 0 ϕ (x) = 0 if (l(0, 1) l(0, 0)) π 0 P 0 (x) (l(1, 1) l(1, 0)) π 1 P 1 (x) 0. 40 / 55 I-Hsiang Wang IT Lecture 7

Discussions Hypothesis Testing Basics For binary hypothesis testing problems, the likelihood ratio L (x) P 1(x) P 0 (x) is a sufficient statistics. Moreover, a likelihood ratio test (LRT) is optimal both in the Neyman-Pearson and the Bayes settings. Extensions include M-ary hypothesis testing Minimax risk optimization (with unknown prior) Composite hypothesis testing, etc. Here we do not pursue these directions further at the moment. We leave some of them in the later lectures including asymptotic behavior of hypothesis testing, and discuss about its close connection with information theory, in particular, information divergence. Next, we quickly overview the asymptotic behavior of performance in hypothesis testing. 41 / 55 I-Hsiang Wang IT Lecture 7

Estimation 1 Statistical Model and Decision Making Basic Framework Examples Paradigms 2 Hypothesis Testing Basics 3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency 42 / 55 I-Hsiang Wang IT Lecture 7

Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound 1 Statistical Model and Decision Making Basic Framework Examples Paradigms 2 Hypothesis Testing Basics 3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency 43 / 55 I-Hsiang Wang IT Lecture 7

Estimation Estimator, Bias, Mean Squared Error Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Definition 6 (Estimator) Consider X P θ randomly generates the data x, where θ Θ is an unknown parameter. An estimator of θ based on observed x is a mapping ϕ : X Θ, x ˆθ. An estimator of function z (θ) is a mapping ζ : X z (Θ), x ẑ. When Θ R or R n, it is reasonable to consider the following two measures of performance. Definition 7 (Bias, Mean Squared Error) For an estimator ϕ (x) of θ, Bias θ (ϕ) E X Pθ [ϕ (X)] θ, MSE θ (ϕ) E X Pθ [ ϕ (X) θ 2]. 44 / 55 I-Hsiang Wang IT Lecture 7

Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Fact 1 (MSE = Variance +(Bias) 2 ) For an estimator ϕ (x) of θ, MSE θ (ϕ) = Var Pθ [ϕ (X)] + (Bias θ (ϕ)) 2. pf: MSE θ (ϕ) E Pθ [ ϕ (X) θ 2] = E Pθ [ (ϕ (X) E Pθ [ϕ (X)] + E Pθ [ϕ (X)] θ) 2] = Var Pθ [ϕ (X)] + (Bias θ (ϕ)) 2 + 2Bias θ (ϕ) E Pθ [ϕ (X) E Pθ [ϕ (X)]]. }{{} 0 Note: MSE is the risk of an estimator, when the loss function is the squared-error loss. In the following we provide a parameter-dependent lower bound on the MSE of unbiased estimators, namely, Cramér-Rao Inequality. 45 / 55 I-Hsiang Wang IT Lecture 7

Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Lower Bound on MSE of Unbiased Estimators Below we deal with densities and hence change notation from P θ to f θ. Definition 8 (Fisher Information) The Fisher Information of θ is defined as J (θ) E fθ [ ( θ ln f θ(x) ) 2 ]. Definition 9 (Unbiased Estimator) An estimator ϕ is unbiased if Bias θ (ϕ) = 0 for all θ Θ. Now we are ready to state the theorem. Theorem 5 (Cramér-Rao) For any unbiased estimator ϕ, we have MSE θ (ϕ) 1 J(θ), θ Θ. 46 / 55 I-Hsiang Wang IT Lecture 7

Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound pf: The proof is essentially an application of Cauchy-Schwarz inequality. Let us begin with the observation that J (θ) = Var fθ [s θ (X)], where s θ (X) θ ln f θ(x) = 1 θ f θ(x), because f θ (X) E fθ [s θ (X)] = f 1 θ (x) f θ (x) θ f θ(x) dx = f θ(x) dx = 0. = d d θ Hence, by Cauchy-Schwarz inequality, we have θ f θ(x) dx (Cov fθ (s θ (X), ϕ (X))) 2 Var fθ [s θ (X)] Var fθ [ϕ (X)]. Since Bias θ (ϕ) = 0, we have MSE θ (ϕ) = Var fθ [ϕ (X)], and hence MSE θ (ϕ) J (θ) (Cov fθ (s θ (X), ϕ (X))) 2. 47 / 55 I-Hsiang Wang IT Lecture 7

Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound It remains to prove that Cov fθ (s θ (X), ϕ (X)) = 1: {}}{ Cov fθ (s θ (X), ϕ (X)) = E fθ [s θ (X) ϕ (X)] E fθ [s θ (X)] E fθ [ϕ (X)] = E fθ [s θ (X) ϕ (X)] [ ] 1 = E fθ f θ (X) θ f θ(x)ϕ (X) = d d θ f θ(x)ϕ (x) dx = d d θ E f θ [ϕ (X)] (a) = d d θ θ = 1, where the (a) holds because ϕ is unbiased. The proof is complete. Remark: Cramér-Rao inequality can be extended to vector estimators, biased estimators, estimator of a function of θ, etc. 0 48 / 55 I-Hsiang Wang IT Lecture 7

Estimation Extensions of Cramér-Rao Inequality Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Below we list some extensions and leave the proofs as exercises. Exercise 1 (Cramér-Rao Inequality for Unbiased Functional Estimators) Prove that for any unbiased estimator ζ of z (θ), MSE θ (ζ) 1 J(θ) ( d d θ z (θ)) 2. Exercise 2 (Cramér-Rao Inequality for Biased Estimators) Prove that for any estimator ϕ of the parameter θ, MSE θ (ϕ) 1 J(θ) ( 1 + d d θ Bias θ (ϕ) ) 2 + (Biasθ (ϕ)) 2. Exercise 3 (Attainment of Cramér-Rao) Show that the necessary and sufficient condition for an unbiased estimator ϕ to attain the Cramér-Rao lower bound is that, there exists some function g such that for all x, g (θ) (ϕ (x) θ) = θ ln f θ (x). 49 / 55 I-Hsiang Wang IT Lecture 7

More on Fisher Information Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Fisher Information plays a key role in Cramér-Rao lower bound. We make further remarks about it. 1 J (θ) E fθ [(s θ (X)) 2] = Var fθ [s θ (X)], where the score of θ, s θ (X) θ ln f θ(x) = 1 f θ (X) θ f θ(x) is zero-mean. 2 Suppose X i i.i.d. f θ, then for the estimation problem with observation X n, its Fisher information J n (θ) = nj (θ), where J (θ) is the Fisher information when the observation is just X f θ. 3 For an exponential family {f θ θ Θ}, it can be shown that [ ] J (θ) = E 2 fθ ln f θ 2 θ (X), which makes computation of J (θ) simpler. 50 / 55 I-Hsiang Wang IT Lecture 7

Estimation Maximum Likelihood Estimator, Consistency, and Efficiency 1 Statistical Model and Decision Making Basic Framework Examples Paradigms 2 Hypothesis Testing Basics 3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency 51 / 55 I-Hsiang Wang IT Lecture 7

Maximum Likelihood Estimator Estimation Maximum Likelihood Estimator, Consistency, and Efficiency Maximum Likelihood Estimator (MLE) is a widely used estimator. Definition 10 (Maximum Likelihood Estimator) The Maximum Likelihood Estimator (MLE) for estimating θ from a randomly drawn X P θ is Here P θ (x) is called the likelihood function. ϕ MLE (x) arg max {P θ (x)}. θ Θ Exercise 4 (MLE of Gaussian with Unknown Mean and Variance) i.i.d. Consider X i N ( µ, σ 2) for i = 1, 2,..., n, where θ ( µ, σ 2) denote the unknown parameter. Let x 1 n n i=1 x i. Show that ϕ MLE (x n ) = ( 1 x, n n i=1 (x i x) 2). 52 / 55 I-Hsiang Wang IT Lecture 7

Estimation Maximum Likelihood Estimator, Consistency, and Efficiency In the following we consider observation of n i.i.d. drawn samples X i i.i.d. P θ, i = 1,..., n, and give two ways of evaluating the performance of a sequence of estimators {ϕ n (x n ) n N} as n. 1 Consistency: The estimator output will coincide the true parameter as sample size n. 2 Efficiency: The estimator output will achieve the MSE Cramér-Rao lower bound, as sample size n. We will see that MLE is not only consistent but also efficient. 53 / 55 I-Hsiang Wang IT Lecture 7

Estimation Asymptotic Evaluations: Consistency Maximum Likelihood Estimator, Consistency, and Efficiency Definition 11 (Consistency) A sequence of estimators {ζ n (x n ) n N} is consistent if ε > 0, lim n P X i i.i.d. P θ { ζ n (X n ) z (θ) < ε} = 1, θ Θ. In other words, ζ n (X n ) p z (θ) for all θ Θ. Theorem 6 (MLE is Consistent) For a family of densities {f θ θ Θ}, under some regularity conditions on f θ (x), the plug-in estimator z (ϕ MLE (x n )) is a consistent estimator of z (θ), where z is a continuous function of θ. 54 / 55 I-Hsiang Wang IT Lecture 7

Estimation Maximum Likelihood Estimator, Consistency, and Efficiency Asymptotic Evaluations: Efficiency Definition 12 (Efficiency) A sequence of estimators {ζ n (x n ) n N} is asymptotically efficient if ( n (ζn (X n ) z (θ)) d ( ) 1 N 0, d J(θ) d θ z (θ)) 2 as n. Theorem 7 (MLE is Asymptotically Efficient) For a family of densities {f θ θ Θ}, under some regularity conditions on f θ (x), the plug-in estimator z (ϕ n (x n )) is an asymptotically efficient estimator of z (θ), where z is a continuous function of θ. 55 / 55 I-Hsiang Wang IT Lecture 7