Lecture 8: Information Theory and Statistics

Similar documents
Lecture 8: Information Theory and Statistics

Lecture 7 Introduction to Statistical Decision Theory

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

Lecture 22: Error exponents in hypothesis testing, GLRT

6.1 Variational representation of f-divergences

Information Theory and Hypothesis Testing

Detection theory 101 ELEC-E5410 Signal Processing for Communications

Detection theory. H 0 : x[n] = w[n]

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

2 Statistical Estimation: Basic Concepts

Ch. 5 Hypothesis Testing

Parametric Techniques Lecture 3

Estimation Tasks. Short Course on Image Quality. Matthew A. Kupinski. Introduction

Parametric Techniques

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Mathematical statistics

Econometrics I, Estimation

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

DETECTION theory deals primarily with techniques for

Master s Written Examination

Review. December 4 th, Review

INFORMATION THEORY AND STATISTICS

ECE 275A Homework 7 Solutions

Uncertainty. Jayakrishnan Unnikrishnan. CSL June PhD Defense ECE Department

The University of Hong Kong Department of Statistics and Actuarial Science STAT2802 Statistical Models Tutorial Solutions Solutions to Problems 71-80

Lecture 2: Statistical Decision Theory (Part I)

A Very Brief Summary of Statistical Inference, and Examples

Math 494: Mathematical Statistics

10. Composite Hypothesis Testing. ECE 830, Spring 2014

F2E5216/TS1002 Adaptive Filtering and Change Detection. Course Organization. Lecture plan. The Books. Lecture 1

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation

Statistics: Learning models from data

QUANTIZATION FOR DISTRIBUTED ESTIMATION IN LARGE SCALE SENSOR NETWORKS

Detection and Estimation Theory

Parameter Estimation, Sampling Distributions & Hypothesis Testing

Lecture 4 Noisy Channel Coding

ECE531 Lecture 10b: Maximum Likelihood Estimation

Brief Review on Estimation Theory

Graduate Econometrics I: Unbiased Estimation

Lecture 12 November 3

Detection Theory. Composite tests

Lecture 2: Basic Concepts of Statistical Decision Theory

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

Lecture notes on statistical decision theory Econ 2110, fall 2013

Optimum Joint Detection and Estimation

Lecture 5 September 19

16.1 Bounding Capacity with Covering Number

Lecture 5 Channel Coding over Continuous Channels

BTRY 4090: Spring 2009 Theory of Statistics

ST5215: Advanced Statistical Theory

Part III. A Decision-Theoretic Approach and Bayesian testing

EIE6207: Estimation Theory

SGN Advanced Signal Processing: Lecture 8 Parameter estimation for AR and MA models. Model order selection

Hypothesis Testing - Frequentist

Lecture 1: Introduction

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology

simple if it completely specifies the density of x

Module 2. Random Processes. Version 2, ECE IIT, Kharagpur

Mathematical statistics

Principles of Statistics

Homework Assignment #2 for Prob-Stats, Fall 2018 Due date: Monday, October 22, 2018

44 CHAPTER 2. BAYESIAN DECISION THEORY

Applications of Information Geometry to Hypothesis Testing and Signal Detection

Theory of Maximum Likelihood Estimation. Konstantin Kashin

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

Anonymous Heterogeneous Distributed Detection: Optimal Decision Rules, Error Exponents, and the Price of Anonymity

Elements of statistics (MATH0487-1)

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Central Limit Theorem ( 5.3)

Tutorial: Statistical distance and Fisher information

Review and continuation from last week Properties of MLEs

Bayesian statistics: Inference and decision theory

A Few Notes on Fisher Information (WIP)

Introduction to Bayesian Statistics

A Very Brief Summary of Statistical Inference, and Examples

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

21.1 Lower bounds on minimax risk for functional estimation

Non-parametric Inference and Resampling

If there exists a threshold k 0 such that. then we can take k = k 0 γ =0 and achieve a test of size α. c 2004 by Mark R. Bell,

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Institute of Actuaries of India

STATISTICS SYLLABUS UNIT I

ECE534, Spring 2018: Solutions for Problem Set #3

DA Freedman Notes on the MLE Fall 2003

Spring 2012 Math 541B Exam 1

Composite Hypotheses and Generalized Likelihood Ratio Tests

Detection and Estimation Chapter 1. Hypothesis Testing

Detection & Estimation Lecture 1

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary

Mathematics Ph.D. Qualifying Examination Stat Probability, January 2018

ORF 245 Fundamentals of Statistics Chapter 9 Hypothesis Testing

Some General Types of Tests

Outline. Motivation Contest Sample. Estimator. Loss. Standard Error. Prior Pseudo-Data. Bayesian Estimator. Estimators. John Dodson.

Answers to the 8th problem set. f(x θ = θ 0 ) L(θ 0 )

Topic 15: Simple Hypotheses

Master s Written Examination

Transcription:

Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang Wang IT Lecture 8 Part II

1 Hypothesis Testing 2 Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators 2 / 50 I-Hsiang Wang IT Lecture 8 Part II

1 Hypothesis Testing 2 Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators 3 / 50 I-Hsiang Wang IT Lecture 8 Part II

Basic Setup We begin with the simplest setup binary hypothesis testing: 1 Two hypotheses regarding the observation X, indexed by θ {0, 1}: H 0 : X P 0 (Null Hypothesis, θ = 0) H 1 : X P 1 (Alternative Hypothesis, θ = 1) 2 Goal: design a decision making algorithm ϕ : X {0, 1}, x ˆθ, to choose one of the two hypotheses, based on the observed realization of X, so that a certain cost (or risk) is minimized. 3 A popular measure of the cost is based on probability of errors: Probability of false alarm (false positive; type I error): α ϕ P FA (ϕ) P {H 1 is chosen H 0 }. Probability of miss detection (false negative; type II error): β ϕ P MD (ϕ) P {H 0 is chosen H 1 }. 4 / 50 I-Hsiang Wang IT Lecture 8 Part II

Deterministic Testing Algorithm Decision Regions X Observation Space A 1 ( ) Acceptance Region of H 1. A 0 ( ) Acceptance Region of H 0. A test ϕ : X {0, 1} is equivalently characterized by its corresponding acceptance (decision) regions: ) { 1 A θ (ϕ) ϕ (ˆθ x X : ϕ (x) = ˆθ }, ˆθ = 0, 1. Hence, the two types of probability of error can be equivalently represented as α ϕ = β ϕ = x A 1 (ϕ) x A 0(ϕ) P 0 (x) = ϕ (x) P 0 (x), x X P 1 (x) = (1 ϕ (x)) P 1 (x). x X When the context is clear, we often drop the dependency on the test ϕ when dealing with acceptance regions Aˆθ. 5 / 50 I-Hsiang Wang IT Lecture 8 Part II

Likelihood Ratio Test Definition 1 (Likelihood Ratio Test) A (deteministic) likelihood ratio test (LRT) is a test ϕ τ, parametrized by constants τ > 0 (called threshold), defined as follows: { 1 if P 1 (x) > τp 0 (x) ϕ τ (x) = 0 if P 1 (x) τp 0 (x). For x supp P0, the likelihood ratio L (x) P 1(x) P 0(x). Hence, LRT is a thresholding algorithm on likelihood ratio L (x). Remark: For computational convenience, often one deals with log likelihood ratio (LLR) log (L(x)) = log (P 1 (x)) log (P 0 (x)). 6 / 50 I-Hsiang Wang IT Lecture 8 Part II

Trade-Off Between α (P FA ) and β (P MD ) Theorem 1 (Neyman-Pearson Lemma) For a likelihood ratio test ϕ τ and another deterministic test ϕ, α ϕ α ϕτ = β ϕ β ϕτ. pf: Observe x X, 0 (ϕ τ (x) ϕ (x)) (P 1 (x) τp 0 (x)), because if P 1 (x) τp 0 (x) > 0 = ϕ τ (x) = 1 = (ϕ τ (x) ϕ (x)) 0. if P 1 (x) τp 0 (x) 0 = ϕ τ (x) = 0 = (ϕ τ (x) ϕ (x)) 0. Summing over all x X, we get 0 (1 β ϕτ ) (1 β ϕ ) τ (α ϕτ α ϕ ) = (β ϕ β ϕτ ) + τ (α ϕ α ϕτ ). Since τ > 0, from above we conclude that α ϕ α ϕτ = β ϕ β ϕτ. 7 / 50 I-Hsiang Wang IT Lecture 8 Part II

(P MD ) (P MD ) 1 1 1 (P FA ) 1 (P FA ) Question: What is the optimal trade-off curve? What is the optimal test achieving the curve? 8 / 50 I-Hsiang Wang IT Lecture 8 Part II

Randomized Testing Algorithm Randomized tests include deterministic tests as special cases. Definition 2 (Randomized Test) A randomized test decides ˆθ = 1 with probability ϕ (x) and ˆθ = 0 with probability 1 ϕ (x), where ϕ is a mapping ϕ : X [0, 1]. Note: A randomized test is characterized by ϕ, as in deterministic tests. Definition 3 (Randomized LRT) A randomized likelihood ratio test (LRT) is a test ϕ τ,γ, parametrized by cosntants τ > 0 and γ (0, 1), defined as follows: 1 if P 1 (x) > τp 0 (x) ϕ τ,γ (x) = γ if P 1 (x) = τp 0 (x). 0 if P 1 (x) < τp 0 (x) 9 / 50 I-Hsiang Wang IT Lecture 8 Part II

Randomized LRT Achieves the Optimal Trade-Off Consider the following optimization problem: Neyman-Pearson Problem minimize ϕ:x [0,1] subject to β ϕ α ϕ α Theorem 2 (Neyman-Pearson) A randomized LRT ϕ τ,γ with the parameters (τ, γ ) satisfying α = α ϕτ,γ, attains optimality for the Neyman-Pearson Problem. 10 / 50 I-Hsiang Wang IT Lecture 8 Part II

pf: First argue that for any α (0, 1), one can find (τ, γ ) such that α = α ϕτ,γ = ϕ τ,γ (x) P 0 (x) x X = P 0 (x) + γ P 0 (x) x: L(x)>τ x: L(x)=τ For any test ϕ, due to a similar argument as in Theorem 1, we have x X, (ϕ τ,γ (x) ϕ (x)) (P 1 (x) τ P 0 (x)) 0. Summing over all x X, similarly we get ( βϕ β ϕτ,γ ) + τ ( α ϕ α ϕτ,γ ) 0 Hence, for any feasible test ϕ with α ϕ α = α ϕτ,γ, its probability of type II error β ϕ β ϕτ,γ. 11 / 50 I-Hsiang Wang IT Lecture 8 Part II

Bayesian Setup Sometimes prior probabilities of the two hypotheses are known: π θ P {H θ is true}, θ = 0, 1, π 0 + π 1 = 1. In this sense, one can view the index Θ as a (binary) random variable with (prior) distribution P {Θ = θ} = π θ, for θ = 0, 1. With prior probabilities, it then makes sense to talk about the average probability of error for a test ϕ, or more generally, the average cost (risk): { P e (ϕ) π 0 α ϕ + π 1 β ϕ = E Θ,X [1 Θ ˆΘ }], ] R (ϕ) E Θ,X [r Θ, ˆΘ. The Bayesian hypothesis testing problem is to test the two hypotheses with knowledge of prior probabilities so that the average probability of error (or in general, a risk function) is minimized. 12 / 50 I-Hsiang Wang IT Lecture 8 Part II

Minimizing Bayes Risk Consider the following problem of minimizing Bayes risk. minimize ϕ:x [0,1] with known Bayesian Problem R (ϕ) E Θ,X [r Θ, ˆΘ (π 0, π 1 ) and r θ,ˆθ ] Theorem 3 (LRT is an Optimal Bayesian Test) Assume r 0,0 < r 0,1 and r 1,1 < r 1,0. A deterministic LRT ϕ τ threshold τ = (r 0,1 r 0,0 ) π 0 (r 1,0 r 1,1 ) π 1 attains optimality for the Bayesian Problem. with 13 / 50 I-Hsiang Wang IT Lecture 8 Part II

pf: R (ϕ) = r 0,0 π 0 P 0 (x) (1 ϕ (x)) + r 0,1 π 0 P 0 (x) ϕ (x) x X x X + r 1,0 π 1 P 1 (x) (1 ϕ (x)) + r 1,1 π 1 P 1 (x) ϕ (x) x X x X = r 0,0 π 0 + (r 0,1 r 0,0 ) π 0 P 0 (x) ϕ (x) x X + r 1,0 π 1 + (r 1,1 r 1,0 ) π 1 P 1 (x) ϕ (x) = x X x X ( ) {}}{ [ (r0,1 r 0,0 ) π 0 P 0 (x) (r 1,1 r 1,0 ) π 1 P 1 (x) ] ϕ (x) + r 0,0 π 0 + r 1,0 π 1. For each x X, we shall choose ϕ (x) [0, 1] such that ( ) is minimized. It is then obvious that we should choose { 1 if (r 0,1 r 0,0 ) π 0 P 0 (x) (r 1,1 r 1,0 ) π 1 P 1 (x) < 0 ϕ (x) = 0 if (r 0,1 r 0,0 ) π 0 P 0 (x) (r 1,1 r 1,0 ) π 1 P 1 (x) 0. 14 / 50 I-Hsiang Wang IT Lecture 8 Part II

Discussions For binary hypothesis testing problems, the likelihood ratio L (x) P1(x) P 0 (x) turns out to be a sufficient statistics. Moreover, a likelihood ratio test (LRT) is optimal both in the Bayesian and Neyman-Pearson settings. Extensions include M-ary hypothesis testing Minimax risk optimization (with unknown prior) Composite hypothesis testing, etc. Here we do not pursue these directions further. Instead, we would like to explore the asymptotic behavior of hypothesis testing, and the connection with information theoretic tools. 15 / 50 I-Hsiang Wang IT Lecture 8 Part II

1 Hypothesis Testing 2 Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators 16 / 50 I-Hsiang Wang IT Lecture 8 Part II

i.i.d. Observations So far we focus on the general setting where the observation space X can be arbitrary alphabets. In the following, we consider product space X n, length-n observation sequence X n drawn i.i.d. from one of the two distributions, and the two hypotheses are H 0 : X i i.i.d. P 0, i = 1, 2,..., n H 1 : X i i.i.d. P 1, i = 1, 2,..., n The corresponding probability of errors are denoted by α (n) P (n) FA P {H 1 is chosen H 0 } β (n) P (n) MD P {H 0 is chosen H 1 } Throughout the lecture we assume X = {a 1, a 2,..., a d } is a finite set. 17 / 50 I-Hsiang Wang IT Lecture 8 Part II

LRT under i.i.d. Observation (1) With i.i.d. observation, the likelihood ratio of a sequence x n X n is L (x n ) = n i=1 P 1 (x i ) P = 0(x i) a X ( ) N(a x n ) P1 (a) ( P 0(a) = P1 P a X 0(a)) nπ(a xn ), where N (a x n ) # of a s in x n and π (a x n ) 1 n N (a xn ) is the relative frequency of occurrence of symbol a in sequence x n. Note: From the above manipulation, we see that the collection of relative frequency of occurrence (as a X -dim probabilty vector), Π x n [ π (a 1 x n ) π (a 2 x n ) π (a d x n ) ] T, called the type of sequence x n, is a sufficient statistics for all the previously mentioned hypothesis testing problems. 18 / 50 I-Hsiang Wang IT Lecture 8 Part II

LRT under i.i.d. Observation (2) Let us further manipulate the LRT, by taking log likelihood ratio: L (x n ) τ n log (L (x n )) log τ n nπ (a x n ) log a X ( P1(a) P 0 (a) π (a x n ) log a X π (a x n ) log a X ) log τ n ) ( π(a x n ) P 0(a) ( π(a x n ) P 1 (a) ) 1 n log τ n D (Π x n P 0 ) D (Π x n P 1 ) 1 n log τ n 19 / 50 I-Hsiang Wang IT Lecture 8 Part II

Hypothesis Testing Probability Simplex Observation Space X P (X ) P1 A1 ( ) Acceptance Region of H1. (n) Acceptance F1 A0 ( ) P0 (n) Acceptance F0 Region of H0. Acceptance Region of H0. {xn Ai = decide Hi } 20 / 50 Region of H1. I-Hsiang Wang { (n) Πxn Fi IT Lecture 8 Part II } = decide Hi.

Probability Simplex P (X ) P 1 F (n) 1 P F (n) 0 P 0 By Sanov s Theorem, we know that α (n) = P n 0 ( F (n) 1 ) ( 2 nd(p P 0 ), β (n) = P n 1 F (n) 0 ) 2 nd(p P 1 ). 21 / 50 I-Hsiang Wang IT Lecture 8 Part II

Asymptotic Behaviors 1 Neyman-Pearson: β (n, ε) min ϕ n :X n [0,1] It turns out that for all ε (0, 1), lim n β (n) ϕ n, subject to α (n) ϕ n ε. { 1 n log β (n, ε) } = D (P 0 P 1 ) { 2 Bayesian: P e (n) min ϕ n:x n [0,1] π 0 α (n) ϕ n + π 1 β (n) ϕ n }. It turns out that { lim 1 n n log P e (n) } = D (P λ P 0 ) = D (P λ P 1 ) where P λ (a) (P 0(a)) 1 λ (P 1 (a)) λ x X (P0(x))1 λ (P 1(x)) λ, a X, and λ (0, 1) such that D (P λ P 0 ) = D (P λ P 1 ) 22 / 50 I-Hsiang Wang IT Lecture 8 Part II

in Neyman-Pearson Setup Theorem 4 (Chernoff-Stein) { For all ε (0, 1), lim 1 n n log β (n, ε) } = D (P 0 P 1 ). pf: We shall prove the achievability and the converese part separately. Achievability: construct a sequence of tests {ϕ n } with α (n) ϕ n ε for { } n sufficiently large, such that lim inf 1 n n log β(n) ϕ n D (P 0 P 1 ). Converse: for any sequence of tests {ϕ n } with α (n) ϕ n ε for n { } sufficiently large, show that lim sup 1 n log β(n) ϕ n D (P 0 P 1 ). n We use method of types to prove both the achievability and the converse. Alternatively, Chapter 11.8 of Cover&Thomas[1] uses a kind of weak typicality to prove the theorem. 23 / 50 I-Hsiang Wang IT Lecture 8 Part II

Achievability: Consider a deterministic test ϕ n (x n ) = 1 {D (Π x n P 0 ) δ n }, δ n 1 n ( log 1 ε + d log(n + 1)). In other words, it determines H 1 if D (Π x n P 0 ) δ n, and H 0 otherwise. Check the probability of Type I error : By Prop. 4 in Part I, we have α (n) ϕ n ( = P i.i.d. {D (Π X n P 0 ) δ n } 2 n Xi P 0 where (a) is due to our construction. δ n d log(n+1) n ) (a) = ε, Analyze the probability of Type II error : ((b) is due to Prop. 3 in Part I) β (n) ϕ n = P n 1 (T n (Q)) (b) Q P n : D(Q P 0 )<δ n P n 2 nd n, where D n Q P n : D(Q P 0 )<δ n 2 nd(q P 1) min {D (Q P 1 )}. Q P n : D(Q P 0 )<δ n Since lim n δ n = 0, we have lim n D n = D (P 0 P 1 ), and achievability is done. 24 / 50 I-Hsiang Wang IT Lecture 8 Part II

Converse: We prove the converse for deterministic tests. Extension to randomized tests is left as en exercise (HW6). Let A (n) i {x n ϕ n (x n ) = i}, the acceptance region of H i, for i = 0, 1. Let B (n) {x n D (Π x n P 0) < ε n}, ε n 2d log(n+1). By Prop. 4, we have n ) P n 0 (B (n) = 1 P i.i.d. {D (Π X n P 0 ) ε n } Xi P 0 ( 1 2 n ε n d log(n+1) ) n = 1 2 d log(n+1) 1 as n. ) ( ) Hence, for sufficiently large n, both P n 0 (B (n) and P n 0 A (n) 0 > 1 ε, and ( ) ) ( ) ( ) P n 0 B (n) A (n) 0 = P n 0 (B (n) + P n 0 A (n) 0 P n 0 B (n) A (n) 0 > 2 (1 ε) 1 = 1 2ε. Note B (n) = T n (Q). Hence Q n P n, D (Q n P 0) < ε n Q P n: D(Q P 0 )<ε n such that ( ) P n 0 T n (Q n ) A (n) 0 > (1 2ε) P n 0 (T n (Q n )). (1) 25 / 50 I-Hsiang Wang IT Lecture 8 Part II

Key Observation : Note that the probability of each sequence in the same type class is the same, under any product distribution. Hence, (1) is equivalent to T n (Q n ) A (n) > (1 2ε) T n (Q n ), ( ) which implies P n 1 T n (Q n) A (n) 0 > (1 2ε) P n 1 (T n (Q n)). 0 Hence, for sufficiently large n, Q n P n with D (Q n P 0 ) < ε n such that ( ) ( ) P n 1 A (n) 0 P n 1 T n (Q n) A (n) 0 > (1 2ε) P n 1 (T n (Q n)) where (c) is due to Prop. 3. (c) (1 2ε) P n 1 2 nd(q n P 1 ), Finally, as lim εn = 0, we have lim D (Qn P1) = D (P0 P1), and the n n converse proof is done. 26 / 50 I-Hsiang Wang IT Lecture 8 Part II

in Bayesian Setup Theorem 5 (Chernoff) where { lim 1 n n log P e (n) } = D (P λ P 0 ) = D (P λ P 1 ) ( P λ (a) = max log λ [0,1] (P 0(x)) 1 λ (P 1(x)) λ x X }{{} Chernoff Information CI (P 0, P 1 ) (P0(a))1 λ (P 1(a)) λ x X (P 0 (x)) 1 λ (P 1 (x)) λ, a X, and λ (0, 1) such that D (P λ P 0 ) = D (P λ P 1 ). Note: The optimal Bayesian test (for minimizing P e ) is the maximum a posterier (MAP) test: ϕ MAP (x n ) = 1 {π 1 P n 1 (xn ) π 0 P n 0 (xn )}. 1 ) 27 / 50 I-Hsiang Wang IT Lecture 8 Part II

0.45 0.4 0.35 Only intercept at one point, and it is in [0, 1] D(P λ P 0 ) D(P λ P 1 ) 0.3 0.25 0.2 0.15 0.1 0.05 0-0.5 0 0.5 1 1.5 λ 28 / 50 I-Hsiang Wang IT Lecture 8 Part II

0.18 0.16 0.14 0.12 D(P λ P 0 ) D(P λ P 1 ) min{d(p λ P 0 ),D(P λ P 0 )} 0.1 0.08 0.06 0.04 0.02 0 0 0.2 0.4 0.6 0.8 1 λ 29 / 50 I-Hsiang Wang IT Lecture 8 Part II

pf: The proof is based on application of large deviation in analyzing the optimal test, MAP: ϕ MAP (x n ) = 1 {π 1 P 1 (x n ) π 0 P 0 (x n )}. Analysis of error probabilities of MAP test : ( ) ( α (n) = P n 0, β (n) = P n 1 where F (n) 1 F (n) 0 ), { F (n) 1 Q P (X ) D (Q P 0) D (Q P 1) 1 log π 0 n { F (n) 0 Q P (X ) D (Q P 0 ) D (Q P 1 ) 1 log π 0 n : By Sanov s Theorem, we have π 1 } π 1 } } } lim { 1n log n α(n) = min D (Q P 0), lim { 1n log Q F 1 n β(n) = min D (Q P 1), Q F 0 where F 1 {Q P (X ) D (Q P 0 ) D (Q P 1 ) 0} and F 0 {Q P (X ) D (Q P 0) D (Q P 1) 0}.,. 30 / 50 I-Hsiang Wang IT Lecture 8 Part II

Exponents : Characterizing the two exponents is equivalent to solving the two (convex) optimization problems: min D (Q P 0 ) Q F 1 min D (Q P 1 ) Q F 0 minimize (Q 1,...,Q d ) d l=1 Q l log Q l P 0 (a l ) minimize (Q 1,...,Q d ) d l=1 Q l log Q l P 1 (a l ) subject to d l=1 Q l log P 1(a l ) P 0 (a l ) 0 Q l 0, l = 1,..., d d l=1 Q l = 1 subject to d l=1 Q l log P 1(a l ) P 0 (a l ) 0 Q l 0, l = 1,..., d d l=1 Q l = 1 It turns out that both problems have a common optimal solution P λ (a) = (P 0(a)) 1 λ (P 1 (a)) λ x X (P 0 (x)) 1 λ (P 1 (x)) λ, a X, with λ [0, 1] such that D (P λ P 0) = D (P λ P 1). 31 / 50 I-Hsiang Wang IT Lecture 8 Part II

Hence, both types of error probabilities have the same exponent, and so does the average error probability. This completes the proof of the first part. Chernoff Information : To show that ( CI (P 0, P 1 ) max log λ [0,1] simply observe that a X D (P λ P 0 ) = D (P λ P 1 ) 1 (P 0 (x)) 1 λ (P 1 (x)) λ x X (P 0 (a)) 1 λ (P 1 (a)) λ (log P 0 (a) log P 1 (a)) = 0 ) = D (P λ P 0 ), 1 D (P λ P 0) = D (P λ P 1) = log (P 0 (x)) 1 λ (P 1 (x)) λ. Proof complete. x X 32 / 50 I-Hsiang Wang IT Lecture 8 Part II

Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators 1 Hypothesis Testing 2 Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators 33 / 50 I-Hsiang Wang IT Lecture 8 Part II

Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators Parametric In this lecture we focus on parametric estimation, where samples of data are assumed to be drawn from a family of distributions on alphabet X {P θ P (X ) θ Θ}, where θ is called the parameter and Θ is the parameter set. (In this lecture, we mainly focus on X = R or R n where P θ are densities.) Such parametric framework is useful when one is familiar with certain properties of the data and has a good statistical model for the data. The parameter set Θ is hence fixed, not scaling with the samples of data. In contrast, if such knowledge about the underlying data is not sufficient, the non-parametric framework might be more suitable. 34 / 50 I-Hsiang Wang IT Lecture 8 Part II

Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators Outline Parametric estimation itself is a vast area. In this lecture we shall go through some basic results and then draw some connections between estimation theory and information theory. Topics to be discussed in this lecture: 1 Performance Evaluation of Estimators Bias, mean squared error, and Cramér-Rao lower bound Risk function optimization 2 Maximum Likelihood Estimator (MLE) 3 Asymptotic Evaluation Consistency Efficiency 4 Bayesian Estimators 35 / 50 I-Hsiang Wang IT Lecture 8 Part II

Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators 1 Hypothesis Testing 2 Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators 36 / 50 I-Hsiang Wang IT Lecture 8 Part II

Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators Estimator, Bias, Mean Squared Error Definition 4 (Estimator) Consider X P θ randomly generates the observed sample x, where θ Θ is an unknown parameter lies in the parameter set Θ. An estimator of θ based on observed x is a mapping ϕ : X Θ, x ˆθ. An estimator of function z (θ) is a mapping ζ : X z (Θ), x ẑ. For the case X = R or R n, it is reasonable to consider the following two measures of performance of estimators. Definition 5 (Bias, Mean Squared Error) For an estimator ϕ (x) of θ, Bias θ (ϕ) E Pθ [ϕ (X)] θ, MSE θ (ϕ) E Pθ [ ϕ (X) θ 2]. 37 / 50 I-Hsiang Wang IT Lecture 8 Part II

Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators Risk Function Fact 1 (MSE = Variance +(Bias) 2 ) For an estimator ϕ (x) of θ, MSE θ (ϕ) = Var Pθ [ϕ (X)] + (Bias θ (ϕ)) 2. pf: MSE θ (ϕ) E Pθ [ ϕ (X) θ 2 ] = E Pθ [ (ϕ (X) EPθ [ϕ (X)] + E Pθ [ϕ (X)] θ) 2] = Var Pθ [ϕ (X)] + (Bias θ (ϕ)) 2 + 2Bias θ (ϕ) E Pθ [ϕ (X) E Pθ [ϕ (X)]]. }{{} 0 MSE is a special case of the risk function of an estimator. Definition 6 (Risk Function) Let r : Θ Θ R denote the risk (cost) of estimating θ with ˆθ. The risk function of an estimator ϕ is defined as R θ (ϕ) E Pθ [r (θ, ϕ(x))]. 38 / 50 I-Hsiang Wang IT Lecture 8 Part II

Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators With risk functions as the performance measures of estimators, it is then possible to ask the following questions: What is the best estimator that minimizes the risk? What is the minimum risk? But, the questions are not explicit: optimal in what sense? Minimax: the worst-case risk (over Θ) is minimized Bayesian: with prior distribution {π (θ) θ Θ}, the expected risk (Bayes risk) is minimized In the following, we do not pursue these directions further (detailed treatment can be found in decision theory). Instead, we provide a parameter-dependent lower bound on the MSE of unbiased estimators, namely, Cramér-Rao Inequality. Later, we shall also briefly introduce results in the Bayesian setup. 39 / 50 I-Hsiang Wang IT Lecture 8 Part II

Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators Lower Bound on MSE of Unbiased Estimators Below we deal with densities and hence change notation from P θ to f θ. Definition 7 (Fisher Information) The Fisher Information of θ is defined as J (θ) E fθ [ ( θ ln f θ(x) ) 2 ]. Definition 8 (Unbiased Estimator) An estimator ϕ is unbiased if Bias θ (ϕ) = 0 for all θ Θ. Now we are ready to state the theorem. Theorem 6 (Cramér-Rao) For any unbiased estimator ϕ, we have MSE θ (ϕ) 1 J(θ), θ Θ. 40 / 50 I-Hsiang Wang IT Lecture 8 Part II

Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators Proof of Cramér-Rao Inequality pf: The proof is essentially an application of Cauchy-Schwarz inequality. Let us begin with the observation that J (θ) = Var fθ [s θ (X)], where s θ (X) θ ln f θ(x) = 1 f θ (X) θ f θ(x), because E fθ [s θ (X)] = f 1 θ (x) f θ (x) θ f θ(x) dx = f θ(x) dx = 0. = d d θ Hence, by Cauchy-Schwarz inequality, we have θ f θ(x) dx (Cov fθ (s θ (X), ϕ (X))) 2 Var fθ [s θ (X)] Var fθ [ϕ (X)]. Since Bias θ (ϕ) = 0, we have MSE θ (ϕ) = Var fθ [ϕ (X)], and hence MSE θ (ϕ) J (θ) (Cov fθ (s θ (X), ϕ (X))) 2. 41 / 50 I-Hsiang Wang IT Lecture 8 Part II

Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators It remains to prove that Cov fθ (s θ (X), ϕ (X)) = 1: {}}{ Cov fθ (s θ (X), ϕ (X)) = E fθ [s θ (X) ϕ (X)] E fθ [s θ (X)] E fθ [ϕ (X)] = E fθ [s θ (X) ϕ (X)] [ ] 1 = E fθ f θ (X) θ f θ(x)ϕ (X) f θ(x)ϕ (x) dx = d d θ = d d θ E f θ [ϕ (X)] (a) = d d θ θ = 1, where the (a) holds because ϕ is unbiased. The proof is complete. Remark: Cramér-Rao inequality can be extended to vector estimators, biased estimators, estimator of a function of θ, etc. 0 42 / 50 I-Hsiang Wang IT Lecture 8 Part II

Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators Extensions of Cramér-Rao Inequality Below we list some extensions and leave the proofs as exercises. Exercise 1 (Cramér-Rao Inequality for Unbiased Functional Estimators) Prove that for any unbiased estimator ζ of z (θ), MSE θ (ζ) 1 J(θ) ( d d θ z (θ) ) 2. Exercise 2 (Cramér-Rao Inequality for Biased Estimators) Prove that for any estimator ϕ of the parameter θ, MSE θ (ϕ) 1 J(θ) ( 1 + d d θ Bias θ (ϕ)) 2 + (Biasθ (ϕ)) 2. Exercise 3 (Attainment of Cramér-Rao) Show that the necessary and sufficient condition for an unbiased estimator ϕ to attain the Cramér-Rao lower bound is that, there exists some function g such that for all x, g (θ) (ϕ (x) θ) = θ ln f θ (x). 43 / 50 I-Hsiang Wang IT Lecture 8 Part II

Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators More on Fisher Information Fisher Information plays a key role in Cramér-Rao lower bound. We make further remarks about it. 1 J (θ) E fθ [(s θ (X)) 2] = Var fθ [s θ (X)], where the score of θ, s θ (X) θ ln f θ(x) = 1 f θ (X) θ f θ(x) is zero-mean. 2 Suppose X i i.i.d. f θ, then for the estimation problem with observation X n, its Fisher information J n (θ) = nj (θ), where J (θ) is the Fisher information when the observation is just X f θ. 3 For an exponential family {f θ θ Θ}, it can be shown that [ ] J (θ) = E 2 fθ θ ln f 2 θ (X), which makes computation of J (θ) simpler. 44 / 50 I-Hsiang Wang IT Lecture 8 Part II

Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators 1 Hypothesis Testing 2 Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators 45 / 50 I-Hsiang Wang IT Lecture 8 Part II

Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators Maximum Likelihood Estimator Maximum Likelihood Estimator (MLE) is a widely used estimator. Definition 9 (Maximum Likelihood Estimator) The Maximum Likelihood Estimator (MLE) for estimating θ from a randomly drawn X P θ is defined as ϕ MLE (x) arg max {P θ (x)}. θ Θ Here P θ (x) is called the likelihood function. Exercise 4 (MLE of Gaussian with Unknown Mean and Variance) i.i.d. Consider X i N ( µ, σ 2) for i = 1, 2,..., n, where θ ( µ, σ 2) denote the unknown parameter. Let x 1 n n i=1 x i. Show that ϕ MLE (x n ) = ( x, 1 n n i=1 (x i x) 2). 46 / 50 I-Hsiang Wang IT Lecture 8 Part II

Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators Asymptotic Evaluations: Consistency In the following we consider observation of n i.i.d. drawn samples i.i.d. X i P θ, i = 1,..., n, and give two ways of evaluating the performance of a sequence of estimators {ϕ n (x n ) n N} as n. Definition 10 (Consistency) A sequence of estimators {ζ n (x n ) n N} is consistent if ε > 0, lim n P X i i.i.d. P θ { ζ n (X n ) z (θ) < ε} = 1, θ Θ. In other words, ζ n (X n ) p z (θ) for all θ Θ. Theorem 7 (MLE is Consistent) For a family of densities {f θ θ Θ}, under some regularity conditions on f θ (x), the plug-in estimator z (ϕ MLE (x n )) is a consistent estimator of z (θ), where z is a continuous function of θ. 47 / 50 I-Hsiang Wang IT Lecture 8 Part II

Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators Asymptotic Evaluations: Efficiency Motivated by Cramér-Rao inequality, we would like to see if the lower bound is asymptotically attainable. Definition 11 (Efficiency) A sequence of estimators {ζ n (x n ) n N} is asymptotically efficient if ( n (ζn (X n ) z (θ)) d ( ) 1 N 0, d J(θ) d θ z (θ)) 2 as n. Theorem 8 (MLE is Asymptotically Efficient) For a family of densities {f θ θ Θ}, under some regularity conditions on f θ (x), the plug-in estimator z (ϕ n (x n )) is an asymptotically efficient estimator of z (θ), where z is a continuous function of θ. 48 / 50 I-Hsiang Wang IT Lecture 8 Part II

Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators Bayesian Estimators In the Bayesian setting, prior distribution of the parameter, π (θ) for θ Θ, is known, and hence the joint distribution of (θ, X) π (θ) P θ (x). Hence the goal is to find an estimator that minimize the Bayes Risk, defined as the average risk function, averaged over random θ: R (ϕ) E θ π [R θ (ϕ)] = E (θ,x) π Pθ [r (θ, ϕ (X))]. The optimal Bayesian estimator is ϕ ( ) arg min {R (ϕ)}. ϕ: X Θ Below we give some examples of Bayesian estimators, for several kinds of risks: 0-1 risk, squared-error risk, and absolute-error risk. 49 / 50 I-Hsiang Wang IT Lecture 8 Part II

Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators ( 1 0-1 risk r θ, ˆθ ) { = 1 θ ˆθ } : This kind of risk is reasonable for finite X, and the Bayes risk is the same as average probability of error. The optimal Bayesian estimator is ϕ MAP (x) = arg max {π (θ) P θ (x)}, θ Θ called maximum a posterior probability (MAP) estimator. ( 2 Squared-error risk r θ, ˆθ ) = θ ˆθ 2 : The optimal Bayesian estimator is ϕ MMSE (x) = E θ π(θ X=x) [θ X = x], where π (θ X = x) π(θ)p θ (x) θ Θ π(θ)p θ(x) is the a posterior probability. ( 3 Absolute-error risk r θ, ˆθ ) = θ ˆθ : The optimal Bayesian estimator is the median of π (θ X = x). 50 / 50 I-Hsiang Wang IT Lecture 8 Part II