Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Similar documents
ECE531 Lecture 6: Detection of Discrete-Time Signals with Random Parameters

ECE531 Lecture 4b: Composite Hypothesis Testing

Detection and Estimation Theory

Detection Theory. Chapter 3. Statistical Decision Theory I. Isael Diaz Oct 26th 2010

Detection theory 101 ELEC-E5410 Signal Processing for Communications

ECE531 Lecture 2b: Bayesian Hypothesis Testing

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary

ECE531 Lecture 13: Sequential Detection of Discrete-Time Signals

Chapter 2 Signal Processing at Receivers: Detection Theory

Detection and Estimation Chapter 1. Hypothesis Testing

Lecture 7 Introduction to Statistical Decision Theory

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

If there exists a threshold k 0 such that. then we can take k = k 0 γ =0 and achieve a test of size α. c 2004 by Mark R. Bell,

Lecture 8: Information Theory and Statistics

Scalable robust hypothesis tests using graphical models

Detection and Estimation Theory

On the Optimality of Likelihood Ratio Test for Prospect Theory Based Binary Hypothesis Testing

2. What are the tradeoffs among different measures of error (e.g. probability of false alarm, probability of miss, etc.)?

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Detection and Estimation Theory

Lecture 8: Information Theory and Statistics

ECE531 Screencast 9.2: N-P Detection with an Infinite Number of Possible Observations

Bayesian Decision Theory

Lecture 22: Error exponents in hypothesis testing, GLRT

IN HYPOTHESIS testing problems, a decision-maker aims

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

STOCHASTIC PROCESSES, DETECTION AND ESTIMATION Course Notes

Bayesian Decision Theory

Detection theory. H 0 : x[n] = w[n]

Statistical Inference

F2E5216/TS1002 Adaptive Filtering and Change Detection. Course Organization. Lecture plan. The Books. Lecture 1

Information Theory and Hypothesis Testing

Applications of Information Geometry to Hypothesis Testing and Signal Detection

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

ECE531 Lecture 2a: A Mathematical Model for Hypothesis Testing (Finite Number of Possible Observations)

Composite Hypotheses and Generalized Likelihood Ratio Tests

Introduction to Bayesian Statistics

ECE 4400:693 - Information Theory

Lecture 12 November 3

44 CHAPTER 2. BAYESIAN DECISION THEORY

ECE531 Lecture 8: Non-Random Parameter Estimation

STAT 830 Decision Theory and Bayesian Methods

Pattern Classification

Quickest Anomaly Detection: A Case of Active Hypothesis Testing

Fundamentals of Statistical Signal Processing Volume II Detection Theory

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Digital Transmission Methods S

Introduction to Statistical Inference

10. Composite Hypothesis Testing. ECE 830, Spring 2014

PATTERN RECOGNITION AND MACHINE LEARNING

Decentralized Detection In Wireless Sensor Networks

ECE531 Homework Assignment Number 2

Information Theory in Intelligent Decision Making

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Hypothesis Testing. Testing Hypotheses MIT Dr. Kempthorne. Spring MIT Testing Hypotheses

Definition 3.1 A statistical hypothesis is a statement about the unknown values of the parameters of the population distribution.

Introduction to Detection Theory

Error Rates. Error vs Threshold. ROC Curve. Biometrics: A Pattern Recognition System. Pattern classification. Biometrics CSE 190 Lecture 3

A first model of learning

Detection of Unexploded Ordnance: An Introduction

Hypothesis testing (cont d)

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Change Detection Algorithms

Naïve Bayes classification

Decision Criteria 23

Topic 3: Hypothesis Testing

P Values and Nuisance Parameters

ON UNIFORMLY MOST POWERFUL DECENTRALIZED DETECTION

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Introduction to Signal Detection and Classification. Phani Chavali

Announcements. Proposals graded

THE potential for large-scale sensor networks is attracting

ROBUST MINIMUM DISTANCE NEYMAN-PEARSON DETECTION OF A WEAK SIGNAL IN NON-GAUSSIAN NOISE

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Graduate Econometrics I: Maximum Likelihood I

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process

y Xw 2 2 y Xw λ w 2 2

Intelligent Systems Statistical Machine Learning

Hypothesis Testing - Frequentist

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Bayesian inference. Justin Chumbley ETH and UZH. (Thanks to Jean Denizeau for slides)

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

ECE531 Screencast 11.4: Composite Neyman-Pearson Hypothesis Testing

Finite sample size optimality of GLR tests. George V. Moustakides University of Patras, Greece

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Statistical Signal Processing Detection, Estimation, and Time Series Analysis

2.1 Optimization formulation of k-means

Optimization using Calculus. Optimization of Functions of Multiple Variables subject to Equality Constraints

Lecture 4 September 15

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

EECS564 Estimation, Filtering, and Detection Exam 2 Week of April 20, 2015

Introduction to Machine Learning

Bayesian Decision Theory

Machine Learning Linear Classification. Prof. Matteo Matteucci

Lecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1]

Bayesian Learning. Bayesian Learning Criteria

Intelligent Systems Statistical Machine Learning

Uncertainty. Jayakrishnan Unnikrishnan. CSL June PhD Defense ECE Department

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Transcription:

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1) Detection problems can usually be casted as binary or M-ary hypothesis testing problems. Applications: This chapter: Simple hypothesis testing problem, probability distribution of the observations under each hypothesis is assumed to be known exactly. Example: Composite hypothesis testing: problems involving unknown parameters (Chapter 4). Example: 1

Objectives: 1. Design testing rules that are optimal in some appropriate sense based on the amount of information available. 2. Analyze the performance of the test. Structure of this chapter: 2.2 Binary Hypothesis Testing Problem Formulation tes Problem modeling, notation, performance measure, etc. 2.3 Bayesian Test tes * A-priori prob. known * Cost structure known 2.4 Minimax Test tes * A-priori prob. unknown * Cost structure known 2.5 Neyman-Pearson Test tes * A-priori prob. unknown * Cost structure unknown tes Receiver operating characteristic (ROC) 2.6 Gaussian Detection 2.7 M-ary Hypothesis Testing 2

2.2 Binary Hypothesis Testing Problem Formulation (Levy 2.2 and 2.4) Binary Hypothesis Testing is to decide between 2 hypotheses based on the observation (random). Model contains: 1. Hypothesis and a-priori probability 2. Observation 3. Connection b/w hypotheses and observation 4. Decision function 5. Performance measure 3

1. Hypothesis and a-priori probability Hypotheses: and. π 0 = P ( ) and π 1 = P ( ) = 1 π 0. 2. Observation Random vector Y with sample space Y. An observation is a sample vector y of Y. 3. Connection b/w hypotheses and observation: Distributions of Y under and. For continuous Y, For discrete Y, PDF under each hypothesis : Y f Y (y ) : Y f Y (y ) Assume to be known in this chapter. PMF under each hypothesis : P (Y = y ) = p(y ) : P (Y = y ) = p(y ) 4

4. Decision function: Decide whether or is true given an observation. A map from Y to {0, 1}: δ(y) = Decision regions: Y 0 and Y 1 1 if decide on 0 if decide on Y 0 {y δ(y) = 0} Y 1 {y δ(y) = 1}. We have Y 0 Y 1 = and Y 0 Y 1 = Y. A decision function is a partition of the sample space of Y. Examples on decision rules: 5

Goal: obtain the decision rule which is optimal (in some sense). 5. An optimality/performance measure. Bayesian formulation: All uncertainties are quantifiable. The cost and benefits of outcomes can be measured. Cost function: C ij for i = 0, 1, j = 0, 1, the cost of deciding on H i when H j holds. The value of C ij depends on the application/nature of the problem. Examples on cost function: Assumption: Making a correct decision is always less costly than making a mistake, i.e., C 00 < C 10 and C 11 < C 01. 6

Bayes risk of a decision function δ:. test Risk under : R(δ ) = C 00 P (δ(y) = 0 ) + C 10 P (δ(y) = 1 ) Risk under : = C 00 P (Y 0 ) + C 10 P (Y 1 ) R(δ ) = C 01 P (δ(y) = 0 ) + C 11 P (δ(y) = 1 ) = C 01 P (Y 0 ) + C 11 P (Y 1 ) Continuous Y: P (Y i H j ) = Y i f(y H j )dy. Discrete Y: P (Y i H j ) = y Y i P (y H j ). 7

Bayes risk: R(δ) = R(δ )P ( ) + R(δ )P ( ) = π 0 C 00 P (Y 0 ) + π 0 C 10 P (Y 1 ) +π 1 C 01 P (Y 0 ) + π 1 C 11 P (Y 1 ) 1 1 = π j C ij P (Y i H j ). i=0 j=0 Since P (Y 0 ) + P (Y 1 ) = P (Y 0 ) + P (Y 1 ) = 1. R(δ) = π 0 C 00 + π 0 (C 10 C 00 )P (Y 1 ) +π 1 C 01 + π 1 (C 11 C 01 )P (Y 1 ). Optimal δ: δ that minimizes R(δ), the Bayes risk. 8

False alarm: is true but is decided. (Error of Type I) Detection: is true and is decided. Miss of detection: is true but is decided. (Error of Type II). Probability of detection: P D (δ) = P (Y 1 ). Probability of false alarm: P F (δ) = P (Y 1 ). Probability of miss: P M (δ) = P (Y 0 ) = 1 P D (δ). R(δ) = π 0 C 00 +π 0 (C 10 C 00 )P F (δ)+π 1 C 01 +π 1 (C 11 C 01 )P D (δ). Ideally: Want P D (δ) 1 and P F (δ) 0. Receiver operating characteristic (ROC): the upper boundary between achievable and un-achievable regions in the (P F, P D )-square. 9

2.3 Bayesian Testing (Levy 2.2) Assume: 1. A-priori probabilities (π 0, π 1 ) are known. 2. Cost structure (C ij ) is known. Find the optimal decision δ that minimizes Bayes risk, R(δ). Note: Distribution of Y under each hypothesis is also known. 10

2.3.1 Likelihood-Ratio Test (LRT) From previous derivation, R(δ) = π 0 C 00 + π 0 (C 10 C 00 )P (Y 1 ) + π 1 C 01 + π 1 (C 11 C 01 )P (Y 1 ) = π 0 C 00 + π 1 C 01 + Y 1 [π 0 (C 10 C 00 )f(y ) + π 1 (C 11 C 01 )f(y )] dy. R(δ) is minimized if and only if π 0 (C 10 C 00 )f(y ) + π 1 (C 11 C 01 )f(y ) π 0 (C 10 C 00 )f(y ) f(y ) f(y ) π 1 (C 01 C 11 )f(y ) (C 10 C 00)π 0 (C 01 C 11 )π 1, since C 01 > C 11. 0 11

Define the likelihood ratio as L(y) f(y ) f(y ) (C 10 C 00)π 0 (C 01 C 11 )π 1. The optimal decision rule is: L(y) τ (C 10 C 00)π 0 (C 01 C 11 )π 1 δ B (y) = 1 if L(y) τ 0 if L(y) < τ. For discrete Y, similarly, the optimal decision rule is: L(y) P (y ) P (y ) τ, a LRT. 12

Maximum A-Posteriori (MAP) Rule: Consider the following cost structure: C 00 = C 11 = 0, C 01 = C 10 = 1 C ij = 1 δ ij. 1 if i = j Kronecher delta function δ ij = 0 if i j. An error incurs a cost. Minimizing Bayes risk becomes minimizing the probability of error. LRT becomes: L(y) = P (y ) P (y ) π 1 f(y ) P ( Y = y) (C 10 C 00)π 0 = π 0 (C 01 C 11 )π 1 π 1 π 0 f(y ) P ( Y = y). Choose the hypothesis with the larger a-posteriori probability. 13

Maximum Likelihood Rule: If, furthermore, π 0 = π 1 = 1/2, equal-probable hypotheses, LRT becomes f(y ) f(y ). Choose the hypothesis with the larger likelihood function value. 2.3.2 Examples 14

2.3.3 Asymptotic Performance of LRT (Levy 3.2) For a binary hypothesis testing problem, Y 1, Y 2,, Y N is the sequence of i.i.d. random observations. Y k R n. Assume that Y is continuous and let LRT: 1 N Y = L(y) = f(y ) f(y ) = N N ln(y k ) k=1 1 N k=1 Y 1 Y 2 Y N. f(y k ) N f(y k ) = ln τ(n) γ(n). k=1 L(y k ) τ(n) 15

Let Z k ln L(Y k ) = f(y k ) f(y k ) and S N 1 N N k=1 Z k. The LRT becomes: S N γ(n). Notice that Z k s are i.i.d. and S N is the sample mean of Z k s. When N, strong law of large numbers a.s. : S N E [Z k ] = ln f(y ) f(y ) f(y )dy a.s. : S N E [Z k ] = ln f(y ) f(y ) f(y )dy Def. For two PDFs f and g, the Kullback-Leibler (KL) divergence is D(f g) = f(x) ln f(x) g(x) dx. A natural notion of distance between random variables. Not a true distance metric. 16

Properties: 1. D(f g) 0 with equality if and only if f = g. 2. Non-symmetric D(f g) D(g f). 3. Does not satisfy the triangular inequality. Let f 0 (y) f(y ) and f 1 (y) f(y ). When N, a.s. : S N ln f 1(y) f 0 (y) f 1(y)dy = D(f 1 f 0 ) > 0. a.s. : S N ln f 1(y) f 0 (y) f 0(y)dy = D(f 0 f 1 ) < 0 Thus P D (N) 1 and P F (N) 0. * As long as we are willing to collect an arbitrarily large number of ind. observations, we can separate perfectly and regardless of π 0 and C ij. How fast does P D (N) 1 and P F (N) 0? Exponentially with N. 17

2.4 Mini-max Hypothesis Testing (Levy 2.5) Assume: 1. A-priori probabilities (π 0, π 1 ) is unknown. 2. Cost structure (C ij ) is known. Possible solutions: 1. Guess. May lead to bad performance. 2. Design the test conservatively by assuming the least-favorable choice of a-priori and selecting the test that minimizes the Bayes risk for this choice. Minimizes the maximum risk. Guarantees a minimum level of performance independent of the a-priori. Problem statement: Find the test δ M and a-priori value π 0M that solves the mini-max problem (δ M, π 0M ) = arg min δ max R(δ, π 0). π 0 [0,1] 18

Approach: Saddle point method. Def. A saddle point is a point in the domain of a function which is a stationary point but not a local extremum. If a point (δ M, π 0M ) satisfies R(δ M, π 0 ) R(δ M, π 0M ) R(δ, π 0M ), for any δ, π 0, (1) It is a saddle point of the function R. It is the solution of the mini-max problem. Proof: Step 1: A saddle point of the form (1) exists. Step 2: The saddle point is the solution (Saddle point property) Step 3: Construct the saddle point. Mini-max equation: (C 01 C 11 ) + (C 11 C 01 )P D (δ M ) (C 10 C 00 )P F (δ M ) = 0. 19

Comments: test 1. If C 00 = C 11, the mini-max equation becomes P D = 1 C 10 C 00 C 01 C 11 P F, a line through (0,1) of the (P F, P D ) square. If C ij = 1 δ ij, mini-max equation becomes P D = 1 P F. 2. Mini-max test corresponding to the intersection of the ROC and the line of the mini-max equation. 3. The LRT threshold τ m of the mini-max test, corresponding to the intersection, equals the slope of the ROC at this point. The corresponding a-priori probability can be calculated by π 0M = [ 1 + C 10 C 00 (C 01 C 11 )τ M ] 1. 4. Another way of finding π 0M : π 0M = arg max π0 min δ R(δ, π 0 ). Define V (π 0 ) min δ R(δ, π 0 ), which is the minimum Bayes risk with a-priori π, achieved by the LRT. π 0M = arg max π 0 V (π 0 ). 20

Examples: 21

2.5 Neyman-Pearson (NP) Testing (Levy 2.4.1) Assume: 1. A-priori probabilities (π 0, π 1 ) is unknown. 2. Cost structure (C ij ) is unknown. NP-testing problem: Select the test δ that maximizes P D (δ) ensuring that the probability of false alarm P F (δ) is no more than α. D α {δ P F (δ) α} δ NP = arg max δ D α P D (α) Lagrangian method for constrained optimization. 22

δ NP = arg max P D (δ) subject to P F (δ) α. δ Consider the Lagrangian: L(δ, λ) P D (δ) + λ(p F (δ) α). A test δ is optimal if it minimizes L(δ, λ) (maximizes L(δ, λ)), λ 0, P F (δ) α, and λ(α P F (δ)) = 0. L(δ, λ) = f(y )dy + λα λ f(y )dy Y 1 Y 1 = [f(y ) λf(y )] dy + λα. Y 1 L(δ, λ) is maximized when 1 if f(y ) > λf(y ) δ(y) = 0 if f(y ) < λf(y ) 0 or 1 if f(y ) = λf(y ) = 1 if L(y) > λ 0 if L(y) < λ 0 or 1 if L(y) = λ Thus, δ has to be an LRT. λ must satisfy the KKT condition. 23

Let F L (l ) P (L l ), CDF of the LR L = L(y) under. Let f 0 F L (0 ) = P (L = 0 ). Define 2 tests: δ L,λ (y) = 1 if L(y) > λ 0 if L(y) λ δ U,λ (y) = Case 1: If 1 α < f 0, let λ = 0 and δ NP = δ L,0. 1 if L(y) λ 0 if L(y) < λ Case 2: If 1 α f 0 and there exists a λ such that F L (λ ) = 1 α, i.e, 1 α is in the range of F L (l ), choose this λ as the LRT threshold and let δ NP = δ L,λ. Case 3: If 1 α f 0 and 1 α is not in the range of F L (l ), i.e., there is a discontinuity point λ > 0 of F L (l ) such that F L (λ ) < 1 α < F L (λ ), Choose this λ as the LRT threshold, the NP test is the randomized 24

test: Choose δ U,λ with probability p and δ L,λ with probability 1 p, equivalently, 1 if L(y) > λ 0 if L(y) < λ δ NP =. 1 w.p. p; if L(y) = λ 0 w.p. 1 p Comments: 1. When Y is discrete, F L (l ) is discontinuous, thus, randomized test is usually needed. 2. Similarly, we could consider the minimization of P F under the constraint P M (δ) β. Similar solution can be obtained. This problem is called an NP test of Type II. The previously discussed one is called an NP test of Type I. Example: 25

2.5.1. ROC Properties Finding ROC is naturally the NP test problem, which must be an LRT. L(y) τ. P D (τ) = τ f L (l )dl P F (τ) = τ f L (l )dl (2) As τ varies from 0 to, (P F (δ), P D (δ)) moves continuously along the ROC curve. 1. Let τ = 0. Thus δ 1 (y) = 1 always and P D (δ) = P F (δ) = 1. (1, 1) belongs to the ROC. 2. Let τ =. Thus δ 1 (y) = 0 always and P D (δ) = P F (δ) = 0. (0, 0) belongs to the ROC. 3. The slope of the ROC at point (P F (τ), P D (δ)) equals to τ. 26

4. The ROC curve is concave, i.e., the domain of the achievable pairs (P F, P D ) is convex. 5. All points on the ROC curve satisfy P D P F. 6. The region of feasible tests is symmetric about the point (1/2, 1/2), i.e., if (P F, P D ) is feasible, so is (1 P F, 1 P D ). Example: 27