ECE531 Lecture 2b: Bayesian Hypothesis Testing

Similar documents
ECE531 Lecture 4b: Composite Hypothesis Testing

ECE531 Lecture 6: Detection of Discrete-Time Signals with Random Parameters

ECE531 Screencast 9.2: N-P Detection with an Infinite Number of Possible Observations

ECE531 Lecture 2a: A Mathematical Model for Hypothesis Testing (Finite Number of Possible Observations)

ECE531 Lecture 13: Sequential Detection of Discrete-Time Signals

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

ECE531 Homework Assignment Number 2

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary

ECE531 Lecture 8: Non-Random Parameter Estimation

EE 574 Detection and Estimation Theory Lecture Presentation 8

2. What are the tradeoffs among different measures of error (e.g. probability of false alarm, probability of miss, etc.)?

Chapter 2 Signal Processing at Receivers: Detection Theory

Lecture 7. Union bound for reducing M-ary to binary hypothesis testing

Bayesian Learning (II)

Introduction to Statistical Inference

ECE531 Lecture 10b: Maximum Likelihood Estimation

Bayesian Decision Theory

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

10. Composite Hypothesis Testing. ECE 830, Spring 2014

ECE531 Screencast 11.5: Uniformly Most Powerful Decision Rules

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Introduction to Detection Theory

STOCHASTIC PROCESSES, DETECTION AND ESTIMATION Course Notes

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

DETECTION theory deals primarily with techniques for

Detection and Estimation Theory

ECE531 Homework Assignment Number 6 Solution

Detection and Estimation Theory

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Parameter Estimation

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Lecture 8: Signal Detection and Noise Assumption

Detection theory 101 ELEC-E5410 Signal Processing for Communications

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

If there exists a threshold k 0 such that. then we can take k = k 0 γ =0 and achieve a test of size α. c 2004 by Mark R. Bell,

Naïve Bayes classification

Machine Learning 4771

ECE531: Principles of Detection and Estimation Course Introduction

A Simple Example Binary Hypothesis Testing Optimal Receiver Frontend M-ary Signal Sets Message Sequences. possible signals has been transmitted.

Introduction to Machine Learning. Lecture 2

F2E5216/TS1002 Adaptive Filtering and Change Detection. Course Organization. Lecture plan. The Books. Lecture 1

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Physics 403. Segev BenZvi. Choosing Priors and the Principle of Maximum Entropy. Department of Physics and Astronomy University of Rochester

Bayesian Linear Regression [DRAFT - In Progress]

Digital Transmission Methods S

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Lecture 2 Machine Learning Review

7 Gaussian Discriminant Analysis (including QDA and LDA)

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

ORF 245 Fundamentals of Statistics Chapter 9 Hypothesis Testing

Parametric Techniques Lecture 3

Machine Learning Linear Classification. Prof. Matteo Matteucci

CSC321 Lecture 18: Learning Probabilistic Models

Overfitting, Bias / Variance Analysis

Module 2. Random Processes. Version 2, ECE IIT, Kharagpur

+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing

Parametric Techniques

Lecture 7 Introduction to Statistical Decision Theory

Composite Hypotheses and Generalized Likelihood Ratio Tests

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

More Spectral Clustering and an Introduction to Conjugacy

ECE521 week 3: 23/26 January 2017

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Detection theory. H 0 : x[n] = w[n]

01 Probability Theory and Statistics Review

ECE531 Screencast 11.4: Composite Neyman-Pearson Hypothesis Testing

Bayesian RL Seminar. Chris Mansley September 9, 2008

Lecture : Probabilistic Machine Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Notes on Discriminant Functions and Optimal Classification

IN HYPOTHESIS testing problems, a decision-maker aims

Introduction to Bayesian Learning. Machine Learning Fall 2018

ECE531 Screencast 5.5: Bayesian Estimation for the Linear Gaussian Model

Algorithmisches Lernen/Machine Learning

IN MOST problems pertaining to hypothesis testing, one

Lecture 22: Error exponents in hypothesis testing, GLRT

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Detection and Estimation Theory

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

On the Optimality of Likelihood Ratio Test for Prospect Theory Based Binary Hypothesis Testing

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Bayesian Decision Theory Lecture 2

The Wishart distribution Scaled Wishart. Wishart Priors. Patrick Breheny. March 28. Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/11

Detection and Estimation Chapter 1. Hypothesis Testing

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Statistical learning. Chapter 20, Sections 1 4 1

Probability and Estimation. Alan Moses

Lecture 2: Statistical Decision Theory (Part I)

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Lecture 8: Information Theory and Statistics

Bayesian Decision Theory

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Bayesian Methods: Naïve Bayes

Statistics: Learning models from data

Transcription:

ECE531 Lecture 2b: Bayesian Hypothesis Testing D. Richard Brown III Worcester Polytechnic Institute 29-January-2009 Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 1 / 39

Minimizing a Weighted Sum of Conditional Risks We would like to minimize all of the conditional risks R 0 (D),...,R (D), but usually we have to make some sort of tradeoff. A standard approach to solving a multi-objective optimization problem is to form a linear combination of the functions to be minimized: r(d,λ) := where λ j 0 and j λ j = 1. λ j R j (D) = λ R(D) The problem is then to solve the single-objective optimization problem: D = arg min D D r(d,λ) = arg min λ j c j Dp j D D where we are focusing on the case of finite Y for now. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 2 / 39

Minimizing a Weighted Sum of Conditional Risks The problem D = arg min r(d,λ) = arg min λ j c j Dp j D D D D is a finite-dimensional linear programming problem since the variables that we control (D ij ) are constrained by linear inequalities M 1 i=0 D ij 0 i {0,...,M 1} and j {0,...,L 1} D ij = 1 and the objective function r(d,λ) is linear in the variables D ij. For notational convenience, we call this minimization problem LPF(λ). Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 3 / 39

Decomposition: Part 1 Since R j (D) = c j Dp j = M 1 i=0 the problem LPF(λ) can be written as D il P lj C ij L 1 l=0 D M 1 = arg min λ j D D i=0 = arg min D D L 1 l=0 M 1 D il i=0 D il P lj C ij L 1 l=0 λ j C ij P lj } {{ } r l (d,λ) Note that we can minimize each term r l (d,λ) separately. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 4 / 39

Decomposition: Part 2 The lth subproblem is then d = arg min r l(d,λ) d R M M 1 = arg min d i λ j C ij P lj d i=0 subject to the constraints that d i 0 and i d i = 1. Note that d i D il from the original LPF(λ) problem. For notational convenience, we call this subproblem LPF l (λ). Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 5 / 39

Some Intuition: The Hiker Suppose you are going on a hike with a one liter drinking bottle and you must completely fill the bottle before your hike. At the general store, you can fill your bottle with any combination of: Tap water: $0.25 per liter Ice tea: $0.50 per liter Soda: $1.00 per liter Red Bull: $5.00 per liter You want to minimize your cost for the hike. What should you do? We can define a per-liter cost vector h = [0.25,0.50,1.00,5.00] and a proportional purchase vector ν R 4 with elements satisfying ν i 0 and i ν i = 1. Clearly, your cost h ν is minimized when ν = [1,0,0,0]. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 6 / 39

Solution to the Subproblem LPF l (λ) Theorem For fixed h R n, the function h ν is minimized over the set ν {R n : ν i 0 and i ν i = 1} by the vector ν with ν m = 1 and ν k = 0, k m, where m is any index with h m = min i {0,...,n 1} h i. Note that, if two or more commodities have the same minimum price, the solution is not unique. Back to our subproblem LPF l (λ)... " M 1 X # X d = arg min d i λ jc ijp lj d R M i=0 subject to the constraints that d i 0 and i d i = 1. The Theoreom tells us how to easily solve LPF l (λ). We just have to find the index " # X m l = arg min λ jc ijp lj i {0,...,M 1} and then set d ml = 1 and d i = 0 for all i m l. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 7 / 39

Solution to the Problem LPF(λ) Solving the subproblem LPF l (λ) for l = 0,...,L 1 and assembling the results of the subproblems into M L decision matrices yields a non-empty, finite set of deterministic decision matrices that solve LPF(λ). All deterministic decision matrices in the set have the same minimal cost: L 1 r (λ) = min λ i j C ij P lj l=0 It is easy to show that any randomized decision rule in the convex hull of this set of deterministic decision matrices also achieves r (λ), and hence is also optimal. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 8 / 39

Example Part 1 Bayesian Hypothesis Testing Let G il := λ jc ij P lj and G R M L be the matrix composed of elements G il. Suppose 0.3 0.5 0.2 0.8 G = 0.4 0.2 0.1 0.5 0.5 0.1 0.7 0.6 What is the solution to the subproblem LPF 0 (λ)? d = [1,0,0]. What is the solution to the problem LPF(λ)? 1 0 0 0 D = 0 0 1 1 0 1 0 0 Is this solution unique? Yes. What is r (λ)? r (λ) = 1.0. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 9 / 39

Example Part 2 Bayesian Hypothesis Testing What happens if G = 2 0.3 0.5 0.2 3 0.8 40.4 0.2 0.2 0.55? 0.5 0.1 0.7 0.6 The solution to LPF 2 (λ) is no longer unique. Both 2 3 2 3 1 0 0 0 1 0 1 0 D = 40 0 1 15 or D = 40 0 0 15 0 1 0 0 0 1 0 0 solve LPF(λ) and achieve r (λ) = 1.1. In fact, any decision matrix of the form 2 1 0 α 3 0 D = 40 0 1 α 15 0 1 0 0 for α [0,1] will also achieve r (λ) = 1.1. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 10 / 39

Remarks Note that the solution to the subproblem LPF l (λ) always yields at least one deterministic decision rule for the case y = y l. If n l elements of the column G l are equal to the minimum of the column, then there are n l deterministic decision rules solving the subproblem LPF l (λ). There are l n l 1 deterministic decision rules that solve LPF(λ). If there is more than one deterministic solution to LPF(λ) then there will also be a corresponding convex hull of randomized decision rules that also achieve r (λ). Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 11 / 39

Working Example: Effect of Changing the Weighting λ D15 1 1 0.9 2 0.8 0.7 3 R1 0.6 0.5 4 D14 0.4 0.3 5 0.2 6 7 D12 0.1 8 9 0 D8 D0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 R0 Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 12 / 39

Weighted Risk Level Sets Notice in the last example that, by varying our risk weighting λ, we can create a family of level sets that cover every point of the optimal tradeoff surface. Given a constant c R, the level set of value c is defined as L λ c := {x R N : λ x = c} 1 λ = [0.7 0.3] T 1 λ = [0.4 0.6] T 0.8 0.8 0.6 0.6 x1 x1 0.4 0.4 0.2 0.2 0 0 0.2 0.4 0.6 0.8 1 x0 0 0 0.2 0.4 0.6 0.8 1 x0 Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 13 / 39

Infinite Observation Spaces: Part 1 When the observations are specified by conditional pdfs, our conditional risk function becomes [ M 1 R j (ρ) = ρ i (y)c ij ]p j (y)dy y Y i=0 Recall that the randomized decision rule ρ i (y) specifies the probability of deciding H i when the observation is y. We still have N < states, so, as before, we can write a linear combination of the conditional risk functions as r(ρ,λ) := where λ j 0 and j λ j = 1. λ j R j (ρ) Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 14 / 39

Infinite Observation Spaces: Part 2 The problem is then to solve ρ = arg min ρ D r(ρ,λ) = arg min λ j ρ D y Y [ M 1 i=0 ρ i (y)c ij ]p j (y)dy where D is the set of all valid decision rules satisfying ρ i (y) 0 for all i = 0,...,M 1 and y Y, as well as M 1 i=0 ρ i(y) = 1 for all y Y. This is an infinite dimensional linear programming problem. We refer to this problem as LP(λ). Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 15 / 39

Infinite Observation Spaces: Decomposition Part 1 As before, we can decompose the big problem LP(λ) into lots of little subproblems, one for each observation y Y. We rewrite LP(λ) as ρ = arg min λ j ρ D = arg min ρ D y Y y Y M 1 i=0 [ M 1 i=0 ρ i (y)c ij ] p j (y)dy ρ i (y) λ j C ij p j (y) dy To minimize the integral, we can minimize the term M 1 (...) = ρ i (y) λ j C ij p j (y) at each fixed value of y. i=0 Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 16 / 39

Infinite Observation Spaces: Decomposition Part 2 The subproblem of minimizing the term (... ) inside the integral when y = y 0 is a finite-dimensional linear programming problem (since X is still finite): M 1 ν = arg min ν i=0 ν i λ j C ij p j (y 0 ) subject to ν i 0 for all i = 0,...,M 1 and M 1 i=0 ν i = 1. We already know how to do this. We just find the index that has the lowest commodity cost m y0 = arg min i {0,...,M 1} λ j C ij p j (y 0 ) and set ν my0 = 1 and all other ν i = 0. Same technique as solving LPF l (λ). Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 17 / 39

Infinite Observation Spaces: Minimum Weighted Risk As we saw with finite Y, solving LP(λ) will yield at least one deterministic decision rule that achieves the minimum weighted risk r (λ) = λ j C ij p j (y)dy = = y Y y Y M 1 i=0 min i {0,...,M 1} min g i(y,λ)dy i {0,...,M 1} g i (y,λ)dy y Y i 1. The existence of at least one deterministic decision rule solving LPF(λ) or LP(λ) implies the existence of an optimal partition of Y. 2. Since the problems LPF(λ) and LP(λ) always yield at least one deterministic decision rule, another way to think of these problems is: Find a partition of the observation space that minimizes the weighted risk. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 18 / 39

Infinite Observation Spaces: Putting it All Together A decision rule that solves LP(λ) is then ρ i (y) = { 1 if i = arg min m {0,...,M 1} g m (y,λ) 0 otherwise where ρ (y) = [ρ 0 (y),...,ρ M 1 (y)]. Remarks: 1. The solution to LP(λ) may not be unique. 2. If there is more than one deterministic solution to LP(λ) then there will also be a corresponding convex hull of randomized decision rules that also achieve r (λ). 3. Note that the standard convention for deterministic decision rules is δ : Y Z. An equivalent way of specifying a deterministic decision rule that solves LP(λ) is δ (y) = arg min g i(y,λ). i {0,...,M 1} Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 19 / 39

The Bayesian Approach We assume a prior state distribution π P N such that Prob(state is x j ) = π j Note that this is our model of the state probabilities prior to the observation. We denote the Bayes Risk of the decision rule ρ as r(ρ,π) = π j R j (ρ) = π j y Y M 1 p j (y) C ij ρ i (y)dy. This is simply the weighted overall risk, or average risk, given our prior belief of the state probabilities. A decision rule that minimizes this risk is called a Bayes decision rule for the prior π. i=0 Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 20 / 39

Solving Bayesian Hypothesis Testing Problems We know how to solve this problem. For finite Y, we just find the index m l = arg min π j C ij P lj i {0,...,M 1} } {{ } G il and set Dm Bπ l,l = 1 and DBπ i,l = 0 for all i m l. For infinite Y, we just find the index m y0 = arg min i {0,...,M 1} π j C ij p j (y 0 ) } {{ } g i (y 0,π) and set δ Bπ (y 0 ) = m y0 (or, equivalently, set ρ Bπ m y0 (y 0 ) = 1 and ρ Bπ i (y 0 ) = 0 for all i m y0 ). Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 21 / 39

Prior and Posterior Probabilities The conditional probability that H j is true given the observation y: where π j (y) := Prob(H j is true Y = y) = p j(y)π j p(y) p(y) = π j p j (y). Recall that π j is the prior probability of choosing H j, before we have any observations. The quantity π j (y) is the posterior probability of choosing H j, conditioned on the observation y. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 22 / 39

A Bayes Decision Rule Minimizes The Posterior Cost General expression for a deterministic Bayes decision rule: δ Bπ (y) = arg But π j p j (y) = π j (y)p(y), hence δ Bπ (y) = arg min i {0,...,M 1} min i {0,...,M 1} π j C ij p j (y) C ij π j (y)p(y) = arg min C ij π j (y) i {0,...,M 1} since p(y) does not affect the minimizer. What does this mean? C ijπ j (y) the average cost of choosing hypothesis H i given Y = y, i.e. the posterior cost of choosing H i. The Bayes decision rule chooses the hypothesis that yields the minimum expected posterior cost. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 23 / 39

Bayesian Hypothesis Testing with UCA: Part 1 The uniform cost assignment (UCA): { 0 if x j H i C ij = 1 otherwise The conditional risk R j (ρ) under the UCA is simply the probability of not choosing the hypothesis that contains x j, i.e. the probability of error when the state is x j. The Bayes risk in this case is r(ρ, π) = π j R j (ρ) = Prob(error). Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 24 / 39

Bayesian Hypothesis Testing with UCA: Part 2 Under the UCA, a Bayes decision rule can be written in terms of the posterior probabilities as δ Bπ (y) = arg min i {0,...,M 1} = arg min i {0,...,M 1} = arg min i {0,...,M 1} = arg max i {0,...,M 1} C ij π j (y) π j (y) x j / H i 1 π j (y) x j H i π j (y) x j H i Hence, for hypothesis tests under the UCA, the Bayes decision rule is the MAP (maximum a posteriori) decision rule. When the hypothesis test is simple, δ Bπ (y) = arg max i π i (y). Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 25 / 39

Bayesian Hypothesis Testing with UCA and Uniform Prior Under the UCA and a uniform prior, i.e. π j = 1/N for all j = 0,...,N 1 δ Bπ (y) = arg max i {0,...,M 1} = arg max i {0,...,M 1} = arg max i {0,...,M 1} since π j = 1/N does not affect the minimizer. π j (y) x j H i π j p j (y) x j H i p j (y) x j H i In this case, δ Bπ (y) selects the most likely hypothesis, i.e. the hypothesis which best explains the observation y. This is called the maximum likelihood (ML) decision rule. Which is better, MAP or ML? Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 26 / 39

Simple Binary Bayesian Hypothesis Testing: Part 1 We have two states x 0 and x 1 and two hypotheses H 0 and H 1. For each y Y, our problem is to compute where m y = arg min i {0,1} g i(y,π) g 0 (y,π) = π 0 C 00 p 0 (y) + π 1 C 01 p 1 (y) g 1 (y,π) = π 0 C 10 p 0 (y) + π 1 C 11 p 1 (y) We only have two things to compare. We can simplify this comparison: g 0(y,π) g 1(y, π) π 0C 00p 0(y) + π 1C 01p 1(y) π 0C 10p 0(y) + π 1C 11p 1(y) p 1(y)π 1(C 01 C 11) p 0(y)π 0(C 10 C 00) p1(y) π0(c10 C00) p 0(y) π 1(C 01 C 11) where we have assumed that C 01 > C 11 to get the final result. The expression L(y) := p 1(y) p 0 (y) is known as the likelihood ratio. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 27 / 39

Simple Binary Bayesian Hypothesis Testing: Part 2 Given L(y) := p 1(y) p 0 (y) and τ := π 0(C 10 C 00 ) π 1 (C 01 C 11 ), our Bayes decision rule is then simply 1 if L(y) > τ δ Bπ (y) = 0/1 if L(y) = τ 0 if L(y) < τ. Remark: For any y Y that result in L(y) = τ, the Bayes risk is the same whether we decide H 0 or H 1. You can deal with this by always deciding H 0 in this case, or always deciding H 1, or flipping a coin, etc. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 28 / 39

Performance of Simple Binary Bayesian Hypothesis Testing Since the observation Y is a random variable, so is the likelihood ratio L(Y ). Assume that L(Y ) has a conditional pdf Λ j (t) when the state is x j. When the state is x 0, the probability of making an error (deciding H 1 ) is P false positive = Prob(L(y) > τ state is x 0 ) = τ Λ 0 (t)dt Similarly, when the state is x 1, the probability of making an error (deciding H 0 ) is P false negative = Prob(L(y) τ state is x 1 ) = What is the overall (unconditional) probability of error? τ Λ 1 (t)dt Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 29 / 39

Simple Binary Bayesian Hypothesis Testing with UCA Uniform cost assignment: C 00 = C 11 = 0 C 01 = C 10 = 1 In this case, the discriminant functions are simply g 0 (y,π) = π 1 p 1 (y) = π 1 (y)p(y) g 1 (y,π) = π 0 p 0 (y) = π 0 (y)p(y) and a Bayes decision rule can be written in terms of the posterior probabilities as 1 if π 1 (y) > π 0 (y) δ Bπ (y) = 0/1 if π 1 (y) = π 0 (y) 0 if π 1 (y) < π 0 (y). In this case, it should be clear that the Bayes decision rule is the MAP (maximum a posteriori) decision rule. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 30 / 39

Example: Coherent Detection of BPSK Suppose a transmitter sends one of two scalar signals and the signals arrive at a receiver corrupted by zero-mean additive white Gaussian noise (AWGN) with variance σ 2. We want to use Bayesian hypothesis testing to determine which signal was sent. Signal model conditioned on state x j : Y = a j + η where a j is the scalar signal and η N(0,σ 2 ). Hence ( 1 (y aj ) 2 ) p j (y) = exp 2πσ 2σ 2 Hypotheses: H 0 : a 0 was sent, or Y N(a 0,σ 2 ) H 1 : a 1 was sent, or Y N(a 1,σ 2 ) Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 31 / 39

Example: Coherent Detection of BPSK This is just a simple binary hypothesis testing problem. Under the UCA, we can write ( 1 (y a0 ) 2 ) R 0 (ρ) = exp y Y 1 2πσ 2σ 2 dy ( (y a1 ) 2 ) R 1 (ρ) = exp 2πσ 2σ 2 dy y Y 0 1 where Y j = {y Y : ρ j (y) = 1}. Intuitively, a 0 a 1 Y 0 Y 1 Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 32 / 39

Example: Coherent Detection of BPSK Given a prior π 0, π 1 = 1 π 0, we can write the Bayes risk as r(ρ,π) = π 0 R 0 (ρ) + (1 π 0 )R 1 (ρ) The Bayes decision rule can be found by computing the likelihood ratio and comparing it to the threshold τ: p 1 (y) exp p 0 (y) > τ exp ( ) (y a1 ) 2 2σ ( 2 (y a0 ) 2 2σ 2 ) > π 0 π 1 (y a 0) 2 (y a 1 ) 2 2σ 2 > ln π 0 π 1 y > a 0 + a 1 2 + σ2 a 1 a 0 ln π 0 π 1 Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 33 / 39

Example: Coherent Detection of BPSK Let γ := a 0+a 1 2 + σ2 a 1 a 0 ln π 0 π 1. Our Bayes decision rule is then : 1 if y > γ δ Bπ (y) = 0/1 if y = γ 0 if y < γ. Using this decision rule, the Bayes risk is then ( ) ( ) γ r(δ Bπ a0 a1 γ,π) = π 0 Q + (1 π 0 )Q σ σ where Q(x) := x 1 2π e t2 /2 dt. Remarks: When π 0 = π 1 = 1 2, the decision boundary is simply a 0+a 1 2, i.e. the midpoint between a 0 and a 1 and r(δ Bπ,π) = Q ( a 1 a 0 ) 2σ. When σ is very small, the prior has little effect on the decision boundary. When σ is very large, the prior becomes more important. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 34 / 39

Composite Binary Bayesian Hypothesis Testing We have N > 2 states x 0,...,x and two hypotheses H 0 and H 1 that form a valid partition on these states. For each y Y, we compute m y = arg min g i(y,π), where g i (y,π) = π j C ij p j (y) i {0,1} As was the case of simple binary Bayesian hypothesis testing, we only have two things to compare. The comparison boils down to j g 0 (y,π) g 1 (y,π) J(y) := π jc 0j p j (y) j π jc 1j p j (y) 1. Hence, a Bayes decision rule for the composite binary hypothesis testing problem is then simply 1 if J(y) > 1 δ Bπ (y) = 0/1 if J(y) = 1 0 if J(y) < 1. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 35 / 39

Composite Binary Bayesian Hypothesis Testing with UCA For i = 0,1 and j = 0,...,N 1, the UCA is: { 0 if x j H i C ij = 1 otherwise Hence, the discriminant functions are g i (y,π) = π j p j (y) = 1 x j H c i x j H i π j p j (y) = 1 x j H i π j (y)p(y) Under the UCA, the hypothesis test comparison boils down to x g 0 (y,π) g 1 (y,π) J(y) := j H 1 π j (y) x j H 0 π j (y) 1. and the Bayes decision rule δ Bπ (y) is the same as before. Note that P x J(y) := j H 1 π j(y) P x j H 0 π = Prob(H1 y) j(y) Prob(H 0 y) so, as before, the Bayes decision rule is the MAP decision rule. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 36 / 39

Final Remarks on Bayesian Hypothesis Testing 1. A Bayes decision rule minimizes the Bayes risk r(ρ,π) = j π jr j (ρ) under the prior state distribution π. 2. At least one deterministic decision rule δ Bπ (y) achieves the minimimum Bayes risk for any prior state distribution π. 3. The prior state distribution π can also be thought of as a parameter. By varying π, we can generate a family of Bayes decision rules δ Bπ parameterized by π. This family of Bayes decision rules can achieve any operating point on the optimal tradeoff surface (or ROC curve). 4. In some applications, a prior state distribution is difficult to determine. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 37 / 39

Bayesian Hypothesis Testing with an Incorrect Prior 1 0.9 0.8 CRV of deterministic decision rules minimum Bayes risk, our prior [0.3 0.7] minimum Bayes risk, true prior [0.7 0.3] actual Bayes risk R1 0.7 0.6 0.5 0.4 0.3 0.2 0.1 r=0.2339 r=0.3812 0 0 0.2 0.4 0.6 0.8 1 R0 Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 38 / 39

Conclusions and Motivation for the Minimax Criterion If the true prior state distribution π true is far from our belief π, a Bayes detector δ Bπ may result in a significant loss in performance. One way to select a decision rule that is robust with respect to uncertainty in the prior is to minimize max π r(ρ,π) = max j R j (ρ). The minimax criterion leads to a decision rule that minimizes the Bayes risk for the least favorable prior distribution. This minimax criterion is conservative, but it allows one to guarantee performance over all possible priors. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 39 / 39