ECE531 Lecture 2b: Bayesian Hypothesis Testing

ECE531 Lecture 2b: Bayesian Hypothesis Testing D. Richard Brown III Worcester Polytechnic Institute 29-January-2009 Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 1 / 39

Minimizing a Weighted Sum of Conditional Risks We would like to minimize all of the conditional risks R 0 (D),...,R (D), but usually we have to make some sort of tradeoff. A standard approach to solving a multi-objective optimization problem is to form a linear combination of the functions to be minimized: r(d,λ) := where λ j 0 and j λ j = 1. λ j R j (D) = λ R(D) The problem is then to solve the single-objective optimization problem: D = arg min D D r(d,λ) = arg min λ j c j Dp j D D where we are focusing on the case of finite Y for now. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 2 / 39

Minimizing a Weighted Sum of Conditional Risks The problem D = arg min r(d,λ) = arg min λ j c j Dp j D D D D is a finite-dimensional linear programming problem since the variables that we control (D ij ) are constrained by linear inequalities M 1 i=0 D ij 0 i {0,...,M 1} and j {0,...,L 1} D ij = 1 and the objective function r(d,λ) is linear in the variables D ij. For notational convenience, we call this minimization problem LPF(λ). Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 3 / 39

Decomposition: Part 1 Since R j (D) = c j Dp j = M 1 i=0 the problem LPF(λ) can be written as D il P lj C ij L 1 l=0 D M 1 = arg min λ j D D i=0 = arg min D D L 1 l=0 M 1 D il i=0 D il P lj C ij L 1 l=0 λ j C ij P lj } {{ } r l (d,λ) Note that we can minimize each term r l (d,λ) separately. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 4 / 39

Decomposition: Part 2 The lth subproblem is then d = arg min r l(d,λ) d R M M 1 = arg min d i λ j C ij P lj d i=0 subject to the constraints that d i 0 and i d i = 1. Note that d i D il from the original LPF(λ) problem. For notational convenience, we call this subproblem LPF l (λ). Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 5 / 39

Some Intuition: The Hiker Suppose you are going on a hike with a one liter drinking bottle and you must completely fill the bottle before your hike. At the general store, you can fill your bottle with any combination of: Tap water: $0.25 per liter Ice tea: $0.50 per liter Soda: $1.00 per liter Red Bull: $5.00 per liter You want to minimize your cost for the hike. What should you do? We can define a per-liter cost vector h = [0.25,0.50,1.00,5.00] and a proportional purchase vector ν R 4 with elements satisfying ν i 0 and i ν i = 1. Clearly, your cost h ν is minimized when ν = [1,0,0,0]. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 6 / 39

Solution to the Subproblem LPF l (λ) Theorem For fixed h R n, the function h ν is minimized over the set ν {R n : ν i 0 and i ν i = 1} by the vector ν with ν m = 1 and ν k = 0, k m, where m is any index with h m = min i {0,...,n 1} h i. Note that, if two or more commodities have the same minimum price, the solution is not unique. Back to our subproblem LPF l (λ)... " M 1 X # X d = arg min d i λ jc ijp lj d R M i=0 subject to the constraints that d i 0 and i d i = 1. The Theoreom tells us how to easily solve LPF l (λ). We just have to find the index " # X m l = arg min λ jc ijp lj i {0,...,M 1} and then set d ml = 1 and d i = 0 for all i m l. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 7 / 39

Solution to the Problem LPF(λ) Solving the subproblem LPF l (λ) for l = 0,...,L 1 and assembling the results of the subproblems into M L decision matrices yields a non-empty, finite set of deterministic decision matrices that solve LPF(λ). All deterministic decision matrices in the set have the same minimal cost: L 1 r (λ) = min λ i j C ij P lj l=0 It is easy to show that any randomized decision rule in the convex hull of this set of deterministic decision matrices also achieves r (λ), and hence is also optimal. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 8 / 39

Example Part 1 Bayesian Hypothesis Testing Let G il := λ jc ij P lj and G R M L be the matrix composed of elements G il. Suppose 0.3 0.5 0.2 0.8 G = 0.4 0.2 0.1 0.5 0.5 0.1 0.7 0.6 What is the solution to the subproblem LPF 0 (λ)? d = [1,0,0]. What is the solution to the problem LPF(λ)? 1 0 0 0 D = 0 0 1 1 0 1 0 0 Is this solution unique? Yes. What is r (λ)? r (λ) = 1.0. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 9 / 39

Example Part 2 Bayesian Hypothesis Testing What happens if G = 2 0.3 0.5 0.2 3 0.8 40.4 0.2 0.2 0.55? 0.5 0.1 0.7 0.6 The solution to LPF 2 (λ) is no longer unique. Both 2 3 2 3 1 0 0 0 1 0 1 0 D = 40 0 1 15 or D = 40 0 0 15 0 1 0 0 0 1 0 0 solve LPF(λ) and achieve r (λ) = 1.1. In fact, any decision matrix of the form 2 1 0 α 3 0 D = 40 0 1 α 15 0 1 0 0 for α [0,1] will also achieve r (λ) = 1.1. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 10 / 39

Remarks Note that the solution to the subproblem LPF l (λ) always yields at least one deterministic decision rule for the case y = y l. If n l elements of the column G l are equal to the minimum of the column, then there are n l deterministic decision rules solving the subproblem LPF l (λ). There are l n l 1 deterministic decision rules that solve LPF(λ). If there is more than one deterministic solution to LPF(λ) then there will also be a corresponding convex hull of randomized decision rules that also achieve r (λ). Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 11 / 39

Working Example: Effect of Changing the Weighting λ D15 1 1 0.9 2 0.8 0.7 3 R1 0.6 0.5 4 D14 0.4 0.3 5 0.2 6 7 D12 0.1 8 9 0 D8 D0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 R0 Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 12 / 39

Weighted Risk Level Sets Notice in the last example that, by varying our risk weighting λ, we can create a family of level sets that cover every point of the optimal tradeoff surface. Given a constant c R, the level set of value c is defined as L λ c := {x R N : λ x = c} 1 λ = [0.7 0.3] T 1 λ = [0.4 0.6] T 0.8 0.8 0.6 0.6 x1 x1 0.4 0.4 0.2 0.2 0 0 0.2 0.4 0.6 0.8 1 x0 0 0 0.2 0.4 0.6 0.8 1 x0 Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 13 / 39

Infinite Observation Spaces: Part 1 When the observations are specified by conditional pdfs, our conditional risk function becomes [ M 1 R j (ρ) = ρ i (y)c ij ]p j (y)dy y Y i=0 Recall that the randomized decision rule ρ i (y) specifies the probability of deciding H i when the observation is y. We still have N < states, so, as before, we can write a linear combination of the conditional risk functions as r(ρ,λ) := where λ j 0 and j λ j = 1. λ j R j (ρ) Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 14 / 39

Infinite Observation Spaces: Part 2 The problem is then to solve ρ = arg min ρ D r(ρ,λ) = arg min λ j ρ D y Y [ M 1 i=0 ρ i (y)c ij ]p j (y)dy where D is the set of all valid decision rules satisfying ρ i (y) 0 for all i = 0,...,M 1 and y Y, as well as M 1 i=0 ρ i(y) = 1 for all y Y. This is an infinite dimensional linear programming problem. We refer to this problem as LP(λ). Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 15 / 39

Infinite Observation Spaces: Decomposition Part 1 As before, we can decompose the big problem LP(λ) into lots of little subproblems, one for each observation y Y. We rewrite LP(λ) as ρ = arg min λ j ρ D = arg min ρ D y Y y Y M 1 i=0 [ M 1 i=0 ρ i (y)c ij ] p j (y)dy ρ i (y) λ j C ij p j (y) dy To minimize the integral, we can minimize the term M 1 (...) = ρ i (y) λ j C ij p j (y) at each fixed value of y. i=0 Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 16 / 39

Infinite Observation Spaces: Decomposition Part 2 The subproblem of minimizing the term (... ) inside the integral when y = y 0 is a finite-dimensional linear programming problem (since X is still finite): M 1 ν = arg min ν i=0 ν i λ j C ij p j (y 0 ) subject to ν i 0 for all i = 0,...,M 1 and M 1 i=0 ν i = 1. We already know how to do this. We just find the index that has the lowest commodity cost m y0 = arg min i {0,...,M 1} λ j C ij p j (y 0 ) and set ν my0 = 1 and all other ν i = 0. Same technique as solving LPF l (λ). Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 17 / 39

Infinite Observation Spaces: Minimum Weighted Risk As we saw with finite Y, solving LP(λ) will yield at least one deterministic decision rule that achieves the minimum weighted risk r (λ) = λ j C ij p j (y)dy = = y Y y Y M 1 i=0 min i {0,...,M 1} min g i(y,λ)dy i {0,...,M 1} g i (y,λ)dy y Y i 1. The existence of at least one deterministic decision rule solving LPF(λ) or LP(λ) implies the existence of an optimal partition of Y. 2. Since the problems LPF(λ) and LP(λ) always yield at least one deterministic decision rule, another way to think of these problems is: Find a partition of the observation space that minimizes the weighted risk. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 18 / 39

Infinite Observation Spaces: Putting it All Together A decision rule that solves LP(λ) is then ρ i (y) = { 1 if i = arg min m {0,...,M 1} g m (y,λ) 0 otherwise where ρ (y) = [ρ 0 (y),...,ρ M 1 (y)]. Remarks: 1. The solution to LP(λ) may not be unique. 2. If there is more than one deterministic solution to LP(λ) then there will also be a corresponding convex hull of randomized decision rules that also achieve r (λ). 3. Note that the standard convention for deterministic decision rules is δ : Y Z. An equivalent way of specifying a deterministic decision rule that solves LP(λ) is δ (y) = arg min g i(y,λ). i {0,...,M 1} Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 19 / 39

The Bayesian Approach We assume a prior state distribution π P N such that Prob(state is x j ) = π j Note that this is our model of the state probabilities prior to the observation. We denote the Bayes Risk of the decision rule ρ as r(ρ,π) = π j R j (ρ) = π j y Y M 1 p j (y) C ij ρ i (y)dy. This is simply the weighted overall risk, or average risk, given our prior belief of the state probabilities. A decision rule that minimizes this risk is called a Bayes decision rule for the prior π. i=0 Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 20 / 39

Solving Bayesian Hypothesis Testing Problems We know how to solve this problem. For finite Y, we just find the index m l = arg min π j C ij P lj i {0,...,M 1} } {{ } G il and set Dm Bπ l,l = 1 and DBπ i,l = 0 for all i m l. For infinite Y, we just find the index m y0 = arg min i {0,...,M 1} π j C ij p j (y 0 ) } {{ } g i (y 0,π) and set δ Bπ (y 0 ) = m y0 (or, equivalently, set ρ Bπ m y0 (y 0 ) = 1 and ρ Bπ i (y 0 ) = 0 for all i m y0 ). Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 21 / 39

Prior and Posterior Probabilities The conditional probability that H j is true given the observation y: where π j (y) := Prob(H j is true Y = y) = p j(y)π j p(y) p(y) = π j p j (y). Recall that π j is the prior probability of choosing H j, before we have any observations. The quantity π j (y) is the posterior probability of choosing H j, conditioned on the observation y. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 22 / 39

A Bayes Decision Rule Minimizes The Posterior Cost General expression for a deterministic Bayes decision rule: δ Bπ (y) = arg But π j p j (y) = π j (y)p(y), hence δ Bπ (y) = arg min i {0,...,M 1} min i {0,...,M 1} π j C ij p j (y) C ij π j (y)p(y) = arg min C ij π j (y) i {0,...,M 1} since p(y) does not affect the minimizer. What does this mean? C ijπ j (y) the average cost of choosing hypothesis H i given Y = y, i.e. the posterior cost of choosing H i. The Bayes decision rule chooses the hypothesis that yields the minimum expected posterior cost. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 23 / 39

Bayesian Hypothesis Testing with UCA: Part 1 The uniform cost assignment (UCA): { 0 if x j H i C ij = 1 otherwise The conditional risk R j (ρ) under the UCA is simply the probability of not choosing the hypothesis that contains x j, i.e. the probability of error when the state is x j. The Bayes risk in this case is r(ρ, π) = π j R j (ρ) = Prob(error). Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 24 / 39

Bayesian Hypothesis Testing with UCA: Part 2 Under the UCA, a Bayes decision rule can be written in terms of the posterior probabilities as δ Bπ (y) = arg min i {0,...,M 1} = arg min i {0,...,M 1} = arg min i {0,...,M 1} = arg max i {0,...,M 1} C ij π j (y) π j (y) x j / H i 1 π j (y) x j H i π j (y) x j H i Hence, for hypothesis tests under the UCA, the Bayes decision rule is the MAP (maximum a posteriori) decision rule. When the hypothesis test is simple, δ Bπ (y) = arg max i π i (y). Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 25 / 39

Bayesian Hypothesis Testing with UCA and Uniform Prior Under the UCA and a uniform prior, i.e. π j = 1/N for all j = 0,...,N 1 δ Bπ (y) = arg max i {0,...,M 1} = arg max i {0,...,M 1} = arg max i {0,...,M 1} since π j = 1/N does not affect the minimizer. π j (y) x j H i π j p j (y) x j H i p j (y) x j H i In this case, δ Bπ (y) selects the most likely hypothesis, i.e. the hypothesis which best explains the observation y. This is called the maximum likelihood (ML) decision rule. Which is better, MAP or ML? Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 26 / 39

Simple Binary Bayesian Hypothesis Testing: Part 1 We have two states x 0 and x 1 and two hypotheses H 0 and H 1. For each y Y, our problem is to compute where m y = arg min i {0,1} g i(y,π) g 0 (y,π) = π 0 C 00 p 0 (y) + π 1 C 01 p 1 (y) g 1 (y,π) = π 0 C 10 p 0 (y) + π 1 C 11 p 1 (y) We only have two things to compare. We can simplify this comparison: g 0(y,π) g 1(y, π) π 0C 00p 0(y) + π 1C 01p 1(y) π 0C 10p 0(y) + π 1C 11p 1(y) p 1(y)π 1(C 01 C 11) p 0(y)π 0(C 10 C 00) p1(y) π0(c10 C00) p 0(y) π 1(C 01 C 11) where we have assumed that C 01 > C 11 to get the final result. The expression L(y) := p 1(y) p 0 (y) is known as the likelihood ratio. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 27 / 39

Simple Binary Bayesian Hypothesis Testing: Part 2 Given L(y) := p 1(y) p 0 (y) and τ := π 0(C 10 C 00 ) π 1 (C 01 C 11 ), our Bayes decision rule is then simply 1 if L(y) > τ δ Bπ (y) = 0/1 if L(y) = τ 0 if L(y) < τ. Remark: For any y Y that result in L(y) = τ, the Bayes risk is the same whether we decide H 0 or H 1. You can deal with this by always deciding H 0 in this case, or always deciding H 1, or flipping a coin, etc. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 28 / 39

Performance of Simple Binary Bayesian Hypothesis Testing Since the observation Y is a random variable, so is the likelihood ratio L(Y ). Assume that L(Y ) has a conditional pdf Λ j (t) when the state is x j. When the state is x 0, the probability of making an error (deciding H 1 ) is P false positive = Prob(L(y) > τ state is x 0 ) = τ Λ 0 (t)dt Similarly, when the state is x 1, the probability of making an error (deciding H 0 ) is P false negative = Prob(L(y) τ state is x 1 ) = What is the overall (unconditional) probability of error? τ Λ 1 (t)dt Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 29 / 39

Simple Binary Bayesian Hypothesis Testing with UCA Uniform cost assignment: C 00 = C 11 = 0 C 01 = C 10 = 1 In this case, the discriminant functions are simply g 0 (y,π) = π 1 p 1 (y) = π 1 (y)p(y) g 1 (y,π) = π 0 p 0 (y) = π 0 (y)p(y) and a Bayes decision rule can be written in terms of the posterior probabilities as 1 if π 1 (y) > π 0 (y) δ Bπ (y) = 0/1 if π 1 (y) = π 0 (y) 0 if π 1 (y) < π 0 (y). In this case, it should be clear that the Bayes decision rule is the MAP (maximum a posteriori) decision rule. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 30 / 39

Example: Coherent Detection of BPSK Suppose a transmitter sends one of two scalar signals and the signals arrive at a receiver corrupted by zero-mean additive white Gaussian noise (AWGN) with variance σ 2. We want to use Bayesian hypothesis testing to determine which signal was sent. Signal model conditioned on state x j : Y = a j + η where a j is the scalar signal and η N(0,σ 2 ). Hence ( 1 (y aj ) 2 ) p j (y) = exp 2πσ 2σ 2 Hypotheses: H 0 : a 0 was sent, or Y N(a 0,σ 2 ) H 1 : a 1 was sent, or Y N(a 1,σ 2 ) Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 31 / 39

Example: Coherent Detection of BPSK This is just a simple binary hypothesis testing problem. Under the UCA, we can write ( 1 (y a0 ) 2 ) R 0 (ρ) = exp y Y 1 2πσ 2σ 2 dy ( (y a1 ) 2 ) R 1 (ρ) = exp 2πσ 2σ 2 dy y Y 0 1 where Y j = {y Y : ρ j (y) = 1}. Intuitively, a 0 a 1 Y 0 Y 1 Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 32 / 39

Example: Coherent Detection of BPSK Given a prior π 0, π 1 = 1 π 0, we can write the Bayes risk as r(ρ,π) = π 0 R 0 (ρ) + (1 π 0 )R 1 (ρ) The Bayes decision rule can be found by computing the likelihood ratio and comparing it to the threshold τ: p 1 (y) exp p 0 (y) > τ exp ( ) (y a1 ) 2 2σ ( 2 (y a0 ) 2 2σ 2 ) > π 0 π 1 (y a 0) 2 (y a 1 ) 2 2σ 2 > ln π 0 π 1 y > a 0 + a 1 2 + σ2 a 1 a 0 ln π 0 π 1 Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 33 / 39

Example: Coherent Detection of BPSK Let γ := a 0+a 1 2 + σ2 a 1 a 0 ln π 0 π 1. Our Bayes decision rule is then : 1 if y > γ δ Bπ (y) = 0/1 if y = γ 0 if y < γ. Using this decision rule, the Bayes risk is then ( ) ( ) γ r(δ Bπ a0 a1 γ,π) = π 0 Q + (1 π 0 )Q σ σ where Q(x) := x 1 2π e t2 /2 dt. Remarks: When π 0 = π 1 = 1 2, the decision boundary is simply a 0+a 1 2, i.e. the midpoint between a 0 and a 1 and r(δ Bπ,π) = Q ( a 1 a 0 ) 2σ. When σ is very small, the prior has little effect on the decision boundary. When σ is very large, the prior becomes more important. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 34 / 39

Composite Binary Bayesian Hypothesis Testing We have N > 2 states x 0,...,x and two hypotheses H 0 and H 1 that form a valid partition on these states. For each y Y, we compute m y = arg min g i(y,π), where g i (y,π) = π j C ij p j (y) i {0,1} As was the case of simple binary Bayesian hypothesis testing, we only have two things to compare. The comparison boils down to j g 0 (y,π) g 1 (y,π) J(y) := π jc 0j p j (y) j π jc 1j p j (y) 1. Hence, a Bayes decision rule for the composite binary hypothesis testing problem is then simply 1 if J(y) > 1 δ Bπ (y) = 0/1 if J(y) = 1 0 if J(y) < 1. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 35 / 39

Composite Binary Bayesian Hypothesis Testing with UCA For i = 0,1 and j = 0,...,N 1, the UCA is: { 0 if x j H i C ij = 1 otherwise Hence, the discriminant functions are g i (y,π) = π j p j (y) = 1 x j H c i x j H i π j p j (y) = 1 x j H i π j (y)p(y) Under the UCA, the hypothesis test comparison boils down to x g 0 (y,π) g 1 (y,π) J(y) := j H 1 π j (y) x j H 0 π j (y) 1. and the Bayes decision rule δ Bπ (y) is the same as before. Note that P x J(y) := j H 1 π j(y) P x j H 0 π = Prob(H1 y) j(y) Prob(H 0 y) so, as before, the Bayes decision rule is the MAP decision rule. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 36 / 39

Final Remarks on Bayesian Hypothesis Testing 1. A Bayes decision rule minimizes the Bayes risk r(ρ,π) = j π jr j (ρ) under the prior state distribution π. 2. At least one deterministic decision rule δ Bπ (y) achieves the minimimum Bayes risk for any prior state distribution π. 3. The prior state distribution π can also be thought of as a parameter. By varying π, we can generate a family of Bayes decision rules δ Bπ parameterized by π. This family of Bayes decision rules can achieve any operating point on the optimal tradeoff surface (or ROC curve). 4. In some applications, a prior state distribution is difficult to determine. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 37 / 39

Bayesian Hypothesis Testing with an Incorrect Prior 1 0.9 0.8 CRV of deterministic decision rules minimum Bayes risk, our prior [0.3 0.7] minimum Bayes risk, true prior [0.7 0.3] actual Bayes risk R1 0.7 0.6 0.5 0.4 0.3 0.2 0.1 r=0.2339 r=0.3812 0 0 0.2 0.4 0.6 0.8 1 R0 Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 38 / 39

Conclusions and Motivation for the Minimax Criterion If the true prior state distribution π true is far from our belief π, a Bayes detector δ Bπ may result in a significant loss in performance. One way to select a decision rule that is robust with respect to uncertainty in the prior is to minimize max π r(ρ,π) = max j R j (ρ). The minimax criterion leads to a decision rule that minimizes the Bayes risk for the least favorable prior distribution. This minimax criterion is conservative, but it allows one to guarantee performance over all possible priors. Worcester Polytechnic Institute D. Richard Brown III 29-January-2009 39 / 39