ECE 830 Spring 207 Instructor: R. Willett Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics Executive summary In the last lecture we saw that the likelihood ratio statistic was optimal for testing between two simple hypotheses. The test simply compares the likelihood ratio to a threshold. The optimal threshold is a function of the prior probabilities and the costs assigned to different errors. The choice of costs is subjective and depends on the nature of the problem, but the prior probabilities must be known. In practice, we face several questions:. Unfortunately, often the prior probabilities are not known precisely, and thus the correct setting for the threshold is unclear. How should we proceed? 2. What are the tradeoffs among different measures of error (e.g. probability of false alarm, probability of miss, etc.)? 3. Is the LRT still optimal for different error criteria? 4. Do we really need to store all the observed data, or can we get by with some summary statistics? We will address these questions in these notes. To explore these questions, we will look at a simple example. Example: Detecting ET We observe a sampled radio signal from outer space. Is this just cosmic radiation / noise, or are we receiving a message from extra-terrestrial intelligence (ETI)? null hypothesis (no ETI) : x i iid N (0, σ 2 ), i =,..., n () alternative hypothesis (ETI) H : x i iid N (µ, σ 2 ), µ > 0, i =,..., n. (2) Assume that σ 2 > 0 is known. The first hypothesis is simple. It involves a fixed and known distribution. The second hypothesis is simple if µ is known. However, if all we know is that µ > 0, then the second hypothesis is the composite of many alternative distributions, i.e., the collection {N(µ, σ 2 )} µ>0. In this case, H is called a composite hypothesis. The likelihood ratio test takes the form n i= n i= 2πσ 2 e 2σ 2 (x i µ) 2 2πσ 2 e 2σ 2 x2 i = e n (2πσ 2 ) n/2 2σ 2 i= (x i µ) 2 e n (2πσ 2 ) n/2 2σ 2 i= x2 i H γ
Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics2 The inequalities are preserved if we apply a monotonic transformation to both sides, so we can simplify the expression by taking the logarithm, giving us the log-likelihood ratio test ( ) n 2µ x 2σ 2 i + nµ 2 log(γ) i= Assuming µ > 0, this is equivalent to with ν = σ2 µ ln γ + nµ 2. 2 Sufficient Statistics n i= x i H ν, Our test statistic t := n i= x i is called the sufficient statistic for the mean of a normal distribution. Let s rewrite our hypotheses in terms of the sufficient statistic: H :t N (0, nσ 2 ), (3) H :t N (nµ, nσ 2 ), µ > 0 (4) We call t a sufficient statistic because t is sufficient for performing our likelihood ratio test. More formally, a sufficient statistic is defined as follows: Definition: Sufficient statistic Let X be an n-dimensional random vector and let θ denote a p-dimensional parameter of the distribution of X. The statistic t := T (x) is a sufficient statistic for θ if and only if the conditional distribution of X given T (X) is independent of θ. More details are available at http://willett.ece.wisc.edu/wp-uploads/206/02/04- SuffStats.pdf. If our data is drawn from an exponential family probability model parameterized by θ with the form p θ (x) = exp[a(θ)b(x) + c(x) + d(θ)] then t(x) = n i= b(x i) is a sufficient statistic. Thus if we are faced with a hypothesis test of the form :x i iid p θ0 (x) (5) H :x i iid p θ (x) (6) then our test statistic is a simple function of the data, and we do not need to know the functions a, c, and d to compute it (though we may need these functions to choose an appropriate threshold).
Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics3 3 Choosing thresholds Recall that our ideal threshold is ν = σ2 nµ ln γ +. However, in our ETI setting we have µ 2 no way of knowing prior probabilities π 0, π to set γ, AND we have no way of knowing a good value for µ. To deal with this, consider an alternative design specification. Let s design a test that minimizes one type of error subject to a constraint on the other type of error. This constrained optimization criterion does not require knowledge of prior probabilities nor cost assignments. It only requires a specification of the maximum allowable value for one type of error, which is sometimes even more natural than assigning costs to the different errors. A classic result due to Neyman and Pearson shows that the solution to this type of optimization is again a likelihood ratio test. 4 Neyman-Pearson Lemma Assume that we observe a random variable distributed according to one of two distributions. : X p 0 H : X p In many problems, is consider to be a sort of baseline or default model and is called the null hypothesis. H is a different model and is called the alternative hypothesis. If a test chooses H when in fact the data were generated by the error is called a false-positive or false-alarm, since we mistakenly accepted the alternative hypothesis. The error of deciding when H was the correct model is called a false-negative or miss. Let T denote a testing procedure based on an observation of X, and let R T denote the subset of the range of X where the test chooses H. The probability of a false-positive is denoted by P 0 (R T ) := p 0 (x) dx. R T The probability of a false-negative is P (R T ), where P (R T ) := p (x) dx, R T is the probability of correctly deciding H, often called the probability of detection. Consider likelihood ratio tests of the form p (x) p o (x) H λ. The subset of the range of X where this test decides H is denoted R LR (λ) := {x : p (x) > λ p 0 (x)}, and therefore the probability of a false-positive decision is P 0 (R LR (λ)) := p 0 (x) dx = R LR (λ) {x:p (x)>λp 0 (x)} p 0 (x) dx
Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics4 This probability is a function of the threshold λ; the set R LR (λ) shrinks/grows as λ increases/decreases. We can select λ to achieve a desired probability of error. Neyman-Pearson Lemma Consider the likelihood ratio test p (x) p o (x) H λ with λ > 0 chosen so that P 0 (R LR (λ)) = α. There does not exist another test T with P 0 (R T ) α and P (R T ) > P (R LR (λ)). That is, the LRT is the most powerful test with probability of false-negative less than or equal to α. Proof. Let T be any test with P 0 (R T ) = α and let NP denote the LRT with λ chosen so that P 0 (R LR (λ)) = α. To simplify the notation we will denote use R NP to denote the region R LR (λ). For any subset R of the range of X define P i (R) := p i (x) dx, This is simply the probability of X R under hypothesis H i. Note that P i (R NP ) = P i (R NP R T ) + P i (R NP R c T ) R P i (R T ) = P i (R NP R T ) + P i (R c NP R T ) where the superscript c indicates the complement of the set. By assumption P 0 (R NP ) = P 0 (R T ) = α, therefore P 0 (R NP R c T ) = P 0 (R c NP R T ). Now, we want to show which holds if To see that this is indeed the case, P (R NP ) P (R T ) P (R NP R c T ) P (R c NP R T ). P (R NP R c T ) = R NP R c T λ R NP R c T p (x) dx p o (x) dx = λ P o (R NP R c T ) = λ P o (RNP c R T ) = λ p o (x) dx R c NP R T R c NP R T p (x) dx = P (R c NP R T ).
Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics5 The probability of a false-positive is also called the probability of false-alarm, which we will denote by P F A in the following examples. We will also denote the probability of detection ( probability of a false-negative) by P D. The NP test maximizes P D subject to a constraint on P F A. Example: Detecting a DC Signal in Additive White Gaussian Noise Return to our ETI example from earlier. Assuming µ > 0, our LRT amounts to n i= x i H ν, with ν = σ2 nµ ln γ +, and since γ was ours to choose, we can equivalently choose ν µ 2 to trade-off between the two types of error. Let s now determine P F A and P D for the log-likelihood ratio test. ( ) P F A = t2 ν 2nπσ 2 e 2nσ 2 dt = Q, nσ 2 2π e u2 /2 du, the tail probability of the standard normal distri- where Q(z) = u z bution. Similarly, ν P D = ν ( ) (t nµ)2 ν nµ e 2nσ 2 dt = Q 2nπσ 2 nσ 2 In both cases the expression in terms of the Q function is the result of a simple change of variables in the integration. The Q function is invertible, so we can solve for the value of ν in terms of P F A, that is ν = nσ 2 Q (P F A ). Using this we can express P D as ( ) nµ P D = Q Q 2 (P F A ), nµ 2 σ 2 where is simply the square root of the signal-to-noise ratio ( SNR). Since Q(z) as z, it is easy to see that the probability of detection increases as µ and/or n increase. σ 2. 5 Receiver Operating Characteristic (ROC) curves The binary hypothesis test : X = W H : X = S + W
Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics6 where W N(0, σ 2 I n n ) and S = [s, s 2,..., s n ] T is the known signal. ( P 0 (X) = exp ) (2πσ 2 ) n 2 2σ 2 XT X [ P (X) = exp ] (2πσ 2 ) n 2 2σ (X 2 S)T (X S) Apply the likelihood ratio test (LRT): After simplification, we have log Λ(x) = log P (X) P 0 (X) = 2σ 2 [ 2XT S + S T S] H X T S H σ 2 γ + ST S 2 = γ γ The test statistic X T S is usually called a matched filter. The LR detector filters data by projecting them onto the signal subspace. 5. Quality of the classifier Question: How can we assess the quality of a detector? Example: : X N(0, ) H : X N(2, ) 0.5 0.45 p 0(x) p (x) 0.4 0.35 0.3 0.25 0.2 0.5 0. 0.05 0 3 2 0 2 3 4 5 x P F A and P D characterize the performance of the detector. As γ increases, P F A decreases (good) and P D decreases (bad).
Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics7 Definition: Receiver Operating Characteristic (ROC) Curve An ROC curve is a plot that illustrates the performance of a detector (binary classifier) by plotting its P D vs. P F A at various threshold settings. First use: In World War II. The ROC curve was first developed by electrical engineers and radar engineers during for detecting aircrafts from radar signals after the attack on the Pearl Harbor. To compute the ROC curve, vary the threshold level γ and compute P F A and P D. 0.5 0.4 p 0(x) p (x) 0.9 0.8 0.7 0.6 0.3 0.2 0. 0 4 2 0 2 4 6 x The ROC curve PD 0.5 0.4 0.3 0.2 0. 0 0 0.2 0.4 0.6 0.8 P F A Starts from (0, 0) and ends at (, ) (unless p i (± ) > 0). The diagonal line from (0, 0) to (, ) corresponds to random guesses. Depends on signal strength, noise strength, noise type, etc. 5.2 The AWGN Assumption AWGN is gaussian distributed as W N(0, σ 2 I) Is real-world noise really additive, white and Gaussian? Noise in many applications (e.g. communication and radar) arise from several independent sources, all adding together at sensors and combining additively to the measurement. Central Limit Theorem If x,..., x n are independent random variables with means µ i and variances σi 2 <,then Z n = n x i µ i n i= σ i N(0, ) in distribution as n. Thus, it is quite reasonable to model noise as additive and Gaussian in many applications. However, whiteness is not always a good assumption. 5.3 Colored Gaussian Noise
Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics8 Example: Correlated noise W = S + S 2 + + S k, where S, S 2,... S k are interferring signals that are not of interest, and each of them is structured/correlated in time. W N(0, Σ) is called correlated or colored noise, where Σ is a structured covariance matrix. Consider the binary hypothesis test in this case. : X = S 0 + W H : X = S + W where W N(0, Σ) and S 0 and S are known signals. So we can rewrite the hypothesis as : X N(S 0, Σ) H : X N(S, Σ) The probability density of each hypothesis is [ P i (X) = exp ] (2π) 2 n (Σ) 2 2 (X S i) T Σ (X S i ), i = 0, The log likelihood ratio is ( ) P (X) log = [ (X S ) T Σ(X S ) (X S 0 ) T Σ (X S 0 ) ] P 2 (X) 2 Equivalently, = X T Σ (S S 0 ) 2 ST Σ S + 2 ST 0 Σ S 0 H γ (S S 0 ) T Σ X H γ + ST Σ S ST 0 Σ S 0 2 2 Let t(x) = (S S 0 ) T Σ X, we can get = γ : t N((S S 0 ) T Σ S 0, (S S 0 ) T Σ (S S 0 )) H : t N((S S 0 ) T Σ S, (S S 0 ) T Σ (S S 0 )) The probability of false alarm is ( P F A = Q In this case it is natural to define γ (S S 0 ) T Σ S 0 [(S S 0 ) T Σ (S S 0 )] 2 SNR = (S S 0 ) T Σ (S S 0 ) )