Topic 3: Hypothesis Testing

Size: px

Start display at page:

Download "Topic 3: Hypothesis Testing"

Pearl Prudence Watson
5 years ago
Views:

CS 8850: Advanced Machine Learning Fall 07 Topic 3: Hypothesis Testing Instructor: Daniel L. Pimentel-Alarcón c Copyright 07 3.

The blood glucose level (in mg/dl) of a healthy person can be modeled as N (95, σ ), while that of a diabetic can be modeled as N (40, σ ).

1 CS 8850: Advanced Machine Learning Fall 07 Topic 3: Hypothesis Testing Instructor: Daniel L. Pimentel-Alarcón c Copyright Introduction One of the simplest inference problems is that of deciding between two options (hypotheses). Example 3. (Healthy vs. Diabetic). The blood glucose level (in mg/dl) of a healthy person can be modeled as N (95, σ ), while that of a diabetic can be modeled as N (40, σ ). Given a new patient with glucose level x, you want to decide between two hypotheses: : x N (95, σ ) H : x N (40, σ ) healthy, diabetic. and H are often called null and alternative hypotheses. Example 3. (Radar). A radar is constantly emitting a signal and monitoring to see if it bounces back (see Figure 3.). The signal x that the radar receives can be modeled as N (0, σ ) if there is nothing (hence the signal doesn t bounce back) and N (µ, σ ) for some µ > 0 if an object is present (hence signal bounces back). Thus it needs to decide between: : x N (0, σ ) nothing there, H : x N (µ, σ ), µ > 0 something there. Example 3.3 (Astrophysics). The NASA wants you to determine whether two meteorites one that fell in Roswell, New Mexico, and one that fell in Chelyabinsk, Russia came from the same asteroid in space. With help from the materials expert in your interdisciplinary team, you are able to determine that if two meteorites come from the same asteroid, the difference x of their magnesium composition Figure 3.: A radar is constantly receiving a signal, and needs to decide whether an object is present or not. See Example

Topic 3: Hypothesis Testing 3- Figure 3.: Gene microarrays are data matrices indicating gene activation levels. Each row corresponds to one gene, and each column corresponds to one individual.

2 Topic 3: Hypothesis Testing 3- Figure 3.: Gene microarrays are data matrices indicating gene activation levels. Each row corresponds to one gene, and each column corresponds to one individual. We want to know which genes are related to a disease. See Example 3.4. can be modeled as N (0, σ ), and N (µ, σ ) otherwise for some unknown µ 0. Hence you need to decide between: : x N (0, σ ) same asteroid, H : x N (µ, σ ), µ 0 different asteroids. Example 3.4 (Genetics). Have you wondered how geneticists determine which genes are associated to which diseases? Essentially, they compare the average activation levels of a gene in healthy and sick individuals (see Figure 3.). The difference x between these activation levels can be modeled as N (0, σ ) if the gene is unrelated to the disease, and N (µ, σ ) for some unknown µ 0 if the gene is related to the disease. We thus have to decide between: : x N (0, σ ) gene unrelated to disease, H : x N (µ, σ ), µ 0 gene related to disease. Example 3.5 (Treatment design). Scientists often want to design a treatment (e.g., a drug or procedure) for a disease (e.g., diabetes or cancer). To this end they measure the disease presence (e.g., glucose level or tumor size) before and after treatment in N patients. The differences x i can be modeled as independent and identically distributed (i.i.d.) N (0, σ ) if the treatment is ineffective, and N (µ, σ ) for some µ < 0 if the treatment is effective. Hence we have : x,..., x N H : x,..., x N iid N (0, σ ) treatment is ineffective, iid N (µ, σ ), µ < 0 treatment is effective. Example 3.6 (Neural activity). Scientists want to determine which regions of the brain are related to certain tasks using functional magnetic resonance imaging (fmri), which essentially creates a video of the brain using magnetic fields that map hydrogen density. For example, say they want to know which region of the brain controls the thumb. Then they take an individual, ask her to move her thumb

3 Topic 3: Hypothesis Testing Figure 3.3: Left: Signal of the thumb µ R D, usually a sinusoid with the periodicity of the thumb movement. Right: Each pixel produces one signal vector x ij R D containing the brain measurements in that pixel over time. Some pixels may show neural activity correlated with the signal of the thumb. We want to find such pixels. See Example 3.6. periodically, and take an fmri video of her brain. Then scientists analyze one pixel at a time. The (i, j) th pixel will produce a signal vector x ij R D containing the brain measurements in that pixel over time. Some pixels will show neural activity correlated with the signal of the thumb µ R D, usually a sinusoid with the periodicity of the thumb movement (see Figure 3.3). Then x ij can be modeled as N (0, σ I) if the pixel is uncorrelated to the thumb movement, and N (µ, σ I) if the pixel is correlated (I denotes the identity matrix of compatible size, in this case D D). Hence for each pixel (i, j) they have to decide: : x ij H : x ij iid N (0, σ I) iid N (µ, σ I) (i, j) th pixel is uncorrelated, (i, j) th pixel is correlated. Definition 3. (Hypothesis test). A hypothesis test is a function t : Ω {, H }. 3. The Likelihood Ratio Test In general, hypothesis testing is all about deciding between two options. We observe a random variable x, and want to decide whether : x p 0 (x), H : x p (x). If your hunch is to simply pick whichever is larger between p 0 (x) and p (x), your intuition is correct. That is essentially the likelihood ratio test (LRT) in its most elemental form: p (x) p 0 (x) H, which in words means: let s make a test: if the likelihood ratio Λ(x) := p(x) p 0(x) is larger than (meaning p is larger), then pick H. Similarly, if Λ(x) < (meaning p 0 is larger), then pick.

4 Topic 3: Hypothesis Testing 3-4 Figure 3.4: Likelihood ratio test x H µ in Example 3.7. Remark 3. (Likelihood). The term likelihood is often a source of confusion. To be more precise, in hypothesis testing we observe an instance of a random variable, i.e., we observe data x = x, and we want to decide which of two distributions (p 0 or p ) is more likely to have generated this data. The likelihood that p 0 generated x is essentially p 0 (x) [evaluated at x = x], and similarly for p. Example 3.7 (Radar). Consider the hypothesis problem in Example 3.. Then Λ(x) = p (x) p 0 (x) = πσ e ( x µ σ ) = e πσ e ( x 0 σ ) Since both sides are positive, taking log we obtain: x xµ+µ σ e x σ = e µ(x µ) σ H. µ(x µ) σ H 0. and since µ > 0, this further simplifies into the following test: x H µ. In words, this LRT tells us to decide H if our observed data x is larger than µ /, and otherwise (see Figure 3.4). 3.3 Outcomes and Decision Regions A test has four possible outcomes, depending on what we decide and the truth: (0 0), (0 ), ( 0), ( ). See Table 3.. Sometimes it is desirable to reduce the probability of one particular kind of error.

5 Topic 3: Hypothesis Testing 3-5 Example 3.8. In Example 3.5, scientists want to avoid a ( 0) error, which would mean that they believe their treatment cures a disease, when it really doesn t. The probabilities of the four outcomes are determined by the regions of the test. Definition 3. (Decision regions). The decision regions R 0 and R of a test t : Ω {, H } are the inverse images of, and H, i.e., R 0 := {x Ω : t(x) = }, R := {x Ω : t(x) = H }. In words, R 0 is the region of the domain of x where we will decide, and similarly for R. Example 3.9. The decision regions of the test x H µ/ are R 0 = (, µ /) and R = ( µ /, ). Decision regions determine the probability of each outcome as follows: p gh := p h (x)dx, for g, h {0, }. R g Example 3.0. The test x H µ/ has the following outcomes probabilities (see Figure 3.5): p 00 = Q 0,σ ( µ /), p 0 = Q µ,σ ( µ /), p 0 = Q 0,σ ( µ /), p = Q µ,σ ( µ /), Decision (0 0) True Negative No-alarm Accept ( 0) False positive H False alarm Type error Truth H (0 ) False negative Miss Type error ( ) True Positive Detect Reject Table 3.: Four possible outcomes of a test. Depending on the field, they might come under different names. We will use (0 0), (0 ), ( 0), ( ) for simplicity.

6 Topic 3: Hypothesis Testing 3-6 Figure 3.5: Outcomes probabilities p gh. Left: test x H µ/ from Example 3.7. Right: test x H τ from Example 3.0; we can pick τ such that p 0 < α. Figure 3.6: Q µ,σ (τ, ) = Q(τ µ/σ, ), where Q is shorthand for Q 0,. where Q µ,σ (τ) is the tail probability of the N (µ, σ ) distribution, i.e., Q µ,σ (τ) := If we want to bound p 0, we could modify our test to τ e ( x µ σ ) dx. πσ x H τ, where τ is a threshold selected to make p 0 smaller than the desired probability of error α (see Figure 3.5). This new test has p 0 = Q 0,σ (τ). Furthermore, a simple change of variable shows that ( τ µ ) Q µ,σ (τ) = Q, (3.) σ where we use Q as shorthand for Q 0, (see Figure 3.6 to build some intuition), so p 0 = Q ( τ /σ). Finally, since Q is invertible, if we want p 0 α, we can pick τ = σq (α). 3.4 Neyman-Pearson Lemma As mentioned in Example 3.8, there are some cases where we want to bound a certain probability of error, say p 0. One way to do this is by increasing R 0. However, as R 0 grows, our accuracy p decreases (see

7 Topic 3: Hypothesis Testing 3-7 Figure 3.5 to build some intuition). Neyman-Pearson s Lemma tells us that the LRT is optimal in the sense that there exists no other test that has lower probability of error p 0 and higher accuracy p. Lemma 3. (Neyman-Pearson). Consider the likelihood ratio test t given by p (x) p 0 (x) H τ, with τ chosen such that p 0 = α. Then there exists no other test t with p 0 α and p > p. Proof. For any region R Ω, let P h (R) be the cumulative probability of p h (x) over R, i.e., P h (R) = p h (x)dx. Then for h {0, }, let p h = P h (R R ) + P h (R R 0), p h = P h (R R ) + P h (R 0 R ). (3.) Now suppose p 0 α. We need to show that p p. From (3.), this is equivalent to showing that P (R R 0) P (R 0 R ), so write P (R R 0) = p (x)dx τp 0 (x)dx = τp 0 (R R 0), (3.3) R R R 0 R R 0 where the inequality follows because in R, p (x) τp 0 (x). By assumption, p 0 = α p 0. This, together with (3.) imply P 0 (R R 0) P 0 (R 0 R ), so (3.3) τp 0 (R 0 R ) = τp 0 (x)dx p (x)dx = P (R 0 R ), R 0 R R 0 R where the last inequality follows because in R 0, p (x) τp 0 (x). Example 3.. Consider : x N (0, σ0), H : x N (0, σ), where σ 0 < σ are known. The likelihood ratio test is Λ(x) = p (x) p 0 (x) = e x σ πσ e x σ 0 πσ 0 = σ 0 e σ ( x σ 0 x σ ) H τ.

8 Topic 3: Hypothesis Testing 3-8 Figure 3.7: Left: Threshold τ selected to achieve probability of error p 0 = α in test x H τ of Example 3., where : x /σ0 χ. This is equivalent to the test x H τ with : x N (0, σ0 ), as in the Right. Or equivalently, e σ σ 0 σ 0 σ σ σ0 σ0 x σ x H σ τ, σ 0 x H log ( ) σ τ, σ 0 ( σ H σ 0σ ) σ log τ. σ 0 σ 0 }{{} τ Now recall that if y N (0, ), then y χ. So we can rewrite our hypotheses as : ( x /σ 0) χ, H : ( x /σ ) χ. Then p h (with h {0, }) is simply the probability that a σ h -scaled χ random variable is larger than τ (see Figure 3.7), i.e., p h = Q χ ( τ /σ h), where Q χ is the tail probability of the χ distribution. Since Q χ is invertible, if we want p 0 α, we can pick τ = σ0q χ (α), and then p = Q χ ( σ 0/σ Q (α)). Neyman-Pearson s Lemma tells us that χ 0 there exists no other test that has lower probability of error p 0 and higher accuracy p. 3.5 Multiple Observations We now study what happens when we have several observations instead of just one. Example 3.. Consider the hypotheses in Example 3.5, or equivalently, in vector form: : x R N N (0, σ I), H : x R N N (µ, σ I), µ < 0,

9 Topic 3: Hypothesis Testing 3-9 where denotes the all ones vector of compatible size, in this case N. The likelihood ratio is given by Λ(x) = p (x) p 0 (x) = ( πσ) N e σ (x µ)t (x µ) ( πσ) N e σ xt x Taking log we obtain the log-likelihood ratio test: = e ( σ x T x µx T +µ T ) e σ x T x = e µ σ (x T Nµ). µ σ ( x T Nµ ) H log τ, x T H σ µ log τ + Nµ } {{ } τ Notice that the direction of the inequalities in the test was inverted because µ < 0. Next observe that m = x T = N i= x i, and since sums of gaussians are gaussians, we can rewrite our hypotheses as : m N (0, Nσ ), H : m N (Nµ, Nσ ), µ < 0, Then our log-likelihood ratio test becomes m H τ, and since τ < 0, p 0 = Φ 0,Nσ (τ ), p = Φ Nµ,Nσ (τ ). where Φ µ,σ is the cumulative distribution function (CDF) of a N (µ, σ ) random variable (see Figure 3.8). Equivalently, with a similar transformation as (3.), we can write this in terms of the CDF Φ of the standard normal N (0, ),. p 0 p ( τ = Φ ), Nσ ( τ Nµ = Φ ). (3.4) Nσ Since Φ is invertible, if we want p 0 α, we can pick τ = NσΦ (α). Plugging this in (3.4), we obtain ( ) ( ) NσΦ (α) Nµ Nµ p = Φ = Φ Φ (α). Nσ σ Nµ σ is often known as the signal to noise ratio. Since Φ(τ) as τ, and since µ < 0 by assumption, it is easy to see that p increases with N and µ, but decreases with σ. 3.6 Multiple Testing In many applications we run multiple tests, and we want to bound the probability of making one or more mistakes.

10 Topic 3: Hypothesis Testing 3-0 Figure 3.8: Outcomes probabilities p gh of the test m H τ. See Example 3.. Example 3.3. In Example 3.6 we have a family of K tests (where K is the number of pixels), and we want to be confident that all the identified pixels are truly correlated to the thumb s movement. Definition 3.3 (Family-wise error rate (FWER)). The family-wise error rate is the probability of making one or more ( 0) errors. More precisely, for a family of K tests, ( K ) F W ER = P k= { k 0 k }, where { k 0 k } denotes the event of deciding H in the k th test given that is true. Lemma 3. (Bonferroni Correction). Consider a family of K tests. Setting p 0 = α /K for each test achieves FWER < α. Proof. As a simple consequence of the union bound, we have: ( K ) K F W ER = P { k 0 k } P ({ k 0 k }) = k= k= K p 0 = K α K = α. k= 3.7 Composite Hypotheses So far we have studied simple hypotheses where all distributions and their parameters are known, as in Examples 3., 3.6 and 3., where the distributions and their parameters are known. However, in many practical situations this is not the case. For instance, in H of Example 3., all we know is that µ > 0.

11 Topic 3: Hypothesis Testing 3- In this case, H is composed of the collection of distributions {N (µ, σ )} µ>0. More generally, a composite problem has the form: : x p 0 (x θ 0 ), θ 0 Θ 0, H : x p (x θ ), θ Θ, where the notation p h (x θ h ) means that θ h is a parameter of the distribution p h, and Θ h is a set of all possible values of the parameter θ h. In general, p 0 and p may be entirely different distributions, and the sets Θ 0, Θ may be entirely different. Even though Examples 3. and 3.5 are composite problems, since we know µ > 0, we were still able to derive its LRT in Examples 3.7 and 3.. This is not always the case. There are some more complicated cases, like Examples 3.3 and 3.4, where µ is completely unknown, and this can complicate things. Example 3.4 (Wald s test). Consider Examples 3.3 and 3.4. The likelihood ratio is Λ(x) = p (x) p 0 (x) = πσ e ( x µ σ ) = e πσ e ( x 0 σ ) Taking log and with some minor algebra we obtain: x xµ+µ σ e x σ = e µ(x µ) σ H. xµ H µ. However, since we don t know the sign of µ, we cannot continue as in Example 3.7, as dividing by µ could reverse the direction of the inequalities in the test. Hence this test is uncomputable, or undetermined. So how can we proceed? For example, let s say we decide to use the test from Example 3.0: x H τ. It could happen that we are lucky and µ > 0. Then our test will be optimal (as shown by Neyman- Pearson s Lemma) with p 0 = Q( τ /σ) and p = Q( τ µ /σ) (see Example 3.0 and Figure 3.5). However, if we are unlucky and µ < 0, our test would be doing something terribly insensible, and would have terrible accuracy p = Q( τ+ µ /σ); see Figure 3.9 to build some intuition. One good compromise is to use Wald s test: which has p 0 p x H τ, = Q( τ /σ), ( τ µ ) = Q + Q σ ( τ + µ Wald s test is not optimal, but is sensible. It has higher probability of error p 0 than if we are lucky and guess µ correctly, but also has higher accuracy p than if we are unlucky and guess µ incorrectly. σ ).

12 Topic 3: Hypothesis Testing 3- Figure 3.9: Composite hypothesis test where µ 0 is unknown. Left: Probabilities p 0 and p of the test x H τ. If we are lucky and µ > 0, this test test will be optimal, but if µ < 0, our test will be terrible. Right: Wald s test x H τ. Wald s test is not optimal, but is sensible. It has higher probability of error p 0 than if we are lucky and guess µ correctly, but also has higher accuracy p than if we are unlucky and guess µ incorrectly. See Example 3.4 and Figure 3.0. Figure 3.0: Left: ROC curves of the test x H τ for different values of µ. Consistent with our analysis from Example 3., we can see that p grows with µ. Right: ROC curves for the test x H τ when µ > 0 (optimal), when µ < 0 (terrible), and for Wald s test. This shows that Wald s test is suboptimal but sensible. See Example ROC Curves As shown in Example 3.4, it is not always possible to device an optimal test. It is thus reasonable to ask how good a test is. For example, how good is Wald s test? One way to do this is with Receiver Operating Characteristic (ROC) curves, which measure a test s performance by plotting its p as a function of its p 0. ROC curves are widely used in laboratories to measure a test s ability to discriminate diseased cases from normal cases, and also to compare the performance of two or more tests. 3.9 Generalized Likelihood Ratio Test Wald s test was an intuitive solution to the simplest composite problem. However, Wald s test has a solid statistical foundation. In fact, Wald s test is the result of estimating µ and then using this estimate in a likelihood ratio test. We will come back to Wald s test and its generalization, introducing the generalized likelihood ratio test (GLRT), but first we will need to learn about estimation, which is our next topic.

Topic 5: Generalized Likelihood Ratio Test

CS 885: Advanced Machine Learning Fall 7 Topic 5: Generalized Likelihood Ratio Test Instructor: Daniel L. Pimentel-Alarcón Copyright 7 5. Introduction We already started studying composite hypothesis problems