CS 885: Advanced Machine Learning Fall 7 Topic 5: Generalized Likelihood Ratio Test Instructor: Daniel L. Pimentel-Alarcón Copyright 7 5. Introduction We already started studying composite hypothesis problems of the form: H : x p x θ ), θ Θ, H : x p x θ ), θ Θ. 5.) where the parameters θ and/or θ are unknown. Example 5.. In Example 3.3 we wanted to determine whether two meteorites came from the same asteroid in space using the difference x of their magnesium composition. In Example 3.4 we wanted to determine whether a certain gene was associated with a disease, using the difference x of the gene s average activation levels between healthy and sick people. Both of these problems can be modeled as H : x N, ) H : x N µ, ) same asteroid/gene unrelated to disease, different asteroids/gene related to disease, where µ is unknown. Example 5.. In Example 4. we use an array of sensors to record small voltages generated by your brain and store them in a signal vector x R D. A machine phone, computer, server, etc.) should Figure 5.: Gene microarrays are data matrices indicating gene activation levels. Each row corresponds to one gene, and each column corresponds to one individual. We want to know which genes are related to a disease. See Example 5.. 5-
Topic 5: Generalized Likelihood Ratio Test 5-... Figure 5.: Sensors record small voltages generated by your brain and store them in a signal vector x R D. A machine phone, computer, server, etc.) should interpret x and skip the song if that is what you thought about. See Example 5.. interpret x and skip the song if that is what you thought about. This can be modeled as H : x N, I) H : x N µ, I) do nothing, skip song, where µ is unknown. In Section 3.7 we already studied the hypothesis problems in Example 5., which led us to the intuitive answer of Wald s test in Example 3.4. We now generalize this using our knowledge from estimation theory to obtain the generalized likelihood ratio test GLRT). Definition 5. Generalized likelihood ratio test GLRT)). Consider a hypothesis problem as in 5.), where θ and/or θ are unknown. The generalized likelihood ratio statistic is defined as: ˆΛx) := max p x θ ) θ Θ max p x θ ), θ Θ and the generalized likelihood ratio test is defined as ˆΛx) H H τ. The idea behind the GLRT is actually quite simple: if you don t now a parameter, then first estimate it using maximum likelihood, and then use it as if it were the true parameter in a likelihood ratio test LRT). Example 5.3 Derivation of Wald s test as a GLRT). Let us show that Wald s test is just one particular case of the GLRT. Consider The setup in Example 5.. Here θ is known, and θ = µ, so Θ = R.
Topic 5: Generalized Likelihood Ratio Test 5-3 Then arg max θ Θ p x θ ) µ R µ R µ R p x µ ) µ R ) log e x µ e x µ µ R }{{} x µ µ x µ R constant log } {{ } constant x µ To find this maximum we use our usual tricks: take derivative with respect to w.r.t.) µ, set to zero and solve for µ. Which yields ˆµ = x. Then µ µ x = µ x) = ˆΛx) = max p x θ ) θ Θ max p x θ ) = θ Θ max p x µ ) µ R p x θ ) = p x ˆµ ) p x θ ) = e x ˆµ = e x x e x e x = e x, and our GLRT is e x H H τ. Taking log on both sides this becomes x x H H τ, where τ = log τ, which is precisely Wald s test from Example 3.4. H H log τ, or equivalently 5. Asymptotics Recall that in hypothesis testing we often want to bound the probability p of deciding H when H was true, and so we select our threshold test τ accordingly. For instance, in Wald s test, p = P x > τ ) given that H is true. Since H : x N, ), if we want p < α, all we need to do is find the τ such that the probability of the tails of a N, ) is α see Figure 5.3 to build some intuition). We can do this because we know the distribution of x. Similarly, if we had a test like: N i= x i H τ, 5.) H where H : x i N, ), we would also know that N i= x i χ N). Again, if we wanted p < α, all we would need to do is find the τ such that the probability of the tail of a χ N) is α see Figure 5.3 for more intuition). However, we don t always know the distribution of our test. For example, if we had the same test in 5.), but with H : x i Poisson, or Cauchy, or Weibull, then what is the distribution of N i= x i? The following theorem states that if N is large enough, we don t need to worry about it, because under H, the GLRT is
Topic 5: Generalized Likelihood Ratio Test 5-4 Figure 5.3: We want our test to have p < α. Left: If our test is x H that the probability of the tails of a N, ) is α. Right: if our test is N i= x i H H τ and x N, ), then we need to find the τ such H τ and x i N, ), then we need to find the τ such that the probability that the tail of a χ N) is α. We can do this because we know the distribution of our test. What happens if we don t? Wilks Theorem gives us an answer. asymptotically distributed χ N). Theorem 5. Wilks Theorem Asymptotic distribution of the GLRT). Consider a composite hypothesis problem of the form: H : x,..., x N H : x,..., x N px θ ), θ R K, px θ ), θ R K, where p has the same form in both hypotheses, and the K unknown parameters in θ are a subset of the K unknown parameters in θ these are called [ nested hypotheses). ] Suppose that for every k, l {,..., K}, px θ) θ k and px θ) θ k θ l exist, and that E log px θ) θ k = this guarantees that the MLE ˆθ ML converges to the true θ ). Then under H, log ˆΛx) N χ K K ). Proof. The proof follows as a consequence of the central limit theorem. For a detailed proof see Theorems.3. and.3.3 in Statistical Inference by George Casella and Roger L. Berger, second edition. Example 5.4. Consider H : x,..., x N H : x,..., x N N µ, ), N µ, ), >.
Topic 5: Generalized Likelihood Ratio Test 5-5 The MLE of is arg max θ Θ p x θ ) R + R + R + log p x ) R + ) N e x µ) T x µ) ) N e log π) }{{ N } constant x µ) T x µ) ) log N + log e x µ) T x µ) N R + log x µ) T x µ) Taking derivative w.r.t. and setting to zero we have: x µ) T x µ) N 4 = ) x µ) T x µ) N =. Solving for we obtain the MLE: ˆ = N x µ)t x µ) = N Then N x i µ. i= ˆΛx) = max p x θ ) θ Θ max p x θ ) = θ Θ max p x ) R + p x θ ) = p x ˆ ) p x ) = πˆ e ˆ x µ) T x µ) ) N e x µ) T x µ) )N = ) Ne ˆ ˆ Nˆ {}}{ ) x µ) T x µ) = ˆ ) N e N ˆ ). Taking log on both sides we obtain: log ˆΛx) = N log ˆ {}} { N log N N xi µ i= } {{ } H : χ N) N ˆ {}}{ N xi µ + N. i= } {{ } H : χ N) Now the question is: do you know the distribution of log χ N)+χ N)? Sure as hell I don t! Luckily, there are no unknown parameters under H, and one unknown parameter under H namely ), so these are nested hypotheses, and hence can use Theorem 5. to know that under H, log ˆΛx) N χ ). Hence, if we want p < α, it suffices to select the threshold τ for which the tail probability of a χ ) random variable is α.
Topic 5: Generalized Likelihood Ratio Test 5-6 Figure 5.4: Results of the simulation of Example 5.4. In black is the histogram of log ˆΛx) over T = trials. In blue is the asymptotic distribution of log ˆΛx). In red is the probability of error p set to α =.5. The code for this simulation is in the appendix. 5.3 Experiments Let us now run some simulations to verify Theorem 5. and our results from Example 5.4. We will generate x,..., x N N µ, ) and we will compute log ˆΛx) as in Example 5.4. We will repeat this T trials, plot a histogram, and compare it with the χ ) distribution predicted by Theorem 5.. Figure 5.4 shows some results. The code for this simulation is in the appendix.
Topic 5: Generalized Likelihood Ratio Test 5-7 A Code for Simulation of Example 5.4 clear all; close all; clc; warning'off','all'); 3 % === Code to simulate the asymptotic distribution of log Lambda hat === 4 % ===== See Example 5.4. 5 6 T = ; % Number of trials. 7 N = ; % Number of samples. 8 mu = 5; % Known mean under both hypotheses. 9 sigma star = ; % True standard deviation of X i under H. sigma star = ; % True standard deviation of X i under H. alpha =.5; % Allowed probability of error ). 3 log Lambda hat = zerost,); %log Lambda hat over trials. 4 for t=:t, 5 X = randnn,)+mu) * sigma star; 6 7 log Lambda hatt) = -N/ * log /N * sumx-mu)/sigma star).ˆ) )... 8 + / * sumx-mu)/sigma star).ˆ)... 9 - N/; end % Plot results. 3 figure); 4 axes'box','on'); 5 hold on; 6 [h,rangeh] = hist*log Lambda hat); 7 stemrangeh,h/t,'k','linewidth',4,'markersize',); 8 9 % Asymptotic distribution of *log Lambda hat. 3 rangeh = minrangeh):.:maxrangeh); 3 chi = chipdfrangeh,); 3 plotrangeh,chi,'b','linewidth',4); 33 34 % Highlight the probability alpha. 35 tau = chiinv-alpha,); 36 rangeh = tau:.5:maxrangeh); 37 chi = chipdfrangeh,); 38 stemrangeh,chi,'r','linewidth',5,'markersize',); 39 4 % Make figure look sexy. 4 axis tight; 4 setgca,'xtick',[],'xticklabel',[]); 43 setgca,'ytick',[],'yticklabel',[]); 44 ylabel'$\mathsf{p}\log\hat{\lambda}$\textbf{x}$)$','interpreter','latex','fontsize',5); 45 xlabel'$\log\hat{\lambda}$\textbf{x}$)$','interpreter','latex','fontsize',3); 46 47 % Save figure. 48 setgcf, 'renderer','default'); 49 figurename = 'wilks.pdf'; 5 saveasgcf,figurename);