Composite Hypotheses and Generalized Likelihood Ratio Tests Rebecca Willett, 06 In many real world problems, it is difficult to precisely specify probability distributions. Our models for data may involve unknown parameters or other characteristics. Here are a few motivating examples. Example: Unknown amplitudes/delays in wireless communications. We don t always know how many relays a signal will go through, how strong the signal will be at each receiver, the distance between relay stations, etc. Example: Unknown signal amplitudes in functional brain imaging. Example: Unknown expression levels in gene microarray experiments.
Composite Hypotheses and Generalized Likelihood Ratio Tests Composite Hypothesis Tests We can represent uncertainty by specifying a collection of possible models for each hypothesis. The collections are indexed by a parameter. : X p 0 (x θ 0 ), θ 0 Θ 0 H : X p (x θ ), θ Θ In general, the distributions p 0 and p may have different parametric forms. The sets Θ 0 and Θ represent the possible values for the parameters. If a set contains a single element (i.e., a single value for the parameter), then we have a simple hypothesis, as discussed in past lectures. When a set contains more than one parameter value, then the hypothesis is called a composite hypothesis, because it involves more than one model. The name is even clearer if we consider the following equivalent expression for the hypotheses above. Example: Brain imaging Recall the brain imaging problem. : X p 0, p 0 {p 0 (x θ 0 )} θ0 Θ 0 H : X p, p {p (x θ )} θ Θ : X N (0, ) H : X N (µ, ), µ > 0 but otherwise unknown equivalently X p, p {N (µ, )} µ>0 In this example, is simple and H is composite. Uniformly Most Powerful Tests Let us begin by considering special cases in which the usual likelihood ratio test is computable and optimal. Here is an example. Log LRT: n log i= n : x,, x n iid N (0, ) H : x,, x n iid N (µ, ), µ > 0 i= π e (x i µ) / π e (x i) / = = µ i= (x i µ) i= x i nµ + x i
Composite Hypotheses and Generalized Likelihood Ratio Tests 3 Test statistic: µ i= x i H γ i= x i H γ /µ = γ We were able to divide both sides by µ since µ > 0. We do not need to know the exact value of µ in order to compute the test i x H i γ for any value of γ. Let t = n i= x i denote the test statistic. It is easy to determine its distribution(s) under each hypothesis (a composite in the case of H ). : t N (0, n) H : t N (nµ, n) µ > 0 unknown Since distribution of t under is known, we can choose threshold to control P F A. ( ) γ P F A = Q n γ = nq (P F A ) This is optimal detector (most powerful) according to NP lemma. Several ROC curves corresponding to different values of the unknown parameter µ > 0 are depicted below. We cannot know which curve we are operating on, but we can choose a threshold for a desired P F A and the resulting P D is the best possible (for the unknown value of µ). In such cases we say that the test is uniformly most powerful, that is most powerful no matter what the value of the unknown parameter. Figure : ROC for various µ > 0 for the simple case. Definition: Uniformly Most Powerful Test A uniformly most powerful (UMP) test is a hypothesis test which has the greatest power (i.e. greatest probability of detection) among all possible tests yielding a given false alarm rate regardless of the underlying true parameter(s).
Composite Hypotheses and Generalized Likelihood Ratio Tests 4 3 Two-sided Tests To see how special the UMP condition is, consider the following simple generalization of the testing problems above. The log-likelihood ratio statistic is and the log-lrt has the form : x N (0, ) H : x N (µ, ), µ 0 (x µ) log Λ(x) = + x = µx µ / µ x µ / H γ. We can move the term µ / to the other side and absorb it into the threshold, but this leaves us with a test of the form µ x H γ. Since µ is unknown (and not necessarily positive) the test is uncomputable. How can we proceed? Look at two densities in the microarray experiment. Intuitively the test x H γ seems reasonable. This is called the Wald Test. The P F A of the Wald test can be seen below. P F A = Q(γ) γ = Q (P F A /) P D = e (x µ) / dx + π = γ γ µ N (0, )dy + γ µ = Q(γ µ) + Q(γ + µ). γ π e (x µ) / dx y = x µ N (0, )dy
Composite Hypotheses and Generalized Likelihood Ratio Tests 5 Figure : The P D depends on µ, which is unknown. Model µ as a deterministic, but unknown, parameter. Estimate µ from the data and plug the estimate into the LRT. Under H the distribution is X N (µ, ), so a natural estimate for µ is µ = x, the observation itself. The plugging this into the likelihood ratio yields Λ(x) = p(x µ) p(x 0) = exp( (x µ) /) = e x /. exp( x /) This is the generalized likelihood ratio. In effect, this compares the best fitting model in the composite hypothesis H with the model. Taking the log yields the test which is equivalent to the Wald test. log Λ(x) = x H γ, 4 The Generalized Likelihood Ratio Test (GLRT) Consider a composite hypothesis test of the form : X p 0 (x θ 0 ), θ 0 Θ 0 H : X p (x θ ), θ Θ The parametric densities p 0 and p need not have the same form. The generalized likelihood ratio test (GLRT) is a general procedure for composite testing problems. The basic idea is to compare the best model in class H to the best in, which is formalized as follows. Definition: Generalized Likelihood Ratio Test (GLRT)
Composite Hypotheses and Generalized Likelihood Ratio Tests 6 The GLRT based on an observation x of X is Λ(x) = max p (x θ ) θ Θ max p 0 (x θ 0 ) θ 0 Θ 0 H γ, or equivalently log Λ(x) H γ. Example: We observe a vector X R n, and consider two hypotheses: :X N (0, σ I n ) H :X N (Hθ, σ I n ). where σ > 0 is known, θ R k is unknown, and H R n k is known and full rank. The mean vector Hθ is a model for a signal that lies in the k-dimensional subspace spanned by the columns of H (e.g., a narrowband subspace, polynomial subspace, etc.). In other words, the signal has the representation s = k θ i h i, H = [h,..., h k ]. i= The null hypothesis is that no signal is present (noise only). The log likelihood ratio is Thus we may consider the test Λ(x) = σ ( (x Hθ) (x Hθ) x x ) = σ ( θ H x + θ H Hθ) θ H x H γ, but this test is not computable without knowledge of θ. Recall that H : x N (Hθ, σ I), θ R k H : x p, p {N (Hθ, σ I)} θ R k. We want to pick p in {N (Hθ, σ I)} that matches x the best. { p(x θ, H ) = exp } (πσ ) n/ σ (x Hθ) (x Hθ)
Composite Hypotheses and Generalized Likelihood Ratio Tests 7 Find θ that maximizes the likelihood of observing x: ˆθ = arg min(x Hθ) (x Hθ) θ R k }{{} Taking the gradient with respect to θ x Hθ. θ (x x θ H x + θ H Hθ) = 0 0 H x + H Hθ = 0 θ = (H H) H x Plugging ˆθ into the test statistic θ H x, we have ˆθ H x = x H(H H) H x H γ Recall that the projection matrix onto the subspace is defined as P H := H(H H) H log Λ(x) = σ x P H x = σ P Hx. Thus the GLRT computes the energy in the signal subspace and if the energy is large enough, then H is accepted. In other words, we are taking the projection of x onto H and measuring the energy. The expected value of this energy under (noise only) is E H0 [ PH X ] = kσ, since a fraction k/n of the total noise energy nσ falls into this subspace. The performance of the subspace energy detector can be quantified as follows. We choose a γ for the desired P F A : σ x P H x H γ What is the distribution of x P H x under? First use the decomposition P H = UU where U R n k with orthonormal columns spanning columns of H, and let y := U x. Then σ x P H x = σ x UU x = σ y y y N (0, σ U U) N (0, σ I k k ) y i /σ iid N (0, ) y y σ, i =,..., k χ k, chi-squared with k degrees of freedom
Composite Hypotheses and Generalized Likelihood Ratio Tests 8 Under, σ x P H x χ k = P F A = P(χ k > γ) To calculate the tails on χ k distributions you can software such as Matlab (chicdf(x,k), chiinv(γ,k), chicdf(x,k)). Remember the mean of a χ k distribution is k, so we want to choose a γ bigger than k to produce a small P F A. 5 Wilks Theorem Wilk s Theorem (938) Consider a composite hypothesis testing problem : X, X,..., X n iid p(x θ 0 ), where θ 0,,..., θ 0,l R are free parameters and θ 0,l+ = a l+,..., θ k = a k are fixed at the values a l+,..., a k H : X, X,..., X n iid p(x θ ), θ R k are all free parameters and the parametric density has the same form in each hypothesis. In this case family of models in is a subset of those in H, and we say that the hypotheses are nested. (This is a key condition that must hold for this theorem.) If the st and nd order derivatives of p(x θ i ) with respect to θ i exist and if E [ ] log p(x θi ) θ i = 0 (which guarantees that the MLE θ i θ i as n ), then the generalized likelihood ratio statistic, based on an observation X = (X,..., X n ), Λ n (X) = max p(x θ ) θ max p(x θ 0 ) θ 0 ()
Composite Hypotheses and Generalized Likelihood Ratio Tests 9 has the following asymptotic distribution under : log Λ(x) n χ k l i.e., log Λ(x) D χ k l Proof: (Sketch) under the conditions of the theorem, the log GLRT tends to the log GLRT in a Gaussian setting according to the Central Limit Theorem (CLT). Example: Nested Condition :x i iid N (0, ) H :x i iid N (0, σ ), i =,,..., n, σ > 0 unknown log LR: MLE of σ : i= ( ( log σ x i σ )) log GLR under : [ log ( n i= x i σ = n ) x i i= ( x i n n i= x i ) ] n χ