EECS 750. Hypothesis Testing with Communication Constraints

Size: px

Start display at page:

Download "EECS 750. Hypothesis Testing with Communication Constraints"

Joshua Bryan
5 years ago
Views:

1 EECS 750 Hypothesis Testing with Communication Constraints Name: Dinesh Krithivasan Abstract In this report, we study a modification of the classical statistical problem of bivariate hypothesis testing. The statistician has to make his decisions based on remotely collected data available to him through a noiseless channel having a finite rate. The problem is to determine the minimum β n of the second kind of error probability under the condition that the first kind of error probability α n ɛ for a given 0 < ɛ < 1. We first establish the basic results using techniques of information theory. We then derive a single-letter lower bound for the general bivariate testing case in the presence of one-sided data compression. Special cases of this bound for the case of test against independence and extensions to two-sided data compression are discussed.

2 1 Introduction One of the most fundamental and standard problems in statistics is to decide which one of the two available explanations best explains the observed data x n = (x 1, x 2,..., x n ). For example, based on n measurements of sensor data, one might want to determine whether an earthquake occurred or not. In the simplest hypothesis testing problem, we assume that the data is generated i.i.d from either the distribution P = (P (x)) x X (Hypothesis H 0 ) or from the distribution Q = (Q(x)) x X (Hypothesis H 1 ). The statistician has to decide between H 0 and H 1 based on a sample of size n. The most general decision rule is to declare H 0 if x n A X n and declare H 1 otherwise. There are two kinds of errors involved in this problem. A type-1 error is said to occur if statistician declares H 1 when in fact the data was generated according to H 0. Type-2 error is similarly defined. From the decision rule given above, it can be seen that P ( Error of type 1 ) α n = P n (A c ) P ( Error of type 2 ) β n = Q n (A) It is clear from the above definition of α n and β n that there is a tradeoff between them. 1.1 Neyman-Pearson Lemma The Neyman-Pearson lemma derives the optimal test between two hypotheses (i.e.) it gives the optimal way to choose the set A. The lemma can be stated as follows. Lemma 1.1. For T 0, define the region { A n (T ) = x n : P n (x n } ) Q n (x n ) > T. Then no other decision region B n X n can be found that has lower error probabilities for both type-1 and type-2 error events than that of the set A n. In other words, the optimal test is of the form P (x n ) Q(x n ) T 1

3 where the choice of T determines the tradeoff between α n and β n 1.2 Stein s Lemma Very often, H 1 denotes a highly critical event (such as the occurrence of an earthquake) while H 0 denotes its negation ((i.e) non-occurrence of the earthquake). Consequently type-2 error events are more critical than type-1 error events. Thus, we want β n to decay to 0 as fast as possible while we only want that α n go to 0 as n goes to. We need to find A n X n with P n (A n ) 1 ɛ and Q n (A n ) = β n (ɛ) where 0 < ɛ < 1 is given and β n (ɛ) min A { Qn (A n ) A X n, P n (A) > 1 ɛ} The rate of convergence of β n (ɛ) as n goes to is given by Stein s Lemma which is stated below. Lemma 1.2. For any ɛ (0, 1), we have where 1 lim n n log β n(ɛ) = D(P Q) D(P Q) x X P (x) log P (x) Q(x) is the familiar Kullback-Leibler divergence encountered in information theory. Note that this result is similar to the idea of strong converse in information theory since the limit holds for all ɛ (0, 1) and not just for the case when ɛ 0. A simple proof of this lemma using typicality is sketched below. Proof. We need to show achievability of the said exponent and show that it isn t possible to improve upon it. We first show the direct part. Let the random variables X and X be distributed according to distributions P (H 0 ) and Q (H 1 ) respectively. Let the acceptance region A n be the typical set Tµ n (X). By strong law of large numbers, we have α n = P (X n A c n) ɛ if we choose µ < ɛ. Also, we have P ( X n = x n ) = exp[ n (H(P x n) + D(P x n Q))] 2

4 where P x n is the type of x n. Using this, we can calculate β n (ɛ) as β n = x A n P ( X n = x n ) = x A n exp[ n (H(P x n) + D(P x n Q))] A n max P x n:x n A n exp[ n (H(P x n) + D(P x n Q))] exp[n (H(P ) + ɛ 1 )]exp[ n (H(P ) + D(P Q))] from continuity of entropy exp[ n (D(P Q) ɛ 1 )] where ɛ 1 0 as ɛ 0 This shows the achievability of the exponent D(P Q) To show the converse part, suppose we are given an optimal acceptance region A n which minimizes β n with α n ɛ. Set B n = A n Tµ n (X). Then P (X n B n ) 1 ɛ µ which gives a bound on the cardinality of B n as B n (1 ɛ µ) exp[n (H(X) µ)]. Hence, β n = P ( X n A n ) P ( X n B n ) = x n B n P ( X n = x n ) = B n exp[ n (H(P ) + D(P Q) + δ)] where δ 0 as µ 0 (1 ɛ µ) exp[ n (D(P Q) + δ + µ)] Now, it follows from the assumed optimality of A n that the optimal β n D(P Q) + δ + µ. Since µ is arbitrary, this establishes stein s lemma. 2 Hypothesis Testing with Communication Constraints So far, we have discussed the classical version of the hypothesis testing problem. Now, we add a new dimension to this problem by assuming that the statistician does not have direct access to the data. Rather, the data is remotely collected and made available to him/her through noiseless channels of finite capacity. This is illustrated in figure 1. 3

5 n X X Encoder n f( X ) Statistician H 0 n Y Y Encoder g( Y n ) H 1 Figure 1: Hypothesis testing with two-sided compression Note that in the case of single-user hypothesis testing, the statistician requires only one bit of information (whether x n A n or not) to make an optimal decision. Hence this problem is entirely trivial. The simplest non-trivial case is when there are atleast 2 sources as illustrated in figure 1. This system was first studied by Ahlswede and Csisźar [1] in a simplified form where they considered test against independence with one-sided compression. They also derived a lower bound for the case of more general bivariate testing. This lower bound was later improved by Te Sun Han [2] whose formula subsumes all other known bounds. We shall, in this report, be concerned mainly about this lower bound. A good survey of statistical inference problems under communication constraints can be found in [3]. There are a couple of important differences between multi-terminal information theory and statistical testing with multi-terminal data compression. In the hypothesis testing problem, we are not interested in recovering the original data in any sense. We are only interested in designing the best possible test to differentiate between H 0 and H 1. The performance measure here is the rate needed to transmit the data and the best possible exponent based on that compressed data. Indeed, information theory tells us that if the rates R 1 and R 2 are not large enough, the probability of error in decoding goes to 1 exponentially. But even in such cases, it is possible to do very well in the hypothesis testing problem. In fact, even the case of (R 1, R 2 ) = (0, 0) yields a non-zero exponent in the hypothesis testing problem. (see [2]). 4

6 2.1 Bivariate Hypotheses with One-sided Compression Let s first consider a simpler case than the system in figure 1 where one of the data streams (say y n ) is assumed to be available without compression. Let the null hypothesis be H 0 = (P XY ) and the alternative hypothesis be H 1 = (P XȲ ). The main result here is to establish a good lower bound on the exponent θ n (ɛ) 1 n log β n(ɛ). Before we proceed to the main theorem, we need the following lemma. Lemma 2.1. Suppose that η > 0, δ > 0 are arbitrary fixed, and let U,X,Y be finite random variables such that U X Y. Let M = exp[n(i(u; X) + η)]. Then there exist u 1, u 2,..., u M Tµ n (U) and M disjoint subsets C 1,..., C M such that C i Tµ n (X u i ) for which M P { X n Y n C i Tµ n (Y u i ) } 1 δ (1) i=1 Proof. The lemma can be proved using standard ideas from multi-user information theory and is omitted here. It can be found in [2]. We will state the main result of this section below. Define two sets of auxiliary random variables: S(R) = {U : I(U; X) R, U X Y } L(U) = {Ũ XỸ : P (Ũ X) } = P (UX), P (ŨỸ ) = P (UY ) Also define the random variable Ū to satisfy Ū X Ȳ and P (Ū X) = P (U X). Define the function θ L (R) = sup min D(Ũ U S(R) Ũ XỸ L(U) XỸ Ū XȲ ) Then we have the following theorem Theorem 2.1. Let θ(r, ɛ) be the largest possible exponent for the case of one-sided data compression at rate R. Then, for all R 0 and 0 < ɛ < 1, θ(r, ɛ) θ L (R) 5

7 Proof. It is sufficient to show that for each U S(R) θ(r, ɛ) min D(Ũ XỸ Ū XȲ ) Ũ XỸ L(U) Let M, u 1, u 2,..., u M T n µ (U), C i T n µ (X u i ) be as specified in Lemma 2.1. The X-encoder is described as i if x n C f(x n i ) = 0 else Note that since M = exp[n(i(u; X)+η)], the rate constraint is automatically satisfied for this encoder. The statistician has access to i {1, 2,..., M} and y n with which to make a decision. The decision rule is to declare H 0 if y n Tµ n (Y u i ). Globally, this translates into an acceptance region defined by M ( A n = Ci Tµ n (Y u i ) ) (2) i=1 Note that this acceptance region A n is defined for computational purposes only and no single module in our system has all the information required to determine if indeed (x n, y n ) A n. We now compute the error probabilities α n and β n. Evaluation of α n : Clearly, by the definition of A n and from equation (1), we have α n = P (X n Y n A c n) δ Evaluation of β n : Error probability of the second kind can be evaluated as β n = = (x n,y n ) A n P (x n,y n ) A n exp ( ( Xn, Ȳ n ) = (x n, y n ) ) [ ( ( ) ( ))] n H P(x n,y n ) + D P(x n,y n ) P XȲ (3) We now convert the summation over individual sequences to summation over types. This involves estimating the number of different types P (x n,y n ) in A n and number of sequences in each 6

8 of them. Let U (n), X (n), Y (n) be the type variables of the n-length sequences u n, x n, y n respectively. Let K(X (n) Y (n) ) be the number of (x n, y n ) A n with type given by (X (n) Y (n) ). Let K i (X (n) Y (n) ) (i = 1, 2,..., M) be the number of (x n, y n ) C i T n µ (Y u i ) with type given by (X (n) Y (n) ). By the disjointed nature of C i s, we have K(X (n) Y (n) ) = M i=1 K i(x (n) Y (n) ). Using these in equation (3), we get β n = = (X (n) Y (n) ) M K i=1 (U (n) X (n) Y (n) ) ( X (n) Y (n)) [ ( ( ) ( ))] exp n H P (X (n),y (n) ) + D P (X (n),y (n) ) P XȲ K i (X (n) Y (n)) [ ( ( ) ( ))] exp n H P (X (n),y (n) ) + D P (X (n),y (n) ) P XȲ A simple upper bound on K i ( X (n) Y (n)) is given by K i (X (n) Y (n)) [ ( Tµ n (Y u i ) exp n H (X (n) U (n) Y (n)))] Using this bound and the fact that M = exp[n(i(u; X) + η)], we get where β n (U (n) X (n) Y (n) ) [ ( )] exp n d(u (n) X (n) Y (n) ) 2µ η [ ( )] (n + 1) U X Y max (U (n) X (n) Y (n) ) exp n d(u (n) X (n) Y (n) ) 2µ η (4) d(u (n) X (n) Y (n) ) H(X (n) Y (n) ) + D(X (n) Y (n) XȲ ) H(X (n) Y (n) U (n) ) H(Y U) I(U; X) (5) We now determine the allowable types (U (n) X (n) Y (n) ) in A n. If (x n, y n ) A n, then by the definition of A n, x n C i Tµ n (X u i ) and y n Tµ n (Y u i ) for some i {1, 2,..., M}. But since the u i themselves are chosen from the typical set Tµ n (U), this implies that (x n, u n ) Tµ n (X, U) and (y n, u n ) Tµ n (Y, U). Hence the type variables (U (n) X (n) ) have to be close in distribution to (U, X) and similarly (U (n) Y (n) ) must be close to (U, Y ). Using continuity of all the information theoretic quantities, we can now rewrite equation (5) as d(u (n) X (n) Y (n) ) = H( XỸ ) + D( XỸ XȲ ) H( X ŨỸ ) H(Ỹ Ũ) I(Ũ; X) + δ (6) 7

9 for some auxiliary random variables Ũ, X, Ỹ such that P Ũ X = P UX and PŨỸ = P UY. This can be further simplified as d(u (n) X (n) Y (n) ) = D(Ũ XỸ Ū XȲ ) + δ where Ū is a random variable uniquely defined through the conditions p(ū X) = p(u X) and Ū X Ȳ. Substituting this simplified form of d(u (n) X (n) Y (n) ) into equation (4), we can conclude that θ L (R, ɛ) = min D(Ũ XỸ Ū XȲ ) Ũ XỸ L(U) is an achievable lower bound to θ(r, ɛ) for every U in S(R). Thus, we have succeeded in deriving a single-letter lower bound for the power exponent in the case of hypothesis testing with one-sided data compression. The extension to the case of two-sided data compression is straight-forward and involves the introduction of another pair of auxiliary random variables V and Ṽ. The proof techniques and results are analogous to the ones discussed above. 2.2 Special Cases We will now describe several results that are easily derived from the above theorem. Corollary 1. If R H(X), then for all ɛ (0, 1) θ L (R, ɛ) = D ( XY XȲ ) Proof. If R H(X), we can choose the auxiliary random variable U S(R) to be identical with X. In that case, we have P (Ũ XỸ ) = P (UXY ) and hence θ L(R) = D(XY XȲ ). By Stein s lemma, this is also the best obtainable power exponent. Corollary 2. For any U S(R) and 0 < ɛ < 1, θ L (R, ɛ) D(X X) + D(UY UŶ ) 8

10 where Ŷ is the random variable with U X Ŷ and p(ŷ X) = p(ȳ X) This lower bound was originally derived by Ahlswede and Csisźar in [1]. Proof. In the proof, we will need to make use of the fact that for any finite random variables X 1, X 2, Y 1, Y 2 D (X 1 Y 1 X 2 Y 2 ) D (X 1 X 2 ) (7) Also note that, by the definition of Ŷ, we have P (UŶ X) = P (U X)P (Ŷ X) = P (U X)P (Ȳ X). For any U S(R) and Ũ XỸ L(U), D (Ũ XỸ Ū XȲ ) = D(X X) + u,x,y P ( Ũ XỸ )(u, x, y) log P ŨỸ X (u, y x) P ŪȲ X (u, y x) = D(X X) + u,x,y = D(X X) + u,x,y P ( Ũ XỸ )(u, x, y) log PŨ XỸ (u, x, y) P Ū X (u x)pȳ X (y x)p X(x) P ( Ũ XỸ )(u, x, y) log PŨ XỸ (u, x, y) P U Ŷ X (u, y x)p X(x) D(X X) + u,y P UY (u, y) log P UY (u, y) P U Ŷ (u, y) from equation (7) = D(X X) + D(UY UŶ ) Corollary 3. Suppose P ( XȲ ) = P (X)P (Y ). Then for any 0 < ɛ < 1, θ L (R, ɛ) max I(U; Y ) U S(R) This is the problem of testing against independence that was completely resolved in [1]. Proof. From equation (7) we have D(Ũ XỸ Ū XȲ ) D(ŨỸ ŪȲ ). Since P (ŨỸ ) = P (UY ), P (Ū) = P (U), P (Ȳ ) = P (Y ) and Ū, Ȳ are independent, this further reduces to D(ŨỸ ŪȲ ) = D(P UY P U P Y ) = I(U; Y ) Note that, in [1] the authors also provided a strong converse to the above result using a very different proof technique. (using results from [4]) 9

11 2.3 Improvement of the Power Exponent So far, we have looked at an encoding scheme that is designed to reproduce the joint types P (u n,x n ) and P (u n,y n ) exactly at the decoder with zero-error probability. One can also consider a wider class of encoders that guarantee exponentially decaying non-zero error probabilities. In this case, the source of error is two-fold and there is a tradeoff between them. The error events are Error in hypothesis testing given an acceptance region Since there is now a possibility of error in the acceptance region, the set L (U) is now enlarged to L (U) = {Ũ XỸ P (Ũ X) = P (UX), P (Ỹ ) = P (Y ), H(Ũ Ỹ ) H(U Y )} The resulting exponent is ρ 1(U) = min D(Ũ XỸ XȲ Z) Ũ XỸ Encoding error in specifying an acceptance region By allowing for this error event, the set S(R) of allowed Us can be enlarged to S (R) = {U R I(U; X Y ), U X Y } The resulting exponent is + if R I(U; X) ρ 2(U) = ρ 2 (U) otherwise where ρ 2 (U) = [R I(U; X Y ] + + min Ũ XỸ L (U) D(Ũ XỸ XȲ Z) Clearly the minimum of the 2 exponents is achievable for every U L (R) and so we have that θ L(R, ɛ) = sup min(ρ 1(U), ρ 2(U)) U L (R) is an achievable exponent. Since this scheme allows for a much wider class of encoders, this lower bound is much tighter than the previous bound given in Theorem

12 3 Conclusion Thus far, we have dealt with the multi-terminal hypothesis testing problem in the presence of onesided data compression. Two single-letter lower bounds θ L (R, ɛ) and θl (R, ɛ) were derived. However, the problem isn t completely resolved since it isn t known if the lower bounds are tight. More general hypothesis testing problems seem to be very intractable with very few single-letter characterizations available. The only fully solved model is that of testing against independence studied in [1]. The approach of divergence characterization by Ahlswede and Csisźar has formidable complexity in proving the direct part but lends itself to converses (using ideas such as the blowing-up lemma). The approach detailed here is useful for establishing achievability results but seem to be insufficient in establishing converses. Te sun Han and Amari [3] have also studied other statistical inference problems such as parameter estimation and pattern classification under similar rate constraints. References [1] R. Ahlswede and I. Csisźar, Hypothesis testing with commuication constraints, IEEE transactions on information theory, vol. IT-32, pp , July [2] T. S. Han, Hypothesis testing with multiterminal data compression, IEEE transactions on information theory, vol. IT-33, pp , November [3] T. S. Han and S.-I. Amari, Statistical inference under multiterminal data compression, IEEE transactions on information theory, vol. 44, pp , October [4] R. Ahlswede and J.Korner, Source coding with side information and a converse for degraded broadcast channels, IEEE transactions on information theory, vol. IT-21, pp , November

Hypothesis Testing with Communication Constraints

Hypothesis Testing with Communication Constraints Dinesh Krithivasan EECS 750 April 17, 2006 Dinesh Krithivasan (EECS 750) Hyp. testing with comm. constraints April 17, 2006 1 / 21 Presentation Outline