INFORMATION PROCESSING ABILITY OF BINARY DETECTORS AND BLOCK DECODERS Michael A. Lexa and Don H. Johnson Rice University Department of Electrical and Computer Engineering Houston, TX 775-892 amlexa@rice.edu, dhj@rice.edu ABSTRACT This paper applies the concepts of information processing [] to the study of binary detectors and block decoders in a single user digital communication system. We quantify performance in terms of the information transfer ratio which measures how well systems preserve discrimination information between two stochastic signals. We investigate hard decision detectors and minimum distance decoders in various additive noise environments. We show that likelihood ratio digital demodulators maximize.. INTRODUCTION In our theory of information processing, information is defined only with respect to the ultimate receiver. Consequently, no single objective measure can quantify the information a signal expresses. For example, this paper (presumably) means more to a signal processing researcher than it does to a Shakespearean scholar. To probe how well systems process information, we resort to calculating how well an informational change at the input is expressed in the output. The complete theoretical basis of this theory can be found elsewhere []. Briefly, to quantify an informational change, we calculate the information-theoretic distance, specifically the Kullback-Leibler distance (KL), between the probability distributions characterizing the signals that encode two pieces of information. We assume the signals, but not the information, are stochastic. The Data Processing Theorem [] says that the KL distance between the outputs of any system responding to the two inputs must be less than or equal to the distance calculated at the input. Here, we use this framework to characterize how well likelihood ratio detectors and block decoders process the information encoded in their inputs. This work was supported by the National Science Foundation under Grant CCR-5558. The word distance does not imply a metric since the KL distance is not symmetric in its arguments and does not satisfy the triangle inequality. We adopt the digital communication system model shown schematically in Figure. The input binary data word u α of length K represents the information the receiver ultimately wants. The encoder simply maps the data word into a code word of length N (u α v α ) and passes the code word onto the modulator. The modulator maps the code word into their signal representations (v α s α ) and transmits a continuous-time signal using an antipodal signal set. The channel adds white noise and the total transmission interval for each data word is KT seconds. Viewed from the framework of information processing, we say that the information is encoded in the received signal vector r α. Obviously, we use the word encode in an untraditional sense. What we mean is the following. The theory of information processing assumes that information does not exist in a tangible form, rather it is always contained within a signal. Thus, the received signal vector contains information about the data word. Normally, we would describe the received signal as a noisy version of the transmitted signal, but viewing the information as being encoded makes it easier to think about this theory in an arbitrary setting. We calculate three KL distances. The first is between the two received signal vectors r α, r α2 at the input to the detector. (The subscripts α and α 2 distinguish the two transmitted pieces of information.) The second is between the detected binary words w α, w α2 at the output of the detector (input to decoder), and the third is between the decoded binary words û α, û α2. We denote these distances by D r (α α 2 ), D w (α α 2 ), and Dû (α α 2 ) respectively. These distances represent the informational change between these particular signals. Certainly, a change in information (that is, a change in data words) induces the distance, but more importantly, through Stein s Lemma [2], the KL distance is the exponential decay rate of the false alarm probability of an optimum Neyman-Pearson detector. Thus, these distances quantify our ability to discriminate between the two information bearing signals at the input and output of the detector and decoder. Because of the Data Processing Theorem [], the detector and the decoder can at best pre-
KL KL KL u α vα s α (t) r α (t) r α w α ^ uα encoder modulator demodulator detector decoder n(t) Fig.. Two binary data blocks u α, u α2 are separately transmitted. The Kullback-Leibler distance between the distributions induced by each of the data blocks is calculated at the input and output of the detector and the decoder. The ratios of the input and output distances provide a measure of how well the detector and decoder preserve the informational change encoded their input signals. serve the distance presented at their input and at worst, reduce it to zero causing the ultimate recipient of the transmission to lose all ability to discern the informational change. The performance criterion we use is the information transfer ratio denoted by and defined as the ratio of the KL distances at the input and output of any system. It is a number between zero and one and reflects the fraction of the informational change preserved across a system. Here, we study the information transfer ratio of the detector and the decoder. det = D w (α α 2 ) D r (α α 2 ) dec = D û (α α 2 ) D w (α α 2 ) Ideally, the information transfer ratios across each of these systems would equal one indicating no informational loss. However in reality, we expect informational losses because the probability of error is never zero. The overall information transfer ratio across both the detector and decoder is simply expressed as the product of the individual information transfer ratios [3]. overall = det dec 2. KULLBACK-LEIBLER DISTANCE CALCULATIONS Each transmitted data word induces a probability distribution on the received signal vectors at the output of the demodulator. For example, if the channel adds white noise then each element of the received vector r α would be normally distributed with mean ± KE b /N and variance N /2 depending upon whether a zero or one is transmitted. (E b is the energy per data bit.) The statistical independence of the received vector elements allows us to write the KL distance at the input of the detector as a sum of the distances () (2) between each received vector element [3]. D r (α α 2 ) = N D rj (α α 2 ) (3) j= Simplifying this expression we can rewrite it in terms of the Hamming distance between v α and v α2 because D rj (α α 2 ) = if the j th bits in each word are the same. D r (α α 2 ) = d H (v α, v α2 ) D r (α α 2 ) (4) Table lists the KL distances D r (α α 2 ) for various noise distributions as a function of SNR. The detector compares each received sample r αj (j =,..., N) to a threshold and declares as its output either a one or a zero. The detected binary word w α is the collection of N such outputs. The decoder maps w α to estimates of the transmitted data words (w α û α, w α2 û α2 ). Specifically, its output is the code word closest in Hamming distance to w α (minimum distance decoding). We calculate the KL distance at the output of the detector by viewing each binary vector w n (n =,..., 2 N ) as the output of a binary symmetric channel with error probability P e. (See Table for expressions of P e for different noise distributions.) Accordingly, the probability of receiving w n when we transmit v α (or equivalently u α ) is Pr[w n u α ] = P d H(w n,v α) e ( P e ) N d H(w n,v α). These probabilities define the discrete distribution over the output of the detector, thus by definition we obtain D w (α α 2 ) = = N D wj (α α 2 ) j= 2 N n= Pr[w n u α ] log Pr[w n u α ] Pr[w n u α2 ] (5)
Noise Distribution D r (α α 2) P e SNR 4ξ Q ( 2ξ ) 2 e 4 ξ + 4 ξ 2 e 2 ξ SNR 4 2 ln ( + ξ) 2 ln [ sech ( )] 2 2ξ [ 2 tan sinh ( )] 2 2ξ 8 2 2 ( 2 tan ξ ) 8 2 2 Table. The Kullback-Leibler distances between the received random variables r αj and r α2j and the detector s hard decision bit error probabilities are shown in columns two and three for various noise distributions. In each expression ξ = KE b /NN where the signal-to-noise ratio per bit (SNR) equals E b /N. For the distribution, the quantity N is understood to be the width parameter. The fourth and fifth columns list the asymptotic values of the information transfer ratio across the detector. (See Figure 2.) When no error control coding is employed K = N. Calculation of the KL distance at the output of the decoder hinges on the decoding probabilities. Assuming u α is transmitted, the probability of decoding it as û m (m =,..., 2 K ) is the total probability mass of the decoding sphere of v m. Pr[û m u α ] = L m l= Pr[w l u α ] Here, l indexes the binary words within the decoding sphere of v m. To ensure the KL distance at the output of the decoder is defined we assume there are no failure-to-decode events, or in other words, we assume each w n lies within a decoding sphere. Similar to equation (5), we have Dû (α α 2 ) = 2 K m= Pr[û m u α ] log Pr[û m u α ] Pr[û m u α2 ]. (6) In the special case when no error control coding is employed, v α = u α, N = K, and the decoder performs no function. The output of the detector is the estimate of the transmitted data word. The expression for the KL distance at the input of the detector remains unchanged except u α substitutes v α in equation (4). The KL distance at the output of the detector can be written like equation (6) but with Pr[û m u α ] replaced by Pr[û m u α ] = P d H(u m,u α) e ( P e ) K d H(u m,u α). In this case we can also simplify the output KL distance in much the same way as equation (3). Because the estimates û α are statistically independent when there is no coding, we can write Dû (α α 2 ) = [ d H (u α, u α2 ) ( P e ) log P ] e P e + P e log P e P e (7) The bracketed term is the KL distance between the binary distributions which result every time a data bit is transmitted. 3. EXAMPLES AND DISCUSSION We study three fundamental examples. We investigate performance when no error control coding is used (the uncoded case), and then consider two Hamming codes ((3, ) and (7, 4)). In order to make fair comparisons between uncoded and coded cases, we maintain constant data rates. This requirement constrains the total transmission time of the N coded bits of a (N, K) code to KT seconds. (It takes KT seconds to transmit K data bits in uncoded cases.) We plot the information transfer ratios for four noise distributions in Figure 2 for the uncoded case and list their respective asymptotic values of in Table. These curves show the informational loss for making hard decisions at the detector. Notice the decrease in performance as the SNR increases. It is not due to the output KL distances decreasing but instead, to the growing proportional differences between the input and the output distances. (See Figure 3.) This fact means the detector better preserves the informational change at lower SNR values than at higher values. However, even though the detector is less efficient with SNR, the loss is not great. Because the information transfer ratio across the detector is completely independent of the input data words, it is, in particular, independent of the input data word length for the uncoded case. We prove in Appendix A that a likelihood ratio digital demodulator maximizes the information transfer ratio across binary detectors. Thus, the curves in Figure 2 represent the best achievable performance across any hard decision detector. Figure 4 plots det, dec, and overall when we use a (3, ) and (7, 4) Hamming code. The top row exhibits the losses across the detector; the middle row across the decoder;
.8.8.6.4.2.6.4.2 2 5 5 5 5 2 2 5 5 5 5 2 Fig. 2. The performance of the detector in terms of the information transfer ratio is shown for the uncoded case. The performance is independent of the data word length K. 2 8 6 4 2 2 5 5 5 5 2 Fig. 3. The widening gap between the Kullback-Leibler distances at the input and output of the hard decision detector illustrates why the information transfer ratio decreases with increasing SNR. This particular plot is generated with noise and with K = 4. and the bottom row across both systems. Because of the constant data rate constraint the information transfer ratio curves across the detector are scaled versions of the curves in Figure 2. The examples studied here show a relatively constant additional loss across the decoder. These curves are identical when plotted against probability of error. Why they are not monotonic is an issue we are studying. Apparently, a more efficient code, the (7,4) code here, yields larger information transfer ratios. The performance across the decoder depends upon the choice of the transmitted code words. In general, the dependence is related in a complicated way to Hamming distance, but for the (7, 4) code studied here, greater distance implies better performance. For example, instead of choosing two code words with a Hamming distance of 4 as in Figure 4, we could choose two with a Hamming distance of 7. As shown in Figure 5, compared to the right middle panel of Figure 4, better fidelity results for high SNR. Within the framework of information processing the concept of coding gain does not exist. Because of the Data Processing Theorem, error control coding simply can not regain the informational loss across the detector. Once the loss occurs no post-processing can be performed to compensate for it. More powerful codes and decoding Fig. 5. The information transfer ratio across the decoder for a (7, 4) code is shown (α = (), α 2 = 6 ()). The improved performance, compared with right middle plot of Figure 4, is due to the increase in Hamming distance from 4 to 7. schemes could conceivably improve the informational efficiency across the decoder. At present however, no methods or even approaches exist on how to design codes and decoding schemes to maximize across the decoder. Improvements can be made across the detector if we introduce soft decision detectors. In fact, it is not difficult to think of examples in which this is the case. Such investigations could possibly lead to using, for example, to systematically study soft decision decoding. A. APPENDIX Consider a general binary detection problem where r α and r α2 are two possible received signal vectors presented at the input of the detector under hypothesis α and α 2 respectively. Let p(r α ) and p(r α 2 ) be conditional probability density functions associated with each hypothesis. Denote the output decisions of the detector as Λ and Λ 2. The information transfer ratio equals = D Λ (α α 2 ) D r (α α 2 ) = P D log (P D /P F ) + ( P D ) log ( P D )/( P F ) p(r α ) log p(r α) p(r α 2) dr where P D is the probability of detection and P F is the probability of false alarm. Explicitly, p(λ α ) = P F p(λ 2 α ) = P F p(λ α 2 ) = P D p(λ 2 α 2 ) = P D. Maximizing is equivalent to maximizing the numerator which translates into finding values of P D and P F which maximize ( ) ( ) PD PD P D log + ( P D ) log = P F P F H(P D ) P D log P F ( P D ) log ( P F ). (8)
Across detector Across detector.8.8.6.4.2.6.4.2 2 5 5 5 5 2.8 Across decoder 2 5 5 5 5 2.8 Across decoder.6.4.2.6.4.2 2 5 5 5 5 2.8.6 Overall 2 5 5 5 5 2.8.6 Overall.4.2 2 5 5 5 5 2.4.2 2 5 5 5 5 2 Fig. 4. Plots of the information transfer ratio across the detector and decoder for a (3, ) (left column) and a (7, 4) (right column) Hamming code are shown for various noise distributions. The detector makes hard decisions and the decoder uses minimum distance decoding. For the (3, ) code α = (), α 2 = 2 (); for the (7, 4) code α = (), α 2 = 5 (). We arbitrarily reference the plots to the all-zero code words. Since P D and P F are coupled they can not be independently optimized, so without loss of generality, assume P F = a and P D = a + l. Substituting these values into equation (8) and setting its derivative equal to zero we obtain [ (a 2 ] + al) + a + l log (a 2 =. + al) + a For a given value of a (P F ), we note that the derivative is positive for l >, negative for l <, and zero when l = (minimum). Thus to maximize the numerator of equation (8) we choose the largest possible l but constrained to l a. The upper bound results from the fact that P D and P F are probabilities and thus must be between zero and one. Formally, for a given false-alarm probability max l = max P D P F l< a P D = max p(r α 2 ) p(r α ) dr Λ Λ Therefore Λ should be defined as Λ = {r p(r α 2 ) > p(r α )} which is exactly the condition of the likelihood ratio test. This result is general and holds for all noise distributions. B. REFERENCES [] S. Sinanović and D. H. Johnson, Toward a theory of information processing, Submitted to IEEE Trans on Signal Processing, Jun 22. [2] T.M. Cover and J.A. Thomas, Elements of Information Theory, Wiley, 99. [3] Sinan Sinanović, Toward a Theory of Information Processing, 999, Master of Science Thesis, Rice University, Houston TX.