Chapter I: Fundamental Information Theory

Size: px

Start display at page:

Download "Chapter I: Fundamental Information Theory"

Godfrey Shaw
6 years ago
Views:

1 ECE-S622/T62 Notes Chapter I: Fundamental Information Theory Ruifeng Zhang Dept. of Electrical & Computer Eng. Drexel University. Information Source Information is the outcome of some physical processes. Though information sources are highly complex and diverse in real world, we study them from a viewpoint of communications engineering and using probabilistic model. An information source is modeled as a stochastic process {X(t); p( )} where p( ) is the (possibly infinite-dimensional) distribution of the process X(t). According to the time index t and the value domain of X(t), the information sources can be divided into different categories. If t is continuous, the source is a continuous-time source but usually called waveform source. If t is discrete, the source is called discrete-time source. The set of values from which X(t) takes is denoted as X. If X is continuous, the source is a continuous source. Otherwise, the source is a discrete source. The value set X of a discrete source is called the alphabet of the source and each element of it is called a symbol or letter. Discrete sources are of primary interest to the discussion of digital communications; therefore, we will first study such sources. Example. An information source of dialed telephone numbers has the source alphabet X = {,, 2, 3, 4, 5, 6, 7, 8, 9, #, }. An information source of text would consist of the letters (both lower and upper cases), space and various punctuation symbols: X = {a,..., z, A,..., Z,,,,., ;, :,!,?}. An information source representing the on/off status of a switch or actuator is a binary information source: X = {off,on}, or simply, X = {, } with representing off and on. Generally, we need infinite-dimensional joint distribution to fully characterize a stochastic process and therefore, an information source. However, if the values of X(t) at different time instances t are statistically independent, it suffices to use one-dimensional distribution. Such a source is called memoryless source. We further impose the identical distribution condition on the memoryless sources because we want them to be also stationary (Why?). Definition. (Discrete Memoryless Source) A discrete source with independent and identically distributed samples from its alphabet is called a Discrete Memoryless Source (DMS). For a DMS source, we can omit the time index of the stochastic process X(t) and just use a random variable X with alphabet X and probability mass function p X (x) = P [X = x X ] to describe it. Our discussion on information theory will start from the DMS.

2 ECE-S622/T62 Notes Part I 2.2 Self-information When we say that something is informative, we imply that we have some unknowns or are not sure about that thing. Therefore, the information contained in an event must be associated with the uncertainty of that event in some sense; and the higher the uncertainty of an event, the more information it contains. It is then needed to find a measure of the uncertainty in order to quantify the information. There is not yet a unanimous measure of the uncertainty (and I don t think that there exists one). However, for the stochastically modeled information source as described above, it is possible to conveniently measure the uncertainty, and thus the information, with the probabilistic distribution. We are considering DMSes. We first want a proper definition of the information measure of the event that the source outputs a specific symbol, i.e., X = x X. Let us denote the information of this event as I(X = x). According to our previous discussion, I(X = x) should depends on the probability P [X = x] = p. For some obvious reasons of usefulness of the definition, we desire the following properties (axioms, say in mathematical language) regard I(X = x) and p.. The information of an event is a function of its probability, i.e., I(X = x) = F (p); 2. The information is differentiable with respect to the probability, i.e., 3. The information is monotonically decreasing with the probability, i.e., df (p) dp df (p) dp exists; 4. Suppose that the source outputs two symbols in a row independently. The information of the event that the first symbol is x and the second is y equals the summation of the information of the two individual events X = x and X = y, i.e., I(X() = x, X(2) = y) = F (pq) = F (p) + F (q) = I(X = x) + I(X = y). 5. Deterministic events contain zero information and impossible events contain infinite information, i.e. F () = and F () =. One can see that these requirements do correlate with our intuition of the relation of information and uncertainty. Interestingly, only the following definitions of I(X = x) satisfies the aforementioned requirement. Definition.2 (Self-information) The self-information of an event X = x with probability p = P [X = x] is ; I(X = x) = log p (.) Qualification of this definition to meet all the desired properties can be easily verified. We further indicate that the choice of the base of the logarithm does not bother us very much. It only affects the unit of the information measure. If base 2 is chosen, the resulting information unit is called bit (standing for binary digit). If natural base (e) is chosen, the resulting information unit is called nat (standing for natural digit). While, if base is chosen, the resulting information unit is called Hartley (also called dit, correct!). In most cases, we use base 2. The subtle reason

3 ECE-S622/T62 Notes Part I 3 for that will be clear as we proceed. Sometimes the natural base offers much convenience in mathematical analysis and hence is preferred there. For simplicity, we denote base 2 logarithm with log, natural base logarithm with ln, and base logarithm with lg (though we will rarely use it). Different information units can be easily translated to each other. Using the base change formula of logarithm, we can easily find that nat = log e bits and dit = log bits. The last paragraph of this section is devoted to the mathematical justification of Definition.2, and is only for those who are curious of the uniqueness of Definition.2 and are stick to mathematical rigidity. Proof: From the fourth property in the list above, differentiating both sides of the equation with respect to p, we obtain Similarly, df (pq) dp df (pq) dq Combining the above two equations, we get = q df (p) dp = df (q) p dq df (p) df (q) p = q dp dq Since it holds for arbitrary p and q, F (p) must satisfy where C is a constant. Consequently, df (p) p = C, dp F (p) = C ln p + D, df (p) where D is another constant. Since we require F () =, D = and since we require, dp C <. We are in favor of base 2 logarithm, so we set C = / ln 2 and the desired result follows..3 Entropy Self-information is just a measure of the information of the outcome of a specific symbol in the alphabet of a source. To characterize the information content of the source, we need to (statistically) average the self-information of all symbols. This average information is called entropy of the source. Definition.3 (Entropy) The average information (per symbol) or entropy, H(X) of DMS X with alphabet X = {x i, i =,..., n} and probability mass function p X (x i ) = P [X = x i ] = p i, is H(X) = E[I(X = x i )] = p i log = p i i=... p i log p i (.2) i=

4 ECE-S622/T62 Notes Part I 4 The word entropy (from the Greek entrope, meaning change) was obviously borrowed from the thermodynamics which was first used by Clausius to measure the irreversible increase of the non-disposable energy. Actually, (.2) is very similar to Boltzmann s statistical definition of entropy. Example.2 Consider the binary source X = { : p, : ( p)}. Its entropy is H(X) = p log p + ( p) log p. If the source is equiprobable, i.e., p = /2, then H(X) =. We find that the binary source has average bit information per information symbol. That is the reason why we call a binary symbol ( or ) a bit. It means that a message represented with N binary symbols contains N bits of information. In other words, a message contains N bits of information needs N binary symbols to describe. The binary source is the simplest and therefore, most convenient source for both theoretical and practical purposes. Therefore, using base 2 logarithm to have an information unit in bit is prevalent. Example.3 Consider the source X = {x :.5, x 2 :.25, x 3 :.25} Then, H(X) =.5 log log log 4 =.5 bits. Thus, a typical message from the source contains.5 bits of information per symbol. Consequently, one symbol of the given source is equivalent in information content to.5 binary symbols. Example.4 Listed below are letters of the English alphabet with their relative frequencies. According to them, we can compute the entropy of English as an DMS. H(X) bits per letter. Letter Frequency Letter Frequency A.856 N.77 B.39 O.797 C.279 P.99 D.378 Q.2 E.34 R.677 F.289 S.67 G.99 T.45 H.528 U.249 I.627 V.92 J.3 W.49 K.42 X.7 L.339 Y.99 M.249 Z.8 Table.: English Alphabet The information entropy has the following properties. Theorem. (Minimum and Maximum Entropy). H(X) with equality when p i = for one of the symbol x i X. 2. H(X) log n for an alphabet of n symbols, with equality when p i = n for all symbols.

5 ECE-S622/T62 Notes Part I 5 Theorem. gives the minimum and maximum of the entropy of an information source. The first result states that entropy is non-negative, which means that there is no negative information. This is quite intuitive because we will not lose information to a message even if we may get nothing from it. The zero entropy happens when the source alphabet loses randomness because we get no information from a deterministic event. The second result tells us that the equiprobable source has the maximum entropy. Now, we proof Theorem.. Before we do that, we first proof the following lemma. Lemma. (Information theory fundamental inequality) ln x x. (.3) Proof: Denote f(x) = ln x x +. We have f () = and f () = <. Therefore, x = is the maxima of f(x), i.e., f(x) f() =. The desired result follows. A corollary of Lemma. is that ln x > x, (.4) which is obtained by replacing ln x with ln x in (.3). Then let us proof Theorem.. Proof: The first result is derived as H(X) = p i log ln ( ) p = p i i p i ln 2 p i ( p i ) = p 2 i, ln 2 ln 2 i= i= and the condition for equality can be easily verified. For the second result, we consider H(X) log n = p i log p i i= i= p i log n = i= i= p i log np i i= i= p i ( np i ) =, and the desired result follows. We note that the equality stands if and only if np i =, i.e., p i = n for all i. Example.5 Using the expression obtained in Example.2, let us plot the entropy of the binary source as a function of the probability p. The plot is shown in Figure., from which we can see that the maximum entropy is reached when p = /2, i.e., when the binary source is equiprobable..4 Joint Entropy Joint entropy is an obvious extension of entropy when we need to study two or more information sources. Definition.4 (Joint Entropy) The joint entropy of k information sources, X j X j, j =,..., k with the joint probability mass function p X...X K (x,..., x k ) = P [X = x X,..., X k = x k X k ], is defined as H(X,..., X k ) = x X... i k X k p X...X K (x,..., x k ) log p X...X K (x,..., x k ). (.5)

6 ECE-S622/T62 Notes Part I 6 H(x) p Figure.: Plot of H(X) = p log p ( p) log( p) Example.6 Consider two binary sources X and X 2 sharing the same alphabet X = {, } but having different probability mass functions. For X, p () = p () =.5, while for X 2, p 2 () = /3 and p 2 () = 2/3. We assume independence of the two sources, i.e., p X,X 2 (x, x 2 ) = p (x )p 2 (x 2 ). Then, H(X, X 2 ) = 6 log log log log bits per symbol pair. We want to emphasis that H(X, X 2 ) is defined as the average bits of information per pair of symbols. This should be in contrast with H(X) which is defined as the average bits of information per symbol. If the information per symbol of the joint source are concerned, we can divide H(X, X 2 ) by 2 to get what we want, which is.959 bits of information per symbol. You may compare this number with H(X ) = and H(X 2 ) =.983 to see the change of the entropy when computed jointly from those computed individually. An important application of joint entropy is to describe the information contents of the extended sources. Definition.5 (The k-th Extension of an information source) Let X be an source with alphabet X = {x,..., x n }. The k-th extension of X, denoted as X k, is a source with alphabet X k = {σ,..., σ n k}, each σ i corresponding to a length-k block of symbols from X, i.e., σ i = (x i,,..., x i,k ), x i,j X. Example.7 Consider the binary source, X = {, }. Its 2nd extension is X 2 = {,,, } and the 3rd extension X 3 = {,,,,,,, }. The entropy of the extended source X k can be computed using the formula of joint entropy, assuming the sources share the same alphabet. The complexity lies in that we need to know the k-dimensional distribution of X. However, the DMS allows simple computation of k-dimensional probability from one-dimensional mass function, p X k(σ i ) = p X (x i, )... p X (x i,k ). Therefore, the

7 ECE-S622/T62 Notes Part I 7 entropy of the kth extension of a DMS X is H(X k ) = = = = = = p X k(σ i ) log p X k(σ i ) n k i= p X (x i, )... p X (x i,k ) log p i= X (x i, )... p X (x i,k )... p X (x i )... p X (x ik ) log p i = i k = X (x i )... p X (x ik ) k... p X (x i )... p X (x ik ) log p X (x ij ) n k j= k i = j= i j = i k = p X (x ij ) p X (x ij ) k H(X) j= = kh(x) bits per symbol block (.6) Note that H(X k )/k = H(x). That means that each symbol still contain the same amount of information in the extended source as in the original source. This fact hold for DMSes. For sources with memory however, each symbol contain less information in the extended source. (Why?).5 Conditional Entropy Conditional entropy is used to quantify the information of a source when the information of other sources is available. Consider two sources X and Y with alphabets X and Y. The conditional probability that Y = y Y when X = x X is p Y X (y x). By simple analogy to the selfinformation, we can guess that the conditional self-information of Y = y on X = x is I(Y = y X = x) = log p Y X (y x). The average information of Y conditioned on X = x then, can be written as H(Y X = x) = y Y p Y X (y x) log p Y X (y x) Furthermore, we want the average information of Y conditioned on all possible symbols of x X, H(Y X) = p Y X (y x) log p x X y Y Y X (y x) p X(x) = p X,Y (x, y) log p x X y Y Y X (y x). (.7) The above equation is the definition of the conditional entropy of Y on X. It can be easily generalized to the case of multiple conditioning sources.

8 ECE-S622/T62 Notes Part I 8 Definition.6 (Conditional Entropy) The conditional entropy of the information source X k on sources X,..., X k is H(X k X,..., X k ) = x X... x k X k p X...X k (x,..., x k ) log p Xk X...X k (x k x,..., x k ) Example.8 Consider sources X { :.5, :.5} and Y {, }. The conditional probability of Y on X is p Y X ( ) =.25, p Y X ( ) =.6, p Y X ( ) =.75, p Y X ( ) =.4. The conditional entropy of Y on X then, is H(Y X) = (.25)(.5) log + (.75)(.5) log.75 + (.6)(.5) log + (.4)(.5) log =.89. We can also derive the marginal probability of Y using the conditional probability and the probability of X as p Y () =.425, p Y () =.575. Then, we can compute the (unconditional) entropy of Y as H(Y ) =.9837 bits per symbol. Note that the conditional entropy is smaller than the (unconditional) entropy. The noted fact in the above example is a general result about conditional entropy Theorem.2 (Conditioning Reduces Uncertainty) (.8) H(X Y ) H(X) (.9) Proof: H(X Y ) H(X) = p XY (x, y) log p y Y x X X Y (x y) p X (x) log p X (x) x X = p X (x) p XY (x, y) log p y Y x X X Y (x y) = p XY (x, y) log p X(x)p Y (y) p XY (x, y) y Y x X [ ] px (x)p Y (y) p XY (x, y) ln 2 p XY (x, y) = y Y x X This property is very intuitive because conditions give us some information about the considered source and thus reduces the original information of it. Conditional entropy and joint entropy are related by the chain rule. Theorem.3 (Chain Rule) H(X, Y ) = H(X) + H(Y X) (.)

9 ECE-S622/T62 Notes Part I 9 Proof: H(X, Y ) = p XY (x, y) log p XY (x, y) x X y Y = p XY (x, y) log p Y X y xp X (x) x X y Y = p XY (x, y) log p X (x) p XY (x, y) log p Y X y xp X (x) x X y Y x X y Y = H(X) + H(Y X). Corollary. Proof: The proof is along the same line as the theorem. H(X, Y Z) = H(X Z) + H(Y X, Z). (.) Example.9 Continue from Example.8. We can derive the joint probabilities as p XY (, ) = p Y X ( )P () = (.25)(.5) =.25, p XY (, ) =.375, p XY (, ) =.3, p XY (, ) =.2. Therefore, we can compute the joint entropy of X and Y, H(X, Y ) =.89. In addition, the entropy of X is obviously H(X) =. In Example.8, we have got H(X 2 X ) =.89. Then, we see H(X, Y ) = H(X) + H(Y X). A final remark is that H(Y X) H(X Y ). However, H(X) H(X Y ) = H(Y ) H(Y X), a property that we shall exploit later..6 Communication Channels Shannon gave a very abstract but precise model for communication systems: An information source, an output and a communication channel in between them. This model is shown in Figure.2. We consider the so-called discrete channel as the first step. A discrete channel is such a channel that both the source and the output are discrete processes. It can be simply modeled as a mapping of the source alphabet to the output alphabet in a probabilistic manner. If this discrete channel is additionally memoryless (from one source symbol to another), the mapping of successive source symbols is independent and the description of it can be simple. Definition.7 (Discrete Memoryless Channel) A discrete memoryless channel (DMC) between the information source X X = {x,..., x n } and the output Y Y = {y,..., y m } is a set of conditional probability p Y X (y j x i ) = p ij, standing for the probability that the output symbol y j is received when the source symbol x i is sent. Note that the channel may change the transmitted symbol to another one or introduce new symbols.

10 ECE-S622/T62 Notes Part I PSfrag replacementsinformation Source X Channel Information Output Y Figure.2: Information Theoretical Model for Communication Systems It is convenient to organize the conditional probabilities into an matrix p Y X (y x ) p Y X (y 2 x )... p Y X (y m x ) p p 2... p m p Y X (y x 2 ) p Y X (y 2 x 2 )... p Y X (y m x 2 ) P = = p 2 p p 2m p Y X (y x n ) p Y X (y 2 x n )... p Y X (y m x n ) p n p n2... p nm PSfrag replacements This matrix is usually referred to as channel matrix. Each row of the channel matrix P corresponds to an input of the channel, and each column of P corresponds to an output of the channel. Since if we sent x i we must receive some y j, we have m p ij =, i =,..., n. We also usually represent the channel graphically as in Figure.3. j= X Y x p p 2 y x 2 p 2 p 22 y 2 p npn2 p 2m pm x n p nm y m Figure.3: Channel Transition Graph Example. A source emits symbols {, } and the receiver receives symbols {, } as well.. If the channel is noiseless and deterministic, PSfrag replacements then p( ) = p( ) = and p( ) = p( ) =. The channel matrix and transition graph are [ ] P = This is called Binary Deterministic channel.

11 ECE-S622/T62 Notes Part I 2. If the channel introduces % bit inversion errors, then p( ) = p( ) =.99 and p( ) = PSfrag replacements p( ) =.. The channel matrix and transition graph are.99 [ ].99. P = The general case of P ( ) = P ( ) = ɛ and P ( ) = P ( ) = ɛ is called Binary symmetric channel (BSC). The binary deterministic channel described above is a special case of the BSC. PSfrag replacements 3. In general the errors in a binary channel depend on the symbol transmitted, i.e., P ( ) P ( )..8 [ ].8.2 P =.3.7 Example. (Binary Erasure Channel (BEC)) A binary erasure channel has a binary source {, } and a ternary output {,?, } where? means a decision cannot be made on whether a or was sent (output is erased). q P = [ q q q q ] PSfrag replacements q q If the source symbols are sent with probabilities p X (x i ), i =,..., n, the output symbols will then appear with some other set of probabilities: p Y (y j ), j =,..., m, which can be derived using the total probability formula: p ( y j ) = for any given channel p Y X (y j x i ). From Bayes law, we also get p Y X (y j x i )p X (x i ). (.2) i=? p X Y (x i y j ) = p XY (x i, y j ) p Y (y j ) = p Y X(y j x i )p X (x i ) p Y (y j ), (.3) which is the probability of an input x i having been sent given that the output y j was received. We denote p X Y (x i y j ) the backward probability and p Y X (y j x i ) the forward probability. Note that if we are given p Y (y j ) and p Y X (y j x i ) it may not be possible to invert (.2) to determine p X (x i ). In other words, there may be many source distribution p X (x i ) which lead to the same output distribution p Y (y j ) for a given channel p Y X (y j x i ). But if we are given p X (x i ) and p Y X (y j x i ) we always having unique p Y (y j ).

12 ECE-S622/T62 Notes Part I 2 [ ] 2/3 /3 Example.2 A binary channel which includes the source probabilities p / 9/ X () = 3/4 and p X () = /4 PSfrag can be replacements represented diagrammatically by 3/4 2/3 /4 /3 / 9/ Now we can derive the output probabilities: the backward probabilities p Y () = (2/3)(3/4) + (/)(/4) = 2/4 P Y () = (9/)(/4) + (/3)(3/4) = 9/4 p X Y ( ) = (2/3)(3/4) = 2/2 2/4 p X Y ( ) = (/)(/4) = /2 2/4 p X Y ( ) = (/3)(3/4) = /9 9/4 p X Y ( ) = (9/)(/4) = 9/9 9/4.7 Equivocation and Mutual Information Now let us study how the information content of a source is affected by a communication channel. We first need to give the information measure of the source before and after the channel (when the output is available). The a priori entropy of a source X is just the regular entropy of it, H(X) = P (x i ) log P (x i ). (.4) i= While, it is easy to guess that the a posteriori entropy should be the conditional entropy of X on the channel output Y, which we call equivocation of X with respect to Y. Definition.8 (Equivocation) Equivocation of X with respect to Y is the conditional entropy of X on Y is H(X Y ) = i= j= m p XY (x i, y j ) log p X Y (x i y j ) (.5)

13 ECE-S622/T62 Notes Part I 3 Equivocation gives the information of source X after its response at the output of the channel is available. From the discussion on conditional entropy in the previous section, we know that H(X Y ) H(X), i.e., after the transmission through channel p Y X (y j x i ), the information in X has been decreased. If we remember that information is uncertainty, then we have less uncertainty about the source X after we get the channel output Y. In other words, we are more certain about the source after we observe its through the channel output. This totally correlates our intuition. Then, it is straightforward to think that the difference between H(X) and H(X Y ) must be the information we extract from X through knowing Y. In other words, H(X) H(X Y ) gives the information about X implicated in Y. This quantity is name mutual information. Definition.9 (Mutual Information) The mutual information of X and Y is defined as Alternative expressions for I(X; Y ) can be obtained as follows, I(X; Y ) = H(X) H(X Y ) (.6) I(X; Y ) = = = = = p X (x i ) log p X (x i ) i= i= j= i= j= m p XY (x i, y j ) log m m i= j= m i= j= i= j= m p XY (x i, y j ) log p X (x i ) p XY (x i, y j ) log p X Y (x i y j ) p X (x i ) i= j= p XY (x i, y j ) log p X,Y (x i, y j ) p X (x i )p Y (y j ) p XY (x i, y j ) log p Y X(y j x i ) p Y (y j ) p X Y (x i y j ) m p XY (x i, y j ) log p X Y (x i y j ) (.7) The following propertied of the mutual information I(X; Y ) can be observed. Theorem.4 I(X; Y ) (.8) with equality if and only if p XY (x i, y j ) = p X (x i )p Y (y j ), i, j. Proof: Use information theory fundamental inequality of (.3).. Theorem.5 I(X;X) = H(X). Therefore, source entropy can be viewed as a special case of mutual information. Theorem.6 I(X; Y ) = I(Y ; X). (.9)

14 ECE-S622/T62 Notes Part I 4 It means that Mutual information are symmetric with respect to X and Y. This can be easily seen from (.7). It is interesting to investigate the relations between the entropies H(X) and H(Y ), the joint entropy H(X, Y ), + the equivocations H(X Y ) and H(Y X), and the mutual information I(X; Y ) = I(Y ; X), which are summarized in Figure.4 and the following theorem. Theorem.7 H(X, Y ) = H(X) + H(Y ) I(X; Y ) = H(X) + H(Y X) = H(Y ) + H(X Y ). (.2) The proof of this relationship is just a exercise of playing with the probability, joint probability and conditional probability. H(X) H(Y) H(X Y) I(X;Y) I(Y;X) H(Y)X) H(X,Y) Figure.4: Relationship between entropies, enquivocations, and mutual information Example.3 Use the specifications given in Example.2. Before we seeing an output from the channel, our priori knowledge of the information source is H(X) = (3/4) log(4/3) + (/4) log(4) =.8. But after seeing the channel output, say y j =, our knowledge of the input source becomes H(X y j = ) = (2/2) log(2/2) + (/2) log(2) =.276. We can similarly obtain H(X y j = ) =.998. We see that we are more certain that is sent when we observe because the uncertainty (entropy) of the source X is reduced from.8 to.276 when we receive a. On the other hand, when we receive a, we are more uncertain about what x i is (with almost equal uncertainty about whether it was a or ). On average, the equivocation of X wrt Y is H(X Y ) =.276(2/4) +.998(9/4) =.69 <.8 = H(X). The mutual information of X and Y is I(X; Y ) =.8.69 =.92. That means we get.92 bits of information per symbol received. Other quantities include H(Y ) = (2/4) log(4/2) + (9/4) log(4/9) =.9982, H(Y X) = H(Y ) I(Y ; X) = H(Y ) I(X; Y ) =.862, and H(X; Y ) = H(X) + H(Y ) I(X; Y ) =.67. Example.4 For a BSC (c.f. Example.), the following probabilities can be obtained p X (x i ): p X () = p, p X () = p;

15 ECE-S622/T62 Notes Part I 5 p Y X (y j x i ): p Y X ( ) = p Y X ( ) = ɛ, p Y X ( ) = p Y X ( ) = ɛ; p XY (x i, y j ): p XY (, ) = p( ɛ), p XY (, ) = ( p)( ɛ), + p XY (, ) = pɛ, p XY (, ) = ( p)ɛ; p Y (y j ): p Y () = p( ɛ) + ( p)ɛ, p Y () = pɛ + ( p)( ɛ); p X Y (x i y j ): p X Y ( ) = p( ɛ) p( ɛ)+( p)ɛ, p X Y ( ) = ( p)ɛ p( ɛ)+( p)ɛ, p X Y ( ) = pɛ pɛ+( p)( ɛ), p X Y ( ) = ( p)( ɛ) pɛ+( p)( ɛ) Then we can compute various entropies and the mutual information. Here we just show an easy way for mutual information. I(X; Y ) = H(Y ) H(Y X) = F (P ( ɛ) + ( p)ɛ) F (ɛ) where F (x) = x log x + ( x) log x. What would you expect I(X; Y ) to be for a BSC if p =.5, p =, ɛ =.5, or ɛ =.? You do not need to calculate! Example.5 (Noiseless Channel) A channel of which each output symbol can be produced by the occurrence only of a particular one of the source symbols is called noisless channel, i.e., there is no noise or ambiguity on which input have caused the output. An example channel is given in the following /2 /2 P = /2 /2 3/5 3/ / 3/5 3/ / We see that for a noiseless channel, the channel matrix has one and only on nonzero element in each column. Also note that the output symbols may be more than the source symbols. However, it can not be less (Why?). For a noiseless channel when we observe the output y j we know with probability that input, say x, was sent; that is, p X Y (x y j ) = for x and p X Y (x i y j ) = for all other x i x s. The equivocation H(X Y ) will be H(X Y ) = i= j= m m P (x i, y j ) log p(x i y j ) = P (y j ) j= P (x i y j ) log P (x i y j ) =, i= because there is only one P (x i y j ) to be and other are zero. Then, we have the following result: For noiseless channels, I(X; Y ) = H(X). That means that with noiseless channel, there is no uncertainty about the input upon observing the output, and that the amount of information transmitted through the channel is the same as the information contained in the source. That is way we are favor of noiseless channels. Example.6 (Deterministic channel) A channel in which there are more possible input symbols than output symbols, but where each of the input symbol is only capable of producing one of the output

16 ECE-S622/T62 Notes Part I 6 symbols, is called a deterministic channel. An example of deterministic channel is given below. P = We can see that a deterministic channel has a channel matrix with one and only one nonzero element in each row. For a deterministic channel we know with probability that output symbol, say y i will be produced when x i is sent. Therefore, P (y i a i ) = for y i and P (y j a i ) = for other y j s. The equivocation H(Y X) = following the same derivation in the previous example. Hence, we have the following result For deterministic channels I(X; Y ) = H(Y )..8 Cascaded Channel A cascade of two channels is shown in Figure.5 The output of channel is connected to the input of channel 2. When x i is sent through channel, the output is y j. The same y j forms the input to channel 2 which produces the output z k. If we know that the intermediate symbol is y j, then the probability of obtaining z k at the output is dependent solely on y j and not on x i. That is p Z XY (z k x i, y j ) = p Z Y (z k y j ), i, j, k. Actually, this relationship can be viewed as a definition of a cascade of two channels. In the reverse direction we have p X Y Z (x i y j, z k ) = p X Y (x i y j ). PSfrag replacements X Y Z Channel Channel 2 Figure.5: Cascade of Two Channels

17 ECE-S622/T62 Notes Part I 7 Let us look at H(X Z) H(X Y ) = P (x, z) log P (x z) log P (x y) X,Z X,Y = P (x, y, z) log P (x z) X,Y,Z = X,Y,Z = Y,Z = Y,Z P (x, y, z) log P (x y) P (x z) P (y, z) X P (y, z) X X,Y,Z P (x, y, z) log P (x y, z) P (x y, z) log P (x z) ( P (x y, z) P (x z) ) P (x y, z) P (x y) Hence, H(X Z) H(X Y ) with equality iff p X Z (x z) = P X Y Z (x y, z) p X Z (x z) = p X Y (x y). Consequently, we have the following result Theorem.8 For the cascade of channels X Y and Y Z, with equality iff p X Z (x z) = p X Y (x y). I(X; Y ) I(X; Z) (.2) This result implies that information channels tend to leak: the information coming out at the end of the cascaded system can be no greater (and probably less) than the information from an intermediate point. Example.7 If the channel Y Z is noiseless, then p X Z (x z) = p Y y (x y) because p X Z (x z) = y Y p X Y (x y)p Y Z (y z) and p Y Z (y z) = only for a specific y according to the property of noiseless channels. However, the condition p XZ (x z) = p XY (x y) can be satisfied by noisy channels. Example.8 Consider the cascade of channel X Y [ ] /3 /3 /3 P XY = /2 /2 with channel Y Z P Y Z = 2/3 /3 /3 2/3 which gives

18 ECE-S622/T62 Notes Part I 8 /3 /3 /3 /2 /2 2/3 /3 /3 2/3 You can verify that channel Y Z is not noiseless, but surprisingly it does not leak information because I(X; Z) = I(X; Y ). Indeed P XZ = P XY P Y Z = P XY.9 Continuous Sources and Channels Models for continuous sources and channels are necessary when we study analogue communication systems. Even in digital communication systems, the signal transmission between the modulator and demodulator is in a continuous fashion. A continuous source is a random process X(t) with continuous amplitude. It can be memoryless or with memory. But we mainly consider memoryless sources which can be described with a random variable X representing one snapshot of the source. The associated probability density function (pdf) of X, f X (x) can fully specify the continuous information source. A continuous channel maps a continuous source X to the output Y which is also continuous. The mapping is probabilistic. Such a channel is also called waveform channel. Again, we consider memoryless channels which can be described by the conditional pdf f Y X (y x). Example.9 (Additive white Gaussian noise channel) The most popular channel is the additive white Gaussian noise (AWGN) channel as shown in Figure.6. The channel can be described by the conditional probability f Y X (y x) = 2πσ e (y x)2 2σ 2 (.22) X Y=X+W W Figure.6: Additive White Gaussian Noise (AWGN) Channel There are also semi-continuous channels, in which one of X or Y is continuous and another is discrete.

19 ECE-S622/T62 Notes Part I 9. Mutual Information and Differential Entropy Let us first consider the extension of the mutual information for discrete channels to continuous ones. Consider quantizing the source X and output Y by dividing their value set into small intervals δx and δy, respectively, and concentrating each small interval into one value: x i = iδx, y j = jδy The probabilities associated with x i and y i are related to the pdfs in the following way: P (x i ) = f X (iδx)δx, P (y j ) = f Y (jδy)δy, P (x i, y j ) = f Y X (iδx, jδy)δxδy We can use the mutual information of X d = {x i }, Y d = {y j } to approximate the mutual information of X and Y, I(X; Y ) I(X d ; Y d ) = i f(iδx, jδy) f(iδx, jδy)δxδy log f(iδx)f(jδy) j Letting δx, δy, we can expect that the approximation becomes precise. Note that limiting procedure changes the double-summation into double-integration. Finally, we get the following definition of mutual information for continuous channels. Definition. (Mutual Information) The mutual information of X and Y is I(X; Y ) = f XY (x, y) log f XY (x, y) dxdy (.23) f X (x)f Y (y) (.23) also allows the following alternative expressions: I(X; Y ) = = X X X Y Y Y f X (x)f Y X (y x) f Y X(y x) dxdy f Y (y) f Y (y)f X Y (x y) f X Y (x y) dxdy (.24) f X (x) However, this quantization method does not apply to the extension of entropy, because H(X d ) = i f X (iδx)δx log f X (iδx)δx = i f X (iδx) log f X (iδx) δx + i f X (iδx) log δx δx where the second term does not converge as δx. A solution is that we only take the first well-behaved term as the entropy of the continuous source X, which we call differential entropy. Definition. (Differential Entropy) h(x) = X f X (x) log dx (.25) f X (x)

20 ECE-S622/T62 Notes Part I 2 Similarly, we have joint differential entropy h(x, Y ) = f XY (x, y) log dx, (.26) f XY (x, y) X Y and conditional differential entropy h(y X) = X Y f XY (x, y) log f Y X (y x) (.27) Though these definitions look just like a replacement of the probabilities with pdfs, they no longer possess the mathematical beautifulness of the former ones. For example, h(x) may be negative. More importantly, the usefulness of h(x) depends on the existence of the integration which is not necessarily a finite number. One the other hand, as long as h(x) and h(x Y ) exist the relationship will hold. I(X; Y ) = h(x) h(x Y ) (.28) Example.2 The differential entropy of a Gaussian source, f X (x) = 2πσ e (x m)2 2σ 2 can be derived as [ ( )] h(x) = E log e (x m)2 2σ 2 = log( 2πσ) + [ ] (x m) 2 2πσ 2 log e E σ 2 = 2 log(2πeσ2 ) (.29) If we view σ 2 as the power of X, we see that the differential entropy of a Gaussian source is determined by its power. The result of (.29) can be extended to k joint Gaussian sources, h(x,..., X k ) = 2 log[(2πe)n P] where P is the covariance matrix of X,..., X k. Besides Gaussian distribution, uniform distribution, f X (x) = /(b a), a x b is another important distribution. Its differential entropy is h(x) = log(b a), (.3) determined by the length of the interval. If b a <, h(x) <. So, the positiveness of h(x) does not hold.

Lecture 8: Channel Capacity, Continuous Random Variables

EE376A/STATS376A Information Theory Lecture 8-02/0/208 Lecture 8: Channel Capacity, Continuous Random Variables Lecturer: Tsachy Weissman Scribe: Augustine Chemparathy, Adithya Ganesh, Philip Hwang Channel