Electrical and Information Technology Information Theory Problems and Solutions Contents Problems.......... Solutions...........7
Problems 3. In Problem?? the binomial coefficent was estimated with Stirling s approximation. Use the strong form of Stirling s approximation ( n ) n ( n ) ne πn n! πn n e e and justify the steps in the following calculations, where n is an integer and < p <, q = p. ( ) n (a) np πnpq p np q nq e n = (c) πnpq nh(p) e n πnpq nh(p) ( ) n (d) np = (e) πnpq p np q e npq nq πnpq nh(p) e npq 8npq nh(p) where, in the last inequality, it can be assumed that npq 9. The above calculations gives a useful bound on the binomial coefficient, ( ) n nh(p) nh(p) 8npq np πnpq 3. Encode the text IF IF = THEN THEN THEN = ELSE ELSE ELSE = IF; using the LZ77 algoritm with S = 7 and B = 7. How many code symbols were generated? If each letter in the text is translated to binary form with eight bits, what is the compression ratio? 3. Encode the text IF IF = THEN THEN THEN = ELSE ELSE ELSE = IF; using the LZ78 algoritm. How many code symbols were generated? If each letter in the text is translated to binary form with eight bits, what is the compression ratio?
33. Consider the sequence Nat the bat swat at Matt the gnat Encode and calculate the compression rate using (a) LZ77 with S = and B = 3. LZSS with S = and B = 3. (c) LZ78. (d) LZW with predefined alphabet of size 56. 34. Consider the sequence gegeven.een.eend (a) (c) (d) What source alphabet should be used? Use the LZ77 with a window size N = 8 to encode and decode the sequence with a binary code alphabet. Use the LZ78 to encode and decode the sequence with a binary code alphabet. How many codesymbols were generated? 35. Use the LZ78 algorithm to encode and decode the string THE FRIEND IN NEED IS THE FRIEND INDEED with a binary code alphabet. What is the minimal source alphabet? How many code symbols were generated? 36. Show that for all jointly ε-typical sequences, (x, y) A ε (X, Y ), we get n(h(x Y )+ε) n(h(x Y ) ε) p(x y) 37. One is given a communication channel with transition probabilities p(y x) and channel capacity C = max p(x) I(X; Y ). A helpful statistician preprocesses the output by forming Ỹ = g(y ). He claims that this will strictly improve the capacity. (a) Show that he is wrong. Under what conditions does he not strictly decrease the capacity? 38. Let X Z = {,,..., }, be a random variable used as input to an additive channel, Y = X + Z, mod where p(z = ) = p(z = ) = p(z = 3) = 3. Assume that X and Z are statistically idependent.
(a) What is the capacity of the channel? What distribution on p(x) gives the capacity? 39. Consider the discrete memoryless channel Y = X Z where X and Z are independent binary random variables. Let P (Z = ) = α. Find the capacity of this channel and the maximizing distribution on X. 4. In Shannon s original paper from 948, the following discrete memoryless channels are given. Calculate their channel capacities. (a) Noisy typewriter / / X 3 / / / / / / 3 Y Soft decoding /3 X /6 /6 /3 /6 /6 /3 Y /3 3 (c) 3-ary channel /6 /3 / X /6 / /3 Y /3 /6 / 4. Consider the binary erasure channel below. Calculate the channel capacity. p q X q p p q p q Y 3
4. Determine the channel capacity for the following Z-channel. X / Y / Hint: D(h(p)) = D(p) log( p p ). 43. Casdade two binary symetric channels as in the following picture. Determin the channel capacity. p p p p X Y Z p p p p 44. Consider a linear encoder where three information bits u = (u, u, u ) is complemented with three parity bits according to v = u u v = u u v = u u Hence, an information word u = (u, u, u ) is encoded to the codeword x = (u, u, u, v, v, v ). (a) What is the code rate R? Find a generator matrix G. (c) What is the minimum distance, d min, of the code? (d) Find a parity check matrix H, such that GH T =. (e) (f) Construct a syndrom table for decoding. Make an example where a three bit vector is encoded, transmitted over a channel and decoded. 45. Show that if d min λ + γ + for a linear code, it is capable of correcting λ errors and simultaneously detecting γ errors, where γ > λ. 46. Derive the differential entropy for the following distributions: (a) Rectangular distribution: f(x) = b a, a x b. Normal distribution: f(x) = (x µ) e σ, x. πσ (c) Exponential distribution: f(x) = λe λx, x. (d) Laplace distribution: f(x) = λe λ x, x. 4
47. Let X and X be two independant normal distributions random variables with distributions N(µ, σ ) and N(µ, σ ), respectively. Construct a new random variable X = X + X. (a) What is the distribution of X? Derive the differential entropy of X. 48. An additive channel, Y = X +Z, has the input alphabet X = {,,,, }. The aditive random variable Z is uniformly distributed over the interval [, ]. Thus, the input is a discrete random variable and the output is a continuous random variable. Derive the capacity C = max p(x) I(X; Y ). 49. The length X a stick that is manufactured in a poorly managed company, is uniformly distributed. (a) The length varies between and meters, i.e. {, x f(x) =, otherwise Derive the differential entropy H(X). The length varies between and cm, i.e. {., x f(x) =, otherwise Derive the differential entropy H(X). 5. Consider an additive channel where the output is Y = X + Z, where the noise is normal distributed with N(, σ). The channel has an output power constraint E[Y ] P. Derive the channel capacity for the channel. 5. Consider a channel with binary input with P (X = ) = p and P (X = ) = p. During the transmission a uniformly distributed noise parameter Z in the interval [, a], where a >, is added to X, i.e. Y = X + Z. (a) Calculate the mutual information according to I(X; Y ) = H(X) H(X Y ) Calculate the mutual information according to I(X; Y ) = H(Y ) H(Y X) (c) Calculate the capacity by maximizing over p. 5
5. Consider four independent, parallel, time discrete, additive Gaussian channels. The variance of the noise in the ith channel is σ i = i, i =,, 3, 4. The total power of the used signals is limited by 4 P i 7. i= Determine the channel capacity for this parallel combination. 6
Solutions 3. (a) Follows from Stirling s approximation. p np q nq = n log pp q q = n( p log p q log q) = nh(p). (c) From e n e <. (d) Follows from Stirling s approximation. (e) With npq 9 we can use e npq e 9 > 3. The decoding procedure can be viewed in the following table. The colon in the B-buffer denots the stop of the encoded letters for that codeword. S-buffer B-buffer Codeword [IF IF =] [ T:HEN T] (,,T) [ IF = T] [H:EN THE] (,,H) [IF = TH] [E:N THEN] (,,E) [F = THE] [N: THEN ] (,,N) [ = THEN] [ THEN TH:] (5,7,H) [THEN TH] [EN =: EL] (5,3,=) [ THEN =] [ E:LSE E] (,,E) [HEN = E] [L:SE ELS] (,,L) [EN = EL] [S:E ELSE] (,,S) [N = ELS] [E :ELSE ] (3,, ) [= ELSE ] [ELSE ELS:] (5,7,S) [LSE ELS] [E =: IF ] (5,,=) [ ELSE =] [ I:F ] (,,I) [LSE = I] [F:; ] (,,F) [SE = IF] [;: ] (,,;) There are 5 codewords. In the uncoded text there are 45 letters, which corresponds to 36 bits. In the coded sequence we first have the buffer of 7 letters, which gives 56 bits. Then, each codeword requires 3 + 3 + 8 = 4 bits. With 5 codewords we get 7 8 + 5(3 + 3 + 8) = 66 bits. The compression rate becomes R = 66 36 =.7389. 3. The decoding procedure can be viewed in the following table. The colon in the binary representation of the codeword shows where the index stops and the character code begins. This separatot is not necessary in the final code string. π. 7
Index Codeword Dictionary Binary (,I) [I] : (,F) [F] : 3 (, ) [ ] : 4 (,F) [IF] : 5 (3,=) [ =] : 6 (3,T) [ T] : 7 (,H) [H] : 8 (,E) [E] : 9 (,N) [N] : (6,H) [ TH] : (8,N) [EN] : (,E) [ THE] : 3 (9, ) [N ] : 4 (,=) [=] : 5 (3,E) [ E] : 6 (,L) [L] : 7 (,S) [S] : 8 (8, ) [E ] : 9 (8,L) [EL] : (7,E) [SE] : (5,L) [ EL] : (, ) [SE ] : 3 (4, ) [= ] : 4 (4,;) [IF;] : In the uncoded text there are 45 letters, which corresponds to 36 bits. In the coded sequence there are in total + +4 3+8 4+8 5 = 89 bits for the indexes and 4 8 = 9 bits for the characters of the codewords. In total the codesequence is 89 + 9 = 8 bits. The compression rate becomes R = 8 36 =.786. 33. (a) S-buffer B-buffer Codeword [Nat the ba] [t s:] (8,,s) [ the bat s] [w:at] (,,w) [the bat sw] [at a:] (5,3,a) [bat swat a] [t M:] (3,,M) [ swat at M] [att:] (4,,t) [at at Matt] [ t:h] (5,,t) [ at Matt t] [h:e ] (,,h) [at Matt th] [e: g] (,,e) [t Matt the] [ g:n] (4,,g) [Matt the g] [n:at] (,,n) [att the gn] [at:] (,,t) Text: 64 bits, Code: 34 bits, Rate:.886364 8
S-buffer B-buffer Codeword [Nat the ba] [t :s] (,8,) [t the bat ] [s:wa] (,s) [ the bat s] [w:at] (,w) [the bat sw] [at :] (,5,3) [ bat swat ] [at :] (,3,3) [t swat at ] [M:at] (,M) [ swat at M] [at:t] (,4,) [wat at Mat] [t :t] (,5,) [t at Matt ] [t:he] (,,) [ at Matt t] [h:e ] (,h) [at Matt th] [e: g] (,e) [t Matt the] [ :gn] (,4,) [ Matt the ] [g:na] (,g) [Matt the g] [n:at] (,n) [att the gn] [at:] (,,) Text: 64 bits, Code: 99 bits, Rate:.7538 (c) Index Codeword Dictionary Binary (,N) [N] : (,a) [a] : 3 (,t) [t] : 4 (, ) [ ] : 5 (3,h) [th] : 6 (,e) [e] : 7 (4,b) [ b] : 8 (,t) [at] : 9 (4,s) [ s] : (,w) [w] : (8, ) [at ] : (,M) [at M] : 3 (8,t) [att] : 4 (4,t) [ t] : 5 (,h) [h] : 6 (6, ) [e ] : 7 (,g) [g] : 8 (,n) [n] : 9 (,t) -- : Text: 64 bits, Code: 6 bits, Rate:.88 (d) 9
Index Codeword Dictionary Binary 3 [ ] 77 [M] 78 [N] 97 [a] 98 [b] [e] 3 [g] 4 [h] [n] 5 [s] 6 [t] 9 [w] 56 78 [Na] 57 97 [at] 58 6 [t ] 59 3 [ t] 6 6 [th] 6 4 [he] 6 [e ] 63 3 [ b] 64 98 [ba] 65 57 [at ] 66 3 [ s] 67 5 [sw] 68 9 [wa] 69 65 [at a] 7 65 [at M] 7 77 [Ma] 7 57 [att] 73 58 [t t] 74 6 [the] 75 6 [e g] 76 3 [gn] 77 [na] 78 57 -- Text: 64 bits, Code: 6 bits, Rate:.783 34. 35. Encoding
step lexicon prefix new symbol codeword (pointer,new symbol) binary T (, T ), T H (, H ), H E (, E ), 3 E (, ), 4 F (, F ), 5 F R (, R ), 6 R I (, I ), 7 I E N (3, N ), 8 EN D (, D ), 9 D I (4, I ), I N (, N ), N N (4, N ), N E E (3, E ), 3 EE D (9, ), 4 D I S (7, S ), 5 IS T (4, T ), 6 T H E (, E ), 7 HE F (4, F ), 8 F R I (6, I ), 9 RI EN D (8, D ), END I N (, N ), IN D E (9, E ), DE E D (3, D ), The length of the code sequence is 68 bits. Assume that the source alphabet is ASCII, then the source sequence is of length 3 bits. There are only ten different symbols in the sequence, therefore we can use a letter alphabet, {T,H,E,-,F,R,I,N,D,S}. In that case we get 39 4 = 56 bits as the source sequence. 36. The definition of jointly typical sequences can be rewritten as and n(h(x,y )+ε) n(h(x,y ) ε) p(x, y) n(h(y )+ε) n(h(y ) ε) p(y) Dividing these and using the chainrule concludes the proof. 37. (a) According to the data processing inequality we have that I(X; Y ) I(X; Ỹ ), where X Y Ỹ forms a Markov chain. Now if p(x) maximizes I(X; Ỹ ) we have that C = max I(X; Y ) I(X; Y ) p(x)= p(x) I(X; Ỹ ) p(x)= p(x) = max I(X; Ỹ ) = C p(x) p(x)
The capacity is not decreased only if we have equality in the data processing inequality, that is when X Ỹ Y forms the Markov chain. 38. (a) Since X and Z independent H(Y X) = H(Z X) = H(Z) = log 3. The capacity becomes C = max p(x) I(X; Y ) = max H(Y ) log 3 = log log 3 = log p(x) 3 This is achieved for uniform Y which by symmetry is achieved for uniform X, i.e. p(x i ) = i. 39. Assume that P (X = ) = p and P (X = ) = p. Then { P (Y = ) = P (X = )P (Z = ) = αp Then P (Y = ) = P (Y = ) = αp I(X; Y ) = H(Y ) H(Y X) = h(αp) (( p)h() + ph(α)) = h(αp) ph(α) Diffretiatin with respect to p gives us the maximizing p =. The capacity is α( h(α) α +) C = h(α p) ph(α) = = log( h(α α + ) h(α) α 4. (a) C = log 4 h( ) = = C = log 4 H( 3, 3, 6, 6 ), 87 (c) C = log 3 H(, 3, 6 ), 6 4. By assuming that P (X = ) = π and P (X = ) = π we get the following: H(Y ) = H(π( p q) + ( π)p, πq + ( π)q, ( π)( p q) + πp) = H(π pπ qπ + p, q, p q π + pπ + qπ) ( ) π pπ qπ + p p q π + pπ + qπ = h(q) + ( q)h, h(q) + ( q) ( q) ( q) with equality if π =, where H(, ) =. C = max I(X; Y ) = max(h(y ) H(Y X)) = h(q) + ( q) H(p, q, p q) p(x) p(x) ( ( )) p q p = ( q) H, q q 4. Assume that P (X = ) = A and P (X = ) = A. Then ( H(Y ) = H ( A) + A, A ) ( = H A, A ) = h( A ) H(Y X) = P (X = )H(Y X = ) + P (X = )H(Y X = ) = Ah( ) = A
and we conclude } C = max {h( A ) A p(x) Differentiation with respect to A gives the optimal à = 5. C = h( à ) Ã, 3 43. By cascading two BSCs we get the following probabilities: P (X =, Z = ) = ( p) + p P (X =, Z = ) = p( p) + ( p)p = p( p) P (X =, Z = ) = p( p) P (X =, Z = ) = ( p) + p This channel can be seen as a new BSC with crossover probability ɛ = p( p). The capacity for this channel becomes C = h(ɛ) = h(p( p)). 44. (a) R = 3 6 Find the codewords for u = (), u = () and u 3 = () and form the gernerator matrix G = (c) List all codewords (d) u x u x Then we get d min = min x {w H (x)} = 3 From part b we note that G = (I P ). Since ( ) ( ) P I P = P P = I we get H = (P T I) = 3
(e) List the most probable error patterns e s = eh T (f) where the last row is one of the weight two vectors that gives the syndrom (). One (correctable) error An uncorrectable error u = x = e = y = x e = s = yh T = ê = ˆx = y ê = û = u = x = e = y = x e = s = yh T = ê = ˆx = y ê = û = 45. Consider the graphical interpretation of F n and the two codewords x i and x j. λ γ + x x d d min λ A received symbol that is at Hamming distance at most λ from a codeword is corrected to that codeword. This is indicated by a sphere with radius λ around each codeword. Received symbols that lies outside a sphere are detected to be eroneus. The distance from one codeword to the sphere around another codeword is γ +, the number of detected errors, and the minimal distance between two codewords must be at least γ++λ. Hence, d min λ + γ +. 46. According to the definition of differential entropi (H(X) = f(x) log f(x) dx) we get that: 4
(a) (c) H(X) = = H(X) = = H(X) = b a f(x) log f(x) dx = [ x log (b a) b a ] b a f(x) log f(x) dx πσ e (x µ) σ = log πσ πσ + log e σ b a = log (b a) (x µ) πσ ( ) b a log dx b a [ log πσ e (x µ) σ dx e (x µ) σ = log (πσ ) + log e σ σ = log (πeσ ) = f(x) log f(x) dx = ] (x µ) e σ dx dx ( λe λx log λe λx) dx λe λx (log λ λx log e) dx = log λ + λ log e xλe λx dx = log λ + log e = log e λ (d) ( ) H(X) = λe λ x log λe λ x dx [ ( ) λ = λe λx log eλx dx + ( λ λe λx log [ ( ( ) ) ] λ = λe λx log + λe λx log e( λx) dx [ ( ) λ ] = log λ log e xλe λx dx = log e λ e λx ) ] dx 47. (a) The sum of two normal variables is normal distributed with N(µ + µ, σ + σ ). According to Problem 46. the entropy becomes log πe(σ + σ ). 48. The mutual information is I(X; Y ) = H(Y ) H(Y X) = H(Y ) H(X + Z X) = H(Y ) H(Z), where H(Z) = log ( ( )) = log. Since Y ranges from -3 to 3 with uniform weights p / for 3 Y, (p + p )/ for Y etc the maximum of H(Y ) is obtained for a uniform Y. This can be achieved if the distribution of X is ( 3,, 3,, 3 ). Now H(Y ) = log (3 ( 3)) = log 6. We conclude that C = log 6 log = log 3. 49. The differential entropi for a uniformly distributed variable between a and b is H(X) = log (b a). 5
(a) H(X) = log ( ) = log = H(X) = log ( ) = log 6, 644 5. The capacity of this additive white Gaussian noise channel with the output power constraint E[Y ] P is C = max I(X; Y ) = max (H(Y ) H(Y X)) f(x):e[y ] P f(x):e[y ] P = max (H(Y ) H(Z)) f(x):e[y ] P Here the maximum differential entropi is achieved by a normal distribution and the power constraint on Y is satisfied if we choose the distribution of X as N(, P σ). The capacity is C = log (πe(p σ + σ)) log (πe(σ)) = log (πep ) log (πeσ) = log (P σ ) 5. From the problem we have that P (X = ) = p, P (X = ) = p and that f(z) = a, z a, where a >. This gives that the conditional density for y becomes f(y X = ) = a, y a f(y X = ) = a, y a + which gives the desity for y as f(y) = p a, y f(y X = x)p (X = x) = a x, y a ( p) a, a y a + (a) H(X) = h(p) H(X Y ) = H(X y ) P ( y ) }{{} = + H(X y a) P ( y a) }{{} =h(p) + H(X a y a + ) P (a y a + ) }{{} = a = h(p) a log dy = h(p)a a a I(X; Y ) = H(X) H(X Y ) = a h(p) 6
p H(Y ) = a log p a a dy a log a+ a dy p log p a a a dy = p a log p a (a ) a log a p log p a a = h(p) + log a a H(X Y ) = x H(Y X = x) P (X = x) = log a }{{} =log a I(X; Y ) = H(Y ) H(Y X) = a h(p) (c) C = max I(X; Y ) = p a for p =. 5. We can use the total power P + P + P 3 + P 4 = 7 and for the four channels the noise power is N =, N = 4, N 3 = 9, N 4 = 6. Let B = P i + N i for the used channels. Since (6 ) + (6 4) + (6 9) > 7 we should not use channel four when reaching capacity. Similarly, since (9 ) + (9 4) < 7 we should use the rest of the three channels. These tests are marked as dashed lines in the figure below. Hence, B = P + = P +4 = P 3 +9, which leads to B = 3 (P + P + P 3 + 4) = 3 (7 + 4) = 3 3. The capacity becomes C = 3 i= ( log + P ) i = N i 3 log B = 3 N i log 3 + 3 log 3 4 + 3 log 3 9 i= = 3 log 3 5 log 3.4689 6 B P 3 9 P P N 4 4 N 3 N N 7