Solutions to Homework Set #1 Sanov s Theorem, Rate distortion

Similar documents
Lecture 22: Final Review

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Lecture 3: Channel Capacity

National University of Singapore Department of Electrical & Computer Engineering. Examination for

EE5319R: Problem Set 3 Assigned: 24/08/16, Due: 31/08/16

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Lecture 6 I. CHANNEL CODING. X n (m) P Y X

EE/Stat 376B Handout #5 Network Information Theory October, 14, Homework Set #2 Solutions

ECE Information theory Final (Fall 2008)

UCSD ECE 255C Handout #14 Prof. Young-Han Kim Thursday, March 9, Solutions to Homework Set #5

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

Lecture 5: Channel Capacity. Copyright G. Caire (Sample Lectures) 122

Electrical and Information Technology. Information Theory. Problems and Solutions. Contents. Problems... 1 Solutions...7

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

Solutions to Set #2 Data Compression, Huffman code and AEP

(Classical) Information Theory III: Noisy channel coding

ECE Information theory Final

SOURCE CODING WITH SIDE INFORMATION AT THE DECODER (WYNER-ZIV CODING) FEB 13, 2003

ECE 4400:693 - Information Theory

Solutions to Homework Set #3 Channel and Source coding

Network coding for multicast relation to compression and generalization of Slepian-Wolf

INFORMATION THEORY AND STATISTICS

Capacity of a channel Shannon s second theorem. Information Theory 1/33

18.2 Continuous Alphabet (discrete-time, memoryless) Channel

Solutions to Homework Set #2 Broadcast channel, degraded message set, Csiszar Sum Equality

Lecture 4 Noisy Channel Coding

Quiz 2 Date: Monday, November 21, 2016

EE5585 Data Compression May 2, Lecture 27

EE376A: Homeworks #4 Solutions Due on Thursday, February 22, 2018 Please submit on Gradescope. Start every question on a new page.

EE376A - Information Theory Final, Monday March 14th 2016 Solutions. Please start answering each question on a new page of the answer booklet.

Exercise 1. = P(y a 1)P(a 1 )

X 1 : X Table 1: Y = X X 2

EE5139R: Problem Set 7 Assigned: 30/09/15, Due: 07/10/15

CHAPTER 3. P (B j A i ) P (B j ) =log 2. j=1

ELEC546 Review of Information Theory

Principles of Communications

Homework Set #2 Data Compression, Huffman code and AEP

UCSD ECE 255C Handout #12 Prof. Young-Han Kim Tuesday, February 28, Solutions to Take-Home Midterm (Prepared by Pinar Sen)

Noisy-Channel Coding

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science Transmission of Information Spring 2006

Lecture 8: Channel Capacity, Continuous Random Variables

Ch. 8 Math Preliminaries for Lossy Coding. 8.4 Info Theory Revisited

Lecture 15: Conditional and Joint Typicaility

Lecture 2. Capacity of the Gaussian channel

LECTURE 3. Last time:

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Universal Incremental Slepian-Wolf Coding

An Extended Fano s Inequality for the Finite Blocklength Coding

Midterm Exam Information Theory Fall Midterm Exam. Time: 09:10 12:10 11/23, 2016

Exercises with solutions (Set B)

(each row defines a probability distribution). Given n-strings x X n, y Y n we can use the absence of memory in the channel to compute

Lecture 11: Continuous-valued signals and differential entropy

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

Note that the new channel is noisier than the original two : and H(A I +A2-2A1A2) > H(A2) (why?). min(c,, C2 ) = min(1 - H(a t ), 1 - H(A 2 )).

Lecture 2: August 31

Shannon s Noisy-Channel Coding Theorem

Lecture 10: Broadcast Channel and Superposition Coding

Entropies & Information Theory

Lecture 8: Channel and source-channel coding theorems; BEC & linear codes. 1 Intuitive justification for upper bound on channel capacity

Remote Source Coding with Two-Sided Information

Noisy channel communication

LECTURE 13. Last time: Lecture outline

The Gallager Converse

Lecture 14 February 28

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

EE 4TM4: Digital Communications II. Channel Capacity

LECTURE 15. Last time: Feedback channel: setting up the problem. Lecture outline. Joint source and channel coding theorem

Chapter 9 Fundamental Limits in Information Theory

MAHALAKSHMI ENGINEERING COLLEGE QUESTION BANK. SUBJECT CODE / Name: EC2252 COMMUNICATION THEORY UNIT-V INFORMATION THEORY PART-A

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

On Optimum Conventional Quantization for Source Coding with Side Information at the Decoder

Lecture 17: Differential Entropy

3F1: Signals and Systems INFORMATION THEORY Examples Paper Solutions

The Communication Complexity of Correlation. Prahladh Harsha Rahul Jain David McAllester Jaikumar Radhakrishnan

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

MAHALAKSHMI ENGINEERING COLLEGE-TRICHY QUESTION BANK UNIT V PART-A. 1. What is binary symmetric channel (AUC DEC 2006)

The Method of Types and Its Application to Information Hiding

Coding for Discrete Source


Notes 3: Stochastic channels and noisy coding theorem bound. 1 Model of information communication and noisy channel

Superposition Encoding and Partial Decoding Is Optimal for a Class of Z-interference Channels

Common Information. Abbas El Gamal. Stanford University. Viterbi Lecture, USC, April 2014

Lecture 5 Channel Coding over Continuous Channels

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Example: Letter Frequencies

Introduction to Information Theory. B. Škorić, Physical Aspects of Digital Security, Chapter 2

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Capacity of the Discrete Memoryless Energy Harvesting Channel with Side Information

Example: Letter Frequencies

Example: Letter Frequencies

EC2252 COMMUNICATION THEORY UNIT 5 INFORMATION THEORY

A One-to-One Code and Its Anti-Redundancy

(Classical) Information Theory II: Source coding

Shannon s A Mathematical Theory of Communication

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

Distributed Lossless Compression. Distributed lossless compression system

Transcription:

st Semester 00/ Solutions to Homework Set # Sanov s Theorem, Rate distortion. Sanov s theorem: Prove the simple version of Sanov s theorem for the binary random variables, i.e., let X,X,...,X n be a sequence of binary random variables, drawn i.i.d. according to the distribution: Pr(X = ) = q, Pr(X = 0) = q. () Let the proportion of s in the sequence X,X,...,X n be p X, i.e., p X n = n X i. () i= By the law of large numbers, we would expect p X to be close to q for large n. Sanov s theorem deals with the probability that p X n is far away from q. In particular, for concreteness, if we take p > q >, Sanov s theorem states that n logpr{(x,x,...,x n ) : p X n p} plog p p +( p)log q q = D((p, p) (q, q) (3) Justify the following steps: Pr{(X,X,...,X n ) : p X p} i= np q i ( q) n i (4) i Argue that the term corresponding to i = np is the largest term in the sum on the right hand side of the last equation. Show that this term is approximately nd. Prove an upper bound on the probability in Sanov s theorem using the above steps. Use similar arguments to prove a lower bound and complete the proof of Sanov s theorem.

Solution Sanov s theorem Since nx n has a binomial distribution, we have Pr(nX n = i) = q i ( q) n i (5) i and therefore Pr{(X,X,...,X n ) : p X p} i= np q i ( q) n i (6) i Pr(nX n = i+) Pr(nX n = i) = ( n i+( n i ) q i+ ( q) n i ) = n i qi ( q) n i i+ q q (7) This ratio is less than if n i,i.e., if i > nq ( q). Thus q the maximum of the terms occurs when i = np. From Example..3, i+ < q.= nh(p) np and hence the largest term in the sum is q np ( q) n np = n( plogp ( p)log( p))+nplogq+n( p)log( q) = nd(p q) np (9) From the above results, it follows that Pr{(X,X,...,X n ) : p X p} q i ( q) n i (0) i i= np (n np ) q i ( q) () n i np (8) (n( p)+) nd(p q) () where the second inequality follows from the fact that the sum is less than the largest term times the number of terms. Taking the

logarithm and dividing by n and taking the limit as n, we obtain lim n n logpr{(x,x,...,x n ) : p X p} D(p q) (3) Similarly, using the fact the sum of the terms is larger than the largest term, we obtain Pr{(X,X,...,X n ) : p X p} q i ( q) n i (4) i i= np q i ( q) n i (5) np and nd(p q) (6) lim n n logpr{(x,x,...,x n ) : p X p} D(p q) (7) Combining these two results, we obtain the special case of Sanov s theorem lim n n logpr{(x,x,...,x n ) : p X p} = D(p q) (8). Strong Typicality Let X n be drawn i.i.d. P(x). Prove that for each x n T δ (X), n(h(x)+δ ) P n (x n ) n(h(x) δ ) for some δ = δ (δ) that vanishes as δ 0. Solution: Strong Typicality From our familiar trick, we have P n (x n ) = n( a X P x n(a)log P(a)) But, since x n T δ (X), we have P x n(a) P(a) < δ for all a. Therefore, H(X) δ log P(a) < P x n(a)log a a X 3 P(a) < H(X)+ δ log a P(a)

Hence, where δ = δ a log P(a) n(h(x)+δ ) P n (x n ) n(h(x) δ ) 3. Weak Typicality vs. Strong Typicality In this problem, we compare the weakly typical set A ǫ (P) and the strongly typical set T δ (P). To recall, the definition of two sets are following. { } A ǫ (P) = x n X n : n logpn (x n ) H(P) ǫ { T δ (P) = x n X n : P x n P δ } (a) Suppose P is such that P(a) > 0 for all a X. Then, there is an inclusion relationship between the weakly typical set A ǫ (P) and the strongly typical set T δ (P) for an appropriate choice of ǫ. Which of the statement is true: A ǫ (P) T δ (P) or A ǫ (P) T δ (P)? What is the appropriate relation between δ and ǫ? (b) Give a description of the sequences that belongs to A ǫ (P), vs. the sequences that belongs to T δ (P), when the source is uniformly distributed, i.e. P(a) =, a X. (Assume <.) (c) Can you explain why T δ (P) is called strongly typical set and A ǫ (P) is called weakly typical set? Solution: Weak Typicality vs. Strong Typicality (a) From Problem, we can see that if x n T δ (P), then n logpn (x n ) H(P) = δ log δ P(a) Therefore, we can see that if ǫ > δ a log, then A P(a) ǫ(p) T δ (P). To show that the other way does not hold, see the next part. a 4

(b) When P is a uniform distribution, we can see that A ǫ (P) includes every possible sequences x n. To see this, we can easily see that P n (x n ) = ( )n, and thus, /nlogp n (x n ) = log = H(P). Thus, x n A ǫ (P) for all x n, for all ǫ > 0. However, obviously, among those sequences in A ǫ (P), only those who have the type δ P x n(a) + δ for each letter a X, are in T δ (P). Therefore, for sufficiently small value of δ, there always exist some sequences that are in A ǫ (P), but not in T δ (P). (c) From above questions, we can see that the strongly typical set is contained in the weakly typical set for appropriate choice of δ and ǫ. Thus, we can see that the definition of strong typical set is stronger than that of the weakly typical set. 4. Rate distortion for uniform source with Hamming distortion. Consider a source X uniformly distributed on the set {,,...,m}. Find the rate distortion function for this source with Hamming distortion, i.e., { 0 if x = ˆx, d(x,ˆx) = if x ˆx. Solution: Rate distortion for uniform source with Hamming distortion. X is uniformly distributed on the set {,,...,m}. The distortion measure is { 0 if x = ˆx d(x,ˆx) = if x ˆx Consider any joint distribution that satisfies the distortion constraint D. Since D = Pr(X ˆX), we have by Fano s inequality and hence H(X ˆX) H(D)+Dlog(m ), (9) I(X; ˆX) = H(X) H(X ˆX) (0) logm H(D) Dlog(m ). () 5

We can achieve this lower bound by choosing p(ˆx) to be the uniform distribution, and the conditional distribution of p(x ˆx) to be p(ˆx x) { = D if ˆx = x = D/(m ) if ˆx x. () It is easy to verify that this gives the right distribution on X and satisfies the bound with equality for D < m. Hence { = logm H(D) Dlog(m ) if 0 D R(D) m 0 if D >. (3) m 5. Erasure distortion Consider X Bernoulli( ), and let the distortion measure be given by the matrix [ ] 0 d(x,ˆx) =. (4) 0 Calculate the rate distortion function for this source. Can you suggest a simple scheme to achieve any value of the rate distortion function for this source? Solution: Erasure distortion The infinite distortion constrains p(0,) = p(,0) = 0. Hence by symmetry the joint distribution of (X, ˆX) is of the form shown in Figure. 0 a 0 a e a a Figure : Joint distribution for erasure rate distortion of a binary source. 6

For this joint distribution, it is easy to calculate the distortion D = a and that I(X; ˆX) = H(X) H(X ˆX) = a. Hence we have R(D) = D for 0 D. For D >, R(D) = 0. It is very see how we could achieve this rate distortion function. If D is rational, say k/n, then we send only the first n k of any block of n bits. We reproduce these bits exactly and reproduce the remaining bits as erasures. Hence we can send information at rate D and achieve a distortion D. If D is irrational, we can get arbitrarily close to D by using longer and longer block lengths. 6. Rate distortion. A memoryless source U is uniformly distributed on {0,...,r }. The following distortion function is given by d(u,v) = 0, u = v,, u = v ± mod r,, otherwise. Show that the rate distortion function is { logr D h (D), D 3 R(D) =, logr log3, D >. 3 Solution: Rate distortion From the symmetry of the problem, we can assume the conditional distribution of p(v u) as p u = v, p(v u) = p/ u = v ± mod r, 0 otherwise Then, E(d(U,V)) = p. Therefore, the rate distortion function is Now, we know that R(D) = min p D I(U;V). I(U;V) =H(V) H(V U) =logr H( p,p/,p/), 7

since U is uniform, and due to symmetry, V is also uniform. We know that H( p,p,p) log3, and this is achieved when p = /3. Therefore, R(D) = logr log3, if D > /3. Now, let s consider the case when D /3. Denote f(p) = H( p,p/,p/) = ( p)log( p) p/logp/ = ( p)log( p) plogp+p, We know that f(p) is a concave function. By differentiating with respect to p, df(p) dp = log( p)+ logp + = log p p + and setting f(p) = 0, f(p) becomes maximum when p = /3. Therefore, if D /3, f(p) is an increasing function of p. Thus,. R(D) = log3 D h (D), if D /3. 7. Adding a column to the distortion matrix. Let R(D) be the rate distortion function for an i.i.d. process with probability mass function p(x) and distortion function d(x,ˆx), x X, ˆx ˆX. Now suppose that we add a new reproduction symbol ˆx 0 to ˆX with associated distortion d(x,ˆx 0 ), x X. Can this increase R(D)? Explain. Solution: Adding a column Let the new rate distortion function be denoted as R(D), and note that we can still achieve R(D) by restricting the support of p(x,ˆx), i.e., by simply ignoring the new symbol. Thus, R(D) R(D). Finally note the duality to the problem in which we added a row to the channel transition matrix to have no smaller capacity (Problem 7.). 8. Simplification. Suppose X = {,,3,4}, ˆX = {,,3,4}, p(i) = 4, i =,,3,4, and X,X,... are i.i.d. p(x). The distortion matrix 8

d(x,ˆx) is given by 3 4 0 0 0 0 3 0 0 4 0 0 (a) Find R(0), the rate necessary to describe the process with zero distortion. (b) Find the rate distortion function R(D). (Hint: The distortion measure allows to simplify the problem into one you have already seen.) (c) Suppose we have a nonuniform distribution p(i) = p i, i =,,3,4. What is R(D)? Solution: Simplification (a) We can achieve 0 distortion if we output ˆX = if X = or, and ˆX = 3 if X = 3 or 4. Thus if we set Y = if X = or, and Y = if X = 3 or 4, we can recover Y exactly if the rate is greater than H(Y) = bit. It is also not hard to see that any 0 distortion code would be able to recover Y exactly, and thus R(0) =. (b) If we define Y as in the previous part, and Ŷ similarly from ˆX, we can see that the distortion between X and ˆX is equal to the Hamming distortion between Y and Ŷ. Therefore if the rate is greater than the Hamming rate distortion function R(D) for Y, we can recover X to distortion D. Thus R(D) = H(D). (c) If the distribution of X is not uniform, the same arguments hold and Y has a distribution (p +p,p 3 +p 4 ), and the rate distortion function is R(D) = H(p +p ) H(D), 9. Rate distortion for two independent sources. Can one simultaneously compress two independent sources better than compressing the sources individually? The following problem addresses this question. Let the pair {(X i,y i )} be iid p(x,y). The distortion measure for X is d(x,ˆx) and its rate distortion function is R X (D). Similarly, the 9

distortion measure for Y is d(y,ŷ) and its rate distortion function is R Y (D). Suppose we now wish to describe the process {(X i,y i )} subject to distortionconstraintslim n Ed(X n, ˆX n ) D andlim n Ed(Y n,ŷ n ) D. Our rate distortion theorem can be shown to naturally extend to this setting and imply that the minimum rate required to achieve these distortion is given by R X,Y (D,D ) = min p(ˆx,ŷ x,y):ed(x, ˆX) D,Ed(Y,Ŷ) D I(X,Y; ˆX,Ŷ) Now, suppose the {X i } process and the {Y i } process are independent of each other. (a) Show (b) Does equality hold? Now answer the question. R X,Y (D,D ) R X (D )+R Y (D ). Solution: Rate distortion for two independent sources. (a) Given that X and Y are independent, we have Then I(X,Y; p(x,y,ˆx,ŷ) = p(x)p(y)p(ˆx,ŷ x,y) (5) ˆX,Ŷ) = H(X,Y) H(X,Y ˆX,Ŷ) (6) = H(X)+H(Y) H(X ˆX,Ŷ) H(Y X, ˆX,Ŷ) H(X)+H(Y) H(X ˆX) H(Y Ŷ) (8) = I(X; ˆX)+I(Y;Ŷ) (9) where the inequality follows from the fact that conditioning reduces entropy. Therefore R X,Y (D,D ) = min I(X,Y; ˆX,Ŷ) (30) p(ˆx,ŷ x,y):ed(x, ˆX) D,Ed(Y,Ŷ) D min p(ˆx,ŷ x,y):ed(x, ˆX) D,Ed(Y,Ŷ) D (I(X; ˆX)+I(Y;Ŷ) ) (3) = min I(X; ˆX)+ min I(Y;Ŷ) (3) p(ˆx x):ed(x, ˆX) D p(ŷ y):ed(y,ŷ) D = R X (D )+R Y (D ) (33) 0

(b) If then I(X,Y; p(x, y, ˆx, ŷ) = p(x)p(y)p(ˆx x)p(ŷ y), (34) ˆX,Ŷ) = H(X,Y) H(X,Y ˆX,Ŷ) (35) = H(X)+H(Y) H(X ˆX,Ŷ) H(Y X, ˆX,Ŷ) = H(X)+H(Y) H(X ˆX) H(Y Ŷ) (37) = I(X; ˆX)+I(Y;Ŷ) (38) Let p(x, ˆx) be a distribution that achieves the rate distortion R X (D ) at distortion D and let p(y,ŷ) be a distribution that achieves the ratedistortionr Y (D ) atdistortion D. Then forthe product distribution p(x,y,ˆx,ŷ) = p(x,ˆx)p(y,ŷ), where the componentdistributionsachieverates(d,r X (D ))and(d,r X (D )), the mutual information corresponding to the product distribution is R X (D )+R Y (D ). Thus R X,Y (D,D ) = min I(X,Y; ˆX,Ŷ) = R X(D )+R Y (D ) p(ˆx,ŷ x,y):ed(x, ˆX) D,Ed(Y,Ŷ) D (39) Thus by using the product distribution, we can achieve the sum of the rates. Therefore the total rate at which we encode two independent sources together with distortions D and D is the same as if we encoded each of them separately. 0. One bit quantization of a single Gaussian random variable. Let X Norm(0,σ ) and let the distortion measure be squared error. Here we do not allow block descriptions. Show that the optimum reproduction points for bit quantization are ± π σ, and that the expected distortion for bit quantization is π π σ. Compare this with the distortion rate bound D = σ R for R =. Solution: One bit quantization of a single Gaussian random variable With one bit quantization, the obvious reconstruction regions are the positive and negative real axes. The reconstruction point is the centroid

of each region. For example, for the positive real line, the centroid a is given as a = x 0 πσ = σ 0 π e y dy = σ π, e x σ dx using the substitution y = x /σ. The expected distortion for one bit quantization is D = 0 ( x+σ π ( + x σ 0 π ( x +σ ) π πσ ( ) xσ π = 0 = σ + π σ 4 π σ π = σ π π.3634σ, ) x e σ dx πσ ) x e σ dx πσ e x σ dx x e σ dx πσ which is much larger than the distortion rate bound D() = σ /4.. Side information. Amemorylesssourcegeneratesi.i.d.pairsofrandomvariables(U i,s i ), i =,,... on finite alphabets, according to p(u n,s n ) = n p(u i,s i ). i=

We are interested in describing U, when S, called side information, is known both at the encoder and at the decoder. Prove that the rate distortion function is given by R U S (D) = min I(U;V S). p(v u,s):e[d(u,v)] D Compare R U S (D) with the ordinary rate distortion function R U (D) without any side information. What can you say about the influence of the correlation between U and S on R U S (D)? Solution: Side information Since both encoder and decoder have the common side information S, we can be helped from the correlation of our source U and the side information S. The idea for the achievability is, we use the different rate distortion code for the source symbols that have different corresponding side information. Therefore, for the symbols that has side information S = s will have the rate distortion code at rate R U S=s (D) = min I(U;V S = s) p(v u,s=s):e(d(u,v)) D and expected distortion D. Now, the total rate will be S R = P S n(s)r U S=s (D), s= where P S n(s) is the empirical distribution of S n. As n becomes large enough, we know that P S n(s) will be very close to the true distribution of S, p(s), or with high probability, S n will be in the strong typical set T P (δ), where { T P (δ) = s n : max P s n(s) p(s) < δ }. s S S Therefore, if n sufficiently large, and δ sufficiently small, we will achieve the rate R = p(s)i(u;v S = s) = I(U;V S) s S 3

It is clear that the overall rate distortion code has the expected distortion D. For the converse, we have nr H(V n S n ) =I(U n ;V n S n ) =H(U n S n ) H(U n V n,s n ) ( = H(Ui S i ) H(U i U i,v n,s n ) ) = i= (H(U i S i ) H(U i V i )) i= I(U i ;V i S i ) i= R U S (Ed(U i,v i )) i= R U S (Ed(U n,v n )) R U S (D) where each step has the identical reason as the original converse. Now, to compare R U S (D) and R U (D), consider following: R U S (D) = min p(v u,s):e(d(u,v)) D min p(v u):e(d(u,v)) D min p(v u):e(d(u,v)) D =R U (D). I(U;V S) I(U;V S) (40) I(U;V) (4) Here, the inequality in (40) is from the fact that the minimization is held on the smaller set. The inequality in (4) is from the fact that S U V forms a Markov chain with the joint distribution that we get from the minimization in (40). Also, we have, I(U,S;V) =I(U;V)+I(S;V U) =I(S;V)+I(U;V S). 4

Since I(S;V U) = 0 and I(S;V) 0, we have the inequality in (4). Therefore, we can only gain from the the side information. Intuitively, this also makes sense, because we can always just ignore the side information and get the original rate distortion code, but the rate may be worse. The two rates will be equal if and only if U and S are independent, which achieves the equality in both (40) and (4). 5