Entropy and Ergodic Theory Lecture 7: Rate-distortion theory

Size: px

Start display at page:

Download "Entropy and Ergodic Theory Lecture 7: Rate-distortion theory"

Baldric Barnett
5 years ago
Views:

1 Entropy and Ergodic Theory Lecture 7: Rate-distortion theory 1 Coupling source coding to channel coding Let rσ, ps be a memoryless source and let ra, θ, Bs be a DMC. Here are the two coding problems we have discussed so far: A length-n block code for rσ, ps using the alphabet A may be represented by the diagram Σ n f ÝÑ C ÝÑ g Σ n, (1) where C Ď A n is the set of codewords. The objective is to make pˆn tgpfpσqq σu ă ε for some small ε. Roughly put, this is possible if ratepc q : 1 log C ą Hppq n and n is sufficiently large. (To be more precise, the gap in this inequality must be large enough, but by an amount which is o ε p1q as n ÝÑ 8.) See Lecture 3. A length-n channel code consists of a set of codewords C Ď A n and a decoder ϕ : B n ÝÑ A n. The objective is to make θ n ptϕ xu xq ă ε for some small ε, uniformly over all x P C. Roughly put, this is possible if See Lecture 6. ratepc q ă cappθq : sup Ipθ ; pq. ppprobpaq 1

2 By combining the coding theorems from Lectures 3 and 6, it follows that if Hppq ă cappθq, (2) then we may construct source and channel codes as above with the same set C for both. Then we can use the channel ra, θ, Bs to transmit random sourcewords from rσ, ps with small probability of error: given σ P Σ n, form fpσq P C, transmit it through the channel to produce a noisy output Y θ n p fpσqq, apply ϕ to get ϕpyq fpσq with probability at least 1 ε, and now apply g to recover the original sourceword, with probability at least 1 2ε. gpϕpyqq σ, 2 Rate-distortion theory What if the reverse inequality holds in (2)? Then it turns out that reliable communication, in the sense described above, is not possible, not matter how large an n we choose. Put another way: there is no way to transmit information from rσ, ps through ra, θ, Bs reliably which gives a better rate than by applying the source and channel coding theorems separately and then combining them, as described above. This is Shannon s source-channel separation theorem. (The case of equality in (2) is more complicated, and I do not know of a simple analysis.) We do not prove the separation theorem here. Instead, we focus on a new problem: what to do if (2) is reversed, but we are willing to accept some fraction of errors in our recovery of the original sourceword? To answer this, we again separate the problem into a source-coding part and a channel-coding part. We still ask for reliable transmission of codewords through the channel, as above, but now we are willing to use a source code that enables recovery of the original sourceword only with a certain fraction of errors. A more general version of the source-channel separation theorem shows that this separation into a noisy source code and a good channel code is again optimal. 2

3 Thus, set R : cappθq, and assume that Hppq ą R. Our new task is to design a source code as in (1) such that the expected distortion pˆn pσq ti 1, 2,..., n : gpfpσqq i σ i u n σpσ n is as small as possible, given that we are restricted to codes in which ratepcq ď R. This is the basic rate-distortion problem. Some more notation makes for a better understanding of (3). First, by replacing C with its image gpc q and f with g f, we reduce the rate distortion problem to the case Σ A and g id C. Next, let us write d n pa, a 1 q : ti : a i a 1 iu n for a pa 1,..., a n q, a 1 pa 1 1,..., a 1 nq P A n. This defines a metric on A n (exercise!), although we do not use this fact in the present lecture. t is called the (normalized) Hamming metric. With these simplifications, our task it to find an encoder f : A n ÝÑ A n for which fpa n q ď 2 Rn and for which the expected distortion is as small as possible when X pˆn. Erd n px, fpxqqs (3) 3 Shannon s rate-distortion theorem It turns out to be slightly easier to fix an upper bound on the expected distortion that we are willing to accept, and find the least rate bound R for which that expected distortion bound can be satisfied. This is done by the following theorem. Theorem 1 (Shannon s rate-distortion theorem). For δ ą 0 let Rpδq : inf IpX ; Y q : X, Y are A- and B-valued RVs with p X p and PpX Y q ď δ (! ) inf Ipθ ; pq : channels ra, θ, As satisfying ppaqθpa 1 aq ď δ. a a 1 in A (4) 3

4 1. (Existence.) If R 1 ą Rpδq and δ 1 ą δ, then there is a sequence of encoders f n : A n ÝÑ A n such that f n pa n q ă 2 R1n and with expected distortions less than δ 1 for all sufficiently large n. 2. (Converse.) If f : A n ÝÑ A n is any encoder with expected distortion at most δ, then f n pa n q ě 2 Rpδqn. The function δ ÞÑ Rpδq defined above is called the rate-distortion function of the source ra, ps. Thus, as with channel coding, the answer is obtained by solving a certain optimization problem for mutual information. This time, the source is fixed, and one obtains its rate-distortion function by optimizing over kernels: this is the opposite of the channel coding theorem. However, as in channel coding, the basis for Theorem 1 is the asymptotic picture of qˆn for q P ProbpA ˆ Aq that we obtained from our study of conditional entropy. Our first step is the following lemma, an obvious cousin of Feinstein s lemma. Feinstein s lemma was about packing disjoint sets V i into B n so as to mostly cover the distributions θ n p x i q. This time we need to cover A n with sets U i, not necesssarily disjoint, which all have small diameter according to d n. It suffices to do this using balls for the Hamming metric d n. Given x P A n and r ą 0, let B r pxq : ty P A n : d n px, yq ă ru. (This notation leaves the alphabet A and dimension n to the reader s understanding.) Lemma 2. Let R 1 ą Rpδq, δ 1 ą δ and ε ą 0. Then for every sufficiently large n (depending on R 1 and δ 1 ) there exist a positive integer N ă 2 R1n and elements v 1,..., v N P A n such that pˆn`b δ 1pv 1 q Y Y B δ 1pv N q ą 1 ε. (5) Proof. Choose a kernel θ from A to A so that if px, Y q p θ then PpX Y q ă δ 1 (6) and IpX ; Y q ă R 1. Then p X p, and let us write q : p θ. This joint distribution may also be written in terms of its second marginal p Y and a different kernel, say λ: we denote this representation by λ p Y. Let κ ą 0; we will specify it later. 4

5 Step 1: Setting up. For any y P B n, let and now H y : B δ 1pyq X T n,κ pλ, yq, H : ď ypt n,κppq H y ˆ tyu. Using the inclusions from the end of Lecture 5, we have H Ě T n,κ{2 pqq X tpx, yq : d n px, yq ă δ 1 u. The LLN for types gives qˆn rt n,κ{2 pqqs ÝÑ 1 as n ÝÑ 8. On the other hand, if px, Yq qˆn, then the quantity d n px, yq 1 n 1 txi Y i u is also a sum of bounded i.i.d. RVs, so the upper bound (6) and another appeal to the LLN give that qˆn phq ÝÑ 1 as n ÝÑ 8. Henceforth assume that n is large enough to satisfy qˆn phq ą 1 ε{2. (7) Step 2: Completion. Our proof actually gives pˆn ph v1 Y Y H vn q ą 1 ε (8) instead of (5). Since H y Ď B δ 1pyq, this implies (5). We now use a recursion to find a finite list of elements v i P T n,κ pp Y q such that for each i we have λ n`h vi zph v1 Y Y H vi 1 q ˇˇ vi ą ε{2. (9) We do this so that: (i) at each step, either (5) is also satisfied, or the recursion can be continued for another step; and (ii) the recursion must stop before 2 R1n steps. Thus, suppose we have already found v 1,..., v m which satisfy (9). If (8) is also satisfied with N m, then we stop the recursion. Otherwise, let U : H v1 Y Y H vm, and observe from (7) and the negation of (8) that pˆn pyqλ n`h y zu ˇˇ y qˆn phzpu ˆ B n qq ě ε{2. ypt n,κppq 5

6 So there must be some y P T n,κ pp Y q such that λ n ph y zu yq ą ε{2. Letting v m`1 : y, it follows that (9) also holds with i m ` 1. This continue the recursion. This recursion terminates only once (8) is satisfied, so it remains to show that this happens after fewer than 2 R1n steps. To see this, suppose that at step m the recursion has not yet terminated. Then on the one hand, since v i P T n,δ pp Y q, we have T n,κ pλ, v i q Ď T n,2κ pp X q, and so H v1 Y Y H vm Ď T n,2κ pp X q ùñ H v1 Y Y H vm ď T n,2κ pp X q ď 2 HpXqn` pκqn`opnq. On the other hand, from Lecture 5 we have cov 1 ε{2 pλ n p v i qq ě 2 Hpλ pv i qn opnq ě 2 HpX Y qn pκqn opnq, where the error estimates do not depend on v i provided that v i P T n,κ pp Y q. In view of (9), this implies that H vi zph v1 Y Y H vi 1 q ě 2 HpX Y qn pκqn opnq. Comparing the above inequalities gives m 2 HpX Y qn pκqn opnq ď m H vi zph v1 Y Y H vi 1 q H v1 Y Y H vm ď 2 HpXqn` pκqn`opnq and hence m ď 2 IpX ; Y qn` pκqn`opnq, where the estimates in the two error terms depend on A and λ but nothing else. Since IpX ; Y q ă R 1, we may choose κ small enough to deduce from this that N ă 2 R1n for all sufficiently large n. Proof of existence in Theorem 1. For any sufficiently large n, let v i, i 1, 2,..., N n, be given by the previous lemma. By forming set differences, we may now find subsets U i Ď B δ 1pv i q which are disjoint and still cover more than 1 ε of pˆn. Define fpxq v P U i, i 1, 2,..., N n, and extend this arbitrarily to a function f : A n ÝÑ tv 1,..., v Nn u. The lemma tells us that N n ă 2 R1n, and if X pˆn then E d n px, fpxqq ď E d n px, fpxqq1 txp Ťi U iu ` pˆn A nh`u 1 Y Y U Nn ă δ 1 ` ε. 6

7 Since δ 1 may be chosen arbitrarily close to δ and ε may be chosen arbitrarily small, this completes the proof. 4 The converse to the rate-distortion theorem One can give a counting proof of the converse to the rate-distortion theorem, similarly to the channel-coding theorem. But this time let us instead exhibit a more formula-based proof. It begins with the following lemma. Lemma 3. Given ra, ps, the rate-distortion function Rpδq is non-increasing and convex. Proof. The property of being non-increasing is obvious: if δ 1 ă δ 2, then Rpδ 2 q is defined by an infimum over a larger set of possible values than Rpδ 1 q. So now let δ i ą 0 for i 1, 2 and let 0 ă t ă 1. We need to show that Rptδ 1 ` p1 tqδ 2 q ď trpδ 1 q ` p1 tqrpδ 1 q. If either of the values Rpδ i q is infinite, then this is trivial, so suppose they are both finite. Let θ i for i 1, 2 be kernels from A to A such that Ipθ i ; pq Rpδ i q and ppaqθ i pa 1 aq ď δ i for i 1, 2. a a 1 These exist because Ipθ ; pq is continuous as a function of θ and the space of kernels is compact, so the infimum which defines Rpδq is always achieved. Define a new kernel by θpa 1 aq : tθ 1 pa 1 aq ` p1 tqθ 2 pa 1 aq. It satisfies ppaqθpa 1 aq t ppaqθ 1 pa 1 aq ` p1 tq ppaqθ 2 pa 1 aq ď tδ 1 ` p1 tqδ 2 a a 1 a a 1 a a 1 and Ipθ ; pq a ppaqh`θp aq ď a ppaq th`θ 1 p aq ` p1 tqh`θ 2 p aq trpδ 1 q ` p1 tqrpδ 2 q, by the concavity of H. So this θ witnesses that R is also concave. 7

8 Lemma 4. Let px, Yq px 1,..., X n, Y 1,..., Y n q be a RV taking values in A n ˆ B n, and assume that X 1,..., X n are independent. Then IpX ; Yq ě IpX i, Y i q. Proof. First, the definition of mutual information and the independence of X 1,..., X n give On the other hand, IpX ; Yq HpXq HpX Yq HpX Yq HpX 1,..., X n Yq ď HpX i Yq (subadditivity) Therefore IpX ; Yq ď HpX i Y i q HpX i q HpX Yq ě HpX i q HpX Yq. (data processing: Y determines Y i for each i). HpXi q HpX i Y i q IpX i ; Y i q. Proof of converse in Theorem 1. Assume that for some n we have an encoder f : A n ÝÑ A n such that if X pˆn then E d n px, fpxqq 1 n PpfpXq i X i q ď δ. (10) Let f n pxq Y py 1,..., Y n q, and let δ i : PpX i Y i q for each i. Then the above inequality may be re-written as 1 n pδ 1 ` ` δ n q ď δ. 8

9 On the other hand, since the coordinates of X are independent and X determines Y, Lemma 4 gives log 2 fpa n q ě HpYq HpYq HpY Xq IpY ; Xq ě IpY i ; X i q. By the definition of the rate-distortion function, this satisfies IpY i ; X i q ě Rpδ i q, and by Lemma 3 this right-hand side is at least 1 nr n pδ 1 ` ` δ n q ě Rpδq. Putting the above inequalities together gives fpa n q ě 2 Rpδqn. 5 Notes and remarks Sources for this lecture: [CT06, Chapter 10]. Further reading: See [CT06, Sections 7.13 and 10.5] for the source-channel separation theorem. Rate-distortion theory is easily generalized to handle different notions of similarity between a sourceword and its attempted recovery. To do this, one now considers two possibly different alphabets A and B and a general distortion function d : A ˆ B ÝÑ r0, 8q. Then the normalized Hamming metric is replaced by the function d n px, yq : 1 dpx i, y i q of x P A n, y P B n. n This is the theory covered in [CT06, Chapter 10]. See also [McE02, Chapter 3], and especially [McE02, Section 6.3] for a discussion of several further directions of research extending these results, such as the problem of explicitly constructing low-distortion codes. 9

10 Information theory analyses many other models than just finite strings over finite alphabets. For instance, a simple extension of the basic model is to allow larger collections of sources and destinations, connected by several encoders and decoders arranged in some directed network. Such a model is often called a multi-terminal communication system. The analysis of such a network (principally, finding its possible rates of data transmission subject to various contraints) is a rich and difficult subject. Some kinds of network have known complete solutions, often involving new and surprising ideas, such as fork networks which are treated by the Slepian Wolf theorem. Others do not seem to have tractable complete solutions. See [CT06, Chapter 15] for an introduction to this large area. More recent developments in information theory have increasingly been interwoven into other branches of applied mathematics and especially theoretical computer science. MacKay s book [Mac03] can serve as a good introduction to these connections. References [CT06] Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, second edition, [Mac03] David J. C. MacKay. Information theory, inference and learning algorithms. Cambridge University Press, New York, [McE02] R. J. McEliece. The theory of information and coding, volume 86 of Encyclopedia of Mathematics and its Applications. Cambridge University Press, Cambridge, second edition, TIM AUSTIN COURANT INSTITUTE OF MATHEMATICAL SCIENCES, NEW YORK UNIVERSITY, 251 MERCER ST, NEW YORK NY 10012, USA tim@cims.nyu.edu URL: cims.nyu.edu/ tim 10

Entropy and Ergodic Theory Lecture 6: Channel coding

Entropy and Ergodic Theory Lecture 6: Channel coding 1 Discrete memoryless channels In information theory, a device which transmits information is called a channel. A crucial feature of real-world channels