Subset Universal Lossy Compression

Similar documents
Lecture 4 Noisy Channel Coding

Superposition Encoding and Partial Decoding Is Optimal for a Class of Z-interference Channels

Multiaccess Channels with State Known to One Encoder: A Case of Degraded Message Sets

5218 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 12, DECEMBER 2006

(Classical) Information Theory III: Noisy channel coding

Lecture 5 Channel Coding over Continuous Channels

MARKOV CHAINS A finite state Markov chain is a sequence of discrete cv s from a finite alphabet where is a pmf on and for

The Method of Types and Its Application to Information Hiding

Shannon s noisy-channel theorem

Mismatched Multi-letter Successive Decoding for the Multiple-Access Channel

EE 4TM4: Digital Communications II. Channel Capacity

The Gallager Converse

Lecture 6 I. CHANNEL CODING. X n (m) P Y X

SOURCE CODING WITH SIDE INFORMATION AT THE DECODER (WYNER-ZIV CODING) FEB 13, 2003

On Scalable Coding in the Presence of Decoder Side Information

Chapter 9 Fundamental Limits in Information Theory

Reliable Computation over Multiple-Access Channels

Midterm Exam Information Theory Fall Midterm Exam. Time: 09:10 12:10 11/23, 2016

On Multiple User Channels with State Information at the Transmitters

The Poisson Channel with Side Information

SUCCESSIVE refinement of information, or scalable

ELEC546 Review of Information Theory

Lecture 4 Channel Coding

Lattices for Distributed Source Coding: Jointly Gaussian Sources and Reconstruction of a Linear Function

18.2 Continuous Alphabet (discrete-time, memoryless) Channel

COMMUNICATION over unknown channels is traditionally

UCSD ECE 255C Handout #12 Prof. Young-Han Kim Tuesday, February 28, Solutions to Take-Home Midterm (Prepared by Pinar Sen)

Joint Source-Channel Coding for the Multiple-Access Relay Channel

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

On Common Information and the Encoding of Sources that are Not Successively Refinable

A Comparison of Superposition Coding Schemes

Duality Between Channel Capacity and Rate Distortion With Two-Sided State Information

Source and Channel Coding for Correlated Sources Over Multiuser Channels

On Source-Channel Communication in Networks

Upper Bounds on the Capacity of Binary Intermittent Communication

National University of Singapore Department of Electrical & Computer Engineering. Examination for

Lecture 15: Conditional and Joint Typicaility

Homework Set #2 Data Compression, Huffman code and AEP

A Formula for the Capacity of the General Gel fand-pinsker Channel

Universal Incremental Slepian-Wolf Coding

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

1 Background on Information Theory

EE376A - Information Theory Final, Monday March 14th 2016 Solutions. Please start answering each question on a new page of the answer booklet.

Simultaneous Nonunique Decoding Is Rate-Optimal

On Scalable Source Coding for Multiple Decoders with Side Information

Network coding for multicast relation to compression and generalization of Slepian-Wolf

Capacity of a channel Shannon s second theorem. Information Theory 1/33

Representation of Correlated Sources into Graphs for Transmission over Broadcast Channels

Interactive Hypothesis Testing with Communication Constraints

Lecture 3: Channel Capacity

Exercise 1. = P(y a 1)P(a 1 )

On Competitive Prediction and Its Relation to Rate-Distortion Theory

Solutions to Set #2 Data Compression, Huffman code and AEP

Remote Source Coding with Two-Sided Information

Multicoding Schemes for Interference Channels

An Achievable Error Exponent for the Mismatched Multiple-Access Channel

Solutions to Homework Set #1 Sanov s Theorem, Rate distortion

New Results on the Equality of Exact and Wyner Common Information Rates

Variable Length Codes for Degraded Broadcast Channels

Exact Random Coding Error Exponents of Optimal Bin Index Decoding

Coding into a source: an inverse rate-distortion theorem

lossless, optimal compressor

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

ProblemsWeCanSolveWithaHelper

AN INTRODUCTION TO SECRECY CAPACITY. 1. Overview

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Lecture 11: Quantum Information III - Source Coding

Feedback Capacity of the Compound Channel

Equivalence for Networks with Adversarial State

A Novel Asynchronous Communication Paradigm: Detection, Isolation, and Coding

Entropies & Information Theory

Soft-Output Trellis Waveform Coding

Lecture 8: Channel and source-channel coding theorems; BEC & linear codes. 1 Intuitive justification for upper bound on channel capacity

Distributed Functional Compression through Graph Coloring

Capacity of the Discrete Memoryless Energy Harvesting Channel with Side Information

On the Secrecy Capacity of Fading Channels

Shannon s Noisy-Channel Coding Theorem

An Achievable Rate Region for the 3-User-Pair Deterministic Interference Channel

Capacity of AWGN channels

Lecture 5: Channel Capacity. Copyright G. Caire (Sample Lectures) 122

ECE Information theory Final (Fall 2008)

The Duality Between Information Embedding and Source Coding With Side Information and Some Applications

On Optimum Conventional Quantization for Source Coding with Side Information at the Decoder

Katalin Marton. Abbas El Gamal. Stanford University. Withits A. El Gamal (Stanford University) Katalin Marton Withits / 9

Frans M.J. Willems. Authentication Based on Secret-Key Generation. Frans M.J. Willems. (joint work w. Tanya Ignatenko)

Capacity Region of Reversely Degraded Gaussian MIMO Broadcast Channel

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Joint Universal Lossy Coding and Identification of Stationary Mixing Sources With General Alphabets Maxim Raginsky, Member, IEEE

Interactive Decoding of a Broadcast Message

Optimal Natural Encoding Scheme for Discrete Multiplicative Degraded Broadcast Channels

A New Achievable Region for Gaussian Multiple Descriptions Based on Subset Typicality

A new converse in rate-distortion theory

Information Theory. Lecture 10. Network Information Theory (CT15); a focus on channel capacity results

Approximately achieving the feedback interference channel capacity with point-to-point codes

Afundamental component in the design and analysis of

Capacity of Memoryless Channels and Block-Fading Channels With Designable Cardinality-Constrained Channel State Feedback

Two Applications of the Gaussian Poincaré Inequality in the Shannon Theory

Notes 3: Stochastic channels and noisy coding theorem bound. 1 Model of information communication and noisy channel

Performance-based Security for Encoding of Information Signals. FA ( ) Paul Cuff (Princeton University)

Digital Communications III (ECE 154C) Introduction to Coding and Information Theory

Transcription:

Subset Universal Lossy Compression Or Ordentlich Tel Aviv University ordent@eng.tau.ac.il Ofer Shayevitz Tel Aviv University ofersha@eng.tau.ac.il Abstract A lossy source code C with rate R for a discrete memoryless source S is called subset universal if for every 0 < R < R, almost every subset of 2 nr of its codewords achieves average distortion close to the source s distortion-rate function DR ). In this paper we prove the asymptotic existence of such codes. Moreover, we show the asymptotic existence of a code that is subset universal with respect to all sources with the same alphabet. I. INTRODUCTION To motivate the topic studied in this work, let us consider the following scenario. Let S be a discrete source with a probability mass function pmf) P S over an alphabet S. Assume one wants to convey n i.i.d. instances of S over a deterministic channel f : X n Y n, with the smallest possible distortion. If the channel law f ) is known at both ends, the channel corresponds to a bit pipe whose capacity is the normalized logarithm of the function s image size, i.e., C n log fxn ). In this case a separation approach, where the source is first compressed with rate C and then the compression index is transmitted over the channel, is optimal and achieves the best possible average distortion D PS C), where D PS ) is the distortion-rate function of the source P S. When only the encoder knows the function f ) but the decoder does not, it may be possible to learn the channel law if the class of possible functions is sufficiently small. For instance, if f ) tensorizes, i.e., if fx n ) hx ),...,hx n ), it is possible to learn f ) with negligible cost for n large enough. In this case, the separation based approach continues to be asymptotically optimal. Moreover, sometimes even partial knowledge of f ) at the decoder is sufficient for separation to be optimal. For example, if the channel is a binary memory device with pn cells stuck at either 0 or, and whose locations are known only to the encoder, a reliable communication rate approaching n log fxn ) can be achieved via Gelfand-Pinsker coding [], [2], and subsequently separation can asymptotically attain the optimal distortion. A separation approach, however, is completely useless when f ) can be any arbitrary function known only to the encoder. This follows from the fact that the capacity of the compound channelf that consists of all Y n X n possible functions from The work of O. Ordentlich was supported by the Adams Fellowship Program of the Israel Academy of Sciences and Humanities, a fellowship from The Yitzhak and Chaya Weinstein Research Institute for Signal Processing at Tel Aviv University and the Feder Family Award. The work of O. Shayevitz was supported by the Israel Science Foundation under grant agreement no. 367/4. X n to Y n is zero. It is therefore tempting to jump to the conclusion that the smallest average distortion that can be attained in this setting is D PS 0). But is this indeed the case? We will show that the answer is negative, and that in fact there exists a coding scheme that achieves average distortion D PS n log fxn ) ) for almost every deterministic channel. Note that the encoder, which knows the function f ), can deterministically impose any output from the setfx n ) Y n. However, due to its ignorance, the decoder does not know how to translate the possible outputs to a vector oflog fx n ) bits. Instead, the encoder and decoder can agree in advance on a mapping g : Y n Ŝn from each possible channel output to a source reconstruction sequence. The set gy n ) of all possible reconstructions can be thought of as a source code C of rate R log Y bits per source symbol. The channel s effect is in diluting C to the source code C gfx n )) C whose rate is R n log fxn ) bits per symbol. In other words, a deterministic channel f ) with capacity R chooses a subset of 2 nr codewords from the 2 nr codewords in C. This motivates the study of subset universal source codes. A subset universal source code of rate R has the property that for any 0 < R < R, almost every subset of 2 nr of its codewords is close to being optimal, in the sense that its induced average distortion is close to the distortion-rate functiond PS R ). Our main result is the asymptotic existence of such codes. In fact, we show the asymptotic existence of a code that is subset universal with respect to any i.i.d. source distribution over the alphabet S. Returning to our discussion on joint-source channel coding over a deterministic channel known only at the encoder, the existence of a subset-universal code implies that for almost every choice of f ), an average distortion of D PS n log fxn ) ) can be asymptotically achieved. II. MAIN RESULT The entropy of a random variable Y with probability mass function pmf) P Y over the alphabet Y, is defined as HY) y YP Y y)logp Y y). For a pair of random variables Y,Z) with a joint pmf P Y Z P Y P Z Y over the product alphabet Y Z, the conditional entropy is defined as HZ Y) P Y Z y,z)logp Z Y z y). y,z) Y Z)

and the mutual information is defined as IY;Z) HY) HY Z) HZ) HZ Y). For two distributions P and Q defined on the same alphabet Z, the KL-divergence is defined as DP Q) z Z Pz)log Pz) Qz). Let S be a discrete memoryless source DMS) over the alphabet S with pmf P S, Ŝ a reconstruction alphabet, and d : S Ŝ R+ a bounded single-letter distortion measure. A 2 nr,n) lossy source code consists of an encoder that assigns an index ms n ),2,...,2 nr to each n-dimensional vector s n S n, and a decoder that assigns an estimate ŝ n m) Ŝn to each index m,2,...,2 nr. The set C ŝ n ),...,ŝ n 2 nr ) constitutes the codebook. We will assume throughout that the encoder assigns to each source sequence the index m s n ) according to the optimal rule m s n ) argmin m,2,...,2 nr n n ds i,ŝ n m)) i ), such that the lossy source code is completely specified by the codebook C and the distortion measure d). The expected distortion associated with a lossy source code C is defined as E S n ds n,ŝ n n ) E S n ds i,ŝi). n A distortion-rate pair D, R) is said to be achievable if there exists a sequence ) of 2 nr,n) codes with limsup n E S n ds n,ŝ n ) D. The distortion-rate function D PS R) of the source P S is the infimum of distortions D such that D,R) is achievable, and is given by [3], [4] D PS R) min P S s)pŝ S ŝ s)ds,ŝ). PŜ S :IS;Ŝ) R s S,ŝ Ŝ For a sequence of2 nr,n) codes and0<r < R, we say that almost every subset with cardinality 2 nr satisfies a certain property if the fraction of subsets with cardinality 2 nr that do not satisfy the property vanishes with n. In [5], Ziv proved that there exists a codebook with rate R that asymptotically achieves the distortion-rate function DR), regardless of the underlying source distribution. His result holds for any stationary source and even for a certain class of nonstationary sources. Our main result, stated below, only deals with i.i.d. sources with unknown distribution, and is hence less general than [5] in terms of the assumptions made on the source s distribution. However, it extends [5] in the sense that the distortion attained by subsets, and not just the full codebook, is shown to be universally optimal. Theorem : Let S be a DMS over the alphabet S, Ŝ a reconstruction alphabet, and d : S Ŝ R+ a bounded ) distortion measure. For any R > 0, δ > 0, there exists a sequence of 2 nr,n) codebooks C Ŝn with the property that for any source pmf P S on S and any 0 < R < R, almost every subset of 2 nr codewords from C achieves an expected distortion no greater than D PS R δ). Remark : Ziv s result [5] is based on first splitting the source sequence into l blocks of length k l k). Then, the best 2 kr,k) lossy source code, with respect to the average distortion among the l blocks, is found, and conveyed to the receiver. Afterwards, each block is encoded using this source code and its index is sent to the receiver. If the number of blocks of lengthk is large enough, the overhead for transmitting the codebook becomes negligible, which allows the scheme to perform well for a large class of distributions. This construction, however, is not subset universal. To see this, note that the empirical distribution of most codewords in the obtained codebook is that of the largest type. Thus, a random subset of 2 nr codewords will contain a negligible fraction of codewords from other types, and in general cannot attain the distortion-rate function unless the optimal reconstruction distribution at rate R is equal to that of the largest type as well. The proof of Theorem relies on independently drawing the codewords of C from a mixture distribution, and will be given in Section IV. In the next section we lay the ground by proving certain properties of mixture distributions. III. PROPERTIES OF MIXTURE DISTRIBUTIONS We follow the notation of [6], and define the empirical pmf of an n-dimensional sequence y n with elements from Y as πy y n ) n i : y i y for y Y. Similarly, the empirical pmf of a pair of n-dimensional sequences y n,z n ) with elements from Y Z is defined as πy,z y n,z n ) n i : y i,z i ) y,z) for y,z) Y Z. Let P YZ P Y P Z Y be a joint pmf on Y Z. The set of ε-typical n-dimensional sequences w.r.t. P Y is defined as T n) ε P Y ) y n : πy y n ) P Y y) εp Y y) y Y, and the set of jointly ε-typical n-dimensional sequences w.r.t. P YZ is defined as T ε n) P Y Z ) y n,z n ) : πy,z y n,z n ) P YZ y,z) εp YZ y,z) y,z) Y Z. We also define the set of conditionallyε-typicaln-dimensional sequences w.r.t. P Y Z as T ε n) P Y Z y n ) z n : y n,z n ) T ε n) P Y Z ). The next statement follows from the definitions above [6].

Proposition : Lety n Y n. For everyz n T ε n) P YZ y n ) we have z n T ε n) P Z ). If in addition, y n T n) ε P Y), for some ε < ε, then for n large enough T n) ε P Y Z y n ) ε)2 n ε)hz Y ). Let P Z denote the simplex containing all probability mass functions on Z. For every θ P Z, let P θ z) be the corresponding pmf evaluated at z. Let wθ) be some probability density function on P Z. We may now define the mixture distribution Q as [7] n Qz n ) wθ) P θ z i )dθ. 2) The following propositions are proved in the appendix. Proposition 2: Let Z n be a randomn-dimensional sequence drawn according to Qz n ) defined in 2). Let P Y Z be some pmf on Y Z, and let y n T n) ε P Y), for some ε < ε. For n large enough, we have Pr Z n T ε n) P Y Z y n ) ε) wθ)2 n+ε)iy;z)+dpz P θ)+2εhz Y )) dθ. Proposition 3: Let P Z be some distribution in P Z. For any 0 < ξ < / Z 2, there exists a subset V P Z with Lebesgue measure ξ Z on P Z, such that for all P θ V DP Z P θ ) log ξ Z 2, Combining Proposition 2 and 3 yields the following lemma. Lemma : Let Qz n ) be as defined in 2), with wθ) taken as the uniform distribution on P Z, and let Z n be a random n-dimensional sequence drawn according to Qz n ). Let P Y Z be some pmf on Y Z, and let y n T n) ε P Y), for some ε < ε. For n large enough, we have Pr Z n T ε n) P Y Z y n ) 2 niy;z)+δε)), where δǫ) 0 for ǫ 0. Proof. Let c Z be the Lebesgue measure of the simplex P Z. Clearly, we have that wθ) c Z for θ P Z and wθ) 0 otherwise. Setting ξ 2 ε )/ Z 2 in Proposition 3 implies that there exists a set V P Z with volume 2 ε )/ Z 2 ) Z such that DP Z P θ ) < ε for any P θ V. Combining this with Proposition 2 gives Pr Z n T ε n) P Y Z y n ) ε) wθ)2 n+ε)iy;z)+ε+2εhz Y )) dθ > ε c Z θ V 2 ε Z 2 ) Z n+ε)iy ;Z)+ε+2εHZ Y )) 2 2 niy ;Z)+δǫ)), where δǫ) 0 for ǫ 0. IV. PROOF OF THEOREM Random codebook generation: Let Qŝ n ) θ P Ŝ wθ) n P θ ŝ i )dθ, where P Ŝ is the simplex containing all probability mass functions on Ŝ and wθ) is the uniform distribution on P Ŝ. Randomly and independently generate 2 nr sequences ŝ n m), m,2,...,2 nr, each according to Qŝ n ). These sequence constitutes the full codebook C. A subset of the codebook, indexed by I,R ), consists of an index set I,2,...,2 nr with cardinality I 2 nr, 0 < R < R, and the corresponding sequencesŝ n m), m I. An arbitrary I,R ) subset of C is revealed to the encoder and the decoder. Encoding: Given the source sequence s n and a subset I,R ) of C, the optimal encoder sends the index of the codeword that achieves the minimal distortion, i.e., it sends m n argmin ds i,ŝ n m)) i ). 3) m I n Note that the optimal encoder 3) is universal, i.e., it does not have to know the true underlying source pmf, and that such knowledge can in no way improve its performance. Decoding: Upon receiving the index m, the decoder simply sets the reconstruction sequence as ŝ n m). Analysis: Let us define a quantized grid of rates in [0,R) whose resolution is > 0 R R j : R j j, j,..., R/. We further define a set of quantized distributions as follows. For 0 < < and 0 < p min < define the set of points G,pmin, S p min, + ), + ) 2, S S..., + S ) log p min log+ S ) and consists of all vectors in the simplex P S with at least S components that belong to G,pmin, S. Clearly, the Our set of quantized distributions is denoted by P S cardinality P S of this set is a function of only S, p min and. Moreover, for any pmf P θ P S with the property that min s S P θ s) > / S )p min, there exists a pmf P θ P S such that P θ s) P θ s) P θ s) for all s S. To see this, construct P θ by quantizing all but the greatest entry of P θ to the nearest point in G,pmin, S and set the remaining entry such that s S P θ s)..

Let P p S min be the set of all probability mass functions on S whose minimal mass is smaller than p min. We can partition the remaining part of the simplex as P S \P S p min P S P S PP s ) where the sets PP s ) P S are disjoint and satisfy the property that for any P θ PP s ) we have P θ s) P S s) < P S s) for all s S. Let R R and P S P S. We will next show that almost every codebook in our ensemble satisfies the property that almost every subset of 2 nr of its codewords achieves average distortion no greater than D PS R δ) with respect to all sources whose pmf belongs topp S ). Since R P S does not increase with n, there must exist a codebook that simultaneously satisfies this property for all R R and P S P S. The theorem will then follow by taking 0, p min 0, and using the continuity of D PS R) w.r.t. R and P S for a bounded distortion measure d [8]. Let R R, P S P S and P S PP s). We upper bound the average distortion of a given subseti,r ) ofc over the ensemble of codebooks and a DMS S n with pdf P S. To that end, rather than analyzing the average distortion attained by the encoder 3), we analyze the average distortion attained by the following suboptimal encoder. The suboptimal encoder first solves the minimization problem P R Ŝ S argmin PŜ S :IS;Ŝ) R δ s S,ŝ Ŝ P S s)pŝ S ŝ s)ds,ŝ), and sets P R P SŜ SP R as the target joint pmf. Then, given a Ŝ S source sequences n, it looks for the smallest indexm I such that s n,ŝ n m)) T ε n) P SŜ) R and sends it to the decoder. If no such index is found, the smallest m I is sent to the decoder. Note that this suboptimal encoder is not universal, as it requires knowledge of P S. Nevertheless, its average distortion is by definition greater than that of the optimal encoder which is universal, and will therefore indeed serve as an upper bound on the distortion of the universal encoder. Let S n be a random source drawn i.i.d. according to P S and define Ŝ n as the reconstruction obtained by applying the suboptimal encoder and the decoder described above). Define the error event EI,R ) as EI,R ) S n,ŝ n m)) / T ε n) P SŜ) R for all m I. By the definition of P R and the definition of the typical set, SŜ we have that if EI,R ) did not occur 4) ds n,ŝ n ) +ε)d PS R δ), 5) where D PS ) is the distortion-rate function ) with respect to P S. Further, the error event satisfiesei,r ) E E 2 I,R ), where E S n / T n) ε P S) E 2 I,R ) S n T n) ε P S), ŝ n m) / T n) ε P R SŜ Sn ) for all m I, 6) and < ε < ε. Note that E is independent of I and that PrE ) 0 as n by the weak) law of large numbers and the fact that < ε. The second error event does depend on I. We will now show that its expected probability with respect to the ensemble of codebookse C PrE 2 I,R ))) vanishes for almost all I,2,...,2 nr of cardinality 2 nr. Let U be a random subset of indices drawn from the uniform distribution on all subsets of,...,2 nr with cardinality 2 nr. By the random symmetric generation process of the codewords in C, the value of E C U PrE 2 U,R )) U I) depends on the index set I only through its cardinality 2 nr. We therefore have E C,U PrE 2 U,R ))) E U E C U PrE 2 U,R ) U)) E U P S nsn ) s n T n) ε P S) Ŝn Pr m) / T ε n) P R ) for all m U U E U P S nsn ) Pr m U s n T n) ε P S) Ŝn m) / T n) ε P R ) U s n T n) ε P S) ) ) ) ) P S Ŝn )) 2 nr nsn ) Pr ) / T ε n) P R ) By Lemma we have Ŝn ) Pr ) T ε n) P R ) 2 nis;ŝr )+δǫ)), where ) is the mutual information between S and Ŝ IS;ŜR under the target joint pmf P SŜ. R This implies that ) 2 nr Pr Ŝ n ) / T ε n) P R ) 7) 2 ) 2 nr nis;ŝr )+δǫ)) exp 2 nr IS;ŜR ) δǫ)) 8) exp 2 nδ δǫ)), 9) where 8) follows from the inequality x) k e kx which holds for x [0,], and 9) follows from the definition of P R. Combining 7) and 9) we have Ŝ S E C,U PrE 2 U,R ))) exp 2 nδ δǫ)). 0)

Thus, for any δ > 0 we may take 0 < < ε < ε sufficiently small such that δε) < δ and 0) can be made arbitrary small when increasing n. The proof is concluded by applying the following arguments: By Markov s inequality, 0) implies that for every P S PP S ) and almost every subset I,2,...,2 nr of cardinality 2 nr the expectation E C PrE 2 I,R ))) vanishes for large n. Applying Markov s inequality again, this time with respect to C we see that almost all codebooks in the ensemble satisfy that for every P S PP S) and almost every subset I,2,...,2 nr of cardinality 2 nr the probability PrE 2 I,R )) vanishes for large n. As discussed above, PrE ) 0 as n for every P S PP S), regardless of I and C. By the union bound PrEI,R )) PrE ) + PrE 2 I,R )). We therefore have that almost all codebooks in the ensemble satisfy that for every P S PP S) and almost every subset I,2,...,2 nr of cardinality 2 nr the probability PrEI,R )) vanishes for large n. Since the distortion measure d is bounded, this together with 5) implies that almost all codebooks in the ensemble satisfy that for every P S PP S) and almost every subset I,2,...,2 nr of cardinality 2 nr the average distortion approaches +ε)d PS R δ) as n increases. Since R P S does not increase with n, there must exist a sequence of codebooks whose average distortion approaches +ε)d PS R δ) for almost every subset I,2,...,2 nr of cardinality 2 nr, simultaneously for all R R, P S P S and P S PP S). The theorem follows by taking ε 0, p min 0 and using the continuity of the function D PS R) with respect to P S and R for a bounded distortion measure d [8]. APPENDIX Proof of Proposition 2. From Proposition and the definition of ε-typical sequences, we have that πz z n ) +ǫ)p Z z) for any z n T ε n) P Y Z y n ). We can therefore write Pr Z n T ε n) P Y Z y n ) Qz n ) z n T ε n) P Y Z y n ) z n T n) ε P Y Z y n ) ε)2 n ε)hz Y ) wθ)2 n+ǫ) ε) wθ)2 ne dθ, z n T ε n) P Y Z y n ) n wθ) P θ z i )dθ wθ) P θ z) n+ε)pzz) dθ z Z z Z PZz)logP θz) dθ where the last inequality follows from Proposition, and E ε)hz Y)++ε) z ZP Z z)logp θ z) 2εHZ Y)++ε)HZ Y) DP Z P θ ) HZ)) +ε)iy;z)+dp Z P θ )+2εHZ Y)). as desired. Proof of Proposition 3. Without loss of generality, we may assume Z,..., Z and that P Z ) P Z 2) P Z Z ). Note that under these assumptions / Z ) P Z Z ), and P Z Z ) 2. Let us define the perturbation set U u,...,u Z ) : 0 u i < ξ, for i,..., Z Z u Z u i, and the set V P Z +U, where the sum is in the Minkowski sense. Clearly, V P Z for any 0 < ξ < / Z 2, and its Lebesgue measure on the simplex P Z is V ξ Z. In addition, for any P θ V we have DP Z P θ ) P Z z)log P Zz) P θ z) z Z P Z Z ) P Z Z )log P Z Z ) ξ Z ) / Z log / Z ) ξ Z ) log ξ Z 2. REFERENCES [] S. I. Gelfand and M. S. Pinsker, Coding for channels with random parameters, Problem of Control and Information Theory, vol. 9, no., pp. 9 3, 980. [2] C. Heegard and A. Gamal, On the capacity of computer memory with defects, IEEE Transactions on Information Theory, vol. 29, no. 5, pp. 73 739, Sep 983. [3] T. Berger, Rate distortion theory: A mathematical basis for data compression. Englewood Cliffs, NJ: Prentice-Hall, 97. [4] T. Berger and J. Gibson, Lossy source coding, IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2693 2723, Oct 998. [5] J. Ziv, Coding of sources with unknown statistics II: Distortion relative to a fidelity criterion, IEEE Transactions on Information Theory, vol. 8, no. 3, pp. 389 394, May 972. [6] A. El Gamal and Y.-H. Kim, Network information theory. Cambridge University Press, 20. [7] N. Merhav and M. Feder, Universal prediction, IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 224 247, Oct 998. [8] I. Csiszár and J. Körner, Information Theory: Coding Theorems for Discrete Memoryless Systems. New York: Academic Press, 982.