The Empirical Distribution of Rate-Constrained Source Codes

Size: px

Start display at page:

Download "The Empirical Distribution of Rate-Constrained Source Codes"

Sharlene Watkins
6 years ago
Views:

The Empirical Distribution of Rate-Constrained Source Codes Tsachy Weissman, Eri Ordentlich HP Laboratories Palo Alto HPL-2003-253 December 8 th, 2003* E-mail: tsachy@stanford.edu, eord@hpl

1 The Empirical Distribution of Rate-Constrained Source Codes Tsachy Weissman, Eri Ordentlich HP Laboratories Palo Alto HPL December 8 th, 2003* tsachy@stanford.edu, eord@hpl.hp.com rate-distortion theory, sample converse, denoising, empirical distribution Let X = X,... be a stationary ergodic finite-alphabet source, X n denote its first n symbols, and Y n be the codeword assigned to X n by a lossy source code. The empirical th-order joint distribution Qˆ [X n, Y n ]x, y is defined as the frequency of appearances of pairs of -strings x, y along the pair X n, Y n. Our main interest is in the sample behavior of this random distribution. Letting IQ denote the mutual information IX ; Y when X, Y ~ Q we show that for any sequence of lossy source codes of rate R lim sup I Qˆ [X n, Y n ] R + H X - H X a.s. where H X denotes the entropy rate of X. This is shown to imply, for a large class of sources including all i.i.d. sources and all sources satisfying the Shannon lower bound with equality, that for any sequence of codes which is good in the sense of asymptotically attaining a point on the rate distortion curve whenever P X ~ Y Qˆ [X n,y n ] d P X ~ Y, a.s.,, is the unique distribution attaining the minimum in the definition of the th-order rate distortion function. Further consequences of these results are explored. These include a simple proof of Kieffer's sample converse to lossy source coding, as well as pointwise performance bounds for compression-based denoisers. * Internal Accession Date Only Approved for External Publication Copyright Hewlett-Pacard Company 2003

2 The Empirical Distribution of Rate-Constrained Source Codes Tsachy Weissman Eri Ordentlich November 6, 2003 Abstract Let X = X,... be a stationary ergodic finite-alphabet source, X n denote its first n symbols, and Y n be the codeword assigned to X n by a lossy source code. The empirical th-order joint distribution ˆQ [X n, Y n ]x, y is defined as the frequency of appearances of pairs of -strings x, y along the pair X n, Y n. Our main interest is in the sample behavior of this random distribution. Letting IQ denote the mutual information IX ; Y when X, Y Q we show that for any sequence of lossy source codes of rate R lim sup I ˆQ [X n, Y n ] R + HX HX where HX denotes the entropy rate of X. This is shown to imply, for a large class of sources including all i.i.d. sources and all sources satisfying the Shannon lower bound with equality, that for any sequence of codes which is good in the sense of asymptotically attaining a point on the rate distortion curve a.s., ˆQ [X n, Y n ] d P X,Ỹ a.s., whenever P X,Ỹ is the unique distribution attaining the minimum in the definition of the th-order rate distortion function. Further consequences of these results are explored. These include a simple proof of Kieffer s sample converse to lossy source coding, as well as pointwise performance bounds for compression-based denoisers. Introduction Loosely speaing, a rate distortion code sequence is considered good for a given source if it attains a point on its rate distortion curve. The existence of good code sequences with empirical distributions close to those achieving the minimum mutual information in the definition of the rate distortion function is a consequence of the random coding argument at the heart of the achievability part of rate distortion theory. It turns out, however, in ways that we quantify in this wor, that any good code sequence must have this property. This behavior of the empirical distribution of good rate distortion codes is somewhat analogous to that of good channel codes which was characterized by Shamai and Verdú in [SV97]. Defining the th-order empirical distribution of a channel code as the proportion of -strings anywhere in the codeboo equal to every given -string, this empirical distribution was shown to converge to the capacity-achieving channel input distribution. The analogy is more than merely qualitative. For example, the growth rate of with n where this convergence was shown in [SV97] to brea down will be seen to have an analogue in our setting. A slight difference between the problems is that in the setting of [SV97] the whole codeboo of a good code was shown to be well-behaved in the sense that the empirical distribution T. Weissman is with the Electrical Engineering Department, Stanford University, CA USA tsachy@stanford.edu. Part of this wor was done while T. Weissman was visiting Hewlett-Pacard Laboratories, Palo Alto, CA, USA. T. Weissman was partially supported by NSF grants DMS and CCR E. Ordentlich is with Hewlett-Pacard Laboratories, Palo Alto, CA USA eord@hpl.hp.com.

3 is defined as an average over all codewords. In the source coding analogue, it is clear that no meaningful statement can be made on the empirical distribution obtained when averaging and giving equal weights to all codewords in the codeboo. The reason is that any approximately good codeboo will remain approximately good when appending to it a polynomial number of additional codeboos of the same size, so essentially any empirical distribution induced by the new codeboo can be created while maintaining the goodness of the codeboo. We therefore adopt in this wor what seems to be a more meaningful analogue to the empirical distribution notion of [SV97], namely the empirical distribution of the codeword corresponding to the particular source realization, or the joint empirical distribution of source realization and its associated codeword. We shall show several strong probabilistic senses in which the latter entities converge to the distributions attaining the minimum in the associated rate distortion function. The basic phenomenon we quantify in this wor is that a rate constraint on a sequence of codes imposes a severe limitation on the mutual information associated with its empirical distribution. This is independent of the notion of distortion, or goodness of a code. The described behavior of the empirical distribution of good codes is a consequence of it. Our wor consists of two main parts. The first part, Section 3, considers the distribution of the pair of - tuples obtained by looing at the source-reconstruction pair X n, Y n through a -window at a uniformly sampled location. More specifically, we loo at the distribution of X I+ I, Y I+ I, where I is chosen uniformly from {,..., n + } independently of everything else. We refer to this distribution as the th-order marginalization of the distribution of X n, Y n. In subsection A we show that the normalized mutual information between X I+ I and Y I+ I is esentially upper bounded by R + HX HX, R denoting the rate of the code and HX the source entropy rate. This is seen in subsection B to imply the convergence in distribution of X I+ I, Y I+ I to the joint distribution attaining RX, D the th-order rate distortion function whenever it is unique and satisfies the relation RX, D = RX, D + HX HX with RX, D denoting the rate distortion function of the source. This includes all i.i.d. sources, as well as the large family of sources for which the Shannon lower bound holds with equality cf. [Gra70, Gra7, BG98] and references therein. In subsection C we show that convergence in any reasonable sense cannot hold for any code sequence when increases with n such that /n > RX, D/HỸ, where HỸ is the entropy rate of the reconstruction process attaining the minimum mutual information defining the rate distortion function. In subsection D we apply the results to obtain performance bounds for compressionbased denoising. We extend results from [Don02] by showing that any additive noise distribution induces a distortion measure such that if optimal lossy compression of the noisy signal is performed under it, at a distortion level matched to the level of the noise, then the marginalized joint distribution of the noisy source and the reconstruction converges to that of the noisy source and the underlying clean source. This is shown to lead to bounds on the performance of such compression-based denoising schemes. Section 4 presents the primary part of our wor, which consists of pointwise analogues to the results of Section 3. More specifically, we loo at properties of the random empirical th-order joint distribution ˆQ [X n, Y n ] induced by the source-reconstruction pair of n-tuples X n, Y n. The main result of subsection A Theorem 7 asserts that Up to an o term, independent on the particular code. 2

4 R+ HX HX is not only an upper bound on the normalized mutual information between X I+ I and Y I+ I as established in Section 3, but is essentially 2, with probability one, an upper bound on the normalized mutual information under ˆQ [X n, Y n ]. This is seen in subsection B to imply the sample converse to lossy source coding of [Kie9], avoiding the use of the ergodic theorem in [Kie89]. In subsection C this is used to establish the almost sure convergence of ˆQ to the joint distribution attaining the th-order rate distortion function of the source, under the conditions stipulated for the convergence in the setting of Section 3. In subsection D we apply these almost sure convergence results to derive a pointwise analogue of the performance bound for compression-based denoising derived in subsection 3.D. In subsection E we show that a simple post-processing derandomization scheme performed on the output of the previously analyzed compression-based denoisers results in essentially optimum denoising performance. The empirical distribution of good lossy source codes was first considered by Kanlis, Khudanpur and Narayan in [KSP96]. For memoryless sources they showed that any good sequence of codeboos must have an exponentially non-negligible fraction of codewords which are typical with respect to the RD-achieving output distribution. It was later further shown in [Kan97] that this subset of typical codewords carries most of the probabilistic mass in that the probability of a source word having a non-typical codeword is negligible. That wor also setched a proof of the fact that the source word and its reconstruction are, with high probability, jointly typical with respect to the joint distribution attaining RD. More recently, Donoho established distributional properties of good rate distortion codes for certain families of processes and applied them to performance analysis of compression based denoisers [Don02]. The main innovation in the parts of the present wor related to good source codes is in considering the pointwise behavior of any th-order joint empirical distribution for general stationary ergodic processes. Other than sections 3 and 4 which were detailed above, we introduce notation and conventions in Section 2, and summarize the paper with a few of its open directions in Section 5. 2 Notation and Conventions X and Y will denote, respectively, the source and reconstruction alphabets which we assume throughout to be finite. We shall also use the notation x n for x,..., x n and x j i = x i,... x j. We denote the set of probability measures on a finite set A by MA. For P, P MA we let P P = max a A P a P a. BP, δ will denote the l ball around P, i.e., BP, δ = {P : P P δ}. For {P n }, P n, P MA, P n d P, Pn P, and P n P will all stand for lim P n P = 0. 2 For Q MX Y, IQ will denote the mutual information IX; Y when X, Y Q. Similarly, for Q MX Y, IQ will stand for IX ; Y when X, Y Q. For P MA and a stochastic matrix W from A into B, P W MA B is the distribution on A, B under which PrA = a, B = b = P aw b a. 2 In the limit of large n. 3

5 H P W X Y and HW P will both denote HX Y when X, Y P W. IP ; W will stand for IX; Y when X, Y P W. When dealing with expectations of functions or with functionals of random variables, we shall sometimes subscript the distributions of the associated random variables. Thus, for example, for any f : A R and P MA we shall write E P fa for the expectation of fa when A is distributed according to P. M n A will denote the subset of MA consisting of n-th-order empirical types, i.e., P M n A if and only if P a is an integer multiple of /n for all a A. For any P MX, C n P will denote the set of stochastic matrices conditional distributions from X into Y for which P W M n X Y. For P MA we let T n P, or simply T P, denote the associated type class, i.e., the set of sequences u n A n with M n A will denote the set of P MA for which T n P sequences u n A n with n { i n : u i = a} = P a a A. 3. For δ 0, T n [P ] δ, or simply T [P ]δ, will denote the set of n { i n : u i = a} P a δ a A. 4 For a random element, P subscripted by the element will denote its distribution. Thus, for example, for the process X = X, X 2,..., P X, P X n and P X will denote, respectively, the distribution of its first component, the distribution of its first n components, and the distribution of the process itself. If, say, X 0, Z j i are jointly distributed then P X0 z j MX will denote the conditional distribution of X 0 given Z j i = zj i. This will also hold for the cases i with i = and\or j = by assuming P X0 z j to be a regular version of the conditional distribution of X 0 given i Z j i, evaluated at zj i. For a sequence of random variables {X d d n}, X n X will stand for PXn PX. For x n X n and y n Y n we let ˆQ [x n ] MX, ˆQ [y n ] MY, and ˆQ [x n, y n ] MX Y denote the -th order empirical distributions defined by and ˆQ [x n ]u = ˆQ [y n ]v = ˆQ [x n, y n ]u, v = {0 i n : xi+ i+ n + = u }, 5 {0 i n : yi+ i+ n + = v }, 6 {0 i n : xi+ i+ n + = u, y i+ i+ = v }. 7 ˆQ [y n x n ] will denote the conditional distribution P Y X when X, Y ˆQ [x n, y n ]. In accordance with notation defined above, for example, E ˆQ [x n,y n ] fx, Y will denote expectation of fx, Y when X, Y ˆQ [x n, y n ]. Definition A fixed-rate n-bloc code is a pair C n, φ n where C n Y n is the code-boo and φ n : X n C n. The rate of the bloc-code is given by n log C n. The rate of a sequence of bloc codes is defined by lim sup n log C n. Y n = Y n X n = φ n X n will denote the reconstruction sequence when the n-bloc code is applied to the source sequence X n. 4

6 Definition 2 A variable-rate code for n-blocs is a triple C n, φ n, l n with C n and φ n as in Definition. The code operates by mapping a source n-tuple X n into C n via φ n, and then encoding the corresponding member of the code-boo denoted Y n in Definition using a uniquely decodable binary code. Letting l n X n denote the associated length function, the rate of the code is defined by E n l nx n and the rate of a sequence of codes for the source X = X, X 2,... is defined by lim sup E n l nx n. Note that the notion of a code in the above definitions does not involve a distortion measure according to which goodness of reconstruction is judged. We assume throughout a given single-letter loss function ρ : X Y [0, satisfying x X y Y s.t. ρx, y = 0. 8 For x n X n and y n Y n we let ρ n x n, y n = n ρx i, y i. We shall also let δ Y X denote a conditional distribution of Y given X with the property that for all x X δ Y X y xρx, y = 0 9 y Y note that 8 implies existence of at least one such conditional distribution. The rate distortion function associated with the random variable X X is defined by RX, D = min IX; Y, 0 EρX,Y D where the minimum is over all joint distributions of the pair X, Y X Y consistent with the given distribution of X. Letting ρ n denote the normalized distortion measure between n-tuples induced by ρ, ρ n x n, y n = n the rate distortion function of X n X n is defined by RX n, D = n ρx i, y i, i= min Eρ nx n,y n D n IXn ; Y n. 2 The rate distortion function of the stationary ergodic process X = X, X 2,... is defined by RX, D = lim RXn, D. 3 Note that our assumptions on the distortion measure, together with the assumption of finite source alphabets, combined with its well-nown convexity, imply that RX, D as a function of D [0, is. Non-negative valued and bounded. 2. Continuous. 3. Strictly decreasing in the range 0 D D X max, where D X max D D X max, = min{d : RX, D = 0} and identically 0 for 5

7 properties that will be tacitly relied on in proofs of some of the results. The set { } RX, D, D : 0 D Dmax X will be referred to as the rate distortion curve. The distortion rate function is the inverse of RX,, which we formally define to account for the trivial regions as unique value of D such that RX, D = R for R 0, RX, 0] DX, R = Dmax X for R = 0 0 for R > RX, 0. 3 Distributional Properties of Rate-Constrained Codes Throughout this section codes can be taen to mean either fixed- or variable-rate codes. 4 A Bound on the Mutual Information Induced by a Rate-Constrained Code The following result and its proof are implicit in the proof of the converse of [DW03, Theorem ]. Theorem Let X = X, X 2... be stationary and Y n = Y n X n be the reconstruction of an n-bloc code of rate R. Let J be uniformly distributed on {,..., n}, and independent of X. Then IX J ; Y J R + HX n HXn. 5 In particular IX J ; Y J R when X is memoryless. Proof: HY n IX n ; Y n = HX n HX n Y n n = HX n nhx + [HX i HX i X i, Y n ] HX n nhx + = HX n nhx + i= n [HX i HX i Y i ] i= n IX i ; Y i HX n nhx + nix J ; Y J, 6 d where the last inequality follows from the facts that by stationarity for all i X J = Xi, that P YJ X J = n n i= P Y i X i, and the convexity of the mutual information in the conditional distribution cf. [CT9, Theorem 2.7.4]. The fact that Y n is the reconstruction of an n-bloc code of rate R implies HY n nr, completing the proof when combined with 6. Remar: It is readily verified that 5 remains true even when the requirement of a stationary X is relaxed to the requirement of equal first order marginals. To see that this cannot be relaxed much further note that when the source is an individual sequence the right side of 5 equals R. On the other hand, if this individual sequence is non-constant one can tae a code of rate 0 consisting of one codeword whose joint empirical distribution with the source sequence has positive mutual information, violating the bound. i= 6

8 Theorem 2 Let X = X, X 2... be stationary and Y n = Y n X n be the reconstruction of an n-bloc code of rate R. Let n and J be uniformly distributed on {,..., n + }, and independent of X. Then I X J+ J ; Y J+ n J n R + HX n HXn + 2 n log X. 7 Proof: Let, for j, S j = { i n : i = j mod }. Note that n S j n j. Let P X,Y distribution defined by Note, in particular, that Now be the P j x, y = Pr{X i+ X,Y i = x, Y i+ i = y }. 8 S j i S j nr HY n H j H P X J+ J,Y = J+ J Y j+ Sj X j+ Sj j j= S j n P j X,Y. 9 S j HX + S j I P j, 20 X,Y where the first inequality is due to the rate constraint of the code and the last one follows from the bound established in 6 with the assignment n S j, X n, Y n X j+ Sj j, Y j+ Sj j, X i, Y i X j+i j+i, Xj+i j+i. Rearranging terms I P j X,Y n S j R + HX S j H X j+ Sj j [ n n R + HX n HXn + 2 n where in the second inequality we have used the fact that HX n H + HX j, Xj+ Sj n H X j+ Sj j X j+ Sj j ] log X, log X. Inequality 7 now follows from 2, 9, the fact that P X J+ = P j both equaling the distribution of a source J X -tuple for all j, and the convexity of the mutual information in the conditional distribution. An immediate consequence of Theorem 2 is Corollary Let X = X, X 2... be stationary and {Y n } be a sequence of bloc codes of rate R. For n let J n, be uniformly distributed on {,..., n + }, and independent of X. Consider the pair X n, Y n = X n, Y n X n and denote Then Proof: Theorem 2 implies Xn,, Ỹ n, = lim sup completing the proof by letting. X J n, + J n,, Y J n, J n, lim sup lim sup I Xn, ; Ỹ n, R. 23 I Xn, ; Ỹ n, R + HX HX, 24 7

9 B The Empirical Distribution of Good Codes The converse in rate distortion theory cf. [Ber7], [Gal68] asserts that if X is stationary ergodic and R, D is a point on its rate distortion curve then any code sequence of rate R must satisfy lim inf Eρ nx n, Y n X n D. 25 The direct part, on the other hand, guarantees the existence of codes that are good in the following sense. Definition 3 Let X be stationary ergodic, 0 D D X max, and R, D be a point on its rate distortion curve. The sequence of codes with associated mappings {Y n } will be said to be good for the source X at rate R or at distortion level D if it has rate R and lim sup Eρ n X n, Y n X n D. 26 Note that the converse to rate distortion coding implies that the limit supremum in the above definition is actually a limit and the rate of the corresponding good code sequence is = R. Theorem 3 Let {Y n } correspond to a good code sequence for the stationary ergodic source X and Xn,, Ỹ n, be defined as in Corollary. Then:. For every RX, D lim inf so, in particular, lim, Ỹ n, = D, 27 I Xn, ; Ỹ n, lim sup I Xn, ; Ỹ n, RX, D + HX HX 28 lim lim inf I Xn, ; Ỹ n, = lim lim sup I Xn, ; Ỹ n, = RX, D If it also holds that RX, D = RX, D + HX HX then lim I Xn, ; Ỹ n, = RX, D If, additionally, RX, D is uniquely achieved in distribution by the pair X, Ỹ then Xn,, Ỹ n, d X, Ỹ as n. 3 The condition in the second item clearly holds for any memoryless source. It holds, however, much beyond memoryless sources. Any source for which the Shannon lower bound holds with equality is readily seen to satisfy RX, D = RX, D + HX HX for all. This is a rich family of sources that includes Marov processes, hidden Marov processes, auto-regressive sources, and Gaussian processes in the continuous alphabet setting, cf. [Gra70, Gra7, BG98, EM02, WM03]. The first part of Theorem 3 will be seen to follow from Theorem 2. Its last part will be a consequence of the following lemma, whose proof is given in Appendix A. 8

10 Lemma Let RD be uniquely achieved by the joint distribution P X,Y. distributions on X, Y satisfying and Then P n X d lim I P n X,Y Let further {P n X,Y } be a sequence of P X, 32 = RD, 33 lim E ρx, Y = D. 34 P n X,Y P n X,Y d P X,Y. 35 Proof of Theorem 3: For any code sequence {Y n }, an immediate consequence of the definition of Xn,, Ỹ n, is that Combined with the fact that {Y n } is a good code this implies [ lim Eρ Xn,, Ỹ n, ] Eρ n X n, Y n = lim Eρ Xn,, Ỹ n, = lim Eρ nx n, Y n = D, 37 proving 27. To prove 28 note that since {Y n } is a sequence of codes for X at rate RX, D, Theorem 2 implies lim sup I Xn, ; Ỹ n, RX, D + HX HX. 38 On the other hand, since X n, d = X, it follows from the definition of RX, that implying I Xn, ; Ỹ n, R X, Eρ Xn,, Ỹ n,, 39 lim inf I Xn,, Ỹ n, RX, D 40 by 37 and the continuity of RX,. This complete the proof of 28. Displays 29 and 30 are immediate consequences. The convergence in 3 is a direct consequence of 30 and Lemma applied with the assignment X X, Y Y and ρ ρ. Remar: Note that to conclude the convergence in 3 in the above proof a weaer version of Lemma would have sufficed, that assumes P n X = P X for all n instead of P n d X P X in 32. This is because clearly, by construction of X n,, Xn, = d X for all n and n. The stronger form of Lemma given will be instrumental in the derivation of a pointwise analogue to 3 in Section 4 third item of Theorem 9. C A Lower Bound on the Convergence Rate Assume a stationary ergodic source X that satisfies, for each, RX, D = RX, D + HX HX and for which RX, D is uniquely achieved by the pair X, Ỹ. Theorem 3 implies that any good code sequence for X at rate RX, D must satisfy X n,, Ỹ n, d X, Ỹ as n 4 9

11 recall Corollary for the definition of the pair X n,, Ỹ n,. This implies, in particular, It follows from 42 that for n increasing slowly enough with n Display 43 is equivalent to HỸ n, HỸ as n. 42 n [HỸ n,n HỸ n ] 0 as n. 43 HỸ n,n HỸ as n, 44 n where we denote lim HỸ by HỸ a limit which exists3 under our hypothesis that RX, D is uniquely achieved by X, Ỹ for every. The rate at which n may be allowed to increase while maintaining the convergence in 44 may depend on the particular code sequence through which X n,, Ỹ n, are defined. We now show, however, that if n increases more rapidly than a certain fraction of n, then the convergence in 44 fails for any code sequence. To see this note that for any Now, since Y n is associated with a good code, Letting n = αn, 49 and 50 imply HỸ n, = HY J n, + 45 J n, HY J n, +, J n, 46 J n, lim sup excluding the possibility that 44 holds whenever HY n, J n, 47 HY n + HJ n, 48 HY n + log n. 49 n HY n RX, D. 50 HỸ n,n RX, D, 5 n α α > RX, D. 52 HỸ This is analogous to the situation in the empirical distribution of good channel codes where it was shown in [SV97, Section III] for the memoryless case that convergence is excluded whenever n = αn for α > C H, where H stands for the entropy of the capacity-achieving channel input distribution. To sum up, we see that convergence taes place whenever = O Theorem 3, and does not tae place whenever = αn for α satisfying 52. For growing with n at the intermediate rates convergence depends on the particular code sequence, and examples can be constructed both for convergence and for lac thereof analogous to [SV97, Examples 4,5]. 3 In particular, for a memoryless source HỸ = HỸ. For sources for which the Shannon lower bound is tight it is the entropy rate of the source which results in X when corrupted by the max-entropy white noise tailored to the corresponding distortion measure and level. 0

12 D Applications to Compression-Based Denoising The point of view that compression may facilitate denoising was put forth by Natarajan in [Nat95]. It is based on the intuition that the noise constitutes that part of a noisy signal which is least compressible. Thus, lossy compression of the noisy signal, under the right distortion measure and at the right distortion level, should lead to effective denoising. Compression-based denoising schemes have since been suggested and studied under various settings and assumptions cf. [Nat93, Nat95, NKH98, CYV97, JY0, Don02, TRA02, Ris00, TPB99] and references therein. In this subsection we consider compression-based denoising when the clean source is corrupted by additive white noise. We will give a new performance bound for denoisers that optimally lossily compress the noisy signal under distortion measure and level induced by the noise. For simplicity, we restrict attention throughout this section to the case where the alphabets of the clean-, noisy-, and reconstructed-source are all equal to the M-ary alphabet A = {0,,..., M }. Addition and subtraction between elements of this alphabet should be understood modulo-m throughout. The results we develop below have analogues for cases where the alphabet is the real line. We consider the case of a stationary ergodic source X corrupted by additive white noise N. That is, we assume N i are i.i.d. N, independent of X, and that the noisy observation sequence Z is given by Z i = X i + N i. 53 By channel matrix we refer to the Toeplitz matrix whose, say, first row is PrN = 0,..., PrN = M. We assume that PrN = a > 0 for all a A and associate with it a difference distortion measure ρ N defined by ρ N a = log We shall omit the superscript and let ρ stand for ρ N throughout this section. Fact is uniquely achieved in distribution by N. max{hx : X is A-valued and EρX HN} PrN = a. 54 Proof: N is clearly in the feasible set by the definition of ρ. Now, for any A-valued X in the feasible set with X d N HN EρX = P X a log PrN = a a > HX, where the strict inequality follows from DP X P N > 0. Theorem 4 Let RZ, denote the -th order rate distortion function of Z under the distortion measure in 54. Then RZ, HN = HZ HN. 55 Furthermore, it is uniquely achieved in distribution by the pair Z, X whenever the channel matrix is invertible.

13 Proof: Let Z, Y achieve RZ, HN. Then RZ, HN = IZ ; Y = HZ HZ Y HZ HZ Y Y HZ HZ Y HZ HN, 56 where the last inequality follows by the fact that Eρ n Z, Y HN otherwise Z, Y would not be an achiever of RZ, HN and Fact. The first part of the theorem follows since the pair Z, X satisfies Eρ n Z, X = HN and is readily seen to satisfy all the inequalities in 56 with equality. For the uniqueness part note that it follows from the chain of inequalities in 56 that if Z, Y achieve RZ, HN then Z is the output of the memoryless additive channel with noise components N whose input is Y. The invertibility of the channel matrix guarantees uniqueness of the distribution of the Y satisfying this for any given output distribution cf., e.g., [WOS + 03]. Let now Λ : A A [0, be the loss function according to which denoising performance is measured. For u n, v n A n let Λ n u n, v n = n n i= Λu i, v i. Define Ψ : MA MA [0, by and Ψ : MA [0, by ΓP, Q = ΨP = ΓP, P = max EΛU, V. 57 U,V :U P,V Q max EΛU, V. 58 U,V :U P,V P It will be convenient to state the following implied by an elementary continuity argument for future reference Fact 2 Let {R n }, {Q n } be sequences in MA with R n P and Q n P. Then lim ΓR n, Q n = ΨP. Example In the binary case, M = 2, when Λ denotes Hamming distortion, it is readily checed that, for 0 p ΨBernoullip = 2 min{p, p} = 2φp, 59 where φp = min{p, p} 60 is the well-nown Bayesian envelope for this setting [Han57, Sam63, MF98]. The achieving distribution in this case assuming, say, p /2 is PrU = 0, V = 0 = 2p, PrU = 0, V = = p, PrU =, V = 0 = p, PrU = 0, V = 0 = 0. 6 Theorem 5 Let {Y n } correspond to a good code sequence for the noisy source Z at distortion level HN under the difference distortion measure in 54 and assume that the channel matrix is invertible. Then P X0 Z lim sup EΛ n X n, Y n Z n EΨ. 62 2

14 Note, in particular, that for the case of a binary source corrupted by a BSC, the optimum distribution dependent denoising performance is Eφ P X0 Z cf. [WOS + 03, Section 5], so the right side of 62 is, by 59, precisely twice the expected fraction of errors made by the optimal denoiser in the limit of many observations. Note that this performance can be attained universally i.e., with no a priori nowledge of the distribution of the noise-free source X by employing a universal rate distortion code on the noisy source, e.g., the fixed distortion version of the Yang-Kieffer codes 4 [YK96, Section III]. Proof of Theorem 5: Similarly as in Corollary, for n 2 let J n, be uniformly distributed on {+,..., n + }, and independent of X, Z. Consider the triple X n, Z n, Y n = X n, Z n, Y n Z n and denote Note first that so and, in particular, Also, by the definition of Γ, Now, clearly X [n,], Z [n,], Y [n,] = X J n, + J n,, ZJn, + J n,, Y J n, J n, EΛ n X n, Y n Z n 2Λ max n EΛ X [n,] 0, Y [n,] 0 n + EΛ X [n,] 0, Y [n,] 0 = EΛX i, Y i 64 n 2 i=+ EΛ X [n,] 0, Y [n,] 0 n n 2 EΛ nx n, Y n Z n 65 [ ] lim EΛ n X n, Y n Z n EΛ X [n,] 0, Y [n,] 0 = = z A E z A Γ [ ] Λ X [n,] 0, Y 0 [n,] Z [n,] = z Pr Z [n,] = z P X [n,] 0 Z, P [n,] =z Y [n,] 0 Z Pr Z [n,] [n,] =z = z. 67 X [n,], Z [n,] d= X, Z so, in particular, z A P X [n,] 0 Z [n,] =z = P X0 Z =z. 68 On the other hand, from the combination of Theorem 3 and Theorem 4 it follows, since {Y n } corresponds to a good code sequence, that and therefore, in particular, z A Y [n,], Z [n,] d X, Z as n. 69 P Y [n,] 0 Z [n,] =z P X0 Z =z as n. 70 The combination of 68, 70, and Fact 2 implies that z A Γ P X [n,] 0 Z, P [n,] =z Y [n,] 0 Z Ψ P [n,] =z X0 Z =z as n. 7 4 This may conceptually motivate employing a universal lossy source code for denoising. Implementation of such a code, however, is too complex to be of practical value. It seems even less motivated in light of the universally optimal and practical scheme in [WOS + 03]. 3

15 Thus we obtain lim sup EΛ n X n, Y n Z n a = lim sup b = EΨ z A Ψ EΛ X [n,] 0, Y [n,] 0 P X0 Z Pr Z =z = z P X0 Z, 72 where a follows from 66 and b from combining 67 with 7 and the fact that Z [n,] that lim P X0 Z = P X 0 Z theorem imply which completes the proof when combined with 72. d = Z. The fact a.s. martingale convergence, the continuity of Ψ, and the bounded convergence lim P EΨ X0 Z = EΨ P X0 Z, 73 4 Sample Properties of the Empirical Distribution Throughout this section codes should be understood in the fixed-rate sense of Definition. A Pointwise Bounds on the Mutual Information Induced by a Rate-Constrained Code The first result of this section is the following. Theorem 6 Let {Y n } be a sequence of bloc codes with rate R. For any stationary ergodic X lim sup I ˆQ [X n, Y n X n ] R + HX HX a.s. 74 In particular, when X is memoryless, lim sup I ˆQ [X n, Y n X n ] R a.s. 75 The following lemma is at the heart of the proof of Theorem 6. Lemma 2 Let Y n be an n-bloc code of rate R. Then, for every P M n X and η > 0, where ε n = X Y n logn +. { x n T P : I ˆQ [x n, Y n x n ] > R + η} e n[hp +ε n η], 76 Lemma 2 quantifies a remarable limitation that the rate of the code induces: for any η > 0, only an exponentially negligible fraction of the source sequences in any type can have an empirical joint distribution with their reconstruction with mutual information exceeding the rate by more than η. Proof of Lemma 2: For every W C n P let SP, W = {x n : x n, Y n x n T P W } 77 4

16 and note that SP, W e nr e n[h P W X Y ], 78 since there are no more than e nr different values of Y n x n and, by [CK8, Lemma 2.3], no more than e n[h P W X Y ] different sequences in SP, W that can be mapped to the same value of Y n. It follows that Now, by definition, so IP ; W R + η W C n P : SP, W e n[hp η]. 79 { } x n T P : I ˆQ [x n, Y n x n ] > R + η = { x n T P : I ˆQ [x n, Y n x n ] W C np :IP ;W >R+η } > R + η W C np :IP ;W >R+η W C np :IP ;W >R+η n[hp η] C n P e SP, W 80 SP, W e n[hp η] 8 n + X Y n[hp η] e where 8 follows from 79 and 82 from the definition of ε n. e n[hp +εn η], 82 We shall also mae use of the following converse to the AEP proof of which is deferred to Appendix B. Lemma 3 Let X be stationary ergodic and A n X n an arbitrary sequence satisfying lim sup Then, with probability one, X n A n for only finitely many n. Proof of Theorem 6: Fix an arbitrary η > 0 and ξ > 0 small enough so that n log A n < HX. 83 max P BP X,ξ HP HX + η/2 84 the continuity of the entropy functional guarantees the existence of such a ξ > 0. Lemma 2 with the assignment η HX HX + η guarantees that for every n and P M n X, { x n T P : I ˆQ [x n, Y n x n ] > R n + HX HX + η} e n[hp +ε n HX +HX η], 85 where R n denotes the rate of Y n. Thus { x n T [PX ] ξ : I ˆQ [x n, Y n x n ] > R n + HX HX + η} 86 { [ ]} exp n max HP + 2ε n HX + HX η 87 P BP X,ξ exp{n[2ε n + HX η/2]}, 88 5

17 where 87 follows from 85 and from the obvious fact that T [PX ] ξ is a union of less than e nεn strict types, and 88 follows from 84. Denoting now the set in 86 by A n, i.e., A n = { } x n T [PX ] ξ : I ˆQ [x n, Y n x n ] > R n + HX HX + η, 89 we have { } x n : I ˆQ [x n, Y n x n ] > R n + HX HX + η T[P c X ] ξ A n. 90 But, by ergodicity, X n T c [P X ] ξ for only finitely many n with probability one. On the other hand, the fact that A n exp{n[2ε n + HX η/2]} recall 88 implies, by Lemma 3, that X n A n for only finitely many n with probability one. Thus we get I ˆQ [X n, Y n X n ] R n + HX HX + η eventually a.s. 9 But, by hypothesis, {Y n } has rate R namely lim sup R n R so 9 implies lim sup I ˆQ [X n, Y n X n ] R + HX HX + η a.s. 92 which completes the proof by the arbitrariness of η > 0. Theorem 6 extends to higher-order empirical distributions as follows: Theorem 7 Let {Y n } be a sequence of bloc codes with rate R. For any stationary ergodic X and lim sup In particular, when X is memoryless, I ˆQ [X n, Y n X n ] R + HX HX a.s. 93 lim sup Note that defining, for each 0 j, ˆQ,j by 5 I ˆQ [X n, Y n X n ] R a.s. 94 ˆQ,j [x n, y n ]u, v = { 0 i n n : xi++j i+j+ = u, y i++j i+j+ = v }, 95 it follows from Theorem 6 applied to a th-order super symbol that when the source is ergodic in -blocs 6 lim sup Then, since for fixed and large n ˆQ I ˆQ,j [X n, Y n X n ] R + HX HX a.s. 96 j=0 ˆQ,j and, by the -bloc ergodictiy, ˆQ,j [X n ] ˆQ,j [X n ] P X, it would follow from the continuity of I and its convexity in the conditional distribution that, for large n, I ˆQ < j=0 I ˆQ,j, implying 93 when combined with 96. The proof that follows maes these continuity arguments precise and accommodates the possibility that the source is not ergodic in -blocs. Proof of Theorem 7: For each 0 j denote the th-order super source by X,j = {X i++j i+j+ } i. Lemma of [Gal68] implies the existence of events A 0,..., A with the following properties: 5 Note that ˆQ,j [x n, y n ] may not be a bona fide distribution but is very close to one for fixed and large n. 6 When ˆQ,j [x n, y n ] is not a bona fide distribution I ˆQ,j [x n, y n ] should be understood as the mutual information under the normalized version of ˆQ,j [x n, y n ]. 6

18 { / if i = j. P A i A j = 0 otherwise. 2. For each i, j {0,..., }, X,j conditioned on A i is stationary and ergodic. 3. X,j conditioned on A i is equal in distribution to X,j+ conditioned on A i+ where the indices i, j are to be taen modulo. Note that ˆQ [x n, y n ] j=0 ˆQ,j [x n, y n ] δ n, 97 where {δ n } n is a deterministic sequence with δ n 0 as n. Theorem 6 applied to X,j conditioned on A i gives lim sup I ˆQ,j [X n, Y n X n ] R + HX +j j+ A i HX,j A i a.s. on A i, 98 where HX +j j+ A i and HX,j A i denote, respectively, the entropy of the distribution of X +j j+ when conditioned on A i and the entropy rate of X,j when condition on A i. Now, due to the fact that X,j conditioned on A i is stationary ergodic we have, for each j, lim ˆQ,j [X n ] = P X A i a.s. on A i, 99 with P X A i denoting the distribution of X conditioned on A i. Equation 99 implies, in turn, by a standard continuity argument letting ˆQ,j [y n x n ] be defined analogously as ˆQ [y n x n ], lim ˆQ,j [X n, Y n X n ] P X A i ˆQ,j [Y n X n X n ] = 0 a.s. on A i. 00 Combined with 97, 00 implies also lim ˆQ [X n, Y n X n ] P X A i j=0 ˆQ,j [Y n X n X n ] = 0 a.s. on A i 0 implying, in turn, by the continuity of the mutual information as a function of the joint distribution I ˆQ [X n, Y n X n ] I P X A i j=0 ˆQ,j [Y n X n X n ] 0 a.s. on A i. 02 On the other hand, by the convexity of the mutual information in the conditional distribution [CT9, Theorem 2.7.4], we have for all, n and all sample paths I P X A i ˆQ,j [Y n X n X n ] I P X A i ˆQ,j [Y n X n X n ]. 03 j=0 Using 99 and the continuity of the mutual information yet again gives for all j j=0 I P X A i ˆQ,j [Y n X n X n ] I ˆQ,j [X n, Y n X n ] 0 a.s. on A i. 04 7

19 Consequently, a.s. on A i, lim sup I ˆQ [X n, Y n X n ] lim sup I ˆQ,j [X n, Y n X n ] j=0 05 R + [HX +j j+ A i HX,j A i ] 06 j=0 = R + [HX A i j mod HX, A i j mod ] 07 j=0 = R + [HX A j HX, A j ], 08 j=0 where 05 follows by combining 02-04, 06 follows from 98, and 07 follows from the third property of the events A i recalled above. Since 08 does not depend on i we obtain lim sup I ˆQ [X n, Y n X n ] R + [HX A j HX, A j ] a.s. 09 Defining the {0,..., }-valued random variable J by j=0 J = i on A i, 0 it follows from the first property of the sets A i recalled above that and that HX A j = HX J HX j=0 HX, A j = HX, J j=0 Combining 09 with and 2 gives 93. A direct consequence of 93 is the following: = lim n HXn J = lim n [HXn, J HJ] = lim lim n HXn, J n HXn = HX. 2 Corollary 2 Let {Y n } be a sequence of bloc codes with rate R. For any stationary ergodic X lim sup lim sup I ˆQ [X n, Y n X n ] R a.s. 3 8

20 B Sample Converse in Lossy Source Coding One of the main results of [Kie9] is the following. Theorem 8 Theorem in [Kie9] Let {Y n } be a sequence of bloc codes with rate R. For any stationary ergodic X lim inf ρ nx n, Y n X n DX, R a.s. 4 The proof in [Kie9] was valid for general source and reconstruction alphabets, and was based on the powerful ergodic theorem in [Kie89]. We now show how Corollary 2 can be used to give a simple proof, valid in our finite-alphabet setting. Proof of Theorem 8: Note first that by a standard continuity argument and ergodicity, for each, lim ˆQ [X n, Y n X n ] P X ˆQ [Y n X n X n ] = 0 a.s. 5 implying, in turn, by the continuity of the mutual information as a function of the joint distribution I ˆQ [X n, Y n X n ] I P X ˆQ [Y n X n X n ] 0 a.s. 6 Fix now ε > 0. By Corollary 2, with probability one there exists a large enough that I ˆQ [X n, Y n X n ] R + ε for all sufficiently large n, 7 implying, by 6, with probability one existence of large enough that I P X ˆQ [Y n X n X n ] R + 2ε for all sufficiently large n. 8 By the definition of D X, it follows that for all such and n E PX ˆQ [Y n X n X n ] ρ X, Y D X, R + 2ε DX, R + 2ε, 9 implying, by the continuity property 5, with probability one the existence of large enough that E ˆQ [X n,y n X n ] ρ X, Y DX, R + 2ε ε for all sufficiently large n. 20 By the definition of ˆQ it is readily verified that for any and all sample paths 7 E ˆQ [X n,y n X n ] ρ X, Y ρ n X n, Y n ε n, 2 where {ε n } n is a deterministic sequence with ε n 0 as n. Combined with 20 this implies lim inf ρ nx n, Y n DX, R + 2ε ε a.s., 22 completing the proof by the arbitrariness of ε and the continuity of DX,. 7 In fact, we would have pointwise equality E ˆQ [x n,y n ] ρ X, Y = ρ nx n, y n had ˆQ been slightly modified from its definition in 2 to ˆQ [x n, y n ]u, v = { i n : xi+ n i+ = u, y i+ i+ = v } with the indices understood modulo n. 9

21 C Sample Behavior of the Empirical Distribution of Good Codes In the context of Theorem 8 we mae the following sample analogue of Definition 3. Definition 4 Let {Y n } be a sequence of bloc codes with rate R and let X be a stationary ergodic source. {Y n } will be said to be a pointwise good sequence of codes for the stationary ergodic source X in the strong sense at rate R if lim sup ρ n X n, Y n X n DX, R a.s. 23 Note that Theorem 8 implies that the limit supremum in the above definition is actually a limit, the inequality in 23 actually holds with equality, and the rate of the corresponding good code sequence is = R. The bounded convergence theorem implies that a pointwise good sequence of codes is also good in the sense of Definition 3. The converse, however, is not true [Wei]. The existence of pointwise good code sequences is a nown consequence of the existence of good code sequences [Kie9, Kie78]. In fact, there exist pointwise good code sequences that are universally good for all stationary and ergodic sources [Ziv72, Ziv80, YK96]. Henceforth the phrase good codes should be understood in the sense of Definition 4, even when omitting pointwise. The following is the pointwise version of Theorem 3. Theorem 9 Let {Y n } correspond to a pointwise good code sequence for the stationary ergodic source X at rate RX, D. Then:. For every RX, D lim inf so in particular I ˆQ [X n, Y n X n ] lim E ˆQ [X n,y n X n ] ρ X, Y = D a.s., 24 lim sup I ˆQ [X n, Y n X n ] RX, D+ HX HX a.s., 25 lim lim inf I ˆQ [X n, Y n X n ] = lim lim sup I ˆQ [X n, Y n X n ] = RX, D a.s If it also holds that RX, D = RX, D + HX HX then lim I ˆQ [X n, Y n X n ] = RX, D a.s If, additionally, R X, D is uniquely achieved by the pair X, Ỹ then ˆQ [X n, Y n X n ] P X,Ỹ a.s. 28 Proof of Theorem 9: {Y n } being a good code sequence implies lim ρ nx n, Y n X n = D a.s., 29 20

22 implying 24 by 2. Reasoning similarly as in the proof of Theorem 8, it follows from 5 that implying lim E P X ˆQ [Y n X n X n ] ρ X, Y = lim E ˆQ [X n,y n X n ] ρ X, Y = D a.s., 30 lim inf I ˆQ [X n, Y n X n ] lim inf I P X ˆQ [Y n X n X n ] R X, D a.s., 3 where the left inequality follows from 6 and the second one by 30 and the continuity of R X,. The upper bound in 25 follows from Theorem 7. Displays 26 and 27 are immediate consequences. The convergence in 28 follows directly from combining 27, 24, the fact that, by ergodicity, ˆQ [X n ] P X a.s., 32 and Lemma applied to the -th order super-symbol. D Application to Compression-Based Denoising Consider again the setting of subsection 3.D where the stationary ergodic source X is corrupted by additive noise N of i.i.d. components. The following is the almost sure version of Theorem 5. Theorem 0 Let {Y n } correspond to a good code sequence for the noisy source Z at distortion level HN under the difference distortion measure in 54 and assume that the channel matrix is invertible. Then lim sup Λ n X n, Y n Z n EΨ P X0 Z Proof: The combination of Theorem 4 with the third item of Theorem 9 gives a.s. 33 ˆQ [Z n, Y n Z n ] P Z,X a.s. 34 On the other hand, joint stationarity and ergodicity of the pair X, Z implies ˆQ [X n, Z n ] P X,Z a.s. 35 Arguing analogously as in the proof of Theorem 5 this time for the sample empirical distribution 8 ˆQ [X n, Z n, Y n ] instead of the distribution of the triple in 63 leads, taing say = 2m +, to implying 62 upon letting m. lim sup Λ n X n, Y n Z n EΨ P X0 Z m m a.s., 36 8 We extend the notation by letting ˆQ [x n, z n, y n ] stand for the -th-order empirical distribution of the triple X, Z, Y induced by x n, z n, y n. 2

23 E Derandomizing for Optimum Denoising The reconstruction associated with a good code sequence has what seems to be a desirable property in the denoising setting of the previous subsection, namely, that the marginalized distribution, for any finite, of the noisy source and reconstruction is essentially distributed lie the noisy source with the underlying clean signal, as n becomes large. Indeed, this property was enough to derive an upper bound on the denoising loss theorems 5 and 0 which, for example, for the binary case was seen to be within a factor of 2 from the optimum. We now point out, using the said property, that essentially optimum denoising can be attained by a simple post-processing procedure. The procedure is to fix an m and evaluate, for each z m m A 2m+ and y A ˆQ 2m+ [Z n, Y n Z n ]z m m, y = y m,ym ˆQ 2m+ [Z n, Y n Z n ]z m m, y myy m. In practice, the computation of ˆQ 2m+ [Z n, Y n Z n ]z m m, y can be done quite efficiently and sequentially by updating counts for the various z m m A 2m+ as they appear along the noisy sequence cf. [WOS + 03, Section 3]. Define now the n-bloc denoiser ˆX [m],n Z n by letting the reconstruction symbol at location m + i n m be given by ˆX i = arg min ˆx y and can be arbitrarily defined for i-s outside that range. ˆQ 2m+ [Z n, Y n Z n ]z i+m i m, yλy, ˆx 37 Note that ˆX i is nothing but the Bayes response to the conditional distribution of Y m+ given Z 2m+ = z i+m i m induced by the joint distribution ˆQ 2m+ [Z n, Y n Z n ] of Z 2m+, Y 2m+. However, from the conclusion in 34 we now that this converges almost surely to the conditional distribution of X m+ given Z 2m+ = z i+m i m. Thus, ˆXi in 37 is a Bayes response to a conditional distribution that converges almost surely as n to the true conditional distribution of X i conditioned on Z i+m i m. It thus follows from continuity of the performance of Bayes responses cf., e.g., [Han57, Equation 4], [WOS + 03, Lemma] and 35 that lim Λ nx n, ˆX [m],n Z n = Eφ P X0 Z m m a.s., 38 with φ denoting the Bayes envelope associated with the loss function Λ defined, for P MA, by φp = min Λx, ˆxP x. 39 ˆx A x A Letting m in 38 gives lim lim Λ nx n, ˆX [m],n Z n = Eφ P X0 Z m a.s., 40 where the right side is the asymptotic optimum distribution-dependent denoising performance cf. [WOS + 03, Section 5]. Note that ˆX [m],n can be chosen independently of a source, e.g., taing the Yang-Kieffer codes of [YK96] followed by the postprocessing detailed in 37. Thus, 40 implies that the post-processing step leads to essentially asymptotically optimum denoising performance. Bounded convergence implies also lim lim EΛ nx n, ˆX [m],n Z n = Eφ P X0 Z m 22 4

24 implying, in turn, the existence of a deterministic sequence {m n } such that for ˆX n = P X0 Z lim EΛ nx n, ˆX [mn],n Z n = Eφ ˆX [mn],n. 42 The sequence {m n } for which 42 holds, however, may depend on the particular code sequence {Y n }, as well as on the distribution of the active source X. 5 Conclusion and Open Questions In this wor we have seen that a rate constraint on a source code places a similar limitation on the mutual information associated with its -th order marginalization, for any finite, as n becomes large. This was shown to be the case both in the distributional sense of Section 3 and the pointwise sense of Section 4. This was also shown to imply, in various quantitative senses, that the empirical distribution of codes that are good in the sense of asymptotically attaining a point on the rate distortion curve becomes close to that attaining the minimum mutual information problem associated with the rate distortion function of the source. For a source corrupted by additive white noise this was shown to imply that, under the right distortion measure, and at the right distortion level, the joint empirical distribution of the noisy source and reconstruction associated with a good code tends to imitate the joint distribution of the noisy and noise-free source. This property led to performance bounds both in expectation and pointwise on the denoising performance of such codes. It was also seen to imply the existence of a simple postprocessing procedure that, when applied to these compression-based denoisers, results in essentially optimum denoising performance. One of the salient questions left open in the context of the third item in Theorem 3 respectively, Theorem 9 is whether the convergence in distribution in some appropriately defined sense continues to hold in cases beyond those required in the second item. Another interesting question, in the context of the postprocessing scheme of subsection 4.E, is whether there exists a universally good code sequence for the noisy source sequence under the distortion measure and level associated with the given noise distribution and a corresponding growth rate for m under which 42 holds simultaneously for all stationary ergodic underlying noise-free sources. Appendix A Proof of Lemma We shall use the following elementary fact from analysis. Fact 3 Let f : D R be continuous and D be compact. Assume further that min z D fz is uniquely attained by some z D. Let z n D satisfy Then lim fz n = fz = min fz. z D lim z n = z. 23

25 Equipped with this, we prove Lemma as follows. Let P n Y X denote the conditional distribution of Y given X induced by P n X,Y and define n ˆP X,Y = P X P n Y X. A. It follows from 32 that n ˆP X,Y P n X,Y 0 A.2 and therefore also, by 34, Define further lim E n ρx, Y = D. ˆP X,Y n P X,Y by n P X,Y = α n n ˆP X,Y + α np X δ Y X, A.3 A.4 where By construction, α n = min E ˆP n X,Y D ρx, Y,. A.5 P n X = P X and E n P ρx, Y D n. A.6 X,Y n n Also, A.3 implies that α n and hence ˆP X,Y P X,Y 0 implying, when combined with A.2, that n P X,Y P n X,Y 0 A.7 implying, in turn, by 33 and uniform continuity of I, Now, the set lim I n P X,Y = RD. A.8 { } P X,Y : P X = P X, E P X,Y ρx, Y D is compact, I is continuous, and RD is uniquely achieved by P X,Y. So A.6, A.8, and Fact 3 imply implying 35 by A.7. B Proof of Lemma 3 P n X,Y d P X,Y, Inequality 83 implies the existence of η > 0 and n 0 such that A.9 A n e n[hx 2η] n n 0. A.0 Defining G n,η = { x n : } n log P X nxn HX η, A. 24

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University Chapter 4 Data Transmission and Channel Capacity Po-Ning Chen, Professor Department of Communications Engineering National Chiao Tung University Hsin Chu, Taiwan 30050, R.O.C. Principle of Data Transmission