Achievable Rates for Probabilistic Shaping

Achievable Rates for Probabilistic Shaping Georg Böcherer Mathematical and Algorithmic Sciences Lab Huawei Technologies France S.A.S.U. georg.boecherer@ieee.org May 3, 08 arxiv:707.034v5 cs.it May 08 For a layered probabilistic shaping (PS) scheme with a general decoding metric, an achievable rate is derived using Gallager s error exponent approach and the concept of achievable code rates is introduced. Several instances for specific decoding metrics are discussed, including bit-metric decoding, interleaved coded modulation, and hard-decision decoding. It is shown that important previously known achievable rates can also be achieved by layered PS. A practical instance of layered PS is the recently proposed probabilistic amplitude shaping (PAS). Introduction Communication channels often have non-uniform capacity-achieving input distributions, which is the main motivation for probabilistic shaping (PS), i.e., the development of practical transmission schemes that use non-uniform input distributions. Many different PS schemes have been proposed in literature, see, e.g., the literature review in, Section II. In, we proposed probabilistic amplitude shaping (PAS), a layered PS architecture that concatenates a distribution matcher (DM) with a systematic encoder of a forward error correcting (FEC) code. In a nutshell, PAS works as follows. The DM serves as a shaping encoder and maps data bits to non-uniformly distributed ( shaped ) amplitude sequences, which are then systematically FEC encoded, preserving the amplitude distribution. The additionally generated redundancy bits are mapped to sign sequences that are multiplied entrywise with the amplitude sequences, resulting in a capacity-achieving input distribution for the practically relevant discrete-input additive white Gaussian noise channel. In this work, we take an information-theoretic perspective and use random coding arguments following Gallager s error exponent approach, Chapter 5 to derive achievable rates for a layered PS scheme of which PAS is a practical instance. Because rate and FEC code rate are different for layered PS, we introduce achievable code rates. The

proposed achievable rate is amenable to analysis and we instantiate it for several special cases, including bit-metric decoding, interleaved coded modulation, hard-decision decoding, and binary hard-decision decoding. Section provides preliminaries and notation. We define the layered PS scheme in Section 3. In Section 4, we state and discuss the main results for a generic decoding metric. We discuss metric design and metric assessment in Section 5 and Section 6, respectively. The main results are proven in Section 7. Preliminaries. Empirical Distributions Let X be a finite set and consider a length n sequence x n = x x x n with entries x i X. Let N(a x n ) be the number of times that letter a X occurs in x n, i.e., The empirical distribution (type) of x n is N(a x n ) = {i: x i = a}. () P X (a) = N(a xn ), a X. () n The type P X can also be interpreted as a probability distribution P X on X, assigning to each letter a X the probability Pr(X = a) = P X (a). The concept of lettertypical sequences as defined in 3, Section.3 describes a set of sequences that have approximately the same type. For ɛ 0, we say x n is ɛ-letter-typical with respect to P X if for each letter a X, ( ɛ)p X (a) N(a xn ) n ( + ɛ)p X (a), a X. (3) The sequences (3) are called typical in 4, Section 3.3, 5, Section.4 and robust typical in 6, Appendix. We denote the set of letter typical sequences by T n ɛ (P X ).. Expectations For a real-valued function f on X, the expectation of f(x) is Ef(X) = a supp P X P X (a)f(a) (4) where supp P X is the support of P X. The conditional expectation is Ef(X) Y = b = P X Y (a b)f(a) (5) a supp P X

where for each b Y, P X Y ( b) is a distribution on X. Accordingly Ef(X) Y = P X Y (a Y )f(a) (6) a supp P X is a random variable and.3 Information Measures Entropy of a discrete distribution P X is The conditional entropy is and the mutual information is EEf(X) Y = Ef(X). (7) H(P X ) = H(X) = E log P X (X). (8) The cross-entropy of two distributions P X, P Z on X is H(X Y ) = E log P X Y (X Y ) (9) I(X; Y ) = H(X) H(X Y ). (0) X(P X P Z ) = E log P Z (X). () Note that the expectation in () is taken with respect to P X. The informational divergence of two distributions P X, P Z on X is P X (X) D(P X P Z ) = E log () P Z (X) We define the uniform distribution on X as We have The information inequality states that and equivalently with equality if and only if P X = P Z. = X(P X P Z ) H(X). (3) P U (a) = X, a X. (4) D(P X P U ) = H(U) H(X) = log X H(X). (5) D(P X P Z ) 0 (6) X(P X P Z ) H(P X ) (7) 3

shaping layer FEC layer message u {,..., nrtx } shaping encoder index w {,..., nrc } FEC encoder x n (w) Channel message estimate û shaping decoder index estimate ŵ FEC decoder y n Figure : Layered PS capturing the essence of PAS proposed in. 3 Layered Probabilistic Shaping We consider the following transceiver setup (see also Figure ): We consider a discrete-time channel with input alphabet X and output alphabet Y. We derive our results assuming continuous-valued output. Our results also apply for discrete output alphabets. Random coding: For indices w =,,..., C, we generate code words C n (w) with the n C entries independent and uniformly distributed on X. The code is C = {C n (), C n (),..., C n ( C )}. (8) The code rate is R c = log ( C ) n and equivalently, we have C = nrc. Encoding: We set R tx + R = R c and double index the code words by C n (u, v), u =,,..., nrtx, v =,,..., nr. We encode message u {,..., nrtx } by looking for a v, so that C n (u, v) Tɛ n (P X ). If we can find such v, we transmit the corresponding code word. If not, we choose some arbitrary v and transmit the corresponding code word. The transmission rate is R tx, since the encoder can encode nrtx different messages. Decoding: We consider a non-negative metric q on X Y and we define q n (x n, y n ) := n q(x i, y i ), x n X n, y n Y n. (9) i= 4

For the channel output y n, we let the receiver decode with the rule Ŵ = argmax w {,..., nrc } i= n q(c i (w), y i ). (0) Note that the decoder evaluates the metric on all code words in C, which includes code words that will never be transmitted because they are not in the shaping set T n ɛ (P X ). Decoding error: We consider the error probability P e = Pr(Ŵ W ) () where W is the index of the transmitted code word and Ŵ is the detected index at the receiver. Note that Ŵ = W implies Û = U, where U is the encoded message and where Û is the detected message. In particular, we have Pr(Û U) P e. Remark. The classical transceiver setup analyzed in, e.g.,, Chapter 5 & 7, 7, 8, is as follows: Random coding: For the code C = { C n (),..., C n ( n R c )}, the n n R c entries are generated independently according to the distribution P X. code word Encoding: Message u is mapped to code word C n (u). The decoder uses the decoding rule û = argmax u {,,..., n Rc } i= n q( C i (u), y i ). () Note that in difference to layered PS, the code word index is equal to the message, i.e., w = u, and consequently, the transmission rate is equal to the code rate, i.e., R tx = R c, while for layered PS, we have R tx < R c. Remark. In case the input distribution P X is uniform, layered PS is equivalent to the classical transceiver. 4 Main Results 4. Achievable Encoding Rate Proposition. Layered PS encoding is successful with high probability for large n if Proof. See Section 7.. R tx < R c D(P X P U ) +. (3) 5

If the right-hand side of (3) is positive, this condition means the following: out of the nrc code words, approximately nrc D(P X P U ) have approximately the distribution P X and may be selected by the encoder for transmission. If the code rate is less than the informational divergence, then very likely, the code does not contain any code word with approximately the distribution P X. In this case, encoding is impossible, which corresponds to the encoding rate zero. The plus operator + = max{0, } ensures that this is reflected by the expression on the right-hand side of (3). 4. Achievable Decoding Rate Proposition. Suppose code word C n (w) = x n is transmitted and let y n be a channel output sequence. With high probability for large n, the layered PS decoder can recover the index w from the sequence y n if R c < ˆT c (x n, y n, q) = log X n q(x i, y i ) log n a X q(a, y (4) i) that is, ˆT c (x n, y n, q) is an achievable code rate. Proof. See Section 7.. Proposition 3. For a memoryless channel with channel law p Y n X n(bn a n ) = i= n p Y X (b i a i ), b n Y n, a n X n (5) i= the layered PS decoder can recover sequence x n from the random channel output if the sequence is approximately of type P X and if R c < T c = log X E log a X q(a, Y ) (6) where the expectation is taken according to XY P X p Y X. Proof. See Section 7.3. The term E log a X q(a, Y ) in (6) and its empirical version in (4) play a central role in achievable rate calculations. We call (7) uncertainty. Note that for each realization b of Y, Q X Y ( b) := q(, b)/ a X q(a, b) is a distribution on X so that E log a X q(a, Y ) Y = b = P X Y (a b) log QX Y (a b) (8) a X (7) = X ( P X Y ( b) Q X Y ( b) ) (9) 6

is the cross-entropy of P X Y ( b) and Q X Y ( b). Thus, the uncertainty in (7) is a conditional cross-entropy of the probabilistic model Q X Y assumed by the decoder via its decoding metric q, and the actual distribution P X p Y X. Example 4. (Achievable Binary Code (ABC) Rate). For bit-metric decoding (BMD), which we discuss in detail in Section 5., the input is a binary label B = B B... B m and the optimal bit-metric is q(a, y) = m P Bj (a j )p Y Bj (y a j ) (30) j= and the binary code rate is R b = R c /m. The achievable binary code (ABC) rate is then T abc = T c m = m m H(B j Y ) (3) where we used log X = m. We remark that ABC rates were used implicitly in, Remark 6 and 9, Eq. (3) for the design of binary low-density parity-check (LDPC) codes. 4.3 Achievable Transmission Rate By replacing the code rate R c in the achievable encoding rate R c D(P X P U ) + by the achievable decoding rate T c, we arrive at an achievable transmission rate. Proposition 4. An achievable transmission rate is R ps = T c D(P X P U ) + = log X E log = = = H(X) E E E j= + a X q(a, Y ) D(P X P U ) + log a X q(a, Y ) (3) log a X log P X (X) a X q(a, Y ) + X q(a, Y ) D(P X P U ) (33) +. (34) The right-hand sides provide three different perspectives on the achievable transmission rate. Uncertainty perspective: In (3), q(, b)/ a X q(a, b) defines for each realization b of Y a distribution on X and plays the role of a posterior probability distribution that the receiver assumes about the input, given its output observation. The 7

expectation corresponds to the uncertainty that the receiver has about the input, given the output. Divergence perspective: The term in (33) emphasizes that the random code was generated according to a uniform distribution and that of the ntc code words, only approximately ntc / n D(P X P U ) code words are actually used for transmission, because the other code words very likely do not have distributions that are approximately P X. Output perspective: In (34), q(a, )/P X (a) has the role of a channel likelihood given input X = a assumed by the receiver, and correspondingly, a X q(a, ) plays the role of a channel output statistics assumed by the receiver. 5 Metric Design: Examples By the information inequality (6), we know that E log P Z (X) E log P X (X) = H(X) (35) with equality if and only if P Z = P X. We now use this observation to choose optimal metrics. 5. Mutual Information Suppose we have no restriction on the decoding metric q. To maximize the achievable rate, we need to minimize the uncertainty in (3). We have E log a X q(a, Y ) = E E log a X q(a, Y ) Y (36) (35) E E log P X Y (X Y ) Y (37) = H(X Y ) (38) with equality if we use the posterior probability distribution as metric, i.e., q(a, b) = P X Y (a b), a X, b Y. (39) Note that this choice of q is not unique, in particular, q(a, b) = P X Y (a b)p Y (b) is also optimal, since the factor P Y (b) cancels out. For the optimal metric, the achievable rate is R opt ps = H(X) H(X Y ) + = I(X; Y ) (40) where we dropped the ( ) + operator because by the information inequality, mutual information is non-negative. 8

Discussion In, Chapter 5 & 7, the achievability of mutual information is shown using the classical transceiver of Remark with the likelihood decoding metric q(a, b) = p Y X (b a), a X, b Y. Comparing the classical transceiver with layered PS for a common rate R tx, we have classical transceiver: ŵ = layered PS: ŵ = argmax w {,,..., nr tx} i= argmax w {,,..., nr tx +D(P X P U ) } i= = argmax w {,,..., nr tx +D(P X P U ) } i= n p Y X (y i c i (w)) (4) n P X Y (c i (w) y i ) Comparing (4) and (4) suggests the following interpretation: n p Y X (y i c i (w))p X (c i (w)) (4) The classical transceiver uses the prior information by evaluating the likelihood density p Y X on the code C that contains code words with distribution P X. The code C has size C = nrtx. Layered PS uses the prior information by evaluating the posterior distribution on all code words in the large code C that contains mainly code words that do not have distribution P X. The code C has size C = nrtx+d(p X P U ). Remark 3. The code C of the classical transceiver is in general non-linear, since the set of vectors with distribution P X is non-linear. It can be shown that all the presented results for layered PS also apply when C is a random linear code. In this case, layered PS evaluates a metric on a linear set while the classical transceiver evaluates a metric on a non-linear set. 5. Bit-Metric Decoding Suppose the channel input is a binary vector B = B B m and the receiver uses a bit-metric, i.e., q(a, y) = m q j (a j, y). (43) j= 9

In this case, we have for the uncertainty in (3) m q(b, Y ) j= E log a {0,} m q(a, Y ) = E log q j(b j, Y ) m a {0,} m j= q (44) j(a j, Y ) m j= = E log q j(b j, Y ) m j= a {0,} q (45) j(a, Y ) m q j (B j, Y ) = E log j= a {0,} q (46) j(a, Y ) m q j (B j, Y ) = E log a {0,} q. (47) j(a, Y ) For each j =,..., m, we now have q j (B j, Y ) q E log j (B j, Y ) a {0,} q = E E log j(a, Y ) a {0,} q j(a, Y ) Y with equality if j= (48) H(B j Y ) (49) q j (a, b) = P Bj Y (a b), a {0, }, b Y. (50) The achievable rate becomes the bit-metric decoding (BMD) rate R bmd ps = H(B) m H(B j Y ) j= + (5) which we first stated in 0 and discuss in detail in, Section VI.. In, we prove the achievability of (5) for discrete memoryless channels. For independent bit-level B, B,..., B m, the BMD rate can be also be written in the form R bmd,ind ps = 5.3 Interleaved Coded Modulation m I(B j ; Y ). (5) j= Suppose we have a vector channel with input X = X X m with distribution P X on the input alphabet X m and output Y = Y Y m with distributions P Y X ( a), a X m, on the output alphabet Y m. We consider the following situation: The Y i are potentially correlated, in particular, we may have Y = Y = = Y m. 0

Despite the potential correlation, the receiver uses a memoryless metric q defined on X Y, i.e., a vector input x and a vector output y are scored by q m (x, y) = m q(x i, y i ). (53) i= The reason for this decoding strategy may be an interleaver between encoder output and channel input that is reverted at the receiver but not known to the decoder. We therefore call this scenario interleaved coded modulation. Using the same approach as for bit-metric decoding, we have m m E i= log q(x i, Y i ) m a X m i= q(a i, Y i ) = m m E i= log q(x i, Y i ) m i= a X q(a, Y i) = m m i= E q(x i, Y i ) log a X q(a, Y i) (54). (55) This expression is not very insightful. We could optimize q for, say, the ith term, which would be q(a, b) = P Xi Y i (a b), a X, b Y (56) but this would not be optimal for the other terms. We therefore choose a different approach. Let I be a random variable uniformly distributed on I = {,,..., m} and define X = X I, Y = Y I. Then, we have m m i= q(x i, Y i ) E log a X q(a, Y i) Thus, the optimal metric for interleaving is which can be calculated from The achievable rate becomes P X (a)p Y X (b a) = q(x I, Y I ) = E log a X q(a, Y I) = E log a X q(a, Y ) (57). (58) q(a, b) = P X Y (a b) (59) m j= m P X j (a)p Yj X j (b a). (60) R icm ps = H(X) m H(X Y ) +. (6)

6 Metric Assessment: Examples Suppose a decoder is constrained to use a specific metric q. In this case, our task is to assess the metric performance by calculating a rate that can be achieved by using metric q. If q is a non-negative metric, an achievable rate is our transmission rate expression + R ps (q) = H(X) E log a X q(a, Y ). (6) However, higher rates may also be achievable by q. The reason for this is as follows: suppose we have another metric q that scores the code words in the same order as metric q, i.e., we have q(a, b) > q(a, b) q(a, b) > q(a, b), a, a X, b Y. (63) Then, R ps ( q) is also achievable by q. An example for an order preserving transformation is q(a, b) = e q(a,b). For a non-negative metric q, another order preserving transformation is q(a, b) = q(a, b) s for s > 0. We may now find a better achievable rate for metric q by calculating for instance max s>0 R ps(q s ). (64) In the following, we will say that two metrics q and q are equivalent if and only if the order-preserving condition (63) is fulfilled. 6. Generalized Mutual Information Suppose the input distribution is uniform, i.e., P X (a) = / X, a X. In this case, we have max R ps(q s ) = max E log s + P X (X) (65) s>0 s>0 a X q(a, Y )s = max E log s s>0 a X P X(a)q(a, Y ) s (66) where we used the output perspective (34) in (65), where we could move P X (a) under the sum in (66), because P X is by assumption uniform, and where we could drop the ( ) + operator because for s = 0, the expectation is zero. The expression in (66) is called generalized mutual information (GMI) in 7 and was shown to be an achievable rate for the classical transceiver. This is in line with Remark, namely that for uniform input, layered PS is equivalent to the classical transceiver. For non-uniform input, the GMI and (65) differ, i.e., we do not have equality in (66).

Discussion Suppose for a non-uniform input distribution P X and a metric q, the GMI evaluates to R, implying that a classical transceiver can achieve R. Can also layered PS achieve R, possibly by using a different metric? The answer is yes. Define q(a, b) = q(a, b)p X (a) s, a X, b Y (67) where s is the optimal value maximizing the GMI. We calculate a PS achievable rate for q by analyzing the equivalent metric q s. We have q R ps = E log s + (X, Y ) P X (X) (68) a X qs (a, Y ) q s (X, Y ) + = E log a X P X(a)q s (69) (a, Y ) = R (70) which shows that R can also be achieved by layered PS. It is important to stress that this requires a change of the metric: for example, suppose q is the Hamming metric of a hard-decision decoder (see Section 6.3). In general, this does not imply that also q defined by (67) is a Hamming metric. 6. LM-Rate For the classical transceiver of Remark, the work 8 shows that the so-called LM-Rate defined as + s r(x) R LM (s, r) = E log a supp P X P X (a)q(a, Y ) s (7) r(a) is achievable, where s > 0 and where r is a function on X. By choosing s = and r(a) = /P X (a), we have + P R LM (, /P X ) = E log X (X) (7) a supp P X q(a, Y ) + P E log X (X) a X q(a, Y ) (73) = R ps (74) with equality in (73) if supp P X = X. Thus, formally, our achievable transmission rate can be recovered from the LM-Rate. We emphasize that 8 shows the achievability of the LM-Rate for the classical transceiver of Remark, and consequently, R LM and R ps have different operational meanings, corresponding to achievable rates of two different transceiver setups, with different random coding experiments, and different encoding and decoding strategies. 3

0 ɛ 0 X. ɛ. Y M ɛ M ɛ Figure : The M-ary symmetric channel. Each red transition has probability M. Note that for M =, the channel is the binary symmetric channel. For uniformly distributed input X, we have H(X Y ) = H (ɛ) + ɛ log (M ). 6.3 Hard-Decision Decoding Hard-decision decoding consists of two steps. First, the channel output alphabet is partitioned into disjoint decision regions Y = Y a, Y a Y b = if a b (75) a X and a quantizer ω maps the channel output to the channel input alphabet according to the decision regions, i.e., ω : Y X, ω(b) = a b Y a. (76) Second, the receiver uses the Hamming metric of X for decoding, i.e., {, if a = ω(y) q(a, ω(y)) = (a, ω(y)) = 0, otherwise. (77) We next derive an achievable rate by analyzing the equivalent metric e s(, ), s > 0. For the uncertainty, we have esx,ω(y ) esx,ω(y ) E log = E log esa,ω(y ) X + e s (78) a X e s = PrX = ω(y ) log X + e s PrX ω(y ) log X + e s (79) e s = ( ɛ) log X + e s ɛ log X + e s (80) e s X = ( ɛ) log X + e s l= ɛ X log X + e s (8) 4

where we defined ɛ = Pr(X ω(y )). By (35), the last line is maximized by choosing which is achieved by s: ɛ = e s X + e s and ɛ X = X + e s (8) With this choice for s, we have e s = ( X )( ɛ). (83) ɛ X ( ɛ) log ( ɛ) l= ɛ X log ɛ X = ( ɛ) log ( ɛ) ɛ log ɛ +ɛ log }{{} ( X ) (84) =:H (ɛ) = H (ɛ) + ɛ log ( X ) (85) where H ( ) is the binary entropy function. The term (85) corresponds to the conditional entropy of a X -ary symmetric channel with uniform input, see Figure for an illustration. We conclude that by hard-decision decoding, we can achieve where 6.4 Binary Hard-Decision Decoding R hd ps = H(X) H (ɛ) + ɛ log ( X ) + (86) ɛ = PrX = ω(y ) (87) = P X (a) p Y X (τ a) dτ. (88) a X Y a Suppose the channel input is the binary vector B = B B m and the decoder uses m binary quantizers, i.e., we have Y = Y 0j Y j, Y j = Y \ Y 0j (89) ω j : Y {0, }, ω j (b) = a b Y ja. (90) The receiver uses a binary Hamming metric, i.e., q(a, b) = (a, b), a, b {0, } (9) m q m (a, b) = (a j, b j ) (9) j= 5

and we analyze the equivalent metric e sqm (a,b) = m e s(a j,b j ), s > 0. (93) j= Since the decoder uses the same metric for each bit-level j =,,..., m, binary harddecision decoding is an instance of interleaved coded modulation, which we discussed in Section 5.3. Thus, defining the auxiliary random variable I uniformly distributed on {,,..., m} and B = B I, ˆB = ωi (Y ) (94) we can use the interleaved coded modulation result (58). We have for the normalized uncertainty m m E j= log esb j,ω j (Y ) with equality if a {0,} m m j= esa j,ω j (Y ) (58),(94) = E log es(b, ˆB) a {0,} es(a, ˆB) = Pr(B = ˆB) e s log e s Pr(B ˆB) log + }{{} e s + =:ɛ (95) (96) (35) H(ɛ) (97) Thus, with a hard decision decoder, we can achieve s: e s = ɛ. (98) + R hd,bin ps = H(B) m H (ɛ) + (99) where ɛ = m j= m a {0,} P Bj (a) p Y Bj (τ a) dτ. Y ja (00) For uniform input, the rate becomes R hd,bin uni = m m H (ɛ) = m H (ɛ). (0) 6

7 Proofs 7. Achievable Encoding Rate We consider a general shaping set S n X n, of which Tɛ n (P X ) is an instance. encoding error happens when for message u {,,..., nrtx } with An R tx 0 (0) there is no v {,,..., nr } such that C n (u, v) S n, where R = R c R tx. In our random coding experiment, each code word is chosen uniformly at random from X n and it is not in S n with probability X n S n X n = S n X n. (03) The probability that none of nr code words is in S n is therefore ( S ) nr n ( X n exp S ) n X n nr (04) where we used s exp( s). We have S n X n nr = n(rc Rtx+ n log Sn log X ). (05) Thus, for ( R tx < R c log X ) n log S n (06) the encoding error probability decays doubly exponentially fast with n. This bound can now be instantiated for specific shaping sets. Here, we consider the typical set Tɛ n (P X ) as defined in (3). The exponential growth with n of Tɛ n (P X ) is as follows. Lemma (Typicality, 3, Theorem., 6, Lemma 9). Suppose 0 < ɛ < µ X. We have ( δ ɛ (n, P X )) n( ɛ) H(X) T n ɛ (P X ) (07) where δ ɛ (P X, n) is such that δ ɛ (P X, n) n 0 exponentially fast in n. Thus, For S n = T n ɛ (P X ) in (06) and by (08), if n log T n ɛ (P X ) n H(X). (08) R tx < R c log X H(X) (09) = R c D(P X P U ) (0) 7

W = Encoder C n () = x x... x n random code C Channel Ŵ Decoder y y... y n Figure 3: The considered setup. then the encoding error probability can be made arbitrarily small by choosing n large enough. Equality in (0) follows by (5). By (0) and (0), an achievable encoding rate is which is the statement of Proposition. 7. Achievable Code Rate R tx = R c D(P X P U ) + () We consider the setup in Figure 3, i.e., we condition on that index W was encoded to C n (W ) = x n and that sequence y n was output by the channel. For notational convenience, we assume without loss of generality W =. We have the implications Ŵ Ŵ = w () L(w ) := qn (C n (w ), y n ) q n (x n, y n ) C w= (3) L(w). (4) 8

If event A implies event B, then Pr(A) Pr(B). Therefore, we have C Pr(Ŵ Xn = x n, Y n = y n ) Pr L(w) w= Xn = x n, Y n = y n (5) C E L(w) w= Xn = x n, Y n = y n (6) C = q n (x n, y n ) E q n (C n (w), y n ) (7) where w= = ( C )q n (x n, y n ) E q n (C n, y n ) (8) C q n (x n, y n ) E q n (C n, y n ) (9) n = C n i= q(x E q(c, y i ) i, y i ) (0) i= n = C n i= q(x X q(a, y i ) () i, y i ) i= a X n a X = C q(a, y i) () q(x i, y i ) X i= Inequality in (6) follows by Markov s inequality, Section.6.. Equality in (7) follows because for w, the code word C n (w) and the transmitted code word C n () were generated independently so that C n (w) and C n (), Y n are independent. Equality in (8) holds because in our random coding experiment, for each index w, we generated the code word entries C (w), C (w),..., C n (w) iid. In (0), we used (9), i.e., that q n defines a memoryless metric. We can now write this as Pr(Ŵ Xn = x n, Y n = y n ) n ˆT c(x n,y n,q) R c (3) where ˆT c (x n, y n, q) = n q(x i, y i ) log n i= a X X q(a, y i) (4) R c = log C. n (5) For large n, the error probability upper bound is vanishingly small, if R c < ˆT c (x n, y n, q). (6) Thus, ˆTc (x n, y n, q) is an achievable code rate, i.e., for a random code C, if (6) holds, then, with high probability, sequence x n can be decoded from y n. 9

7.3 Achievable Code Rate for Memoryless Channels Consider now a memoryless channel n p Y n X n = p Yi X i (7) i= p Yi X i = p Y X, i =,,..., n. (8) We continue to assume input sequence x n was transmitted, but we replace the specific channel output measurement y n by the random output Y n, distributed according to p n Y X ( xn ). The achievable code rate (4) evaluated in Y n is ˆT c (x n, Y n, q) = n n i= log q(x i, Y i ) c X X q(c, Y i). (9) Since Y n is random, ˆT c (x n, Y n, q) is also random. First, we rewrite (9) by sorting the summands by the input symbols, i.e., n q(x i, Y i ) log n i= c X X q(c, Y i) = N(a x n ) q(a, Y i ) n N(a x n log ) a X i: x i =a c X X q(c, Y i) (30) Note that identity (30) holds also when the channel has memory. channels, we make the following two observations: For memoryless Consider the inner sums in (30). For memoryless channels, the outputs {Y i : x i = a} are iid according to p Y X ( a). Therefore, by the Weak Law of Large Number, Section.7, N(a x n ) i: x i =a log q(a, Y i ) c X X q(c, Y i) p E log q(a, Y ) c X X q(c, Y ) X = a (3) where p denotes convergence in probability, Section.7. That is, by making n and thereby N(a x n ) large, each inner sum converges in probability to a deterministic value. Note that the expected value on the right-hand side of (3) is no longer a function of the output sequence Y n and is determined by the channel law p Y X ( a) according to which the expectation is calculated. Suppose now for some distribution P X and ɛ 0, we have x n T n ɛ (P X ), in particular, N(a x n ) n ( ɛ)p X (a), a X. (3) 0

We now have n n i= q(x i, Y i ) log c X X q(c, Y i) p N(a x n ) E n a X E log c X X q(c, Y ) ɛ P X (a) E a X q(a, Y ) log c X X q(c, Y ) X = a q(a, Y ) log c X X q(c, Y ) X = a (33) (34) where the expectation in (34) is calculated according to P X and the channel law p Y X. In other words, (34) is an achievable code rate for all code words x n that are in the shaping set T n ɛ (P X ). Acknowledgment The author is grateful to Gianluigi Liva and Fabian Steiner for fruitful discussions encouraging this work. The author thanks Fabian Steiner for helpful comments on drafts. A part of this work was done while the author was with the Institute for Communications Engineering, Technical University of Munich, Germany. References G. Böcherer, F. Steiner, and P. Schulte, Bandwidth efficient and rate-matched lowdensity parity-check coded modulation, IEEE Trans. Commun., vol. 63, no., pp. 465 4665, Dec. 05. R. G. Gallager, Information Theory and Reliable Communication. John Wiley & Sons, Inc., 968. 3 G. Kramer, Topics in multi-user information theory, Foundations and Trends in Comm. and Inf. Theory, vol. 4, no. 4 5, pp. 65 444, 007. 4 J. L. Massey, Applied digital information theory I, lecture notes, ETH Zurich. Online. Available: http://www.isiweb.ee.ethz.ch/archive/massey scr/adit.pdf 5 A. El Gamal and Y.-H. Kim, Network Information Theory. Cambridge University Press, 0. 6 A. Orlitsky and J. R. Roche, Coding for computing, IEEE Trans. Inf. Theory, vol. 47, no. 3, pp. 903 97, Mar. 00.

7 G. Kaplan and S. Shamai (Shitz), Information rates and error exponents of compound channels with application to antipodal signaling in a fading environment, AEÜ, vol. 47, no. 4, pp. 8 39, 993. 8 A. Ganti, A. Lapidoth, and E. Telatar, Mismatched decoding revisited: General alphabets, channels with memory, and the wide-band limit, IEEE Trans. Inf. Theory, vol. 46, no. 7, pp. 35 38, Nov. 000. 9 F. Steiner, G. Böcherer, and G. Liva, Protograph-based LDPC code design for shaped bit-metric decoding, IEEE J. Sel. Areas Commun., vol. 34, no., pp. 397 407, Feb. 06. 0 G. Böcherer, Probabilistic signal shaping for bit-metric decoding, in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Honolulu, HI, USA, Jun. 04, pp. 43 435. G. Böcherer, Achievable rates for shaped bit-metric decoding, arxiv preprint, 06. Online. Available: http://arxiv.org/abs/40.8075 R. G. Gallager, Stochastic processes: theory for applications. Cambridge University Press, 03.