Dispersion of the Gilbert-Elliott Channel

Dispersion of the Gilbert-Elliott Channel Yury Polyanskiy Email: ypolyans@princeton.edu H. Vincent Poor Email: poor@princeton.edu Sergio Verdú Email: verdu@princeton.edu Abstract Channel dispersion plays a fundamental role in assessing the backoff from capacity due to finite blocklength. This paper analyzes the channel dispersion for a simple channel with memory: the Gilbert-Elliott communication model in which the crossover probability of a binary symmetric channel evolves as a binary symmetric Markov chain, with and without side information at the receiver about the channel state. With side information, although capacity is invariant to the chain dynamics, dispersion is shown to be the sum of two terms: due to the Markov chain dynamics and due to the the randomness in the error generation, respectively. I. INTRODUCTION The fundamental performance limit for a channel in the finite blocklength regime is M (n, ɛ), the maximum cardinality of a codebook of blocklength n which can be decoded with block error probability no greater than ɛ. Denoting the channel capacity by C, the approximation log M (n, ɛ) n C () is asymptotically tight for channels that satisfy the strong converse. However for many channels, error rates and blocklength ranges of practical interest, () is too optimistic. It has been shown in [ that a much tighter approximation can be obtained by defining a second parameter referred to as the channel dispersion: Definition : The dispersion V (measured in squared information units per channel use) of a channel with capacity C is equal to (nc log M (n, ɛ)) V = lim lim sup ɛ 0 n n ln. () ɛ In conjunction with the channel capacity C, channel dispersion emerges as a powerful analysis and design tool [. In order to achieve a given fraction η of capacity with a given error probability, the minimal required blocklength is proportional to V/C, namely, n ( Q ) (ɛ) V η C. (3) The research was supported by the National Science Foundation under Grants CCF-06-3554 and CNS-06-5637. As usual, Q(x) = R x e t / dt. π More specifically, [ shows that for simple memoryless channels the two-term expansion log M (n, ɛ) = nc V Q (ɛ) + O(log n), (4) gives an excellent approximation (unless the blocklength is very small). The expansion (4) was first proved for a discrete memoryless channel by Strassen [. The new upper and lower bounds found in [ allowed us to demonstrate the remarkable tightness of the approximation, prove (4) for the additive white Gaussian noise channels, refine Strassen s proof, and improve the bounds on the O(log n) term (see also [3, [4). In this paper, we initiate the study of the dispersion of channels subject to fading with memory. For coherent channels that behave ergodically, channel capacity is independent of the fading dynamics [5 since a sufficiently long codeword sees a channel realization whose empirical statistics have no randomness. In contrast, channel dispersion does depend on the extent of the fading memory since it determines the blocklength required to ride out not only the noise but the channel fluctuations due to fading. One of the simplest models that incorporates fading with memory is the Gilbert- Elliott channel (GEC): a binary symmetric channel where the crossover probability is a binary Markov chain [6, [7. The results and required tools depend crucially on whether the channel state is known at the decoder. II. CHANNEL MODEL Let {S j } j= be a homogeneous Markov process with states {, }, transition probabilities P[S = S = = P[S = S = = τ, (5) P[S = S = = P[S = S = = τ, (6) and initial distribution P[S = = P[S = = /. (7) Now for δ, δ [0, we define {Z j } j= as conditionally independent given S and P[Z j = 0 S j = s = δ s, (8) P[Z j = S j = s = δ s. (9) The results in this paper can be readily generalized at the expense of more cumbersome expressions to Gilbert-Elliott channels with asymmetric Markov chains.

The Gilbert-Elliott channel acts on an input binary vector X n by adding (modulo ) the vector Z n : Y n = X n + Z n. (0) III. CAPACITY The capacity C of a Gilbert-Elliott channel with the state S n known perfectly at the receiver depends only on the stationary distribution P S and is given by C = log E [h(δ S ) () = log P[S = h(δ ) P[S = h(δ ), () where h(x) = x log x ( x) log( x) is the binary entropy function. In the symmetric-chain special case considered in this paper, C = log h(δ ) h(δ ). (3) When the state S n is not known at the receiver, the capacity is given by [8 C 0 = log E [ h(p[z 0 = Z ) (4) = log lim E [ h(p[z 0 = Z n). (5) n Throughout the paper we use subscripts and 0 to denote when the state S n is known and is not known, respectively. IV. MAIN RESULTS Before showing the expansion for the Gilbert-Elliott channel we recall the corresponding result for the binary symmetric channel (BSC) [, [3. Theorem : The dispersion of the BSC with crossover probability δ is V (δ) = δ( δ) log δ. (6) δ Furthermore, provided that V (δ) > 0 and regardless of whether ɛ (0, ) is a maximal or average probability of error we have log M (n, ɛ) = n(log h(δ)) V (δ)q (ɛ) + log n + O(). (7) Theorem : The dispersion of the Gilbert-Elliott channel with state S n known at the receiver, and state transition probability τ (0, ) is V = (V (δ ) + V (δ )) ( ) ( ) h(δ ) h(δ ) + τ. (8) Furthermore, provided that V > 0 and regardless of whether ɛ (0, ) is a maximal or average probability of error we have log M (n, ɛ) = nc V Q (ɛ) + O(log n), (9) where C is given in (3). Moreover, (9) holds even if the transmitter knows the full state sequence S n in advance (i.e., non-causally). Note that the condition V > 0 for (9) to hold excludes only some degenerate cases for which we have: M (n, ɛ) = n (when both crossover probabilities are 0 or ) or M (n, ɛ) = ɛ (when δ = δ = /). To formulate the result for the case of no state information at the receiver, we define the stationary process: F j = log P Zj Z j (Z j Z ) j. (0) Theorem 3: The dispersion of the Gilbert-Elliott channel with no state information and state transition probability τ (0, ) is V 0 = Var [F 0 + E [(F i E [F i )(F 0 E [F 0 ). () i= Furthermore, provided that V 0 > 0 and regardless of whether ɛ is a maximal or average probability of error, we have log M (n, ɛ) = nc 0 V 0 Q (ɛ) + O(log n), () where C 0 is given by (4). It can be shown that the process F j has a smooth spectral density S F (f), and that V 0 = S F (0), (3) which provides a way of computing V 0 by a Monte Carlo simulation paired with a spectral estimator. Another method is to notice that the terms in the infinite sum () decay as ( τ) j. Hence, given any prescribed precision it is sufficient to compute only finitely many terms in (). Each term can in turn be computed with required precision by noting that P Zj Z j [ Z j is a Markov process with a simple transition kernel. The proof of Theorem is outlined in the appendix, while the proof of Theorem 3 being somewhat more technical can be found in [0. V. DISCUSSION The natural application of expansion (4) is in approximating the maximal achievable rate. Unlike the BSC case (7), the prelog constant for the GEC is unknown and therefore a natural choice for the approximation would be 0 log n. However, in view of robustness of the prelog in (7) to variation in crossover probability, we chose the following expression for numerical comparison V0, C 0, n Q (ɛ) + log n. (4) n To demonstrate the tightness of (4) we have computed numerically the upper and lower bounds developed in the course of the proof of Theorems and 3. The comparison is shown on Fig.. For the case of state known (left plot) the achievability bound is (45) and the converse bound is (6), see the appendix. For the state not known (right plot), we computed the capacity and dispersion: C 0 0.80 bit, (5) V 0.73 bit. (6)

0.5 Capacity 0.5 Converse 0.4 0.4 Rate R, bit/ch.use 0.3 Achievability Normal approximation Rate R, bit/ch.use 0.3 Converse Capacity 0. 0. 0. 0. Achievability Normal approximation 0 0 500 000 500 000 500 3000 3500 4000 Blocklength, n 0 0 500 000 500 000 500 3000 3500 4000 Blocklength, n Fig.. Rate-blocklength tradeoff at block error rate ɛ = 0 for the Gilbert-Elliott channel with parameters δ = /, δ = 0 and state transition probability τ = 0.. The left (right) plot is for the case when the state is known (not known) at the receiver. The plots in Fig. suggest that not only do our bounds tightly describe the value of n log M (n, ɛ), but also that the simple expression (4) is precise enough for addressing many practical questions. Let us discuss two such questions. First, for the state known case, capacity C is independent of the state transition probability τ; see (3). However, as shown by Theorem, the channel dispersion V does indeed depend on τ. But then (8) implies that this minimal blocklength behaves as O ( ) τ when τ 0. This has an intuitive explanation: to achieve the full capacity of a Gilbert-Elliott channel we need to wait until the influence of the random initial state washes away. Since transitions occur on average every τ channel uses, the blocklength should be O ( ) τ as τ 0. Comparing (6) and (8) we can ascribe a meaning to each of the two terms in (8): the first one gives the dispersion due to the usual BSC noise, whereas the second one is due to memory in the channel. Next, consider the case in which the state is not known at the decoder. As shown in [8, when the state transition probability τ decreases to 0 the capacity C 0 (τ) increases to C. This is sometimes interpreted as implying that if the state is unknown at the receiver slower dynamics are advantageous. Our refined analysis, however, shows that this is true only up to a point. Indeed, fix a rate R < C and an ɛ > 0. In view of the tightness of (4), the minimum blocklength, as a function of state transition probability τ needed to achieve rate R is approximately given by ( Q ) (ɛ) N 0 (τ) V 0 (τ), (7) C 0 (τ) R provided that C 0 (τ) > R, of course. Heuristically, there are two effects: when the state transition probability τ decreases we can predict the current state better and therefore, the capacity grows; on the other hand, as Blocklength, N(τ) 0 7 0 6 0 5 0 4 0 4 0 3 0 0 τ Fig.. Minimal blocklength, needed to achieve R = 0.4 bit and ɛ = 0.0 as a function of state transition probability τ. The channel is the Gilbert-Elliott with no state information at the receiver, δ = /, δ = 0. discussed above, when τ decreases we also have to wait longer until the chain forgets the initial state. The trade-off between these two effects is demonstrated on Fig., where we plot N 0 (τ) for the same parameters as in Fig.. VI. CONCLUSION In this paper, we have proved an approximation of the form (4) for the Gilbert-Elliott channel. In Fig., we have illustrated the relevance by comparing it numerically with bounds. As we have found previously in [ and [3, expansions such as (4) have practical importance by providing tight approximations of the speed of convergence to capacity, and by allowing for estimation of the blocklength needed to achieve

a given fraction of capacity, as given by (3). REFERENCES [ Y. Polyanskiy, H. V. Poor and S. Verdú, Channel coding rate in the finite blocklength regime, submitted to IEEE Trans. Inform. Theory, Nov. 008. [ V. Strassen, Asymptotische Abschätzungen in Shannon s Informationstheorie, Trans. Third Prague Conf. Information Theory, 96, Czechoslovak Academy of Sciences, Prague, pp. 689-73. [3 Y. Polyanskiy, H. V. Poor and S. Verdú, New channel coding achievability bounds, Proc. IEEE Int. Symp. Information Theory (ISIT), Toronto, Canada, 008. [4 Y. Polyanskiy, H. V. Poor and S. Verdú, Dispersion of Gaussian channels, Proc. IEEE Int. Symp. Information Theory (ISIT), Seoul, Korea, 009. [5 E. Biglieri, J. Proakis, and S. Shamai (Shitz), Fading channels: Information-theoretic and communication aspects, IEEE Trans. Inform. Theory, 50th Anniversary Issue, Vol. 44, No. 6, pp. 69-69, October 998. [6 E. N. Gilbert, Capacity of burst-noise channels, Bell Syst. Tech. J., Vol. 39, pp. 53-65, Sept. 960. [7 E. O. Elliott, Estimates of error rates for codes on burst-noise channels, Bell Syst. Tech. J., Vol. 4, pp. 977-997, Sept. 963 [8 M. Mushkin and I. Bar-David, Capacity and coding for the Gilbert- Elliott channels, IEEE Trans. Inform. Theory, Vol. 35, No. 6, pp. 77-90, 989. [9 S. Verdú, EE58 Information Theory, Lecture Notes,, Princeton, NJ, 007. [0 Y. Polyanskiy, H. V. Poor and S. Verdú, Dispersion of the Gilbert-Elliott channel, draft, 009. [ A. N. Tikhomirov, On the convergence rate in the central limit theorem for weakly dependent random variables, Theory of Probability and Its Applications, Vol. XXV, No. 4, 980. APPENDIX PROOF OF THEOREM For our analysis we need to invoke a few relevant results from [ and [3. Consider an abstract channel P Y X ; for an arbitrary input distribution P X define an (extended) random variable i(x; Y ) = log dp Y X(Y X), (8) dp Y (Y ) where P Y = dp X P Y X=x. Theorem 4 (DT bound): For an arbitrary P X there exists a code with M codewords and average probability of error ɛ satisfying [ { [ } ɛ E exp i(x; Y ) log M +. (9) The optimal performance of binary hypothesis testing plays an important role in our development. Consider a random variable W taking values in a set W which can take probability measures P or Q. A randomized test between those two distributions is defined by a random transformation P Z W : W {0, } where 0 indicates that the test chooses Q. The best performance achievable among those randomized tests is given by 3 β α (P, Q) = min w W Q(w)P Z W ( w), (30) 3 We write summations over alphabets for simplicity; however, all of our general results hold for arbitrary probability spaces. where the minimum is taken over all probability distributions P Z W satisfying P Z W : P (w)p Z W ( w) α. (3) w W The minimum in (30) is guaranteed to be achieved by the Neyman-Pearson lemma. Thus, β α (P, Q) gives the minimum probability of error under hypothesis Q if the probability of error under hypothesis P is not larger than α. It is easy to show that (e.g. [9) for any γ > 0 α P On the other hand, [ dp dq γ + γβ α (P, Q). (3) β α (P, Q) γ 0, (33) for any γ 0 that satisfies [ dp P dq γ 0 α. (34) Virtually all known converse results for channel coding follow from the next theorem by a judicious choice of Q Y X and a lower bound on β, see [. Theorem 5 (meta-converse): Consider two different abstract channels P Y X and Q Y X defined on the same input and output spaces. For a given code (possibly randomized encoder and decoder pair), let ɛ = average error probability with P Y X ɛ = average error probability with Q Y X P X = Q X = encoder output distribution with Then, equiprobable codewords. β ɛ (P XY, Q XY ) ɛ, (35) where P XY = P X P Y X and Q XY = Q X Q Y X. Proof of Theorem : Achievability: We choose P X n equiprobable. Since the output of the channel is (Y n, S n ) we need to write down the expression for i(x n ; Y n S n ). To do that we define an operation on R {0, }: { a [b 0, b = 0, = a, b =. (36) Then we obtain i(x n ; Y n S n ) = log P Y n X n S n (37) P Y n S n n ( = n log + log δ [Zj [ Zj S j + δ S j )(38) j= where (37) is by independence of X n and S n and (38) is because under equiprobable X n we have that P Y n S n is also equiprobable. Using (38) we can find E [i(x n ; Y n S n ) = nc and (39) Var[i(X n ; Y n S n ) = nv + O(). (40)

By (38) we see that i(x n ; Y n S n ) is a sum of weakly dependent random variables. For such a process, Tikhomirov s theorem [ provides an extension of the Berry-Esseen inequality, namely: P [ i(x n ; Y n S n ) > nc + V λ Q(λ) B log n. n (4) In addition, similarly to Lemma 45, Appendix G of [, we can show for arbitrary A: E [exp{ i(x n ; Y n S n ) + A} {i(x n ; Y n S n ) A} B log n. (4) In (4) and (4), B and B are fixed constants, which do not depend on A or λ. Finally, we set where log M = nc V Q (ɛ n ), (43) ɛ n = ɛ (B + B ) log n. (44) Then, by Theorem 4 we know that there exists a code with M codewords and average probability of error p e bounded by 4 [ { [ } p e E exp i(x n ; Y n S n ) log M + (45) [ P i(x n ; Y n S n ) log M + B log n (46) ɛ n + (B + B ) log n (47) ɛ, (48) where (46) is by (4) with A = log M, (47) is by (4) and definition of log M and (48) is by (44). Therefore, by invoking Taylor s expansion in (43) we have log M (n, ɛ) log M nc V Q (ɛ) + O(log n). (49) Converse: In the converse part we will assume that the transmitter has access to the full state sequence S n and then generates X n based on both the input message and S n. Take the best such code with M (n, ɛ) codewords and average probability of error no greater than ɛ. We now propose to treat the pair (X n, S n ) as a combined input to the channel and the pair (Y n, S n ) as a combined output, available to the decoder. Note that in this situation, the encoder induces a distribution P X n S n and is necessarily randomized because the distribution of S n is given by the output of the Markov chain and is independent of the transmitted message W. To apply Theorem 5 we choose the auxiliary channel that passes S n unchanged and generates Y n equiprobably: Q Y n X n S n(yn, s n x n ) = n. (50) 4 In the statement of the theorem we claimed a stronger result about maximal probability of error. Its proof is only slightly different and is omitted. Note that by the constraint on the encoder, S n is independent of the message W. Moreover, under the Q-channel the Y n is also independent of W and we clearly have ɛ M. (5) Therefore by Theorem 5 we obtain β ɛ (P Xn Y n S n, Q X n Y n S n) M. (5) To lower bound β ɛ (P Xn Y n S n, Q X n Y n Sn) via (3) we notice that log P X n Y n S n Q X n Y n S n = log P Y n Xn SnP Xn Sn Q Y n X n S nq X n S n (53) = log P Y n X n S n Q Y n X n S n (54) = i(x n ; Y n S n ), (55) where (53) is because P Xn S n = Q X n Sn and (55) follows simply by noting that P Y n S n in (37) is also equiprobable. Now set log γ = nc V Q (ɛ n ), (56) where this time ɛ n = ɛ + B log n +. (57) By (3) we have for α = ɛ: β ɛ ( [ ɛ P log P X n Y n S n γ Q X n Y n S n ) log γ (58) = γ ( ɛ P [i(xn ; Y n S n ) log γ) (59) ( ɛ ( ɛ n ) B ) log n (60) γ n =, (6) nγ where (59) is by (55), (60) is by (4) and (6) is by (57). Finally, log M (n, ɛ) log β ɛ (6) log γ + log n (63) = nc V Q (ɛ n ) + log n (64) = nc V Q (ɛ) + O(log n), (65) where (6) is just (5), (63) is by (6), (64) is by (56) and (65) is by Taylor s formula applied to Q using (57) for ɛ n.