IN [1], Forney derived lower bounds on the random coding

Exact Random Coding Exponents for Erasure Decoding Anelia Somekh-Baruch and Neri Merhav Abstract Random coding of channel decoding with an erasure option is studied By analyzing the large deviations behavior of the code ensemble, we obtain exact single-letter formulas for the error exponents in lieu of Forney s lower bounds The analysis technique we use is based on an enhancement and specialization of tools for assessing the moments of certain distance enumerators We specialize our results to the setup of the binary symmetric channel case with uniform random coding distribution and derive an explicit expression for the error exponent which, unlike Forney s bounds, does not involve optimization over two parameters We also establish the fact that for this setup, the difference between the exact error exponent corresponding to the probability of undetected decoding error and the exponent corresponding to the erasure event is equal to the threshold parameter Numerical calculations indicate that for this setup, as well as for a Z-channel, Forney s bound coincides with the exact random coding exponent Index Terms random coding, erasure, list, error exponent, distance enumerator I INTRODUCTION IN, Forney derived lower bounds on the random coding exponents associated with decoding rules that allow for erasure and list decoding see also later related studies4-9) The channel model he considered was a single user discrete memoryless channel DMC), a codebook of block length n is randomly drawn with iid codewords having iid symbols When erasure is concerned, the decoder may fully decode the message, or, decide to declare that an erasure has occurred An optimum tradeoff between the probability of erasure and the probability of undetected decoding error was investigated This tradeoff is optimally controlled by a threshold parameter T of the function e nt to which one compares the ratio between the likelihood of each hypothesized message and the sum of likelihoods of all other messages If this ratio exceeds e nt for some message, a decision is made in favor of that message, otherwise, an erasure is declared Forney s main result in is a single-letter lower bound, E R, T ), to the exponent of the probability of the event E of not making the correct decision, namely, either erasing or making the wrong decision, and a single-letter lower bound, E 2 R, T ), to the exponent of the probability of the event E 2 of undetected error In 2, Th 5, Csiszár and Körner derived universally achievable error exponents for a decoder with an erasure option for DMCs These error exponents were obtained by analyzing a decoder which generalizes the MMI decoder for A Somekh-Baruch is with the School of Engineering, Bar-Ilan University, Ramat-Gan, Israel 52900 E-mail: aneliasomekhbaruch@gmailcom N Merhav is with the Department of Electrical Engineering, Technion Israel Institute of Technology, Haifa, 32000, Israel E-mail: merhav@eetechnionacil constant composition CC) codes Unlike Forneys decoder, no optimality claims were made for this decoder, but, in 3, Sec 443 Telatar stated that these bounds are essentially the same as those in Inspired by a statistical mechanical point of view on random code ensembles offered in 0 and further elaborated on in 2), Merhav applied a different technique to derive a lower bound to the exponents of the probabilities of E, E 2 by assessing the moments of certain distance enumerators This approach, which also proved fruitful in several other applications see 3, 4, 5), resulted in a bound that is at least as tight as Forney s bound It is shown in that under certain symmetry conditions that often hold) on the random coding distribution and the channel, the resulting bound is also simpler in the sense that there is only one parameter to optimize rather than two Moreover, this optimization can be carried out in closed form at least in some special cases like the binary symmetric channel BSC) It is not clear though, whether the bounds of are strictly tighter than those of Forney In this paper, we use the approach of distance enumerators to tackle again the problem of random of channel decoding with an erasure option Unlike the approach of and, our starting point is not a Gallager-type bound 6 on the probability of error, but rather the exact expression This approach results in single-letter expressions for the exact exponential behavior of the probabilities of the events E, E 2 when random coding is used So far, we have not been able to determine analytically whether our results coincide with Forney s bounds, ie, we cannot say whether Forney s bounds are tight or not, but the tightness of the our expressions is guaranteed While our analysis pertains to the ensemble of codes each symbol of each codeword is drawn iid mainly in order to enable a fair comparison to ), our technique can easily be used for other ensembles, like the ensemble each codeword is drawn independently according to the uniform distribution within a given type class For the case of the BSC with uniform random coding distribution we have conducted several numerical calculations, which indicate that Forney s bound coincides with the exact random coding exponent The outline of this paper is as follows In Section II, we present notation conventions and in Section III, we give some necessary background in more detail Section IV is devoted to a description of the main results In the following sections, V-IX, we provide detailed derivations of the main results: in Section V, we derive the exact expression for the error exponent corresponding to the probability of E, and in Sections VI and VII, we study two special cases of

2 channels Section VIII is dedicated to the derivation of the exact expression for the error exponent corresponding to the probability of E 2, and in Section IX, we specialize the proof to the case of the BSC with the uniform random coding distribution II NOTATION Throughout this paper, scalar random variables RVs) will be denoted by capital letters, their sample values will be denoted by the respective lower case letters, and their alphabets will be denoted by the respective calligraphic letters, eg X, x, and X, respectively A similar convention will apply to random vectors of dimension n and their sample values, which will be denoted with the same symbols in the boldface font The set of all n-vectors with components taking values in a certain finite alphabet, will be denoted as the same alphabet superscripted by n, eg, X n Sources and channels will be denoted generically by the letter P or Information theoretic quantities, such as entropies and conditional entropies, will be denoted following the usual conventions of the information theory literature, eg, HX), HX Y ), and so on When we wish to emphasize the dependence of the entropy on a certain underlying probability distribution, say, we subscript it by, ie, use notations like H X), H X Y ), etc The divergence or, Kullback - Liebler distance) between two probability measures and P will be denoted by D P ), and when there is a need to make a distinction between P and as joint distributions of X, Y ) as opposed to the corresponding marginal distributions of, say, X, we will use subscripts to avoid ambiguity, that is, we shall use the notations D XY P XY ) and D X P X ) For two number 0 q, p, Dq p) will stand for the divergence between the binary measures q, q and p, p The expectation operator will be denoted by E, and once again, when we wish to make the dependence on the underlying distribution clear, we denote it by E The cardinality of a finite set A will be denoted by A The indicator function of an event E will be denoted by E For a given sequence y Y n, Y being a finite alphabet, ˆPy will denote the empirical distribution on Y extracted from y, in other words, ˆPy is the vector ˆPyy), y Y, ˆPyy) is the relative frequency of the letter y in the vector y For two sequences of positive numbers, a n and b n, the = bn means that a n and b n are of the same notation a n an exponential order, ie, n ln b n a n b n means that lim sup n 0 as n Similarly, an n ln b n 0, and so on Another notation that we shall use is that for a real number x, x + = max0, x III PRELIMINARIES Consider a DMC with a finite input alphabet X, finite output alphabet Y, and single-letter transition probabilities P y x), x X, y Y As the channel is fed by an input vector x X n, it generates an output vector y Y n according to the sequence of conditional probability distributions P y i x,, x i, y,, y i ) = P y i x i ), i =, 2,, n ) for i =, y,, y i ) is understood as the null string A rate-r block code of length n consists of M = e nr n- vectors x m, m =, 2,, M, which represent M different messages We will assume that all possible messages are a-priori equiprobable, ie, P m) = /M for all m =, 2,, M A decoder with an erasure option is a partition of Y n into M + ) regions, R 0, R,, R M Such a decoder works as follows: If y falls into R m, m =, 2,, M, then a decision is made in favor of message number m If y R 0, no decision is made and an erasure is declared We will refer to R 0 as the erasure event Given a code C = x,, x M and a decoder R = R 0, R,, R m ), let us now define two undesired events The event E is the event of not making the right decision This event is the disjoint union of the erasure event and the event E 2, which is the undetected error event, namely, the event of making the wrong decision The probabilities of all three events are defined as follows: PrE = PrE 2 = M m= y R c m M m= y R m m m = M M P x m, y) = M m= y R m m m P x m, y) M m= y R c m P y x m ) 2) P y x m ) 3) PrR 0 = PrE PrE 2 4) Forney shows, using the Neyman-Pearson Theorem, that the best tradeoff between PrE and PrE 2 is attained by the decoder R = R 0, R,, R M ) defined by R P y x m ) m = y : m m P y x ent, m =, 2,, M m ) 5) M R 0 = R m) c, 6) m= R m) c is the complement of R m, and T 0 is a parameter, henceforth referred to as the threshold, which controls the balance between the probabilities of E and E 2 Define the error exponents e i R, T ), i =, 2, as the exponents associated with the average probabilities of error PrE i, i =, 2, the average is taken with respect to wrt) the ensemble of randomly selected codes, drawn independently according to an iid distribution P x) = n i= P x i), that is, e i R, T ) = lim sup n n ln PrE i, i =, 2 7) Forney derives lower bounds, E R, T ) and E 2 R, T ), to e R, T ) and e 2 R, T ), or, upper bounds to the average probabilities of error), respectively, given by E R, T ) = max E 0s, ρ) ρr st, 8) 0 s ρ

3 E 0 s, ρ) = 9) ) ) ρ ln P x)p s y x) P x )P s/ρ y x ) y x x and E 2 R, T ) = E R, T ) + T 0) Merhav established tighter though not necessarily strictly tighter) upper bounds to e R, T ) and e 2 R, T ) denoted ER, T ) and E2R, T ) These bounds take on a very simple form under the following condition: Condition : The random coding distribution P x), x X and the channel transition matrix P y x), x X, y Y are such that for every real s, γ y s) = ln P x)p s y x) ) x X is independent of y When the condition holds, γ y s) will be denoted by γs) Under Condition, the bounds to the exponents are ER, T ) = sup ΛR, s) + γ s) st ln Y 2) s 0 γ s) = dγs) ds ΛR, s) = Also, similarly as in 0), γs) R, s sr sγ s R ), s < s R, and s R is the solution to the equation 3) γs) sγ s) = R 4) E 2R, T ) = E R, T ) + T 5) Let δ GV R) denote the normalized Gilbert-Varshamov GV) distance, ie, the smaller solution, δ, to the equation hδ) = ln 2 R, 6) hδ) = δ lnδ) δ) ln δ) is the binary entropy function For the BSC with uniform random coding distribution, the upper bound is given by ER, T ) = 7) ) µs, R) + s ln sup s 0 µs, R) = p lnp s + p) s st µ0 s, R) s s R βsδ GV R) s < s R, 8) µ 0 s, R) = s ln p) lnp s + p) s + ln 2 R, 9) and β = ln p p 20) It is noted that the optimal s in 7) has an explicit expression given in By the term uniform random coding distribution, we mean uniform over 0, n IV MAIN RESULTS The main results in this paper are stated in Theorems and 2, establishing exact expressions for the random coding error exponents e R, T ) and e 2 R, T ) for the general DMC For a given probability distribution on X Y define K, R) = E ln P X) H X Y ) R = D X P X ) + I X; Y ) R, 2) and for a given probability distribution Y on Y, define G R Y ) = X Y : K, R) 0, 22) is the probability distribution Y X Y Theorem : The error exponent e R, T ) is given by e R, T ) D XY P XY ) + min K, R), 23) X Y is a probability distribution on X Y, and share the same marginal pmf of Y, that is, = Y X Y, and the inner minimization is over X Y G c R Y ) such that Ω,, T ) = E ln P Y X) + E ln P Y X) T 0 24) Corollary : Under Condition see )) the error exponent e R, T ) is given by e R, T ) D XY P XY ) + ψsξ, T ))) R, : ξ,t ) γ s R ) 25) ξ, T ) = T + E ln P Y X), 26) sξ) is the solution of the equation γ s) = ξ, s R is defined as in 4), and ψs) = γs) sγ s) 27) Corollary 2: For the BSC with uniform random coding distribution, if R ln 2 hp + T/β), e R, T ) = 0 and otherwise e R, T ) = 28) min Dq p) hq + T/β) + ln 2 R, q p,δ GV R) T/β β is defined in 20) It is easy to verify, by equating the derivative of Dq p) hq +T/β) to zero, that the minimizing q is either a boundary point of the interval p, δ GV R) T/β, or q that satisfies the quadratic equation ) T q 2 e β ) + q β e β ) + 2e β e β = 0, 29)

4 that is, denoting τ = e β ) ) 2 T β τ) + 2τ T + β τ) + 2τ + 4 τ)τ q = 2 τ) 30) We note that in the BSC case, the exact exponent e R, T ) has a surprisingly simple explicit expression 28) in the sense that there is an optimization over one parameter only and that its optimum value is found in closed form Theorem 2: The error exponent e 2 R, T ) is given by e 2 R, T ) D Y P Y ) + min E A Y, Θ) Y Θ Θ 0 Y ) +E B Y, Θ) R, 3) of the results in and As mentioned earlier, we have conducted a numerical study for the case of the BSC with uniform random coding distribution which indicates that in this case, e i R, T ) appearing in 28) is equal to E i R, T ) see 2)) and to E i R, T ) We have shown analytically for this case that the lowest rate for which e R, T ) = 0 is equal to that of E R, T ) A numerical study of the Z-channel ie, a binary channel for which P Y = 0 X = ) = 0) indicates that e i R, T ) = E i R, T ) also for this case In light of these two examples, we conjecture that Forney s exponent is tight in general V PROOF OF THEOREM Θ 0 Y ) X Y G R Y ) E ln with = Y X Y, E A Y, Θ) = min X Y : E ln P Y X)=T Θ E B Y, Θ) = min X Y : E ln P Y X) Θ P X, Y ) H X Y ) R, 32) E ln E ln P X) H X Y ), 33) P X Y ) H X Y ) 34) Corollary 3: For the BSC with uniform random coding distribution, e R, T ) = e 2 R, T ) + T 35) Discussion: The relation e R, T ) = e 2 R, T ) + T, which is proved to hold in the case of the BSC with uniform random coding distribution, is not surprising The intuition behind it can be explained as follows: Recall that the decision rule 5) was chosen to minimize the tradeoff between PrE and PrE 2, and consider an equivalent problem of minimizing a Lagrangian which is a linear combination of PrE and PrE 2, the Lagrange multiplier is e nt Had e R, T ) been different from e 2 R, T ) + T, the exponents of these two terms would differ, and hence one could improve the overall exponent by changing their balance While our results imply that e i R, T ) E i R, T ) and e i R, T ) Ei R, T ), i =, 2, we have not been able to determine analytically whether or not there are cases in which at least one of these inequalities is a strong one If such cases will be found, then the conclusion will be that we have strictly improved on the results of or or both If not, then the conclusion would be that the exponents of and are tight, a fact which was not determined unequivocally before In either case, the tools proposed in this paper provide us with a yardstick to determine the tightness The probability of error given that the message that was sent is m, averaged over the codebooks is given by PrE m) = P x m ) P y x m ) x m y Pr P y x m ) > P y x m )e nt x m, y m m 36) Next, let be an empirical probability distribution defined on X Y and let Ny) denote the number of codewords excluding x m ) whose joint empirical probability distribution with y is, and denote fy, ) = Ny)e ne ln P Y X) then Pr P y x m ) > P y x m )e nt x m, y m m = Pr Ny)e ne ln P Y X) > P y x m )e nt x m, y = Pr fy, ) > P y x m )e nt x m, y a) = Pr max fy, ) > P y x m)e nt x m, y = Pr fy, ) > P y xm )e nt x m, y = Pr fy, ) > P y x m )e nt x m, y = max Pr Ny) > P y x m )e ne ln P Y X)+T x m, y, 37) a) follows from monotonicity and continuity of the

5 exponent 2 in T Now, we calculate Pr Ny) > P y x m )e ne ln P Y X)+T Recall the definition of Ω,, T ) 24), and note that n ln P y x m )e ne ln P Y X)+T ) = E ˆPxm,y ln P Y X) E ln P Y X) T = Ω ˆPx m,y,, T ), 38) so we will be interested in evaluating Pr Ny) > e nω ˆPx m,y,,t ) There are two cases to consider depending on the sign of Ω = Ω ˆPx m,y,, T ) The case Ω 0: Here e nω and since Ny) takes on integer values, Pr Ny) > e nω = Pr Ny ) = e nh X Y )+E M ln P X) = exp n H X Y ) E ln P X) R + = exp n K, R) +, 39) K, R) is defined in 2), the first exponential equality is because the random variable Ny ) is equal to the sum of the iid binary-p random variables ˆPX i,y =, i =,, e nr, with p = e nh X Y )+E ln P X), 40) and the second exponential equality is because if a 0,, then 2 min, am a)m min, am 4) see Lemma in 7) The case Ω > 0: There are two sub-cases to consider: If Ω > 0 and Ω K, R) we can use the Chernoff bound, similarly to 2 Appendix B Pr Ny ) > e nω 42) min, exp e nω n K, R) + Ω This term decays at least double exponentially rapidly and hence is negligible in the exponential scale If 0 < Ω K, R) we prove in the following lemma see Appendix A) that Pr Ny) > e nω is very close to one Lemma : If G R and Ω < R + H X Y ) + E ln P X), then Pr Ny) e nω vanishes superexponentially 2 Formally, the difference between P fy, ) and max fy, ), which is Olog n/n) in the exponential scale, can be absorbed in the parameter T Thus, one can derive upper and lower bounds in terms of e R, T + Olog n/n)) and e R, T Olog n/n)), with e, ) defined as in 23) These are asymptotically the same ver e R, T ) is continuous in T And, in fact, since e R, T ) a monotonically non-increasing function of T for a given R, it is continuous in T almost every Therefore, for Ω > 0, we have Pr Ny ) > e nω = 0 < Ω K, R) 43) Combining this with the expression for the range Ω 0 39), we get Pr Ny ) > e nω = Ω 0 exp n K, R) + This can be rewritten as + 0 < Ω K, R) 44) Pr Ny ) > e nω = GR c ˆPy), Ω 0 exp n K, R)) + G R ˆPy), Ω K, R) 45) Recall that, in fact, Ω = Ω,, T ) with = ˆPx m,y, see 38)), so the condition in the second term, Ω K, R) = H X Y ) + E ln P X) + R 46) is equivalent to E ln P Y X) T R+E ln P X, Y )+H X Y ) 47) Thus, after maximizing over and taking the expectation wrt x m, y), the resulting exponent is e R, T ) E a R, T ), E b R, T ) 48) E a R, T ) E b R, T ) = D XY P XY ) + min K, R) X Y : GR c Y ), Ω,,T ) 0, 49) min D XY P XY ), 50) L R,T with = Y X Y, and X, Y ) : E L R,T = ln P Y X) R + T + max GR Y ) E, ln P X, Y ) + H X Y ) 5) and the probability distributions are defined on the set X Y Next, we show that E a R, T ) is the dominant term in the minimization 48): E a R, T ) = min min L R,T a) L R,T D XY P XY )) + min K, R) GR c Y ): Ω,,T ) 0 D XY P XY ) + min G c R Y ): Ω,,T ) 0 D XY P XY ) + min K, R) GR c Y ) b) L R,T D XY P XY ) K, R) = E b R, T ), 52)

6 a) follows since when L R,T, the constraint Ω,, T ) 0 becomes inactive, and b) is by definition of G R Y ), and by the fact that when the rate R is below capacity, the boundary of GR c Y ) is non empty 3 This results in the exponent e R, T ) = E a R, T ) VI PROOF OF COROLLARY To prove the corollary we use the general expression 23) of Theorem For a measure on X Y define Z, R, T ) X Y E ln/p X) H X Y ) 53) = Y X Y and the minimum over X Y being across GR c Y ) Ω,, T ) 0 Assuming that the condition of Theorem holds, it is easy to see using Lagrange multipliers) that the minimizing X Y of Z is always of the form P x)p s y x) µ s x y) = x P x )P s y x ), 54) and it is easy to check that for s = Y µ s E s ln/p X) H s X Y ) = γs) sγ s), 55) and that E s ln/p Y X) = γ s) Therefore, the computation of Z, R, T ) boils down to a problem of minimizing over one parameter only, that is s Specifically, Z, R, T ) ψs) : ψs) R, γ s) ξ, T 56) ξ, T ) = T + E ln/p Y X) 57) Now, γs) is a monotonically non decreasing concave function as γ s) 0, γ s) 0), and so, ψ s) = sγ s) 0, which means that ψ is a non decreasing function Therefore, minimizing ψ is equivalent to minimizing s subject to the constraints Now, the constraint ψs) R is equivalent to a constraint s s R see the definition of s R preceding 4)), and the constraint γ s) ξ is equivalent to s sξ, T )), sξ) being the solution to the equation γ s) = ξ Thus, to satisfy both constraints s must be larger than maxs R, sξ, T )) So, the optimum s is s = maxs R, sξ, T )) We therefore obtain from 23) that the exponent is equal to mind XY P XY ) + ψmaxs R, sξ, T ))) R 58) We now argue that the achiever must satisfy sξ, T )) s R, and so, and alternative expression of the above is min D XY P XY ) + ψsξ, T ))) R : sξ,t )) s R 59) 3 To realize this, note that for rates below the capacity P GR c Y ) since H X Y ) E ln P X) =P R = I P X; Y ) R 0, and obviously G R Y ) is not empty, eg, it contains XY = P X) Y which yields H X Y ) E ln P X) R = R hence, and by continuity there exists a linear combination of these two probability distributions that lies on the boundary of GR c Y ) or, equivalently, min D XY P XY ) + ψsξ, T ))) R : ξ,t )) γ s R ) 60) To see why this is true, assume conversely, that the achiever satisfies sξ, T )) < s R, or equivalently, E ln/p Y X) > γ s R ) T In this case we get, e R, T ) = D XY P XY ) + ψs R ) R = D XY P XY ) 6) But we know already that E PXY ln/p Y X) = HY X) < γ s R ) T Thus, if we look at the convex combination t) = t) + tp XY, and choose t such that E t) ln/p Y X) = γ s R ) T namely, sξ t), T )) = s R ), we have, by convexity of the divergence, D t) XY P XY ) t)d XY P XY ) < D XY P XY ), which contradicts the optimality of and hence proves the claim VII PROOF OF COROLLARY 2 Consider the general expression 23) of Theorem For the BSC, P Y X), we have E ln P Y X) E ln P Y X) = X Y ) X Y ) β 62) We also note that since P X) = 2, for all, E ln P X) = ln 2, thus the expression we get from 23) is min D XY P XY ) + min H X Y ) + ln 2 R) X Y S 63) = T X Y S = X Y : and Consequently, 63) is equal to H X Y ) + ln 2 R 0, X Y ) X Y ) + T/β 64) min D XY P XY ) + min H X Y ) + ln 2 R) 65) X Y S + 2 S 2 = X Y : X Y ) X Y ) + T/β Now, on one hand we have H X Y ) = H X Y Y ) H X Y ), 66)

7 thus, min D XY P XY ) + min H X Y ) + ln 2 R) X Y S 2 + min D XY P XY ) + min H X Y ) + ln 2 R ) + X Y S 2 + D q p) + q min hq) + ln 2 R), 67) q q+t/β the last step follows since the minimizing is such that X = P X to obtain minimal D XY P XY ), and it is also easy to verify that given X Y ) = q the divergence D XY P XY ) is minimized for a symmetric Y X), that is, y x) = q x = y q x y, 68) for which we have D XY P XY ) = D q p) On the other hand, one can choose q x = y y x) = q x y, 69) which obtains the inequality in 67) with equality and thus is the minimizer Next, we observe that hq) is decreasing in q for q 0, 2 and increasing for q 2, so, + min D q p) + q min hq) + ln 2 R) q q q+t/β D q p) + min h ), q + T/β 2 + ln 2 R q D q p) h min q + T/β, δ GV R)) + ln 2 R + 70) Finally, we see that the minimum over q cannot be attained at q > δ GV R) T/β, because beyond δ GV R) T/β, D q p) grows while hmin q + T/β, δ GV R)) = hδ GV R)) remains constant Thus, it is enough to limit the range of the minimization to q p, δ GV R) T/β in which case the minimization within the argument of h) becomes redundant In summary, e R, T ) q p,δ GV R) T/β D q p) h q + T/β) + ln 2 R 7) VIII PROOF OF THEOREM 2 The exponent e 2 R, T ) is associated with the probability that y falls in R k for some k while the true message sent was m, m k Since R k, k =,, e nr are disjoint sets PrE 2 m, y = Pr e nt P y x k ) P y x l ) k m l k x m, y 72) Now, fix k and consider the following chain of equalities on the exponential scale): Pr e nt P y x k ) P y x l ) l k x m, y P y x k ) ) =Pr e nt max P y x m ), P y x l l k,m x m, y P y xm ) e nt P y x =Pr k ), l k,m P y x l) e nt P y x k ) x m, y, 73), as mentioned, x m is the true message transmitted Consider now the random selection of the codebook, the random codewords will be denoted by capital letters Then, Now, PrE 2 = P y) Pre nt P y X k ) = e nθ y k m y Y n Θ PrP y X m ) e nθ y Pr P y X l ) e nθ y l k,m = P y) PrA y) PrB y) PrC y)74) k m y Y n Θ PrA y) = x: ln P y x)=nt Θ) = exp n max P x) X Y : E ln P Y X)=T Θ H X Y ) + E ln P X) = exp n min E ln /P X) H X Y ) X Y : E ln P Y X)=T Θ = e ne A ˆPy,Θ) 75)

8 Similarly, letting P x y) = n rior induced by P x, y) = n i= P x i y i ) denote the postei= P x i)p y i x i ), we have: PrB y) = x: ln P y x) nθ = exp n max P x y) X Y : E ln P Y X) Θ H X Y ) + E ln P X Y ) = exp n min E ln /P X Y ) H X Y ) X Y : E ln P Y X) Θ = e ne B ˆPy,Θ) 76) As for PrC y), we proceed as follows PrC y) = Pr Ny) e ne ln P Y X) e nθ y = Pr max Ny) e ne ln P Y X) e nθ y = Pr Ny) e ne ln /P Y X) Θ y 77) Now, if Θ is such that there exists X Y G R ˆPy) such that = ˆPy X Y satisfies E ln /P Y X) Θ < R + H X Y ) + E ln P X) or equivalently, Θ > min X Y G R ˆPy )E ln /P X, Y ) H X Y ) R), then this alone is responsible for a double exponential decay of PrC y), let alone the intersection over all Thus, in this range of large Θ, the contribution is doubleexponential Conversely, in the complementary event, namely, if Θ is such that for all X Y G R ˆPy), the measure = ˆPy X Y satisfies E ln /P Y X) Θ > R + H X Y ) + E ln P X) or equivalently, Θ < min X Y G R ˆPy )E ln /P X, Y ) H X Y ) R), then PrC y) is close to unity because it is lower bounded by PrN y) > e ne ln /P Y X) Θ, the summation is over polynomially many terms that decay at least exponentially rapidly at least exponentially in G c R ˆPy) and at least double exponentially in G R ˆPy)) Thus, PrC y) can be approximated exponentially tightly by an indicator for the event Θ < Θ 0 ˆPy) X Y G R ˆPy )E ln /P X, Y ) H X Y ) R Putting it all together, we then have: PrE 2 = P y) k m y Θ Θ 0 ˆPy ) exp ne A ˆPy, Θ) + E B ˆPy, Θ) = exp n min DY P Y ) Y + min Θ E A Y, Θ) + E B Y, Θ) R the minimization is over Θ Θ 0 Y ) IX PROOF OF COROLLARY 3, 78) First, we compute E A Y, Θ) and E B Y, Θ) for this case Clearly we have E ln /P X) = ln 2 and E ln P Y X) = X Y ) β + ln p), therefore, E A Y, Θ E ln /P X) H X Y ) X Y : E ln P Y X)=T Θ X Y : X Y )β+ln p)=t Θ ln 2 H X Y ) = ln 2 h ln p) + Θ T /β), 79) the last step is because we have 66) so the expression is minimized by choosing X Y as in 69) with q = X Y ) = ln p)+θ T β As for E B Y, Θ), we have similarly E B Y, Θ) E ln /P X Y ) H X Y ) X Y : E ln P Y X) Θ X Y : X Y )β+ln p) Θ X Y ) β q Θ+ln p) β q β ln p) h q) ln p) H X Y ) = max p, q m β ln p) h max p, q m ), 80) the last step of 80) is because the unconstrained minimization is achieved at q = p and the function qβ hq) is convex for q 0, Next, we calculate Θ 0 Y ) similarly Θ 0 Y ) X Y G R Y ) E ln /P X, Y ) H X Y ) R qβ ln p) + ln 2 hq) R q: ln 2 hq) R 0 = max p, δ GV R) β ln p) + ln 2 h max p, δ GV R)) R 8) To conclude we note that since all of the relevant expressions E A Y, Θ), E B Y, Θ), Θ 0 Y ) actually do not depend on Y, the optimal Y is P Y to minimize the divergence, and therefore we get 35)

9 APPENDIX To prove Lemma, we shall use the lower bound on the divergence given in 2, eq 27) Da b) a ln a ) b + b 82) 5 Y Kaspi and N Merhav, Error exponents for broadcast channels with degraded message sets, accepted to IEEE Trans Inform Theory, June 200 6 R G Gallager, Information Theory and Reliable Communication New York: Wiley, 968 7 A Somekh-Baruch and N Merhav, Achievable error exponents for the private fingerprinting game, vol 53, no 5, pp 827-838, May 2007 Using the Chernoff bound for Ω < R + H X Y ) + E ln P X) we get Pr Ny) e nω exp e nr )D e ) nr Ω) e n H X Y ) E ln P X)) exp exp e nr D e ) nr Ω) e n H X Y ) E ln P X)) e nω n H X Y ) E ln P X) R + Ω ) e nh X Y )+E ln P X)+R) 83) Now, it is evident that for G R, the term e nh X Y )+E ln P X)+R) increases exponentially and the term e nω n H X Y ) E ln P X) R + Ω ) is negative but with exponent Ω which is smaller than H X Y ) + E ln P X) + R and therefore negligible, hence this probability vanishes superexponentially REFERENCES G D Forney, Jr, Exponential error bounds for erasure, list, and decision feedback schemes, IEEE Trans Inform Theory, vol IT-4, no 2, pp 206-220, Mar 968 2 I Csiszár and J Körner, Information Theory: Coding Theorems for Discrete Memoryless Systems Academic Press, 98 3 E Telatar, Multi-access communications with decision feedback decoding, PhD dissertation, MIT, Cambridge, Massachusetts, May 992 4 R Ahlswede, N Cai, and Z Zhang, Erasure, list, and detection zeroerror capacities for low noise and a relation to identication, IEEE Trans Inform Theory, vol 42, no, pp 55-62, Jan 996 5 T Hashimoto, Composite scheme LR+Th for decoding with era- sures and its effective equivalence to Forneys rule, IEEE Trans Inform Theory, vol 45, no, pp 78-93, Jan 999 6 T Hashimoto and M Taguchi, Performance and explicit error detec- tion and threshold decision in decoding with erasures, IEEE Trans Inform Theory, vol 43, no 5, pp 650-655, Sep 997 7 P Kumar, Y-H Nam, and H El Gamal, On the error exponents of arq channels with deadlines, IEEE Trans Inform Theory, vol 53, no, pp 4265-4273, Nov 2007 8 N Merhav and M Feder, Minimax universal decoding with an erasure option, IEEE Trans Inform Theory, vol 53, no 5, pp 664-675, May 2007 9 A J Viterbi, Error bounds for the white Gaussian and other very noisy memoryless channels with generalized decision regions, IEEE Trans Inform Theory, vol IT-5, no 2, pp 279-287, Mar 969 0 M Mézard and A Montanari, Information, Physics, and Computation, Oxford University Press, 2009 N Merhav, Error exponents of erasure/list decoding revisited via moments of distance enumerators, IEEE Trans Inform Theory, vol 54, no 0, pp 4439-4447, Oct 2008 2 N Merhav, Relations between random coding exponents and the statistical physics of random codes, IEEE Trans Inform Theory, vol 55, no, pp 83-92, Jan 2009 3 N Merhav, The generalized random energy model and its application to the statistical physics of ensembles of hierarchical codes, IEEE Trans Inform Theory, vol 55, no 3, pp 250-268, Mar 2009 4 R Etkin, N Merhav and E Ordentlich, Error exponents of optimum decoding for the interference channel, IEEE Trans Inform Theory, vol 56 no pp 40-56, Jan 200