STRONG CONVERSE FOR GEL FAND-PINSKER CHANNEL. Pierre Moulin

STROG COVERSE FOR GEL FAD-PISKER CHAEL Pierre Moulin Beckman Inst., Coord. Sci. Lab and ECE Department University of Illinois at Urbana-Champaign, USA ABSTRACT A strong converse for the Gel fand-pinsker channel is established in this paper. The method is then extended to a multiuser scenario. A strong converse is established for the multiple-access Gel fand- Pinsker channel under the maximum error criterion, and the capacity region is determined.. ITRODUCTIO The Gel fand-pinsker (GP) channel [] and its variants have attracted considerable interest in the information theory literature. Applications include coding in the presence of a known interference at the transmitter, and watermarking [2]. This paper derives a strong converse for the GP channel. In addition to strengthening the classical weak converse [], the derivation provides new insights into the problem and in particular into the role of the auxiliary random variable. These insights are particularly useful for the multiple-access version of the GP channel, for which an outer rate region can be derived using a weak converse based on Fano s inequality, but that region does not coincide with the achievable region identified in [3]. We prove that in fact the maximum error probability tends to for any rate pair outside the region of [3], thereby determining the capacity region of the multiple-access GP channel. The proof does not require the wringing methods of [4] that were used to prove the strong converse for the multiple-access channel without side information (SI) under the average error criterion. Also note that: (i) according to Ahlswede [5, 6], the maximum error criterion is more natural in multiuser communications than the average error criterion because the latter guarantees a small error probability only if the users choose their message with uniform probability; and (ii) capacity regions under the max error criterion are generally smaller than capacity regions under the average error criterion [6]. These capacity regions coincide if stochastic encoders are allowed [7, pp. 284,285]. This holds a fortiori if common randomness between encoders and receiver is allowed. 2. GEL FAD-PISKER CHAEL Consider the GP channel p(y x, s) with input alphabet X, output alphabet Y, and channel state S Sdistributed according to a pmf p S []. A message M drawn uniformly from M {,, 2 R } is to be sent over the channel using a length- code. The channel state sequence S =(S,,S ) S is iid p S, independent of M, and available to the encoder. The transmitted sequence is denoted by f (M,S) X and the decoding rule by g (Y) M. Gel fand and Pinsker established the capacity formula C =max[i(u; Y ) I(U; S)] () p XU S where U is an auxiliary random variable taking values in an alphabet U of cardinality U S X +, and U (S, X) Y forms a Markov chain. The maximum over p X US is achieved by a deterministic pmf: p X US = {X = f(u, S)} for some function f : U S X. The direct part of the theorem was proven using a random binning technique, where an arbitrarily small number ɛ>0is chosen, and codewords u(l, m), l 2 [I(U;S)+ɛ], m 2 R are drawn iid from the marginal p U associated with the capacity-achieving distribution p XU S, and a virtual memoryless channel p Y U is created from U to Y. The transmitted sequence is given by x t = f(u t (m, s),s t ) for t =,,. Decoding of the codeword u(l, m) selected by the encoder is successful if I(U; S)+R I(U; Y ) 2ɛ. The (weak) converse part of the theorem was proved using a telescoping formula. In this derivation, U =(M,ST +, Y T,T) (where T is a time-sharing random variable) does not admit an obvious coding interpretation since Y is not available at the encoder. In this paper, we establish a strong converse. The notion of a virtual channel p Y U appears clearly in this derivation, and the construction of U does not involve feedback from the decoder. The source and channel coding aspects of the problem arise from the tension between providing the decoder with information about S and M, respectively, and are also apparent in the derivation. Our main result is stated below. Theorem 2. Assume that H(S) > 0 and min y,x,s p Y XS(y x, s) > 0. For any sequence of length- codes (f,g ) with rate R>C, the average error probability P e (f,g ) for the GP channel tends to as. Extended sketch of the proof. Assume without loss of generality (wlog) that min s p S (s) ɛ and min y,x,s p Y XS (y x, s)

ɛ, for some arbitrarily small ɛ>0. Step. Assume wlog that the codewords are given by the following two-step procedure: a Markov chain. The distribution of Y given m, s is the product pmf p Y US (y u(m, s), s). Denote by D m the decoding region for message m, i.e., An alphabet U of arbitrarily large cardinality, a function f : U S X, and a codebook with codewords u(m, s) V are defined; Each channel input symbol is obtained as x t = f(u t (m, s),s t ), t. Since this construction contains the choice U = X, u(m, s) x(m, s), f(x, s) =x as a special case, there is no loss of generality in making the above assumption. Moreover, it can be shown that capacity is not reduced if, instead of S, a slightly degraded version of S is available to the encoder (output of an R/D code with vanishing Hamming distortion for each m M ) and so we may restrict our attention to codes that satisfy property (P) below. Such codes may be thought of as including an elementary amount of binning, since for each m, many (albeit not necessarily exponentially many) sequences s map to the same codeword u(m, s). Let d H (s, s ) denote Hamming distance between two sequences s and s and Ω the set of pairs (m, s) such that u(m, s) =u(m, s ), s : d H (s, s ). (2) In other words, given m, arbitrarily changing any one sample s t of the sequence s does not change the value of the codeword u(m, s). Denote by Σ(m) {s : (m, s) Ω} the sections of Ω along the m direction. Also denote by p s the type of the sequence s (and empirical pmf over S), by T ɛ = {s : max s S p s (s) p S (s) ɛ} the strong ɛ-typical set, and by Σ ɛ (m) Σ(m) T ɛ its intersection with Σ(m). (P). For each m M, the set Σ ɛ (m) has probability P S (Σ ɛ(m)) o(). (3) Step 2. Define the random variables U t = u t (M,S) U and Q t = {S j,j t} S (note Q t is independent of S t ) for t. The equivalence relation s =(s t,q t ) holds for each t. Let T be a time-sharing random variable uniformly distributed over {, 2,, } and independent of all other random variables. Let S = S T, Q = Q T, X = X T, U = U T, and Y = Y T. For each t, the joint pmf of (S,M,U t,x t,y t ) is given by p(s,m,u t,x t,y t ) = p S (s) p M(m) {u t = u(m, s)} {x t = f(u t,s t )} p Y XS (y t x t,s t ). y D m g (y) =m, m M. The decoding regions form a partition of Y. The probability of correct decoding of message m M is given by P c (f,g,m)=pr[g (Y) =m] = p S (s) p Y US (y u(m, s), s). (4) y D m s S The average probability of correct decoding is given by P c (f,g )=2 R P c (f,g,m). m M Step 3. For each m M and s S, denote by λ = λ(m, s) P [] U S the conditional type of u(m, s) given s (empirical conditional pmf, implicitly dependent on (m, s)). The quadruple (p S,λ,f,p Y XS ) induces a joint pmf on S U X Y: λ SUXY (s, u, x, y) =p S(s) λ(u s) {X =f(v,s)} p Y XS (y x, s). Thus U (X, S) Y forms a Markov chain for each λ, f. We denote by λ Y,λ S U, etc. the various marginals and conditional marginals associated with λ SUXY and therefore induced by (λ, f). Consider the conditional mutual informations I λ (U; S) = s,u I λ,f (U; Y ) = s,u,y p S (s) λ(u s) log λ S U(s u) p S (s) p S (s) λ YU S (y, u s) log λ Y U (y u) λ Y (y) which will be viewed as functions of λ and f. Also define the empirical conditional self-informations Î λ (U; S) =α(m, s) Î λ (U; Y )= ˆβ(m, s, y) Ĭ λ,f (U; Y )=β(m, s) log λ S U (s t u t (m, s)), (5) p S (s t ) log λ Y U(y t u t (m, s)), λ Y (y t ) p Y US (y t u t (m, s),s t ) y t Y log λ Y U(y t u t (m, s)) λ Y (y t ) = E Y M,S [ ˆβ(m, s, Y)]. (6) Hence the joint pmf of (T,S,Q,U,X,Y) is p T p S p Q p U SQT {X = f(v,s)} p Y XS. ote that V (S, X) Y forms These quantities do not coincide with I λ (U; S) and I λ,f (U; Y ) because the type of s does not coincide with p S in general.

However for strongly typical s T ɛ we have Îλ(U; S) I λ (U; S) ɛ log S Ĭλ,f (U; Y ) I λ,f (U; Y ) ɛ log Y. (7) Also the following inequality holds for all (m, s) Ω and s differing from s in position t: 2log2/ɛ max s S[ α(m, s )] min s)] + t s t S[ α(m,. (8) Step 4. Define the following subsets of Y, indexed by m, s: } B ɛ (m, s) {y : ˆβ(m, s, y) β(m, s)+ɛ (9) For all m M, s S, the probability Pr[Y / B ɛ (m, s) M = m, S = s] log2 ɛ ɛ 2 (0) vanishes as. This follows from Chebyshev s inequality and the fact that Y t, t, are conditionally independent given (m, s). Step 5. The probability of correct decoding for m may be upper-bounded as P c (f,g,m) Pr[Y / B ɛ (m, S) M = m] +Pr[S / Σ ɛ (m)] + P c (f,g,m)() where the first two terms in the right side were upper-bounded in (3) and (0) respectively, and P c (f,g,m) Pr[g (Y) =m,s Σ ɛ (m), Y B ɛ (m, S) M = m] = s Σ ɛ(m) y D m B ɛ(m,s) λ y Y s t S p S (s) p Y US (y u(m, s), s). (2) Define the disjoint events E(m, λ) {S Σ ɛ (m), λ(m, S) =λ, Y D m B ɛ (m, S)} for all m M and λ P [] U S, and write (2) as P c (f,g,m) = p S (s) p Y US (y u(m, s), s) {E(m, λ)} λ s S y Y (a) = 2 α(m,s) λ S U (s u(m, s)) λ s S y Y p Y US (y u(m, s), s) {E(m, λ)} = 2 α(m,s) λ S U (s t u t (m, s)) λ y Y s t S p Y US (y t u t (m, s),s t ) {E(m, λ)} (b) 2log2/ɛ α(m,s)+ 2 w(s t y t ) λ Y U (y t u t (m, s)) {E(m, λ)} = 6 ɛ 4 (c) 6 ɛ 4 (d) λ 2 α(m,s) w (s y) λ y Y s S λ Y U (y u(m, s)) {E(m, λ)} 2 [β(m,s) α(m,s)+ɛ] w (s y) λ y Y s S λ Y (y) {E(m, λ)} 2 [I λ,f (U;Y ) I λ (U;S)+ɛ ] w (s y) λ Y (y) {E(m, λ)} (3) y Y s S where ɛ = ɛ log(2 S Y ). Equality (a) follows from (5). In (b) the conditional pmf w(s y) is arbitrary. There we have used the property (P) which implies that given any m M, s Σ(m) and t, u t (m, s) is independent of s t. Similarly B ɛ (m, s) is independent of s t ; and by (8), α(m, s) is almost independent of s t. Inequality (c) follows from (9) and (6), and inequality (d) from (7). Averaging (3) over m M, we obtain P c (f,g ) = 2 R P c (f,g,m) m M 2 R 2 [I λ,f (U;Y ) I λ (U;S)+ɛ ] m,λ s S y Y w (s y) λ Y (y) {E(m, λ)} sup max V λ,f 2 [R I λ,f (U;Y )+I λ (U;S) ɛ ] w (s y) λ Y (y) {E(m, λ)} s S y Y m,λ }{{} (a) 2 [R sup V max λ,f (I λ,f (U;Y ) I λ (U;S)) ɛ ] = 2 [R C ɛ ] (4) where (a) holds because the events E(m, λ) are disjoint. Step 6. Combining (3), (0), (2), and (4) yields P c (f,g ) o() + log2 ɛ ) ɛ 2 +2 (R C ɛ. (5) Hence P c (f,g ) vanishes for all sequences of codes (f,g ) of rate R>C+ ɛ. Since this inequality holds for arbitrarily small ɛ>0, we conclude that P c (f,g ) vanishes for all R>C. This concludes the proof. 3. MULTIPLE-ACCESS GEL FAD-PISKER CHAEL Consider the multiple-access GP channel p(y x,x 2,s) with alphabets S, X, X 2, Y and message sets {,, 2 R } and

{,, 2 R2 }. The channel state sequence S S is iid p S and is known to both encoders. The encoders transmit sequences f (m, s) X and f 2 (m 2, s) X 2 respectively, where M and M 2 are independent of S and are drawn uniformly and independently from their respective message sets. Given the channel output sequence y Y, the decoder outputs (ˆm, ˆm 2 )=g (y). Somekh-Baruch and Merhav [3] have shown that the following rate region R is achievable. For a pmf P of the form p S p T p XV ST p X2V 2 ST p Y XX 2S, let R(L, P ) be the region of rate pairs (R,R 2 ) that satisfy R < I(V ; Y V 2,T) I(V ; S V 2,T) R 2 < I(V 2 ; Y V,T) I(V 2 ; S V,T) R + R 2 < I(V,V 2 ; Y T ) I(V,V 2 ; S T ) (6) where the alphabets for the auxiliary random variables V and 2 V 2 have cardinality L. Let R denote the closure of L,P R(L, P ). ({v it = v it (m i, s)} {x it = f i (v it,s t )}) Rate pairs in R(L, P ) are achieved using a time-shared binning scheme with codeword arrays {u (l,m )} and {u 2 (l 2,m 2 )}. p Y XX 2S(y t x t,x 2t,s t ). i= Each transmitter selects the row index so that the corresponding codeword is jointly typical with s. The joint pmf of T,S,Q,V,V 2,X,X 2,Y is given by Attempts to find a outer rate region for this problem by p T p S p Q p V SQT p V2 SQT {X = f (V,S)} deriving a weak converse (based on Fano s inequality and the telescoping formula) have met only partial success. We have {X 2 = f 2 (V 2,S,T)} p Y XX 2S. derived a rate region of the form (6), but the maximization The probability of correct decoding for (m is over a larger set of distributions, with the distribution of,m 2 ) is given by (X,V,X 2,V 2 ) given (S, T ) given by p X ST p X2 ST p VV 2 X X 2STP c (f,f instead of p XV ST p X2V 2 ST. (See [9] for a related problem,,g 2,m,m 2 )= y D(m,m 2) and [8] for a similar mismatch in the case of SI causally available to the encoders.) Apparently the resulting outer region is p S (s) p Y V (y v V 2S (m, s), v 2 (m 2, s), s). strictly larger than the inner region R of (6). However, using s S a strong converse we have established the following result. Under the maximal error criterion we have Theorem 3. Assume that H(S) > 0 and min p Y X y,x X 2S,x 2,s (y x,x 2,s) > 0. For any sequence of length- codes with rate pair (R,R 2 ) / R, the maximum error probability (over all pairs of messages m,m 2 ) tends to as. Furthermore, in the definition of R, it suffices to consider conditional pmfs p XiV i ST of the form p Vi ST {X i = f i (V i,s)} where f i is a mapping from V i Sto X, for each i =, 2. Sketch of the proof. The proof extends the methods from Sec. 2, however our derivation does not make use of types (presumably wringing methods would have to be used to show that certain correlated types have low probability). Define the decoding regions D(m,m 2 ) which form a partition of Y. As in Step 2 of the proof of Theorem, assume the codewords are given by the following two-step procedure: Define for i =, 2 an alphabet V i of arbitrarily large cardinality, a function f i : V i S X i, and a codebook with codewords v i (m i, s) Vi ; Each channel input symbol is obtained as x it = f i (v it (m i, s),s t ), i =, 2, t. Observe that the channel from (v, v 2, s) to y is time-invariant and memoryless, where p Y VV 2S(y v,v 2,s)=p Y XX 2S(y f (v,s),f 2 (v 2,s),s). Define the random variables V it = v it (M i, S) V i for i =, 2, and Q t = {S j,j t} S (again note Q t is independent of S t ) for t. The equivalence relation s = (s t,q t ) holds for each t. Let T be a time-sharing random variable uniformly distributed over {, 2,, } and independent of all other random variables. Let S = S T, Q = Q T, X = X T, V = V T, X 2 = X 2T, V 2 = V 2T, and Y = Y T. For each t, the joint pmf of (S,M,V t,x t,v 2t,X 2t,Y t ) is given by p(s,m,v t,x t,v 2t,x 2t,y t )=p S (s) p M (m) P c (f,f2,g,m,m 2 ) δ, m,m 2 where δ is the maximum error probability. Denote by Ω the set of triples (m,m 2, s) such that v i (m i, s) =v i (m i, s ), i =, 2, s : d H (s, s ). As in (2), for (m,m 2, s) Ω, arbitrarily changing any one sample s t of the sequence s does not change the value of the codewords v (m, s) and v 2 (m 2, s). Denote by Σ(m,m 2 ) {s :(m,m 2, s) Ω} and M (s) {(m,m 2 ):(m,m 2, s) Ω} the sections of Ω along the (m,m 2 ) and s directions. Choose an arbitrarily small ɛ>0. Similarly to (P), without loss of optimality, we restrict our attention to codes that satisfy the following property. (P2). For each m,m 2, the set Σ(m,m 2 ) has probability PS (Σ(m,m 2 )) o(). (7) Hence the sets Σ ɛ {s : M (s) ( ɛ)2 (R+R2) }, Σ ɛ (m,m 2 ) Σ(m,m 2 ) Σ ɛ have probabilities PS ( ) o().

Step 2. Define three conditional self-informations: Î(V t V 2t ; S t Q t = q t )=α (3) (m,m 2, s) s t p S (s t)log p S t V tv 2tQ t (s t v t (m, s),v 2t (m 2, s),q t ) p St Q t (s t q t ) and for j =, 2, α (j) (m,m 2, s) is defined similarly, using log ratios p St V tv 2tQ t /p St V j,t Q t. For any sequence s Σ(m,m 2 ), t, and j =, 2, 3, the quantity α (j) (m,m 2, s) does not depend on s t. We also define the following three conditional self-informations: Î(V t V 2t ; Y t Q t = q t )=β (3) (m,m 2, s) p Y VV 2S(y t v t (m, s),v 2t (m 2, s),s t ) It may be shown that, for j =, 2, 3 and s Σ ɛ, ( ɛ)e[β (j) (M,M 2, s) α (j) (M,M 2, s)] R jt (q t )+ɛ. (8) Similarly to (9), 3 high-probability subsets B ɛ (j) (m,m 2, s) of Y are also defined. 3 cond l reference pmf s are defined: r () (y v 2, s), r (2) (y v, s), and r (3) (y s) p Y t Q t (y t q t ). Step 3. Three upper bounds are evaluated for the probability of correct decoding for m,m 2 : δ P c (f,f,g 2,m,m 2 ) δ ɛ P (j) c (f,f2,g,m,m 2 ), j =, 2, 3 where P (j) c (f,g,m,m 2 ) Pr[g (Y) =(m,m 2 ), S Σ ɛ (m,m 2 ), Y B ɛ (j) (m,m 2, S) M = m,m 2 = m 2 ].(9) Step 4. The case j =3yields an upper bound on the sum rate R 3 = R + R 2. Define the good event E (3) (m,m 2 ) = {S Σ ɛ (m,m 2 ), Y D(m,m 2 ) B ɛ (3) (m,m 2, S)}. Analogously to (4), we derive δ ɛ P (3) c (f,g,m,m 2 ) 2 [β (3) (m,m 2,s) α (3) (m,m 2,s)+ɛ] y Y s S p S (s) r (3) (y s) {E (3) (m,m 2 )}. (20) ow, since the average value of β (3) (m,m 2, s) α (3) (m,m 2, s) over m,m 2 satisfies (8) for all s Σ ɛ, there must exist a large set Γ 3 (s) of pairs m,m 2 for which β (3) (m,m 2, s) α (3) (m,m 2, s) E[ s]+ɛ R 3t (q t )+2ɛ. It is easily shown that Γ 3 (s) ɛ 2 R3 where ɛ = Averaging (20) over (m,m 2 ) Γ 3 (s), we obtain ɛ( δ ɛ) log R 3 +max ɛ +log Y {q t} ɛ ɛ+log Y. R 3t (q t )+2ɛ. s y t t Y (2) log p Y t V tv 2tQ t (y t v t (m, s),v 2t (m 2, s),q t ) In the cases j = and j = 2, the decoder uses a helper. p Yt Q t (y t q t ) who reveals one message and the corresponding codeword v i (m i, s) (but not s). This leads to the same inequality (2), and for j =, 2, β (j) (m,m 2, s) is defined similarly, using with R j and R jt in place of R 3 and R 3t, respectively. log ratios p Yt V tv 2tQ t /p Yt V j,t Q t. Define Step 5. Let W =(T,S,V,V 2,X,X 2,Y) and define R j (p WQ ) = q t p Qt (q t ) R R jt (q t ) for j =, 2, 3. t (q t ) = I(V t ; Y t V 2t,Q t = q t ) I(V t ; S t V 2t,Q t = q t ) We obtain R 2t (q t ) = I(V 2t ; Y t V t,q t = q t ) I(V 2t ; S t V t,q t = q t ) ɛ( δ ɛ) log min R 3t (q t ) = I(V t V 2t ; Y t Q t = q t ) I(V t V 2t ; S t Q t = q t ). ɛ +log Y [ R j+max R j (p W Q p Q )]+2ɛ. j=,2,3 p Q (22) Taking ɛ 0, we conclude that for (22) to hold for all 0 δ, there must exist a pmf P = p WQJ of the form p S p QJT p V SQJT p V2 SQJT {X = f (V,S)} {X 2 = f 2(V 2,S)} p Y X X 2 S such that (6) holds with (Q, J, T ) in place of T. 4. REFERECES [] S. I. Gel fand and M. S. Pinsker, Coding for Channel with Random Parameters, Probl. Contr. Info. Th., Vol. 9, o., pp. 9 3, 980. [2] G. Keshet, Y. Steinberg, and. Merhav, Channel Coding in the Presence of Side Information: Subject Review, Foundations and Trends in Communications and Information Theory, 2007. [3] A. Somekh-Baruch and. Merhav, On the Random Coding Error Exponents of the Single-User and the Multiple-Access Gel fand-pinsker Channels, Proc. ISIT, p. 448, Chicago, IL, June-July 2004. [4] R. Ahlswede, An Elementary Proof of the Strong Converse Theorem for the Multiple-Access Channel, J. Combinatorics, Information and System Sciences, Vol. 7, o. 3, pp. 26 230, 982. [5] R. Ahlswede, On Two-Way Communication Channels and a Problem by Zarankiewicz, 6th Prague Conf. on Information Theory, Statistical Decision Functions, and Random Processes, Prague, 97. [6] G. Dueck, Maximal Error Capacity Regions Are Smaller than Average Error Capacity Regions for Multi-User Channels, Problems Control and Information Theory, Vol. 7, o., pp. 9, 978. [7] I. Csiszár and J. Körner, Information Theory: Coding Theory for Discrete Memoryless Systems, Academic Press, Y, 98. [8] S. Sigurjónsson and Y.-H. Kim, On Multiple User Channels with State Information at the Transmitters, Proc. ISIT 2005. [9] Y. Wang and P. Moulin, Blind Fingerprinting, arxiv:0803.0265v [cs.it], March 2008.