Statistical Mechanics of Multi Terminal Data Compression

MEXT Grant-in-Aid for Scientific Research on Priority Areas Statistical Mechanical Approach to Probabilistic Information Processing Workshop on Mathematics of Statistical Inference (December 004, Tohoku University, Sendai, Japan) Statistical Mechanics of Multi Terminal Data Compression Theory and Practice Tatsuto Murayama 1 NTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation, Kyoto 619-037, Japan This paper presents an efficient LDPC-based algorithm for multi-terminal data compression. In our scenario, a codeword sequence is calculated by multiplying a data sequence by a predetermined LDPC matrix. In contrast, the decoder generates a proper reproduction sequence for a codeword sequence using a message passing algorithm. The key result is the discovery that the LDPC coding technique can provide a suboptimal solution to problem of decoding, which often suffers from the dimension curse. Our analysis shows that the achievable rate region is described as first-order phase transitions among several phases. The typical performance of our practical decoder is also well evaluated by the replica method. 1 Introduction Data compression, or source coding, is a scheme to reduce the size of message (data) in information representation. In his seminal paper [1], Shannon showed that for an information source represented by a distribution P(S) ofn dimensional Boolean (binary) vector S, one can employ another representation in which the message length N is reduced to M( N) without any distortion, if the code rate R = M/N satisfies R H (S) in the limit N, M. Here, H (S) = (1/N )Tr S P(S)log P(S) representsthe binary entropy per bit in the original representation S indicating the optimal compression rate. Unfortunately, Shannon s theorem itself is non-constructive and does not provide explicit rules for devising the optimal codes. Therefore, it is surprising that a practical code proposed by Lempel and Ziv in 1973 [] asymptotically saturates Shannon s optimal compression limit in the case of point-to-point communication, when lossless compression is considered. However, it should be emphasized here that generalization of the Lempel-Ziv codes to advanced data compression suitable for a network is difficult, although the importance of networks is rapidly increasing with the recent development of the Internet. This is because all the practical codes that saturate Shannon s limit to date require a complete knowledge of all source vectors coming into the communication network, while the compression should be carried out 1 E-mail: murayama@cslab.kecl.ntt.co.jp 1

encoders decoder ξ u u R 1 ξ,η η v v R Figure 1: Slepian and Wolf system: A simple communication network introduced in the data compression theorem of Slepian and Wolf. Separate coding is assumed in the distributed system. independently on each terminal in usual situations. Therefore, the quest for more efficient compression codes that are suitable for a network still remains one of the most important topics in information theory [3]. The purpose of this chapter is to employ recent advances in the research on error-correcting codes for this purpose. More specifically, we will investigate the efficacy and the limitation of a linear compression scheme inspired by Gallager s error-correcting codes [4] which has been actively investigated in both of information theory and physics communities [5, 6], when it is applied to the data compression problem introduced by Slepian and Wolf in their research on the network based information theory [7]. Unlike the existing argument in information theory, our approach based on statistical mechanics makes it possible not only to assess the theoretical bounds of the achievable performance but also to provide practical encoding/decoding methods that can be performed in linear time scales with respect to the data length. General Scenario Let us start by setting up the framework of the Slepian-Wolf coding-decoding problem [7]. In a general scenario, two correlated N-dimensional Boolean vectors ξ and η are independently compressed to M- dimensional vectors u and v, respectively. These compressed data (or codewords) u and v are decoded to retrieve the original data simultaneously by a single decoder. A schematic representation of this system is shown in Figure 1. The codes used in this chapter are composed of randomly selected sparse matrices A and B of dimensionality M 1 N and M N, respectively. These are constructed similarly to those of Gallager s error-correcting codes [4] as characterized by K 1 and K nonzero unit elements per row and C 1 and C nonzero unit elements per column, respectively. The compression rates can be different between the two

H(η) PF FF (achievable rate region) R H(η ξ) PP FP H(ξ η) H(ξ) R 1 Figure : Achievable rate region: Code rates are classified into four categories according to whether the two compressed data are decodable or not. The parameter regime where the both data are decodable without any distortion is termed the achievable rate region. terminals. Corresponding to matrices A and B, the rates are defined as R 1 = M 1 /N = K 1 /C 1 and R = M /N = K /C, respectively. While both matrices are known to the decoder, encoders only need to know their own matrix, that is, encoding is carried out separately in this scheme as u = Aξ and v = Bη, where Boolean arithmetic is employed to the Boolean vectors. After receiving the codewords u and v, the couple of equations u = AS, v = Bτ (1) should be solved with respect to S and τ which become the estimates of the original data ξ and η, respectively. 3 Statistical Mechanical Analysis To facilitate the current investigation we first map the problem to that of an Ising model with finite connectivity [8]. We employ the binary representation (+1, 1) of the dynamical variables S and τ and of the vectors u and v rather than the Boolean (0, 1) one; the vector u is generated by taking products of the relevant binary data bits u i1,i,,i K1 = ξ i1 ξ i ξ ik1, where the indices i 1,i,,i K1 correspond to the nonzero elements of A, producing a binary version of u, and similarly for v. Assuming the thermodynamic limit N, M 1,M, while keeping the code rates R 1 = M 1 /N and R = M /N finite is quite natural as communication to date generally requires transmitting large data, where finite size corrections are likely to be negligible. To explore the system s capabilities we examine the partition 3

function Z =Tr S,τ P(S, τ ) i 1,i,,i K1 i 1,i,,i K { 1+ 1 ( ) } A i 1,i,,i K1 u i1,i,,i K1 S i1 S i S ik1 1 { 1+ 1 ( B i 1,i,,i K v i1,i,,i K τ i1 τ i τ ik 1) }. () The tensor product A i1,i,,i K1 u i1,i,,i K1,whereu i1,i,,i K1 = ξ i1 ξ i ξ ik1 is the binary equivalent of Aξ. Elements of the sparse connectivity tensor A i1,i,,i K1 take the value 1 if the corresponding indices of data are chosen (i.e., if all corresponding indices of the matrix A are 1) and 0 otherwise; it has C 1 unit elements per i index representing the system s degree of connectivity. Notice that if the product S i1 S i S ik1 is in disagreement with the corresponding element u i1,i,,i K1, which implies an error for the parity check, the value of the partition function Z vanishes. Similar arguments are valid for B i1,i,,i K and v i1,i,,i K. The probability P(S, τ) represents our prior knowledge of data including the correlation between the sources ξ and η. Note that the dynamical variables τ, introduced to estimate η, are irrelevant to the performance measure with respect to the other data ξ. Since the partition function Eq. () is invariant under the transformations S i S i ξ i, u i1,i,,i K1 u i1,i,,i K1 ξ i1 ξ i ξ ik1 =1 τ i τ i η i, v i1,i,,i K v i1,i,,i K τ i1 τ i τ ik =1, (3) it is useful to decouple the correlations between the vectors S, τ and ξ, η. Rewriting Eq. () using this gauge, one obtains a similar expression apart from the first factor which becomes P(S ξ, τ η), where S ξ =(S i ξ i )andτ η =(τ i η i )fori =1,,,N. The random selection of elements in A and B introduces disorder to the system; we average the logarithm of the partition function Z(A, B, u, v) over the disorder and the statistical properties of both data, using the replica method [9]. In the calculation, a set of order parameters q α,β,,γ = 1 N r α,β,,γ = 1 N N Z i Si α S β i Sγ i, (4) N Y i τi α τ β i τ γ i (5) arise, where α, β,,γ represent replica indices, and the variables Z i and Y i come from enforcing the restriction of C 1 and C connections per index, respectively, δ π P A i,i,,i K1 C 1 dz = i,,i K1 0 π Z i,,i K1 A i,i,,i K1 (C 1+1), (6) δ π B i,i,,i K C dy P = π Y i,,i K B i,i,,i K (C +1). i,,i K 0 To proceed further, we have to make an assumption about the order parameters symmetry. The assumption made here, and validated later on, is that of replica symmetry in the following representation 4

of the order parameters and the related conjugate variables [6], q α,β,,γ = a q dx π(x)x l, ˆq α,β,,γ = aˆq r α,β,,γ = a r dy ρ(y)y l, ˆr α,β,,γ = aˆr dˆx ˆπ(ˆx)ˆx l, dŷ ˆρ(ŷ)ŷ l, (7) where l is the number of replica indices and a are normalization factors to make π(x), ˆπ(ˆx), ρ(y) and ˆρ(ŷ) represent probability distributions. Unspecified integrals are carried out over the range [ 1, +1]. Extremizing the averaged expression with respect to the probability distributions, we obtain the following free energy per spin: F = 1 N ln Z A,B,P ( C 1 1+ ) K 1 = Extr ln x i + C ln π,ˆπ,ρ,ˆρ K 1 K π ( ) ( ) 1+xˆx 1+yŷ C 1 ln C ln π,ˆπ ρ,ˆρ [ + 1 N C 1 ( ) 1+ˆxµi S i N C ( 1+ŷµi τ i ln Tr N S,τ µ=1 µ=1 ( 1+ ) K y i ρ ) ] P(S ξ, τ η) ˆπ,ˆρ,P, where the brackets with the subscript π and ˆπ represent averages over the probability distributions π(x) and π(ˆx) with respect to variables denoted by x and ˆx with and without subscripts, respectively. Similar notations are also used for ρ and ˆρ. The bracket with the subscript P denotes the average with respect to ξ and η following the data distribution P(ξ, η). Taking the functional derivative with respect to the distributions π, ˆπ, ρ and ˆρ, we obtain the following saddle point equations: [ ( π(x) = 1 N δ x tanh F i (ˆx µj L(µ)/i, ŷ µi ; ξ, η)ξ i + N ( ) K 1 1 ˆπ(ˆx) = δ ˆx x i π C 1 1 µ=1 tanh 1 (ˆx µi ) where the effective fields denoted by F i with subscripts are implicitly defined as e Fi(ˆx µj L(µ)/i,ŷ µi;ξ,η)ξ is i coshf i (ˆx µj L(µ)/i, ŷ µi ; ξ, η) = Tr S/S i,τ j L(µ)/i Tr S,τ N i C1 µ=1 C1 µ=1 ( 1+ˆxµjS j ( 1+ˆxµiS i ) N ) N C µ=1 C µ=1 ( 1+ŷµiτ i ( ) 1+ŷµiτ i )] P(S ξ, τ η) ) P(S ξ, τ η) and similarly for ρ(y) andˆρ(ŷ). Notice that the notation S/S i represents the set of all dynamical variables S except S i. On the other hand, L 1 (µ) andl (µ) denote the set of all indices of nonzero components in the µth row of A and B, respectively. The notation L 1 (µ)/i represents the set of all indices belonging to L 1 (µ) except i, and the same is true for others. After solving these equations, the expectation of the overlap can be evaluated as m 1 = 1 N ξ i sign S i = dz φ(z)sign(z), (11) N A,P ˆπ,ˆρ,P, (8) (9) (10) 5

where we denote thermal averages and φ(z) = 1 N [ ( )] N C 1 δ z tanh F i (ˆx µj L(µ)/i, ŷ µi ; ξ, η)ξ i + tanh 1 ˆx i, (1) ˆπ,ˆρ,P and similarly for m of the overlap between η and its estimator. 4 Structure of Solutions The performance of the current compression method can be measured by the vector m =(m 1, m ). Hereafter, we use the term ferromagnetic to specify the perfect retrieval, that is, m 1 =1(orm =1), while the term paramagnetic implies the distortion, that is, m 1 < 1(orm < 1). For instance, a term such as ferromagnetic-paramagnetic phase denotes the phase characterized by the performance vector m {(m 1, m ) m 1 =1, m < 1}, andsoon. One can show that the ferromagnetic-ferromagnetic state (FF): π(x) = δ(x 1), ˆπ(ˆx) = δ(ˆx 1), ρ(y) =δ(y 1) and ˆρ(ŷ) =δ(ŷ 1) always satisfies Eq. (9). In addition, in the limit of C 1,C, four solutions describing the paramagnetic-paramagnetic state (PP): π(x) =δ(x), ˆπ(ˆx) =δ(ˆx), ρ(y) = δ(y) andˆρ(ŷ) =δ(ŷ), the paramagnetic-ferromagnetic phase (PF): π(x) =δ(x), ˆπ(ˆx) =δ(ˆx), ρ(y) = δ(y 1) and ˆρ(ŷ) =δ(ŷ 1) and the ferromagnetic-paramagnetic state (FP): π(x) =δ(x 1), ˆπ(ˆx) = δ(ˆx 1), ρ(y) =δ(y) andˆρ(ŷ) =δ(ŷ) are also analytically obtained for an arbitrary joint distribution P(ξ, η). Free energies corresponding to these solutions are provided from Eq. (8) as F FF = 1 N Tr ξ,ηp(ξ, η)lnp(ξ, η), F PP =(R 1 + R )ln, F FP = R ln 1 N Tr ξp(ξ)lnp(ξ), F PF = R 1 ln 1 N Tr ηp(η)lnp(η), (13) where subscripts stand for corresponding states and P(ξ) =Tr ηp(ξ,η) and P(η) =Tr ξ P(ξ, η) represent marginal distributions for the two source vectors ξ and η. 4.1 Case of Dense Matrix Perfect decoding is theoretically possible if F FF is the lowest among the above four. The corresponding parameter regime termed achievable rate region is shown in Fig. as an intersection of the inequalities R 1 + R H (ξ, η), R 1 H (ξ η), R H (η ξ), (14) where H (ξ, η) = 1 N Tr ξ,ηp(ξ, η)lnp(ξ, η), H (ξ η) =H (ξ, η) H (η), (15) H (η ξ) =H (ξ, η) H (ξ). 6

It is worth noticing that this coincides with the achievable rate region saturated by the optimal data compression in the current framework previously shown by Slepian-Wolf [7]. Namely, in the limit C 1,C, the current compression codes provide the optimal performance for arbitrary information sources P(ξ, η). 4. Case of Sparse Matrix For finite C 1 and C, the saddle point equations (9) can be solved numerically; but the properties of the system highly depend on the source distribution P(ξ, η), which makes it difficult to go further without any assumption on the distribution. As a simple but non-trivial example, we will focus here on a component-wise correlated joint distribution N ( ) 1+m1 S i + m τ i + qs i τ i P(S, τ )=, (16) 4 where a set of parameters m 1, m,andq characterize the data sources. To make Eq. (16) a distribution, these parameters must satisfy four inequalities: 1+m 1 +m +q 0, 1 m 1 +m q 0, 1+m 1 m q 0 and 1 m 1 m + q 0. 5 Decoding Solving Eq. (1) rigorously for decoding is computationally hard in general cases. However, one can construct a practical decoding algorithm based on the belief propagation (BP) [10, 5] or the Thouless- Anderson-Palmer (TAP) approach [11]. It has recently been shown that these two frameworks provide the same algorithm in the case of error-correcting codes [1], as mentioned in the previous chapter. This is also the case under the current context. For distribution (16), the algorithm derived from the BP-based frameworks becomes m 1 µi = a µi + m 1 + m a µi b i + qb i, m µi = b µi + m + m 1 a i b µi + qa i, 1+m 1 a µi + m b i + qa µi b i 1+m 1 a i + m b µi + qa i b µi ˆm 1 µi = u µ m 1 µj, ˆm µi = v µ m µj, (17) j L 1(µ)/i j L (µ)/i where we denote a µi tanh tanh 1 ˆm 1 νi, a i tanh ν M 1(i)/µ µ M 1(i) tanh 1 ˆm 1 µi, (18) and similarly for b s. Here, M 1 (i) andm (i) indicate the set of all indices of nonzero components in the ith column of the sparse matrices A and B, respectively. Equation (17) can be solved iteratively from the appropriate initial conditions. After obtaining a solution, approximated posterior means can be calculated for i =1,,,N as m 1 i = S i = a i + m 1 + m a i b i + qb i 1+m 1 a i + m b i + qa i b i, m i = τ i = b i + m + m 1 a i b i + qa i 1+m 1 a i + m b i + qa i b i, (19) 7

which provide an approximation to the Bayes-optimal estimators as ξ i =sign(m 1 i )andη i =sign(m i ), respectively. In order to investigate the efficacy of the current method for finite C 1 and C, we have numerically solved Eqs. (9) and (17) for K 1 = K =6andC 1 = C =3(R 1 = R =1/), results of which are summarized in Fig. 3. Numerical results for the saddle point equation (9) were obtained by an iterative method using 10 4 10 5 bin models for each probability distribution. 10 10 updates were sufficient for convergence in most cases. Similarly to the case of C 1, C, there can be four types of solutions corresponding to combinations of decoding success and failure on the two sources. The obtained phase diagram is quite similar to that for C 1, C. This implies that the current compression code theoretically has a good performance close to the optimal one that is saturated in the limit C 1,C, although the choice of C 1 = C = 3 is far from such limit. However, this does not directly mean that the suggested performance can be obtained in practice. Since the variables are updated locally in the BP-based decoding algorithm (17), it may become difficult to find the thermodynamically dominant state when there appear suboptimal states which have large basins of attraction. This suggests that the practical performance for the perfect decoding is determined by the spinodal points of the suboptimal states, similar to the case of channel coding [6]. To confirm this conjecture, we have numerically compared the practical limit of the perfect decoding obtained by the BP-based decoding algorithm (17) and the spinodal points of the non-ff solutions. These two results exhibit an excellent consistency supporting our conjecture. In the figure, the perfectly decodable region obtained by the BP-based algorithm for m 1 =0.7cases is indicated as the area surrounded by the spinodal points and the boundaries for the feasible region 1 + 0.7 m q =0and1+0.7+m q =0. This looks narrow compared to the theoretical limit, which might provide a negative impression on the practical utility of this code. Nevertheless, we still consider that the current method may be practically useful because the size of information that can be represented by parameters in the region is not as small as the area appears. Moreover, we can not achieve the retrieval using the time sharing scheme for the shaded regions. 6 Conclusion Today almost all of the digital communications are based on the network, even if only point-to-point communications are considered. Therefore we have investigated the problem of multi-terminal data compression, a typical topic in network-based communicating scheme. Furthermore, we have selected the simplest model of data compression, the Slepian-Wolf system to reveal its theoretical aspects. The system had been introduced in the data compression theorem of Slepain and Wolf in 1973, which corresponds to the source coding theorem given by Shannon. We have derived the achievable rate region given in the data compression theorem by making use of the linear compression codes when dense matrix limit is considered. Although the result shows only theoretical aspects, that is, infinite computational power is 8

Figure 3: Phase diagram for K 1 = K =6andC 1 = C = 3 code in the case of component-wise correlated information source (16). This figure shows that the feasible region in m q plane for m 1 =0.7isclassified into three states. Phase boundaries obtained by numerical methods are indicated by with errorbars (FF/PP and FF/PF) and (PF/PP). These are close to those for K 1 = K,C = C (curves and the vertical line). Practically decodable limits of the BP-based algorithm obtained for N =10 4 systems are indicated as. These are well evaluated by the spinodal points of non-ff solutions ( with errorbars). Inset: The practical limits are represented by the sizes of transmitted information. The horizontal and vertical axes show the entropy of the second source τ and the joint entropy. The shaded regions indicate that we can not achieve the retrieval using the time sharing scheme. assumed for decoding, the rediscovery of the data compression theorem appeared beautiful. Moreover, we have generalized the message passing decoding to that of multi-terminal cases to find that it really works well. The figure shows that our practical decoder outperforms the simple method using time sharing scheme. Acknowledgements The author thanks Y. Kabashima and T. Ohira for valuable discussions. This research was partially supported by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Young Scientists (B), 1576088, 003. References [1] C. E. Shannon. A mathematical theory of communication. Bell Sys. Tech. J., 7:379 43 & 63 656, 1948. [] J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory, IT-3:337 343, 1977. 9

[3] T.M.CoverandJ.A.Thomas.Elements of Information Theory. Wiley, 1991. [4] R. G. Gallager. Low-density parity-check codes. IRE Trans. Inf. Theory, IT-8:1 8, 196. [5] D. J. C. MacKay. Good error-correcting codes based on very sparse matrices. IEEE Trans. Inf. Theory, IT-45:399 431, 1999. [6] Tatsuto Murayama, Yoshiyuki Kabashima, David Saad, and Renato Vicente. Statistical physics of regular low-density parity-check error-correcting codes. Phys. Rev. E, 6:1577 1591, 000. [7] D. Slepian and J. K. Wolf. Noiseless coding of correlated information sources. IEEE Trans. Inf. Theory, IT-19:471 480, 1973. [8] N. Sourlas. Spin-glass models as error-correcting codes. Nature, 339:693 695, 1989. [9] K. Y. M. Wong and D. Sherrington. Graph bipartitioning and spin glasses on a random network of fixed finite valence. J. Phys. A, 0:L793 L799, 1987. [10] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988. [11] D. J. Thouless, P. W. Anderson, and R. G. Palmer. Solvable model of a spin glass. Philos. Mag., 35:593 601, 1977. [1] Y. Kabashima and D. Saad. Belief propagation vs. TAP for decoding corrupted messages. Europhys. Lett., 44:668 674, 1998. 10