Convex Optimization methods for Computing Channel Capacity

Convex Otimization methods for Comuting Channel Caacity Abhishek Sinha Laboratory for Information and Decision Systems (LIDS), MIT sinhaa@mit.edu May 15, 2014 We consider a classical comutational roblem from Information Theory, namely, numerically determining the Shannon Caacity of a given discrete memoryless channel. We formulate the roblem as a convex otimization roblem and review a classical algorithm, namely, the Blahut-Arimoto (BA) algorithm [1] that exloits the articular structure of the roblem. This algorithm is an examle of an alternating minimizing algorithm with a guaranteed rate of convergence Θ( 1 k ). Moreover, if the otimal solution is unique, this algorithm achieves an exonential rate of convergence. Then we review some recent advances made on this roblem using methods of convex otimization. First, we review [2] where the authors resent two related algorithms, based on natural gradient and roximal oint methods resectively, that are otentially faster than the original Blahut-Arimoto algorithm. Finally, we review [4] that considers the roblem from a dual ersective and resents a dual algorithm that is shown to be a geometric rogram. We then critically evaluate the relative erformance of these methods on secific roblems. Finally, we resent some directions for further research on this interesting roblem. 1 Introduction Claude Shannon s 1948 aer [5] marked the beginning the field of mathematical study of Information and reliable transmission of Information over a noisy communication channel, known as Information Theory. In that aer, through some ingenious mathematical arguments, he showed that information can be reliably transmitted over a noisy communication channel if the rate of transmission of information is less than the channel caacity, a fundamental quantity determined by the statistical descrition of the channel. In articular, the aer shows the startling fact that resence of noise in a communication channel limits only the rate of communication and not the robability of error in information transmission. In the simlest case of discrete memoryless channel (DMC), the channel caacity is exressed as a convex rogram with the inut robability distribution as the otimiza- 1

Figure 1: A Communication Channel tion variables. Although this rogram can be solved exlicitly in some secial cases, no closed form formula is known for arbitrary DMCs. Hence one needs to resort to the techniques of convex otimization algorithms to evaluate the channel caacity of an arbitrary DMC. In this exository article, we will discuss an elegant iterative algorithm obtained by Suguru Arimoto, resented in IEEE Transactions on Information Theory in 1972. In this article, we will strive to rovide comlete roof of the key results starting from the first rinciles. 2 Preliminary Definitions and Results In this section, we will define some standard Information Theoretic functionals that will be extensively used in the rest of the aer. All random variables discussed in this aer are assumed to take value from a finite set (i.e. discrete) with strictly ositive robabilities and all logarithms are taken with resect to base 2, unless secified otherwise. Definition The Entroy H(X) of a random variable X taking value from a finite alhabet X with Probability Mass Function X (x) is defined as: H(X) = E( log X (X)) = x X X (x) log( X (x)) H( X ) (1) Note that H(X) deends only the robability measure of the random variable X and not on the articular values that X takes. Definition The Relative Entroy D( X q X ) of two PMFs X ( ) and q X ( ) (with q X (x) > 0, x X ) suorted on the same alhabet sace X is defined as: D( X q X ) = x X X (x) log X(x) q X (x) (2) Lemma 2.1. For any two distributions and q with the same suort, we have With equality holding iff = q. D( q) 0 (3)

Proof. Although this result can be roved directly using Jensen s inequality, we ot to give an elementary roof here. The fundamental inequality that we use is ex(x) 1 + x, x R (4) with the equality holding iff x = 0. The roof of this result follows from simle calculus. Taking natural logarithm of both sides of the above inequality, we conclude that for all x R with the equality holding iff x = 1. Now we write, ln(x) x 1 (5) D( q) = = i log q i i i ( q i i 1) (6) q i = 1 1 = 0 Where inequality (6) follows from Eqn. (5). Hence we have i D( q) 0 (7) Where the equality holds iff the equality holds in Eqn. 6, i.e. if = q. Definition The mutual information I(X; Y ) between two random variables X and Y, taking values from the alhabet set X Y with joint distribution XY ( ) and marginal distributions X ( ) and Y ( ) resectively, is defined as follows: I(X; Y ) = D( XY (, ) X ( ) Y ( )) = (x,y) X Y XY (x, y) log XY (x, y) X (x) Y (y) Writing XY (x, y) as X (x) Y X (y x), the above quantity may be re-written as (8) I(X; Y ) = = (x,y) X Y (x,y) X Y X (x) Y X (y x) log Y X(y x) Y (y) Y X (y x) X (x) Y X (y x) log z X X(x) Y X (y z) (9)

Definition A Discrete Memoryless Channel [6], denoted by (X, Y X (y x), Y) consists of two finite sets X and Y and a collection of robability mass functions { Y X ( x), x X }, with the interretation that X is the inut and Y is the outut of the channel. The caacity C of the DMC is defined as the maximum ossible rate of information transmission with arbitrary small robability of error. Shannon established the following fundamental result in his seminal aer [5]. Theorem 2.2 (The Noisy Channel Coding Theorem). C = max X I(X; Y ) (10) In the rest of this article, we discuss algorithms that solve the otimization roblem 10 for a given DMC. 3 Some Convexity Results In this section we will establish the convexity of the otimization roblem 10, starting from the first rinciles. To simlify notations, we re-label the inut symbols as [1... N] and outut symbols as [1... M] where N = X and M = Y. We denote the 1 N inut robability vector by, the N M channel matrix by Q and the 1 M outut robability vector by q. Then by the laws of robability, we have Hence the objective function I(X; Y ) can be re-written as I(X; Y ) = I(, Q) = = = q = Q (11) i Q ij log Q ij q j ( M ) M ( N ) i Q ij log Q ij i Q ij log qj ( M ) M i Q ij log Q ij q j log q j (12) Where we have utilized equation 11 in the last equation. Lemma 3.1. For a fixed channel-matrix Q, I(, Q) is concave in the inut robability distribution and hence the roblem 10 has an otimal solution. Proof. We first establish that the function f(x) = x log x, x 0 is convex in x. To establish this, just note that f (x) > 0, x > 0. Hence the function f(q) = M q i log q i is convex in q. Now from Eqn. 11 we note that q is a linear transformation of the inut robability vector. Hence, viewed as a function of, the second term on the right of Eqn. 12 is convex in. Since the first term is linear in, the result follows.

Since the constraint set in the otimization roblem 10 is the robability simlex 1 = 1, i 0, the above lemma establishes that the otimization roblem 10 is that of maximizing a concave function over a convex constraint set. We record this fact in the following theorem Theorem 3.2. The otimization roblem given in 10 is convex. 4 A Variational Characterization of Mutual Information In this section we will exress the mutual information I(X; Y ) as variational roblem. This will lead us directly to an alternating minimization algorithm for solving the otimization roblem 10. Let us denote the set of all conditional distributions on the alhabet X, indexed by the outut alhabet Y by Φ = {φ( j), j Y}. For any φ Φ define the quantity Ĩ(, Q; φ) as follows: Ĩ(, Q; φ) = The concavity of Ĩ(, Q; φ) w.r.t. and φ is readily aarent. i Q ij log φ(i j) i (13) Lemma 4.1. For fixed and Q, Ĩ(, Q; φ) is concave in φ. Similarly, for fixed φ and Q, Ĩ(, Q; φ) is concave in. Proof. This follows from the concavity of the functions log(x) and x log 1 x and the definition of Ĩ(, Q; φ). i Clearly, from the defining Eqn. 12, it follows that for the articular choice φ (i j) = Q ij N, we have iqij Ĩ(, Q; φ ) = I(, Q) (14) The following lemma shows that φ maximizes Ĩ(, Q; ). Lemma 4.2. For any matrix of conditional robabilities φ, we have Proof. We have, I(, Q) Ĩ(, Q; φ) = N = = Ĩ(, Q; φ) I(, Q) (15) N q j N q j i Q ij log Q ij q j i Q ij q j log iq ij /q j φ(i j) φ (i j) log φ (i j) φ(i j) i Q ij log φ(i j) i (16) (17) (18)

Define r(i j) = i Q ij /q j which can be interreted as a osteriori inut robability distribution given the outut variable to take value j. Then we can write down the above equation as follows M I(, Q) Ĩ(, Q; φ) = q j D(φ ( j) φ( j)) (19) Which is non-negative and equality holds iff φ(i j) = φ (i j), i, j, by virtue of lemma 2.1. Combining the above results, we have the following variational characterization of mutual information Theorem 4.3. For any inut distribution and any channel matrix Q we have I(, Q) = max Ĩ(, Q; φ) (20) φ Φ And the conditional robability matrix that achieves the maximum is given by Q ij φ(i j) = φ (i j) = i N (21) iq ij Based on the above theorem, we can recast the otimization roblem 10 as follows C = max X max Ĩ(, Q; φ) (22) φ Φ Since the channel matrix Q is fixed, we can view the otimization roblem 22 as otimizing over two different sets of variables, and φ. One natural iterative aroach to solve the roblem is to fix one set of variables and otimize over the other and vice versa. This method is esecially attractive when closed form solution for both the maximizations are available. As we will see in the following theorem, this is recisely the case here. This is in essence the Blahut-Arimoto (BA) algorithm for obtaining the caacity of a Discrete Memoryless Channel [1]. 5 The geometric idea of alternating otimization We consider the following roblem, given two convex sets A and b in R n as shown in Figure 2, we wish to determine the minimum distance between them. More recisely, we wish to determine d min = min d(a, b) (23) a A,b B Where d(a, b) is the euclidean distance between a and b. An obvious algorithm to do this would be to take any oint x A, and find the y B that is closest to it. Then fix this y and find the closest oint in A. Reeating this rocess, it is clear that the distance is non-increasing at each stage. But it is not obvious whether the algorithm

A B Figure 2: Alternating Minimization converges to the otimal solution. However, we will show that if the sets are robability distributions and the distance measure is the relative entroy then the algorithm does converge to the minimum relative entroy between two sets. To use the above idea of alternating otimization in roblem 22, if ossible, it is advantageous to have a closed form exression for the solution of either otimization roblem. The theorem below indicates that this is indeed ossible and finds the solution of either otimization in closed form. Theorem 5.1. For a fixed and Q, we have arg max Ĩ(, Q; φ) = φ (24) φ Φ Where, Q ij φ (i j) = i N k=1 (25) kq kj And for a fixed φ and Q, we have Where the comonents of are given by, arg max Ĩ(, Q; φ) = (26) And the maximum value is given by (i) = r i k r k max Ĩ(, Q; φ) = log( i (27) r i ) (28) Where, ( ) r i = ex Q ij log φ(i j) j (29)

Proof. The first art of the theorem is already roved as art of the theorem 4.3. We rove here the second art only. As with any constrained-otimization roblem with equality constraint, a straightforward way to aroach the roblem is to use the method of lagrange-multilier. However an elegant way to solve the roblem is to use lemma 2.1 in a clever way to tightly uer-bound the objective function and then to find an otimal inut-distribution which achieves the bound. We take this aroach here. Consider an inut distribution such that (i) = Dr(i), i X where, log r(i) = j Q ij log φ(i j) (30) And D is the normalization constant, i.e. D = ( i X r(i)) 1. From Lemma 2.1, we have i.e., D( ) 0 (31) M i log i i log (i) = log D + i Q ij log φ(i j) (32) Rearranging the above equation, we have Ĩ(, Q; φ) = With the equality holding iff =. log D = log i r i. i Q ij log φ(i j) i log D (33) Clearly the otimal value is given by Equied with Theorem 5.1, we are now ready to describe the Blahut-Arimoto (BA) algorithm formally. Ste 1: Initialize (1) to the uniform distribution over X, i.e. (1) i = 1 X for all i X. Set t to 1. Ste 2: Find φ (t+1) as follows: φ (t+1) (i j) = Ste 3: Udate (t+1) as follows: (t) i Q ij k (t) k Q kj, i, j (34) (t+1) i = r (t+1) i k X r(t+1) k (35)

Where, r (t+1) i ( ) = ex Q ij log φ (t+1) (i j) j (36) Ste 4: Set t t + 1 and goto Ste 2. We can combine Ste 2 and Ste 3 as follows. Denote the outut distribution induced by the inut distribution t by q t, i.e. q t = t Q. Hence from Eqn. 34 we have φ (t+1) (i j) = (t) i Q ij q (t) (j) (37) We now evaluate the term inside the exonent of Eqn 36 as follows Q ij log φ (t+1) (i j) = j j Q ij log (t) i Q ij q (t) (j) = D(Q i q (t) ) + log (t) i (38) Where Q i denotes the i th row of the channel matrix Q. Hence from Eqn. 36 we have r (t+1) i = (t) i ex ( D(Q i q (t) ) ) (39) Thus the above algorithm has the following simlified descrition Simlified Blahut-Arimoto Algorithm Ste 1: Initialize (1) to the uniform distribution over X, i.e. (1) i = 1 X for all i X. Set t to 1. Ste 2: Reeat until convergence: q (t) = (t) Q (40) i = (t) ex ( D(Q i q (t) ) ) i k (t) k ex ( D(Q k q (t) ) ) i X (41) (t+1) 6 Proximal Point Reformulation : Accelerated Blahut- Arimoto Algorithm In this section we re-examine the alternating minimization rocedure of the Blahut- Arimoto algorithm [2]. Plugging in the otimal solution φ t from the first otimization

to the second otimization, we have t+1 = arg max Ĩ(, Q; φ t ) = arg max = arg max = arg max i Q ij log φt (i j) i i Q ij log t i Q ij i q t j ( N ) i D(Q i q t ) D( t ) (42) The Eqn. (42) can be interreted a maximization of N id(q i q t ) with a enalty term D( t ) which ensures that the udate t+1 remains in the vicinity of t [2]. Algorithms of this tye are known as roximal oint methods, since they force the udate to stay in the roximity of the current guess. This is reasonable in our case because the first term in 42 is an aroximation of the mutual information I(; Q), by relacing the KLDs D(Q i q ) with D(Q i q t ). The enalty term D( k ) ensures that the maximization is restricted to a neighbourhood of k for which the aroximation D(Q i q ) D(Q i q t ) is accurate. In fact we have the following equality k+1 = arg max (Ĩt () D( t ) ) (43) Where Ĩt () = I( t, Q)+ N ( i t i )D(Q i q t ), which can be shown to be a firstorder Taylor series aroximation of I(). Thus the original Blahut-Arimoto algorithm can be thought of as a roximal oint method maximizing the first-order Taylor series aroximation of I(, Q) with a roximity enalty exressed by D( t ). It is now natural to modify (43) by an emhasizing/attenuating the enalty term via a weighting factor, i.e.,consider the following iteration t+1 = arg max (Ĩt () γ t D( t ) ) (44) The idea is that close to the otimal solution the K-L distance of to t would be small and hence the roximity constraint γ t can be gradually relaxed by decreasing γ t. One such ossible choice of the {γ t } t 1 sequence. In the following sub-section we derive a sequence of ste-sizes that guarantees non-decreasing mutual information estimates I( (t), Q). We note below the accelerated BA algorithm as derived above Ste 1: Initialize (1) to the uniform distribution over X, i.e. (1) i = 1 X for all i X. Set t to 1.

Ste 2: Reeat until convergence: q (t) = (t) Q (45) i = (t) ex ( γt 1 D(Q i q (t) ) ) i k (t) k ex ( γt 1 D(Q k q (t) ) ), i X (46) (t+1) 6.1 Suitable choice of ste-sizes for accelerated BA algorithm A fundamental roerty of the BA algorithm is that the mutual information I( t, Q), which reresents the current caacity estimate at the t th iteration is non-decreasing. For the accelerated BA algorithm, we need to choose a sequence {γ t } t 1 that reserves this roerty. For this we need the following lemma. Lemma 6.1. For any iteration t, we have Proof. Recall that, D(q (t+1) q (t) ) q (t+1) = (t+1) Q = (t+1) i D(Q i q (t) ) (47) (t+1) i Q i (48) The above equation exresses the outut robability vector q k+1 as a convex combination of the rows of the matrix Q. Since the relative entroy D( ) is convex in both the arguments, we have D(q (t+1) q (t) ) (t+1) i D(Q i q (t) ) Equied with the above lemma, we now establish a lower bound on increment of mutual information I(, Q) at each stage. Lemma 6.2. For every stage t of the accelerated BA algorithm, we have I( (t+1), Q) I( (t), Q) + γ t D( (t+1) (t) ) D(q (t+1) q (t) ) (49) Proof. We have from Eqn. 44 of the accelerated BA iteration Ĩ t ( (t+1) ) γ t D( (t+1) (t) ) Ĩt ( (t) ) (50)

Plugging in the exression for Ĩt ( ) from above, we have I( (t+1), Q) + ( (t+1) i (t) i )D(Q i q (t) ) I( (t), Q) + γ t D( (t+1) (t) ) (51) Now using the lemma 6.1 and using the non-negativity of K-L divergence, the result follows. From the above lemma, it follows that a sufficient condition for I( (t+1), Q) to be non-decreasing is 1 D((t+1) (t) ) γ t D( (t+1) Q (t) Q) (52) Now define the maximum KLD-induced eigenvalue of Q as λ 2 D(Q Q) KL(Q) = su D( ) (53) Using the above definition, we conclude that a sufficient condition for I( (t+1), Q) to be non-decreasing is given by γ t λ 2 KL(Q) (54) 7 Convergence Statements of the Accelerated BA Algorithm In the revious section we roved that for any ste-size sequence γ t λ 2 KL (Q), the accelerated BA algorithm has the otential for increased convergence seed. For lack of sace, we only give the statements of the theorem. Comlete roofs of these theorems may be found in [2]. Theorem 7.1. Consider the accelerated BA algorithm with I t = i t i D(Q i q (t) ) and L t = γ t log( i (t) i ex(γt 1 D(Q i q (t) )). Assume that γ inf = inf t γt 1 > 0 and (54) is satisfied for all t. Then lim L t = lim I t = C (55) t t And the convergence rate is at least roortional to 1/t, i.e. C L t < D( 0 ) µ inf t (56)

8 Dual Aroach : Geometric Programming In this section, we take a dual aroach to solve the roblem 10 and show that the dual roblem reduces to a simle Geometric Program [4]. We also derive several useful uer bounds on the channel caacity from the dual rogram. First we rewrite the mutual information functional as follows. I(X; Y ) = H(Y ) H(Y X) = q j log q j r (57) Where, Subject to, r i = Q ij log Q ij (58) q = Q (59) 1 = 1, >= 0 (60) Hence the otimization roblem 10 may be rewritten as follows max r q j log q j (61) Subject to, Q = q 1 = 1 0 It is to be noted that keeing two sets of otimization variables and q and introducing the equality constraint Q = q in the rimal roblem is a key ste to derive an exlicit and simle Lagrange dual roblem of 61. Theorem 8.1. The Lagrange dual of the channel caacity roblem 61 is given by the following roblem Subject to, min α M log ex(α j ) (62) Qα r (63)

An equivalent version of the above Lagrange dual roblem is the following Geometric rogram (in the standard form): Subject to, M min z z j (64) z Pij j ex ( H(Q i ) ), i = 1, 2,..., N z 0 From the Lagrange dual roblem, we immediately have the following uer bound on the channel caacity. ( ) M Weak Duality: log ex(α j) C, for all α that satisfy Qα + r 0. ( ) M Strong Duality: log ex(α j) = C, for the otimal dual variable α. 8.1 Bounding From the Dual Because the inequality constraints in the dual roblem 64 are affine, it is easy to obtain a dual feasible α by finding any solution to a system of linear inequalities, and the resulting value of the dual objective function rovides an easily derivable uer bound on channel caacity. The following is one such non-trivial bound. Corollary 8.2. Channel caacity is uer-bounded in terms of a maximum-likelihood receiver selecting arg max i P ij for each outut symbol j C log max P ij i which is tight iff the otimal outut distribution q is q j = max i P ij M k=1 P ik (65) As is readily aarent, the geometric rogram Lagrange dual 64 generates a broader class of uer bounds on caacity. This uer bounds can be effectively used to terminate an iterative otimization rocedure for channel caacity.

9 Conclusion In this reort, we have surveyed various convex otimization algorithms for solving the channel caacity roblem. In articular, we have derived the classical Blahut-Arimoto algorithm from first rinciles. Then we established a connection with Proximal algorithms and original BA iteration. Using a roer ste-size sequence, we have derived an acceerated version of the BA algorithm. Finally we have considered the dual of the channel caacity roblem and have shown that its Lagrange dual is given by a Geometric Program (GP). The GP have been effectively utilized to derive non-trivial uer bound on the channel caacity. References [1] S. Arimoto, An algorithm for comuting the caacity of arbitrary discrete memoryless channels, Information Theory, IEEE Transactions on, vol. 18, no. 1,. 14 20, 1972. [2] G. Matz and P. Duhamel, Information geometric formulation and interretation of accelerated blahut-arimoto-tye algorithms, in Information theory worksho, 2004. IEEE. IEEE, 2004,. 66 70. [3] Y. Yu, Squeezing the arimoto-blahut algorithm for faster convergence, arxiv rerint arxiv:0906.3849, 2009. [4] M. Chiang and S. Boyd, Geometric rogramming duals of channel caacity and rate distortion, Information Theory, IEEE Transactions on, vol. 50, no. 2,. 245 258, 2004. [5] C. E. Shannon, A mathematical theory of communication, ACM SIGMOBILE Mobile Comuting and Communications Review, vol. 5, no. 1,. 3 55, 2001. [6] T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley & Sons, 2012.