Convex Optimization methods for Computing Channel Capacity

Similar documents
On the capacity of the general trapdoor channel with feedback

Improved Capacity Bounds for the Binary Energy Harvesting Channel

Elementary Analysis in Q p

On Code Design for Simultaneous Energy and Information Transfer

Approximating min-max k-clustering

On Wald-Type Optimal Stopping for Brownian Motion

ECE 534 Information Theory - Midterm 2

MATH 2710: NOTES FOR ANALYSIS

Various Proofs for the Decrease Monotonicity of the Schatten s Power Norm, Various Families of R n Norms and Some Open Problems

On the Chvatál-Complexity of Knapsack Problems

Inequalities for the L 1 Deviation of the Empirical Distribution

ON THE NORM OF AN IDEMPOTENT SCHUR MULTIPLIER ON THE SCHATTEN CLASS

Notes on duality in second order and -order cone otimization E. D. Andersen Λ, C. Roos y, and T. Terlaky z Aril 6, 000 Abstract Recently, the so-calle

FE FORMULATIONS FOR PLASTICITY

LECTURE 7 NOTES. x n. d x if. E [g(x n )] E [g(x)]

Some results of convex programming complexity

Asymptotically Optimal Simulation Allocation under Dependent Sampling

4. Score normalization technical details We now discuss the technical details of the score normalization method.

An Introduction to Information Theory: Notes

Numerical Linear Algebra

Radial Basis Function Networks: Algorithms

Formal Modeling in Cognitive Science Lecture 29: Noisy Channel Model and Applications;

On a Markov Game with Incomplete Information

Statics and dynamics: some elementary concepts

On Doob s Maximal Inequality for Brownian Motion

HENSEL S LEMMA KEITH CONRAD

Elementary theory of L p spaces

1 Riesz Potential and Enbeddings Theorems

On Isoperimetric Functions of Probability Measures Having Log-Concave Densities with Respect to the Standard Normal Law

An Estimate For Heilbronn s Exponential Sum

The non-stochastic multi-armed bandit problem

Homework Set #3 Rates definitions, Channel Coding, Source-Channel coding

Advanced Calculus I. Part A, for both Section 200 and Section 501

Model checking, verification of CTL. One must verify or expel... doubts, and convert them into the certainty of YES [Thomas Carlyle]

Analysis of some entrance probabilities for killed birth-death processes

Optimization in Information Theory

Uniformly best wavenumber approximations by spatial central difference operators: An initial investigation

Sums of independent random variables

arxiv: v1 [physics.data-an] 26 Oct 2012

GOOD MODELS FOR CUBIC SURFACES. 1. Introduction

Robust Solutions to Markov Decision Problems

Homework 2: Solution

On Line Parameter Estimation of Electric Systems using the Bacterial Foraging Algorithm

Convex Analysis and Economic Theory Winter 2018

BEST CONSTANT IN POINCARÉ INEQUALITIES WITH TRACES: A FREE DISCONTINUITY APPROACH

A Social Welfare Optimal Sequential Allocation Procedure

A construction of bent functions from plateaued functions

Cryptanalysis of Pseudorandom Generators

Some Results on the Generalized Gaussian Distribution

John Weatherwax. Analysis of Parallel Depth First Search Algorithms

Distributed Rule-Based Inference in the Presence of Redundant Information

HAUSDORFF MEASURE OF p-cantor SETS

arxiv: v1 [cs.gt] 2 Nov 2018

AM 221: Advanced Optimization Spring Prof. Yaron Singer Lecture 6 February 12th, 2014

PROFIT MAXIMIZATION. π = p y Σ n i=1 w i x i (2)

Topic: Lower Bounds on Randomized Algorithms Date: September 22, 2004 Scribe: Srinath Sridhar

Universal Finite Memory Coding of Binary Sequences

Math 104B: Number Theory II (Winter 2012)

VARIANTS OF ENTROPY POWER INEQUALITY

Department of Mathematics

Convexification of Generalized Network Flow Problem with Application to Power Systems

arxiv: v2 [math.na] 6 Apr 2016

A New Perspective on Learning Linear Separators with Large L q L p Margins

CSE 599d - Quantum Computing When Quantum Computers Fall Apart

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL

arxiv:cond-mat/ v2 25 Sep 2002

OPTIMAL AFFINE INVARIANT SMOOTH MINIMIZATION ALGORITHMS

MATH 361: NUMBER THEORY EIGHTH LECTURE

Discrete Memoryless Channels with Memoryless Output Sequences

Real Analysis 1 Fall Homework 3. a n.

Sharp gradient estimate and spectral rigidity for p-laplacian

Feedback-error control

Computer arithmetic. Intensive Computation. Annalisa Massini 2017/2018

MATHEMATICAL MODELLING OF THE WIRELESS COMMUNICATION NETWORK

Uniform Law on the Unit Sphere of a Banach Space

The Graph Accessibility Problem and the Universality of the Collision CRCW Conflict Resolution Rule

Decoding Linear Block Codes Using a Priority-First Search: Performance Analysis and Suboptimal Version

Coding Along Hermite Polynomials for Gaussian Noise Channels

Improving on the Cutset Bound via a Geometric Analysis of Typical Sets

Multi-Operation Multi-Machine Scheduling

Interactive Hypothesis Testing Against Independence

REGULARITY OF SOLUTIONS TO DOUBLY NONLINEAR DIFFUSION EQUATIONS

Improved Bounds on Bell Numbers and on Moments of Sums of Random Variables

Inference for Empirical Wasserstein Distances on Finite Spaces: Supplementary Material

IMPROVED BOUNDS IN THE SCALED ENFLO TYPE INEQUALITY FOR BANACH SPACES

Intrinsic Approximation on Cantor-like Sets, a Problem of Mahler

Extension of Minimax to Infinite Matrices

An Analysis of Reliable Classifiers through ROC Isometrics

OPTIMAL DESIGNS FOR TWO-LEVEL FACTORIAL EXPERIMENTS WITH BINARY RESPONSE

Principal Components Analysis and Unsupervised Hebbian Learning

RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES

System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests

State Estimation with ARMarkov Models

Fig. 4. Example of Predicted Workers. Fig. 3. A Framework for Tackling the MQA Problem.

Finite Mixture EFA in Mplus

arxiv:math/ v4 [math.gn] 25 Nov 2006

s v 0 q 0 v 1 q 1 v 2 (q 2) v 3 q 3 v 4

Uncertainty Modeling with Interval Type-2 Fuzzy Logic Systems in Mobile Robotics

Quantitative estimates of propagation of chaos for stochastic systems with W 1, kernels

Research Article An iterative Algorithm for Hemicontractive Mappings in Banach Spaces

Transcription:

Convex Otimization methods for Comuting Channel Caacity Abhishek Sinha Laboratory for Information and Decision Systems (LIDS), MIT sinhaa@mit.edu May 15, 2014 We consider a classical comutational roblem from Information Theory, namely, numerically determining the Shannon Caacity of a given discrete memoryless channel. We formulate the roblem as a convex otimization roblem and review a classical algorithm, namely, the Blahut-Arimoto (BA) algorithm [1] that exloits the articular structure of the roblem. This algorithm is an examle of an alternating minimizing algorithm with a guaranteed rate of convergence Θ( 1 k ). Moreover, if the otimal solution is unique, this algorithm achieves an exonential rate of convergence. Then we review some recent advances made on this roblem using methods of convex otimization. First, we review [2] where the authors resent two related algorithms, based on natural gradient and roximal oint methods resectively, that are otentially faster than the original Blahut-Arimoto algorithm. Finally, we review [4] that considers the roblem from a dual ersective and resents a dual algorithm that is shown to be a geometric rogram. We then critically evaluate the relative erformance of these methods on secific roblems. Finally, we resent some directions for further research on this interesting roblem. 1 Introduction Claude Shannon s 1948 aer [5] marked the beginning the field of mathematical study of Information and reliable transmission of Information over a noisy communication channel, known as Information Theory. In that aer, through some ingenious mathematical arguments, he showed that information can be reliably transmitted over a noisy communication channel if the rate of transmission of information is less than the channel caacity, a fundamental quantity determined by the statistical descrition of the channel. In articular, the aer shows the startling fact that resence of noise in a communication channel limits only the rate of communication and not the robability of error in information transmission. In the simlest case of discrete memoryless channel (DMC), the channel caacity is exressed as a convex rogram with the inut robability distribution as the otimiza- 1

Figure 1: A Communication Channel tion variables. Although this rogram can be solved exlicitly in some secial cases, no closed form formula is known for arbitrary DMCs. Hence one needs to resort to the techniques of convex otimization algorithms to evaluate the channel caacity of an arbitrary DMC. In this exository article, we will discuss an elegant iterative algorithm obtained by Suguru Arimoto, resented in IEEE Transactions on Information Theory in 1972. In this article, we will strive to rovide comlete roof of the key results starting from the first rinciles. 2 Preliminary Definitions and Results In this section, we will define some standard Information Theoretic functionals that will be extensively used in the rest of the aer. All random variables discussed in this aer are assumed to take value from a finite set (i.e. discrete) with strictly ositive robabilities and all logarithms are taken with resect to base 2, unless secified otherwise. Definition The Entroy H(X) of a random variable X taking value from a finite alhabet X with Probability Mass Function X (x) is defined as: H(X) = E( log X (X)) = x X X (x) log( X (x)) H( X ) (1) Note that H(X) deends only the robability measure of the random variable X and not on the articular values that X takes. Definition The Relative Entroy D( X q X ) of two PMFs X ( ) and q X ( ) (with q X (x) > 0, x X ) suorted on the same alhabet sace X is defined as: D( X q X ) = x X X (x) log X(x) q X (x) (2) Lemma 2.1. For any two distributions and q with the same suort, we have With equality holding iff = q. D( q) 0 (3)

Proof. Although this result can be roved directly using Jensen s inequality, we ot to give an elementary roof here. The fundamental inequality that we use is ex(x) 1 + x, x R (4) with the equality holding iff x = 0. The roof of this result follows from simle calculus. Taking natural logarithm of both sides of the above inequality, we conclude that for all x R with the equality holding iff x = 1. Now we write, ln(x) x 1 (5) D( q) = = i log q i i i ( q i i 1) (6) q i = 1 1 = 0 Where inequality (6) follows from Eqn. (5). Hence we have i D( q) 0 (7) Where the equality holds iff the equality holds in Eqn. 6, i.e. if = q. Definition The mutual information I(X; Y ) between two random variables X and Y, taking values from the alhabet set X Y with joint distribution XY ( ) and marginal distributions X ( ) and Y ( ) resectively, is defined as follows: I(X; Y ) = D( XY (, ) X ( ) Y ( )) = (x,y) X Y XY (x, y) log XY (x, y) X (x) Y (y) Writing XY (x, y) as X (x) Y X (y x), the above quantity may be re-written as (8) I(X; Y ) = = (x,y) X Y (x,y) X Y X (x) Y X (y x) log Y X(y x) Y (y) Y X (y x) X (x) Y X (y x) log z X X(x) Y X (y z) (9)

Definition A Discrete Memoryless Channel [6], denoted by (X, Y X (y x), Y) consists of two finite sets X and Y and a collection of robability mass functions { Y X ( x), x X }, with the interretation that X is the inut and Y is the outut of the channel. The caacity C of the DMC is defined as the maximum ossible rate of information transmission with arbitrary small robability of error. Shannon established the following fundamental result in his seminal aer [5]. Theorem 2.2 (The Noisy Channel Coding Theorem). C = max X I(X; Y ) (10) In the rest of this article, we discuss algorithms that solve the otimization roblem 10 for a given DMC. 3 Some Convexity Results In this section we will establish the convexity of the otimization roblem 10, starting from the first rinciles. To simlify notations, we re-label the inut symbols as [1... N] and outut symbols as [1... M] where N = X and M = Y. We denote the 1 N inut robability vector by, the N M channel matrix by Q and the 1 M outut robability vector by q. Then by the laws of robability, we have Hence the objective function I(X; Y ) can be re-written as I(X; Y ) = I(, Q) = = = q = Q (11) i Q ij log Q ij q j ( M ) M ( N ) i Q ij log Q ij i Q ij log qj ( M ) M i Q ij log Q ij q j log q j (12) Where we have utilized equation 11 in the last equation. Lemma 3.1. For a fixed channel-matrix Q, I(, Q) is concave in the inut robability distribution and hence the roblem 10 has an otimal solution. Proof. We first establish that the function f(x) = x log x, x 0 is convex in x. To establish this, just note that f (x) > 0, x > 0. Hence the function f(q) = M q i log q i is convex in q. Now from Eqn. 11 we note that q is a linear transformation of the inut robability vector. Hence, viewed as a function of, the second term on the right of Eqn. 12 is convex in. Since the first term is linear in, the result follows.

Since the constraint set in the otimization roblem 10 is the robability simlex 1 = 1, i 0, the above lemma establishes that the otimization roblem 10 is that of maximizing a concave function over a convex constraint set. We record this fact in the following theorem Theorem 3.2. The otimization roblem given in 10 is convex. 4 A Variational Characterization of Mutual Information In this section we will exress the mutual information I(X; Y ) as variational roblem. This will lead us directly to an alternating minimization algorithm for solving the otimization roblem 10. Let us denote the set of all conditional distributions on the alhabet X, indexed by the outut alhabet Y by Φ = {φ( j), j Y}. For any φ Φ define the quantity Ĩ(, Q; φ) as follows: Ĩ(, Q; φ) = The concavity of Ĩ(, Q; φ) w.r.t. and φ is readily aarent. i Q ij log φ(i j) i (13) Lemma 4.1. For fixed and Q, Ĩ(, Q; φ) is concave in φ. Similarly, for fixed φ and Q, Ĩ(, Q; φ) is concave in. Proof. This follows from the concavity of the functions log(x) and x log 1 x and the definition of Ĩ(, Q; φ). i Clearly, from the defining Eqn. 12, it follows that for the articular choice φ (i j) = Q ij N, we have iqij Ĩ(, Q; φ ) = I(, Q) (14) The following lemma shows that φ maximizes Ĩ(, Q; ). Lemma 4.2. For any matrix of conditional robabilities φ, we have Proof. We have, I(, Q) Ĩ(, Q; φ) = N = = Ĩ(, Q; φ) I(, Q) (15) N q j N q j i Q ij log Q ij q j i Q ij q j log iq ij /q j φ(i j) φ (i j) log φ (i j) φ(i j) i Q ij log φ(i j) i (16) (17) (18)

Define r(i j) = i Q ij /q j which can be interreted as a osteriori inut robability distribution given the outut variable to take value j. Then we can write down the above equation as follows M I(, Q) Ĩ(, Q; φ) = q j D(φ ( j) φ( j)) (19) Which is non-negative and equality holds iff φ(i j) = φ (i j), i, j, by virtue of lemma 2.1. Combining the above results, we have the following variational characterization of mutual information Theorem 4.3. For any inut distribution and any channel matrix Q we have I(, Q) = max Ĩ(, Q; φ) (20) φ Φ And the conditional robability matrix that achieves the maximum is given by Q ij φ(i j) = φ (i j) = i N (21) iq ij Based on the above theorem, we can recast the otimization roblem 10 as follows C = max X max Ĩ(, Q; φ) (22) φ Φ Since the channel matrix Q is fixed, we can view the otimization roblem 22 as otimizing over two different sets of variables, and φ. One natural iterative aroach to solve the roblem is to fix one set of variables and otimize over the other and vice versa. This method is esecially attractive when closed form solution for both the maximizations are available. As we will see in the following theorem, this is recisely the case here. This is in essence the Blahut-Arimoto (BA) algorithm for obtaining the caacity of a Discrete Memoryless Channel [1]. 5 The geometric idea of alternating otimization We consider the following roblem, given two convex sets A and b in R n as shown in Figure 2, we wish to determine the minimum distance between them. More recisely, we wish to determine d min = min d(a, b) (23) a A,b B Where d(a, b) is the euclidean distance between a and b. An obvious algorithm to do this would be to take any oint x A, and find the y B that is closest to it. Then fix this y and find the closest oint in A. Reeating this rocess, it is clear that the distance is non-increasing at each stage. But it is not obvious whether the algorithm

A B Figure 2: Alternating Minimization converges to the otimal solution. However, we will show that if the sets are robability distributions and the distance measure is the relative entroy then the algorithm does converge to the minimum relative entroy between two sets. To use the above idea of alternating otimization in roblem 22, if ossible, it is advantageous to have a closed form exression for the solution of either otimization roblem. The theorem below indicates that this is indeed ossible and finds the solution of either otimization in closed form. Theorem 5.1. For a fixed and Q, we have arg max Ĩ(, Q; φ) = φ (24) φ Φ Where, Q ij φ (i j) = i N k=1 (25) kq kj And for a fixed φ and Q, we have Where the comonents of are given by, arg max Ĩ(, Q; φ) = (26) And the maximum value is given by (i) = r i k r k max Ĩ(, Q; φ) = log( i (27) r i ) (28) Where, ( ) r i = ex Q ij log φ(i j) j (29)

Proof. The first art of the theorem is already roved as art of the theorem 4.3. We rove here the second art only. As with any constrained-otimization roblem with equality constraint, a straightforward way to aroach the roblem is to use the method of lagrange-multilier. However an elegant way to solve the roblem is to use lemma 2.1 in a clever way to tightly uer-bound the objective function and then to find an otimal inut-distribution which achieves the bound. We take this aroach here. Consider an inut distribution such that (i) = Dr(i), i X where, log r(i) = j Q ij log φ(i j) (30) And D is the normalization constant, i.e. D = ( i X r(i)) 1. From Lemma 2.1, we have i.e., D( ) 0 (31) M i log i i log (i) = log D + i Q ij log φ(i j) (32) Rearranging the above equation, we have Ĩ(, Q; φ) = With the equality holding iff =. log D = log i r i. i Q ij log φ(i j) i log D (33) Clearly the otimal value is given by Equied with Theorem 5.1, we are now ready to describe the Blahut-Arimoto (BA) algorithm formally. Ste 1: Initialize (1) to the uniform distribution over X, i.e. (1) i = 1 X for all i X. Set t to 1. Ste 2: Find φ (t+1) as follows: φ (t+1) (i j) = Ste 3: Udate (t+1) as follows: (t) i Q ij k (t) k Q kj, i, j (34) (t+1) i = r (t+1) i k X r(t+1) k (35)

Where, r (t+1) i ( ) = ex Q ij log φ (t+1) (i j) j (36) Ste 4: Set t t + 1 and goto Ste 2. We can combine Ste 2 and Ste 3 as follows. Denote the outut distribution induced by the inut distribution t by q t, i.e. q t = t Q. Hence from Eqn. 34 we have φ (t+1) (i j) = (t) i Q ij q (t) (j) (37) We now evaluate the term inside the exonent of Eqn 36 as follows Q ij log φ (t+1) (i j) = j j Q ij log (t) i Q ij q (t) (j) = D(Q i q (t) ) + log (t) i (38) Where Q i denotes the i th row of the channel matrix Q. Hence from Eqn. 36 we have r (t+1) i = (t) i ex ( D(Q i q (t) ) ) (39) Thus the above algorithm has the following simlified descrition Simlified Blahut-Arimoto Algorithm Ste 1: Initialize (1) to the uniform distribution over X, i.e. (1) i = 1 X for all i X. Set t to 1. Ste 2: Reeat until convergence: q (t) = (t) Q (40) i = (t) ex ( D(Q i q (t) ) ) i k (t) k ex ( D(Q k q (t) ) ) i X (41) (t+1) 6 Proximal Point Reformulation : Accelerated Blahut- Arimoto Algorithm In this section we re-examine the alternating minimization rocedure of the Blahut- Arimoto algorithm [2]. Plugging in the otimal solution φ t from the first otimization

to the second otimization, we have t+1 = arg max Ĩ(, Q; φ t ) = arg max = arg max = arg max i Q ij log φt (i j) i i Q ij log t i Q ij i q t j ( N ) i D(Q i q t ) D( t ) (42) The Eqn. (42) can be interreted a maximization of N id(q i q t ) with a enalty term D( t ) which ensures that the udate t+1 remains in the vicinity of t [2]. Algorithms of this tye are known as roximal oint methods, since they force the udate to stay in the roximity of the current guess. This is reasonable in our case because the first term in 42 is an aroximation of the mutual information I(; Q), by relacing the KLDs D(Q i q ) with D(Q i q t ). The enalty term D( k ) ensures that the maximization is restricted to a neighbourhood of k for which the aroximation D(Q i q ) D(Q i q t ) is accurate. In fact we have the following equality k+1 = arg max (Ĩt () D( t ) ) (43) Where Ĩt () = I( t, Q)+ N ( i t i )D(Q i q t ), which can be shown to be a firstorder Taylor series aroximation of I(). Thus the original Blahut-Arimoto algorithm can be thought of as a roximal oint method maximizing the first-order Taylor series aroximation of I(, Q) with a roximity enalty exressed by D( t ). It is now natural to modify (43) by an emhasizing/attenuating the enalty term via a weighting factor, i.e.,consider the following iteration t+1 = arg max (Ĩt () γ t D( t ) ) (44) The idea is that close to the otimal solution the K-L distance of to t would be small and hence the roximity constraint γ t can be gradually relaxed by decreasing γ t. One such ossible choice of the {γ t } t 1 sequence. In the following sub-section we derive a sequence of ste-sizes that guarantees non-decreasing mutual information estimates I( (t), Q). We note below the accelerated BA algorithm as derived above Ste 1: Initialize (1) to the uniform distribution over X, i.e. (1) i = 1 X for all i X. Set t to 1.

Ste 2: Reeat until convergence: q (t) = (t) Q (45) i = (t) ex ( γt 1 D(Q i q (t) ) ) i k (t) k ex ( γt 1 D(Q k q (t) ) ), i X (46) (t+1) 6.1 Suitable choice of ste-sizes for accelerated BA algorithm A fundamental roerty of the BA algorithm is that the mutual information I( t, Q), which reresents the current caacity estimate at the t th iteration is non-decreasing. For the accelerated BA algorithm, we need to choose a sequence {γ t } t 1 that reserves this roerty. For this we need the following lemma. Lemma 6.1. For any iteration t, we have Proof. Recall that, D(q (t+1) q (t) ) q (t+1) = (t+1) Q = (t+1) i D(Q i q (t) ) (47) (t+1) i Q i (48) The above equation exresses the outut robability vector q k+1 as a convex combination of the rows of the matrix Q. Since the relative entroy D( ) is convex in both the arguments, we have D(q (t+1) q (t) ) (t+1) i D(Q i q (t) ) Equied with the above lemma, we now establish a lower bound on increment of mutual information I(, Q) at each stage. Lemma 6.2. For every stage t of the accelerated BA algorithm, we have I( (t+1), Q) I( (t), Q) + γ t D( (t+1) (t) ) D(q (t+1) q (t) ) (49) Proof. We have from Eqn. 44 of the accelerated BA iteration Ĩ t ( (t+1) ) γ t D( (t+1) (t) ) Ĩt ( (t) ) (50)

Plugging in the exression for Ĩt ( ) from above, we have I( (t+1), Q) + ( (t+1) i (t) i )D(Q i q (t) ) I( (t), Q) + γ t D( (t+1) (t) ) (51) Now using the lemma 6.1 and using the non-negativity of K-L divergence, the result follows. From the above lemma, it follows that a sufficient condition for I( (t+1), Q) to be non-decreasing is 1 D((t+1) (t) ) γ t D( (t+1) Q (t) Q) (52) Now define the maximum KLD-induced eigenvalue of Q as λ 2 D(Q Q) KL(Q) = su D( ) (53) Using the above definition, we conclude that a sufficient condition for I( (t+1), Q) to be non-decreasing is given by γ t λ 2 KL(Q) (54) 7 Convergence Statements of the Accelerated BA Algorithm In the revious section we roved that for any ste-size sequence γ t λ 2 KL (Q), the accelerated BA algorithm has the otential for increased convergence seed. For lack of sace, we only give the statements of the theorem. Comlete roofs of these theorems may be found in [2]. Theorem 7.1. Consider the accelerated BA algorithm with I t = i t i D(Q i q (t) ) and L t = γ t log( i (t) i ex(γt 1 D(Q i q (t) )). Assume that γ inf = inf t γt 1 > 0 and (54) is satisfied for all t. Then lim L t = lim I t = C (55) t t And the convergence rate is at least roortional to 1/t, i.e. C L t < D( 0 ) µ inf t (56)

8 Dual Aroach : Geometric Programming In this section, we take a dual aroach to solve the roblem 10 and show that the dual roblem reduces to a simle Geometric Program [4]. We also derive several useful uer bounds on the channel caacity from the dual rogram. First we rewrite the mutual information functional as follows. I(X; Y ) = H(Y ) H(Y X) = q j log q j r (57) Where, Subject to, r i = Q ij log Q ij (58) q = Q (59) 1 = 1, >= 0 (60) Hence the otimization roblem 10 may be rewritten as follows max r q j log q j (61) Subject to, Q = q 1 = 1 0 It is to be noted that keeing two sets of otimization variables and q and introducing the equality constraint Q = q in the rimal roblem is a key ste to derive an exlicit and simle Lagrange dual roblem of 61. Theorem 8.1. The Lagrange dual of the channel caacity roblem 61 is given by the following roblem Subject to, min α M log ex(α j ) (62) Qα r (63)

An equivalent version of the above Lagrange dual roblem is the following Geometric rogram (in the standard form): Subject to, M min z z j (64) z Pij j ex ( H(Q i ) ), i = 1, 2,..., N z 0 From the Lagrange dual roblem, we immediately have the following uer bound on the channel caacity. ( ) M Weak Duality: log ex(α j) C, for all α that satisfy Qα + r 0. ( ) M Strong Duality: log ex(α j) = C, for the otimal dual variable α. 8.1 Bounding From the Dual Because the inequality constraints in the dual roblem 64 are affine, it is easy to obtain a dual feasible α by finding any solution to a system of linear inequalities, and the resulting value of the dual objective function rovides an easily derivable uer bound on channel caacity. The following is one such non-trivial bound. Corollary 8.2. Channel caacity is uer-bounded in terms of a maximum-likelihood receiver selecting arg max i P ij for each outut symbol j C log max P ij i which is tight iff the otimal outut distribution q is q j = max i P ij M k=1 P ik (65) As is readily aarent, the geometric rogram Lagrange dual 64 generates a broader class of uer bounds on caacity. This uer bounds can be effectively used to terminate an iterative otimization rocedure for channel caacity.

9 Conclusion In this reort, we have surveyed various convex otimization algorithms for solving the channel caacity roblem. In articular, we have derived the classical Blahut-Arimoto algorithm from first rinciles. Then we established a connection with Proximal algorithms and original BA iteration. Using a roer ste-size sequence, we have derived an acceerated version of the BA algorithm. Finally we have considered the dual of the channel caacity roblem and have shown that its Lagrange dual is given by a Geometric Program (GP). The GP have been effectively utilized to derive non-trivial uer bound on the channel caacity. References [1] S. Arimoto, An algorithm for comuting the caacity of arbitrary discrete memoryless channels, Information Theory, IEEE Transactions on, vol. 18, no. 1,. 14 20, 1972. [2] G. Matz and P. Duhamel, Information geometric formulation and interretation of accelerated blahut-arimoto-tye algorithms, in Information theory worksho, 2004. IEEE. IEEE, 2004,. 66 70. [3] Y. Yu, Squeezing the arimoto-blahut algorithm for faster convergence, arxiv rerint arxiv:0906.3849, 2009. [4] M. Chiang and S. Boyd, Geometric rogramming duals of channel caacity and rate distortion, Information Theory, IEEE Transactions on, vol. 50, no. 2,. 245 258, 2004. [5] C. E. Shannon, A mathematical theory of communication, ACM SIGMOBILE Mobile Comuting and Communications Review, vol. 5, no. 1,. 3 55, 2001. [6] T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley & Sons, 2012.