Convexity/Concavity of Renyi Entropy and α-mutual Information

Size: px

Start display at page:

Download "Convexity/Concavity of Renyi Entropy and α-mutual Information"

Willa Fleming
6 years ago
Views:

1 Convexity/Concavity of Renyi Entropy and -Mutual Information Siu-Wai Ho Institute for Telecommunications Research University of South Australia Adelaide, SA 5095, Australia Sergio Verdú Dept. of Electrical Engineering Princeton University Princeton, NJ 08544, U.S.A. Abstract Entropy is well known to be Schur concave on finite alphabets. Recently, the authors have strengthened the result by showing that for any pair of probability distributions P and with majorized by P, the entropy of is larger than the entropy of P by the amount of relative entropy DP. This result applies to P and defined on countable alphabets. This paper shows the counterpart of this result for the Rényi entropy and the Tsallis entropy. Lower bounds on the difference in the Rényi or Tsallis entropy are given in terms of a new divergence which is related to the Rényi or Tsallis divergence. This paper also considers a notion of generalized mutual information, namely -mutual information, which is defined through the Rényi divergence. The convexity/concavity for different ranges of is shown. A sufficient condition for the Schur concavity is discussed and upper bounds on -mutual information are given in terms of the Rényi entropy. I. INTRODUCTION The Rényi entropy H P of a probability distribution P of order was introduced in []. The concavity of H P for P defined on a finite support was shown in [2]. If 0, ], H P is strictly concave in P and it is also Schur concave [3]. When >, Rényi entropy is neither convex nor concave in general but it is pseudoconcave. Since all concave functions and pseudoconcave functions are quasiconcave, Rényi entropy is Schur concave for > 0 [4]. It is well known that the Shannon entropy of finitelyvalued random variables is a Schur-concave function [4]. This result was strengthened and generalized for countably infinite random variables in [5], which shows that if a probability distribution is majorized by a probability distribution P, their difference in entropy was shown to be lower bounded by the relative entropy DP. This motivates the search for strengthened results for the Schur-concavity of the Rényi entropy. In order to strengthen the Schur-concavity of the Rényi entropy, this paper investigates the Schur-concavity of the Tsallis entropy. Tsallis entropy was introduced by Tsallis [6] to study multifractals. The same paper showed that the Tsallis entropy of order β is concave for β > 0 and convex for β < 0. The Schur-concavity of the Tsallis entropy was discussed in [7]. Tsallis divergence of order β was introduced in [8] and it is also known as the Hellinger divergence of order β [9]. Indeed, the Tsallis divergence is an f-divergence with fx = x xβ β and the Tsallis entropy is the corresponding entropy with the form i fp i for any probability distribution P. This paper also considers the convexity of a notion of generalized mutual information, namely -mutual information. This notion of mutual information was introduced by Sibson [0] in the discrete case, as the information radius of order []. Alternative notions of what could be called Rényi mutual information were proposed by Csiszár [2] and Arimoto [3]. -mutual information is closely related to Gallager s exponent function and plays an important role in a number of applications in channel capacity problems e.g., [4]. In Section II, the Schur concavity of the Rényi entropy and the Tsallis entropy are revisited. A new divergence is defined and new bounds on entropy in terms of the new divergence are derived. The -mutual information is defined and discussed in Section III. Its concavity, convexity, Schur concavity and upper bounds under different settings are illustrated. Due to space limits, some of the proofs are omitted. For simplicity, only countable alphabets are considered in this paper. All log and exp have the same arbitrary base. We use P to denote that the support of P is a subset of the support of. Furthermore, if P, it means that P does not hold. II. SCHUR-CONCAVITY OF RÉNYI ENTROPY AND TSALLIS ENTROPY Definition : For any probability distribution P, the Rényi entropy of order 0,, is defined as H P = log P i, where A is the support of P. By convention, H P is the Shannon entropy of P. Definition 2: For any probability distributions P and, let A be the union of the supports of P and. The Rényi divergence [] of order 0,, is defined as D P = log P i i, 2 which becomes infinite if > and P.

2 Definition 3: Given P X, P Y X, Y X, the conditional Rényi divergence of order 0,, is D P Y X Y X P X 3 = D P Y X P X Y X P X 4 = log PY X y x Y X y xp Xx, x,y X Y where X and Y are the supports of P X and P Y, respectively. The Tsallis entropy is related to the Rényi entropy through the following one-to-one function. For 0,, and z > 0, let and 5 ϕ z = expz z 6 ϕ v = log v. 7 Definition 4: For any probability distribution P, the Tsallis entropy of order β 0,, is defined as S β P = ϕ βh β P = P i β, 8 β β where A is the support of P. Definition 5: For any probability distributions P and, let A be the support of P. The Tsallis divergence of order β 0,, is defined as T β P = ϕ β D β P 9 β = P i β i β, 0 β which becomes infinite if β > and P. Definition 6: For any probability distributions P and and strictly concave function φ : [0, ] IR, let φ P i, i =P i iφ i + φi φp i, where φ x is the derivative of φx and i A which is the union of the supports of P and. Define the divergence W φ P = φ P i, i. 2 Lemma : For any probability distributions P and and strictly concave function φ : [0, ] IR, and with equality if and only if P =. φ P i, i 0 3 W φ P 0, 4 In the following, we fix β 0,, and consider the strictly concave function φ β x = xβ x β. 5 Its derivative is denoted by φ β x = βxβ β. For 0 x, φ β x is a decreasing function. Note that the Tsallis divergence can be seen as an f-divergence with fx = φx. The Tsallis entropy is equal to fp i. Without loss of generality, we assume that P and are defined on the positive integers and labeled in decreasing probabilities, i.e., P i P i + 6 i i +. 7 We say that is majorized by P if for all k =, 2,... k i i= k P i. 8 i= If is majorized by P, then P. If a real-valued functional f is such that fp f whenever is majorized by P, we say that f is Schur-concave. The following result generalizes [5, Theorem 3]. Theorem : For β 0,,, if is majorized by P, then S β S β P + W φβ P. 9 Proof: We first consider the case where and therefore P takes values on a finite integer set {, 2,..., M}. S β S β P W φβ P 20 = i φ β i i φ β P i W φβ P 2 = φ βii P i 22 i = M i P i φ βk φ βk + + i k=i M i P iφ βm 23 i= M = k= k φ βk φ βk + i i= k P i i= 24 0, 25 where 25 follows from the fact that φ β x is a decreasing function and is majorized by P. Consider the case that is non-zero for all integers. We assume that S β is finite, as otherwise, there is nothing to prove. We need to take M in the foregoing expressions.

3 Although the last term in the right side of 23 can now be negative, it vanishes due to the finiteness of S β : M i P iφ βm i= i=m+ i=m+ = β β iφ βm iφ βi i=m i β, 28 where 27 follows from that φ β x is a decreasing function and 7. Therefore, the Tsallis entropy is a Schur-concave function. Due to the one-to-one correspondence between the Rényi entropy and the Tsallis entropy through 8, Theorem leads to the following bounds. If 0 < <, then ϕ H ϕ H P W φ P. 29 If >, then ϕ H P ϕ H W φ P. 30 These bounds are sufficient to show that the Rényi entropy is Schur-concave. They can further be used to obtain bounds on H H P in terms of W φ P as follows. Theorem 2: For P defined on a countable alphabet, the Rényi entropy H P is a Schur-concave function for 0,,. Furthermore, if is majorized by P, then a If 0 < <, then in nats H H P = ϕ A A W φ P 3 W φ P A 32 0, 33 where A is the support of. b If >, then in nats H H P ϕ W φ P 34 W φ P It would be interesting to obtain an inequality similar to [5, Theorem 3] for the Rényi entropy. However, the following example shows that it is not possible. Example : Consider P = {0.4, 0.35, 0.5, 0.} and = {0.3, 0.3, 0.3, 0.} so that is majorized by P. If = 0.6, then H H P > 0 but H H P < D P. To end this section, we illustrate the meaning of φ P i, i and how W φ P is related to relative a Fig.. An illustration of φx and φ in case a P i < i and case b i < P i. entropy and Tsallis divergence. Consider the tangent at φx when x = i as shown in Fig.. Lemma shows that the tangent is always above φp i and the amount is φ P i, i. Theorem 3: For any probability distributions P and with P and φx = x log x, b DP = W φ P 37 with the convention 0 log 0 = 0. Furthermore, if φ β x = xβ β, i β φβ P i, i = T β P, 38 where A is the support of, and hence for 0 < β <, T β P W φβ P. 39 III. -MUTUAL INFORMATION Definition 7: Let P X P Y X P Y, with X X and Y Y. The -mutual information with 0,, is I X; Y = min D P Y X P X 40 = log P X xpy X=x y. x X 4 By convention, I X; Y is the mutual information. Some properties of I X; Y related to the parameter are summarized in the following theorem. Theorem 4: For any fixed P X on X and P Y X : X Y, a I X; Y is continuous with 0, and lim X; Y = sup log 0 P X x. 42 lim I X; Y = I X; Y. 43 lim I X; Y = log sup x:p X x>0 P Y X y x. 44

4 b I X; Y is increasing with. Proof: a. The limit in 42 can be obtained by using L norm. For any ɛ > 0, there exists a sufficiently small 0 < such that P X x ɛ < P X xpy X=x y 45 x X < P X x. 46 Since ɛ > 0 is arbitrary, using L -norm gives lim P X xp 0 Y X=x y 47 y x X = sup P X x. 48 So the limit in 42 follows. The limits in 43 and 44 can be easily verified by using L Hôpital s rule. Due to 43, I X; Y is continuous with 0,. b. For < β, denote β = argmin D β P Y X P X so that I X; Y D P Y X β P X 49 D β P Y X β P X 50 = I β X; Y, 5 where 49 and 50 follow from 40 and [5, Theorem 3], respectively. Provided that C 0f > 0, Theorem 4 intriguingly reveals that the zero-error capacity of a discrete memoryless channel with feedback [6] is equal to C 0f = sup I 0 X; Y, 52 X where I 0 is defined as the continuous extension of I. Some upper bounds on -mutual information in terms of Rényi entropy are illustrated. These bounds provide some insights about generalizing conditional entropy. Theorem 5: For any given P X, P Y X and 0 < <, H P X = I X; X I P X, P Y X 53 Proof: It is easy to verify the equality in 53. The inequality can be seen by the following. Let U = X so that X U Y forms a Markov chain. The inequality in 53 follows from [4, Theorem 5.2] for and [7] for =. The difference between the sides in 53 is a good candidate for the conditional Rényi entropy for which several candidates have been suggested in the past e.g., [2], [3]. Since H P X is decreasing in [2], the following corollary holds. Corollary 6: For any given P X, P Y X and 0 <, H P X I P X, P Y X. 54 The following result shows the convexity or concavity of the conditional Rényi divergence, which proves to be useful to investigate the properties of -mutual information. Theorem 7: Given P X, P Y X, Y X, the conditional Rényi divergence D P Y X Y X P X is a concave in P X for and quasiconcave in P X for 0 < <. b convex in P Y X, Y X for 0 < and quasiconvex in P Y X, Y X for 0 < <. Proof: a. For 0 < <, log a is a decreasing function in a. The quasiconcavity of I X; Y can be seen from 5. For >, log a is concave in a so that the concavity of I X; Y can be seen from 5. Since I X; Y is concave in P X, part a is verified. Part b is a consequence of 4 together with [5, Theorem 3] and [5, Theorem ]. Theorem 8: Let P X P Y X P Y. For fixed P Y X, a I P X, P Y X = I X; Y is concave in P X for. b I P X, P Y X = I X; Y is quasiconcave in P X for 0 < <. Proof: Consider 0 < λ < and λ = λ. We first prove part a. I X; Y is known to be concave in P X for fixed P Y X [7]. Consider any probability distributions P X0 and P X and let = argmin D P Y X λp X0 + λp X. 55 Theorem 7 can be applied to show part a as follows: I λp X0 + λp X, P Y X = D P Y X λp X0 + λp X 56 λd P Y X P X0 + λd P Y X P X 57 λ min D P Y X P X0 + λ min D P Y X P X. 58 Similarly, part b can also be shown by using Theorem 7. Example 2: Although I P X, P Y X is concave in P X for, it can be shown that in general it is not Schur concave by considering λ = 0.3, = 2, P X0 0 = P X0 = 0.4, P X 0 = P X = 0.5, and [ ] 0 P Y X = However, the Schur concavity can be preserved if P Y X describes a symmetric channel. In other words, all the rows of the probability transition matrix P Y X are permutations of other rows and so are the columns. Since all symmetric quasiconcave functions are Schur concave [4], Theorem 8 implies the following corollary. Corollary 9: For any fixed symmetric P Y X, I P X, P Y X = I X; Y is Schur concave in P X for 0 < <. We now fix P X and consider the behavior of I P X,.

5 Theorem 0: Let P X P Y X P Y. For fixed P X, a I P X, P Y X = I X; Y is convex in P Y X for 0 <. b I P X, P Y X = I X; Y is quasiconvex in P Y X for 0 < <. Proof: Consider 0 < λ < and λ = λ. We first prove part a. I X; Y is known to be convex in P Y X [7]. Consider any conditional probability distributions P Y0 X and P Y X and let i = argmin D P Yi X P X 60 for i = 0 or. Due to 0 < < and Theorem 7, I P X, λp Y X + λp Y0 X 6 = min λp Y X + λp Y0 X P X 62 D λp Y X + λp Y0 X λ + λ 0 P X 63 λd P Y X P X + λd P Y0 X 0 P X 64 =λi P X, P Y X + λi P X, P Y0 X. 65 Similarly, Part b can be verifed by using Theorem 7. In Theorems 8 and 0, we have made different assumptions on the ranges of. The following examples justify that those assumptions are not superfluous. Example 3: Let λ = 0.3, P X0 0 = P X0 =, P X 0 = P X = 0.5, and [ ] 0 P Y X = I 0. P X, P Y X is not concave in P X and I 0.3 P X, P Y X is not convex in P X. Example 4: Let λ = 0.3, P X 0 = P X = 0.5, [ ] [ ] P Y0 X 0 = and P Y X = I 2 P X, P Y X is not concave in P Y X and I 3 P X, P Y X is not convex in P Y X. Although Examples 3 and 4 show some unpleasant behaviour of I P X, P Y X, the Arimoto-Blahut algorithm [8] can still be applied to maximize I P X, P Y X for > 0. The following theorem, motivated by [4], illustrates that some optimization problems about I P X, P Y X can be converted into convex optimization problems. Theorem : Consider 0,,. Let f P X, P Y X = ϕ I P X, P Y X 68 = P X xpy X=x y, x X z is a mono- where presents in 68 such that tonic increasing function in z. ϕ 69 a For a fixed P Y X, sup P X I P X, P Y X = ϕ sup f P X, P Y X, P X 70 where f P X, P Y X is concave in P X. b For a fixed P X, inf I P X, P Y X = ϕ P Y X inf f P X, P Y X P Y X, 7 where f P X, P Y X is convex in P Y X. Proof: Since ϕ z is a monotonic increasing function in z, 70 and 7 follow from the relation in 68. The function f P X, P Y X in 69 is concave in P X because z is concave in z. The function f P X, P Y X in 69 is convex in P Y X due to the Minkowski inequality for > and the reverse Minkowski inequality for <. Remark: Note that convexity/concavity were mistakenly reversed in [4, Theorem 5]. REFERENCES [] A. Rényi, On measures of entropy and information, Fourth Berkeley Symp. on Mathematical Statistics and Probability, pp , 96. [2] M. Ben-Bassat and J. Raviv, Rényi s entropy and the probability of error, IEEE Transactions on Information Theory, vol. 24, no. 3, pp , May 978. [3] Y. He, A. B. Hamza, and H. Krim, A generalized divergence measure for robust image registration, IEEE Transactions on Signal Processing, vol. 5, no. 5, pp , [4] A. W. Marshall, I. Olkin, and B. C. Arnold, Inequalities: Theory of Majorization and Its Applications. Springer, 200. [5] S.-W. Ho and S. Verdú, On the interplay between conditional entropy and error probability, IEEE Transactions on Information Theory, vol. 56, no. 2, pp , 200. [6] C. Tsallis, Possible generalization of Boltzmann-Gibbs statistics, Journal of statistical physics, vol. 52, no. -2, pp , 988. [7] S. Furuichi, K. Yanagi, and K. Kuriyama, Fundamental properties of Tsallis relative entropy, Journal of Mathematical Physics, vol. 45, no. 2, pp , [8] C. Tsallis, Generalized entropy-based criterion for consistent testing, Physical Review E, vol. 58, no. 2, p. 442, 998. [9] F. Liese and I. Vajda, On divergences and informations in statistics and information theory, IEEE Transactions on Information Theory, vol. 52, no. 0, pp , Oct [0] R. Sibson, Information radius, Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, vol. 4, no. 2, pp , 969. [] S. Verdú, -mutual information, 205 Information Theory and Applications Workshop, 205. [2] I. Csiszár, Generalized cutoff rates and Rényi s information measures, IEEE Trans. on Information Theory, vol. 4, no., pp , 995. [3] S. Arimoto, Information measures and capacity of order for discrete memoryless channels, Proc. Coll. Math. Soc. János Bolyai., 975. [4] Y. Polyanskiy and S. Verdú, Arimoto channel coding converse and Rényi divergence, th Annual Allerton Conference on Communication, Control, and Computing Allerton, pp , 200. [5] T. van Erven and P. Harremoës, Rényi divergence and Kullback-Leibler divergence, IEEE Transactions on Information Theory, vol. 60, no. 7, pp , July 204. [6] C. E. Shannon, The zero error capacity of a noisy channel, IRE Transactions on Information Theory, vol. 2, no. 3, pp. 8 9, 956. [7] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. John Wiley & Sons, 202. [8] S. Arimoto, Computation of random coding exponent functions, IEEE Transactions on Information Theory, vol. 22, no. 6, pp , 976.

Arimoto Channel Coding Converse and Rényi Divergence

Arimoto Channel Coding Converse and Rényi Divergence Yury Polyanskiy and Sergio Verdú Abstract Arimoto proved a non-asymptotic upper bound on the probability of successful decoding achievable by any code