Computation of Csiszár s Mutual Information of Order α

Size: px

Start display at page:

Download "Computation of Csiszár s Mutual Information of Order α"

Reynold Cobb
5 years ago
Views:

1 Computation of Csiszár s Mutual Information of Order Damianos Karakos, Sanjeev Khudanpur and Care E. Priebe Department of Electrical and Computer Engineering and Center for Language and Speech Processing Department of Applied Mathematics and Statistics Johns Hopkins Universit, Baltimore, MD 8, USA {damianos, khudanpur, cep}@jhu.edu Abstract Csiszár introduced the mutual information of order in [] as a parameterized version of Shannon s mutual information. It involves a minimization over the probabilit space, and it cannot be computed in closed form. An alternating minimization algorithm for its computation is presented in this paper, along with a proof of its convergence. Furthermore, it is proved that the algorithm is an instance of the Csiszár-Tusnád iterative minimization procedure []. I. INTRODUCTION Csiszár [] defined the mutual information of order between two discrete random variables X, Y (with joint distribution P XY ) as I (X; Y ) = min where 0, and D (P ) = log P Y ()D (P X Y ( ) ( )), () P () (), () is the Réni divergence of order [3] between distributions P,. The limit of (), as, is the well-known Kullback-Leibler information divergence. Furthermore, we adopt the same convention as in []: if > and () = 0 for some, then P () () is equal to 0 or + when P () = 0 or P () > 0, respectivel. It can be proved [] that I (X; Y ) retains most of the properties of I(X; Y ) (i.e., it is a non-negative, continuous, and concave function of P X, and, when <, it is conve with respect to P X Y ). Also, as, I (X; Y ) converges to I(X; Y ). Observe that () is computed b eponentiating the probabilit measures. If the eponents are strictl less than unit (i.e., 0 ), this results in smoothed values, lessening possible etreme variations in P (), (). Smoothing is important in cases where the distributions are computed from limited amounts of data; sparsit tpicall leads to underestimates of the least likel events, and to overestimates of the most likel events. Such phenomena are prevalent when the underling spaces are ver large, compared to the sample size; see, for instance, [4] [7] for various treatments of the sparsit problem in natural language processing. This fact partl eplains the improvement in performance of the tet categorization algorithms of [8], [9], when Shannon s mutual information was replaced with () as a criterion for clustering sparse empirical distributions. There is no analtic form for the minimizer of the right-hand side of (). The net section describes an alternating minimization algorithm which converges to the desired minimum. Use of notation: Random variables (r.v.) are denoted b capital letters, while their realizations are denoted b the corresponding lowercase letters. All random variables are assumed to lie in discrete finite spaces, denoted b the corresponding calligraphic letter (e.g., X X ). The probabilit mass function (pmf) of a random quantit is denoted using the appropriate subscript, e.g., P X for r.v. X, or P XY

2 for the joint pmf of X, Y. The conditional pmf of X given Y is denoted b P X Y ; letter W is also used to denote a conditional pmf. Finall, for the rest of the paper it will be assumed that 0 < <. II. THE ALGORITHM The algorithm for the computation of () is described in Figure, and it takes as input the joint distribution P XY and γ > 0 (the latter is tpicall a small number, and is used in the termination condition). The algorithm is based on the following identit (proved in Section III): for an distributions P,, and 0 < <, D (P ) = where Therefore, P () = I (X; Y ) = min D(P P ) + D(P ), (3) (P ()) (()) (P ()) (()). ( P Y () D ( P ( ) ) + D( P ( ) P ( ) ) ), which, b virtue of Lemma of Section III, can be written as I (X; Y ) = min min ( P Y () D ( W ( ) ) + W D( W ( ) P ( ) ) ). (4) The double minimization of (4) is achieved through the alternating minimization of the algorithm of Figure, as is proved in Theorem of Section III. III. CONVERGENCE OF THE ALGORITHM First, we will prove identit (3). Given distributions P, and a scalar 0 < <, we define: P () = C (P ()) (()), where C = (P ()) (()) is just a normalizing constant. Then, we have the following Algorithm for computing Csiszár s mutual information of order. Input: Joint distribution P XY, 0 < <, and threshold γ > 0. Initialization: t = 0 t () = P Y ()P X Y ( ) I t = Loop: Repeat t = t + Pt ( ) = Ct ()(P X Y ( )) ( t ()) t () = P Y ()Pt ( ) I t = P Y () ( D(P t ( ) P X Y ( )) +D(P t ( ) t ) ) Until I t I t < γ Output: I t. Fig.. Alternating minimization algorithm for computing I (X; Y ). C t() is just a normalizing constant, computed for each conditioning. chain of equalities D(P ) = ( ( ) ) P () P () log C () = log(c) + ( ) P () P () log () = log(c) + (D(P ) D(P P )) = ( )D (P ) + (D(P ) D(P P )). (5) B re-arranging terms in (5), we obtain (3). Now, for fied distributions P Y, P X Y, let us define the functional J(W, ) = ( P Y () D ( W ( ) ) + D( W ( ) P X Y ( ) ) ), (6) which is a function of distribution and conditional distribution W. The algorithm of Figure performs an alternating update of, W. As is proved in the

3 net two lemmas, this alternating update can onl reduce J(W, ) (alternating minimization). Then, convergence to the global minimum is guaranteed b the fact that, W lie in conve (probabilit) spaces, and that J is a conve function of its arguments, W. Lemma : For a fied conditional distribution W, the functional J(W, ) is minimized b () = P Y ()W ( ). Proof: Onl the first term in the right-hand side of (6) needs to be minimized. It can be easil checked that this term is equal to D(P Y W P Y ), and hence the proof follows immediatel from Lemma 3.8. of [0]. Alternative proof: Let be an arbitrar pmf on X, and as specified in the statement of the lemma. Then, P Y ()D(W ( ) ) P Y ()D(W ( ) ) = ( P Y () W ( ) log W ( ) () W ( ) log W ( ) ) () = () log () () 0, Hence, the first term in the right-hand side of (6) is minimized b, as required. Lemma : For a fied distribution, the functional J(W, ) is minimized b W ( ) = C ()(P X Y ( )) (()), where C() = (P X Y ( )) (()) is a normalizing constant. Proof: It suffices to find the minimizer of the epression K ( W ( ), ) = D ( W ( ) ) + D( W ( ) P X Y ( ) ), for an given. This will impl minimization of J(W, ) as well. Notice that K ( W ( ), ) is a conve function of W ( ), for all, b virtue of the conveit of the KL-divergence [0]. Hence, it has a unique minimizer, which can be found using Lagrange multipliers, b solving the equations (K ( W ( ), ) + λ ( W ( ) )) = 0, W ( ) (7) for all,, under the usual constraint W ( ) =. Note that there are X Y equalities implied in Equation (7). The left-hand-side of (7) is equal to K (W, ) + λ = + log(w ( )) log(()) + ( + log(w ( )) log(p X Y ( ))) + λ = ( + log(w ( )) ( ) log(()) log(p X Y ( )) + λ = ( + log(w ( )) log((p X Y ( )) (()) )) + λ. (8) B setting (8) equal to zero and solving for W ( ), we obtain W ( ) = ep( ( )λ )(P X Y ( )) (()), which, together with the constraint W ( ) =, gives us the required result. Alternative proof: Minimizing K(W, ) is equivalent to minimizing ( )K(W, ) = ( )D(W ) + D(W P ) with respect to W, for fied, P. Then, ( )D(W ) + D(W P ) (9) = CW () W () log C ()P () = CW () W () log ()P log C, (0) () where C = ()P () and the first epression in (0) is the non-negative Kullback- Leibler distance of the distribution W and the

4 distribution W () = C ()P (). Hence (0) is minimized when W = W, as required. Theorem : The alternating minimization algorithm of Figure converges to I (X; Y ), i.e., it computes the epression min min J(W, ), W where J(W, ) was defined in (6). Proof: It is eas to establish that J(W, ) is a conve function of the pair (, W ), as it is a linear combination of KL-divergences having, W as their arguments. Also, and W belong to conve probabilit sets. Thus, the sufficient condition of [] is satisfied, which implies that the alternating minimization algorithm of Figure converges to the (unique) global minimum of J(W, ). IV. CONNECTION TO THE CSISZÁR-TUSNÁDY ALTERNATING MINIMIZATION ALGORITHM In their paper [], Csiszár and Tusnád show that an iterative algorithm that successivel computes projections between two conve sets eventuall converges, under certain conditions, to the minimum distance between the two sets. In this section, we prove that these conditions are satisfied for functional J of (6), thus establishing that the alternating minimization algorithm of Section II is an instance of the algorithm of []. This provides an alternative (albeit more complicated) proof of the convergence of the algorithm. Let P, represent two abstract conve and closed sets, and let d : P IR {+ } be an etended real-valued function. The projection of a member of one set to the other set is defined as follows: P iff d(p, ) = min d(p, ) P iff d(p, ) = min d(p, ) P P where P P and. The alternating minimization algorithm of [] computes a sequence of members of P and satisfing 0 P.... For a given function δ, we have the following definitions (from []): Definition : The three points propert holds for P P if 0 P d(p, 0 ) δ(p, P ) + d(p, ). Definition : The four points propert holds for P P if, P d(p, ) + δ(p, P ) d(p, ). The following theorem of [] establishes convergence of the iterative minimization procedure to the minimum between the two sets: Theorem : If the three points propert and the four points propert hold for all P P, then lim d(p n, n ) = min d(p, ) d(p, ). n P P, We will now show that the sequence of updates 0 W, as epressed in the algorithm of Figure, leads to convergence to the minimum of J(W, ). Note that W P Y X and P X, where P X is the set of measures on X. Furthermore, δ is the KL-divergence function. Theorem 3: J satisfies the four points propert for all W P Y X and all P X. Proof: We have J(W, ) J(W, ) = Y P Y ()(D(W ( ) ) D(W ( ) )) = (P Y W )() log () () X = D(P Y W ) D(P Y W ) () = D(P Y W P Y W ) D(P Y W ) () δ(p Y W, P Y W ), (3) where, in (), P Y W is the X marginal of the joint distribution W P Y, and in (), was replaced b P Y W b virtue of Lemma. Theorem 4: J satisfies the three points propert for all W P Y X.

5 Proof: Let µ [0, ] and W (µ) = µw + ( µ)w. It suffices to show that J(W (µ), 0 ) J(W, 0 ) D(P Y W (µ) P Y W ) (4) for all µ [0, ] (and, hence, for µ = ). First, note that, when µ = 0, the left-hand side and the right-hand side of (4) are both equal to zero. Hence, it suffices to prove that J(W (µ), 0 ) We have D(P Y W (µ) P Y W ) = X D(P Y W (µ) P Y W ), µ. (P Y W () P Y W ()) log (P Y W (µ) )() (P Y W )() = µ (D(P Y W (µ) P Y W ) + D(P Y W P Y W (µ) )) (5) where µ is assumed non-zero in (5). Note that D(W (µ) P Y P Y W )/ = 0 when µ = 0 (the same is true for J(W (µ), 0 )/, as can be seen below). Net, we have the following J(W (µ), 0 ) = Y P Y () X(W ( ) W ( )) (W (µ) ( )) /( ) log 0 ()(P X Y ( )) /( ) = P Y () (W ( ) W ( )) Y X log W (µ) ( ) (6) W ( ) = P Y ()(D(W (µ) ( ) W ( )) µ( ) Y +D(W ( ) W (µ) ( ))) µ (D(P Y W (µ) P Y W ) + D(P Y W P Y W (µ) )), (7) where the update equation of Lemma was used in (6) and µ > 0 was assumed in (7). B combining (5) and (7) we obtain the required result. Finall, Theorems 3 and 4 establish that the iterative algorithm of Figure results in the minimization of J, as required. ACKNOWLEDGMENTS The authors would like to thank the anonmous reviewers for their comments. The alternative proofs of Lemmas and were supplied b the first reviewer. This work was partiall supported b National Science Foundation grant No CCF REFERENCES [] I. Csiszár, Generalized cutoff rates and Réni s information measures, IEEE Trans. on Information Theor, vol. 4, no., pp. 6 34, Januar 995. [] I. Csiszár and G. Tusnád, Information geometr and alternating minimization procedures, Statistics and Decisions, Supplement Issue, pp , 984. [3] A. Réni, On measures of entrop and information, in Proc. 4th Berkele Smposium on Math. Statist. Probabilit, vol., 96, pp [4] F. Jelinek, Statistical Methods for Speech Recognition. MIT Press, 997. [5] S. F. Chen and J. Goodman, An empirical stud of smoothing techniques for language modeling, in Proceedings of the 34th Annual Meeting of the ACL, 996, pp [6] D. Karakos and S. Khudanpur, Language modeling with the maimum likelihood set: Compleit issues and the back-off formula, in Proc. 006 IEEE Intern. Smposium on Information Theor (ISIT-06), Seattle, WA, Jul 006. [7], Error bounds and improved probabilit estimation using the maimum likelihood set, in Proc. 007 IEEE Intern. Smposium on Information Theor (ISIT-06), Nice, France, Jul 007. [8] D. Karakos, J. Eisner, S. Khudanpur, and C. E. Priebe, Cross-instance tuning of unsupervised document clustering algorithms, in Proc. 007 Conference of the North American Chapter of the Assoc. for Computational Linguistics (NAACL-HLT 007), April 007. [9] N. Slonim, N. Friedman, and N. Tishb, Unsupervised document classification using sequential information maimization, in Proc. SIGIR 0, 5th ACM Int. Conf. on Research and Development of Inform. Retrieval, 00. [0] T. Cover and J. Thomas, Elements of Information Theor. John Wile and Sons, 99. [] R. W. Yeung and T. Berger, Multi-wa alternating minimization, in Proc. IEEE Int. Smposium Information Theor (ISIT-995), September 995, p. 9.

Computation of Total Capacity for Discrete Memoryless Multiple-Access Channels

Computation of Total Capacity for Discrete Memoryless Multiple-Access Channels IEEE TRANSACTIONS ON INFORATION THEORY, VOL. 50, NO. 11, NOVEBER 2004 2779 Computation of Total Capacit for Discrete emorless ultiple-access Channels ohammad Rezaeian, ember, IEEE, and Ale Grant, Senior