Maximization of the information divergence from the multinomial distributions 1

Size: px

Start display at page:

Download "Maximization of the information divergence from the multinomial distributions 1"

Cori Gibbs
5 years ago
Views:

1 aximization of the information divergence from the multinomial distributions Jozef Juríček Charles University in Prague Faculty of athematics and Physics Department of Probability and athematical Statistics Supervisor: Ing. František atúš, CSc. Academy of Sciences of the Czech Republic Institute of Information Theory and Automation Department of Decision-aking Theory Abstract. The explicit solution of the problem of maximization of information divergence from the family of multinomial distributions is presented, using result of N. Ay and A. Knauf for the problem of maximization of multi-information, which is the special case of maximization of information divergence from hierarchical models. The problem studied in this paper is a generalization of the binomial case, which was solved in [3]. The problem of maximization of information divergence from an exponential family has emerged in probabilistic models for evolution and learning in neural networks that are based on infomax principles. The maximizers admit interpretation as stochastic systems with high complexity w.r.t. exponential family. Introduction Let µ, ν be nonzero measures on a finite set Z. Let f : Z R d. Let E = E µ,f = {Q µ,f,ϑ : ϑ R d } be the (full) exponential family determined by the reference measure µ and the directional statistic f, where Q µ,f,ϑ is a probability measure (pm) given by where, denotes the scalar product and Q µ,f,ϑ (z) = e ϑ,f(z) Λ µ,f (ϑ) µ(z), z Z, Λ µ,f (ϑ) = ln z Z e ϑ,f(z) µ(z). The information divergence (relative entropy; Kullback-Leibler divergence) of a pm P (on Z) from ν is { P (z) D(P ν) = z s(p ) P (z) ln ν(z), s(p ) s(ν), +, otherwise, s( ) is the support function, i.e. s(ν) = {z Z : ν(z) > 0}. The information divergence of a pm P (on Z) from the exponential family E is defined by D(P E) = inf D(P Q). Q E This work studies the maximization of the function P D(P ) where is a family of multinomial distributions (which is the closure of an exponential family). This problem is a generalization of the binomial case, which was solved in [3]. Problem. (aximization of divergence from the multinomial family). Let N be the number of identical and independent trials, n be the number of possible outcomes in each trial. Let p j be the probability of realization of the j th outcome ( n j= p j =, p,..., p n [0; ]) in each trial. Now, the multinomial distribution (with parameters N, n, p,..., p n ) is a joint distribution of numbers of realizations of each outcome in all N trials. Let Z := {z = (z,..., z n ) {0,,..., N} n : n j= z j = N} be the state space of (random variables with) multinomial distributions (with N, n fixed). Let be the set of all pm s on Z and be the set of all multinomial distributions (with N, n fixed). Finally, let be the set of all strictly positive pm s on Z and :=. The problem is to calculate sup D(P ) and find every P sup and Psup such that D(P sup P sup) = D(P sup ) = sup D(P ). AS 000 ath. Subject Classification. Primary 94A7. Secondary 6B0, 60A0, 5A0. Keywords and phrases. Kullback-Leibler divergence, relative entropy, exponential family, hierarchical models, multinomial distribution, information projection, log-laplace transform, cumulant generating function.

2 Example. (N =, n = ). Z = {z 0 = (, 0), z = (, ), z 0 = (0, ) }, = {( P (z 0 ), P (z ), P (z 0 ) ) = ( p 0, p, p 0 ) : (p0, p, p 0 ) [0; ] 3, p 0 + p + p 0 = }, = {( P (z 0 ), P (z ), P (z 0 ) ) = ( p, p( p), ( p) ) : p [0; ] }. The situation is illustrated on Figure. δ 0 δ 0 δ Figure : The simplex and exponential family for N =, n =. Example.3 (N = 3, n = ). Z = {z 30 = (3, 0), z = (, ), z = (, ), z 03 = (0, 3) }, = = {( P (z 30 ), P (z ), P (z ), P (z 03 ) ) = ( p 30, p, p, p 03 ) : (p30, p, p, p 03 ) [0; ] 4, p 30 + p + p + p 03 = }, = {( P (z 30 ), P (z ), P (z ), P (z 03 ) ) = ( p 3, 3p ( p), 3p( p), ( p) 3) : p [0; ] }. The situation is illustrated on Figure. δ 03 δ 30 δ δ Figure : The simplex and exponential family for N = 3, n =. The general problem of maximization of information divergence from an exponential family has emerged in probabilistic models for evolution and learning in neural networks based on infomax principles. aximizers of D( E) admit interpretation as stochastic systems with high complexity w.r.t. exponential family E [].

3 Preliminaries This section reviews some facts about exponential families and information projections. Let Lin(A) denote the linear span of a set A R d. Lemma.. Let µ, ν be strictly positive measures on a finite set Z, f : Z R d f, g : Z R dg two exponential families. Then E µ,f = E ν,f. and E µ,f E ν,g Proof. Notice that ν ν(z) = Q ν,g,0 E ν,g E µ,f. Then, there exists ϑ 0 R d f, such that Now µ can be expressed as µ(z) = ν(z) ν(z) eλ(ϑ0) ϑ0,f(z). It can be seen, that for every ϑ R d f Q µ,f,ϑ (z) = e ϑ,f(z) Λ(ϑ) µ(z) = z Z e ϑ,f(z) +Λ(ϑ0) ϑ0,f(z) = e ϑ ϑ0,f(z) Λ(ϑ ϑ0) ν(z) = Q ν,f,ϑ ϑ0 (z). This proves E µ,f E ν,f and the equality here follows by symmetry. e ϑ,f(z) +Λ(ϑ0) ϑ0,f(z) ν(z) ν(z) ν ν(z) = Q µ,f,ϑ 0. ν(z) ν(z) Lemma.. Let ν be a nonzero measure, f = (f,..., f df ), f i : Z R, i =,..., d f, g = (g,..., g dg ), g j : Z R, j =,..., d g. Then E ν,g E ν,f Lin{, g,..., g dg } Lin{, f,..., f df }, E ν,g = E ν,f Lin{, g,..., g dg } = Lin{, f,..., f df }. Proof. It is easy to see, that E ν,f = E ν,(,f). The rest results from the fact that exponential function is injective. Corollary.3. With using notation from Lemma. and D f := dim, f,..., f df there exists h = (h,..., h Df ), h i : Z R, i =,..., D f such that E ν,f = E ν,h and {h i, i =,..., D f } are linearly independent and linearly independent with (on Z). oreover, if E ν,g E ν,f, then dim, g,..., g dg =: D g D f and for h g := (h,..., h Dg ), it holds E ν,g = E ν,hg. Proof. By Lemma. and Steinitz s exchange theorem. The nonnegative integer D f is the dimension of the exponential family E ν,f. Theorem.4 (Uniqueness of the generalized ri-projection). For every pm P (on Z) and exponential family E = E ν,f with s(ν) = Z, there exists a unique pm P E E (the generalized reverse information projection; generalized ri-projection) such that D(P P E ) = D(P E). Proof. For details, see []. holds 3 ultinomial family For n, N N denote [0 : N] := {0,..., N}, [ : n] = {,..., n}. Z := {z = (z,..., z n ) [0 : N] n : n j= z j = N}, for z Z denote ( ) N z := N! n zj!. j= The set of all pm s on Z will be denoted := {P = ( P (z) ) z Z [0; ]Z : z Z P (z) = }, strictly positive pm s := {P = ( P (z) ) z Z (0; )Z : z Z P (z) = }. The family of multinomial distributions (multinomial family) is a set of pm s { Denote = = Q : Q(z) = Q : Q(z) = ( N z ( ) N n p zj j z, z Z; (p j) n j= = ( p(j) ) j [:n] j= = p P([ : n]). ) n j= pzj j, z Z; (p j) n j= = ( p(j) ) j [:n] = p P([ : n]) }. 3

4 It is easy to see, that the multinomial family is the closure of an exponential family, = E µ,f and = E µ,f with µ(z) = ( ) N z and f(z) = z. Its dimension is equal to n and for ϑ R n, Q µ,f,ϑ =: Q = E µ,f, one e has p j = ϑ j n k= eϑ k. Let (X,..., X N ) be the random vector with identical marginal distributions X k p P([ : n]), k =,..., N. Denote V j := {i [ : N] : Y i = j}, j =,..., n. Then V = (V,..., V n ) Q if and only if X,..., X N are mutually independent. Now, the problem of maximization of D(P ) = D(P ) can be formulated in a different equivalent way. Denote X = [ : n] N the state space of the random vector (X,..., X N ). For x = (x,..., x N ) X and permutation π : [ : N] [ : N] let x π = (x π(),..., x π(n) ). The set of all permutations π on [ : N] will be denoted as [ : N]!. Denote: E := {P P(X) : P (x) = P (x π ), x X; π [ : N]!}, F := {Q P(X) : Q(x) = N i= Q i(x i ), x X}, where Q i (x i ) = x X:x i =xi Q(x ), i =,..., N. Finally, E := P(X) E, F := P(X) F. Lemma 3.. With using a previous notation and X z := {x X : j [ : n] : {i [ : N] : x i = j} = z j }, it holds: (i) The mapping h : E such that h(p ) = P, P (x) = P (z) ( N z ) for z Z s.t. x Xz is a bijection, h : E F = E F and for h : E, the inverse of h, h (P ) = P and P (z) = ( ) N z P (x) for any x X z. (ii) For any P, Q, it holds D(P Q) = D ( h(p ) h(q) ). (iii) For any P E, Q F \ E F, there exists π [ : N]!, such that for Q π, Q π (x) = Q(x π ), it holds Q π Q and D(P Q) = D(P Q π ). (iv) For any P E: D(P F) = inf Q E F D(P Q) and arg inf Q E F D(P Q) = P F E F. (v) sup arg sup D(P ) = sup D(P ) = h (arg sup D(P E F) = sup D(P F) D(P E F)) = h (arg sup sup D(P F) and P P(X) D(P F)). Proof. Due to the uniqueness of the ri-projection (Theorem.4), (iii) (iv). Other propositions are straightforward. It is well known, that for P P(X), the D(P F) = (P ), the multi-information. The problem of maximizing the multi-information over the P(X) has explicit solution and was solved in []. Theorem 3. (aximizers of D( F) = ( )). The set of maximizers of D( F) = ( ) is equal to arg sup D(P F) = P P(X) P Π = n δ n (j,π(j),...,π N (j)) : Π = (π,..., π N ) [ : n]! (N ), j= D(P Π F) = (N ) ln(n) and P F Π = U X = n N x X δ x, Π [ : n]! (N ). Proof. For details, see [], Theorem 4.3 and Corollary 4.0. j Denote e j = e j,j = (0,..., 0,, 0,..., 0) }{{}, e k,l = (0,..., 0,, 0,..., 0,, 0,..., 0) }{{}, j, k, l [ : n], k < l; n n ɛ j,j = δ e j,j, ɛ k,l = δ e k,l. k l 4

5 Corollary 3.3 (The set of maximizers of D( )). When using notation of Lemma 3., it holds: ) j [:n]:j π(j) arg sup D(P ) = h (E arg sup D(P F) P P(X) For N =, arg sup D(P ) = = P π = ɛ j,π(j), π [ : n]! : [π(j) = k] [π(k) = j], j, k [ : n] n. For N >, the only maximizer is P Id = n n δ Ne j. j= sup D(P ) = (N ) ln(n) and for every maximizer P sup, it holds Psup(z) = (N z ), z Z. n N Proof. To avoid trivial cases, let n, N. By Lemma 3. and Theorem 3.: sup D(P ) = sup E D(P F) sup P(X) D(P F) = (N ) ln(n). It is easy to see, that P 0 = P (Id,...,Id) is a maximizer (on P(X)) and even P 0 E (Id is an identity mapping on [ : n]). In order of finding the rest of maximizers (on P(X)) which also belong to E, for another maximizer P E, P 0 P = P (π,...,π N ): π i Id and π i (j) j for some i [ : N] and j [ : n]. Thus, (j,..., π i (j),... ) s(p ) and (from the fact, that P E) also (π i (j),..., j,... ) s(p ). If N >, then (j,..., π i (j),..., k,... ) s(p ) and also (π i (j),..., j,..., k,... ) s(p ) for some k [ : n]. Hence, for some l [ : N], π l is not injective, but π l is a permutation and this is contradiction. The rest simply follows. When considering the binomial case (n = ), the application of the ri-projection theorem (Theorem.4), result (in []) of N. Ay and A. Knauf (Theorem 3.) and Lemma 3. (prop. (v)) substantially simplified the proof of the result given in [3] (see proof of Proposition ). Example 3.4 (Ad: Example., N =, n = ). arg sup P(X) D(P F) = { (δ + δ ), (δ + δ )} arg sup E D(P E F) = E arg sup P(X) D(P F) = { (δ + δ ), (δ + δ )} arg sup D(P ) = h (E arg sup P(X) D(P F)) = { (δ 0 + δ 0 ), δ )} sup D(P ) = ln Figure 3 illustrates how the maximization of information divergence from multinomial family is related to the maximization of multi-information and the fact of Lemma 3., prop. (iii). Correspondingly, situation in the simplex is depiced on Figure 4(a). Example 3.5 (Ad: Example.3, N = 3, n = ). arg sup P(X) D(P F) = { (δ + δ ), (δ + δ ), (δ + δ ), (δ + δ )} arg sup E D(P E F) = { (δ + δ )} arg sup D(P ) = { (δ 30 + δ 03 )} sup D(P ) = ln aximization problem in the simplex is illustrated on Figure 4(b). Example 3.6 (N =, n = 3). arg sup P(X) D(P F) = { 3 (δ + δ + δ 33 ), 3 (δ + δ 3 + δ 3 ), 3 (δ 3 + δ + δ 3 ), 3 (δ + δ + δ 33 ), 3 (δ + δ 3 + δ 3 ), 3 (δ 3 + δ + δ 3 )} arg sup E D(P E F) = { 3 (δ + δ + δ 33 ), 3 (δ + δ 3 + δ 3 ), 3 (δ 3 + δ + δ 3 ), 3 (δ + δ + δ 33 )} arg sup D(P ) = { 3 (δ 00 + δ 00 + δ 00 ), 3 δ δ 0, 3 δ δ 0, 3 δ δ 0} sup D(P ) = ln 3. 5

6 δ (δ + δ) E E F U X F δ δ δ δ Q U X (δ + δ) F E F Q π (a) The simplex P(X) δ δ δ (b) Factorizable pm s F Figure 3: Relation between maximization of information divergence and multi-information for N =, n =. δ 0 δ 03 P sup = (δ30 + δ03) (δ0 + δ0) δ 30 δ h (U X ) Psup δ 0 δ δ (a) N =, n = (b) N = 3, n = Figure 4: Ad Figure and Figure : aximization in. References [] Ay, N., Knauf, A. (006). aximizing multi-information. Kybernetika [] Csiszár, I., atúš, F. (003). Information projections revisited. IEEE Transactions Information Theory [3] atúš, F. (004). aximization of information divergences from binary i.i.d. sequences. Proceedings IPU (004) Perugia, Italy. 6

Maximization of Multi - Information

Maximization of Multi - Information Week of Doctoral Students 2007 Jozef Juríček http://www.adultpdf.com Academy of Sciences of the Czech Republic Created by Image To PDF trial version, Institute to remove