Maximum Entropy and Exponential Families

Maximum Entropy and Exponential Families April 9, 209 Abstrat The goal of this note is to derive the exponential form of probability distribution from more basi onsiderations, in partiular Entropy. It follows a desription by ET Jaynes in Chapter of his book Probability Theory: the Logi of Siene []. Motivating the Exponential Model This setion will motivate the exponential model form that we ve seen in leture. The Setup The setup for our problem is that we are given a finite set of instanes I and a set of m statistis (f i, i ) in whih f i : I R and i R. An instane (or possible world) is just an element in a set. We an think about a statisti as a measurement of an instane, it tells us the important features of that instane that are important for our model. More preisely, the only information we have about the instanes is the values of f i on these instanes. Our goal is to find a probability funtion p suh that p : I [0, ] suh that I I p(i) =. The main goal of this note is to provide a set of assumptions under whih suh distributions have a speifi funtional form, the exponential family, that we saw in generalized linear model: p θ (I) = Z (θ) exp {θ f(i)} in whih θ R m and f(i) R m and f(i) i = f i (I). Notie that there is exatly one parameter for eah statisti. As we ll see for disrete distributions, we are able to derive this exponential form as a onsequenes of a maximizing entropy subjet to mathing the statistis. 2. The problem: Too many distributions! We ll see the problem of defining a distribution from statistis (measurements). We ll see that often there are often many probability distributions that satisfy our onstraints, and we ll be fored to pik among them. 3 This work is available online in many plaes inluding http://omega.albany.edu:8008/etj-ps/g.ps. 2 Unfortunately, for ontinuous distributions, suh a derivation does not work due to some tehnial issues with Entropy this hasn t stopped folks from using it as justifiation. 3 Throughout this setion, it will be onvenient to view p and f j as funtions from I R and also as vetors indexed by I. Their use should be lear from the ontext.

The Constraints We interpret a statisti as a onstraint on p of the following form: E p [f i ] = i i.e., I I f i (I)p(I) = i Let s get some notation to desribe these onstraints. Let N = I then the probability we are after is p R N subjet to onstraints. There are m onstraints of the form f T j p = j for j =,..., m. A single onstraint of the form N i= p i = to ensure that p is a probability distribution. We an write this more suintly as T N p =. We also have that p i 0 for i =,..., N. More ompatly, we an write F R m N suh that F i = f i for i =,..., m. Then, we an ompatly write all onstraints in a matrix G as N G = R (m+) N so that Gp =. F If N(G) =, then this means that p is uniquely defined as G has an inverse. In this ase, p = G. However often m is muh smaller than N, so that N(G) and there are many solutions that satisfy the onstraints. Example.. Suppose we have three possible worlds, i.e., I = {I, I 2, I 3 } and one statisti f(i i ) = i and = 2.5. Then, we have: G = and N(G) = 2 2 3 Let p () = (/2, /3, 7/2) then Gp = (, 2.5) T but so do (infinitely) many others, in partiular q(α) = p () + α(, 2, ) is valid so long as α [ /2, /6] (due to positivity). Piking a probability distribution p In the ase = N(G), there are many probability distributions we an pik. All of these distributions an be written as follows: p = p (0) + p () in whih p (0) N(G) and p () satisfies Gp () = Example.2. Continuing the omputation above, we see p (0) = α(, 2, ) is a vetor in N(G). Whih p should we pik? Well, we ll use one method alled the method of maximum entropy. In turn, this will lead to the fat that our funtion p has a very speial form the form of exponential models!.2 Entropy To pik among the distributions, we ll need some soring method. 4 We ll ut to the hase here and define the entropy, whih is a funtion on probability distributions p R N suh that p 0 and p T N =. H(p) = p i log p i i= 4 A few natural methods don t work as we might think they should (minimizing variane, et.) See [, Ch.] for a desription of these alternative approahes. 2

Effetively, the entropy rewards one for spreading the distribution out more. One an motivate Entropy from axioms, and either Jaynes or the Wikipedia page is pretty good on this aount. 5. The intuition should be that entropy an be used to selet the least informative prior, it s a way of making as few additional assumptions as possible. In other words, we want to enode the prior information given by the onstraints on the statistis while being as objetive or agnosti as possible. This is alled the maximum entropy priniple. For example, one an verify that under no onstraints, H(p) is maximized with p i = N that is all alternatives have equal probability. This is what we mean by spread out. We ll pik the distribution that maximizes entropy subjet to our onstraints. Mathematially, we ll examine: max H(p) s.t. p T =, p 0, and F p = p R N We will not disuss it, but under appropriate onditions there is a unique solution p..3 The Lagrangian We ll reate a funtion alled the Lagrangian that has the property that any ritial point of the Lagrangian is a ritial point of the onstrained problem. We will show that all ritial points of the Lagrangian (and so our original problem) an be written in the exponential format we desribed above. To simplify our disussion, let s imagine that p > 0, i.e,. there are no possible worlds I suh that p(i) = 0. In this ase, our problem redues to: max H(p) s.t. F p = and T Np = p R N We an write the Lagrangian Λ : R N (R m R) R as follows: Λ(p; θ, λ) = H(p) + θ T (F p ) + λ( T p ) The speial property of Λ is that any ritial point of our original solution, in partiular any maximum or minimum orresponds to a ritial point of the Lagrangian. Thus, if we prove something about ritial points of the Lagrangian, we prove something about the ritial points of the original funtion. Later in the ourse, we ll see more sophistiated uses of Lagrangians but for now we inlude a simple derivation below to give a hint what s going on. For this setion, we ll assume this speial property is true. Due to that speial property, we find the ritial points of Λ by differentiating with respet to p i and setting the resulting equations to 0. p i [ H(p) + θ T (F p ) + λ( T p ) ] = (log p i + ) + Setting this expression equal to 0 and solving for p i we learn: m θ j f j (I i ) + λ = (log p i + ) + θ T f(i i ) + λ j= p i = e λ exp { θ T f(i i ) } whih is of the right form exept that we have one too many parameters, namely λ. Nevertheless, this is remarkable: at a ritial point, it s always the ase that the exponential family pops out! 5 https://en.wikipedia.org/wiki/entropy_(information_theory)#rationale 3

Eliminating λ The parameter λ an be eliminated, whih is the final step to math our original laimed exponential form. To do so, we sum over all the p i whih we know on one hand is equal to, and the other hand, we have the above expression for p i. This gives us the following equation: ( N p i = and p i = e λ exp { θ T f(i i ) }) ( N thus e λ+ = exp { θ T f(i i ) }) i= i= i= Thus, we have expressed λ as a funtion of θ and we an eliminate it. To do so, we write: Z(θ) = exp { θ T f(i i ) } and p i = Z(θ) exp{θ T f(i i )} i= This funtion Z is alled the partition funtion that we saw in leture. The above form is the laimed exponential form that has one parameter per onstraint. 2 Why the Lagrangian? We observe that this is a onstrained optimization problem with linear onstraints. 6 Let r be the rank of G and so dim(n(g)) = N r. We reate a funtion φ : R N r R suh that there is a map between any point in the domain of φ and a feasible solution to our onstrained problem, and moreover φ will take the same value as H. In ontrast to our original onstrained problem, φ has an unonstrained domain (all of R N r ), and so we an apply standard alulus to find its ritial points. To that end, we define a (linear) map B R N (N r) that has rank N r. We also insist that B T B = I N r. Suh a B exists, as it is simply the first N r olumns of a hange of basis matrix from the standard basis to an orthonormal basis for N(G). φ(x) = H(Bx + p () ), where p () is a fixed vetor satisfying Gp () =. Observe that for any x R N r, Bx N(G) so that G(Bx + p () ) = Gp () = and so Bx + p () is feasible. Moreover, B is a bijetion from R N r to the set of feasible solutions. 7 Importantly, φ is now unonstrained, and so any saddle point (and so any maximum or minimum) must satisfy: x φ(x) = 0 Gradient Deomposition Any ritial point of H yields a ritial point of φ, that is, if p = p (0) + p () is a ritial point of H then x = B T p (0) is a ritial point of φ. Consider any ritial point p, then we an uniquely deompose the gradient as: p H(p) = g 0 + g in whih g 0 N(G) and g N(G). We laim g 0 = B φ(b T p) or equivalently B T g = x φ(b T p). From diret alulation, x φ(x) = x H(Bx+p () ) = B T p H(p (0) +p () ) = B T p H(p) = B T g 0, where the last equality is due to g N(G). A ritial point of H satisfying the onstraints must not hange along any diretion that satisfies the onstraints, whih is to say that we must have g 0 = 0. Very roughly, one an have the intuition that if p were a maximum (or minimum), then if g 0 were non-zero there would be a way to stritly inrease (or derease) the funtion in a neighbor around p ontraditing p being a maximum (minimum). 6 One an form the Lagrangian for non-linear onstraints, but to derive it we need to use fanier math like the impliit funtion theorem. We only need linear onstraints for our appliations. 7 For ontradition, suppose p, q are distint feasible solutions then, p q but B T p = B T q but we an write p = p (0) + p () and q = q (0) + p () from the above. However, B T p = B T q implies that B T p (0) = B T q (0). In turn sine B is a bijetion on N(G) this implies that p (0) = q (0). 4 i=

Lagrangian Sine g N(G) = R(G T ) (see the fundamental theorem of linear algebra), we an find a θ(p) suh that g = G T θ(p), whih motivates the following funtional form: By the definition of θ(p), we have: Λ(p, θ(p)) = H(p) + θ(p) T (Gp ) p Λ(p, θ(p)) = g 0 + g + θ(p) T G = g 0. That is, for any ritial point p of the original funtion (whih orresponds to g 0 = 0) we an selet θ(p) so that it is a ritial point of Λ(p, θ). Informally, the multipliers ombines the rows of G to anel g, the omponent of the gradient in the diretion of the onstraints. This establishes that any ritial point of the original onstrained funtion is also a ritial point of the Lagrangian. Referenes [] Jaynes, Edwin T, Probability theory: The logi of siene, Cambridge University Press, 2003. 5