Robustness and duality of maximum entropy and exponential family distributions

Size: px

Start display at page:

Download "Robustness and duality of maximum entropy and exponential family distributions"

Bennett Conley
5 years ago
Views:

1 Chapter 7 Robustness and duality of maximum entropy and exponential family distributions In this lecture, we continue our study of exponential families, but now we investigate their properties in somewhat more depth, showing how exponential family models provide a natural robustness against model mis-specification, enjoy natural projection properties, and arise in other settings. 7.1 The existence of maximum entropy distributions As in the previous chapter of these notes, we again consider exponential family models. For simplicity throughout this chapter, and with essentially no loss of generality, we assume that all of our exponential family distributions have (standard) densities. Moreover, we assume there is some fixed density (or, more generally, an arbitrary function) p satisfying p(x) 0 and for which p θ (x) = p(x)exp( θ,φ(x) A(θ)), (7.1.1) where the log-partition function or cumulant generating function A(θ) = log p(x)exp( θ,φ(x) )dx as usual, and φ is the usual vector of sufficient statistics. In the previous chapter, we saw that if we restricted consideration to distributions satisfying the mean-value (linear) constraints of the form P lin α := { Q : q(x) = p(x)f(x), where f 0 and q(x)φ(x)dx = α, } q(x)dx = 1, then the distribution with density p θ (x) = p(x)exp( θ,φ(x) A(θ)) uniquely maximized the (Shannon) entropy over the family Pα lin if we could find any θ satisfying E Pθ [φ(x)] = α. (Recall Theorem 6.7.) Now, of course, we must ask: does this actually happen? For if it does not, then all of this work is for naught. Luckily for us, the answer is that we often find ourselves in the case that such results occur. Indeed, it is possible to show that, except for pathological cases, we are essentially always able to find such a solution. To that end, define the mean space { } M φ := α R d : Q s.t. q(x) = f(x)p(x),f 0, and q(x)φ(x)dx = α 64

2 Then we have the following result, which is well-known in the literature on exponential family modeling; we refer to Wainwright and Jordan [6, Proposition 3.2 and Theorem 3.3] for the proof. In the statement of the theorem, we recall that the domain doma of the log partition function is defined as those points θ for which the integral p(x)exp( θ,φ(x) )dx <. Theorem 7.1. Assume that there exists some point θ 0 intdoma, where doma := {θ R d : A(θ) < }. Then for any α in the interior of M φ, there exists some θ = θ(α) such that E Pθ [φ(x)] = α. Using tools from convex analysis, it is possible to extend this result to the case that doma has no interior but only a relative interior, and similarly for M φ (see Hiriart-Urruty and Lemaréchal [4] or Rockafellar [5] for discussions of interior and relative interior). Moreover, it is also possible to show that for any α M φ (not necessarily the interior), there exists a sequence θ 1,θ 2,... satisfying the limiting guarantee lim n E Pθn [φ(x)] = α. Regardless, we have our desired result: if P lin is not empty, maximum entropy distributions exist and exponential family models attain these maximum entropy solutions. 7.2 I-projections and maximum likelihood We first show one variant of the robustness of exponential family distributions by showing that they are (roughly) projections onto constrained families of distributions, and that they arise naturally in the context of maximum likelihood estimation. First, suppose that we have a family Π of distributions and some fixed distribution P (this last assumption of a fixed distribution P is not completely essential, but it simplifies our derivation). Then the I-Projection (for information projection) of the distribution P onto the family Π is P := argmind kl (Q P), (7.2.1) Q Π when such a distribution exists. (In nice cases, it does.) Perhaps unsurprisingly, given our derivations with maximum entropy distributions and exponential family models, we have the next proposition. The proposition shows that I-Projection is essentially the same as maximum entropy, and the projection of a distribution P onto a family of linearly constrained distributions yields exponential family distributions. Proposition 7.2. Suppose that Π = P lin α. If p θ (x) = p(x)exp( θ,φ(x) A(θ)) satisfies E Pθ [φ(x)] = α, then p θ solves the I-projection problem (7.2.1). Moreover we have (the Pythagorean identity) for Q P lin α. D kl (Q P) = D kl (P θ P)+D kl (Q P θ ) Proof Our proof is to perform an expansion of the KL-divergence that is completely parallel to 65

3 that we performed in the proof of Theorem 6.7. Indeed, we have D kl (Q P) = q(x)log q(x) p(x) dx = q(x)log p θ(x) p(x) dx+ q(x)log q(x) p θ (x) dx = q(x)[ θ,φ(x) A(θ)]dx+D kl (Q P θ ) ( ) = p θ (x)[ θ,φ(x) A(θ)]dx+D kl (Q P θ ) = p θ (x)log p θ(x) p(x) +D kl(q P θ ), where equality ( ) follows by assumption that E Pθ [φ(x)] = α. Now we consider maximum likelihood estimation, showing that in a completely handwavy fashion approximates I-projection. First, suppose that we have an exponential family {P θ } θ Θ of distributions, and suppose that the data comes from a true distribution P. Then to maximizing the likelihood of the data is equivalent to maximizing the log likelihood, which, in the population case, gives us the following sequence of equivalences: maximize E P [logp θ (X)] minimize E P [log 1 p θ (X) ] minimize E P [ log p(x) p θ (X) minimize θ D kl (P P θ ), ] +H(P) so that maximum likelihood is essentially a different type of projection. Now, we also consider the empirical variant of maximum likelihood, where we maximize the likelihoodofagivensamplex 1,...,X n. Inparticular, wemaystudythestructureofmaximumlikelihood exponential family estimators, and we see that they correspond to simple moment matching in exponential families. Indeed, consider the sample-based maximum likelihood problem of solving maximize θ n p θ (X i ) maximize 1 n logp θ (X i ), (7.2.2) where as usual we assume the exponential family model p θ (x) = p(x)exp( θ,φ(x) A(θ)). We have the following result. Proposition 7.3. Let α = 1 n n φ(x i). Then the maximum likelihood solution is given by any θ such that E Pθ [φ(x)] = α. Proof The proof follows immediately upon taking derivatives. We define the empirical negative log likelihood (the empirical risk) as R n (θ) := 1 n logp θ (X i ) = 1 n θ,φ(x i ) +A(θ) 1 n logp(x i ), 66

4 which is convex as θ A(θ) is convex (recall Proposition 6.4). Taking derivatives, we have θ Rn (θ) = 1 n = 1 n = 1 n φ(x i )+ A(θ) 1 φ(x i )+ p(x)exp( θ,φ(x) )dx φ(x i )+E Pθ [φ(x)]. φ(x)p(x) exp( θ, φ(x) )dx In particular, finding any θ such that A(θ) = E Pn [φ(x)] gives the result. As a consequence of the result, we have the following rough equivalences tying together the preceding material. In short, maximum entropy subject to (linear) empirical moment constraints (Theorem 6.7) is equivalent to maximum likelihood estimation in exponential families (Proposition 7.3), which is equivalent to I-projection of a fixed base distribution onto a linearly constrained family of distributions (Proposition 7.2). 7.3 Basics of minimax game playing with log loss The final set of problems we consider in which exponential families make a natural appearance are in so-called minimax games under the log loss. In particular, we consider the following general formulation of a two-player minimax game. First, we choose a distribution Q on a set X (with density q). Then nature (or our adversary) chooses a distribution P P on the set X, where P is a collection of distributions on X, so we suffer loss sup E P [ logq(x)] = sup P P P P In particular, we would like to solve the minimax problem minimize Q sup E[ logq(x)]. P P p(x)log 1 dx. (7.3.1) q(x) To motivate this abstract setting we give two examples, the first abstract and the second somewhat more concrete. Example 7.4: Suppose that receive n random variables X i i.i.d. P; in this case, we have the sequential prediction loss E P [ logq(x n 1)] = ] 1 E P [log q(x i X1 i 1, ) which corresponds to predicting X i given X i 1 1 as well as possible, when the X i follow an (unknown or adversarially chosen) distribution P. 67

5 Example 7.5 (Coding): Expanding on the preceding example, suppose that the set X is finite, and we wish to encode X into {0, 1}-valued sequences using as few bits as possible. In this case, the Kraft inequality (see any standard information theory text, for example, Cover and Thomas [2, Chapter 5]) tells us that if C : X {0,1} is an uniquely decodable code, and l C (x) denotes the length of the encoding for the symbol x X, then 2 lc(x) 1. x Conversely, given any length function l : X N satisfying x 2 l(x) 1, there exists a Huffman code C with the given length function. Thus, if we define the p.m.f. q C (x) = 2 lc(x) / x 2 lc(x), we have [ log 2 q C (x n 1) = l C (x i )+log ] 2 l C(x) l C (x i ). x Inparticular, wehaveacodinggamewhereweattempttochooseadistributionq(orsequential coding scheme C) that has as small an expected length as possible, uniformly over distributions P. (The field of universal coding studies such questions in depth; see Tsachy Weissman s course EE376b.) TODO: Picture of Huffman coding We now show how the minimax game (7.3.1) naturally gives rise to exponential family models, so that exponential family distributions are so-called robust Bayes procedures (cf. Grünwald and Dawid [3]). Specifically, we say that Q is a robust Bayes procedure for the class P of distributions if it minimizes the supremum risk (7.3.1) taken over the family P; that is, it is uniformly good for all distributions P P. If we restrict our class P to be a linearly constrained family of distributions, then we see that the exponential family distributions are natural robust Bayes procedures: they uniquely solve the minimax game. More concretely, assume that P = Pα lin and that P θ denotes the exponential family distribution with density p θ (x) = p(x)exp( θ,φ(x) A(θ)), where p denotes the base density. We have the following. Proposition 7.6. If E Pθ [φ(x)] = α, then inf Q sup P Pα lin E P [ logq(x)] = sup E P [ logp θ (X)] = sup inf Q E P[ logq(x)]. Proof This is a standard saddle-point argument (cf. [5, 4, 1]). First, note that sup E P [ logp θ (X)] = sup E P [ φ(x),θ +A(θ)] = α,θ +A(θ) = E Pθ [ θ,φ(x) +A(θ)] = H(P θ ), where H denotes the Shannon entropy, for any distribution P P lin α. Moreover, for any Q P θ, we have supe P [ logq(x)] E Pθ [ logq(x)] > E Pθ [ logp θ (X)] = H(P θ ), P where the inequality follows because D kl (P θ Q) = p θ (x)log p θ(x) q(x) dx > 0. This shows the first equality in the proposition. 68

6 For the second equality, note that [ inf E P[ logq(x)] = inf E P log p(x) ] E P [logp(x)] = H(P). Q Q q(x) }{{} =0 But we know from our standard maximum entropy results (Theorem 6.7) that P θ maximizes the entropy over Pα lin, that is, sup P P lin H(P) = H(P α θ). In short: maximum entropy is equivalent to robust prediction procedures for linear families of distributions Pα lin, which is equivalent to maximum likelihood in exponential families, which in turn is equivalent to I-projection. 69

7 Bibliography [1] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, [2] T. M. Cover and J. A. Thomas. Elements of Information Theory, Second Edition. Wiley, [3] P. Grünwald and A. P. Dawid. Game theory, maximum entropy, minimum discrepancy, and robust Bayesian decision theory. Annals of Statistics, 32(4): , [4] J. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms I & II. Springer, New York, [5] R. T. Rockafellar. Convex Analysis. Princeton University Press, [6] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1 2):1 305,

U Logo Use Guidelines

Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity