Maximum Entropy and Exponential Families

Similar documents
Hankel Optimal Model Order Reduction 1

The Hanging Chain. John McCuan. January 19, 2006

Probabilistic Graphical Models

A NETWORK SIMPLEX ALGORITHM FOR THE MINIMUM COST-BENEFIT NETWORK FLOW PROBLEM

Complexity of Regularization RBF Networks

Millennium Relativity Acceleration Composition. The Relativistic Relationship between Acceleration and Uniform Motion

A Unified View on Multi-class Support Vector Classification Supplement

Math 151 Introduction to Eigenvectors

23.1 Tuning controllers, in the large view Quoting from Section 16.7:

Geometry of Transformations of Random Variables

11.1 Polynomial Least-Squares Curve Fit

max min z i i=1 x j k s.t. j=1 x j j:i T j

Product Policy in Markets with Word-of-Mouth Communication. Technical Appendix

LECTURE NOTES FOR , FALL 2004

Advanced Computational Fluid Dynamics AA215A Lecture 4

Estimating the probability law of the codelength as a function of the approximation error in image compression

Lecture 7: Sampling/Projections for Least-squares Approximation, Cont. 7 Sampling/Projections for Least-squares Approximation, Cont.

Methods of evaluating tests

The Laws of Acceleration

Normative and descriptive approaches to multiattribute decision making

Danielle Maddix AA238 Final Project December 9, 2016

On Component Order Edge Reliability and the Existence of Uniformly Most Reliable Unicycles

Where as discussed previously we interpret solutions to this partial differential equation in the weak sense: b

Moments and Wavelets in Signal Estimation

Relativistic Dynamics

7 Max-Flow Problems. Business Computing and Operations Research 608

Singular Event Detection

Average Rate Speed Scaling

Math 220A - Fall 2002 Homework 8 Solutions

A Characterization of Wavelet Convergence in Sobolev Spaces

The Effectiveness of the Linear Hull Effect

Relativity fundamentals explained well (I hope) Walter F. Smith, Haverford College

2 The Bayesian Perspective of Distributions Viewed as Information

Study of EM waves in Periodic Structures (mathematical details)

MBS TECHNICAL REPORT 17-02

Q2. [40 points] Bishop-Hill Model: Calculation of Taylor Factors for Multiple Slip

Understanding Elementary Landscapes

Lecture 3 - Lorentz Transformations

Modeling of discrete/continuous optimization problems: characterization and formulation of disjunctions and their relaxations

A new initial search direction for nonlinear conjugate gradient method

Subject: Introduction to Component Matching and Off-Design Operation % % ( (1) R T % (

MAC Calculus II Summer All you need to know on partial fractions and more

Nonreversibility of Multiple Unicast Networks

CMSC 451: Lecture 9 Greedy Approximation: Set Cover Thursday, Sep 28, 2017

Sensitivity Analysis in Markov Networks

MOLECULAR ORBITAL THEORY- PART I

Modeling of Threading Dislocation Density Reduction in Heteroepitaxial Layers

A Queueing Model for Call Blending in Call Centers

Likelihood-confidence intervals for quantiles in Extreme Value Distributions

arxiv:math/ v4 [math.ca] 29 Jul 2006

Electromagnetic radiation of the travelling spin wave propagating in an antiferromagnetic plate. Exact solution.

CSC2515 Winter 2015 Introduc3on to Machine Learning. Lecture 5: Clustering, mixture models, and EM

Chapter 2. Conditional Probability

REFINED UPPER BOUNDS FOR THE LINEAR DIOPHANTINE PROBLEM OF FROBENIUS. 1. Introduction

After the completion of this section the student should recall

V. Interacting Particles

The gravitational phenomena without the curved spacetime

The shape of a hanging chain. a project in the calculus of variations

arxiv: v2 [math.pr] 9 Dec 2016

Sensitivity analysis for linear optimization problem with fuzzy data in the objective function

Counting Idempotent Relations

Solutions to Problem Set 1

SOA/CAS MAY 2003 COURSE 1 EXAM SOLUTIONS

Computer Science 786S - Statistical Methods in Natural Language Processing and Data Analysis Page 1

Acoustic Waves in a Duct

MBS TECHNICAL REPORT 17-02

The Unified Geometrical Theory of Fields and Particles

UNCERTAINTY RELATIONS AS A CONSEQUENCE OF THE LORENTZ TRANSFORMATIONS. V. N. Matveev and O. V. Matvejev

Determination of the reaction order

Review of Force, Stress, and Strain Tensors

Review of classical thermodynamics

Control Theory association of mathematics and engineering

Generalized Dimensional Analysis

arxiv:gr-qc/ v7 14 Dec 2003

Ordered fields and the ultrafilter theorem

arxiv: v1 [physics.gen-ph] 5 Jan 2018

Taste for variety and optimum product diversity in an open economy

General Equilibrium. What happens to cause a reaction to come to equilibrium?

On the density of languages representing finite set partitions

EE 321 Project Spring 2018

22.54 Neutron Interactions and Applications (Spring 2004) Chapter 6 (2/24/04) Energy Transfer Kernel F(E E')

9 Geophysics and Radio-Astronomy: VLBI VeryLongBaseInterferometry

LOGISTIC REGRESSION IN DEPRESSION CLASSIFICATION

the following action R of T on T n+1 : for each θ T, R θ : T n+1 T n+1 is defined by stated, we assume that all the curves in this paper are defined

ELECTROMAGNETIC WAVES WITH NONLINEAR DISPERSION LAW. P. М. Меdnis

ES 247 Fracture Mechanics Zhigang Suo

A model for measurement of the states in a coupled-dot qubit

Parallel disrete-event simulation is an attempt to speed-up the simulation proess through the use of multiple proessors. In some sense parallel disret

Chapter 26 Lecture Notes

CRITICAL EXPONENTS TAKING INTO ACCOUNT DYNAMIC SCALING FOR ADSORPTION ON SMALL-SIZE ONE-DIMENSIONAL CLUSTERS

Simplified Buckling Analysis of Skeletal Structures

arxiv:hep-ph/ v2 30 May 1998

A Logic of Local Graph Shapes

Fiber Optic Cable Transmission Losses with Perturbation Effects

University of Groningen

The homopolar generator: an analytical example

THE NUMBER FIELD SIEVE FOR INTEGERS OF LOW WEIGHT

Espen Gaarder Haug Norwegian University of Life Sciences April 4, 2017

Optimization of Statistical Decisions for Age Replacement Problems via a New Pivotal Quantity Averaging Approach

Cooperative detection of areas of rapid change in spatial fields

Transcription:

Maximum Entropy and Exponential Families April 9, 209 Abstrat The goal of this note is to derive the exponential form of probability distribution from more basi onsiderations, in partiular Entropy. It follows a desription by ET Jaynes in Chapter of his book Probability Theory: the Logi of Siene []. Motivating the Exponential Model This setion will motivate the exponential model form that we ve seen in leture. The Setup The setup for our problem is that we are given a finite set of instanes I and a set of m statistis (f i, i ) in whih f i : I R and i R. An instane (or possible world) is just an element in a set. We an think about a statisti as a measurement of an instane, it tells us the important features of that instane that are important for our model. More preisely, the only information we have about the instanes is the values of f i on these instanes. Our goal is to find a probability funtion p suh that p : I [0, ] suh that I I p(i) =. The main goal of this note is to provide a set of assumptions under whih suh distributions have a speifi funtional form, the exponential family, that we saw in generalized linear model: p θ (I) = Z (θ) exp {θ f(i)} in whih θ R m and f(i) R m and f(i) i = f i (I). Notie that there is exatly one parameter for eah statisti. As we ll see for disrete distributions, we are able to derive this exponential form as a onsequenes of a maximizing entropy subjet to mathing the statistis. 2. The problem: Too many distributions! We ll see the problem of defining a distribution from statistis (measurements). We ll see that often there are often many probability distributions that satisfy our onstraints, and we ll be fored to pik among them. 3 This work is available online in many plaes inluding http://omega.albany.edu:8008/etj-ps/g.ps. 2 Unfortunately, for ontinuous distributions, suh a derivation does not work due to some tehnial issues with Entropy this hasn t stopped folks from using it as justifiation. 3 Throughout this setion, it will be onvenient to view p and f j as funtions from I R and also as vetors indexed by I. Their use should be lear from the ontext.

The Constraints We interpret a statisti as a onstraint on p of the following form: E p [f i ] = i i.e., I I f i (I)p(I) = i Let s get some notation to desribe these onstraints. Let N = I then the probability we are after is p R N subjet to onstraints. There are m onstraints of the form f T j p = j for j =,..., m. A single onstraint of the form N i= p i = to ensure that p is a probability distribution. We an write this more suintly as T N p =. We also have that p i 0 for i =,..., N. More ompatly, we an write F R m N suh that F i = f i for i =,..., m. Then, we an ompatly write all onstraints in a matrix G as N G = R (m+) N so that Gp =. F If N(G) =, then this means that p is uniquely defined as G has an inverse. In this ase, p = G. However often m is muh smaller than N, so that N(G) and there are many solutions that satisfy the onstraints. Example.. Suppose we have three possible worlds, i.e., I = {I, I 2, I 3 } and one statisti f(i i ) = i and = 2.5. Then, we have: G = and N(G) = 2 2 3 Let p () = (/2, /3, 7/2) then Gp = (, 2.5) T but so do (infinitely) many others, in partiular q(α) = p () + α(, 2, ) is valid so long as α [ /2, /6] (due to positivity). Piking a probability distribution p In the ase = N(G), there are many probability distributions we an pik. All of these distributions an be written as follows: p = p (0) + p () in whih p (0) N(G) and p () satisfies Gp () = Example.2. Continuing the omputation above, we see p (0) = α(, 2, ) is a vetor in N(G). Whih p should we pik? Well, we ll use one method alled the method of maximum entropy. In turn, this will lead to the fat that our funtion p has a very speial form the form of exponential models!.2 Entropy To pik among the distributions, we ll need some soring method. 4 We ll ut to the hase here and define the entropy, whih is a funtion on probability distributions p R N suh that p 0 and p T N =. H(p) = p i log p i i= 4 A few natural methods don t work as we might think they should (minimizing variane, et.) See [, Ch.] for a desription of these alternative approahes. 2

Effetively, the entropy rewards one for spreading the distribution out more. One an motivate Entropy from axioms, and either Jaynes or the Wikipedia page is pretty good on this aount. 5. The intuition should be that entropy an be used to selet the least informative prior, it s a way of making as few additional assumptions as possible. In other words, we want to enode the prior information given by the onstraints on the statistis while being as objetive or agnosti as possible. This is alled the maximum entropy priniple. For example, one an verify that under no onstraints, H(p) is maximized with p i = N that is all alternatives have equal probability. This is what we mean by spread out. We ll pik the distribution that maximizes entropy subjet to our onstraints. Mathematially, we ll examine: max H(p) s.t. p T =, p 0, and F p = p R N We will not disuss it, but under appropriate onditions there is a unique solution p..3 The Lagrangian We ll reate a funtion alled the Lagrangian that has the property that any ritial point of the Lagrangian is a ritial point of the onstrained problem. We will show that all ritial points of the Lagrangian (and so our original problem) an be written in the exponential format we desribed above. To simplify our disussion, let s imagine that p > 0, i.e,. there are no possible worlds I suh that p(i) = 0. In this ase, our problem redues to: max H(p) s.t. F p = and T Np = p R N We an write the Lagrangian Λ : R N (R m R) R as follows: Λ(p; θ, λ) = H(p) + θ T (F p ) + λ( T p ) The speial property of Λ is that any ritial point of our original solution, in partiular any maximum or minimum orresponds to a ritial point of the Lagrangian. Thus, if we prove something about ritial points of the Lagrangian, we prove something about the ritial points of the original funtion. Later in the ourse, we ll see more sophistiated uses of Lagrangians but for now we inlude a simple derivation below to give a hint what s going on. For this setion, we ll assume this speial property is true. Due to that speial property, we find the ritial points of Λ by differentiating with respet to p i and setting the resulting equations to 0. p i [ H(p) + θ T (F p ) + λ( T p ) ] = (log p i + ) + Setting this expression equal to 0 and solving for p i we learn: m θ j f j (I i ) + λ = (log p i + ) + θ T f(i i ) + λ j= p i = e λ exp { θ T f(i i ) } whih is of the right form exept that we have one too many parameters, namely λ. Nevertheless, this is remarkable: at a ritial point, it s always the ase that the exponential family pops out! 5 https://en.wikipedia.org/wiki/entropy_(information_theory)#rationale 3

Eliminating λ The parameter λ an be eliminated, whih is the final step to math our original laimed exponential form. To do so, we sum over all the p i whih we know on one hand is equal to, and the other hand, we have the above expression for p i. This gives us the following equation: ( N p i = and p i = e λ exp { θ T f(i i ) }) ( N thus e λ+ = exp { θ T f(i i ) }) i= i= i= Thus, we have expressed λ as a funtion of θ and we an eliminate it. To do so, we write: Z(θ) = exp { θ T f(i i ) } and p i = Z(θ) exp{θ T f(i i )} i= This funtion Z is alled the partition funtion that we saw in leture. The above form is the laimed exponential form that has one parameter per onstraint. 2 Why the Lagrangian? We observe that this is a onstrained optimization problem with linear onstraints. 6 Let r be the rank of G and so dim(n(g)) = N r. We reate a funtion φ : R N r R suh that there is a map between any point in the domain of φ and a feasible solution to our onstrained problem, and moreover φ will take the same value as H. In ontrast to our original onstrained problem, φ has an unonstrained domain (all of R N r ), and so we an apply standard alulus to find its ritial points. To that end, we define a (linear) map B R N (N r) that has rank N r. We also insist that B T B = I N r. Suh a B exists, as it is simply the first N r olumns of a hange of basis matrix from the standard basis to an orthonormal basis for N(G). φ(x) = H(Bx + p () ), where p () is a fixed vetor satisfying Gp () =. Observe that for any x R N r, Bx N(G) so that G(Bx + p () ) = Gp () = and so Bx + p () is feasible. Moreover, B is a bijetion from R N r to the set of feasible solutions. 7 Importantly, φ is now unonstrained, and so any saddle point (and so any maximum or minimum) must satisfy: x φ(x) = 0 Gradient Deomposition Any ritial point of H yields a ritial point of φ, that is, if p = p (0) + p () is a ritial point of H then x = B T p (0) is a ritial point of φ. Consider any ritial point p, then we an uniquely deompose the gradient as: p H(p) = g 0 + g in whih g 0 N(G) and g N(G). We laim g 0 = B φ(b T p) or equivalently B T g = x φ(b T p). From diret alulation, x φ(x) = x H(Bx+p () ) = B T p H(p (0) +p () ) = B T p H(p) = B T g 0, where the last equality is due to g N(G). A ritial point of H satisfying the onstraints must not hange along any diretion that satisfies the onstraints, whih is to say that we must have g 0 = 0. Very roughly, one an have the intuition that if p were a maximum (or minimum), then if g 0 were non-zero there would be a way to stritly inrease (or derease) the funtion in a neighbor around p ontraditing p being a maximum (minimum). 6 One an form the Lagrangian for non-linear onstraints, but to derive it we need to use fanier math like the impliit funtion theorem. We only need linear onstraints for our appliations. 7 For ontradition, suppose p, q are distint feasible solutions then, p q but B T p = B T q but we an write p = p (0) + p () and q = q (0) + p () from the above. However, B T p = B T q implies that B T p (0) = B T q (0). In turn sine B is a bijetion on N(G) this implies that p (0) = q (0). 4 i=

Lagrangian Sine g N(G) = R(G T ) (see the fundamental theorem of linear algebra), we an find a θ(p) suh that g = G T θ(p), whih motivates the following funtional form: By the definition of θ(p), we have: Λ(p, θ(p)) = H(p) + θ(p) T (Gp ) p Λ(p, θ(p)) = g 0 + g + θ(p) T G = g 0. That is, for any ritial point p of the original funtion (whih orresponds to g 0 = 0) we an selet θ(p) so that it is a ritial point of Λ(p, θ). Informally, the multipliers ombines the rows of G to anel g, the omponent of the gradient in the diretion of the onstraints. This establishes that any ritial point of the original onstrained funtion is also a ritial point of the Lagrangian. Referenes [] Jaynes, Edwin T, Probability theory: The logi of siene, Cambridge University Press, 2003. 5