June 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions

Similar documents
Introduction to Information Geometry

Chapter 2 Exponential Families and Mixture Families of Probability Distributions

Symplectic and Kähler Structures on Statistical Manifolds Induced from Divergence Functions

Information Geometry of Positive Measures and Positive-Definite Matrices: Decomposable Dually Flat Structure

Information geometry for bivariate distribution control

AN AFFINE EMBEDDING OF THE GAMMA MANIFOLD

arxiv: v1 [cs.lg] 17 Aug 2018

Mean-field equations for higher-order quantum statistical models : an information geometric approach

Differential Forms, Integration on Manifolds, and Stokes Theorem

Inference. Data. Model. Variates

Riemannian geometry of surfaces

Week 3: The EM algorithm

Lecture 5 - Information theory

DIFFERENTIAL GEOMETRY, LECTURE 16-17, JULY 14-17

1. Geometry of the unit tangent bundle

F -Geometry and Amari s α Geometry on a Statistical Manifold

CALCULUS ON MANIFOLDS. 1. Riemannian manifolds Recall that for any smooth manifold M, dim M = n, the union T M =

1 First and second variational formulas for area

Information geometry of Bayesian statistics

Math 6455 Nov 1, Differential Geometry I Fall 2006, Georgia Tech

Characterizing the Region of Entropic Vectors via Information Geometry

Information geometry in optimization, machine learning and statistical inference

1 Directional Derivatives and Differentiability

Information Geometric Structure on Positive Definite Matrices and its Applications

Section 2. Basic formulas and identities in Riemannian geometry

Homework for Math , Spring 2012

LECTURE 15: COMPLETENESS AND CONVEXITY

Dually Flat Geometries in the State Space of Statistical Models

Differential Geometry Exercises

Posterior Regularization

Information Geometric view of Belief Propagation

Graphical Models for Collaborative Filtering

Basic math for biology

Information geometry of the power inverse Gaussian distribution

G8325: Variational Bayes

Exercises in Geometry II University of Bonn, Summer semester 2015 Professor: Prof. Christian Blohmann Assistant: Saskia Voss Sheet 1

Algorithms for Variational Learning of Mixture of Gaussians

Law of Cosines and Shannon-Pythagorean Theorem for Quantum Information

DIFFERENTIAL GEOMETRY. LECTURE 12-13,

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Lecture 8. Connections

Course 212: Academic Year Section 1: Metric Spaces

Invariant Nonholonomic Riemannian Structures on Three-Dimensional Lie Groups

Latent Variable Models and EM algorithm

4.7 The Levi-Civita connection and parallel transport

Applications of Information Geometry to Hypothesis Testing and Signal Detection

Analysis Finite and Infinite Sets The Real Numbers The Cantor Set

Differential Geometry MTG 6257 Spring 2018 Problem Set 4 Due-date: Wednesday, 4/25/18

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

Information Theory Primer:

Chapter 3. Riemannian Manifolds - I. The subject of this thesis is to extend the combinatorial curve reconstruction approach to curves

Tensors, and differential forms - Lecture 2

Information geometry of mirror descent

WARPED PRODUCTS PETER PETERSEN

Geometric inequalities for black holes

Practice Qualifying Exam Questions, Differentiable Manifolds, Fall, 2009.

Lecture 4 - The Basic Examples of Collapse

LECTURE 10: THE PARALLEL TRANSPORT

Bregman Divergences for Data Mining Meta-Algorithms

RANDOM FIELDS AND GEOMETRY. Robert Adler and Jonathan Taylor

LECTURE 8: THE SECTIONAL AND RICCI CURVATURES

Discrete Euclidean Curvature Flows

ON THE FOLIATION OF SPACE-TIME BY CONSTANT MEAN CURVATURE HYPERSURFACES

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

Information Theory in Intelligent Decision Making

GEOMETRIC QUANTIZATION

On the Chi square and higher-order Chi distances for approximating f-divergences

Outline of the course

Homework 4. Goldstein 9.7. Part (a) Theoretical Dynamics October 01, 2010 (1) P i = F 1. Q i. p i = F 1 (3) q i (5) P i (6)

Statistical physics models belonging to the generalised exponential family

Syllabus. May 3, Special relativity 1. 2 Differential geometry 3

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

CHAPTER 1 PRELIMINARIES

Fisher Information in Gaussian Graphical Models

class # MATH 7711, AUTUMN 2017 M-W-F 3:00 p.m., BE 128 A DAY-BY-DAY LIST OF TOPICS

ECE 4400:693 - Information Theory

Distance-Divergence Inequalities

Bounding the Entropic Region via Information Geometry

Chapter 7 Curved Spacetime and General Covariance

LECTURE 2. (TEXED): IN CLASS: PROBABLY LECTURE 3. MANIFOLDS 1. FALL TANGENT VECTORS.

Faculty of Engineering, Mathematics and Science School of Mathematics

Linear Ordinary Differential Equations

CS Lecture 19. Exponential Families & Expectation Propagation

x log x, which is strictly convex, and use Jensen s Inequality:

Transport Continuity Property

Information Theory and Communication

Series 7, May 22, 2018 (EM Convergence)

Quadratic forms. Here. Thus symmetric matrices are diagonalizable, and the diagonalization can be performed by means of an orthogonal matrix.

14 : Theory of Variational Inference: Inner and Outer Approximation

Metrics and Curvature

Probabilistic & Unsupervised Learning

Two dimensional manifolds

LECTURE 9: MOVING FRAMES IN THE NONHOMOGENOUS CASE: FRAME BUNDLES. 1. Introduction

Gravitation: Tensor Calculus

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

Two simple ideas from calculus applied to Riemannian geometry

Dependence, correlation and Gaussianity in independent component analysis

Is there a magnification paradox in gravitational lensing?

Expectation Maximization

Transcription:

Dual Peking University June 21, 2016

Divergences:

Riemannian connection Let M be a manifold on which there is given a Riemannian metric g =,. A connection satisfying Z X, Y = Z X, Y + X, Z Y (1) for all vector fields X, Y, Z T (M) is called the Riemannian connection.

Dual connection Giving two and on M, if for all X, Y, Z Z X, Y = Z X, Y + X, Z Y (2) holds, then we say that and are duals of each other with respect to g, and call one either the dual connection or the conjugate connection of the other. In addition, we call such a triple (g,, ) a dualistic structure on M. If is metric, then =. Hence the duality of may be considered as a ization of metric connection. In a statistical model S, (g, (α), ( α) ) is a dualistic structure.

Dual connection Given a local frame [x i ], from Equation (2) we have k g ij = Γ ki,j + Γ kj,i. (3) Thus, given g and, there exists a unique dual connection. In addition ( ) = holds. We also see that ( + )/2 becomes a metric connection. And conversely, if a connection has the same torsion as and if ( + )/2 is metric, then =.

Submanifold Letting N be a submanifold of M, consider N and N, which are respectively the projections of and onto N with respect to g. These are dual with respect to g N (the metric on N determined by g). We call (g N, N, N ) the dualistic structure on N induced by (g,, ), or the induced dualistic structure on M.

Covariant derivative Let γ : t γ(t) be a curve in M and let X and Y be vector fields along γ. In addition, let D t X and D t Y respectively denote the covariant derivatives of X with respect to and Y with respect to. Then from Equation (2), we see that d dt X (t), Y (t) = D tx (t), Y (t) + X (t), D t Y (t) (4) Now suppose that X is parallel with respect to, and that Y is parallel with respect to, i.e., D t X = D t Y = 0. Then X (t), Y (t) is constant on γ.

Parallel transform Theorem Letting P γ and P γ (:T p (M) T q (M), where p and q are boundary points of γ) respectively denote the parallel translation along γ with respect ro and, then for all X,Y T p (M) we have P γ (X ), P γ(y ) q = X, Y p. (5) This is a ization of the invariance of the inner product under parallel translation through metric discussed in Chapter 1.

Curvature The relationship between P γ and P γ is completely determined by Equation (5). Hence if P γ is independent of the actual curve joining p and q, and hence may be written as P γ = P p,q, then this is true of P γ also. Theorem Letting the curvature tensors of and be denoted by R and R, respectively, we have This is immediate from R = 0 R = 0. (6) R(X, Y )Z, W = R (X, Y )W, Z, X, Y, Z, X T (M). (7)

Measure Consider a smooth function D = D( ) : M M R satisfying for any p, q M D(p q) 0, and D(p q) = 0 iff p = q. (8) D is a distance-like measure of the separation between two points. However, it does not in satisfy the axioms of distance (symmetry and the triangle inequality).

Derivatives Given an arbitrary coordinate system [x i ] of M, let us represent a pair of points (p, p ) M M by a pair of coordinates ([x i ], [x i ]) and denote the partial derivatives of D(p p ) with respect to p and p by D(( i ) p p ) ( i ) x D(x p ) x = p D(( i ) p ( j ) p ) ( i ) x ( j ) y D(x y) x=p,y=p D(( i j ) p ( k ) p ) ( i ) x ( j ) x ( k ) y D(x y) x=p,y=p, etc., (9) These definitions are naturally extended to those of D((X 1 X l ) p p ), D(p (Y 1 Y m ) p ) and D((X 1 X l ) p (Y 1 Y m ) p ) for any vector fields X 1, X l, Y 1,, Y m T (M).

Divergence Now consider the restrictions onto the diagonal {(p, p) p M} M M and denote the induced on M by D[X 1 X l ] : p D((X 1 X l ) p p), D[ Y 1 Y m ] : p D(p (Y 1 Y m ) p ), D[X 1 X l Y 1 Y m ] : p D((X 1 X l ) p (Y 1 Y m ) p ). (10) Easily, we have D[ i ] = D[ i ] = 0, (11) D[ i j ] = D[ i j ] = D[ i j ]( g (D) ij ). (12)

Divergence and Riemannian metric The matrix [g (D) ij ] is positive semidefinite (it s the Hessian matrix of the minimum point). When [g (D) ij ] is strictly positive definite everywhere on M, we say that D is a or a function on M. For a D, a unique Riemannian metric g (D) =, (D) on M is defined by i, j (D) = g (D) ij, or equivalently by Using Taylor expansion, we have X, Y (D) = D[X Y ]. (13) D(p q) = 1 2 g (D) ij (q) x i x j + o( x 2 ), (14)

Divergence and connection We define a connection (D) with the coefficients Γ (D) ij,k by or equivalently by Γ (D) ij,k = D[ i j k ], (15) (D) X Y, Z (D) = D[XY Z] (16) It s easy to see that Γ (D) ij,k = Γ(D) ji,k Γh(D) ij = Γ h(d) ji.

Divergence and connection D(p q) = 1 2 g (D) ij (q) x i x j + 1 6 h(d) ijk (q) x i x j x k +o( x 3 ), (17) where h (D) ijk D[ i j k ] = i g (D) jk + Γ (D) jk,i. (18) Conversely, we see that g (D) and (D) are determined by the expansion (17) through Equation (18).

Divergence and dual connection Replace the D(p q) with its dual D (p q) = D(q p). Then we obtain g (D ) = g (D) and Γ (D ) ij,k = D[ k i j ]. (19) Theorem (D) and (D ) are dual with respect to g (D).

Divergence and dual connection D(p q) = D (q p) where = 1 2 g (D) ij (p) x i x j 1 6 h(d ) ijk (p) x i x j x k + o( x 3 ) h (D ) ijk (20) D[ i j k ] = i g (D) jk + Γ (D ) jk,i. (21)

Dual connection and We see that any induces a torsion-free dualistic structure. Conversely, any triple (g,, ) are induced from a. In fact, if we let where D(p q) 1 2 g ij(q) x i x j + 1 6 h ijk(q) x i x j x k, (22) h ijk i g jk + Γ jk,i = Γ ij,k + Γ ik,j + Γ jk,i, (23) then (g,, ) = (g (D), (D), (D ) ).

Let (g,, ) be a dualistic structure on a manifold M. If and are both symmetric (T = T = 0), then from Theorem before we see that -flatness and -flatness are equivalent. We call (M, g,, ) a dually flat space if both dual are flat.

Autoparallel Theorem Let (M, g,, ) be a dually flat space. If a submanifold N of M is autoparallel with respect to either or, then N is a dually flat space with respect to the dualistic structure (g N, N, N ) induced on N by (g,, ).

Dual coordinate Suppose (U; θ i, η j ) is a coordinate neighborhood of dually flat space (M, g,, ), where [θ i ] and [η j ] denote the affine coordinate system for and respectively. We let i θ i and j η j. i, j is constant on U since they are respectively parallel on flat manifold. Thus we can choose particular coordinate systems such that i, j = δ j i. (24) Such two systems are called mutually dual. We see then that the Euclidean coordinate system defined as i, j = δ ij (affine coordinate) is self-dual.

Dual coordinate Dual coordinate systems do not ly exist for a Riemannian manifold. If (M, g,, ) is a dually flat space, then dual coordinate systems exist. Conversely, if for a Riemannian manifold (M, g) there exists such coordinate systems, then and for which they are affine are determined, and (M, g,, ) is a dually flat space.

Dual coordinate Let the components of g with respect to [θ i ] and [η j ] be defined by g ij i, j and g ij i, j. (25) Considering j = ( j θ i ) i and i = ( i η j ) j, the Equation (24) is equivalent to η j θ i = g ij and therefore g ij g jk = δ k i. and θ i η j = g ij, (26)

Legendre transformations Suppose we are given mutually dual coordinate systems [θ i ] and [η j ], and consider the following partial differential equation for a function ψ : M R : i ψ = η i. (27) Rewrite this as dψ = η i dθ i, and a solution exists iff i η j = j η i. Since i η j = g ij = j η i, a solution ψ always exists. i j ψ = g ij, (28) Hessian matrix of ψ is positve definte, thus it s strictly convex of [θ 1,, θ m ].

Legendre transformations Similarly, a solution ϕ to i ϕ = θ i (29) exists. In fact, ϕ = θ i η i ψ is a solution. i j ϕ = g ij, (30) and hence it s a strictly convex function of [η 1,, η m ].

From convexity we have Legendre transformations ϕ(q) = max p M {θi (p)η i (q) ψ(p)}, (31) ψ(p) = max q M {θi (p)η i (q) ϕ(q)}. (32) Sometimes it is more natural to view these relations as ϕ(η) = max θ Θ {θi η i ψ(θ)}, (33) ψ(θ) = max η H {θi η i ϕ(η)}, (34) where ψ and ϕ are simply convex defined on convex regions Θ and H in R m.

Legendre transformations Those coordinate transformations expressed in Equations (27) through (32) are called Legendre transformations, and ψ and ϕ are called their potentials. Note also that Γ ij,k i j, k = i j k ψ, (35) Γ ij,k i j, k = i j k ψ, (36) which are derived from Equation (3) combined with Γ ij,k = Γ ij,k = 0.

Let (M, g,, ) be a dually flat space, on which we are given mutually dual affine coordinate systems {[θ i ], [η i ]}. The canonical or (g, )- is defined as D(p q) ψ(p) + ϕ(q) θ i (p)η i (q). (37) Then from Equation (31) and (32) we see that D(p q) 0 and D(p q) = 0 p = q. It is easy to verify the equations D(( i j ) p q) = g ij (p) and D(p ( i j ) q ) = g ij (q) (38) which immediately implies that D is a and induces g. Also = (D) and = (D ) since Γ ij,k = Γ ij,k = 0 due to the -affinity of [θ i ] and the -affinity of [η i ].

Note 1 The canonical is defined globally, though it uses locally defined charts, which is guaranteed by the following lemma. Lemma Suppose M is connected and is flat with respect to, then every two or finite points on M can be contained in a single affine chart.

Note 2 If given another set of dual affine coordinate systems expressed by θ j = A j i θi + B j, η j = Cj i η i + D j, (39) ψ = ψ + D j θ j + c, ϕ = ϕ + B j η j B j D j c, (40) where [A j i ] is a regular matrix and [C i j ] is its inverse, [Bj ] and [D j ] are real-valued vectors, and c is a real number, then we have ψ(p) + ϕ(q) θ i (p)η i (q) = ψ(p) + ϕ(q) θ i (p) η i (q), (41) which indicates that the canonical is well defined. On (M, g,, ), we define the (g, )- D (p q) = D(q p).

Example If is a Riemannian connection, the condition for dually flat reduces to being flat, and hence there exists a Euclidean coordinate system [θ i ], which is self dual (θ i = η i ), and its potential is given by ψ = ϕ = 1 2 i (θi ) 2. Hence we obtain D(p q) = 1 {(θ i (p)) 2 + (θ i (q)) 2 2θ i (p)θ i (q)} 2 i = 1 2 {d(p, q)}2, (42) where d is the Euclidean distance d(p, q) i {θi (p) θ i (q)} 2. In, D(p q) on a dually flat space is only approximately equal to 1 2 {d(p, q)}2 in the sense of Equation (14).

Triangular relation Theorem Let {[θ i ], [η i ]} be mutually dual affine coordinate systems of a dually flat space (M, g,, ), and let D be a on M. Then a necessary and sufficient condition for D to be the (g, )- is that for all p,q,r M the following triangular relation holds: D(p q) + D(q r) D(p r) = {θ i (p) θ i (q)}{η i (r) η i (q)}. (43)

Pythagorean relation Theorem Let p,q, and r be three points in M. Let γ 1 be the -geodesic connecting p and q, and let γ 2 be the -geodesic connecting q and r. If at the intersection q the curves γ 1 and γ 2 are orthogonal (with respect to the inner product g), then we have the Pythagorean relation D(p r) = D(p q) + D(q r) (44)

Pythagorean relation Figure: The Pythagorean relation for (g, )-s.

Projection Corollary Let p be a point in M and let N be a submanifold of M which is -autoparallel. Then a necessary and sufficient condition for a point q in N to satisfy D(p q) = min r N D(p r) is for the -geodesic connecting p and q to be orthogonal to N at q. The point q is called the -projection of p onto N when the geodesic connecting p and q N is orthogonal to N.

Projection Figure: The projection theorem of (g, )-.

Projection Theorem Let p be a point in M and let N be a submanifold of M. A necessary and sufficient condition for a point q N to be a stationary point of the function D(p ) : r D(p r) restricted on N (in other words, the partial derivatives with respect to a coordinate system of N are all 0) is for the -geodesic connecting p and q to be orthogonal to N at q. Corollary Given a point p in M and a positive number c, suppose that the D-sphere N = {q M D(p q) = c} forms a hypersurface in M. Then every -geodesic passing through the center p orthogonally intersects N.

em algorithm Given two submanifolds K and S in a dually flat M, we define a between K and S by D[K S] min D(p q) = D( p q), (45) p K,q S where D is the (g, )- of M and p K and q S are the closest pair between K and S. In order to obtain the closest pair, the following iterative algorithm is proposed.

em algorithm Figure: Iterated dual geodesic projections (em algorithm)

em algorithm Begin with an arbitrary Q t S, t = 0, 1, and search for P K that minimizes D(P Q t ) which is given by the geodesic projection of Q t to K. Let it be P t K. Then search for the point in S that minimizes D(P t Q) which is given by the dual geodesic projection of P t to S, denoted as Q t+1. Since we have D(P t 1 Q t ) D(P t Q t ) D(P t Q t+1 ), (46) the procedure converges. It is unique when S is flat and K is dual flat. Otherwise, the converging point is not necessarily unique.

Let f (u) be a convex function on u > 0. For each probability distributions p, q, we define ( q(x) ) D f (p q) p(x)f dx (47) p(x) and call it the.

Properties of Using Jensen s inequality we have ( D f (p q) f p(x) q(x) ) p(x) dx = f (1), (48) where the equality holds if p = q and, conversely, the equality implies p = q when f (u) is strictly convex at u = 1. D f is invariant when f (u) is replaced with f (u) + c(u 1) for any c R.

Properties of Df = D f, where f = uf (1/u). Monotonicity Let κ = {κ(y x) 0; x X, y Y} be an arbitrary transition probability distribution such that κ(y x)dy = 1, x, whereby the value of x is randomly transformed ro y according to the probability κ(y x). Denoting the distributions of y derived from p(x) and q(x) by p κ (y) and q κ (y) respectively, we have D f (p q) D f (p κ q κ ) (49)

Properties of Proof of monotonicity. ( q(x) D f (p q) = p(x)κ(y x)f p(x) ( q(x) = p κ (y)p κ (x y)f p(x) p κ (y)f = D f (p κ q κ ) ) dxdy ) dxdy ( p κ (x y) q(x) p(x) dx ) dy (50) The equality holds if p κ (x y) = q κ (x y) for all x and y.

Joint convexity The joint convexity D f (λp 1 + (1 λ)p 2 λq 1 + (1 λ)q 2 ) (51) λd f (p 1 q 1 ) + (1 λ)d f (p 2 q 2 ), 0 λ 1 follows from the convexity of pf ( q p ) ((λ 1 p 1 + λ 2 p 2 )f ( λ 1q 1 +λ 2 q 2 λ 1 p 1 +λ 2 p 2 ) = q 1 q +λ p 2 p 2 2 1 (λ 1 p 1 + λ 2 p 2 )f ( λ 1p 1 p 2 λ 1 p 1 +λ 2 p 2 ) λ 1 p 1 f ( q 1 p 1 ) + λ 2 p 2 f ( q 2 p 2 )).

Assume f is strictly convex and smooth and f (1) = 0, then D f becomes a and induces the metric g (D f ) = g (f ) and the connection (D f ) = (f ).

α- Important examples of smooth s are given by the α- D (α) = D f (α) for a real number α, which is defined by 4 f (α) 1 α 2 {1 u(1+α)/2 } (α ±1) (u) = ulogu (α = 1) logu (α = 1). We have for α ±1 D (α) (p q) = 4 1 α 2 {1 (52) p(x) 1 α 2 q(x) 1+α 2 dx} (53)

and for α = ±1 D ( 1) (p q) = D (1) (q p) = α- p(x)log p(x) dx. (54) q(x) We can immediately see that the α- D (α) induces (g (f (α)), (f (α)) ) = (g, (α) ). Note that D (α) (p q) = D ( α) (q p) ly holds. In particular, D (0) (p q) is symmetric, and moreover D (0) (p q) satisfies the axioms of distance, which follows since D (0) (p q) = 2 ( p(x) q(x)) 2 dx. (55) D (0) (p q) is called the Hellinger distance.

Kullback The ±1- is called the Kullback or Kullback-Leibler(KL). Here we refer to D ( 1) as the KL and D (1) its dual. The KL satisfies the chain rule: D ( 1) (p q) =D ( 1) (p κ q κ ) + D ( 1) (p κ ( y) q κ ( y))p κ (y)dy. (56)

Expectation parameters In an family p(x; θ) = exp[c(x) + θ i F i (x) ψ(θ)], (57) the natural parameters [θ i ] form a 1-affine chart. Now if we define η i = η i (θ) E θ [F i ] = F i (x)p(x; θ)dx, (58) then η i = i ψ and i j ψ = g ij. Hence [η i ] is a (-1)-affine chart dual to [θ i ], and ψ is the potential of a Legendre transformation. We call this [η i ] the expectation parameters or the dual parameters.

Examples Normal Distribution η 1 = µ = θ1 2θ 2, η 2 = µ 2 + σ 2 = (θ1 ) 2 2θ 2 4(θ 2 ) 2 Poisson Distribution P(X ) for finite X η = ξ = expθ η i = p(x i ) = ξ i = expθ i 1 + n j=1 expθj

Entropy The dual potential ϕ is given by ϕ(η) = θ i η i ψ(θ) = E θ [logp θ C] = H(p θ ) E θ [C], where H is the entropy: H(p) p(x)logp(x)dx. In addition, we have (59) ϕ(θ) = max{θ i η i (θ) ψ(θ )}, (60) θ where the maximum is attained by θ = θ

The ±1- is exactly the canonical (g, (±1) )-. The triangular relation can be rewritten as D(p q) + D(q r) D(p r) = {p(x) q(x)}{logr(x) logq(x)}dx, where D = D ( 1) is the KL. (61)

Projection From theorems in canonical, the solutions to the minimization problems min D(p q) and min D(q p) q M q M are repectively given by the (m) -projection and (e) -projection.

Principle of maximum entropy Given (n + 1) C, F 1,, F n : X R, let S = {p θ θ Θ} be the n-dimensional family. Then for any θ Θ and any q P(X ) we have H(p θ ) + E pθ [C] + θ i E pθ [F i ] H(q) E q [C] θ i E q [F i ] = D(q p θ ) 0, which leads to max {H(q) + E q[c] + θ i E q [F i ]} q P(X ) = H(p θ ) + E pθ [C] + θ i E pθ [F i ] = ψ(θ). (62) (63)

Principle of maximum entropy Given a vector λ = (λ 1,, λ n ) R n,let M λ {q P E q [F i ] = λ i, i = 1,, n}. (64) Now assume S M λ and suppose θ λ Θ s.t. η i (θ λ ) = E pθλ [F i ] = λ i for i = 1,, n. Then we have max {H(q) + E q [C]} = H(p θλ ) + E pθλ [C] q M λ = ψ(θ λ ) θ i λ λ i = min θ Θ {ψ(θ) θi λ i }, When C = 0 it follows that max q Mλ H(q) = H(p θλ ), which is called the principle of maximum entropy. (65)

Boltzmann-Gibbs distribution The thermal equilibrium state which maximizes the thermodynamical entropy S(p) kh(p), where k(> 0) is Boltzmann s constant, under the constraint E q [ɛ] = ɛ on the average of the energy function ɛ, is given by the Boltzmann-Gibbs distribution p (x) = 1 Z e ɛ(x)/kt, (66) where T is the temperature and Z is the partition function. This corresponds to the previous situation by letting C = 0, n = 1, F i = ɛ, λ = ɛ, θ λ = 1/kT and ψ(θ λ ) = logz.

Statistical model with hidden variables Consider a statistical model M = {p(x, ξ)}, where x is divided into two parts x = (y, h) so that p(x, ξ) = p(y, h; ξ). When x is not fully observed but y is observed, h is called a hidden variable. In such a case, we estimate ξ from observed y. Actually, we want to compute the MLE of p Y (y, ξ) = p(y, h; ξ)dh. However, in many cases, the form of p(x, ξ) is simple and estimation is tractable in M, but p Y (y, ξ) is complicated and the estimation is computationally intractable.

Empirical distribution Consider a larger model S = {q(y, h)} consisting of all probability density of (y, h). We don t have the empirical distribution q(x) = 1 N δ(x x i ) but only an empirical distribution q Y (y) for y only. We use an arbitrary conditional distribution q(h y) and put q(y, h) = q Y (y)q(h y). (67) And we take all the candidates of observed points and consider a submanifold D = { q(y, h) q(y, h) = q Y (y)q(h y), q(h y) is arbitrary}. (68)

Empirical distribution D is the observed submanifold in S specified by the partially observed data y 1,, y N. By using the empirical distribution, it is written as q(y, h) = 1 N δ(y y i )q(h y i ) (69) The data submanifold D is m-flat, because it is linear with respect to q(h y i ).

MLE and KL- Consider the minimizer of KL- from data manifold M to the model manifold D, D[D : M] = min q Y (y)q(h y)log q Y (y)q(h y) dydh (70) p(y, h, ξ) Theorem The MLE of p Y (y, ξ) is the minimizer of the KL- from D to M. In fact, we minimize the equation above with respect to both ξ and q(h y) alternately by the em algorithm, that is, the alternating use of the e-projection and m-projection.

Algorithm () 1 Choose an initial parameter ξ 0. 2 E-step e-project ξ 0 to D. It can be verified that the e-projection is q(h y) = p(h y; ξ 0 ). 3 M-step Maximize a log likelihood L(ξ, ξ 0 ) = 1 p(h y; ξ 0 )logp(y N i, h, ξ)dh (71) i to obtain a new candidate ξ 1 in M. It can be verified that this is the m-projection. 4 Repeat step 2 and 3.

Theorem The KL- decreases monotonically by repeating the E-step and the M-step. Hence, the algorithm converges to an equilibrium. It should be noted that the m-projection is not necessarily unique unless M is e-flat. Hence, there might exist local minima. However, we often come across the family and thus there exists unique solution.