Information Geometric view of Belief Propagation

Similar documents
Introduction to Information Geometry

Extremal Entropy: Information Geometry, Numerical Entropy Mapping, and Machine Learning Application of Associated Conditional Independences

CS Lecture 19. Exponential Families & Expectation Propagation

13: Variational inference II

Characterizing the Region of Entropic Vectors via Information Geometry

Belief Propagation, Information Projections, and Dykstra s Algorithm

Probabilistic and Bayesian Machine Learning

Information Projection Algorithms and Belief Propagation

13 : Variational Inference: Loopy Belief Propagation

Chapter 2 Exponential Families and Mixture Families of Probability Distributions

A Survey of Information Geometry in Iterative Receiver Design, and Its Application to Cooperative Positioning

Variational Inference (11/04/13)

Information Geometry of Turbo and Low-Density Parity-Check Codes

AN AFFINE EMBEDDING OF THE GAMMA MANIFOLD

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Mean-field equations for higher-order quantum statistical models : an information geometric approach

Probabilistic Graphical Models

Variational algorithms for marginal MAP

14 : Theory of Variational Inference: Inner and Outer Approximation

Probabilistic Graphical Models

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

PMR Learning as Inference

Expectation Propagation in Factor Graphs: A Tutorial

Junction Tree, BP and Variational Methods

Probabilistic Graphical Models

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Information geometry of the power inverse Gaussian distribution

Information geometry of Bayesian statistics

June 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions

Advanced Machine Learning & Perception

Fractional Belief Propagation

Bayesian Machine Learning - Lecture 7

Expectation Maximization

Graphical models and message-passing Part II: Marginals and likelihoods

Learning MN Parameters with Approximation. Sargur Srihari

Expectation Propagation Algorithm

Lecture 13 : Variational Inference: Mean Field Approximation

LECTURE 11: EXPONENTIAL FAMILY AND GENERALIZED LINEAR MODELS

Connecting Belief Propagation with Maximum Likelihood Detection

Information Geometry of Positive Measures and Positive-Definite Matrices: Decomposable Dually Flat Structure

Information Geometry

Pattern Recognition and Machine Learning

Information geometry in optimization, machine learning and statistical inference

Neighbourhoods of Randomness and Independence

Part 1: Expectation Propagation

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Variational Inference. Sargur Srihari

Bounding the Entropic Region via Information Geometry

9 Forward-backward algorithm, sum-product on factor graphs

CSCI567 Machine Learning (Fall 2014)

Graphical Models and Kernel Methods

Information geometry for bivariate distribution control

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY

14 : Theory of Variational Inference: Inner and Outer Approximation

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. An Extension of Least Angle Regression Based on the Information Geometry of Dually Flat Spaces

12 : Variational Inference I

Probabilistic Graphical Models

Message Passing Algorithm with MAP Decoding on Zigzag Cycles for Non-binary LDPC Codes

Expectation propagation for symbol detection in large-scale MIMO communications

CS Lecture 13. More Maximum Likelihood

Information geometry connecting Wasserstein distance and Kullback Leibler divergence via the entropy-relaxed transportation problem

Information Geometric Structure on Positive Definite Matrices and its Applications

An Introduction to Bayesian Machine Learning

Linear Response for Approximate Inference

Lower Bounds on the Graphical Complexity of Finite-Length LDPC Codes

Discrete Markov Random Fields the Inference story. Pradeep Ravikumar

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

On the Convergence of the Convex Relaxation Method and Distributed Optimization of Bethe Free Energy

Learning unbelievable marginal probabilities

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

5. Density evolution. Density evolution 5-1

Bregman Divergences for Data Mining Meta-Algorithms

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Machine Learning Basics III

Performance Analysis and Code Optimization of Low Density Parity-Check Codes on Rayleigh Fading Channels

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

DEEP LEARNING CHAPTER 3 PROBABILITY & INFORMATION THEORY

Self-Organization by Optimizing Free-Energy

Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation

Information Geometry on Hierarchy of Probability Distributions

Statistical Approaches to Learning and Discovery

Expectation Maximization

Learning Binary Classifiers for Multi-Class Problem

Machine Learning 4771

Introduction to Low-Density Parity Check Codes. Brian Kurkoski

Artificial Intelligence

Factor Graphs and Message Passing Algorithms Part 1: Introduction

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

Maximization of the information divergence from the multinomial distributions 1

Inference. Data. Model. Variates

Maximum likelihood estimation

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

Series 7, May 22, 2018 (EM Convergence)

Variational Autoencoder

Support Vector Machines

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Low-Density Parity-Check Codes

Bayesian Methods for Machine Learning

Transcription:

Information Geometric view of Belief Propagation Yunshu Liu 2013-10-17

References: [1]. Shiro Ikeda, Toshiyuki Tanaka and Shun-ichi Amari, Stochastic reasoning, Free energy and Information Geometry, Neural Computation, 16, 1779-1810, 2004. [2]. Shiro Ikeda, Toshiyuki Tanaka and Shun-ichi Amari, Information Geometry of Turbo and Low-Density Parity-Check Codes, IEEE Transaction on Information Theory, VOL. 50, NO.6, June 2004. [3].Kevin P. Murphy, Machine Learning - A Probabilistic Perspective, The MIT Press, 2012.

Outline Information Geometry Examples of probabilistic manifold Important Submanifold Information projection Information Geometry of Belief Propagation Belief Propagation/Sum-product algorithms Information Geometry of Graphical structure Information-Geometrical view of Belief Propagation

Probability distribution viewed as manifold Outline for Infromation Geometry Examples of probabilistic manifold Important Submanifold Information projection

Manifold Manifold S Manifold: a set with a coordinate system, a one-to-one mapping from S to R n, supposed to be locally looks like an open subset of R n Elements of the set(points): points in R n, probability distribution, linear system. Figure: A coordinate system ξ for a manifold S

Example manifold Parametrization of color models Parametrization: map of color into R 3 Examples:

Example of manifold Parametrization of Gaussian distribution S = {p(x; µ, σ)} p(x; µ, σ) = 1 2πσ exp{ (x µ)2 2σ 2 } p(x) σ S x µ

Example of manifold Exponential family p(x; θ) = exp[c(x) + n θ i F i (x) ψ(θ)] [θ i ] are called the natural parameters(coordinates), and ψ is the potential function for [θ i ], which can be calculated as ψ(θ) = log i=1 exp[c(x) + n θ i F i (x)]dx The exponential families include many of the most common distributions, including the normal, exponential, gamma, beta, Dirichlet, Bernoulli, binomial, multinomial, Poisson, and so on. i=1

Example of two binary discrete random variables as Exponential families Let x = {x 1, x 2 }, x i = 1 or 1, consider the family of all the probability distributions S = {p(x) p(x) > 0, i p(x = C i) = 1} Outcomes: C 0 = { 1, 1}, C 1 = { 1, 1} C 2 = {1, 1}, C 3 = {1, 1} Parameterization: p(x = C i ) = η i for i = 1, 2, 3 p(x = C 0 ) = 1 3 j=1 ηj η = (η 1 η 2 η 3 ) is called the m-coordinate of S. η 1 (0 0 1) η 3 (1 0 0) (0 0 0) (0 1 0) Simplex η 2

Example of two binary discrete random variables as Exponential families A new coordinate {θ i }, which is defined as θ i = ln η i 1 3 j=1 ηj for i = 1, 2, 3. (1) θ = (θ 1 θ 2 θ 3 ) is called the e-coordinate of S. η 3 θ 3 (1 0 0) (0 0 0) (0 1 0) η 2 θ 2 η 1 (0 0 1) Simplex θ 1 Whole space

Example of two binary discrete random variables as Exponential families Introduce new random variables: { 1 if x = Ci δ Ci (x) = 0 otherwise and let ψ = ln p(x = C 0 ) (2) Then S is an exponential family with natural parameter {θ i }: p(x, θ) = exp( 3 θ i δ Ci (x) ψ(θ)) (3) i=1

Probability distribution viewed as manifold Outline for Infromation Geometry Examples of probabilistic manifold Important Submanifolds Information projection

Submanifolds Submanifolds Definition: a submanifold M of a manifold S is a subset of S which itself has the structure of a manifold An open subset of n-dimensional manifold forms an n-dimensional submanifold. One way to construct m(<n)dimensional manifold: fix n-m coordinates. Examples:

e-autoparallel submanifold and m-autoparallel submanifold e-autoparallel submanifold A subfamily E is said to be e-autoparallel submanifold of S if for some coordinate λ there exist a matrix A and a vector b such that for the e-coordinate θ of S θ(p) = Aλ(p) + b ( p E) (4) If λ is one-dimensional, it is called a e-geodesic. An e-autoparallel submanifold of S is a hyperplane in e-coordinate; A e-geodesic of S is a straight line in e-coordinate. Hyper-plane in θ E in θ coordinate

e-autoparallel submanifold and m-autoparallel submanifold m-autoparallel submanifold A subfamily M is said to be m-autoparallel submanifold of S if for some coordinate γ there exist a matrix A and a vector b such that for the m-coordinate η of S η(p) = Aγ(p) + b ( p M) (5) If γ is one-dimensional, it is called a m-geodesic.

e-autoparallel submanifold and m-autoparallel submanifold Two independent bits The set of all product distributions, which is defined as E = {p(x) p(x) = p X1 (x 1 )p X2 (x 2 )} (6) p X1 (x 1 = 1) = p 1, p X2 (x 2 = 1) = p 2 then p X1 (x 1 = 1) = 1 p 1, p X2 (x 2 = 1) = 1 p 2 Parameterization Define λ i = 1 2 ln p i 1 p i for i = 1, 2, and ψ(λ) = i=1,2 ln (eλ i + e λ i ), p(x, λ) = e λ 1x 1 ψ(λ 1 ) e λ 2x 2 ψ(λ 2 ) = e λx ψ(λ) (7) Thus E is an exponential family with nature parameter {λ i }

e-autoparallel submanifold and m-autoparallel submanifold Dual ( parameters: ) ( expectations ) ( ) γ 1 1 ψ(λ) 2p1 1 γ = = = 2 ψ(λ) 2p 2 1 γ 2 = E p (x) Properties E is an e-autoparallel submanifold(plane in θ coordinate) of S, but not a m-autoparallel submanifold(not a plane in η coordinate): θ 1 2 2 ( ) 2 2 θ(p) = θ 2 = 2 0 λ1 = 2 0 λ(p) λ θ 3 0 2 2 0 2

e-autoparallel submanifold and m-autoparallel submanifold Two independent bits: e-autoparallel submanifold but not m-autoparallel submanifold E in θ coordinate E in η coordinate

Probability distribution viewed as manifold Outline for Infromation Geometry Examples of probabilistic manifold Important Submanifolds Information projection

Projections and Pythagorean Theorem KL-divergence On the manifold of probability mass functions for random variables taking values in the set X, we can also define the Kullback Leibler divergence, or relative entropy, measured in bits, according to D(p X q X ) = ( ) px (x) p X (x) log (8) q x X X (x) q X p X S

Projections and Pythagorean Theorem Information Projection Let p be a point in S and let E be a e-autoparallel submanifold of S. We want to find point Π E (p X ) in E that minimize the KL-divergence D(p X Π E (p X )) = min q X E D(p X q X ) (9) p X The m-geodesic connecting p X and Π E (p X ) is orthogonal to E at Π E (p X ). In geometry, orthogonal usually means right angle In information geometry, orthogonal usually means certain values are uncorrelated m-geodesic q X ΠE(pX) e-autoparallel submanifold E D(p X Π E (p X )) = min qx E D(p X q X )

Projections and Pythagorean Theorem Pythagorean relation For q X E, we have the following Pythagorean relation. D(p X q X ) = D(p X Π E (p X )) + D(Π E (p X ) q X ) q X E (10) p X m-geodesic Π E (p X ) q X e-autoparallel submanifold E D(p X q X )=D(p X Π E (p X )) + D(Π E (p X ) q X ) q X E

Projection in θ coordinate

Projection in η coordinate

Information Geometry of Belief Propagation Outline for Information Geometry of Belief Propagation Belief Propagation/Sum-product algorithms Information Geometry of Graphical structure Information-Geometrical view of Belief Propagation

Belief Propagation Belief Propagation/Sum-product algorithm The Belief Propagation algorithm involves using a message passing scheme to explore the conditional independence and use these properties to change the order of sum and product. xy + xz = x(y + z) + x + y x z y z Expression tree of xy + xz and x(y+z), where the leaf nodes represent variables and internal nodes represent arithmetic operators(addition, multiplication, negation, etc.).

Belief Propagation Message passing on a factor graph: Variable x to factor f : m x f = m fk x(x) k nb(x)\f f 1 m f1 x x m x f f m fk x f K nb(x) denote the neighboring factor nodes of x If nb(x)\f = then m x f = 1.

Belief Propagation Message passing on a factor graph: Factor f to variable x: m f x = f (x, x 1,, x M ) x 1, x M m nb(f )\x m xm f (x m ) x 1 m x1 f f m f x x x M m xm f nb(f ) denote the neighboring variable nodes of f If nb(f )\x = then m f x = f (x).

Serial protocol of Belief Propagation Message passing on a tree structured factor graph Serial protocol: Send messages up to the root and back x 3 f b x 2 f a f c x 1 x 4

Serial protocol of Belief Propagation Message passing on a tree structured factor graph Down - up: m x1 f a (x 1 ) = 1 m fa x 2 (x 2 ) = x 1 f a (x 1, x 2 ) m x4 f c (x 4 ) = 1 m fc x 2 (x 2 ) = x 4 f c (x 2, x 4 ) m x2 f b (x 2 ) = m fa x 2 (x 2 )m fc x 2 (x 2 ) m fb x 3 (x 3 ) = x 2 f b (x 2, x 3 )m x2 f b

Serial protocol of Belief Propagation Message passing on a tree structured factor graph Up - down: m x3 f b (x 3 ) = 1 m fb x 2 (x 2 ) = x 3 f b (x 2, x 3 ) m x2 f a (x 2 ) = m fb x 2 (x 2 )m fc x 2 (x 2 ) m fa x 1 (x 1 ) = x 2 f a (x 1, x 2 )m x2 f a (x 2 ) m x2 f c (x 2 ) = m fa x 2 (x 2 )m fb x 2 (x 2 ) m fc x 4 (x 4 ) = x 2 f c (x 2.x 4 )m x2 f c (x 2 )

Serial protocol of Belief Propagation Message passing on a tree structured factor graph Calculating p(x 2 ) p(x 2 ) = 1 Z m f a x 2 (x 2 )m fb x 2 (x 2 )m fc x2 (x 2 ) = 1 { }{ }{ } f a (x 1, x 2 ) f b (x 2, x 3 ) f c (x 2, x 4 ) Z x 1 x 3 x 4 = 1 f a (x 1, x 2 )f b (x 2, x 3 )f c (x 2, x 4 ) Z x 1 x 3 x 4 = p(x) x 1 x 3 x 4

Parallel protocol of Belief Propagation Message passing on general graphs with loops Parallel protocol: Initialize all messages All nodes receive message from their neighbors in parallel and update their belief states Send new messages back out to their neighbors Repeats the above two process until convergence

Information Geometry of Belief Propagation Outline for Information Geometry of Belief Propagation Belief Propagation/Sum-product algorithms Information Geometry of Graphical structure Information-Geometrical view of Belief Propagation

Information Geometry of Graphical structure Important Graphs(Undirected) for Belief Propagation A: Belief graph with no edges(all n nodes are independent) B: Graph with a single edge r C: Graph of the true distribution with n nodes and L edges

Information Geometry of Graphical structure Important Graphs(Undirected) for Belief Propagation A : B : C : n n M 0 = {p 0 (x, θ) = exp[ h i x i + θ i x i ψ 0 (θ)] θ R n } M r = {p r (x, ζ r ) = exp[ i=1 i=1 n h i x i + s r c r (x) + i=1 i=1 r=1 n ζ ri x i ψ r (ζ r )] ζ r R n } n L S = {p(x, θ, s) = exp[ h i x i + s r c r (x) ψ(θ, s)] θ R n, s R L } i=1

Information Geometry of Graphical structure Properties of M 0, M r M 0 is an e-autoparallel submanifold of S, with e-affine coordinate θ M r for r = 1, L are all e-autoparallel submanifold of S, with e-affine coordiante ζ r A distribution in M r include only one nonlinear interaction term c r (x) corresponding to the single edge in B

Information Geometry of Belief Propagation Outline for Information Geometry of Belief Propagation Belief Propagation/Sum-product algorithms Information Geometry of Graphical structure Information-Geometrical view of Belief Propagation

Information-Geometrical view of Belief Propagation Problem setup for belief propagation Suppose we have the true distribution q(x) S, and want to calculate q 0 (x) = M 0 q(x), the m-projection of q(x) to M 0, the goal of belief propagation is to find a good approximation ˆq(x) M 0 of q(x). M 0 M r S n n = {p 0 (x, θ) = exp[ h i x i + θ i x i ψ 0 (θ)] θ R n } = {p r (x, ζ r ) = exp[ i=1 i=1 n h i x i + s r c r (x) + i=1 i=1 r=1 n ζ ri x i ψ r (ζ r )] ζ r R n } n L = {p(x, θ, s) = exp[ h i x i + s r c r (x) ψ(θ, s)] θ R n, s R L } i=1

Geometrical Belief Propagation Algorithm 1) Put t = 0, and start with initial guesses ζ 0 r, for example, ζ 0 r = 0 2) For t = 0, 1, 2,, m-project p r (x, ζ t r) to M 0 and obtain the linearized version of s r c r (x) ξ t+1 r = M 0 p r (x, ζ t r) ζ t r 3) Summarize all the effects of s r c r (x), to give 4) Update ζ r by ζ t+1 r θ t+1 = = r r L r=1 ξ t+1 r ξ t+1 r = θt+1 ξr t+1 5) Repeat 2-4 until convergence

Geometrical Belief Propagation Algorithm Analysis Given q(x), let ζ r be the current solution which M r believes to give a good approximation of q(x). then we project it to M 0, giving the solution θ r in M 0 having the same expectation: p(x, θ r ) = M 0 p r (x, ζ r ). Since θ r includes both the effect of the single nonlinear term s r c r (x) specific to M r and that due to the linear term ζ r, we can use ξ r = θ r ζ r to represents the effect of the single nonlinear term s r c r (x), we consider it as the linearized version of s r c r (x).

Geometrical Belief Propagation Algorithm Analysis:continue Once M r knows the linearized version of s r c r (x) is ξ r, it broadcasts ξ r to all the other models M 0 and M r (r r). Receiving these messages from all M r, M 0 guesses that the equivalent linear term of L r=1 s r c r (x) is θ = L r=1 ξ r. Since M r in turn receives messages ξ r from all other M r (r r), M r uses them to form a new ζ r = r r ξ r = θ ξ r. The whole process is repeated until converge.

Geometrical Belief Propagation Algorithm Analysis:continue Assume the algorithm has converged to {ζ r } and {θ }. We find the estimation of q(x) S in M 0 is p 0 (x, θ )(which may not be the exact solution). We write q(x), p 0 (x, θ ) and p r (x, ζ r ) as: p 0 (x, θ ) p r (x, ζ r ) q(x) n L = exp[ h i x i + ξr x ψ 0 ] = exp[ i=1 r=1 n h i x i + s r c r (x) + ξr x ψ r ] r r i=1 n L = exp[ h i x i + s r c r (x) ψ] i=1 r=1 The whole idea of the algorithm is to approximate s r c r (x) by ξ r x in M r, then the independent distribution p 0 (x, θ ) M 0 integrates all the information.

Information-Geometrical view of Belief Propagation Convergence Analysis The convergence point satisfies the two conditions: 1. m-condition: θ = M 0 p r (x, ζ r ) 2. e-condition: θ = 1 L 1 L r=1 ζ r The m-condition and e-condition of the algorithm e-condition is satisfied at each iteration of BP algorithm m-condition is satisfied at the equilibrium of the BP algorithm

Information-Geometrical view of Belief Propagation Convergence Analysis Let us consider the m-autoparallel submanifold M connecting p 0 (x, θ ) and p r (x, ζ r ), r = 1,, L: M = {p(x) p(x) = d 0 p 0 (x, θ ) + L d r p r (x, ζr ); d 0 + r=1 L d r = 1} We also consider the e-autoparallel submanifold E connecting p 0 (x, θ ) and p r (x, ζ r ): r=1 L E = {p(x) log p(x) = v 0 log p 0 (x, θ ) + v r log p r (x, ζr ) ψ; r=1 L v 0 + v r = 1} r=1

Information-Geometrical view of Belief Propagation Convergence Analysis M hold the expectation of x unchange, thus a equivalent definition is given by: M = {p(x) p(x) S, x xp(x) = x xp 0 (x, θ ) = η 0 (θ )} Thus m-condition implies that M intersects all of M 0 and M r orthogonally. The e-condition implies that E include the true distribution q(x) since if we set v 0 = (L 1), v r = 1 for r, using e-condition, we will get q(x) = Cp 0 (x, θ ) (L 1) where C is the normalization factor L p r (x, ζr ) E r=1

Information-Geometrical view of Belief Propagation Convergence Analysis The proposed BP algorithm search for θ and ζ until both the m-condition and e-condition are satisfied. We know E includes q(x), if M include q(x), its m-projection to M 0 is p 0 (x, θ ), θ gives the true solution; if M doesn t include q(x), the algorithm only give a approximation of the true distribution. Thus we have the following theorem: Theorem: When the underlying graph is a tree, both E and M include q(x), and the algorithm gives the exact solution.

Information-Geometrical view of Belief Propagation Example of L = 2 where θ = ζ 1 + ζ 2 = ξ 1 + ξ 2, ζ 1 = ξ 2 and ζ 2 = ξ 1 1) Put t = 0, and start with initial guesses ζ 0 2, for example, ζ0 2 = 0 2) For t = 0, 1, 2,, m-project p r (x, ζ t r) to M 0 and obtain the linearized version of s r c r (x) ζ t+1 1 = p 2 (x, ζ2 t ) ζt 2, ζt+1 2 = p 1 (x, ζ t+1 1 ) ζ t+1 1 M 0 M 0 3)Summarize all the effects of s r c r (x), to give θ t+1 = ζ t+1 1 + ζ t+1 2 4) Repeat 2)-3) until convergence, and we get θ = ζ 1 + ζ 2

Information-Geometrical view of Belief Propagation IKEDA et al.: INFORMATION GEOMETRY OF TURBO AND LDPC CODES Fig. 5. Equimarginal subman Fig. 4. Information-geometrical view of turbo decoding. is schematically shown in Fig. 4. The projected distribution is written as cides with. In o of of the -pro and of with respect to Figure: Information geometrical view of the BP algorithm with L = 2 We denote the parameter

Information-Geometrical view of Belief Propagation Analysis of the example with L = 2 The convergence point satisfies the two conditions: 1. m-condition: θ = M 0 p 1 (x, ζ 1 ) = M 0 p 2 (x, ζ 2 ) 2. e-condition: θ = ζ 1 + ζ 2 = ξ 1 + ξ 2 * * Figure: Equimarinal submanifold M

Information-Geometrical view of Belief Propagation * *

Information-Geometrical view of CCCP-Bethe Information Geometric Convex Concave Computational Procedure(CCCP-Bethe) Inner loop: Given θ t, calculate {ζr t+1 } by solving p r (x, ζr t+1 ) = Lθ t M 0 L r=1 ζ t+1 r = p 0 (x, θ t ), r = 1,, L outer loop: Given a set of {ζr t+1 } as the result of the inner loop, update θ t+1 = Lθ t L r=1 ζ t+1 r, r = 1,, L Notes: CCCP enforces the m-condition at each iteration, the e-condition is satisfied only at the convergent point, which means we search for new θ t+1 until the e-condition is satisfied.

A New e-constraint Algorithm Define a cost function P under the e-constraint θ = r ζr /(L 1). We minimize Fe by the gradient descent algorithm. The gradient is Here, I0 (θ) and Ir (ζr ) are the Fisher information matrices of p0 (x, θ) and pr (x, ζr ), resepctively, which are defined as

A New e-constraint Algorithm Under the gradient descent algorithm, ζ r and θ are updated as where δ is a small positive learning rate.

A New e-constraint Algorithm

Thanks! Thanks!