Information Geometric view of Belief Propagation

Information Geometric view of Belief Propagation Yunshu Liu 2013-10-17

References: [1]. Shiro Ikeda, Toshiyuki Tanaka and Shun-ichi Amari, Stochastic reasoning, Free energy and Information Geometry, Neural Computation, 16, 1779-1810, 2004. [2]. Shiro Ikeda, Toshiyuki Tanaka and Shun-ichi Amari, Information Geometry of Turbo and Low-Density Parity-Check Codes, IEEE Transaction on Information Theory, VOL. 50, NO.6, June 2004. [3].Kevin P. Murphy, Machine Learning - A Probabilistic Perspective, The MIT Press, 2012.

Outline Information Geometry Examples of probabilistic manifold Important Submanifold Information projection Information Geometry of Belief Propagation Belief Propagation/Sum-product algorithms Information Geometry of Graphical structure Information-Geometrical view of Belief Propagation

Probability distribution viewed as manifold Outline for Infromation Geometry Examples of probabilistic manifold Important Submanifold Information projection

Manifold Manifold S Manifold: a set with a coordinate system, a one-to-one mapping from S to R n, supposed to be locally looks like an open subset of R n Elements of the set(points): points in R n, probability distribution, linear system. Figure: A coordinate system ξ for a manifold S

Example manifold Parametrization of color models Parametrization: map of color into R 3 Examples:

Example of manifold Parametrization of Gaussian distribution S = {p(x; µ, σ)} p(x; µ, σ) = 1 2πσ exp{ (x µ)2 2σ 2 } p(x) σ S x µ

Example of manifold Exponential family p(x; θ) = exp[c(x) + n θ i F i (x) ψ(θ)] [θ i ] are called the natural parameters(coordinates), and ψ is the potential function for [θ i ], which can be calculated as ψ(θ) = log i=1 exp[c(x) + n θ i F i (x)]dx The exponential families include many of the most common distributions, including the normal, exponential, gamma, beta, Dirichlet, Bernoulli, binomial, multinomial, Poisson, and so on. i=1

Example of two binary discrete random variables as Exponential families Let x = {x 1, x 2 }, x i = 1 or 1, consider the family of all the probability distributions S = {p(x) p(x) > 0, i p(x = C i) = 1} Outcomes: C 0 = { 1, 1}, C 1 = { 1, 1} C 2 = {1, 1}, C 3 = {1, 1} Parameterization: p(x = C i ) = η i for i = 1, 2, 3 p(x = C 0 ) = 1 3 j=1 ηj η = (η 1 η 2 η 3 ) is called the m-coordinate of S. η 1 (0 0 1) η 3 (1 0 0) (0 0 0) (0 1 0) Simplex η 2

Example of two binary discrete random variables as Exponential families A new coordinate {θ i }, which is defined as θ i = ln η i 1 3 j=1 ηj for i = 1, 2, 3. (1) θ = (θ 1 θ 2 θ 3 ) is called the e-coordinate of S. η 3 θ 3 (1 0 0) (0 0 0) (0 1 0) η 2 θ 2 η 1 (0 0 1) Simplex θ 1 Whole space

Example of two binary discrete random variables as Exponential families Introduce new random variables: { 1 if x = Ci δ Ci (x) = 0 otherwise and let ψ = ln p(x = C 0 ) (2) Then S is an exponential family with natural parameter {θ i }: p(x, θ) = exp( 3 θ i δ Ci (x) ψ(θ)) (3) i=1

Probability distribution viewed as manifold Outline for Infromation Geometry Examples of probabilistic manifold Important Submanifolds Information projection

Submanifolds Submanifolds Definition: a submanifold M of a manifold S is a subset of S which itself has the structure of a manifold An open subset of n-dimensional manifold forms an n-dimensional submanifold. One way to construct m(<n)dimensional manifold: fix n-m coordinates. Examples:

e-autoparallel submanifold and m-autoparallel submanifold e-autoparallel submanifold A subfamily E is said to be e-autoparallel submanifold of S if for some coordinate λ there exist a matrix A and a vector b such that for the e-coordinate θ of S θ(p) = Aλ(p) + b ( p E) (4) If λ is one-dimensional, it is called a e-geodesic. An e-autoparallel submanifold of S is a hyperplane in e-coordinate; A e-geodesic of S is a straight line in e-coordinate. Hyper-plane in θ E in θ coordinate

e-autoparallel submanifold and m-autoparallel submanifold m-autoparallel submanifold A subfamily M is said to be m-autoparallel submanifold of S if for some coordinate γ there exist a matrix A and a vector b such that for the m-coordinate η of S η(p) = Aγ(p) + b ( p M) (5) If γ is one-dimensional, it is called a m-geodesic.

e-autoparallel submanifold and m-autoparallel submanifold Two independent bits The set of all product distributions, which is defined as E = {p(x) p(x) = p X1 (x 1 )p X2 (x 2 )} (6) p X1 (x 1 = 1) = p 1, p X2 (x 2 = 1) = p 2 then p X1 (x 1 = 1) = 1 p 1, p X2 (x 2 = 1) = 1 p 2 Parameterization Define λ i = 1 2 ln p i 1 p i for i = 1, 2, and ψ(λ) = i=1,2 ln (eλ i + e λ i ), p(x, λ) = e λ 1x 1 ψ(λ 1 ) e λ 2x 2 ψ(λ 2 ) = e λx ψ(λ) (7) Thus E is an exponential family with nature parameter {λ i }

e-autoparallel submanifold and m-autoparallel submanifold Dual ( parameters: ) ( expectations ) ( ) γ 1 1 ψ(λ) 2p1 1 γ = = = 2 ψ(λ) 2p 2 1 γ 2 = E p (x) Properties E is an e-autoparallel submanifold(plane in θ coordinate) of S, but not a m-autoparallel submanifold(not a plane in η coordinate): θ 1 2 2 ( ) 2 2 θ(p) = θ 2 = 2 0 λ1 = 2 0 λ(p) λ θ 3 0 2 2 0 2

e-autoparallel submanifold and m-autoparallel submanifold Two independent bits: e-autoparallel submanifold but not m-autoparallel submanifold E in θ coordinate E in η coordinate

Probability distribution viewed as manifold Outline for Infromation Geometry Examples of probabilistic manifold Important Submanifolds Information projection

Projections and Pythagorean Theorem KL-divergence On the manifold of probability mass functions for random variables taking values in the set X, we can also define the Kullback Leibler divergence, or relative entropy, measured in bits, according to D(p X q X ) = ( ) px (x) p X (x) log (8) q x X X (x) q X p X S

Projections and Pythagorean Theorem Information Projection Let p be a point in S and let E be a e-autoparallel submanifold of S. We want to find point Π E (p X ) in E that minimize the KL-divergence D(p X Π E (p X )) = min q X E D(p X q X ) (9) p X The m-geodesic connecting p X and Π E (p X ) is orthogonal to E at Π E (p X ). In geometry, orthogonal usually means right angle In information geometry, orthogonal usually means certain values are uncorrelated m-geodesic q X ΠE(pX) e-autoparallel submanifold E D(p X Π E (p X )) = min qx E D(p X q X )

Projections and Pythagorean Theorem Pythagorean relation For q X E, we have the following Pythagorean relation. D(p X q X ) = D(p X Π E (p X )) + D(Π E (p X ) q X ) q X E (10) p X m-geodesic Π E (p X ) q X e-autoparallel submanifold E D(p X q X )=D(p X Π E (p X )) + D(Π E (p X ) q X ) q X E

Projection in θ coordinate

Projection in η coordinate

Information Geometry of Belief Propagation Outline for Information Geometry of Belief Propagation Belief Propagation/Sum-product algorithms Information Geometry of Graphical structure Information-Geometrical view of Belief Propagation

Belief Propagation Belief Propagation/Sum-product algorithm The Belief Propagation algorithm involves using a message passing scheme to explore the conditional independence and use these properties to change the order of sum and product. xy + xz = x(y + z) + x + y x z y z Expression tree of xy + xz and x(y+z), where the leaf nodes represent variables and internal nodes represent arithmetic operators(addition, multiplication, negation, etc.).

Belief Propagation Message passing on a factor graph: Variable x to factor f : m x f = m fk x(x) k nb(x)\f f 1 m f1 x x m x f f m fk x f K nb(x) denote the neighboring factor nodes of x If nb(x)\f = then m x f = 1.

Belief Propagation Message passing on a factor graph: Factor f to variable x: m f x = f (x, x 1,, x M ) x 1, x M m nb(f )\x m xm f (x m ) x 1 m x1 f f m f x x x M m xm f nb(f ) denote the neighboring variable nodes of f If nb(f )\x = then m f x = f (x).

Serial protocol of Belief Propagation Message passing on a tree structured factor graph Serial protocol: Send messages up to the root and back x 3 f b x 2 f a f c x 1 x 4

Serial protocol of Belief Propagation Message passing on a tree structured factor graph Down - up: m x1 f a (x 1 ) = 1 m fa x 2 (x 2 ) = x 1 f a (x 1, x 2 ) m x4 f c (x 4 ) = 1 m fc x 2 (x 2 ) = x 4 f c (x 2, x 4 ) m x2 f b (x 2 ) = m fa x 2 (x 2 )m fc x 2 (x 2 ) m fb x 3 (x 3 ) = x 2 f b (x 2, x 3 )m x2 f b

Serial protocol of Belief Propagation Message passing on a tree structured factor graph Up - down: m x3 f b (x 3 ) = 1 m fb x 2 (x 2 ) = x 3 f b (x 2, x 3 ) m x2 f a (x 2 ) = m fb x 2 (x 2 )m fc x 2 (x 2 ) m fa x 1 (x 1 ) = x 2 f a (x 1, x 2 )m x2 f a (x 2 ) m x2 f c (x 2 ) = m fa x 2 (x 2 )m fb x 2 (x 2 ) m fc x 4 (x 4 ) = x 2 f c (x 2.x 4 )m x2 f c (x 2 )

Serial protocol of Belief Propagation Message passing on a tree structured factor graph Calculating p(x 2 ) p(x 2 ) = 1 Z m f a x 2 (x 2 )m fb x 2 (x 2 )m fc x2 (x 2 ) = 1 { }{ }{ } f a (x 1, x 2 ) f b (x 2, x 3 ) f c (x 2, x 4 ) Z x 1 x 3 x 4 = 1 f a (x 1, x 2 )f b (x 2, x 3 )f c (x 2, x 4 ) Z x 1 x 3 x 4 = p(x) x 1 x 3 x 4

Parallel protocol of Belief Propagation Message passing on general graphs with loops Parallel protocol: Initialize all messages All nodes receive message from their neighbors in parallel and update their belief states Send new messages back out to their neighbors Repeats the above two process until convergence

Information Geometry of Graphical structure Important Graphs(Undirected) for Belief Propagation A: Belief graph with no edges(all n nodes are independent) B: Graph with a single edge r C: Graph of the true distribution with n nodes and L edges

Information Geometry of Graphical structure Important Graphs(Undirected) for Belief Propagation A : B : C : n n M 0 = {p 0 (x, θ) = exp[ h i x i + θ i x i ψ 0 (θ)] θ R n } M r = {p r (x, ζ r ) = exp[ i=1 i=1 n h i x i + s r c r (x) + i=1 i=1 r=1 n ζ ri x i ψ r (ζ r )] ζ r R n } n L S = {p(x, θ, s) = exp[ h i x i + s r c r (x) ψ(θ, s)] θ R n, s R L } i=1

Information Geometry of Graphical structure Properties of M 0, M r M 0 is an e-autoparallel submanifold of S, with e-affine coordinate θ M r for r = 1, L are all e-autoparallel submanifold of S, with e-affine coordiante ζ r A distribution in M r include only one nonlinear interaction term c r (x) corresponding to the single edge in B

Information-Geometrical view of Belief Propagation Problem setup for belief propagation Suppose we have the true distribution q(x) S, and want to calculate q 0 (x) = M 0 q(x), the m-projection of q(x) to M 0, the goal of belief propagation is to find a good approximation ˆq(x) M 0 of q(x). M 0 M r S n n = {p 0 (x, θ) = exp[ h i x i + θ i x i ψ 0 (θ)] θ R n } = {p r (x, ζ r ) = exp[ i=1 i=1 n h i x i + s r c r (x) + i=1 i=1 r=1 n ζ ri x i ψ r (ζ r )] ζ r R n } n L = {p(x, θ, s) = exp[ h i x i + s r c r (x) ψ(θ, s)] θ R n, s R L } i=1

Geometrical Belief Propagation Algorithm 1) Put t = 0, and start with initial guesses ζ 0 r, for example, ζ 0 r = 0 2) For t = 0, 1, 2,, m-project p r (x, ζ t r) to M 0 and obtain the linearized version of s r c r (x) ξ t+1 r = M 0 p r (x, ζ t r) ζ t r 3) Summarize all the effects of s r c r (x), to give 4) Update ζ r by ζ t+1 r θ t+1 = = r r L r=1 ξ t+1 r ξ t+1 r = θt+1 ξr t+1 5) Repeat 2-4 until convergence

Geometrical Belief Propagation Algorithm Analysis Given q(x), let ζ r be the current solution which M r believes to give a good approximation of q(x). then we project it to M 0, giving the solution θ r in M 0 having the same expectation: p(x, θ r ) = M 0 p r (x, ζ r ). Since θ r includes both the effect of the single nonlinear term s r c r (x) specific to M r and that due to the linear term ζ r, we can use ξ r = θ r ζ r to represents the effect of the single nonlinear term s r c r (x), we consider it as the linearized version of s r c r (x).

Geometrical Belief Propagation Algorithm Analysis:continue Once M r knows the linearized version of s r c r (x) is ξ r, it broadcasts ξ r to all the other models M 0 and M r (r r). Receiving these messages from all M r, M 0 guesses that the equivalent linear term of L r=1 s r c r (x) is θ = L r=1 ξ r. Since M r in turn receives messages ξ r from all other M r (r r), M r uses them to form a new ζ r = r r ξ r = θ ξ r. The whole process is repeated until converge.

Geometrical Belief Propagation Algorithm Analysis:continue Assume the algorithm has converged to {ζ r } and {θ }. We find the estimation of q(x) S in M 0 is p 0 (x, θ )(which may not be the exact solution). We write q(x), p 0 (x, θ ) and p r (x, ζ r ) as: p 0 (x, θ ) p r (x, ζ r ) q(x) n L = exp[ h i x i + ξr x ψ 0 ] = exp[ i=1 r=1 n h i x i + s r c r (x) + ξr x ψ r ] r r i=1 n L = exp[ h i x i + s r c r (x) ψ] i=1 r=1 The whole idea of the algorithm is to approximate s r c r (x) by ξ r x in M r, then the independent distribution p 0 (x, θ ) M 0 integrates all the information.

Information-Geometrical view of Belief Propagation Convergence Analysis The convergence point satisfies the two conditions: 1. m-condition: θ = M 0 p r (x, ζ r ) 2. e-condition: θ = 1 L 1 L r=1 ζ r The m-condition and e-condition of the algorithm e-condition is satisfied at each iteration of BP algorithm m-condition is satisfied at the equilibrium of the BP algorithm

Information-Geometrical view of Belief Propagation Convergence Analysis Let us consider the m-autoparallel submanifold M connecting p 0 (x, θ ) and p r (x, ζ r ), r = 1,, L: M = {p(x) p(x) = d 0 p 0 (x, θ ) + L d r p r (x, ζr ); d 0 + r=1 L d r = 1} We also consider the e-autoparallel submanifold E connecting p 0 (x, θ ) and p r (x, ζ r ): r=1 L E = {p(x) log p(x) = v 0 log p 0 (x, θ ) + v r log p r (x, ζr ) ψ; r=1 L v 0 + v r = 1} r=1

Information-Geometrical view of Belief Propagation Convergence Analysis M hold the expectation of x unchange, thus a equivalent definition is given by: M = {p(x) p(x) S, x xp(x) = x xp 0 (x, θ ) = η 0 (θ )} Thus m-condition implies that M intersects all of M 0 and M r orthogonally. The e-condition implies that E include the true distribution q(x) since if we set v 0 = (L 1), v r = 1 for r, using e-condition, we will get q(x) = Cp 0 (x, θ ) (L 1) where C is the normalization factor L p r (x, ζr ) E r=1

Information-Geometrical view of Belief Propagation Convergence Analysis The proposed BP algorithm search for θ and ζ until both the m-condition and e-condition are satisfied. We know E includes q(x), if M include q(x), its m-projection to M 0 is p 0 (x, θ ), θ gives the true solution; if M doesn t include q(x), the algorithm only give a approximation of the true distribution. Thus we have the following theorem: Theorem: When the underlying graph is a tree, both E and M include q(x), and the algorithm gives the exact solution.

Information-Geometrical view of Belief Propagation Example of L = 2 where θ = ζ 1 + ζ 2 = ξ 1 + ξ 2, ζ 1 = ξ 2 and ζ 2 = ξ 1 1) Put t = 0, and start with initial guesses ζ 0 2, for example, ζ0 2 = 0 2) For t = 0, 1, 2,, m-project p r (x, ζ t r) to M 0 and obtain the linearized version of s r c r (x) ζ t+1 1 = p 2 (x, ζ2 t ) ζt 2, ζt+1 2 = p 1 (x, ζ t+1 1 ) ζ t+1 1 M 0 M 0 3)Summarize all the effects of s r c r (x), to give θ t+1 = ζ t+1 1 + ζ t+1 2 4) Repeat 2)-3) until convergence, and we get θ = ζ 1 + ζ 2

Information-Geometrical view of Belief Propagation IKEDA et al.: INFORMATION GEOMETRY OF TURBO AND LDPC CODES Fig. 5. Equimarginal subman Fig. 4. Information-geometrical view of turbo decoding. is schematically shown in Fig. 4. The projected distribution is written as cides with. In o of of the -pro and of with respect to Figure: Information geometrical view of the BP algorithm with L = 2 We denote the parameter

Information-Geometrical view of Belief Propagation Analysis of the example with L = 2 The convergence point satisfies the two conditions: 1. m-condition: θ = M 0 p 1 (x, ζ 1 ) = M 0 p 2 (x, ζ 2 ) 2. e-condition: θ = ζ 1 + ζ 2 = ξ 1 + ξ 2 * * Figure: Equimarinal submanifold M

Information-Geometrical view of Belief Propagation * *

Information-Geometrical view of CCCP-Bethe Information Geometric Convex Concave Computational Procedure(CCCP-Bethe) Inner loop: Given θ t, calculate {ζr t+1 } by solving p r (x, ζr t+1 ) = Lθ t M 0 L r=1 ζ t+1 r = p 0 (x, θ t ), r = 1,, L outer loop: Given a set of {ζr t+1 } as the result of the inner loop, update θ t+1 = Lθ t L r=1 ζ t+1 r, r = 1,, L Notes: CCCP enforces the m-condition at each iteration, the e-condition is satisfied only at the convergent point, which means we search for new θ t+1 until the e-condition is satisfied.

A New e-constraint Algorithm Define a cost function P under the e-constraint θ = r ζr /(L 1). We minimize Fe by the gradient descent algorithm. The gradient is Here, I0 (θ) and Ir (ζr ) are the Fisher information matrices of p0 (x, θ) and pr (x, ζr ), resepctively, which are defined as

A New e-constraint Algorithm Under the gradient descent algorithm, ζ r and θ are updated as where δ is a small positive learning rate.

A New e-constraint Algorithm

Thanks! Thanks!