The Generalized Distributive Law and Free Energy Minimization

Similar documents
Belief Propagation on Partially Ordered Sets Robert J. McEliece California Institute of Technology

UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS

Probabilistic and Bayesian Machine Learning

14 : Theory of Variational Inference: Inner and Outer Approximation

Data Fusion Algorithms for Collaborative Robotic Exploration

Fractional Belief Propagation

Probabilistic Graphical Models

Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract)

Junction Tree, BP and Variational Methods

Maximum Weight Matching via Max-Product Belief Propagation

Lecture 18 Generalized Belief Propagation and Free Energy Approximations

Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation

Expectation Consistent Free Energies for Approximate Inference

17 Variational Inference

14 : Theory of Variational Inference: Inner and Outer Approximation

Concavity of reweighted Kikuchi approximation

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

12 : Variational Inference I

13 : Variational Inference: Loopy Belief Propagation and Mean Field

NOTES ON PROBABILISTIC DECODING OF PARITY CHECK CODES

6.867 Machine learning, lecture 23 (Jaakkola)

Linear Response for Approximate Inference

Variational algorithms for marginal MAP

Min-Max Message Passing and Local Consistency in Constraint Networks

Does Better Inference mean Better Learning?

Message Passing and Junction Tree Algorithms. Kayhan Batmanghelich

Semidefinite relaxations for approximate inference on graphs with cycles

Dual Decomposition for Marginal Inference

Graphical models and message-passing Part II: Marginals and likelihoods

Bayesian Machine Learning - Lecture 7

13 : Variational Inference: Loopy Belief Propagation

Structured Variational Inference

Probabilistic Graphical Models

Concavity of reweighted Kikuchi approximation

LINK scheduling algorithms based on CSMA have received

Mean-field equations for higher-order quantum statistical models : an information geometric approach

Lecture 12: Binocular Stereo and Belief Propagation

Connecting Belief Propagation with Maximum Likelihood Detection

Estimating the Capacity of the 2-D Hard Square Constraint Using Generalized Belief Propagation

Tightening Fractional Covering Upper Bounds on the Partition Function for High-Order Region Graphs

On the number of circuits in random graphs. Guilhem Semerjian. [ joint work with Enzo Marinari and Rémi Monasson ]

Undirected Graphical Models

bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o

On the optimality of tree-reweighted max-product message-passing

Spanning Trees in Grid Graphs

Information Theory in Intelligent Decision Making

On the Convergence of the Convex Relaxation Method and Distributed Optimization of Bethe Free Energy

Graphical Model Inference with Perfect Graphs

CS Lecture 19. Exponential Families & Expectation Propagation

A Cluster-Cumulant Expansion at the Fixed Points of Belief Propagation

Information Geometric view of Belief Propagation

Fast Memory-Efficient Generalized Belief Propagation

CS 590D Lecture Notes

Characterization of Fixed Points in Sequential Dynamical Systems

Growth Rate of Spatially Coupled LDPC codes

Recitation 9: Loopy BP

Counting independent sets of a fixed size in graphs with a given minimum degree

Network Flows. 6. Lagrangian Relaxation. Programming. Fall 2010 Instructor: Dr. Masoud Yaghini

13: Variational inference II

The Distribution of Cycle Lengths in Graphical Models for Turbo Decoding

EECS 545 Project Report: Query Learning for Multiple Object Identification

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Statistical Approaches to Learning and Discovery

Convexifying the Bethe Free Energy

ON WEAK CHROMATIC POLYNOMIALS OF MIXED GRAPHS

ON DOMINATING THE CARTESIAN PRODUCT OF A GRAPH AND K 2. Bert L. Hartnell

Probabilistic Graphical Models

Undirected graphical models

Expectation Propagation in Factor Graphs: A Tutorial

Learning unbelievable marginal probabilities

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Probabilistic Graphical Models

Bayesian Networks: Belief Propagation in Singly Connected Networks

Bayesian networks: approximate inference

MAP estimation via agreement on (hyper)trees: Message-passing and linear programming approaches

Expectation Propagation Algorithm

CS281A/Stat241A Lecture 19

A new class of upper bounds on the log partition function

3 Undirected Graphical Models

Rao s degree sequence conjecture

arxiv: v1 [cs.dm] 26 Apr 2010

Marginal Densities, Factor Graph Duality, and High-Temperature Series Expansions

A Generalized Loop Correction Method for Approximate Inference in Graphical Models

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure

Exact Algorithms for Dominating Induced Matching Based on Graph Partition

Constrained Optimization

Exact Inference I. Mark Peot. In this lecture we will look at issues associated with exact inference. = =

Statistical-Mechanical Approach to Probabilistic Image Processing Loopy Belief Propagation and Advanced Mean-Field Method

Algorithms for Reasoning with Probabilistic Graphical Models. Class 3: Approximate Inference. International Summer School on Deep Learning July 2017

Exact MAP estimates by (hyper)tree agreement

Inference as Optimization

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science ALGORITHMS FOR INFERENCE Fall 2014

Fixed Point Solutions of Belief Propagation

A note on network reliability

CSE 254: MAP estimation via agreement on (hyper)trees: Message-passing and linear programming approaches

On the Choice of Regions for Generalized Belief Propagation

Computational Intelligence Seminar (8 November, 2005, Waseda University, Tokyo, Japan) Statistical Physics Approaches to Genetic Algorithms Joe Suzuki

Variational Inference. Sargur Srihari

On the Power of Belief Propagation: A Constraint Propagation Perspective

Transcription:

The Generalized Distributive Law and Free Energy Minimization Srinivas M. Aji Robert J. McEliece Rainfinity, Inc. Department of Electrical Engineering 87 N. Raymond Ave. Suite 200 California Institute of Technology Pasadena, CA 91103 Pasadena, CA 91125 saji@rainfinity.com rjm@systems.caltech.edu Abstract. In an important recent paper, Yedidia, Freeman, and Weiss [7] showed that there is a close connection between the belief propagation algorithm for probabilistic inference and the Bethe-Kikuchi approximation to the variational free energy in statistical physics. In this paper, we will recast the YFW results in the context of the generalized distributive law [1] formulation of belief propagation. Our main result is that if the GDL is applied to junction graph, the fixed points of the algorithm are in one-to-one correspondence with the stationary points of a certain Bethe-Kikuchi free energy. If the junction graph has no cycles, the BK free energy is convex and has a unique stationary point, which is a global minimum. On the other hand, if the junction graph has cycles, the main result at least shows that the GDL is trying to do something sensible. 1. Introduction. The goals of this paper are twofold: first, to obtain a better understanding of iterative, belief propagation (BP)-like solutions to the general probabilistic inference (PI) problem when cycles are present in the underlying graph G; and second, to design improved iterative solutions to PI problems. The tools we use are also twofold: the generalized distributive law formulation of BP [1], and the recent results of Yedidia, Freeman, and Weiss [6,7], linking the behavior of BP algorithms to certain ideas from statistical physics. If G hs no cycles, it is well known that BP converges to an exact solution to the inference problem in a finite number of steps [1, 3]. But what if G has cycles? Experimetally, BP is often seen to work well in this situation, but there is little theoretical understanding of why this should be so. In this paper, by recasting the YFW results, we shall see (Theorem 2, Section 5) that if the GDL converges to a given set of beliefs, these beliefs correspond to a zero gradient point (conjecturally a local minimum) of the Bethe-Kikuchi free energy, which is a function whose global minimum is an approximation to the solution to the inference problem. In short, even G has cycles, the GDL still does something sensible. (If G has no cycles, the BK free energy is convex and the only stationary point is a global minimum,, which is an exact solution to the inference problem.) Since we also show that a given PI problem can typically be represented by many different junction graphs (Theorem 1 and its Corollary, Section 2), this suggests that there can be many good iterative solutions to the problem. We plan to explore these possibilities in a follow-up paper. * This work was supported by NSF grant no. CCR-9804793, and grants from the Sony Corp., Qualcomm, and Caltech s Lee Center for Advanced Networking.

Here is an outline of the paper. In Section 2, we introduce the notion of a junction graph, which is essential for the entire paper. In Section 3, we describe a generic PI problem, and the generalized distributive law, which is an iterative, BP-like algorithm for solving the PI problem by passing messages on a junction graph. In Section 4 we state and prove a fundamental theorem from statistical physics, viz., that the variational free energy is always greater than or equal to the free energy, with equality if and only if the system is in Boltzmann equilibrium. In Section 5, we show that the PI problem introduced in Section 3 can be recast as a free energy computation. We then show how this free energy computation can be simplified by using an approximation, the Bethe- Kikuchi approximation (which is also based on a junction graph), to the variational free energy. Finally, we prove our main result, viz., the fixed points of the GDL are in one-to-one correspondence with the stationary points of the Bethe-Kikuchi free energy (Theorem 2, Section 5). 2. Junction Graphs. Following Stanley [4], we let [n] ={1, 2,..., n}. An [n]-junction graph is a labelled, undirected graph G =(V,E,L), in which each vertex v V and each edge e E is labelled with a subset of [n], denoted by L(v), and L(e), respectively. If e = {v 1,v 2 } is an edge joining the vertices v 1 and v 2,werequire that L(e) L(v 1 ) L(v 2 ). Furthermore, we require that for each k [n], the subgraph of G consisting only of the vertices and edges which contain k in their labels, is a tree. Figures 1 through 4 give examples of junction graphs. If R = {R 1,...,R M } is a collection of subsets of [n], we say that an [n]-junction graph G =(V,E,L) isajunction graph for R if {L(v 1 ),...,L(v M )} = R. In[1] it was shown that there is not always a junction tree for an arbitrary R. The situation for junction graphs is more favorable, as the following theorem shows. 1 2 4 1 4 1 2 1 4 1 3 2 Figure 1. A junction graph for R = {{1, 2, 4}, {1, 3}, {1, 4}, {2}}. (This is a junction tree.) Theorem 1. Forany collection R = {R 1,...,R M } of subsets of [n], there is a junction graph for R. Proof: Begin with a complete graph G with vertex set V = {v 1,...,v M },vertex labels L(v i )=R i, and edge labels L(v i,v j )=R i R j. For each k [n], let G k =(V k,e k ) be the subgraph of G consisting of those vertices and edges whose label contains k.

1 2 2 3 2 1 3 1 4 4 3 4 Figure 2. A junction graph for R = {{1, 2, }, {2, 3}, {3, 4}, {1, 4}}. 1 2 3 1 2 3 1 4 2 5 3 6 4 5 6 4 5 6 Figure 3. A junction graph for R = {{1, 2, 3}, {1, 4}, {2, 5}, {3, 6}, {4, 5, 6}}. 1 2 3 1 1 3 4 2 3 3 4 2 3 5 3 5 3 4 5 Figure 4. A junction graph for R = {{1, 2, 3}, {1, 3, 4}, {2, 3, 5}, {3, 4, 5}}.) Clearly G k is a complete graph, since if k L(v i )=R i and k L(v j )=R j, then k L(v i ) L(v j )=R i R j. Now let T k be any spanning tree of G k, and delete k from the labels of all edges in E k except those in T k. The resulting labelled graph is a junction graph for R. Using the fact that a complete graph on m vertices has exactly m m 2 spanning trees [4, Prop. 5.3.2], we have the following corollary. Corollary. If m i denotes the number of sets R j such that i R j, then the number of junction graphs for R is n i=1 mm i 2 i.

3. Probabilistic Inference and Belief Propagation. Let A = {0, 1,...,q 1} be afinite set with q elements. We represent the elements of A n as vectors of the form x =(x 1,x 2,...x n ), with x i A, for i [n]. If R [n], we denote by A R the set A n projected onto the coordinates indexed by R. A typical element of A R willl be denoted by x R. If p(x) isaprobability distribution on A n, p R (x R ) denotes p(x) marginalized onto R, i.e., p R (x R )= p(x). x R c A Rc With R = {R 1,...,R M } as in Section 2, let {α R (x R )} R R be a family of nonegative local kernels, i.e., α R (x R )isanonnegative real number for each x R A R, and define the global probability density function (3.1) p(x) = 1 Z R R α R (x R ), where Z is the global normalization constant, i.e., (3.2) Z = α R (x R ). x A n R R The corresponding probabilistic inference problem is to compute Z, and the marginal densities p R (x R )= p(x) = 1 α R (x R ). Z x Rc x R c R R for one or more values of R. If G =(V,E,L) isajunction graph for R, the generalized distributive law [1] is a message-passing algorithm on G for solving the PI problem, either exactly or approximately. It can be described by its messages and beliefs. If {v, u} is an edge of G, amessage from v to u, denoted by m v,u (x L(v,u) ), is a nonnegative function on A L(v,u). Similarly if v is a vertex of G, the belief at v, denoted by b v (x L(v) ), is a probability density on A L(v). Initially, m v,u (x L(v,u) ) 1 for all {v, u} E. The message update rule is (3.3) m v,u (x L(v,u) ) K x L(v)\L(v,u) α v (x L(v) ) u N(v)\u m u,v(x L(u,v)), where K is any convenient constant. (In (3.3), N(v) denotes the neighbors of v, i.e., N(v) ={u : {u, v} E}.) At any stage, the current beliefs at the vertices v and edges e = {u, v} are defined as follows: (3.4) (3.5) b v (x L(v) )= 1 α v (x L(v) ) m u,v (x L(u,v) ). Z v u N(v) b e (x L(e) )= 1 Z e m u,v (x L(e) )m v,u (x L(e) ),

where Z v and Z e are the appropriate local normalizing constants, i.e., Z v = (3.6) α v (x L(v) ) x L(v) u N(v) m u,v (x L(u,v) ) (3.7) Z e = m u,v (x L(e) )m v,u (x L(e) ). x L(e) The hope is that as the algorithm evolves, the beliefs will converge to the desired marginal probabilities: b v (x L(v) )? p L(v) (x L(v) ) b e (x L(e) )? p L(e) (x L(e) ). If G is a tree, the GDL converges as desired in a finite number of steps [1]. Furthermore, a result of Pearl [3] says that if G is a tree, then the global density p(x) defined by (3.1) factors as follows: p(x) = v V p v(x L(v) ) e E p e(x L(e) ). Comparing this to the definition (3.1), we see that the global normalization constant Z defined in (3.2) can be expressed in terms of the local normalization constants Z v and Z e defined in (3.6) and (3.7) as follows: (3.8) Z = v Z v. But what if G is not a tree? In the remainder of the paper, we shall see that if the GDL converges to a given set of beliefs, these beliefs represent a zero gradient point (conjecturally a local minimum) of a function whose global minimum represents an approximation to the solution to the inference problem. In short, even G has cycles, the GDL still does something sensible. To see why this is so, we must use some ideas from statistical physics, which we present in the next section. 4. Free Energy and the Boltzmann Distribution. Imagine a system of n identical particles, each of which can have one of q different spins taken from the set A = {0, 1,...,q 1}. Ifx i denotes the spin of the ith particle, we define the state of the system as the vector x =(x 1,x 2,...x n ). In this way, the set A n can be viewed as a discrete state space S. Now suppose E(x) =E(x 1,x 2,...,x n ) represents the energy of the system (the Hamiltonian) when it is in state x. The corresponding partition function 1 is defined as e Z e (4.1) Z = x S e E(x), 1 In fact, the partition function is also a function of a parameter β, the inverse temperature: Z = Z(β) = x S e βe(x). However, in this paper, we will assume β =1,and omit reference to β.

and the free energy of the system is F = ln Z. The free energy is of fundamental importance in statistical physics [8, Chapter 2], and physicists have developed a number of ways for calculating it, either exactly or approximately. We will now briefly describe some of these techniques. Suppose p(x) represents the probability of finding the system in state x. The corresponding variational free energy is defined as F (p) =U(p) H(p), where U(p) is the average, or internal, energy: (4.2) U(p) = x S p(x)e(x), and H(p) is the entropy: H(p) = x S p(x)lnp(x). We define the Boltzmann, or equilibrium, distribution as follows: (4.3) p B (x) = 1 Z e E(x), A routine calculation shows that F (p) =F + D(p p B ), where D(p p B )isthe Kullback-Leibler distance between p and p B. It then follows from [2, Theorem 2.6.3] that F (p) F, with equality if and only if p(x) =p B (x), which is a classical result from statistical physics [5]. In other words, (4.4) (4.5) F = min F (p) p(x) p B (x) =argmin F (p). p(x) 5. The Bethe-Kikuchi Approximation to the Variational Free Energy. According to (4.4), one method for computing the free energy F is to use calculus to minimize the variational free energy F (p) over all distributions p(x). However, this involves mimimizing a function of the q n variables {p(x) :x A n } subject to the constraint x p(x) =1, which is not attractive, unless qn is quite small. Another approach is to estimate F from above by minimizing F (p) overarestricted class of probability distributions. This is the basic idea underlying the mean field approach [5; 8, Chapter 4] in which only distributions of the form p(x) =p 1 (x 1 )p 2 (x 2 ) p n (x n )

are considered. The Bethe-Kikuchi approximations, which we will now describe, can be thought of as an elaboration on the mean field approach. In many cases of interest, the energy is determined by relatively short-range interactions, and the Hamiltonian assumes the special form (5.1) E(x) = R R E R (x R ), where R is a collection of subsets of [n], as in Sections 2 and 3. If E(x) decomposes in this way, the Boltzmann distribution (4.3) factors: (5.2) p B (x) = 1 Z R R e E R(x R ). Similarly, the average energy (cf. (4.2)) can be written as (5.3) U(p) = R R U(p R ), where (5.4) U(p R )= x R p R (x R )E R (x R ). Thus the average energy U(p) depends on the global density p(x) only through the marginals {p R (x R )}. One might hope for a similar simplification for H(p): (5.5) H(p)? R R H(p R ), where H(p R )= x R p R (x R ) log p R (x R ). For example, if R = {{1, 2, 3}, {1, 4}, {2, 5}, {3, 6}, {4, 5, 6}} then the hope (5.5) becomes H(X 1,X 2,X 3,X 4,X 5,X 6 )? (5.6) H(X 1,X 2,X 3 )+H(X 1,X 4 )+H(X 2,X 5 )+H(X 3,X 6 )+H(X 4,X 5,X 6 ), but this is clearly false, if only because the random variables X i occur unequally on the two sides of the equation. This is where the junction graph comes in. If, instead of (5.5) we substitute the Bethe-Kikuchi approximation to H(p) (relative to the junction graph G =(V,E,L)): (5.7) H(p) H(p v ) H(p e ), v V e E

where for simplicity we write p v instead of p L(v), etc., the junction graph condition guarantees that each X i is counted just once. (In a tree, the number of vertices is exactly one more than the number of edges.) For example, using the junction graph of Figure 1, (5.7) becomes H(X 1,X 2,X 3,X 4,X 5,X 6 )? H(X 1,X 2,X 3 )+H(X 1,X 4 )+H(X 2,X 5 )+H(X 3,X 6 )+H(X 4,X 5,X 6 ) (5.8) H(X 1 ) H(X 2 ) H(X 3 ) H(X 4 ) H(X 5 ) H(X 6 ), which is more plausible (though it needs to be investigated by an information theorist). In any case, the Bethe-Kikuchi approximation with respect to the junction graph G =(V,E,L) tothe variational free energy F (p) isdefined as (cf. (5.3) and (5.7)) (5.9) FBK (p) = ( ) U L(v) (p v ) H(p e ). v V H(p v ) v V e E The important thing about the BK approximation F BK (p) in(5.9) is that it depends only on the marginal probabilities {p v (x L(v) )}, {p e (x L(e) )}, and not on the full global distribution p(x). To remind ourselves of this fact, we will write F BK ({p v,p e }) instead of F BK (p). One plausible way to estimate the free energy is thus (cf. (4.4)) F F BK, where (5.10) F BK = min F BK ({b v,b e }), {b v,b e } where the b v s and the b e s are trial marginal probabilites, or beliefs. The BK approximate beliefs are then the optimizing marginals: (5.11) {b BK v,b BK e } = argmin F BK ({b v,b e }). {b v,b e } Computing the minimum in (5.10) is still not easy, but we can begin by setting up a Lagrangian: L = b v (x L(v) )E L(v) (x L(v) ) v V x L(v) + b v (x L(v) ) log b v (x L(v) ) v V x L(v) b e (x L(e) ) log b e (x L(e) ) e E x L(e) + λ u,v (x L(u,v) ) b v (x L(v) ) b e (x L(u,v) ) (u,v) E x L(u,v) x L(v)\L(u,v) + µ v b v (x L(v) ) 1 v V x L(v) + µ e b e (x L(e) ) 1. e E xl(e)

Here the Lagrange multiplier λ u,v (x L(u,v) ) enforces the constraint b v (x L(v) )=b e (x L(u,v) ), x L(v)\L(u,v) the Lagrange multiplier µ v enforces the constraint b v (x L(v) )=1, x L(v) and the Lagrange multiplier µ e enforces the constraint b e (x L(e) )=1. x L(e) Setting the partial derivative of L with respect to the variable b v (x L(v) ) equal to zero we obtain, after a little rearranging, (5.12) log b v (x L(v) )=k v E L(v) (x L(v) ) λ v,u (x L(u,v) ), u N(v) (where k v = 1 µ v ) for all vertices v and all choices for x L(v). Similarly, setting the partial derivative of L with respect to the variable b e (x L(e) ) equal to zero we obtain (5.13) log b e (x L(e) )=k e λ v,u (x L(e) ) λ u,v (x L(e) ), (where k e = 1+µ e ) for all edges e = {u, v} and all choices for x L(e). The conditions (5.12) and (5.13) are necessary conditions for the attainment of the minimum in (5.10). A set of beliefs and Lagrange multipliers satisfying these conditions may correspond to a local minimum, a local maximum, or neither. In every case, however, we have a stationary point for F BK. Given a stationary point of F BK,ifwedefine local kernels α L(v) (x L(v) ) and messages m u,v (x L(u,v) )asfollows: (5.14) E v (x L(v) )= ln α L(v) (x L(v) ) (5.15) λ v,u (x L(v,u) )= ln m u,v (x L(u,v) ), the stationarity conditions (5.12) and (5.13) then become (5.16) b v (x L(v) )=K v α L(v) (x L(v) ) m u,v (x L(u,v) ) (5.17) u N(v) b e (x L(e) )=K e m u,v (x L(u,v) )m v,u (x L(v,u) ), which arre the same as the GDL belief rules (3.4) and (3.5). (The constants K v = e k v and K e = e k e are uniquely determined by the conditions x L(v) b v (x L(v) )=1and x L(e) b e (x L(e) )=1.) Next, if we choose u N(v), and sum (5.16) over all x L(v)\L(v,u), thereby obtaining an expression for b e (x L(e) ), where e = {u, v}, and then set the result equal to the right side of (5.17), we obtain, after a short calculation, (5.18) m v,u (x L(u,v) )= K v α L(v) (x L(v) ) m u,v(x L(u,v) ), K e x L(v)\L(v,u) u N(v)\u which is the same as the GDL message update rule (3.3). In other words, we have proved

Theorem 2. If a set of beliefs {b v,b e } and Lagrange multipliers {λ u,v } is a stationary point for the BK free energy defined on G with energy function given by (5.1), then these same beliefs, together with the set of messages defined by (5.15), are a fixed point for the GDL of G defined by the local kernels defined by (5.14). Conversely, if a set of beliefs {b v,b e } and messages {m u,v } is a fixed point of the GDL on a junction graph G with local kernels {α R (x R )}, then these same beliefs, together with the Lagrange multipliers {λ u,v } defined by (5.15) are a stationary point for the BK free energy defined on G with energy function defined by (5.14). Finally we state the following two theorems without proof. Theorem 3. If V E (in particular, if G is a tree or if G has only one cycle), then F BK is convex, and hence has only one stationary point, which is a global minimum. Theorem 4. The value of F BK corresponding to a stationary point depends only on the Lagrange multipliers {k v,k e },asfollows: F BK = v k v e k e. (This result should be compared to (3.8).) Acknowledgement. The authors thank Dr. J. C. MacKay for his advice on the material in Section 4. References. 1. S. M. Aji and R. J. McEliece, The generalized distributive law, IEEE Trans. Inform. Theory, vol. 46, no. 2 (March 2000), pp. 325 343. 2. T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: John Wiley and Sons, 1991. 3. J. Pearl, Probabilistic Reasoning in Intelligent Systems. San Francisco: Morgan Kaufmann, 1988. 4. R. P. Stanley, Enumerative Combinatorics, vols. I and II. (Cambridge Studies in Advanced Mathematics 49, 62) Cambridge: Cambridge University Press, 1997 98. 5. J. S. Yedidia, An idiosyncratic journey beyond mean field theory, pp. 21 35 in Advanced Mean Field Methods, Theory and Practice, eds. Manfred Opper and David Saad, MIT Press, 2001. 6. J. S. Yedidia, W. T. Freeman, and Y. Weiss, Generalized belief propagation, pp. 689 695 in Advances in Neural Information Processing Systems 13, eds. Todd K. Leen, Thomas G. Dietterich, and Volker Tresp. 7. J. S. Yedidia, W. T. Freeman, and Y. Weiss, Bethe free energy, Kikuchi approximations, and belief propagation algorithms, available at www.merl.com/papers/ TR2001-16/ 8. J. M. Yeomans, Statistical Mechanics of Phase Transitions. Oxford: Oxford University Press, 1992.