Variable Elimination: Algorithm Sargur srihari@cedar.buffalo.edu 1
Topics 1. Types of Inference Algorithms 2. Variable Elimination: the Basic ideas 3. Variable Elimination Sum-Product VE Algorithm Sum-Product VE for Conditional Probabilities 4. Variable Ordering for VE 2
Variable Elimination: Use of Factors To formalize VE need concept of factors ϕ χ is a set of r.v.s, X is a subset We say Scope[ϕ]= X Factor associates a real value for each setting of it arguments ϕ: Val(X)àR Factor in BN is a product term say ϕ(a,b,c)= P(A,B/C) X χ
Factors in BNs and MNs Useful in both BNs and MNs Factor in BN is a product term, say ϕ(a,b,c)=p(a,b/c) Factor in MN comes from Gibbs distribution, say ϕ(a,b) Definition of Gibbs: Example: P Φ (X 1,..X n )= 1 Z where P(X 1,..X n ) P(X 1,..X n ) = φ i (D i ) m i=1 is an unnomalized measure and Z = P(X1,..X n ) is a normalizing constant X 1,..X n called the partition function 4
Role of Factor Operations The joint distribution is a product of factors P(C,D,I,G,S,L,J,H)= P(C)P(D C)P(I)P(G I,D)P(S I)P(L G)P(J L)P(H G,J)= ϕ C (C) ϕ D (D,C) ϕ I (I) ϕ G (G,I,D) ϕ S (S,I) ϕ L (L,G) ϕ J (J,L,S) ϕ H (H,G,J) Inference is a task of marginalization P(J ) = P(C,D,I,G,S,L,J,H ) L S G H We wish to systematically eliminate all variables other than J I D C 5
About Factors Inference Algorithms manipulate factors Occur in both directed and undirected PGMs Need two operations: Factor Product: Φ 1 (X,Y) Φ 2 (Y,Z) Factor Marginalization: Y ψ (X) = φ(x,y ) 6
Factor Product Let X, Y and Z be three disjoint sets of variables and let Φ 1 (X,Y) and Φ 2 (Y,Z) be two factors. The factor product is the mapping Val(X,Y,Z)àR as follows An example: Φ 1 : 3 x 2 = 6 entries Φ 2 : 2 x 2= 4 entries yields ψ: 3 x 2 x 2= 12 entries ψ(x,y,z)=φ 1 (X,Y) Φ 2 (Y,Z) 7
Factor Marginalization X is a set of variables and ϕ(x,y) is a factor We wish to eliminate Y Factor marginalization of Y is a factor ψ s.t. Y ψ (X) = φ(x,y ) Y X Process is called summing out of Y in Φ Φ(A,B,C) is a variable Fig 9.7 ψ(a,c) here Example of Factor Marginalization: Summing-out Y=B when X={A,C} We sum up entities in the table only when the values of X match up If we sum out all variables we get a factor which is a single value of 1 If we sum out all of the variables in an unnormalized distribution we get the partition function N!P φ = φ i i=1 8 ( D i )
Distributivity of product over sum Example with nos. a 2 b 2 ( ) = A B = ψ A,B A=a 1 B=b 1 a.b 1 +a.b 2 =a(b 1 +b 2 ): product is distributive (a+b 1 ).(a+b 2 ). ne. a+(b 1 b 2 ): sum is not Product distributivity allows fewer operations a 1 b 1 +a 1 b 2 +a 2 b 1 +a 2 b 2 requires 4 products, 3 sums Alternative formulation requires 2 sums, 2 products ( ) = A τ ( B) ψ A,B where a 2 A=a 1 b 2 ( ) = B = b 1 +b 2 B=b 1 τ B ψ(a,b) = a 1 τ(b)+a 2 τ(b) Sum first Product next Saves ops over Product first Sum next Factor product and summation behave exactly like product and summation over nos. φ If then ( 1 φ 2 ) = φ 1 φ X Scope ( φ 2 1 ) X X 9
Sum-Product Variable Elimination Algorithm Task of computing the value of an expression of the form φ Called sum-product inference task Sum of Products Key insight is that scope of the factors is limited Allowing us to push in some of the summations, performing them over the product of only some of the factors We sum out variables one at a time Z φ Φ 10
Inference using Variable Elimination Example: Extended Student BN We wish to infer P(J) P(J) = P(C,D,I,G,S,L,J,H ) H By chain rule: L S G I P(C,D,I,G,S,L,J,H)= P(C)P(D C)P(I)P(G I,D)P(S I)P(L G)P(J L)P(H G,J) Which is a Sum of Product of factors D C 11
Sum-product VE P(C,D,I,G,S,L,J,H)= P(C)P(D C)P(I)P(G I,D)P(S I)P(L G)P(J L)P(H G,J)= ϕ C (C) ϕ D (D,C) ϕ I (I) ϕ G (G,I,D) ϕ S (S,I) ϕ L (L,G) ϕ J (J,L,S) ϕ H (H,G,J) Elimination ordering C,D,I,H.G,S,L 1. Eliminating C: Compute the factors 2. Eliminating D: Note we already eliminated one factor with D, but introduced τ 1 involving D 3. Eliminating I: 4. Eliminating H: Note τ 4 (G,J)=1 5. Eliminating G: P(J ) = P(C,D,I,G,S,L,J,H ) L S G H ψ 1 ( C,D) = φ C (C)φ D ( D,C ) τ 1 ( D) = ψ 1 ψ 2 (G,I,D) = φ G (G,I,D)τ 1 (D) τ 2 G,I I D C C ( ) = ψ 2 D ( C,D) ( G,I,D) ψ 3 ( G,I,S) = φ I ( I )φ S ( S,I )τ 2 ( G,I ) τ 3 ( G,S) = ψ 3 ( G,I,S) ψ 4 ( G,J,H ) = φ H (H,G,J) τ 4 ( G,J ) = ψ 4 ( G,J,H ) H I Each step involves factor product and factor marginalization ψ 5 ( G,J,L,S ) = τ 4 ( G,J )τ 3 ( G,S)φ L ( L,G) τ 5 ( J,L,S ) = ψ 5 ( G,J,L,S ) G 6. Eliminating S: 7. Eliminating L: ψ 6 ψ 7 ( J,L,S ) = τ 5 ( J,L,S ) φ J J,L,S ( J,L) = τ 6 J,L ( ) τ 6 ( J,L) = ψ 6 ( J,L,S ) ( ) τ 7 ( J ) = ψ 7 ( J,L) L S
Computing τ(a,c)= Σ B ψ(a,b,c)=σ B ϕ(a,b)ϕ(b,c) 1. Factor product ψ(a,b,c)=ϕ(a,b)ϕ(b,c) 2. Factor marginalization τ(a,c) = Σ B ψ(a,b,c) 13
Sum-Product VE Algorithm To compute Z φ Φ φ First procedure specifies ordering of k variables Z i Second procedure eliminates a single variable Z (contained in factors Φ ) and returns factor τ 14
Two runs of Variable Elimination Elimination Ordering: C,D,I,H,G,S,L Elimination Ordering: G,I,S,L,H,C,D Factors with much larger scope 15
Dealing with Evidence We observe student is intelligent (i 1 ) and is unhappy (h 0 ) What is the probability that student has a job? P(J i 1,h 0 ) = P(J,i1,h 0 ) P(i 1,h 0 ) For this we need unnormalized distribution P(J,i 1,h 0 ). Then we compute conditional distribution by renormalizing by P(e)=P(i 1,h 0 ) 16
BN with evidence e is Gibbs with Z=P(e) Defined by original factors reduced to context E=e B is a BN over χ and E=e an observation. Let W=χ-E. Then P B (W e) is a Gibbs distribution with factors Φ={ϕ Xi } X i ε χ where ϕ Xi =P B (X i Pa Xi )[E=e] Partition function for Gibbs distribution is P(e). Proof follows: P B ( ) N ( χ) = P B X i Pa Xi i=1 P B (W E = e) = P (W ) E = e B = P B (E = e) N i=1 P B W ( X i Pa Xi ) E = e = P B ( χ) E = e N P B i=1 N P B W i=1 ( X i Pa Xi ) E = e ( X i Pa Xi ) E = e Thus any BN conditioned on evidence can be regarded as a Markov network and use techniques developed for MN analysis 17
Sum-Product for Conditional Probabilities Apply Sum-product VE to χ-y-e Returned factor ϕ* is P(Y,e) Renormalize by P(e), sum over entries in unormalized distribution 18
Computing P(J,i 1,h 0 ) Run of Sum-Product VE Compare with previous elimination ordering: Steps 3,4 disappear Since I and H need not be eliminated By not eliminating I we avoid step that correlates G and I 19
Complexity of VE: Simple Analysis If n random variables and m initial factors: We have m=n in a BN In a MN we may have more factors than variables VE picks a variable X i then multiplies all factors involving that variable Result is a single factor ψ i If N i is no. of factors in ψ i and N max =max N i Overall amount of work required is O(mN max ) Inevitable exponential blowup is exponential size of factors ψ i 20
Complexity: Graph-Theoretic Analysis VE can be viewed as operating on an undirected graph with factors Φ If P is distribution defined by multiplying factors in Φ Defining X = Scope[Φ] P(X) = 1 Z φ Φ X φ Φ φ where Z = φ Then the directed graph defined by VE algorithm is precisely the Moralized BN
Factor Reduction: Reduced Gibbs Factor ψ(a,b,c) Context C=c 1 Moralized BN A C B Value of C determines the factor τ(a,b) A C=c 1 τ(a,b) = Σ C=c 1 ψ(a,b,c) B Initial Set of Factors Context G=g Context G=g, S=s 22
VE as graph transformation When a variable X is eliminated from Φ, Fill edges are introduced in Φ X After eliminating C After eliminating D No fill edges After eliminating I Fill edge G-S 23
Induced Graph Union of all graphs generated by VE Every factor generated is a clique Every maximal clique is the scope of some intermediate factor Induced Graph due to VE using elimination order: Maximal Cliques: Width of induced graph= no. of nodes in largest clique minus 1 Clique Tree for the graph: Minimal induced width over all orderings is bound on VE performance 24
Finding Elimination Orderings Max-cardinality Search Induced graphs are chordal Every minimal loop is of length 3 GàLà Jà H is cut by chord GàJ Greedy Search 25
Max-Cardinality Search Procedure Max-Cardinality ( H // An undirected graph over χ ) Select S first Next is a neighbor, say J Largest no of marked neighbors are H and I 26
Greedy Search Procedure Greedy- Ordering( Probabilistic Graphical Models Η // An undirected graph over χ s // An evaluation metric ) Evaluation metric s(h,x): Min-neighbors Min-weight Min-fill Weighted min-fill 27
Comparison of VE Orderings Different heuristics for variable orderings Testing data: 8 standard BNs ranging from 8 to 1,000 nodes Methods: Simulated annealing, BN package Four heuristics 28
Comparison of VE variable ordering algorithms Evaluation metric s(h,x): Min-neighbors Min-weight Min-fill Weighted min-fill For large networks worthwhile to run several heuristic algorithms to find best ordering 29
Two Simple Inference Cases 1. Bayes theorem as inference 2. Inference on a chain 30
1. Bayes Theorem as Inference Joint distribution p(x,y) over two variables x and y Factors p(x,y)=p(x)p(y x) represented as directed graph (a) We are given CPDs p(x) and p(y x) If we observe value of y as in (b) Can view marginal p(x) as prior Over latent variable x Analogy to 2-class classifier Class x ε{0,1} and feature y is continuous Wish to infer a posteriori distribution p(x y) 31
Inferring posterior using Bayes Using sum and product rules, we can evaluate marginal Need to evaluate a summation Which is then used in Bayes rule to calculate Observations Joint is now expressed as p(x,y)=p(y)p(x y) Which is shown in (c) Thus knowing value of y we know distribution of x 32
2. Inference on a Chain x N-1 x N Graphs of this form are known as Markov chains Example: N = 365 days and x is weather (cloudy,rainy,snow..) Analysis more complex than previous case In this case directed and undirected are exactly same since there is only one parent per node (no additional links needed) Joint distribution has form p(x) = 1 Z ψ 1,2 (x 1, x 2 )ψ 2,3 (x 2, x 3 )...ψ N 1,N (x N 1, x N ) Product of potential functions over pairwise cliques Specific case of N discrete variables Potential functions are K x K tables Joint distribution has (n-1)k 2 parameters 33
Inferring marginal of a node Wish to evaluate marginal distribution p(x n ) What is the weather on November 11? For specific node x n part way along chain As yet there are no observed nodes Required marginal obtained summing joint distribution over all variables except x n px ( ) = By application.... p(x) n x x x x 1 n 1 n+ 1 N of sum rule 34
Naïve Evaluation of marginal 1. Evaluate joint distribution 2. Perform summations explicitly Joint can be expressed as set of numbers one for each value of x There are N variables with K states K N values for x p(x n ) =.... Evaluation of both joint and marginal Exponential with length N of chain Impossible with K=10 and N=365 x 1 x n 1 x n+1 =.... x 1 x n 1 x n+1 x N p(x) 1 Z ψ (x, x )ψ (x, x )...ψ (x, x ) 1,2 1 2 2,3 2 3 N 1,N N 1 N x N Joint 35
Efficient Evaluation 1 p(x n )=.... Z ψ (x, x )ψ (x, x )...ψ (x, x ) 1,2 1 2 2,3 2 3 N 1,N N 1 N x N We are adding a bunch of products But multiplication is distributive over addition ab+ac=a(b+c) Perform summation first and then do product LHS involves 3 arithmetic ops, RHS involves 2 x 1 x n 1 x n+1 Sum-of-products evaluated as sums first 36
Efficient evaluation: exploiting conditional independence properties Probabilistic Graphical Models p(x n )=.... Rearrange order of summations/multiplications to allow marginal to be evaluated more efficiently Consider summation over x N ψ ( x, x ) Potential N 1, N N 1 N is only one that depends on x N So we can perform ψ N 1, N( xn 1, xn) To give a function of x N-1 x 1 x n 1 x n+1 1 Z ψ (x, x )ψ (x, x )...ψ (x, x ) 1,2 1 2 2,3 2 3 N 1,N N 1 N x N x N Use this to perform summation over x N-1 Each summation removes a variable from distribution or removal of node from graph 37
Marginal Expression Group potentials and summations together to give marginal px ( ) = n 1, n n 1 n 2,3 2 3 1,2 1 2 x x x x n 1 2 1 n 1 n 1 Z ψ ( x, x )... ψ ( x, x ) ψ ( x, x ).. ψ µ α (x n ) ( x, x )... ψ ( x, x ).. N nn, + 1 n n+ 1 N 1, N N 1 N x Key concept: Multiplication is distributive over addition ab+ac=a(b+c) LHS involves 3 arithmetic ops, RHS involves 2 µ β (x n ) 38
Computational cost Evaluation of marginal using reordered expression N-1 summations Each with K states Each a function of 2 variables Summation over x 1 involves only ψ 1,2 (x 1,x 2 ) A table of K x K numbers Sum table over x 1 for each x 2 O(K 2 ) cost Total cost is O(NK 2 ) Linear in chain length vs. exponential cost of naïve approach Able to exploit many conditional independence properties of simple graph 39
Interpretation as Message Passing Calculation viewed as message passing in graph Expression for marginal decomposes into Interpretation 1 p( xn) = µ α ( xn) µ β ( xn) Z Message passed forwards along chain from node x n-1 to x n is µ α (x n ) Message passed backwards from node x n+1 to x n is Each message comprises of K values one for each choice of x n µ β (x n ) 40
Recursive evaluation of messages Message µ α (x n ) = Therefore first evaluate can be evaluated as µ ( x µ α n 1 n 1, n Apply (1) repeatedly until we reach desired node Note that outgoing message µ α (x n ) in (1) is obtained by multiplying incoming message µ α (x n 1 ) by the local potential involving the node variable and the outgoing variable and summing over node variable α n ) = x n 1 n 2 x ψ 1 ψ n 1, n ( x ( x n 1 n 1, x n, x ( x = 2) ψ 1,2 ( x1, x2) x ) n x... ) µ ( x α n 1 ) (1) 41
Recursive message passing Similarly message µ b (x n ) can be evaluated recursively starting with node x n µ ( x β n ) = = x n+ 1 n+ 1, n n+ 1 n+ 2 x ψ ψ n+ 1, n ( x ( x n+ 1 n+ 1, x n, x ) n x... ) µ ( x β n+ 1 ) Message passing equations known as Chapman-Kolmogorov equations for Markov processes Normalization constant Z is easily evaluated 1 By summing µ α ( x n ) µ β ( x n ) over all state of x Z n An O(K) computation 42
Evaluating marginals for every node Evaluate p(x n ) for every node n =1,..N Simply applying above procedure is O(N 2 M 2 ) Computationally wasteful with duplication To find p(x 1 ) we need to propagate message m b (.) from node x N back to x 2 To evaluate p(x 2 ) we need to propagate message m b (.) from node x N back to x 3 Instead launch message m b (x N-1 ) starting from node x N and propagate back to x 1 launch message m a (x 2 ) starting from node x 2 and propagate forward to x N Store all intermediate messages along the way 1 Then any node can evaluate its marginal by p x ) = µ α ( x ) µ ( x ) ( n n β n Z Computational cost is only twice as finding marginal of single node instead of N times 43
Joint distribution of neighbors Wish to calculate joint distribution p(x n-1,x n ) for neighboring nodes Similar to previous computation Required joint distribution can be written as 1 p( xn 1, xn) = µ α ( xn 1) ψ n 1, n( xn 1, xn) µ β ( xn) Z Obtained once message passing for marginals is completed Useful result if we wish to use parametric forms for conditional distributions 44
Tree structured graphs Local message passing can be performed efficiently on trees Message passing can be generalized to give sum-product algorithm Tree a graph with only one path between any pair of nodes Such graphs have no loops In directed graphs a tree has a single node with no parents called a root Directed to undirected will not add moralization links since every node has only one parent Polytree A directed graph has nodes with more than one parent but there is only one path between nodes (ignoring arrow direction) Moralization will add links Undirected tree Directed tree Directed polytree 45