Variable Elimination: Algorithm

Similar documents
Variable Elimination: Algorithm

Exact Inference: Clique Trees. Sargur Srihari

From Bayesian Networks to Markov Networks. Sargur Srihari

Variable Elimination (VE) Barak Sternberg

Alternative Parameterizations of Markov Networks. Sargur Srihari

Variable Elimination: Basic Ideas

Alternative Parameterizations of Markov Networks. Sargur Srihari

Computational Complexity of Inference

Inference as Optimization

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Exact Inference: Variable Elimination

Graphical Models. Lecture 10: Variable Elimina:on, con:nued. Andrew McCallum

CS281A/Stat241A Lecture 19

Machine Learning Lecture 14

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Message Passing and Junction Tree Algorithms. Kayhan Batmanghelich

9 Forward-backward algorithm, sum-product on factor graphs

Inference in Bayesian Networks

Inference and Representation

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

Probabilistic Graphical Models

Structured Variational Inference

Likelihood Weighting and Importance Sampling

Undirected Graphical Models: Markov Random Fields

Probabilistic Graphical Models (I)

Variational Inference. Sargur Srihari

4 : Exact Inference: Variable Elimination

Machine Learning 4771

12 : Variational Inference I

Lecture 17: May 29, 2002

Linear Dynamical Systems

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

5. Sum-product algorithm

CS 188: Artificial Intelligence. Bayes Nets

COMPSCI 276 Fall 2007

6.867 Machine learning, lecture 23 (Jaakkola)

Exact Inference Algorithms Bucket-elimination

Chapter 8 Cluster Graph & Belief Propagation. Probabilistic Graphical Models 2016 Fall

Chris Bishop s PRML Ch. 8: Graphical Models

MAP Examples. Sargur Srihari

Directed Graphical Models or Bayesian Networks

Intelligent Systems (AI-2)

Message Passing Algorithms and Junction Tree Algorithms

Using Graphs to Describe Model Structure. Sargur N. Srihari

Conditional Random Field

Clique trees & Belief Propagation. Siamak Ravanbakhsh Winter 2018

Need for Sampling in Machine Learning. Sargur Srihari

Intelligent Systems (AI-2)

13 : Variational Inference: Loopy Belief Propagation

CSC 412 (Lecture 4): Undirected Graphical Models

Introduction to Probabilistic Graphical Models

Bayesian Networks: Representation, Variable Elimination

COMP538: Introduction to Bayesian Networks

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Graphical Models and Kernel Methods

Probabilistic Graphical Models

Markov Chain Monte Carlo Methods

Variational Inference (11/04/13)

Undirected Graphical Models

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Lecture 4 October 18th

Exact Inference I. Mark Peot. In this lecture we will look at issues associated with exact inference. = =

Undirected Graphical Models

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Learning MN Parameters with Approximation. Sargur Srihari

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Undirected Graphical Models 4 Bayesian Networks and Markov Networks. Bayesian Networks to Markov Networks

Quantifying uncertainty & Bayesian networks

Exact Inference Algorithms Bucket-elimination

Undirected graphical models

4.1 Notation and probability review

Probabilistic Graphical Models

Directed and Undirected Graphical Models

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Lecture 15. Probabilistic Models on Graph

Probabilistic Graphical Networks: Definitions and Basic Results

Review: Directed Models (Bayes Nets)

An Introduction to Bayesian Machine Learning

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Probabilistic Graphical Models

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

11 The Max-Product Algorithm

p L yi z n m x N n xi

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

Bayesian Network Inference Using Marginal Trees

Statistical Learning

Graphical Models. Andrea Passerini Statistical relational learning. Graphical Models

Probability Propagation

Bayesian Networks Representation and Reasoning

Lecture 8: Bayesian Networks

Chapter 16. Structured Probabilistic Models for Deep Learning

Probabilistic Partial Evaluation: Exploiting rule structure in probabilistic inference

Probabilistic Graphical Models

3 Undirected Graphical Models

Bayes Nets III: Inference

Probabilistic Graphical Models

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Transcription:

Variable Elimination: Algorithm Sargur srihari@cedar.buffalo.edu 1

Topics 1. Types of Inference Algorithms 2. Variable Elimination: the Basic ideas 3. Variable Elimination Sum-Product VE Algorithm Sum-Product VE for Conditional Probabilities 4. Variable Ordering for VE 2

Variable Elimination: Use of Factors To formalize VE need concept of factors ϕ χ is a set of r.v.s, X is a subset We say Scope[ϕ]= X Factor associates a real value for each setting of it arguments ϕ: Val(X)àR Factor in BN is a product term say ϕ(a,b,c)= P(A,B/C) X χ

Factors in BNs and MNs Useful in both BNs and MNs Factor in BN is a product term, say ϕ(a,b,c)=p(a,b/c) Factor in MN comes from Gibbs distribution, say ϕ(a,b) Definition of Gibbs: Example: P Φ (X 1,..X n )= 1 Z where P(X 1,..X n ) P(X 1,..X n ) = φ i (D i ) m i=1 is an unnomalized measure and Z = P(X1,..X n ) is a normalizing constant X 1,..X n called the partition function 4

Role of Factor Operations The joint distribution is a product of factors P(C,D,I,G,S,L,J,H)= P(C)P(D C)P(I)P(G I,D)P(S I)P(L G)P(J L)P(H G,J)= ϕ C (C) ϕ D (D,C) ϕ I (I) ϕ G (G,I,D) ϕ S (S,I) ϕ L (L,G) ϕ J (J,L,S) ϕ H (H,G,J) Inference is a task of marginalization P(J ) = P(C,D,I,G,S,L,J,H ) L S G H We wish to systematically eliminate all variables other than J I D C 5

About Factors Inference Algorithms manipulate factors Occur in both directed and undirected PGMs Need two operations: Factor Product: Φ 1 (X,Y) Φ 2 (Y,Z) Factor Marginalization: Y ψ (X) = φ(x,y ) 6

Factor Product Let X, Y and Z be three disjoint sets of variables and let Φ 1 (X,Y) and Φ 2 (Y,Z) be two factors. The factor product is the mapping Val(X,Y,Z)àR as follows An example: Φ 1 : 3 x 2 = 6 entries Φ 2 : 2 x 2= 4 entries yields ψ: 3 x 2 x 2= 12 entries ψ(x,y,z)=φ 1 (X,Y) Φ 2 (Y,Z) 7

Factor Marginalization X is a set of variables and ϕ(x,y) is a factor We wish to eliminate Y Factor marginalization of Y is a factor ψ s.t. Y ψ (X) = φ(x,y ) Y X Process is called summing out of Y in Φ Φ(A,B,C) is a variable Fig 9.7 ψ(a,c) here Example of Factor Marginalization: Summing-out Y=B when X={A,C} We sum up entities in the table only when the values of X match up If we sum out all variables we get a factor which is a single value of 1 If we sum out all of the variables in an unnormalized distribution we get the partition function N!P φ = φ i i=1 8 ( D i )

Distributivity of product over sum Example with nos. a 2 b 2 ( ) = A B = ψ A,B A=a 1 B=b 1 a.b 1 +a.b 2 =a(b 1 +b 2 ): product is distributive (a+b 1 ).(a+b 2 ). ne. a+(b 1 b 2 ): sum is not Product distributivity allows fewer operations a 1 b 1 +a 1 b 2 +a 2 b 1 +a 2 b 2 requires 4 products, 3 sums Alternative formulation requires 2 sums, 2 products ( ) = A τ ( B) ψ A,B where a 2 A=a 1 b 2 ( ) = B = b 1 +b 2 B=b 1 τ B ψ(a,b) = a 1 τ(b)+a 2 τ(b) Sum first Product next Saves ops over Product first Sum next Factor product and summation behave exactly like product and summation over nos. φ If then ( 1 φ 2 ) = φ 1 φ X Scope ( φ 2 1 ) X X 9

Sum-Product Variable Elimination Algorithm Task of computing the value of an expression of the form φ Called sum-product inference task Sum of Products Key insight is that scope of the factors is limited Allowing us to push in some of the summations, performing them over the product of only some of the factors We sum out variables one at a time Z φ Φ 10

Inference using Variable Elimination Example: Extended Student BN We wish to infer P(J) P(J) = P(C,D,I,G,S,L,J,H ) H By chain rule: L S G I P(C,D,I,G,S,L,J,H)= P(C)P(D C)P(I)P(G I,D)P(S I)P(L G)P(J L)P(H G,J) Which is a Sum of Product of factors D C 11

Sum-product VE P(C,D,I,G,S,L,J,H)= P(C)P(D C)P(I)P(G I,D)P(S I)P(L G)P(J L)P(H G,J)= ϕ C (C) ϕ D (D,C) ϕ I (I) ϕ G (G,I,D) ϕ S (S,I) ϕ L (L,G) ϕ J (J,L,S) ϕ H (H,G,J) Elimination ordering C,D,I,H.G,S,L 1. Eliminating C: Compute the factors 2. Eliminating D: Note we already eliminated one factor with D, but introduced τ 1 involving D 3. Eliminating I: 4. Eliminating H: Note τ 4 (G,J)=1 5. Eliminating G: P(J ) = P(C,D,I,G,S,L,J,H ) L S G H ψ 1 ( C,D) = φ C (C)φ D ( D,C ) τ 1 ( D) = ψ 1 ψ 2 (G,I,D) = φ G (G,I,D)τ 1 (D) τ 2 G,I I D C C ( ) = ψ 2 D ( C,D) ( G,I,D) ψ 3 ( G,I,S) = φ I ( I )φ S ( S,I )τ 2 ( G,I ) τ 3 ( G,S) = ψ 3 ( G,I,S) ψ 4 ( G,J,H ) = φ H (H,G,J) τ 4 ( G,J ) = ψ 4 ( G,J,H ) H I Each step involves factor product and factor marginalization ψ 5 ( G,J,L,S ) = τ 4 ( G,J )τ 3 ( G,S)φ L ( L,G) τ 5 ( J,L,S ) = ψ 5 ( G,J,L,S ) G 6. Eliminating S: 7. Eliminating L: ψ 6 ψ 7 ( J,L,S ) = τ 5 ( J,L,S ) φ J J,L,S ( J,L) = τ 6 J,L ( ) τ 6 ( J,L) = ψ 6 ( J,L,S ) ( ) τ 7 ( J ) = ψ 7 ( J,L) L S

Computing τ(a,c)= Σ B ψ(a,b,c)=σ B ϕ(a,b)ϕ(b,c) 1. Factor product ψ(a,b,c)=ϕ(a,b)ϕ(b,c) 2. Factor marginalization τ(a,c) = Σ B ψ(a,b,c) 13

Sum-Product VE Algorithm To compute Z φ Φ φ First procedure specifies ordering of k variables Z i Second procedure eliminates a single variable Z (contained in factors Φ ) and returns factor τ 14

Two runs of Variable Elimination Elimination Ordering: C,D,I,H,G,S,L Elimination Ordering: G,I,S,L,H,C,D Factors with much larger scope 15

Dealing with Evidence We observe student is intelligent (i 1 ) and is unhappy (h 0 ) What is the probability that student has a job? P(J i 1,h 0 ) = P(J,i1,h 0 ) P(i 1,h 0 ) For this we need unnormalized distribution P(J,i 1,h 0 ). Then we compute conditional distribution by renormalizing by P(e)=P(i 1,h 0 ) 16

BN with evidence e is Gibbs with Z=P(e) Defined by original factors reduced to context E=e B is a BN over χ and E=e an observation. Let W=χ-E. Then P B (W e) is a Gibbs distribution with factors Φ={ϕ Xi } X i ε χ where ϕ Xi =P B (X i Pa Xi )[E=e] Partition function for Gibbs distribution is P(e). Proof follows: P B ( ) N ( χ) = P B X i Pa Xi i=1 P B (W E = e) = P (W ) E = e B = P B (E = e) N i=1 P B W ( X i Pa Xi ) E = e = P B ( χ) E = e N P B i=1 N P B W i=1 ( X i Pa Xi ) E = e ( X i Pa Xi ) E = e Thus any BN conditioned on evidence can be regarded as a Markov network and use techniques developed for MN analysis 17

Sum-Product for Conditional Probabilities Apply Sum-product VE to χ-y-e Returned factor ϕ* is P(Y,e) Renormalize by P(e), sum over entries in unormalized distribution 18

Computing P(J,i 1,h 0 ) Run of Sum-Product VE Compare with previous elimination ordering: Steps 3,4 disappear Since I and H need not be eliminated By not eliminating I we avoid step that correlates G and I 19

Complexity of VE: Simple Analysis If n random variables and m initial factors: We have m=n in a BN In a MN we may have more factors than variables VE picks a variable X i then multiplies all factors involving that variable Result is a single factor ψ i If N i is no. of factors in ψ i and N max =max N i Overall amount of work required is O(mN max ) Inevitable exponential blowup is exponential size of factors ψ i 20

Complexity: Graph-Theoretic Analysis VE can be viewed as operating on an undirected graph with factors Φ If P is distribution defined by multiplying factors in Φ Defining X = Scope[Φ] P(X) = 1 Z φ Φ X φ Φ φ where Z = φ Then the directed graph defined by VE algorithm is precisely the Moralized BN

Factor Reduction: Reduced Gibbs Factor ψ(a,b,c) Context C=c 1 Moralized BN A C B Value of C determines the factor τ(a,b) A C=c 1 τ(a,b) = Σ C=c 1 ψ(a,b,c) B Initial Set of Factors Context G=g Context G=g, S=s 22

VE as graph transformation When a variable X is eliminated from Φ, Fill edges are introduced in Φ X After eliminating C After eliminating D No fill edges After eliminating I Fill edge G-S 23

Induced Graph Union of all graphs generated by VE Every factor generated is a clique Every maximal clique is the scope of some intermediate factor Induced Graph due to VE using elimination order: Maximal Cliques: Width of induced graph= no. of nodes in largest clique minus 1 Clique Tree for the graph: Minimal induced width over all orderings is bound on VE performance 24

Finding Elimination Orderings Max-cardinality Search Induced graphs are chordal Every minimal loop is of length 3 GàLà Jà H is cut by chord GàJ Greedy Search 25

Max-Cardinality Search Procedure Max-Cardinality ( H // An undirected graph over χ ) Select S first Next is a neighbor, say J Largest no of marked neighbors are H and I 26

Greedy Search Procedure Greedy- Ordering( Probabilistic Graphical Models Η // An undirected graph over χ s // An evaluation metric ) Evaluation metric s(h,x): Min-neighbors Min-weight Min-fill Weighted min-fill 27

Comparison of VE Orderings Different heuristics for variable orderings Testing data: 8 standard BNs ranging from 8 to 1,000 nodes Methods: Simulated annealing, BN package Four heuristics 28

Comparison of VE variable ordering algorithms Evaluation metric s(h,x): Min-neighbors Min-weight Min-fill Weighted min-fill For large networks worthwhile to run several heuristic algorithms to find best ordering 29

Two Simple Inference Cases 1. Bayes theorem as inference 2. Inference on a chain 30

1. Bayes Theorem as Inference Joint distribution p(x,y) over two variables x and y Factors p(x,y)=p(x)p(y x) represented as directed graph (a) We are given CPDs p(x) and p(y x) If we observe value of y as in (b) Can view marginal p(x) as prior Over latent variable x Analogy to 2-class classifier Class x ε{0,1} and feature y is continuous Wish to infer a posteriori distribution p(x y) 31

Inferring posterior using Bayes Using sum and product rules, we can evaluate marginal Need to evaluate a summation Which is then used in Bayes rule to calculate Observations Joint is now expressed as p(x,y)=p(y)p(x y) Which is shown in (c) Thus knowing value of y we know distribution of x 32

2. Inference on a Chain x N-1 x N Graphs of this form are known as Markov chains Example: N = 365 days and x is weather (cloudy,rainy,snow..) Analysis more complex than previous case In this case directed and undirected are exactly same since there is only one parent per node (no additional links needed) Joint distribution has form p(x) = 1 Z ψ 1,2 (x 1, x 2 )ψ 2,3 (x 2, x 3 )...ψ N 1,N (x N 1, x N ) Product of potential functions over pairwise cliques Specific case of N discrete variables Potential functions are K x K tables Joint distribution has (n-1)k 2 parameters 33

Inferring marginal of a node Wish to evaluate marginal distribution p(x n ) What is the weather on November 11? For specific node x n part way along chain As yet there are no observed nodes Required marginal obtained summing joint distribution over all variables except x n px ( ) = By application.... p(x) n x x x x 1 n 1 n+ 1 N of sum rule 34

Naïve Evaluation of marginal 1. Evaluate joint distribution 2. Perform summations explicitly Joint can be expressed as set of numbers one for each value of x There are N variables with K states K N values for x p(x n ) =.... Evaluation of both joint and marginal Exponential with length N of chain Impossible with K=10 and N=365 x 1 x n 1 x n+1 =.... x 1 x n 1 x n+1 x N p(x) 1 Z ψ (x, x )ψ (x, x )...ψ (x, x ) 1,2 1 2 2,3 2 3 N 1,N N 1 N x N Joint 35

Efficient Evaluation 1 p(x n )=.... Z ψ (x, x )ψ (x, x )...ψ (x, x ) 1,2 1 2 2,3 2 3 N 1,N N 1 N x N We are adding a bunch of products But multiplication is distributive over addition ab+ac=a(b+c) Perform summation first and then do product LHS involves 3 arithmetic ops, RHS involves 2 x 1 x n 1 x n+1 Sum-of-products evaluated as sums first 36

Efficient evaluation: exploiting conditional independence properties Probabilistic Graphical Models p(x n )=.... Rearrange order of summations/multiplications to allow marginal to be evaluated more efficiently Consider summation over x N ψ ( x, x ) Potential N 1, N N 1 N is only one that depends on x N So we can perform ψ N 1, N( xn 1, xn) To give a function of x N-1 x 1 x n 1 x n+1 1 Z ψ (x, x )ψ (x, x )...ψ (x, x ) 1,2 1 2 2,3 2 3 N 1,N N 1 N x N x N Use this to perform summation over x N-1 Each summation removes a variable from distribution or removal of node from graph 37

Marginal Expression Group potentials and summations together to give marginal px ( ) = n 1, n n 1 n 2,3 2 3 1,2 1 2 x x x x n 1 2 1 n 1 n 1 Z ψ ( x, x )... ψ ( x, x ) ψ ( x, x ).. ψ µ α (x n ) ( x, x )... ψ ( x, x ).. N nn, + 1 n n+ 1 N 1, N N 1 N x Key concept: Multiplication is distributive over addition ab+ac=a(b+c) LHS involves 3 arithmetic ops, RHS involves 2 µ β (x n ) 38

Computational cost Evaluation of marginal using reordered expression N-1 summations Each with K states Each a function of 2 variables Summation over x 1 involves only ψ 1,2 (x 1,x 2 ) A table of K x K numbers Sum table over x 1 for each x 2 O(K 2 ) cost Total cost is O(NK 2 ) Linear in chain length vs. exponential cost of naïve approach Able to exploit many conditional independence properties of simple graph 39

Interpretation as Message Passing Calculation viewed as message passing in graph Expression for marginal decomposes into Interpretation 1 p( xn) = µ α ( xn) µ β ( xn) Z Message passed forwards along chain from node x n-1 to x n is µ α (x n ) Message passed backwards from node x n+1 to x n is Each message comprises of K values one for each choice of x n µ β (x n ) 40

Recursive evaluation of messages Message µ α (x n ) = Therefore first evaluate can be evaluated as µ ( x µ α n 1 n 1, n Apply (1) repeatedly until we reach desired node Note that outgoing message µ α (x n ) in (1) is obtained by multiplying incoming message µ α (x n 1 ) by the local potential involving the node variable and the outgoing variable and summing over node variable α n ) = x n 1 n 2 x ψ 1 ψ n 1, n ( x ( x n 1 n 1, x n, x ( x = 2) ψ 1,2 ( x1, x2) x ) n x... ) µ ( x α n 1 ) (1) 41

Recursive message passing Similarly message µ b (x n ) can be evaluated recursively starting with node x n µ ( x β n ) = = x n+ 1 n+ 1, n n+ 1 n+ 2 x ψ ψ n+ 1, n ( x ( x n+ 1 n+ 1, x n, x ) n x... ) µ ( x β n+ 1 ) Message passing equations known as Chapman-Kolmogorov equations for Markov processes Normalization constant Z is easily evaluated 1 By summing µ α ( x n ) µ β ( x n ) over all state of x Z n An O(K) computation 42

Evaluating marginals for every node Evaluate p(x n ) for every node n =1,..N Simply applying above procedure is O(N 2 M 2 ) Computationally wasteful with duplication To find p(x 1 ) we need to propagate message m b (.) from node x N back to x 2 To evaluate p(x 2 ) we need to propagate message m b (.) from node x N back to x 3 Instead launch message m b (x N-1 ) starting from node x N and propagate back to x 1 launch message m a (x 2 ) starting from node x 2 and propagate forward to x N Store all intermediate messages along the way 1 Then any node can evaluate its marginal by p x ) = µ α ( x ) µ ( x ) ( n n β n Z Computational cost is only twice as finding marginal of single node instead of N times 43

Joint distribution of neighbors Wish to calculate joint distribution p(x n-1,x n ) for neighboring nodes Similar to previous computation Required joint distribution can be written as 1 p( xn 1, xn) = µ α ( xn 1) ψ n 1, n( xn 1, xn) µ β ( xn) Z Obtained once message passing for marginals is completed Useful result if we wish to use parametric forms for conditional distributions 44

Tree structured graphs Local message passing can be performed efficiently on trees Message passing can be generalized to give sum-product algorithm Tree a graph with only one path between any pair of nodes Such graphs have no loops In directed graphs a tree has a single node with no parents called a root Directed to undirected will not add moralization links since every node has only one parent Polytree A directed graph has nodes with more than one parent but there is only one path between nodes (ignoring arrow direction) Moralization will add links Undirected tree Directed tree Directed polytree 45