Y1 Y2 Y3 Y4 Y1 Y2 Y3 Y4 Z1 Z2 Z3 Z4

Similar documents
STA 4273H: Statistical Machine Learning

Linear Dynamical Systems

9 Forward-backward algorithm, sum-product on factor graphs

CS281 Section 4: Factor Analysis and PCA

STA 414/2104: Machine Learning

4 : Exact Inference: Variable Elimination

STA 4273H: Statistical Machine Learning

Local Probability Models

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Chris Bishop s PRML Ch. 8: Graphical Models

Probabilistic Graphical Models

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Probabilistic Graphical Models (Cmput 651): Hybrid Network. Matthew Brown 24/11/2008

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Probabilistic Graphical Models

CS 188: Artificial Intelligence Spring Announcements

Conditional probabilities and graphical models

Introduction to Probabilistic Graphical Models

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Bayes Nets III: Inference

Variable Elimination: Basic Ideas

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Lecture 17: May 29, 2002

Applying Bayesian networks in the game of Minesweeper

Bayesian Networks. Axioms of Probability Theory. Conditional Probability. Inference by Enumeration. Inference by Enumeration CSE 473

CS 188: Artificial Intelligence. Bayes Nets

CS 343: Artificial Intelligence

Introduction to Probabilistic Graphical Models: Exercises

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way

Announcements. CS 188: Artificial Intelligence Spring Bayes Net Semantics. Probabilities in BNs. All Conditional Independences

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

12 : Variational Inference I

15-780: Graduate Artificial Intelligence. Bayesian networks: Construction and inference

1 EM Primer. CS4786/5786: Machine Learning for Data Science, Spring /24/2015: Assignment 3: EM, graphical models

Math 123, Week 2: Matrix Operations, Inverses

Expectation Propagation in Factor Graphs: A Tutorial

Bayesian Networks: Representation, Variable Elimination

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

p L yi z n m x N n xi

DEEP LEARNING CHAPTER 3 PROBABILITY & INFORMATION THEORY

Informatics 2D Reasoning and Agents Semester 2,

Probabilistic Graphical Models (I)

Dynamic models 1 Kalman filters, linearization,

CS 6375 Machine Learning

Tensor rank-one decomposition of probability tables

Graphical Models - Part I

Deduction by Daniel Bonevac. Chapter 3 Truth Trees

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Undirected graphical models

Review of Probability Theory

Undirected Graphical Models

Designing Information Devices and Systems I Spring 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way

Fisher Information in Gaussian Graphical Models

Exact Inference I. Mark Peot. In this lecture we will look at issues associated with exact inference. = =

CS Lecture 19. Exponential Families & Expectation Propagation

MATRIX MULTIPLICATION AND INVERSION

Independencies. Undirected Graphical Models 2: Independencies. Independencies (Markov networks) Independencies (Bayesian Networks)

STA 4273H: Statistical Machine Learning

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

14 PROBABILISTIC REASONING

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Soft Computing. Lecture Notes on Machine Learning. Matteo Matteucci.

Hidden Markov models

Lecture 15. Probabilistic Models on Graph

Machine Learning Techniques for Computer Vision

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Nonparametric Bayesian Methods (Gaussian Processes)

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Uncertainty and Bayesian Networks

STA 4273H: Statistical Machine Learning

Chapter 8 Cluster Graph & Belief Propagation. Probabilistic Graphical Models 2016 Fall

The Inductive Proof Template

Based on slides by Richard Zemel

Systems of Linear Equations and Inequalities

Machine learning: lecture 20. Tommi S. Jaakkola MIT CSAIL

GAUSSIAN ELIMINATION AND LU DECOMPOSITION (SUPPLEMENT FOR MA511)

Lecture 8: December 17, 2003

1 : Introduction. 1 Course Overview. 2 Notation. 3 Representing Multivariate Distributions : Probabilistic Graphical Models , Spring 2014

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

Outline. Spring It Introduction Representation. Markov Random Field. Conclusion. Conditional Independence Inference: Variable elimination

Recitation 9: Graphical Models: D-separation, Variable Elimination and Inference

CS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method

Directed and Undirected Graphical Models

Decision Trees (Cont.)

Undirected Graphical Models

A Stochastic l-calculus

CSC2515 Assignment #2

Probabilistic Models

STA 4273H: Sta-s-cal Machine Learning

Decomposable and Directed Graphical Gaussian Models

Bayes Nets. CS 188: Artificial Intelligence Fall Example: Alarm Network. Bayes Net Semantics. Building the (Entire) Joint. Size of a Bayes Net

Supplemental for Spectral Algorithm For Latent Tree Graphical Models

CS839: Probabilistic Graphical Models. Lecture 2: Directed Graphical Models. Theo Rekatsinas

Lecture 21: Spectral Learning for Graphical Models

Covariance. if X, Y are independent

Final Exam, Fall 2002

Lecture 4: Gaussian Elimination and Homogeneous Equations

Advanced Introduction to Machine Learning CMU-10715

Belief Update in CLG Bayesian Networks With Lazy Propagation

Transcription:

Inference: Exploiting Local Structure aphne Koller Stanford University CS228 Handout #4 We have seen that N inference exploits the network structure, in particular the conditional independence and the locality of influence (small number of long-range paths). ut when we discussed representation, we also allowed for the representation of finer-grained structure within the CPs. It turns out that we can exploit that structure as well. Causal Independence The earliest and simplest instance of exploiting local structure was for CPs that have independence of causal influence, such as noisy or. Consider the network in Figure representing the node and its family, and assume that all of the variables are binary. ssume that we want to get a factor over. The operations required to do that (assuming we do the operations as efficiently as possible) is: ffl 4multiplications for P (Y ) P (Y 2 ) ffl 8multiplications for P (Y ;Y 2 ) P (Y 3 ) ffl 6 multiplications for P (Y ;Y 2 ;Y 3 ) P (Y 4 ) ffl 32 multiplications for P (Y ;Y 2 ;Y 3 ;Y 4 ) P ( j Y ;Y 2 ;Y 3 ;Y 4 ) The total is 60 multiplications, plus another 30 additions to sum out Y ;:::;Y 4 (to reduce the resulting factor of size 32 back to a factor of size 2). If we exploit causal independence and split up the family" of, we can do a lot better. Recall that we could split up each noisy-or node into a deterministic or of independent noise nodes. Thus, the figure above can be decomposed into the network of Figure. This, by itself, is not very helpful. The factor P ( j Z ;Z 2 ;Z 3 ;Z 4 ) is still of size 32 if we represent it as a full factor, so we haven't saved anything. Z Z2 Z3 Z4 Figure : very simple N. The noisy-or decomposition of the family.

aphne Koller, Stanford University CS228 Notes, Handout #4 2 O O2 O O2 O3 Figure 2: Two possible cascaded decompositions for a noisy-or node The key idea is that the deterministic OR node can be decomposed into various cascades of smaller deterministic OR nodes. There are two main ways to do the split, shown in Figure 2. In the first decomposition, the optimal ordering for variable elimination is Y ;O ;Y 2 ;O 2 ;Y 3 ;O 3 ;Y 4. The cost is 4 multiplications and 2 additions to eliminate each Y i (multiplying it with the factor for O i and summing out), and 8 multiplications and 4 additions to eliminate each O i. The total cost is 4 + 8 + 4+8+4+8+4=40multiplications and 2 + 4 + 2 + 4 + 2 + 4 + 2 = 20 additions. In the second decomposition, an optimal elimination ordering is Y ;Y 2 ;Y 3 ;Y 4 ;O ;O 2. We need 8multiplications and 4 additions to eliminate each ofy, Y 3, and O ;we need 4 multiplications and 2 additions to eliminate each ofy 2, Y 4, and O 2. The total is 36 multiplications and 8 additions. We can generalize this phenomenon to more complex networks and other types of CPs that have independence of causal influence. We take a node whose CP has independence of causal influence, and generate its decomposition into a set of independent noise models and a deterministic function. We then cascade the computation of the deterministic function into a bunch of smaller steps. (This can't always be done, but it can be done in most practical cases.) We do this for every node of this type in the network, and let the variable elimination (or clique tree) algorithm find a good elimination ordering in the decomposed network with the smaller factors. s we saw, the complexity of the inference can go down substantially if we have smaller CPs. 2 Context specific independence It turns out that we can use a construction that is very similar in spirit, although different in detail, to decompose CP-trees into network fragments involving smaller families. Thus, the same benefits for inference also arise in this case. In order to understand the transformation, first consider a generic node in a ayesian network. Let be one of 's parents, and let ;:::; k be the remaining parents. ssume, for simplicity, that and are both binary-valued. Intuitively, we can view the value of the random variable as the outcome of two conditional variables: the value that would take if were true, and the value that would take if were false. We can conduct a thought experiment where these two variables are decided separately, and then, when the value of is revealed, the appropriate value for is chosen. Formally, we define a random variable a, with a conditional distribution that depends only on ;:::; k : P ( a j ;:::; k )=P ( j a ; ;:::; k ): We can similarly define a variable a 0. The variable is equal to a if = a and is equal

aphne Koller, Stanford University CS228 Notes, Handout #4 3 2 k 2 3 4 =t =f =t =f (c) Figure 3: simple decomposition of the node ; more effective decomposition of, utilizing CSI. C a0,b0,c C a0,b0,c a0,b a0,b0 p p2 p3 C p4 a a0 p5 p6 Figure 4: n example CP-tree. decomposition of the CP-tree to a 0 if = a 0. Note that the variables a and a 0 both have the same set of values as. This perspective allows us to replace the node in any network with the subnetwork illustrated in Figure 3. The node is a deterministic node, which we call a multiplexer node, since it takes either the value of a or of a 0, depending on the value of. For a generic node, this decomposition is not particularly useful. For one thing, the total size of the two new CPs is exactly the same as the size of the original CP for ; for another, the resulting structure (with its many tightly-coupled cycles) does not admit a more effective decompositions into cliques. However, if exhibits a significant amount of CSI, this type of transformation can result in a far more compact representation. For example, let k = 4, and assume that depends only on and 2 when is true, and only on 3 and 4 when is false. Then each of a and a 0 will have only two parents, as in Figure 3. If these variables are binary, the new representation requires two CPs with four entries each, plus a single deterministic multiplexer node with 8 (predetermined) `distributions'. y contrast, the original representation of had a single CP with 32 entries. Furthermore, the structure of the resulting network may well allow the construction of a join tree with much smaller cliques.

aphne Koller, Stanford University CS228 Notes, Handout #4 4 We can extend this transformation to arbitrary CP-trees by applying the decomposition recursively. Essentially, each node is first decomposed according to the parent which is at the root of its CPT tree. Each of the conditional nodes ( a and a 0 in the binary case) has, as its CP, one of the subtrees of the root in the CP-tree for ( a will have the subtree corresponding to the a child of the root). The resulting conditional nodes can be decomposed recursively, in a similar fashion. Consider, for example, the CP-tree of Figure 4. There, the node corresponding to a 0 can be decomposed into a 0 ;b and a 0 ;b 0. The node a 0 ;b 0 can then be decomposed into a 0 ;b 0 ;c and a 0 ;b 0 ;c 0. The resulting network fragment is shown in Figure 4. It is clear that this procedure is beneficial only if there is a structure in the CP of a node. Thus, in general, we want to stop the decomposition when the CP of a node is a full tree. (Note that this includes leaves as a special case.) s in the structural transformation for noisy-or nodes, this decomposition can allow variable elimination or clustering algorithms to form smaller clusters. fter the transformation, there are many more nodes in the network, but each generally has far fewer parents, as shown in our example. Note that the graphical structure of our (transformed) N cannot capture all independencies implicit in the CPTs. In particular, none of the CSI relations induced by particular value assignments can be read from the transformed structure. These CSI relations are captured by the deterministic relationships used in the transformation: in a multiplexer node, the value depends only on one parent once the value of the selecting" parent (the original variable) is known. 3 Continuous variables s we discussed, we can represent hybrid Ns with discrete and continuous variables. Can we do inference in such networks? For compact representations of discrete CPs, this was not an issue. We could always resort to generating explicit CPTs and using standard inference algorithms. For continuous variables, the issue is not so simple. t some level, the principle of the computation is unchanged. The joint distribution is represented as a product of density functions, each of which depends only on a few variables. We can marginalize out the irrelevant variables by integrating them out, instead of summing them out. Therefore, we simply replace summation with integration, and everything continues to hold. In particular, we can take irrelevant factors out of the integral, in exactly the same way as we did for summations. Indeed, the entire formulation of the clique tree algorithm was based only on qualitative properties of conditional independence, so it continues to hold without change. The main problem is that the factors" that we now have to deal with the CPs, the intermediate results, the messages are now continuous or hybrid (continuous/discrete) functions. It is not clear how these functions can be represented compactly. Unfortunately, it turns out that very few families of continuous functions continue to have a nice compact closed form when subjected to multiplication, marginalization, and conditioning. Rather, as we do the inference process multiplying these functions, eliminating variables the functions become more and more complicated. These are usually complex enough that we cannot represent them in closed form. The main exception is Gaussians. The product of two Gaussians is a Gaussian. Marginalizing out a Gaussian results in a Gaussian over the other dimensions. Even conditioning works. In fact, we can do all these operations in closed form, maintaining only the mean and covariance matrices for the multivariate Gaussians. Thus, for networks consisting of linear Gaussian relationships, our clique tree algorithm (or variable elimination) continues to work. Sadly, that's not very interesting, since there is a well-known simple algorithm that works for purely Gaussian networks: Since a

aphne Koller, Stanford University CS228 Notes, Handout #4 5 2 3 4 2 3 4 Figure 5: simple conditional linear Gaussian network. The i 's are discrete and the i 's continuous. Gaussian N is simply a multi-variate Gaussian, the entire joint distribution has a nice compact representation using a mean and a covariance matrix. One easy and fairly efficient way to do inference in a Gaussian network is to generate this representation of the joint and then marginalize it and condition it directly. However, the clique tree algorithm is useful in another class of networks: one where all CPs are either discrete or Conditional Linear Gaussian. It turns out that for conditional Gaussian networks, an extension of the clique tree algorithm works in some sense. To understand the sense in which this algorithm works, consider the network shown in Figure 5. The marginal distribution over 4 is a mixture of univariate Gaussians, with one mixture component for every assignment of values to ; 2 ; 3 ; 4. In general, if we continue this construction further, we can see that the number of mixture components for n would grow exponentially with n. Thus, we cannot represent the distribution in finite closed form. However, it turns out that we can use a trick where we collapse the mixture components into a single Gaussian. In other words, as we are doing variable elimination, if we end up with a distribution over some set of continuous variables which isa mixture, we approximate it with a single Gaussian over that set, and continue our inference process with that simpler density. If we are very careful about the elimination ordering (we eliminate all continuous variables before we eliminate any of the discrete ones), we can show that the resulting distribution has the right mean and covariance. Note that it is a very poor approximation overall: it is unimodal instead of multimodal. However, if our only goal is to compute the mean and variance of one of the continuous random variables, this algorithm can do that exactly.