13 : Variational Inference: Loopy Belief Propagation and Mean Field
|
|
- Sheila Howard
- 5 years ago
- Views:
Transcription
1 10-708: Probabilistic Graphical Models , Spring : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction Inference problems involve answering a query that concerns the likelihood of observed data. For example, to answer the query on a marginal p(x A ), we can perform the marginalization operation to derive C/A p(x). Or, for queries concern the conditionals, such as p(x A x B ), we can first compute the joint, and divide by the marginals p(x B ). Sometimes to answer a query, we might also need to compute the mode of density ˆx = arg max x X m p(x). So far, in the class, we have covered the exact inference problem. To perform exact inference, we know that brute force search might be too inefficient for large graphs with complex structures, so a family of message passing algorithm such as forward-backward, sum-product, max-product, and junction tree, was introduced. However, although we know that these message-passing based exact inference algorithms work well for treestructured graphical models, it was also shown in the class that they might not yield consistent results for loopy graphs, or, the convergence might not be guaranteed. Also, for complex graphical models such as the Ising model, we cannot run exact inference algorithm such as the junction tree algorithm, because it is computationally intractable. In this lecture, we look at two variational inference algorithms: loopy belief propagation (yww) and mean field approximation (pschulam). 2 Loopy Belief Propagation The general idea for loopy belief propagation is that even though we know graph contains loops and the messages might circulate indefinitely, we still let it run anyway and hope for the best. In this section, we first review the basic belief propagation algorithms. Then, we discuss an experimental study by Murphy et al. (1999), and show some empirical results on the effects of loopy belief propagation. Most importantly, we start from the notion of KL divergence, and show how to explain the LBP algorithm from the perspective of minimizing the Bethe free energy. 2.1 Belief Propagation: a Quick Review The basic idea of belief propagation is very simple: to update the belief on a node, we just need to calculate the doubleton potentials from its neighboring nodes, and multiply with the target node s singleton potential. To give a more concrete example, let s consider the example on Figure 1. On the part (a) of the figure, we see that in order to compute the message M i j (x j ), we will need to calculate the message from all the neighboring nodes x k to. Then, multiply with the singletons and doubletons concerning and, x j : M i j (x j ) Φ ij (, x j )Φ i ( ) k M k i ( ) (1) 1
2 2 13 : Variational Inference: Loopy Belief Propagation and Mean Field Figure 1: Belief propagation: an example. Here the doubleton potential Φ ij (, x j ) is also called compatibilities, and is used to model the interactions of the two nodes, whereas the singleton potential Φ i ( ) is also called the external evidence. On the right-hand side (part b), we can simple update the belief of using the similar formulation: b i ( ) Φ i ( ) k M k (x k ) (2) Similarly, for factor graphs, we can also have the notion of messages and update the belief of node by multiplying its factor and the messages coming from neighboring nodes: b i ( ) f i ( ) m a i ( ) (3) a N(i) If we want to calculate message from node X a to node, we should sum up all the products: m a i ( ) f a (X a ) m j a (x j ) (4) X a\ j N(a)\i In the class, we know that running BP on trees always converges to the exact solution. However, it is not always the case for loopy graphs. The problem is that when the message is sent into a loop structure, it might circulate indefinitely, so it does not guarantee the convergence or it might converge to the wrong solution. 2.2 Loopy Belief Propagation Algorithm The idea of loopy belief propagation algorithm is to use a fixed point iteration procedure to minimize the Bethe free energy. Basically, if the convergence criteria is not met, we can update the messages and the believes: b i ( ) m a i ( ) (5) a N(i) b a (X a ) f a (X a ) m new i a( ) = c N(i)\a i N(a) m new a i( ) = f a (X a ) X a\ m i a ( ) (6) m c i ( ) (7) j N(a)\i m j a (x j ) (8)
3 13 : Variational Inference: Loopy Belief Propagation and Mean Field 3 Therefore, we know that stationary properties are guaranteed when it converges. However, the big problem here is that the convergence is not guaranteed, and the reason is intuitive: when BP algorithm is running on graphs that include loops, the messages might be circulating in the loops forever. Interestingly, Murphy et al. (1999 UAI) has studied the empirical behaviors of the loopy belief propagation algorithm, and found that LBP can still achieve good approximations: The program is stopped after a fixed number of iterations. Stop when there is no significant difference in belief updates. When the solution converges, it is usually a good approximation. And this is probably the reason why LBP is still a very popular inference algorithm, even though convergence might be guaranteed. Also, it was mentioned in class that, in order to test the empirical performance of an approximate inference algorithm on large intractable problems, one can always start simple by testing on a small example of the problem (e.g. a 20 x 20 nodes Ising model). 2.3 Understanding LBP: a F Bethe Minimization Perspective To understand the LBP algorithm, let s first define the true distribution P as: P (X) = 1 f a (X a ) (9) Z f a F where Z is the partition function, and we are interested in the product of factors. Since this is often intractable, we can approximate the distribution P with Q. To do this, we can utilize the KL-divergence method: KL(Q P ) = X Q(X) log Q(X) P (X) Note that the KL divergence is asymmetric. The value from KL should be non-negative, and has the minimum value when P = Q. KL is very useful in our problem, because it means that we can know use KL to approximate Q. To do this, we can compute KL(Q P ) without performing inference: KL(Q P ) = X Q(X) log Q(X) P (X) (10) (11) = X Q(X) log Q(X) X Q(X) log P (X) (12) = H Q (X) E Q log P (X) (13) If we replace P (X) with our earlier definition on the true probability, we can get: ( 1 ) KL(Q P ) = H Q (X) E Q log f a (X a ) (14) Z f a F = H Q (X) log 1 Z E Q log f a (X a ) (15) Note that if we re-arrange the terms on the right side of the equation, we can get KL(Q P ) = H Q (X) E Q log f a (X a ) + log Z (16) f a F f a F
4 4 13 : Variational Inference: Loopy Belief Propagation and Mean Field And the Physicists define the first two terms on the right side of equation as (Gibbs) free energy F (P, Q). So, now, our goal can be boiled down to compute the F (P, Q). In order to do this, we know that f E a F Q log f a (X a ) can be computed by summing up all the marginals, where as computing H Q (X) is a much harder task that needs to sum over all possible values, which is very expensive. However, we can always approximate F (P, Q) by computing ˆF (P, Q). Before we show how to approximate the Gibbs free energy, let s first consider the case with tree graphical models in Fig. 2 : Here, we know the probability can Figure 2: Calculating the tree energy: an example. be written as: b(x) = a b a(x a ) i b i( ) 1 di, and the H tree and F tree can be written as: H tree = b a (x a ) log b a (x a ) + (d i 1) b i ( ) log b i ( ) (17) a x a i F tree = b a (x a ) log b a(x a ) f a x a (x a ) + (1 d i ) b i ( ) log b i ( ) (18) a i = F 12 + F F 67 + F 78 F 1 F 5 F 2 F 6 F 3 F 7 (19) It can be seen that from the above derivation, we only need to sum over the singletons and doubletons, which is easy to compute. Similarly, we can also use the above idea to approximate the Gibbs free energy. For example, in a general graph, such as Fig. 3, we also have: Figure 3: Calculting the loopy graph Bethe energy: an example. H Bethe = b a (x a ) log b a (x a ) + (d i 1) b i ( ) log b i ( ) a x a i (20) F Bethe = b a (x a ) log b a(x a ) f a x a (x a ) + (1 d i ) b i ( ) log b i ( ) = f a (x a ) H Bethe a i (21) = F 12 + F F 67 + F 78 F 1 F 5 2F 2 2F 6 F 8 (22)
5 13 : Variational Inference: Loopy Belief Propagation and Mean Field 5 So, this is called the Bethe approximation of the Gibbs free energy. The idea is simple: we just need to sum over the singletons and the doubletons to derive the entropy. However, we need to notice that this approximation might not be well connected to the Gibbs free energy. Now, to minimize the Bethe free energy, we can write out the objective with Lagrangian dual form: l = F Bethe + i γ i {1 b i ( )} + a i N(a) λ ai ( ) {b i ( ) } b a (X a ) X a\ l Now to solve this, we need to take the partial derivatives and set them into zeros ( b = 0 and l i() b = 0). a(x a) Then, we have: ( 1 b i ( ) exp d i 1 a N(i) ) λ ai ( ) ( b a (X a ) exp E a (X a ) + i N(a) ) λ ai ( ) Interestingly, if we let λ ai ( ) = log(m i a ( )) = log b N(i) a m b i( ), and use b a i ( ) = X a\ b a (X a ), then we can obtain exactly the same BP formulations in equations 3 and 4. This is very attractive, because we have shown how to derive the message passing algorithm from the perspective of minimizing the Bethe free energy. So, in general, the variational methods can be summarize as: { } q = arg min F Bethe (p, q) q S where q is a now a tractable problem. Note that here we do not want to optimize q(x) directly. Instead, we want to focus on a relaxed feasible set and approximate the objective: { } b = arg min E F (b) b M o b where b covers the edge potentials (doubletons) and the node potentials (singletons). To solve for b, we typically use a fixed point iteration algorithm. (23) (24) (25) (26) (27) 3 Mean Field Approximation Recall that the purpose of approximate inference methods is to allow us to compute the posterior distribution over a model s latent variables even when the posterior involves an intractable integral or summation. As a motivating example, we will look at a Bayesian mixture of Gaussians with known variance b 2. To review, a mixture of K Gaussians has the following generative story: θ Dir(α) µ k N (0, a 2 ) for k {1,..., K} For i {1,..., n} z i Mult(θ) N (µ zi, b 2 )
6 6 13 : Variational Inference: Loopy Belief Propagation and Mean Field Suppose that we wanted to compute a posterior distribution over the cluster assignments z i and cluster mean vectors µ k. This would require us to compute the following quantity where µ = {µ 1,..., µ K }, z = {z 1,..., z n }, and x = {x 1,..., x n } p(µ, z x) = µ K k=1 p(µ k) n i=1 p(z i)p( z i, µ) K z k=1 p(µ k) n i=1 p(z i)p( z i, µ) (28) We can easily compute the numerator, but the denominator will be intractable because it involves a summation over all configuration of the latent cluster variables z. If there are K clusters, then the number of configurations that we would need to sum over is K n. This is difficult by itself, and, in order to compute the denominator, we also need compute the integral over all mean vectors µ. In the above posterior distribution, the denominator is difficult to compute because the latent variables are not easily factored. Note, in particular, that they are coupled in the conditional density of a particular data point. In general, when the latent variables are coupled, we must sum an exponentially large number of terms in order to compute the normalizing quantity in a posterior distribution. If, however, the latent variables could be easily factored, then we might be able to more easily compute the normalizing term in the posterior distribution. Broadly, mean field variational inference is a technique used to design a new family of distributions over the latent variables that do factorize well, and can then be used to approximate the posterior distribution over the Gaussian mixture model parameters and latent variables shown above. In symbols, the mean field approximation assumes that the variational distribution over the latent variables factorizes q(z 1,..., z m ) = m q(z i ; ν i ) (29) More generally, we do not need to assume that the joint distribution over the latent variables factorizes into a separate term for each variable. We can include more broad families of variational distributions by instead assuming that the joint factorizes into independent distributions over clusters of the latent variables: i=1 q(z 1,..., z m ) = q(c i ; ν i ) (30) C i C Where C is some set of disjoint sets of the latent variables. 3.1 Variational Inference Objective Functions Since we are approximating a distribution with our variational distribution q, a natural way to measure the quality of our approximation is using the Kullback-Leibler (KL) divergence between the true density p and our approximation q: KL(p q) = x p(x) log p(x) q(x) (31)
7 13 : Variational Inference: Loopy Belief Propagation and Mean Field 7 This metric, however, is a problem since it requires pointwise evaluation of p(x), which is the problem we are trying to solve in the first place. An alternative is to reverse the directionality of the KL divergence: KL(q p) = x q(x) log q(x) p(x) (32) Assuming that our approximation q(x) is tractable to compute, this metric is a slight improvement, but still involves evaluating p(x) in the denominator of the log. Note, however, that the unnormalized measure p(x) can be written as p(x)z where Z is the normalizing factor of the distribution. When p(x) is a posterior p(x D), then the normalizing constant Z is p(d). Using this fact, we can define a new objective function J(q): J(q) = KL(q p) (33) = x = x = x = x q(x) log q(x) p(x) q(x) log q(x) p(x)z q(x) log q(x) p(x) x (34) (35) q(x) log Z (36) q(x) log q(x) log Z (37) p(x) (38) Since log Z = log p(d) is a constant, minimizing J(q) = KL(q p) is equivalent to minimizing an upper bound on the negative log likelihood of the evidence by minimizing the KL divergence between our approximation and the true distribution p(x). An alternative objective is to maximize J(q), which is known as the energy functional. 3.2 Interpretations of the Objective Function We can rewrite our objective function J(q) as J(q) = E q [log q(x)] + E q [ log p(x)] (39) = H(q) + E q [E(x)] (40) Where E(x) = log p(x) is the energy. Intuitively, since we are minimizing J(q), we can see by breaking the objective function down in this way that minimizing our objective is attempting to do two things. First, we want to minimize the negative entropy (or increase the entropy), which, as we know from the maximum entropy principle, is a good way to measure how well a distribution will generalize. That is, we do not want to make unwarranted assumptions about the distribution, and should, in general, always seek to choose the distribution that maximizes entropy. Second, we want to minimize the expected energy E q [ log p(x)].
8 8 13 : Variational Inference: Loopy Belief Propagation and Mean Field Recall that the energy is lower when probability is high, so we would like to minimize energy if we want to maximize likelihood. The second term, at an intuitive level, is making sure that our approximate distribution puts more mass on x with low energy according to p(x), and less mass on x with high energy. Another interpretation of the objective function J(q) is J(q) = E q [log q(x) log p(x)p(d x)] (41) = E q [log q(x) log p(x) log p(d x)] (42) = E q [ log p(d x)] + KL(q p) (43) We can again see that breaking down the objective function can give us an intuitive feel for what we are minimizing. We see that we are minimizing the expected negative log likelihood of the data conditioned on x, which prefers distributions that put more probability mass on x that increase the likelihood of our observed data. In addition, there is a term that penalizes distributions q that are too far from p. 3.3 Optimizing the Variational Distribution Now that we understand the objective functions that we want to minimize (maximize in the case of the energy functional), we can address the issue of actually finding the variational distribution that minimizes (or maximizes) the objective function. In what follows, we use the energy functional objective function: J(q) = x q(x) log p(x) q(x) (44) Recall that we wish to minimize this function. We will also use the most simple approximating distribution that assumes that the joint density over all hidden variables x can factor completely. That is m q(x 1,..., x m ) = q i ( ; ν i ) (45) i=1 Our strategy will be to use coordinate ascent to maximize the J(q) with respect to each q i. By deriving results that optimize the energy functional with respect to q i, it is relatively straight forward to extend the coordinate ascent updates to optimize the parameters ν i for each local distribution q i. Let us first view J(q) as a function of one of the local distributions q i, which we will write as J(q i ). We can then rewrite the objective function
9 13 : Variational Inference: Loopy Belief Propagation and Mean Field 9 J(q i ) = q(x) log p(x) (46) q(x) x = ( m q i ( )) m log p(x) log q j (x j ) (47) =1 j=1 = q i ( ) q k (x k ) log p(x) log q j (x j ) log q i ( ) (48) x i k i j i = q i ( ) q k (x k ) log p(x) q i ( ) q k (x k ) log q j (x j ) + log q i ( ) (49) x i k i x i k i j i q i ( ) q k (x k ) log p(x) q i ( ) q k (x k ) log q i ( ) (50) x i k i x i k i = q i ( ) q k (x k ) log p(x) q i ( ) log q i ( ) (51) x i k i = q i ( )E qi [log p(x)] q i ( ) log q i ( ) (52) With this modified form, we can now define E qi [log p(x)] to be the log of some function of : log f i ( ) = E qi [log p(x)] (53) which allows us to rewrite the final expression in our derivation above as: q i ( ) log f i ( ) q i ( ) log q i ( ) = KL(q i f i ) (54) We then maximize our objective J(q i ) by minimizing KL(q i f i ), which is clearly done by setting q i ( ) = f i ( ). Thus log f i ( ) = E qi [log p(x)] (55) f i ( ) = exp (E qi [log p(x)]) (56) Therefore, the distribution for each q i that maximizes our objective function is q i ( ) = 1 Z i exp (E qi [log p(x)]) (57) where Z i is some normalizing constant to ensure that q i is a proper distribution. From this we can see that the approximate distribution over a particular hidden variable depends on the mean values of the rest of
10 10 13 : Variational Inference: Loopy Belief Propagation and Mean Field the hidden variables. In the expectation, we can drop all terms that do not involve, which will remove the means of all variables that are not neighbors of. We see then that the distribution for a variable depends on the mean values of its neighbors. This is known as the mean field, which is where the name mean field approximation comes from.
12 : Variational Inference I
10-708: Probabilistic Graphical Models, Spring 2015 12 : Variational Inference I Lecturer: Eric P. Xing Scribes: Fattaneh Jabbari, Eric Lei, Evan Shapiro 1 Introduction Probabilistic inference is one of
More information13 : Variational Inference: Loopy Belief Propagation
10-708: Probabilistic Graphical Models 10-708, Spring 2014 13 : Variational Inference: Loopy Belief Propagation Lecturer: Eric P. Xing Scribes: Rajarshi Das, Zhengzhong Liu, Dishan Gupta 1 Introduction
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationVariational Inference (11/04/13)
STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further
More informationLecture 13 : Variational Inference: Mean Field Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Variational Inference II: Mean Field Method and Variational Principle Junming Yin Lecture 15, March 7, 2012 X 1 X 1 X 1 X 1 X 2 X 3 X 2 X 2 X 3
More informationBayesian Machine Learning - Lecture 7
Bayesian Machine Learning - Lecture 7 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh gsanguin@inf.ed.ac.uk March 4, 2015 Today s lecture 1
More informationVariational algorithms for marginal MAP
Variational algorithms for marginal MAP Alexander Ihler UC Irvine CIOG Workshop November 2011 Variational algorithms for marginal MAP Alexander Ihler UC Irvine CIOG Workshop November 2011 Work with Qiang
More information14 : Theory of Variational Inference: Inner and Outer Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2014 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Yu-Hsin Kuo, Amos Ng 1 Introduction Last lecture
More informationExpectation Propagation Algorithm
Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,
More informationProbabilistic and Bayesian Machine Learning
Probabilistic and Bayesian Machine Learning Day 4: Expectation and Belief Propagation Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/
More information14 : Mean Field Assumption
10-708: Probabilistic Graphical Models 10-708, Spring 2018 14 : Mean Field Assumption Lecturer: Kayhan Batmanghelich Scribes: Yao-Hung Hubert Tsai 1 Inferential Problems Can be categorized into three aspects:
More informationLecture 18 Generalized Belief Propagation and Free Energy Approximations
Lecture 18, Generalized Belief Propagation and Free Energy Approximations 1 Lecture 18 Generalized Belief Propagation and Free Energy Approximations In this lecture we talked about graphical models and
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationUNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS
UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS JONATHAN YEDIDIA, WILLIAM FREEMAN, YAIR WEISS 2001 MERL TECH REPORT Kristin Branson and Ian Fasel June 11, 2003 1. Inference Inference problems
More informationVariational Inference. Sargur Srihari
Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Variational Inference IV: Variational Principle II Junming Yin Lecture 17, March 21, 2012 X 1 X 1 X 1 X 1 X 2 X 3 X 2 X 2 X 3 X 3 Reading: X 4
More informationExpectation Maximization
Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger
More informationJunction Tree, BP and Variational Methods
Junction Tree, BP and Variational Methods Adrian Weller MLSALT4 Lecture Feb 21, 2018 With thanks to David Sontag (MIT) and Tony Jebara (Columbia) for use of many slides and illustrations For more information,
More informationThe Expectation Maximization or EM algorithm
The Expectation Maximization or EM algorithm Carl Edward Rasmussen November 15th, 2017 Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11 Contents notation, objective the lower bound functional,
More information14 : Theory of Variational Inference: Inner and Outer Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Maria Ryskina, Yen-Chia Hsu 1 Introduction
More information9 Forward-backward algorithm, sum-product on factor graphs
Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 9 Forward-backward algorithm, sum-product on factor graphs The previous
More informationInference as Optimization
Inference as Optimization Sargur Srihari srihari@cedar.buffalo.edu 1 Topics in Inference as Optimization Overview Exact Inference revisited The Energy Functional Optimizing the Energy Functional 2 Exact
More informationAn Introduction to Expectation-Maximization
An Introduction to Expectation-Maximization Dahua Lin Abstract This notes reviews the basics about the Expectation-Maximization EM) algorithm, a popular approach to perform model estimation of the generative
More informationBasic Sampling Methods
Basic Sampling Methods Sargur Srihari srihari@cedar.buffalo.edu 1 1. Motivation Topics Intractability in ML How sampling can help 2. Ancestral Sampling Using BNs 3. Transforming a Uniform Distribution
More information17 : Markov Chain Monte Carlo
10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo
More informationLearning MN Parameters with Approximation. Sargur Srihari
Learning MN Parameters with Approximation Sargur srihari@cedar.buffalo.edu 1 Topics Iterative exact learning of MN parameters Difficulty with exact methods Approximate methods Approximate Inference Belief
More informationThe Expectation-Maximization Algorithm
1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable
More informationProbabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013
School of Computer Science Probabilistic Graphical Models Theory of Variational Inference: Inner and Outer Approximation Junming Yin Lecture 15, March 4, 2013 Reading: W & J Book Chapters 1 Roadmap Two
More informationInstructor: Dr. Volkan Cevher. 1. Background
Instructor: Dr. Volkan Cevher Variational Bayes Approximation ice University STAT 631 / ELEC 639: Graphical Models Scribe: David Kahle eviewers: Konstantinos Tsianos and Tahira Saleem 1. Background These
More information17 Variational Inference
Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms for Inference Fall 2014 17 Variational Inference Prompted by loopy graphs for which exact
More informationProbabilistic Graphical Models
2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector
More informationSampling Algorithms for Probabilistic Graphical models
Sampling Algorithms for Probabilistic Graphical models Vibhav Gogate University of Washington References: Chapter 12 of Probabilistic Graphical models: Principles and Techniques by Daphne Koller and Nir
More informationGenerative and Discriminative Approaches to Graphical Models CMSC Topics in AI
Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 26, 2007 Review of Inference on Graphical Models Elimination algorithm finds single
More informationFractional Belief Propagation
Fractional Belief Propagation im iegerinck and Tom Heskes S, niversity of ijmegen Geert Grooteplein 21, 6525 EZ, ijmegen, the etherlands wimw,tom @snn.kun.nl Abstract e consider loopy belief propagation
More informationThe Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision
The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationLecture 21: Spectral Learning for Graphical Models
10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation
More informationLatent Variable Models
Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:
More informationProbabilistic Graphical Models for Image Analysis - Lecture 4
Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems Overview 1. Repetition 2. α-divergence 3. Variational Inference 4.
More informationVariational Inference. Sargur Srihari
Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of Discussion Functionals Calculus of Variations Maximizing a Functional Finding Approximation to a Posterior Minimizing K-L divergence Factorized
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 9: Expectation Maximiation (EM) Algorithm, Learning in Undirected Graphical Models Some figures courtesy
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationLinear Dynamical Systems
Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationExtended Version of Expectation propagation for approximate inference in dynamic Bayesian networks
Extended Version of Expectation propagation for approximate inference in dynamic Bayesian networks Tom Heskes & Onno Zoeter SNN, University of Nijmegen Geert Grooteplein 21, 6525 EZ, Nijmegen, The Netherlands
More information27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling
10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel
More information4 : Exact Inference: Variable Elimination
10-708: Probabilistic Graphical Models 10-708, Spring 2014 4 : Exact Inference: Variable Elimination Lecturer: Eric P. ing Scribes: Soumya Batra, Pradeep Dasigi, Manzil Zaheer 1 Probabilistic Inference
More informationProbabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm
Probabilistic Graphical Models 10-708 Homework 2: Due February 24, 2014 at 4 pm Directions. This homework assignment covers the material presented in Lectures 4-8. You must complete all four problems to
More informationIEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm
IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.
More informationExpectation Propagation in Factor Graphs: A Tutorial
DRAFT: Version 0.1, 28 October 2005. Do not distribute. Expectation Propagation in Factor Graphs: A Tutorial Charles Sutton October 28, 2005 Abstract Expectation propagation is an important variational
More informationCS Lecture 19. Exponential Families & Expectation Propagation
CS 6347 Lecture 19 Exponential Families & Expectation Propagation Discrete State Spaces We have been focusing on the case of MRFs over discrete state spaces Probability distributions over discrete spaces
More informationInference in Bayesian Networks
Andrea Passerini passerini@disi.unitn.it Machine Learning Inference in graphical models Description Assume we have evidence e on the state of a subset of variables E in the model (i.e. Bayesian Network)
More informationRecitation 9: Loopy BP
Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 204 Recitation 9: Loopy BP General Comments. In terms of implementation,
More informationBayesian Learning in Undirected Graphical Models
Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationG8325: Variational Bayes
G8325: Variational Bayes Vincent Dorie Columbia University Wednesday, November 2nd, 2011 bridge Variational University Bayes Press 2003. On-screen viewing permitted. Printing not permitted. http://www.c
More informationThe Origin of Deep Learning. Lili Mou Jan, 2015
The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets
More informationMCMC and Gibbs Sampling. Kayhan Batmanghelich
MCMC and Gibbs Sampling Kayhan Batmanghelich 1 Approaches to inference l Exact inference algorithms l l l The elimination algorithm Message-passing algorithm (sum-product, belief propagation) The junction
More informationGraphical Models and Kernel Methods
Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.
More informationStructured Variational Inference
Structured Variational Inference Sargur srihari@cedar.buffalo.edu 1 Topics 1. Structured Variational Approximations 1. The Mean Field Approximation 1. The Mean Field Energy 2. Maximizing the energy functional:
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationSTATS 306B: Unsupervised Learning Spring Lecture 5 April 14
STATS 306B: Unsupervised Learning Spring 2014 Lecture 5 April 14 Lecturer: Lester Mackey Scribe: Brian Do and Robin Jia 5.1 Discrete Hidden Markov Models 5.1.1 Recap In the last lecture, we introduced
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More information5. Sum-product algorithm
Sum-product algorithm 5-1 5. Sum-product algorithm Elimination algorithm Sum-product algorithm on a line Sum-product algorithm on a tree Sum-product algorithm 5-2 Inference tasks on graphical models consider
More informationWalk-Sum Interpretation and Analysis of Gaussian Belief Propagation
Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation Jason K. Johnson, Dmitry M. Malioutov and Alan S. Willsky Department of Electrical Engineering and Computer Science Massachusetts Institute
More information3 : Representation of Undirected GM
10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:
More informationSeries 7, May 22, 2018 (EM Convergence)
Exercises Introduction to Machine Learning SS 2018 Series 7, May 22, 2018 (EM Convergence) Institute for Machine Learning Dept. of Computer Science, ETH Zürich Prof. Dr. Andreas Krause Web: https://las.inf.ethz.ch/teaching/introml-s18
More informationPower EP. Thomas Minka Microsoft Research Ltd., Cambridge, UK MSR-TR , October 4, Abstract
Power EP Thomas Minka Microsoft Research Ltd., Cambridge, UK MSR-TR-2004-149, October 4, 2004 Abstract This note describes power EP, an etension of Epectation Propagation (EP) that makes the computations
More informationLecture 6: Graphical Models: Learning
Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)
More informationCS Lecture 8 & 9. Lagrange Multipliers & Varitional Bounds
CS 6347 Lecture 8 & 9 Lagrange Multipliers & Varitional Bounds General Optimization subject to: min ff 0() R nn ff ii 0, h ii = 0, ii = 1,, mm ii = 1,, pp 2 General Optimization subject to: min ff 0()
More informationMachine Learning for Data Science (CS4786) Lecture 24
Machine Learning for Data Science (CS4786) Lecture 24 Graphical Models: Approximate Inference Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016sp/ BELIEF PROPAGATION OR MESSAGE PASSING Each
More informationLearning MN Parameters with Alternative Objective Functions. Sargur Srihari
Learning MN Parameters with Alternative Objective Functions Sargur srihari@cedar.buffalo.edu 1 Topics Max Likelihood & Contrastive Objectives Contrastive Objective Learning Methods Pseudo-likelihood Gradient
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationExpectation Propagation in Dynamical Systems
Expectation Propagation in Dynamical Systems Marc Peter Deisenroth Joint Work with Shakir Mohamed (UBC) August 10, 2012 Marc Deisenroth (TU Darmstadt) EP in Dynamical Systems 1 Motivation Figure : Complex
More informationGaussian Mixture Models
Gaussian Mixture Models David Rosenberg, Brett Bernstein New York University April 26, 2017 David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 1 / 42 Intro Question Intro
More informationEECS 545 Project Report: Query Learning for Multiple Object Identification
EECS 545 Project Report: Query Learning for Multiple Object Identification Dan Lingenfelter, Tzu-Yu Liu, Antonis Matakos and Zhaoshi Meng 1 Introduction In a typical query learning setting we have a set
More information11. Learning graphical models
Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical
More informationQuantitative Biology II Lecture 4: Variational Methods
10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s
More informationMachine Learning Summer School
Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,
More informationLecture 15. Probabilistic Models on Graph
Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 9: Variational Inference Relaxations Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 24/10/2011 (EPFL) Graphical Models 24/10/2011 1 / 15
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 12: Gaussian Belief Propagation, State Space Models and Kalman Filters Guest Kalman Filter Lecture by
More informationPattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM
Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures
More informationExpectation Maximization
Expectation Maximization Aaron C. Courville Université de Montréal Note: Material for the slides is taken directly from a presentation prepared by Christopher M. Bishop Learning in DAGs Two things could
More informationApproximate inference, Sampling & Variational inference Fall Cours 9 November 25
Approimate inference, Sampling & Variational inference Fall 2015 Cours 9 November 25 Enseignant: Guillaume Obozinski Scribe: Basile Clément, Nathan de Lara 9.1 Approimate inference with MCMC 9.1.1 Gibbs
More informationVariational inference
Simon Leglaive Télécom ParisTech, CNRS LTCI, Université Paris Saclay November 18, 2016, Télécom ParisTech, Paris, France. Outline Introduction Probabilistic model Problem Log-likelihood decomposition EM
More informationBayesian Learning in Undirected Graphical Models
Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ and Center for Automated Learning and
More informationBayesian Inference Course, WTCN, UCL, March 2013
Bayesian Course, WTCN, UCL, March 2013 Shannon (1948) asked how much information is received when we observe a specific value of the variable x? If an unlikely event occurs then one would expect the information
More informationIntroduction to Probabilistic Graphical Models: Exercises
Introduction to Probabilistic Graphical Models: Exercises Cédric Archambeau Xerox Research Centre Europe cedric.archambeau@xrce.xerox.com Pascal Bootcamp Marseille, France, July 2010 Exercise 1: basics
More informationMixtures of Gaussians. Sargur Srihari
Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm
More informationLecture 14. Clustering, K-means, and EM
Lecture 14. Clustering, K-means, and EM Prof. Alan Yuille Spring 2014 Outline 1. Clustering 2. K-means 3. EM 1 Clustering Task: Given a set of unlabeled data D = {x 1,..., x n }, we do the following: 1.
More informationWeek 3: The EM algorithm
Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent
More informationSTATS 306B: Unsupervised Learning Spring Lecture 2 April 2
STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationVariational Algorithms for Marginal MAP
Variational Algorithms for Marginal MAP Qiang Liu Department of Computer Science University of California, Irvine Irvine, CA, 92697 qliu1@ics.uci.edu Alexander Ihler Department of Computer Science University
More informationClustering and Gaussian Mixture Models
Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap
More informationExpectation Maximization
Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /
More information