Recitation 9: Loopy BP

Similar documents
17 Variational Inference

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

3 Undirected Graphical Models

13 : Variational Inference: Loopy Belief Propagation and Mean Field

11 The Max-Product Algorithm

9 Forward-backward algorithm, sum-product on factor graphs

14 : Theory of Variational Inference: Inner and Outer Approximation

Probabilistic Graphical Models

13 : Variational Inference: Loopy Belief Propagation

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

Bayesian Machine Learning - Lecture 7

12 : Variational Inference I

Computer Science 385 Analysis of Algorithms Siena College Spring Topic Notes: Limitations of Algorithms

Probabilistic Graphical Models

Lecture 10: Powers of Matrices, Difference Equations

Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation

Variational Inference (11/04/13)

Probabilistic and Bayesian Machine Learning

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Fractional Belief Propagation

Inference and Representation

CS Lecture 19. Exponential Families & Expectation Propagation

Message passing and approximate message passing

Lecture 8: PGM Inference

Variational algorithms for marginal MAP

Structured Variational Inference

14 : Theory of Variational Inference: Inner and Outer Approximation

CS 6820 Fall 2014 Lectures, October 3-20, 2014

Inference as Optimization

Junction Tree, BP and Variational Methods

Lecture 12 Scribe Notes

Linear Response for Approximate Inference

1 Primals and Duals: Zero Sum Games

Review: Directed Models (Bayes Nets)

Polynomial functions right- and left-hand behavior (end behavior):

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science ALGORITHMS FOR INFERENCE Fall 2014

Data Structures in Java

Answers for Calculus Review (Extrema and Concavity)

Machine Learning 4771

12. LOCAL SEARCH. gradient descent Metropolis algorithm Hopfield neural networks maximum cut Nash equilibria

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Class 8 Review Problems solutions, 18.05, Spring 2014

P P P NP-Hard: L is NP-hard if for all L NP, L L. Thus, if we could solve L in polynomial. Cook's Theorem and Reductions

13: Variational inference II

CPSC 540: Machine Learning

CS Lecture 13. More Maximum Likelihood

Inference in Bayesian Networks

11. Learning graphical models

Probabilistic Graphical Models

Suppose that f is continuous on [a, b] and differentiable on (a, b). Then

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities

FINAL EXAM PRACTICE PROBLEMS CMSC 451 (Spring 2016)

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Learning MN Parameters with Approximation. Sargur Srihari

UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS

Structure Learning: the good, the bad, the ugly

arxiv: v1 [stat.ml] 26 Feb 2013

14 Increasing and decreasing functions

1 T 1 = where 1 is the all-ones vector. For the upper bound, let v 1 be the eigenvector corresponding. u:(u,v) E v 1(u)

Notes 6 : First and second moment methods

6.207/14.15: Networks Lecture 12: Generalized Random Graphs

6c Lecture 14: May 14, 2014

x log x, which is strictly convex, and use Jensen s Inequality:

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 11 Luca Trevisan February 29, 2016

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Final Exam Study Guide

CS281A/Stat241A Lecture 19

An Introduction to Entropy and Subshifts of. Finite Type

3 : Representation of Undirected GM

5. Sum-product algorithm

Complexity Theory: The P vs NP question

M17 MAT25-21 HOMEWORK 6

CSC 5170: Theory of Computational Complexity Lecture 13 The Chinese University of Hong Kong 19 April 2010

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics. EECS 281A / STAT 241A Statistical Learning Theory

Minimizing D(Q,P) def = Q(h)

Bayesian Learning in Undirected Graphical Models

Math 115 Practice for Exam 2

Probabilistic Graphical Models

It has neither a local maximum value nor an absolute maximum value

Vector fields Lecture 2

Lecture 4: Constructing the Integers, Rationals and Reals

COMPSCI 611 Advanced Algorithms Second Midterm Exam Fall 2017

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Submodularity in Machine Learning

Machine Learning for Data Science (CS4786) Lecture 24

Stat 521A Lecture 18 1

Undirected graphical models

Cases Where Finding the Minimum Entropy Coloring of a Characteristic Graph is a Polynomial Time Problem

Chapter 2. Polynomial and Rational Functions. 2.5 Zeros of Polynomial Functions

2.5 The Fundamental Theorem of Algebra.

- Well-characterized problems, min-max relations, approximate certificates. - LP problems in the standard form, primal and dual linear programs

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

THE N-VALUE GAME OVER Z AND R

Conditional Probability, Independence and Bayes Theorem Class 3, Jeremy Orloff and Jonathan Bloom

ALGEBRAIC STRUCTURE AND DEGREE REDUCTION

Maximum likelihood in log-linear models

LECTURE 13. Last time: Lecture outline

6.841/18.405J: Advanced Complexity Wednesday, February 12, Lecture Lecture 3

Transcription:

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 204 Recitation 9: Loopy BP General Comments. In terms of implementation, loopy BP is not different from sum-product for trees. In other words, you can take your implementation of sum-product algorithm for trees and apply it directly to a loopy graph. This is because sum-product/bp is a distributed, local algorithm: each node receives messages from its neighbours, does some local computation, and sends out messages to its neighbours. The nodes only interact with its neighbourhood and have no idea if the overall graph is loopy or not. So implementation of loopy BP is easy, the difficult part is analysing it. Recall that Sum-Product algorithm for trees is guaranteed to converge after a certain number of iterations, and the resulting estimates of node marginals are accurate. For loopy BP, this is no longer the case: the algorithm might not converge at all; even if it does converge, it doesn t necessarily result in correct estimates of marginals. The lecture notes asked three questions: First of all, is there any fixed point at all? Notice that if the algorithm does converge, it will converge to a fixed point by definition of a fixed point. The answer to this question is yes, assuming some regularity conditions (e.g. continuous functions, compact sets etc). Secondly, what are these fixed points? The short answer is that these fixed points are in - correspondence to the extrema (i.e. maxima or minima) of the corresponding Bethe Approximation Problem. We ll discuss how to formulate the Bethe Approximation Problem in more details later. Thirdly, will the loopy BP algorithm converge to a fixed point given a certain initialization? Or is there any initialization that makes the algorithm converge? Unfortunately, the answer is not sure in general. But for some special cases, e.g. graphs with one single loop, analysis is possible. In lecture, we introduced computation trees and attractive fixed points, both are methods for analysing convergence behaviour for loopy BP in special cases. 2. Computing marginals and computing partition function Z are equivalent. Both tasks are NP-hard for general distributions. But if we have an oracle that computes marginals for any given distribution, we ve proved in last homework (Problem 5.3) that we can design a polynomial time algorithm that uses this oracle to compute partition function Z. Conversely, if we have an oracle that can compute partition function for any given distribution, we can use it to compute marginals in polynomial time as

well. Consider a distribution p x (x), notice and p x (x,...,x n) p x (x,...,x n) x 2,...,xn L compute Z and over x in polynomial time. x Z x 2,...,x n p x (x )= p x (x,..., x n )= p (x,..., x n ) x 2,...,x n is a distribution over variables x 2,..., x n. Thus the oracle can p (x,..., x n ) for us, and we can get the marginal distribution x x 2,...,x n In inference, marginal distributions usually make more sense. But in statistical physics community, where a large body of work on loopy BP algorithm comes from, log partition function is very important. In fact, many quantities, e.g. free energy, free entropy, internal energy etc, are defined in terms of log(z). 2 Details 2. Variational Characterization of log(z) Remember our original goal is to compute log(z) for a given distribution p x (x). However, we will convert the problem into an equivalent optimization problem log(z) = sup F (μ) μ M where M is the set of all distributions over x. We ll define F (μ) and derive the conversion in a minute but there is a fancy term for converting a computation problem into an optimization problem: variational characterization. Letusfirstre-write p x (x) as p x (x) = Z exp( E(x)). Given any distribution, you can always re-write it in this form (known as Boltzmann Distribution) by choosing the right E(x). E(x) has a physical meaning. If we think of each configuration x as a state, E(x) is the energy corresponding to the state. The distribution indicates that a system is less likely to be in a state that has higher energy. Taking log on both sides and rearrange, we get Nowwedefine Bethefreeentropy F (μ) as E(x) = log(p x (x)) log(z) F (μ) = μ(x)log(μ(x)) μ(x)e(x) 2

Notice the first term is the entropy of distribution μ(x) and the second term is the average energy with respect to distribution μ(x). Replace E(x) with p x (x) and log(z), we get F (μ) = μ(x)log(μ(x)) μ(x)e(x) = μ(x)log(μ(x)) + μ(x)(log(p x (x)) + log(z)) = μ(x)log(μ(x)) + μ(x)log(p x (x)) + log(z) = D(μ(x) p x (x)) + log(z) where D( ) is the KL-divergence and thus always non-negative. log(z) F (μ) and equality is achieved if and only if μ(x) = p x (x) log(z) = sup F (μ) μ M Notice this optimization problem is still NP-hard for general distributions. It can be made computationally tractable if we re willing to optimize over a different set instead of M. For instance, the basic idea of mean field approximation is to look at a subset of M whose elements are distributions under which x,..., x n are independent. The resulting log(z MF ) is a lower bound of the actual log(z). In Bethe approximation problem, we maximize F (μ) over a set M ˆ that is neither a subset nor a superset of M. We will discuss M ˆ in more details in next section. 2.2 Locally Consistent Marginals Recall that a distribution p x (x) can be factorized in the form p xi,x j (x i, x j ) p x (x) =( p xi (x i ))( ) p xi (x i )p xj (x j ) i V (i,j) E if and only if the distribution factorizes according to a tree graph. The main idea of Bethe Approximation is instead of optimizing over the set of all dis ˆ tributions M, we optimize F (μ) over M, which is defined as: μ xi,x j (x i, x j ) ˆM = {μ x (x) =( μ xi (x i ))( )} μ xi (x i )μ xj (x j ) i V (i,j) E where {μ xi (x i )} i V and {μ xi,x j (x i, x j )} (i,j) E are a set of locally consistent marginals, in other words: 3

μ xi (x i ) 0, x i μ xi,x j (x i, x j ) 0, x i, x j μ xi (x i )=, i x i μ xi,x j (x i, x j )= μ xj (x j ) x i μ (x xi,xj i, x j )= μ (x xi i) x j In lecture, we discussed that M ˆ is neither a superset nor a subset of M, thus log(z Bethe )is neither an upper bound nor a lower bound on log(z).. There are elements in M that are not in M. ˆ Consider a distribution that factorizes according to a loopy graph. 2. There are elements in M ˆ that are not in M. Check out the following example. 4

It is easy to check that the {μ xi (x i )} i V and {μ xi,x j (x i, x j )} (i,j) E are a set of locally consistent marginals. We claim they do not correspond to any actual distribution. Let us assume that they are the local marginals computed from some distribution q x,x 2,x 3.Then we have q(0, 0, 0) + q(0,, 0) = μ 3 (0, 0) = 0.0 q(0, 0, ) + q(, 0, ) = μ 23 (0, ) = 0.0 q(0, 0, 0) + q(0, 0, ) = μ 2 (0, 0) = 0.49 Since q is a distribution, all q x,x 2,x 3 terms are non-negative. Thus the first two equations indicate q(0, 0, 0) 0.0 and q(0, 0, ) 0.0. So the third equation cannot hold. The assumption is false and the set of locally consistent marginals does not correspond to any actual distribution. 2.3 Computation Trees As mentioned before, computation tree is a method to analyse behaviour of Loopy BP in special cases such as graphs with a single loop. In this section, we will look at an example to get an idea how this method can be applied. Consider the graph below, and its computation tree rooted at node and run for 4 iterations. 2 4 Ψ2 2 Ψ4 Ψ23 4 3 Ψ23 3 4 3 2 (2) (2) (2) (2) (3) Figure : Original Graph Figure 2: Computation Tree Rooted at Node, for 4 Iterations 5

Notice the messages satisfy the following equations: (t) ) (t ) (t ) m 2 (0) ψ 23 (0, 0) ψ 23 (, 0) m 3 2 (0) m 3 2 (0) (t) = (t ) = A 23 (t ) m () ψ 23 (0, ) ψ 23 (, ) 2 m 3 2 () m 3 2 () (t ) ) (t 2) (t 2) m 3 2 (0) ψ 34 (0, 0) ψ 34 (, 0) m 4 3 (0) m 4 3 (0) (t ) = (t 2) = A 34 (t 2) m () ψ 34 (0, ) ψ 34 (, ) 3 2 m 4 3 () m 4 3 () (t 2) ) (t 3) (t 3) m 4 3 (0) ψ 4 (0, 0) ψ 4 (0, ) m 4 (0) m 4 (0) (t 2) = (t 3) = A T 4 (t 3) m () ψ 4 (, 0) ψ 4 (, ) 4 3 m 4 () m 4 () (t 3) ) (t 4) (t 4) m 4 (0) ψ 2 (0, 0) ψ 2 (, 0) m 2 (0) m 2 (0) (t 3) = (t 4) = A 2 (t 4) m () ψ 2 (0, ) ψ 2 (, ) m () m () 4 2 2 (t) (t 4) (t 4) 2 4A 2 2 2 (t 4) 2 2 2 m (0) m (0) m (0) (t) = A 23 A 34 A T (t 4) = M m () m () m () In other words, the message from 2 to at iteration t can be written as the product of a matrix and the message from 2 to at iteration t-4. Thus we have (4t) (0) m 2 (0) m (0) (4t) = M t 2 (0) m () m () 2 Notice M is a positive matrix (i.e. each entry is positive). Apply Perron-Frobenious theorem, as t, M t λ k T v u,where λ is the largest eigenvalue of M, and v and u are the right and left eigenvalue of M corresponding to λ. Thus message from 2 to will converge when number of iteration goes to infinity. Due to the symmetry of the graph, similar arguments can be made for messages along other edges as well. So we have proved Loopy BP will converge on this graph. 2.4 Intuition: When Should We Expect Loopy BP to Work Well Since Sum-Product is a distributed algorithm and it s exact on trees, intuitively we expect loopy BP to work well on graphs that are locally tree-like. In other words, if for any node, its neighbours are far apart in the graph, we can think of the incoming messages as roughly independent (recall that incoming messages to each node are independent if and only if the graph is a tree). On the other hand, if a graph contains small loops, the messages around the loop will be amplified and result in marginal estimates very different from the true marginals. The numerical examples below hopefully will provide some intuition. 2 6

Ψ2(0,0) = Ψ2(,) = 0.9 Ψ2(0,) = Ψ2(,0) = 0. Ψ(0) = 0.7 Ψ() = 0.3 Exact Marginals Loopy BP Estimated Marginals [Px(0), Px()] [0.7, 0.3] [0.9555, 0.0445] [Px2(0), Px2()] [0.6905, 0.3095] [0.8829, 0.7] [Px3(0), Px3()] [0.6905, 0.3095] [0.8829, 0.7] Ψ3(0,0) = Ψ3(,) = 0.9 Ψ3(0,) = Ψ3(,0) = 0. 2 3 Ψ23(0,0) = Ψ23(,) = 0.9 Ψ23(0,) = Ψ23(,0) = 0. Ψ(0) = 0.7 Ψ() = 0.3 2 3 6 5 Exact Marginals Loopy BP Estimated Marginals [Px(0), Px()] [0.7, 0.3] [0.9027, 0.0973] [Px2(0), Px2()] [0.6787, 0.323] [0.7672, 0.2328] [Px3(0), Px3()] [0.6663, 0.3337] [0.7487, 0.253] [Px4(0), Px4()] [0.6623, 0.3377] [0.7427, 0.2573] [Px5(0), Px5()] [0.6663, 0.3377] [0.7487, 0.253] [Px6(0), Px6()] [0.6787, 0.323] [0.7672, 0.2328] 4 Ψij(0,0) = Ψij(,) = 0.9 Ψij(0,) = Ψij(,0) = 0. Notice in both cases, the marginals estimated from Loopy BP algorithm is higher than the exact marginals. In loops, the incoming messages to a node are not independent. But the Loopy BP algorithm doesn t know that and still treat them as independent. Thus the messages will be counted more than once and result in an amplification effect. Also notice that the amplification is less severe for size-6 loop, because it is more locally tree-like. 7

What Loopy BP dislikes more than small loops is closely-tied small loops. The following example hopefully explain this point pretty clearly. Ψ(0) = 0.5 Ψ() = 0.49 2 3 4 5 6 7 8 9 Ψij(0,0) = Ψij(,) = 0.9 Ψij(0,) = Ψij(,0) = 0. Exact Marginals Loopy BP Estimated Marginals [Px(0), Px()] [0.500, 0.4900] [0.987, 0.083] [Px2(0), Px2()] [0.5096, 0.4904] [0.993, 0.0069] [Px3(0), Px3()] [0.5093, 0.4907] [0.98, 0.089] [Px4(0), Px4()] [0.5096, 0.4904] [0.993, 0.0069] [Px5(0), Px5()] [0.5096, 0.4904] [0.9992, 0.0008] [Px6(0), Px6()] [0.5095, 0.4905] [0.9930, 0.0070] [Px7(0), Px7()] [0.5093, 0.4907] [0.98, 0.089] [Px8(0), Px8()] [0.5095, 0.4905] [0.9930, 0.0070] [Px9(0), Px9()] [0.5093, 0.4907] [0.980, 0.090] 8

MIT OpenCourseWare http://ocw.mit.edu 6.438 Algorithms for Inference Fall 204 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.