Bayesian Machine Learning - Lecture 7

Similar documents
Bioinformatics 2 - Lecture 4

13: Variational inference II

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Variational Inference (11/04/13)

Probabilistic Graphical Models

Probabilistic and Bayesian Machine Learning

Dynamic Approaches: The Hidden Markov Model

Chapter 16. Structured Probabilistic Models for Deep Learning

Probabilistic Graphical Models

Junction Tree, BP and Variational Methods

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Machine Learning for Data Science (CS4786) Lecture 24

9 Forward-backward algorithm, sum-product on factor graphs

Probabilistic Graphical Models

Probabilistic Graphical Models

p L yi z n m x N n xi

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 9: PGM Learning

Lecture 15. Probabilistic Models on Graph

The Origin of Deep Learning. Lili Mou Jan, 2015

Linear Dynamical Systems

6.867 Machine learning, lecture 23 (Jaakkola)

Lecture 4 October 18th

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Lecture 18 Generalized Belief Propagation and Free Energy Approximations

UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS

Probabilistic Graphical Models

Directed and Undirected Graphical Models

Undirected Graphical Models

The Ising model and Markov chain Monte Carlo

Lecture 13 : Variational Inference: Mean Field Approximation

COMP90051 Statistical Machine Learning

14 : Theory of Variational Inference: Inner and Outer Approximation

Variational algorithms for marginal MAP

Machine Learning Lecture 14

STA 4273H: Statistical Machine Learning

Machine Learning 4771

Note Set 5: Hidden Markov Models

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

13 : Variational Inference: Loopy Belief Propagation

STA 4273H: Statistical Machine Learning

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

Expectation Propagation in Factor Graphs: A Tutorial

Bayesian Learning in Undirected Graphical Models

CS711008Z Algorithm Design and Analysis

5. Sum-product algorithm

12 : Variational Inference I

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Variational Inference. Sargur Srihari

Lecture 6: Graphical Models

Bayesian Learning in Undirected Graphical Models

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

3 : Representation of Undirected GM

14 : Mean Field Assumption

Probabilistic Graphical Models: Representation and Inference

Machine Learning Summer School

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Expectation Propagation Algorithm

Probabilistic Graphical Models

STA 414/2104: Machine Learning

Lecture 6: Graphical Models: Learning

CS Lecture 3. More Bayesian Networks

Extended Version of Expectation propagation for approximate inference in dynamic Bayesian networks

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

CS Lecture 19. Exponential Families & Expectation Propagation

Variational Scoring of Graphical Model Structures

14 : Theory of Variational Inference: Inner and Outer Approximation

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Graphical Models and Kernel Methods

Lecture 4: State Estimation in Hidden Markov Models (cont.)

Undirected Graphical Models

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Inference in Bayesian Networks

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Approximate Inference Part 1 of 2

Statistical Approaches to Learning and Discovery

17 Variational Inference

4 : Exact Inference: Variable Elimination

Approximate Inference Part 1 of 2

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Probabilistic Graphical Models (I)

Factor Graphs and Message Passing Algorithms Part 1: Introduction

Directed and Undirected Graphical Models

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

Graphical Models for Collaborative Filtering

Markov Chains and MCMC

Fast Inference and Learning with Sparse Belief Propagation

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Introduction to Probabilistic Graphical Models: Exercises

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Cours 7 12th November 2014

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Expectation Maximization

Basic math for biology

Undirected graphical models

Transcription:

Bayesian Machine Learning - Lecture 7 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh gsanguin@inf.ed.ac.uk March 4, 2015

Today s lecture 1 The Forward-Backward algorithm 2 3

Hidden Markov Model definitions We have a Markov chain of unobserved states Traditionally the latent variables are taken to be discrete and the transition probabilities are given by a transition matrix T We have a sequence of observations Each observation depends only on the latent state at that time through the emission probability

Graphical representation of HMM We represent the latent states as a sequence of random variables; each of them depends only on the previous one The observations depend only on the corresponding state

States and parameters We are interested in the posterior distribution of the states x 1:T given the observations y 1:T (subscript 1 : T denotes the collection of variables from 1 to T ) Notice that we only have one observation per time point In the independent observations case, this would not be enough We also have parameters which we assume known (for now): these are in the known probabilities π = p (x(1)) T x(t 1),x(t) = p (x(t) x(t 1)) O x,y = p (y(t) x(t) We assume parameters to be time-independent

The single time marginals The joint posterior over the states is, by the rules of probability, proportional to the joint probability of observations and states p (x 1:T y 1:T ) p (x 1:T, y 1:T ) An object of central importance is the single time marginal for the latent variable at time t This is obtained by marginalising the latent variables at all other time points; by the proportionality above p (x(t) y 1:T ) p (x(t), y 1:T )

Conditional independence and factorisations By using the product rule of probability, we can rewrite the joint probability of states and observations as p (x 1:T, y 1:T ) = = p (y t+1:t x 1:T, y 1:t ) p (x 1:T, y 1:t ) (1) Recall that graphical models encode conditional independence relations

Some conditional independencies By inspection of the network representation of the model (four slides ago), we see that p (y t+1:t x 1:T, y 1:t ) = p (y t+1:t x t+1:t ) (2) Also x t+1:t are conditionally independent of y 1:t given x t, so that p (x 1:T, y 1:t ) = p (x t+1:t x 1:t, y 1:t ) p (x 1:t, y 1:t ) = p (x t+1:t x t ) p (x 1:t, y 1:t ) (3)

Factorisations and messages Putting equations (2,3) into (1), we get p (x 1:T, y 1:T ) = p (y t+1:t, x t+1:t x t ) p (x 1:t, y 1:t ) Marginalising x 1:t 1 and x t+1:t we get the following fundamental factorisation of the single time marginal p (x(t) y 1:T ) α(x(t))β(x(t)) = = p (x(t) y 1:t ) p (y t+1:t x(t)) (4) The single time marginal at time t is the product of the posterior estimate given all the data up to that point, times the likelihood of future observations given the state at t

Filtering: computing the forward message Initialisation: Recursion: α(t) p (x(t), y 1:t ) = = x(t 1) = x(t 1) α(1) p (y(1), x(1)) = πo x(1),y(1) x(t 1) p (x(t), x(t 1), y 1:t ) = p (y(t) x(t)) p (x(t) x(t 1)) p (x(t 1) y 1:t 1 ) = O x(t),y(t) T x(t 1),x(t) α(x(t 1)) where I used the conditional independences of the network to go from line 1 to 2 If x(t) is a continuous, replace the sum with an integral

Computing the backward message Initialisation: β(x(t )) = 1 (why?) Backward recursion: β(x(t 1)) = p (y t:t x(t 1)) = x(t) p (y t:t, x(t) x(t 1)) = = x(t) p (y t+1:t y(t), x(t), x(t 1)) p (y(t)x(t) x(t 1)) = = x(t) β(x(t))p (y(t) x(t)) p (x(t) x(t 1)) Once again, if x is continuous replace sum with integral

Message passing The factorisation in equation (4) is an example of message passing α(x(t)) is a message propagated forwards from the previous observations (forward message or filtered process) β(x(t)) is a message propagated backwards from future observations (backward message) Message passing algorithms allow exact inference in tree structured graphical models (why?) and approximate inference in more complicated models

General graphical model scenario Excellent review by Yedidia et al available here http://www.merl.com/publications/docs/tr2001-22.pdf We consider the case of pairwise Markov random fields, where the joint distribution is given by p(x) = 1 Z (ij) ψ(x i, x j ) i φ(x i ) where (ij) is the set of edges in the graph and the terms φ(x i ) are terms coming from (noisy) observations of nodes x i (local evidence) The goal is to compute an approximation of the single node marginal, called the belief at node i b(x i )

Neighbours and messages Given a node i, denote as N i the set of neighbouring nodes in the graph (recall we work with MRFs, undirected) We associate with each edge (ij) a message from i to j m ij (x j ) For graphical models on discrete random variables, the message m ij (x j ) is a vector with dimension the number of states of x j Intuitively, the message m ij (x j ) encodes what node i thinks about the state of node j

Belief propagation The Belief Propagation (BP) algorithm computes an approximation to the single node belief in terms of messages and local evidence b i (x i ) φ(x i ) j N i m ji (x i ) (5) The messages are computed self-consistently by m ji (x i ) = φ(x j )ψ(x i, x j ) m lj (x j ) (6) x j l N j \i

Why it might make sense We proved earlier that BP is exact on trees What are the various BP quantities in our Forward-Backward derivation? BP essentially posits a local tree-like approximation to a general graphical model It works well when there are no short loops Recent work (by Tony Jebara) showed that exact message passing algorithms (not BP) can also be devised for perfect graphs (clique number= colouring number)

Matching distributions through KL divergence Given two probability distributions q and p, we have seen that the Kullback-Leibler divergence KL[q p] = x q(x) log q(x) p(x) is a convex non-negative functional which vanishes only when the two distributions are the same We write the distribution p as p(x) = 1 Z exp [ E(x)]] where E is traditionally known as the energy The KL divergence then becomes KL[q p] = log Z + E(x) q H [q(x)] (7) where the functional H is the entropy of a distribution

Free energy For fixed p, the KL is a positive functional of q If the parametrisation of q is sufficiently rich as to contain the distribution p, minimising the KL divergence would return the exact distribution p Operationally, this is achieved by minimising the Free Energy G(q) = E(x) q H [q(x)] When p is an (intractable) posterior distribution, this is the variational principle of Bayesian inference Choosing q in a specific distributional family yields specific variational inference algorithms

Variational formulation of BP BP works by positing a local tree-like approximation to the graph In terms of distributions, it assumes that an approximating distribution can be defined only in terms of 1 and 2 node beliefs b(x i ) and b(x i, x j ) (subject to x j b(x i, x j ) = b(x i )) Then, the approximation consists of replacing the entropy term with that of a distribution is defined as (ij) q(x) = b(x i, x j ) i b(x i) (n (8) i 1) where n i is the number of neighbours of node i. This is called the Bethe approximation Exercise: prove that the formula above is the correct joint distribution on a tree (or a singly connected graph in general)

Variational BP again Sticking the formulation in equation (8) in the KL divergence (7) yields the so called Bethe Free Energy Minimising this w.r.t. the beliefs gives an iterative algorithm which is the same as BP PROBLEM: for general graphs, there may not be a distribution with marginals satisfying the constraints In this case, BP will not converge. Theoretical guarantees as to when it will converge are lacking, although the absence of short loops is widely taken to be a good idea