p L yi z n m x N n xi - PDF Free Download

y i z n x n N x i

Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference

Books statistical perspective Graphical Models, S. Lauritzen (1996) An Introduction to Bayesian Networks, F. Jensen (1996) Expert Systems and Probabilistic Network Models, Castillo et al. (1997) Probabilistic Reasoning in Intelligent Systems, J. Pearl (1998) Probabilistic Expert Systems, Cowell et al., (1999) Bayesian Networks and Decision Graphs, F. Jensen (2001) Learning Bayesian Networks, R. Neapolitan (2004)

Books learning perspective Learning in Graphical Models, M. I. Jordan, Ed.,(1998) Graphical Models for Machine Learning and Digital Communication, B. Frey (1998) Graphical Models, M. I. Jordan (TBD) Information Theory, Inference, and Learning Algorithms, D. J. C. MacKay (2003) Also

Pattern Recognition and Machine Learning Springer (2005) 600 pages, hardback, four colour, low price Graduate-level text book Worked solutions to all 250 exercises Complete lectures on www Matlab software: Netlab, and companion text with Ian Nabney (Springer 2006)

Probabilistic Graphical Models Graphical representations of probability distributions new insights into existing models motivation for new models graph based algorithms for calculation and computation

Probability Theory Sum rule Product rule From these we have Bayes theorem with normalization

Directed Graphs: Decomposition Consider an arbitrary joint distribution By successive application of the product rule x y z

Directed Acyclic Graphs Joint distribution where denotes the parents of x 2 x 1 x 3 x 4 x 5 x 6 x 7 No directed cycles

Examples of Directed Graphs Hidden Markov models Kalman filters Factor analysis Probabilistic principal component analysis Independent component analysis Mixtures of Gaussians Transformed component analysis Probabilistic expert systems Sigmoid belief networks Hierarchical mixtures of experts Etc, etc,

Undirected Graphs Provided then joint distribution is product of non-negative functions over the cliques of the graph where are the clique potentials, and Z is a normalization constant w x y z

Conditioning on Evidence Variables may be hidden (latent) or visible (observed) visible hidden Latent variables may have a specific interpretation, or may be introduced to permit a richer class of distribution

Importance of Ordering Battery Fuel gauge Fuel Start Engine turns over Fuel gauge Engine turns over Start Battery Fuel

Causality Directed graphs can express causal relationships Often we observe child variables and wish to infer the posterior distribution of parent variables Example: x cancer x cancer y blood test y blood test Note: inferring causal structure from data is subtle

Conditional Independence independent of given if, for all values of, Phil Dawid s notation Equivalently Conditional independence crucial in practical applications since we can rarely work with a general joint distribution

Markov Properties Can we determine the conditional independence properties of a distribution directly from its graph? undirected graphs: easy directed graphs: one subtlety

Undirected Graphs Conditional independence given by graph separation!

Graphs as Filters p( x) DF? Factorization and conditional independence give identical families of distributions

Directed Markov Properties: Example 1 Joint distribution over 3 variables specified by the graph a c b Note the missing edge from to Node c is head-to-tail with respect to the path Joint distribution

Directed Markov Properties: Example 1 Suppose we condition on node a c b Hence Note that if is not observed we have Informally: observation of blocks the path from to

Directed Markov Properties: Example 2 3-node graph c a b Joint distribution Node is tail-to-tail with respect to the path Again, note missing edge fromto

Directed Markov Properties: Example 2 Now condition on node We have Hence a c b Again, if is not observed Informally: observation of blocks the path from to

Directed Markov Properties: Example 3 Node is head-to-head with respect to the path a b Joint distribution c Note missing edge from to If is not observed we have and hence

Directed Markov Properties: Example 3 Suppose we condition on node a b Hence c Informally: unobserved head-to-head node blocks the path from to once is observed the path is unblocked Note: observation of any descendent of also unblocks the path

Explaining Away Illustration: pixel colour in an image lighting colour surface colour image colour

d-separation Conditional independence if, and only if, all possible paths are blocked. Examples: f f a a e b e b c c (i) (ii)

Markov Blankets

Directed versus Undirected w x y x y z z D U P

Example: State Space Models Hidden Markov model Kalman filter

Example: Bayesian SSM

Example: Factorial SSM Multiple hidden sequences

Example: Markov Random Field Typical application: image region labelling y i x i

Example: Conditional Random Field y y y y x i

Summary of Factorization Properties Directed graphs conditional independence from d-separation test Undirected graphs conditional independence from graph separation

Inference Simple example: Bayes theorem x x y y

Message Passing Example x 1 x 2 x L-1 x L Find marginal for a particular node for -state nodes, cost is exponential in length of chain but, we can exploit the graphical structure (conditional independences)

Message Passing Joint distribution Exchange sums and products

Message Passing Express as product of messages m ( x i ) m ( x i ) x i 1 x i x i 1 Recursive evaluation of messages Find by normalizing

Belief Propagation Extension to general tree-structured graphs At each node: form product of incoming messages and local evidence marginalize to give outgoing message one message in each direction across every link also called the sum-product algorithm x i Fails if there are loops

Max-product Algorithm Goal: find define then Message passing algorithm with sum replaced by max Generalization of Viterbi algorithm for HMMs

Example: Hidden Markov Model Inference involves one forward and one backward pass Computational cost grows linearly with length of chain Similarly for the Kalman filter

Junction Tree Algorithm An efficient exact algorithm for a general graph applies to both directed and undirected graphs compile original graph into a tree of cliques then perform message passing on this tree Problem: cost is exponential in size of largest clique many interesting models have intractably large cliques

Loopy Belief Propagation Apply belief propagation directly to general graph need to keep iterating might not converge State-of-the-art performance in error-correcting codes

Junction Tree Algorithm Key steps: 1. Moralize 2. Absorb evidence 3. Triangulate 4. Construct junction tree of cliques 5. Pass messages to achieve consistency

Moralization There are algorithms which work with the original directed graph, but these turn out to be special cases of the junction tree algorithm In the JT algorithm we first convert the directed graph into an undirected graph directed and undirected graphs are then treated using the same approach Suppose we are given a directed graph with a conditionals and we wish to find a representation as an undirected graph

Moralization (cont d) The conditionals are obvious candidates as clique potentials, but we need to ensure that each node belongs in the same clique as its parents This is achieved by adding, for each node, links connecting together all of the parents

Moralization (cont d) Moralization therefore consists of the following steps: 1. For each node in the graph, add an edge between all parents of the node and then convert directed edges to undirected edges 2. Initialize the clique potentials of the moral graph to 1 3. For each local conditional probability choose a clique C such that C contains both and pa i and multiply by Note that this undirected graph automatically has a normalization factor

Moralization (cont d) By adding links we have discarded some conditional independencies However, any conditional independencies in the moral graph also hold for the original directed graph, so if we solve the inference problem for the moral graph we will solve it also for the directed graph

Absorbing Evidence The nodes can be grouped into visible V for, which we have particular observed values, and hidden H We are interested in the conditional (posterior) probability Absorb evidence simply by altering the clique potentials to be zero for any configuration inconsistent with

Absorbing Evidence (cont d) We can view as an un-normalized version of and hence an un-normalized version of

Local Consistency As it stands, the graph correctly represents the (unnormalized) joint distribution but the clique potentials do not have an interpretation as marginal probabilities Our goal is to update the clique potentials so that they acquire a local probabilistic interpretation while preserving the global distribution

Local Consistency (cont d) Note that we cannot simply have with as can be seen by considering the three node graph where

Local Consistency (cont d) Instead we consider a more general representation for undirected graphs including separator sets

Local Consistency (cont d) Starting from our un-normalized representation of in terms of products of clique potentials, we can introduce separator potentials initially set to unity Note that nodes can appear in more than one clique, and we require that these be consistent for all marginals Achieving consistency is central to the junction tree algorithm

Local Consistency (cont d) Consider the elemental problem of achieving consistency between a pair of cliques V and W, with separator set S Initially

Local Consistency (cont d) First construct a message at clique V and pass to W Since is unchanged, and so the joint distribution is invariant

Local Consistency (cont d) Next pass a message back from W to V using the same update rule Here is unchanged and so, and again the joint distribution is unchanged The marginals are now correct for both of the cliques and also for the separator

Local Consistency (cont d) Example: return to the earlier three node graph Initially the clique potentials are and, and the separator potential The first message pass gives the following update which are the correct marginals In this case the second message is vacuous

Local Consistency (cont d) Now suppose that node is observed so that Absorbing the evidence by setting for Summing over A gives Updating the potential gives

Local Consistency (cont d) Hence the potentials after the first message pass are Again the reverse message is vacuous Note that the resulting clique and separator marginals require normalization (a local operation)

Global Consistency How can we extend our two-clique procedure to ensure consistency across the whole graph? We construct a clique tree by considering a spanning tree linking all of the cliques which is maximal with respect to the cardinality of the intersection sets Next we construct and pass messages using the following protocol: a clique can send a message to a neighbouring clique only when it has received messages from all of its neighbours

Global Consistency (cont d) In practice this can be achieved by designating one clique as root and then (i) collecting evidence by passing messages from the leaves to the root (ii) distributing evidence by propagating outwards from the root to the leaves

One Last Issue The algorithm discussed so far is not quite sufficient to guarantee consistency for an arbitrary graph Consider the four node graph here, together with a maximal spanning clique tree Node C appears in two places - no guarantee that local consistency for will result in global consistency

One Last Issue (cont d) The problem is resolved if the tree of cliques is a junction tree, i.e. if for every pair of cliques V and W all cliques on the (unique) path from V to W contain V W (running intersection property) As a by-product we are also guaranteed that the (now consistent) clique potentials are indeed marginals

One Last Issue (cont d) How do we ensure that the maximal spanning tree of cliques will be a junction tree? Result: a graph has a junction tree if, and only if, it is triangulated, i.e. there are no chordless cycles of four or more nodes in the graph Example of a graph and its triangulated counterpart

Summary of Junction Tree Algorithm Key steps: 1. Moralize 2.Absorb evidence 3. Triangulate 4.Construct junction tree 5.Pass messages to achieve consistency

Example of JT Algorithm Original directed graph

Example of JT Algorithm (cont d) Moralization

Example of JT Algorithm (cont d) Undirected graph

Example of JT Algorithm (cont d) Triangulation

Example of JT Algorithm (cont d) Junction tree

Inference and Learning Data set Likelihood function (independent observations) Maximize (log) likelihood Predictive distribution

Regularized Maximum Likelihood Prior, posterior MAP (maximum posterior) Predictive distribution Not really Bayesian

Bayesian Learning Key idea is to marginalize over unknown parameters, rather than make point estimates avoids severe over-fitting of ML and MAP allows direct model comparison Parameters are now latent variables Bayesian learning is an inference problem!

Bayesian Learning

The Exponential Family Many distributions can be written in the form Includes: Gaussian Dirichlet Gamma Multi-nomial Wishart Bernoulli Building blocks in graphs to give rich probabilistic models

Illustration: the Gaussian Use precision (inverse variance) In standard form

Maximum Likelihood Likelihood function (independent observations) Depends on data via sufficient statistics of fixed dimension

Conjugate Priors Prior has same functional form as likelihood Hence posterior is of the form Can interpret prior as effective observations of value Examples: Gaussian for the mean of a Gaussian Gaussian-Wishart for mean and precision of Gaussian Dirichlet for the parameters of a discrete distribution

EM and Variational Inference Roadmap: mixtures of Gaussians EM (informal derivation) lower bound viewpoint EM revisited variational inference

The Gaussian Distribution Multivariate Gaussian mean covariance Maximum likelihood

Gaussian Mixtures Linear super-position of Gaussians Normalization and positivity require

Example: Mixture of 3 Gaussians 1 1 0.5 0.5 0 0 0.5 1 (a) 0 0 0.5 1 (b)

Maximum Likelihood for the GMM Log likelihood function Sum over components appears inside the log no closed form ML solution

EM Algorithm Informal Derivation

EM Algorithm Informal Derivation M step equations

EM Algorithm Informal Derivation E step equation

Responsibilities Can interpret the mixing coefficients as prior probabilities Corresponding posterior probabilities (responsibilities)

Old Faithful Data Set Time between eruptions (minutes) Duration of eruption (minutes)

Over-fitting in Gaussian Mixture Models Infinities in likelihood function when a component collapses onto a data point: with Also, maximum likelihood cannot determine the number of components

Latent Variable View of EM To sample from a Gaussian mixture: first pick one of the components with probability then draw a sample from that component repeat these two steps for each new data point 1 0.5 0 0 0.5 1 (a)

Latent Variable View of EM Goal is to solve the inverse problem: given a data set, find Suppose we knew the colours maximum likelihood would involve fitting each component to the corresponding cluster Problem: the colours are latent (hidden) variables

Incomplete and Complete Data 1 1 0.5 0.5 0 0 0.5 1 (a) complete 0 0 0.5 1 (b) incomplete

Latent Variable Viewpoint Binary latent variables describing which component generated each data point Z X Example: 3 components and 5 data points

Latent Variable Viewpoint Conditional distribution of observed variable Z Prior distribution of latent variables X Marginalizing over the latent variables we obtain

Graphical Representation of GMM z n x n N

Latent Variable View of EM Suppose we knew the values for the latent variables maximize the complete-data log likelihood trivial closed-form solution: fit each component to the corresponding set of data points

Latent Variable View of EM Problem: we don t know the values of the latent variables Instead maximize the expected value of the completedata log likelihood Make use of Gives the EM algorithm In summary: maximize the log of the joint distribution of latent and observed variables, averaged w.r.t. the posterior distribution of the latent variables

Posterior Probabilities (colour coded) 1 1 0.5 0.5 0 0 0.5 1 (b) 0 0 0.5 1 (a)

Lower Bound on Model Evidence For arbitrary where Maximizing over would give the true posterior

Variational Lower Bound

EM: Variational Viewpoint (cont d) If we maximize with respect to a free-form distribution we obtain which is the true posterior distribution The lower bound then becomes which, as a function of is the expected complete-data log likelihood (up to an additive constant)

Initial Configuration

E-step

M-step

KL Divergence

Bayesian Learning Introduce prior distributions over parameters Equivalent to graph with additional hidden variables Learning becomes inference on the expanded graph No distinction between variables and parameters

Bayesian Mixture of Gaussians Parameters and latent variables appear on equal footing Conjugate priors z n x n N

Explaining Away

Lower Bound on Model Evidence For arbitrary where Maximizing over would give the true posterior

Variational Lower Bound

Variational Inference KL divergence vanishes when equals By definition the exact posterior is intractable We therefore restrict attention to a family of distributions that are both sufficiently simple to be tractable sufficiently flexible to give a good approximation to the true posterior One approach is to use a parametric family

Factorized Approximation Here we consider factorized distributions No further assumptions are required!

Factorized Approximation

Factorized Approximation Optimal solution for one factor, keeping the remainder fixed coupled solutions so initialize then cyclically update message passing view (Winn and Bishop, 2004 to appear in JMLR)

1 x 2 0.5 0 0 0.5 x 1 (a) 1

1 x 2 0.5 0 0 0.5 x 1 (b) 1

Illustration: Univariate Gaussian Likelihood function Conjugate prior Factorized variational distribution

Initial Configuration 2 (a) τ 1 0 1 0 µ 1

After Updating 2 (b) τ 1 0 1 0 µ 1

After Updating 2 (c) τ 1 0 1 0 µ 1

Converged Solution 2 (d) τ 1 0 1 0 µ 1

Bayesian Model Complexity Consider multiple models Prior probabilities Observed data set Posterior probabilities If prior probabilities equal, models are ranked by their evidence

Lower Bound Can also be evaluated Useful for maths/code verification Also useful for model comparison:

Variational Mixture of Gaussians Assume factorized posterior distribution No other approximations needed!

Variational Equations for GMM

Lower Bound for GMM

Bound vs. K for Old Faithful Data

Bayesian Model Complexity

Sparse Bayes for Gaussian Mixture Corduneanu and Bishop (2001) Start with large value of treat mixing coefficients as parameters maximize marginal likelihood prunes out excess components

Conventional PCA Minimize sum-of-squares reconstruction error Solution given by of eigen-spectrum of data covariance x 2 x n u 1 ~ xn x 1

Probabilistic PCA Tipping and Bishop (1998) Generative latent variable model Maximum likelihood solution given by eigenspectrum x 2 w { z x 1

EM for PCA 2 (a) 0 2 2 0 2

EM for PCA 2 (b) 0 2 2 0 2

EM for PCA 2 (c) 0 2 2 0 2

EM for PCA 2 (d) 0 2 2 0 2

EM for PCA 2 (e) 0 2 2 0 2

EM for PCA 2 (f) 0 2 2 0 2

EM for PCA 2 (g) 0 2 2 0 2

Bayesian PCA Bishop (1998) Gaussian prior over columns of Automatic relevance determination (ARD) z n N W x n ML PCA Bayesian PCA

Bayesian Mixture of BPCA Models Bishop and Winn (2000) W m s n z nm x n N m M

VIBES Variational Inference for Bayesian Networks Winn and Bishop (1999, 2003, 2004) A general inference engine using variational methods VIBES available from: http://vibes.sourceforge.net/

VIBES (cont d) A key observation is that in the general solution the update for a particular node (or group of nodes) depends only on other nodes in the Markov blanket Permits a local message-passing implementation which is independent of the particular graph structure

VIBES (cont d)

Structured Variational Inference Example: factorial HMM

Variational Approximation

Viewgraphs and papers: http://research.microsoft.com/~cmbishop