p L yi z n m x N n xi

Size: px

Start display at page:

Download "p L yi z n m x N n xi"

Alexandra Burke
6 years ago
Views:

1 y i z n x n N x i

2 Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference

3 Books statistical perspective Graphical Models, S. Lauritzen (1996) An Introduction to Bayesian Networks, F. Jensen (1996) Expert Systems and Probabilistic Network Models, Castillo et al. (1997) Probabilistic Reasoning in Intelligent Systems, J. Pearl (1998) Probabilistic Expert Systems, Cowell et al., (1999) Bayesian Networks and Decision Graphs, F. Jensen (2001) Learning Bayesian Networks, R. Neapolitan (2004)

4 Books learning perspective Learning in Graphical Models, M. I. Jordan, Ed.,(1998) Graphical Models for Machine Learning and Digital Communication, B. Frey (1998) Graphical Models, M. I. Jordan (TBD) Information Theory, Inference, and Learning Algorithms, D. J. C. MacKay (2003) Also

5 Pattern Recognition and Machine Learning Springer (2005) 600 pages, hardback, four colour, low price Graduate-level text book Worked solutions to all 250 exercises Complete lectures on www Matlab software: Netlab, and companion text with Ian Nabney (Springer 2006)

6 Probabilistic Graphical Models Graphical representations of probability distributions new insights into existing models motivation for new models graph based algorithms for calculation and computation

7 Probability Theory Sum rule Product rule From these we have Bayes theorem with normalization

8 Directed Graphs: Decomposition Consider an arbitrary joint distribution By successive application of the product rule x y z

9 Directed Acyclic Graphs Joint distribution where denotes the parents of x 2 x 1 x 3 x 4 x 5 x 6 x 7 No directed cycles

10 Examples of Directed Graphs Hidden Markov models Kalman filters Factor analysis Probabilistic principal component analysis Independent component analysis Mixtures of Gaussians Transformed component analysis Probabilistic expert systems Sigmoid belief networks Hierarchical mixtures of experts Etc, etc,

11 Undirected Graphs Provided then joint distribution is product of non-negative functions over the cliques of the graph where are the clique potentials, and Z is a normalization constant w x y z

12 Conditioning on Evidence Variables may be hidden (latent) or visible (observed) visible hidden Latent variables may have a specific interpretation, or may be introduced to permit a richer class of distribution

13 Importance of Ordering Battery Fuel gauge Fuel Start Engine turns over Fuel gauge Engine turns over Start Battery Fuel

14 Causality Directed graphs can express causal relationships Often we observe child variables and wish to infer the posterior distribution of parent variables Example: x cancer x cancer y blood test y blood test Note: inferring causal structure from data is subtle

15 Conditional Independence independent of given if, for all values of, Phil Dawid s notation Equivalently Conditional independence crucial in practical applications since we can rarely work with a general joint distribution

16 Markov Properties Can we determine the conditional independence properties of a distribution directly from its graph? undirected graphs: easy directed graphs: one subtlety

17 Undirected Graphs Conditional independence given by graph separation!

18 Graphs as Filters p( x) DF? Factorization and conditional independence give identical families of distributions

19 Directed Markov Properties: Example 1 Joint distribution over 3 variables specified by the graph a c b Note the missing edge from to Node c is head-to-tail with respect to the path Joint distribution

20 Directed Markov Properties: Example 1 Suppose we condition on node a c b Hence Note that if is not observed we have Informally: observation of blocks the path from to

21 Directed Markov Properties: Example 2 3-node graph c a b Joint distribution Node is tail-to-tail with respect to the path Again, note missing edge fromto

22 Directed Markov Properties: Example 2 Now condition on node We have Hence a c b Again, if is not observed Informally: observation of blocks the path from to

23 Directed Markov Properties: Example 3 Node is head-to-head with respect to the path a b Joint distribution c Note missing edge from to If is not observed we have and hence

24 Directed Markov Properties: Example 3 Suppose we condition on node a b Hence c Informally: unobserved head-to-head node blocks the path from to once is observed the path is unblocked Note: observation of any descendent of also unblocks the path

25 Explaining Away Illustration: pixel colour in an image lighting colour surface colour image colour

26 d-separation Conditional independence if, and only if, all possible paths are blocked. Examples: f f a a e b e b c c (i) (ii)

27 Markov Blankets

28 Directed versus Undirected w x y x y z z D U P

29 Example: State Space Models Hidden Markov model Kalman filter

30 Example: Bayesian SSM

31 Example: Factorial SSM Multiple hidden sequences

32 Example: Markov Random Field Typical application: image region labelling y i x i

33 Example: Conditional Random Field y y y y x i

34 Summary of Factorization Properties Directed graphs conditional independence from d-separation test Undirected graphs conditional independence from graph separation

35 Inference Simple example: Bayes theorem x x y y

36 Message Passing Example x 1 x 2 x L-1 x L Find marginal for a particular node for -state nodes, cost is exponential in length of chain but, we can exploit the graphical structure (conditional independences)

37 Message Passing Joint distribution Exchange sums and products

38 Message Passing Express as product of messages m ( x i ) m ( x i ) x i 1 x i x i 1 Recursive evaluation of messages Find by normalizing

39 Belief Propagation Extension to general tree-structured graphs At each node: form product of incoming messages and local evidence marginalize to give outgoing message one message in each direction across every link also called the sum-product algorithm x i Fails if there are loops

40 Max-product Algorithm Goal: find define then Message passing algorithm with sum replaced by max Generalization of Viterbi algorithm for HMMs

41 Example: Hidden Markov Model Inference involves one forward and one backward pass Computational cost grows linearly with length of chain Similarly for the Kalman filter

42 Junction Tree Algorithm An efficient exact algorithm for a general graph applies to both directed and undirected graphs compile original graph into a tree of cliques then perform message passing on this tree Problem: cost is exponential in size of largest clique many interesting models have intractably large cliques

43 Loopy Belief Propagation Apply belief propagation directly to general graph need to keep iterating might not converge State-of-the-art performance in error-correcting codes

44 Junction Tree Algorithm Key steps: 1. Moralize 2. Absorb evidence 3. Triangulate 4. Construct junction tree of cliques 5. Pass messages to achieve consistency

45 Moralization There are algorithms which work with the original directed graph, but these turn out to be special cases of the junction tree algorithm In the JT algorithm we first convert the directed graph into an undirected graph directed and undirected graphs are then treated using the same approach Suppose we are given a directed graph with a conditionals and we wish to find a representation as an undirected graph

46 Moralization (cont d) The conditionals are obvious candidates as clique potentials, but we need to ensure that each node belongs in the same clique as its parents This is achieved by adding, for each node, links connecting together all of the parents

47 Moralization (cont d) Moralization therefore consists of the following steps: 1. For each node in the graph, add an edge between all parents of the node and then convert directed edges to undirected edges 2. Initialize the clique potentials of the moral graph to 1 3. For each local conditional probability choose a clique C such that C contains both and pa i and multiply by Note that this undirected graph automatically has a normalization factor

48 Moralization (cont d) By adding links we have discarded some conditional independencies However, any conditional independencies in the moral graph also hold for the original directed graph, so if we solve the inference problem for the moral graph we will solve it also for the directed graph

49 Absorbing Evidence The nodes can be grouped into visible V for, which we have particular observed values, and hidden H We are interested in the conditional (posterior) probability Absorb evidence simply by altering the clique potentials to be zero for any configuration inconsistent with

50 Absorbing Evidence (cont d) We can view as an un-normalized version of and hence an un-normalized version of

51 Local Consistency As it stands, the graph correctly represents the (unnormalized) joint distribution but the clique potentials do not have an interpretation as marginal probabilities Our goal is to update the clique potentials so that they acquire a local probabilistic interpretation while preserving the global distribution

52 Local Consistency (cont d) Note that we cannot simply have with as can be seen by considering the three node graph where

53 Local Consistency (cont d) Instead we consider a more general representation for undirected graphs including separator sets

54 Local Consistency (cont d) Starting from our un-normalized representation of in terms of products of clique potentials, we can introduce separator potentials initially set to unity Note that nodes can appear in more than one clique, and we require that these be consistent for all marginals Achieving consistency is central to the junction tree algorithm

55 Local Consistency (cont d) Consider the elemental problem of achieving consistency between a pair of cliques V and W, with separator set S Initially

56 Local Consistency (cont d) First construct a message at clique V and pass to W Since is unchanged, and so the joint distribution is invariant

57 Local Consistency (cont d) Next pass a message back from W to V using the same update rule Here is unchanged and so, and again the joint distribution is unchanged The marginals are now correct for both of the cliques and also for the separator

58 Local Consistency (cont d) Example: return to the earlier three node graph Initially the clique potentials are and, and the separator potential The first message pass gives the following update which are the correct marginals In this case the second message is vacuous

59 Local Consistency (cont d) Now suppose that node is observed so that Absorbing the evidence by setting for Summing over A gives Updating the potential gives

60 Local Consistency (cont d) Hence the potentials after the first message pass are Again the reverse message is vacuous Note that the resulting clique and separator marginals require normalization (a local operation)

61 Global Consistency How can we extend our two-clique procedure to ensure consistency across the whole graph? We construct a clique tree by considering a spanning tree linking all of the cliques which is maximal with respect to the cardinality of the intersection sets Next we construct and pass messages using the following protocol: a clique can send a message to a neighbouring clique only when it has received messages from all of its neighbours

62 Global Consistency (cont d) In practice this can be achieved by designating one clique as root and then (i) collecting evidence by passing messages from the leaves to the root (ii) distributing evidence by propagating outwards from the root to the leaves

63 One Last Issue The algorithm discussed so far is not quite sufficient to guarantee consistency for an arbitrary graph Consider the four node graph here, together with a maximal spanning clique tree Node C appears in two places - no guarantee that local consistency for will result in global consistency

64 One Last Issue (cont d) The problem is resolved if the tree of cliques is a junction tree, i.e. if for every pair of cliques V and W all cliques on the (unique) path from V to W contain V W (running intersection property) As a by-product we are also guaranteed that the (now consistent) clique potentials are indeed marginals

65 One Last Issue (cont d) How do we ensure that the maximal spanning tree of cliques will be a junction tree? Result: a graph has a junction tree if, and only if, it is triangulated, i.e. there are no chordless cycles of four or more nodes in the graph Example of a graph and its triangulated counterpart

66 Summary of Junction Tree Algorithm Key steps: 1. Moralize 2.Absorb evidence 3. Triangulate 4.Construct junction tree 5.Pass messages to achieve consistency

67 Example of JT Algorithm Original directed graph

68 Example of JT Algorithm (cont d) Moralization

69 Example of JT Algorithm (cont d) Undirected graph

70 Example of JT Algorithm (cont d) Triangulation

71 Example of JT Algorithm (cont d) Junction tree

72 Inference and Learning Data set Likelihood function (independent observations) Maximize (log) likelihood Predictive distribution

73 Regularized Maximum Likelihood Prior, posterior MAP (maximum posterior) Predictive distribution Not really Bayesian

74 Bayesian Learning Key idea is to marginalize over unknown parameters, rather than make point estimates avoids severe over-fitting of ML and MAP allows direct model comparison Parameters are now latent variables Bayesian learning is an inference problem!

75 Bayesian Learning

76 Bayesian Learning

77 The Exponential Family Many distributions can be written in the form Includes: Gaussian Dirichlet Gamma Multi-nomial Wishart Bernoulli Building blocks in graphs to give rich probabilistic models

78 Illustration: the Gaussian Use precision (inverse variance) In standard form

79 Maximum Likelihood Likelihood function (independent observations) Depends on data via sufficient statistics of fixed dimension

80 Conjugate Priors Prior has same functional form as likelihood Hence posterior is of the form Can interpret prior as effective observations of value Examples: Gaussian for the mean of a Gaussian Gaussian-Wishart for mean and precision of Gaussian Dirichlet for the parameters of a discrete distribution

81 EM and Variational Inference Roadmap: mixtures of Gaussians EM (informal derivation) lower bound viewpoint EM revisited variational inference

82 The Gaussian Distribution Multivariate Gaussian mean covariance Maximum likelihood

83 Gaussian Mixtures Linear super-position of Gaussians Normalization and positivity require

84 Example: Mixture of 3 Gaussians (a) (b)

85 Maximum Likelihood for the GMM Log likelihood function Sum over components appears inside the log no closed form ML solution

86 EM Algorithm Informal Derivation

87 EM Algorithm Informal Derivation M step equations

88 EM Algorithm Informal Derivation E step equation

89 Responsibilities Can interpret the mixing coefficients as prior probabilities Corresponding posterior probabilities (responsibilities)

90 Old Faithful Data Set Time between eruptions (minutes) Duration of eruption (minutes)

97 Over-fitting in Gaussian Mixture Models Infinities in likelihood function when a component collapses onto a data point: with Also, maximum likelihood cannot determine the number of components

98 Latent Variable View of EM To sample from a Gaussian mixture: first pick one of the components with probability then draw a sample from that component repeat these two steps for each new data point (a)

99 Latent Variable View of EM Goal is to solve the inverse problem: given a data set, find Suppose we knew the colours maximum likelihood would involve fitting each component to the corresponding cluster Problem: the colours are latent (hidden) variables

100 Incomplete and Complete Data (a) complete (b) incomplete

101 Latent Variable Viewpoint Binary latent variables describing which component generated each data point Z X Example: 3 components and 5 data points

102 Latent Variable Viewpoint Conditional distribution of observed variable Z Prior distribution of latent variables X Marginalizing over the latent variables we obtain

103 Graphical Representation of GMM z n x n N

104 Latent Variable View of EM Suppose we knew the values for the latent variables maximize the complete-data log likelihood trivial closed-form solution: fit each component to the corresponding set of data points

105 Latent Variable View of EM Problem: we don t know the values of the latent variables Instead maximize the expected value of the completedata log likelihood Make use of Gives the EM algorithm In summary: maximize the log of the joint distribution of latent and observed variables, averaged w.r.t. the posterior distribution of the latent variables

106 Posterior Probabilities (colour coded) (b) (a)

107 Lower Bound on Model Evidence For arbitrary where Maximizing over would give the true posterior

108 Variational Lower Bound

109 EM: Variational Viewpoint (cont d) If we maximize with respect to a free-form distribution we obtain which is the true posterior distribution The lower bound then becomes which, as a function of is the expected complete-data log likelihood (up to an additive constant)

110 Initial Configuration

111 E-step

112 M-step

113 KL Divergence

114 KL Divergence

115

116

117 Bayesian Learning Introduce prior distributions over parameters Equivalent to graph with additional hidden variables Learning becomes inference on the expanded graph No distinction between variables and parameters

118 Bayesian Mixture of Gaussians Parameters and latent variables appear on equal footing Conjugate priors z n x n N

119

120 Explaining Away

121 Lower Bound on Model Evidence For arbitrary where Maximizing over would give the true posterior

122 Variational Lower Bound

123 Variational Inference KL divergence vanishes when equals By definition the exact posterior is intractable We therefore restrict attention to a family of distributions that are both sufficiently simple to be tractable sufficiently flexible to give a good approximation to the true posterior One approach is to use a parametric family

124 Factorized Approximation Here we consider factorized distributions No further assumptions are required!

125 Factorized Approximation

126 Factorized Approximation Optimal solution for one factor, keeping the remainder fixed coupled solutions so initialize then cyclically update message passing view (Winn and Bishop, 2004 to appear in JMLR)

127 1 x x 1 (a) 1

128 1 x x 1 (b) 1

129 Illustration: Univariate Gaussian Likelihood function Conjugate prior Factorized variational distribution

130 Initial Configuration 2 (a) τ µ 1

131 After Updating 2 (b) τ µ 1

132 After Updating 2 (c) τ µ 1

133 Converged Solution 2 (d) τ µ 1

134 Bayesian Model Complexity Consider multiple models Prior probabilities Observed data set Posterior probabilities If prior probabilities equal, models are ranked by their evidence

135 Lower Bound Can also be evaluated Useful for maths/code verification Also useful for model comparison:

136 Variational Mixture of Gaussians Assume factorized posterior distribution No other approximations needed!

137 Variational Equations for GMM

138 Lower Bound for GMM

139 Bound vs. K for Old Faithful Data

140 Bayesian Model Complexity

141

142 Sparse Bayes for Gaussian Mixture Corduneanu and Bishop (2001) Start with large value of treat mixing coefficients as parameters maximize marginal likelihood prunes out excess components

143

144

145 Conventional PCA Minimize sum-of-squares reconstruction error Solution given by of eigen-spectrum of data covariance x 2 x n u 1 ~ xn x 1

146 Probabilistic PCA Tipping and Bishop (1998) Generative latent variable model Maximum likelihood solution given by eigenspectrum x 2 w { z x 1

147 EM for PCA 2 (a)

148 EM for PCA 2 (b)

149 EM for PCA 2 (c)

150 EM for PCA 2 (d)

151 EM for PCA 2 (e)

152 EM for PCA 2 (f)

153 EM for PCA 2 (g)

154 Bayesian PCA Bishop (1998) Gaussian prior over columns of Automatic relevance determination (ARD) z n N W x n ML PCA Bayesian PCA

155 Bayesian Mixture of BPCA Models Bishop and Winn (2000) W m s n z nm x n N m M

156

157 VIBES Variational Inference for Bayesian Networks Winn and Bishop (1999, 2003, 2004) A general inference engine using variational methods VIBES available from:

158 VIBES (cont d) A key observation is that in the general solution the update for a particular node (or group of nodes) depends only on other nodes in the Markov blanket Permits a local message-passing implementation which is independent of the particular graph structure

159 VIBES (cont d)

160 VIBES (cont d)

161 VIBES (cont d)

162 Structured Variational Inference Example: factorial HMM

163 Variational Approximation

164 Viewgraphs and papers:

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM