Bayesian Learning in Undirected Graphical Models

Similar documents
Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models: Approximate MCMC algorithms

The Origin of Deep Learning. Lili Mou Jan, 2015

Probabilistic Graphical Models

Active and Semi-supervised Kernel Classification

Probabilistic and Bayesian Machine Learning

Lecture 6: Graphical Models

Probabilistic Graphical Models

13: Variational inference II

Machine Learning Summer School

Lecture 9: PGM Learning

Chris Bishop s PRML Ch. 8: Graphical Models

Unsupervised Learning

Chapter 16. Structured Probabilistic Models for Deep Learning

Statistical Approaches to Learning and Discovery

Lecture 6: Graphical Models: Learning

Graphical Models and Kernel Methods

Bayesian Machine Learning - Lecture 7

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

CPSC 540: Machine Learning

Lecture 13 : Variational Inference: Mean Field Approximation

13 : Variational Inference: Loopy Belief Propagation and Mean Field

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Lecture 16 Deep Neural Generative Models

Variational Scoring of Graphical Model Structures

Variational Inference (11/04/13)

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

STA 4273H: Statistical Machine Learning

Probabilistic Graphical Models (I)

Deep unsupervised learning

Fractional Belief Propagation

Graphical models: parameter learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Learning Energy-Based Models of High-Dimensional Data

Machine Learning Lecture 14

Undirected Graphical Models

Approximate Inference Part 1 of 2

Probabilistic Graphical Models

Approximate Inference Part 1 of 2

Undirected Graphical Models: Markov Random Fields

Variational Inference. Sargur Srihari

Inference as Optimization

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

Probabilistic Graphical Models

Does Better Inference mean Better Learning?

Bayesian networks: approximate inference

UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS

Introduction to Restricted Boltzmann Machines

Bayesian Networks BY: MOHAMAD ALSABBAGH

Introduction to Gaussian Processes

Lecture 8: Bayesian Networks

12 : Variational Inference I

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Markov Networks.

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Learning in Markov Random Fields An Empirical Study

Bayesian Model Scoring in Markov Random Fields

Bayesian Random Fields: The Bethe-Laplace Approximation

CPSC 540: Machine Learning

Naïve Bayes classification

14 : Theory of Variational Inference: Inner and Outer Approximation

bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o

Alternative Parameterizations of Markov Networks. Sargur Srihari

Pattern Recognition and Machine Learning

Bayesian Machine Learning

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Junction Tree, BP and Variational Methods

p L yi z n m x N n xi

Probabilistic Graphical Models

Inference in Bayesian Networks

Lecture 15. Probabilistic Models on Graph

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Learning Gaussian Process Models from Uncertain Data

STA 4273H: Statistical Machine Learning

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

13 : Variational Inference: Loopy Belief Propagation

Bayesian Hidden Markov Models and Extensions

14 : Theory of Variational Inference: Inner and Outer Approximation

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

CS Lecture 8 & 9. Lagrange Multipliers & Varitional Bounds

Structure Learning in Markov Random Fields

CSC 412 (Lecture 4): Undirected Graphical Models

Approximate inference, Sampling & Variational inference Fall Cours 9 November 25

Probabilistic Graphical Models

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Variational Methods in Bayesian Deconvolution

Gaussian Process Vine Copulas for Multivariate Dependence

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

6.867 Machine learning, lecture 23 (Jaakkola)

Alternative Parameterizations of Markov Networks. Sargur Srihari

3 : Representation of Undirected GM

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

Bayesian Approaches Data Mining Selected Technique

Restricted Boltzmann Machines for Collaborative Filtering

T Machine Learning: Basic Principles

Diagram Structure Recognition by Bayesian Conditional Random Fields

Sampling Algorithms for Probabilistic Graphical models

Recent Advances in Bayesian Inference Techniques

Transcription:

Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul Kim Max Planck Institute, Tübingen Sept 2003

Undirected Graphical Models An Undirected Graphical Model (UGM; or Markov Network) is a graphical representation of the dependence relationships between a set of random variables. In an UGM, the joint probability over M variables x = [x 1,..., x M ], can be written in a factored form: p(x) = 1 Z J j=1 g j (x Cj ) Here the g j are non-negative potential functions over subsets of variables C j {1,..., M} and the notation: x S [x m : m S]. The normalization constant (a.k.a. partition function) is Z = x We represent this type of probabilistic model graphically. g j (x Cj ) Graph Definition: Let each variable be a node. Connect nodes i and k if there exists a set C j such that both i C j and k C j. These sets form the cliques of the graph (fully connected subgraphs). j

Undirected Graphical Models: An Example A B C D E p(a, B, C, D, E) = 1 Z g(a, C)g(B, C, D)g(C, D, E) Markov Property: Every node is conditionally independent from its nonneighbors given its neighbors. Conditional Independence: A B C p(a B, C) = p(a C) for p(b, C) > 0 also A B C p(a, B C) = p(a C)p(B C).

Applications of Undirected Graphical Models Markov Random Fields in Vision, Bioinformatics Conditional Random Fields, and Exponential Language Models, e.g.: { } p(s) = 1 Z p 0(s) exp λ i f i (s) i Products of Experts: p(x) = 1 Z Semi-Supervised Learning: p j (x θ j ) j Boltzmann Machines

Boltzmann Machines Undirected graph over a vector of binary variables s i {0, 1}. Variables can be hidden or visible (observed). p(s W ) = 1 Z exp W ij s i s j j<i where Z is the partition function (normalizer) Z = s exp{ j<i W ij s i s j } Usual learning algorithm: Maximum Likelihood using approximate EM.

Bayesian Learning Prior over parameters: p(w ) Posterior over parameters, given data set S = {s (1),... s (N) }, p(w S) = p(w )p(s W ) p(s) Model Comparison (for example for two different graph structures m, m ) using Bayes factors: p(m S) p(m S) = p(m) p(s m) p(m ) p(s m ) where the marginal likelihood is: p(s m) = p(s W, m)p(w m) dw

Why Bayesian Learning? Useful prior knowledge can be included (e.g. sparsity, domain knowledge) More robust to overfitting (because nothing needs to be fit) Error bars on all parameters, and predictions Model and feature selection

A Simple Idea Define the following joint distribution of weights W and matrix of binary variables S, organized into N rows (data vectors) and M columns (features, variables). Some variables on some data points may be hidden and some may be observed. p(s, W ) = 1 Z exp 1 2σ 2 M i,j=1 W 2 ij + N n=1 M W ij s ni s nj j<i Where Z = dw S exp{...} is a nasty partition function. Gibbs sampling in this model is very easy! Gibbs sample s ni given all other s and W : Bernouilli, easy as usual. Gibbs sample W given s: diagonal multivariate Gaussian, easy as well. What is wrong with this approach?

...a Strange Prior on W p(s, W ) = 1 Z exp 1 2σ 2 M i,j=1 W 2 ij + N n=1 M W ij s ni s nj This defines a Boltzmann machine for the data given W, but defines a somewhat strange and hard to compute prior on the weights. What is the prior on W? p(w ) = S p(s, W ) N(0, σ 2 I) S j<i exp n,j<i W ij s ni s nj where the second factor is data-size dependent, so it s not a valid hierarchical Bayesian model of the kind W S. The second factor can be written as: exp S n,j<i This will not work! W ij s ni s nj = s exp W ij s i s j j<i N = Z(W ) N

Three Families of Approximations In order to do Bayesian inference in undirected models with nontrivial partition functions we can develop three classes of methods: Approximate Partition Function: Z(W ) = exp W ij s i s j s Approximate Ratio of Partition Functions. Z(W ) Z(W ) = p(s W ) exp (W ij W ij) s i s j s j<i j<i Approximate Gradients. ln Z(W ) W ij = s p(s W ) s i s j The above quantities can be approximated using modern tools developed in the machine learning/statistics/physics communities. Surprisingly, none of the following methods have been explored!

I. Metropolis with Nested Sampling Simplest sampling approach: Metropolis Sampling Start with random weight matrix W Perturb it with a small-radius Gaussian proposal distribution W W Accept the change with probability min [1, a], where a = p (S W ) p (W ) p (S W ) p (W ) = ( ) N Z(W ) Z(W exp ) n,i<j ( W ij W ij ) s (n) i s (n) j p (W ) p (W ) The partition function ratio is nasty. But one can estimate it using an MCMC sampling inner loop: Z(W ) Z(W ) = s exp { j<i W ijs i s j } s exp { j<i W ij s is j } = too slow: inner loop can take exponential time exp (W ij W ij)s i s j j<i p(s W )

II. Naive Mean-Field Metropolis Same as above, but use naive mean-field to estimate the partition function. Jensen s inequality gives us: ln Z(W ) = ln s exp{ j<i W ij s i s j } s q(s) j<i W ij s i s j + H(q) = F (W, q) where q(s) = i ms i i (1 m i) (1 s i) and H is the entropy. Gradient-based variant: use expectations to compute approximate gradients

III. Tree Mean-Field Metropolis Same as above, but use tree-structured mean-field to estimate the partition function. Jensen s inequality gives us: ln Z(W ) = ln s exp{ j<i W ij s i s j } s q(s) j<i W ij s i s j + H(q) = F (W, q) where q(s) Q tree, the set of tree-structured distributions and H is the entropy. Gradient-based variant: use expectations to compute approximate gradients

IV. Loopy Metropolis Belief Propagation (BP) is an exact method for inference on trees. Run belief propagation (BP) on the (loopy) graph and use the Bethe free energy as an estimate of Z(W ). Loopy BP provides on non-trees: 1. approximate marginals b i p (s i W ) 2. approximate pairwise marginals b ij p (s i, s j W ) These marginals are fixed points of the Bethe Free energy F Bethe = U H Bethe log Z(W ) where U is the expected energy and the approximate entropy is: H Bethe = (ij) s i,s j b ij (s i, s j ) log b ij (s i, s j ) (1 ne(i)) b i (s i ) log b i (s i ). s i i Gradient-based variant: use expectations to compute approximate gradients

V. The Langevin MCMC Sampling Procedure So far, we ve been describing Metropolis procedures, but these suffer from random walk behaviour. Langevin makes use of gradient information and resembles noisy steepest descent. This is uncorrected Langevin: W ij = W ij + ɛ2 2 W ij log p(s, W ) + ɛ n ij where n N (0, 1). There are many ways of estimating gradients, but we use a method based on Contrastive Divergence (Hinton, 2000).

VI. Pseudo-Likelihood Based Approximations p(s W ) = 1 Z(W ) exp The pseudo-likelihood is defined as W ij s i s j j<i p(s W ) i p(s i s \i, W ) = i exp{s i j i W ijs j } 1 + exp{ j i W ijs j } We can consider pseudo-likelihood based approximations to Z(W ): Z(W ) exp W ij s i s j = 1 + exp W ij s j j<i i s &ne(s ) One can approximate Z(W ) by summing over larger neighborhoods of the MAP s, or by summing in the nighborhood of the data points s (n) which is closely related to Contrastive Divergence. This has not been tried yet one can design many other approaches. j i

Naive Mean Field vs Tree Mean Field Approximation 14 13 more sparse (n=10, 0.3 large weights) mf tree 22 20 mf tree more dense (n = 10, 0.6 large weights) 12 18 approx F < log Z(W) 11 10 9 log Z(W) 16 14 8 12 7 10 6 6 7 8 9 10 11 12 13 14 true log Z(W) 8 8 10 12 14 16 18 20 22 approx F < log Z(W) The tree based approximation found a maximum spanning tree and then used Wiegerinck s (UAI, 2000) variational approximation.

Bethe Free Energy Plots of Z Bethe vs Z true for some independently drawn Boltzmann machines. 350 300 250 200 150 100 50 0 0 50 100 150 200 250 300 350 Points in red show where belief propagation failed to converge. No hacks were applied to fix up the results; there are ways in the literature.

Results on Coronary Heart Disease Data Classic data set of 6 binary variables detailing risk factors for coronary heart disease in 1841 men. Small enough exact Z(W ) can be computed. 1 Blue: exact; Red: Langevin with Contrastive Divergence; Purple: loopy Metropolis. mean field tree 5 0 5 W AA 2 0 2 W AB W AC W AD W AE W AF 2 0 2 W BB 4 2 0 W BC W BD W BE W BF 2 0 2 W CC W CD W CE W CF W DD W DE W DF 2 1 0 W EE W EF 4 2 0 W FF 1 100000 samples; local Metropolis proposals 0.01 variance; Langevin step = 0.01.

Results on Synthetic Data Sets 100 node random network. 204 and 500 edge systems. Weights N (0, 1). 100 data points. Dashed Blue: Loopy Metropolis; Black: Langevin with Contrastive Divergence; Red: true. 1 0.5 0 0.5 1 1.5 f is fraction of samples within ±0.1 of true parameter value (higher is better): 0.8 0.8 0.6 f 0.4 0.6 f 0.4 0.2 0.2 0 0 50 100 150 200 250 300 Parameters 0 0 100 200 300 400 500 600 Parameters

Part II: Summary and Future Directions The problem of Bayesian learning in large tree-width undirected models (loglinear models) appears to have been completely overlooked (!?) Standard MCMC procedures are intractable due to the need to compute partition functions at each step. This problem offers a natural opportunity for combining modern deterministic approximations with MCMC. We have use known ideas to develop a variety of novel methods for approximate MCMC sampling for parameters of undirected models. Naive mean field and tree-based mean field Metropolis do not seem to work. Wander into areas of poor approximation (loose bound). The loopy Metropolis and contrastive Langevin both seem to work well. Potential applications to text modelling and computer vision. There is still a lot to do in this area!

End of Talk

Appendix

Contrastive Divergence 2 The gradient for maximum likelihood learning: becomes log p (s W ) W kl s k s l Data s k s l p(s W ) log p (s W ) W kl s k s l p0 (W ) s ks l p (W ) s k s l p0 (W ) s ks l p1 (W ) where p n (W ) is defined to be the distribution obtained at the n th step of Gibbs sampling starting from the data. 2 Hinton (2000)

Contrastive Divergence for Bayesian Learning A pretty accurate Taylor expansion makes the comparison easier: log a + log p (W ) } {δ p (W ) = N s k s l p0 (W ) log exp δs ks l, p (W ) { } Nδ s k s l p0 (W ) s ks l, p (W ) It is now tempting to try: log a + log p (W ) } { s p (W ) = Nδ k s l p0 (W ) s ks l, p1 (W ) We will call this contrastive sampling.