Part 1: Expectation Propagation

Similar documents
Part 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Expectation Propagation for Approximate Bayesian Inference

Expectation propagation as a way of life

Lecture 13 : Variational Inference: Mean Field Approximation

Pattern Recognition and Machine Learning

STA 4273H: Statistical Machine Learning

Density Estimation. Seungjin Choi

Variational Scoring of Graphical Model Structures

Graphical Models for Collaborative Filtering

an introduction to bayesian inference

CSC 2541: Bayesian Methods for Machine Learning

Bayesian Machine Learning

Improving posterior marginal approximations in latent Gaussian models

Bayesian Regression Linear and Logistic Regression

Unsupervised Learning

Expectation Propagation Algorithm

Probabilistic Graphical Models

Approximate Inference using MCMC

13: Variational inference II

Recent Advances in Bayesian Inference Techniques

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Probabilistic and Bayesian Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Patterns of Scalable Bayesian Inference Background (Session 1)

Expectation propagation for signal detection in flat-fading channels

Bayesian Inference and MCMC

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

STA414/2104 Statistical Methods for Machine Learning II

Bayesian Machine Learning

STA 4273H: Sta-s-cal Machine Learning

Probabilistic Graphical Models

Probabilistic Machine Learning. Industrial AI Lab.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

Probabilistic Graphical Networks: Definitions and Basic Results

Expectation Propagation in Factor Graphs: A Tutorial

Introduction to Probabilistic Machine Learning

GWAS V: Gaussian processes

Online Algorithms for Sum-Product

Outline Lecture 2 2(32)

Machine Learning Summer School

Introduction to Machine Learning

Lecture 6: Graphical Models: Learning

STA 4273H: Statistical Machine Learning

Dynamic models 1 Kalman filters, linearization,

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Ch 4. Linear Models for Classification

Variational Methods in Bayesian Deconvolution

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Machine Learning for Data Science (CS4786) Lecture 24

Integrated Non-Factorized Variational Inference

Variational Inference (11/04/13)

Lecture 7 and 8: Markov Chain Monte Carlo

Kernel methods, kernel SVM and ridge regression

CS540 Machine learning L9 Bayesian statistics

Probabilistic Graphical Models

Bayesian Machine Learning

Bayesian Methods for Machine Learning

Introduction to Bayesian inference

1 Bayesian Linear Regression (BLR)

CPSC 540: Machine Learning

Bayesian Inference Course, WTCN, UCL, March 2013

Learning Energy-Based Models of High-Dimensional Data

Introduction into Bayesian statistics

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Deterministic Approximation Methods in Bayesian Inference

Part IV: Monte Carlo and nonparametric Bayes

PMR Learning as Inference

STA 4273H: Statistical Machine Learning

Machine Learning Techniques for Computer Vision

Probabilistic Graphical Models

Notes on pseudo-marginal methods, variational Bayes and ABC

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

Particle Filtering a brief introductory tutorial. Frank Wood Gatsby, August 2007

p L yi z n m x N n xi

Black-box α-divergence Minimization

Online Bayesian Transfer Learning for Sequential Data Modeling

Bayesian Networks BY: MOHAMAD ALSABBAGH

Variational Inference via Stochastic Backpropagation

Probabilistic Time Series Classification

Statistical learning. Chapter 20, Sections 1 4 1

Linear Dynamical Systems

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters

Nonparametric Bayesian Methods (Gaussian Processes)

Expectation Propagation in Dynamical Systems

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Probabilistic Graphical Models

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Active and Semi-supervised Kernel Classification

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Probabilistic Machine Learning

Notes on Machine Learning for and

State Space Gaussian Processes with Non-Gaussian Likelihoods

Inference in Bayesian Networks

Transcription:

Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud University Nijmegen, The Netherlands April 15, 2015

Outline Bayesian Machine Learning Probabilistic modeling Approximate inference Expectation Propagation Bit of history Factor graphs Iterative procedure EP for Gaussian process classification Locality property Step by step Conclusion

Outline Bayesian Machine Learning Probabilistic modeling Approximate inference Expectation Propagation Bit of history Factor graphs Iterative procedure EP for Gaussian process classification Locality property Step by step Conclusion

Bayesian Machine Learning Enumerate all reasonable models θ and assign a prior belief p(θ). Upon observing the data D, compute the likelihood p(d θ). Compute the posterior probability over models using Bayes rule: p(θ D) p(d θ)p(θ). Problem: the posterior distribution is often intractable.

Approximate Inference Stochastic sampling methods Markov Chain Monte Carlo sampling Gibbs sampling Particle filtering Deterministic methods Variational ( mean-field ) approaches Loopy belief propagation Expectation propagation

Outline Bayesian Machine Learning Probabilistic modeling Approximate inference Expectation Propagation Bit of history Factor graphs Iterative procedure EP for Gaussian process classification Locality property Step by step Conclusion

Expectation Propagation Message passing algorithm, invented by Thomas Minka (PhD thesis, 2001). Its generalization, power EP, contains a large class of deterministic algorithms for approximate inference. Arguably the best approximate inference results, if it converges... Implemented in Microsoft s infer.net.

Factors Many probabilistic models factorize, i.e., can be written in the form Factor graph p(d, θ) = i f i (θ). f 0 f 1 f n For example, with independently, identically distributed data, there is one factor f n (θ) = p(x n θ) for each data point x n along with a factor f 0 (θ) = p(θ) for the prior. This also applies to Gaussian process regression and classification: θ is drawn from a Gaussian process prior and each of the factors further simplifies into f n (θ) = p(x n θ n ). θ General case f 0 f 1 f 2 f n θ 1 θ 2 θ n Gaussian processes

Approximation Approximate the posterior by an exponential distribution: p(d, θ) = i f i (θ) 1 Z f i (θ) = p(θ). I.e., approximate each term f i (θ) by an exponential term approximation f i (θ). Terms in exponential form, often the prior, do not have to be approximated. i f 0 f1 fn θ Exponential form: f (θ) = h(θ)g(η) exp [ ] η T u(θ), natural parameters η and sufficient statistics u(θ).

Iterative Updating Take out term approximation i: p \i (θ) f j (θ). j i Put back in term i: p (i) (θ) f i (θ) j i f j (θ). Match moments, i.e., find the approximate distribution of exponential form such that dθ u(θ) p new (θ) = dθ u(θ) p (i) (θ). Bookkeeping: set the new term approximation such that p new (θ) f i new (θ) f j (θ). j i

Going Back and Forth f 0 f1 fi fn θ substitute = = project f 0 f1 f i fn θ Project: minimize the KL-divergence KL( p (i), p) = dθ p (i) (θ) log [ ] p (i) (θ). p(θ) Equivalent to moment matching when p is in the exponential family.

Outline Bayesian Machine Learning Probabilistic modeling Approximate inference Expectation Propagation Bit of history Factor graphs Iterative procedure EP for Gaussian process classification Locality property Step by step Conclusion

Gaussian Processes for Classification f 0 (θ) is a Gaussian prior. No need to approximate. f i (θ) = f i (θ i ), some nonlinear sigmoidal function of θ i. Each term approximation is a one-dimensional Gaussian form, [ f i (θ i ) = exp h i θ i 1 ] 2 K iθi 2, not necessarily normalizable: K i may be negative. 1 0.5 0 5 0 5 f i (θ i ) = σ(y i θ i )

Locality Property (1) Consider updating the term approximation f i (θ i ). After replacing the old term approximation by the term we have p (i) (θ) p \i (θ)f i (θ i ) p \i (θ \i θ i ) p \i (θ i )f i (θ i ). We have to find the new approximation p new (θ) closest in KL-divergence to p (i) (θ): [ ] KL( p (i), p new ) = dθ p (i) p (i) (θ) (θ) log p new (θ) [ ] = dθ i p (i) p (i) (θ i ) (θ i ) log p new (θ i ) [ ] p + dθ i p (i) (θ i ) dθ \i p (i) (i) (θ \i θ i ) (θ \i θ i ) log p new. (θ \i θ i )

Locality Property (2) From previous slide: KL( p (i), p new ) = + dθ i p (i) (θ i ) [ ] dθ i p (i) p (i) (θ i ) (θ i ) log p new (θ i ) [ ] p dθ \i p (i) (i) (θ \i θ i ) (θ \i θ i ) log p new. (θ \i θ i ) Consequences: 1. At the optimum p new (θ \i θ i ) = p (i) (θ \i θ i ), which means that only K ii and h i can change. 2. We only need to match moments for the marginal p(θ i ).

Take Out Easy in terms of canonical parameters, K \i = K K i 1 i 1 T i and h \i = h h i 1 i, with 1 i a vector with a 1 at element i and the rest 0. We need the corresponding moment form with (only) C \i ii and m \i i. Efficiently with Sherman-Morrison formula (see next slide): C \i ii The new mean follows from = C ii + C iic ii K [ ] i 1 1/C ii K i. 1 C ii K i m \i i = m i + C \i ii ] [ h i + K ii m i. Computational complexity is order 1 per term, i.e., order N in total per iteration of EP.

Sherman-Morrison Formula Efficient way to recompute the inverse after adding a lower-dimensional part: (A + uv T ) 1 = A 1 A 1 uv T A 1 1 + v T A 1 u, The result on the previous slide follows by setting u = K i 1 i and v = 1 i. Woodbury formula: generalization to matrices in terms of vectors. Matrix determinant lemma does something similar for (log) determinants. See the Matrix Cookbook (or Wikipedia... ).

Match Moments We have to compute dθ i N (θ i ; m \i i, C \i ii )f i (θ i ){1, θ i, θ 2 i }. One-dimensional integrals that (sometimes) can be computed analytically, and otherwise approximated with Gauss-Hermite quadrature, i.e., from k w k f i (m \i i + This then yields the new m new i C \i ii x k ){1, x k, x 2 k }. and C new ii. Computational complexity is order W, the number of quadrature points per term, i.e., order NW per EP iteration.

Bookkeeping Now that we have the new moments, we have to find new term approximations that give exactly those same moments. It can be shown that the updates are simply as if we are in a one-dimensional situation: and similarly h new i K new ii = K ii + [1/C new ii 1/C ii ], = h i + [mi new /Cii new m i /C ii ]. Keep track of C and m using Sherman-Morrison, but now applied to the whole matrix. Computational complexity is order N 2 per term, i.e., N 3 per EP iteration.

Sequential vs. parallel Initial formulation of expectation propagation: sequentially update terms and keep track of approximated posterior. Viewed as a mapping from old to new term approximations, we may as well do this in parallel. Advantages: much, much faster for sparse precision matrices; numerically more stable. Disadvantage: convergence might be a bit slower. f1 f2 fn z1 z2 zn g1 g2 gn θ1 θ2 θp h1 h2 hp u1 v1 u2 v2 up vp i1 i2

Outline Bayesian Machine Learning Probabilistic modeling Approximate inference Expectation Propagation Bit of history Factor graphs Iterative procedure EP for Gaussian process classification Locality property Step by step Conclusion

Other Issues By keeping track of normalizations, we can also approximate the model evidence and use that for optimizing hyperparameters. Power EP: take out the term proxy/put back the term to power α. Standard EP/loopy belief propagation: α = 1. Variational message passing: α = 0. α just below 1 happens to be more stable than α = 1. Convergence is a (big) issue: in particular there is no guarantee on normalizability. Many, many more applications: mixture models, nonlinear Kalman filters, Dirichlet models, Plackett-Luce,... Consistently more accurate than Laplace approximations; ongoing efforts to speed it up.