Expectation Propagation for Approximate Bayesian Inference

Similar documents
PATTERN RECOGNITION AND MACHINE LEARNING

Expectation Propagation Algorithm

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2

Part 1: Expectation Propagation

Black-box α-divergence Minimization

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Lecture : Probabilistic Machine Learning

Density Estimation. Seungjin Choi

Lecture 6: Graphical Models: Learning

An introduction to Variational calculus in Machine Learning

Bayesian Machine Learning

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Unsupervised Learning

Lecture 6: Model Checking and Selection

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Introduction to Machine Learning

Machine Learning Summer School

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian Machine Learning

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Recent Advances in Bayesian Inference Techniques

Curve Fitting Re-visited, Bishop1.2.5

Announcements. Proposals graded

Nonparameteric Regression:

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

an introduction to bayesian inference

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Introduction to SVM and RVM

Ch 4. Linear Models for Classification

Machine learning - HT Maximum Likelihood

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Variational Bayesian Logistic Regression

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Pattern Recognition and Machine Learning

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Bayesian Machine Learning

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Bayesian Inference and MCMC

Expectation propagation for signal detection in flat-fading channels

Kernel methods, kernel SVM and ridge regression

Lecture 7 and 8: Markov Chain Monte Carlo

Expectation Propagation in Dynamical Systems

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Lecture 2: From Linear Regression to Kalman Filter and Beyond

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Understanding Covariance Estimates in Expectation Propagation

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

An Introduction to Bayesian Machine Learning

STA414/2104 Statistical Methods for Machine Learning II

Introduction to Bayesian Inference

Machine Learning Lecture 5

Machine Learning Techniques for Computer Vision

Reading Group on Deep Learning Session 1

STA 4273H: Statistical Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

Support Vector Machines

Machine Learning using Bayesian Approaches

GWAS V: Gaussian processes

CPSC 540: Machine Learning

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

(Multivariate) Gaussian (Normal) Probability Densities

Quantitative Biology II Lecture 4: Variational Methods

Introduction to Bayesian Statistics

Empirical Risk Minimization is an incomplete inductive principle Thomas P. Minka

A minimalist s exposition of EM

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

An Introduction to Statistical and Probabilistic Linear Models

Auto-Encoding Variational Bayes

Linear Classification

Lecture 4: Probabilistic Learning

Bayesian Regression Linear and Logistic Regression

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

CPSC 540: Machine Learning

Introduction to Machine Learning

Gaussian Mixture Models

Model Selection for Gaussian Processes

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Foundations of Statistical Inference

Variational Principal Components

STA 4273H: Sta-s-cal Machine Learning

Bayesian Inference Course, WTCN, UCL, March 2013

Bayesian Support Vector Machines for Feature Ranking and Selection

Logistic Regression. Machine Learning Fall 2018

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Variational Scoring of Graphical Model Structures

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

COM336: Neural Computing

Model Comparison. Course on Bayesian Inference, WTCN, UCL, February Model Comparison. Bayes rule for models. Linear Models. AIC and BIC.

GWAS IV: Bayesian linear (variance component) models

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Approximate Inference using MCMC

Introduction to Machine Learning

Introduction to Probabilistic Machine Learning

Transcription:

Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24

Bayesian Inference Inference Given some data we want to assign probabilities to a set of hypothesis. Uncertainty is involved in the whole process. The tool to reason under uncertainty is Probability Theory. Bayes Theorem P(θ D, M) = P(D θ,m)p(θ M) P(D M). M represents our assumptions (model) for the problem. θ is a hypothesis. 2/ 24

Main difficulties in the Bayesian framework We have to work with complex integrals To normalize distributions: P(D M) = P(D θ, M)P(θ M). (1) To make predictions: P(y D, M) = P(y θ)p(θ D, M)dθ. (2) Approximate Solutions Laplace s method. Monte Carlo methods. Variational Inference. Expectation Propagation. 3/ 24

Kullback-Leibler Divergence D KL (p q) = p(x)log p(x) q(x) dx Is a distance measure from a true density p to an another density q. D KL (p q) = 0 p = q, otherwise D KL (p q) > 0. It is not symmetric D KL (p q) D KL (q p). We can approximate p with a simpler density q by minimizing D KL (p q) (direct) or D KL (q p) (inverse). 4/ 24

Non-symmetry of the KL divergence I 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 Figure: Inverse solution (left) and direct solution (right) for an approximation of a bivariate Gaussian with two independent Gaussian components. 5/ 24

Non-symmetry of the KL divergence II 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 Figure: Inverse solution (left) and direct solution (right) for an approximation of a mixture of two Gaussians with a Bivariate Gaussian. 6/ 24

Exponential Family A density q is in the exponential family if q(x) = exp{ i g i (x)ν i }. (3) ν i are the natural parameters. g i are the sufficient statistics: (1,x,x 2 ) for a Gaussian. Exponential families are closed under multiplication. We can minimize D KL (p q) if q is in an exponential family just by i, g i q(x)dx = g i p(x)dx. (4) 7/ 24

Assumed Density Filtering We write p(x) as a product of terms p(x) = n i=1 t i(x). We approximate p term by term with q. Algorithm 1 q 0 (x) constant 2 for i 1 to n 1 Z i q i 1 (x)t i (x)dx. 2 q i min q D KL ( 1 Z i q i 1 t i q). 3 Z n i=1 Z i 4 Return q n and Z q n approximates p(x) R p(x)dx and Z approximates p(x)dx. 8/ 24

Main disadvantage of ADF Problem The solution depends on the processing order of the t i (x). Density 0.0 0.1 0.2 0.3 0.4 4 2 0 2 4 Figure: p(x) is shown in black and the approximations obtained by ADF using different orderings are shown in red. x 9/ 24

Expectation Propagation Solves the ordering dependence of ADF. Approximates p(x) = n i=1 t i(x) by q(x) = n i=1 ˆt i (x) where all ˆt i are in the same exponential family and so is q. Each ˆt j term is chosen so that q(x) = ˆt j (x) ˆt i (x) (5) i j is as close as possible to t j (x) i j ˆt i (x). (6) The distance measure used is the direct K-L divergence. It is easy to work with q and it can be integrated automatically. 10/ 24

Pseudocode of Expectation Propagation Algorithm 1 Initialize all ˆt i and q to constant densities. 2 Until all ˆt i converge: 1 Choose a ˆt i to update. 2 q old qˆt i. 3 q min q D KL (q old t i q ). 4 ˆt i q q old. 3 Return q. p(x)dx is approximated by q(x)dx p(x) R p(x)dx is approximated by q(x) R q(x)dx. 11/ 24

Final Notes on Expectation Propagation The ˆt i terms and therefore q could also be factorized densities. In this case we only have to perform some marginalizations in the algorithm. Advantages of Expectation Propagation No local minima minimizing the K-L divergence. Applicable to high dimensional densities. Usually faster than other approaches. Disadvantages of Expectation Propagation It is not guaranteed to converge. q old might not be a proper density. The t i terms have to be simple. 12/ 24

Bayes Machine I It is a Bayesian single layer perceptron. w determines the hyperplane of a perceptron. Given a data set D = {(x 1,y 1 ),...,(x n,y n )}, y { 1,1}, the likelihood for w is P(D w) = i P(y i w) = i Θ(y i w T x i ), (7) where Θ is the step function. We can take into account a labeling error rate ǫ P(y i w) = ǫ + (1 2ǫ)Θ(y i w T x i ). (8) The likelihood only depends on the number of errors. 13/ 24

Bayes Machine II The prior for w is P(w) N(0,I), a spherical Gaussian. The posterior for w is P(w D) i P(y i w)p(w). The predictive distribution for a point x is P(y x, D) = P(y x, w)p(w D). (9) The model evidence is P(D M) = w w P(y i w)p(w). (10) i We approximate the posterior P(w D) with a multivariate Gaussian N(µ,Σ) by means of EP. EP also approximates the evidence. 14/ 24

Example of a Bayes Machine Y 0.0 0.5 1.0 0.0 0.5 1.0 Figure: Contour plot of the decision surface obtained by the Bayes machine and maximum margin classifier in black. The Bayes machine approximates a vote between all possible linear separators. In this example ǫ = 0. X 15/ 24

Non-linear Bayes Machine It is possible to rewrite the whole EP algorithm for the Bayes Machine in terms of inner products. The kernel trick allows us to work with the data projected on an infinite dimension space where it can be linearly separated. We can fix the kernel and its parameters just by maximizing the approximation for the evidence. The same procedure can be used to perform selection of attributes. 16/ 24

Example: the Spiral Dataset with 100 points We fix a Gaussian kernel exp( 1 2σ 2 (x i x j ) T (x i x j )) and maximize log(p(d M)) with respect to σ. σ log(p(d M)) 1-33 0.5-25 0.25-22.6 0.1-34 0.35-22.7 0.3-22.4 17/ 24

The Non-linear Bayes Machine on the Spiral Data Set I Y 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Figure: Contour plot of the decision surface obtained by the non-linear Bayes machine. In this example σ 2 = 1 and ǫ = 0. X 18/ 24

The Non-linear Bayes Machine on the Spiral Data Set II Y 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Figure: Contour plot of the decision surface obtained by the non-linear Bayes machine. In this example σ 2 = 0.3 and ǫ = 0. X 19/ 24

Results of a SVM on the Spiral Data Set Y 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Figure: Decision surface obtained by a support vector machine. We used a Gaussian kernel and ǫ = 0. The with of the kernel was selected by cross validation. X 20/ 24

Results of a SVM on another Data Set Y 0.5 0.0 0.5 1.0 1.5 2.0 2.5 0.5 0.0 0.5 1.0 1.5 2.0 2.5 Figure: Decision surface obtained by a support vector machine. We used a Gaussian kernel and ǫ = 0. The with of the kernel was selected by cross validation. X 21/ 24

Results of the Bayes Machine on another Data Set Y 0.5 0.0 0.5 1.0 1.5 2.0 2.5 0.5 0.0 0.5 1.0 1.5 2.0 2.5 Figure: Decision surface obtained by the non-linear Bayes machine. We used a Gaussian kernel and ǫ = 0. The with of the kernel was selected by maximizing the approximation for the evidence. X 22/ 24

Bayes Machines vs SVMs SVMs Generally lead to less smooth decision borders. To tune their parameters you have to use CV and discard data. Training process is faster. Prediction is faster because they only use the support vectors. They do not output probabilities directly. Bayes Machines Seem to generalize better (see ref 1). The kernel parameters can be fixed using all data. Training is slow, O(n 3 ). Prediction is slow. They output a predictive distribution. 23/ 24

Bibliography 1 T. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, MIT Media Lab, 2001. 2 Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer (2006). 3 David MacKay. Information Theory, Inference, and Learning Algorithms. (Available on the web). 24/ 24