Approximate Inference Part 1 of 2

Similar documents
Approximate Inference Part 1 of 2

Part 1: Expectation Propagation

Recent Advances in Bayesian Inference Techniques

Expectation Propagation for Approximate Bayesian Inference

Pattern Recognition and Machine Learning

Machine Learning Techniques for Computer Vision

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Unsupervised Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Expectation propagation for signal detection in flat-fading channels

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Expectation Propagation Algorithm

Expectation Propagation in Factor Graphs: A Tutorial

GWAS V: Gaussian processes

Bayesian Machine Learning

Active and Semi-supervised Kernel Classification

Expectation Propagation in Dynamical Systems

Nonparametric Bayesian Methods (Gaussian Processes)

Probabilistic Graphical Models

STA 4273H: Sta-s-cal Machine Learning

Introduction to Gaussian Processes

Bayesian Machine Learning - Lecture 7

Bayesian Inference Course, WTCN, UCL, March 2013

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Learning in Undirected Graphical Models

Machine Learning Summer School

Model Selection for Gaussian Processes

p L yi z n m x N n xi

Graphical Models and Kernel Methods

STA 4273H: Statistical Machine Learning

Black-box α-divergence Minimization

Variational Autoencoders

Lecture 6: Graphical Models: Learning

Dynamic models 1 Kalman filters, linearization,

Non-Gaussian likelihoods for Gaussian Processes

an introduction to bayesian inference

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Gaussian Processes in Machine Learning

Automating variational inference for statistics and data mining

Linear Dynamical Systems

Variational Principal Components

Bayesian Networks BY: MOHAMAD ALSABBAGH

13: Variational inference II

Bayesian Learning in Undirected Graphical Models

Variational Scoring of Graphical Model Structures

Algorithmisches Lernen/Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Bayesian Methods for Machine Learning

Lecture : Probabilistic Machine Learning

Introduction to Machine Learning

Variational Methods in Bayesian Deconvolution

Probabilistic Graphical Networks: Definitions and Basic Results

Lecture 13 : Variational Inference: Mean Field Approximation

State Space Gaussian Processes with Non-Gaussian Likelihoods

Relevance Vector Machines

CPSC 540: Machine Learning

Density Estimation. Seungjin Choi

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Notes on Machine Learning for and

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

Learning Gaussian Process Models from Uncertain Data

MCMC and Gibbs Sampling. Kayhan Batmanghelich

STA 4273H: Statistical Machine Learning

Probabilistic Time Series Classification

Probabilistic Graphical Models

Variational Inference via Stochastic Backpropagation

PART I INTRODUCTION The meaning of probability Basic definitions for frequentist statistics and Bayesian inference Bayesian inference Combinatorics

Probabilistic and Bayesian Machine Learning

Expectation propagation as a way of life

CPSC 540: Machine Learning

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Sparse Gaussian Markov Random Field Mixtures for Anomaly Detection

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Inference in Bayesian Networks

Diagram Structure Recognition by Bayesian Conditional Random Fields

Machine Learning using Bayesian Approaches

Gaussian Process Approximations of Stochastic Differential Equations

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Machine Learning 4771

Understanding Covariance Estimates in Expectation Propagation

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Machine Learning, Midterm Exam

Bayesian Mixtures of Bernoulli Distributions

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Classification & Information Theory Lecture #8

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Bayesian Machine Learning

Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks

CSC 2541: Bayesian Methods for Machine Learning

Generative Clustering, Topic Modeling, & Bayesian Inference

Clique trees & Belief Propagation. Siamak Ravanbakhsh Winter 2018

CS 6375 Machine Learning

Lecture 16 Deep Neural Generative Models

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Transcription:

Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/

Bayesian paradigm Consistent use of probability theory for representing unknowns (parameters, latent variables, missing data)

Bayesian paradigm Bayesian posterior distribution summarizes what we ve learned from training data and prior knowledge Can use posterior to: Describe training data Make predictions on test data Incorporate new data (online learning) Today s question: How to efficiently represent and compute posteriors?

Factor graphs Shows how a function of several variables can be factored into a product of simpler functions f(x,y,z) = (x+y)(y+z)(x+z) Very useful for representing posteriors

Example factor graph p( x m) N( x ; m,1) i = i

Two tasks Modeling What graph should I use for this data? Inference Given the graph and data, what is the mean of x (for example)? Algorithms: Sampling Variable elimination Message-passing (Expectation Propagation, Variational Bayes, )

A (seemingly) intractable problem

Clutter problem Want to estimate x given multiple y s

Exact posterior exact p(x,d D) -1 0 1 2 3 4 x

Representing posterior distributions Sampling Deterministic approximation Good for complex, multi-modal distributions Slow, but predictable accuracy Good for simple, smooth distributions Fast, but unpredictable accuracy

Deterministic approximation Laplace s method Bayesian curve fitting, neural networks (MacKay) Bayesian PCA (Minka) Variational bounds Bayesian mixture of experts (Waterhouse) Mixtures of PCA (Tipping, Bishop) Factorial/coupled Markov models (Ghahramani, Jordan, Williams)

Moment matching Another way to perform deterministic approximation Much higher accuracy on some problems Assumed-density filtering (1984) Loopy belief propagation (1997) Expectation Propagation (2001)

Today Moment matching (Expectation Propagation) Tomorrow Variational bounds (Variational Message Passing)

Best Gaussian by moment matching exact bestgaussian p(x,d) ) -1 0 1 2 3 4 x

Strategy Approximate each factor by a Gaussian in x

Approximating a single factor

(naïve) f i (x) \ q i ( x ) = p(x)

(informed) f i (x) \ q i ( x ) = p(x)

Single factor with Gaussian context

Gaussian multiplication formula

Approximation with narrow context

Approximation with medium context

Approximation with wide context

Two factors x x Message passing

Three factors x x Message passing

Message Passing = Distributed Optimization Messages represent a simpler distribution q(x) that approximates p(x) A distributed representation Message passing = optimizing q to fit p q stands in for p when answering queries Choices: What type of distribution to construct (approximating family) What cost to minimize (divergence measure)

Distributed divergence minimization Write p as product of factors: Approximate factors one by one: Multiply to get the approximation:

Global divergence to local divergence Global divergence: Local divergence:

Message passing Messages are passed between factors Messages are factor approximations: Factor a receives Minimize local divergence to get Send to other factors Repeat until convergence

Gaussian found by EP ep exact bestgaussian p( (x,d) -1 0 1 2 3 4 x

Other methods vb laplace exact p( (x,d) -1 0 1 2 3 4 x

Accuracy Posterior mean: exact = 1.64864 ep = 1.64514 laplace = 1.61946 vb = 1.61834 Posterior variance: exact = 0.359673 ep = 0.311474 laplace = 0.234616 vb = 0.171155

Cost vs. accuracy 20 points 200 points Deterministic methods improve with more data (posterior is more Gaussian) Sampling methods do not

Time series problems

Example: Tracking Guess the position of an object given noisy measurements y 2 x 2 x 1 x 3 y 4 y 1 y 3 x 4 Object

Factor graph x1 x x 2 3 x4 y1 y2 y3 y4 e.g. t = xt 1 x + ν t (random walk) y t = x t + noise want distribution of x s given y s

Approximate factor graph x1 x x 2 3 x4

Splitting a pairwise factor x1 x2 x1 x2

Splitting in context x x 2 3 x x 2 3

Sweeping through the graph x1 x x 2 3 x4

Sweeping through the graph x1 x x 2 3 x4

Sweeping through the graph x1 x x 2 3 x4

Sweeping through the graph x1 x x 2 3 x4

Example: Poisson tracking y t is a Poisson-distributed integer with mean exp(x t )

Poisson tracking model p( x 1 ) ~ N(0,100) p( x x ) ~ N ( x ( t t 1 t 1,0.01) xt p( yt xt ) = exp( yt xt e ) / yt!

Factor graph x1 x x 2 3 x4 y1 y2 y3 y4 x1 x x 2 3 x4

Approximating a measurement factor x 1 y 1 x 1

Posterior for the last state

EP for signal detection Wireless communication problem Transmitted signal = a sin( ω t + φ) ( a, φ) vary to encode each symbol In complex numbers: iφ ae (Qi and Minka, 2003) Im a φ Re

Binary symbols, Gaussian noise Symbols are 1 0 s =1 and s = 1 (in complex plane) Received signal = a sin( ωt + φ) + noise y t Optimal detection is easy in this case y t 0 s 1 s

Fading channel Channel systematically changes amplitude and phase: y x s + noise = t t s t = transmitted symbol x t t = channel multiplier (complex number) x t changes over time yt 1 x t s 0 x t s

Differential detection Use last measurement to estimate state: x t y s / t 1 t 1 State estimate is noisy can we do better? y t y t 1 y t 1

Factor graph s1 s2 s3 s4 y1 y2 y y 3 4 x1 x x 2 3 x4 Symbols can also be correlated (e.g. error-correcting code) Channel dynamics are learned from training data (all 1 s)

Splitting a transition factor

Splitting a measurement factor

On-line implementation Iterate over the last δ measurements Previous measurements act as prior Results comparable to particle filtering, but much faster

Classification problems

Spam filtering by linear separation Not spam Spam Choose a boundary that will generalize to new data

Linear separation Minimum training error solution (Perceptron) Too arbitrary won t generalize well

Linear separation Maximum-margin solution (SVM) Ignores information in the vertical direction

Linear separation Bayesian solution (via averaging) Has a margin, and uses information in all dimensions

Geometry of linear separation Separator is any vector w such that: w T xi > 0 (class 1) w T x < 0 (class 2) w i =11 (sphere) This set has an unusual shape SVM: Optimize over it Bayes: Average over it

Factor graph

Performance on linear separation EP Gaussian approximation to posterior

Time vs. accuracy A typical run on the 3-point problem Error = distance to true mean of w Billiard = Monte Carlo sampling (Herbrich et al, 2001) Opper&Winther s algorithms: MF = mean-field theory TAP = cavity method (equiv to Gaussian EP for this problem)

Gaussian kernels Map data into high-dimensional space so that

Bayesian model comparison Multiple models M i with prior probabilities p(m i ) Posterior probabilities: For equal priors, models are compared using model evidence:

Highest-probability kernel

Margin-maximizing kernel

Bayesian feature selection Synthetic data where 6 features are relevant (out of 20) Bayes picks 6 Margin picks 13

EP versus Monte Carlo Monte Carlo is general but expensive A sledgehammer EP exploits underlying simplicity of the problem (if it exists) Monte Carlo is still needed for complex problems (e.g. large isolated peaks) Trick is to know what problem you have

Software for EP Bayes Point Machine toolbox http://research.microsoft.com/~minka/papers/ep/bpm/ Sparse Online Gaussian Process toolbox http://www.kyb.tuebingen.mpg.de/bs/people/csatol/ogp/index.html Infer.NET http://research.microsoft.com/infernet

Further reading EP bibliography http://research.microsoft.com/~minka/papers/ep/roadmap.html EP quick reference http://research.microsoft.com/~minka/papers/ep/minka-epquickref.pdf

Tomorrow Variational Message Passing Divergence measures Comparisons to EP