Approximate Inference Part 1 of 2

Similar documents
Approximate Inference Part 1 of 2

Part 1: Expectation Propagation

Recent Advances in Bayesian Inference Techniques

Expectation Propagation for Approximate Bayesian Inference

Pattern Recognition and Machine Learning

Machine Learning Techniques for Computer Vision

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Unsupervised Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Expectation propagation for signal detection in flat-fading channels

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

GWAS V: Gaussian processes

Bayesian Machine Learning

Active and Semi-supervised Kernel Classification

Expectation Propagation in Dynamical Systems

Expectation Propagation Algorithm

Nonparametric Bayesian Methods (Gaussian Processes)

Expectation Propagation in Factor Graphs: A Tutorial

Probabilistic Graphical Models

STA 4273H: Sta-s-cal Machine Learning

Introduction to Gaussian Processes

Bayesian Machine Learning - Lecture 7

Bayesian Inference Course, WTCN, UCL, March 2013

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Learning in Undirected Graphical Models

Machine Learning Summer School

Non-Gaussian likelihoods for Gaussian Processes

Model Selection for Gaussian Processes

Black-box α-divergence Minimization

Variational Autoencoders

Lecture 6: Graphical Models: Learning

p L yi z n m x N n xi

Dynamic models 1 Kalman filters, linearization,

Graphical Models and Kernel Methods

an introduction to bayesian inference

STA 4273H: Statistical Machine Learning

Gaussian Processes in Machine Learning

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Automating variational inference for statistics and data mining

Linear Dynamical Systems

Variational Principal Components

Bayesian Networks BY: MOHAMAD ALSABBAGH

Variational Scoring of Graphical Model Structures

13: Variational inference II

Bayesian Learning in Undirected Graphical Models

Algorithmisches Lernen/Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Bayesian Methods for Machine Learning

Lecture : Probabilistic Machine Learning

Introduction to Machine Learning

Probabilistic and Bayesian Machine Learning

Variational Methods in Bayesian Deconvolution

Probabilistic Graphical Networks: Definitions and Basic Results

Lecture 13 : Variational Inference: Mean Field Approximation

State Space Gaussian Processes with Non-Gaussian Likelihoods

Relevance Vector Machines

CPSC 540: Machine Learning

Density Estimation. Seungjin Choi

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Notes on Machine Learning for and

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

Learning Gaussian Process Models from Uncertain Data

MCMC and Gibbs Sampling. Kayhan Batmanghelich

STA 4273H: Statistical Machine Learning

Probabilistic Time Series Classification

Probabilistic Graphical Models

Variational Inference via Stochastic Backpropagation

Naïve Bayes classification

PART I INTRODUCTION The meaning of probability Basic definitions for frequentist statistics and Bayesian inference Bayesian inference Combinatorics

CPSC 540: Machine Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Sparse Gaussian Markov Random Field Mixtures for Anomaly Detection

Diagram Structure Recognition by Bayesian Conditional Random Fields

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Machine Learning using Bayesian Approaches

Machine Learning 4771

Gaussian Process Approximations of Stochastic Differential Equations

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Understanding Covariance Estimates in Expectation Propagation

Machine Learning, Midterm Exam

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Bayesian Mixtures of Bernoulli Distributions

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Bayesian Machine Learning

Classification & Information Theory Lecture #8

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks

Generative Clustering, Topic Modeling, & Bayesian Inference

CSC 2541: Bayesian Methods for Machine Learning

CS 6375 Machine Learning

Lecture 16 Deep Neural Generative Models

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Graphical Models Chris Bishop

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Probabilistic Machine Learning. Industrial AI Lab.

MODULE -4 BAYEIAN LEARNING

Transcription:

Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1

Bayesian paradigm Consistent use of probability theory for representing unknowns (parameters, latent variables, missing data) 2

Bayesian paradigm Bayesian posterior distribution summarizes what we ve learned from training data and prior knowledge Can use posterior to: Describe training data Make predictions on test data Incorporate new data (online learning) Today s question: How to efficiently represent and compute posteriors? 3

Factor graphs Shows how a function of several variables can be factored into a product of simpler functions f(x,y,z) = (x+y)(y+z)(x+z) Very useful for representing posteriors 4

Example factor graph p( x m) N( x ; m,1) i = i 5

Two tasks Modeling What graph should I use for this data? Inference Given the graph and data, what is the mean of x (for example)? Algorithms: Sampling Variable elimination Message-passing (Expectation Propagation, Variational Bayes, ) 6

A (seemingly) intractable problem 7

Clutter problem Want to estimate x given multiple y s 8

Exact posterior exact p(x,d D) -1 0 1 2 3 4 x 9

Representing posterior distributions Sampling Deterministic approximation Good for complex, multi-modal distributions Slow, but predictable accuracy Good for simple, smooth distributions Fast, but unpredictable accuracy 10

Deterministic approximation Laplace s method Bayesian curve fitting, neural networks (MacKay) Bayesian PCA (Minka) Variational bounds Bayesian mixture of experts (Waterhouse) Mixtures of PCA (Tipping, Bishop) Factorial/coupled Markov models (Ghahramani, Jordan, Williams) 11

Moment matching Another way to perform deterministic approximation Much higher accuracy on some problems Assumed-density filtering (1984) Loopy belief propagation (1997) Expectation Propagation (2001) 12

Today Moment matching (Expectation Propagation) Tomorrow Variational bounds (Variational Message Passing) 13

Best Gaussian by moment matching exact bestgaussian p(x,d) ) -1 0 1 2 3 4 x 14

Strategy Approximate each factor by a Gaussian in x 15

Approximating a single factor 16

(naïve) f i (x) \ q i ( x ) = p(x) 17

(informed) f i (x) \ q i ( x ) = p(x) 18

Single factor with Gaussian context 19

Gaussian multiplication formula 20

Approximation with narrow context 21

Approximation with medium context 22

Approximation with wide context 23

Two factors x x Message passing 24

Three factors x x Message passing 25

Message Passing = Distributed Optimization Messages represent a simpler distribution q(x) that approximates p(x) A distributed representation Message passing = optimizing q to fit p q stands in for p when answering queries Choices: What type of distribution to construct (approximating family) What cost to minimize (divergence measure) 26

Distributed divergence minimization Write p as product of factors: Approximate factors one by one: Multiply to get the approximation: 27

Gaussian found by EP ep exact bestgaussian p( (x,d) -1 0 1 2 3 4 x 28

Other methods vb laplace exact p( (x,d) -1 0 1 2 3 4 x 29

Accuracy Posterior mean: exact = 1.64864 ep = 1.64514 laplace = 1.61946 vb = 1.61834 Posterior variance: exact = 0.359673 ep = 0.311474 laplace = 0.234616 vb = 0.171155 30

Cost vs. accuracy 20 points 200 points Deterministic methods improve with more data (posterior is more Gaussian) Sampling methods do not 31

Censoring example Want to estimate x given multiple y s p ( x) = N( x;0,100) p( y i x) = N( y; x,1) p( y i > t x) = t N( y; x,1) dy + t N( y; x,1) dy 32

Time series problems 33

Example: Tracking Guess the position of an object given noisy measurements y 2 x 2 x 1 x 3 y 4 y 1 y 3 x 4 Object 34

Factor graph x1 x x 2 3 x4 y1 y2 y3 y4 e.g. t = xt 1 x + ν t (random walk) y t = x t + noise want distribution of x s given y s 35

Approximate factor graph x1 x x 2 3 x4 36

Splitting a pairwise factor x1 x2 x1 x2 37

Splitting in context x x 2 3 x x 2 3 38

Sweeping through the graph x1 x x 2 3 x4 39

Sweeping through the graph x1 x x 2 3 x4 40

Sweeping through the graph x1 x x 2 3 x4 41

Sweeping through the graph x1 x x 2 3 x4 42

Example: Poisson tracking y t is a Poisson-distributed integer with mean exp(x t ) 43

Poisson tracking model p( x 1 ) ~ N(0,100) p( x x ) ~ N ( x ( t t 1 t 1,0.01) xt p( yt xt ) = exp( yt xt e ) / yt! 44

Factor graph x1 x x 2 3 x4 y1 y2 y3 y4 x1 x x 2 3 x4 45

Approximating a measurement factor x 1 y 1 x 1 46

47

Posterior for the last state 48

49

50

EP for signal detection Wireless communication problem Transmitted signal = a sin( ω t + φ) ( a, φ) vary to encode each symbol In complex numbers: iφ ae (Qi and Minka, 2003) Im a φ Re 51

Binary symbols, Gaussian noise Symbols are 1 0 s =1 and s = 1 (in complex plane) Received signal = a sin( ωt + φ) + noise y t Optimal detection is easy in this case y t 0 s 1 s 52

Fading channel Channel systematically changes amplitude and phase: y x s + noise = t t s t = transmitted symbol x t t = channel multiplier (complex number) x t changes over time yt 1 x t s 0 x t s 53

Differential detection Use last measurement to estimate state: x t y s / t 1 t 1 State estimate is noisy can we do better? y t y t 1 y t 1 54

Factor graph s1 s2 s3 s4 y1 y2 y y 3 4 x1 x x 2 3 x4 Symbols can also be correlated (e.g. error-correcting code) Channel dynamics are learned from training data (all 1 s) 55

56

57

Splitting a transition factor 58

Splitting a measurement factor 59

On-line implementation Iterate over the last δ measurements Previous measurements act as prior Results comparable to particle filtering, but much faster 60

61

Classification problems 62

Spam filtering by linear separation Not spam Spam Choose a boundary that will generalize to new data 63

Linear separation Minimum training error solution (Perceptron) Too arbitrary won t generalize well 64

Linear separation Maximum-margin solution (SVM) Ignores information in the vertical direction 65

Linear separation Bayesian solution (via averaging) Has a margin, and uses information in all dimensions 66

Geometry of linear separation Separator is any vector w such that: w T xi > 0 (class 1) w T x < 0 (class 2) w i =1 (sphere) This set has an unusual shape SVM: Optimize over it Bayes: Average over it 67

Performance on linear separation EP Gaussian approximation to posterior 68

Factor graph p ( w) = N( w; 0, I) 69

Computing moments q ( w ) = N( w; m, V \ i \ i \ i ) 70

Computing moments 71

Time vs. accuracy A typical run on the 3-point problem Error = distance to true mean of w Billiard = Monte Carlo sampling (Herbrich et al, 2001) Opper&Winther s algorithms: MF = mean-field theory TAP = cavity method (equiv to Gaussian EP for this problem) 72

Gaussian kernels Map data into high-dimensional space so that 73

Bayesian model comparison Multiple models M i with prior probabilities p(m i ) Posterior probabilities: For equal priors, models are compared using model evidence: 74

Highest-probability kernel 75

Margin-maximizing kernel 76

Bayesian feature selection Synthetic data where 6 features are relevant (out of 20) Bayes picks 6 Margin picks 13 77

EP versus Monte Carlo Monte Carlo is general but expensive A sledgehammer EP exploits underlying simplicity of the problem (if it exists) Monte Carlo is still needed for complex problems (e.g. large isolated peaks) Trick is to know what problem you have 78

Software for EP Bayes Point Machine toolbox http://research.microsoft.com/~minka/papers/ep/bpm/ Sparse Online Gaussian Process toolbox http://www.kyb.tuebingen.mpg.de/bs/people/csatol/ogp/index.html Infer.NET http://research.microsoft.com/infernet 79

Further reading EP bibliography http://research.microsoft.com/~minka/papers/ep/roadmap.html EP quick reference http://research.microsoft.com/~minka/papers/ep/minka-epquickref.pdf 80

Tomorrow Variational Message Passing Divergence measures Comparisons to EP 81