Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1
Bayesian paradigm Consistent use of probability theory for representing unknowns (parameters, latent variables, missing data) 2
Bayesian paradigm Bayesian posterior distribution summarizes what we ve learned from training data and prior knowledge Can use posterior to: Describe training data Make predictions on test data Incorporate new data (online learning) Today s question: How to efficiently represent and compute posteriors? 3
Factor graphs Shows how a function of several variables can be factored into a product of simpler functions f(x,y,z) = (x+y)(y+z)(x+z) Very useful for representing posteriors 4
Example factor graph p( x m) N( x ; m,1) i = i 5
Two tasks Modeling What graph should I use for this data? Inference Given the graph and data, what is the mean of x (for example)? Algorithms: Sampling Variable elimination Message-passing (Expectation Propagation, Variational Bayes, ) 6
A (seemingly) intractable problem 7
Clutter problem Want to estimate x given multiple y s 8
Exact posterior exact p(x,d D) -1 0 1 2 3 4 x 9
Representing posterior distributions Sampling Deterministic approximation Good for complex, multi-modal distributions Slow, but predictable accuracy Good for simple, smooth distributions Fast, but unpredictable accuracy 10
Deterministic approximation Laplace s method Bayesian curve fitting, neural networks (MacKay) Bayesian PCA (Minka) Variational bounds Bayesian mixture of experts (Waterhouse) Mixtures of PCA (Tipping, Bishop) Factorial/coupled Markov models (Ghahramani, Jordan, Williams) 11
Moment matching Another way to perform deterministic approximation Much higher accuracy on some problems Assumed-density filtering (1984) Loopy belief propagation (1997) Expectation Propagation (2001) 12
Today Moment matching (Expectation Propagation) Tomorrow Variational bounds (Variational Message Passing) 13
Best Gaussian by moment matching exact bestgaussian p(x,d) ) -1 0 1 2 3 4 x 14
Strategy Approximate each factor by a Gaussian in x 15
Approximating a single factor 16
(naïve) f i (x) \ q i ( x ) = p(x) 17
(informed) f i (x) \ q i ( x ) = p(x) 18
Single factor with Gaussian context 19
Gaussian multiplication formula 20
Approximation with narrow context 21
Approximation with medium context 22
Approximation with wide context 23
Two factors x x Message passing 24
Three factors x x Message passing 25
Message Passing = Distributed Optimization Messages represent a simpler distribution q(x) that approximates p(x) A distributed representation Message passing = optimizing q to fit p q stands in for p when answering queries Choices: What type of distribution to construct (approximating family) What cost to minimize (divergence measure) 26
Distributed divergence minimization Write p as product of factors: Approximate factors one by one: Multiply to get the approximation: 27
Gaussian found by EP ep exact bestgaussian p( (x,d) -1 0 1 2 3 4 x 28
Other methods vb laplace exact p( (x,d) -1 0 1 2 3 4 x 29
Accuracy Posterior mean: exact = 1.64864 ep = 1.64514 laplace = 1.61946 vb = 1.61834 Posterior variance: exact = 0.359673 ep = 0.311474 laplace = 0.234616 vb = 0.171155 30
Cost vs. accuracy 20 points 200 points Deterministic methods improve with more data (posterior is more Gaussian) Sampling methods do not 31
Censoring example Want to estimate x given multiple y s p ( x) = N( x;0,100) p( y i x) = N( y; x,1) p( y i > t x) = t N( y; x,1) dy + t N( y; x,1) dy 32
Time series problems 33
Example: Tracking Guess the position of an object given noisy measurements y 2 x 2 x 1 x 3 y 4 y 1 y 3 x 4 Object 34
Factor graph x1 x x 2 3 x4 y1 y2 y3 y4 e.g. t = xt 1 x + ν t (random walk) y t = x t + noise want distribution of x s given y s 35
Approximate factor graph x1 x x 2 3 x4 36
Splitting a pairwise factor x1 x2 x1 x2 37
Splitting in context x x 2 3 x x 2 3 38
Sweeping through the graph x1 x x 2 3 x4 39
Sweeping through the graph x1 x x 2 3 x4 40
Sweeping through the graph x1 x x 2 3 x4 41
Sweeping through the graph x1 x x 2 3 x4 42
Example: Poisson tracking y t is a Poisson-distributed integer with mean exp(x t ) 43
Poisson tracking model p( x 1 ) ~ N(0,100) p( x x ) ~ N ( x ( t t 1 t 1,0.01) xt p( yt xt ) = exp( yt xt e ) / yt! 44
Factor graph x1 x x 2 3 x4 y1 y2 y3 y4 x1 x x 2 3 x4 45
Approximating a measurement factor x 1 y 1 x 1 46
47
Posterior for the last state 48
49
50
EP for signal detection Wireless communication problem Transmitted signal = a sin( ω t + φ) ( a, φ) vary to encode each symbol In complex numbers: iφ ae (Qi and Minka, 2003) Im a φ Re 51
Binary symbols, Gaussian noise Symbols are 1 0 s =1 and s = 1 (in complex plane) Received signal = a sin( ωt + φ) + noise y t Optimal detection is easy in this case y t 0 s 1 s 52
Fading channel Channel systematically changes amplitude and phase: y x s + noise = t t s t = transmitted symbol x t t = channel multiplier (complex number) x t changes over time yt 1 x t s 0 x t s 53
Differential detection Use last measurement to estimate state: x t y s / t 1 t 1 State estimate is noisy can we do better? y t y t 1 y t 1 54
Factor graph s1 s2 s3 s4 y1 y2 y y 3 4 x1 x x 2 3 x4 Symbols can also be correlated (e.g. error-correcting code) Channel dynamics are learned from training data (all 1 s) 55
56
57
Splitting a transition factor 58
Splitting a measurement factor 59
On-line implementation Iterate over the last δ measurements Previous measurements act as prior Results comparable to particle filtering, but much faster 60
61
Classification problems 62
Spam filtering by linear separation Not spam Spam Choose a boundary that will generalize to new data 63
Linear separation Minimum training error solution (Perceptron) Too arbitrary won t generalize well 64
Linear separation Maximum-margin solution (SVM) Ignores information in the vertical direction 65
Linear separation Bayesian solution (via averaging) Has a margin, and uses information in all dimensions 66
Geometry of linear separation Separator is any vector w such that: w T xi > 0 (class 1) w T x < 0 (class 2) w i =1 (sphere) This set has an unusual shape SVM: Optimize over it Bayes: Average over it 67
Performance on linear separation EP Gaussian approximation to posterior 68
Factor graph p ( w) = N( w; 0, I) 69
Computing moments q ( w ) = N( w; m, V \ i \ i \ i ) 70
Computing moments 71
Time vs. accuracy A typical run on the 3-point problem Error = distance to true mean of w Billiard = Monte Carlo sampling (Herbrich et al, 2001) Opper&Winther s algorithms: MF = mean-field theory TAP = cavity method (equiv to Gaussian EP for this problem) 72
Gaussian kernels Map data into high-dimensional space so that 73
Bayesian model comparison Multiple models M i with prior probabilities p(m i ) Posterior probabilities: For equal priors, models are compared using model evidence: 74
Highest-probability kernel 75
Margin-maximizing kernel 76
Bayesian feature selection Synthetic data where 6 features are relevant (out of 20) Bayes picks 6 Margin picks 13 77
EP versus Monte Carlo Monte Carlo is general but expensive A sledgehammer EP exploits underlying simplicity of the problem (if it exists) Monte Carlo is still needed for complex problems (e.g. large isolated peaks) Trick is to know what problem you have 78
Software for EP Bayes Point Machine toolbox http://research.microsoft.com/~minka/papers/ep/bpm/ Sparse Online Gaussian Process toolbox http://www.kyb.tuebingen.mpg.de/bs/people/csatol/ogp/index.html Infer.NET http://research.microsoft.com/infernet 79
Further reading EP bibliography http://research.microsoft.com/~minka/papers/ep/roadmap.html EP quick reference http://research.microsoft.com/~minka/papers/ep/minka-epquickref.pdf 80
Tomorrow Variational Message Passing Divergence measures Comparisons to EP 81