Part 1: Expectation Propagation

Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud University Nijmegen, The Netherlands April 15, 2015

Outline Bayesian Machine Learning Probabilistic modeling Approximate inference Expectation Propagation Bit of history Factor graphs Iterative procedure EP for Gaussian process classification Locality property Step by step Conclusion

Bayesian Machine Learning Enumerate all reasonable models θ and assign a prior belief p(θ). Upon observing the data D, compute the likelihood p(d θ). Compute the posterior probability over models using Bayes rule: p(θ D) p(d θ)p(θ). Problem: the posterior distribution is often intractable.

Approximate Inference Stochastic sampling methods Markov Chain Monte Carlo sampling Gibbs sampling Particle filtering Deterministic methods Variational ( mean-field ) approaches Loopy belief propagation Expectation propagation

Expectation Propagation Message passing algorithm, invented by Thomas Minka (PhD thesis, 2001). Its generalization, power EP, contains a large class of deterministic algorithms for approximate inference. Arguably the best approximate inference results, if it converges... Implemented in Microsoft s infer.net.

Factors Many probabilistic models factorize, i.e., can be written in the form Factor graph p(d, θ) = i f i (θ). f 0 f 1 f n For example, with independently, identically distributed data, there is one factor f n (θ) = p(x n θ) for each data point x n along with a factor f 0 (θ) = p(θ) for the prior. This also applies to Gaussian process regression and classification: θ is drawn from a Gaussian process prior and each of the factors further simplifies into f n (θ) = p(x n θ n ). θ General case f 0 f 1 f 2 f n θ 1 θ 2 θ n Gaussian processes

Approximation Approximate the posterior by an exponential distribution: p(d, θ) = i f i (θ) 1 Z f i (θ) = p(θ). I.e., approximate each term f i (θ) by an exponential term approximation f i (θ). Terms in exponential form, often the prior, do not have to be approximated. i f 0 f1 fn θ Exponential form: f (θ) = h(θ)g(η) exp [ ] η T u(θ), natural parameters η and sufficient statistics u(θ).

Iterative Updating Take out term approximation i: p \i (θ) f j (θ). j i Put back in term i: p (i) (θ) f i (θ) j i f j (θ). Match moments, i.e., find the approximate distribution of exponential form such that dθ u(θ) p new (θ) = dθ u(θ) p (i) (θ). Bookkeeping: set the new term approximation such that p new (θ) f i new (θ) f j (θ). j i

Going Back and Forth f 0 f1 fi fn θ substitute = = project f 0 f1 f i fn θ Project: minimize the KL-divergence KL( p (i), p) = dθ p (i) (θ) log [ ] p (i) (θ). p(θ) Equivalent to moment matching when p is in the exponential family.

Gaussian Processes for Classification f 0 (θ) is a Gaussian prior. No need to approximate. f i (θ) = f i (θ i ), some nonlinear sigmoidal function of θ i. Each term approximation is a one-dimensional Gaussian form, [ f i (θ i ) = exp h i θ i 1 ] 2 K iθi 2, not necessarily normalizable: K i may be negative. 1 0.5 0 5 0 5 f i (θ i ) = σ(y i θ i )

Locality Property (1) Consider updating the term approximation f i (θ i ). After replacing the old term approximation by the term we have p (i) (θ) p \i (θ)f i (θ i ) p \i (θ \i θ i ) p \i (θ i )f i (θ i ). We have to find the new approximation p new (θ) closest in KL-divergence to p (i) (θ): [ ] KL( p (i), p new ) = dθ p (i) p (i) (θ) (θ) log p new (θ) [ ] = dθ i p (i) p (i) (θ i ) (θ i ) log p new (θ i ) [ ] p + dθ i p (i) (θ i ) dθ \i p (i) (i) (θ \i θ i ) (θ \i θ i ) log p new. (θ \i θ i )

Locality Property (2) From previous slide: KL( p (i), p new ) = + dθ i p (i) (θ i ) [ ] dθ i p (i) p (i) (θ i ) (θ i ) log p new (θ i ) [ ] p dθ \i p (i) (i) (θ \i θ i ) (θ \i θ i ) log p new. (θ \i θ i ) Consequences: 1. At the optimum p new (θ \i θ i ) = p (i) (θ \i θ i ), which means that only K ii and h i can change. 2. We only need to match moments for the marginal p(θ i ).

Take Out Easy in terms of canonical parameters, K \i = K K i 1 i 1 T i and h \i = h h i 1 i, with 1 i a vector with a 1 at element i and the rest 0. We need the corresponding moment form with (only) C \i ii and m \i i. Efficiently with Sherman-Morrison formula (see next slide): C \i ii The new mean follows from = C ii + C iic ii K [ ] i 1 1/C ii K i. 1 C ii K i m \i i = m i + C \i ii ] [ h i + K ii m i. Computational complexity is order 1 per term, i.e., order N in total per iteration of EP.

Sherman-Morrison Formula Efficient way to recompute the inverse after adding a lower-dimensional part: (A + uv T ) 1 = A 1 A 1 uv T A 1 1 + v T A 1 u, The result on the previous slide follows by setting u = K i 1 i and v = 1 i. Woodbury formula: generalization to matrices in terms of vectors. Matrix determinant lemma does something similar for (log) determinants. See the Matrix Cookbook (or Wikipedia... ).

Match Moments We have to compute dθ i N (θ i ; m \i i, C \i ii )f i (θ i ){1, θ i, θ 2 i }. One-dimensional integrals that (sometimes) can be computed analytically, and otherwise approximated with Gauss-Hermite quadrature, i.e., from k w k f i (m \i i + This then yields the new m new i C \i ii x k ){1, x k, x 2 k }. and C new ii. Computational complexity is order W, the number of quadrature points per term, i.e., order NW per EP iteration.

Bookkeeping Now that we have the new moments, we have to find new term approximations that give exactly those same moments. It can be shown that the updates are simply as if we are in a one-dimensional situation: and similarly h new i K new ii = K ii + [1/C new ii 1/C ii ], = h i + [mi new /Cii new m i /C ii ]. Keep track of C and m using Sherman-Morrison, but now applied to the whole matrix. Computational complexity is order N 2 per term, i.e., N 3 per EP iteration.

Sequential vs. parallel Initial formulation of expectation propagation: sequentially update terms and keep track of approximated posterior. Viewed as a mapping from old to new term approximations, we may as well do this in parallel. Advantages: much, much faster for sparse precision matrices; numerically more stable. Disadvantage: convergence might be a bit slower. f1 f2 fn z1 z2 zn g1 g2 gn θ1 θ2 θp h1 h2 hp u1 v1 u2 v2 up vp i1 i2

Other Issues By keeping track of normalizations, we can also approximate the model evidence and use that for optimizing hyperparameters. Power EP: take out the term proxy/put back the term to power α. Standard EP/loopy belief propagation: α = 1. Variational message passing: α = 0. α just below 1 happens to be more stable than α = 1. Convergence is a (big) issue: in particular there is no guarantee on normalizability. Many, many more applications: mixture models, nonlinear Kalman filters, Dirichlet models, Plackett-Luce,... Consistently more accurate than Laplace approximations; ongoing efforts to speed it up.