Perturbation Theory for Variational Inference

Similar documents
CONTINUOUS SPATIAL DATA ANALYSIS

MMJ1153 COMPUTATIONAL METHOD IN SOLID MECHANICS PRELIMINARIES TO FEM

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

Large-scale Ordinal Collaborative Filtering

INF Introduction to classifiction Anne Solberg

INF Anne Solberg One of the most challenging topics in image analysis is recognizing a specific object in an image.

Recurrent Latent Variable Networks for Session-Based Recommendation

Short course A vademecum of statistical pattern recognition techniques with applications to image and video analysis. Agenda

Computation fundamentals of discrete GMRF representations of continuous domain spatial models

Review of Probability

6. Vector Random Variables

A variational radial basis function approximation for diffusion processes

STAT 518 Intro Student Presentation

Overview. IAML: Linear Regression. Examples of regression problems. The Regression Problem

Gaussian Process Approximations of Stochastic Differential Equations

MAE 323: Chapter 4. Plane Stress and Plane Strain. The Stress Equilibrium Equation

Where now? Machine Learning and Bayesian Inference

2: Distributions of Several Variables, Error Propagation

The Variational Gaussian Approximation Revisited

STA 4273H: Statistical Machine Learning

Applications of Gauss-Radau and Gauss-Lobatto Numerical Integrations Over a Four Node Quadrilateral Finite Element

15. Eigenvalues, Eigenvectors

Gaussian processes for inference in stochastic differential equations

A New Perspective and Extension of the Gaussian Filter

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Cross-validation for detecting and preventing overfitting

Finding Limits Graphically and Numerically. An Introduction to Limits

Understanding Covariance Estimates in Expectation Propagation

Chapter 4 Analytic Trigonometry

Machine Learning Techniques for Computer Vision

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

The connection of dropout and Bayesian statistics

Unit 12 Study Notes 1 Systems of Equations

Finding Limits Graphically and Numerically. An Introduction to Limits

Nonparametric Bayesian Methods (Gaussian Processes)

8.1 Exponents and Roots

Introduction to Gaussian Processes

Optimal Kernels for Unsupervised Learning

Section J Venn diagrams

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Supplementary Material: A Robust Approach to Sequential Information Theoretic Planning

Smoothness Selection. Simon Wood Mathematical Sciences, University of Bath, U.K.

A Comparison of Particle Filters for Personal Positioning

Biostatistics in Research Practice - Regression I

1 History of statistical/machine learning. 2 Supervised learning. 3 Two approaches to supervised learning. 4 The general learning procedure

3.3 Logarithmic Functions and Their Graphs

Gaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction

Approximate Inference Part 1 of 2

Bayesian Semi-supervised Learning with Deep Generative Models

Gaussian integrals. Calvin W. Johnson. September 9, The basic Gaussian and its normalization

Approximate Inference Part 1 of 2

A Practitioner s Guide to Generalized Linear Models

Gaussian Process Vine Copulas for Multivariate Dependence

Recent Advances in Bayesian Inference Techniques

Uncertainty and Parameter Space Analysis in Visualization -

DART_LAB Tutorial Section 2: How should observations impact an unobserved state variable? Multivariate assimilation.

. a m1 a mn. a 1 a 2 a = a n

A comparison of estimation accuracy by the use of KF, EKF & UKF filters

Why do we care? Measurements. Handling uncertainty over time: predicting, estimating, recognizing, learning. Dealing with time

σ = F/A. (1.2) σ xy σ yy σ zx σ xz σ yz σ, (1.3) The use of the opposite convention should cause no problem because σ ij = σ ji.

Unsupervised Learning

STA 414/2104, Spring 2014, Practice Problem Set #1

Regression with Input-Dependent Noise: A Bayesian Treatment

Default Priors and Effcient Posterior Computation in Bayesian

Chapter 5: Systems of Equations

Polynomial and Rational Functions

Approximate inference, Sampling & Variational inference Fall Cours 9 November 25

Black-box α-divergence Minimization

Lecture 3: Pattern Classification. Pattern classification

Properties of Limits

Machine Learning. 1. Linear Regression

ECS289: Scalable Machine Learning

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

A Process over all Stationary Covariance Kernels

Auto-Encoding Variational Bayes

Chapter 3. Theory of measurement

EigenGP: Gaussian Process Models with Adaptive Eigenfunctions

Mobile Robot Localization

The Force Table Introduction: Theory:

Type II variational methods in Bayesian estimation

Approximate Message Passing

Why do we care? Examples. Bayes Rule. What room am I in? Handling uncertainty over time: predicting, estimating, recognizing, learning

Deep Nonlinear Non-Gaussian Filtering for Dynamical Systems

Machine Learning 4771

Variational Autoencoder

Probabilistic Graphical Models

Bayesian Quadrature: Model-based Approximate Integration. David Duvenaud University of Cambridge

Self-Averaging Expectation Propagation

Gaussian Process Vine Copulas for Multivariate Dependence

Andriy Mnih and Ruslan Salakhutdinov

Parameterized Joint Densities with Gaussian and Gaussian Mixture Marginals

Supplementary Note on Bayesian analysis

Viktor Shkolnikov, Juan G. Santiago*

Computation of Total Capacity for Discrete Memoryless Multiple-Access Channels

Bayesian Inference of Noise Levels in Regression

Anomaly Detection and Removal Using Non-Stationary Gaussian Processes

Bayesian IV: the normal case with multiple endogenous variables

Notes on Discriminant Functions and Optimal Classification

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures AB = BA = I,

Probabilistic & Bayesian deep learning. Andreas Damianou

Transcription:

Perturbation heor for Variational Inference Manfred Opper U Berlin Marco Fraccaro echnical Universit of Denmark Ulrich Paquet Apple Ale Susemihl U Berlin Ole Winther echnical Universit of Denmark Abstract he variational approimation is known to underestimate the true variance of a probabilit measure, like a posterior distribution. In latent variable models whose parameters are permutation-invariant with respect to the likelihood, the variational approimation might self-prune and ignore its parameters. We view the variational free energ and its accompaning evidence lower bound as a first-order term from a perturbation of the true log partition function and derive a power series of corrections. We sketch b means of a variational matri factorization eample how a further term could correct for predictions from a self-pruned approimation. Introduction We consider the probabilit measure p() = µ()e H() Z on a random variable. We are often concerned with the computation of the negative log partition function log Z = log dµ() e H(), or of epectations Ef() = dµ() f()e H() under p. As these are tpicall not analticall tractable, the variational approimation introduces a tractable approimation measure q() = Z q µ()e Hq(), and adjusts the parameters in H q in such a wa that the free energ which is defined via the Kullback Leibler divergence is minimal. his ields the form F q = D(q, p) log Z = E q log q() p() log Z F q = log Z q + E q V () where V = H H q. he minimizer of F usuall underestimates the variance of p, and smmetries are sometimes not broken,, so that the approimation self-prunes. In Sec. we illustrate this effect with a small matri factorization eample, and then show how predictions (epectations Ef() of f() under the posterior p) could be improved. Perturbation theor Perturbation theor aims to find approimate solutions to a problem given eact solutions of a simpler related sub-problem (the VB solution in our case). If we define Ĥλ = H q + λv =

( λ)h q + λh, we have Ĥ = H and Ĥ = H q. Let us now define the perturbation epansion using the cumulant epansion of log E q e V, where E q denotes the epectation with respect to the variational distribution: log dµ() e Ĥλ() = log Z q log E q e λv () = log Z q + λe q V λ }{{} E q (V E q V ) + λ! E q (V E q V ) +... () F q for λ= At the end, of course, we set λ =. he first order term in () ields the variational free energ F q in (), i.e. the negative of the usual evidence lower bound (ELBO). It is therefore reasonable to correct it using higher orders. Note that this ma not be a convergent series, but lead to an asmptotic epansion onl. he approach is similar to corrections to Epectation Propagation, 6, but the computation of the correction terms here require less effort. A variational linear response correction can also be applied 5. Epectations. In a similar wa, we can compute corrections for epectations of functions. We first define E λ as the epectation with respect to p λ () = µ()e Ĥλ() (we then have E q = E ) and notice that dµ() f() e Ĥλ() dµ() f() e Ĥ() λv () E λ f() = = dµ() e Ĥλ() dµ() e Ĥ() λv () ) dµ() f() e ( Ĥ() λv () + λ V () λ! V () ±... = ( dµ() e Ĥ() λv () + λ V () λ! V () ±... ) ( ) E f() λv + λ V λ! V ±... = ( E λv + λ V λ! V ). () ±... Using z = + z + z +... we can re-epand the part from the denominator in () to get (λe V λ E V + λ! E V ±...) = + λe V λ E V + λ E V ±... () Putting this together with the numerator in () ields ) E λ f() = (E f() λe f()v + λ E f()v ±... ) ( + λe V λ E V + λ E V ±... = E f() λe f()v + λe f()e V + λ E f()v λ E f()e V λ E f()v E V + λ E f()e V ±... = E f() λ f(), V () λ E V () f(), V () + λ f(), V () ±... (5) Variational Matri Factorization As to eample, we factorize a scalar matri r = + ɛ with ɛ N (ɛ;, β ). In the parlance of recommender sstems (as we ll later consider sparsel observed matrices R) a user (modelled b a latent scalar ) rates an item (modelled b ). One rating Note r N (r;, β ) z = + z + z +... onl converges for z <.

..5 λ =. MC VB VB-st VB-nd E..5...5..5..5..5. noise std dev β / Figure : he epected affinit E under p(, r, β) as a Monte Carlo ground truth (MC), and its approimations. We plot the results with r = and α = α =. With these parameters the VB estimate is zero for β > / (see Fig. and analsis in ), but is accuratel corrected through the first order term in (7). Under less noise smmetr is broken (β < / ), and the first order correction (6) is remains accurate. is observed. Assume priors p() = N (;, α ) and p() = N (;, α ). Let q(, ) = q()q() be a factorized approimation with q() = N (; µ, γ ) and q() = N (; µ, γ ). We will investigate how the variational prediction of a rating E and its further terms in (5) change as the observation noise standard deviation β increases. Motivation. his to eample is the building block for computing the corrections for real recommender sstem models, i.e. matri factorization models with sparsel observed ratings matrices and K latent dimension for both user and item vectors. As shown in the Supplementar Material, the corrections for the predicted rating from user i to item j, E i j, can be shown to be the sum of K(N i + M j ) corrections like the ones presented in this section, when q full factorizes. N i denotes the number of items rated b user i and M j the number of users that have rated item j. We epect the correction to be largest when N i and M j are small. Before commencing to technical details, consider Fig., which is accompanied b Fig.. When there is little observation noise, with β being small, the variational Baes (VB) solution E closel matches E, obtained from a Monte Carlo (MC) estimate. As the noise is increased, the VB solution stops breaking smmetr and snaps to zero (see Fig., and Nakajima et al. s analsis ), after which E = will alwas be predicted. he first order correction gives a substantial improvement towards the ground truth, although with no guarantee that (5) is a convergent series, the inclusion of a second order term incorporates a degradation. echnical details. he terms that constitute V (, ) = H(, ) H q (, ) are H(, ) = β(r ) + α + α H q (, ) = Let s look at the first order of the epansion in (5), E = E q λe q V + λe q E q V, γ ( µ ) + γ ( µ ). bearing in mind that q is the minimizer of F q = log Z q + E q V. he terms that we require in the epansion are E q = µ µ E q V E q E q V = { µ µ α α 8 β γ γ γ γ + βr { β µ µ γ γ γ } + µ } µ µ { β γ { βr γ } + µ { } βr γ }, (6)

p(, r, β =.) p(, r, β =.6) p(, r, β =.8) p(, r, β =.) p(, r, β =.5) p( r, β =.) p( r, β =.6) p( r, β =.8) p( r, β =.) p( r, β =.5)..5..5..5..5 affinit 5 affinit 5 6 affinit 6 8 affinit 6 6 8 affinit Figure : he posterior p(, r, β ), and the accompaning full factorized VB solution (circular contours), for r =. he approimation breaks smmetr at β =. Fig. presents corrections to the mean of p( r, β ). with the latter being Cov q, V. o illustrate the effect of the correction, consider high observation noise β > in Fig.. he VB approimation locks on to a zero mean, and does not break smmetr. With µ u = µ v = for eample, the onl remaining term to first order is (using λ = ) E β r, (7) γ γ and the correction is verifiabl in the direction of r. Summar We ve motivated, b means of a to eample, the benefit of and framework for a perturbation theoretical view on variational inference. Outside the scope of the workshop, current and future work encompass the application to matri factorization (see Supplementar Material), variational inference of stochastic differential equations, Gaussian Process (GP) classification, and the variational approach for sparse GP regression. References D. J. C. MacKa. Local minima, smmetr-breaking, and model pruning in variational free energ minimization. echnical report, Universit of Cambridge Cavendish Laborator,. S. Nakajima and M. Sugiama. heoretical analsis of Baesian matri factorization. Journal of Machine Learning Research, (Nov):58 68,. S. Nakajima, M. Sugiama, S. D. Babacan, and R. omioka. Global analtic solution of full-observed variational Baesian matri factorization. Journal of Machine Learning Research, (Jan): 7,. M. Opper, U. Paquet, and O. Winther. Perturbative corrections for approimate inference in Gaussian latent variable models. Journal of Machine Learning Research, (Sep):857 898,. 5 M. Opper and O. Winther. Variational linear response. In Advances in Neural Information Processing Sstems 6, pages 57 6. MI Press,. 6 U. Paquet, O. Winther, and M. Opper. Perturbation corrections in approimate inference: Miture modelling applications. Journal of Machine Learning Research, (Jun):6, 9.

Supplementar Material Matri factorization - general case We now consider a recommender sstem with N users, M items and K latent dimensions. An N M sparse rating matri R can be constructed using each user s list of rated items in the catalogue. An element r ij in R contains the rating that user i has given to item j. Matri factorization techniques factorize the rating matri as R = X Y + E, where X is an K N user matri, the item matri Y is K M and the residual term E is N M. We denote the set of observed entries in R b R. We use a Gaussian likelihood for the ratings, i.e. r ij N (r ij ; i j, β ), and isotropic Gaussian priors for both user and item vectors: K K p( i ) = N ( ik ;, α ) p( j ) = N ( jk ;, α ). (8) It is common to take take a full factorized approimation for user i s and item j s vectors over dimensions k =,..., K: q( i ) = k q( ik ) = k N ( ik ; µ ik, γ ik ) q( j) = k q( jk ) = k N ( jk ; η jk, ζ jk ). he main quantit of interest in a recommender sstem is the predicted rating E i j, whose first order correction requires the computation of i j, V (X, Y). With our assumptions, the functions H and H q in V (X, Y) = H(X, Y) H q (X, Y) are H(X, Y) = β (r ij i j ) + α i i + α j j, (i,j) i j H q (X, Y) = γ ik ( ik µ ik ) + ζ jk ( jk η jk ). i,k j,k Counterintuitivel, we see that while we are onl interested in correcting E i j, both H and H q depend also on terms relative to items not seen b user i and users that have not rated item j. As we will see shortl, however, these terms do not contribute in i j, V (X, Y). In its unsimplified form, i j, V (X, Y) = i j, β N M (r nm n m ) + α n n + α (n,m) R N n= K γ nk ( nk µ nk ) M n= m= m= m m } K ζ mk ( mk η mk ) In the computation of i j, V (X, Y), all the terms in V (X, Y) that do not depend on i or j are constants and leave the covariance unchanged: considering the user vectors we have for eample N i j, α n n = i j, α i i. (n,m) R n= In a similar wa, among the likelihood terms (n,m) R (r nm n m ) we need onl to keep the ones corresponding to the items rated b user i and the users that rated item j (i.e. onl the i-th row and j-th column of the ratings matri). Defining M(i) as the set containing the indices of the items rated b user i and N (j) the set containing the indices of the users that rated item j, we have i j, (r nm n m ) = i j, (r im i m ) + (r nj n j ). Notice that we now use η and ζ for item mean and precisions, to avoid subscript spaghetti. 5

After removing all the constant terms the simplified covariance is given b i j, V = i j, β (r im i m ) + β (r nj n j ) + + α i i + α j j K γ ik ( ik µ ik ) K ζ jk ( jk η jk ) = β Cov i j, (r im i m ) + β Cov i j, (r nj n j ) + + K K i j, α i i + α j j γ ik ( ik µ ik ) ζ jk ( jk η jk ) = β f ( i, i, Y i ) + β f ( i, j, X j ) + g( i, j ). where X j is the restriction of X to the columns indeed b N (j), and Y i is the restriction of Y to the columns indeed b M(i). Computation of g( i, j ). hanks to our choice of a full factorized VB distribution and the bi-linearit of the covariance operator we have K K ( g( i, j ) = ik ik, α ik + α jk γ ik ( ik µ ik ) ζ jk ( jk η jk ) ) = K ik ik, α ik + α jk γ ik ( ik µ ik ) ζ jk ( jk η jk ). Computation of f ( i, i, Y i ) and f ( i, j, X j ). We now focus below onl on the computation of f ( i, i, Y i ), as the results for f ( i, j, X j ) are analogous. We rewrite the term f ( i, i, Y i ) as f ( i, i, Y i ) = i j, = = = (r im i m ) i j, (r im i m ) i j, r im r im i m + ( i m ) { rim i j, i m + Cov i j, ( i m ) } and analse separatel each of the covariance terms. hanks to the full factorization of the VB posterior we have i j, K K K i m = Cov ik jk, ik mk = ik jk, ik mk. 6

For i j, ( i m) we get instead i j, ( i m ) K K K = ik jk, ( ik mk ) + ik mk ip mp p<k K = ik jk, ( ik mk ) K K + ik jk, ik mk ip mp p<k K = ik jk, ( ik mk ) K + ik jk, ic mc ik mk K = ik jk, ( ik mk ) K + E ic mc ik jk, ik mk Combining these results we obtain f ( i, i, Y i ) = { K r im ik jk, ik mk + ik jk, ( ik mk ) + + E ic mc ik jk, ik mk = K r im E ic mc ik jk, ik mk + ik jk, ( ik mk ) If we then define r (k) im as the epected contribution to the rating r im from component k, i.e. r (k) im = r im E ic mc = r im µ ic η mc, we can rewrite f ( i, i, Y i ) as f ( i, i, Y i ) = i j, = = K K (r im i m ) { r (k) im ik jk, ik mk + ik jk, ( ik mk ) } ik jk, (r (k) im ik mk ). For each of the M(i) points we will then onl need the correction for K univariate problems (that are far simpler to compute). Notice in particular that this result is onl possible thanks to our assumption of full factorized priors and VB posteriors, as there are no correlations among variables of the same user/item vectors to be taken into account when computing the covariances. 7

Putting all together: corrections for the predicted ratings Combining the results above, the first order correction for the required epectation of this matri factorization problem is given b i j, V = β + β K K ik jk, (r (k) im ik mk ) + ik jk, (r (k) nj nk jk ) + + K ik ik, α ik + α jk γ ik ( ik µ ik ) ζ jk ( jk η jk ) K β = ik jk, (r (k) im ik mk ) + + β ik jk, (r (k) nj nk jk ) + + Cov } ik ik, α ik + α jk γ ik ( ik µ ik ) ζ jk ( jk η jk ) Due to the full factorized VB approimation, the computation of these covariances does not require vector operations and can be easil done manuall or using smbolic mathematics software such as Smp. he required non-central Gaussian moments can be computed using Wick s theorem. Assuming that r ij is an unobserved rating that we want to predict, the final epression for the covariance term in the first order correction of E i j is i j, V K β = µ ik η jk ηmk βr(k) im η jk η mk + β µ ik η jk + γ ik γ ik γ ik ζ mk + β µ ik η jk µ nk βr (k) nj µ ik µ nk + β µ ik η jk + ζ jk ζ jk γ nk ζ jk ( α + + α γ ik ζ jk ) µ ik η jk } o illustrate the effect of the correction we can compute its value in the case where smmetries in the user vectors u i are not broken, i.e. µ i = (this ma happen for nois users with onl few rated items). In this case, the epected rating with a first order correction is (with λ = ) E i j K βr (k) im η jk η mk γ ik. We see that the epectation depends on the rating given b user i to other items in the catalogue weighted b how similar their vectors are to j (so that we have a positive contribution from r (k) im to http://www.smp.org his means that the likelihood term in V (X, Y) does not depend on i j. In the to eample presented in Section instead we were computing a correction on an observed rating. 8

the predicted rating onl if the k-th components of the means of the j-th and m-th item vectors have the same direction). 9