Perturbation Theory for Variational Inference

Perturbation heor for Variational Inference Manfred Opper U Berlin Marco Fraccaro echnical Universit of Denmark Ulrich Paquet Apple Ale Susemihl U Berlin Ole Winther echnical Universit of Denmark Abstract he variational approimation is known to underestimate the true variance of a probabilit measure, like a posterior distribution. In latent variable models whose parameters are permutation-invariant with respect to the likelihood, the variational approimation might self-prune and ignore its parameters. We view the variational free energ and its accompaning evidence lower bound as a first-order term from a perturbation of the true log partition function and derive a power series of corrections. We sketch b means of a variational matri factorization eample how a further term could correct for predictions from a self-pruned approimation. Introduction We consider the probabilit measure p() = µ()e H() Z on a random variable. We are often concerned with the computation of the negative log partition function log Z = log dµ() e H(), or of epectations Ef() = dµ() f()e H() under p. As these are tpicall not analticall tractable, the variational approimation introduces a tractable approimation measure q() = Z q µ()e Hq(), and adjusts the parameters in H q in such a wa that the free energ which is defined via the Kullback Leibler divergence is minimal. his ields the form F q = D(q, p) log Z = E q log q() p() log Z F q = log Z q + E q V () where V = H H q. he minimizer of F usuall underestimates the variance of p, and smmetries are sometimes not broken,, so that the approimation self-prunes. In Sec. we illustrate this effect with a small matri factorization eample, and then show how predictions (epectations Ef() of f() under the posterior p) could be improved. Perturbation theor Perturbation theor aims to find approimate solutions to a problem given eact solutions of a simpler related sub-problem (the VB solution in our case). If we define Ĥλ = H q + λv =

( λ)h q + λh, we have Ĥ = H and Ĥ = H q. Let us now define the perturbation epansion using the cumulant epansion of log E q e V, where E q denotes the epectation with respect to the variational distribution: log dµ() e Ĥλ() = log Z q log E q e λv () = log Z q + λe q V λ }{{} E q (V E q V ) + λ! E q (V E q V ) +... () F q for λ= At the end, of course, we set λ =. he first order term in () ields the variational free energ F q in (), i.e. the negative of the usual evidence lower bound (ELBO). It is therefore reasonable to correct it using higher orders. Note that this ma not be a convergent series, but lead to an asmptotic epansion onl. he approach is similar to corrections to Epectation Propagation, 6, but the computation of the correction terms here require less effort. A variational linear response correction can also be applied 5. Epectations. In a similar wa, we can compute corrections for epectations of functions. We first define E λ as the epectation with respect to p λ () = µ()e Ĥλ() (we then have E q = E ) and notice that dµ() f() e Ĥλ() dµ() f() e Ĥ() λv () E λ f() = = dµ() e Ĥλ() dµ() e Ĥ() λv () ) dµ() f() e ( Ĥ() λv () + λ V () λ! V () ±... = ( dµ() e Ĥ() λv () + λ V () λ! V () ±... ) ( ) E f() λv + λ V λ! V ±... = ( E λv + λ V λ! V ). () ±... Using z = + z + z +... we can re-epand the part from the denominator in () to get (λe V λ E V + λ! E V ±...) = + λe V λ E V + λ E V ±... () Putting this together with the numerator in () ields ) E λ f() = (E f() λe f()v + λ E f()v ±... ) ( + λe V λ E V + λ E V ±... = E f() λe f()v + λe f()e V + λ E f()v λ E f()e V λ E f()v E V + λ E f()e V ±... = E f() λ f(), V () λ E V () f(), V () + λ f(), V () ±... (5) Variational Matri Factorization As to eample, we factorize a scalar matri r = + ɛ with ɛ N (ɛ;, β ). In the parlance of recommender sstems (as we ll later consider sparsel observed matrices R) a user (modelled b a latent scalar ) rates an item (modelled b ). One rating Note r N (r;, β ) z = + z + z +... onl converges for z <.

..5 λ =. MC VB VB-st VB-nd E..5...5..5..5..5. noise std dev β / Figure : he epected affinit E under p(, r, β) as a Monte Carlo ground truth (MC), and its approimations. We plot the results with r = and α = α =. With these parameters the VB estimate is zero for β > / (see Fig. and analsis in ), but is accuratel corrected through the first order term in (7). Under less noise smmetr is broken (β < / ), and the first order correction (6) is remains accurate. is observed. Assume priors p() = N (;, α ) and p() = N (;, α ). Let q(, ) = q()q() be a factorized approimation with q() = N (; µ, γ ) and q() = N (; µ, γ ). We will investigate how the variational prediction of a rating E and its further terms in (5) change as the observation noise standard deviation β increases. Motivation. his to eample is the building block for computing the corrections for real recommender sstem models, i.e. matri factorization models with sparsel observed ratings matrices and K latent dimension for both user and item vectors. As shown in the Supplementar Material, the corrections for the predicted rating from user i to item j, E i j, can be shown to be the sum of K(N i + M j ) corrections like the ones presented in this section, when q full factorizes. N i denotes the number of items rated b user i and M j the number of users that have rated item j. We epect the correction to be largest when N i and M j are small. Before commencing to technical details, consider Fig., which is accompanied b Fig.. When there is little observation noise, with β being small, the variational Baes (VB) solution E closel matches E, obtained from a Monte Carlo (MC) estimate. As the noise is increased, the VB solution stops breaking smmetr and snaps to zero (see Fig., and Nakajima et al. s analsis ), after which E = will alwas be predicted. he first order correction gives a substantial improvement towards the ground truth, although with no guarantee that (5) is a convergent series, the inclusion of a second order term incorporates a degradation. echnical details. he terms that constitute V (, ) = H(, ) H q (, ) are H(, ) = β(r ) + α + α H q (, ) = Let s look at the first order of the epansion in (5), E = E q λe q V + λe q E q V, γ ( µ ) + γ ( µ ). bearing in mind that q is the minimizer of F q = log Z q + E q V. he terms that we require in the epansion are E q = µ µ E q V E q E q V = { µ µ α α 8 β γ γ γ γ + βr { β µ µ γ γ γ } + µ } µ µ { β γ { βr γ } + µ { } βr γ }, (6)

p(, r, β =.) p(, r, β =.6) p(, r, β =.8) p(, r, β =.) p(, r, β =.5) p( r, β =.) p( r, β =.6) p( r, β =.8) p( r, β =.) p( r, β =.5)..5..5..5..5 affinit 5 affinit 5 6 affinit 6 8 affinit 6 6 8 affinit Figure : he posterior p(, r, β ), and the accompaning full factorized VB solution (circular contours), for r =. he approimation breaks smmetr at β =. Fig. presents corrections to the mean of p( r, β ). with the latter being Cov q, V. o illustrate the effect of the correction, consider high observation noise β > in Fig.. he VB approimation locks on to a zero mean, and does not break smmetr. With µ u = µ v = for eample, the onl remaining term to first order is (using λ = ) E β r, (7) γ γ and the correction is verifiabl in the direction of r. Summar We ve motivated, b means of a to eample, the benefit of and framework for a perturbation theoretical view on variational inference. Outside the scope of the workshop, current and future work encompass the application to matri factorization (see Supplementar Material), variational inference of stochastic differential equations, Gaussian Process (GP) classification, and the variational approach for sparse GP regression. References D. J. C. MacKa. Local minima, smmetr-breaking, and model pruning in variational free energ minimization. echnical report, Universit of Cambridge Cavendish Laborator,. S. Nakajima and M. Sugiama. heoretical analsis of Baesian matri factorization. Journal of Machine Learning Research, (Nov):58 68,. S. Nakajima, M. Sugiama, S. D. Babacan, and R. omioka. Global analtic solution of full-observed variational Baesian matri factorization. Journal of Machine Learning Research, (Jan): 7,. M. Opper, U. Paquet, and O. Winther. Perturbative corrections for approimate inference in Gaussian latent variable models. Journal of Machine Learning Research, (Sep):857 898,. 5 M. Opper and O. Winther. Variational linear response. In Advances in Neural Information Processing Sstems 6, pages 57 6. MI Press,. 6 U. Paquet, O. Winther, and M. Opper. Perturbation corrections in approimate inference: Miture modelling applications. Journal of Machine Learning Research, (Jun):6, 9.

Supplementar Material Matri factorization - general case We now consider a recommender sstem with N users, M items and K latent dimensions. An N M sparse rating matri R can be constructed using each user s list of rated items in the catalogue. An element r ij in R contains the rating that user i has given to item j. Matri factorization techniques factorize the rating matri as R = X Y + E, where X is an K N user matri, the item matri Y is K M and the residual term E is N M. We denote the set of observed entries in R b R. We use a Gaussian likelihood for the ratings, i.e. r ij N (r ij ; i j, β ), and isotropic Gaussian priors for both user and item vectors: K K p( i ) = N ( ik ;, α ) p( j ) = N ( jk ;, α ). (8) It is common to take take a full factorized approimation for user i s and item j s vectors over dimensions k =,..., K: q( i ) = k q( ik ) = k N ( ik ; µ ik, γ ik ) q( j) = k q( jk ) = k N ( jk ; η jk, ζ jk ). he main quantit of interest in a recommender sstem is the predicted rating E i j, whose first order correction requires the computation of i j, V (X, Y). With our assumptions, the functions H and H q in V (X, Y) = H(X, Y) H q (X, Y) are H(X, Y) = β (r ij i j ) + α i i + α j j, (i,j) i j H q (X, Y) = γ ik ( ik µ ik ) + ζ jk ( jk η jk ). i,k j,k Counterintuitivel, we see that while we are onl interested in correcting E i j, both H and H q depend also on terms relative to items not seen b user i and users that have not rated item j. As we will see shortl, however, these terms do not contribute in i j, V (X, Y). In its unsimplified form, i j, V (X, Y) = i j, β N M (r nm n m ) + α n n + α (n,m) R N n= K γ nk ( nk µ nk ) M n= m= m= m m } K ζ mk ( mk η mk ) In the computation of i j, V (X, Y), all the terms in V (X, Y) that do not depend on i or j are constants and leave the covariance unchanged: considering the user vectors we have for eample N i j, α n n = i j, α i i. (n,m) R n= In a similar wa, among the likelihood terms (n,m) R (r nm n m ) we need onl to keep the ones corresponding to the items rated b user i and the users that rated item j (i.e. onl the i-th row and j-th column of the ratings matri). Defining M(i) as the set containing the indices of the items rated b user i and N (j) the set containing the indices of the users that rated item j, we have i j, (r nm n m ) = i j, (r im i m ) + (r nj n j ). Notice that we now use η and ζ for item mean and precisions, to avoid subscript spaghetti. 5

After removing all the constant terms the simplified covariance is given b i j, V = i j, β (r im i m ) + β (r nj n j ) + + α i i + α j j K γ ik ( ik µ ik ) K ζ jk ( jk η jk ) = β Cov i j, (r im i m ) + β Cov i j, (r nj n j ) + + K K i j, α i i + α j j γ ik ( ik µ ik ) ζ jk ( jk η jk ) = β f ( i, i, Y i ) + β f ( i, j, X j ) + g( i, j ). where X j is the restriction of X to the columns indeed b N (j), and Y i is the restriction of Y to the columns indeed b M(i). Computation of g( i, j ). hanks to our choice of a full factorized VB distribution and the bi-linearit of the covariance operator we have K K ( g( i, j ) = ik ik, α ik + α jk γ ik ( ik µ ik ) ζ jk ( jk η jk ) ) = K ik ik, α ik + α jk γ ik ( ik µ ik ) ζ jk ( jk η jk ). Computation of f ( i, i, Y i ) and f ( i, j, X j ). We now focus below onl on the computation of f ( i, i, Y i ), as the results for f ( i, j, X j ) are analogous. We rewrite the term f ( i, i, Y i ) as f ( i, i, Y i ) = i j, = = = (r im i m ) i j, (r im i m ) i j, r im r im i m + ( i m ) { rim i j, i m + Cov i j, ( i m ) } and analse separatel each of the covariance terms. hanks to the full factorization of the VB posterior we have i j, K K K i m = Cov ik jk, ik mk = ik jk, ik mk. 6

For i j, ( i m) we get instead i j, ( i m ) K K K = ik jk, ( ik mk ) + ik mk ip mp p<k K = ik jk, ( ik mk ) K K + ik jk, ik mk ip mp p<k K = ik jk, ( ik mk ) K + ik jk, ic mc ik mk K = ik jk, ( ik mk ) K + E ic mc ik jk, ik mk Combining these results we obtain f ( i, i, Y i ) = { K r im ik jk, ik mk + ik jk, ( ik mk ) + + E ic mc ik jk, ik mk = K r im E ic mc ik jk, ik mk + ik jk, ( ik mk ) If we then define r (k) im as the epected contribution to the rating r im from component k, i.e. r (k) im = r im E ic mc = r im µ ic η mc, we can rewrite f ( i, i, Y i ) as f ( i, i, Y i ) = i j, = = K K (r im i m ) { r (k) im ik jk, ik mk + ik jk, ( ik mk ) } ik jk, (r (k) im ik mk ). For each of the M(i) points we will then onl need the correction for K univariate problems (that are far simpler to compute). Notice in particular that this result is onl possible thanks to our assumption of full factorized priors and VB posteriors, as there are no correlations among variables of the same user/item vectors to be taken into account when computing the covariances. 7

Putting all together: corrections for the predicted ratings Combining the results above, the first order correction for the required epectation of this matri factorization problem is given b i j, V = β + β K K ik jk, (r (k) im ik mk ) + ik jk, (r (k) nj nk jk ) + + K ik ik, α ik + α jk γ ik ( ik µ ik ) ζ jk ( jk η jk ) K β = ik jk, (r (k) im ik mk ) + + β ik jk, (r (k) nj nk jk ) + + Cov } ik ik, α ik + α jk γ ik ( ik µ ik ) ζ jk ( jk η jk ) Due to the full factorized VB approimation, the computation of these covariances does not require vector operations and can be easil done manuall or using smbolic mathematics software such as Smp. he required non-central Gaussian moments can be computed using Wick s theorem. Assuming that r ij is an unobserved rating that we want to predict, the final epression for the covariance term in the first order correction of E i j is i j, V K β = µ ik η jk ηmk βr(k) im η jk η mk + β µ ik η jk + γ ik γ ik γ ik ζ mk + β µ ik η jk µ nk βr (k) nj µ ik µ nk + β µ ik η jk + ζ jk ζ jk γ nk ζ jk ( α + + α γ ik ζ jk ) µ ik η jk } o illustrate the effect of the correction we can compute its value in the case where smmetries in the user vectors u i are not broken, i.e. µ i = (this ma happen for nois users with onl few rated items). In this case, the epected rating with a first order correction is (with λ = ) E i j K βr (k) im η jk η mk γ ik. We see that the epectation depends on the rating given b user i to other items in the catalogue weighted b how similar their vectors are to j (so that we have a positive contribution from r (k) im to http://www.smp.org his means that the likelihood term in V (X, Y) does not depend on i j. In the to eample presented in Section instead we were computing a correction on an observed rating. 8

the predicted rating onl if the k-th components of the means of the j-th and m-th item vectors have the same direction). 9