Variational Inference. Sargur Srihari

Variational Inference Sargur srihari@cedar.buffalo.edu 1

Plan of Discussion Functionals Calculus of Variations Maximizing a Functional Finding Approximation to a Posterior Minimizing K-L divergence Factorized Distribution Example of Bivariate Gaussian 2

Function versus Functional Function takes a value of a variable as input and returns the function value as output Functional takes a function as input and returns the functional value as output An example is entropy H[p(x)] = p(x) ln p(x) dx 3

-_ It shouldbe emphasize, Machine Learning bution. Inparricular.,j,jl,T:^:::, Calculus of Variations H,,:illr,;(#:"+ o"^itil?ilit,f Hii,fn ractorsot (zns. rhis ractorized ro proximation framework develone t' Amongst all distributio _-, a\" bution rnfbr for which which the the lower lo, ii,?l^o ro,mru-iuti;;;i);#;,1:.':f ltiljtrx':t;; rn,s, werirstsubsriru," which we do by opiim " li,,it,y::l Function: input x is a value, returns a value y(x) il;y;*::,:9,;i1,"^ qiei)n"""ii ;:,ir:f:ffi Functional: input is a function y(x), returns a : n, L(q) r f {,,,,* value F[y(x)] : nt\j t"rt*." l r r Figure4.14. Tha!l (green)apprc- f tte rue posteria ); qtll and *rcr mrzed.our god ile distributiom d flexible thar i t is imponant l ( r J Standard Calculus concerns Derivatives of functions : f tinn6,zj) dz wherewe havedefineda new distrib 1n1(x,z ) : l. andtharsub ng distriburia hl1 flexibledrr. I"Jii'"HHluii? all variableszr'toi L How output changes in response to small changesover in input-,i,!,if,proachthe rc Variational Calculus: concerns Functional derivative Et+i[In p(x,z)] use a pararncf er bound 4 E# r oprimjz rample of tb a\ e opti Now suppose we keepthe.{o;a specr roa, possibre forms fortrri,'o nizingthat(10.6)is a.negative p(x'z;)'thus maximrrrtig ru'u rlo.oiir How output changes in response to infinitesimal changes in input function il1 of disrrib oupsthat re ronfactonztl Invented by Leonhard Euler Swiss 1707-1783 1.::i?' j, so tha 4

Functional Derivative Defined by considering: How the value of a functional F[y] changes when the function y(x) is changed to y(x)+εη(x) where η(x) is an arbitrary function of x εη 5

Maximizing a Functional Common problem in conventional calculus: Find a value x that maximizes a function y(x) In calculus of variations find a function y(x) that maximizes a functional F[y] E.g., Find a function that maximizes entropy Can be used to show that: 1. Shortest path between two points is a straight line 2. Distribution that maximizes entropy H[p(x)] = p(x) ln p(x) dx is Gaussian 6

Variational methods Nothing intrinsically approximate But naturally lend themselves to approximation By restricting range of functions, e.g., Quadratic Linear combination of fixed basis functions In which only the coefficients of the linear combination vary Factorization assumptions Method used for probabilistic inference 7

Application to Probabilistic Inference Consider a fully Bayesian model All parameters are given prior distributions Model may have latent variables as well as parameters Denote all parameters and latent variables by Z Denote set of observed variables by X Given N iid samples for which X={x 1,..x N } and Z={z 1,..z N }, probabilistic model specifies posterior distribution p(z X) as well as model evidence p(x) 8

Lower Bound on p(x) using KL divergence ln p(x) = L(q) + KL(q p) where L(q) = and " q(z)ln p(x,z) % # & dz $ q(z) ' " KL{q p} = q(z)ln# $ p(z X) q(z) Log Marginal Probability % & dz ' KL Divergence between proposed q and p (the desired posterior of interest) to be minimized Functional we wish to maximize Also applicable to discrete distributions by replacing integrations with summations Observations on optimization Lower bound on ln p(x) is L(q) Maximizing lower bound L(q) wrt distribution q(z) is equivalent to minimizing KL Divergence KL divergence vanishes when q(z) = p(z X) Plan: We seek that distribution q(z) for which L(q) is largest Since true posterior is intractable we consider restricted family for q(z) Seek member of this family for which KL divergence is minimized 9

Approximating Family Typically true posterior p(z X) is intractable Pick family of parametric distributions q(z ω) governed by parameters ω Lower bound becomes a function of ω Can exploit standard nonlinear optimization techniques to determine optimal parameters Use q with the fitted parameters as a proxy for the posterior e,.g., to predict about future data, or posterior of hidden variables 10

Variational Approximation Example Use parametric distribution q(z ω) Lower bound for the functional L(q) is a function of ω and can use standard nonlinear optimization to determine optimal ω Negative Logarithms Laplace Original distribution Variational distribution is Gaussian Optimized with respect to mean and variance 11

Factorized Distribution Approach One way to restrict the family of distributions Partition the elements of Z into disjoint groups Z i, i=1,.., M, so that No further assumptions made about distribution No restrictions on factors q i (Z i ) M q(z) = q i (Z i ) i=1 Method is called Mean Field Theory in Physics also Mean Field Variational Inference 12

Factorized Distribution Approach Among factorized distributions q(z) we seek one for which lower bound L(q), is largest L(q) = q(z)ln p(x,z) dz q(z) = q i ln p(x,z) lnq i dz i i Substitute Denote q j (Z j ) by q j = q j ln p(x,z) q i dz i dz i j i q j lnq j dz j +const = q j ln!p(x,z j )dz j q j lnq j dz j +const where we have defined a new distribution!p(x,z j ) as ln!p(x,z j ) = E i j ln p(x,z) +const where E i j [...] is an expectation wrt the q distribution, so that M q(z) = q i (Z i ) i=1 KL Divergence of q j (Z j ) and!p(x,z j ) E i j [ln p(x,z)] = ln p(x,z) q i dz i i j 13

Maximizing L(q) with factorization Since we have seen that L(q) = q j ln!p(x,z j )dz j q j Maximizing L(q) is equivalent to minimizing the Kullback-Leibler divergence between q j (Z j ) and!p(x,z j ) The minimum occurs when lnq j dz j +const q j (Z j ) =!p(x,z j ) Thus we obtain a general expression for the optimal solution q * j (Z j ) given by lnq j * (Z j ) = E i j [ ln p(x,z) ]+ const 14

Optimal Solution Interpretation The optimal solution is lnq * j (Z j ) = E i j [ ln p(x,z) ]+ const This provides the basis for application of variational methods Log of the optimal solution for factor q j is obtained by considering log of the joint distribution over all hidden and joint variables and Take expectation wrt all other factors {q i } for i j 15

Additive constant in Normalized Solution is set by normalizing the distribution q j * (Z j ) Take exponential of both sides and normalize q j * (Z j ) = lnq j * (Z j ) = E i j [ ] ( ) exp E i j ln p(x,z) exp( E i j [ ln p(x,z) ])dz j In practice more convenient to work with the form ln p(x,z) lnq * j (Z j ) = E i j [ ]+ const [ ln p(x,z) ]+ const 16

Obtaining an Explicit Solution Set of equations lnq * j (Z j ) = E i j [ ln p(x,z) ]+ const j=1,..m represent a set of consistency conditions for the maximum of the lower bound They do not provide an explicit solution Solution obtained by properly initializing q i (Z i ) and cycling through all other factors Replacing each in turn with revised estimate given by E i j [ ln p(x, Z) ]+ const evaluated with current estimates for all other factors Convergence guaranteed because bound is convex 17 wrt factors q i (Z i )

Properties of Factorized Approximations Our approach to variational inference is based on a factorized approximation to the true posterior distribution We will consider approximating a Gaussian distribution using a factorized Gaussian It will provide insight into types of inaccuracy introduced in using factorized approximations 18

Example of Factorized Approximation True posterior distribution p(z), z = {z 1, z 2 }, is Gaussian p(z) = N ( z µ,λ 1 ) over two correlated variables µ= µ 1 µ 2 Λ= Λ 11 Λ 12 Λ 21 Λ 22 We wish to approximate p(z) using a factorized Gaussian of the form q(z) = q 1 (z 1 ) q 2 (z 2 ) 19

Factor Derivation Using the optimal solution for factorized distributions lnq j * (Z j ) = E i j We find an expression for the optimal factor Only need terms that have functional dependence on z 1 Other terms are absorbed into normalization constant lnq * 1 (z 1 ) = E z2 [ ln p(z) ]+ const = E z2 1 ( 2 z µ 1 1) 2 Λ 11 z 1 µ 1 [ ln p(x,z) ]+ const p(z) = N z µ, Λ 1 ( )Λ 12 ( z 2 µ 2 ) +const = 1 2 z 2 Λ 1 11 + z 1 µ 1 Λ 11 z 1 Λ 12 (E[z 2 ] µ 2 ) +const q 1 * (z 1 ) ( ) Since rhs is a quadratic function of z 1, we can identify as a Gaussian q 1 * (z 1 ) 20

Factorized Approximation of Gaussian We did not assume q 1 (z 1 ) to be Gaussian but derived it as a variational optimization of KL divergence over all possible distributions We can identify mean and precision for the Gaussians as 1 q 1 * z 1 Coupled solution ( ) where m 1 = µ 1 Λ 11 ( ) where m 2 = µ 2 Λ 22 ( ) = N z 1 m 1, Λ 11 1 ( ) = N z 2 m 1, Λ 22 q 2 * z 2 ( ) ( ) 1 Λ 12 E[z 2 ] µ 2 1 Λ 21 E[z 1 ] µ 1 q 1 *(z 1 ) depends on expectation wrt q*(z 2 ) Treated as re-estimation equations and cycling through them in turn 21

Result of Variational Approximation Mean is correctly captured Variance is controlled by direction of smallest variance of p(z) Variance along orthogonal direction is under-estimated In general: Green: Correlated Gaussian For 1, 2 and 3 std deviations Red: q(z) over same variables given by product of two independent univariate Gaussians factorized variational approximation gives results that are too compact 22

Two alternative forms of KL Divergence Green: Correlated Gaussian For 1, 2 and 3 std deviations Red: q(z) over same variables given by product of two independent univariate Gaussians Minimization based on KL Divergence KL(q p) Minimization based on Reverse KL Divergence KL(p q) Mean correctly captured Variance under-estimated : too compact Form used in Expectation Propagation

Alternative KL divergences for Multimodal Approximating a multimodal by a unimodal one a b c Blue Contours: multimodal distribution p(z) Red Contours in (a): Single Gaussian q(z) that best approximates p(z) in (b): single Gaussian q(z) that best approximates p(z) in the sense of minimizing KL(p q) in(c): as in (b) showing different local minimum of KL divergence 24

Alpha Family of Divergences Two forms of divergence are members of the alpha family of divergences ( ) 4 D α ( p q) = 1 α 1 2 p(x)(1+α )/ 2 q(x) (1 α )/ 2 dx where < α < KL(p Q) corresponds to KL(q p) corresponds to p q which is linearly related to the Hellinger Distance (a valid distance measure) D H ( p q) = ( p(x) 1/ 2 α 1 α 1 ( ) 0 with equality iff p(x) = q(x) For all α D α when α = 0 we get symmetric divergence q(x) 1/ 2 ) 2 dx

Example: Univariate Gaussian Inferring the parameters of a single Gaussian Mean μ and precision τ given data set D={x 1,..x N } Assume Gaussian-Gamma conjugate prior distribution of parameters Likelihood function: ( ) = τ p D µ, τ 2π Conjugate priors for μ and τ p(µ τ) = N ( µ µ 0,(λ 0 τ) 1 ) N /2 exp τ 2 p(τ) = Gam(τ a 0,b 0 ) N n=1 ( x n µ ) 2 26

Factorized Variational Approximation Posterior distribution q(µ, τ) = q µ (µ)q τ (τ) For q μ (μ) we have a Gaussian Optimal solution for factor q τ (τ) is a Gamma 27

Machine Learning Variational Inference of Univariate Gaussian Variational Inference of mean μ and precision τ (a) Green: Contours of true posterior p(μ,τ D) Blue: Contours of initial approx q (µ)q (τ) (b) after re-estimating the factor qμ(μ). (c) after re-estimating the factor qτ(τ). (d) Iterative scheme converges to red contours µ τ 28