Variational inference - PDF Free Download

Simon Leglaive Télécom ParisTech, CNRS LTCI, Université Paris Saclay November 18, 2016, Télécom ParisTech, Paris, France.

Outline Introduction Probabilistic model Problem Log-likelihood decomposition EM algorithm Approximate inference Variational approximation Mean Field approximation Optimal variational distribution under the MF approximation Relation with Gibbs sampling Audio source separation example Mixing model Source model Source separation problem 2/26 November 18, 2016

Introduction Audio source separation example Further readings Probabilistic model Let consider a probabilistic model where I x denotes the set of observed random variables; I z denotes the set of latent/hidden random variables; I θ denotes the set of deterministic parameters. For example: Audio source separation ab ari te La nt ev urc les : source and mixing parameters. For example: NMF parameters and mixing filters. so Observed mixture variables Note: In a Bayesian framework the latent variables include the parameters. 4/26 November 18, 2016 Appendix

Problem 1. Definition of the model: How are the data generated from the latent unobserved variables? 2. Inference: What are the values of the latent variables that generated the data? Inference We are naturally interested in computing the posterior p(z x; θ ). Maximum Likelihood estimation: θ = arg max p(x; θ); θ Minimum Mean Square Error (MMSE) estimation: ẑ = E z x;θ [z]. If we can compute the posterior distribution Expectation-Maximization (EM) algorithm. 5/26 November 18, 2016

Log-likelihood decomposition Let q be a probability density function (pdf) over z, then the log-likelihood can be decomposed as: ln p(x; θ) = L(q; θ) + KL(q p(z x; θ)), (1) }{{}}{{} Variational Free Energy Kullback-Leibler divergence ( ) ( ) where L(q, θ) = KL(q p(z x; θ)) = ln and f (z) q = f (z)q(z)dz. ln p(x, z; θ) q }{{} E(q;θ) ( p(z x; θ) q(z) ln q(z) q }{{} H(q): entropy ) As KL( ) 0, L(q, θ) lower bounds the log-likelihood. ; (2) ; (3) q Note: Variational free energy = Evidence lower bound (ELBO) in Bayesian settings. 6/26 November 18, 2016

EM algorithm Maximize L(q, θ) w.r.t q at the E-step and w.r.t θ at the M-step. E-step: From (1), Then from (2), q (z) = arg max L(q; θ old ) = p(z x; θ old ). q Complete-data log-likelihood {}}{ L(q ; θ) = ln p(x, z; θ) p(z x;θ old ) }{{} Q(θ,θ old ) in the standard EM formulation M-step: ( ) + H p(z x; θ old ). } {{ } constant w.r.t θ θ new = arg max Q(θ, θ old ). θ 7/26 November 18, 2016

What if we cannot compute the posterior distribution? 8/26 November 18, 2016

Approximate inference Stochastic methods: Based on sampling Markov Chain Monte Carlo (MCMC) methods, etc.; Computationally expensive but converges to the true posterior. Deterministic methods: Based on optimization Variational methods, etc.; Computationally cheaper but not exact. True posterior xxx xxxxx xxx Monte Carlo Variational 10/26 November 18, 2016

Variational approximation We want to find q F (in a variational family) which approximates p(z x; θ). We take the KL divergence as a measure of fit, but we cannot directly minimize it. However from (1) we have: Variational EM algorithm: KL(q p(z x; θ)) = ln p(x; θ) L(q; θ). E-step: q = arg min q F M-step: θnew = arg max L(q ; θ). θ KL(q p(z x; θ old )) = arg max L(q; θ old ); q F 11/26 November 18, 2016

Mean Field (MF) approximation : set of pdfs over that factorize as true posterior mean field approximation We drop the posterior dependencies between the latent variables; Generally the true posterior does not belong to this variational family; More general than it seems: Latent variables can be grouped and only the distribution of each group factorizes. We want to optimize L(q; θ old ) for this factorized distribution. 12/26 November 18, 2016

Optimization under the MF approximation We can show that 1 : L(q; θ) = KL(q j p(x, z j ; θ)) + i j H(q i ), (4) where ln p(x, z j ; θ) = ln p(x, z; θ) i j q i. Coordinate ascent inference; optimizing w.r.t q j with {q i } i j fixed: qj (z j ) = arg max L(q; θ old ) = arg min KL(q j p(x, z j ; θ old )). q j q j ln q j (z j ) = ln p(x, z; θ old ) i j q i + constant. Hope to recognize a standard distribution, or normalize. Coupled solutions so initialize then cyclically update. 1 See appendix for calculus details. 13/26 November 18, 2016

Relation with Gibbs sampling Let consider a Bayesian setting without deterministic parameters θ. Variational Bayesian inference q j (z j ) exp [ ] ln p(x, z) i j q. i But p(x, z) = p(z j x, z \zj )p(x, z \zj ) where z \zj denotes z except z j, so [ ] qj (z j ) exp ln p(z j x, z \zj ) i j q i. Gibbs sampling We want to sample from p(z x) by successively sampling z j from p(z j x, z \zj ). Hybrid approach alternating sampling and optimization. 14/26 November 18, 2016

Introduction Audio source separation example Further readings Appendix Audio source separation example Mixing model: I ψfn (t) is a Modified Discrete Cosine Transform (MDCT) atom. I Sensor noise: bi (t) N (0, σi2 ). 16/26 November 18, 2016

Introduction Audio source separation example Further readings Appendix Source model Non-negative Matrix Factorization (NMF) of the short-term power spectral density: amplitude 2 1 0 4 2 spectral templates temporal activations frequency (khz) 0 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 0 1 2 3 4 0 1 2 3 4 1 0 0-100 -50 0-100 -50 0 amplitude (db) 17/26 November 18, 2016

Source separation problem Observed variables: x = {x i (t)} i,t Latent variables: s = {s j,fn } j,f,n Parameters: θ = { {W j, H j } j, {a ij (t)} i,j,t, {σi 2} } i Minimum Mean Square Error estimation of the sources ŝ = E s x;θ [s] Maximum likelihood parameters estimation θ = arg max p(x; θ) θ p(s x; θ) is Gaussian but parametrized by a full covariance matrix of too high dimensions to be implemented VEM algorithm. 18/26 November 18, 2016

Mean field approximation Source estimate q(s) = F 1 ŷ ij (t) = [a ij ŝ j ](t) = J F 1 ŝ j (t) = N 1 f =0 n=0 F 1 N 1 j=1 f =0 n=0 m j,fn = s j,fn q ; N 1 f =0 n=0 m j,fn g ij,fn (t), q jfn (s j,fn ). m j,fn ψ fn (t); g ij,fn (t) = [a ij ψ fn ](t). 19/26 November 18, 2016

Complete-data log-likelihood ln p(x, s; θ) = ln p(x s; θ) + ln p(s; θ) [ c = 1 I T 1 ln(σi 2 ) + 1 2 σ 2 i=1 t=0 i [ 1 J F 1 N 1 2 We recall that y ij (t) = F 1 N 1 f =0 n=0 v j,fn = [W j H j ] fn. j=1 f =0 n=0 s j,fn g ij,fn (t); ( x i (t) ln(v j,fn ) + s2 j,fn v j,fn ]. J j=1 ) ] 2 y ij (t) 20/26 November 18, 2016

E-Step Under the MF approximation qjfn (s j,fn) = arg max L(q; θ old ) satisfies: q jfn ln q jfn (s j,fn ) = ln p(x, s; θ old ) ( We develop this expression: q j f n (j,f,n ) (j,f,n) omitting all the terms that do not depend on s j,fn ; ). and we hope to recognize a standard distribution. 21/26 November 18, 2016

E-Step After computation we find that q jfn (s j,fn) = N(s j,fn ; m j,fn, γ j,fn ) where: γ j,fn = ( 1 v j,fn + d j,fn = m j,fn v j,fn I 1 σ 2 i=1 i I 1 σ 2 i=1 i m j,fn = m j,fn γ j,fn d j,fn. T 1 t=0 T 1 t=0 g 2 ij,fn (t) ) 1 ; ( g ij,fn (t) x i (t) J j =1 Note that the parameters m j,fn have to be updated in turn. ) ŷ ij (t) ; Note: We can show that d j,fn = ( L(q ;θ)) m j,fn. For the sake of computational efficiency, we can thus use a preconditioned conjugate gradient method instead of this coordinate-wise update. 22/26 November 18, 2016

M-Step (NMF parameters example) We now want to maximize L(q ; θ) w.r.t the NMF parameters under a non-negativity constraint. L(q ; θ) c = c = 1 2 c = 1 2 ln p(x, s; θ) q J F 1 N 1 j=1 f =0 n=0 J F 1 N 1 j=1 f =0 n=0 [ ln([w j H j ] fn ) + m2 j,fn + γ ] j,fn [W j H j ] fn d IS (m 2 j,fn + γ j,fn, [W j H j ] fn ). The Itakura-Saito (IS) divergence is given by d IS (x, y) = x y ln x y 1. Compute an NMF on ˆP j = [ mj,fn 2 + γ j,fn divergence. ]fn RF N + with the IS It can be done with the standard multiplicative update rules. 23/26 November 18, 2016

Further readings D. M. Blei et al., Variational Inference: A Review for Statisticians, arxiv:1601.00670v4 [stat.co], 2016. D. G. Tzikas et al., The variational approximation for Bayesian inference. IEEE Signal Processing Magazine, 25(6), 131-146, 2008. 24/26 November 18, 2016

Calculus details for E-step under the mean-field approximation From [Tzikas et al., 2008]: L(q; θ) = = = = = ( p(x, z; θ) ) q i ln i i q dz i [ q i ln p(x, z; θ) ln q i ]dz i i q i ln p(x, z; θ) dz i q k ln q i dz k i i i k k q i ln p(x, z; θ) dz i q i ln q i dz i as q k dz k = 1 i i i [ q j ln p(x, z; θ) q i dz i ]dz j q j ln q j dz j q i ln q i dz i. i j i j 25/26 November 18, 2016

Let define ln p(x, z j ; θ) = ln p(x, z; θ) q i dz i = ln p(x, z; θ) i j q. i i j It follows L(q; θ) = q j ln p(x, z j ; θ)dz j q j ln q j dz j i j = q j ln p(x, z j; θ) dz j q i ln q i dz i q j i j = KL(q j p) q i ln q i dz i. i j q i ln q i dz i 26/26 November 18, 2016