Variational inference

Similar documents
A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Lecture 13 : Variational Inference: Mean Field Approximation

The Expectation Maximization or EM algorithm

Cheng Soon Ong & Christian Walder. Canberra February June 2017

13: Variational inference II

STA 4273H: Statistical Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Quantitative Biology II Lecture 4: Variational Methods

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Series 7, May 22, 2018 (EM Convergence)

EM & Variational Bayes

Variational Inference (11/04/13)

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Maximum Marginal Likelihood Estimation For Nonnegative Dictionary Learning

Probabilistic Graphical Models for Image Analysis - Lecture 4

Machine Learning Techniques for Computer Vision

Expectation Maximization

Latent Variable Models

Variational Autoencoders

14 : Mean Field Assumption

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Expectation Maximization

Introduction to Machine Learning

Density Estimation. Seungjin Choi

Probabilistic Graphical Models

Introduction to Probabilistic Graphical Models: Exercises

Variational Autoencoder

Learning the hyper-parameters. Luca Martino

Bayesian Inference and MCMC

Recent Advances in Bayesian Inference Techniques

Minimizing D(Q,P) def = Q(h)

Variational Scoring of Graphical Model Structures

Posterior Regularization

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Latent Variable View of EM. Sargur Srihari

Technical Details about the Expectation Maximization (EM) Algorithm

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Variational Bayesian Logistic Regression

Week 3: The EM algorithm

Clustering, K-Means, EM Tutorial

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Algorithms for Variational Learning of Mixture of Gaussians

Auto-Encoding Variational Bayes

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Covariance Matrix Simplification For Efficient Uncertainty Management

Variational Bayes and Variational Message Passing

Chapter 20. Deep Generative Models

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Latent Variable Models and EM algorithm

An Introduction to Expectation-Maximization

Linear Dynamical Systems

Gaussian Mixture Models

Variational Autoencoders (VAEs)

arxiv: v9 [stat.co] 9 May 2018

Variational Methods in Bayesian Deconvolution

G8325: Variational Bayes

Non-negative Matrix Factorization: Algorithms, Extensions and Applications

Part 1: Expectation Propagation

Variational Inference. Sargur Srihari

Introduction to Bayesian inference

Bayesian X-ray Computed Tomography using a Three-level Hierarchical Prior Model

Sparse Stochastic Inference for Latent Dirichlet Allocation

Probabilistic Graphical Models

Reinforcement Learning as Variational Inference: Two Recent Approaches

Variational Inference: A Review for Statisticians

CSC 2541: Bayesian Methods for Machine Learning

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Variational Mixture of Gaussians. Sargur Srihari

VARIATIONAL BAYESIAN EM ALGORITHM FOR MODELING MIXTURES OF NON-STATIONARY SIGNALS IN THE TIME-FREQUENCY DOMAIN (HR-NMF)

A Note on the Expectation-Maximization (EM) Algorithm

Variational Bayesian EM algorithm for modeling mixtures of non-stationary signals in the time-frequency domain (HR-NMF)

Variational Principal Components

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

9 Multi-Model State Estimation

Variational Inference via Stochastic Backpropagation

Aalborg Universitet. Published in: IEEE International Conference on Acoustics, Speech, and Signal Processing. Creative Commons License Unspecified

an introduction to bayesian inference

Outline Lecture 2 2(32)

Bayesian Machine Learning - Lecture 7

Probabilistic Graphical Models

Variational Learning : From exponential families to multilinear systems

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Lecture 16 Deep Neural Generative Models

The connection of dropout and Bayesian statistics

Unsupervised Learning

STA 4273H: Statistical Machine Learning

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

Scaling Neighbourhood Methods

Notes on Machine Learning for and

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

Lecture 6: Gaussian Mixture Models (GMM)

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Gradient Methods for Markov Decision Processes

Basic math for biology

Lecture 6: April 19, 2002

Transcription:

Simon Leglaive Télécom ParisTech, CNRS LTCI, Université Paris Saclay November 18, 2016, Télécom ParisTech, Paris, France.

Outline Introduction Probabilistic model Problem Log-likelihood decomposition EM algorithm Approximate inference Variational approximation Mean Field approximation Optimal variational distribution under the MF approximation Relation with Gibbs sampling Audio source separation example Mixing model Source model Source separation problem 2/26 November 18, 2016

Outline Introduction Probabilistic model Problem Log-likelihood decomposition EM algorithm Approximate inference Variational approximation Mean Field approximation Optimal variational distribution under the MF approximation Relation with Gibbs sampling Audio source separation example Mixing model Source model Source separation problem 3/26 November 18, 2016

Introduction Audio source separation example Further readings Probabilistic model Let consider a probabilistic model where I x denotes the set of observed random variables; I z denotes the set of latent/hidden random variables; I θ denotes the set of deterministic parameters. For example: Audio source separation ab ari te La nt ev urc les : source and mixing parameters. For example: NMF parameters and mixing filters. so Observed mixture variables Note: In a Bayesian framework the latent variables include the parameters. 4/26 November 18, 2016 Appendix

Problem 1. Definition of the model: How are the data generated from the latent unobserved variables? 2. Inference: What are the values of the latent variables that generated the data? Inference We are naturally interested in computing the posterior p(z x; θ ). Maximum Likelihood estimation: θ = arg max p(x; θ); θ Minimum Mean Square Error (MMSE) estimation: ẑ = E z x;θ [z]. If we can compute the posterior distribution Expectation-Maximization (EM) algorithm. 5/26 November 18, 2016

Log-likelihood decomposition Let q be a probability density function (pdf) over z, then the log-likelihood can be decomposed as: ln p(x; θ) = L(q; θ) + KL(q p(z x; θ)), (1) }{{}}{{} Variational Free Energy Kullback-Leibler divergence ( ) ( ) where L(q, θ) = KL(q p(z x; θ)) = ln and f (z) q = f (z)q(z)dz. ln p(x, z; θ) q }{{} E(q;θ) ( p(z x; θ) q(z) ln q(z) q }{{} H(q): entropy ) As KL( ) 0, L(q, θ) lower bounds the log-likelihood. ; (2) ; (3) q Note: Variational free energy = Evidence lower bound (ELBO) in Bayesian settings. 6/26 November 18, 2016

EM algorithm Maximize L(q, θ) w.r.t q at the E-step and w.r.t θ at the M-step. E-step: From (1), Then from (2), q (z) = arg max L(q; θ old ) = p(z x; θ old ). q Complete-data log-likelihood {}}{ L(q ; θ) = ln p(x, z; θ) p(z x;θ old ) }{{} Q(θ,θ old ) in the standard EM formulation M-step: ( ) + H p(z x; θ old ). } {{ } constant w.r.t θ θ new = arg max Q(θ, θ old ). θ 7/26 November 18, 2016

What if we cannot compute the posterior distribution? 8/26 November 18, 2016

Outline Introduction Probabilistic model Problem Log-likelihood decomposition EM algorithm Approximate inference Variational approximation Mean Field approximation Optimal variational distribution under the MF approximation Relation with Gibbs sampling Audio source separation example Mixing model Source model Source separation problem 9/26 November 18, 2016

Approximate inference Stochastic methods: Based on sampling Markov Chain Monte Carlo (MCMC) methods, etc.; Computationally expensive but converges to the true posterior. Deterministic methods: Based on optimization Variational methods, etc.; Computationally cheaper but not exact. True posterior xxx xxxxx xxx Monte Carlo Variational 10/26 November 18, 2016

Variational approximation We want to find q F (in a variational family) which approximates p(z x; θ). We take the KL divergence as a measure of fit, but we cannot directly minimize it. However from (1) we have: Variational EM algorithm: KL(q p(z x; θ)) = ln p(x; θ) L(q; θ). E-step: q = arg min q F M-step: θnew = arg max L(q ; θ). θ KL(q p(z x; θ old )) = arg max L(q; θ old ); q F 11/26 November 18, 2016

Mean Field (MF) approximation : set of pdfs over that factorize as true posterior mean field approximation We drop the posterior dependencies between the latent variables; Generally the true posterior does not belong to this variational family; More general than it seems: Latent variables can be grouped and only the distribution of each group factorizes. We want to optimize L(q; θ old ) for this factorized distribution. 12/26 November 18, 2016

Optimization under the MF approximation We can show that 1 : L(q; θ) = KL(q j p(x, z j ; θ)) + i j H(q i ), (4) where ln p(x, z j ; θ) = ln p(x, z; θ) i j q i. Coordinate ascent inference; optimizing w.r.t q j with {q i } i j fixed: qj (z j ) = arg max L(q; θ old ) = arg min KL(q j p(x, z j ; θ old )). q j q j ln q j (z j ) = ln p(x, z; θ old ) i j q i + constant. Hope to recognize a standard distribution, or normalize. Coupled solutions so initialize then cyclically update. 1 See appendix for calculus details. 13/26 November 18, 2016

Relation with Gibbs sampling Let consider a Bayesian setting without deterministic parameters θ. Variational Bayesian inference q j (z j ) exp [ ] ln p(x, z) i j q. i But p(x, z) = p(z j x, z \zj )p(x, z \zj ) where z \zj denotes z except z j, so [ ] qj (z j ) exp ln p(z j x, z \zj ) i j q i. Gibbs sampling We want to sample from p(z x) by successively sampling z j from p(z j x, z \zj ). Hybrid approach alternating sampling and optimization. 14/26 November 18, 2016

Outline Introduction Probabilistic model Problem Log-likelihood decomposition EM algorithm Approximate inference Variational approximation Mean Field approximation Optimal variational distribution under the MF approximation Relation with Gibbs sampling Audio source separation example Mixing model Source model Source separation problem 15/26 November 18, 2016

Introduction Audio source separation example Further readings Appendix Audio source separation example Mixing model: I ψfn (t) is a Modified Discrete Cosine Transform (MDCT) atom. I Sensor noise: bi (t) N (0, σi2 ). 16/26 November 18, 2016

Introduction Audio source separation example Further readings Appendix Source model Non-negative Matrix Factorization (NMF) of the short-term power spectral density: amplitude 2 1 0 4 2 spectral templates temporal activations frequency (khz) 0 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 0 1 2 3 4 0 1 2 3 4 1 0 0-100 -50 0-100 -50 0 amplitude (db) 17/26 November 18, 2016

Source separation problem Observed variables: x = {x i (t)} i,t Latent variables: s = {s j,fn } j,f,n Parameters: θ = { {W j, H j } j, {a ij (t)} i,j,t, {σi 2} } i Minimum Mean Square Error estimation of the sources ŝ = E s x;θ [s] Maximum likelihood parameters estimation θ = arg max p(x; θ) θ p(s x; θ) is Gaussian but parametrized by a full covariance matrix of too high dimensions to be implemented VEM algorithm. 18/26 November 18, 2016

Mean field approximation Source estimate q(s) = F 1 ŷ ij (t) = [a ij ŝ j ](t) = J F 1 ŝ j (t) = N 1 f =0 n=0 F 1 N 1 j=1 f =0 n=0 m j,fn = s j,fn q ; N 1 f =0 n=0 m j,fn g ij,fn (t), q jfn (s j,fn ). m j,fn ψ fn (t); g ij,fn (t) = [a ij ψ fn ](t). 19/26 November 18, 2016

Complete-data log-likelihood ln p(x, s; θ) = ln p(x s; θ) + ln p(s; θ) [ c = 1 I T 1 ln(σi 2 ) + 1 2 σ 2 i=1 t=0 i [ 1 J F 1 N 1 2 We recall that y ij (t) = F 1 N 1 f =0 n=0 v j,fn = [W j H j ] fn. j=1 f =0 n=0 s j,fn g ij,fn (t); ( x i (t) ln(v j,fn ) + s2 j,fn v j,fn ]. J j=1 ) ] 2 y ij (t) 20/26 November 18, 2016

E-Step Under the MF approximation qjfn (s j,fn) = arg max L(q; θ old ) satisfies: q jfn ln q jfn (s j,fn ) = ln p(x, s; θ old ) ( We develop this expression: q j f n (j,f,n ) (j,f,n) omitting all the terms that do not depend on s j,fn ; ). and we hope to recognize a standard distribution. 21/26 November 18, 2016

E-Step After computation we find that q jfn (s j,fn) = N(s j,fn ; m j,fn, γ j,fn ) where: γ j,fn = ( 1 v j,fn + d j,fn = m j,fn v j,fn I 1 σ 2 i=1 i I 1 σ 2 i=1 i m j,fn = m j,fn γ j,fn d j,fn. T 1 t=0 T 1 t=0 g 2 ij,fn (t) ) 1 ; ( g ij,fn (t) x i (t) J j =1 Note that the parameters m j,fn have to be updated in turn. ) ŷ ij (t) ; Note: We can show that d j,fn = ( L(q ;θ)) m j,fn. For the sake of computational efficiency, we can thus use a preconditioned conjugate gradient method instead of this coordinate-wise update. 22/26 November 18, 2016

M-Step (NMF parameters example) We now want to maximize L(q ; θ) w.r.t the NMF parameters under a non-negativity constraint. L(q ; θ) c = c = 1 2 c = 1 2 ln p(x, s; θ) q J F 1 N 1 j=1 f =0 n=0 J F 1 N 1 j=1 f =0 n=0 [ ln([w j H j ] fn ) + m2 j,fn + γ ] j,fn [W j H j ] fn d IS (m 2 j,fn + γ j,fn, [W j H j ] fn ). The Itakura-Saito (IS) divergence is given by d IS (x, y) = x y ln x y 1. Compute an NMF on ˆP j = [ mj,fn 2 + γ j,fn divergence. ]fn RF N + with the IS It can be done with the standard multiplicative update rules. 23/26 November 18, 2016

Further readings D. M. Blei et al., Variational Inference: A Review for Statisticians, arxiv:1601.00670v4 [stat.co], 2016. D. G. Tzikas et al., The variational approximation for Bayesian inference. IEEE Signal Processing Magazine, 25(6), 131-146, 2008. 24/26 November 18, 2016

Calculus details for E-step under the mean-field approximation From [Tzikas et al., 2008]: L(q; θ) = = = = = ( p(x, z; θ) ) q i ln i i q dz i [ q i ln p(x, z; θ) ln q i ]dz i i q i ln p(x, z; θ) dz i q k ln q i dz k i i i k k q i ln p(x, z; θ) dz i q i ln q i dz i as q k dz k = 1 i i i [ q j ln p(x, z; θ) q i dz i ]dz j q j ln q j dz j q i ln q i dz i. i j i j 25/26 November 18, 2016

Let define ln p(x, z j ; θ) = ln p(x, z; θ) q i dz i = ln p(x, z; θ) i j q. i i j It follows L(q; θ) = q j ln p(x, z j ; θ)dz j q j ln q j dz j i j = q j ln p(x, z j; θ) dz j q i ln q i dz i q j i j = KL(q j p) q i ln q i dz i. i j q i ln q i dz i 26/26 November 18, 2016