Variational Inference. Sargur Srihari

Similar documents
STA 4273H: Statistical Machine Learning

Variational Mixture of Gaussians. Sargur Srihari

Variational Inference. Sargur Srihari

Curve Fitting Re-visited, Bishop1.2.5

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Instructor: Dr. Volkan Cevher. 1. Background

Linear Dynamical Systems

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Variational Inference and Learning. Sargur N. Srihari

Expectation Propagation Algorithm

Mixtures of Gaussians. Sargur Srihari

The Expectation Maximization or EM algorithm

Variational Inference (11/04/13)

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

Quantitative Biology II Lecture 4: Variational Methods

Probabilistic Graphical Models for Image Analysis - Lecture 4

13: Variational inference II

Variational Bayesian Logistic Regression

Machine Learning Srihari. Information Theory. Sargur N. Srihari

Lecture 13 : Variational Inference: Mean Field Approximation

14 : Mean Field Assumption

PATTERN RECOGNITION AND MACHINE LEARNING

Series 7, May 22, 2018 (EM Convergence)

Variational Principal Components

Week 3: The EM algorithm

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Bayesian Linear Regression. Sargur Srihari

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Basic Sampling Methods

Variational Autoencoders

STA 4273H: Statistical Machine Learning

Bayesian Machine Learning - Lecture 7

Expectation Maximization

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Statistical Machine Learning Lectures 4: Variational Bayes

Ch 4. Linear Models for Classification

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Latent Variable View of EM. Sargur Srihari

Foundations of Statistical Inference

The Expectation-Maximization Algorithm

Introduction to Probabilistic Graphical Models: Exercises

The Laplace Approximation

p L yi z n m x N n xi

Clustering, K-Means, EM Tutorial

Density Estimation. Seungjin Choi

STA 4273H: Sta-s-cal Machine Learning

Probabilistic and Bayesian Machine Learning

Data Mining Techniques

STA 4273H: Statistical Machine Learning

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Variational Autoencoders (VAEs)

Bayesian Inference Course, WTCN, UCL, March 2013

Variational Bayes and Variational Message Passing

Machine Learning using Bayesian Approaches

Machine learning - HT Maximum Likelihood

Linear Classification

Neural Network Training

Lecture 1b: Linear Models for Regression

Introduction to Machine Learning

Outline Lecture 2 2(32)

Unsupervised Learning

Expectation Propagation in Dynamical Systems

Machine Learning Basics: Maximum Likelihood Estimation

Parametric Techniques

Variational inference

14 : Theory of Variational Inference: Inner and Outer Approximation

Deterministic Approximation Methods in Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference

Introduction to Statistical Learning Theory

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

An introduction to Variational calculus in Machine Learning

The Variational Gaussian Approximation Revisited

An Introduction to Expectation-Maximization

Stochastic Spectral Approaches to Bayesian Inference

Parametric Techniques Lecture 3

Machine Learning Techniques for Computer Vision

topics about f-divergence

Introduction to Probabilistic Graphical Models

Posterior Regularization

Variational Learning : From exponential families to multilinear systems

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Information geometry for bivariate distribution control

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Aalborg Universitet. Published in: IEEE International Conference on Acoustics, Speech, and Signal Processing. Creative Commons License Unspecified

Two Useful Bounds for Variational Inference

Expectation Maximization

G8325: Variational Bayes

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

STA414/2104 Statistical Methods for Machine Learning II

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

A minimalist s exposition of EM

STA 4273H: Statistical Machine Learning

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Recent Advances in Bayesian Inference Techniques

Probabilistic Graphical Models

Reading Group on Deep Learning Session 1

Latent Variable Models and EM algorithm

CMPS 242: Project Report

Transcription:

Variational Inference Sargur srihari@cedar.buffalo.edu 1

Plan of Discussion Functionals Calculus of Variations Maximizing a Functional Finding Approximation to a Posterior Minimizing K-L divergence Factorized Distribution Example of Bivariate Gaussian 2

Function versus Functional Function takes a value of a variable as input and returns the function value as output Functional takes a function as input and returns the functional value as output An example is entropy H[p(x)] = p(x) ln p(x) dx 3

-_ It shouldbe emphasize, Machine Learning bution. Inparricular.,j,jl,T:^:::, Calculus of Variations H,,:illr,;(#:"+ o"^itil?ilit,f Hii,fn ractorsot (zns. rhis ractorized ro proximation framework develone t' Amongst all distributio _-, a\" bution rnfbr for which which the the lower lo, ii,?l^o ro,mru-iuti;;;i);#;,1:.':f ltiljtrx':t;; rn,s, werirstsubsriru," which we do by opiim " li,,it,y::l Function: input x is a value, returns a value y(x) il;y;*::,:9,;i1,"^ qiei)n"""ii ;:,ir:f:ffi Functional: input is a function y(x), returns a : n, L(q) r f {,,,,* value F[y(x)] : nt\j t"rt*." l r r Figure4.14. Tha!l (green)apprc- f tte rue posteria ); qtll and *rcr mrzed.our god ile distributiom d flexible thar i t is imponant l ( r J Standard Calculus concerns Derivatives of functions : f tinn6,zj) dz wherewe havedefineda new distrib 1n1(x,z ) : l. andtharsub ng distriburia hl1 flexibledrr. I"Jii'"HHluii? all variableszr'toi L How output changes in response to small changesover in input-,i,!,if,proachthe rc Variational Calculus: concerns Functional derivative Et+i[In p(x,z)] use a pararncf er bound 4 E# r oprimjz rample of tb a\ e opti Now suppose we keepthe.{o;a specr roa, possibre forms fortrri,'o nizingthat(10.6)is a.negative p(x'z;)'thus maximrrrtig ru'u rlo.oiir How output changes in response to infinitesimal changes in input function il1 of disrrib oupsthat re ronfactonztl Invented by Leonhard Euler Swiss 1707-1783 1.::i?' j, so tha 4

Functional Derivative Defined by considering: How the value of a functional F[y] changes when the function y(x) is changed to y(x)+εη(x) where η(x) is an arbitrary function of x εη 5

Maximizing a Functional Common problem in conventional calculus: Find a value x that maximizes a function y(x) In calculus of variations find a function y(x) that maximizes a functional F[y] E.g., Find a function that maximizes entropy Can be used to show that: 1. Shortest path between two points is a straight line 2. Distribution that maximizes entropy H[p(x)] = p(x) ln p(x) dx is Gaussian 6

Variational methods Nothing intrinsically approximate But naturally lend themselves to approximation By restricting range of functions, e.g., Quadratic Linear combination of fixed basis functions In which only the coefficients of the linear combination vary Factorization assumptions Method used for probabilistic inference 7

Application to Probabilistic Inference Consider a fully Bayesian model All parameters are given prior distributions Model may have latent variables as well as parameters Denote all parameters and latent variables by Z Denote set of observed variables by X Given N iid samples for which X={x 1,..x N } and Z={z 1,..z N }, probabilistic model specifies posterior distribution p(z X) as well as model evidence p(x) 8

Lower Bound on p(x) using KL divergence ln p(x) = L(q) + KL(q p) where L(q) = and " q(z)ln p(x,z) % # & dz $ q(z) ' " KL{q p} = q(z)ln# $ p(z X) q(z) Log Marginal Probability % & dz ' KL Divergence between proposed q and p (the desired posterior of interest) to be minimized Functional we wish to maximize Also applicable to discrete distributions by replacing integrations with summations Observations on optimization Lower bound on ln p(x) is L(q) Maximizing lower bound L(q) wrt distribution q(z) is equivalent to minimizing KL Divergence KL divergence vanishes when q(z) = p(z X) Plan: We seek that distribution q(z) for which L(q) is largest Since true posterior is intractable we consider restricted family for q(z) Seek member of this family for which KL divergence is minimized 9

Approximating Family Typically true posterior p(z X) is intractable Pick family of parametric distributions q(z ω) governed by parameters ω Lower bound becomes a function of ω Can exploit standard nonlinear optimization techniques to determine optimal parameters Use q with the fitted parameters as a proxy for the posterior e,.g., to predict about future data, or posterior of hidden variables 10

Variational Approximation Example Use parametric distribution q(z ω) Lower bound for the functional L(q) is a function of ω and can use standard nonlinear optimization to determine optimal ω Negative Logarithms Laplace Original distribution Variational distribution is Gaussian Optimized with respect to mean and variance 11

Factorized Distribution Approach One way to restrict the family of distributions Partition the elements of Z into disjoint groups Z i, i=1,.., M, so that No further assumptions made about distribution No restrictions on factors q i (Z i ) M q(z) = q i (Z i ) i=1 Method is called Mean Field Theory in Physics also Mean Field Variational Inference 12

Factorized Distribution Approach Among factorized distributions q(z) we seek one for which lower bound L(q), is largest L(q) = q(z)ln p(x,z) dz q(z) = q i ln p(x,z) lnq i dz i i Substitute Denote q j (Z j ) by q j = q j ln p(x,z) q i dz i dz i j i q j lnq j dz j +const = q j ln!p(x,z j )dz j q j lnq j dz j +const where we have defined a new distribution!p(x,z j ) as ln!p(x,z j ) = E i j ln p(x,z) +const where E i j [...] is an expectation wrt the q distribution, so that M q(z) = q i (Z i ) i=1 KL Divergence of q j (Z j ) and!p(x,z j ) E i j [ln p(x,z)] = ln p(x,z) q i dz i i j 13

Maximizing L(q) with factorization Since we have seen that L(q) = q j ln!p(x,z j )dz j q j Maximizing L(q) is equivalent to minimizing the Kullback-Leibler divergence between q j (Z j ) and!p(x,z j ) The minimum occurs when lnq j dz j +const q j (Z j ) =!p(x,z j ) Thus we obtain a general expression for the optimal solution q * j (Z j ) given by lnq j * (Z j ) = E i j [ ln p(x,z) ]+ const 14

Optimal Solution Interpretation The optimal solution is lnq * j (Z j ) = E i j [ ln p(x,z) ]+ const This provides the basis for application of variational methods Log of the optimal solution for factor q j is obtained by considering log of the joint distribution over all hidden and joint variables and Take expectation wrt all other factors {q i } for i j 15

Additive constant in Normalized Solution is set by normalizing the distribution q j * (Z j ) Take exponential of both sides and normalize q j * (Z j ) = lnq j * (Z j ) = E i j [ ] ( ) exp E i j ln p(x,z) exp( E i j [ ln p(x,z) ])dz j In practice more convenient to work with the form ln p(x,z) lnq * j (Z j ) = E i j [ ]+ const [ ln p(x,z) ]+ const 16

Obtaining an Explicit Solution Set of equations lnq * j (Z j ) = E i j [ ln p(x,z) ]+ const j=1,..m represent a set of consistency conditions for the maximum of the lower bound They do not provide an explicit solution Solution obtained by properly initializing q i (Z i ) and cycling through all other factors Replacing each in turn with revised estimate given by E i j [ ln p(x, Z) ]+ const evaluated with current estimates for all other factors Convergence guaranteed because bound is convex 17 wrt factors q i (Z i )

Properties of Factorized Approximations Our approach to variational inference is based on a factorized approximation to the true posterior distribution We will consider approximating a Gaussian distribution using a factorized Gaussian It will provide insight into types of inaccuracy introduced in using factorized approximations 18

Example of Factorized Approximation True posterior distribution p(z), z = {z 1, z 2 }, is Gaussian p(z) = N ( z µ,λ 1 ) over two correlated variables µ= µ 1 µ 2 Λ= Λ 11 Λ 12 Λ 21 Λ 22 We wish to approximate p(z) using a factorized Gaussian of the form q(z) = q 1 (z 1 ) q 2 (z 2 ) 19

Factor Derivation Using the optimal solution for factorized distributions lnq j * (Z j ) = E i j We find an expression for the optimal factor Only need terms that have functional dependence on z 1 Other terms are absorbed into normalization constant lnq * 1 (z 1 ) = E z2 [ ln p(z) ]+ const = E z2 1 ( 2 z µ 1 1) 2 Λ 11 z 1 µ 1 [ ln p(x,z) ]+ const p(z) = N z µ, Λ 1 ( )Λ 12 ( z 2 µ 2 ) +const = 1 2 z 2 Λ 1 11 + z 1 µ 1 Λ 11 z 1 Λ 12 (E[z 2 ] µ 2 ) +const q 1 * (z 1 ) ( ) Since rhs is a quadratic function of z 1, we can identify as a Gaussian q 1 * (z 1 ) 20

Factorized Approximation of Gaussian We did not assume q 1 (z 1 ) to be Gaussian but derived it as a variational optimization of KL divergence over all possible distributions We can identify mean and precision for the Gaussians as 1 q 1 * z 1 Coupled solution ( ) where m 1 = µ 1 Λ 11 ( ) where m 2 = µ 2 Λ 22 ( ) = N z 1 m 1, Λ 11 1 ( ) = N z 2 m 1, Λ 22 q 2 * z 2 ( ) ( ) 1 Λ 12 E[z 2 ] µ 2 1 Λ 21 E[z 1 ] µ 1 q 1 *(z 1 ) depends on expectation wrt q*(z 2 ) Treated as re-estimation equations and cycling through them in turn 21

Result of Variational Approximation Mean is correctly captured Variance is controlled by direction of smallest variance of p(z) Variance along orthogonal direction is under-estimated In general: Green: Correlated Gaussian For 1, 2 and 3 std deviations Red: q(z) over same variables given by product of two independent univariate Gaussians factorized variational approximation gives results that are too compact 22

Two alternative forms of KL Divergence Green: Correlated Gaussian For 1, 2 and 3 std deviations Red: q(z) over same variables given by product of two independent univariate Gaussians Minimization based on KL Divergence KL(q p) Minimization based on Reverse KL Divergence KL(p q) Mean correctly captured Variance under-estimated : too compact Form used in Expectation Propagation

Alternative KL divergences for Multimodal Approximating a multimodal by a unimodal one a b c Blue Contours: multimodal distribution p(z) Red Contours in (a): Single Gaussian q(z) that best approximates p(z) in (b): single Gaussian q(z) that best approximates p(z) in the sense of minimizing KL(p q) in(c): as in (b) showing different local minimum of KL divergence 24

Alpha Family of Divergences Two forms of divergence are members of the alpha family of divergences ( ) 4 D α ( p q) = 1 α 1 2 p(x)(1+α )/ 2 q(x) (1 α )/ 2 dx where < α < KL(p Q) corresponds to KL(q p) corresponds to p q which is linearly related to the Hellinger Distance (a valid distance measure) D H ( p q) = ( p(x) 1/ 2 α 1 α 1 ( ) 0 with equality iff p(x) = q(x) For all α D α when α = 0 we get symmetric divergence q(x) 1/ 2 ) 2 dx

Example: Univariate Gaussian Inferring the parameters of a single Gaussian Mean μ and precision τ given data set D={x 1,..x N } Assume Gaussian-Gamma conjugate prior distribution of parameters Likelihood function: ( ) = τ p D µ, τ 2π Conjugate priors for μ and τ p(µ τ) = N ( µ µ 0,(λ 0 τ) 1 ) N /2 exp τ 2 p(τ) = Gam(τ a 0,b 0 ) N n=1 ( x n µ ) 2 26

Factorized Variational Approximation Posterior distribution q(µ, τ) = q µ (µ)q τ (τ) For q μ (μ) we have a Gaussian Optimal solution for factor q τ (τ) is a Gamma 27

Machine Learning Variational Inference of Univariate Gaussian Variational Inference of mean μ and precision τ (a) Green: Contours of true posterior p(μ,τ D) Blue: Contours of initial approx q (µ)q (τ) (b) after re-estimating the factor qμ(μ). (c) after re-estimating the factor qτ(τ). (d) Iterative scheme converges to red contours µ τ 28