Probabilistic Graphical Models for Image Analysis - Lecture 4

Similar documents
Stochastic Variational Inference

13: Variational inference II

Lecture 13 : Variational Inference: Mean Field Approximation

Exponential Families

STA 4273H: Statistical Machine Learning

14 : Mean Field Assumption

Variational Inference (11/04/13)

Approximate Bayesian inference

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

The Expectation Maximization or EM algorithm

Expectation Propagation Algorithm

Generalized Linear Models and Exponential Families

Patterns of Scalable Bayesian Inference Background (Session 1)

CS Lecture 19. Exponential Families & Expectation Propagation

Probabilistic Graphical Models

Variational Autoencoders (VAEs)

Introduction: exponential family, conjugacy, and sufficiency (9/2/13)

Week 3: The EM algorithm

Note 1: Varitional Methods for Latent Dirichlet Allocation

Gaussian Mixture Models

Introduction to Machine Learning

Series 7, May 22, 2018 (EM Convergence)

Variational Autoencoder

Quantitative Biology II Lecture 4: Variational Methods

Probabilistic Reasoning in Deep Learning

Latent Variable Models and EM algorithm

Expectation Maximization

Variational inference

Unsupervised Learning

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Density Estimation. Seungjin Choi

Statistical Machine Learning Lectures 4: Variational Bayes

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

The Expectation-Maximization Algorithm

LECTURE 11: EXPONENTIAL FAMILY AND GENERALIZED LINEAR MODELS

Variational Inference. Sargur Srihari

Variational Scoring of Graphical Model Structures

Bayesian Inference and MCMC

Variational Autoencoders

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Variational Inference. Sargur Srihari

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Expectation Maximization

Outline Lecture 2 2(32)

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

arxiv: v9 [stat.co] 9 May 2018

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

G8325: Variational Bayes

Variational Bayesian Logistic Regression

Variational Inference via Stochastic Backpropagation

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Variational Learning : From exponential families to multilinear systems

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Probabilistic and Bayesian Machine Learning

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Integrated Non-Factorized Variational Inference

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Clustering, K-Means, EM Tutorial

Bayesian Methods for Machine Learning

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Mixtures of Gaussians. Sargur Srihari

Variational Inference: A Review for Statisticians

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm

PMR Learning as Inference

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

The connection of dropout and Bayesian statistics

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Expectation Maximization

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Variational Bayes and Variational Message Passing

Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems

Expectation maximization tutorial

Introduction to Probabilistic Machine Learning

Expectation Propagation in Dynamical Systems

Lecture 8: Graphical models for Text

Lecture 6: Gaussian Mixture Models (GMM)

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Latent Variable Models

Auto-Encoding Variational Bayes

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Lecture 14. Clustering, K-means, and EM

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

CMPS 242: Project Report

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

Machine Learning using Bayesian Approaches

Probabilistic Graphical Models

An Introduction to Expectation-Maximization

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Introduction to Machine Learning

Posterior Regularization

Gaussian Mixture Models

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Transcription:

Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems

Overview 1. Repetition 2. α-divergence 3. Variational Inference 4. Exponential Families 5. Stochastic Variational Inference 6. Black Box Variational Inference 1

Repetition

Framework x observations, Z hidden variables, α additional parameters p(z x, α) = p(z, x α) p(z, x α) (1) Idea: Pick family of distributions over latent variables with its own variational parameter q(z ν) =...? and find variational parameters ν such that q and p are "close". 2

Mean-field Assumes that each variable is independent: k q(z 1,, z k ) = q(z i ) (2) i=1 It does not contain the true posterior since the variables are dependent which can now not be captured by q. Offers the possibility to group variables together. Use coordinate ascent updates for each q(z k ). Here we only specified factorization but not the form of the q(z k ). 3

Using Jensen s inequality to obtain a lower bound log p(x) = log p(z, x) = Z = log p(z, x) q(z) Z q(z) ( ) p(x, Z) = log E q q(z) E q (log p(x, Z)) E q (log q(z)) Proposal: Choose/design variational family Q such that the expectations are easily computable. 4

Relation with KL KL[q(z) p(z y)] = E q [ log q(z) ] p(z y) = E q [log q(z)] E q [log p(z y)] = E q [log q(z)] E q [log p(z, y)] + log p(y) = (E q [log p(z, y)] E q [log q(z)]) + log p(y) Difference between KL and ELBO is precisely the log normalizer, which does not depend on q and is bounded by the ELBO. 5

α-divergence

α-divergence αp(x) + (1 α)q(x) p(x) α q(x) 1 α D α (p q) = dx (3) α(1 α) with α (, ). Properties: D α (p q) is convex with respect to both q and p. D α (p q) 0 D α (p q) = 0 when q = p lim α 0 D α (p q) = KL(q p) lim α 1 D α (p q) = KL(p q) 6

Inclusive v.s Exclusive Take again Gaussian q approximating two mode Gaussian p from before: For α 1: inclusive since it prefers to stretch across p. For α 0: exclusive, q seeks area of largest total mass, mode seeking. 7

Expectation Propagation Instead of minimizing with respect to KL(q p) we minimize wrt. KL(p q). Assume q is a member of the exponential family i.e.: Then q(x η) = h(x)g(η) exp(η u(x)) KL(p q)(η) = log g(η) η E p(z) [u(z)] + const. Minimize KL by setting gradient to zero: g(η)! = E p(z) [u(z)] from previous slide log g(η) = E[u(x)], we can then conclude (moment matching) E q(z) [u(z)] = E p(z) [u(z)] 8

Expectation Propagation - Summary The reversed KL is harder to optimized. If the true posterior p factorizes, then we can update single factors iteratively by moment matching. Factors are in the exponential family. There is no guarantee that the iterations will converge. Expectation Propagation aims to preserve the marginals! Restriction to exponential family in EP implies: Any product and any division between distributions stays in parametric family and can be done analytically. Main applications involve Gaussian Processes and logistic regression (less well suited for GMM). 9

Variational Inference

Framework We use variational inference to approximate the posterior distribution log p(x, θ) = ELBO(q, θ) + KL(q(z) p(z x, θ)), log p(x, θ) E q [log p(z, x)] E q [log q(z)] To optimize the lower bound, we can use coordinate ascent! Problems: In each iteration we go over all the data! Computing the gradient of the expectations above. 10

Coordinate Ascent Given the independence assumption, we can decompose the ELBO as a function of individual q(z k ): ELBO j = q(z j )E i j log p(z j z j, x, θ)dz j q(z j ) log q(z j )dz j Optimality condition delbo dq(z j ) = 0 Exercise (Lagrange multipliers): q (z j ) exp E j [log p(z j Z j, x)] Coordinate Ascent: Iteratively update each q(z ). 11

Conditionals Last slide: q (z j ) exp E j [log p(z j Z j, x)] Assume conditional is in the exponential family i.e. p(z j z j, x) = h(z j ) exp (η(z j, x) t(z j ) a(η(z j, x))) Note: We will see examples in the following Lectures! Mean-field for exponential family Compute log of the conditional log p(z j z j, x) = log(h(z j )) + η(z j, x) t(z j ) a(η(z j, x)) Compute expectation with respect to q(z j ) E[log p(z j z j, x)] = log(h(z j )) + E[η(z j, x) ]t(z j ) E[a(η(z j, x))] Thus q (z j ) h(z j ) exp(e[η(z j, x) ]t(z j )) 12

Conditionals Note: In the case of an exponential family, the optimal q(z j ) is in the same family as the conditional. Coordinate Ascent Assuming variational parameter ν q(z 1,, z k ν) = k q(z i ν i ) (4) Then each natural variational parameter is set equal to the expectation of the natural conditional parameter given all the other variables and the observations: i=1 ν j = E[η(z j, x)] (5) 13

Exponential Families

Exponential Families The exponential family of distributions over x, given parameters η, is defined to be the set of distribution of the form where: p(x η) = h(x) exp(η t(x) a(η)) (6) η are the so called natural parameters t(x) sufficient statistic a(η) log normalizer E[t(x)] = d dη a(η) Higher moments are next derivatives. Examples: Bernoulli (last week), Gaussian, Binomial, Multinomial, Poisson, Dirichlet, Beta, Gamma etc. 14

Conjugacy Bayesian modeling allows to incorporate priors: η F( λ), x i G( η), for i {1,, n} The posterior distribution of η given the data x 1:n is then given by n p(η x, λ) F(η λ) G(x i η) We say F and G are conjugate if the above posterior belongs to the same functional family as F. i=1 15

Conjugate prior for the Exponential Family Exponential family members have a conjugate prior: p(x i η) = h l (x) exp(η t(x) a l (η)) p(η λ) = h c (x) exp(λ 1 η + λ 2 ( a l(η)) a c (λ)) The natural parameter λ = (λ 1, λ 2 ) has dimension dim(η) + 1 The sufficient statistics are (η, a(η)) Exercise: Show the above result. 16

Conditional conjugacy Let β be a vector of global latent variables (with corresponding global parameters α and z be a vector of local latent variables: p(β, z, x) = p(β) n p(z i, x i β) Note: For stochastic variational inference, we make an additional assumption: i=1 p(z i, x i β) = h(z i, x i ) exp(β t(z i, x i ) a(β)) and take an exponential prior on the global variables as the corresponding conjugate prior: p α (β) = h(β) exp(α [β, a(β)] a(α)) with α = (α 1, α 2 ). 17

Mean-field for conjugates Mean-field: q(z, β) = q λ (β) k i=1 q ϕ i (z i ) λ global variational parameter ϕ local variational parameter Local update ϕ i E λ [η l (β, x i )] Global update: λ E ϕ [η g (x, z)] Note: Coordinate ascent iterates between local and global updates. 18

Algorithm Algorithm 1 Mean-field with conjugate family assumption 1: Input: model p, variational family q ϕ (z),, q λ (β) 2: 3: while ELBO is not converged do 4: for each data point i do 5: Update ϕ i E λ [η l (β, x i )] 6: end for 7: Update λ E ϕ [η g (x, z)] 8: end while 19

Stochastic Variational Inference

Gradient Optimization λ t+1 = λ t + δ λ f(λ t ) or equally arg max dλ f(λ + dλ) st. dλ 2 ε (7) Problem: Here it is the euclidean distance, which is not suitable for probability distributions. Natural Gradient for ELBO: arg max dλ ELBO(λ + dλ) st. Dsym KL (q λ, q λ+dλ ) ε (8) where D sym KL (q, p) = KL(q p) + KL(p q) 20

Gradient Optimization Riemannian metric G(λ) sucht that: dλ G(λ)dλ = D sym KL (q λ(β), q λ+dλ (β)) From information geometry, we know how to calculate the natural gradient: λ ELBO = G 1 (λ) λ ELBO where G(λ) = E[( λ log q λ (β))( λ log q λ (β)) ] (9) 21

Gradient Optimization for conjugate models For our model class, we have: Thus λ log q λ (β) = t(β) E[t(β)] (10) Recall then G(λ) = 2 λa(λ) (11) λ ELBO = 2 λa(λ)(e[η(x, z)] λ) (12) λ ELBO = G 1 (λ) λ ELBO = G 1 (λ) 2 λa(λ)(e[η(x, z)] λ) = (E[η(x, z)] λ) 22

Gradient Step In each iteration λ t = λ t 1 + δ t λt 1 ELBO where δ t is the step size. Then substituting the above λ t = (1 δ t )λ t 1 + δ t E[η(x, z)] 23

Algorithm Algorithm 2 Mean-field with natural gradient 1: Input: model p, variational family q ϕ (z),, q λ (β) 2: 3: while ELBO is not converged do 4: for each data point i do 5: Update ϕ i E λ [η l (β, x i )] 6: end for 7: Update λ λ + δ λ ELBO = (1 δ)λ + δe q(ϕ) [η g (x, z)] 8: end while 24

Stochastic Optimization Idea: Maximize a function f using noisy gradients H of that function. Noisy gradient H: E[H] = f Step size δ t x t+1 x t + δ t H(x t ) Convergence to a local optimum is guaranteed, when: δ t = t=1 t=1 δ 2 t < 25

Stochastic Gradient Step Recall λ ELBO = (E[η(x, z)] λ) ( ) n = α 1 + E q [t(z i, x i )], n + α 2 λ i=1 Idea: Construct a noisy natural gradient by sampling Sample index j Uniform(1,, n) Rescale λ ELBO = (E[η(x (n) j, z (n) j )] λ) = (α 1 + ne q [t(z j, x j )], 1 + α 2 ) λ =: λ λ Summary gradient step λ t = (1 δ t )λ t 1 + δ t λ 26

Algorithm Algorithm 3 Mean-field with stochastic gradient 1: Input: model p, variational family q ϕ (z),, q λ (β) 2: 3: while Stopping criteria is not fulfilled do 4: Sample index j Uniform(1,, n) 5: Update ϕ i E λ [η l (β, x i )] 6: Compute global parameter estimate λ = E ϕ [η g (x j, z j )] 7: Optimize the global variational parameter λ t+1 λ t (1 δ t ) + δ t λ 8: Check step size and update if required! 9: end while 27

Black Box Variational Inference

Gradient Estimates of the ELBO ELBO = E qν [log p θ (z, x)] E q [log q ν (z)] where ν are the parameters of the variational distribution and θ the parameters of the model (as before). Aim: Maximize the ELBO Problem: Need unbiased estimates of ν,θ ELBO. 28

Reparametrization trick * Simplified notation: ν E qν [f ν (z)] Assume that there exists a fixed reparameterization such that E qν [f ν (z)] = E q [f ν (g ν (ε))] where the expectation on the right does now not depend on ν. Then ν E q [f ν (g ν (ε))] = E q [ ν f ν (g ν (ε))] Solution: Obtain unbiased estimates by taking a Monte Carlo estimate of the expectation on the right. * Will be covered later 29

Optimizing the ELBO E q(λ) [log p(z, x) log q(z)] =: E[g(z)] Exercise: λ ELBO = λ E[g(z)] = E[g(z) log q(z)] + E[ g(z)] where log q(z) is called the score function. Note: The expectation of the score function is zero for any q i.e. E q [ log q(z)] = 0 Thus, to compute a noisy gradient of the ELBO sample from q(z) evaluate log q(z) evaluate log p(x, z) and log q(z) 30

Algorithm Algorithm 4 Black Box Variational Inference 1: Input: data x, model p(x,z), variational family q ϕ (z), 2: while Stopping criteria is not fulfilled do 3: Draw L samples z l q ϕ (z) 4: Update variational paramater using the collected samples ϕ ϕ + δ t 1 L L log q(z l )(log p(x, z l ) log q(z l )) l=1 5: Check step size and update if required! 6: end while Note: Active research area (problem) is the reduction of the variance of the noisy gradient estimator. 31

Summary What we have seen Expectation Maximization Introduced KL to understand convergence of EM. α- Divergence as a more general concept, including Expectation Propagation Scalable Stochastic Variational Inference for conjugates in the exponential family black box variational inference to overcome exponential family assumption What is not covered Problem for black box is the variance of the noisy gradient Extensions black box α-divergence Online inference algorithms 32

Next week Plan for next week Originally planned: Bayesian Non-parametrics Most likely: State Space Models 33

Questions? 33