Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Similar documents
Lecture 8: Graphical models for Text

13: Variational inference II

Study Notes on the Latent Dirichlet Allocation

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

The Expectation Maximization or EM algorithm

Quantitative Biology II Lecture 4: Variational Methods

Latent Dirichlet Allocation (LDA)

Introduction to Machine Learning

Variational Inference. Sargur Srihari

Posterior Regularization

CS Lecture 18. Topic Models and LDA

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

EM & Variational Bayes

Latent Variable View of EM. Sargur Srihari

Gaussian Mixture Models

Variational Scoring of Graphical Model Structures

Lecture 13 : Variational Inference: Mean Field Approximation

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Note 1: Varitional Methods for Latent Dirichlet Allocation

Linear Dynamical Systems

Expectation maximization tutorial

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

CS6220: DATA MINING TECHNIQUES

LDA with Amortized Inference

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Two Useful Bounds for Variational Inference

Unsupervised Learning

A Note on the Expectation-Maximization (EM) Algorithm

Expectation Maximization

Expectation Maximization Algorithm

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Expectation Maximization and Mixtures of Gaussians

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

ECE 5984: Introduction to Machine Learning

Mixture Models and Expectation-Maximization

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Mixtures of Gaussians. Sargur Srihari

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Technical Details about the Expectation Maximization (EM) Algorithm

Probabilistic Time Series Classification

Latent Variable Models

Motif representation using position weight matrix

Pattern Recognition and Machine Learning

STA 4273H: Statistical Machine Learning

An Introduction to Expectation-Maximization

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Auto-Encoding Variational Bayes

A minimalist s exposition of EM

Bayesian Inference and MCMC

The Expectation-Maximization Algorithm

Probabilistic Graphical Models for Image Analysis - Lecture 4

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Expectation Maximization

Algorithms for Variational Learning of Mixture of Gaussians

Clustering, K-Means, EM Tutorial

Gaussian Mixture Models, Expectation Maximization

Generative Clustering, Topic Modeling, & Bayesian Inference

CS6220: DATA MINING TECHNIQUES

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Expectation Propagation Algorithm

Clustering and Gaussian Mixture Models

Bayesian Methods for Machine Learning

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

Latent Dirichlet Bayesian Co-Clustering

Lecture 6: Gaussian Mixture Models (GMM)

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Part 1: Expectation Propagation

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Latent Variable Models and EM Algorithm

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

EM-algorithm for motif discovery

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Learning MN Parameters with Approximation. Sargur Srihari

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

an introduction to bayesian inference

Programming Assignment 4: Image Completion using Mixture of Bernoullis

Introduction to Machine Learning

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Directed Probabilistic Graphical Models CMSC 678 UMBC

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Non-Parametric Bayes

Note for plsa and LDA-Version 1.1

Expectation-Maximization (EM) algorithm

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Variational Inference and Learning. Sargur N. Srihari

STA 4273H: Statistical Machine Learning

Latent Variable Models and EM algorithm

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Probabilistic Graphical Models

Statistical Pattern Recognition

Unsupervised Learning

The Expectation Maximization Algorithm

Week 3: The EM algorithm

Bayesian Analysis for Natural Language Processing Lecture 2

Transcription:

Another Walkthrough of Variational Bayes Bevan Jones Machine Learning Reading Group Macquarie University

2 Variational Bayes? Bayes Bayes Theorem But the integral is intractable! Sampling Gibbs, Metropolis Hastings, Slice Sampling, Particle Filters Variational Bayes Change the equations, replacing intractable integrals This involves searching for a good approximation Variational Calculus of Variations A way of searching through a space of functions for the best one

Useful Concepts Probability/Information Theory Bayes Theorem Expectations Jensen s Inequality KL Divergence Calculus Functionals & Functional Derivatives Lagrange Multipliers Logarithms 3

Outline The true likelihood Approximating the posterior The lower bound and a definition for best Finding the optimal approximation Functionals & functional derivatives Connection to KL divergence The Mean-field approximation An inference procedure Dirichlet-multinomial example 4

The (Log) Likelihood We have some observed data: We have a model relating latent variable z to the data: To guess z The problem is one of computing Or just as good 5

Approximating p(z x) The integral in the expression for p(x) may not be easily computed But we might be able to get by with an approximation for p(x, z) We ll focus on approximating only part of it 6

Choosing q How to choose q? Ideally, we want the q that is closest to p Define a lower bound on p Make this a function of q Maximize the lower bound to make it as tight as possible Choose q accordingly 7

Bounding the Log Likelihood w/ Jensen s Inequality Jensen s Inequality where f is concave 8

Bounding the Log Likelihood w/ Jensen s Inequality Jensen s Inequality where f is concave 9

Bounding the Log Likelihood w/ Jensen s Inequality Jensen s Inequality where f is concave 10

Bounding the Log Likelihood w/ Jensen s Inequality Jensen s Inequality where f is concave 11

The Lower Bound We can t calculate the log likelihood, but we can compute the lower bound Maximizing F tightens the lower bound on the likelihood What q maximizes F? If q were a variable we could do this by taking derivatives and solving for q 12

Functionals: the Variational in VB Functional: a kind of meta-function that takes a function as input We can view F[q] as a functional of q Calculus of functionals parallels that of functions Then, we can take the derivative of F[q] with respect to q, set it to 0, and solve for q 13

Derivatives 14

Functional Derivatives The change in functional as we change its function argument 15

Useful Derivatives 16

Useful Derivatives 17

Useful Derivatives 18

Useful Derivatives 19

Useful Derivatives 20

Calculating q Use Lagrange multipliers constraint 21

Calculating q 22

Calculating q 23

Calculating q 24

KL Divergence: An Alternative View Maximizing F is minimizing the KL divergence And 25

Optimal q The best q(z) is p(z x) 26

Where are we? We ve bounded the likelihood (Jensen s Ineq.) Made this bound tight (Lagrange Multipliers) But the best approximation is no approximation at all! We need to constrain q so that it s tractable 27

Optimal q in an Imperfect World We can t compute q(z)=p(z x) directly Instead, constrain the domain of F[q] to some set of more tractable functions This is usually done by making independence assumptions The mean field assumption: cut all dependencies 28

Example 2: Mean Field Assumption We have some observed data: We have a model relating latent variables z and θ to the data: To guess z and θ we need But the integral is hard! Apply the mean field assumption 29

The New Lower Bound 30

The New Lower Bound 31

The New Lower Bound 32

The New Lower Bound Apply mean field assumption 33

The Benefit of Independence The integrals get simpler In fact, these go away 34

Optimizing the Lower Bound 35

Optimal q θ (θ) Use Lagrange multipliers constraint 36

Optimal q z (z) Use Lagrange multipliers constraint 37

The Approximation q p 38

Estimating Parameters Now we have our approximation q We need to compute the expectations Use EM-like procedure, alternating between the two It was hard to do this for p(z,θ x) It s (hopefully) easy for q(z,θ) if we ve defined p to make use of conjugacy and if we ve chosen the right constraint for q 39

Calculating F 40

Calculating F As a side effect of inference, we already have It s the log of the normalization constant for q(z) So, we really only need two more expectations 41

Uses for F We can often use F in cases where we would normally use the log likelihood Measuring convergence No guarantee to maximize likelihood, but we do have F Others Model selection Choose the model with the highest lower bound Selecting the number of clusters Pick the number that gives us the highest lower bound Parameter optimization Again, optimize the lower bound w.r.t. the parameters 42

Worked Example Dirichlet-Multinomial Mixture Model 43

Dirichlet-Multinomial Mixture Model α φ β z π K x N 44

The Intractable Integral 45

The Mean Field Assumption 46

Optimizing F Apply Lagrange multipliers just like example 2 In this case, we have simply replaced z, x, and θ with vectors The math is exactly the same But we need to find the expectations we skipped before Plug in the Dirichlet and multinomial distributions 47

Optimal q(z,θ) Borrowed from example 2 See slides 36-38 All we need to do is apply the particulars of the Mixture model 48

Optimal q θ (θ) 49

Optimal q φ (φ): The Expectation 50

Dirichlet Distribution 51

Optimal q φ (φ): The Numerator 52

Optimal q φ (φ): The Normalization 53

Optimal q φ (φ): Conjugacy Helps 54

Optimal q π (π) q(π) is essentially the same as q(φ) The only difference is that there are multiple π s So, q(π) should be a product of Dirichlets 55

Optimal q π (π): The Expectation 56

Optimal q π (π): The Numerator 57

Optimal q π (π): The Denominator 58

Optimal q π (π): Putting Them Together 59

A Useful Standard Result The digamma function The expectation under a Dirichlet of the log of an individual component of a Dirichlet random variable 60

Optimal q z (z) Again, borrowed from example 2 See slides 36-38 Here, we plug in the model definition 61

Optimal q z (z) First, let s work with the simpler multinomial distribution Side effect: a kind of estimate for the multinomial parameter vector 62

Optimal q z (z): The Expectations 63

Optimal q z (z): The Expectations Now, let s work with the product of multinomials Side effect: a kind of set of multinomial parameter vectors This is essentialy the same math required for HMMs and PCFGs 64

Optimal q z (z): The Expectations 65

Optimal q z (z): Putting It Together 66

Implications of Assumption We should get the same result with an even weaker assumption 67

Inference E-Step : Expected Counts Topic counts Topic-word pair counts M-Step : The Proportions Topic j Topic-word pair j-k 68

Calculating F Also borrowed from example 2 See slides 40-41 But we adapt it for the mixture model 69

Calculating F 70

Calculating F: The Normalization Constant By product of computing 71