Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Another Walkthrough of Variational Bayes Bevan Jones Machine Learning Reading Group Macquarie University

2 Variational Bayes? Bayes Bayes Theorem But the integral is intractable! Sampling Gibbs, Metropolis Hastings, Slice Sampling, Particle Filters Variational Bayes Change the equations, replacing intractable integrals This involves searching for a good approximation Variational Calculus of Variations A way of searching through a space of functions for the best one

Useful Concepts Probability/Information Theory Bayes Theorem Expectations Jensen s Inequality KL Divergence Calculus Functionals & Functional Derivatives Lagrange Multipliers Logarithms 3

Outline The true likelihood Approximating the posterior The lower bound and a definition for best Finding the optimal approximation Functionals & functional derivatives Connection to KL divergence The Mean-field approximation An inference procedure Dirichlet-multinomial example 4

The (Log) Likelihood We have some observed data: We have a model relating latent variable z to the data: To guess z The problem is one of computing Or just as good 5

Approximating p(z x) The integral in the expression for p(x) may not be easily computed But we might be able to get by with an approximation for p(x, z) We ll focus on approximating only part of it 6

Choosing q How to choose q? Ideally, we want the q that is closest to p Define a lower bound on p Make this a function of q Maximize the lower bound to make it as tight as possible Choose q accordingly 7

Bounding the Log Likelihood w/ Jensen s Inequality Jensen s Inequality where f is concave 8

Bounding the Log Likelihood w/ Jensen s Inequality Jensen s Inequality where f is concave 9

Bounding the Log Likelihood w/ Jensen s Inequality Jensen s Inequality where f is concave 10

Bounding the Log Likelihood w/ Jensen s Inequality Jensen s Inequality where f is concave 11

The Lower Bound We can t calculate the log likelihood, but we can compute the lower bound Maximizing F tightens the lower bound on the likelihood What q maximizes F? If q were a variable we could do this by taking derivatives and solving for q 12

Functionals: the Variational in VB Functional: a kind of meta-function that takes a function as input We can view F[q] as a functional of q Calculus of functionals parallels that of functions Then, we can take the derivative of F[q] with respect to q, set it to 0, and solve for q 13

Derivatives 14

Functional Derivatives The change in functional as we change its function argument 15

Useful Derivatives 16

Useful Derivatives 17

Useful Derivatives 18

Useful Derivatives 19

Useful Derivatives 20

Calculating q Use Lagrange multipliers constraint 21

Calculating q 22

Calculating q 23

Calculating q 24

KL Divergence: An Alternative View Maximizing F is minimizing the KL divergence And 25

Optimal q The best q(z) is p(z x) 26

Where are we? We ve bounded the likelihood (Jensen s Ineq.) Made this bound tight (Lagrange Multipliers) But the best approximation is no approximation at all! We need to constrain q so that it s tractable 27

Optimal q in an Imperfect World We can t compute q(z)=p(z x) directly Instead, constrain the domain of F[q] to some set of more tractable functions This is usually done by making independence assumptions The mean field assumption: cut all dependencies 28

Example 2: Mean Field Assumption We have some observed data: We have a model relating latent variables z and θ to the data: To guess z and θ we need But the integral is hard! Apply the mean field assumption 29

The New Lower Bound 30

The New Lower Bound 31

The New Lower Bound 32

The New Lower Bound Apply mean field assumption 33

The Benefit of Independence The integrals get simpler In fact, these go away 34

Optimizing the Lower Bound 35

Optimal q θ (θ) Use Lagrange multipliers constraint 36

Optimal q z (z) Use Lagrange multipliers constraint 37

The Approximation q p 38

Estimating Parameters Now we have our approximation q We need to compute the expectations Use EM-like procedure, alternating between the two It was hard to do this for p(z,θ x) It s (hopefully) easy for q(z,θ) if we ve defined p to make use of conjugacy and if we ve chosen the right constraint for q 39

Calculating F 40

Calculating F As a side effect of inference, we already have It s the log of the normalization constant for q(z) So, we really only need two more expectations 41

Uses for F We can often use F in cases where we would normally use the log likelihood Measuring convergence No guarantee to maximize likelihood, but we do have F Others Model selection Choose the model with the highest lower bound Selecting the number of clusters Pick the number that gives us the highest lower bound Parameter optimization Again, optimize the lower bound w.r.t. the parameters 42

Worked Example Dirichlet-Multinomial Mixture Model 43

Dirichlet-Multinomial Mixture Model α φ β z π K x N 44

The Intractable Integral 45

The Mean Field Assumption 46

Optimizing F Apply Lagrange multipliers just like example 2 In this case, we have simply replaced z, x, and θ with vectors The math is exactly the same But we need to find the expectations we skipped before Plug in the Dirichlet and multinomial distributions 47

Optimal q(z,θ) Borrowed from example 2 See slides 36-38 All we need to do is apply the particulars of the Mixture model 48

Optimal q θ (θ) 49

Optimal q φ (φ): The Expectation 50

Dirichlet Distribution 51

Optimal q φ (φ): The Numerator 52

Optimal q φ (φ): The Normalization 53

Optimal q φ (φ): Conjugacy Helps 54

Optimal q π (π) q(π) is essentially the same as q(φ) The only difference is that there are multiple π s So, q(π) should be a product of Dirichlets 55

Optimal q π (π): The Expectation 56

Optimal q π (π): The Numerator 57

Optimal q π (π): The Denominator 58

Optimal q π (π): Putting Them Together 59

A Useful Standard Result The digamma function The expectation under a Dirichlet of the log of an individual component of a Dirichlet random variable 60

Optimal q z (z) Again, borrowed from example 2 See slides 36-38 Here, we plug in the model definition 61

Optimal q z (z) First, let s work with the simpler multinomial distribution Side effect: a kind of estimate for the multinomial parameter vector 62

Optimal q z (z): The Expectations 63

Optimal q z (z): The Expectations Now, let s work with the product of multinomials Side effect: a kind of set of multinomial parameter vectors This is essentialy the same math required for HMMs and PCFGs 64

Optimal q z (z): The Expectations 65

Optimal q z (z): Putting It Together 66

Implications of Assumption We should get the same result with an even weaker assumption 67

Inference E-Step : Expected Counts Topic counts Topic-word pair counts M-Step : The Proportions Topic j Topic-word pair j-k 68

Calculating F Also borrowed from example 2 See slides 40-41 But we adapt it for the mixture model 69

Calculating F 70

Calculating F: The Normalization Constant By product of computing 71