Another Walkthrough of Variational Bayes Bevan Jones Machine Learning Reading Group Macquarie University
2 Variational Bayes? Bayes Bayes Theorem But the integral is intractable! Sampling Gibbs, Metropolis Hastings, Slice Sampling, Particle Filters Variational Bayes Change the equations, replacing intractable integrals This involves searching for a good approximation Variational Calculus of Variations A way of searching through a space of functions for the best one
Useful Concepts Probability/Information Theory Bayes Theorem Expectations Jensen s Inequality KL Divergence Calculus Functionals & Functional Derivatives Lagrange Multipliers Logarithms 3
Outline The true likelihood Approximating the posterior The lower bound and a definition for best Finding the optimal approximation Functionals & functional derivatives Connection to KL divergence The Mean-field approximation An inference procedure Dirichlet-multinomial example 4
The (Log) Likelihood We have some observed data: We have a model relating latent variable z to the data: To guess z The problem is one of computing Or just as good 5
Approximating p(z x) The integral in the expression for p(x) may not be easily computed But we might be able to get by with an approximation for p(x, z) We ll focus on approximating only part of it 6
Choosing q How to choose q? Ideally, we want the q that is closest to p Define a lower bound on p Make this a function of q Maximize the lower bound to make it as tight as possible Choose q accordingly 7
Bounding the Log Likelihood w/ Jensen s Inequality Jensen s Inequality where f is concave 8
Bounding the Log Likelihood w/ Jensen s Inequality Jensen s Inequality where f is concave 9
Bounding the Log Likelihood w/ Jensen s Inequality Jensen s Inequality where f is concave 10
Bounding the Log Likelihood w/ Jensen s Inequality Jensen s Inequality where f is concave 11
The Lower Bound We can t calculate the log likelihood, but we can compute the lower bound Maximizing F tightens the lower bound on the likelihood What q maximizes F? If q were a variable we could do this by taking derivatives and solving for q 12
Functionals: the Variational in VB Functional: a kind of meta-function that takes a function as input We can view F[q] as a functional of q Calculus of functionals parallels that of functions Then, we can take the derivative of F[q] with respect to q, set it to 0, and solve for q 13
Derivatives 14
Functional Derivatives The change in functional as we change its function argument 15
Useful Derivatives 16
Useful Derivatives 17
Useful Derivatives 18
Useful Derivatives 19
Useful Derivatives 20
Calculating q Use Lagrange multipliers constraint 21
Calculating q 22
Calculating q 23
Calculating q 24
KL Divergence: An Alternative View Maximizing F is minimizing the KL divergence And 25
Optimal q The best q(z) is p(z x) 26
Where are we? We ve bounded the likelihood (Jensen s Ineq.) Made this bound tight (Lagrange Multipliers) But the best approximation is no approximation at all! We need to constrain q so that it s tractable 27
Optimal q in an Imperfect World We can t compute q(z)=p(z x) directly Instead, constrain the domain of F[q] to some set of more tractable functions This is usually done by making independence assumptions The mean field assumption: cut all dependencies 28
Example 2: Mean Field Assumption We have some observed data: We have a model relating latent variables z and θ to the data: To guess z and θ we need But the integral is hard! Apply the mean field assumption 29
The New Lower Bound 30
The New Lower Bound 31
The New Lower Bound 32
The New Lower Bound Apply mean field assumption 33
The Benefit of Independence The integrals get simpler In fact, these go away 34
Optimizing the Lower Bound 35
Optimal q θ (θ) Use Lagrange multipliers constraint 36
Optimal q z (z) Use Lagrange multipliers constraint 37
The Approximation q p 38
Estimating Parameters Now we have our approximation q We need to compute the expectations Use EM-like procedure, alternating between the two It was hard to do this for p(z,θ x) It s (hopefully) easy for q(z,θ) if we ve defined p to make use of conjugacy and if we ve chosen the right constraint for q 39
Calculating F 40
Calculating F As a side effect of inference, we already have It s the log of the normalization constant for q(z) So, we really only need two more expectations 41
Uses for F We can often use F in cases where we would normally use the log likelihood Measuring convergence No guarantee to maximize likelihood, but we do have F Others Model selection Choose the model with the highest lower bound Selecting the number of clusters Pick the number that gives us the highest lower bound Parameter optimization Again, optimize the lower bound w.r.t. the parameters 42
Worked Example Dirichlet-Multinomial Mixture Model 43
Dirichlet-Multinomial Mixture Model α φ β z π K x N 44
The Intractable Integral 45
The Mean Field Assumption 46
Optimizing F Apply Lagrange multipliers just like example 2 In this case, we have simply replaced z, x, and θ with vectors The math is exactly the same But we need to find the expectations we skipped before Plug in the Dirichlet and multinomial distributions 47
Optimal q(z,θ) Borrowed from example 2 See slides 36-38 All we need to do is apply the particulars of the Mixture model 48
Optimal q θ (θ) 49
Optimal q φ (φ): The Expectation 50
Dirichlet Distribution 51
Optimal q φ (φ): The Numerator 52
Optimal q φ (φ): The Normalization 53
Optimal q φ (φ): Conjugacy Helps 54
Optimal q π (π) q(π) is essentially the same as q(φ) The only difference is that there are multiple π s So, q(π) should be a product of Dirichlets 55
Optimal q π (π): The Expectation 56
Optimal q π (π): The Numerator 57
Optimal q π (π): The Denominator 58
Optimal q π (π): Putting Them Together 59
A Useful Standard Result The digamma function The expectation under a Dirichlet of the log of an individual component of a Dirichlet random variable 60
Optimal q z (z) Again, borrowed from example 2 See slides 36-38 Here, we plug in the model definition 61
Optimal q z (z) First, let s work with the simpler multinomial distribution Side effect: a kind of estimate for the multinomial parameter vector 62
Optimal q z (z): The Expectations 63
Optimal q z (z): The Expectations Now, let s work with the product of multinomials Side effect: a kind of set of multinomial parameter vectors This is essentialy the same math required for HMMs and PCFGs 64
Optimal q z (z): The Expectations 65
Optimal q z (z): Putting It Together 66
Implications of Assumption We should get the same result with an even weaker assumption 67
Inference E-Step : Expected Counts Topic counts Topic-word pair counts M-Step : The Proportions Topic j Topic-word pair j-k 68
Calculating F Also borrowed from example 2 See slides 40-41 But we adapt it for the mixture model 69
Calculating F 70
Calculating F: The Normalization Constant By product of computing 71