Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Size: px

Start display at page:

Download "Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University"

Silvester Houston
5 years ago
Views:

1 Another Walkthrough of Variational Bayes Bevan Jones Machine Learning Reading Group Macquarie University

2 2 Variational Bayes? Bayes Bayes Theorem But the integral is intractable! Sampling Gibbs, Metropolis Hastings, Slice Sampling, Particle Filters Variational Bayes Change the equations, replacing intractable integrals This involves searching for a good approximation Variational Calculus of Variations A way of searching through a space of functions for the best one

3 Useful Concepts Probability/Information Theory Bayes Theorem Expectations Jensen s Inequality KL Divergence Calculus Functionals & Functional Derivatives Lagrange Multipliers Logarithms 3

4 Outline The true likelihood Approximating the posterior The lower bound and a definition for best Finding the optimal approximation Functionals & functional derivatives Connection to KL divergence The Mean-field approximation An inference procedure Dirichlet-multinomial example 4

5 The (Log) Likelihood We have some observed data: We have a model relating latent variable z to the data: To guess z The problem is one of computing Or just as good 5

6 Approximating p(z x) The integral in the expression for p(x) may not be easily computed But we might be able to get by with an approximation for p(x, z) We ll focus on approximating only part of it 6

7 Choosing q How to choose q? Ideally, we want the q that is closest to p Define a lower bound on p Make this a function of q Maximize the lower bound to make it as tight as possible Choose q accordingly 7

8 Bounding the Log Likelihood w/ Jensen s Inequality Jensen s Inequality where f is concave 8

9 Bounding the Log Likelihood w/ Jensen s Inequality Jensen s Inequality where f is concave 9

10 Bounding the Log Likelihood w/ Jensen s Inequality Jensen s Inequality where f is concave 10

11 Bounding the Log Likelihood w/ Jensen s Inequality Jensen s Inequality where f is concave 11

12 The Lower Bound We can t calculate the log likelihood, but we can compute the lower bound Maximizing F tightens the lower bound on the likelihood What q maximizes F? If q were a variable we could do this by taking derivatives and solving for q 12

13 Functionals: the Variational in VB Functional: a kind of meta-function that takes a function as input We can view F[q] as a functional of q Calculus of functionals parallels that of functions Then, we can take the derivative of F[q] with respect to q, set it to 0, and solve for q 13

14 Derivatives 14

15 Functional Derivatives The change in functional as we change its function argument 15

16 Useful Derivatives 16

17 Useful Derivatives 17

18 Useful Derivatives 18

19 Useful Derivatives 19

20 Useful Derivatives 20

21 Calculating q Use Lagrange multipliers constraint 21

22 Calculating q 22

23 Calculating q 23

24 Calculating q 24

25 KL Divergence: An Alternative View Maximizing F is minimizing the KL divergence And 25

26 Optimal q The best q(z) is p(z x) 26

27 Where are we? We ve bounded the likelihood (Jensen s Ineq.) Made this bound tight (Lagrange Multipliers) But the best approximation is no approximation at all! We need to constrain q so that it s tractable 27

28 Optimal q in an Imperfect World We can t compute q(z)=p(z x) directly Instead, constrain the domain of F[q] to some set of more tractable functions This is usually done by making independence assumptions The mean field assumption: cut all dependencies 28

29 Example 2: Mean Field Assumption We have some observed data: We have a model relating latent variables z and θ to the data: To guess z and θ we need But the integral is hard! Apply the mean field assumption 29

30 The New Lower Bound 30

31 The New Lower Bound 31

32 The New Lower Bound 32

33 The New Lower Bound Apply mean field assumption 33

34 The Benefit of Independence The integrals get simpler In fact, these go away 34

35 Optimizing the Lower Bound 35

36 Optimal q θ (θ) Use Lagrange multipliers constraint 36

37 Optimal q z (z) Use Lagrange multipliers constraint 37

38 The Approximation q p 38

hard to do this for p(z,θ x) It s (hopefully) easy for q(z,θ) if we ve

39 Estimating Parameters Now we have our approximation q We need to compute the expectations Use EM-like procedure, alternating between the two It was hard to do this for p(z,θ x) It s (hopefully) easy for q(z,θ) if we ve defined p to make use of conjugacy and if we ve chosen the right constraint for q 39

40 Calculating F 40

41 Calculating F As a side effect of inference, we already have It s the log of the normalization constant for q(z) So, we really only need two more expectations 41

42 Uses for F We can often use F in cases where we would normally use the log likelihood Measuring convergence No guarantee to maximize likelihood, but we do have F Others Model selection Choose the model with the highest lower bound Selecting the number of clusters Pick the number that gives us the highest lower bound Parameter optimization Again, optimize the lower bound w.r.t. the parameters 42

43 Worked Example Dirichlet-Multinomial Mixture Model 43

44 Dirichlet-Multinomial Mixture Model α φ β z π K x N 44

45 The Intractable Integral 45

46 The Mean Field Assumption 46

47 Optimizing F Apply Lagrange multipliers just like example 2 In this case, we have simply replaced z, x, and θ with vectors The math is exactly the same But we need to find the expectations we skipped before Plug in the Dirichlet and multinomial distributions 47

48 Optimal q(z,θ) Borrowed from example 2 See slides All we need to do is apply the particulars of the Mixture model 48

49 Optimal q θ (θ) 49

50 Optimal q φ (φ): The Expectation 50

51 Dirichlet Distribution 51

52 Optimal q φ (φ): The Numerator 52

53 Optimal q φ (φ): The Normalization 53

54 Optimal q φ (φ): Conjugacy Helps 54

55 Optimal q π (π) q(π) is essentially the same as q(φ) The only difference is that there are multiple π s So, q(π) should be a product of Dirichlets 55

56 Optimal q π (π): The Expectation 56

57 Optimal q π (π): The Numerator 57

58 Optimal q π (π): The Denominator 58

59 Optimal q π (π): Putting Them Together 59

60 A Useful Standard Result The digamma function The expectation under a Dirichlet of the log of an individual component of a Dirichlet random variable 60

61 Optimal q z (z) Again, borrowed from example 2 See slides Here, we plug in the model definition 61

62 Optimal q z (z) First, let s work with the simpler multinomial distribution Side effect: a kind of estimate for the multinomial parameter vector 62

63 Optimal q z (z): The Expectations 63

64 Optimal q z (z): The Expectations Now, let s work with the product of multinomials Side effect: a kind of set of multinomial parameter vectors This is essentialy the same math required for HMMs and PCFGs 64

65 Optimal q z (z): The Expectations 65

66 Optimal q z (z): Putting It Together 66

67 Implications of Assumption We should get the same result with an even weaker assumption 67

68 Inference E-Step : Expected Counts Topic counts Topic-word pair counts M-Step : The Proportions Topic j Topic-word pair j-k 68

69 Calculating F Also borrowed from example 2 See slides But we adapt it for the mixture model 69

70 Calculating F 70

71 Calculating F: The Normalization Constant By product of computing 71

Lecture 8: Graphical models for Text

Lecture 8: Graphical models for Text 4F13: Machine Learning Joaquin Quiñonero-Candela and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/