Instructor: Dr. Volkan Cevher. 1. Background

Instructor: Dr. Volkan Cevher Variational Bayes Approximation ice University STAT 631 / ELEC 639: Graphical Models Scribe: David Kahle eviewers: Konstantinos Tsianos and Tahira Saleem 1. Background These lecture notes are the seventh in a series of lecture notes taken from a course offered in the Fall of 2008 at ice University entitled Graphical Models. The course was written and instructed by Dr. Volkan Cevher in the Department of Electrical and Computer Engineering. This particular set of notes was taken by David Kahle on September 16, 2008. 2. Introduction In the last lecture we became particularly interested in making inference on the graph displayed in Figure 1. Here Z m and X n are random vectors. For the purposes of this lecture, all Z X Figure 1. Simple observed directed graph random vectors will be assumed to exhibit densities with respect to the Lebesgue measure denoted p and subscripted by the corresponding random vector. For example, the random vector Z exhibits the density p Z (z. The joint density of the random vectors X and Z is written p X,Z (x, z. The graphical model presented above is simple in order to emphasize the fact that we wish to make inference regarding the vector Z provided we have information on the vector X. To that end, recall the crucial formula derived from the definition of conditional expectation p X,Z (x, z p Z X (z xp X (x, (1 which is sometimes referred to as the product rule. As we well know, our primary interest lies in the factor p Z X (z x. The rest of this lecture is concerned with characterizing this density using the method of variational Bayes (VB approximation. This is the second example of a deterministic scheme in approximating the conditional density for inferential procedures (the first being Laplace approximation. 1

2 3. Motivation and the Kullback-Leibler Divergence In this section we derive a pseudo distance metric of probability distributions known as the Kullback- Leibler divergence. The idea is that without some kind of measure of difference between probability measures, we have no benchmark for comparing the accuracy of different approximations. Our result will be the Kullback-Leibler divergence; it will soon become clear why it is called divergence instead of distance. The divergence is derived as follows. Beginning with (1 and taking logs, we obtain which rearranges to log p X,Z (x, z log p Z X (z x + log p X (x (2 log p X (x log p X,Z (x, z log p Z X (z x. (3 Now, recall the property of logs which states log a log b log a c log b c (provided everything is defined appropriately. Appealing to this fact, we introduce another density, q Z (z, which is an arbitrary probability density (also with respect to the Lebesgue measure. Then (3 grants log p X (x log p X,Z(x, z q Z (z and multiplying both sides by q Z (z we obtain q Z (z log p X (x q Z (z log p X,Z(x, z q Z (z By integrating with respect to z, we have q Z (z log p X (x dz q Z (z log p X,Z(x, z m q m Z (z log p Z X(z x, (4 q Z (z q Z (z log p Z X(z x. (5 q Z (z dz q Z (z log p Z X(z x q m Z (z dz. (6 This is the key equation for our search. To summarize it we note that the left hand side of (6 is simply log p X (x and use short hand notation for the two terms on the right hand side (with the second including the negative sign. The short hand notation is defined as L (q Z : q Z (z log p X,Z(x, z dz q m Z (z E qz log p X,Z(X, Z KL ( q Z p Z X q Z (Z : q Z (z log p Z X(z x dz q m Z (z q Z (z q Z (z log p m Z X (z x dz E qz log q Z (Z p Z X (Z X where E Π represents the expected value operator with respect to the probability measure (or equivalent Π. Thus, the fundamental relationship is concisely written log p X (x L (q Z + KL ( q Z p Z X. (7,

3 The functional KL ( q Z p Z X is known as the Kullback-Leibler divergence of pz X from q Z and is the pseudo metric which we are seeking. A few properties of KL are that for all valid p Z X and q Z, 1. KL ( q Z p Z X 0 and 2. KL ( q Z p Z X 0 qz p Z X a.e. Proof of both these facts is provided in Lemma 3.1 of 1. 1 However, note that the Kullback- Leibler divergence is not symmetric, and thus not a true distance metric. To conclude this section it is instructive to see the meaning of the KL divergence with an example. In figure 2 we plot an attempt to approximate a normal distribution p N(2, 0.5 with three different log-normals q1, q2, q3. Oberve that the approximation that visually seems more accurate also has the smallest KL divergence. Moreover, it is worth noting that for the case where KL(p q2 0.1115 we have KL(q2 p 0.1399. Figure 2. Three different approximations. smaller the KL divergence. The better the approximation the 4. Variational Bayes Approximation ecall that the choice of q Z (z is entirely arbitrary so long as it is a probability density with respect to the Lebesgue measure. The idea in variational Bayes (VB approximation is to select a simple approximation q Z to the complicated conditional density p Z X. The Kullback-Leibler divergence gives us a measure of the discrepancy between q Z and p Z X ; so the goal is to find an approximation q Z which minimizes KL ( q Z p Z X. Further consideration of (7 in light of our new task will prove beneficial. Note that the left hand side does not vary with z; moreover, in our graph we are considering an experiment where we observe x, so the quantity is fixed. 2 It is generally referred to as the log marginal likelihood or the 1 Note that what we are labeling divergence and what Kullback and Leibler label divergence are different functionals. Our definition of KL divergence is consistent with literature, despite Kullback and Leibler defining it differently. 2 In this set of notes the upper case oman characters such as X will be used to denote random vectors while the lower case oman characters such as x (outside a density or integral equation will denote a single observation of X as is common in statistics literature.

4 log evidence. Since it does not vary with q Z, it is clear that the functionals L and KL are inversely related. Therefore, a minimization of KL amounts to a maximization of L, i.e., q Z : arg min q Z Q KL ( q Z p Z X arg max q Z Q L (q Z, (8 where qz is our approximation of interest and Q denotes any set of valid probability densities. We will refer to qz as the Q-VB approximation. It is also sometimes useful to note that by the first property of KL, log p X (x L (q Z, and thus e L(qZ provides a lower bound for the marginal density of X. Finding qz when Q is any probability density is in general a difficult task. To make our analysis more tractable we can impose an independence structure on the random vector Z; that is, we will only consider q Z s which come from the set Q : q Z (z : q Z (z q Zi (z i, (9 i1 where q Zi (z i is the probability density of Z i, the ith element of Z. Sometimes even this task proves difficult and more restrictions are imposed to make Q even smaller. Such techniques are referred to as restricted variational Bayes (-VB techniques. For example, we could add the additional requirement that each q Zi (z i is in the exponential family. For the rest of these notes, we will take Q as defined in (9 for our Q in (8. To find (8, we begin by looking at L (q Z. In particular, it will be beneficial to look at the jth factor q Zj. For this reason, our mathematics is aimed at separating out the factors which depend on q Zj. Presented in the equations which follow is the derivation in terms of expectations. An equivalent form of the first part of the derivation in terms of integrals is provided in the Appendix.

5 So, from (9 and Fubini s theorem 3 we have L (q Z E qz log p X,Z(X, Z q Z (Z E qz log p X,Z (X, Z log q Z (Z E qz log p X,Z (X, Z log q Zk (Z k E qz log p X,Z (X, Z E qz log q Zk (Z k E qzj E Q m log p X,Z (X, Z E qzj E Q m log p X,Z (X, Z E qzk log q Zk (Z k E qzj log qzj (Z j ( E qzj log exp E Q m log p X,Z (X, Z ( E qzj log exp E Q m log p X,Z (X, Z E qzj exp log E Q m log p X,Z (X, Z q Zj (Z j E qzk log q Zk (Z k E qzj log qzj (Z j log q Zj (Z j m E qzk log q Zk (Z k E qzk log q Zk (Z k E qzk log q Zk (Z k E qzj log q Zj (Z j m E qzk log q Zk (Z k exp q Zi log p X,Z (X, Z ( KL q Zj exp E Q m i1 E Q m log p X,Z (X, Z E qzk log q Zk (Z k. Now, we know from KL s first property that it is never negative. Thus, to maximize L (q Z we need to minimize the KL term in the last equation. From KL s second property, we know that it is minimized precisely when the two terms are equivalent a.e. This gives the nice formula we know must hold for the probability density qz j - the jth factor of qz - qz j (z j exp E Q m log p X,Z (X, Z, (10 where here Z Z 1,..., Z j1, z j, Z j+1,..., Z m. 3 The theorem states that if A B f(x, y d(x, y < then ` A B f(x, ydy dx B A B f(x, yd(x, y. ` A f(x, ydx dy

6 5. Example: A Univariate Gaussian To understand the aforementioned approximation procedure, we show the analytic computations in the case of a univariate gaussian distribution. Assume we have a data set D x 1,..., x N draw from a distribution with unknown parameters µ, τ. Given the data, the likelihood function is: p(d µ, τ ( τ N/2 exp τ 2π 2 N (x n µ 2 We may have some idea about the distribution of the unknown parameters µ, τ. To simplify the analysis here, we introduce the following conjugate priors: n1 p(µ τ N (µ µ 0, (λ 0 τ 1 p(τ Gamma(τ a 0, b 0 ecall that we are interested in estimated the posterior distribution q(µ, τ over the unknown variables. According the mean field approximation we assume it takes a factorized form (which is not how the true posterior factorizes: q(µ, τ q µ (µq τ (τ To find the optimal value for factor q µ (µ we apply the formula the we derived above: log q µ(µ E τ log p(d, µ, τ E τ logp(d µ, τp(µ τp(τ E τ log p(d µ, τ + log p(µ τ + log p(τ E τ log p(d µ, τ + log p(µ τ + const Eτ N λ 0 (µ µ 0 2 + (x n µ 2 + constant. 2 n1 The next step is to complete the square over µ to obtain the form of a gaussian N(µ µ N, λ 1 N for q µ (µ where: µ N λ 0µ 0 + Nx λ 0 + N, λ N (λ 0 + NEτ. A similar analysis for q τ (τ shows that it follows a gamma distribution Gamma(τ a N, b N where: a N a 0 + N + 1, 2 b N b 0 + 1 N 2 E τ (x n µ 2 + λ 0 (µ µ 0 2. n1

7 The same analysis together with more sophisticated examples can be found in chapter 10 of?. It important to notice that we can use the above derived formulas to iterative compute more refined estimates of the model parameters. Observe that the paramters of q µ (µ depend on the mean value of τ and vice verca. To conlude this section we refer to the reader to two excellent references on variational methods?,?. 6. Appendix L (q Z q Z (z log p X,Z(x, z dz q m Z (z q Zi (z i log p X,Z(x, z m q Z k (z k dz 1 dz m i1 ( q Zi (z i log p X,Z (x, z log q Zk (z k dz 1 dz m i1 i1 q Zi (z i log p X,Z (x, z dz 1 dz m q Zj (z j q Zj (z j q Zj (z j q Zi (z i log q Zk (z k dz 1 dz m i1 m1 m1 q Zi (z i log p X,Z (x, z dz \j dz j i1 log q Zk (z k q Zi (z i dz 1 dz m i1 q Zi (z i log p X,Z (x, z dz \j dz j i1 log q Zk (z k q Zk (z k dz k m1 q Zi (z i log p X,Z (x, z dz \j dz j i1 log q Zj (z j q Zj (z j dz j log q Zk (z k q Zk (z k dz k eferences 1. S. Kullback and.a. Leibler, On information and sufficiency, Annals of Mathematical Statistics 22 (1951, no. 1, 79 86.