Scalable Deep Poisson Factor Analysis for Topic Modeling: Supplementary Material

Size: px

Start display at page:

Download "Scalable Deep Poisson Factor Analysis for Topic Modeling: Supplementary Material"

Dylan Phelps
5 years ago
Views:

1 : Supplementary Material Zhe Gan Changyou Chen Ricardo Henao David Carlson Lawrence Carin Department of Electrical and Computer Engineering, Duke University, Durham, NC 7708, USA A Conditional Densities used in BCDF Using dot notation to represent marginal sums, eg, x nk p x pnk, we can write the conditional densities for DPFA as Zhou & Carin, 015 x pnk Multix pn ; ζ pn1,, ζ pnk, 1 φ k Dira φ + x 1 k,, a φ + x P k, θ kn Gammar k h 1 kn + x nk, p n, N 1 r k Gamma γ 0 + l kn, n=1 c 0 N n=1 h1 kn ln1 p, n K γ 0 Gamma e 0 + l k, 1 k=1 f 0 K k=1 ln1 p k, 3 h 1 kn π kn 1 p n rk δx nk = 0Ber π kn 1 p n rk + 1 π kn + δx nk > 0, where N l kn CRT x nk, r k h 1 kn, l k CRT l kn, γ 0, ζ pnk = φ pk θ kn K k=1 φ pkθ kn, p k = π kn = σ w 1 k h n + c 1 k n=1 N n=1 h1 kn ln1 p n c 0 N n=1 h1 kn ln1 p n, CRT represents the Chinese Restaurant Table distribution A CRT random variable l CRTm, r can be generated with the summation of independent Bernoulli random variables as Zhou & Carin, 015 l = m n=1 r b n, b n Ber n 1 + r B Proof of Theorem 1 Consider a general stochastic differential equation of the form dγ = QΓdt + DΓdW, 4 where Γ R N, Q : R N R N, D : R M R N P are measurable functions with P unnecessarily equal to N, and W is the standard P -dimensional Brownian motion Taking Γ = θ v ξ, QΓ = v f ξv 1 M vt v 1, and DΓ to be constant, ie, D, recovers the setting in the original stochastic gradient Nóse-Hoover thermostats algorithm Ding et al, 014 with the notation of this paper Furthermore, write the joint distribution of Γ as pγ = 1 exp { HΓ}, Z where HΓ is usually called the Hamiltonian function of a system In the following we decompose Γ as Γ = θ, x and HΓ as HΓ = Uθ + Eθ, x where Uθ may be the negative log-posterior of a Bayesian model To prove Theorem 1 in the main text, we make use of Lemma 1 below, which is essentially the main theorem in Ding et al, 014, which again is a consequence of the celebrated Fokker-Planck Equation Risken, 1989 Lemma 1 The stochastic process of θ generated by the stochastic differential equation 4 has the target distribution p θ θ = 1 Z exp{ Uθ} as its stationary distribution, if pγ satisfies the following marginalization condition: exp{ Uθ} exp{ Uθ Eθ, x}dx, 5

2 and if the following condition is also satisfied: pq = : pd, 6 where / θ, / x, represents the vector inner product operator, : represents a matrix double dot product, ie, X : Y trx Y Proof of Theorem 1 For conciseness, we replace the model parameters Ψ g in the main text with notation θ in the following We first reformulate our proposed SGNHT as a special case of the general SDE in 4, ie, Γ = θ, v, ξ 1,, ξ M, QΓ = v, f Ξv, v1 1,, vm 1, DΓ = diag 0,, 0, D,, D, 0,, 0 }{{}}{{}}{{} M M M From Theorem 1 in the main text we know pγ = 1 Z exp 1 v v Uθ 1 tr {Ξ D Ξ D}, 7 } with HΓ = 1 v v +Uθ+ 1 {Ξ tr D Ξ D The marginalization condition 5 is trivially satisfied, we are left to verify condition 6 Substituting p and Q into 6, we have the left-hand side LHS = i = i = i = i = i Γ i pq i p Q i + Q i p Γ i Γ i Qi H p Γ i Γ i ξ i v T f Ξv + f T v i D ii v i 1 p ξ i D v i 1 Since D is independent of Γ, we have the right-hand side RHS = D ij p Γ i j i Γ j = D ij H p v i j j v i = D ii v i 1 p i LHS Now we have verified both conditions of Lemma 1, this makes the distribution 7 the equilibrium distribution of our proposed SGNHT algorithm C Sampling Topic-Word Distributions using the Stochastic Gradient Riemannian Langevin Dynamics The Langevin dynamic diffusion is defined via a stochastic differential equation of the following form: dθ t = 1 θ t log Uθ t dt + dw t, 8 where t is the time index, θ t R M is the model parameter, Uθ t N i=1 px i θ t pθ t is the model posterior, and W t is the standard M-dimensional Brownian motion The law of the Langevin dynamic is described by the Fokker-Planck equation Risken, 1989, and the equilibrium distribution pθ of θ t can be shown to be the model posterior Uθ Øksendal, 003 In Bayesian learning with big data, the stochastic gradient Langevin dynamic SGLD Welling & Teh, 011 generalizes the Langevin dynamic 8 by replacing Uθ t with a stochastic version Ũθ t evaluated on a subset of data, eg, Ũθ t i D px i θ t pθ t, where D is a random subset of {1,, N} Furthermore, the corresponding SDE of SGLD is then solved via the Euler-Maruyama scheme Tuckerman, 010 using a decreasing step sizes sequence Welling & Teh, 011, ie, samples of the SGLD are generated as θ t+1 = θ t + δ t θ log Ũθ t + ζ t, ζ t N0, δ t I, where I is the identity matrix and {δ t } is a decreasing sequence such that lim t δ t = 0 and t=1 δ t = Under certain assumptions, it is shown that the SGLD is consistent with the Langevin dynamics of 8, ie, it generates correct samples from the posterior Uθ Teh et al, 014; Vollmer et al, 015 The stochastic gradient Riemannian Langevin dynamics SGRLD Patterson & Teh, 013 extends the SGLD by defining it on Riemannian manifolds Girolami & Calderhead, 011; Byrne & Girolami, 013 Specifically, given a Riemannian metric Gθ Jürgen, 008, the SGRLD generalizes the Langevin dynamic 8 on a Riemannian manifold and generates samples using the following proposal θ t+1 = θ t + δ t µθ t + Gθ t 1/ ζ t, ζ t N0, δ t I, where the j-th component of µθ t is given by µθ t j = M G 1 θ t θt Ũθ t j G 1 θ t Gθ t G 1 θ t θ tk k=1 jk M + G 1 θ t G jk tr 1 θ t Gθ t θ tk k=1 9

3 The SGRLD makes moves along a Riemannian manifold defined by the metric Gθ, thus having the advantage of converging faster towards the optimal solution compared to the naive SGLD In the Bayesian setting, the Fisher Information matrix Rao, 1945; Amari, 1990; 1997 is usually chosen as the metric Gθ, which has convenient forms for some models SGRLD for Deep Poisson Factor Analysis We use the SGRLD algorithm to sample the topic-word distributions {φ k } s From Section A, we see that based on the augmentation x pnk, the posterior of φ k is a Dirichlet distribution, ie, the joint conditional likelihood of {x pnk }, φ k is p{x pnk }, φ k p φ pk x p k Now we reparameterize φ k using the expanded-mean φ schema Patterson & Teh, 013 as φ pk = pk p φ Together with the prior on φ k as φ k Dira φ, this gives us pk a stochastic gradient on a subset of data D as log p φ k φ a φ 1 1 pk φ pk + N E {xpnk } D n D [ ] x pnk x nk, φ pk φ k where the expectation is taken over the posterior of {x pnk }, which is infeasible As a result, we use samples from the posterior to approximate the expectation To this end, we use the conditional posterior of {x pnk } in 1 to collect samples for the current mini-batch D after a few burn-in steps To apply the SGRLD algorithm, we need to choose a metric Gθ for the Riemannian manifold Because the posterior of φ k is Dirichlet, we thus use the same Riemannian metric as in LDA Patterson & Teh, 013, eg, Gφ = diagφ 11,, φ P 1,, φ 1K,, φ P K 1, and use the same mirror idea Patterson & Teh, 013 to avoid parameters going out of boundary After plugging Gθ into 9 and simplifying, we get the update for φ pk as: φ pk = φ pk + δ a φ φ pk + N E {xpnk } [x pnk x nk φ pk ] + D φ pk 1/ ζ, i D where we have omitted the time index t in the parameters for simplicity For the other global model parameters such as r k and γ 0, we make use of their posterior distributions in and 3, respectively We only briefly describe the sampling for r k, sampling γ 0 is similar To simplified, rewrite the posterior r k as pr k Gamma a k + c 0, 1/c 0 + b k, where a k and b k are parameters for the Gamma distribution containing local parameters or augmented parameters from the current mini-batch, eg, a k contains {l kn } for the current mini-batch Denote these local parameters as L x Now we use the reparameterization for r k as r k = e r k with r k Log-Gammaγ 0, 1/c 0, this is equivalent as putting a Gammaγ 0, 1/c 0 prior on r k Now the stochastic gradient for r k can be easily seen as log p r k r k = r k E Lx [ a k + γ 0 1 r k e c0+b k r k ] = E Lx [a k + γ 0 1 c 0 + b k r c0+b k k ] Again, the expectation can be approximated using Monte Carlo integration by drawing samples from the posterior, then the stochastic gradient Langevin dynamic can be straightforwardly applied D Evaluation Details on Perplexities For each test document, we randomly partition the words into a 80/0% split We learn document-specific local parameters using the 80% portion, and then calculate the predictive perplexities on the remaining 0% subset, denoted as Y For the PFA-based models, the test perplexity is calculated as Zhou et al, 01 exp 1 y P p=1 n=1 N y pn log S K s=1 k=1 φs pk θs kn S P K s=1 p=1 k=1 φs pk θs kn where S is the total number of collected samples, y = P N p=1 n=1 y pn and y pn is an element of matrix Y The conditional distribution of y n given h n, in the Replicated Softmax model RSM is specified as y n MultiD n ; β n, β pn = expw p h n + c p P p =1 expw p h n + c p, where y n is the nth column of Y, and D n = P p=1 y pn W = [w 1, w P ] R P K is the mapping from h n to y n, and c = [c 1, c P ] R P 1 is the bias term Based on this, the predictive test perplexity for RSM can be calculated as exp 1 y P p=1 n=1 E Sensitivity Analysis N y pn log β pn We examined the sensitivity of the model performance with respect to batch sizes in SGNHT on the three corpora considered The results are shown in Figure 1 We found that,

4 Batch_1 Batch_10 Batch_30 Batch_50 Batch_80 Batch_ Batch_10 Batch_50 Batch_80 Batch_ Batch_10 Batch_30 Batch_50 Batch_80 Batch_ Figure 1 Test perplexities wrt mini-batch sizes on the three corpora The number of hidden units in each layer is 18, 64, 3, respectively Left 0 Newsgroups Middle RCV1-v Right Wikipedia x 10 6 Figure Test perplexities as a function of training documents seen The number of hidden units in each layer is 18, 64, 3, respectively Left RCV1-v Right Wikipedia overall performance, both convergence speed and test perplexity, suffer considerably when the batch size is smaller than 10 documents However, for batch sizes larger than for RCV1-v we obtain performances comparable to those shown in Tables 1 and 3 We run the SGNHT algorithm on the RCV1-v and Wikipedia datasets long enough so that the whole corpora can be traversed The results are shown in Figure As can be seen, performance smoothly improves as the amount of data processed increases F Additional Results We compare our DPFA model with stronger shallow baselines, such as the Marked-Gamma-NB and Marked-Beta- NB model described in Zhou & Carin 015 Direct Gibbs sampling is only suitable for relatively small datasets, so only results on the 0Newsgroups are reported, shown in Table 1 As can be seen, these advanced models can achieve perplexity results comparable to those of a twolayer DPFA model Table 1 Additional results on 0 Newsgroups MODEL METHOD DIM PERP MARKED-BETA-NB GIBBS MARKED-GAMMA-NB GIBBS G Source Code The source code, along with the topics we inferred from the model, are available at zhegan7/dpfa_icml015 This package is made publicly available for reproducibility purposes, and it is not optimized for speed, minimally documented but fully functional References Amari, S Differential geometrical methods in statistics Lecture Notes in Statistics 8, Springer-Verlag, 1990 Amari, S Natural gradient works efficiently in learning Neural computation, 1997 Byrne, S and Girolami, M Geodesic Monte Carlo on embedded manifolds Scandinavian J Statist, 013 Ding, N, Fang, Y, Babbush, R, Chen, C, Skeel, R D, and

5 Figure 3 Full graph induced by the correlation structure learned by DPFA-SBN for the 0 Newsgroups corpus

6 Figure 4 Full graph induced by the correlation structure learned by DPFA-SBN for the RCV1-v corpus

7 Neven, H Bayesian sampling using stochastic gradient thermostats NIPS, 014 Girolami, M and Calderhead, B Riemann manifold Langevin and Hamiltonian Monte Carlo methods J R Statist Soc B, 011 Jürgen, J Riemannian Geometry and Geometric Analysis Springer-Verlag, 008 Øksendal, B Stochastic differential equations Springer, 003 Patterson, S and Teh, Y W Stochastic gradient Riemannian Langevin dynamics on the probability simplex NIPS, 013 Rao, C R Information and accuracy attainable in the estimation of statistical parameters Bulletin of the Calcutta mathematical society, 1945 Risken, H The Fokker-Planck Equation Springer-Verlag, New York, 1989 Teh, Y W, Thiery, A H, and Vollmer, S J Consistency and fluctuations for stochastic gradient Langevin dynamics arxiv: , 014 Tuckerman, M E Statistical Mechanics: Theory and Molecular Simulation Oxford University Press, 010 Vollmer, S J, Zygalakis, K C, and Teh, Y W Non- asymptotic properties of stochastic gradient Langevin dynamics arxiv: , 015 Welling, M and Teh, Y W Bayesian learning via stochastic gradient Langevin dynamics ICML, 011 Zhou, M and Carin, L Negative binomial process count and mixture modeling PAMI, 015 Zhou, M, Hannah, L, Dunson, D, and Carin, L Betanegative binomial process and Poisson factor analysis AISTATS, 01 Scalable Deep Poisson Factor Analysis for Topic Modeling

Supplementary Material of High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep Models

Supplementary Material of High-Order Stochastic Gradient hermostats for Bayesian Learning of Deep Models Chunyuan Li, Changyou Chen, Kai Fan 2 and Lawrence Carin Department of Electrical and Computer Engineering,