Scalable Parallel EM Algorithms for Latent Dirichlet Allocation in Multi-Core Systems

Size: px

Start display at page:

Download "Scalable Parallel EM Algorithms for Latent Dirichlet Allocation in Multi-Core Systems"

Garry Pearson
5 years ago
Views:

1 Scalable Parallel EM Algorithms for Latent irichlet Allocation in Multi-Core Systems Xiaosheng Liu,, Jia Zeng,,3,, Xi Yang,, Jianfeng Yan,, Qiang Yang 3, School of Computer Science an Technology, Soochow University, Suzhou 5, China Collaborative Innovation Center of Novel Software Technology an Inustrialization 3 Huawei Noah s Ark Lab, Hong ong epartment of Computer Science an Engineering, Hong ong University of Science an Technology Corresponing Author: zeng.jia@acm.org ABSTRACT Latent irichlet allocation (LA) is a wiely-use probabilistic topic moeling tool for content analysis such as web mining. To hanle web-scale content analysis on just a single PC, we propose multi-core parallel expectation-maximization (PEM) algorithms to infer an estimate LA parameters in share memory systems. By avoiing memory access conflicts reucing the locking time among multiple threas an resiual-base ynamic scheuling, we show that PEM algorithms are more scalable an accurate than the current state-of-the-art parallel LA algorithms on a commoity PC. This parallel LA toolbox is mae publicly available as open source software at mloss.org. Categories an Subject escriptors H. [Information Systems Applications]: Miscellaneous;..8 [Software Engineering]: Metrics complexity measures, performance measures eywors Latent irichlet allocation, parallel EM algorithms, multi-core systems, share memory systems, scalability. INTROUCTION Latent irichlet allocation (LA) [, 3] is a popular probabilistic topic moel for content analysis such as web mining []. It can automatically infer the hien thematic groups of observe wors calle topics from a collection of ocuments represente as an input sparse ocument-wor matrix. In the big ata era, scalable parallel LA algorithms have attracte intensive research interests because billions of tweets, images an vieos on the web become increasingly common. The aim of this paper is to evelop efficien parallel LA algorithms for big ata. There are two wiely available parallel architectures: ) multiprocessor [9] an ) multi-core [9] systems, where the main ifference lies in the way to use the memory. In the multi-processor architecture, all processes allocate inepenent memory space an Copyright is hel by the International orl ie eb Conference Committee (I3C). I3C reserves the right to provie a hyperlink to the author s site if the Material is use in electronic meia. 5, May 8, 5, Florence, Italy. ACM /5/5. communicate to synchronize LA parameters at the en of each learning iteration [, 7, 9, 5, 5,, 38]. The reuction of communication cost still remains an unsolve problem because this cost is often too big to be maske by computation time in web-scale applications. The experimental results confir that the communication cost may excee the computation cost to become the primitive cost of large-scale topic moeling [7, 5]. In the multi-core architecture, all threas access the share memory space so that the race conition is serious. A major scalability issue is the locking time among multiple threas. An example of share memory architecture is GPU-LA [9], which shares LA ocument-topic an topic-wor istribution parameters by multiple threas in multicore GPU. To avoi access conflict in parameter matrices, the input ocument-wor matrix is partitione into inepenent ata blocks with non-overlapping rows an columns. A preprocessing algorithm is use to balance the number of wors in ata blocks so that ifferent threas can finis scanning non-conflictin blocks with almost the same time. However, it is ifficul for absolute ata block balancing an faster threas nee to wait for the slowest one causing longer locking time. Yahoo!LA [5, ] presents a blackboar architecture that uses the memcache technique, a istribute share cache service, to maintain LA parameter matrices in the share memory environment. Parallel processing of two or more corpus shars woul lea to serious access conflicts Yahoo!LA aresses this problem by locking accesses to conflictin shars, but this locking mechanism egenerates its scalability performance when the number of threas increases. In this paper, we focus on eveloping more scalable LA algorithms in share memory systems, which can hanle web-scale content analysis on just a commoity PC having the multi-core architecture. For example, our parallel solution can learn topics from 8 million ocuments on just a PC with cores using aroun 3 hours. In contrast, previous parallel LA algorithms on CPUs nee aroun.5 hours to o the same task [9]. This result suggests that parallel LA algorithms in multi-core system is not only competitive even compare to parallel processing over large clusters (multi-processor systems), but it is very afforable also. Generally, there are two main steps to evelop scalable parallel LA algorithms in share memory systems. The firs step is to choose a batch/online LA inference algorithm with the fast convergence spee on a single machine. The secon step is to parallelize this algorithm in the share memory system with small locking costs. As far as the firs step is concerne, batch LA inference algorithms inclue expectation maximization (EM) [7], variational Bayes (VB) [], collapse Gibbs sampling (GS) [], collapse variational Bayes (CVB) [, ], an belief propagation (BP) [35]. Most parallel inference solutions of LA choose GS algorithms 9

2 because they are more memory-efficien than other algorithms [, 7, 9, 5, 5, ]. For example, GS oes not nee to maintain the large posterior matrix in memory. In aition, GS stores LA parameters using the integer type by sparse matrices, an often obtains higher topic moeling accuracy than VB []. So, GS is generally agree to be a more scalable choice in many parallel LA solutions. However, recent research [] shows that EM [7], CVB [] an BP [35, 3] converge much faster an prouce higher topic moeling accuracy than GS. Online LA inference algorithms [,, 5, 9] become popular with two reasons. First, they combine the stochastic optimization framework [] with the corresponing batch LA algorithms, which theoretically can converge to the local optimal point of the LA s objective function. Secon, online algorithms are memory-efficien because they loa each mini-batch of ata in memory for online processing, an remove the processe mini-batch an local parameters from memory after one look. Although online algorithms converge slower than batch counterparts, they can process big ata streams with a high velocity [3]. e choose EM [7] for LA because of its fast convergence spee as well as high topic moeling accuracy. To justify this choice, we erive the batch EM (BEM), incremental EM (IEM) [8] an online EM (OEM) [] algorithms for learning LA, an compare them with other wiely use LA algorithms such as VB, GS, CVB an BP. Then, we parallelize EM algorithms in the multi-core systems calle parallel EM (PEM). Inspire by the recent evelopment of locking-free parallel stochastic graient escent for matrix factorization [, 39, 33], we further propose a resiual-base scheuling metho to reuce the locking time among multiple threas. In practice, this scheuling metho can significantl spee up the convergence of PEM algorithms. Experiments confir that PEM algorithms converge significantl faster an scale up to more ata an topics when compare with the current state-of-the-art parallel LA algorithms [9, 5, 3] in multi-core systems. Note that the propose PEM can be also eploye in multi-processor systems, similar to previous multi-core solutions that work in multiprocessor systems [5, 33]. To summarize, we have the following contributions in this paper: e evelop scalable PEM algorithms in both batch an online versions for LA in share memory environment. These efficien PEM algorithms can converge to the local maximum of the LA log-likelihoo function. e propose a resiual-base scheuling metho, which can reuce the locking time among multiple threas, an in the meanwhile spees up the convergence of PEM in the multicore architecture. Experiments on three large-scale ata sets confir that the propose PEM algorithms converge faster an are more scalable than the current state-of-the-art [9, 5, 3]. This paper is organize as follows: Section iscusses why we choose EM for LA (Appenix shows the erivation of BEM, IEM, OEM an their convergence analysis). Section 3 escribes scalable PEM algorithms by avoiing memory access conflicts reucing the locking time among multiple threas an resiual-base ynamic scheuling. Section shows extensive experiments on three large-scale ata sets. Section 5 makes conclusions an envisions further work.. HY EM INFERENCE FOR LA? LA allocates a set of thematic topic labels, z = {z k w,}, toexplain nonzero elements in the ocument-wor co-occurrence matrix x = {x w, }, where w enotes the wor Table : efinition of Notation. ocument inex w or inex in vocabulary k Topic inex m M M M M ata blocks n N Threa inex NNZ Number of nonzero elements x = {x w, } ocument-wor matrix z = {zw,} k Topic labels for wors θ ocument-topic istribution φ Topic-wor istribution μ NNZ Responsibility matrix α, β irichlet hyperparameters inex in the vocabulary, enotes the ocument inex in the corpus, an k enotes the topic inex. Usually, the number of topics is provie by users. The nonzero element x w, enotes the number of wor counts at the inex {w, }. For each wor token x w,,i = {, },x w, = i x w,,i, there is a topic label zw,,i k = {, }, k= zk w,,i =, i x w,. Each nonzero element x w, is associate with a topic probability vector k μ w,(k) =, which enotes the posterior probability of a topic label zw, k =given the observe wor {w, }. The objective of inference algorithms is to infer posterior probability from the full joint probability p(x, z, θ, φ α, β), where z is the topic labeling configuration θ an φ are two non-negative matrices of multinomial parameters for ocument-topic an topicwor istributions, satisfying k θ (k) =an w φw(k) =. Both multinomial matrices are generate by two irichlet istributions with hyperparameters α an β. For simplicity, we consier the smoothe LA with fi e symmetric hyperparameters []. Table summarizes the important notations in this paper. VB [] infers the following posterior from the full joint probability, p(x, z, θ, φ α, β) p(θ, z x, φ,α,β)=. () p(x, φ α, β) This posterior means that if we learn the topic-wor istribution φ from training ata, we want to infer the best {θ, z} from unseen test ata given φ, i.e., for the best generalization performance. However, computing this posterior is intractable because the enominator contains intractable integration, p(x, z, θ, φ α, β). θ,z So, VB infers an approximate variational posterior base on the variational EM algorithm [7]: Variational E-step: μ w, (k) exp[ψ(ˆθ (k)+α)] exp[ψ( ˆφ w(k)+β)] exp[ψ( w [ ˆφ, () w(k)+β])] ˆθ (k) = x w, μ w, (k). (3) w Variational M-step: ˆφ w(k) = x w, μ w, (k). () In variational E-step, we upate μ w, (k) an ˆθ (k) until convergence, which makes the variational posterior approximate the true posterior p(θ, z x, φ,α,β) by minimizing the ullback-leibler (L) ivergence between them. In the variational M-step, we upate ˆφ w(k) to maximize the variational posterior. Here, we use 7

3 Table : Time an Space Complexities of LA Inference Algorithms. Posterior Time Space (Memory) VB [] p(θ, z x, φ,α,β) NNZ igamma ( + ) GS [3] p(z x,α,β) δ ntokens δ + ntokens CVB [] p(θ, φ, z x,α,β) δ 3 NNZ ( ( + )+NNZ) BEM (Section 7.) p(θ, φ x,α,β) NNZ ( + ) Moifie IEM (Section 7.) p(θ, φ x,α,β) NNZ ( + ) OEM (Section 7.3) p(θ, φ x,α,β) NNZ ( s + + NNZ s) the notation ˆφ(k) = w ˆφ w(k) for the enominator in (). Normalizing {ˆθ, ˆφ} yiels the multinomial parameters {θ, φ}. However, the variational posterior cannot touch the true posterior for inaccurate solutions []. In aition, the calculation of exponential igamma function exp[ψ( )] is computationally complicate. As shown in Table, the time complexity of VB for one iteration is O( NNZ igamma), where igamma is the computing time for exponential igamma function, an NNZ is the number of nonzero elements in ocument-wor sparse matrix. For each nonzero element, we nee iterations for variational E-step an iterations for normalizing μ w, (k). The space complexity is O( ( + )) for two multinomial parameters an temporary storage for variational M-step. In contrast to VB, the collapse GS [] algorithm infers the posterior by integrating out {θ, φ}, p(z x,α,β)= p(x, z α, β) p(x α, β) p(x, z α, β). (5) This posterior means that we want to fin the best topic labeling configuratio z given the observe wors x. Because the multinomial parameters {θ, φ} have been integrate out, the best labeling configuratio z is insensitive to the variation of {θ, φ}. Maximizing the joint probability p(x, z α, β) is intractable (i.e., there are ntokens configuration that increase exponentially), an approximate inference calle Markov chain Monte Carlo (MCMC) EM [7] is use as follows: MCMC E-step: μ w,,i (k) k,ol z w,,i [ˆθ (k)+α][ k,ol z w,,i ˆφ w (k)+β], () (k)+β] k,ol z w,,i w [ ˆφ w Ranom Sampling z k,new w,,i =from μ w,,i (k). (7) MCMC M-step: ˆθ (k) =ˆθ zk,ol w,,i (k)+z k,new w,,i, (8) ˆφ w(k) = zk,ol w,,i ˆφ w (k)+z k,new w,,i. (9) In the MCMC E-step, GS infers the topic posterior per wor token, μ w,,i (k) =p(z k,new w,,i = z k,ol w,, i, x,α,β), an ranomly samples a new topic label z k,new w,,i =from this posterior. The notation z k,ol w,,i means excluing the ol topic label from the corresponing matrices {ˆθ, ˆφ}. In the MCMC M-step, GS upates immeiately {ˆθ, ˆφ} by the new topic label of each wor token. In this sense, GS can be viewe as an incremental algorithm that learns parameters by processing ata point sequentially. In Table, the time complexity of GS for one iteration is O(δ ntokens), where δ. The reason is that we require iterations in MCMC E- step an less iterations for normalizing μ w,,i (k). Accoring to sparseness of μ w,,i (k), efficien sampling techniques [3, 3, 3] can make δ even smaller. Practically, when is larger than, δ.5. Generally, we o not nee to store ˆθ in memory because z can recover ˆθ. So, the space complexity is O(δ + ntokens) because ˆφ can be compresse ue to sparseness [3, 3]. hen is larger than, δ.8. Note that all parameters in GS are store in integer type, saving half memory space than ouble type use by other algorithms. Unlike VB an GS, CVB [] infers the complete posterior given the observe ata x, p(θ, φ, z x,α,β) p(x, z, θ, φ α, β). () Maximizing this posterior means that we want to obtain the best combination of multinomial parameters {θ, φ} for the best topic labeling configuratio z. However, inference of this posterior is intractable so that the Gaussian approximation is use []. In this sense, CVB optimizes an approximate LA moel, which cannot achieve the best topic moeling accuracy. The variational E-step an M-step in CVB are similar to those in GS. The main ifference is that the variational E-step requires multiplying an exponential correction factor containing variance upate for each nonzero element rather than wor token. In Table, the time complexity of CVB is O(δ 3 NNZ), where δ 3 > enotes the aitional cost for calculating exponential correction factor. The space complexity is ( ( + )+NNZ) because CVB nees to store one copy of matrix μ NNZ, an two copies of matrices ˆθ an ˆφ in memory (one for the original an the other for the variance). etails can be foun in [, ]. e avocate the stanar EM [7] algorithm that infers the posterior by integrating out the topic labeling configuratio z, p(θ, φ x,α,β)= p(x, θ, φ α, β) p(x α, β) p(x, θ, φ α, β). () Unlike the posteriors of VB an GS, this posterior means that we want to fin the best parameters {θ, φ} given observations x, no matter what topic labeling configuratio z is. To this en, we integrate out the labeling configuratio z in full joint probability, an use the stanar batch EM algorithm [8] to optimize this objective (): E-step: M-step: μ w, (k) [ˆθ (k)+α ][ ˆφ w(k)+β ] w [ ˆφ(k)+β, () ] ˆθ (k) = w ˆφ w(k) = x w, μ w, (k), (3) x w, μ w, (k). () 7

4 In the E-step, EM infers the responsibility μ w, (k) conitione on parameters {ˆθ, ˆφ}. In the M-step, EM upates parameters {ˆθ, ˆφ} base on the inferre responsibility μ w, (k). Unlike VB, EM can touch the true posterior istribution p(θ, φ x,α, β) in the E-step for maximization. hen compare with VB, the time complexity of EM for one iteration is O( NNZ) without calculating exponential igamma functions. The space complexity of EM is the same as VB with O( ( + )) because of storing {ˆθ, ˆφ} as well as the temporary variables in the M-step. In the past ecae, VB an GS have been two main inference algorithms in LA literatures, while EM has been rarely iscusse an use in learning LA. e show two main reasons to use EM:. EM yiels a higher topic moeling accuracy measure by preictive perplexity than both VB an GS. Preictive perplexity is a stanar performance measure for ifferent LA inference algorithms [,, 35], which is calculate as follows: ) e ranomly partition the ata set into training an test sets in terms of ocuments. ) e estimate ˆφ on the training set by 5 iterations. 3) e ranomly partition each ocument into 8% an % subsets on the test set. Fixing ˆφ, we estimate ˆθ on the 8% subset by 5 iterations, an then calculate the preictive perplexity on the rest % subset, { exp w, x% w, log [ k θ (k)φ w(k) ] w, x% w, }, (5) where {θ, φ} are multinomial parameters by normalizing {ˆθ, ˆφ}, an x % w, enotes the wor counts in the the % subset. The lower preictive perplexity represents a better generalization ability. It is clear that Eq. (5) is a function of multinomial parameters {θ, φ}, an EM infers the best multinomial parameters p(θ, φ x,α, β) for the low preictive perplexity. By contrast, VB an GS prouce higher preictive perplexity than EM because they infer ifferent posteriors as iscusse before.. EM converges significantl faster than both VB an GS. In Appenix (Section 7), we erive BEM [7], IEM [8] an OEM [] algorithms for LA. Convergence analysis shows that all these EM algorithms can converge to the local maximum of LA s objective function, because in the E-step the lower-boun can touch the true posterior. The moifie IEM has a low space complexity, an OEM is able to process big ata streams. e see that the zero-orer approximation of CVB calle CVB [] an asynchronous BP [35, 3] are equivalent to IEM, which have been confirme empirically to converge faster than both VB an GS. Also, online BP (OBP) [37] an stochastic CVB (SCVB) [9] are implementations of OEM, which have been also confirme to be faster than some state-of-the-art online LA algorithms. 3. SCALABLE PEM FOR LA Fig. shows PEM for learning LA with N threas in share memory systems. First, we ranom shuffl an partition the input ocument-wor matrix x into M M ata blocks where M > N (line ). Secon, we run N threas in parallel an each threa performs BEM, IEM an OEM in Figs. 9, an (line 5). Thir, we upate the resiual r m,t () after sweeping each block an ynamically scheule the free threas to the free ata blocks with the largest resiual (lines an 7). Finally, we synchronize the global parameter vector ˆφ(k) after some ata blocks (e.g., N) are swept (line 8) input : x,,α,β. output : ˆφ, ˆθ. ranom shuffl an partition x into blocks x m, m M M,M > N; initialize ˆθ (k), ˆφ w (k), ˆφ(k); repeat for n to N threa in parallel o free block x m : o BEM/IEM/OEM ; free block x m : upate resiual r m ; resiual-base ynamic scheuling ; synchronize ˆφ(k); until converge; Figure : Scalable PEM for LA. 3. Resiual-base ynamic Scheuling Multiple threas in parallel LA algorithms have long locking time, which is a main factor that we shoul try to reuce [9]. This motivates us to evelop the resiual-base ynamic scheuling metho. The ocument-wor matrix is partitione into multiple inepenent ata blocks for parallel computation in ifferent threas. PEM in share-memory systems use all available N threas to perform E-step an M-step on ifferent ata blocks simultaneously. The challenge is that ifferent threas may rea an write the same elements in parameter matrices ˆθ (k), ˆφ w(k) an ˆφ(k), which leas to a serious access conflic problem. The - length parameter vector ˆφ(k) will be visite by all threas at the same time, so that the best metho to avoi access conflict is to make N inepenent copies of vector ˆφ(k) for N threas [9]. After sweeping a certain number of ata blocks (e.g., N), we synchronize the parameter vector by summing the upate parameter matrix ˆφ w(k), i.e., ˆφ(k) = w ˆφ w(k). This synchronization cost is not the main bottleneck in PEM because the sum operation on the -length vectors is very simple. Accoring to [9], we can avoi access conflict in ˆθ (k) an ˆφ w(k) by partitioning the ocument-wor matrix into m M M blocks. In this case, the number of threas N = M. Fig. shows an example of ata blocks when M =in the firs column. The secon an thir columns show the parameter matrices ˆφ an ˆθ, respectively. e use four colors (re, yellow, blue an green) to enote four threas. Fig. A shows that four threas simultaneously process four ata blocks in iagonal. In this way, four threas will visit only the inepenent rows in ˆφ w(k) an inepenent columns in ˆθ (k) without conflicting After processing four iagonal ata blocks, four threas will simultaneously move to another four ata blocks as shown in Fig. B, which also use only the inepenent rows in ˆφ w(k) an inepenent columns in ˆθ (k) without conflicting This continues in Fig. C an Fig. until all ata blocks have been processe. To avoi access conflicts all threas in Fig. A nee to wait for the slowest threa before moving to the next ata block in Fig. B. ue to ata block imbalance (the number of nonzero elements is unequal in ifferent ata blocks), the locking time is the main bottleneck in PEM. There are currently two solutions to make ata blocks more balance. First, an approximate integer programming metho is use to fin the better ata partition efficien y before parallel computation [9]. Secon, the ranom shufflin metho works very well empirically [39], which ranomly permutes ocuments (columns) an vocabulary wors (rows) of ocument-wor 7

5 ˆφ m,t (A) ˆφ m, ˆφ m,t Figure : Resiual reflect the convergence spee. (B) (C) (re threa) has no access conflict with blocks an 5 (yellow threa), an block 5 (yellow threa) has no access conflict with blocks, an 8 (re threa). In this asynchronous parameter upate strategy, there are little locking costs (i.e., few threas nee waiting) but scheuling costs. e propose an efficien resiual-base scheuling metho that can spee up convergence of PEM. For each ata block m, we efin the resiual as follows, r m,t = m,t m,t ˆφ w (k) ˆφ w (k), () w,k () Figure : Locking problem occurs when four threas (re, yellow, blue an green) process four inepenent blocks. The first column enotes the ocument-wor matrix. The secon an thir columns enote the parameter matrices ˆφ w(k) an ˆθ (k) re threa yellow threa 5 8 Time cost 3 Figure 3: Resiual-base ynamic scheuling of two threas (enote by re an yellow) to process free ata blocks without the locking time cost. sparse matrix before processing. However, the locking problem still exists because NNZ in each ata block may not be exactly the same, an there is also slight ifference in computing efficien y among ifferent threas. Our approach is the resiual-base scheuling to solve the locking problem in Fig. 3. First, we set M > N, i.e., the number of ata blocks is more than the number of threas. For example in Fig. 3, we have N =threas an partition ata matrix into M M = blocks. As far as N =threas are concerne, this will create more free" ata blocks without access conflicts On the left figure the re an yellow threas simultaneously process non-conflictin blocks an. If the yellow threa finis sweeping block earlier than the re threa, it can irectly jump to process a free block 5 without waiting for the re threa. Then, the re threa can jump to process a free block when the yellow threa is processing 5. On the right figure we show the scheuling orer of re an yellow threas, where the overlapping ata blocks in the time axis have no access conflicts For example, block 7 3 ˆφ m,t w where (k) is the upate parameter submatrix at sweep t an ˆφ m,t w (k) is the parameter submatrix before upating at sweep t. The resiual implies the convergence spee when sweeping the current ata block m. EM shows that the parameter submatrix ˆφ m,t will converge to a stationary point ˆφ m, when t. So, if we minimize the largest istance ˆφ m,t ˆφ m, at higher priority, we will spee up convergence of PEM. However, we o not know this istance because the stationary point ˆφ m, is unknown. Alternatively, we turn to minimizing the lower boun () on this istance that can be calculate easily. Using the triangle inequality, we get the lower boun r m,t = ˆφ m,t ˆφ m,t ˆφ m,t ˆφ m, + ˆφ m,t ˆφ m,. (7) Fig. shows the efinitio of resiual (soli line), which is the lower boun of the istance to be minimize (ashe lines) accoring to the triangle inequality. In this way, the scheuling orer is to sweep the free block with the largest resiual first hen t, the resiual r m,t ue to convergence. This property ensures that the resiual of each free block will become smaller when t so that all blocks have the chance to be swept in resiual-base ynamic scheuling. 3. Implementation Issues e fin that using single-precision floating-poin computation oes not suffer from numerical error accumulation. Empirically, using single precision runs aroun % faster than using ouble precision by saving aroun 5% memory. Moern CPU provies Streaming SIM Extension (SSE) instructions that can concurrently run floating-poin multiplications an aitions. To spee up both E-step an M-step, we apply SSE instructions for vector inner proucts an aitions in Fig. 9 (lines 5-), Fig. (lines -), an Fig. (lines 7-). Using SSE reuces significantl the time cost of line in Fig... EXPERIMENTS In our experiments, we call parallel BEM as PBEM, parallel IEM as PIEM, an parallel OEM as POEM. e compare these PEM algorithms with the following state-of-the-art parallel LA algo- 73

6 Table 3: Statistics of ata sets. ata Pubme iki Nytimes, 9 7, 87 5, 33 train 8, 9,, 35, 95 9, test,,, NNZ tr, 7, 5 5, 57, 8, 9, 33 NNZ te 7, 87 3,, 57, Table : Convergence time cost (secon) when =. Algorithms Pubme iki Nytimes PBEM-noScheuling PIEM-noScheuling PBEM PIEM PBGS PBCVB rithms in share memory systems: Yahoo!LA (parallel batch GS, PGS) [5, ], GPU-LA (parallel batch CVB, PCVB) [9, ], parallel online VB (POVB) [], an istribute stochastic MCMC (parallel online GS, POGS) [3]. Accoring to [], for a fair comparison, we set irichlet hyperparameters {α = β =.} in PGS, POGS an PCVB, α =.,β =. in PEM, an α.5 =.,β.5 =. in POVB. e carry out experiments on a server with two Intel Xeon X59 3.7G processors an G memory. There are cores in each processor for a total of cores (threas). Likewise to the previous stuies [,, 35], we use the preictive perplexity (5) to evaluate the topic moeling accuracy. The lower perplexity the higher topic moeling accuracy. If the ifference of preictive perplexity between two consecutive iterations 5, the algorithm is consiere to be converge [35]. In this way, we can compare the convergence time cost among all algorithms. Table 3 shows publicly available training an hel-out test sets in our experiments. 3 The infrequent wors are remove from the vocabulary similar to [] so that the number of nonzero elements in these ata sets becomes smaller than that of the original sets. Among these ata sets, Pubme contains 8,,, iki contains, 3, 95, an Nytimes contains 3, ocuments. These ata sets are big enough for evaluating parallel LA algorithms. For resiual-base ynamic scheuling in PEM, we set M = N =for free ata blocks similar to [39]. In benchmark algorithms without resiual-base ynamic scheuling, we set M = N =similar to [9].. PBEM an PIEM In this subsection, we compare parallel batch LA algorithms in multi-core systems (PBEM/PIEM v.s. PGS/PCVB). Table shows the convergence time cost of all algorithms when =. e see that PBEM or PIEM converges aroun. or.3,. or.,. or. times faster than PBEM-noScheuling or PIEM-noScheuling on Pubme, iki an Nytimes, respectively. On average, resiual-base ynamic scheuling can reuce aroun % % running time for convergence excluing the speeup /9/multicore-la-in-\ \python-from-over-night-to-over-lunch/ 3 %of\%ors Scale Up Spee Up Pubme PBEM PIEM PGS PVB iki 8 8 PBEM PIEM PGS PVB 8 Number of Threas Figure : Scalability of PBEM an PIEM ( = ). effects brought by implementations in Subsection 3.. This confirm the effectiveness of the resiual-base ynamic scheuling to reuce the overall locking time. In aition, we fin that PIEM benefi more from ynamic scheuling than PBEM. The major reason is that IEM often passes the influenc of ata blocks with largest resiuals more efficientl than BEM. Both PBEM an PIEM converge to almost the same perplexity level, which inicates almost the same topic moeling accuracy. On Nytimes ata set, PIEM converges significantl slower than PBEM even after aing resiualbase ynamic scheuling. Inee, there is currently no theory that IEM always converges faster than BEM [8, ], though some limite experiments [] inicate that IEM converges faster than BEM. In our experiments, three ata sets have quite ifferent wor istributions. e see that PIEM converges slightly faster than PBEM on Pubme an iki, but it converges slower on Nytimes. This gives an example that BEM sometimes converges faster than IEM. In Fig. 5, we compare the preictive perplexity of PBEM/PIEM an PGS/PCVB when {5,, 5,, 5} on three ata sets. PBEM/PIEM always converges to a much lower preictive perplexity than PGS. On average, there is aroun 3% 5% preictive perplexity improvement. Because PCVB is similar to PIEM, it converges to almost the same preictive perplexity as PIEM. Clearly, both PBEM an PIEM converge significantl faster than PGS an PCVB. Their perplexity curves always locate on those of PGS/PCVB s left. The speeup has been largely attribute to the resiual-base ynamic scheuling methos as well as fast convergence spee of EM. In practice, PBEM an PIEM can process 8,, ocuments in Pubme using no more than minutes on a single PC, which is comparable with the previous multi-processor solution on CPUs (3 minutes) [9]. Therefore, parallel LA algorithms in multi-core systems are not only competitive but also afforable in big ata era. To test scalability, we perform two types of experiments on both Pubme an iki ( = ): Scale Up an Spee Up. InScale Up, we establish scalability in terms of the number of ocuments. e fi each threa to process M ata an increase the number of threas from to (x-axis). Thus, the scale of processe ata increases from MtoM. The Scale Up (y-axis) is the ivision between the convergence time of all other algorithms an that of PCVB using threa for M ata. In Spee Up, we establish 7

7 Pubme PBEM 5 PIEM 5 PGS 5 PCVB 5 PBEM PIEM PGS PCVB PBEM 5 PIEM 5 PGS 5 PCVB 5 PBEM PIEM PGS PBEM 5 PIEM 5 PGS Preic ve Perplexity iki PBEM 5 PIEM 5 PGS 5 PCVB 5 PBEM PIEM PGS PCVB PBEM 5 PIEM 5 PGS 5 PCVB 5 PBEM PIEM PGS PBEM 5 PIEM 5 PGS Ny mes PBEM 5 PIEM 5 PGS 5 PCVB PBEM PIEM PGS PCVB 3 5 PBEM 5 PIEM 5 PGS 5 PCVB PBEM PIEM PGS 3 5 PBEM 5 PIEM 5 PGS Time (secon) Figure 5: Comparisons of preictive perplexity when {5,, 5,, 5} on three ata sets. scalability in terms of a speeup in convergence time as we increase the number of threas available. e use the entire Pubme an iki ata sets, an increase the number of threas from to (x-axis). The Spee Up (y-axis) is the ivision between the convergence time of all other algorithms an that of PCVB using threa for the entire ata set. Fig. shows that PIEM performs the best in terms of Scale Up an Spee Up. The top row shows that the Scale Up curve of PIEM remains almost a horizontal line when the processe ata increase. This means that PIEM uses a large fraction of runtime to o topic moeling when the volume of ata increases. As a comparison, the Scale Up curves of PGS, PCVB an PBEM increase linearly with respect to the volume of processe ata. The bottom row shows that the Spee Up curve of PIEM is almost linear with respect to the number of threas. This means that more threas will lea to faster spee. As a comparison, the Spee Up curves of PGS, PCVB an PBEM ben obviously when the number of threas increases. But PBEM s scalability is still much better than both PGS an PCVB. The major reason why PIEM has the best scalability is that the resiual-base ynamic scheuling performs very well in PIEM so that locking time has been significantl reuce. In this paper, we avocate PIEM in share memory systems for large-scale ata sets ue to goo scalability performance.. POEM In this subsection, we compare POEM with two parallel online LA algorithms: POVB (multi-core OVB with open source coes) an POGS. Similar to previous work [], we set the learning parameters as {τ =,κ=.5, s = 9}. First, we compare parallel online LA algorithms with batch algorithms: PBEM an PIEM. Fig. 7 (left panel) shows the convergence time costs (xaxis) an preictive perplexity (y-axis) achieve on three ata sets iki (re color), Nytimes (blue color) an Pubme (green color) when =. If the algorithm locates in the left-bottom area, it 9/multicore-la-in-python\ \-from-over-night-to-over-lunch/ Preic ve Perplexity = PBEM PIEM POEM POGS POVB iki Nytimes Pubme Time (secon) = 3 5 x Figure 7: POEM convergence spee an preictive perplexity. inicates the esirable topic moeling result (i.e., fast convergence spee as well as high topic moeling accuracy). e see that the batch algorithms converge faster than the online ones since they use the global graient ascent of all ata points while the online algorithms use only the local graient ascent of each mini-batch to upate parameters. This is consistent with the previous fining that the convergence rate of stochastic algorithms is often slower than that of batch algorithms []. POEM (star sign) can converge at almost the same preictive perplexity of PBEM (circle sign) an PIEM (plus sign), which confirm that POEM can converge to the local maximum of the LA s log-likelihoo function in Section 7.3. As a comparison, POGS (cross sign) an POVB (triangle sign) converge aroun 3 times slower than POEM. The main reason is that POVB uses computationally complicate igamma functions in Table, while POGS uses much more iterations for learning each mini-batch ue to its Markov Chain Monte Carlo (MCMC) nature. Also, we see that POVB an POGS converge at the higher level of preictive perplexity than POEM, supporting our analysis in Section that EM yiels a higher topic moeling accuracy because of its inferre posterior p(θ, φ x, α,β). Although [] states that the topic moeling accuracy of ifferent inference algorithms can be almost the same by tuning the hyperparameters {α, β}, we still 75

8 Scale Up Spee Up POEM POGS POVB Pubme. 8 8 POEM POGS POVB iki input : x,,t,α,β. output : ˆφ, ˆθ. initialize ˆθ (k); ˆφ w (k); ˆφ(k); for t to T o ˆθ new new (k) ; ˆφ w (k) ; ˆφ new (k) ; for x w, o μ w, (k) normalize([ˆθ (k)+α ][ ˆφ w (k)+β ]/[ ˆφ(k)+ (β )]); ˆθ new new (k) ˆθ (k)+x w,μ w, (k); ˆφ new new w (k) ˆφ w (k)+x w,μ w, (k); ˆφ new (k) ˆφ new (k)+x w, μ w, (k); ˆθ new 7 ˆθ (k) (k); ˆφ w (k) ˆφ(k) ˆφnew (k); new ˆφ w (k); 8 Number of Threas 8 Figure 8: Scalability of POEM ( = ). avocate the stanar EM framework because it converges much faster than both VB an GS. Although POEM, POVB an POGS converge slower than PBEM an PIEM, they are more memory-efficien to hanle larger scale topic moeling tasks on a single PC because they o not nee to store the large matrix ˆθ. For example, PBEM an PIEM cannot efficientl process Pubme ata set when =, while POEM, POVB an POGS can o it. Fig. 7 (right panel) compares the convergence time costs an preictive perplexity of POEM, POVB an POGS when =. Similar to the results when =, POEM performs significantl faster an achieves a lower perplexity than POVB an POGS. In practice, POEM processes Pubme ata set ( = ) using less than 3 hours, while PGS using CPUs requires aroun.5 hours on the same scale of ata set [9]. Fig. 8 compares the Scale Up an Spee Up curves of POEM an POVB/POGS. In Scale Up, we f x each threa to process 3M ata an increase the number of threas from to (x-axis). The Scale Up (y-axis) is the ivision between the convergence time of all other algorithms an that of POVB using threa for 3M ata. The perfect Scale Up curve is a horizontal line in the bottom area. Clearly, the Scale Up curve of POEM locates significantl lower than those of POGS an POVB, inicating a much better Scale Up performance. In Spee Up, we ivie between the convergence time of all other parallel online LA algorithms an that of POVB using threa for the entire ata set (y-axis). e increase the number of threas from to (x-axis) an see if the convergence spee increases. The perfect Spee Up curve is a linear line with a high slope without bening. e see that the Spee Up curve of POEM is significantl higher than those of POVB an POGS. This shows that POEM can reach a higher speeup when more threas are use. Both Fig. an Fig. 7 confir that the propose PEM algorithms are more scalable than the current state-of-the-art solutions. 5. CONCLUSIONS Scalable LA algorithms in share memory systems are neee for big ata on wiely use multi-core systems. Unlike previous parallel solutions using batch/online VB an GS inference, we avocate the EM framework to buil more scalable parallel LA algorithms. Using the efficien resiual-base ynamic scheuling, Figure 9: BEM for LA. we propose scalable PEM algorithms for LA with faster convergence spee an shorter locking time than the current state-of-theart. Experiments show that the resiual-base ynamic scheuling can effectively reuce the locking time an spee up convergence of PEM, which can be use in other latent variable moels where EM inference works. In our future work, we shall stuy how to exten PEM from the multi-core systems to multi-processor systems [3, 8].. ACNOLEGEMENTS This work was supporte by National Grant Funamental Research (973 Program) of China uner Grant CB33, NSFC (Grant No an 333), Hong ong RGC project 8, an Natural Science Founation of the Jiangsu Higher Eucation Institutions of China (Grant No. JA5). This work was partially supporte by Collaborative Innovation Center of Novel Software Technology an Inustrialization. 7. APPENIX In this appenix, we erive BEM, IEM an OEM algorithms for LA, which infer the posterior p(θ, φ x,α,β) p(x, θ, φ α, β) from the full joint probability of LA. This objective is quite ifferent from VB [], GS [] an CVB algorithms, which infer the posterior p(θ, z x, φ,α,β), p(z x,α,β), an p(θ, φ, z x, α,β), respectively. The time an space complexity comparison of these algorithms has been shown in Table. 7. Batch EM (BEM) e maximize the likelihoo function of LA in terms of multinomial parameter set λ = {θ, φ} as follows, p(x, θ, φ α, β) = [ p(x w,,i =,zw,,i k = w,,i k ] θ (k),φ w(k)) p(θ (k) α) p(φ w(k) β). (8) k Employing the Bayes rule an the efinitio of multinomial istributions, we get the wor likelihoo, p(x w,,i =,z k w,,i = θ (k),φ w(k)) = p(x w,,i = zw,,i k =,φ w(k)) p(zw,,i k = θ (k)), = x w,,i φ w(k)θ (k), (9) 7

9 which epens only on the wor inex {w, } instea of the wor token inex i. Then, accoring to the efinitio of irichlet istributions, the log-likelihoo of (8) is l(λ) [ x w,,i log μ w, (k) θ ] (k)φ w(k) μ w, (k) w,,i k + log[θ (k)] α + log[φ w(k)] β, () k k w where μ w, (k) is some topic istribution over the wor inex {w, } satisfying k μ w,(k) =,μ w, (k). In (9), we observe that w,,i [x w,,i =]= w, x w,, so that we can cancel the wor token inex i in (). Because the logarithm is concave, by Jensen s inequality, we have l(λ) l(μ,λ)= [ x w, μ w, (k) log θ ] (k)φ w(k) μ w, (k) w, k + log[θ (k)] α + log[φ w(k)] β, () k k w which gives the lower boun of log-likelihoo (). The equality hols true if an only if μ w, (k) θ (k)φ w(k). () In EM, the -length posterior probability vector μ w, (k) is the responsibility that the topic k takes for wor inex {w, } [7]. For this choice of μ w, (k), Eq. () gives a tight lower boun on the log-likelihoo () we are trying to maximize. This is calle the E-step in EM [8]. In the successive M-step, we then maximize () with respect to parameters to obtain a new setting of λ. Since the hyperparameters {α, β} are fi e, without loss of generality, we erive the M-step upate for the parameter θ (k). There is an aitional constraint that k θ (k) =because θ (k) is parameter of a multinomial istribution. To eal with this constraint, we construct the Lagrangian from () by grouping together only the terms that epen on θ (k), l(θ) = [ ] x w, μ w, (k)+α log θ (k) k w +δ( θ (k) ), (3) k where δ is the Lagrange multiplier. Taking erivatives, we fin w l(θ) = x w,μ w, (k)+α + δ. () θ (k) θ (k) Setting this to zero, we get θ (k) = w x w,μ w, (k)+α. δ (5) Using the constraint that k θ (k) =, we easily fin that δ = k [ w x w,μ w, (k) +α ]. e therefore have our M-step upate for the parameter θ (k) as θ (k) = ˆθ (k)+α k ˆθ (k)+(α ). () where ˆθ (k) = w x w,μ w, (k) is the expecte sufficien statistics. Similarly, another multinomial parameter can be estimate by φ w(k) = ˆφ w(k)+β ˆφ(k)+ (β ), (7) 3 5 input : x,,t,α,β. output : ˆφ, ˆθ. initialize ˆθ (k), ˆφ w (k), ˆφ(k); for t to T o for x w, in ranom orer o ˆθ (k) ( x w, / w x w,)ˆθ (k); ˆφ w (k) ( x w, / x w,) ˆφ w (k); ˆφ(k) ( x w, / w, x w,) ˆφ(k); μ w, (k) normalize([ˆθ (k)+α ][ ˆφ w (k)+β ]/[ ˆφ(k)+ (β )]); ˆθ (k) ˆθ (k)+x w, μ w, (k); ˆφ w (k) ˆφ w (k)+x w, μ w, (k); ˆφ(k) ˆφ(k)+x w, μ w, (k); Figure : Moifie IEM for LA. where ˆφ w(k) = x w,μ w, (k) is the expecte sufficien statistics an ˆφ(k) = w ˆφ w(k). Note that the enominator of () is a constant. Replacing () an (7) into (), we obtain the E-step in terms of sufficien statistics, μ w, (k) [ˆθ (k)+α ] [ ˆφ w(k)+β ], (8) ˆφ(k)+ (β ) where the EM iterates the E-step an M-step to refin sufficien statistics ˆθ (k) an ˆφ w(k), which can be normalize to be the multinomial parameters accoring to () an (7). Fig. 9 shows BEM for LA. e initialize three temporary matrices ˆφ w (k), ˆθ (k), ˆφ new (k) (line 3) to accumulate respon- new new sibilities in E-step for all wors (line ) without storing the large responsibility matrix μ NNZ in memory. At the en of each iteration t, t T, we copy the three temporary matrices back to ˆφ w(k), ˆθ (k), ˆφ(k) in M-step (line 7). BEM iterates E-step an M-step repeately. Suppose λ t an λ t are the parameters from two successive iterations of EM. It is easy to prove that l(λ t ) l(μ t,λ t ) l(μ t,λ t )=l(λ t ), (9) which shows that EM always monotonically improves the LA s log-likelihoo () for convergence. The EM can be also viewe as a coorinate ascent on the lower boun l(μ,λ) (), in which the E-step maximizes it with respect to μ, an the M-step maximizes it with respect to λ. 7. Incremental EM (IEM) In batch EM (BEM), the M-step is performe until the E-step upates all responsibilities μ w, (k), which slows own the convergence since the upate responsibility of each wor in the E-step oes not immeiately influenc the parameter estimation in the M- step. This problem motivates incremental EM (IEM) [8]. hen compare with BEM (8), IEM alternates a single E-step an M- step for each nonzero element x w, sequentially. Thus, the E-step of IEM becomes μ w, (k) [ˆθ w, (k)+α ] [ ˆφ w, (k)+β ]. (3) ˆφ (w,) (k)+(β ) 77

10 The expecte sufficien statistics are ˆθ w, (k) = w x w, μ w, (k), (3) ˆφ w, (k) = x w, μ w, (k), (3) ˆφ (w,) (k) = x w, μ w, (k), (33) (w,) where w, an (w, ) enote all wor inices except w, all ocument inices except, an all wor inices except {w, }. After the E-step for each wor, the M-step will upate the sufficien statistics immeiately by aing the upate posterior μ w, (k) (3) into (3), (3) an (33). Comparing the E-step between BEM an IEM, we fin that the major ifference between (8) an (3) is that IEM exclues the current posterior x w, μ w, (k) from sufficien statistics in (3), (3) an (33). As a result, IEM s space complexity is O( (+ + NNZ)) by storing the large responsibility matrix μ NNZ.For example, if =, the responsibility matrix will occupy aroun 3GB (using ouble-precision floating-poin format) memory on the Pubme ata set [] having 83, 5, 57 nonzero elements. This space is currently too large to be affore by a single commoity PC. Note that CVB [] an asynchronous BP [35, 3] are equivalent to IEM, which are also memory-consuming for big ata on a single PC. So, we propose a moifie IEM in Fig. that o not nee to store the large responsibility matrix. After ranom initialization, we reuce the parameter matrices ˆθ (k), ˆφ w(k), ˆφ(k) in a certain proportion (line ). This avois to subtract the current responsibility from parameter matrices in (3), (3) an (33). Then, the E-step of incremental EM becomes (8) rather than (3). In this way, we o not nee to store the large responsibility matrix μ NNZ in memory. After E-step (line 5) for each nonzero element, the parameter matrices can be compensate by the upate -tuple responsibility μ w, (k) in M-step (line ). In this way, the change of parameter matrix will immeiately influenc the upate of the responsibility for the next nonzero element (line 5). In anticipation, this incremental upate metho is more efficien to pass the influenc of the upate responsibility than batch EM in Fig. 9. Likewise, it is easy to see that IEM can also converge to the local stationary point of LA s log-likelihoo because l(λ t )=l(μ t,λ t ) l(μ t w,, μ t (w,),λt ) l(μ t w,, μ t (w,),λt ) l(μ t,λ t )=l(λ t ). (3) 7.3 Online EM (OEM) The basic iea of online algorithms is to partition a stream of ocuments into small mini-batches with size s, an use the online graient prouce by each mini-batch to estimate topic istributions incrementally. OEM [] combines IEM with the stochastic approximation, which achieves convergence to the stationary points of the likelihoo function by interpolating between sufficien statistics base on a learning rate ρ s satisfying Robbins-Monro conitions [], ρ s =(τ + s) κ, (35) where τ is a pre-efine number of mini-batches, s is the minibatch inex an κ (.5, ] is provie by users. Similar to (3), it is easy to observe that l( ˆφ s )=l(μ s+:,μ s, ˆφ s ) l(μ s+:,μ s, ˆφ s ) l(μ s:,μ s, ˆφ s )=l( ˆφ s ), (3) input : x s w,, s,τ,κ,,α,β. output : ˆφ S. for s to S o Loa x s w,, s in memory; 3 ρ s =(τ + s) κ ; initialize μ s ; ˆθ s(k) w xs w, μs w, (k), s; ˆφ s w(k) ˆφ s w(k)+ xs w, μs w, (k), s; ˆφ s (k) ˆφ s (k)+ 5 w, xs w, μs w, (k), s; repeat 7 for x s w, in ranom orer o 8 ˆθs w, (k) ˆθ s(k) xs w, μs w, (k); ˆφs w, (k) ˆφ w (k) x s w, μs w, (k); ˆφ s (w,) (k) ˆφ(k) x s w, μs w, (k); μ s w, (k) normalize([ˆθ s (k)+α 9 ][ ˆφ s w (k)+β ]/[ ˆφ s (k)+ (β )]); ˆθ s (k) ˆθ s (k)+xs w, μs w, (k); ˆφs w(k) ˆφ s w(k)+x s w, μs w, (k); ˆφ s (k) ˆφ s (k)+x s w, μs w, (k); until converge; ˆφ s w (k) ( ρ s s) ˆφ w (k)+ρ s[ xs w, μs w, (k)]; ˆφs (k) ( ρ s ) ˆφ s (k)+ρ s [ w, xs 3 w, μs w, (k)]; Free x s w,, ˆθ s s, μ s NNZ s from memory; Figure : OEM for LA. where μ s+: enotes responsibilities of unseen mini-batches from s +to. Note that the lower boun (3) will not touch the loglikelihoo () until all responsibilities for ata streams have been upate in (). The inequality (3) confirm that OEM can improve ˆφ s to maximize the LA s log-likelihoo (). In practice, OEM reas each mini-batch x s w, into memory an runs IEM until μ s converge. Then, the sufficien statistics ˆφ s is upate by a linear combination between previous ˆφ s an the upate sufficien statistics xs w,μ s w,(k), ] x s w,μ s w,(k). (37) ˆφ s =( ρ s) ˆφ s + ρ s [ Since OEM only stores the current mini-batch x s w,, the local parameters μ s, ˆθ s an the global parameter ˆφ s in memory, it is easy to process big ata stream with low space complexities O( ( s + + NNZ s)), where s is the number of ocuments an NNZ s the number of nonzero elements in the sth minibatch. Fig. summarizes the OEM algorithm for LA, where online BP [37] an stochastic CVB [9] are some implementations of OEM. Note that OEM can revisit previous processe mini-batch. 8. REFERENCES [] A. Ahme, M. Aly, J. Gonzalez, S. Narayanamurthy, an A. Smola. Scalable inference in latent variable moels. In SM, pages 3 3,. [] A. Asuncion, M. elling, P. Smyth, an Y.. Teh. On smoothing an inference for topic moels. In UAI, pages 7 3, 9. [3]. M. Blei. Introuction to probabilistic topic moels. Communications of the ACM, 55():77 8,. 78

11 []. M. Blei, A. Y. Ng, an M. I. Joran. Latent irichlet allocation. J. Mach. Learn. Res., 3:993, 3. [5] T. Broerick, N. Boy, A. ibisono, A. C. ilson, an M. I. Joran. Streaming variational bayes. In NIPS, pages , 3. [] O. Cappé an E. Moulines. Online expectation-maximization algorithm for latent ata moels. Journal of the Royal Statistical Society: Series B, 7(3):593 3, 9. [7] N. e Freitas an. Barnar. Bayesian latent semantic analysis of multimeia atabases. Technical report, University of British Columbia,. [8] A. P. empster, N. M. Lair, an. B. Rubin. Maximum likelihoo from incomplete ata via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39: 38, 977. [9] J. R. Fouls, L. Boyles, C. ubois, P. Smyth, an M. elling. Stochastic collapse variational bayesian inference for latent irichlet allocation. In, pages 5, 3. [] T. L. Griffith an M. Steyvers. Fining scientifi topics. Proc. Natl. Aca. Sci., :58 535,. [] M. Hoffman,. Blei, an F. Bach. Online learning for latent irichlet allocation. In NIPS, pages 85 8,. []. Jiang,..-T. Leung, an. Ng. Fast topic iscovery from web search streams. In, pages 99 9,. [3] A. Q. Li, A. Ahme, S. Ravi, an A. J. Smola. Reucing the sampling complexity of topic moels. In,. [] P. Liang an. lein. Online EM for unsupervise moels. In Human Language Technologies: The 9 Annual Conference of the North American Chapter of the ACL, pages 9, 9. [5] Z. Liu, Y. Zhang, E. Y. Chang, an M. Sun. PLA+: Parallel latent irichlet allocation with ata placement an pipeline processing. ACM Trans. Intell. Syst. Technol., (3): 8,. []. Mimno, M.. Hoffman, an. M. Blei. Sparse stochastic inference for latent irichlet allocation. In ICML,. [7]. P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press,. [8] R. M. Neal an G. E. Hinton. A view of the EM algorithm that justifie incremental, sparse, an other variants. Learning in Graphical Moels, 89:355 38, 998. [9]. Newman, A. Asuncion, P. Smyth, an M. elling. istribute algorithms for topic moels. J. Mach. Learn. Res., :8 88, 9. [] F. Niu, B. Recht, C. Re, an S. J. right. HOGIL!: A lock-free approach to parallelizing stochastic graient escent. In NIPS, pages 93 7,. [] I. Porteous,. Newman, A. Ihler, A. Asuncion, P. Smyth, an M. elling. Fast collapse Gibbs sampling for latent irichlet allocation. In, pages , 8. [] H. Robbins an S. Monro. A stochastic approximation metho. The Annals of Mathematical Statistics, (3): 7, 95. [3] B. S. S. Ahn an M. elling. istribute stochastic graient mcmc. In ICML,. [] M.. Schmit, N. L. Roux, an F. Bach. Minimizing finit sums with the stochastic average graient. CoRR, abs/39.388, 3. [5] A. Smola an S. Narayanamurthy. An architecture for parallel topic moels. In PVLB, pages 73 7,. [] Y.. Teh,. Newman, an M. elling. A collapse variational Bayesian inference algorithm for latent irichlet allocation. In NIPS, pages 353 3, 7. [7] Y. ang, H. Bai, M. Stanton,. Y. Chen, an E. Chang. PLA: Parallel latent irichlet allocation for large-scale applications. In Algorithmic Aspects in Information an Management, pages 3 3, 9. [8] Y. ang, X. Zhao, Z. Sun, H. Yan, L. ang, Z. Jin, L. ang, Y. Gao, C. Law, an J. Zeng. Peacock: Learning long-tail topic features for inustrial applications. ACM Transactions on Intelligent Systems an Technology, 5. [9] F. Yan, N. Xu, an Y. Qi. Parallel inference for latent irichlet allocation on graphics processing units. In NIPS, pages 3, 9. [3] J.-F. Yan, J. Zeng, Z.-Q. Liu, an Y. Gao. Towars big topic moeling. page arxiv:3.5, 3. [3] L. Yao,. Mimno, an A. McCallum. Efficien methos for topic moel inference on streaming ocument collections. In, pages 937 9, 9. [3] J. Yuan, F. Gao, Q. Ho,. ai, J. ei, X. Zheng, E. P. Xing, T.-Y. Liu, an.-y. Ma. LightLA: Big topic moels on moest compute clusters. page arxiv:.57 [stat.ml],. [33] H. Yun, H.-F. Yu, C.-J. Hsieh, S. V. N. Vishwanathan, an I. hillon. NOMA: Nonlocking, stochastic Multi-machine algorithm for Asynchronous an ecentralize matrix completion. In PVLB, pages ,. [3] J. Zeng. A topic moeling toolbox using belief propagation. J. Mach. Learn. Res., 3:33 3,. [35] J. Zeng,.. Cheung, an J. Liu. Learning topic moels by belief propagation. IEEE Trans. Pattern Anal. Mach. Intell., 35(5): 3, 3. [3] J. Zeng, Z.-Q. Liu, an X.-Q. Cao. A new approach to speeing up topic moeling. page arxiv:.7 [cs.lg],. [37] J. Zeng, Z.-Q. Liu, an X.-Q. Cao. Online belief propagation for topic moeling. arxiv:.79 [cs.lg],. [38]. Zhai, J. Boy-Graber, an N. Asai. Mr. LA: A fl xible large scale topic moeling package using variational inference in MapReuce. In, pages ,. [39] Y. Zhuang,.-S. Chin, Y.-C. Juan, an C.-J. Lin. A fast parallel SG for matrix factorization in share memory systems. In ACM Recommener Systems, 3. 79

Fast Online EM for Big Topic Modeling

1 Fast Online EM for Big Topic Modeling Jia Zeng, Senior Member, IEEE Zhi-Qiang Liu and Xiao-Qin Cao arxiv:1212179v3 [cslg] 7 Dec 215 Abstract The expectation-maximization (EM) algorithm can compute the