Scalable Parallel EM Algorithms for Latent Dirichlet Allocation in Multi-Core Systems

Size: px
Start display at page:

Download "Scalable Parallel EM Algorithms for Latent Dirichlet Allocation in Multi-Core Systems"

Transcription

1 Scalable Parallel EM Algorithms for Latent irichlet Allocation in Multi-Core Systems Xiaosheng Liu,, Jia Zeng,,3,, Xi Yang,, Jianfeng Yan,, Qiang Yang 3, School of Computer Science an Technology, Soochow University, Suzhou 5, China Collaborative Innovation Center of Novel Software Technology an Inustrialization 3 Huawei Noah s Ark Lab, Hong ong epartment of Computer Science an Engineering, Hong ong University of Science an Technology Corresponing Author: zeng.jia@acm.org ABSTRACT Latent irichlet allocation (LA) is a wiely-use probabilistic topic moeling tool for content analysis such as web mining. To hanle web-scale content analysis on just a single PC, we propose multi-core parallel expectation-maximization (PEM) algorithms to infer an estimate LA parameters in share memory systems. By avoiing memory access conflicts reucing the locking time among multiple threas an resiual-base ynamic scheuling, we show that PEM algorithms are more scalable an accurate than the current state-of-the-art parallel LA algorithms on a commoity PC. This parallel LA toolbox is mae publicly available as open source software at mloss.org. Categories an Subject escriptors H. [Information Systems Applications]: Miscellaneous;..8 [Software Engineering]: Metrics complexity measures, performance measures eywors Latent irichlet allocation, parallel EM algorithms, multi-core systems, share memory systems, scalability. INTROUCTION Latent irichlet allocation (LA) [, 3] is a popular probabilistic topic moel for content analysis such as web mining []. It can automatically infer the hien thematic groups of observe wors calle topics from a collection of ocuments represente as an input sparse ocument-wor matrix. In the big ata era, scalable parallel LA algorithms have attracte intensive research interests because billions of tweets, images an vieos on the web become increasingly common. The aim of this paper is to evelop efficien parallel LA algorithms for big ata. There are two wiely available parallel architectures: ) multiprocessor [9] an ) multi-core [9] systems, where the main ifference lies in the way to use the memory. In the multi-processor architecture, all processes allocate inepenent memory space an Copyright is hel by the International orl ie eb Conference Committee (I3C). I3C reserves the right to provie a hyperlink to the author s site if the Material is use in electronic meia. 5, May 8, 5, Florence, Italy. ACM /5/5. communicate to synchronize LA parameters at the en of each learning iteration [, 7, 9, 5, 5,, 38]. The reuction of communication cost still remains an unsolve problem because this cost is often too big to be maske by computation time in web-scale applications. The experimental results confir that the communication cost may excee the computation cost to become the primitive cost of large-scale topic moeling [7, 5]. In the multi-core architecture, all threas access the share memory space so that the race conition is serious. A major scalability issue is the locking time among multiple threas. An example of share memory architecture is GPU-LA [9], which shares LA ocument-topic an topic-wor istribution parameters by multiple threas in multicore GPU. To avoi access conflict in parameter matrices, the input ocument-wor matrix is partitione into inepenent ata blocks with non-overlapping rows an columns. A preprocessing algorithm is use to balance the number of wors in ata blocks so that ifferent threas can finis scanning non-conflictin blocks with almost the same time. However, it is ifficul for absolute ata block balancing an faster threas nee to wait for the slowest one causing longer locking time. Yahoo!LA [5, ] presents a blackboar architecture that uses the memcache technique, a istribute share cache service, to maintain LA parameter matrices in the share memory environment. Parallel processing of two or more corpus shars woul lea to serious access conflicts Yahoo!LA aresses this problem by locking accesses to conflictin shars, but this locking mechanism egenerates its scalability performance when the number of threas increases. In this paper, we focus on eveloping more scalable LA algorithms in share memory systems, which can hanle web-scale content analysis on just a commoity PC having the multi-core architecture. For example, our parallel solution can learn topics from 8 million ocuments on just a PC with cores using aroun 3 hours. In contrast, previous parallel LA algorithms on CPUs nee aroun.5 hours to o the same task [9]. This result suggests that parallel LA algorithms in multi-core system is not only competitive even compare to parallel processing over large clusters (multi-processor systems), but it is very afforable also. Generally, there are two main steps to evelop scalable parallel LA algorithms in share memory systems. The firs step is to choose a batch/online LA inference algorithm with the fast convergence spee on a single machine. The secon step is to parallelize this algorithm in the share memory system with small locking costs. As far as the firs step is concerne, batch LA inference algorithms inclue expectation maximization (EM) [7], variational Bayes (VB) [], collapse Gibbs sampling (GS) [], collapse variational Bayes (CVB) [, ], an belief propagation (BP) [35]. Most parallel inference solutions of LA choose GS algorithms 9

2 because they are more memory-efficien than other algorithms [, 7, 9, 5, 5, ]. For example, GS oes not nee to maintain the large posterior matrix in memory. In aition, GS stores LA parameters using the integer type by sparse matrices, an often obtains higher topic moeling accuracy than VB []. So, GS is generally agree to be a more scalable choice in many parallel LA solutions. However, recent research [] shows that EM [7], CVB [] an BP [35, 3] converge much faster an prouce higher topic moeling accuracy than GS. Online LA inference algorithms [,, 5, 9] become popular with two reasons. First, they combine the stochastic optimization framework [] with the corresponing batch LA algorithms, which theoretically can converge to the local optimal point of the LA s objective function. Secon, online algorithms are memory-efficien because they loa each mini-batch of ata in memory for online processing, an remove the processe mini-batch an local parameters from memory after one look. Although online algorithms converge slower than batch counterparts, they can process big ata streams with a high velocity [3]. e choose EM [7] for LA because of its fast convergence spee as well as high topic moeling accuracy. To justify this choice, we erive the batch EM (BEM), incremental EM (IEM) [8] an online EM (OEM) [] algorithms for learning LA, an compare them with other wiely use LA algorithms such as VB, GS, CVB an BP. Then, we parallelize EM algorithms in the multi-core systems calle parallel EM (PEM). Inspire by the recent evelopment of locking-free parallel stochastic graient escent for matrix factorization [, 39, 33], we further propose a resiual-base scheuling metho to reuce the locking time among multiple threas. In practice, this scheuling metho can significantl spee up the convergence of PEM algorithms. Experiments confir that PEM algorithms converge significantl faster an scale up to more ata an topics when compare with the current state-of-the-art parallel LA algorithms [9, 5, 3] in multi-core systems. Note that the propose PEM can be also eploye in multi-processor systems, similar to previous multi-core solutions that work in multiprocessor systems [5, 33]. To summarize, we have the following contributions in this paper: e evelop scalable PEM algorithms in both batch an online versions for LA in share memory environment. These efficien PEM algorithms can converge to the local maximum of the LA log-likelihoo function. e propose a resiual-base scheuling metho, which can reuce the locking time among multiple threas, an in the meanwhile spees up the convergence of PEM in the multicore architecture. Experiments on three large-scale ata sets confir that the propose PEM algorithms converge faster an are more scalable than the current state-of-the-art [9, 5, 3]. This paper is organize as follows: Section iscusses why we choose EM for LA (Appenix shows the erivation of BEM, IEM, OEM an their convergence analysis). Section 3 escribes scalable PEM algorithms by avoiing memory access conflicts reucing the locking time among multiple threas an resiual-base ynamic scheuling. Section shows extensive experiments on three large-scale ata sets. Section 5 makes conclusions an envisions further work.. HY EM INFERENCE FOR LA? LA allocates a set of thematic topic labels, z = {z k w,}, toexplain nonzero elements in the ocument-wor co-occurrence matrix x = {x w, }, where w enotes the wor Table : efinition of Notation. ocument inex w or inex in vocabulary k Topic inex m M M M M ata blocks n N Threa inex NNZ Number of nonzero elements x = {x w, } ocument-wor matrix z = {zw,} k Topic labels for wors θ ocument-topic istribution φ Topic-wor istribution μ NNZ Responsibility matrix α, β irichlet hyperparameters inex in the vocabulary, enotes the ocument inex in the corpus, an k enotes the topic inex. Usually, the number of topics is provie by users. The nonzero element x w, enotes the number of wor counts at the inex {w, }. For each wor token x w,,i = {, },x w, = i x w,,i, there is a topic label zw,,i k = {, }, k= zk w,,i =, i x w,. Each nonzero element x w, is associate with a topic probability vector k μ w,(k) =, which enotes the posterior probability of a topic label zw, k =given the observe wor {w, }. The objective of inference algorithms is to infer posterior probability from the full joint probability p(x, z, θ, φ α, β), where z is the topic labeling configuration θ an φ are two non-negative matrices of multinomial parameters for ocument-topic an topicwor istributions, satisfying k θ (k) =an w φw(k) =. Both multinomial matrices are generate by two irichlet istributions with hyperparameters α an β. For simplicity, we consier the smoothe LA with fi e symmetric hyperparameters []. Table summarizes the important notations in this paper. VB [] infers the following posterior from the full joint probability, p(x, z, θ, φ α, β) p(θ, z x, φ,α,β)=. () p(x, φ α, β) This posterior means that if we learn the topic-wor istribution φ from training ata, we want to infer the best {θ, z} from unseen test ata given φ, i.e., for the best generalization performance. However, computing this posterior is intractable because the enominator contains intractable integration, p(x, z, θ, φ α, β). θ,z So, VB infers an approximate variational posterior base on the variational EM algorithm [7]: Variational E-step: μ w, (k) exp[ψ(ˆθ (k)+α)] exp[ψ( ˆφ w(k)+β)] exp[ψ( w [ ˆφ, () w(k)+β])] ˆθ (k) = x w, μ w, (k). (3) w Variational M-step: ˆφ w(k) = x w, μ w, (k). () In variational E-step, we upate μ w, (k) an ˆθ (k) until convergence, which makes the variational posterior approximate the true posterior p(θ, z x, φ,α,β) by minimizing the ullback-leibler (L) ivergence between them. In the variational M-step, we upate ˆφ w(k) to maximize the variational posterior. Here, we use 7

3 Table : Time an Space Complexities of LA Inference Algorithms. Posterior Time Space (Memory) VB [] p(θ, z x, φ,α,β) NNZ igamma ( + ) GS [3] p(z x,α,β) δ ntokens δ + ntokens CVB [] p(θ, φ, z x,α,β) δ 3 NNZ ( ( + )+NNZ) BEM (Section 7.) p(θ, φ x,α,β) NNZ ( + ) Moifie IEM (Section 7.) p(θ, φ x,α,β) NNZ ( + ) OEM (Section 7.3) p(θ, φ x,α,β) NNZ ( s + + NNZ s) the notation ˆφ(k) = w ˆφ w(k) for the enominator in (). Normalizing {ˆθ, ˆφ} yiels the multinomial parameters {θ, φ}. However, the variational posterior cannot touch the true posterior for inaccurate solutions []. In aition, the calculation of exponential igamma function exp[ψ( )] is computationally complicate. As shown in Table, the time complexity of VB for one iteration is O( NNZ igamma), where igamma is the computing time for exponential igamma function, an NNZ is the number of nonzero elements in ocument-wor sparse matrix. For each nonzero element, we nee iterations for variational E-step an iterations for normalizing μ w, (k). The space complexity is O( ( + )) for two multinomial parameters an temporary storage for variational M-step. In contrast to VB, the collapse GS [] algorithm infers the posterior by integrating out {θ, φ}, p(z x,α,β)= p(x, z α, β) p(x α, β) p(x, z α, β). (5) This posterior means that we want to fin the best topic labeling configuratio z given the observe wors x. Because the multinomial parameters {θ, φ} have been integrate out, the best labeling configuratio z is insensitive to the variation of {θ, φ}. Maximizing the joint probability p(x, z α, β) is intractable (i.e., there are ntokens configuration that increase exponentially), an approximate inference calle Markov chain Monte Carlo (MCMC) EM [7] is use as follows: MCMC E-step: μ w,,i (k) k,ol z w,,i [ˆθ (k)+α][ k,ol z w,,i ˆφ w (k)+β], () (k)+β] k,ol z w,,i w [ ˆφ w Ranom Sampling z k,new w,,i =from μ w,,i (k). (7) MCMC M-step: ˆθ (k) =ˆθ zk,ol w,,i (k)+z k,new w,,i, (8) ˆφ w(k) = zk,ol w,,i ˆφ w (k)+z k,new w,,i. (9) In the MCMC E-step, GS infers the topic posterior per wor token, μ w,,i (k) =p(z k,new w,,i = z k,ol w,, i, x,α,β), an ranomly samples a new topic label z k,new w,,i =from this posterior. The notation z k,ol w,,i means excluing the ol topic label from the corresponing matrices {ˆθ, ˆφ}. In the MCMC M-step, GS upates immeiately {ˆθ, ˆφ} by the new topic label of each wor token. In this sense, GS can be viewe as an incremental algorithm that learns parameters by processing ata point sequentially. In Table, the time complexity of GS for one iteration is O(δ ntokens), where δ. The reason is that we require iterations in MCMC E- step an less iterations for normalizing μ w,,i (k). Accoring to sparseness of μ w,,i (k), efficien sampling techniques [3, 3, 3] can make δ even smaller. Practically, when is larger than, δ.5. Generally, we o not nee to store ˆθ in memory because z can recover ˆθ. So, the space complexity is O(δ + ntokens) because ˆφ can be compresse ue to sparseness [3, 3]. hen is larger than, δ.8. Note that all parameters in GS are store in integer type, saving half memory space than ouble type use by other algorithms. Unlike VB an GS, CVB [] infers the complete posterior given the observe ata x, p(θ, φ, z x,α,β) p(x, z, θ, φ α, β). () Maximizing this posterior means that we want to obtain the best combination of multinomial parameters {θ, φ} for the best topic labeling configuratio z. However, inference of this posterior is intractable so that the Gaussian approximation is use []. In this sense, CVB optimizes an approximate LA moel, which cannot achieve the best topic moeling accuracy. The variational E-step an M-step in CVB are similar to those in GS. The main ifference is that the variational E-step requires multiplying an exponential correction factor containing variance upate for each nonzero element rather than wor token. In Table, the time complexity of CVB is O(δ 3 NNZ), where δ 3 > enotes the aitional cost for calculating exponential correction factor. The space complexity is ( ( + )+NNZ) because CVB nees to store one copy of matrix μ NNZ, an two copies of matrices ˆθ an ˆφ in memory (one for the original an the other for the variance). etails can be foun in [, ]. e avocate the stanar EM [7] algorithm that infers the posterior by integrating out the topic labeling configuratio z, p(θ, φ x,α,β)= p(x, θ, φ α, β) p(x α, β) p(x, θ, φ α, β). () Unlike the posteriors of VB an GS, this posterior means that we want to fin the best parameters {θ, φ} given observations x, no matter what topic labeling configuratio z is. To this en, we integrate out the labeling configuratio z in full joint probability, an use the stanar batch EM algorithm [8] to optimize this objective (): E-step: M-step: μ w, (k) [ˆθ (k)+α ][ ˆφ w(k)+β ] w [ ˆφ(k)+β, () ] ˆθ (k) = w ˆφ w(k) = x w, μ w, (k), (3) x w, μ w, (k). () 7

4 In the E-step, EM infers the responsibility μ w, (k) conitione on parameters {ˆθ, ˆφ}. In the M-step, EM upates parameters {ˆθ, ˆφ} base on the inferre responsibility μ w, (k). Unlike VB, EM can touch the true posterior istribution p(θ, φ x,α, β) in the E-step for maximization. hen compare with VB, the time complexity of EM for one iteration is O( NNZ) without calculating exponential igamma functions. The space complexity of EM is the same as VB with O( ( + )) because of storing {ˆθ, ˆφ} as well as the temporary variables in the M-step. In the past ecae, VB an GS have been two main inference algorithms in LA literatures, while EM has been rarely iscusse an use in learning LA. e show two main reasons to use EM:. EM yiels a higher topic moeling accuracy measure by preictive perplexity than both VB an GS. Preictive perplexity is a stanar performance measure for ifferent LA inference algorithms [,, 35], which is calculate as follows: ) e ranomly partition the ata set into training an test sets in terms of ocuments. ) e estimate ˆφ on the training set by 5 iterations. 3) e ranomly partition each ocument into 8% an % subsets on the test set. Fixing ˆφ, we estimate ˆθ on the 8% subset by 5 iterations, an then calculate the preictive perplexity on the rest % subset, { exp w, x% w, log [ k θ (k)φ w(k) ] w, x% w, }, (5) where {θ, φ} are multinomial parameters by normalizing {ˆθ, ˆφ}, an x % w, enotes the wor counts in the the % subset. The lower preictive perplexity represents a better generalization ability. It is clear that Eq. (5) is a function of multinomial parameters {θ, φ}, an EM infers the best multinomial parameters p(θ, φ x,α, β) for the low preictive perplexity. By contrast, VB an GS prouce higher preictive perplexity than EM because they infer ifferent posteriors as iscusse before.. EM converges significantl faster than both VB an GS. In Appenix (Section 7), we erive BEM [7], IEM [8] an OEM [] algorithms for LA. Convergence analysis shows that all these EM algorithms can converge to the local maximum of LA s objective function, because in the E-step the lower-boun can touch the true posterior. The moifie IEM has a low space complexity, an OEM is able to process big ata streams. e see that the zero-orer approximation of CVB calle CVB [] an asynchronous BP [35, 3] are equivalent to IEM, which have been confirme empirically to converge faster than both VB an GS. Also, online BP (OBP) [37] an stochastic CVB (SCVB) [9] are implementations of OEM, which have been also confirme to be faster than some state-of-the-art online LA algorithms. 3. SCALABLE PEM FOR LA Fig. shows PEM for learning LA with N threas in share memory systems. First, we ranom shuffl an partition the input ocument-wor matrix x into M M ata blocks where M > N (line ). Secon, we run N threas in parallel an each threa performs BEM, IEM an OEM in Figs. 9, an (line 5). Thir, we upate the resiual r m,t () after sweeping each block an ynamically scheule the free threas to the free ata blocks with the largest resiual (lines an 7). Finally, we synchronize the global parameter vector ˆφ(k) after some ata blocks (e.g., N) are swept (line 8) input : x,,α,β. output : ˆφ, ˆθ. ranom shuffl an partition x into blocks x m, m M M,M > N; initialize ˆθ (k), ˆφ w (k), ˆφ(k); repeat for n to N threa in parallel o free block x m : o BEM/IEM/OEM ; free block x m : upate resiual r m ; resiual-base ynamic scheuling ; synchronize ˆφ(k); until converge; Figure : Scalable PEM for LA. 3. Resiual-base ynamic Scheuling Multiple threas in parallel LA algorithms have long locking time, which is a main factor that we shoul try to reuce [9]. This motivates us to evelop the resiual-base ynamic scheuling metho. The ocument-wor matrix is partitione into multiple inepenent ata blocks for parallel computation in ifferent threas. PEM in share-memory systems use all available N threas to perform E-step an M-step on ifferent ata blocks simultaneously. The challenge is that ifferent threas may rea an write the same elements in parameter matrices ˆθ (k), ˆφ w(k) an ˆφ(k), which leas to a serious access conflic problem. The - length parameter vector ˆφ(k) will be visite by all threas at the same time, so that the best metho to avoi access conflict is to make N inepenent copies of vector ˆφ(k) for N threas [9]. After sweeping a certain number of ata blocks (e.g., N), we synchronize the parameter vector by summing the upate parameter matrix ˆφ w(k), i.e., ˆφ(k) = w ˆφ w(k). This synchronization cost is not the main bottleneck in PEM because the sum operation on the -length vectors is very simple. Accoring to [9], we can avoi access conflict in ˆθ (k) an ˆφ w(k) by partitioning the ocument-wor matrix into m M M blocks. In this case, the number of threas N = M. Fig. shows an example of ata blocks when M =in the firs column. The secon an thir columns show the parameter matrices ˆφ an ˆθ, respectively. e use four colors (re, yellow, blue an green) to enote four threas. Fig. A shows that four threas simultaneously process four ata blocks in iagonal. In this way, four threas will visit only the inepenent rows in ˆφ w(k) an inepenent columns in ˆθ (k) without conflicting After processing four iagonal ata blocks, four threas will simultaneously move to another four ata blocks as shown in Fig. B, which also use only the inepenent rows in ˆφ w(k) an inepenent columns in ˆθ (k) without conflicting This continues in Fig. C an Fig. until all ata blocks have been processe. To avoi access conflicts all threas in Fig. A nee to wait for the slowest threa before moving to the next ata block in Fig. B. ue to ata block imbalance (the number of nonzero elements is unequal in ifferent ata blocks), the locking time is the main bottleneck in PEM. There are currently two solutions to make ata blocks more balance. First, an approximate integer programming metho is use to fin the better ata partition efficien y before parallel computation [9]. Secon, the ranom shufflin metho works very well empirically [39], which ranomly permutes ocuments (columns) an vocabulary wors (rows) of ocument-wor 7

5 ˆφ m,t (A) ˆφ m, ˆφ m,t Figure : Resiual reflect the convergence spee. (B) (C) (re threa) has no access conflict with blocks an 5 (yellow threa), an block 5 (yellow threa) has no access conflict with blocks, an 8 (re threa). In this asynchronous parameter upate strategy, there are little locking costs (i.e., few threas nee waiting) but scheuling costs. e propose an efficien resiual-base scheuling metho that can spee up convergence of PEM. For each ata block m, we efin the resiual as follows, r m,t = m,t m,t ˆφ w (k) ˆφ w (k), () w,k () Figure : Locking problem occurs when four threas (re, yellow, blue an green) process four inepenent blocks. The first column enotes the ocument-wor matrix. The secon an thir columns enote the parameter matrices ˆφ w(k) an ˆθ (k) re threa yellow threa 5 8 Time cost 3 Figure 3: Resiual-base ynamic scheuling of two threas (enote by re an yellow) to process free ata blocks without the locking time cost. sparse matrix before processing. However, the locking problem still exists because NNZ in each ata block may not be exactly the same, an there is also slight ifference in computing efficien y among ifferent threas. Our approach is the resiual-base scheuling to solve the locking problem in Fig. 3. First, we set M > N, i.e., the number of ata blocks is more than the number of threas. For example in Fig. 3, we have N =threas an partition ata matrix into M M = blocks. As far as N =threas are concerne, this will create more free" ata blocks without access conflicts On the left figure the re an yellow threas simultaneously process non-conflictin blocks an. If the yellow threa finis sweeping block earlier than the re threa, it can irectly jump to process a free block 5 without waiting for the re threa. Then, the re threa can jump to process a free block when the yellow threa is processing 5. On the right figure we show the scheuling orer of re an yellow threas, where the overlapping ata blocks in the time axis have no access conflicts For example, block 7 3 ˆφ m,t w where (k) is the upate parameter submatrix at sweep t an ˆφ m,t w (k) is the parameter submatrix before upating at sweep t. The resiual implies the convergence spee when sweeping the current ata block m. EM shows that the parameter submatrix ˆφ m,t will converge to a stationary point ˆφ m, when t. So, if we minimize the largest istance ˆφ m,t ˆφ m, at higher priority, we will spee up convergence of PEM. However, we o not know this istance because the stationary point ˆφ m, is unknown. Alternatively, we turn to minimizing the lower boun () on this istance that can be calculate easily. Using the triangle inequality, we get the lower boun r m,t = ˆφ m,t ˆφ m,t ˆφ m,t ˆφ m, + ˆφ m,t ˆφ m,. (7) Fig. shows the efinitio of resiual (soli line), which is the lower boun of the istance to be minimize (ashe lines) accoring to the triangle inequality. In this way, the scheuling orer is to sweep the free block with the largest resiual first hen t, the resiual r m,t ue to convergence. This property ensures that the resiual of each free block will become smaller when t so that all blocks have the chance to be swept in resiual-base ynamic scheuling. 3. Implementation Issues e fin that using single-precision floating-poin computation oes not suffer from numerical error accumulation. Empirically, using single precision runs aroun % faster than using ouble precision by saving aroun 5% memory. Moern CPU provies Streaming SIM Extension (SSE) instructions that can concurrently run floating-poin multiplications an aitions. To spee up both E-step an M-step, we apply SSE instructions for vector inner proucts an aitions in Fig. 9 (lines 5-), Fig. (lines -), an Fig. (lines 7-). Using SSE reuces significantl the time cost of line in Fig... EXPERIMENTS In our experiments, we call parallel BEM as PBEM, parallel IEM as PIEM, an parallel OEM as POEM. e compare these PEM algorithms with the following state-of-the-art parallel LA algo- 73

6 Table 3: Statistics of ata sets. ata Pubme iki Nytimes, 9 7, 87 5, 33 train 8, 9,, 35, 95 9, test,,, NNZ tr, 7, 5 5, 57, 8, 9, 33 NNZ te 7, 87 3,, 57, Table : Convergence time cost (secon) when =. Algorithms Pubme iki Nytimes PBEM-noScheuling PIEM-noScheuling PBEM PIEM PBGS PBCVB rithms in share memory systems: Yahoo!LA (parallel batch GS, PGS) [5, ], GPU-LA (parallel batch CVB, PCVB) [9, ], parallel online VB (POVB) [], an istribute stochastic MCMC (parallel online GS, POGS) [3]. Accoring to [], for a fair comparison, we set irichlet hyperparameters {α = β =.} in PGS, POGS an PCVB, α =.,β =. in PEM, an α.5 =.,β.5 =. in POVB. e carry out experiments on a server with two Intel Xeon X59 3.7G processors an G memory. There are cores in each processor for a total of cores (threas). Likewise to the previous stuies [,, 35], we use the preictive perplexity (5) to evaluate the topic moeling accuracy. The lower perplexity the higher topic moeling accuracy. If the ifference of preictive perplexity between two consecutive iterations 5, the algorithm is consiere to be converge [35]. In this way, we can compare the convergence time cost among all algorithms. Table 3 shows publicly available training an hel-out test sets in our experiments. 3 The infrequent wors are remove from the vocabulary similar to [] so that the number of nonzero elements in these ata sets becomes smaller than that of the original sets. Among these ata sets, Pubme contains 8,,, iki contains, 3, 95, an Nytimes contains 3, ocuments. These ata sets are big enough for evaluating parallel LA algorithms. For resiual-base ynamic scheuling in PEM, we set M = N =for free ata blocks similar to [39]. In benchmark algorithms without resiual-base ynamic scheuling, we set M = N =similar to [9].. PBEM an PIEM In this subsection, we compare parallel batch LA algorithms in multi-core systems (PBEM/PIEM v.s. PGS/PCVB). Table shows the convergence time cost of all algorithms when =. e see that PBEM or PIEM converges aroun. or.3,. or.,. or. times faster than PBEM-noScheuling or PIEM-noScheuling on Pubme, iki an Nytimes, respectively. On average, resiual-base ynamic scheuling can reuce aroun % % running time for convergence excluing the speeup /9/multicore-la-in-\ \python-from-over-night-to-over-lunch/ 3 %of\%ors Scale Up Spee Up Pubme PBEM PIEM PGS PVB iki 8 8 PBEM PIEM PGS PVB 8 Number of Threas Figure : Scalability of PBEM an PIEM ( = ). effects brought by implementations in Subsection 3.. This confirm the effectiveness of the resiual-base ynamic scheuling to reuce the overall locking time. In aition, we fin that PIEM benefi more from ynamic scheuling than PBEM. The major reason is that IEM often passes the influenc of ata blocks with largest resiuals more efficientl than BEM. Both PBEM an PIEM converge to almost the same perplexity level, which inicates almost the same topic moeling accuracy. On Nytimes ata set, PIEM converges significantl slower than PBEM even after aing resiualbase ynamic scheuling. Inee, there is currently no theory that IEM always converges faster than BEM [8, ], though some limite experiments [] inicate that IEM converges faster than BEM. In our experiments, three ata sets have quite ifferent wor istributions. e see that PIEM converges slightly faster than PBEM on Pubme an iki, but it converges slower on Nytimes. This gives an example that BEM sometimes converges faster than IEM. In Fig. 5, we compare the preictive perplexity of PBEM/PIEM an PGS/PCVB when {5,, 5,, 5} on three ata sets. PBEM/PIEM always converges to a much lower preictive perplexity than PGS. On average, there is aroun 3% 5% preictive perplexity improvement. Because PCVB is similar to PIEM, it converges to almost the same preictive perplexity as PIEM. Clearly, both PBEM an PIEM converge significantl faster than PGS an PCVB. Their perplexity curves always locate on those of PGS/PCVB s left. The speeup has been largely attribute to the resiual-base ynamic scheuling methos as well as fast convergence spee of EM. In practice, PBEM an PIEM can process 8,, ocuments in Pubme using no more than minutes on a single PC, which is comparable with the previous multi-processor solution on CPUs (3 minutes) [9]. Therefore, parallel LA algorithms in multi-core systems are not only competitive but also afforable in big ata era. To test scalability, we perform two types of experiments on both Pubme an iki ( = ): Scale Up an Spee Up. InScale Up, we establish scalability in terms of the number of ocuments. e fi each threa to process M ata an increase the number of threas from to (x-axis). Thus, the scale of processe ata increases from MtoM. The Scale Up (y-axis) is the ivision between the convergence time of all other algorithms an that of PCVB using threa for M ata. In Spee Up, we establish 7

7 Pubme PBEM 5 PIEM 5 PGS 5 PCVB 5 PBEM PIEM PGS PCVB PBEM 5 PIEM 5 PGS 5 PCVB 5 PBEM PIEM PGS PBEM 5 PIEM 5 PGS Preic ve Perplexity iki PBEM 5 PIEM 5 PGS 5 PCVB 5 PBEM PIEM PGS PCVB PBEM 5 PIEM 5 PGS 5 PCVB 5 PBEM PIEM PGS PBEM 5 PIEM 5 PGS Ny mes PBEM 5 PIEM 5 PGS 5 PCVB PBEM PIEM PGS PCVB 3 5 PBEM 5 PIEM 5 PGS 5 PCVB PBEM PIEM PGS 3 5 PBEM 5 PIEM 5 PGS Time (secon) Figure 5: Comparisons of preictive perplexity when {5,, 5,, 5} on three ata sets. scalability in terms of a speeup in convergence time as we increase the number of threas available. e use the entire Pubme an iki ata sets, an increase the number of threas from to (x-axis). The Spee Up (y-axis) is the ivision between the convergence time of all other algorithms an that of PCVB using threa for the entire ata set. Fig. shows that PIEM performs the best in terms of Scale Up an Spee Up. The top row shows that the Scale Up curve of PIEM remains almost a horizontal line when the processe ata increase. This means that PIEM uses a large fraction of runtime to o topic moeling when the volume of ata increases. As a comparison, the Scale Up curves of PGS, PCVB an PBEM increase linearly with respect to the volume of processe ata. The bottom row shows that the Spee Up curve of PIEM is almost linear with respect to the number of threas. This means that more threas will lea to faster spee. As a comparison, the Spee Up curves of PGS, PCVB an PBEM ben obviously when the number of threas increases. But PBEM s scalability is still much better than both PGS an PCVB. The major reason why PIEM has the best scalability is that the resiual-base ynamic scheuling performs very well in PIEM so that locking time has been significantl reuce. In this paper, we avocate PIEM in share memory systems for large-scale ata sets ue to goo scalability performance.. POEM In this subsection, we compare POEM with two parallel online LA algorithms: POVB (multi-core OVB with open source coes) an POGS. Similar to previous work [], we set the learning parameters as {τ =,κ=.5, s = 9}. First, we compare parallel online LA algorithms with batch algorithms: PBEM an PIEM. Fig. 7 (left panel) shows the convergence time costs (xaxis) an preictive perplexity (y-axis) achieve on three ata sets iki (re color), Nytimes (blue color) an Pubme (green color) when =. If the algorithm locates in the left-bottom area, it 9/multicore-la-in-python\ \-from-over-night-to-over-lunch/ Preic ve Perplexity = PBEM PIEM POEM POGS POVB iki Nytimes Pubme Time (secon) = 3 5 x Figure 7: POEM convergence spee an preictive perplexity. inicates the esirable topic moeling result (i.e., fast convergence spee as well as high topic moeling accuracy). e see that the batch algorithms converge faster than the online ones since they use the global graient ascent of all ata points while the online algorithms use only the local graient ascent of each mini-batch to upate parameters. This is consistent with the previous fining that the convergence rate of stochastic algorithms is often slower than that of batch algorithms []. POEM (star sign) can converge at almost the same preictive perplexity of PBEM (circle sign) an PIEM (plus sign), which confirm that POEM can converge to the local maximum of the LA s log-likelihoo function in Section 7.3. As a comparison, POGS (cross sign) an POVB (triangle sign) converge aroun 3 times slower than POEM. The main reason is that POVB uses computationally complicate igamma functions in Table, while POGS uses much more iterations for learning each mini-batch ue to its Markov Chain Monte Carlo (MCMC) nature. Also, we see that POVB an POGS converge at the higher level of preictive perplexity than POEM, supporting our analysis in Section that EM yiels a higher topic moeling accuracy because of its inferre posterior p(θ, φ x, α,β). Although [] states that the topic moeling accuracy of ifferent inference algorithms can be almost the same by tuning the hyperparameters {α, β}, we still 75

8 Scale Up Spee Up POEM POGS POVB Pubme. 8 8 POEM POGS POVB iki input : x,,t,α,β. output : ˆφ, ˆθ. initialize ˆθ (k); ˆφ w (k); ˆφ(k); for t to T o ˆθ new new (k) ; ˆφ w (k) ; ˆφ new (k) ; for x w, o μ w, (k) normalize([ˆθ (k)+α ][ ˆφ w (k)+β ]/[ ˆφ(k)+ (β )]); ˆθ new new (k) ˆθ (k)+x w,μ w, (k); ˆφ new new w (k) ˆφ w (k)+x w,μ w, (k); ˆφ new (k) ˆφ new (k)+x w, μ w, (k); ˆθ new 7 ˆθ (k) (k); ˆφ w (k) ˆφ(k) ˆφnew (k); new ˆφ w (k); 8 Number of Threas 8 Figure 8: Scalability of POEM ( = ). avocate the stanar EM framework because it converges much faster than both VB an GS. Although POEM, POVB an POGS converge slower than PBEM an PIEM, they are more memory-efficien to hanle larger scale topic moeling tasks on a single PC because they o not nee to store the large matrix ˆθ. For example, PBEM an PIEM cannot efficientl process Pubme ata set when =, while POEM, POVB an POGS can o it. Fig. 7 (right panel) compares the convergence time costs an preictive perplexity of POEM, POVB an POGS when =. Similar to the results when =, POEM performs significantl faster an achieves a lower perplexity than POVB an POGS. In practice, POEM processes Pubme ata set ( = ) using less than 3 hours, while PGS using CPUs requires aroun.5 hours on the same scale of ata set [9]. Fig. 8 compares the Scale Up an Spee Up curves of POEM an POVB/POGS. In Scale Up, we f x each threa to process 3M ata an increase the number of threas from to (x-axis). The Scale Up (y-axis) is the ivision between the convergence time of all other algorithms an that of POVB using threa for 3M ata. The perfect Scale Up curve is a horizontal line in the bottom area. Clearly, the Scale Up curve of POEM locates significantl lower than those of POGS an POVB, inicating a much better Scale Up performance. In Spee Up, we ivie between the convergence time of all other parallel online LA algorithms an that of POVB using threa for the entire ata set (y-axis). e increase the number of threas from to (x-axis) an see if the convergence spee increases. The perfect Spee Up curve is a linear line with a high slope without bening. e see that the Spee Up curve of POEM is significantl higher than those of POVB an POGS. This shows that POEM can reach a higher speeup when more threas are use. Both Fig. an Fig. 7 confir that the propose PEM algorithms are more scalable than the current state-of-the-art solutions. 5. CONCLUSIONS Scalable LA algorithms in share memory systems are neee for big ata on wiely use multi-core systems. Unlike previous parallel solutions using batch/online VB an GS inference, we avocate the EM framework to buil more scalable parallel LA algorithms. Using the efficien resiual-base ynamic scheuling, Figure 9: BEM for LA. we propose scalable PEM algorithms for LA with faster convergence spee an shorter locking time than the current state-of-theart. Experiments show that the resiual-base ynamic scheuling can effectively reuce the locking time an spee up convergence of PEM, which can be use in other latent variable moels where EM inference works. In our future work, we shall stuy how to exten PEM from the multi-core systems to multi-processor systems [3, 8].. ACNOLEGEMENTS This work was supporte by National Grant Funamental Research (973 Program) of China uner Grant CB33, NSFC (Grant No an 333), Hong ong RGC project 8, an Natural Science Founation of the Jiangsu Higher Eucation Institutions of China (Grant No. JA5). This work was partially supporte by Collaborative Innovation Center of Novel Software Technology an Inustrialization. 7. APPENIX In this appenix, we erive BEM, IEM an OEM algorithms for LA, which infer the posterior p(θ, φ x,α,β) p(x, θ, φ α, β) from the full joint probability of LA. This objective is quite ifferent from VB [], GS [] an CVB algorithms, which infer the posterior p(θ, z x, φ,α,β), p(z x,α,β), an p(θ, φ, z x, α,β), respectively. The time an space complexity comparison of these algorithms has been shown in Table. 7. Batch EM (BEM) e maximize the likelihoo function of LA in terms of multinomial parameter set λ = {θ, φ} as follows, p(x, θ, φ α, β) = [ p(x w,,i =,zw,,i k = w,,i k ] θ (k),φ w(k)) p(θ (k) α) p(φ w(k) β). (8) k Employing the Bayes rule an the efinitio of multinomial istributions, we get the wor likelihoo, p(x w,,i =,z k w,,i = θ (k),φ w(k)) = p(x w,,i = zw,,i k =,φ w(k)) p(zw,,i k = θ (k)), = x w,,i φ w(k)θ (k), (9) 7

9 which epens only on the wor inex {w, } instea of the wor token inex i. Then, accoring to the efinitio of irichlet istributions, the log-likelihoo of (8) is l(λ) [ x w,,i log μ w, (k) θ ] (k)φ w(k) μ w, (k) w,,i k + log[θ (k)] α + log[φ w(k)] β, () k k w where μ w, (k) is some topic istribution over the wor inex {w, } satisfying k μ w,(k) =,μ w, (k). In (9), we observe that w,,i [x w,,i =]= w, x w,, so that we can cancel the wor token inex i in (). Because the logarithm is concave, by Jensen s inequality, we have l(λ) l(μ,λ)= [ x w, μ w, (k) log θ ] (k)φ w(k) μ w, (k) w, k + log[θ (k)] α + log[φ w(k)] β, () k k w which gives the lower boun of log-likelihoo (). The equality hols true if an only if μ w, (k) θ (k)φ w(k). () In EM, the -length posterior probability vector μ w, (k) is the responsibility that the topic k takes for wor inex {w, } [7]. For this choice of μ w, (k), Eq. () gives a tight lower boun on the log-likelihoo () we are trying to maximize. This is calle the E-step in EM [8]. In the successive M-step, we then maximize () with respect to parameters to obtain a new setting of λ. Since the hyperparameters {α, β} are fi e, without loss of generality, we erive the M-step upate for the parameter θ (k). There is an aitional constraint that k θ (k) =because θ (k) is parameter of a multinomial istribution. To eal with this constraint, we construct the Lagrangian from () by grouping together only the terms that epen on θ (k), l(θ) = [ ] x w, μ w, (k)+α log θ (k) k w +δ( θ (k) ), (3) k where δ is the Lagrange multiplier. Taking erivatives, we fin w l(θ) = x w,μ w, (k)+α + δ. () θ (k) θ (k) Setting this to zero, we get θ (k) = w x w,μ w, (k)+α. δ (5) Using the constraint that k θ (k) =, we easily fin that δ = k [ w x w,μ w, (k) +α ]. e therefore have our M-step upate for the parameter θ (k) as θ (k) = ˆθ (k)+α k ˆθ (k)+(α ). () where ˆθ (k) = w x w,μ w, (k) is the expecte sufficien statistics. Similarly, another multinomial parameter can be estimate by φ w(k) = ˆφ w(k)+β ˆφ(k)+ (β ), (7) 3 5 input : x,,t,α,β. output : ˆφ, ˆθ. initialize ˆθ (k), ˆφ w (k), ˆφ(k); for t to T o for x w, in ranom orer o ˆθ (k) ( x w, / w x w,)ˆθ (k); ˆφ w (k) ( x w, / x w,) ˆφ w (k); ˆφ(k) ( x w, / w, x w,) ˆφ(k); μ w, (k) normalize([ˆθ (k)+α ][ ˆφ w (k)+β ]/[ ˆφ(k)+ (β )]); ˆθ (k) ˆθ (k)+x w, μ w, (k); ˆφ w (k) ˆφ w (k)+x w, μ w, (k); ˆφ(k) ˆφ(k)+x w, μ w, (k); Figure : Moifie IEM for LA. where ˆφ w(k) = x w,μ w, (k) is the expecte sufficien statistics an ˆφ(k) = w ˆφ w(k). Note that the enominator of () is a constant. Replacing () an (7) into (), we obtain the E-step in terms of sufficien statistics, μ w, (k) [ˆθ (k)+α ] [ ˆφ w(k)+β ], (8) ˆφ(k)+ (β ) where the EM iterates the E-step an M-step to refin sufficien statistics ˆθ (k) an ˆφ w(k), which can be normalize to be the multinomial parameters accoring to () an (7). Fig. 9 shows BEM for LA. e initialize three temporary matrices ˆφ w (k), ˆθ (k), ˆφ new (k) (line 3) to accumulate respon- new new sibilities in E-step for all wors (line ) without storing the large responsibility matrix μ NNZ in memory. At the en of each iteration t, t T, we copy the three temporary matrices back to ˆφ w(k), ˆθ (k), ˆφ(k) in M-step (line 7). BEM iterates E-step an M-step repeately. Suppose λ t an λ t are the parameters from two successive iterations of EM. It is easy to prove that l(λ t ) l(μ t,λ t ) l(μ t,λ t )=l(λ t ), (9) which shows that EM always monotonically improves the LA s log-likelihoo () for convergence. The EM can be also viewe as a coorinate ascent on the lower boun l(μ,λ) (), in which the E-step maximizes it with respect to μ, an the M-step maximizes it with respect to λ. 7. Incremental EM (IEM) In batch EM (BEM), the M-step is performe until the E-step upates all responsibilities μ w, (k), which slows own the convergence since the upate responsibility of each wor in the E-step oes not immeiately influenc the parameter estimation in the M- step. This problem motivates incremental EM (IEM) [8]. hen compare with BEM (8), IEM alternates a single E-step an M- step for each nonzero element x w, sequentially. Thus, the E-step of IEM becomes μ w, (k) [ˆθ w, (k)+α ] [ ˆφ w, (k)+β ]. (3) ˆφ (w,) (k)+(β ) 77

10 The expecte sufficien statistics are ˆθ w, (k) = w x w, μ w, (k), (3) ˆφ w, (k) = x w, μ w, (k), (3) ˆφ (w,) (k) = x w, μ w, (k), (33) (w,) where w, an (w, ) enote all wor inices except w, all ocument inices except, an all wor inices except {w, }. After the E-step for each wor, the M-step will upate the sufficien statistics immeiately by aing the upate posterior μ w, (k) (3) into (3), (3) an (33). Comparing the E-step between BEM an IEM, we fin that the major ifference between (8) an (3) is that IEM exclues the current posterior x w, μ w, (k) from sufficien statistics in (3), (3) an (33). As a result, IEM s space complexity is O( (+ + NNZ)) by storing the large responsibility matrix μ NNZ.For example, if =, the responsibility matrix will occupy aroun 3GB (using ouble-precision floating-poin format) memory on the Pubme ata set [] having 83, 5, 57 nonzero elements. This space is currently too large to be affore by a single commoity PC. Note that CVB [] an asynchronous BP [35, 3] are equivalent to IEM, which are also memory-consuming for big ata on a single PC. So, we propose a moifie IEM in Fig. that o not nee to store the large responsibility matrix. After ranom initialization, we reuce the parameter matrices ˆθ (k), ˆφ w(k), ˆφ(k) in a certain proportion (line ). This avois to subtract the current responsibility from parameter matrices in (3), (3) an (33). Then, the E-step of incremental EM becomes (8) rather than (3). In this way, we o not nee to store the large responsibility matrix μ NNZ in memory. After E-step (line 5) for each nonzero element, the parameter matrices can be compensate by the upate -tuple responsibility μ w, (k) in M-step (line ). In this way, the change of parameter matrix will immeiately influenc the upate of the responsibility for the next nonzero element (line 5). In anticipation, this incremental upate metho is more efficien to pass the influenc of the upate responsibility than batch EM in Fig. 9. Likewise, it is easy to see that IEM can also converge to the local stationary point of LA s log-likelihoo because l(λ t )=l(μ t,λ t ) l(μ t w,, μ t (w,),λt ) l(μ t w,, μ t (w,),λt ) l(μ t,λ t )=l(λ t ). (3) 7.3 Online EM (OEM) The basic iea of online algorithms is to partition a stream of ocuments into small mini-batches with size s, an use the online graient prouce by each mini-batch to estimate topic istributions incrementally. OEM [] combines IEM with the stochastic approximation, which achieves convergence to the stationary points of the likelihoo function by interpolating between sufficien statistics base on a learning rate ρ s satisfying Robbins-Monro conitions [], ρ s =(τ + s) κ, (35) where τ is a pre-efine number of mini-batches, s is the minibatch inex an κ (.5, ] is provie by users. Similar to (3), it is easy to observe that l( ˆφ s )=l(μ s+:,μ s, ˆφ s ) l(μ s+:,μ s, ˆφ s ) l(μ s:,μ s, ˆφ s )=l( ˆφ s ), (3) input : x s w,, s,τ,κ,,α,β. output : ˆφ S. for s to S o Loa x s w,, s in memory; 3 ρ s =(τ + s) κ ; initialize μ s ; ˆθ s(k) w xs w, μs w, (k), s; ˆφ s w(k) ˆφ s w(k)+ xs w, μs w, (k), s; ˆφ s (k) ˆφ s (k)+ 5 w, xs w, μs w, (k), s; repeat 7 for x s w, in ranom orer o 8 ˆθs w, (k) ˆθ s(k) xs w, μs w, (k); ˆφs w, (k) ˆφ w (k) x s w, μs w, (k); ˆφ s (w,) (k) ˆφ(k) x s w, μs w, (k); μ s w, (k) normalize([ˆθ s (k)+α 9 ][ ˆφ s w (k)+β ]/[ ˆφ s (k)+ (β )]); ˆθ s (k) ˆθ s (k)+xs w, μs w, (k); ˆφs w(k) ˆφ s w(k)+x s w, μs w, (k); ˆφ s (k) ˆφ s (k)+x s w, μs w, (k); until converge; ˆφ s w (k) ( ρ s s) ˆφ w (k)+ρ s[ xs w, μs w, (k)]; ˆφs (k) ( ρ s ) ˆφ s (k)+ρ s [ w, xs 3 w, μs w, (k)]; Free x s w,, ˆθ s s, μ s NNZ s from memory; Figure : OEM for LA. where μ s+: enotes responsibilities of unseen mini-batches from s +to. Note that the lower boun (3) will not touch the loglikelihoo () until all responsibilities for ata streams have been upate in (). The inequality (3) confirm that OEM can improve ˆφ s to maximize the LA s log-likelihoo (). In practice, OEM reas each mini-batch x s w, into memory an runs IEM until μ s converge. Then, the sufficien statistics ˆφ s is upate by a linear combination between previous ˆφ s an the upate sufficien statistics xs w,μ s w,(k), ] x s w,μ s w,(k). (37) ˆφ s =( ρ s) ˆφ s + ρ s [ Since OEM only stores the current mini-batch x s w,, the local parameters μ s, ˆθ s an the global parameter ˆφ s in memory, it is easy to process big ata stream with low space complexities O( ( s + + NNZ s)), where s is the number of ocuments an NNZ s the number of nonzero elements in the sth minibatch. Fig. summarizes the OEM algorithm for LA, where online BP [37] an stochastic CVB [9] are some implementations of OEM. Note that OEM can revisit previous processe mini-batch. 8. REFERENCES [] A. Ahme, M. Aly, J. Gonzalez, S. Narayanamurthy, an A. Smola. Scalable inference in latent variable moels. In SM, pages 3 3,. [] A. Asuncion, M. elling, P. Smyth, an Y.. Teh. On smoothing an inference for topic moels. In UAI, pages 7 3, 9. [3]. M. Blei. Introuction to probabilistic topic moels. Communications of the ACM, 55():77 8,. 78

11 []. M. Blei, A. Y. Ng, an M. I. Joran. Latent irichlet allocation. J. Mach. Learn. Res., 3:993, 3. [5] T. Broerick, N. Boy, A. ibisono, A. C. ilson, an M. I. Joran. Streaming variational bayes. In NIPS, pages , 3. [] O. Cappé an E. Moulines. Online expectation-maximization algorithm for latent ata moels. Journal of the Royal Statistical Society: Series B, 7(3):593 3, 9. [7] N. e Freitas an. Barnar. Bayesian latent semantic analysis of multimeia atabases. Technical report, University of British Columbia,. [8] A. P. empster, N. M. Lair, an. B. Rubin. Maximum likelihoo from incomplete ata via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39: 38, 977. [9] J. R. Fouls, L. Boyles, C. ubois, P. Smyth, an M. elling. Stochastic collapse variational bayesian inference for latent irichlet allocation. In, pages 5, 3. [] T. L. Griffith an M. Steyvers. Fining scientifi topics. Proc. Natl. Aca. Sci., :58 535,. [] M. Hoffman,. Blei, an F. Bach. Online learning for latent irichlet allocation. In NIPS, pages 85 8,. []. Jiang,..-T. Leung, an. Ng. Fast topic iscovery from web search streams. In, pages 99 9,. [3] A. Q. Li, A. Ahme, S. Ravi, an A. J. Smola. Reucing the sampling complexity of topic moels. In,. [] P. Liang an. lein. Online EM for unsupervise moels. In Human Language Technologies: The 9 Annual Conference of the North American Chapter of the ACL, pages 9, 9. [5] Z. Liu, Y. Zhang, E. Y. Chang, an M. Sun. PLA+: Parallel latent irichlet allocation with ata placement an pipeline processing. ACM Trans. Intell. Syst. Technol., (3): 8,. []. Mimno, M.. Hoffman, an. M. Blei. Sparse stochastic inference for latent irichlet allocation. In ICML,. [7]. P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press,. [8] R. M. Neal an G. E. Hinton. A view of the EM algorithm that justifie incremental, sparse, an other variants. Learning in Graphical Moels, 89:355 38, 998. [9]. Newman, A. Asuncion, P. Smyth, an M. elling. istribute algorithms for topic moels. J. Mach. Learn. Res., :8 88, 9. [] F. Niu, B. Recht, C. Re, an S. J. right. HOGIL!: A lock-free approach to parallelizing stochastic graient escent. In NIPS, pages 93 7,. [] I. Porteous,. Newman, A. Ihler, A. Asuncion, P. Smyth, an M. elling. Fast collapse Gibbs sampling for latent irichlet allocation. In, pages , 8. [] H. Robbins an S. Monro. A stochastic approximation metho. The Annals of Mathematical Statistics, (3): 7, 95. [3] B. S. S. Ahn an M. elling. istribute stochastic graient mcmc. In ICML,. [] M.. Schmit, N. L. Roux, an F. Bach. Minimizing finit sums with the stochastic average graient. CoRR, abs/39.388, 3. [5] A. Smola an S. Narayanamurthy. An architecture for parallel topic moels. In PVLB, pages 73 7,. [] Y.. Teh,. Newman, an M. elling. A collapse variational Bayesian inference algorithm for latent irichlet allocation. In NIPS, pages 353 3, 7. [7] Y. ang, H. Bai, M. Stanton,. Y. Chen, an E. Chang. PLA: Parallel latent irichlet allocation for large-scale applications. In Algorithmic Aspects in Information an Management, pages 3 3, 9. [8] Y. ang, X. Zhao, Z. Sun, H. Yan, L. ang, Z. Jin, L. ang, Y. Gao, C. Law, an J. Zeng. Peacock: Learning long-tail topic features for inustrial applications. ACM Transactions on Intelligent Systems an Technology, 5. [9] F. Yan, N. Xu, an Y. Qi. Parallel inference for latent irichlet allocation on graphics processing units. In NIPS, pages 3, 9. [3] J.-F. Yan, J. Zeng, Z.-Q. Liu, an Y. Gao. Towars big topic moeling. page arxiv:3.5, 3. [3] L. Yao,. Mimno, an A. McCallum. Efficien methos for topic moel inference on streaming ocument collections. In, pages 937 9, 9. [3] J. Yuan, F. Gao, Q. Ho,. ai, J. ei, X. Zheng, E. P. Xing, T.-Y. Liu, an.-y. Ma. LightLA: Big topic moels on moest compute clusters. page arxiv:.57 [stat.ml],. [33] H. Yun, H.-F. Yu, C.-J. Hsieh, S. V. N. Vishwanathan, an I. hillon. NOMA: Nonlocking, stochastic Multi-machine algorithm for Asynchronous an ecentralize matrix completion. In PVLB, pages ,. [3] J. Zeng. A topic moeling toolbox using belief propagation. J. Mach. Learn. Res., 3:33 3,. [35] J. Zeng,.. Cheung, an J. Liu. Learning topic moels by belief propagation. IEEE Trans. Pattern Anal. Mach. Intell., 35(5): 3, 3. [3] J. Zeng, Z.-Q. Liu, an X.-Q. Cao. A new approach to speeing up topic moeling. page arxiv:.7 [cs.lg],. [37] J. Zeng, Z.-Q. Liu, an X.-Q. Cao. Online belief propagation for topic moeling. arxiv:.79 [cs.lg],. [38]. Zhai, J. Boy-Graber, an N. Asai. Mr. LA: A fl xible large scale topic moeling package using variational inference in MapReuce. In, pages ,. [39] Y. Zhuang,.-S. Chin, Y.-C. Juan, an C.-J. Lin. A fast parallel SG for matrix factorization in share memory systems. In ACM Recommener Systems, 3. 79

Fast Online EM for Big Topic Modeling

Fast Online EM for Big Topic Modeling 1 Fast Online EM for Big Topic Modeling Jia Zeng, Senior Member, IEEE Zhi-Qiang Liu and Xiao-Qin Cao arxiv:1212179v3 [cslg] 7 Dec 215 Abstract The expectation-maximization (EM) algorithm can compute the

More information

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling Case Stuy : Document Retrieval Collapse Gibbs an Variational Methos for LDA Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 7 th, 0 Example

More information

LDA Collapsed Gibbs Sampler, VariaNonal Inference. Task 3: Mixed Membership Models. Case Study 5: Mixed Membership Modeling

LDA Collapsed Gibbs Sampler, VariaNonal Inference. Task 3: Mixed Membership Models. Case Study 5: Mixed Membership Modeling Case Stuy 5: Mixe Membership Moeling LDA Collapse Gibbs Sampler, VariaNonal Inference Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox May 8 th, 05 Emily Fox 05 Task : Mixe

More information

Lecture 2: Correlated Topic Model

Lecture 2: Correlated Topic Model Probabilistic Moels for Unsupervise Learning Spring 203 Lecture 2: Correlate Topic Moel Inference for Correlate Topic Moel Yuan Yuan First of all, let us make some claims about the parameters an variables

More information

Collapsed Variational Inference for HDP

Collapsed Variational Inference for HDP Collapse Variational Inference for HDP Yee W. Teh Davi Newman an Max Welling Publishe on NIPS 2007 Discussion le by Iulian Pruteanu Outline Introuction Hierarchical Bayesian moel for LDA Collapse VB inference

More information

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences. S 63 Lecture 8 2/2/26 Lecturer Lillian Lee Scribes Peter Babinski, Davi Lin Basic Language Moeling Approach I. Special ase of LM-base Approach a. Recap of Formulas an Terms b. Fixing θ? c. About that Multinomial

More information

Topic Modeling: Beyond Bag-of-Words

Topic Modeling: Beyond Bag-of-Words Hanna M. Wallach Cavenish Laboratory, University of Cambrige, Cambrige CB3 0HE, UK hmw26@cam.ac.u Abstract Some moels of textual corpora employ text generation methos involving n-gram statistics, while

More information

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions Working Paper 2013:5 Department of Statistics Computing Exact Confience Coefficients of Simultaneous Confience Intervals for Multinomial Proportions an their Functions Shaobo Jin Working Paper 2013:5

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 12 EFFICIENT LEARNING So far, our focus has been on moels of learning an basic algorithms for those moels. We have not place much emphasis on how to learn quickly.

More information

A Review of Multiple Try MCMC algorithms for Signal Processing

A Review of Multiple Try MCMC algorithms for Signal Processing A Review of Multiple Try MCMC algorithms for Signal Processing Luca Martino Image Processing Lab., Universitat e València (Spain) Universia Carlos III e Mari, Leganes (Spain) Abstract Many applications

More information

Optimization of Geometries by Energy Minimization

Optimization of Geometries by Energy Minimization Optimization of Geometries by Energy Minimization by Tracy P. Hamilton Department of Chemistry University of Alabama at Birmingham Birmingham, AL 3594-140 hamilton@uab.eu Copyright Tracy P. Hamilton, 1997.

More information

Least-Squares Regression on Sparse Spaces

Least-Squares Regression on Sparse Spaces Least-Squares Regression on Sparse Spaces Yuri Grinberg, Mahi Milani Far, Joelle Pineau School of Computer Science McGill University Montreal, Canaa {ygrinb,mmilan1,jpineau}@cs.mcgill.ca 1 Introuction

More information

Equilibrium in Queues Under Unknown Service Times and Service Value

Equilibrium in Queues Under Unknown Service Times and Service Value University of Pennsylvania ScholarlyCommons Finance Papers Wharton Faculty Research 1-2014 Equilibrium in Queues Uner Unknown Service Times an Service Value Laurens Debo Senthil K. Veeraraghavan University

More information

Lower bounds on Locality Sensitive Hashing

Lower bounds on Locality Sensitive Hashing Lower bouns on Locality Sensitive Hashing Rajeev Motwani Assaf Naor Rina Panigrahy Abstract Given a metric space (X, X ), c 1, r > 0, an p, q [0, 1], a istribution over mappings H : X N is calle a (r,

More information

Homework 2 EM, Mixture Models, PCA, Dualitys

Homework 2 EM, Mixture Models, PCA, Dualitys Homework 2 EM, Mixture Moels, PCA, Dualitys CMU 10-715: Machine Learning (Fall 2015) http://www.cs.cmu.eu/~bapoczos/classes/ml10715_2015fall/ OUT: Oct 5, 2015 DUE: Oct 19, 2015, 10:20 AM Guielines The

More information

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics This moule is part of the Memobust Hanbook on Methoology of Moern Business Statistics 26 March 2014 Metho: Balance Sampling for Multi-Way Stratification Contents General section... 3 1. Summary... 3 2.

More information

Proof of SPNs as Mixture of Trees

Proof of SPNs as Mixture of Trees A Proof of SPNs as Mixture of Trees Theorem 1. If T is an inuce SPN from a complete an ecomposable SPN S, then T is a tree that is complete an ecomposable. Proof. Argue by contraiction that T is not a

More information

Introduction to Machine Learning

Introduction to Machine Learning How o you estimate p(y x)? Outline Contents Introuction to Machine Learning Logistic Regression Varun Chanola April 9, 207 Generative vs. Discriminative Classifiers 2 Logistic Regression 2 3 Logistic Regression

More information

Admin BACKPROPAGATION. Neural network. Neural network 11/3/16. Assignment 7. Assignment 8 Goals today. David Kauchak CS158 Fall 2016

Admin BACKPROPAGATION. Neural network. Neural network 11/3/16. Assignment 7. Assignment 8 Goals today. David Kauchak CS158 Fall 2016 Amin Assignment 7 Assignment 8 Goals toay BACKPROPAGATION Davi Kauchak CS58 Fall 206 Neural network Neural network inputs inputs some inputs are provie/ entere Iniviual perceptrons/ neurons Neural network

More information

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012 CS-6 Theory Gems November 8, 0 Lecture Lecturer: Alesaner Mąry Scribes: Alhussein Fawzi, Dorina Thanou Introuction Toay, we will briefly iscuss an important technique in probability theory measure concentration

More information

Sparse Stochastic Inference for Latent Dirichlet Allocation

Sparse Stochastic Inference for Latent Dirichlet Allocation Sparse Stochastic Inference for Latent Dirichlet Allocation David Mimno 1, Matthew D. Hoffman 2, David M. Blei 1 1 Dept. of Computer Science, Princeton U. 2 Dept. of Statistics, Columbia U. Presentation

More information

SYNCHRONOUS SEQUENTIAL CIRCUITS

SYNCHRONOUS SEQUENTIAL CIRCUITS CHAPTER SYNCHRONOUS SEUENTIAL CIRCUITS Registers an counters, two very common synchronous sequential circuits, are introuce in this chapter. Register is a igital circuit for storing information. Contents

More information

Linear First-Order Equations

Linear First-Order Equations 5 Linear First-Orer Equations Linear first-orer ifferential equations make up another important class of ifferential equations that commonly arise in applications an are relatively easy to solve (in theory)

More information

Table of Common Derivatives By David Abraham

Table of Common Derivatives By David Abraham Prouct an Quotient Rules: Table of Common Derivatives By Davi Abraham [ f ( g( ] = [ f ( ] g( + f ( [ g( ] f ( = g( [ f ( ] g( g( f ( [ g( ] Trigonometric Functions: sin( = cos( cos( = sin( tan( = sec

More information

u!i = a T u = 0. Then S satisfies

u!i = a T u = 0. Then S satisfies Deterministic Conitions for Subspace Ientifiability from Incomplete Sampling Daniel L Pimentel-Alarcón, Nigel Boston, Robert D Nowak University of Wisconsin-Maison Abstract Consier an r-imensional subspace

More information

Capacity Analysis of MIMO Systems with Unknown Channel State Information

Capacity Analysis of MIMO Systems with Unknown Channel State Information Capacity Analysis of MIMO Systems with Unknown Channel State Information Jun Zheng an Bhaskar D. Rao Dept. of Electrical an Computer Engineering University of California at San Diego e-mail: juzheng@ucs.eu,

More information

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search Part I: Web Structure Mining Chapter : Information Retrieval an Web Search The Web Challenges Crawling the Web Inexing an Keywor Search Evaluating Search Quality Similarity Search The Web Challenges Tim

More information

Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling

Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling Christophe Dupuy INRIA - Technicolor christophe.dupuy@inria.fr Francis Bach INRIA - ENS francis.bach@inria.fr Abstract

More information

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback Journal of Machine Learning Research 8 07) - Submitte /6; Publishe 5/7 An Optimal Algorithm for Banit an Zero-Orer Convex Optimization with wo-point Feeback Oha Shamir Department of Computer Science an

More information

Cascaded redundancy reduction

Cascaded redundancy reduction Network: Comput. Neural Syst. 9 (1998) 73 84. Printe in the UK PII: S0954-898X(98)88342-5 Cascae reunancy reuction Virginia R e Sa an Geoffrey E Hinton Department of Computer Science, University of Toronto,

More information

Homework 2 Solutions EM, Mixture Models, PCA, Dualitys

Homework 2 Solutions EM, Mixture Models, PCA, Dualitys Homewor Solutions EM, Mixture Moels, PCA, Dualitys CMU 0-75: Machine Learning Fall 05 http://www.cs.cmu.eu/~bapoczos/classes/ml075_05fall/ OUT: Oct 5, 05 DUE: Oct 9, 05, 0:0 AM An EM algorithm for a Mixture

More information

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k A Proof of Lemma 2 B Proof of Lemma 3 Proof: Since the support of LL istributions is R, two such istributions are equivalent absolutely continuous with respect to each other an the ivergence is well-efine

More information

Acute sets in Euclidean spaces

Acute sets in Euclidean spaces Acute sets in Eucliean spaces Viktor Harangi April, 011 Abstract A finite set H in R is calle an acute set if any angle etermine by three points of H is acute. We examine the maximal carinality α() of

More information

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments 2 Conference on Information Sciences an Systems, The Johns Hopkins University, March 2, 2 Time-of-Arrival Estimation in Non-Line-Of-Sight Environments Sinan Gezici, Hisashi Kobayashi an H. Vincent Poor

More information

The total derivative. Chapter Lagrangian and Eulerian approaches

The total derivative. Chapter Lagrangian and Eulerian approaches Chapter 5 The total erivative 51 Lagrangian an Eulerian approaches The representation of a flui through scalar or vector fiels means that each physical quantity uner consieration is escribe as a function

More information

Qubit channels that achieve capacity with two states

Qubit channels that achieve capacity with two states Qubit channels that achieve capacity with two states Dominic W. Berry Department of Physics, The University of Queenslan, Brisbane, Queenslan 4072, Australia Receive 22 December 2004; publishe 22 March

More information

Topic 7: Convergence of Random Variables

Topic 7: Convergence of Random Variables Topic 7: Convergence of Ranom Variables Course 003, 2016 Page 0 The Inference Problem So far, our starting point has been a given probability space (S, F, P). We now look at how to generate information

More information

All s Well That Ends Well: Supplementary Proofs

All s Well That Ends Well: Supplementary Proofs All s Well That Ens Well: Guarantee Resolution of Simultaneous Rigi Boy Impact 1:1 All s Well That Ens Well: Supplementary Proofs This ocument complements the paper All s Well That Ens Well: Guarantee

More information

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION The Annals of Statistics 1997, Vol. 25, No. 6, 2313 2327 LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION By Eva Riccomagno, 1 Rainer Schwabe 2 an Henry P. Wynn 1 University of Warwick, Technische

More information

Calculus and optimization

Calculus and optimization Calculus an optimization These notes essentially correspon to mathematical appenix 2 in the text. 1 Functions of a single variable Now that we have e ne functions we turn our attention to calculus. A function

More information

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013 Survey Sampling Kosuke Imai Department of Politics, Princeton University February 19, 2013 Survey sampling is one of the most commonly use ata collection methos for social scientists. We begin by escribing

More information

Non-Linear Bayesian CBRN Source Term Estimation

Non-Linear Bayesian CBRN Source Term Estimation Non-Linear Bayesian CBRN Source Term Estimation Peter Robins Hazar Assessment, Simulation an Preiction Group Dstl Porton Down, UK. probins@stl.gov.uk Paul Thomas Hazar Assessment, Simulation an Preiction

More information

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks A PAC-Bayesian Approach to Spectrally-Normalize Margin Bouns for Neural Networks Behnam Neyshabur, Srinah Bhojanapalli, Davi McAllester, Nathan Srebro Toyota Technological Institute at Chicago {bneyshabur,

More information

Analyzing Tensor Power Method Dynamics in Overcomplete Regime

Analyzing Tensor Power Method Dynamics in Overcomplete Regime Journal of Machine Learning Research 18 (2017) 1-40 Submitte 9/15; Revise 11/16; Publishe 4/17 Analyzing Tensor Power Metho Dynamics in Overcomplete Regime Animashree Ananumar Department of Electrical

More information

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21 Large amping in a structural material may be either esirable or unesirable, epening on the engineering application at han. For example, amping is a esirable property to the esigner concerne with limiting

More information

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis Keywors: multi-view learning, clustering, canonical correlation analysis Abstract Clustering ata in high-imensions is believe to be a har problem in general. A number of efficient clustering algorithms

More information

arxiv: v1 [cs.ds] 31 May 2017

arxiv: v1 [cs.ds] 31 May 2017 Succinct Partial Sums an Fenwick Trees Philip Bille, Aners Roy Christiansen, Nicola Prezza, an Freerik Rye Skjoljensen arxiv:1705.10987v1 [cs.ds] 31 May 2017 Technical University of Denmark, DTU Compute,

More information

A note on asymptotic formulae for one-dimensional network flow problems Carlos F. Daganzo and Karen R. Smilowitz

A note on asymptotic formulae for one-dimensional network flow problems Carlos F. Daganzo and Karen R. Smilowitz A note on asymptotic formulae for one-imensional network flow problems Carlos F. Daganzo an Karen R. Smilowitz (to appear in Annals of Operations Research) Abstract This note evelops asymptotic formulae

More information

Level Construction of Decision Trees in a Partition-based Framework for Classification

Level Construction of Decision Trees in a Partition-based Framework for Classification Level Construction of Decision Trees in a Partition-base Framework for Classification Y.Y. Yao, Y. Zhao an J.T. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canaa S4S

More information

On the Aloha throughput-fairness tradeoff

On the Aloha throughput-fairness tradeoff On the Aloha throughput-fairness traeoff 1 Nan Xie, Member, IEEE, an Steven Weber, Senior Member, IEEE Abstract arxiv:1605.01557v1 [cs.it] 5 May 2016 A well-known inner boun of the stability region of

More information

Power Generation and Distribution via Distributed Coordination Control

Power Generation and Distribution via Distributed Coordination Control Power Generation an Distribution via Distribute Coorination Control Byeong-Yeon Kim, Kwang-Kyo Oh, an Hyo-Sung Ahn arxiv:407.4870v [math.oc] 8 Jul 204 Abstract This paper presents power coorination, power

More information

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x)

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x) Y. D. Chong (2016) MH2801: Complex Methos for the Sciences 1. Derivatives The erivative of a function f(x) is another function, efine in terms of a limiting expression: f (x) f (x) lim x δx 0 f(x + δx)

More information

Improving Estimation Accuracy in Nonrandomized Response Questioning Methods by Multiple Answers

Improving Estimation Accuracy in Nonrandomized Response Questioning Methods by Multiple Answers International Journal of Statistics an Probability; Vol 6, No 5; September 207 ISSN 927-7032 E-ISSN 927-7040 Publishe by Canaian Center of Science an Eucation Improving Estimation Accuracy in Nonranomize

More information

One-dimensional I test and direction vector I test with array references by induction variable

One-dimensional I test and direction vector I test with array references by induction variable Int. J. High Performance Computing an Networking, Vol. 3, No. 4, 2005 219 One-imensional I test an irection vector I test with array references by inuction variable Minyi Guo School of Computer Science

More information

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors Math 18.02 Notes on ifferentials, the Chain Rule, graients, irectional erivative, an normal vectors Tangent plane an linear approximation We efine the partial erivatives of f( xy, ) as follows: f f( x+

More information

Lecture 2 Lagrangian formulation of classical mechanics Mechanics

Lecture 2 Lagrangian formulation of classical mechanics Mechanics Lecture Lagrangian formulation of classical mechanics 70.00 Mechanics Principle of stationary action MATH-GA To specify a motion uniquely in classical mechanics, it suffices to give, at some time t 0,

More information

KNN Particle Filters for Dynamic Hybrid Bayesian Networks

KNN Particle Filters for Dynamic Hybrid Bayesian Networks KNN Particle Filters for Dynamic Hybri Bayesian Networs H. D. Chen an K. C. Chang Dept. of Systems Engineering an Operations Research George Mason University MS 4A6, 4400 University Dr. Fairfax, VA 22030

More information

Monte Carlo Methods with Reduced Error

Monte Carlo Methods with Reduced Error Monte Carlo Methos with Reuce Error As has been shown, the probable error in Monte Carlo algorithms when no information about the smoothness of the function is use is Dξ r N = c N. It is important for

More information

A Sketch of Menshikov s Theorem

A Sketch of Menshikov s Theorem A Sketch of Menshikov s Theorem Thomas Bao March 14, 2010 Abstract Let Λ be an infinite, locally finite oriente multi-graph with C Λ finite an strongly connecte, an let p

More information

19 Eigenvalues, Eigenvectors, Ordinary Differential Equations, and Control

19 Eigenvalues, Eigenvectors, Ordinary Differential Equations, and Control 19 Eigenvalues, Eigenvectors, Orinary Differential Equations, an Control This section introuces eigenvalues an eigenvectors of a matrix, an iscusses the role of the eigenvalues in etermining the behavior

More information

Fast image compression using matrix K-L transform

Fast image compression using matrix K-L transform Fast image compression using matrix K-L transform Daoqiang Zhang, Songcan Chen * Department of Computer Science an Engineering, Naning University of Aeronautics & Astronautics, Naning 2006, P.R. China.

More information

Maximal Causes for Non-linear Component Extraction

Maximal Causes for Non-linear Component Extraction Journal of Machine Learning Research 9 (2008) 1227-1267 Submitte 5/07; Revise 11/07; Publishe 6/08 Maximal Causes for Non-linear Component Extraction Jörg Lücke Maneesh Sahani Gatsby Computational Neuroscience

More information

Switching Time Optimization in Discretized Hybrid Dynamical Systems

Switching Time Optimization in Discretized Hybrid Dynamical Systems Switching Time Optimization in Discretize Hybri Dynamical Systems Kathrin Flaßkamp, To Murphey, an Sina Ober-Blöbaum Abstract Switching time optimization (STO) arises in systems that have a finite set

More information

A. Exclusive KL View of the MLE

A. Exclusive KL View of the MLE A. Exclusive KL View of the MLE Lets assume a change-of-variable moel p Z z on the ranom variable Z R m, such as the one use in Dinh et al. 2017: z 0 p 0 z 0 an z = ψz 0, where ψ is an invertible function

More information

7.1 Support Vector Machine

7.1 Support Vector Machine 67577 Intro. to Machine Learning Fall semester, 006/7 Lecture 7: Support Vector Machines an Kernel Functions II Lecturer: Amnon Shashua Scribe: Amnon Shashua 7. Support Vector Machine We return now to

More information

d dx But have you ever seen a derivation of these results? We ll prove the first result below. cos h 1

d dx But have you ever seen a derivation of these results? We ll prove the first result below. cos h 1 Lecture 5 Some ifferentiation rules Trigonometric functions (Relevant section from Stewart, Seventh Eition: Section 3.3) You all know that sin = cos cos = sin. () But have you ever seen a erivation of

More information

Angles-Only Orbit Determination Copyright 2006 Michel Santos Page 1

Angles-Only Orbit Determination Copyright 2006 Michel Santos Page 1 Angles-Only Orbit Determination Copyright 6 Michel Santos Page 1 Abstract This ocument presents a re-erivation of the Gauss an Laplace Angles-Only Methos for Initial Orbit Determination. It keeps close

More information

A simplified macroscopic urban traffic network model for model-based predictive control

A simplified macroscopic urban traffic network model for model-based predictive control Delft University of Technology Delft Center for Systems an Control Technical report 9-28 A simplifie macroscopic urban traffic network moel for moel-base preictive control S. Lin, B. De Schutter, Y. Xi,

More information

Robust Low Rank Kernel Embeddings of Multivariate Distributions

Robust Low Rank Kernel Embeddings of Multivariate Distributions Robust Low Rank Kernel Embeings of Multivariate Distributions Le Song, Bo Dai College of Computing, Georgia Institute of Technology lsong@cc.gatech.eu, boai@gatech.eu Abstract Kernel embeing of istributions

More information

Expected Value of Partial Perfect Information

Expected Value of Partial Perfect Information Expecte Value of Partial Perfect Information Mike Giles 1, Takashi Goa 2, Howar Thom 3 Wei Fang 1, Zhenru Wang 1 1 Mathematical Institute, University of Oxfor 2 School of Engineering, University of Tokyo

More information

WEIGHTING A RESAMPLED PARTICLE IN SEQUENTIAL MONTE CARLO. L. Martino, V. Elvira, F. Louzada

WEIGHTING A RESAMPLED PARTICLE IN SEQUENTIAL MONTE CARLO. L. Martino, V. Elvira, F. Louzada WEIGHTIG A RESAMPLED PARTICLE I SEQUETIAL MOTE CARLO L. Martino, V. Elvira, F. Louzaa Dep. of Signal Theory an Communic., Universia Carlos III e Mari, Leganés (Spain). Institute of Mathematical Sciences

More information

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation JOURNAL OF MATERIALS SCIENCE 34 (999)5497 5503 Thermal conuctivity of grae composites: Numerical simulations an an effective meium approximation P. M. HUI Department of Physics, The Chinese University

More information

State-Space Model for a Multi-Machine System

State-Space Model for a Multi-Machine System State-Space Moel for a Multi-Machine System These notes parallel section.4 in the text. We are ealing with classically moele machines (IEEE Type.), constant impeance loas, an a network reuce to its internal

More information

Computing the Longest Common Subsequence of Multiple RLE Strings

Computing the Longest Common Subsequence of Multiple RLE Strings The 29th Workshop on Combinatorial Mathematics an Computation Theory Computing the Longest Common Subsequence of Multiple RLE Strings Ling-Chih Yao an Kuan-Yu Chen Grauate Institute of Networking an Multimeia

More information

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis Technical Report TTI-TR-2008-5 Multi-View Clustering via Canonical Correlation Analysis Kamalika Chauhuri UC San Diego Sham M. Kakae Toyota Technological Institute at Chicago ABSTRACT Clustering ata in

More information

Schrödinger s equation.

Schrödinger s equation. Physics 342 Lecture 5 Schröinger s Equation Lecture 5 Physics 342 Quantum Mechanics I Wenesay, February 3r, 2010 Toay we iscuss Schröinger s equation an show that it supports the basic interpretation of

More information

BEYOND THE CONSTRUCTION OF OPTIMAL SWITCHING SURFACES FOR AUTONOMOUS HYBRID SYSTEMS. Mauro Boccadoro Magnus Egerstedt Paolo Valigi Yorai Wardi

BEYOND THE CONSTRUCTION OF OPTIMAL SWITCHING SURFACES FOR AUTONOMOUS HYBRID SYSTEMS. Mauro Boccadoro Magnus Egerstedt Paolo Valigi Yorai Wardi BEYOND THE CONSTRUCTION OF OPTIMAL SWITCHING SURFACES FOR AUTONOMOUS HYBRID SYSTEMS Mauro Boccaoro Magnus Egerstet Paolo Valigi Yorai Wari {boccaoro,valigi}@iei.unipg.it Dipartimento i Ingegneria Elettronica

More information

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

Lower Bounds for the Smoothed Number of Pareto optimal Solutions Lower Bouns for the Smoothe Number of Pareto optimal Solutions Tobias Brunsch an Heiko Röglin Department of Computer Science, University of Bonn, Germany brunsch@cs.uni-bonn.e, heiko@roeglin.org Abstract.

More information

2Algebraic ONLINE PAGE PROOFS. foundations

2Algebraic ONLINE PAGE PROOFS. foundations Algebraic founations. Kick off with CAS. Algebraic skills.3 Pascal s triangle an binomial expansions.4 The binomial theorem.5 Sets of real numbers.6 Surs.7 Review . Kick off with CAS Playing lotto Using

More information

On combinatorial approaches to compressed sensing

On combinatorial approaches to compressed sensing On combinatorial approaches to compresse sensing Abolreza Abolhosseini Moghaam an Hayer Raha Department of Electrical an Computer Engineering, Michigan State University, East Lansing, MI, U.S. Emails:{abolhos,raha}@msu.eu

More information

Generalizing Kronecker Graphs in order to Model Searchable Networks

Generalizing Kronecker Graphs in order to Model Searchable Networks Generalizing Kronecker Graphs in orer to Moel Searchable Networks Elizabeth Boine, Babak Hassibi, Aam Wierman California Institute of Technology Pasaena, CA 925 Email: {eaboine, hassibi, aamw}@caltecheu

More information

arxiv: v3 [cs.ir] 6 Dec 2014

arxiv: v3 [cs.ir] 6 Dec 2014 39 Peacock: Learning Long-Tail Topic Features for Industrial Applications arxiv:1405.4402v3 [cs.ir] 6 Dec 2014 YI WANG, Tencent XUEMIN ZHAO, Tencent ZHENLONG SUN, Tencent HAO YAN, Tencent LIFENG WANG,

More information

Make graph of g by adding c to the y-values. on the graph of f by c. multiplying the y-values. even-degree polynomial. graph goes up on both sides

Make graph of g by adding c to the y-values. on the graph of f by c. multiplying the y-values. even-degree polynomial. graph goes up on both sides Reference 1: Transformations of Graphs an En Behavior of Polynomial Graphs Transformations of graphs aitive constant constant on the outsie g(x) = + c Make graph of g by aing c to the y-values on the graph

More information

Math 1B, lecture 8: Integration by parts

Math 1B, lecture 8: Integration by parts Math B, lecture 8: Integration by parts Nathan Pflueger 23 September 2 Introuction Integration by parts, similarly to integration by substitution, reverses a well-known technique of ifferentiation an explores

More information

Math 1271 Solutions for Fall 2005 Final Exam

Math 1271 Solutions for Fall 2005 Final Exam Math 7 Solutions for Fall 5 Final Eam ) Since the equation + y = e y cannot be rearrange algebraically in orer to write y as an eplicit function of, we must instea ifferentiate this relation implicitly

More information

Separation of Variables

Separation of Variables Physics 342 Lecture 1 Separation of Variables Lecture 1 Physics 342 Quantum Mechanics I Monay, January 25th, 2010 There are three basic mathematical tools we nee, an then we can begin working on the physical

More information

Topic Modeling Ensembles

Topic Modeling Ensembles Topic Moeling Ensembles Zhiyong Shen, Ping Luo, Shengen Yang, Xukun Shen HP Laboratories HPL-2-58 Keyor(s): Topic moel, Ensemble Abstract: In this paper e propose a frameork of topic moeling ensembles,

More information

Fast Resampling Weighted v-statistics

Fast Resampling Weighted v-statistics Fast Resampling Weighte v-statistics Chunxiao Zhou Mar O. Hatfiel Clinical Research Center National Institutes of Health Bethesa, MD 20892 chunxiao.zhou@nih.gov Jiseong Par Dept of Math George Mason Univ

More information

EVALUATING HIGHER DERIVATIVE TENSORS BY FORWARD PROPAGATION OF UNIVARIATE TAYLOR SERIES

EVALUATING HIGHER DERIVATIVE TENSORS BY FORWARD PROPAGATION OF UNIVARIATE TAYLOR SERIES MATHEMATICS OF COMPUTATION Volume 69, Number 231, Pages 1117 1130 S 0025-5718(00)01120-0 Article electronically publishe on February 17, 2000 EVALUATING HIGHER DERIVATIVE TENSORS BY FORWARD PROPAGATION

More information

Pure Further Mathematics 1. Revision Notes

Pure Further Mathematics 1. Revision Notes Pure Further Mathematics Revision Notes June 20 2 FP JUNE 20 SDB Further Pure Complex Numbers... 3 Definitions an arithmetical operations... 3 Complex conjugate... 3 Properties... 3 Complex number plane,

More information

arxiv: v4 [math.pr] 27 Jul 2016

arxiv: v4 [math.pr] 27 Jul 2016 The Asymptotic Distribution of the Determinant of a Ranom Correlation Matrix arxiv:309768v4 mathpr] 7 Jul 06 AM Hanea a, & GF Nane b a Centre of xcellence for Biosecurity Risk Analysis, University of Melbourne,

More information

Experiment 2, Physics 2BL

Experiment 2, Physics 2BL Experiment 2, Physics 2BL Deuction of Mass Distributions. Last Upate: 2009-05-03 Preparation Before this experiment, we recommen you review or familiarize yourself with the following: Chapters 4-6 in Taylor

More information

Chapter 4. Electrostatics of Macroscopic Media

Chapter 4. Electrostatics of Macroscopic Media Chapter 4. Electrostatics of Macroscopic Meia 4.1 Multipole Expansion Approximate potentials at large istances 3 x' x' (x') x x' x x Fig 4.1 We consier the potential in the far-fiel region (see Fig. 4.1

More information

The Principle of Least Action

The Principle of Least Action Chapter 7. The Principle of Least Action 7.1 Force Methos vs. Energy Methos We have so far stuie two istinct ways of analyzing physics problems: force methos, basically consisting of the application of

More information

arxiv: v6 [stat.ml] 11 Apr 2017

arxiv: v6 [stat.ml] 11 Apr 2017 Improved Gibbs Sampling Parameter Estimators for LDA Dense Distributions from Sparse Samples: Improved Gibbs Sampling Parameter Estimators for LDA arxiv:1505.02065v6 [stat.ml] 11 Apr 2017 Yannis Papanikolaou

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Neural Networks. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Neural Networks. Tobias Scheffer Universität Potsam Institut für Informatik Lehrstuhl Maschinelles Lernen Neural Networks Tobias Scheffer Overview Neural information processing. Fee-forwar networks. Training fee-forwar networks, back

More information

Gaussian processes with monotonicity information

Gaussian processes with monotonicity information Gaussian processes with monotonicity information Anonymous Author Anonymous Author Unknown Institution Unknown Institution Abstract A metho for using monotonicity information in multivariate Gaussian process

More information

16 : Approximate Inference: Markov Chain Monte Carlo

16 : Approximate Inference: Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models 10-708, Spring 2017 16 : Approximate Inference: Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Yuan Yang, Chao-Ming Yen 1 Introduction As the target distribution

More information

Influence of weight initialization on multilayer perceptron performance

Influence of weight initialization on multilayer perceptron performance Influence of weight initialization on multilayer perceptron performance M. Karouia (1,2) T. Denœux (1) R. Lengellé (1) (1) Université e Compiègne U.R.A. CNRS 817 Heuiasyc BP 649 - F-66 Compiègne ceex -

More information

The Press-Schechter mass function

The Press-Schechter mass function The Press-Schechter mass function To state the obvious: It is important to relate our theories to what we can observe. We have looke at linear perturbation theory, an we have consiere a simple moel for

More information