LSI, plsi, LDA and inference methods

Size: px

Start display at page:

Download "LSI, plsi, LDA and inference methods"

Lucinda Allen
5 years ago
Views:

1 LSI, plsi, LDA and inference methods Guillaume Obozinski INRIA - Ecole Normale Supérieure - Paris RussIR summer school Yaroslavl, August 6-10th 2012 Guillaume Obozinski LSI, plsi, LDA and inference methods 1/40

2 Latent Semantic Indexing Guillaume Obozinski LSI, plsi, LDA and inference methods 2/40

3 Latent Semantic Indexing (LSI) (Deerwester et al., 1990) Idea: words that co-occur frequently in documents should be similar. Let x (i) 1 and x (i) 2 count resp. the number of occurrences of the words physician and doctor in the i th document. Guillaume Obozinski LSI, plsi, LDA and inference methods 3/40

4 Latent Semantic Indexing (LSI) (Deerwester et al., 1990) Idea: words that co-occur frequently in documents should be similar. Let x (i) 1 and x (i) 2 count resp. the number of occurrences of the words physician and doctor in the i th document. e 2 x (2) u (1) e 1 x (1) Guillaume Obozinski LSI, plsi, LDA and inference methods 3/40

5 Latent Semantic Indexing (LSI) (Deerwester et al., 1990) Idea: words that co-occur frequently in documents should be similar. Let x (i) 1 and x (i) 2 count resp. the number of occurrences of the words physician and doctor in the i th document. e 2 x (2) u (1) e 1 x (1) The directions of covariance or principal directions are obtained using the singular value decomposition of X R d N X = USV, with U U = I d and V V = I N and S R d N a matrix with non-zero element only on the diagonal: the singular values of X, positives and sorted in decreasing order. Guillaume Obozinski LSI, plsi, LDA and inference methods 3/40

6 LSI: computation of the document representation U = u (1)... u (d) : the principal directions. Let U K R d K, V K R N K be the matrices retaining the K first columns and S K R K K the top left K K corner of S. Guillaume Obozinski LSI, plsi, LDA and inference methods 4/40

7 LSI: computation of the document representation U = u (1)... u (d) : the principal directions. Let U K R d K, V K R N K be the matrices retaining the K first columns and S K R K K the top left K K corner of S. Guillaume Obozinski LSI, plsi, LDA and inference methods 4/40

8 LSI: computation of the document representation U = u (1)... u (d) : the principal directions. Let U K R d K, V K R N K be the matrices retaining the K first columns and S K R K K the top left K K corner of S. The projection of x (i) on the subspace spanned by U K yields the Latent representation: x (i) = U K x(i). Guillaume Obozinski LSI, plsi, LDA and inference methods 4/40

9 LSI: computation of the document representation U = u (1)... u (d) : the principal directions. Let U K R d K, V K R N K be the matrices retaining the K first columns and S K R K K the top left K K corner of S. The projection of x (i) on the subspace spanned by U K yields the Remarks Latent representation: x (i) = U K x(i). U K X = U K U K S K V K = S K V K Guillaume Obozinski LSI, plsi, LDA and inference methods 4/40

10 LSI: computation of the document representation U = u (1)... u (d) : the principal directions. Let U K R d K, V K R N K be the matrices retaining the K first columns and S K R K K the top left K K corner of S. The projection of x (i) on the subspace spanned by U K yields the Remarks Latent representation: x (i) = U K x(i). UK X = U K U K S K VK = S K VK u (k) is somehow like a topic and x (i) is the vector of coefficients of decomposition of a document on the K topics. Guillaume Obozinski LSI, plsi, LDA and inference methods 4/40

11 LSI: computation of the document representation U = u (1)... u (d) : the principal directions. Let U K R d K, V K R N K be the matrices retaining the K first columns and S K R K K the top left K K corner of S. The projection of x (i) on the subspace spanned by U K yields the Remarks Latent representation: x (i) = U K x(i). UK X = U K U K S K VK = S K VK u (k) is somehow like a topic and x (i) is the vector of coefficients of decomposition of a document on the K topics. The similarity between two documents can now be measured by cos( ( x (i), x (j) )) = x(i) x (i) x (j) x (j) Guillaume Obozinski LSI, plsi, LDA and inference methods 4/40

12 LSI vs PCA LSI is almost identical to Principal Component Analysis (PCA) proposed by Karl Pearson in Guillaume Obozinski LSI, plsi, LDA and inference methods 5/40

13 LSI vs PCA LSI is almost identical to Principal Component Analysis (PCA) proposed by Karl Pearson in Like PCA, LSI aims at finding the directions of high correlations between words called principal directions. Like PCA, it retains the projection of the data on a number k of these principal directions, which are called the principal components. Guillaume Obozinski LSI, plsi, LDA and inference methods 5/40

14 LSI vs PCA LSI is almost identical to Principal Component Analysis (PCA) proposed by Karl Pearson in Like PCA, LSI aims at finding the directions of high correlations between words called principal directions. Like PCA, it retains the projection of the data on a number k of these principal directions, which are called the principal components. Difference between LSI and PCA LSI does not center the data (no specific reason). LSI is typically combined with TF-IDF Guillaume Obozinski LSI, plsi, LDA and inference methods 5/40

15 Limitations and shortcomings of LSI The generative model of the data underlying PCA is a Gaussian cloud which does not match the structure of the data. Guillaume Obozinski LSI, plsi, LDA and inference methods 6/40

16 Limitations and shortcomings of LSI The generative model of the data underlying PCA is a Gaussian cloud which does not match the structure of the data. In particular: LSI ignores That the data are counts, frequencies or tf-idf scores. The data is positive (u k typically has negative coefficients) Guillaume Obozinski LSI, plsi, LDA and inference methods 6/40

17 Limitations and shortcomings of LSI The generative model of the data underlying PCA is a Gaussian cloud which does not match the structure of the data. In particular: LSI ignores That the data are counts, frequencies or tf-idf scores. The data is positive (u k typically has negative coefficients) The singular value decomposition is expensive to compute Guillaume Obozinski LSI, plsi, LDA and inference methods 6/40

18 Topic models and matrix factorization X R d M with columns x i corresponding to documents B the matrix whose columns correspond to different topics Θ the matrix of decomposition coefficients with columns θ i associated each to one document and which encodes its topic content. X = B. Θ Guillaume Obozinski LSI, plsi, LDA and inference methods 7/40

19 Probabilistic LSI Guillaume Obozinski LSI, plsi, LDA and inference methods 8/40

20 Probabilistic Latent Semantic Indexing (Hofmann, 2001) TOPIC 1 computer, technology, system, service, site, phone, internet, machine Red Light, Green Light: A 2-Tone L.E.D. to Simplify Screens The three big Internet portals begin to distinguish among themselves as shopping malls Stock Trades: A Better Deal For Investors Isn't Simple TOPIC 2 sell, sale, store, product, business, advertising, market, consumer TOPIC 1 Forget the Bootleg, Just Download the Movie Legally TOPIC 2 TOPIC 3 play, film, movie, theater, production, star, director, stage The Shape of Cinema, Transformed At the Click of a Mouse TOPIC 3 Multiplex Heralded As Linchpin To Growth A Peaceful Crew Puts Muppets Where Its Mouth Is (a) Topics (b) Document Assignments to Topics Figure 1: The latent space of a topic model consists of topics, which are distributions over words, and distribution over these topics for each document. On the left are three topics from a fifty topic LDA mod trained on articles from the New York Times. On the right is a simplex depicting the distribution over topi associated with seven documents. The line from each document s title shows the document s position in th Guillaume Obozinski LSI, plsi, LDA and inference methods 9/40

21 Probabilistic Latent Semantic Indexing (Hofmann, 2001) Obtain a more expressive model by allowing several topics per document in various proportions so that each word w in gets its own topic z in drawn from the multinomial distribution d i unique to the i th document. Guillaume Obozinski LSI, plsi, LDA and inference methods 10/40

22 Probabilistic Latent Semantic Indexing (Hofmann, 2001) Obtain a more expressive model by allowing several topics per document in various proportions so that each word w in gets its own topic z in drawn from the multinomial distribution d i unique to the i th document. d i B z in w in N i M d i topic proportions in document i z in M(1, d i ) (w in z ink = 1) M(1, (b 1k,..., b dk )) Guillaume Obozinski LSI, plsi, LDA and inference methods 10/40

23 EM algorithm for plsi Denote jin the index in the dictionary of the word appearing in document i as the nth word. Guillaume Obozinski LSI, plsi, LDA and inference methods 11/40

24 EM algorithm for plsi Denote jin the index in the dictionary of the word appearing in document i as the nth word. Expectation step q (t) ink = p(z ink = 1 w in ; d (t 1) i, B (t 1) ) = Maximization step d (t 1) ik K k =1 b (t 1) j in k d (t 1) ik b (t 1) jin k d (t) ik = N (i) n N (i) n=1 K k =1 q (t) ink q (t) ink = Ñ(i) k N (i) and b (t) jk = M N (i) i=1 n=1 M N (i) i=1 n=1 q (t) ink w inj q (t) ink Guillaume Obozinski LSI, plsi, LDA and inference methods 11/40

25 Issues with plsi Too many parameters overfitting! Guillaume Obozinski LSI, plsi, LDA and inference methods 12/40

26 Issues with plsi? Too many parameters overfitting! Not clear Solutions Guillaume Obozinski LSI, plsi, LDA and inference methods 12/40

27 Issues with plsi? Too many parameters overfitting! Not clear Solutions or alternative approaches Frequentist approach: regularize + optimize Dictionary Learning min θ i log p(x i θ i ) + λω(θ i ) Guillaume Obozinski LSI, plsi, LDA and inference methods 12/40

28 Issues with plsi? Too many parameters overfitting! Not clear Solutions or alternative approaches Frequentist approach: regularize + optimize Dictionary Learning min θ i log p(x i θ i ) + λω(θ i ) Bayesian approach: prior + integrate Latent Dirichlet Allocation p(θ i x i, α) p(x i θ i ) p(θ i α) Guillaume Obozinski LSI, plsi, LDA and inference methods 12/40

29 Issues with plsi? Too many parameters overfitting! Not clear Solutions or alternative approaches Frequentist approach: regularize + optimize Dictionary Learning min θ i log p(x i θ i ) + λω(θ i ) Bayesian approach: prior + integrate Latent Dirichlet Allocation p(θ i x i, α) p(x i θ i ) p(θ i α) Frequentist + Bayesian integrate + optimize max α M i=1 p(x i θ i ) p(θ i α) dθ Guillaume Obozinski LSI, plsi, LDA and inference methods 12/40

30 Issues with plsi? Too many parameters overfitting! Not clear Solutions or alternative approaches Frequentist approach: regularize + optimize Dictionary Learning min θ i log p(x i θ i ) + λω(θ i ) Bayesian approach: prior + integrate Latent Dirichlet Allocation p(θ i x i, α) p(x i θ i ) p(θ i α) Frequentist + Bayesian integrate + optimize max α M i=1 p(x i θ i ) p(θ i α) dθ... called Empirical Bayes approach or Type II Maximum Likelihood Guillaume Obozinski LSI, plsi, LDA and inference methods 12/40

31 Latent Dirichlet Allocation Guillaume Obozinski LSI, plsi, LDA and inference methods 13/40

32 Latent Dirichlet Allocation (Blei et al., 2003) B α θ i z in w in N i M K topics α = (α 1,..., α K ) parameter vector θ i = (θ 1i,..., θ Ki ) Dir(α) topic proportions z in topic indicator vector for n th word of i th document: z = (z in1,..., z ink ) {0, 1} K z in M(1, (θ 1i,..., θ Ki )) K [ ] z p(z in θ i ) = ink θki k=1 Guillaume Obozinski LSI, plsi, LDA and inference methods 14/40

33 Latent Dirichlet Allocation (Blei et al., 2003) B α θ i z in w in N i M K topics α = (α 1,..., α K ) parameter vector θ i = (θ 1i,..., θ Ki ) Dir(α) topic proportions z in topic indicator vector for n th word of i th document: z = (z in1,..., z ink ) {0, 1} K z in M(1, (θ 1i,..., θ Ki )) K [ ] z p(z in θ i ) = ink θki k=1 w in {z ink =1} M(1, (b 1k,..., b dk )) p(w inj = 1 z ink = 1) = b jk Guillaume Obozinski LSI, plsi, LDA and inference methods 14/40

34 LDA likelihood p (( w in, z in )1 m N i θ i ) = Guillaume Obozinski LSI, plsi, LDA and inference methods 15/40

35 LDA likelihood p (( Ni ) ) w in, z in θ 1 m N i i = p ( ) w in, z in θ i n=1 Guillaume Obozinski LSI, plsi, LDA and inference methods 15/40

36 LDA likelihood p (( Ni ) ) w in, z in θ 1 m N i i = p ( ) w in, z in θ i = n=1 N i d n=1 j=1 k=1 K (b jk θ ki ) w inj z ink Guillaume Obozinski LSI, plsi, LDA and inference methods 15/40

37 LDA likelihood p (( Ni ) ) w in, z in θ 1 m N i i = p ( ) w in, z in θ i p (( ) ) w in θ 1 m N i i = n=1 N i d n=1 j=1 k=1 K (b jk θ ki ) w inj z ink Guillaume Obozinski LSI, plsi, LDA and inference methods 15/40

38 LDA likelihood p (( Ni ) ) w in, z in θ 1 m N i i = p ( ) w in, z in θ i = n=1 N i d n=1 j=1 k=1 p (( Ni ) ) w in θ 1 m N i i = n=1 K (b jk θ ki ) w inj z ink z in p ( w in, z in θ i ) Guillaume Obozinski LSI, plsi, LDA and inference methods 15/40

39 LDA likelihood p (( Ni ) ) w in, z in θ 1 m N i i = p ( ) w in, z in θ i = n=1 N i d n=1 j=1 k=1 p (( Ni ) ) w in θ 1 m N i i = = n=1 N i n=1 j=1 K (b jk θ ki ) w inj z ink p ( ) w in, z in θ i z in d [ K ] w inj b jk θ ki, k=1 Guillaume Obozinski LSI, plsi, LDA and inference methods 15/40

40 LDA likelihood p (( Ni ) ) w in, z in θ 1 m N i i = p ( ) w in, z in θ i = n=1 N i d n=1 j=1 k=1 p (( Ni ) ) w in θ 1 m N i i = = n=1 N i n=1 j=1 K (b jk θ ki ) w inj z ink p ( ) w in, z in θ i z in d [ K ] w inj b jk θ ki, k=1 i.i.d. i.i.d. so that w in θ i M(1, Bθ i ) or x i θ i M(N i, Bθ i ). Guillaume Obozinski LSI, plsi, LDA and inference methods 15/40

41 LDA as Multinomial Factorial Analysis Eliminating zs from the model yields a conceptually simpler model in which θ i can be interpreted as latent factors as in factorial analysis. B α θ i x i M Topic proportions for document i: θ i R K θ i Dir(α) Empirical words counts for document i: x i R d x i M(N i, Bθ i ) Guillaume Obozinski LSI, plsi, LDA and inference methods 16/40

42 LDA with smoothing of the dictionary Issue with new words: they will have probability 0 if B is optimized over the training data. Need to smooth B e.g. via Laplacian smoothing. α B η θ i z in w in N i M B η α θ i x i M Guillaume Obozinski LSI, plsi, LDA and inference methods 17/40

43 Learning with LDA Guillaume Obozinski LSI, plsi, LDA and inference methods 18/40

44 How do we learn with LDA? How do we learn for each topic its word distribution b k? How do we learn for each document its topic composition θ i? How do we assign to each word of a document its topic z in? Guillaume Obozinski LSI, plsi, LDA and inference methods 19/40

45 How do we learn with LDA? How do we learn for each topic its word distribution b k? How do we learn for each document its topic composition θ i? How do we assign to each word of a document its topic z in? These quantities are treated in a Bayesian fashion, so the natural Bayesian answer are p(b W) p(θ i W) p(z in W) Guillaume Obozinski LSI, plsi, LDA and inference methods 19/40

46 How do we learn with LDA? How do we learn for each topic its word distribution b k? How do we learn for each document its topic composition θ i? How do we assign to each word of a document its topic z in? These quantities are treated in a Bayesian fashion, so the natural Bayesian answer are p(b W) p(θ i W) p(z in W) or E(B W) E(θ i W) E(z in W) if point-estimates are needed. Guillaume Obozinski LSI, plsi, LDA and inference methods 19/40

47 Monte Carlo Principle of Monte Carlo integration Let Z be a random variable, to compute E[f (Z)] we can sample and do the approximation Z (1),..., Z (B) i.i.d. Z E[f (X )] 1 B B f (Z (b) ) b=1 Guillaume Obozinski LSI, plsi, LDA and inference methods 20/40

48 Monte Carlo Principle of Monte Carlo integration Let Z be a random variable, to compute E[f (Z)] we can sample and do the approximation Z (1),..., Z (B) i.i.d. Z E[f (X )] 1 B B f (Z (b) ) b=1 Problem: In most situations sampling exactly from the distribution of Z is too hard, so this direct approach is impossible. Guillaume Obozinski LSI, plsi, LDA and inference methods 20/40

49 Markov Chain Monte Carlo (MCMC) If we can cannot sample exact from the distribution of Z, i.e. from some q(z) = P(Z = z) or q(z) is the density of r.v. Z, then we can create a sequence of random variables that approach the correct distribution. Principle of MCMC Construct a chain of random variables Z (b,1),..., Z (b,t ) with Z (b,t) p t (z (b,t) Z (b,t 1) = z (b,t 1) ) such that We can then approximate: Z (b,t ) D T Z E[f (Z)] 1 B B b=1 (b,t f (Z )) Guillaume Obozinski LSI, plsi, LDA and inference methods 21/40

50 MCMC in practice Run a single chain: E[f (Z)] 1 T T t=1 f (Z ) (T 0+k t) Guillaume Obozinski LSI, plsi, LDA and inference methods 22/40

51 MCMC in practice Run a single chain: E[f (Z)] 1 T T t=1 f (Z ) (T 0+k t) T 0 is the burn-in time Guillaume Obozinski LSI, plsi, LDA and inference methods 22/40

52 MCMC in practice Run a single chain: E[f (Z)] 1 T T t=1 f (Z ) (T 0+k t) T 0 is the burn-in time k is the thinning factor Guillaume Obozinski LSI, plsi, LDA and inference methods 22/40

53 MCMC in practice Run a single chain: E[f (Z)] 1 T T t=1 f (Z ) (T 0+k t) T 0 is the burn-in time k is the thinning factor Useful to take k > 1 only if almost i.i.d. samples are required. To compute an expectation in which the correlation between Z (t) and Z (t 1) would not interfere take k = 1 Guillaume Obozinski LSI, plsi, LDA and inference methods 22/40

54 MCMC in practice Run a single chain: E[f (Z)] 1 T T t=1 f (Z ) (T 0+k t) T 0 is the burn-in time k is the thinning factor Useful to take k > 1 only if almost i.i.d. samples are required. To compute an expectation in which the correlation between Z (t) and Z (t 1) would not interfere take k = 1 Main difficulties: the mixing time of the chain can be very large Guillaume Obozinski LSI, plsi, LDA and inference methods 22/40

55 MCMC in practice Run a single chain: E[f (Z)] 1 T T t=1 f (Z ) (T 0+k t) T 0 is the burn-in time k is the thinning factor Useful to take k > 1 only if almost i.i.d. samples are required. To compute an expectation in which the correlation between Z (t) and Z (t 1) would not interfere take k = 1 Main difficulties: the mixing time of the chain can be very large Assessing whether the chain has mixed or not is a hard problem Guillaume Obozinski LSI, plsi, LDA and inference methods 22/40

56 MCMC in practice Run a single chain: E[f (Z)] 1 T T t=1 f (Z ) (T 0+k t) T 0 is the burn-in time k is the thinning factor Useful to take k > 1 only if almost i.i.d. samples are required. To compute an expectation in which the correlation between Z (t) and Z (t 1) would not interfere take k = 1 Main difficulties: the mixing time of the chain can be very large Assessing whether the chain has mixed or not is a hard problem proper approximation only with T very large. Guillaume Obozinski LSI, plsi, LDA and inference methods 22/40

57 MCMC in practice Run a single chain: E[f (Z)] 1 T T t=1 f (Z ) (T 0+k t) T 0 is the burn-in time k is the thinning factor Useful to take k > 1 only if almost i.i.d. samples are required. To compute an expectation in which the correlation between Z (t) and Z (t 1) would not interfere take k = 1 Main difficulties: the mixing time of the chain can be very large Assessing whether the chain has mixed or not is a hard problem proper approximation only with T very large. MCMC can be quite slow or just never converge and you will not necessarily know it. Guillaume Obozinski LSI, plsi, LDA and inference methods 22/40

58 Gibbs sampling A nice special case of MCMC: Guillaume Obozinski LSI, plsi, LDA and inference methods 23/40

59 Gibbs sampling A nice special case of MCMC: Principle of Gibbs sampling For each node i in turn, sample the node conditionally on the other nodes, i.e. ( ) Sample Z (t) i p z i Z i = z (t 1) i Guillaume Obozinski LSI, plsi, LDA and inference methods 23/40

60 Gibbs sampling A nice special case of MCMC: Principle of Gibbs sampling For each node i in turn, sample the node conditionally on the other nodes, i.e. ( ) Sample Z (t) i p z i Z i = z (t 1) i B η α θ i z in w in N i M Guillaume Obozinski LSI, plsi, LDA and inference methods 23/40

61 Gibbs sampling A nice special case of MCMC: Principle of Gibbs sampling For each node i in turn, sample the node conditionally on the other nodes, i.e. ( ) Sample Z (t) i p z i Z i = z (t 1) i B η α θ i z in w in N i M Markov Blanket Definition: Let V be the set of nodes of the graph. The Markov blanket of node i is the minimal set of nodes S (not containing i) such that p(z i Z S ) = p(z i Z i ) or equivalently Z i Z V \(S {i}) Z S Guillaume Obozinski LSI, plsi, LDA and inference methods 23/40

62 d-separation Theorem Let A, B and C three disjoint sets of nodes. The property X A X B X C holds if and only if all paths connecting A to B are blocked, which means that they contain at least one blocking node. Node j is a blocking node if there is no v-structure in j and j is in C or if there is a v-structure in j and if neither j nor any of its descendants in the graph is in C. a f e b c Guillaume Obozinski LSI, plsi, LDA and inference methods 24/40

63 Markov Blanket in a Directed Graphical model x i Guillaume Obozinski LSI, plsi, LDA and inference methods 25/40

64 Markov Blankets in LDA? α Markov blankets for B η θ i z in w in N i M Guillaume Obozinski LSI, plsi, LDA and inference methods 26/40

65 Markov Blankets in LDA? α Markov blankets for η θ i θ i z in B w in N i M Guillaume Obozinski LSI, plsi, LDA and inference methods 26/40

66 Markov Blankets in LDA? α Markov blankets for η θ i θ i (z in ) n=1...ni z in B w in N i M Guillaume Obozinski LSI, plsi, LDA and inference methods 26/40

67 Markov Blankets in LDA? α Markov blankets for η θ i θ i (z in ) n=1...ni B z in z in w in N i M Guillaume Obozinski LSI, plsi, LDA and inference methods 26/40

68 Markov Blankets in LDA? α Markov blankets for η θ i θ i (z in ) n=1...ni B z in z in w in, θ i and B w in N i M Guillaume Obozinski LSI, plsi, LDA and inference methods 26/40

69 Markov Blankets in LDA? α Markov blankets for η θ i θ i (z in ) n=1...ni B z in w in N i M z in B w in, θ i and B Guillaume Obozinski LSI, plsi, LDA and inference methods 26/40

70 Markov Blankets in LDA? α Markov blankets for η θ i θ i (z in ) n=1...ni B z in w in N i M z in w in, θ i and B B (w in, z in ) n=1...ni,i=1...,m Guillaume Obozinski LSI, plsi, LDA and inference methods 26/40

71 Markov Blankets in LDA? α Markov blankets for η θ i θ i (z in ) n=1...ni B z in w in N i M z in w in, θ i and B B (w in, z in ) n=1...ni,i=1...,m Guillaume Obozinski LSI, plsi, LDA and inference methods 26/40

72 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = Guillaume Obozinski LSI, plsi, LDA and inference methods 27/40

73 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = [ N n=1 ] p(w n z n, B) p(z n θ) p(θ α) k p(b k η) Guillaume Obozinski LSI, plsi, LDA and inference methods 27/40

74 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = [ N n=1 ] p(w n z n, B) p(z n θ) p(θ α) k [ N (b jk θ k ) w ] nj z nk n=1 j,k k θ α k 1 k p(b k η) j,k b η j 1 jk Guillaume Obozinski LSI, plsi, LDA and inference methods 27/40

75 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = We thus have: (z n w n, θ) [ N n=1 ] p(w n z n, B) p(z n θ) p(θ α) k [ N (b jk θ k ) w ] nj z nk n=1 j,k k θ α k 1 k p(b k η) j,k b η j 1 jk Guillaume Obozinski LSI, plsi, LDA and inference methods 27/40

76 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = [ N n=1 ] p(w n z n, B) p(z n θ) p(θ α) k [ N (b jk θ k ) w ] nj z nk n=1 j,k k θ α k 1 k p(b k η) j,k b η j 1 jk We thus have: (z n w n, θ) M(1, p m ) with p nk = b j(n),k θ k k b j(n),k θ k. Guillaume Obozinski LSI, plsi, LDA and inference methods 27/40

77 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = [ N n=1 ] p(w n z n, B) p(z n θ) p(θ α) k [ N (b jk θ k ) w ] nj z nk n=1 j,k k θ α k 1 k p(b k η) j,k b η j 1 jk We thus have: (z n w n, θ) M(1, p m ) with p nk = b j(n),k θ k k b j(n),k θ k. (θ (z n, w n ) n, α) Guillaume Obozinski LSI, plsi, LDA and inference methods 27/40

78 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = [ N n=1 ] p(w n z n, B) p(z n θ) p(θ α) k [ N (b jk θ k ) w ] nj z nk n=1 j,k k θ α k 1 k p(b k η) j,k b η j 1 jk We thus have: (z n w n, θ) M(1, p m ) with p nk = b j(n),k θ k k b j(n),k θ k. (θ (z n, w n ) n, α) Dir( α) with α k = α k + N k, N k = N z nk. n=1 Guillaume Obozinski LSI, plsi, LDA and inference methods 27/40

79 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = [ N n=1 ] p(w n z n, B) p(z n θ) p(θ α) k [ N (b jk θ k ) w ] nj z nk n=1 j,k k θ α k 1 k p(b k η) j,k b η j 1 jk We thus have: (z n w n, θ) M(1, p m ) with p nk = b j(n),k θ k k b j(n),k θ k. (θ (z n, w n ) n, α) Dir( α) with α k = α k + N k, N k = (b k (z n, w n ) n, η) N z nk. n=1 Guillaume Obozinski LSI, plsi, LDA and inference methods 27/40

80 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = [ N n=1 ] p(w n z n, B) p(z n θ) p(θ α) k [ N (b jk θ k ) w ] nj z nk n=1 j,k k θ α k 1 k p(b k η) j,k b η j 1 jk We thus have: (z n w n, θ) M(1, p m ) with p nk = b j(n),k θ k k b j(n),k θ k. (θ (z n, w n ) n, α) Dir( α) with α k = α k + N k, N k = (b k (z n, w n ) n, η) Dir( η) with η j = η j + N n=1 w njz nk. N z nk. n=1 Guillaume Obozinski LSI, plsi, LDA and inference methods 27/40

81 LDA Results (Blei et al., 2003) Arts" Budgets" Children" Education" NEW MILLION CHILDREN SCHOOL FILM TAX WOMEN STUDENTS SHOW PROGRAM PEOPLE SCHOOLS MUSIC BUDGET CHILD EDUCATION MOVIE BILLION YEARS TEACHERS PLAY FEDERAL FAMILIES HIGH MUSICAL YEAR WORK PUBLIC BEST SPENDING PARENTS TEACHER ACTOR NEW SAYS BENNETT FIRST STATE FAMILY MANIGAT YORK PLAN WELFARE NAMPHY OPERA MONEY MEN STATE THEATER PROGRAMS PERCENT PRESIDENT ACTRESS GOVERNMENT CARE ELEMENTARY LOVE CONGRESS LIFE HAITI The William Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropolitan Opera Co., New York Philharmonic and Juilliard School. Our board felt that we had a real opportunity to make a Guillaume mark on Obozinski the future LSI, of plsi, the performing LDA and inference arts methods with these grants 28/40 an act

82 / / / / / / NEW YORK TIMES Reading Tea leaves: 150word precision / (Boyd-Graber / et al., / 2009) WIKIPEDIA / / / / / / / / / topics 100 topics 150 topics Model Precision New York Times Wikipedia CTM LDA plsi CTM LDA plsi CTM LDA plsi Figure 3: The model precision (Equation 1) for the three models on two corpora. Higher is better. Surprisingly, although CTM generally Precision achieves of athe betteridentification predictive likelihood ofthan word the other outliers, models (Table 1), the topics it infers fare worst when evaluated against human judgments. by humans and for different models. Web interface. No specialized training or knowledge is typically expected of the workers. Amazon Mechanical Turk has been successfully used in the past to develop gold-standard data for natural language processing [22] and Guillaume to label Obozinski images [23]. LSI, For plsi, both LDAthe andword inference intrusion methods and topic intrusion 29/40

83 Reading Tea leaves: topic precision (Boyd-Graber et al., 2009) 50 topics 100 topics 150 topics 0 Topic Log Odds New York Times Wikipedia 7 CTM LDA plsi CTM LDA plsi CTM LDA plsi Precision of the identification of topic outliers, Figure 6: The topic log odds (Equation 2) for the three models on two corpora. Higher is better. Although CTM generally achieves a better predictive likelihood than the other models (Table 1), the topics it infers fare worst when evaluated against human by humans judgments. and for different models. m and let jd, m denote the true intruder, i.e., the one generated by the model. We define the topic log odds as the log ratio of the probability mass assigned to the true intruder to the probability mass assigned to the intruder selected Guillaume by the Obozinski subject, LSI, plsi, LDA and inference methods 30/40

84 Reading Tea leaves: log-likelihood on held out data (Boyd-Graber et al., 2009) Table 1: Two predictive metrics: predictive log likelihood/predictive rank. Consistent with values reported in th literature, CTM generally performs the best, followed by LDA, then plsi. The bold numbers indicate the be performance in each row. CORPUS TOPICS LDA CTM PLSI / / / NEW YORK TIMES / / / / / / / / / WIKIPEDIA / / / / / / del Precision Log-likelihoods of several models including LDA, plsi and CTM 50 topics 100 topics 150 topics (CTM=correlated topic model) Guillaume Obozinski LSI, plsi, LDA and inference methods 31/40 New York Times

85 Variational inference for LDA Guillaume Obozinski LSI, plsi, LDA and inference methods 32/40

86 Principle of Variational Inference Problem: it is hard to compute: p(b, θ i, z in W), E(B W), E(θ i W), E(z in W). Idea of Variational Inference: Find a distribution q which is as close as possible to p( W) for which it is not too hard to compute E q (B), E q (θ i ), E q (z in ). Guillaume Obozinski LSI, plsi, LDA and inference methods 33/40

87 Principle of Variational Inference Problem: it is hard to compute: p(b, θ i, z in W), E(B W), E(θ i W), E(z in W). Idea of Variational Inference: Find a distribution q which is as close as possible to p( W) for which it is not too hard to compute E q (B), E q (θ i ), E q (z in ). Usual approach: 1 Choose a simple parametric family Q for q. Guillaume Obozinski LSI, plsi, LDA and inference methods 33/40

88 Principle of Variational Inference Problem: it is hard to compute: p(b, θ i, z in W), E(B W), E(θ i W), E(z in W). Idea of Variational Inference: Find a distribution q which is as close as possible to p( W) for which it is not too hard to compute E q (B), E q (θ i ), E q (z in ). Usual approach: 1 Choose a simple parametric family Q for q. 2 Solve the variational formulation min KL( q p( W) ) q Q Guillaume Obozinski LSI, plsi, LDA and inference methods 33/40

89 Principle of Variational Inference Problem: it is hard to compute: p(b, θ i, z in W), E(B W), E(θ i W), E(z in W). Idea of Variational Inference: Find a distribution q which is as close as possible to p( W) for which it is not too hard to compute E q (B), E q (θ i ), E q (z in ). Usual approach: 1 Choose a simple parametric family Q for q. 2 Solve the variational formulation min KL( q p( W) ) q Q 3 Compute the desired expectations: E q (B), E q (θ i ), E q (z in ). Guillaume Obozinski LSI, plsi, LDA and inference methods 33/40

90 Principle of Variational Inference Problem: it is hard to compute: p(b, θ i, z in W), E(B W), E(θ i W), E(z in W). Idea of Variational Inference: Find a distribution q which is as close as possible to p( W) for which it is not too hard to compute E q (B), E q (θ i ), E q (z in ). Usual approach: 1 Choose a simple parametric family Q for q. 2 Solve the variational formulation min KL( q p( W) ) q Q 3 Compute the desired expectations: E q (B), E q (θ i ), E q (z in ). Guillaume Obozinski LSI, plsi, LDA and inference methods 33/40

91 Principle of Variational Inference Problem: it is hard to compute: p(b, θ i, z in W), E(B W), E(θ i W), E(z in W). Idea of Variational Inference: Find a distribution q which is as close as possible to p( W) for which it is not too hard to compute E q (B), E q (θ i ), E q (z in ). Usual approach: 1 Choose a simple parametric family Q for q. 2 Solve the variational formulation min KL( q p( W) ) q Q 3 Compute the desired expectations: E q (B), E q (θ i ), E q (z in ). Guillaume Obozinski LSI, plsi, LDA and inference methods 33/40

92 Variational Inference for LDA (Blei et al., 2003) Assume B is a parameter, assume there is a single document, and focus on the inference on θ and (z n ) n. Choose q in a factorized form (mean field approximation) q(θ, (z n ) n ) = q θ (θ) N q zn (z n ) n=1 Guillaume Obozinski LSI, plsi, LDA and inference methods 34/40

93 Variational Inference for LDA (Blei et al., 2003) Assume B is a parameter, assume there is a single document, and focus on the inference on θ and (z n ) n. Choose q in a factorized form (mean field approximation) q θ (θ) = Γ( k γ k) k Γ(γ k) q(θ, (z n ) n ) = q θ (θ) k θ γ k 1 k N q zn (z n ) n=1 with Guillaume Obozinski LSI, plsi, LDA and inference methods 34/40

94 Variational Inference for LDA (Blei et al., 2003) Assume B is a parameter, assume there is a single document, and focus on the inference on θ and (z n ) n. Choose q in a factorized form (mean field approximation) q θ (θ) = Γ( k γ k) k Γ(γ k) q(θ, (z n ) n ) = q θ (θ) k N q zn (z n ) n=1 with θ γ k 1 k and q zn (z n ) = k φ z nk nk. Guillaume Obozinski LSI, plsi, LDA and inference methods 34/40

95 Variational Inference for LDA (Blei et al., 2003) Assume B is a parameter, assume there is a single document, and focus on the inference on θ and (z n ) n. Choose q in a factorized form (mean field approximation) q θ (θ) = Γ( k γ k) k Γ(γ k) KL ( q p( W) ) = E q [ log... log p(θ α) n q(θ, (z n ) n ) = q θ (θ) k N q zn (z n ) n=1 with θ γ k 1 k and q zn (z n ) = k φ z nk nk. q(θ, (z n ) n ) ] [ = E q log q θ (θ) + p(θ, (z n ) n W) n ( log p(zn θ) + log p(w n z n, B) )] p ( ) (w n ) n log q zn (z n )

96 Variational Inference for LDA (Blei et al., 2003) Assume B is a parameter, assume there is a single document, and focus on the inference on θ and (z n ) n. Choose q in a factorized form (mean field approximation) q θ (θ) = Γ( k γ k) k Γ(γ k) q(θ, (z n ) n ) = q θ (θ) k N q zn (z n ) n=1 with θ γ k 1 k and q zn (z n ) = k KL ( q p( W) ) = E q [ log q θ (θ) + n φ z nk nk. log q zn (z n )... log p(θ α) n ( log p(zn θ) + log p(w n z n, B) )] p ( (w n ) n )

97 Variational Inference for LDA (Blei et al., 2003) Assume B is a parameter, assume there is a single document, and focus on the inference on θ and (z n ) n. Choose q in a factorized form (mean field approximation) q θ (θ) = Γ( k γ k) k Γ(γ k) KL ( q p( W) ) = E q [ log... log p(θ α) n q(θ, (z n ) n ) = q θ (θ) k N q zn (z n ) n=1 with θ γ k 1 k and q zn (z n ) = k φ z nk nk. q(θ, (z n ) n ) ] [ = E q log q θ (θ) + p(θ, (z n ) n W) n ( log p(zn θ) + log p(w n z n, B) )] p ( ) (w n ) n log q zn (z n ) Guillaume Obozinski LSI, plsi, LDA and inference methods 34/40

98 Variational Inference for LDA II [ E log q θ (θ) log p(θ α)+ n ( log qzn (z n ) log p(z n θ) log p(w n z n, B) )] Guillaume Obozinski LSI, plsi, LDA and inference methods 35/40

99 Variational Inference for LDA II [ E log q θ (θ) log p(θ α)+ n ( log qzn (z n ) log p(z n θ) log p(w n z n, B) )] E q [ log qθ (θ) ] = Guillaume Obozinski LSI, plsi, LDA and inference methods 35/40

100 Variational Inference for LDA II [ E log q θ (θ) log p(θ α)+ n ( log qzn (z n ) log p(z n θ) log p(w n z n, B) )] E q [ log qθ (θ) ] = E q [ log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) log(θ k ) )] Guillaume Obozinski LSI, plsi, LDA and inference methods 35/40

101 Variational Inference for LDA II [ E log q θ (θ) log p(θ α)+ n ( log qzn (z n ) log p(z n θ) log p(w n z n, B) )] [ E q log qθ (θ) ] [ = E q log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) log(θ k ) )] = log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) E q [log(θ k )] ) Guillaume Obozinski LSI, plsi, LDA and inference methods 35/40

102 Variational Inference for LDA II [ E log q θ (θ) log p(θ α)+ n ( log qzn (z n ) log p(z n θ) log p(w n z n, B) )] [ E q log qθ (θ) ] [ = E q log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) log(θ k ) )] = log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) E q [log(θ k )] ) E q [p(θ α)] = Guillaume Obozinski LSI, plsi, LDA and inference methods 35/40

103 Variational Inference for LDA II [ E log q θ (θ) log p(θ α)+ n ( log qzn (z n ) log p(z n θ) log p(w n z n, B) )] [ E q log qθ (θ) ] [ = E q log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) log(θ k ) )] = log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) E q [log(θ k )] ) E q [p(θ α)] = E[(α k 1) log(θ k )] + cst = (α k 1)E q [log(θ k )] + cst Guillaume Obozinski LSI, plsi, LDA and inference methods 35/40

104 Variational Inference for LDA II [ E log q θ (θ) log p(θ α)+ n ( log qzn (z n ) log p(z n θ) log p(w n z n, B) )] [ E q log qθ (θ) ] [ = E q log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) log(θ k ) )] = log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) E q [log(θ k )] ) E q [p(θ α)] = E[(α k 1) log(θ k )] + cst = (α k 1)E q [log(θ k )] + cst E q [log q zn (z n ) log p(z n )] = Guillaume Obozinski LSI, plsi, LDA and inference methods 35/40

105 Variational Inference for LDA II [ E log q θ (θ) log p(θ α)+ n ( log qzn (z n ) log p(z n θ) log p(w n z n, B) )] [ E q log qθ (θ) ] [ = E q log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) log(θ k ) )] = log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) E q [log(θ k )] ) E q [p(θ α)] = E[(α k 1) log(θ k )] + cst = (α k 1)E q [log(θ k )] + cst [ ( E q [log q zn (z n ) log p(z n )] = E q znk log(φ nk ) z nk log(θ k ) )] k = E q [z nk ] ( log(φ nk ) E q [log(θ k )] ) k Guillaume Obozinski LSI, plsi, LDA and inference methods 35/40

106 Variational Inference for LDA II [ E log q θ (θ) log p(θ α)+ n ( log qzn (z n ) log p(z n θ) log p(w n z n, B) )] [ E q log qθ (θ) ] [ = E q log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) log(θ k ) )] = log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) E q [log(θ k )] ) E q [p(θ α)] = E[(α k 1) log(θ k )] + cst = (α k 1)E q [log(θ k )] + cst [ ( E q [log q zn (z n ) log p(z n )] = E q znk log(φ nk ) z nk log(θ k ) )] k = E q [z nk ] ( log(φ nk ) E q [log(θ k )] ) k [ E q [log p(w n z n, B)] = E q z nk w nj log(b jk ) ] = Guillaumej,k Obozinski LSI, plsi, LDA and inference methods 35/40

107 Variational Inference for LDA II [ E log q θ (θ) log p(θ α)+ n ( log qzn (z n ) log p(z n θ) log p(w n z n, B) )] [ E q log qθ (θ) ] [ = E q log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) log(θ k ) )] = log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) E q [log(θ k )] ) E q [p(θ α)] = E[(α k 1) log(θ k )] + cst = (α k 1)E q [log(θ k )] + cst [ ( E q [log q zn (z n ) log p(z n )] = E q znk log(φ nk ) z nk log(θ k ) )] k = E q [z nk ] ( log(φ nk ) E q [log(θ k )] ) k [ E q [log p(w n z n, B)] = E q z nk w nj log(b jk ) ] = E q [z nk ] w nj log(b jk ) j,k Guillaumej,k Obozinski LSI, plsi, LDA and inference methods 35/40

108 VI for LDA: Computing the expectations The expectation of the logarithm of a Dirichlet r.v. can be computed exactly with the digamma function Ψ: E q [log(θ k )] = Ψ(γ k ) Ψ( k γ k), with Ψ(x) := ( ) log Γ(x). x Guillaume Obozinski LSI, plsi, LDA and inference methods 36/40

109 VI for LDA: Computing the expectations The expectation of the logarithm of a Dirichlet r.v. can be computed exactly with the digamma function Ψ: E q [log(θ k )] = Ψ(γ k ) Ψ( k γ k), with Ψ(x) := ( ) log Γ(x). x We obviously have E q [z nk ] = φ nk. Guillaume Obozinski LSI, plsi, LDA and inference methods 36/40

110 VI for LDA: Computing the expectations The expectation of the logarithm of a Dirichlet r.v. can be computed exactly with the digamma function Ψ: E q [log(θ k )] = Ψ(γ k ) Ψ( k γ k), with Ψ(x) := ( ) log Γ(x). x We obviously have E q [z nk ] = φ nk. The problem min q Q KL( q p( W) ) is therefore equivalent to min D(γ, (φ n ) n ) γ,(φ n) n with Guillaume Obozinski LSI, plsi, LDA and inference methods 36/40

111 VI for LDA: Computing the expectations The expectation of the logarithm of a Dirichlet r.v. can be computed exactly with the digamma function Ψ: E q [log(θ k )] = Ψ(γ k ) Ψ( k γ k), with Ψ(x) := ( ) log Γ(x). x We obviously have E q [z nk ] = φ nk. The problem min q Q KL( q p( W) ) is therefore equivalent to min D(γ, (φ n ) n ) γ,(φ n) n with D(γ, (φ n ) n ) = log Γ( k γ k ) k log Γ(γ k ) + n,k φ nk log(φ nk ) n,k φ nk w nj log(b jk ) j k ((α k + n φ nk γ k ) ( Ψ(γ k ) Ψ( k γ k) ) Guillaume Obozinski LSI, plsi, LDA and inference methods 36/40

112 VI for LDA: Solving for the variational updates Introducing a Lagrangian to account for the constraints K k=1 φ nk = 1: L(γ, (φ n ) n ) = D(γ, (φ n ) n ) + N n=1 ( λ n 1 k φ ) nk Guillaume Obozinski LSI, plsi, LDA and inference methods 37/40

113 VI for LDA: Solving for the variational updates Introducing a Lagrangian to account for the constraints K k=1 φ nk = 1: L(γ, (φ n ) n ) = D(γ, (φ n ) n ) + Computing the gradient of the Lagrangian: N n=1 ( λ n 1 k φ ) nk Guillaume Obozinski LSI, plsi, LDA and inference methods 37/40

114 VI for LDA: Solving for the variational updates Introducing a Lagrangian to account for the constraints K k=1 φ nk = 1: L(γ, (φ n ) n ) = D(γ, (φ n ) n ) + Computing the gradient of the Lagrangian: N n=1 ( λ n 1 k φ ) nk L = (α k + φ nk γ k )(Ψ (γ k ) Ψ ( k γ γ k)) k n L = log(φ nk ) + 1 w nj log(b jk ) (Ψ(γ k ) Ψ( k φ γ k)) λ n nk j Partial minimizations in γ and φ nk are therefore respectively solved by γ k = α k + n φ nk and φ nk b j(n),k exp(ψ(γ k ) Ψ( k γ k)), where j(n) is the one and only j such that w nj = 1.

115 VI for LDA: Solving for the variational updates Introducing a Lagrangian to account for the constraints K k=1 φ nk = 1: L(γ, (φ n ) n ) = D(γ, (φ n ) n ) + Computing the gradient of the Lagrangian: N n=1 ( λ n 1 k φ ) nk L = (α k + φ nk γ k )(Ψ (γ k ) Ψ ( k γ γ k)) k n L = log(φ nk ) + 1 w nj log(b jk ) (Ψ(γ k ) Ψ( k φ γ k)) λ n nk j Guillaume Obozinski LSI, plsi, LDA and inference methods 37/40

116 VI for LDA: Solving for the variational updates Introducing a Lagrangian to account for the constraints K k=1 φ nk = 1: L(γ, (φ n ) n ) = D(γ, (φ n ) n ) + Computing the gradient of the Lagrangian: N n=1 ( λ n 1 k φ ) nk L = (α k + φ nk γ k )(Ψ (γ k ) Ψ ( k γ γ k)) k n L = log(φ nk ) + 1 w nj log(b jk ) (Ψ(γ k ) Ψ( k φ γ k)) λ n nk j Partial minimizations in γ and φ nk are therefore respectively solved by γ k = α k + n φ nk and φ nk b j(n),k exp(ψ(γ k ) Ψ( k γ k)), where j(n) is the one and only j such that w nj = 1. Guillaume Obozinski LSI, plsi, LDA and inference methods 37/40

117 Variational Algorithm Algorithm 1 Variational inference for LDA Require: W, α, γ init, (φ n,init ) n 1: while Not converged do 2: γ k α k + n φ nk 3: for n=1..n do 4: for k=1..k do 5: φ nk b j(n),k exp(ψ(γ k ) Ψ( k γ k)) 6: end for 7: φ n 1 k φ φ n nk 8: end for 9: end while 10: return γ, (φ n ) n Guillaume Obozinski LSI, plsi, LDA and inference methods 38/40

118 Variational Algorithm Algorithm 2 Variational inference for LDA Require: W, α, γ init, (φ n,init ) n 1: while Not converged do 2: γ k α k + n φ nk 3: for n=1..n do 4: for k=1..k do 5: φ nk b j(n),k exp(ψ(γ k ) Ψ( k γ k)) 6: end for 7: φ n 1 k φ φ n nk 8: end for 9: end while 10: return γ, (φ n ) n With the quantities computed we can approximate: E[θ k W] γ k k γ k and E[z n W] φ n Guillaume Obozinski LSI, plsi, LDA and inference methods 38/40

119 Polylingual Topic Model (Mimno et al., 2009) Generalization of LDA to documents available simultaneously in several languages such as Wikipedia articles, which are not literal translations of one another but share the same topics. α θ i z (1) in z (2) in z (L) in w (1) in w (2) in N (1) i N (2) i... M w (L) in N (L) i B (1) B (2) B (L) Guillaume Obozinski LSI, plsi, LDA and inference methods 39/40

120 References I Blei, D., Ng, A., and Jordan, M. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3: Boyd-Graber, J., Chang, J., Gerrish, S., Wang, C., and Blei, D. (2009). Reading tea leaves: How humans interpret topic models. In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems. Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6): Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1): Mimno, D., Wallach, H., Naradowsky, J., Smith, D., and McCallum, A. (2009). Polylingual topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2, pages Association for Computational Linguistics. Guillaume Obozinski LSI, plsi, LDA and inference methods 40/40

Latent Dirichlet Allocation

Latent Dirichlet Allocation Outlines Advanced Artificial Intelligence October 1, 2009 Outlines Part I: Theoretical Background Part II: Application and Results 1 Motive Previous Research Exchangeability 2 Notation and Terminology