LSI, plsi, LDA and inference methods

Size: px
Start display at page:

Download "LSI, plsi, LDA and inference methods"

Transcription

1 LSI, plsi, LDA and inference methods Guillaume Obozinski INRIA - Ecole Normale Supérieure - Paris RussIR summer school Yaroslavl, August 6-10th 2012 Guillaume Obozinski LSI, plsi, LDA and inference methods 1/40

2 Latent Semantic Indexing Guillaume Obozinski LSI, plsi, LDA and inference methods 2/40

3 Latent Semantic Indexing (LSI) (Deerwester et al., 1990) Idea: words that co-occur frequently in documents should be similar. Let x (i) 1 and x (i) 2 count resp. the number of occurrences of the words physician and doctor in the i th document. Guillaume Obozinski LSI, plsi, LDA and inference methods 3/40

4 Latent Semantic Indexing (LSI) (Deerwester et al., 1990) Idea: words that co-occur frequently in documents should be similar. Let x (i) 1 and x (i) 2 count resp. the number of occurrences of the words physician and doctor in the i th document. e 2 x (2) u (1) e 1 x (1) Guillaume Obozinski LSI, plsi, LDA and inference methods 3/40

5 Latent Semantic Indexing (LSI) (Deerwester et al., 1990) Idea: words that co-occur frequently in documents should be similar. Let x (i) 1 and x (i) 2 count resp. the number of occurrences of the words physician and doctor in the i th document. e 2 x (2) u (1) e 1 x (1) The directions of covariance or principal directions are obtained using the singular value decomposition of X R d N X = USV, with U U = I d and V V = I N and S R d N a matrix with non-zero element only on the diagonal: the singular values of X, positives and sorted in decreasing order. Guillaume Obozinski LSI, plsi, LDA and inference methods 3/40

6 LSI: computation of the document representation U = u (1)... u (d) : the principal directions. Let U K R d K, V K R N K be the matrices retaining the K first columns and S K R K K the top left K K corner of S. Guillaume Obozinski LSI, plsi, LDA and inference methods 4/40

7 LSI: computation of the document representation U = u (1)... u (d) : the principal directions. Let U K R d K, V K R N K be the matrices retaining the K first columns and S K R K K the top left K K corner of S. Guillaume Obozinski LSI, plsi, LDA and inference methods 4/40

8 LSI: computation of the document representation U = u (1)... u (d) : the principal directions. Let U K R d K, V K R N K be the matrices retaining the K first columns and S K R K K the top left K K corner of S. The projection of x (i) on the subspace spanned by U K yields the Latent representation: x (i) = U K x(i). Guillaume Obozinski LSI, plsi, LDA and inference methods 4/40

9 LSI: computation of the document representation U = u (1)... u (d) : the principal directions. Let U K R d K, V K R N K be the matrices retaining the K first columns and S K R K K the top left K K corner of S. The projection of x (i) on the subspace spanned by U K yields the Remarks Latent representation: x (i) = U K x(i). U K X = U K U K S K V K = S K V K Guillaume Obozinski LSI, plsi, LDA and inference methods 4/40

10 LSI: computation of the document representation U = u (1)... u (d) : the principal directions. Let U K R d K, V K R N K be the matrices retaining the K first columns and S K R K K the top left K K corner of S. The projection of x (i) on the subspace spanned by U K yields the Remarks Latent representation: x (i) = U K x(i). UK X = U K U K S K VK = S K VK u (k) is somehow like a topic and x (i) is the vector of coefficients of decomposition of a document on the K topics. Guillaume Obozinski LSI, plsi, LDA and inference methods 4/40

11 LSI: computation of the document representation U = u (1)... u (d) : the principal directions. Let U K R d K, V K R N K be the matrices retaining the K first columns and S K R K K the top left K K corner of S. The projection of x (i) on the subspace spanned by U K yields the Remarks Latent representation: x (i) = U K x(i). UK X = U K U K S K VK = S K VK u (k) is somehow like a topic and x (i) is the vector of coefficients of decomposition of a document on the K topics. The similarity between two documents can now be measured by cos( ( x (i), x (j) )) = x(i) x (i) x (j) x (j) Guillaume Obozinski LSI, plsi, LDA and inference methods 4/40

12 LSI vs PCA LSI is almost identical to Principal Component Analysis (PCA) proposed by Karl Pearson in Guillaume Obozinski LSI, plsi, LDA and inference methods 5/40

13 LSI vs PCA LSI is almost identical to Principal Component Analysis (PCA) proposed by Karl Pearson in Like PCA, LSI aims at finding the directions of high correlations between words called principal directions. Like PCA, it retains the projection of the data on a number k of these principal directions, which are called the principal components. Guillaume Obozinski LSI, plsi, LDA and inference methods 5/40

14 LSI vs PCA LSI is almost identical to Principal Component Analysis (PCA) proposed by Karl Pearson in Like PCA, LSI aims at finding the directions of high correlations between words called principal directions. Like PCA, it retains the projection of the data on a number k of these principal directions, which are called the principal components. Difference between LSI and PCA LSI does not center the data (no specific reason). LSI is typically combined with TF-IDF Guillaume Obozinski LSI, plsi, LDA and inference methods 5/40

15 Limitations and shortcomings of LSI The generative model of the data underlying PCA is a Gaussian cloud which does not match the structure of the data. Guillaume Obozinski LSI, plsi, LDA and inference methods 6/40

16 Limitations and shortcomings of LSI The generative model of the data underlying PCA is a Gaussian cloud which does not match the structure of the data. In particular: LSI ignores That the data are counts, frequencies or tf-idf scores. The data is positive (u k typically has negative coefficients) Guillaume Obozinski LSI, plsi, LDA and inference methods 6/40

17 Limitations and shortcomings of LSI The generative model of the data underlying PCA is a Gaussian cloud which does not match the structure of the data. In particular: LSI ignores That the data are counts, frequencies or tf-idf scores. The data is positive (u k typically has negative coefficients) The singular value decomposition is expensive to compute Guillaume Obozinski LSI, plsi, LDA and inference methods 6/40

18 Topic models and matrix factorization X R d M with columns x i corresponding to documents B the matrix whose columns correspond to different topics Θ the matrix of decomposition coefficients with columns θ i associated each to one document and which encodes its topic content. X = B. Θ Guillaume Obozinski LSI, plsi, LDA and inference methods 7/40

19 Probabilistic LSI Guillaume Obozinski LSI, plsi, LDA and inference methods 8/40

20 Probabilistic Latent Semantic Indexing (Hofmann, 2001) TOPIC 1 computer, technology, system, service, site, phone, internet, machine Red Light, Green Light: A 2-Tone L.E.D. to Simplify Screens The three big Internet portals begin to distinguish among themselves as shopping malls Stock Trades: A Better Deal For Investors Isn't Simple TOPIC 2 sell, sale, store, product, business, advertising, market, consumer TOPIC 1 Forget the Bootleg, Just Download the Movie Legally TOPIC 2 TOPIC 3 play, film, movie, theater, production, star, director, stage The Shape of Cinema, Transformed At the Click of a Mouse TOPIC 3 Multiplex Heralded As Linchpin To Growth A Peaceful Crew Puts Muppets Where Its Mouth Is (a) Topics (b) Document Assignments to Topics Figure 1: The latent space of a topic model consists of topics, which are distributions over words, and distribution over these topics for each document. On the left are three topics from a fifty topic LDA mod trained on articles from the New York Times. On the right is a simplex depicting the distribution over topi associated with seven documents. The line from each document s title shows the document s position in th Guillaume Obozinski LSI, plsi, LDA and inference methods 9/40

21 Probabilistic Latent Semantic Indexing (Hofmann, 2001) Obtain a more expressive model by allowing several topics per document in various proportions so that each word w in gets its own topic z in drawn from the multinomial distribution d i unique to the i th document. Guillaume Obozinski LSI, plsi, LDA and inference methods 10/40

22 Probabilistic Latent Semantic Indexing (Hofmann, 2001) Obtain a more expressive model by allowing several topics per document in various proportions so that each word w in gets its own topic z in drawn from the multinomial distribution d i unique to the i th document. d i B z in w in N i M d i topic proportions in document i z in M(1, d i ) (w in z ink = 1) M(1, (b 1k,..., b dk )) Guillaume Obozinski LSI, plsi, LDA and inference methods 10/40

23 EM algorithm for plsi Denote jin the index in the dictionary of the word appearing in document i as the nth word. Guillaume Obozinski LSI, plsi, LDA and inference methods 11/40

24 EM algorithm for plsi Denote jin the index in the dictionary of the word appearing in document i as the nth word. Expectation step q (t) ink = p(z ink = 1 w in ; d (t 1) i, B (t 1) ) = Maximization step d (t 1) ik K k =1 b (t 1) j in k d (t 1) ik b (t 1) jin k d (t) ik = N (i) n N (i) n=1 K k =1 q (t) ink q (t) ink = Ñ(i) k N (i) and b (t) jk = M N (i) i=1 n=1 M N (i) i=1 n=1 q (t) ink w inj q (t) ink Guillaume Obozinski LSI, plsi, LDA and inference methods 11/40

25 Issues with plsi Too many parameters overfitting! Guillaume Obozinski LSI, plsi, LDA and inference methods 12/40

26 Issues with plsi? Too many parameters overfitting! Not clear Solutions Guillaume Obozinski LSI, plsi, LDA and inference methods 12/40

27 Issues with plsi? Too many parameters overfitting! Not clear Solutions or alternative approaches Frequentist approach: regularize + optimize Dictionary Learning min θ i log p(x i θ i ) + λω(θ i ) Guillaume Obozinski LSI, plsi, LDA and inference methods 12/40

28 Issues with plsi? Too many parameters overfitting! Not clear Solutions or alternative approaches Frequentist approach: regularize + optimize Dictionary Learning min θ i log p(x i θ i ) + λω(θ i ) Bayesian approach: prior + integrate Latent Dirichlet Allocation p(θ i x i, α) p(x i θ i ) p(θ i α) Guillaume Obozinski LSI, plsi, LDA and inference methods 12/40

29 Issues with plsi? Too many parameters overfitting! Not clear Solutions or alternative approaches Frequentist approach: regularize + optimize Dictionary Learning min θ i log p(x i θ i ) + λω(θ i ) Bayesian approach: prior + integrate Latent Dirichlet Allocation p(θ i x i, α) p(x i θ i ) p(θ i α) Frequentist + Bayesian integrate + optimize max α M i=1 p(x i θ i ) p(θ i α) dθ Guillaume Obozinski LSI, plsi, LDA and inference methods 12/40

30 Issues with plsi? Too many parameters overfitting! Not clear Solutions or alternative approaches Frequentist approach: regularize + optimize Dictionary Learning min θ i log p(x i θ i ) + λω(θ i ) Bayesian approach: prior + integrate Latent Dirichlet Allocation p(θ i x i, α) p(x i θ i ) p(θ i α) Frequentist + Bayesian integrate + optimize max α M i=1 p(x i θ i ) p(θ i α) dθ... called Empirical Bayes approach or Type II Maximum Likelihood Guillaume Obozinski LSI, plsi, LDA and inference methods 12/40

31 Latent Dirichlet Allocation Guillaume Obozinski LSI, plsi, LDA and inference methods 13/40

32 Latent Dirichlet Allocation (Blei et al., 2003) B α θ i z in w in N i M K topics α = (α 1,..., α K ) parameter vector θ i = (θ 1i,..., θ Ki ) Dir(α) topic proportions z in topic indicator vector for n th word of i th document: z = (z in1,..., z ink ) {0, 1} K z in M(1, (θ 1i,..., θ Ki )) K [ ] z p(z in θ i ) = ink θki k=1 Guillaume Obozinski LSI, plsi, LDA and inference methods 14/40

33 Latent Dirichlet Allocation (Blei et al., 2003) B α θ i z in w in N i M K topics α = (α 1,..., α K ) parameter vector θ i = (θ 1i,..., θ Ki ) Dir(α) topic proportions z in topic indicator vector for n th word of i th document: z = (z in1,..., z ink ) {0, 1} K z in M(1, (θ 1i,..., θ Ki )) K [ ] z p(z in θ i ) = ink θki k=1 w in {z ink =1} M(1, (b 1k,..., b dk )) p(w inj = 1 z ink = 1) = b jk Guillaume Obozinski LSI, plsi, LDA and inference methods 14/40

34 LDA likelihood p (( w in, z in )1 m N i θ i ) = Guillaume Obozinski LSI, plsi, LDA and inference methods 15/40

35 LDA likelihood p (( Ni ) ) w in, z in θ 1 m N i i = p ( ) w in, z in θ i n=1 Guillaume Obozinski LSI, plsi, LDA and inference methods 15/40

36 LDA likelihood p (( Ni ) ) w in, z in θ 1 m N i i = p ( ) w in, z in θ i = n=1 N i d n=1 j=1 k=1 K (b jk θ ki ) w inj z ink Guillaume Obozinski LSI, plsi, LDA and inference methods 15/40

37 LDA likelihood p (( Ni ) ) w in, z in θ 1 m N i i = p ( ) w in, z in θ i p (( ) ) w in θ 1 m N i i = n=1 N i d n=1 j=1 k=1 K (b jk θ ki ) w inj z ink Guillaume Obozinski LSI, plsi, LDA and inference methods 15/40

38 LDA likelihood p (( Ni ) ) w in, z in θ 1 m N i i = p ( ) w in, z in θ i = n=1 N i d n=1 j=1 k=1 p (( Ni ) ) w in θ 1 m N i i = n=1 K (b jk θ ki ) w inj z ink z in p ( w in, z in θ i ) Guillaume Obozinski LSI, plsi, LDA and inference methods 15/40

39 LDA likelihood p (( Ni ) ) w in, z in θ 1 m N i i = p ( ) w in, z in θ i = n=1 N i d n=1 j=1 k=1 p (( Ni ) ) w in θ 1 m N i i = = n=1 N i n=1 j=1 K (b jk θ ki ) w inj z ink p ( ) w in, z in θ i z in d [ K ] w inj b jk θ ki, k=1 Guillaume Obozinski LSI, plsi, LDA and inference methods 15/40

40 LDA likelihood p (( Ni ) ) w in, z in θ 1 m N i i = p ( ) w in, z in θ i = n=1 N i d n=1 j=1 k=1 p (( Ni ) ) w in θ 1 m N i i = = n=1 N i n=1 j=1 K (b jk θ ki ) w inj z ink p ( ) w in, z in θ i z in d [ K ] w inj b jk θ ki, k=1 i.i.d. i.i.d. so that w in θ i M(1, Bθ i ) or x i θ i M(N i, Bθ i ). Guillaume Obozinski LSI, plsi, LDA and inference methods 15/40

41 LDA as Multinomial Factorial Analysis Eliminating zs from the model yields a conceptually simpler model in which θ i can be interpreted as latent factors as in factorial analysis. B α θ i x i M Topic proportions for document i: θ i R K θ i Dir(α) Empirical words counts for document i: x i R d x i M(N i, Bθ i ) Guillaume Obozinski LSI, plsi, LDA and inference methods 16/40

42 LDA with smoothing of the dictionary Issue with new words: they will have probability 0 if B is optimized over the training data. Need to smooth B e.g. via Laplacian smoothing. α B η θ i z in w in N i M B η α θ i x i M Guillaume Obozinski LSI, plsi, LDA and inference methods 17/40

43 Learning with LDA Guillaume Obozinski LSI, plsi, LDA and inference methods 18/40

44 How do we learn with LDA? How do we learn for each topic its word distribution b k? How do we learn for each document its topic composition θ i? How do we assign to each word of a document its topic z in? Guillaume Obozinski LSI, plsi, LDA and inference methods 19/40

45 How do we learn with LDA? How do we learn for each topic its word distribution b k? How do we learn for each document its topic composition θ i? How do we assign to each word of a document its topic z in? These quantities are treated in a Bayesian fashion, so the natural Bayesian answer are p(b W) p(θ i W) p(z in W) Guillaume Obozinski LSI, plsi, LDA and inference methods 19/40

46 How do we learn with LDA? How do we learn for each topic its word distribution b k? How do we learn for each document its topic composition θ i? How do we assign to each word of a document its topic z in? These quantities are treated in a Bayesian fashion, so the natural Bayesian answer are p(b W) p(θ i W) p(z in W) or E(B W) E(θ i W) E(z in W) if point-estimates are needed. Guillaume Obozinski LSI, plsi, LDA and inference methods 19/40

47 Monte Carlo Principle of Monte Carlo integration Let Z be a random variable, to compute E[f (Z)] we can sample and do the approximation Z (1),..., Z (B) i.i.d. Z E[f (X )] 1 B B f (Z (b) ) b=1 Guillaume Obozinski LSI, plsi, LDA and inference methods 20/40

48 Monte Carlo Principle of Monte Carlo integration Let Z be a random variable, to compute E[f (Z)] we can sample and do the approximation Z (1),..., Z (B) i.i.d. Z E[f (X )] 1 B B f (Z (b) ) b=1 Problem: In most situations sampling exactly from the distribution of Z is too hard, so this direct approach is impossible. Guillaume Obozinski LSI, plsi, LDA and inference methods 20/40

49 Markov Chain Monte Carlo (MCMC) If we can cannot sample exact from the distribution of Z, i.e. from some q(z) = P(Z = z) or q(z) is the density of r.v. Z, then we can create a sequence of random variables that approach the correct distribution. Principle of MCMC Construct a chain of random variables Z (b,1),..., Z (b,t ) with Z (b,t) p t (z (b,t) Z (b,t 1) = z (b,t 1) ) such that We can then approximate: Z (b,t ) D T Z E[f (Z)] 1 B B b=1 (b,t f (Z )) Guillaume Obozinski LSI, plsi, LDA and inference methods 21/40

50 MCMC in practice Run a single chain: E[f (Z)] 1 T T t=1 f (Z ) (T 0+k t) Guillaume Obozinski LSI, plsi, LDA and inference methods 22/40

51 MCMC in practice Run a single chain: E[f (Z)] 1 T T t=1 f (Z ) (T 0+k t) T 0 is the burn-in time Guillaume Obozinski LSI, plsi, LDA and inference methods 22/40

52 MCMC in practice Run a single chain: E[f (Z)] 1 T T t=1 f (Z ) (T 0+k t) T 0 is the burn-in time k is the thinning factor Guillaume Obozinski LSI, plsi, LDA and inference methods 22/40

53 MCMC in practice Run a single chain: E[f (Z)] 1 T T t=1 f (Z ) (T 0+k t) T 0 is the burn-in time k is the thinning factor Useful to take k > 1 only if almost i.i.d. samples are required. To compute an expectation in which the correlation between Z (t) and Z (t 1) would not interfere take k = 1 Guillaume Obozinski LSI, plsi, LDA and inference methods 22/40

54 MCMC in practice Run a single chain: E[f (Z)] 1 T T t=1 f (Z ) (T 0+k t) T 0 is the burn-in time k is the thinning factor Useful to take k > 1 only if almost i.i.d. samples are required. To compute an expectation in which the correlation between Z (t) and Z (t 1) would not interfere take k = 1 Main difficulties: the mixing time of the chain can be very large Guillaume Obozinski LSI, plsi, LDA and inference methods 22/40

55 MCMC in practice Run a single chain: E[f (Z)] 1 T T t=1 f (Z ) (T 0+k t) T 0 is the burn-in time k is the thinning factor Useful to take k > 1 only if almost i.i.d. samples are required. To compute an expectation in which the correlation between Z (t) and Z (t 1) would not interfere take k = 1 Main difficulties: the mixing time of the chain can be very large Assessing whether the chain has mixed or not is a hard problem Guillaume Obozinski LSI, plsi, LDA and inference methods 22/40

56 MCMC in practice Run a single chain: E[f (Z)] 1 T T t=1 f (Z ) (T 0+k t) T 0 is the burn-in time k is the thinning factor Useful to take k > 1 only if almost i.i.d. samples are required. To compute an expectation in which the correlation between Z (t) and Z (t 1) would not interfere take k = 1 Main difficulties: the mixing time of the chain can be very large Assessing whether the chain has mixed or not is a hard problem proper approximation only with T very large. Guillaume Obozinski LSI, plsi, LDA and inference methods 22/40

57 MCMC in practice Run a single chain: E[f (Z)] 1 T T t=1 f (Z ) (T 0+k t) T 0 is the burn-in time k is the thinning factor Useful to take k > 1 only if almost i.i.d. samples are required. To compute an expectation in which the correlation between Z (t) and Z (t 1) would not interfere take k = 1 Main difficulties: the mixing time of the chain can be very large Assessing whether the chain has mixed or not is a hard problem proper approximation only with T very large. MCMC can be quite slow or just never converge and you will not necessarily know it. Guillaume Obozinski LSI, plsi, LDA and inference methods 22/40

58 Gibbs sampling A nice special case of MCMC: Guillaume Obozinski LSI, plsi, LDA and inference methods 23/40

59 Gibbs sampling A nice special case of MCMC: Principle of Gibbs sampling For each node i in turn, sample the node conditionally on the other nodes, i.e. ( ) Sample Z (t) i p z i Z i = z (t 1) i Guillaume Obozinski LSI, plsi, LDA and inference methods 23/40

60 Gibbs sampling A nice special case of MCMC: Principle of Gibbs sampling For each node i in turn, sample the node conditionally on the other nodes, i.e. ( ) Sample Z (t) i p z i Z i = z (t 1) i B η α θ i z in w in N i M Guillaume Obozinski LSI, plsi, LDA and inference methods 23/40

61 Gibbs sampling A nice special case of MCMC: Principle of Gibbs sampling For each node i in turn, sample the node conditionally on the other nodes, i.e. ( ) Sample Z (t) i p z i Z i = z (t 1) i B η α θ i z in w in N i M Markov Blanket Definition: Let V be the set of nodes of the graph. The Markov blanket of node i is the minimal set of nodes S (not containing i) such that p(z i Z S ) = p(z i Z i ) or equivalently Z i Z V \(S {i}) Z S Guillaume Obozinski LSI, plsi, LDA and inference methods 23/40

62 d-separation Theorem Let A, B and C three disjoint sets of nodes. The property X A X B X C holds if and only if all paths connecting A to B are blocked, which means that they contain at least one blocking node. Node j is a blocking node if there is no v-structure in j and j is in C or if there is a v-structure in j and if neither j nor any of its descendants in the graph is in C. a f e b c Guillaume Obozinski LSI, plsi, LDA and inference methods 24/40

63 Markov Blanket in a Directed Graphical model x i Guillaume Obozinski LSI, plsi, LDA and inference methods 25/40

64 Markov Blankets in LDA? α Markov blankets for B η θ i z in w in N i M Guillaume Obozinski LSI, plsi, LDA and inference methods 26/40

65 Markov Blankets in LDA? α Markov blankets for η θ i θ i z in B w in N i M Guillaume Obozinski LSI, plsi, LDA and inference methods 26/40

66 Markov Blankets in LDA? α Markov blankets for η θ i θ i (z in ) n=1...ni z in B w in N i M Guillaume Obozinski LSI, plsi, LDA and inference methods 26/40

67 Markov Blankets in LDA? α Markov blankets for η θ i θ i (z in ) n=1...ni B z in z in w in N i M Guillaume Obozinski LSI, plsi, LDA and inference methods 26/40

68 Markov Blankets in LDA? α Markov blankets for η θ i θ i (z in ) n=1...ni B z in z in w in, θ i and B w in N i M Guillaume Obozinski LSI, plsi, LDA and inference methods 26/40

69 Markov Blankets in LDA? α Markov blankets for η θ i θ i (z in ) n=1...ni B z in w in N i M z in B w in, θ i and B Guillaume Obozinski LSI, plsi, LDA and inference methods 26/40

70 Markov Blankets in LDA? α Markov blankets for η θ i θ i (z in ) n=1...ni B z in w in N i M z in w in, θ i and B B (w in, z in ) n=1...ni,i=1...,m Guillaume Obozinski LSI, plsi, LDA and inference methods 26/40

71 Markov Blankets in LDA? α Markov blankets for η θ i θ i (z in ) n=1...ni B z in w in N i M z in w in, θ i and B B (w in, z in ) n=1...ni,i=1...,m Guillaume Obozinski LSI, plsi, LDA and inference methods 26/40

72 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = Guillaume Obozinski LSI, plsi, LDA and inference methods 27/40

73 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = [ N n=1 ] p(w n z n, B) p(z n θ) p(θ α) k p(b k η) Guillaume Obozinski LSI, plsi, LDA and inference methods 27/40

74 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = [ N n=1 ] p(w n z n, B) p(z n θ) p(θ α) k [ N (b jk θ k ) w ] nj z nk n=1 j,k k θ α k 1 k p(b k η) j,k b η j 1 jk Guillaume Obozinski LSI, plsi, LDA and inference methods 27/40

75 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = We thus have: (z n w n, θ) [ N n=1 ] p(w n z n, B) p(z n θ) p(θ α) k [ N (b jk θ k ) w ] nj z nk n=1 j,k k θ α k 1 k p(b k η) j,k b η j 1 jk Guillaume Obozinski LSI, plsi, LDA and inference methods 27/40

76 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = [ N n=1 ] p(w n z n, B) p(z n θ) p(θ α) k [ N (b jk θ k ) w ] nj z nk n=1 j,k k θ α k 1 k p(b k η) j,k b η j 1 jk We thus have: (z n w n, θ) M(1, p m ) with p nk = b j(n),k θ k k b j(n),k θ k. Guillaume Obozinski LSI, plsi, LDA and inference methods 27/40

77 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = [ N n=1 ] p(w n z n, B) p(z n θ) p(θ α) k [ N (b jk θ k ) w ] nj z nk n=1 j,k k θ α k 1 k p(b k η) j,k b η j 1 jk We thus have: (z n w n, θ) M(1, p m ) with p nk = b j(n),k θ k k b j(n),k θ k. (θ (z n, w n ) n, α) Guillaume Obozinski LSI, plsi, LDA and inference methods 27/40

78 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = [ N n=1 ] p(w n z n, B) p(z n θ) p(θ α) k [ N (b jk θ k ) w ] nj z nk n=1 j,k k θ α k 1 k p(b k η) j,k b η j 1 jk We thus have: (z n w n, θ) M(1, p m ) with p nk = b j(n),k θ k k b j(n),k θ k. (θ (z n, w n ) n, α) Dir( α) with α k = α k + N k, N k = N z nk. n=1 Guillaume Obozinski LSI, plsi, LDA and inference methods 27/40

79 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = [ N n=1 ] p(w n z n, B) p(z n θ) p(θ α) k [ N (b jk θ k ) w ] nj z nk n=1 j,k k θ α k 1 k p(b k η) j,k b η j 1 jk We thus have: (z n w n, θ) M(1, p m ) with p nk = b j(n),k θ k k b j(n),k θ k. (θ (z n, w n ) n, α) Dir( α) with α k = α k + N k, N k = (b k (z n, w n ) n, η) N z nk. n=1 Guillaume Obozinski LSI, plsi, LDA and inference methods 27/40

80 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = [ N n=1 ] p(w n z n, B) p(z n θ) p(θ α) k [ N (b jk θ k ) w ] nj z nk n=1 j,k k θ α k 1 k p(b k η) j,k b η j 1 jk We thus have: (z n w n, θ) M(1, p m ) with p nk = b j(n),k θ k k b j(n),k θ k. (θ (z n, w n ) n, α) Dir( α) with α k = α k + N k, N k = (b k (z n, w n ) n, η) Dir( η) with η j = η j + N n=1 w njz nk. N z nk. n=1 Guillaume Obozinski LSI, plsi, LDA and inference methods 27/40

81 LDA Results (Blei et al., 2003) Arts" Budgets" Children" Education" NEW MILLION CHILDREN SCHOOL FILM TAX WOMEN STUDENTS SHOW PROGRAM PEOPLE SCHOOLS MUSIC BUDGET CHILD EDUCATION MOVIE BILLION YEARS TEACHERS PLAY FEDERAL FAMILIES HIGH MUSICAL YEAR WORK PUBLIC BEST SPENDING PARENTS TEACHER ACTOR NEW SAYS BENNETT FIRST STATE FAMILY MANIGAT YORK PLAN WELFARE NAMPHY OPERA MONEY MEN STATE THEATER PROGRAMS PERCENT PRESIDENT ACTRESS GOVERNMENT CARE ELEMENTARY LOVE CONGRESS LIFE HAITI The William Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropolitan Opera Co., New York Philharmonic and Juilliard School. Our board felt that we had a real opportunity to make a Guillaume mark on Obozinski the future LSI, of plsi, the performing LDA and inference arts methods with these grants 28/40 an act

82 / / / / / / NEW YORK TIMES Reading Tea leaves: 150word precision / (Boyd-Graber / et al., / 2009) WIKIPEDIA / / / / / / / / / topics 100 topics 150 topics Model Precision New York Times Wikipedia CTM LDA plsi CTM LDA plsi CTM LDA plsi Figure 3: The model precision (Equation 1) for the three models on two corpora. Higher is better. Surprisingly, although CTM generally Precision achieves of athe betteridentification predictive likelihood ofthan word the other outliers, models (Table 1), the topics it infers fare worst when evaluated against human judgments. by humans and for different models. Web interface. No specialized training or knowledge is typically expected of the workers. Amazon Mechanical Turk has been successfully used in the past to develop gold-standard data for natural language processing [22] and Guillaume to label Obozinski images [23]. LSI, For plsi, both LDAthe andword inference intrusion methods and topic intrusion 29/40

83 Reading Tea leaves: topic precision (Boyd-Graber et al., 2009) 50 topics 100 topics 150 topics 0 Topic Log Odds New York Times Wikipedia 7 CTM LDA plsi CTM LDA plsi CTM LDA plsi Precision of the identification of topic outliers, Figure 6: The topic log odds (Equation 2) for the three models on two corpora. Higher is better. Although CTM generally achieves a better predictive likelihood than the other models (Table 1), the topics it infers fare worst when evaluated against human by humans judgments. and for different models. m and let jd, m denote the true intruder, i.e., the one generated by the model. We define the topic log odds as the log ratio of the probability mass assigned to the true intruder to the probability mass assigned to the intruder selected Guillaume by the Obozinski subject, LSI, plsi, LDA and inference methods 30/40

84 Reading Tea leaves: log-likelihood on held out data (Boyd-Graber et al., 2009) Table 1: Two predictive metrics: predictive log likelihood/predictive rank. Consistent with values reported in th literature, CTM generally performs the best, followed by LDA, then plsi. The bold numbers indicate the be performance in each row. CORPUS TOPICS LDA CTM PLSI / / / NEW YORK TIMES / / / / / / / / / WIKIPEDIA / / / / / / del Precision Log-likelihoods of several models including LDA, plsi and CTM 50 topics 100 topics 150 topics (CTM=correlated topic model) Guillaume Obozinski LSI, plsi, LDA and inference methods 31/40 New York Times

85 Variational inference for LDA Guillaume Obozinski LSI, plsi, LDA and inference methods 32/40

86 Principle of Variational Inference Problem: it is hard to compute: p(b, θ i, z in W), E(B W), E(θ i W), E(z in W). Idea of Variational Inference: Find a distribution q which is as close as possible to p( W) for which it is not too hard to compute E q (B), E q (θ i ), E q (z in ). Guillaume Obozinski LSI, plsi, LDA and inference methods 33/40

87 Principle of Variational Inference Problem: it is hard to compute: p(b, θ i, z in W), E(B W), E(θ i W), E(z in W). Idea of Variational Inference: Find a distribution q which is as close as possible to p( W) for which it is not too hard to compute E q (B), E q (θ i ), E q (z in ). Usual approach: 1 Choose a simple parametric family Q for q. Guillaume Obozinski LSI, plsi, LDA and inference methods 33/40

88 Principle of Variational Inference Problem: it is hard to compute: p(b, θ i, z in W), E(B W), E(θ i W), E(z in W). Idea of Variational Inference: Find a distribution q which is as close as possible to p( W) for which it is not too hard to compute E q (B), E q (θ i ), E q (z in ). Usual approach: 1 Choose a simple parametric family Q for q. 2 Solve the variational formulation min KL( q p( W) ) q Q Guillaume Obozinski LSI, plsi, LDA and inference methods 33/40

89 Principle of Variational Inference Problem: it is hard to compute: p(b, θ i, z in W), E(B W), E(θ i W), E(z in W). Idea of Variational Inference: Find a distribution q which is as close as possible to p( W) for which it is not too hard to compute E q (B), E q (θ i ), E q (z in ). Usual approach: 1 Choose a simple parametric family Q for q. 2 Solve the variational formulation min KL( q p( W) ) q Q 3 Compute the desired expectations: E q (B), E q (θ i ), E q (z in ). Guillaume Obozinski LSI, plsi, LDA and inference methods 33/40

90 Principle of Variational Inference Problem: it is hard to compute: p(b, θ i, z in W), E(B W), E(θ i W), E(z in W). Idea of Variational Inference: Find a distribution q which is as close as possible to p( W) for which it is not too hard to compute E q (B), E q (θ i ), E q (z in ). Usual approach: 1 Choose a simple parametric family Q for q. 2 Solve the variational formulation min KL( q p( W) ) q Q 3 Compute the desired expectations: E q (B), E q (θ i ), E q (z in ). Guillaume Obozinski LSI, plsi, LDA and inference methods 33/40

91 Principle of Variational Inference Problem: it is hard to compute: p(b, θ i, z in W), E(B W), E(θ i W), E(z in W). Idea of Variational Inference: Find a distribution q which is as close as possible to p( W) for which it is not too hard to compute E q (B), E q (θ i ), E q (z in ). Usual approach: 1 Choose a simple parametric family Q for q. 2 Solve the variational formulation min KL( q p( W) ) q Q 3 Compute the desired expectations: E q (B), E q (θ i ), E q (z in ). Guillaume Obozinski LSI, plsi, LDA and inference methods 33/40

92 Variational Inference for LDA (Blei et al., 2003) Assume B is a parameter, assume there is a single document, and focus on the inference on θ and (z n ) n. Choose q in a factorized form (mean field approximation) q(θ, (z n ) n ) = q θ (θ) N q zn (z n ) n=1 Guillaume Obozinski LSI, plsi, LDA and inference methods 34/40

93 Variational Inference for LDA (Blei et al., 2003) Assume B is a parameter, assume there is a single document, and focus on the inference on θ and (z n ) n. Choose q in a factorized form (mean field approximation) q θ (θ) = Γ( k γ k) k Γ(γ k) q(θ, (z n ) n ) = q θ (θ) k θ γ k 1 k N q zn (z n ) n=1 with Guillaume Obozinski LSI, plsi, LDA and inference methods 34/40

94 Variational Inference for LDA (Blei et al., 2003) Assume B is a parameter, assume there is a single document, and focus on the inference on θ and (z n ) n. Choose q in a factorized form (mean field approximation) q θ (θ) = Γ( k γ k) k Γ(γ k) q(θ, (z n ) n ) = q θ (θ) k N q zn (z n ) n=1 with θ γ k 1 k and q zn (z n ) = k φ z nk nk. Guillaume Obozinski LSI, plsi, LDA and inference methods 34/40

95 Variational Inference for LDA (Blei et al., 2003) Assume B is a parameter, assume there is a single document, and focus on the inference on θ and (z n ) n. Choose q in a factorized form (mean field approximation) q θ (θ) = Γ( k γ k) k Γ(γ k) KL ( q p( W) ) = E q [ log... log p(θ α) n q(θ, (z n ) n ) = q θ (θ) k N q zn (z n ) n=1 with θ γ k 1 k and q zn (z n ) = k φ z nk nk. q(θ, (z n ) n ) ] [ = E q log q θ (θ) + p(θ, (z n ) n W) n ( log p(zn θ) + log p(w n z n, B) )] p ( ) (w n ) n log q zn (z n )

96 Variational Inference for LDA (Blei et al., 2003) Assume B is a parameter, assume there is a single document, and focus on the inference on θ and (z n ) n. Choose q in a factorized form (mean field approximation) q θ (θ) = Γ( k γ k) k Γ(γ k) q(θ, (z n ) n ) = q θ (θ) k N q zn (z n ) n=1 with θ γ k 1 k and q zn (z n ) = k KL ( q p( W) ) = E q [ log q θ (θ) + n φ z nk nk. log q zn (z n )... log p(θ α) n ( log p(zn θ) + log p(w n z n, B) )] p ( (w n ) n )

97 Variational Inference for LDA (Blei et al., 2003) Assume B is a parameter, assume there is a single document, and focus on the inference on θ and (z n ) n. Choose q in a factorized form (mean field approximation) q θ (θ) = Γ( k γ k) k Γ(γ k) KL ( q p( W) ) = E q [ log... log p(θ α) n q(θ, (z n ) n ) = q θ (θ) k N q zn (z n ) n=1 with θ γ k 1 k and q zn (z n ) = k φ z nk nk. q(θ, (z n ) n ) ] [ = E q log q θ (θ) + p(θ, (z n ) n W) n ( log p(zn θ) + log p(w n z n, B) )] p ( ) (w n ) n log q zn (z n ) Guillaume Obozinski LSI, plsi, LDA and inference methods 34/40

98 Variational Inference for LDA II [ E log q θ (θ) log p(θ α)+ n ( log qzn (z n ) log p(z n θ) log p(w n z n, B) )] Guillaume Obozinski LSI, plsi, LDA and inference methods 35/40

99 Variational Inference for LDA II [ E log q θ (θ) log p(θ α)+ n ( log qzn (z n ) log p(z n θ) log p(w n z n, B) )] E q [ log qθ (θ) ] = Guillaume Obozinski LSI, plsi, LDA and inference methods 35/40

100 Variational Inference for LDA II [ E log q θ (θ) log p(θ α)+ n ( log qzn (z n ) log p(z n θ) log p(w n z n, B) )] E q [ log qθ (θ) ] = E q [ log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) log(θ k ) )] Guillaume Obozinski LSI, plsi, LDA and inference methods 35/40

101 Variational Inference for LDA II [ E log q θ (θ) log p(θ α)+ n ( log qzn (z n ) log p(z n θ) log p(w n z n, B) )] [ E q log qθ (θ) ] [ = E q log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) log(θ k ) )] = log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) E q [log(θ k )] ) Guillaume Obozinski LSI, plsi, LDA and inference methods 35/40

102 Variational Inference for LDA II [ E log q θ (θ) log p(θ α)+ n ( log qzn (z n ) log p(z n θ) log p(w n z n, B) )] [ E q log qθ (θ) ] [ = E q log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) log(θ k ) )] = log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) E q [log(θ k )] ) E q [p(θ α)] = Guillaume Obozinski LSI, plsi, LDA and inference methods 35/40

103 Variational Inference for LDA II [ E log q θ (θ) log p(θ α)+ n ( log qzn (z n ) log p(z n θ) log p(w n z n, B) )] [ E q log qθ (θ) ] [ = E q log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) log(θ k ) )] = log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) E q [log(θ k )] ) E q [p(θ α)] = E[(α k 1) log(θ k )] + cst = (α k 1)E q [log(θ k )] + cst Guillaume Obozinski LSI, plsi, LDA and inference methods 35/40

104 Variational Inference for LDA II [ E log q θ (θ) log p(θ α)+ n ( log qzn (z n ) log p(z n θ) log p(w n z n, B) )] [ E q log qθ (θ) ] [ = E q log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) log(θ k ) )] = log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) E q [log(θ k )] ) E q [p(θ α)] = E[(α k 1) log(θ k )] + cst = (α k 1)E q [log(θ k )] + cst E q [log q zn (z n ) log p(z n )] = Guillaume Obozinski LSI, plsi, LDA and inference methods 35/40

105 Variational Inference for LDA II [ E log q θ (θ) log p(θ α)+ n ( log qzn (z n ) log p(z n θ) log p(w n z n, B) )] [ E q log qθ (θ) ] [ = E q log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) log(θ k ) )] = log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) E q [log(θ k )] ) E q [p(θ α)] = E[(α k 1) log(θ k )] + cst = (α k 1)E q [log(θ k )] + cst [ ( E q [log q zn (z n ) log p(z n )] = E q znk log(φ nk ) z nk log(θ k ) )] k = E q [z nk ] ( log(φ nk ) E q [log(θ k )] ) k Guillaume Obozinski LSI, plsi, LDA and inference methods 35/40

106 Variational Inference for LDA II [ E log q θ (θ) log p(θ α)+ n ( log qzn (z n ) log p(z n θ) log p(w n z n, B) )] [ E q log qθ (θ) ] [ = E q log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) log(θ k ) )] = log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) E q [log(θ k )] ) E q [p(θ α)] = E[(α k 1) log(θ k )] + cst = (α k 1)E q [log(θ k )] + cst [ ( E q [log q zn (z n ) log p(z n )] = E q znk log(φ nk ) z nk log(θ k ) )] k = E q [z nk ] ( log(φ nk ) E q [log(θ k )] ) k [ E q [log p(w n z n, B)] = E q z nk w nj log(b jk ) ] = Guillaumej,k Obozinski LSI, plsi, LDA and inference methods 35/40

107 Variational Inference for LDA II [ E log q θ (θ) log p(θ α)+ n ( log qzn (z n ) log p(z n θ) log p(w n z n, B) )] [ E q log qθ (θ) ] [ = E q log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) log(θ k ) )] = log Γ( k γ k) k log Γ(γ k) + k ((γ k 1) E q [log(θ k )] ) E q [p(θ α)] = E[(α k 1) log(θ k )] + cst = (α k 1)E q [log(θ k )] + cst [ ( E q [log q zn (z n ) log p(z n )] = E q znk log(φ nk ) z nk log(θ k ) )] k = E q [z nk ] ( log(φ nk ) E q [log(θ k )] ) k [ E q [log p(w n z n, B)] = E q z nk w nj log(b jk ) ] = E q [z nk ] w nj log(b jk ) j,k Guillaumej,k Obozinski LSI, plsi, LDA and inference methods 35/40

108 VI for LDA: Computing the expectations The expectation of the logarithm of a Dirichlet r.v. can be computed exactly with the digamma function Ψ: E q [log(θ k )] = Ψ(γ k ) Ψ( k γ k), with Ψ(x) := ( ) log Γ(x). x Guillaume Obozinski LSI, plsi, LDA and inference methods 36/40

109 VI for LDA: Computing the expectations The expectation of the logarithm of a Dirichlet r.v. can be computed exactly with the digamma function Ψ: E q [log(θ k )] = Ψ(γ k ) Ψ( k γ k), with Ψ(x) := ( ) log Γ(x). x We obviously have E q [z nk ] = φ nk. Guillaume Obozinski LSI, plsi, LDA and inference methods 36/40

110 VI for LDA: Computing the expectations The expectation of the logarithm of a Dirichlet r.v. can be computed exactly with the digamma function Ψ: E q [log(θ k )] = Ψ(γ k ) Ψ( k γ k), with Ψ(x) := ( ) log Γ(x). x We obviously have E q [z nk ] = φ nk. The problem min q Q KL( q p( W) ) is therefore equivalent to min D(γ, (φ n ) n ) γ,(φ n) n with Guillaume Obozinski LSI, plsi, LDA and inference methods 36/40

111 VI for LDA: Computing the expectations The expectation of the logarithm of a Dirichlet r.v. can be computed exactly with the digamma function Ψ: E q [log(θ k )] = Ψ(γ k ) Ψ( k γ k), with Ψ(x) := ( ) log Γ(x). x We obviously have E q [z nk ] = φ nk. The problem min q Q KL( q p( W) ) is therefore equivalent to min D(γ, (φ n ) n ) γ,(φ n) n with D(γ, (φ n ) n ) = log Γ( k γ k ) k log Γ(γ k ) + n,k φ nk log(φ nk ) n,k φ nk w nj log(b jk ) j k ((α k + n φ nk γ k ) ( Ψ(γ k ) Ψ( k γ k) ) Guillaume Obozinski LSI, plsi, LDA and inference methods 36/40

112 VI for LDA: Solving for the variational updates Introducing a Lagrangian to account for the constraints K k=1 φ nk = 1: L(γ, (φ n ) n ) = D(γ, (φ n ) n ) + N n=1 ( λ n 1 k φ ) nk Guillaume Obozinski LSI, plsi, LDA and inference methods 37/40

113 VI for LDA: Solving for the variational updates Introducing a Lagrangian to account for the constraints K k=1 φ nk = 1: L(γ, (φ n ) n ) = D(γ, (φ n ) n ) + Computing the gradient of the Lagrangian: N n=1 ( λ n 1 k φ ) nk Guillaume Obozinski LSI, plsi, LDA and inference methods 37/40

114 VI for LDA: Solving for the variational updates Introducing a Lagrangian to account for the constraints K k=1 φ nk = 1: L(γ, (φ n ) n ) = D(γ, (φ n ) n ) + Computing the gradient of the Lagrangian: N n=1 ( λ n 1 k φ ) nk L = (α k + φ nk γ k )(Ψ (γ k ) Ψ ( k γ γ k)) k n L = log(φ nk ) + 1 w nj log(b jk ) (Ψ(γ k ) Ψ( k φ γ k)) λ n nk j Partial minimizations in γ and φ nk are therefore respectively solved by γ k = α k + n φ nk and φ nk b j(n),k exp(ψ(γ k ) Ψ( k γ k)), where j(n) is the one and only j such that w nj = 1.

115 VI for LDA: Solving for the variational updates Introducing a Lagrangian to account for the constraints K k=1 φ nk = 1: L(γ, (φ n ) n ) = D(γ, (φ n ) n ) + Computing the gradient of the Lagrangian: N n=1 ( λ n 1 k φ ) nk L = (α k + φ nk γ k )(Ψ (γ k ) Ψ ( k γ γ k)) k n L = log(φ nk ) + 1 w nj log(b jk ) (Ψ(γ k ) Ψ( k φ γ k)) λ n nk j Guillaume Obozinski LSI, plsi, LDA and inference methods 37/40

116 VI for LDA: Solving for the variational updates Introducing a Lagrangian to account for the constraints K k=1 φ nk = 1: L(γ, (φ n ) n ) = D(γ, (φ n ) n ) + Computing the gradient of the Lagrangian: N n=1 ( λ n 1 k φ ) nk L = (α k + φ nk γ k )(Ψ (γ k ) Ψ ( k γ γ k)) k n L = log(φ nk ) + 1 w nj log(b jk ) (Ψ(γ k ) Ψ( k φ γ k)) λ n nk j Partial minimizations in γ and φ nk are therefore respectively solved by γ k = α k + n φ nk and φ nk b j(n),k exp(ψ(γ k ) Ψ( k γ k)), where j(n) is the one and only j such that w nj = 1. Guillaume Obozinski LSI, plsi, LDA and inference methods 37/40

117 Variational Algorithm Algorithm 1 Variational inference for LDA Require: W, α, γ init, (φ n,init ) n 1: while Not converged do 2: γ k α k + n φ nk 3: for n=1..n do 4: for k=1..k do 5: φ nk b j(n),k exp(ψ(γ k ) Ψ( k γ k)) 6: end for 7: φ n 1 k φ φ n nk 8: end for 9: end while 10: return γ, (φ n ) n Guillaume Obozinski LSI, plsi, LDA and inference methods 38/40

118 Variational Algorithm Algorithm 2 Variational inference for LDA Require: W, α, γ init, (φ n,init ) n 1: while Not converged do 2: γ k α k + n φ nk 3: for n=1..n do 4: for k=1..k do 5: φ nk b j(n),k exp(ψ(γ k ) Ψ( k γ k)) 6: end for 7: φ n 1 k φ φ n nk 8: end for 9: end while 10: return γ, (φ n ) n With the quantities computed we can approximate: E[θ k W] γ k k γ k and E[z n W] φ n Guillaume Obozinski LSI, plsi, LDA and inference methods 38/40

119 Polylingual Topic Model (Mimno et al., 2009) Generalization of LDA to documents available simultaneously in several languages such as Wikipedia articles, which are not literal translations of one another but share the same topics. α θ i z (1) in z (2) in z (L) in w (1) in w (2) in N (1) i N (2) i... M w (L) in N (L) i B (1) B (2) B (L) Guillaume Obozinski LSI, plsi, LDA and inference methods 39/40

120 References I Blei, D., Ng, A., and Jordan, M. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3: Boyd-Graber, J., Chang, J., Gerrish, S., Wang, C., and Blei, D. (2009). Reading tea leaves: How humans interpret topic models. In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems. Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6): Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1): Mimno, D., Wallach, H., Naradowsky, J., Smith, D., and McCallum, A. (2009). Polylingual topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2, pages Association for Computational Linguistics. Guillaume Obozinski LSI, plsi, LDA and inference methods 40/40

Latent Dirichlet Allocation

Latent Dirichlet Allocation Outlines Advanced Artificial Intelligence October 1, 2009 Outlines Part I: Theoretical Background Part II: Application and Results 1 Motive Previous Research Exchangeability 2 Notation and Terminology

More information

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1 Topic Models Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1 Low-Dimensional Space for Documents Last time: embedding space

More information

Topic Models. Material adapted from David Mimno University of Maryland INTRODUCTION. Material adapted from David Mimno UMD Topic Models 1 / 51

Topic Models. Material adapted from David Mimno University of Maryland INTRODUCTION. Material adapted from David Mimno UMD Topic Models 1 / 51 Topic Models Material adapted from David Mimno University of Maryland INTRODUCTION Material adapted from David Mimno UMD Topic Models 1 / 51 Why topic models? Suppose you have a huge number of documents

More information

Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability

Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability Ramesh Nallapati, William Cohen and John Lafferty Machine Learning Department Carnegie Mellon

More information

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Information retrieval LSI, plsi and LDA. Jian-Yun Nie Information retrieval LSI, plsi and LDA Jian-Yun Nie Basics: Eigenvector, Eigenvalue Ref: http://en.wikipedia.org/wiki/eigenvector For a square matrix A: Ax = λx where x is a vector (eigenvector), and

More information

Text Mining for Economics and Finance Latent Dirichlet Allocation

Text Mining for Economics and Finance Latent Dirichlet Allocation Text Mining for Economics and Finance Latent Dirichlet Allocation Stephen Hansen Text Mining Lecture 5 1 / 45 Introduction Recall we are interested in mixed-membership modeling, but that the plsi model

More information

topic modeling hanna m. wallach

topic modeling hanna m. wallach university of massachusetts amherst wallach@cs.umass.edu Ramona Blei-Gantz Helen Moss (Dave's Grandma) The Next 30 Minutes Motivations and a brief history: Latent semantic analysis Probabilistic latent

More information

Document and Topic Models: plsa and LDA

Document and Topic Models: plsa and LDA Document and Topic Models: plsa and LDA Andrew Levandoski and Jonathan Lobo CS 3750 Advanced Topics in Machine Learning 2 October 2018 Outline Topic Models plsa LSA Model Fitting via EM phits: link analysis

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

PROBABILISTIC LATENT SEMANTIC ANALYSIS

PROBABILISTIC LATENT SEMANTIC ANALYSIS PROBABILISTIC LATENT SEMANTIC ANALYSIS Lingjia Deng Revised from slides of Shuguang Wang Outline Review of previous notes PCA/SVD HITS Latent Semantic Analysis Probabilistic Latent Semantic Analysis Applications

More information

Dirichlet Enhanced Latent Semantic Analysis

Dirichlet Enhanced Latent Semantic Analysis Dirichlet Enhanced Latent Semantic Analysis Kai Yu Siemens Corporate Technology D-81730 Munich, Germany Kai.Yu@siemens.com Shipeng Yu Institute for Computer Science University of Munich D-80538 Munich,

More information

Language Information Processing, Advanced. Topic Models

Language Information Processing, Advanced. Topic Models Language Information Processing, Advanced Topic Models mcuturi@i.kyoto-u.ac.jp Kyoto University - LIP, Adv. - 2011 1 Today s talk Continue exploring the representation of text as histogram of words. Objective:

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

LDA with Amortized Inference

LDA with Amortized Inference LDA with Amortied Inference Nanbo Sun Abstract This report describes how to frame Latent Dirichlet Allocation LDA as a Variational Auto- Encoder VAE and use the Amortied Variational Inference AVI to optimie

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) Latent Dirichlet Allocation (LDA) A review of topic modeling and customer interactions application 3/11/2015 1 Agenda Agenda Items 1 What is topic modeling? Intro Text Mining & Pre-Processing Natural Language

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

AN INTRODUCTION TO TOPIC MODELS

AN INTRODUCTION TO TOPIC MODELS AN INTRODUCTION TO TOPIC MODELS Michael Paul December 4, 2013 600.465 Natural Language Processing Johns Hopkins University Prof. Jason Eisner Making sense of text Suppose you want to learn something about

More information

Note 1: Varitional Methods for Latent Dirichlet Allocation

Note 1: Varitional Methods for Latent Dirichlet Allocation Technical Note Series Spring 2013 Note 1: Varitional Methods for Latent Dirichlet Allocation Version 1.0 Wayne Xin Zhao batmanfly@gmail.com Disclaimer: The focus of this note was to reorganie the content

More information

CS Lecture 18. Topic Models and LDA

CS Lecture 18. Topic Models and LDA CS 6347 Lecture 18 Topic Models and LDA (some slides by David Blei) Generative vs. Discriminative Models Recall that, in Bayesian networks, there could be many different, but equivalent models of the same

More information

Web-Mining Agents Topic Analysis: plsi and LDA. Tanya Braun Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Web-Mining Agents Topic Analysis: plsi and LDA. Tanya Braun Ralf Möller Universität zu Lübeck Institut für Informationssysteme Web-Mining Agents Topic Analysis: plsi and LDA Tanya Braun Ralf Möller Universität zu Lübeck Institut für Informationssysteme Acknowledgments Pilfered from: Ramesh M. Nallapati Machine Learning applied

More information

Sparse Stochastic Inference for Latent Dirichlet Allocation

Sparse Stochastic Inference for Latent Dirichlet Allocation Sparse Stochastic Inference for Latent Dirichlet Allocation David Mimno 1, Matthew D. Hoffman 2, David M. Blei 1 1 Dept. of Computer Science, Princeton U. 2 Dept. of Statistics, Columbia U. Presentation

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Collapsed Variational Bayesian Inference for Hidden Markov Models

Collapsed Variational Bayesian Inference for Hidden Markov Models Collapsed Variational Bayesian Inference for Hidden Markov Models Pengyu Wang, Phil Blunsom Department of Computer Science, University of Oxford International Conference on Artificial Intelligence and

More information

Latent Dirichlet Alloca/on

Latent Dirichlet Alloca/on Latent Dirichlet Alloca/on Blei, Ng and Jordan ( 2002 ) Presented by Deepak Santhanam What is Latent Dirichlet Alloca/on? Genera/ve Model for collec/ons of discrete data Data generated by parameters which

More information

Lecture 19, November 19, 2012

Lecture 19, November 19, 2012 Machine Learning 0-70/5-78, Fall 0 Latent Space Analysis SVD and Topic Models Eric Xing Lecture 9, November 9, 0 Reading: Tutorial on Topic Model @ ACL Eric Xing @ CMU, 006-0 We are inundated with data

More information

Topic Models. Charles Elkan November 20, 2008

Topic Models. Charles Elkan November 20, 2008 Topic Models Charles Elan elan@cs.ucsd.edu November 20, 2008 Suppose that we have a collection of documents, and we want to find an organization for these, i.e. we want to do unsupervised learning. One

More information

16 : Approximate Inference: Markov Chain Monte Carlo

16 : Approximate Inference: Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models 10-708, Spring 2017 16 : Approximate Inference: Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Yuan Yang, Chao-Ming Yen 1 Introduction As the target distribution

More information

Text Mining: Basic Models and Applications

Text Mining: Basic Models and Applications Introduction Basics Latent Dirichlet Allocation (LDA) Markov Chain Based Models Public Policy Applications Text Mining: Basic Models and Applications Alvaro J. Riascos Villegas University of los Andes

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Topic Modelling and Latent Dirichlet Allocation

Topic Modelling and Latent Dirichlet Allocation Topic Modelling and Latent Dirichlet Allocation Stephen Clark (with thanks to Mark Gales for some of the slides) Lent 2013 Machine Learning for Language Processing: Lecture 7 MPhil in Advanced Computer

More information

Clustering problems, mixture models and Bayesian nonparametrics

Clustering problems, mixture models and Bayesian nonparametrics Clustering problems, mixture models and Bayesian nonparametrics Nguyễn Xuân Long Department of Statistics Department of Electrical Engineering and Computer Science University of Michigan Vietnam Institute

More information

Topic Modeling Using Latent Dirichlet Allocation (LDA)

Topic Modeling Using Latent Dirichlet Allocation (LDA) Topic Modeling Using Latent Dirichlet Allocation (LDA) Porter Jenkins and Mimi Brinberg Penn State University prj3@psu.edu mjb6504@psu.edu October 23, 2017 Porter Jenkins and Mimi Brinberg (PSU) LDA October

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Yuriy Sverchkov Intelligent Systems Program University of Pittsburgh October 6, 2011 Outline Latent Semantic Analysis (LSA) A quick review Probabilistic LSA (plsa)

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Topic Models and Applications to Short Documents

Topic Models and Applications to Short Documents Topic Models and Applications to Short Documents Dieu-Thu Le Email: dieuthu.le@unitn.it Trento University April 6, 2011 1 / 43 Outline Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text

More information

Latent Dirichlet Allocation Introduction/Overview

Latent Dirichlet Allocation Introduction/Overview Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models

More information

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up Much of this material is adapted from Blei 2003. Many of the images were taken from the Internet February 20, 2014 Suppose we have a large number of books. Each is about several unknown topics. How can

More information

Measuring Topic Quality in Latent Dirichlet Allocation

Measuring Topic Quality in Latent Dirichlet Allocation Measuring Topic Quality in Sergei Koltsov Olessia Koltsova Steklov Institute of Mathematics at St. Petersburg Laboratory for Internet Studies, National Research University Higher School of Economics, St.

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) Latent Dirichlet Allocation (LDA) D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3:993-1022, January 2003. Following slides borrowed ant then heavily modified from: Jonathan Huang

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014 Understanding Comments Submitted to FCC on Net Neutrality Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014 Abstract We aim to understand and summarize themes in the 1.65 million

More information

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) ( ) Knowledge Discovery and Data Mining 1 (VO) (707.003) Probabilistic Latent Semantic Analysis Denis Helic KTI, TU Graz Jan 16, 2014 Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 1 / 47 Big picture: KDDM

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Modeling User Rating Profiles For Collaborative Filtering

Modeling User Rating Profiles For Collaborative Filtering Modeling User Rating Profiles For Collaborative Filtering Benjamin Marlin Department of Computer Science University of Toronto Toronto, ON, M5S 3H5, CANADA marlin@cs.toronto.edu Abstract In this paper

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

Introduction to Bayesian inference

Introduction to Bayesian inference Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015 Probabilistic models Describe how data was generated using probability distributions

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Gaussian Mixture Model

Gaussian Mixture Model Case Study : Document Retrieval MAP EM, Latent Dirichlet Allocation, Gibbs Sampling Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 5 th,

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

Evaluation Methods for Topic Models

Evaluation Methods for Topic Models University of Massachusetts Amherst wallach@cs.umass.edu April 13, 2009 Joint work with Iain Murray, Ruslan Salakhutdinov and David Mimno Statistical Topic Models Useful for analyzing large, unstructured

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang Matrix Factorization & Latent Semantic Analysis Review Yize Li, Lanbo Zhang Overview SVD in Latent Semantic Indexing Non-negative Matrix Factorization Probabilistic Latent Semantic Indexing Vector Space

More information

Probabilistic Graphical Networks: Definitions and Basic Results

Probabilistic Graphical Networks: Definitions and Basic Results This document gives a cursory overview of Probabilistic Graphical Networks. The material has been gleaned from different sources. I make no claim to original authorship of this material. Bayesian Graphical

More information

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank Shoaib Jameel Shoaib Jameel 1, Wai Lam 2, Steven Schockaert 1, and Lidong Bing 3 1 School of Computer Science and Informatics,

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Latent variable models for discrete data

Latent variable models for discrete data Latent variable models for discrete data Jianfei Chen Department of Computer Science and Technology Tsinghua University, Beijing 100084 chris.jianfei.chen@gmail.com Janurary 13, 2014 Murphy, Kevin P. Machine

More information

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic

More information

Latent Dirichlet Bayesian Co-Clustering

Latent Dirichlet Bayesian Co-Clustering Latent Dirichlet Bayesian Co-Clustering Pu Wang 1, Carlotta Domeniconi 1, and athryn Blackmond Laskey 1 Department of Computer Science Department of Systems Engineering and Operations Research George Mason

More information

Design of Text Mining Experiments. Matt Taddy, University of Chicago Booth School of Business faculty.chicagobooth.edu/matt.

Design of Text Mining Experiments. Matt Taddy, University of Chicago Booth School of Business faculty.chicagobooth.edu/matt. Design of Text Mining Experiments Matt Taddy, University of Chicago Booth School of Business faculty.chicagobooth.edu/matt.taddy/research Active Learning: a flavor of design of experiments Optimal : consider

More information

Latent Dirichlet Allocation

Latent Dirichlet Allocation Latent Dirichlet Allocation 1 Directed Graphical Models William W. Cohen Machine Learning 10-601 2 DGMs: The Burglar Alarm example Node ~ random variable Burglar Earthquake Arcs define form of probability

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Notes on Latent Semantic Analysis

Notes on Latent Semantic Analysis Notes on Latent Semantic Analysis Costas Boulis 1 Introduction One of the most fundamental problems of information retrieval (IR) is to find all documents (and nothing but those) that are semantically

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Bayesian Inference for Dirichlet-Multinomials

Bayesian Inference for Dirichlet-Multinomials Bayesian Inference for Dirichlet-Multinomials Mark Johnson Macquarie University Sydney, Australia MLSS Summer School 1 / 50 Random variables and distributed according to notation A probability distribution

More information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Probabilistic Graphical Models (I)

Probabilistic Graphical Models (I) Probabilistic Graphical Models (I) Hongxin Zhang zhx@cad.zju.edu.cn State Key Lab of CAD&CG, ZJU 2015-03-31 Probabilistic Graphical Models Modeling many real-world problems => a large number of random

More information

Probablistic Graphical Models, Spring 2007 Homework 4 Due at the beginning of class on 11/26/07

Probablistic Graphical Models, Spring 2007 Homework 4 Due at the beginning of class on 11/26/07 Probablistic Graphical odels, Spring 2007 Homework 4 Due at the beginning of class on 11/26/07 Instructions There are four questions in this homework. The last question involves some programming which

More information

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures 17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Probabilistic Machine Learning

Probabilistic Machine Learning Probabilistic Machine Learning Bayesian Nets, MCMC, and more Marek Petrik 4/18/2017 Based on: P. Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. Chapter 10. Conditional Independence Independent

More information

Sampling Methods (11/30/04)

Sampling Methods (11/30/04) CS281A/Stat241A: Statistical Learning Theory Sampling Methods (11/30/04) Lecturer: Michael I. Jordan Scribe: Jaspal S. Sandhu 1 Gibbs Sampling Figure 1: Undirected and directed graphs, respectively, with

More information

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016 Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is Variational Inference? What is Variational Inference? Want to estimate some distribution, p*(x) p*(x) What is

More information

Graphical Models and Kernel Methods

Graphical Models and Kernel Methods Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.

More information

Gaussian Models

Gaussian Models Gaussian Models ddebarr@uw.edu 2016-04-28 Agenda Introduction Gaussian Discriminant Analysis Inference Linear Gaussian Systems The Wishart Distribution Inferring Parameters Introduction Gaussian Density

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun yzsun@cs.ucla.edu December 4, 2017 Methods to be Learnt Vector Data Set Data Sequence Data Text Data Classification Clustering

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Variational Scoring of Graphical Model Structures

Variational Scoring of Graphical Model Structures Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational

More information

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

Principal Component Analysis (PCA) for Sparse High-Dimensional Data AB Principal Component Analysis (PCA) for Sparse High-Dimensional Data Tapani Raiko Helsinki University of Technology, Finland Adaptive Informatics Research Center The Data Explosion We are facing an enormous

More information

AUTOMATIC DETECTION OF WORDS NOT SIGNIFICANT TO TOPIC CLASSIFICATION IN LATENT DIRICHLET ALLOCATION

AUTOMATIC DETECTION OF WORDS NOT SIGNIFICANT TO TOPIC CLASSIFICATION IN LATENT DIRICHLET ALLOCATION AUTOMATIC DETECTION OF WORDS NOT SIGNIFICANT TO TOPIC CLASSIFICATION IN LATENT DIRICHLET ALLOCATION By DEBARSHI ROY A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

cross-language retrieval (by concatenate features of different language in X and find co-shared U). TOEFL/GRE synonym in the same way.

cross-language retrieval (by concatenate features of different language in X and find co-shared U). TOEFL/GRE synonym in the same way. 10-708: Probabilistic Graphical Models, Spring 2015 22 : Optimization and GMs (aka. LDA, Sparse Coding, Matrix Factorization, and All That ) Lecturer: Yaoliang Yu Scribes: Yu-Xiang Wang, Su Zhou 1 Introduction

More information

Latent Dirichlet Allocation

Latent Dirichlet Allocation Journal of Machine Learning Research 3 (2003) 993-1022 Submitted 2/02; Published 1/03 Latent Dirichlet Allocation David M. Blei Computer Science Division University of California Bereley, CA 94720, USA

More information

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing 1 Zhong-Yuan Zhang, 2 Chris Ding, 3 Jie Tang *1, Corresponding Author School of Statistics,

More information

Applying hlda to Practical Topic Modeling

Applying hlda to Practical Topic Modeling Joseph Heng lengerfulluse@gmail.com CIST Lab of BUPT March 17, 2013 Outline 1 HLDA Discussion 2 the nested CRP GEM Distribution Dirichlet Distribution Posterior Inference Outline 1 HLDA Discussion 2 the

More information

A Probabilistic Clustering-Projection Model for Discrete Data

A Probabilistic Clustering-Projection Model for Discrete Data Proc. PKDD, Porto, Portugal, 2005 A Probabilistic Clustering-Projection Model for Discrete Data Shipeng Yu 12, Kai Yu 2, Volker Tresp 2, and Hans-Peter Kriegel 1 1 Institute for Computer Science, University

More information

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644: Gibbs Sampling Héctor Corrada Bravo University of Maryland, College Park, USA CMSC 644: 2019 03 27 Latent semantic analysis Documents as mixtures of topics (Hoffman 1999) 1 / 60 Latent semantic analysis

More information

An Introduction to Bayesian Machine Learning

An Introduction to Bayesian Machine Learning 1 An Introduction to Bayesian Machine Learning José Miguel Hernández-Lobato Department of Engineering, Cambridge University April 8, 2013 2 What is Machine Learning? The design of computational systems

More information

Mathematical Formulation of Our Example

Mathematical Formulation of Our Example Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Variational Inference. Sargur Srihari

Variational Inference. Sargur Srihari Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms

More information

Lecture 22 Exploratory Text Analysis & Topic Models

Lecture 22 Exploratory Text Analysis & Topic Models Lecture 22 Exploratory Text Analysis & Topic Models Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor [Some slides borrowed from Michael Paul] 1 Text Corpus

More information

Mr. LDA: A Flexible Large Scale Topic Modeling Package using Variational Inference in MapReduce

Mr. LDA: A Flexible Large Scale Topic Modeling Package using Variational Inference in MapReduce Mr. LDA: A Flexible Large Scale Topic Modeling Package using Variational Inference in MapReduce Ke Zhai, Jordan Boyd-Graber, Nima Asadi, and Mohamad Alkhouja Mr. LDA: A Flexible Large Scale Topic Modeling

More information

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

Distributed ML for DOSNs: giving power back to users

Distributed ML for DOSNs: giving power back to users Distributed ML for DOSNs: giving power back to users Amira Soliman KTH isocial Marie Curie Initial Training Networks Part1 Agenda DOSNs and Machine Learning DIVa: Decentralized Identity Validation for

More information

Probabilistic Matrix Factorization

Probabilistic Matrix Factorization Probabilistic Matrix Factorization David M. Blei Columbia University November 25, 2015 1 Dyadic data One important type of modern data is dyadic data. Dyadic data are measurements on pairs. The idea is

More information