Latent variable models for discrete data

Latent variable models for discrete data Jianfei Chen Department of Computer Science and Technology Tsinghua University, Beijing 100084 chris.jianfei.chen@gmail.com Janurary 13, 2014 Murphy, Kevin P. Machine learning: a probabilistic perspective. The MIT Press, 2012. Chapter 27.

Introduction We want to model three types of discrete data Sequence of tokens: p(y i,1:li ) Bag of words: p(n i ) Discrete features: p(y i,1:r ) Jianfei Chen (THU) Latent variable models Janurary 13, 2014 2 / 21

Outline Mixture Models LSA / PLSI / LDA / GaP / NMF LDA Evaluation Inference Variants: CTM, DTM, LDA-HMM, SLDA, MedLDA, etc. RBM Jianfei Chen (THU) Latent variable models Janurary 13, 2014 3 / 21

Mixture models p(y) = k p(y q i = k)p(q i = k) Sequence of tokens: p(y i,1:li q i = k) = L i l=1 Cat(y il b k ) Discrete features: p(y i,1:r q i = k) = R r=1 Cat(y ir b (r) k ) Bag of words (known L i ): p(n i L i, q i = k) = Mu(n i L i, b k ) Bag of words (unknown L i ): p(n i q i = k) = V v=1 Poi(n iv λvk) Jianfei Chen (THU) Latent variable models Janurary 13, 2014 4 / 21

Mixture models Theorem If i, X i Poi(λ i ), let n = i X i where π i = λ i k λ k. p(x 1,, X k n) = Mu(X n, π) Jianfei Chen (THU) Latent variable models Janurary 13, 2014 5 / 21

Exponential Family PCA latent semantic analysis (LSA) / latent semantic indexing (LSI) Sequence of tokens: p(y i,1:li z i ) = L i l=1 Cat(y il S(Wz i )) Discrete features: p(y i,1:r z i ) = R r=1 Cat(y ir S(W r z i )) Bag of words (known L i ): p(n i L i, z i ) = Mu(n i L i, S(Wz i )) Bag of words (unknown L i ): p(n i z i ) = V v=1 Poi(n iv exp(w v,: z i )) where S( ) is the softmax transformation, z i R K, W, W r R V K. Inference coordinate ascent / degenerated EM (problem: overfitting?) variational EM / MCMC Jianfei Chen (THU) Latent variable models Janurary 13, 2014 6 / 21

LSA / PLSI / LDA Unigram: p(y i,1:li q i = k) = L i l=1 Cat(y il b k ) LSI: p(y i,1:li z i ) = L i l=1 Cat(y il S(Wz i )) PLSI: p(y i,1:li π i ) = L i l=1 Cat(y il Bπ i ) LDA: p(y i,1:li π i ) = L i l=1 Cat(y il Bπ i ), π i Dir(π i α) LDA for other data types Bag of words:p(n i L i, π i ) = Mu(n i L i, Bπ i ) Discrete features:p(y i,1:r π i ) = R r=1 Cat(y ir B (r) π i ) Question: What is dual parameter? Why is it convenient? Marlin, Benjamin M. Modeling user rating profiles for collaborative filtering. Advances in neural information processing systems. 2003. Jianfei Chen (THU) Latent variable models Janurary 13, 2014 7 / 21

Gamma-Poisson Model LDA GaP models p(n i L i, π i ) = Mu(n i L i, Bπ i ) Prior π i Dir(α) Constraint 0 π ik, j π ik = 1, 0 B vk, v B vk = 1 models p(n i z + i ) = V v=1 Poi(n iv b v,:z + i ) Prior p(z + i ) = k Ga(z+ ik α k, β k ) Constraint 0 z ik, 0 B vk Can use sparse-inducing prior (27.17) GaP only have non-negative constraints Jianfei Chen (THU) Latent variable models Janurary 13, 2014 8 / 21

Non-negative matrix factorization Given non-negative matrix V, find non-negative matrix factors W, H such that V W H V i k W ik H k Can be view as GaP when prior α k = β k = 0. Seung, D., and L. Lee. Algorithms for non-negative matrix factorization. Advances in neural information processing systems. Jianfei Chen (THU) Latent variable models Janurary 13, 2014 9 / 21

Latent Dirichlet Allocation (LDA) Notation π z α Dir(α) (1) q il π i Cat(π i ) (2) b k γ Dir(γ) (3) y il q il = k, B Cat(b k ) (4) Geometric interpretation Simplex: handle ambiguity (?) Unidentifiable: Labeled LDA D. Blei et al. Latent dirichlet allocation. JMLR G. Heinrich. Parameter estimation for text analysis. D. Ramage, et al. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. EMNLP http://www.cs.princeton.edu/courses/archive/fall11/cos597c/lectures/variational-inference-i.pdf Jianfei Chen (THU) Latent variable models Janurary 13, 2014 10 / 21

Evaluation: Perplexity Perplexity of language model q given language p is defined as (both p, q are stocastic process) where H(p, q) is cross-entrypy Approximations N is finite perplexity(p, q) = 2 H(p,q) H(p, q) = lim N 1 N p(y 1:N ) = δ y 1:N (y 1:N ) y 1:N p(y 1:N ) log q(y 1:N ) Jianfei Chen (THU) Latent variable models Janurary 13, 2014 11 / 21

Evaluation: Perplexity H(p, q) = 1 N log q(y 1:N) Intuition: weighted average branching factor For unigram model H = 1 N N 1 L i L i=1 i l=1 log q(y il ) For LDA H = 1 N i=1 N p(y i,1:l i ) Use variational evidence lower bound (ELBO) Use annealed importance sampling Use validation set and plug in approximation H. Wallach, et al. Evaluation methods for topic models. ICML 2009 Jianfei Chen (THU) Latent variable models Janurary 13, 2014 12 / 21

Evaluation: Coherence TODO D. Newman et al. Automatic evaluation of topic coherence. NAACL HLT 2010. Jianfei Chen (THU) Latent variable models Janurary 13, 2014 13 / 21

Inference Exponential number of inference algorithms Variational inference vs sampling vs both Collapsed vs non-collpased Online vs stocastic vs offline Empirical Bayes vs fully Bayes Other algorithms: expectation propagation, etc. Jianfei Chen (THU) Latent variable models Janurary 13, 2014 14 / 21

Inference: towards large scale algorithms Online / stocastic Sparsity Spectral methods system Distributed: Yahoo-LDA, Petuum, Parameter-Server, etc. GPU: BIDMach, etc. Jianfei Chen (THU) Latent variable models Janurary 13, 2014 15 / 21

Model Selection Compute evidence with AIS / ELBO Cross validation Bayesian non-parametrics Teh et al. Hierarchical dirichlet processes. Journal of the american statistical association (2006). Jianfei Chen (THU) Latent variable models Janurary 13, 2014 16 / 21

Extensions of LDA Correlation: Correlated topic model Time series: Dynamic topic model Syntax: LDA-HMM Supervision: many 1D categorial label: SLDA (generative), DLDA (discrimitive), MedLDA (regularized) nd label: MR-LDA, random effects mixture of experts, conditional topic random field, Dirchlet multinomial regression LDA K labels per document: labeled LDA labels per word: TagLDA Structural: RTM Jianfei Chen (THU) Latent variable models Janurary 13, 2014 17 / 21

Restricted Boltzmann machines Restricted Boltzmann machines p(h, v θ) = 1 Z(θ) R r=1 k=1 K ψ rk (v r, h k ) where h, v are binary vectors. factorized posterior p(h v, θ) = k p(h k v, θ) advantage: symmetric, both posterior inference (backward) and generating (forward) are easy. Exponential family harmonium (harmonium is 2-layer UGM) Jianfei Chen (THU) Latent variable models Janurary 13, 2014 18 / 21

Restricted Boltzmann machines Binary latent and binary visiable (other models exist, see Table 27.2) p(v, h θ) = 1 exp( E(v, h; θ)) (5) Z(θ) E(v, h; θ) = v Wh (6) p(h v, θ) = k Ber(h k sigm(w:,k, v)) (7) p(v h, θ) = r Ber(v r sigm(w r,:, h)) (8) Jianfei Chen (THU) Latent variable models Janurary 13, 2014 19 / 21

Restricted Boltzmann machines Goal: maximize p(v θ) w l = E pemp( θ)[vh ] E p( θ) [vh ] Jianfei Chen (THU) Latent variable models Janurary 13, 2014 20 / 21

Conclusions Why there are many things to do Exponential number of inference algorithms Exponential number of models Exponential exponential number of solutions Application, evaluation, theory (e.g. spectral), etc. Need a way for information retriver, data miners find correct & fast solutions for them... Jianfei Chen (THU) Latent variable models Janurary 13, 2014 21 / 21