Latent variable models for discrete data

Similar documents
Generative Clustering, Topic Modeling, & Bayesian Inference

Latent Dirichlet Allocation Introduction/Overview

CS Lecture 18. Topic Models and LDA

Lecture 13 : Variational Inference: Mean Field Approximation

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Study Notes on the Latent Dirichlet Allocation

13: Variational inference II

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Latent Dirichlet Allocation

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Sparse Stochastic Inference for Latent Dirichlet Allocation

Document and Topic Models: plsa and LDA

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Online Bayesian Passive-Agressive Learning

Topic Modelling and Latent Dirichlet Allocation

Content-based Recommendation

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Introduction to Probabilistic Machine Learning

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Latent Dirichlet Allocation (LDA)

Measuring Topic Quality in Latent Dirichlet Allocation

Classical Predictive Models

Introduction to Machine Learning

Using Both Latent and Supervised Shared Topics for Multitask Learning

Language Information Processing, Advanced. Topic Models

Probabilistic Latent Semantic Analysis

Latent Dirichlet Allocation (LDA)

cross-language retrieval (by concatenate features of different language in X and find co-shared U). TOEFL/GRE synonym in the same way.

ECE 5984: Introduction to Machine Learning

Data Mining Techniques

CS6220: DATA MINING TECHNIQUES

Pattern Recognition and Machine Learning

Dirichlet Enhanced Latent Semantic Analysis

Introduction to Machine Learning

topic modeling hanna m. wallach

Unsupervised Learning

Bayesian Nonparametrics for Speech and Signal Processing

Non-Parametric Bayes

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Recent Advances in Bayesian Inference Techniques

Dimension Reduction (PCA, ICA, CCA, FLD,

Topic Modeling: Beyond Bag-of-Words

AN INTRODUCTION TO TOPIC MODELS

The supervised hierarchical Dirichlet process

Hybrid Models for Text and Graphs. 10/23/2012 Analysis of Social Media

Modeling Documents with a Deep Boltzmann Machine

Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

Latent Dirichlet Conditional Naive-Bayes Models

Modeling User Rating Profiles For Collaborative Filtering

Modeling Environment

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Probabilistic Graphical Models

Mixtures of Multinomials

STA 4273H: Statistical Machine Learning

CS145: INTRODUCTION TO DATA MINING

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Probabilistic Time Series Classification

LDA with Amortized Inference

Lecture 8: Graphical models for Text

Variational Scoring of Graphical Model Structures

Variational Inference for Scalable Probabilistic Topic Modelling

Lecture 16 Deep Neural Generative Models

Conditional Random Field

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

The Origin of Deep Learning. Lili Mou Jan, 2015

Chapter 16. Structured Probabilistic Models for Deep Learning

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Present and Future of Text Modeling

Nonparametric Bayesian Methods (Gaussian Processes)

Web-Mining Agents Topic Analysis: plsi and LDA. Tanya Braun Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

STA 4273H: Statistical Machine Learning

Latent Dirichlet Alloca/on

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

Bayesian Methods for Machine Learning

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Click Prediction and Preference Ranking of RSS Feeds

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

Distributed ML for DOSNs: giving power back to users

Relevance Topic Model for Unstructured Social Group Activity Recognition

Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling

LSI, plsi, LDA and inference methods

Probabilistic Graphical Models

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Statistical Debugging with Latent Topic Models

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Topic Models and Applications to Short Documents

Bayesian Hidden Markov Models and Extensions

Non-parametric Clustering with Dirichlet Processes

RECSM Summer School: Facebook + Topic Models. github.com/pablobarbera/big-data-upf

Deep Generative Models. (Unsupervised Learning)

Posterior Regularization

Transcription:

Latent variable models for discrete data Jianfei Chen Department of Computer Science and Technology Tsinghua University, Beijing 100084 chris.jianfei.chen@gmail.com Janurary 13, 2014 Murphy, Kevin P. Machine learning: a probabilistic perspective. The MIT Press, 2012. Chapter 27.

Introduction We want to model three types of discrete data Sequence of tokens: p(y i,1:li ) Bag of words: p(n i ) Discrete features: p(y i,1:r ) Jianfei Chen (THU) Latent variable models Janurary 13, 2014 2 / 21

Outline Mixture Models LSA / PLSI / LDA / GaP / NMF LDA Evaluation Inference Variants: CTM, DTM, LDA-HMM, SLDA, MedLDA, etc. RBM Jianfei Chen (THU) Latent variable models Janurary 13, 2014 3 / 21

Mixture models p(y) = k p(y q i = k)p(q i = k) Sequence of tokens: p(y i,1:li q i = k) = L i l=1 Cat(y il b k ) Discrete features: p(y i,1:r q i = k) = R r=1 Cat(y ir b (r) k ) Bag of words (known L i ): p(n i L i, q i = k) = Mu(n i L i, b k ) Bag of words (unknown L i ): p(n i q i = k) = V v=1 Poi(n iv λvk) Jianfei Chen (THU) Latent variable models Janurary 13, 2014 4 / 21

Mixture models Theorem If i, X i Poi(λ i ), let n = i X i where π i = λ i k λ k. p(x 1,, X k n) = Mu(X n, π) Jianfei Chen (THU) Latent variable models Janurary 13, 2014 5 / 21

Exponential Family PCA latent semantic analysis (LSA) / latent semantic indexing (LSI) Sequence of tokens: p(y i,1:li z i ) = L i l=1 Cat(y il S(Wz i )) Discrete features: p(y i,1:r z i ) = R r=1 Cat(y ir S(W r z i )) Bag of words (known L i ): p(n i L i, z i ) = Mu(n i L i, S(Wz i )) Bag of words (unknown L i ): p(n i z i ) = V v=1 Poi(n iv exp(w v,: z i )) where S( ) is the softmax transformation, z i R K, W, W r R V K. Inference coordinate ascent / degenerated EM (problem: overfitting?) variational EM / MCMC Jianfei Chen (THU) Latent variable models Janurary 13, 2014 6 / 21

LSA / PLSI / LDA Unigram: p(y i,1:li q i = k) = L i l=1 Cat(y il b k ) LSI: p(y i,1:li z i ) = L i l=1 Cat(y il S(Wz i )) PLSI: p(y i,1:li π i ) = L i l=1 Cat(y il Bπ i ) LDA: p(y i,1:li π i ) = L i l=1 Cat(y il Bπ i ), π i Dir(π i α) LDA for other data types Bag of words:p(n i L i, π i ) = Mu(n i L i, Bπ i ) Discrete features:p(y i,1:r π i ) = R r=1 Cat(y ir B (r) π i ) Question: What is dual parameter? Why is it convenient? Marlin, Benjamin M. Modeling user rating profiles for collaborative filtering. Advances in neural information processing systems. 2003. Jianfei Chen (THU) Latent variable models Janurary 13, 2014 7 / 21

Gamma-Poisson Model LDA GaP models p(n i L i, π i ) = Mu(n i L i, Bπ i ) Prior π i Dir(α) Constraint 0 π ik, j π ik = 1, 0 B vk, v B vk = 1 models p(n i z + i ) = V v=1 Poi(n iv b v,:z + i ) Prior p(z + i ) = k Ga(z+ ik α k, β k ) Constraint 0 z ik, 0 B vk Can use sparse-inducing prior (27.17) GaP only have non-negative constraints Jianfei Chen (THU) Latent variable models Janurary 13, 2014 8 / 21

Non-negative matrix factorization Given non-negative matrix V, find non-negative matrix factors W, H such that V W H V i k W ik H k Can be view as GaP when prior α k = β k = 0. Seung, D., and L. Lee. Algorithms for non-negative matrix factorization. Advances in neural information processing systems. Jianfei Chen (THU) Latent variable models Janurary 13, 2014 9 / 21

Latent Dirichlet Allocation (LDA) Notation π z α Dir(α) (1) q il π i Cat(π i ) (2) b k γ Dir(γ) (3) y il q il = k, B Cat(b k ) (4) Geometric interpretation Simplex: handle ambiguity (?) Unidentifiable: Labeled LDA D. Blei et al. Latent dirichlet allocation. JMLR G. Heinrich. Parameter estimation for text analysis. D. Ramage, et al. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. EMNLP http://www.cs.princeton.edu/courses/archive/fall11/cos597c/lectures/variational-inference-i.pdf Jianfei Chen (THU) Latent variable models Janurary 13, 2014 10 / 21

Evaluation: Perplexity Perplexity of language model q given language p is defined as (both p, q are stocastic process) where H(p, q) is cross-entrypy Approximations N is finite perplexity(p, q) = 2 H(p,q) H(p, q) = lim N 1 N p(y 1:N ) = δ y 1:N (y 1:N ) y 1:N p(y 1:N ) log q(y 1:N ) Jianfei Chen (THU) Latent variable models Janurary 13, 2014 11 / 21

Evaluation: Perplexity H(p, q) = 1 N log q(y 1:N) Intuition: weighted average branching factor For unigram model H = 1 N N 1 L i L i=1 i l=1 log q(y il ) For LDA H = 1 N i=1 N p(y i,1:l i ) Use variational evidence lower bound (ELBO) Use annealed importance sampling Use validation set and plug in approximation H. Wallach, et al. Evaluation methods for topic models. ICML 2009 Jianfei Chen (THU) Latent variable models Janurary 13, 2014 12 / 21

Evaluation: Coherence TODO D. Newman et al. Automatic evaluation of topic coherence. NAACL HLT 2010. Jianfei Chen (THU) Latent variable models Janurary 13, 2014 13 / 21

Inference Exponential number of inference algorithms Variational inference vs sampling vs both Collapsed vs non-collpased Online vs stocastic vs offline Empirical Bayes vs fully Bayes Other algorithms: expectation propagation, etc. Jianfei Chen (THU) Latent variable models Janurary 13, 2014 14 / 21

Inference: towards large scale algorithms Online / stocastic Sparsity Spectral methods system Distributed: Yahoo-LDA, Petuum, Parameter-Server, etc. GPU: BIDMach, etc. Jianfei Chen (THU) Latent variable models Janurary 13, 2014 15 / 21

Model Selection Compute evidence with AIS / ELBO Cross validation Bayesian non-parametrics Teh et al. Hierarchical dirichlet processes. Journal of the american statistical association (2006). Jianfei Chen (THU) Latent variable models Janurary 13, 2014 16 / 21

Extensions of LDA Correlation: Correlated topic model Time series: Dynamic topic model Syntax: LDA-HMM Supervision: many 1D categorial label: SLDA (generative), DLDA (discrimitive), MedLDA (regularized) nd label: MR-LDA, random effects mixture of experts, conditional topic random field, Dirchlet multinomial regression LDA K labels per document: labeled LDA labels per word: TagLDA Structural: RTM Jianfei Chen (THU) Latent variable models Janurary 13, 2014 17 / 21

Restricted Boltzmann machines Restricted Boltzmann machines p(h, v θ) = 1 Z(θ) R r=1 k=1 K ψ rk (v r, h k ) where h, v are binary vectors. factorized posterior p(h v, θ) = k p(h k v, θ) advantage: symmetric, both posterior inference (backward) and generating (forward) are easy. Exponential family harmonium (harmonium is 2-layer UGM) Jianfei Chen (THU) Latent variable models Janurary 13, 2014 18 / 21

Restricted Boltzmann machines Binary latent and binary visiable (other models exist, see Table 27.2) p(v, h θ) = 1 exp( E(v, h; θ)) (5) Z(θ) E(v, h; θ) = v Wh (6) p(h v, θ) = k Ber(h k sigm(w:,k, v)) (7) p(v h, θ) = r Ber(v r sigm(w r,:, h)) (8) Jianfei Chen (THU) Latent variable models Janurary 13, 2014 19 / 21

Restricted Boltzmann machines Goal: maximize p(v θ) w l = E pemp( θ)[vh ] E p( θ) [vh ] Jianfei Chen (THU) Latent variable models Janurary 13, 2014 20 / 21

Conclusions Why there are many things to do Exponential number of inference algorithms Exponential number of models Exponential exponential number of solutions Application, evaluation, theory (e.g. spectral), etc. Need a way for information retriver, data miners find correct & fast solutions for them... Jianfei Chen (THU) Latent variable models Janurary 13, 2014 21 / 21