Present and Future of Text Modeling

Similar documents
The Infinite Markov Model

The Infinite Markov Model

Non-parametric Clustering with Dirichlet Processes

Latent Dirichlet Allocation (LDA)

Bayesian Structure Modeling. SPFLODD December 1, 2011

CS Lecture 18. Topic Models and LDA

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Latent Dirichlet Allocation (LDA)

Study Notes on the Latent Dirichlet Allocation

Collapsed Variational Bayesian Inference for Hidden Markov Models

Spatial Normalized Gamma Process

Document and Topic Models: plsa and LDA

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

A Brief Overview of Nonparametric Bayesian Models

Latent Dirichlet Allocation Based Multi-Document Summarization

Hierarchical Bayesian Nonparametrics

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

Topic Modelling and Latent Dirichlet Allocation

Note for plsa and LDA-Version 1.1

Generative Clustering, Topic Modeling, & Bayesian Inference

Sparse Stochastic Inference for Latent Dirichlet Allocation

Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh

Bayesian Tools for Natural Language Learning. Yee Whye Teh Gatsby Computational Neuroscience Unit UCL

Latent Dirichlet Allocation

Latent Dirichlet Alloca/on

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Latent variable models for discrete data

N-gram Language Modeling Tutorial

Topic Models. Charles Elkan November 20, 2008

CS145: INTRODUCTION TO DATA MINING

Bayesian Inference for Dirichlet-Multinomials

State-Space Methods for Inferring Spike Trains from Calcium Imaging

The Infinite Markov Model: A Nonparametric Bayesian approach

Topic Modeling: Beyond Bag-of-Words

Hierarchical Bayesian Nonparametric Models of Language and Text

AN INTRODUCTION TO TOPIC MODELS

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

Non-Parametric Bayes

Acoustic Unit Discovery (AUD) Models. Leda Sarı

Dirichlet Enhanced Latent Semantic Analysis

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Hybrid Models for Text and Graphs. 10/23/2012 Analysis of Social Media

Text Mining for Economics and Finance Latent Dirichlet Allocation

Lecture 13 : Variational Inference: Mean Field Approximation

Hierarchical Dirichlet Processes

A Bayesian mixture model for term re-occurrence and burstiness

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Measuring Topic Quality in Latent Dirichlet Allocation

Priors for Random Count Matrices with Random or Fixed Row Sums

Latent Dirichlet Allocation Introduction/Overview

LDA with Amortized Inference

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Hierarchical Bayesian Nonparametric Models of Language and Text

Distance dependent Chinese restaurant processes

Bayesian Nonparametrics for Speech and Signal Processing

Lecture 19, November 19, 2012

Introduction to Machine Learning Midterm Exam Solutions

Statistical Debugging with Latent Topic Models

Nonparametric Mixed Membership Models

Bayesian (Nonparametric) Approaches to Language Modeling

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Sampling from Bayes Nets

Bayesian Nonparametric Mixture, Admixture, and Language Models

Hierarchical Bayesian Models of Language and Text

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Gaussian Mixture Model

Language Information Processing, Advanced. Topic Models

Evaluation Methods for Topic Models

Infinite latent feature models and the Indian Buffet Process

Latent Dirichlet Bayesian Co-Clustering

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Hierarchical Models, Nested Models and Completely Random Measures

Additive Regularization of Topic Models for Topic Selection and Sparse Factorization

Latent Dirichlet Allocation and Singular Value Decomposition based Multi-Document Summarization

Lecture 22 Exploratory Text Analysis & Topic Models

Dirichlet Processes: Tutorial and Practical Course

Applying hlda to Practical Topic Modeling

Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Advanced Machine Learning

Bayesian Nonparametrics: Models Based on the Dirichlet Process

Collapsed Variational Inference for HDP

Prediction of Next Contextual Changing Point of Driving Behavior Using Unsupervised Bayesian Double Articulation Analyzer

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

Learning Bayesian network : Given structure and completely observed data

The effect of non-tightness on Bayesian estimation of PCFGs

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Lecture 13: More uses of Language Models

Introduction to Machine Learning Midterm Exam

RECSM Summer School: Facebook + Topic Models. github.com/pablobarbera/big-data-upf

Bayesian Nonparametrics: Dirichlet Process

Latent Dirichlet Allocation

Bayesian Methods for Machine Learning

Data Preprocessing. Cluster Similarity

A generic approach to topic models and its application to virtual communities

Latent Variable Models in NLP

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

arxiv: v1 [stat.ml] 5 Dec 2016

Modeling Environment

Improving Topic Models with Latent Feature Word Representations

Transcription:

Present and Future of Text Modeling Daichi Mochihashi NTT Communication Science Laboratories daichi@cslab.kecl.ntt.co.jp T-FaNT2, University of Tokyo 2008-2-13 (Wed) Present and Future of Text Modeling (T-FaNT2) 1/25

(2) s What s Text Modeling (Usually) bag-of-words style collections of words Idiosyncrasy of context Context = Document Sentence Utterances so far, Word occurrences are not homogeneous. Present and Future of Text Modeling (T-FaNT2) 2/25

(2) s d 1 sea:2, habitat:1, d 2 economy:5, relation:1, international:2, Occurrences of words (features) Words are exchangeable (order doesn t matter) Counts are explicitly discrete ( Log-linear models) Present and Future of Text Modeling (T-FaNT2) 3/25

(2) s w 1 (0, 0, 1) w V w V w 3 p = (0.4, 0.5, 0.1) w 1 w 2 w 2 (1, 0, 0) (0, 1, 0) Each point p (V 1) is a Multinomial parameter p = (p(w 1 ),p(w 2 ),,p(w V )) (1) Words are generated i.i.d. from p: w i p. (i = 1 N) (2) Present and Future of Text Modeling (T-FaNT2) 4/25

(2) s p w V w 1 w 2 10 5 0 1 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 It is convenient (simplest) to put a Dirichlet prior on the of p s: p(p α) = Dir(p α) (3) V p α i 1 i V i=1 (Normalization constant: i) Γ( V i=1 α i) ) i=1 (4) Present and Future of Text Modeling (T-FaNT2) 5/25

(2) s Generate w = w 1 w 2 w N in two steps: 1. Draw p Dir(p α). 2. For i = 1 N, Draw w i p. Integrate out p to get the DCM (Dirichlet Compound Multinomial) : p(w) = p(w p)p(w α)dp (5) = Γ(α) Γ(α+V ) v w Γ(α v +n v ) Γ(α v ) (α = V α i ) (6) i=1 Present and Future of Text Modeling (T-FaNT2) 6/25

(2) (2) s α p w N D Characteristics of DCM : 1. Feature occurrences are correlated (through p) 2. Damping of counts (a) Counts n is effectively damped to log n (b) Chance of Two Noriegas is closer to p/2 than p 2 (Church 2000) Present and Future of Text Modeling (T-FaNT2) 7/25

s (2) s w 1 w N w 2 Dirichlet s (Sjölander+ 1996, Yamamoto+ 2003) Parameter estimation: EM-Newton Tool: http://chasen.org/ daiti-m/dist/dm/ Unsupervised, complete Bayesian version of Naive Bayes Perform better than NB p Present and Future of Text Modeling (T-FaNT2) 8/25 p w α N z λ D

(2) s (Yamamoto+ 2005) Better than LDA (!) Cache property (damping multiple counts) is important in natural language. LDA? Present and Future of Text Modeling (T-FaNT2) 9/25

(2) s w V p( K) Topic space Topic K p( 1) Topic 1 θ p( 2) w 1 w 2 Topic 2 Each word will have different topic k 3-stage generative process: 1. Draw θ Dir(α). (topic mixture) 2. For n = 1 N, (a) Draw k θ. (b) Draw w n p(w k). Present and Future of Text Modeling (T-FaNT2) 10/25

(2) s α η θ z β K w N D Mixture Model LDA is actually a collection of mixture models Mixture components are shared (HDP exactly does this) Many applications Contextual SMT (Zhao and Xing 2007) Semi-supervised POS Tagging (Toutanova and Johnson 2007) Information retrieval, Computer vision, Present and Future of Text Modeling (T-FaNT2) 11/25

(2) s w V p( K) Topic space Topic K p( 1) Topic 1 θ p( 2) w 1 w 2 Topic 2 LDA models only within the Topic subsimplex Each document is generated from a mixture of fixed coordinates Whole simplex like DM? DDA (this talk) Correlation between topics? hlda, CTM, PAM Present and Future of Text Modeling (T-FaNT2) 12/25

(2) s LDA cannot model outside topic subsimplex Plugging DCM into LDA? w V Topic 1 w 2 1. Draw θ Dir(θ α). 2. For k = 1 K, (a) Draw p( k) Dir(β k ). 3. For n = 1 N, (a) Draw k θ. (b) Draw w n p(w k). Topic K Topic 2 Present and Future of Text Modeling (T-FaNT2) 13/25 θ

(2) s Generally OK, but: p(w) = Γ(α) k Γ(α k) k θ α k 1 k Too complex to optimize! k θ k Γ(β k ) Γ(β k +n k ) DCM: not in the exponential family. v w Γ(β kv +n kv ) dθ Γ(β kv ) (7) Present and Future of Text Modeling (T-FaNT2) 14/25

(2) s : p(w) = n! v n v! Γ(α) Γ(α+n) v Γ(α v +n v ) Γ(α v ) For α 1, Γ(α+n)/Γ(α) αγ(n). Plugging this into (8) and using Γ(n) = (n 1)!, we get: Exponential family DCM : q(w) = n! v n v! = n! Γ(α) Γ(α+n) Γ(α) Γ(α + n) v Exponential family and simpler form! v (8) α v (n v 1)! (9) I(n v >0) α v n v. (10) Present and Future of Text Modeling (T-FaNT2) 15/25

(2) s p(w,z,θ) = Γ(α) k Γ(α θ α k+n k 1 Γ(β k ) k n k! I(n kv >0) k) Γ(β k +n k ) k v w n (11) where n k = I(z n = k) (12) n n kv = I(z n = k)i(w n = v) (13) n v Can model whole word simplex Burst phenomenon of natural language Example: Multiple appearances of Noriega in a document Toyota and Nissan in car topic Present and Future of Text Modeling (T-FaNT2) 16/25

(2) s A small piece of work, but can combine the benefits of both LDA and DM Derived the variational lower bound Experiments to compare with DM and vanilla LDA Lesson: EDCM is an useful exponential family Present and Future of Text Modeling (T-FaNT2) 17/25

(2) s he will will ǫ of and Japan and bread and 1-gram 2-gram 3-gram Bayesian n-gram model Hierarchical Pitman-Yor Language Model (ACL 2006) Hierarchical draw of n-gram from (n 1)-gram An extension of HDP Present and Future of Text Modeling (T-FaNT2) 18/25

(2) s Estimate n-gram s by finite draws = customers from them 1-gram 2-gram 3-gram he will will ǫ of and Japan and bread and. ate bread and butter.. Real customers only reside in the leaves < n-gram customers are virtual and sent by n-gram customers for smoothing purpose Gibbs sampler: remove a customer and add him again to stochastically optimize < n-gram customers for smoothing Present and Future of Text Modeling (T-FaNT2) 19/25

Mixtures (2) s Straightforward approach: LDA of HPYLM (or VPYLM) Gibbs: Text: w 1 w 2 w 3 w 4 Topic 1 Topic 2 Topic 3 Topic k w t,d p(w t HPYLM k ) (n t d (k) + α k) (14) n t d (k) : # of customers in document d, assigned to topic k (excluding w t ) Present and Future of Text Modeling (T-FaNT2) 20/25

Mixtures (2) (2) s NIPS papers dataset (1500 documents/3,261,224 words) with 5 mixtures p(n, s) Phrase 0.9904 in section # 0.9900 the number of 0.9856 in order to 0.9832 in table # 0.9752 dealing with 0.9693 with respect to (a) Topic 0 (generic) p(n, s) Phrase 0.9853 et al 0.9840 receptive field 0.9630 excitatory and inhibitory 0.9266 in order to 0.8939 primary visual cortex 0.8756 corresponds to (b) Topic 1 p(n, s) Phrase 0.9823 monte carlo 0.9524 associative memory 0.9081 as can be seen 0.8206 parzen windows 0.8044 in the previous section 0.7790 american institute of phy (c) Topic 4 Generally OK, but PPL is identical (117.28 116.62). PPL increases on other dataset (such as AP) Data sparseness problem! Present and Future of Text Modeling (T-FaNT2) 21/25

Data sparseness in n-gram mixtures (2) s Text: w 1 w 2 w 3 w 4 Topic 1 Topic 2 Topic 3 LDA only mixes different trees Severe data sparseness problem Many infrequent n-grams concentrate on specific trees If topic weights for these trees are near zero, estimates are severely backed-off Solution? Present and Future of Text Modeling (T-FaNT2) 22/25

Solution (2) s Possible solution: Nested (Poisson-)Dirichlet process (ndp) (Rodriguez+ 2006) Single tree, but measures on branches are infinite mixtures ǫ he will will of and Japan and + + 0.4 0.22 0.1 + 0.7 0.2 0.9 0.05 bread and Mixtures are governed by DP (thus many counts (eg. unigrams) induce large mixtures Customers are grouped accordingly Present and Future of Text Modeling (T-FaNT2) 23/25 +

Current problem (2) s he will will ǫ of and Japan and + + 0.4 0.22 0.1 + 0.7 0.2 + 0.9 0.05 bread and ndp (or npd) is OK, but how can we introduce document (context) here? Context-dependent n-gram language models are important in NLP applications. Present and Future of Text Modeling (T-FaNT2) 24/25

Final Remark (2) s Text Modeling is a research for introducing context dependency in many NLP applications. LDA is useful, and can incorporate DM through EDCM Latent topic-aware n-gram language models are important, and ndp may open the door Thank you very much. Present and Future of Text Modeling (T-FaNT2) 25/25