Present and Future of Text Modeling

Size: px

Start display at page:

Download "Present and Future of Text Modeling"

Norma Jackson
5 years ago
Views:

1 Present and Future of Text Modeling Daichi Mochihashi NTT Communication Science Laboratories T-FaNT2, University of Tokyo (Wed) Present and Future of Text Modeling (T-FaNT2) 1/25

2 (2) s What s Text Modeling (Usually) bag-of-words style collections of words Idiosyncrasy of context Context = Document Sentence Utterances so far, Word occurrences are not homogeneous. Present and Future of Text Modeling (T-FaNT2) 2/25

3 (2) s d 1 sea:2, habitat:1, d 2 economy:5, relation:1, international:2, Occurrences of words (features) Words are exchangeable (order doesn t matter) Counts are explicitly discrete ( Log-linear models) Present and Future of Text Modeling (T-FaNT2) 3/25

4 (2) s w 1 (0, 0, 1) w V w V w 3 p = (0.4, 0.5, 0.1) w 1 w 2 w 2 (1, 0, 0) (0, 1, 0) Each point p (V 1) is a Multinomial parameter p = (p(w 1 ),p(w 2 ),,p(w V )) (1) Words are generated i.i.d. from p: w i p. (i = 1 N) (2) Present and Future of Text Modeling (T-FaNT2) 4/25

5 (2) s p w V w 1 w It is convenient (simplest) to put a Dirichlet prior on the of p s: p(p α) = Dir(p α) (3) V p α i 1 i V i=1 (Normalization constant: i) Γ( V i=1 α i) ) i=1 (4) Present and Future of Text Modeling (T-FaNT2) 5/25

6 (2) s Generate w = w 1 w 2 w N in two steps: 1. Draw p Dir(p α). 2. For i = 1 N, Draw w i p. Integrate out p to get the DCM (Dirichlet Compound Multinomial) : p(w) = p(w p)p(w α)dp (5) = Γ(α) Γ(α+V ) v w Γ(α v +n v ) Γ(α v ) (α = V α i ) (6) i=1 Present and Future of Text Modeling (T-FaNT2) 6/25

7 (2) (2) s α p w N D Characteristics of DCM : 1. Feature occurrences are correlated (through p) 2. Damping of counts (a) Counts n is effectively damped to log n (b) Chance of Two Noriegas is closer to p/2 than p 2 (Church 2000) Present and Future of Text Modeling (T-FaNT2) 7/25

8 s (2) s w 1 w N w 2 Dirichlet s (Sjölander+ 1996, Yamamoto+ 2003) Parameter estimation: EM-Newton Tool: daiti-m/dist/dm/ Unsupervised, complete Bayesian version of Naive Bayes Perform better than NB p Present and Future of Text Modeling (T-FaNT2) 8/25 p w α N z λ D

9 (2) s (Yamamoto+ 2005) Better than LDA (!) Cache property (damping multiple counts) is important in natural language. LDA? Present and Future of Text Modeling (T-FaNT2) 9/25

10 (2) s w V p( K) Topic space Topic K p( 1) Topic 1 θ p( 2) w 1 w 2 Topic 2 Each word will have different topic k 3-stage generative process: 1. Draw θ Dir(α). (topic mixture) 2. For n = 1 N, (a) Draw k θ. (b) Draw w n p(w k). Present and Future of Text Modeling (T-FaNT2) 10/25

11 (2) s α η θ z β K w N D Mixture Model LDA is actually a collection of mixture models Mixture components are shared (HDP exactly does this) Many applications Contextual SMT (Zhao and Xing 2007) Semi-supervised POS Tagging (Toutanova and Johnson 2007) Information retrieval, Computer vision, Present and Future of Text Modeling (T-FaNT2) 11/25

12 (2) s w V p( K) Topic space Topic K p( 1) Topic 1 θ p( 2) w 1 w 2 Topic 2 LDA models only within the Topic subsimplex Each document is generated from a mixture of fixed coordinates Whole simplex like DM? DDA (this talk) Correlation between topics? hlda, CTM, PAM Present and Future of Text Modeling (T-FaNT2) 12/25

13 (2) s LDA cannot model outside topic subsimplex Plugging DCM into LDA? w V Topic 1 w 2 1. Draw θ Dir(θ α). 2. For k = 1 K, (a) Draw p( k) Dir(β k ). 3. For n = 1 N, (a) Draw k θ. (b) Draw w n p(w k). Topic K Topic 2 Present and Future of Text Modeling (T-FaNT2) 13/25 θ

14 (2) s Generally OK, but: p(w) = Γ(α) k Γ(α k) k θ α k 1 k Too complex to optimize! k θ k Γ(β k ) Γ(β k +n k ) DCM: not in the exponential family. v w Γ(β kv +n kv ) dθ Γ(β kv ) (7) Present and Future of Text Modeling (T-FaNT2) 14/25

15 (2) s : p(w) = n! v n v! Γ(α) Γ(α+n) v Γ(α v +n v ) Γ(α v ) For α 1, Γ(α+n)/Γ(α) αγ(n). Plugging this into (8) and using Γ(n) = (n 1)!, we get: Exponential family DCM : q(w) = n! v n v! = n! Γ(α) Γ(α+n) Γ(α) Γ(α + n) v Exponential family and simpler form! v (8) α v (n v 1)! (9) I(n v >0) α v n v. (10) Present and Future of Text Modeling (T-FaNT2) 15/25

16 (2) s p(w,z,θ) = Γ(α) k Γ(α θ α k+n k 1 Γ(β k ) k n k! I(n kv >0) k) Γ(β k +n k ) k v w n (11) where n k = I(z n = k) (12) n n kv = I(z n = k)i(w n = v) (13) n v Can model whole word simplex Burst phenomenon of natural language Example: Multiple appearances of Noriega in a document Toyota and Nissan in car topic Present and Future of Text Modeling (T-FaNT2) 16/25

17 (2) s A small piece of work, but can combine the benefits of both LDA and DM Derived the variational lower bound Experiments to compare with DM and vanilla LDA Lesson: EDCM is an useful exponential family Present and Future of Text Modeling (T-FaNT2) 17/25

18 (2) s he will will ǫ of and Japan and bread and 1-gram 2-gram 3-gram Bayesian n-gram model Hierarchical Pitman-Yor Language Model (ACL 2006) Hierarchical draw of n-gram from (n 1)-gram An extension of HDP Present and Future of Text Modeling (T-FaNT2) 18/25

19 (2) s Estimate n-gram s by finite draws = customers from them 1-gram 2-gram 3-gram he will will ǫ of and Japan and bread and. ate bread and butter.. Real customers only reside in the leaves < n-gram customers are virtual and sent by n-gram customers for smoothing purpose Gibbs sampler: remove a customer and add him again to stochastically optimize < n-gram customers for smoothing Present and Future of Text Modeling (T-FaNT2) 19/25

20 Mixtures (2) s Straightforward approach: LDA of HPYLM (or VPYLM) Gibbs: Text: w 1 w 2 w 3 w 4 Topic 1 Topic 2 Topic 3 Topic k w t,d p(w t HPYLM k ) (n t d (k) + α k) (14) n t d (k) : # of customers in document d, assigned to topic k (excluding w t ) Present and Future of Text Modeling (T-FaNT2) 20/25

21 Mixtures (2) (2) s NIPS papers dataset (1500 documents/3,261,224 words) with 5 mixtures p(n, s) Phrase in section # the number of in order to in table # dealing with with respect to (a) Topic 0 (generic) p(n, s) Phrase et al receptive field excitatory and inhibitory in order to primary visual cortex corresponds to (b) Topic 1 p(n, s) Phrase monte carlo associative memory as can be seen parzen windows in the previous section american institute of phy (c) Topic 4 Generally OK, but PPL is identical ( ). PPL increases on other dataset (such as AP) Data sparseness problem! Present and Future of Text Modeling (T-FaNT2) 21/25

22 Data sparseness in n-gram mixtures (2) s Text: w 1 w 2 w 3 w 4 Topic 1 Topic 2 Topic 3 LDA only mixes different trees Severe data sparseness problem Many infrequent n-grams concentrate on specific trees If topic weights for these trees are near zero, estimates are severely backed-off Solution? Present and Future of Text Modeling (T-FaNT2) 22/25

23 Solution (2) s Possible solution: Nested (Poisson-)Dirichlet process (ndp) (Rodriguez+ 2006) Single tree, but measures on branches are infinite mixtures ǫ he will will of and Japan and bread and Mixtures are governed by DP (thus many counts (eg. unigrams) induce large mixtures Customers are grouped accordingly Present and Future of Text Modeling (T-FaNT2) 23/25 +

24 Current problem (2) s he will will ǫ of and Japan and bread and ndp (or npd) is OK, but how can we introduce document (context) here? Context-dependent n-gram language models are important in NLP applications. Present and Future of Text Modeling (T-FaNT2) 24/25

25 Final Remark (2) s Text Modeling is a research for introducing context dependency in many NLP applications. LDA is useful, and can incorporate DM through EDCM Latent topic-aware n-gram language models are important, and ndp may open the door Thank you very much. Present and Future of Text Modeling (T-FaNT2) 25/25

The Infinite Markov Model

The Infinite Markov Model Daichi Mochihashi NTT Communication Science Laboratories, Japan daichi@cslab.kecl.ntt.co.jp NIPS 2007 The Infinite Markov Model (NIPS 2007) p.1/20 Overview ɛ ɛ is of will is of