The Infinite Markov Model

Similar documents
The Infinite Markov Model: A Nonparametric Bayesian approach

The Infinite Markov Model

Present and Future of Text Modeling

Hierarchical Bayesian Nonparametric Models of Language and Text

Hierarchical Bayesian Nonparametric Models of Language and Text

Hierarchical Bayesian Models of Language and Text

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

Bayesian Nonparametrics for Speech and Signal Processing

Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh

Non-Parametric Bayes

Bayesian Tools for Natural Language Learning. Yee Whye Teh Gatsby Computational Neuroscience Unit UCL

A Brief Overview of Nonparametric Bayesian Models

Hierarchical Dirichlet Processes

Non-parametric Clustering with Dirichlet Processes

Collapsed Variational Bayesian Inference for Hidden Markov Models

N-gram Language Modeling Tutorial

Bayesian (Nonparametric) Approaches to Language Modeling

Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Spatial Normalized Gamma Process

Bayesian Nonparametric Models

N-gram Language Modeling

ANLP Lecture 6 N-gram models and smoothing

Gentle Introduction to Infinite Gaussian Mixture Modeling

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

Acoustic Unit Discovery (AUD) Models. Leda Sarı

Bayesian Structure Modeling. SPFLODD December 1, 2011

Language Model. Introduction to N-grams

A Hierarchical Bayesian Language Model based on Pitman-Yor Processes

Bayesian Nonparametrics

Latent Variable Models in NLP

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Hierarchical Bayesian Nonparametrics

Bayesian Nonparametric Mixture, Admixture, and Language Models

A Bayesian Interpretation of Interpolated Kneser-Ney NUS School of Computing Technical Report TRA2/06

Sequences and Information

Lecture 3a: Dirichlet processes

An Infinite Product of Sparse Chinese Restaurant Processes

Applied Bayesian Nonparametrics 3. Infinite Hidden Markov Models

The Infinite PCFG using Hierarchical Dirichlet Processes

Infinite Hierarchical Hidden Markov Models

An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Distance dependent Chinese restaurant processes

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Dirichlet Process. Yee Whye Teh, University College London

Spatial Bayesian Nonparametrics for Natural Image Segmentation

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Statistical Methods for NLP

Stochastic Processes, Kernel Regression, Infinite Mixture Models

Study Notes on the Latent Dirichlet Allocation

Latent Dirichlet Allocation Based Multi-Document Summarization

Shared Segmentation of Natural Scenes. Dependent Pitman-Yor Processes

Dirichlet Processes: Tutorial and Practical Course

Language Modeling. Michael Collins, Columbia University

Latent Dirichlet Allocation (LDA)

Dirichlet Processes and other non-parametric Bayesian models

Hidden Markov Models

Bayesian Nonparametrics: Models Based on the Dirichlet Process

Bayesian Analysis for Natural Language Processing Lecture 2

Applied Nonparametric Bayes

Topic Modeling: Beyond Bag-of-Words

STAT Advanced Bayesian Inference

Context tree models for source coding

Foundations of Nonparametric Bayesian Methods

Lecture 13 : Variational Inference: Mean Field Approximation

Bayesian Methods for Machine Learning

Nonparametric Mixed Membership Models

Expectation Maximization (EM)

The Infinite PCFG using Hierarchical Dirichlet Processes

Bayesian Nonparametric Learning of Complex Dynamical Phenomena

Probabilistic Deterministic Infinite Automata

Infering the Number of State Clusters in Hidden Markov Model and its Extension

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Bayesian Hidden Markov Models and Extensions

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Latent Dirichlet Alloca/on

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Hierarchical Bayesian Nonparametric Models with Applications

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

Deep Learning for NLP

Parallel Markov Chain Monte Carlo for Pitman-Yor Mixture Models

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Document and Topic Models: plsa and LDA

Latent Dirichlet Allocation (LDA)

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Unsupervised Learning

The infinite HMM for unsupervised PoS tagging

Generative Clustering, Topic Modeling, & Bayesian Inference

Natural Language Processing

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CPSC 540: Machine Learning

Bayesian nonparametric latent feature models

Advanced Machine Learning

Haupthseminar: Machine Learning. Chinese Restaurant Process, Indian Buffet Process

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Midterm sample questions

Transcription:

The Infinite Markov Model Daichi Mochihashi NTT Communication Science Laboratories, Japan daichi@cslab.kecl.ntt.co.jp NIPS 2007 The Infinite Markov Model (NIPS 2007) p.1/20

Overview ɛ ɛ is of will is of will language states he she language order states Fixed n-th order Markov model Fixed-order Markov dependency Infinitely variable Markov orders Simple prior for stochastic trees (other than Coalescents) infinite Infinitely Variable-order Markov model How to draw an inference based on only the output sequences? united the The Infinite Markov Model (NIPS 2007) p.2/20

Markov Models 1st order 2nd order p( mama I want to sing ) = p(mama) p(i mama) p(want mama I) p(to I want) p(sing want to) n-gram (3-gram) n-gram (n-1 th order Markov) model is prevalent in speech recognition and natural language processing Music processing, Bioinformatics, compression, Notice: HMM is a first order Markov model over hidden states Emission is a unigram on the hidden state The Infinite Markov Model (NIPS 2007) p.3/20

Estimating a Markov Model "will" ɛ "of" "and" butter Predictive Distributions "she will" "he will" "states of" "bread and" Each Markov state is a node in a Suffix Tree (Ron+ (1994), Pereira+ (1995), Buhlmann (1999)) Depth = Markov order Each node has a predictive distribution over the next word Problem: # of states will explode as the order n gets larger Restrict to a small Markov order (n =3 5 in speech and NLP) Distributions get sparser and sparser using hierarchical Bayes? The Infinite Markov Model (NIPS 2007) p.4/20

Hierarchical (Poisson-) Dirichlet Process Teh (2006), Goldwater+ (2006) mapped a hierarchical Dirichlet process to Markov Models "will" ɛ "of" "and" Text a bread and butter is usually a america butter "she will" "he will" "states of" " bread and " n th order predictive distribution is a Dirichlet process draw from the (n 1) th distribution Chinese restaurant process representation: a customer = a count (in the training data) Hierarchical Pitman-Yor Language Model (HPYLM) The Infinite Markov Model (NIPS 2007) p.5/20

Hierarchical (Poisson-) Dirichlet Process Teh (2006), Goldwater+ (2006) mapped a hierarchical Dirichlet process to Markov Models "will" ɛ "of" "and" Text a bread and butter is usually a america butter "she will" "he will" "states of" " bread and " n th order predictive distribution is a Dirichlet process draw from the (n 1) th distribution Chinese restaurant process representation: a customer = a count (in the training data) Hierarchical Pitman-Yor Language Model (HPYLM) The Infinite Markov Model (NIPS 2007) p.5/20

Hierarchical (Poisson-) Dirichlet Process Teh (2006), Goldwater+ (2006) mapped a hierarchical Dirichlet process to Markov Models "will" ɛ "of" "and" Text a bread and butter is usually a america butter "she will" "he will" "states of" " bread and " n th order predictive distribution is a Dirichlet process draw from the (n 1) th distribution Chinese restaurant process representation: a customer = a count (in the training data) Hierarchical Pitman-Yor Language Model (HPYLM) The Infinite Markov Model (NIPS 2007) p.5/20

Hierarchical (Poisson-) Dirichlet Process Teh (2006), Goldwater+ (2006) mapped a hierarchical Dirichlet process to Markov Models ɛ Proxy (imaginary) customer (Not relevant) Text a bread and butter "will" "of" "and" is usually a america butter "she will" "he will" "states of" " bread and " n th order predictive distribution is a Dirichlet process draw from the (n 1) th distribution Chinese restaurant process representation: a customer = a count (in the training data) Hierarchical Pitman-Yor Language Model (HPYLM) The Infinite Markov Model (NIPS 2007) p.5/20

Problem with HPYLM ɛ Text "will" "of" "and" a bread and butter is usually a america butter "she will" "he will" "states of" " bread and " All the real customers reside in depth (n 1) (say, 2) in the suffix tree Corresponds to a fixed Markov order less than ; the united states of america Character model for supercalifragilisticexpialidocious! How can we deploy customers at suitably different depths? The Infinite Markov Model (NIPS 2007) p.6/20

Infinite-depth Hierarchical CRP 1 q i i 1 q j j 1 q k k Add a customer by stochastically decending a suffix tree from its root Each node i has a probability to stop at that node (1 q i equals the penetration probability) q i Be(α, β) i.i.d. (1) Therefore, a customer will stop at depth n by the probability n 1 p(n h) =q n (1 q i ). (2) i=0 The Infinite Markov Model (NIPS 2007) p.7/20

Variable-order Pitman-Yor language model (VPYLM) For the training data w = w 1 w 2 w T, latent Markov orders n = n 1 n 2 n T exist: p(w) = p(w, n, s) (3) n s s = s 1 s 2 s T : seatings of proxy customers in parent nodes Gibbs sample n for inference: p(n t w, n t, s t ) p(w t n t, w, n t, s t ) }{{} n t-gram prediction p(n t w t, n t, s t ) }{{} prob to reach depth n t Trade-off between two terms (penalty for deep n t ) How to compute the second term p(n t w t, n t, s t )? (4) The Infinite Markov Model (NIPS 2007) p.8/20

Inference of VPYLM (2) w n w t 2 w t 1 w t w t+1 2 3 2 4 We can estimate q i of node i through the depths of the other customers ɛ β 5+α+β (a, b) = (100, 900) 900+β 1000+α+β w t 1 70+β 80+α+β 20+β 50+α+β w t 3 (a, b) =(10, 70) w t 2 (a, b) =(5, 0) (a, b) =(30, 20) Let a i = # of times the node i was stopped at, b i = # of times the node i was passed by: n 1 p(n t = n w t, n t, s t )=q n (1 q i ) (5) = i=0 a n +α a n +b n +α+β n 1 i=0 b i +β a i +b i +α+β. (6) The Infinite Markov Model (NIPS 2007) p.9/20

Estimated Markov Orders while key european consuming nations appear unfazed about the prospects of a producer cartel that will attempt to fix prices the pact is likely to meet strong opposition from u.s. delegates this week EOS n 0123456789 Hinton diagram of p(nt w) used in Gibbs sampling for the training data Estimated Markov orders from which each word has been generated. NAB Wall Street Journal corpus of 10,007,108 words The Infinite Markov Model (NIPS 2007) p.10/20

Prediction We don t know the Markov order n beforehand sum it out p(w h) = p(w, n h) = p(w n, h) p(n h). (7) n=0 n=0 We can rewrite the above expression recursively: p(w h) =p(0 h) p(w h, 0) + p(1 h) p(w h, 1) + p(2 h) p(w h, 2) + Therefore, = q 0 p(w h, 0)+(1 q 0 )q 1 p(w h, 1)+(1 q 0 )(1 q 1 )q 2 p(w h, 2) = q 0 p(w h, 0)+(1 q 0 ) [ q 1 p(w h, 1)+(1 q 1 )q 2 p(w h, 2)+ ] p(w h, n + ) q n p(w h, n)+(1 q n ) p(w h, (n+1) + ), (9) p(w h) =p(w h, 0 + ). (10) (8) The Infinite Markov Model (NIPS 2007) p.11/20

Prediction (2) p(w h, n + ) q n p(w h, n) +(1 q }{{} n ) p(w h, (n+1) + ) }{{} Prediction at Depth n Prediction at Depths >n p(w h) =p(w h, 0 + ), q n Be(α, β). Stick-breaking process on an infinite tree, where breaking proportions will differ from branch to branch. Bayesian sophistication of CTW (context tree weighting) algorithm (Willems+ 1995) in information theory ( Poster) The Infinite Markov Model (NIPS 2007) p.12/20

Perplexity and Number of Nodes in the Tree n HPYLM VPYLM Nodes(H) Nodes(V) 3 113.60 113.74 1,417K 1,344K 5 101.08 101.69 12,699K 7,466K 7 N/A 100.68 27,193K 10,182K 8 N/A 100.58 34,459K 10,434K 100.36 10,629K Perplexity = 1/average predictive probabilities (lower is better) VPYLM causes no memory overflow even for large n Italic : expected number of nodes Identical performance as HPYLM, but with much less number of nodes -gram performed the best (ɛ =1e 8) The Infinite Markov Model (NIPS 2007) p.13/20

Stochastic phrases from VPYLM (1/2) p(w, n h) =p(w h, n)p(n h) Probability to generate w using the last n words of h as the context For example, generate Gaussians after mixture of mixture of Gaussians : a phrase p(w, n h) = cohesion strength of the stochastic phrase Will not necessarily decay with length (like an empirical probability) Enumerated by traversing the suffix tree in depth-first order The Infinite Markov Model (NIPS 2007) p.14/20

Stochastic phrases from VPYLM (2/2) p Stochastic phrase in the suffix tree 0.9784 primary new issues 0.9726 ˆ at the same time 0.9556 american telephone & 0.9512 is a unit of 0.9394 to # % from # % 0.8896 in a number of 0.8831 in new york stock exchange composite trading 0.8696 a merrill lynch & co. 0.7566 mechanism of the european monetary 0.7134 increase as a result of 0.6617 tiffany & co. : : ˆ = beginning-of-sentence, # = numbers The Infinite Markov Model (NIPS 2007) p.15/20

Random Walk generation from the language model it was a singular man, fierce and quick-tempered, very foul-mouthed when he was angry, and of her muff and began to sob in a high treble key. it seems to have made you, said he. what have i to his invariable success that the very possibility of something happening on the very morning of the wedding.... Random walk generation from the 5-gram VPYLM trained on The Adventures of Sherlock Holmes. We begin with an infinite number of beginning-of-sentence special symbols as the context. If we use vanilla 5-grams, overfitting will lead to a mere reproduction of the training data. The Infinite Markov Model (NIPS 2007) p.16/20

Infinite Character Markov Model how queershaped little children drawling-desks, which would get through that dormouse! said alice; let us all for anything the secondly, but it to have and another question, but i shalled out, you are old, said the you re trying to far out to sea. (a) Random walk generation from a character model. Character sa i d alice; let us al l for anything Markov order 5 6 5 4 7 10 6 5 4 3 7 1 4 8 2 4 4 6 5 5 4 4 5 5 6 4 5 6 7 7 7 5 3 3 4 5 9 (b) Markov orders used to generate each character above. Character-based Markov model trained on Alice in Wonderland. Lowercased alphabets + space OCR, compression, Morphology,... The Infinite Markov Model (NIPS 2007) p.17/20

Final Remarks Hyperparameter sensitivity and empirical Bayes optimization Paper LDA extension Paper (but partially succeeded) Comparison with Entropy Pruning (Stolcke 1998) Poster Poster: W24 (near the escalator). The Infinite Markov Model (NIPS 2007) p.18/20

Summary 2.5 2 1.5 1 0.5 0 0 0.2 0.4 0.6 0.8 1 We introduced the Infinite Markov model where the orders are unspecified and unbounded but can be learned from data. We defined a simple prior for stochastic infinite trees. We expect to use it for latent trees: Variable resolution hierarchical clustering (cf. hlda) Deep semantic categories just when needed. Also for variable order HMM (pruning approach: Wang+, ICDM 2006) The Infinite Markov Model (NIPS 2007) p.19/20

Future Work Fast variational inference Obviates Gibbs for inference and prediction CVB for HDP: Teh et al. (this NIPS) More elaborate tree prior than a single Beta Relationship to Tailfree processes (Fabius 1964; Ferguson 1974) The Infinite Markov Model (NIPS 2007) p.20/20