Hierarchical Bayesian Nonparametric Models of Language and Text

Similar documents
Hierarchical Bayesian Nonparametric Models of Language and Text

Hierarchical Bayesian Models of Language and Text

Bayesian Tools for Natural Language Learning. Yee Whye Teh Gatsby Computational Neuroscience Unit UCL

Bayesian Nonparametric Mixture, Admixture, and Language Models

Bayesian (Nonparametric) Approaches to Language Modeling

A Hierarchical Bayesian Language Model based on Pitman-Yor Processes

The Sequence Memoizer

The Infinite Markov Model

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

Language Model. Introduction to N-grams

A Bayesian Interpretation of Interpolated Kneser-Ney NUS School of Computing Technical Report TRA2/06

Forgetting Counts : Constant Memory Inference for a Dependent Hierarchical Pitman-Yor Process

Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh

Deplump for Streaming Data

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

ANLP Lecture 6 N-gram models and smoothing

Collapsed Variational Bayesian Inference for Hidden Markov Models

Topic Modeling: Beyond Bag-of-Words

The Infinite Markov Model

Bayesian Nonparametrics

Bayesian Nonparametrics for Speech and Signal Processing

Non-Parametric Bayes

Bayesian Nonparametrics: Dirichlet Process

Natural Language Processing. Statistical Inference: n-grams

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

Bayesian Hidden Markov Models and Extensions

N-gram Language Modeling

A Brief Overview of Nonparametric Bayesian Models

Probabilistic Language Modeling

Lecture 3a: Dirichlet processes

Shared Segmentation of Natural Scenes. Dependent Pitman-Yor Processes

Probabilistic Machine Learning

Parallel Markov Chain Monte Carlo for Pitman-Yor Mixture Models

Haupthseminar: Machine Learning. Chinese Restaurant Process, Indian Buffet Process

Probabilistic Deterministic Infinite Automata

Hierarchical Dirichlet Processes

Gentle Introduction to Infinite Gaussian Mixture Modeling

N-gram Language Modeling Tutorial

Dirichlet Process. Yee Whye Teh, University College London

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Chapter 3: Basics of Language Modelling

Hierarchical Bayesian Nonparametrics

Language Modeling. Michael Collins, Columbia University

Machine Learning for natural language processing

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Nonparametric Bayesian Methods: Models, Algorithms, and Applications (Day 5)

Spatial Bayesian Nonparametrics for Natural Image Segmentation

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

Statistical Methods for NLP

Bayesian nonparametric latent feature models

Scalable Non-Markovian Sequential Modelling for Natural Language Processing

A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

Stochastic Processes, Kernel Regression, Infinite Mixture Models

Inference in Explicit Duration Hidden Markov Models

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

STA 4273H: Statistical Machine Learning

Statistical Machine Translation

Latent Dirichlet Allocation (LDA)

Hierarchical Bayesian Nonparametric Approach to Modeling and Learning the Wisdom of Crowds of Urban Traffic Route Planning Agents

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Bayesian Learning in Undirected Graphical Models

Natural Language Processing (CSE 490U): Language Models

Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes

Applied Nonparametric Bayes

The Infinite Markov Model: A Nonparametric Bayesian approach

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Modelling Genetic Variations with Fragmentation-Coagulation Processes

Advanced Machine Learning

Bayesian Nonparametric Learning of Complex Dynamical Phenomena

Present and Future of Text Modeling

Distance dependent Chinese restaurant processes

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Modeling Topic and Role Information in Meetings Using the Hierarchical Dirichlet Process

Learning Energy-Based Models of High-Dimensional Data

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Improved Decipherment of Homophonic Ciphers

Conditional Language Modeling. Chris Dyer

Latent Dirichlet Allocation Introduction/Overview

The Noisy Channel Model and Markov Models

Priors for Random Count Matrices with Random or Fixed Row Sums

AN INTRODUCTION TO TOPIC MODELS

Naïve Bayes classification

Latent Dirichlet Allocation Based Multi-Document Summarization

Bayesian Nonparametrics

Gaussian Models

Infinite latent feature models and the Indian Buffet Process

CMPT-825 Natural Language Processing

Recent Advances in Bayesian Inference Techniques

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Dirichlet Processes: Tutorial and Practical Course

CPSC 540: Machine Learning

Spatial Normalized Gamma Process

Transcription:

Hierarchical Bayesian Nonparametric Models of Language and Text Gatsby Computational Neuroscience Unit, UCL Joint work with Frank Wood *, Jan Gasthaus *, Cedric Archambeau, Lancelot James SIGIR Workshop on Feature Generation and Selection July 2010

Overview Probabilistic Models for Language and Text Sequences The Sequence Memoizer Hierarchical Bayesian Modelling on Context Trees Modelling Power Laws with Pitman-Yor Processes Non-Markov Models Efficient Inference and Computation Conclusions

Overview Probabilistic Models for Language and Text Sequences The Sequence Memoizer Hierarchical Bayesian Modelling on Context Trees Modelling Power Laws with Pitman-Yor Processes Non-Markov Models Efficient Inference and Computation Conclusions

Sequence Models for Language and Text Probabilistic models for sequences of words and characters, e.g. toad, in, a, hole t, o, a, d,_,i,n,_,a,_, h, o, l, e Uses: Natural language processing: speech recognition, OCR, machine translation. Compression. Information retrieval. Cognitive models of language acquisition. Sequence data arises in many other domains.

Markov Models for Language and Text Probabilistic models for sequences of words and characters. P(toad in a hole) = P(toad)* P(in toad)* P(a toad in)* P(hole toad in a) Usually makes a Markov assumption: P(toad in a hole) ~ P(toad)* P(in toad)* P(a in)* P(hole a) Andrey Markov (1856-1922) George E. P. Box Order of Markov model typically ranges from ~2 to > 10.

Sparsity in Markov Models Consider a high order Markov models: P (sentence) = i P (word i word i N+1...word i 1 ) Large vocabulary size means naïvely estimating parameters of this model from data counts is impossible for N>2. P ML (word i word i N+1...word i 1 )= C(word i N+1...word i ) C(word i N+1...word i 1 ) Naïve priors assuming independent parameters fail as well--- most parameters will have no associated data. Smoothing. Hierarchical Bayesian models.

Smoothing in Language Models Smoothing is a way of dealing with data sparsity by combining large and small models together. P smooth (word i word i 1 i N+1 )= N n=1 λ(n)q n (word i word i 1 i n+1 ) Combines expressive power of large models with better estimation of small models (cf bias-variance trade-off). P smooth (hole toad in a) = λ(4)q 4 (hole toad in a) + λ(3)q 3 (hole in a) + λ(2)q 2 (hole a) + λ(1)q 1 (hole )

Smoothing in Language Models [Chen and Goodman 1998] found that Interpolated and modified Kneser-Ney are best under virtually all circumstances.

Overview Probabilistic Models for Language and Text Sequences The Sequence Memoizer Hierarchical Bayesian Modelling on Context Trees Modelling Power Laws with Pitman-Yor Processes Non-Markov Models Efficient Inference and Computation Conclusions

Hierarchical Bayesian Models Hierarchical modelling an important overarching theme in modern statistics [Gelman et al, 1995]. In machine learning, have been used for multitask learning, learningto-learn and domain adaptation. φ 0 φ 1 φ 2 φ 3 x 1i x 2i x 3i i=1...n1 i=1...n2 i=1...n3

Hierarchical Bayesian Models in IR Explains word frequencies in one of two ways: Some words are common throughout corpus. Some words are frequent only in certain documents. Leads to TF-IDF like behaviour [Cowans 2004, Momtazi & Klakow 2010]. φ 0 φ 1 φ 2 φ 3 x 1i x 2i x 3i i=1...n1 i=1...n2 i=1...n3

Context Tree Context of conditional probabilities naturally organized using a suffix tree. Smoothing makes conditional probabilities of neighbouring contexts more similar. P (word i word i 1 i N+1 ) Later words in context more important in predicting next word. a in a kiss a about a toad in a stuck in a

Hierarchical Bayesian Models on Context Tree Parametrize the conditional probabilities of Markov model: P (word i = w word i 1 i N+1 = u) =G u(w) G u =[G u (w)] w vocabulary Gu is a probability vector associated with context u [MacKay and Peto 1994]. G G a G in a G kiss a G about a G toad in a G stuck in a

Hierarchical Dirichlet Language Models What is P (G u G pa(u) )? Standard Dirichlet distribution over probability vectors---bad idea... We will use Pitman-Yor processes instead [Perman, Pitman and Yor 1992], [Pitman and Yor 1997], [Ishwaran and James 2001]. T N-1 IKN MKN HDLM 2 10 6 2 148.8 144.1 191.2 4 10 6 2 137.1 132.7 172.7 6 10 6 2 130.6 126.7 162.3 8 10 6 2 125.9 122.3 154.7 10 10 6 2 122.0 118.6 148.7 12 10 6 2 119.0 115.8 144.0 14 10 6 2 116.7 113.6 140.5 14 10 6 1 169.9 169.2 180.6 14 10 6 3 106.1 102.4 136.6

Overview Probabilistic Models for Language and Text Sequences The Sequence Memoizer Hierarchical Bayesian Modelling on Context Trees Modelling Power Laws with Pitman-Yor Processes Non-Markov Models Efficient Inference and Computation Conclusions

Pitman-Yor Processes Produce power-law distributions more suitable for languages.

Chinese Restaurant Processes Easiest to understand them using Chinese restaurant processes. x 5 x 1 x 2 y 1 x 3 x 4 x 6 x 7 x 8 x 9 y 2 y 3 y 4 p(sit at table k) c k d p(sit at new table) θ + dk p(table serves dish y) = H(y) Defines an exchangeable stochastic process over sequences x 1,x 2,... The de Finetti measure is the Pitman-Yor process, G PY(θ,d,H) x i G i =1, 2,...

Power Law Properties of Pitman-Yor Processes Chinese restaurant process: p(sit at table k) c k d p(sit at new table) θ + dk Pitman-Yor processes produce distributions over words given by a power-law distribution with index 1+d. Customers = word instances, tables = dictionary look-up; Small number of common word types; Large number of rare types. This is more suitable for languages than Dirichlet distributions. [Goldwater, Griffiths and Johnson 2005] investigated the Pitman-Yor process from this perspective.

Hierarchical Pitman-Yor Language Models Parametrize the conditional probabilities of Markov model: P (word i = w word i 1 i N+1 = u) =G u(w) G u =[G u (w)] w vocabulary Gu is a probability vector associated with context u [MacKay and Peto 1994]. G Place Pitman-Yor process prior on each Gu. G a G in a G kiss a G about a G toad in a G stuck in a

Hierarchical Pitman-Yor Language Models Significantly improved on the hierarchical Dirichlet language model. Results better Kneser-Ney smoothing, state-of-the-art language models. Similarity of perplexities not a surprise---kneser-ney can be derived as a particular approximate inference method. T N-1 IKN MKN HDLM HPYLM 2 10 6 2 148.8 144.1 191.2 144.3 4 10 6 2 137.1 132.7 172.7 132.7 6 10 6 2 130.6 126.7 162.3 126.4 8 10 6 2 125.9 122.3 154.7 121.9 10 10 6 2 122.0 118.6 148.7 118.2 12 10 6 2 119.0 115.8 144.0 115.4 14 10 6 2 116.7 113.6 140.5 113.2 14 10 6 1 169.9 169.2 180.6 169.3 14 10 6 3 106.1 102.4 136.6 101.9

Overview Probabilistic Models for Language and Text Sequences The Sequence Memoizer Hierarchical Bayesian Modelling on Context Trees Modelling Power Laws with Pitman-Yor Processes Non-Markov Models Efficient Inference and Computation Conclusions

Markov Models for Language and Text Usually makes a Markov assumption to simplify model: P(toad in a hole) ~ P(toad)* P(in toad)* P(a in)* P(hole a) Language models: usually Markov models of order 2-4 (3-5-grams). How do we determine the order of our Markov models? Is the Markov assumption a reasonable assumption? Be nonparametric about Markov order as well...

Non-Markov Models for Language and Text Model the conditional probabilities of each possible word occurring after each possible context. Use hierarchical Pitman-Yor process prior to share information across all contexts. Hierarchy is infinitely deep. Sequence memoizer. G in a G a G kiss a G G about a G toad in a G stuck in a............ G i m stuck in a

Overview Probabilistic Models for Language and Text Sequences The Sequence Memoizer Hierarchical Bayesian Modelling on Context Trees Modelling Power Laws with Pitman-Yor Processes Non-Markov Models Efficient Inference and Computation Conclusions

Model Size: Infinite -> O(n^2) The sequence memoizer model is very large (actually, infinite). Given a training sequence (e.g.: o,a,c,a,c), most of the model can be integrated out, leaving a finite number of nodes in context tree. H But there are still O(n 2 ) number of G [] nodes in the context tree... G c [c] G a [ac] G o [oa] G c o [cac] c G G a [oac] [acac] a o G [oacac] o a o G [a] c G [o] G [ca] a a G [aca] o G [oaca] c

Model Size: Infinite -> O(n^2) -> 2n Idea: integrate out non-branching, non-leaf nodes of the context tree. Resulting tree has at most 2n nodes. There are linear time construction algorithms [Ukkonen 1995]. H G [ac] ac G [] o G [oa] a o o G [a] G [o] a oac o a c G [oac] oac G [oaca] G [oacac] c

Closure under Marginalization In marginalizing out non-branching interior nodes, need to ensure that resulting conditional distributions are still tractable. G [a] G [a] PY(θ 2,d 2,G [a] ) G [ca]? PY(θ 3,d 3,G [ca] ) G [aca] G [aca] E.g.: If each conditional is Dirichlet, resulting conditional is not of known analytic form.

Closure under Marginalization In marginalizing out non-branching interior nodes, need to ensure that resulting conditional distributions are still tractable. G [a] G [a] PY(θ 2,d 2,G [a] ) G [ca] PY(θ 2 d 3,d 2 d 3,G [a] ) PY(θ 2 d 3,d 3,G [ca] ) G [aca] G [aca] For certain parameter settings, Pitman-Yor processes are closed under marginalization!

Comparison to Finite Order HPYLM Infinite order sequence memoizer performance better than any finite order HPYLM.

Overview Probabilistic Models for Language and Text Sequences The Sequence Memoizer Hierarchical Bayesian Modelling on Context Trees Modelling Power Laws with Pitman-Yor Processes Non-Markov Models Efficient Inference and Computation Conclusions

Inference using Gibbs Sampling Gibbs sampling in Chinese restaurant process representation of Pitman-Yor processes. x 10 7 2.5 Tables 2 1.5 1 2e+06 4e+06 6e+06 8e+06 1e+07 Observations

Inference Very efficient: posterior is unimodal and variables are weakly coupled.

Online Algorithms for Compression Possible to construct context tree in online fashion as sequence of training symbols observed one at a time. Use particle filtering or interpolated Kneser-Ney approximation for inference. Linear sized storage and average computation time, but quadratic computation time in worse case. Online construction, inference and prediction of next symbol necessary for use in text compression using entropic coding.

Compression Model Average bits / byte gzip 2.61 bzip2 2.11 CTW 1.99 PPM 1.93 Sequence Memoizer 1.89 Calgary corpus SM inference: particle filter PPM: Prediction by Partial Matching CTW: Context Tree Weigting Online inference, entropic coding.

Conclusions Probabilistic models of sequence models without making Markov assumptions with efficient construction and inference algorithms. State-of-the-art text compression and language modelling results. Hierarchical Bayesian modelling leads to improved performance. Pitman-Yor processes allow us to encode prior knowledge about power-law properties, leading to improved performance. Papers available on my webpage http://www.gatsby.ucl.ac.uk/~ywteh

Related Works Text compression: Prediction by Partial Matching [Cleary & Witten 1984], Context Tree Weighting [Willems et al 1995]... Language model smoothing algorithms [Chen & Goodman 1998, Kneser & Ney 1995]. Variable length/order/memory Markov models [Ron et al 1996, Buhlmann & Wyner 1999, Begleiter et al 2004...]. Hierarchical Bayesian nonparametric models [Teh & Jordan 2010].

Publications A Hierarchical Bayesian Language Model based on Pitman-Yor Processes. Y.W. Teh. Coling/ACL 2006. A Bayesian Interpretation of Interpolated Kneser-Ney. Y.W. Teh. Technical Report TRA2/06, School of Computing, NUS, revised 2006. A Hierarchical Nonparametric Bayesian Approach to Statistical Language Model Domain Adaptation. F. Wood and Y.W. Teh. AISTATS 2009. A Stochastic Memoizer for Sequence Data. F. Wood, C. Archambeau, J. Gasthaus, L. F. James and Y.W. Teh. ICML 2009. Text Compression Using a Hierarchical Pitman-Yor Process Prior. J. Gasthaus, F. Wood and Y.W. Teh. DCC 2010. Hierarchical Bayesian Nonparametric Models with Applications. Y.W. Teh and M.I. Jordan, in Bayesian Nonparametrics. Cambridge University Press, 2010.

Thank You!