Bayesian (Nonparametric) Approaches to Language Modeling

Similar documents
Hierarchical Bayesian Nonparametric Models of Language and Text

Hierarchical Bayesian Models of Language and Text

Hierarchical Bayesian Nonparametric Models of Language and Text

Deplump for Streaming Data

Bayesian Tools for Natural Language Learning. Yee Whye Teh Gatsby Computational Neuroscience Unit UCL

Forgetting Counts : Constant Memory Inference for a Dependent Hierarchical Pitman-Yor Process

The Sequence Memoizer

A Hierarchical Bayesian Language Model based on Pitman-Yor Processes

Bayesian Nonparametric Mixture, Admixture, and Language Models

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Bayesian Nonparametrics for Speech and Signal Processing

Parallel Markov Chain Monte Carlo for Pitman-Yor Mixture Models

Bayesian Hidden Markov Models and Extensions

The Infinite Markov Model

A Brief Overview of Nonparametric Bayesian Models

STA 4273H: Statistical Machine Learning

Probabilistic Deterministic Infinite Automata

ANLP Lecture 6 N-gram models and smoothing

Topic Modeling: Beyond Bag-of-Words

N-gram Language Modeling Tutorial

STA 414/2104: Machine Learning

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Lecture 3a: Dirichlet processes

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Inference in Explicit Duration Hidden Markov Models

Latent Dirichlet Allocation Introduction/Overview

Graphical Models for Query-driven Analysis of Multimodal Data

Sequences and Information

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh

N-gram Language Modeling

A Bayesian Interpretation of Interpolated Kneser-Ney NUS School of Computing Technical Report TRA2/06

Spatial Bayesian Nonparametrics for Natural Image Segmentation

Two Useful Bounds for Variational Inference

Collapsed Variational Bayesian Inference for Hidden Markov Models

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA)

Exercise 1: Basics of probability calculus

Statistical Methods for NLP

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2

Bayesian Methods for Machine Learning

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Nonparametric Bayesian Methods (Gaussian Processes)

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Non-Parametric Bayes

CS 224N HW:#3. (V N0 )δ N r p r + N 0. N r (r δ) + (V N 0)δ. N r r δ. + (V N 0)δ N = 1. 1 we must have the restriction: δ NN 0.

Fast Text Compression with Neural Networks

Hierarchical Dirichlet Processes

CS 630 Basic Probability and Information Theory. Tim Campbell

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Recent Advances in Bayesian Inference Techniques

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Classification & Information Theory Lecture #8

Spatial Normalized Gamma Process

Lecture 6: Graphical Models: Learning

Introduction to Information Theory. Part 3

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Probabilistic Machine Learning

CS Lecture 3. More Bayesian Networks

Does Better Inference mean Better Learning?

Structure Learning: the good, the bad, the ugly

Exploring Asymmetric Clustering for Statistical Language Modeling

Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems

Bayesian Networks Inference with Probabilistic Graphical Models

Hidden Markov models: from the beginning to the state of the art

Shared Segmentation of Natural Scenes. Dependent Pitman-Yor Processes

Topic Models and Applications to Short Documents

Present and Future of Text Modeling

Sparse Stochastic Inference for Latent Dirichlet Allocation

Midterm sample questions

Machine Learning Summer School

Probabilistic Language Modeling

Lecture : Probabilistic Machine Learning

Distance dependent Chinese restaurant processes

Study Notes on the Latent Dirichlet Allocation

Dynamic Approaches: The Hidden Markov Model

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts

Model Averaging (Bayesian Learning)

Gaussian Mixture Model

Nonparametric Bayesian Methods: Models, Algorithms, and Applications (Day 5)

AN INTRODUCTION TO TOPIC MODELS

lossless, optimal compressor

Language as a Stochastic Process

Bayesian Nonparametric Models

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Applied Bayesian Nonparametrics 3. Infinite Hidden Markov Models

Latent Dirichlet Allocation Based Multi-Document Summarization

Entropy Rate of Stochastic Processes

Generative Clustering, Topic Modeling, & Bayesian Inference

Data Compression Techniques

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

DT2118 Speech and Speaker Recognition

Language Model. Introduction to N-grams

Conditional probabilities and graphical models

Transcription:

Bayesian (Nonparametric) Approaches to Language Modeling Frank Wood unemployed (and homeless) January, 2013 Wood (University of Oxford) January, 2013 1 / 34

Message Whole Distributions As Random Variables Hierarchical Modeling Embracing Uncertainty Wood (University of Oxford) January, 2013 2 / 34

Challenge Problem Predict What Comes Next 01001001, 01101110, 00100000, 01110100, 01101000, 01110010, 01100101, 01100101, 00100000, 01100100, 01100001, 01111001, 01110011, 00100000, 01111001, 01101111, 01110101, 01110010, 00100000, 01101000, 01100001, 01110010, 01100100, 00100000, 01100100, 01110010, 01101001, 01110110, 01100101, 00100000, 01110111, 01101001, 01101100, 01101100, 00100000, 01100011, 01110010, 01100001, 01110011, 01101000... Wood (University of Oxford) January, 2013 3 / 34

.6 < Entropy 1 of English < 1.3 bits/byte [Shannon, 1951] bits / byte 4 2 L=10 3 L=10 4 L=10 5 L=10 6 L=10 7 3 4 5 6 7 8 9 stream length (log 10 ) Sequence memoizer with life-long learning computational characteristics [Bartlett, Pfau, and Wood, 2010] Wikipedia : 26.8GB 4GB ( 1.2) bits/byte gzip : 7.9GB PAQ : 3.8GB L is number of states in SM, data Wikipedia dump 2010 head, depth 16 1 Entropy = 1 T T t=1 log 2 P(x t x 1:t 1) (lower is better) Wood (University of Oxford) January, 2013 4 / 34

How Good Is 1.5 bits/byte? the fastest way to make money is... being player. bucketti wish i made it end in the reducers and assemblance smart practices to allow city is something to walk in most of the work of agriculture. i d be able to compete with an earlier goals : words the danger of conduct in southern california, has set of juice of the fights lack of audience that the eclasm beverly hills. she companies or running back down that the book, grass. and the coynes, exaggerate between 1972. the pad, a debate on emissions on air political capital crashing that the new obviously program irock price, coach began refugees, much and 2 2 constant-space, byte-sm, NY Times Corpus (50MB), log-loss 1.49 bits/byte, 2 million rest. s Wood (University of Oxford) January, 2013 5 / 34

Notation and iid Conditional Distribution Estimation Let Σ be a set of symbols, let x = x 1, x 2,..., x T be an observed sequence, x t, s, s Σ, and u W = {w 1, w 2,... w K } with w k Σ +. N(us) G u (s) = s Σ N(us ) is the maximum likelihood estimator for the conditional distribution P(X = s u) G u (s). Wood (University of Oxford) January, 2013 6 / 34

Notation and iid Conditional Distribution Estimation Let Σ be a set of symbols, let x = x 1, x 2,..., x T be an observed sequence, x t, s, s Σ, and u W = {w 1, w 2,... w K } with w k Σ +. N(us) G u (s) = s Σ N(us ) is the maximum likelihood estimator for the conditional distribution P(X = s u) G u (s). Estimating a finite set of conditional distributions corresponding to a finite set of contexts u generally yields a poor model. Why? Long contexts occur infrequently Maximum likelihood estimates given few observations are often poor Finite set of contexts might be sparse Wood (University of Oxford) January, 2013 6 / 34

Bayesian Smoothing Use independent Dirichlet priors, i.e. G u Dir( α Σ ) x t x t k,..., x t 1 = u G u Wood (University of Oxford) January, 2013 7 / 34

Bayesian Smoothing Use independent Dirichlet priors, i.e. G u Dir( α Σ ) x t x t k,..., x t 1 = u G u and embrace uncertainty about G u s : P(x T +1 = s x = u) = P(x t+1 = s G u )P(G u x)dg = E[G u (s)], yielding, for instance, E[G u (s)] = N(us) + α Σ s Σ N(us ) + α Wood (University of Oxford) January, 2013 7 / 34

Hierarchical Smoothing Smooth G u using a similar distribution G σ(u), i.e. G u Dir(cG σ(u) ) 3 or G u DP(cG σ(u) ) 4 with, for instance, σ(x 1 x 2 x 3... x t ) = x 2 x 3... x t (suffix operator) 3 [MacKay and Peto, 1995] 4 [Teh et al., 2006] Wood (University of Oxford) January, 2013 8 / 34

Natural Question Is it possible to simultaneously estimate all distributions in such a model, even in the infinite-depth case? Wood (University of Oxford) January, 2013 9 / 34

Unexpected Answer Is it possible to simultaneously estimate all distributions in such a model, even in the infinite-depth case? YES! Sequence memoizer (SM), Wood et al. [2011] The SM is a compact O(T ), unbounded-depth, hierarchical, Bayesian-nonparametric distribution over discrete sequence data. Graphical Pitman-Yor process (GPYP), Wood and Teh [2008] The GPYP is a hierarchical sharing prior with directed acyclic dependency. Wood (University of Oxford) January, 2013 9 / 34

SM = hierarchical PY process [Teh, 2006] of unbounded depth. Figure : Binary Sequence Memoizer Graphical Model Wood (University of Oxford) January, 2013 10 / 34

Binary Sequence Memoizer Notation G ε U Σ, d 0 PY(d 0, 0, U Σ ) G u G σ(u), d u PY(d u, 0, G σ(u) ) u Σ + x t x 1,..., x t 1 = u G u Here σ(x 1 x 2 x 3... x t ) = x 2 x 3... x t is the suffix operator, Σ = {0, 1}, and U Σ = [.5,.5]. Encodes natural assumptions : Recency assumption encoded by form of coupling related conditional distributions ( back-off ) Power-law effects encoded by the choice of hierarchical modeling glue: the Pitman-Yor process (PY). Wood (University of Oxford) January, 2013 11 / 34

Pitman-Yor process A Pitman-Yor process PY(c, d, H) is a distribution over distributions with three parameters A discount 0 d < 1 that controls power-law behavior d = 0 is Dirichlet process (DP) A concentration c > d like that of the Dirichlet process (DP) A base distribution H also like that of the DP Key properties : If G PY(c, d, H) then a priori E[G(s)] = H(s) Var[G(s)] = (1 d)h(s)(1 H(s)) Wood (University of Oxford) January, 2013 12 / 34

Text and Pitman-Yor Process Power-Law Properties Word frequency 10 6 10 5 10 4 10 3 10 2 Pitman-Yor English text Dirichlet 10 1 10 0 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Rank (according to frequency) Power-law scaling of word frequencies in English text. Relative word frequency is plotted against each words rank when ordered according to frequency. There are a few very common words and a large number of relatively rare words. Figure from Wood et al. [2011]. Wood (University of Oxford) January, 2013 13 / 34

Big Problem Space. And time. Wood (University of Oxford) January, 2013 14 / 34

Coagulation and Fragmentation 5 PY properties used to achieve linear space sequence memoizer. Coagulation If G 2 G 1 PY(d 1, 0, G 1 ) and G 3 G 2 PY(d 2, 0, G 2 ) then G 3 G 1 PY(d 1 d 2, 0, G 1 ) Fragmentation Suppose G 3 G 1 PY(d 1 d 2, 0, G 1 ). Operations exist to draw G 2 G 1 PY(d 1, 0, G 1 ) and G 3 G 2 PY(d 2, 0, G 2 ) with G 2 marginalized out. 5 [Pitman, 1999, Ho et al., 2006, Wood et al., 2009] Wood (University of Oxford) January, 2013 15 / 34

O(T ) Sequence Memoizer 6 Graphical Model for x = 110100 6 [Wood et al, 2009] Wood (University of Oxford) January, 2013 16 / 34

O(T ) Sequence Memoizer vs. Smoothed n th -order Markov Model 400 1.5 10 7 Perplexity 300 200 100 Markov perplexity SM perplexity Markov #nodes SM #nodes 0 0 1 2 3 4 5 6 Context length (n) 1.2 10 7 9 10 6 6 10 6 3 10 6 Number of nodes SM versus n th order Markov models with hierarchical PYP priors as n varies. In red is computational complexity. Wood (University of Oxford) January, 2013 17 / 34

Problem Nonparametric O(T ) space = doesn t work for life-long learning Wood (University of Oxford) January, 2013 18 / 34

Towards Lifelong Learning : Incremental Estimation Sequential Monte Carlo inference [Gasthaus, Wood, and Teh, 2010] Wikipedia 100MB 20.8MB (1.66 bits/byte 7 ) PAQ, special-purpose, non-incremental, Hutter-prize leader Still O(T ) storage complexity. 7 average log-loss 1 T T log 2 P(x t x 1:t 1) t=1 is average # of bits to transmit one byte under optimal entropy encoding (lower is better) Wood (University of Oxford) January, 2013 19 / 34

Towards Lifelong Learning : Constant Space Approximation Forgetting Counts [Bartlett, Pfau, and Wood, 2010] Average Log Loss (bits) 2.6 2.4 2.2 2 1.8 Naive Finite Depth Random Deletion Greedy Deletion Sequence Memoizer gzip bzip2 14,164 41,860 89,479 155,623 300,490 467,502 632,448 776,547 States Constant-state-count SM approximation Data: Calgary corpus, States = number of states used in 3-10 grams Improvements to the SM [Gasthaus and Teh, 2011] Constant-size representation of predictive distributions Wood (University of Oxford) January, 2013 20 / 34

Life-Long Learning SM [Bartlett, Pfau, and Wood, 2010] bits / byte 4 2 L=10 3 L=10 4 L=10 5 L=10 6 L=10 7 3 4 5 6 7 8 9 stream length (log 10 ) SM with life-long learning computational characteristics. Wikipedia : 26.8GB 4GB ( 1.2) bits/byte L is number of states in SM, data Wikipedia dump 2010 head, depth 16, Wood (University of Oxford) January, 2013 21 / 34

Summary So Far SM = powerful Bayesian nonparametric sequence model Intuition : lim n smoothing n-gram Byte predictive performance in human range Bayesian-nonparametric = incompatible with lifelong learning Non-trivial approximations preserve performance Resulting model compatible with lifelong learning Widely useful Language-modeling Lossless compression Plug-in replacement for Markov model component of any complex graphical model Direct plug-in replacement for n-gram language model Software : http://www.sequencememoizer.com/ Wood (University of Oxford) January, 2013 22 / 34

Can We Do Better? Problem : No long-range coherence Domain specificity Solution : Develop models that more efficiently encode long-range dependencies Develop models that use domain-specific context Graphical Pitman-Yor process (GPYP) Wood (University of Oxford) January, 2013 23 / 34

GPYP Application : Language Model Domain Adaptation Wood (University of Oxford) January, 2013 24 / 34

Doubly Hierarchical Pitman Yor Process Language Model 8 G{} D PY(d0 D, α0 D, λ D 0 U + (1 λ D 0 )G{} H ) G{w D t 1 } PY(d1 D, α1 D, λ D 1 G{} D + (1 λd 1 )G{w H t 1 } ). G{w D t j :w t 1 } PY(dj D, αj D, λ D j G{w D t j+1 :w t 1 } + (1 λd j )G{w H t j :w t 1 } ) x w t n+1 : w t 1 G{w D t n+1 :w t 1 }. 8 [Wood and Teh, 2009] Wood (University of Oxford) January, 2013 25 / 34

Domain Adaptation Schematic...... Latent............ Domain 1 Switch Priors... Domain 2 Wood (University of Oxford) January, 2013 26 / 34

Domain Adaptation Effect (Copacetic) Test perplexity 450 400 350 300 Union DHPYPLM Mixture Baseline 250 0 2 4 6 8 10 Brown corpus subset size x 10 5 Mixture : [Kneser and Steinbiss, 1993] Union : [Bellegarda, 2004] Training corpora Small State of the Union (SOU) 1945-1963, 1969-2006 370k tokens 13k types Big Brown 1967 1m tokens 50k types Test corpus Johnson s SOU speeches 1963-1969 37k tokens Wood (University of Oxford) January, 2013 27 / 34

Domain Adaptation Effect (Not) Test perplexity 320 300 280 260 240 220 200 Union DHPYPLM Mixture Baseline 180 0 2 4 6 8 10 Brown corpus subset size x 10 5 Mixture : [Kneser and Steinbiss, 1993] Union : [Bellegarda, 2004] Training corpora Small Augmented Multi-Party Interaction (AMI) 2007 800k tokens 8k types Big Brown 1967 1m tokens 50k types Test corpus AMI excerpt 2007 60k tokens Wood (University of Oxford) January, 2013 28 / 34

Domain Adaptation Cost-Benefit Analysis Test Perplexity 500 400 300 200 10 x 105 9 Realistic scenario Increasing Brown Corpus size imposes computational cost $ Increasing SOU corpus size imposes data acquisition cost $$$ Brown Corpus Size 8 7 6 250 5 4 3 2 1 350 375 400 450 300 275 2 4 6 8 10 12 SOU Corpus Size x 10 4 Wood (University of Oxford) January, 2013 29 / 34

Summary Whole Distributions As Random Variables Hierarchical Modeling Embracing Uncertainty Wood (University of Oxford) January, 2013 30 / 34

The End Thank you. Wood (University of Oxford) January, 2013 31 / 34

References I Nicholas Bartlett, David Pfau, and Frank Wood. Forgetting counts: Constant memory inference for a dependent hierarchical Pitman-Yor process. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 63 70, June 2010. URL http://www.icml2010.org/papers/549.pdf. J. R. Bellegarda. Statistical language model adaptation: review and perspectives. Speech Communication, 42:93 108, 2004. J. Gasthaus and Y. W. Teh. Improvements to the sequence memoizer. In Proceedings of Neural Information Processing Systems, pages 685 693, 2011. J. Gasthaus, F. Wood, and Y. W. Teh. Lossless compression based on the Sequence Memoizer. In Data Compression Conference 2010, pages 337 345, 2010. Wood (University of Oxford) January, 2013 32 / 34

References II M. W. Ho, L. F. James, and J. W. Lau. Coagulation fragmentation laws induced by general coagulations of two-parameter Poisson-Dirichlet processes. http://arxiv.org/abs/math.pr/0601608, 2006. R. Kneser and V. Steinbiss. On the dynamic adaptation of stochastic language models. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 586 589, 1993. D. J. C. MacKay and L.C. Bauman Peto. A hierarchical Dirichlet language model. Natural language engineering, 1(2):289 307, 1995. J. Pitman. Coalescents with multiple collisions. Annals of Probability, 27: 1870 1902, 1999. C.E. Shannon. Prediction and entropy of printed english. The Bell System Technical Journal, pages 50 64, 1951. Y. W. Teh. A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the Association for Computational Linguistics, pages 985 992, 2006. Wood (University of Oxford) January, 2013 33 / 34

References III Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101 (476):1566 1581, 2006. F. Wood and Y.W. Teh. A hierarchical, hierarchical Pitman Yor process language model. In ICML/UAI Nonparametric Bayes Workshop, 2008. F. Wood and Y.W. Teh. A hierarchical nonparametric Bayesian approach to statistical language model domain adaptation. In Artificial Intelligence and Statistics, 2009. F. Wood, C. Archambeau, J. Gasthaus, L. James, and Y. W. Teh. A stochastic memoizer for sequence data. In Proceedings of the 26th International Conference on Machine Learning, pages 1129 1136, Montreal, Canada, 2009. F. Wood, J. Gasthaus, C. Archambeau, L. James, and Y. W. Teh. The sequence memoizer. Communications of the ACM, 2011. Wood (University of Oxford) January, 2013 34 / 34