Bayesian (Nonparametric) Approaches to Language Modeling

Bayesian (Nonparametric) Approaches to Language Modeling Frank Wood unemployed (and homeless) January, 2013 Wood (University of Oxford) January, 2013 1 / 34

Message Whole Distributions As Random Variables Hierarchical Modeling Embracing Uncertainty Wood (University of Oxford) January, 2013 2 / 34

Challenge Problem Predict What Comes Next 01001001, 01101110, 00100000, 01110100, 01101000, 01110010, 01100101, 01100101, 00100000, 01100100, 01100001, 01111001, 01110011, 00100000, 01111001, 01101111, 01110101, 01110010, 00100000, 01101000, 01100001, 01110010, 01100100, 00100000, 01100100, 01110010, 01101001, 01110110, 01100101, 00100000, 01110111, 01101001, 01101100, 01101100, 00100000, 01100011, 01110010, 01100001, 01110011, 01101000... Wood (University of Oxford) January, 2013 3 / 34

.6 < Entropy 1 of English < 1.3 bits/byte [Shannon, 1951] bits / byte 4 2 L=10 3 L=10 4 L=10 5 L=10 6 L=10 7 3 4 5 6 7 8 9 stream length (log 10 ) Sequence memoizer with life-long learning computational characteristics [Bartlett, Pfau, and Wood, 2010] Wikipedia : 26.8GB 4GB ( 1.2) bits/byte gzip : 7.9GB PAQ : 3.8GB L is number of states in SM, data Wikipedia dump 2010 head, depth 16 1 Entropy = 1 T T t=1 log 2 P(x t x 1:t 1) (lower is better) Wood (University of Oxford) January, 2013 4 / 34

How Good Is 1.5 bits/byte? the fastest way to make money is... being player. bucketti wish i made it end in the reducers and assemblance smart practices to allow city is something to walk in most of the work of agriculture. i d be able to compete with an earlier goals : words the danger of conduct in southern california, has set of juice of the fights lack of audience that the eclasm beverly hills. she companies or running back down that the book, grass. and the coynes, exaggerate between 1972. the pad, a debate on emissions on air political capital crashing that the new obviously program irock price, coach began refugees, much and 2 2 constant-space, byte-sm, NY Times Corpus (50MB), log-loss 1.49 bits/byte, 2 million rest. s Wood (University of Oxford) January, 2013 5 / 34

Notation and iid Conditional Distribution Estimation Let Σ be a set of symbols, let x = x 1, x 2,..., x T be an observed sequence, x t, s, s Σ, and u W = {w 1, w 2,... w K } with w k Σ +. N(us) G u (s) = s Σ N(us ) is the maximum likelihood estimator for the conditional distribution P(X = s u) G u (s). Estimating a finite set of conditional distributions corresponding to a finite set of contexts u generally yields a poor model. Why? Long contexts occur infrequently Maximum likelihood estimates given few observations are often poor Finite set of contexts might be sparse Wood (University of Oxford) January, 2013 6 / 34

Bayesian Smoothing Use independent Dirichlet priors, i.e. G u Dir( α Σ ) x t x t k,..., x t 1 = u G u Wood (University of Oxford) January, 2013 7 / 34

Bayesian Smoothing Use independent Dirichlet priors, i.e. G u Dir( α Σ ) x t x t k,..., x t 1 = u G u and embrace uncertainty about G u s : P(x T +1 = s x = u) = P(x t+1 = s G u )P(G u x)dg = E[G u (s)], yielding, for instance, E[G u (s)] = N(us) + α Σ s Σ N(us ) + α Wood (University of Oxford) January, 2013 7 / 34

Hierarchical Smoothing Smooth G u using a similar distribution G σ(u), i.e. G u Dir(cG σ(u) ) 3 or G u DP(cG σ(u) ) 4 with, for instance, σ(x 1 x 2 x 3... x t ) = x 2 x 3... x t (suffix operator) 3 [MacKay and Peto, 1995] 4 [Teh et al., 2006] Wood (University of Oxford) January, 2013 8 / 34

Natural Question Is it possible to simultaneously estimate all distributions in such a model, even in the infinite-depth case? Wood (University of Oxford) January, 2013 9 / 34

Unexpected Answer Is it possible to simultaneously estimate all distributions in such a model, even in the infinite-depth case? YES! Sequence memoizer (SM), Wood et al. [2011] The SM is a compact O(T ), unbounded-depth, hierarchical, Bayesian-nonparametric distribution over discrete sequence data. Graphical Pitman-Yor process (GPYP), Wood and Teh [2008] The GPYP is a hierarchical sharing prior with directed acyclic dependency. Wood (University of Oxford) January, 2013 9 / 34

SM = hierarchical PY process [Teh, 2006] of unbounded depth. Figure : Binary Sequence Memoizer Graphical Model Wood (University of Oxford) January, 2013 10 / 34

Binary Sequence Memoizer Notation G ε U Σ, d 0 PY(d 0, 0, U Σ ) G u G σ(u), d u PY(d u, 0, G σ(u) ) u Σ + x t x 1,..., x t 1 = u G u Here σ(x 1 x 2 x 3... x t ) = x 2 x 3... x t is the suffix operator, Σ = {0, 1}, and U Σ = [.5,.5]. Encodes natural assumptions : Recency assumption encoded by form of coupling related conditional distributions ( back-off ) Power-law effects encoded by the choice of hierarchical modeling glue: the Pitman-Yor process (PY). Wood (University of Oxford) January, 2013 11 / 34

Pitman-Yor process A Pitman-Yor process PY(c, d, H) is a distribution over distributions with three parameters A discount 0 d < 1 that controls power-law behavior d = 0 is Dirichlet process (DP) A concentration c > d like that of the Dirichlet process (DP) A base distribution H also like that of the DP Key properties : If G PY(c, d, H) then a priori E[G(s)] = H(s) Var[G(s)] = (1 d)h(s)(1 H(s)) Wood (University of Oxford) January, 2013 12 / 34

Text and Pitman-Yor Process Power-Law Properties Word frequency 10 6 10 5 10 4 10 3 10 2 Pitman-Yor English text Dirichlet 10 1 10 0 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Rank (according to frequency) Power-law scaling of word frequencies in English text. Relative word frequency is plotted against each words rank when ordered according to frequency. There are a few very common words and a large number of relatively rare words. Figure from Wood et al. [2011]. Wood (University of Oxford) January, 2013 13 / 34

Big Problem Space. And time. Wood (University of Oxford) January, 2013 14 / 34

Coagulation and Fragmentation 5 PY properties used to achieve linear space sequence memoizer. Coagulation If G 2 G 1 PY(d 1, 0, G 1 ) and G 3 G 2 PY(d 2, 0, G 2 ) then G 3 G 1 PY(d 1 d 2, 0, G 1 ) Fragmentation Suppose G 3 G 1 PY(d 1 d 2, 0, G 1 ). Operations exist to draw G 2 G 1 PY(d 1, 0, G 1 ) and G 3 G 2 PY(d 2, 0, G 2 ) with G 2 marginalized out. 5 [Pitman, 1999, Ho et al., 2006, Wood et al., 2009] Wood (University of Oxford) January, 2013 15 / 34

O(T ) Sequence Memoizer 6 Graphical Model for x = 110100 6 [Wood et al, 2009] Wood (University of Oxford) January, 2013 16 / 34

O(T ) Sequence Memoizer vs. Smoothed n th -order Markov Model 400 1.5 10 7 Perplexity 300 200 100 Markov perplexity SM perplexity Markov #nodes SM #nodes 0 0 1 2 3 4 5 6 Context length (n) 1.2 10 7 9 10 6 6 10 6 3 10 6 Number of nodes SM versus n th order Markov models with hierarchical PYP priors as n varies. In red is computational complexity. Wood (University of Oxford) January, 2013 17 / 34

Problem Nonparametric O(T ) space = doesn t work for life-long learning Wood (University of Oxford) January, 2013 18 / 34

Towards Lifelong Learning : Incremental Estimation Sequential Monte Carlo inference [Gasthaus, Wood, and Teh, 2010] Wikipedia 100MB 20.8MB (1.66 bits/byte 7 ) PAQ, special-purpose, non-incremental, Hutter-prize leader Still O(T ) storage complexity. 7 average log-loss 1 T T log 2 P(x t x 1:t 1) t=1 is average # of bits to transmit one byte under optimal entropy encoding (lower is better) Wood (University of Oxford) January, 2013 19 / 34

Towards Lifelong Learning : Constant Space Approximation Forgetting Counts [Bartlett, Pfau, and Wood, 2010] Average Log Loss (bits) 2.6 2.4 2.2 2 1.8 Naive Finite Depth Random Deletion Greedy Deletion Sequence Memoizer gzip bzip2 14,164 41,860 89,479 155,623 300,490 467,502 632,448 776,547 States Constant-state-count SM approximation Data: Calgary corpus, States = number of states used in 3-10 grams Improvements to the SM [Gasthaus and Teh, 2011] Constant-size representation of predictive distributions Wood (University of Oxford) January, 2013 20 / 34

Life-Long Learning SM [Bartlett, Pfau, and Wood, 2010] bits / byte 4 2 L=10 3 L=10 4 L=10 5 L=10 6 L=10 7 3 4 5 6 7 8 9 stream length (log 10 ) SM with life-long learning computational characteristics. Wikipedia : 26.8GB 4GB ( 1.2) bits/byte L is number of states in SM, data Wikipedia dump 2010 head, depth 16, Wood (University of Oxford) January, 2013 21 / 34

Summary So Far SM = powerful Bayesian nonparametric sequence model Intuition : lim n smoothing n-gram Byte predictive performance in human range Bayesian-nonparametric = incompatible with lifelong learning Non-trivial approximations preserve performance Resulting model compatible with lifelong learning Widely useful Language-modeling Lossless compression Plug-in replacement for Markov model component of any complex graphical model Direct plug-in replacement for n-gram language model Software : http://www.sequencememoizer.com/ Wood (University of Oxford) January, 2013 22 / 34

Can We Do Better? Problem : No long-range coherence Domain specificity Solution : Develop models that more efficiently encode long-range dependencies Develop models that use domain-specific context Graphical Pitman-Yor process (GPYP) Wood (University of Oxford) January, 2013 23 / 34

GPYP Application : Language Model Domain Adaptation Wood (University of Oxford) January, 2013 24 / 34

Doubly Hierarchical Pitman Yor Process Language Model 8 G{} D PY(d0 D, α0 D, λ D 0 U + (1 λ D 0 )G{} H ) G{w D t 1 } PY(d1 D, α1 D, λ D 1 G{} D + (1 λd 1 )G{w H t 1 } ). G{w D t j :w t 1 } PY(dj D, αj D, λ D j G{w D t j+1 :w t 1 } + (1 λd j )G{w H t j :w t 1 } ) x w t n+1 : w t 1 G{w D t n+1 :w t 1 }. 8 [Wood and Teh, 2009] Wood (University of Oxford) January, 2013 25 / 34

Domain Adaptation Schematic...... Latent............ Domain 1 Switch Priors... Domain 2 Wood (University of Oxford) January, 2013 26 / 34

Domain Adaptation Effect (Copacetic) Test perplexity 450 400 350 300 Union DHPYPLM Mixture Baseline 250 0 2 4 6 8 10 Brown corpus subset size x 10 5 Mixture : [Kneser and Steinbiss, 1993] Union : [Bellegarda, 2004] Training corpora Small State of the Union (SOU) 1945-1963, 1969-2006 370k tokens 13k types Big Brown 1967 1m tokens 50k types Test corpus Johnson s SOU speeches 1963-1969 37k tokens Wood (University of Oxford) January, 2013 27 / 34

Domain Adaptation Effect (Not) Test perplexity 320 300 280 260 240 220 200 Union DHPYPLM Mixture Baseline 180 0 2 4 6 8 10 Brown corpus subset size x 10 5 Mixture : [Kneser and Steinbiss, 1993] Union : [Bellegarda, 2004] Training corpora Small Augmented Multi-Party Interaction (AMI) 2007 800k tokens 8k types Big Brown 1967 1m tokens 50k types Test corpus AMI excerpt 2007 60k tokens Wood (University of Oxford) January, 2013 28 / 34

Domain Adaptation Cost-Benefit Analysis Test Perplexity 500 400 300 200 10 x 105 9 Realistic scenario Increasing Brown Corpus size imposes computational cost $ Increasing SOU corpus size imposes data acquisition cost $$$ Brown Corpus Size 8 7 6 250 5 4 3 2 1 350 375 400 450 300 275 2 4 6 8 10 12 SOU Corpus Size x 10 4 Wood (University of Oxford) January, 2013 29 / 34

Summary Whole Distributions As Random Variables Hierarchical Modeling Embracing Uncertainty Wood (University of Oxford) January, 2013 30 / 34

The End Thank you. Wood (University of Oxford) January, 2013 31 / 34

References I Nicholas Bartlett, David Pfau, and Frank Wood. Forgetting counts: Constant memory inference for a dependent hierarchical Pitman-Yor process. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 63 70, June 2010. URL http://www.icml2010.org/papers/549.pdf. J. R. Bellegarda. Statistical language model adaptation: review and perspectives. Speech Communication, 42:93 108, 2004. J. Gasthaus and Y. W. Teh. Improvements to the sequence memoizer. In Proceedings of Neural Information Processing Systems, pages 685 693, 2011. J. Gasthaus, F. Wood, and Y. W. Teh. Lossless compression based on the Sequence Memoizer. In Data Compression Conference 2010, pages 337 345, 2010. Wood (University of Oxford) January, 2013 32 / 34

References II M. W. Ho, L. F. James, and J. W. Lau. Coagulation fragmentation laws induced by general coagulations of two-parameter Poisson-Dirichlet processes. http://arxiv.org/abs/math.pr/0601608, 2006. R. Kneser and V. Steinbiss. On the dynamic adaptation of stochastic language models. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 586 589, 1993. D. J. C. MacKay and L.C. Bauman Peto. A hierarchical Dirichlet language model. Natural language engineering, 1(2):289 307, 1995. J. Pitman. Coalescents with multiple collisions. Annals of Probability, 27: 1870 1902, 1999. C.E. Shannon. Prediction and entropy of printed english. The Bell System Technical Journal, pages 50 64, 1951. Y. W. Teh. A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the Association for Computational Linguistics, pages 985 992, 2006. Wood (University of Oxford) January, 2013 33 / 34

References III Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101 (476):1566 1581, 2006. F. Wood and Y.W. Teh. A hierarchical, hierarchical Pitman Yor process language model. In ICML/UAI Nonparametric Bayes Workshop, 2008. F. Wood and Y.W. Teh. A hierarchical nonparametric Bayesian approach to statistical language model domain adaptation. In Artificial Intelligence and Statistics, 2009. F. Wood, C. Archambeau, J. Gasthaus, L. James, and Y. W. Teh. A stochastic memoizer for sequence data. In Proceedings of the 26th International Conference on Machine Learning, pages 1129 1136, Montreal, Canada, 2009. F. Wood, J. Gasthaus, C. Archambeau, L. James, and Y. W. Teh. The sequence memoizer. Communications of the ACM, 2011. Wood (University of Oxford) January, 2013 34 / 34