Bayesian (Nonparametric) Approaches to Language Modeling

Size: px

Start display at page:

Download "Bayesian (Nonparametric) Approaches to Language Modeling"

Isaac Mason
5 years ago
Views:

1 Bayesian (Nonparametric) Approaches to Language Modeling Frank Wood unemployed (and homeless) January, 2013 Wood (University of Oxford) January, / 34

2 Message Whole Distributions As Random Variables Hierarchical Modeling Embracing Uncertainty Wood (University of Oxford) January, / 34

3 Challenge Problem Predict What Comes Next , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Wood (University of Oxford) January, / 34

4 .6 < Entropy 1 of English < 1.3 bits/byte [Shannon, 1951] bits / byte 4 2 L=10 3 L=10 4 L=10 5 L=10 6 L= stream length (log 10 ) Sequence memoizer with life-long learning computational characteristics [Bartlett, Pfau, and Wood, 2010] Wikipedia : 26.8GB 4GB ( 1.2) bits/byte gzip : 7.9GB PAQ : 3.8GB L is number of states in SM, data Wikipedia dump 2010 head, depth 16 1 Entropy = 1 T T t=1 log 2 P(x t x 1:t 1) (lower is better) Wood (University of Oxford) January, / 34

5 How Good Is 1.5 bits/byte? the fastest way to make money is... being player. bucketti wish i made it end in the reducers and assemblance smart practices to allow city is something to walk in most of the work of agriculture. i d be able to compete with an earlier goals : words the danger of conduct in southern california, has set of juice of the fights lack of audience that the eclasm beverly hills. she companies or running back down that the book, grass. and the coynes, exaggerate between the pad, a debate on emissions on air political capital crashing that the new obviously program irock price, coach began refugees, much and 2 2 constant-space, byte-sm, NY Times Corpus (50MB), log-loss 1.49 bits/byte, 2 million rest. s Wood (University of Oxford) January, / 34

6 Notation and iid Conditional Distribution Estimation Let Σ be a set of symbols, let x = x 1, x 2,..., x T be an observed sequence, x t, s, s Σ, and u W = {w 1, w 2,... w K } with w k Σ +. N(us) G u (s) = s Σ N(us ) is the maximum likelihood estimator for the conditional distribution P(X = s u) G u (s). Wood (University of Oxford) January, / 34

7 Notation and iid Conditional Distribution Estimation Let Σ be a set of symbols, let x = x 1, x 2,..., x T be an observed sequence, x t, s, s Σ, and u W = {w 1, w 2,... w K } with w k Σ +. N(us) G u (s) = s Σ N(us ) is the maximum likelihood estimator for the conditional distribution P(X = s u) G u (s). Estimating a finite set of conditional distributions corresponding to a finite set of contexts u generally yields a poor model. Why? Long contexts occur infrequently Maximum likelihood estimates given few observations are often poor Finite set of contexts might be sparse Wood (University of Oxford) January, / 34

8 Bayesian Smoothing Use independent Dirichlet priors, i.e. G u Dir( α Σ ) x t x t k,..., x t 1 = u G u Wood (University of Oxford) January, / 34

9 Bayesian Smoothing Use independent Dirichlet priors, i.e. G u Dir( α Σ ) x t x t k,..., x t 1 = u G u and embrace uncertainty about G u s : P(x T +1 = s x = u) = P(x t+1 = s G u )P(G u x)dg = E[G u (s)], yielding, for instance, E[G u (s)] = N(us) + α Σ s Σ N(us ) + α Wood (University of Oxford) January, / 34

10 Hierarchical Smoothing Smooth G u using a similar distribution G σ(u), i.e. G u Dir(cG σ(u) ) 3 or G u DP(cG σ(u) ) 4 with, for instance, σ(x 1 x 2 x 3... x t ) = x 2 x 3... x t (suffix operator) 3 [MacKay and Peto, 1995] 4 [Teh et al., 2006] Wood (University of Oxford) January, / 34

11 Natural Question Is it possible to simultaneously estimate all distributions in such a model, even in the infinite-depth case? Wood (University of Oxford) January, / 34

12 Unexpected Answer Is it possible to simultaneously estimate all distributions in such a model, even in the infinite-depth case? YES! Sequence memoizer (SM), Wood et al. [2011] The SM is a compact O(T ), unbounded-depth, hierarchical, Bayesian-nonparametric distribution over discrete sequence data. Graphical Pitman-Yor process (GPYP), Wood and Teh [2008] The GPYP is a hierarchical sharing prior with directed acyclic dependency. Wood (University of Oxford) January, / 34

13 SM = hierarchical PY process [Teh, 2006] of unbounded depth. Figure : Binary Sequence Memoizer Graphical Model Wood (University of Oxford) January, / 34

14 Binary Sequence Memoizer Notation G ε U Σ, d 0 PY(d 0, 0, U Σ ) G u G σ(u), d u PY(d u, 0, G σ(u) ) u Σ + x t x 1,..., x t 1 = u G u Here σ(x 1 x 2 x 3... x t ) = x 2 x 3... x t is the suffix operator, Σ = {0, 1}, and U Σ = [.5,.5]. Encodes natural assumptions : Recency assumption encoded by form of coupling related conditional distributions ( back-off ) Power-law effects encoded by the choice of hierarchical modeling glue: the Pitman-Yor process (PY). Wood (University of Oxford) January, / 34

15 Pitman-Yor process A Pitman-Yor process PY(c, d, H) is a distribution over distributions with three parameters A discount 0 d < 1 that controls power-law behavior d = 0 is Dirichlet process (DP) A concentration c > d like that of the Dirichlet process (DP) A base distribution H also like that of the DP Key properties : If G PY(c, d, H) then a priori E[G(s)] = H(s) Var[G(s)] = (1 d)h(s)(1 H(s)) Wood (University of Oxford) January, / 34

16 Text and Pitman-Yor Process Power-Law Properties Word frequency Pitman-Yor English text Dirichlet Rank (according to frequency) Power-law scaling of word frequencies in English text. Relative word frequency is plotted against each words rank when ordered according to frequency. There are a few very common words and a large number of relatively rare words. Figure from Wood et al. [2011]. Wood (University of Oxford) January, / 34

17 Big Problem Space. And time. Wood (University of Oxford) January, / 34

18 Coagulation and Fragmentation 5 PY properties used to achieve linear space sequence memoizer. Coagulation If G 2 G 1 PY(d 1, 0, G 1 ) and G 3 G 2 PY(d 2, 0, G 2 ) then G 3 G 1 PY(d 1 d 2, 0, G 1 ) Fragmentation Suppose G 3 G 1 PY(d 1 d 2, 0, G 1 ). Operations exist to draw G 2 G 1 PY(d 1, 0, G 1 ) and G 3 G 2 PY(d 2, 0, G 2 ) with G 2 marginalized out. 5 [Pitman, 1999, Ho et al., 2006, Wood et al., 2009] Wood (University of Oxford) January, / 34

19 O(T ) Sequence Memoizer 6 Graphical Model for x = [Wood et al, 2009] Wood (University of Oxford) January, / 34

20 O(T ) Sequence Memoizer vs. Smoothed n th -order Markov Model Perplexity Markov perplexity SM perplexity Markov #nodes SM #nodes Context length (n) Number of nodes SM versus n th order Markov models with hierarchical PYP priors as n varies. In red is computational complexity. Wood (University of Oxford) January, / 34

21 Problem Nonparametric O(T ) space = doesn t work for life-long learning Wood (University of Oxford) January, / 34

22 Towards Lifelong Learning : Incremental Estimation Sequential Monte Carlo inference [Gasthaus, Wood, and Teh, 2010] Wikipedia 100MB 20.8MB (1.66 bits/byte 7 ) PAQ, special-purpose, non-incremental, Hutter-prize leader Still O(T ) storage complexity. 7 average log-loss 1 T T log 2 P(x t x 1:t 1) t=1 is average # of bits to transmit one byte under optimal entropy encoding (lower is better) Wood (University of Oxford) January, / 34

23 Towards Lifelong Learning : Constant Space Approximation Forgetting Counts [Bartlett, Pfau, and Wood, 2010] Average Log Loss (bits) Naive Finite Depth Random Deletion Greedy Deletion Sequence Memoizer gzip bzip2 14,164 41,860 89, , , , , ,547 States Constant-state-count SM approximation Data: Calgary corpus, States = number of states used in 3-10 grams Improvements to the SM [Gasthaus and Teh, 2011] Constant-size representation of predictive distributions Wood (University of Oxford) January, / 34

24 Life-Long Learning SM [Bartlett, Pfau, and Wood, 2010] bits / byte 4 2 L=10 3 L=10 4 L=10 5 L=10 6 L= stream length (log 10 ) SM with life-long learning computational characteristics. Wikipedia : 26.8GB 4GB ( 1.2) bits/byte L is number of states in SM, data Wikipedia dump 2010 head, depth 16, Wood (University of Oxford) January, / 34

25 Summary So Far SM = powerful Bayesian nonparametric sequence model Intuition : lim n smoothing n-gram Byte predictive performance in human range Bayesian-nonparametric = incompatible with lifelong learning Non-trivial approximations preserve performance Resulting model compatible with lifelong learning Widely useful Language-modeling Lossless compression Plug-in replacement for Markov model component of any complex graphical model Direct plug-in replacement for n-gram language model Software : Wood (University of Oxford) January, / 34

26 Can We Do Better? Problem : No long-range coherence Domain specificity Solution : Develop models that more efficiently encode long-range dependencies Develop models that use domain-specific context Graphical Pitman-Yor process (GPYP) Wood (University of Oxford) January, / 34

27 GPYP Application : Language Model Domain Adaptation Wood (University of Oxford) January, / 34

28 Doubly Hierarchical Pitman Yor Process Language Model 8 G{} D PY(d0 D, α0 D, λ D 0 U + (1 λ D 0 )G{} H ) G{w D t 1 } PY(d1 D, α1 D, λ D 1 G{} D + (1 λd 1 )G{w H t 1 } ). G{w D t j :w t 1 } PY(dj D, αj D, λ D j G{w D t j+1 :w t 1 } + (1 λd j )G{w H t j :w t 1 } ) x w t n+1 : w t 1 G{w D t n+1 :w t 1 }. 8 [Wood and Teh, 2009] Wood (University of Oxford) January, / 34

29 Domain Adaptation Schematic Latent Domain 1 Switch Priors... Domain 2 Wood (University of Oxford) January, / 34

30 Domain Adaptation Effect (Copacetic) Test perplexity Union DHPYPLM Mixture Baseline Brown corpus subset size x 10 5 Mixture : [Kneser and Steinbiss, 1993] Union : [Bellegarda, 2004] Training corpora Small State of the Union (SOU) , k tokens 13k types Big Brown m tokens 50k types Test corpus Johnson s SOU speeches k tokens Wood (University of Oxford) January, / 34

31 Domain Adaptation Effect (Not) Test perplexity Union DHPYPLM Mixture Baseline Brown corpus subset size x 10 5 Mixture : [Kneser and Steinbiss, 1993] Union : [Bellegarda, 2004] Training corpora Small Augmented Multi-Party Interaction (AMI) k tokens 8k types Big Brown m tokens 50k types Test corpus AMI excerpt k tokens Wood (University of Oxford) January, / 34

32 Domain Adaptation Cost-Benefit Analysis Test Perplexity x Realistic scenario Increasing Brown Corpus size imposes computational cost $ Increasing SOU corpus size imposes data acquisition cost $$$ Brown Corpus Size SOU Corpus Size x 10 4 Wood (University of Oxford) January, / 34

33 Summary Whole Distributions As Random Variables Hierarchical Modeling Embracing Uncertainty Wood (University of Oxford) January, / 34

34 The End Thank you. Wood (University of Oxford) January, / 34

35 References I Nicholas Bartlett, David Pfau, and Frank Wood. Forgetting counts: Constant memory inference for a dependent hierarchical Pitman-Yor process. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 63 70, June URL J. R. Bellegarda. Statistical language model adaptation: review and perspectives. Speech Communication, 42:93 108, J. Gasthaus and Y. W. Teh. Improvements to the sequence memoizer. In Proceedings of Neural Information Processing Systems, pages , J. Gasthaus, F. Wood, and Y. W. Teh. Lossless compression based on the Sequence Memoizer. In Data Compression Conference 2010, pages , Wood (University of Oxford) January, / 34

36 References II M. W. Ho, L. F. James, and J. W. Lau. Coagulation fragmentation laws induced by general coagulations of two-parameter Poisson-Dirichlet processes R. Kneser and V. Steinbiss. On the dynamic adaptation of stochastic language models. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , D. J. C. MacKay and L.C. Bauman Peto. A hierarchical Dirichlet language model. Natural language engineering, 1(2): , J. Pitman. Coalescents with multiple collisions. Annals of Probability, 27: , C.E. Shannon. Prediction and entropy of printed english. The Bell System Technical Journal, pages 50 64, Y. W. Teh. A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the Association for Computational Linguistics, pages , Wood (University of Oxford) January, / 34

37 References III Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101 (476): , F. Wood and Y.W. Teh. A hierarchical, hierarchical Pitman Yor process language model. In ICML/UAI Nonparametric Bayes Workshop, F. Wood and Y.W. Teh. A hierarchical nonparametric Bayesian approach to statistical language model domain adaptation. In Artificial Intelligence and Statistics, F. Wood, C. Archambeau, J. Gasthaus, L. James, and Y. W. Teh. A stochastic memoizer for sequence data. In Proceedings of the 26th International Conference on Machine Learning, pages , Montreal, Canada, F. Wood, J. Gasthaus, C. Archambeau, L. James, and Y. W. Teh. The sequence memoizer. Communications of the ACM, Wood (University of Oxford) January, / 34

Hierarchical Bayesian Nonparametric Models of Language and Text

Hierarchical Bayesian Nonparametric Models of Language and Text Gatsby Computational Neuroscience Unit, UCL Joint work with Frank Wood *, Jan Gasthaus *, Cedric Archambeau, Lancelot James August 2010 Overview