Hierarchical Bayesian Models of Language and Text

Size: px

Start display at page:

Download "Hierarchical Bayesian Models of Language and Text"

Adrian Fox
6 years ago
Views:

1 Hierarchical Bayesian Models of Language and Text Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL Joint work with Frank Wood *, Jan Gasthaus *, Cedric Archambeau, Lancelot James

2 Overview Probabilistic Models for Language and Text Sequences The Sequence Memoizer Hierarchical Bayesian Modelling on Context Trees Modelling Power Laws with Pitman-Yor Processes Non-Markov Models Efficient Computation Conclusions

3 Overview Probabilistic Models for Language and Text Sequences The Sequence Memoizer Hierarchical Bayesian Modelling on Context Trees Modelling Power Laws with Pitman-Yor Processes Non-Markov Models Efficient Computation Conclusions

4 Sequence Models for Language and Text Probabilistic models for sequences of words and characters, e.g. statistical, machine, learning s, t, a, t, i, s, t, i, c, a, l, _, m, a, c, h, i, n, e, _, l, e, a, r, n, i, n, g Uses: Natural language processing: speech recognition, OCR, machine translation. Compression. Cognitive models of language acquisition. Sequence data arises in many other domains.

Probabilistic Modelling Set of potential outcomes/observations X. Set of unobserved latent variables Y. Joint distribution over X and Y: θ parameters of the model.

5 Probabilistic Modelling Set of potential outcomes/observations X. Set of unobserved latent variables Y. Joint distribution over X and Y: θ parameters of the model. P (x X, y Y θ) Inference: P (y Y x X, θ) = P (y, x θ) P (x θ) Rev. Thomas Bayes Learning: P (training data θ) Bayesian learning: P (θ training data) = P (training data θ)p (θ) Z

6 Communication via Noisy Channel Mary has a little lamb Mary likes little Sam Sentence Utterance P (s u) = Reconstructed sentence P (s)p (u s) P (u)

7 Communication via Noisy Channel María tiene un pequeño cordero Mary has a little lamb Mary has a little lamb Sentence foreign sentence P (s u) = Reconstructed sentence P (s)p (u s) P (u)

8 Markov Models for Language and Text Probabilistic models for sequences of words and characters. P(statistical machine learning) = P(statistical)* P(machine statistical)* P(learning statistical machine) Andrey Markov Usually makes a Markov assumption: P(statistical machine learning) = P(statistical)* P(machine statistical)* P(learning machine) George E. P. Box Order of Markov model typically ranges from ~3 to > 10.

9 Sparsity in Markov Models Consider a high order Markov models: P (sentence) = i P (word i word i N+1...word i 1 ) Large vocabulary size means naïvely estimating parameters of this model from data counts is problematic for N>2. P ML (word i word i N+1...word i 1 )= C(word i N+1...word i ) C(word i N+1...word i 1 ) Naïve priors/regularization fail as well: most parameters have no associated data. Smoothing. Hierarchical Bayesian models.

10 Smoothing in Language Models Smoothing is a way of dealing with data sparsity by combining large and small models together. P smooth (word i word i 1 i N+1 )= N n=1 λ(n)q n (word i word i 1 i n+1 ) Combines expressive power of large models with better estimation of small models (cf bias-variance trade-off). P smooth (learning statistical machine) = λ(3)q 3 (learning statistical machine) + λ(2)q 2 (learning machine) + λ(1)q 1 (learning )

11 Smoothing in Language Models [Chen and Goodman 1998] found that Interpolated and modified Kneser-Ney are best under virtually all circumstances.

12 Overview Probabilistic Models for Language and Text Sequences The Sequence Memoizer Hierarchical Bayesian Modelling on Context Trees Modelling Power Laws with Pitman-Yor Processes Non-Markov Models Efficient Computation Conclusions

13 Hierarchical Bayesian Models Hierarchical modelling an important overarching theme in modern statistics [Gelman et al, 1995, James & Stein 1961]. In machine learning, have been used for multitask learning, transfer learning, learning-to-learn and domain adaptation. φ 0 φ 1 φ 2 φ 3 x 1i x 2i x 3i i=1...n1 i=1...n2 i=1...n3

14 Context Tree Context of conditional probabilities naturally organized using a tree. Smoothing makes conditional probabilities of neighbouring contexts more similar. P smooth (learning statistical machine) =λ(3)q 3 (learning statistical machine)+ λ(2)q 2 (learning machine)+ λ(1)q 1 (learning ) Later words in context more important in predicting next word. machine statistical machine a machine Bayesian machine in statistical machine is statistical machine

15 Hierarchical Bayesian Models on Context Tree Parametrize the conditional probabilities of Markov model: P (word i = w word i 1 i N+1 = u) =G u(w) G u =[G u (w)] w vocabulary Gu is a probability vector associated with context u. G [MacKay and Peto 1994]. G statistical machine G machine G amachine G Bayesian machine G in statistical machine G is statistical machine

16 Hierarchical Dirichlet Language Models What is P (G u G pa(u)?)[mackay and Peto 1994] proposed using the standard Dirichlet distribution over probability vectors. T N-1 IKN MKN HDLM We will use Pitman-Yor processes instead [Perman, Pitman and Yor 1992], [Pitman and Yor 1997], [Ishwaran and James 2001].

17 Overview Probabilistic Models for Language and Text Sequences The Sequence Memoizer Hierarchical Bayesian Modelling on Context Trees Modelling Power Laws with Pitman-Yor Processes Non-Markov Models Efficient Computation Conclusions

18 Pitman-Yor Processes Word frequency Pitman-Yor English text Dirichlet Rank (according to frequency)

19 Pitman-Yor Processes PY Jan Gasthaus (Gatsby Unit, UCL) Sequence Memoizer DCC / 19 Yee Whye Teh

20 Chinese Restaurant Processes Easiest to understand them using Chinese restaurant processes. x 5 x 1 x 2 y 1 x 3 x 4 x 6 p(sit at table k) = p(sit at new table) = x 7 x 8 x 9 y 2 y 3 y 4 c k d θ + K j=1 c j θ + dk θ + K j=1 c j Defines an exchangeable stochastic process over sequences The de Finetti measure is the Pitman-Yor process, G PY(θ,d,H) x i G i =1, 2,... [Perman, Pitman & Yor 1992, Pitman & Yor 1997] p(table serves dish y) =H(y) x 1,x 2,...

21 Power Law Properties of Pitman-Yor Processes Chinese restaurant process: p(sit at table k) c k d p(sit at new table) θ + dk Pitman-Yor processes produce distributions over words given by a power-law distribution with index. 1+d Customers = word instances, tables = dictionary look-up; Small number of common word types; Large number of rare word types. This is more suitable for languages than Dirichlet distributions. [Goldwater, Griffiths and Johnson 2005] investigated the Pitman-Yor process from this perspective.

22 Power Law Properties of Pitman-Yor Processes Word frequency Pitman-Yor English text Dirichlet Rank (according to frequency)

23 Power Law Properties of Pitman-Yor Processes

24 Hierarchical Pitman-Yor Language Models Parametrize the conditional probabilities of Markov model: P (word i = w word i 1 i N+1 = u) =G u(w) G u =[G u (w)] w vocabulary Gu is a probability vector associated with context u. G Place Pitman-Yor process prior on each Gu. G machine G statistical machine G amachine G Bayesian machine G in statistical machine G is statistical machine

25 Hierarchical Pitman-Yor Language Models Significantly improved on the hierarchical Dirichlet language model. Results better Kneser-Ney smoothing, state-of-the-art language models. T N-1 IKN MKN HDLM HPYLM Similarity of perplexities not a surprise---kneser-ney can be derived as a particular approximate inference method.

26 Overview Probabilistic Models for Language and Text Sequences The Sequence Memoizer Hierarchical Bayesian Modelling on Context Trees Modelling Power Laws with Pitman-Yor Processes Non-Markov Models Efficient Computation Conclusions

27 Markov Models for Language and Text Usually makes a Markov assumption to simplify model: P(south parks road) ~ P(south)* P(parks south)* P(road south parks) Language models: usually Markov models of order 2-4 (3-5-grams). How do we determine the order of our Markov models? Is the Markov assumption a reasonable assumption? Be nonparametric about Markov order...

28 Non-Markov Models for Language and Text Model the conditional probabilities of each possible word occurring after each possible context (of unbounded length). Use hierarchical Pitman-Yor process prior to share information across all contexts. Hierarchy is infinitely deep. G Sequence memoizer. G machine G statistical machine G amachine G Bayesian machine G in statistical machine G is statistical machine G this is statistical machine

29 Overview Probabilistic Models for Language and Text Sequences The Sequence Memoizer Hierarchical Bayesian Modelling on Context Trees Modelling Power Laws with Pitman-Yor Processes Non-Markov Models Efficient Computation Conclusions

30 Model Size: Infinite -> O(T 2 ) The sequence memoizer model is very large (actually, infinite). Given a training sequence (e.g.: o,a,c,a,c), most of the model can be ignored (integrated out), leaving a finite number of nodes in context tree. H But there are still O(T 2 ) number of nodes in the context tree... G [ac] G [] G c [c] a o G [oa] a o o G [a] c G [ca] G [o] a G [cac] c o c a G [aca] G [acac] a a G [oac] o G [oaca] G [oacac] o c

31 Model Size: Infinite -> O(T 2 ) -> 2T Idea: integrate out non-branching, non-leaf nodes of the context tree. Resulting tree is related to a suffix tree data structure, and has at most 2T nodes. H There are linear time construction algorithms [Ukkonen 1995]. G [ac] ac G [] o G [oa] a o o G [a] G [o] a oac o a c G [oac] oac G [oaca] G [oacac] c

32 Closure under Marginalization In marginalizing out non-branching interior nodes, need to ensure that resulting conditional distributions are still tractable. G [a] G [a] PY(θ 2,d 2,G [a] ) G [ca]? PY(θ 3,d 3,G [ca] ) G [aca] G [aca] E.g.: If each conditional is Dirichlet, resulting conditional is not of known analytic form.

33 Closure under Marginalization In marginalizing out non-branching interior nodes, need to ensure that resulting conditional distributions are still tractable. G [a] G [a] PY(θ 2,d 2,G [a] ) G [ca] PY(θ 2 d 3,d 2 d 3,G [a] ) PY(θ 2 d 3,d 3,G [ca] ) G [aca] G [aca] For certain parameter settings, Pitman-Yor processes are closed under marginalization! [Pitman 1999, Ho, James & Lau 2006]

34 Coagulation and Fragmentation Operators π1 2 7 π

35 Coagulation and Fragmentation Operators A π1 A B C 2 7 D Coagulate π B C D

36 Coagulation and Fragmentation Operators 1 3 A B A π1 C D Coagulate π2 B C C D

37 Coagulation and Fragmentation Operators 1 3 A B A π1 C D Coagulate π2 B C C 1 3 Fragment D 7

38 Coagulation and Fragmentation Operators The following statements are equivalent: Coagulate Fragment A B C D A B C D π2 C (I) π 2 CRP [n] (αd 2,d 2 ) and π 1 π 2 CRP π2 (α,d 1 ) (II) C CRP [n] (αd 2,d 1 d 2 ) and F a C CRP a ( d 1 d 2,d 2 ) a C π1

39 Final Model Specification Probability of sequence: P (x 1:T )= T P (x i x 1:i 1 )= T G x1:i 1 (x i ) i=1 i=1 Prior over conditional probabilities: G PY(θ,d,H) G u G σ(u) PY(θ u,d u,g σ(u) ), for u Σ \{ }, Constraint on parameters: θ u = θ v=, suffix ofu d v

40 Comparison to Finite Order HPYLM

41 Inference using Gibbs Sampling x Tables e+06 4e+06 6e+06 8e+06 1e+07 Observations

42 Inference using Gibbs Sampling

43 Entropic Coding for Compression Encoder: x i Model P(x i x 1...x i-1,θ i ) -log 2 P(x i x 1...x i-1,θ i ) Decoder: x i Model P(x i x 1...x i-1,θ i ) -log 2 P(x i x 1...x i-1,θ i ) θ i parameter value estimated from x 1...x i-1. A good probabilistic model = good compressor. Claude Shannon

44 Compression Results Model Average bits/byte gzip 2.61 bzip CTW 1.99 PPM 1.93 Sequence Memoizer 1.89 Calgary corpus SM inference: particle filter PPM: Prediction by Partial Matching CTW: Context Tree Weigting Online inference, entropic coding.

45 Related Works Infinite Markov models [Mochihashi & Sumita 2008] Bayesian nonparametric grammars (Goldwater, Johnson, Blunsom, Cohn etc). Text compression: Prediction by Partial Matching [Cleary & Witten 1984], Context Tree Weighting [Willems et al 1995]... Language model smoothing algorithms [Chen & Goodman 1998, Kneser & Ney 1995]. Variable length/order/memory Markov models [Ron et al 1996, Buhlmann & Wyner 1999, Begleiter et al ]. Hierarchical Bayesian nonparametric models [Teh & Jordan 2010].

46 Conclusions Probabilistic models of sequence models without making Markov assumptions with efficient construction and inference algorithms. State-of-the-art text compression and language modelling results. Hierarchical Bayesian modelling leads to improved performance. Pitman-Yor processes allow us to encode prior knowledge about powerlaw properties, leading to improved performance. Hierarchical Pitman-Yor processes have been used successfully for various more linguistically motivated models. (Java implementation) (text compression demo) Jan Gasthaus webpage (C++ implementation)

47 Publications A Hierarchical Bayesian Language Model based on Pitman-Yor Processes. Y.W. Teh. Coling/ACL A Bayesian Interpretation of Interpolated Kneser-Ney. Y.W. Teh. Technical Report TRA2/06, School of Computing, NUS, revised A Stochastic Memoizer for Sequence Data. F. Wood, C. Archambeau, J. Gasthaus, L. F. James and Y.W. Teh. ICML Text Compression Using a Hierarchical Pitman-Yor Process Prior. J. Gasthaus, F. Wood and Y.W. Teh. DCC Forgetting Counts: Constant Memory Inference for a Dependent Hierarchical Pitman-Yor Process. N. Bartlett, D. Pfau and F. Wood. ICML Some Improvements to the Sequence Memoizer. J. Gasthaus and Y.W. Teh. NIPS The Sequence Memoizer. F. Wood, J. Gasthaus, C. Archambeau, L. F. James and Y.W. Teh. CACM Hierarchical Bayesian Nonparametric Models with Applications. Y.W. Teh and M.I. Jordan, in Bayesian Nonparametrics. Cambridge University Press, 2010.

48 Thank You! Acknowledgements: Frank Wood, Jan Gasthaus, Cedric Archambeau, Lancelot James Lee Kuan Yew Foundation Gatsby Charitable Foundation

Hierarchical Bayesian Nonparametric Models of Language and Text

Hierarchical Bayesian Nonparametric Models of Language and Text Gatsby Computational Neuroscience Unit, UCL Joint work with Frank Wood *, Jan Gasthaus *, Cedric Archambeau, Lancelot James August 2010 Overview