Bayesian Tools for Natural Language Learning. Yee Whye Teh Gatsby Computational Neuroscience Unit UCL

Size: px

Start display at page:

Download "Bayesian Tools for Natural Language Learning. Yee Whye Teh Gatsby Computational Neuroscience Unit UCL"

Lindsey Hardy
5 years ago
Views:

1 Bayesian Tools for Natural Language Learning Yee Whye Teh Gatsby Computational Neuroscience Unit UCL

2 Bayesian Learning of Probabilistic Models Potential outcomes/observations X. Unobserved latent variables Y. Joint distribution over X and Y: Parameters of the model θ. P (x X, y Y θ) Inference: P (y Y x X, θ) = P (y, x θ) P (x θ) Learning: Bayesian learning: P (θ training data) = P (training data θ) P (training data θ)p (θ) Z

3 Why Bayesian Learning? Less worry about overfitting. Nonparametric Bayes mitigates issues of model selection. Separation of modeling assumptions and algorithmic concerns. Explicit statement of assumptions made. Allows inclusion of domain knowledge into model via the structure and form of Bayesian priors. Power law properties via Pitman-Yor processes. Information sharing via hierarchical Bayes.

4 Hierarchical Bayes, Pitman-Yor Processes, and N-gram Language Models

5 N-gram Language Models Probabilistic models for sequences of words and characters, e.g. south, parks, road s, o, u, t, h, _, p, a, r, k, s, _, r, o, a, d (N-1)th order Markov model: P (sentence) = i P (word i word i N+1...word i 1 )

6 Sparsity and Smoothing P (sentence) = i P (word i word i N+1...word i 1 ) Large vocabulary size means naïvely estimating parameters of this model from data counts is problematic for N>2. P ML (word i word i N+1...word i 1 )= C(word i N+1...word i ) C(word i N+1...word i 1 ) Naïve priors/regularization fail as well: most parameters have no associated data. Smoothing.

7 Smoothing on Context Tree Context of conditional probabilities naturally organized using a tree. Smoothing makes conditional probabilities of neighbouring contexts more similar. Later words in context more important in predicting next word. P smooth (road south parks) = λ(3)q 3 (road south parks) + λ(2)q 2 (road parks) + λ(1)q 1 (road ) parks south parks to parks university parks along south parks at south parks

8 Smoothing in Language Models Interpolated and modified Kneser-Ney are best under virtually all circumstances. [Chen and Goodman 1998]

9 Hierarchical Bayesian Models Hierarchical modelling an important overarching theme in modern statistics. In machine learning, have been used for multitask learning, transfer learning, learning-to-learn and domain adaptation. φ 0 φ 1 φ 2 φ 3 x 1i x 2i x 3i i=1...n1 i=1...n2 i=1...n3 [Gelman et al, 1995, James & Stein 1961]

10 Hierarchical Bayesian Models on Context Tree Parametrize the conditional probabilities of Markov model: P (word i = w word i 1 i N+1 = u) =G u(w) G u =[G u (w)] w vocabulary Gu is a probability vector associated with context u. G G parks G south parks G to parks G university parks G along south parks G at south parks [MacKay and Peto 1994]

11 Hierarchical Dirichlet Language Models What is P (G u G pa(u) )? [MacKay and Peto 1994] proposed using the standard Dirichlet distribution over probability vectors. T N-1 IKN MKN HDLM We will use Pitman-Yor processes instead.

12 Power Laws in English Word frequency Pitman-Yor English text Dirichlet Rank (according to frequency)

13 Chinese Restaurant Processes Generative Process: x 5 x 1 x 2 y 1 x 3 x 4 x 6 p(sit at table k) = p(sit at new table) = x 7 x 8 x 9 y 2 y 3 y 4 c k d θ + K j=1 c j θ + dk θ + K j=1 c j Defines an exchangeable stochastic process over sequences x 1, x 2,... The de Finetti measure is the Pitman-Yor process, G PY(θ,d,H) x i G i =1, 2,... p(table serves dish y) =H(y) [Perman, Pitman & Yor 1992, Pitman & Yor 1997, Ishwaran & James 2001]

14 Chinese Restaurant Processes customers = word tokens. H = dictionary. tables = dictionary lookup. x 5 x 1 x 2 x x 8 7 x 9 cat dog cat mouse x 3 x 4 x 6 Dictionary look-up sequence: cat, dog, cat, mouse Word token sequence: cat, dog, dog, dog, cat, dog, cat, mouse, mouse

15 Stochastic Programming Perspective G ~ PY(θ,d,H) cat, dog, cat, mouse... ~ H iid x 5 x 1 x 2 x 3 x 4 x 6 x 7 x 8 cat dog cat mouse x 9 cat dog dog dog cat dog cat mouse mouse ~ G iid A stochastic program producing a random sequence of words. H G ~ PY(θ,d,H) cat, dog, cat, mouse... cat, dog, dog, dog, cat... [Goodman et al 2008]

16 Power Law Properties of Pitman-Yor Processes Chinese restaurant process: Pitman-Yor processes produce distributions over words given by a power law distribution with index 1+d. Small number of common word types; Large number of rare word types. p(sit at table k) c k d p(sit at new table) θ + dk This is more suitable for languages than Dirichlet distributions. [Goldwater et al 2006] investigated the Pitman-Yor process from this perspective. [Goldwater et al 2006]

17 Power Law Properties of Pitman-Yor Processes Word frequency Pitman-Yor English text Dirichlet Rank (according to frequency)

18 Hierarchical Pitman-Yor Language Models Parametrize the conditional probabilities of Markov model: P (word i = w word i 1 i N+1 = u) =G u(w) G u =[G u (w)] w vocabulary Gu is a probability vector associated with context u. G Place Pitman-Yor process prior on each Gu. G parks G south parks G to parks G university parks G along south parks G at south parks

19 Stochastic Programming Perspective G 1 ~ PY(θ 1,d 1,G 0 ) G 2 G 1 ~ PY(θ 2,d 2,G 1 ) G 3 G 1 ~ PY(θ 3,d 3,G 1 ) G 0 w 01,w 02,w 03,... G 1 ~ PY(θ 1,d 1,G 0 ) w 11,w 12,w 13,... G 2 ~ PY(θ 2,d 2,G 1 ) G 3 ~ PY(θ 3,d 3,G 1 ) w 21,w 22,w 23,... w 31,w 32,w 33,...

20 Hierarchical Pitman-Yor Language Models Significantly improved on the hierarchical Dirichlet language model. Results better Kneser-Ney smoothing, state-of-the-art language models. T N-1 IKN MKN HDLM HPYLM

21 Pitman-Yor and Kneser-Ney Interpolated Kneser-Ney can be derived as a particular approximate inference method in a hierarchical Pitman-Yor language model. x 1 x 2 x x 8 7 x 5 x 3 x 4 x 6 cat dog cat mouse cat dog dog dog cat dog cat mouse mouse x 9 P (x 10 = dog) = 4 d θ +9 + θ +4d θ +9 H(dog) absolute discounting interpolation with H(dog) Pitman-Yor processes can be used in place of Kneser-Ney.

22 -gram Language Models and Computational Advantages

23 Markov Language Models Usually makes a Markov assumption to simplify model: P(south parks road) ~ P(south)* P(parks south)* P(road parks) Language models: usually Markov models of order 2-4 (3-5-grams). How do we determine the order of our Markov models? Is the Markov assumption a reasonable assumption? Be nonparametric about Markov order...

24 Non-Markov Language Models Model the conditional probabilities of each possible word occurring after each possible context (of unbounded length). Use hierarchical Pitman-Yor process prior to share information across all contexts. Hierarchy is infinitely deep. G Sequence memoizer. G parks G south parks G to parks G university parks G along south parks G at south parks G meet at south parks

25 Model Size: Infinite -> O(T 2 ) The sequence memoizer model is very large (actually, infinite). Given a training sequence (e.g.: o,a,c,a,c), most of the model can be ignored (integrated out), leaving a finite number of nodes in context tree. H But there are still O(T 2 ) number of nodes in the context tree... G [ac] G [] G c [c] a o G [oa] a o o G [a] c G [ca] G [o] a G [cac] c o c a G [aca] G [acac] a a G [oac] o G [oaca] G [oacac] o c

26 Model Size: Infinite -> O(T 2 ) -> 2T Idea: integrate out non-branching, non-leaf nodes of the context tree. Resulting tree is related to a suffix tree data structure, and has at most 2T nodes. H There are linear time construction algorithms [Ukkonen 1995]. G [ac] ac G [] o G [oa] a o o G [a] G [o] a oac o a c G [oac] oac G [oaca] G [oacac] c

27 Closure under Marginalization In marginalizing out non-branching interior nodes, need to ensure that resulting conditional distributions are still tractable. G [a] G [a] PY(θ 2,d 2,G [a] ) G [ca]? PY(θ 3,d 3,G [ca] ) G [aca] G [aca] E.g.: If each conditional is Dirichlet, resulting conditional is not of known analytic form.

28 Closure under Marginalization In marginalizing out non-branching interior nodes, need to ensure that resulting conditional distributions are still tractable. G [a] G [a] PY(θ 2,d 2,G [a] ) G [ca] PY(θ 2 d 3,d 2 d 3,G [a] ) PY(θ 2 d 3,d 3,G [ca] ) G [aca] G [aca] For certain parameter settings, Pitman-Yor processes are closed under marginalization! [Pitman 1999]

29 Comparison to Finite Order HPYLM

30 Compression Results Model Average bits/byte gzip 2.61 bzip CTW 1.99 PPM 1.93 Sequence Memoizer 1.89 Calgary corpus SM inference: particle filter PPM: Prediction by Partial Matching CTW: Context Tree Weigting Online inference, entropic coding.

31 Hierarchical Bayes and Domain Adaptation

32 Domain Adaptation General English statistics computational linguistics Children

33 Graphical Pitman-Yor Process Each conditional probability vector given context u=(w 1,w 2 ) and in domain has prior: G domain w 1,w 2 G domain w 2,G general w 1,w 2 PY(θ,d,πG domain w 2 + (1 π)g general w 1,w 2 ) Back-off in two different ways. More flexible than a straight mixture of the two base distributions. An example of a graphical Pitman-Yor process.

34 Graphical Pitman-Yor Process Ghidden Markov Gorder Markov General English GMarkov GMary has GCharlie has Ghas G. Ghidden Markov Ghidden Markov Gorder Markov GMarkov Gorder Markov GMarkov GMary has G. GMary has G. GCharlie has Ghas GCharlie has Ghas Domain=statistics Domain=children

35 Graphical Pitman-Yor Process Ghidden Markov Gorder Markov General English GMarkov GMary has GCharlie has Ghas G. Ghidden Markov Ghidden Markov Gorder Markov GMarkov Gorder Markov GMarkov GMary has G. GMary has G. GCharlie has Ghas GCharlie has Ghas Domain=statistics Domain=children

36 Graphical Pitman-Yor Process Ghidden Markov Gorder Markov General English GMarkov GMary has GCharlie has Ghas G. Ghidden Markov Ghidden Markov Gorder Markov GMarkov Gorder Markov GMarkov GMary has G. GMary has G. GCharlie has Ghas GCharlie has Ghas Domain=statistics Domain=children

37 Domain Adaptation Results Compared a graphical Pitman-Yor domain adapted language model to: no additional domain. naively including additional domain. mixture model. Test perplexity HPYLM Union HHPYLM Mixture HPYLM Test perplexity HPYLM Union HHPYLM Mixture HPYLM Brown corpus subset size x Brown corpus subset size x 10 5

38 Related Works

39 Adaptor Grammars Reuse fully expanded subtrees of PCFG using Chinese restaurant processes. Flexible framework and software to make use of hierarchical Pitman-Yor process technology. Applied to unsupervised word segmentation, morphological analysis etc. [Johnson, Griffiths, Goldwater *]

40 Adaptor Grammars Stem Phon + Suffix Phon + talk, look, talk, eat,... ing, s, ed, ing,... Word Stem Suffix talking, looks, talking, talked, eated,...

41 Tree Substitution Grammars Multiple level hierarchy of adapted by Pitman-Yor processes: tree fragments PCFG productions lexicalization heads of CFG rules [Goldwater, Blunsom & Cohn *] also [Post & Gildea 2009, O Donnell et al 2009]

42 Other Related Works Infinite PCFGs [Finkel et al 2007, Liang et al 2007] Infinite Markov model [Mochihashi & Sumita 2008] Nested Pitman-Yor language models [Mochihashi et al 2009] POS induction [Blunsom & Cohn 2011]

43 Concluding Remarks

44 Conclusions Bayesian methods are powerful approaches to computational linguistics. Hierarchical Bayesian models for encoding generalization capabilities. Pitman-Yor processes for encoding power law properties. Nonparametric Bayesian models for sidestepping model selection. Hurdles to future progress: Scaling up Bayesian inference to large datasets and large models. Better exploration of combinatorial spaces. Fruitful cross-pollination of ideas across machine learning, statistics and computational linguistics.

45 Thank You! Sharon Goldwater, Chris Manning & CoNLL committee Acknowledgements: Frank Wood, Jan Gasthaus, Cedric Archambeau, Lancelot James Lee Kuan Yew Foundation Gatsby Charitable Foundation

Hierarchical Bayesian Models of Language and Text

Hierarchical Bayesian Models of Language and Text Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL Joint work with Frank Wood *, Jan Gasthaus *, Cedric Archambeau, Lancelot James Overview Probabilistic