The Infinite PCFG using Hierarchical Dirichlet Processes

Size: px

Start display at page:

Download "The Infinite PCFG using Hierarchical Dirichlet Processes"

Alexina Kelley
5 years ago
Views:

1 The Infinite PCFG using Hierarchical Dirichlet Processes Liang, Petrov, Jordan & Klein Presented by: Will Allen November 8, 2011

2 Overview 1. Overview 2. (Very) Brief History of Context Free Grammars 3. Probabilistic Context Free Grammars (PCFG) 4. HDP-PCFG Model 5. HDP-PCFG for grammar refinement (HDP-PCFG-GR) 6. HDP-PCFG Variational Inference 7. Experimental Results

3 Overview Goal: To understanding the latent rules generating the recursive structure of phrases and sentences in natural language. Not just for NLP: PCFGs also used in bioinformatics (RNA structure prediction), vision (geometric grammars), and probably other places.

(Very) Brief History of Context Free Grammars From: http://lawanddisorder.

4 (Very) Brief History of Context Free Grammars From: 4th century BC: first description by Pāṇini of a grammar, a set of rules dictating the order in which clauses and words appear. Grammars are tree-structured to model recursive structure of natural language. 1950s: Noam Chomsky invents context free-grammar formally describing how to generate these tree structures.

5 current sentence and the current context can provide precise inference despite the high degree of ambiguity. In the current chapter, we discuss a Bayesian approach to the problem of syntactic parsing and the underlying problems of grammar induction and grammar refinement. The central object of study is the parse tree, an example of which is shown in Figure 1. A substantial amount of the syntactic structure and relational semantics of natural language sentences can be described using parse trees. These trees play a central role in a range of activities in modern NLP, including machine translation (Galley et al., 2004), semantic role extraction (Gildea and Jurafsky, 2002), and question answering (Hermjakob, 2001), just to name a few. From a statistical perspective, parse trees are an extremely rich class of objects, and our approach to capturing this class probabilistically will be to make use of tools from nonparametric Bayesian statistics. (Very) Brief History of Context Free Grammars A parse tree: Noun-Phrase Pronoun They Verb solved Sentence Determiner the Verb-Phrase Noun-Phrase Noun problem Prepositional-Phrase Preposition with Noun-Phrase Proper-Noun Plural-Noun Bayesian statistics From: Liang, FigureP., 1: Jordan, A parsem.i., treeklein, for the D. sentence Probabilistic TheyGrammars solved theand problem Hierarchical with Bayesian Dirichletstatistics. Processes. (2009) BookItchapter: seems reasonable The Handbook enough of Applied to model Bayesian parseanalysis. trees using context-free grammars (CFGs); indeed, this goal was the original motivation behind the development of the CFG formalism (Chomsky, 1956), and it remains a major focus of research on parsing to this day. Early work on NLP parsing concentrated on efficient algorithms for computing the set of all parses for a sentence under a given CFG. Unfortunately, as we have alluded to, natural language is highly ambiguous. In fact, the number of parses for a sentence grows exponentially with its length. As a result, systems which enumerated all possibilities were not useful in practice. Modern work on parsing has therefore turned to probabilistic models which place distributions over parse trees and probabilistic inference methods which focus on likely trees (Lari and Young, 1990). The workhorse model family for probabilistic parsing is the family of probabilistic context-free

6 Probabilistic Context Free Grammars Set of rules for generating parse trees. A PCFG consists of: A set of terminal symbols Σ (e.g. actual words) A set of nonterminal symbols S (e.g. word types) A root nonterminal symbol Root S Rule probabilities φ = (φ s (γ) : s S, γ Σ (S S) where φ s (γ) 0 and γ φ s(γ) = 1 (produce terminal symbols or pairs of nonterminal symbols)

7 Context Free Grammars Chomsky Normal Form is used in this paper: A BC, A, B, C S (binary production) A α, α Σ (emission) Root ɛ (empty string) where A BC occurs with probability φ A ((B, C)).

8 HDP-PCFG Model Previous work for learning PCFGs: Models have a fixed number symbols. Infer maximum-likelihood symbol transition and emission probabilities by Expectation Maximization algorithm. Use pseudocounts for smoothing: may only have a few training examples of each transition.

9 HDP-PCFG Model Goal: Learn how many grammar symbols to allocate given data. Use these symbols to learn transition and emission probabilities. Method: Use HDP to model syntactic tree structures. Nonterminal nodes are symbols. Bonus: Develop model for grammar refinement: given a coarse supervised annotation of tree structures, infer a richer model by learning how many subsymbols to split from existing symbols.

10 HDP-PCFG Model Using Chemosky Normal Form grammar, so only has emissions or binary productions. Each grammar symbol is a mixture component. Use DP prior to let number of grammar symbols.

11 Figure 2: The definition and graphical model of the HDP-PCFG. Since there is no convenient way of representing them in the visual langua Instead, we show a simple fixed example tree. Node 1 has two childre HDP-PCFG Model HDP-PCFG GEM( ) [draw top-level symbol weights] For each grammar symbol z {1, 2,...}: T z Dirichlet( T ) [draw rule type parameters] E z Dirichlet( E ) [draw emission parameters] B z DP( B, T ) [draw binary production parameters] For each node i in the parse tree: t i Multinomial( T z i ) [choose rule type] If t i = EMISSION: x i Multinomial( E z i ) [emit terminal symbol] If t i = BINARY-PRODUCTION: (z L(i),z R(i) ) Multinomial( B z i ) [generate children symbols] Par

12 HDP-PCFG Model Key points: Symbols are derived from global stick-breaking prior β DP(α B, ββ T ) gives a distribution over pairs of symbols for each symbol. Unlike in HDP-HMM, either binary production or emission chosen. φ T z is distribution over type of rule to apply (2 types for CNF). Although use Dirichlet/Multinomial for emission distribution for NLP, could use more general base measure to get different emission distribution.

13 HDP-PCFG Model Graphical model of fixed tree (not showing hyperparameters α, α T, α E, α B ): el symbol weights] e type parameters] ission parameters] uction parameters] B z T z z 2 z 1 z 3 [choose rule type] t terminal symbol] E z z 1 x 2 x 3 T children symbols] Parameters Trees of the HDP-PCFG. Since parse trees have unknown structure, them in the visual language of traditional graphical models.

14 nd R(i) to denote the left and right children of node i. HDP-PCFG Model Distribution over pairs of child symbols: ers of a state states; simis of a gramn over pairs state GEM( ) We adapt the uction distrithat now we left child state T ammar symgle grammar right child state MM, at each ion are made, roduction or ach grammar over the type ere are only y generalized B z DP( left child state T ) right child state

15 HDP-PCFG for Grammar Refinement Want to refine existing, human-created grammar. Are given a set of symbols. Want to allocate some number of subsymbols for each symbol. Idea is to better capture subtleties in types of grammatical objects (e.g. different types of noun phrases)

16 ize to natural language, in working with parse ultinomial observations models. e applications, there is pre-terminal symbols ) and non-terminal symas two non-terminal or s can be accomplished ch forces a draw T z to ule type. f an HDP-PCFG would ol z, draw a distribution DP( ) and an indeht child symbols r z ary production distribu- B z = l z rz T. This also bol pairs and hence deparametric PCFG. This t require any additional P-HMM. However, the sed by this alternative ume the left child and given the parent, which atural language. HDP-PCFG for Grammar Refinement mar refinement a finite Dirichlet distribution since all symbols are known and observed, but the latter must be handled with the Dirichlet process machinery, since the number of subsymbols is unknown. HDP-PCFG for grammar refinement (HDP-PCFG-GR) For each symbol s S: s GEM( ) [draw subsymbol weights] For each subsymbol z {1, 2,...}: T sz Dirichlet( T ) [draw rule type parameters] E sz Dirichlet( E (s)) [draw emission parameters] u sz Dirichlet( u ) [unary symbol productions] b sz Dirichlet( b ) [binary symbol productions] For each child symbol s S: U szs DP( U, s ) [unary subsymbol prod.] For each pair of children symbols (s,s ) S S: B szs s DP( B, s T s ) [binary subsymbol] For each node i in the parse tree: t i Multinomial( T s iz i ) [choose rule type] If t i = EMISSION: x i Multinomial( E s iz i ) [emit terminal symbol] If t i = UNARY-PRODUCTION: s L(i) Multinomial( u s iz i ) [generate child symbol] z L(i) Multinomial( U s iz is L(i) ) [child subsymbol] If t i = BINARY-PRODUCTION: (s L(i),s R(i) ) Mult( siz i ) [children symbols] (z L(i),z R(i) ) Mult( B s iz is L(i) s R(i) ) [subsymbols] 2.6 Variational inference We present an inference algorithm for the HDP-

17 HDP-PCFG for Grammar Refinement Key points: 1. Similar to previous model, but for each symbol s S. Creates distribution over symbol/subsymbol pairs (s i, z i ). 2. Included unary productions (equivalent of state transition in HMM). 3. Since annotated symbols have child symbols already, have to have distribution over child symbols and subsymbols.

18 HDP-PCFG Variational Inference The authors chose to use variational inference to avoid having to deal with covergence and sample aggregation. Adapts existing efficient EM algorithm for PCFG refinement and induction. EM algorithm uses Markov structure of parse tree to do dynamic programming in E-step.

19 HDP-PCFG Variational Inference Recall: variational methods approximate posterior p(θ, z x) with q = arg min KL(q(θ, z) p(θ, z x)) q Q In this case: θ = (β, φ) β = top-level symbol probabilities φ = rule probabilities z = training parse trees x = observed sentences

20 HDP-PCFG Variational Inference They use a structured mean-field approxmation. I.e. only look at distributions of the form { } K Q q : q(z)q(β) q(φ T z )q(φ E z )q(φ B z ) z=1 where q(φ T z ), q(φ E z ), q(φ B z ) are Dirichlet, q(z) is multinomial, q(β) is a degenerate distribution truncated at K (β z = 0 if z > K).

21 HDP-PCFG Variational Inference Factorized model q = q(β)q(φ)q(z): B z T z E z z 1 Parameters z 1 z 2 z 3 Trees T Figure 4: We approximate the true posterior p over parameters and latent parse trees z using a structured mean-field distribution q, in which the distridegenerate d for z>k. W an infinite nu cay of the D bility mass is waran and Ja proximation is not. As K 2.8 Coordi The optimiza is intractable ple coordina

22 HDP-PCFG Variational Inference Optimization of q is intractable, but can use coordinate-ascent algorithm similar to EM. Optimize one factor at a time while keeping other factors constant

23 HDP-PCFG Variational Inference Parse trees q(z) Uses inside-out algorithm with unnormalized rule weights W (r): dynamic programming algorithm similar to forward-backward for HMMs Then computes expected sufficient statistics, rule counts C(r): of binary productions C(z z l z r ) and emissions C(z x).

24 HDP-PCFG Variational Inference Rule probabilities q(φ) Update Dirichlet posteriors: C(r) + pseudocounts Compute rule weights: Compute multinomial weights W B z (z l, z r ) = exp E q [log φ B z (z l, z r )] = eψ(c(z z l z r )+α B β zl β zr ) = e Ψ(prior(r)+C(r)) e Ψ( r prior(r )+C(r )) e Ψ(C(z )+αb ) where exp Ψ( ) increases the weight of large counts and decrease the weight of small counts (as in DP). Similar for emission distributions.

25 HDP-PCFG Variational Inference Top-level symbol probabilities q(β): Truncate at level K. q(β) = δ β (β) so trying to find single best β. Use gradient projection method to find: arg max L(β ) = log GEM(β ; α) β + K E q [log Dirichlet(φ B z ; α B β β T )]. z=1

26 Results Recovering synthetic grammar: S! X 1 X 1 X 2 X 2 X 3 X 3 X 4 X 4 X 1! a 1 b 1 c 1 d 1 X 2! a 2 b 2 c 2 d 2 X 3! a 3 b 3 c 3 d 3 X 4! a 4 b 4 c 4 d 4 S X i X i {a i, b i, c i, d i } {a i, b i, c i, d i } posterior (a) (b) stand 0.25 Figure 6: (a) A synthetic grammar with a uniform Figure 7: Th X i distribution over rules. (b) The grammar generates standard PC a i, b i, c i, d i } subsymbol subsymbol trees of the form shown on the right. teriors of th standard PCFG HDP-PCFG subsymbols in the gramm a uniform Figure 7: The posteriors over the subsymbols of the Generate PCFG 2000 trees, fails to with do terminal so because symbols it has no having built-in same control with over X. i, then r generates standard PCFG is roughly uniform, whereas the posteriors of grammar the HDP-PCFG complexity. is concentrated From the grammar on fourin subsymbols, Figure 6, wewhich generated is the 2000 truetrees. number The of twosymbols terminal troduce a rig replace X i symbols always have the same subscript, but we col- symbol X. posterior posterior 0.25

27 Results Empirical results measured by F 1 = 2 precision recall precision+recall. Uses labeled brackets to represent the tree: LB(s) = {(s [i,j], [i, j]) : s [i,j] Non-node, 1 i j n} precision(s, s ) = # correct # returned = LB(s) LB(s ) LB(s ) recall(s, s ) = # correct # should have returned = LB(s) LB(s ) LB(s) s is true parse tree, s is predicted.

28 must be imp single proba truncation K B best Relate uniform B The questio Table 1: For each truncation level, we report the B mar comple Applied tothat one yielded section the(wsj) highest off Penn 1 scoretreebank on the development (corpus of parsed It is well k sentences), set. preprocessed so fit CNF: essarily hav K PCFG PCFG (smoothed) HDP-PCFG amples of s F 1 Size F 1 Size F 1 Size clude Stolck an asympto and Petrov algorithm w switch betw These techn that use heu Table 2: Shows development F 1 and grammar sizes (the number of effective rules) as we increase the truncation K. tistical mod the Bayesia over model Results

29 Recap Main contributions: Used HDP prior to allow Chomsky Normal Form PCFG to learn the number of symbols in a grammar while also learning the rule transition and emission probabilities. Developed an efficient variational methods for inference, similar to existing EM algorithms for PCFG. Can be extended to model other kinds of context free grammars. Possible problems: Variational methods only finds local maxima? Anything else?

The Infinite PCFG using Hierarchical Dirichlet Processes

The Infinite PCFG using Hierarchical Dirichlet Processes S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise