S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise The Infinite PCFG using Hierarchical Dirichlet Processes EMNLP 2007 Prague, Czech Republic June 29, 2007 Percy Liang Slav Petrov Michael I. Jordan Dan Klein S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise
How do we choose the grammar complexity? Grammar induction: How many grammar symbols (NP, VP, etc.)?? She heard the noise 2
How do we choose the grammar complexity? Grammar induction: How many grammar symbols (NP, VP, etc.)? Grammar refinement: How many grammar subsymbols (NP-loc, NP-subj, etc.)? S-? NP-? PRP-? VBD-? VP-? NP-? She heard DT-? NN-? the noise 2
How do we choose the grammar complexity? Grammar induction: How many grammar symbols (NP, VP, etc.)? Grammar refinement: How many grammar subsymbols (NP-loc, NP-subj, etc.)? S-? NP-? PRP-? VBD-? VP-? NP-? She heard DT-? NN-? the noise Our solution: the HDP-PCFG allows number of (sub)symbols to adapt to data 2
True grammar: S A A B B C C D D A a 1 a 2 a 3 B b 1 b 2 b 3 C c 1 c 2 c 3 D d 1 d 2 d 3 A motivating example 3
A motivating example True grammar: S A A B B C C D D A a 1 a 2 a 3 B b 1 b 2 b 3 C c 1 c 2 c 3 D d 1 d 2 d 3 Generate examples: S S S A A B B A A a 2 a 3 b 1 b 3 a 1 a 1 S C C c 1 c 1 3
A motivating example True grammar: S A A B B C C D D A a 1 a 2 a 3 B b 1 b 2 b 3 C c 1 c 2 c 3 D d 1 d 2 d 3 Collapse A,B,C,D X: S S X X X X X X X X a 2 a 3 b 1 b 3 a 1 a 1 c 1 c 1 S S 3
A motivating example True grammar: S A A B B C C D D A a 1 a 2 a 3 B b 1 b 2 b 3 C c 1 c 2 c 3 D d 1 d 2 d 3 Results: Collapse A,B,C,D X: S S X X X X X X X X a 2 a 3 b 1 b 3 a 1 a 1 c 1 c 1 S S 0.25 posterior 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 subsymbol standard PCFG 3
A motivating example True grammar: S A A B B C C D D A a 1 a 2 a 3 B b 1 b 2 b 3 C c 1 c 2 c 3 D d 1 d 2 d 3 Results: Collapse A,B,C,D X: S S X X X X X X X X a 2 a 3 b 1 b 3 a 1 a 1 c 1 c 1 S S 0.25 posterior 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 subsymbol HDP-PCFG 3
The meeting of two fields Grammar learning Lexicalized [Charniak, 1996] [Collins, 1999] Manual refinement [Johnson, 1998] [Klein, Manning, 2003] Automatic refinement [Matsuzaki, et al., 2005] [Petrov, et al., 2006] 4
The meeting of two fields Grammar learning Lexicalized [Charniak, 1996] [Collins, 1999] Manual refinement [Johnson, 1998] [Klein, Manning, 2003] Automatic refinement [Matsuzaki, et al., 2005] [Petrov, et al., 2006] Bayesian nonparametrics... Basic theory [Ferguson, 1973] [Antoniak, 1974] [Sethuraman, 1994] [Escobar, West, 1995] [Neal, 2000] More complex models [Teh, et al., 2006] [Beal, et al., 2002] [Goldwater, et al., 2006] [Sohn, Xing, 2007] 4
The meeting of two fields Grammar learning Lexicalized [Charniak, 1996] [Collins, 1999] Manual refinement [Johnson, 1998] [Klein, Manning, 2003] Automatic refinement [Matsuzaki, et al., 2005] [Petrov, et al., 2006] Bayesian nonparametrics... Basic theory [Ferguson, 1973] [Antoniak, 1974] [Sethuraman, 1994] [Escobar, West, 1995] [Neal, 2000] More complex models [Teh, et al., 2006] [Beal, et al., 2002] [Goldwater, et al., 2006] [Sohn, Xing, 2007] Nonparametric grammars [Johnson, et al., 2006] [Finkel, et al., 2007] [Liang, et al., 2007] 4
The meeting of two fields Grammar learning Bayesian nonparametrics... Lexicalized [Charniak, 1996] [Collins, 1999] Basic theory [Ferguson, 1973] [Antoniak, 1974] [Sethuraman, 1994] Manual refinement Our contribution [Escobar, West, 1995] [Johnson, 1998] [Klein, Definition Manning, 2003] of the HDP-PCFG [Neal, 2000] More complex models Automatic Simple refinement and efficient variational inference algorithm [Teh, et al., 2006] [Matsuzaki, et al., 2005] Empirical comparison with finite [Beal, models et al., 2002] [Petrov, et al., 2006] on a full-scale parsing task Nonparametric grammars [Johnson, et al., 2006] [Finkel, et al., 2007] [Liang, et al., 2007] [Goldwater, et al., 2006] [Sohn, Xing, 2007] 4
Bayesian paradigm Generative model: α HDP θ PCFG hyperparameters. grammar. z: parse tree x: sentence 5
Bayesian paradigm Generative model: α HDP θ PCFG hyperparameters. grammar. z: parse tree x: sentence Bayesian posterior inference: Observe x. What s θ and z? 5
HDP probabilistic context-free grammars β z 1 φ B z z 2 z 3 z φ E z x 2 x 3 HDP-PCFG β GEM(α) [generate distribution over symbols] For each symbol z {1, 2,... }: φ E z Dirichlet(α E ) [generate emission probs] φ B z DP(α B, ββ T ) [binary production probs] For each nonterminal node i: (z L(i), z R(i) ) Multinomial(φ B z i ) [child symbols] For each preterminal node i: x i Multinomial(φ E z i ) [terminal symbol] 6
HDP probabilistic context-free grammars β z 1 φ B z z 2 z 3 z φ E z x 2 x 3 HDP-PCFG β GEM(α) [generate distribution over symbols] For each symbol z {1, 2,... }: φ E z Dirichlet(α E ) [generate emission probs] φ B z DP(α B, ββ T ) [binary production probs] For each nonterminal node i: (z L(i), z R(i) ) Multinomial(φ B z i ) [child symbols] For each preterminal node i: x i Multinomial(φ E z i ) [terminal symbol] 7
HDP-PCFG: prior over symbols β GEM(α) α = 1 GEM 8
HDP-PCFG: prior over symbols β GEM(α) α = 1 GEM 8
HDP-PCFG: prior over symbols β GEM(α) α = 1 GEM 8
HDP-PCFG: prior over symbols β GEM(α) α = 1 GEM 8
HDP-PCFG: prior over symbols β GEM(α) α = 0.5 α = 1 GEM GEM 8
α = 0.2 α = 0.5 α = 1 HDP-PCFG: prior over symbols β GEM(α) GEM GEM GEM 8
α = 0.2 α = 0.5 α = 1 α = 5 α = 10 HDP-PCFG: prior over symbols β GEM(α) GEM GEM GEM GEM GEM 8
HDP probabilistic context-free grammars β z 1 φ B z z 2 z 3 z φ E z x 2 x 3 HDP-PCFG β GEM(α) [generate distribution over symbols] For each symbol z {1, 2,... }: φ E z Dirichlet(α E ) [generate emission probs] φ B z DP(α B, ββ T ) [binary production probs] For each nonterminal node i: (z L(i), z R(i) ) Multinomial(φ B z i ) [child symbols] For each preterminal node i: x i Multinomial(φ E z i ) [terminal symbol] 9
HDP-PCFG: binary productions Distribution over symbols (top-level): β GEM(α) 1 2 3 4 5 6 10
HDP-PCFG: binary productions Distribution over symbols (top-level): β GEM(α) 1 2 3 4 5 6 Mean distribution over child symbols: ββ T right child left child 10
HDP-PCFG: binary productions Distribution over symbols (top-level): β GEM(α) 1 2 3 4 5 6 Mean distribution over child symbols: ββ T right child left child Distribution over child symbols (per-state): φ B z DP(α B, ββ T ) right child left child 10
HDP-PCFG: binary productions Distribution over symbols (top-level): β GEM(α) 1 2 3 4 5 6 Mean distribution over child symbols: ββ T right child left child Distribution over child symbols (per-state): φ B z DP(α B, ββ T ) right child left child 10
HDP-PCFG: binary productions Distribution over symbols (top-level): β GEM(α) 1 2 3 4 5 6 Mean distribution over child symbols: ββ T right child left child Distribution over child symbols (per-state): φ B z DP(α B, ββ T ) right child left child 10
HDP-PCFG: binary productions Distribution over symbols (top-level): β GEM(α) 1 2 3 4 5 6 Mean distribution over child symbols: ββ T right child left child Distribution over child symbols (per-state): φ B z DP(α B, ββ T ) right child left child 10
HDP probabilistic context-free grammars β z 1 φ B z z 2 z 3 z φ E z x 2 x 3 HDP-PCFG β GEM(α) [generate distribution over symbols] For each symbol z {1, 2,... }: φ E z Dirichlet(α E ) [generate emission probs] φ B z DP(α B, ββ T ) [binary production probs] For each nonterminal node i: (z L(i), z R(i) ) Multinomial(φ B z i ) [child symbols] For each preterminal node i: x i Multinomial(φ E z i ) [terminal symbol] 11
HDP probabilistic context-free grammars β z 1 φ B z z 2 z 3 z φ E z x 2 x 3 HDP-PCFG β GEM(α) [generate distribution over symbols] For each symbol z {1, 2,... }: φ E z Dirichlet(α E ) [generate emission probs] φ B z DP(α B, ββ T ) [binary production probs] For each nonterminal node i: (z L(i), z R(i) ) Multinomial(φ B z i ) [child symbols] For each preterminal node i: x i Multinomial(φ E z i ) [terminal symbol] 12
Variational Bayesian inference α HDP θ PCFG hyperparameters. grammar. Goal: compute posterior p(θ, z x) z: parse tree x: sentence 13
Variational Bayesian inference α HDP θ PCFG hyperparameters. grammar. Goal: compute posterior p(θ, z x) Variational inference: approximate posterior with best from a set of tractable distributions Q: q = argmin q Q KL(q p) z: parse tree x: sentence Q q p 13
Variational Bayesian inference α HDP θ PCFG hyperparameters. grammar. Goal: compute posterior p(θ, z x) Variational inference: approximate posterior with best from a set of tractable distributions Q: q = argmin q Q Q q KL(q p) p z: parse tree x: sentence Mean-field { approximation: } Q = q : q = q(z)q(β)q(φ) z β φ B z φ E z z 1 z 2 z 3 13
Coordinate-wise descent algorithm Goal: argmin KL(q p) q Q q = q(z)q(φ)q(β) z = parse tree φ = rule probabilities β = inventory of symbols β z 1 φ B z z φ E z z 2 z 3 14
Coordinate-wise descent algorithm Goal: argmin KL(q p) q Q q = q(z)q(φ)q(β) z = parse tree φ = rule probabilities β = inventory of symbols Iterate: Optimize q(z) (E-step): Optimize q(φ) (M-step): β z 1 φ B z Optimize q(β) (no equivalent in EM): z φ E z z 2 z 3 14
Coordinate-wise descent algorithm Goal: argmin KL(q p) q Q q = q(z)q(φ)q(β) z = parse tree φ = rule probabilities β = inventory of symbols Iterate: Optimize q(z) (E-step): Inside-outside with rule weights W (r) Gather expected rule counts C(r) Optimize q(φ) (M-step): β z 1 φ B z Optimize q(β) (no equivalent in EM): z φ E z z 2 z 3 14
Coordinate-wise descent algorithm Goal: argmin KL(q p) q Q q = q(z)q(φ)q(β) z = parse tree φ = rule probabilities β = inventory of symbols β φ B z z 1 Iterate: Optimize q(z) (E-step): Inside-outside with rule weights W (r) Gather expected rule counts C(r) Optimize q(φ) (M-step): Update Dirichlet posteriors (expected rule counts + pseudocounts) Compute rule weights W (r) Optimize q(β) (no equivalent in EM): z φ E z z 2 z 3 14
Coordinate-wise descent algorithm Goal: argmin KL(q p) q Q q = q(z)q(φ)q(β) z = parse tree φ = rule probabilities β = inventory of symbols z β φ B z φ E z z 1 z 2 z 3 Iterate: Optimize q(z) (E-step): Inside-outside with rule weights W (r) Gather expected rule counts C(r) Optimize q(φ) (M-step): Update Dirichlet posteriors (expected rule counts + pseudocounts) Compute rule weights W (r) Optimize q(β) (no equivalent in EM): Truncate at level K (set the maximum number of symbols) Use projected gradient to adapt number of symbols 14
Rule weights Weight W (r) of rule r similar to probability p(r) W (r) unnormalized extra degree of freedom 15
Rule weights Weight W (r) of rule r similar to probability p(r) W (r) unnormalized extra degree of freedom EM (maximum likelihood): W (r) = C(r) P r C(r ) 15
Rule weights Weight W (r) of rule r similar to probability p(r) W (r) unnormalized extra degree of freedom EM (maximum likelihood): W (r) = C(r) P r C(r ) EM (maximum a posteriori): W (r) = P prior(r) 1+C(r) r prior(r ) 1+C(r ) 15
Rule weights Weight W (r) of rule r similar to probability p(r) W (r) unnormalized extra degree of freedom EM (maximum likelihood): W (r) = C(r) P r C(r ) EM (maximum a posteriori): W (r) = P prior(r) 1+C(r) r prior(r ) 1+C(r ) Mean-field (with DP prior): W (r) = exp Ψ(prior(r)+C(r)) exp Ψ( P r prior(r )+C(r )) 15
Rule weights Weight W (r) of rule r similar to probability p(r) W (r) unnormalized extra degree of freedom EM (maximum likelihood): 2.0 W (r) = C(r) P r C(r ) EM (maximum a posteriori): W (r) = P prior(r) 1+C(r) r prior(r ) 1+C(r ) Mean-field (with DP prior): W (r) = exp Ψ(prior(r)+C(r)) exp Ψ( P r prior(r )+C(r )) 1.6 1.2 0.8 0.4 x exp(ψ( )) 0.4 0.8 1.2 1.6 2.0 x 15
Rule weights Weight W (r) of rule r similar to probability p(r) W (r) unnormalized extra degree of freedom EM (maximum likelihood): 2.0 W (r) = C(r) P r C(r ) EM (maximum a posteriori): W (r) = P prior(r) 1+C(r) r prior(r ) 1+C(r ) Mean-field (with DP prior): W (r) = exp Ψ(prior(r)+C(r)) exp Ψ( P r prior(r )+C(r )) C(r) 0.5 P r C(r ) 0.5 1.6 1.2 0.8 0.4 x exp(ψ( )) 0.4 0.8 1.2 1.6 2.0 x 15
Rule weights Weight W (r) of rule r similar to probability p(r) W (r) unnormalized extra degree of freedom EM (maximum likelihood): 2.0 W (r) = C(r) P r C(r ) EM (maximum a posteriori): W (r) = P prior(r) 1+C(r) r prior(r ) 1+C(r ) Mean-field (with DP prior): W (r) = exp Ψ(prior(r)+C(r)) exp Ψ( P r prior(r )+C(r )) C(r) 0.5 P r C(r ) 0.5 1.6 1.2 0.8 0.4 x exp(ψ( )) 0.4 0.8 1.2 1.6 2.0 x Subtract 0.5 small counts hurt more than large counts rich gets richer controls number of symbols 15
Parsing the WSJ Penn Treebank Setup: grammar refinement (split symbols into subsymbols) Training on one section: 100.0 92.0 Standard PCFG F 1 84.0 76.0 68.0 5 9 12 16 20 maximum number of subsymbols (truncation K) 16
Parsing the WSJ Penn Treebank Setup: grammar refinement (split symbols into subsymbols) Training on one section: 100.0 F 1 92.0 84.0 76.0 HDP-PCFG Standard PCFG 68.0 5 9 12 16 20 maximum number of subsymbols (truncation K) 16
Parsing the WSJ Penn Treebank Setup: grammar refinement (split symbols into subsymbols) Training on one section: 100.0 F 1 92.0 84.0 76.0 HDP-PCFG (K = 20) Standard PCFG 68.0 5 9 12 16 20 maximum number of subsymbols (truncation K) 16
Parsing the WSJ Penn Treebank Setup: grammar refinement (split symbols into subsymbols) Training on one section: 100.0 F 1 92.0 84.0 76.0 HDP-PCFG (K = 20) Standard PCFG 68.0 5 9 12 16 20 maximum number of subsymbols (truncation K) Training on 20 sections: (K = 16) Standard PCFG: 86.23 HDP-PCFG: 87.08 16
Parsing the WSJ Penn Treebank Setup: grammar refinement (split symbols into subsymbols) Training on one section: 100.0 F 1 92.0 84.0 76.0 HDP-PCFG (K = 20) Standard PCFG 68.0 5 9 12 16 20 maximum number of subsymbols (truncation K) Training on 20 sections: (K = 16) Standard PCFG: 86.23 HDP-PCFG: 87.08 Results: HDP-PCFG overfits less than standard PCFG If have large amounts of data, HDP-PCFG standard PCFG 16
Conclusions What? HDP-PCFG model allows number of grammar symbols to adapt to data How? Mean-field algorithm (variational inference) simple, efficient, similar to EM When? Have small amounts of data overfits less than standard PCFG Why? Declarative framework Grammar complexity specified declaratively in model rather than in learning procedure S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise 17