The Infinite PCFG using Hierarchical Dirichlet Processes

Similar documents
The Infinite PCFG using Hierarchical Dirichlet Processes

The Infinite PCFG using Hierarchical Dirichlet Processes

Probabilistic modeling of NLP

Latent Variable Models in NLP

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing

Statistical Methods for NLP

Multiword Expression Identification with Tree Substitution Grammars

Probabilistic Context Free Grammars. Many slides from Michael Collins

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

PCFGs 2 L645 / B659. Dept. of Linguistics, Indiana University Fall PCFGs 2. Questions. Calculating P(w 1m ) Inside Probabilities

Type-Based MCMC. Michael I. Jordan UC Berkeley 1 In NLP, this is sometimes referred to as simply the collapsed

Chapter 14 (Partially) Unsupervised Parsing

Probabilistic Context-Free Grammar

Expectation Maximization (EM)

Probabilistic Context-free Grammars

Bayesian Structure Modeling. SPFLODD December 1, 2011

LECTURER: BURCU CAN Spring

Spectral Unsupervised Parsing with Additive Tree Metrics

CS 545 Lecture XVI: Parsing

{Probabilistic Stochastic} Context-Free Grammars (PCFGs)

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Bayesian Analysis for Natural Language Processing Lecture 2

Tree-Based Inference for Dirichlet Process Mixtures

Processing/Speech, NLP and the Web

Natural Language Processing

Lecture 5: UDOP, Dependency Grammars

Bayesian Nonparametrics for Speech and Signal Processing

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Annotating images and image objects using a hierarchical Dirichlet process model

A Context-Free Grammar

Parsing with Context-Free Grammars

Multilevel Coarse-to-Fine PCFG Parsing

Roger Levy Probabilistic Models in the Study of Language draft, October 2,

Lecture 15. Probabilistic Models on Graph

Spatial Bayesian Nonparametrics for Natural Image Segmentation

Bayesian Nonparametrics: Dirichlet Process

CS : Speech, NLP and the Web/Topics in AI

Two Useful Bounds for Variational Inference

Probabilistic Graphical Models

Stochastic Variational Inference for the HDP-HMM

Dirichlet Process. Yee Whye Teh, University College London

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

The effect of non-tightness on Bayesian estimation of PCFGs

Dirichlet Processes: Tutorial and Practical Course

Applied Nonparametric Bayes

Dirichlet Enhanced Latent Semantic Analysis

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction

The infinite HMM for unsupervised PoS tagging

Graphical Models for Collaborative Filtering

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Bayesian Inference for PCFGs via Markov chain Monte Carlo

Generative Models for Sentences

Infinite Hierarchical Hidden Markov Models

Recap: Lexicalized PCFGs (Fall 2007): Lecture 5 Parsing and Syntax III. Recap: Charniak s Model. Recap: Adding Head Words/Tags to Trees

Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Advanced Natural Language Processing Syntactic Parsing

Image segmentation combining Markov Random Fields and Dirichlet Processes

Hierarchical Dirichlet Processes

Collapsed Variational Bayesian Inference for Hidden Markov Models

Analyzing the Errors of Unsupervised Learning

In this chapter, we explore the parsing problem, which encompasses several questions, including:

An introduction to PRISM and its applications

Lecture 3a: Dirichlet processes

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

CSC321 Lecture 18: Learning Probabilistic Models

Generative Clustering, Topic Modeling, & Bayesian Inference

Probabilistic Context-Free Grammars and beyond

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Shared Segmentation of Natural Scenes. Dependent Pitman-Yor Processes

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Features of Statistical Parsers

Online Bayesian Passive-Agressive Learning

The SUBTLE NL Parsing Pipeline: A Complete Parser for English Mitch Marcus University of Pennsylvania

Bayesian Nonparametrics

Hierarchical Dirichlet Processes with Random Effects

Structured Prediction Models via the Matrix-Tree Theorem

Discrimina)ve Latent Variable Models. SPFLODD November 15, 2011

Construction of Dependent Dirichlet Processes based on Poisson Processes

Streaming Variational Bayes

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

Multi-Modal Hierarchical Dirichlet Process Model for Predicting Image Annotation and Image-Object Label Correspondence

Sparse Stochastic Inference for Latent Dirichlet Allocation

Bayesian Learning (II)

Computational Cognitive Science

Attendee information. Seven Lectures on Statistical Parsing. Phrase structure grammars = context-free grammars. Assessment.

Probabilistic Reasoning in Deep Learning

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

A permutation-augmented sampler for DP mixture models

Bayesian Tools for Natural Language Learning. Yee Whye Teh Gatsby Computational Neuroscience Unit UCL

Computational Learning of Probabilistic Grammars in the Unsupervised Setting

Hierarchical Models, Nested Models and Completely Random Measures

Posterior Sparsity in Unsupervised Dependency Parsing

Transcription:

S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise The Infinite PCFG using Hierarchical Dirichlet Processes EMNLP 2007 Prague, Czech Republic June 29, 2007 Percy Liang Slav Petrov Michael I. Jordan Dan Klein S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise

How do we choose the grammar complexity? Grammar induction: How many grammar symbols (NP, VP, etc.)?? She heard the noise 2

How do we choose the grammar complexity? Grammar induction: How many grammar symbols (NP, VP, etc.)? Grammar refinement: How many grammar subsymbols (NP-loc, NP-subj, etc.)? S-? NP-? PRP-? VBD-? VP-? NP-? She heard DT-? NN-? the noise 2

How do we choose the grammar complexity? Grammar induction: How many grammar symbols (NP, VP, etc.)? Grammar refinement: How many grammar subsymbols (NP-loc, NP-subj, etc.)? S-? NP-? PRP-? VBD-? VP-? NP-? She heard DT-? NN-? the noise Our solution: the HDP-PCFG allows number of (sub)symbols to adapt to data 2

True grammar: S A A B B C C D D A a 1 a 2 a 3 B b 1 b 2 b 3 C c 1 c 2 c 3 D d 1 d 2 d 3 A motivating example 3

A motivating example True grammar: S A A B B C C D D A a 1 a 2 a 3 B b 1 b 2 b 3 C c 1 c 2 c 3 D d 1 d 2 d 3 Generate examples: S S S A A B B A A a 2 a 3 b 1 b 3 a 1 a 1 S C C c 1 c 1 3

A motivating example True grammar: S A A B B C C D D A a 1 a 2 a 3 B b 1 b 2 b 3 C c 1 c 2 c 3 D d 1 d 2 d 3 Collapse A,B,C,D X: S S X X X X X X X X a 2 a 3 b 1 b 3 a 1 a 1 c 1 c 1 S S 3

A motivating example True grammar: S A A B B C C D D A a 1 a 2 a 3 B b 1 b 2 b 3 C c 1 c 2 c 3 D d 1 d 2 d 3 Results: Collapse A,B,C,D X: S S X X X X X X X X a 2 a 3 b 1 b 3 a 1 a 1 c 1 c 1 S S 0.25 posterior 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 subsymbol standard PCFG 3

A motivating example True grammar: S A A B B C C D D A a 1 a 2 a 3 B b 1 b 2 b 3 C c 1 c 2 c 3 D d 1 d 2 d 3 Results: Collapse A,B,C,D X: S S X X X X X X X X a 2 a 3 b 1 b 3 a 1 a 1 c 1 c 1 S S 0.25 posterior 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 subsymbol HDP-PCFG 3

The meeting of two fields Grammar learning Lexicalized [Charniak, 1996] [Collins, 1999] Manual refinement [Johnson, 1998] [Klein, Manning, 2003] Automatic refinement [Matsuzaki, et al., 2005] [Petrov, et al., 2006] 4

The meeting of two fields Grammar learning Lexicalized [Charniak, 1996] [Collins, 1999] Manual refinement [Johnson, 1998] [Klein, Manning, 2003] Automatic refinement [Matsuzaki, et al., 2005] [Petrov, et al., 2006] Bayesian nonparametrics... Basic theory [Ferguson, 1973] [Antoniak, 1974] [Sethuraman, 1994] [Escobar, West, 1995] [Neal, 2000] More complex models [Teh, et al., 2006] [Beal, et al., 2002] [Goldwater, et al., 2006] [Sohn, Xing, 2007] 4

The meeting of two fields Grammar learning Lexicalized [Charniak, 1996] [Collins, 1999] Manual refinement [Johnson, 1998] [Klein, Manning, 2003] Automatic refinement [Matsuzaki, et al., 2005] [Petrov, et al., 2006] Bayesian nonparametrics... Basic theory [Ferguson, 1973] [Antoniak, 1974] [Sethuraman, 1994] [Escobar, West, 1995] [Neal, 2000] More complex models [Teh, et al., 2006] [Beal, et al., 2002] [Goldwater, et al., 2006] [Sohn, Xing, 2007] Nonparametric grammars [Johnson, et al., 2006] [Finkel, et al., 2007] [Liang, et al., 2007] 4

The meeting of two fields Grammar learning Bayesian nonparametrics... Lexicalized [Charniak, 1996] [Collins, 1999] Basic theory [Ferguson, 1973] [Antoniak, 1974] [Sethuraman, 1994] Manual refinement Our contribution [Escobar, West, 1995] [Johnson, 1998] [Klein, Definition Manning, 2003] of the HDP-PCFG [Neal, 2000] More complex models Automatic Simple refinement and efficient variational inference algorithm [Teh, et al., 2006] [Matsuzaki, et al., 2005] Empirical comparison with finite [Beal, models et al., 2002] [Petrov, et al., 2006] on a full-scale parsing task Nonparametric grammars [Johnson, et al., 2006] [Finkel, et al., 2007] [Liang, et al., 2007] [Goldwater, et al., 2006] [Sohn, Xing, 2007] 4

Bayesian paradigm Generative model: α HDP θ PCFG hyperparameters. grammar. z: parse tree x: sentence 5

Bayesian paradigm Generative model: α HDP θ PCFG hyperparameters. grammar. z: parse tree x: sentence Bayesian posterior inference: Observe x. What s θ and z? 5

HDP probabilistic context-free grammars β z 1 φ B z z 2 z 3 z φ E z x 2 x 3 HDP-PCFG β GEM(α) [generate distribution over symbols] For each symbol z {1, 2,... }: φ E z Dirichlet(α E ) [generate emission probs] φ B z DP(α B, ββ T ) [binary production probs] For each nonterminal node i: (z L(i), z R(i) ) Multinomial(φ B z i ) [child symbols] For each preterminal node i: x i Multinomial(φ E z i ) [terminal symbol] 6

HDP probabilistic context-free grammars β z 1 φ B z z 2 z 3 z φ E z x 2 x 3 HDP-PCFG β GEM(α) [generate distribution over symbols] For each symbol z {1, 2,... }: φ E z Dirichlet(α E ) [generate emission probs] φ B z DP(α B, ββ T ) [binary production probs] For each nonterminal node i: (z L(i), z R(i) ) Multinomial(φ B z i ) [child symbols] For each preterminal node i: x i Multinomial(φ E z i ) [terminal symbol] 7

HDP-PCFG: prior over symbols β GEM(α) α = 1 GEM 8

HDP-PCFG: prior over symbols β GEM(α) α = 1 GEM 8

HDP-PCFG: prior over symbols β GEM(α) α = 1 GEM 8

HDP-PCFG: prior over symbols β GEM(α) α = 1 GEM 8

HDP-PCFG: prior over symbols β GEM(α) α = 0.5 α = 1 GEM GEM 8

α = 0.2 α = 0.5 α = 1 HDP-PCFG: prior over symbols β GEM(α) GEM GEM GEM 8

α = 0.2 α = 0.5 α = 1 α = 5 α = 10 HDP-PCFG: prior over symbols β GEM(α) GEM GEM GEM GEM GEM 8

HDP probabilistic context-free grammars β z 1 φ B z z 2 z 3 z φ E z x 2 x 3 HDP-PCFG β GEM(α) [generate distribution over symbols] For each symbol z {1, 2,... }: φ E z Dirichlet(α E ) [generate emission probs] φ B z DP(α B, ββ T ) [binary production probs] For each nonterminal node i: (z L(i), z R(i) ) Multinomial(φ B z i ) [child symbols] For each preterminal node i: x i Multinomial(φ E z i ) [terminal symbol] 9

HDP-PCFG: binary productions Distribution over symbols (top-level): β GEM(α) 1 2 3 4 5 6 10

HDP-PCFG: binary productions Distribution over symbols (top-level): β GEM(α) 1 2 3 4 5 6 Mean distribution over child symbols: ββ T right child left child 10

HDP-PCFG: binary productions Distribution over symbols (top-level): β GEM(α) 1 2 3 4 5 6 Mean distribution over child symbols: ββ T right child left child Distribution over child symbols (per-state): φ B z DP(α B, ββ T ) right child left child 10

HDP-PCFG: binary productions Distribution over symbols (top-level): β GEM(α) 1 2 3 4 5 6 Mean distribution over child symbols: ββ T right child left child Distribution over child symbols (per-state): φ B z DP(α B, ββ T ) right child left child 10

HDP-PCFG: binary productions Distribution over symbols (top-level): β GEM(α) 1 2 3 4 5 6 Mean distribution over child symbols: ββ T right child left child Distribution over child symbols (per-state): φ B z DP(α B, ββ T ) right child left child 10

HDP-PCFG: binary productions Distribution over symbols (top-level): β GEM(α) 1 2 3 4 5 6 Mean distribution over child symbols: ββ T right child left child Distribution over child symbols (per-state): φ B z DP(α B, ββ T ) right child left child 10

HDP probabilistic context-free grammars β z 1 φ B z z 2 z 3 z φ E z x 2 x 3 HDP-PCFG β GEM(α) [generate distribution over symbols] For each symbol z {1, 2,... }: φ E z Dirichlet(α E ) [generate emission probs] φ B z DP(α B, ββ T ) [binary production probs] For each nonterminal node i: (z L(i), z R(i) ) Multinomial(φ B z i ) [child symbols] For each preterminal node i: x i Multinomial(φ E z i ) [terminal symbol] 11

HDP probabilistic context-free grammars β z 1 φ B z z 2 z 3 z φ E z x 2 x 3 HDP-PCFG β GEM(α) [generate distribution over symbols] For each symbol z {1, 2,... }: φ E z Dirichlet(α E ) [generate emission probs] φ B z DP(α B, ββ T ) [binary production probs] For each nonterminal node i: (z L(i), z R(i) ) Multinomial(φ B z i ) [child symbols] For each preterminal node i: x i Multinomial(φ E z i ) [terminal symbol] 12

Variational Bayesian inference α HDP θ PCFG hyperparameters. grammar. Goal: compute posterior p(θ, z x) z: parse tree x: sentence 13

Variational Bayesian inference α HDP θ PCFG hyperparameters. grammar. Goal: compute posterior p(θ, z x) Variational inference: approximate posterior with best from a set of tractable distributions Q: q = argmin q Q KL(q p) z: parse tree x: sentence Q q p 13

Variational Bayesian inference α HDP θ PCFG hyperparameters. grammar. Goal: compute posterior p(θ, z x) Variational inference: approximate posterior with best from a set of tractable distributions Q: q = argmin q Q Q q KL(q p) p z: parse tree x: sentence Mean-field { approximation: } Q = q : q = q(z)q(β)q(φ) z β φ B z φ E z z 1 z 2 z 3 13

Coordinate-wise descent algorithm Goal: argmin KL(q p) q Q q = q(z)q(φ)q(β) z = parse tree φ = rule probabilities β = inventory of symbols β z 1 φ B z z φ E z z 2 z 3 14

Coordinate-wise descent algorithm Goal: argmin KL(q p) q Q q = q(z)q(φ)q(β) z = parse tree φ = rule probabilities β = inventory of symbols Iterate: Optimize q(z) (E-step): Optimize q(φ) (M-step): β z 1 φ B z Optimize q(β) (no equivalent in EM): z φ E z z 2 z 3 14

Coordinate-wise descent algorithm Goal: argmin KL(q p) q Q q = q(z)q(φ)q(β) z = parse tree φ = rule probabilities β = inventory of symbols Iterate: Optimize q(z) (E-step): Inside-outside with rule weights W (r) Gather expected rule counts C(r) Optimize q(φ) (M-step): β z 1 φ B z Optimize q(β) (no equivalent in EM): z φ E z z 2 z 3 14

Coordinate-wise descent algorithm Goal: argmin KL(q p) q Q q = q(z)q(φ)q(β) z = parse tree φ = rule probabilities β = inventory of symbols β φ B z z 1 Iterate: Optimize q(z) (E-step): Inside-outside with rule weights W (r) Gather expected rule counts C(r) Optimize q(φ) (M-step): Update Dirichlet posteriors (expected rule counts + pseudocounts) Compute rule weights W (r) Optimize q(β) (no equivalent in EM): z φ E z z 2 z 3 14

Coordinate-wise descent algorithm Goal: argmin KL(q p) q Q q = q(z)q(φ)q(β) z = parse tree φ = rule probabilities β = inventory of symbols z β φ B z φ E z z 1 z 2 z 3 Iterate: Optimize q(z) (E-step): Inside-outside with rule weights W (r) Gather expected rule counts C(r) Optimize q(φ) (M-step): Update Dirichlet posteriors (expected rule counts + pseudocounts) Compute rule weights W (r) Optimize q(β) (no equivalent in EM): Truncate at level K (set the maximum number of symbols) Use projected gradient to adapt number of symbols 14

Rule weights Weight W (r) of rule r similar to probability p(r) W (r) unnormalized extra degree of freedom 15

Rule weights Weight W (r) of rule r similar to probability p(r) W (r) unnormalized extra degree of freedom EM (maximum likelihood): W (r) = C(r) P r C(r ) 15

Rule weights Weight W (r) of rule r similar to probability p(r) W (r) unnormalized extra degree of freedom EM (maximum likelihood): W (r) = C(r) P r C(r ) EM (maximum a posteriori): W (r) = P prior(r) 1+C(r) r prior(r ) 1+C(r ) 15

Rule weights Weight W (r) of rule r similar to probability p(r) W (r) unnormalized extra degree of freedom EM (maximum likelihood): W (r) = C(r) P r C(r ) EM (maximum a posteriori): W (r) = P prior(r) 1+C(r) r prior(r ) 1+C(r ) Mean-field (with DP prior): W (r) = exp Ψ(prior(r)+C(r)) exp Ψ( P r prior(r )+C(r )) 15

Rule weights Weight W (r) of rule r similar to probability p(r) W (r) unnormalized extra degree of freedom EM (maximum likelihood): 2.0 W (r) = C(r) P r C(r ) EM (maximum a posteriori): W (r) = P prior(r) 1+C(r) r prior(r ) 1+C(r ) Mean-field (with DP prior): W (r) = exp Ψ(prior(r)+C(r)) exp Ψ( P r prior(r )+C(r )) 1.6 1.2 0.8 0.4 x exp(ψ( )) 0.4 0.8 1.2 1.6 2.0 x 15

Rule weights Weight W (r) of rule r similar to probability p(r) W (r) unnormalized extra degree of freedom EM (maximum likelihood): 2.0 W (r) = C(r) P r C(r ) EM (maximum a posteriori): W (r) = P prior(r) 1+C(r) r prior(r ) 1+C(r ) Mean-field (with DP prior): W (r) = exp Ψ(prior(r)+C(r)) exp Ψ( P r prior(r )+C(r )) C(r) 0.5 P r C(r ) 0.5 1.6 1.2 0.8 0.4 x exp(ψ( )) 0.4 0.8 1.2 1.6 2.0 x 15

Rule weights Weight W (r) of rule r similar to probability p(r) W (r) unnormalized extra degree of freedom EM (maximum likelihood): 2.0 W (r) = C(r) P r C(r ) EM (maximum a posteriori): W (r) = P prior(r) 1+C(r) r prior(r ) 1+C(r ) Mean-field (with DP prior): W (r) = exp Ψ(prior(r)+C(r)) exp Ψ( P r prior(r )+C(r )) C(r) 0.5 P r C(r ) 0.5 1.6 1.2 0.8 0.4 x exp(ψ( )) 0.4 0.8 1.2 1.6 2.0 x Subtract 0.5 small counts hurt more than large counts rich gets richer controls number of symbols 15

Parsing the WSJ Penn Treebank Setup: grammar refinement (split symbols into subsymbols) Training on one section: 100.0 92.0 Standard PCFG F 1 84.0 76.0 68.0 5 9 12 16 20 maximum number of subsymbols (truncation K) 16

Parsing the WSJ Penn Treebank Setup: grammar refinement (split symbols into subsymbols) Training on one section: 100.0 F 1 92.0 84.0 76.0 HDP-PCFG Standard PCFG 68.0 5 9 12 16 20 maximum number of subsymbols (truncation K) 16

Parsing the WSJ Penn Treebank Setup: grammar refinement (split symbols into subsymbols) Training on one section: 100.0 F 1 92.0 84.0 76.0 HDP-PCFG (K = 20) Standard PCFG 68.0 5 9 12 16 20 maximum number of subsymbols (truncation K) 16

Parsing the WSJ Penn Treebank Setup: grammar refinement (split symbols into subsymbols) Training on one section: 100.0 F 1 92.0 84.0 76.0 HDP-PCFG (K = 20) Standard PCFG 68.0 5 9 12 16 20 maximum number of subsymbols (truncation K) Training on 20 sections: (K = 16) Standard PCFG: 86.23 HDP-PCFG: 87.08 16

Parsing the WSJ Penn Treebank Setup: grammar refinement (split symbols into subsymbols) Training on one section: 100.0 F 1 92.0 84.0 76.0 HDP-PCFG (K = 20) Standard PCFG 68.0 5 9 12 16 20 maximum number of subsymbols (truncation K) Training on 20 sections: (K = 16) Standard PCFG: 86.23 HDP-PCFG: 87.08 Results: HDP-PCFG overfits less than standard PCFG If have large amounts of data, HDP-PCFG standard PCFG 16

Conclusions What? HDP-PCFG model allows number of grammar symbols to adapt to data How? Mean-field algorithm (variational inference) simple, efficient, similar to EM When? Have small amounts of data overfits less than standard PCFG Why? Declarative framework Grammar complexity specified declaratively in model rather than in learning procedure S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise 17