Multiword Expression Identification with Tree Substitution Grammars Spence Green, Marie-Catherine de Marneffe, John Bauer, and Christopher D. Manning Stanford University EMNLP 2011
Main Idea Use syntactic context to find multiword expressions
Main Idea Use syntactic context to find multiword expressions Syntactic context constituency parses
Main Idea Use syntactic context to find multiword expressions Syntactic context constituency parses Multiword expressions idiomatic constructions
Which languages? Results and analysis for French 3 / 42
Which languages? Results and analysis for French Lexicographic tradition of compiling MWE lists Annotated data! 3 / 42
Which languages? Results and analysis for French Lexicographic tradition of compiling MWE lists Annotated data! English examples in the talk 3 / 42
Motivating Example: Humans get this 1. He kicked the pail. 2. He kicked the bucket. He died. (Katz and Postal 1963) 4 / 42
Stanford parser can t tell the difference S NP He kicked VP NP the pail 5 / 42
Stanford parser can t tell the difference S S NP VP NP VP He kicked NP the pail He kicked NP the bucket 5 / 42
What does the lexicon contain? Single-word entries? kick : <agent, theme> die : <theme> NP S VP Multi-word entries? kick the bucket : <theme> He kicked NP the bucket 6 / 42
Lexicon-Grammar: He kicked the bucket S NP VP He died 7 / 42
Lexicon-Grammar: He kicked the bucket S S NP VP NP VP MWV He He died kicked the bucket (Gross 1986) 7 / 42
MWEs in Lexicon-Grammar Classified by global POS MWV Described by internal POS sequence VBD DT NN Flat structures! kicked the bucket 8 / 42
MWEs in Lexicon-Grammar Classified by global POS MWV Described by internal POS sequence VBD DT NN Flat structures! kicked the bucket Of theoretical interest but... 8 / 42
Why do we care (in NLP)? MWE knowledge improves: Dependency parsing (Nivre and Nilsson 2004) Constituency parsing (Arun and Keller 2005) Sentence generation (Hogan et al. 2007) Machine translation (Carpuat and Diab 2010) Shallow parsing (Korkontzelos and Manandhar 2010) 9 / 42
Why do we care (in NLP)? MWE knowledge improves: Dependency parsing (Nivre and Nilsson 2004) Constituency parsing (Arun and Keller 2005) Sentence generation (Hogan et al. 2007) Machine translation (Carpuat and Diab 2010) Shallow parsing (Korkontzelos and Manandhar 2010) Most experiments assume high accuracy identification! 9 / 42
French and the French Treebank MWEs common in French 5,000 multiword adverbs 10 / 42
French and the French Treebank MWEs common in French 5,000 multiword adverbs MWC Paris 7 French Treebank 16,000 trees 13% of tokens are MWE P sous N prétexte C que on the grounds that 10 / 42
French Treebank: MWE types Global POS I ET CL PRO ADV D V C P ADV N Lots of nominal compounds e.g. N N numéro deux 0 10 20 30 40 50 %Total MWEs 11 / 42
MWE Identification Evaluation Identification is a by-product of parsing 12 / 42
MWE Identification Evaluation Identification is a by-product of parsing Corpus: Paris 7 French Treebank (FTB) Split: same as (Crabbé and Candito 2008) Metrics: Precision and Recall Lengths 40 words 12 / 42
MWE Identification: Parent-Annotated PCFG 60 40 32.6 F1 20 0 PA-PCFG 13 / 42
MWE Identification: n-gram methods 60 40 32.6 34.7 F1 20 0 PA-PCFG mwetoolkit 14 / 42
MWE Identification: n-gram methods 60 40 32.6 34.7 F1 20 0 PA-PCFG mwetoolkit Standard approach in 2008 MWE Shared Task, MWE Workshops, etc. 14 / 42
n-gram methods: mwetoolkit Based on surface statistics 15 / 42
n-gram methods: mwetoolkit Based on surface statistics Step 1: Lemmatize and POS tag corpus 15 / 42
n-gram methods: mwetoolkit Based on surface statistics Step 1: Lemmatize and POS tag corpus Step 2: Compute n-gram statistics: Maximum likelihood estimator Dice s coefficient Pointwise mutual information Student s t-score (Ramisch, Villavicencio, and Boitet 2010) 15 / 42
n-gram methods: mwetoolkit Step 3: Create n-gram feature vectors 16 / 42
n-gram methods: mwetoolkit Step 3: Create n-gram feature vectors Step 4: Train a binary classifier 16 / 42
n-gram methods: mwetoolkit Step 3: Create n-gram feature vectors Step 4: Train a binary classifier Exploits statistical idiomaticity of MWEs 16 / 42
Is statistical idiomaticity sufficient? French multiword verbs VN Tree maintains relationship between MWV parts MWV va MWADV d ailleurs MWV bon train is also well underway 17 / 42
Recap: French MWE Identification Baselines 60 40 32.6 34.7 F1 20 0 PA-PCFG mwetoolkit 18 / 42
Recap: French MWE Identification Baselines 60 40 32.6 34.7 F1 20 0 PA-PCFG mwetoolkit Let s build a better grammar 18 / 42
Better PCFGs: Manual grammar splits Symbol refinement à la (Klein and Manning 2003) 19 / 42
Better PCFGs: Manual grammar splits Symbol refinement à la (Klein and Manning 2003) Has a verbal nucleus (VN) 19 / 42
Better PCFGs: Manual grammar splits Symbol refinement à la (Klein and Manning 2003) Has a verbal nucleus (VN) C Ou ADV bien COORD VN doit -il Otherwise he must... 19 / 42
Better PCFGs: Manual grammar splits Symbol refinement à la (Klein and Manning 2003) Has a verbal nucleus (VN) C Ou COORD-hasVN ADV bien VN doit -il Otherwise he must... 20 / 42
French MWE Identification: Manual Splits 80 63.1 60 40 32.6 34.7 F1 20 0 PA-PCFG mwetoolkit Splits 21 / 42
French MWE Identification: Manual Splits 80 63.1 60 40 32.6 34.7 F1 20 0 PA-PCFG mwetoolkit Splits MWE features: high frequency POS sequences 21 / 42
Capture more syntactic context? PCFGs work well! 22 / 42
Capture more syntactic context? PCFGs work well! Larger rules : Tree Substitution Grammars (TSG) 22 / 42
Capture more syntactic context? PCFGs work well! Larger rules : Tree Substitution Grammars (TSG) Relationship with Data-Oriented Parsing (DOP): Same grammar formalism (TSG) We include unlexicalized fragments Different parameter estimation 22 / 42
Which tree fragments do we select? S NP VP N MWV He V D N kicked the bucket 23 / 42
Which tree fragments do we select? S NP VP N MWV He V D N kicked the bucket 24 / 42
Which tree fragments do we select? NP V MWV S N kicked V D N NP VP He the bucket MWV 25 / 42
TSG Grammar Extraction as Tree Selection MWV V D the N bucket 26 / 42
TSG Grammar Extraction as Tree Selection MWV V D the N bucket Describes MWE context Allows for inflection: kick, kicked, kicking 26 / 42
Dirichlet process TSG (DP-TSG) Tree selection as non-parametric clustering 1 1 Cohn, Goldwater, and Blunsom 2009; Post and Gildea 2009; O Donnell, Tenenbaum, and Goodman 2009. 27 / 42
Dirichlet process TSG (DP-TSG) Tree selection as non-parametric clustering 1 Labeled Chinese Restaurant process Dirichlet process (DP) prior for each non-terminal type c 1 Cohn, Goldwater, and Blunsom 2009; Post and Gildea 2009; O Donnell, Tenenbaum, and Goodman 2009. 27 / 42
Dirichlet process TSG (DP-TSG) Tree selection as non-parametric clustering 1 Labeled Chinese Restaurant process Dirichlet process (DP) prior for each non-terminal type c Supervised case: segment the treebank 1 Cohn, Goldwater, and Blunsom 2009; Post and Gildea 2009; O Donnell, Tenenbaum, and Goodman 2009. 27 / 42
DP-TSG: Learning and Inference DP base distribution from manually-split CFG 28 / 42
DP-TSG: Learning and Inference DP base distribution from manually-split CFG Type-based Gibbs sampler (Liang, Jordan, and Klein 2010) Fast convergence: 400 iterations 28 / 42
DP-TSG: Learning and Inference DP base distribution from manually-split CFG Type-based Gibbs sampler (Liang, Jordan, and Klein 2010) Fast convergence: 400 iterations Derivations of a TSG are a CFG forest 28 / 42
DP-TSG: Learning and Inference DP base distribution from manually-split CFG Type-based Gibbs sampler (Liang, Jordan, and Klein 2010) Fast convergence: 400 iterations Derivations of a TSG are a CFG forest SCFG decoder: cdec (Dyer et al. 2010) 28 / 42
French MWE Identification: DP-TSG 80 71.1 63.1 60 40 32.6 34.7 F1 20 0 PA-PCFG mwetoolkit Splits DP-TSG 29 / 42
French MWE Identification: DP-TSG 80 71.1 63.1 60 40 32.6 34.7 F1 20 0 PA-PCFG mwetoolkit Splits DP-TSG DP-TSG result is a lower bound 29 / 42
Human-interpretable DP-TSG rules MWN coup de N coup de pied coup de coeur coup de foudre coup de main coup de grâce kick favorite love at first sight help death blow 30 / 42
Human-interpretable DP-TSG rules MWN coup de N coup de pied coup de coeur coup de foudre coup de main coup de grâce kick favorite love at first sight help death blow n-gram methods: separate feature vectors 30 / 42
DP-TSG errors: Overgeneration NP NP D Le N marché AP A national The national march Reference D Le MWN N A marché national DP-TSG 31 / 42
DP-TSG errors: Overgeneration NP NP D Le N marché AP A national The national march Reference D Le MWN N A marché national DP-TSG MWEs are subtle; reference sometimes inconsistent 31 / 42
Standard Parsing Evaluation Same setup as MWE identification! 32 / 42
Standard Parsing Evaluation Same setup as MWE identification! Corpus: Paris 7 French Treebank (FTB) Split: same as (Crabbé and Candito 2008) Metrics: Evalb and Leaf Ancestor Lengths 40 words 32 / 42
French Parsing Evaluation: All bracketings 90 Evalb F1 80 70 67.6 75.2 75.8 60 PA-PCFG Splits DP-TSG 33 / 42
French Parsing Evaluation: All bracketings 90 Evalb F1 80 70 67.6 75.2 75.8 60 PA-PCFG Splits DP-TSG Paper: more results (Stanford, Berkeley, etc.) 33 / 42
Future Directions Syntactic context for n-gram methods Parse the corpus! Adapt lexical context measures to syntactic context 34 / 42
Future Directions Syntactic context for n-gram methods Parse the corpus! Adapt lexical context measures to syntactic context DP-TSG Better base distribution 34 / 42
Conclusion Parsers work well for MWE identification 35 / 42
Conclusion Parsers work well for MWE identification Other languages: combine treebanks with MWE lists 35 / 42
Conclusion Parsers work well for MWE identification Other languages: combine treebanks with MWE lists Non- gold mode parsing results for French 35 / 42
Conclusion Parsers work well for MWE identification Other languages: combine treebanks with MWE lists Non- gold mode parsing results for French Code Google: Stanford parser 35 / 42
un grand merci. thanks a lot.
Questions?
MWE Identification Results 80 60 40 32.6 34.7 63.1 69.6 70.1 71.1 F1 20 0 PA-PCFG mwetoolkit Splits Berkeley Stanford DP-TSG 38 / 42
Dirichlet process TSG DP prior for each non-terminal type c V: θ c c, α c, P 0 ( c) DP(α c, P 0 ) e θ c θ c 2 Cohn, Goldwater, and Blunsom 2009; Post and Gildea 2009; O Donnell, Tenenbaum, and Goodman 2009. 39 / 42
Dirichlet process TSG DP prior for each non-terminal type c V: θ c c, α c, P 0 ( c) DP(α c, P 0 ) e θ c θ c Binary variable b s for each non-terminal node in corpus Supervised case: segment the treebank 2 2 Cohn, Goldwater, and Blunsom 2009; Post and Gildea 2009; O Donnell, Tenenbaum, and Goodman 2009. 39 / 42
DP-TSG: Base distribution P 0 Phrasal rules: P 0 (A + B C + ) = p MLE (A B C) s B (1 s C ) 40 / 42
DP-TSG: Base distribution P 0 Phrasal rules: P 0 (A + B C + ) = p MLE (A B C) s B (1 s C ) p MLE is the manually-split grammar! s B is the stop probability 40 / 42
DP-TSG: Base distribution P 0 Lexical insertion rules: P 0 (C + t) = p MLE (C t) p(t) 41 / 42
DP-TSG: Base distribution P 0 Lexical insertion rules: P 0 (C + t) = p MLE (C t) p(t) p(t) is unigram probability of word t 41 / 42
Tree substitution grammars A Probabilistic TSG is a 5-tuple V, Σ, R,, θ c V are non-terminals V is a unique start symbol t Σ are terminals e R are elementary trees θ c,e θ are parameters for each tree fragment 42 / 42
Tree substitution grammars A Probabilistic TSG is a 5-tuple V, Σ, R,, θ c V are non-terminals V is a unique start symbol t Σ are terminals e R are elementary trees θ c,e θ are parameters for each tree fragment elementary tree == tree fragment 42 / 42