The SUBTLE NL Parsing Pipeline: A Complete Parser for English Mitch Marcus University of Pennsylvania

Size: px

Start display at page:

Download "The SUBTLE NL Parsing Pipeline: A Complete Parser for English Mitch Marcus University of Pennsylvania"

Branden Asher Stanley
5 years ago
Views:

1 The SUBTLE NL Parsing Pipeline: A Complete Parser for English Mitch Marcus University of Pennsylvania 1

2 PICTURE OF ANALYSIS PIPELINE Tokenize Maximum Entropy POS tagger MXPOST Ratnaparkhi Core Parser Collins Generative Prob. CFG Functional Tag Replacement Kulick Null Element Placement Ryan Gabbard PhD Dec 2010 (just submitted) 2

3 NLP Analysis Pipeline Typed text in Tokenize Maximum Entropy POS Tagger (Ratnaparkhi 97) Parse (Collins Bikel 04) Junior, defuse any bombs immediately when going into a room. Junior_NNP,_, defuse_vbp any_dt bombs_nns immediately_rb when_wrb going_vbg into_in a_dt room_nn._. ( (S (NP-VOC (NNP Junior)) (,,) (NP-SBJ-0 (-NONE- *)) (VP (VB defuse) (NP (DT any) (NNS bombs)) (ADVP-TMP (RB immediately)) (SBAR-TMP (WHADVP-1 (WRB when)) (..))) (S (NP-SBJ-0 (-NONE- *)) (VP (VBG going) (PP-DIR (IN into) (NP (DT a) (NN room))) (ADVP-1 (-NONE- *T*)))))) 11/4/2010 SUBTLE Annual Review 3

4 NLP Analysis Pipeline Typed text in Tokenize Maximum Entropy POS Tagger (Ratnaparkhi 97) Parse (Collins Bikel 04) Restore Function Tags (Kulick 06) Junior, defuse any bombs immediately when going into a room. Junior_NNP,_, defuse_vbp any_dt bombs_nns immediately_rb when_wrb going_vbg into_in a_dt room_nn._. ( (S (NP-VOC (NNP Junior)) (,,) (NP-SBJ-0 (-NONE- *)) (VP (VB defuse) (NP (DT any) (NNS bombs)) (ADVP-TMP (RB immediately)) (SBAR-TMP (WHADVP-1 (WRB when)) (..))) (S (NP-SBJ-0 (-NONE- *)) (VP (VBG going) (PP-DIR (IN into) (NP (DT a) (NN room))) (ADVP-1 (-NONE- *T*)))))) 11/4/2010 SUBTLE Annual Review 4

5 NLP Analysis Pipeline Typed text in Tokenize Maximum Entropy POS Tagger (Ratnaparkhi 97) Parse (Collins Bikel 04) Restore Function Tags (Kulick 06) Restore Null Elements (Gabbard 10) Junior, defuse any bombs immediately when going into a room. Junior_NNP,_, defuse_vbp any_dt bombs_nns immediately_rb when_wrb going_vbg into_in a_dt room_nn._. ( (S (NP-VOC (NNP Junior)) (,,) (NP-SBJ-0 (-NONE- *)) (VP (VB defuse) (NP (DT any) (NNS bombs)) (ADVP-TMP (RB immediately)) (SBAR-TMP (WHADVP-1 (WRB when)) (..))) (S (NP-SBJ-0 (-NONE- *)) (VP (VBG going) (PP-DIR (IN into) (NP (DT a) (NN room))) (ADVP-1 (-NONE- *T*)))))) 11/4/2010 SUBTLE Annual Review 5

6 NLP Analysis Pipeline 11/4/2010 Typed text in Tokenize Maximum Entropy POS Tagger (Ratnaparkhi 97) Parse (Collins Bikel 04) Restore Function Tags (Kulick 06) Restore Null Elements (Gabbard 10) Semantic Analysis (Perera) Junior, defuse any bombs immediately when going into a room. Junior_NNP,_, defuse_vbp any_dt bombs_nns immediately_rb when_wrb going_vbg into_in a_dt room_nn._. ( (S (NP-VOC (NNP Junior)) (,,) (NP-SBJ-0 (-NONE- *)) (VP (VB defuse) (NP (DT any) (NNS bombs)) (ADVP-TMP (RB immediately)) (SBAR-TMP (WHADVP-1 (WRB when)) (..))) (S (NP-SBJ-0 (-NONE- *)) (VP (VBG going) SUBTLE Annual Review (PP-DIR (IN into) (NP (DT a) (NN room))) (ADVP-1 (-NONE- *T*)))))) 6

7 1995: A breakthrough in parsing 10 6 words of Penn Treebank Annotation + Machine Learning = Robust Parsers training sentences (Magerman 95) Training Program answers The founder of Pakistan's nuclear program, Abdul Qadeer Khan, has admitted he transferred nuclear technology to Iran, Libya and North Korea 1990 Best hand-built parsers: Statistical parsers: (both on short sentences) Models Parser ~40% accuracy >90% accuracy 7 S VP VP SBA R NP S VP NP PP NP PP NP NP NP NP NP NP NP NP The founder of Pakistan s nuclear department Abdul Qadeer Khan has admitted he transferred nuclear technology to Iran, Libya, and North Korea

8 Generative Parsing Results (<40 w) Parser F 1 Magerman Collins Charniak Collins 99 (Bikel 04) 88.6 Liang 08 (extending Charniak 05)

9 Meaningless Progress.. 9

10 Treebank Tree with Function Tags NP-SBJ S VP Junior (S (NP-SBJ (NNP Junior)) (VP (MD should) (VP (VB defuse) (NP (NP (NNS bombs)) (PP-LOC (IN in) (NP (DT a) should (NN room)))) defuse (ADVP-TMP (RB immediately)))) VP NP NP any bombs PP-LOC in NP the room NP-TMP immediately 10

11 Treebank Tree with Null Elements SBARQ WHNP-1 Who was SQ VP NP-SBJ-2 *T*-1 S believed NP-SBJ-3 to have been *T*-2 VP VP shot NP believe(somebody, shoot (SOMEONE, Who)) *-3 11

12 A Perfect Parseval Analysis.. SBARQ WHNP SQ Who was believed VP S Without null element, can t tell what the semantic role of Who was, or even the predication it s part of. VP to have been VP shot believe(somebody, shoot (SOMEONE, Who)) 12

13 A Complete Parsing System Ryan Gabbard, Seth Kulick, and Mitch Marcus. Fully Parsing the Penn Treebank. HLT/NAACL 2006 Ryan Gabbard, Null Element Restoration, PhD Dissertation, Nov

14 Step 1: Recovering Function Tags 14

15 Treebank II Function Tag Set Syntactic Semantic Miscellaneous SBJ Subject TMP Temporal CLF It-cleft LGS Logical Subject LOC Location HLN Headline PRD Predicate DIR Direction TTL Title DTV Dative EXT Extent Topicalization PUT Locative of put VOC Vocative MNR Manner BNF Benefactive TPC Topic PRP Purpose NOM Nominal CLR ADV Non-specific adverbial CLR Closely- Related 15

16 A Sample PCFG 16

17 An Example Derivation 17

18 The Key Trick in 95: Lexicalizing the CFG New problem: Estimating the probability of S(bought) NP(yesterday) NP(Marks) VP(bought) 18

19 Solving the Sparse Data Problem with Independence Assumptions Instead of S(bought) NP(yesterday) NP(Marks) VP(bought) 1. Generate Head, given lexicalized parent: S(bought) VP(bought). 2. Generate other children given parent & head: S(bought) NP( ) NP ( ) VP(bought) 3. Lexicalize other children, given parent, head & node type: S(bought) NP(yesterday)NP(Marks) VP(bought) 19

20 To augment the parser output with function tags 1. Find the code in Collins Model II that removes function tags 2. Remove it! {NP-OBJ( )} NP-OBJ( ) NP-TMP( ) NP-OBJ(her) NP-TMP(today) (From Collins & Koo 05) 20

21 Results for Function Tag Groups & Overall Overall w/ CLR Overall w/o CLR Collins 2 +FTags Blaheta, Jijkoun & de Rijke, With no drop in basic parser F score! 21

22 Results for Function Tag Groups & Overall Overall w/ CLR Overall w/o CLR Syn 56% Sem 36% Collins 2 +FTags Blaheta, Jijkoun & de Rijke, Musillo & Merlo, Oct With no drop in basic parser F score! 22

23 Step 2: Recovering Null Elements 23

24 Null Elements for Wh-movement SBARQ WHNP-1 SQ Predicate argument structure: see(your,what) What NP-SBJ VP did you see NP *T*-1 *T* - marks WH-questions Null elements are co-indexed using integer tags. To recover Pred-arg structure: Replace null element with co-indexed lexical material Similar approaches for other grammatical phenomenon 24

25 Previous Work Forward Integrated with Parsing (Collins Model 3, Dienes&Dubey 03) Postprocessing Parser Output Hand Built Rule-based (Campbell 04) Statistical Machine Learning Approaches (Johnson 02, Levy&Manning 04, Jijkoun&de Rijke 04) Our approach combines: Campbell s Problem Decomposition Machine Learning Into: A pipeline of linear classifiers 25

26 The Null Element Replacement Algorithm Pipeline & rich linguistic features á la Campbell Pipeline of linear classifiers (McCallum s Mallet) parser output with function tags NPTrace (NP*)insert NullComp ProAntecedent WHXPInsert-> WHXPDiscern Antecedentless WHTrace 26

27 Aggregate results over all recovered categories Automatically parsed trees from section 23 27

28 Present vs. Campbell s Rule Based System Test on gold--standard trees from section 23 using Campbell's metric 28

29 Wh placement with factor graphs 29

30 Wh placement with factor graphs 30

31 Results: two systems for Wh-placement 31

32 Summary: Full Parsing! Before S NP VP After S NP-SBJ-1 VP patients were VBD VP VBN examined patients VBD were 20 Function Tags Mostly syntactic or semantic Syntactic Tags: 95% recovery accuracy Semantic Tags: 84% recovery accuracy Traces Passivization, control, raising: ~70% recovered Wh-traces: ~80% recovered VP VBN examined NP-1 * 32

A Context-Free Grammar

Statistical Parsing A Context-Free Grammar S VP VP Vi VP Vt VP VP PP DT NN PP PP P Vi sleeps Vt saw NN man NN dog NN telescope DT the IN with IN in Ambiguity A sentence of reasonable length can easily