Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

Size: px

Start display at page:

Download "Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation"

Coleen Holt
5 years ago
Views:

1 atural Language Processing 1 lecture 7: constituent parsing Ivan Titov Institute for Logic, Language and Computation

2 Outline Syntax: intro, CFGs, PCFGs PCFGs: Estimation CFGs: Parsing PCFGs: Parsing Parsing evaluation Unsupervised estimation of PCFGs (?) Grammar refinement (producing a state-of-the-art parser) (?)

Syntactic parsing Syntax The study of the patterns of formation of sentences and phrases from words Borders with semantics and morphology sometimes blurred Parsing The process of predicting syntactic

3 Syntactic parsing Syntax The study of the patterns of formation of sentences and phrases from words Borders with semantics and morphology sometimes blurred Parsing The process of predicting syntactic representations Syntactic Representations Afyonkarahisarlılaştırabildiklerimizdenmişsinizcesinee in Turkish mean "as if you are one of the people that we thought to be originating from Afyonkarahisar" [wikipedia] P My ifferent types of syntactic representations are possible, for example: dog S V ate VP a sausage Each internal node correspond to a phrase root poss nsubj dobj det root My dog ate a sausage P V Constuent (a.k.a. phrasestructure) tree ependency tree 3

4 Constituent trees S Internal nodes correspond to phrases VP S a sentence P My dog V ate a sausage (oun Phrase): My dog, a sandwich, lakes,.. VP (Verb Phrase): ate a sausage, barked, PP (Prepositional phrases): with a friend, in a car, odes immediately above words are PoS tags (aka preterminals) P pronoun determiner V verb noun P preposition 4

5 Bracketing notation S VP P V My dog ate a sausage It is often convenient to represent a tree as a bracketed sequence (S ) ( (P My) ( og) ) (VP (V ate) ( ( a ) ( sausage) ) ) We will use this format in the assignment 5

6 ependency trees root esignated root symbol poss nsubj dobj det root My dog ate a sausage P V odes are words (along with PoS tags) irected arcs encode syntactic dependencies between them Labels are types of relations between the words poss possesive dobj direct object nsub - subject det - determiner 6

7 Constituent and dependency representations Constituent trees can (potentially) be converted to dependency trees P S V VP One potential rule to extract nsubj dependency. (In practice, they are specified a bit differently) My dog ate ependency trees can (potentially) be converted to constituent trees Recovering labels is harder S VP a sandwich poss root nsubj Roughly: every word along with all its dependents corresponds to a phrase = to an inner node in the constituent tree P My dog V barked root My dog barked P V 7

8 Recovering shallow semantics root poss nsubj dobj det root My dog ate a sausage P V Some semantic information can be (approximately) derived from syntactic information Subjects (nsubj) are (often) agents ("initiator / doers for an action") irect objects (dobj) are (often) patients ("affected entities") 8

9 Recovering shallow semantics root poss nsubj dobj det root My dog ate a sausage P V Some semantic information can be (approximately) derived from syntactic information Subjects (nsubj) are (often) agents ("initiator / doers for an action") irect objects (dobj) are (often) patients ("affected entities") But even for agents and patients consider: Mary is baking a cake in the oven A cake is baking in the oven Syntactic subject corresponds to an agent (Who is baking?) Syntactic subject corresponds to a patient (What is being baked?) In general it is not trivial even for the most shallow forms of semantics E.g., consider prepositions: in can encode direction, position, temporal information, 9

10 Brief history of parsing Before mid-90s syntactic rule-based parsers producing rich linguistic information Provide low coverage (though hard to evaluate) Predict much more information than the formalisms we have just discussed Realization that basic PCFGs do not result in accurate predictive models We will discuss this in detail Mid-90s first accurate statistical parsers (e.g., [Magerman 1994, Collins 1996] ) o handcrafted grammars, estimated from large datasets (Penn Treebank WSJ) ow: better models, more efficient algorithms, more languages, 10

11 Ambiguity Ambiguous sentence: I saw a girl with a telescope What kind of ambiguity it has? What implications it has for parsing? 11

12 Why parsing is hard? Ambiguity Prepositional phrase attachment ambiguity S S VP P I V saw VP a girl P with PP a telescope P I V saw a girl P with PP a telescope What kind of clues would be useful? PP-attachment ambiguity is just one example type of ambiguity How serious is this problem in practice? 12

13 Why parsing is hard? Ambiguity Example with 3 preposition phrases, 5 interpretations: Put the block ((in the box on the table) in the kitchen) Put the block (in the box (on the table in the kitchen)) Put ((the block in the box) on the table) in the kitchen. Put (the block (in the box on the table)) in the kitchen. Put (the block in the box) (on the table in the kitchen) 13

14 Why parsing is hard? Ambiguity Example with 3 preposition phrases, 5 interpretations: Put the block ((in the box on the table) in the kitchen) Put the block (in the box (on the table in the kitchen)) Put ((the block in the box) on the table) in the kitchen. Put (the block (in the box on the table)) in the kitchen. Put (the block in the box) (on the table in the kitchen) A general case: 2n Cat n = n 2n 4n n 1 n 3/2p Catalan numbers 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786,... 14

15 Why parsing is hard? Ambiguity Example with 3 preposition phrases, 5 interpretations: Put the block ((in the box on the table) in the kitchen) Put the block (in the box (on the table in the kitchen)) Put ((the block in the box) on the table) in the kitchen. Put (the block (in the box on the table)) in the kitchen. Put (the block in the box) (on the table in the kitchen) A general case: 2n Cat n = n 2n 4n n 1 n 3/2p Catalan numbers 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786,... This is only one type of ambiguity but you have many types in almost all sentences 15

16 Why parsing is hard? Ambiguity A typical tree from a standard dataset (Penn treebank WSJ) [from Michael Collins slides] Canadian Utilities had 1988 revenue of $ 1.16 billion, mainly from its natural gas and electric utility businesses in Alberta, where the company serves about 800,000 customers. 16

17 Key problems Recognition problem: is the sentence grammatical? Parsing problem: what is a (most plausible) derivation (tree) corresponding the sentence? Parsing problem encompasses the recognition problem A more interesting question from practical point of view 17

18 Context-Free Grammar Context-free grammar is a tuple of 4 elements G =(V,,R,S) V - the set of non-terminals In our case: phrase categories (VP,,..) and PoS tags (, V,.. aka preterminals) - the set of terminals Words - the set of rules of the form, where, R X! Y 1,Y 2,...,Y n n 0 X 2 V S, Y i 2 V [ is a dedicated start symbol S! VP! girl!! telescope! P V! saw! PP V! eat PP! P

19 An example grammar V = {S, V P,, P P,, V, P, P } = {girl, telescope, sandwich, I, saw, ate, with, in, a, the} S = {S} R : Called Inner rules Preterminal rules (correspond to the HMM emission model aka task model) S! VP VP! V VP! V VP! VP PP! PP!! P PP! P ( A girl) (VP ate a sandwich) (V ate) ( a sandwich) (VP saw a girl) (PP with a telescope) ( a girl) (PP with a sandwich) ( a) ( sandwich) (P with) ( with a sandwich)! girl! telescope! sandwich P! I V! saw V! ate P! with P! in! a! the 19

20 CFGs S S! VP VP! V VP! V VP! VP PP! girl! telescope! sandwich P! I V! saw P I V saw VP P PP! PP!! P PP! P V! ate P! with P! in! a! the a girl with a telescope 20

21 CFGs S S! VP VP! V VP! V VP! VP PP! girl! telescope! sandwich P! I V! saw P I V saw VP P PP! PP!! P PP! P V! ate P! with P! in! a! the a girl with a telescope 21

22 CFGs S S! VP VP! V VP! V VP! VP PP! girl! telescope! sandwich P! I V! saw P I V saw VP P PP! PP!! P PP! P V! ate P! with P! in! a! the a girl with a telescope 22

23 CFGs S S! VP VP! V VP! V VP! VP PP! girl! telescope! sandwich P! I V! saw P I V saw VP P PP! PP!! P PP! P V! ate P! with P! in! a! the a girl with a telescope 23

24 CFGs S S! VP VP! V VP! V VP! VP PP! girl! telescope! sandwich P! I V! saw P I V saw VP P PP! PP!! P PP! P V! ate P! with P! in! a! the a girl with a telescope 24

25 CFGs S S! VP VP! V VP! V VP! VP PP! girl! telescope! sandwich P! I V! saw P I V saw VP P PP! PP!! P PP! P V! ate P! with P! in! a! the a girl with a telescope 25

26 CFGs S S! VP VP! V VP! V VP! VP PP! girl! telescope! sandwich P! I V! saw P I V saw VP P PP! PP!! P PP! P V! ate P! with P! in! a! the a girl with a telescope 26

27 CFGs S S! VP VP! V VP! V VP! VP PP! girl! telescope! sandwich P! I V! saw P I V saw VP P PP! PP!! P PP! P V! ate P! with P! in! a! the a girl with a telescope 27

28 CFGs S S! VP VP! V VP! V VP! VP PP! girl! telescope! sandwich P! I V! saw P I V saw VP P PP! PP!! P PP! P V! ate P! with P! in! a! the a girl with a telescope CFG defines both: - a set of substrings (a language) - structures used to represent sentences (constituent trees) 28

29 Why context-free? S VP What can be a sub-tree is only affected by what the phrase type is (VP) but not the context V The dog ate a sandwich 29

30 Why context-free? S VP What can be a sub-tree is only affected by what the phrase type is (VP) but not the context V The dog ate VP a sandwich VP ot grammatical V V AVP ate bark A a sandwich often 30

31 Why context-free? S VP What can be a sub-tree is only affected by what the phrase type is (VP) but not the context V The dog ate VP a sandwich VP ot grammatical V V AVP ate bark A a sandwich often Matters if we want to generate language (e.g., language modeling) but is this relevant to parsing? 31

32 Introduced coordination ambiguity Here, the coarse VP and categories cannot enforce subject-verb agreement in number resulting in the coordination ambiguity S S. VP VP V koalas eat Coordination leaves C and barks koalas VP V eat leaves C and VP V barks "Bark" can refer both to a noun or a verb This tree would be ruled out if the context would be somehow captured (subject-verb agreement) 32

33 Introduced coordination ambiguity Here, the coarse VP and categories cannot enforce subject-verb agreement in number resulting in the coordination ambiguity S S. S koalas VBP eat Coordination Even more detailed PoS tags are not going to help here VP S leaves CC and "Bark" can refer both to a noun or a verb S koalas S barks VBP eat VP S leaves VP CC and This tree would be ruled out if the context would be somehow captured (subject-verb agreement) VP VBZ barks 33

34 Key problems Recognition problem: does the sentence belong to the language defined by CFG? That is: is there a derivation which yields the sentence? Parsing problem: what is a (most plausible) derivation (tree) corresponding the sentence? Parsing problem encompasses the recognition problem 34

35 How to deal with ambiguity? There are (exponentially) many derivation for a typical sentence S ]] S VP VP S VP VB PP Put the block P in the box P on PP P the table in V Put PP the kitchen the block P in PP the box P on V Put PP PP P the table in the kitchen the block P in PP P the box on PP the table P in PP the kitchen Put the block in the box on the table in the kitchen We want to score all the derivations to encode how plausible they are 35

36 An example probabilistic CFG Associate probabilities with the rules p(x! ) : 8 X! 2 R : 0 apple p(x! ) apple 1 X 8X 2 : p(x! ) =1 :X! 2R ow we can score a tree as a product of probabilities corresponding to the used rules S! VP VP! V VP! V VP! VP PP! PP!! P PP! P ( A girl) (VP ate a sandwich) (VP ate) ( a sandwich) (VP saw a girl) (PP with ) ( a girl) (PP with.) ( a) ( sandwich) (P with) ( with a sandwich)! girl! telescope! sandwich P! I V! saw V! ate P! with P! in! a! the

37 CFGs P I S V 0.5 saw 1.0 VP P a girl with PP S! VP VP! V VP! V VP! VP PP! PP!! P PP! P ! girl! telescope! sandwich P! I V! saw V! ate P! with P! in! a! the a telescope p(t )= =

38 CFGs P I S V 0.5 saw 1.0 VP P a girl with PP S! VP VP! V VP! V VP! VP PP! PP!! P PP! P ! girl! telescope! sandwich P! I V! saw V! ate P! with P! in! a! the a telescope p(t )= =

39 CFGs P I S V 0.5 saw 1.0 VP P a girl with PP S! VP VP! V VP! V VP! VP PP! PP!! P PP! P ! girl! telescope! sandwich P! I V! saw V! ate P! with P! in! a! the a telescope p(t )= =

40 CFGs P I S V 0.5 saw 1.0 VP P a girl with PP S! VP VP! V VP! V VP! VP PP! PP!! P PP! P ! girl! telescope! sandwich P! I V! saw V! ate P! with P! in! a! the a telescope p(t )= =

41 CFGs P I S V 0.5 saw 1.0 VP P a girl with PP S! VP VP! V VP! V VP! VP PP! PP!! P PP! P ! girl! telescope! sandwich P! I V! saw V! ate P! with P! in! a! the a telescope p(t )= =

42 CFGs P I S V 0.5 saw 1.0 VP P a girl with PP S! VP VP! V VP! V VP! VP PP! PP!! P PP! P ! girl! telescope! sandwich P! I V! saw V! ate P! with P! in! a! the a telescope p(t )= =

43 CFGs P I S V 0.5 saw 1.0 VP P a girl with PP S! VP VP! V VP! V VP! VP PP! PP!! P PP! P ! girl! telescope! sandwich P! I V! saw V! ate P! with P! in! a! the a telescope p(t )= =

44 CFGs P I S V 0.5 saw 1.0 VP P a girl with PP S! VP VP! V VP! V VP! VP PP! PP!! P PP! P ! girl! telescope! sandwich P! I V! saw V! ate P! with P! in! a! the a telescope p(t )= =

45 CFGs P I S V 0.5 saw 1.0 VP P a girl with PP S! VP VP! V VP! V VP! VP PP! PP!! P PP! P ! girl! telescope! sandwich P! I V! saw V! ate P! with P! in! a! the a telescope p(t )= =

46 CFGs P I S V 0.5 saw 1.0 VP P a girl with PP S! VP VP! V VP! V VP! VP PP! PP!! P PP! P ! girl! telescope! sandwich P! I V! saw V! ate P! with P! in! a! the a telescope p(t )= =

47 CFGs P I S V 0.5 saw 1.0 VP P a girl with PP S! VP VP! V VP! V VP! VP PP! PP!! P PP! P ! girl! telescope! sandwich P! I V! saw V! ate P! with P! in! a! the a telescope p(t )= =

48 istribution over trees We defined a distribution over production rules for each nonterminal Our goal was to define a distribution over parse trees Unfortunately, not all PCFGs give rise to a proper distribution over trees, i.e. the sum over probabilities of all Xtrees the grammar can generate may be less than 1: P (T ) < 1 T 48

49 istribution over trees We defined a distribution over production rules for each nonterminal Our goal was to define a distribution over parse trees Unfortunately, not all PCFGs give rise to a proper distribution over trees, i.e. the sum over probabilities of all Xtrees the grammar can generate may be less than 1: P (T ) < 1 T Good news: any PCFG estimated with the maximum likelihood procedure are always proper [Chi and Geman, 98] 49

50 istribution over trees Let us denote by G(x) the set of derivations for the sentence x The probability distribution defines the scoring P (T ) over the trees T 2 G(x) Finding the best parse for the sentence according to PCFG: arg max T 2G(x) P (T ) We will look at this at the next class 50

51 Outline Syntax: intro, CFGs, PCFGs PCFGs: Estimation CFGs: Parsing PCFGs: Parsing Parsing evaluation Unsupervised estimation of PCFGs (?) Grammar refinement (producing a state-of-the-art parser) (?) 51

52 ML estimation 52 A treebank: a collection sentences annotated with constituent trees S P My dog VP V ate a sausage S VP V Put the block PP P in the box PP P on the table PP P in the kitchen S P I VP V saw a girl PP P with a telescope S koalas VP VP V eat leaves C and VP V barks...

53 ML estimation A treebank: a collection sentences annotated with constituent trees P My dog S V ate VP a sausage An estimated probability of a rule (maximum likelihood estimates) p(x! ) = V Put the block S VP P in PP the C(X! ) C(X) box P on the PP table P in PP the kitchen P I S V saw a VP girl P with PP a telescope... koalas The number of times the rule used in the corpus The number of times the nonterminal X appears in the treebank V eat S VP leaves VP C and VP V barks 53

54 ML estimation A treebank: a collection sentences annotated with constituent trees P My dog S V ate VP a sausage An estimated probability of a rule (maximum likelihood estimates) p(x! ) = V Put the block S VP P in PP the C(X! ) C(X) box P on the PP table P in PP the kitchen P I S V saw a VP girl P with PP a telescope... koalas The number of times the rule used in the corpus The number of times the nonterminal X appears in the treebank V eat S VP leaves VP C and VP V barks 54

55 ML estimation A treebank: a collection sentences annotated with constituent trees P My dog S V ate VP a sausage An estimated probability of a rule (maximum likelihood estimates) p(x! ) = Smoothing is helpful V Put the block S VP P in PP the C(X! ) C(X) box P on the PP table P in PP the kitchen P I S V saw a VP girl P with PP a telescope... koalas The number of times the rule used in the corpus The number of times the nonterminal X appears in the treebank V eat S VP leaves VP C and VP V barks Especially important for preterminal rules, i.e. generation of words ( = task mode in PoS tagging) See smoothing techniques in Section 4.5 of the J&M book (+ section for some techniques specifically dealing with to unknown words) 55

56 ML estimation: an example [Ex. from Collins IWPT 01] A toy treebank: S S n 1 n B C 2 n C 3 S B a a a a a a a a Without smoothing: Rule Count Prob. estimate S! BC n 1 n 1 /(n 1 + n 2 + n 3 ) S! C n 2 n 2 /(n 1 + n 2 + n 3 ) S! B n 3 n 3 /(n 1 + n 2 + n 3 ) B! aa n 1 n 1 /(n 1 + n 3 ) B! a n 3 n 3 /(n 1 + n 3 ) C! aa n 1 n 1 /(n 1 + n 2 ) C! aaa n 2 n 2 /(n 1 + n 2 ) 56

57 ML estimation: an example [Ex. from Collins IWPT 01] A toy treebank: S S n 1 n B C 2 n C 3 S B a a a a a a a a Without smoothing: Rule Count Prob. estimate S! BC n 1 n 1 /(n 1 + n 2 + n 3 ) S! C n 2 n 2 /(n 1 + n 2 + n 3 ) S! B n 3 n 3 /(n 1 + n 2 + n 3 ) B! aa n 1 n 1 /(n 1 + n 3 ) B! a n 3 n 3 /(n 1 + n 3 ) C! aa n 1 n 1 /(n 1 + n 2 ) C! aaa n 2 n 2 /(n 1 + n 2 ) 57

58 ML estimation: an example [Ex. from Collins IWPT 01] A toy treebank: S S n 1 n B C 2 n C 3 S B a a a a a a a a Without smoothing: Rule Count Prob. estimate S! BC n 1 n 1 /(n 1 + n 2 + n 3 ) S! C n 2 n 2 /(n 1 + n 2 + n 3 ) S! B n 3 n 3 /(n 1 + n 2 + n 3 ) B! aa n 1 n 1 /(n 1 + n 3 ) B! a n 3 n 3 /(n 1 + n 3 ) C! aa n 1 n 1 /(n 1 + n 2 ) C! aaa n 2 n 2 /(n 1 + n 2 ) 58

59 Penn Treebank: peculiarities Wall street journal: around 40, 000 annotated sentences, 1,000,000 words Fine-grain part of speech tags (45), e.g., for verbs VB VBG VBP VBZ M Verb, past tense Verb, gerund or present participle Verb, present (non-3 rd person singular) Verb, present (3 rd person singular) Modal Flat s (no attempt to disambiguate attachment) T JJ a hot dog food cart 59

60 Outline Unsupervised estimation: forward-backward Syntax: intro, CFGs, PCFGs PCFGs: Estimation CFGs: Parsing PCFGs: Parsing Parsing evaluation Unsupervised estimation of PCFGs (?) Grammar refinement (producing a state-of-the-art parser) (?) 60

61 Parsing Parsing is search through the space of all possible parses e.g., we may want either any parse, all parses or the highest scoring parse (if PCFG): The probability by the arg max P (T ) PCFG model T 2G(x) Bottom-up: Set of all trees given by the grammar for the sentence x One starts from words and attempt to construct the full tree Top-down Start from the start symbol and attempt to expand to get the sentence 61

62 CKY algorithm (aka CYK) Cocke-Kasami-Younger algorithm Independently discovered in late 60s / early 70s An efficient bottom up parsing algorithm for (P)CFGs can be used both for the recognition and parsing problems Very important in LP (and beyond) We will start with the non-probabilistic version 62

63 Constraints on the grammar The basic CKY algorithm supports only rules in the Chomsky ormal Form (CF): C! x Unary preterminal rules (generation of words given PoS tags! telescope,! the, ) C! C 1 C 2 Binary inner rules (e.g., S! VP,! ) 63

64 Constraints on the grammar The basic CKY algorithm supports only rules in the Chomsky ormal Form (CF): C! x C! C 1 C 2 Unary preterminal rules (generation of words given PoS tags! telescope,! the, ) Binary inner rules (e.g., S! VP,! ) Any CFG can be converted to an equivalent CF Equivalent means that they define the same language However (syntactic) trees will look differently Makes linguists unhappy It is possible to address it but defining such transformations that allows for easy reverse transformation 64

65 Transformation to CF form What one need to do to convert to CF form Get rid of empty (aka epsilon) productions: C! Get rid of unary rules: C! C 1 -ary rules: C! C 1 C 2...C n (n>2) Generally not a problem as there are not empty production in the standard (postprocessed) treebanks ot a problem, as our CKY algorithm will support unary rules Crucial to process them, as required for efficient parsing 65

66 Transformation to CF form: binarization Consider! T V BG T VBG the utch publishing group How do we get a set of binary rules which are equivalent? 66

67 Transformation to CF form: binarization Consider! T V BG T VBG the utch publishing group How do we get a set of binary rules which are equivalent?! T X X! Y Y! VBG 67

68 Transformation to CF form: binarization Consider! T V BG T VBG the utch publishing group How do we get a set of binary rules which are equivalent?! T X X! Y Y! VBG A more systematic way to refer to new non-terminals! T! VBG 68

69 Transformation to CF form: binarization Instead of binarizing tules we can binarize trees on preprocessing: T the utch VBG publishing group Also known as lossless Markovization in the context of PCFGs T Can be easily reversed on postprocessing T utch VBG publishing group 69

70 Transformation to CF form: binarization Instead of binarizing tules we can binarize trees on preprocessing: T VBG the utch publishing group T T utch VBG T VBG group 70

71 CKY: Parsing task We a given start symbol a grammar G =(V,,R,S) a sequence of words w =(w 1,w 2,...,w n ) Our goal is to produce a parse tree for w 71

72 CKY: Parsing task We a given start symbol [Some illustrations and slides in this lecture are from Marco Kuhlmann] a grammar G =(V,,R,S) a sequence of words w =(w 1,w 2,...,w n ) Our goal is to produce a parse tree for w We need an easy way to refer to substrings of w indices refer to fenceposts span (i, j) refers to words between fenceposts i and j 72

73 Key problems Recognition problem: does the sentence belong to the language defined by CFG? Is there a derivation which yields the sentence? Parsing problem: what is a derivation (tree) corresponding the sentence? Probabilistic parsing: what is the most probable tree for the sentence? 73

74 Parsing one word C! w i 74

75 Parsing one word C! w i 75

76 Parsing one word C! w i 76

77 Parsing longer spans C! C 1 C 2 Check through all C 1,C 2, mid 77

78 Parsing longer spans C! C 1 C 2 Check through all C 1,C 2, mid 78

79 Parsing longer spans 79

80 Signatures Applications of rules is independent of inner structure of a parse tree We only need to know the corresponding span and the root label of the tree Its signature [min, max, C] Also known as an edge 80

81 CKY idea Compute for every span a set of admissible labels (may be empty for some spans) Start from small trees (single words) and proceed to larger ones 81

82 CKY idea Compute for every span a set of admissible labels (may be empty for some spans) Start from small trees (single words) and proceed to larger ones When done, how do we detect that the sentence is grammatical? 82

83 CKY idea Compute for every span a set of admissible labels (may be empty for some spans) Start from small trees (single words) and proceed to larger ones When done, check if S is among admissible labels for the whole sentence, if yes the sentence belong to the language That is if a tree with signature [0, n, S] exists 83

84 CKY idea Compute for every span a set of admissible labels (may be empty for some spans) Start from small trees (single words) and proceed to larger ones When done, check if S is among admissible labels for the whole sentence, if yes the sentence belong to the language That is if a tree with signature [0, n, S] exists Unary rules? 84

85 CKY in action lead can poison S! VP VP! VP VP! M V VP! V!!! can! lead! poison M! can M! must V! poison V! lead Preterminal rules Inner rules

86 CKY in action S! VP lead can poison min = 0 min = 1 min = 2 max = 1 max = 2 max = 3 S? Chart (aka parsing triangle) VP! VP VP! M V VP! V!!! can! lead! poison M! can M! must V! poison V! lead Preterminal rules Inner rules

87 CKY in action lead can poison max = 1 max = 2 max = 3 S? min = 1 min = 2 min = 3 lead can poison S! VP VP! VP VP! M V VP! V!!! can! lead! poison M! can M! must V! poison V! lead Preterminal rules Inner rules

88 CKY in action lead can poison max = 1 max = 2 max = 3 S? min = 1 min = 2 min = 3 lead can poison S! VP VP! VP VP! M V VP! V!!! can! lead! poison M! can M! must V! poison V! lead Preterminal rules Inner rules

89 CKY in action lead can poison max = 1 max = 2 max = 3 S? min = 1 min = 2 min = 3 lead can poison S! VP VP! VP VP! M V VP! V!!! can! lead! poison M! can M! must V! poison V! lead Preterminal rules Inner rules

90 CKY in action S! VP lead can poison min = 0 min = 1 min = 2 max = 1 max = 2 max = 3 S? VP! VP VP! M V VP! V!!! can! lead! poison M! can M! must V! poison V! lead Preterminal rules Inner rules

91 CKY in action S! VP lead can poison min = 0 min = 1 min = 2 max = 1 max = 2 max = S? VP! VP VP! M V VP! V!!! can! lead! poison M! can M! must V! poison V! lead Preterminal rules Inner rules

92 CKY in action S! VP lead can poison min = 0 min = 1 min = 2 1 max = 1 max = 2 max = 3? 2? 3? VP! VP VP! M V VP! V!!! can! lead! poison M! can M! must V! poison V! lead Preterminal rules Inner rules

CKY in action S! VP lead can poison 0 1 2 3 min = 0 min = 1 min = 2 1 max = 1 max = 2 max = 3? 2? 3? VP! VP VP!

93 CKY in action S! VP lead can poison min = 0 min = 1 min = 2 1 max = 1 max = 2 max = 3? 2? 3? VP! VP VP! M V VP! V!!! can! lead! poison M! can M! must V! poison V! lead Preterminal rules Inner rules

94 CKY in action S! VP lead can poison min = 0 min = 1 min = 2 1 max = 1 max = 2 max = 3,V 2,M 3,V VP! VP VP! M V VP! V!!! can! lead! poison M! can M! must V! poison V! lead Preterminal rules Inner rules

95 CKY in action S! VP lead can poison min = 0 min = 1 min = 2 1 max = 1 max = 2 max = 3,V,VP 2,M 3,V,VP VP! VP VP! M V VP! V!!! can! lead! poison M! can M! must V! poison V! lead Preterminal rules Inner rules

96 CKY in action S! VP lead can poison min = 0 min = 1 min = 2 1 max = 1 max = 2 max = 3,V,VP 4 2?,M 3,V,VP VP! VP VP! M V VP! V!!! can! lead! poison M! can M! must V! poison V! lead Preterminal rules Inner rules

CKY in action S! VP lead can poison 0 1 2 3 min = 0 min = 1 min = 2 1 max = 1 max = 2 max = 3,V,VP 4 2?

97 CKY in action S! VP lead can poison min = 0 min = 1 min = 2 1 max = 1 max = 2 max = 3,V,VP 4 2?,M 3,V,VP VP! VP VP! M V VP! V!!! can! lead! poison M! can M! must V! poison V! lead Preterminal rules Inner rules

98 CKY in action S! VP lead can poison min = 0 min = 1 min = 2 1 max = 1 max = 2 max = 3,V,VP 4 2,M 3,V,VP VP! VP VP! M V VP! V!!! can! lead! poison M! can M! must V! poison V! lead Preterminal rules Inner rules

99 CKY in action lead can poison min = 0 min = 1 min = 2 1 max = 1 max = 2 max = 3,V,VP 4 2,M 3 Check about unary rules: no unary rules here,v,vp S! VP VP! VP VP! M V VP! V!!! can! lead! poison M! can M! must V! poison V! lead Preterminal rules Inner rules

100 CKY in action S! VP lead can poison min = 0 min = 1 min = 2 1 max = 1 max = 2 max = 3,V,VP 4 2,M 5 3?,V,VP VP! VP VP! M V VP! V!!! can! lead! poison M! can M! must V! poison V! lead Preterminal rules Inner rules

101 CKY in action S! VP lead can poison min = 0 min = 1 min = 2 1 max = 1 max = 2 max = 3,V,VP 4 2,M 5 3 S, V P,,V,VP VP! VP VP! M V VP! V!!! can! lead! poison M! can M! must V! poison V! lead Preterminal rules Inner rules

102 CKY in action S! VP lead can poison min = 0 min = 1 min = 2 1 max = 1 max = 2 max = 3,V,VP 4 2,M 5 3 S, V P,,V,VP Check about unary rules: no unary rules here VP! VP VP! M V VP! V!!! can! lead! poison M! can M! must V! poison V! lead Preterminal rules Inner rules

103 CKY in action S! VP lead can poison min = 0 min = 1 min = 2 1 max = 1 max = 2 max = 3,V,VP 4 2,M 6 5 3? S, V P,,V,VP VP! VP VP! M V VP! V!!! can! lead! poison M! can M! must V! poison V! lead Preterminal rules Inner rules

104 CKY in action S! VP lead can poison min = 0 min = 1 min = 2 1 max = 1 max = 2 max = 3,V,VP 4 2,M 6 5 3? S, V P,,V,VP VP! VP VP! M V VP! V!!! can! lead! poison M! can M! must V! poison V! lead Preterminal rules Inner rules

105 CKY in action S! VP lead can poison mid = 1 min = 0 min = 1 min = 2 1 max = 1 max = 2 max = 3,V,VP 4 2,M 6 S, 5 3 S, V P,,V,VP VP! VP VP! M V VP! V!!! can! lead! poison M! can M! must V! poison V! lead Preterminal rules Inner rules

106 CKY in action S! VP lead can poison mid = 2 min = 0 min = 1 min = 2 1 max = 1 max = 2 max = 3,V,VP 4 2,M 6 S, S(?!) 5 3 S, V P,,V,VP VP! VP VP! M V VP! V!!! can! lead! poison M! can M! must V! poison V! lead Preterminal rules Inner rules

107 CKY in action lead can poison mid = 2 min = 0 min = 1 min = 2 1 max = 1 max = 2 max = 3,V,VP 4 2,M 6 S, S(?!) S, V P,,V Apparently the sentence is ambiguous,vpfor the grammar: (as the grammar overgenerates) 5 3 S! VP VP! VP VP! M V VP! V!!! can! lead! poison M! can M! must V! poison V! lead Preterminal rules Inner rules

108 Ambiguity S S lead M can VP V poison lead can VP V poison o subject-verb agreement, and poison used as an intransitive verb Apparently the sentence is ambiguous for the grammar: (as the grammar overgenerates)

109 CKY more formally Chart can be represented by a Boolean array chart[min][max][label] chart[min][max][c] Relevant entries have 0 < min < max apple n Here we assume that labels (C) are integer indices chart[min][max][c] = true if the signature (min, max, C) is already added to the chart; false otherwise. max = 1 max = 2 max = 3 min = 0 1,V,VP 4 6 S,V P, min = 1 2,M 5 S, V P, 3,V min = 2,VP 109

110 CKY more formally Chart can be represented by a Boolean array chart[min][max][label] chart[min][max][c] Relevant entries have 0 < min < max apple n Here we assume that labels (C) are integer indices chart[min][max][c] = true if the signature (min, max, C) is already added to the chart; false otherwise. max = 1 max = 2 max = 3 min = 0 1,V,VP 4 6 S,V P, min = 1 2,M 5 S, V P, 3,V min = 2,VP 110

111 Implementation: preterminal rules 111

112 Implementation: binary rules 112

113 Unary rules How to integrate unary rules C! C 1? 113

114 Implementation: unary rules 114

115 Implementation: unary rules But we forgot something! 115

116 Unary closure What if the grammar contained 2 rules: A! B B! C But C can be derived from A by a chain of rules: A! B! C One could support chains in the algorithm but it is easier to extend the grammar, to get the transitive closure A! B B! C ) A! B B! C A! C 116

117 Unary closure What if the grammar contained 2 rules: A! B B! C But C can be derived from A by a chain of rules: A! B! C One could support chains in the algorithm but it is easier to extend the grammar, to get the reflexive transitive closure A! B B! C ) A! B B! C A! C A! A B! B C! C Convenient for programming reasons in the PCFG case 117

118 Implementation: skeleton 118

119 Algorithm analysis Time complexity? (n 3 R ) R, where is the number of rules in the grammar There exist algorithms with better asymptotical time complexity but the `constant' makes them slower in practice (in general) 119

120 Algorithm analysis Time complexity? (n 3 R ) R, where is the number of rules in the grammar There exist algorithms with better asymptotical time complexity but the `constant' makes them slower in practice (in general) 120

121 Algorithm analysis Time complexity? A few seconds for sentences under < 20 words for a nonoptimized parser (n 3 R ) R, where is the number of rules in the grammar There exist algorithms with better asymptotical time complexity but the `constant' makes them slower in practice (in general) 121

122 Algorithm analysis Time complexity? A few seconds for sentences under < 20 words for a nonoptimized parser (n 3 R ) R, where is the number of rules in the grammar There exist algorithms with better asymptotical time complexity but the `constant' makes them slower in practice (in general) 122

123 Practical time complexity Time complexity? (for the PCFG version) n 3.6 [Plot by an Klein] 123

124 Outline Syntax: intro, CFGs, PCFGs PCFGs: Estimation CFGs: Parsing PCFGs: Parsing Parsing evaluation Unsupervised estimation of PCFGs (?) Grammar refinement (producing a state-of-the-art parser) (?) 124

125 Probabilistic parsing We discussed the recognition problem: check if a sentence is parsable with a CFG ow we consider parsing with PCFGs Recognition with PCFGs: what is the probability of the most probable parse tree? Parsing with PCFGs: What is the most probable parse tree? 125

126 CFGs P I S V 0.5 saw 1.0 VP P a girl with PP S! VP VP! V VP! V VP! VP PP! PP!! P PP! P ! girl! telescope! sandwich P! I V! saw V! ate P! with P! in! a! the a telescope p(t )= =

istribution over trees Let us denote by G(x) the set of derivations for the sentence x The probability distribution defines the scoring P (T ) over the

127 istribution over trees Let us denote by G(x) the set of derivations for the sentence x The probability distribution defines the scoring P (T ) over the trees T 2 G(x) Finding the best parse for the sentence according to PCFG: arg max T 2G(x) P (T ) This will be another Viterbi / Max-Product algorithm! 127

128 CKY with PCFGs Chart is represented by a double array chart[min][max][label] chart[min][max][c] It stores probabilities for the most probable subtree with a given signature chart[0][n][s] parse tree will store the probability of the most probable full 128

129 Intuition C! C 1 C 2 For every C choose C 1, C 2 and mid such that P (T 1 ) P (T 2 ) P (C! C 1 C 2 ) is maximal, where T 1 and T 2 are left and right subtrees. 129

130 Implementation: preterminal rules 130

131 Implementation: binary rules 131

132 Unary rules Similarly to CFGs: after producing scores for signatures (c, i, j), try to improve the scores by applying unary rules (and rule chains) If improved, update the scores 132

133 Unary (reflexive transitive) closure A! B B! C ) A! B B! C A! C A! A B! B C! C ote that this is not a PCFG anymore as the rules do not sum to 1 for each parent

134 Unary (reflexive transitive) closure The fact that the rule is composite needs to be stored to recover the true tree A! B B! C ) A! B B! C A! C A! A B! B C! C ote that this is not a PCFG anymore as the rules do not sum to 1 for each parent

135 Unary (reflexive transitive) closure The fact that the rule is composite needs to be stored to recover the true tree A! B B! C ) A! B B! C A! C A! A B! B C! C ote that this is not a PCFG anymore as the rules do not sum to 1 for each parent A! B B! C A! C e 5 ) A! B B! C A! C A! A B! B C! C

136 Unary (reflexive transitive) closure The fact that the rule is composite needs to be stored to recover the true tree A! B B! C ) A! B B! C A! C A! A B! B C! C ote that this is not a PCFG anymore as the rules do not sum to 1 for each parent A! B B! C A! C e 5 ) A! B B! C A! C A! A B! B C! C What about loops, like: A! B! A! C? 136

137 Recovery of the tree For each signature we store backpointers to the elements from which it was built (e.g., rule and, for binary rules, midpoint) start recovering from [0, n, S] Be careful with unary rules Basically you can assume that you always used an unary rule from the closure (but it could be the trivial one ) C! C 137

138 Speeding up the algorithm (approximate search) Any ideas? 138

139 Speeding up the algorithm (approximate search) Basic pruning (roughly): For every span (i,j) store only labels which have the probability at most times smaller than the probability of the most probable label for this span Check not all rules but only rules yielding subtree labels having non-zero probability Coarse-to-fine pruning Parse with a smaller (simpler) grammar, and precompute (posterior) probabilities for each spans, and use only the ones with non-negligible probability from the previous grammar 139

140 Outline Syntax: intro, CFGs, PCFGs PCFGs: Estimation CFGs: Parsing PCFGs: Parsing Parsing evaluation Unsupervised estimation of PCFGs (?) Grammar refinement (producing a state-of-the-art parser) (?) 140

141 Parser evaluation Though has many drawbacks it is easier and allows to track stateof-the-art across many years Intrinsic evaluation: Automatic: evaluate against annotation provided by human experts (gold standard) according to some predefined measure Manual: according to human judgment Extrinsic evaluation: score syntactic representation by comparing how well a system using this representation performs on some task E.g., use syntactic representation as input for a semantic analyzer and compare results of the analyzer using syntax predicted by different parsers. 141

142 Standard evaluation setting in parsing Automatic intrinsic evaluation is used: parsers are evaluated against gold standard by provided by linguists There is a standard split into the parts: training set: used for estimation of model parameters development set: used for tuning the model (initial experiments) test set: final experiments to compare against previous work 142

143 Automatic evaluation of constituent parsers Exact match: percentage of trees predicted correctly Bracket score: scores how well individual phrases (and their boundaries) are identified Crossing brackets: percentage of phrases boundaries crossing The most standard measure; we will focus on it ependency metrics: scores dependency structure corresponding to the constituent tree (percentage of correctly identified heads) 143

144 Brackets scores The most standard score is bracket score It regards a tree as a collection of brackets: The set of brackets predicted by a parser is compared against the set of brackets in the tree annotated by a linguist Precision, recall and F1 are used as scores Subtree signatures for CKY [min, max, C] 144

145 Bracketing notation S VP P V My dog ate a sausage The same tree as a bracketed sequence (S ( (P My) ( og) ) (VP (V ate) ( ( a ) ( sausage) ) ) ) 145

146 Brackets scores Pr = Re = number of brackets the parser and annotation agree on number of brackets predicted by the parser number of brackets the parser and annotation agree on number of brackets in annotation F 1= 2 Pr Re Pr+ Re Harmonic mean of precision and recall 146

147 Preview: F1 bracket score Treebank PCFG Unlexicalized Lexicalized PCFG PCFG (Klein and (Collins, 1999) Manning, 2003) Automatically Induced PCFG (Petrov et al., 2006) The best results reported (as of 2012) 147

Preview: F1 bracket score 95 90 85 80 75 70 65 Treebank PCFG Unlexicalized Lexicalized PCFG PCFG (Klein and (Collins, 1999) Manning, 2003)

148 Preview: F1 bracket score Treebank PCFG Unlexicalized Lexicalized PCFG PCFG (Klein and (Collins, 1999) Manning, 2003) Automatically Induced PCFG (Petrov et al., 2006) The best results reported (as of 2012) Looked into this in BSc KI at UvA We will look into this one 148

LECTURER: BURCU CAN Spring

LECTURER: BURCU CAN Spring LECTURER: BURCU CAN 2017-2018 Spring Regular Language Hidden Markov Model (HMM) Context Free Language Context Sensitive Language Probabilistic Context Free Grammar (PCFG) Unrestricted Language PCFGs can