/6/7 CS 6/CS: Natural Language Processing Instructor: Prof. Lu Wang College of Computer and Information Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang The grammar: Binary, no epsilons,.9..5 VP V. VP V @VP_V. VP V PP. @VP_V NP PP. NP NP NP. NP NP PP. NP N.7 N rods. V tanks. P with. fish people fish tanks.9. fish people fish tanks score[][] score[][] score[][] score[][].5 VP V. VP V @VP_V. VP V PP. score[][] score[][] score[][] @VP_V NP PP. NP NP NP. NP NP PP. NP N.7 score[][] score[][] score[][] N rods. for i=; i<#(words); i++ for A in nonterms if A -> words[i] in grammar score[i][i+][a] = P(A -> words[i]); P with..9..5 VP V. VP V @VP_V. VP V PP. @VP_V NP PP. NP NP NP. NP NP PP. NP N.7 N rods. P with. fish people fish tanks // handle unaries boolean added = true while added added = false for A, B in nonterms if score[i][i+][b] > && A->B in grammar prob = P(A->B)*score[i][i+][B] if(prob > score[i][i+][a]) score[i][i+][a] = prob back[i][i+][a] = B added = true.9..5 VP V. VP V @VP_V. VP V PP. @VP_V NP PP. NP NP NP. NP NP PP. NP N.7 N rods. P with. fish people fish tanks.6 NP N.5 VP V...6 prob=score[begin][split][b]*score[split][end][c]*p(a->bc) if (prob > score[begin][end][a]) VP V..
/6/7.9..5 VP V. VP V @VP_V. VP V PP. @VP_V NP PP. NP NP NP. NP NP PP. NP N.7 N rods. P with. fish people fish tanks NP NP NP.9.5.6.6 NP NP NP.9 NP N.5.7 VP V...89 NP NP NP.96. //handle unaries boolean added = true.6.78 while added added = false for A, B in nonterms prob = P(A->B)*score[begin][end][B]; score[begin][end][a] = prob VP V. back[begin][end][a] = B. added = true.9..5 VP V. VP V @VP_V. VP V PP. @VP_V NP PP. NP NP NP. NP NP PP. NP N.7 N rods. P with. fish people fish tanks NP NP NP.9.5.6.5 NP NP NP.9 NP N.5.7 VP V...89 NP NP NP.96..6. for split = begin+ to end- prob=score[begin][split][b]*score[split][end][c]*p(a->bc) VP V...9..5 VP V. VP V @VP_V. VP V PP. @VP_V NP PP. NP NP NP. NP NP PP. NP N.7 N rods. P with. fish people fish tanks NP NP NP NP NP NP.9.686.5.7.6.5.88 NP NP NP.9 NP N.5.7 VP V...89 NP NP NP.96..6. for split = begin+ to end- prob=score[begin][split][b]*score[split][end][c]*p(a->bc) VP V...9..5 VP V. VP V @VP_V. VP V PP. @VP_V NP PP. NP NP NP. NP NP PP. NP N.7 N rods. P with. fish people fish tanks NP NP NP NP NP NP.9.686.5.7.6.5.88 NP NP NP NP NP NP.9.686 NP N.5.7.98 VP V...89. NP NP NP.96..6. for split = begin+ to end- prob=score[begin][split][b]*score[split][end][c]*p(a->bc) VP V...9..5 VP V. VP V @VP_V. VP V PP. @VP_V NP PP. NP NP NP. NP NP PP. NP N.7 N rods. P with. fish people fish tanks NP NP NP NP NP NP NP NP NP.9.686.96.5.7.58.6.5.88 NP NP NP NP.85 NP NP.9.686 NP N.5.7.98 VP V...89. NP NP NP.96..6. VP V.. Call buildtree(score, back) to get the best parse Evaluating constituency parsing
/6/7 Evaluating constituency parsing Gold standard brackets: S-(:), NP-(:), VP-(:9), VP-(:9), NP-(:6), PP-(6-9), NP-(7,9), NP-(9:) Candidate brackets: S-(:), NP-(:), VP-(:), VP-(:), NP-(:6), PP-(6-), NP-(7,) Labeled Precision /7 =.9% Labeled Recall /8 = 7.5% LP/LR F.% POS Tagging Accuracy / =.% How good are PCFGs? Penn WSJ parsing accuracy: about 7% LP/LR F Robust Usually admit everything, but with low probability Partial solution for grammar ambiguity A PCFG gives some idea of the plausibility of a parse But not so good because the independence assumptions are too strong Give a probabilistic language model But in the simple case it performs worse than a trigram model The problem seems to be that PCFGs lack the lexicalization of a trigram model [Magerman 995, Collins 997; Charniak 997] [Magerman 995, Collins 997; Charniak 997] [Magerman 995, Collins 997; Charniak 997] [Magerman 995, Collins 997; Charniak 997] Word-to-word affinities are useful for certain ambiguities PP attachment is now (partly) captured in a local PCFG rule. Think about: What useful information isn t captured? VP NP PP VP NP PP announce RATES FOR January ANNOUNCE rates IN January Also useful for: coordination scope, verb complement patterns
/6/7 Lexicalized parsing was seen as the parsing breakthrough of the late 99s Eugene Charniak, JHU workshop: To do better, it is necessary to condition probabilities on the actual words of the sentence. This makes the probabilities much tighter: p( NP) =.5 p( NP said) =. p( NP gave) =.98 Lexicalization of PCFGs: Charniak (997) A very straightforward model of a lexicalized PCFG Probabilistic conditioning is top-down like a regular PCFG But actual parsing is bottom-up, somewhat like the CKY algorithm we saw Michael Collins, COLT tutorial: Lexicalized Probabilistic Context- Free Grammars perform vastly better than PCFGs (88% vs. 7% accuracy) Charniak (997) example Lexicalization models argument selection by sharpening rule expansion probabilities The probability of different verbal complement frames (i.e., subcategorizations ) depends on the verb: Local Tree come take think want VP V 9.5%.6%.6% 5.7%.%.%.%.9% VP V PP.5%.% 7.%.% VP V SBAR 6.6%.% 7.%.% VP V S.%.%.8% 7.8% S.% 5.7%.%.% VP V PRT NP.% 5.8%.%.% VP V PRT PP 6.%.5%.%.% monolexical probabilities Lexicalization sharpens probabilities: Predicting heads Bilexical probabilities Charniak (997) linear interpolation/shrinkage P(prices n-plural) =. P(prices n-plural, NP) =. P(prices n-plural, NP, S) =.5 P(prices n-plural, NP, S, v-past) =.5 P(prices n-plural, NP, S, v-past, fell) =.6
/6/7 Charniak (997) shrinkage example Dependency Grammar and Dependency Structure Dependency syntax postulates that syntactic structure consists of lexical items linked by binary asymmetric relations ( arrows ) called dependencies The arrows are commonly typed with the name of grammatical relations (subject, ositional object, apposition, etc.) submitted nsubjpass auxpass Bills were by on Brownback nn appos ports Senator Republican cc conj and immigration of Kansas Dependency Grammar and Dependency Structure Dependency syntax postulates that syntactic structure consists of lexical items linked by binary asymmetric relations ( arrows ) called dependencies The arrow connects a head (governor, superior, regent) with a dependent (modifier, inferior, subordinate) Usually, dependencies form a tree (connected, acyclic, single-head) submitted nsubjpass auxpass Bills were by on Brownback nn appos ports Senator Republican cc conj and immigration of Kansas Relation between phrase structure and dependency structure A dependency grammar has a notion of a head. Officially, CFGs don t. But modern linguistic theory and all modern statistical parsers (Charniak, Collins, Stanford, ) do, via hand-written phrasal head rules : The head of a Noun Phrase is a noun/number/adj/ The head of a Verb Phrase is a verb/modal/. The head rules can be used to extract a dependency parse from a CFG parse Methods of Dependency Parsing. Dynamic programming (like in the CKY algorithm) You can do it similarly to lexicalized PCFG parsing: an O(n 5 ) algorithm Eisner (996) gives a clever algorithm that reduces the complexity to O(n ), by producing parse items with heads at the ends rather than in the middle. Graph algorithms You create a Maximum Spanning Tree for a sentence McDonald et al. s (5) MSTParser scores dependencies independently using a ML classifier (he uses MIRA, for online learning, but it could be MaxEnt). Constraint Satisfaction Edges are eliminated that don t satisfy hard constraints. Karlsson (99), etc.. Deterministic parsing Greedy choice of attachments guided by machine learning classifiers MaltParser (Nivre et al. 8) discussed in the next segment Dependency Conditioning Preferences What are the sources of information for dependency parsing?. Bilexical affinities [issues à the] is plausible. Dependency distance mostly with nearby words. Intervening material Dependencies rarely span intervening verbs or punctuation. Valency of heads How many dependents on which side are usual for a head? ROOT Discussion of the outstanding issues was completed. 5