Natural Language Processing

Similar documents
Natural Language Processing

CMPT-825 Natural Language Processing. Why are parsing algorithms important?

A Context-Free Grammar

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Probabilistic Context-free Grammars

LECTURER: BURCU CAN Spring

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Parsing with Context-Free Grammars

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

Maschinelle Sprachverarbeitung

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Maschinelle Sprachverarbeitung

CS460/626 : Natural Language

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Chapter 14 (Partially) Unsupervised Parsing

Probabilistic Context Free Grammars. Many slides from Michael Collins

Parsing with Context-Free Grammars

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Lecture 5: UDOP, Dependency Grammars

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Multilevel Coarse-to-Fine PCFG Parsing

Statistical Methods for NLP

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.

Recap: Lexicalized PCFGs (Fall 2007): Lecture 5 Parsing and Syntax III. Recap: Charniak s Model. Recap: Adding Head Words/Tags to Trees

Advanced Natural Language Processing Syntactic Parsing

c(a) = X c(a! Ø) (13.1) c(a! Ø) ˆP(A! Ø A) = c(a)

Introduction to Probablistic Natural Language Processing

A* Search. 1 Dijkstra Shortest Path

Features of Statistical Parsers

Processing/Speech, NLP and the Web

Unit 2: Tree Models. CS 562: Empirical Methods in Natural Language Processing. Lectures 19-23: Context-Free Grammars and Parsing

10/17/04. Today s Main Points

Marrying Dynamic Programming with Recurrent Neural Networks

Attendee information. Seven Lectures on Statistical Parsing. Phrase structure grammars = context-free grammars. Assessment.

Multiword Expression Identification with Tree Substitution Grammars

The Infinite PCFG using Hierarchical Dirichlet Processes

CISC4090: Theory of Computation

Einführung in die Computerlinguistik

Probabilistic Context Free Grammars

Suppose h maps number and variables to ɛ, and opening parenthesis to 0 and closing parenthesis

CKY & Earley Parsing. Ling 571 Deep Processing Techniques for NLP January 13, 2016

Context-Free Grammars and Languages. Reading: Chapter 5

PCFGs 2 L645 / B659. Dept. of Linguistics, Indiana University Fall PCFGs 2. Questions. Calculating P(w 1m ) Inside Probabilities

Decoding and Inference with Syntactic Translation Models

Roger Levy Probabilistic Models in the Study of Language draft, October 2,

Probabilistic Linguistics

Tutorial on Corpus-based Natural Language Processing Anna University, Chennai, India. December, 2001

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

CS 6120/CS4120: Natural Language Processing

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Probabilistic Context-Free Grammar

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning

CS 545 Lecture XVI: Parsing

{Probabilistic Stochastic} Context-Free Grammars (PCFGs)

Computational Models - Lecture 4 1

Review. Earley Algorithm Chapter Left Recursion. Left-Recursion. Rule Ordering. Rule Ordering

Computational Models - Lecture 4 1

Constituency Parsing

The SUBTLE NL Parsing Pipeline: A Complete Parser for English Mitch Marcus University of Pennsylvania

CS460/626 : Natural Language

Algorithms for Syntax-Aware Statistical Machine Translation

Handout 8: Computation & Hierarchical parsing II. Compute initial state set S 0 Compute initial state set S 0

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Latent Variable Models in NLP

Aspects of Tree-Based Statistical Machine Translation

Computational Models - Lecture 4

Spectral Unsupervised Parsing with Additive Tree Metrics

Random Generation of Nondeterministic Tree Automata

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Parsing. Unger s Parser. Laura Kallmeyer. Winter 2016/17. Heinrich-Heine-Universität Düsseldorf 1 / 21

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Context Free Grammars

Artificial Intelligence

A DOP Model for LFG. Rens Bod and Ronald Kaplan. Kathrin Spreyer Data-Oriented Parsing, 14 June 2005

Parsing with CFGs L445 / L545 / B659. Dept. of Linguistics, Indiana University Spring Parsing with CFGs. Direction of processing

Parsing with CFGs. Direction of processing. Top-down. Bottom-up. Left-corner parsing. Chart parsing CYK. Earley 1 / 46.

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

CISC 4090 Theory of Computation

Introduction to Computational Linguistics

Spectral Learning for Non-Deterministic Dependency Parsing

Syntax-based Statistical Machine Translation

Lecture 12: Algorithms for HMMs

Introduction to Theory of Computing

Parsing. Context-Free Grammars (CFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 26

Maxent Models and Discriminative Estimation

Semantic Role Labeling via Tree Kernel Joint Inference

Lecture 12: Algorithms for HMMs

Languages. Languages. An Example Grammar. Grammars. Suppose we have an alphabet V. Then we can write:

13A. Computational Linguistics. 13A. Log-Likelihood Dependency Parsing. CSC 2501 / 485 Fall 2017

Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

In this chapter, we explore the parsing problem, which encompasses several questions, including:

Lecture Notes on Inductive Definitions

CISC 4090 Theory of Computation

Expectation Maximization (EM)

An introduction to PRISM and its applications

The Noisy Channel Model and Markov Models

Transcription:

SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University September 27, 2018 0

Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 1: Ambiguity 1

Context Free Grammars and Ambiguity S NP VP VP V NP VP VP PP PP P NP NP NP PP NP Calvin NP monsters NP school V imagined P in What is the analysis using the above grammar for: Calvin imagined monsters in school 2

Context Free Grammars and Ambiguity Calvin imagined monsters in school (S (NP Calvin) (VP (V imagined) (NP (NP monsters) (PP (P in) (NP school))))) (S (NP Calvin) (VP (VP (V imagined) (NP monsters)) (PP (P in) (NP school)))) Which one is more plausible? 3

Context Free Grammars and Ambiguity Calvin imagined monsters in school Calvin imagined monsters in school 4

Ambiguity Kills (your parser) natural language learning course (run demos/parsing-ambiguity.py) ((natural language) (learning course)) (((natural language) learning) course) ((natural (language learning)) course) (natural (language (learning course))) (natural ((language learning) course)) Some difficult issues: Which one is more plausible? How many analyses for a given input? Computational complexity of parsing language 5

Number of derivations CFG rules { N N N, N a } n : a n number of parses 1 1 2 1 3 2 4 5 5 14 6 42 7 132 8 429 9 1430 10 4862 11 16796 6

CFG Ambiguity Number of parses in previous table is an integer series, known as the Catalan numbers Catalan numbers have a closed form: ( a b Cat(n) = 1 n + 1 ) is the binomial coefficient ( a b ) = ( 2n n ) a! (b!(a b)!) 7

Catalan numbers Why Catalan numbers? Cat(n) is the number of ways to parenthesize an expression of length n with two conditions: 1. there must be equal numbers of open and close parens 2. they must be properly nested so that an open precedes a close ((ab)c)d (a(bc))d (ab)(cd) a((bc)d) a(b(cd)) For an expression of with n ways to form constituents there are a total of 2n choose n parenthesis pairs. Then divide by n + 1 to remove invalid parenthesis pairs. For more details see (Church and Patil, CL Journal, 1982) 8

Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 2: Context Free Grammars 9

Context-Free Grammars A CFG is a 4-tuple: (N, T, R, S), where N is a set of non-terminal symbols, T is a set of terminal symbols which can include the empty string ɛ. T is analogous to Σ the alphabet in FSAs. R is a set of rules of the form A α, where A N and α {N T } S is a set of start symbols, S N 10

Context-Free Grammars Here s an example of a CFG, let s call this one G: 1. S a S b 2. S ɛ What is the language of this grammar, which we will call L(G), the set of strings generated by this grammar How? Notice that there cannot be any FSA that corresponds exactly to this set of strings L(G) Why? What is the tree set or derivations produced by this grammar? 11

Context-Free Grammars This notion of generating both the strings and the trees is an important one for Computational Linguistics Consider the trees for the grammar G : P = {S A A, A aa, A A b, A ɛ}, Σ = {a, b}, N = {S, A}, T = {a, b, ɛ}, S = {S} Why is it called context-free grammar? 12

Context-Free Grammars Can the grammar G produce only trees with equal height subtrees on the left and right? 13

Parse Trees Consider the grammar with rules: S NP VP NP PRP NP DT NPB VP VBP NP NPB NN NN PRP I VBP prefer DT a NN morning NN flight 14

Parse Trees 15

Parse Trees: Equivalent Representations (S (NP (PRP I) ) (VP (VBP prefer) (NP (DT a) (NPB (NN morning) (NN flight))))) [ S [ NP [ PRP I ] ] [ VP [ VBP prefer ] [ NP [ DT a ] [ NPB [ NN morning ] [ NN flight ] ] ] ] ] 16

Ambiguous Grammars S S S S a Given the above rules, consider the input aaa, what are the valid parse trees? Now consider the input aaaa 17

Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 3: Structural Ambiguity 18

Ambiguity Part of Speech ambiguity saw noun saw verb Structural ambiguity: Prepositional Phrases I saw (the man) with the telescope I saw (the man with the telescope) Structural ambiguity: Coordination a program to promote safety in ((trucks) and (minivans)) a program to promote ((safety in trucks) and (minivans)) ((a program to promote safety in trucks) and (minivans)) 19

Ambiguity attachment choice in alternative parses NP NP NP VP NP VP a program to VP a program to VP promote NP promote NP NP PP NP and NP safety in NP safety PP minivans trucks and minivans in trucks 20

Ambiguity in Prepositional Phrases noun attach: I bought the shirt with pockets verb attach: I washed the shirt with soap As in the case of other attachment decisions in parsing: it depends on the meaning of the entire sentence needs world knowledge, etc. Maybe there is a simpler solution: we can attempt to solve it using heuristics or associations between words 21

Structure Based Ambiguity Resolution Right association: a constituent (NP or PP) tends to attach to another constituent immediately to its right (Kimball 1973) Minimal attachment: a constituent tends to attach to an existing non-terminal using the fewest additional syntactic nodes (Frazier 1978) These two principles make opposite predictions for prepositional phrase attachment Consider the grammar: VP V NP PP (1) NP NP PP (2) for input: I [ VP saw [ NP the man... [ PP with the telescope ], RA predicts that the PP attaches to the NP, i.e. use rule (2), and MA predicts V attachment, i.e. use rule (1) 22

Structure Based Ambiguity Resolution Garden-paths look structural: The emergency crews hate most is domestic violence Neither MA or RA account for more than 55% of the cases in real text Psycholinguistic experiments using eyetracking show that humans resolve ambiguities as soon as possible in the left to right sequence using the words to disambiguate Garden-paths are caused by a combination of lexical and structural effects: The flowers delivered for the patient arrived 23

Ambiguity Resolution: Prepositional Phrases in English Learning Prepositional Phrase Attachment: Annotated Data v n1 p n2 Attachment join board as director V is chairman of N.V. N using crocidolite in filters V bring attention to problem V is asbestos in products N making paper for filters N including three with cancer N..... 24

Prepositional Phrase Attachment Method Accuracy Always noun attachment 59.0 Most likely for each preposition 72.2 Average Human (4 head words only) 88.2 Average Human (whole sentence) 93.2 25

Some other studies Toutanova, Manning, and Ng, 2004: 87.54% using some external knowledge (word classes) Merlo, Crocker and Berthouzoz, 1997: test on multiple PPs generalize disambiguation of 1 PP to 2-3 PPs 14 structures possible for 3PPs assuming a single verb all 14 are attested in the Penn WSJ Treebank 1PP: 84.3% 2PP: 69.6% 3PP: 43.6% Belinkov+ TACL 2014: Neural networks for PP attachment (multiple candidate heads) NN model (no extra data): 86.6% NN model (lots of raw data for word vectors): 88.7% NN model with parser and lots of raw data: 90.1% This experiment is still only part of the real problem faced in parsing English. Plus other sources of ambiguity in other languages 26

Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 4: Weighted Context Free Grammars 27

Treebanks What is the CFG that can be extracted from this single tree: (S (NP (Det the) (NP man)) (VP (VP (V played) (NP (Det a) (NP game))) (PP (P with) (NP (Det the) (NP dog))))) 28

PCFG S NP VP c = 1 NP Det NP c = 3 NP man c = 1 NP game c = 1 NP dog c = 1 VP VP PP c = 1 VP V NP c = 1 PP P NP c = 1 Det the c = 2 Det a c = 1 V played c = 1 P with c = 1 We can do this with multiple trees. Simply count occurrences of CFG rules over all the trees. A repository of such trees labelled by a human is called a TreeBank. 29

Probabilistic CFG (PCFG) S NP VP 1 VP V NP 0.9 VP VP PP 0.1 PP P NP 1 NP NP PP 0.25 NP Calvin 0.25 NP monsters 0.25 NP school 0.25 V imagined 1 P in 1 P(input) = tree P(tree input) P(Calvin imagined monsters in school) =? Notice that P(VP V NP) + P(VP VP PP) = 1.0 30

Probabilistic CFG (PCFG) P(Calvin imagined monsters in school) =? (S (NP Calvin) (VP (V imagined) (NP (NP monsters) (PP (P in) (NP school))))) (S (NP Calvin) (VP (VP (V imagined) (NP monsters)) (PP (P in) (NP school)))) 31

Probabilistic CFG (PCFG) (S (NP Calvin) (VP (V imagined) (NP (NP monsters) (PP (P in) (NP school))))) P(tree 1 ) = P(S NP VP) P(NP Calvin) P(VP V NP) P(V imagined) P(NP NP PP) P(NP monsters) P(PP P NP) P(P in) P(NP school) = 1 0.25 0.9 1 0.25 0.25 1 1 0.25 =.003515625 32

Probabilistic CFG (PCFG) (S (NP Calvin) (VP (VP (V imagined) (NP monsters)) (PP (P in) (NP school)))) P(tree 2 ) = P(S NP VP) P(NP Calvin) P(VP VP PP) P(VP V NP) P(V imagined) P(NP monsters) P(PP P NP) P(P in) P(NP school) = 1 0.25 0.1 0.9 1 0.25 1 1 0.25 =.00140625 33

Probabilistic CFG (PCFG) P(Calvin imagined monsters in school) = P(tree 1 ) + P(tree 2 ) =.003515625 +.00140625 =.004921875 Most likely tree is tree 1 = arg max P(tree input) tree (S (NP Calvin) (VP (V imagined) (NP (NP monsters) (PP (P in) (NP school))))) (S (NP Calvin) (VP (VP (V imagined) (NP monsters)) (PP (P in) (NP school)))) 34

Probabilistic Context-Free Grammars (PCFG) A PCFG is a 4-tuple: (N, T, R, S), where N is a set of non-terminal symbols, T is a set of terminal symbols which can include the empty string ɛ. T is analogous to Σ the alphabet in FSAs. R is a set of rules of the form A α, where A N and α {N T } P(R) is the probability of rule R : A α such that α P(A α) = 1.0 S is a set of start symbols, S N 35

PCFG Central condition: α P(A α) = 1 Called a proper PCFG if this condition holds Note that this means P(A α) = P(α A) = f (A,α) f (A) P(T S) = P(T,S) P(S) = P(T, S) = i P(RHS i LHS i ) 36

PCFG What is the PCFG that can be extracted from this single tree: (S (NP (Det the) (NP man)) (VP (VP (V played) (NP (Det a) (NP game))) (PP (P with) (NP (Det the) (NP dog))))) How many different rhs α exist for A α where A can be S, NP, VP, PP, Det, N, V, P 37

PCFG S NP VP c = 1 p = 1/1 = 1.0 NP Det NP c = 3 p = 3/6 = 0.5 NP man c = 1 p = 1/6 = 0.1667 NP game c = 1 p = 1/6 = 0.1667 NP dog c = 1 p = 1/6 = 0.1667 VP VP PP c = 1 p = 1/2 = 0.5 VP V NP c = 1 p = 1/2 = 0.5 PP P NP c = 1 p = 1/1 = 1.0 Det the c = 2 p = 2/3 = 0.67 Det a c = 1 p = 1/3 = 0.33 V played c = 1 p = 1/1 = 1.0 P with c = 1 p = 1/1 = 1.0 We can do this with multiple trees. Simply count occurrences of CFG rules over all the trees. A repository of such trees labelled by a human is called a TreeBank. 38

Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 5: Lexicalized Context Free Grammars 39

Ambiguity Part of Speech ambiguity saw noun saw verb Structural ambiguity: Prepositional Phrases I saw (the man) with the telescope I saw (the man with the telescope) Structural ambiguity: Coordination a program to promote safety in ((trucks) and (minivans)) a program to promote ((safety in trucks) and (minivans)) ((a program to promote safety in trucks) and (minivans)) 40

Ambiguity attachment choice in alternative parses NP NP NP VP NP VP a program to VP a program to VP promote NP promote NP NP PP NP and NP safety in NP safety PP minivans trucks and minivans in trucks 41

Parsing as a machine learning problem S = a sentence T = a parse tree A statistical parsing model defines P(T S) Find best parse: arg max T P(T S) = arg max T P(T S) arg max T P(T,S) P(S) P(S) = T P(T, S) so we can ignore P(S) when searching for the best tree as it is just scaling each tree score by a constant value arg max So the best parse is: T P(T, S) For PCFGs: P(T, S) = i=1...n P(RHS i LHS i ) 42

Adding Lexical Information to PCFG S.. VP{indicated} VB{indicated} NP{difference} PP{in} indicated difference P NP in.. 43

Adding Lexical Information to PCFG (Collins 99, Charniak 00) VP{indicated} VP{indicated} VP{indicated} VB{+H:indicated} STOP.. VB{+H:indicated} VB{+H:indicated} NP{difference} VP{indicated} VP{indicated} VB{+H:indicated}.. PP{in} VB{+H:indicated}.. STOP P h (VB VP, indicated) P l (STOP VP, VB, indicated) P r (NP(difference) VP, VB, indicated) P r (PP(in) VP, VB, indicated) P r (STOP VP, VB, indicated) 44

Evaluation of Parsing Consider a candidate parse to be evaluated against the truth (or gold-standard parse): candidate: (S (A (P this) (Q is)) (A (R a) (T test))) gold: (S (A (P this)) (B (Q is) (A (R a) (T test)))) In order to evaluate this, we list all the constituents Candidate (0,4,S) (0,2,A) (2,4,A) Gold (0,4,S) (0,1,A) (1,4,B) (2,4,A) Skip spans of length 1 which would be equivalent to part of speech tagging accuracy. Precision is defined as #correct #in gold = 2 4. #correct #proposed = 2 3 Another measure: crossing brackets, and recall as candidate: [ an [incredibly expensive] coat ] (1 CB) gold: [ an [incredibly [expensive coat]] 45

Evaluation of Parsing Bracketing recall R = Bracketing precision P = num of correct constituents num of constituents in the goldfile num of correct constituents num of constituents in the parsed file Complete match = % of sents where recall & precision are both 100% Average crossing = num of constituents crossing a goldfile constituent num of sents No crossing = % of sents which have 0 crossing brackets 2 or less crossing = % of sents which have 2 crossing brackets 46

Statistical Parsing Results precision recall F1-score = 2 precision + recall 100wds System F1-score Shift-Reduce (Magerman, 1995) 84.14 PCFG with Lexical Features (Charniak, 1999) 89.54 Unlexicalized Berkeley parser (Petrov et al, 2007) 90.10 n-best Re-ranking (Charniak and Johnson, 2005) 91.02 Tree-insertion grammars (Carreras, Collins, Koo, 91.10 2008) Ensemble n-best Re-ranking (Johnson and Ural, 91.49 2010) Forest Re-ranking (Huang, 2010) 91.70 Unlabeled Data with Self-Training (McCloskey et al, 92.10 2006) Self-Attention (Kitaev and Klein, 2018) 93.55 Self-Attention with unlabeled data (Kitaev and 95.13 Klein, 2018) 47

Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 6: Practical Issues in Parsing 48

Practical Issues: Beam Thresholding and Priors Read the parsing slides before this section. Probability of nonterminal X spanning j... k: N[X, j, k] Beam Thresholding compares N[X, j, k] with every other Y where N[Y, j, k] But what should be compared? Just the inside probability: P(X t j... t k )? written as β(x, j, k) Perhaps β(frag, 0, 3) > β(np, 0, 3), but NPs are much more likely than FRAGs in general 49

Practical Issues: Beam Thresholding and Priors The correct estimate is the outside probability: written as α(x, j, k) P(S t 1... t j 1 X t k+1... t n ) Unfortunately, you can only compute α(x, j, k) efficiently after you finish parsing and reach (S, 0, n) 50

Practical Issues: Beam Thresholding and Priors To make things easier we multiply the prior probability P(X ) with the inside probability In beam Thresholding we compare every new insertion of X for span j, k as follows: Compare P(X ) β(x, j, k) with the most probable Y P(Y ) β(y, j, k) Assume Y is the most probable entry in j, k, then we compare beam P(Y ) β(y, j, k) (3) P(X ) β(x, j, k) (4) If (4) < (3) then we prune X for this span j, k beam is set to a small value, say 0.001 or even 0.01. As the beam value increases, the parser speed increases (since more entries are pruned). A simpler (but not as effective) alternative to using the beam is to keep only the top K entries for each span j, k 51

Experiments with Beam Thresholding 1 No pruning, No prior Beam = 10-5, No prior Beam = 10-4, No prior 0.8 Time (secs) 0.6 0.4 0.2 0 0 2 4 6 8 10 12 14 16 Sentence length 52

Experiments with Beam Thresholding 1 No pruning, w/ prior Beam = 10-5, w/ prior Beam = 10-4, w/ prior 0.8 Time (secs) 0.6 0.4 0.2 0 0 2 4 6 8 10 12 14 16 Sentence length 53