Advanced Natural Language Processing Syntactic Parsing

Similar documents
Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Probabilistic Context-free Grammars

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Probabilistic Context-Free Grammar

Statistical Methods for NLP

Processing/Speech, NLP and the Web

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

LECTURER: BURCU CAN Spring

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

{Probabilistic Stochastic} Context-Free Grammars (PCFGs)

CS460/626 : Natural Language

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Parsing with Context-Free Grammars

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

Chapter 14 (Partially) Unsupervised Parsing

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Statistical Methods for NLP

Probabilistic Context Free Grammars. Many slides from Michael Collins

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

Soft Inference and Posterior Marginals. September 19, 2013

Parsing with Context-Free Grammars

DT2118 Speech and Speaker Recognition

10/17/04. Today s Main Points

Lecture 12: Algorithms for HMMs

Multiword Expression Identification with Tree Substitution Grammars

Parsing with CFGs L445 / L545 / B659. Dept. of Linguistics, Indiana University Spring Parsing with CFGs. Direction of processing

Parsing with CFGs. Direction of processing. Top-down. Bottom-up. Left-corner parsing. Chart parsing CYK. Earley 1 / 46.

Decoding and Inference with Syntactic Translation Models

Statistical Processing of Natural Language

PCFGs 2 L645 / B659. Dept. of Linguistics, Indiana University Fall PCFGs 2. Questions. Calculating P(w 1m ) Inside Probabilities

CS : Speech, NLP and the Web/Topics in AI

Lecture 13: Structured Prediction

In this chapter, we explore the parsing problem, which encompasses several questions, including:

Lecture 12: Algorithms for HMMs

Statistical Machine Translation

CMPT-825 Natural Language Processing. Why are parsing algorithms important?

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

A DOP Model for LFG. Rens Bod and Ronald Kaplan. Kathrin Spreyer Data-Oriented Parsing, 14 June 2005

Natural Language Processing

Review. Earley Algorithm Chapter Left Recursion. Left-Recursion. Rule Ordering. Rule Ordering

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Probabilistic Context Free Grammars

Unit 2: Tree Models. CS 562: Empirical Methods in Natural Language Processing. Lectures 19-23: Context-Free Grammars and Parsing

Spectral Learning for Non-Deterministic Dependency Parsing

Introduction to Probablistic Natural Language Processing

Language and Statistics II

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.

Remembering subresults (Part I): Well-formed substring tables

Stochastic Parsing. Roberto Basili

Attendee information. Seven Lectures on Statistical Parsing. Phrase structure grammars = context-free grammars. Assessment.

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Statistical NLP: Hidden Markov Models. Updated 12/15

Constituency Parsing

Lecture 5: UDOP, Dependency Grammars

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

So# Inference and Posterior Marginals. September 19, 2013

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

CS460/626 : Natural Language

Grammar formalisms Tree Adjoining Grammar: Formal Properties, Parsing. Part I. Formal Properties of TAG. Outline: Formal Properties of TAG

Machine Learning for natural language processing

Spectral Unsupervised Parsing with Additive Tree Metrics

Sharpening the empirical claims of generative syntax through formalization

Sequence Labeling: HMMs & Structured Perceptron

A Context-Free Grammar

1. Markov models. 1.1 Markov-chain

Multilevel Coarse-to-Fine PCFG Parsing

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Hidden Markov Models (HMMs)

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Statistical Methods for NLP

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Hidden Markov Models

Features of Statistical Parsers

STA 414/2104: Machine Learning

CS 6120/CS4120: Natural Language Processing

Text Mining. March 3, March 3, / 49

Probabilistic Linguistics

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

CKY & Earley Parsing. Ling 571 Deep Processing Techniques for NLP January 13, 2016

Lecture 12: EM Algorithm

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Algorithms for Syntax-Aware Statistical Machine Translation

Lecture 15. Probabilistic Models on Graph

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

CS838-1 Advanced NLP: Hidden Markov Models

A gentle introduction to Hidden Markov Models

A* Search. 1 Dijkstra Shortest Path

Aspects of Tree-Based Statistical Machine Translation

STA 4273H: Statistical Machine Learning

Introduction to Computational Linguistics

Lecture 3: ASR: HMMs, Forward, Viterbi

Probabilistic Context-Free Grammars and beyond

Latent Variable Models in NLP

Transcription:

Advanced Natural Language Processing Syntactic Parsing Alicia Ageno ageno@cs.upc.edu Universitat Politècnica de Catalunya NLP statistical parsing 1

Parsing Review Statistical Parsing SCFG Inside Algorithm Outside Algorithm Viterbi Algorithm Learning models Grammar acquisition: Grammatical induction NLP statistical parsing 2

Parsing Parsing: recognising higher level units of structure that allow us to compress our description of a sentence Goal of syntactic analysis (parsing): Detect if a sentence is correct Provide a syntactic structure of a sentence Parsing is the task of uncovering the syntactic structure of language and is often viewed as an important prerequisite for building systems capable of understanding language Syntactic structure is necessary as a first step towards semantic interpretation, for detecting phrasal chunks for indexing in an IR system... NLP statistical parsing 3

Parsing A syntactic tree NLP statistical parsing 4

Parsing Another syntactic tree NLP statistical parsing 5

Parsing A dependency tree NLP statistical parsing 6

Parsing A real sentence NLP statistical parsing 7

Parsing Theories of Syntactic Structure Constituent trees Dependency trees NLP statistical parsing 8

Parsing Factors in parsing Grammar expressivity Coverage Involved Knowledge Sources Parsing strategy Parsing direction Production application order Ambiguity management NLP statistical parsing 9

Parsing Parsers today CFG (extended or not) Tabular Charts LR Unification-based Statistical Dependency parsing Robust parsing (shallow, fragmental, chunkers, spotters) NLP statistical parsing 10

Parsing Context Free Grammars (CFGs) NLP statistical parsing 11

Parsing Context Free Grammars, example NLP statistical parsing 12

Parsing Properties of CFGs NLP statistical parsing 13

Parsing I was on the hill that has a telescope when I saw a man. I saw a man who was on a hill and who had a telescope. I saw a man who was on the hill that has a telescope on it. Using a telescope, I saw a man who was on a hill. I was on the hill when I used the telescope to see a man.... I saw the man on the hill with the telescope Me See A man The telescope The hill NLP statistical parsing 14

Parsing Chomsky Normal Form (CNF) NLP statistical parsing 15

Parsing Tabular Methods Dynamic programming CFG CKY (Cocke, Kasami, Younger,1967) Grammar in CNF Earley 1969 Extensible to unification, probabilistic, etc... NLP statistical parsing 16

Parsing Parsing as searching in a search space Characterizing the states (if possible) enumerate them Define the initial state (s) Define (if possible) final states or the condition to reach one of them NLP statistical parsing 17

Tabular methods: CKY General parsing schema (Sikkel 97) <X, H, D> V (D) H X V X domain, set of items set of de hypothesis set of valid entities set of deductive steps NLP statistical parsing 18

Tabular methods: CKY G = <N,, P,S >, G CNF, w = a 1... a n <X, H, D> X = {[A, i, j] 1 i j A N G } H = {[A, j, j] A a j P G 1 j n } D = {[B, i, j], [C, j+1, k] [A, i, k] A BC P G 1 i j < k} V (D) = {[A, i, j] A * a i... a j } CKY domain, set of items set of de hypothesis set of valid entities set of deductive steps NLP statistical parsing 19

Tabular methods: CKY CKY spatial cost O(n 2 ) temporal cost O(n 3 ) CNF BU strategy: dynamically build the parsing table t ji rows: width of each component, 1 j wi + 1 columns: initial position of each component, 1 i w where w = a 1,... a n is the input string, w =n NLP statistical parsing 20

Tabular methods: CKY A t j,i B C a 1 a 2... a i... a n j Where A -> BC is a binary production of the grammar NLP statistical parsing 21

Tabular methods: CKY That A is in cell t j,i means that from A the text fragment a i,... a i+j-1 (string of length j starting in i-esim position) can be derived. The grammaticality condition is that the initial symbol of the grammar (S) satisfies S t w 1 NLP statistical parsing 22

Tabular methods: CKY The table is built BU Base case: row1 is built using only the unary rules of the grammar: j=1 t 1i = {A [A --> a i ] P} Recursive case: rows j=2,... are built. The key of the algorithm is that when row j is built all the previous ones (from 1 to j-1) are already built: row j > 1 t ji = {A k, 1 k j, [A-->BC] P, B t ki,c t j-k,i+k } NLP statistical parsing 23

Tabular methods: CKY 1. Add the lexical edges: t[1,i] 2. for j = 2 to n: for i = 1 to n-j: for k = 1 to j-1: if: then: ABC and B t[k,i] and C t[j-k,i+k] add ABC to t[j,i] 3. If St[n,1], return the corresponding parse NLP statistical parsing 24

Tabular methods: CKY sentence NP, VP NP A, B VP C, NP A det B n NP n VP vi C vt Parse the sentence the cat eats fish the (det) cat(n) eats(vt,vi) fish(n) NLP statistical parsing 25

Tabular methods: CKY the cat eats fish sentence the cat eats sentence cat eats fish sentence the cat NP cat eats sentence eats fish VP the (det) A cat (n) B, NP eats (vt, vi) C, VP fish (n) B, NP NLP statistical parsing 26

Statistical parsing Introduction SCFG Inside Algorithm Outside Algorithm Viterbi Algorithm Learning models Grammar acquisition: Grammatical induction NLP statistical parsing 27

Statistical parsing Using statistical models for Determining the sentence (ex. speech recognizers) The job of the parser is to be a language model Guiding parsing Order or prune the search space Get the most likely parse Ambiguity resolution E.g. Pp-attachment NLP statistical parsing 28

Statistical parsing Lexical approaches Context free: unigram Context dependent: N-gram, HMM Syntactic approaches SCFG (or PCFG) Hybrid approaches Stochastic Lexicalized Tags Computing the most likeky (most probable) parse Viterbi Parameter learning Supervised Tagged/parsed corpora Non supervised Baum-Welch (Fw-Bw) para HMM Inside-Outside for SCFG NLP statistical parsing 29

SCFG Stochastic Context-Free Grammars (or PCFGs) Associate a probability to each rule Associate a probability to each lexical entry Frequent restriction CNF: Binary rules A p A q A r matrix B pqr Unary rules A p b m matrix U pm NLP statistical parsing 30

SCFG NLP statistical parsing 31

SCFG NLP statistical parsing 32

SCFG NLP statistical parsing 33

Parsing SCFG Starting from a CFG SCFG For each rule of G, (A ) P G we should be able to define a probability P(A ) ( A ) P( A ) P G 1 Probability of a tree P( ) ( A ) P( A P G ) f ( A ; ) NLP statistical parsing 34

Parsing SCFG P(t) -- Probability of a tree t (product of probabilities of the rules generating it. P(w 1n ) -- Probability of a sentence is the sum of the probabilities of all the valid parse trees of the sentence P(w 1n ) = Σ j P(w 1n, t) where t is a parse of w 1n = Σ j P(t) NLP statistical parsing 35

Parsing SCFG Positional invariance: The probability of a subtree is independent of its position in the derivation tree Context-free: the probability of a subtree does not depend on words not dominated by a subtree Ancestor-free: the probability of a subtree does not depend on nodes in the derivation outside the subtree NLP statistical parsing 36

Parsing SCFG Parameter estimation Supervised learning From a treebank (MLE) { 1,, N } Non supervised learning Inside/Outside (EM) Similar to Baum-Welch in HMMs NLP statistical parsing 37

NLP statistical parsing 38 Parsing SCFG P G A A A A P ) ( ) #( ) #( ) ( N i i A f A 1 ) ; ( ) ( # Supervised learning: Maximum Likelihood Estimation (MLE)

SCFG in CNF Learning using CNF CNF: Most frequent approach Binary rules: A p A q A r matrix B p,q,r Unary rules: A p b m matrix U p,m that should satisfy: p, q,r B p,q, r m U p,m 1 A 1 is the axiom of the grammar. d = derivation = sequence of rule applications from A 1 to w: A 1 = 0 1... d = w p(d G) d k1 p( k k G) 1 p(w G) d: A 1 * P(d w G) NLP statistical parsing 39

SCFG in CNF A 1 A p w 1... w i w k+1... w n A q A r A s w i+1...... w k b m = w j NLP statistical parsing 40

SCFG in CNF Learning using CNF Problems to solve (~ HMM) Probability of a string (LM) p(w 1n G) Most probable parse of a string arg max t p(t w 1n G) Parameter learning: Find G such that if maximizes p(w 1n G) NLP statistical parsing 41

SCFG in CNF HMM Probability distribution over strings of a certain length For all n: Σ W1n P(w 1n ) = 1 PCFG Probability distribution over the set of strings that are in the language L Σ L P( ) = 1 Example: P(John decided to bake a) NLP statistical parsing 42

SCFG in CNF HMM Probability distribution over strings of a certain length For all n: Σ W1n P(w 1n ) = 1 Forward/Backward Forward α i (t) = P(w 1(t-1), X t =i) Backward β i (t) = P(w tt X t =i) PCFG Probability distribution over the set of strings that are in the language L Σ L P( ) = 1 Inside/Outside Outside O i (p,q) = P(w 1p-1, N i pq, w (q+1)m G) Inside I i (p,q) = P(w pq N i pq, G) NLP statistical parsing 43

SCFG in CNF A 1 outside A p A q A r inside NLP statistical parsing 44

SCFG in CNF Inside probability I p (i,j) = P(A p * w i... w j ) This probability can be computed bottom up Starting with the shorter constituents base case: I p (i,i) = p(a p * w i ) = U p,m (w m = w i ) recurrence: I p (i, k) k 1 q,r ji I q (i, j) I r (j1, k) B p,q,r NLP statistical parsing 45

SCFG in CNF Outside probability: O q (i,j) = P(A 1 * w 1... w i-1 A q w j+1... w n ) This probability can be computed top down Starting with the widest constituents Base case: O 1 (1,n) = p(a 1 * A 1 ) = 1 O j (1,n) = 0, for j 1 Recurrence: two cases, over all the possible partitions O N q p1 (i, j) N r r 1 q n O (i,k) I (j1, k) B p r p,q,r kj1 p1 r1 N N i1 k1 O p (k, j) I r (k,i1) B p,r,q NLP statistical parsing 46

Two splitting forms: First SCFG in CNF O q (i, j) O p (i, k) I r (j1, k) B p,q,r A 1 A 1 A q A p w 1...w i-1 w j+1...w n A q A r w 1... w i-1 w j+1... w k w k+1... w n NLP statistical parsing 47

SCFG in CNF second: O q (i, j) O p (k, j) I r (k,i1) B p,r,q A 1 A 1 A q A p w 1...w i-1 w j+1...w n A r A q w 1... w k-1 w k... w i-1 w j+1... w n NLP statistical parsing 48

SCFG in CNF Viterbi O( G n 3 ) Given a sentence w 1... w n M P (i,j) contains the maximum probability of derivation A p * w i... w j M can be computed incrementally for increasing values of the substring using induction over the length j i +1 Base case: A p M p (i,i) = p(a p * w i ) = U p,m (w m = w i ) w i NLP statistical parsing 49

SCFG in CNF Recurrence: Consider all the forms of decomposing A p into 2 components updating the maximum probability M p (i, j) q,r j1 max max M ki q (i, k) M (k r 1, j) B p,q,r Recall that using sum instead of max we get the inside algorithm: p(w 1n G) A q A p A r w i... w k w k+1... w j k - i +1 j - k j i + 1 NLP statistical parsing 50

SCFG in CNF To get the probability of best (most probable) derivation: M 1 (1,n) To get the best derivation tree we need to maintain not only the probability M P (i,j) but also the cut point and the two categories of the right side of the rule: (i, p j) arg max M q,r,k q (i, k) M (k 1, j) B r A p p,q,r A RHS1(p,i,j) A RHS2(p,i,j) w i... w SPLIT(p,i,j) w SPLIT(p,i,j) +1... w j NLP statistical parsing 51

SCFG in CNF Learning the models. Supervised approach Parameters (probabilities, i.e. matrices B and U) of a corpus MLE (Maximum Likelihood Estimation): Corpus fully parsed (i.e. set of pairs <sentence, correct parse tree> ) Bˆ p,q,r pˆ(a p A q A r ) E(# Ap AqA E(# A G) p r G) NLP statistical parsing 52

SCFG in CNF Learning the models. Unsupervised approach Inside/Outside algorithm: Similar to Forward-Backward (Baum-Welch) for HMM Particular application of Expectation Maximization (EM) algorithm: 1. Start with an initial model µ0 (uniform, random, MLE...) 2. Compute observation probability using current model 3. Use obtained probabilities as data to reestimate the model, computing µ 4. Let µ= µ and repeat until no significant improvement (convergence) Iterative hill-climbing: Local maxima. EM property: Pµ (O) Pµ(O) NLP statistical parsing 53

SCFG in CNF Learning the models. Unsupervised approach Inside/Outside algorithm: Input: set of training examples (non parsed sentences) and a CFG G Initialization: choose initial parameters P for each rule in the grammar: (randomly or from small labelled corpus using MLE) P( A ) 0 Expectation: compute the posterior probability of each annotated rule and position in each training set tree T Maximization: use these probabilities as weighted observations to update the rule probabilities ( A ) P( A ) P G 1 NLP statistical parsing 54

SCFG in CNF Inside/Outside algorithm: For each training sentence w, we compute the insideoutside probabilities. We can multiply the probabilities inside and outside: O i (j,k) I i (j,k) = P(A 1 * w 1... w n, A i * w j... w k G ) = P(w 1n, A i jk G) So that the estimate of A i being used in the derivation: E(A i is used in the derivation ) n p1 n qp O i (p, q) I I 1 (1, n) i (p, q) NLP statistical parsing 55

SCFG in CNF Inside/Outside algorithm: The estimate of A i A r A s being used in the derivation: E(A i A r A s ) n1 n q1 p1 q p 1d p O i (p, q) B (1, n) For unary rules, the estimate of A i w m being used: E(A i w m ) n h1 O i (h, h) P(w I 1 h And we can reestimate P(A i A r A s ) and P(A i w m ): P(A i A r A s ) = E(A i A r A s ) /E(A i used) P(A i w m ) = E(A i w m ) /E(A i used) I 1 (1, n) w i,r,s m I r ) I i (p, d) I (h, h) s (d 1, q) NLP statistical parsing 56

SCFG in CNF Inside/Outside algorithm: Assuming independence of the sentences in the training corpus, we sum the contributions from multiple sentences in the reestimation process. We can reestimate the values of P(A p A q A r ) and P(A p w m ) and from them the new values of U p,m and B p,q,r The I-O algorithm is to iterate this process of parameter reestimation until the change in the estimated probability is small: P W G ) P( W G ) ( i1 i NLP statistical parsing 57

SCFG Pros and cons of SCFG Some idea of the probability of a parse But not very good. CFG cannot be learned without negative examples, SCFG can SCFGs provide a LM for a language In practice SCFG provide a worse LM than an n-gram (n>1) P([N [N toy] [N [N coffee] [N grinder]]]) = P ([N [N [N cat] [N food]] [N tin]]) P (NP Pro) is > in Subj position than in Obj position. NLP statistical parsing 58

SCFG Pros and cons of SCFG Robust Possibility of combining SCFG with 3-grams SCFG assign a lot of probability mass to short sentences (a small tree is more probable than a big one) Parameter estimation (probabilities) Problem of sparseness Volume NLP statistical parsing 59

Statistical parsing Grammatical induction from corpora Goal: Parsing of non restricted texts with a reasonable level of accuracy (>90%) and efficiency. Requirements: Corpora tagged (with POS): Brown, LOB, Clic-Talp Corpora analyzed: Penn treebank, Susanne, Ancora NLP statistical parsing 60

Treebank grammars Penn Treebank = 50,000 sentences with associated trees Usual set-up: 40,000 training sentences, 2400 test sentences NLP statistical parsing 61

Treebank grammars Grammars directly derived from a treebank Charniak,1996 UsingPTB 47,000 sentences Navigating PTB where each local subtree provides the left hand and right hand side of a rule Precision and recall around 80% Around 17,500 rules NLP statistical parsing 62

Treebank grammars Learning Treebank Grammars Σ j P(N i ζ j N i ) = 1 NLP statistical parsing 63

Treebank grammars Supervised learning MLE NLP statistical parsing 64

Treebank grammars Proposals for transformation of the obtained PTB grammar: Sekine,1997, Sekine & Grishman,1995 Treebank grammars compactation Lacking generalization ability Continuous growth of the grammar size Most induced rules present low frequency Krotov et al,1999, Krotov,1998, Gaizauskas,1995 NLP statistical parsing 65

Treebank grammars Treebank Grammars compactation Partial bracketting NP DT NN CC DT NN NP NP CC NP NP DT NN Redundance removing (some rules can be generated from others) NLP statistical parsing 66

Treebank grammars Removing non linguistically valid rules Assign probabilities (MLE) to the initial rules Remove a rule unless the probability of the structure built from its application is greater than the probability of building the structure by applying simpler rules. Thresholding Removing rules occurring < n times Full Simply Fully Linguistically Linguistically thresholded compacted Compacted Compacted Grammar 1 Grammar 2 Recall 70.55 70.78 30.93 71.55 70.76 Precision 77.89 77.66 19.18 72.19 77.21 Grammar size 15,421 7,278 1,122 4,820 6,417 NLP statistical parsing 67

Treebank grammars Applying compactation 17,529 1,667 rules #rules 2000 1500 10% corpus 60% 100% NLP statistical parsing 68