Advanced Natural Language Processing Syntactic Parsing

Advanced Natural Language Processing Syntactic Parsing Alicia Ageno ageno@cs.upc.edu Universitat Politècnica de Catalunya NLP statistical parsing 1

Parsing Review Statistical Parsing SCFG Inside Algorithm Outside Algorithm Viterbi Algorithm Learning models Grammar acquisition: Grammatical induction NLP statistical parsing 2

Parsing Parsing: recognising higher level units of structure that allow us to compress our description of a sentence Goal of syntactic analysis (parsing): Detect if a sentence is correct Provide a syntactic structure of a sentence Parsing is the task of uncovering the syntactic structure of language and is often viewed as an important prerequisite for building systems capable of understanding language Syntactic structure is necessary as a first step towards semantic interpretation, for detecting phrasal chunks for indexing in an IR system... NLP statistical parsing 3

Parsing A syntactic tree NLP statistical parsing 4

Parsing Another syntactic tree NLP statistical parsing 5

Parsing A dependency tree NLP statistical parsing 6

Parsing A real sentence NLP statistical parsing 7

Parsing Theories of Syntactic Structure Constituent trees Dependency trees NLP statistical parsing 8

Parsing Factors in parsing Grammar expressivity Coverage Involved Knowledge Sources Parsing strategy Parsing direction Production application order Ambiguity management NLP statistical parsing 9

Parsing Parsers today CFG (extended or not) Tabular Charts LR Unification-based Statistical Dependency parsing Robust parsing (shallow, fragmental, chunkers, spotters) NLP statistical parsing 10

Parsing Context Free Grammars (CFGs) NLP statistical parsing 11

Parsing Context Free Grammars, example NLP statistical parsing 12

Parsing Properties of CFGs NLP statistical parsing 13

Parsing I was on the hill that has a telescope when I saw a man. I saw a man who was on a hill and who had a telescope. I saw a man who was on the hill that has a telescope on it. Using a telescope, I saw a man who was on a hill. I was on the hill when I used the telescope to see a man.... I saw the man on the hill with the telescope Me See A man The telescope The hill NLP statistical parsing 14

Parsing Chomsky Normal Form (CNF) NLP statistical parsing 15

Parsing Tabular Methods Dynamic programming CFG CKY (Cocke, Kasami, Younger,1967) Grammar in CNF Earley 1969 Extensible to unification, probabilistic, etc... NLP statistical parsing 16

Parsing Parsing as searching in a search space Characterizing the states (if possible) enumerate them Define the initial state (s) Define (if possible) final states or the condition to reach one of them NLP statistical parsing 17

Tabular methods: CKY General parsing schema (Sikkel 97) <X, H, D> V (D) H X V X domain, set of items set of de hypothesis set of valid entities set of deductive steps NLP statistical parsing 18

Tabular methods: CKY G = <N,, P,S >, G CNF, w = a 1... a n <X, H, D> X = {[A, i, j] 1 i j A N G } H = {[A, j, j] A a j P G 1 j n } D = {[B, i, j], [C, j+1, k] [A, i, k] A BC P G 1 i j < k} V (D) = {[A, i, j] A * a i... a j } CKY domain, set of items set of de hypothesis set of valid entities set of deductive steps NLP statistical parsing 19

Tabular methods: CKY CKY spatial cost O(n 2 ) temporal cost O(n 3 ) CNF BU strategy: dynamically build the parsing table t ji rows: width of each component, 1 j wi + 1 columns: initial position of each component, 1 i w where w = a 1,... a n is the input string, w =n NLP statistical parsing 20

Tabular methods: CKY A t j,i B C a 1 a 2... a i... a n j Where A -> BC is a binary production of the grammar NLP statistical parsing 21

Tabular methods: CKY That A is in cell t j,i means that from A the text fragment a i,... a i+j-1 (string of length j starting in i-esim position) can be derived. The grammaticality condition is that the initial symbol of the grammar (S) satisfies S t w 1 NLP statistical parsing 22

Tabular methods: CKY The table is built BU Base case: row1 is built using only the unary rules of the grammar: j=1 t 1i = {A [A --> a i ] P} Recursive case: rows j=2,... are built. The key of the algorithm is that when row j is built all the previous ones (from 1 to j-1) are already built: row j > 1 t ji = {A k, 1 k j, [A-->BC] P, B t ki,c t j-k,i+k } NLP statistical parsing 23

Tabular methods: CKY 1. Add the lexical edges: t[1,i] 2. for j = 2 to n: for i = 1 to n-j: for k = 1 to j-1: if: then: ABC and B t[k,i] and C t[j-k,i+k] add ABC to t[j,i] 3. If St[n,1], return the corresponding parse NLP statistical parsing 24

Tabular methods: CKY sentence NP, VP NP A, B VP C, NP A det B n NP n VP vi C vt Parse the sentence the cat eats fish the (det) cat(n) eats(vt,vi) fish(n) NLP statistical parsing 25

Tabular methods: CKY the cat eats fish sentence the cat eats sentence cat eats fish sentence the cat NP cat eats sentence eats fish VP the (det) A cat (n) B, NP eats (vt, vi) C, VP fish (n) B, NP NLP statistical parsing 26

Statistical parsing Introduction SCFG Inside Algorithm Outside Algorithm Viterbi Algorithm Learning models Grammar acquisition: Grammatical induction NLP statistical parsing 27

Statistical parsing Using statistical models for Determining the sentence (ex. speech recognizers) The job of the parser is to be a language model Guiding parsing Order or prune the search space Get the most likely parse Ambiguity resolution E.g. Pp-attachment NLP statistical parsing 28

Statistical parsing Lexical approaches Context free: unigram Context dependent: N-gram, HMM Syntactic approaches SCFG (or PCFG) Hybrid approaches Stochastic Lexicalized Tags Computing the most likeky (most probable) parse Viterbi Parameter learning Supervised Tagged/parsed corpora Non supervised Baum-Welch (Fw-Bw) para HMM Inside-Outside for SCFG NLP statistical parsing 29

SCFG Stochastic Context-Free Grammars (or PCFGs) Associate a probability to each rule Associate a probability to each lexical entry Frequent restriction CNF: Binary rules A p A q A r matrix B pqr Unary rules A p b m matrix U pm NLP statistical parsing 30

SCFG NLP statistical parsing 31

SCFG NLP statistical parsing 32

SCFG NLP statistical parsing 33

Parsing SCFG Starting from a CFG SCFG For each rule of G, (A ) P G we should be able to define a probability P(A ) ( A ) P( A ) P G 1 Probability of a tree P( ) ( A ) P( A P G ) f ( A ; ) NLP statistical parsing 34

Parsing SCFG P(t) -- Probability of a tree t (product of probabilities of the rules generating it. P(w 1n ) -- Probability of a sentence is the sum of the probabilities of all the valid parse trees of the sentence P(w 1n ) = Σ j P(w 1n, t) where t is a parse of w 1n = Σ j P(t) NLP statistical parsing 35

Parsing SCFG Positional invariance: The probability of a subtree is independent of its position in the derivation tree Context-free: the probability of a subtree does not depend on words not dominated by a subtree Ancestor-free: the probability of a subtree does not depend on nodes in the derivation outside the subtree NLP statistical parsing 36

Parsing SCFG Parameter estimation Supervised learning From a treebank (MLE) { 1,, N } Non supervised learning Inside/Outside (EM) Similar to Baum-Welch in HMMs NLP statistical parsing 37

NLP statistical parsing 38 Parsing SCFG P G A A A A P ) ( ) #( ) #( ) ( N i i A f A 1 ) ; ( ) ( # Supervised learning: Maximum Likelihood Estimation (MLE)

SCFG in CNF Learning using CNF CNF: Most frequent approach Binary rules: A p A q A r matrix B p,q,r Unary rules: A p b m matrix U p,m that should satisfy: p, q,r B p,q, r m U p,m 1 A 1 is the axiom of the grammar. d = derivation = sequence of rule applications from A 1 to w: A 1 = 0 1... d = w p(d G) d k1 p( k k G) 1 p(w G) d: A 1 * P(d w G) NLP statistical parsing 39

SCFG in CNF A 1 A p w 1... w i w k+1... w n A q A r A s w i+1...... w k b m = w j NLP statistical parsing 40

SCFG in CNF Learning using CNF Problems to solve (~ HMM) Probability of a string (LM) p(w 1n G) Most probable parse of a string arg max t p(t w 1n G) Parameter learning: Find G such that if maximizes p(w 1n G) NLP statistical parsing 41

SCFG in CNF HMM Probability distribution over strings of a certain length For all n: Σ W1n P(w 1n ) = 1 PCFG Probability distribution over the set of strings that are in the language L Σ L P( ) = 1 Example: P(John decided to bake a) NLP statistical parsing 42

SCFG in CNF HMM Probability distribution over strings of a certain length For all n: Σ W1n P(w 1n ) = 1 Forward/Backward Forward α i (t) = P(w 1(t-1), X t =i) Backward β i (t) = P(w tt X t =i) PCFG Probability distribution over the set of strings that are in the language L Σ L P( ) = 1 Inside/Outside Outside O i (p,q) = P(w 1p-1, N i pq, w (q+1)m G) Inside I i (p,q) = P(w pq N i pq, G) NLP statistical parsing 43

SCFG in CNF A 1 outside A p A q A r inside NLP statistical parsing 44

SCFG in CNF Inside probability I p (i,j) = P(A p * w i... w j ) This probability can be computed bottom up Starting with the shorter constituents base case: I p (i,i) = p(a p * w i ) = U p,m (w m = w i ) recurrence: I p (i, k) k 1 q,r ji I q (i, j) I r (j1, k) B p,q,r NLP statistical parsing 45

SCFG in CNF Outside probability: O q (i,j) = P(A 1 * w 1... w i-1 A q w j+1... w n ) This probability can be computed top down Starting with the widest constituents Base case: O 1 (1,n) = p(a 1 * A 1 ) = 1 O j (1,n) = 0, for j 1 Recurrence: two cases, over all the possible partitions O N q p1 (i, j) N r r 1 q n O (i,k) I (j1, k) B p r p,q,r kj1 p1 r1 N N i1 k1 O p (k, j) I r (k,i1) B p,r,q NLP statistical parsing 46

Two splitting forms: First SCFG in CNF O q (i, j) O p (i, k) I r (j1, k) B p,q,r A 1 A 1 A q A p w 1...w i-1 w j+1...w n A q A r w 1... w i-1 w j+1... w k w k+1... w n NLP statistical parsing 47

SCFG in CNF second: O q (i, j) O p (k, j) I r (k,i1) B p,r,q A 1 A 1 A q A p w 1...w i-1 w j+1...w n A r A q w 1... w k-1 w k... w i-1 w j+1... w n NLP statistical parsing 48

SCFG in CNF Viterbi O( G n 3 ) Given a sentence w 1... w n M P (i,j) contains the maximum probability of derivation A p * w i... w j M can be computed incrementally for increasing values of the substring using induction over the length j i +1 Base case: A p M p (i,i) = p(a p * w i ) = U p,m (w m = w i ) w i NLP statistical parsing 49

SCFG in CNF Recurrence: Consider all the forms of decomposing A p into 2 components updating the maximum probability M p (i, j) q,r j1 max max M ki q (i, k) M (k r 1, j) B p,q,r Recall that using sum instead of max we get the inside algorithm: p(w 1n G) A q A p A r w i... w k w k+1... w j k - i +1 j - k j i + 1 NLP statistical parsing 50

SCFG in CNF To get the probability of best (most probable) derivation: M 1 (1,n) To get the best derivation tree we need to maintain not only the probability M P (i,j) but also the cut point and the two categories of the right side of the rule: (i, p j) arg max M q,r,k q (i, k) M (k 1, j) B r A p p,q,r A RHS1(p,i,j) A RHS2(p,i,j) w i... w SPLIT(p,i,j) w SPLIT(p,i,j) +1... w j NLP statistical parsing 51

SCFG in CNF Learning the models. Supervised approach Parameters (probabilities, i.e. matrices B and U) of a corpus MLE (Maximum Likelihood Estimation): Corpus fully parsed (i.e. set of pairs <sentence, correct parse tree> ) Bˆ p,q,r pˆ(a p A q A r ) E(# Ap AqA E(# A G) p r G) NLP statistical parsing 52

SCFG in CNF Learning the models. Unsupervised approach Inside/Outside algorithm: Similar to Forward-Backward (Baum-Welch) for HMM Particular application of Expectation Maximization (EM) algorithm: 1. Start with an initial model µ0 (uniform, random, MLE...) 2. Compute observation probability using current model 3. Use obtained probabilities as data to reestimate the model, computing µ 4. Let µ= µ and repeat until no significant improvement (convergence) Iterative hill-climbing: Local maxima. EM property: Pµ (O) Pµ(O) NLP statistical parsing 53

SCFG in CNF Learning the models. Unsupervised approach Inside/Outside algorithm: Input: set of training examples (non parsed sentences) and a CFG G Initialization: choose initial parameters P for each rule in the grammar: (randomly or from small labelled corpus using MLE) P( A ) 0 Expectation: compute the posterior probability of each annotated rule and position in each training set tree T Maximization: use these probabilities as weighted observations to update the rule probabilities ( A ) P( A ) P G 1 NLP statistical parsing 54

SCFG in CNF Inside/Outside algorithm: For each training sentence w, we compute the insideoutside probabilities. We can multiply the probabilities inside and outside: O i (j,k) I i (j,k) = P(A 1 * w 1... w n, A i * w j... w k G ) = P(w 1n, A i jk G) So that the estimate of A i being used in the derivation: E(A i is used in the derivation ) n p1 n qp O i (p, q) I I 1 (1, n) i (p, q) NLP statistical parsing 55

SCFG in CNF Inside/Outside algorithm: The estimate of A i A r A s being used in the derivation: E(A i A r A s ) n1 n q1 p1 q p 1d p O i (p, q) B (1, n) For unary rules, the estimate of A i w m being used: E(A i w m ) n h1 O i (h, h) P(w I 1 h And we can reestimate P(A i A r A s ) and P(A i w m ): P(A i A r A s ) = E(A i A r A s ) /E(A i used) P(A i w m ) = E(A i w m ) /E(A i used) I 1 (1, n) w i,r,s m I r ) I i (p, d) I (h, h) s (d 1, q) NLP statistical parsing 56

SCFG in CNF Inside/Outside algorithm: Assuming independence of the sentences in the training corpus, we sum the contributions from multiple sentences in the reestimation process. We can reestimate the values of P(A p A q A r ) and P(A p w m ) and from them the new values of U p,m and B p,q,r The I-O algorithm is to iterate this process of parameter reestimation until the change in the estimated probability is small: P W G ) P( W G ) ( i1 i NLP statistical parsing 57

SCFG Pros and cons of SCFG Some idea of the probability of a parse But not very good. CFG cannot be learned without negative examples, SCFG can SCFGs provide a LM for a language In practice SCFG provide a worse LM than an n-gram (n>1) P([N [N toy] [N [N coffee] [N grinder]]]) = P ([N [N [N cat] [N food]] [N tin]]) P (NP Pro) is > in Subj position than in Obj position. NLP statistical parsing 58

SCFG Pros and cons of SCFG Robust Possibility of combining SCFG with 3-grams SCFG assign a lot of probability mass to short sentences (a small tree is more probable than a big one) Parameter estimation (probabilities) Problem of sparseness Volume NLP statistical parsing 59

Statistical parsing Grammatical induction from corpora Goal: Parsing of non restricted texts with a reasonable level of accuracy (>90%) and efficiency. Requirements: Corpora tagged (with POS): Brown, LOB, Clic-Talp Corpora analyzed: Penn treebank, Susanne, Ancora NLP statistical parsing 60

Treebank grammars Penn Treebank = 50,000 sentences with associated trees Usual set-up: 40,000 training sentences, 2400 test sentences NLP statistical parsing 61

Treebank grammars Grammars directly derived from a treebank Charniak,1996 UsingPTB 47,000 sentences Navigating PTB where each local subtree provides the left hand and right hand side of a rule Precision and recall around 80% Around 17,500 rules NLP statistical parsing 62

Treebank grammars Learning Treebank Grammars Σ j P(N i ζ j N i ) = 1 NLP statistical parsing 63

Treebank grammars Supervised learning MLE NLP statistical parsing 64

Treebank grammars Proposals for transformation of the obtained PTB grammar: Sekine,1997, Sekine & Grishman,1995 Treebank grammars compactation Lacking generalization ability Continuous growth of the grammar size Most induced rules present low frequency Krotov et al,1999, Krotov,1998, Gaizauskas,1995 NLP statistical parsing 65

Treebank grammars Treebank Grammars compactation Partial bracketting NP DT NN CC DT NN NP NP CC NP NP DT NN Redundance removing (some rules can be generated from others) NLP statistical parsing 66

Treebank grammars Removing non linguistically valid rules Assign probabilities (MLE) to the initial rules Remove a rule unless the probability of the structure built from its application is greater than the probability of building the structure by applying simpler rules. Thresholding Removing rules occurring < n times Full Simply Fully Linguistically Linguistically thresholded compacted Compacted Compacted Grammar 1 Grammar 2 Recall 70.55 70.78 30.93 71.55 70.76 Precision 77.89 77.66 19.18 72.19 77.21 Grammar size 15,421 7,278 1,122 4,820 6,417 NLP statistical parsing 67

Treebank grammars Applying compactation 17,529 1,667 rules #rules 2000 1500 10% corpus 60% 100% NLP statistical parsing 68