Advanced Natural Language Processing Syntactic Parsing Alicia Ageno ageno@cs.upc.edu Universitat Politècnica de Catalunya NLP statistical parsing 1
Parsing Review Statistical Parsing SCFG Inside Algorithm Outside Algorithm Viterbi Algorithm Learning models Grammar acquisition: Grammatical induction NLP statistical parsing 2
Parsing Parsing: recognising higher level units of structure that allow us to compress our description of a sentence Goal of syntactic analysis (parsing): Detect if a sentence is correct Provide a syntactic structure of a sentence Parsing is the task of uncovering the syntactic structure of language and is often viewed as an important prerequisite for building systems capable of understanding language Syntactic structure is necessary as a first step towards semantic interpretation, for detecting phrasal chunks for indexing in an IR system... NLP statistical parsing 3
Parsing A syntactic tree NLP statistical parsing 4
Parsing Another syntactic tree NLP statistical parsing 5
Parsing A dependency tree NLP statistical parsing 6
Parsing A real sentence NLP statistical parsing 7
Parsing Theories of Syntactic Structure Constituent trees Dependency trees NLP statistical parsing 8
Parsing Factors in parsing Grammar expressivity Coverage Involved Knowledge Sources Parsing strategy Parsing direction Production application order Ambiguity management NLP statistical parsing 9
Parsing Parsers today CFG (extended or not) Tabular Charts LR Unification-based Statistical Dependency parsing Robust parsing (shallow, fragmental, chunkers, spotters) NLP statistical parsing 10
Parsing Context Free Grammars (CFGs) NLP statistical parsing 11
Parsing Context Free Grammars, example NLP statistical parsing 12
Parsing Properties of CFGs NLP statistical parsing 13
Parsing I was on the hill that has a telescope when I saw a man. I saw a man who was on a hill and who had a telescope. I saw a man who was on the hill that has a telescope on it. Using a telescope, I saw a man who was on a hill. I was on the hill when I used the telescope to see a man.... I saw the man on the hill with the telescope Me See A man The telescope The hill NLP statistical parsing 14
Parsing Chomsky Normal Form (CNF) NLP statistical parsing 15
Parsing Tabular Methods Dynamic programming CFG CKY (Cocke, Kasami, Younger,1967) Grammar in CNF Earley 1969 Extensible to unification, probabilistic, etc... NLP statistical parsing 16
Parsing Parsing as searching in a search space Characterizing the states (if possible) enumerate them Define the initial state (s) Define (if possible) final states or the condition to reach one of them NLP statistical parsing 17
Tabular methods: CKY General parsing schema (Sikkel 97) <X, H, D> V (D) H X V X domain, set of items set of de hypothesis set of valid entities set of deductive steps NLP statistical parsing 18
Tabular methods: CKY G = <N,, P,S >, G CNF, w = a 1... a n <X, H, D> X = {[A, i, j] 1 i j A N G } H = {[A, j, j] A a j P G 1 j n } D = {[B, i, j], [C, j+1, k] [A, i, k] A BC P G 1 i j < k} V (D) = {[A, i, j] A * a i... a j } CKY domain, set of items set of de hypothesis set of valid entities set of deductive steps NLP statistical parsing 19
Tabular methods: CKY CKY spatial cost O(n 2 ) temporal cost O(n 3 ) CNF BU strategy: dynamically build the parsing table t ji rows: width of each component, 1 j wi + 1 columns: initial position of each component, 1 i w where w = a 1,... a n is the input string, w =n NLP statistical parsing 20
Tabular methods: CKY A t j,i B C a 1 a 2... a i... a n j Where A -> BC is a binary production of the grammar NLP statistical parsing 21
Tabular methods: CKY That A is in cell t j,i means that from A the text fragment a i,... a i+j-1 (string of length j starting in i-esim position) can be derived. The grammaticality condition is that the initial symbol of the grammar (S) satisfies S t w 1 NLP statistical parsing 22
Tabular methods: CKY The table is built BU Base case: row1 is built using only the unary rules of the grammar: j=1 t 1i = {A [A --> a i ] P} Recursive case: rows j=2,... are built. The key of the algorithm is that when row j is built all the previous ones (from 1 to j-1) are already built: row j > 1 t ji = {A k, 1 k j, [A-->BC] P, B t ki,c t j-k,i+k } NLP statistical parsing 23
Tabular methods: CKY 1. Add the lexical edges: t[1,i] 2. for j = 2 to n: for i = 1 to n-j: for k = 1 to j-1: if: then: ABC and B t[k,i] and C t[j-k,i+k] add ABC to t[j,i] 3. If St[n,1], return the corresponding parse NLP statistical parsing 24
Tabular methods: CKY sentence NP, VP NP A, B VP C, NP A det B n NP n VP vi C vt Parse the sentence the cat eats fish the (det) cat(n) eats(vt,vi) fish(n) NLP statistical parsing 25
Tabular methods: CKY the cat eats fish sentence the cat eats sentence cat eats fish sentence the cat NP cat eats sentence eats fish VP the (det) A cat (n) B, NP eats (vt, vi) C, VP fish (n) B, NP NLP statistical parsing 26
Statistical parsing Introduction SCFG Inside Algorithm Outside Algorithm Viterbi Algorithm Learning models Grammar acquisition: Grammatical induction NLP statistical parsing 27
Statistical parsing Using statistical models for Determining the sentence (ex. speech recognizers) The job of the parser is to be a language model Guiding parsing Order or prune the search space Get the most likely parse Ambiguity resolution E.g. Pp-attachment NLP statistical parsing 28
Statistical parsing Lexical approaches Context free: unigram Context dependent: N-gram, HMM Syntactic approaches SCFG (or PCFG) Hybrid approaches Stochastic Lexicalized Tags Computing the most likeky (most probable) parse Viterbi Parameter learning Supervised Tagged/parsed corpora Non supervised Baum-Welch (Fw-Bw) para HMM Inside-Outside for SCFG NLP statistical parsing 29
SCFG Stochastic Context-Free Grammars (or PCFGs) Associate a probability to each rule Associate a probability to each lexical entry Frequent restriction CNF: Binary rules A p A q A r matrix B pqr Unary rules A p b m matrix U pm NLP statistical parsing 30
SCFG NLP statistical parsing 31
SCFG NLP statistical parsing 32
SCFG NLP statistical parsing 33
Parsing SCFG Starting from a CFG SCFG For each rule of G, (A ) P G we should be able to define a probability P(A ) ( A ) P( A ) P G 1 Probability of a tree P( ) ( A ) P( A P G ) f ( A ; ) NLP statistical parsing 34
Parsing SCFG P(t) -- Probability of a tree t (product of probabilities of the rules generating it. P(w 1n ) -- Probability of a sentence is the sum of the probabilities of all the valid parse trees of the sentence P(w 1n ) = Σ j P(w 1n, t) where t is a parse of w 1n = Σ j P(t) NLP statistical parsing 35
Parsing SCFG Positional invariance: The probability of a subtree is independent of its position in the derivation tree Context-free: the probability of a subtree does not depend on words not dominated by a subtree Ancestor-free: the probability of a subtree does not depend on nodes in the derivation outside the subtree NLP statistical parsing 36
Parsing SCFG Parameter estimation Supervised learning From a treebank (MLE) { 1,, N } Non supervised learning Inside/Outside (EM) Similar to Baum-Welch in HMMs NLP statistical parsing 37
NLP statistical parsing 38 Parsing SCFG P G A A A A P ) ( ) #( ) #( ) ( N i i A f A 1 ) ; ( ) ( # Supervised learning: Maximum Likelihood Estimation (MLE)
SCFG in CNF Learning using CNF CNF: Most frequent approach Binary rules: A p A q A r matrix B p,q,r Unary rules: A p b m matrix U p,m that should satisfy: p, q,r B p,q, r m U p,m 1 A 1 is the axiom of the grammar. d = derivation = sequence of rule applications from A 1 to w: A 1 = 0 1... d = w p(d G) d k1 p( k k G) 1 p(w G) d: A 1 * P(d w G) NLP statistical parsing 39
SCFG in CNF A 1 A p w 1... w i w k+1... w n A q A r A s w i+1...... w k b m = w j NLP statistical parsing 40
SCFG in CNF Learning using CNF Problems to solve (~ HMM) Probability of a string (LM) p(w 1n G) Most probable parse of a string arg max t p(t w 1n G) Parameter learning: Find G such that if maximizes p(w 1n G) NLP statistical parsing 41
SCFG in CNF HMM Probability distribution over strings of a certain length For all n: Σ W1n P(w 1n ) = 1 PCFG Probability distribution over the set of strings that are in the language L Σ L P( ) = 1 Example: P(John decided to bake a) NLP statistical parsing 42
SCFG in CNF HMM Probability distribution over strings of a certain length For all n: Σ W1n P(w 1n ) = 1 Forward/Backward Forward α i (t) = P(w 1(t-1), X t =i) Backward β i (t) = P(w tt X t =i) PCFG Probability distribution over the set of strings that are in the language L Σ L P( ) = 1 Inside/Outside Outside O i (p,q) = P(w 1p-1, N i pq, w (q+1)m G) Inside I i (p,q) = P(w pq N i pq, G) NLP statistical parsing 43
SCFG in CNF A 1 outside A p A q A r inside NLP statistical parsing 44
SCFG in CNF Inside probability I p (i,j) = P(A p * w i... w j ) This probability can be computed bottom up Starting with the shorter constituents base case: I p (i,i) = p(a p * w i ) = U p,m (w m = w i ) recurrence: I p (i, k) k 1 q,r ji I q (i, j) I r (j1, k) B p,q,r NLP statistical parsing 45
SCFG in CNF Outside probability: O q (i,j) = P(A 1 * w 1... w i-1 A q w j+1... w n ) This probability can be computed top down Starting with the widest constituents Base case: O 1 (1,n) = p(a 1 * A 1 ) = 1 O j (1,n) = 0, for j 1 Recurrence: two cases, over all the possible partitions O N q p1 (i, j) N r r 1 q n O (i,k) I (j1, k) B p r p,q,r kj1 p1 r1 N N i1 k1 O p (k, j) I r (k,i1) B p,r,q NLP statistical parsing 46
Two splitting forms: First SCFG in CNF O q (i, j) O p (i, k) I r (j1, k) B p,q,r A 1 A 1 A q A p w 1...w i-1 w j+1...w n A q A r w 1... w i-1 w j+1... w k w k+1... w n NLP statistical parsing 47
SCFG in CNF second: O q (i, j) O p (k, j) I r (k,i1) B p,r,q A 1 A 1 A q A p w 1...w i-1 w j+1...w n A r A q w 1... w k-1 w k... w i-1 w j+1... w n NLP statistical parsing 48
SCFG in CNF Viterbi O( G n 3 ) Given a sentence w 1... w n M P (i,j) contains the maximum probability of derivation A p * w i... w j M can be computed incrementally for increasing values of the substring using induction over the length j i +1 Base case: A p M p (i,i) = p(a p * w i ) = U p,m (w m = w i ) w i NLP statistical parsing 49
SCFG in CNF Recurrence: Consider all the forms of decomposing A p into 2 components updating the maximum probability M p (i, j) q,r j1 max max M ki q (i, k) M (k r 1, j) B p,q,r Recall that using sum instead of max we get the inside algorithm: p(w 1n G) A q A p A r w i... w k w k+1... w j k - i +1 j - k j i + 1 NLP statistical parsing 50
SCFG in CNF To get the probability of best (most probable) derivation: M 1 (1,n) To get the best derivation tree we need to maintain not only the probability M P (i,j) but also the cut point and the two categories of the right side of the rule: (i, p j) arg max M q,r,k q (i, k) M (k 1, j) B r A p p,q,r A RHS1(p,i,j) A RHS2(p,i,j) w i... w SPLIT(p,i,j) w SPLIT(p,i,j) +1... w j NLP statistical parsing 51
SCFG in CNF Learning the models. Supervised approach Parameters (probabilities, i.e. matrices B and U) of a corpus MLE (Maximum Likelihood Estimation): Corpus fully parsed (i.e. set of pairs <sentence, correct parse tree> ) Bˆ p,q,r pˆ(a p A q A r ) E(# Ap AqA E(# A G) p r G) NLP statistical parsing 52
SCFG in CNF Learning the models. Unsupervised approach Inside/Outside algorithm: Similar to Forward-Backward (Baum-Welch) for HMM Particular application of Expectation Maximization (EM) algorithm: 1. Start with an initial model µ0 (uniform, random, MLE...) 2. Compute observation probability using current model 3. Use obtained probabilities as data to reestimate the model, computing µ 4. Let µ= µ and repeat until no significant improvement (convergence) Iterative hill-climbing: Local maxima. EM property: Pµ (O) Pµ(O) NLP statistical parsing 53
SCFG in CNF Learning the models. Unsupervised approach Inside/Outside algorithm: Input: set of training examples (non parsed sentences) and a CFG G Initialization: choose initial parameters P for each rule in the grammar: (randomly or from small labelled corpus using MLE) P( A ) 0 Expectation: compute the posterior probability of each annotated rule and position in each training set tree T Maximization: use these probabilities as weighted observations to update the rule probabilities ( A ) P( A ) P G 1 NLP statistical parsing 54
SCFG in CNF Inside/Outside algorithm: For each training sentence w, we compute the insideoutside probabilities. We can multiply the probabilities inside and outside: O i (j,k) I i (j,k) = P(A 1 * w 1... w n, A i * w j... w k G ) = P(w 1n, A i jk G) So that the estimate of A i being used in the derivation: E(A i is used in the derivation ) n p1 n qp O i (p, q) I I 1 (1, n) i (p, q) NLP statistical parsing 55
SCFG in CNF Inside/Outside algorithm: The estimate of A i A r A s being used in the derivation: E(A i A r A s ) n1 n q1 p1 q p 1d p O i (p, q) B (1, n) For unary rules, the estimate of A i w m being used: E(A i w m ) n h1 O i (h, h) P(w I 1 h And we can reestimate P(A i A r A s ) and P(A i w m ): P(A i A r A s ) = E(A i A r A s ) /E(A i used) P(A i w m ) = E(A i w m ) /E(A i used) I 1 (1, n) w i,r,s m I r ) I i (p, d) I (h, h) s (d 1, q) NLP statistical parsing 56
SCFG in CNF Inside/Outside algorithm: Assuming independence of the sentences in the training corpus, we sum the contributions from multiple sentences in the reestimation process. We can reestimate the values of P(A p A q A r ) and P(A p w m ) and from them the new values of U p,m and B p,q,r The I-O algorithm is to iterate this process of parameter reestimation until the change in the estimated probability is small: P W G ) P( W G ) ( i1 i NLP statistical parsing 57
SCFG Pros and cons of SCFG Some idea of the probability of a parse But not very good. CFG cannot be learned without negative examples, SCFG can SCFGs provide a LM for a language In practice SCFG provide a worse LM than an n-gram (n>1) P([N [N toy] [N [N coffee] [N grinder]]]) = P ([N [N [N cat] [N food]] [N tin]]) P (NP Pro) is > in Subj position than in Obj position. NLP statistical parsing 58
SCFG Pros and cons of SCFG Robust Possibility of combining SCFG with 3-grams SCFG assign a lot of probability mass to short sentences (a small tree is more probable than a big one) Parameter estimation (probabilities) Problem of sparseness Volume NLP statistical parsing 59
Statistical parsing Grammatical induction from corpora Goal: Parsing of non restricted texts with a reasonable level of accuracy (>90%) and efficiency. Requirements: Corpora tagged (with POS): Brown, LOB, Clic-Talp Corpora analyzed: Penn treebank, Susanne, Ancora NLP statistical parsing 60
Treebank grammars Penn Treebank = 50,000 sentences with associated trees Usual set-up: 40,000 training sentences, 2400 test sentences NLP statistical parsing 61
Treebank grammars Grammars directly derived from a treebank Charniak,1996 UsingPTB 47,000 sentences Navigating PTB where each local subtree provides the left hand and right hand side of a rule Precision and recall around 80% Around 17,500 rules NLP statistical parsing 62
Treebank grammars Learning Treebank Grammars Σ j P(N i ζ j N i ) = 1 NLP statistical parsing 63
Treebank grammars Supervised learning MLE NLP statistical parsing 64
Treebank grammars Proposals for transformation of the obtained PTB grammar: Sekine,1997, Sekine & Grishman,1995 Treebank grammars compactation Lacking generalization ability Continuous growth of the grammar size Most induced rules present low frequency Krotov et al,1999, Krotov,1998, Gaizauskas,1995 NLP statistical parsing 65
Treebank grammars Treebank Grammars compactation Partial bracketting NP DT NN CC DT NN NP NP CC NP NP DT NN Redundance removing (some rules can be generated from others) NLP statistical parsing 66
Treebank grammars Removing non linguistically valid rules Assign probabilities (MLE) to the initial rules Remove a rule unless the probability of the structure built from its application is greater than the probability of building the structure by applying simpler rules. Thresholding Removing rules occurring < n times Full Simply Fully Linguistically Linguistically thresholded compacted Compacted Compacted Grammar 1 Grammar 2 Recall 70.55 70.78 30.93 71.55 70.76 Precision 77.89 77.66 19.18 72.19 77.21 Grammar size 15,421 7,278 1,122 4,820 6,417 NLP statistical parsing 67
Treebank grammars Applying compactation 17,529 1,667 rules #rules 2000 1500 10% corpus 60% 100% NLP statistical parsing 68