Statistical Methods for NLP Stochastic Grammars Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(22)
Structured Classification In many NLP tasks, the output (and input) is structured: Part-of-speech tagging Input: Sequence X 1,..., X n of words Output: Sequence Y 1,..., Y n of tags Syntactic parsing Input: Sequence X 1,..., X n of words Output: Parse tree Y consisting of nodes, edges and labels Models for structured classification: Sequence models lecture 5 Stochastic grammars today Statistical Methods for NLP 2(22)
Syntactic Parsing Given a word sequence w 1,..., w n, determine the corresponding syntactic analysis Y. Probabilistic view of the problem: f (w 1,..., w n ) = argmax Y P(Y w 1,..., w n ) = argmax Y P(Y, w 1,..., w n ) P(w 1,..., w n ) = argmax Y P(Y, w 1,..., w n ) = argmax Y P(Y )P(w 1,..., w n Y ) We will assume that Y is a context-free parse tree, but the same reasoning applies to any choice of syntactic analysis. Statistical Methods for NLP 3(22)
Context-Free Parse Trees Since a context-free parse tree T for the string w 1,..., w n includes the string itself, it follows that: { 1 if yield(t ) = w1,... w P(w 1,..., w n T ) = n 0 otherwise Hence, if we restrict attention to trees with the right yield, we can simply search for the most probable tree T : argmax P(T ) T Statistical Methods for NLP 4(22)
Probabilistic Context-Free Grammar A PCFG G is a 5-tuple G = (Σ, N, S, R, D): N is a finite (non-terminal) alphabet. Σ is a finite (terminal) alphabet. S N is the start symbol. R is a finite set of rules A α (A N, α (Σ N) ). D is function from R to the real numbers in [0, 1] such that: [ ] A N : D(A α) = 1 α:a α R Statistical Methods for NLP 5(22)
Example: Grammar S NP VP PU 1.00 VP VP PP 0.33 VP VBD NP 0.67 NP NP PP 0.14 NP JJ NN 0.57 NP JJ NNS 0.29 PP IN NP 1.00 PU. 1.00 JJ Economic 0.33 JJ little 0.33 JJ financial 0.33 NN news 0.50 NN effect 0.50 NNS markets 1.00 VBD had 1.00 IN on 1.00 S VP NP PP NP NP NP PU JJ NN VBD JJ NN IN JJ NNS Economic news had little effect on financialmarkets S VP VP PP NP NP NP PU JJ NN VBD JJ NN IN JJ NNS Economic news had little effect on financialmarkets.. Statistical Methods for NLP 6(22)
Independence Assumptions The probability of a rule A α represents the probability of using the rule to expand a node labeled A: A D(A α) = P( A) α The probability of using the rule in a derivation S w 1,..., w n is independent of anything before or after: P(A α S βaγ, βαγ w 1,..., w n ) = D(A α) Statistical Methods for NLP 7(22)
Probabilities for Parse Trees and Strings The probability of a parse tree is the product of the probabilities of all its independent subtrees: P(T ) = D(A α) t(a,α) T where t(a, α) T signifies that T contains a local tree with root labeled by A and children labeled α. The probability of a string is the sum of the probabilities of all its parse trees: P(w 1,..., w n ) = P(T ) T :yield(t )=w 1,...w n Statistical Methods for NLP 8(22)
Example S NP VP PU 1.00 VP VP PP 0.33 VP VBD NP 0.67 NP NP PP 0.14 NP JJ NN 0.57 NP JJ NNS 0.29 PP IN NP 1.00 PU. 1.00 JJ Economic 0.33 JJ little 0.33 JJ financial 0.33 NN news 0.50 NN effect 0.50 NNS markets 1.00 VBD had 1.00 IN on 1.00 Economic news had little effect on financial markets. 0.0002665 S VP NP PP NP NP NP PU JJ NN VBD JJ NN IN JJ NNS Economic news had little effect on financialmarkets S VP VP PP NP NP NP PU JJ NN VBD JJ NN IN JJ NNS Economic news had little effect on financialmarkets.. 0.0000794 0.0001871 Statistical Methods for NLP 9(22)
Inference (Parsing) Inference problem: Finding the most probable tree T for a string w1,..., w n Difficulty: Maximizing over all possible trees The number of trees grows exponentially Key observation: Solution of size n contains solutions of smaller sizes For binarized grammar: max P(T, w 1,n ) = max i ˆmaxT P(T, w 1,i ) max T P(T, w i+1,n ) D(r(T ) r(t )r(t )) Sounds familiar? Dynamic programming algorithms are applicable Probabilistic versions of algorithms like CKY and Earley Statistical Methods for NLP 10(22)
Probabilistic CKY PARSE(G, w 1,..., w n ) for j from 1 to n do for all A : A a R G and a = w j C[j 1, j, A] := D G (A a) for i from j 2 downto 0 do for k from i + 1 to j 1 do for all A : A BC R G and C[i, k, B] > 0 and C[k, j, C] > 0 if (C[i, j, A] < D G (A BC) C[i, k, B] C[k, j, C]) then C[i, j, A] := D G (A BC) C[i, k, B] C[k, j, C] B[i, j, A] := {k, B, C} return BUILD-TREE(B[1, n, S]), C[1, n, S] Statistical Methods for NLP 11(22)
Learning Two parts: Learn CFG G = (Σ, N, S, R) Learn rule probability distributions D Supervised learning of G and D: Treebank grammars (more in a minute) Unsupervised learning of D given G: Expectation-Maximization (EM) Guess initial distribution D 0 Iteratively improve distribution D i until convergence: E-step: Compute expected rule frequencies E i given D i M-step: Compute D i+1 to maximize probability given E i The Inside-Outside algorithm for computing expectations is similar to the Forward-Backward algorithm for HMMs Statistical Methods for NLP 12(22)
Treebank Grammar Training set: Treebank T = {T1,..., T m } Extract grammar G = (Σ, N, S, R): Σ = the set of all terminals occurring in some Ti T N = the set of all nonterminals occurring in some Ti T S = the nonterminal at the root of every Ti T R = the set of all rules needed to derive some T i T Estimate D using relative frequencies (MLE): Introduction D(A α) = m m C(A α, T i ) i=1 i=1 β:a β R C(A β, T i ) Statistical Methods for NLP 13(22)
Example: Treebank Grammar S NP VP PU 1.00 VP VP PP 0.33 VP VBD NP 0.67 NP NP PP 0.14 NP JJ NN 0.57 NP JJ NNS 0.29 PP IN NP 1.00 PU. 1.00 JJ Economic 0.33 JJ little 0.33 JJ financial 0.33 NN news 0.50 NN effect 0.50 NNS markets 1.00 VBD had 1.00 IN on 1.00 S VP NP PP NP NP NP PU JJ NN VBD JJ NN IN JJ NNS Economic news had little effect on financialmarkets S VP VP PP NP NP NP PU JJ NN VBD JJ NN IN JJ NNS Economic news had little effect on financialmarkets.. Statistical Methods for NLP 14(22)
Pros and Cons of Treebank Grammars Pros: Guaranteed to produce a consistent probability model Learning simple, efficient and well understood (MLE) Inference simple and (relatively) efficient Cons: Not guaranteed to be robust parsing new sentences may require rules not seen in the treebank Not optimal for disambigation treebank annotation may not fit independence assumptions enforced by PCFG model Statistical Methods for NLP 15(22)
Example: NP Expansions in Penn Treebank Introduction Tree Context NP PP DT NN PRP Anywhere 11% 9% 6% NP under S 9% 9% 21% NP under VP 23% 7% 4% Pronouns (PRP) more frequent under S (subject) Prepositional modifiers more frequent under VP (object) Statistical Methods for NLP 16(22)
PCFG Transformations Early research on statistical parsing abandoned PCFGs in favor of richer history-based models More recent research has shown that the same effect can be achieved by transforming PCFGs (or treebanks) Three common techniques: Markovization Parent annotation Lexicalization Statistical Methods for NLP 17(22)
Markovization Idea: Replace an n-ary rule by a set of unary and binary rules Encode a Markov process in new nonterminals Example: VP VB NP PP VP VP:NP_PP VP:NP_PP VP:VB_NP PP VP:VB_NP VP:VB NP VP:VB VB Benefits: Reduces the number of unique rules Improves robustness Statistical Methods for NLP 18(22)
Parent Annotation Idea: Replace nonterminal A with AˆB when A is child of B Example: SˆROOT VPˆS NPˆVP PPˆNP NPˆS NPˆNP NPˆPP. JJ NN VBD JJ NN IN JJ NNS Economic news had little effect on financialmarkets Benefit: Differentiates structural contexts (SˆNP VPˆNP). Statistical Methods for NLP 19(22)
Horizontal and Vertical Markovization Markovization is often called horizontal markovization Conditioning history for siblings Standard PCFG: infinite-order (any number of siblings) Example above: second-order (at most two siblings) Parent annotation can be seen as vertical markovization Conditioning history for descendants Standard PCFG: first-order (only parent) Example above: second-order (grand-parent as well) Many different combinations are possible Statistical Methods for NLP 20(22)
Lexicalization Idea: Index nonterminals by lexical heads (terminals) Example: VP VBD NP VP(had) VBD(had) NP(effect) Consequences: Increases sensitivity to lexical properties (good) Increases the size of the grammar drastically (bad) Statistical Methods for NLP 21(22)
The State of the Art PCFG models: Markovization Limited horizontal markovization Extended vertical markovization Fine-grained nonterminals: Completely or partially lexicalized models Latent variable models that learn splits using EM Alternative models: Discriminative (log-linear) models for (re)ranking Dependency parsing (graph-based, transition-based) Statistical Methods for NLP 22(22)