To make a grammar probabilistic, we need to assign a probability to each context-free rewrite

Similar documents
Chomsky and Greibach Normal Forms

In this chapter, we explore the parsing problem, which encompasses several questions, including:

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Probabilistic Context-Free Grammar

The Inside-Outside Algorithm

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

LECTURER: BURCU CAN Spring

Expectation Maximization (EM)

Advanced Natural Language Processing Syntactic Parsing

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Cross-Entropy and Estimation of Probabilistic Context-Free Grammars

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

Computation of Substring Probabilities in Stochastic Grammars Ana L. N. Fred Instituto de Telecomunicac~oes Instituto Superior Tecnico IST-Torre Norte

A shrinking lemma for random forbidding context languages

Lecture 12: Algorithms for HMMs

STA 414/2104: Machine Learning

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

RNA Secondary Structure Prediction

STA 4273H: Statistical Machine Learning

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Probabilistic Context Free Grammars

PAC Generalization Bounds for Co-training

Lecture 12: Algorithms for HMMs

A* Search. 1 Dijkstra Shortest Path

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.

Parsing with Context-Free Grammars

Foundations of Informatics: a Bridging Course

DT2118 Speech and Speaker Recognition

cmp-lg/ Nov 1994

Probabilistic Context-free Grammars

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Parsing with CFGs L445 / L545 / B659. Dept. of Linguistics, Indiana University Spring Parsing with CFGs. Direction of processing

Parsing with CFGs. Direction of processing. Top-down. Bottom-up. Left-corner parsing. Chart parsing CYK. Earley 1 / 46.

Context Free Grammars

Lecture 3: ASR: HMMs, Forward, Viterbi

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Maschinelle Sprachverarbeitung

Hidden Markov Models in Language Processing

CS20a: summary (Oct 24, 2002)

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

Maschinelle Sprachverarbeitung

Improved TBL algorithm for learning context-free grammar

Multiscale Systems Engineering Research Group

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

straight segment and the symbol b representing a corner, the strings ababaab, babaaba and abaabab represent the same shape. In order to learn a model,

Finite Automata Theory and Formal Languages TMV026/TMV027/DIT321 Responsible: Ana Bove

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Introduction to Computational Linguistics

Probabilistic Context-Free Grammars and beyond

Machine Learning for natural language processing

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

order is number of previous outputs

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Einführung in die Computerlinguistik

Language Modeling. Michael Collins, Columbia University

Chapter 14 (Partially) Unsupervised Parsing

Sequences and Information

Log-Linear Models, MEMMs, and CRFs

Midterm sample questions

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

An Efficient Context-Free Parsing Algorithm. Speakers: Morad Ankri Yaniv Elia

Grammar formalisms Tree Adjoining Grammar: Formal Properties, Parsing. Part I. Formal Properties of TAG. Outline: Formal Properties of TAG

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Parsing with Context-Free Grammars

CS460/626 : Natural Language

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

CS 188 Introduction to AI Fall 2005 Stuart Russell Final

Hidden Markov Models

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Probabilistic Linguistics

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

Hidden Markov Models. x 1 x 2 x 3 x N

IBM Model 1 for Machine Translation

Aspects of Tree-Based Statistical Machine Translation

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay

Linear Regression and Its Applications

A note on stochastic context-free grammars, termination and the EM-algorithm

The Noisy Channel Model and Markov Models

Languages. Languages. An Example Grammar. Grammars. Suppose we have an alphabet V. Then we can write:

Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for Natural Language Processing

Suppose h maps number and variables to ɛ, and opening parenthesis to 0 and closing parenthesis

Samy Bengioy, Yoshua Bengioz. y INRS-Telecommunications, 16, Place du Commerce, Ile-des-Soeurs, Qc, H3E 1H6, CANADA

Probabilistic Language Modeling

Statistical Methods for NLP

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

Even More on Dynamic Programming

Computational Models - Lecture 3

Lecture 12 Simplification of Context-Free Grammars and Normal Forms

Statistical Methods for NLP

Learning Conditional Probabilities from Incomplete Data: An Experimental Comparison Marco Ramoni Knowledge Media Institute Paola Sebastiani Statistics

Hidden Markov Models

Statistical Pattern Recognition

(Refer Slide Time: 0:21)

Statistical Methods for NLP

An Efficient Probabilistic Context-Free Parsing Algorithm that Computes Prefix Probabilities

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Theory of Computation (II) Yijia Chen Fudan University

Transcription:

Notes on the Inside-Outside Algorithm To make a grammar probabilistic, we need to assign a probability to each context-free rewrite rule. But how should these probabilities be chosen? It is natural to expect that these probabilities should depend in some way on the domain that is being parsed. Just as a simple approach such as the trigram model used in speech recognition is \tuned" to a training corpus, we would like to \tune" our rule probabilities to our data. The statistical principle of maximum likelihood guides us in deciding how to proceed. According to this principle, we should choose our rule probabilities so as to maximize the likelihood of our data, whenever this is possible. In these notes we describe how this principle is realized in the Inside-Outside algorithm for parameter estimation of context-free grammars. Let's begin by being concrete, considering an example of a simple sentence syntactic parse. She eats pizza anchovies Figure 6.1 Implicit in this parse are four context-free rules of the form A! B C. These are the rules S! N V V! V N N! N P P! PP N : In addition, there are ve \lexical productions" of the form A! w: N! She V! eats N! pizza PP! N! anchovies: 1

Of course, from a purely syntactic point-of-view, this sentence should also have a very dierent parse. To see this, just change the word \anchovies" to \hesitation": She eats pizza hesitation Figure 6.2 Here two new context-free rules have been used: V! V N P N! hesitation : In the absence of other constraints, both parses are valid for each sentence. That is, one could, at least syntactically, speak of a type of \pizza hesitation," though this would certainly be semantic gibberish, if not completely taste. One of the goals of probabilistic training parsing is to enable the statistics to correctly distinguish between such structures for a given sentence. This is indeed a formidable problem. Throughout this chapter we will refer to the above sentences grammar to demonstrate how the training algorithm is actually carried out. While certainly a \toy" example, it will serve to illustrate the general algorithm, as well as to demonstrate the strengths weaknesses of the approach. Let's now assume that we have probabilities assigned to each of the above context-free rewrite rules. These probabilities will be written as, for example, or as (S! N V) (N! pizza) we call these numbers the parameters of our probabilistic model. The only requirements of these numbers are that they be non-negative, that for each nonterminal A they sum to one; that is, X (A! ) = 1 for each A, where the sum is over all strings of words nonterminals that the grammar allows the symbol A to rewrite as. For our example grammar, this requirement is spelled out as follows: (N! N P) + (N! pizza) + (N! anchovies) + + (N! hesitation) + (N! She) = 1 (V! V N) + (V! V N P) + (V! eats) = 1 2

(S! N V) = 1 (P! PP N) = 1 (PP! ) = 1 So, the only parameters that are going to be trained, are those associated with rewriting a noun N or a verb V; all others are constrained to be equal to one. The Inside-Outside algorithm starts from some initial setting of the parameters, iteratively adjusts them so that the likelihood of the training corpus (in this case the two sentences \She eats pizza anchovies" \She eats pizza hesitation") increases. To write down the computation of this likelihood for our example, we'll have to introduce a bit of notation. First, we'll write W 1 = \She eats pizza anchovies" W 2 = \She eats pizza hesitation": Also, T 1 will refer to the parse in Figure 6.1, T 2 will refer to the parse in Figure 6.2. Then the statistical model assigns probabilities to these parses as P (W 1 ; T 1 ) = (S! N V) (V! V N) (N! N P) (P! PP N) (N! She) (V! eats) (N! pizza) (PP! ) (N! anchovies) P (W 2 ; T 1 ) = (S! N V) (V! V N P) (P! P PP) (N! She) (V! eats) (N! pizza) (PP! ) (N! hesitation) In the absence of other restrictions, each sentence can have both parses. Thus we also have P (W 1 ; T 2 ) = (S! N V) (V! V N P) (P! P PP) (N! She) (V! eats) (N! pizza) (PP! ) (N! anchovies) P (W 2 ; T 1 ) = (S! N V) (V! V N) (N! N P) (P! PP N) (N! She) (V! eats) (N! pizza) (PP! ) (N! hesitation) The likelihood of our corpus with respect to the parameters is thus L() = (P (W 1 ; T 1 ) + P (W 1 ; T 2 ))(P (W 2 ; T 1 ) + P (W 2 ; T 2 ): In general, the probability of a sentence W is P (W ) = X T P (W; T ) 3

where the sum is over all valid parse trees that the grammar assigns to W, if our training corpus comprises sentences W 1 ; W 2 ; : : : ; W N, then the likelihood L() of the corpus is given by L() = P (W 1 )P (W 2 ) P (W N ): Starting at some initial parameters, the inside-algorithm reestimates the parameters to obtain new parameters 0 for which L( 0 ) L(). This process is repeated until the likelihood has converged. Since it is quite possible that starting with a dierent initialization of the parameters could lead to a signicantly larger or smaller likelihood in the limit, we say that the Inside-Outside algorithm \locally maximizes" the likelihood of the training data. The Inside-Outside algorithm is a special case of the EM algorithm [1] for maximum likelihood estimation of \hidden" models. However, it is beyond the scope of these notes to describe in detail how the Inside-Outside algorithm derives from the EM algorithm. Instead, we will simply provide formulas for updating the parameters, as well as describe how the CYK algorithm is used to actually compute those updates. This is all that is needed to implement the training algorithm for your favorite grammar. Those readers who are interested in the actual mathematics of the Inside-Outside algorithm are referred to [3]. 0.1 The Parameter Updates In this section we'll dene some more notation, then write down the updates for the parameters. This will be rather general, the formulas may seem complex, but in the next section we will give a detailed description of how these updates are actually computed using our sample grammar. To write down how the Inside-Outside algorithm updates the parameters, it is most convenient to assume that the grammar is in Chomsky normal form. It should be emphasized, however, that this is only a convenience. In the following section we will work through the Inside-Outside algorithm for our toy grammar, which is not, in fact, in Chomsky normal form. But we'll assume here that we have a general context-free grammar G all of whose rules are either of the form A! B C for nonterminals A; B; C, or of the form (A! w) for some word w. Suppose that we have chosen values for our rule probabilities. Given these probabilities a training corpus of sentences W 1 ; W 2 ; : : : ; W N, the parameters are reestimated to obtain new parameters 0 as follows: 0 count(a! B C) (A! B C) = count(a! ) where 0 (A! w) = count(a! B C) = count(a! w) = P count(a! w) count(a! ) P NX NX i=1 i=1 c (A! B C; W i ) c (A! w; W i ) The number c (A! ; W i ) is the expected number of times that the rewrite rule A! is used in generating the sentence W i when the rule probabilities are given by. To give a formula for these expected counts, we need two more pieces of notation. The rst piece of notation is stard in the 4

automata literature (see, for example, [2]). If beginning with a nonterminal A we can derive a string of words nonterminals by applying a sequence of rewrite rules from our grammar, then we write A ) ; say that A derives. So, if a sentence W = w 1 w 2 w n can be parsed by the grammar we can write S ) w 1 w 2 w n : In the notation used above, the probability of the sentence given our probabilistic grammar is then P (W ) = X T P (W; T ) = P (S ) w 1 w 2 w n ) : The other piece of notation is just a shorth for certain probabilities. The probability that the nonterminal A derives the string of words w i w j in the sentence W = w 1 w n is denoted by ij (A). That is, ij (A) = P (A ) w i w j ) : Also, we set the probability that beginning with the start symbol S we can derive the string equal to ij (A). That is, w 1 w i?1 A w j+1 w n ij (A) = P (S ) w 1 w i?1 A w j+1 w n ) : The alphas betas are referred to, respectively, as inside outside probabilities. We are nally ready to give the formula for computing the expected counts. For a rule A! B C, the expected number of times that the rule is used in deriving the sentence W is c (A! B C; W ) = (A! B C) P (W ) X 1ijkn ik (A) ij (B) j+1;k (C) : Similarly, the expected number of times that a lexical rule A! w is used in deriving W is given by c (A! w; W ) = (A! w) P (W ) X 1n ii (A) : To actually carry out the computation, we need an ecient method for computing the 's 's. Fortunately, there is an ecient way of computing these, based upon the following recurence relations. If we still assume that our grammar is in Chomsky normal form, then it is easy to see that the 's must satisfy X ij X (A) = (A! B C) ik (B) k+1;j (C) B;C ikj for i < j if we take ii (A) = (A! w i ) : Intuitively, this formula says that the inside probability ij (A) is computed as a sum over all possible ways of drawing the following picture: 5

i k k+1 j Figure 6.3 In the same way, if the outside probabilities are initialized as 1n (S) = 1 1n (A) = 0 for A 6= S, then the 's are given by the following recursive expression: ij (A) = X B;C + X B;C X X 1k<i nk>j (B! C A) k;i?1 (C) kj (B) + (B! A C) j+1;k (C) ik (B): Again, the rst pair of sums can be viewed as considering all ways of drawing the following picture: k i-1 i j Figure 6.4 Together with the update formulas, the above recurence formulas form the core of the Inside-Outside algorithm. 6

To summarize, the Inside-Outside algorithm consists of the following steps. First, chose some initial parameters set all of the counts count(a! ) to zero. Then, for each sentence W i in the training corpus, compute the inside probabilities the outside probabilities. Then compute the expected number of times that each rule A! is used in generating the sentence W i. These are the numbers c (A! ; W i ). For each rule A! add the number c (A! ; W i ) to the total count count(a! ) proceed to the next sentence. After processing each sentence in this way, reestimate the parameters to obtain 0 count(a! ) (A! ) = : count(a! ) P Then, repeat the process all over again, setting = 0, computing the expected counts with respect to the new parameters. How do we know when to stop? During each iteration, we compute the probability P (W ) = P (S ) w 1 w n ) = 1n (S) of each sentence. This enables us to compute the likelihood, or, better yet, the log likelihood L() = P (W 1 ) P (W 2 ) P (W N ); LL() = NX i=1 log P (W i ) : The Inside-Outside algorithm is guaranteed not to decrease the log likelihood; that is, LL( 0 )?LL() 0. One may decide to stop whenever the change in log likelihood is suciently small. In our experience, this is typically after only a few iterations for a large natural language grammar. In the next section, we will return to our toy example, detail how these calculations are actually carried out. Whether the grammar is small or large, feature-based or in stard context-free form, the basic calculations are the same, an understing of them for the following example will enable you to implement the algorithm for your own grammar. 0.2 Calculation of the inside outside probabilities The actual implementation of these computations is usually carried out with the help of the CYK algorithm [2]. This is a cubic recognition algorithm, it proceeds as follows for our example probabilistic grammar. First, we need to put the grammar into Chomsky normal form. In fact, there is only one agrant rule, V! V N P, which we break up into two rules N-P! N P V! V N-P, introducing a new nonterminal N-P. Notice that since there is only one N-P rule, the parameter (N-P! N P) will be constrained to be one, so that (V! V N-P) (N-P! N P) = (V! V N-P) is our estimate for (V! V N P). We now want to ll up the CYK chart for our rst sentence, which is shown below. 7

She eats pizza anchovies Figure 6.5 The algorithm proceeds by lling up the boxes in the order shown in the following picture. 1 2 3 4 5 Figure 6.6 The rst step is to ll the outermost diagonal. Into each box is entered the nonterminals which can generate the word associated with that box. Then the 's for the nonterminals which were entered are initialized. Thus, we obtain the chart 8

She eats pizza anchovies Figure 6.7 we will have computed the inside probabilities 11 (N) = (N! She) 33 (N) = (N! pizza) 55 (N) = (N! anchovies) 22 (V) = (V! eats) 44 (PP) = (PP! ) All other 's are zero. Now it happened in this case that each box contains only one nonterminal. If, however, a box contained two or more nonterminals, then the for each nonterminal would be updated. Suppose, for example, that the word \anchovies" were replaced by \mushrooms." Then since \mushrooms" can be either a noun or a verb, the bottom part of the chart would appear as mushrooms Figure 6.8 we would have computed the inside probabilities 55 (N) = (N! mushrooms) 55 (V) = (V! mushrooms): In general, the rule for lling the boxes is that each nonterminal in box (i; j) must generate words w i w j, where the boxes are indexed as shown in the following picture. 9

1 2 3 4 5 1 2 3 4 5 Figure 6.9 In this gure, the shaded box, which is box (2; 4), spans words w 2 w 3 w 4. To return now to our example, we proceed by lling the next diagonal updating the 's, to get: She eats pizza anchovies Figure 6.10 with 12 (S) = (S! N V) 11 (N) 22 (V) 23 (V) = (V! V N) 22 (V) 33 (N) 45 (P) = (P! PP N) 44 (PP) 55 (N) : When we nish lling the chart, we will have 10

She eats pizza anchovies Figure 6.11 with, for example, inside probabilities 25 (V) = (V! V N) 22 (V) 35 (N) + + (V! V N-P) 22 (V) 35 (N-P) 15 (S) = (S! N V) 11 (N) 25 (V) : This last alpha is the total probability of the sentence: 15 (S) = P (S ) She eats anchovies) = X T P (She eats anchovies; T ): This completes the inside pass of the Inside-Outside algorithm. Now for the outside pass. outside pass proceeds in the reverse order of Figure 6.6. We initialize the 's by setting 15 (S) = 1 all other 's equal to zero. Then 25 (V) = (S! N V) 11 (N) 15 (S) 35 (N) = (V! V N) 22 (V) 25 (V) The. 55 (N) = (P! PP N) 44 (PP) 45 (P) : The 's are computed in a top-down manner, retracing the steps that the inside pass took. For a given rule used in building the table, the outside probability for each child is updated using the outside probability of its parent, together with the inside probability of its sibling nonterminal. Now we have all the necessary ingredients necessary to compute the counts. As an example of how these are computed, we have c (V! V N; W 1 ) = (V! V N) ( 22 (V) 33 (N) 23 (V)+ 15 (S) + 22 (V) 35 (N) 25 (V)) = (V! V N) ( 22 (V) 33 (N) 23 (V)) 15 (S) 11

since 23 (V) = 0. Also, c (V! V N-P; W 1 ) = (V! V N-P) 15 (S) ( 22 (V) 35 (N-P) 25 (V)) : We now proceed to the next sentence, \She eats pizza hesitation." In this case, the computation proceeds exactly as before, except, of course, that all inside probabilities involving the probability (N! anchovies) are replaced by the probability (N! hesitation). When we are nished updating the counts for this sentence, the probabilities are recomputed as, for example, 0 (V! V N) = c (V! V N; W 1 ) + c (V! V N; W 2 ) Pi=1;2 c (V! V N P; W i ) + c (V! eats; W i ) + c (V! V N; W i ) 0 (V! eats) = c (V! eats; W 1 ) + c (V! eats; W 2 ) Pi=1;2 c (V! V N P; W i ) + c (V! eats; W i ) + c (V! V N; W i ) Though we have described an algorithm for training the grammar probabilities, it is a simple matter to modify the algorithm to extract the most probable parse for a given sentence. For historical reasons, this is called the Viterbi algorithm, the most probable parse is called the Viterbi parse. Briey, the algorithm proceeds as follows. For each nonterminal A added to a box (i; j), in the chart, we keep a record of which is the most probable way of rewriting A. That is, we determine which nonterminals B; C index k maximize the probability (A! B C) ik (B) k+1;j (C) : When we reach the topmost nonterminal S, we can then \traceback" to construct the Viterbi parse. We shall leave the details of this algorithm as an exercise for the reader. The computation outlined here has most of the essential features of the Inside-Outside algorithm applied to a \serious" natural language grammar. References [1] A.P. Dempster, N.M. Laird, D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(B):1{38, 1977. [2] J.E. Hopcroft J.D. Ullman. Introduction to Automata Theory, Languages, Computation. Addison-Wesley, Reading, Massachusetts, 1979. [3] J. D. Laerty. A derivation of the inside-outside algorithm from the EM algorithm. Technical report, IBM Research, 1992. 12