The Rise of Statistical Parsing

Size: px

Start display at page:

Download "The Rise of Statistical Parsing"

Flora Holt
6 years ago
Views:

1 The Rise of Statistical Parsing Karl Stratos Abstract The effectiveness of statistical parsing has almost completely overshadowed the previous dependence on rule-based parsing. Statistically learning how to parse sentences from sample data appeals strongly as the right approach for both practical (performance) and theoretical (human-likeness) reasons. Benefitting from, and contributing to, the advancement in machine learning, state-of-theart statistical parsers are widely used in many natural language processing studies and applications today. While the relative efficacy of statistical parsing is undeniable, we are still far from truly solving the problem of natural language parsing, and the inaccuracy of parse trees remains a main bottleneck in reaching the grand goal of language understanding machines. In order to know where to go next, we may benefit from studying how the statistical approach to parsing has been engineered to its current state. In this survey, we trace its development chronologically in hope of gaining a deeper understanding of the methods and techniques involved in it. 1 Introduction The discovery that machines can handle symbols as well as numbers was a major breakthrough in the early phase of the development in artificial intelligence (AI). In particular, this implied machines capability to process human language, which is composed of symbols and rules, and thus the field of natural language processing (NLP) was born. Parsing, the process of revealing the structure of a set of symbols, is indispensable in any attempt to computationally understand language, because syntax and semantics are closely tied together. The grammar formalism of Noam Chomsky, and the resulting Chomsky hierarchy, enabled formalizing languages of different complexities, yet the task of capturing the nuances of human language proved to be formidably difficult. Unlike a programming language, which is designed with an unambiguous grammar so that it can be parsed using that grammar with absolute precision, a human language does not come with a prescription of grammar, and the latent grammar is highly ambiguous. Numerous algorithms were conceived for the task, starting from the naive ones like backtracking bottom-up parsing to CKY (Cocke and Schwartz, 1970; Kasami, 1965; Younger, 1967), Earley s (Earley, 1970), and GLR (Tomita, 1984) parsing. In fact, (Goodman, 1999) showed that a general system of semiring properties synthesizes a wide variety of such parsers. In the early days, people had to come up with the rules, often by relying on language experts, and then recognized the parse trees of sentences using the rules with one of the algorithms. This method suffers from several serious defects. First, the task of coming up with a grammar that is complete and expressive enough to generate real world sentences is almost impossibly hard itself,

2 even with the knowledge of best linguists available. Second, even if it is done rather well for one language, we have to repeat the procedure whenever we move to a different language with a different grammar. Third, it is extremely intolerant of grammatical errors, because it either finds a right rule or does not, and it cannot do much in the latter case other than simply failing. Finally, this is not how humans take sentences. As Charlton Laird says, Grammar... is something inherent in language... It can be discovered, but not invented. Therefore, for the ultimate end of human-like machines, handcrafting rules for parsing is unlikely to be a complete solution. The idea of the statistical solution for parsing is to observe a large collection of sample parses, and parse a sentence using the rules that were observed in the collection, with preference for the rules that match the sentence better. This has several merits that overcome the defects of rule-based parsing. One does not have to come up with a definitive set of grammar rules, but rather simply has to provide good parsing samples, handling the first defect. Since producing a training corpus is substantially easier than formalizing a complete grammar for a language, the second defect is ameliorated. The third defect illustrates the mixed blessing of this approach; it is very tolerant of errors, but on the other hand the guarantee of the success of parsing does not promise the quality of the parse tree. Finally, learning to understand the structure of language by familiarizing oneself with sentences from the wild is in the spirit of Laird s view of discovering grammar. The questions then are (1) how to get parsing data of high quality and great quantity, and (2) how to learn well from the data set? (1) was answered with the advent of hand-annotated corpora such as the Penn Treebank (Marcus et al., 1993), which contained over 4.5 million words of American English annotated for part-of-speech (POS) information and skeletal syntactic structure. It is the question (2) that is the focus of this survey. In other words, how have people come up with an increasingly sophisticated algorithmic machinery to improve performance of their statistical parsers? There are a handful of authoritative figures who are responsible for the huge advancement in the field in its early phase. (Collins, 1997) proposed a generative approach to statistical parsing. (Collins, 2000) also proposed reranking the resultant parse trees for a sentence for higher accuracy. (Taskar et al., 2004) introduced max-margin parsing, (Charniak and Johnson, 2005) suggested a better way to rerank with a dynamic programming n-best parsing algorithm, and (Huang, 2008) showed that reranking not just sentences but forests still improves performance further. We will look at each of them in order. We will conclude by the contemplating the achievement of statistical parsing, why it fails to solve the problem of parsing, and what we might need to do. 2 Models can be Generative (Collins, 1997) introduces three parsing models that incorporate the generative aspect of syntax. That is, each sentence-tree pair (S, T ) has an associated top-down derivation consisting of a sequence of rule applications of a grammar. Collins applies the generative aspect to his previous model that uses lexicalized PCFG (PCFG in which each non-terminal in a parse tree is associated with a headword), and improves the performance on Wall Street Journal text by 2.3%. Collins models in this paper summarize many important principles of statistical parsing, so it is worthwhile to invest a substantial portion of this survey in understanding their basics. 2.1 Model 1 The task of a statistical parser is as follows: given a sentence S, search for the tree T that maximizes P (T S). The model has to define how one can compute the conditional probability P (T S). The previous non-generative models estimated P (T S) directly. Instead, a generative model estimates it by attaching probabilities to a top-down derivation of the tree. For a tree derived by n applications

3 of context-free rules LHS i RHS i, 1 i n, P (T, S) = P (RHS 1 LHS 1 )P (RHS 2 LHS 2 )... P (RHS n LHS n ) n = P (RHS i LHS i ). i=1 Why P (T, S) instead of P (T S)? We use the observation that maximizing the former is equivalent to maximizing the latter by the Bayes rule: T best = arg max P (T S) T = P (S, T ) arg max T P (S) = arg max P (S, T ). T Now, the rules, LHS i RHS i, are either internal to the tree (non-leaves), where LHS is a non-terminal and RHS is a sequence of one or more non-terminals, e.g., S(ran) NP(Karl) VP(ran), or lexical (leaves), where LHS is a POS tag and RHS is a word, e.g., NNP(Karl) Karl. Collins lexicalizes a PCFG by associating a word w and its POS tag t with each non-terminal X in the tree, and writes a non-terminal as X(x), where x = w, t. The addition of lexical heads leads to an enormous number of potential rules. We will see how Collins addresses the problem shortly. Each rule can be written in the following form, which we will use throughout the section in describing our method: P (h) L n (l n )... L 2 (l 1 )H(h)R 1 (r 1 )... R m (r m ). (1) H is the head of the children constituents of the phrase. Note that it inherits the head-word h from its parent P. L 1... L n and R 1... R n are left and right modifiers of H. Put simply, they are the surrounding constituents around the head constituent. When the rule is unary, n = m = 0. At this point, it is informative to study the following parse tree of the sentence Last week Marks bought Brooks, which Collins uses as an example throughout his paper. As mentioned, a direct estimation of P (RHS LHS) is infeasible due to the large number of possible rules from lexical heads. Collins proposes decomposing the generation of the RHS given the LHS of a rule such as (1) into three steps. Informally speaking, first take care of the head, then deal with the right of the head, and finally generate the left of the head. This method exploits the head-driven structure of a rule. The generation of the left and right modifiers is simplified by the independence assumptions that they are generated by separate 0 th -order markov processes

4 (that is, the probabilities for the next choice do not depend on the past choices at all, equivalent to weighted random selection). The only exception is the first rule T OP H(h) which has probability P (H, h T OP ). More specifically, the three steps required for generating the RHS from the LHS are: 1. Generate the head constituent label of the phrase, with the posterior probability P H (H P, h) given the parent P and the head-word h. 2. Generate the right modifiers with the posterior probability m+1 i=1 P R(R i (r i ) P, h, H) given the parent P, the head-word h, and the head constituent H. The right-most constituent R m+1 (r m+1 ) is defined as a non-terminal ST OP to signify the end of generation of right modifiers. 3. Similarly, generate the left modifiers with probability n+1 i=1 P L(L i (l i ) P, h, H) with L n+1 (l n+1 ) = ST OP. Since Collins gives the probability of the rule S(bought) expansion as an example, let us estimate the probability of the rule V P (bought) V B(bought) NP (Brooks) instead in the figure. The first step gives the probability of the head, P h (V B V P, bought). The second gives the probability of the right modifier, P r (NP (Brooks) V P, V B, bought), and the end signal P r (ST OP V P, V B, bought). The third gives the probability of the left modifier, which is none in this case, so we only have the end signal P l (ST OP V P, V B, bought). Therefore, the probability of the rule is estimated as: P h (V B V P, bought) P r (NP (Brooks) V P, V B, bought) P r (ST OP V P, V B, bought) P l (ST OP V P, V B, bought). Despite the 0 th order markov assumptions, the probabilities could be conditioned on any of the preceding modifiers in general. Naturally, if the derivation is left-to-right depth-first, so that we obtain the preceding modifiers and all their sub-trees before the current modifier, the model can also condition on any structure below the preceding modifiers. Collins exploits this by making the approximations P l (L i (l i ) H, P, h, L 1 (l 1 )... L i 1 (l i 1 )) = P l (L i (l i ) H, P, h, distance l (i 1)) P r (R i (r i ) H, P, h, R 1 (r 1 )... R i 1 (r i 1 )) = P r (R i (r i ) H, P, h, distance r (i 1)) where distance l and distance r are functions of the surface string from the head-word to the edge of the constituent. 2.2 Model 2 Model 1 does not incorporate the complement-adjunct distinction when evaluating the probabilities of parses, so model 2 seeks to improve on model 1 by making the distinction. Briefly put, complements and adjuncts both serve to give additional information to the object they modify in the form of adjectival, adverbial, or sentential phrase, but they differ in that the former cannot be

5 taken away without fundamentally impairing the semantic content of the original phrase while the latter can. In Model 2, Collins makes this distinction by attaching a -C suffix to non-terminals that are complements. There are several reasons that we want to make this distinction not in a post-processing stage but in parsing. First, recognizing complements requires probabilistic treatment because it involves lexical information (e.g., week is likely to be a temporal modifier) and subcategorization preferences (e.g., a that-phrase tends to be a complement as in The spokeswoman said that the asbestos was dangerous, whereas a because-phrase tends to be an adjunct as in Bonds beat short-term investment because the market is down ). Second, the very process of identifying complements may help parsing accuracy. This is because it allows us to avoid making the bad mistake of generating complements independently of each other. The following figure incorrectly considers two different noun phrases as both subjects/objects because of the independent assumption: If we know the subcategorization information of the head, we will not make such a mistake. The conditions for a non-terminal to be tagged with the -C suffix specify what it must be and must not be in order to be a complement. It must be (1) an NP, SBAR, or S whose parent is an S; (2) an NP, SBAR, S, or VP whose parent is a VP; or (3) an S whose parent is an SBAR. It must not have one of the following tags: ADV, VOC, BNF, DIR, EXT, LOC, MNR, TMP, CLR or PRP. Each of these is a strong sign that the non-terminal is an adjunct. For instance, a temporal tagged (TMP) phrase such as Last week is an adjunct, and a because-phrase like because the market is down is tagged with ADV. This supplements the non-terminal set with the complement distinction. To address the bad independence assumption, Collins incorporates a probabilistic choice of left and right subcategorization (subcat) frames into the generative process of Model 1: 1. Generate the head with probability P H (H P, h) as before. 2. Choose left and right subcat frames, LC and RC, with probabilities P lc (LC P, H, h) and P rc (RC P, H, h). Each subcat frame is a multiset specifying the complements which the head

6 requires in its left or right modifiers. So for example, an LC of a verb phrase might look like {NP -C}. 3. Generate the left and right modifiers with probabilities P l (L i, l i H, P, h, distance l (i 1), LC) and P r (R i, r i H, P, h, distance r (i 1), RC), exactly the same as before except for the inclusion of LC and RC. As complements are generated they are removed from the appropriate subcat multiset. Here is how Collins ensures that all and only the required complements will be generated; the probability of generating ST OP will be 0 when the subcat frame is non-empty (so an subcat-incomplete rule will never be cut short), and the probability of generating a complement will be 0 when it is not in the subcat frame (so only the required complements have a chance of being generated). The probability of the augmented parse tree V P (bought) V BD(bought) NP -C(Brooks) is now: P h (V BD V P, bought) P lc ({} V P, V BD, bought) P rc ({NP -C} V P, V BD, bought) P r (NP -C(Brooks) V P, V B, bought, {NP -C}) P r (ST OP V P, V B, bought, {}) P l (ST OP V P, V B, bought, {}). Note that this subcat frame helps prevent the incorrect parses shown in the above figure, because the probability of the form P lc ({NP -C, NP -C}...) must be very small. 2.3 Model 3 Model 2 still does not handle wh-movement from relative clauses. The examples of three different extractions that Collins provides are illuminating: 1. From subject: The store (SBAR which TRACE bought Brooks Brothers) 2. From object: The store (SBAR which Marks bought TRACE) 3. From within PP: The store (SBAR which Marks bought Brooks Brothers from TRACE) For the same reasons (the desirability of probabilistic treatment and the chance of improving parsing accuracy), we want to perform this task during the parsing phase rather than after it. The traditional approach to wh-movement is to add a gap feature to each non-terminal in the tree, and propagating gaps through the tree until they are finally filled as a trace complement. Since the Penn Treebank contains this information with the TRACE symbol, it is straightforward to add it to trees in training data.

7 Given that the LHS has a gap, there are three cases in passing down the gap to the RHS: 1. to the head, 2. to the left, or 3. to the right. In the latter two cases, the gap may be discharged as a trace argument (as in rule (4) in the figure above). We specify a parameter P G (G P, h, H) where G is either Head, Lef t, or Right. The generative process is extended to chose between these cases after generating the head of the phrase. When G = Head, the left and right modifiers are generated as normal. Otherwise, we add a gap requirement as a SUBCAT variable, and propagate it until this requirement is either passed down to a non-terminal or fulfilled by a TRACE. The former (the gap passed down to a non-terminal) is illustrated by rule (2), SBAR(that)(+gap) W HN P (that) S-C(bought)(+gap), which has probability: P h (W HNP SBAR, that) P G (Right SBAR, W HNP, that) P LC ({} SBAR, W HNP, that) P RC ({S-C} SBAR, W HNP, that) P R (S-C(bought)(+gap) SBAR, W HNP, that, {S-C, +gap}) P R (ST OP SBAR, W HNP, that, {}) P L (ST OP SBAR, W HNP, that, {}). The latter (the gap filled by a TRACE) is illustrated by rule (4), V P (bought)(+gap) V B(bought) T RACE N P (week), which has probability: P h (V B V P, bought) P G (Right V P, bought, V B) P LC ({} V P, bought, V B) P RC ({NP -C} V P, bought, V B) P R (T RACE V P, bought, V P, {NP -C, +gap}) P R (NP (week) V P, bought, V B, {}) P L (ST OP V P, bought, V B, {}) P R (ST OP V P, bought, V B, {}). In the actual implementation of the models, lots of smoothing is done, separately for different levels of back-off. Also, Collins replaces words in test data that have never been seen or occurred less than 5 times in training with the UNKNOWN token, thereby allowing the model to robustly handle the statistics for rare or new words. The parser was trained on the Wall Street Journal portion of the Penn Treebank (around 40,000 sentences), and tested on a different portion (2,416 sentences). The performance measures used are: P recision = # of correct constituents in parse # of constituents in parse Recall = # of correct constituents in parse # of constituents in Treebank parse Crossing Brackets = # of constituents violating boundaries. The precision/recall performance on constituent recovery is 88.1%/87.5%. That on trace recovery is also interesting to note. In oder for a trace to be considered correct, three criteria must be met: it must be an argument to the correct head-word, it must be in the correct position in relation to that head word, and it must be under the correct non-terminal label. Given this specification, the performance of model 3 on 436 sentences is 93.3%/90.1%. Collins reports that 342 were short distance cases with 97.1%/98.2% precision/recall, and 94 were long distance cases with 76%/60.6% precision/recall. This suggests that sentences with wh-movement become much harder to parse very fast as they increase in length.

8 3 Think Twice with Discriminative Re-ranking Recall that our task is to find the tree T with the highest probability P (T S) of being the parse tree of the sentence S. Also, recall that we refrained from performing the complement/adjunct distinction and the wh-movement handling as post-parsing processes, because we wanted to use their information for the parsing task. However, there is a reason we might want to perform post-processing. When we want to encode some constraints in the framework, we would like to be able to conveniently impose features discriminating between candidate trees rather than having to alter the whole derivation to take these features into account. (Collins, 2000) proposes a two-pass process in which the base parser produces a set of candidate parses that are initially ranked with derivation probabilities, and then a second model re-ranks the parses using additional features as evidence. This allows a tree to be represented as an arbitrary set of features, without concerns about how theses features interact during a derivation. A concrete example that Collins gives is the task of POS tagging using a Hidden Markov Model (HMM). In order to exploit our intuition that most sentences have at least one verb so that sequences including at least one verb should have increased score under the model, we may try to encode it directly into an HMM. But the obvious approach of adding to each state the information about whether or not a verb has been generated doubles the number of states and parameters in the model. In contrast, the task would be easy if we can just implement a binary feature that indicates whether or not the tagging contains a verb. The techniques involved in Collins method illustrate many of the important ideas in machine learning, such as the use of a dual problem, feature selection, and smoothing, and we will investigate them at a high level. 3.1 Problem Definition Collins uses a generative model that is similar to (Collins, 1997) as the base parser. Here he uses the term history-based for generative, but we will avoid using it to be consistent with the previous section. His definition of the probability of a parse tree is now P (Y, X) = n P (d i Φ(d 1... d i 1 )) i=1 where d i is the i th decision (of which re-write rule to apply), so that (d 1... d i 1 ) is the history for the i th decision, and Φ is a function which groups histories into equivalent classes, which corresponds to the 0 th markov independence assumption in the last section. The probability used by all models in the paper is processed with log. The log probability of a tree can be written as a linear sum of parameters α s multiplied by features h s, where h s (X, Y ) is the count of the rule γ s β s in the tree Y of the sentence X, and α s = log P (β s γ s ) is the parameter associated with that rule. That is, if we have a PCFG with such rules for 1 s m, log P (Y, X) = m α s h s (X, Y ). s=1 Thus the features h s define an m-dimensional vector of counts, (h 1... h m ), and the parameters α s represent the influence of each feature on the score of a tree, i.e., weights. We need to define some more terms in order to be able to express ideas in equation: X and Y are input space and tree space.

9 x i,j X Y is the j th parse of the i th sentence, where 1 i n and 1 j n i. Score(x i,j ) is the measure of the similarity of x i,j to the gold-standard parse. x i1 is defined to be the highest scoring parse for the i th sentence. Q(x i,j ) = P (X, Y ) and L(x i,j ) = log Q(x i,j ). We have a separate test set of parses y i,j. The task is to learn a ranking function F (x i,j ). Note that for the base model, F (x i,j ) = L(x i,j ). The performance of a ranking function is evaluated on the entire test data; the score of F is i Score(y i,z i ) where z i is the index of the top-ranked parse under F on the i th sentence: arg max j=1...ni F (y i,j ). Under this definition, the maximum possible score is i Score(y 1,z i ). F can be written in the following form, using the weights ᾱ = {α 0, α 1... α m } and the indicator function h s (x) which is 1 if x contains the s th rule and 0 otherwise (h s is restricted to binary value for the simplicity of the algorithm, but we can simulate the features that count rules by having multiple features that take value 1 if a rule is seen n times), F (x i,j, ᾱ) = α 0 L(x i,j ) + m α s h s (x i,j ). (2) Then the task now is to find the parameter settings for ᾱ that lead to good scores on test data. 3.2 Loss Functions In order to drive the training process, we define a measure of ranking errors F makes on the training set. The ranking error rate is the number of times a lower scoring parse is ranked better than the best parse: s=1 Error(ᾱ) = i = i n i j=2 n i t[f (x i,1, ᾱ) < F (x i,j, ᾱ)] t[f (x i,1, ᾱ) F (x i,j, ᾱ < 0)]. j=2 The indicator function t[x] is 1 if x is true and 0 otherwise. Define the margin on example x i,j as M i,j (ᾱ) = F (x i,1, ᾱ) F (x i,j, ᾱ). All the loss functions that Collins defines can be written in terms of the margins Log-Likelihood Loss The first loss function defines the conditional probability of x i,q being the correct parse for the i th sentence as e F (x i,q) P (x i,q ) = ni j=1 ef (x i,j).

10 Now, maximizing the likelihood is equivalent to minimizing the negative log-likelihood. This negative log-loss is a function of the margins on training data: LogLoss(ᾱ) = i = i = i = i log log e F (x i,q) ni j=1 ef (x i,j) 1 ni j=1 e (F (x i,q) F (x i,j )) n i log 1 + j=2 n i log 1 + j=2 e (F (x i,q) F (x i,j )) e M i,j(ᾱ) Boosting Loss The second loss function is defined as BoostLoss(ᾱ) = i n i j=2 e M i,j(ᾱ), which has Error(ᾱ) as a lower bound so that minimizing BoostLoss(ᾱ) is closely related to minimizing the ranking error rate. 3.3 Optimization Methods The naive approach would be to find parameter settings ᾱ that minimize Loss(ᾱ), which can be one of the loss functions. But this risks overtraining due to the large number of features. Instead, Collins attempts to find a small subset of the features that contribute most to reducing the loss function by using a greedy algorithm: at each iteration, pick the feature h s with weight δ that has the most impact on the loss function. It helps to define: Upd(ᾱ, k, δ) = [α 0,..., α k + δ,..., α m ] BestW t(k, ᾱ) = argmin δ Loss(Upd(ᾱ, k, δ)) BestLoss(k, ᾱ) = Loss(U pd(ᾱ, k, BestW t(k, ᾱ))). Upd(ᾱ, k, δ) is an updated parameter vector, the same as ᾱ except that α k is incremented by δ. BestW t(k, ᾱ) is the optimal increment δ to α k that reduces the error rate most. BestLoss(k, ᾱ) is the value of the loss function on the updated parameters. Then the algorithm for feature selection is: 1. Initialize ᾱ 0 to some value, e.g., [1, 0, 0,...]. 2. for t=1:n, find (k, δ ) = argmin k,δ Loss(Upd(ᾱ t 1, k, δ)). and set ᾱ t = Upd(ᾱ t 1, k, δ ). After computing BestW t and BestLoss for each feature, we can compute the optimal feature/weight pair as k = argmin k BestLoss(k, ᾱ) δ = BestW t(k, ᾱ).

11 We will investigate how BestW t and BestLoss can be computed for BoostLoss. At each iteration, α 0 is set to optimize BoostLoss. From equation (2), we see that we can perform a linear search to find n i α 0 = argmin α e (α(l(x i,1) L(x i,j ))). i j=2 For BoostLoss, note again from equation (2) that F (x i,j, Upd(ᾱ, k, δ)) is equal to F (x i,j, α) + δh k (x i,j ), so the margin on example ij can be shown to have a simple update, M i,j (Upd(ᾱ, k, δ)) = F (x i,1, Upd(ᾱ, k, δ)) F (x i,j, Upd(ᾱ, k, δ)) which leads to a simple update for BoostLoss: BoostLoss(Upd(ᾱ, k, δ)) = i = F (x i,1, ᾱ) F (x i,j, ᾱ) + δ(h k (x i,1 ) h k (x i,j )) = M i,j (ᾱ) + δ(h k (x i,1 ) h k (x i,j )), n i j=2 e M i,j(ᾱ) δ(h k (x i,1 ) h k (x i,j )). Recall that h s (x) is an indicator function on x, so that h k (x i,1 ) h k (x i,j ) is either +1, 1, or 0. Collins partitions the training sample into three sets depending on this value, A + k = {(i, j) : h k (x i,1 ) h k (x i,j ) = 1} A k = {(i, j) : h k (x i,1 ) h k (x i,j ) = 1} A 0 k = {(i, j) : h k (x i,1 ) h k (x i,j ) = 0}. Next, define W + k = (i,j) A + e Mi,j(ᾱ) and its two counterparts W k k and W k 0 defined analogously. The reason for all these definitions is solely to formulate the update rule for BoostLoss as BoostLoss(Upd(ᾱ, k, δ)) = i = n i e M i,j(ᾱ) δ(h k (x i,1 ) h k (x i,j )) j=2 (i,j) A + k e M i,j(ᾱ) δ + = e δ W + k + eδ W k + W 0 k. (i,j) A k e M i,j(ᾱ)+δ + (i,j) A 0 k e M i,j(ᾱ) The purpose is to make it easier to differentiate BoostLoss with respect to δ in minimizing the loss. This gives BestW t(k, ᾱ) = 1 2 log W + k W, k and when we plug this optimal δ to the update rule for BoostLoss, we obtain BestLoss(k, ᾱ) = Z ( W +k W k )2 where Z = i ni j=2 e M i,j(ᾱ) is a constant among features. Collins proposes finding a smoothing parameter ɛ using cross-validation to prevent BestW t from being undefined: BestW t(k, ᾱ) = 1 2 log W + k + ɛz W k + ɛz. We have achieved our goal for BoostLoss. Unfortunately, such a brisk closed form does not exist for BestW t in the case of LogLoss, so Collins resorts to an iterative solution to find the value of BestW t and uses it to calculate BestLoss.

12 3.4 Efficiency While we cannot cover the efficiency issues as thoroughly as Collins here, it is enlightening to note them because they are important in the parsing problem in general; one usually has to train the parser on a huge data set (36,000 sentences with 1,000,000 parse trees and 500,000 features in this case). In a nutshell, Collins exploits the fact that in the update from ᾱ to Upd(ᾱ, k, δ ) the values W + k and W k remain unchanged for most features, so that they do not have to be re-calculated. The use of a second model to re-rank the parse trees from the first model using the selected features that contribute most to minimizing the error (estimated by noting how many times a nonbest tree is incorrectly ranked higher than the best tree) gives a significant improvement, a 1.5% absolute increase in accuracy over the base model. We have explored the three generative models and the re-ranking model of Collins (1997; 2000) in great detail. For the rest of the survey, we will be less rigorous, but try to capture the overall idea in a more sweeping manner. This is because the detailed account of those models were sufficient to illustrate the nature of the task, and because we do not want to be bogged down by every nut and bolt lest we should fail to see the idea itself within our limited scope. 4 Consider the Entire Tree Space when Re-ranking We have seen in the previous section that efficiency is a critical factor in parsing, since the training data tends to be enormous. This poses a problem, because the set of n-best parses (rather than all of them) often does not contain the true parse. In Collins re-ranking model (2000), for instance, 41% of the correct parses were not in the candidate pool of 30-best parses. Thus (Taskar et al., 2004) motivates a novel discriminative approach that allows one to efficiently learn a model which discriminates among the entire space of parse trees, as opposed to reranking the top few candidates. It uses the idea of finding the largest margin, which lies at the core of support vector machines (SVMs). Furthermore, it can condition on arbitrary features of input sentences, thereby leveraging additional lexical information without the cost of algorithmic complexity. Taskar categorizes discriminative approaches for parsing into two parts: re-ranking (the 2-pass system of the kind proposed by (Collins, 2000)) and dynamic programming (the system in which candidate parse trees are recored in a chart and subsequently used in decoding and parameter estimation with dynamic programming algorithms). It is the latter type of discriminative approach that is the subject of (Taskar et al., 2004). He extends his previous max-margin approach to context-free grammars, presenting a dynamic programming approach to discriminative parsing that is an alternative to maximum entropy estimation. Unlike re-ranking methods, it is an end-to-end discriminative model over the full space of parses. 4.1 Max-Margin Estimation The traditional method of estimating the parameters of PCFGs assumes a generative grammar that defines P (x, y) and maximizes the joint log-likelihood i log P (x i, y i ), where x is a sentence and y is a proposed tree. An alternative is to estimate the parameters discriminatively by maximizing conditional log-likelihood, such as P w (y x) = e w,φ(x,y) y G(x) e w,φ(x,y), where G(x) Y maps an input x X to a set of candidate parses,, denotes the vector inner product, w R d, and Φ : X Y R d maps a sentence-tree pair to a feature vector.

13 Taskar advocates a different estimation criterion that uses the max-margin principle of SVMs. The key idea is to directly ensure that y i = arg max w, Φ(x i, y) y G(x i ) for all sentences x i in the training data. The margin of the parameters w on the example x i and proposed parse y is defined as how much y deviates from the true parse y i : w, Φ(x i, y i ) w, Φ(x i, y) = w, Φ(x i, y i ) Φ(x i, y). We would like this margin to be large when the mistake y is more idiotic; that is, when the loss function L(x i, y i, y) (measuring the penalty for proposing the parse y for x i when y i is the true parse) gives a large value. The optimization task is to maximize γ such that w, Φ(x i, y i ) Φ(x i, y) γl(x i, y i, y) y G(x i ), w 2 1. After a standard transformation, an equivalent task is to minimize 1 2 w 2 + C i ξ i such that w, Φ(x i, y i ) Φ(x i, y) L(x i, y i, y) ξ i y G(x i ), (3) where the slack variables ξ i 0 allow one to increase the global margin by paying a local penalty on some outliers, and the constance C dictates the desired trade-off between margin size and outliers. Taskar obtains the dual of problem (3) to be able to use the kernel trick and prevent the exponential size of constraints (one for each possible parse y for each sentence x i ): maximize C i,y α i,yl(x i, y i, y) 1 2 C i,y (I i,y α i,y )Φ i,y 2 such that α i,y = 1 i, α i,y 0 i, y, y where α i,yi are additional variables, C is renormalized, and I i,y indicates whether y is the true parse y i. Given the dual solution α, the solution to the primal problem w in (3) is simply a weighted linear combination of the feature vectors of the correct and wrong parses: w = C i,y (I i,y α i,y)φ i,y. So α corresponds to the original task in that large α contribute more strongly to the model. 4.2 Efficiency The number of variables and constraints is proportional to G(x), which is generally exponential with x, for both the primal and dual formations. Taskar exploits the structure of grammars to derive an efficient dynamic programming decomposition, factoring models by assigning scores to local parts of the parse. The key idea is to make simplifying assumptions on the feature vector Φ and the loss function L that they can be decomposed, Φ(x, y) = φ(x, r) L(x, y, ŷ) = r R(x,y) r R(x,ŷ) l(x, y, r), where φ(x, y) maps a rule production and its position in the sentence x to some feature vector representation, l(x, y, r) is a local loss function, and R(x, y) maps a derivation y of x to a finite subset of parts. A part is defined as either A, s, e, i (A non-terminal, s and e start and end points, and i sentence) or A B C, s, m, e, i (A B C is a particular rule). Each is shown in the figure below, r and q, respectively.

14 Taskar uses the decomposition assumptions to re-frame the original optimization problem in terms of a polynomial number of variables, cubic in the length of the sentence, and a polynomial number of constraints, quadratic. The result is a slight improvement (absolute 0.43% increase) over Collins re-ranking parser in F 1 measure. 5 More Discriminative Re-ranking The impact of (Collins, 2000) and the early figures who initially developed the schema for re-ranking candidate parse trees can be seen in the field s subsequent focus on discriminative re-ranking approaches. The last section described how (Taskar et al., 2004) attempted to efficiently discriminate the entire set of parse trees using dynamic programming. The final two papers (Charniak and Johnson, 2005; Huang, 2008) we briefly look into also address the re-ranking task. 5.1 Rank n-best Parses with Coarse-to-fine Dynamic Programming, and re-rank with MaxEnt (Charniak and Johnson, 2005) proposes using a MaxEnt re-ranker to select the best parse from the 50-best parses returned by a generative parsing model. He presents a simple method for constructing sets of 50-best parses based on a coarse-to-fine generative parser. Notice the contrasting stance between (Charniak and Johnson, 2005) and (Taskar et al., 2004). The latter, as we have seen, suggests considering all candidate trees in re-ranking. The former, on the other hand, tries to come up with a relatively short candidate list of very high quality. The main difficulty in n-best parsing, compared to 1-best parsing, is that dynamic programming is harder to apply, and replacing it with the more natural approaches using best-first search or beam search (e.g., we keep looking for the next best candidate as long as we want) results in a loss in efficiency. One way to retain the use of dynamic programming in n-best parsing is to exploit the local characteristic of the Viterbi algorithm: in the optimal parse, the parsing decisions at each of the choice points that determine a parse must be optimal. For instance, in the second-best parse, all but one of the parsing decisions must be optimal. Thus we can first find the best parse, then find the second-best parse, then the third-best, and so on. Charniak presents a novel 2-pass method. For every edge, he tries to store n best parses rather than a single best parse. The space problem (O(nm 3 ) where m is the length of the sentence) is mitigated by this 2-pass system. The first pass creates a crude version of the parse based on a much less complex version of the complete grammar (using coarse-grained dynamic programming

15 states). The edges are pruned according to p(n i j,k s) = α(ni j,k )β(ni j,k ), p(s) where n i j,k is a constituent of type i spanning the words from j to k, α(ni j,k ) is the outside probability of this constituent, β(n i j,k ) is the inside probability. The parser removes all constituents ni j,k whose probability falls below some threshold (here on the order of 10 4 ). The remaining edges are then exhaustively evaluated according to the find-grained probabilistic model, which conditions on much richer information, such as the lexical head of one s parents, the POS of this head, etc. In a sense, this 2-pass coarse-to-fine method divides the burden of spatial complexity. The result of this dynamic programming n-best parsing algorithm that utilizes coarse-to-fine refinement of parses is that the n-best parser s most probable parses are already of high quality, and applying a MaxEnt discriminative re-ranker further improves performance. That this selective choice of parses outperforms a less selective choice is illustrated in the result compared with the Collins parser on n-best trees; the new model s F-score is 91.02%, a statistically significant improvement over Collins 90.37%. 5.2 Re-rank the Forests The final model we consider re-ranks the entire forests of parses (Huang, 2008). Huang shows that re-ranking a packed forest of exponentially many parses, enabled by an efficient approximation algorithm, results in the highest F-score we have seen so far, 91.7, outperforming both 50-best and 100-best re-ranking baselines. The reason we may consider improving beyond discriminating only the top n candidate trees is that the limited scope of the n-best list rules out many potentially good alternatives. (Taskar et al., 2004) used dynamic programming to search the whole tree space, but all features were restricted to be local (recall the decomposition assumptions that looked at local windows within the factored search space). Huang attempts to expand this approach to include non-local features by forest re-ranking. We compute non-local features incrementally from bottom-up, so that we can re-rank the n-best subtrees at all internal nodes, instead of only at the root node. A packed parse forest is a compact representation of all the derivations for a given sentence under a CFG. For example, the verb phrase saw him with a mirror has a corresponding forest illustrating the ambiguity as to where to attach the prepositional phrase with a mirror. Formally, a forest is a pair V, E where V is the set of nodes (a non-terminal spanning a portion of s) and E is the set of hyperedges (e E is a pair tails(e), head(e) where head(e) V is the consequent node, or the left side of a grammar rule, and tails(e) V is the set of the antecedent nodes, or the

16 right side of a grammar rule). So the VP forest in the figure would be expressed by V, E where V = {V P 1,6, V BD 1,2, NP 2,6, NP 2,3, P P 3,6 } E = { tails(e 1 ), head(e 1 ), tails(e 2 ), head(e 2 ) } = { (V BD 1,2, NP 2,3, P P 3,6 ), V P 1,6 (V BD 1,2, NP 2,6 ), V P 1,6. Note the differece it makes with regard to the job of a reranker, choosing the best scoring parse tree among the candidates: ŷ = arg max score(y). y cand(s) In n-best reranking, cand(s) = {y 1... y n }, whereas in forest reranking, cand(s) is one forest implicitly containing possible parses. The score function is defined as a weighted sum of features of the tree y: score(y) = w, f where f = (f 1... f d ) is a feature extractor for d features. Each feature basically counts the number of a structural instance in a tree, such as: a VP of length 5 being surrounded by the word has and a period. To learn the weights, Huang uses the averaged perceptron algorithm using for each parse an oracle parse y i + (we cannot use the gold standard yi in the training data because the grammar may fail to cover the gold parse) estimated by maximizing the F-score with the gold standard among the candidates: The algorithm for the perceptron learning is y + i arg max F (y, yi ). y cand(s i ) Another challenge Huang tries to overcome is to incorporate non-local features (intuitively speaking, the ones for which we do not have complete information) as well as local ones. He first splits the feature extractor f = (f 1... f d ), into local ones and non-local ones, f = (f L ; f N ). Computing the local feature extractor is easy because we have the information of all the edges so that we can pre-compute f L (e) for edges e in a forest and sum them up, f L (y) = e y f L (e). The non-local features are not so straightforward, because we cannot observe the required edges immediately. However, we want to compute them as early as possible, so we can compute a feature at the smallest common ancester that contains all antecedents. More precisely, we factor non-local features across subtrees. For each subtree y of a parse y, obtain the part f(y ) (called a unit

17 feature) of the feature f(y) that is computable within y. In this way, we can build unit features from bottom up to compute the non-local feature, f N (y) = y y fn (y ). Huang presents an efficient dynamic programming algorithm to compute the forest oracle and pruned forests, but we do not have space here to elaborate on them. Huang achieves a 0.26% absolute improvement over 50-best re-ranking. 6 Conclusion In this whirl-wind tour of the recent developments in statistical parsing, we have investigated increasingly sophisticated algorithmic approaches to learning better from the training data provided by hand-annotated corpora such as the Penn Treebank. The generative models and the re-ranking schema proposed by (Collins, 1997; Collins, 2000) illustrated many of the most crucial components of statistical parsing, such as the formulation of the task into an optimization problem and the benefit of re-considering the plausibility of the candidate trees. Others joined to suggest various methods, pushing the parsing accuracy further. (Taskar et al., 2004) used the max-margin technique from the SVMs to efficiently discriminate the entire space of parse trees with dynamic programming. (Charniak and Johnson, 2005) presented a simple way to produce a high quality list of n-best parses to accommodate re-ranking. (Huang, 2008) showed that re-ranking the whole forests (packed parses that include the ambiguity information of sentences) overcomes both the limited scope of n-best parsing and the local constraint of the conventional dynamic programming approaches. The highest accuracy seen in this survey is 91.69% (F-score) on the test data. This is an impressive feat. In light of the strengths of the statistical solution to parsing that we considered in the introduction (error-tolerant, easier to build, etc.), such a high performance confirms the fact that we are on the right track to solving the problem of parsing. However, the truth is that the current solution is far from enough. According to the accuracy score (which in real applications will likely to be much lower, since they will involve domains outside the test data), we get a wrong parse for every ten inputs. Much worse, as we have glimpsed, a sentence becomes exponentially harder (corresponding to the exponentially bigger search space) to parse as it increases in length. But a machine must be able to parse sentences of non-trivial length if it is to exhibit a level of non-trivial intelligence. In other words, it is the long sentences rather than short that we really want to be able to parse accurately. Without this ability, it is doubtful machines will ever be capable of performing deep interactions in language. Note that the rate of increase in performance drops continuously as we move from the early phase of statistical parsing to the later phase. For instance, (Collins, 1997) achieved an absolute improvement of 2.3% over the previous best-performing parser, whereas (Huang, 2008) gained only 0.26% despite the use of incredibly complicated methods. At this rate, we will never reach the point of 99.9% accuracy, which is what we really need. Why are we not successful? Perhaps my personal experience of acquiring the English language may shed some light. As I started learning to read English texts at a late age of 15, the most difficult challenge was (not surprisingly) understanding the structure of a sentence, i.e., parsing. It was extremely hard to tell where the noun phrase ended, whether the gerund was a verb or an adjective, which was the head verb phrase of the sentence, and so on. Without knowing them, it was impossible to make any sense of the text.

18 How did I overcome the problem? I can say I learned to parse statistically, by reading hundreds of books and thereby accumulating a large data set to consult. Although the set was not annotated like the Penn Treebank, by carefully decoding the structure of sentences in the early phase, I had access to the information which was a good parse and which was not. There was a certain take-off point in my ability to understand English. Prior to the point, my interaction with the language went almost entirely to syntactic learning, and it meant little semantically. Put differently, I was reading, but I was not understanding, disoriented too much by the challenge of parsing. Then, at some point, I became good enough at parsing so that I could finally use, rather than being enslaved by, the task in making semantic interpretations of the text. My parsing skills were incomplete, yet the ability to make semantic analysis in turn greatly helped me parse the subsequent sentences better (e.g., using common-sense to attach the prepositional phrase to the right target). The bootstrapping process suddenly became blindingly fast, and before I knew it I was fluent in the language. In this regard, perhaps the state-of-art performance of the statistical parsers today corresponds to the slot immediately before the take-off point. This suggests that, as many non-statistical NLP researchers assert, statistics alone will not take us there. But such a vague assertion is not very helpful. We must find a method to simulate the bootstrapping process between syntactic and semantic information of the text. Without it, no matter how clever the statistical parsers are trained, they will likely to stay at the current level of 90%. This needs to be a central focus in AI research, since it lies at the very foundation of language intelligence without which AI will never truly connect with humans. On the bright side, this suggests that we are not too far away from the take-off point that I experienced. Once we reach there, the rate of advancement in machine intelligence will be phenomenal. Acknowledgements The greatest thanks go to Professor Dan Gildea for suggesting me this topic. May his work continue to pioneer and inspire the field of statistical NLP. References Joshua Goodman Semiring Parsing. ACL. John Cocke and Jocab T. Schwartz Programming languages and their compilers: Preliminary notes. Technical report, Courant Institute of Mathematical Sciences, New York University. T. Kasami An efficient recognition and syntax-analysis algorithm for context-free languages. Scientific report AFCRL , Air Force Cambridge Research Lab, Bedford, MA. Daniel H. Younger Recognition and parsing of context-free languages in time n3. Information and Control 10(2) J. Earley An efficient context-free parsing algorithm. Communications of the Association for Computing Machinery 13(2), Masaru Tomita LR parsers for natural languages. COLING. 10th International Conference on Computational Linguistics, M. Marcus, B. Santorini and M. Marcinkiewicz Building a Large Annotated Corpus of English: the Penn Treebank. Computational linguistics, Michael Collins Three Generative, Lexicalised Models for Statistical Parsing. EACL 97 Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics. Michael Collins Discriminative Reranking for Natural Language Parsing. Proceedings of the International Conference on Machine Learning (ICML). Ben Taskar, Dan Klein, Michael Collins, Daphne Koller, and Christopher Manning Max-Margin Parsing. Empirical Methods in Natural Language Processing (EMNLP04).

19 Eugene Charniak and Mark Johnson Coarse-to-fine n-best parsing and MaxEnt discriminative reranking. ACL. Liang Huang Forest Reranking: Discriminative Parsing with Non-Local Features. ACL.

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark Penn Treebank Parsing Advanced Topics in Language Processing Stephen Clark 1 The Penn Treebank 40,000 sentences of WSJ newspaper text annotated with phrasestructure trees The trees contain some predicate-argument