The Rise of Statistical Parsing

Size: px
Start display at page:

Download "The Rise of Statistical Parsing"

Transcription

1 The Rise of Statistical Parsing Karl Stratos Abstract The effectiveness of statistical parsing has almost completely overshadowed the previous dependence on rule-based parsing. Statistically learning how to parse sentences from sample data appeals strongly as the right approach for both practical (performance) and theoretical (human-likeness) reasons. Benefitting from, and contributing to, the advancement in machine learning, state-of-theart statistical parsers are widely used in many natural language processing studies and applications today. While the relative efficacy of statistical parsing is undeniable, we are still far from truly solving the problem of natural language parsing, and the inaccuracy of parse trees remains a main bottleneck in reaching the grand goal of language understanding machines. In order to know where to go next, we may benefit from studying how the statistical approach to parsing has been engineered to its current state. In this survey, we trace its development chronologically in hope of gaining a deeper understanding of the methods and techniques involved in it. 1 Introduction The discovery that machines can handle symbols as well as numbers was a major breakthrough in the early phase of the development in artificial intelligence (AI). In particular, this implied machines capability to process human language, which is composed of symbols and rules, and thus the field of natural language processing (NLP) was born. Parsing, the process of revealing the structure of a set of symbols, is indispensable in any attempt to computationally understand language, because syntax and semantics are closely tied together. The grammar formalism of Noam Chomsky, and the resulting Chomsky hierarchy, enabled formalizing languages of different complexities, yet the task of capturing the nuances of human language proved to be formidably difficult. Unlike a programming language, which is designed with an unambiguous grammar so that it can be parsed using that grammar with absolute precision, a human language does not come with a prescription of grammar, and the latent grammar is highly ambiguous. Numerous algorithms were conceived for the task, starting from the naive ones like backtracking bottom-up parsing to CKY (Cocke and Schwartz, 1970; Kasami, 1965; Younger, 1967), Earley s (Earley, 1970), and GLR (Tomita, 1984) parsing. In fact, (Goodman, 1999) showed that a general system of semiring properties synthesizes a wide variety of such parsers. In the early days, people had to come up with the rules, often by relying on language experts, and then recognized the parse trees of sentences using the rules with one of the algorithms. This method suffers from several serious defects. First, the task of coming up with a grammar that is complete and expressive enough to generate real world sentences is almost impossibly hard itself,

2 even with the knowledge of best linguists available. Second, even if it is done rather well for one language, we have to repeat the procedure whenever we move to a different language with a different grammar. Third, it is extremely intolerant of grammatical errors, because it either finds a right rule or does not, and it cannot do much in the latter case other than simply failing. Finally, this is not how humans take sentences. As Charlton Laird says, Grammar... is something inherent in language... It can be discovered, but not invented. Therefore, for the ultimate end of human-like machines, handcrafting rules for parsing is unlikely to be a complete solution. The idea of the statistical solution for parsing is to observe a large collection of sample parses, and parse a sentence using the rules that were observed in the collection, with preference for the rules that match the sentence better. This has several merits that overcome the defects of rule-based parsing. One does not have to come up with a definitive set of grammar rules, but rather simply has to provide good parsing samples, handling the first defect. Since producing a training corpus is substantially easier than formalizing a complete grammar for a language, the second defect is ameliorated. The third defect illustrates the mixed blessing of this approach; it is very tolerant of errors, but on the other hand the guarantee of the success of parsing does not promise the quality of the parse tree. Finally, learning to understand the structure of language by familiarizing oneself with sentences from the wild is in the spirit of Laird s view of discovering grammar. The questions then are (1) how to get parsing data of high quality and great quantity, and (2) how to learn well from the data set? (1) was answered with the advent of hand-annotated corpora such as the Penn Treebank (Marcus et al., 1993), which contained over 4.5 million words of American English annotated for part-of-speech (POS) information and skeletal syntactic structure. It is the question (2) that is the focus of this survey. In other words, how have people come up with an increasingly sophisticated algorithmic machinery to improve performance of their statistical parsers? There are a handful of authoritative figures who are responsible for the huge advancement in the field in its early phase. (Collins, 1997) proposed a generative approach to statistical parsing. (Collins, 2000) also proposed reranking the resultant parse trees for a sentence for higher accuracy. (Taskar et al., 2004) introduced max-margin parsing, (Charniak and Johnson, 2005) suggested a better way to rerank with a dynamic programming n-best parsing algorithm, and (Huang, 2008) showed that reranking not just sentences but forests still improves performance further. We will look at each of them in order. We will conclude by the contemplating the achievement of statistical parsing, why it fails to solve the problem of parsing, and what we might need to do. 2 Models can be Generative (Collins, 1997) introduces three parsing models that incorporate the generative aspect of syntax. That is, each sentence-tree pair (S, T ) has an associated top-down derivation consisting of a sequence of rule applications of a grammar. Collins applies the generative aspect to his previous model that uses lexicalized PCFG (PCFG in which each non-terminal in a parse tree is associated with a headword), and improves the performance on Wall Street Journal text by 2.3%. Collins models in this paper summarize many important principles of statistical parsing, so it is worthwhile to invest a substantial portion of this survey in understanding their basics. 2.1 Model 1 The task of a statistical parser is as follows: given a sentence S, search for the tree T that maximizes P (T S). The model has to define how one can compute the conditional probability P (T S). The previous non-generative models estimated P (T S) directly. Instead, a generative model estimates it by attaching probabilities to a top-down derivation of the tree. For a tree derived by n applications

3 of context-free rules LHS i RHS i, 1 i n, P (T, S) = P (RHS 1 LHS 1 )P (RHS 2 LHS 2 )... P (RHS n LHS n ) n = P (RHS i LHS i ). i=1 Why P (T, S) instead of P (T S)? We use the observation that maximizing the former is equivalent to maximizing the latter by the Bayes rule: T best = arg max P (T S) T = P (S, T ) arg max T P (S) = arg max P (S, T ). T Now, the rules, LHS i RHS i, are either internal to the tree (non-leaves), where LHS is a non-terminal and RHS is a sequence of one or more non-terminals, e.g., S(ran) NP(Karl) VP(ran), or lexical (leaves), where LHS is a POS tag and RHS is a word, e.g., NNP(Karl) Karl. Collins lexicalizes a PCFG by associating a word w and its POS tag t with each non-terminal X in the tree, and writes a non-terminal as X(x), where x = w, t. The addition of lexical heads leads to an enormous number of potential rules. We will see how Collins addresses the problem shortly. Each rule can be written in the following form, which we will use throughout the section in describing our method: P (h) L n (l n )... L 2 (l 1 )H(h)R 1 (r 1 )... R m (r m ). (1) H is the head of the children constituents of the phrase. Note that it inherits the head-word h from its parent P. L 1... L n and R 1... R n are left and right modifiers of H. Put simply, they are the surrounding constituents around the head constituent. When the rule is unary, n = m = 0. At this point, it is informative to study the following parse tree of the sentence Last week Marks bought Brooks, which Collins uses as an example throughout his paper. As mentioned, a direct estimation of P (RHS LHS) is infeasible due to the large number of possible rules from lexical heads. Collins proposes decomposing the generation of the RHS given the LHS of a rule such as (1) into three steps. Informally speaking, first take care of the head, then deal with the right of the head, and finally generate the left of the head. This method exploits the head-driven structure of a rule. The generation of the left and right modifiers is simplified by the independence assumptions that they are generated by separate 0 th -order markov processes

4 (that is, the probabilities for the next choice do not depend on the past choices at all, equivalent to weighted random selection). The only exception is the first rule T OP H(h) which has probability P (H, h T OP ). More specifically, the three steps required for generating the RHS from the LHS are: 1. Generate the head constituent label of the phrase, with the posterior probability P H (H P, h) given the parent P and the head-word h. 2. Generate the right modifiers with the posterior probability m+1 i=1 P R(R i (r i ) P, h, H) given the parent P, the head-word h, and the head constituent H. The right-most constituent R m+1 (r m+1 ) is defined as a non-terminal ST OP to signify the end of generation of right modifiers. 3. Similarly, generate the left modifiers with probability n+1 i=1 P L(L i (l i ) P, h, H) with L n+1 (l n+1 ) = ST OP. Since Collins gives the probability of the rule S(bought) expansion as an example, let us estimate the probability of the rule V P (bought) V B(bought) NP (Brooks) instead in the figure. The first step gives the probability of the head, P h (V B V P, bought). The second gives the probability of the right modifier, P r (NP (Brooks) V P, V B, bought), and the end signal P r (ST OP V P, V B, bought). The third gives the probability of the left modifier, which is none in this case, so we only have the end signal P l (ST OP V P, V B, bought). Therefore, the probability of the rule is estimated as: P h (V B V P, bought) P r (NP (Brooks) V P, V B, bought) P r (ST OP V P, V B, bought) P l (ST OP V P, V B, bought). Despite the 0 th order markov assumptions, the probabilities could be conditioned on any of the preceding modifiers in general. Naturally, if the derivation is left-to-right depth-first, so that we obtain the preceding modifiers and all their sub-trees before the current modifier, the model can also condition on any structure below the preceding modifiers. Collins exploits this by making the approximations P l (L i (l i ) H, P, h, L 1 (l 1 )... L i 1 (l i 1 )) = P l (L i (l i ) H, P, h, distance l (i 1)) P r (R i (r i ) H, P, h, R 1 (r 1 )... R i 1 (r i 1 )) = P r (R i (r i ) H, P, h, distance r (i 1)) where distance l and distance r are functions of the surface string from the head-word to the edge of the constituent. 2.2 Model 2 Model 1 does not incorporate the complement-adjunct distinction when evaluating the probabilities of parses, so model 2 seeks to improve on model 1 by making the distinction. Briefly put, complements and adjuncts both serve to give additional information to the object they modify in the form of adjectival, adverbial, or sentential phrase, but they differ in that the former cannot be

5 taken away without fundamentally impairing the semantic content of the original phrase while the latter can. In Model 2, Collins makes this distinction by attaching a -C suffix to non-terminals that are complements. There are several reasons that we want to make this distinction not in a post-processing stage but in parsing. First, recognizing complements requires probabilistic treatment because it involves lexical information (e.g., week is likely to be a temporal modifier) and subcategorization preferences (e.g., a that-phrase tends to be a complement as in The spokeswoman said that the asbestos was dangerous, whereas a because-phrase tends to be an adjunct as in Bonds beat short-term investment because the market is down ). Second, the very process of identifying complements may help parsing accuracy. This is because it allows us to avoid making the bad mistake of generating complements independently of each other. The following figure incorrectly considers two different noun phrases as both subjects/objects because of the independent assumption: If we know the subcategorization information of the head, we will not make such a mistake. The conditions for a non-terminal to be tagged with the -C suffix specify what it must be and must not be in order to be a complement. It must be (1) an NP, SBAR, or S whose parent is an S; (2) an NP, SBAR, S, or VP whose parent is a VP; or (3) an S whose parent is an SBAR. It must not have one of the following tags: ADV, VOC, BNF, DIR, EXT, LOC, MNR, TMP, CLR or PRP. Each of these is a strong sign that the non-terminal is an adjunct. For instance, a temporal tagged (TMP) phrase such as Last week is an adjunct, and a because-phrase like because the market is down is tagged with ADV. This supplements the non-terminal set with the complement distinction. To address the bad independence assumption, Collins incorporates a probabilistic choice of left and right subcategorization (subcat) frames into the generative process of Model 1: 1. Generate the head with probability P H (H P, h) as before. 2. Choose left and right subcat frames, LC and RC, with probabilities P lc (LC P, H, h) and P rc (RC P, H, h). Each subcat frame is a multiset specifying the complements which the head

6 requires in its left or right modifiers. So for example, an LC of a verb phrase might look like {NP -C}. 3. Generate the left and right modifiers with probabilities P l (L i, l i H, P, h, distance l (i 1), LC) and P r (R i, r i H, P, h, distance r (i 1), RC), exactly the same as before except for the inclusion of LC and RC. As complements are generated they are removed from the appropriate subcat multiset. Here is how Collins ensures that all and only the required complements will be generated; the probability of generating ST OP will be 0 when the subcat frame is non-empty (so an subcat-incomplete rule will never be cut short), and the probability of generating a complement will be 0 when it is not in the subcat frame (so only the required complements have a chance of being generated). The probability of the augmented parse tree V P (bought) V BD(bought) NP -C(Brooks) is now: P h (V BD V P, bought) P lc ({} V P, V BD, bought) P rc ({NP -C} V P, V BD, bought) P r (NP -C(Brooks) V P, V B, bought, {NP -C}) P r (ST OP V P, V B, bought, {}) P l (ST OP V P, V B, bought, {}). Note that this subcat frame helps prevent the incorrect parses shown in the above figure, because the probability of the form P lc ({NP -C, NP -C}...) must be very small. 2.3 Model 3 Model 2 still does not handle wh-movement from relative clauses. The examples of three different extractions that Collins provides are illuminating: 1. From subject: The store (SBAR which TRACE bought Brooks Brothers) 2. From object: The store (SBAR which Marks bought TRACE) 3. From within PP: The store (SBAR which Marks bought Brooks Brothers from TRACE) For the same reasons (the desirability of probabilistic treatment and the chance of improving parsing accuracy), we want to perform this task during the parsing phase rather than after it. The traditional approach to wh-movement is to add a gap feature to each non-terminal in the tree, and propagating gaps through the tree until they are finally filled as a trace complement. Since the Penn Treebank contains this information with the TRACE symbol, it is straightforward to add it to trees in training data.

7 Given that the LHS has a gap, there are three cases in passing down the gap to the RHS: 1. to the head, 2. to the left, or 3. to the right. In the latter two cases, the gap may be discharged as a trace argument (as in rule (4) in the figure above). We specify a parameter P G (G P, h, H) where G is either Head, Lef t, or Right. The generative process is extended to chose between these cases after generating the head of the phrase. When G = Head, the left and right modifiers are generated as normal. Otherwise, we add a gap requirement as a SUBCAT variable, and propagate it until this requirement is either passed down to a non-terminal or fulfilled by a TRACE. The former (the gap passed down to a non-terminal) is illustrated by rule (2), SBAR(that)(+gap) W HN P (that) S-C(bought)(+gap), which has probability: P h (W HNP SBAR, that) P G (Right SBAR, W HNP, that) P LC ({} SBAR, W HNP, that) P RC ({S-C} SBAR, W HNP, that) P R (S-C(bought)(+gap) SBAR, W HNP, that, {S-C, +gap}) P R (ST OP SBAR, W HNP, that, {}) P L (ST OP SBAR, W HNP, that, {}). The latter (the gap filled by a TRACE) is illustrated by rule (4), V P (bought)(+gap) V B(bought) T RACE N P (week), which has probability: P h (V B V P, bought) P G (Right V P, bought, V B) P LC ({} V P, bought, V B) P RC ({NP -C} V P, bought, V B) P R (T RACE V P, bought, V P, {NP -C, +gap}) P R (NP (week) V P, bought, V B, {}) P L (ST OP V P, bought, V B, {}) P R (ST OP V P, bought, V B, {}). In the actual implementation of the models, lots of smoothing is done, separately for different levels of back-off. Also, Collins replaces words in test data that have never been seen or occurred less than 5 times in training with the UNKNOWN token, thereby allowing the model to robustly handle the statistics for rare or new words. The parser was trained on the Wall Street Journal portion of the Penn Treebank (around 40,000 sentences), and tested on a different portion (2,416 sentences). The performance measures used are: P recision = # of correct constituents in parse # of constituents in parse Recall = # of correct constituents in parse # of constituents in Treebank parse Crossing Brackets = # of constituents violating boundaries. The precision/recall performance on constituent recovery is 88.1%/87.5%. That on trace recovery is also interesting to note. In oder for a trace to be considered correct, three criteria must be met: it must be an argument to the correct head-word, it must be in the correct position in relation to that head word, and it must be under the correct non-terminal label. Given this specification, the performance of model 3 on 436 sentences is 93.3%/90.1%. Collins reports that 342 were short distance cases with 97.1%/98.2% precision/recall, and 94 were long distance cases with 76%/60.6% precision/recall. This suggests that sentences with wh-movement become much harder to parse very fast as they increase in length.

8 3 Think Twice with Discriminative Re-ranking Recall that our task is to find the tree T with the highest probability P (T S) of being the parse tree of the sentence S. Also, recall that we refrained from performing the complement/adjunct distinction and the wh-movement handling as post-parsing processes, because we wanted to use their information for the parsing task. However, there is a reason we might want to perform post-processing. When we want to encode some constraints in the framework, we would like to be able to conveniently impose features discriminating between candidate trees rather than having to alter the whole derivation to take these features into account. (Collins, 2000) proposes a two-pass process in which the base parser produces a set of candidate parses that are initially ranked with derivation probabilities, and then a second model re-ranks the parses using additional features as evidence. This allows a tree to be represented as an arbitrary set of features, without concerns about how theses features interact during a derivation. A concrete example that Collins gives is the task of POS tagging using a Hidden Markov Model (HMM). In order to exploit our intuition that most sentences have at least one verb so that sequences including at least one verb should have increased score under the model, we may try to encode it directly into an HMM. But the obvious approach of adding to each state the information about whether or not a verb has been generated doubles the number of states and parameters in the model. In contrast, the task would be easy if we can just implement a binary feature that indicates whether or not the tagging contains a verb. The techniques involved in Collins method illustrate many of the important ideas in machine learning, such as the use of a dual problem, feature selection, and smoothing, and we will investigate them at a high level. 3.1 Problem Definition Collins uses a generative model that is similar to (Collins, 1997) as the base parser. Here he uses the term history-based for generative, but we will avoid using it to be consistent with the previous section. His definition of the probability of a parse tree is now P (Y, X) = n P (d i Φ(d 1... d i 1 )) i=1 where d i is the i th decision (of which re-write rule to apply), so that (d 1... d i 1 ) is the history for the i th decision, and Φ is a function which groups histories into equivalent classes, which corresponds to the 0 th markov independence assumption in the last section. The probability used by all models in the paper is processed with log. The log probability of a tree can be written as a linear sum of parameters α s multiplied by features h s, where h s (X, Y ) is the count of the rule γ s β s in the tree Y of the sentence X, and α s = log P (β s γ s ) is the parameter associated with that rule. That is, if we have a PCFG with such rules for 1 s m, log P (Y, X) = m α s h s (X, Y ). s=1 Thus the features h s define an m-dimensional vector of counts, (h 1... h m ), and the parameters α s represent the influence of each feature on the score of a tree, i.e., weights. We need to define some more terms in order to be able to express ideas in equation: X and Y are input space and tree space.

9 x i,j X Y is the j th parse of the i th sentence, where 1 i n and 1 j n i. Score(x i,j ) is the measure of the similarity of x i,j to the gold-standard parse. x i1 is defined to be the highest scoring parse for the i th sentence. Q(x i,j ) = P (X, Y ) and L(x i,j ) = log Q(x i,j ). We have a separate test set of parses y i,j. The task is to learn a ranking function F (x i,j ). Note that for the base model, F (x i,j ) = L(x i,j ). The performance of a ranking function is evaluated on the entire test data; the score of F is i Score(y i,z i ) where z i is the index of the top-ranked parse under F on the i th sentence: arg max j=1...ni F (y i,j ). Under this definition, the maximum possible score is i Score(y 1,z i ). F can be written in the following form, using the weights ᾱ = {α 0, α 1... α m } and the indicator function h s (x) which is 1 if x contains the s th rule and 0 otherwise (h s is restricted to binary value for the simplicity of the algorithm, but we can simulate the features that count rules by having multiple features that take value 1 if a rule is seen n times), F (x i,j, ᾱ) = α 0 L(x i,j ) + m α s h s (x i,j ). (2) Then the task now is to find the parameter settings for ᾱ that lead to good scores on test data. 3.2 Loss Functions In order to drive the training process, we define a measure of ranking errors F makes on the training set. The ranking error rate is the number of times a lower scoring parse is ranked better than the best parse: s=1 Error(ᾱ) = i = i n i j=2 n i t[f (x i,1, ᾱ) < F (x i,j, ᾱ)] t[f (x i,1, ᾱ) F (x i,j, ᾱ < 0)]. j=2 The indicator function t[x] is 1 if x is true and 0 otherwise. Define the margin on example x i,j as M i,j (ᾱ) = F (x i,1, ᾱ) F (x i,j, ᾱ). All the loss functions that Collins defines can be written in terms of the margins Log-Likelihood Loss The first loss function defines the conditional probability of x i,q being the correct parse for the i th sentence as e F (x i,q) P (x i,q ) = ni j=1 ef (x i,j).

10 Now, maximizing the likelihood is equivalent to minimizing the negative log-likelihood. This negative log-loss is a function of the margins on training data: LogLoss(ᾱ) = i = i = i = i log log e F (x i,q) ni j=1 ef (x i,j) 1 ni j=1 e (F (x i,q) F (x i,j )) n i log 1 + j=2 n i log 1 + j=2 e (F (x i,q) F (x i,j )) e M i,j(ᾱ) Boosting Loss The second loss function is defined as BoostLoss(ᾱ) = i n i j=2 e M i,j(ᾱ), which has Error(ᾱ) as a lower bound so that minimizing BoostLoss(ᾱ) is closely related to minimizing the ranking error rate. 3.3 Optimization Methods The naive approach would be to find parameter settings ᾱ that minimize Loss(ᾱ), which can be one of the loss functions. But this risks overtraining due to the large number of features. Instead, Collins attempts to find a small subset of the features that contribute most to reducing the loss function by using a greedy algorithm: at each iteration, pick the feature h s with weight δ that has the most impact on the loss function. It helps to define: Upd(ᾱ, k, δ) = [α 0,..., α k + δ,..., α m ] BestW t(k, ᾱ) = argmin δ Loss(Upd(ᾱ, k, δ)) BestLoss(k, ᾱ) = Loss(U pd(ᾱ, k, BestW t(k, ᾱ))). Upd(ᾱ, k, δ) is an updated parameter vector, the same as ᾱ except that α k is incremented by δ. BestW t(k, ᾱ) is the optimal increment δ to α k that reduces the error rate most. BestLoss(k, ᾱ) is the value of the loss function on the updated parameters. Then the algorithm for feature selection is: 1. Initialize ᾱ 0 to some value, e.g., [1, 0, 0,...]. 2. for t=1:n, find (k, δ ) = argmin k,δ Loss(Upd(ᾱ t 1, k, δ)). and set ᾱ t = Upd(ᾱ t 1, k, δ ). After computing BestW t and BestLoss for each feature, we can compute the optimal feature/weight pair as k = argmin k BestLoss(k, ᾱ) δ = BestW t(k, ᾱ).

11 We will investigate how BestW t and BestLoss can be computed for BoostLoss. At each iteration, α 0 is set to optimize BoostLoss. From equation (2), we see that we can perform a linear search to find n i α 0 = argmin α e (α(l(x i,1) L(x i,j ))). i j=2 For BoostLoss, note again from equation (2) that F (x i,j, Upd(ᾱ, k, δ)) is equal to F (x i,j, α) + δh k (x i,j ), so the margin on example ij can be shown to have a simple update, M i,j (Upd(ᾱ, k, δ)) = F (x i,1, Upd(ᾱ, k, δ)) F (x i,j, Upd(ᾱ, k, δ)) which leads to a simple update for BoostLoss: BoostLoss(Upd(ᾱ, k, δ)) = i = F (x i,1, ᾱ) F (x i,j, ᾱ) + δ(h k (x i,1 ) h k (x i,j )) = M i,j (ᾱ) + δ(h k (x i,1 ) h k (x i,j )), n i j=2 e M i,j(ᾱ) δ(h k (x i,1 ) h k (x i,j )). Recall that h s (x) is an indicator function on x, so that h k (x i,1 ) h k (x i,j ) is either +1, 1, or 0. Collins partitions the training sample into three sets depending on this value, A + k = {(i, j) : h k (x i,1 ) h k (x i,j ) = 1} A k = {(i, j) : h k (x i,1 ) h k (x i,j ) = 1} A 0 k = {(i, j) : h k (x i,1 ) h k (x i,j ) = 0}. Next, define W + k = (i,j) A + e Mi,j(ᾱ) and its two counterparts W k k and W k 0 defined analogously. The reason for all these definitions is solely to formulate the update rule for BoostLoss as BoostLoss(Upd(ᾱ, k, δ)) = i = n i e M i,j(ᾱ) δ(h k (x i,1 ) h k (x i,j )) j=2 (i,j) A + k e M i,j(ᾱ) δ + = e δ W + k + eδ W k + W 0 k. (i,j) A k e M i,j(ᾱ)+δ + (i,j) A 0 k e M i,j(ᾱ) The purpose is to make it easier to differentiate BoostLoss with respect to δ in minimizing the loss. This gives BestW t(k, ᾱ) = 1 2 log W + k W, k and when we plug this optimal δ to the update rule for BoostLoss, we obtain BestLoss(k, ᾱ) = Z ( W +k W k )2 where Z = i ni j=2 e M i,j(ᾱ) is a constant among features. Collins proposes finding a smoothing parameter ɛ using cross-validation to prevent BestW t from being undefined: BestW t(k, ᾱ) = 1 2 log W + k + ɛz W k + ɛz. We have achieved our goal for BoostLoss. Unfortunately, such a brisk closed form does not exist for BestW t in the case of LogLoss, so Collins resorts to an iterative solution to find the value of BestW t and uses it to calculate BestLoss.

12 3.4 Efficiency While we cannot cover the efficiency issues as thoroughly as Collins here, it is enlightening to note them because they are important in the parsing problem in general; one usually has to train the parser on a huge data set (36,000 sentences with 1,000,000 parse trees and 500,000 features in this case). In a nutshell, Collins exploits the fact that in the update from ᾱ to Upd(ᾱ, k, δ ) the values W + k and W k remain unchanged for most features, so that they do not have to be re-calculated. The use of a second model to re-rank the parse trees from the first model using the selected features that contribute most to minimizing the error (estimated by noting how many times a nonbest tree is incorrectly ranked higher than the best tree) gives a significant improvement, a 1.5% absolute increase in accuracy over the base model. We have explored the three generative models and the re-ranking model of Collins (1997; 2000) in great detail. For the rest of the survey, we will be less rigorous, but try to capture the overall idea in a more sweeping manner. This is because the detailed account of those models were sufficient to illustrate the nature of the task, and because we do not want to be bogged down by every nut and bolt lest we should fail to see the idea itself within our limited scope. 4 Consider the Entire Tree Space when Re-ranking We have seen in the previous section that efficiency is a critical factor in parsing, since the training data tends to be enormous. This poses a problem, because the set of n-best parses (rather than all of them) often does not contain the true parse. In Collins re-ranking model (2000), for instance, 41% of the correct parses were not in the candidate pool of 30-best parses. Thus (Taskar et al., 2004) motivates a novel discriminative approach that allows one to efficiently learn a model which discriminates among the entire space of parse trees, as opposed to reranking the top few candidates. It uses the idea of finding the largest margin, which lies at the core of support vector machines (SVMs). Furthermore, it can condition on arbitrary features of input sentences, thereby leveraging additional lexical information without the cost of algorithmic complexity. Taskar categorizes discriminative approaches for parsing into two parts: re-ranking (the 2-pass system of the kind proposed by (Collins, 2000)) and dynamic programming (the system in which candidate parse trees are recored in a chart and subsequently used in decoding and parameter estimation with dynamic programming algorithms). It is the latter type of discriminative approach that is the subject of (Taskar et al., 2004). He extends his previous max-margin approach to context-free grammars, presenting a dynamic programming approach to discriminative parsing that is an alternative to maximum entropy estimation. Unlike re-ranking methods, it is an end-to-end discriminative model over the full space of parses. 4.1 Max-Margin Estimation The traditional method of estimating the parameters of PCFGs assumes a generative grammar that defines P (x, y) and maximizes the joint log-likelihood i log P (x i, y i ), where x is a sentence and y is a proposed tree. An alternative is to estimate the parameters discriminatively by maximizing conditional log-likelihood, such as P w (y x) = e w,φ(x,y) y G(x) e w,φ(x,y), where G(x) Y maps an input x X to a set of candidate parses,, denotes the vector inner product, w R d, and Φ : X Y R d maps a sentence-tree pair to a feature vector.

13 Taskar advocates a different estimation criterion that uses the max-margin principle of SVMs. The key idea is to directly ensure that y i = arg max w, Φ(x i, y) y G(x i ) for all sentences x i in the training data. The margin of the parameters w on the example x i and proposed parse y is defined as how much y deviates from the true parse y i : w, Φ(x i, y i ) w, Φ(x i, y) = w, Φ(x i, y i ) Φ(x i, y). We would like this margin to be large when the mistake y is more idiotic; that is, when the loss function L(x i, y i, y) (measuring the penalty for proposing the parse y for x i when y i is the true parse) gives a large value. The optimization task is to maximize γ such that w, Φ(x i, y i ) Φ(x i, y) γl(x i, y i, y) y G(x i ), w 2 1. After a standard transformation, an equivalent task is to minimize 1 2 w 2 + C i ξ i such that w, Φ(x i, y i ) Φ(x i, y) L(x i, y i, y) ξ i y G(x i ), (3) where the slack variables ξ i 0 allow one to increase the global margin by paying a local penalty on some outliers, and the constance C dictates the desired trade-off between margin size and outliers. Taskar obtains the dual of problem (3) to be able to use the kernel trick and prevent the exponential size of constraints (one for each possible parse y for each sentence x i ): maximize C i,y α i,yl(x i, y i, y) 1 2 C i,y (I i,y α i,y )Φ i,y 2 such that α i,y = 1 i, α i,y 0 i, y, y where α i,yi are additional variables, C is renormalized, and I i,y indicates whether y is the true parse y i. Given the dual solution α, the solution to the primal problem w in (3) is simply a weighted linear combination of the feature vectors of the correct and wrong parses: w = C i,y (I i,y α i,y)φ i,y. So α corresponds to the original task in that large α contribute more strongly to the model. 4.2 Efficiency The number of variables and constraints is proportional to G(x), which is generally exponential with x, for both the primal and dual formations. Taskar exploits the structure of grammars to derive an efficient dynamic programming decomposition, factoring models by assigning scores to local parts of the parse. The key idea is to make simplifying assumptions on the feature vector Φ and the loss function L that they can be decomposed, Φ(x, y) = φ(x, r) L(x, y, ŷ) = r R(x,y) r R(x,ŷ) l(x, y, r), where φ(x, y) maps a rule production and its position in the sentence x to some feature vector representation, l(x, y, r) is a local loss function, and R(x, y) maps a derivation y of x to a finite subset of parts. A part is defined as either A, s, e, i (A non-terminal, s and e start and end points, and i sentence) or A B C, s, m, e, i (A B C is a particular rule). Each is shown in the figure below, r and q, respectively.

14 Taskar uses the decomposition assumptions to re-frame the original optimization problem in terms of a polynomial number of variables, cubic in the length of the sentence, and a polynomial number of constraints, quadratic. The result is a slight improvement (absolute 0.43% increase) over Collins re-ranking parser in F 1 measure. 5 More Discriminative Re-ranking The impact of (Collins, 2000) and the early figures who initially developed the schema for re-ranking candidate parse trees can be seen in the field s subsequent focus on discriminative re-ranking approaches. The last section described how (Taskar et al., 2004) attempted to efficiently discriminate the entire set of parse trees using dynamic programming. The final two papers (Charniak and Johnson, 2005; Huang, 2008) we briefly look into also address the re-ranking task. 5.1 Rank n-best Parses with Coarse-to-fine Dynamic Programming, and re-rank with MaxEnt (Charniak and Johnson, 2005) proposes using a MaxEnt re-ranker to select the best parse from the 50-best parses returned by a generative parsing model. He presents a simple method for constructing sets of 50-best parses based on a coarse-to-fine generative parser. Notice the contrasting stance between (Charniak and Johnson, 2005) and (Taskar et al., 2004). The latter, as we have seen, suggests considering all candidate trees in re-ranking. The former, on the other hand, tries to come up with a relatively short candidate list of very high quality. The main difficulty in n-best parsing, compared to 1-best parsing, is that dynamic programming is harder to apply, and replacing it with the more natural approaches using best-first search or beam search (e.g., we keep looking for the next best candidate as long as we want) results in a loss in efficiency. One way to retain the use of dynamic programming in n-best parsing is to exploit the local characteristic of the Viterbi algorithm: in the optimal parse, the parsing decisions at each of the choice points that determine a parse must be optimal. For instance, in the second-best parse, all but one of the parsing decisions must be optimal. Thus we can first find the best parse, then find the second-best parse, then the third-best, and so on. Charniak presents a novel 2-pass method. For every edge, he tries to store n best parses rather than a single best parse. The space problem (O(nm 3 ) where m is the length of the sentence) is mitigated by this 2-pass system. The first pass creates a crude version of the parse based on a much less complex version of the complete grammar (using coarse-grained dynamic programming

15 states). The edges are pruned according to p(n i j,k s) = α(ni j,k )β(ni j,k ), p(s) where n i j,k is a constituent of type i spanning the words from j to k, α(ni j,k ) is the outside probability of this constituent, β(n i j,k ) is the inside probability. The parser removes all constituents ni j,k whose probability falls below some threshold (here on the order of 10 4 ). The remaining edges are then exhaustively evaluated according to the find-grained probabilistic model, which conditions on much richer information, such as the lexical head of one s parents, the POS of this head, etc. In a sense, this 2-pass coarse-to-fine method divides the burden of spatial complexity. The result of this dynamic programming n-best parsing algorithm that utilizes coarse-to-fine refinement of parses is that the n-best parser s most probable parses are already of high quality, and applying a MaxEnt discriminative re-ranker further improves performance. That this selective choice of parses outperforms a less selective choice is illustrated in the result compared with the Collins parser on n-best trees; the new model s F-score is 91.02%, a statistically significant improvement over Collins 90.37%. 5.2 Re-rank the Forests The final model we consider re-ranks the entire forests of parses (Huang, 2008). Huang shows that re-ranking a packed forest of exponentially many parses, enabled by an efficient approximation algorithm, results in the highest F-score we have seen so far, 91.7, outperforming both 50-best and 100-best re-ranking baselines. The reason we may consider improving beyond discriminating only the top n candidate trees is that the limited scope of the n-best list rules out many potentially good alternatives. (Taskar et al., 2004) used dynamic programming to search the whole tree space, but all features were restricted to be local (recall the decomposition assumptions that looked at local windows within the factored search space). Huang attempts to expand this approach to include non-local features by forest re-ranking. We compute non-local features incrementally from bottom-up, so that we can re-rank the n-best subtrees at all internal nodes, instead of only at the root node. A packed parse forest is a compact representation of all the derivations for a given sentence under a CFG. For example, the verb phrase saw him with a mirror has a corresponding forest illustrating the ambiguity as to where to attach the prepositional phrase with a mirror. Formally, a forest is a pair V, E where V is the set of nodes (a non-terminal spanning a portion of s) and E is the set of hyperedges (e E is a pair tails(e), head(e) where head(e) V is the consequent node, or the left side of a grammar rule, and tails(e) V is the set of the antecedent nodes, or the

16 right side of a grammar rule). So the VP forest in the figure would be expressed by V, E where V = {V P 1,6, V BD 1,2, NP 2,6, NP 2,3, P P 3,6 } E = { tails(e 1 ), head(e 1 ), tails(e 2 ), head(e 2 ) } = { (V BD 1,2, NP 2,3, P P 3,6 ), V P 1,6 (V BD 1,2, NP 2,6 ), V P 1,6. Note the differece it makes with regard to the job of a reranker, choosing the best scoring parse tree among the candidates: ŷ = arg max score(y). y cand(s) In n-best reranking, cand(s) = {y 1... y n }, whereas in forest reranking, cand(s) is one forest implicitly containing possible parses. The score function is defined as a weighted sum of features of the tree y: score(y) = w, f where f = (f 1... f d ) is a feature extractor for d features. Each feature basically counts the number of a structural instance in a tree, such as: a VP of length 5 being surrounded by the word has and a period. To learn the weights, Huang uses the averaged perceptron algorithm using for each parse an oracle parse y i + (we cannot use the gold standard yi in the training data because the grammar may fail to cover the gold parse) estimated by maximizing the F-score with the gold standard among the candidates: The algorithm for the perceptron learning is y + i arg max F (y, yi ). y cand(s i ) Another challenge Huang tries to overcome is to incorporate non-local features (intuitively speaking, the ones for which we do not have complete information) as well as local ones. He first splits the feature extractor f = (f 1... f d ), into local ones and non-local ones, f = (f L ; f N ). Computing the local feature extractor is easy because we have the information of all the edges so that we can pre-compute f L (e) for edges e in a forest and sum them up, f L (y) = e y f L (e). The non-local features are not so straightforward, because we cannot observe the required edges immediately. However, we want to compute them as early as possible, so we can compute a feature at the smallest common ancester that contains all antecedents. More precisely, we factor non-local features across subtrees. For each subtree y of a parse y, obtain the part f(y ) (called a unit

17 feature) of the feature f(y) that is computable within y. In this way, we can build unit features from bottom up to compute the non-local feature, f N (y) = y y fn (y ). Huang presents an efficient dynamic programming algorithm to compute the forest oracle and pruned forests, but we do not have space here to elaborate on them. Huang achieves a 0.26% absolute improvement over 50-best re-ranking. 6 Conclusion In this whirl-wind tour of the recent developments in statistical parsing, we have investigated increasingly sophisticated algorithmic approaches to learning better from the training data provided by hand-annotated corpora such as the Penn Treebank. The generative models and the re-ranking schema proposed by (Collins, 1997; Collins, 2000) illustrated many of the most crucial components of statistical parsing, such as the formulation of the task into an optimization problem and the benefit of re-considering the plausibility of the candidate trees. Others joined to suggest various methods, pushing the parsing accuracy further. (Taskar et al., 2004) used the max-margin technique from the SVMs to efficiently discriminate the entire space of parse trees with dynamic programming. (Charniak and Johnson, 2005) presented a simple way to produce a high quality list of n-best parses to accommodate re-ranking. (Huang, 2008) showed that re-ranking the whole forests (packed parses that include the ambiguity information of sentences) overcomes both the limited scope of n-best parsing and the local constraint of the conventional dynamic programming approaches. The highest accuracy seen in this survey is 91.69% (F-score) on the test data. This is an impressive feat. In light of the strengths of the statistical solution to parsing that we considered in the introduction (error-tolerant, easier to build, etc.), such a high performance confirms the fact that we are on the right track to solving the problem of parsing. However, the truth is that the current solution is far from enough. According to the accuracy score (which in real applications will likely to be much lower, since they will involve domains outside the test data), we get a wrong parse for every ten inputs. Much worse, as we have glimpsed, a sentence becomes exponentially harder (corresponding to the exponentially bigger search space) to parse as it increases in length. But a machine must be able to parse sentences of non-trivial length if it is to exhibit a level of non-trivial intelligence. In other words, it is the long sentences rather than short that we really want to be able to parse accurately. Without this ability, it is doubtful machines will ever be capable of performing deep interactions in language. Note that the rate of increase in performance drops continuously as we move from the early phase of statistical parsing to the later phase. For instance, (Collins, 1997) achieved an absolute improvement of 2.3% over the previous best-performing parser, whereas (Huang, 2008) gained only 0.26% despite the use of incredibly complicated methods. At this rate, we will never reach the point of 99.9% accuracy, which is what we really need. Why are we not successful? Perhaps my personal experience of acquiring the English language may shed some light. As I started learning to read English texts at a late age of 15, the most difficult challenge was (not surprisingly) understanding the structure of a sentence, i.e., parsing. It was extremely hard to tell where the noun phrase ended, whether the gerund was a verb or an adjective, which was the head verb phrase of the sentence, and so on. Without knowing them, it was impossible to make any sense of the text.

18 How did I overcome the problem? I can say I learned to parse statistically, by reading hundreds of books and thereby accumulating a large data set to consult. Although the set was not annotated like the Penn Treebank, by carefully decoding the structure of sentences in the early phase, I had access to the information which was a good parse and which was not. There was a certain take-off point in my ability to understand English. Prior to the point, my interaction with the language went almost entirely to syntactic learning, and it meant little semantically. Put differently, I was reading, but I was not understanding, disoriented too much by the challenge of parsing. Then, at some point, I became good enough at parsing so that I could finally use, rather than being enslaved by, the task in making semantic interpretations of the text. My parsing skills were incomplete, yet the ability to make semantic analysis in turn greatly helped me parse the subsequent sentences better (e.g., using common-sense to attach the prepositional phrase to the right target). The bootstrapping process suddenly became blindingly fast, and before I knew it I was fluent in the language. In this regard, perhaps the state-of-art performance of the statistical parsers today corresponds to the slot immediately before the take-off point. This suggests that, as many non-statistical NLP researchers assert, statistics alone will not take us there. But such a vague assertion is not very helpful. We must find a method to simulate the bootstrapping process between syntactic and semantic information of the text. Without it, no matter how clever the statistical parsers are trained, they will likely to stay at the current level of 90%. This needs to be a central focus in AI research, since it lies at the very foundation of language intelligence without which AI will never truly connect with humans. On the bright side, this suggests that we are not too far away from the take-off point that I experienced. Once we reach there, the rate of advancement in machine intelligence will be phenomenal. Acknowledgements The greatest thanks go to Professor Dan Gildea for suggesting me this topic. May his work continue to pioneer and inspire the field of statistical NLP. References Joshua Goodman Semiring Parsing. ACL. John Cocke and Jocab T. Schwartz Programming languages and their compilers: Preliminary notes. Technical report, Courant Institute of Mathematical Sciences, New York University. T. Kasami An efficient recognition and syntax-analysis algorithm for context-free languages. Scientific report AFCRL , Air Force Cambridge Research Lab, Bedford, MA. Daniel H. Younger Recognition and parsing of context-free languages in time n3. Information and Control 10(2) J. Earley An efficient context-free parsing algorithm. Communications of the Association for Computing Machinery 13(2), Masaru Tomita LR parsers for natural languages. COLING. 10th International Conference on Computational Linguistics, M. Marcus, B. Santorini and M. Marcinkiewicz Building a Large Annotated Corpus of English: the Penn Treebank. Computational linguistics, Michael Collins Three Generative, Lexicalised Models for Statistical Parsing. EACL 97 Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics. Michael Collins Discriminative Reranking for Natural Language Parsing. Proceedings of the International Conference on Machine Learning (ICML). Ben Taskar, Dan Klein, Michael Collins, Daphne Koller, and Christopher Manning Max-Margin Parsing. Empirical Methods in Natural Language Processing (EMNLP04).

19 Eugene Charniak and Mark Johnson Coarse-to-fine n-best parsing and MaxEnt discriminative reranking. ACL. Liang Huang Forest Reranking: Discriminative Parsing with Non-Local Features. ACL.

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark Penn Treebank Parsing Advanced Topics in Language Processing Stephen Clark 1 The Penn Treebank 40,000 sentences of WSJ newspaper text annotated with phrasestructure trees The trees contain some predicate-argument

More information

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other

More information

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other

More information

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Natural Language Processing CS 6840 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Statistical Parsing Define a probabilistic model of syntax P(T S):

More information

A Context-Free Grammar

A Context-Free Grammar Statistical Parsing A Context-Free Grammar S VP VP Vi VP Vt VP VP PP DT NN PP PP P Vi sleeps Vt saw NN man NN dog NN telescope DT the IN with IN in Ambiguity A sentence of reasonable length can easily

More information

Probabilistic Context-free Grammars

Probabilistic Context-free Grammars Probabilistic Context-free Grammars Computational Linguistics Alexander Koller 24 November 2017 The CKY Recognizer S NP VP NP Det N VP V NP V ate NP John Det a N sandwich i = 1 2 3 4 k = 2 3 4 5 S NP John

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Stochastic Grammars Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(22) Structured Classification

More information

Recap: Lexicalized PCFGs (Fall 2007): Lecture 5 Parsing and Syntax III. Recap: Charniak s Model. Recap: Adding Head Words/Tags to Trees

Recap: Lexicalized PCFGs (Fall 2007): Lecture 5 Parsing and Syntax III. Recap: Charniak s Model. Recap: Adding Head Words/Tags to Trees Recap: Lexicalized PCFGs We now need to estimate rule probabilities such as P rob(s(questioned,vt) NP(lawyer,NN) VP(questioned,Vt) S(questioned,Vt)) 6.864 (Fall 2007): Lecture 5 Parsing and Syntax III

More information

LECTURER: BURCU CAN Spring

LECTURER: BURCU CAN Spring LECTURER: BURCU CAN 2017-2018 Spring Regular Language Hidden Markov Model (HMM) Context Free Language Context Sensitive Language Probabilistic Context Free Grammar (PCFG) Unrestricted Language PCFGs can

More information

Log-Linear Models, MEMMs, and CRFs

Log-Linear Models, MEMMs, and CRFs Log-Linear Models, MEMMs, and CRFs Michael Collins 1 Notation Throughout this note I ll use underline to denote vectors. For example, w R d will be a vector with components w 1, w 2,... w d. We use expx

More information

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP  NP PP 1.0. N people 0. /6/7 CS 6/CS: Natural Language Processing Instructor: Prof. Lu Wang College of Computer and Information Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang The grammar: Binary, no epsilons,.9..5

More information

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Classification II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Minimize Training Error? A loss function declares how costly each mistake is E.g. 0 loss for correct label,

More information

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:

More information

Advanced Natural Language Processing Syntactic Parsing

Advanced Natural Language Processing Syntactic Parsing Advanced Natural Language Processing Syntactic Parsing Alicia Ageno ageno@cs.upc.edu Universitat Politècnica de Catalunya NLP statistical parsing 1 Parsing Review Statistical Parsing SCFG Inside Algorithm

More information

Natural Language Processing

Natural Language Processing SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University September 27, 2018 0 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class

More information

Parsing with Context-Free Grammars

Parsing with Context-Free Grammars Parsing with Context-Free Grammars Berlin Chen 2005 References: 1. Natural Language Understanding, chapter 3 (3.1~3.4, 3.6) 2. Speech and Language Processing, chapters 9, 10 NLP-Berlin Chen 1 Grammars

More information

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09 Natural Language Processing : Probabilistic Context Free Grammars Updated 5/09 Motivation N-gram models and HMM Tagging only allowed us to process sentences linearly. However, even simple sentences require

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

6.891: Lecture 24 (December 8th, 2003) Kernel Methods

6.891: Lecture 24 (December 8th, 2003) Kernel Methods 6.891: Lecture 24 (December 8th, 2003) Kernel Methods Overview ffl Recap: global linear models ffl New representations from old representations ffl computational trick ffl Kernels for NLP structures ffl

More information

Marrying Dynamic Programming with Recurrent Neural Networks

Marrying Dynamic Programming with Recurrent Neural Networks Marrying Dynamic Programming with Recurrent Neural Networks I eat sushi with tuna from Japan Liang Huang Oregon State University Structured Prediction Workshop, EMNLP 2017, Copenhagen, Denmark Marrying

More information

Parsing with Context-Free Grammars

Parsing with Context-Free Grammars Parsing with Context-Free Grammars CS 585, Fall 2017 Introduction to Natural Language Processing http://people.cs.umass.edu/~brenocon/inlp2017 Brendan O Connor College of Information and Computer Sciences

More information

CKY & Earley Parsing. Ling 571 Deep Processing Techniques for NLP January 13, 2016

CKY & Earley Parsing. Ling 571 Deep Processing Techniques for NLP January 13, 2016 CKY & Earley Parsing Ling 571 Deep Processing Techniques for NLP January 13, 2016 No Class Monday: Martin Luther King Jr. Day CKY Parsing: Finish the parse Recognizer à Parser Roadmap Earley parsing Motivation:

More information

Decoding and Inference with Syntactic Translation Models

Decoding and Inference with Syntactic Translation Models Decoding and Inference with Syntactic Translation Models March 5, 2013 CFGs S NP VP VP NP V V NP NP CFGs S NP VP S VP NP V V NP NP CFGs S NP VP S VP NP V NP VP V NP NP CFGs S NP VP S VP NP V NP VP V NP

More information

Features of Statistical Parsers

Features of Statistical Parsers Features of tatistical Parsers Preliminary results Mark Johnson Brown University TTI, October 2003 Joint work with Michael Collins (MIT) upported by NF grants LI 9720368 and II0095940 1 Talk outline tatistical

More information

Review. Earley Algorithm Chapter Left Recursion. Left-Recursion. Rule Ordering. Rule Ordering

Review. Earley Algorithm Chapter Left Recursion. Left-Recursion. Rule Ordering. Rule Ordering Review Earley Algorithm Chapter 13.4 Lecture #9 October 2009 Top-Down vs. Bottom-Up Parsers Both generate too many useless trees Combine the two to avoid over-generation: Top-Down Parsing with Bottom-Up

More information

The SUBTLE NL Parsing Pipeline: A Complete Parser for English Mitch Marcus University of Pennsylvania

The SUBTLE NL Parsing Pipeline: A Complete Parser for English Mitch Marcus University of Pennsylvania The SUBTLE NL Parsing Pipeline: A Complete Parser for English Mitch Marcus University of Pennsylvania 1 PICTURE OF ANALYSIS PIPELINE Tokenize Maximum Entropy POS tagger MXPOST Ratnaparkhi Core Parser Collins

More information

Parsing with CFGs L445 / L545 / B659. Dept. of Linguistics, Indiana University Spring Parsing with CFGs. Direction of processing

Parsing with CFGs L445 / L545 / B659. Dept. of Linguistics, Indiana University Spring Parsing with CFGs. Direction of processing L445 / L545 / B659 Dept. of Linguistics, Indiana University Spring 2016 1 / 46 : Overview Input: a string Output: a (single) parse tree A useful step in the process of obtaining meaning We can view the

More information

Parsing with CFGs. Direction of processing. Top-down. Bottom-up. Left-corner parsing. Chart parsing CYK. Earley 1 / 46.

Parsing with CFGs. Direction of processing. Top-down. Bottom-up. Left-corner parsing. Chart parsing CYK. Earley 1 / 46. : Overview L545 Dept. of Linguistics, Indiana University Spring 2013 Input: a string Output: a (single) parse tree A useful step in the process of obtaining meaning We can view the problem as searching

More information

Spectral Unsupervised Parsing with Additive Tree Metrics

Spectral Unsupervised Parsing with Additive Tree Metrics Spectral Unsupervised Parsing with Additive Tree Metrics Ankur Parikh, Shay Cohen, Eric P. Xing Carnegie Mellon, University of Edinburgh Ankur Parikh 2014 1 Overview Model: We present a novel approach

More information

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Maximum Entropy Models I Welcome back for the 3rd module

More information

Probabilistic Context Free Grammars. Many slides from Michael Collins

Probabilistic Context Free Grammars. Many slides from Michael Collins Probabilistic Context Free Grammars Many slides from Michael Collins Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar

More information

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford) Parsing Based on presentations from Chris Manning s course on Statistical Parsing (Stanford) S N VP V NP D N John hit the ball Levels of analysis Level Morphology/Lexical POS (morpho-synactic), WSD Elements

More information

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Probabilistic Context-Free Grammars. Michael Collins, Columbia University Probabilistic Context-Free Grammars Michael Collins, Columbia University Overview Probabilistic Context-Free Grammars (PCFGs) The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

Multilevel Coarse-to-Fine PCFG Parsing

Multilevel Coarse-to-Fine PCFG Parsing Multilevel Coarse-to-Fine PCFG Parsing Eugene Charniak, Mark Johnson, Micha Elsner, Joseph Austerweil, David Ellis, Isaac Haxton, Catherine Hill, Shrivaths Iyengar, Jeremy Moore, Michael Pozar, and Theresa

More information

Chapter 14 (Partially) Unsupervised Parsing

Chapter 14 (Partially) Unsupervised Parsing Chapter 14 (Partially) Unsupervised Parsing The linguistically-motivated tree transformations we discussed previously are very effective, but when we move to a new language, we may have to come up with

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

c(a) = X c(a! Ø) (13.1) c(a! Ø) ˆP(A! Ø A) = c(a)

c(a) = X c(a! Ø) (13.1) c(a! Ø) ˆP(A! Ø A) = c(a) Chapter 13 Statistical Parsg Given a corpus of trees, it is easy to extract a CFG and estimate its parameters. Every tree can be thought of as a CFG derivation, and we just perform relative frequency estimation

More information

Lab 12: Structured Prediction

Lab 12: Structured Prediction December 4, 2014 Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM Class review: from modelization to classification What does learning mean?

More information

Machine Learning for Structured Prediction

Machine Learning for Structured Prediction Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for

More information

10/17/04. Today s Main Points

10/17/04. Today s Main Points Part-of-speech Tagging & Hidden Markov Model Intro Lecture #10 Introduction to Natural Language Processing CMPSCI 585, Fall 2004 University of Massachusetts Amherst Andrew McCallum Today s Main Points

More information

Probabilistic Context-Free Grammar

Probabilistic Context-Free Grammar Probabilistic Context-Free Grammar Petr Horáček, Eva Zámečníková and Ivana Burgetová Department of Information Systems Faculty of Information Technology Brno University of Technology Božetěchova 2, 612

More information

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning Probabilistic Context Free Grammars Many slides from Michael Collins and Chris Manning Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic

More information

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation atural Language Processing 1 lecture 7: constituent parsing Ivan Titov Institute for Logic, Language and Computation Outline Syntax: intro, CFGs, PCFGs PCFGs: Estimation CFGs: Parsing PCFGs: Parsing Parsing

More information

2.2 Structured Prediction

2.2 Structured Prediction The hinge loss (also called the margin loss), which is optimized by the SVM, is a ramp function that has slope 1 when yf(x) < 1 and is zero otherwise. Two other loss functions squared loss and exponential

More information

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging 10-708: Probabilistic Graphical Models 10-708, Spring 2018 10 : HMM and CRF Lecturer: Kayhan Batmanghelich Scribes: Ben Lengerich, Michael Kleyman 1 Case Study: Supervised Part-of-Speech Tagging We will

More information

Maxent Models and Discriminative Estimation

Maxent Models and Discriminative Estimation Maxent Models and Discriminative Estimation Generative vs. Discriminative models (Reading: J+M Ch6) Introduction So far we ve looked at generative models Language models, Naive Bayes But there is now much

More information

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister A Syntax-based Statistical Machine Translation Model Alexander Friedl, Georg Teichtmeister 4.12.2006 Introduction The model Experiment Conclusion Statistical Translation Model (STM): - mathematical model

More information

CMPT-825 Natural Language Processing. Why are parsing algorithms important?

CMPT-825 Natural Language Processing. Why are parsing algorithms important? CMPT-825 Natural Language Processing Anoop Sarkar http://www.cs.sfu.ca/ anoop October 26, 2010 1/34 Why are parsing algorithms important? A linguistic theory is implemented in a formal system to generate

More information

Lecture 15. Probabilistic Models on Graph

Lecture 15. Probabilistic Models on Graph Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how

More information

Lecture Notes on Inductive Definitions

Lecture Notes on Inductive Definitions Lecture Notes on Inductive Definitions 15-312: Foundations of Programming Languages Frank Pfenning Lecture 2 August 28, 2003 These supplementary notes review the notion of an inductive definition and give

More information

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009 CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models Jimmy Lin The ischool University of Maryland Wednesday, September 30, 2009 Today s Agenda The great leap forward in NLP Hidden Markov

More information

Lecture Notes on Inductive Definitions

Lecture Notes on Inductive Definitions Lecture Notes on Inductive Definitions 15-312: Foundations of Programming Languages Frank Pfenning Lecture 2 September 2, 2004 These supplementary notes review the notion of an inductive definition and

More information

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this. Chapter 12 Synchronous CFGs Synchronous context-free grammars are a generalization of CFGs that generate pairs of related strings instead of single strings. They are useful in many situations where one

More information

Algorithms for NLP. Classifica(on III. Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Classifica(on III. Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Classifica(on III Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley The Perceptron, Again Start with zero weights Visit training instances one by one Try to classify If correct,

More information

Driving Semantic Parsing from the World s Response

Driving Semantic Parsing from the World s Response Driving Semantic Parsing from the World s Response James Clarke, Dan Goldwasser, Ming-Wei Chang, Dan Roth Cognitive Computation Group University of Illinois at Urbana-Champaign CoNLL 2010 Clarke, Goldwasser,

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

Algorithms for Syntax-Aware Statistical Machine Translation

Algorithms for Syntax-Aware Statistical Machine Translation Algorithms for Syntax-Aware Statistical Machine Translation I. Dan Melamed, Wei Wang and Ben Wellington ew York University Syntax-Aware Statistical MT Statistical involves machine learning (ML) seems crucial

More information

CS460/626 : Natural Language

CS460/626 : Natural Language CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 23, 24 Parsing Algorithms; Parsing in case of Ambiguity; Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 8 th,

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 26 February 2018 Recap: tagging POS tagging is a sequence labelling task.

More information

DT2118 Speech and Speaker Recognition

DT2118 Speech and Speaker Recognition DT2118 Speech and Speaker Recognition Language Modelling Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 56 Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Sequential Supervised Learning

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given

More information

CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss

CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss Jeffrey Flanigan Chris Dyer Noah A. Smith Jaime Carbonell School of Computer Science, Carnegie Mellon University, Pittsburgh,

More information

Multiword Expression Identification with Tree Substitution Grammars

Multiword Expression Identification with Tree Substitution Grammars Multiword Expression Identification with Tree Substitution Grammars Spence Green, Marie-Catherine de Marneffe, John Bauer, and Christopher D. Manning Stanford University EMNLP 2011 Main Idea Use syntactic

More information

A Support Vector Method for Multivariate Performance Measures

A Support Vector Method for Multivariate Performance Measures A Support Vector Method for Multivariate Performance Measures Thorsten Joachims Cornell University Department of Computer Science Thanks to Rich Caruana, Alexandru Niculescu-Mizil, Pierre Dupont, Jérôme

More information

Linear Classifiers IV

Linear Classifiers IV Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers IV Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic

More information

Santa Claus Schedules Jobs on Unrelated Machines

Santa Claus Schedules Jobs on Unrelated Machines Santa Claus Schedules Jobs on Unrelated Machines Ola Svensson (osven@kth.se) Royal Institute of Technology - KTH Stockholm, Sweden March 22, 2011 arxiv:1011.1168v2 [cs.ds] 21 Mar 2011 Abstract One of the

More information

Introduction to Computational Linguistics

Introduction to Computational Linguistics Introduction to Computational Linguistics Olga Zamaraeva (2018) Based on Bender (prev. years) University of Washington May 3, 2018 1 / 101 Midterm Project Milestone 2: due Friday Assgnments 4& 5 due dates

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21

More information

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park Low-Dimensional Discriminative Reranking Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park Discriminative Reranking Useful for many NLP tasks Enables us to use arbitrary features

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 17 October 2016 updated 9 September 2017 Recap: tagging POS tagging is a

More information

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012 CS626: NLP, Speech and the Web Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012 Parsing Problem Semantics Part of Speech Tagging NLP Trinity Morph Analysis

More information

Natural Language Processing

Natural Language Processing SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University October 9, 2018 0 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

On the Sizes of Decision Diagrams Representing the Set of All Parse Trees of a Context-free Grammar

On the Sizes of Decision Diagrams Representing the Set of All Parse Trees of a Context-free Grammar Proceedings of Machine Learning Research vol 73:153-164, 2017 AMBN 2017 On the Sizes of Decision Diagrams Representing the Set of All Parse Trees of a Context-free Grammar Kei Amii Kyoto University Kyoto

More information

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page

More information

Statistical NLP Spring A Discriminative Approach

Statistical NLP Spring A Discriminative Approach Statistical NLP Spring 2008 Lecture 6: Classification Dan Klein UC Berkeley A Discriminative Approach View WSD as a discrimination task (regression, really) P(sense context:jail, context:county, context:feeding,

More information

CS838-1 Advanced NLP: Hidden Markov Models

CS838-1 Advanced NLP: Hidden Markov Models CS838-1 Advanced NLP: Hidden Markov Models Xiaojin Zhu 2007 Send comments to jerryzhu@cs.wisc.edu 1 Part of Speech Tagging Tag each word in a sentence with its part-of-speech, e.g., The/AT representative/nn

More information

Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs

Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs Shay B. Cohen and Michael Collins Department of Computer Science Columbia University New York, NY 10027 scohen,mcollins@cs.columbia.edu

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Global linear models Based on slides from Michael Collins Globally-normalized models Why do we decompose to a sequence of decisions? Can we directly estimate the probability

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Statistical Machine Translation

Statistical Machine Translation Statistical Machine Translation -tree-based models (cont.)- Artem Sokolov Computerlinguistik Universität Heidelberg Sommersemester 2015 material from P. Koehn, S. Riezler, D. Altshuler Bottom-Up Decoding

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

Hidden Markov Models

Hidden Markov Models CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each

More information

Soft Inference and Posterior Marginals. September 19, 2013

Soft Inference and Posterior Marginals. September 19, 2013 Soft Inference and Posterior Marginals September 19, 2013 Soft vs. Hard Inference Hard inference Give me a single solution Viterbi algorithm Maximum spanning tree (Chu-Liu-Edmonds alg.) Soft inference

More information

A* Search. 1 Dijkstra Shortest Path

A* Search. 1 Dijkstra Shortest Path A* Search Consider the eight puzzle. There are eight tiles numbered 1 through 8 on a 3 by three grid with nine locations so that one location is left empty. We can move by sliding a tile adjacent to the

More information

Global Machine Learning for Spatial Ontology Population

Global Machine Learning for Spatial Ontology Population Global Machine Learning for Spatial Ontology Population Parisa Kordjamshidi, Marie-Francine Moens KU Leuven, Belgium Abstract Understanding spatial language is important in many applications such as geographical

More information

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are

More information

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers

More information

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we

More information

Latent Variable Models in NLP

Latent Variable Models in NLP Latent Variable Models in NLP Aria Haghighi with Slav Petrov, John DeNero, and Dan Klein UC Berkeley, CS Division Latent Variable Models Latent Variable Models Latent Variable Models Observed Latent Variable

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Max-margin learning of GM Eric Xing Lecture 28, Apr 28, 2014 b r a c e Reading: 1 Classical Predictive Models Input and output space: Predictive

More information

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Natural Language Processing Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Projects Project descriptions due today! Last class Sequence to sequence models Attention Pointer networks Today Weak

More information

NLP Programming Tutorial 11 - The Structured Perceptron

NLP Programming Tutorial 11 - The Structured Perceptron NLP Programming Tutorial 11 - The Structured Perceptron Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Prediction Problems Given x, A book review Oh, man I love this book! This book is

More information

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University Grammars and introduction to machine learning Computers Playing Jeopardy! Course Stony Brook University Last class: grammars and parsing in Prolog Noun -> roller Verb thrills VP Verb NP S NP VP NP S VP

More information

Predicting New Search-Query Cluster Volume

Predicting New Search-Query Cluster Volume Predicting New Search-Query Cluster Volume Jacob Sisk, Cory Barr December 14, 2007 1 Problem Statement Search engines allow people to find information important to them, and search engine companies derive

More information