arxiv: v1 [cs.cl] 24 May 2012

Size: px

Start display at page:

Download "arxiv: v1 [cs.cl] 24 May 2012"

Christian Douglas
6 years ago
Views:

1 FASTSUBS: An Efficient Admissible Algorithm for Finding the Most Likely Lexical Substitutes Using a Statistical Language Model Deniz Yuret Koç University İstanbul, Turkey dyuret@ku.edu.tr arxiv: v1 [cs.cl] 24 May Introduction Lexical substitutes have found use in the context of word sense disambiguation (Yuret and Yatbaz, 2010), unsupervised part-of-speech induction (Yatbaz and Yuret, 2010), paraphrasing (McCarthy and Navigli, 2007), machine translation (Mihalcea et al., 2010), and text simplification (Specia et al., 2012). Using a statistical language model to find the most likely substitutes in a given context is a successful approach (Hawker, 2007; Yuret, 2007), but the cost of a naive algorithm is proportional to the vocabulary size. This paper presents the FASTSUBS algorithm which can efficiently and correctly identify the most likely lexical substitutes for a given context based on a statistical language model without going through most of the vocabulary. The efficiency of FASTSUBS makes large scale experiments based on lexical substitutes feasible. For example, it is possible to compute the top 10 substitutes for each one of the 1,173,766 tokens in Penn Treebank (Marcus et al., 1999) in about 6 hours on a typical workstation. The same task would take about 6 days with the naive algorithm. An implementation of the algorithm and a dataset with the top 100 substitutes of each token in the WSJ section of the Penn Treebank are available from the author s website 1. 2 Substitute Probabilities This section presents the derivation of lexical substitute probabilities based on an n-gram language model. Details of this derivation are important in finding an admissible algorithm that identifies the 1 most likely substitutes efficiently, without trying out most of the vocabulary. N-gram language models assign probabilities to arbitrary sequences of words (or other tokens like punctuation etc.) based on their occurance statistics in large training corpora. They approximate the probability of a sequence of words by assuming each word is conditionally independent of the rest given the previous (n 1) words. For example a trigram model would approximate the probability of a sequence abcde as: p(abcde) = p(a)p(b a)p(c ab)p(d bc)p(e cd) (1) where lowercase letters like a, b, c represent words and strings of letters like abcde represent word sequences 2. The individual conditional probability terms are typically expressed in back-off form 3 using log probabilities l(c ab) log p(c ab): { α(abc) if f(abc) > 0 l(c ab) = (2) β(ab) + l(c b) otherwise where α(abc) is the discounted log probability estimate for l(c ab) (typically slightly less than the log frequency in the training corpus), f(abc) is the number of times abc has been observed in the training corpus, β(ab) is the back-off weight to keep the probabilities add up to 1. The formula can be generalized to arbitrary n-gram orders if we let b stand for zero or more words. The recursion bottoms out at 2 I prefer this notation to the more general w i+n 1 i n+1 which I find difficult to read. 3 Even interpolated models can be represented in the backoff form and in fact that is the way SRILM stores them in ARPA (Doug Paul) format model files.

2 unigrams (single words) where l(c) = α(c). If there are any out-of-vocabulary words we assume they are mapped to a special UNK token, so α(c) is never undefined. It is best to use both left and right context when estimating the probabilities for potential lexical substitutes. For example, in He lived in San Francisco suburbs., the token San would be difficult to guess from the left context but it is almost certain looking at the right context. The log probability of a substitute word given both left and right contexts can be estimated as: l(c ab de) l(abcde) (3) l(c ab) + l(d bc) + l(e cd) Here the symbol represents the position the candidate substitute c is going to occupy. The first line follows from the definition of conditional probability and the second line comes from Equation 1 except the terms that do not include the candidate c have been dropped. The expression for the unnormalized log probability of a lexical substitute according to Equation 3 and the decomposition of its terms according to Equation 2 can be combined to give us Equation 4. For arbitrary order n-gram models we would end up with a sum of n terms and each term would come from one of n alternatives. l(c ab de) (4) + + α(abc) if f(abc) > 0 β(ab) + α(bc) if f(bc) > 0 β(ab) + β(b) + α(c) otherwise α(bcd) if f(bcd) > 0 β(bc) + α(cd) if f(cd) > 0 β(bc) + β(c) + α(d) otherwise α(cde) if f(cde) > 0 β(cd) + α(de) if f(de) > 0 β(cd) + β(d) + α(e) otherwise 3 Algorithm A naive algorithm to find the most likely substitutes in a given context could try each word in the vocabulary as a potential substitute c and compute the value of the expression given in Equation 4. The computation of Equation 4 requires O(N 2 ) operations for an order N language model, and if we have V words in our vocabulary the cost of the naive algorithm to find a single most likely substitute would be O(V N 2 ). We can do better, however, if we keep track of the candidate words c that maximize the individual α and β terms. The basic idea of the FASTSUBS algorithm is to maintain priority queues for candidate words for each possible α and β term and to derive priority queues for compound terms like sums and alternates from the queues for their constituents. Queues: Maintaining exact values in these queues turns out to be impractical. Instead, each priority queue in FASTSUBS maintains an upper bound on the actual values of its elements. Such an upper bound priority queue can still be used to retrieve top elements as long as we check to make sure that their actual values are above the upper bound for the elements remaining in the queue. We will define three functions for such upper bound priority queues: SUP(q): will return an upper bound on the value of the elements in the queue. TOP(q): will return the top word in the queue. Note that this is the word with the highest upper bound, not necessarily the highest actual value. POP(q): will extract and return the top element in the queue and update the upper bound accordingly. FASTSUBS (X, K) 1. Initialize priority queue q for context X. 2. Initialize set of candidate words S = {}. 3. WHILE c : c S, l(c X) SUP(q) < K AND S < V DO S := S {POP(q)} 4. Return top K words in S based on l(c X). Outline: The FASTSUBS algorithm takes a context X and a desired number of top substitutes K as inputs. It initializes an upper bound priority queue

3 q for words that can go in the context X and their log probabilities. It keeps popping candidate words from q until K of them are guaranteed to have values above SUP(q), the upper bound for the remaining words in the queue. Analysis: As long as POP(q) cycles through all the words in the vocabulary and SUP(q) gives an upper bound for the remaining words in the queue, the algorithm will obviously return a correct result. The Appendix describes the construction of the priority queue q recursively in terms of queues for the constituent terms and outlines a correctness proof. The efficiency of the algorithm depends on the number of iterations of the while loop which in turn depends on the quality of words returned by POP(q), the tightness of the upper bound given by SUP(q), and the ratio K/V. The worst case is no better than the naive algorithm s O(V N 2 ). However the experiments presented in the next section indicate that the average performance on real data is closer to the best case O(KN 2 ), which is a large improvement when K V. 4 Experiments This section presents experimental results that aim to quantify the efficiency of FASTSUBS on a real world dataset. I used a corpus of 126 million words of WSJ data as the training set and the WSJ section of the Penn Treebank (Marcus et al., 1999) as the test set. A 4-gram language model was built from the training set using Kneser-Ney smoothing in SRILM (Stolcke, 2002) with a fixed vocabulary of 78,894 words. The following figure shows the average number of while loop iterations in FASTSUBS as a function of the K parameter. The function shows a regular sub-linear growth and approaches the vocabulary size V as K V. It is well approximated by the formula y = y 0 x a where a = log(v/y 0 )/ log(v ). As a practical example, it is possible to compute the top 10 substitutes for each one of the 1,173,766 tokens in Penn Treebank in about 6 hours on a typical workstation. The same task would take about 6 days for the naive algorithm. Appendix. Priority Queues This section will describe the recursive construction of the priority queue for a given context based on Number of iterations Number of top substitutes Figure 1: Number of iterations as a function of K Equation 4. Each term in Equation 4 has an associated priority queue. The queues for compound terms like sums and alternates are defined in terms of the queues of their constituents. Each queue satisfies the upper bound contract: SUP(q) is an upper bound on the values in the queue. Primitive terms: Looking at Equation 4 we see several types of primitive terms. Here we will define how their associated priority queues behave: α terms with candidate words such as α(abc) need an actual priority queue q α for candidates to implement the following efficiently: SUP(q α (xcy)) = max α(xcy) c (5) TOP(q α (xcy)) = arg max α(xcy) c (6) Here x and y stand for zero or more words and c is a candidate lexical substitute word. SUP(q α ) gives the real maximum, thus provides a tight upper bound. The q α queues are constructed once in the beginning of the program as sorted arrays and re-used in queries for different contexts. The construction can be performed in one pass through the language model and the memory requirement is of the same order as the language model if we ignore patterns that have not been observed. Candidates that have not been observed in the argument context will be at the bottom of this queue because α(xcy) if f(xcy) = 0. To save memory such c are not placed in the queue. Thus after we run out of elements in

4 q α we need to return: SUP(q α (xcy)) = (7) TOP(q α (xcy)) = NIL (8) β terms with candidate words pose an interesting problem because candidates that have not been observed in the argument context give the maximum value 0. For example if f(cd) = 0 then β(cd) = 0 which is the maximum value β can take. Rather than maintaining q β priority queues containing the whole vocabulary we will just use 0 as an upper bound for β terms with c. SUP(q β (xcy)) = 0 (9) TOP(q β (xcy)) = NIL (10) α and β terms without candidate words like α(de) or β(ab), act as constants. For consistency we will define priority queues q α and q β for constant terms. If x is a word sequence without c: SUP(q α (x)) = α(x) (11) SUP(q β (x)) = β(x) (12) TOP(q α (x)) = TOP(q β (x)) = NIL (13) Compound terms: We can identify several types of compound terms in Equation 4. We will define their priority queue functions in terms of the queues of their children. Low level sums like β(ab) + β(b) + α(c), add up primitive terms. Let q σ be the queue for a low level sum and let C(σ) indicate the primitive terms in it. We have: SUP(q σ ) = SUP(q α,β ) (14) α,β C(σ) TOP(q σ ) = TOP(q α (xcy)) (15) if σ contains an α(xcy) term that is not constant, otherwise: TOP(q σ ) = NIL (16) Alternates indicated by { in Equation 4 pick their topmost child whose α argument has been observed in the training corpus. Let q ρ be the queue for such an alternate expression and C(ρ) be its children terms. SUP(q ρ ) = max σ C(ρ) SUP(q σ) (17) TOP(q ρ ) = TOP(q σmax ) (18) where σ max = arg max σ C(ρ) SUP(q σ ). The top level sum in Equation 4, is a sum of N alternate terms for an order N language model. Let q λ represent the queue for the top level sum and let C(λ) represent its children. SUP(q λ ) = ρ C(λ) SUP(q ρ ) (19) TOP(q λ ) = TOP(q ρr ) (20) where ρ r is a randomly chosen child of λ for which TOP(q ρr ) NIL. Correctness: As mentioned in Section 3, the correctness of the algorithm depends on two factors: (i) the SUP(q) function should return an upper bound on the remaining values in q, and (ii) the POP(q) function should cycle through the whole vocabulary for the top level queue. The correctness of the SUP(q) function can be proved recursively. For primitive terms SUP(q) is equal to the actual maximum (e.g. for q α ), or is an obvious upper bound (e.g. SUP(q β (xcy)) = 0). For sums, SUP(q) is equal to the sum of the upper bounds for the children and for alternates, SUP(q) is equal to the maximum of the upper bounds for the children. To prove that POP(q λ ) will cycle through the entire vocabulary it suffices to show that the queue for at least one child of λ will cycle through the entire vocabulary. This is in fact the case because one of the children will always include the term α(c) whose queue contains the entire vocabulary. Acknowledgments References [Hawker2007] Tobias Hawker Usyd: Wsd and lexical substitution using the web1t corpus. In SemEval-2007: 4th International Workshop on Semantic Evaluations.

5 [Marcus et al.1999] Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, and Ann Taylor Treebank-3. Linguistic Data Consortium, Philadelphia. [McCarthy and Navigli2007] D. McCarthy and R. Navigli Semeval-2007 task 10: English lexical substitution task. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval- 2007), pages [Mihalcea et al.2010] R. Mihalcea, R. Sinha, and D. Mc- Carthy Semeval-2010 task 2: Cross-lingual lexical substitution. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages Association for Computational Linguistics. [Specia et al.2012] Lucia Specia, Sujay Jauhar, and Rada Mihalcea Semeval-2012 task 1: English lexical simplification. In Proceedings of the International Workshop on Semantic Evaluation. forthcoming. [Stolcke2002] A. Stolcke Srilm an extensible language modeling toolkit. In Seventh International Conference on Spoken Language Processing. [Yatbaz and Yuret2010] M.A. Yatbaz and D. Yuret Unsupervised part of speech tagging using unambiguous substitutes from a statistical language model. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pages Association for Computational Linguistics, August. [Yuret and Yatbaz2010] Deniz Yuret and Mehmet Ali Yatbaz The noisy channel model for unsupervised word sense disambiguation. Computational Linguistics, 36(1): , March. [Yuret2007] D. Yuret Ku: Word sense disambiguation by substitution. In Proceedings of the 4th International Workshop on Semantic Evaluations, pages Association for Computational Linguistics.

SYNTHER A NEW M-GRAM POS TAGGER

SYNTHER A NEW M-GRAM POS TAGGER David Sündermann and Hermann Ney RWTH Aachen University of Technology, Computer Science Department Ahornstr. 55, 52056 Aachen, Germany {suendermann,ney}@cs.rwth-aachen.de