arxiv: v1 [cs.cl] 24 May 2012
|
|
- Christian Douglas
- 6 years ago
- Views:
Transcription
1 FASTSUBS: An Efficient Admissible Algorithm for Finding the Most Likely Lexical Substitutes Using a Statistical Language Model Deniz Yuret Koç University İstanbul, Turkey dyuret@ku.edu.tr arxiv: v1 [cs.cl] 24 May Introduction Lexical substitutes have found use in the context of word sense disambiguation (Yuret and Yatbaz, 2010), unsupervised part-of-speech induction (Yatbaz and Yuret, 2010), paraphrasing (McCarthy and Navigli, 2007), machine translation (Mihalcea et al., 2010), and text simplification (Specia et al., 2012). Using a statistical language model to find the most likely substitutes in a given context is a successful approach (Hawker, 2007; Yuret, 2007), but the cost of a naive algorithm is proportional to the vocabulary size. This paper presents the FASTSUBS algorithm which can efficiently and correctly identify the most likely lexical substitutes for a given context based on a statistical language model without going through most of the vocabulary. The efficiency of FASTSUBS makes large scale experiments based on lexical substitutes feasible. For example, it is possible to compute the top 10 substitutes for each one of the 1,173,766 tokens in Penn Treebank (Marcus et al., 1999) in about 6 hours on a typical workstation. The same task would take about 6 days with the naive algorithm. An implementation of the algorithm and a dataset with the top 100 substitutes of each token in the WSJ section of the Penn Treebank are available from the author s website 1. 2 Substitute Probabilities This section presents the derivation of lexical substitute probabilities based on an n-gram language model. Details of this derivation are important in finding an admissible algorithm that identifies the 1 most likely substitutes efficiently, without trying out most of the vocabulary. N-gram language models assign probabilities to arbitrary sequences of words (or other tokens like punctuation etc.) based on their occurance statistics in large training corpora. They approximate the probability of a sequence of words by assuming each word is conditionally independent of the rest given the previous (n 1) words. For example a trigram model would approximate the probability of a sequence abcde as: p(abcde) = p(a)p(b a)p(c ab)p(d bc)p(e cd) (1) where lowercase letters like a, b, c represent words and strings of letters like abcde represent word sequences 2. The individual conditional probability terms are typically expressed in back-off form 3 using log probabilities l(c ab) log p(c ab): { α(abc) if f(abc) > 0 l(c ab) = (2) β(ab) + l(c b) otherwise where α(abc) is the discounted log probability estimate for l(c ab) (typically slightly less than the log frequency in the training corpus), f(abc) is the number of times abc has been observed in the training corpus, β(ab) is the back-off weight to keep the probabilities add up to 1. The formula can be generalized to arbitrary n-gram orders if we let b stand for zero or more words. The recursion bottoms out at 2 I prefer this notation to the more general w i+n 1 i n+1 which I find difficult to read. 3 Even interpolated models can be represented in the backoff form and in fact that is the way SRILM stores them in ARPA (Doug Paul) format model files.
2 unigrams (single words) where l(c) = α(c). If there are any out-of-vocabulary words we assume they are mapped to a special UNK token, so α(c) is never undefined. It is best to use both left and right context when estimating the probabilities for potential lexical substitutes. For example, in He lived in San Francisco suburbs., the token San would be difficult to guess from the left context but it is almost certain looking at the right context. The log probability of a substitute word given both left and right contexts can be estimated as: l(c ab de) l(abcde) (3) l(c ab) + l(d bc) + l(e cd) Here the symbol represents the position the candidate substitute c is going to occupy. The first line follows from the definition of conditional probability and the second line comes from Equation 1 except the terms that do not include the candidate c have been dropped. The expression for the unnormalized log probability of a lexical substitute according to Equation 3 and the decomposition of its terms according to Equation 2 can be combined to give us Equation 4. For arbitrary order n-gram models we would end up with a sum of n terms and each term would come from one of n alternatives. l(c ab de) (4) + + α(abc) if f(abc) > 0 β(ab) + α(bc) if f(bc) > 0 β(ab) + β(b) + α(c) otherwise α(bcd) if f(bcd) > 0 β(bc) + α(cd) if f(cd) > 0 β(bc) + β(c) + α(d) otherwise α(cde) if f(cde) > 0 β(cd) + α(de) if f(de) > 0 β(cd) + β(d) + α(e) otherwise 3 Algorithm A naive algorithm to find the most likely substitutes in a given context could try each word in the vocabulary as a potential substitute c and compute the value of the expression given in Equation 4. The computation of Equation 4 requires O(N 2 ) operations for an order N language model, and if we have V words in our vocabulary the cost of the naive algorithm to find a single most likely substitute would be O(V N 2 ). We can do better, however, if we keep track of the candidate words c that maximize the individual α and β terms. The basic idea of the FASTSUBS algorithm is to maintain priority queues for candidate words for each possible α and β term and to derive priority queues for compound terms like sums and alternates from the queues for their constituents. Queues: Maintaining exact values in these queues turns out to be impractical. Instead, each priority queue in FASTSUBS maintains an upper bound on the actual values of its elements. Such an upper bound priority queue can still be used to retrieve top elements as long as we check to make sure that their actual values are above the upper bound for the elements remaining in the queue. We will define three functions for such upper bound priority queues: SUP(q): will return an upper bound on the value of the elements in the queue. TOP(q): will return the top word in the queue. Note that this is the word with the highest upper bound, not necessarily the highest actual value. POP(q): will extract and return the top element in the queue and update the upper bound accordingly. FASTSUBS (X, K) 1. Initialize priority queue q for context X. 2. Initialize set of candidate words S = {}. 3. WHILE c : c S, l(c X) SUP(q) < K AND S < V DO S := S {POP(q)} 4. Return top K words in S based on l(c X). Outline: The FASTSUBS algorithm takes a context X and a desired number of top substitutes K as inputs. It initializes an upper bound priority queue
3 q for words that can go in the context X and their log probabilities. It keeps popping candidate words from q until K of them are guaranteed to have values above SUP(q), the upper bound for the remaining words in the queue. Analysis: As long as POP(q) cycles through all the words in the vocabulary and SUP(q) gives an upper bound for the remaining words in the queue, the algorithm will obviously return a correct result. The Appendix describes the construction of the priority queue q recursively in terms of queues for the constituent terms and outlines a correctness proof. The efficiency of the algorithm depends on the number of iterations of the while loop which in turn depends on the quality of words returned by POP(q), the tightness of the upper bound given by SUP(q), and the ratio K/V. The worst case is no better than the naive algorithm s O(V N 2 ). However the experiments presented in the next section indicate that the average performance on real data is closer to the best case O(KN 2 ), which is a large improvement when K V. 4 Experiments This section presents experimental results that aim to quantify the efficiency of FASTSUBS on a real world dataset. I used a corpus of 126 million words of WSJ data as the training set and the WSJ section of the Penn Treebank (Marcus et al., 1999) as the test set. A 4-gram language model was built from the training set using Kneser-Ney smoothing in SRILM (Stolcke, 2002) with a fixed vocabulary of 78,894 words. The following figure shows the average number of while loop iterations in FASTSUBS as a function of the K parameter. The function shows a regular sub-linear growth and approaches the vocabulary size V as K V. It is well approximated by the formula y = y 0 x a where a = log(v/y 0 )/ log(v ). As a practical example, it is possible to compute the top 10 substitutes for each one of the 1,173,766 tokens in Penn Treebank in about 6 hours on a typical workstation. The same task would take about 6 days for the naive algorithm. Appendix. Priority Queues This section will describe the recursive construction of the priority queue for a given context based on Number of iterations Number of top substitutes Figure 1: Number of iterations as a function of K Equation 4. Each term in Equation 4 has an associated priority queue. The queues for compound terms like sums and alternates are defined in terms of the queues of their constituents. Each queue satisfies the upper bound contract: SUP(q) is an upper bound on the values in the queue. Primitive terms: Looking at Equation 4 we see several types of primitive terms. Here we will define how their associated priority queues behave: α terms with candidate words such as α(abc) need an actual priority queue q α for candidates to implement the following efficiently: SUP(q α (xcy)) = max α(xcy) c (5) TOP(q α (xcy)) = arg max α(xcy) c (6) Here x and y stand for zero or more words and c is a candidate lexical substitute word. SUP(q α ) gives the real maximum, thus provides a tight upper bound. The q α queues are constructed once in the beginning of the program as sorted arrays and re-used in queries for different contexts. The construction can be performed in one pass through the language model and the memory requirement is of the same order as the language model if we ignore patterns that have not been observed. Candidates that have not been observed in the argument context will be at the bottom of this queue because α(xcy) if f(xcy) = 0. To save memory such c are not placed in the queue. Thus after we run out of elements in
4 q α we need to return: SUP(q α (xcy)) = (7) TOP(q α (xcy)) = NIL (8) β terms with candidate words pose an interesting problem because candidates that have not been observed in the argument context give the maximum value 0. For example if f(cd) = 0 then β(cd) = 0 which is the maximum value β can take. Rather than maintaining q β priority queues containing the whole vocabulary we will just use 0 as an upper bound for β terms with c. SUP(q β (xcy)) = 0 (9) TOP(q β (xcy)) = NIL (10) α and β terms without candidate words like α(de) or β(ab), act as constants. For consistency we will define priority queues q α and q β for constant terms. If x is a word sequence without c: SUP(q α (x)) = α(x) (11) SUP(q β (x)) = β(x) (12) TOP(q α (x)) = TOP(q β (x)) = NIL (13) Compound terms: We can identify several types of compound terms in Equation 4. We will define their priority queue functions in terms of the queues of their children. Low level sums like β(ab) + β(b) + α(c), add up primitive terms. Let q σ be the queue for a low level sum and let C(σ) indicate the primitive terms in it. We have: SUP(q σ ) = SUP(q α,β ) (14) α,β C(σ) TOP(q σ ) = TOP(q α (xcy)) (15) if σ contains an α(xcy) term that is not constant, otherwise: TOP(q σ ) = NIL (16) Alternates indicated by { in Equation 4 pick their topmost child whose α argument has been observed in the training corpus. Let q ρ be the queue for such an alternate expression and C(ρ) be its children terms. SUP(q ρ ) = max σ C(ρ) SUP(q σ) (17) TOP(q ρ ) = TOP(q σmax ) (18) where σ max = arg max σ C(ρ) SUP(q σ ). The top level sum in Equation 4, is a sum of N alternate terms for an order N language model. Let q λ represent the queue for the top level sum and let C(λ) represent its children. SUP(q λ ) = ρ C(λ) SUP(q ρ ) (19) TOP(q λ ) = TOP(q ρr ) (20) where ρ r is a randomly chosen child of λ for which TOP(q ρr ) NIL. Correctness: As mentioned in Section 3, the correctness of the algorithm depends on two factors: (i) the SUP(q) function should return an upper bound on the remaining values in q, and (ii) the POP(q) function should cycle through the whole vocabulary for the top level queue. The correctness of the SUP(q) function can be proved recursively. For primitive terms SUP(q) is equal to the actual maximum (e.g. for q α ), or is an obvious upper bound (e.g. SUP(q β (xcy)) = 0). For sums, SUP(q) is equal to the sum of the upper bounds for the children and for alternates, SUP(q) is equal to the maximum of the upper bounds for the children. To prove that POP(q λ ) will cycle through the entire vocabulary it suffices to show that the queue for at least one child of λ will cycle through the entire vocabulary. This is in fact the case because one of the children will always include the term α(c) whose queue contains the entire vocabulary. Acknowledgments References [Hawker2007] Tobias Hawker Usyd: Wsd and lexical substitution using the web1t corpus. In SemEval-2007: 4th International Workshop on Semantic Evaluations.
5 [Marcus et al.1999] Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, and Ann Taylor Treebank-3. Linguistic Data Consortium, Philadelphia. [McCarthy and Navigli2007] D. McCarthy and R. Navigli Semeval-2007 task 10: English lexical substitution task. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval- 2007), pages [Mihalcea et al.2010] R. Mihalcea, R. Sinha, and D. Mc- Carthy Semeval-2010 task 2: Cross-lingual lexical substitution. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages Association for Computational Linguistics. [Specia et al.2012] Lucia Specia, Sujay Jauhar, and Rada Mihalcea Semeval-2012 task 1: English lexical simplification. In Proceedings of the International Workshop on Semantic Evaluation. forthcoming. [Stolcke2002] A. Stolcke Srilm an extensible language modeling toolkit. In Seventh International Conference on Spoken Language Processing. [Yatbaz and Yuret2010] M.A. Yatbaz and D. Yuret Unsupervised part of speech tagging using unambiguous substitutes from a statistical language model. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pages Association for Computational Linguistics, August. [Yuret and Yatbaz2010] Deniz Yuret and Mehmet Ali Yatbaz The noisy channel model for unsupervised word sense disambiguation. Computational Linguistics, 36(1): , March. [Yuret2007] D. Yuret Ku: Word sense disambiguation by substitution. In Proceedings of the 4th International Workshop on Semantic Evaluations, pages Association for Computational Linguistics.
SYNTHER A NEW M-GRAM POS TAGGER
SYNTHER A NEW M-GRAM POS TAGGER David Sündermann and Hermann Ney RWTH Aachen University of Technology, Computer Science Department Ahornstr. 55, 52056 Aachen, Germany {suendermann,ney}@cs.rwth-aachen.de
More informationNatural Language Processing. Statistical Inference: n-grams
Natural Language Processing Statistical Inference: n-grams Updated 3/2009 Statistical Inference Statistical Inference consists of taking some data (generated in accordance with some unknown probability
More informationMachine Learning for natural language processing
Machine Learning for natural language processing N-grams and language models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 25 Introduction Goals: Estimate the probability that a
More informationDT2118 Speech and Speaker Recognition
DT2118 Speech and Speaker Recognition Language Modelling Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 56 Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language
More informationLanguage Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky
Language Modeling Introduction to N-grams Many Slides are adapted from slides by Dan Jurafsky Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation: P(high
More informationN-gram Language Modeling
N-gram Language Modeling Outline: Statistical Language Model (LM) Intro General N-gram models Basic (non-parametric) n-grams Class LMs Mixtures Part I: Statistical Language Model (LM) Intro What is a statistical
More informationN-gram Language Modeling Tutorial
N-gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture notes courtesy of Prof. Mari Ostendorf Outline: Statistical Language Model (LM) Basics n-gram models Class LMs Cache LMs Mixtures
More informationN-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24
L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,
More informationFoundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model
Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January
More informationProbabilistic Language Modeling
Predicting String Probabilities Probabilistic Language Modeling Which string is more likely? (Which string is more grammatical?) Grill doctoral candidates. Regina Barzilay EECS Department MIT November
More informationLanguage Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky
Language Modeling Introduction to N-grams Many Slides are adapted from slides by Dan Jurafsky Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation: P(high
More informationNatural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)
Natural Language Processing SoSe 2015 Language Modelling Dr. Mariana Neves April 20th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline 2 Motivation Estimation Evaluation Smoothing Outline 3 Motivation
More informationCS 6120/CS4120: Natural Language Processing
CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang College of Computer and Information Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Outline Probabilistic language
More informationEmpirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model
Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model (most slides from Sharon Goldwater; some adapted from Philipp Koehn) 5 October 2016 Nathan Schneider
More informationNatural Language Processing SoSe Words and Language Model
Natural Language Processing SoSe 2016 Words and Language Model Dr. Mariana Neves May 2nd, 2016 Outline 2 Words Language Model Outline 3 Words Language Model Tokenization Separation of words in a sentence
More informationTnT Part of Speech Tagger
TnT Part of Speech Tagger By Thorsten Brants Presented By Arghya Roy Chaudhuri Kevin Patel Satyam July 29, 2014 1 / 31 Outline 1 Why Then? Why Now? 2 Underlying Model Other technicalities 3 Evaluation
More informationMaschinelle Sprachverarbeitung
Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other
More informationMaschinelle Sprachverarbeitung
Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other
More informationMidterm sample questions
Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts
More informationCMPT-825 Natural Language Processing
CMPT-825 Natural Language Processing Anoop Sarkar http://www.cs.sfu.ca/ anoop February 27, 2008 1 / 30 Cross-Entropy and Perplexity Smoothing n-gram Models Add-one Smoothing Additive Smoothing Good-Turing
More informationCross-Lingual Language Modeling for Automatic Speech Recogntion
GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim woosung@cs.jhu.edu Center for Language and Speech Processing Dept. of Computer Science The
More informationStatistical Machine Translation
Statistical Machine Translation Marcello Federico FBK-irst Trento, Italy Galileo Galilei PhD School University of Pisa Pisa, 7-19 May 2008 Part V: Language Modeling 1 Comparing ASR and statistical MT N-gram
More informationLanguage Models. Philipp Koehn. 11 September 2018
Language Models Philipp Koehn 11 September 2018 Language models 1 Language models answer the question: How likely is a string of English words good English? Help with reordering p LM (the house is small)
More informationProbabilistic Context-free Grammars
Probabilistic Context-free Grammars Computational Linguistics Alexander Koller 24 November 2017 The CKY Recognizer S NP VP NP Det N VP V NP V ate NP John Det a N sandwich i = 1 2 3 4 k = 2 3 4 5 S NP John
More informationKneser-Ney smoothing explained
foldl home blog contact feed Kneser-Ney smoothing explained 18 January 2014 Language models are an essential element of natural language processing, central to tasks ranging from spellchecking to machine
More informationChapter 3: Basics of Language Modelling
Chapter 3: Basics of Language Modelling Motivation Language Models are used in Speech Recognition Machine Translation Natural Language Generation Query completion For research and development: need a simple
More informationSpatial Role Labeling CS365 Course Project
Spatial Role Labeling CS365 Course Project Amit Kumar, akkumar@iitk.ac.in Chandra Sekhar, gchandra@iitk.ac.in Supervisor : Dr.Amitabha Mukerjee ABSTRACT In natural language processing one of the important
More informationThe Noisy Channel Model and Markov Models
1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle
More informationExact Sampling and Decoding in High-Order Hidden Markov Models
Exact Sampling and Decoding in High-Order Hidden Markov Models Simon Carter Marc Dymetman Guillaume Bouchard ISLA, University of Amsterdam Science Park 904, 1098 XH Amsterdam, The Netherlands s.c.carter@uva.nl
More informationarxiv: v1 [cs.cl] 21 May 2017
Spelling Correction as a Foreign Language Yingbo Zhou yingbzhou@ebay.com Utkarsh Porwal uporwal@ebay.com Roberto Konow rkonow@ebay.com arxiv:1705.07371v1 [cs.cl] 21 May 2017 Abstract In this paper, we
More informationImproved Decipherment of Homophonic Ciphers
Improved Decipherment of Homophonic Ciphers Malte Nuhn and Julian Schamper and Hermann Ney Human Language Technology and Pattern Recognition Computer Science Department, RWTH Aachen University, Aachen,
More informationA Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister
A Syntax-based Statistical Machine Translation Model Alexander Friedl, Georg Teichtmeister 4.12.2006 Introduction The model Experiment Conclusion Statistical Translation Model (STM): - mathematical model
More informationLanguage Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky
Language Modeling Introduc*on to N- grams Many Slides are adapted from slides by Dan Jurafsky Probabilis1c Language Models Today s goal: assign a probability to a sentence Machine Transla*on: Why? P(high
More informationNgram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer
+ Ngram Review October13, 2017 Professor Meteer CS 136 Lecture 10 Language Modeling Thanks to Dan Jurafsky for these slides + ASR components n Feature Extraction, MFCCs, start of Acoustic n HMMs, the Forward
More informationStatistical Methods for NLP
Statistical Methods for NLP Language Models, Graphical Models Sameer Maskey Week 13, April 13, 2010 Some slides provided by Stanley Chen and from Bishop Book Resources 1 Announcements Final Project Due,
More informationLanguage Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012
Language Modelling Marcello Federico FBK-irst Trento, Italy MT Marathon, Edinburgh, 2012 Outline 1 Role of LM in ASR and MT N-gram Language Models Evaluation of Language Models Smoothing Schemes Discounting
More informationPrenominal Modifier Ordering via MSA. Alignment
Introduction Prenominal Modifier Ordering via Multiple Sequence Alignment Aaron Dunlop Margaret Mitchell 2 Brian Roark Oregon Health & Science University Portland, OR 2 University of Aberdeen Aberdeen,
More informationCS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev
CS4705 Probability Review and Naïve Bayes Slides from Dragomir Radev Classification using a Generative Approach Previously on NLP discriminative models P C D here is a line with all the social media posts
More informationNaïve Bayes, Maxent and Neural Models
Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words
More informationLanguage Model. Introduction to N-grams
Language Model Introduction to N-grams Probabilistic Language Model Goal: assign a probability to a sentence Application: Machine Translation P(high winds tonight) > P(large winds tonight) Spelling Correction
More informationText Mining. March 3, March 3, / 49
Text Mining March 3, 2017 March 3, 2017 1 / 49 Outline Language Identification Tokenisation Part-Of-Speech (POS) tagging Hidden Markov Models - Sequential Taggers Viterbi Algorithm March 3, 2017 2 / 49
More informationAdvanced Natural Language Processing Syntactic Parsing
Advanced Natural Language Processing Syntactic Parsing Alicia Ageno ageno@cs.upc.edu Universitat Politècnica de Catalunya NLP statistical parsing 1 Parsing Review Statistical Parsing SCFG Inside Algorithm
More informationEmpirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs
Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21
More informationSpeech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri
Speech Recognition Lecture 5: N-gram Language Models Eugene Weinstein Google, NYU Courant Institute eugenew@cs.nyu.edu Slide Credit: Mehryar Mohri Components Acoustic and pronunciation model: Pr(o w) =
More informationPenn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark
Penn Treebank Parsing Advanced Topics in Language Processing Stephen Clark 1 The Penn Treebank 40,000 sentences of WSJ newspaper text annotated with phrasestructure trees The trees contain some predicate-argument
More informationLanguage Modeling. Introduction to N-grams. Klinton Bicknell. (borrowing from: Dan Jurafsky and Jim Martin)
Language Modeling Introduction to N-grams Klinton Bicknell (borrowing from: Dan Jurafsky and Jim Martin) Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation:
More informationANLP Lecture 6 N-gram models and smoothing
ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 27 September 2018 Sharon Goldwater ANLP Lecture 6 27 September 2018 Recap: N-gram models We can model sentence
More informationProbabilistic Counting with Randomized Storage
Probabilistic Counting with Randomized Storage Benjamin Van Durme University of Rochester Rochester, NY 14627, USA Ashwin Lall Georgia Institute of Technology Atlanta, GA 30332, USA Abstract Previous work
More informationAn Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling
An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling Masaharu Kato, Tetsuo Kosaka, Akinori Ito and Shozo Makino Abstract Topic-based stochastic models such as the probabilistic
More informationN-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by:
N-gram Language Model 2 Given a text w = w 1...,w t,...,w w we can compute its probability by: Language Models Marcello Federico FBK-irst Trento, Italy 2016 w Y Pr(w) =Pr(w 1 ) Pr(w t h t ) (1) t=2 where
More informationLanguage as a Stochastic Process
CS769 Spring 2010 Advanced Natural Language Processing Language as a Stochastic Process Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Basic Statistics for NLP Pick an arbitrary letter x at random from any
More informationCSE 4502/5717 Big Data Analytics Spring 2018; Homework 1 Solutions
CSE 502/5717 Big Data Analytics Spring 2018; Homework 1 Solutions 1. Consider the following algorithm: for i := 1 to α n log e n do Pick a random j [1, n]; If a[j] = a[j + 1] or a[j] = a[j 1] then output:
More informationGraphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science
Graphical Models Mark Gales Lent 2011 Machine Learning for Language Processing: Lecture 3 MPhil in Advanced Computer Science MPhil in Advanced Computer Science Graphical Models Graphical models have their
More informationMachine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017
1 Introduction Let x = (x 1,..., x M ) denote a sequence (e.g. a sequence of words), and let y = (y 1,..., y M ) denote a corresponding hidden sequence that we believe explains or influences x somehow
More informationNLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)
NLP: N-Grams Dan Garrette dhg@cs.utexas.edu December 27, 2013 1 Language Modeling Tasks Language idenfication / Authorship identification Machine Translation Speech recognition Optical character recognition
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Language Models Tobias Scheffer Stochastic Language Models A stochastic language model is a probability distribution over words.
More informationA fast and simple algorithm for training neural probabilistic language models
A fast and simple algorithm for training neural probabilistic language models Andriy Mnih Joint work with Yee Whye Teh Gatsby Computational Neuroscience Unit University College London 25 January 2013 1
More informationImproved Learning through Augmenting the Loss
Improved Learning through Augmenting the Loss Hakan Inan inanh@stanford.edu Khashayar Khosravi khosravi@stanford.edu Abstract We present two improvements to the well-known Recurrent Neural Network Language
More informationN-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition
2010 11 5 N-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition 1 48-106413 Abstract Large-Vocabulary Continuous Speech Recognition(LVCSR) system has rapidly been growing today.
More informationExploring Asymmetric Clustering for Statistical Language Modeling
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL, Philadelphia, July 2002, pp. 83-90. Exploring Asymmetric Clustering for Statistical Language Modeling Jianfeng
More informationWhat to Expect from Expected Kneser-Ney Smoothing
What to Expect from Expected Kneser-Ney Smoothing Michael Levit, Sarangarajan Parthasarathy, Shuangyu Chang Microsoft, USA {mlevit sarangp shchang}@microsoft.com Abstract Kneser-Ney smoothing on expected
More informationWeek 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya
Week 13: Language Modeling II Smoothing in Language Modeling Irina Sergienya 07.07.2015 Couple of words first... There are much more smoothing techniques, [e.g. Katz back-off, Jelinek-Mercer,...] and techniques
More informationLanguage Processing with Perl and Prolog
Language Processing with Perl and Prolog Chapter 5: Counting Words Pierre Nugues Lund University Pierre.Nugues@cs.lth.se http://cs.lth.se/pierre_nugues/ Pierre Nugues Language Processing with Perl and
More informationMore Smoothing, Tuning, and Evaluation
More Smoothing, Tuning, and Evaluation Nathan Schneider (slides adapted from Henry Thompson, Alex Lascarides, Chris Dyer, Noah Smith, et al.) ENLP 21 September 2016 1 Review: 2 Naïve Bayes Classifier w
More informationACS Introduction to NLP Lecture 3: Language Modelling and Smoothing
ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk Language Modelling 2 A language model is a probability
More informationNatural Language Processing (CSE 490U): Language Models
Natural Language Processing (CSE 490U): Language Models Noah Smith c 2017 University of Washington nasmith@cs.washington.edu January 6 9, 2017 1 / 67 Very Quick Review of Probability Event space (e.g.,
More informationTailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest Neighbors Frank Nielsen 1 Paolo Piro 2 Michel Barlaud 2 1 Ecole Polytechnique, LIX, Palaiseau, France 2 CNRS / University of Nice-Sophia Antipolis, Sophia
More informationCMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009
CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models Jimmy Lin The ischool University of Maryland Wednesday, September 30, 2009 Today s Agenda The great leap forward in NLP Hidden Markov
More informationCS 224N HW:#3. (V N0 )δ N r p r + N 0. N r (r δ) + (V N 0)δ. N r r δ. + (V N 0)δ N = 1. 1 we must have the restriction: δ NN 0.
CS 224 HW:#3 ARIA HAGHIGHI SUID :# 05041774 1. Smoothing Probability Models (a). Let r be the number of words with r counts and p r be the probability for a word with r counts in the Absolute discounting
More informationPart-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287
Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the
More informationStatistical Machine Translation. Part III: Search Problem. Complexity issues. DP beam-search: with single and multi-stacks
Statistical Machine Translation Marcello Federico FBK-irst Trento, Italy Galileo Galilei PhD School - University of Pisa Pisa, 7-19 May 008 Part III: Search Problem 1 Complexity issues A search: with single
More informationCS 6120/CS4120: Natural Language Processing
CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang College of Computer and Information Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Today s Outline Probabilistic
More informationLanguage Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky
Language Modeling Introduc*on to N- grams Many Slides are adapted from slides by Dan Jurafsky Probabilis1c Language Models Today s goal: assign a probability to a sentence Machine Transla*on: Why? P(high
More informationPROBABILISTIC PASSWORD MODELING: PREDICTING PASSWORDS WITH MACHINE LEARNING
PROBABILISTIC PASSWORD MODELING: PREDICTING PASSWORDS WITH MACHINE LEARNING JAY DESTORIES MENTOR: ELIF YAMANGIL Abstract Many systems use passwords as the primary means of authentication. As the length
More informationInternet Engineering Jacek Mazurkiewicz, PhD
Internet Engineering Jacek Mazurkiewicz, PhD Softcomputing Part 11: SoftComputing Used for Big Data Problems Agenda Climate Changes Prediction System Based on Weather Big Data Visualisation Natural Language
More informationFast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation
Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation Joern Wuebker, Hermann Ney Human Language Technology and Pattern Recognition Group Computer Science
More informationLanguage Modeling. Introduction to N- grams
Language Modeling Introduction to N- grams Probabilistic Language Models Today s goal: assign a probability to a sentence Machine Translation: P(high winds tonite) > P(large winds tonite) Why? Spell Correction
More informationLanguage Modeling. Introduction to N- grams
Language Modeling Introduction to N- grams Probabilistic Language Models Today s goal: assign a probability to a sentence Machine Translation: P(high winds tonite) > P(large winds tonite) Why? Spell Correction
More informationAn implementation of deterministic tree automata minimization
An implementation of deterministic tree automata minimization Rafael C. Carrasco 1, Jan Daciuk 2, and Mikel L. Forcada 3 1 Dep. de Lenguajes y Sistemas Informáticos, Universidad de Alicante, E-03071 Alicante,
More informationIntroduction to Techniques for Counting
Introduction to Techniques for Counting A generating function is a device somewhat similar to a bag. Instead of carrying many little objects detachedly, which could be embarrassing, we put them all in
More informationLanguage Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016
Language Modelling: Smoothing and Model Complexity COMP-599 Sept 14, 2016 Announcements A1 has been released Due on Wednesday, September 28th Start code for Question 4: Includes some of the package import
More informationLanguage Model Rest Costs and Space-Efficient Storage
Language Model Rest Costs and Space-Efficient Storage Kenneth Heafield Philipp Koehn Alon Lavie Carnegie Mellon, University of Edinburgh July 14, 2012 Complaint About Language Models Make Search Expensive
More informationCS246 Final Exam. March 16, :30AM - 11:30AM
CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions
More information/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17
601.433/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17 12.1 Introduction Today we re going to do a couple more examples of dynamic programming. While
More informationLECTURER: BURCU CAN Spring
LECTURER: BURCU CAN 2017-2018 Spring Regular Language Hidden Markov Model (HMM) Context Free Language Context Sensitive Language Probabilistic Context Free Grammar (PCFG) Unrestricted Language PCFGs can
More informationGaussian Models
Gaussian Models ddebarr@uw.edu 2016-04-28 Agenda Introduction Gaussian Discriminant Analysis Inference Linear Gaussian Systems The Wishart Distribution Inferring Parameters Introduction Gaussian Density
More informationA Geometric Method to Obtain the Generation Probability of a Sentence
A Geometric Method to Obtain the Generation Probability of a Sentence Chen Lijiang Nanjing Normal University ljchen97@6.com Abstract "How to generate a sentence" is the most critical and difficult problem
More informationPart of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015
Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about
More informationGenerative Clustering, Topic Modeling, & Bayesian Inference
Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week
More informationLanguage Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1
Language Modelling Steve Renals Automatic Speech Recognition ASR Lecture 11 6 March 2017 ASR Lecture 11 Language Modelling 1 HMM Speech Recognition Recorded Speech Decoded Text (Transcription) Acoustic
More informationLanguage Models. CS6200: Information Retrieval. Slides by: Jesse Anderton
Language Models CS6200: Information Retrieval Slides by: Jesse Anderton What s wrong with VSMs? Vector Space Models work reasonably well, but have a few problems: They are based on bag-of-words, so they
More informationA Tabular Method for Dynamic Oracles in Transition-Based Parsing
A Tabular Method for Dynamic Oracles in Transition-Based Parsing Yoav Goldberg Department of Computer Science Bar Ilan University, Israel yoav.goldberg@gmail.com Francesco Sartorio Department of Information
More informationLecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:
Lecture 2: N-gram Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS 6501: Natural Language Processing 1 This lecture Language Models What are
More informationCOMS F18 Homework 3 (due October 29, 2018)
COMS 477-2 F8 Homework 3 (due October 29, 208) Instructions Submit your write-up on Gradescope as a neatly typeset (not scanned nor handwritten) PDF document by :59 PM of the due date. On Gradescope, be
More informationIn this chapter, we explore the parsing problem, which encompasses several questions, including:
Chapter 12 Parsing Algorithms 12.1 Introduction In this chapter, we explore the parsing problem, which encompasses several questions, including: Does L(G) contain w? What is the highest-weight derivation
More informationDoctoral Course in Speech Recognition. May 2007 Kjell Elenius
Doctoral Course in Speech Recognition May 2007 Kjell Elenius CHAPTER 12 BASIC SEARCH ALGORITHMS State-based search paradigm Triplet S, O, G S, set of initial states O, set of operators applied on a state
More informationStatistical NLP: Hidden Markov Models. Updated 12/15
Statistical NLP: Hidden Markov Models Updated 12/15 Markov Models Markov models are statistical tools that are useful for NLP because they can be used for part-of-speech-tagging applications Their first
More informationarxiv: v1 [cs.cl] 5 Mar Introduction
Neural Machine Translation and Sequence-to-sequence Models: A Tutorial Graham Neubig Language Technologies Institute, Carnegie Mellon University arxiv:1703.01619v1 [cs.cl] 5 Mar 2017 1 Introduction This
More informationNeural Networks Language Models
Neural Networks Language Models Philipp Koehn 10 October 2017 N-Gram Backoff Language Model 1 Previously, we approximated... by applying the chain rule p(w ) = p(w 1, w 2,..., w n ) p(w ) = i p(w i w 1,...,
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More information