An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Similar documents
N-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

DT2118 Speech and Speaker Recognition

Machine Learning for natural language processing

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Natural Language Processing. Statistical Inference: n-grams

Exploring Asymmetric Clustering for Statistical Language Modeling

Probabilistic Language Modeling

N-gram Language Modeling Tutorial

Topic tracking language model for speech recognition

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

Language Models. Philipp Koehn. 11 September 2018

N-gram Language Modeling

Topic Modeling: Beyond Bag-of-Words

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

The Noisy Channel Model and Markov Models

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Modeling Topic and Role Information in Meetings Using the Hierarchical Dirichlet Process

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

What to Expect from Expected Kneser-Ney Smoothing

Chapter 3: Basics of Language Modelling

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Notes on Latent Semantic Analysis

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

ANLP Lecture 6 N-gram models and smoothing

Statistical Methods for NLP

Bayesian (Nonparametric) Approaches to Language Modeling

Statistical Machine Translation

CMPT-825 Natural Language Processing

arxiv:cmp-lg/ v1 9 Jun 1997

topic modeling hanna m. wallach

Natural Language Processing SoSe Words and Language Model

Language Modeling. Michael Collins, Columbia University

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

Language Model. Introduction to N-grams

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Probabilistic Latent Semantic Analysis

Language Modeling for Speech Recognition Incorporating Probabilistic Topic Models

Language Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012

Probabilistic Latent Semantic Analysis

Topic Models and Applications to Short Documents

Probabilistic Dyadic Data Analysis with Local and Global Consistency

Cross-Lingual Language Modeling for Automatic Speech Recogntion

COMPARISON OF FEATURES FOR DP-MATCHING BASED QUERY-BY-HUMMING SYSTEM

SYNTHER A NEW M-GRAM POS TAGGER

Document and Topic Models: plsa and LDA

Latent Dirichlet Allocation Based Multi-Document Summarization

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Chapter 3: Basics of Language Modeling

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

Generalized Hebbian Algorithm for Incremental Singular Value Decomposition in Natural Language Processing

Adapting n-gram Maximum Entropy Language Models with Conditional Entropy Regularization

Kneser-Ney smoothing explained

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition

Latent Dirichlet Allocation and Singular Value Decomposition based Multi-Document Summarization

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

N-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by:

Microsoft Corporation.

Study Notes on the Latent Dirichlet Allocation

Advanced topics in language modeling: MaxEnt, features and marginal distritubion constraints

Language Processing with Perl and Prolog

Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Latent Dirichlet Allocation Introduction/Overview

A Bayesian Interpretation of Interpolated Kneser-Ney NUS School of Computing Technical Report TRA2/06

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

Query-document Relevance Topic Models

CS Lecture 18. Topic Models and LDA

Hierarchical Distributed Representations for Statistical Language Modeling

Latent Dirichlet Allocation (LDA)

Ranked Retrieval (2)

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Latent Dirichlet Allocation (LDA)

Sequences and Information

A fast and simple algorithm for training neural probabilistic language models

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Natural Language Processing (CSE 490U): Language Models

Independent Component Analysis and Unsupervised Learning

Word2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding.

Topic Modelling and Latent Dirichlet Allocation

Estimating Sparse Events using Probabilistic Logic: Application to Word n-grams Djamel Bouchara Abstract In several tasks from dierent elds, we are en

Generative Clustering, Topic Modeling, & Bayesian Inference

Conditional Language Modeling. Chris Dyer

Efficient Tree-Based Topic Modeling

CS145: INTRODUCTION TO DATA MINING

Pattern Recognition System with Top-Down Process of Mental Rotation

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Hierarchical Bayesian Nonparametric Models of Language and Text

Hierarchical Bayesian Nonparametric Models of Language and Text

Naïve Bayes, Maxent and Neural Models

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Statistical Natural Language Processing

Transcription:

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling Masaharu Kato, Tetsuo Kosaka, Akinori Ito and Shozo Makino Abstract Topic-based stochastic models such as the probabilistic latent semantic analysis (PLSA) are good tools for adapting a language model into a specific domain using a constraint of global context. A probability given by a topic model is combined with an n-gram probability using the unigram rescaling scheme. One practical problem to apply PLSA to speech recognition is that calculation of probabilities using PLSA is computationally expensive, that prevents the topic-based language model from incorporating that model into decoding process. In this paper, we proposed an algorithm to calculate a back-off n-gram probability with unigram rescaling quickly, without any approximation. This algorithm reduces the calculation of a normalizing factor drastically, which only requires calculation of probabilities of words that appears in the current context. The experimental result showed that the proposed algorithm was more than 6000 times faster than the naive calculation method. Keywords: probabilistic latent semantic analysis, unigram rescaling, n-gram, back-off smoothing 1 Introduction A language model is an important component of a speech recognition system [1]. Especially an n-gram model is widely used for speech recognition purpose. Although an n-gram is a simple and powerful model, it has a drawback that it is strongly affected by a domain of the training data, and thus the performance is greatly deteriorated when applied to speech from other domain. To overcome this problem, language model adaptation techniques have been developed so far [2]. There are various types of language model adaptation methods; among them, adaptation methods based on topic models Graduate School of Science and Technology, Yamagata University, Jonan 4-3-16, Yonezawa, Yamagata, 992-8510 Japan. Tel/Fax: +81 238 22 4814 Email: katoh@yz.yamagata-u.ac.jp Graduate School of Science and Technology, Yamagata University, Jonan 4-3-16, Yonezawa, Yamagata, 992-8510 Japan. Tel/Fax: +81 238 22 4814 Email: tkosaka@yz.yamagata-u.ac.jp Graduate School of Engineering, Tohoku University, Aramaki aza Aoba 6-6-5, Sendai, Miyagi, 980-8579 Japan. Tel/Fax: +81 22 795 7084 Email:aito@makino.ecei.tohoku.ac.jp Graduate School of Engineering, Tohoku University, Aramaki aza Aoba 6-6-5, Sendai, Miyagi, 980-8579 Japan. Tel/Fax: +81 22 795 7084 Email: makino@makino.ecei.tohoku.ac.jp are promising because they exploit global constraint (or topic ) of a speech into calculation of word occurrence probabilities. There are many topic models, including the latent semantic analysis (LSA) [3], the probabilistic latent semantic analysis (PLSA) [4], and the latent Dirichlet allocation (LDA) [5]. They use a vector representation of a word or a document. In these models, a word or a document is regarded as a point in a vector space or a continuous probability space. Then a probability of a word is calculated by estimating a point in the space under a constraint of the adaptation data. These topic models only estimate unigram probabilities, where the occurrence probability of a word is independent of the previous or following words. However, bigram or trigram model is used as a language model of speech recognition, where a word occurrence probability is determined based on one or two preceding words. Therefore, we need to combine the unigram probability obtained from the topic model with the bigram or trigram probability. Gildea and Hofmann proposed the unigram rescaling for combining these probabilities [6]. The unigram rescaling is a method to adjust an n-gram probability according to the ratio between the unigram probability from the topic model and the unigram model without the topic model. The unigram rescaling can be applied any topic model [7, 6], and is proved to be effective compared with other combination methods such as linear combination [8]. The probability combination of a topic model and an n-gram model is widely used in many speech recognition researches [9, 10, 11]. A problem of the unigram rescaling is its computational complexity. The unigram rescaling needs a normalization of probability, and calculation of the normalizing factor needs calculation of probabilities that is proportion to the vocabulary size. Furthermore, we need to compute the normalizing factor for all word history independently, the number of which are proportion to the vocabulary size for bigram, and to the square of the vocabulary size for trigram. When the adaptation data are fixed, the calculation of the normalizing factor is not a big problem because we need to calculate them only once when the adaptation data are given. However, the calculation of the normalizing factor can be a matter when we adapt

the language model dynamically (for example, word-byword) [6]. In this case, the calculation of word occurrence probability becomes several thousand times slower. In this paper, we propose an algorithm to expedite the calculation of unigram-rescaled n-gram probability. The basic idea of the proposed algorithm is to combine the calculation of unigram rescaling with calculation of the backoff n-gram. Using the proposed algorithm, we can calculate the rescaled probability by only considering those words actually appear in the linguistic context. 2 Unigram rescaling A PLSA or other topic-based model gives an occurrence probability of a word w in a document d as = t P (w t)p (t d) (1) where t is a latent topic. To combine a documentdependent word occurrence probability with an n-gram probability, the unigram rescaling is used: P (w h, d) = 1 Z(h, d) P (w h) (2) P (w) Here, h is a word history where a word w occurs. P (w) is a unigram probability and Z(h, d) is a normalizing factor calculated as follows. Z(h, d) = w V where V is a vocabulary. P (w h) (3) P (w) According to the Eq. (3) naively, calculation time of the normalizing factor Z(h, d) is in proportion to the vocabulary size, which means the calculation of a unigramrescaled probability takes several thousands times slower than calculation time of a simple n-gram probability. This is a problem when using unigram-rescaled probabilities in a decoding process. 3 Back-off n-gram To calculate an n-gram probability, back-off smoothing[12] is often used. Consider calculating a trigram probability P (w i w i 1, where wk j denotes a word sequence w j... w k. The trigram probability is calculated based on a trigram count N(wi 2 i ), which is the number of occurrence of word sequence w i 2 w i 1 w i in a training corpus. Based on maximum likelihood estimation, the trigram probability is estimated as P M (w i w i 1 = N(wi N(w i 1 (4) ). i 2 However, the probability P M becomes zero when the word sequence wi 2 i does not occur in the training corpus. To avoid the zero-frequency problem, probabilities of unseen trigrams are estimated from bigram frequencies. P (w i w i 1 = β(wi 2 i )P M(w i w i 1 if N(wi > 0 P (w i w i 1 ) if N(w i 1 > 0 P (w i w i 1 ) otherwise (5) In this equation, 0 < β(wi 2 i ) < 1 denotes a discount coefficient, which is used to reserve a certain amount of probability mass for unseen n-grams. Various methods for determining the discount coefficient β have been proposed, including the Good-Turing discounting [12], Witten-Bell discounting [13] and Kneser-Ney discounting [14]. is a normalizing factor of n-gram calculation, calculated as follows. = 1 β(wi P M (w w i 1 P (w w i 1) Here, V 0 (w i 1 denotes a set defined as (6) V 0 (w i 1 = {w w V and N(w i 2w i 1 w) = 0}, (7) and V 1 (w i 1 is a complementary set of V 0, i.e. V 1 (w i 1 = {w w V and N(w i 2w i 1 w) > 0}. (8) The bigram probability P (w i w i 1 ) in Eq. (5) is calculated using the back-off smoothing recursively. Note that the coefficients α and β are calculated when the language model is generated. 4 Fast calculation of unigram rescaling 4.1 The bigram case Let us consider calculating a bigram probability with unigram rescaling. By rewriting Eq. (2), we obtain P (w i w i 1, d) = 1 P (w i d), d) P (w i ) P (w i w i 1 ). (9) When w i V 1 (w i 1 ), we calculate Eq. (9) as P (w i w i 1, d) = β(wi i 1 ) P (w i d), d) P (w i ) P M(w i w i 1 ), (10) and when w i V 0 (w i 1 ), assuming P (w i ) = P M (w i ), we obtain P (w i w i 1, d) = ) P (w i d), d) P (w i ) P M (w i ) (11) ), d) P (w i d) Therefore the normalizing factor, d) is calculated as, d) = P (w) P (w w i 1) (12) w V

= = ) ) ) P (w) β(w i 1w)P M (w w i 1 ) + (13) ) P (w) β(w i 1w)P M (w w i 1 ) + (14) ) 1 ) Equation (14) indicates that the normalizing factor, d) can be calculated by summing up all words in V 1 (w i 1 ), which is a set of words that appears just after w i 1. As we can assume V 1 (w i 1 ) V 0 (w i 1 ), we can expect reduction of calculation time. 4.2 The trigram case The reduction of calculation in the previous section relies on the fact that the back-off bigram probability can be reduced to a unigram probability when the word is contained in V 0 (w i 1 ) and the unigram probability cancels out the denominator of unigram rescaling factor. This does not hold for the trigram case. However, we can reduce calculation time of a trigram probability with unigram rescaling by considering the bigram normalizing factor, d). The normalizing factor for a trigram probability is calculated as w V = Let the second term of Eq. (16) be Then i 2, d) can be rewritten as P (w) P (w wi 1 (15) P (w) P (w wi 1 + (16) P (w) P (w wi 1. P (w) P (w wi 1. (17) P (w) α(wi 1 P (w w i 1)(18) Here, from equation (9) we obtain P (w) P (w w i 1) (19), d)p (w i w i 1, d) = P (w i d) P (w i ) P (w i w i 1 ). (20) Then Eq. (19) can be rewritten as i 2, d), d)p (w w i 1, d) (21), d) (22) 1 P (w w i 1, d) i 2 ) (23), d) P (w) P (w w i 1) Finally, we obtain P (w) P (w wi 1 + (24), d) P (w) P (w w i 1). This equation contains only summation over V 1 (w i 1, which is a set of words that appear just after w i 1 i 2. Note that calculation of, d) requires summation over V 1 (w i 1 ), which is larger than V 1 (w i 1. However, as we have only V kinds of, d), we can cache the calculated value of, d) for a specific d. Therefore, on calculating i 2, d), if, d) has not been calculated yet, we first calculate, d) and then calculate i 2, d). On the other hand, if, d) has been calculated, we just use the cached value of, d) to calculate i 2, d). 4.3 Estimation of calculation time Let us estimate how fast the proposed calculation algorithm is, based on a simple assumption. When calculating probabilities of word occurrences in an evaluation corpus based on Eq. (2), we have to calculate normalizing factors for all word w in the vocabulary and all history h in the evaluation corpus. Let K n be the number of distinct n-grams that appears in the training corpus, and H n be that in the evaluation corpus. Considering the bigram case, the number of calculation of probabilities of all distinct bigrams in the evaluation corpus can be estimated as O(H 1 K 1 ), because we have to calculate probabilities of all words in the vocabulary for each of word history that appears in the evaluation corpus. On the other hand, the calculation based on Eq.

(14) only requires probability calculation for those words that appears just after the current context. Because E w [ V 1 (w) ] = 1 V V 1 (w) (25) w = K 2 V = K 2 K 1, (26) the computational complexity of the proposed mathod is proportion to O(H 1 K 2 /K 1 ). As for the trigram case, the naive calculation needs O(H 2 K 1 ). According to Eq. (24), we need to calculate all, d) to calculate the normalizing factors for trigram probabilities, which means the calculation time is in proportion to O(H 1 K 2 /K 1 + H 2 K 3 /K 2 ). 5 Evaluation by an experiment 5.1 Experimental conditions We carried out an experiment to actually calculate probabilities of trigrams in a corpus using the naive calculation method and the proposed algorithm. In this experiment, we calculated probabilities of all distinct trigrams in the corpus, that means the calculation of a specific trigram is conducted only once. We used the Corpus of Spontaneous Japanese (CSJ) [15] for both training and testing purpose. The text is extracted from the XML representation of the corpus, and we employed the short-unit word (SUW) as a definition of words. All inter-pausal units were enclosed by the beginning-of-sentence and end-of-sentence symbols. We used 2668 lectures drawn from academic presentations and public speaking, which contain 833140 sentences (word sequences between pauses) and 8486301 words. The vocabulary size was 47309, which includes words that appeared more than once in the training corpus. We trained a trigram model from the training corpus. The cut-off frequency was set to 1 for the bigram and 3 for the trigram, which means that the bigrams that occurred only once were excluded from the n-gram calculation, and the trigrams that occurred less than four times were also excluded from the n-gram calculation. We also trained a PLSA model from the training corpus using a parallel training algorithm of PLSA [16]. We used 10 lectures (test-set 1) as an evaluation corpus, which contained 1217 sentences and 30837 words. Numbers of distinct n-grams in the training and evaluation corpora are shown in Table 1. In this table, K n shows the number of n-grams in the generated n-gram model. Therefore, the number of distinct trigrams is smaller than that of bigrams because the trigram counts with small frequency were cut off while constructing the language model. Calculation of the PLSA was conducted using a computer equipped with Intel Celeron 2.80GHz and 1GB memory. Table 1: Number of distinct n-grams n-gram K n H n unigram 47100 665 bigram 327604 34296 trigram 208234 72618 Table 2: CPU time for evaluation Calculation CPU time [sec] Naive 5683.00 Proposed 0.85 Table 2 shows the evaluation result. Note that this result does not include the calculation time for PLSA adaptation. This result shows that the proposed algorithm was about 6700 times faster than the naive method that calculates the normalizing factors for all words in the vocabulary. 6 Conclusion In this paper, we proposed an algorithm that calculates probabilities that are calculated by back-off n-gram and unigram rescaling. This algorithm drastically reduces calculation of a normalizing factor of unigram-rescaled probability. From the experimental result, the proposed algorithm could calculate the unigram-rescaled probabilities 6700 times faster. We used the proposed algorithm for combining n-gram and PLSA probabilities, but the proposed algorithm can be used for combining any language models by unigram rescaling. Using this calculation algorithm, it becomes realistic to incorporate a unigram-rescaled language model into the decoder. References [1] Rosenfeld, R., Two decades of statistical language modeling: where do we go from here. Proceedings of the IEEE, V88, N8, pp. 1270 1278, 2000. [2] Bellegarda, J.R., Statistical language model adaptation: review and perspectives, Speech Communication, V42, N1, pp. 93 109, 2004. [3] Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R., Indexing by latent semantic analysis, J. American Society for Information Science, V41, pp. 391 407, 1990. [4] Hoffmann, T., Unsupervised Learning by Probabilistic Latent Semantic Analysis, Machine Learning, V42, N1 2, pp. 177 196, 1999.

[5] Blei, D.M., Ng, A.Y., Jordan, M.I., Lafferty, J., Latent dirichlet allocation, J. Machine Learning Research, V3, pp. 993 1022, 2003. [6] Gildea, D., Hofmann, T., Topic-based Language Models using EM, Proc. Eurospeech, Budapest, Hungary, pp. 2167 2170, 9/99. [7] Bellegarda, J.R., Exploiting Latent Semantic Information in Statistical Language Modeling, Proceedings of the IEEE, V88, pp. 1279 1296, 2000. [8] Broman, S., Kurimo, M., Methods for Combining Language Models in Speech Recognition, Proc. Interspeech, Lisbon, Portugal, pp. 1317 1320, 9/05. [9] Akita, Y., Kawahara, T., Language Model Adaptation based on PLSA of Topics and Speakers, Proc. Interspeech, Jeju, South Korea, pp. 1045 1048, 10/04. [10] Chiu, H.S., Chen, B., Word Topical Mixture Models for Dynamic Language Model Adaptation, Proc. Int. Conf. on Acoustics, Speech and Signal Processing, Honolulu, U.S.A, pp. IV-169 IV-172, 4/07. [11] Hsieh, Y.C., Huang, Y.T., Wang, C.C., Lee., L.S., Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis (PLSA), Proc. Int. Conf. on Acoustics, Speech and Signal Processing, Toulouse, France, pp. I-961 I-964, 5/06. [12] Katz, S., Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer, IEEE Trans. Acoustics, Speech & Signal Processing, V35, N3, pp.400 401, 1987. [13] Witten, I.H., Bell, T.C., The Zero-frequency Problem: Estimating the Probabilities of Novel Events in Adaptive Text Compression, IEEE Trans. Information Theory, V37, N4, pp. 1085 1094, 1991. [14] Ney, H., Essen, U., Kneser, R., On Structuring Probabilistic Dependences in Stochastic Language Modelling, Computer Speech and Language, V8, N1, pp. 1 38, 1994. [15] Maekawa, K., Koiso, H., Furui, S., Isahara, H., Spontaneous speech corpus of Japanese, Proc. Second International Conference on Language Resources and Evaluation (LREC), Athens, Greece, pp. 947 952, 6/00. [16] Kato, M., Kosaka, T., Ito, A., Makino, S., Fast and Robust Training of a Probabilistic Latent Semantic Analysis Model by the Parallel Learning and Data Segmentation, J. Communication and Computer, V6, N5, pp. 28-35, 2009.