2010 11 5 N-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition 1 48-106413 Abstract Large-Vocabulary Continuous Speech Recognition(LVCSR) system has rapidly been growing today. The purpose of this system is automating the speech transcript of lecture or news etc. Speech recognition system has two core parts; acoustic model and language model. I focus on the language model. Statistical language models are widely used as language model for LVCSR. Above all, a N-gram language model is a de facto standard as a language model for speech recognition. N-gram is the most excellent model of any other language models, but there are Fig.1 Overview of Speech Recognition some problems. In this paper, I introduce three major issues of N-gram model, and overview recent research trends of language model as the solutions of these problems. 1 (Large-Vocabulary Continuous Speech Recognition: LVCSR) [1] (Acoustic Model) (Language Model) 2 Fig.1 [2] [3] (N 1) N-gram [4] N-gram [5, 6] N-gram N-gram 2 N- gram 3 4 5 N-gram 3 6 2 N-gram 2.1 N-gram N-gram [4] N-gram (N 1) 1
Fig.2 An Example of Calculating Trigram Probability Fig.3 Back-off Smoothing N-gram (N 1) N-gram w i i n+1 w i w b a w 1, w 2,..., w k a b P (w i w i 1 1 ) P (w i w i 1 i n+1 ) (1) N = 1, 2, 3 unigram bigram trigram P (w i w i 1 1 ) N (N-1) P (w i w i 1 i n+1 ) = C(wi i n+1 ) C(w i 1 i n+1 ) (2) C(w i 1) w i 1 N-gram N N = 2 3 bigram trigram [7] trigram Fig.2 2.2 back-off N-gram N N-gram (2) 0 N-gram back-off N-gram N-gram [7] back-off Fig.3 Fig.3 N N-1 trigram bigram 2.3 N-gram N-gram 3 3 5 2
3 N-gram (N 1) trigram 2 [8] [9] N-gram[10] Structured Language Model[11] 3.1 [8] Fig.4 (2004 ) N-gram w n H = {w n H,, w n 1 } w n P c (w n H) P c (w n H) = 1 H w h H δ(w n, w h ) (3) H H δ { 1 (i = j) δ(i, j) = 0 (i j) (4) Kuhn [8] N-gram N-gram Fig.4 Probability of Occurrence of The Same Word Again [12] [13] N-gram (3) H H N- gram 1 3.2 [9] Fig.5 ( 2004 ) w a w b 3
1 N-gram [14] [15] Fig.5 Probability of Co-occurrence of Two Words (trigger-pair) w a w b w w (self-trigger) P T (w d) exp( i λ i f i (d)) (5) { 1, if d w f i (d) = 0, other (6) (5) (ME )[3] d w f i (d) d w λ i [9] [3] MI(w a ; w b ) = P (w a, w b )log P (w b w a ) P (w b ) +P (w a, w b )log P (w b w a ) P (w b ) +P (w a, w b )log P (w b w a ) P (w b ) +P (w a, w b )log P (w b w a ) P (w b ) (7) k 4 N-gram (Language Model Adaptation) [16, 17] 4.1 2 (Topic Adaptation) (Speech Style Adaptation) 4.2 4
1. 2. 2 1. (PLSA)[18] 2. World Wide Web (WWW) 4.2.1 (PLSA) (Probabilistic Latent Semantic Analysis: PLSA)[18] PLSA d w z P (w d) = z P (w z)p (z d) (8) PLSA EM bigram trigram N-gram unigram [19] P (w i w i 1 w i 2 ) P (w i w i 1 w i 2 ) P (w i d) P (w i ) P (w i w i 1 w i 2 ) (9) trigram trigram Fig.6 Unsupervised Language Model Adaptation by Using Web trigram PLSA [20] 2 PLSA [21] PLSA 3 4.2.2 Web World Wide Web Web [13] Web [22] 2 Web Web 1. 2. 5
3. 4. Web 5. Web Fig.6 4.2.3 Web tf-idf[23] tf-idf tf-idf tf Term Frequency: idf Inverse Document Frequency: 2 tfidf = tf idf (10) tf i = n i k n k D idf i = log {d : d t i } (11) (12) n i i D {d : d t i } i idf tf-idf [12] tf-idf 3 Web [24] tf-idf Web Web Fig.7 An Example of tagged transcript 4.3 [25] Fig.7 [26] N-gram 5 N-gram N-gram UNK P (UNK w i 2 w i 1 ) 6
Web Web Web Fig.8 A Hierarchical Language Model UNK N-gram N-gram [27] Fig.8 Wikipedia Web [28] N-gram Wikipedia [29] 6 N-gram 3 N-gram N-gram [1],,,,.., 2001. [2]. ( )., Vol. 72, No. 8, pp. 1284 1290, 1989. [3].., 1999. [4] Shannon C. E. A mathematical theory of communication. Bell System Technical Journal, pp. 623 656, 1948. [5]., Vol. 2008, No. 68, pp. 43 46, 2008. [6] R Rosenfeld. Two Decades of Statistical Language Modeling: Where Do We Go from Here? Proceedings of IEEE, pp. 1270 1278, 2000. [7].. 5, pp. 1 21, 2010. [8] R. Kuhn and R. De Mori. A cache-based natural language model for speech recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, Vol. 12, No. 6, pp. 570 583, 2002. [9] R. Rosenfeld. A maximum entropy approach to adaptive statistical language modelling. Computer speech and language, Vol. 10, No. 3, p. 187, 1996. [10] R. Kneser. Statistical language modeling using a variable context length. In Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, Vol. 1, pp. 494 497. IEEE, 2002. [11] C. Chelba and F. Jelinek. Recognition perfor- 7
mance of a structured language model. Arxiv preprint cs/0001022, 2000. [12],,.. 1, pp. 89 94, 2007. [13],,,.., Vol. 50, No. 2, pp. 469 476, 2009. [14].., 2006. [15],... SLP,, Vol. 2005, No. 69, pp. 13 18, 2005-07-15. [16] A. Berger and R. Miller. Just-in-time language modelling. In Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, Vol. 2, pp. 705 708. IEEE, 2002. [17] D. Vaufreydaz, M. Akbar, and J. Rouillard. Internet documents: a rich source for spoken language modeling. ASRU 99 Workshop, pp. 277 280. Citeseer, 1999. [18] T. Hofmann. Probabilistic latent semantic analysis. Proc. of Uncertainty in Artificial Intelligence, UAI 99, pp. 289 296, 1999. [19] D. Gildea and T. Hofmann. Topic-based language modeling using EM. In Proc. Eurospeech, pp. 2167 2170, 1999. [20],. PLSA.. SLP,, Vol. 2003, No. 124, pp. 67 72, 2003-12-18. [21],,,. PLSA., pp. 233 238, 2006. [22],,,. Web., pp. 57 58, 2010. [23] Gerard Salton. Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1989. [24],,,. Web., pp. 57 58, 2010. [25],. (session- 4, 7 ).. NLC,, Vol. 105, No. 494, pp. 19 24, 2005. [26],,... SLP, Vol. 2007, No. 75, pp. 1 6, 2007. [27] K. Tanigaki, H. Yamamoto, and Y. Sagisaka. A hierarchical language model incorporating classdependent word models for oov words recognition. In The Proceedings of the 6th International Conference on Spoken Language Processing (Volume 3), 2000. [28],,,. Web., Vol. 2008, No. 3, pp. 10 16, 2008. [29],. Wikipedia., Vol. 11,, 2009. 8