Cross-Lingual Language Modeling for Automatic Speech Recogntion

GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim woosung@cs.jhu.edu Center for Language and Speech Processing Dept. of Computer Science The Johns Hopkins University, Baltimore, MD 21218, USA GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 1/19

Introduction Motivation : Success of statistical modeling techniques Development of modeling and automatic learning techniques A large amount of data for training is available Most resources on English, French and German How to construct stochastic models in resource-deficient languages? Bootstrap from other languages, e.g. Universal phone-set for ASR (Schultz & Waibel, 98, Byrne et al, 00) Exploit parallel texts to project morphological analyzers, POS taggers, etc. (Yarowsky, Ngai & Wicentowski, 01) Language modeling (this talk) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 2/19

Introduction We present: An approach to sharpen an LM in a resource-deficient language using comparable text from resource-rich languages GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 3/19

Introduction We present: An approach to sharpen an LM in a resource-deficient language using comparable text from resource-rich languages Story-specific language models from contemporaneous text Integration of machine translation (MT), cross-language information retrieval (CLIR), and language modeling (LM) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 3/19

Langauge Models Ex1: Optical Character Recognition All Iam a student GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 4/19

Langauge Models Ex1: Optical Character Recognition Ex2: Speech Recognition All I am a student It s [tu:] cold He is [tu:]-years-old GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 4/19

Langauge Models Ex1: Optical Character Recognition Ex2: Speech Recognition All I am a student It s [tu:] cold too He is [tu:]-years-old two Many speech & NLP applications deal with ungrammatical sentences/phrases Need to suppress ungrammatical outputs Assign a probability to given a sequence of word strings P( He is two-years-old ) P( He is too-years-old ) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 4/19

Langauge Models: ASR problem ASR problem : Finding word string Ŵ = w 1, w 2,, w n given acoustic evidence A Bayes Rule Ŵ = arg max W P(W A) = P(W)P(A W) P(A) Ŵ = arg max W P(W A) (1) (2) P(W)P(A W) (3) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 5/19

Langauge Models: ASR problem ASR problem : Finding word string Ŵ = w 1, w 2,, w n given acoustic evidence A Bayes Rule Ŵ = arg max W P(W A) = P(W)P(A W) P(A) Ŵ = arg max W P(W A) (1) (2) P(W)P(A W) (3) Language Model Acoustic Model GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 5/19

Langauge Models: Formulation P(w 1, w 2,, w n ) = P(w 1 )P(w 2 w 1 )P(w 3 w 1, w 2 ) P(w n w 1,, w n 1 ) P(w 1 )P(w 2 w 1 )P(w 3 w 1, w 2 ) P(w n w n 2, w n 1 ) n = P(w 1 )P(w 2 w 1 ) P(w i w i 2, w i 1 ) (4) i=3 P(w i w i 2, w i 1 ) = N(w i 2, w i 1, w i ) N(w i 2, w i 1 ) P(w i w i 1 ) = N(w i 1, w i ) N(w i 1 ) P(w i ) = = Trigram (5) = Bigram (6) N(w i ) w V N(w) = Unigram (7) where N(w i 2, w i 1, w i ) denotes the number of times entry (w i 2, w i 1, w i ) appears in the training data and V is the vocabulary. GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 6/19

Langauge Models: Evaluation Word Error Rate (WER): Performace of ASR given LM REFERENCE: UP UPSTATE NEW YORK SOMEWHERE UH OVER OVER HUGE AREAS HYPOTHESIS: UPSTATE NEW YORK SOMEWHERE UH ALL ALL THE HUGE AREAS COR(0)/ERR(1): 1 0 0 0 0 0 1 1 1 0 0 :4 errors per 10 words in reference; WER = 40% Ultimate measure, but expensive to compute Perplexity (PPL): based on cross entropy of test data D w.r.t. LM M H(P D ; P M ) = w V P D (w) log 2 P M (w) (8) = 1 N N log 2 P M (w i w i 2, w i 1 ) (9) i=1 PPL M (D) = 2 H(P D;P M ) = [P M (w 1,, w N )] 1 N (10) For both of WER and PPL, the lower, the better! Close correlation between WER and PPL GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 7/19

Information Retrieval Given a query doc, find relevant docs from the collection Vector-based IR: Represent each doc as a Bag-Of-Words (BOW) Convert the BOW into the vector of terms All terms are considered as independent (orthogonal) Measure similarity between a query and docs air automobile bank car... query : ( 0 1 2 0... ) doc : ( 2 0 5 3... ) Similarity measure: cosine similarity inner product sim( d j, q) = t t i=1 w ij w iq i=1 w2 ij t i=1 w2 iq (11) where d j = (w 1j,, w tj ) and q = (w 1q,, w tq ) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 8/19

Information Retrieval Retrieval performance evaluation: Precision = Recall = No. {Retrieved docs Relevant docs} No. Retrieved docs No. {Retrieved docs Relevant docs} No. Relevant docs (in the collection) (12) (13) Problems Terms are not orthgonal Mismatches between queries and docs Polysemy: e.g., bank (❶ river, ❷ money, ) recall Synonymy: e.g., car, automobile, vehicle, precision Latent Semantic Analysis (LSA) has been proposed Cross-Lingual IR: query & docs are in different languages Query translation approach GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 9/19

Cross-Lingual LM for ASR Mandarin Story Automatic Speech Recognition Baseline Chinese Acoustic Model Chinese Dictionary (Vocabulary) Contemporaneous English Articles C di Automatic Transcription Baseline Chinese Language Model Translation Lexicons Cross-Language Information Retrieval d E i English Article Aligned with Mandarin Story P ˆ(e d Statistical Machine Translation P T (c e) E P CLunigram( c di ) Cross-Language Unigram Model E i ) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 10/19

Model Estimation Assume document correspondence, d E i d C i, is known for Chinese test doc d C i, P CL-unigram (c d E i ) = P T (c e) ˆP(e d E i ), c C (14) e E Cross-Language LM construction Build story-specific cross-language LMs, P(c d E i ) Linear interpolation with the baseline trigram LM P CL-interpolated (c k c k 1, c k 2, d E i ) (15) = λp CL-unigram (c k d E i ) + (1 λ)p(c k c k 1, c k 2 ) λ is optimized to minimize the PPL of heldout data via EM algorithm GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 11/19

Model Estimation Document correspondence obtained by CLIR For each Chinese test doc d C i, create an English BOW Find the English doc with the highest cosine similarity d E i = arg max d E j DE sim CL (P(e d C i ), ˆP(e d E j )) (16) Estimation of P T (c e) and P T (e c) GIZA++ : statistical MT tool based on IBM model-4 Input : Hong Kong news Chinese-English sentence-aligned parallel corpus 18K docs, 200K sents, 4M wds each We only need translation tables : P T (e c) and P T (c e) Mutual Information-based CL-triggers CL-LSA (Latent Semantic Analysis) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 12/19

Cross-Lingual Lexical Triggers: Identification Monolingual Triggers : e.g. either... or Let Cross-Lingual Setting : Translation lexicons Based on Average Mutual Information (I(e; c)) P(e, c) = #d(e, c) N and P(e, c) = #d(e, c) N (17) where #d(e) denote the number of English articles in which e occurs, and let P(e) = #d(e) N and P(c e) = P(e, c) P(e) (18) I(e; c) = P(e, c) log P(c e) P(c) + P(e, c) log P( c e) P( c) + P(ē, c) log P(c ē) P(c) + P(ē, c) log P( c ē) P( c) (19) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 13/19

Cross-Lingual Lexical Triggers: Estimation We estimate the trigger-based CL unigram probs with P Trig (c e) = I(e; c) c C I(e; c ), (20) Analogous to (14), P Trig-unigram (c d E i ) = e E P Trig (c e) ˆP(e d E i ) (21) Again, we build the interpolated model P Trig-interpolated (c k c k 1, c k 2, d E i ) (22) = λp Trig-unigram (c k d E i ) + (1 λ)p(c k c k 1, c k 2 ) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 14/19

Latent Semantic Analysis for CLIR Singular Value Decomposition (SVD) of the parallel corpus d E 1 W U S V T... d E N = d C 1... d C N M N M R R R R N x x x x Input : word-document frequency matrix, W Reduce the dimension into the smaller but adequate subpace Singular Value Decomposition : U, V, and S S : diagonal matrix w/ diagonal entries σ 1,, σ k where σ 1 σ 2 σ k (k R) Remove noisy entries by setting σ i = 0 for i > R GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 15/19

Latent Semantic Analysis for CLIR Singular Value Decomposition (SVD) of the parallel corpus W U S V T d E J = d C J M N M R R R R N x x x x Input : word-document frequency matrix, W Reduce the dimension into the smaller but adequate subpace Singular Value Decomposition : U, V, and S S : diagonal matrix w/ diagonal entries σ 1,, σ k where σ 1 σ 2 σ k (k R) Remove noisy entries by setting σ i = 0 for i > R GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 15/19

Latent Semantic Analysis for CLIR d E 1... Folding-in a monolingual corpus W U S V T d E P = 0... 0 M P M R R R R P x x x x Given a monolingual corpus, W, in either side Use the same matrices U, S Project into low-dimensional space, V = S 1 U 1 W Compare a query and a document in the reduced dimensional space GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 16/19

Training and Test Corpora Acoustic model training HUB4-NE Mandarin training data (96K wds) 10 hours Chinese monolingual language model training XINHUA : 13M wds HUB4-NE : 96K wds ASR test set : NIST HUB4-NE test data (only F0 portion) 1263 sents, 9.8K wds (1997 1998) English CLIR corpus : NAB-TDT NAB (1997 LA, WP) + TDT-2 (1998 APW, NYT) 45K docs, 30M wds GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 17/19

ASR Experimental Results Vocab : 51K for Chinese 300-best list rescoring Oracle best/worst WER : 33.4/94.4% for Xinhua and 39.7/95.5% for HUB4-NE Language Model Perp WER CER p-value Xinhua Trigram 426 49.9% 28.8% Trig-interpolated 367 49.1% 28.6% 0.004 LSA-interpolated 364 49.3% 28.9% 0.043 CL-interpolated 346 48.8% 28.4% < 0.001 HUB4-NE Trigram 1195 60.1% 44.1% Trig-interpolated 727 58.8% 43.3% < 0.001 LSA-interpolated 695 58.6% 43.1% <0.001 CL-interpolated 630 58.8% 43.1% < 0.001 GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 18/19

Conclusions Exploits side-information from contemporaneous articles useful for resource-deficient languages Statistically significant improvements in ASR WER Use of CL triggers & CL-LSA A document-aligned corpus suffices rather than a sentence-aligned corpus Future work Extensions to higher order N-grams (e.g., bigrams) Discriminate LMs for Word Sense Disambiguation story-specific translation models Applications to other languages (e.g., Arabic) and other tasks (e.g., MT) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 19/19