Cross-Lingual Language Modeling for Automatic Speech Recogntion

Size: px

Start display at page:

Download "Cross-Lingual Language Modeling for Automatic Speech Recogntion"

Morgan Warner
5 years ago
Views:

1 GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim Center for Language and Speech Processing Dept. of Computer Science The Johns Hopkins University, Baltimore, MD 21218, USA GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 1/19

2 Introduction Motivation : Success of statistical modeling techniques Development of modeling and automatic learning techniques A large amount of data for training is available Most resources on English, French and German How to construct stochastic models in resource-deficient languages? GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 2/19

3 Introduction Motivation : Success of statistical modeling techniques Development of modeling and automatic learning techniques A large amount of data for training is available Most resources on English, French and German How to construct stochastic models in resource-deficient languages? Bootstrap from other languages, e.g. Universal phone-set for ASR (Schultz & Waibel, 98, Byrne et al, 00) Exploit parallel texts to project morphological analyzers, POS taggers, etc. (Yarowsky, Ngai & Wicentowski, 01) Language modeling (this talk) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 2/19

4 Introduction We present: An approach to sharpen an LM in a resource-deficient language using comparable text from resource-rich languages GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 3/19

5 Introduction We present: An approach to sharpen an LM in a resource-deficient language using comparable text from resource-rich languages Story-specific language models from contemporaneous text GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 3/19

6 Introduction We present: An approach to sharpen an LM in a resource-deficient language using comparable text from resource-rich languages Story-specific language models from contemporaneous text Integration of machine translation (MT), cross-language information retrieval (CLIR), and language modeling (LM) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 3/19

7 Langauge Models Ex1: Optical Character Recognition All Iam a student GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 4/19

8 Langauge Models Ex1: Optical Character Recognition Ex2: Speech Recognition All I am a student It s [tu:] cold He is [tu:]-years-old GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 4/19

9 Langauge Models Ex1: Optical Character Recognition Ex2: Speech Recognition All I am a student It s [tu:] cold too He is [tu:]-years-old two Many speech & NLP applications deal with ungrammatical sentences/phrases Need to suppress ungrammatical outputs Assign a probability to given a sequence of word strings P( He is two-years-old ) P( He is too-years-old ) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 4/19

10 Langauge Models: ASR problem ASR problem : Finding word string Ŵ = w 1, w 2,, w n given acoustic evidence A Bayes Rule Ŵ = arg max W P(W A) = P(W)P(A W) P(A) Ŵ = arg max W P(W A) (1) (2) P(W)P(A W) (3) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 5/19

11 Langauge Models: ASR problem ASR problem : Finding word string Ŵ = w 1, w 2,, w n given acoustic evidence A Bayes Rule Ŵ = arg max W P(W A) = P(W)P(A W) P(A) Ŵ = arg max W P(W A) (1) (2) P(W)P(A W) (3) Language Model Acoustic Model GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 5/19

12 Langauge Models: Formulation P(w 1, w 2,, w n ) = P(w 1 )P(w 2 w 1 )P(w 3 w 1, w 2 ) P(w n w 1,, w n 1 ) P(w 1 )P(w 2 w 1 )P(w 3 w 1, w 2 ) P(w n w n 2, w n 1 ) n = P(w 1 )P(w 2 w 1 ) P(w i w i 2, w i 1 ) (4) i=3 P(w i w i 2, w i 1 ) = N(w i 2, w i 1, w i ) N(w i 2, w i 1 ) P(w i w i 1 ) = N(w i 1, w i ) N(w i 1 ) P(w i ) = = Trigram (5) = Bigram (6) N(w i ) w V N(w) = Unigram (7) where N(w i 2, w i 1, w i ) denotes the number of times entry (w i 2, w i 1, w i ) appears in the training data and V is the vocabulary. GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 6/19

13 Langauge Models: Evaluation Word Error Rate (WER): Performace of ASR given LM REFERENCE: UP UPSTATE NEW YORK SOMEWHERE UH OVER OVER HUGE AREAS HYPOTHESIS: UPSTATE NEW YORK SOMEWHERE UH ALL ALL THE HUGE AREAS COR(0)/ERR(1): :4 errors per 10 words in reference; WER = 40% Ultimate measure, but expensive to compute Perplexity (PPL): based on cross entropy of test data D w.r.t. LM M H(P D ; P M ) = w V P D (w) log 2 P M (w) (8) = 1 N N log 2 P M (w i w i 2, w i 1 ) (9) i=1 PPL M (D) = 2 H(P D;P M ) = [P M (w 1,, w N )] 1 N (10) For both of WER and PPL, the lower, the better! Close correlation between WER and PPL GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 7/19

14 Information Retrieval Given a query doc, find relevant docs from the collection Vector-based IR: Represent each doc as a Bag-Of-Words (BOW) Convert the BOW into the vector of terms All terms are considered as independent (orthogonal) Measure similarity between a query and docs air automobile bank car... query : ( ) doc : ( ) Similarity measure: cosine similarity inner product sim( d j, q) = t t i=1 w ij w iq i=1 w2 ij t i=1 w2 iq (11) where d j = (w 1j,, w tj ) and q = (w 1q,, w tq ) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 8/19

15 Information Retrieval Retrieval performance evaluation: Precision = Recall = No. {Retrieved docs Relevant docs} No. Retrieved docs No. {Retrieved docs Relevant docs} No. Relevant docs (in the collection) (12) (13) Problems Terms are not orthgonal Mismatches between queries and docs Polysemy: e.g., bank (❶ river, ❷ money, ) recall Synonymy: e.g., car, automobile, vehicle, precision Latent Semantic Analysis (LSA) has been proposed Cross-Lingual IR: query & docs are in different languages Query translation approach GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 9/19

16 Cross-Lingual LM for ASR Mandarin Story Automatic Speech Recognition Baseline Chinese Acoustic Model Chinese Dictionary (Vocabulary) Contemporaneous English Articles C di Automatic Transcription Baseline Chinese Language Model Translation Lexicons Cross-Language Information Retrieval d E i English Article Aligned with Mandarin Story P ˆ(e d Statistical Machine Translation P T (c e) E P CLunigram( c di ) Cross-Language Unigram Model E i ) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 10/19

17 Model Estimation Assume document correspondence, d E i d C i, is known for Chinese test doc d C i, P CL-unigram (c d E i ) = P T (c e) ˆP(e d E i ), c C (14) e E Cross-Language LM construction Build story-specific cross-language LMs, P(c d E i ) Linear interpolation with the baseline trigram LM P CL-interpolated (c k c k 1, c k 2, d E i ) (15) = λp CL-unigram (c k d E i ) + (1 λ)p(c k c k 1, c k 2 ) λ is optimized to minimize the PPL of heldout data via EM algorithm GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 11/19

18 Model Estimation Document correspondence obtained by CLIR For each Chinese test doc d C i, create an English BOW Find the English doc with the highest cosine similarity d E i = arg max d E j DE sim CL (P(e d C i ), ˆP(e d E j )) (16) Estimation of P T (c e) and P T (e c) GIZA++ : statistical MT tool based on IBM model-4 Input : Hong Kong news Chinese-English sentence-aligned parallel corpus 18K docs, 200K sents, 4M wds each We only need translation tables : P T (e c) and P T (c e) Mutual Information-based CL-triggers CL-LSA (Latent Semantic Analysis) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 12/19

19 Cross-Lingual Lexical Triggers: Identification Monolingual Triggers : e.g. either... or Let Cross-Lingual Setting : Translation lexicons Based on Average Mutual Information (I(e; c)) P(e, c) = #d(e, c) N and P(e, c) = #d(e, c) N (17) where #d(e) denote the number of English articles in which e occurs, and let P(e) = #d(e) N and P(c e) = P(e, c) P(e) (18) I(e; c) = P(e, c) log P(c e) P(c) + P(e, c) log P( c e) P( c) + P(ē, c) log P(c ē) P(c) + P(ē, c) log P( c ē) P( c) (19) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 13/19

20 Cross-Lingual Lexical Triggers: Estimation We estimate the trigger-based CL unigram probs with P Trig (c e) = I(e; c) c C I(e; c ), (20) Analogous to (14), P Trig-unigram (c d E i ) = e E P Trig (c e) ˆP(e d E i ) (21) Again, we build the interpolated model P Trig-interpolated (c k c k 1, c k 2, d E i ) (22) = λp Trig-unigram (c k d E i ) + (1 λ)p(c k c k 1, c k 2 ) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 14/19

21 Latent Semantic Analysis for CLIR Singular Value Decomposition (SVD) of the parallel corpus d E 1 W U S V T... d E N = d C 1... d C N M N M R R R R N x x x x Input : word-document frequency matrix, W Reduce the dimension into the smaller but adequate subpace Singular Value Decomposition : U, V, and S S : diagonal matrix w/ diagonal entries σ 1,, σ k where σ 1 σ 2 σ k (k R) Remove noisy entries by setting σ i = 0 for i > R GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 15/19

22 Latent Semantic Analysis for CLIR Singular Value Decomposition (SVD) of the parallel corpus W U S V T d E J = d C J M N M R R R R N x x x x Input : word-document frequency matrix, W Reduce the dimension into the smaller but adequate subpace Singular Value Decomposition : U, V, and S S : diagonal matrix w/ diagonal entries σ 1,, σ k where σ 1 σ 2 σ k (k R) Remove noisy entries by setting σ i = 0 for i > R GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 15/19

23 Latent Semantic Analysis for CLIR Singular Value Decomposition (SVD) of the parallel corpus W U S V T d E J = d C J M N M R R R R N x x x x Input : word-document frequency matrix, W Reduce the dimension into the smaller but adequate subpace Singular Value Decomposition : U, V, and S S : diagonal matrix w/ diagonal entries σ 1,, σ k where σ 1 σ 2 σ k (k R) Remove noisy entries by setting σ i = 0 for i > R GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 15/19

24 Latent Semantic Analysis for CLIR d E 1... Folding-in a monolingual corpus W U S V T d E P = M P M R R R R P x x x x Given a monolingual corpus, W, in either side Use the same matrices U, S Project into low-dimensional space, V = S 1 U 1 W Compare a query and a document in the reduced dimensional space GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 16/19

25 Training and Test Corpora Acoustic model training HUB4-NE Mandarin training data (96K wds) 10 hours Chinese monolingual language model training XINHUA : 13M wds HUB4-NE : 96K wds ASR test set : NIST HUB4-NE test data (only F0 portion) 1263 sents, 9.8K wds ( ) English CLIR corpus : NAB-TDT NAB (1997 LA, WP) + TDT-2 (1998 APW, NYT) 45K docs, 30M wds GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 17/19

26 ASR Experimental Results Vocab : 51K for Chinese 300-best list rescoring Oracle best/worst WER : 33.4/94.4% for Xinhua and 39.7/95.5% for HUB4-NE Language Model Perp WER CER p-value Xinhua Trigram % 28.8% Trig-interpolated % 28.6% LSA-interpolated % 28.9% CL-interpolated % 28.4% < HUB4-NE Trigram % 44.1% Trig-interpolated % 43.3% < LSA-interpolated % 43.1% <0.001 CL-interpolated % 43.1% < GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 18/19

27 Conclusions Exploits side-information from contemporaneous articles useful for resource-deficient languages Statistically significant improvements in ASR WER Use of CL triggers & CL-LSA A document-aligned corpus suffices rather than a sentence-aligned corpus Future work Extensions to higher order N-grams (e.g., bigrams) Discriminate LMs for Word Sense Disambiguation story-specific translation models Applications to other languages (e.g., Arabic) and other tasks (e.g., MT) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 19/19

Chapter 3: Basics of Language Modelling

Chapter 3: Basics of Language Modelling Motivation Language Models are used in Speech Recognition Machine Translation Natural Language Generation Query completion For research and development: need a simple