Cross-Lingual Language Modeling for Automatic Speech Recogntion

Similar documents
Chapter 3: Basics of Language Modelling

The Noisy Channel Model and Markov Models

Latent Semantic Analysis. Hongning Wang

Notes on Latent Semantic Analysis

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Latent Semantic Analysis. Hongning Wang

Chapter 3: Basics of Language Modeling

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

N-gram Language Modeling Tutorial

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Discriminative Training

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

IBM Model 1 for Machine Translation

Lecture 13: More uses of Language Models

Machine Learning for natural language processing

Latent semantic indexing

Phrase-Based Statistical Machine Translation with Pivot Languages

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

TnT Part of Speech Tagger

CS 572: Information Retrieval

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

DT2118 Speech and Speaker Recognition

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Statistical Machine Translation

Language Processing with Perl and Prolog

N-gram Language Modeling

Information Retrieval

Linear Algebra Background

Triplet Lexicon Models for Statistical Machine Translation

Probabilistic Language Modeling

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Multiple System Combination. Jinhua Du CNGL July 23, 2008

Variable Latent Semantic Indexing

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Lecture 3: ASR: HMMs, Forward, Viterbi

Language Modeling. Michael Collins, Columbia University

Natural Language Processing SoSe Words and Language Model

Conditional Language Modeling. Chris Dyer

Deep Learning for Speech Recognition. Hung-yi Lee

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition

Learning to translate with neural networks. Michael Auli

SYNTHER A NEW M-GRAM POS TAGGER

Automatic Speech Recognition and Statistical Machine Translation under Uncertainty

Midterm sample questions

Speech Translation: from Singlebest to N-Best to Lattice Translation. Spoken Language Communication Laboratories

Naïve Bayes, Maxent and Neural Models

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Adapting n-gram Maximum Entropy Language Models with Conditional Entropy Regularization

Natural Language Processing. Statistical Inference: n-grams

Variational Decoding for Statistical Machine Translation

Statistical Methods for NLP

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Discriminative Training. March 4, 2014

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Topic Modeling: Beyond Bag-of-Words

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Advanced topics in language modeling: MaxEnt, features and marginal distritubion constraints

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

Semantic Similarity from Corpora - Latent Semantic Analysis

N-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition

Latent Semantic Models. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Ranked Retrieval (2)

CRF Word Alignment & Noisy Channel Translation

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Pre-Initialized Composition For Large-Vocabulary Speech Recognition

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Hidden Markov Modelling

INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING. Crista Lopes

1 Introduction. Sept CS497:Learning and NLP Lec 4: Mathematical and Computational Paradigms Fall Consider the following examples:

Word Alignment III: Fertility Models & CRFs. February 3, 2015

Manning & Schuetze, FSNLP (c) 1999,2000

Fun with weighted FSTs

Log-linear models (part 1)

Modeling Topic and Role Information in Meetings Using the Hierarchical Dirichlet Process

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

A fast and simple algorithm for training neural probabilistic language models

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Sparse Models for Speech Recognition

Boolean and Vector Space Retrieval Models

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Discriminative training of GMM-HMM acoustic model by RPCL type Bayesian Ying-Yang harmony learning

Natural Language Processing and Recurrent Neural Networks

Fast Logistic Regression for Text Categorization with Variable-Length N-grams

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18 Alignment in SMT and Tutorial on Giza++ and Moses)

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Statistical Machine Translation and Automatic Speech Recognition under Uncertainty

Latent Dirichlet Allocation and Singular Value Decomposition based Multi-Document Summarization

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Transcription:

GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim woosung@cs.jhu.edu Center for Language and Speech Processing Dept. of Computer Science The Johns Hopkins University, Baltimore, MD 21218, USA GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 1/19

Introduction Motivation : Success of statistical modeling techniques Development of modeling and automatic learning techniques A large amount of data for training is available Most resources on English, French and German How to construct stochastic models in resource-deficient languages? GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 2/19

Introduction Motivation : Success of statistical modeling techniques Development of modeling and automatic learning techniques A large amount of data for training is available Most resources on English, French and German How to construct stochastic models in resource-deficient languages? Bootstrap from other languages, e.g. Universal phone-set for ASR (Schultz & Waibel, 98, Byrne et al, 00) Exploit parallel texts to project morphological analyzers, POS taggers, etc. (Yarowsky, Ngai & Wicentowski, 01) Language modeling (this talk) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 2/19

Introduction We present: An approach to sharpen an LM in a resource-deficient language using comparable text from resource-rich languages GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 3/19

Introduction We present: An approach to sharpen an LM in a resource-deficient language using comparable text from resource-rich languages Story-specific language models from contemporaneous text GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 3/19

Introduction We present: An approach to sharpen an LM in a resource-deficient language using comparable text from resource-rich languages Story-specific language models from contemporaneous text Integration of machine translation (MT), cross-language information retrieval (CLIR), and language modeling (LM) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 3/19

Langauge Models Ex1: Optical Character Recognition All Iam a student GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 4/19

Langauge Models Ex1: Optical Character Recognition Ex2: Speech Recognition All I am a student It s [tu:] cold He is [tu:]-years-old GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 4/19

Langauge Models Ex1: Optical Character Recognition Ex2: Speech Recognition All I am a student It s [tu:] cold too He is [tu:]-years-old two Many speech & NLP applications deal with ungrammatical sentences/phrases Need to suppress ungrammatical outputs Assign a probability to given a sequence of word strings P( He is two-years-old ) P( He is too-years-old ) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 4/19

Langauge Models: ASR problem ASR problem : Finding word string Ŵ = w 1, w 2,, w n given acoustic evidence A Bayes Rule Ŵ = arg max W P(W A) = P(W)P(A W) P(A) Ŵ = arg max W P(W A) (1) (2) P(W)P(A W) (3) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 5/19

Langauge Models: ASR problem ASR problem : Finding word string Ŵ = w 1, w 2,, w n given acoustic evidence A Bayes Rule Ŵ = arg max W P(W A) = P(W)P(A W) P(A) Ŵ = arg max W P(W A) (1) (2) P(W)P(A W) (3) Language Model Acoustic Model GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 5/19

Langauge Models: Formulation P(w 1, w 2,, w n ) = P(w 1 )P(w 2 w 1 )P(w 3 w 1, w 2 ) P(w n w 1,, w n 1 ) P(w 1 )P(w 2 w 1 )P(w 3 w 1, w 2 ) P(w n w n 2, w n 1 ) n = P(w 1 )P(w 2 w 1 ) P(w i w i 2, w i 1 ) (4) i=3 P(w i w i 2, w i 1 ) = N(w i 2, w i 1, w i ) N(w i 2, w i 1 ) P(w i w i 1 ) = N(w i 1, w i ) N(w i 1 ) P(w i ) = = Trigram (5) = Bigram (6) N(w i ) w V N(w) = Unigram (7) where N(w i 2, w i 1, w i ) denotes the number of times entry (w i 2, w i 1, w i ) appears in the training data and V is the vocabulary. GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 6/19

Langauge Models: Evaluation Word Error Rate (WER): Performace of ASR given LM REFERENCE: UP UPSTATE NEW YORK SOMEWHERE UH OVER OVER HUGE AREAS HYPOTHESIS: UPSTATE NEW YORK SOMEWHERE UH ALL ALL THE HUGE AREAS COR(0)/ERR(1): 1 0 0 0 0 0 1 1 1 0 0 :4 errors per 10 words in reference; WER = 40% Ultimate measure, but expensive to compute Perplexity (PPL): based on cross entropy of test data D w.r.t. LM M H(P D ; P M ) = w V P D (w) log 2 P M (w) (8) = 1 N N log 2 P M (w i w i 2, w i 1 ) (9) i=1 PPL M (D) = 2 H(P D;P M ) = [P M (w 1,, w N )] 1 N (10) For both of WER and PPL, the lower, the better! Close correlation between WER and PPL GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 7/19

Information Retrieval Given a query doc, find relevant docs from the collection Vector-based IR: Represent each doc as a Bag-Of-Words (BOW) Convert the BOW into the vector of terms All terms are considered as independent (orthogonal) Measure similarity between a query and docs air automobile bank car... query : ( 0 1 2 0... ) doc : ( 2 0 5 3... ) Similarity measure: cosine similarity inner product sim( d j, q) = t t i=1 w ij w iq i=1 w2 ij t i=1 w2 iq (11) where d j = (w 1j,, w tj ) and q = (w 1q,, w tq ) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 8/19

Information Retrieval Retrieval performance evaluation: Precision = Recall = No. {Retrieved docs Relevant docs} No. Retrieved docs No. {Retrieved docs Relevant docs} No. Relevant docs (in the collection) (12) (13) Problems Terms are not orthgonal Mismatches between queries and docs Polysemy: e.g., bank (❶ river, ❷ money, ) recall Synonymy: e.g., car, automobile, vehicle, precision Latent Semantic Analysis (LSA) has been proposed Cross-Lingual IR: query & docs are in different languages Query translation approach GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 9/19

Cross-Lingual LM for ASR Mandarin Story Automatic Speech Recognition Baseline Chinese Acoustic Model Chinese Dictionary (Vocabulary) Contemporaneous English Articles C di Automatic Transcription Baseline Chinese Language Model Translation Lexicons Cross-Language Information Retrieval d E i English Article Aligned with Mandarin Story P ˆ(e d Statistical Machine Translation P T (c e) E P CLunigram( c di ) Cross-Language Unigram Model E i ) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 10/19

Model Estimation Assume document correspondence, d E i d C i, is known for Chinese test doc d C i, P CL-unigram (c d E i ) = P T (c e) ˆP(e d E i ), c C (14) e E Cross-Language LM construction Build story-specific cross-language LMs, P(c d E i ) Linear interpolation with the baseline trigram LM P CL-interpolated (c k c k 1, c k 2, d E i ) (15) = λp CL-unigram (c k d E i ) + (1 λ)p(c k c k 1, c k 2 ) λ is optimized to minimize the PPL of heldout data via EM algorithm GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 11/19

Model Estimation Document correspondence obtained by CLIR For each Chinese test doc d C i, create an English BOW Find the English doc with the highest cosine similarity d E i = arg max d E j DE sim CL (P(e d C i ), ˆP(e d E j )) (16) Estimation of P T (c e) and P T (e c) GIZA++ : statistical MT tool based on IBM model-4 Input : Hong Kong news Chinese-English sentence-aligned parallel corpus 18K docs, 200K sents, 4M wds each We only need translation tables : P T (e c) and P T (c e) Mutual Information-based CL-triggers CL-LSA (Latent Semantic Analysis) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 12/19

Cross-Lingual Lexical Triggers: Identification Monolingual Triggers : e.g. either... or Let Cross-Lingual Setting : Translation lexicons Based on Average Mutual Information (I(e; c)) P(e, c) = #d(e, c) N and P(e, c) = #d(e, c) N (17) where #d(e) denote the number of English articles in which e occurs, and let P(e) = #d(e) N and P(c e) = P(e, c) P(e) (18) I(e; c) = P(e, c) log P(c e) P(c) + P(e, c) log P( c e) P( c) + P(ē, c) log P(c ē) P(c) + P(ē, c) log P( c ē) P( c) (19) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 13/19

Cross-Lingual Lexical Triggers: Estimation We estimate the trigger-based CL unigram probs with P Trig (c e) = I(e; c) c C I(e; c ), (20) Analogous to (14), P Trig-unigram (c d E i ) = e E P Trig (c e) ˆP(e d E i ) (21) Again, we build the interpolated model P Trig-interpolated (c k c k 1, c k 2, d E i ) (22) = λp Trig-unigram (c k d E i ) + (1 λ)p(c k c k 1, c k 2 ) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 14/19

Latent Semantic Analysis for CLIR Singular Value Decomposition (SVD) of the parallel corpus d E 1 W U S V T... d E N = d C 1... d C N M N M R R R R N x x x x Input : word-document frequency matrix, W Reduce the dimension into the smaller but adequate subpace Singular Value Decomposition : U, V, and S S : diagonal matrix w/ diagonal entries σ 1,, σ k where σ 1 σ 2 σ k (k R) Remove noisy entries by setting σ i = 0 for i > R GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 15/19

Latent Semantic Analysis for CLIR Singular Value Decomposition (SVD) of the parallel corpus W U S V T d E J = d C J M N M R R R R N x x x x Input : word-document frequency matrix, W Reduce the dimension into the smaller but adequate subpace Singular Value Decomposition : U, V, and S S : diagonal matrix w/ diagonal entries σ 1,, σ k where σ 1 σ 2 σ k (k R) Remove noisy entries by setting σ i = 0 for i > R GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 15/19

Latent Semantic Analysis for CLIR Singular Value Decomposition (SVD) of the parallel corpus W U S V T d E J = d C J M N M R R R R N x x x x Input : word-document frequency matrix, W Reduce the dimension into the smaller but adequate subpace Singular Value Decomposition : U, V, and S S : diagonal matrix w/ diagonal entries σ 1,, σ k where σ 1 σ 2 σ k (k R) Remove noisy entries by setting σ i = 0 for i > R GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 15/19

Latent Semantic Analysis for CLIR d E 1... Folding-in a monolingual corpus W U S V T d E P = 0... 0 M P M R R R R P x x x x Given a monolingual corpus, W, in either side Use the same matrices U, S Project into low-dimensional space, V = S 1 U 1 W Compare a query and a document in the reduced dimensional space GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 16/19

Training and Test Corpora Acoustic model training HUB4-NE Mandarin training data (96K wds) 10 hours Chinese monolingual language model training XINHUA : 13M wds HUB4-NE : 96K wds ASR test set : NIST HUB4-NE test data (only F0 portion) 1263 sents, 9.8K wds (1997 1998) English CLIR corpus : NAB-TDT NAB (1997 LA, WP) + TDT-2 (1998 APW, NYT) 45K docs, 30M wds GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 17/19

ASR Experimental Results Vocab : 51K for Chinese 300-best list rescoring Oracle best/worst WER : 33.4/94.4% for Xinhua and 39.7/95.5% for HUB4-NE Language Model Perp WER CER p-value Xinhua Trigram 426 49.9% 28.8% Trig-interpolated 367 49.1% 28.6% 0.004 LSA-interpolated 364 49.3% 28.9% 0.043 CL-interpolated 346 48.8% 28.4% < 0.001 HUB4-NE Trigram 1195 60.1% 44.1% Trig-interpolated 727 58.8% 43.3% < 0.001 LSA-interpolated 695 58.6% 43.1% <0.001 CL-interpolated 630 58.8% 43.1% < 0.001 GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 18/19

Conclusions Exploits side-information from contemporaneous articles useful for resource-deficient languages Statistically significant improvements in ASR WER Use of CL triggers & CL-LSA A document-aligned corpus suffices rather than a sentence-aligned corpus Future work Extensions to higher order N-grams (e.g., bigrams) Discriminate LMs for Word Sense Disambiguation story-specific translation models Applications to other languages (e.g., Arabic) and other tasks (e.g., MT) GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR p. 19/19