Statistical Phrase-Based Speech Translation

Similar documents
Automatic Speech Recognition and Statistical Machine Translation under Uncertainty

Statistical Machine Translation and Automatic Speech Recognition under Uncertainty

Speech Translation: from Singlebest to N-Best to Lattice Translation. Spoken Language Communication Laboratories

TALP Phrase-Based System and TALP System Combination for the IWSLT 2006 IWSLT 2006, Kyoto

Phrase-Based Statistical Machine Translation with Pivot Languages

Multiple System Combination. Jinhua Du CNGL July 23, 2008

Chapter 3: Basics of Language Modelling

Out of GIZA Efficient Word Alignment Models for SMT

Statistical Machine Translation. Part III: Search Problem. Complexity issues. DP beam-search: with single and multi-stacks

Efficient Path Counting Transducers for Minimum Bayes-Risk Decoding of Statistical Machine Translation Lattices

Cross-Lingual Language Modeling for Automatic Speech Recogntion

The Noisy Channel Model and Markov Models

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Triplet Lexicon Models for Statistical Machine Translation

Language Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012

Variational Decoding for Statistical Machine Translation

Recent Developments in Statistical Dialogue Systems

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Department of Computer Science and Engineering, Department of Electronic and Computer Engineering, HKUST, Hong Kong, Dec. 04, 2012

Unsupervised Model Adaptation using Information-Theoretic Criterion

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Detection-Based Speech Recognition with Sparse Point Process Models

Decoding in Statistical Machine Translation. Mid-course Evaluation. Decoding. Christian Hardmeier

The Geometry of Statistical Machine Translation

Fun with weighted FSTs

Adapting n-gram Maximum Entropy Language Models with Conditional Entropy Regularization

Hidden Markov Modelling

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

An Empirical Study on Computing Consensus Translations from Multiple Machine Translation Systems

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation

Sequences and Information

Tuning as Linear Regression

] Automatic Speech Recognition (CS753)

Statistical Machine Translation

Word Alignment for Statistical Machine Translation Using Hidden Markov Models

A Systematic Comparison of Training Criteria for Statistical Machine Translation

Text Mining. March 3, March 3, / 49

Learning to translate with neural networks. Michael Auli

Augmented Statistical Models for Speech Recognition

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013

Machine Learning for natural language processing

What to Expect from Expected Kneser-Ney Smoothing

Chapter 3: Basics of Language Modeling

Statistical NLP Spring Corpus-Based MT

Corpus-Based MT. Statistical NLP Spring Unsupervised Word Alignment. Alignment Error Rate. IBM Models 1/2. Problems with Model 1

Theory of Alignment Generators and Applications to Statistical Machine Translation

IBM Model 1 for Machine Translation

Presented By: Omer Shmueli and Sivan Niv

Machine Recognition of Sounds in Mixtures

Lattice-based System Combination for Statistical Machine Translation

End-to-end Automatic Speech Recognition

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Evaluation. Brian Thompson slides by Philipp Koehn. 25 September 2018

Improving the Multi-Stack Decoding Algorithm in a Segment-based Speech Recognizer

Multi-Task Word Alignment Triangulation for Low-Resource Languages

Improved Decipherment of Homophonic Ciphers

Deep Learning for Speech Recognition. Hung-yi Lee

Computing Lattice BLEU Oracle Scores for Machine Translation

Bitext Alignment for Statistical Machine Translation

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Algorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Conditional Language Modeling. Chris Dyer

Automatic Speech Recognition (CS753)

N-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by:

Hidden Markov Model and Speech Recognition

Lecture 10. Discriminative Training, ROVER, and Consensus. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Sparse Models for Speech Recognition

A New Smoothing Method for Lexicon-based Handwritten Text Keyword Spotting

Machine Translation without Words through Substring Alignment

Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks

Chapter 2 Computer Assisted Transcription: General Framework

Finite-State Transducers

DT2118 Speech and Speaker Recognition

Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks

Bayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B)

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Spatial Role Labeling CS365 Course Project

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Natural Language Processing and Recurrent Neural Networks

n-gram-based MT : what s behind us, what s ahead

Incremental HMM Alignment for MT System Combination

Around the Speaker De-Identification (Speaker diarization for de-identification ++) Itshak Lapidot Moez Ajili Jean-Francois Bonastre

A Fast Re-scoring Strategy to Capture Long-Distance Dependencies

1. Markov models. 1.1 Markov-chain

Natural Language Understanding. Kyunghyun Cho, NYU & U. Montreal

Statistical Machine Translation

Lecture 3: ASR: HMMs, Forward, Viterbi

Machine Translation Evaluation

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

CSC321 Lecture 15: Recurrent Neural Networks

statistical machine translation

CRF Word Alignment & Noisy Channel Translation

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

Pre-Initialized Composition For Large-Vocabulary Speech Recognition

N-gram Language Modeling

Machine Translation: Examples. Statistical NLP Spring Levels of Transfer. Corpus-Based MT. World-Level MT: Examples

Word Alignment via Submodular Maximization over Matroids

Section 2.3: Statements Containing Multiple Quantifiers

Transcription:

Statistical Phrase-Based Speech Translation Lambert Mathias 1 William Byrne 2 1 Center for Language and Speech Processing Department of Electrical and Computer Engineering Johns Hopkins University 2 Machine Intelligence Laboratory Departent of Engineering Cambridge University May 5 2006 / CLSP Student Seminar

Outline 1 2 3 4

Outline 1 2 3 4

Outline 1 2 3 4

Outline 1 2 3 4

Outline 1 2 3 4

A model based approach to translation is easy to formulate Target Speech Target Sentence Source Sentence A t J 1 s I 1 P(A t J 1) P(t J 1 s I 1) P(s I 1)

Serial Architecture Recognition t b 1 J = argmax P(A t1)p(t J 1) J t 1 J Translation s b 1 I = argmax P( t bj 1 si 1) P(s1) I s 1 I Integrated architecture j ff bs 1 I = argmax P(s1){max I P(t J t 1 J 1 s1) I P(A t1)} I s I 1 Given the ASR models and the translation models, speech translation is easy to do!

Serial Architecture Recognition t b 1 J = argmax P(A t1)p(t J 1) J t 1 J Translation s b 1 I = argmax P( t bj 1 si 1) P(s1) I s 1 I Integrated architecture j ff bs 1 I = argmax P(s1){max I P(t J t 1 J 1 s1) I P(A t1)} I s I 1 Given the ASR models and the translation models, speech translation is easy to do!

Serial Architecture Recognition t b 1 J = argmax P(A t1)p(t J 1) J t 1 J Translation s b 1 I = argmax P( t bj 1 si 1) P(s1) I s 1 I Integrated architecture j ff bs 1 I = argmax P(s1){max I P(t J t 1 J 1 s1) I P(A t1)} I s I 1 Given the ASR models and the translation models, speech translation is easy to do!

Recovering from ASR errors Translating alternative hypotheses Processing the monolingual information available on the target side Correcting disfluencies on the target side Coupling ASR to MT 1-best transcription N-best lists (R.Zhang et al 2004, V.H. Quan et al 2005) Word graphs (S. Saleem et al 2004, E. Matusov et al 2005) Key Idea Maximum ASR signal transfer to the translation component

Recovering from ASR errors Translating alternative hypotheses Processing the monolingual information available on the target side Correcting disfluencies on the target side Coupling ASR to MT 1-best transcription N-best lists (R.Zhang et al 2004, V.H. Quan et al 2005) Word graphs (S. Saleem et al 2004, E. Matusov et al 2005) Key Idea Maximum ASR signal transfer to the translation component

Recovering from ASR errors Translating alternative hypotheses Processing the monolingual information available on the target side Correcting disfluencies on the target side Coupling ASR to MT 1-best transcription N-best lists (R.Zhang et al 2004, V.H. Quan et al 2005) Word graphs (S. Saleem et al 2004, E. Matusov et al 2005) Key Idea Maximum ASR signal transfer to the translation component

Outline 1 2 3 4

Target Speech Target Sentence Target Phrase Source Phrase Source Sentence A t J 1 v R 1 u K 1 s I 1 Models P(A t J 1 ) P(t J 1 vr 1 ) P(v R 1 uk 1 ) P(u K 1 si 1 ) P(s I 1 ) FSMs L Ω Φ W G ASR Word Lattice Target Phrase Segmentation Transducer Phrase Translation, Reordering Transducer Source Phrase Segmentation Transducer Source Language Model The final translation is given by j bs 1 I = argmax max P(A t1) J v 1 R,uK 1,K {z } s I 1 max t J 1 L ASR Word Lattice P(t J 1, v R 1, u K 1 s I 1) P(s I 1) {z } Translation Model ff.

The translation is from a lattice of phrase sequences Target phrase segmentation: P(A v R 1 ) = P(v R 1 tj 1 ) P(A tj 1 ) Corrresponding FSM Q = Ω L Acoustic scores retained during target segmentation Implemented as a best-path search through the translation FSM T T = G W Φ Q Simple formulation with minimal changes to existing models We are now ready to translate speech!

The translation is from a lattice of phrase sequences Target phrase segmentation: P(A v R 1 ) = P(v R 1 tj 1 ) P(A tj 1 ) Corrresponding FSM Q = Ω L Acoustic scores retained during target segmentation Implemented as a best-path search through the translation FSM T T = G W Φ Q Simple formulation with minimal changes to existing models We are now ready to translate speech!

The translation is from a lattice of phrase sequences Target phrase segmentation: P(A v R 1 ) = P(v R 1 tj 1 ) P(A tj 1 ) Corrresponding FSM Q = Ω L Acoustic scores retained during target segmentation Implemented as a best-path search through the translation FSM T T = G W Φ Q Simple formulation with minimal changes to existing models We are now ready to translate speech!

NOTE Original Problem: How to translate ASR word lattices? New Problem: How to efficiently extract phrases from ASR lattices? Phrases are extracted using the GRM Library Controlling ambiguity in phrase extraction Pruning the ASR word lattice Extract phrases under the posterior distribution P Q = P(v1 R A) = t 1 J P(v1 R tj 1 ) P(A tj 1 ) P(tJ 1 ) P(A) The target LM P(t J 1) does not show up in the original formulation!

NOTE Original Problem: How to translate ASR word lattices? New Problem: How to efficiently extract phrases from ASR lattices? Phrases are extracted using the GRM Library Controlling ambiguity in phrase extraction Pruning the ASR word lattice Extract phrases under the posterior distribution P Q = P(v1 R A) = t 1 J P(v1 R tj 1 ) P(A tj 1 ) P(tJ 1 ) P(A) The target LM P(t J 1) does not show up in the original formulation!

NOTE Original Problem: How to translate ASR word lattices? New Problem: How to efficiently extract phrases from ASR lattices? Phrases are extracted using the GRM Library Controlling ambiguity in phrase extraction Pruning the ASR word lattice Extract phrases under the posterior distribution P Q = P(v1 R A) = t 1 J P(v1 R tj 1 ) P(A tj 1 ) P(tJ 1 ) P(A) The target LM P(t J 1) does not show up in the original formulation!

Well-formedness of target sentence in text-based MT Weak translation models Need to choose the right t J 1 from a set of hypotheses Experimentally shown to improve translation quality 1 Need to correctly incorporate the target LM (In progress...) 1 D. Dechelotte, H. Schwenk an Jean-Luc Gauvain, Olivier Galibert, and Lori Lamel. Investigating Translation of Parliament Speeches. ASRU, 2005

A neat trick to include the target LM P(t J 1) 2 j ff bs 1 I = argmax maxp(t1, J s1) I P(A t1) I t 1 J s I 1 P(t J 1, s I 1) = JY j=1 P(t j, s j t j 1 j m, g s j 1 j m ) Involves a complicated procedure in order to estimate the m-gram tuple based model Considers only a single segmentation of the parallel text So what s new in our framework Unified modeling framework of the underlying ASR and SMT system Different model parameterization Direct extension of the text based MT system, straightforward implementation 2 E Matusov, S Kanthak, H Ney. On the integration of speech recognition and statistical machine translation, InterSpeech, 2005.

A neat trick to include the target LM P(t J 1) 2 j ff bs 1 I = argmax maxp(t1, J s1) I P(A t1) I t 1 J s I 1 P(t J 1, s I 1) = JY j=1 P(t j, s j t j 1 j m, g s j 1 j m ) Involves a complicated procedure in order to estimate the m-gram tuple based model Considers only a single segmentation of the parallel text So what s new in our framework Unified modeling framework of the underlying ASR and SMT system Different model parameterization Direct extension of the text based MT system, straightforward implementation 2 E Matusov, S Kanthak, H Ney. On the integration of speech recognition and statistical machine translation, InterSpeech, 2005.

Outline 1 2 3 4

OpenLab 2006 EPPS Spanish to English task, 2005 TC-STAR evaluation data Spanish Source DEV EVAL Monotone Verbatim Transcription - 44.16 Phrase ASR 1-Best 44.48 39.71 Order ASR lattice 44.74 39.93 How many new foreign phrases does speech translation introduce? Verbatim ASR ASR transcription 1-best pruned #Spanish 58438 96991 163065 phrases How many new foreign phrases were found in bitext? Verbatim ASR ASR transcription 1-best pruned #Spanish 24511 24481 33138 phrases

OpenLab 2006 EPPS Spanish to English task, 2005 TC-STAR evaluation data Spanish Source DEV EVAL Monotone Verbatim Transcription - 44.16 Phrase ASR 1-Best 44.48 39.71 Order ASR lattice 44.74 39.93 How many new foreign phrases does speech translation introduce? Verbatim ASR ASR transcription 1-best pruned #Spanish 58438 96991 163065 phrases How many new foreign phrases were found in bitext? Verbatim ASR ASR transcription 1-best pruned #Spanish 24511 24481 33138 phrases

OpenLab 2006 EPPS Spanish to English task, 2005 TC-STAR evaluation data Spanish Source DEV EVAL Monotone Verbatim Transcription - 44.16 Phrase ASR 1-Best 44.48 39.71 Order ASR lattice 44.74 39.93 How many new foreign phrases does speech translation introduce? Verbatim ASR ASR transcription 1-best pruned #Spanish 58438 96991 163065 phrases How many new foreign phrases were found in bitext? Verbatim ASR ASR transcription 1-best pruned #Spanish 24511 24481 33138 phrases

Outline 1 2 3 4

Presented a generative model of speech-to-text translation 3 Tight coupling of the ASR and SMT models via word lattices Initial results on speech translation a little dissapointing Modeling problems: Proper integration of the target LM Phrases extracted from ASR lattices are not in bitext ASR errors Disfluencies, silences and other spoken language phenomena ASR output error-correction disfluency removal, inserting phrase boundaries, SU detection etc... 3 L. Mathias and W. Byrne. Statistical phrase-based speech translation. ICASSP, 2006