Speech Translation: from Singlebest to N-Best to Lattice Translation. Spoken Language Communication Laboratories

Similar documents
Multiple System Combination. Jinhua Du CNGL July 23, 2008

Statistical Phrase-Based Speech Translation

TALP Phrase-Based System and TALP System Combination for the IWSLT 2006 IWSLT 2006, Kyoto

Phrase-Based Statistical Machine Translation with Pivot Languages

Statistical Machine Translation. Part III: Search Problem. Complexity issues. DP beam-search: with single and multi-stacks

Automatic Speech Recognition and Statistical Machine Translation under Uncertainty

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Conditional Language Modeling. Chris Dyer

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Chapter 3: Basics of Language Modelling

Augmented Statistical Models for Speech Recognition

Decoding Revisited: Easy-Part-First & MERT. February 26, 2015

Decoding in Statistical Machine Translation. Mid-course Evaluation. Decoding. Christian Hardmeier

ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation

Analysing Soft Syntax Features and Heuristics for Hierarchical Phrase Based Machine Translation

Machine Translation: Examples. Statistical NLP Spring Levels of Transfer. Corpus-Based MT. World-Level MT: Examples

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Algorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Statistical Machine Translation and Automatic Speech Recognition under Uncertainty

N-gram Language Modeling Tutorial

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Out of GIZA Efficient Word Alignment Models for SMT

Statistical NLP Spring HW2: PNP Classification

HW2: PNP Classification. Statistical NLP Spring Levels of Transfer. Phrasal / Syntactic MT: Examples. Lecture 16: Word Alignment

Triplet Lexicon Models for Statistical Machine Translation

Adapting n-gram Maximum Entropy Language Models with Conditional Entropy Regularization

Theory of Alignment Generators and Applications to Statistical Machine Translation

Chapter 3: Basics of Language Modeling

Recent Developments in Statistical Dialogue Systems

Efficient Path Counting Transducers for Minimum Bayes-Risk Decoding of Statistical Machine Translation Lattices

Variational Decoding for Statistical Machine Translation

Learning to translate with neural networks. Michael Auli

Evaluation. Brian Thompson slides by Philipp Koehn. 25 September 2018

Discriminative Training

Statistical NLP Spring Corpus-Based MT

Corpus-Based MT. Statistical NLP Spring Unsupervised Word Alignment. Alignment Error Rate. IBM Models 1/2. Problems with Model 1

Automatic Speech Recognition (CS753)

Applied Natural Language Processing

Minimum Error Rate Training Semiring

Enhanced Bilingual Evaluation Understudy

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Detection-Based Speech Recognition with Sparse Point Process Models

Welcome to Pittsburgh!

Topics in Natural Language Processing

Unsupervised Model Adaptation using Information-Theoretic Criterion

Hidden Markov Model and Speech Recognition

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Using a Mixture of N-Best Lists from Multiple MT Systems in Rank-Sum-Based Confidence Measure for MT Outputs

Statistical Ranking Problem

An Empirical Study on Computing Consensus Translations from Multiple Machine Translation Systems

] Automatic Speech Recognition (CS753)

Statistical Machine Translation

Tuning as Linear Regression

CS230: Lecture 10 Sequence models II

A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions

The Noisy Channel Model and Markov Models

Machine Translation without Words through Substring Alignment

Language Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Improved Decipherment of Homophonic Ciphers

STA 4273H: Statistical Machine Learning

Machine Translation Evaluation

Maximal Lattice Overlap in Example-Based Machine Translation

Adversarial Training and Decoding Strategies for End-to-end Neural Conversation Models

Why DNN Works for Acoustic Modeling in Speech Recognition?

End-to-end Automatic Speech Recognition

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

1 Evaluation of SMT systems: BLEU

DT2118 Speech and Speaker Recognition

N-gram Language Modeling

Presented By: Omer Shmueli and Sivan Niv

The Geometry of Statistical Machine Translation

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

What to Expect from Expected Kneser-Ney Smoothing

Computing Lattice BLEU Oracle Scores for Machine Translation

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

statistical machine translation

Natural Language Processing

CRF Word Alignment & Noisy Channel Translation

Hidden Markov Modelling

CS 188: Artificial Intelligence Fall 2011

1. Markov models. 1.1 Markov-chain

A Systematic Comparison of Training Criteria for Statistical Machine Translation

Incremental HMM Alignment for MT System Combination

Lexical Translation Models 1I

Advanced topics in language modeling: MaxEnt, features and marginal distritubion constraints

Bayesian Learning of Non-compositional Phrases with Synchronous Parsing

IBM Model 1 for Machine Translation

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention

Lecture 5 Neural models for NLP

To Separate Speech! A System for Recognizing Simultaneous Speech

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Sparse Models for Speech Recognition

Hidden Markov Models and Gaussian Mixture Models

INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS. Jan Tore Lønning, Lecture 3, 7 Sep., 2016

Transcription:

Speech Translation: from Singlebest to N-Best to Lattice Translation Ruiqiang ZHANG Genichiro KIKUI Spoken Language Communication Laboratories

2 Speech Translation Structure Single-best only ASR Single-best MT N-best hypothesis translation ASR N-best MT Word lattice X J E ASR WLT

3 References Ney (ICASSP 1999). Speech translation: Coupling of recognition and translation. Casacuberta (2002). Architectures for speech-to-speech translation using finite-state transducer Zhang(Coling 2004). A unified approach in speech-tospeech translation Saleem(ICSLP 2004). Using word lattice information for a tight coupling in speech translation systems Matusov(Eurospeech 2005). On the Integration of Speech Recognition and SMT (Bozarov, Zhang)(Eurospeech 2005). Speech Translation by Confidence Measure

4 Outline N-best translation Word lattice translation IWSLT 2005 evaluation Conclusions

5 N-best Hypothesis Translation J1 E1,1, E1,2,, E1,K X J2 JN E2,1,E2,2,,E2,K EN,1,EN,2,,EN,K ASR SMT Rescore E J1, J2 and JN : N-best speech recognition hypotheses E2,1, E2,2 E2,K : K-best translation hypotheses produced from J2 Rescore: to rescore all NxK translations

6 Rescore: Integration of ASR and SMT Statistical theory Make approximations Eˆ, Jˆ = arg max E, J { P( E) P( X J ) P( J E) }

7 Rescore: Log-linear models = = M m m m E E X P E 1 ), ( log max arg λ : m-th feature in log value : weight of each feature ), ( E X P m λ = = = E M m m m M m m m E X f E X f X E P 1 1 )), ( exp( )), ( exp( ) ( λ λ : all possible translation hypotheses E

8 Parameter optimization Objective function : λ = optimize D( R, E M 1 s s ) E s translation output after log-linear model rescoring R s References of English sentences. 16 reference sentences for each English sentence D( R s, Es ) Automatic translation quality metrics. BLEU, NIST, mwer and mper

9 Translation Assessment D( R s, Es ) N-gram methods BLEU: A weighted geometric mean of the n-gram matches between test and reference sentences plus a short sentence penalty NIST: An arithmetic mean of the n-gram matches between test and reference sentences Word error rate mwer: multiple reference word error rate. mper: multiple reference position independent word error rate

Optimization:Direction Set Methods Change initial lambda D( R s, Es ) Change Direction Local optimization Local lambda Best lambda 10 λ

Features from ASR Acoustic Model (AM) scores Gaussian mixture output probability density function(pdf) Language Model(LM) scores N-gram language model 11

12 Features from Phrase-based SMT Target language model ( trigram ) Target class language model: SRILM cluster (5- gram) Target phrase language model: Phrase translation model: Distortion model : Length model: NULL word translation model: Jump model: Long distance target LM: (9-gram) for rescore Long distance class LM: (11-gram)

13 An Experimental Results of N- best Translation BLEU 0.63 0.62 0.61 0.6 0.59 0.58 0.57 0.56 0.55 1 3 5 10 50 100 # hypotheses

Word Lattice Translation 14

15 Recognition Word Lattice 経験者 /s 救急車 を 呼ぶ でもらえ ます か /s 消え 検査 で もらえ ASR First-best: 経験者を呼ぶでもらえますか First-best translation: Could I get a job ASR correct recognition: 救急車を呼ぶでもらえますか Word Lattice translation: Could you call an ambulance

16 Machine Translation for Text could you call an ambulance NULL model NULL could you call an ambulance Fertility model Lexical model Distortion model NULL NULL could could call ambulance かをでもらえます呼ぶ救急車 救急車を呼ぶでもらえますか

17 Machine Translation for Lattice could I you get call a an job ambulance NULL model NULL could I you get call a an job ambulance Fertility model NULL NULL could could get call job ambulance

18 Machine Translation for Lattice Lexical model 経験者 かをでもらえます呼ぶ 救急車 経験者 Distortion model を呼ぶでもらえますか 救急車

19 How We Translate Word Lattice Two-step decoding: beam-search + A* search beam search: construct translation word graph (TWG) An edge in the word lattice is mapped to an edge in the TWG A path in the TWG corresponds to a path in the word lattice Lower-scored edges are pruned. Simple translation models are used. A* search: Search the TWG with a higher-grade translation models(ibm model4)

Illustration of Constructing TWG (Translation Word Graph) Beam-search: threshold pruning 経験者救急車呼ぶでもらえます job ambulance call get can could you job ambulance call get can could you job ambulance call get can could you job ambulance call get can could you I I I I 20

Translation Word Graph (example) 21

22 A* Search A* search Forward score: Accumulated from the start node to current node, using IBM Model4 model Heuristic score: Accumulated from the current node to the end node Approximations are made on the models dependent on the length of source sentence: distribution model NULL word

23 Features in Speech Translation Models Eˆ = arg max{ λ log P + λ log P ( ) + λ E 0 pp 1 lm E log P lm ( POS( E)) + λ 3 log P( 0) 2 φ + λ 4 log N( Φ E) + λ 5 log Τ ( J E) + λ 6 log D( E, J )}

Effect of Word Lattice Translation 0.552 BLEU 1st best Lattice 0.55 0.548 0.546 0.544 0.542 1 2 4 6 8 10 20 30 40 60 80 #Nbest in Lattice 24

25 Beam-size effect in WLT 1 A translation of 1 st best ASR hypotheses 2 A translation of 2 nd best ASR hypotheses 3 A translation of 3 rd best ASR hypotheses Single-best translation Word lattice translation

26 Beam-size Effects in WLT (Nbest=20) Promising hypotheses pruned in WLT but saved in single-best translation under the same beam size BLEU 0.56 0.555 0.55 0.545 0.54 0.535 0.53 0.525 0.52 0.515 0.51 0.505 0.5 1-best 20-best 100 200 300 400 500 1000 Beam size

Why Word Lattice Minimization Raw lattice is too huge A lot of duplicated word IDs in the lattice Significant are the top N-best hypotheses Minimization under the light of machine translation Minimization can make decoding fast Minimization can reduce translation error; reduce pruning error in decoding 27

Word Lattice Minimization 28

29 Word Lattice?? N-best?? After lattice minimization, the output is not a lattice again. Only N-best with new assigned edge ids. After lattice minimization, the ASR score lost in single edge. Instead, we use ASR path score to represent single edge s score.

Effect of Lattice Minimization 0.554 1st best La ttice w/o min La ttice w/ min. 0.552 0.55 0.548 0.546 0.544 0.542 0.54 1 2 4 6 8 10 20 30 40 60 80 #Nbest in lattice 30

31 Posterior Probability Integrating acoustic model and language model probabilities Indicating relative accuracy of N-best hypotheses p( J j X ) = N e i = 1 λ log score e λ log score j i log score i :log-scale ASR score ( AM+LM )

32 Confidence Measure Filtering ASR hypotheses with very low posterior probability degrade translations A predefined confidence threshold, T, is applied to remove the most unlikely ASR hypotheses By comparing a hypothesis s posterior probability to the single-best hypothesis s posterior probability multiplied by T, Pfirst-best*T, remove the smaller.

Confidence Measure Filtering ASR Output ASR score PP=Posterior probability cmf= PP/PP1-st Decision cmf>0.5? 1 st cand. 0.55 0.215 1 PASS 2 nd cand. 0.50 0.196 0.912 PASS 3 rd cand. 0.45 0.176 0.818 PASS 4 th cand. 0.40 0.157 0.730 PASS 5 th cand. 0.30 0.118 0.549 PASS 6 th cand. 0.20 0.078 0.363 FAIL 7 th cand. 0.10 0.039 0.181 FAIL 8 th cand. 0.05 0.020 0.09 FAIL SUM=2.25 33

34 Effect of CM filtering 0.558 1st best N-b e st La ttic e Min. w/ CMF 0.556 0.554 0.552 0.55 0.548 0.546 0.544 1 2 4 6 8 10 20 30 40 60 80

IWSLT 2005 Evaluation 35

36 IWSLT 2005 Evaluation( training data) Language pair Data track Data size perplexity Testset Dev.data C/E J/E Supplied +tagger 20K 65.4 53.8 C-star 172K 69.3 52.2 Supplied +tagger 20K 54.9 53.7 C-star 463K 22.5 31.6

37 Test Data Analysis Japanese Chinese 0.9 0.8 0.82 0.87 0.9 0.8 0.7 0.7 0.69 0.6 0.5 0.6 0.5 0.54 0.4 N=1 N=20 0.4 N=1 N=20 ASR Recognition Accuracy

38 Test Data Results (J/E BLEU C-star track) 0.74 CSTAR track 0.72 0.7 0.68 0.66 0.64 0.62 0.6 Text N-best Lattice 1-best

39 Test Data Results ( J/E NIST C-star track) 11.2 11 10.8 10.6 10.4 10.2 10 9.8 9.6 9.4 9.2 9 CSTAR track Text N-best Lattice 1-best

40 Test Data Results(J/E WER C- star track) 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 CSTAR track Text N-best Lattice 1-best

41 Evaluation Results (CE) Data track Input BLEU NIST WER PER METEOR GTM Supplied+to ols Text 0.305 7.20 0.607 0.494 0.574 0.471 Nbest 0.267 6.19 0.645 0.546 0.506 0.421 Sbest 0.251 5.93 0.683 0.581 0.479 0.395 Cstar Text 0.421 8.17 0.518 0.422 0.642 0.547 Nbest 0.375 6.80 0.561 0.486 0.560 0.493 Sbest 0.340 6.76 0.619 0.525 0.531 0.461

Evaluation Results(JE) Data track Input BLEU NIST WER PER METEOR GTM Supplied+ tagger Text 0.388 4.39 0.563 0.519 0.520 0.431 Nbest 0.383 4.27 0.574 0.530 0.513 0.422 Lattice 0.378 4.18 0.578 0.534 0.511 0.420 Sbest 0.366 4.50 0.576 0.527 0.508 0.412 Cstar Text 0.727 10.94 0.289 0.243 0.80 0.716 Nbest 0.679 10.04 0.324 0.281 0.760 0.670 Lattice 0.670 9.86 0.329 0.289 0.763 0.665 Sbest 0.646 9.68 0.352 0.304 0.741 0.645 42

Remarks Text translation (0.727) > N-best translation (0.679) N-best translation (0.679) > lattice translation (0.67) Lattice translation (0.670) > single-best translation (0.646) Training data size influences speech translation 43

Analysis: Lattice Translation Worse than N-best Translation We used the same number of ASR hypotheses in N- best translation and lattice translation In beam search, N-best translation and lattice translation used the same beam size and threshold in pruning Model approximations and inaccuracy: distortion, null, acoustic model, language model. 44

45 Comparisons of the structures Single-best translation Simple, direct ASR and SMT isolated optimization MT flexible, easy to upgrade, multiple translation engines Non-robust to ASR WER N-best hypothesis translation Robust, resistant to ASR WER MT flexible, multiple translation engines Slow, duplicate calculation Word lattice translation Reduce computing cost, efficient Speech translation system, ASR and SMT, overall optimized MT inflexible

Conclusions We applied two approaches to improve ASR singlebest translation. By applying a log-linear model, N-best translation approach can improve single-best translation effectively. We observed improved speech translation performance in word lattice translation: Confidence measure filtering Word lattice reduction 46