Speech Translation: from Singlebest to N-Best to Lattice Translation. Spoken Language Communication Laboratories

Size: px

Start display at page:

Download "Speech Translation: from Singlebest to N-Best to Lattice Translation. Spoken Language Communication Laboratories"

Mercy Henderson
5 years ago
Views:

1 Speech Translation: from Singlebest to N-Best to Lattice Translation Ruiqiang ZHANG Genichiro KIKUI Spoken Language Communication Laboratories

2 2 Speech Translation Structure Single-best only ASR Single-best MT N-best hypothesis translation ASR N-best MT Word lattice X J E ASR WLT

3 3 References Ney (ICASSP 1999). Speech translation: Coupling of recognition and translation. Casacuberta (2002). Architectures for speech-to-speech translation using finite-state transducer Zhang(Coling 2004). A unified approach in speech-tospeech translation Saleem(ICSLP 2004). Using word lattice information for a tight coupling in speech translation systems Matusov(Eurospeech 2005). On the Integration of Speech Recognition and SMT (Bozarov, Zhang)(Eurospeech 2005). Speech Translation by Confidence Measure

4 4 Outline N-best translation Word lattice translation IWSLT 2005 evaluation Conclusions

5 5 N-best Hypothesis Translation J1 E1,1, E1,2,, E1,K X J2 JN E2,1,E2,2,,E2,K EN,1,EN,2,,EN,K ASR SMT Rescore E J1, J2 and JN : N-best speech recognition hypotheses E2,1, E2,2 E2,K : K-best translation hypotheses produced from J2 Rescore: to rescore all NxK translations

6 6 Rescore: Integration of ASR and SMT Statistical theory Make approximations Eˆ, Jˆ = arg max E, J { P( E) P( X J ) P( J E) }

7 7 Rescore: Log-linear models = = M m m m E E X P E 1 ), ( log max arg λ : m-th feature in log value : weight of each feature ), ( E X P m λ = = = E M m m m M m m m E X f E X f X E P 1 1 )), ( exp( )), ( exp( ) ( λ λ : all possible translation hypotheses E

8 8 Parameter optimization Objective function : λ = optimize D( R, E M 1 s s ) E s translation output after log-linear model rescoring R s References of English sentences. 16 reference sentences for each English sentence D( R s, Es ) Automatic translation quality metrics. BLEU, NIST, mwer and mper

9 9 Translation Assessment D( R s, Es ) N-gram methods BLEU: A weighted geometric mean of the n-gram matches between test and reference sentences plus a short sentence penalty NIST: An arithmetic mean of the n-gram matches between test and reference sentences Word error rate mwer: multiple reference word error rate. mper: multiple reference position independent word error rate

10 Optimization:Direction Set Methods Change initial lambda D( R s, Es ) Change Direction Local optimization Local lambda Best lambda 10 λ

11 Features from ASR Acoustic Model (AM) scores Gaussian mixture output probability density function(pdf) Language Model(LM) scores N-gram language model 11

12 12 Features from Phrase-based SMT Target language model ( trigram ) Target class language model: SRILM cluster (5- gram) Target phrase language model: Phrase translation model: Distortion model : Length model: NULL word translation model: Jump model: Long distance target LM: (9-gram) for rescore Long distance class LM: (11-gram)

13 13 An Experimental Results of N- best Translation BLEU # hypotheses

14 Word Lattice Translation 14

15 15 Recognition Word Lattice 経験者 /s 救急車を呼ぶでもらえますか /s 消え検査でもらえ ASR First-best: 経験者を呼ぶでもらえますか First-best translation: Could I get a job ASR correct recognition: 救急車を呼ぶでもらえますか Word Lattice translation: Could you call an ambulance

16 16 Machine Translation for Text could you call an ambulance NULL model NULL could you call an ambulance Fertility model Lexical model Distortion model NULL NULL could could call ambulance かをでもらえます呼ぶ救急車救急車を呼ぶでもらえますか

17 17 Machine Translation for Lattice could I you get call a an job ambulance NULL model NULL could I you get call a an job ambulance Fertility model NULL NULL could could get call job ambulance

18 18 Machine Translation for Lattice Lexical model 経験者かをでもらえます呼ぶ救急車経験者 Distortion model を呼ぶでもらえますか救急車

19 19 How We Translate Word Lattice Two-step decoding: beam-search + A* search beam search: construct translation word graph (TWG) An edge in the word lattice is mapped to an edge in the TWG A path in the TWG corresponds to a path in the word lattice Lower-scored edges are pruned. Simple translation models are used. A* search: Search the TWG with a higher-grade translation models(ibm model4)

20 Illustration of Constructing TWG (Translation Word Graph) Beam-search: threshold pruning 経験者救急車呼ぶでもらえます job ambulance call get can could you job ambulance call get can could you job ambulance call get can could you job ambulance call get can could you I I I I 20

21 Translation Word Graph (example) 21

22 22 A* Search A* search Forward score: Accumulated from the start node to current node, using IBM Model4 model Heuristic score: Accumulated from the current node to the end node Approximations are made on the models dependent on the length of source sentence: distribution model NULL word

23 23 Features in Speech Translation Models Eˆ = arg max{ λ log P + λ log P ( ) + λ E 0 pp 1 lm E log P lm ( POS( E)) + λ 3 log P( 0) 2 φ + λ 4 log N( Φ E) + λ 5 log Τ ( J E) + λ 6 log D( E, J )}

24 Effect of Word Lattice Translation BLEU 1st best Lattice #Nbest in Lattice 24

25 25 Beam-size effect in WLT 1 A translation of 1 st best ASR hypotheses 2 A translation of 2 nd best ASR hypotheses 3 A translation of 3 rd best ASR hypotheses Single-best translation Word lattice translation

26 26 Beam-size Effects in WLT (Nbest=20) Promising hypotheses pruned in WLT but saved in single-best translation under the same beam size BLEU best 20-best Beam size

27 Why Word Lattice Minimization Raw lattice is too huge A lot of duplicated word IDs in the lattice Significant are the top N-best hypotheses Minimization under the light of machine translation Minimization can make decoding fast Minimization can reduce translation error; reduce pruning error in decoding 27

28 Word Lattice Minimization 28

29 29 Word Lattice?? N-best?? After lattice minimization, the output is not a lattice again. Only N-best with new assigned edge ids. After lattice minimization, the ASR score lost in single edge. Instead, we use ASR path score to represent single edge s score.

30 Effect of Lattice Minimization st best La ttice w/o min La ttice w/ min #Nbest in lattice 30

31 31 Posterior Probability Integrating acoustic model and language model probabilities Indicating relative accuracy of N-best hypotheses p( J j X ) = N e i = 1 λ log score e λ log score j i log score i :log-scale ASR score ( AM+LM )

32 32 Confidence Measure Filtering ASR hypotheses with very low posterior probability degrade translations A predefined confidence threshold, T, is applied to remove the most unlikely ASR hypotheses By comparing a hypothesis s posterior probability to the single-best hypothesis s posterior probability multiplied by T, Pfirst-best*T, remove the smaller.

33 Confidence Measure Filtering ASR Output ASR score PP=Posterior probability cmf= PP/PP1-st Decision cmf>0.5? 1 st cand PASS 2 nd cand PASS 3 rd cand PASS 4 th cand PASS 5 th cand PASS 6 th cand FAIL 7 th cand FAIL 8 th cand FAIL SUM=

34 34 Effect of CM filtering st best N-b e st La ttic e Min. w/ CMF

35 IWSLT 2005 Evaluation 35

36 36 IWSLT 2005 Evaluation( training data) Language pair Data track Data size perplexity Testset Dev.data C/E J/E Supplied +tagger 20K C-star 172K Supplied +tagger 20K C-star 463K

37 37 Test Data Analysis Japanese Chinese N=1 N= N=1 N=20 ASR Recognition Accuracy

38 38 Test Data Results (J/E BLEU C-star track) 0.74 CSTAR track Text N-best Lattice 1-best

39 39 Test Data Results ( J/E NIST C-star track) CSTAR track Text N-best Lattice 1-best

40 40 Test Data Results(J/E WER C- star track) CSTAR track Text N-best Lattice 1-best

41 41 Evaluation Results (CE) Data track Input BLEU NIST WER PER METEOR GTM Supplied+to ols Text Nbest Sbest Cstar Text Nbest Sbest

42 Evaluation Results(JE) Data track Input BLEU NIST WER PER METEOR GTM Supplied+ tagger Text Nbest Lattice Sbest Cstar Text Nbest Lattice Sbest

43 Remarks Text translation (0.727) > N-best translation (0.679) N-best translation (0.679) > lattice translation (0.67) Lattice translation (0.670) > single-best translation (0.646) Training data size influences speech translation 43

44 Analysis: Lattice Translation Worse than N-best Translation We used the same number of ASR hypotheses in N- best translation and lattice translation In beam search, N-best translation and lattice translation used the same beam size and threshold in pruning Model approximations and inaccuracy: distortion, null, acoustic model, language model. 44

45 45 Comparisons of the structures Single-best translation Simple, direct ASR and SMT isolated optimization MT flexible, easy to upgrade, multiple translation engines Non-robust to ASR WER N-best hypothesis translation Robust, resistant to ASR WER MT flexible, multiple translation engines Slow, duplicate calculation Word lattice translation Reduce computing cost, efficient Speech translation system, ASR and SMT, overall optimized MT inflexible

46 Conclusions We applied two approaches to improve ASR singlebest translation. By applying a log-linear model, N-best translation approach can improve single-best translation effectively. We observed improved speech translation performance in word lattice translation: Confidence measure filtering Word lattice reduction 46

Multiple System Combination. Jinhua Du CNGL July 23, 2008

Multiple System Combination. Jinhua Du CNGL July 23, 2008 Multiple System Combination Jinhua Du CNGL July 23, 2008 Outline Introduction Motivation Current Achievements Combination Strategies Key Techniques System Combination Framework in IA Large-Scale Experiments