Automatic Speech Recognition and Statistical Machine Translation under Uncertainty

Size: px

Start display at page:

Download "Automatic Speech Recognition and Statistical Machine Translation under Uncertainty"

Preston Lamb
5 years ago
Views:

1 Outlines Automatic Speech Recognition and Statistical Machine Translation under Uncertainty Lambert Mathias Advisor: Prof. William Byrne Thesis Committee: Prof. Gerard Meyer, Prof. Trac Tran and Prof. Frederick Jelinek Center for Language and Speech Processing Department of Electrical and Computer Engineering Johns Hopkins University December 7, 2007 Lambert Mathias 1 / 43

2 Outlines Uncertainty Reduction in Language Processing Two applications of statistical methods in language processing Automatic Speech Recognition (ASR) Statistical Machine Translation (SMT) Statistical learning approaches have to deal with uncertainty Data sparsity, noise, incorrect modeling assumptions etc. Uncertainty in SMT Ambiguity in translation from speech Using a cascaded approach to translate speech Uncertainty in ASR Training acoustic models in the absence of reliable transcripts Lightly supervised discriminative training in the medical domain 1 1 Juergen Fritsch, Girija Yegnarayanan, Multimodal Technologies Inc. Lambert Mathias 2 / 43

3 Outlines Uncertainty Reduction in Language Processing Two applications of statistical methods in language processing Automatic Speech Recognition (ASR) Statistical Machine Translation (SMT) Statistical learning approaches have to deal with uncertainty Data sparsity, noise, incorrect modeling assumptions etc. Uncertainty in SMT Ambiguity in translation from speech Using a cascaded approach to translate speech Uncertainty in ASR Training acoustic models in the absence of reliable transcripts Lightly supervised discriminative training in the medical domain 1 1 Juergen Fritsch, Girija Yegnarayanan, Multimodal Technologies Inc. Lambert Mathias 2 / 43

4 Outlines Uncertainty Reduction in Language Processing Two applications of statistical methods in language processing Automatic Speech Recognition (ASR) Statistical Machine Translation (SMT) Statistical learning approaches have to deal with uncertainty Data sparsity, noise, incorrect modeling assumptions etc. Uncertainty in SMT Ambiguity in translation from speech Using a cascaded approach to translate speech Uncertainty in ASR Training acoustic models in the absence of reliable transcripts Lightly supervised discriminative training in the medical domain 1 1 Juergen Fritsch, Girija Yegnarayanan, Multimodal Technologies Inc. Lambert Mathias 2 / 43

5 Outlines Part I: Speech Translation Part II: Discriminative Training for MT Outline Part I: Speech Translation 1 Speech Translation Architectures 2 Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction 3 Lambert Mathias 3 / 43

6 Outlines Part I: Speech Translation Part II: Discriminative Training for MT Outline Part II: Discriminative Training for MT 4 Motivation 5 Problem Definition 6 Discriminative Objective function 7 Growth Transformations for MT Growth Transformations in ASR Enumerating the joint distribution Implementation Details 8 Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Lambert Mathias 4 / 43

7 Speech Translation Architectures Part I Speech Translation Lambert Mathias 5 / 43

8 Motivation Speech Translation Architectures Applications of Speech Translation Facilitate international business communication Computer aided learning Language acquisition aid Statistical Speech Translation components Automatic Speech Recognition (ASR) Statistical Machine Translation (SMT) Integrating the ASR and SMT components Varying levels of interaction - N-best list, Lattices, Confusion Networks Coupling of ASR with the SMT - maximum information transfer Lambert Mathias 6 / 43

9 Speech Translation Architectures Outline 1 Speech Translation Architectures 2 Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction 3 Lambert Mathias 7 / 43

10 Speech Translation Architectures Loosely-Coupled Speech Translation ASR SMT Translate 1-best 2 sub-optimal - no interaction between ASR and SMT Translate multiple hypotheses instead 3 Does not exploit sub-sentential language information 2 Liu et al, Noise Robustness in Speech to Speech Translation, IBM Technical Report Black et al, Rapid Development of Speech to Speech Translation Systems, ICSLP 2002 Lambert Mathias 8 / 43

11 Speech Translation Architectures Integrated Speech Translation Architecture Speech Translation Objective: Tight integration of ASR with SMT system Unified modeling framework No a priori decisions on candidates for translation Integrated search over a larger space of translation candidates SMT system robust to ASR errors Problem: How to translate ASR Lattices? Lambert Mathias 9 / 43

12 Speech Translation Architectures Integrated Speech Translation Architecture Speech Translation Objective: Tight integration of ASR with SMT system Unified modeling framework No a priori decisions on candidates for translation Integrated search over a larger space of translation candidates SMT system robust to ASR errors Problem: How to translate ASR Lattices? Lambert Mathias 9 / 43

13 Outline Speech Translation Architectures Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction 1 Speech Translation Architectures 2 Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction 3 Lambert Mathias 10 / 43

14 Speech Translation Architectures Noisy Channel Model Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction Noisy Channel Translation MAP decoder êî1 = argmax max I,e 1 I f 1 J P(e I 1) {z } Language Model P(f J 1 e I 1) {z } Translation Model P(A f J 1) {z } Acoustic Model Lambert Mathias 11 / 43

15 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % Lambert Mathias 12 / 43

16 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % Lambert Mathias 12 / 43

17 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % Lambert Mathias 12 / 43

18 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % Lambert Mathias 12 / 43

19 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% Lambert Mathias 12 / 43

20 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% Lambert Mathias 12 / 43

21 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% 1 exportations 1 grains doivent fléchir de_25_% Lambert Mathias 12 / 43

22 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% 1 exportations 1 grains doivent fléchir de_25_% Lambert Mathias 12 / 43

23 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% 1 exportations 1 grains doivent fléchir de_25_% les exportations de grains doivent fléchir de_25_% Lambert Mathias 12 / 43

24 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% 1 exportations 1 grains doivent fléchir de_25_% les exportations de grains doivent fléchir de_25_% Lambert Mathias 12 / 43

25 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% 1 exportations 1 grains doivent fléchir de_25_% les exportations de grains doivent fléchir de_25_% les exportations de grains doivent fléchir de 25 % Lambert Mathias 12 / 43

26 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% 1 exportations 1 grains doivent fléchir de_25_% les exportations de grains doivent fléchir de_25_% les exportations de grains doivent fléchir de 25 % Lambert Mathias 12 / 43

25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% 1 exportations 1 grains doivent

27 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% 1 exportations 1 grains doivent fléchir de_25_% les exportations de grains doivent fléchir de_25_% les exportations de grains doivent fléchir de 25 % Lambert Mathias 12 / 43

28 Speech Translation Architectures Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction T arget Speech T arget Sentence T arget Phrase with Insertion Reordered T arget P hrase T arget P hrase Source P hrase Source Sentence A f J 1 v R 1 y K 1 x K 1 u K 1 e I 1 Models P(A f J 1 ) P(f J 1 v R 1 ) P(v R 1 y K 1 ) P(y K 1 x K 1 ) P(x K 1 u K 1 ) P(u K 1 e I 1) P(e I 1) F SMs L Ω Φ R Y W G T arget W ord Acoustic Lattice T arget P hrase Segmentation T ransducer T arget P hrase Insertion T ransducer T arget P hrase Reordering T ransducer Source Target P hrase T ranslation T ransducer Source P hrase Segmentation T ransducer Source Language Model Transformations via stochastic models implemented as WFSTs Built with standard WFST operations - composition and best path search Translation graph G W Y R Φ Ω L Straightforward extension of the text based SMT system êî1 = argmax{ max I,e 1 I f 1 J L max P(A f J v 1 R,yK 1,xK 1,uK 1,K 1 {z } ) Target Lattice P(f1 J, v 1 R, y 1 K, x 1 K, uk 1, ei 1 {z } ) } Text Translation Lambert Mathias 13 / 43

29 Speech Translation Architectures Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction T arget Speech T arget Sentence T arget Phrase with Insertion Reordered T arget P hrase T arget P hrase Source P hrase Source Sentence A f J 1 v R 1 y K 1 x K 1 u K 1 e I 1 Models P(A f J 1 ) P(f J 1 v R 1 ) P(v R 1 y K 1 ) P(y K 1 x K 1 ) P(x K 1 u K 1 ) P(u K 1 e I 1) P(e I 1) F SMs L Ω Φ R Y W G T arget W ord Acoustic Lattice T arget P hrase Segmentation T ransducer T arget P hrase Insertion T ransducer T arget P hrase Reordering T ransducer Source Target P hrase T ranslation T ransducer Source P hrase Segmentation T ransducer Source Language Model Transformations via stochastic models implemented as WFSTs Built with standard WFST operations - composition and best path search Translation graph G W Y R Φ Ω L Straightforward extension of the text based SMT system êî1 = argmax{ max I,e 1 I f 1 J L max P(A f J v 1 R,yK 1,xK 1,uK 1,K 1 {z } ) Target Lattice P(f1 J, v 1 R, y 1 K, x 1 K, uk 1, ei 1 {z } ) } Text Translation Lambert Mathias 13 / 43

30 Speech Translation Architectures Translation under ASR Posterior Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction Translation under the proper ASR posterior distribution Uncertainty reduction: strong target LM can help guide the translation process êî1 = argmax{ max I,e 1 I f 1 J L max P(A f J v 1 R,yK 1,xK 1,uK 1,K 1 )P(f1 J ) P(f1 J, v1 R, y1 K, x1 K, u1 K, e1) I } {z } {z } Target Text Lattice Translation Lambert Mathias 14 / 43

31 Speech Translation Architectures Translation is from a Lattice of Phrase Sequences Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction es/359.0 un/0 crdito/0 de/ del/203.5 compromiso/0 un/0 crdito/0 del/ un/ a/476.1 de/0 determinado/0 ao/0 compromiso/0 un/ TARGET PHRASE SEGMENTATION TRANSDUCER (Ω) ASR WORD LATTICE (L) pues/ crdito_del/203.5 crdito/0 5 2 un/0 crdito_de/0 un_crdito_de/0 1 un_crdito/0 3 determinado_ao/0 compromiso_un/ un_determinado_ao/0 del_compromiso/ un_determinado/0 un_determinado/0 ao/0 pues/ de_compromiso/0 determinado/0 de/0 compromiso_un/0 determinado_ao/0 un/0 compromiso/ un_determinado_ao/0 de_compromiso/0 de/0 TARGET PHRASE LATTICE (Q) Q = Ω L Phrase sequence lattice has foreign phrase sequences in the ASR lattice Phrase sequence corresponds to translatable word sequences in lattice Lattice contains the ASR weights Translation of phrase lattice is MAP translation of ASR word lattice Lambert Mathias 15 / 43

32 Speech Translation Architectures Direct Translation of ASR word lattice Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction Original Problem: How to translate ASR word lattice? New Problem: How to extract translatable phrases from ASR lattice? Speech Translation recast as an ASR analysis problem: 1 Perform foreign language ASR to obtain speech lattice L 2 Analyze foreign language word lattice and extract translatable phrases 3 Build the translation component models and convert the word lattice to a phrase lattice Ω L 4 Translate the foreign language phrase lattice Extracting translatable phrases 4 C(w k 1 ) = X π L # w k 1 (π) [[L]](w k 1 ) Filtering low-confidence phrases 4 Count automaton -GRM Library p(w1 k C(w1 k ) ) = P w 1 k L C(w 1 k) Lambert Mathias 16 / 43

33 Speech Translation Architectures Direct Translation of ASR word lattice Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction Original Problem: How to translate ASR word lattice? New Problem: How to extract translatable phrases from ASR lattice? Speech Translation recast as an ASR analysis problem: 1 Perform foreign language ASR to obtain speech lattice L 2 Analyze foreign language word lattice and extract translatable phrases 3 Build the translation component models and convert the word lattice to a phrase lattice Ω L 4 Translate the foreign language phrase lattice Extracting translatable phrases 4 C(w k 1 ) = X π L # w k 1 (π) [[L]](w k 1 ) Filtering low-confidence phrases 4 Count automaton -GRM Library p(w1 k C(w1 k ) ) = P w 1 k L C(w 1 k) Lambert Mathias 16 / 43

34 Speech Translation Architectures Direct Translation of ASR word lattice Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction Original Problem: How to translate ASR word lattice? New Problem: How to extract translatable phrases from ASR lattice? Speech Translation recast as an ASR analysis problem: 1 Perform foreign language ASR to obtain speech lattice L 2 Analyze foreign language word lattice and extract translatable phrases 3 Build the translation component models and convert the word lattice to a phrase lattice Ω L 4 Translate the foreign language phrase lattice Extracting translatable phrases 4 C(w k 1 ) = X π L # w k 1 (π) [[L]](w k 1 ) Filtering low-confidence phrases 4 Count automaton -GRM Library p(w1 k C(w1 k ) ) = P w 1 k L C(w 1 k) Lambert Mathias 16 / 43

35 Speech Translation Architectures Outline 1 Speech Translation Architectures 2 Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction 3 Lambert Mathias 17 / 43

36 Speech Translation Architectures Experiment Setup Parallel Text : documents in the two languages aligned at sentence level - 1.4M Word alignments using MTTK 5 Phrase pair inventory of phrasal translations 6 Component distributions under current phrase pair inventory Automated evaluation measure BLEU 7 : geometric mean of n-gram overlap with penalization for short sentences 5 Deng, Y. and Byrne W, Och Papineni et al, 2001 Lambert Mathias 18 / 43

37 Speech Translation Architectures ASR Lattice Oracle WER Translation Experiments 44 DEV TST BEST BLEU ORACLE WER (%) Oracle WER path - path in lattice closest to reference transcript under edit distance Decrease in WER correlates with increase in BLEU Lambert Mathias 19 / 43

38 Speech Translation Architectures TCSTAR 2005 EPPS Lattice Translation performance Spanish Source DEV BLEU EVAL BLEU Reference Transcription ASR 1-best ASR Lattice Lattice translation better than ASR 1-best hypothesis Lambert Mathias 20 / 43

39 Speech Translation Architectures Speech Translation N-best list translation quality Translation oracle-best BLEU Input DEV TST Reference Transcription ASR 1-best ASR Lattice Table: Oracle-best BLEU for EPPS development and test set measured over a 1000-best translation list Oracle-best BLEU: N-best list candidate with maximum sentence level BLEU score Quantifies the best possible performance under BLEU given the current inventory of phrases Lambert Mathias 21 / 43

40 Speech Translation Architectures Input Reference ASR 1-best ASR Lattice Translation 1. in accordance with the committee on budgets the period for tabling amendments for the second reading of the european union budget will end on wednesday the first of december at twelve noon. 2. in agreement with the committee on budgets the deadline for the presentation of projects of amendment for the second reading of the european union budget will finish on wednesday first of december at twelve noon. according to the committee on budgets of the deadline for the submission of projects amendment concerning the second reading of the budget union will end on wednesday, one of the twelve noon in accordance with the committee on budgets of the deadline for the submission of projects amendment concerning the second reading of the budget union will end on wednesday, one of the twelve noon Lambert Mathias 22 / 43

41 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Part II Discriminative Training for MT Lambert Mathias 23 / 43

42 Outline Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments 4 Motivation 5 Problem Definition 6 Discriminative Objective function 7 Growth Transformations for MT Growth Transformations in ASR Enumerating the joint distribution Implementation Details 8 Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Lambert Mathias 24 / 43

43 Motivation Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Automatic evaluation measures for improving translation performance Need to optimize MT parameters for a particular task or evaluation metric automatically Efficient procedures for parameter tuning are needed Capable of handling many features Robust State-of-the-art translation performance Lambert Mathias 25 / 43

44 Outline Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments 4 Motivation 5 Problem Definition 6 Discriminative Objective function 7 Growth Transformations for MT Growth Transformations in ASR Enumerating the joint distribution Implementation Details 8 Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Lambert Mathias 26 / 43

45 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Learning Problem Definition Objective: Discriminative training for improved translation Introduce growth transformations for MT parameter optimization Compare translation performance with MET line search optimization Training problem: Given a parallel training corpus - foreign sentences and their translations A set of parameters: θ = {θ 1,..., θ Q } A joint distribution p θ (e s, f s) = Q q Φq(es, fs)θq An objective function: F (θ) = f (p θ (e s, f s)) Optimization problem: ˆθ = argmax F(θ) θ Translation evaluated on a blind test set using the optimized parameters Lambert Mathias 27 / 43

46 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Foreign Sentences Reference Translation Generative Model Additional Features N-best Translation Rerank Translations 2-pass approach - decoding followed by optimization Incorporate additional features Lambert Mathias 28 / 43

47 Outline Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments 4 Motivation 5 Problem Definition 6 Discriminative Objective function 7 Growth Transformations for MT Growth Transformations in ASR Enumerating the joint distribution Implementation Details 8 Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Lambert Mathias 29 / 43

48 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Minimum Error Training F MET (θ) = SX s=1 BLEU(argmax e s p θ (e s f s), e + s ) e s = english translation hypothesis e + s = english translation reference f s = foreign sentence Not smooth and not differentiable Piecewise constant Multidimensional gradient free search - line search subroutine 8 Current state-of-the-art 8 F. Och, Minimum Error Training in Statsitical Machine Translation, 2002 Lambert Mathias 30 / 43

49 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Expected BLEU Maximization F MBR (θ) = SX XN s BLEU(e sk, e + s ) s=1 k=1 e s = english translation hypothesis e + s = english translation reference f s = foreign sentence p θ (e sk, f s) P Ns k=1 p θ(e sk, f s) Partial credit to all hypotheses in N-best list Non-convex but differentiable Related to Minimum Bayes Risk decoding Lambert Mathias 31 / 43

50 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Maximizing the posterior F MMI (θ) = SX p θ (e + s, f s) log P Ns k=1 p θ(e sk, f s) s=1 e s = english translation hypothesis e + s = english translation reference f s = foreign sentence Maximum Mutual Information (MMI) - separating truth from competing classes Smooth and differentiable Labeling the true class - reference vs oracle BLEU hypothesis Lambert Mathias 32 / 43

51 Outline Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Growth Transformations in ASR Enumerating the joint distribution Implementation Details 4 Motivation 5 Problem Definition 6 Discriminative Objective function 7 Growth Transformations for MT Growth Transformations in ASR Enumerating the joint distribution Implementation Details 8 Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Lambert Mathias 33 / 43

52 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Growth Transformations in ASR Growth Transformations in ASR Enumerating the joint distribution Implementation Details Locally maximize a rational function 9 F(θ) = N(θ) D(θ) Assumes N(θ) and D(θ) are polynomials in θ, D(θ) > 0 Parameter set θ = {θ q θ q 0, Define a transformation where, T (θ) q = P q θq = 1} «P θ θ (Θ) q θ q + C P Q q=1 θq P θ (Θ) θ q «+ C P θ (Θ) = N θ (Θ) F(θ)D θ (Θ) + C For sufficiently large C > 0, then T (θ) is a growth transform if F(T (θ)) F (θ) 9 Gopalakrishnan 1991 Lambert Mathias 34 / 43

53 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Growth Transformations in ASR Growth Transformations in ASR Enumerating the joint distribution Implementation Details Locally maximize a rational function 9 F(θ) = N(θ) D(θ) Assumes N(θ) and D(θ) are polynomials in θ, D(θ) > 0 Parameter set θ = {θ q θ q 0, Define a transformation where, T (θ) q = P q θq = 1} «P θ θ (Θ) q θ q + C P Q q=1 θq P θ (Θ) θ q «+ C P θ (Θ) = N θ (Θ) F(θ)D θ (Θ) + C For sufficiently large C > 0, then T (θ) is a growth transform if F(T (θ)) F (θ) 9 Gopalakrishnan 1991 Lambert Mathias 34 / 43

54 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Growth Transformations in ASR Growth Transformations in ASR Enumerating the joint distribution Implementation Details Locally maximize a rational function 9 F(θ) = N(θ) D(θ) Assumes N(θ) and D(θ) are polynomials in θ, D(θ) > 0 Parameter set θ = {θ q θ q 0, Define a transformation where, T (θ) q = P q θq = 1} «P θ θ (Θ) q θ q + C P Q q=1 θq P θ (Θ) θ q «+ C P θ (Θ) = N θ (Θ) F(θ)D θ (Θ) + C For sufficiently large C > 0, then T (θ) is a growth transform if F(T (θ)) F (θ) 9 Gopalakrishnan 1991 Lambert Mathias 34 / 43

55 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Enumerating the joint distribution Growth Transformations in ASR Enumerating the joint distribution Implementation Details Note that the objective F(θ) = f (p θ (e s, f s)) not a polynomial in θ «k p θ (e s, f s) = Y Φ q(e s, f s) θq Y nx θ q log Φ q(e s, f s) k! q q=1 k=0 {z } p (n) θ (es, fs) Growth transform for polynomials: F (n) (T (n) (θ)) F (n) (θ) If lim n T (n) (θ) T (θ) and lim n F (n) (θ) F(θ) then lim F (n) (T (n) (θ)) lim F (n) (θ) F(T (θ)) F(θ) n n Result Growth transformations can be extended to any function f (θ) that is differentiable and is analytical a a Kanevsky 1995 Lambert Mathias 35 / 43

56 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Enumerating the joint distribution Growth Transformations in ASR Enumerating the joint distribution Implementation Details Note that the objective F(θ) = f (p θ (e s, f s)) not a polynomial in θ «k p θ (e s, f s) = Y Φ q(e s, f s) θq Y nx θ q log Φ q(e s, f s) k! q q=1 k=0 {z } p (n) θ (es, fs) Growth transform for polynomials: F (n) (T (n) (θ)) F (n) (θ) If lim n T (n) (θ) T (θ) and lim n F (n) (θ) F(θ) then lim F (n) (T (n) (θ)) lim F (n) (θ) F(T (θ)) F(θ) n n Result Growth transformations can be extended to any function f (θ) that is differentiable and is analytical a a Kanevsky 1995 Lambert Mathias 35 / 43

57 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Enumerating the joint distribution Growth Transformations in ASR Enumerating the joint distribution Implementation Details Note that the objective F(θ) = f (p θ (e s, f s)) not a polynomial in θ «k p θ (e s, f s) = Y Φ q(e s, f s) θq Y nx θ q log Φ q(e s, f s) k! q q=1 k=0 {z } p (n) θ (es, fs) Growth transform for polynomials: F (n) (T (n) (θ)) F (n) (θ) If lim n T (n) (θ) T (θ) and lim n F (n) (θ) F(θ) then lim F (n) (T (n) (θ)) lim F (n) (θ) F(T (θ)) F(θ) n n Result Growth transformations can be extended to any function f (θ) that is differentiable and is analytical a a Kanevsky 1995 Lambert Mathias 35 / 43

58 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Enumerating the joint distribution Growth Transformations in ASR Enumerating the joint distribution Implementation Details Note that the objective F(θ) = f (p θ (e s, f s)) not a polynomial in θ «k p θ (e s, f s) = Y Φ q(e s, f s) θq Y nx θ q log Φ q(e s, f s) k! q q=1 k=0 {z } p (n) θ (es, fs) Growth transform for polynomials: F (n) (T (n) (θ)) F (n) (θ) If lim n T (n) (θ) T (θ) and lim n F (n) (θ) F(θ) then lim F (n) (T (n) (θ)) lim F (n) (θ) F(T (θ)) F(θ) n n Result Growth transformations can be extended to any function f (θ) that is differentiable and is analytical a a Kanevsky 1995 Lambert Mathias 35 / 43

59 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Growth Transformations in ASR Enumerating the joint distribution Implementation Details Growth Transform Iterative Training Procedure 1 Initialize the parameter vector θ = {θ 1, θ Q }, such that P Q q=1 θq = 1, and i = 0. 2 For each parameter θ (i) q, calculate the gradient F(θ (i) ) (i) θ=θ q 3 For each parameter θ (i) q, calculate the parameter update «θ (i) q F(θ (i) ) (i) θ=θ + C θ (i+1) q q = «P Q q=1 θ(i) q F(θ (i) ) (i) θ=θ + C q 4 If F(θ (i+1) ) F(θ (i) ) or if i == MAXITER, then terminate. Else, i i + 1, goto Step 2. Lambert Mathias 36 / 43

60 Implementation Details Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Growth Transformations in ASR Enumerating the joint distribution Implementation Details Posterior scaling Convergence Rate C = N c Entropy Regularization p θ,α (e sk f s) = p θ (e sk, f s) α P Ns k=1 p θ(e sk, f s) α» j ff max max { F(θ) θ=θ q }, 0 + ɛ q G(θ) = F(θ) + T H(p θ,α ) Lambert Mathias 37 / 43

61 Outline Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments 4 Motivation 5 Problem Definition 6 Discriminative Objective function 7 Growth Transformations for MT Growth Transformations in ASR Enumerating the joint distribution Implementation Details 8 Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Lambert Mathias 38 / 43

62 Hyper-Parameter Tuning Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments DEV BLEU DEV BLEU DEV BLEU α=0.25 α=1 α= Epoch T=0 T=0.25 T= Epoch N c =20 N c =50 N c = Epoch Figure: Chinese-English Text Translation Task: Expected BLEU over a 1000-best N-best list Lambert Mathias 39 / 43

63 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments MMI choosing the true class - oracle vs reference DEV BLEU TST BLEU MMI w/ true reference MMI w/ oracle reference MET Epoch MMI w/ true reference MMI w/ oracle reference MET Epoch Figure: Arabic-English Text Translation Task: MMI over a 1000-best N-best list Lambert Mathias 40 / 43

64 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Speech Translation Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Translation Training Optimization BLEU Input Criterion Method DEV TST ASR 1-best MET Line Search MMI Growth Transform Expected BLEU Growth Transform ASR Lattice MET Line Search MMI Growth Transform Expected BLEU Growth Transform Table: Comparing MMI and Expected BLEU and MET training criteria for the Spanish-English ASR translation task MET overfits the training data Expected BLEU and MMI outperform MET for lattice translation Lambert Mathias 41 / 43

65 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Conclusions: Machine Translation Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Novel weighted finite state approach to translation of speech Noisy channel formulation direct extension of text translation systems Efficient phrase extraction from lattices ASR phrase pruning to control ambiguity Improved performance over ASR 1-best Confusion network decoding - word and phrase confusion networks Improved discriminative training for SMT Iterative growth transformation based updates for the MT parameters Comparable to MET line search Principled approach to the optimization of MT objective functions Extensions to lattice based training using WFSTs Anticipate further gains by increasing the number of features Lambert Mathias 42 / 43

66 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Conclusions: Machine Translation Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Novel weighted finite state approach to translation of speech Noisy channel formulation direct extension of text translation systems Efficient phrase extraction from lattices ASR phrase pruning to control ambiguity Improved performance over ASR 1-best Confusion network decoding - word and phrase confusion networks Improved discriminative training for SMT Iterative growth transformation based updates for the MT parameters Comparable to MET line search Principled approach to the optimization of MT objective functions Extensions to lattice based training using WFSTs Anticipate further gains by increasing the number of features Lambert Mathias 42 / 43

67 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Thank You! Lambert Mathias 43 / 43

Statistical Machine Translation and Automatic Speech Recognition under Uncertainty

Statistical Machine Translation and Automatic Speech Recognition under Uncertainty Lambert Mathias A dissertation submitted to the Johns Hopkins University in conformity with the requirements for the degree