Automatic Speech Recognition and Statistical Machine Translation under Uncertainty

Size: px
Start display at page:

Download "Automatic Speech Recognition and Statistical Machine Translation under Uncertainty"

Transcription

1 Outlines Automatic Speech Recognition and Statistical Machine Translation under Uncertainty Lambert Mathias Advisor: Prof. William Byrne Thesis Committee: Prof. Gerard Meyer, Prof. Trac Tran and Prof. Frederick Jelinek Center for Language and Speech Processing Department of Electrical and Computer Engineering Johns Hopkins University December 7, 2007 Lambert Mathias 1 / 43

2 Outlines Uncertainty Reduction in Language Processing Two applications of statistical methods in language processing Automatic Speech Recognition (ASR) Statistical Machine Translation (SMT) Statistical learning approaches have to deal with uncertainty Data sparsity, noise, incorrect modeling assumptions etc. Uncertainty in SMT Ambiguity in translation from speech Using a cascaded approach to translate speech Uncertainty in ASR Training acoustic models in the absence of reliable transcripts Lightly supervised discriminative training in the medical domain 1 1 Juergen Fritsch, Girija Yegnarayanan, Multimodal Technologies Inc. Lambert Mathias 2 / 43

3 Outlines Uncertainty Reduction in Language Processing Two applications of statistical methods in language processing Automatic Speech Recognition (ASR) Statistical Machine Translation (SMT) Statistical learning approaches have to deal with uncertainty Data sparsity, noise, incorrect modeling assumptions etc. Uncertainty in SMT Ambiguity in translation from speech Using a cascaded approach to translate speech Uncertainty in ASR Training acoustic models in the absence of reliable transcripts Lightly supervised discriminative training in the medical domain 1 1 Juergen Fritsch, Girija Yegnarayanan, Multimodal Technologies Inc. Lambert Mathias 2 / 43

4 Outlines Uncertainty Reduction in Language Processing Two applications of statistical methods in language processing Automatic Speech Recognition (ASR) Statistical Machine Translation (SMT) Statistical learning approaches have to deal with uncertainty Data sparsity, noise, incorrect modeling assumptions etc. Uncertainty in SMT Ambiguity in translation from speech Using a cascaded approach to translate speech Uncertainty in ASR Training acoustic models in the absence of reliable transcripts Lightly supervised discriminative training in the medical domain 1 1 Juergen Fritsch, Girija Yegnarayanan, Multimodal Technologies Inc. Lambert Mathias 2 / 43

5 Outlines Part I: Speech Translation Part II: Discriminative Training for MT Outline Part I: Speech Translation 1 Speech Translation Architectures 2 Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction 3 Lambert Mathias 3 / 43

6 Outlines Part I: Speech Translation Part II: Discriminative Training for MT Outline Part II: Discriminative Training for MT 4 Motivation 5 Problem Definition 6 Discriminative Objective function 7 Growth Transformations for MT Growth Transformations in ASR Enumerating the joint distribution Implementation Details 8 Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Lambert Mathias 4 / 43

7 Speech Translation Architectures Part I Speech Translation Lambert Mathias 5 / 43

8 Motivation Speech Translation Architectures Applications of Speech Translation Facilitate international business communication Computer aided learning Language acquisition aid Statistical Speech Translation components Automatic Speech Recognition (ASR) Statistical Machine Translation (SMT) Integrating the ASR and SMT components Varying levels of interaction - N-best list, Lattices, Confusion Networks Coupling of ASR with the SMT - maximum information transfer Lambert Mathias 6 / 43

9 Speech Translation Architectures Outline 1 Speech Translation Architectures 2 Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction 3 Lambert Mathias 7 / 43

10 Speech Translation Architectures Loosely-Coupled Speech Translation ASR SMT Translate 1-best 2 sub-optimal - no interaction between ASR and SMT Translate multiple hypotheses instead 3 Does not exploit sub-sentential language information 2 Liu et al, Noise Robustness in Speech to Speech Translation, IBM Technical Report Black et al, Rapid Development of Speech to Speech Translation Systems, ICSLP 2002 Lambert Mathias 8 / 43

11 Speech Translation Architectures Integrated Speech Translation Architecture Speech Translation Objective: Tight integration of ASR with SMT system Unified modeling framework No a priori decisions on candidates for translation Integrated search over a larger space of translation candidates SMT system robust to ASR errors Problem: How to translate ASR Lattices? Lambert Mathias 9 / 43

12 Speech Translation Architectures Integrated Speech Translation Architecture Speech Translation Objective: Tight integration of ASR with SMT system Unified modeling framework No a priori decisions on candidates for translation Integrated search over a larger space of translation candidates SMT system robust to ASR errors Problem: How to translate ASR Lattices? Lambert Mathias 9 / 43

13 Outline Speech Translation Architectures Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction 1 Speech Translation Architectures 2 Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction 3 Lambert Mathias 10 / 43

14 Speech Translation Architectures Noisy Channel Model Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction Noisy Channel Translation MAP decoder êî1 = argmax max I,e 1 I f 1 J P(e I 1) {z } Language Model P(f J 1 e I 1) {z } Translation Model P(A f J 1) {z } Acoustic Model Lambert Mathias 11 / 43

15 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % Lambert Mathias 12 / 43

16 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % Lambert Mathias 12 / 43

17 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % Lambert Mathias 12 / 43

18 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % Lambert Mathias 12 / 43

19 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% Lambert Mathias 12 / 43

20 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% Lambert Mathias 12 / 43

21 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% 1 exportations 1 grains doivent fléchir de_25_% Lambert Mathias 12 / 43

22 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% 1 exportations 1 grains doivent fléchir de_25_% Lambert Mathias 12 / 43

23 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% 1 exportations 1 grains doivent fléchir de_25_% les exportations de grains doivent fléchir de_25_% Lambert Mathias 12 / 43

24 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% 1 exportations 1 grains doivent fléchir de_25_% les exportations de grains doivent fléchir de_25_% Lambert Mathias 12 / 43

25 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% 1 exportations 1 grains doivent fléchir de_25_% les exportations de grains doivent fléchir de_25_% les exportations de grains doivent fléchir de 25 % Lambert Mathias 12 / 43

26 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% 1 exportations 1 grains doivent fléchir de_25_% les exportations de grains doivent fléchir de_25_% les exportations de grains doivent fléchir de 25 % Lambert Mathias 12 / 43

27 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% 1 exportations 1 grains doivent fléchir de_25_% les exportations de grains doivent fléchir de_25_% les exportations de grains doivent fléchir de 25 % Lambert Mathias 12 / 43

28 Speech Translation Architectures Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction T arget Speech T arget Sentence T arget Phrase with Insertion Reordered T arget P hrase T arget P hrase Source P hrase Source Sentence A f J 1 v R 1 y K 1 x K 1 u K 1 e I 1 Models P(A f J 1 ) P(f J 1 v R 1 ) P(v R 1 y K 1 ) P(y K 1 x K 1 ) P(x K 1 u K 1 ) P(u K 1 e I 1) P(e I 1) F SMs L Ω Φ R Y W G T arget W ord Acoustic Lattice T arget P hrase Segmentation T ransducer T arget P hrase Insertion T ransducer T arget P hrase Reordering T ransducer Source Target P hrase T ranslation T ransducer Source P hrase Segmentation T ransducer Source Language Model Transformations via stochastic models implemented as WFSTs Built with standard WFST operations - composition and best path search Translation graph G W Y R Φ Ω L Straightforward extension of the text based SMT system êî1 = argmax{ max I,e 1 I f 1 J L max P(A f J v 1 R,yK 1,xK 1,uK 1,K 1 {z } ) Target Lattice P(f1 J, v 1 R, y 1 K, x 1 K, uk 1, ei 1 {z } ) } Text Translation Lambert Mathias 13 / 43

29 Speech Translation Architectures Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction T arget Speech T arget Sentence T arget Phrase with Insertion Reordered T arget P hrase T arget P hrase Source P hrase Source Sentence A f J 1 v R 1 y K 1 x K 1 u K 1 e I 1 Models P(A f J 1 ) P(f J 1 v R 1 ) P(v R 1 y K 1 ) P(y K 1 x K 1 ) P(x K 1 u K 1 ) P(u K 1 e I 1) P(e I 1) F SMs L Ω Φ R Y W G T arget W ord Acoustic Lattice T arget P hrase Segmentation T ransducer T arget P hrase Insertion T ransducer T arget P hrase Reordering T ransducer Source Target P hrase T ranslation T ransducer Source P hrase Segmentation T ransducer Source Language Model Transformations via stochastic models implemented as WFSTs Built with standard WFST operations - composition and best path search Translation graph G W Y R Φ Ω L Straightforward extension of the text based SMT system êî1 = argmax{ max I,e 1 I f 1 J L max P(A f J v 1 R,yK 1,xK 1,uK 1,K 1 {z } ) Target Lattice P(f1 J, v 1 R, y 1 K, x 1 K, uk 1, ei 1 {z } ) } Text Translation Lambert Mathias 13 / 43

30 Speech Translation Architectures Translation under ASR Posterior Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction Translation under the proper ASR posterior distribution Uncertainty reduction: strong target LM can help guide the translation process êî1 = argmax{ max I,e 1 I f 1 J L max P(A f J v 1 R,yK 1,xK 1,uK 1,K 1 )P(f1 J ) P(f1 J, v1 R, y1 K, x1 K, u1 K, e1) I } {z } {z } Target Text Lattice Translation Lambert Mathias 14 / 43

31 Speech Translation Architectures Translation is from a Lattice of Phrase Sequences Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction es/359.0 un/0 crdito/0 de/ del/203.5 compromiso/0 un/0 crdito/0 del/ un/ a/476.1 de/0 determinado/0 ao/0 compromiso/0 un/ TARGET PHRASE SEGMENTATION TRANSDUCER (Ω) ASR WORD LATTICE (L) pues/ crdito_del/203.5 crdito/0 5 2 un/0 crdito_de/0 un_crdito_de/0 1 un_crdito/0 3 determinado_ao/0 compromiso_un/ un_determinado_ao/0 del_compromiso/ un_determinado/0 un_determinado/0 ao/0 pues/ de_compromiso/0 determinado/0 de/0 compromiso_un/0 determinado_ao/0 un/0 compromiso/ un_determinado_ao/0 de_compromiso/0 de/0 TARGET PHRASE LATTICE (Q) Q = Ω L Phrase sequence lattice has foreign phrase sequences in the ASR lattice Phrase sequence corresponds to translatable word sequences in lattice Lattice contains the ASR weights Translation of phrase lattice is MAP translation of ASR word lattice Lambert Mathias 15 / 43

32 Speech Translation Architectures Direct Translation of ASR word lattice Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction Original Problem: How to translate ASR word lattice? New Problem: How to extract translatable phrases from ASR lattice? Speech Translation recast as an ASR analysis problem: 1 Perform foreign language ASR to obtain speech lattice L 2 Analyze foreign language word lattice and extract translatable phrases 3 Build the translation component models and convert the word lattice to a phrase lattice Ω L 4 Translate the foreign language phrase lattice Extracting translatable phrases 4 C(w k 1 ) = X π L # w k 1 (π) [[L]](w k 1 ) Filtering low-confidence phrases 4 Count automaton -GRM Library p(w1 k C(w1 k ) ) = P w 1 k L C(w 1 k) Lambert Mathias 16 / 43

33 Speech Translation Architectures Direct Translation of ASR word lattice Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction Original Problem: How to translate ASR word lattice? New Problem: How to extract translatable phrases from ASR lattice? Speech Translation recast as an ASR analysis problem: 1 Perform foreign language ASR to obtain speech lattice L 2 Analyze foreign language word lattice and extract translatable phrases 3 Build the translation component models and convert the word lattice to a phrase lattice Ω L 4 Translate the foreign language phrase lattice Extracting translatable phrases 4 C(w k 1 ) = X π L # w k 1 (π) [[L]](w k 1 ) Filtering low-confidence phrases 4 Count automaton -GRM Library p(w1 k C(w1 k ) ) = P w 1 k L C(w 1 k) Lambert Mathias 16 / 43

34 Speech Translation Architectures Direct Translation of ASR word lattice Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction Original Problem: How to translate ASR word lattice? New Problem: How to extract translatable phrases from ASR lattice? Speech Translation recast as an ASR analysis problem: 1 Perform foreign language ASR to obtain speech lattice L 2 Analyze foreign language word lattice and extract translatable phrases 3 Build the translation component models and convert the word lattice to a phrase lattice Ω L 4 Translate the foreign language phrase lattice Extracting translatable phrases 4 C(w k 1 ) = X π L # w k 1 (π) [[L]](w k 1 ) Filtering low-confidence phrases 4 Count automaton -GRM Library p(w1 k C(w1 k ) ) = P w 1 k L C(w 1 k) Lambert Mathias 16 / 43

35 Speech Translation Architectures Outline 1 Speech Translation Architectures 2 Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction 3 Lambert Mathias 17 / 43

36 Speech Translation Architectures Experiment Setup Parallel Text : documents in the two languages aligned at sentence level - 1.4M Word alignments using MTTK 5 Phrase pair inventory of phrasal translations 6 Component distributions under current phrase pair inventory Automated evaluation measure BLEU 7 : geometric mean of n-gram overlap with penalization for short sentences 5 Deng, Y. and Byrne W, Och Papineni et al, 2001 Lambert Mathias 18 / 43

37 Speech Translation Architectures ASR Lattice Oracle WER Translation Experiments 44 DEV TST BEST BLEU ORACLE WER (%) Oracle WER path - path in lattice closest to reference transcript under edit distance Decrease in WER correlates with increase in BLEU Lambert Mathias 19 / 43

38 Speech Translation Architectures TCSTAR 2005 EPPS Lattice Translation performance Spanish Source DEV BLEU EVAL BLEU Reference Transcription ASR 1-best ASR Lattice Lattice translation better than ASR 1-best hypothesis Lambert Mathias 20 / 43

39 Speech Translation Architectures Speech Translation N-best list translation quality Translation oracle-best BLEU Input DEV TST Reference Transcription ASR 1-best ASR Lattice Table: Oracle-best BLEU for EPPS development and test set measured over a 1000-best translation list Oracle-best BLEU: N-best list candidate with maximum sentence level BLEU score Quantifies the best possible performance under BLEU given the current inventory of phrases Lambert Mathias 21 / 43

40 Speech Translation Architectures Input Reference ASR 1-best ASR Lattice Translation 1. in accordance with the committee on budgets the period for tabling amendments for the second reading of the european union budget will end on wednesday the first of december at twelve noon. 2. in agreement with the committee on budgets the deadline for the presentation of projects of amendment for the second reading of the european union budget will finish on wednesday first of december at twelve noon. according to the committee on budgets of the deadline for the submission of projects amendment concerning the second reading of the budget union will end on wednesday, one of the twelve noon in accordance with the committee on budgets of the deadline for the submission of projects amendment concerning the second reading of the budget union will end on wednesday, one of the twelve noon Lambert Mathias 22 / 43

41 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Part II Discriminative Training for MT Lambert Mathias 23 / 43

42 Outline Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments 4 Motivation 5 Problem Definition 6 Discriminative Objective function 7 Growth Transformations for MT Growth Transformations in ASR Enumerating the joint distribution Implementation Details 8 Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Lambert Mathias 24 / 43

43 Motivation Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Automatic evaluation measures for improving translation performance Need to optimize MT parameters for a particular task or evaluation metric automatically Efficient procedures for parameter tuning are needed Capable of handling many features Robust State-of-the-art translation performance Lambert Mathias 25 / 43

44 Outline Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments 4 Motivation 5 Problem Definition 6 Discriminative Objective function 7 Growth Transformations for MT Growth Transformations in ASR Enumerating the joint distribution Implementation Details 8 Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Lambert Mathias 26 / 43

45 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Learning Problem Definition Objective: Discriminative training for improved translation Introduce growth transformations for MT parameter optimization Compare translation performance with MET line search optimization Training problem: Given a parallel training corpus - foreign sentences and their translations A set of parameters: θ = {θ 1,..., θ Q } A joint distribution p θ (e s, f s) = Q q Φq(es, fs)θq An objective function: F (θ) = f (p θ (e s, f s)) Optimization problem: ˆθ = argmax F(θ) θ Translation evaluated on a blind test set using the optimized parameters Lambert Mathias 27 / 43

46 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Foreign Sentences Reference Translation Generative Model Additional Features N-best Translation Rerank Translations 2-pass approach - decoding followed by optimization Incorporate additional features Lambert Mathias 28 / 43

47 Outline Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments 4 Motivation 5 Problem Definition 6 Discriminative Objective function 7 Growth Transformations for MT Growth Transformations in ASR Enumerating the joint distribution Implementation Details 8 Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Lambert Mathias 29 / 43

48 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Minimum Error Training F MET (θ) = SX s=1 BLEU(argmax e s p θ (e s f s), e + s ) e s = english translation hypothesis e + s = english translation reference f s = foreign sentence Not smooth and not differentiable Piecewise constant Multidimensional gradient free search - line search subroutine 8 Current state-of-the-art 8 F. Och, Minimum Error Training in Statsitical Machine Translation, 2002 Lambert Mathias 30 / 43

49 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Expected BLEU Maximization F MBR (θ) = SX XN s BLEU(e sk, e + s ) s=1 k=1 e s = english translation hypothesis e + s = english translation reference f s = foreign sentence p θ (e sk, f s) P Ns k=1 p θ(e sk, f s) Partial credit to all hypotheses in N-best list Non-convex but differentiable Related to Minimum Bayes Risk decoding Lambert Mathias 31 / 43

50 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Maximizing the posterior F MMI (θ) = SX p θ (e + s, f s) log P Ns k=1 p θ(e sk, f s) s=1 e s = english translation hypothesis e + s = english translation reference f s = foreign sentence Maximum Mutual Information (MMI) - separating truth from competing classes Smooth and differentiable Labeling the true class - reference vs oracle BLEU hypothesis Lambert Mathias 32 / 43

51 Outline Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Growth Transformations in ASR Enumerating the joint distribution Implementation Details 4 Motivation 5 Problem Definition 6 Discriminative Objective function 7 Growth Transformations for MT Growth Transformations in ASR Enumerating the joint distribution Implementation Details 8 Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Lambert Mathias 33 / 43

52 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Growth Transformations in ASR Growth Transformations in ASR Enumerating the joint distribution Implementation Details Locally maximize a rational function 9 F(θ) = N(θ) D(θ) Assumes N(θ) and D(θ) are polynomials in θ, D(θ) > 0 Parameter set θ = {θ q θ q 0, Define a transformation where, T (θ) q = P q θq = 1} «P θ θ (Θ) q θ q + C P Q q=1 θq P θ (Θ) θ q «+ C P θ (Θ) = N θ (Θ) F(θ)D θ (Θ) + C For sufficiently large C > 0, then T (θ) is a growth transform if F(T (θ)) F (θ) 9 Gopalakrishnan 1991 Lambert Mathias 34 / 43

53 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Growth Transformations in ASR Growth Transformations in ASR Enumerating the joint distribution Implementation Details Locally maximize a rational function 9 F(θ) = N(θ) D(θ) Assumes N(θ) and D(θ) are polynomials in θ, D(θ) > 0 Parameter set θ = {θ q θ q 0, Define a transformation where, T (θ) q = P q θq = 1} «P θ θ (Θ) q θ q + C P Q q=1 θq P θ (Θ) θ q «+ C P θ (Θ) = N θ (Θ) F(θ)D θ (Θ) + C For sufficiently large C > 0, then T (θ) is a growth transform if F(T (θ)) F (θ) 9 Gopalakrishnan 1991 Lambert Mathias 34 / 43

54 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Growth Transformations in ASR Growth Transformations in ASR Enumerating the joint distribution Implementation Details Locally maximize a rational function 9 F(θ) = N(θ) D(θ) Assumes N(θ) and D(θ) are polynomials in θ, D(θ) > 0 Parameter set θ = {θ q θ q 0, Define a transformation where, T (θ) q = P q θq = 1} «P θ θ (Θ) q θ q + C P Q q=1 θq P θ (Θ) θ q «+ C P θ (Θ) = N θ (Θ) F(θ)D θ (Θ) + C For sufficiently large C > 0, then T (θ) is a growth transform if F(T (θ)) F (θ) 9 Gopalakrishnan 1991 Lambert Mathias 34 / 43

55 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Enumerating the joint distribution Growth Transformations in ASR Enumerating the joint distribution Implementation Details Note that the objective F(θ) = f (p θ (e s, f s)) not a polynomial in θ «k p θ (e s, f s) = Y Φ q(e s, f s) θq Y nx θ q log Φ q(e s, f s) k! q q=1 k=0 {z } p (n) θ (es, fs) Growth transform for polynomials: F (n) (T (n) (θ)) F (n) (θ) If lim n T (n) (θ) T (θ) and lim n F (n) (θ) F(θ) then lim F (n) (T (n) (θ)) lim F (n) (θ) F(T (θ)) F(θ) n n Result Growth transformations can be extended to any function f (θ) that is differentiable and is analytical a a Kanevsky 1995 Lambert Mathias 35 / 43

56 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Enumerating the joint distribution Growth Transformations in ASR Enumerating the joint distribution Implementation Details Note that the objective F(θ) = f (p θ (e s, f s)) not a polynomial in θ «k p θ (e s, f s) = Y Φ q(e s, f s) θq Y nx θ q log Φ q(e s, f s) k! q q=1 k=0 {z } p (n) θ (es, fs) Growth transform for polynomials: F (n) (T (n) (θ)) F (n) (θ) If lim n T (n) (θ) T (θ) and lim n F (n) (θ) F(θ) then lim F (n) (T (n) (θ)) lim F (n) (θ) F(T (θ)) F(θ) n n Result Growth transformations can be extended to any function f (θ) that is differentiable and is analytical a a Kanevsky 1995 Lambert Mathias 35 / 43

57 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Enumerating the joint distribution Growth Transformations in ASR Enumerating the joint distribution Implementation Details Note that the objective F(θ) = f (p θ (e s, f s)) not a polynomial in θ «k p θ (e s, f s) = Y Φ q(e s, f s) θq Y nx θ q log Φ q(e s, f s) k! q q=1 k=0 {z } p (n) θ (es, fs) Growth transform for polynomials: F (n) (T (n) (θ)) F (n) (θ) If lim n T (n) (θ) T (θ) and lim n F (n) (θ) F(θ) then lim F (n) (T (n) (θ)) lim F (n) (θ) F(T (θ)) F(θ) n n Result Growth transformations can be extended to any function f (θ) that is differentiable and is analytical a a Kanevsky 1995 Lambert Mathias 35 / 43

58 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Enumerating the joint distribution Growth Transformations in ASR Enumerating the joint distribution Implementation Details Note that the objective F(θ) = f (p θ (e s, f s)) not a polynomial in θ «k p θ (e s, f s) = Y Φ q(e s, f s) θq Y nx θ q log Φ q(e s, f s) k! q q=1 k=0 {z } p (n) θ (es, fs) Growth transform for polynomials: F (n) (T (n) (θ)) F (n) (θ) If lim n T (n) (θ) T (θ) and lim n F (n) (θ) F(θ) then lim F (n) (T (n) (θ)) lim F (n) (θ) F(T (θ)) F(θ) n n Result Growth transformations can be extended to any function f (θ) that is differentiable and is analytical a a Kanevsky 1995 Lambert Mathias 35 / 43

59 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Growth Transformations in ASR Enumerating the joint distribution Implementation Details Growth Transform Iterative Training Procedure 1 Initialize the parameter vector θ = {θ 1, θ Q }, such that P Q q=1 θq = 1, and i = 0. 2 For each parameter θ (i) q, calculate the gradient F(θ (i) ) (i) θ=θ q 3 For each parameter θ (i) q, calculate the parameter update «θ (i) q F(θ (i) ) (i) θ=θ + C θ (i+1) q q = «P Q q=1 θ(i) q F(θ (i) ) (i) θ=θ + C q 4 If F(θ (i+1) ) F(θ (i) ) or if i == MAXITER, then terminate. Else, i i + 1, goto Step 2. Lambert Mathias 36 / 43

60 Implementation Details Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Growth Transformations in ASR Enumerating the joint distribution Implementation Details Posterior scaling Convergence Rate C = N c Entropy Regularization p θ,α (e sk f s) = p θ (e sk, f s) α P Ns k=1 p θ(e sk, f s) α» j ff max max { F(θ) θ=θ q }, 0 + ɛ q G(θ) = F(θ) + T H(p θ,α ) Lambert Mathias 37 / 43

61 Outline Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments 4 Motivation 5 Problem Definition 6 Discriminative Objective function 7 Growth Transformations for MT Growth Transformations in ASR Enumerating the joint distribution Implementation Details 8 Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Lambert Mathias 38 / 43

62 Hyper-Parameter Tuning Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments DEV BLEU DEV BLEU DEV BLEU α=0.25 α=1 α= Epoch T=0 T=0.25 T= Epoch N c =20 N c =50 N c = Epoch Figure: Chinese-English Text Translation Task: Expected BLEU over a 1000-best N-best list Lambert Mathias 39 / 43

63 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments MMI choosing the true class - oracle vs reference DEV BLEU TST BLEU MMI w/ true reference MMI w/ oracle reference MET Epoch MMI w/ true reference MMI w/ oracle reference MET Epoch Figure: Arabic-English Text Translation Task: MMI over a 1000-best N-best list Lambert Mathias 40 / 43

64 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Speech Translation Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Translation Training Optimization BLEU Input Criterion Method DEV TST ASR 1-best MET Line Search MMI Growth Transform Expected BLEU Growth Transform ASR Lattice MET Line Search MMI Growth Transform Expected BLEU Growth Transform Table: Comparing MMI and Expected BLEU and MET training criteria for the Spanish-English ASR translation task MET overfits the training data Expected BLEU and MMI outperform MET for lattice translation Lambert Mathias 41 / 43

65 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Conclusions: Machine Translation Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Novel weighted finite state approach to translation of speech Noisy channel formulation direct extension of text translation systems Efficient phrase extraction from lattices ASR phrase pruning to control ambiguity Improved performance over ASR 1-best Confusion network decoding - word and phrase confusion networks Improved discriminative training for SMT Iterative growth transformation based updates for the MT parameters Comparable to MET line search Principled approach to the optimization of MT objective functions Extensions to lattice based training using WFSTs Anticipate further gains by increasing the number of features Lambert Mathias 42 / 43

66 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Conclusions: Machine Translation Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Novel weighted finite state approach to translation of speech Noisy channel formulation direct extension of text translation systems Efficient phrase extraction from lattices ASR phrase pruning to control ambiguity Improved performance over ASR 1-best Confusion network decoding - word and phrase confusion networks Improved discriminative training for SMT Iterative growth transformation based updates for the MT parameters Comparable to MET line search Principled approach to the optimization of MT objective functions Extensions to lattice based training using WFSTs Anticipate further gains by increasing the number of features Lambert Mathias 42 / 43

67 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Thank You! Lambert Mathias 43 / 43

Statistical Machine Translation and Automatic Speech Recognition under Uncertainty

Statistical Machine Translation and Automatic Speech Recognition under Uncertainty Statistical Machine Translation and Automatic Speech Recognition under Uncertainty Lambert Mathias A dissertation submitted to the Johns Hopkins University in conformity with the requirements for the degree

More information

Statistical Phrase-Based Speech Translation

Statistical Phrase-Based Speech Translation Statistical Phrase-Based Speech Translation Lambert Mathias 1 William Byrne 2 1 Center for Language and Speech Processing Department of Electrical and Computer Engineering Johns Hopkins University 2 Machine

More information

Phrase-Based Statistical Machine Translation with Pivot Languages

Phrase-Based Statistical Machine Translation with Pivot Languages Phrase-Based Statistical Machine Translation with Pivot Languages N. Bertoldi, M. Barbaiani, M. Federico, R. Cattoni FBK, Trento - Italy Rovira i Virgili University, Tarragona - Spain October 21st, 2008

More information

Multiple System Combination. Jinhua Du CNGL July 23, 2008

Multiple System Combination. Jinhua Du CNGL July 23, 2008 Multiple System Combination Jinhua Du CNGL July 23, 2008 Outline Introduction Motivation Current Achievements Combination Strategies Key Techniques System Combination Framework in IA Large-Scale Experiments

More information

Speech Translation: from Singlebest to N-Best to Lattice Translation. Spoken Language Communication Laboratories

Speech Translation: from Singlebest to N-Best to Lattice Translation. Spoken Language Communication Laboratories Speech Translation: from Singlebest to N-Best to Lattice Translation Ruiqiang ZHANG Genichiro KIKUI Spoken Language Communication Laboratories 2 Speech Translation Structure Single-best only ASR Single-best

More information

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister A Syntax-based Statistical Machine Translation Model Alexander Friedl, Georg Teichtmeister 4.12.2006 Introduction The model Experiment Conclusion Statistical Translation Model (STM): - mathematical model

More information

TALP Phrase-Based System and TALP System Combination for the IWSLT 2006 IWSLT 2006, Kyoto

TALP Phrase-Based System and TALP System Combination for the IWSLT 2006 IWSLT 2006, Kyoto TALP Phrase-Based System and TALP System Combination for the IWSLT 2006 IWSLT 2006, Kyoto Marta R. Costa-jussà, Josep M. Crego, Adrià de Gispert, Patrik Lambert, Maxim Khalilov, José A.R. Fonollosa, José

More information

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data Statistical Machine Learning from Data Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne (EPFL),

More information

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Cross-Lingual Language Modeling for Automatic Speech Recogntion GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim woosung@cs.jhu.edu Center for Language and Speech Processing Dept. of Computer Science The

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

Efficient Path Counting Transducers for Minimum Bayes-Risk Decoding of Statistical Machine Translation Lattices

Efficient Path Counting Transducers for Minimum Bayes-Risk Decoding of Statistical Machine Translation Lattices Efficient Path Counting Transducers for Minimum Bayes-Risk Decoding of Statistical Machine Translation Lattices Graeme Blackwood, Adrià de Gispert, William Byrne Machine Intelligence Laboratory Cambridge

More information

Out of GIZA Efficient Word Alignment Models for SMT

Out of GIZA Efficient Word Alignment Models for SMT Out of GIZA Efficient Word Alignment Models for SMT Yanjun Ma National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Series March 4, 2009 Y. Ma (DCU) Out of Giza

More information

Discriminative Learning in Speech Recognition

Discriminative Learning in Speech Recognition Discriminative Learning in Speech Recognition Yueng-Tien, Lo g96470198@csie.ntnu.edu.tw Speech Lab, CSIE Reference Xiaodong He and Li Deng. "Discriminative Learning in Speech Recognition, Technical Report

More information

Variational Decoding for Statistical Machine Translation

Variational Decoding for Statistical Machine Translation Variational Decoding for Statistical Machine Translation Zhifei Li, Jason Eisner, and Sanjeev Khudanpur Center for Language and Speech Processing Computer Science Department Johns Hopkins University 1

More information

Statistical Machine Translation. Part III: Search Problem. Complexity issues. DP beam-search: with single and multi-stacks

Statistical Machine Translation. Part III: Search Problem. Complexity issues. DP beam-search: with single and multi-stacks Statistical Machine Translation Marcello Federico FBK-irst Trento, Italy Galileo Galilei PhD School - University of Pisa Pisa, 7-19 May 008 Part III: Search Problem 1 Complexity issues A search: with single

More information

The Geometry of Statistical Machine Translation

The Geometry of Statistical Machine Translation The Geometry of Statistical Machine Translation Presented by Rory Waite 16th of December 2015 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions ntroduction We provide

More information

Sparse Models for Speech Recognition

Sparse Models for Speech Recognition Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center Hong Kong University of Science and Technology Outline Introduction to speech recognition Motivations

More information

] Automatic Speech Recognition (CS753)

] Automatic Speech Recognition (CS753) ] Automatic Speech Recognition (CS753) Lecture 17: Discriminative Training for HMMs Instructor: Preethi Jyothi Sep 28, 2017 Discriminative Training Recall: MLE for HMMs Maximum likelihood estimation (MLE)

More information

Discriminative Training

Discriminative Training Discriminative Training February 19, 2013 Noisy Channels Again p(e) source English Noisy Channels Again p(e) p(g e) source English German Noisy Channels Again p(e) p(g e) source English German decoder

More information

Adapting n-gram Maximum Entropy Language Models with Conditional Entropy Regularization

Adapting n-gram Maximum Entropy Language Models with Conditional Entropy Regularization Adapting n-gram Maximum Entropy Language Models with Conditional Entropy Regularization Ariya Rastrow, Mark Dredze, Sanjeev Khudanpur Human Language Technology Center of Excellence Center for Language

More information

statistical machine translation

statistical machine translation statistical machine translation P A R T 3 : D E C O D I N G & E V A L U A T I O N CSC401/2511 Natural Language Computing Spring 2019 Lecture 6 Frank Rudzicz and Chloé Pou-Prom 1 University of Toronto Statistical

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Unsupervised Model Adaptation using Information-Theoretic Criterion

Unsupervised Model Adaptation using Information-Theoretic Criterion Unsupervised Model Adaptation using Information-Theoretic Criterion Ariya Rastrow 1, Frederick Jelinek 1, Abhinav Sethy 2 and Bhuvana Ramabhadran 2 1 Human Language Technology Center of Excellence, and

More information

Tuning as Linear Regression

Tuning as Linear Regression Tuning as Linear Regression Marzieh Bazrafshan, Tagyoung Chung and Daniel Gildea Department of Computer Science University of Rochester Rochester, NY 14627 Abstract We propose a tuning method for statistical

More information

Learning to translate with neural networks. Michael Auli

Learning to translate with neural networks. Michael Auli Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each

More information

ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin and Franz Josef Och Information Sciences Institute University of Southern California 4676 Admiralty Way

More information

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius Doctoral Course in Speech Recognition May 2007 Kjell Elenius CHAPTER 12 BASIC SEARCH ALGORITHMS State-based search paradigm Triplet S, O, G S, set of initial states O, set of operators applied on a state

More information

Minimum Error Rate Training Semiring

Minimum Error Rate Training Semiring Minimum Error Rate Training Semiring Artem Sokolov & François Yvon LIMSI-CNRS & LIMSI-CNRS/Univ. Paris Sud {artem.sokolov,francois.yvon}@limsi.fr EAMT 2011 31 May 2011 Artem Sokolov & François Yvon (LIMSI)

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Hidden Markov Modelling

Hidden Markov Modelling Hidden Markov Modelling Introduction Problem formulation Forward-Backward algorithm Viterbi search Baum-Welch parameter estimation Other considerations Multiple observation sequences Phone-based models

More information

Topics in Natural Language Processing

Topics in Natural Language Processing Topics in Natural Language Processing Shay Cohen Institute for Language, Cognition and Computation University of Edinburgh Lecture 9 Administrativia Next class will be a summary Please email me questions

More information

Statistical Ranking Problem

Statistical Ranking Problem Statistical Ranking Problem Tong Zhang Statistics Department, Rutgers University Ranking Problems Rank a set of items and display to users in corresponding order. Two issues: performance on top and dealing

More information

Chapter 3: Basics of Language Modelling

Chapter 3: Basics of Language Modelling Chapter 3: Basics of Language Modelling Motivation Language Models are used in Speech Recognition Machine Translation Natural Language Generation Query completion For research and development: need a simple

More information

An Empirical Study on Computing Consensus Translations from Multiple Machine Translation Systems

An Empirical Study on Computing Consensus Translations from Multiple Machine Translation Systems An Empirical Study on Computing Consensus Translations from Multiple Machine Translation Systems Wolfgang Macherey Google Inc. 1600 Amphitheatre Parkway Mountain View, CA 94043, USA wmach@google.com Franz

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Algorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Machine Translation II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Project 4: Word Alignment! Will be released soon! (~Monday) Phrase-Based System Overview

More information

A Systematic Comparison of Training Criteria for Statistical Machine Translation

A Systematic Comparison of Training Criteria for Statistical Machine Translation A Systematic Comparison of Training Criteria for Statistical Machine Translation Richard Zens and Saša Hasan and Hermann Ney Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6

More information

Probabilistic Inference for Phrase-based Machine Translation: A Sampling Approach. Abhishek Arun

Probabilistic Inference for Phrase-based Machine Translation: A Sampling Approach. Abhishek Arun Probabilistic Inference for Phrase-based Machine Translation: A Sampling Approach Abhishek Arun Doctor of Philosophy Institute for Communicating and Collaborative Systems School of Informatics University

More information

Augmented Statistical Models for Speech Recognition

Augmented Statistical Models for Speech Recognition Augmented Statistical Models for Speech Recognition Mark Gales & Martin Layton 31 August 2005 Trajectory Models For Speech Processing Workshop Overview Dependency Modelling in Speech Recognition: latent

More information

Independent Component Analysis and Unsupervised Learning

Independent Component Analysis and Unsupervised Learning Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien National Cheng Kung University TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent

More information

Forward algorithm vs. particle filtering

Forward algorithm vs. particle filtering Particle Filtering ØSometimes X is too big to use exact inference X may be too big to even store B(X) E.g. X is continuous X 2 may be too big to do updates ØSolution: approximate inference Track samples

More information

Loss Functions, Decision Theory, and Linear Models

Loss Functions, Decision Theory, and Linear Models Loss Functions, Decision Theory, and Linear Models CMSC 678 UMBC January 31 st, 2018 Some slides adapted from Hamed Pirsiavash Logistics Recap Piazza (ask & answer questions): https://piazza.com/umbc/spring2018/cmsc678

More information

CS230: Lecture 10 Sequence models II

CS230: Lecture 10 Sequence models II CS23: Lecture 1 Sequence models II Today s outline We will learn how to: - Automatically score an NLP model I. BLEU score - Improve Machine II. Beam Search Translation results with Beam search III. Speech

More information

Discriminative Training. March 4, 2014

Discriminative Training. March 4, 2014 Discriminative Training March 4, 2014 Noisy Channels Again p(e) source English Noisy Channels Again p(e) p(g e) source English German Noisy Channels Again p(e) p(g e) source English German decoder e =

More information

Decoding Revisited: Easy-Part-First & MERT. February 26, 2015

Decoding Revisited: Easy-Part-First & MERT. February 26, 2015 Decoding Revisited: Easy-Part-First & MERT February 26, 2015 Translating the Easy Part First? the tourism initiative addresses this for the first time the die tm:-0.19,lm:-0.4, d:0, all:-0.65 tourism touristische

More information

Graphical models for part of speech tagging

Graphical models for part of speech tagging Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework

More information

Word Alignment for Statistical Machine Translation Using Hidden Markov Models

Word Alignment for Statistical Machine Translation Using Hidden Markov Models Word Alignment for Statistical Machine Translation Using Hidden Markov Models by Anahita Mansouri Bigvand A Depth Report Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of

More information

Lecture 13: More uses of Language Models

Lecture 13: More uses of Language Models Lecture 13: More uses of Language Models William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 13 What we ll learn in this lecture Comparing documents, corpora using LM approaches

More information

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013 Machine Translation CL1: Jordan Boyd-Graber University of Maryland November 11, 2013 Adapted from material by Philipp Koehn CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 1 / 48 Roadmap

More information

Hidden Markov Models in Language Processing

Hidden Markov Models in Language Processing Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications

More information

Computing Lattice BLEU Oracle Scores for Machine Translation

Computing Lattice BLEU Oracle Scores for Machine Translation Computing Lattice Oracle Scores for Machine Translation Artem Sokolov & Guillaume Wisniewski & François Yvon {firstname.lastname}@limsi.fr LIMSI, Orsay, France 1 Introduction 2 Oracle Decoding Task 3 Proposed

More information

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012) Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Introduction. Basic Probability and Bayes Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 26/9/2011 (EPFL) Graphical Models 26/9/2011 1 / 28 Outline

More information

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, et al. Google arxiv:1609.08144v2 Reviewed by : Bill

More information

N-gram Language Modeling Tutorial

N-gram Language Modeling Tutorial N-gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture notes courtesy of Prof. Mari Ostendorf Outline: Statistical Language Model (LM) Basics n-gram models Class LMs Cache LMs Mixtures

More information

Naïve Bayes, Maxent and Neural Models

Naïve Bayes, Maxent and Neural Models Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

IBM Model 1 for Machine Translation

IBM Model 1 for Machine Translation IBM Model 1 for Machine Translation Micha Elsner March 28, 2014 2 Machine translation A key area of computational linguistics Bar-Hillel points out that human-like translation requires understanding of

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Machine Recognition of Sounds in Mixtures

Machine Recognition of Sounds in Mixtures Machine Recognition of Sounds in Mixtures Outline 1 2 3 4 Computational Auditory Scene Analysis Speech Recognition as Source Formation Sound Fragment Decoding Results & Conclusions Dan Ellis

More information

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1 EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle

More information

Massachusetts Institute of Technology

Massachusetts Institute of Technology Massachusetts Institute of Technology 6.867 Machine Learning, Fall 2006 Problem Set 5 Due Date: Thursday, Nov 30, 12:00 noon You may submit your solutions in class or in the box. 1. Wilhelm and Klaus are

More information

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri Speech Recognition Lecture 5: N-gram Language Models Eugene Weinstein Google, NYU Courant Institute eugenew@cs.nyu.edu Slide Credit: Mehryar Mohri Components Acoustic and pronunciation model: Pr(o w) =

More information

Conditional Language Modeling. Chris Dyer

Conditional Language Modeling. Chris Dyer Conditional Language Modeling Chris Dyer Unconditional LMs A language model assigns probabilities to sequences of words,. w =(w 1,w 2,...,w`) It is convenient to decompose this probability using the chain

More information

GWAS IV: Bayesian linear (variance component) models

GWAS IV: Bayesian linear (variance component) models GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian

More information

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014 Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of

More information

Lecture 10. Discriminative Training, ROVER, and Consensus. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Lecture 10. Discriminative Training, ROVER, and Consensus. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen Lecture 10 Discriminative Training, ROVER, and Consensus Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen}@us.ibm.com

More information

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 ) Part A 1. A Markov chain is a discrete-time stochastic process, defined by a set of states, a set of transition probabilities (between states), and a set of initial state probabilities; the process proceeds

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted

More information

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent voices Nonparametric likelihood

More information

Adversarial Training and Decoding Strategies for End-to-end Neural Conversation Models

Adversarial Training and Decoding Strategies for End-to-end Neural Conversation Models MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Adversarial Training and Decoding Strategies for End-to-end Neural Conversation Models Hori, T.; Wang, W.; Koji, Y.; Hori, C.; Harsham, B.A.;

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

Natural Language Processing (CSEP 517): Machine Translation

Natural Language Processing (CSEP 517): Machine Translation Natural Language Processing (CSEP 57): Machine Translation Noah Smith c 207 University of Washington nasmith@cs.washington.edu May 5, 207 / 59 To-Do List Online quiz: due Sunday (Jurafsky and Martin, 2008,

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Bayesian Deep Learning

Bayesian Deep Learning Bayesian Deep Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp June 06, 2018 Mohammad Emtiyaz Khan 2018 1 What will you learn? Why is Bayesian inference

More information

Recent Developments in Statistical Dialogue Systems

Recent Developments in Statistical Dialogue Systems Recent Developments in Statistical Dialogue Systems Steve Young Machine Intelligence Laboratory Information Engineering Division Cambridge University Engineering Department Cambridge, UK Contents Review

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Classification: Naive Bayes Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 20 Introduction Classification = supervised method for

More information

Outline of Today s Lecture

Outline of Today s Lecture University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005 Jeff A. Bilmes Lecture 12 Slides Feb 23 rd, 2005 Outline of Today s

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II) CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models

More information

Applied Natural Language Processing

Applied Natural Language Processing Applied Natural Language Processing Info 256 Lecture 7: Testing (Feb 12, 2019) David Bamman, UC Berkeley Significance in NLP You develop a new method for text classification; is it better than what comes

More information

Seq2Seq Losses (CTC)

Seq2Seq Losses (CTC) Seq2Seq Losses (CTC) Jerry Ding & Ryan Brigden 11-785 Recitation 6 February 23, 2018 Outline Tasks suited for recurrent networks Losses when the output is a sequence Kinds of errors Losses to use CTC Loss

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

p(d θ ) l(θ ) 1.2 x x x

p(d θ ) l(θ ) 1.2 x x x p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to

More information

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Maximum Entropy Models I Welcome back for the 3rd module

More information

Ecient Higher-Order CRFs for Morphological Tagging

Ecient Higher-Order CRFs for Morphological Tagging Ecient Higher-Order CRFs for Morphological Tagging Thomas Müller, Helmut Schmid and Hinrich Schütze Center for Information and Language Processing University of Munich Outline 1 Contributions 2 Motivation

More information

Bitext Alignment for Statistical Machine Translation

Bitext Alignment for Statistical Machine Translation Bitext Alignment for Statistical Machine Translation Yonggang Deng Advisor: Prof. William Byrne Thesis Committee: Prof. William Byrne, Prof. Trac Tran Prof. Jerry Prince and Prof. Gerard Meyer Center for

More information

Lecture 3: ASR: HMMs, Forward, Viterbi

Lecture 3: ASR: HMMs, Forward, Viterbi Original slides by Dan Jurafsky CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 3: ASR: HMMs, Forward, Viterbi Fun informative read on phonetics The

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Deep Neural Networks

Deep Neural Networks Deep Neural Networks DT2118 Speech and Speaker Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 45 Outline State-to-Output Probability Model Artificial Neural Networks Perceptron Multi

More information

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project

More information

Empirical Risk Minimization, Model Selection, and Model Assessment

Empirical Risk Minimization, Model Selection, and Model Assessment Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich,

More information

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information