Automatic Speech Recognition and Statistical Machine Translation under Uncertainty
|
|
- Preston Lamb
- 5 years ago
- Views:
Transcription
1 Outlines Automatic Speech Recognition and Statistical Machine Translation under Uncertainty Lambert Mathias Advisor: Prof. William Byrne Thesis Committee: Prof. Gerard Meyer, Prof. Trac Tran and Prof. Frederick Jelinek Center for Language and Speech Processing Department of Electrical and Computer Engineering Johns Hopkins University December 7, 2007 Lambert Mathias 1 / 43
2 Outlines Uncertainty Reduction in Language Processing Two applications of statistical methods in language processing Automatic Speech Recognition (ASR) Statistical Machine Translation (SMT) Statistical learning approaches have to deal with uncertainty Data sparsity, noise, incorrect modeling assumptions etc. Uncertainty in SMT Ambiguity in translation from speech Using a cascaded approach to translate speech Uncertainty in ASR Training acoustic models in the absence of reliable transcripts Lightly supervised discriminative training in the medical domain 1 1 Juergen Fritsch, Girija Yegnarayanan, Multimodal Technologies Inc. Lambert Mathias 2 / 43
3 Outlines Uncertainty Reduction in Language Processing Two applications of statistical methods in language processing Automatic Speech Recognition (ASR) Statistical Machine Translation (SMT) Statistical learning approaches have to deal with uncertainty Data sparsity, noise, incorrect modeling assumptions etc. Uncertainty in SMT Ambiguity in translation from speech Using a cascaded approach to translate speech Uncertainty in ASR Training acoustic models in the absence of reliable transcripts Lightly supervised discriminative training in the medical domain 1 1 Juergen Fritsch, Girija Yegnarayanan, Multimodal Technologies Inc. Lambert Mathias 2 / 43
4 Outlines Uncertainty Reduction in Language Processing Two applications of statistical methods in language processing Automatic Speech Recognition (ASR) Statistical Machine Translation (SMT) Statistical learning approaches have to deal with uncertainty Data sparsity, noise, incorrect modeling assumptions etc. Uncertainty in SMT Ambiguity in translation from speech Using a cascaded approach to translate speech Uncertainty in ASR Training acoustic models in the absence of reliable transcripts Lightly supervised discriminative training in the medical domain 1 1 Juergen Fritsch, Girija Yegnarayanan, Multimodal Technologies Inc. Lambert Mathias 2 / 43
5 Outlines Part I: Speech Translation Part II: Discriminative Training for MT Outline Part I: Speech Translation 1 Speech Translation Architectures 2 Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction 3 Lambert Mathias 3 / 43
6 Outlines Part I: Speech Translation Part II: Discriminative Training for MT Outline Part II: Discriminative Training for MT 4 Motivation 5 Problem Definition 6 Discriminative Objective function 7 Growth Transformations for MT Growth Transformations in ASR Enumerating the joint distribution Implementation Details 8 Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Lambert Mathias 4 / 43
7 Speech Translation Architectures Part I Speech Translation Lambert Mathias 5 / 43
8 Motivation Speech Translation Architectures Applications of Speech Translation Facilitate international business communication Computer aided learning Language acquisition aid Statistical Speech Translation components Automatic Speech Recognition (ASR) Statistical Machine Translation (SMT) Integrating the ASR and SMT components Varying levels of interaction - N-best list, Lattices, Confusion Networks Coupling of ASR with the SMT - maximum information transfer Lambert Mathias 6 / 43
9 Speech Translation Architectures Outline 1 Speech Translation Architectures 2 Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction 3 Lambert Mathias 7 / 43
10 Speech Translation Architectures Loosely-Coupled Speech Translation ASR SMT Translate 1-best 2 sub-optimal - no interaction between ASR and SMT Translate multiple hypotheses instead 3 Does not exploit sub-sentential language information 2 Liu et al, Noise Robustness in Speech to Speech Translation, IBM Technical Report Black et al, Rapid Development of Speech to Speech Translation Systems, ICSLP 2002 Lambert Mathias 8 / 43
11 Speech Translation Architectures Integrated Speech Translation Architecture Speech Translation Objective: Tight integration of ASR with SMT system Unified modeling framework No a priori decisions on candidates for translation Integrated search over a larger space of translation candidates SMT system robust to ASR errors Problem: How to translate ASR Lattices? Lambert Mathias 9 / 43
12 Speech Translation Architectures Integrated Speech Translation Architecture Speech Translation Objective: Tight integration of ASR with SMT system Unified modeling framework No a priori decisions on candidates for translation Integrated search over a larger space of translation candidates SMT system robust to ASR errors Problem: How to translate ASR Lattices? Lambert Mathias 9 / 43
13 Outline Speech Translation Architectures Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction 1 Speech Translation Architectures 2 Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction 3 Lambert Mathias 10 / 43
14 Speech Translation Architectures Noisy Channel Model Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction Noisy Channel Translation MAP decoder êî1 = argmax max I,e 1 I f 1 J P(e I 1) {z } Language Model P(f J 1 e I 1) {z } Translation Model P(A f J 1) {z } Acoustic Model Lambert Mathias 11 / 43
15 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % Lambert Mathias 12 / 43
16 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % Lambert Mathias 12 / 43
17 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % Lambert Mathias 12 / 43
18 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % Lambert Mathias 12 / 43
19 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% Lambert Mathias 12 / 43
20 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% Lambert Mathias 12 / 43
21 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% 1 exportations 1 grains doivent fléchir de_25_% Lambert Mathias 12 / 43
22 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% 1 exportations 1 grains doivent fléchir de_25_% Lambert Mathias 12 / 43
23 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% 1 exportations 1 grains doivent fléchir de_25_% les exportations de grains doivent fléchir de_25_% Lambert Mathias 12 / 43
24 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% 1 exportations 1 grains doivent fléchir de_25_% les exportations de grains doivent fléchir de_25_% Lambert Mathias 12 / 43
25 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% 1 exportations 1 grains doivent fléchir de_25_% les exportations de grains doivent fléchir de_25_% les exportations de grains doivent fléchir de 25 % Lambert Mathias 12 / 43
26 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% 1 exportations 1 grains doivent fléchir de_25_% les exportations de grains doivent fléchir de_25_% les exportations de grains doivent fléchir de 25 % Lambert Mathias 12 / 43
27 Speech Translation Architectures The Generative Process Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction grains exports are expected to fall by 25 % grains exports are_expected_to fall by_25 % grains exportations doivent fléchir de_25_% 1 exportations 1 grains doivent fléchir de_25_% les exportations de grains doivent fléchir de_25_% les exportations de grains doivent fléchir de 25 % Lambert Mathias 12 / 43
28 Speech Translation Architectures Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction T arget Speech T arget Sentence T arget Phrase with Insertion Reordered T arget P hrase T arget P hrase Source P hrase Source Sentence A f J 1 v R 1 y K 1 x K 1 u K 1 e I 1 Models P(A f J 1 ) P(f J 1 v R 1 ) P(v R 1 y K 1 ) P(y K 1 x K 1 ) P(x K 1 u K 1 ) P(u K 1 e I 1) P(e I 1) F SMs L Ω Φ R Y W G T arget W ord Acoustic Lattice T arget P hrase Segmentation T ransducer T arget P hrase Insertion T ransducer T arget P hrase Reordering T ransducer Source Target P hrase T ranslation T ransducer Source P hrase Segmentation T ransducer Source Language Model Transformations via stochastic models implemented as WFSTs Built with standard WFST operations - composition and best path search Translation graph G W Y R Φ Ω L Straightforward extension of the text based SMT system êî1 = argmax{ max I,e 1 I f 1 J L max P(A f J v 1 R,yK 1,xK 1,uK 1,K 1 {z } ) Target Lattice P(f1 J, v 1 R, y 1 K, x 1 K, uk 1, ei 1 {z } ) } Text Translation Lambert Mathias 13 / 43
29 Speech Translation Architectures Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction T arget Speech T arget Sentence T arget Phrase with Insertion Reordered T arget P hrase T arget P hrase Source P hrase Source Sentence A f J 1 v R 1 y K 1 x K 1 u K 1 e I 1 Models P(A f J 1 ) P(f J 1 v R 1 ) P(v R 1 y K 1 ) P(y K 1 x K 1 ) P(x K 1 u K 1 ) P(u K 1 e I 1) P(e I 1) F SMs L Ω Φ R Y W G T arget W ord Acoustic Lattice T arget P hrase Segmentation T ransducer T arget P hrase Insertion T ransducer T arget P hrase Reordering T ransducer Source Target P hrase T ranslation T ransducer Source P hrase Segmentation T ransducer Source Language Model Transformations via stochastic models implemented as WFSTs Built with standard WFST operations - composition and best path search Translation graph G W Y R Φ Ω L Straightforward extension of the text based SMT system êî1 = argmax{ max I,e 1 I f 1 J L max P(A f J v 1 R,yK 1,xK 1,uK 1,K 1 {z } ) Target Lattice P(f1 J, v 1 R, y 1 K, x 1 K, uk 1, ei 1 {z } ) } Text Translation Lambert Mathias 13 / 43
30 Speech Translation Architectures Translation under ASR Posterior Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction Translation under the proper ASR posterior distribution Uncertainty reduction: strong target LM can help guide the translation process êî1 = argmax{ max I,e 1 I f 1 J L max P(A f J v 1 R,yK 1,xK 1,uK 1,K 1 )P(f1 J ) P(f1 J, v1 R, y1 K, x1 K, u1 K, e1) I } {z } {z } Target Text Lattice Translation Lambert Mathias 14 / 43
31 Speech Translation Architectures Translation is from a Lattice of Phrase Sequences Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction es/359.0 un/0 crdito/0 de/ del/203.5 compromiso/0 un/0 crdito/0 del/ un/ a/476.1 de/0 determinado/0 ao/0 compromiso/0 un/ TARGET PHRASE SEGMENTATION TRANSDUCER (Ω) ASR WORD LATTICE (L) pues/ crdito_del/203.5 crdito/0 5 2 un/0 crdito_de/0 un_crdito_de/0 1 un_crdito/0 3 determinado_ao/0 compromiso_un/ un_determinado_ao/0 del_compromiso/ un_determinado/0 un_determinado/0 ao/0 pues/ de_compromiso/0 determinado/0 de/0 compromiso_un/0 determinado_ao/0 un/0 compromiso/ un_determinado_ao/0 de_compromiso/0 de/0 TARGET PHRASE LATTICE (Q) Q = Ω L Phrase sequence lattice has foreign phrase sequences in the ASR lattice Phrase sequence corresponds to translatable word sequences in lattice Lattice contains the ASR weights Translation of phrase lattice is MAP translation of ASR word lattice Lambert Mathias 15 / 43
32 Speech Translation Architectures Direct Translation of ASR word lattice Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction Original Problem: How to translate ASR word lattice? New Problem: How to extract translatable phrases from ASR lattice? Speech Translation recast as an ASR analysis problem: 1 Perform foreign language ASR to obtain speech lattice L 2 Analyze foreign language word lattice and extract translatable phrases 3 Build the translation component models and convert the word lattice to a phrase lattice Ω L 4 Translate the foreign language phrase lattice Extracting translatable phrases 4 C(w k 1 ) = X π L # w k 1 (π) [[L]](w k 1 ) Filtering low-confidence phrases 4 Count automaton -GRM Library p(w1 k C(w1 k ) ) = P w 1 k L C(w 1 k) Lambert Mathias 16 / 43
33 Speech Translation Architectures Direct Translation of ASR word lattice Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction Original Problem: How to translate ASR word lattice? New Problem: How to extract translatable phrases from ASR lattice? Speech Translation recast as an ASR analysis problem: 1 Perform foreign language ASR to obtain speech lattice L 2 Analyze foreign language word lattice and extract translatable phrases 3 Build the translation component models and convert the word lattice to a phrase lattice Ω L 4 Translate the foreign language phrase lattice Extracting translatable phrases 4 C(w k 1 ) = X π L # w k 1 (π) [[L]](w k 1 ) Filtering low-confidence phrases 4 Count automaton -GRM Library p(w1 k C(w1 k ) ) = P w 1 k L C(w 1 k) Lambert Mathias 16 / 43
34 Speech Translation Architectures Direct Translation of ASR word lattice Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction Original Problem: How to translate ASR word lattice? New Problem: How to extract translatable phrases from ASR lattice? Speech Translation recast as an ASR analysis problem: 1 Perform foreign language ASR to obtain speech lattice L 2 Analyze foreign language word lattice and extract translatable phrases 3 Build the translation component models and convert the word lattice to a phrase lattice Ω L 4 Translate the foreign language phrase lattice Extracting translatable phrases 4 C(w k 1 ) = X π L # w k 1 (π) [[L]](w k 1 ) Filtering low-confidence phrases 4 Count automaton -GRM Library p(w1 k C(w1 k ) ) = P w 1 k L C(w 1 k) Lambert Mathias 16 / 43
35 Speech Translation Architectures Outline 1 Speech Translation Architectures 2 Noisy Channel Model Phrase-Based Generative Model Translation under ASR Posterior Target Phrase Segmentation - Translation of ASR Word Lattices Phrase Extraction 3 Lambert Mathias 17 / 43
36 Speech Translation Architectures Experiment Setup Parallel Text : documents in the two languages aligned at sentence level - 1.4M Word alignments using MTTK 5 Phrase pair inventory of phrasal translations 6 Component distributions under current phrase pair inventory Automated evaluation measure BLEU 7 : geometric mean of n-gram overlap with penalization for short sentences 5 Deng, Y. and Byrne W, Och Papineni et al, 2001 Lambert Mathias 18 / 43
37 Speech Translation Architectures ASR Lattice Oracle WER Translation Experiments 44 DEV TST BEST BLEU ORACLE WER (%) Oracle WER path - path in lattice closest to reference transcript under edit distance Decrease in WER correlates with increase in BLEU Lambert Mathias 19 / 43
38 Speech Translation Architectures TCSTAR 2005 EPPS Lattice Translation performance Spanish Source DEV BLEU EVAL BLEU Reference Transcription ASR 1-best ASR Lattice Lattice translation better than ASR 1-best hypothesis Lambert Mathias 20 / 43
39 Speech Translation Architectures Speech Translation N-best list translation quality Translation oracle-best BLEU Input DEV TST Reference Transcription ASR 1-best ASR Lattice Table: Oracle-best BLEU for EPPS development and test set measured over a 1000-best translation list Oracle-best BLEU: N-best list candidate with maximum sentence level BLEU score Quantifies the best possible performance under BLEU given the current inventory of phrases Lambert Mathias 21 / 43
40 Speech Translation Architectures Input Reference ASR 1-best ASR Lattice Translation 1. in accordance with the committee on budgets the period for tabling amendments for the second reading of the european union budget will end on wednesday the first of december at twelve noon. 2. in agreement with the committee on budgets the deadline for the presentation of projects of amendment for the second reading of the european union budget will finish on wednesday first of december at twelve noon. according to the committee on budgets of the deadline for the submission of projects amendment concerning the second reading of the budget union will end on wednesday, one of the twelve noon in accordance with the committee on budgets of the deadline for the submission of projects amendment concerning the second reading of the budget union will end on wednesday, one of the twelve noon Lambert Mathias 22 / 43
41 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Part II Discriminative Training for MT Lambert Mathias 23 / 43
42 Outline Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments 4 Motivation 5 Problem Definition 6 Discriminative Objective function 7 Growth Transformations for MT Growth Transformations in ASR Enumerating the joint distribution Implementation Details 8 Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Lambert Mathias 24 / 43
43 Motivation Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Automatic evaluation measures for improving translation performance Need to optimize MT parameters for a particular task or evaluation metric automatically Efficient procedures for parameter tuning are needed Capable of handling many features Robust State-of-the-art translation performance Lambert Mathias 25 / 43
44 Outline Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments 4 Motivation 5 Problem Definition 6 Discriminative Objective function 7 Growth Transformations for MT Growth Transformations in ASR Enumerating the joint distribution Implementation Details 8 Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Lambert Mathias 26 / 43
45 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Learning Problem Definition Objective: Discriminative training for improved translation Introduce growth transformations for MT parameter optimization Compare translation performance with MET line search optimization Training problem: Given a parallel training corpus - foreign sentences and their translations A set of parameters: θ = {θ 1,..., θ Q } A joint distribution p θ (e s, f s) = Q q Φq(es, fs)θq An objective function: F (θ) = f (p θ (e s, f s)) Optimization problem: ˆθ = argmax F(θ) θ Translation evaluated on a blind test set using the optimized parameters Lambert Mathias 27 / 43
46 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Foreign Sentences Reference Translation Generative Model Additional Features N-best Translation Rerank Translations 2-pass approach - decoding followed by optimization Incorporate additional features Lambert Mathias 28 / 43
47 Outline Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments 4 Motivation 5 Problem Definition 6 Discriminative Objective function 7 Growth Transformations for MT Growth Transformations in ASR Enumerating the joint distribution Implementation Details 8 Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Lambert Mathias 29 / 43
48 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Minimum Error Training F MET (θ) = SX s=1 BLEU(argmax e s p θ (e s f s), e + s ) e s = english translation hypothesis e + s = english translation reference f s = foreign sentence Not smooth and not differentiable Piecewise constant Multidimensional gradient free search - line search subroutine 8 Current state-of-the-art 8 F. Och, Minimum Error Training in Statsitical Machine Translation, 2002 Lambert Mathias 30 / 43
49 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Expected BLEU Maximization F MBR (θ) = SX XN s BLEU(e sk, e + s ) s=1 k=1 e s = english translation hypothesis e + s = english translation reference f s = foreign sentence p θ (e sk, f s) P Ns k=1 p θ(e sk, f s) Partial credit to all hypotheses in N-best list Non-convex but differentiable Related to Minimum Bayes Risk decoding Lambert Mathias 31 / 43
50 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Maximizing the posterior F MMI (θ) = SX p θ (e + s, f s) log P Ns k=1 p θ(e sk, f s) s=1 e s = english translation hypothesis e + s = english translation reference f s = foreign sentence Maximum Mutual Information (MMI) - separating truth from competing classes Smooth and differentiable Labeling the true class - reference vs oracle BLEU hypothesis Lambert Mathias 32 / 43
51 Outline Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Growth Transformations in ASR Enumerating the joint distribution Implementation Details 4 Motivation 5 Problem Definition 6 Discriminative Objective function 7 Growth Transformations for MT Growth Transformations in ASR Enumerating the joint distribution Implementation Details 8 Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Lambert Mathias 33 / 43
52 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Growth Transformations in ASR Growth Transformations in ASR Enumerating the joint distribution Implementation Details Locally maximize a rational function 9 F(θ) = N(θ) D(θ) Assumes N(θ) and D(θ) are polynomials in θ, D(θ) > 0 Parameter set θ = {θ q θ q 0, Define a transformation where, T (θ) q = P q θq = 1} «P θ θ (Θ) q θ q + C P Q q=1 θq P θ (Θ) θ q «+ C P θ (Θ) = N θ (Θ) F(θ)D θ (Θ) + C For sufficiently large C > 0, then T (θ) is a growth transform if F(T (θ)) F (θ) 9 Gopalakrishnan 1991 Lambert Mathias 34 / 43
53 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Growth Transformations in ASR Growth Transformations in ASR Enumerating the joint distribution Implementation Details Locally maximize a rational function 9 F(θ) = N(θ) D(θ) Assumes N(θ) and D(θ) are polynomials in θ, D(θ) > 0 Parameter set θ = {θ q θ q 0, Define a transformation where, T (θ) q = P q θq = 1} «P θ θ (Θ) q θ q + C P Q q=1 θq P θ (Θ) θ q «+ C P θ (Θ) = N θ (Θ) F(θ)D θ (Θ) + C For sufficiently large C > 0, then T (θ) is a growth transform if F(T (θ)) F (θ) 9 Gopalakrishnan 1991 Lambert Mathias 34 / 43
54 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Growth Transformations in ASR Growth Transformations in ASR Enumerating the joint distribution Implementation Details Locally maximize a rational function 9 F(θ) = N(θ) D(θ) Assumes N(θ) and D(θ) are polynomials in θ, D(θ) > 0 Parameter set θ = {θ q θ q 0, Define a transformation where, T (θ) q = P q θq = 1} «P θ θ (Θ) q θ q + C P Q q=1 θq P θ (Θ) θ q «+ C P θ (Θ) = N θ (Θ) F(θ)D θ (Θ) + C For sufficiently large C > 0, then T (θ) is a growth transform if F(T (θ)) F (θ) 9 Gopalakrishnan 1991 Lambert Mathias 34 / 43
55 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Enumerating the joint distribution Growth Transformations in ASR Enumerating the joint distribution Implementation Details Note that the objective F(θ) = f (p θ (e s, f s)) not a polynomial in θ «k p θ (e s, f s) = Y Φ q(e s, f s) θq Y nx θ q log Φ q(e s, f s) k! q q=1 k=0 {z } p (n) θ (es, fs) Growth transform for polynomials: F (n) (T (n) (θ)) F (n) (θ) If lim n T (n) (θ) T (θ) and lim n F (n) (θ) F(θ) then lim F (n) (T (n) (θ)) lim F (n) (θ) F(T (θ)) F(θ) n n Result Growth transformations can be extended to any function f (θ) that is differentiable and is analytical a a Kanevsky 1995 Lambert Mathias 35 / 43
56 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Enumerating the joint distribution Growth Transformations in ASR Enumerating the joint distribution Implementation Details Note that the objective F(θ) = f (p θ (e s, f s)) not a polynomial in θ «k p θ (e s, f s) = Y Φ q(e s, f s) θq Y nx θ q log Φ q(e s, f s) k! q q=1 k=0 {z } p (n) θ (es, fs) Growth transform for polynomials: F (n) (T (n) (θ)) F (n) (θ) If lim n T (n) (θ) T (θ) and lim n F (n) (θ) F(θ) then lim F (n) (T (n) (θ)) lim F (n) (θ) F(T (θ)) F(θ) n n Result Growth transformations can be extended to any function f (θ) that is differentiable and is analytical a a Kanevsky 1995 Lambert Mathias 35 / 43
57 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Enumerating the joint distribution Growth Transformations in ASR Enumerating the joint distribution Implementation Details Note that the objective F(θ) = f (p θ (e s, f s)) not a polynomial in θ «k p θ (e s, f s) = Y Φ q(e s, f s) θq Y nx θ q log Φ q(e s, f s) k! q q=1 k=0 {z } p (n) θ (es, fs) Growth transform for polynomials: F (n) (T (n) (θ)) F (n) (θ) If lim n T (n) (θ) T (θ) and lim n F (n) (θ) F(θ) then lim F (n) (T (n) (θ)) lim F (n) (θ) F(T (θ)) F(θ) n n Result Growth transformations can be extended to any function f (θ) that is differentiable and is analytical a a Kanevsky 1995 Lambert Mathias 35 / 43
58 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Enumerating the joint distribution Growth Transformations in ASR Enumerating the joint distribution Implementation Details Note that the objective F(θ) = f (p θ (e s, f s)) not a polynomial in θ «k p θ (e s, f s) = Y Φ q(e s, f s) θq Y nx θ q log Φ q(e s, f s) k! q q=1 k=0 {z } p (n) θ (es, fs) Growth transform for polynomials: F (n) (T (n) (θ)) F (n) (θ) If lim n T (n) (θ) T (θ) and lim n F (n) (θ) F(θ) then lim F (n) (T (n) (θ)) lim F (n) (θ) F(T (θ)) F(θ) n n Result Growth transformations can be extended to any function f (θ) that is differentiable and is analytical a a Kanevsky 1995 Lambert Mathias 35 / 43
59 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Growth Transformations in ASR Enumerating the joint distribution Implementation Details Growth Transform Iterative Training Procedure 1 Initialize the parameter vector θ = {θ 1, θ Q }, such that P Q q=1 θq = 1, and i = 0. 2 For each parameter θ (i) q, calculate the gradient F(θ (i) ) (i) θ=θ q 3 For each parameter θ (i) q, calculate the parameter update «θ (i) q F(θ (i) ) (i) θ=θ + C θ (i+1) q q = «P Q q=1 θ(i) q F(θ (i) ) (i) θ=θ + C q 4 If F(θ (i+1) ) F(θ (i) ) or if i == MAXITER, then terminate. Else, i i + 1, goto Step 2. Lambert Mathias 36 / 43
60 Implementation Details Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Growth Transformations in ASR Enumerating the joint distribution Implementation Details Posterior scaling Convergence Rate C = N c Entropy Regularization p θ,α (e sk f s) = p θ (e sk, f s) α P Ns k=1 p θ(e sk, f s) α» j ff max max { F(θ) θ=θ q }, 0 + ɛ q G(θ) = F(θ) + T H(p θ,α ) Lambert Mathias 37 / 43
61 Outline Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments 4 Motivation 5 Problem Definition 6 Discriminative Objective function 7 Growth Transformations for MT Growth Transformations in ASR Enumerating the joint distribution Implementation Details 8 Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Lambert Mathias 38 / 43
62 Hyper-Parameter Tuning Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments DEV BLEU DEV BLEU DEV BLEU α=0.25 α=1 α= Epoch T=0 T=0.25 T= Epoch N c =20 N c =50 N c = Epoch Figure: Chinese-English Text Translation Task: Expected BLEU over a 1000-best N-best list Lambert Mathias 39 / 43
63 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments MMI choosing the true class - oracle vs reference DEV BLEU TST BLEU MMI w/ true reference MMI w/ oracle reference MET Epoch MMI w/ true reference MMI w/ oracle reference MET Epoch Figure: Arabic-English Text Translation Task: MMI over a 1000-best N-best list Lambert Mathias 40 / 43
64 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Speech Translation Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Translation Training Optimization BLEU Input Criterion Method DEV TST ASR 1-best MET Line Search MMI Growth Transform Expected BLEU Growth Transform ASR Lattice MET Line Search MMI Growth Transform Expected BLEU Growth Transform Table: Comparing MMI and Expected BLEU and MET training criteria for the Spanish-English ASR translation task MET overfits the training data Expected BLEU and MMI outperform MET for lattice translation Lambert Mathias 41 / 43
65 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Conclusions: Machine Translation Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Novel weighted finite state approach to translation of speech Noisy channel formulation direct extension of text translation systems Efficient phrase extraction from lattices ASR phrase pruning to control ambiguity Improved performance over ASR 1-best Confusion network decoding - word and phrase confusion networks Improved discriminative training for SMT Iterative growth transformation based updates for the MT parameters Comparable to MET line search Principled approach to the optimization of MT objective functions Extensions to lattice based training using WFSTs Anticipate further gains by increasing the number of features Lambert Mathias 42 / 43
66 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Conclusions: Machine Translation Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Novel weighted finite state approach to translation of speech Noisy channel formulation direct extension of text translation systems Efficient phrase extraction from lattices ASR phrase pruning to control ambiguity Improved performance over ASR 1-best Confusion network decoding - word and phrase confusion networks Improved discriminative training for SMT Iterative growth transformation based updates for the MT parameters Comparable to MET line search Principled approach to the optimization of MT objective functions Extensions to lattice based training using WFSTs Anticipate further gains by increasing the number of features Lambert Mathias 42 / 43
67 Motivation Problem Definition Discriminative Objective function Growth Transformations for MT Discriminative Training Experiments Parameter Tuning for Growth Transforms MMI choosing the true class - oracle vs reference Speech Translation Experiments Thank You! Lambert Mathias 43 / 43
Statistical Machine Translation and Automatic Speech Recognition under Uncertainty
Statistical Machine Translation and Automatic Speech Recognition under Uncertainty Lambert Mathias A dissertation submitted to the Johns Hopkins University in conformity with the requirements for the degree
More informationStatistical Phrase-Based Speech Translation
Statistical Phrase-Based Speech Translation Lambert Mathias 1 William Byrne 2 1 Center for Language and Speech Processing Department of Electrical and Computer Engineering Johns Hopkins University 2 Machine
More informationPhrase-Based Statistical Machine Translation with Pivot Languages
Phrase-Based Statistical Machine Translation with Pivot Languages N. Bertoldi, M. Barbaiani, M. Federico, R. Cattoni FBK, Trento - Italy Rovira i Virgili University, Tarragona - Spain October 21st, 2008
More informationMultiple System Combination. Jinhua Du CNGL July 23, 2008
Multiple System Combination Jinhua Du CNGL July 23, 2008 Outline Introduction Motivation Current Achievements Combination Strategies Key Techniques System Combination Framework in IA Large-Scale Experiments
More informationSpeech Translation: from Singlebest to N-Best to Lattice Translation. Spoken Language Communication Laboratories
Speech Translation: from Singlebest to N-Best to Lattice Translation Ruiqiang ZHANG Genichiro KIKUI Spoken Language Communication Laboratories 2 Speech Translation Structure Single-best only ASR Single-best
More informationA Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister
A Syntax-based Statistical Machine Translation Model Alexander Friedl, Georg Teichtmeister 4.12.2006 Introduction The model Experiment Conclusion Statistical Translation Model (STM): - mathematical model
More informationTALP Phrase-Based System and TALP System Combination for the IWSLT 2006 IWSLT 2006, Kyoto
TALP Phrase-Based System and TALP System Combination for the IWSLT 2006 IWSLT 2006, Kyoto Marta R. Costa-jussà, Josep M. Crego, Adrià de Gispert, Patrik Lambert, Maxim Khalilov, José A.R. Fonollosa, José
More informationACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging
ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data Statistical Machine Learning from Data Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne (EPFL),
More informationCross-Lingual Language Modeling for Automatic Speech Recogntion
GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim woosung@cs.jhu.edu Center for Language and Speech Processing Dept. of Computer Science The
More informationThe Noisy Channel Model and Markov Models
1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle
More informationEfficient Path Counting Transducers for Minimum Bayes-Risk Decoding of Statistical Machine Translation Lattices
Efficient Path Counting Transducers for Minimum Bayes-Risk Decoding of Statistical Machine Translation Lattices Graeme Blackwood, Adrià de Gispert, William Byrne Machine Intelligence Laboratory Cambridge
More informationOut of GIZA Efficient Word Alignment Models for SMT
Out of GIZA Efficient Word Alignment Models for SMT Yanjun Ma National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Series March 4, 2009 Y. Ma (DCU) Out of Giza
More informationDiscriminative Learning in Speech Recognition
Discriminative Learning in Speech Recognition Yueng-Tien, Lo g96470198@csie.ntnu.edu.tw Speech Lab, CSIE Reference Xiaodong He and Li Deng. "Discriminative Learning in Speech Recognition, Technical Report
More informationVariational Decoding for Statistical Machine Translation
Variational Decoding for Statistical Machine Translation Zhifei Li, Jason Eisner, and Sanjeev Khudanpur Center for Language and Speech Processing Computer Science Department Johns Hopkins University 1
More informationStatistical Machine Translation. Part III: Search Problem. Complexity issues. DP beam-search: with single and multi-stacks
Statistical Machine Translation Marcello Federico FBK-irst Trento, Italy Galileo Galilei PhD School - University of Pisa Pisa, 7-19 May 008 Part III: Search Problem 1 Complexity issues A search: with single
More informationThe Geometry of Statistical Machine Translation
The Geometry of Statistical Machine Translation Presented by Rory Waite 16th of December 2015 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions ntroduction We provide
More informationSparse Models for Speech Recognition
Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center Hong Kong University of Science and Technology Outline Introduction to speech recognition Motivations
More information] Automatic Speech Recognition (CS753)
] Automatic Speech Recognition (CS753) Lecture 17: Discriminative Training for HMMs Instructor: Preethi Jyothi Sep 28, 2017 Discriminative Training Recall: MLE for HMMs Maximum likelihood estimation (MLE)
More informationDiscriminative Training
Discriminative Training February 19, 2013 Noisy Channels Again p(e) source English Noisy Channels Again p(e) p(g e) source English German Noisy Channels Again p(e) p(g e) source English German decoder
More informationAdapting n-gram Maximum Entropy Language Models with Conditional Entropy Regularization
Adapting n-gram Maximum Entropy Language Models with Conditional Entropy Regularization Ariya Rastrow, Mark Dredze, Sanjeev Khudanpur Human Language Technology Center of Excellence Center for Language
More informationstatistical machine translation
statistical machine translation P A R T 3 : D E C O D I N G & E V A L U A T I O N CSC401/2511 Natural Language Computing Spring 2019 Lecture 6 Frank Rudzicz and Chloé Pou-Prom 1 University of Toronto Statistical
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationUnsupervised Model Adaptation using Information-Theoretic Criterion
Unsupervised Model Adaptation using Information-Theoretic Criterion Ariya Rastrow 1, Frederick Jelinek 1, Abhinav Sethy 2 and Bhuvana Ramabhadran 2 1 Human Language Technology Center of Excellence, and
More informationTuning as Linear Regression
Tuning as Linear Regression Marzieh Bazrafshan, Tagyoung Chung and Daniel Gildea Department of Computer Science University of Rochester Rochester, NY 14627 Abstract We propose a tuning method for statistical
More informationLearning to translate with neural networks. Michael Auli
Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each
More informationORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation
ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin and Franz Josef Och Information Sciences Institute University of Southern California 4676 Admiralty Way
More informationDoctoral Course in Speech Recognition. May 2007 Kjell Elenius
Doctoral Course in Speech Recognition May 2007 Kjell Elenius CHAPTER 12 BASIC SEARCH ALGORITHMS State-based search paradigm Triplet S, O, G S, set of initial states O, set of operators applied on a state
More informationMinimum Error Rate Training Semiring
Minimum Error Rate Training Semiring Artem Sokolov & François Yvon LIMSI-CNRS & LIMSI-CNRS/Univ. Paris Sud {artem.sokolov,francois.yvon}@limsi.fr EAMT 2011 31 May 2011 Artem Sokolov & François Yvon (LIMSI)
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationHidden Markov Modelling
Hidden Markov Modelling Introduction Problem formulation Forward-Backward algorithm Viterbi search Baum-Welch parameter estimation Other considerations Multiple observation sequences Phone-based models
More informationTopics in Natural Language Processing
Topics in Natural Language Processing Shay Cohen Institute for Language, Cognition and Computation University of Edinburgh Lecture 9 Administrativia Next class will be a summary Please email me questions
More informationStatistical Ranking Problem
Statistical Ranking Problem Tong Zhang Statistics Department, Rutgers University Ranking Problems Rank a set of items and display to users in corresponding order. Two issues: performance on top and dealing
More informationChapter 3: Basics of Language Modelling
Chapter 3: Basics of Language Modelling Motivation Language Models are used in Speech Recognition Machine Translation Natural Language Generation Query completion For research and development: need a simple
More informationAn Empirical Study on Computing Consensus Translations from Multiple Machine Translation Systems
An Empirical Study on Computing Consensus Translations from Multiple Machine Translation Systems Wolfgang Macherey Google Inc. 1600 Amphitheatre Parkway Mountain View, CA 94043, USA wmach@google.com Franz
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationAlgorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Machine Translation II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Project 4: Word Alignment! Will be released soon! (~Monday) Phrase-Based System Overview
More informationA Systematic Comparison of Training Criteria for Statistical Machine Translation
A Systematic Comparison of Training Criteria for Statistical Machine Translation Richard Zens and Saša Hasan and Hermann Ney Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6
More informationProbabilistic Inference for Phrase-based Machine Translation: A Sampling Approach. Abhishek Arun
Probabilistic Inference for Phrase-based Machine Translation: A Sampling Approach Abhishek Arun Doctor of Philosophy Institute for Communicating and Collaborative Systems School of Informatics University
More informationAugmented Statistical Models for Speech Recognition
Augmented Statistical Models for Speech Recognition Mark Gales & Martin Layton 31 August 2005 Trajectory Models For Speech Processing Workshop Overview Dependency Modelling in Speech Recognition: latent
More informationIndependent Component Analysis and Unsupervised Learning
Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien National Cheng Kung University TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent
More informationForward algorithm vs. particle filtering
Particle Filtering ØSometimes X is too big to use exact inference X may be too big to even store B(X) E.g. X is continuous X 2 may be too big to do updates ØSolution: approximate inference Track samples
More informationLoss Functions, Decision Theory, and Linear Models
Loss Functions, Decision Theory, and Linear Models CMSC 678 UMBC January 31 st, 2018 Some slides adapted from Hamed Pirsiavash Logistics Recap Piazza (ask & answer questions): https://piazza.com/umbc/spring2018/cmsc678
More informationCS230: Lecture 10 Sequence models II
CS23: Lecture 1 Sequence models II Today s outline We will learn how to: - Automatically score an NLP model I. BLEU score - Improve Machine II. Beam Search Translation results with Beam search III. Speech
More informationDiscriminative Training. March 4, 2014
Discriminative Training March 4, 2014 Noisy Channels Again p(e) source English Noisy Channels Again p(e) p(g e) source English German Noisy Channels Again p(e) p(g e) source English German decoder e =
More informationDecoding Revisited: Easy-Part-First & MERT. February 26, 2015
Decoding Revisited: Easy-Part-First & MERT February 26, 2015 Translating the Easy Part First? the tourism initiative addresses this for the first time the die tm:-0.19,lm:-0.4, d:0, all:-0.65 tourism touristische
More informationGraphical models for part of speech tagging
Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional
More informationHidden Markov Models
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework
More informationWord Alignment for Statistical Machine Translation Using Hidden Markov Models
Word Alignment for Statistical Machine Translation Using Hidden Markov Models by Anahita Mansouri Bigvand A Depth Report Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of
More informationLecture 13: More uses of Language Models
Lecture 13: More uses of Language Models William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 13 What we ll learn in this lecture Comparing documents, corpora using LM approaches
More informationMachine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013
Machine Translation CL1: Jordan Boyd-Graber University of Maryland November 11, 2013 Adapted from material by Philipp Koehn CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 1 / 48 Roadmap
More informationHidden Markov Models in Language Processing
Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications
More informationComputing Lattice BLEU Oracle Scores for Machine Translation
Computing Lattice Oracle Scores for Machine Translation Artem Sokolov & Guillaume Wisniewski & François Yvon {firstname.lastname}@limsi.fr LIMSI, Orsay, France 1 Introduction 2 Oracle Decoding Task 3 Proposed
More informationOutline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Introduction. Basic Probability and Bayes Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 26/9/2011 (EPFL) Graphical Models 26/9/2011 1 / 28 Outline
More informationGoogle s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, et al. Google arxiv:1609.08144v2 Reviewed by : Bill
More informationN-gram Language Modeling Tutorial
N-gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture notes courtesy of Prof. Mari Ostendorf Outline: Statistical Language Model (LM) Basics n-gram models Class LMs Cache LMs Mixtures
More informationNaïve Bayes, Maxent and Neural Models
Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationProbabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov
Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly
More informationIBM Model 1 for Machine Translation
IBM Model 1 for Machine Translation Micha Elsner March 28, 2014 2 Machine translation A key area of computational linguistics Bar-Hillel points out that human-like translation requires understanding of
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationMachine Recognition of Sounds in Mixtures
Machine Recognition of Sounds in Mixtures Outline 1 2 3 4 Computational Auditory Scene Analysis Speech Recognition as Source Formation Sound Fragment Decoding Results & Conclusions Dan Ellis
More informationEEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1
EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle
More informationMassachusetts Institute of Technology
Massachusetts Institute of Technology 6.867 Machine Learning, Fall 2006 Problem Set 5 Due Date: Thursday, Nov 30, 12:00 noon You may submit your solutions in class or in the box. 1. Wilhelm and Klaus are
More informationSpeech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri
Speech Recognition Lecture 5: N-gram Language Models Eugene Weinstein Google, NYU Courant Institute eugenew@cs.nyu.edu Slide Credit: Mehryar Mohri Components Acoustic and pronunciation model: Pr(o w) =
More informationConditional Language Modeling. Chris Dyer
Conditional Language Modeling Chris Dyer Unconditional LMs A language model assigns probabilities to sequences of words,. w =(w 1,w 2,...,w`) It is convenient to decompose this probability using the chain
More informationGWAS IV: Bayesian linear (variance component) models
GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian
More informationLearning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014
Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of
More informationLecture 10. Discriminative Training, ROVER, and Consensus. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen
Lecture 10 Discriminative Training, ROVER, and Consensus Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen}@us.ibm.com
More informationPart A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )
Part A 1. A Markov chain is a discrete-time stochastic process, defined by a set of states, a set of transition probabilities (between states), and a set of initial state probabilities; the process proceeds
More informationExpectation Maximization (EM)
Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted
More informationIndependent Component Analysis and Unsupervised Learning. Jen-Tzung Chien
Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent voices Nonparametric likelihood
More informationAdversarial Training and Decoding Strategies for End-to-end Neural Conversation Models
MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Adversarial Training and Decoding Strategies for End-to-end Neural Conversation Models Hori, T.; Wang, W.; Koji, Y.; Hori, C.; Harsham, B.A.;
More informationMachine Learning, Fall 2012 Homework 2
0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0
More informationNatural Language Processing (CSEP 517): Machine Translation
Natural Language Processing (CSEP 57): Machine Translation Noah Smith c 207 University of Washington nasmith@cs.washington.edu May 5, 207 / 59 To-Do List Online quiz: due Sunday (Jurafsky and Martin, 2008,
More informationAlgorithmisches Lernen/Machine Learning
Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines
More informationBayesian Deep Learning
Bayesian Deep Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp June 06, 2018 Mohammad Emtiyaz Khan 2018 1 What will you learn? Why is Bayesian inference
More informationRecent Developments in Statistical Dialogue Systems
Recent Developments in Statistical Dialogue Systems Steve Young Machine Intelligence Laboratory Information Engineering Division Cambridge University Engineering Department Cambridge, UK Contents Review
More informationMidterm sample questions
Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts
More informationMachine Learning for natural language processing
Machine Learning for natural language processing Classification: Naive Bayes Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 20 Introduction Classification = supervised method for
More informationOutline of Today s Lecture
University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005 Jeff A. Bilmes Lecture 12 Slides Feb 23 rd, 2005 Outline of Today s
More informationHidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationCISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)
CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models
More informationApplied Natural Language Processing
Applied Natural Language Processing Info 256 Lecture 7: Testing (Feb 12, 2019) David Bamman, UC Berkeley Significance in NLP You develop a new method for text classification; is it better than what comes
More informationSeq2Seq Losses (CTC)
Seq2Seq Losses (CTC) Jerry Ding & Ryan Brigden 11-785 Recitation 6 February 23, 2018 Outline Tasks suited for recurrent networks Losses when the output is a sequence Kinds of errors Losses to use CTC Loss
More informationStatistical Methods for NLP
Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured
More informationp(d θ ) l(θ ) 1.2 x x x
p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to
More informationNatural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Maximum Entropy Models I Welcome back for the 3rd module
More informationEcient Higher-Order CRFs for Morphological Tagging
Ecient Higher-Order CRFs for Morphological Tagging Thomas Müller, Helmut Schmid and Hinrich Schütze Center for Information and Language Processing University of Munich Outline 1 Contributions 2 Motivation
More informationBitext Alignment for Statistical Machine Translation
Bitext Alignment for Statistical Machine Translation Yonggang Deng Advisor: Prof. William Byrne Thesis Committee: Prof. William Byrne, Prof. Trac Tran Prof. Jerry Prince and Prof. Gerard Meyer Center for
More informationLecture 3: ASR: HMMs, Forward, Viterbi
Original slides by Dan Jurafsky CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 3: ASR: HMMs, Forward, Viterbi Fun informative read on phonetics The
More informationAn Introduction to Bioinformatics Algorithms Hidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationDeep Neural Networks
Deep Neural Networks DT2118 Speech and Speaker Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 45 Outline State-to-Output Probability Model Artificial Neural Networks Perceptron Multi
More informationStatistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields
Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project
More informationEmpirical Risk Minimization, Model Selection, and Model Assessment
Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich,
More informationFoundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model
Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More information