Weighted Finite State Transducers in Automatic Speech Recognition

Similar documents
Automatic Speech Recognition (CS753)

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Automatic Speech Recognition (CS753)

Pre-Initialized Composition For Large-Vocabulary Speech Recognition

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

Weighted Finite-State Transducers in Automatic Speech Recognition

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center. Lecture 4a: Probabilities and Estimations

Lecture 3: ASR: HMMs, Forward, Viterbi

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

Natural Language Processing

Speech Recognition Lecture 9: Pronunciation Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

OpenFst: An Open-Source, Weighted Finite-State Transducer Library and its Applications to Speech and Language. Part I. Theory and Algorithms

CS 188: Artificial Intelligence Fall 2011

Introduction to Finite Automaton

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Speech Recognition Lecture 4: Weighted Transducer Software Library. Mehryar Mohri Courant Institute of Mathematical Sciences

Finite-State Transducers

1. Markov models. 1.1 Markov-chain

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

On-Line Learning with Path Experts and Non-Additive Losses

Augmented Statistical Models for Speech Recognition

Unit 1: Sequence Models

Statistical Sequence Recognition and Training: An Introduction to HMMs

Design and Implementation of Speech Recognition Systems

Hidden Markov Modelling

A Disambiguation Algorithm for Weighted Automata

Midterm sample questions

Efficient Path Counting Transducers for Minimum Bayes-Risk Decoding of Statistical Machine Translation Lattices

Lecture 5: GMM Acoustic Modeling and Feature Extraction

] Automatic Speech Recognition (CS753)

ON LATTICE GENERATION FOR LARGE VOCABULARY SPEECH RECOGNITION. David Rybach, Michael Riley, Johan Schalkwyk. Google Inc., USA

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

CS 343: Artificial Intelligence

Statistical Natural Language Processing

Hidden Markov Model and Speech Recognition

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

p(d θ ) l(θ ) 1.2 x x x

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Statistical Methods for NLP

STA 4273H: Statistical Machine Learning

Hidden Markov Models and Gaussian Mixture Models

CS 5522: Artificial Intelligence II

CS 5522: Artificial Intelligence II

A Dynamic Programming Algorithm for Computing N-gram Posteriors from Lattices

Advanced topics in language modeling: MaxEnt, features and marginal distritubion constraints

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Automatic Speech Recognition (CS753)

CS 343: Artificial Intelligence

Temporal Modeling and Basic Speech Recognition

EFFICIENT ALGORITHMS FOR TESTING THE TWINS PROPERTY

Accelerated Natural Language Processing Lecture 3 Morphology and Finite State Machines; Edit Distance

Building Finite-State Machines Intro to NLP - J. Eisner 1

Parsing with Context-Free Grammars

Automatic Speech Recognition (CS753)

STA 414/2104: Machine Learning

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

On the Influence of the Delta Coefficients in a HMM-based Speech Recognition System

Automatic Speech Recognition (CS753)

Hidden Markov Models and Gaussian Mixture Models

EECS E6870: Lecture 4: Hidden Markov Models

Theory of Computation Turing Machine and Pushdown Automata

Turing s thesis: (1930) Any computation carried out by mechanical means can be performed by a Turing Machine

Improving the Multi-Stack Decoding Algorithm in a Segment-based Speech Recognizer

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Lecture 9: Speech Recognition. Recognizing Speech

Introduction to Kleene Algebras

Lecture 9: Speech Recognition

Music as a Formal Language

N-gram Language Modeling Tutorial

Hidden Markov Models. Algorithms for NLP September 25, 2014

A NONPARAMETRIC BAYESIAN APPROACH FOR SPOKEN TERM DETECTION BY EXAMPLE QUERY

Learning Weighted Finite State Transducers. SPFLODD October 27, 2011

Deterministic Finite Automata (DFAs)

CS 373: Theory of Computation. Fall 2010

Speech Translation: from Singlebest to N-Best to Lattice Translation. Spoken Language Communication Laboratories

Template-Based Representations. Sargur Srihari

Sequential Supervised Learning

Unsupervised Model Adaptation using Information-Theoretic Criterion

Statistical Methods for NLP

CHAPTER 8 Viterbi Decoding of Convolutional Codes

Hidden Markov Models

Decoding and Inference with Syntactic Translation Models

Deterministic Finite Automata (DFAs)

Fun with weighted FSTs

An Introduction to Bioinformatics Algorithms Hidden Markov Models

N-gram Language Modeling

Forward algorithm vs. particle filtering

Chapter 6. Properties of Regular Languages

ON THE USE OF MLP-DISTANCE TO ESTIMATE POSTERIOR PROBABILITIES BY KNN FOR SPEECH RECOGNITION

Theory of Computation

N-Way Composition of Weighted Finite-State Transducers

Finite Automata. Finite Automata

Automata Theory (2A) Young Won Lim 5/31/18

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

CISC 4090: Theory of Computation Chapter 1 Regular Languages. Section 1.1: Finite Automata. What is a computer? Finite automata

Transcription:

Weighted Finite State Transducers in Automatic Speech Recognition ZRE lecture 10.04.2013 Mirko Hannemann Slides provided with permission, Daniel Povey some slides from T. Schultz, M. Mohri and M. Riley 10.04.2013 Mirko Hannemann Weighted Finite State Transducers in ASR 1/46

Automatic speech recognition Mirko Hannemann Weighted Finite State Transducers in ASR 2/46

Automatic speech recognition Mirko Hannemann Weighted Finite State Transducers in ASR 3/46

Hidden Markov Model, Token passing. Mirko Hannemann Weighted Finite State Transducers in ASR 4/46

Viterbi algorithm, trellis Mirko Hannemann Weighted Finite State Transducers in ASR 5/46

Decoding graph/recognition network P(one) w ah n one sil P(two) t uw two sil P(three) th r iy three Mirko Hannemann Weighted Finite State Transducers in ASR 6/46

Why Finite State Transducers? Motivation: most components (LM, lexicon, lattice) are finite-state unified framework for describing models integrate different models into a single model via composition operations improve search efficiency via optimization algorithms flexibility to extend (add new models)! speed: pre-compiled search space, near realtime performance on embedded systems! flexibility: same decoder used for hand-held devices and LVCSR Mirko Hannemann Weighted Finite State Transducers in ASR 7/46

Finite State Acceptor (FSA) a 0 1 b 2 An FSA accepts a set of strings (a string is a sequence of symbols). View FSA as a representation of a possibly infinite set of strings. This FSA accepts just the string ab, i.e.theset{ab} Numbers in circles are state labels (not really important). Labels are on arcs are the symbols. Start state(s) bold; final/accepting states have extra circle. Note: it is sometimes assumed there is just one start state. Mirko Hannemann Weighted Finite State Transducers in ASR 8/46

A less trivial FSA a a 0 1 b 2 The previous example doesn t show the power of FSAs because we could represent the set of strings finitely. This example represents the infinite set {ab, aab, aaab,...} Note: a string is accepted (included in the set) if: There is a path with that sequence of symbols on it. That path is successful (starts at an initial state, ends at a final state). Mirko Hannemann Weighted Finite State Transducers in ASR 9/46

The epsilon symbol a 0 1 <eps> The symbol ϵ has a special meaning in FSAs (and FSTs) It means no symbol is there. This example represents the set of strings {a, aa, aaa,...} If ϵ were treated as a normal symbol, this would be {a, aϵa, aϵaϵa,...} In text form, ϵ is sometimes written as <eps> Toolkits implementing FSAs/FSTs generally assume <eps> is the symbol numbered zero Mirko Hannemann Weighted Finite State Transducers in ASR 10/46

Weighted finite state acceptors a/1 0 1 b/1 2/1 Like a normal FSA but with costs on the arcs and final-states Note: cost comes after /. For final-state, 2/1 means final-cost 1 on state 2. View WFSA as a function from a string to a cost. In this view, unweighted FSA is f : string {0, }. If multiple paths have the same string, take the one with the lowest cost. This example maps ab to (3 = 1 + 1 + 1), all else to. Mirko Hannemann Weighted Finite State Transducers in ASR 11/46

Semirings Reminder: HMMs and ASR The semiring concept makes WFSAs more general. Asemiringis Asetofelements(e.g.R) Two special elements 1 and 0 (theidentityelementandzero) Two operations, (plus) and (times) satsifying certain axioms. Semiring examples. log is defined by: x log y = log(e x + e y ). SEMIRING SET 0 1 Boolean {0, 1} 0 1 Probability R + + 0 1 Log R {, + } log + + 0 Tropical R {, + } min + + 0 In WFSAs, weights are multiplied along paths summed over paths with identical symbol-sequences Mirko Hannemann Weighted Finite State Transducers in ASR 12/46

Weights vs. costs a/0 1/0 0 b/2 2 c/1 3/1 Personally I use cost to refer to the numeric value, and weight when speaking abstractly, e.g.: The acceptor above accepts a with unit weight. It accepts a with zero cost. It accepts bc with cost 4 = 2 + 1 + 1 State 1 is final with unit weight. The acceptor assigns zero weight to xyz. It assigns infinite cost to xyz. Mirko Hannemann Weighted Finite State Transducers in ASR 13/46

Weights and costs b/1 a/0 1/2 d/1 3/1 0 a/1 2 b/2 4/2 Consider the WFSA above, with the tropical ( Viterbi-like ) semiring. Take the string ab. We multiply ( ) the weights along paths; this means adding the costs. Two paths for ab: One goes through states (0, 1, 1); cost is (0 + 1 + 2) = 3 One goes through states (0, 2, 3); cost is (1 + 2 + 2) = 5 We add weights across different paths; tropical is take min cost this WFSA maps ab to 3 Mirko Hannemann Weighted Finite State Transducers in ASR 14/46

Weighted finite state transducers (WFST) a:x/1 0 1/0 Like a WFSA except with two labels on each arc. View it as a function from a (pair of strings) to a weight This one maps (a, x) to1andallelseto Note: view 1 and as costs. is 0 insemiring. Symbols on the left and right are termed input and output symbols. Mirko Hannemann Weighted Finite State Transducers in ASR 15/46

Composition of WFSTs A B C a:x/1 0 1/0 b:x/1 x:y/1 0 1/0 a:y/2 0 1/0 b:y/2 Notation: C = A B means, C is A composed with B. In special cases, composition is similar to function composition Composition algorithm matches up the inner symbols i.e. those on the output (right) of A and input (left) of B Mirko Hannemann Weighted Finite State Transducers in ASR 16/46

Composition algorithm Ignoring ϵ symbols, algorithm is quite simple. States in C correspond to tuples of (state in A, stateinb). But some of these may be inaccessible and pruned away. Maintain queue of pairs, initially the single pair (0, 0) (start states). When processing a pair (s, t): Consider each pair of (arc a from s), (arc b from t). If these have matching symbols (output of a, inputofb): Create transition to state in C corresponding to (next-state of a, next-stateofb) If not seen before, add this pair to queue. With ϵ involved, need to be careful to avoid redundant paths... Mirko Hannemann Weighted Finite State Transducers in ASR 17/46

Construction of decoding network WFST approach [Mohri et al.] exploit several knowledge sources (lexicon, grammar, phonetics) to find most likely spoken word sequence HCLG = H C L G (1) G probabilistic grammar or language model acceptor (word) L lexicon (phones to words) C context-dependent relabeling (ctx-dep-phone to phone) H HMM structure (PDF labels to context-dependent phones) Create H, C, L, G separately and compose them together Mirko Hannemann Weighted Finite State Transducers in ASR 18/46

Language model Estimate probability of a word sequence W : P(W )= P(w 1, w 2,...,w N ) (2) P(W )= P(w 1 ) P(w 2 w 1 ) P(w 3 w 1, w 2 )... P(w N w 1,...,w N 1 ) (3) Approximate by sharing histories h (e.g. bigram history): P(W ) P(w 1 ) P(w 2 w 1 ) P(w 3 w 2 )... P(w N w N 1 ) (4) What to do if a certain history was not observed in training texts? statistical smoothing of the distribution interpolation of different orders of history backing-off to shorter history Mirko Hannemann Weighted Finite State Transducers in ASR 19/46

Language model acceptor G Mirko Hannemann Weighted Finite State Transducers in ASR 20/46

Language models (ARPA back-off) \1-grams: -5.2347 a -3.3-3.4568 b 0.0000 <s> -2.5-4.3333 </s> \2-grams: -1.4568 a b -1.3049 <s> a -1.78 b a -2.30 b </s> start <s> SB a/3.0046 <eps>/5.7565 <eps>/7.5985 backoff b/7.9595 </s>/9.9779 </s>/5.2959 SE a a/12.053 b/3.3544 <eps> b a/4.0986 Mirko Hannemann Weighted Finite State Transducers in ASR 21/46

Pronunciation lexicon L A ax #1 ABERDEEN ae b er d iy n ABOARD ax b r ao dd ADD ae dd #1 ABOVE ax b ah v Added disambiguation symbols: if a phone sequences can output different words ("I scream for ice cream.") non-determinism: introduce disambiguation symbols, remove at last stage Mirko Hannemann Weighted Finite State Transducers in ASR 22/46

Deterministic WFSTs a:x/1 0 1/0 b:x/1 Taken to mean deterministic on the input symbol I.e., no state can have > 1arcoutofitwiththesameinput symbol. Some interpretations (e.g. Mohri/AT&T/OpenFst) allow ϵ input symbols (i.e. being ϵ-free is a separate issue). Ipreferadefinitionthatdisallowsepsilons,exceptas necessary to encode a string of output symbols on an arc. Regardless of definition, not all WFSTs can be determinized. Mirko Hannemann Weighted Finite State Transducers in ASR 23/46

Determinization (like making tree-structured lexicon) Mirko Hannemann Weighted Finite State Transducers in ASR 24/46

Minimal deterministic WFSTs a 0 1 b c 2 3 d d 4 5 a 0 1 b c 2 3 d d 4 Here, the left FSA is not minimal but the right one is. Minimal is normally only applied to deterministic FSAs. Think of it as suffix sharing, or combining redundant states. It s useful to save space (but not as crucial as determinization, for ASR). Mirko Hannemann Weighted Finite State Transducers in ASR 25/46

Minimization (like suffix sharing) Mirko Hannemann Weighted Finite State Transducers in ASR 26/46

Pronunciation lexicon L Mirko Hannemann Weighted Finite State Transducers in ASR 27/46

Context-dependency of phonemes So far we use phonemes independent of context, but: co-articulation: pronunciation of phones changes with surrounding phones Mirko Hannemann Weighted Finite State Transducers in ASR 28/46

Context-dependency transducer C Introduce context-dependent phones: tri-phones: each model depends on predecessor and successor phoneme: a-b-c (b/a c ) implemented as context-dependency transducer C Input: context-dependent phone (triphone) Output: context-independent phone (phone) shown is one path of it eps-eps #-1/a eps-a eps-a-b/b a-b a-b-c/c b-c b-c-d/d c-d c-d-eps/$ d/eps Mirko Hannemann Weighted Finite State Transducers in ASR 29/46

Context-dependency transducer C ae:ae/k_t k,ae ae,t k,ae t:ae/k_t ae,t (a) (b) Figure 6: Context-dependent triphone transducer transition: (a) non-deterministic, (b) deterministic. $:y/ε_ε ε,ε x:ε ε,x y:ε y:x/ε_y x:x/x_x y:x/x_y x:x/ε_x x,x x:y/ε_y ε,y $:y/x_ε y:y/y_y $:y/y_ε y:y/ε_y y:y/x_y y,y x:y/y_x x,y x:x/y_x y:x/y_y x:y/x_x $:x/x_ε $:x/ε_ε y,ε y,x $:x/y_ε x,ε Mirko Hannemann Weighted Finite State Transducers in ASR 30/46

HMM as transducer. Mirko Hannemann Weighted Finite State Transducers in ASR 31/46

HMM as transducer (monophone) Mirko Hannemann Weighted Finite State Transducers in ASR 32/46

Phonetic decision tree too many context-dependent models (N 3 )! clustering determine model-id (gaussian) based on phoneme context and state in HMM using questions about context (sets of phones) m Yes Left=vowel? No Yes Right=fricative? No Mirko Hannemann Weighted Finite State Transducers in ASR 33/46

HMM transducer H a 2:aa/-2.3842e-07 1 4:<eps> 5 6:<eps>/-5.9605e-08 0:<eps> 6 0:<eps> 0 301:#0 0:<eps> 4 10:<eps>/1.1921e-07 7 12:<eps>/-1.1921e-07 8 8:ae 2 286:<eps>/2.9449 0:sil 289:<eps>/2.6975 3 284:<eps>/0.11113 9 291:<eps>/0.55411 295:<eps>/0.17433 296:<eps>/2.5934 11 298:<eps>/2.4625 285:<eps>/2.9449 288:<eps>/2.6975 10 293:<eps>/1.5476 294:<eps>/1.5476 290:<eps>/0.14469 12 300:<eps> 0:<eps> (here shown for monophone case, without self-loops) 13 Mirko Hannemann Weighted Finite State Transducers in ASR 34/46

Construction of decoding network WFST approach by [Mohri et al.] HCLG = rds(min(det(h det(c det(l G))))) (5) rds - remove disambiguation symbols min - minimization, includes weight pushing det - determinization Kaldi toolkit [Povey et al.] HCLG = asl(min(rds(det(h a min(det(c min(det(l G)))))))) (6) asl - add self loops rds - remove disambiguation symbols Mirko Hannemann Weighted Finite State Transducers in ASR 35/46

Decoding graph construction (complexities) Have to do things in a careful order or algorithms "blow up" Determinization for WFSTs can fail need to insert "disambiguation symbols" into the lexicon. need to "propagate these through" H and C. Need to guarantee that final HCLG is stochastic: i.e. sums to one, like a properly normalized HMM needed for optimal pruning (discard unlikely paths) usually done by weight-pushing, but standard algorithm can fail, because FST representation of back-off LMs is non-stochastic We want to recover the phone sequence from the recognized path (words) sometimes also the model-indices (PDF-ids) and the HCLG arcs that were used in best path Mirko Hannemann Weighted Finite State Transducers in ASR 36/46

(finding best path) Mirko Hannemann Weighted Finite State Transducers in ASR 37/46

1/4.86 1/4.16 3/5.16 0 2/4.94 3/5.31 1 4/5.91 2/5.44 3/6.31 4/5.02 2 4/8.53 1/6.02 2/6.47 3/0 First a WFST definition of the decoding problem. Let U be an FST that encodes the acoustic scores of an utterance (as above). Let S = U HCLG be called the search graph for an utterance. Note: if U has N frames (3, above), then #states in S is (N +1)times#statesinHCLG. Like N +1copiesofHCLG. Mirko Hannemann Weighted Finite State Transducers in ASR 38/46

Viterbi algorithm, trellis Mirko Hannemann Weighted Finite State Transducers in ASR 39/46

With beam pruning, we search a subgraph of S. The set of active states on all frames, with arcs linking them as appropriate, is a subgraph of S. Let this be called the beam-pruned subgraph of S; callitb. Astandardspeechrecognitiondecoderfindsthebestpath through B. In our case, the output of the decoder is a linear WFST that consists of this best path. This contains the following useful information: The word sequence, as output symbols. The state alignment, as input symbols. The cost of the path, as the total weight. Mirko Hannemann Weighted Finite State Transducers in ASR 40/46

Decoding output utt1 [ 2 6 6 6 6 10 ] [ 614 613 613 613 711 ] [ 122 123 utt1 SIL th ax utt1 <s> THE Mirko Hannemann Weighted Finite State Transducers in ASR 41/46

Word Lattice / Word Graph Word Lattice: a compact representation of the search space COMPUTERS 0.7 COMPUTES 0.3 t 1 ARE 0.5 A 0.2 EVERY VERY 0.55 VARIED 0.15 0.2AVERAGE 0.1 OBJECTIVE 0.3 ATTRACTIVE 0.2 EFFECTIVE 0.2 ACTIVE 0.25 Mirko Hannemann Weighted Finite State Transducers in ASR 42/46

Lattices as WFSTs The word "lattice" is used in the ASR literature as: Some kind of compact representation of the alternate word hypotheses for an utterance. Like an N-best list but with less redundancy. Usually has time information, sometimes state or word alignment information. Generally a directed acyclic graph with only one start node and only one end node. Mirko Hannemann Weighted Finite State Transducers in ASR 43/46

Lattice Generation with WFSTs [Povey12] Basic process (not memory efficient): Generate beam-pruned subgraph B of search graph S The states in B correspond to the active states on particular frames. Prune B with beam to get pruned version P. Convert P to acceptor and lattice-determinize to get A (deterministic acceptor)! No two paths in L have same word sequence (take best) Prune A with beam to get final lattice L (in acceptor form). Mirko Hannemann Weighted Finite State Transducers in ASR 44/46

Finite State Transducers for ASR Pro s: Fast: compact/minimal search space due to combined minimization of lexicon, phonemes, HMM s Simple: easy construction of recognizer by composition from states, HMMs, phonemes, lexicon, grammar Flexible: whatever new knowledge sources, the compose/optimize/search remains the same Con s: composition of complex models generates a huge WFST search space increases, and huge memory is required esp. how to deal with huge language models Mirko Hannemann Weighted Finite State Transducers in ASR 45/46

Ressources Reminder: HMMs and ASR OpenFST http://www.openfst.org/ Library, developed at Google Research (M. Riley, J. Schalkwyk, W. Skut) and NYU s Courant Institute (C. Allauzen, M. Mohri) Mohri08 M. Mohri et al., Speech Recognition with weighted finite state transducers. Povey11 D. Povey et al., The Kaldi Speech Recognition Toolkit. Povey12 D. Povey et al., Generating exact lattices in the WFST framework. Mirko Hannemann Weighted Finite State Transducers in ASR 46/46