Automatic Speech Recognition (CS753)

Automatic Speech Recognition (S753) Lecture 5: idden Markov s (Part I) Instructor: Preethi Jyothi August 7, 2017

Recap: WFSTs applied to ASR

WFST-based ASR System Indices s Triphones ontext Transducer Monophones Pronunciation Words Language Word Sequence

WFST-based ASR System Indices s Triphones ontext Transducer Monophones Pronunciation Words Language Word Sequence a/a_b f0:a:a_b f1:ε f4:ε f3:ε f2:ε f4:ε b/a_b... x/y_z f5:ε f6:ε } One 3-state MM for each triphone FST Union + losure Resulting FST

WFST-based ASR System Indices s Triphones ontext Transducer Monophones Pronunciation Words Language Word Sequence x:x/ ε_ε y:y/ ε_x ε,* x:x/ ε_x x:x/ ε_y x:x/x_x x,x x:x/x_y x,y x:x/y_x x:x/y_y y:y/x_x y:y/x_y y:y/y_y y:y/y_x y,x x:x/y_ε x,ε y,y y:y/y_ε y:y/x_ ε y,ε y:y/ ε_y x:x/x_ε y:y/ ε_ε -1 : Arc labels: monophone : phone / left-context_right-context Figure reproduced from Weighted Finite State Transducers in Speech Recognition, Mohri et al., 2002

WFST-based ASR System Indices s Triphones ontext Transducer Monophones Pronunciation L Words Language Word Sequence d:data/1 1 ey:ε/0.5 ae:ε/0.5 2 t:ε/0.3 dx:ε/0.7 3 ax: ε /1 4 0 d:dew/1 5 uw:ε/1 6 Figure reproduced from Weighted Finite State Transducers in Speech Recognition, Mohri et al., 2002

WFST-based ASR System Indices s Triphones ontext Transducer Monophones Pronunciation Words Language G Word Sequence are/0.693 walking 0 the birds/0.404 animals/1.789 were/0.693 is boy/1.789

onstructing the Decoding Graph Indices s Triphones ontext Transducer Monophones Pronunciation Words Language L G Decoding graph, D Word Sequence onstruct decoding search graph, D, using L G that maps acoustic states to word sequences arefully construct D using optimization algorithms: D = min(det( det( det(l G)))) ow do we decode a test utterance O using D? D is typically traversed dynamically: Search algorithms will be covered later in the semester

Before D, let s understand in more detail Indices s Triphones ontext Transducer Monophones Pronunciation Words Language Word Sequence a/a_b f0:a:a_b f1:ε f4:ε f3:ε f2:ε f4:ε b/a_b... x/y_z f5:ε f6:ε } One 3-state MM for each triphone FST Union + losure Resulting FST

idden Markov s (MMs) Following slides contain figures/material from idden Markov s, hapter 9, Speech and Language Processing, D. Jurafsky and J.. Martin, 2016. (https://web.stanford.edu/~jurafsky/slp3/9.pdf)

babilities on all arcs leaving a node must sum to 1) and in which the input sence uniquely determines which states the automaton will go through. Because an t represent inherently ambiguous problems, a Markov chain is only useful for igning probabilities to unambiguous sequences. a 22 OLD 2 a 24 Markov hains a 22 a 02 snow 2 a 24 2 a 32 a a 23 a 21 33 a 34 End 4 a Start 12 a 0 23 End 4 a 11 a 21 a 32 a 33 a 34 a 13 a 01 a 13 a 31 WARM 3 a 14 a 03 is 1 a 31 white 3 a 14 (a) ov chain for weather (a) and one for words (b). A Markov chain is specified by the n between states, and the start and end states. (b) 9.2 TE IDDEN MARKOV MODEL 3 Figure 9.1a shows a Markov chain for assigning a probability to a sequence of previous state: ather events, for which the vocabulary consists of OT, OLD, and WARM. Fig- 9.1b shows another simple example of a Markov chain for assigning a probability a sequence of words w 1...w n. This Markov chain should be familiar; in fact, it Markov Assumption: P(q i q 1...q i 1 )=P(q i q i 1 ) (9.1) resents a bigram language model. Given the two models in Fig. 9.1, we can as-notn that because each a ij expresses the probability p(q j q i ), the laws of prob- a probability to any sequence from our vocabulary. We go over how to doability this require that the values of the outgoing arcs from a given state must sum to rtly. 1: First, let s be more formal and view a Markov chain as a kind of probabilistic phical model: a way of representing probabilistic assumptions in a graph. A nx rkov chain is specified by the following components: a ij = 1 8i (9.2) Q = q 1 q 2...q N A = a 01 a 02...a n1...a nn a set of N states An alternative representation that is sometimes used for Markov chains doesn t a transition probability matrix A, each a ij representing the probability of moving from state i (b) rely on a start or end state, instead representing the distribution over initial states and to state j, s.t. P n j=1 a accepting states explicitly: ij = 1 8i p = p 1,p 2,...,p N an initial probability distribution over states. p i is the probability that the Markov chain will start in state i. Some states j may have p j = 0, meaning that they cannot be initial presentation q 0,q F of the a special same start state Markov and end (final) chain state that for are weather shown in Fig. 9.1. not associated with observations al Figure start 9.1 shows state that wewith represent athe 01 states transition (including start and probabilities, end states) as we use states. the Also, P pn vector, i=1 p i = 1 es in the graph, and the transitions as edges between nodes. QA = {q tribution over starting state probabilities. The x,qfigure y...} in (b) shows A Markov chain embodies an important assumption about these probabilities. In j=1 a set QA Q of legal accepting states

Given a sequence of observations O, each observation an integer corresponding to the number of ice creams eaten on a given day, figure out the correct hidden sequence Q of weather states ( or ) which caused Jason to eat the ice cream. idden Markov Let s begin with a formal definition of a hidden Markov model, focusing on how it differs from a Markov chain. An MM is specified by the following components: Q = q 1 q 2...q N A = a 11 a 12...a n1...a nn O = o 1 o 2...o T B = b i (o t ) q 0,q F a set of N states a transition probability matrix A, each a ij representing the probability of moving from state i to state j, s.t. P n j=1 a ij = 1 8i a sequence of T observations, each one drawn from a vocabulary V = v 1,v 2,...,v V a sequence of observation likelihoods, also called emission probabilities, each expressing the probability of an observation o t being generated from a state i a special start state and end (final) state that are not associated with observations, together with transition probabilities a 01 a 02...a 0n out of the start state and a 1F a 2F...a nf into the end state As we noted for Markov chains, an alternative representation that is sometimes

MM Assumptions start 0.2.1.6.5 end 3.8 P.3.1 OT 1 OLD 2.4 B 1 B 2 P(1 OT).2 P(2 OT) =.4 P(3 OT).4 P(1 OLD).5 P(2 OLD) =.4 P(3 OLD).1 Markov Assumption: P(q i q 1...q i 1 )=P(q i q i 1 ) e probability of an output observation o depends only o Output Independence: P(o i q 1...q i,...,q T,o 1,...,o i,...,o T )=P(o i q i )

Three problems for MMs Problem 1 (Likelihood): Given an MM l =(A, B) and an observation sequence O, determine the likelihood P(O l). Problem 2 (Decoding): Given an observation sequence O and an MM l = (A,B), discover the best hidden state sequence Q. Problem 3 (Learning): Given an observation sequence O and the set of states in the MM, learn the MM parameters A and B. omputing Likelihood: Given an MM l =(A,B) and an observation sequence O, determine the likelihood P(O l). A tutorial on hidden Markov models and selected applications in speech recognition, Rabiner, 1989

Forward Trellis a t ( j)=p(o 1,o 2...o t,q t = j l) the tth state in the sequence of states q F end end end a t ( j)= NX a t 1 (i)a ij b j (o t ) i=1 end α 1 (2)=.32 α 2 (2)=.32*.12 +.02*.08 =.040 q 2 P( ) * P(1 ).6 *.2 q 1 q 0 start P( start)*p(3 ).8 *.4 P( start) * P(3 ).2 *.1 α 1 (1) =.02 P( ) * P(1 ).3 *.5 P( ) * P(1 ).4 *.2 P( ) * P(1 ).5 *.5 α 2 (1) =.32*.15 +.02*.25 =.053 start start start 3 1 3 o 1 o 2 o 3 t

Forward Algorithm 1. Initialization: a 1 ( j) = a 0 j b j (o 1 ) 1 apple j apple N 2. Recursion (since states 0 and F are non-emitting): a t ( j)= NX a t 1 (i)a ij b j (o t ); 1apple j apple N,1 < t apple T i=1 3. Termination: P(O l)=a T (q F )= NX a T (i)a if i=1

Visualizing the forward recursion α t-2 (N) α t-1 (N) q N q N a Nj α t (j)= Σ i α t-1 (i) a ij b j (o t ) q N q j α t-2 (3) α t-1 (3) a 3j q 3 q 3 q 3 α t-2 (2) α t-1 (2) q 2 q 2 a 2j a 1j q 2 b j (o t ) q 2 α t-2 (1) α t-1 (1) q 1 q 1 q 1 q 1 o t-2 o t-1 o t o t+1

Three problems for MMs Problem 1 (Likelihood): Given an MM l =(A, B) and an observation sequence O, determine the likelihood P(O l). Problem 2 (Decoding): Given an observation sequence O and an MM l = (A,B), discover the best hidden state sequence Q. Problem 3 (Learning): Given an observation sequence O and the set of states in the MM, learn the MM parameters A and B. Decoding: Given as input an MM l =(A,B) and a sequence of observations O = o 1,o 2,...,o T, find the most probable sequence of states Q = q 1 q 2 q 3...q T.

Viterbi Trellis v t ( j)= max P(q 0,q 1...q t 1,o 1,o 2...o t,q t = j l) v t ( j) = N q 0,q 1,...,q t 1 max v t 1(i) a ij b j (o t ) i=1 q F end end end end v 1 (2)=.32 v 2 (2)= max(.32*.12,.02*.08) =.038 q 2 P( ) * P(1 ).6 *.2 q 1 q 0 start P( start)*p(3 ).8 *.4 P( start) * P(3 ).2 *.1 v 1 (1) =.02 P( ) * P(1 ).3 *.5 P( ) * P(1 ).4 *.2 P( ) * P(1 ).5 *.5 v 2 (1) = max(.32*.15,.02*.25) =.048 start start start 3 1 3 o 1 o 2 o 3 t

Viterbi recursion 1. Initialization: v 1 ( j) = a 0 j b j (o 1 ) 1 apple j apple N bt 1 ( j) = 0 2. Recursion (recall that states 0 and q F are non-emitting): v t ( j) = bt t ( j) = N max i=1 v t 1(i)a ij b j (o t ); 1apple j apple N,1 < t apple T N argmaxv t 1 (i)a ij b j (o t ); 1apple j apple N,1 < t apple T i=1 3. Termination: The best score: P = v T (q F ) = N max i=1 v T (i) a if The start of backtrace: q T = bt T (q F ) = N argmax i=1 v T (i) a if

Viterbi backtrace q F end end end end v 1 (2)=.32 v 2 (2)= max(.32*.12,.02*.08) =.038 q 2 P( ) * P(1 ).3 *.5 P( ) * P(1 ).6 *.2 q 1 q 0 start P( start)*p(3 ).8 *.4 P( start) * P(3 ).2 *.1 v 1 (1) =.02 P( ) * P(1 ).4 *.2 P( ) * P(1 ).5 *.5 v 2 (1) = max(.32*.15,.02*.25) =.048 start start start 3 1 3 o 1 o 2 o 3 t