Automatic Speech Recognition (CS753)

Size: px

Start display at page:

Download "Automatic Speech Recognition (CS753)"

Henry Watson
5 years ago
Views:

1 Automatic Speech Recognition (S753) Lecture 5: idden Markov Models (Part I) Instructor: Preethi Jyothi Lecture 5

2 OpenFst heat Sheet

an a <eps> 0 an 1 a 2 1 2 <eps> n 0 2 a a 1 2 A.

3 Quick Intro to OpenFst ( a 0 label is 0 an 1 2 reserved for epsilon 0 1 an a <eps> 0 an 1 a <eps> n 0 2 a a 1 2 A.txt <eps> 0 a 1 n 2 Input alphabet (in.txt) Output alphabet (out.txt)

4 Quick Intro to OpenFst ( a 0 an 1 2/ an a <eps> n a a

5 ompiling & Printing FSTs The text FSTs need to be compiled into binary objects before further use with OpenFst utilities ommand used to compile: fstcompile --isymbols=in.txt --osymbols=out.txt A.txt A.fst Get back the text FST using a print command with the binary file: fstprint --isymbols=in.txt --osymbols=out.txt A.fst A.txt

6 Drawing FSTs Small FSTs can be visualized easily using the draw tool: fstdraw --isymbols=in.txt --osymbols=out.txt A.fst dot -Tpdf > A.pdf an:a 1 <eps>:n 0 a:a 2

7 Fairly large FST!

8 idden Markov Models (MMs) Following slides contain figures/material from idden Markov Models, hapter 9, Speech and Language Processing, D. Jurafsky and J.. Martin, (

9 babilities on all arcs leaving a node must sum to 1) and in which the input sence uniquely determines which states the automaton will go through. Because an t represent inherently ambiguous problems, a Markov chain is only useful for igning probabilities to unambiguous sequences. a 22 OLD 2 a 24 Markov hains a 22 a 02 snow 2 a 24 2 a 32 a a 23 a a 34 End 4 a Start 12 a 0 23 End 4 a 11 a 21 a 32 a 33 a 34 a 13 a 01 a 13 a 31 WARM 3 a 14 a 03 is 1 a 31 white 3 a 14 (a) ov chain for weather (a) and one for words (b). A Markov chain is specified by the n between states, and the start and end states. (b) 9.2 TE IDDEN MARKOV MODEL 3 Figure 9.1a shows a Markov chain for assigning a probability to a sequence of previous state: ather events, for which the vocabulary consists of OT, OLD, and WARM. Fig- 9.1b shows another simple example of a Markov chain for assigning a probability a sequence of words w 1...w n. This Markov chain should be familiar; in fact, it Markov Assumption: P(q i q 1...q i 1 )=P(q i q i 1 ) (9.1) resents a bigram language model. Given the two models in Fig. 9.1, we can as-notn that because each a ij expresses the probability p(q j q i ), the laws of prob- a probability to any sequence from our vocabulary. We go over how to doability this require that the values of the outgoing arcs from a given state must sum to rtly. 1: First, let s be more formal and view a Markov chain as a kind of probabilistic phical model: a way of representing probabilistic assumptions in a graph. A nx rkov chain is specified by the following components: a ij = 1 8i (9.2) Q = q 1 q 2...q N A = a 01 a 02...a n1...a nn a set of N states An alternative representation that is sometimes used for Markov chains doesn t a transition probability matrix A, each a ij representing the probability of moving from state i (b) rely on a start or end state, instead representing the distribution over initial states and to state j, s.t. P n j=1 a accepting states explicitly: ij = 1 8i p = p 1,p 2,...,p N an initial probability distribution over states. p i is the probability that the Markov chain will start in state i. Some states j may have p j = 0, meaning that they cannot be initial presentation q 0,q F of the a special same start state Markov and end (final) chain state that for are weather shown in Fig not associated with observations al Figure start 9.1 shows state that wewith represent athe 01 states transition (including start and probabilities, end states) as we use states. the Also, P pn vector, i=1 p i = 1 es in the graph, and the transitions as edges between nodes. QA = {q tribution over starting state probabilities. The x,qfigure y...} in (b) shows A Markov chain embodies an important assumption about these probabilities. In j=1 a set QA Q of legal accepting states

10 Given a sequence of observations O, each observation an integer corresponding to the number of ice creams eaten on a given day, figure out the correct hidden sequence Q of weather states ( or ) which caused Jason to eat the ice cream. idden Markov Model Let s begin with a formal definition of a hidden Markov model, focusing on how it differs from a Markov chain. An MM is specified by the following components: Q = q 1 q 2...q N A = a 11 a 12...a n1...a nn O = o 1 o 2...o T B = b i (o t ) q 0,q F a set of N states a transition probability matrix A, each a ij representing the probability of moving from state i to state j, s.t. P n j=1 a ij = 1 8i a sequence of T observations, each one drawn from a vocabulary V = v 1,v 2,...,v V a sequence of observation likelihoods, also called emission probabilities, each expressing the probability of an observation o t being generated from a state i a special start state and end (final) state that are not associated with observations, together with transition probabilities a 01 a 02...a 0n out of the start state and a 1F a 2F...a nf into the end state As we noted for Markov chains, an alternative representation that is sometimes

11 MM Assumptions start end 3.8 P.3.1 OT 1 OLD 2.4 B 1 B 2 P(1 OT).2 P(2 OT) =.4 P(3 OT).4 P(1 OLD).5 P(2 OLD) =.4 P(3 OLD).1 Markov Assumption: P(q i q 1...q i 1 )=P(q i q i 1 ) e probability of an output observation o depends only o Output Independence: P(o i q 1...q i,...,q T,o 1,...,o i,...,o T )=P(o i q i )

12 Three problems for MMs Problem 1 (Likelihood): Given an MM l =(A, B) and an observation sequence O, determine the likelihood P(O l). Problem 2 (Decoding): Given an observation sequence O and an MM l = (A,B), discover the best hidden state sequence Q. Problem 3 (Learning): Given an observation sequence O and the set of states in the MM, learn the MM parameters A and B. omputing Likelihood: Given an MM l =(A,B) and an observation sequence O, determine the likelihood P(O l).

13 Forward Trellis a t ( j)=p(o 1,o 2...o t,q t = j l) the tth state in the sequence of states q F end end end a t ( j)= NX a t 1 (i)a ij b j (o t ) i=1 end α 1 (2)=.32 α 2 (2)=.32* *.08 =.040 q 2 P( ) * P(1 ).6 *.2 q 1 q 0 start P( start)*p(3 ).8 *.4 P( start) * P(3 ).2 *.1 α 1 (1) =.02 P( ) * P(1 ).3 *.5 P( ) * P(1 ).4 *.2 P( ) * P(1 ).5 *.5 α 2 (1) =.32* *.25 =.053 start start start o 1 o 2 o 3 t

14 Forward Algorithm 1. Initialization: a 1 ( j) = a 0 j b j (o 1 ) 1 apple j apple N 2. Recursion (since states 0 and F are non-emitting): a t ( j)= NX a t 1 (i)a ij b j (o t ); 1apple j apple N,1 < t apple T i=1 3. Termination: P(O l)=a T (q F )= NX a T (i)a if i=1

15 Visualizing the forward recursion α t-2 (N) α t-1 (N) q N q N a Nj α t (j)= Σ i α t-1 (i) a ij b j (o t ) q N q j α t-2 (3) α t-1 (3) a 3j q 3 q 3 q 3 α t-2 (2) α t-1 (2) q 2 q 2 a 2j a 1j q 2 b j (o t ) q 2 α t-2 (1) α t-1 (1) q 1 q 1 q 1 q 1 o t-2 o t-1 o t o t+1

16 Three problems for MMs Problem 1 (Likelihood): Given an MM l =(A, B) and an observation sequence O, determine the likelihood P(O l). Problem 2 (Decoding): Given an observation sequence O and an MM l = (A,B), discover the best hidden state sequence Q. Problem 3 (Learning): Given an observation sequence O and the set of states in the MM, learn the MM parameters A and B. Decoding: Given as input an MM l =(A,B) and a sequence of observations O = o 1,o 2,...,o T, find the most probable sequence of states Q = q 1 q 2 q 3...q T.

17 Viterbi Trellis v t ( j)= max P(q 0,q 1...q t 1,o 1,o 2...o t,q t = j l) v t ( j) = N q 0,q 1,...,q t 1 max v t 1(i) a ij b j (o t ) i=1 q F end end end end v 1 (2)=.32 v 2 (2)= max(.32*.12,.02*.08) =.038 q 2 P( ) * P(1 ).6 *.2 q 1 q 0 start P( start)*p(3 ).8 *.4 P( start) * P(3 ).2 *.1 v 1 (1) =.02 P( ) * P(1 ).3 *.5 P( ) * P(1 ).4 *.2 P( ) * P(1 ).5 *.5 v 2 (1) = max(.32*.15,.02*.25) =.048 start start start o 1 o 2 o 3 t

18 Viterbi recursion 1. Initialization: v 1 ( j) = a 0 j b j (o 1 ) 1 apple j apple N bt 1 ( j) = 0 2. Recursion (recall that states 0 and q F are non-emitting): v t ( j) = bt t ( j) = N max i=1 v t 1(i)a ij b j (o t ); 1apple j apple N,1 < t apple T N argmaxv t 1 (i)a ij b j (o t ); 1apple j apple N,1 < t apple T i=1 3. Termination: The best score: P = v T (q F ) = N max i=1 v T (i) a if The start of backtrace: q T = bt T (q F ) = N argmax i=1 v T (i) a if

19 Viterbi backtrace q F end end end end v 1 (2)=.32 v 2 (2)= max(.32*.12,.02*.08) =.038 q 2 P( ) * P(1 ).3 *.5 P( ) * P(1 ).6 *.2 q 1 q 0 start P( start)*p(3 ).8 *.4 P( start) * P(3 ).2 *.1 v 1 (1) =.02 P( ) * P(1 ).4 *.2 P( ) * P(1 ).5 *.5 v 2 (1) = max(.32*.15,.02*.25) =.048 start start start o 1 o 2 o 3

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (S753) Lecture 5: idden Markov s (Part I) Instructor: Preethi Jyothi August 7, 2017 Recap: WFSTs applied to ASR WFST-based ASR System Indices s Triphones ontext Transducer