Automatic Speech Recognition (CS753)

Size: px

Start display at page:

Download "Automatic Speech Recognition (CS753)"

Garey Page
5 years ago
Views:

1 Automatic Speech Recognition (CS753) Lecture 18: Search & Decoding (Part I) Instructor: Preethi Jyothi Mar 23, 2017

2 Recall ASR Decoding W = arg max W Pr(O A W )Pr(W ) W = arg max w N 1,N 8" < Y N : n=1 # 2 Pr(w n wn n m+1 1 ) 4 X q T 1,wN 1 TY t=1 39 = Pr(O t q t,w1 N )Pr(q t q t 1,w1 N ) 5 ; Viterbi arg max w N 1,N (" N Y n=1 Pr(w n w n 1 n m+1 ) #"max q T 1,wN 1 TY t=1 #) Pr(O t q t,w1 N )Pr(q t q t 1,w1 N ) Viterbi approximation divides the above optimisation problem into sub-problems that allows the efficient application of dynamic programming An exact search using Viterbi is infeasible for large vocabulary tasks!

3 Recall Viterbi search Viterbi search finds the most probable path through a trellis of time on the X-axis and states on the Y-axis q F end end end end v 1 (2)=.32 v 2 (2)= max(.32*.12,.02*.08) =.038 q 2 H H P(C H) * P(1 C).3 *.5 P(H H) * P(1 H).6 *.2 H H q 1 q 0 C start P(H start)*p(3 H).8 *.4 P(C start) * P(3 C).2 *.1 v 1 (1) =.02 C P(H C) * P(1 H).4 *.2 P(C C) * P(1 C).5 *.5 v 2 (1) = max(.32*.15,.02*.25) =.048 start start start C C o 1 o 2 o 3 Viterbi algorithm: Only needs to maintain information about the most probable path at each state Image from [JM]: Jurafsky & Martin, 3rd edition, Chapter 9

4 ASR Search Network Network of HMM states d ax b Network of phones b oy the birds walking are 0 boy is Network of words

5 word1 word2 word3 Time-state trellis Time, t

6 Viterbi search over the large trellis Exact search is infeasible for large vocabulary tasks Unknown word boundaries Ngram language models greatly increase the search space Solutions Compactly represent the search space using WFST-based optimisations Beam search: Prune away parts of the search space that aren t promising

7 Viterbi search over the large trellis Exact search is infeasible for large vocabulary tasks Unknown word boundaries Ngram language models greatly increase the search space Solutions Compactly represent the search space using WFST-based optimisations Beam search: Prune away parts of the search space that aren t promising

8 Two main WFST Optimizations Use determinization to reduce/eliminate redundancy Recall not all weighted transducers are determinizable To ensure determinizability of L G, introduce disambiguation symbols in L to deal with homophones in the lexicon read : r eh d #0 red : r eh d #1 Propagate the disambiguation symbols as self-loops back to C and H. Resulting machines are H, C, L

Two main WFST Optimizations Use determinization to reduce/eliminate redundancy Use minimization to reduce space requirements Minimization ensures that the final

9 Two main WFST Optimizations Use determinization to reduce/eliminate redundancy Use minimization to reduce space requirements Minimization ensures that the final composed machine has minimum number of states Final optimization cascade: N = πε(min(det(h det(c det(l G))))) Replaces disambiguation symbols in input alphabet of H with ε

10 Example G bob:bob 0 bond:bond 1 rob:rob slept:slept read:read ate:ate 2

11 Compact language models (G) Use Backoff Ngram language models for G a,b c / Pr(c a,b) b,c ε / α(a,b) b c / Pr(c b) ε / α(b,c) c ε / α(b) c / Pr(c) ε

12 Example G bob:bob 0 bond:bond 1 rob:rob slept:slept read:read ate:ate 2

13 Example L :Lexicon with disambig symbols 0 1 b:bob 5 b:bond 9 r:rob 12 s:slept 17 r:read 20 ey:ate 2 aa:- 6 aa:- 10 aa:- 13 l:- 18 eh:- 21 t:- 3 b:- 4 #0:- 7 n:- 8 d:- 11 b:- 14 eh:- 15 p:- 16 t:- 19 d:-

14 L G 0 1 b:bob 2 b:bond 3 r:rob 4 aa:- 5 aa:- 6 aa:- 7 b:- 8 n:- 9 b:- 10 #0:- 11 d: s:slept 14 r:read 15 ey:ate 16 l:- 17 eh:- 18 t:- 19 eh:- 20 d:- 21 p:- 22 t:- det(l G) 0 1 b:- 2 r:rob 3 aa:- 4 aa:- 5 b:bob 6 n:bond 7 b:- 8 #0:- 9 d: r:read 12 s:slept 13 ey:ate 14 eh:- 15 l:- 16 t:- 17 d:- 18 eh:- 19 p:- 20 t:-

15 min(det(l G)) 0 1 b:- 2 r:rob 3 aa:- 4 aa:- 5 b:bob 6 n:bond 7 b:- #0:- d:- 8 9 r:read 10 s:slept 11 ey:ate 12 eh:- 13 l:- 14 t:- d:- 15 eh:- p:- det(l G) 0 1 b:- 2 r:rob 3 aa:- 4 aa:- 5 b:bob 6 n:bond 7 b:- 8 #0:- 9 d: r:read 12 s:slept 13 ey:ate 14 eh:- 15 l:- 16 t:- 17 d:- 18 eh:- 19 p:- 20 t:-

16 Viterbi search over the large trellis Exact search is infeasible for large vocabulary tasks Unknown word boundaries Ngram language models greatly increase the search space Solutions Compactly represent the search space using WFST-based optimisations Beam search: Prune away parts of the search space that aren t promising

17 Beam pruning At each time-step t, only retain those nodes in the time-state trellis that are within a fixed threshold δ (beam width) of the best path Given active nodes from the last time-step: Examine nodes in the current time-step that are reachable from active nodes in the previous timestep Get active nodes for the current time-step by only retaining nodes with hypotheses that score close to the score of the best hypothesis

18 Beam search Beam search at each node keeps only hypotheses with scores that fall within a threshold of the current best hypothesis Hypotheses with Q(t, s) < δ max Q(t, s ) are pruned here, δ controls the beam width Search errors could occur if the most probable hypothesis gets pruned Trade-off between balancing search errors and speeding up decoding

19 Static and dynamic networks What we ve seen so far: Static decoding graph H C L G Determinize/minimize to make this graph more compact Another approach: Dynamic graph expansion Dynamically build the graph with active states on the fly Do on-the-fly composition with the language model G (H C L) G

20 Multi-pass search Some models are too expensive to implement in first-pass decoding (e.g. RNN-based LMs) First-pass decoding: Use simpler model (e.g. Ngram LMs) to find most probable word sequences and represent as a word lattice or an N-best list Rescore first-pass hypotheses using complex model to find the best word sequence

21 Multi-pass decoding with N-best lists Simple algorithm: Modify the Viterbi algorithm to return the N- best word sequences for a given speech input DRA speech input If music be the food of love... Simple Knowledge Source N-Best Decoder N-Best List?Alice was beginning to get...?every happy family...?in a hole in the ground...?if music be the food of love...?if music be the foot of dove... Smarter Knowledge Source Rescoring 1-Best Utterance If music be the food of love... Image from [JM]: Jurafsky & Martin, SLP 2nd edition, Chapter 10

22 Multi-pass decoding with N-best lists Simple algorithm: Modify the Viterbi algorithm to return the N- best word sequences for a given speech input Rank Path logprob logprob 1. it s an area that s naturally sort of mysterious that s an area that s naturally sort of mysterious it s an area that s not really sort of mysterious that scenario that s naturally sort of mysterious there s an area that s naturally sort of mysterious that s an area that s not really sort of mysterious the scenario that s naturally sort of mysterious so it s an area that s naturally sort of mysterious that scenario that s not really sort of mysterious there s an area that s not really sort of mysterious N-best lists aren t as diverse as we d like. And, not enough information in N-best lists to effectively use other knowledge sources AFTAM LM Image from [JM]: Jurafsky & Martin, SLP 2nd edition, Chapter 10

23 Multi-pass decoding with lattices ASR lattice: Weighted automata/directed graph representing alternate word hypotheses from an ASR system so, it s it s there s that s an area that s naturally sort of mysterious that the scenario not really

24 Multi-pass decoding with lattices Confusion networks/sausages: Lattices that show competing/ confusable words and can be used to compute posterior probabilities at the word level it s there s that s the an area that s naturally sort of mysterious that scenario not

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (S753) Lecture 5: idden Markov s (Part I) Instructor: Preethi Jyothi August 7, 2017 Recap: WFSTs applied to ASR WFST-based ASR System Indices s Triphones ontext Transducer