Automatic Speech Recognition (CS753)

Similar documents
Automatic Speech Recognition (CS753)

Weighted Finite State Transducers in Automatic Speech Recognition

Lecture 3: ASR: HMMs, Forward, Viterbi

CHAPTER. Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright c All rights reserved. Draft of August 7, 2017.

8: Hidden Markov Models

Automatic Speech Recognition (CS753)

Machine Learning for natural language processing

Hidden Markov Models and Gaussian Mixture Models

Automatic Speech Recognition (CS753)

Hidden Markov Modelling

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Hidden Markov Models

Hidden Markov Models and Gaussian Mixture Models

Statistical Sequence Recognition and Training: An Introduction to HMMs

Data-Intensive Computing with MapReduce

Hidden Markov Model and Speech Recognition

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Sequence Labeling: HMMs & Structured Perceptron

CS 188: Artificial Intelligence Fall 2011

Automatic Speech Recognition (CS753)

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs

Statistical NLP: Hidden Markov Models. Updated 12/15

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

COMP90051 Statistical Machine Learning

A gentle introduction to Hidden Markov Models

Introduction to Artificial Intelligence (AI)

8: Hidden Markov Models

p(d θ ) l(θ ) 1.2 x x x

Dynamic Programming: Hidden Markov Models

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Computational Genomics and Molecular Biology, Fall

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

ASR using Hidden Markov Model : A tutorial

Hidden Markov Models Hamid R. Rabiee

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

] Automatic Speech Recognition (CS753)

Statistical Methods for NLP

Multiscale Systems Engineering Research Group

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Introduction to Machine Learning CMU-10701

Supervised Learning Hidden Markov Models. Some of these slides were inspired by the tutorials of Andrew Moore

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Speech Recognition Lecture 9: Pronunciation Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Statistical Processing of Natural Language

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Dept. of Linguistics, Indiana University Fall 2009

Hidden Markov Models

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Lecture 11: Hidden Markov Models

Statistical Methods for NLP

Text Mining. March 3, March 3, / 49

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Design and Implementation of Speech Recognition Systems

10/17/04. Today s Main Points

Augmented Statistical Models for Speech Recognition

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

EECS E6870: Lecture 4: Hidden Markov Models

Computational Genomics and Molecular Biology, Fall

CS 343: Artificial Intelligence

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

1. Markov models. 1.1 Markov-chain

Lecture 12: EM Algorithm

7. Shortest Path Problems and Deterministic Finite State Systems

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Assignments for lecture Bioinformatics III WS 03/04. Assignment 5, return until Dec 16, 2003, 11 am. Your name: Matrikelnummer: Fachrichtung:

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Hidden Markov Models

CS 5522: Artificial Intelligence II

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CS 5522: Artificial Intelligence II

CS 5522: Artificial Intelligence II

Statistical Natural Language Processing

CS 5522: Artificial Intelligence II

Automatic Speech Recognition (CS753)

Hidden Markov Models. Terminology and Basic Algorithms

L23: hidden Markov models

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

Fun with weighted FSTs

Lecture 9: Hidden Markov Model

Probabilistic Graphical Models

CS 343: Artificial Intelligence

CS 7180: Behavioral Modeling and Decision- making in AI

Hidden Markov Models (HMMs)

LEARNING DYNAMIC SYSTEMS: MARKOV MODELS

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Hidden Markov Models NIKOLAY YAKOVETS

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

Hidden Markov Models

Hidden Markov models

Sequence modelling. Marco Saerens (UCL) Slides references

Transcription:

Automatic Speech Recognition (S753) Lecture 5: idden Markov s (Part I) Instructor: Preethi Jyothi August 7, 2017

Recap: WFSTs applied to ASR

WFST-based ASR System Indices s Triphones ontext Transducer Monophones Pronunciation Words Language Word Sequence

WFST-based ASR System Indices s Triphones ontext Transducer Monophones Pronunciation Words Language Word Sequence a/a_b f0:a:a_b f1:ε f4:ε f3:ε f2:ε f4:ε b/a_b... x/y_z f5:ε f6:ε } One 3-state MM for each triphone FST Union + losure Resulting FST

WFST-based ASR System Indices s Triphones ontext Transducer Monophones Pronunciation Words Language Word Sequence x:x/ ε_ε y:y/ ε_x ε,* x:x/ ε_x x:x/ ε_y x:x/x_x x,x x:x/x_y x,y x:x/y_x x:x/y_y y:y/x_x y:y/x_y y:y/y_y y:y/y_x y,x x:x/y_ε x,ε y,y y:y/y_ε y:y/x_ ε y,ε y:y/ ε_y x:x/x_ε y:y/ ε_ε -1 : Arc labels: monophone : phone / left-context_right-context Figure reproduced from Weighted Finite State Transducers in Speech Recognition, Mohri et al., 2002

WFST-based ASR System Indices s Triphones ontext Transducer Monophones Pronunciation L Words Language Word Sequence d:data/1 1 ey:ε/0.5 ae:ε/0.5 2 t:ε/0.3 dx:ε/0.7 3 ax: ε /1 4 0 d:dew/1 5 uw:ε/1 6 Figure reproduced from Weighted Finite State Transducers in Speech Recognition, Mohri et al., 2002

WFST-based ASR System Indices s Triphones ontext Transducer Monophones Pronunciation Words Language G Word Sequence are/0.693 walking 0 the birds/0.404 animals/1.789 were/0.693 is boy/1.789

onstructing the Decoding Graph Indices s Triphones ontext Transducer Monophones Pronunciation Words Language L G Decoding graph, D Word Sequence onstruct decoding search graph, D, using L G that maps acoustic states to word sequences arefully construct D using optimization algorithms: D = min(det( det( det(l G)))) ow do we decode a test utterance O using D? D is typically traversed dynamically: Search algorithms will be covered later in the semester

Before D, let s understand in more detail Indices s Triphones ontext Transducer Monophones Pronunciation Words Language Word Sequence a/a_b f0:a:a_b f1:ε f4:ε f3:ε f2:ε f4:ε b/a_b... x/y_z f5:ε f6:ε } One 3-state MM for each triphone FST Union + losure Resulting FST

idden Markov s (MMs) Following slides contain figures/material from idden Markov s, hapter 9, Speech and Language Processing, D. Jurafsky and J.. Martin, 2016. (https://web.stanford.edu/~jurafsky/slp3/9.pdf)

babilities on all arcs leaving a node must sum to 1) and in which the input sence uniquely determines which states the automaton will go through. Because an t represent inherently ambiguous problems, a Markov chain is only useful for igning probabilities to unambiguous sequences. a 22 OLD 2 a 24 Markov hains a 22 a 02 snow 2 a 24 2 a 32 a a 23 a 21 33 a 34 End 4 a Start 12 a 0 23 End 4 a 11 a 21 a 32 a 33 a 34 a 13 a 01 a 13 a 31 WARM 3 a 14 a 03 is 1 a 31 white 3 a 14 (a) ov chain for weather (a) and one for words (b). A Markov chain is specified by the n between states, and the start and end states. (b) 9.2 TE IDDEN MARKOV MODEL 3 Figure 9.1a shows a Markov chain for assigning a probability to a sequence of previous state: ather events, for which the vocabulary consists of OT, OLD, and WARM. Fig- 9.1b shows another simple example of a Markov chain for assigning a probability a sequence of words w 1...w n. This Markov chain should be familiar; in fact, it Markov Assumption: P(q i q 1...q i 1 )=P(q i q i 1 ) (9.1) resents a bigram language model. Given the two models in Fig. 9.1, we can as-notn that because each a ij expresses the probability p(q j q i ), the laws of prob- a probability to any sequence from our vocabulary. We go over how to doability this require that the values of the outgoing arcs from a given state must sum to rtly. 1: First, let s be more formal and view a Markov chain as a kind of probabilistic phical model: a way of representing probabilistic assumptions in a graph. A nx rkov chain is specified by the following components: a ij = 1 8i (9.2) Q = q 1 q 2...q N A = a 01 a 02...a n1...a nn a set of N states An alternative representation that is sometimes used for Markov chains doesn t a transition probability matrix A, each a ij representing the probability of moving from state i (b) rely on a start or end state, instead representing the distribution over initial states and to state j, s.t. P n j=1 a accepting states explicitly: ij = 1 8i p = p 1,p 2,...,p N an initial probability distribution over states. p i is the probability that the Markov chain will start in state i. Some states j may have p j = 0, meaning that they cannot be initial presentation q 0,q F of the a special same start state Markov and end (final) chain state that for are weather shown in Fig. 9.1. not associated with observations al Figure start 9.1 shows state that wewith represent athe 01 states transition (including start and probabilities, end states) as we use states. the Also, P pn vector, i=1 p i = 1 es in the graph, and the transitions as edges between nodes. QA = {q tribution over starting state probabilities. The x,qfigure y...} in (b) shows A Markov chain embodies an important assumption about these probabilities. In j=1 a set QA Q of legal accepting states

Given a sequence of observations O, each observation an integer corresponding to the number of ice creams eaten on a given day, figure out the correct hidden sequence Q of weather states ( or ) which caused Jason to eat the ice cream. idden Markov Let s begin with a formal definition of a hidden Markov model, focusing on how it differs from a Markov chain. An MM is specified by the following components: Q = q 1 q 2...q N A = a 11 a 12...a n1...a nn O = o 1 o 2...o T B = b i (o t ) q 0,q F a set of N states a transition probability matrix A, each a ij representing the probability of moving from state i to state j, s.t. P n j=1 a ij = 1 8i a sequence of T observations, each one drawn from a vocabulary V = v 1,v 2,...,v V a sequence of observation likelihoods, also called emission probabilities, each expressing the probability of an observation o t being generated from a state i a special start state and end (final) state that are not associated with observations, together with transition probabilities a 01 a 02...a 0n out of the start state and a 1F a 2F...a nf into the end state As we noted for Markov chains, an alternative representation that is sometimes

MM Assumptions start 0.2.1.6.5 end 3.8 P.3.1 OT 1 OLD 2.4 B 1 B 2 P(1 OT).2 P(2 OT) =.4 P(3 OT).4 P(1 OLD).5 P(2 OLD) =.4 P(3 OLD).1 Markov Assumption: P(q i q 1...q i 1 )=P(q i q i 1 ) e probability of an output observation o depends only o Output Independence: P(o i q 1...q i,...,q T,o 1,...,o i,...,o T )=P(o i q i )

Three problems for MMs Problem 1 (Likelihood): Given an MM l =(A, B) and an observation sequence O, determine the likelihood P(O l). Problem 2 (Decoding): Given an observation sequence O and an MM l = (A,B), discover the best hidden state sequence Q. Problem 3 (Learning): Given an observation sequence O and the set of states in the MM, learn the MM parameters A and B. omputing Likelihood: Given an MM l =(A,B) and an observation sequence O, determine the likelihood P(O l). A tutorial on hidden Markov models and selected applications in speech recognition, Rabiner, 1989

Forward Trellis a t ( j)=p(o 1,o 2...o t,q t = j l) the tth state in the sequence of states q F end end end a t ( j)= NX a t 1 (i)a ij b j (o t ) i=1 end α 1 (2)=.32 α 2 (2)=.32*.12 +.02*.08 =.040 q 2 P( ) * P(1 ).6 *.2 q 1 q 0 start P( start)*p(3 ).8 *.4 P( start) * P(3 ).2 *.1 α 1 (1) =.02 P( ) * P(1 ).3 *.5 P( ) * P(1 ).4 *.2 P( ) * P(1 ).5 *.5 α 2 (1) =.32*.15 +.02*.25 =.053 start start start 3 1 3 o 1 o 2 o 3 t

Forward Algorithm 1. Initialization: a 1 ( j) = a 0 j b j (o 1 ) 1 apple j apple N 2. Recursion (since states 0 and F are non-emitting): a t ( j)= NX a t 1 (i)a ij b j (o t ); 1apple j apple N,1 < t apple T i=1 3. Termination: P(O l)=a T (q F )= NX a T (i)a if i=1

Visualizing the forward recursion α t-2 (N) α t-1 (N) q N q N a Nj α t (j)= Σ i α t-1 (i) a ij b j (o t ) q N q j α t-2 (3) α t-1 (3) a 3j q 3 q 3 q 3 α t-2 (2) α t-1 (2) q 2 q 2 a 2j a 1j q 2 b j (o t ) q 2 α t-2 (1) α t-1 (1) q 1 q 1 q 1 q 1 o t-2 o t-1 o t o t+1

Three problems for MMs Problem 1 (Likelihood): Given an MM l =(A, B) and an observation sequence O, determine the likelihood P(O l). Problem 2 (Decoding): Given an observation sequence O and an MM l = (A,B), discover the best hidden state sequence Q. Problem 3 (Learning): Given an observation sequence O and the set of states in the MM, learn the MM parameters A and B. Decoding: Given as input an MM l =(A,B) and a sequence of observations O = o 1,o 2,...,o T, find the most probable sequence of states Q = q 1 q 2 q 3...q T.

Viterbi Trellis v t ( j)= max P(q 0,q 1...q t 1,o 1,o 2...o t,q t = j l) v t ( j) = N q 0,q 1,...,q t 1 max v t 1(i) a ij b j (o t ) i=1 q F end end end end v 1 (2)=.32 v 2 (2)= max(.32*.12,.02*.08) =.038 q 2 P( ) * P(1 ).6 *.2 q 1 q 0 start P( start)*p(3 ).8 *.4 P( start) * P(3 ).2 *.1 v 1 (1) =.02 P( ) * P(1 ).3 *.5 P( ) * P(1 ).4 *.2 P( ) * P(1 ).5 *.5 v 2 (1) = max(.32*.15,.02*.25) =.048 start start start 3 1 3 o 1 o 2 o 3 t

Viterbi recursion 1. Initialization: v 1 ( j) = a 0 j b j (o 1 ) 1 apple j apple N bt 1 ( j) = 0 2. Recursion (recall that states 0 and q F are non-emitting): v t ( j) = bt t ( j) = N max i=1 v t 1(i)a ij b j (o t ); 1apple j apple N,1 < t apple T N argmaxv t 1 (i)a ij b j (o t ); 1apple j apple N,1 < t apple T i=1 3. Termination: The best score: P = v T (q F ) = N max i=1 v T (i) a if The start of backtrace: q T = bt T (q F ) = N argmax i=1 v T (i) a if

Viterbi backtrace q F end end end end v 1 (2)=.32 v 2 (2)= max(.32*.12,.02*.08) =.038 q 2 P( ) * P(1 ).3 *.5 P( ) * P(1 ).6 *.2 q 1 q 0 start P( start)*p(3 ).8 *.4 P( start) * P(3 ).2 *.1 v 1 (1) =.02 P( ) * P(1 ).4 *.2 P( ) * P(1 ).5 *.5 v 2 (1) = max(.32*.15,.02*.25) =.048 start start start 3 1 3 o 1 o 2 o 3 t