Hidden Markov Modelling

Similar documents
HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

ASR using Hidden Markov Model : A tutorial

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Hidden Markov Models Hamid R. Rabiee

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Hidden Markov Models and Gaussian Mixture Models

Multiscale Systems Engineering Research Group

Lecture 3: ASR: HMMs, Forward, Viterbi

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Hidden Markov Model and Speech Recognition

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

Statistical Sequence Recognition and Training: An Introduction to HMMs

Lecture 11: Hidden Markov Models

Hidden Markov Models

L23: hidden Markov models

Hidden Markov Models

Improving the Multi-Stack Decoding Algorithm in a Segment-based Speech Recognizer

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Hidden Markov Models and Gaussian Mixture Models

Parametric Models Part III: Hidden Markov Models

Machine Learning for natural language processing

Statistical Methods for NLP

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Hidden Markov Models. Dr. Naomi Harte

Hidden Markov Models

Automatic Speech Recognition (CS753)

Basic math for biology

Engineering Part IIB: Module 4F11 Speech and Language Processing Lectures 4/5 : Speech Recognition Basics

CS 4495 Computer Vision

Lab 3: Practical Hidden Markov Models (HMM)

Introduction to Markov systems

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Statistical Processing of Natural Language

A New OCR System Similar to ASR System

Statistical NLP: Hidden Markov Models. Updated 12/15

Brief Introduction of Machine Learning Techniques for Content Analysis

Dept. of Linguistics, Indiana University Fall 2009

Master 2 Informatique Probabilistic Learning and Data Analysis

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Lecture 12: Algorithms for HMMs

STA 414/2104: Machine Learning

STA 4273H: Statistical Machine Learning

Hidden Markov Models in Language Processing

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Statistical Machine Learning from Data

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

Hidden Markov Models

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Lecture 5: GMM Acoustic Modeling and Feature Extraction

Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

Hidden Markov Models (HMMs)

Sequence Labeling: HMMs & Structured Perceptron

EECS E6870: Lecture 4: Hidden Markov Models

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

ISOLATED WORD RECOGNITION FOR ENGLISH LANGUAGE USING LPC,VQ AND HMM

Pattern Classification

The main algorithms used in the seqhmm package

A Higher-Order Interactive Hidden Markov Model and Its Applications Wai-Ki Ching Department of Mathematics The University of Hong Kong

Hidden Markov Models

Data Mining in Bioinformatics HMM

1. Markov models. 1.1 Markov-chain

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Lecture 12: Algorithms for HMMs

order is number of previous outputs

Data-Intensive Computing with MapReduce

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Biochemistry 324 Bioinformatics. Hidden Markov Models (HMMs)

Hidden Markov Models Part 2: Algorithms

CS 7180: Behavioral Modeling and Decision- making in AI

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Linear Dynamical Systems (Kalman filter)

COMP90051 Statistical Machine Learning

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

Introduction to Machine Learning CMU-10701

Temporal Modeling and Basic Speech Recognition

Graphical models for part of speech tagging

Expectation Maximization (EM)

Automatic Speech Recognition (CS753)

p(d θ ) l(θ ) 1.2 x x x

Text Mining. March 3, March 3, / 49

Expectation Maximization (EM)

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Introduction to Hidden Markov Modeling (HMM) Daniel S. Terry Scott Blanchard and Harel Weinstein labs

Speech Recognition HMM

Outline of Today s Lecture

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

Conditional Random Field

Design and Implementation of Speech Recognition Systems

Note Set 5: Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Transcription:

Hidden Markov Modelling Introduction Problem formulation Forward-Backward algorithm Viterbi search Baum-Welch parameter estimation Other considerations Multiple observation sequences Phone-based models for continuous speech recognition Continuous density HMMs Implementation issues 6.345 Automatic Speech Recognition HMM

Information Theoretic Approach to ASR Speech Generation Speech Recognition Text Generation Speech Production Acoustic Processor A Linguistic Decoder W noisy communication channel Recognition is achieved by maximizing the probability of the linguistic string, W, given the acoustic evidence, A, i.e., choose the linguistic sequence W ˆ such that P(Ŵ A) =m ax P(W A) W 6.345 Automatic Speech Recognition HMM

Information Theoretic Approach to ASR From Bayes rule: P(W A) = P(A W )P(W ) P(A) Hidden Markov modelling (HMM) deals with the quantity P(A W ) Change in notation: A O W λ P(A W ) P(O λ) 6.345 Automatic Speech Recognition HMM 3

HMM: An Example 0 Consider 3 mugs, each with mixtures of state stones, and The fractions for the i th mug are a i and a i,and a i + a i = Consider urns, each with mixtures of black and white balls The fractions for the i th urn are b i (B) and b i (W); b i (B)+ b i (W) = The parameter vector for this model is: λ = {a 0,a 0,a,a,a,a,b (B),b (W),b (B),b (W)} 6.345 Automatic Speech Recognition HMM 4

HMM: An Example (cont d) 0 Observation Sequence: O = {B, W, B, W, W, B} State Sequence: Q = {,,,,, } Goal: Given the model λ and the observation sequence O, how can the underlying state sequence Q be determined? 6.345 Automatic Speech Recognition HMM 5

Elements of a Discrete Hidden Markov Model N : number of states in the model states, s = {s,s,...,s N } state at time t, q t s M: number of observation symbols (i.e., discrete observations) observation symbols, v = {v,v,...,v M } observation at time t, o t v A = {a ij }: state transition probability distribution a ij = P(q t+ = s j q t = s i ), i,j N B = {b j (k)}: observation symbol probability distribution in state j b j (k) = P(v k at t q t = s j ), j N, k M π = {π i }: initial state distribution π i = P(q = s i ), i N Notationally, an HMM is typically written as: λ = {A, B,π} 6.345 Automatic Speech Recognition HMM 6

HMM: An Example (cont d) For our simple example: [ ] [ ] a π = {a 0,a 0 }, A = a b,and B = (B) b (W) a a b (B) b (W) State Diagram -state 3-state a a a 3 a {b (B), b (W)} {b (B), b (W)} 6.345 Automatic Speech Recognition HMM 7

Generation of HMM Observations. Choose an initial state, q = s i, based on the initial state distribution, π. For t = to T : Choose o t = v k according to the symbol probability distribution in state s i,b i (k) Transition to a new state q t+ = s j according to the state transition probability distribution for state s i,a ij 3. Increment t by, return tostep if t T ; else, terminate a 0i a ij q q..... q T b i (k) o o o T 6.345 Automatic Speech Recognition HMM 8

Representing State Diagram by Trellis 3 s s s 3 0 3 4 The dashed line represents a null transition, where no observation symbol is generated 6.345 Automatic Speech Recognition HMM 9

Three Basic HMM Problems. Scoring: Given an observation sequence O = {o,o,..., o T } and a model λ = {A, B,π}, how do we compute P(O λ), the probability of the observation sequence? ==> The Forward-Backward Algorithm. Matching: Given an observation sequence O = {o,o,..., o T },how do we choose a state sequence Q = {q,q,..., q T } which is optimum in some sense? ==> The Viterbi Algorithm 3. Training: How do we adjust the model parameters λ = {A, B, π} to maximize P(O λ)? ==> The Baum-Welch Re-estimation Procedures 6.345 Automatic Speech Recognition HMM 0

Computation of P(O λ) P(O λ) = P(O, Q λ) allq P(O, Q λ) = P(O Q,λ)P(Q λ) Consider the fixed state sequence: Q = q q...q T P(O Q,λ) = P(Q λ) = b q (o )b q (o )...b qt (o T ) π q a q q a q q 3...a qt q T Therefore: P(O λ) = π q b q (o )a q q b q (o )...a qt q T b qt (o T ) q,q,...,q T Calculation required T N T (there are N T such sequences) For N = 5,T = 00 00 5 00 0 7 computations! 6.345 Automatic Speech Recognition HMM

The Forward Algorithm Let us define the forward variable, α t (i), as the probability of the partial observation sequence up to time t and state s i at time t, given the model, i.e. It can easily be shown that: α t (i) = P(o o...o t,q t = s i λ) α (i) = π i b i (o ), i N By induction: N P(O λ) = α T (i) i= N α t+ (j) =[ α t (i) a ij ]b j (o t+ ), i= t T j N Calculation is on the order of N T. For N = 5,T = 00 00 5 computations, instead of 0 7 6.345 Automatic Speech Recognition HMM

Forward Algorithm Illustration s a j s i α t (i) a ij s j a jj α t+ (j) a Nj s N t t+ T 6.345 Automatic Speech Recognition HMM 3

The Backward Algorithm Similarly, let us define the backward variable, β t (i), asthe probability of the partial observation sequence from time t + to the end, given state s i at time t and the model, i.e. β t (i) = P(o t+ o t+...o T q t = s i,λ) It can easily be shown that: β T (i) =, i N and: By induction: N β t (i) = j= N P(O λ) = π i b i (o )β (i) i= t = T,T,..., a ij b j (o t+ ) β t+ (j), i N 6.345 Automatic Speech Recognition HMM 4

Backward Procedure Illustration s a i s i β t (i) a ii a ij s j β t+ (j) a in s N t t+ T 6.345 Automatic Speech Recognition HMM 5

Finding Optimal State Sequences One criterion chooses states, q t, which are individually most likely This maximizes the expected number of correct states Let us define γ t (i) as the probability of being in state s i at time t, given the observation sequence and the model, i.e. γ t (i) = P(q t = s i O,λ) N i= γ t (i) =, t Then the individually most likely state, q t,attime t is: q t = argmax γ t (i) i N t T Note that it can be shown that: γ t (i) = α t (i)β t (i) P(O λ) 6.345 Automatic Speech Recognition HMM 6

Finding Optimal State Sequences The individual optimality criterion has the problem that the optimum state sequence may not obey state transition constraints Another optimality criterion is to choose the state sequence which maximizes P(Q, O λ); This can be found by the Viterbi algorithm Let us define δ t (i) as the highest probability along a single path, at time t, which accounts for the first t observations, i.e. δ t (i) = max q,q,...,q t P(q q...q t,q t = s i,o o...o t λ) By induction: δ t+ (j) =[max δ t (i)a ij ]b j (o t+ ) i To retrieve the state sequence, we must keep track of the state sequence which gave the best path, at time t, to state s i We do this in a separate array ψ t (i) 6.345 Automatic Speech Recognition HMM 7

The Viterbi Algorithm. Initialization:. Recursion: δ (i) = π i b i (o ), ψ (i) = 0 i N 3. Termination: δ t (j) = max t (i)a ij ]b j (o t ), i N t T j N ψ t (j) = argmax[δ t (i)a ij ], t T j N i N P qt 4. Path (state-sequence) backtracking: Computation N T =max [δ T (i)] i N = argmax[δ T (i)] i N q t = ψ t+ (q t + ), t = T,T,..., 6.345 Automatic Speech Recognition HMM 8

The Viterbi Algorithm: An Example.8..7.5.3.4.5.5.3.7.3.5 3 P(a) P(b).. O={a a b b} s s s 3 0 a a b b.40.40.0.0..0..0.09.0.09.0.0.0.0.0.0.5.5.35.0.0.0.35.0.0 0 3 4 6.345 Automatic Speech Recognition HMM 9

The Viterbi Algorithm: An Example (cont d) 0 a aa aab aabb s.0 s,a.4 s,a.6 s,b.06 s,b.006 s, 0.08 s, 0.03 s, 0.003 s, 0.0003 s s, 0. s,a. s,a.084 s,b.044 s,b.0044 s,a.04 s,a.04 s,b.068 s,b.00336 s, 0.0 s, 0.0084 s, 0.0068 s, 0.000336 s 3 s, 0.0 s,a.03 s,a.035 s,b.094 s,b.00588 s 0 a a b b.0 0.4 0.6 0.06 0.006 s 0. 0. 0.084 0.068 0.00336 s 3 0.0 0.03 0.035 0.094 0.00588 0 3 4 6.345 Automatic Speech Recognition HMM 0

Matching Using Forward-Backward Algorithm 0 a aa aab aabb s.0 s,a.4 s,a.6 s,b.06 s,b.006 s, 0.08 s, 0.03 s, 0.003 s, 0.0003 s s, 0. s,a. s,a.084 s,b.044 s,b.0044 s,a.04 s,a.066 s,b.0364 s,b.008 s, 0.033 s, 0.08 s, 0.0054 s, 0.0056 s 3 s, 0.0 s,a.03 s,a.0495 s,b.0637 s,b.089 s 0 a a b b.0 0.4 0.6 0.06 0.006 s 0. 0.33 0.8 0.0540 0.056 s 3 0.0 0.063 0.0677 0.069 0.0056 0 3 4 6.345 Automatic Speech Recognition HMM

Baum-Welch Re-estimation Baum-Welch re-estimation uses EM to determine ML parameters Define ξ t (i,j) as the probability of being in state s i at time t and state s j at time t +, given the model and observation sequence Then: ξ t (i,j) = P(q t = s i,q t+ = s j O,λ) ξ t (i,j) = γ t (i) = j= Summing γ t (i) and ξ t (i,j), we get: T t= T t= α t (i)a ij b j (o t+ )β t+ (j) P(O λ) N ξ t (i,j) γ t (i) = expected number of transitions from s i ξ t (i,j) = expected number of transitions from s i to s j 6.345 Automatic Speech Recognition HMM

Baum-Welch Re-estimation Procedures s s i α t (i) a ij s j β t+ (j) s N t- t t+ t+ T 6.345 Automatic Speech Recognition HMM 3

Baum-Welch Re-estimation Formulas π = expected number of times in state s i at t = = γ (i) a ij = expected number of transitions from state s i to s j expected number of transitions from state s i T ξ t (i,j) t= = T γ t (i) t= b j (k) = expected number of times in state s j with symbol v k expected number of times in state s j T t= ot =v k γ t (j) = T γ t (j) t= 6.345 Automatic Speech Recognition HMM 4

Baum-Welch Re-estimation Formulas If λ = (A, B,π) is the initial model, and λ = ( A, B, π) is the re-estimated model. Then it can be proved that either:. The initial model, λ, defines a critical point of the likelihood function, in which case λ = λ, or. Model λ is more likely than λ in the sense that P(O λ) > P(O λ), i.e., we have found a new model λ from which the observation sequence is more likely to have been produced. Thus we can improve the probability of O being observed from the model if we iteratively use λ in place of λ and repeat the re-estimation until some limiting point is reached. The resulting model is called the maximum likelihood HMM. 6.345 Automatic Speech Recognition HMM 5

Multiple Observation Sequences Speech recognition typically uses left-to-right HMMs. These HMMs can not be trained using a single observation sequence, because only a small number of observations are available to train each state. To obtain reliable estimates of model parameters, one must use multiple observation sequences. In this case, the re-estimation procedure needs to be modified. Let us denote the set of K observation sequences as O = {O (), O (),..., O (K) } where O (k) = {o (k),o (k),...,o (k) T k } is the k-th observation sequence. Assume that the observations sequences are mutually independent, we want to estimate the parameters so as to maximize K K P(O λ) = P(O (k) λ) = P k k= k= 6.345 Automatic Speech Recognition HMM 6

Multiple Observation Sequences (cont d) Since the re-estimation formulas are based on frequency of occurrence of various events, we can modify them by adding up the individual frequencies of occurrence for each sequence a ij = K T k k= t= K T k ξ t k (i,j) γ t k (i) = K T k P α k t (i)a ij b j (o (k) k= k t= K T k k= t= k= P k t= α t k (i)β k t (i) t+)β k t+(j) b j (l) = = K Tk γ k t (j) K T k K k= t= (k) o t =vl k=t= γ t k (j) K T k k= P k t= (k) o t =vl T k k= P k t= α t k (i)β k t (i) αt k (i)β k t (i) 6.345 Automatic Speech Recognition HMM 7

Phone-based HMMs Word-based HMMs are appropriate for small vocabulary speech recognition. For large vocabulary ASR, sub-word-based (e.g., phone-based) models are more appropriate. 6.345 Automatic Speech Recognition HMM 8

Phone-based HMMs (cont d) The phone models can have many states, and words are made up from a concatenation of phone models. 6.345 Automatic Speech Recognition HMM 9

Continuous Density Hidden Markov Models A continuous density HMM replaces the discrete observation probabilities, b j (k), by a continuous PDF b j (x) A common practice is to represent b j (x) as a mixture of Gaussians: M b j (x) = c jk N[x,µ jk, Σ jk ] j N k= where c jk is the mixture weight M c jk 0 ( j N, k M, and c jk =, j N ), k= N is the normal density, and µ jk and Σ jk are the mean vector and covariance matrix associated with state j and mixture k. 6.345 Automatic Speech Recognition HMM 30

Acoustic Modelling Variations Semi-continuous HMMs first compute a VQ codebook of size M The VQ codebook is then modelled as a family of Gaussian PDFs Each codeword is represented by a Gaussian PDF, and may be used together with others to model the acoustic vectors From the CD-HMM viewpoint, this is equivalent to using the same set of M mixtures to model all the states It is therefore often referred to as a Tied Mixture HMM All three methods have been used in many speech recognition tasks, with varying outcomes For large-vocabulary, continuous speech recognition with sufficient amount (i.e., tens of hours) of training data, CD-HMM systems currently yield the best performance, but with considerable increase in computation 6.345 Automatic Speech Recognition HMM 3

Implementation Issues Scaling: to prevent underflow Segmental K-means Training: to train observation probabilities by first performing Viterbi alignment Initial estimates of λ: to provide robust models Pruning: to reduce search computation 6.345 Automatic Speech Recognition HMM 3

References X. Huang, A. Acero, and H. Hon, Spoken Language Processing, Prentice-Hall, 00. F. Jelinek, Statistical Methods for Speech Recognition. MIT Press, 997. L. Rabiner and B. Juang, Fundamentals of Speech Recognition, Prentice-Hall, 993. 6.345 Automatic Speech Recognition HMM 33