HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Similar documents
Hidden Markov Modelling

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Hidden Markov Models and Gaussian Mixture Models

Automatic Speech Recognition (CS753)

Statistical Sequence Recognition and Training: An Introduction to HMMs

Statistical Methods for NLP

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Hidden Markov Model and Speech Recognition

EECS E6870: Lecture 4: Hidden Markov Models

ASR using Hidden Markov Model : A tutorial

Lecture 3: ASR: HMMs, Forward, Viterbi

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

STA 414/2104: Machine Learning

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

Introduction to Markov systems

Introduction to Machine Learning CMU-10701

Lecture 12: Algorithms for HMMs

Design and Implementation of Speech Recognition Systems

Parametric Models Part III: Hidden Markov Models

Hidden Markov Models. Dr. Naomi Harte

L23: hidden Markov models

Hidden Markov Models

Hidden Markov Models

Lecture 12: Algorithms for HMMs

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

Hidden Markov Models

Dept. of Linguistics, Indiana University Fall 2009

Lecture 11: Hidden Markov Models

STA 4273H: Statistical Machine Learning

Hidden Markov Models Hamid R. Rabiee

Speech Recognition HMM

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

An Introduction to Bioinformatics Algorithms Hidden Markov Models

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

Hidden Markov Models

Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Artificial Intelligence Markov Chains

Engineering Part IIB: Module 4F11 Speech and Language Processing Lectures 4/5 : Speech Recognition Basics

Brief Introduction of Machine Learning Techniques for Content Analysis

Advanced Data Science

order is number of previous outputs

Hidden Markov Models

Design and Implementation of Speech Recognition Systems

Lecture 9. Intro to Hidden Markov Models (finish up)

A New OCR System Similar to ASR System

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Multiscale Systems Engineering Research Group

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Hidden Markov Models. George Konidaris

ON SCALABLE CODING OF HIDDEN MARKOV SOURCES. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose

HMM: Parameter Estimation

Statistical NLP: Hidden Markov Models. Updated 12/15

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Hidden Markov Models (HMMs)

Lab 3: Practical Hidden Markov Models (HMM)

HIDDEN MARKOV MODELS

Hidden Markov Models

Lecture 4. Hidden Markov Models. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom

Hidden Markov Models in Language Processing

Statistical Machine Learning from Data

Master 2 Informatique Probabilistic Learning and Data Analysis

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

HMM part 1. Dr Philip Jackson

Hidden Markov Models,99,100! Markov, here I come!

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Hidden Markov Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Hidden Markov Models Part 2: Algorithms

Hidden Markov Models

Improving the Multi-Stack Decoding Algorithm in a Segment-based Speech Recognizer

Data Mining in Bioinformatics HMM

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

p(d θ ) l(θ ) 1.2 x x x

A gentle introduction to Hidden Markov Models

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Weighted Finite-State Transducers in Computational Biology

Statistical Processing of Natural Language

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence Labeling: HMMs & Structured Perceptron

Hidden Markov Models. x 1 x 2 x 3 x K

1. Markov models. 1.1 Markov-chain

Lecture 5: GMM Acoustic Modeling and Feature Extraction

Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data

Graphical Models Seminar

An Introduction to Hidden

CS460/626 : Natural Language

Computational Genomics and Molecular Biology, Fall

Hidden Markov Models

Transcription:

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION Wayne Ward Carnegie Mellon University Pittsburgh, PA 1

Acknowledgements Much of this talk is derived from the paper "An Introduction to Hidden Markov Models", by Rabiner and Juang and from the talk "Hidden Markov Models: Continuous Speech Recognition" by Kai-Fu Lee 2

Topics Markov Models and Hidden Markov Models HMMs applied to speech recognition Training Decoding 3

Speech Recognition Front End O 1 O 2 O T Match Search W 1 W 2 W T Analog Speech Discrete Observations Word Sequence 4

ML Continuous Speech Recognition Goal: Given acoustic data A = a1, a2,..., ak Find word sequence W = w1, w2,... wn Such that P(W A) is maximized Bayes Rule: acoustic model (HMMs) language model P(W A) = P(A W) P(W) P(A) P(A) is a constant for a complete sentence 5

Markov Models Elements: States : Transition probabilities : S = S 0, S 1, S N Pq t = Si q t-1 = Sj P(A A) P(B B) P(B A) A B P(A B) Markov Assumption: Transition probability depends only on current state Pq t = S i q t-1 = S j, q t-2 = S k, = P q t = S i q t-1 = S j = a ji a ji 0 j,i N i= 0 a ji = 1 j 6

Single Fair Coin 0.5 0.5 0.5 1 2 0.5 P(H) = 1.0 P(T) = 0.0 P(H) = 0.0 P(T) = 1.0 Outcome head corresponds to state 1, tail to state 2 Observation sequence uniquely defines state sequence 7

Hidden Markov Models Elements: States Transition probabilities Output prob distributions (at state j for symbol k) S = S 0, S 1, S N P q t = Si q t-1 = Sj = a ji P(y t = Ok q t = Sj) = b j k PO 1 A PO 2 A P(A A) P(B B) PO 1 B PO 2 B P O M A P(B A) P O M B Prob Obs A P(A B) B Prob Obs 8

Discrete Observation HMM P(R) = 0.31 P(B) = 0.50 P(Y) = 0.19 P(R) = 0.50 P(B) = 0.25 P(Y) = 0.25 P(R) = 0.38 P(B) = 0.12 P(Y) = 0.50 Observation sequence: R B Y Y R not unique to state sequence 9

HMMs In Speech Recognition Represent speech as a sequence of observations Use HMM to model some unit of speech (phone, word) Concatenate units into larger units ih Phone Model d ih d Word Model 10

HMM Problems And Solutions Evaluation: Problem - Compute Probabilty of observation sequence given a model Solution - Forward Algorithm and Viterbi Algorithm Decoding: Problem - Find state sequence which maximizes probability of observation sequence Solution - Viterbi Algorithm Training: Problem - Adjust model parameters to maximize probability of observed sequences Solution - Forward-Backward Algorithm 11

Evaluation Probability of observation sequence O = O1 O2 OT given HMM model λ is : P ( O λ) = P( O, Q λ) Q Q = q 0 q 1 q T is a state sequence = a b ( O ) a b ( O ) a b ( ) qt O q0q1 q1 1 q1q2 q2 2 qt 1q T T Not practical since the number of paths is O( N T ) N = number of states in model T = number of observations in sequence 12

The Forward Algorithm α t j = P(O 1 O 2 O t, q t = S j λ ) Compute α recursively: P α 0 j = N 1 if j is start state 0 otherwise ( j) () ( ) = i a b O t 0 α t α t i= 0 13 > 1 ij j t ( O ) ( S ) Computation is O( N 2 T ) λ = α T N

Forward Trellis A 0.8 B 0.2 0.6 1.0 0.4 A 0.3 B 0.7 Initial Final state 1 A A B t=0 t=1 t=2 t=3 1.0 0.6 * 0.8 0.6 * 0.8 0.6 * 0.2 0.48 0.23 0.03 0.4 * 0.3 0.4 * 0.3 0.4 * 0.7 state 2 0.0 1.0 * 0.3 1.0 * 0.3 1.0 * 0.7 0.12 0.09 0.13 14

The Backward Algorithm β t(i) = P(O t+1 O t+2 O T, q t = S i λ ) Compute β recursively: P β T (i) = β N 1 if i is end state 0 otherwise () i a b ( O ) ( j) t T t = β ij j t+ 1 t+ 1 < j=0 ( ) ( ) ( ) 2 O λ = β S = S Computation is O( N ) 0 0 α T N T 15

Backward Trellis A 0.8 B 0.2 0.6 1.0 0.4 A 0.3 B 0.7 Initial Final state 1 A A B t=0 t=1 t=2 t=3 0.13 0.6 * 0.8 0.6 * 0.8 0.6 * 0.2 0.22 0.28 0.0 0.4 * 0.3 0.4 * 0.3 0.4 * 0.7 state 2 0.06 1.0 * 0.3 1.0 * 0.3 1.0 * 0.7 0.21 0.7 1.0 16

The Viterbi Algorithm For decoding: Find the state sequence Q which maximizes P(O, Q λ ) Similar to Forward Algorithm except MAX instead of SUM VP t (i) = MAX q0, q t-1 P(O 1 O 2 O t, q t =i λ ) Recursive Computation: VP t (j) = MAX i=0,, N VP t-1 (i) a ij b j (O t ) t > 0 P(O, Q λ ) = V P T (S N ) Save each maximum for backtrace at end 17

Viterbi Trellis A 0.8 B 0.2 0.6 1.0 0.4 A 0.3 B 0.7 Initial Final state 1 A A B t=0 t=1 t=2 t=3 1.0 0.6 * 0.8 0.6 * 0.8 0.6 * 0.2 0.48 0.23 0.03 0.4 * 0.3 0.4 * 0.3 0.4 * 0.7 state 2 0.0 1.0 * 0.3 1.0 * 0.3 1.0 * 0.7 0.12 0.06 0.06 18

Training HMM Parameters Train parameters of HMM Tune λ to maximize P(O λ ) No efficient algorithm for global optimum Efficient iterative algorithm finds a local optimum Baum-Welch (Forward-Backward) re-estimation Compute probabilities using current model λ Refine λ > λ based on computed values Use α and β from Forward-Backward 19

Forward-Backward Algorithm ξ t (i,j) = Probability of transiting from to at time t given O = P( q t = S i, q t+1 = S j O, λ ) S i S j = α t(i) a ij b j (O t+1 ) β t+1 (j) P(O λ ) α t (i) a ij b j (O t+1 ) β t+1 (j) 20

Baum-Welch Reestimation aij = expected number of trans from S i to Sj expected number of trans from Si T ξ t= 1 = T N t= 1 j= 0 t ( i, j) ξ t ( i, j) b j (k) = expected number of times in state j with symbol k expected number of times in state j t= 1 i= 0 N t: Ot = k i= 0 = T N ξ ξ t t ( i, j) ( i, j) 21

Convergence of FB Algorithm 1. Initialize λ = (A,B) 2. Compute α, β, and ξ 3. Estimate λ = (A, B) from ξ 4. Replace λ with λ 5. If not converged go to 2 It can be shown that P(O λ) > P(O λ) unless λ = λ 22

HMMs In Speech Recognition Represent speech as a sequence of symbols Use HMM to model some unit of speech (phone, word) Output Probabilities - Prob of observing symbol in a state Transition Prob - Prob of staying in or skipping state Phone Model 23

Training HMMs for Continuous Speech Use only orthograph transcription of sentence no need for segmented/labelled data Concatenate phone models to give word model Concatenate word models to give sentence model Train entire sentence model on entire spoken sentence 24

Forward-Backward Training for Continuous Speech SHOW ALL ALERTS SH OW AA L AX L ER TS 25

Recognition Search /w/ /ah/ /ts/ /th/ /ax/ /w/ -> /ah/ -> /ts/ /th/ -> /ax/ what s the willamette s location kirk s longitude display sterett s lattitude 26

Viterbi Search Uses Viterbi decoding Takes MAX, not SUM Finds optimal state sequence P(O, Q λ ) not optimal word sequence P(O λ ) Time synchronous Extends all paths by 1 time step All paths have same length (no need to normalize to compare scores) 27

Viterbi Search Algorithm 0. Create state list with one cell for each state in system 1. Initialize state list with initial states for time t= 0 2. Clear state list for time t+1 3. Compute within-word transitions from time t to t+1 If new state reached, update score and BackPtr If better score for state, update score and BackPtr 4. Compute between word transitions at time t+1 If new state reached, update score and BackPtr If better score for state, update score and BackPtr 5. If end of utterance, print backtrace and quit 6. Else increment t and go to step 2 28

Viterbi Search Algorithm Word 1 Word 2 OldProb(S1) OutProb Transprob OldProb(S3) P(W2 W1) Word 1 Word 2 S1 S2 S3 S1 S2 S3 S1 S2 S3 S1 S2 S3 Score BackPtr ParmPtr time t time t+1 29

Viterbi Beam Search Viterbi Search All states enumerated Not practical for large grammars Most states inactive at any given time Viterbi Beam Search - prune less likely paths States worse than threshold range from best are pruned From and To structures created dynamically - list of active states 30

Viterbi Beam Search Word 1 Word 2 FROM BEAM States within threshold from best state S1 S1 S2 S3? Within threshold? Exist in TO beam? Better than existing score in TO beam? TO BEAM Dynamically constructed time t time t+1 31

Continuous Density HMMs Model so far has assumed discete observations, each observation in a sequence was one of a set of M discrete symbols Speech input must be Vector Quantized in order to provide discrete input. VQ leads to quantization error The discrete probability density b j (k) can be replaced with the continuous probability density b j (x) where x is the observation vector Typically Gaussian densities are used A single Gaussian is not adequate, so a weighted sum of Gaussians is used to approximate actual PDF 32

Mixture Density Functions b j x is the probability density function for state j b j M () x = c [ ] jm N x, µ jm, U jm m= 1 x = Observation vector x 1, x 2,, x D M = Number of mixtures (Gaussians) c jm = Weight of mixture m in state j where N = Gaussian density function M m = 1 c jm = 1 µjm = Mean vector for mixture m, state j Ujm = Covariance matrix for mixture m, state j 33

Discrete Hmm vs. Continuous HMM À Problems with Discrete: quantization errors Codebook and HMMs modelled separately Problems with Continuous Mixtures: Small number of mixtures performs poorly Large number of mixtures increases computation and parameters to be estimated c jm, µ jm, Ujm for j = 1,, N and m= 1,, M Continuous makes more assumptions than Discrete, especially if diagonal covariance pdf Discrete probability is a table lookup, continuous mixtures require many multiplications 34

Model Topologies Ergodic - Fully connected, each state has transition to every other state Left-to-Right - Transitions only to states with higher index than current state. Inherently impose temporal order. These most often used for speech. 35