Automatic Speech Recognition (CS753)

Similar documents
Hidden Markov Models and Gaussian Mixture Models

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

Statistical Sequence Recognition and Training: An Introduction to HMMs

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Modelling

EECS E6870: Lecture 4: Hidden Markov Models

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Statistical Methods for NLP

] Automatic Speech Recognition (CS753)

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models

Automatic Speech Recognition (CS753)

Introduction to Machine Learning CMU-10701

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

Automatic Speech Recognition (CS753)

STA 414/2104: Machine Learning

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

Parametric Models Part III: Hidden Markov Models

Note Set 5: Hidden Markov Models

Lecture 5: GMM Acoustic Modeling and Feature Extraction

Weighted Finite-State Transducers in Computational Biology

Lecture 9. Intro to Hidden Markov Models (finish up)

STA 4273H: Statistical Machine Learning

Multiscale Systems Engineering Research Group

Linear Dynamical Systems (Kalman filter)

CS838-1 Advanced NLP: Hidden Markov Models

Machine Learning for natural language processing

Hidden Markov Model and Speech Recognition

CS 136 Lecture 5 Acoustic modeling Phoneme modeling

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Hidden Markov Models (HMMs)

Lecture 3: ASR: HMMs, Forward, Viterbi

Cheng Soon Ong & Christian Walder. Canberra February June 2018

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Automatic Speech Recognition (CS753)

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Sparse Models for Speech Recognition

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Brief Introduction of Machine Learning Techniques for Content Analysis

Hidden Markov Models in Language Processing

Hidden Markov Models

Expectation Maximization (EM)

L23: hidden Markov models

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models

Hidden Markov Models Part 2: Algorithms

Expectation Maximization (EM)

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Lecture 11: Hidden Markov Models

Hidden Markov Models Hamid R. Rabiee

Statistical Machine Learning from Data

Hidden Markov Models

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data

p(d θ ) l(θ ) 1.2 x x x

Hidden Markov Models

Hidden Markov Models. x 1 x 2 x 3 x K

Expectation-Maximization (EM) algorithm

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

An Introduction to Bioinformatics Algorithms Hidden Markov Models

1. Markov models. 1.1 Markov-chain

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Speech Recognition HMM

HMM: Parameter Estimation

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

COMP90051 Statistical Machine Learning

Hidden Markov Models

O 3 O 4 O 5. q 3. q 4. Transition

MAP adaptation with SphinxTrain

Linear Dynamical Systems

Outline of Today s Lecture

Data Mining in Bioinformatics HMM

Lecture 12: Algorithms for HMMs

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

Engineering Part IIB: Module 4F11 Speech and Language Processing Lectures 4/5 : Speech Recognition Basics

Hidden Markov Models for biological sequence analysis

Lecture 12: Algorithms for HMMs

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Hidden Markov Models

Hidden Markov Models

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Lecture 3. Gaussian Mixture Models and Introduction to HMM s. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom

1 What is a hidden Markov model?

Hidden Markov Models. Dr. Naomi Harte

Master 2 Informatique Probabilistic Learning and Data Analysis

Basic math for biology

Introduction to Markov systems

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Hidden Markov Models. Hosein Mohimani GHC7717

Transcription:

Automatic Speech Recognition (CS753) Lecture 6: Hidden Markov Models (Part II) Instructor: Preethi Jyothi Aug 10, 2017

Recall: Computing Likelihood Problem 1 (Likelihood): Given an HMM l =(A, B) and an observation sequence O, determine the likelihood P(O l). Problem 2 (Decoding): Given an observation sequence O and an HMM l = (A,B), discover the best hidden state sequence Q. Problem 3 (Learning): Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B. Computing Likelihood: Given an HMM l =(A,B) and an observation sequence O, determine the likelihood P(O l). Use the Forward Algorithm

Recall: Decoding best state sequence Problem 1 (Likelihood): Given an HMM l =(A, B) and an observation sequence O, determine the likelihood P(O l). Problem 2 (Decoding): Given an observation sequence O and an HMM l = (A,B), discover the best hidden state sequence Q. Problem 3 (Learning): Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B. Decoding: Given as input an HMM l =(A,B) and a sequence of observations O = o 1,o 2,...,o T, find the most probable sequence of states Q = q 1 q 2 q 3...q T. Use the Viterbi Algorithm

Learning HMM Parameters Problem 1 (Likelihood): Given an HMM l =(A, B) and an observation sequence O, determine the likelihood P(O l). Problem 2 (Decoding): Given an observation sequence O and an HMM l = (A,B), discover the best hidden state sequence Q. Problem 3 (Learning): Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B. Learning: Given an observation sequence O and the set of possible states in the HMM, learn the HMM parameters A and B. Standard algorithm for HMM training: Forward-backward or Baum-Welch algorithm Before moving on to Baum-Welch, what is the Expectation Maximization algorithm?

EM Algorithm: Fitting Parameters to Data Parameter θ determines Pr(x, z; θ) where x is observed and z is hidden Observed data: i.i.d samples xi, i=1,, N Goal: Find arg max L( ) Initial parameters: θ 0 where L( ) = NX i=1 log Pr(x i ; ) Iteratively compute θ l as follows: NX X Q(, ` 1 )= i=1 z Pr(z x i ; ` 1 ) log Pr(x i,z; ) ` = arg max Q(, ` 1 ) Estimate θ l cannot get worse over iterations because for all θ: L( ) L( ` 1 ) Q(, ` 1 ) Q( ` 1, ` 1 ) EM is guaranteed to converge to a local optimum [Wu83]

Coin example to illustrate EM Coin 1 Coin 2 Coin 3 ρ1 = Pr(H) ρ2 = Pr(H) ρ3 = Pr(H) Repeat: Toss Coin 1 privately if it shows H: Toss Coin 2 twice else Toss Coin 3 twice The following sequence is observed: HH, TT, HH, TT, HH How do you estimate ρ1, ρ2 and ρ3?

Coin example to illustrate EM Recall, for partially observed data, the likelihood is given by: L( ) = NX log Pr(x i ; ) = NX log X z Pr(x i,z; ) i=1 i=1 where, for the coin example: each observation x i X = {HH,HT,TH,TT} the hidden variable z Z = {H,T}

Coin example to illustrate EM Recall, for partially observed data, the likelihood is given by: L( ) = NX log Pr(x i ; ) = NX log X z Pr(x i,z; ) i=1 i=1 Pr(x, z; ) =Pr(x z; )Pr(z; ) where Pr(z; ) = Pr(x z; ) = Coin 1 ρ1 = Pr(H) ( 1 if z =H Coin 2 Coin 3 ρ2 =Pr(H) 1 1 if z =T ( h 2(1 2 ) t if z =H h 3(1 3 ) t if z =T ρ3 = Pr(H) h : number of heads, t : number of tails

Coin example to illustrate EM Our observed data is: {HH, TT, HH, TT, HH} Let s use EM to estimate θ = (ρ1, ρ2, ρ3) [EM Iteration, E-step] Compute quantities involved in Q(, ` 1 )= NX i=1 X where γ(z, x) = Pr(z x ;θ l -1 ) Suppose θ l -1 is ρ1 = 0.3, ρ2 = 0.4, ρ3 = 0.6: What is γ(h, HH)? What is γ(h, TT)? z (z,x i ) log Pr(x i,z; ) i.e., compute γ(z, x i ) for all z and all i = 0.16 = 0.49

Coin example to illustrate EM Our observed data is: {HH, TT, HH, TT, HH} Let s use EM to estimate θ = (ρ1, ρ2, ρ3) [EM Iteration, M-step] Find θ which maximises Q(, ` 1 )= NX X i=1 z (z,x i ) log Pr(x i,z; ) 1 = 2 = P N i=1 (H,x i) N P N i=1 (H,x i)h i P N i=1 (H,x i)(h i + t i ) 3 = P N i=1 (T,x i)h i P N i=1 (T,x i)(h i + t i )

Coin example to illustrate EM ε/1 This was a very simple HMM (with observations from 2 states) ε/ρ1 H/ρ2 H T/1-ρ2 State remains the same after the first transition ε/1-ρ1 T ε/1 γ estimated the distribution of this state H/ρ3 T/1-ρ3 More generally, will need the distribution of the state at each time step EM for general HMMs: Baum-Welch algorithm (1972) predates the general formulation of EM (1977)

Baum-Welch Algorithm as EM Observed data: N sequences, x i = (x i1,, x it i), i=1 N where x it R d Parameters θ : transition matrix A, observation probabilities B [EM Iteration, E-step] Compute quantities involved in Q(θ,θ l -1 ) γi,t (j) = Pr(z t = j x i ;θ l -1 ) ξ i,t (j,k) = Pr(z t-1 = j, z t = k x i ;θ l -1 )

Baum-Welch Algorithm as EM Observed data: N sequences, x i = (x i1,, x it i), i=1 N where x it R d Parameters θ : transition matrix A, observation probabilities B [EM Iteration, M-step] Find θ which maximises Q(θ,θ l -1 ) A j,k = B j,v = P N P Ti i=1 t=2 i,t(j, k) P N P Ti P i=1 t=2 k 0 i,t (j, k 0 ) P N P i=1 t:x it =v i,t(j) P N P Ti i=1 t=1 i,t(j)

Gaussian Observation Model So far we considered HMMs with discrete outputs In acoustic models, HMMs output real valued vectors Hence, observation probabilities are defined using probability density functions A widely used model: Gaussian distribution N (x µ, 2 )= 1 p e 1 2 2 2 2 (x µ)2 HMM emission/observation probabilities bj(x) = N(x µj, σj 2 ) where µj is the mean associated with state j and σj 2 is its variance. For multivariate Gaussians, bj(x) = N(x µj, Σj) where Σ is the covariance associated with state j

BW for Gaussian Observation Model Observed data: N sequences, x i = (x i1,, x it i), i=1 N where x it R d Parameters θ : transition matrix A, observation prob. B = {(µj,σj)} for all j [EM Iteration, M-step] Find θ which maximises Q(θ,θ l -1 ) A same as with discrete outputs P N P Ti i=1 t=1 i,t(j)x it µ j = P N P Ti i=1 t=1 i,t(j) j = P N i=1 P Ti t=1 i,t(j)(x it µ j )(x it µ j ) T P N i=1 P Ti t=1 i,t(j)

Gaussian Mixture Model A single Gaussian observation model assumes that the observed acoustic feature vectors are unimodal

Unimodal 1.0 0.8 μ= 0, μ= 0, μ= 0, μ= 2, 2 σ = 0.2, 2 σ = 1.0, 2 σ = 5.0, 2 σ = 0.5, x) 0.6 φ μ,σ 2( 0.4 0.2 0.0 5 4 3 2 1 0 1 2 3 4 5 x 1 0.8 0.6 0.4 0.2 0 3 2 1 0-1 -2-3 -3-2 -1 0 1 2 3

Gaussian Mixture Model A single Gaussian observation model assumes that the observed acoustic feature vectors are unimodal More generally, we use a mixture of Gaussians to model multiple modes in the data

Mixture Models

Gaussian Mixture Model A single Gaussian observation model assumes that the observed acoustic feature vectors are unimodal More generally, we use a mixture of Gaussians to model multiple modes in the data Instead of bj(x) = N(x µj, Σj) in the single Gaussian case, bj(x) now becomes: b j (x) = MX m=1 c jm N (x µ jm, jm ) where cjm is the mixing probability for Gaussian component m of state j MX m=1 c jm =1, c jm 0

BW for Gaussian Mixture Model Observed data: N sequences, x i = (x i1,, x it i), i=1 N where x it R d Parameters θ : transition matrix A, observation prob. B = {(µjm,σjm,cjm)} for all j,m [EM Iteration, M-step] Find θ which maximises Q(θ,θ l -1 ) P N P Ti i=1 t=1 i,t(j, m)x it µ jm = P N P Ti i=1 t=1 i,t(j, m) P N P Ti i=1 t=1 i,t(j, m)(x it µ jm )(x it µ jm ) T jm = P N P Ti i=1 t=1 i,t(j, m) P N P Ti i=1 t=1 i,t(j, m) c jm = P N P Ti i=1 t=1 i,t(j) γi,t(j)=pr(qt=j xi) Mixing probabilities

ASR Framework: Acoustic Models Acoustic Features Acoustic Models H Triphones Context Transducer Monophones Pronunciation Model Words Language Model Acoustic models are estimated using training data: {x i, y i }, i=1 N where x i corresponds to a sequence of acoustic feature vectors and y i corresponds to a sequence of words Word Sequence For each x i, y i, a composite HMM is constructed using the HMMs that correspond to the triphone sequence in y i Hello world sil hh ah l ow w er l d sil sil sil/hh/ah hh/ah/l ah/l/ow l/ow/w er/w/l l/er/d er/l/d l/d/sil sil

ASR Framework: Acoustic Models Acoustic Features Acoustic Models H Triphones Context Transducer Monophones Pronunciation Model Words Language Model Acoustic models are estimated using training data: {x i, y i }, i=1 N where x i corresponds to a sequence of acoustic feature vectors and y i corresponds to a sequence of words Word Sequence For each x i, y i, a composite HMM is constructed using the HMMs that correspond to the triphone sequence in y i Parameters of these composite HMMs are the parameters of the constituent triphone HMMs. These parameters are fit to the acoustic data {x i }, i=1 N using the Baum-Welch algorithm (EM)

Baum Welch: In summary [Every EM Iteration] Compute θ = { Ajk, (µjm,σjm,cjm) } for all j,k,m A j,k = P N P Ti i=1 t=2 i,t(j, k) P N P Ti P i=1 t=2 k 0 i,t (j, k 0 ) P N P Ti i=1 t=1 i,t(j, m)x it µ jm = P N P Ti i=1 t=1 i,t(j, m) jm = P N i=1 P Ti t=1 i,t(j, m)(x it µ jm )(x it µ jm ) T P N P Ti i=1 t=1 i,t(j, m) P N P Ti i=1 t=1 i,t(j, m) c jm = P N P Ti i=1 t=1 i,t(j) How do we efficiently compute these quantities? Next class!