Digital Speech Processing Lecture 20. The Hidden Markov Model (HMM)

Similar documents
Consider processes where state transitions are time independent, i.e., System of distinct states,

Outline. Probabilistic Model Learning. Probabilistic Model Learning. Probabilistic Model for Time-series Data: Hidden Markov Model

Introduction ( Week 1-2) Course introduction A brief introduction to molecular biology A brief introduction to sequence comparison Part I: Algorithms

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Discrete Markov Process. Introduction. Example: Balls and Urns. Stochastic Automaton. INTRODUCTION TO Machine Learning 3rd Edition

Clustering (Bishop ch 9)

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 4

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

V.Abramov - FURTHER ANALYSIS OF CONFIDENCE INTERVALS FOR LARGE CLIENT/SERVER COMPUTER NETWORKS

John Geweke a and Gianni Amisano b a Departments of Economics and Statistics, University of Iowa, USA b European Central Bank, Frankfurt, Germany

Hidden Markov Models

Advanced Machine Learning & Perception

Hidden Markov Models Following a lecture by Andrew W. Moore Carnegie Mellon University

( t) Outline of program: BGC1: Survival and event history analysis Oslo, March-May Recapitulation. The additive regression model

CHAPTER 10: LINEAR DISCRIMINATION

Comb Filters. Comb Filters

Variants of Pegasos. December 11, 2009

Robustness Experiments with Two Variance Components

Fall 2010 Graduate Course on Dynamic Learning

Dynamic Team Decision Theory. EECS 558 Project Shrutivandana Sharma and David Shuman December 10, 2005

Solution in semi infinite diffusion couples (error function analysis)

Department of Economics University of Toronto

( ) () we define the interaction representation by the unitary transformation () = ()

CS286.2 Lecture 14: Quantum de Finetti Theorems II

In the complete model, these slopes are ANALYSIS OF VARIANCE FOR THE COMPLETE TWO-WAY MODEL. (! i+1 -! i ) + [(!") i+1,q - [(!

Bayes rule for a classification problem INF Discriminant functions for the normal density. Euclidean distance. Mahalanobis distance

Density Matrix Description of NMR BCMB/CHEM 8190

Lecture 2 M/G/1 queues. M/G/1-queue

Graduate Macroeconomics 2 Problem set 5. - Solutions

Density Matrix Description of NMR BCMB/CHEM 8190

Econ107 Applied Econometrics Topic 5: Specification: Choosing Independent Variables (Studenmund, Chapter 6)

. The geometric multiplicity is dim[ker( λi. number of linearly independent eigenvectors associated with this eigenvalue.

. The geometric multiplicity is dim[ker( λi. A )], i.e. the number of linearly independent eigenvectors associated with this eigenvalue.

TSS = SST + SSE An orthogonal partition of the total SS

Linear Response Theory: The connection between QFT and experiments

Computing Relevance, Similarity: The Vector Space Model

Chapter 6: AC Circuits

Lecture 6: Learning for Control (Generalised Linear Regression)

Math 128b Project. Jude Yuen

On One Analytic Method of. Constructing Program Controls

FTCS Solution to the Heat Equation

Machine Learning Linear Regression

An introduction to Support Vector Machine

Let s treat the problem of the response of a system to an applied external force. Again,

Mechanics Physics 151

Mechanics Physics 151

THEORETICAL AUTOCORRELATIONS. ) if often denoted by γ. Note that

Modélisation de la détérioration basée sur les données de surveillance conditionnelle et estimation de la durée de vie résiduelle

Mechanics Physics 151

Multi-Modal User Interaction Fall 2008

Robust and Accurate Cancer Classification with Gene Expression Profiling

Lecture VI Regression

J i-1 i. J i i+1. Numerical integration of the diffusion equation (I) Finite difference method. Spatial Discretization. Internal nodes.

Volatility Interpolation

WiH Wei He

Chapter 6 DETECTION AND ESTIMATION: Model of digital communication system. Fundamental issues in digital communications are

Dishonest casino as an HMM

Part II CONTINUOUS TIME STOCHASTIC PROCESSES

Machine Learning 2nd Edition

Hybrid of Chaos Optimization and Baum-Welch Algorithms for HMM Training in Continuous Speech Recognition

[ ] 2. [ ]3 + (Δx i + Δx i 1 ) / 2. Δx i-1 Δx i Δx i+1. TPG4160 Reservoir Simulation 2018 Lecture note 3. page 1 of 5

Single-loop System Reliability-Based Design & Topology Optimization (SRBDO/SRBTO): A Matrix-based System Reliability (MSR) Method

Advanced time-series analysis (University of Lund, Economic History Department)

Genetic Algorithm in Parameter Estimation of Nonlinear Dynamic Systems

( ) [ ] MAP Decision Rule

Normal Random Variable and its discriminant functions

January Examinations 2012

EEL 6266 Power System Operation and Control. Chapter 5 Unit Commitment

UNIVERSITAT AUTÒNOMA DE BARCELONA MARCH 2017 EXAMINATION

Lecture 11 SVM cont

Lecture 18: The Laplace Transform (See Sections and 14.7 in Boas)

CHAPTER 7: CLUSTERING

HIDDEN MARKOV MODELS FOR AUTOMATIC SPEECH RECOGNITION: THEORY AND APPLICATION. S J Cox

Lecture 9: Dynamic Properties

Hidden Markov Model. a ij. Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,..., sn

Reactive Methods to Solve the Berth AllocationProblem with Stochastic Arrival and Handling Times

F-Tests and Analysis of Variance (ANOVA) in the Simple Linear Regression Model. 1. Introduction

e-journal Reliability: Theory& Applications No 2 (Vol.2) Vyacheslav Abramov

CH.3. COMPATIBILITY EQUATIONS. Continuum Mechanics Course (MMC) - ETSECCPB - UPC

Hidden Markov Model for Speech Recognition. Using Modified Forward-Backward Re-estimation Algorithm

Chapter Lagrangian Interpolation

New M-Estimator Objective Function. in Simultaneous Equations Model. (A Comparative Study)

Learning Objectives. Self Organization Map. Hamming Distance(1/5) Introduction. Hamming Distance(3/5) Hamming Distance(2/5) 15/04/2015

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window

FI 3103 Quantum Physics

2. SPATIALLY LAGGED DEPENDENT VARIABLES

Ordinary Differential Equations in Neuroscience with Matlab examples. Aim 1- Gain understanding of how to set up and solve ODE s

5th International Conference on Advanced Design and Manufacturing Engineering (ICADME 2015)

Predicting and Preventing Emerging Outbreaks of Crime

Dual Approximate Dynamic Programming for Large Scale Hydro Valleys

, t 1. Transitions - this one was easy, but in general the hardest part is choosing the which variables are state and control variables

Sampling Procedure of the Sum of two Binary Markov Process Realizations

EP2200 Queuing theory and teletraffic systems. 3rd lecture Markov chains Birth-death process - Poisson process. Viktoria Fodor KTH EES

CS 536: Machine Learning. Nonparametric Density Estimation Unsupervised Learning - Clustering

Online Appendix for. Strategic safety stocks in supply chains with evolving forecasts

CHAPTER 5: MULTIVARIATE METHODS

Cubic Bezier Homotopy Function for Solving Exponential Equations

Notes on the stability of dynamic systems and the use of Eigen Values.

A HIERARCHICAL KALMAN FILTER

Relative controllability of nonlinear systems with delays in control

Transcription:

Dgal Speech Processng Lecure 20 The Hdden Markov Model (HMM)

Lecure Oulne Theory of Markov Models dscree Markov processes hdden Markov processes Soluons o he Three Basc Problems of HMM s compuaon of observaon probably deermnaon of opmal sae sequence opmal ranng of model Varaons of elemens of he HMM model ypes denses Implemenaon Issues scalng mulple observaon sequences nal parameer esmaes nsuffcen ranng daa Implemenaon of Isolaed Word Recognzer Usng HMM s 2

Sochasc Sgnal Modelng Reasons for Ineres: bass for heorecal descrpon of sgnal processng algorhms can learn abou sgnal source properes models work well n pracce n real world applcaons Types of Sgnal Models deemnsc, paramerc models sochasc models 3

Dscree Markov Processes { } Sysem of N dsnc saes, S, S2,..., SN Markov Propery: Tme( ) 2 3 4 5... Sae q q q q q... 2 3 4 5 P q = S q = S, q = S,... = P q = S q = S j 2 k j 4

Properes of Sae Transon Coeffcens Consder processes where sae ransons are me ndependen,.e., a = = j P q S q = Sj,, j N a j 0 j, N = a j = j 5

Example of Dscree Markov Process Once each day (e.g., a noon), he weaher s observed and classfed as beng one of he followng: Sae Ran (or Snow; e.g. precpaon) Sae 2 Cloudy Sae 3 Sunny wh sae ranson probables: 0.4 0.3 0.3 A = { a } = j 0.2 0.6 0.2 0. 0. 0.8 6

Dscree Markov Process Problem: Gven ha he weaher on day s sunny, wha s he probably (accordng o he model) ha he weaher for he nex 7 days wll be sunny-sunny-ranran-sunny-cloudy-sunny? Soluon: We defne he observaon sequence, O, as: O { S, S, S, S, S, S, S, S } = 3 3 3 3 2 3 and we wan o calculae P(O Model). Tha s: [ ] P( O Model),,,,,,, Model = P S3 S3 S3 S S S3 S2 S3 7

Dscree Markov Process π = π [ ] PO ( Model) = P S, S, S, S, S, S, S, S Model = [ ] [ ] [ ] [ ] [ ] [ ] [ ] PS S PS S PS S ( a ) 2 ( ) ( )( )( )( )( ) = 0.8 0. 0.4 0.3 0. 0.2 =.536 0 3 3 3 3 2 3 [ ] 04 = Pq= S, N 2 PS PS S PS S PS S 3 3 3 3 3 2 3 3 2 2 3 33 3 3 32 23 a a a a a 8

Dscree Markov Process Problem: Gven ha he model s n a known sae, wha s he probably says n ha sae for exacly d days? Soluon: {,,,...,, } O = S S S S S S j 2 3 d d + ( ) ( ) P O Model, q = S = a ( a ) = p( d) d = d p( d) = d = d a 9

Exercse Gven a sngle far con,.e., P (H=Heads)= P (T=Tals) = 0.5, whch you oss once and observe Tals: a) wha s he probably ha he nex 0 osses wll provde he sequence {H H T H T T H T T H}? SOLUTION: For a far con, wh ndependen con osses, he probably of any specfc observaon sequence of lengh 0 (0 osses) s (/2) 0 snce here are 2 0 such sequences and all are equally probable. Thus: P (H H T H T T H T T H) = (/2) 0 0

Exercse b) wha s he probably ha he nex 0 osses wll produce he sequence {H H H H H H H H H H}? SOLUTION: Smlarly: P (H H H H H H H H H H)= (/2) 0 Thus a specfed run of lengh 0 s equally as lkely as a specfed run of nerlaced H and T.

Exercse c) wha s he probably ha 5 of he nex 0 osses wll be als? Wha s he expeced number of als over he nex 0 osses? SOLUTION: The probably of 5 als n he nex 0 osses s jus he number of observaon sequences wh 5 als and 5 heads (n any sequence) and hs s: P (5H, 5T)=(0C5) (/2) 0 = 252/024 0.25 snce here are (0C5) combnaons (ways of geng 5H and 5T) for 0 con osses, and each sequence has probably of (/2) 0. The expeced number of als n 0 osses s: 0 0 E(Number of T n 0 con osses) = d = 5 d = 0 d 2 Thus, on average, here wll be 5H and 5T n 0 osses, bu he probably of exacly 5H and 5T s only abou 0.25. 0 2

Con Toss Models A seres of con ossng expermens s performed. The number of cons s unknown; only he resuls of each con oss are revealed. Thus a ypcal observaon sequence s: O = O O O... O = HHTTTHTTH... H 2 3 T Problem: Buld an HMM o explan he observaon sequence. Issues:. Wha are he saes n he model? 2. How many saes should be used? 3. Wha are he sae ranson probables? 3

Con Toss Models 4

Con Toss Models 5

Con Toss Models Problem: Consder an HMM represenaon (model λ) of a con ossng expermen. Assume a 3-sae model (correspondng o 3 dfferen cons) wh probables: Sae Sae 2 Sae 3 P(H) 0.5 0.75 0.25 P(T) 0.5 0.25 0.75 and wh all sae ranson probables equal o /3. (Assume nal sae probables of /3). a) You observe he sequence: O=H H H H T H T T T T Wha sae sequence s mos lkely? Wha s he probably of he observaon sequence and hs mos lkely sae sequence? 6

Con Toss Problem Soluon SOLUTION: Gven O=HHHHTHTTTT, he mos lkely sae sequence s he one for whch he probably of each ndvdual observaon s maxmum. Thus for each H, he mos lkely sae s S 2 and for each T he mos lkely sae s S 3. Thus he mos lkely sae sequence s: S= S 2 S 2 S 2 S 2 S 3 S 2 S 3 S 3 S 3 S 3 The probably of O and S (gven he model) s: POS (, λ) = (0.75) 3 0 0 7

Con Toss Models b) Wha s he probably ha he observaon sequence came enrely from sae? SOLUTION: The probably of O gven ha S s of he form: s: Sˆ = SSSSSSSSSS 0 POS (, ˆ λ) = (0.50) 3 The rao of POS (, λ) o POS (, ˆ λ) s: POS (, λ) 3 R = = = 57.67 POS (, ˆ λ) 2 0 0 8

Con Toss Models c) Consder he observaon sequence: O = HT T HTHHTTH How would your answers o pars a and b change? SOLUTION: Gven O whch has he same number of H's and T's, he answers o pars a and b would reman he same as he mos lkely saes occur he same number of mes n boh cases. 9

Con Toss Models d) If he sae ranson probables were of he form: a = 0.9, a2 = 0.45, a3 = 0.45 a2 = 0.05, a22 = 0., a32 = 0.45 a = 0.05, a = 0.45, a = 0. 3 23 33.e., a new model λ, how would your answers o pars a-c change? Wha does hs sugges abou he ype of sequences generaed by he models? 20

Con Toss Problem Soluon SOLUTION: The new probably of O and S becomes: 0 6 3 POS (, λ ) = (0.75) ( 0.) ( 0.45) 3 The new probably of O and Sˆ becomes: ˆ POS (, λ ) = (0.50) (0.9) 3 The rao s: R 0 6 3 0 9 3 =.36 0 2 = 9 2 5 2

Con Toss Problem Soluon Now he probably of O and S s no he same as he probably of O and S. We now have: POS (, λ ) = (0.75) (0.45) (0.) 3 0 9 POS (, ˆ λ ) = (0.50) (0.9) 3 wh he rao: 0 6 3 0 6 3 3 3 R = =.24 0 2 2 9 Model λ, he nal model, clearly favors long runs of H's or T's, whereas model λ, he new model, clearly favors random sequences of H's and T's. Thus even a run of H's or T's s more lkely o occur n sae for model λ, and a random sequence of H's and T's s more lkely o occur n saes 2 and 3 for model λ. 22

Balls n Urns Model 23

Elemens of an HMM. N, number of saes n he model { } saes, S = S, S,..., S 2 sae a me, q S 2. M, number of dsnc observaon symbols per sae { ν ν ν } observaon symbols, V =,,..., observaon a me j + j { bj ( k) } N V 2 M { aj} 3. Sae ranson probably dsrbuon, A =, a = P( q = S q = S ),, j N =, O 4. Observaon symbol probably dsrbuon n sae B b = ν = j( k) P k a q Sj, j N, k M 5. Inal sae dsrbuon, Π= π [ ] = P q = S, N { π } j 24

HMM Generaor of Observaons. Choose an nal sae, q = S, accordng o he nal sae dsrbuon, Π. 2. Se =. 3. Choose O = ν k accordng o he symbol probably dsrbuon n sae S, namely b( k). 4. Trans o a new sae, q S accordng o he sae ranson = + probably dsrbuon for sae S, namely a. j 5. Se = + ; reurn o sep 3 f T; oherwse ermnae he procedure. j sae q 2 q 2 3 q 3 4 q 4 5 q 5 6 q 6 T q T observaon O O 2 O 3 O 4 O 5 O 6 O T ( ABΠ) Noaon: λ=,, --HMM 25

Three Basc HMM Problems Problem --Gven he observaon sequence, O = OO 2... O T, and a model ( ABΠ) λ=,,, how do we (effcenly) compue PO ( λ), he probably of he observaon sequence? Problem 2--Gven he observaon sequence, O = OO 2... OT, how do we choose a sae sequence Q = qq 2... qt whch s opmal n some meanngful sense? Problem 3--How do we adjus he model parameers λ= AB,, Π o maxmze PO ( λ)? ( ) Inerpreaon: Problem --Evaluaon or scorng problem. Problem 2--Learn srucure problem. Problem 3--Tranng problem. 26

Soluon o Problem P(O P(O λ) T Consder he fxed sae sequence (here are N such sequences): Then and Q = q q... q 2 POQ (, λ) = b ( O) b ( O)... b ( O ) q q 2 q T 2 PQ ( λ) = π a a... a POQ (, λ) = POQ (, λ) PQ ( λ) Fnally PO ( λ) = POQ (, λ) all Q T q q q q q q q 2 2 3 T T 2 2 2 T T T, 2,..., qt P( O λ) = π b ( O ) a b ( O )... a b ( O ) q q T q q q q q q q q T T Calculaons requred 2 T N ; N = 5, T = 00 2 00 5 72 0 compuaons! 00 27

The Forward Procedure Consder he forward varable, α ( ), defned as he probably of he paral observaon sequence (unl me ) and sae S a me, gven he model,.e., α () = P( OO... O, q = S λ) 2 Inducvely solve for α ( ) as: Compuaon: 2 N. Inalzaon α() = π b( O), N 2. Inducon N α+ ( j) = α( ) aj bj( O+ ), T, j N = 3. Termnaon PO ( λ) = POO (... O, q = S λ) = α ( ) N 2 T T T = = T T versus 2 TN ; N = 5, T = 00 2500 versus 0 N 72 28

The Forward Procedure 29

The Backward Algorhm Consder he backward varable, β ( ), defned as he probably of he paral observaon sequence from + o he end, gven sae S a me, and he model,.e., 2 NT β () = P( O O... O q = S, λ) + + 2 T Inducve Soluon :. Inalzaon βt ( ) =, N 2. Inducon N j j + + j = β ( ) = a b ( O ) β ( j), = T, T 2,...,, N calculaons, same as n forward case 30

Soluon o Problem 2 Opmal 2 Sae Sequence. Choose saes, q, whch are ndvdually mos lkely maxmze expeced number of correc ndvdual saes 2. Choose saes, q, whch are par -wse mos lkely maxmze expeced number of correc sae pars 3. Choose saes, q, whch are rple-wse mos lkely maxmze expeced number of correc sae rples 4. Choose saes, q, whch are T -wse mos lkely fnd he sngle bes sae sequence whch maxmzes PQO (, λ) Ths soluon s ofen called he Verb sae sequence because s found usng he Verb algorhm. 3

Maxmze Indvdual Saes We defne γ ( ) as he probably of beng n sae S a me, gven he observaon sequence, and he model,.e., hen wh hen γ ( ) = P( q = S O, λ) = γ () = N = Pq ( = S, O λ) N Pq ( = S, O λ) Pq ( = S, O λ) PO ( λ) = = γ ( ) =, N [ γ ] q = argmax ( ), T α() β() α() β() = = N PO ( λ) α () β () Problem: q need no obey sae ranson consrans. 32

Bes Sae Sequence The Verb Algorhm Defne δ ( ) as he hghes probably along a sngle pah, a me, whch accouns for he frs observaons,.e., 2 [ λ] δ ( ) = max P qq... q, q =, OO... O 2 2 q, q,..., q We mus keep rack of he sae sequence whch gave he bes pah, a me, o sae. We do hs n he array ψ ( ). 33

The Verb Algorhm Sep - -Inalzaon δ ( ) = π b( O ), N ψ () = 0, N Sep 2 - -Recurson ( ) δ δ ( j) = max ( ) aj bj O, 2 T, j N N ψ δ ( j) = argmax ( ) aj, 2 T, j N N Sep 3 - -Termnaon P q T N [ δ ] = max T ( ) N = argmax ( ) ( + ) [ δ ] q = ψ q, = T, T 2,..., + 2 Calculaon NToperaons (,+) T Sep 4 - -Pah (Sae Sequence) Backrackng 34

Alernave Verb Implemenaon ( π) ( ) ( ) π = log N b O = log b O N, T a j = log a j, j N Sep - -Inalzaon δ () = log( δ ()) = π + b O, N ( ) ψ () = 0, N Sep 2 - -Recurson δ ( j) = log( δ (j))=max δ ( ) + a + b ( O ), 2 T, j N j j N ψ ( j) = argmax δ ( ) + a, 2 T, j N j N Sep 3 - -Termnaon P = max δt ( ), N N qt = argmax δt( ), N Calculaon N Sep 4 - -Backrackng q = ψ ( q ), = T, T 2,..., + + 2 NT addons 35

Problem Gven he model of he con oss expermen used earler (.e., 3 dfferen cons) wh probables: Sae Sae 2 Sae 3 P(H) 0.5 0.75 0.25 P(T) 0.5 0.25 0.75 wh all sae ranson probables equal o /3, and wh nal sae probables equal o /3. For he observaon sequence O=H H H H T H T T T T, fnd he Verb pah of maxmum lkelhood. 36

Problem Soluon Snce all a erms are equal o /3, we can om hese erms (as well as he nal sae probably erm) gvng: δ() = 0.5, δ(2) = 0.75, δ(3) = 0.25 The recurson for δ ( j) gves ( 2 0) 2 δ () = (0.75)(0.5), δ (2) = (0.75), δ (3) = (0.75)(0.25) 2 j 2 2 2 3 2 δ () = (0.75) (0.5), δ (2) = (0.75), δ (3) = (0.75) (0.25) 3 3 3 3 4 3 δ () = (0.75) (0.5), δ (2) = (0.75), δ (3) = (0.75) (0.25) 4 4 4 4 4 5 δ5() = (0.75) (0.5), δ5(2) = (0.75) (0.25), δ5(3) = (0.75) 5 6 5 δ () = (0.75) (0.5), δ (2) = (0.75), δ (3) = (0.75) (0.25) 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 0 δ () = (0.75) (0.5), δ (2) = (0.75) (0.25), δ (3) = (0.75) δ () = (0.75) (0.5), δ (2) = (0.75) (0.25), δ (3) = (0.75) δ () = (0.75) (0.5), δ (2) = (0.75) (0.25), δ (3) = (0.75) δ0() = (0.75) (0.5), δ0(2) = (0.75) (0.25), δ0(3) = (0.75) Ths leads o a dagram (rells) of he form: 37

Soluon o Problem 3 he 3 Tranng Problem no globally opmum soluon s known all soluons yeld local opma can ge soluon va graden echnques can use a re-esmaon procedure such as he Baum-Welch or EM mehod consder re-esmaon procedures basc dea: gven a curren model esmae, λ, compue expeced values of model evens, hen refne he model based on he compued values E [ Model Evens] E[ Model Evens] (0) () (2) λ λ λ Defne ξ(, j), he probably of beng n sae S a me, and sae S a me +, gven he model and he observaon sequence,.e., j ξ = = + = λ (, j) P q S, q Sj O, 38

The Tranng Problem ξ = = + = λ (, j) P q S, q Sj O, 39

The Tranng Problem ξ = = + = λ (, j) P q S, q Sj O, ξ (, j) P q = + = λ S, q Sj, O = PO ( λ) α( ) aj bj( O+ ) β+ ( j) α( ) aj bj( O+ ) β+ ( j) = = N N PO ( λ) α () a b ( O ) β ( j) N γ () = ξ (, j) T = T = j = = j= j j + + γ ( ) = Expeced number of ransons from S ξ (, j) = Expeced number of ransons from S os j 40

Re-esmaon esmaon Formulas π a j = Expeced number of mes n sae S a = = γ () Expeced number of ransons from sae S o sae S = Expeced number of ransons from sae S T ξ (, j) = = T γ () = Expeced number of mes n sae j wh symbol ν k bj ( k) = Expeced number of mes n sae j = T γ ( j) = O = ν T = k γ ( j) j 4

Re-esmaon esmaon Formulas ( AB ) λ ( AB ) If λ =,, Π s he nal model, and =,, Π s he re-esmaed model, hen can be proven ha eher:. he nal model, λ, defnes a crcal pon of he lkelhood funcon, n whch case λ = λ, or 2. model λ s more lkely han model λ n he sense ha PO ( λ) > PO ( λ),.e., we have found a new model λ from whch he observaon sequence s more lkely o have been produced. Concluson: Ieravely use λ n place of λ, and repea he re-esmaon unl some lmng pon s reached. The resulng model s called he maxmum lkelhood ( ML) HMM. 42

Re-esmaon esmaon Formulas. The re-esmaon formulas can be derved by maxmzng he auxlary funcon Q( λλ, ) over λ,.e., Q( λλ, ) = P( O, q λ)log P( O, q λ max Q( λλ, ) P( O λ) P( O λ) λ I can be proved ha: q Evenually he lkelhood funcon converges o a crcal pon 2. Relaon o EM algorhm: E (Expecaon) sep s he calculaon of he auxlary funcon, Q( λλ, ) M (Modfcaon) sep s he maxmzaon over λ 43

Noes on Re-esmaon esmaon. Sochasc consrans on π, a, b ( k) are auomacally me,.e., N N M π =, a =, b ( k) = j j = j= k= 2. A he crcal pons of P = P( O λ), hen P π π π = = π N P π k π a j = k = N a k = j a P a k = j k P a k = a j j P bj ( k) bj ( k) bj( k) = = bj( k) M P bj () l b () j j a crcal pons, he re-esmaon formulas are exacly correc. 44

Varaons on HMM s. Types of HMM model srucures 2. Connuous observaon densy models mxures 3. Auoregressve HMM s LPC lnks 4. Null ransons and ed saes 5. Incluson of explc sae duraon densy n HMM s 6. Opmzaon creron ML, MMI, MDI 45

Types of HMM. Ergodc models--no ransen saes 2. Lef-rgh models--all ransen saes (excep he las sae) wh he consrans:, = π = 0, a = 0 j > j Conrolled ransons mples: a = 0, j > +Δ ( Δ =,2 ypcally) j 3. Mxed forms of ergodc and lef-rgh models (e.g., parallel branches) Noe: Consrans of lef-rgh models don' affec re-esmaon formulas (.e., a parameer nally se o 0 remans a 0 durng re-esmaon). 46

Types of HMM Ergodc Model Lef-Rgh Model Mxed Model 47

Connuous Observaon Densy HMM s Mos general form of pdf wh a vald re-esmaon procedure s: M b = μ j( x) cjm x, jm, Ujm, j N m= { } x = observaon vecor= x, x,..., x 2 D M = number of mxure denses cjm = gan of m-h mxure n sae j = any log-concave or ellpcally symmerc densy (e.g., a Gaussan) μ = mean vecor for mxure m, sae j jm U = covarance marx for mxure m, sae j jm c 0, j N, m M M jm m= c jm =, j N bj ( x) dx =, j N 48

Sae Equvalence Char S S S Equvalence of sae wh mxure densy o mul-sae sngle mxure case S S 49

Re-esmaon esmaon for Mxure Denses μ = = jk T M jk jk = = = k= γ (, jk) γ (, jk) γ (, jk) O = T = T T T = γ (, jk) ( )( ) γ (, jk) O μ O μ jk jk T = γ (, jk) γ ( jk, ) s he probably of beng n sae ja me wh he c U k-h mxure componen accounng for O α β (, μ, ) γ ( j) ( j) cjk O jk Ujk = (, jk) N M α β μ ( j) ( j) cjm ( O, jm, Ujm) j= m= 50

Auoregressve HMM Consder an observaon vecor O = ( x, x,..., x ) where each x k 0 K s a waveform sample, and O represens a frame of he sgnal (e.g., K = 256 samples). We assume x s relaed o prevous samples of O by a Gaussan auoregressve process of order p,.e., p O = ao + e, 0 k K k k k = where e are Gaussan, ndependen, dencally dsrbued random k 2 varables wh zero mean and varance σ, and a, p are he auoregressve or predcor coeffcens. As K, hen 2 K /2 fo ( ) = (2 πσ ) exp δ ( Oa, ) 2 2σ where δ ( Oa, ) = ra(0)(0) r + 2 ra()() r p = k 5

Auoregressve HMM [ ] p r () = a a,( a = ), p a n n+ n= 0 K r () = xx, 0 p n= 0 n n+ 0 a =, a, a2,..., ap The predcon resdual s: K 2 2 α = E ( e ) = Kσ = Consder he normalzed observaon vecor ˆ O O O = = 2 α Kσ ˆ K /2 K fo ( ) = (2 π ) exp δ ( Oa ˆ, ) 2 In pracce, K s replaced by Kˆ, he effecve frame lengh, e.g., Kˆ = K / 3 for frame overlap of 3 o. 52

Applcaon of Auoregressve HMM M b (0) = c b ( O) j jm jm m= K /2 K bjm( O) = (2 π) exp δ( O, ajm ) 2 Each mxure characerzed by predcor vecor or by auocorrelaon vecor from whch predcor vecor can be derved. Re-esmaon formulas for r are: r jk = T = T γ (, jk) r = γ (, jk) α β ( ) γ ( j) ( j) cjkbjk O = (, jk) N M α β ( j) ( j) cjkbjk ( O ) j= k= jk 53

Null Transons and Ted Saes Null Transons: ransons whch produce no oupu, and ake no me, denoed by φ Ted Saes: ses up an equvalence relaon beween HMM parameers n dfferen saes number of ndependen parameers of he model reduced parameer esmaon becomes smpler useful n cases where here s nsuffcen ranng daa for relable esmaon of all model parameers 54

Null Transons 55

Incluson of Explc Sae Duraon Densy For sandard HMM's, he duraon densy s: p ( d) = probably of exacly d observaons n sae S d = a ( ) ( a ) Wh arbrary sae duraon densy, generaed as follows: dsrbuon, π p ( d), observaons are. an nal sae, q = S, s chosen accordng o he nal sae 2. a duraon d s chosen accordng o he sae duraon densy p q ( d ) 3. observaons OO... O 2 2 b ( O O... O ) = b ( O ) q 2 d q = d 4. he nex sae, q = S, s chosen accordng o he sae ranson 2 are chosen accordng o he jon densy b ( O O... O ). Generally we assume ndependence, so q d j d probables, a, wh he consran ha a = 0,.e., no ranson qq 2 back o he same sae can occur. qq 56

Explc Sae Duraon Densy Sandard HMM HMM wh explc sae duraon densy 57

Explc Sae Duraon Densy d + d + d + sae duraon d d + d + d d + d + d + d + d. frs sae, q, begns a = 2. las sae, q, ends a = T 2 2 3 2 3 observaons O... O O... O O... O Assume: r q q q d d d 2 2 2 3 enre duraon nervals are ncluded whn he observaon sequence OO... O Modfed α: 2 α ( ) = P( O O... O, S endng a λ) 2 Assume r saes n frs observaons,.e., { } Q = q q... q wh q = S 2 { } D = d d... d wh 2 T r r r r s= d s = 58

Then we have By nducon: Explc Sae Duraon Densy α ( ) = π p ( d ) P( OO... O q ) q q 2 d q d a p ( d ) P( O... O q )... qq q 2 d + d + d 2 2 2 2 a p ( d ) P( O... O q ) q q q r d + d +... + d + r r r r 2 r α ( j) = α ( ) a p ( d) b ( O ) d j j j s = d = s= d+ Inalzaon of α ( ) : α () = π p () b( O ) N D 2 α () = π p (2) b( O ) + α ( j) a p () b( O ) 2 s j 2 s= j=, j N α () = π p (3) b( O ) + α ( j) a p ( d) b( O ) 3 s 3 d j s s= d= j=, j s= 4 d 3 2 N 3 N PO ( λ) = α ( ) = T 59

Explc Sae Duraon Densy re-esmaon formulas for a, b( k), and p( d) can be formulaed j and appropraely nerpreed modfcaons o Verb scorng requred,.e., δ ( ) = P( OO... O, qq... q = S endng a O) 2 2 r Basc Recurson : δ() = max max δ d( j) aj p( d) bj( Os) j N, j d D s= d+ sorage requred for δ... δ N D locaons D maxmzaon nvolves all erms--no jus old δ's and a as n prevous case sgnfcanly larger compuaonal load 2 2 ( D / 2) N T compuaons nvolvng b ( O) Example: N = 5, D = 20 mplc duraon explc duraon sorage 5 00 compuaon 2500 500,000 j j 60

Issues wh Explc Sae Duraon Densy. qualy of sgnal modelng s ofen mproved sgnfcanly 2. sgnfcan ncrease n he number of parameers per sae ( D duraon esmaes) 3. sgnfcan ncrease n he compuaon assocaed wh probably 2 calculaon ( / 2) D 4. nsuffcen daa o gve good p( d) esmaes Alernaves :. use paramerc sae duraon densy 2 ( ) = ( d, μ, σ ) -- Gaussan p d η d e p ( d) = Γ( ν ) ν ν η d -- Gamma 2. ncorporae sae duraon nformaon afer probably calculaon, e.g., n a pos-processor 6

Alernaves o ML Esmaon Assume we wsh o desgn V dfferen HMM's, λ, λ2,..., λv. Normally we desgn each HMM, λ, based on a ranng se of observaons, O V V, usng a maxmum lkelhood (ML) creron,.e., P V = maxp O λ V V λ V Consder he muual nformaon, I, beween he observaon ( ) V sequence, O, and he complee se of models λ = λ, λ,..., λ, V V V IV = log P( O λv) log P( O λw) w = Consder maxmzng I over λ, gvng V V 2 V V V IV = max λ λ λ log P( O V) log P( O W) w = choose λ so as o separae he correc model, λ, from all V oher models, as much as possble, for he ranng se, O. V V 62

Alernaves o ML Esmaon Sum over all such ranng ses o gve models accordng o an MMI creron,.e., V V v v I = max ( λ ) λ λ log P( O v log P( O w) v= w= soluon va seepes descen mehods. 63

Comparson of HMM s Problem: gven wo HMM's, λ and λ2, s possble o gve a measure of how smlar he wo models are Example : equvalen ( A B ) ( A B ) P O = ν For,, we requre ( ) o be he same 2 2 for boh models and for all symbols ν. Thus we requre pq + ( p)( q) = rs + ( r )( s) 2pq p q = 2rs = r = s p+ 2pq r s = 2r Le p = 0.6, q = 0.7, r = 0.2, hen s = 3 / 30 0.433 k k 64

Comparson of HMM s Thus he wo models have very dfferen A and B marces, bu are equvalen n he sense ha all symbol probables (averaged over me) are he same. We generalze he concep of model dsance (ds-smlary) by defnng a dsance measure, D( λ, λ ) beween wo Markov sources, λ and λ, as 2 2 (2) (2) D( λ λ = λ λ, 2) log P( OT ) log P( OT 2) T (2) where OT s a sequence of observaons generaed and scored by boh models. We symmerze D by usng he relaon: DS ( λ, λ2) = D( λ, λ2) + D( λ2, λ) 2 [ ] by model λ, 2 65

Implemenaon Issues for HMM s. Scalng o preven underflow and/or overflow. 2. Mulple Observaon Sequences o ran lef-rgh models. 3. Inal Esmaes of HMM Parameers o provde robus models. 4. Effecs of Insuffcen Tranng Daa 66

Scalng α ( ) s a sum of a large number of erms, each of he form: - aqq + b ( ) s s q O s s s= s= snce each a and b erm s less han, as ges larger, α ( ) exponenally heads o 0. Thus scalng s requred o preven underflow. consder scalng α ( ) by he facor c = N = we denoe he scaled α's as: ˆ α ( ) = cα ( ) = N = ˆ α ( ) = N, ndependen of α () α () = α () 67

Scalng for fxed, we compue α ( ) = ˆ α ( j) a b( O ) j j = scalng gves N ˆ α ( ) = j = N N = j= by nducon we ge N ˆ α ( j) a b( O ) j ˆ α ( j) a b( O ) j gvng ˆ α ( j) = τ = c τ α ( j) ˆ α ( ) N α ( j) cτ aj b( O) j = τ = = = N N α ( j) cτ aj b( O) = j= τ = α () N = α () 68

Scalng for scalng he β ( ) erms we use he same scale facors as for he α ( ) erms,.e., ˆ β( ) = cβ( ) snce he magnudes of he α and β erms are comparable. he re-esmaon formula for a n erms of he scaled α's and β's s: we have a = T = j N T j= = ˆ α () a b ( O ) ˆ β ( j) j j + + ˆ α () a b ( O ) ˆ β ( j) j j + + ˆ α ( ) = cτ α() = Cα() τ = T ˆ β+ ( j) = cτ β+ ( j) = D+ β+ ( j) τ =+ j 69

gvng a = T = j N T + ndependen of. Noes on Scalng : τ τ τ τ= τ= + τ= Scalng Cα () a b ( O ) D β ( j) j j + + + Cα () a b ( O ) D β ( j) j j + + + j= = T T CD = c c = c = C. scalng procedure works equally well on π or B coeffcens 2. scalng need no be performed each eraon; se c = whenever scalng s skpped c. can solve for PO ( λ) from scaled coeffcens as: T N N = c α () = C α () = T T = = N PO ( λ) = α ( ) = / c T = = T log PO ( λ) = log( c) = T 70

Mulple Observaon Sequences For lef-rgh models, we need o use mulple sequences of observaons for ranng. Assume a se of K observaon sequences (.e., ranng uerances): () (2) ( K ) O = O, O,..., O where ( k) ( k) ( k) ( k ) O = O O 2... OT k We wsh o maxmze he probably ( k ) PO ( λ) = PO ( λ) = P a j = = K T k k= = Scalng requres: a j K P K k= k= k ( k) k α () a b ( O ) β ( j) K k= k = K j j + + T k k= = T k k k α () β () k ˆ α () a j K ( k) ( ) ˆ k b O β ( j) Tk k ˆ α () ˆ k β () P k= k = all scalng facors cancel ou k j + + 7

Inal Esmaes of HMM Parameers N -- choose based on physcal consderaons M -- choose based on model fs π a j -- random or unform ( π 0) -- random or unform ( a 0) b ( k) -- random or unform ( b ( k) ε ) j b ( O) -- need good nal j j j esmaes of mean vecors; need reasonable esmaes of covarance marces 72

Effecs of Insuffcen Tranng Daa Insuffcen ranng daa leads o poor esmaes of model parameers. Possble Soluons:. use more ranng daa--ofen hs s mpraccal 2. reduce he sze of he model--ofen here are physcal reasons for keepng a chosen model sze 3. add exra consrans o model parameers b ( k) ε j U jk (,) r r δ ofen he model performance s relavely nsensve o exac choce of ε, δ 4. mehod of deleed nerpolaon λ= ελ+(- ε) λ 73

Mehods for Insuffcen Daa Performance nsensvy o ε 74

Deleed Inerpolaon 75

Isolaed Word Recognon Usng HMM s Assume a vocabulary of V words, wh K occurrences of each spoken word n a ranng se. Observaon vecors are specral characerzaons of he word. For solaed word recognon, we do he followng: v. for each word, v, n he vocabulary, we mus buld an HMM, λ,.e., we mus re-esmae model parameers,, ha opmze he lkelhood of he ( ABΠ) ranng se observaon vecors for he v-h word. (TRAINING) 2. for each unknown word whch s o be recognzed, we do he followng: a. measure he observaon sequence O O O... O [ ] = 2 v b. calculae model lkelhoods, PO ( λ ), v V c. selec he word whose model lkelhood score s hghes v v = argmax P( O λ ) v V 2 Compuaon s on he order of V N T requred; V = 00, N = 5, T = 40 5 0 compuaons T 76

Isolaed Word HMM Recognzer 77

Choce of Model Parameers. Lef-rgh model preferable o ergodc model (speech s a lef-rgh process) 2. Number of saes n range 2-40 (from sounds o frames) Order of number of dsnc sounds n he word Order of average number of observaons n word 3. Observaon vecors Cepsral coeffcens (and her second and hrd order dervaves) derved from LPC (-9 mxures), dagonal covarance marces Vecor quanzed dscree symbols (6-256 codebook szes) 4. Consrans on b j (O) denses bj(k)>ε for dscree denses C jm >δ, U jm (r,r)>δ for connuous denses 78

Performance Vs Number of Saes n Model 79

HMM Feaure Vecor Denses 80

Movaon: Segmenal K-Means K Segmenaon no Saes derve good esmaes of he b j (O) denses as requred for rapd convergence of re-esmaon procedure. Inally: ranng se of mulple sequences of observaons, nal model esmae. Procedure: segmen each observaon sequence no saes usng a Verb procedure. For dscree observaon denses, code all observaons n sae j usng he M-codeword codebook, gvng b j (k) = number of vecors wh codebook ndex k, n sae j, dvded by he number of vecors n sae j. for connuous observaon denses, cluser he observaons n sae j no a se of M clusers, gvng 8

Segmenal K-Means K Segmenaon no Saes c jm = number of vecors assgned o cluser m of sae j dvded by he number of vecors n sae j. μ jm = sample mean of he vecors assgned o cluser m of sae j U jm = sample covarance of he vecors assgned o cluser m of sae j use as he esmae of he sae ranson probables a = number of vecors n sae mnus he number of observaon sequences for he ranng word dvded by he number of vecors n sae. a,+ = a he segmenng HMM s updaed and he procedure s eraed unl a converged model s obaned. 82

Segmenal K-Means K Tranng 83

HMM Segmenaon for /SIX/ 84

Dg Recognon Usng HMM s unknown log energy one one nne frame cumulave scores one frame lkelhood scores one one sae segmenaon nne nne 85

Dg Recognon Usng HMM s seven unknown log energy seven sx frame cumulave scores seven frame lkelhood scores seven seven sae segmenaon sx sx 86

87