Hidden Markov Models

Similar documents
Introduction to Hidden Markov Models

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Lecture 10 Support Vector Machines II

Hidden Markov Models

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

Time-Varying Systems and Computations Lecture 6

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

MARKOV CHAIN AND HIDDEN MARKOV MODEL

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Differentiating Gaussian Processes

Integrals and Invariants of Euler-Lagrange Equations

The Basic Idea of EM

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Limited Dependent Variables

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

The Expectation-Maximization Algorithm

Feature Selection: Part 1

EM and Structure Learning

Hidden Markov Models

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Maximum Likelihood Estimation

Problem Set 9 Solutions

Week 5: Neural Networks

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

Hidden Markov Model Cheat Sheet

Lecture 3: Probability Distributions

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Open Systems: Chemical Potential and Partial Molar Quantities Chemical Potential

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

PROBLEM SET 7 GENERAL EQUILIBRIUM

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE

PHYS 705: Classical Mechanics. Calculus of Variations II

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

ON A DETERMINATION OF THE INITIAL FUNCTIONS FROM THE OBSERVED VALUES OF THE BOUNDARY FUNCTIONS FOR THE SECOND-ORDER HYPERBOLIC EQUATION

Primer on High-Order Moment Estimators

Econ Statistical Properties of the OLS estimator. Sanjaya DeSilva

Bezier curves. Michael S. Floater. August 25, These notes provide an introduction to Bezier curves. i=0

1 Convex Optimization

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

Overview. Hidden Markov Models and Gaussian Mixture Models. Acoustic Modelling. Fundamental Equation of Statistical Speech Recognition

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

1 Matrix representations of canonical matrices

Some modelling aspects for the Matlab implementation of MMA

Online Appendix. t=1 (p t w)q t. Then the first order condition shows that

Continuous Time Markov Chain

Randomness and Computation

Expected Value and Variance

Lecture 21: Numerical methods for pricing American type derivatives

4DVAR, according to the name, is a four-dimensional variational method.

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Appendix for Causal Interaction in Factorial Experiments: Application to Conjoint Analysis

Lecture Notes on Linear Regression

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Quantifying Uncertainty

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty

Physics 5153 Classical Mechanics. Principle of Virtual Work-1

The Geometry of Logit and Probit

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

A new Approach for Solving Linear Ordinary Differential Equations

Markov decision processes

Bézier curves. Michael S. Floater. September 10, These notes provide an introduction to Bézier curves. i=0

Lecture 12: Discrete Laplacian

Simultaneous Optimization of Berth Allocation, Quay Crane Assignment and Quay Crane Scheduling Problems in Container Terminals

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Assortment Optimization under MNL

Transfer Functions. Convenient representation of a linear, dynamic model. A transfer function (TF) relates one input and one output: ( ) system

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Kernel Methods and SVMs Extension

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Dynamic Programming. Preview. Dynamic Programming. Dynamic Programming. Dynamic Programming (Example: Fibonacci Sequence)

Expectation Maximization Mixture Models HMMs

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Integrals and Invariants of

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

THE SUMMATION NOTATION Ʃ

MMA and GCMMA two methods for nonlinear optimization

Notes on Frequency Estimation in Data Streams

Homework Assignment 3 Due in class, Thursday October 15

Effects of Ignoring Correlations When Computing Sample Chi-Square. John W. Fowler February 26, 2012

EEE 241: Linear Systems

Difference Equations

Grenoble, France Grenoble University, F Grenoble Cedex, France

LECTURE 9 CANONICAL CORRELATION ANALYSIS

COS 521: Advanced Algorithms Game Theory and Linear Programming

Remarks on the Properties of a Quasi-Fibonacci-like Polynomial Sequence

Transcription:

Hdden Markov Models Namrata Vaswan, Iowa State Unversty Aprl 24, 204 Hdden Markov Model Defntons and Examples Defntons:. A hdden Markov model (HMM) refers to a set of hdden states X 0, X,..., X t,..., X T and a set of observatons, Y,..., Y t,..., Y T wth the followng jont PMF or PDF: p(x 0:T, y :T ) [p(x 0 )[ p(x τ x τ )]][[ p(y τ x τ )]] () 2. The sequence s an HMM f and only f (a) gven X t, X t+ s ndependent of X 0:t (past-x) and (b) gven X t, Y t s ndependent of X 0:t, X t+:t (all-x) and Y 0:t (past-y) Ths follows by wrtng out the expresson for p(x 0:t, y 0:t ) usng chan rule and then usng () and comparng coeffcents. By chan rule, p(x 0:T, y :T ) p(x 0 ) p(x τ x τ, x 0:τ 2 ) p(y τ x 0:T, y :τ ) (2) Compare ths wth (). In both equatons ntegrate over y :T and cancel out p(x x 0 ) to get p(x τ x τ, x 0:τ 2 ) τ2 p(x τ x τ ). Now ntegrate also over x 3:T on both sdes to get p(x 2 x, x 0 ) p(x 2 x ). Next, ntegrate over only x 4:T and use ths to conclude that p(x 2 x, x 0 )p(x 3 x 2, x, x 0 ) τ2 p(x 2 x )p(x 3 x 2 ) and so p(x 3 x 2, x, x 0 ) p(x 3 x 2 ). conclude that p(x t x 0:t ) p(x t x t ) for each t,.e. tem (a) holds. At the end of the above, we conclude that p(y τ x 0:T, y :τ ) Proceed n a smlar fashon to p(y τ x τ ).

Integrate over y 2:T to conclude that p(y x 0:T ) p(y x ). Use ths and ntegrate over only y 3:T to conclude that p(y 2 x 0:T, y ) p(y 2 x 2 ). Proceed n a smlar fashon to conclude that p(y t x 0:T, y :t ) p(y t x t ) for each t,.e. tem (b) holds. 3. The sequence s an HMM f and only f (a) gven X t, X t+ s ndependent of X 0:t (past-x) and Y 0:t (past-y) (b) gven X t, Y t s ndependent of X 0:t (past-x) and Y 0:t (past-y) Ths also follows by wrtng out the expresson for p(x 0:t, y 0:t ) usng chan rule and then usng () and comparng coeffcents. Ether of the above can also be concluded by usng results from the Graphcal Models handout. The followng can be shown ether usng Theorem 2 of the Graphcal Models handout or drectly.. The jont PMF or PDF of the hdden states gven by p(x 0:T ) p(x 0 ) p(x τ x τ ) (3) Ths follows usng () and ntegratng over y :T. 2. Gven X t, X t+:t are condtonally ndependent (c..) of past-x (X 0:t ) and of past-y (Y 0:t ). 3. Gven X t, Y t:t are c.. of past-x (X 0:t ) and of past-y (Y 0:t ). 4. Gven X t k, Y t:t s c.. of Y 0:t k and of X 0:t k for k > 0. 5. gven X t, X t+ s c.. of past-x (X 0:t ) and of past-y (Y 0:t ), and 6. gven X t, Y t s c.. of all-x (X 0:t, X t+:t ) and all-y (Y 0:t, Y 0:t+:T ). 7. By reversng the Markov chan {X t }, we can also clam that gven X t, X t s c.. of all future-x (X t+:t ) and all future-y (Y t:t ). 8. Gven X t k, Y t:t s c.. of X 0:t k and Y 0:t k for k > 0. If k 0, replace Y 0:t k by Y 0:t 9. By reversng the Markov chan {X t }, the opposte of 3 can also be shown for future. 0. Many more 2

Let us try to prove tem 2. We get p(x t+:t x t, x 0:t, y 0:t ) p(x t+ x t, x 0:t, y 0:t )p(x t+2:t x t+, x 0:t, y 0:t ) p(x t+ x t )p(x t+2:t x t+, x 0:t, y 0:t ) p(x t+ x t )p(x t+2 x t+, x 0:t, y 0:t )p(x t+3:t x t+2, x 0:t+, y 0:t ) p(x t+ x t )p(x t+2 x t+ )p(x t+3:t x t+2, x 0:t+, y 0:t ) (4) The frst equalty uses chan rule, the second uses (a) of defnton 3, the thrd uses chan rule. The fourth uses (a) of defnton 3 and the followng fact wth X X t+2, W X t+, Z Y 0:t and Y Y t+. Fact X ndependent of {Z, Y } mples that X ndependent of Z. Smlarly gven W, X c.. of {Z, Y } mples that gven W, X c.. of Z. The proof of ths follows by wrtng p(x, z, y w) p(x w)p(z, y w) and ntegratng over y. Proceedng n a smlar fashon, we fnally get p(x t+:t x t, x 0:t, y 0:t ) τt+ p(x τ x τ ) Usng (a) of Defnton 2 and Fact, p(x t+:t x t ) T τt+ p(x τ x τ ) and thus we get.e. the result follows. p(x t+:t x t, x 0:t, y 0:t ) p(x t+:t x t ) The other conclusons gven above can be proved smlarly. HMM Examples.. The state space model used for defnng the Kalman flter was an example of an HMM wth contnuous states, X t and contnuous observatons, Y t. 2. X t refers to today s weather whch can take one of three possble values, {rany, cloudy, sunny}. Y t s a bnary random varable whch can take two possble values {class occurs, no class occurs}. It s natural to clam that today s weather depends only on yesterday s weather,.e. gven yesterday s weather, today s weather s c.. of past weather or of whether class occurred yesterday or n the past or not. Also, the chance that class wll occur today or not s governed only by today s weather (f t s sunny, t s more lkely that the class wll not occur!) and gven today s weather, the chance s ndependent of all past or future weather and also of whether classes occurred n the past or n the future. Ths, of course models, an rresponsble professor who does not care about whether the materal s covered or not! 3

3. Speech recognton, X t : dfferent phonems, Y t : lnear predcton coeffcents (LPC s) of the AR model descrbng observed speech. 4. Gesture recognton, X t : dfferent gestures out of a set, Y t : outer contour of the observed hand shape (for hand gestures) 5. In last two examples, X t s dscrete, Y t s contnuous, that s allowed too. Causal Posteror Computaton.. We refer to p(x t y 0:t ) as the causal posteror. In real-tme applcatons, there s a need to compute t recursvely, for example, to be able to compute the causal MMSE or causal MAP estmate. 2. Recursve computaton means use the causal posteror and t and the current observaton to compute the causal posteror at t. 3. Usng Bayes rule and HMM propertes, the causal posteror satsfes p(x t y 0:t ) p(x t, y t y 0:t ) p(x t y 0:t )p(y t x t, y 0:t ) p(x t y 0:t )p(y t x t ) (usng HMM defnton 3) p(y t x t ) p(x t, x t y 0:t )dx t p(y t x t ) p(x t y 0:t )p(x t x t, y 0:t )dx t p(y t x t ) p(x t y 0:t )p(x t x t )dx t (usng HMM defnton 3)(5) 4. The above recurson s another way to derve the Kalman flter recurson: the causal MMSE estmate, E[X t Y 0:t ], s the expectaton of X t under the causal posteror. Snce everythng there s jontly Gaussan, the posterors wll also be Gaussan and hence completely specfed by the mean and covarance. Kay s book does t ths way. 5. The same rules apply for dscrete states: just replace by. 2 Dscrete-state HMM We study the set of technques developed for dscrete-state HMM s. The materal s based on Rabner s tutoral (Proc. IEEE, February 989). Thus any X t s a dscrete random varable whch takngs one of N possble values,, 2,... N. Y t s ether dscrete or contnuous. 4

2. Notaton A tme-homogenous dscrete state HMM s completely specfed by π P (X t ) a,j P (X t j X t ) b j (y) P (Y t y X t j) (f Y t s contnuous ths s replaced by the condtonal PDF) (6) The followng notaton s used n effcent computaton of varous quanttes. α t () p(y 0:t, x t ) β t () p(y t+:t x t ) γ t () p(x t y 0:T ) (note ths condtons on all observatons) ξ t (, j) p(x t, x t+ j y 0:T ) (note ths condtons on all observatons) (7) 2.2 Recurson for α t, β t, γ t, ξ t Consder α t α t () p(y 0:t, x t ) p(y 0:t, x t, x t j) j p(y 0:t, x t j)p(x t x t j, y 0:t )p(y t x t, x t j, y 0:t ) j p(y 0:t, x t j)p(x t x t j)p(y t x t ) (usng HMM defnton 3) j p(y 0:t, x t j)a j b (y t ) j b (y t ) α t (j)a j, (8) j 5

Consder β t β t () p(y t+:t x t ) p(y t+:t, x t+ j x t ) j p(x t+ j x t )p(y t+ x t+ j, x t )p(y t+2:t x t+ j, x t, y t+ ) j p(x t+ j x t )p(y t+ x t+ j)p(y t+2:t x t+ j) (usng HMM defnton 3) j a,j b j (y t+ )β t+ (j) (9) j Consder γ t. Usng defntons of α t () and β t (), t s clear that γ t () α t ()β t (). Thus, γ t () N j α t(j)β t (j) α t()β t () (0) Consder ξ t ξ t (, j) p(x t, x t+ j y 0:T ) p(y 0:T ) p(y 0:T, x t, x t+ j) p(y 0:T ) p(x t, y 0:t )p(x t+ j x t, y 0:t )p(y t+:t x t+ j, x t, y 0:t ) p(y 0:T ) p(x t, y 0:t )p(x t+ j x t )p(y t+:t x t+ j) (usng HMM defnton 3) p(y 0:T ) α t()a,j β t (j) j α t ( )a,j β t(j ) α t()a,j β t (j) () 2.3 Computng p(y 0:T ): Forward algorthm, Backward algorthm Brute force computaton of p(y 0:T ) wll requre evaluatng p(y 0:T ) x 0:T wll requre O(N T ) computatons. p(x 0 )[ p(x t x t )]p(y 0 x 0 )[ p(y t x t )] (2) t t 6

2.3. Forward algorthm A fast and causal way to compute p(y 0:T ) s to go forward n tme p(y 0:T ) α T () (3) The recurson for α t () s gven n (8). Ths takes O(N 2 T ) computaton only. 2.3.2 Backward Algorthm Another O(N 2 T ) way to compute p(y 0:T ) s gong backwards n tme p(y 0:T ) β 0 ()π (4) The recurson for β t () s gven n (9). Typcally one would use the forward algorthm to compute ths, snce that s also causal. There may be stuatons, e.g. f ths s done offlne and f observatons are stored as last-nfrst-out where one may need to use the backward algorthm. 2.4 EM algorthm for dscrete-state HMM parameter estmaton: Baum Welch algorthm Let θ denote the set of parameters. In ths case, θ ncludes all elements {a,j }, {π } and the parameters of b (y). Assumpton Assume for the dscusson below that Y t s are also dscrete and take M possble values,, 2,... M. Thus, n b (y), y can be, 2,... M. Then θ {π },...,N, {a,j },...N,j,...N, {b (y)}...n,y,...m. Let θ k denote the parameter estmate at the k th teraton. Recall that the EM algorthm computes θ k+ arg max Q(θ, θ k ) s.t. constrants on θ θ where Q(θ, θ k ) E[log p(y :T, X 0:T ; θ) y :T ; θ k ] (5).e. at each teraton EM maxmzes the posteror expectaton of the logarthm of the complete data lkelhood (the posteror expectaton s computed usng the parameter estmates from the prevous teraton). As dscussed earler (when talkng about EM algorthm), under certan assumptons, ths leads to maxmzaton of the observed data lkelhood,.e. ts soluton converges to arg max θ p(y 0:T ; θ). 7

Now for our HMM, log p(y 0:T, X 0:T ; θ) log π X0 + log a Xt,X t + t log b Xt (y t ) (6) t0 Thus the frst term s only a functon of random varable X 0, the t th entry of the second term s only a functon of X t, X t and the t th entry of the thrd term s only a functon of X t. E[log p(y 0:T, X 0:T ; θ) y 0:T ; θ k ] E[log π X0 y 0:T ; θ k ] + E[log a Xt,X t y 0:T ; θ k ] + t p(x 0 y 0:T ) log π + γ k 0 () log π + E[log b Xt (y t ) y 0:T ; θ k ] t0 p(x t, x t j y 0:T ) log a,j + t,j ξt (, k j) log a,j + t γ k 0 () log π +,j,j γt k () log b (y t ) t0 ξt (, k j) log a,j + t where γ k t, ξ k t are computed usng θ k n the recursons gven n (0) and (). We need to maxmze the above subject to the constrants π p(x t y 0:T ) log b (y t ) t0 γt k () log b (y t ) (7) t0 a,j,,... N j M b (y),,... N (8) y Usng Lagrange multplers, dfferentatng and solvng, the fnal solutons are π k+ γ k 0 () a k+,j b (m) k+ N j T t0 γk t () T t ξk t (, j ) ( ξt (, k j)) where I(A) s f A occurs and 0 otherwse. Here γt k, ξt k estmates at teraton k n the recursons gven earler. t T t γ t() ( ξt (, k j)) I(y t m)γt k () (9) t0 Thus, the stepwse EM algorthm s as follows. At teraton k +, 8 t are computed usng the parameter

. Compute γ k t () for all for all t usng (0) and parameter estmates from teraton k, θ k. 2. Compute ξ k t (, j) for all, j for all t usng () and parameter estmates from teraton k, θ k. 3. Compute parameter estmates at teraton k +, θ k+, usng (9. Now, f Y t s are not dscrete, but are contnuous r.v. s wth parameters of ther PDF beng governed by the current state, e.g. Y t s can be scalar Gaussans wth mean µ and varance σ 2 Thus, f the state X t. In ths case, ther estmates can be computed as follows. We need to maxmze the followng w.r.t. µ, σ 2. γt k () log b (y t ) t0 µ k+ σ 2 k+ γt k ()[ log( 2π)σ 2 (y t µ ) 2 ] (20) t0 T t0 γk t () T t0 γk t () γt k ()y t t0 t0 2σ 2 γ k t ()(y t µ k+ ) 2 (2) 2.5 General dea of Vterb algorthm / dynamc programmng In dynamc programmng / Vterb algorthm, the goal s to fnd where f t (q 0:t ) at any t satsfes arg max q0:t f T (q 0:T ) (22) f t (q 0:t ) f t (q 0:t ) + h t (q t, q t ) + g t (q t ) (23) Notce that f t (.) s a functon only of the frst t + varables. Typcally path optmzaton problems are of ths type. Effcent solutons strategy: Let Then, usng (23), δ t () max q 0:t f t (q 0:t, ) (24) δ t () max q 0:t [f t (q 0:t ) + h t (q t, ) + g t ()] max max[f t (q 0:t ) + h t (q t, ) + g t ()] q t q 0:t 2 g t () + max[h t (q t, ) + max f t (q 0:t )] q t q 0:t 2 g t () + max[h t (q t, ) + δ t (q t )] q t (25) 9

Also store the optmal path to get to q t for each value of q t. So f q t {, 2,... N} then store the optmal path to get to q t for each n the set. For the above problem, ths can be done effcently by only storng the optmal value of q t that gets you to q t and dong ths for each at each t. Thus, at each t, for each q t, we store ψ t () arg max q t [h t (q t, ) + δ t (q t )] (26) To summarze the above dea, we do the followng.. Intalze at t 0 to δ 0 () f 0 () for all, 2... N 2. Startng at t, at each t, compute the followng for all, 2... N δ t () g t () + max q t [h t (q t, ) + δ t (q t )] (27) 3. Smultaneously, at each t, for each, 2... N also store the maxmzer of the above,.e. store ψ t () arg max q t [h t (q t, ) + δ t (q t )] (28) 4. At t T, fnd the optmal cost and the optmal value of q T as max δ T (), qt arg max δ T () (29) 5. Backtrack usng ψ t to fnd the optmal state sequence,.e. startng wth t T, go backwards, q t ψ t+ (q t+) (30) 2.6 Posteror MAP (non-causal) computaton: Vterb algorthm We would lke to fnd the non-causal posteror MAP estmate s x 0:T arg max x 0:T Usng notaton from above, p(x 0:T y 0:T ) arg max p(x 0:T, y 0:T ) arg max log p(x 0:T, y 0:T ) (3) x 0:T x 0:T Usng the HMM defnton, t s easy to see that Thus, Thus, the fnal Vterb algorthm s f t (x 0:t ) : log p(x 0:t, y 0:t ) (32) f t (x 0:t ) f t (x 0:t ) + log p(x t x t ) + log p(y t x t ) (33) h t (x t, x t ) : log p(x t x t ) g t (x t ) : log p(y t x t ) (34) 0

. Intalze at t 0 to δ 0 () f 0 () π b (y 0 ) for all, 2... N 2. Startng at t, at each t, compute the followng for all, 2... N δ t () log b (y t ) + max x t,2,...n [log a x t, + δ t (x t )] (35) 3. Smultaneously, at each t, for each, 2... N also store the maxmzer of the above,.e. store ψ t () arg max x t,2,...n [log a x t, + δ t (x t )] (36) 4. At t T, fnd the optmal cost and the optmal value of q T as max δ T (), x T arg max δ T () (37) 5. Backtrack usng ψ t to fnd the optmal state sequence,.e. startng wth t T, go backwards, z t ψ t+ (x t+) (38) 2.7 Drect dervaton of Vterb algorthm for HMMs Let Recurson for δ t Also, δ t () max x 0:t p(x 0:t, y 0:t ) Thus, δ t (x t ) max x 0:t p(x 0:t, y 0:t ) ψ t (x t ) arg max x t N δ t (x t )a xt,x t (39) max p(x 0:t, y 0:t )p(x t x 0:t, y 0:t )p(y t x t, x 0:t, y 0:t ) x 0:t max p(x 0:t, y 0:t )p(x t x t )p(y t x t ) x 0:t (usng HMM defnton 3) max p(x 0:t, y 0:t )a xt,b (y t ) x 0:t b (y t ) max(max p(x 0:t, y 0:t ))a xt, x t x 0:t 2 b (y t ) max δ t (j)a j, j (40) ψ t () arg max δ t (j)a j, (4) j x T arg max T () N x t ψ t+ (x t+), t T, T 2,... 0 (42)