Introduction to Hidden Markov Models

Similar documents
Hidden Markov Models

Hidden Markov Models

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Hidden Markov Model Cheat Sheet

Hidden Markov Models

MARKOV CHAIN AND HIDDEN MARKOV MODEL

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

I529: Machine Learning in Bioinformatics (Spring 2017) Markov Models

Gaussian process classification: a message-passing viewpoint

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Tracking with Kalman Filter

Lecture 10 Support Vector Machines II

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

The Basic Idea of EM

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Singular Value Decomposition: Theory and Applications

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Motion Perception Under Uncertainty. Hongjing Lu Department of Psychology University of Hong Kong

Week 5: Neural Networks

Speech and Language Processing

Artificial Intelligence Bayesian Networks

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1

Why BP Works STAT 232B

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Overview. Hidden Markov Models and Gaussian Mixture Models. Acoustic Modelling. Fundamental Equation of Statistical Speech Recognition

EEE 241: Linear Systems

Time-Varying Systems and Computations Lecture 6

Continuous Time Markov Chains

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

6 Supplementary Materials

Difference Equations

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Grenoble, France Grenoble University, F Grenoble Cedex, France

763622S ADVANCED QUANTUM MECHANICS Solution Set 1 Spring c n a n. c n 2 = 1.

Evaluation for sets of classes

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

EM and Structure Learning

Estimating the Fundamental Matrix by Transforming Image Points in Projective Space 1

Erratum: A Generalized Path Integral Control Approach to Reinforcement Learning

10-701/ Machine Learning, Fall 2005 Homework 3

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Kernel Methods and SVMs Extension

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

Notes on Frequency Estimation in Data Streams

Conjugacy and the Exponential Family

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

COMPARISON OF SOME RELIABILITY CHARACTERISTICS BETWEEN REDUNDANT SYSTEMS REQUIRING SUPPORTING UNITS FOR THEIR OPERATIONS

Probability-Theoretic Junction Trees

Hidden Markov Models. Hongxin Zhang State Key Lab of CAD&CG, ZJU

Relevance Vector Machines Explained

Lecture 6 Hidden Markov Models and Maximum Entropy Models

3.1 ML and Empirical Distribution

1 Convex Optimization

MAXIMUM A POSTERIORI TRANSDUCTION

Bayesian predictive Configural Frequency Analysis

1 Motivation and Introduction

Semi-Supervised Learning

Chapter 6 Hidden Markov Models. Chaochun Wei Spring 2018

Expectation Maximization Mixture Models HMMs

The Order Relation and Trace Inequalities for. Hermitian Operators

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Errors for Linear Systems

MMA and GCMMA two methods for nonlinear optimization

Entropy of Markov Information Sources and Capacity of Discrete Input Constrained Channels (from Immink, Coding Techniques for Digital Recorders)

Section 8.3 Polar Form of Complex Numbers

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

6. Stochastic processes (2)

6. Stochastic processes (2)

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Problem Set 9 Solutions

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Assignment 2. Tyler Shendruk February 19, 2010

Course 395: Machine Learning - Lectures

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

An Integrated Asset Allocation and Path Planning Method to to Search for a Moving Target in in a Dynamic Environment

The Expectation-Maximization Algorithm

Grover s Algorithm + Quantum Zeno Effect + Vaidman

Engineering Risk Benefit Analysis

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Generalized Linear Methods

Composite Hypotheses testing

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Chapter - 2. Distribution System Power Flow Analysis

A linear imaging system with white additive Gaussian noise on the observed data is modeled as follows:

Lecture Notes on Linear Regression

Design and Analysis of Algorithms

Chapter Newton s Method

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

Lecture 4: Universal Hash Functions/Streaming Cont d

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Limited Dependent Variables

Retrieval Models: Language models

Transcription:

Introducton to Hdden Markov Models Alperen Degrmenc Ths document contans dervatons and algorthms for mplementng Hdden Markov Models. The content presented here s a collecton of my notes and personal nsghts from two semnal papers on HMMs by Rabner n 1989 [2] and Ghahraman n 2001 [1], and also from Kevn Murphy s book [3]. Ths s an excerpt from my project report for the MIT 6.867 Machne Learnng class taught n Fall 2014. I. HIDDEN MARKOV MODELS (HMMS) HMMs have been wdely used n many applcatons, such as speech recognton, actvty recognton from vdeo, gene fndng, gesture trackng. In ths secton, we wll explan what HMMs are, how they are used for machne learnng, ther advantages and dsadvantages, and how we mplemented our own HMM algorthm. A. Defnton A hdden Markov model s a tool for representng probablty dstrbutons over sequences of observatons [1]. In ths model, an observaton X t at tme t s produced by a stochastc process, but the state Z t of ths process cannot be drectly observed,.e. t s hdden [2]. Ths hdden process s assumed to satsfy the Markov property, where state Z t at tme t depends only on the prevous state, Z t 1 at tme t 1. Ths s, n fact, called the frst-order Markov model. The n th - order Markov model depends on the n prevous states. Fg. 1 shows a Bayesan network representng the frst-order HMM, where the hdden states are shaded n gray. We should note that even though we talk about tme to ndcate that observatons occur at dscrete tme steps, tme could also refer to locatons wthn a sequence [3]. The jont dstrbuton of a sequence of states and observatons for the frst-order HMM can be wrtten as, P (Z 1:N, X 1:N ) = P (Z 1 )P (X 1 Z 1 ) P (Z t Z t 1 )P (X t Z t ) t=2 (1) where the notaton Z 1:N s used as a shorthand for Z 1,..., Z N. Notce that Eq. 1 can be also wrtten as, P (X 1:N, Z 1:N ) = P (Z 1 ) P (Z t Z t 1 ) P (X t Z t ) t=2 (2) whch s same as the expresson gven n the lecture notes. There are fve elements that characterze a hdden Markov model: The author s wth the School of Engneerng and Appled Scences at Harvard Unversty, Cambrdge, MA 02138 USA. (adegrmenc@seas.harvard.edu). Ths document s an excerpt from a project report for the MIT 6.867 Machne Learnng class taught n Fall 2014. Z 1 X 1 Z 2 X 2 Z t X t Z N X N Fg. 1. A Bayesan network representng a frst-order HMM. The hdden states are shaded n gray. 1) Number of states n the model, K: Ths s the number of states that the underlyng hdden Markov process has. The states often have some relaton to the phenomena beng modeled. For example, f a HMM s beng used for gesture recognton, each state may be a dfferent gesture, or a part of the gesture. States are represented as ntegers 1,..., K. We wll encode the state Z t at tme t as a K 1 vector of bnary numbers, where the only non-zero element s the k-th element (.e. Z tk = 1), correspondng to state k K at tme t. Whle ths may seem contrved, t wll later on help us n our computatons. (Note that [2] uses N nstead of K). 2) Number of dstnct observatons, Ω: Observatons are represented as ntegers 1,..., Ω. We wll encode the observaton X t at tme t as a Ω 1 vector of bnary numbers, where the only non-zero element s the l-th element (.e. X tl = 1), correspondng to state l Ω at tme t. Whle ths may seem contrved, t wll later on help us n our computatons. (Note that [2] uses M nstead of Ω, and [1] uses D. We decded to use Ω snce ths agrees wth the lecture notes). 3) State transton model, A: Also called the state transton probablty dstrbuton [2] or the transton matrx [3], ths s a K K matrx whose elements A j descrbe the probablty of transtonng from state Z t 1, to Z t,j n one tme step where, j {1,..., K}. Ths can be wrtten as, A j = P (Z t,j = 1 Z t 1, = 1). (3) Each row of A sums to 1, j A j = 1, and therefore t s called a stochastc matrx. If any state can reach any other state n a sngle step (fully-connected), then A j > 0 for 1-α 1-β α 1 2 β (a) A 11 1 A 12 A 21 A 22 A 33 A 23 2 3 (b) A 32 Fg. 2. A state transton dagram for (a) a 2-state, and (b) a 3-state ergodc Markov chan. For a chan to be ergodc, any state should be reachable from any other state n a fnte amount of tme. 1 c 2014 Alperen Degrmenc

all, j; otherwse A wll have some zero-valued elements. Fg. 2 shows two state transton dagrams for a 2-state and 3-state frst-order Markov chan. For these dagrams, the state transton models are, [ 1 α α A (a) = β 1 β ], A (b) = A 11 A 12 0 A 21 A 22 A 23 0 A 32 A 33 The condtonal probablty can be wrtten as P (Z t Z t 1 ) = =1 j=1 Takng the logarthm, we can wrte ths as logp (Z t Z t 1 ) = =1 j=1. (4) A Zt 1,Zt,j j. (5) Z t 1, Z t,j log A j (6) = Z t log (A)Z t 1. (7) 4) Observaton model, B: Also called the emsson probabltes, B s a Ω K matrx whose elements B kj descrbe the probablty of makng observaton X t,k gven state Z t,j. Ths can be wrtten as, B kj = P (X t = k Z t = j). (8) The condtonal probablty can be wrtten as P (X t Z t ) = Ω j=1 k=1 Takng the logarthm, we can wrte ths as logp (X t Z t ) = j=1 k=1 B Zt,jX t,k kj. (9) Ω Z t,j X t,k log B kj (10) = X t log (B)Z t. (11) 5) Intal state dstrbuton, π: Ths s a K 1 vector of probabltes π = P (Z 1=1 ). The condtonal probablty can be wrtten as, P (Z 1 π) = =1 π Z1. (12) Gven these fve parameters presented above, an HMM can be completely specfed. In lterature, ths often gets abbrevated as λ = (A, B, π). (13) B. Three Problems of Interest In [2] Rabner states that for the HMM to be useful n real-world applcatons, the followng three problems must be solved: Problem 1: Gven observatons X 1,..., X N and a model λ = (A, B, π), how do we effcently compute P (X 1:N λ), the probablty of the observatons gven the model? Ths s a part of the exact nference problem presented n the lecture notes, and can be solved usng forward flterng. μ Φ Zt Φ Zt-1,Z t Z t Φ Zt,Z t+1 Zt-1,Z t μ Zt Φ Zt,Z t+1 μ Φ Zt+1 μ Zt Φ Zt-1,Z t μ Zt Φ Xt,Z t μ Φ Xt,Z t Xt Φ Xt,Z t X t μ Φ Zt,Z t+1 Zt μ Φ Xt,Z t Zt μ Xt Φ Xt,Z t Zt,Z t+1 μ Zt+1 Φ Zt-1,Z t Z t+1 Fg. 3. Factor graph for a slce of the HMM at tme t. Problem 2: Gven observatons X 1,..., X N and the model λ, how do we fnd the correct hdden state sequence Z 1,..., Z N that best explans the observatons? Ths corresponds to fndng the most probable sequence of hdden states from the lecture notes, and can be solved usng the Vterb algorthm. A related problem s calculatng the probablty of beng n state Z tk gven the observatons, P (Z t = k X 1:N ), whch can be calculated usng the forward-backward algorthm. Problem 3: How do we adjust the model parameters λ = (A, B, π) to maxmze P (X 1:N λ)? Ths corresponds to the learnng problem presented n the lecture notes, and can be solved usng the Expectaton-Maxmzaton (EM) algorthm (n the case of HMMs, ths s called the Baum-Welch algorthm). C. The Forward-Backward Algorthm The forward-backward algorthm s a dynamc programmng algorthm that makes use of message passng (belef propagaton). It allows us to compute the fltered and smoothed margnals, whch can be then used to perform nference, MAP estmaton, sequence classfcaton, anomaly detecton, and model-based clusterng. We wll follow the dervaton presented n Murphy [3]. 1) The Forward Algorthm: In ths part, we compute the fltered margnals, P (Z t X 1:T ) usng the predct-update cycle. The predcton step calculates the one-step-ahead predctve densty, P (Z t =j X 1:t 1 ) = = P (Z t = j Z t 1 = )P (Z t 1 = X 1:t 1 ) =1 (14) whch acts as the new pror for tme t. In the update state, the observed data from tme t s absorbed usng Bayes rule: α t (j) P (Z t = j X 1:t ) = P (Z t = j X t, X 1:t 1 ) = P (X t Z t = j, X 1:t 1 )P (Z t = j X 1:t 1 ) j P (X t Z t = j, X 1:t 1 )P (Z t = j X 1:t 1 ) = 1 C t P (X t Z t = j)p (Z t = j X 1:t 1 ) (15) 2 c 2014 Alperen Degrmenc

Algorthm 1 Forward algorthm 1: Input: A, ψ 1:N, π 2: [α 1, C 1 ] = normalze(ψ 1 π) ; 3: for t = 2 : N do 4: [α t, C t ] = normalze(ψ t (A α t 1 )) ; 5: Return α 1:N and log P (X 1:N ) = t log C t 6: Sub: [α, C] = normalze(u): C u j; α j = u j /C; Algorthm 2 Backward algorthm 1: Input: A, ψ 1:N, α 2: β N = 1; 3: for t = N 1 : 1 do 4: β t = normalze(a(ψ t+1 β t+1 ) ; 5: γ = normalze(α β, 1) 6: Return γ 1:N where the observatons X 1:t 1 cancel out because they are d-separated from X t. C t s the normalzaton constant (to avod confuson, we used C t as opposed to Z t from [3]) gven by, C t P (X t X 1:t 1 ) = = P (X t Z t = j)p (Z t = j X 1:t 1 ). j=1 (16) The K 1 vector α t = P (Z t X 1:T ) s called the (fltered) belef state at tme t. In matrx notaton, we can wrte the recursve update as: ) α t ψ t (A α t 1 (17) where ψ t = [ψ t1, ψ t2,..., ψ tk ] = {P (X t Z t = )} 1 K s the local evdence at tme t whch can be calculated usng Eq. 9, A s the transton matrx, and s the Hadamard product, representng elementwse vector multplcaton. The pseudo-code n Algorthm 1 outlnes the steps of the computaton. The log probablty of the evdence can be computed as N N log P (X 1:N λ) = log P (X t X 1:t 1 ) = log C t (18) Ths, n fact, s the soluton for Problem 1 stated by Rabner [2]. Workng n the log doman allows us to avod numercal underflow durng computatons. 2) The Forward-Backward Algorthm: Now that we have the fltered belef states α from the forward messages, we can compute the backward messages to get the smoothed margnals: P (Z t = j X 1:N ) P (Z t = j.x t+1:n X 1:t ) (19) P (Z t = j X 1:t )P (X t+1:n Z t = j, X1:t ). whch s the probablty of beng n state Z tj. Gven that the hdden state at tme t s j, defne the condtonal lkelhood of future evdence as β t (j) P (X t+1:n Z t = j). (20) Also defne the desred smoothed posteror margnal as Then we can rewrte Eq. 19 as γ t (j) P (Z t = j X 1:N ). (21) γ t (j) α t (j)β t (j) (22) We can now compute the β s recursvely from rght to left: β t 1 () = P (X t:n Z t 1 = ) Ths can be wrtten as The base case for β N s P (Z t = j, X t, X t+1:n Z t 1 = ) P (X t+1:n Z t = j, X t, Z t 1 = j ) P (Z t = j, X t Z t 1 = ) P (X t+1:n Z t = j)p (X t Z t = j, Z t 1 = ) P (Z t = j Z t 1 = ) β t (j)ψ t (j)a(, j) (23) β t 1 = A (ψ t β t ) (24) β N () = P (X N+1:N Z N = ) = P ( Z N = ) = 1 (25) Fnally, the smoothed posteror s then α β γ = j (α (j) β (j)) (26) where the denomnator ensures that each column of γ sums to 1 to ensure t s a stochastc matrx. The pseudo-code n Algorthm 2 outlnes the steps of the computaton. D. The Vterb Algorthm In order to compute the most probable sequence of hdden states (Problem 2), we wll use the Vterb algorthm. Ths algorthm computes the shortest path through the trells dagram of the HMM. The trells dagram shows how each state n the model at one tme step connects to the states n the next tme step. In ths secton, we agan follow the dervaton presented n Murphy [3]. The Vterb algorthm also has a forward and backward pass. In the forward pass, nstead of the sum-product algorthm, we utlze the max-product algorthm. The backward pass recovers the most probable path through the trells dagram usng a traceback procedure, propagatng the most lkely state at tme t back n tme to recursvely fnd the most lkely sequence between tmes 1 : t. Ths can be expressed as, δ t (j) max P (Z 1:t 1, Z t = j X 1:t ). (27) Z 1,...,Z t 1 Ths probablty can be expressed as a combnaton of the transton from the prevous state at tme t 1 and the most 3 c 2014 Alperen Degrmenc

Algorthm 3 Vterb algorthm 1: Input: X 1:N, K, A, B, π 2: Intalze: δ 1 = π B X1, a 1 = 0; 3: for t = 2 : N do 4: for j = 1 : K do 5: [a t (j), δ t (j)] = max (log δ t 1 (:) + log A j + log B Xt (j)); 6: Z N = arg max(δ N ); 7: for t = N 1 : 1 do 8: Z t = a t+1 Z t+1; 9: Return Z 1:N probable path leadng to, δ t (j) = max 1 K δ t 1()A j B Xt (j). (28) Here B Xt (j) s the emsson probablty of observaton X t gven state j. We also need to keep track of the most lkely prevous state, a t (j) = arg max δ t 1 ()A j B Xt (j). (29) The ntal probablty s The most probable fnal state s δ 1 (j) = π j B X1 (j). (30) ZN = arg max δ N (). (31) The most probable sequence can be computng usng traceback, Z t = a t+1 Z t+1. (32) In order to avod underflow, we can work n the log doman. Ths s one of the advantages of the Vterb algorthm, snce log max = max log; ths s not possble wth the forwardbackward algorthm snce log log. Therefore log δ t (j) max log δ t 1 () + log A j + log B Xt (j). (33) The pseudo-code n Algorthm 3 outlnes the steps of the computaton. E. The Baum-Welch Algorthm The Baum-Welch algorthm s n essence the Expectaton- Maxmzaton (EM) algorthm for HMMs. Gven a sequence of observatons X 1:N, we would lke to fnd arg max λ P (X; λ) = arg max P (X, Z; λ) (34) by dong maxmum-lkelhood estmaton. Snce summng over all possble Z s not possble n terms of computaton tme, we use EM to estmate the model parameters. The algorthm requres us to have the forward and backward probabltes α, β calculated usng the forwardbackward algorthm. In ths secton we follow the dervaton presented n Murphy [3] and the lecture notes. λ Z Algorthm 4 Baum-Welch algorthm 1: Input: X 1:N, A, B, α, β 2: for t = 1 : N do 3: γ(:, t) = (α(:, t) β(:, t))./sum(α(:, t) β(:, t)); 4: ξ(:, :, t) = ((α(:, t) A(t + 1)) (β(:, t + 1) B(X t+1 )) T )./sum(α(:, t) β(:, t)); 5: ˆπ = γ(:, 1)./sum(γ(:, 1)); 6: for j = 1 : K do 7: Â(j, :) = sum(ξ(2 : N, j, :), 1)./sum(sum(ξ(2 : N, j, :), 1), 2); 8: B(j, ˆ :) = ( X(:, j) T γ )./sum(γ, 1); 9: Return ˆπ, Â, ˆB 1) E Step: γ tk P (Z tk = 1 X, λ old ) α k (t)β k (t) = j=1 α j(t)β j (t) ξ tjk P (Z t 1,j = 1, Z tk = 1 X, λ old ) = α j(t)a jk β k (t + 1)B k (X t+1 ) =1 α (t)β (t) (35) (36) 2) M Step: The parameter estmaton problem can be turned nto a constraned optmzaton problem where P (X 1:N λ) s maxmzed, subject to the stochastc constrants of the HMM parameters [2]. The technques of Lagrange multplers can be then used to fnd the model parameters, yeldng the followng expressons: ˆπ k = E[N 1 k ] N = γ 1k K j=1 γ 1j  jk = E[N jk] k E[N jk] = ˆB jl = E[M jl] E[N j ] t=2 ξ tjk K l=1 t=2 ξ tjl = γ tlx tj γ tl (37) (38) (39) λ new = (Â, ˆB, ˆλ) (40) The pseudo-code n Algorthm 4 outlnes the steps of the computaton. F. Lmtatons A fully-connected transton dagram can lead to severe overfttng. [1] explans ths by gvng an example from computer vson, where objects are tracked n a sequence of mages. In problems wth large parameter spaces lke ths, the transton matrx ends up beng very large. Unless there are lots of examples n the data set, or unless some a pror knowledge about the problem s used, then ths leads to severe overfttng. A soluton to ths s to use other types of HMMs, such as factoral or herarchcal HMMs. REFERENCES [1] Z. Ghahraman, An Introducton to Hdden Markov Models and Bayesan Networks, Internatonal Journal of Pattern Recognton and Artfcal Intellgence, vol. 15, no. 1, pp. 9 42, 2001. 4 c 2014 Alperen Degrmenc

[2] L. Rabner, A Tutoral on Hdden Markov Models and Selected Applcatons n Speech Recognton, Proceedngs of the IEEE, vol. 77, no. 2, pp. 257 286, 1989. [3] K.P. Murphy, Machne Learnng: A Probablstc Perspectve, Cambrdge, MA: MIT Press, 2012. 5 c 2014 Alperen Degrmenc