Markov Chains and Hidden Markov Models

Similar documents
Note Set 5: Hidden Markov Models

STA 414/2104: Machine Learning

STA 4273H: Statistical Machine Learning

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Basic math for biology

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Dynamic Approaches: The Hidden Markov Model

Linear Dynamical Systems (Kalman filter)

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Lecture 4: State Estimation in Hidden Markov Models (cont.)

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

State Space and Hidden Markov Models

Naïve Bayes classification

Expectation maximization

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Advanced Data Science

Computer Intensive Methods in Mathematical Statistics

Introduction to Machine Learning CMU-10701

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Steven L. Scott. Presented by Ahmet Engin Ural

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Solutions to Problem Set 5

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Models Part 2: Algorithms

Linear Dynamical Systems

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

order is number of previous outputs

Hidden Markov Models Part 1: Introduction

Hidden Markov Models. AIMA Chapter 15, Sections 1 5. AIMA Chapter 15, Sections 1 5 1

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Hidden Markov models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Probabilistic Graphical Models: Exercises

Lecture 11: Hidden Markov Models

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

9 Multi-Model State Estimation

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Notes on Machine Learning for and

Conditional Random Field

Hidden Markov Models

STA 294: Stochastic Processes & Bayesian Nonparametrics

Statistical learning. Chapter 20, Sections 1 4 1

Statistical Pattern Recognition

CS Homework 3. October 15, 2009

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Lecture 3: Machine learning, classification, and generative models

Statistical Methods for NLP

Hidden Markov Models in Language Processing

Hidden Markov Models. Terminology and Basic Algorithms

Based on slides by Richard Zemel

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

A gentle introduction to Hidden Markov Models

Probabilistic Graphical Models

Master 2 Informatique Probabilistic Learning and Data Analysis

CS281A/Stat241A Lecture 17

Statistical Sequence Recognition and Training: An Introduction to HMMs

10-701/15-781, Machine Learning: Homework 4

Hidden Markov models 1

Latent Variable Models

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Parametric Models Part III: Hidden Markov Models

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

A brief introduction to Conditional Random Fields

Hidden Markov Models. Terminology, Representation and Basic Problems

Density Estimation. Seungjin Choi

Latent Variable View of EM. Sargur Srihari

Factor Graphs and Message Passing Algorithms Part 1: Introduction

Lecture 15. Probabilistic Models on Graph

Dynamic models. Dependent data The AR(p) model The MA(q) model Hidden Markov models. 6 Dynamic models

Machine Learning Techniques for Computer Vision

9 Forward-backward algorithm, sum-product on factor graphs

Hidden Markov Models and Gaussian Mixture Models

Lecture 13: Structured Prediction

Towards a Bayesian model for Cyber Security

Lecture 3: Pattern Classification

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

CSCI-567: Machine Learning (Spring 2019)

Introduction to Machine Learning

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

p L yi z n m x N n xi

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

MCMC for non-linear state space models using ensembles of latent sequences

Switching Regime Estimation

Spring 2012 Math 541B Exam 1

HMM part 1. Dr Philip Jackson

Transcription:

Chapter 1 Markov Chains and Hidden Markov Models In this chapter, we will introduce the concept of Markov chains, and show how Markov chains can be used to model signals using structures such as hidden Markov models (HMM) Markov chains are based on the simple idea that each new sample is only dependent on the previous sample This simple assumption makes them easy to analyze, but still allows them to be very powerful tools for modeling physical processes 11 Markov Chains A Markov chain is a discrete-time and discrete-valued random process in which each new sample is only dependent on the previous sample So let {X n } N be a sequence of random variables taking values in the countable set Ω Definition: Then we say that X n is a Markov chain if for all values of x k and all n P {X n = x n X k = x k for all k < n} = P {X n = x n X n 1 = x k 1 } Notice that a Markov chain is discrete in time and value Alternatively, a Markov process which is continuously valued and discrete in time is known as a discrete-time Markov process We will primarily focus on Markov chains because they is easy to analyze and of great practical value However, most of the results we derive are also true for discrete-time Markov processes 1

EE641 Digital Image Processing II: Purdue University VISE - November 7, 006 Binary Valued Markov Chain: rho = 0100000 Binary Valued Markov Chain: rho = 0500000 Binary Valued Markov Chain: rho = 0900000 15 15 15 1 1 1 yaxis 05 yaxis 05 yaxis 05 0 0 0 05 05 05 1 0 40 60 80 100 discrete time, n 1 0 40 60 80 100 discrete time, n 1 0 40 60 80 100 discrete time, n Figure 11: Three figures illustrating three distinct behaviors of the Markov chain example with ρ = 01, 05, and 09 In order to analyze a Markov chain, we will first define notation to describe the marginal distribution of X n and the probability of transitioning from state X n 1 to state X n as π (n) j P (n) i,j = P {X n = j} = P {X n = j X n 1 = i} for i, j Ω If the transition probabilities, P (n) i,j, do not depend on time, n, then we say that the Markov chain is homogeneous Note, that a homogeneous Markov chain may have time varying distribution because of the transients associated with an initial condition, but we might hope that a homogeneous Markov chain will reach a stationary distribution given a long enough time We will discuss this later in more detail For the remainder of this chapter, we will assume that Markov chains are homogeneous unless otherwise stated From the Markov process, we can derive an expression for the probability of the sequence {X n } N The probability of the sequence is the product of the probability of the initial state, denoted by ρ x0, with each of the N transitions from state x n 1 to state x n This product is given by p(x) = ρ x0 N P xn 1,x n From this expression, it is clear that the Markov chain is parameterized by its initial distribution, ρ i, and its transition probabilities, P i,j

EE641 Digital Image Processing II: Purdue University VISE - November 7, 006 3 Example 111: Let {X n } N be a Markov chain with Ω = {0, 1} and parameters ρ j = 1/ P i,j = 1 ρ if j = i ρ if j i This Markov chain starts with an equal chance of being 0 or 1 Then with each new state, it has probability ρ of changing states, and probability 1 ρ of remaining in the same state If ρ is small, then the Markov chain is likely to stay in the same state for a long time When ρ = 1/ each new state will be independent of the previous state; and when ρ is approximate 1, then with almost every value of n, the state is likely to change These three cases are illustrated in Figure 11 If we can define the statistic K = N N δ(x n X n 1 ), then K is the number of times that the Markov chain changes state Using this definition, we can express the probability of the sequence as p(x) = (1/)(1 ρ) N K ρ K 1 Parameter Estimation for Markov Chains Markov chains are very useful for modeling physical phenomena, but even homogeneous Markov chains can have many parameters In general, an M = Ω state Markov chain has a total of M parameters 1 When M is large, it is important to have effective methods to estimate these parameters Fortunately, we will see that Markov chains are exponential distributions so parameters are easily estimated from natural sufficient statistics Let {X n } N be a Markov chain parameterized by θ = [ρ i, P i,j : for i, j Ω], and define the statistics τ i = δ(x 0 i) K i,j = N δ(x n j)δ(x n 1 i) 1 The quantity M results from the sum of M parameters for the initial state plus M(M 1) parameters for the transition probabilities The value M(M 1) results form the fact that there are M rows to the transition matrix and each row has M 1 degrees of freedom since it must sum to 1

4 EE641 Digital Image Processing II: Purdue University VISE - November 7, 006 Figure 1: This diagram illustrates the dependency of quantities in a hidden Markov model Notice that the values X n form a Markov chain in time, while the observed values, Y n, are dependent on the corresponding label X n The statistic K i,j essentially counts the number of times that the Markov chain transitions from state i to j, and the statistic τ i counts the number of times the initial state has value i Using these statistics, we can express the probability of the Markov chain sequence as p(x x 0 ) = i Ω j Ω [ρ i ] τ i [P i,j ] K i,j which means that the log likelihood has the form log p(x x 0 ) = i Ω j Ω {τ i log[ρ i ] + K i,j log[p i,j ]} (11) Based on this, it is easy to see that X n has an exponential distribution and that τ i and K i,j are its nature sufficient statistics From (11), we can derive the maximum likelihood estimates of the parameter θ in much the same manner as is done for the ML estimates of the parameters of a Bernoulli sequence This results in the ML estimates ˆρ i = τ i K i,j ˆP i,j = j Ω K i,j So the ML estimate of transition parameters is quite reasonable It simply counts the rate at which a particular i to j transition occurs 13 Hidden Markov Models One important application of Markov chains is in hidden Markov models (HMM) Figure 1 shows the structure of an HMM The discrete values {X n } N form a Markov chain in time, and their values determine the distribution of the observations Y n Much like in the case of the Gaussian mixture distribution, the labels X n are typically not observed in the real application, but we can imagine that their existence explains the change in behavior of Y n over long time scales The HMM model is sometimes referred to as

EE641 Digital Image Processing II: Purdue University VISE - November 7, 006 5 doubly-stochastic because the unobserved stochastic process X n controls the observed stochastic process Y n ***Put some more discussion of applications here**** Let the density function for Y n given X n be given by P {Y n dy X n = k} = f(y k) and let the Markov chain X n be parameterized by θ = [ρ i, P i,j : for i, j Ω], then the density function for the sequences Y and X are given by p(y, x θ) = ρ x0 N { f(yn x n )P xn 1,x n } and assuming the no quantities are zero, the log likelihood is given by log p(y, x θ) = log ρ x0 + N { log f(yn x n ) + log P xn 1,x n } There are two basic tasks which one typically needs to solve with HMMs The first task is to estimate the unknown states X n from the observed data, Y n, and the second task is to estimate the parameters θ from the observations Y n The following two sections explain how these problems can be solved 131 MAP State Sequence Estimation for HMMs One common estimate for the states, X n, is the MAP estimate given by ˆx = arg max p(y, x θ) (1) x ΩN Interestingly, the optimization of (1) can be efficiently computed using dynamic programming To do this, we first define the quantity L(k, n) to be the log probability of the state sequence that results in the largest probability and starts with the value X n = k The values of L(k, n) can be computed with the recursion L(k, n 1) = arg max j Ω {log f(y n j) + log P k,j + L(j, n)} This is actually a mixed probability density and probability mass function This is fine as long as one remembers to integrate over y and sum over x

6 EE641 Digital Image Processing II: Purdue University VISE - November 7, 006 with the initial condition that L(k, N) = 0 Once these values are computed, the MAP sequence can be computed by ˆx 0 ˆx n = arg max {L(k, 0) + log ρ k} k Ω { = arg max log Pˆxn 1,k + log f(y n k) + L(k, 0) } k Ω 13 Training HMMs with the EM Algorithm It order to Train the HMM, it is necessary to estimate the parameters for the HMM model This will be done by employing the EM algorithm, so we will need to derive both the E and M-steps In a typical application, the observe quantity Y n is a multivariate Gaussian random vector with distribution N(µ xn, R xn ), so it is also necessary to estimate the parameters µ k and R k for each of the states k Ω In this case, the joint distribution p(x, y θ) is an exponential distribution with parameter vector θ = [µ i, R i, ρ i, P i,j : for i, j Ω], and natural sufficient statistics given by t 1,k t,k N k Y n δ(x n k) Y n Y t nδ(x n k) δ(x n k) τ k = δ(x 0 k) K i,j = N δ(x n j)δ(x n 1 i) The ML estimate of θ can then be calculated from the sufficient statistics as ˆµ k = t 1,k N k ˆR k = t,k t 1,k t t 1,k N k Nk ˆρ i = τ i K i,j ˆP i,j = j Ω K i,j

EE641 Digital Image Processing II: Purdue University VISE - November 7, 006 7 From this and the results of the previous chapter, we know that we can compute the EM updates by simply substituting the natural sufficient statistics with their conditional expectation given Y So to computer the EM update of the parameter θ, we first compute the conditional expectation of the sufficient statistics in the E-step as t 1,k t,k N k Y n P {X n = k Y = y, θ} Y n Y t np {X n = k Y = y, θ} P {X n = k Y = y, θ} τ k = P {X n = k Y = y, θ} K i,j = N P {X n = j, X n 1 = i Y = y, θ}, and the HMM model parameters are then updated using the M-step ˆµ k t 1,k N k ˆR k t,k t 1,k t t 1,k N k Nk ˆρ i τ i ˆP i,j K i,j j Ω K i,j While the M-step for the HMM is easily computed, the E-step requires the computation of the posterior probability of X n given Y However, this is not easily computed using Bayes rule due to the time dependencies of the Markov chain Fortunately, there is a computationally efficient method for computing these posterior probabilities that exploits the 1D structure of the HMM The algorithms for doing this are known as the forward-backward algorithms due to there forward and backward recursion structure in time The forward recursion is given by α 1 (i) = ρ i α n+1 (j) = i=ω α n (i)p i,j p(y n+1 j)

8 EE641 Digital Image Processing II: Purdue University VISE - November 7, 006 The backward recursion is given by β N (i) = 1 β n+1 (i) = j=ω P i,j p(y n+1 j)β n+1 (j) From this the required posterior probabilities can be calculated as where P {X n = k Y = y, θ} = P {X n = j, X n 1 = i Y = y, θ} = p(y) = i Ω α n (i)β n (i) α n(i)β n (i) p(y) α n 1(i)p(y n j)p i,j β n (i) p(y)