Hidden Markov Models (HMMs)

Similar documents
L23: hidden Markov models

Parametric Models Part III: Hidden Markov Models

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Brief Introduction of Machine Learning Techniques for Content Analysis

Lecture 11: Hidden Markov Models

Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models

p(d θ ) l(θ ) 1.2 x x x

Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models Hamid R. Rabiee

HIDDEN MARKOV MODELS

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

O 3 O 4 O 5. q 3. q 4. Transition

An Introduction to Hidden

Multiscale Systems Engineering Research Group

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Modelling

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

A Higher-Order Interactive Hidden Markov Model and Its Applications Wai-Ki Ching Department of Mathematics The University of Hong Kong

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.

Hidden Markov Model and Speech Recognition

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

Statistical Processing of Natural Language

order is number of previous outputs

Computational Genomics and Molecular Biology, Fall

Hidden Markov Models Part 2: Algorithms

Hidden Markov Models

Lecture 9. Intro to Hidden Markov Models (finish up)

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Statistical Machine Learning from Data

Basic math for biology

Automatic Speech Recognition (CS753)

Machine Learning for natural language processing

Hidden Markov Models. Three classic HMM problems

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Learning from Sequential and Time-Series Data

The main algorithms used in the seqhmm package

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

1 What is a hidden Markov model?

CS 7180: Behavioral Modeling and Decision- making in AI

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Statistical Problem. . We may have an underlying evolving system. (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t

Hidden Markov Models 1

Hidden Markov Models

Statistical Methods for NLP

Sequence modelling. Marco Saerens (UCL) Slides references

Hidden Markov Models

Dept. of Linguistics, Indiana University Fall 2009

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

COMP90051 Statistical Machine Learning

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

STA 4273H: Statistical Machine Learning

Hidden Markov Models

Markov Chains and Hidden Markov Models. = stochastic, generative models

Hidden Markov Models and Gaussian Mixture Models

Introduction to Machine Learning CMU-10701

Hidden Markov Models for biological sequence analysis I

Master 2 Informatique Probabilistic Learning and Data Analysis

Graphical Models Seminar

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models for biological sequence analysis

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Hidden Markov Models

ASR using Hidden Markov Model : A tutorial

CS 4495 Computer Vision

Hidden Markov Models

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Sequence Labeling: HMMs & Structured Perceptron

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data

HMM: Parameter Estimation

Hidden Markov Models. Hosein Mohimani GHC7717

Statistical Sequence Recognition and Training: An Introduction to HMMs

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

STA 414/2104: Machine Learning

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Hidden Markov Models. x 1 x 2 x 3 x K

Statistical NLP: Hidden Markov Models. Updated 12/15

Topics in Probability Theory and Stochastic Processes Steven R. Dunbar. Notation and Problems of Hidden Markov Models

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Hidden Markov Models. Terminology and Basic Algorithms

Linear Dynamical Systems (Kalman filter)

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Hidden Markov Models NIKOLAY YAKOVETS

1 Ways to Describe a Stochastic Process

Final Examination CS 540-2: Introduction to Artificial Intelligence

11.3 Decoding Algorithm

Lecture 15. Probabilistic Models on Graph

Hidden Markov Models. x 1 x 2 x 3 x K

A REVIEW AND APPLICATION OF HIDDEN MARKOV MODELS AND DOUBLE CHAIN MARKOV MODELS

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Hidden Markov models

SYMBOL RECOGNITION IN HANDWRITTEN MATHEMATI- CAL FORMULAS

Temporal Modeling and Basic Speech Recognition

HMM part 1. Dr Philip Jackson

Transcription:

Hidden Markov Models (HMMs) Reading Assignments R. Duda, P. Hart, and D. Stork, Pattern Classification, John-Wiley, 2nd edition, 2001 (section 3.10, hard-copy). L. Rabiner, "A tutorial on HMMs and selected applications in speech recognition", Proceedings of IEEE, vol. 77, pp. 257-286, 1989 (hard-copy). Case Studies F. Samaria, "Face segmentation for identification using HMMs", British Machine Vision Conference, pp. 399-408, 1993 (on-line). A. Nefian and M. Hayes III, "Face recognition using an embedded HMM", Intel, 1999 (on-line).

-2- Hidden Markov Models (HMMs) Time dependencies -HMMs are appropriate for problems that have an inherent temporality. *speech recognition *gesture recognition *human activity recognition -Apattern is the result of a time process which has a number of states. -States at time t are influenced directly by states in previous time steps. Definition of first-order Markov models -They are represented by a graph where every node corresponds to a state ω i. -The graph can be fully-connected with self-loops. -Links between nodes ω i and ω j are associated with a transition probability: P(ω (t + 1) = ω j /ω (t) = ω i ) = a ij which is the probability of having state ω j at time t + 1given that the state at time t was ω i (first-order model). -The following constraints should be satisfied: Σ a ij = 1for all i j -Markov models are fully described by their transition probabilities a ij

How tocompute the probability P(ω T ) of a sequence of states ω T? -3- -Given asequence of states ω T = (ω (1), ω (2),..., ω (T )), the probability that the model generated ω T is equal to the product of the corresponding transition probabilities: P(ω T ) = Σ T P(ω (t)/ω 1)) where P(ω (1)/ω (0)) P(ω (1)) is the prior probability on the first state. t=1 Example: if ω 6 = (ω 1, ω 4, ω 2, ω 2, ω 1, ω 4 ), then P(ω 6 ) = P(ω 1 )P(ω 4 /ω 1 )P(ω 2 /ω 4 )P(ω 2 /ω 2 )P(ω 2 /ω 1 )P(ω 1 /ω 4 ) = a 1 a 14 a 42 a 22 a 21 a 14 -The last state ω (T )iscalled the absorbing state and is denoted as ω 0 (i.e., a state which if entered, is never left: a 00 =1) Definition of first-order hidden Markov models -Weaugment the model such that when it is in state ω (t) italso emits some symbol v(t) (visible states) among a set of possible symbols. - For every sequence of -hidden- states, there is an associated sequence of visible states: ω T = (ω (1), ω (2),..., ω (T )) -> V T = (v(1), v(2),..., v(t ))

-4- -When the model is in state ω j at time t, the probability of emitting a visible state v k at that time is denoted as P(v(t) = v k /ω (t) = ω j ) = b jk -The following constraints should be satisfied: Σ b jk = 1for all j k Coin toss example -You are in a room with a barrier (e.g., a curtain) through which you cannot see what is happening. -Onthe other side of the barrier is another person who is performing a coin (or multiple coin) toss experiment. -The other person will tell you only the result of the experiment, not how he obained that result!! V T = HHTHTTHH...T = v(1), v(2),..., v(t ) Problem: build an HMM model to explain the observed sequence of heads and tails. 1-fair coin model -There are 2 states, each associated with either heads (state1) or tails (state2) -The observation sequence uniquely defines the states (model is not hidden).

-5-2-fair coins model -There are 2 states but neither state is uniquely associated with either heads or tails (each state can be associated with a different fair coin). -Athird coin is used to decided which of the biased coins to flip. 2-biased coins model -There are 2 states with each state associated with a biased coin. -Athird coin is used to decided which of the biased coins to flip. 3-biased coins model -There are 3 states with each state associated with a biased coin. -Wedecide which coin to flip using some way (e.g., other coins).

-6- Hidden Markov models and finite-state machines -The two models are basically equivalent! -When the transitions from state to state are probabilistic, we call them HHMs. Some definitions Causal HMM: the probabilities depend only upon previous states. Ergodic HMM: every one of the states has a non-zero probability of occurring given some starting state. Central issues in HMMs Evaluation problem: Determine the probability that a particular sequence of visible states V T was generated by a given model. Decoding problem: Given a sequence of visible states V T,determine the most likely sequence of hidden states ω T that led to those observations. Learning problem: Given aset of visible observations, determine a ij and b jk.

-7- Evaluation -Inpractice, we have several HMMs, one for each class and we classify a test pattern by choosing the model with the highest probability. HMM 1 P(V T ) V T HMM 2 P(V T ) MAX HMM k... HMM N P(V T ) -The probability that a model produces V T can be computed using the theorem of total probability: P(V T ) = r max Σ P(V T /ω r T )P(ω r T ) r=1 where ω r T = (ω (1), ω (2),..., ω (T )) is one one of the possible sequences and r max = c T for a model with c states ω 1, ω 2,..., ω c. -The second term P(ω T r )can be written as follows: P(ω r T ) = P(ω (1)) Π T P(ω (t)/ω 1)) -The first term P(V T /ω T r )can be written: t=1 P(V T /ω r T ) = Π T P(v(t)/ω (t)) t=1

-8- -Combining the two terms together: P(V T ) = P(ω (1)) r max Σ T r=1 t=1 Π P(v(t)/ω (t)) P(ω (t)/ω (t 1)) Computational complexity -Given a ij and b jk,itstraightforward to compute the P(V T ) -This computation, however, has O(Tc T )requirements!

-9- Recursive computation of P(V T ) (HMM Forward) Input: V T = (v(1), v(2),..., v(t )) -Let α i (t) represent the probability that the HMM is in hidden state ω i at step t, assuming that the first t elements of V T have been generated: α i (t)=p(v(1), v(2),..., v(t), ω (t) = ω i ) -Wecan compute α j (t + 1), j = 1, 2,..., c, asfollows: c i=1 α j (t + 1) = P(v(1), v(2),..., v(t), v(t + 1), ω (t + 1) = ω j ) = Σ P(v(1), v(2),..., v(t), ω (t) = ω i )P(v(t + 1)/ω (t + 1) = ω j )P(ω (t + 1) = ω j /ω (t) = ω i ) or α j (t + 1) = Σ c α i (t)b jv(t+1) a ij i=1 j = 1, 2,..., c

-10- Initialize ω (1) (known initial state) Set α i (0) = 1 if i = ω (1) and α i (0) = 0 if i ω (1) (prior state probability) for(t=1; t<=t t++) for j=1 to c do α j (t) = Σ c α i (t 1)a ij b jv(t) ) i=1 P(V T ) = α 0 (T ) -The complexity of this algorithm is only O(c 2 T )!! An example a ij = 1 0. 2 0. 2 0. 8 0 0. 3 0. 5 0. 1 0 0. 1 0. 2 0. 0 0 0. 4, b jk = 0. 1 0. 1 1 0 0 0 0 0. 3 0. 1 0. 5 0 0. 4 0. 1 0. 2 0 0. 1 0. 7 0. 1 0 0. 2 0. 1 0. 2 t=1 α 0 (1) = α 0 (0) P(v(1)/ω 0 ) P(ω 0 /ω 0 ) + α 1 (0) P(v(1)/ω 0 ) P(ω 0 /ω 1 ) + α 2 (0) P(v(1)/ω 0 ) P(ω 0 /ω 2 ) + α 3 (0) P(v(1)/ω 0 ) P(ω 0 /ω 3 ) = 0 α 1 (1) = a 0 (0) P(v(1)/ω 1 ) P(ω 1 /ω 0 ) + α 1 (0) P(v(1)/ω 1 ) P(ω 1 /ω 1 ) + α 2 (0) P(v(1)/ω 1 ) P(ω 1 /ω 2 ) + α 3 (0) P(v(1)/ω 1 ) P(ω 1 /ω 3 ) = 0. 09 α 2 (1) = a 0 (0) P(v(1)/ω 2 ) P(ω 2 /ω 0 ) + α 1 (0) P(v(1)/ω 2 ) P(ω 2 /ω 1 ) + α 2 (0) P(v(1)/ω 2 ) P(ω 2 /ω 2 ) + α 3 (0) P(v(1)/ω 2 ) P(ω 2 /ω 3 ) = 0. 01

-11- α 3 (1) = a 0 (0) P(v(1)/ω 3 ) P(ω 3 /ω 0 ) + α 1 (0) P(v(1)/ω 3 ) P(ω 3 /ω 1 ) + α 2 (0) P(v(1)/ω 3 ) P(ω 3 /ω 2 ) + α 3 (0) P(v(1)/ω 3 ) P(ω 3 /ω 3 ) = 0. 2 -Similarly for t = 2, 3, 4; final answer P(V T ) = Σ c α i (T 1) = α 0 (T ) = 0. 0011 i=1

-12- The backward algorithm (HMM backward) Input: V T = (v(1), v(2),..., v(t )) -Let β i (t + 1) represent the probability that the HMM is in hidden state ω i at step t + 1, and will generate the remainder of the target sequence, i.e., t + 1,..., T : β i (t + 1)=P(v(t + 1), v(t + 2),..., v(t )/ω (t) = ω i ) -Wecan compute β j (t), j = 1, 2,..., c, asfollows: c i=1 β j (t) = P(v(t), v(t + 1), v(t + 2),..., v(t )/ω (t) = ω j ) = Σ P(v(t + 1), v(t + 2),..., v(t )/ω (t) = ω i )P(v(t + 1)/ω (t + 1) = ω i )P(ω (t + 1) = ω i /ω (t) = ω or β j (t) = Σ c β i (t + 1)b iv(t+1) a ji i=1 j = 1, 2,..., c Initialize ω (T ) β i (T ) = 1 if i = ω (T ) and β i (T ) = 0 if i ω (T ) for(t=t-1; t>=0; t--) for i=1 to c do β j (t) = Σ c β i (t + 1)a ji b iv(t+1) i=1 P(V T ) = β i (0) where ω i is the known initial state

-13- Decoding - We need to use an optimality criterion to solve this problem (i.e., there are several possible ways solving this problem since there are various optimality criteria we could use). Algorithm 1: choose the states ω (t) which are individually most likely (i.e., maximize the expected number of correct individual states). *Ifwedefine γ i (t) = P(ω (t) = ω i /V T ), then: γ i (t) = P(ω (t) = ω i, V T ) P(V T ) = α i(t)β i (t) P(V T ) *Using γ i (t), the individually most likely state ω (t) attime t is: ω (t)=arg max i [γ i (t)], 1 t T

-14- Algorithm 2 (easy): at each time step t, find the state that has the highest probability of having come from the previous step and generated the observed visible state v(t) --uses the forward algorithm with minor changes. Initialize ω (1) a i (0) = 1 if i = ω (1) and a i (0) = 0 if i ω (1) Path=empty for (t=1; t<=t; t++) { for j=1 to c do a j (t) = Σ c a i (t 1)a ij b jv(t) ) i=1 j =arg max j a j (t) Append ω j to Path } return Path

-15- -There is no guarantee that the path is a valid one (local optimization). -The path might imply a transition that is not allowed by the model.

-16- Algorithm 3 (Viterbi algorithm - most widely used) find the single best sequence, i.e., maximize P(ω T /V T ) *Equivalent to maximizing P(ω T, V T )since: P(ω T /V T ) = P(ω T, V T ) P(V T ) *Wewill compute the probability P(ω T, V T )recursively: *Let us define δ i (t) asbeing the highest probability along a single path, at time t, with the path ending at ω i : δ i (t) = max ω (1),ω (2),..,ω (t 1) P(ω (1), ω (2),.., ω (t 1), ω (t) = ω i, v(1), v(2),..., v(t)) *Using induction we have: δ j (t) = [max i δ i (t 1)a ij ] b jv(t),1 j c *Toretrieve the best state sequence we need to keep track of the argument i that maximizes the above equation: ψ j (t) = argmax i [δ i (t 1)a ij ], 1 j c

-17- Step1: Initialization δ i (1) = b jv(1), ψ i (1) = 0, 1 i c 1 i c Step2: Recursion δ j (t) = max i [δ i (t 1)a ij ] b jv(t), ψ j (t)=argmax i [δ i (t 1)a ij ], 2 t T,1 j c 2 t T,1 j c Step3: Termination P * = max i [δ i (T )] ω * (T ) = argmax i [δ i (T )] Step4: Path backtracking ω * (t) = ψ ω * (t+1)(t + 1)

-18- Rate invariance -Let s consider the problem of gesture recognition: *The duration of the same gesture can vary from person to person. *The duration of the same gesture can vary for the same person. -HMM models address this issue: *Transition probabilities incorporate probabilistic structure of the durations. *Post processing can be used to delete repeated states e.g., (ω 1, ω 1, ω 3, ω 2, ω 2, ω 2 )can be converted to (ω 1, ω 3, ω 2 )

-19- Learning -Determine the transition probabilities a ij and b jk from a set of training examples (i.e., maximize the probability of the observation sequences). -There is no known way to solve for a maximum likelihood model analytically. The forward-backward algorithm (Baum-Welch algorithm) *Let s define ξ ij (t) = P(ω (t) = ω i, ω (t + 1) = ω j /V T ), then: ξ ij (t) = α i(t)a ij b jv(t+1) β j (t + 1) P(V T ) *Wecan write γ i (t) asfollows: γ i (t) = Σ c ξ ij (t) *The expected number of times that ω i is visited: T t=1 j=1 Σ γ i (t) *The expected number of transitions made from ω i : T 1 Σ γ i (t) t=1

-20- *The expected number of transitions from ω i to ω j : T 1 Σ ξ ij (t) t=1 *wecan re-estimate α ij as the ratio of the expected number of transition from ω i to ω j,divided by the expected number of transitions out of state ω i : ˆα ij = T Σ 1 ξ ij (t)/ T Σ 1 γ i (t) t=1 *wecan re-estimate b jv(t) as the ratio of the expected number of times of being in state ω j and observing v(t), divided by the expected number of times being in state ω j : t=1 ˆb jv(t) = T 1 Σ γ j (t)/ T Σ 1 γ j (t) t=1,v(t) t=1 Difficulties with using HMMs -How do wedecide on the number of states of the model? -What about the size of observation sequence? * Should be sufficiently long to guarantee that all state transitions will appear a sufficient number of times. *Alarge number of training data is necessary to learn the HMM parameters.