Supervised Learning Hidden Markov Models. Some of these slides were inspired by the tutorials of Andrew Moore

Similar documents
Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

STA 414/2104: Machine Learning

Introduction to Machine Learning CMU-10701

STA 4273H: Statistical Machine Learning

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

15-381: Artificial Intelligence. Hidden Markov Models (HMMs)

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov models

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Markov Models. Machine Learning & Data Mining. Prof. Alexander Ihler. Some slides adapted from Andrew Moore s lectures

Lecture 3: ASR: HMMs, Forward, Viterbi

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Lecture 11: Hidden Markov Models

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Advanced Data Science

Introduction to Artificial Intelligence (AI)

Hidden Markov Models Part 2: Algorithms

Brief Introduction of Machine Learning Techniques for Content Analysis

Artificial Intelligence Markov Chains

Machine Learning for natural language processing

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Computational Genomics and Molecular Biology, Fall

Hidden Markov Models,99,100! Markov, here I come!

Statistical NLP: Hidden Markov Models. Updated 12/15

Machine Learning for Data Science (CS4786) Lecture 19

CSEP 573: Artificial Intelligence

Multiscale Systems Engineering Research Group

Machine Learning for OR & FE

Statistical Sequence Recognition and Training: An Introduction to HMMs

Hidden Markov Models Hamid R. Rabiee

Hidden Markov Models

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Math 350: An exploration of HMMs through doodles.

order is number of previous outputs

Why do we care? Measurements. Handling uncertainty over time: predicting, estimating, recognizing, learning. Dealing with time

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Lecture 12: Algorithms for HMMs

CS 7180: Behavioral Modeling and Decision- making in AI

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Lecture 12: Algorithms for HMMs

Statistical Methods for NLP

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Hidden Markov Models. AIMA Chapter 15, Sections 1 5. AIMA Chapter 15, Sections 1 5 1

Conditional Random Field

CS 7180: Behavioral Modeling and Decision- making in AI

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Bayesian Networks BY: MOHAMAD ALSABBAGH

A gentle introduction to Hidden Markov Models

CS 188: Artificial Intelligence Spring Announcements

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Hidden Markov Models. George Konidaris

Hidden Markov Modelling

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

CS711008Z Algorithm Design and Analysis

A brief introduction to Conditional Random Fields

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

LEARNING DYNAMIC SYSTEMS: MARKOV MODELS

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Lecture 3: Machine learning, classification, and generative models

COMP90051 Statistical Machine Learning

Dept. of Linguistics, Indiana University Fall 2009

Mini-project 2 (really) due today! Turn in a printout of your work at the end of the class

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Basic math for biology

Parametric Models Part III: Hidden Markov Models

Mobile Robot Localization

CSE 473: Artificial Intelligence

Hidden Markov Model and Speech Recognition

Hidden Markov Models. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 19 Apr 2012

ASR using Hidden Markov Model : A tutorial

CS532, Winter 2010 Hidden Markov Models

Data Mining in Bioinformatics HMM

Statistical Methods for NLP

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Note Set 5: Hidden Markov Models

Data-Intensive Computing with MapReduce

Hidden Markov Models and Gaussian Mixture Models

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Hidden Markov Models. Vibhav Gogate The University of Texas at Dallas

Lab 3: Practical Hidden Markov Models (HMM)

ECE521 Lecture 19 HMM cont. Inference in HMM

Hidden Markov Models

CS 5522: Artificial Intelligence II

CS 5522: Artificial Intelligence II

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Learning from Sequential and Time-Series Data

Hidden Markov Models in Language Processing

BAYESIAN STRUCTURAL INFERENCE AND THE BAUM-WELCH ALGORITHM

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Probability and Time: Hidden Markov Models (HMMs)

Transcription:

Supervised Learning Hidden Markov Models Some of these slides were inspired by the tutorials of Andrew Moore

A Markov System S 2 Has N states, called s 1, s 2.. s N There are discrete timesteps, t=0, t=1,. N=3 t=0 S 1 S 3

A Markov System Has N states, called s 1, s 2.. s N There are discrete timesteps, t=0, t=1,. S 2 On the t th timestep the system is in exactly one of the available states. Call it q t (Note: q t {s 1, s 2.. s N } S 1 S 3 N=3 t=0 Current state

A Markov System Current state S 2 S 1 S 3 Has N states, called s 1, s 2.. s N There are discrete timesteps, t=0, t=1,. On the t th timestep the system is in exactly one of the available states. Call it q t (Note: q t {s 1, s 2.. s N } Between each timestep, the next state is chosen randomly. N=3 t=1

A Markov System P(q t+1 =s 1 q t =s 2 = 1/2 P(q t+1 =s 2 q t =s 2 = 1/2 P(q t+1 =s 3 q t =s 2 = 0 Has N states, called s 1, s 2.. s N There are discrete timesteps, t=0, t=1,. P(q t+1 =s 1 q t =s 1 = 0 P(q t+1 =s 2 q t =s 1 = 0 P(q t+1 =s 3 q t =s 1 = 1 S 1 S 2 S 3 On the t th timestep the system is in exactly one of the available states. Call it q t (Note: q t {s 1, s 2.. s N } Between each timestep, the next state is chosen randomly. N=3 t=0 P(q t+1 =s 1 q t =s 3 = 1/3 P(q t+1 =s 2 q t =s 3 = 2/3 P(q t+1 =s 3 q t =s 3 = 0 The current state determines the probability distribution for the next state

A Markov System S 1 N=3 t=0 P(q t+1 =s 1 q t =s 2 = 1/2 P(q t+1 =s 2 q t =s 2 = 1/2 P(q t+1 =s 3 q t =s 2 = 0 P(q t+1 =s 1 q t =s 1 = 0 P(q t+1 =s 2 q t =s 1 = 0 P(q t+1 =s 3 q t =s 1 = 1 1/2 1/3 1 S 2 S 3 2/3 1/2 P(q t+1 =s 1 q t =s 3 = 1/3 P(q t+1 =s 2 q t =s 3 = 2/3 P(q t+1 =s 3 q t =s 3 = 0 Has N states, called s 1, s 2.. s N There are discrete timesteps, t=0, t=1,. On the t th timestep the system is in exactly one of the available states. Call it q t (Note: q t {s 1, s 2.. s N } Between each timestep, the next state is chosen randomly. The current state determines the probability distribution for the next state Often notated with arcs between states

Markov Property S 1 N=3 t=0 P(q t+1 =s 1 q t =s 2 = 1/2 P(q t+1 =s 2 q t =s 2 = 1/2 P(q t+1 =s 3 q t =s 2 = 0 P(q t+1 =s 1 q t =s 1 = 0 P(q t+1 =s 2 q t =s 1 = 0 P(q t+1 =s 3 q t =s 1 = 1 1/2 1/3 1 S 2 S 3 2/3 1/2 P(q t+1 =s 1 q t =s 3 = 1/3 P(q t+1 =s 2 q t =s 3 = 2/3 P(q t+1 =s 3 q t =s 3 = 0 q t+1 is conditionally independent of { q t-1, q t-2,... q 1, q 0 } given q t In other words: P(q t+1 = s j q t = s i = P(q t+1 = s j q t = s i, any earlier history Question: what would be the best Bayes Net structure to represent the Joint Distribution of ( q 0, q 1,... q 3,q 4?

Markov Property P(q t+1 =s 1 q t =s 2 = 1/2 P(q t+1 =s 2 q t =s 2 = 1/2 P(q t+1 =s 3 q t =s 2 = 0 P(q t+1 =s 1 q t =s 1 = 0 P(q t+1 =s 2 q t =s 1 = 0 P(q t+1 =s 3 q t =s 1 = 1 S 1 N=3 t=0 1/2 1/3 1 S 2 S 3 2/3 1/2 P(q t+1 =s 1 q t =s 3 = 1/3 P(q t+1 =s 2 q t =s 3 = 2/3 P(q t+1 =s 3 q t =s 3 = 0 q t+1 is conditionally independent of { q t-1, q t-2,... q 1, q 0 } given q t In other words: P(q t+1 = s j q t = s i = P(q t+1 = s j q t = s i, any earlier history Question: what would be the best Bayes Net structure to represent the Joint Distribution of ( q 0, q 1,... q 3,q 4?

S 1 N=3 t=0 Markov Property P(q t+1 =s 1 q t =s 2 = 1/2 P(q t+1 =s 2 q t =s 2 = 1/2 P(q t+1 =s 3 q t =s 2 = 0 P(q t+1 =s 1 q t =s 1 = 0 P(q t+1 =s 2 q t =s 1 = 0 P(q t+1 =s 3 q t =s 1 = 1 1/2 1/3 1 S 2 S 3 2/3 q t+1 is conditionally independent of { q t-1, q t-2,... q 1, q 0 } given q t i P(q t+1 =sin 1 qother t =s i words: P(q P(q t+1 =s t+1 = N q s t j q =s i t = s i 1/2 = P(q t+1 = s j q t = s i, any earlier 1 a 11 history a 1N 2 a 21 a 2N Question: what would be the best 3 a 31 a 3N Bayes Net structure to represent.. the Joint Distribution of ( q 0, q 1,... i q 3,q 4? P(q t+1 =s 1 q t =s 3 = 1/3 P(q t+1 =s 2 q t =s 3 = N 2/3 P(q t+1 =s 3 q t =s 3 = 0 a N1 a NN

A Blind Robot A human and a robot wander around randomly on a grid. R H STATE q = {Location Robot, Location Human} # states=18*18=324

Dynamics of System H R Each timestep the human moves randomly to an adjacent cell. And Robot also moves randomly to an adjacent cell. Typical Questions: What s the expected time until the human is crushed like a bug? What s the probability that the robot will hit the left wall before it hits the human? What s the probability Robot crushes human on next time step?

What is P(q t =s? Slow Answer Step 1: Work out how to compute P(Q for any path Q =q 1 q 2 q 3..q t Given we know the start state q 1 (i.e. P(q 1 =1 P(q 1, q 2,..., q t = P(q 1, q 2,..., q t 1 P(q t q 1, q 2,..., q t 1 = P(q 1, q 2,..., q t 1 P(q t q t 1 = P(q 2 q 1 P(q 3 q 2...P(q t q t 1 Step 2: Use this knowledge to get P(q t =s P(q t = s = Q Paths of length t that end in s P(Q

What is P(q t =s? Clever Answer For each state s i, define P t (i = Probability that state is s i at time t = P(q t =s i Easy to do inductive definition " i P 0 (i = 1 if s i is the start state% # & $ 0 otherwise '

What is P(q t =s? Clever Answer For each state s i, define P t (i = Probability that state is s i at time t = P(q t =s i Easy to do inductive definition " i P 0 (i = 1 if s i is the start state% # & $ 0 otherwise ' j P t+1 ( j = p(q t+1 = s j

What is P(q t =s? Clever Answer For each state s i, define P t (i = Probability that state is s i at time t = P(q t =s i Easy to do inductive definition " i P 0 (i = 1 if s i is the start state% # & $ 0 otherwise ' j P t+1 ( j = p(q t+1 = s j N i=1 = p(q t+1 = s j q t = s i

What is P(q t =s? Clever Answer For each state s i, define P t (i = Probability that state is s i at time t = P(q t =s i Easy to do inductive definition " i P 0 (i = 1 if s i is the start state% # & $ 0 otherwise ' j P t+1 ( j = p(q t+1 = s j N i=1 N = p(q t+1 = s j q t = s i = p(q t+1 = s j q t = s i P(q t = s i = a ij P t (i i=1 N i=1

What is P(q t =s? Clever Answer For each state s i, define P t (i = Probability that state is s i at time t = P(q t =s i Easy to do inductive definition " i P 0 (i = 1 if s i is the start state% # & $ 0 otherwise ' j P t+1 ( j = p(q t+1 = s j N i=1 N = p(q t+1 = s j q t = s i = p(q t+1 = s j q t = s i P(q t = s i = a ij P t (i i=1 t P t (1 P t (2 P t (N 0 1 0 0 1 t final Computation is simple: Just fill in this table in this order N i=1

Hidden State The previous example tried to estimate P(q t = s i unconditionally (using no observed evidence. Suppose we can observe something that s affected by the true state. Example: Proximity sensors. (tell us the contents of the 8 adjacent squares W denotes Walls H R W W W H True state q t What the robot sees: Observation Ot

Noisy Hidden State Example: Noisy Proximity sensors. (unreliably tell us the contents of the 8 adjacent squares H R W W W H Uncorrupted Observation W denotes Walls True state q t W H W W H What the robot sees: Observation Ot

Noisy Hidden State Example: Noisy Proximity sensors. (unreliably tell us the contents of the 8 adjacent squares H R W W W True state q t W W O t is noisily determined depending on the current state Assume that O t is conditionally independent of H W H {q t-1, q t-2,... q 1, q 0,O t-1, O t-2,... O 1, O 0 } What the robot sees: given q Observation Ot t P(O t = X q t = s i = P(O t = X q t = s i,any earlier history H Uncorrupted Observation W denotes Walls

Noisy Hidden State Example: Noisy Proximity sensors. (unreliably tell us the contents of the 8 adjacent squares H R W W W True state q t Question: what d W be the best W Bayes Net structure to represent the Joint Distribution O t is noisily determined depending Of (q on the current state 0, q 1, q 2,q 3,q 4,O 0, O 1, O W 2,O 3,O 4 Assume that O t is conditionally H H independent of {q t-1, q t-2,... q 1, q 0,O t-1, O t-2,... O 1, O 0 } What the robot sees: given q Observation Ot t P(O t = X q t = s i = P(O t = X q t = s i,any earlier history H Uncorrupted Observation W denotes Walls

Noisy Hidden State 1/2 1/3 1

Noisy Hidden State i P(O t =1 q t =s i P(O t =M q t =s i 1/2 1/3 1 1 b1(1 b1(m 2 b2(1 b2(m 3 b3(1 b3(m.. i N b N (1 bn(m

HMM Formal Definition An HMM, λ, is a 5-tuple consisting of N: The number of states M: the number of possible observations {π 1, π 2, π 3, π N } the starting state probabilities P(q 0 =S i = π i

Probability of a Series of Observations What is P(O=P(O 1 O 2 O 3 = P(O 1 =X^O 2 =X^O 3 =Z? Slow, stupid way: P(O = Q Paths of length 3 Q Paths of length 3 P(O Q P(O = P(O Q P(Q S 1 S 2 1/3 X Y Z Y 1/3 2/3 1/3 Z X 1/3 2/3 S 3 1/3 How do we compute P(Q for an arbitrary path Q? How do we compute P(O Q for an arbitrary path Q?

Probability of a Series of Observations What is P(O=P(O 1 O 2 O 3 = P(O 1 =X^O 2 =X^O 3 =Z? Slow, stupid way: S 1 S 2 P(O = P(O Q 1/3 X Y Z Y Q Paths of length 3 2/3 2/3 P(O = P(O Q P(Q Q Paths of length 3 1/3 Z X 1/3 S 3 How do we compute P(Q for an arbitrary path 1/3 P(Q = P(q 1, q 2, q 3 = P(q 1 P(q 2, q 3 q 1 (chain rule = P(q 1 P(q 2 q 1 P(q 3 q 2, q 1 (chain = P(q 1 P(q 2 q 1 P(q 3 q 2 (independency Example in the case Q = S1 S3 S3: 1/2 * 2/3 * 1/3 = 1/9? 1/3

Probability of a Series of Observations What is P(O=P(O 1 O 2 O 3 = P(O 1 =X^O 2 =X^O 3 =Z? Slow, stupid way: S 1 S 2 P(O = P(O Q 1/3 X Y Z Y Q Paths of length 3 2/3 2/3 P(O = P(O Q P(Q Q Paths of length 3 1/3 Z X 1/3 S 3 How do we compute P(O Q for an arbitrary path Q? 1/3 P(O Q = P(O 1,O 2,O 3 q 1, q 2, q 3 = P(O 1 q 1 P(O 2 q 2 P(O 3 q 3 Example in the case Q = S1 S3 S3: P(O Q=P(X S1P(X S3P(Z S3=1/2 * 1/2 * 1/2 = 1/8? 1/3

Probability of a Series of Observations What is P(O=P(O 1 O 2 O 3 = P(O 1 =X^O 2 =X^O 3 =Z? Slow, stupid way: P(O = Q Paths of length 3 P(O P(O would Q need 27 P(Q computations and 27 P(O Q computations P(O = P(O Q P(Q Q Paths of length 3 A sequence of 20 observations would need 3 20 P(O Q = P(O 3.5 billion 1,O 2,O 3 P(Q q 1, qcomputations 2, q 3 = P(Oand 1 q 1 P(O 3.5 billion 2 q 2 P(O computations 3 q 3 How do we compute P(O Q for an arbitrary path Q? Example in the case Q = S1 S3 S3: P(O Q=P(X S1P(X S3P(Z S3=1/2 * 1/2 * 1/2 = 1/8? 1/3 S 1 S 2 1/3 X Y Z Y 2/3 2/3 1/3 Z X 1/3 1/3 S 3

Non Exponential Cost Style Given observations O 1 O 2... O T Define α t (i = P(O 1 O 2...O t q t where 1 t T α t (i = Probability that, in a random trial, We d have seen the first t observations We d have ended up in Si as the t th state visited. In our example, what is α 2 (3?

α t (i: easy to define recursively α 1 ( j = P(O 1 q 1 = S j = P(q 1 = S j P(O 1 q 1 = S j α t+1 (i = P(O 1 O 2...O t O t+1 q t+1 N = P(O 1 O 2...O t q t = S j O t+1 q t+1 j=1 N = P(O t+1, q t+1 O 1 O 2...O t q t = S j P(O 1 O 2...O t q t = S j j=1 N = P(O t+1, q t+1 q t = S j α t ( j j=1 N = P(q t+1 q t = S j P(O t+1 q t+1 α t ( j j=1 N = a ij b i (O t+1 α t ( j j=1

Easy Question We can cheaply compute α t (i = P(O 1 O 2...O t q t How can we cheaply compute P(O 1 O 2...O t How can we cheaply compute P(q t O 1 O 2...O t

Easy Question We can cheaply compute α t (i = P(O 1 O 2...O t q t How can we cheaply compute P(O 1 O 2...O t = How can we cheaply compute N i=1 α t (i P(q t O 1 O 2...O t

Easy Question We can cheaply compute α t (i = P(O 1 O 2...O t q t How can we cheaply compute P(O 1 O 2...O t = N i=1 α(i How can we cheaply compute P(q t O 1 O 2...O t = N j=1 α t(i α t ( j

Most Probable Path Given Observations What is the most probable path given O 1 O 2...O T,i.e. What is : argmax P(Q O 1 O 2...O T? Q Slow answer: argmax P(Q O 1 O 2...O T = Q = argmax P(O 1O 2...O T QP(Q Q P(O 1 O 2...O T argmax P(O 1 O 2...O T QP(Q Q

Efficient MPP Computation We are going to compute the following variables: δ t (i = max P(q 1 q 2...q t 1 q t O 1...O t q 1 q 2...q t 1 = The probability of the path of length t-1 with the maximum chance of doing all these things: Occurring and Ending up in state S i and Producing output O 1 O t Define: mpp t (i= that path δ t (i=prob(mpp t (i

The Viterbi Algorithm δ t (i = mpp t (i = max P(q 1 q 2...q t 1 q t O 1...O t q 1 q 2...q t 1 argmax P(q 1 q 2...q t 1 q t O 1...O t q 1 q 2...q t 1 δ 1 (i = max P(q 1 O 1 one choice = P(q 1 P(O 1 q 1 = π i b i (O 1 Now, suppose we have all the δ t (i s and mpp t (i s for all i. How to get δ t+1 (j and mpp t+1 (j? mpp t (1 mpp t (2 mpp t (N Prob=δ t (1 Prob=δ t (2 Prob=δ t (N S 1 S 2 S N

The Viterbi Algorithm time t time t+1 The most prob path with last two states S i S j is: the most prob path to S i, followed by transition S 1 S i S j S i S N S j Summary δ t+1 ( j = δ t (i*a ij b j (O t+1!# " with i* defined below mpp t+1 ( j = mpp t (i*s j $# What is the prob of that path? δ t (i x P(S i S j O t+1 = δ t (i a ij b j (O t+1 So, the most probable path to S j has S i * as its penultimate state where i*=argmax δt(i a ij b j (O t+1 i

What is the Viterbi Algorithm Used For? Speech recognition Signal words HMM observable is signal Hidden state is part of word formation What is the most probable word given this signal?

HMMs are Used and Useful But how do you design an HMM? Occasionally, (e.g. in our robot example it is reasonable to deduce the HMM from first principles But usually, especially in Speech or Genetics, it is better to infer it from large amounts of data. O 1 O 2.. O T with a big T Observations previously in lecture O 1 O 2.. O T Observations in the next bit O 1 O 2.. O T

Inferring an HMM We were computing probabilities based on some parameters P(O 1 O 2..O T λ That λ is the notation for our HMM parameters λ = {a ij },{b i (j}, π i We could use MAX LIKELIHOOD λ = argmax P(O 1.. O T λ λ BAYES Workout P(λ O 1..O T and then take E[λ] or max P(λ O 1..O T λ

Max Likelihood HMM Estimation Define γ t (i = P(q t O 1 O 2...O T, λ ε t (i, j = P(q t q t+1 = S j O 1 O 2...O T, λ γ t (i and ε t (i,j can be computed efficiently i,j,t T 1 t=1 T 1 t=1 γ t (i = Expected number of transitions out of state i during the path ε t (i, j = Expected number of transitions from state i to state j during the path

HMM Estimation γ t (i = P(q t O 1 O 2...O T, λ ε t (i, j = P(q t q t+1 = S j O 1 O 2...O T, λ T 1 t=1 T 1 t=1 γ t (i = Expected number of transitions out of state i during the path ε t (i, j = Expected number of transitions from state i to state j during the path T 1 t=1 T 1 t=1 ε t (i, j γ t (i $ expected frequency' & % i j ( = $ expected frequency' & % i ( = Estimate of Prob(Next state S j This state S i We can re-estimate aij T 1 t=1 T 1 t=1 ε t (i, j γ t (i

Expectation Maximization for HMMs If we knew λ we could estimate EXPECTATIONS of quantities such as Expected number of times in state i Expected number of transitions i j If we knew the quantities such as Expected number of times in state i Expected number of transitions i j We could compute the MAX LIKELIHOOD estimate of λ = {a ij },{b i (j}, π i

Expectation Maximization for HMMs 1. Get your observations O 1...O T 2. Guess your first λ estimate λ(0, k=0 3. k=k+1 4. Given O 1...O T, λ(k compute : γt(i, εt(i,j 1 t, T, 1 i,j N, 5. Compute expected freq. of state i, and expected freq. i j 6. Compute new estimates of a ij, b j (k, π i accordingly. Call them λ(k+1 7. Goto 3, unless converged. Also known (for the HMM case as the BAUM-WELCH algorithm.

Summary HMM is a dynamic Bayesian network with hidden states Efficient computation of probabilities of series of observation The Viterbi algorithm Outline of the EM algorithm