Hidden Markov Models. Terminology and Basic Algorithms

Similar documents
Hidden Markov Models. Terminology and Basic Algorithms

Hidden Markov Models. Implementing the forward-, backward- and Viterbi-algorithms

Hidden Markov Models. Terminology, Representation and Basic Problems

Linear Dynamical Systems

Cheng Soon Ong & Christian Walder. Canberra February June 2018

STA 414/2104: Machine Learning

Graphical Models Seminar

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Dynamic Approaches: The Hidden Markov Model

Basic math for biology

STA 4273H: Statistical Machine Learning

Hidden Markov Models Hamid R. Rabiee

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Linear Dynamical Systems (Kalman filter)

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Parametric Models Part III: Hidden Markov Models

Hidden Markov Models. Three classic HMM problems

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

Hidden Markov Models

Master 2 Informatique Probabilistic Learning and Data Analysis

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Statistical Machine Learning from Data

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Lecture 11: Hidden Markov Models

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Data-Intensive Computing with MapReduce

Hidden Markov Models

p(d θ ) l(θ ) 1.2 x x x

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Genome 373: Hidden Markov Models II. Doug Fowler

Note Set 5: Hidden Markov Models

Brief Introduction of Machine Learning Techniques for Content Analysis

Statistical Methods for NLP

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

L23: hidden Markov models

Conditional Random Field

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Today s Lecture: HMMs

Computational Genomics and Molecular Biology, Fall

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

CS711008Z Algorithm Design and Analysis

Multiscale Systems Engineering Research Group

Introduction to Machine Learning CMU-10701

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Hidden Markov Models

Markov Chains and Hidden Markov Models

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

ECE521 Lecture 19 HMM cont. Inference in HMM

order is number of previous outputs

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Hidden Markov Models for biological sequence analysis I

Dynamic Programming: Hidden Markov Models

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Advanced Data Science

Hidden Markov Models for biological sequence analysis

Factor Graphs and Message Passing Algorithms Part 1: Introduction

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Hidden Markov Models Part 2: Algorithms

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Hidden Markov Models,99,100! Markov, here I come!

HMM: Parameter Estimation

Hidden Markov Models

CS838-1 Advanced NLP: Hidden Markov Models

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

A gentle introduction to Hidden Markov Models

Numerically Stable Hidden Markov Model Implementation

Hidden Markov Models

O 3 O 4 O 5. q 3. q 4. Transition

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Hidden Markov models 1

Machine Learning for natural language processing

Hidden Markov Models. AIMA Chapter 15, Sections 1 5. AIMA Chapter 15, Sections 1 5 1

Sequence modelling. Marco Saerens (UCL) Slides references

An Introduction to Bioinformatics Algorithms Hidden Markov Models

The main algorithms used in the seqhmm package

Machine Learning, Fall 2011: Homework 5

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Lecture 9. Intro to Hidden Markov Models (finish up)

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Hidden Markov Models. Hosein Mohimani GHC7717

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Assignments for lecture Bioinformatics III WS 03/04. Assignment 5, return until Dec 16, 2003, 11 am. Your name: Matrikelnummer: Fachrichtung:

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

Hidden Markov Models. Ron Shamir, CG 08

Hidden Markov Models (HMMs)

An Introduction to Bioinformatics Algorithms Hidden Markov Models

HMM part 1. Dr Philip Jackson

Hidden Markov Models

CS Homework 3. October 15, 2009

CS1820 Notes. hgupta1, kjline, smechery. April 3-April 5. output: plausible Ancestral Recombination Graph (ARG)

Log-Linear Models, MEMMs, and CRFs

Lecture 4: State Estimation in Hidden Markov Models (cont.)

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

HIDDEN MARKOV MODELS

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Transcription:

Hidden Markov Models Terminology and Basic Algorithms

The next two weeks Hidden Markov models (HMMs): Wed 9/11: Terminology and basic algorithms Mon 14/11: Implementing the basic algorithms Wed 16/11: Selecting model parameters and training. Introduction to hand-in project. Mon 21/11: Extensions and application. We use Chapter 13 from Bishop's book Pattern Recognition and Machine Learning. Rabiner's paper A Tutorial on Hidden Markov Models [...] might also be useful to read. Blackboard and http://cs.au.dk/~cstorm/courses/ml_e16

What is machine learning? Machine learning means different things to different people, and there is no general agreed upon core set of algorithms that must be learned. For me, the core of machine learning is: Building a mathematical model that captures some desired structure of the data that you are working on. Training the model (i.e. set the parameters of the model) based on existing data to optimize it as well as we can. Making predictions by using the model on new data.

Motivation We make predictions based on models of observed data (machine learning). A simple model is that observations are assumed to be independent and identically distributed... but this assumption is not always the best, fx (1) measurements of weather patterns, (2) daily values of stocks, (3) the composition of DNA or proteins, or...

Markov Models If the n'th observation in a chain of observations is influenced only by the n-1'th observation, i.e. then the chain of observations is a 1st-order Markov chain, and the joint-probability of a sequence of N observations is If the distributions p(xn xn-1) are the same for all n, then the chain of observations is an homogeneous 1st-order Markov chain...

The model, i.e. p(xn xn-1): A sequence of observations: Markov Models If the n'th observation in a chain of observations is influenced only by the n-1'th observation, i.e. then the chain of observations is a 1st-order Markov chain, and the joint-probability of a sequence of N observations is If the distributions p(xn xn-1) are the same for all n, then the chain of observations is an homogeneous 1st-order Markov chain...

Hidden Markov Models What if the n'th observation in a chain of observations is influenced by a corresponding hidden variable? Latent values H H L L H Observations If the hidden variables are discrete and form a Markov chain, then it is a hidden Markov model (HMM)

Hidden Markov Models What if the n'th observation in a chain of observations is influenced by a corresponding hidden variable? Latent values Markov Model H Hidden Markov Model H L L H Observations If the hidden variables are discrete and form a Markov chain, then it is a hidden Markov model (HMM)

Hidden Markov Models What if the n'th observation in a chain of observations is influenced The variable? joint distribution by a corresponding hidden Latent values Markov Model H Hidden Markov Model H L L H Observations If the hidden variables are discrete and form a Markov chain, then it is a hidden Markov model (HMM)

Hidden Markov Models Emission probabilities What if the n'th observation in a chain of observations is influenced Transition probabilities The variable? joint distribution by a corresponding hidden Latent values Markov Model H Hidden Markov Model H L L H Observations If the hidden variables are discrete and form a Markov chain, then it is a hidden Markov model (HMM)

Transition probabilities Notation: In Bishop, the hidden variables zn are positional vectors, e.g. if zn = (0,0,1) then the model in step n is in state k=3... Transition probabilities: If the hidden variables are discrete with K states, the conditional distribution p(zn zn-1) is a K x K table A, and the marginal distribution p(z1) describing the initial state is a K vector π... The probability of going from state j to state k is: The probability of state k being the initial state is:

Transition probabilities Notation: In Bishop, the hidden variables zn are positional vectors, e.g. if zn = (0,0,1) then the model in step n is in state k=3... Transition probabilities: If the hidden variables are discrete with K states, the conditional distribution p(zn zn-1) is a K x K table A, and the marginal distribution p(z1) describing the initial state is a K vector π... The probability of going from state j to state k is: The probability of state k being the initial state is:

Transition Probabilities The transition probabilities: Notation: In Bishop, the hidden variables zn are positional vectors, e.g. if zn = (0,0,1) then the model in step n is in state k=3... Transition probabilities: If the hidden variables are discrete with K states, the conditional distribution p(zn zn-1) is a K x K table A, and the marginal distribution p(z1) describing the initial state is a K vector π... The probability of going from state j to state k is: The probability of state k being the initial state is:

Emission probabilities Emission probabilities: The conditional distributions of the observed variables p(xn zn) from a specific state If the observed values xn are discrete (e.g. D symbols), the emission probabilities Ф is a KxD table of probabilities which for each of the K states specifies the probability of emitting each observable...

Emission probabilities Emission probabilities: The conditional distributions of the observed variables p(xn zn) from a specific state If the observed values xn are discrete (e.g. D symbols), the emission probabilities Ф is a KxD table of probabilities which for each of the K states specifies the probability of emitting each observable...

HMM joint probability distribution Observables: Latent states: Model parameters: If A and Ф are the same for all n then the HMM is homogeneous

HMM joint probability distribution Observables: Latent states: Model parameters: If A and Ф are the same for all n then the HMM is homogeneous

HMMs as a generative model A HMM generates a sequence of observables by moving from latent state to latent state according to the transition probabilities and emitting an observable (from a discrete set of observables, i.e. a finite alphabet) from each latent state visited according to the emission probabilities of the state... Model M: A run follows a sequence of states: H H L L H And emits a sequence of symbols:

Using HMMs Determine the likelihood of a sequence of observations. Find a plausible underlying explanation (or decoding) of a sequence of observations.

Using HMMs Determine the likelihood of a sequence of observations. Find a plausible underlying explanation (or decoding) of a sequence of observations.

Using HMMs Determine the likelihood of a sequence of observations. Find a plausible underlying explanation (or decoding) of a sequence of observations. The sum has KN terms, but it turns out that it can be computed in O(K2N) time, but first we will consider decoding

Decoding using HMMs Given a HMM Θ and a sequence of observations X = x1,...,xn, find a plausible explanation, i.e. a sequence Z* = z*1,...,z*n of values of the hidden variable.

Decoding using HMMs Given a HMM Θ and a sequence of observations X = x1,...,xn, find a plausible explanation, i.e. a sequence Z* = z*1,...,z*n of values of the hidden variable. Viterbi decoding Z* is the overall most likely explanation of X:

Decoding using HMMs Given a HMM Θ and a sequence of observations X = x1,...,xn, find a plausible explanation, i.e. a sequence Z* = z*1,...,z*n of values of the hidden variable. Viterbi decoding Z* is the overall most likely explanation of X: Posterior decoding z*n is the most likely state to be in the n'th step:

Viterbi decoding Given X, find Z* such that: Where is the probability of the most likely sequence of states z1,...,zn ending in zn generating the observations x1,...,xn

Viterbi decoding Given X, find Z* such that: ω[k][n] = ω(zn) if zn is state k k 1 Where n N is the probability of the most likely sequence of states z1,...,zn ending in zn generating the observations x1,...,xn

The ω-recursion

The ω-recursion Using HMM

The ω-recursion ω(zn) is the probability of the most likely sequence of states z1,...,zn ending in zn generating the observations x1,...,xn Recursion: ω[k][n] = ω(zn) if zn is state k Basis: k 1 n N

The ω-recursion // Pseudo code for computing ω[k][n] for some n>1 ω(zn)=is0 ω[k][n] the probability of the most likely sequence of states z1,...,zn ifending p(x[n] k)in!= z0:n generating the observations x1,...,xn for j = 1 to K: if p(k j)!= 0: ω[k][n] = max( ω[k][n], p(x[n] k) * ω[ j ][n-1] * p( k j) ) Recursion: ω[k][n] = ω(zn) if zn is state k Basis: k 1 n N

The ω-recursion // Pseudo code for computing ω[k][n] for some n>1 ω(zn)=is0 ω[k][n] the probability of the most likely sequence of states z1,...,zn ifending p(x[n] k)in!= z0:n generating the observations x1,...,xn for j = 1 to K: if p(k j)!= 0: ω[k][n] = max( ω[k][n], p(x[n] k) * ω[ j ][n-1] * p( k j) ) Recursion: ω[k][n] = ω(zn) if zn is state k Basis: k Computing ω takes time O(K2N) and space O(KN) using memorization 1 n N

Viterbi decoding Retrieving Z* ω(zn) is the probability of the most likely sequence of states z1,...,zn ending in zn generating the observations x1,...,xn. We find Z* by backtracking: k ω[k][n] = ω(zn) if zn is state k 1 n N

Viterbi decoding Retrieving Z* z[1..n] = undef // Pseudocode for backtracking z[n] ω(z= )arg is max thek ω[k][n] probability of the most likely sequence of states z1,...,zn for n = N-1in to z 1: generating the observations x,...,x. We find Z* by ending n 1 n z[n] = arg maxk ( p(x[n+1] z[n+1]) * ω[k][n] * p(z[n+1] k ) ) backtracking: n print z[1..n] k ω[k][n] = ω(zn) if zn is state k 1 n N

Viterbi decoding Retrieving Z* z[1..n] = undef // Pseudocode for backtracking z[n] ω(z= )arg is max thek ω[k][n] probability of the most likely sequence of states z1,...,zn for n = N-1in to z 1: generating the observations x,...,x. We find Z* by ending n 1 n z[n] = arg maxk ( p(x[n+1] z[n+1]) * ω[k][n] * p(z[n+1] k ) ) backtracking: n print z[1..n] Backtracking takes time O(KN) and k space O(KN) using ω ω[k][n] = ω(zn) if zn is state k 1 n N

Decoding using HMMs Given a HMM Θ and a sequence of observations X = x1,...,xn, find a plausible explanation, i.e. a sequence Z* = z*1,...,z*n of values of the hidden variable. Viterbi decoding Z* is the overall most likely explanation of X: Posterior decoding z*n is the most likely state to be in the n'th step:

Posterior decoding Given X, find Z*, where is the most likely state to be in the n'th step.

Posterior decoding α(zn) is the joint probability of observing x1,...,xn and being in state zn β(zn) is the conditional probability of future observation xn+1,...,xn assuming being in state zn α[k][n] = α(zn) if zn is state k k β[k][n] = β(zn) if zn is state k k 1 n N 1 n N

Posterior decoding α(zn) is the joint probability of observing x1,...,xn and being in state zn β(zn) is the conditional probability of future observation xn+1,...,xn assuming being in state zn Using α(zn) and β(zn) we get the likelihood of the observations as:

The forward algorithm α(zn) is the joint probability of observing x1,...,xn and being in state zn α[k][n] = α(zn) if zn is state k k 1 n N

The α-recursion

The α-recursion Using HMM

The forward algorithm α(zn) is the joint probability of observing x1,...,xn and being in state zn Recursion: Basis: k α[k][n] = α(zn) if zn is state k 1 n N

The forward algorithm // Pseudo code for computing α[k][n] for some n>1 α(zn)=is0 α[k][n] the joint probability of observing x1,...,xn and being in state zn if p(x[n] k)!= 0: for j = 1 to K: if p(k j)!= 0: α[k][n] = α[k][n] + p(x[n] k) * α[ j ][n-1] * p( k j) Recursion: Basis: k α[k][n] = α(zn) if zn is state k 1 n N

The forward algorithm // Pseudo code for computing α[k][n] for some n>1 α(zn)=is0 α[k][n] the joint probability of observing x1,...,xn and being in state zn if p(x[n] k)!= 0: for j = 1 to K: if p(k j)!= 0: α[k][n] = α[k][n] + p(x[n] k) * α[ j ][n-1] * p( k j) Recursion: Basis: k Computing α takes time O(K2N) and α[k][n] = α(zn) if zn is state k space O(KN) using memorization 1 n N

The backward algorithm β(zn) is the conditional probability of future observation xn+1,...,xn assuming being in state zn β[k][n] = β(zn) if zn is state k k 1 n N

The β-recursion

The β-recursion Using HMM

The backward algorithm β(zn) is the conditional probability of future observation xn+1,...,xn assuming being in state zn Recursion: Basis: k β[k][n] = β(zn) if zn is state k 1 n N

The backward algorithm // Pseudo code for computing β[k][n] for some n<n β(zn)=is0 β[k][n] the conditional probability of future observation xn+1,...,xn assuming for j = 1 to K: being in state zn if p( j k)!= 0: β[k][n] = β[k][n] + p( j k) * p(x[n+1] j) * β[ j ][n+1] Recursion: Basis: k β[k][n] = β(zn) if zn is state k 1 n N

The backward algorithm // Pseudo code for computing β[k][n] for some n<n β(zn)=is0 β[k][n] the conditional probability of future observation xn+1,...,xn assuming for j = 1 to K: being in state zn if p( j k)!= 0: β[k][n] = β[k][n] + p( j k) * p(x[n+1] j) * β[ j ][n+1] Recursion: Basis: k Computing β takes time O(K2N) and space O(KN) using β[k][n]memorization = β(zn) if zn is state k 1 n N

Posterior decoding α(zn) is the joint probability of observing x1,...,xn and being in state zn β(zn) is the conditional probability of future observation xn+1,...,xn assuming being in state zn Using α(zn) and β(zn) we get the likelihood of the observations as:

Posterior decoding // Pseudocode for posterior decoding Compute α[1..k][1..n] and β[1..k][1..n] px + + α[k][n] of α(z= )α[1][n] is the+ α[2][n] joint probability n observing x1,...,xn and being in state zn z[1..n] = undef for n = 1 to N: z[n] = arg maxk ( α[k][n] * β[k][n] / px ) print β(znz[1..n] ) is the conditional probability of future observation xn+1,...,xn assuming being in state zn Using α(zn) and β(zn) we get the likelihood of the observations as:

Viterbi vs. Posterior decoding A sequence of states z1,...,zn where p(x1,...,xn, z1,...,zn) > 0 is a legal (or syntactically correct) decoding of X. Viterbi finds the most likely syntactically correct decoding of X. What does Posterior decoding find? Does it always find a syntactically correct decoding of X?

Viterbi vs. Posterior decoding A sequence of states z1,...,zn where p(x1,...,xn, z1,...,zn) > 0 is a legal (or syntactically correct) decoding of X. Viterbi finds the most likely syntactically correct decoding of X. What does Posterior decoding find? Does it always find a syntactically correct decoding of X? 2 Emits a sequence of A and Bs following either the path 12...2 or 13...3 with equal probability Viterbi vs.1 Posterior decoding 1/2 1 A:.5 B:.5 A:.5 B:.5 A sequence 3of states z1,...,zn where P(x1,...,xN, z1,...,zn) > 0 is a I.e. Viterbi finds either 12...2 or A:.5 1 legal (or 1/2 syntactically correct) decoding of X. B:.5 13...3, while Posterior finds that 2 and 3 are equally likely for n>1.

Recall: Using HMMs Determine the likelihood of a sequence of observations. Find a plausible underlying explanation (or decoding) of a sequence of observations. The sum has KN terms, but it turns out that it can be computed in O(K2N) time by computing the α-table using the forward algorithm and summing the last column: p(x) = α[1][n] + α[2][n] + + α[k][n]

Summary Terminology of hidden Markov models (HMMs) Viterbi- and Posterior decoding for finding a plausible underlying explanation (sequence of hidden states) of a sequence of observation forward-backward algorithms for computing the likelihood of being in a given state in the n'th step, and for determining the likelihood of a sequence of observations.

Viterbi Recursion: Basis: Forward Recursion: Basis: Backward Recursion: Basis:

Viterbi Problem: The values in the ω-, α-, and β-tables can come very Recursion: close to zero, by multiplying them we potentially exceed the precision of double precision floating points and get underflow Basis: Next: How to implement the basic algorithms (forward, backward, and Viterbi) in a numerically sound manner. Forward Recursion: Basis: Backward Recursion: Basis: