Hidden Markov Models Terminology and Basic Algorithms
The next two weeks Hidden Markov models (HMMs): Wed 9/11: Terminology and basic algorithms Mon 14/11: Implementing the basic algorithms Wed 16/11: Selecting model parameters and training. Introduction to hand-in project. Mon 21/11: Extensions and application. We use Chapter 13 from Bishop's book Pattern Recognition and Machine Learning. Rabiner's paper A Tutorial on Hidden Markov Models [...] might also be useful to read. Blackboard and http://cs.au.dk/~cstorm/courses/ml_e16
What is machine learning? Machine learning means different things to different people, and there is no general agreed upon core set of algorithms that must be learned. For me, the core of machine learning is: Building a mathematical model that captures some desired structure of the data that you are working on. Training the model (i.e. set the parameters of the model) based on existing data to optimize it as well as we can. Making predictions by using the model on new data.
Motivation We make predictions based on models of observed data (machine learning). A simple model is that observations are assumed to be independent and identically distributed... but this assumption is not always the best, fx (1) measurements of weather patterns, (2) daily values of stocks, (3) the composition of DNA or proteins, or...
Markov Models If the n'th observation in a chain of observations is influenced only by the n-1'th observation, i.e. then the chain of observations is a 1st-order Markov chain, and the joint-probability of a sequence of N observations is If the distributions p(xn xn-1) are the same for all n, then the chain of observations is an homogeneous 1st-order Markov chain...
The model, i.e. p(xn xn-1): A sequence of observations: Markov Models If the n'th observation in a chain of observations is influenced only by the n-1'th observation, i.e. then the chain of observations is a 1st-order Markov chain, and the joint-probability of a sequence of N observations is If the distributions p(xn xn-1) are the same for all n, then the chain of observations is an homogeneous 1st-order Markov chain...
Hidden Markov Models What if the n'th observation in a chain of observations is influenced by a corresponding hidden variable? Latent values H H L L H Observations If the hidden variables are discrete and form a Markov chain, then it is a hidden Markov model (HMM)
Hidden Markov Models What if the n'th observation in a chain of observations is influenced by a corresponding hidden variable? Latent values Markov Model H Hidden Markov Model H L L H Observations If the hidden variables are discrete and form a Markov chain, then it is a hidden Markov model (HMM)
Hidden Markov Models What if the n'th observation in a chain of observations is influenced The variable? joint distribution by a corresponding hidden Latent values Markov Model H Hidden Markov Model H L L H Observations If the hidden variables are discrete and form a Markov chain, then it is a hidden Markov model (HMM)
Hidden Markov Models Emission probabilities What if the n'th observation in a chain of observations is influenced Transition probabilities The variable? joint distribution by a corresponding hidden Latent values Markov Model H Hidden Markov Model H L L H Observations If the hidden variables are discrete and form a Markov chain, then it is a hidden Markov model (HMM)
Transition probabilities Notation: In Bishop, the hidden variables zn are positional vectors, e.g. if zn = (0,0,1) then the model in step n is in state k=3... Transition probabilities: If the hidden variables are discrete with K states, the conditional distribution p(zn zn-1) is a K x K table A, and the marginal distribution p(z1) describing the initial state is a K vector π... The probability of going from state j to state k is: The probability of state k being the initial state is:
Transition probabilities Notation: In Bishop, the hidden variables zn are positional vectors, e.g. if zn = (0,0,1) then the model in step n is in state k=3... Transition probabilities: If the hidden variables are discrete with K states, the conditional distribution p(zn zn-1) is a K x K table A, and the marginal distribution p(z1) describing the initial state is a K vector π... The probability of going from state j to state k is: The probability of state k being the initial state is:
Transition Probabilities The transition probabilities: Notation: In Bishop, the hidden variables zn are positional vectors, e.g. if zn = (0,0,1) then the model in step n is in state k=3... Transition probabilities: If the hidden variables are discrete with K states, the conditional distribution p(zn zn-1) is a K x K table A, and the marginal distribution p(z1) describing the initial state is a K vector π... The probability of going from state j to state k is: The probability of state k being the initial state is:
Emission probabilities Emission probabilities: The conditional distributions of the observed variables p(xn zn) from a specific state If the observed values xn are discrete (e.g. D symbols), the emission probabilities Ф is a KxD table of probabilities which for each of the K states specifies the probability of emitting each observable...
Emission probabilities Emission probabilities: The conditional distributions of the observed variables p(xn zn) from a specific state If the observed values xn are discrete (e.g. D symbols), the emission probabilities Ф is a KxD table of probabilities which for each of the K states specifies the probability of emitting each observable...
HMM joint probability distribution Observables: Latent states: Model parameters: If A and Ф are the same for all n then the HMM is homogeneous
HMM joint probability distribution Observables: Latent states: Model parameters: If A and Ф are the same for all n then the HMM is homogeneous
HMMs as a generative model A HMM generates a sequence of observables by moving from latent state to latent state according to the transition probabilities and emitting an observable (from a discrete set of observables, i.e. a finite alphabet) from each latent state visited according to the emission probabilities of the state... Model M: A run follows a sequence of states: H H L L H And emits a sequence of symbols:
Using HMMs Determine the likelihood of a sequence of observations. Find a plausible underlying explanation (or decoding) of a sequence of observations.
Using HMMs Determine the likelihood of a sequence of observations. Find a plausible underlying explanation (or decoding) of a sequence of observations.
Using HMMs Determine the likelihood of a sequence of observations. Find a plausible underlying explanation (or decoding) of a sequence of observations. The sum has KN terms, but it turns out that it can be computed in O(K2N) time, but first we will consider decoding
Decoding using HMMs Given a HMM Θ and a sequence of observations X = x1,...,xn, find a plausible explanation, i.e. a sequence Z* = z*1,...,z*n of values of the hidden variable.
Decoding using HMMs Given a HMM Θ and a sequence of observations X = x1,...,xn, find a plausible explanation, i.e. a sequence Z* = z*1,...,z*n of values of the hidden variable. Viterbi decoding Z* is the overall most likely explanation of X:
Decoding using HMMs Given a HMM Θ and a sequence of observations X = x1,...,xn, find a plausible explanation, i.e. a sequence Z* = z*1,...,z*n of values of the hidden variable. Viterbi decoding Z* is the overall most likely explanation of X: Posterior decoding z*n is the most likely state to be in the n'th step:
Viterbi decoding Given X, find Z* such that: Where is the probability of the most likely sequence of states z1,...,zn ending in zn generating the observations x1,...,xn
Viterbi decoding Given X, find Z* such that: ω[k][n] = ω(zn) if zn is state k k 1 Where n N is the probability of the most likely sequence of states z1,...,zn ending in zn generating the observations x1,...,xn
The ω-recursion
The ω-recursion Using HMM
The ω-recursion ω(zn) is the probability of the most likely sequence of states z1,...,zn ending in zn generating the observations x1,...,xn Recursion: ω[k][n] = ω(zn) if zn is state k Basis: k 1 n N
The ω-recursion // Pseudo code for computing ω[k][n] for some n>1 ω(zn)=is0 ω[k][n] the probability of the most likely sequence of states z1,...,zn ifending p(x[n] k)in!= z0:n generating the observations x1,...,xn for j = 1 to K: if p(k j)!= 0: ω[k][n] = max( ω[k][n], p(x[n] k) * ω[ j ][n-1] * p( k j) ) Recursion: ω[k][n] = ω(zn) if zn is state k Basis: k 1 n N
The ω-recursion // Pseudo code for computing ω[k][n] for some n>1 ω(zn)=is0 ω[k][n] the probability of the most likely sequence of states z1,...,zn ifending p(x[n] k)in!= z0:n generating the observations x1,...,xn for j = 1 to K: if p(k j)!= 0: ω[k][n] = max( ω[k][n], p(x[n] k) * ω[ j ][n-1] * p( k j) ) Recursion: ω[k][n] = ω(zn) if zn is state k Basis: k Computing ω takes time O(K2N) and space O(KN) using memorization 1 n N
Viterbi decoding Retrieving Z* ω(zn) is the probability of the most likely sequence of states z1,...,zn ending in zn generating the observations x1,...,xn. We find Z* by backtracking: k ω[k][n] = ω(zn) if zn is state k 1 n N
Viterbi decoding Retrieving Z* z[1..n] = undef // Pseudocode for backtracking z[n] ω(z= )arg is max thek ω[k][n] probability of the most likely sequence of states z1,...,zn for n = N-1in to z 1: generating the observations x,...,x. We find Z* by ending n 1 n z[n] = arg maxk ( p(x[n+1] z[n+1]) * ω[k][n] * p(z[n+1] k ) ) backtracking: n print z[1..n] k ω[k][n] = ω(zn) if zn is state k 1 n N
Viterbi decoding Retrieving Z* z[1..n] = undef // Pseudocode for backtracking z[n] ω(z= )arg is max thek ω[k][n] probability of the most likely sequence of states z1,...,zn for n = N-1in to z 1: generating the observations x,...,x. We find Z* by ending n 1 n z[n] = arg maxk ( p(x[n+1] z[n+1]) * ω[k][n] * p(z[n+1] k ) ) backtracking: n print z[1..n] Backtracking takes time O(KN) and k space O(KN) using ω ω[k][n] = ω(zn) if zn is state k 1 n N
Decoding using HMMs Given a HMM Θ and a sequence of observations X = x1,...,xn, find a plausible explanation, i.e. a sequence Z* = z*1,...,z*n of values of the hidden variable. Viterbi decoding Z* is the overall most likely explanation of X: Posterior decoding z*n is the most likely state to be in the n'th step:
Posterior decoding Given X, find Z*, where is the most likely state to be in the n'th step.
Posterior decoding α(zn) is the joint probability of observing x1,...,xn and being in state zn β(zn) is the conditional probability of future observation xn+1,...,xn assuming being in state zn α[k][n] = α(zn) if zn is state k k β[k][n] = β(zn) if zn is state k k 1 n N 1 n N
Posterior decoding α(zn) is the joint probability of observing x1,...,xn and being in state zn β(zn) is the conditional probability of future observation xn+1,...,xn assuming being in state zn Using α(zn) and β(zn) we get the likelihood of the observations as:
The forward algorithm α(zn) is the joint probability of observing x1,...,xn and being in state zn α[k][n] = α(zn) if zn is state k k 1 n N
The α-recursion
The α-recursion Using HMM
The forward algorithm α(zn) is the joint probability of observing x1,...,xn and being in state zn Recursion: Basis: k α[k][n] = α(zn) if zn is state k 1 n N
The forward algorithm // Pseudo code for computing α[k][n] for some n>1 α(zn)=is0 α[k][n] the joint probability of observing x1,...,xn and being in state zn if p(x[n] k)!= 0: for j = 1 to K: if p(k j)!= 0: α[k][n] = α[k][n] + p(x[n] k) * α[ j ][n-1] * p( k j) Recursion: Basis: k α[k][n] = α(zn) if zn is state k 1 n N
The forward algorithm // Pseudo code for computing α[k][n] for some n>1 α(zn)=is0 α[k][n] the joint probability of observing x1,...,xn and being in state zn if p(x[n] k)!= 0: for j = 1 to K: if p(k j)!= 0: α[k][n] = α[k][n] + p(x[n] k) * α[ j ][n-1] * p( k j) Recursion: Basis: k Computing α takes time O(K2N) and α[k][n] = α(zn) if zn is state k space O(KN) using memorization 1 n N
The backward algorithm β(zn) is the conditional probability of future observation xn+1,...,xn assuming being in state zn β[k][n] = β(zn) if zn is state k k 1 n N
The β-recursion
The β-recursion Using HMM
The backward algorithm β(zn) is the conditional probability of future observation xn+1,...,xn assuming being in state zn Recursion: Basis: k β[k][n] = β(zn) if zn is state k 1 n N
The backward algorithm // Pseudo code for computing β[k][n] for some n<n β(zn)=is0 β[k][n] the conditional probability of future observation xn+1,...,xn assuming for j = 1 to K: being in state zn if p( j k)!= 0: β[k][n] = β[k][n] + p( j k) * p(x[n+1] j) * β[ j ][n+1] Recursion: Basis: k β[k][n] = β(zn) if zn is state k 1 n N
The backward algorithm // Pseudo code for computing β[k][n] for some n<n β(zn)=is0 β[k][n] the conditional probability of future observation xn+1,...,xn assuming for j = 1 to K: being in state zn if p( j k)!= 0: β[k][n] = β[k][n] + p( j k) * p(x[n+1] j) * β[ j ][n+1] Recursion: Basis: k Computing β takes time O(K2N) and space O(KN) using β[k][n]memorization = β(zn) if zn is state k 1 n N
Posterior decoding α(zn) is the joint probability of observing x1,...,xn and being in state zn β(zn) is the conditional probability of future observation xn+1,...,xn assuming being in state zn Using α(zn) and β(zn) we get the likelihood of the observations as:
Posterior decoding // Pseudocode for posterior decoding Compute α[1..k][1..n] and β[1..k][1..n] px + + α[k][n] of α(z= )α[1][n] is the+ α[2][n] joint probability n observing x1,...,xn and being in state zn z[1..n] = undef for n = 1 to N: z[n] = arg maxk ( α[k][n] * β[k][n] / px ) print β(znz[1..n] ) is the conditional probability of future observation xn+1,...,xn assuming being in state zn Using α(zn) and β(zn) we get the likelihood of the observations as:
Viterbi vs. Posterior decoding A sequence of states z1,...,zn where p(x1,...,xn, z1,...,zn) > 0 is a legal (or syntactically correct) decoding of X. Viterbi finds the most likely syntactically correct decoding of X. What does Posterior decoding find? Does it always find a syntactically correct decoding of X?
Viterbi vs. Posterior decoding A sequence of states z1,...,zn where p(x1,...,xn, z1,...,zn) > 0 is a legal (or syntactically correct) decoding of X. Viterbi finds the most likely syntactically correct decoding of X. What does Posterior decoding find? Does it always find a syntactically correct decoding of X? 2 Emits a sequence of A and Bs following either the path 12...2 or 13...3 with equal probability Viterbi vs.1 Posterior decoding 1/2 1 A:.5 B:.5 A:.5 B:.5 A sequence 3of states z1,...,zn where P(x1,...,xN, z1,...,zn) > 0 is a I.e. Viterbi finds either 12...2 or A:.5 1 legal (or 1/2 syntactically correct) decoding of X. B:.5 13...3, while Posterior finds that 2 and 3 are equally likely for n>1.
Recall: Using HMMs Determine the likelihood of a sequence of observations. Find a plausible underlying explanation (or decoding) of a sequence of observations. The sum has KN terms, but it turns out that it can be computed in O(K2N) time by computing the α-table using the forward algorithm and summing the last column: p(x) = α[1][n] + α[2][n] + + α[k][n]
Summary Terminology of hidden Markov models (HMMs) Viterbi- and Posterior decoding for finding a plausible underlying explanation (sequence of hidden states) of a sequence of observation forward-backward algorithms for computing the likelihood of being in a given state in the n'th step, and for determining the likelihood of a sequence of observations.
Viterbi Recursion: Basis: Forward Recursion: Basis: Backward Recursion: Basis:
Viterbi Problem: The values in the ω-, α-, and β-tables can come very Recursion: close to zero, by multiplying them we potentially exceed the precision of double precision floating points and get underflow Basis: Next: How to implement the basic algorithms (forward, backward, and Viterbi) in a numerically sound manner. Forward Recursion: Basis: Backward Recursion: Basis: