Hidden Markov Models Terminology and Basic Algorithms
What is machine learning? From http://en.wikipedia.org/wiki/machine_learning Machine learning, a branch of artificial intelligence, is about the construction and study of systems that can learn from data. For example, a machine learning system could be trained on email messages to learn to distinguish between spam and non-spam messages. After learning, it can then be used to classify new email messages into spam and nonspam folders. [...] Machine learning focuses on prediction, based on known properties learned from the training data.
What is machine learning? Machine learning means different things to different people, and there is no general agreed upon core set of algorithms that must be learned. To us, the core of machine learning (in bioinformatics) boils down to three things: Building a mathematical model that captures some desired structure of the data that you are working on. Training the model (i.e. set the parameters of the model) based on existing data to optimize it as well as we can. Making predictions using the model on new data.
Motivation We make predictions based on models of observed data (machine learning). A simple model is that observations are assumed to be independent and identically distributed (iid)... but this assumption is not always the best, fx (1) measurements of weather patterns, (2) daily values of stocks, (3) acoustic features in successive time frames used for speech recognition, (4) the composition of texts, (5) the composition of DNA, or...
Markov Models If the n'th observation in a chain of observations is influenced only by the n-1'th observation, i.e. then the chain of observations is a 1st-order Markov chain, and the joint-probability of a sequence of N observations is If the distributions p(xn xn-1) are the same for all n, then the chain of observations is an homogeneous 1st-order Markov chain...
The model, i.e. p(xn xn-1): A sequence of observations: Markov Models If the n'th observation in a chain of observations is influenced only by the n-1'th observation, i.e. then the chain of observations is a 1st-order Markov chain, and the joint-probability of a sequence of N observations is If the distributions p(xn xn-1) are the same for all n, then the chain of observations is an homogeneous 1st-order Markov chain...
The model, i.e. p(xn xn-1): A sequence of observations: Markov Models If the n'th observation in a chain of observations is influenced only by the n-1'th observation, i.e. Extension A higher Markov chain then the chain of observations is aorder 1st-order Markov chain, and the joint-probability of a sequence of N observations is If the distributions p(xn xn-1) are the same for all n, then the chain of observations is an homogeneous 1st-order Markov chain...
Hidden Markov Models What if the n'th observation in a chain of observations is influenced by a corresponding latent (i.e. hidden) variable? Latent values H H L L H Observations If the latent variables are discrete and form a Markov chain, then it is a hidden Markov model (HMM)
Hidden Markov Models What if the n'th observation in a chain of observations is influenced by a corresponding latent (i.e. hidden) variable? Latent values Markov Model H Hidden Markov Model H L L H Observations If the latent variables are discrete and form a Markov chain, then it is a hidden Markov model (HMM)
Computational problems Hidden Markov Models Determine the likelihood of the sequence of observations What if the n'th observation in a chain of observations is influenced Predict the next observation in the by a corresponding latent (i.e. hidden) variable? sequence of observations Latent values Markov Model Find the most likely underlying H H L L H explanation of the sequence of observation (i.e. most likely sequence of hidden values). Hidden Markov Model Observations If the latent variables are discrete and form a Markov chain, then it is a hidden Markov model (HMM)
Hidden Markov Models What if the n'th observation in a chain of observations is influenced The predictive distribution by a corresponding latent (i.e. hidden) variable? p(xn+1 x1,...,xn) Latent values Markov Model H Lon alll for observation xn+1 can be shownhto depend previous observations, i.e. the sequence of observations is not a Markov chain of any order... Hidden Markov Model Observations H If the latent variables are discrete and form a Markov chain, then it is a hidden Markov model (HMM)
Hidden Markov Models What if the n'th observation in a chain of observations is influenced Thevariable? joint distribution by a corresponding latent Observations Markov Model H Hidden Markov Model H L L H Latent values If the latent variables are discrete and form a Markov chain, then it is a hidden Markov model (HMM)
Hidden Markov Models Emission probabilities What if the n'th observation in a chain of observations is influenced Transition probabilities Thevariable? joint distribution by a corresponding latent Observations Markov Model H Hidden Markov Model H L L H Latent values If the latent variables are discrete and form a Markov chain, then it is a hidden Markov model (HMM)
Transition probabilities Notation: In Bishop, the latent variables zn are discrete variables, e.g. if zn = (0,0,1) then the model in step n is in state k=3... Transition probabilities: If the latent variables are discrete with K states, the conditional distribution p(zn zn-1) is a K x K table A, and the marginal distribution p(z1) describing the initial state is a K vector π... The probability of going from state j to state k is: The probability of state k being the initial state is:
Transition probabilities Notation: In Bishop, the latent variables zn are discrete variables, e.g. if zn = (0,0,1) then the model in step n is in state k=3... Transition probabilities: If the latent variables are discrete with K states, the conditional distribution p(zn zn-1) is a K x K table A, and the marginal distribution p(z1) describing the initial state is a K vector π... The probability of going from state j to state k is: The probability of state k being the initial state is:
Transition Probabilities The transition probabilities: Notation: The latent variables zn are discrete multinomial variables, e.g. if zn = (0,0,1) then the model in step n is in state k=3... Transition probabilities: If the latent variables are discrete with K states, the conditional distribution p(zn zn-1) is a K x K table A, and the marginal distribution p(z1) describing the initial state is a K vector π... The probability of going from state j to state k is: The probability of state k being the initial state is:
State transition diagram Transition Probabilities Notation: The latent variables zn are discrete multinomial variables, e.g. if zn = (0,0,1) then the model in step n is in state k=3... Transition probabilities: If the latent variables are discrete with K states, the conditional distribution p(zn zn-1) is a K x K table A, and the marginal distribution p(z1) describing the initial state is a K vector π... The probability of going from state j to state k is: The probability of state k being the initial state is:
Emission probabilities Emission probabilities: The conditional distributions of the observed variables p(xn zn) from a specific state If the observed values xn are discrete (e.g. D symbols), the emission probabilities Ф is a KxD table of probabilities which for each of the K states specifies the probability of emitting each observable...
Emission probabilities Emission probabilities: The conditional distributions of the observed variables p(xn zn) from a specific state If the observed values xn are discrete (e.g. D symbols), the emission probabilities Ф is a KxD table of probabilities which for each of the K states specifies the probability of emitting each observable...
Emission probabilities Emission probabilities: The conditional distributions of the observed variables p(xn zn) from a specific state znk = 1 iff the n'th latent variable in the If the observed values xn are discrete (e.g. D symbols), the emission sequence is in state k, otherwise it is probabilities Ф is a KxD table of probabilities which for each of the K 0, i.e. the product just picks the states specifies the probability of emitting each observable... emission probabilities corresponding to state k...
HMM joint probability distribution Observables: Latent states: Model parameters: If A and Ф are the same for all n then the HMM is homogeneous
HMM joint probability distribution Observables: Latent states: Model parameters: If A and Ф are the same for all n then the HMM is homogeneous
HMMs as a generative model A HMM generates a sequence of observables by moving from latent state to latent state according to the transition probabilities and emitting an observable (from a discrete set of observables, i.e. a finite alphabet) from each latent state visited according to the emission probabilities of the state... Model M: A run follows a sequence of states: H H L L H And emits a sequence of symbols:
HMMs as a generative model A HMM generates a sequence of observables by moving from latent state to latent state according to the transition probabilities and emitting an observable (from a discrete set of observables, i.e. a finite alphabet) from each latent state visited according to the emission probabilities of the state... Model M: A run follows a sequence of states: H H L L H And emits a sequence of symbols: End A special End-state can be added to generate finite output
Using HMMs Determine the likelihood of the sequence of observations Predict the next observation in the sequence of observations Find the most likely underlying explanation of the sequence of observation
Using HMMs Determine the likelihood of the sequence of observations Predict the next observation in the sequence of observations Find the most likely underlying explanation of the sequence of observation
Using HMMs Determine the likelihood of the sequence of observations Predict the next observation in the sequence of observations Find the most likely underlying explanation of the sequence of observation The sum has KN terms, but it can be computed in O(K2N) time...
The forward-backward algorithm α(zn) is the joint probability of observing x1,...,xn and being in state zn α(znk) k 1 n N
The forward-backward algorithm α(zn) is the joint probability of observing x1,...,xn and being in state zn β(zn) is the conditional probability of future observation xn+1,...,xn assuming being in state zn
The forward-backward algorithm α(zn) is the joint probability of observing x1,...,xn and being in state zn β(zn) is the conditional probability of future observation xn+1,...,xn assuming being in state zn Using α(zn) and β(zn) we get the likelihood of the observations as:
The forward algorithm α(zn) is the joint probability of observing x1,...,xn and being in state zn
The α-recursion
The α-recursion
The α-recursion Prod. rule Using HMM Prod. rule Sum rule Prod. rule Using HMM Prod. rule
The forward algorithm α(zn) is the joint probability of observing x1,...,xn and being in state zn Recursion: α(znk) k 1 n N
The forward algorithm α(zn) is the joint probability of observing x1,...,xn and being in state zn Recursion: Basis:
The forward algorithm α(zn) is the joint probability of observing x1,...,xn and being in state zn Recursion: Basis: Takes time O(K2N) and space O(KN) using memorization
The backward algorithm β(zn) is the conditional probability of future observation xn+1,...,xn assuming being in state zn
The β-recursion
The β-recursion Sum rule Prod. rule Using HMM Using HMM
The backward algorithm β(zn) is the conditional probability of future observation xn+1,...,xn assuming being in state zn Recursion: β(znk) k 1 n N
The backward algorithm β(zn) is the conditional probability of future observation xn+1,...,xn assuming being in state zn Recursion: Basis:
The backward algorithm β(zn) is the conditional probability of future observation xn+1,...,xn assuming being in state zn Recursion: Basis: Takes time O(K2N) and space O(KN) using memorization
Using HMMs Determine the likelihood of the sequence of observations Predict the next observation in the sequence of observations Find the most likely underlying explanation of the sequence of observation
Predicting the next observation
Predicting the next observation Prod. rule and HMM Sum rule Prod. rule and HMM Prod. rule
Predicting the next observation Prod. rule and HMM Sum rule Prod. rule and HMM Prod. rule
Using HMMs Determine the likelihood of the sequence of observations Predict the next observation in the sequence of observations Find the most likely underlying explanation of the sequence of observation The Viterbi algorithm: Finds the most probable sequence of states generating the observations...
The Viterbi algorithm ω(zn) is the probability of the most likely sequence of states z1,...,zn generating the observations x1,...,xn Intuition: Find the longest path from column 1 to column n, where length is its total probability, i.e. the probability of the transitions and emissions along the path... h
The Viterbi algorithm ω(zn) is the probability of the most likely sequence of states z1,...,zn generating the observations x1,...,xn Recursion: ω(znk) k 1 n N
The Viterbi algorithm ω(zn) is the probability of the most likely sequence of states z1,...,zn generating the observations x1,...,xn Recursion: Basis: Takes time O(K2N) and space O(KN) using memorization
The Viterbi algorithm ω(zn) is the probability of the most likely sequence of states z1,...,zn generating the observations x1,...,xn Recursion: Basis: The path itself can be retrieve in time O(KN) by backtracking Takes time O(K2N) and space O(KN) using memorization
Another kind of decoding α(zn) is the joint probability of observing x1,...,xn and being in state zn β(zn) is the conditional probability of future observation xn+1,...,xn assuming being in state zn Using α(zn) and β(zn) we can also find the most likely state to be in the n'th step: Finding z*1,...,z*n is called posterior decoding.
Another kind of decoding Using α(zn) and β(zn) we can also find the most likely state to be in the n'th step: Finding z*1,...,z*n is called posterior decoding.
Viterbi vs. Posterior decoding A sequence of states z1,...,zn where P(x1,...,xN, z1,...,zn) > 0 is a legal (or syntactically correct) decoding of X. Viterbi finds the most likely syntactically correct decoding of X. What does Posterior decoding find? Does it always find a syntactically correct decoding of X? Using α(zn) and β(zn) we can also find the most likely state to be in the n'th step: Finding z*1,...,z*n is called posterior decoding.
Summary Introduced hidden Markov models (HMMs) The forward-backward algorithms for determining the likelihood of a sequence of observations, and predicting the next observation in a sequence of observations. The Viterbi-algorithm for finding the most likely underlying explanation (sequence of latent states) of a sequence of observation Next: How to implement the basic algorithms (forward, backward, and Viterbi) in a numerically sound manner.