1 Hidden Markov Models

Size: px

Start display at page:

Download "1 Hidden Markov Models"

Katrina Harper
6 years ago
Views:

1 1 Hidden Markov Models Hidden Markov models are used to model phenomena in areas as diverse as speech recognition, financial risk management, the gating of ion channels or gene-finding. They are in fact Markov chain models for a given phenomenon, where the states of the chain are only partially observed or observed with error. The hidden nature of the Markov chain arises, because many systems can be thought of as evolving as a Markov process in time or space) provided that the state space is chosen to contain enough information to ensure that the jumps of the process are indeed determined based on the current state only. This may necessitate including unobservable quantities in the states. The popularity of hidden Markov models is also partly explained by the existence of famous algorithms to compute likelihood-based quantities. In fact, the EM-algorithm was first invented in the context of hidden Markov models for speech recognition. Y 1 Y 2 Y 3... Y n 1 Y n X 1 X 2 X 3... X n 1 X n Figure 1.1. Graphical representation of a hidden Markov model. The state variables Y 1,Y 2,..., form a Markov chain,but are unobserved.the variables X 1,X 2,... are observable outputs of the chain. Arrows indicate conditional dependence relations. Given the state Y i the variable X i is independent of all other variables. Figure 1.1 gives a graphical representation of a hidden Markov model. The sequence Y 1,Y 2,..., forms a Markov chain, and is referred to as the sequence of

2 2 1: Hidden Markov Models state variables. This sequence of variables is not observed hidden ). The variables X 1,X 2,... are observable, with X i viewed as the output of the system at time i. Besides the Markov property of the sequence Y 1,Y 2,... relative to its own history), it is assumed that given Y i the variable X i is conditionally independent of all other variables Y 1,X 1,...,Y i 1,X i 1,Y i+1,x i+1,...,y n,x n ). Thus the output at time i depends on the value of the state variable Y i only. We can describe the distribution of Y 1,X 1,...,Y n,x n completely by: - the density π of Y 1 giving the initial distribution of the chain). - the set of transition densities y i 1,y i ) p i y i y i 1 ) of the Markov chain. - the set of output densities x i,y i ) q i x i y i ). The transition densities and output densities may be time-independent p i = p and q i = q for fixed transition densities p and q), but this is not assumed. It is easy to write down the likelihood of the complete set of variables Y 1,X 1,...,Y n,x n : πy 1 )p 2 y 2 y 1 ) p n y n y n 1 ) q 1 x 1 y 1 ) q n x n y n ). However, for likelihood inference the marginal density of the outputs X 1,...,X n is the relevant density. This is obtained by integrating or summing out the hidden states Y 1,...,Y n. This is conceptually easy, but the n-dimensional integral may be hard to handle numerically. The full likelihood has three components, corresponding to the initial density π, the transition densities of the chain and the output densities. In typical applications these three components are parametrized with three different parameters, which range independently. For the case of discrete state and output spaces, and under the assumption of stationarity transitions and outputs, the three components are often not modelled at all: π is an arbitrary density, and p i = p and q i = q are arbitrary transition densities. If the Markov chain is stationary in time, then the inital density π is typically a function of the transition density p Baum-Welch Algorithm The Baum-Welch algorithm is the special case of the EM-algorithm for hidden Markov models. Historically it was the first example of an EM-algorithm. Suppose that initial estimates π, p i and q i are given. The M-step of the EMalgorithm requires that we compute 1.1) E logπy 1 ) n n ) p i Y i Y i 1 ) q i X i Y i ) X 1,...,X n i=1 = E logπy 1 ) X 1,...,X n ) + + n n i=1 E logp i Y i Y i 1 ) X 1,...,X n ) E logq i X i Y i ) X 1,...,X n ).

3 1: Hidden Markov Models 3 To compute the right side we need the conditional distributions of Y i and the pairs Y i 1,Y i ) given X 1,...,X n only, expressed in the initial guesses π, p i and q i. It is shown below that these conditional distributions can be computed by simple recursive formulas. The computation of the distribution of Y i given the observations is known as smoothing, and in the special case that i = n also as filtering. The E-step of the EM-algorithm requires that the right side of the preceding display be maximized over the parameters, which we may take to be π, p i and q i themselves provided that we remember that the parameters must be restricted to their respective parameter spaces. As this step depends on the specific models used, there is no general recipe. However, if the three types of parameters π, p i and q i range independently over their respective parameter spaces, then the maximization can be performed separately, using the appropriate term of the right side of 1.1) provided the maxima are finite). 1.2 Example Stationary transitions, nonparametric model). Assumethatp i = p and q i = q for fixed but arbitrary transition densities p and q, and no other restrictions are placed on the model. The three terms on the right side of 1.1) can be written logπy) p Y1 X1,...,Xn y) dµy), [ n ) ] logpv u) p Yi 1,Yi X1,...,Xn u, v) dµv) dµu), [ logqx y) x X i:x i=x p Yi X1,...,Xi 1,Xi=x,Xi+1,...,Xn y)) ] dµy). The first expression is the divergence between the density π and the distribution of Y 1 given X 1,...,X n. Without restrictions on π other than that π is a probability density) it is maximized by taking π equal to the density of the second distribution, π = p Y1 X1,...,Xn y). The inner integral within square brackets) in the second expression is, for fixed u, the divergence between the density v pv u) and the function given by the sum between round brackets) viewed as function of v. Thus by the same argument this expression is maximized over arbitrary transities densities p by pv u) = n pyi 1,Yi X1,...,Xn u, v) n. pyi 1 X1,...,Xn u) The sum in square brackets) in the third expression of the display can be viewed, for fixed y, also as a divergence, and hence by the same argument this term is maximized by qx y) = i:x i=x pyi X1,...,Xi 1,Xi=x,Xi+1,...,Xn x X y) i:x i=x pyi X1,...,Xi 1,Xi=x,Xi+1,...,Xn y).

4 4 1: Hidden Markov Models These expressions can be evaluated using the formulas for the conditional distributions of Y i 1 and Y i 1,Y i ) given X 1,...,X n, given below Smoothing The algorithm for obtaining formulas for the conditional distributions of the variables Y i 1,Y i given the observations X 1,...,X n is expressed in the functions α i y):= PX 1 = x 1,...,X i = x i,y i = y), β i y):= PX i+1 = x i+1,...,x n = x n Y i = y). Here the values of x 1,...,x n have been omitted from the notation on the left, as these can be considered fixed at the observed values throughout. These functions can be computed recursively in i by a forward algorithm for the α s) and a backward algorithm for the β s), starting from the initial expressions The forward algorithm is to write α 1 y) = πy)q 1 x 1 y), β n y) = 1. α i+1 y) = z = z Px 1,...,x i+1,y i+1 = y,y i = z) q i+1 x i+1 y)p i+1 y z)α i z). Heretheargumentx i withinp )isshorthandfortheeventx i = x i.thebackward algorithm is given by β i y) = z = z Px i+1,...,x n Y i+1 = z,y i = y)py i+1 = z Y i = y) q i+1 x i+1 z)β i+1 z)p i+1 z y). Given the set of all α s and β s we may now obtain the likelihood of the observed data as PX 1 = x 1,...,X n = x n ) = α n y). y The conditional distributions of Y i and Y i 1,Y i ) given X 1,...,X n are the joint distributions divided by this likelihood. The second joint distribution can be written PY i 1 = y,y i = z,x 1,...,x n ) = PY i = z,x i,...,x n Y i 1 = y,x 1,...,x i 1 )PY i 1 = y,x 1,...,x i 1 ) = Px i+1,...,x n Y i = z,x i )PY i = z,x i Y i 1 = y,x 1,...,x i 1 )α i 1 y) = β i z)q i x i z)p i z y)α i 1 y). By a similar argument we see that PY i = y,x 1,...,x n ) = Px i+1,...,x n Y i = y,x 1,...,x i )PY i = y,x 1,...,x i ) = β i y)α i y).

5 1: Hidden Markov Models EXERCISE. Show that PX 1 = x 1,...,X n = x n ) = y α iy)β i y) for every i = 1,...,n Viterbi Algorithm The Viterbi algorithm computes the most likely sample path of the hidden Markov chain Y 1,...,Y n given the observed outputs. It is a backward programming algorithm, also known as dynamic programming. The most likely path is the vector y 1,...,y n ) that maximizes the conditional probability y 1,...,y n ) PY 1 = y 1,...,Y n = y n X 1,...,X n ). This is of interest in some applications. It should be noted that there may be many possible paths, consistent with the observations, and the most likely path may be nonunique or not very likely and only slightly more likely than many other paths. Thus for many applications use of the full conditional distribution of the hidden states given the observations obtained in the smoothing algorithm) is preferable.

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated