Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications for inference and estimation Back to HMM details: the key questions Hidden-event language models Goals: To understand the assumptions behind an HMM, so that you can decide when the model makes sense and when extensions make sense as well as the cost of extensions. 1
Review: Markov Models The next state s i+1 is conditionally independent of the past given the current state s i : p(s 1,..., s T ) = p(s 1 ) T Illustration of Topology i=2 p(s i s 1,..., s i 1 ) = p(s 1 ) T i=2 p(s i s i 1 ) State topology pictures depict constraints on transitions, i.e. the allowable transitions in the state space with circles corresponding to state values. Two examples below are for: a fully connected state space (as in a letter bigram with non-zero probabilities), and a strictly ordered model (as in the left-to-right model used in speech acoustic models). Graphical Model Illustration of Time-Sequence The picture below could apply for any topology. Each circle is a random variable S i at a particular time i. 2
Hidden Markov Models (HMMs) In an HMM, we observe a sequence o = o 1,..., o T that is associated with a state sequence that we cannot observe s = s 1,..., s T. The model describes the joint state and observation sequence: p(s 1,..., s T, o 1,..., o T ) = p(s 1 )p(o 1 s 1 ) T i=2 p(s i s i 1 )p(o i s i ) and we get the probability of the observation sequence by marginalizing: p(o 1,..., o T ) = o p(o 1,..., o T, s 1,..., s T ) The key assumptions in an HMM are: The state sequence is Markov: p(s i s 1,..., s i 1 ) = p(s i s i 1 ) The observations are conditionally independent of future and past states and observations given the current state: p(o i s 1,..., s T, o 1,..., o i 1, o i+1,..., o T ) = p(o i s i ) Aside: there are two types of HMMs, labeled states (p(o i s i )) and labeled transitions (p(o i s i 1, s i )). You can convert from one to another, so we will stick with the simpler and more popular labeled-state HMM framework. 3
We can extend the illustrations of Markov models to HMMs below. Illustration of Topology Using only the strictly ordered model for simplicity, each circle is a possible value in the state space as before but we add dashed arrows to indicate generation of an observation. Graphical Model Illustration of Time-Sequence Again, the picture below could apply for any topology. Each empty circle is a random variable S i at a particular time i, and each full circle is a random variable O i at time i. 4
HMM Examples in Language Processing Some examples include part-of-speech tagging, identification of named entities in text, punctuation prediction, recognition of speech acts,... More details for the first examples below. Part-of-speech (POS) Tagging States are POS tags, p(s i s i 1 ) describes POS sequence tendencies Observations are words, p(o i s i ) describes word-tag relations o 1, o 2, o 3, o 4 = I saw the boy s 1, s 2, s 3, s 4 = pronoun verb det noun It might be useful to have more complex representations of the words as observations to better model unknown words, i.e. o i not in a known vocabulary, since the observation space needs to be finite in order to specify p(o i s i ). Consider possible observation sequences: I saw the zlrxl Twas brillig and the slithy toves did gyre and... We would probably say that zlrxl is a noun, slithy is an adjective, and toves is a plural noun, based on endings of the words and/or neighboring POS types. So, we might want to expand o i to a vector with the word, ending, capitalization, etc as elements of the observation vector. Aside: There are some limitations of an HMM for POS tagging, so extensions or other models are more often used. For example, the POS tag might be better predicted by using the previous word and not just the previous tag, e.g. some verbs do not take a direct object. 5
Name Recognition States include {begin person (BP), cont person (CP), begin location (BL), cont location (CL), not a name (φ)}, p(s i s i 1 ) describes name sequence tendencies Observations are words, p(o i s i ) describes word-name relations o 1, o 2, o 3, o 4, o 5, o 6, o 7 = We saw Senator Pat Johnson in Boston. s 1, s 2, s 3, s 4, s 5, s 6, s 7 = φ φ φ BP CP φ BL Again, it might be useful to have features for characterizing new words, and it might be useful to condition the predicted state and/or observation on both the previous word and state: p(s i s i 1, o i 1 ); p(o i s i, o i 1 ) Other sequence labeling problems... Sentence segmentation: sequence of boundary vs. no boundary decisions after every word Topic segmentation: sequence of boundary vs. no boundary decisions after every sentence Speech act tagging on the sequence of utterances in a conversation... 6
Hidden variables are useful for: Hidden Variables modeling different sources of variability, e.g. temporal variability in an HMM, multimodal behavior in a mixture model (some random event that I can t observe determines the distribution), OR the hidden variables may be what you want to detect (as in tagging), learning distributions where some values are missing, as in missing labels for semi-supervised learning. Key points in working with hidden variables: Computing P(observations) requires marginalizing (summing out) hidden states. Parameter estimation is iterative, e.g. the EM algorithm for maximum likelihood estimation. 7
The EM Algorithm Given training data X = {x 1,..., x T } and model p(x, s θ), the maximum likelihood parameter estimate requires argmax θ log p(x θ) = argmax θ T i=1 log s p(x i, s θ) where the sum over the hidden variable complicates the solution. Basic idea: directly maximizing log likelihood log p(x θ) is too hard, but can maximize E[log p(x, S) X, θ)] and indirectly maximize log likelihood. Still no closed form solution though, need to: Iterate: E-step: Find Q(θ θ (l) ) = E[log p(x, S) X, θ (l) ] M-step: Find θ (l+1) = argmax θ Q(θ θ (l) ) For problems involving p(x, s) in the exponential family (most everything we work with), the EM algorithm reduces to: E-step: Find expected sufficient statistics E[t (l) ] of unobserved process. M-step: Use these in place of t in the ML update formulas for θ (l+1). For many discrete hidden variable problems, the E-step involves estimating the probabilities for different possible state values: γ (l) t (j) = p(s t = j x, θ (l) ) 8
EM for Mixtures In a mixture distribution, the index of the underlying mode is the hidden variable: p(x) = m λ i p i (x) = m p(z = i)p(x z = i) i=1 Initialize: Provide {λ (0) i, p (0) j (x)} Iterate: E-step : Compute γ (l) t (j) = p(z t = j x t, θ (l) ) = p (l) (z t = j, x t )/[ k p (l) (z t = k, x t )] M-step: Update component model parameters using γ t -weighted observation statistics and update mixture weights using: λ (l+1) j i=1 = 1 T t γ t (l) (j) (Essentially a relative frequency estimate using weighted counts.) The mixture estimation algorithm works for all types of mixtures, with the difference being in the details of how the distributions p i (x) are estimated. Important examples include: Mixtures of language models (for topic and genre modeling) use weighted n-gram counts, e.g. Gaussian mixtures n j = t γ t (j); c(w a ) = µ j = 1 γ t (j)x t ; n j t t:w t =w a γ t (j) σj 2 = 1 γ t (j)(x t µ j ) 2 n j where I ve dropped the (l) superscript to simplify the notation. Note: You can also keep the mixture components fixed and just estimate the mixture weights if the components come from other training data. t 9
EM for HMMs Define HMM parameters: π j = p(s 1 = j); a jk = p(s i = k s i 1 = j); b j (o) = p(o s t = j) E-step: Compute γ (l) t (j) = p(s t = j o 1,..., o T, θ (l) ) ξ (l) t (j, k) = p(s t 1 = j, s t = k o 1,..., o T, θ (l) ) M-step: Update component model parameters using γ t -weighted observation statistics and update transitions using ξ t terms. Using the same weighted counts idea as for mixtures: a (l+1) jk = t ξ (l) t (j, k) k t ξ (l) t (j, k) The update for π j is similar, and the update for the observation distribution depends on the form of b j (o) but also uses weighted observations(continuous o) or weighted counts (discrete o). The model-related details are primarily in the E-Step, which is more complicated than the mixture model because of the sequential dependencies. So now, let s go back to the details of the HMM. 10
Details for HMMs Questions people ask about HMMs: 1. How do you compute p(o)? 2. What is the most likely state sequence argmax s p(s o)? 3. What is the most likely state at a particular time t, argmax j p(s t = j o)? 4. How do you estimate the parameters of an HMM? Three key algorithms needed to answer the questions Forward algorithm: α t (k) = p(o 1,..., o t, s t = j) = j α t 1 (j)a jk b k (o t ) Viterbi algorithm: δ t (k) = max p(o s 1,...,s 1,..., o t, s 1,..., s t 1, s t = k) = max δ t 1 (j)a jk b k (o t ) t 1 j Backward algorithm: β t (j) = p(o t+1,..., o T s t = j) = k β t+1 (k)a jk b k (o t+1 ) 11
These algorithms answer the questions as follows: 1. Use the forward algorithm: p(o) = p(o 1,..., o T ) = j p(o 1,..., o T, s T = j) = j α T (j) = α T 2. Use the Viterbi algorithm, keeping track of the max j at each time step, then traceback from the final best state. 3. Use the forward and backward algorithms to find γ t (j) = p(s t = j o 1,..., o T ) = p(s t = j, o 1,..., o T )/p(o) = α t (j)β t (j)/α T and use γ t (j) = p(s t = j o) in finding the argmax. 4. Use the forward and backward algorithms in the E-step to find γ t (j) as above and ξ t (j, k) = p(s t 1 = j, s t = k o 1,..., o T ) = α t (j)a jk b k (o t+1 )β t+1 (k)/α T 12
Hidden-Event Language Models Hidden-event language models represent a hidden event e i (such as a sentence or topic boundary) after every word w i. It is a bit like an HMM because the hidden event, but the observations are based on n-gram language models. The bigram case would be: p(w, e) = p(w 1, e 1, w 2, e 2,..., w T, e T ) = p(e 1 w 1 )p(w 1 ) t p(e t w t, w t 1, e t 1 )p(w t w t 1, e t 1 ) The p(e i ) terms are the hidden state sequence model, but different from and HMM in that the state transitions depend on the words, and the p(w t ) terms are the observation model which is an event-dependent language model. For many problems in speech processing, it is useful to combine this model with another observation model (e.g. p(f t e t, w t )) that characterizes acoustic information f t. The hidden-event language model can be designed with the SRILM toolkit using the hidden-ngram command. 13