Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21)
Introduction Structured Classification In many NLP tasks, the output (and input) is structured: Part-of-speech tagging Input: Sequence X 1,..., X n of words Output: Sequence Y 1,..., Y n of tags Syntactic parsing Input: Sequence X 1,..., X n of words Output: Parse tree Y consisting of nodes, edges and labels Models for structured classification: Sequence models today Stochastic grammars lecture 6 Statistical Methods for NLP 2(21)
Introduction Sequence Classification Why not just take one element at a time? for i = 1 to n do argmax P(Y i = y X i ) y Because the desired solution may be: arg max y 1,...,y n P(Y 1 = y 1,..., Y n = y n X 1,..., X n ) Two approaches: Local optimization greedy method may be suboptimal Global optimization requires independence assumptions Statistical Methods for NLP 3(21)
Hidden Markov Models Hidden Markov Models Markov models are probabilistic sequence models used for problems such as: 1. Speech recognition 2. Spell checking 3. Part-of-speech tagging 4. Named entity recognition A (discrete) Markov model runs through a sequence of states emitting signals. If the state sequence cannot be determined from the signal sequence, the model is said to be hidden. Statistical Methods for NLP 4(21)
Hidden Markov Models Hidden Markov Models A Markov model consists of five elements: 1. A finite set of states Ω = {s 1,..., s k }. 2. A finite signal alphabet Σ = {σ 1,..., σ m }. 3. Initial probabilities P(s) (for every s Ω) defining the probability of starting in state s. 4. Transition probabilities P(s i s j ) (for every (s i, s j ) Ω 2 ) defining the probability of going from state s j to state s i. 5. Emission probabilities P(σ s) (for every (σ, s) Σ Ω) defining the probability of emitting symbol σ in state s. Statistical Methods for NLP 5(21)
Hidden Markov Models A Simple HMM Statistical Methods for NLP 6(21)
Hidden Markov Models Markov Assumptions State transitions are assumed to be independent of everything except the current state: n 1 P(s 1,..., s n ) = P(s 1 ) P(s i+1 s i ) i=1 Signal emissions are assumed to be independent of everything except the current state: P(s 1,..., s n, σ 1,..., σ n ) = P(s 1,..., s n ) n P(σ i s i ) i=1 Statistical Methods for NLP 7(21)
Hidden Markov Models Observation Sequences The probability of a signal sequence is obtained by summing over state sequences: P(σ 1,..., σ n ) = P(s 1,..., s n, σ 1,..., σ n ) s 1,...,s n Ω n Looks familiar? The HMM states are hidden variables. We can use EM to train an HMM on unlabeled data. Statistical Methods for NLP 8(21)
Hidden Markov Models Problems for HMMs Optimal state sequence: arg max s 1,...,s n P(s 1,..., s n, σ 1,..., σ n ) Probability of signal sequence: P(σ 1,..., σ n ) = P(s 1,..., s n, σ 1,..., σ n ) s 1,...,s n Ω n Expected counts for hidden variables (for EM): E[C(s)] = P(σ 1,..., σ n, s 1 = s) E[C(s, s )] = n 1 i=1 P(σ 1,..., σ n, s i = s, s i+1 = s ) E[C(s, σ)] = n i=1 P(σ 1,..., σ i 1, σ i = σ,..., σ n, s i = s) Statistical Methods for NLP 9(21)
Hidden Markov Models Problems for HMMs Difficulty: Summing (or maximizing) over all possible state sequences The number Ω n of sequences grows exponentially Key observation: Solution of size n contains solution of size n 1 Probability of signal sequence: P(σ 1,..., σ n) = P s P(σ n Ω 1,..., σ n, s n) P(σ 1,..., σ i, s i ) = P s i 1 Ω P(σ i s i )P(s i s i 1 )P(σ 1,..., σ i 1, s i 1 ) Dynamic programming algorithms are applicable Statistical Methods for NLP 10(21)
Hidden Markov Models Algorithms for HMMs Optimal state sequence (Viterbi): viterbi(i, s) max s P(σ 1,..., σ i, s i 1 = s, s i = s) viterbi(1, s) = P(s)P(σ 1 s) viterbi(i, s) = max s viterbi(i 1, s )P(s s )P(σ i s) max s viterbi(n, s) = max s1,...,s n P(σ 1,..., σ n, s 1,..., s n) Note: The equations above are for computing the max To get the argmax, we store back-pointers to best states Statistical Methods for NLP 11(21)
Hidden Markov Models Algorithms for HMMs Probability of signal sequence (Forward): α(i, s) P(σ 1,..., σ i, s i = s) α(1, s) = P(s)P(σ 1 s) α(i, s) = P s α(i 1, s )P(s s )P(σ i s) P s α(n, s) = P(σ 1,..., σ n) Probability of signal sequence (Backward): β(i, s) P(σ i+1,..., σ n s i = s) β(n, s) = 1 β(i, s) = P s β(i + 1, s )P(s s)p(σ i+1 s ) P s β(1, s)p(s)p(σ 1 s) = P(σ 1,..., σ n) Statistical Methods for NLP 12(21)
Hidden Markov Models Algorithms for HMMs Expected counts of hidden states (Forward-Backward): E[C(s)] = P(s)P(σ 1 s)β(1, s) E[C(s, s )] = P n i=1 α(i, s)p(s s)p(σ i+1 s )β(i + 1, s ) E[C(s, σ)] = P i:σ i =σ α(i, s)p(σ i s)β(i, s) Algorithm analysis: All algorithms can be implemented to fill an Ω n matrix All algorithms run in O( Ω 2 n) time Compare to O( Ω n ) for the naive implementation Statistical Methods for NLP 13(21)
Part-of-Speech Tagging Part-of-Speech Tagging Typical sequence classification problem: Given a word sequence w1,..., w n, determine the corresponding part-of-speech (tag) sequence t 1,..., t n Probabilistic view of the problem: arg max t 1,...,t n P(t 1,..., t n w 1,..., w n ) First order HMM (tag bigram model): Signals represent words States represent part-of-speech tags Tagging amounts to finding the optimal state sequence Statistical Methods for NLP 14(21)
Part-of-Speech Tagging Modeling Generative model: argmax t1,...,t n P(t 1,..., t n w 1,..., w n ) = argmax t1,...,t n P(t 1,..., t n, w 1,..., w n ) = argmax t1,...,t n P(t 1,..., t n )P(w 1,..., w n t 1,..., t n ) Markov assumptions: Model parameters: P(t 1,..., t n ) = n i=1 P(t i t i 1 ) P(w 1,..., w n t 1,..., t n ) = n i=1 P(w i t i ) P(t) (start probabilities for all tags t) P(t t ) (transition probabilities for all pairs of tags t, t ) P(w t) (emission probabilities for words w and tags t) Statistical Methods for NLP 15(21)
Part-of-Speech Tagging Modeling P(vb dt) P(dt) P(nn nn) P(vb vb) P(nn dt) P(vb nn) dt nn vb P(nn vb) P(the dt) P(can nn) P(can vb) P(smells nn) P(smells vb) Statistical Methods for NLP 16(21)
Part-of-Speech Tagging Learning Supervised learning: Given a tagged training corpus, we can estimate parameters using (smoothed) relative frequencies Weakly supervised learning: Given a lexicon and an untagged training corpus, we can use EM to estimate parameters EM for HMM = Baum-Welch s algorithm E-step = Forward-Backward algorithm (see above) M-step = MLE given expected counts Statistical Methods for NLP 17(21)
Smoothing for Part-of-Speech Tagging Part-of-Speech Tagging Transition probabilities: Structurally similar to n-gram probabilities Standard methods like additive smoothing Emission probabilities: Structurally similar to Naive Bayes likelihood Standard methods for known words Special treatment of unknown words (affixes, capitalization) Statistical Methods for NLP 18(21)
Part-of-Speech Tagging Inference and Evaluation Inference: Finding the optimal state (tag) sequence Viterbi algorithm runs in O( Ω n) time Sparse matrix may save space (and time) Evaluation: Compare output t1,..., t n to gold standard t 1,..., t n: Accuracy = n i=1 [[t i = t i ]] n Confidence intervals and test for proportions Precision and recall for specific classes (tags) Statistical Methods for NLP 19(21)
Conclusion HMM Applications HMMs are widely used in NLP (and elsewhere): Speech recognition Optical character recognition Spell checking Morphological segmentation Part-of-speech tagging Named entity recognition Chunking Disfluency detection Topic segmentation DNA sequence analysis Protein classification... Statistical Methods for NLP 20(21)
Conclusion Other Sequence Models HMM is a generative model Models the joint distribution of inputs and outputs Discriminative models: Maximum Entropy Markov Models (MEMM) Conditional Random Fields (CRF) Weighted Finite-State Transducers (WFST): Finite-state automata for relations Weighted transitions combining transition and emission Subsumes HMMs and other probabilistic models Statistical Methods for NLP 21(21)