Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Similar documents
Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Sequence Labeling: HMMs & Structured Perceptron

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Lecture 12: EM Algorithm

Statistical Methods for NLP

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Hidden Markov Models

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Machine Learning for natural language processing

Statistical Methods for NLP

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Lecture 9: Hidden Markov Model

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

Hidden Markov Models (HMMs)

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Lecture 3: ASR: HMMs, Forward, Viterbi

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Computational Genomics and Molecular Biology, Fall

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models

Introduction to Machine Learning CMU-10701

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Hidden Markov Models and Gaussian Mixture Models

Probabilistic Context-free Grammars

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Log-Linear Models, MEMMs, and CRFs

Data-Intensive Computing with MapReduce

Data Mining in Bioinformatics HMM

LECTURER: BURCU CAN Spring

Statistical Pattern Recognition

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Hidden Markov Models in Language Processing

Probabilistic Graphical Models

Lecture 13: Structured Prediction

Statistical Machine Learning from Data

Hidden Markov Models

Language and Statistics II

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Pair Hidden Markov Models

CS838-1 Advanced NLP: Hidden Markov Models

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Natural Language Processing

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models. x 1 x 2 x 3 x K

ASR using Hidden Markov Model : A tutorial

HMM: Parameter Estimation

Lecture 11: Hidden Markov Models

Hidden Markov Modelling

Speech Recognition HMM

Dept. of Linguistics, Indiana University Fall 2009

Lecture 11: Viterbi and Forward Algorithms

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Hidden Markov Models Part 2: Algorithms

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay

Supervised Learning Hidden Markov Models. Some of these slides were inspired by the tutorials of Andrew Moore

Statistical Methods for NLP

CS 7180: Behavioral Modeling and Decision- making in AI

COMP90051 Statistical Machine Learning

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Lecture 9. Intro to Hidden Markov Models (finish up)

Intelligent Systems (AI-2)

Hidden Markov Models. x 1 x 2 x 3 x N

order is number of previous outputs

STA 414/2104: Machine Learning

Statistical Sequence Recognition and Training: An Introduction to HMMs

1. Markov models. 1.1 Markov-chain

A gentle introduction to Hidden Markov Models

Statistical methods in NLP, lecture 7 Tagging and parsing

Intelligent Systems (AI-2)

Hidden Markov Models Hamid R. Rabiee

Hidden Markov Models: Maxing and Summing

Hidden Markov Models

10/17/04. Today s Main Points

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Statistical NLP: Hidden Markov Models. Updated 12/15

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models

Automatic Speech Recognition (CS753)

Collapsed Variational Bayesian Inference for Hidden Markov Models

Automatic Speech Recognition (CS753)

Midterm sample questions

Lecture 7 Sequence analysis. Hidden Markov Models

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Expectation Maximization (EM)

CSCE 471/871 Lecture 3: Markov Chains and

Hidden Markov Models,99,100! Markov, here I come!

Transcription:

Recap: HMM ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2018 Elements of HMM: Set of states (tags) Output alphabet (word types) Start state (beginning of sentence) State transition probabilities Output probabilities from each state Algorithms for HMMs (Goldwater, ANLP) 2 More general notation Previous lecture: Sequence of tags T = t 1 t n Sequence of words S = w 1 w n This lecture: Sequence of states Q = q 1... q T Sequence of outputs O = o 1... o T So t is now a time step, not a tag! And T is the sequence length. Recap: HMM Given a sentence O = o 1... o T with tags Q = q 1... q T, compute P(O,Q) as: P(O, Q) = P o t q t T t=1 P q t q t 1 But we want to find argmax Q P(Q O) without enumerating all possible Q Use Viterbi algorithm to store partial computations. Algorithms for HMMs (Goldwater, ANLP) 3 Algorithms for HMMs (Goldwater, ANLP) 4

Today s lecture What algorithms can we use to Efficiently compute the most probable tag sequence for a given word sequence? Efficiently compute the likelihood for an HMM (probability it outputs a given sequence s)? Learn the parameters of an HMM given unlabelled training data? What are the properties of these algorithms (complexity, convergence, etc)? Words: Possible tags: (ordered by frequency for each word) Tagging example <s> one dog bit </s> <s> CD NN NN </s> NN VB VBD PRP Algorithms for HMMs (Goldwater, ANLP) 5 Algorithms for HMMs (Goldwater, ANLP) 6 Tagging example Viterbi: intuition Words: <s> one dog bit </s> Words: <s> one dog bit </s> Possible tags: (ordered by frequency for each word) <s> CD NN NN </s> NN VB VBD PRP Possible tags: (ordered by frequency for each word) <s> CD NN NN </s> NN VB VBD PRP Choosing the best tag for each word independently gives the wrong answer (<s> CD NN NN </s>). P(VBD bit) < P(NN bit), but may yield a better sequence (<s> CD NN VB </s>) because P(VBD NN) and P(</s> VBD) are high. Algorithms for HMMs (Goldwater, ANLP) 7 Suppose we have already computed a) The best tag sequence for <s> bit that ends in NN. b) The best tag sequence for <s> bit that ends in VBD. Then, the best full sequence would be either sequence (a) extended to include </s>, or sequence (b) extended to include </s>. Algorithms for HMMs (Goldwater, ANLP) 8

Words: Possible tags: (ordered by frequency for each word) Viterbi: intuition <s> one dog bit </s> <s> CD NN NN </s> NN VB VBD PRP Viterbi: high-level picture Intuition: the best path of length t ending in state q must include the best path of length t-1 to the previous state. (t now a time step, not a tag). But similarly, to get a) The best tag sequence for <s> bit that ends in NN. We could extend one of: The best tag sequence for <s> dog that ends in NN. The best tag sequence for <s> dog that ends in VB. And so on Algorithms for HMMs (Goldwater, ANLP) 9 Algorithms for HMMs (Goldwater, ANLP) 10 Viterbi: high-level picture Intuition: the best path of length t ending in state q must include the best path of length t-1 to the previous state. (t now a time step, not a tag). So, Find the best path of length t-1 to each state. Consider extending each of those by 1 step, to state q. Take the best of those options as the best path to state q. Notation Sequence of observations over time o 1, o 2,, o T here, words in sentence Vocabulary size V of possible observations Set of possible states q 1, q 2,, q N (see note next slide) here, tags A, an NxN matrix of transition probabilities a ij : the prob of transitioning from state i to j. (JM3 Fig 8.7) B, an NxV matrix of output probabilities b i (o t ): the prob of emitting o t from state i. (JM3 Fig 8.8) Algorithms for HMMs (Goldwater, ANLP) 11 Algorithms for HMMs (Goldwater, ANLP) 12

Note on notation HMM example w/ new notation J&M use q 1, q 2,, q N for set of states, but also use q 1, q 2,, q T for state sequence over time. So, just seeing q 1 is ambiguous (though usually disambiguated from context). I ll instead use q i for state names, and q t for state at time t. So we could have q t = q i, meaning: the state we re in at time t is q i..7 Start.3 q 1 q 2.5.6.1.3.1.7.2.5 x y z x y z States {q 1, q 2 } (or {<s>, q 1, q 2 }) Output alphabet {x, y, z} Algorithms for HMMs (Goldwater, ANLP) 13 Algorithms for HMMs (Goldwater, Adapted ANLP) from Manning & Schuetze, Fig 14 9.2 Transition and Output Probabilities Joint probability of (states, outputs) Transition matrix A: a ij = P(q j q i ) Output matrix B: b i (o) = P(o q i ) for output o q 1 q 2 <s> 1 0 q 1.7.3 q 2.5.5 x y z q 1.6.1.3 q 2.1.7.2 Let λ = (A, B) be the parameters of our HMM. Using our new notation, given state sequence Q = (q 1... q T ) and output sequence O = (o 1... o T ), we have: T P O, Q λ = P o t q t P q t q t 1 t=1 Algorithms for HMMs (Goldwater, ANLP) 15 Algorithms for HMMs (Goldwater, ANLP) 16

Joint probability of (states, outputs) Let λ = (A, B) be the parameters of our HMM. Using our new notation, given state sequence Q = (q 1... q T ) and output sequence O = (o 1... o T ), we have: P O, Q λ Or: P O, Q λ T = P o t q t P q t q t 1 t=1 T = b qt (o t ) a qt 1 q t t=1 Joint probability of (states, outputs) Let λ = (A, B) be the parameters of our HMM. Using our new notation, given state sequence Q = (q 1... q T ) and output sequence O = (o 1... o T ), we have: P O, Q λ Or: P O, Q λ T = P o t q t P q t q t 1 t=1 T = b qt (o t ) a qt 1 q t t=1 Example: Algorithms for HMMs (Goldwater, ANLP) 17 P O = y, z, Q = (q 1, q 1 ) λ = b 1 y b 1 z a <s>,1 a 11 = (.1)(.3)(1)(.7) Algorithms for HMMs (Goldwater, ANLP) 18 Viterbi: high-level picture Want to find argmax Q P(Q O) Intuition: the best path of length t ending in state q must include the best path of length t-1 to the previous state. So, Find the best path of length t-1 to each state. Consider extending each of those by 1 step, to state q. Take the best of those options as the best path to state q. Viterbi algorithm Use a chart to store partial results as we go NxT table, where v(j,t) is the probability* of the best state sequence for o 1 o t that ends in state j. Algorithms for HMMs (Goldwater, ANLP) 19 Algorithms for HMMs (Goldwater, ANLP) 20 *Specifically, v(j,t) stores the max of the joint probability P(o 1 o t,q 1 q t-1,q t =j λ)

Viterbi algorithm Use a chart to store partial results as we go NxT table, where v(j,t) is the probability* of the best state sequence for o 1 o t that ends in state j. Fill in columns from left to right, with Viterbi algorithm Use a chart to store partial results as we go NxT table, where v(j,t) is the probability* of the best state sequence for o 1 o t that ends in state j. Fill in columns from left to right, with v j, t = max N v i, t 1 a ij b j o t v j, t = max N v i, t 1 a ij b j o t Store a backtrace to show, for each cell, which state at t-1 we came from. Algorithms for HMMs (Goldwater, ANLP) 21 *Specifically, v(j,t) stores the max of the joint probability P(o 1 o t,q 1 q t-1,q t =j λ) Algorithms for HMMs (Goldwater, ANLP) 22 *Specifically, v(j,t) stores the max of the joint probability P(o 1 o t,q 1 q t-1,q t =j λ) Example Filling the first column Suppose O=xzy. Our initially empty table: q 1 q 2 q 1.6 q 2 0 v 1,1 = a <s>1 b 1 x) = 1 (.6 v 2,1 = a <s>2 b 2 x) = 0 (.1 Algorithms for HMMs (Goldwater, ANLP) 23 Algorithms for HMMs (Goldwater, ANLP) 24

Starting the second column Starting the second column q 1.6 q 2 0 q 1.6.126 q 2 0 v 1,2 = max N v i, 1 a i1 b 1 z v 1,2 = max N v i, 1 a i1 b 1 z = max ቊ v 1,1 a 11 b 1 z = (.6)(.7)(.3) v 2,1 a 21 b 1 z = (0)(.5)(.3) = max ቊ v 1,1 a 11 b 1 z = (.6)(.7)(.3) v 2,1 a 21 b 1 z = (0)(.5)(.3) Algorithms for HMMs (Goldwater, ANLP) 25 Algorithms for HMMs (Goldwater, ANLP) 26 Finishing the second column Finishing the second column q 1.6.126 q 2 0 q 1.6.126 q 2 0.036 v 2,2 = max N v i, 1 a i2 b 2 z v 2,2 = max N v i, 1 a i2 b 2 z = max ቊ v 1,1 a 12 b 2 z = (.6)(.3)(.2) v 2,1 a 22 b 2 z = (0)(.5)(.2) = max ቊ v 1,1 a 12 b 2 z = (.6)(.3)(.2) v 2,1 a 22 b 2 z = (0)(.5)(.2) Algorithms for HMMs (Goldwater, ANLP) 27 Algorithms for HMMs (Goldwater, ANLP) 28

Third column Best Path q 1.6.126.00882 q 2 0.036.02646 q 1.6.126.00882 q 2 0.036.02646 Exercise: make sure you get the same results! Choose best final state: max N v i, T Follow backtraces to find best full sequence: q 1 q 1 q 2, so: Algorithms for HMMs (Goldwater, ANLP) 29 Algorithms for HMMs (Goldwater, ANLP) 30 HMMs: what else? Using Viterbi, we can find the best tags for a sentence (decoding), and get P O, Q λ). We might also want to Compute the likelihood P O λ), i.e., the probability of a sentence regardless of tags (a language model!) learn the best set of parameters λ = (A, B) given only an unannotated corpus of sentences. Computing the likelihood From probability theory, we know that P O λ) = P O, Q λ There are an exponential number of Qs. Again, by computing and storing partial results, we can solve efficiently. Q (Next slides show the algorithm but I ll likely skip them) Algorithms for HMMs (Goldwater, ANLP) 31 Algorithms for HMMs (Goldwater, ANLP) 32

Forward algorithm Use a table with cells α(j,t): the probability of being in state j after seeing o 1 o t (forward probability). Example Suppose O=xzy. Our initially empty table: α(j, t) = P(o 1, o 2, ot, qt = j λ) Fill in columns from left to right, with N α j, t = α i, t 1 a ij b j o t q 1 q 2 Same as Viterbi, but sum instead of max (and no backtrace). Algorithms for HMMs (Goldwater, ANLP) 33 Algorithms for HMMs (Goldwater, ANLP) 34 Filling the first column Starting the second column q 1.6 q 2 0 q 1.6.126 q 2 0 α 1,1 = a <s>1 b 1 x) = 1 (.6 α 2,1 = a <s>2 b 2 x) = 0 (.1 Algorithms for HMMs (Goldwater, ANLP) 35 α 1,2 N = α i, 1 a i1 b 1 z = α 1,1 a 11 b 1 z + α 2,1 a 21 b 1 (z) =.6.7.3 + 0.5.3 =.126 Algorithms for HMMs (Goldwater, ANLP) 36

Finishing the second column Third column and finish q 1.6.126 q 2 0.036 q 1.6.126.01062 q 2 0.036.03906 α 2,2 N = α i, 1 a i2 b 2 z = α 1,1 a 12 b 2 z + α 2,1 a 22 b 2 (z) =.6.3.2 + 0.5.2 =.036 Algorithms for HMMs (Goldwater, ANLP) 37 Add up all probabilities in last column to get the probability of the entire sequence: P O λ N = α i, T Algorithms for HMMs (Goldwater, ANLP) 38 Learning Unsupervised learning Given only the output sequence, learn the best set of parameters λ = (A, B). Assume best = maximum-likelihood. Other definitions are possible, won t discuss here. Training an HMM from an annotated corpus is simple. Supervised learning: we have examples labelled with the right answers (here, tags): no hidden variables in training. Training from unannotated corpus is trickier. Unsupervised learning: we have no examples labelled with the right answers : all we see are outputs, state sequence is hidden. Algorithms for HMMs (Goldwater, ANLP) 39 Algorithms for HMMs (Goldwater, ANLP) 40

Circularity Expectation-maximization (EM) If we know the state sequence, we can find the best λ. E.g., use MLE: If we know λ, we can find the best state sequence. use Viterbi P q j qi But we don't know either! = C(qi qj) C(qi) Essentially, a bootstrapping algorithm. Initialize parameters λ (0) At each iteration k, E-step: Compute expected counts using λ (k-1) M-step: Set λ (k) using MLE on the expected counts Repeat until λ doesn't change (or other stopping criterion). Algorithms for HMMs (Goldwater, ANLP) 41 Algorithms for HMMs (Goldwater, ANLP) 42 Expected counts?? Counting transitions from q i q j : Real counts: count 1 each time we see q i q j in true tag sequence. Expected counts: With current λ, compute probs of all possible tag sequences. If sequence Q has probability p, count p for each q i q j in Q. Add up these fractional counts across all possible sequences. Example Notionally, we compute expected counts as follows: Possible sequence Probability of sequence Q 1 = q 1 q 1 q 1 p 1 Q 2 = q 1 q 2 q 1 p 2 Q 3 = q 1 q 1 q 2 p 3 Q 4 = q 1 q 2 q 2 p 4 Observs: x z y Algorithms for HMMs (Goldwater, ANLP) 43 Algorithms for HMMs (Goldwater, ANLP) 44

Example Forward-Backward algorithm Notionally, we compute expected counts as follows: Possible sequence መC q 1 q 1 = 2p 1 + p 3 Probability of sequence Q 1 = q 1 q 1 q 1 p 1 Q 2 = q 1 q 2 q 1 p 2 Q 3 = q 1 q 1 q 2 p 3 Q 4 = q 1 q 2 q 2 p 4 Observs: x z y As usual, avoid enumerating all possible sequences. Forward-Backward (Baum-Welch) algorithm computes expected counts using forward probabilities and backward probabilities: β(j, t) = P(qt = j, o t+1, o t+2, ot λ) Details, see J&M 6.5 EM idea is much more general: can use for many latent variable models. Algorithms for HMMs (Goldwater, ANLP) 45 Algorithms for HMMs (Goldwater, ANLP) 46 Guarantees EM is guaranteed to find a local maximum of the likelihood. Guarantees EM is guaranteed to find a local maximum of the likelihood. P(O λ) P(O λ) values of λ values of λ Not guaranteed to find global maximum. Practical issues: initialization, random restarts, early stopping. Algorithms for HMMs (Goldwater, ANLP) 47 Algorithms for HMMs (Goldwater, ANLP) 48

Summary HMM: a generative model of sentences using hidden state sequence Dynamic programming algorithms to compute Best tag sequence given words (Viterbi algorithm) Likelihood (forward algorithm) Best parameters from unannotated corpus (forward-backward algorithm, an instance of EM) Algorithms for HMMs (Goldwater, ANLP) 49