Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Similar documents
Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

STA 4273H: Statistical Machine Learning

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models Part 2: Algorithms

STA 414/2104: Machine Learning

Introduction to Machine Learning CMU-10701

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs

Data-Intensive Computing with MapReduce

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 8: Hidden Markov Models

Hidden Markov Models: All the Glorious Gory Details

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Hidden Markov Models in Language Processing

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 11: Hidden Markov Models

Math 350: An exploration of HMMs through doodles.

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Hidden Markov Models. Three classic HMM problems

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Hidden Markov Models

Hidden Markov Models

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Machine Learning for natural language processing

Statistical Methods for NLP

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Hidden Markov Models

Hidden Markov Models and Gaussian Mixture Models

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay

CS 7180: Behavioral Modeling and Decision- making in AI

Basic math for biology

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Hidden Markov models

Lecture 11: Hidden Markov Models

Lecture 12: EM Algorithm

Supervised Learning Hidden Markov Models. Some of these slides were inspired by the tutorials of Andrew Moore

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm

Statistical Methods for NLP

Statistical Sequence Recognition and Training: An Introduction to HMMs

A brief introduction to Conditional Random Fields

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

CS838-1 Advanced NLP: Hidden Markov Models

Expectation Maximization (EM)

Why do we care? Measurements. Handling uncertainty over time: predicting, estimating, recognizing, learning. Dealing with time

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Hidden Markov Models: Maxing and Summing

Computational Genomics and Molecular Biology, Fall

Hidden Markov Models

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

1. Markov models. 1.1 Markov-chain

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Sequence Labeling: HMMs & Structured Perceptron

Lecture 3: ASR: HMMs, Forward, Viterbi

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Log-Linear Models, MEMMs, and CRFs

Midterm sample questions

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Numerically Stable Hidden Markov Model Implementation

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

1 Hidden Markov Models

COMP90051 Statistical Machine Learning

Linear Dynamical Systems

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

HMM : Viterbi algorithm - a toy example

Hidden Markov Models. x 1 x 2 x 3 x N

HMM : Viterbi algorithm - a toy example

Hidden Markov Models

Hidden Markov Models,99,100! Markov, here I come!

Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data

The Ising model and Markov chain Monte Carlo

Hidden Markov Models

Sequential Supervised Learning

CSCE 471/871 Lecture 3: Markov Chains and

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Brief Introduction of Machine Learning Techniques for Content Analysis

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Generative Clustering, Topic Modeling, & Bayesian Inference

Statistical NLP: Hidden Markov Models. Updated 12/15

Hidden Markov Models

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III

Hidden Markov Models Hamid R. Rabiee

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

HMM: Parameter Estimation

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

order is number of previous outputs

Note Set 5: Hidden Markov Models

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Data Mining in Bioinformatics HMM

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Statistical Machine Learning from Data

Transcription:

1 Introduction Let x = (x 1,..., x M ) denote a sequence (e.g. a sequence of words), and let y = (y 1,..., y M ) denote a corresponding hidden sequence that we believe explains or influences x somehow (e.g. a sequence of part-of-speech tags). For simplicity, we assume that x is encoded using D tokens, so each x j 1,..., D, and y is encoded using L tokens, so each y j 1,..., L. Thus there are D M possible length-m sequences x and L M possible length-m sequences of y. Typically we refer to the elements of x as observations and the elements of y as states or hidden states. Our goal is to use a first-order hidden Markov model (HMM) to model the joint distribution P (x, y). An HMM models the joint distribution as follows: M P (x, y) = P (End y M ) P (x j y j )P (y j y j 1 ), (1) where End is a special state denoting the end of the hidden sequence and y 0 := Start is always the special state denoting the start of the hidden sequence. Note that the End state is optional in HMMs, and can be ignored if you don t wish to model the probability of a sequence spontaneously ending if we wanted to ignore End we would just replace P (End y M ) with 1 everywhere it appears below. A closer look at (1) reveals that we have made two big independence assumptions about the sequences x and y. The first is that observations in x are conditionally independent of each other given the hidden sequence y: M P (x y) = P (x j y j ). In other words, observation x j in x only depends on state y j, not on any other state in y. The second is that y is entirely independent of x, with each y j depending only on y <j. Because our HMM model only needs to know the values of P (x j y j ) and P (y j y j 1 ) for all possible combinations of tokens to compute the entire joint distribution, the total size of our HMM model is (D L) + (L L) + 2L: Component P (x j y j ) P (y j y j 1 ) P (y 1 y 0 ) P (End y M ) No. Parameters D L L L L L Note that the probabilities of entire sequences P (x, y) can often be very small (exponentially small in M), so it is usually more convenient to work in log-probability space: log P (x, y) = log P (End y M ) + M ( log P (x j y j ) + log P (y j y j 1 ) ). (2) 2 How to Make Predictions: The Viterbi Algorithm Let s say we have a trained HMM model and a sequence x, and we want to predict what hidden sequence y most likely produced x. E.g. we have a sentence of words and we want to tag each word with a part-of- 1

speech tag. We make this prediction by selecting the y that maximizes P (y x): y P (y x) = y log P (y x) = y log P (x, y)/p (x) = y log P (x, y) M = y log P (End y M ( ) + log P (x j y j ) + log P (y j y j 1 ) ). (3) Naively iterating over all possible ys would take exponential time (w.r.t. the length x = M). Luckily, we can use use a dynamic programming approach known as the Viterbi algorithm to efficiently compute (3). First, notation: Let P denote the log probability, e.g., P (x, y) = log P (x, y). Define the transition matrix à whose entries are à a,b = P (y j = a y j 1 = b) and the observation matrix Õ whose entries are Õ w,z = P (x j = w y j = z). Finally, for the key definition, let ŷ j a denote the length-j sequence of hidden states ending with yj = a that is most likely to have produced x 1:j : ŷ j a ( = P ) y 1:j 1 (y 1:j 1 a, x 1:j ) a, (4) where denotes sequence concatenation or appending a token (it should be clear from context which one it is). Now, the Viterbi algorithm: If we already knew ŷ M a for all different choices of hidden state a, then we could find the y that maximizes P (y x) (which, recall, is the prediction we re looking for) by simply selecting the ŷ M a with highest value of P (ŷ M a, x). At the other end of things, computing each ŷ1 a is trivial since by definition ŷ 1 a = a. To ŷj a for j > 1 we use the Viterbi Algorithm, which tells us to find ŷj a by just looping through all the ŷ j 1 b and choosing the best one: ŷ j a = = = P (y 1:j 1 a, x 1:j ) a (5) P (y 1:j 1, x 1:j 1 ) + P (y j = a y j 1 ) + P (x j y j = a) a P (y 1:j 1, x 1:j 1 ) + Ãa,y j 1 + Õx j,a a. Computing each ŷ j a takes O(L) running time (enumerating over all ŷj 1 a ), and so the total running time of this approach is O(L 2 M). Note that we need to save the P (ŷ j a, x1:j ) that we compute along the way while computing each (5), since we use them to compute each ŷ j+1 a, and we need to save the ŷ j a in order to have something to output at the end. Again, note that we re operating with log-probabilities, because the actual probabilities would underflow for any reasonably sized M. 2

3 How to Compute Marginal Probabilities: The Forward-Backward Algorithm Let s say we once again have a trained HMM, and suppose we want to compute some marginal probability from our joint model, like P (y 5 = 1, x = 1423) or P (y 3 = 2, y 4 = 2, x = 334113). Why might we want to do this? Potentially because we are genuinely curious to know some marginal probability, but also because knowing these marginals will to let us do unsupervised training of HMMs later. To compute such quantities, we first define for each j the vector α j R L whose a th entry corresponds to the unnormalized probability of observing the prefix x 1:j while also having that y j = a. That is: α j (a) P (x 1:j, y j = a) = y 1:j 1 P (x 1:j, y 1:j 1 a) We similarly define a sequence of vectors β j R L whose b th entry corresponds to the unnormalized probability of observing the suffix x j+1:m given that the j th state is y j = b. That is: β j (b) P (x j+1:m y j = b) = P (y j+1 y j = b)p (y j+1:m, x j+1:m ). y j+1:m (Note that sometimes people define α and β with superscripts in 1,..., L and entries indexed by j; all together it s the same quantities, just with different indexing.) If we had values for the α and β vectors, note that we could easily compute some useful marginal probabilities: P (y j = a, y j+1 = b, x) = P (y j = a, x) = αj (a)β j (a) a αj (a )β j (a ), (6) α j (a)p (x j+1 y j+1 =b)p (y j+1 =b y j =a)β j+1 (b) a,b αj (a )P (x j+1 y j+1 =b )P (y j+1 =b y j =a )β j+1 (b ). (7) But how can we compute α j (a) and β j (b) efficiently, given that the equations for α and β above require summing over potentially exponentially many hidden subsequences? A pair of dynamic programming algorithms called the Forward and Backward algorithms, both similar to the Viterbi algorithm, will do the job. We initalize α 0 and β M as: α 0 (a) = { 1 if a = Start 0 otherwise, β M (b) = P (End y M = b). Define P y (b a) = P (y j+1 = b y j = a), and P x (w z) = P (x j = w y j = z). Similar to the Viterbi algorithm, we recursively define each α j as: α j (a) = P (x j a) a α j 1 (a )P y (a a ). (8) We can also recursively define each β j as: β j (b) = b β j+1 (b )P y (b b)p (x j+1 b ). (9) (8) is known as the Forward Algorithm, and (9) is known as the Backward Algorithm. Dealing With Numerical Instability. In practice directly implementing (8) and (9) leads to numerical instability. But remembering that the entries of α and β need only be proportional to the probabilities they 3

represent, we can avoid overflow and underflow by renormalizing each α j and β j after each step in (8) and (9), i.e.: ( α j (a) = 1 Cα j P (x j a) ) α j 1 (b)p y (a a ). (10) a and ( ) β j (b) = 1 C j β j+1 (b )P y (b b)p (x j+1 b ), (11) β b for some choice of Cα j and C j β such that overflow and underflow do not happen. (The sum of the vector entries is a good choice.) Afterwards, we can compute (6) and (7) as: P (y j = a, y j+1 = b, x) = P (y j = a, x) = α j (a) β j (a) a αj (a ) β j (a ), (12) α j (a)p (x j+1 y j+1 =b)p (y j+1 =b y j =a) β j+1 (b) a,b αj (a )P (x j+1 y j+1 =b )P (y j+1 =b y j =a ) β j+1 (b ). (13) Relationship to Viterbi. Viterbi keeps track of the best sequence of length-j that ends in some y j = a, and also keeps track of the probability of that sequence. The Forward algorithm keeps track of the marginal probability of all sequences of length-j that ends in some y j = a. Thus, the Viterbi takes the max whereas the Forward algorithm takes the sum. Because Viterbi doesn t sum, the probabilities of the single best sequence will shrink exponentially as the sequence grows, hence necessitating taking the log to ensure numerical stability. With the Forward-Backward algorithm, we keep track of the unnormalized probabilities, which can both overflow or underflow. But because the values are the product of a bunch of sums, you can normalize at each iteration and still maintain correctness. 4 How to Train: Counting and Baum-Welch Supervised Training. In the supervised setting, we are given a training set of N training examples: S = {(x i, y i )} N, and our goal is to learn the parameters of the HMM to maximize the likelihood on S: N P (x i, y i ). In this case, training is very straightforward counting: P (y j = b y j 1 = a) = P (x j = w y j = a) = N Mi 1 [y j i =b yj 1 i =a] N Mi 1 [y j 1 i =a] 1 [x j i =w yj i =a] N Mi 1. [y j i =a]. 4

Unsupervised Training. In the unsupervised setting, we are given a training set of N training examples containing only the x s: S = {x i } N, and the maximum likelihood problem is thus: N N P (x i ) = P (x i, y). If we knew the marginal distributions P (y j i = a, x i ) and P (y j i = b, y j 1 i = a, x i ) that we learned how to compute with Forward-Backward, then we could use our training data to estimate the parameters of our HMM model as: P (y j = b y j 1 = a) = P (yj i = b, yj 1 i = a, x i ) P. (14) (yj 1 i = a, x i ) y P (x j = w y j = a) = 1 [x j =w]p (yj i = a, x i) i P (yj i = a, x. (15) i) And of course if we knew the parameters of our HMM model, then we could run Forward-Backward to get the marginal distributions. The EM or Baum-Welch algorithm solves this chicken-and-egg problem by iterating between the two computations until convergence. That is, the EM algorithm for unsupervised training of HMMs is 1. INIT: randomly initialize HMM model parameters 2. E-STEP: run Forward-Backward to compute marginal probabilities (6), (7) based on the current model parameters 3. M-STEP: update the model parameters based on the maximium likelihood estimate given the data (14), (15) 4. If not converged, repeat from Step 2 5