A gentle introduction to Hidden Markov Models

Similar documents
Statistical Methods for NLP

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Sequence Labeling and HMMs

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

CS838-1 Advanced NLP: Hidden Markov Models

Hidden Markov Models

Lecture 13: Structured Prediction

Computational Genomics and Molecular Biology, Fall

A brief introduction to Conditional Random Fields

Statistical Processing of Natural Language

p(d θ ) l(θ ) 1.2 x x x

Sequential Supervised Learning

Statistical Methods for NLP

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Sequence Labeling: HMMs & Structured Perceptron

Intelligent Systems (AI-2)

Hidden Markov Models in Language Processing

Directed Probabilistic Graphical Models CMSC 678 UMBC

Basic math for biology

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

STA 414/2104: Machine Learning

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Intelligent Systems (AI-2)

Hidden Markov Models

Statistical NLP: Hidden Markov Models. Updated 12/15

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Dept. of Linguistics, Indiana University Fall 2009

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

STA 4273H: Statistical Machine Learning

The Noisy Channel Model and Markov Models

Advanced Data Science

Hidden Markov Models

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

COMP90051 Statistical Machine Learning

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Lecture 12: Algorithms for HMMs

Introduction to Machine Learning CMU-10701

order is number of previous outputs

Machine Learning for natural language processing

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Lecture 7: Sequence Labeling

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Hidden Markov Models

Lecture 12: Algorithms for HMMs

Probabilistic Models for Sequence Labeling

Note Set 5: Hidden Markov Models

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Hidden Markov Models

Probabilistic Graphical Models

Hidden Markov models

Statistical Sequence Recognition and Training: An Introduction to HMMs

Introduction to Artificial Intelligence (AI)

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Conditional Random Field

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Graphical Models Seminar

CS 343: Artificial Intelligence

Lecture 3: ASR: HMMs, Forward, Viterbi

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Parametric Models Part III: Hidden Markov Models

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Models NIKOLAY YAKOVETS

Dynamic Approaches: The Hidden Markov Model

Hidden Markov Models (HMMs)

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Lecture 11: Hidden Markov Models

Markov Chains and Hidden Markov Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Statistical Machine Learning from Data

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Natural Language Processing

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Hidden Markov Models and Gaussian Mixture Models

Graphical models for part of speech tagging

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Linear Dynamical Systems (Kalman filter)

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

CS532, Winter 2010 Hidden Markov Models

POS-Tagging. Fabian M. Suchanek

CS 343: Artificial Intelligence

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Sequence modelling. Marco Saerens (UCL) Slides references

Supervised Learning Hidden Markov Models. Some of these slides were inspired by the tutorials of Andrew Moore

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Hidden Markov Model and Speech Recognition

Data-Intensive Computing with MapReduce

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Hidden Markov models 1

CS 5522: Artificial Intelligence II

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

CS 7180: Behavioral Modeling and Decision- making in AI

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

EECS E6870: Lecture 4: Hidden Markov Models

Transcription:

A gentle introduction to Hidden Markov Models Mark Johnson Brown University November 2009 1 / 27

Outline What is sequence labeling? Markov models Hidden Markov models Finding the most likely state sequence Estimating HMMs Conclusion 2 / 27

Sequence labeling Input: a sequence x = (x 1,..., x n ) (usually n may vary) Output: a sequence y = (x 1,..., y n ), where y i is a label for x i Part of speech (POS) tagging: Noun-phrase chunking: y : x : DT JJ NN VBD NNP. the big cat bit Sam. y : x : [NP the NP big NP] cat bit [NP] Sam.. Named entity detection: y : x : [CO XYZ CO] Corp. of [LOC] Boston announced [PER] Spade s resignation Speech recognition: The x are 100 msec. time slices of acoustic input, and the y are the corresponding phonemes (i.e., y i is the phoneme being uttered in time slice x i ) 3 / 27

Sequence labeling with probabilistic models General idea: develop a probabilistic model P(y, x) use this to compute P(y x) and ŷ(x) = argmaxy P(y x) A Hidden Markov Model (HMM) factors: P(y, x) = P(y) P(x y) P(y) is the probability of sequence y In an HMM, P(y) is a Markov chain in applications, y is usually hidden P(x y) is the emission model. 4 / 27

Outline What is sequence labeling? Markov models Hidden Markov models Finding the most likely state sequence Estimating HMMs Conclusion 5 / 27

Markov models of sequences Let y = (y 1,..., y n ), where y i Y and n is fixed P(y) = P(y 1 ) P(y 2 y 1 ) P(y 3 y 1, y 2 )... P(y n y 1,..., y n 1 ) n = P(y 1 ) P(y i y 1:i 1 ) i=2 where y i:j is the subsequence (y i,..., y j ) of y In a (first-order) Markov chain: P(y i y 1:i 1 ) = P(y i y i 1 ) for i > 1 In a homogenous Markov chain, P(y i y i 1 ) does not depend on i, so probability of a sequence is completely determined by: P(y 1 ) = ι(y 1 ) (start probabilities) P(y i y i 1 ) = σ(y i 1, y i ) (transition probabilities), so: n P(y) = ι(y 1 ) σ(y i 1, y i ) i=2 6 / 27

Markov models of varying-length sequences To define a model of sequences y = (y 1,..., y n ) of varying lengths n, need to define P(n) An easy way to do this: add a new end marker symbol to Y pad all sequences with this end-marker y = y 1, y 2,..., y n,,,... set σ(, ) = 1, i.e., is a sink state so σ(y, ) is probability of sequence ending after state y We can also incorporate the start probabilities ι into σ by introducing a new begin marker y 0 = y =, y 1, y 2,..., y n,,,..., so: P(y) = σ(y i 1, y i ) (only compute to i = n + 1) i=1 7 / 27

Outline What is sequence labeling? Markov models Hidden Markov models Finding the most likely state sequence Estimating HMMs Conclusion 8 / 27

Hidden Markov models An HMM is specified by: A state-to-state transition matrix σ, where σ(y, y ) is the probability of moving from state y to state y A state-to-observation emission matrix τ, where τ(y, x) is the probability of emitting x in state y P(y, x) = P(y) P(x y) P(y) = n+1 P(y i y i 1 ) = P(x y) = i=1 n P(x i y i ) = i=1 n+1 σ(y i 1, y i ) i=1 n τ(y i, x i ) I.e., each x i is generated conditioned on y i with probability τ(y i, x i ) i=1 9 / 27

HMMs as stochastic automata 0.3 V buy 0.4 eat 0.3 flour 0.2 pan 0.1 0.3 0.3 0.4 0.3 0.3 0.4 0.4 D the 0.7 a 0.3 N buy 0.2 flour 0.4 pan 0.4 0.3 10 / 27

Bayes net representation of an HMM Y 0 = Y 1 = V Y 2 = D Y 3 = N Y 4 = X 1 = flour X 2 = the X 3 = pan 11 / 27

Trellis representation of HMM Y 1 = N Y 2 = N Y 0 = Y 3 = Y 1 = V Y 2 = V X 1 = flour X 2 = pan The trellis representation represents the possible state sequences y for a single observation sequence x A trellis is a DAG with nodes (i, y), where i 0,..., n + 1 and y Y and edges from (i 1, y ) to (i, y) It plays a crucial role in dynamic programming algorithms for HMMs 12 / 27

Outline What is sequence labeling? Markov models Hidden Markov models Finding the most likely state sequence Estimating HMMs Conclusion 13 / 27

Two important inference problems Given an HMM σ, τ and an observation sequence x: what is the most likely state sequence ŷ = argmaxy P(x, y)? what is the probability of a sequence P(x) = y P(x, y)? If the sequences are short enough, can be solved by exhaustively enumerating y Given a sequence of length n and an HMM with m states, can be solved using dynamic programming in O(n m 2 ) time 14 / 27

Key observation: P(x, y) factorizes! P(x, y) = log P(x, y) = ( n ) σ(y i 1, y i ) τ(y i, x i ) σ(y n, ), so: i=1 ( n ) log(σ(y i 1, y i ) τ(y i, x i )) log σ(y n, ) i=1 Given observation sequence x, define a trellis with edge weights w((i 1, y ), (i, y)) = log(σ(y i 1, y i ) τ(y i, x i )) w((n, y), (n + 1, )) = log(σ(y, )) Then the log probability of any state sequence y is the sum of the weights of the corresponding path in the trellis 15 / 27

Finding the most likely state sequence ŷ Goal: find ŷ = argmax y P(x, y) Algorithm: Construct a trellis with edge weights w((i 1, y ), (i, y)) = log(σ(y i 1, y i ) τ(y i, x i )) w((n, y), (n + 1, )) = log(σ(y, )) Run a shortest path algorithm on this trellis, which returns a state sequence with lowest log probability 16 / 27

The Viterbi algorithm for finding ŷ Idea: given x = (x 1,..., x n ), let µ(k, u) be the maximum probability of any state sequence y 1,..., y k ending in y k = u Y. µ(k, u) = max y 1:k : y k =u P(x 1:k, y 1:k ) Base case: µ(0, ) = 1 Recursive case: µ(k, u) = max u µ(k 1, u ) σ(u, u) τ(u, x k ) Final case: µ(n + 1, ) = maxu µ(n, u ) σ(u, ) To find ŷ, for each trellis node (k, u) keep a back-pointer to most likely predecessor ρ(k, u) = argmax µ(k 1, u ) σ(u, u) τ(u, x k ) u Y 17 / 27

Why does the Viterbi algorithm work? µ(k, u) = max y 1:k : y k =u P(x 1:k, y 1:k ) = max y 1:k : y k =u = max y 1:k 1 ( = max u i=1 ( k 1 k σ(y i 1, y i ) τ(y i, x i ) ) σ(y i 1, y i ) τ(y i, x i ) σ(y k 1, u) τ(u, x k ) i=1 max y 1:k 1 : y k 1 =u ) k 1 σ(y i 1, y i ) τ(y i, x i ) σ(u, u) τ(u, x k ) i=1 = max u µ(k 1, u ) σ(u, u) τ(u, x k ) 18 / 27

The sum-product algorithm for computing P(x) Goal: compute P(x) = y Y n P(x, y) Idea: given x = (x 1,..., x n ), let α(k, u) be the sum of the probabilities of x 1,..., x k and all state sequences y 1,..., y k ending in y k = u. α(k, u) = P(x 1:k, y 1:k ) y 1:k : y k =u Base case: α(0, ) = 1 Recursive case: for k = 1,..., n: α(k, u) = u α(k 1, u ) σ(u, u) τ(u, x i ) Final case: α(n + 1, ) = u α(n, u ) σ(u, ) The α are called forward probabilities 19 / 27

Outline What is sequence labeling? Markov models Hidden Markov models Finding the most likely state sequence Estimating HMMs Conclusion 20 / 27

Estimating HMMs from visible data Training data: D = ((x 1, y 1 ),..., (x n, y n )), where the x i and y i are sequences for conceptual simplicity, assume that the data comes as two long sequences D = (x, y) Then the MLE is: ˆσ(y, y) = ˆτ(y, x) = c(y, y) y Y c(y, y) c(y, x) where: c(y, x), x X c(y, y) = the number of times state y precedes state y in y c(y, x) = the number of times state y emits x in (x, y) 21 / 27

Estimating HMMs from hidden data using EM EM iteration at time t: σ(y, y) (t) = E[c(y, y)] y Y E[c(y, y)] τ(y, x) (t) = E[c(y, x)] where: E[c(y, x)], x X E[c(y, y)] = c(y, y) P(y x, σ (t 1), τ (t 1) ) y Y n E[c(y, x)] = c(y, x) P(y x, σ (t 1), τ (t 1) ) y Y n Key computational problem: computing these expectations efficiently Dynamic programming to the rescue! 22 / 27

Expectations require marginal probabilities Suppose we could efficiently compute the marginal probabilities: P(Y i = y y Y x) = n : y i =y P(y, x) P(x) Then we could efficiently compute: E[c(y, x )] = i : x i =x P(Y i = y x) Efficiently computing E[c(y, y)] requires computation of a different marginal probability 23 / 27

Backward probabilities Idea: given x = (x 1,..., x n ), let β(k, v) be the sum of the probabilities of x k+1,..., x n and all state sequences y k+1,..., y n conditioned on y k = v. β(k, v) = P(x k+1:n, y k+1:n Y k = v) y k:n : y k =v Base case: β(n, v) = σ(v, ) for each v Y Recursive case: for k = n 1,..., 0: β(k, v) = v σ(v, v ) τ(v, x k+1 ) β(k + 1, v ) The β are called backward probabilities 24 / 27

Computing marginals from forward and backward probabilities P(Y i = u x) = = = = 1 P(x) 1 P(x) 1 P(x) y Y n : y i =u y 1:k : y i =u ( y 1:k : y i =u P(y, x) y i:n : y i =u 1 α(i, u) β(i, u) P(x) P(x 1:i, y 1:i ) P(x 1:i, y 1:i )P(x i+1:n, y i:n ) ) ( y k:n : y i =u P(x i+1:n, y i:n ) ) So the marginal P(Y i x) can be efficiently computed from α and β 25 / 27

Outline What is sequence labeling? Markov models Hidden Markov models Finding the most likely state sequence Estimating HMMs Conclusion 26 / 27

Conclusion Hidden Markov Models (HMMs) can be used to label a sequence of observations There are efficient dynamic programming algorithms to find the most likely label sequence and the marginal probability of a label The expectation-maximization algorithm can be used to learn an HMM from unlabeled data The expected values required for EM can be efficiently computed 27 / 27