Hidden Markov Models

Similar documents
Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

CS 7180: Behavioral Modeling and Decision- making in AI

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Introduction to Machine Learning CMU-10701

Lecture 12: Algorithms for HMMs

CS 7180: Behavioral Modeling and Decision- making in AI

Lecture 12: Algorithms for HMMs

Hidden Markov Modelling

8: Hidden Markov Models

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Multiscale Systems Engineering Research Group

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Lecture 3: ASR: HMMs, Forward, Viterbi

Statistical NLP: Hidden Markov Models. Updated 12/15

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Hidden Markov Models

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

STA 414/2104: Machine Learning

Statistical Processing of Natural Language

Hidden Markov Models in Language Processing

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

COMP90051 Statistical Machine Learning

STA 4273H: Statistical Machine Learning

Hidden Markov Models

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

CS 4495 Computer Vision

Hidden Markov Models Hamid R. Rabiee

Advanced Data Science

Hidden Markov Models

Lecture 11: Hidden Markov Models

Intelligent Systems (AI-2)

Sequence Labeling: HMMs & Structured Perceptron

Statistical Methods for NLP

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Data-Intensive Computing with MapReduce

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Dept. of Linguistics, Indiana University Fall 2009

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Intelligent Systems (AI-2)

ASR using Hidden Markov Model : A tutorial

p(d θ ) l(θ ) 1.2 x x x

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

L23: hidden Markov models

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Statistical Methods for NLP

Log-Linear Models, MEMMs, and CRFs

CS838-1 Advanced NLP: Hidden Markov Models

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

1. Markov models. 1.1 Markov-chain

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

A gentle introduction to Hidden Markov Models

Brief Introduction of Machine Learning Techniques for Content Analysis

Machine Learning for natural language processing

Parametric Models Part III: Hidden Markov Models

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

A brief introduction to Conditional Random Fields

Graphical Models Seminar

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Supervised Learning Hidden Markov Models. Some of these slides were inspired by the tutorials of Andrew Moore

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequential Supervised Learning

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Hidden Markov models

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Text Mining. March 3, March 3, / 49

Lecture 12: EM Algorithm

Lecture 13: Structured Prediction

Introduction to Hidden Markov Modeling (HMM) Daniel S. Terry Scott Blanchard and Harel Weinstein labs

Computational Genomics and Molecular Biology, Fall

Hidden Markov Model and Speech Recognition

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Hidden Markov Models

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Models

Hidden Markov Models and Gaussian Mixture Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Basic math for biology

Midterm sample questions

Pair Hidden Markov Models

order is number of previous outputs

Hidden Markov Models

Math 350: An exploration of HMMs through doodles.

CS 343: Artificial Intelligence

10/17/04. Today s Main Points

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

Directed Probabilistic Graphical Models CMSC 678 UMBC

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Hidden Markov Models Part 2: Algorithms

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Hidden Markov Models,99,100! Markov, here I come!

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

Transcription:

Hidden Markov Models Slides mostly from Mitch Marcus and Eric Fosler (with lots of modifications). Have you seen HMMs? Have you seen Kalman filters? Have you seen dynamic programming?

HMMs are dynamic latent variable models Given a sequence of sounds, find the sequence of words most likely to have produced them Given a sequence of images find the sequence of locations most likely to have produced them. Given a sequence of words, find the sequence of meanings most likely to have generated them Or part of speech Noun, verb, adverb, Or entity type Person, place, company, date, movie E.g. river bank vs. money bank s 1 è s 3 ès 3 ès 4 ès 5 o 1 o 2 o 3 o 4 o 5 Bob went to the bank 2

Conditional Independence If we want the joint probability of an entire sequence, the Markov assumption lets us treat it as a product of bigram conditional probabilities: p(w1,w2,w3,w4) = p(w1) p(w2 w1) p(w3 w2,w1) p(w4 w3,w2,w1) ~ p(w1) p(w2 w1) p(w3 w2) p(w4 w3) 3

A Markovian weather model: Tomorrow is like today, except when it isn t. Markov matrix gives p(tomorrow s weather today s weather) If you start from any prior distribution over states (weathers) and run long enough you converge to a stationary distribution. 4

The same model expressed as a graph 5

Imperfect Knowledge But sometimes we can only observe a process as through a glass darkly. We don t get to see what is really happening, but only some clues that are more or less strongly indicative of what might be happening. So you re bucking for partner in a windowless law office and you don t see the weather for days at a time But you do see whether your office mate (who has an actual life) brings in an umbrella or not: 6

How to make predictions? Now you re bored with researching briefs, and you want to guess the weather from a sequence of umbrella (non)sightings: P(w 1, w 2, w n u 1,, u n ) How to do it? You observe u, but not w. w is the hidden part of the Hidden Markov Model In speech recognition, we will observe the sounds, but not the intended words 7

Bayes rule rules! Bayes Rule! 8

A Rainy-Day Example You go into the office Sunday morning and it s sunny. w 1 = Sunny You work through the night on Sunday, and on Monday morning, your officemate comes in with an umbrella. u 2 = T What s the probability that Monday is rainy? P(w 2 =Rainy w 1 =Sunny, u 2 =T) = P(u 2 =T w 2 =Rainy)/P(u 2 =T w 1 =Sunny) x P(w 2 =Rainy w 1 =Sunny) (likelihood of umbrella)/normalization x prior 9

Bayes rule for speech To find the most likely word Start with a prior of how likely each word is And the likelihood of each set of sounds given the word The most likely word is the one most likely to have generated the sounds heard The fundamental equation of speech recognition : argmax w P(w u) = argmax w P(u w) P(w) / P(u) w = word, u = sound ( utterance ) 10

Speech Recognition Markov model for words in a sentence P(I like cabbages) = P(I START)P(like I)P(cabbages like) Markov model for sounds in a word Model the relation of words to sounds by breaking words down into pieces 11

HMMs can also extract meaning Natural language is ambiguous Banks banks in banks on banks. Sequence of hidden states are the meanings (what the word refers to) and words are the percepts The (Markov) Language Model says how likely each meaning is to follow other meanings All meanings of banks may produce the same percept 12

HMMs: Midpoint Summary Language can be modeled by HMMs Predict words from sounds Captures priors on words Hierarchical Phonemes to morphemes to words to phrases to sentences Was used in all commercial speech recognition software until recently Rapidly being replaced by deep networks Markov assumption, HMM definition Fundamental equation of speech recognition 13

Hidden Markov Models (HMMs) Part of Speech tagging is often done using HMMs Viewed graphically:.02.3.3 Det.47 Adj.6 Noun.7 Verb P(w Det) a.4 the.4.51.1 P(w Adj) good.02 low.04 P(w Noun) price.001 deal.0001 14

HMM: The Model Starts in some initial state s i Moves to a new state s j with probability p(s j s i ) = a ij The matrix A with elements a ij is called the Markov transition matrix Emits an observation o v (e.g. a word, w) with probability p(o v s i ) = b iv.02.3.3 Det.47 Adj.6 Noun.7 Verb P(w Det) a.4 the.4.51.1 P(w Adj) P(w Noun) good.02 price.001 low.04 deal.0001 15

A HMM is a dynamic Bayesian Network There is a node for the hidden state and for the emission (observed state) at each time step, but the probability tables are the same at all times. A A A A B B B B A: Markov transition matrix B: Emission probabilities 16

Recognition using an HMM Note that the tags t are the hidden states (s) and the words w are the emissions (o) Transitions Emissions So we need to find 17

Parameters of an HMM States: A set of states S = s 1,,s k Markov transition probabilities: A = a 1,1,a 1,2,,a k,k Each a i,j = p(s j s i ) represents the probability of transitioning from state s i to s j. Emission probabilities: A set B of functions of the form b i (o t ) = p(o s i ) giving the probability of observation o t being emitted by s i Initial state distribution: π i is the probability that s i is a start state 18

The Three Basic HMM Problems Problem 1 (Evaluation): Given the observation sequence O=o 1,,o T and an HMM model λ = (A,B,π), compute the probability of O given the model. Problem 2 (Decoding): Given the observation sequence O=o 1,,o T and an HMM model λ = (A,B,π), find the state sequence that best explains the observations (This and following slides follow classic formulation by Rabiner and Juang, as adapted by Manning and Schutze. Slides adapted from Dorr.) 19

The Three Basic HMM Problems Problem 3 (Learning): Pick the model parameters to maximize P(O λ) λ = (A,B,π) 20

Problem 1: Probability of an Observation Sequence P(O λ) What is? The probability of a observation sequence is the sum of the probabilities of all possible state sequences in the HMM. Naïve computation is very expensive. Given T observations and k states, there are k T possible state sequences. Even small HMMs, e.g. T=10 and k=10, contain 10 billion different paths The solution: use dynamic programming Once you are in a state it doesn t matter how you got there 21

The Trellis (for a 4 state model) 22

Forward Probabilities What is the probability that, given an HMM, at time t the state is s i and the partial observation o 1 o t has been generated? λ α t (i) = P(o 1... o t, q t = s i λ) 23

Forward Probabilities α t (i) = P(o 1...o t, q t = s i λ) N should be k; sorry! k # α t ( j) = % $ k i=1 & α t 1 (i)a ij ( b j (o t ) ' Note: summation is over the k hidden states 24

Forward Algorithm Initialization: α 1 (i) = π i b i (o 1 ) 1 i k Induction: # α t ( j) = % $ k i=1 Termination: P(O λ) = & α t 1 (i)a ij ( b j (o t ) 2 t T, 1 j k ' k i=1 α T (i) 25

Forward Algorithm Complexity Naïve approach takes O(2T*k T ) computation Forward algorithm using dynamic programming takes O(k 2 T) computations 26

Backward Probabilities What is the probability that λ = (A,B,π) given an HMM given the state at time t is s i, and the partial observation o t+1 o T is generated? Analogous to forward probability, just in the other direction β t (i) = P(o t +1...o T q t = s i,λ) 27

Backward Probabilities β t (i) = P(o t +1...o T q t = s i,λ) N should be k; sorry! " k % β t (i) = $ a ij b j (o t+1 )β t+1 ( j) ' # $ j=1 &' 28

Backward Algorithm Initialization: β T (i) =1, 1 i k Induction: " k % β t (i) = $ a ij b j (o t+1 )β t+1 ( j) ' # $ j=1 &' Termination: t = T 1...1, 1 i k P(O λ) = k i=1 π i β 1 (i) 29

Problem 2: Decoding The forward algorithm gives the sum of all paths through an HMM efficiently. Here, we want to find the highest probability path. We want to find the state sequence Q=q 1 q T, such that Q = argmax Q' P(Q' O,λ) 30

Viterbi Algorithm Similar to computing the forward probabilities, but instead of summing over transitions from incoming states, compute the maximum Forward: # α t ( j) = % $ k i=1 & α t 1 (i)a ij ( b j (o t ) ' Viterbi Recursion: δ t ( j) = # maxδ t 1 (i)a % $ ij & b j(o t ) 1 i k 31

Core Idea of Viterbi Algorithm 32

Viterbi Algorithm Initialization: Induction: 1 i k δ 1 (i) = π i b i (o 1 ) δ t ( j) = # maxδ t 1 (i)a % $ ij & b (o ) j t # & ψ t ( j) = argmaxδ t 1 (i)a $% ij '( 1 i k 1 i k 2 t T, 1 j k Termination: p * = maxδ T (i) 1 i k q * T = argmaxδ T (i) 1 i k Read out path: q * t =ψ t +1 (q * t +1 ) t = T 1,...,1 33

Problem 3: Learning Up to now we ve assumed that we know the underlying model λ = (A,B,π) Often these parameters are estimated on annotated training data (i.e. with known hidden state ) But of course, such labels are often lacking We want to maximize the parameters with respect to the current data, i.e., find a model λ', such that λ'= argmax P(O λ) λ 34

Problem 3: Learning Unfortunately, there is no known way to analytically find a global maximum, i.e., a model λ', such that λ'= argmax P(O λ) λ But it is possible to find a local maximum Given an initial model, we can always find a model λ', such that P(O λ') P(O λ) λ 35

Forward-Backward (Baum-Welch) algorithm EM algorithm Find the forward and backward probabilities for the current parameters Re-estimate the model parameters given the estimated probabilities Repeat 36

Parameter Re-estimation Three parameters need to be re-estimated: Initial state distribution: π i Transition probabilities: a i,j Emission probabilities: b i (o t ) 37

Re-estimating Transition Probabilities What s the probability of being in state s i at time t and going to state s j, given the current model and parameters? ξ t (i, j) = P(q t = s i, q t +1 = s j O,λ) 38

Re-estimating Transition Probabilities ξ t (i, j) = P(q t = s i, q t +1 = s j O,λ) ξ t (i, j) = k α t (i) a i, j b j (o t+1 ) β t+1 ( j) k i=1 j=1 α t (i) a i, j b j (o t+1 ) β t+1 ( j) 39

Re-estimating Transition Probabilities The intuition behind the re-estimation equation for transition probabilities is â = i, j expected Formally: number expected of number â i, j = transitions of T 1 t=1 T 1 t=1 k j'=1 transitions ξ t (i, j) ξ t (i, j') from state s from state s i tostate s i j 40

Re-estimating Transition Probabilities Defining k j=1 γ t (i) = ξ t (i, j) As the probability of being in state s i, given the complete observation O We can say: ˆ a i, j = T 1 t=1 T 1 t=1 ξ t (i, j) γ t (i) 41

Re-estimating Initial State Probabilities Initial state distribution: that s i is a start state π i is the probability Re-estimation is easy: πˆ i = expected number of timesinstate s i at time 1 Formally: ˆ π i = γ 1 (i) 42

Re-estimation of Emission Probabilities Emission probabilities are re-estimated as bˆ ( k ) = i Formally: expected number of timesinstate si and observe symbol v expected number of times in state s where δ ˆ b i (k) = T t=1 δ(o t,v k )γ t (i) T t=1 γ t (i) δ(o t,v k ) =1, if o t = v k, and 0 otherwise Note that here is the Kronecker delta function and is not related to the δ in the discussion of the Viterbi algorithm!! i k 43

The Updated Model Coming from λ'= ( ˆ A, ˆ B, ˆ π ) λ = (A,B,π) we get to by the following update rules: ˆ a i, j = T 1 t=1 T 1 t=1 ξ t (i, j) γ t (i) ˆ b i (k) = T t=1 δ(o t,v k )γ t (i) T t=1 γ t (i) ˆ π i = γ 1 (i) 44

HMM vs CRF HMMs are generative models They model the full probability distribution Conditional Random Fields are discriminative Similar to HMMs, but only model the probability of labels given the data Generally give more accurate predictions A bit harder to code But good open source versions are available Very popular in the research world (until deep nets arrived) 45

What you should know HMMs Markov assumption Markov transition matrix, Emission probabilities Viterbi decoding (just well enough to do the homework) 46