Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Similar documents
Hidden Markov Models

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Statistical NLP: Hidden Markov Models. Updated 12/15

Hidden Markov Modelling

CS 7180: Behavioral Modeling and Decision- making in AI

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

CS 7180: Behavioral Modeling and Decision- making in AI

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Lecture 12: Algorithms for HMMs

Multiscale Systems Engineering Research Group

Machine Learning for natural language processing

Introduction to Machine Learning CMU-10701

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models

Lecture 12: Algorithms for HMMs

Lecture 11: Hidden Markov Models

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Data-Intensive Computing with MapReduce

Lecture 3: ASR: HMMs, Forward, Viterbi

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Statistical Processing of Natural Language

Supervised Learning Hidden Markov Models. Some of these slides were inspired by the tutorials of Andrew Moore

Sequence Labeling: HMMs & Structured Perceptron

Advanced Data Science

L23: hidden Markov models

Basic math for biology

Hidden Markov Models Hamid R. Rabiee

Parametric Models Part III: Hidden Markov Models

Dept. of Linguistics, Indiana University Fall 2009

Computational Genomics and Molecular Biology, Fall

Hidden Markov Models. x 1 x 2 x 3 x K

CS 4495 Computer Vision

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Model and Speech Recognition

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Hidden Markov Models

STA 414/2104: Machine Learning

Hidden Markov Models. Three classic HMM problems

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Hidden Markov Models. Adapted from. Dr Catherine Sweeney-Reed s slides

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Hidden Markov Models Part 2: Algorithms

Hidden Markov Models,99,100! Markov, here I come!

Hidden Markov Models. x 1 x 2 x 3 x K

Statistical Methods for NLP

HMM: Parameter Estimation

Lecture 7 Sequence analysis. Hidden Markov Models

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

STA 4273H: Statistical Machine Learning

Lecture 9. Intro to Hidden Markov Models (finish up)

Statistical Methods for NLP

Log-Linear Models, MEMMs, and CRFs

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Hidden Markov Models and Gaussian Mixture Models

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Introduction to Artificial Intelligence (AI)

A gentle introduction to Hidden Markov Models

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

p(d θ ) l(θ ) 1.2 x x x

CSCE 471/871 Lecture 3: Markov Chains and

Graphical Models Seminar

O 3 O 4 O 5. q 3. q 4. Transition

Statistical Sequence Recognition and Training: An Introduction to HMMs

11.3 Decoding Algorithm

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Pair Hidden Markov Models

A New OCR System Similar to ASR System

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Hidden Markov Models for biological sequence analysis I

ASR using Hidden Markov Model : A tutorial

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models

Linear Dynamical Systems (Kalman filter)

Hidden Markov Models (HMMs)

Hidden Markov Models for biological sequence analysis

Hidden Markov Models NIKOLAY YAKOVETS

Introduction to Hidden Markov Modeling (HMM) Daniel S. Terry Scott Blanchard and Harel Weinstein labs

Hidden Markov Models

A Higher-Order Interactive Hidden Markov Model and Its Applications Wai-Ki Ching Department of Mathematics The University of Hong Kong

Hidden Markov Models (Part 1)

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Brief Introduction of Machine Learning Techniques for Content Analysis

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Hidden Markov Models

1. Markov models. 1.1 Markov-chain

Biochemistry 324 Bioinformatics. Hidden Markov Models (HMMs)

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Hidden Markov models

Lab 3: Practical Hidden Markov Models (HMM)

Hidden Markov Models. Dr. Naomi Harte

Transcription:

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Parameters of an HMM States: A set of states S=s 1, s n Transition probabilities: A= a 1,1, a 1,2,, a n,n Each a i,j represents the probability of transitioning from state s i to s j. Emission probabilities: a set B of functions of the form b i (o t ) which is the probability of observation o t being emitted by s i Initial state distribution: s i is a start state i is the probability that (This and later slides follow classic formulation by Rabiner and Juang following Ferguson, as adapted by Manning and Schutze. Slides adapted from Dorr. Note the change in notation!!) 2

The Three Basic HMM Problems Problem 1 (Evaluation): Given the observation sequence O=o 1,,o T and an HMM model (A,B,), how do we compute the probability of O given the model? Problem 2 (Decoding): Given the observation sequence O=o 1,,o T and an HMM model (A,B,), how do we find the state sequence that best explains the observations? Problem 3 (Learning): How do we adjust the model parameters (A,B,), to maximize? P(O ) 3

Problem 1: Probability of an Observation Sequence Q: What is? A: the sum of the probabilities of all possible state sequences in the HMM. The probability of each state sequence is itself the product of the state transitions and emit probabilities P(O ) Naïve computation is very expensive. Given T observations and N states, there are N T possible state sequences. (for T=10 and N=10, 10 billion different paths!!) Solution: linear time dynamic programming! 4

The Crucial Data Structure: The Trellis 5

Forward Probabilities: For a given HMM, given that the state is i at time t (with change of notation: some arbitrary time), what is the probability that the partial observation o 1 o t has been generated? t (i) P(o 1...o t, q t s i ) Forward algorithm computes t (i) 1<i<N, 1<t<T in time 0(N 2 T) using the trellis 6

Forward Algorithm: Induction step t (i) P(o 1...o t, q t s i ) t ( j) N i1 t1 (i)a ij b j (o t ) 7

Forward Algorithm Initialization: 1 (i) i b i (o 1 ) 1 i N Induction: ( t j) N t1( i) aij b j ( ot ) 2 t T, 1 i1 j N Termination: P(O ) N i1 T (i) 8

Forward Algorithm Complexity Naïve approach requires exponential time to evaluate all N T state sequences Forward algorithm using dynamic programming takes O(N 2 T) computations 9

Backward Probabilities: For a given HMM, given that the state is i at time t, what is the probability that the partial observation o t+1 o T will be generated? (i) P(o...o q s,) t t 1 T t i Analogous to forward probability, just in the other direction: Backward algorithm computes t (i) 1<i<N, 1<t<T in time 0(N 2 T) using the trellis 10

Backward Probabilities t (i) P(o t 1...o T q t s i,) N t ( i) aijbj ( ot ) t j1 1 1( j) 11

Backward Algorithm Initialization: T (i) 1, 1 i N Induction : N t ( i) aijbj ( ot 1) t 1( j) T 1 t 1,1 i N j1 Termination: N P( O ) b ( o ) ( i i1 i i 1 1 ) 12

Problem 2: Decoding The Forward algorithm gives the sum of all paths through an HMM efficiently. Here, we want to find the highest probability path. We want to find the state sequence Q=q 1 q T, such that Q argmax Q' P(Q' O,) 13

Viterbi Algorithm Just like the forward algorithm, but instead of summing over transitions from incoming states, compute the maximum Forward: N t ( j) t 1( i) aij bj ( ot ) i1 Viterbi Recursion: t( j) max t 1( i) a ij bj ( ot ) 1 i N 14

Core Idea of Viterbi Algorithm 15

Not quite what we want. Viterbi recursion computes the maximum probability path to state j at time t given that the partial observation o 1 o t has been generated But we want the path itself that gives the maximum probability Solution: 1. Keep backpointers 2. Find max ( ) ( ) t( j) t 1 i aij bj ot 1 i N arg max ( j) j T 3. Chase backpointers from state j at time T to find state sequence (backwards) 16

Viterbi Algorithm Initialization: Induction: ` ( j) max 1( i) a b ( o ) 1 i N 2 t T 1, j N (Backpointers) t ( j) arg max t 1( i) a ij 1 i N Termination: 1 (i) i b j (o 1 ) 1 i N t t ij j t q * T arg max () i 1 i N T (Final state!) Backpointer path: q * t t 1 (q * t 1 ) t T 1,...,1 17

Problem 3: Learning Up to now we ve assumed that we know the underlying model Often these parameters are estimated on annotated training data, but: (A,B,) Annotation is often difficult and/or expensive Training data is different from the current data We want to maximize the parameters with respect to the current data, i.e., we re looking for a model ', such that ' argmax P(O ) 18

Problem 3: Learning (If Time Allows ) Unfortunately, there is no known way to analytically find a global maximum, i.e., a model ', such that ' argmax P(O ) But it is possible to find a local maximum Given an initial model, we can always find a model, such that P(O ') P(O ) ' 19

Forward-Backward (Baum-Welch) algorithm Key Idea: parameter re-estimation by hillclimbing FB algorithm iteratively re-estimates the parameters yielding a new at each iteration 1. Initialize to a random set of values 2. Estimate PO ( ), filling out the trellis for both the Forward and the Backward algorithms 3. Reestimate using both trellises, yielding a new estimate ' ' Theorem: P(O ') P(O ) 20

Parameter Re-estimation Three parameters need to be re-estimated: Initial state distribution: i Transition probabilities: a i,j Emission probabilities: b i (o t ) 21

Re-estimating Transition Probabilities: Step 1 What s the probability of being in state s i at time t and going to state s j, given the current model and parameters? t (i, j) P(q t s i, q t 1 s j O,) 22

Re-estimating Transition Probabilities: Step 1 t (i, j) P(q t s i, q t 1 s j O,) t (i, j) N t (i) a i, j b j (o t 1 ) t 1 ( j) N i1 j1 t (i) a i, j b j (o t 1 ) t 1 ( j) 23

Re-estimating Transition Probabilities: Step 2 The intuition behind the re-estimation equation for transition probabilities is â = i, j expected number of Formally: expected number of ˆ a i, j transitions T 1 t1 T 1 t1 N j'1 transitions t (i, j) t (i, j') from states from states i to state s i j 24

Re-estimating Transition Probabilities Defining N j1 t (i) t (i, j) As the probability of being in state s i, given the complete observation O We can say: ˆ a i, j T 1 t1 T 1 t1 t (i, j) t (i) 25

Re-estimating Initial State Probabilities Initial state distribution: that s i is a start state i is the probability Re-estimation is easy: πˆ i = expectednumber of timesin states i at time1 Formally: ˆ i 1 (i) 26

Re-estimation of Emission Probabilities Emission probabilities are re-estimated as bˆ ( k ) = i Formally: expected number of times in state si and observe symbol v expected number of times in state s where ˆ b i (k) T t1 (o t,v k ) t (i) T t1 t (i) (o t,v k ) 1, if o t v k, and 0 otherwise Note that here is the Kronecker delta function and is not related to the in the discussion of the Viterbi algorithm!! 27 i k

The Updated Model Coming from ' ( ˆ A, ˆ B, ˆ ) (A,B,) we get to by the following update rules: ˆ a i, j T 1 t1 T 1 t1 t (i, j) t (i) ˆ b i (k) T t1 (o t,v k ) t (i) T t1 t (i) ˆ i 1 (i) 28

Expectation Maximization The forward-backward algorithm is an instance of the more general EM algorithm The E Step: Compute the forward and backward probabilities for a give model The M Step: Re-estimate the model parameters 29