Lecture 11: Hidden Markov Models

Similar documents
HMM part 1. Dr Philip Jackson

Hidden Markov Models

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Models

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models Hamid R. Rabiee

Brief Introduction of Machine Learning Techniques for Content Analysis

Machine Learning for natural language processing

Hidden Markov Modelling

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

L23: hidden Markov models

Multiscale Systems Engineering Research Group

COMP90051 Statistical Machine Learning

Parametric Models Part III: Hidden Markov Models

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Hidden Markov Models

Hidden Markov Model and Speech Recognition

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Cheng Soon Ong & Christian Walder. Canberra February June 2018

STA 414/2104: Machine Learning

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Models (HMMs)

Basic math for biology

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Hidden Markov models 1

Hidden Markov Models

Lecture 3: ASR: HMMs, Forward, Viterbi

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Computational Genomics and Molecular Biology, Fall

Statistical Sequence Recognition and Training: An Introduction to HMMs

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Introduction to Machine Learning CMU-10701

Hidden Markov Models

Advanced Data Science

Statistical NLP: Hidden Markov Models. Updated 12/15

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Statistical Methods for NLP

Hidden Markov Models Part 2: Algorithms

STA 4273H: Statistical Machine Learning

Statistical Methods for NLP

Data Mining in Bioinformatics HMM

CS 7180: Behavioral Modeling and Decision- making in AI

Hidden Markov Models. AIMA Chapter 15, Sections 1 5. AIMA Chapter 15, Sections 1 5 1

Dynamic Approaches: The Hidden Markov Model

1 What is a hidden Markov model?

Statistical Processing of Natural Language

order is number of previous outputs

CS 4495 Computer Vision

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

A Higher-Order Interactive Hidden Markov Model and Its Applications Wai-Ki Ching Department of Mathematics The University of Hong Kong

CS 7180: Behavioral Modeling and Decision- making in AI

p(d θ ) l(θ ) 1.2 x x x

Dept. of Linguistics, Indiana University Fall 2009

Engineering Part IIB: Module 4F11 Speech and Language Processing Lectures 4/5 : Speech Recognition Basics

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay

Statistical Machine Learning from Data

Introduction to Markov systems

Supervised Learning Hidden Markov Models. Some of these slides were inspired by the tutorials of Andrew Moore

Data-Intensive Computing with MapReduce

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Introduction to Hidden Markov Modeling (HMM) Daniel S. Terry Scott Blanchard and Harel Weinstein labs

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

HMM: Parameter Estimation

ASR using Hidden Markov Model : A tutorial

Hidden Markov Models. Terminology and Basic Algorithms

Lecture 9. Intro to Hidden Markov Models (finish up)

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Lab 3: Practical Hidden Markov Models (HMM)

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Hidden Markov Models NIKOLAY YAKOVETS

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Markov Chains and Hidden Markov Models

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Multiple Sequence Alignment using Profile HMM

Further details of the Baum-Welch algorithm

Sequence modelling. Marco Saerens (UCL) Slides references

Hidden Markov Models

Conditional Random Field

Lecture 13: Structured Prediction

Hidden Markov model. Jianxin Wu. LAMDA Group National Key Lab for Novel Software Technology Nanjing University, China

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Mini-project 2 (really) due today! Turn in a printout of your work at the end of the class

CSCE 471/871 Lecture 3: Markov Chains and

O 3 O 4 O 5. q 3. q 4. Transition

Hidden Markov Models,99,100! Markov, here I come!

Lecture 12: Algorithms for HMMs

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

An Introduction to Bioinformatics Algorithms Hidden Markov Models

1. Markov models. 1.1 Markov-chain

Transcription:

Lecture 11: Hidden Markov Models Cognitive Systems - Machine Learning Cognitive Systems, Applied Computer Science, Bamberg University slides by Dr. Philip Jackson Centre for Vision, Speech & Signal Processing University of Surrey, UK last change December 22, 2014 Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 1 / 30

Outline Outline Introduction Markov Models Hidden Markov Models Parameters of a Discrete HMM Computing P(O λ) Forward Procedure Backward Procedure Finding the Best Path Viterbi Algorithm Training the Models Viterbi Training Baum-Welch Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 2 / 30

Introduction Introduction to Markov models We can model stochastic sequences using a Markov chain, e.g., the state topology of an ergodic Markov model: For 1st-order Markov chains, probability of state occupation depends only on the previous step (Rabiner, 1989): P(x t = j x t 1 = i, x t 2 = h,...) P(x t = j x t 1 = i) So, if we assume the RHS of eq. 5 is independent of time, we can write the state-transition probabilities a ij = P(x t = j x t 1 = i), 1 i, j N with the properties a ij 0 and N a ij = 1 1 i, j N j=1 Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 3 / 30

Introduction Weather prediction example Example Let us represent the state of the weather by a 1st-order, ergodic Markov model, M: state 1: raining state 2: cloudy state 3: sunny with state-transition probabilities expressed in matrix form: 0.4 0.3 0.3 A = {a ij } = 0.2 0.6 0.2 0.1 0.1 0.8 Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 4 / 30

Introduction Example Weather predictor probability calculation Given today s weather, what is the probability of directly observing the sequence of weather states rain-sun-sun with model M? A = rain cloud sun rain cloud sun 0.4 0.3 0.3 0.2 0.6 0.2 0.1 0.1 0.8 P(X M) = P(X = 1, 3, 3 M) = P(x 1 = rain today) P(x 2 = sun x 1 = rain) P(x 3 = sun x 2 = sun) = a 21 a 13 a 33 = 0.2 0.3 0.8 = 0.048 Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 5 / 30

Introduction Example Start and end of a stochastic state sequence Null states deal with the start and end of sequences, as in the state topology of this left-right Markov model: Entry probabilities at t = 1 for each state i are defined π i = P(x 1 = i) 1 i N with the properties π i 0 for 1 i N and N i=1 π i = 1. Exit probabilities at t = T are similarly defined η i = P(x T = i) 1 i N with the properties η i 0 and η i + N j=1 a ij = 1 for 1 i N Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 6 / 30

Introduction Example Example: probability of MM state sequence Consider the state topology state transition probabilities A = 0 1 0 0 0 0.8 0.2 0 0 0 0.6 0.4 0 0 0 0 The probability of state sequence X = 1, 2, 2 is P(X M) = π 1 a 12 a 22 η 2 = 1 0.2 0.6 0.4 = 0.048 Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 7 / 30

Introduction Summary of Markov models Summary of Markov Models State topology diagram: Entry probabilities π = {π i } = [ 1 0 0 ] and exit probabilities η = {η i } = [ 0 0 0.2 ] T are combined with state transition probabilities in complete A matrix: A = Probability of a given state sequence X: 0 1 0 0 0 0 0.6 0.3 0.1 0 0 0 0.9 0.1 0 0 0 0 0.8 0.2 0 0 0 0 0 ( T ) P(X M) = π x1 a xt 1 x t η xt t=2 Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 8 / 30

Introduction Introduction to Hidden Markov Models Introduction to Hidden Markov Models Hidden Markov Models (HMMs) use a Markov chain to model stochastic state sequences which emit stochastic observations, e.g., the state topology of an ergodic HMM: Probability of state i generating a discrete observation o t, which has one of a finite set of values, k 1..K, is b i (k) = P(o t = k x t = i). Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 9 / 30

Parameters of a Discrete HMM, λ Parameters of a discrete HMM, λ State transition probabilities, A = {π j, a ij, η i } = {P(x t = j x t 1 = i)} where N is the number of states Discrete output probabilities, B = {b i (k)} = {P(o t = k x t = i)} where K is the number of distinct observations. for 1 i, j N for 1 i N 1 k K Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 10 / 30

HMM Probability Calculation HMM Probability Calculation The joint likelihood of state and observation sequences is P(O, X λ) = P(X λ)p(o X, λ) The state sequence X = 1, 1, 2, 3, 3, 4 produces the observations O = o 1, o 2,..., o 6 : P(X λ) = π 1 a 11 a 12 a 23 a 33 a 34 η 4 P(O X, λ) = b 1 (o 1 )b 1 (o 2 )b 2 (o 3 )b 3 (o 4 )b 3 (o 5 )b 4 (o 6 ) ( T ) P(O, X λ) = π x1 b x1 (o 1 ) t=2 a x t 1 x t b xt (o t ) η xt Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 11 / 30

HMM Probability Calculation Example Example: probability of HMM state sequence Consider state topology, state transition matrix, and 0 1 0 0 A = 0 0.8 0.2 0 0 0 0.6 0.4 0 0 0 0 output probabilities [ ] [ R G B ] b1 0.5 0.2 0.3 B = = b 2 0 0.9 0.1 What is the probability of observations O = R, G, G with state sequence X = 1, 2, 2? P(O, X λ) = P(X λ)p(o X, λ) = π 1 b 1 (o 1 )a 12 b 2 (o 2 )a 22 b 2 (o 3 )η 2 = 1 0.5 0.2 0.9 0.6 0.9 0.4 = 0.01944 0.02 Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 12 / 30

HMM Probability Calculation Example HMM Recognition & Training Three tasks within HMM framework 1 Compute likelihood of a set of observations with a given model, P(O λ) 2 Decode a test sequence by calculating the most likely path, X 3 Optimise pattern templates by training parameters in the models Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 13 / 30

Computing P(O λ) Computing P(O λ) So far, we calculated the joint probability of observations and state sequence, for a given model λ, P(O, X λ) = P(X λ)p(o X, λ) For the total probability of the observations, we marginalise the state sequence by summing over all possible X P(O λ) = all X P(O, X λ) Alternate Notation O = o T 1 = o 1 t.o T t+1 1 t < T P(O λ) = all x T 1 P(o T 1, x T 1 λ) Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 14 / 30

Computing P(O λ) Computing P(O λ) Now, we define forward likelihood for state j, as α t (j) = P(o t 1, x t = j λ) = P(o t 1, x t 1 λ) x t 1 1,x t =j considering the (unknown) previous state x t 1 this can be reformulated: α t (j) = P(o t 1, x t 1 λ) x t 2 1,x t 1,x t =j = t 1 P(o1, x t 1 1 λ)a xt 1 jb j (o t ) x t 2 1,x t 1 N = P(o t 1 1, x t 1 1 λ)a ij b j (o t ) = i=1 x t 2 1,x t 1 =i N α t 1 (i)a ij b j (o t ) i=1 Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 15 / 30

Computing P(O λ) Forward Procedure Forward procedure To calculate forward likelihood, α t (i) = P(o t 1, x t = i λ) : 1 Initialise at t = 1, α 1 (i) = π i b i (o 1 ) for 1 i N 2 Recur for t = {2, 3,..., T }, α t (j) = [ N i=1 α t 1(i)a ij ]b j (o t ) for 1 j N 3 Finalise, P(O λ) = N i=1 α T (i)η i Thus, we can solve Task 1 efficiently by recursion. Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 16 / 30

Computing P(O λ) Example Worked example of the forward procedure α 1 (1) = 1 0.5 = 0.5 α 1 (2) = 0 0 = 0 α 2 (1) = 0.5 0.8 0.2 = 0.08 α 2 (2) = 0.5 0.2 0.9 = 0.09 α 3 (1) = 0.08 0.8 0.2 = 0.0128 α 3 (2) = (0.08 0.2 + 0.09 0.6) 0.9 = 0.063 P(O λ) = 0.063 0.4 = 0.0252 Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 17 / 30

Computing P(O λ) Backward Procedure Backward procedure We define backward likelihood, β t (i) = P(o T t+1 x t = i, λ), and calculate: 1 Initialise at t = T, β T (i) = η i for 1 i N 2 Recure for t = {T 1, T 2,..., 1}, β t (i) = N j=1 a ijb j (o t+1 )β t+1 (j) for 1 i N 3 Finalise, P(O λ) = N i=1 π ib i (o 1 )β 1 (i) This is an equivalent way of computing P(O λ) recursively. Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 18 / 30

Finding the best path Finding the Best Path Given observations O = o 1,..., o T, find the HMM state sequence X = x 1,..., x T that has greatest likelihood where X = arg max P(O, X λ), X P(O, X λ) = P(O X, λ)p(x λ) ( T ) = π x1 b x1 (o 1 ) t=2 a xt 1 x t b xt (o t ) Viterbi algorithm is an inductive method to find optimal state sequence X efficiently, similar to forward procedure. It computes maximum cumulative likelihood δ t (i) up to current time t for each state i: δ t (i) = max P(o t 1, x t 1 x t 1 1, x t = i λ) 1,x t =i Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 19 / 30 η xt

Finding the Best Path Viterbi Algorithm Viterbi algorithm To compute the maximum cumulative likelihood, δ t (i): 1 Initialise at t = 1, δ 1 (i) = π i b i (o 1 ) ψ 1 (i) = 0 2 Recur for t = {2, 3,..., T }, δ t (j) = max i [δ t 1 (i)a ij ]b j (o t ) ψ t (j) = arg max i [δ t 1 (i)a ij ] for 1 i N for 1 j N 3 Finalise, P(O, X λ) = max i [δ T (i)η i ] x T = arg max i [δ T (i)η i ] 4 Trace back, for t = {T, T 1,..., 2}, x t 1 = ψ t (x t ), and X = {x 1, x 2,..., x T } Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 20 / 30

Finding the Best Path Viterbi Algorithm Illustration of the Viterbi algorithm 1 Initialise, δ 1 (i) = π i b i (o 1 ) ψ 1 (i) = 0 2 Recur for t = 2, δ 2 (j) = max i [ δ1 (i)a ij ] bj (o 2 ) ψ 2 (j) = arg max i [ δ1 (i)a ij ] Recur for t = 3, δ 3 (j) = max i [ δ2 (i)a ij ] bj (o 3 ) ψ 3 (j) = arg max i [ δ2 (i)a ij ] 3 Finalise, P(O, X λ) = max i [δ 3 (i)η i ] x 3 = arg max i [δ 3(i)η i ] 4 Trace back for t = {3..2}, x 2 = ψ 3(x 3 ) x 1 = ψ 2(x 2 ) X = {x 1, x 2, x 3 } Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 21 / 30

Finding the Best Path Example Worked example of the Viterbi algorithm δ 1 (1) = 1 0.5 = 0.5 ψ 1 (1) = 0 δ 1 (2) = 0 0 = 0 ψ 1 (2) = 0 δ 2 (1) = 0.5 0.8 0.2 = 0.08 ψ 2 (1) = 1 δ 2 (2) = 0.5 0.2 0.9 = 0.09 ψ 2 (2) = 1 δ 3 (1) = 0.08 0.8 0.2 = 0.0128 ψ 3 (1) = 1 δ 3 (2) = max(0.08 0.2, 0.09 0.6) 0.9 = 0.0486 ψ 3 (2) = 2 P(O, X δ) = 0.0486 0.4 = 0.01944 Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 22 / 30

Training the Models Maximum Likelihood Maximum likelihood training In general, we want to find the value of some model parameter c that is most likely to give our set of training data O train. This maximum likelihood (ML) estimate ĉ is obtained by setting the derivative of P(O train c) w.r.t. c equal to zero, which is equivalent to: ln(p(o train ĉ) c = 0 Solving this likelihood equation tells us how to optimise the model parameters in training. Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 23 / 30

Training the Models Maximum Likelihood Re-estimating the parameters of the model λ As a preliminary approach, let us use the optimal path X computed by the Viterbi algorithm with some initial model parameters λ = {A, B}. In so doing, we approximate the total likelihood of the observations: P(O λ) = allx P(O, X λ) P(O, X λ) Using X, we make a hard binary decision about the state occupation, q t (i) {0, 1}, and train the parameters of our model accordingly. Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 24 / 30

Training the Models Maximum Likelihood Viterbi training (hard state assignment) Model parameters can be re-estimated using the Viterbi alignment to assign observations to states (a.k.a. segmental k-means training). State-transition probabilities, â ij = T t=2 q t 1(i)q t (j) T for t=1 qt (i) where state indicator q t (i) = Discrete output probabilities, T t=1 qt (j)ωt (k) for ˆb j (k) = T t=1 qt (j) where event indicator ω t (k) = { 1 for i = x t 0 otherwise { 1 for k = o t 0 otherwise 1 i, j N 1 j N and 1 k K Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 25 / 30

Training the Models Baum-Welch Maximum likelihood training by EM Baum-Welch re-estimation (occupation) Yet, the hidden state occupation is not known with absolute certainty. So, the expectation maximisation (EM) method optimises the model parameters based on soft assignment of observations to states via the occupation likelihood, γ t (i) = P(x t = i O, λ) relaxing the assumption previously made two slides before We can rearrange, using Bayes theorem, to obtain γ t (i) = P(O, x t = i λ) P(O λ) = α t (i)β t (i) P(O λ) where α t, β t and P(O λ) are computed by the forward and backward procedures. Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 26 / 30

Training the Models Baum-Welch Baum-Welch re-estimation (transition) Similarly, we also define a transition likelihood, ξ t (i, j) = P(x t 1 = i, x t = j O, λ) = P(O, x t 1 = i, x t = j λ) P(O λ) = α t 1(i) a ij b j (o t ) β t (j) P(O λ) Trellis depiction of (left) occupation and (right) transition likelihoods Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 27 / 30

Training the Models Baum-Welch Baum-Welch training (soft state assignment) State-transition probabilities, â ij = T t=2 ξt (i,j) T for t=1 γt (i) Discrete output probabilities, T t=1 γt (j)ωt (k) ˆb j (k) = T for t=1 γt (j) 1 i, j N 1 j N and 1 k K Re-estimation increases the likelihood over the training data for the new model ˆλ P(O train ˆλ) P(O train λ) although it does not guarantee a global maximum. Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 28 / 30

Training the Models Baum-Welch Use of HMMs with training and test data (a) Training (b) Recognition Isolated word training and recognition (Young et al., 2009) Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 29 / 30

Learning Terminology Training the Models Baum-Welch Hidden Markov Models Supervised Learning Approaches: Concept / Classification symbolic inductive Learning Strategy: Data: Target Values: unsupervised learning Policy Learning statistical / neuronal network analytical learning from examples sequences of discrete observations probabilities for classes Michael Siebers (CogSys) ML Hidden Markov Models December 22, 2014 30 / 30