HMM part 1. Dr Philip Jackson

Similar documents
Lecture 11: Hidden Markov Models

Hidden Markov Models

Basic math for biology

Hidden Markov Models

L23: hidden Markov models

Hidden Markov Models

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

Brief Introduction of Machine Learning Techniques for Content Analysis

Computational Genomics and Molecular Biology, Fall

Hidden Markov Models and Gaussian Mixture Models

Advanced Data Science

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

8: Hidden Markov Models

8: Hidden Markov Models

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Design and Implementation of Speech Recognition Systems

Data Mining in Bioinformatics HMM

Parametric Models Part III: Hidden Markov Models

Dynamic Approaches: The Hidden Markov Model

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models Hamid R. Rabiee

Approximate Inference

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models. Terminology, Representation and Basic Problems

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Today s Lecture: HMMs

Introduction to Machine Learning CMU-10701

STA 414/2104: Machine Learning

CS 188: Artificial Intelligence Spring 2009

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

order is number of previous outputs

Forward algorithm vs. particle filtering

COMP90051 Statistical Machine Learning

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Statistical Sequence Recognition and Training: An Introduction to HMMs

Statistical Methods for NLP

Hidden Markov Models Part 1: Introduction

One-Parameter Processes, Usually Functions of Time

PROBABILITY. Inference: constraint propagation

Statistical Machine Learning from Data

ASR using Hidden Markov Model : A tutorial

PROBABILITY AND INFERENCE

Hidden Markov Models. Vibhav Gogate The University of Texas at Dallas

Bayesian Networks BY: MOHAMAD ALSABBAGH

Hidden Markov Modelling

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Learning from Sequential and Time-Series Data

p(d θ ) l(θ ) 1.2 x x x

CS325 Artificial Intelligence Ch. 15,20 Hidden Markov Models and Particle Filtering

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Hidden Markov Model and Speech Recognition

Sequence modelling. Marco Saerens (UCL) Slides references

Hidden Markov Models. x 1 x 2 x 3 x N

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Hidden Markov Models

15-381: Artificial Intelligence. Hidden Markov Models (HMMs)

Temporal Modeling and Basic Speech Recognition

CS 188: Artificial Intelligence Fall 2011

Hidden Markov Models Part 2: Algorithms

STA 4273H: Statistical Machine Learning

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Probabilistic Machine Learning

Conditional Random Field

Our Status in CSE 5522

Intelligent Systems (AI-2)

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Bayesian Machine Learning - Lecture 7

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Machine Learning for natural language processing

ECE521 Lecture 19 HMM cont. Inference in HMM

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Augmented Statistical Models for Speech Recognition

18.600: Lecture 32 Markov Chains

CSE 473: Artificial Intelligence

18.175: Lecture 30 Markov chains

Undirected Graphical Models

Sequential Data and Markov Models

COMP90051 Statistical Machine Learning

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Model Selection for Gaussian Processes

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Graphical Models for Collaborative Filtering

Hidden Markov Models

Introduction to Artificial Intelligence (AI)

Computational Genomics and Molecular Biology, Fall

Course Introduction. Probabilistic Modelling and Reasoning. Relationships between courses. Dealing with Uncertainty. Chris Williams.

Intelligent Systems (AI-2)

Inference in Explicit Duration Hidden Markov Models

Introduction to Hidden Markov Modeling (HMM) Daniel S. Terry Scott Blanchard and Harel Weinstein labs

Note Set 5: Hidden Markov Models

Introduction to Neural Networks

Introduction to Stochastic Processes

Machine Learning Techniques for Computer Vision

Hidden Markov Models. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 19 Apr 2012

Statistical Problem. . We may have an underlying evolving system. (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t

Towards a Bayesian model for Cyber Security

Transcription:

Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. HMM part 1 Dr Philip Jackson Probability fundamentals Markov models State topology diagrams Hidden Markov models - Likelihood calculation - Recognition & training 1 2 3 www.ee.surrey.ac.uk/teaching/courses/eem.ssr F.1

Summary of Dynamic Time Warping The DTW approach allows efficient computation with limited flexibility in the alignment. It treats templates as deterministic with residual noise. Problems: 1. How much flexibility should we allow? 2. How should we penalise any warping? 3. How do we determine a fair distance metric? 4. How many templates should we register? 5. How do we select the best ones? Solution: Develop an inference framework to build templates based on the statistics of our data. F.2

Characteristics of the desired model 1. evolution of sequence should not be deterministic 2. observations are coloured depending on class 3. cannot directly observe class 4. stochastic sequence + stochastic observations Applications: automatic speech recognition optical character recognition protein and DNA sequencing speech synthesis noise-robust data transmission crytoanalysis machine translation image classification, etc. F.3

Probability fundamentals Normalisation Independent events Dependent events Bayes theorem Marginalisation F.4

Normalisation Discrete: probability of all possibilities sums to one: all X P (X) = 1 (1) Continuous: integral over the entire probabilty density function (pdf) comes to one: p(x) dx = 1 (2) Joint probability The joint probability that two independent events occur is the product of their individual probabilities: P (A, B) = P (A) P (B) (3) F.5

Conditional probability If two events are dependent, we need to determine their conditional probabilities. The joint probability is now P (A, B) = P (A) P (B A) (4) where P (B A) is the probability of event B given that A occurred; conversely, taking the events the other way P (A, B) = P (A B) P (B) (5) P (A, B) A Ā B 0.1 0.3 0.4 B 0.4 0.2 0.6 0.5 0.5 1.0 These expressions can be rearranged to yield the conditional probabilities. Also, we can combine them to obtain the theorem proposed by Rev. Thomas Bayes (C.18th). F.6

Bayes theorem Equating the RHS of eqs. 4 and 5 gives P (B A) = P (A B) P (B) P (A) (6) For example, in a word recognition application we have P (w O) = which can be interpreted as p(o w) P (w) p(o) (7) posterior = likelihood prior evidence (8) posterior probability gives basis for Bayesian inference, likelihood describes how likely are data for given class, prior incorporates other knowledge (e.g., language model), evidence normalises and is often discarded (as it is the same for all classes). F.7

Marginalisation Discrete: probability of event B, which depends on A, is the sum over A of all joint probabilities: P (B) = all A P (A, B) = all A P (B A) P (A) (9) Continuous: similarly, the nuisance factor x can be eliminated from its joint pdf with y: p(y) = p(x, y) dx = p(y x)p(x) dx (10) F.8

Introduction to Markov models We can model stochastic sequences using a Markov chain, e.g., the state topology of an ergodic Markov model: For 1st-order Markov chains, probability of state occupation depends only on the previous step (Rabiner, 1989): 1 3 2 P (x t = j x t 1 = i, x t 2 = h,...) P (x t = j x t 1 = i) (11) So, if we assume the RHS of eq. 11 is independent of time, we can write the state-transition probabilities with the properties a ij = P (x t = j x t 1 = i), 1 i, j N (12) a ij 0 and N j=1 a ij = 1 i, j 1..N F.9

State duration characteristics As a consequence of the first-order Markov model, the probability of occupying a state for a given duration, τ, decays exponentially: p(x x 1 = i, M) = (a ii ) τ 1 (1 a ii ) (13) probability 0.2 0.15 0.1 0.05 a 33 0 0 5 10 15 20 duration F.10

Weather prediction example Let us represent the state of the weather by a 1st-order, ergodic Markov model, M: 0.4 0.8 state 1: state 2: state 3: raining cloudy sunny 1 0.3 0.1 0.3 0.2 0.2 0.1 2 3 0.6 with state-transition probabilities expressed in matrix form: A = { a ij } = 0.4 0.3 0.3 0.2 0.6 0.2 0.1 0.1 0.8 (14) F.11

Weather predictor probability calculation Given today s weather what is the probability of directly observing the sequence of weather states rain-sun-sun with model M? A = rain cloud sun rain cloud sun 0.4 0.3 0.3 0.2 0.6 0.2 0.1 0.1 0.8 P (X M) = P (X = {1, 3, 3} M) = P (x 1 = rain today) P (x 2 = sun x 1 = rain) P (x 3 = sun x 2 = sun) = a 1 a 13 a 33 = 0.3 0.8 = F.12

Start and end of a state sequence Null states deal with the start and end of sequences, as in the state topology of this left-right Markov model: a 11 a 22 a 33 π 1 a 12 a 23 η 2 3 1 3 Entry probabilities at t=1 for each state i are defined π i = P (x 1 = i) 1 i N (15) with the properties π i 0, and N i=1 π i = 1 for i 1..N Exit probabilities at t=t are similarly defined η i = P (x T = i) 1 i N (16) with properties η i 0, and η i + N j=1 a ij = 1 for i 1..N F.13

Parameters of the Markov Model, M State transition probabilities, A = {π j, a ij, η i } = {P (x t = j x t 1 = i)} where N is the number of states for 1 i, j N a 11 a 22 a 33 a 44 π 1 a 12 a 23 a 34 1 2 3 4 η 4 1 1 2 3 3 4 x x1 x2 x3 x4 x5 6 producing a sequence X = { 1, 1, 2, 3, 3, 4 } F.14

Example: probability of MM state sequence Consider the state topology 0.8 0.6 1.0 0.2 1 2 0.4 state transition probabilities A = 0 1 0 0 0 0.8 0.2 0 0 0 0.6 0.4 0 0 0 0 The probability of state sequence X = {1, 2, 2} is P (X M) = π 1 a 12 a 22 η 2 = 1 0.2 0.6 0.4 = 0.048 F.15

Summary of Markov models State topology diagram: 1 2 3 Entry probabilities π = {π i } = [ 1 0 0 ] and exit probabilities η = {η i } = [ 0 0 0.2 ] T are combined with state transition probabilities in complete A matrix: A = 0 1 0 0 0 0 0.6 0.3 0.1 0 0 0 0.9 0.1 0 0 0 0 0.8 0.2 0 0 0 0 0 Probability of a given state sequence X: P (X M) = π x1 T a xt 1 x t t=2 η xt (17) F.16

Introduction to hidden Markov models Hidden Markov Models (HMMs) use a Markov chain to model stochastic state sequences which emit stochastic observations, e.g., the state topology of an ergodic HMM: b 1 1 3 b 3 2 b2 Probability of state i generating discrete observation o t, which has a value from a finite set k 1..K, is b i (o t ) = P (o t = k x t = i) (18) Probability distribution of a continuous observation o t, which takes a value from an infinite set, is b i (o t ) = p(o t x t = i) (19) We begin by considering only discrete observations. F.17

Observations in discretised feature space c 2 k =1 k =2 k =3 c 1 Discrete output probability histogram P(o) 1 2 3... K k F.18

Parameters of a discrete HMM, λ State transition probabilities, A = {π j, a ij, η i } = {P (x t = j x t 1 = i)} where N is the number of states for 1 i, j N Discrete output probabilities, B = {b i (k)} = {P (o t = k x t = i)} where K is the number of observation types for 1 i N 1 k K a 11 a 22 a 33 a 44 π 1 a 12 a 23 a 34 1 2 3 4 b 1(o 1) b 1(o 2) b 2(o 3) b 3(o 4) b 3(o 5) η 4 b 4(o 6) generating a state sequence X = { 1, 1, 2, 3, 3, 4 } and observations O = {o 1, o 2,..., o 6 } o 1 o 2 o 3 o 4 o 5 o 6 F.19

Procedure for generating an observation sequence 1. For t = 1, choose state x t = i using entry probability π i 2. Select o t = k according to b xt (k) 3. Transit according to a ij and η i, then respectively: (a) increment t, set x t = j and repeat from 2, or (b) terminate the sequence, t = T. a 11 a 22 a 33 a 44 π 1 a 12 a 23 a 34 1 2 3 4 b 1(o 1) b 1(o 2) b 2(o 3) b 3(o 4) b 3(o 5) η 4 b 4(o 6) o 1 o 2 o 3 o 4 o 5 o 6 F.20

HMM probability calculation The joint likelihood of state and observation sequences is P (O, X λ) = P (X λ) P (O X, λ) (20) a 11 a 22 a 33 a 44 π 1 a 12 a 23 a 34 1 2 3 4 b 1(o 1) b 1(o 2) b 2(o 3) b 3(o 4) b 3(o 5) η 4 b 4(o 6) o 1 o 2 o 3 o 4 o 5 o 6 The state sequence X = {1, 1, 2, 3, 3, 4} produces the set of observations O = {o 1, o 2,..., o 6 }: P (X λ) = π 1 a 11 a 12 a 23 a 33 a 34 η 4 P (O X, λ) = b 1 (o 1 ) b 1 (o 2 ) b 2 (o 3 ) b 3 (o 4 ) b 3 (o 5 ) b 4 (o 6 ) P (O, X λ) = π x1 b x1 (o 1 ) T t=2 a xt 1 x t b xt (o t ) η xt (21) F.21

Example: probability of HMM state sequence Consider state topology 0.8 0.6 1.0 0.2 1 2 0.4 and state transition matrix: A = 0 1 0 0 0 0.8 0.2 0 0 0 0.6 0.4 0 0 0 0 o1 o 2 o 3 Output probabilities: B = [ b1 b 2 ] = R G B [ 0.5 0.2 0.3 0 0.9 0.1 ] Probability of observations with state sequence X = {1, 2, 2}: P (O, X λ) = P (X λ) P (O X, λ) = π 1 b 1 (o 1 ) a 12 b 2 (o 2 ) a 22 b 2 (o 3 ) η 2 = 1 = F.22

HMM Recognition & Training Three tasks within HMM framework 1. Compute likelihood of a set of observations with a given model, P (O λ) 2. Decode a test sequence by calculating the most likely path, X 3. Optimise pattern templates by training parameters in the models, Λ = {λ} State 1 2 1 2 3 Observations F.23

HMM part 1 summary Probability fundamentals Normalisation and marginalisation Joint and conditional probabilities Bayes theorem Markov models sequence of directly observable states Hidden Markov models (HMMs) hidden state sequence generation of observations F.24