Hidden Markov Models

Similar documents
Hidden Markov Models

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Parametric Models Part III: Hidden Markov Models

Hidden Markov Modelling

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models Part 2: Algorithms

Lecture 11: Hidden Markov Models

Hidden Markov Models

Multiscale Systems Engineering Research Group

Hidden Markov Models

Basic math for biology

Brief Introduction of Machine Learning Techniques for Content Analysis

p(d θ ) l(θ ) 1.2 x x x

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

STA 414/2104: Machine Learning

Hidden Markov Models (HMMs)

Statistical Machine Learning from Data

ASR using Hidden Markov Model : A tutorial

Statistical Processing of Natural Language

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

HMM part 1. Dr Philip Jackson

Statistical NLP: Hidden Markov Models. Updated 12/15

O 3 O 4 O 5. q 3. q 4. Transition

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Hidden Markov Models and Gaussian Mixture Models

Machine Learning for natural language processing

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

STA 4273H: Statistical Machine Learning

L23: hidden Markov models

Computational Genomics and Molecular Biology, Fall

Master 2 Informatique Probabilistic Learning and Data Analysis

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Introduction to Hidden Markov Modeling (HMM) Daniel S. Terry Scott Blanchard and Harel Weinstein labs

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Introduction to Machine Learning CMU-10701

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Outline of Today s Lecture

Statistical Sequence Recognition and Training: An Introduction to HMMs

Statistical Methods for NLP

Advanced Data Science

Hidden Markov Models in Language Processing

A Higher-Order Interactive Hidden Markov Model and Its Applications Wai-Ki Ching Department of Mathematics The University of Hong Kong

Lecture 3: Machine learning, classification, and generative models

Hidden Markov Models

Hidden Markov Models Hamid R. Rabiee

Graphical Models Seminar

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Lecture 3: ASR: HMMs, Forward, Viterbi

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

EECS E6870: Lecture 4: Hidden Markov Models

Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data

Hidden Markov models

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Speech Recognition HMM

Hidden Markov Models

Statistical Methods for NLP

Learning from Sequential and Time-Series Data

Hidden Markov Models. Terminology and Basic Algorithms

Temporal Modeling and Basic Speech Recognition

Hidden Markov Models. Dr. Naomi Harte

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

The main algorithms used in the seqhmm package

Hidden Markov Models. Terminology and Basic Algorithms

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Linear Dynamical Systems (Kalman filter)

MACHINE LEARNING 2 UGM,HMMS Lecture 7

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Expectation Maximization (EM)

COMP90051 Statistical Machine Learning

Dept. of Linguistics, Indiana University Fall 2009

order is number of previous outputs

An Evolutionary Programming Based Algorithm for HMM training

Hidden Markov Models NIKOLAY YAKOVETS

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Note Set 5: Hidden Markov Models

Data Mining in Bioinformatics HMM

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models Part 1: Introduction

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models. Three classic HMM problems

A New OCR System Similar to ASR System

Automatic Speech Recognition (CS753)

CS838-1 Advanced NLP: Hidden Markov Models

Tutorial on Hidden Markov Model

Hidden Markov Models

1. Markov models. 1.1 Markov-chain

Today s Lecture: HMMs

CS 188: Artificial Intelligence Fall 2011

Engineering Part IIB: Module 4F11 Speech and Language Processing Lectures 4/5 : Speech Recognition Basics

CSCE 471/871 Lecture 3: Markov Chains and

LEARNING DYNAMIC SYSTEMS: MARKOV MODELS

Transcription:

Hidden Markov Models Lecture Notes Speech Communication 2, SS 2004 Erhard Rank/Franz Pernkopf Signal Processing and Speech Communication Laboratory Graz University of Technology Inffeldgasse 16c, A-8010 Graz, Austria Tel.: +43 316 873 4436 E-Mail: rank@inw.tugraz.at 2004 1

Word Recognition Given: Word dictionary: W = {W 1,..., W L } Time-series of features from unknown word: X = {x 1,..., x N } Wanted: Most probable spoken word: W l W Target: Minimization of word error rate (WER) 2004 2

Word Recognition/Dynamic Time Warping Alignment of observed and reference pattern 2004 3

Word Recognition/Dynamic Time Warping Pattern matching Dynamic time warping (DTW): Dynamic programming, complexity: O(SN) 2004 4

Word Recognition/Markov Models Word recognition by maximizing the probability of Markov model Θ l of word W l for the observed time-series of feature vectors X: l = argmax l P (Θ l X) = argmax l P (X Θ l ) P (Θ l ) P (X) 2004 5

Markov Model/Graz Weather Today s weather Tomorrow s weather Transition probabilities: 0.8 0.05 0.15 0.2 0.6 0.2 0.2 0.3 0.5 0.8 0.5 State transition diagram: Sunny 0.2 0.05 0.15 0.2 Rainy 0.3 Foggy 0.2 0.6 2004 6

Markov Model A Markov Model is specified by The set of states S = {s 1, s 2,..., s Ns }. and characterized by The prior probabilities π i = P (q 1 = s i ) Probabilities of s i being the first state of a state sequence. Collected in vector π. (The prior probabilities are often assumed equi-probable, π i = 1/N s.) The transition probabilities a ij = P (q n+1 = s j q n = s i ) probability to go from state i to state j. Collected in matrix A. The Markov model produces A state sequence Q = {q 1,..., q N }, over time 1 n N. q n S 2004 7

Hidden Markov Model Additionally, for a Hidden Markov model we have Emission probabilities: and we get for continuous observations, e.g., x R D : b i (x) = p(x n q n = s i ) pdfs of the observation x n at time n, if the system is in state s i. Collected as a vector of functions B(x). Often parametrized, e.g, by mixtures of Gaussians. for discrete observations, x {v 1,..., v K }: b i,k = P (x n = v k q n = s i ) Probabilities for the observation of x n = v k, if the system is in state s i. Collected in matrix B. Observation sequence: X = {x 1, x 2,..., x N } HMM parameters (for fixed number of states N s ) thus are Θ = (A, B, π) 2004 8

HMM/Graz Weather The above weather model turns into a hidden Markov model, if we can not observe the weather directly. Suppose you were locked in a room for several days, and you can only observe if a person is carrying an umbrella (v 1 = ) or not (v 2 = ). Example emission probabilities could be: Weather Probability of umbrella Sunny b 1,1 = 0.1 Rainy b 2,1 = 0.8 Foggy b 3,1 = 0.3 Since there are only two possible states for the discrete observations, the probabilities for no umbrella are b i,2 = 1 b i,1. 2004 9

HMM/Discrete vs. Continuous Features Discrete features/ emission probability: HMM: Continuous features/ emission probability: 2004 10

Markov Model/Left-to-Right Models 2004 11

HMM/Trellis Trellis: Model description over time State 1 b 1,k a 1,1 b 1,k b 1,k...... b 1,k a 1,2 b 2,k b 2,k b 2,k b 2,k State 2 a 1,3...... State 3 b 3,k b 3,k b 3,k...... b 3,k Sequence: x 1 x 2 x i x N n = 1 n = 2 n = i n = N time 2004 12

HMM/Trellis example b, = 0.9 b, = 0.9 STATES a, = 0.15 b, = 0.7 a, = 0.2 Sequence: x 1 = x 2 = x 3 = n=1 n=2 n=3 time Joint likelihood for observed sequence X and state sequence (path) Q: P (X, Q Θ) = π b, a, b, a, b, = 1/3 0.9 0.15 0.7 0.2 0.9 2004 13

HMM/Parameters Parameters {π, A, B} are probabilities: positive π i 0, a i,j 0, b i,k 0 or b i (x) 0 normalization conditions N s i=1 π i = 1, N s j=1 a i,j = 1, K b i,k = 1 or k=1 X b i (x) dx = 1 2004 14

Hidden Markov Models: 3 Problems The three basic problems for HMMs: Given a HMM with parameters Θ = (A, B, π), efficiently compute the production probability of an observation sequence X P (X Θ) =? (1) Given model Θ, what is the hidden state sequence Q that best explains an observation sequence X Q = argmax Q P (X, Q Θ) =? (2) How do we adjust the model parameters to maximize P (X Θ) ˆΘ = (Â, ˆB, ˆπ) =?, P (X ˆΘ) = max Θ P (X Θ) (3) 2004 15

HMM/Prod. Probability Problem 1: Production probability Given: HMM parameters Θ Given: Observed sequence X (length N) Wanted: Probability P (X Θ), for X being produced by Θ Probability of a certain state sequence P (Q Θ) = P (q 1,..., q N Θ) = π q1 N n=2 a qn 1,q n Emission probabilities for the state sequence P (X Q, Θ) = P (x 1,..., x N q 1,..., q n, Θ) = N n=1 b qn,x n Joint probability of hidden state sequence and observation sequence N P (X, Q Θ) = P (X Q, Θ) P (Q Θ) = π q1 b q1 (x 1 ) a qn 1,q n b qn,x n 2004 16 n=2

HMM/Prod. Probability Production probability P (X Θ) = P (X, Q Θ) = Q Q N Q Q N ( π q1 b q1 (x 1 ) N n=2 a qn 1,q n b qn,x n ) Exponential complexity O(2N N N s ) use recursive algorithm (complexity linear in N N s ): 2004 17

HMM/Prod. Probability Forward algorithm Computation of forward probabilities α n (j) = P (x 1,..., x n, q n = s j Θ) Initialization: for all j = 1... N s α 1 (j) = π i b j,x1 Recursion: for all n > 1 and all j = 1... N s ( Ns ) α n (j) = α n 1 (i) a i,j b j,xn i=1 Termination: P (X Θ) = N s j=1 α N (j) 2004 18

HMM/Prod. Probability Backward algorithm Computation of backward probabilities β n (i) = P (x n + 1,..., x N q n = s i, Θ) Initialization: for all i = 1... N s β N (i) = 1 Recursion: for all n < N and all i = 1... N s β n (i) = N s j=1 a i,j b j,xn+1 β n+1 (j) Termination: P (X Θ) = N s π j b j,x1 β 1 (j) j=1 2004 19

HMM/Prod. Probability Forward algorithm Backward algorithm s 1 s 2 α n 1 (1) α n 1 (2) a 1,j a 2,j a i,2 a i,1 b 1,xn+1 s 1 β n+1 (1) b 2,xn+1 s 2 β n+1 (2) STATES s 3 α n 1 (3) a 3,j b j,xn s j α n (j) s i β n (i) b a 3,xn+1 i,3 s 3 β n+1 (3) a Ns,j a i,ns b Ns,x n+1 s Ns α n 1 (N s ) s Ns β n+1 (N s ) n 1 n n n + 1 time time At each time n α n (j) β n (j) = P (X, q n = s j Θ) is the joint probability of the observation sequence X and all state sequences (paths) passing through state s j at time n, and P (X Θ) = N s j=1 α n (j) β n (j) 2004 20

HMM/State Sequence Problem 2: Hidden state sequence Given: HMM parameters Θ Given: Observed sequence X (length N) Wanted: A posteriori most probable state sequence Q Viterbi algorithm a posteriori probabilities P (Q X, Θ) = P (X, Q Θ) P (X Θ) Q is the optimal state sequence if P (X, Q Θ) = max Q Q N P (X, Q Θ) =: P (X Θ) Viterbi algorithm computes δ n (j) = max Q Q n P (x 1,..., x n, q 1,..., q n Θ) for q n = s j 2004 21

HMM/Viterbi Algorithm Viterbi Algorithm Computation of optimal state sequence Initialization: for all j = 1... N s δ 1 (j) = π j b j,x1, ψ 1 (j) = 0 Recursion: for n > 1 and all j = 1... N s δ n (j) = max(δ n 1 a i,j ) b j,xn, i ψ n (j) = argmax(δ n 1 (i) a i,j ) i Termination: P (X Θ) = max(δ N (j)), qn = argmax(δ N (j)) j Backtracking of optimal state sequence: q n = ψ n+1 (q n+1), n = N 1, N 2,..., 1 2004 22 j

HMM/Viterbi Algorithm δ = δ 1 (1) δ 2 (1) δ 2 (1) δ N (1) δ 1 (2) δ 2 (2) δ 2 (2) δ N (2) δ 1 (3) δ 2 (3) δ 2 (3) δ N (3) δ 1 (4) δ 2 (4) δ 2 (4) δ N (4) ψ =???? Example: For our weather HMM Θ, find the most probable hidden weather sequence for the observation sequence X = {x 1 =, x 2 =, x 3 = } 1. Initialization (n = 1): δ 1 ( ) = π b, = 1/3 0.9 = 0.3 ψ 1 ( ) = 0 δ 1 ( ) = π b, = 1/3 0.2 = 0.0667 ψ 1 ( ) = 0 δ 1 ( ) = π b, = 1/3 0.7 = 0.233 ψ 1 ( ) = 0 2004 23

HMM/Viterbi Algorithm 2. Recursion (n = 2): We calculate the likelihood of getting to state from all possible 3 predecessor states, and choose the most likely one to go on with: δ 2 ( ) = max(δ 1 ( ) a,, δ 1 ( ) a,, δ 1 ( ) a, ) b, ψ 2 ( ) = = max(0.3 0.8, 0.0667 0.2, 0.233 0.2) 0.1 = 0.024 The likelihood is stored in δ 2, the most likely predecessor in ψ 2. The same procedure is executed with states and : δ 2 ( ) = max(δ 1 ( ) a,, δ 1 ( ) a,, δ 1 ( ) a, ) b, ψ 2 ( ) = = max(0.3 0.05, 0.0667 0.6, 0.233 0.3) 0.8 = 0.056 δ 2 ( ) = max(δ 1 ( ) a,, δ 1 ( ) a,, δ 1 ( ) a, ) b, ψ 2 ( ) = = max(0.3 0.15, 0.0667 0.2, 0.233 0.5) 0.3 = 0.0350 2004 24

HMM/Viterbi Algorithm δ 2 ( ) = max(δ 1 ( ) a,, δ 1 ( ) a,, δ 1 ( ) a, ) b, δ 1 = 0.3 ψ 2 ( ) = STATES Sequence: x 1 = x 2 = x 3 = n = 1 n = 2 n = 3 time 2004 25

HMM/Viterbi Algorithm Recursion (n = 3): δ 3 ( ) = max(δ 2 ( ) a,, δ 2 ( ) a,, δ 2 ( ) a, ) b, = max(0.024 0.8, 0.056 0.2, 0.035 0.2) 0.1 = 0.0019 ψ 3 ( ) = δ 3 ( ) = max(δ 2 ( ) a,, δ 2 ( ) a,, δ 2 ( ) a, ) b, = max(0.024 0.05, 0.056 0.6, 0.035 0.3) 0.8 = 0.0269 ψ 3 ( ) = δ 3 ( ) = max(δ 2 ( ) a,, δ 2 ( ) a,, δ 2 ( ) a, ) b, = max(0.0024 0.15, 0.056 0.2, 0.035 0.5) 0.3 = 0.0052 ψ 3 ( ) = 2004 26

HMM/Viterbi Algorithm δ 3 ( ) = 0.0019 STATES δ 3 ( ) = 0.0269 δ 3 ( ) = 0.0052 Sequence: x 1 = x 2 = x 3 = n = 1 n = 2 n = 3 time 2004 27

HMM/Viterbi Algorithm 3. Termination The globally most likely path is determined, starting by looking for the last state of the most likely sequence. P (X Θ) = max(δ 3 (i)) = δ 3 ( ) = 0.0269 q 3 = argmax(δ 3 (i)) = 4. Backtracking The best sequence of states can be read from the ψ vectors. n = N 1 = 2: q 2 = ψ 3 (q 3) = ψ 3 ( ) = n = N 1 = 1: q 1 = ψ 2 (q 2) = ψ 2 ( ) = The most likely weather sequence is: Q = {q 1, q 2, q 3} = {,, }. 2004 28

HMM/Viterbi Algorithm Backtracking: δ 3 ( ) = 0.0019 STATES δ 3 ( ) = 0.0269 δ 3 ( ) = 0.0052 Sequence: x 1 = x 2 = x 3 = n = 1 n = 2 n = 3 time 2004 29

HMM/Parameter Estimation Problem 3: Parameter estimation for HMMs Given: HMM structure (N s states, K observation symbols) Given: Training sequence X = {x 1,..., x N } Wanted: optimal parameter values ˆΘ = {ˆπ, Â, ˆB} P (X ˆΘ) = max Θ P (X Θ) = max Θ Q Q N P (X, Q ˆΘ) Baum-Welch Algorithm or EM (Expectation-Maximization) Algorithm Iterative optimization of parameters Θ ˆΘ In the terminology of the EM algorithm we have observable variables: observation sequence X hyper-parameters: state sequence Q 2004 30

HMM/Parameter Estimation Transition probabilities for s i s j at time n (for given Θ): ξ n (i, j) := P (q n = s i, q n+1 = s j X, Θ) = α n(i) a i,j b j,xn+1 β n+1 (j) P (X Θ) b j,xn+1 STATES s i α n (i) a i,j s j β n+1 (j) n 1 n n + 1 n + 2 State probability for s i at time n (for given Θ): γ n (i) := P (q n = s i X, Θ) = α n(i) β n (i) P (X Θ) P (X Θ) = N s α n (i) β n (i) = time N s j=1 ξ n (i, j) (cf. forward/backward algorithm) 2004 i=1 31

HMM/Parameter Estimation Summing over time n gives expected numbers # (frequencies) for N γ n (i)... # of transitions from state s i n=1 N ξ n (i, j)... # of transitions from state s i to state s j n=1 Baum-Welch update of HMM parameters: π i = γ 1 (i)...# of state s i at time n = 1 ā i,j = bj,k = N 1 n=1 ξ n(i, j) N 1 n=1 γ n(i, j) N n=1 γ n(i, j) [x n = v k ] N n=1 γ n(i, j)... # of transitions from state s i to state s j # of transitions from state s i... # of state s i with v k emitted # of state s i 2004 32

HMM/Parameter Estimation Gaussian (normal distributed) emission probabilities: b j (x) = N (x µ j, Σ j ) Mixtures of Gaussians b j (x) = K k=1 c jk N (x µ jk, Σ jk ), Semi-continuous emission probabilities: b j (x) = K k=1 c jk N (x µ k, Σ k ), K k=1 c jk = 1 K k=1 c jk = 1 2004 33

HMM/Parameter Estimation Problems encountered for HMM parameter estimation many word models/hmm states/parameters... always too less training data! Consequences: large variance of estimated parameters large variance in objective function P (X Θ) vanishing statistics zero valued parameters â i,j, ˆb j,k, ˆΣ k, ˆΣ jk,... Possible remedies (besides using more training data): fix some parameter values tying parameter values for similar models interpolation of sensible parameters by robust parameters smoothing of probability density functions defining limits for sensible density parameters 2004 34

HMM/Parameter Estimation Parameter tying simultaneous identification of parameters for similar models forces identical parameter values reduces parameter space dimension Example (state tying): Automatic determination of states that can be tied, e.g., by mutual information 2004 35

HMM/Parameter Estimation Parameter interpolation instead of fixed tying of states: interpolate parameters of similar models P (X Θ R, Θ S, r R, r S ) = r R P (X Θ R ) + r S P (X Θ S ), r R + r S = 1 especially suited for semi-continuous emission pdfs weights r R, r S can be chosen heuristically or included in the Baum-Welch algorithm 2004 36

References R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis. Wiley&Sons, Inc., 1973. S. Bengio, An Introduction to Statistical Machine Learning EM for GMMs, Dalle Molle Institute for Perceptual Artificial Intelligence. E.G. Schukat-Talamazzini, Automatische Spracherkennung, Vieweg-Verlag, 1995. L.R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE, Vol. 77, No. 2, pp. 257-286, 1989. 2004 37