Note Set 5: Hidden Markov Models

Similar documents
STA 4273H: Statistical Machine Learning

STA 414/2104: Machine Learning

Linear Dynamical Systems

Dynamic Approaches: The Hidden Markov Model

Linear Dynamical Systems (Kalman filter)

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Markov Chains and Hidden Markov Models

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Parametric Models Part III: Hidden Markov Models

Hidden Markov Models in Language Processing

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models Part 2: Algorithms

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Introduction to Machine Learning CMU-10701

Hidden Markov Models

1 What is a hidden Markov model?

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Statistical Pattern Recognition

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Hidden Markov Models Part 1: Introduction

Bayesian Machine Learning - Lecture 7

Master 2 Informatique Probabilistic Learning and Data Analysis

Basic math for biology

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm

CS532, Winter 2010 Hidden Markov Models

Conditional Random Field

Hidden Markov Models. Terminology, Representation and Basic Problems

Hidden Markov Models

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Lecture 21: Spectral Learning for Graphical Models

Infering the Number of State Clusters in Hidden Markov Model and its Extension

Kalman filtering and friends: Inference in time series models. Herke van Hoof slides mostly by Michael Rubinstein

Machine Learning for OR & FE

COMP90051 Statistical Machine Learning

Hidden Markov Models and Gaussian Mixture Models

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Computer Intensive Methods in Mathematical Statistics

Steven L. Scott. Presented by Ahmet Engin Ural

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Hidden Markov Models. Terminology and Basic Algorithms

Dynamic models. Dependent data The AR(p) model The MA(q) model Hidden Markov models. 6 Dynamic models

Statistical Pattern Recognition

Graphical Models Seminar

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Based on slides by Richard Zemel

order is number of previous outputs

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Statistical Methods for NLP

HMM: Parameter Estimation

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Hidden Markov Models. AIMA Chapter 15, Sections 1 5. AIMA Chapter 15, Sections 1 5 1

Advanced Data Science

Hidden Markov Models for precipitation

Undirected Graphical Models

Learning from Sequential and Time-Series Data

Weighted Finite-State Transducers in Computational Biology

CS838-1 Advanced NLP: Hidden Markov Models

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

ECE521 Lecture 19 HMM cont. Inference in HMM

Introduction to Probabilistic Graphical Models: Exercises

p(d θ ) l(θ ) 1.2 x x x

Hidden Markov Models Hamid R. Rabiee

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

STOCHASTIC MODELING OF ENVIRONMENTAL TIME SERIES. Richard W. Katz LECTURE 5

LEARNING DYNAMIC SYSTEMS: MARKOV MODELS

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

Automatic Speech Recognition (CS753)

Expectation Maximization (EM)

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Hidden Markov models 1

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Multiscale Systems Engineering Research Group

A Higher-Order Interactive Hidden Markov Model and Its Applications Wai-Ki Ching Department of Mathematics The University of Hong Kong

CS 7180: Behavioral Modeling and Decision- making in AI

Statistical Sequence Recognition and Training: An Introduction to HMMs

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

8: Hidden Markov Models

Hidden Markov Models and Gaussian Mixture Models

Variational Autoencoder

Hidden Markov Models

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Hidden Markov Models. Terminology and Basic Algorithms

Lecture 3: ASR: HMMs, Forward, Viterbi

Hidden Markov models for time series of counts with excess zeros

Hidden Markov models

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data

A gentle introduction to Hidden Markov Models

Bioinformatics 2 - Lecture 4

Hidden Markov model. Jianxin Wu. LAMDA Group National Key Lab for Novel Software Technology Nanjing University, China

Chapter 16. Structured Probabilistic Models for Deep Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

PROBABILISTIC REASONING OVER TIME

Transcription:

Note Set 5: Hidden Markov Models Probabilistic Learning: Theory and Algorithms, CS 274A, Winter 2016 1 Hidden Markov Models (HMMs) 1.1 Introduction Consider observed data vectors x t that are d-dimensional and where the subscript t indicates discrete time or position in a sequence, t = 1,..., T. We can assume for notational simplicity that the x t vector is real-valued, but in general x t could be discrete or could be a mixture of discrete and real-valued. Our data consists of D = {x 1,..., x T }. Unlike the IID assumption we have made in the past we would now like to model the sequential dependence among the x t s. One approach is to use a hidden Markov model where we assume that the x t s are noisy stochastic functions of an unobserved (hidden) Markov chain denoted by z t, where z t is a discrete random variable taking one of K possible values, z t {1,..., K}. z t is often referred to as the state variable. The generative model for a hidden Markov model is simple: at each time step t, a data vector x t is generated conditioned on the state z t, and the Markov chain then transitions to a new state z t+1 to generate x t+1, and so on. As with standard Markov chains there is an initial distribution π over the K states to initialize the chain. There are two key assumptions in a hidden Markov model: 1. Observations x t are conditionally independent of all other variables given z t, so the observation at time t depends only on the current state z t. 2. The z t s form a (first order) Markov chain, i.e., p(z t z t 1,..., z 1 ) = p(z t z t 1 ), t = 2,..., T. The chain is also typically assumed to be homogeneous in that the transition probabilities do not depend on t. The earliest application of hidden Markov models was in speech recognition in the 1970 s, but they have since been used for various other problems in bioinformatics, language modeling, economics, and climate modeling. We will use the shorthand notation x [1,T ] for x 1,..., x T, and z [1,T ] for z 1,..., z T. Our observed data is D = {x 1,..., x T }. From our graphical model we have p(x [1,T ], z [1,T ] ) = T p(x t z t )p(z t z t 1 ) 1

Note Set 5: Hidden Markov Models: CS 274A, Probabilistic Learning: Fall 2016 2 (where p(z 1 z 0 ) = π, the initial distribution on states.) We have two sets of parameters: Transition matrix, K K matrix A, with a ij = p(z t = j z t 1 = i), 1 i K. K emission distributions/densities p(x t z t = j), j = 1... K, e.g., multivariate Gaussian for realvalued x t, and usually assumed to be homogeneous (i.e., does not depend on t). If x t is very highdimensional it is common to assume that the components of x are conditionally independent given z t. For simplicity we will assume that π the initial distribution on states is known, e.g., set to the uniform distribution, but if we had multiple sequences this could also be learned from the data. We will let θ indicate the full set of parameters, i.e., the transition matrix parameters in A and the parameters of K state-dependent emission distributions p(x t z t = j), j = 1... K. Note the similarity of the HMM to a finite mixture model with K components. In particular, the HMM can be viewed as adding Markov dependence to the unobserved component indicator variable z in a mixture model. 1.2 Efficient Computation of HMM Likelihood Below we show how to compute the likelihood L(θ) (where both the K emission density parameters and the transition matrix A are unknown)? L(θ) = p(d θ) = p(x [1,T ] θ) = z [1,T ] p(x [1,T ], z [1,T ] θ) But this sum is intractable to compute directly, since it has complexity O(K T ). However, we can use the conditional independence assumptions (or equivalently the graphical model structure in the HMM) to carry out this computation efficiently. Let α t (j) = p(z t = j, x [1,t] ), j = 1,..., K (implicitly conditioning on θ). This is the joint probability of (a) the unobserved state at time t being in state j and (b) all of the observed x t s up to and including time

Note Set 5: Hidden Markov Models: CS 274A, Probabilistic Learning: Fall 2016 3 t. α t (j) = p(z t = j, x [1,t] ) = p(z t = j, z t 1 = i, x [1,t] ) = p(x t z t = j, z t 1 = i, x [1,t 1] ) p(z t = j z t 1 = i, x [1,t 1 ]) p(z t 1 = i, x [1,t 1] ) i = = p(x t z t = j) p(z t = j z t 1 = i) p(z t 1 = i, x [1,t 1] ) p(x t z t = j) a ij α t 1 (i) The first term is the evidence from the observation x t at time t, the second term is the transition probability, and the final term is just α t 1 (i). This yields a simple recurrence relation for the α s. We can compute the α t (j) s recursively, where α t (j) = p(z t = j, x [1,t] ). We can do this in a single forward pass, from t = 1 up to t = T, initializing the recursion with α 0 (j) = π(j). This is the forward part of the well-known forward-backward algorithm for HMMs. Then given α T (j), j = 1,..., K, the likelihood can be computed as L(θ) = j α T (j), by the LTP. If we know α t 1 (i), i = 1,..., K, we can compute α t (j) in time O(K 2 + Kf(d)). The K 2 is because we have to compute the probability for all i, j pairs, and the function f reflects the complexity of computing the likelihood of the data vector x t for each possible state, e.g., f(d) = O(d 2 ) for a Gaussian emission density. The overall complexity of computing all of the α s is O(T K 2 + T Kf(d)). 1.3 Efficient Computation of State Probabilities In developing an EM algorithm for HMMs we will want to compute the probability of each possible state at each time t given all of the observed data, i.e., p(z t = j x [1,T ] ) (using all of the data, both before and after t). We factor it as follows. p(z t = j x [1,T ] ) p(z t = j, x [1,t], x [t+1,t ] ) = p(x [t+1,t ] z t = j, x [1,t] ) p(z t = j, x [1,t] ) = p(x [t+1,t ] z t = j) p(z t = j, x [1,t] ) = p(x [t+1,t ] z t = j) α t (j) Note that given z t = j, the x [1,t] values give us no additional information about x [t+1,t ] (which is how we get from the 2nd to the 3rd line above). Define β t (j) = p(x [t+1,t ] z t = j), t = 1,..., T, j = 1,..., K. Then, from above, we have p(z t = j x [1,T ] ) β t (j)α t (j)

Note Set 5: Hidden Markov Models: CS 274A, Probabilistic Learning: Fall 2016 4 Using the same type of recursive decomposition as we used for α t (j), the β t (j) s can be computed in time O(T K 2 + T Kf(d)), working backwards from t = T to t = 1. Thus, to compute p(z t = j x [1,T ] ), t = 1,..., T, j = 1,..., K: We first recursively compute the α t (j) s (forward step). Next we recursively compute the β t (j) s (backward step). Finally, we can compute p(z t = j x [1,T ] ) as α t (j)β t (j) w t (j) = p(z t = j x [1,T ] ) = k α t(k)β t (k) where the denominator is the normalization term. This yields a set of T K probabilities w t (j), playing the same role in EM as the membership weights for mixture models. 2 EM for learning HMM Parameters The EM algorithm for HMMs follows the same general idea as the EM algorithm for finite mixtures. In the E-step we compute the probabilities w t (j) (or membership weights) of the unobserved states, for each state j and each time t, conditioned on all of the data x [1,T ] and conditioned on the current parameters θ. In the M-step we compute point estimates of the parameters given the membership weights from the E-step. There are two different sets of parameters: (1) the emission density parameters p(x t z t = j), and (2) the transition parameters a ij, 1 i, j, K. The estimation of the emission density parameters proceeds in exactly the same manner as for the finite mixture case. For example, if the emission densities are Gaussian, then the membership weights are used to generate fractional counts for estimating the mean and covariance for each of the K emission densities. For the transition probabilities we proceed as follows. We first compute E[N j ], the expected number of times in state j, which is T w t(j). Next, we need to compute E[N ij ], the expected number of times we transition from state i to state j. E[N ij ] = T 1 p(z t = i, z t+1 = j x [1,T ] ) T 1 p(z t = i, z = j, x [1,T ] ) Letting γ t (i, j) = p(z t = i, z = j, x [1,T ] ), we have γ t (i, j) = p(z t = i, z = j, x [1,T ] ) = p(x [t+2,t ] z t+1 = j) p(x t+1 z t+1 = j) p(z t+1 = j z t = i) p(z t = i, x [1,t] ) = β t+1 (j) p(x t+1 z t+1 = j) a ij α t (i).

Note Set 5: Hidden Markov Models: CS 274A, Probabilistic Learning: Fall 2016 5 In going from the first to the second line we have used various conditional independence properties that exist in the model. The final line consists of quantities that can easily be computed, e.g., they be computed directly from the model (p(x t+1 z t+1 = j)), or are known parameters (a ij ), or have been computed during the forward-backward computations of the E-step (the β s and α s). We then normalize the γ s to get the conditional probabilities we need, i.e., p(z t = i, z t+1 = j x [1,T ] ) = from which we can compute E[N ij ] above. The M step for the transition probabilities is now very simple: γ t (i, j) k 1 k 2 γ t (k 1, k 2 ) â ij = E[N ij] E[N j ], 1 i, j K