CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm

Similar documents
Note Set 5: Hidden Markov Models

Markov Chains and Hidden Markov Models

Basic math for biology

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Data Mining Techniques

Hidden Markov Models Part 2: Algorithms

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

STA 4273H: Statistical Machine Learning

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Exponential Families

Data Mining Techniques

Linear Dynamical Systems

Infering the Number of State Clusters in Hidden Markov Model and its Extension

STA 414/2104: Machine Learning

Hidden Markov Models

Probabilistic Graphical Models

Latent Variable View of EM. Sargur Srihari

13: Variational inference II

STA 4273H: Statistical Machine Learning

Introduction to Machine Learning

Probabilistic Graphical Models for Image Analysis - Lecture 4

Expectation Propagation Algorithm

Study Notes on the Latent Dirichlet Allocation

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Latent Variable Models and EM algorithm

An Introduction to Expectation-Maximization

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Introduction: exponential family, conjugacy, and sufficiency (9/2/13)

EM-algorithm for motif discovery

Graphical Models for Collaborative Filtering

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Implementation of EM algorithm in HMM training. Jens Lagergren

Lecture 13 : Variational Inference: Mean Field Approximation

Introduction to Probabilistic Graphical Models: Exercises

Hidden Markov models

Exponential families also behave nicely under conditioning. Specifically, suppose we write η = (η 1, η 2 ) R k R p k so that

CS Lecture 19. Exponential Families & Expectation Propagation

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Expectation Maximization

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Expectation Maximization

Hidden Markov Models and Gaussian Mixture Models

Cours 7 12th November 2014

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Note 1: Varitional Methods for Latent Dirichlet Allocation

Probabilistic Graphical Models

The Expectation Maximization or EM algorithm

Probabilistic Graphical Models

Variational Scoring of Graphical Model Structures

Graphical Models Seminar

Introduction to Machine Learning CMU-10701

Chris Bishop s PRML Ch. 8: Graphical Models

Probabilistic Graphical Models

Variational Autoencoders

CS838-1 Advanced NLP: Hidden Markov Models

Stephen Scott.

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Expectation Maximization Algorithm

CS532, Winter 2010 Hidden Markov Models

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

Dynamic models. Dependent data The AR(p) model The MA(q) model Hidden Markov models. 6 Dynamic models

Conditional Random Field

Dynamic Approaches: The Hidden Markov Model

Quantitative Biology II Lecture 4: Variational Methods

The Expectation-Maximization Algorithm

Conditional Random Fields for Sequential Supervised Learning

Unsupervised Learning

Machine Learning 4771

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequential Monte Carlo Methods for Bayesian Computation

Sequential Supervised Learning

State-Space Methods for Inferring Spike Trains from Calcium Imaging

11.3 Decoding Algorithm

Patterns of Scalable Bayesian Inference Background (Session 1)

Hidden Markov Models and Gaussian Mixture Models

A minimalist s exposition of EM

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Log-Linear Models, MEMMs, and CRFs

LECTURE 11: EXPONENTIAL FAMILY AND GENERALIZED LINEAR MODELS

Mixtures of Gaussians. Sargur Srihari

Hidden Markov Models. Terminology and Basic Algorithms

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

CSCE 471/871 Lecture 3: Markov Chains and

Logistic Regression. Seungjin Choi

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Probabilistic Graphical Models

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Inference and estimation in probabilistic time series models

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Lecture 8: Graphical models for Text

Transcription:

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm Jan-Willem van de Meent, 19 November 2016 1 Hidden Markov Models A hidden Markov model (HMM) defines a joint probability distribution of a series of observations x t and hidden states z t for t = 1,..., T. We use x 1:T and z 1:T to refer to the full sequence of observations and states respectively. In a HMM the prior on the state sequence is assumed to satisfy the Markov property, which is to say that the probability of each state z t depends only on the previous state z t 1. In the broadest definition of a HMM, the states z t can be either discrete or continuous valued. Here we will discuss the more commonly considered case where z t takes on a discrete values z t [K] where [K] := {1,..., K}. We will use a matrix A R K K and a vector π R K to define the transition probabilities and the probability for the first state z 1 respectively p(z t = l z t 1 = k, A) := A kl, (1) p(z 1 = k π) := π k. (2) The rows A k of the transition matrix and entries of π sum to one A kl = 1 k [K], π k = 1. (3) l The prior probability for the state sequence z 1:T can be expressed as a product over conditional probabilities p(z 1:T π, A) = p(z 1 π) k T p(z t z t 1, A). (4) t=2 The observations in an HMM are assumed to be independent conditioned on the sequence of states, which is to say p(x 1:T z 1:T, η) = T p(x t z t, η). (5) Here η = η 1:K is a set of for the density f(x t ; η k ) associated with each state k [K], p(x t z t = k, η) := f(x t ; η k ). (6) The observations x t can be either discrete or continuous valued. We will here assume that f(x t ; η k ) is an exponential family distribution, which we will describe in more detail below. 1

1.1 Expectation Maximization We will use θ = {η, A, π} to refer to the parameters of the HMM. We would like to find the set of parameters that maximizes the complete data likelihood, θ = argmax p(x 1:T θ) = argmax p(x 1:T, z 1:T θ) (7) θ θ z 1:T [K] T Expectation maximization (EM) methods define a lower bound on the log likelihood a variational distribution q(z 1:T ) [ L(q(z 1:T ), θ) := E q(z1:t ) log p(x ] 1:T, z 1:T θ), (8) q(z 1:T ) = q(z 1:T ) log p(x 1:T, z 1:T θ) log p(x 1:T θ). (9) q(z 1:T ) z 1:T [K] T When the variational distribution on the state sequence is equal to the posterior, that is q (z 1:T ) := p(z 1:T x 1:T, θ), (10) the the lower bound is tight (i.e. the lower bound is equal to the log likelihood) L(q (z 1:T ), θ) = p(z 1:T x 1:T, θ) log p(x 1:T, z 1:T θ) p(z 1:T x 1:T, θ), (11) z 1:T [K] T = p(z 1:T x 1:T, θ) log p(z 1:T x 1:T, θ)p(x 1:T θ), (12) p(z 1:T x 1:T, θ) z 1:T [K] T = log p(x 1:T θ) p(z 1:T x 1:T, θ) = log p(x 1:T θ). (13) z 1:T [K] T The EM algorithm iterates between two steps: 1. The expectation step: Optimize the lower bound with respect to q(z 1:T ) q i (z 1:T ) = argmax L(q(z 1:T ), θ i 1 ) = p(z 1:T x 1:T, θ i 1 ). (14) q 2. The maximization step: Optimize the lower bound with respect to θ θ i = argmax L(q i (z 1:T ), θ). (15) θ At a first glance it is not obvious whether EM for hidden Markov models is computationally tractable. Calculation of the lower bound via direct summation over z 1:T requires O(K T ) computation. Fortunately it turns out that we can exploit the Markov property of the state sequence to perform this summation in O(K 2 T ) time. To see how we can reduce the complexity of the estimation problem we will introduce some new notation. We will define to z t,k := I[z t = k] to be the one-hot vector representation 2

of the state at time t, which is sometimes also know as an indicator vector. Given this representation we can re-express the probabilities for states and observations as p(x t z t, η) = k p(z t z t 1, A) = k,l p(z 1 π) = k f(x t ; η k ) z t,k, (16) A z t 1,kz t,l kl, (17) π z 1,k k. (18) Using this representation, we can now express the lower bound as L(q(z 1:T ), θ) = E q(z1:t ) [log p(x 1:T z 1:T, θ) + log p(z 1:T θ)] E q(z1:t ) [log q(z 1:T )] (19) [ T = E q(z1:t ) z t,k log f(x t ; η k ) + z 1,k log π k k=1 k (20) T ] + z t 1,k z t,l log A kl (21) t=2 k=1,l=1 [ ] E q(z1:t ) log q(z 1:T ) In this expression we see a number of terms that are a product of a term that depends on z and a term that does not. If we pull all terms that are independent of z out of the expectation, then we obtain (22) L(q(z 1:T ), θ) = T t=2 k=1,l=1 E q(z1:t )[z 1,k ] log π k (23) E q(z1:t )[z t,k ] log f(x t ; η k ) + k=1 k T [ ] + E q(z1:t )[z t 1,k z t,l ] log A kl E q(z1:t ) log q(z 1:T ) (24) We can now derive the maximization step of the EM algorithm by differentiating the lower bound and solving for zero. For example, for the parameters η k we need to solve the identity ηk L(q(z 1:T ), θ) = T E q(z1:t )[z t,k ] η k f(x t ; η k ) f(x t ; η k ) = 0 (25) The first thing to observe here is that the only dependence on z in this identity is through the terms E q(z1:t )[z t,k ]. Similarly, the gradients of L with respect to π and A will depend on E q(z1:t )[z 1,k ] and E q(z1:t )[z t 1,k z t,l ] respectively. These terms do not depend directly on the parameters θ, so we can treat them as constants during the maximization step. We will from now on use the following substitutions γ tk := E q(z1:t )[z t,k ], ξ tkl := E q(z1:t )[z t,k z t+1,l ]. (27) (26) 3

If we can come up with a way of calculating γ tk and ξ tkl in a manner that avoids direct summation over all sequences z 1:T, then we can perform EM on hidden Markov models. It turns out that such a procedure does in fact exists and we will describe it in the section on the forward-backward algorithm below. 2 Maximization of Exponential Family distributions Before we turn to the computation of γ and ξ, let us look at how to perform the maximization step once γ and ξ are known. We will start with maximization with respect to the parameters η k. In general, the objective in equation 25 could be minimized using any number of gradientbased minimization methods (e.g. LBFGS). However, we can do better. We can obtain a closed-form solution when the observation density f(x t ; η k ) belongs to a so called exponential family. Exponential family distributions can be either discrete or continuous valued. In both cases, the mass or density function must be expressible in the following form f(x; η) = h(x) exp[η T (x) A(η)]. (28) In this expression, h(x) is referred to as the base measure, η is a vector of parameters, which are often referred to as the natural parameters, T (x) is a vector of sufficient statistics and A(η) is known as a log normalizer. It turns out that many distributions that we use most commonly can be cast into this form, including the univariate and multivariate normal, the discrete and multinomial, the Poisson, the gamma, and the Dirichlet. As an example, the univariate normal distribution Normal(x; µ, σ) can be cast in exponential family form by defining h(x) = 1, 2π ( µ η = σ, 1 ), 2 2σ 2 T (x) = (x, x 2 ), A(η) = η2 1 4η 2 + 1 2 log( 2η 2) = µ 2 /2σ 2 + log σ. (32) Similarly, for a discrete distribution Discrete(x; µ) with support d {1,..., D} we can define h(x) = 1, η d = log µ d, T d (x) = I[x = d], D A(η) = log exp η d = log d=1 (29) (30) (31) (33) (34) (35) D µ d. (36) d=1 Exponential family distributions have a number of properties that turn out to be very beneficial when performing expectation maximization. Because of its exponential form, the gradient gradient of an exponential family density takes on a convenient form η f(x; η) = [T (x) η A(η)]f(x; η). (37) 4

If we subsitute this form back into equation 25 then we obtain ηk A(η k ) = T γ tkt (x t ) T γ. (38) tk The term on the right has a clear interpretation: it is the empirical expectation of the sufficient statistics T (x) over the data points associated with state k. The term on the left has a clear interpretation as well. To see this, we need to derive an identity that holds for all exponential family distributions. We start by observing that, since f(x; η) is normalized, the following must hold for all η 1 = dxf(x; η). (39) If we now take the gradient with respect to η on both sides of the equation we obtain 0 = dx η f(x; η), (40) = dx f(x; η)[t (x) η A(η)], (41) = E f(x;η) [T (x)] η A(η). (42) In other words the gradient η A(η) is equal to the expected value of the sufficient statistics E f(x;η) [T (x)]. If we substitute this relationship back into equation 38 then we obtain the condition E f(x;ηk )[T (x)] = T γ tkt (x t ) T γ. (43) tk The interpretation of this condition is that we can perform the maximization step in the EM algorithm by finding the value η k for which the expected value of the sufficient statistics T (x) is equal to the empirical expectation for the observations associated with state k. This type of relationship is an example of a so called moment-matching condition. As an example, if the observation distribution f(x; η k ) is a univariate normal, then the sufficient statistics are T (x) = (x, x 2 ). The maximization step updates then become E f(x;ηk )[x] = µ k = t γ tk x t / t γ tk, (44) E f(x;ηk )[x 2 ] = σ 2 k + µ 2 k = t γ tk x 2 t / t γ tk. (45) Suppose the observation distribution is discrete, with D possible values. If we use x d = I[x = d] to refer to the entries of the one-hot representation, then T d (x) = I[x = d] = x d and the maximization-step updates become E f(x;ηk )[x d ] = π k,d = t γ tk x t,d / t γ tk. (46) 5

Of course, now that we know know how to do the maximization step for any exponential family, we can also make use of these relationships when in deriving the updates for π and A k. For these variables we now obtain the particularly simple relations π k = γ 1k, T 1 T 1 A kl = ξ tkl / (47) ξ tkl. (48) l=1 3 The Forward-backward Algorithm During the expectation step, we update q(z 1:T ) to the posterior q(z 1:T ) = p(z 1:T x 1:T, θ). (49) For small values of T we could in principle calculate this posterior via direct summation q(z 1:T ) = p(x 1:T, z 1:T θ) z 1:T [K] T p(x 1:T, z 1:T θ). (50) However, this quickly becomes infeasible since there are K T distinct sequences for an HMM with K states. Luckily it turns out that the maximization step can be performed by calculating two sets of expected values γ tk := E p(z1:t x 1:T,θ)[z t,k ], (51) ξ tkl := E p(z1:t x 1:T,θ)[z t,k z t+1,l ]. (52) As we will see below these two quantities can be calculated in O(K 2 T ) time using a dynamic programming method known as the forward-backward algorithm. 3.1 Forward and Backward Recursion We will begin by expressing γ tk as a product of two terms. The first is the joint probability a t,k := p(x 1:t, z t = k θ) of all observations up to time t and the current state z t. The second is the probability of all future observations β t,k := p(x t+1:t z t = k, θ) conditioned on z t = k. We can express γ in terms of α and β as γ t,k = p(x 1:T, z t = k θ) p(x 1:T θ) = p(x t+1:t z t, θ)p(x 1:t, z t = k θ) p(x 1:T θ) We can similarly express ξ t,kl in terms of α and β as ξ t,kl = p(z t = k, z t+1 = l x 1:T, θ) = p(x 1:T, z t+1 = l, z t = k θ) p(x 1:T θ) = α t,kβ t,k p(x 1:T ). (53) = p(x t+2:t z t + 1=l, θ)p(x t+1 z t+1 =l, η)p(z t+1 =l z t+1 =k, A)p(x 1:t, z t =k θ) p(x 1:T θ) = β t+1,lf(x t ; η l )A kl α tk p(x 1:T θ) 6 (54) (55) (56)

We will now derive a recursion relation for both α and β. For the first time point, we can calculate α 1,k directly α 1,k = p(x 1, z 1 = k) = f(x 1 ; η k )π k (57) For all t > 1 we can define a recursion relation that expresses α t,l in terms of α t 1,k α t,l := p(x 1:t, z t = l θ), (58) = p(x t z t = l, η)p(z t = l z t 1 = k, A)p(x 1:t 1, z t 1 = k θ). (59) = k=1 f(x t ; η l )A kl α k,t 1. (60) k=1 For β t,k we begin by defining the value at the final time point t = T. At this point there are no future observations, so we can set the probability of future observations to 1, β T,k = p( z T = k, θ) = 1. (61) For all preceding t < T we can express β t,k in terms of β t+1,l β t,k := p(x t+1:t z t = k, θ), (62) = p(x t+2:t z t+1 =l, θ)p(x t+1 z t+1 =l, η)p(z t+1 =l z t =k, A), (63) = l=1 β t+1,l f(x t+1 ; η l ) A kl. (64) l=1 In short, we can calculate γ by combining a forward recursion for α with a backward recursion for β. Note that each step in both recursion requires O(K 2 ), since we the calculation of each of the K entries requires a sum over K terms. This means the total computation required by the forward-backward algorithm is O(K 2 T ). 3.2 Normalized computation In practice we use a slightly different recursion relation when implementing the forwardbackward algorithm. If we were to calculate α and β through the naive recursion relations defined above, then we could calculate γ by simple normalization γ t,k = α t,kβ t,k p(x 1:T θ) = α t,kβ t,k k α t,k β t,k. (65) At each step, the normalization term should then in principle be equal to the marginal likelihood of the data p(x 1:T θ) = k α t,k β t,k. (66) 7

In practice p(x 1:T θ) will be either a small (or sometimes a very large) number for large values of T, which means naive computation of α and β can lead to numerical underflow (or overflow). A simple solution to this problem is to normalize α tk during each step, by defining α t,l = f(x t ; η l )A kl α k,t 1. (67) k=1 c t = k α tk, (68) α tl = α tl /c t. (69) The normalized values of α t,k now represent the partial posterior p(z t = k x 1:t, θ) instead of the partial joint p(x 1:t, z t = k θ), which means that the normalized values α t,k differ from the original values by a factor α tk α tk = t i=1 c i = p(x 1:t, z t θ) p(z t x 1:t, θ) = p(x 1:t θ). (70) Since this relationship must hold for any t, this in particular implies that we can recover the marginal likelihood from the normalizing constants p(x 1:T θ) = T c t. (71) We can also normalize the values β t,k on the backward pass. A particularly good choice is β tk = 1 c t+1 β t+1,lf(x t+1 ; η l )A kl. (72) l=1 This choice of normalization implies that β tk β tk = p(x 1:T θ) p(x 1:t θ) = T i=t+1 c i, α tk β tk α tk β tk = T c i = p(x 1:T θ), (73) i=1 which in turn ensures that we will no longer have to perform any normalization upon completion of the forward and backward recursion γ t,k = α tkβ tk p(x 1:T θ) = α tkβ tk. (74) Similarly, the expression for ξ t,kl in terms of α and β becomes ξ t,kl = 1 c t+1 β t+1,lf(x t+1 ; η l )A kl α tk. (75) 8