Dynamic Approaches: The Hidden Markov Model

Similar documents
STA 414/2104: Machine Learning

STA 4273H: Statistical Machine Learning

Note Set 5: Hidden Markov Models

Bayesian Machine Learning - Lecture 7

Conditional Random Field

Linear Dynamical Systems

Basic math for biology

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Models. Terminology, Representation and Basic Problems

Directed and Undirected Graphical Models

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Hidden Markov Models

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Introduction to Machine Learning CMU-10701

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

COMP90051 Statistical Machine Learning

Linear Dynamical Systems (Kalman filter)

Machine Learning Summer School

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

order is number of previous outputs

Infering the Number of State Clusters in Hidden Markov Model and its Extension

Statistical Machine Learning from Data

Learning from Sequential and Time-Series Data

Lecture 6: Graphical Models: Learning

Today s Lecture: HMMs

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Graphical Models Seminar

2 : Directed GMs: Bayesian Networks

Hidden Markov Models. Terminology and Basic Algorithms

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

HMM: Parameter Estimation

p(d θ ) l(θ ) 1.2 x x x

Markov Chains and Hidden Markov Models

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Hidden Markov Models. Terminology and Basic Algorithms

Master 2 Informatique Probabilistic Learning and Data Analysis

9 Forward-backward algorithm, sum-product on factor graphs

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Lecture 4: State Estimation in Hidden Markov Models (cont.)

Hidden Markov Models

Hidden Markov Models

CS838-1 Advanced NLP: Hidden Markov Models

13: Variational inference II

Statistical Methods for NLP

Hidden Markov Models. Three classic HMM problems

Statistical Sequence Recognition and Training: An Introduction to HMMs

Introduction to Bayesian Learning

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Lecture 11: Hidden Markov Models

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Lecture 15. Probabilistic Models on Graph

Parametric Models Part III: Hidden Markov Models

Input-Output Hidden Markov Models for Trees

Deep Learning Autoencoder Models

Graphical Models for Collaborative Filtering

Learning in Bayesian Networks

Computational Genomics and Molecular Biology, Fall

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Pattern Recognition and Machine Learning

Lecture 7 Sequence analysis. Hidden Markov Models

Probabilistic Graphical Models

Directed Probabilistic Graphical Models CMSC 678 UMBC

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Hidden Markov Models and Gaussian Mixture Models

Introduction to Probabilistic Graphical Models: Exercises

Dynamic models. Dependent data The AR(p) model The MA(q) model Hidden Markov models. 6 Dynamic models

Probabilistic Graphical Models

LEARNING DYNAMIC SYSTEMS: MARKOV MODELS

Probabilistic Graphical Models

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

State Space and Hidden Markov Models

Advanced Data Science

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Algorithmisches Lernen/Machine Learning

HMM part 1. Dr Philip Jackson

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Statistical Approaches to Learning and Discovery

CS 188: Artificial Intelligence Fall 2011

Data-Intensive Computing with MapReduce

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Hidden Markov models

4 : Exact Inference: Variable Elimination

L23: hidden Markov models

CPSC 540: Machine Learning

Variational Inference (11/04/13)

Lecture 9. Intro to Hidden Markov Models (finish up)

CS Homework 3. October 15, 2009

Lecture 13: Structured Prediction

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Statistical Learning

Probabilistic Graphical Networks: Definitions and Basic Results

Cours 7 12th November 2014

Statistical learning. Chapter 20, Sections 1 4 1

Hidden Markov Models

Transcription:

Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2)

Inference as Message Passing Last Lecture Refresher Lecture Plan How to infer the distribution P(X unk X obs ) of a number of random variables X unk in the graphical model, given the observed values of other variables X obs Directed and undirected models of fixed structure Exact inference Passing messages (vectors of information) on the structure of the graphical model following a propagation direction Works for chains, trees and can be used in (some) graphs Approximated inference can use approximations of the distribution (variational) or can estimate its expectation using examples (sampling)

Today s Lecture Introduction Last Lecture Refresher Lecture Plan Exact inference on a chain with observed and unobserved variables A probabilistic model for sequences: Hidden Markov Models (HMMs) Using inference to learn: the Expectation-Maximization algorithm for HMMs Graphical models with varying structure: Dynamic Bayesian Networks Application examples

Sequences Introduction Last Lecture Refresher Lecture Plan A sequence y is a collection of observations y t where t represent the position of the element according to a (complete) order (e.g. time) Reference population is a set of i.i.d sequences y 1,..., y N Different sequences y 1,..., y N generally have different lengths T 1,..., T N

Last Lecture Refresher Lecture Plan Sequences in Speech Processing

Sequences in Biology Last Lecture Refresher Lecture Plan

Markov Chain Introduction First-Order Markov Chain Directed graphical model for sequences s.t. element X t only depends on its predecessor in the sequence Joint probability factorizes as P(X) = P(X 1,..., X T ) = P(X 1 ) T P(X t X t 1 ) t=2 P(X t X t 1 ) is the transition distribution; P(X 1 ) is the prior distribution General form: an L-th order Markov chain is such that X t depends on L predecessors

Observed Markov Chains Can we use a Markov chain to model the relationship between observed elements in a sequence? Of course yes, but... Does it make sense to represent P(is cat)?

Hidden Markov Model (HMM) (I) Stochastic process where transition dynamics is disentangled from observations generated by the process State transition is an unobserved (hidden/latent) process characterized by the hidden state variables S t S t are often discrete with value in {1,..., C} Multinomial state transition and prior probability (stationariety assumption) A ij = P(S t = i S t 1 = j) and π i = P(S t = i)

Hidden Markov Model (HMM) (II) Stochastic process where transition dynamics is disentangled from observations generated by the process Observations are generated by the emission distribution b i (y t ) = P(Y t = y t S t = i)

HMM Joint Probability Factorization Discrete-state HMMs are parameterized by θ = (π, A, B) and the finite number of hidden states C State transition and prior distribution A and π Emission distribution B (or its parameters) P(Y =y) = P(Y = y, S = s) s = {P(S 1 = s 1 )P(Y 1 = y 1 S 1 = s 1 ) s 1,...,s T } T P(S t = s t S t 1 = s t 1 )P(Y t = y t S t = s t ) t=2

HMMs as a Recursive Model A graphical framework describing how contextual information is recursively encoded by both probabilistic and neural models Indicates that the hidden state S t at time t is dependent on context information from The previous time step s 1 Two time steps earlier s 2... When applying the recursive model to a sequence (unfolding), it generates the corresponding directed graphical model

3 Notable Inference Problems Definition (Smoothing) Given a model θ and an observed sequence y, determine the distribution of the t-th hidden state P(S t Y = y, θ) Definition (Learning) Given a dataset of N observed sequences D = {y 1,..., y N } and the number of hidden states C, find the parameters π, A and B that maximize the probability of model θ = {π, A, B} having generated the sequences in D Definition (Optimal State Assignment) Given a model θ and an observed sequence y, find an optimal state assignment s = s1,..., s T for the hidden Markov chain

Forward-Backward Algorithm Smoothing - How do we determine the posterior P(S t = i y)? Exploit factorization P(S t = i y) P(S t = i, y) = P(S t = i, Y 1:t, Y t+1:t ) = P(S t = i, Y 1:t )P(Y t+1:t S t = i) = α t (i)β t (i) α-term computed as part of forward recursion (α 1 (i) = b i (y 1 )π i ) α t (i) = P(S t = i, Y 1:t ) = b i (y t ) C A ij α t 1 (j) β-term computed as part of backward recursion (β T (i) = 1, i) j=1 β t (j) = P(Y t+1:t S t = j) = C b i (y t+1 )β t+1 (i)a ij i=1

Deja vu Introduction Doesn t the Forward-Backward algorithm look strangely familiar? α t µ α (X n ) forward message µ α (X n ) = ψ(x }{{} n 1, X n ) µ }{{} α (X n 1 ) }{{} X α t (i) n 1 }{{} b i (y t )A ij α t 1 (j) C j=1 β t µ β (X n ) backward message µ β (X n ) = ψ(x n, X n+1 ) µ }{{}}{{} β (X n+1 ) }{{} X β t (j) n+1 }{{} b i (y t+1 )A ij β t+1 (i) C i=1

Learning in HMM Introduction Learning HMM parameters θ = (π, A, B) by maximum likelihood L(θ) = log = log N P(Y n θ) n=1 N n=1 s n 1,...,sn Tn T n P(S1 n )P(Y 1 n Sn 1 ) P(St n St 1 n )P(Y t n St n ) t=2 How can we deal with the unobserved random variables S t and the nasty summation in the log? Expectation-Maximization algorithm Maximization of the complete likelihood L c (θ) Completed with indicator variables { zti n 1 if n-th chain is in state i at time t = 0 otherwise

Complete HMM Likelihood Introduce indicator variables in L(θ) together with model parameters θ = (π, A, B) L c (θ) = log P(X, Z θ) = log = n=1 T n t=2 i,j=1 C P(St n N C z1i n log π T n i + i=1 { N C [P(S1 n = i)p(y 1 n Sn 1 = i)]zn 1i n=1 i=1 = i S n t 1 = j)zn ti zn (t 1)j P(Y n t S n t t=2 i,j=1 = i) zn ti C zti n zn (t 1)j log A T n C ij + zti n log b i (yt n ) t=1 i=1

Expectation-Maximization A 2-step iterative algorithm for the maximization of complete likelihood L c (θ) w.r.t. model parameters θ E-Step: Given the current estimate of the model parameters θ (t), compute Q (t+1) (θ θ (t) ) = E Z X,θ (t)[log P(X, Z θ)] M-Step: Find the new estimate of the model parameters θ (t+1) = arg max Q (t+1) (θ θ (t) ) θ Iterate 2 steps until L c (θ) it L c (θ) it 1 < ɛ (or stop if maximum number of iterations is reached)

E-Step (I) Introduction Compute the expected value expectation of the complete log-likelihood w.r.t indicator variables z n ti assuming (estimated) parameters θ t = (π t, A t, B t ) fixed at time t (i.e. constants) Q (t+1) (θ θ (t) ) = E Z X,θ (t)[log P(X, Z θ)] Expectation w.r.t a (discrete) random variable z is E z [Z ] = z z P(Z = z) To compute the conditional expectation Q (t+1) (θ θ (t) ) for the complete HMM log-likelihood we need to estimate E Z Y,θ (k)[z ti ] = P(S t = i y) E Z Y,θ (k)[z ti z (t 1)j ] = P(S t = i, S t 1 = j Y)

E-Step (II) Introduction We know how to compute the posteriors by the forward-backward algorithm! γ t (i) = P(S t = i Y) = α t (i)β t (i) C j=1 α t(j)β t (j) γ t,t 1 (i, j) = P(S t = i, S t 1 = j Y) = α t 1 (j)a ij b i (y t )β t (i) C m,l=1 α t 1(m)A lm b l (y t )β t (l)

M-Step (I) Introduction Solve the optimization problem θ (t+1) = arg max Q (t+1) (θ θ (t) ) θ using the information computed at the E-Step (the posteriors). How? As usual Q (t+1) (θ θ (t) ) θ where θ = (π, A, B) are now variables. Attention Parameters can be distributions need to preserve sum-to-one constraints (Lagrange Multipliers)

M-Step (II) Introduction State distributions A ij = N T n n=1 t=2 γn t,t 1 (i, j) N n=1 N T n n=1 t=2 γn t 1 (j) and π i = γn 1 (i) N Emission distribution (multinomial) B ki = N n=1 Tn t=1 γn t (i)δ(y t = k) N n=1 Tn t=1 γn t (i) where δ( ) is the indicator function for emission symbols k

Decoding Problem Introduction Find the optimal hidden state assignement s = s1,..., s T for an observed sequence y given a trained HMM No unique interpretation of the problem Identify the single hidden states s t that maximize the posterior st = arg max P(S t = i Y) i=1,...,c Find the most likely joint hidden state assignment s = arg max P(Y, S = s) s The last problem is addressed by the Viterbi algorithm

Viterbi Algorithm Introduction An efficient dynamic programming algorithm based on a backward-forward recursion An example of a max-product message passing algorithm Recursive backward term ɛ t 1 (s t 1 ) = max s t P(Y t S t = s t )P(S t = s t S t 1 = s t 1 )ɛ t (s t ), Root optimal state s 1 = arg max P(Y t S 1 = s)p(s 1 = s 1 )ɛ 1 (s). s Recursive forward optimal state st = arg max P(Y t S t = s)p(s t = s S t 1 = s s t 1 )ɛ t(s).

Applications Dynamic Bayesian Networks Conclusion Input-Output Translate an input sequence into an output sequence (transduction) State transition and emissions depend on input observations (input-driven) Recursive model highlights analogy with recurrent neural networks

Bidirectional Input-driven Models Applications Dynamic Bayesian Networks Conclusion Remove causality assumption that current observation does not depend on the future and homogeneity assumption that an state transition is not dependent on the position in the sequence Structure and function of a region of DNA and protein sequences may depend on upstream and downstream information Hidden state transition distribution changes with the amino-acid sequence being fed in input

Coupled HMM Introduction Applications Dynamic Bayesian Networks Conclusion Describing interacting processes whose observations follow different dynamics while the underlying generative processes are interlaced

Dynamic Bayesian Networks Applications Dynamic Bayesian Networks Conclusion HMMs are a specific case of a class of directed models that represent dynamic processes and data with changing connectivity template Hierarchical HMM Dynamic Bayesian Networks (DBN) Structure changing information Graphical models whose structure changes to represent evolution across time and/or between different samples

Take Home Messages Applications Dynamic Bayesian Networks Conclusion Hidden states used to realize an unobserved generative process for sequential data A mixture model where selection of the next component is regulated by the transition distribution Hidden states summarize (cluster) information on subsequences in the data Inference in HMMS Forward-backward - Hidden state posterior estimation Expectation-Maximization - HMM parameter learning Viterbi - Most likely hidden state sequence Dynamic Bayesian Networks A graphical model whose structure changes to reflect information with variable size and connectivity patterns Suitable for modeling structured data (sequences, tree,...)