Learning from Sequential and Time-Series Data

Similar documents
Dynamic Approaches: The Hidden Markov Model

Master 2 Informatique Probabilistic Learning and Data Analysis

STA 4273H: Statistical Machine Learning

Introduction to Artificial Intelligence (AI)

STA 414/2104: Machine Learning

L23: hidden Markov models

CS 7180: Behavioral Modeling and Decision- making in AI

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Linear Dynamical Systems

Parametric Models Part III: Hidden Markov Models

Introduction to Artificial Intelligence (AI)

Note Set 5: Hidden Markov Models

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

A Bayesian Perspective on Residential Demand Response Using Smart Meter Data

Basic math for biology

Machine Learning for OR & FE

COMP90051 Statistical Machine Learning

Hidden Markov Models

Machine Learning for natural language processing

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Lecture 11: Hidden Markov Models

Introduction to Machine Learning CMU-10701

Hidden Markov Models

A Higher-Order Interactive Hidden Markov Model and Its Applications Wai-Ki Ching Department of Mathematics The University of Hong Kong

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

15-381: Artificial Intelligence. Hidden Markov Models (HMMs)

Advanced Data Science

Hidden Markov Models,99,100! Markov, here I come!

A Probabilistic Relational Model for Characterizing Situations in Dynamic Multi-Agent Systems

Hidden Markov Modelling

Intelligent Systems (AI-2)

Hidden Markov Models (HMMs)

HMM part 1. Dr Philip Jackson

CS 188: Artificial Intelligence Fall 2011

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Intelligent Systems (AI-2)

Probabilistic Graphical Models

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

ASR using Hidden Markov Model : A tutorial

Hidden Markov Models

Hidden Markov Models Hamid R. Rabiee

Planning by Probabilistic Inference

Probability and Time: Hidden Markov Models (HMMs)

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Models Part 2: Algorithms

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Models

Conditional Random Field

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Hidden Markov Models in Language Processing

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Machine Learning Summer School

Bayesian Hidden Markov Models and Extensions

LEARNING DYNAMIC SYSTEMS: MARKOV MODELS

O 3 O 4 O 5. q 3. q 4. Transition

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Hidden Markov Models

Lecture 6: Graphical Models: Learning

Graphical Models Seminar

Sequential Supervised Learning

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Hidden Markov models

Statistical Methods for NLP

Statistical Processing of Natural Language

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Linear Dynamical Systems (Kalman filter)

Multiscale Systems Engineering Research Group

Hidden Markov Models

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Switching-state Dynamical Modeling of Daily Behavioral Data

Hidden Markov Models and Gaussian Mixture Models

Supervised Learning Hidden Markov Models. Some of these slides were inspired by the tutorials of Andrew Moore

Data Mining in Bioinformatics HMM

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Computational Genomics and Molecular Biology, Fall

Hidden Markov Models

Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS

An Evolutionary Programming Based Algorithm for HMM training

10/17/04. Today s Main Points

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Data-Intensive Computing with MapReduce

Weighted Finite-State Transducers in Computational Biology

arxiv: v1 [stat.ml] 6 Aug 2014

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Brief Introduction of Machine Learning Techniques for Content Analysis

CSEP 573: Artificial Intelligence

1/30/2012. Reasoning with uncertainty III

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Expectation Maximization (EM)

Inference in Explicit Duration Hidden Markov Models

Statistical Machine Learning from Data

Chapter 05: Hidden Markov Models

1 What is a hidden Markov model?

CS532, Winter 2010 Hidden Markov Models

1. Markov models. 1.1 Markov-chain

PROBABILISTIC REASONING OVER TIME

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Inference and estimation in probabilistic time series models

Transcription:

Learning from Sequential and Time-Series Data Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts Sridhar Mahadevan: CMPSCI 689 p. 1/?

Sequential and Time-Series Data Many real-world applications involve sequential or time-series data: Gene and protein sequences in bioinformatics Natural language text Modeling human and robot behavior GPS tracking of cars, missiles, satellites etc. Sensor networks, weather prediction Examples: Markov chains, Markov decision processes, hidden Markov models, partially observable MDPs, conditional random fields, observable operator models, predictive state representations etc. Sridhar Mahadevan: CMPSCI 689 p. 2/?

Sequence Models Model M = O, S, A, P o,p t,r O is a set of continuous or discrete observations S is a set of states A is a set of actions or decisions P o is an observation model where P (y s) is the probability of observing y in state s. P t is a transition model where Pss a is the transition probability of moving from state s to s under action a. R is a reward or cost function Sridhar Mahadevan: CMPSCI 689 p. 3/?

Markov Chains The most widely studied sequential model developed by Markov to study text (Pushkin). A Markov chain M =< S,P t,π o > is specified by: A set of discrete or continuous states S A transition probability P t (s s) of moving from s to s An initial distribution π o (s) of starting in state s. Maximum likelihood estimation of a markov chain is theoretically trivial, but can be practically challenging. Example: bigram model of English text (where states are words). Sridhar Mahadevan: CMPSCI 689 p. 4/?

Hidden Markov Models A finite set of states Q where Q = M qt i is a multinomial random variable =1for some particular value of i, and 0 otherwise. An initial distribution π on Q where π i = P (q0 i =1). An observation model Ω=P (y t q t ), where the space of observations is discrete or continuous (real-valued). Denote η ij as the probability that state i produces observation j. A transition matrix a ij = P (q j t+1 qt) i Sridhar Mahadevan: CMPSCI 689 p. 5/?

HMM Applications Perception: face recognition, gesture recognition, handwriting, speech recognition Robot navigation Biology: DNA sequence prediction Language analysis: part of speech tagging Smart rooms, wearable devices Sridhar Mahadevan: CMPSCI 689 p. 6/?

Hidden Markov Models: References Model developed by Baum in the 1960s. HMMs are instances of dynamic Bayesian networks (DBNs) [Dean et al, 1989; Murphy, 2002]. L. R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceedings of the IEEE, 77, no. 2, 257 285, February 1989. Fred Jellinek, Statistical Methods in Speech Recognition, MIT Press, 1997. Sridhar Mahadevan: CMPSCI 689 p. 7/?

HMM Graphical Model q0 q1 q2 qt Π y0 y1 y2 yt Conditioning on a state q t d-separates the observation sequence into three categories: current observation y t, the past y 0,...,y t 1, and the future y t+1,...,y T. Sridhar Mahadevan: CMPSCI 689 p. 8/?

Markov Properties of HMMs The future is independent of the past given the present. P (q t+1 q t,q t 1,...,q 0 )=P(q t+1 q t ) P (y 0,...,y T q t )=P(y 0,...,y t q t )P (y t+1,...,y T q t ) HMMs are more powerful than any finite memory device P (y t+1 y t,y t 1,...,y t k ) P (y t+1 y t,y t 1,...,y t k+1 ) Sridhar Mahadevan: CMPSCI 689 p. 9/?

Basic Problems in HMMs Likelihood: Given an observation sequence Y = y 0 y 1...y T and a model θ =(A, Ω,π), determine the likelihood P (Y θ) Filtering: Given an observation sequence Y t = y 0 y 1...y t and a model θ =(A, Ω,π), determine the belief state P (q t Y t,θ) Prediction: Given an observation sequence Y t = y 0 y 1...y t and a model θ =(A, Ω,π), determine the posterior over a future state P (q s Y t,θ),s>t. Sridhar Mahadevan: CMPSCI 689 p. 10/?

Basic Problems in HMMs Smoothing: Given an observation sequence Y t = y 0 y 1...y t and a model θ =(A, Ω,π), determine the posterior over a previous state P (q s Y t,θ),s<t Most likely explanation: Given an observation sequence Y = y 0 y 1...y T and a model Ω=(A, Ω,π), find the most likely sequence of states q 0 q 1...q T given y, that is compute argmax q0,...,q T P (q 0,...,q T y) Learning: Find the model parameters θ such that i P (Y i θ) is maximized, over multiple independent (IID) sequences Y i. Sridhar Mahadevan: CMPSCI 689 p. 11/?

Example: Robot Navigation O = Wall Wall Opening Wall P(Wall 1) 1 a_12 2 P(Opening 3) 1 1 a_23 P(Wall 2) a_14 2 2 1 2 3 3 3 3 4 4 4 4 P(Wall 4) Sridhar Mahadevan: CMPSCI 689 p. 12/?

Inference in HMMs The complete data for an HMM is the output sequence y produced, along with the (hidden) state sequence traversed. P (q, y) = P (q 0 ) = M i=1 T 1 (π i ) qi 0 T 1 = π q0 P (q t+1 q t ) T 1 a qt,q t+1 M i,j=1 T T (a ij ) qi t qj t+1 P (y t q t ) P (y t q t ) T M i=1 N j=1 (η ij ) qi t yj t Sridhar Mahadevan: CMPSCI 689 p. 13/?

Maximizing Observed Likelihood The inference problem is computing the probability of a state sequence P (q y) = P (q,y) P (y) To get the probability of the output y, wehavetosum over all possible state sequences (which is intractable!): p(y θ) = q 0 q 1... q T π q0 T 1 a qt,q t+1 T P (y t q t ) We cannot easily maximize the observed data s likelihood by analytically solving argmax θ l(θ y) =argmax θ log p(y θ) Sridhar Mahadevan: CMPSCI 689 p. 14/?

Structuring Inference in HMMs We can condition on a particular state q t to decompose the inference: P (q t y) = P (y q t)p (q t ) P (y) = P (y 0,...,y t q t )P (y t+1,...,y T q t )P (q t ) P (y) = P (y 0,...,y t,q t )P (y t+1,...,y T q t ) P (y) = α(q t)β(q t ) P (y) Note that P (y) = q t α(q t )β(q t ). Sridhar Mahadevan: CMPSCI 689 p. 15/?

Forward Step α(q t+1 ) = P (y 0,...,y t+1,q t+1 ) = P (y 0,...,y t+1 q t+1 )P (q t+1 ) = P (y 0,...y t q t+1 )P (y t+1 q t+1 )P (q t+1 ) = P (y 0,...y t,q t+1 )P (y t+1 q t+1 ) = P (y 0,...y t,q t,q t+1 )P (y t+1 q t+1 ) q t = P (y 0,...y t,q t+1 q t )P (q t )P (y t+1 q t+1 ) q t = P (y 0,...y t q t )P (q t+1 q t )P (q t )P (y t+1 q t+1 ) q t = q t α(q t )a qt,q t+1 P (y t+1 q t+1 ) Sridhar Mahadevan: CMPSCI 689 p. 16/?

The Forward Algorithm Initialization: α(q 0 )=P(y 0,q 0 )=π q0 P (y 0 q 0 ) where 1 q 0 M Induction: α(q t+1 )= q t α(q t )a qt,q t+1 P (y t+1 q t+1 ) where 0 t T 1 and 1 q t,q t+1 M Time complexity: Each α(q t ) computation takes O(M 2 ), since we have to iterate over all M values of q t and q t+1. Totally, the forward algorithm takes O(M 2 T ) since we have to iterate over each time step (from t =1,...,T). Sridhar Mahadevan: CMPSCI 689 p. 17/?

Backward Phase β(q t ) = P (y t+1,...,y T q t ) = P (y t+1,...,y T,q t+1 q t ) q t+1 = P (y t+1,...,y T q t+1,q t )P (q t+1 q t ) q t+1 = P (y t+2,...,y T q t+1,q t )P (y t+1 q t+1 )P (q t+1 q t ) q t+1 = q t+1 β(q t+1 )P (y t+1 q t+1 )a qt,q t+1 Sridhar Mahadevan: CMPSCI 689 p. 18/?

The Backward Algorithm The β variables can be calculated recursively also, except we proceed backwards: Initialization: Define β(q T )=1 Induction: β(q t )= q t+1 β(q t+1 )P (y t+1 q t+1 )a qt,q t+1 where 1 t T 1 and 1 q t,q t+1 M. Time complexity: Each β(q t ) computation takes O(M 2 ), since we have to iterate over all M values of q t and q t+1. Totally, the backwards algorithm takes O(M 2 T ) since we have to iterate over each time step (from t =1,...,T). Sridhar Mahadevan: CMPSCI 689 p. 19/?

Most Likely State The most likely state at any time t can be readily calculated from the α and β variables. We can then define the most likely state as q MLS t = argmax 1 i M γ(q i t) where the variable γ(q i t) can be easily calculated as γ(q i t)= α(qi t )β(qi t ) j α(q j t )β(q j t ) Sridhar Mahadevan: CMPSCI 689 p. 20/?

Most Likely State Sequence Let us define the probability of the most likely sequence that accounts for all observations upto time t and ends in state qt i as δ t (i) = max P (q 0,...,...q t = i, y 0,...,y t θ) q 1,q 2,...,q t 1 We can define this probability by induction as δ t+1 (i) = ( ) max δ t (j)a j q j P (y t qi t+1 qt+1) i t+1 Sridhar Mahadevan: CMPSCI 689 p. 21/?

The Viterbi Algorithm Initialize: δ 0 (i) =π i P (y 0 q i 0)1 i M ψ 0 (i) =0 Recursion: δ t (i) = ( ) max δ t 1 (j)a j q j P (y t,qi t qt) i t+1 ) ψ t (i) =argmax 1 j M (δ t 1 (j)a q j t 1 qi t Sridhar Mahadevan: CMPSCI 689 p. 22/?

The Viterbi Algorithm Termination: P = max 1 i M (δ T (i)) Path computation: q T = argmax 1 i M δ T (i) qt = ψ t+1(qt+1 ), t = T 1,T 2,...,0 Sridhar Mahadevan: CMPSCI 689 p. 23/?

Learning the HMM Parameters The α and β variables enable estimation of observation model. To estimate the transition matrix, we introduce a new variable ξ(q t,q t+1 ) = P (q t,q t+1 y) = P (y q t,q t+1 )P (q t+1 q t )P (q t ) P (y) = α(q t)p (y t+1 q t+1 )β(q t+1 )a qt,q t+1 P (y) Sridhar Mahadevan: CMPSCI 689 p. 24/?

Maximum Likelihood for HMMs Generally, we want to find the parameters θ that maximizes the (log) likelihood log P (y θ), where log p(y θ) =log q 0 q 1... q T π q0 T 1 a qt,q t+1 T P (y t q t ) As before, we note that this involves taking the log of a bunch of summations, and it is pretty difficult to optimize it. Sridhar Mahadevan: CMPSCI 689 p. 25/?

Complete Log-likelihood To simplify the maximum log-likelihood computation, we postulate a complete dataset y, q, where l c (θ; q, y) = =log M i=1 = (π i ) qi 0 M i=1 T 1 M i,j=1 q i 0 log π i + (a ij ) qi t qj t+1 T 1 M i,j=1 + T T M N i=1 j=1 (η ij ) qt t yj t q i tq j t+1 log a ij +... M i=1 N j=1 q i ty j t log η ij Sridhar Mahadevan: CMPSCI 689 p. 26/?

M-Step of EM for HMMs l c (θ; y, q) a ij = a ij = â ij = â ij = T 1 T 1 M i,j=1 q i tq j t+1 a ij + λ i =0 T 1 qtq i j t+1 λ i T 1 qtq i j t+1 Mj=1 T 1 qtq i j t+1 q i tq j t+1 log a ij + M i=1 exploiting that λ i ( M j=1 M j=1 a ij 1) a ij =1 Sridhar Mahadevan: CMPSCI 689 p. 27/?

M-Step of EM for HMMs ˆη ij = T 1 q i ty j t Nk=1 T q i ty k t exploiting that N j=1 η ij =1 ˆπ i = q i 0 Sridhar Mahadevan: CMPSCI 689 p. 28/?

E-step for HMMs E(η ij y, θ p ) = T E(q i t y, θ k )y j t = T P (q i t =1 y, θk )y j t T γ i t yj t T 1 E( qtq i t+1 y, j θ k ) = T 1 E(q i tq j t+1 y, θ k )= T 1 P (q i tq j t+1 y, θ k ) T 1 ξ i,j t,t+1 Sridhar Mahadevan: CMPSCI 689 p. 29/?

M-Step in HMMs ˆη (p+1) ij = T γ i ty j t Nk=1 T γ i ty k t = T γ i ty j t T γ i t â (p+1) ij = ˆπ (p+1) i = γ i 0 T 1 ξ i,j t,t+1 Mj=1 T 1 ξ i,j t,t+1 = T 1 ξ i,j t,t+1 T γ t i Sridhar Mahadevan: CMPSCI 689 p. 30/?

Extensions of HMMs Observer Operator Models: observable representations of hidden states Semi-Markov HMMs: state durations are not exponential, but arbitrary. Hierarchical HMMs: multi-level tree-structured models, which are a special case of probabilistic context-free grammars. Abstract Hidden Markov Models: AHMMs with state-mediated transitions. Sridhar Mahadevan: CMPSCI 689 p. 31/?

Hierarchical HMMs Fine, Singer, Tishby, The Hierarchical Hidden Markov Model, Machine Learning, 1998. s1 0.5 0.5 s2 0.7 1.0 s3 0.3 e1 0.9 0.1 s4 1.0 0.3 s5 e3 0.5 0.3 0.2 0.7 OBSERVATION MODEL FOR S6 s6 0.6 s7 0.8 s8 1.0 e4 FRONT(W:0.1, O:0.9) LEFT(W:0.9, O:0.1) BACK(W:0.1, O:0.9) RIGHT(W:0.9, O:0.1) 0.4 0.2 Sridhar Mahadevan: CMPSCI 689 p. 32/?

Using Hierarchical HMMs in Robot Navigation Actual model used extends hierarchical HMM to include temporally extended actions like exit corridor (hierarchical POMDP) Observation vectors: Front, Left, Right, Back: Wall, Opening; Door on Right; Stripe on Right Wall. See [Theocharous, Rohanimanesh, and Mahadevan, ICRA 2001], [Theocharous, Murphy, Kaelbling, IJCAI 03] for details. S2 S3 B S4 S1 A Sridhar Mahadevan: CMPSCI 689 p. 33/?

Foveal Face Recognition using HMMs Minut, Mahadevan, Henderson, and Dyer, "Face Recognition using Foveal Vision", in Lecture Notes in Computer Science: Biologically Motivated Computer Vision, Seong-Whan Lee, Heinrich H. Bulthoff, Tomasio Poggio (editors), vol. 1811, pp. 424-433, Springer-Verlag, 2000. Sridhar Mahadevan: CMPSCI 689 p. 34/?

Comparing Sliding Windows with Foveation Competing approach: slide a window down of fixed size [Samaria, PhD, Cambridge] AVERAGE RECOGNITION RATE PERFORMANCE OF SUBSAMPLED VS FOVEAL HMM CLASSIFIERS 100 WOMEN-FOVEAL WOMEN-SUBSAMPLED 80 60 40 20 0 3 4 5 6 7 8 9 10 11 NUMBER OF STATES Sridhar Mahadevan: CMPSCI 689 p. 35/?

Abstract Hidden Markov Model [Bui, Venkatesh, West: JAIR, IJCAI 03] Level 2 Π2 Π2 E2 E2 Level 1 Π1 Π1 E1 E1 Level 0 A A S S S Ο O O Sridhar Mahadevan: CMPSCI 689 p. 36/?