Learning from Sequential and Time-Series Data

Learning from Sequential and Time-Series Data Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts Sridhar Mahadevan: CMPSCI 689 p. 1/?

Sequential and Time-Series Data Many real-world applications involve sequential or time-series data: Gene and protein sequences in bioinformatics Natural language text Modeling human and robot behavior GPS tracking of cars, missiles, satellites etc. Sensor networks, weather prediction Examples: Markov chains, Markov decision processes, hidden Markov models, partially observable MDPs, conditional random fields, observable operator models, predictive state representations etc. Sridhar Mahadevan: CMPSCI 689 p. 2/?

Sequence Models Model M = O, S, A, P o,p t,r O is a set of continuous or discrete observations S is a set of states A is a set of actions or decisions P o is an observation model where P (y s) is the probability of observing y in state s. P t is a transition model where Pss a is the transition probability of moving from state s to s under action a. R is a reward or cost function Sridhar Mahadevan: CMPSCI 689 p. 3/?

Markov Chains The most widely studied sequential model developed by Markov to study text (Pushkin). A Markov chain M =< S,P t,π o > is specified by: A set of discrete or continuous states S A transition probability P t (s s) of moving from s to s An initial distribution π o (s) of starting in state s. Maximum likelihood estimation of a markov chain is theoretically trivial, but can be practically challenging. Example: bigram model of English text (where states are words). Sridhar Mahadevan: CMPSCI 689 p. 4/?

Hidden Markov Models A finite set of states Q where Q = M qt i is a multinomial random variable =1for some particular value of i, and 0 otherwise. An initial distribution π on Q where π i = P (q0 i =1). An observation model Ω=P (y t q t ), where the space of observations is discrete or continuous (real-valued). Denote η ij as the probability that state i produces observation j. A transition matrix a ij = P (q j t+1 qt) i Sridhar Mahadevan: CMPSCI 689 p. 5/?

HMM Applications Perception: face recognition, gesture recognition, handwriting, speech recognition Robot navigation Biology: DNA sequence prediction Language analysis: part of speech tagging Smart rooms, wearable devices Sridhar Mahadevan: CMPSCI 689 p. 6/?

Hidden Markov Models: References Model developed by Baum in the 1960s. HMMs are instances of dynamic Bayesian networks (DBNs) [Dean et al, 1989; Murphy, 2002]. L. R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceedings of the IEEE, 77, no. 2, 257 285, February 1989. Fred Jellinek, Statistical Methods in Speech Recognition, MIT Press, 1997. Sridhar Mahadevan: CMPSCI 689 p. 7/?

HMM Graphical Model q0 q1 q2 qt Π y0 y1 y2 yt Conditioning on a state q t d-separates the observation sequence into three categories: current observation y t, the past y 0,...,y t 1, and the future y t+1,...,y T. Sridhar Mahadevan: CMPSCI 689 p. 8/?

Markov Properties of HMMs The future is independent of the past given the present. P (q t+1 q t,q t 1,...,q 0 )=P(q t+1 q t ) P (y 0,...,y T q t )=P(y 0,...,y t q t )P (y t+1,...,y T q t ) HMMs are more powerful than any finite memory device P (y t+1 y t,y t 1,...,y t k ) P (y t+1 y t,y t 1,...,y t k+1 ) Sridhar Mahadevan: CMPSCI 689 p. 9/?

Basic Problems in HMMs Likelihood: Given an observation sequence Y = y 0 y 1...y T and a model θ =(A, Ω,π), determine the likelihood P (Y θ) Filtering: Given an observation sequence Y t = y 0 y 1...y t and a model θ =(A, Ω,π), determine the belief state P (q t Y t,θ) Prediction: Given an observation sequence Y t = y 0 y 1...y t and a model θ =(A, Ω,π), determine the posterior over a future state P (q s Y t,θ),s>t. Sridhar Mahadevan: CMPSCI 689 p. 10/?

Basic Problems in HMMs Smoothing: Given an observation sequence Y t = y 0 y 1...y t and a model θ =(A, Ω,π), determine the posterior over a previous state P (q s Y t,θ),s<t Most likely explanation: Given an observation sequence Y = y 0 y 1...y T and a model Ω=(A, Ω,π), find the most likely sequence of states q 0 q 1...q T given y, that is compute argmax q0,...,q T P (q 0,...,q T y) Learning: Find the model parameters θ such that i P (Y i θ) is maximized, over multiple independent (IID) sequences Y i. Sridhar Mahadevan: CMPSCI 689 p. 11/?

Example: Robot Navigation O = Wall Wall Opening Wall P(Wall 1) 1 a_12 2 P(Opening 3) 1 1 a_23 P(Wall 2) a_14 2 2 1 2 3 3 3 3 4 4 4 4 P(Wall 4) Sridhar Mahadevan: CMPSCI 689 p. 12/?

Inference in HMMs The complete data for an HMM is the output sequence y produced, along with the (hidden) state sequence traversed. P (q, y) = P (q 0 ) = M i=1 T 1 (π i ) qi 0 T 1 = π q0 P (q t+1 q t ) T 1 a qt,q t+1 M i,j=1 T T (a ij ) qi t qj t+1 P (y t q t ) P (y t q t ) T M i=1 N j=1 (η ij ) qi t yj t Sridhar Mahadevan: CMPSCI 689 p. 13/?

Maximizing Observed Likelihood The inference problem is computing the probability of a state sequence P (q y) = P (q,y) P (y) To get the probability of the output y, wehavetosum over all possible state sequences (which is intractable!): p(y θ) = q 0 q 1... q T π q0 T 1 a qt,q t+1 T P (y t q t ) We cannot easily maximize the observed data s likelihood by analytically solving argmax θ l(θ y) =argmax θ log p(y θ) Sridhar Mahadevan: CMPSCI 689 p. 14/?

Structuring Inference in HMMs We can condition on a particular state q t to decompose the inference: P (q t y) = P (y q t)p (q t ) P (y) = P (y 0,...,y t q t )P (y t+1,...,y T q t )P (q t ) P (y) = P (y 0,...,y t,q t )P (y t+1,...,y T q t ) P (y) = α(q t)β(q t ) P (y) Note that P (y) = q t α(q t )β(q t ). Sridhar Mahadevan: CMPSCI 689 p. 15/?

Forward Step α(q t+1 ) = P (y 0,...,y t+1,q t+1 ) = P (y 0,...,y t+1 q t+1 )P (q t+1 ) = P (y 0,...y t q t+1 )P (y t+1 q t+1 )P (q t+1 ) = P (y 0,...y t,q t+1 )P (y t+1 q t+1 ) = P (y 0,...y t,q t,q t+1 )P (y t+1 q t+1 ) q t = P (y 0,...y t,q t+1 q t )P (q t )P (y t+1 q t+1 ) q t = P (y 0,...y t q t )P (q t+1 q t )P (q t )P (y t+1 q t+1 ) q t = q t α(q t )a qt,q t+1 P (y t+1 q t+1 ) Sridhar Mahadevan: CMPSCI 689 p. 16/?

The Forward Algorithm Initialization: α(q 0 )=P(y 0,q 0 )=π q0 P (y 0 q 0 ) where 1 q 0 M Induction: α(q t+1 )= q t α(q t )a qt,q t+1 P (y t+1 q t+1 ) where 0 t T 1 and 1 q t,q t+1 M Time complexity: Each α(q t ) computation takes O(M 2 ), since we have to iterate over all M values of q t and q t+1. Totally, the forward algorithm takes O(M 2 T ) since we have to iterate over each time step (from t =1,...,T). Sridhar Mahadevan: CMPSCI 689 p. 17/?

Backward Phase β(q t ) = P (y t+1,...,y T q t ) = P (y t+1,...,y T,q t+1 q t ) q t+1 = P (y t+1,...,y T q t+1,q t )P (q t+1 q t ) q t+1 = P (y t+2,...,y T q t+1,q t )P (y t+1 q t+1 )P (q t+1 q t ) q t+1 = q t+1 β(q t+1 )P (y t+1 q t+1 )a qt,q t+1 Sridhar Mahadevan: CMPSCI 689 p. 18/?

The Backward Algorithm The β variables can be calculated recursively also, except we proceed backwards: Initialization: Define β(q T )=1 Induction: β(q t )= q t+1 β(q t+1 )P (y t+1 q t+1 )a qt,q t+1 where 1 t T 1 and 1 q t,q t+1 M. Time complexity: Each β(q t ) computation takes O(M 2 ), since we have to iterate over all M values of q t and q t+1. Totally, the backwards algorithm takes O(M 2 T ) since we have to iterate over each time step (from t =1,...,T). Sridhar Mahadevan: CMPSCI 689 p. 19/?

Most Likely State The most likely state at any time t can be readily calculated from the α and β variables. We can then define the most likely state as q MLS t = argmax 1 i M γ(q i t) where the variable γ(q i t) can be easily calculated as γ(q i t)= α(qi t )β(qi t ) j α(q j t )β(q j t ) Sridhar Mahadevan: CMPSCI 689 p. 20/?

Most Likely State Sequence Let us define the probability of the most likely sequence that accounts for all observations upto time t and ends in state qt i as δ t (i) = max P (q 0,...,...q t = i, y 0,...,y t θ) q 1,q 2,...,q t 1 We can define this probability by induction as δ t+1 (i) = ( ) max δ t (j)a j q j P (y t qi t+1 qt+1) i t+1 Sridhar Mahadevan: CMPSCI 689 p. 21/?

The Viterbi Algorithm Initialize: δ 0 (i) =π i P (y 0 q i 0)1 i M ψ 0 (i) =0 Recursion: δ t (i) = ( ) max δ t 1 (j)a j q j P (y t,qi t qt) i t+1 ) ψ t (i) =argmax 1 j M (δ t 1 (j)a q j t 1 qi t Sridhar Mahadevan: CMPSCI 689 p. 22/?

The Viterbi Algorithm Termination: P = max 1 i M (δ T (i)) Path computation: q T = argmax 1 i M δ T (i) qt = ψ t+1(qt+1 ), t = T 1,T 2,...,0 Sridhar Mahadevan: CMPSCI 689 p. 23/?

Learning the HMM Parameters The α and β variables enable estimation of observation model. To estimate the transition matrix, we introduce a new variable ξ(q t,q t+1 ) = P (q t,q t+1 y) = P (y q t,q t+1 )P (q t+1 q t )P (q t ) P (y) = α(q t)p (y t+1 q t+1 )β(q t+1 )a qt,q t+1 P (y) Sridhar Mahadevan: CMPSCI 689 p. 24/?

Maximum Likelihood for HMMs Generally, we want to find the parameters θ that maximizes the (log) likelihood log P (y θ), where log p(y θ) =log q 0 q 1... q T π q0 T 1 a qt,q t+1 T P (y t q t ) As before, we note that this involves taking the log of a bunch of summations, and it is pretty difficult to optimize it. Sridhar Mahadevan: CMPSCI 689 p. 25/?

Complete Log-likelihood To simplify the maximum log-likelihood computation, we postulate a complete dataset y, q, where l c (θ; q, y) = =log M i=1 = (π i ) qi 0 M i=1 T 1 M i,j=1 q i 0 log π i + (a ij ) qi t qj t+1 T 1 M i,j=1 + T T M N i=1 j=1 (η ij ) qt t yj t q i tq j t+1 log a ij +... M i=1 N j=1 q i ty j t log η ij Sridhar Mahadevan: CMPSCI 689 p. 26/?

M-Step of EM for HMMs l c (θ; y, q) a ij = a ij = â ij = â ij = T 1 T 1 M i,j=1 q i tq j t+1 a ij + λ i =0 T 1 qtq i j t+1 λ i T 1 qtq i j t+1 Mj=1 T 1 qtq i j t+1 q i tq j t+1 log a ij + M i=1 exploiting that λ i ( M j=1 M j=1 a ij 1) a ij =1 Sridhar Mahadevan: CMPSCI 689 p. 27/?

M-Step of EM for HMMs ˆη ij = T 1 q i ty j t Nk=1 T q i ty k t exploiting that N j=1 η ij =1 ˆπ i = q i 0 Sridhar Mahadevan: CMPSCI 689 p. 28/?

E-step for HMMs E(η ij y, θ p ) = T E(q i t y, θ k )y j t = T P (q i t =1 y, θk )y j t T γ i t yj t T 1 E( qtq i t+1 y, j θ k ) = T 1 E(q i tq j t+1 y, θ k )= T 1 P (q i tq j t+1 y, θ k ) T 1 ξ i,j t,t+1 Sridhar Mahadevan: CMPSCI 689 p. 29/?

M-Step in HMMs ˆη (p+1) ij = T γ i ty j t Nk=1 T γ i ty k t = T γ i ty j t T γ i t â (p+1) ij = ˆπ (p+1) i = γ i 0 T 1 ξ i,j t,t+1 Mj=1 T 1 ξ i,j t,t+1 = T 1 ξ i,j t,t+1 T γ t i Sridhar Mahadevan: CMPSCI 689 p. 30/?

Extensions of HMMs Observer Operator Models: observable representations of hidden states Semi-Markov HMMs: state durations are not exponential, but arbitrary. Hierarchical HMMs: multi-level tree-structured models, which are a special case of probabilistic context-free grammars. Abstract Hidden Markov Models: AHMMs with state-mediated transitions. Sridhar Mahadevan: CMPSCI 689 p. 31/?

Hierarchical HMMs Fine, Singer, Tishby, The Hierarchical Hidden Markov Model, Machine Learning, 1998. s1 0.5 0.5 s2 0.7 1.0 s3 0.3 e1 0.9 0.1 s4 1.0 0.3 s5 e3 0.5 0.3 0.2 0.7 OBSERVATION MODEL FOR S6 s6 0.6 s7 0.8 s8 1.0 e4 FRONT(W:0.1, O:0.9) LEFT(W:0.9, O:0.1) BACK(W:0.1, O:0.9) RIGHT(W:0.9, O:0.1) 0.4 0.2 Sridhar Mahadevan: CMPSCI 689 p. 32/?

Using Hierarchical HMMs in Robot Navigation Actual model used extends hierarchical HMM to include temporally extended actions like exit corridor (hierarchical POMDP) Observation vectors: Front, Left, Right, Back: Wall, Opening; Door on Right; Stripe on Right Wall. See [Theocharous, Rohanimanesh, and Mahadevan, ICRA 2001], [Theocharous, Murphy, Kaelbling, IJCAI 03] for details. S2 S3 B S4 S1 A Sridhar Mahadevan: CMPSCI 689 p. 33/?

Foveal Face Recognition using HMMs Minut, Mahadevan, Henderson, and Dyer, "Face Recognition using Foveal Vision", in Lecture Notes in Computer Science: Biologically Motivated Computer Vision, Seong-Whan Lee, Heinrich H. Bulthoff, Tomasio Poggio (editors), vol. 1811, pp. 424-433, Springer-Verlag, 2000. Sridhar Mahadevan: CMPSCI 689 p. 34/?

Comparing Sliding Windows with Foveation Competing approach: slide a window down of fixed size [Samaria, PhD, Cambridge] AVERAGE RECOGNITION RATE PERFORMANCE OF SUBSAMPLED VS FOVEAL HMM CLASSIFIERS 100 WOMEN-FOVEAL WOMEN-SUBSAMPLED 80 60 40 20 0 3 4 5 6 7 8 9 10 11 NUMBER OF STATES Sridhar Mahadevan: CMPSCI 689 p. 35/?

Abstract Hidden Markov Model [Bui, Venkatesh, West: JAIR, IJCAI 03] Level 2 Π2 Π2 E2 E2 Level 1 Π1 Π1 E1 E1 Level 0 A A S S S Ο O O Sridhar Mahadevan: CMPSCI 689 p. 36/?