Learning from Sequential and Time-Series Data

Size: px

Start display at page:

Download "Learning from Sequential and Time-Series Data"

Carmella Thompson
5 years ago
Views:

1 Learning from Sequential and Time-Series Data Sridhar Mahadevan University of Massachusetts Sridhar Mahadevan: CMPSCI 689 p. 1/?

2 Sequential and Time-Series Data Many real-world applications involve sequential or time-series data: Gene and protein sequences in bioinformatics Natural language text Modeling human and robot behavior GPS tracking of cars, missiles, satellites etc. Sensor networks, weather prediction Examples: Markov chains, Markov decision processes, hidden Markov models, partially observable MDPs, conditional random fields, observable operator models, predictive state representations etc. Sridhar Mahadevan: CMPSCI 689 p. 2/?

3 Sequence Models Model M = O, S, A, P o,p t,r O is a set of continuous or discrete observations S is a set of states A is a set of actions or decisions P o is an observation model where P (y s) is the probability of observing y in state s. P t is a transition model where Pss a is the transition probability of moving from state s to s under action a. R is a reward or cost function Sridhar Mahadevan: CMPSCI 689 p. 3/?

4 Markov Chains The most widely studied sequential model developed by Markov to study text (Pushkin). A Markov chain M =< S,P t,π o > is specified by: A set of discrete or continuous states S A transition probability P t (s s) of moving from s to s An initial distribution π o (s) of starting in state s. Maximum likelihood estimation of a markov chain is theoretically trivial, but can be practically challenging. Example: bigram model of English text (where states are words). Sridhar Mahadevan: CMPSCI 689 p. 4/?

5 Hidden Markov Models A finite set of states Q where Q = M qt i is a multinomial random variable =1for some particular value of i, and 0 otherwise. An initial distribution π on Q where π i = P (q0 i =1). An observation model Ω=P (y t q t ), where the space of observations is discrete or continuous (real-valued). Denote η ij as the probability that state i produces observation j. A transition matrix a ij = P (q j t+1 qt) i Sridhar Mahadevan: CMPSCI 689 p. 5/?

6 HMM Applications Perception: face recognition, gesture recognition, handwriting, speech recognition Robot navigation Biology: DNA sequence prediction Language analysis: part of speech tagging Smart rooms, wearable devices Sridhar Mahadevan: CMPSCI 689 p. 6/?

7 Hidden Markov Models: References Model developed by Baum in the 1960s. HMMs are instances of dynamic Bayesian networks (DBNs) [Dean et al, 1989; Murphy, 2002]. L. R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceedings of the IEEE, 77, no. 2, , February Fred Jellinek, Statistical Methods in Speech Recognition, MIT Press, Sridhar Mahadevan: CMPSCI 689 p. 7/?

8 HMM Graphical Model q0 q1 q2 qt Π y0 y1 y2 yt Conditioning on a state q t d-separates the observation sequence into three categories: current observation y t, the past y 0,...,y t 1, and the future y t+1,...,y T. Sridhar Mahadevan: CMPSCI 689 p. 8/?

9 Markov Properties of HMMs The future is independent of the past given the present. P (q t+1 q t,q t 1,...,q 0 )=P(q t+1 q t ) P (y 0,...,y T q t )=P(y 0,...,y t q t )P (y t+1,...,y T q t ) HMMs are more powerful than any finite memory device P (y t+1 y t,y t 1,...,y t k ) P (y t+1 y t,y t 1,...,y t k+1 ) Sridhar Mahadevan: CMPSCI 689 p. 9/?

10 Basic Problems in HMMs Likelihood: Given an observation sequence Y = y 0 y 1...y T and a model θ =(A, Ω,π), determine the likelihood P (Y θ) Filtering: Given an observation sequence Y t = y 0 y 1...y t and a model θ =(A, Ω,π), determine the belief state P (q t Y t,θ) Prediction: Given an observation sequence Y t = y 0 y 1...y t and a model θ =(A, Ω,π), determine the posterior over a future state P (q s Y t,θ),s>t. Sridhar Mahadevan: CMPSCI 689 p. 10/?

11 Basic Problems in HMMs Smoothing: Given an observation sequence Y t = y 0 y 1...y t and a model θ =(A, Ω,π), determine the posterior over a previous state P (q s Y t,θ),s<t Most likely explanation: Given an observation sequence Y = y 0 y 1...y T and a model Ω=(A, Ω,π), find the most likely sequence of states q 0 q 1...q T given y, that is compute argmax q0,...,q T P (q 0,...,q T y) Learning: Find the model parameters θ such that i P (Y i θ) is maximized, over multiple independent (IID) sequences Y i. Sridhar Mahadevan: CMPSCI 689 p. 11/?

12 Example: Robot Navigation O = Wall Wall Opening Wall P(Wall 1) 1 a_12 2 P(Opening 3) 1 1 a_23 P(Wall 2) a_ P(Wall 4) Sridhar Mahadevan: CMPSCI 689 p. 12/?

13 Inference in HMMs The complete data for an HMM is the output sequence y produced, along with the (hidden) state sequence traversed. P (q, y) = P (q 0 ) = M i=1 T 1 (π i ) qi 0 T 1 = π q0 P (q t+1 q t ) T 1 a qt,q t+1 M i,j=1 T T (a ij ) qi t qj t+1 P (y t q t ) P (y t q t ) T M i=1 N j=1 (η ij ) qi t yj t Sridhar Mahadevan: CMPSCI 689 p. 13/?

14 Maximizing Observed Likelihood The inference problem is computing the probability of a state sequence P (q y) = P (q,y) P (y) To get the probability of the output y, wehavetosum over all possible state sequences (which is intractable!): p(y θ) = q 0 q 1... q T π q0 T 1 a qt,q t+1 T P (y t q t ) We cannot easily maximize the observed data s likelihood by analytically solving argmax θ l(θ y) =argmax θ log p(y θ) Sridhar Mahadevan: CMPSCI 689 p. 14/?

15 Structuring Inference in HMMs We can condition on a particular state q t to decompose the inference: P (q t y) = P (y q t)p (q t ) P (y) = P (y 0,...,y t q t )P (y t+1,...,y T q t )P (q t ) P (y) = P (y 0,...,y t,q t )P (y t+1,...,y T q t ) P (y) = α(q t)β(q t ) P (y) Note that P (y) = q t α(q t )β(q t ). Sridhar Mahadevan: CMPSCI 689 p. 15/?

16 Forward Step α(q t+1 ) = P (y 0,...,y t+1,q t+1 ) = P (y 0,...,y t+1 q t+1 )P (q t+1 ) = P (y 0,...y t q t+1 )P (y t+1 q t+1 )P (q t+1 ) = P (y 0,...y t,q t+1 )P (y t+1 q t+1 ) = P (y 0,...y t,q t,q t+1 )P (y t+1 q t+1 ) q t = P (y 0,...y t,q t+1 q t )P (q t )P (y t+1 q t+1 ) q t = P (y 0,...y t q t )P (q t+1 q t )P (q t )P (y t+1 q t+1 ) q t = q t α(q t )a qt,q t+1 P (y t+1 q t+1 ) Sridhar Mahadevan: CMPSCI 689 p. 16/?

17 The Forward Algorithm Initialization: α(q 0 )=P(y 0,q 0 )=π q0 P (y 0 q 0 ) where 1 q 0 M Induction: α(q t+1 )= q t α(q t )a qt,q t+1 P (y t+1 q t+1 ) where 0 t T 1 and 1 q t,q t+1 M Time complexity: Each α(q t ) computation takes O(M 2 ), since we have to iterate over all M values of q t and q t+1. Totally, the forward algorithm takes O(M 2 T ) since we have to iterate over each time step (from t =1,...,T). Sridhar Mahadevan: CMPSCI 689 p. 17/?

18 Backward Phase β(q t ) = P (y t+1,...,y T q t ) = P (y t+1,...,y T,q t+1 q t ) q t+1 = P (y t+1,...,y T q t+1,q t )P (q t+1 q t ) q t+1 = P (y t+2,...,y T q t+1,q t )P (y t+1 q t+1 )P (q t+1 q t ) q t+1 = q t+1 β(q t+1 )P (y t+1 q t+1 )a qt,q t+1 Sridhar Mahadevan: CMPSCI 689 p. 18/?

19 The Backward Algorithm The β variables can be calculated recursively also, except we proceed backwards: Initialization: Define β(q T )=1 Induction: β(q t )= q t+1 β(q t+1 )P (y t+1 q t+1 )a qt,q t+1 where 1 t T 1 and 1 q t,q t+1 M. Time complexity: Each β(q t ) computation takes O(M 2 ), since we have to iterate over all M values of q t and q t+1. Totally, the backwards algorithm takes O(M 2 T ) since we have to iterate over each time step (from t =1,...,T). Sridhar Mahadevan: CMPSCI 689 p. 19/?

20 Most Likely State The most likely state at any time t can be readily calculated from the α and β variables. We can then define the most likely state as q MLS t = argmax 1 i M γ(q i t) where the variable γ(q i t) can be easily calculated as γ(q i t)= α(qi t )β(qi t ) j α(q j t )β(q j t ) Sridhar Mahadevan: CMPSCI 689 p. 20/?

21 Most Likely State Sequence Let us define the probability of the most likely sequence that accounts for all observations upto time t and ends in state qt i as δ t (i) = max P (q 0,...,...q t = i, y 0,...,y t θ) q 1,q 2,...,q t 1 We can define this probability by induction as δ t+1 (i) = ( ) max δ t (j)a j q j P (y t qi t+1 qt+1) i t+1 Sridhar Mahadevan: CMPSCI 689 p. 21/?

22 The Viterbi Algorithm Initialize: δ 0 (i) =π i P (y 0 q i 0)1 i M ψ 0 (i) =0 Recursion: δ t (i) = ( ) max δ t 1 (j)a j q j P (y t,qi t qt) i t+1 ) ψ t (i) =argmax 1 j M (δ t 1 (j)a q j t 1 qi t Sridhar Mahadevan: CMPSCI 689 p. 22/?

23 The Viterbi Algorithm Termination: P = max 1 i M (δ T (i)) Path computation: q T = argmax 1 i M δ T (i) qt = ψ t+1(qt+1 ), t = T 1,T 2,...,0 Sridhar Mahadevan: CMPSCI 689 p. 23/?

24 Learning the HMM Parameters The α and β variables enable estimation of observation model. To estimate the transition matrix, we introduce a new variable ξ(q t,q t+1 ) = P (q t,q t+1 y) = P (y q t,q t+1 )P (q t+1 q t )P (q t ) P (y) = α(q t)p (y t+1 q t+1 )β(q t+1 )a qt,q t+1 P (y) Sridhar Mahadevan: CMPSCI 689 p. 24/?

25 Maximum Likelihood for HMMs Generally, we want to find the parameters θ that maximizes the (log) likelihood log P (y θ), where log p(y θ) =log q 0 q 1... q T π q0 T 1 a qt,q t+1 T P (y t q t ) As before, we note that this involves taking the log of a bunch of summations, and it is pretty difficult to optimize it. Sridhar Mahadevan: CMPSCI 689 p. 25/?

26 Complete Log-likelihood To simplify the maximum log-likelihood computation, we postulate a complete dataset y, q, where l c (θ; q, y) = =log M i=1 = (π i ) qi 0 M i=1 T 1 M i,j=1 q i 0 log π i + (a ij ) qi t qj t+1 T 1 M i,j=1 + T T M N i=1 j=1 (η ij ) qt t yj t q i tq j t+1 log a ij +... M i=1 N j=1 q i ty j t log η ij Sridhar Mahadevan: CMPSCI 689 p. 26/?

27 M-Step of EM for HMMs l c (θ; y, q) a ij = a ij = â ij = â ij = T 1 T 1 M i,j=1 q i tq j t+1 a ij + λ i =0 T 1 qtq i j t+1 λ i T 1 qtq i j t+1 Mj=1 T 1 qtq i j t+1 q i tq j t+1 log a ij + M i=1 exploiting that λ i ( M j=1 M j=1 a ij 1) a ij =1 Sridhar Mahadevan: CMPSCI 689 p. 27/?

28 M-Step of EM for HMMs ˆη ij = T 1 q i ty j t Nk=1 T q i ty k t exploiting that N j=1 η ij =1 ˆπ i = q i 0 Sridhar Mahadevan: CMPSCI 689 p. 28/?

29 E-step for HMMs E(η ij y, θ p ) = T E(q i t y, θ k )y j t = T P (q i t =1 y, θk )y j t T γ i t yj t T 1 E( qtq i t+1 y, j θ k ) = T 1 E(q i tq j t+1 y, θ k )= T 1 P (q i tq j t+1 y, θ k ) T 1 ξ i,j t,t+1 Sridhar Mahadevan: CMPSCI 689 p. 29/?

30 M-Step in HMMs ˆη (p+1) ij = T γ i ty j t Nk=1 T γ i ty k t = T γ i ty j t T γ i t â (p+1) ij = ˆπ (p+1) i = γ i 0 T 1 ξ i,j t,t+1 Mj=1 T 1 ξ i,j t,t+1 = T 1 ξ i,j t,t+1 T γ t i Sridhar Mahadevan: CMPSCI 689 p. 30/?

31 Extensions of HMMs Observer Operator Models: observable representations of hidden states Semi-Markov HMMs: state durations are not exponential, but arbitrary. Hierarchical HMMs: multi-level tree-structured models, which are a special case of probabilistic context-free grammars. Abstract Hidden Markov Models: AHMMs with state-mediated transitions. Sridhar Mahadevan: CMPSCI 689 p. 31/?

32 Hierarchical HMMs Fine, Singer, Tishby, The Hierarchical Hidden Markov Model, Machine Learning, s s s3 0.3 e s s5 e OBSERVATION MODEL FOR S6 s6 0.6 s7 0.8 s8 1.0 e4 FRONT(W:0.1, O:0.9) LEFT(W:0.9, O:0.1) BACK(W:0.1, O:0.9) RIGHT(W:0.9, O:0.1) Sridhar Mahadevan: CMPSCI 689 p. 32/?

33 Using Hierarchical HMMs in Robot Navigation Actual model used extends hierarchical HMM to include temporally extended actions like exit corridor (hierarchical POMDP) Observation vectors: Front, Left, Right, Back: Wall, Opening; Door on Right; Stripe on Right Wall. See [Theocharous, Rohanimanesh, and Mahadevan, ICRA 2001], [Theocharous, Murphy, Kaelbling, IJCAI 03] for details. S2 S3 B S4 S1 A Sridhar Mahadevan: CMPSCI 689 p. 33/?

Foveal Face Recognition using HMMs Minut, Mahadevan, Henderson, and Dyer, "Face Recognition using Foveal Vision", in Lecture Notes in Computer Science: Biologically

34 Foveal Face Recognition using HMMs Minut, Mahadevan, Henderson, and Dyer, "Face Recognition using Foveal Vision", in Lecture Notes in Computer Science: Biologically Motivated Computer Vision, Seong-Whan Lee, Heinrich H. Bulthoff, Tomasio Poggio (editors), vol. 1811, pp , Springer-Verlag, Sridhar Mahadevan: CMPSCI 689 p. 34/?

35 Comparing Sliding Windows with Foveation Competing approach: slide a window down of fixed size [Samaria, PhD, Cambridge] AVERAGE RECOGNITION RATE PERFORMANCE OF SUBSAMPLED VS FOVEAL HMM CLASSIFIERS 100 WOMEN-FOVEAL WOMEN-SUBSAMPLED NUMBER OF STATES Sridhar Mahadevan: CMPSCI 689 p. 35/?

36 Abstract Hidden Markov Model [Bui, Venkatesh, West: JAIR, IJCAI 03] Level 2 Π2 Π2 E2 E2 Level 1 Π1 Π1 E1 E1 Level 0 A A S S S Ο O O Sridhar Mahadevan: CMPSCI 689 p. 36/?

Dynamic Approaches: The Hidden Markov Model

Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message