LEARNING DYNAMIC SYSTEMS: MARKOV MODELS

Size: px

Start display at page:

Download "LEARNING DYNAMIC SYSTEMS: MARKOV MODELS"

Mabel Stevens
5 years ago
Views:

1 LEARNING DYNAMIC SYSTEMS: MARKOV MODELS Markov Process and Markov Chains Hidden Markov Models Kalman Filters

2 Types of dynamic systems Problem of future state prediction Predictability Observability Easily predictable state Hardly predictable state: state noise Easily observable state Hardly observable state: measurement noise Partially observable state + measurement noise Trajectory of a satellite Controlled underwater vehicle with inertial navigation system in calm water Trajectory of an indoor conltroled robot (SLAM) Trajectory of a GPS localized vehicle radiolocalized phones using GSM cells triangulation Phone call localization, speech recognition Problem of current state estimation (filtering) Problem of past state estimation (smoothing) Problem of past state trajectory

3 Markov models: a global view Observability Type of State Discrete Observable state Markov Chains Partially observable state Hidden Markov models (HMM) Linear continuous models Kalman filter, ARMA models, etc Non-linear continuous models Extended Kalman filter, Particle filters, etc

4 LEARNING DYNAMIC SYSTEMS: MARKOV MODELS Markov Process and Markov Chains Hidden Markov Models Bayesian Filtering and Kalman Filter

5 Markov process Stochastic process: sequence of random variables X,, X t, P X,, X t = P(X ) P X i X,, X i t i= X X X 2 X 3 Markov process: «Knowing the past doesn t help to predict the future when the (close) present is known.» Examples : P X,, X t = P(X ) t i= P X i X i,, X max(i k,) X is the system state, k is the order. Linear autoregressive models (AR) for prediction in economy/finance X t = a X t + + a k X t k + ε t where ε t ~N, σ t 2 is a normal white noise X X X 2 X 3 Markov process of order 2 X X X 2 X 3 Markov process (of order )

6 Definition: Markov Chain Markov process of order with observable discrete state (from to n) Parameterization: distributions of initial state and transitions: θ = p i, p t i i,j t,i,j p i = P X = i t and p i,j = P X t = i X t = j Homogeneous chain: time independent p i,j = P X t = j X t = i Representation as a graph of a stochastic finite state machine:

7 Matrix representation of Markov Chains A Markov chain with n state is defined by a n n transition matrix: t P t = p i,j = P X t = j X t = i i,j Property: P t is stochastic i, n j= t p i,j =

8 State prediction from a Markov chain Fundamental property: given state distribution P t = P X t = P X t = n, P X t = j = i= Predicting state at «t + k» : k General case: P t+k = h= n P X t = j X t = i P(X t = i) P t = P t T P t T P t+h P t Oubli de la transposition dans le poly!!! Homegenous case: P t+k = P Tk P t Likelihood of an observation x, x T : P x, x T θ = P X = x T h= P i xi,x i

9 Given n i.i.d state sequences: s = x,, x T MLE estimation frequencies: Learning a Markov Chain,, s n = x n n,, x T n What is the underlying Markov Chain? p t ij = k s. t. x t k k = j and x t k k s. t. x t = i = i Problem: many coefficients are likely to be equal to t Introduction of a Dirichlet prior: p ij p t ij j n ~ Dir = k s. t. x t k k = j and x t k k s. t. x t In general α t ij = (uniform distribution) α ij t j n t = i + α ij t = i + j α ij

) Problem of supervised classification: Given a new score, guess the composer.

10 Example of application: One has a corpus of classical music scores with the name of their composer. ) Problem of supervised classification: Given a new score, guess the composer. 2) Problem of trajectory generation: Generate a score that sounds like composed by two given composers Xenakis and the «stochastic music»

11 Solution: Learn a homogeneous Markov Chain of order k for each composer c State space = (pitch, length) of a note Compute: p c i k,,i,j = t s. t. X t = j and X t = i and X t k = i k + t s. t. X t = i and X t k = i k + n Find optimal k by cross validation Predict composer for score x, x T : For each composer, compute likelihood: T Choose P x, x T C = c = p c x c = argmax c P x, x T C = c Problem of trajectory generation: t= p c x t k, x t,x t Averaging both transition matrices is a bad idea (loss of information) Use a hierarchical Markov chain with two states (one per composer) Generate a note from the Markov chain of the current composer,9 A,, B,9

12 Other application: PageRank at Google Compute importance index for nodes in a network (Web, etc) Let s X t be the current page of a random Websurfer (random walk model) Page rank value of page p = asymptotic probability lim t P X t = p Page 2 p 2, = 4 Page 2 p 2,2 = 4 Page Page 3 Page p,4 = p 2,3 = 4 Page 3 p 3,2 = 2 Page 4 p 4, = Page 4 p 2,4 = 4 p 3,4 = 2 Model Web as a Markov Chain p ij = P X t = j X t = i If deg i, if "i links to j" p ij = otherwise p deg i ij = If deg i =, p ij = n Does lim P X t = p exist and is independent of X? t

13 Stationary distribution and equilibrium Equilibrium state distribution: state distribution limit independent of the initial state distribution P, P, lim t P t = lim t P Tt P = P Stationary distribution : state distribution P that is a fixed point P = P T P Property : an equilibrium distribution is stationary Property 2: every Markov chain has some stationary distribution(s) Every stochastic matrix accepts as the largest eigenvalue (in absolute value). Components of right and left eigenvectors for eigenvalue have all the same sign. P T = Λ =,48.83i, i.5.97 P =

14 Notion of reducibility State s accessible from state s (s s ) if s s t, P X t = s X = s > s and s communicate (s s ) if s s and s s Communicating classes are equivalent classes for A closed communicating class has no outgoing link. Theorem: the number of stationary distributions is the number of closed communicating classes A chain is irreducible if there is only one (closed) communicating class, i.e. the transition graph is strongly connected Closed communicating class strongly connected components = communicating classes

15 Notion of periodicity Period of a state is period s = gcd t P X t = s X = s > A Markov chain is aperiodic if all states have a period equal to. periods = s s 2 s 3 s 4 s 5 s Theorem: sufficient condition for convergence to an equilibrium An homogeneous irreducible and aperiodic Markov chain converges to an equilibrium

16 Back to PageRank Problem: the Markov chain of the Web is neither irreducible nor aperiodic Solution: every complete transition graph is irreducible and aperiodic i, j, p ij > Algorithm: for every page, one draws a number between and If x α, chooses randomly an outgoing link Otherwise, teleport randomly to a page of the Web Consequence: new transition graph is complete: i, j, p ij = α p ij + α n > P T = α α n

17 LEARNING DYNAMIC SYSTEMS: MARKOV MODELS Markov Process and Markov Chains Hidden Markov Models Bayesian Filtering and Kalman Filter

18 Example of application: Speech recognition systems a a 2 w w2 n n 2 n 3 ǝ ǝ 2 Cepstral coefficients

19 Example of a partially observable state: the burglar problem (Barber) from A burglar walks on a grid 5 x 5 in the dark. P =,9 P =, Creaking floor Collision with an obstacle

20 Different estimation problems with a HMM creaks collisions Observation Y t Present state P(X t Y Y t ) Past state P(X t Y Y T ) Most probable trajectory Real trajectory X t

21 Partially observable process Markov process (X, Y ),, (X t, Y t ), State X t is a hidden variable. such that: State X t is partially observable through observations Y à Y t Observation Y t has only X t as parent: P Y t X,, X t, = P Y t X t Joint distribution: t P X, Y,, X t, Y t = P(X )P Y X P X i X,, X i P Y i X i i= X X X 2 X 3 Y Y Y 2 Y 3

22 Hidden Markov Model (HMM) A Hidden Markov Model is a partially observable Markov Chain t P X, Y,, X t, Y t = P(X )P Y X i= P X i X i, X i k P Y i X i A Hidden Markov Model of order (k=) with discrete observation from to m is defined by: A n n transition matrix: P t = P X t = j X t = i i,j An n m emission matrix: Q t = P Y t = j X t = i i,j HMM of order X X X 2 X 3 Y Y Y 2 Y 3

23 HMM Example: tracking cachalots Scientifics stick a GPS device on the back of a cachalot: A cachalot dives in average 3 min every two hours. The device hibernates and wakes up few minutes every 24 hours. When woken up, if the device is on the sea surface, it emits its position. The risk for the device to become out of service is 5% per day. The risk for the device to come off from the cachalot is % per day. It then drifts on the surface. 5% of the messages are received by a satellite. 75% of the messages sent by a drifting device are received by a satellite. The average lifetime of the device battery is days with standard deviation of 5 days. Model the problem with a HMM as a graph and then matrices.

24 ,25,5,95 =,2375 Tracking cachalots: solution,,95 =,95 Time step = day (when the device wakes up) D F,95 S Surface D Diving F Floating B Broken B,75,5,95 =,6425 S Received message

25 Tracking cachalots: solution P t = S D F B,6425,2375,95,5,6425,2375,95,5,95,5 P X = S D F B Q t = S D F B M M,5,5,75,25

26 Online estimation of current state (filtering) Estimation of the present state P X t y,, y t from past observations : P X t y,, y t = P X t, y,, y t P y,, y t P X t Y,, Y t = α t X t X t α t (X t ) avec α t X t Recursive «forward» computation of coefficients α t (X t ) : α t x = P X t = x, y,, y t = P X t, y,, y t = x P X t = x, X t = x, y,, y t = P y t X t = x P X t = x X t = x P X t = x, y,, y t x n = P y t X t = x P X t = x X t = x α t x x = α t x = Q t x,yt P t T α t

27 An example of HMM : on the track of cachalots Estimate state of device at the 4th day, after positions have been received on days, 2 and 3. α t x = Q t x,yt P t T α t P = S D F B,6425,2375,95,5,6425,2375,95,5,95,5 X = S D F B Q = S D F B M M,5,5,75,25

28 Solution α =,5,75 S P F D = : componentwise multiplication,5 X Y = S D F B α =,5,25 P T,5 =,63,69,9,25 X Y = S D F B,53,35,4,8 α 2 = α 3 =,5,75,5,75 P T P T,63,69,9,25,857,275 = =,857,275,275,257 X 2 Y 2 = X 3 Y 3 = S D F B S D F B,76,24,52,48

29 Offline estimation offline of past state (smoothing) Estimation of P X t y,, y T given t T P X t = x y,, y T P X t = x, y,, y T P Y t+,, Y T X t = x, y,, y t P X t = x, y,, y t P Y t+,, Y T X t = x α t x α t x β t x P X t = x y,, y T = α t x β t x x α t (x)β t x with β t x = P y t+,, y T X t = x

30 Forward/Backward algorithm In parallel:. Forward recursive computation of t, x, α t (x) 2. Backward recursive computation of t, x, β t x β t x = P y t+,, y T X t = x = x P y t+,, y T, X t+ = x X t = x = P y t+2,, y T X t+ = x P y t+ X t+ = x P X t+ = x X t = x x n = P y t+ X t+ = x P X t+ = x X t = x β t+ x x = n β t x = x = Q t x,yt P t+ x,x β t+ x with x, β T x = 3. Compute P X t y,, y T from α t and β t

31 Exemple de HMM : sur les traces des cachalots Estimate state of device at the 4th day, after positions have been received on days, 2 and 3. α t x = Q t x,yt P t T α t n and β t x = x = Q t x,yt P t+ x,x β t+ x X = P = S D F B S D F B,6425,2375,95,5,6425,2375,95,5,95,5 Q = S D F B M M,5,5,75,25

32 Solution α =,5 β =,,,2 X y 3 α β = X y = α =,63,69,9,25 β =,8,8,5 X y 3 α β =,53,35,2 X y =,53,35,4,8 α 2 = α 3 =,857,275,275,257 β 2 = β 3 =,39,39,7 X 2 y 3 α 2 β 2 = X 3 y 3 α 3 β 3 =,63,37,52,48 X 2 y 2 = X 3 y 3 =,76,24,52,48

33 Most probable trajectory: the Viterbi algorithm Determine most probable sequence of state x,, x T observations y,, y T. given argmax P x,, x T y,, y T argmax P x,, x T, y,, y T x x T x x T Resolution by dynamic programming: μ t i : probability of the most probable trajectory x,, x t such that x t = i. Bellman equation: μ t i = max x,,x t P x,, x t, x t = i, y,, y t t > μ t i = P y t X t = i max P X t = i X t = j μ t j j t = μ i = P y X = i P (X = i) Final result: max x x T P x,, x T, y,, y T = max i μ T i

34 Exemple de HMM : sur les traces des cachalots Determine most probable trajectory at the 4th day, after positions have been received on days, 2 and 3. t > μ t i = P y t X t = i max P X t = i X t = j μ t j j t = μ i = P y X = i P (X = i) Matrix reformulation where Diag v is diagonal matrix of coefs. v = c,, c n. t > μ t = Diag Q,y t max P T Diag μ t j t = μ i = Diag Q,y μ

35 Solution μ = Diag,5,75 = S P F D,5 μ = Diag,5,25 = Diag max j,5,25,6425,6425,2375,2375,95,95,95,5,5,5 max j,326,7,475,25 = Diag μ,6,7,2,25

36 Solution μ 2 = Diag = Diag μ 3 = Diag = Diag,5,75,5,75,5,75,5,75 max j max j max j max j,6425,6425,2375,2375,95,95,95,5,5,5,28,68,34,23,52,,3,8,5,6,25,6425,6425,2375,2375,95,95,95,5,5,5,33,,5,,3,6 = Diag μ = Diag μ 2,65,8,54,4

37 Solution State t μ μ μ 2 μ 3 S D.7 F B.25 Most probable trajectory : x, x, x 2, x 3 = S, S, S, S

38 Learning a HMM: Problem: learn θ = P, P t, Q t t from N i.i.d sequences y y T y N N y TN Two cases: States x x T x N N x TN are known (expert annotation) Split problem in two parts:. Learn Markov chain P, P t t 2. Learn emission matrices Q t t using MLE States x x T x N N x TN are unknown EM must be used to learn distribution of hidden states Baum Welch algorithm

39 The Baum-Welch algorithm. Initialize randomly θ = P, P t, Q t t for n fixed 2. E-step: estimate a t i i,t and B t i i, t,, T i, s, distribution a t i of X t i i,t from θ using backward-forward algorithm i a s t P X i t = s y i i y Ti, θ α t s β t (s) i, t,, T i, s, s distribution B i t of transition X i i t X t+ : i B s,s t P X i i t = s, X t+ = s y i i s,s y Ti, θ α t s P s t Q,y i t+ t β t (s ) 3. M-step: learn P, P t, Q t t from a t i t, s, y i,t and B t i i,t t, s, s, P s i a s and Q s,y t s,s P t 4. Go to step 2 until convergence of θ i i B s,s t i i a t i s l y t i = y

Linear Dynamical Systems (Kalman filter)

Linear Dynamical Systems (Kalman filter) (a) Overview of HMMs (b) From HMMs to Linear Dynamical Systems (LDS) 1 Markov Chains with Discrete Random Variables x 1 x 2 x 3 x T Let s assume we have discrete