Markov decision processes

Size: px

Start display at page:

Download "Markov decision processes"

Alice Mills
5 years ago
Views:

1 IMT Atlantque Technopôle de Brest-Irose - CS Brest Cedex 3 Téléphone: +33 0) Télécope: +33 0) URL: Markov decson processes Lecture notes therry.chonavel@mt-atlantque.fr sandrne.vaton@mt-atlantque.fr Date: August 30, 2017 Verson:

2 CONTENTS Contents 1. Introducton HMM, optmal flterng and parameter estmaton Notatons and defntons Flterng Contnuous state Dscrete state flterng and smoothng Parameter estmaton Fully observed MDP Fnte horzon Infnte horzon Dscounted cost Average cost Partally observed MDP Introducton Fnte horzon Infnte horzon References RR SC 1

3 2. HMM, OPTIMAL FILTERING AND PARAMETER ESTIMATION 2.1. Notatons and defntons 1. Introducton These notes are related to the control of Markov processes. They closely follow the text book dedcated to partally observed markov decson processes by V. Krshnamurthy [1] that can be found at the lbrary of IMT-Atlantque and that we nvte you to read for deeper understandng. Frst we recall some facts related to flterng and parameter estmaton n HMM models before dscussng control of such processes, both for the case of controlled Markov chans MC) and for controlled Hdden Markov Chans HMC). 2. HMM, optmal flterng and parameter estmaton 2.1. Notatons and defntons We work n a probablty space denoted by Ω, A, P). We consder an HMM x k, y k ) k 0 where the y k s are observed and the x k s represent the states of the correspondng underlyng Markov process. When the evoluton of the HMM s descrbed va a state space model, followng the notatons n [1], the state and observaton equatons are denoted by x k+1 = φ k x k, w k ) and y k = ψ k x k, v k ). In probablstc terms, ths amounts to supply the dstrbutons of state transton denstes px k+1 x k ) and observaton lkelhoods py k x k ). Note that for the sake of generalty p wll denote a probablty or a probablty densty functon pdf), dependng on stuatons. Gven an ntal dstrbuton wth pdf πx 0 ) for x 0, the dstrbuton of x k can be computed recursvely: πx k+1 ) = px k+1 x k )πx k )dx k. 1) In the case of a fnte state HMM wth homogeneous underlyng Markov chan x wth values n X = {1,..., X}, let us denote by P j = Px k+1 = j x k = ) the transton matrx and by B xy = py k = y x k = x) the lkelhood of y k = y. Note that P1I X = 1I X, wth 1I n the vector of sze n wth unt entres. If y belongs to a fnte space Y = {1,...,Y}, we have B1I Y = 1I X. Here, we wll descrbe dscrete probabltes va column vectors. Then, the Chapman-Kolmogorov equatons wrte π k+1 = P T π k 2) where π k = [Px k = 1),..., Px k = X)] T. When t exsts, for fxed π 0 the lmtng dstrbuton of x s defned as lm k π k = lm k P T ) k π 0, whle the statonary dstrbuton aka equlbrum, nvarant or steady-state dstrbuton) satsfes π = P T π, wth 1I T π = 1. Note that [ a lmtng ] dstrbuton s a statonary dstrbutons the converse s false: consder 0 1 for nstance the case P = ). 1 0 Recall that P s sad rreducble f, j, k > 0, Pj k > 0. In other words t s always possble to transt from any state to any other after some tme. A stronger property s regularty: P s regular f all the entres of P k are strctly postve for some k > 0. For regular P, lmtng and statonary dstrbutons exst and are equal. In addton, t satsfes P k = 1Iπ T + Ok m2 1 λ 2 k ), where λ 2 s the egenvalue of P wth second largest modulus λ 2 < 1) and m 2 ts order of multplcty. Then, clearly π k = P T ) k π 0 converges to π at geometrc rate. For two dstrbutons P 1 and P 2, the total varaton norm s defned as P 1 P 2 TV = max A A P 1 A) P 2 A). 3) For dscrete dstrbutons descrbed by vectors p 1 and p 2, t can be shown that P 1 P 2 TV = 1/2) p 1 p 2 1 = 1/2) p 1 p 2. 4) For transton matrx P, the Dobrushn coeffcent s defned as ρp) = max j P T e P T e j TV. 5) RR SC 2

4 2. HMM, OPTIMAL FILTERING AND PARAMETER ESTIMATION 2.2. Flterng where [e ] j = δ, j and δ denotes Kronecker s symbol: δ a,b = 1 f a = b and 0 otherwse. ρp) has the followng propertes: ρp) [ λ 2, 1] and P T π P T π ρp) π π. Thus, when ρp) < 1, t supples a geometrc convergence parameter easer to compute than λ 2. In addton, one can prove that f there exsts a probablty vector p and ɛ > 0 such that mn P j < ɛp j, then ρp) 1 ɛ. Ex. Show that f x p, {1, 2}, then p 1 p 2 TV Px 1 x 2 ) Flterng Contnuous state Here agan we assume an HMM y wth underlyng state x. The optmal flter that supples an estmate of x k gven y 1:n that s gven {y 1,..., y n }) s defned as arg mn h E[ x k hy 1:n ) 2 ] and gven by x = ĥy 1:n ) = E[x k y 1:n ] = x k px k y 1:n )dx k. More specfcally, dependng whether k > n, k = n or k < n, the problem s called predcton, flterng or smoothng respectvely. The calculaton of correspondng dstrbutons px k y 1:n ) s also called predcton, flterng or smoothng. In partcular, for flterng, π k x k ) = px k y 1:k ) can be calculated recursvely by the formula π k+1 x k+1 ) = py k+1 x k+1 ) x k px k+1 x k )π k x k )dx k x k+1 py k+1 x k+1 ) x k px k+1 x k )π k x k )dx k dx k+1 6) and x k+1 k+1 = E[x k+1 y 1:k+1 ] = x k+1 π k+1 x k+1 )dx k+1. 7) In the partcular case of a lnear Gaussan model n the form { xk+1 = A k x k + f k u k + w k y k = C k x k + g k u k + v k 8) where u k represents some known control varable, wth x 0 N x 0 0, Σ 0 ), w k N 0, Q k ) and v k N 0, R k ), we get π k = N x k k, Σ k ) where x k k and Σ k can be calculated recursvely n closed form thanks to the Kalman flter, the equatons of whch are summarzed down here: predcton of x k+1 : x k+1 k = A k x k k + f k u k predcton of y k+1 : y k+1 k = C k+1 x k+1 k + g k+1 u k+1 covx k x k k+1 ) : Σ k+1 k = A k Σ k k A T k + Q k Kalman gan : K k+1 = Σ k+1 k C T k+1 [C k+1σ k+1 k C T k+1 + R k+1] 1 flterng of x k+1 : x k+1 k+1 = x k+1 k + K k+1 y k+1 y k+1 k ) covx k x k k ) : Σ k k = Σ k+1 k K k+1 C k+1 Σ k+1 k. 9) The Kalman flter can be derved drectly by recursvely calculatng the means and varance of the Gaussan dstrbuton px k y 1:k ). Alternatvely, t can be computed from formula 9 n the Gaussan lnear case, makng use of the followng result: N y; Cx, R)N x; x, P) = N y; Cx, CPC T + R)N x; x + Ky Cx), P KCP) 10) where K = PC T CPC T + R) 1. For more detals about flterng, predcton and smoothng n state space models, please refer for nstance to lecture notes on fllterg n state space models [2] Dscrete state flterng and smoothng In the case of dscrete state HMM, x {1,..., X} but y s possbly contnuous-valued. We shall note B yk = dagpy k x k = 1),..., py k x k = X)) and π k ) = px k = y 1:k ). Then, we get π k+1 j) = px k+1 = j y 1:k+1 ) = py k+1 x k+1 = j) px k = y 1:k )P j j py k+1 x k+1 = j) px k = y 1:k )P j 11) RR SC 3

5 2. HMM, OPTIMAL FILTERING AND PARAMETER ESTIMATION 2.2. Flterng Note that ths equaton s the dscrete verson of Eq. 6). Put n a vector form, we get π k+1 = B y k+1 P T π k 1I T B yk+1 P T π k, 12) Ths s called the HMM flter. In an un-normalzed form, the recurrence s smply wrtten as q k+1 = B yk+1 P T q k, where q k s the un-normalzed verson of π k. For smoothng, to calculate π ) = px k = y 1:n ), wth n > k, we defne the backward flter as It can be calculated teratvely va relatons β ) = py k+1:n x k = ). 13) β n n = 1I β = PB yk+1 β k+1 n, k = n 1,..., 1. 14) Indeed, for 0 < k < n 1 and β ) = j py k+1:n, x k+1 = j x k = ) = j py k+2:n x k+1 = j)py k+1, x k+1 = j x k = ) = j py k+2:n x k+1 = j)py k+1 x k+1=j )px k+1 = j x k = ) = j P j py k+1 x k = )β k+1 n j) = [PB yk+1 β k+1 n ]. β n 1 n ) = py n x n 1 = ) = j py n x n = j, x n 1 = )px n = j x n 1 = ) = [PB yn 1I]. Then, lettng β n n = 1I, we get β = PB yk+1 β k+1 n for k = n 1,..., 1. Furthermore, as π ) = px k = y 1:n ) px k =, y 1:n ) py k+1:n x k = )px k = y 1:k ), 15) 16) 17) we get the forward-backward formula: π ) = π k ) β ) π k T β. 18) From the computaton of π, t s possble to fnd the MAP Maxmum A Posteror) estmator of x k gven y 1:n, that s ˆx = arg max π ). As we wll see t n the secton related to parameter estmaton, t s also useful to be able to compute the smoothed probabltes of pars x k, x k+1 ), denoted by π, j): π, j) = px k =, x k+1 = j y 1:n ) px k =, x k+1 = j, y 1:n ) px k = x k+1 = j, y 1:k )px k+1 = j, y 1:n ) px k+1 = j x k = )px k = y 1:k ) py k+2:n x k+1 = j)px k+1 = j, y 1:k+1 ) px k+1 = j y 1:k ) P j π k ) px k+1 = j y 1:k ) β k+1 n j)py k+1 x k+1 = j, y 1:k )px k+1 = j y 1:k ) P j π k )β k+1 n j)b jyk+1. 19) RR SC 4

6 2. HMM, OPTIMAL FILTERING AND PARAMETER ESTIMATION 2.3. Parameter estmaton Normalzaton s ensured by ntroducng the constrant j π, j) = 1: π, j) = π k )P j B jyk+1 β k+1 n j) ab π k a)p ab B byk+1 β k+1 n b). 20) Now, f we look for the sequence ˆx 1:n that s most lkely to occur gven y 1:n, we can resort to the Vterb algorthm to compute t. Note that ˆx 1:n = arg max x1:n px 1:n y 1:n ) = arg max x1:n px 1:n, y 1:n ) and px 1:k+1, y 1:k+1 ) = py k+1 x k+1 )px k+1 x k )px 1:k, y 1:k ) = B xk+1 y k+1 P xk x k+1 px 1:k, y 1:k ) 21) where B xk+1 y k+1 = py k+1 x k+1 ). Then, let us defne δ k and u k s follows: δ k j) = max x1,...,x k 1,x k =j px 1:k, y 1:k ) u k j) = arg max x1,...,x k 1,x k =j px 1:k, y 1:k ) 22) From Eq. 21), we get the followng recurrence δ k+1 j) u k+1 j) = max B jyk+1 P j δ k ) = arg max B jyk+1 P j δ k ). 23) Ths recurrence permts to desgn a forward dynamc programmng algorthm to maxmze px 1:n y 1:n ): Intlzaton: δ 0 ) = π 0 ) Forward steps: δ k+1 j) = max B jyk+1 P j δ k ). k = 0,..., n 1 u k+1 j) = arg max B jyk+1 P j δ k ). Backtrackng: ˆx n = arg max j δ n ) ˆx k = u k+1 ˆx k+1 ), k = n 1,..., 1. 24) Note that for numercal stablty purpose the Vterb algorthm s often mplemented wth sums nstead of products by maxmzng log px 1:k, y 1:k ) rather than px 1:k, y 1:k ) Parameter estmaton Let us brefly have a look at maxmum lkelhood estmaton of MC and HMC parameters. For an MC x, parameters are θ = π 0, P). Gven an observaton x 0:n, the log-lkelhood s gven by Lθ) = log px 0:n θ) = k=1:n log px k x k 1, P) + log px 0 π 0 ) = j n j log P j + δ x0,π 0 ). 25) where n j s the number of transtons from state to state j n pars x k, x k+1 ) k = 0,..., n 1) and δ a,b denotes the Kronecker symbol δ a,b = 1 f a = b and 0 otherwse). Then, maxmzng Lθ) under probablty normalzaton constrants1, we get the maxmum lkelhood estmators for P and π 0 : ˆP j = n j / k n k and ˆπ 0 ) = δ x0,. For an HMM, the Markovan state s unobserved, resultng thus n generally unfeasble drect estmaton of parameters. As seen n the lecture about nference n HMMs [3] where the problem of estmatng the parameters of mxtures of dstrbutons s addressed, one often resorts to the EM Expectaton- Maxmzaton) algorthm n the presence of hdden varables. Before consderng an example, let us recall the EM algorthm to estmate the parameters θ of an HMM x, y) from observaton y 1:n : 1Here, constraned optmzaton can be acheved ether usng the method of Lagrange multplers or by elmnatng one varable n each lne of P, for nstance replacng varable P X by 1 j=1:x 1 P j for X. RR SC 5

7 2. HMM, OPTIMAL FILTERING AND PARAMETER ESTIMATION 2.3. Parameter estmaton Intlzaton: θ 0) Iteratons: for t = 1, 2,... E step expectaton): Qθ t 1), θ) = E θ t 1)[log px 0:n, y 1:n θ) y 1:n ] M step maxmzaton): θ t) = arg max θ Qθ t 1), θ) where E θ [.] represents the expectaton computed assumng the value θ for the parameter of the model. For the HMM, px 0:n, y 1:n θ) = Π k=1:n py k x k, θ)px k x k 1, θ)px 0 ) 27) and, for fxed px 0 ), Qθ t 1), θ) = k x k log py k x k, bθ)px k y 1:n, θ t 1) ) + k x k,x k 1 log px k x k 1, θ)px k, x k 1 y 1:n, θ t 1) ) + const. We see that these equatons nvolve the smoothng dstrbutons px k y 1:n, θ t 1) ) and px k, x k 1 y 1:n, θ k 1) ) that can be computed by the forward-backward algorthm see Eq. 18) and Eq. 20)). Lettng θ = P, η = {η } =1:X ), where η denotes the parameters of the dstrbuton of y k knowng that x k =, ths expresson becomes Qθ t 1), θ) = k 26) 28) π t 1) ) log B y k η ) + k j π t 1), j) log P j 29) where π t 1) ) = px k = y k:n, η t 1) ) and π t 1), j) = px k =, x k+1 = j y k:n, η t 1) ) can be calculated recursvely from formula 18) and 20). Accountng for constrants j P j = 1 X), the maxmzaton of Qθ k 1), θ) w.r.t. P yelds P t) j = k π t 1), j) k π t 1). 30) ) Examples 1) Let us consder the case where y s contnuous valued, wth values n R p, and y k x k = N m, Σ 2 ). Then η = m, Σ 2 ) and k π t 1) ) log B y k η ) = 1/2) k = 1/2) π t 1) ) [ log Σ + y k m ) T Σ 1 y k m ) ] [ k π t 1) +Tr Σ 1 )) log Σ k π t 1) )y k m )y k m ) )] T where Tr.) denotes the trace operator. Cancellng partal dervatves w.r.t. m yelds Now, lettng Σ = U Λ U T we note that m t) = k π t 1) 31) )y k k π t 1). 32) ) denote the egenvalue decomposton of Σ, wth Λ =dag[λ 1,..., λ Y ], and ˆΣ t 1) m ) = k π t 1) )y k m )y k m ) T k π t 1) ), 33) log Σ = log U + log Λ + log log U T = log U U T + log Λ = 0 + j log λ j Tr Σ 1 ) ˆΣ t 1) m ) = Tr U Λ 1 U T ) ˆΣ t 1) m ) = Tr Λ 1 U T ˆΣ t 1) m )U ). RR SC 6 34)

8 2. HMM, OPTIMAL FILTERING AND PARAMETER ESTIMATION 2.3. Parameter estmaton Then, when c = k π t 1) ) log B y k η ) s maxmum, we get c = k π t 1) ) λ j 2 1 λ j 1 λ 2 j [U ] T j ˆΣ t 1) m )[U ] j = 0,, j X, 35) where [U ] j denotes the j th column of U. Put n a matrx form, Equatons 35) rewrte Λ = 1 k π t 1) ) UT ˆΣ m )U, X 36) and as Σ = U Λ U T, we fnd that Σ t) = ˆΣ t 1) m t) ) = k π t 1) )y k m t) )y k m t) ) T k π t 1) ). 37) Fnally, for a Gaussan HMM the EM algorthm s completely descrbed by equatons 18) and 20) for the E step and by equatons 30), 32) and 37) for the M step. 2) Now, f y s dscrete wth values n Y {1,...,Y}, wth Py k = j x k = ) = B j, then η = B 1,..., B Y1 ) and π t 1) ) log B y k η ) = π t 1) )δ j,y k log B j. 38) k =1:X j=1:y Accountng for the constrants j B j = 1, the maxmum of Qθ k 1), θ) s acheved for k B j = k π t 1) )δ j,y k k π t 1) ). 39) RR SC 7

9 3. FULLY OBSERVED MDP 3.1. Fnte horzon 3. Fully observed MDP A Markov decson process MDP) s a controlled Markov chan x = x k ) k 0. It s controlled n the sense that at tme k a controller makes an acton u k, based on the observaton of the current value x k of the process. The control results n an updated transton matrx Pu k ) between nstants k and k + 1. These actons are chosen wth a vew to mnmze some cost functon over a fnte or nfnte tme nterval horzon). In each tme nterval [k, k + 1[ the cost depends on the values of x k and of the control u k. Frst, we are gong to study fnte horzon MDP before consderng the nfnte horzon case Fnte horzon An MDP s characterzed by ts states, x k X = {1,..., X} at nstants k = 0,..., n 1, actons u k U = {1,..., U}, transtons P k,j u) = Px k+1 = j x k =, u k = u) 1, j X and 1 u U), current costs c k, u) and fnal cost c n ). At tme k, avalable nformaton s I k = {x 0, u 0,..., x k 1, u k 1, x k } and acton u k s chosen accordng to some polcy µ k so that u k = µ k I k ). A sequence of polces at nstants 0,..., n 1 s denoted by µ = µ 0,..., µ n 1 ). Then, for a gven polcy, a trajectory of the MDP s bult as follows: Intalzaton: x 0 π 0, I 0 = {x 0 } k = 0,..., n 1 : compute u k = µ k I k ) c k x k, u k ) x k+1 [P k,xk 1u k ),..., P k,xk X u k )] T I k+1 = I k {u k, x k+1 } Fnal cost: c n x n ) In fact, the polcy µ should be chosen so as to mnmze the expected cumulatve cost defned by k=0 40) n 1 J µ x) = E[ c k x k, µ k I k )) + c n x n ) x 0 = x]. 41) It can be shown that an optmal polcy sequence µ = arg mn µ J µ x) can be found among determnstc Markovan polces, that s, polces n the form µ k I k ) = µ k x k ), where µ k s a determnstc functon from X to U. Such a polcy can be computed teratvely va the followng backward Bellman dynamc programmng algorthm: Intalzaton: J n ) = c n ), 1 X k = 0,..., n 1 : compute J k ) = mn u ck, u) + j P k,j J k+1 j) ) 42) µ k ) = arg mn u ck, u) + j P k,j J k+1 j) ) When the optmal polcy as been computed, t can be used to apply sequentally the optmal acton, gven by u k = µ k x k ), for k = 0,..., n 1 n Eq. 40). Note that ths method can be extended to desgn MDPs wth contnuous valued states. In partcular, f we assume that x has transtons n the form x k+1 = φ k x k, u k, w k ) 43) where the w k s are IID and are ndependent from x 0, wth pdf pw), the Bellman s algorthm teraton now wrtes J k x) = mn u ck x, u) + w J k+1φ k x k, u k, w))pw)dw ) µ k ) = arg mn u ck x, u) + w J k+1φ k x k, u k, w))pw)dw ). 44) RR SC 8

10 3. FULLY OBSERVED MDP 3.2. Infnte horzon 3.2. Infnte horzon In the case of an nfnte horzon, usually ether one consders a dscounted cost or an average cost. We are gong to address these two alternatves down here Dscounted cost Here, the cost at tme k s gven by c k x k, u k ) = ρ k cx k, u k ) where 0 < ρ < 1, ensurng that the cumulatve cost remans bounded: J µ ) = E µ ρ k cx k, u k ) x 0 = 1 1 ρ max c, u). 45),u k=0 After a long run, say n, t seems that the current tme polcy should become statonary, suggestng thus to look for a statonary polcy µ = µ, µ,...). Indeed, t can be proved that the optmal cumulatve cost for J µ ) s obtaned for such a statonary polcy. To understand ntutvely the dervaton of ths polcy, frst consder agan the case of a fnte horzon wth termnal cost c n ) = 0. The Backward Bellman s equaton wrtes J k ) = mn ρ k c, u) + u Now, let us rewrte these equatons wth reversed tme: Then, we get V 0 ) = 0 and j P j u)j k+1 j). 46) V k ) = ρ n k J n k ), k = 0,..., n. 47) V k ) = mn c, u) + ρ u At convergence, V = k = V k 1 and j P j u)v k 1 j), k = 1,..., n. 48) V ) = mn µ E[cx 0, µx 0 ) x 0 = ] + ρe[v x 1 ) x 0 = ]) = mn µ E[ n 1 k=0 ρk cx k, µx k )) x 0 = ] + ρ n E[V x n ) x 0 = ] ). 49) Lettng n tend to nfnty, t comes that V ) = mn µ E[ k=0 ρ k cx k, µx k ))]) = J µ ). In fact, t can be proved formally that the mnmum of 45) s obtaned for a statonary polcy µ that s the soluton of the followng Bellman s equatons: V ) = mn u c, u) + ρ j P j u)v j) ) µ ) = arg mn u c, u) + ρ j P j u)v j) ) 50). In practce, to get ths soluton, one can use a recursve approach, as suggested n the dscusson above. Ths approach s named the value teraton algorthm. It conssts n computng the followng teratons: Intalzaton: V 0 ) = 0, 1 X k = 1,..., n 1 : compute V k ) = mn u c, u) + ρ j P j u)v k 1 j) ) µ k ) = arg mn u c, u) + ρ j P j u)v k 1 j) ). 51) RR SC 9

11 3. FULLY OBSERVED MDP 3.2. Infnte horzon Then, ˆµ = µ n 1), µ n 1),...) wll be used as an approxmate optmal soluton. It can be shown that the mean cumulatve cost error obtaned when replacng µ by ˆµ decreases geometrcally wth n: V ) V n ) < ρn+1 1 ρ max c, u). 52) u However, we can do better. Indeed, there exst X U statonary polces and t s possble to derve algorthms that reach the soluton after a fnte number of teratons. In partcular, the polcy teraton algorthm operates as follows: Intalzaton: choose an ntal polcy µ 0 and compute J µ0 ) by solvng J µ0 ) = c, µ 0 )) + ρ j P k,j µ 0 ))J µ0 j) k = 1, 2,... : compute µ k ) = arg mn u c, µk 1 )) + ρ j P j u)j µk 1 j) ) solve J µk ) = c, µ k )) + ρ j P j µ k ))J µk j) terate untl J µk no longer decreases 53) Alternatvely, one can also use standard lnear optmzaton lbrares to solve the problem restated as a lnear programmng problem see e.g. [4] for an ntroducton to lnear programmng). To ths end, note that f W ) mn u c, u) + ρ j P j u)w j) ), for X. Then, teratng the nequalty over W j), we get W ) mn E[ ρ k cx k, u k )] = V ). 54) u k=0 Thus, V s the largest such W and t s the soluton of the followng lnear programmng problem: max W 1),...,W X) α W ) W ) c, u) + ρ j P j u)w j) 1 X, 1 u U 55) where the coeffcents α are chosen such that α > 0. So, t appears that dscounted cumulatve cost s relatvely easy to handle. However, t nvolves more sgnfcantly the frst terms due to exponentally fast decay of the weghts ρ k. Alternatvely, an average cost approach can be consdered, but t s techncally more complex Average cost In the average cost approach, the cumulatve cost s defned as 1 J µ x) = lm n n + 1 E µ[ n cx k µ k x k )) x 0 = x] 56) k=0 an MDP s sad unchan f for every statonary polcy ts Markov chan has only one recurrent class. For such MDPs, ther exsts a statonary polcy that maxmzes the average cost. In addton, f the state space s fnte, the optmal polcy s gven by equatons g + V ) = mn u c, µk )) + j P j u)v j) ) µ ) = arg mn u c, µk )) + j P j u)v j) ). The frst equaton s called the average cost optmal equaton ACOE). In ths equaton, the constant g s unque and equal to the optmal statonary average cost J µ x), whle V s not unque: addng a same constant to the V )s satsfyng Eq. 57) yelds another soluton for V. 57) RR SC 10

12 3. FULLY OBSERVED MDP 3.2. Infnte horzon An nterestng pont s that for unchan fnte state MDPs, the dscounted cost soluton tends to the average cost soluton as ρ approaches 1. In fact, as the number of statonary polces s fnte, both solutons are equal for ρ n some nterval [ ˆρ, 1[. To mplement the computaton of the optmal polcy, one can consder the relatve value teraton algorthm. Snce the optmal V )s are defned up to some addtve constant, one can set V 1) = 0. Then, the frst equaton n 57) wrtes Ths yelds the followng teratons: g = mn c1, µ k )) + u j>1 Intalzaton: V 0 ) = 0, 1 X, g 0 = 0 P 1j u)v j). 58) k = 1,..., n 1 : compute V k 1) = 0 V k ) = mn u c, u) + ρ j P j u)v k 1 j) ) g k 1, > 1 g k = mn u c1, u) + ρ j P 1j u)v k 1 j) ) µ k ) = arg mn u c, u) + ρ j P j u)v k 1 j) ). 59) Then, ˆµ = µ n 1), µ n 1),...) wll be used as an approxmate optmal soluton. As for the dscounted cost, a lnear programmng approach can also be used. parallelng the dscusson n the dscounted cost secton, we get the followng lnear programmng problem: max V,g g g + V ) c, u) + ρ j P j u)v j), 1 X, 1 u U. 60) Note that the dual of ths problem see e.g. [4] for an ntroducton to dualty) has an nterestng form: the dual varables π, u) 1 X, 1 u U) can be nterpreted as probabltes π, u) = Px k =, u k = u) of a randomzed polcy, although we get π, u) = δ,µ ) at the optmum snce the optmal polcy s determnstc for a fnte state unchan MDP. The dual form can accommodate addtonal cost constrants. Addng such constrants possbly leads then to randomzed optmal solutons. RR SC 11

13 4. PARTIALLY OBSERVED MDP 4.2. Fnte horzon 4. Partally observed MDP 4.1. Introducton A partally observed Markov decson process POMDP) s a controlled HMM. As the state of the HMM s unknown, the control wll use the fltered state to decde whch acton to perform. So, n ths secton, we are gong to use results from both prevous sectons related to HMM flterng and MDP control. As for MDPs, n these notes we wll consder the case of fnte horzon and then nfnte horzons. Let us recall that an MDP s characterzed by several quanttes: states x k X = {1,..., X}, actons u k U = {1,..., U}, transton probabltes Pu) wth u U current costs c k, u) and termnal costs c n x n ). In addton to these quanttes, a POMDP also nvolves an observaton process y wth values n a fnte or nfnte possbly uncountable) set, wth condtonal dstrbutons B yk u) = py k x k =, u k 1 = u). 61) Note that the control can affect the states x k but also the condtonal dstrbutons of the y k s. The dynamcs of the POMDP dffers from that of the MDP n that a trajectory of y s observed rather than that a trajectory of x. Thus, the nformaton avalable at tme k s I k = {π 0, u 0, y 1,..., u k 1, y k }. Then, For a gven polcy µ, a trajectory of the POMDP s bult as follows: Intalzaton: x 0 π 0, I 0 = {x 0 } k = 0,... : compute u k = µ k I k ) c k x k, u k ) x k+1 [P k,xk 1u k ),..., P k,xk X u k )] T y k+1 B xk+1 y k+1 u) = py k+1 x k+1, u k ) I k+1 = I k {u k, y k+1 In the case of a fnte trajectory, the loop s performed for k = 0,..., n 1 and an addtonal step for computng c n x n ) Fnte horzon For fnte horzon nterval, [0, n], the optmal polcy maxmzes k=0 62) n 1 J µ π 0 ) = E µ [ c k x k, u k ) + c n x n ) x 0 ] 63) where expectaton s calculated w.r.t. px 0, y 0,..., x n, y n ). Note that more general costs, n the form c k x k, x k+1, y k, y k+1, u k ), can also be consdered. The objectve functon 63) rewrtes where J µ π 0 ) = E µ [ n 1 k=0 c k x k, u k ) + c n x n ) π 0 ] = E µ [ n 1 k=0 E [c k x k, u k ) I k ] + E [c n x n ) I n ] π 0 ] = n 1 k=0 c k, u k )π k ) + c n )π n ) = n 1 k=0 ct k u k)π k + c T n π n. 64) π k ) = px k = I k ) = Px k = π 0, u 0, y 1,..., u k 1, y k ). 65) From flterng of HMM secton 2.2), π k = T π k 1, y k, u k 1 ), where T π, y, u) = B yu)p T u)π 1I T B y u)p T u)π. 66) RR SC 12

14 4. PARTIALLY OBSERVED MDP 4.2. Fnte horzon We wll also note σπ, y, u) = 1I T B y u)p T u)π. It can be proved that the optmal cost for a fnte tme POMDP s acheved by a determnstc polcy that can be computed recursvely va the followng Bellman s backward equatons: Intalzaton: k = 0,..., n 1 : J n π) = c T n π n compute J k π) = mn u ck u)π + y J k+1 T π, y, u))σπ, y, u) ) µ k π) = arg mn u ck u)π + y J k+1 T π, y, u))σπ, y, u) ). 67) Indeed, and J k π) = mn u ck u)π + E [ J k+1 T π, y, u)) π k = π ]) 68) E [ J k+1 T π, y, u)) π k = π ] = y k+1 J k+1 T π, y, u)) p xk x k+1 y k+1, x k+1, x k π k = π, u) = y k+1 J k+1 T π, y, u)) p xk x k+1 y k+1 x k+1, u)p T x k x k+1 u)πx k ) = y k+1 J k+1 T π, y, u))1i T B y u)p T u)π = y k+1 J k+1 T π, y, u))σπ, y, u). Unlke for the MDP where the cost s defned on X = {1,..., X}, for the POMDP t s defned on the X 1 dmensonal smplex ΠX): 69) ΠX) = { π R X ; 1I T π = 1, 0 π) 1, = 1,..., X } 70) called the belef space. We are gong to see that J k s pecewse lnear and concave. Note that n the case where y k can vary contnuously, the sum n J k becomes an ntegral and we get ) J k π) = mn c k u)π + J k+1 T π, y, u))σπ, y, u)dy. 71) u y In such statons, J k s no longer pecewse lnear, but t remans concave. Unfortunately, µ k π) cannot be stored numercally as π ΠX) s a contnuous doman, makng thus recurson 67) unpractcal, even when y has values n a fnte set. However, as stated above, t can be proved that J k π) s pecewse lnear and concave over ΠX). Ths can be proved recursvely. Lettng Γ k the set of vectors n R X that characterze these lnear components, we get Then, the optmal polcy s gven by J k π) = mn γ Γ k γ T π. 72) µ k π) = arg mn γ T π. 73) γ Γ k Thus, t s suffcent to store the sets Γ k to desgn the optmal polcy. Unfortunately, t can be shown that the cardnal Γ k of Γ k can contan up to U Γ k+1 Y vectors, makng agan the algorthm unpractcal but n smple stuatons where Y, U and n are small. However, for the purpose of dervng algorthms, t s worth detalng how to go from Γ k+1 to Γ k. Note frst that J n s pecewse lnear: J n π) = c T n π n. Then, f J k+1 π) = mn γ Γk+1 γ T π, J k π) = mn u c k u)π + y mn γ Γk+1 γ T B y u)p T u)π 1I T 1I T B y u)p T u)π B y u)p T u)π [ T = mn u ck u) y + mn γ Γk+1 Pu)B y u)γ] π Y 74) = mn γ Γk γ T π. RR SC 13

15 4. PARTIALLY OBSERVED MDP 4.3. Infnte horzon where Γ k can be set as follows: [ ck u) Γ k = u U Y y Y ] + Pu)B y u)γ y ; γ y Γ k+1. 75) Instead of searchng the exact soluton for Eq. 67) based on teratve computaton of Γ k, a lmted subset can be consdered to fnd a suboptmal soluton. Restrctng to a subset Γ k of Γ k results n an approxmate cumulatve cost J k π) = mn γ Γk γ T π that s an upper bound of J k. Ths leads to Lovejoy s algorthm: Intalzaton: Γ n = {c n } computeγ n 1 from Eq. 75) k = n 1, n 2,... : sample ΠX) Π S X) = {π 1,..., π R } set Γ k = { arg mn γ Γk γ T π r ; π r Π S X) } compute Γ k 1 from Γ k followng Eq. 75). 76) So, we see that we can keep the complexty constant at each step of the polcy computaton by usng the approxmate sets Γ k. Alternatvely, open loop control defnes a sub-optmal strategy only based on average predcted cost E[ k c T k π k π 0, u 0,..., u n 1 ]: E c T k π k π 0, u 0,..., u n 1 = c T 0 u 0)π 0 + c 1 u 1 )Pu 0 )π c T n Pu n 1 )... Pu 0 )π 0. 77) k A more effcent strategy s open loop feedback control OLFC) that accounts for observatons as descrbed below here: For m = 0,..., n 1 : compute C m u m,..., u n 1 ) = E[ n k=m ct k π k π m, u m,..., u n 1 ] = c T mu m )π m c T n Pu n 1 )... Pu m )π m compute u m,..., u n 1 ) = arg mn C mu m,..., u n 1 ) select µ m π m ) = u m compute π m+1 = T π m, y m+1, u m). 78) 4.3. Infnte horzon For dscounted POMDPs, as for MDPs, we consder an objectve functon n the form J µ π 0 ) = E µ [ k=0 ρ k cx k, u k )] = E µ [ k=0 ρ k c T π k ], 79) wth c u = [c1, u),..., cx, u)] T. Here agan, the optmal polcy s a statonary determnstc Martkovan polcy µ that s obtaned solvng the followng Bellman s dynamc programmng equaton V π) = mn u c T u π + ρ y V T π, y, u))σπ, y, u) ) µ π) = arg mn u c T u π + ρ y V T π, y, u))σπ, y, u) ) 80) It can be proved that for ρ ]0, 1[ the value teraton converges to the unque soluton of Bellman s equaton for V, that satsfes V π) = mn u c T u π + ρ y V T π, y, u))σπ, y, u). The teratons of the value teraton algorthm can be mplemented usng strateges developped for fnte horzon POMDPs. Note also that Eq. V n V ρn+1 1 ρ max x,u cx, u) stll olds. RR SC 14

16 REFERENCES References [1] V. KRISHNAMURTY, Partally observed Markov decson processes, from flterng to controlled sensng, Cambrdge Unversty Press, REF. IMTA lbrary, Brest:1.88 KRIS) Ref. IMTA-Brest lbrary : 1.88 Krs). [2] T. CHONAVEL, Lnear and nonlnear flterng for state space models, Master course notes, UdelaR, [3] S. VATON, T. CHONAVEL, Modélsaton et Smulaton Stochastque, IMT Atlantque, Ma /cours_prntemps2014.pdf) [4] D.G. LUENBERGER, Lnear and nonlnear programmng, 3rd edton, Kluwer academc publshers, [5] D.P. BERTSEKAS, Dynamc programmng and stochastc control, Academc press REF. IMTA lbrary Brest: 1.75 Bert) RR SC 15

17 OUR WORLDWIDE PARTNERS UNIVERSITIES - DOUBLE DEGREE AGREEMENTS 3 CAMPUS, 1 SITE IMT Atlantque Bretagne Pays de la Lore Campus de Brest Technopôle Brest-Irose CS Brest Cedex 3 France T +33 0) F +33 0) Campus de Nantes 4, rue Alfred Kastler CS Nantes Cedex 3 France T +33 0) F +33 0) Campus de Rennes 2, rue de la Châtagnerae CS Cesson Sévgné Cedex France T +33 0) F +33 0) Ste de Toulouse 10, avenue Édouard Beln BP Toulouse Cedex 04 France T +33 0) IMT Atlantque, 2017 Imprmé à IMT Atlantque Dépôt légal : Févrer 2017 ISSN : en cours

Hidden Markov Models

Hidden Markov Models Hdden Markov Models Namrata Vaswan, Iowa State Unversty Aprl 24, 204 Hdden Markov Model Defntons and Examples Defntons:. A hdden Markov model (HMM) refers to a set of hdden states X 0, X,..., X t,...,