Markov Decision Processes

Size: px

Start display at page:

Download "Markov Decision Processes"

Stewart Daniel
6 years ago
Views:

1 Markov Decisio Processes Defiitios; Statioary policies; Value improvemet algorithm, Policy improvemet algorithm, ad liear programmig for discouted cost ad average cost criteria. Markov Decisio Processes 1

2 Markov Decisio Process Let X = {X 0, X 1, } be a system descriptio process o state space E ad let D = {D 0, D 1, } be a decisio process with actio space A. The process (X, D) is a Markov decisio process if, for j E ad = 0, 1,, + 1 =,,..., 0, 0 = + 1 =, Furthermore, for each k A, let f k be a cost vector ad P k be a oe-step trasitio probability matrix. The the cost f k (i) is icurred wheever X = i ad D = k, ad { } { } P X j X D X D P X j X D { + 1 = =, = } = k (, ) P X j X i D k P i j The problem is to determie how to choose a sequece of actios i order to miimize cost. Markov Decisio Processes 2

3 Policies A policy is a rule that specifies which actio to take at each poit i time. Let D deote the set of all policies. I geeral, the decisios specified by a policy may deped o the curret state of the system descriptio process be radomized (deped o some exteral radom evet) also deped o past states ad/or decisios A statioary policy is defied by a (determiistic) actio fuctio that assigs a actio to each state, idepedet of previous states, previous actios, ad time. Uder a statioary policy, the MDP is a Markov chai. Markov Decisio Processes 3

4 Cost Miimizatio Criteria Sice a MDP goes o idefiitely, it is likely that the total cost will be ifiite. I order to meaigfully compare policies, two criteria are commoly used: 1. Expected total discouted cost computes the preset worth of future costs usig a discout factor < 1, such that oe dollar obtaied at time = 1 has a preset value of at time = 0. Typically, if r is the rate of retur, the = 1/(1 + r). The expected total discouted cost is 0 ( ) E f D X = 2. The log ru average cost is 1 lim m m 1 m = 0 f D ( X ) Markov Decisio Processes 4

5 Optimizatio with Statioary Policies If the state space E is fiite, there exists a statioary policy that solves the problem to miimize the discouted cost: v () i mi v () i d, where v d () i E = = d f ( X ) X i 0 D 0 = d D = If every statioary policy results i a irreducible Markov chai, there exists a statioary policy that solves the problem to miimize the average cost: * 1 m 1 ϕ = mi ϕd, where ϕd = lim f ( ) 0 D X d D m = m Markov Decisio Processes 5

6 Computig Expected Discouted Costs Let X = {X 0, X 1, } be a Markov chai with oe-step trasitio probability matrix P, let f be a cost fuctio that assigs a cost to each state of the M.C., ad let (0 < < 1) be a discout factor. The the expected total discouted cost is 1 g() i = E f ( X ) ( ) ( ) 0 X0 = i = f i I P = Why? Startig from state i, the expected discouted cost ca be foud recursively as g i = f i + P g j ( ) ( ) ( ), or g= f + Pg Note that the expected discouted cost always depeds o the iitial state, while for the average cost criterio the iitial state is uimportat. j ij Markov Decisio Processes 6

7 Solutio Procedures for Discouted Costs Let v be the (vector) optimal value fuctio whose ith compoet is v ( i) = mi v d ( i) d D For each i E, v i = mi fk i + Pk i, j v j k A j E These equatios uiquely determie v. { } () () ( ) ( ) If we ca somehow obtai the values v that satisfy the above equatios, the the optimal policy is the vector a, where { } k k () arg mi () (, ) ( ) a i = f i + P i j v j k A j E arg mi is the argumet that miimizes Markov Decisio Processes 7

8 Value Iteratio for Discouted Costs Make a guess keep applyig the optimal value equatios util the fixed poit is reached. Step 1. Choose ε > 0, set = 0, let v 0 (i) = 0 for each i i E. Step 2. For each i i E, fid v +1 (i) as { } () () ( ) ( ) v 1 i = mi fk i + Pk i, j v j + k A j E { ( ) ( )} Step 3. Let δ = max v + 1 i v i i E Step 4. If δ < ε, stop with v = v +1. Otherwise, set = +1 ad retur to Step 2. Markov Decisio Processes 8

9 Policy Improvemet for Discouted Costs Start myopic, the cosider loger-term cosequeces. Step 1. Set = 0 ad let a 0 (i) = arg mi k A f k ( i) Step 2. Adopt the cost vector ad trasitio matrix: f ( i) = f ()( i) P( i, j) = P ()( i, j a ) i a i Step 3. Fid the value fuctio v= ( I P) 1 f Step 4. Re-optimize: a 1 i = arg mi fk i + Pk i, j v j { } () () ( ) ( ) + j E k A Step 5. If a +1 (i) = a (i), the stop with v = v ad a = a (i). Otherwise, set = + 1 ad retur to Step 2. Markov Decisio Processes 9

10 Liear Programmig for Discouted Costs Cosider the liear program: max u( i) i E () () + ( ) ( ) s.t. u i f i P i, j u j for each i, k k j E The optimal value of u(i) will be v (i), ad the optimal policy is idetified by the costraits that hold as equalities i the optimal solutio (slack variables equal 0). k Note: the decisio variables are urestricted i sig! Markov Decisio Processes 10

11 Log Ru Average Cost per Period For a give policy d, its log ru average cost could be foud from its cost vector f d ad oe-step trasitio probability matrix P d : First, fid the limitig probabilities by solvig The ϕ ( ) = = π π P i, j, j E; π 1 j i d j i E j E m 1 f = 0 d X d = lim = fd ( j) π j m m j E ( )( X ) So, i priciple we could simply eumerate all policies ad choose the oe with the smallest average cost ot practical if A ad E are large. Markov Decisio Processes 11

12 Recursive Equatio for Average Cost Assume that every statioary policy yields a irreducible Markov chai. There exists a scalar ϕ ad a vector h such that for all states i i E, () { () ( ) ( )} * ϕ + hi = mi f i k + P ijh, j k k A j E The scalar ϕ is the optimal average cost ad the optimal policy is foud by choosig for each state the actio that achieves the miimum o the right-had-side. The vector h is uique up to a additive costat as we will see, the differece betwee h(i) - h(j) represets the icrease i total cost from startig out i state i rather tha j. Markov Decisio Processes 12

13 Relatioships betwee Discouted Cost ad Log Ru Average Cost If a cost of c is icurred each period ad is the discout factor, the the total discouted cost is c v= c = = 0 1 Therefore, a total discouted cost v is equivalet to a * average cost of c = (1-)v per period, so lim( 1 ) v ( i) = ϕ Let v 1 be the optimal discouted cost vector, ϕ* be the optimal average cost ad h be the mystery vector from the previous slide. 1 ( ) ( ) = ( ) ( ) lim v i v j h i h j Markov Decisio Processes 13

14 Policy Improvemet for Average Costs Desigate oe state i E to be state umber 1 Step 1. Set = 0 ad let a 0 (i) = arg mi k A f k ( i) Step 2. Adopt the cost vector ad trasitio matrix: f i = f i P i, j = P i, j ( ) ()( ) ( ) ()( ) a i a i Step 3. With h(1) = 0, solve ϕ + h= f + Ph Step 4. Re-optimize: a 1 i = arg mi fk i + Pk i, j h j Step 5. If a +1 (i) = a (i), the stop with ϕ * = ϕ ad a * (i) = a (i). Otherwise, set = + 1 ad retur to Step 2. { } () () ( ) ( ) + j E k A Markov Decisio Processes 14

15 Liear Programmig for Average Costs Cosider radomized policies: let w i (k) = P{D = k X = i}. A statioary policy has w i (k) = 1 for each k=a(i) ad 0 otherwise. The decisio variables are x(i,k) = w i (k)π(i). The objective is to miimize the expected value of the average cost (expectatio take over the radomized policy): x( i k) fk ( i) ( ) ( ) k ( ) x( i, k) = 1 mi =, ϕ i E k A s.t. x j, k = x i, k P i, j for each j E k A i E k A i E k A Note that oe costrait will be redudat ad may be dropped. Markov Decisio Processes 15

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients. Defiitios ad Theorems Remember the scalar form of the liear programmig problem, Miimize, Subject to, f(x) = c i x i a 1i x i = b 1 a mi x i = b m x i 0 i = 1,2,, where x are the decisio variables. c, b,