Long-run Average Reward for Markov Decision Processes

Size: px

Start display at page:

Download "Long-run Average Reward for Markov Decision Processes"

Winfred Mitchell
5 years ago
Views:

1 Long-run Average Reward for Markov Decision Processes Based on a paper at CAV 2017 Pranav Ashok 1, Krishnendu Chatterjee 2, Przemysław Daca 2, Jan Křetínský 1 and Tobias Meggendorfer 1 August 9, Technical University of Munich, Germany 2 IST Austria

2 Motivation Markov Decision Processes (MDPs) are a standard model for describing systems which display probabilistic as well as non-deterministic behaviour. 1

3 Motivation Open Problem Long-run Average Reward or Mean-payoff using Value Iteration (VI) VI approaches exist for subclasses, but not for general MDPs Other existing approaches: Linear Programming (LP) and Strategy Iteration (SI) 1 1 Martin L. Puterman, Markov Decision Processes,

4 Contributions Disprove conjectured stopping criterion for VI 2 General solution using VI Improve performance using ideas from Machine Learning 2 Not covered in this talk 3

5 Markov Decision Process (MDPs) b, 0 s 1 a, 10 a, 4 c, 0 s s 3 s 5 a, 10 a, 20 a, 0 a, 0 s 4 s 6 c, 0 4

6 Strategy A strategy (or policy) gives the action to be taken at every state. π := {s 1 b, s 2 a} b, 0 s 1 a, 10 a, 4 c, 0 s s 3 s 5 a, 10 a, 20 a, 0 a, 0 s 4 s 6 c, 0 5

7 Strategy A strategy (or policy) gives the action to be taken at every state. π := {s 1 a,..., s 6 a} b, 0 s 1 a, 10 a, 4 c, 0 s s 3 s 5 a, 10 a, 20 a, 0 a, 0 s 4 s 6 c, 0 5

8 Mean-payoff or Long-run Average Reward ρ = Then, n-step average reward is given by MP n (ρ) := 1 n n i=1 ρ i MP := sup π lim inf n Eπ [ MP n (ρ) ] 6

9 Value Iteration VI for Total Rewards (Bellman Equation) v n(s 1 ) = max { 0 + v n 1 (s 2 ), 10 + ( 0.99 v n 1 (s 3 ) v n 1 (s 5 ) ) } b, 0 s 1 a, 10 s 2 v n 1 (s 2 ) s 3 s 5 v n 1 (s 3 ) v n 1 (s 5 ) 7

10 Value Iteration Average reward 3 v n (s) lim n n v n (s) v n 1 (s) 3 After making the MDP aperiodic 8

11 Towards a General VI: Maximal End Components (MECs) s 3 s 3 s 4 s 4 An example MEC Communicating MDP VI 9

12 Towards a General VI: Step 1 MEC-Decomposition b, 0 s 1 a, 10 a, 4 s 2 c, 0 s 3 s 5 a, 10 a, 20 a, 0 a, 0 s 4 s 6 c, 0 Find all MECs 10

13 Towards a General VI: Step 1 MEC-Decomposition b, 0 s 1 a, 10 a, 4 4 c, a, 10 a, 20 a, 0 a, c, 0 Find all MECs and run VI on them until ε-convergence 10

14 Towards a General VI: Step 2 Weighted Reachability Max. mean-payoff = sup p p p 3 10 π p 1 s 1 p 3 4 p

15 Towards a General VI: Step 2 Weighted Reachability Max. mean-payoff R 4 = sup p 1 π R + p 2 5 R + p 3 10 R p 1 s 1 p R 5 R + p R 11

16 Towards a General VI: Step 2 Weighted Reachability Max. mean-payoff R 4 = sup p 1 π R + p 2 5 R + p 3 10 R p 1 s 1 p R 5 R + p R Mean-payoff reduced to reachability: P max ( +) 11

17 Limitations of this method Whole state space is being explored Finding MECs on large state spaces is computationally expensive 12

18 Improvement: avoid full state-space exploration Idea: let sampling guide us to the important regions b, 0 s1 a, 10 a, 4 c, 0 s s3 s5 a, 10 a, 20 a, 0 a, 0 s4 s6 c, 0 Contribution of red region to MP is potentially low 13

19 Improvement: guarantees through sampling Existing: BRTDP approach for reachability 4,5 Repeatedly samples paths from initial state Back-propagates values along the path using VI operator 4 Bounded Real-Time Dynamic Programming, McMahan et. al., ICML 05 5 Verification of Markov Decision Processes using Learning Algorithms, Brazdil et. al., ATVA 14 14

20 Improvement: BRTDP in action [0, 0] [0, 1] +[1, 1] 15

21 Improvement: BRTDP in action [0, 0] [0.4, 1] +[1, 1] 15

22 Improvement: BRTDP in action [0, 0] [0.4, 1] +[1, 1] 15

23 Improvement: BRTDP in action [0, 0] [0.4, 0.8] +[1, 1] 15

24 Improvement: BRTDP in action [0, 0] [0.4, 0.8] +[1, 1] 15

25 Improvement: BRTDP in action [0, 0] [0.4, 0.6] +[1, 1] 15

26 Improvement: BRTDP in action [0, 0] [0.4, 0.6] +[1, 1] 15

27 Improvement: BRTDP in action [0, 0] [5, 0.6] +[1, 1] 15

28 Improvement: Algorithm 1. Run BRTDP in search of the + state 2. Repeatedly collapse MECs amongst the states explored so far 3. While collapsing, compute their values and add special transition to + and states 16

29 Improvement: can do even more

30 Improvement: can do even more R 1 15 R R? Collapse MEC into a single state and add a special action 17

31 Final Algorithm: On-demand Value Iteration (ODV) 1. Sample paths like in BRTDP 2. When MECs are detected, collapse them but 3. Don t compute MEC value until ε-convergence 4. Add transition with probability (U-L) to? state 5. Refine value of MEC only when? encountered 18

32 Summary We saw two methods which can be used to obtain mean-payoff with guarantees 1. Collapse MECs, add transitions to +/ states, run reachability 2. ODV: Run sampling, collapse on-the-go, refine MEC values on-demand 19

33 Benchmarks Model States MECs LP 1 MEC-VI 2 virus cs_nfail investor phil-nofair TO 6.67 rabin TO MultiGain, Brazdil et. al MEC-VI: Straightforward conversion to reachability, then VI 20

34 Benchmarks On-demand VI better by orders of magnitude depending on topology Model States MEC-VI ODV ODV States ODV MECs zeroconf(40,10) MO avoid zeroconf(300,15) MO avoid sensors(2) sensors(3)

35 22

Infinite-Horizon Average Reward Markov Decision Processes

Infinite-Horizon Average Reward Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 1 Outline The average