The Art of Sequential Optimization via Simulations

Size: px

Start display at page:

Download "The Art of Sequential Optimization via Simulations"

Philip Clark
6 years ago
Views:

The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern

1 The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint work with Dileep Kalathil (UC Berkeley), W. Haskell (NU Singapore), V. Borkar (IITB), A. Gupta (Ohio State) & P. Glynn (Stanford)) Nov 19, 2015 (*by courtesy) 1

2 Outline I. Dynamic Programming II. Empirical Dynamic Programming III. Extensions IV. Empirical Q-Value Iteration 2

3 Planning in Stochastic Environments States=1200, Actions=4 Large state spaces, Small action spaces What is the strategy to achieve max. expected reward? States, Actions, Transitions, Rewards What is the solution? Policy: a map from a state to an action 3

4 Markov Decision Processes: A Formal Model a 0 x 0 x 1 r(x 0,a 0 ) a 1 r(x 1,a 1 ) x 2 An MDP with State space X, initial distribution λ, Action space A State transition probability Q(y x,a) Reward function r(x,a) Fixed, stationary policies "(a;x) Value of policy, Objective: sup! V! 4

5 Dynamic Programming `Principle of Optimality (Bellman, 1959) 5

6 Dynamic Programming State value function of a policy " Optimal value function The dynamic programming equation `Principle of Optimality (Bellman, 1959) 5

7 Dynamic Programming State value function of a policy " V (x) =E [ 1X t=0 Optimal value function t r(x t,a t )] The dynamic programming equation `Principle of Optimality (Bellman, 1959) 5

8 Dynamic Programming State value function of a policy " V (x) =E [ 1X t=0 Optimal value function t r(x t,a t )] The dynamic programming equation `Principle of Optimality (Bellman, 1959) 5

9 Dynamic Programming State value function of a policy " V (x) =E [ 1X t=0 Optimal value function t r(x t,a t )] The dynamic programming equation `Principle of Optimality (Bellman, 1959) 5

10 Dynamic Programming State value function of a policy " V (x) =E [ 1X t=0 Optimal value function t r(x t,a t )] The dynamic programming equation `Principle of Optimality (Bellman, 1959) E[V*(y) x,a] 5

11 DP and the Bellman Operator 0 t at ΨP 0 t+1 i t Let Ψ:X A R X be a simulation model for transition kernel Q The Bellman operator is DP equation is now a fixed point equation 6

12 The Value Iteration Algorithm Bellman operator T is a contraction operator TV 1 TV 2 < V 1 V 2 Value Iteration: V k+1 (x) =[TV k ](x) :=sup{r(x, a)+ E! [V k ( (x, a,!)]} a 7

13 Online/Approximate Dynamic Programming DP methods known to suffer from a curse of dimensionality Approximate DP Bertsekas-Tsitsiklis [NDP, 1994], Powell [ADP, 2011], Reinforcement Learning Q-Learning, Temporal Differences, etc. (Szepesvari [ARL, 2010]) Stochastic approximation-based schemes Slow rate of convergence 8

14 Outline I. Dynamic Programming II. Empirical Dynamic Programming III. Extensions IV. Empirical Q-Value Iteration 9

15 Empirical Value Iteration Dynamic programming by simulation EVI: ˆV k+1 (x) = [ˆT ˆV k ](x) where ω s are i.i.d. noise RVs is a random sequence := sup{r(x, a)+ Ên [ ˆV k ( (x, a,!)]} a := sup{r(x, a)+ a n n X i=1 ˆV k ( (x, a,! k+1 i )]} is a random monotone operator, E[ b T (V )] 6= T (V ) Non-incremental updates 10

16 Questions 1. What is the behavior of 2. Is there a relevant notion of (probabilistic) fixed point for a random (empirical Bellman) operator? 3. How does it relate to the fixed point of the classical Bellman operator? 4. Can we give a sample complexity n? And how many iterations over k do we need for a reasonable approximation? 11

17 Do EVI and EPI Converge? Numerical Evidence 100 States, 5 actions, Random MDP 12

18 Do EVI and EPI Converge? Numerical Evidence 100 States, 5 actions, Random MDP v k v* / v* EVI, n =1 EVI, n = 5 Exact Value Iteration Number of iterations 12

19 Do EVI and EPI Converge? Numerical Evidence 100 States, 5 actions, Random MDP EPI, n = m = 5 v k v* / v* EVI, n =1 EVI, n = 5 v k v* / v* Exact Policy Iteration EPI, n = m = 1 Exact Value Iteration Number of iterations Number of iterations 12

20 How do they compare? v k v* / v* Policy iteration EPI n = m = 5 EVI, n = 5 Exact Value Iteration QL, = 0.5 OPI, n = m = Number of iterations States=100, Actions=5, random MDP Offline QL with n=5 samples/iteration: 13

21 Actual Runtime Simulation Time Comparison EPI: very slow LP method: Even worse 160 Simulation Time (seconds) S =5000, A =10 EVI Exact VI QL Relative Error S =5000, A = Number of Iterations EVI, n=1 EVI, n=5 EVI, n=10 EVI, n=20 QL, n=20 VI 20 n=18 n=10 n=6 n=5 n= Relative Error(%) States=5000, Actions=10, random MDP. All simulations run on a Macbook Pro under identical conditions 14

22 The Empirical Bellman Operator and its Iterations Q. Can we prove convergence? This is like multiplying random matrices together Q. Will this product converge? 15

23 Probabilistic Fixed Points of Random Operators 16

24 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of 16

25 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence 16

26 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence 16

27 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point 16

28 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point 16

29 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point 16

30 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point Weak probabilistic fixed point 16

31 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point Weak probabilistic fixed point 16

32 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point Weak probabilistic fixed point 16

33 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point Weak probabilistic fixed point Classical fixed point 16

34 Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point Weak probabilistic fixed point Classical fixed point These (asymptotic) notions coincide! 16

35 Sample Complexity of EVI Theorem: Given ϵ (0, 1), δ (0, 1), select n C 1 2 log 2 X A, k log 1 µ n,min Then, P( ˆV k V apple ) 1. `Sample Complexity of EVI: O( 1 2, log 1, log X A ) No assumptions on MDPs needed! W. Haskell, R. Jain and D. Kalathil, Empirical Dynamic Programming, Mathematics of Operations Research, to appear,

36 And Construct a Dominating Markov Chain Discrete Error Process X k n Dominating Markov Chain p n x Y k n η*... N* p n 1 1-p n Key idea is construction of a Markov chain {Ykn } t 0 that stochastically dominates the {X kn } t 0 in probability: P(X k n z) P(Y k n z), for all z {Ykn } t 0 is easier to analyze than {X kn } t 0, show it converges to zero in probability as n,k 18

37 Outline I. Dynamic Programming II. Empirical Dynamic Programming III. Extensions IV. Empirical Q-Value Iteration 19

38 Asynchronous EVI Online EVI: update only one state at a time ˆV k+1 (x) = [ˆT ˆV k ](x) := sup{r(x, a)+ Ên [ ˆV k ( (x, a,!)]} a := sup{r(x, a)+ a n n X i=1 ˆV k ( (x, a,! k+1 i )]} Theorem. If each state is visited infinitely often, P ˆV k! V. Proof now relies on now defining a new random operator that is product of the empirical Bellman operator between hitting times W. Haskell, R. Jain and D. Kalathil, Empirical Dynamic Programming, Mathematics of Operations Research, to appear:

39 Numerical Performance of Online EVI Asynchronous EVI and QL (Updating only 1 state in each iteration) 1 Asynchrouns EVI and QL (Updating only 10 states in each iterations) Relative Error Relative Error EVI, n=10 QL, n=10, =0.6 EVI, n=1 EVI, n=5 EVI, n=10 QL, n=10, = Number of Iterations Number of Iterations States=500, Actions=10, random MDP 21

40 Outline I. Dynamic Programming II. Empirical Dynamic Programming III. Extensions IV. Empirical Q-Value Iteration 22

41 Q-Value Iteration Q-Value of policy " Q (x, a) =E[ 1X t=0 Q-value operator, G t r(x t,a t = (x t )) x 0 = x, a 0 = a] Q (x, a) =sup V (x) = max a2a Q (x, a) G(Q)(x, a) :=r(x, a)+ Optimal Q* is fixed point of G, a contraction Q (x) = arg max a2a Q (x, a) X y P (y x, a) max b Q(y, b) 23

Empirical Q-Value Iteration (EQVI) EQVI: Simulation-based Q-value iteration where ω s are i.i.d. noise RVs bg is a random (monotone) operator bq 0, Q b 1, Q b 2,.

42 Empirical Q-Value Iteration (EQVI) EQVI: Simulation-based Q-value iteration where ω s are i.i.d. noise RVs bg is a random (monotone) operator bq 0, Q b 1, Q b 2,... is a random sequence Non-incremental updates vs. QL: D. Kalathil, V. Borkar, and R. Jain and, Empirical Q-Value Iteration, submitted: Nov

43 Numerical Comparison: EQVI vs QL 0.5 Comparison: EQVI and QL (Synchronous) Relative Error: Q t Q * / Q * S = 500, A = 10 EQVI, n=20 EQVI, n=10 EQVI, n=5 Exact QI QL, n= Number of Iterations Speedup = 10x+ 25

44 Online EQVI vs QL 1 Comparison: EQVI and QL (Asynchrounous) Relative Error: Q t Q * / Q * EQVI, n=20 QL, n=20 Exact QI S = 500, A = Number of Iteration An online version of EQVI Speedup = 100x+? Converges in probability under suitable recurrence conditions. 26

45 Other Extensions Continuous State Space MDPs State Aggregation: Construct an ε-net, perform EVI on the ε-net [Haskell, J. & Sharma (2015)] Function Approximation [Szespesvari & Munos 08] Kernel-Based function approximation Deep Neural Network-based function approximation: DEEP EVI/EQVI Average-case: More complicated, similar numerical performance gains [Gupta, J. & Glynn (2015)] 27

46 Conclusions Empirical Dynamic Programming Algorithms A ``natural way of doing Approximate Dynamic programming via simulations Iteration of Random operators Stochastically dominating MC method is a fairly general technique Extensions to Online Algorithms for Model-free settings Extension to Continuous State Space MDPs Doesn t solve all ``curses of dimensionality Surprisingly good numerical performance Weaker notion of convergence W. Haskell, R. Jain and D. Kalathil, Empirical Dynamic Programming, Math. of Operations Research, to appear: D. Kalathil, V. Borkar, and R. Jain and, Empirical Q-Value Iteration, submitted: Nov

47 29

48 The Cleverest thing to do the simplest one. 29

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter