The Art of Sequential Optimization via Simulations

The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint work with Dileep Kalathil (UC Berkeley), W. Haskell (NU Singapore), V. Borkar (IITB), A. Gupta (Ohio State) & P. Glynn (Stanford)) Nov 19, 2015 (*by courtesy) 1

Outline I. Dynamic Programming II. Empirical Dynamic Programming III. Extensions IV. Empirical Q-Value Iteration 2

Planning in Stochastic Environments States=1200, Actions=4 Large state spaces, Small action spaces What is the strategy to achieve max. expected reward? States, Actions, Transitions, Rewards What is the solution? Policy: a map from a state to an action 3

Markov Decision Processes: A Formal Model a 0 x 0 x 1 r(x 0,a 0 ) a 1 r(x 1,a 1 ) x 2 An MDP with State space X, initial distribution λ, Action space A State transition probability Q(y x,a) Reward function r(x,a) Fixed, stationary policies "(a;x) Value of policy, Objective: sup! V! 4

Dynamic Programming `Principle of Optimality (Bellman, 1959) 5

Dynamic Programming State value function of a policy " Optimal value function The dynamic programming equation `Principle of Optimality (Bellman, 1959) 5

Dynamic Programming State value function of a policy " V (x) =E [ 1X t=0 Optimal value function t r(x t,a t )] The dynamic programming equation `Principle of Optimality (Bellman, 1959) 5

Dynamic Programming State value function of a policy " V (x) =E [ 1X t=0 Optimal value function t r(x t,a t )] The dynamic programming equation `Principle of Optimality (Bellman, 1959) E[V*(y) x,a] 5

DP and the Bellman Operator 0 t at ΨP 0 t+1 i t Let Ψ:X A R X be a simulation model for transition kernel Q The Bellman operator is DP equation is now a fixed point equation 6

The Value Iteration Algorithm Bellman operator T is a contraction operator TV 1 TV 2 < V 1 V 2 Value Iteration: V k+1 (x) =[TV k ](x) :=sup{r(x, a)+ E! [V k ( (x, a,!)]} a 7

Online/Approximate Dynamic Programming DP methods known to suffer from a curse of dimensionality Approximate DP Bertsekas-Tsitsiklis [NDP, 1994], Powell [ADP, 2011], Reinforcement Learning Q-Learning, Temporal Differences, etc. (Szepesvari [ARL, 2010]) Stochastic approximation-based schemes Slow rate of convergence 8

Outline I. Dynamic Programming II. Empirical Dynamic Programming III. Extensions IV. Empirical Q-Value Iteration 9

Empirical Value Iteration Dynamic programming by simulation EVI: ˆV k+1 (x) = [ˆT ˆV k ](x) where ω s are i.i.d. noise RVs is a random sequence := sup{r(x, a)+ Ên [ ˆV k ( (x, a,!)]} a := sup{r(x, a)+ a n n X i=1 ˆV k ( (x, a,! k+1 i )]} is a random monotone operator, E[ b T (V )] 6= T (V ) Non-incremental updates 10

Questions 1. What is the behavior of 2. Is there a relevant notion of (probabilistic) fixed point for a random (empirical Bellman) operator? 3. How does it relate to the fixed point of the classical Bellman operator? 4. Can we give a sample complexity n? And how many iterations over k do we need for a reasonable approximation? 11

Do EVI and EPI Converge? Numerical Evidence 100 States, 5 actions, Random MDP 12

Do EVI and EPI Converge? Numerical Evidence 100 States, 5 actions, Random MDP 0.6 0.5 v k v* / v* 0.4 0.3 0.2 EVI, n =1 EVI, n = 5 Exact Value Iteration 0.1 0 0 20 40 60 80 100 120 Number of iterations 12

Do EVI and EPI Converge? Numerical Evidence 100 States, 5 actions, Random MDP 0.7 0.6 0.6 0.5 0.5 EPI, n = m = 5 v k v* / v* 0.4 0.3 EVI, n =1 EVI, n = 5 v k v* / v* 0.4 0.3 Exact Policy Iteration EPI, n = m = 1 Exact Value Iteration 0.2 0.2 0.1 0.1 0 0 20 40 60 80 100 120 Number of iterations 0 0 10 20 30 40 50 60 Number of iterations 12

How do they compare? 1 0.9 v k v* / v* 0.8 0.7 0.6 0.5 0.4 Policy iteration EPI n = m = 5 EVI, n = 5 Exact Value Iteration QL, = 0.5 OPI, n = m = 5 0.3 0.2 0.1 0 0 10 20 30 40 50 60 70 80 Number of iterations States=100, Actions=5, random MDP Offline QL with n=5 samples/iteration: 13

Actual Runtime 200 180 Simulation Time Comparison EPI: very slow LP method: Even worse 160 Simulation Time (seconds) 140 120 100 80 60 40 S =5000, A =10 EVI Exact VI QL Relative Error 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 S =5000, A =10 0 0 50 100 150 Number of Iterations EVI, n=1 EVI, n=5 EVI, n=10 EVI, n=20 QL, n=20 VI 20 n=18 n=10 n=6 n=5 n=5 0 2 2.5 3 3.5 4 4.5 5 5.5 6 Relative Error(%) States=5000, Actions=10, random MDP. All simulations run on a Macbook Pro under identical conditions 14

The Empirical Bellman Operator and its Iterations Q. Can we prove convergence? This is like multiplying random matrices together Q. Will this product converge? 15

Probabilistic Fixed Points of Random Operators 16

Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of 16

Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence 16

Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point 16

Sample Complexity of EVI Theorem: Given ϵ (0, 1), δ (0, 1), select n C 1 2 log 2 X A, k log 1 µ n,min Then, P( ˆV k V apple ) 1. `Sample Complexity of EVI: O( 1 2, log 1, log X A ) No assumptions on MDPs needed! W. Haskell, R. Jain and D. Kalathil, Empirical Dynamic Programming, Mathematics of Operations Research, to appear, 2015. 17

And Construct a Dominating Markov Chain Discrete Error Process X k n Dominating Markov Chain p n x Y k n η*... N* p n 1 1-p n Key idea is construction of a Markov chain {Ykn } t 0 that stochastically dominates the {X kn } t 0 in probability: P(X k n z) P(Y k n z), for all z {Ykn } t 0 is easier to analyze than {X kn } t 0, show it converges to zero in probability as n,k 18

Outline I. Dynamic Programming II. Empirical Dynamic Programming III. Extensions IV. Empirical Q-Value Iteration 19

Asynchronous EVI Online EVI: update only one state at a time ˆV k+1 (x) = [ˆT ˆV k ](x) := sup{r(x, a)+ Ên [ ˆV k ( (x, a,!)]} a := sup{r(x, a)+ a n n X i=1 ˆV k ( (x, a,! k+1 i )]} Theorem. If each state is visited infinitely often, P ˆV k! V. Proof now relies on now defining a new random operator that is product of the empirical Bellman operator between hitting times W. Haskell, R. Jain and D. Kalathil, Empirical Dynamic Programming, Mathematics of Operations Research, to appear: 2015. 20

Numerical Performance of Online EVI 1 0.9 Asynchronous EVI and QL (Updating only 1 state in each iteration) 1 Asynchrouns EVI and QL (Updating only 10 states in each iterations) Relative Error 0.8 0.7 0.6 0.5 0.4 0.3 Relative Error 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 EVI, n=10 QL, n=10, =0.6 EVI, n=1 EVI, n=5 EVI, n=10 QL, n=10, =0.6 0.2 0.1 0 0 500 1000 1500 2000 2500 3000 3500 4000 Number of Iterations 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Number of Iterations States=500, Actions=10, random MDP 21

Outline I. Dynamic Programming II. Empirical Dynamic Programming III. Extensions IV. Empirical Q-Value Iteration 22

Q-Value Iteration Q-Value of policy " Q (x, a) =E[ 1X t=0 Q-value operator, G t r(x t,a t = (x t )) x 0 = x, a 0 = a] Q (x, a) =sup V (x) = max a2a Q (x, a) G(Q)(x, a) :=r(x, a)+ Optimal Q* is fixed point of G, a contraction Q (x) = arg max a2a Q (x, a) X y P (y x, a) max b Q(y, b) 23

Empirical Q-Value Iteration (EQVI) EQVI: Simulation-based Q-value iteration where ω s are i.i.d. noise RVs bg is a random (monotone) operator bq 0, Q b 1, Q b 2,... is a random sequence Non-incremental updates vs. QL: D. Kalathil, V. Borkar, and R. Jain and, Empirical Q-Value Iteration, submitted: Nov. 2014. http://arxiv.org/abs/1412.0180 24

Numerical Comparison: EQVI vs QL 0.5 Comparison: EQVI and QL (Synchronous) 0.45 0.4 Relative Error: Q t Q * / Q * 0.35 0.3 0.25 0.2 0.15 S = 500, A = 10 EQVI, n=20 EQVI, n=10 EQVI, n=5 Exact QI QL, n=20 0.1 0.05 0 0 50 100 150 200 250 300 Number of Iterations Speedup = 10x+ 25

Online EQVI vs QL 1 Comparison: EQVI and QL (Asynchrounous) Relative Error: Q t Q * / Q * 0.9 0.8 0.7 0.6 0.5 0.4 0.3 EQVI, n=20 QL, n=20 Exact QI S = 500, A = 10 0.2 0.1 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Number of Iteration An online version of EQVI Speedup = 100x+? Converges in probability under suitable recurrence conditions. 26

Other Extensions Continuous State Space MDPs State Aggregation: Construct an ε-net, perform EVI on the ε-net [Haskell, J. & Sharma (2015)] Function Approximation [Szespesvari & Munos 08] Kernel-Based function approximation Deep Neural Network-based function approximation: DEEP EVI/EQVI Average-case: More complicated, similar numerical performance gains [Gupta, J. & Glynn (2015)] 27

Conclusions Empirical Dynamic Programming Algorithms A ``natural way of doing Approximate Dynamic programming via simulations Iteration of Random operators Stochastically dominating MC method is a fairly general technique Extensions to Online Algorithms for Model-free settings Extension to Continuous State Space MDPs Doesn t solve all ``curses of dimensionality Surprisingly good numerical performance Weaker notion of convergence W. Haskell, R. Jain and D. Kalathil, Empirical Dynamic Programming, Math. of Operations Research, to appear: 2015. http://arxiv.org/abs/1311.5918 D. Kalathil, V. Borkar, and R. Jain and, Empirical Q-Value Iteration, submitted: Nov. 2014. http://arxiv.org/abs/1412.0180 28

http://www-bcf.usc.edu/~rahuljai 29

The Cleverest thing to do the simplest one. http://www-bcf.usc.edu/~rahuljai 29