The Art of Sequential Optimization via Simulations

Similar documents
An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

Approximate Dynamic Programming

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Procedia Computer Science 00 (2011) 000 6

Markov Decision Processes and Dynamic Programming

Lecture 4: Approximate dynamic programming

Introduction to Reinforcement Learning

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

CS599 Lecture 1 Introduction To RL

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Planning in Markov Decision Processes

Elements of Reinforcement Learning

6 Reinforcement Learning

Reinforcement Learning. Introduction

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE

Approximate Dynamic Programming

Markov Decision Processes and Dynamic Programming

An Adaptive Clustering Method for Model-free Reinforcement Learning

Basics of reinforcement learning

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Prioritized Sweeping Converges to the Optimal Value Function

Finite-Sample Analysis in Reinforcement Learning

Q-Learning for Markov Decision Processes*

Real Time Value Iteration and the State-Action Value Function

Bayesian Active Learning With Basis Functions

Introduction to Approximate Dynamic Programming

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Approximate Dynamic Programming

Value Function Based Reinforcement Learning in Changing Markovian Environments

Optimistic Policy Iteration and Q-learning in Dynamic Programming

Algorithms for MDPs and Their Convergence

Reinforcement Learning

On the Convergence of Optimistic Policy Iteration

An Empirical Dynamic Programming Algorithm for Continuous MDPs

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley

Infinite-Horizon Average Reward Markov Decision Processes

Markovian Decision Process (MDP): theory and applications to wireless networks

Decision Theory: Markov Decision Processes

Reinforcement Learning as Classification Leveraging Modern Classifiers

Decision Theory: Q-Learning

Reinforcement learning an introduction

Reinforcement Learning

Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

CS 4649/7649 Robot Intelligence: Planning

, and rewards and transition matrices as shown below:

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Bias-Variance Error Bounds for Temporal Difference Updates

Chapter 13 Wow! Least Squares Methods in Batch RL

Lecture 7: Value Function Approximation

INTRODUCTION TO MARKOV DECISION PROCESSES

Introduction to Reinforcement Learning

Gradient Estimation for Attractor Networks

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

CS 7180: Behavioral Modeling and Decisionmaking

A Generalized Reduced Linear Program for Markov Decision Processes

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.

16.410/413 Principles of Autonomy and Decision Making

Central-limit approach to risk-aware Markov decision processes

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Markov Decision Processes Chapter 17. Mausam

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Reinforcement Learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning

November 28 th, Carlos Guestrin 1. Lower dimensional projections

Internet Monetization

Speedy Q-Learning. Abstract

Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS

Consistency of Fuzzy Model-Based Reinforcement Learning

Algorithms for Reinforcement Learning

Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms

CS 188: Artificial Intelligence Spring Announcements

CSC321 Lecture 22: Q-Learning

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Reinforcement Learning. George Konidaris

Open Problem: Approximate Planning of POMDPs in the class of Memoryless Policies

Distributed Optimization. Song Chong EE, KAIST

Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Reinforcement Learning: An Introduction

Temporal difference learning

Maximum Margin Planning

Q-Learning and Stochastic Approximation

Reinforcement Learning

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Probabilistic Planning. George Konidaris

Markov Decision Processes and their Applications to Supply Chain Management

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Reinforcement Learning as Variational Inference: Two Recent Approaches

Abstract Dynamic Programming

arxiv: v2 [cs.lg] 6 Sep 2011

Reinforcement Learning and Deep Reinforcement Learning

MDP Preliminaries. Nan Jiang. February 10, 2019

Optimal sequential decision making for complex problems agents. Damien Ernst University of Liège

The convergence limit of the temporal difference learning

Markov decision processes

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Transcription:

The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint work with Dileep Kalathil (UC Berkeley), W. Haskell (NU Singapore), V. Borkar (IITB), A. Gupta (Ohio State) & P. Glynn (Stanford)) Nov 19, 2015 (*by courtesy) 1

Outline I. Dynamic Programming II. Empirical Dynamic Programming III. Extensions IV. Empirical Q-Value Iteration 2

Planning in Stochastic Environments States=1200, Actions=4 Large state spaces, Small action spaces What is the strategy to achieve max. expected reward? States, Actions, Transitions, Rewards What is the solution? Policy: a map from a state to an action 3

Markov Decision Processes: A Formal Model a 0 x 0 x 1 r(x 0,a 0 ) a 1 r(x 1,a 1 ) x 2 An MDP with State space X, initial distribution λ, Action space A State transition probability Q(y x,a) Reward function r(x,a) Fixed, stationary policies "(a;x) Value of policy, Objective: sup! V! 4

Dynamic Programming `Principle of Optimality (Bellman, 1959) 5

Dynamic Programming State value function of a policy " Optimal value function The dynamic programming equation `Principle of Optimality (Bellman, 1959) 5

Dynamic Programming State value function of a policy " V (x) =E [ 1X t=0 Optimal value function t r(x t,a t )] The dynamic programming equation `Principle of Optimality (Bellman, 1959) 5

Dynamic Programming State value function of a policy " V (x) =E [ 1X t=0 Optimal value function t r(x t,a t )] The dynamic programming equation `Principle of Optimality (Bellman, 1959) 5

Dynamic Programming State value function of a policy " V (x) =E [ 1X t=0 Optimal value function t r(x t,a t )] The dynamic programming equation `Principle of Optimality (Bellman, 1959) 5

Dynamic Programming State value function of a policy " V (x) =E [ 1X t=0 Optimal value function t r(x t,a t )] The dynamic programming equation `Principle of Optimality (Bellman, 1959) E[V*(y) x,a] 5

DP and the Bellman Operator 0 t at ΨP 0 t+1 i t Let Ψ:X A R X be a simulation model for transition kernel Q The Bellman operator is DP equation is now a fixed point equation 6

The Value Iteration Algorithm Bellman operator T is a contraction operator TV 1 TV 2 < V 1 V 2 Value Iteration: V k+1 (x) =[TV k ](x) :=sup{r(x, a)+ E! [V k ( (x, a,!)]} a 7

Online/Approximate Dynamic Programming DP methods known to suffer from a curse of dimensionality Approximate DP Bertsekas-Tsitsiklis [NDP, 1994], Powell [ADP, 2011], Reinforcement Learning Q-Learning, Temporal Differences, etc. (Szepesvari [ARL, 2010]) Stochastic approximation-based schemes Slow rate of convergence 8

Outline I. Dynamic Programming II. Empirical Dynamic Programming III. Extensions IV. Empirical Q-Value Iteration 9

Empirical Value Iteration Dynamic programming by simulation EVI: ˆV k+1 (x) = [ˆT ˆV k ](x) where ω s are i.i.d. noise RVs is a random sequence := sup{r(x, a)+ Ên [ ˆV k ( (x, a,!)]} a := sup{r(x, a)+ a n n X i=1 ˆV k ( (x, a,! k+1 i )]} is a random monotone operator, E[ b T (V )] 6= T (V ) Non-incremental updates 10

Questions 1. What is the behavior of 2. Is there a relevant notion of (probabilistic) fixed point for a random (empirical Bellman) operator? 3. How does it relate to the fixed point of the classical Bellman operator? 4. Can we give a sample complexity n? And how many iterations over k do we need for a reasonable approximation? 11

Do EVI and EPI Converge? Numerical Evidence 100 States, 5 actions, Random MDP 12

Do EVI and EPI Converge? Numerical Evidence 100 States, 5 actions, Random MDP 0.6 0.5 v k v* / v* 0.4 0.3 0.2 EVI, n =1 EVI, n = 5 Exact Value Iteration 0.1 0 0 20 40 60 80 100 120 Number of iterations 12

Do EVI and EPI Converge? Numerical Evidence 100 States, 5 actions, Random MDP 0.7 0.6 0.6 0.5 0.5 EPI, n = m = 5 v k v* / v* 0.4 0.3 EVI, n =1 EVI, n = 5 v k v* / v* 0.4 0.3 Exact Policy Iteration EPI, n = m = 1 Exact Value Iteration 0.2 0.2 0.1 0.1 0 0 20 40 60 80 100 120 Number of iterations 0 0 10 20 30 40 50 60 Number of iterations 12

How do they compare? 1 0.9 v k v* / v* 0.8 0.7 0.6 0.5 0.4 Policy iteration EPI n = m = 5 EVI, n = 5 Exact Value Iteration QL, = 0.5 OPI, n = m = 5 0.3 0.2 0.1 0 0 10 20 30 40 50 60 70 80 Number of iterations States=100, Actions=5, random MDP Offline QL with n=5 samples/iteration: 13

Actual Runtime 200 180 Simulation Time Comparison EPI: very slow LP method: Even worse 160 Simulation Time (seconds) 140 120 100 80 60 40 S =5000, A =10 EVI Exact VI QL Relative Error 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 S =5000, A =10 0 0 50 100 150 Number of Iterations EVI, n=1 EVI, n=5 EVI, n=10 EVI, n=20 QL, n=20 VI 20 n=18 n=10 n=6 n=5 n=5 0 2 2.5 3 3.5 4 4.5 5 5.5 6 Relative Error(%) States=5000, Actions=10, random MDP. All simulations run on a Macbook Pro under identical conditions 14

The Empirical Bellman Operator and its Iterations Q. Can we prove convergence? This is like multiplying random matrices together Q. Will this product converge? 15

Probabilistic Fixed Points of Random Operators 16

Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of 16

Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence 16

Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence 16

Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point 16

Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point 16

Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point 16

Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point Weak probabilistic fixed point 16

Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point Weak probabilistic fixed point 16

Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point Weak probabilistic fixed point 16

Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point Weak probabilistic fixed point Classical fixed point 16

Probabilistic Fixed Points of Random Operators Another way to look at it is there a (probabilistic) fixed point of Multiple notions of probabilistic fixed points of a sequence Strong probabilistic fixed point Weak probabilistic fixed point Classical fixed point These (asymptotic) notions coincide! 16

Sample Complexity of EVI Theorem: Given ϵ (0, 1), δ (0, 1), select n C 1 2 log 2 X A, k log 1 µ n,min Then, P( ˆV k V apple ) 1. `Sample Complexity of EVI: O( 1 2, log 1, log X A ) No assumptions on MDPs needed! W. Haskell, R. Jain and D. Kalathil, Empirical Dynamic Programming, Mathematics of Operations Research, to appear, 2015. 17

And Construct a Dominating Markov Chain Discrete Error Process X k n Dominating Markov Chain p n x Y k n η*... N* p n 1 1-p n Key idea is construction of a Markov chain {Ykn } t 0 that stochastically dominates the {X kn } t 0 in probability: P(X k n z) P(Y k n z), for all z {Ykn } t 0 is easier to analyze than {X kn } t 0, show it converges to zero in probability as n,k 18

Outline I. Dynamic Programming II. Empirical Dynamic Programming III. Extensions IV. Empirical Q-Value Iteration 19

Asynchronous EVI Online EVI: update only one state at a time ˆV k+1 (x) = [ˆT ˆV k ](x) := sup{r(x, a)+ Ên [ ˆV k ( (x, a,!)]} a := sup{r(x, a)+ a n n X i=1 ˆV k ( (x, a,! k+1 i )]} Theorem. If each state is visited infinitely often, P ˆV k! V. Proof now relies on now defining a new random operator that is product of the empirical Bellman operator between hitting times W. Haskell, R. Jain and D. Kalathil, Empirical Dynamic Programming, Mathematics of Operations Research, to appear: 2015. 20

Numerical Performance of Online EVI 1 0.9 Asynchronous EVI and QL (Updating only 1 state in each iteration) 1 Asynchrouns EVI and QL (Updating only 10 states in each iterations) Relative Error 0.8 0.7 0.6 0.5 0.4 0.3 Relative Error 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 EVI, n=10 QL, n=10, =0.6 EVI, n=1 EVI, n=5 EVI, n=10 QL, n=10, =0.6 0.2 0.1 0 0 500 1000 1500 2000 2500 3000 3500 4000 Number of Iterations 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Number of Iterations States=500, Actions=10, random MDP 21

Outline I. Dynamic Programming II. Empirical Dynamic Programming III. Extensions IV. Empirical Q-Value Iteration 22

Q-Value Iteration Q-Value of policy " Q (x, a) =E[ 1X t=0 Q-value operator, G t r(x t,a t = (x t )) x 0 = x, a 0 = a] Q (x, a) =sup V (x) = max a2a Q (x, a) G(Q)(x, a) :=r(x, a)+ Optimal Q* is fixed point of G, a contraction Q (x) = arg max a2a Q (x, a) X y P (y x, a) max b Q(y, b) 23

Empirical Q-Value Iteration (EQVI) EQVI: Simulation-based Q-value iteration where ω s are i.i.d. noise RVs bg is a random (monotone) operator bq 0, Q b 1, Q b 2,... is a random sequence Non-incremental updates vs. QL: D. Kalathil, V. Borkar, and R. Jain and, Empirical Q-Value Iteration, submitted: Nov. 2014. http://arxiv.org/abs/1412.0180 24

Numerical Comparison: EQVI vs QL 0.5 Comparison: EQVI and QL (Synchronous) 0.45 0.4 Relative Error: Q t Q * / Q * 0.35 0.3 0.25 0.2 0.15 S = 500, A = 10 EQVI, n=20 EQVI, n=10 EQVI, n=5 Exact QI QL, n=20 0.1 0.05 0 0 50 100 150 200 250 300 Number of Iterations Speedup = 10x+ 25

Online EQVI vs QL 1 Comparison: EQVI and QL (Asynchrounous) Relative Error: Q t Q * / Q * 0.9 0.8 0.7 0.6 0.5 0.4 0.3 EQVI, n=20 QL, n=20 Exact QI S = 500, A = 10 0.2 0.1 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Number of Iteration An online version of EQVI Speedup = 100x+? Converges in probability under suitable recurrence conditions. 26

Other Extensions Continuous State Space MDPs State Aggregation: Construct an ε-net, perform EVI on the ε-net [Haskell, J. & Sharma (2015)] Function Approximation [Szespesvari & Munos 08] Kernel-Based function approximation Deep Neural Network-based function approximation: DEEP EVI/EQVI Average-case: More complicated, similar numerical performance gains [Gupta, J. & Glynn (2015)] 27

Conclusions Empirical Dynamic Programming Algorithms A ``natural way of doing Approximate Dynamic programming via simulations Iteration of Random operators Stochastically dominating MC method is a fairly general technique Extensions to Online Algorithms for Model-free settings Extension to Continuous State Space MDPs Doesn t solve all ``curses of dimensionality Surprisingly good numerical performance Weaker notion of convergence W. Haskell, R. Jain and D. Kalathil, Empirical Dynamic Programming, Math. of Operations Research, to appear: 2015. http://arxiv.org/abs/1311.5918 D. Kalathil, V. Borkar, and R. Jain and, Empirical Q-Value Iteration, submitted: Nov. 2014. http://arxiv.org/abs/1412.0180 28

http://www-bcf.usc.edu/~rahuljai 29

The Cleverest thing to do the simplest one. http://www-bcf.usc.edu/~rahuljai 29