Williams-Baird Counterexample for Q-Factor Asynchronous Policy Iteration

Similar documents
Optimistic Policy Iteration and Q-learning in Dynamic Programming

Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

Procedia Computer Science 00 (2011) 000 6

6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE

On the Convergence of Optimistic Policy Iteration

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

5. Solving the Bellman Equation

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Q-learning and policy iteration algorithms for stochastic shortest path problems

The convergence limit of the temporal difference learning

16.410/413 Principles of Autonomy and Decision Making

Decision Theory: Q-Learning

Proceedings of the International Conference on Neural Networks, Orlando Florida, June Leemon C. Baird III

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

Value and Policy Iteration

DIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES

Stochastic Shortest Path Problems

Abstract Dynamic Programming

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Distributed Optimization. Song Chong EE, KAIST

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Real Time Value Iteration and the State-Action Value Function

Computation and Dynamic Programming

Prioritized Sweeping Converges to the Optimal Value Function

An Adaptive Clustering Method for Model-free Reinforcement Learning

9 Improved Temporal Difference

On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Introduction to Approximate Dynamic Programming

Decision Theory: Markov Decision Processes

Affine Monotonic and Risk-Sensitive Models in Dynamic Programming

Approximate Dynamic Programming Using Bellman Residual Elimination and Gaussian Process Regression

Planning Under Uncertainty II

ON STEP SIZES, STOCHASTIC SHORTEST PATHS, AND SURVIVAL PROBABILITIES IN REINFORCEMENT LEARNING. Abhijit Gosavi

1 Problem Formulation

Monte Carlo Linear Algebra: A Review and Recent Results

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Bias-Variance Error Bounds for Temporal Difference Updates

Security in Locally Repairable Storage

Q-Learning for Markov Decision Processes*

Extended Monotropic Programming and Duality 1

Analysis of Some Incremental Variants of Policy. Actor-Critic Learning Systems. Ronald J. Williams. College of Computer Science

Reinforcement Learning and Control

Markov Decision Processes (and a small amount of reinforcement learning)

CS 7180: Behavioral Modeling and Decisionmaking

Markov Decision Processes Chapter 17. Mausam

THIRD EDITION. Dimitri P. Bertsekas. Massachusetts Institute of Technology. Last Updated 10/1/2008. Athena Scientific, Belmont, Mass.

Convergence of Synchronous Reinforcement Learning. with linear function approximation

3.3 Discrete Hopfield Net An iterative autoassociative net similar to the nets described in the previous sections has been developed by Hopfield

ONLINE SCHEDULING OF MALLEABLE PARALLEL JOBS

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization

CS 4700: Foundations of Artificial Intelligence Ungraded Homework Solutions

Elements of Reinforcement Learning

HOPFIELD neural networks (HNNs) are a class of nonlinear

Distributed Optimization over Random Networks

Practicable Robust Markov Decision Processes

Lecture 4: Approximate dynamic programming

Temporal Difference Learning & Policy Iteration

Artificial Intelligence. Non-deterministic state model. Model for non-deterministic problems. Solutions. Blai Bonet

Policy Gradient Methods. February 13, 2017

Reinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT

Regular Policies in Abstract Dynamic Programming

Embedded Systems 5. Synchronous Composition. Lee/Seshia Section 6.2

CS 4649/7649 Robot Intelligence: Planning

Heuristic Search Algorithms

Basics of reinforcement learning

Gradient Descent for General Reinforcement Learning

MDP Preliminaries. Nan Jiang. February 10, 2019

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Optimum Power Allocation in Fading MIMO Multiple Access Channels with Partial CSI at the Transmitters

Algorithms for MDPs and Their Convergence

Adaptive linear quadratic control using policy. iteration. Steven J. Bradtke. University of Massachusetts.

Reinforcement Learning

A System Theoretic Perspective of Learning and Optimization

MODULE -4 BAYEIAN LEARNING

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Emphatic Temporal-Difference Learning

Basis Function Adaptation Methods for Cost Approximation in MDP

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Average Reward Parameters

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

MAP estimation via agreement on (hyper)trees: Message-passing and linear programming approaches

16.4 Multiattribute Utility Functions

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

The Art of Sequential Optimization via Simulations

A projection algorithm for strictly monotone linear complementarity problems.

A Least Squares Q-Learning Algorithm for Optimal Stopping Problems

Reinforcement learning

Reinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT

A Self-Stabilizing Algorithm for Finding a Minimal Distance-2 Dominating Set in Distributed Systems

Abstract Dynamic Programming

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

Lecture 23: Reinforcement Learning

Shortest paths with negative lengths

FIBONACCI PRIMITIVE ROOTS

CMSC 451: Lecture 7 Greedy Algorithms for Scheduling Tuesday, Sep 19, 2017

Transcription:

September 010 Williams-Baird Counterexample for Q-Factor Asynchronous Policy Iteration Dimitri P. Bertsekas Abstract A counterexample due to Williams and Baird [WiB93] (Example in their paper) is transcribed here in the context and notation of two papers by Bertsekas and Yu[BeY10a], [BeY10b], and it is also adapted to the case of Q-factor-based policy iteration. The example illustrates that cycling is possible in asynchronous policy iteration if the initial policy and cost/q-factor iterations do not satisfy a certain monotonicity condition, under which Williams and Baird [WiB93] show convergence. The papers [BeY10a], [BeY10b] show how asynchronous policy iteration can be modified to circumvent the difficulties illustrated in this example. The purpose of the transcription given here is to facilitate the understanding of this theoretically interesting example, thereby providing motivation and illustration of the methods proposed in [BeY10a], [BeY10b]. The example has not been altered in any material way. 1. COUNTEREXAMPLE We consider the standard asynchronous version of policy iteration for Q-factors in a discounted problem, where the updates of the policy µ and the Q-factors Q are executed selectively, for only some of the states and state-control pairs (Eqs. (3.1) and (3.) in [BeY10a]). There are two types of iterations: those corresponding to an index subset K Q where Q is updated, and those corresponding to the complementary subset K µ where µ is updated. The algorithm generates a sequence of pairs (Q k, µ k ), starting from an arbitrary pair (Q 0, µ 0 ) as follows: { (Fµk Q k )(i, u) if (i, u) R k, Q k+1 (i, u) = k K Q, (1.1) Q k (i, u) if (i, u) / R k, µ k+1 (j) = { arg minv U(j) Q k (j, v) if j S k, µ k (j) if j / S k, k K µ, (1.) The author is with the Dept. of Electr. Engineering and Comp. Science, M.I.T., Cambridge, Mass., 0139. His research was supported by the AFOSR under Grant FA9550-10-1-041. 1

where F µ is the mapping given by (F µ Q)(i, u) = n j=1 ( p ij (u) g(i, u, j) + αq ( j, µ(j) )), (i, u), and R k and S k are subsets of state-control pairs and states, respectively. The notation of the Williams and Baird [WiB93] second counterexample is adapted below, showing that this algorithm may cycle and never converge to the optimal Q-factors. The states are i = 1,, 3, 4, 5, 6, arranged in a cycle in clockwise order, and the controls are denoted h (high cost) and l (low cost). The system is deterministic, and transitions are counterclockwise by either one state or two states. At states i = 1, 3, 5, only control h is available. The transition from i is to state (i )mod6 at cost g(i, h) = 1. At states i =, 4, 6, both controls h and l are available. Under h, the transition from i is to state (i )mod6 at cost g(i, h) = 1. Under l, the transition from i is to state (i )mod6 at cost g(i, l) = 3. The example works only when the discount factor α satisfies α > 5 1, e.g., α = 0.9. A sequence of operations of the form (1.1) and (1.) is exhibited, that regenerates the initial conditions. This sequence consists of three iteration blocks. At the end of each block, the values Q(i, µ(i)) have been shifted clockwise by two states, i.e., are equal to the values of Q((i )mod6, µ((i )mod6)) prevailing at the beginning of the block. Thus after three blocks the original values Q(i, µ(i)) are repeated for all i. A peculiarity of this example is that the values Q(i, µ(i)) at the end of an iteration block depend only on the initial values Q(i, µ(i)) at the beginning of the block, so we don t have to specify the initial µ and Q(i, u) for u µ(i). This fact also allows the adaptation of the example to the case where costs rather than Q-factors are iterated on (cf. [BeY10b]). The initial values Q(i, µ(i)) at the beginning of the 1st iteration block are if i = 1, Q(i, µ(i)) = if i =, +α if i = 3, 3 if i = 4, +α if i = 5, if i = 6. (1.3) Define an I-step at node i to be an update of all the Q-factors of node i via Eq. (1.1), followed by an update of µ(i) via Eq. (1.) (this is like a full value iteration/policy improvement). Define an E-step at pair (i, u) to be an update of Q(i, u) based on Eq. (1.1) (this is like a Q-factor policy evaluation based on a single iteration).

The iteration block consists of the following: 1) I-step at i = 6. ) I-step at i = 4. 3) I-step at i = 3, or equivalently an E-step at (i, u) = (3, h) (since only h is available at i = 3). 4) I-step at i = 1, or equivalently an E-step at (i, u) = (1, h) (since only h is available at i = 1). 5) E-step at (i, µ(i)), where i = 4. Note that steps 1)-4) are ordinary value iterations where all Q-factors of i are updated. The 5th step is a modified policy iteration step and causes the difficulty (an algorithm that consists of just I-steps is convergent, as it is just asynchronous value iteration). It can be verified that the values Q(i, µ(i)) at the end of an iteration block depend only on the initial values Q(i, µ(i)) at the beginning of the block. The reason is that the results of I-steps depend only the prevailing values of the values Q(i, µ(i)), and the last E-step depends only on the prevailing values of µ(4) (which is set at Step ). It is calculated in [WiB93] (p. 19) that at the end of the iteration block the values Q(i, µ(i)) are: Q(i, µ(i)) = +α if i = 1, if i =, if i = 3, if i = 4, +α if i = 5, 3 if i = 6. so the values are shifted clockwise by two states relative to the beginning of the iteration block. For example (allowing for differences in notation and context) [WiB93] calculate the results of iterations as follows: 1) I-step at i = 6. Sets Q(6, h) = + α 1+α = +α, Q(6, l) = 3 + α 3 = 3, and µ(6) = l. ) I-step at i = 4. Sets Q(4, h) = + α 1+α µ(4) = h (because α > 5 1 ). 3) I-step at i = 3. Sets Q(3, h) = + α 1 4) I-step at i = 1. Sets Q(1, h) = + α 3 = +α, Q(4, l) = 3 + α 1 = 3 α, and =. = +α. 5) E-step at (4, µ(4)). Here µ(4) = h because of step, so Q(4, µ(4)) = 1 + α Q(3, µ(3)) = + α 1 =. Note that only the initial values of Q(i, µ(i)), as given in Eq. (1.3), matter in these calculations. The second iteration block consists of the following: 3

1) I-step at i =. ) I-step at i = 6. 3) I-step at i = 5, or equivalently an E-step at (i, u) = (3, h). 4) I-step at i = 3, or equivalently an E-step at (i, u) = (1, h). 5) E-step at (i, µ(i)), where i = 6. The third iteration block consists of the following: 1) I-step at i = 4. ) I-step at i =. 3) I-step at i = 1, or equivalently an E-step at (i, u) = (3, h). 4) I-step at i = 5, or equivalently an E-step at (i, u) = (1, h). 5) E-step at (i, µ(i)), where i =. At the end of this third block the initial conditions Q(i, µ(i)) should be recovered. The difference between the above calculations and the ones of [WiB93] is just notation: we assume minimization instead of maximization, hence a change of signs in the calculations. Moreover, in [BeY10a] we consider an asynchronous modified policy iteration for Q-factors, while [WiB93] consider an asynchronous modified policy iteration for costs, and J(i) in place of Q(i, µ(i)). With this notational change we also obtain a counterexample for the convergence of algorithm (1.5) of the paper [BeY10b]. Our remedy in the papers [BeY10a], [BeY10b] is to modify the dangerous E-steps so that an oscillation of the type of this example is prevented [e.g., by using the minimum of min{j(j), Q(j, µ(j))} in place of Q(j, µ(j)) in our Q-learning paper, we introduce a mechanism for guarding against dangerously large increases of Q(i, u)].. REFERENCES [BeY10a] Bertsekas, D. P., and Yu, H., 010. Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming, Lab. for Information and Decision Systems Report LIDS-P-831, MIT, April, 010 (revised October 010). [BeY10b] Bertsekas, D. P., and Yu, H., 010. Distributed Asynchronous Policy Iteration in Dynamic Programming, Proc. of 010 Allerton Conference on Communication, Control, and Computing, Allerton Park, ILL. [WiB93] Williams, R. J., and Baird, L. C., 1993. Analysis of Some Incremental Variants of Policy Itera- 4

tion: First Steps Toward Understanding Actor-Critic Learning Systems, Report NU-CCS-93-11, College of Computer Science, Northeastern University, Boston, MA. 5