Lecture 4: Approximate dynamic programming

Similar documents
Real Time Value Iteration and the State-Action Value Function

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Open Theoretical Questions in Reinforcement Learning

Temporal difference learning

, and rewards and transition matrices as shown below:

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

A Gentle Introduction to Reinforcement Learning

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Decision Theory: Q-Learning

MDP Preliminaries. Nan Jiang. February 10, 2019

Internet Monetization

Approximate Dynamic Programming

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

Planning in Markov Decision Processes

Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

Lecture 1: March 7, 2018

16.410/413 Principles of Autonomy and Decision Making

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation

The Art of Sequential Optimization via Simulations

Reinforcement Learning as Classification Leveraging Modern Classifiers

Approximate Dynamic Programming

Reinforcement Learning

CSC321 Lecture 22: Q-Learning

Algorithms for MDPs and Their Convergence

Prioritized Sweeping Converges to the Optimal Value Function

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Off-Policy Temporal-Difference Learning with Function Approximation

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Lecture 7: Value Function Approximation

Reinforcement learning

Lecture 10 - Planning under Uncertainty (III)

15-780: ReinforcementLearning

CS 234 Midterm - Winter

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Introduction to Reinforcement Learning

Basics of reinforcement learning

Q-Learning in Continuous State Action Spaces

Reinforcement Learning and NLP

Reinforcement Learning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

CS599 Lecture 1 Introduction To RL

Reinforcement Learning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Reinforcement Learning: An Introduction

Reinforcement Learning Part 2

Policy Gradient Methods. February 13, 2017

Decision Theory: Markov Decision Processes

Linear Least-squares Dyna-style Planning

An Introduction to Reinforcement Learning

Lecture 23: Reinforcement Learning

Reinforcement Learning using Continuous Actions. Hado van Hasselt

Reinforcement Learning

6 Reinforcement Learning

Reinforcement Learning: the basics

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Figure 1: Bayes Net. (a) (2 points) List all independence and conditional independence relationships implied by this Bayes net.

Off-Policy Actor-Critic

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

University of Alberta

Approximation Methods in Reinforcement Learning

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Reinforcement Learning

REINFORCEMENT LEARNING

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Reinforcement Learning Algorithms in Markov Decision Processes AAAI-10 Tutorial. Part II: Learning to predict values

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Reinforcement Learning. Machine Learning, Fall 2010

Chapter 13 Wow! Least Squares Methods in Batch RL

Reinforcement Learning in Continuous Action Spaces

Reinforcement Learning

On the Convergence of Optimistic Policy Iteration

Machine Learning I Reinforcement Learning

An Introduction to Reinforcement Learning

Reinforcement learning

Reinforcement Learning as Variational Inference: Two Recent Approaches

Reinforcement Learning for NLP

Understanding (Exact) Dynamic Programming through Bellman Operators

Reinforcement Learning

Lecture 8: Policy Gradient

Reinforcement Learning II

(Deep) Reinforcement Learning

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results

Transcription:

IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are iterative algorithms that try to find fixed point of Bellman equations, while approximating the value-function/qfunction a parametric function for scalability when the state space is large. Similar to Q-learning, function approximation versions of TD-learning and value-iteration can be obtained. Here, we will examine some versions of these algorithms in terms of their convergence properties. TD(0) and TD() In TD-learning we are only interested in evaluating a policy given π you can think of this a special case of Q-learning on an MDP with only one action available for every state. Therefore, we are interested in computing value function V π (s) for all s. In the function approximation version, we learn a parametric approximation Ṽθ(s). For example, the function Ṽθ(s, a) could simply be a linear function in θ and features Ṽθ(s) = θ 0 f 0 (s) + θ f (s) +... + θ n f n (s), or output of a deep neural net. The parameter θ can map the feature vector f(s) for any s to its value-function, thus providing a compact representation. Instead of learning the S dimensional value vecor TD-learning algorithm will learn the parameter θ. Here is a least squares approximation based modification (similar to Q-learning with function approximation) of TD(0) and TD() respectively. Algorithm Approximate TD(0) method for policy evaluation : Initialization: Given a starting state distribution D 0, policy π, the method evaluates V π (s) for all states s. Initialize θ. Initialize episode e = 0. 2: repeat 3: e = e +. 4: Set t =, s D 0. Choose step sizes α, α 2,.... 5: Perform TD(0) updates over an episode: 6: repeat 7: Take action a t π(s t ). Observe reward r t, and new state s t+. 8: 9: (*) Let δ t := r t + γ ˆṼ θ (s t+ ) Ṽθ(s t ). (*) Update θ θ + α t δ t θ Ṽ θ (s t ). 0: t = t + : until until episode terminates 2: until termination condition Recall in T D(), the target is computed using infinite lookahead as target(s t+ ) = r t +γr t+ +γ 2 r t+,... =: R t. Effectively, the only difference is to replace Step 8 in TD(0) by: TD() algorithm 8: Let δ t := R t Ṽθ(s t ). In tabular case, both these algorithms converge to the true value (they are both instances of the stochastic approximation method); we saw that depending on the MDP, one may be more efficient than the other. However, these tradeoffs change when we use function approximation. Below is any example where these methods do converge but some times to a function approximation that is far from true value function! The examples show that this can happen even if there exists some value of θ for which the function approximation Ṽθ is close to V π.

Counter-example TODO: Example covered in class: from A Counterexample to Temporal Differences Learning, Bertsekas 995. This example also applies to Q-learning. 2 Fitted Value iteration Let s take a step back and consider an easier case: the problem of solving a known MDP instead of reinforcement learning. That is, we know the MDP model or can simulate the MDP model in any state and action. If state space was small, we could use value iteration for computing optimal policy. However, to handle large state space, we consider approximate value iteration, also known as fitted value iteration. Recall that the tabular value iteration algorithm starts with an initial value vector v 0, and in every iteration i, simply performs the update v i = Lv i (see ()). Tabular value-iteration (Recall). Start with an arbitrary initialization v 0. Specify ɛ > 0 2. Repeat for k =, 2,... until v k (s) v k (s) ɛ ( γ) 2γ : for every s S, improve the value vector as: v k (s) = [Lv k ](s) := max a A R(s, a) + γ s P (s, a, s )v k (s ), () In the fitted value iteration, we replace this to a scalable update, by fitting a compact representation Ṽθ to v i in every iteration. Fitted value-iteration. Start with an arbitrary initialization θ 0. 2. Repeat for k =, 2, 3,...: Evaluate [LṼθ k ](s) on a subset s S 0. Use some regression technique to find a θ to fit data (Ṽθ(s), LṼθ k (s)) s S 0. Set this θ as θ k. Example : Sampling + Least squares fitting [Munos and Szepesvári, 2008]. (This assumes number of actions is small, number of states is large. And, the MDP model is large but can be simulated for any arbitrary state s and action a). In state k:. Sample states X, X 2,..., X N from the state space S, using some distribution µ. 2. For each action a, and state X i, take multiple samples of next state and rewards {Y i,a,j, R i,a,j } j=,...,m. 3. Approximate [LṼθ k ](s) for s {X, X 2,..., X N } as 4. Use least squares new θ as: target i = max a A M M j= θ k := arg min θ N ( ) R i,a,j + γṽθ k (Y i,a,j), i =,..., N N (Ṽθ(i) target i ) 2 Interestingly, it matters what regression technique is used. The example below shows that least square minimization doesn t always work! i= 2

2. Counter-example for least-square regression [Tsitsiklis and van Roy, 996] An MDP with two states x, x 2, -d features for the two states: f x =, f x2 = 2. Linear Function approximation with Ṽθ(x) = θf x. θ k := arg min θ 2 (θ target ) 2 + (2θ target 2 ) 2 = arg min θ 2 (θ γθk f x2 ) 2 + (2θ γθ k f x2 ) 2 = arg min θ 2 (θ γ2θk ) 2 + (2θ γ2θ k ) 2 (θ γ2θ k ) + 2(2θ γ2θ k ) = 0 5θ = 6γθ k This diverges if γ 5/6. θ k = 6 5 γθ k 2.2 Convergence of non-expansive approximations Operator view of Fitted value-iteration. A more general way to interpret fitted value iteration is that you have an operator M A that takes a value vector v i and projects it into the function space formed by functions of form Ṽθ.. Start with an arbitrary initialization V 0, Ṽθ 0 := M A(V 0 ). 2. Repeat for k =, 2, 3,...: Ṽθ i = M A LṼθ i. Now, to match the description earlier, consider operator M A defined as follows: Fit a Ṽθ to LṼθ i by comparing its values on a subset S 0 of states, using a regression technique. And, then return this Ṽθ as new function Ṽθ i in the output space of M A. Thus, M A is effectively an approximation operator. Equivalently,. Start with an arbitrary initialization v 0. 2. Repeat for k =, 2, 3,...: v i = (L M A )v i. (In an efficient implementation, u i = M A v i probably has a more compact representation, so the first view may be better for implementation) The above view allows us to view fitted value iteration as just value iteration with a different operator: v i = Lv i is replaced by Ṽ i = (M A L)Ṽ i. Therefore, as long as the new operator (M A L) is also γ-contraction, 3

the results proven earlier for convergence of value iteration will hold. A sufficient condition is that the operator M A is non-expansive: M A v M A v v v Then, (M A L)v (M A L)v Lv Lv γ v v Similarly, in the other view, v i = Lv i is replaced by v i = (T M A )v i which is also a γ contraction. so, v i converges. Sufficient condition for convergence: Averager [Gordon, 995]. Lemma. The opertor M A is non-expansive if it is an averager, that is, [M A v k ](s) = i S w i,s v k (i) + w 0,s where Proof. This is non-expansive because M A v M A v = max s w i,s + w 0,s =, w i,s 0 i M A v = W v + w 0 i S w i,s (v(i) v (i)) max s max v(i) v (i) = v v i 2.3 Converges to what The fitted value iteration converges with exponential rate under above condition, but to what? Let v i converges to fixed point U. That is, U = L(M A (U )). We show that under the contraction properties, U is as close as it can be to V (the optimal value function), in the sense that: The proof is as follows. 2.4 What about least squares? U V γ γ M AV V U V = L(M A (U ) LV γ M A U V M A U V M A U M A V + M A V V U V + M A V V U V γ U V + γ M A V V U V γ ( γ) M AV V TODO: PAC bounds in [Munos and Szepesvári, 2008]. (Book Chapter reference: Chapter 3 of Szepes vari [999]) 4

3 Discussion What next? So far the methods we have seen are based on the value function approach, where all the function approximation effort went into scalable estimation of value function. The action selection policy was represented implicitly as greedy policy with respect to the estimated value functions. This approach has several limitations. First, it is oriented toward finding deterministic policies, whereas the optimal policy is often stochastic, selecting different actions with specific probabilities. Second, an arbitrarily small change in the estimated value of an action can cause it to be, or not be, selected. Such discontinuous changes have been identified as a key obstacle to establishing convergence assurances for algorithms following the value-function approach (Bertsekas and Tsitsiklis, 996). For example, Q-learning, Sarsa, and dynamic programming methods have all been shown unable to converge to any policy for simple MDPs and simple function approximators (Gordon, 995, 996; Baird, 995; Tsitsiklis and van Roy, 996; Bertsekas and Tsitsiklis, 996). This can occur even if the best approximation is found at each step before changing the policy, and whether the notion of best is in the mean-squared-error sense or the slightly different senses of residual-gradient, temporal-difference, and dynamic-programming methods. Next, we will look at policy gradient methods that use function approximation for directly approximating the optimal policy and eliminate some of the above limitations. References Geoffrey J. Gordon. Stable function approximation in dynamic programming. In Machine Learning Proceedings 995, pages 26 268. Morgan Kaufmann, San Francisco (CA), 995. Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration. J. Mach. Learn. Res., 9:85 857, June 2008. ISSN 532-4435. URL http://dl.acm.org/citation.cfm?id=39068.390708. Csaba Szepes vari. Algorithms for Reinforcement Learning. Morgan & Claypool Publishers, 999. URL http://old.sztaki.hu/ szcsaba/papers/rlalgsinmdps-lecture.pdf. John N. Tsitsiklis and Benjamin van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22():59 94, Mar 996. 5