Q-Learning and Stochastic Approximation

Similar documents
Algorithms for MDPs and Their Convergence

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

Real Time Value Iteration and the State-Action Value Function

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Q-Learning for Markov Decision Processes*

Gradient Estimation for Attractor Networks

11. Further Issues in Using OLS with TS Data

Dependence and independence

Technical Details about the Expectation Maximization (EM) Algorithm

A Gentle Introduction to Reinforcement Learning

Markov Decision Processes and Dynamic Programming

Linear stochastic approximation driven by slowly varying Markov chains

CSE250A Fall 12: Discussion Week 9

On the Convergence of Optimistic Policy Iteration

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Notes on Discrete Probability

Markov Decision Processes and Dynamic Programming

ZEROES OF INTEGER LINEAR RECURRENCES. 1. Introduction. 4 ( )( 2 1) n

ECE521 week 3: 23/26 January 2017

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Bindel, Fall 2013 Matrix Computations (CS 6210) Week 2: Friday, Sep 6

Internet Monetization

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

CSC 2541: Bayesian Methods for Machine Learning

Lecture 3 Stationary Processes and the Ergodic LLN (Reference Section 2.2, Hayashi)

Recent Advances in SPSA at the Extremes: Adaptive Methods for Smooth Problems and Discrete Methods for Non-Smooth Problems

MDP Preliminaries. Nan Jiang. February 10, 2019

Introduction: The Perceptron

Weighted Majority and the Online Learning Approach

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Lecture 4: Introduction to stochastic processes and stochastic calculus

Economics 2010c: Lectures 9-10 Bellman Equation in Continuous Time

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Open Economy Macroeconomics: Theory, methods and applications

Approximate Dynamic Programming

Lecture 5 Linear Quadratic Stochastic Control

Lecture 16: Introduction to Neural Networks

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

6 Reinforcement Learning

Optimal Stopping Problems

Module 9: Stationary Processes

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Optimization and Gradient Descent

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

The Art of Sequential Optimization via Simulations

Some Review Problems for Exam 3: Solutions

Solutions to Homework 2

1 Markov decision processes

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

Players as Serial or Parallel Random Access Machines. Timothy Van Zandt. INSEAD (France)

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Notice that lemma 4 has nothing to do with 3-colorability. To obtain a better result for 3-colorable graphs, we need the following observation.

Abel Summation MOP 2007, Black Group

For those who want to skip this chapter and carry on, that s fine, all you really need to know is that for the scalar expression: 2 H

CONTROL SYSTEMS, ROBOTICS AND AUTOMATION Vol. XI Stochastic Stability - H.J. Kushner

Bayesian Machine Learning

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Ordinary Least Squares Linear Regression

, and rewards and transition matrices as shown below:

arxiv: v3 [math.oc] 8 Jan 2019

CS 7180: Behavioral Modeling and Decisionmaking

Reinforcement Learning. Introduction

17 Solution of Nonlinear Systems

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Prioritized Sweeping Converges to the Optimal Value Function

Theorem (Special Case of Ramsey s Theorem) R(k, l) is finite. Furthermore, it satisfies,

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Lecture 1: Brief Review on Stochastic Processes

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

1 The independent set problem

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation

0.1 Motivating example: weighted majority algorithm

Decision Theory: Markov Decision Processes

Machine Learning and Data Mining. Linear regression. Kalev Kask

Stochastic bandits: Explore-First and UCB

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Lecture 2. We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales.

Introduction to Reinforcement Learning

Reinforcement Learning

Online Learning, Mistake Bounds, Perceptron Algorithm

Adaptive State Feedback Nash Strategies for Linear Quadratic Discrete-Time Games

Machine Learning Lecture Notes

Sequences and infinite series

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 6 Random walks - advanced methods

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Policy Gradient Reinforcement Learning for Robotics

Lecture th January 2009 Fall 2008 Scribes: D. Widder, E. Widder Today s lecture topics

Procedia Computer Science 00 (2011) 000 6

Reinforcement Learning. Machine Learning, Fall 2010

NOTES ON EXISTENCE AND UNIQUENESS THEOREMS FOR ODES

Department of Mathematics

6 Basic Convergence Results for RL Algorithms

Lecture 16: Perceptron and Exponential Weights Algorithm

Value Function Methods. CS : Deep Reinforcement Learning Sergey Levine

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

STA414/2104 Statistical Methods for Machine Learning II

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Transcription:

MS&E338 Reinforcement Learning Lecture 4-04.11.018 Q-Learning and Stochastic Approximation Lecturer: Ben Van Roy Scribe: Christopher Lazarus Javier Sagastuy In this lecture we study the convergence of Q-Learning updates via stochastic approximation results. 1 Q-learning update Assume we are given data samples of the following form: s, a, r +1, s +1 = 0, 1,,... and we iteratively apply the following update to the state-action value function: { 1 γ Q s, a + γ r +1 + max a Q +1 s, a = A Q s +1, a if s = s, a = a 1 Q s, a otherwise Proposition 1. If each s, a S A is sampled infinitely often and we have {γ } deterministic satisfying s, a γ =, γ <, then Q Q :s,a =s,a :s,a =s,a Originally, in class, there were a couple of issues with the formulation of the above proposition: we want each state-action pair to be sampled infinitely often. If an adversary is deciding that, it can try to game the system to just choose an update at particular times, for example: Adversary pics particular state action pairs when γ is zero. Adversary spaces updates, so that spaces get bigger and bigger and it can loo as though the values for gamma are plummeting quicly. We can rewrite the updating equation as temporal difference {}} { Q +1 s, a = Q s, a + γ r +1 + max Q s +1, a Q s, a a A A temporal difference is a difference between predictions that you can mae in consecutive time periods. It is composed of two parts: the current value of the state-action function and the proposed value by the greedy update to the Q function. The difference helps us realize if we over or underestimated the previous value of the Q function. ote that the Q-learning update can be understood as an asynchronous version of stochastic approximation to value iteration. Thus, in the following section we study Stochastic Approximation as a way to understand the convergence properties of Q-Learning. Stochastic Approximation We are not going to learn a comprehensive foundation to prove stochastic approximation results. We will go over intuition for how formal analysis of these things goes. The idea to stochastic approximation is: 1

x +1 = x + γ sx, w where w is a random disturbance. Let s assume that w is ergodic and in particular there is some stationary distribution for w so we can tal about expectation and distribution. In fact, if w is drawn form the steady state distribution, let sx = E [sx, w ] Under various technical conditions when γ is small the sequence approximates an ODE of the form x t = sx t The idea is that we want a relationship from a continuous to a discrete sequence. continuous {}}{ x t t discrete {}}{ x 1 γ i The γ is can be thought of lie the dt in an ODE. x t+dt x t = sx t dt x t+dt = x t + dt sx t If γ =, γ < then we would converge with this update rule. As increases, this follows more and more closely the tracs of the ODE. Let s loo at a simple case: assume the sample mean of w i.i.d. with E [ w] < x = 1 1 w i x +1 = 1 x +1 = w i = 1 w + 1 1 x approximately, using the LL x w = 1 γ x + γ w if we let γ = 1 By the law of large numbers, x E[w 0 ]. In fact, as long as γ = and γ <, we get x E[w 0 ]..1 Understanding the intuition behind the ODE approximation We now reason about the same example in a continuous context. Let h be a small increment in the continuous time domain. x t+h = x t + hsx t, w t

Suppose large but << 1 so that h << 1. Then, h 1 x t+h = x t + h sx t+nh, w t+nh 1 x t + h sx t, w t+nh = x t + h = x t + dt dt 1 1 sx t + O sx t, w t+nh 1 from equation by induction we assume that x t doesn t change much since nh is small multiplying and dividing by 1 where in the last step, O comes from the fact that the variance of the mean of iid samples is on the order of 1. We will leverage basic martingale convergence theorems and use that to prove how basic stochastic approximation theorems wor.. A prototypical example of a stochastic approximation result Proposition. If w i.i.d, and there exist x, c 1, c such that for all x 1. x x sx c 1 x x. E [ sx, w ] c 1 + x x and if γ is deterministic with γ =, γ <, then we have x x with probability 1. Intuitively, the two necessary conditions in proposition can be thought as follows: 1. States that you are going in the right direction fast enough.. Gives a bound on the noise provided by w. It may be very noisy, but not too much. If you are far away you are allowed large noise cause that is compensated by the fact that you can tae big steps. Theorem 1. Supermartingale Convergence Theorem Let X, Y, Z = 0, 1,,... be nonnegative scalar random variables, with Y <. If E [X +1 ] X + Y Z then with probability 1, lim X exists and is finite, and Z <. In the above statement, the sub on the expectation means conditioning on X 0,..., X. ote that the result presented in Theorem 1 is a stochastic generalization of convergence in real analysis. We are not going to prove Theorem 1, it is pretty hard, but we will use it to prove proposition : 3

Proof. Let = x x then +1 = x + γ sx, w x = + γ sx, w γ x x sx, w E [ +1 ] + γc 1 + x x γ c 1 x x from the necessary conditions on prop. = + γ c 1 + γ c 1 = + c γ c 1 γ X E [X +1 ] X + Y Z Z + c γ Y We now want to see if the variables we just defined satisfy the conditions on Theorem 1. First, note that Y = c γ < since γ <. Also, X and Y as defined are non-negative. However, note that Z = C 1 γ C γ could tae on negative values. But since γ converges to 0, there will be some K such that K, Z 0. Thus, we can loo at Z starting at index = K. ow, from Theorem 1 we now that X = = x x converges to a finite random variable with probability 1. We also now that Z <. Since Z = c γ c 1 γ < and we now that c γ <, then c 1 γ <. However, since c 1 is a constant and γ =, the only way the previous sum could tae on a finite value is if is converging to zero. If converged to any value other than zero, the sum would not converge. ow we finally now that = x x 0 with probability 1, which implies that x x with probability 1, as desired. Example 1. Consider F : R R s.t F x F y α x y, α 0, 1 with w i.i.d. E [w ] = 0, E [ w ] <. Then: 1. x +1 = x + γ F x x + w sx, w = F x x + w sx = F x x x = F x x x sx = x x F x x + x x x x x x x x F x x + x x α x x + x x = 1 α x x 4

. [ ] E sx, w [ = E F x x + w E [ F x x + w ] E [1 + α x x + w ] [ ] E 1 + α x x + w [ ] 1 + α 1 + E w C ] 1 + x x Thus, by Proposition, x x. One last problem to thin about: in the last homewor assignment we showed that asynchronous Value Iteration converges to the optimal value function. ow, what if instead of having a maximum norm contraction mapping, we had a contraction with respect to the euclidean norm? We can show that this does not guarantee convergence with an asynchronous update although you do get it in the synchronous case. Come up with an example that there is an asynchronous process that does not get you to the fixed point with the euclidean contraction. Homewor # Let T : R R be such that α 0, 1 s.t. V, V, update V +1 = { T V s if s = s V s otherwise T V T V α V V. Consider the Show that V may not converge to V even if each s S is selected infinitely often. References [1] John Tsitsilis. Asynchronous stochastic approximation and q-learning. Machine learning, 163:185 0, 1994. 5