Elements of Reinforcement Learning

Similar documents
CS599 Lecture 1 Introduction To RL

Introduction to Approximate Dynamic Programming

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Grundlagen der Künstlichen Intelligenz

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning. Introduction

Basics of reinforcement learning

Procedia Computer Science 00 (2011) 000 6

Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning

Temporal difference learning

An Adaptive Clustering Method for Model-free Reinforcement Learning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Reinforcement learning

Reinforcement Learning

6 Reinforcement Learning

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Lecture 23: Reinforcement Learning

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Reinforcement Learning and Control

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Reinforcement Learning

On the Convergence of Optimistic Policy Iteration

Reinforcement learning an introduction

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Introduction to Reinforcement Learning

Reinforcement Learning (1)

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Distributed Optimization. Song Chong EE, KAIST

Reinforcement Learning. George Konidaris

, and rewards and transition matrices as shown below:

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Q-Learning for Markov Decision Processes*

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Evaluation of multi armed bandit algorithms and empirical algorithm

Introduction to Reinforcement Learning

MDP Preliminaries. Nan Jiang. February 10, 2019

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Computation and Dynamic Programming

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Optimistic Policy Iteration and Q-learning in Dynamic Programming

Lecture 1: March 7, 2018

Edward L. Thorndike #1874$1949% Lecture 2: Learning from Evaluative Feedback. or!bandit Problems" Learning by Trial$and$Error.

Bayesian Active Learning With Basis Functions

Reinforcement Learning: the basics

Lecture 3: Markov Decision Processes

Reinforcement learning

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Bayesian Active Learning With Basis Functions

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Markov Decision Processes and Dynamic Programming

Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

Reinforcement Learning

A System Theoretic Perspective of Learning and Optimization

Markov Decision Processes Chapter 17. Mausam

arxiv: v1 [cs.lg] 23 Oct 2017

Reinforcement Learning II

DIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES

INTRODUCTION TO MARKOV DECISION PROCESSES

Decision Theory: Q-Learning

Key concepts: Reward: external «cons umable» s5mulus Value: internal representa5on of u5lity of a state, ac5on, s5mulus W the learned weight associate

Machine Learning I Reinforcement Learning

Reinforcement Learning Active Learning

Lecture 7: Value Function Approximation

CS 570: Machine Learning Seminar. Fall 2016

Decision Theory: Markov Decision Processes

Markov Decision Processes and Dynamic Programming

Animal learning theory

ON STEP SIZES, STOCHASTIC SHORTEST PATHS, AND SURVIVAL PROBABILITIES IN REINFORCEMENT LEARNING. Abhijit Gosavi

Q-Learning and SARSA: Machine learningbased stochastic control approaches for financial trading

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

The Reinforcement Learning Problem

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

16.410/413 Principles of Autonomy and Decision Making

Markov Decision Processes (and a small amount of reinforcement learning)

The Art of Sequential Optimization via Simulations

Bias-Variance Error Bounds for Temporal Difference Updates

An Introduction to Reinforcement Learning

1 MDP Value Iteration Algorithm

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.

Reinforcement Learning

Planning in Markov Decision Processes

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Artificial Intelligence

Markov Decision Processes Infinite Horizon Problems

Abstract Dynamic Programming

Reinforcement Learning

The convergence limit of the temporal difference learning

Transcription:

Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward, total weighted or unweighted reward in present and future Model: mimic behavior of environment

Evaluative Feedback Example Consider n-armed bandit problem: at every instant must choose one of n actions with the goal of maximizing rewards. Expected reward for action a, Q*(a) and estimated value of tth play Q t (a). Set Q t (a) = r i /k a where k a is number of times action a taken Exploration versus exploitation How to choose action a

Policies for n-armed bandit problem Greedy policy: Q t (a*) = max a Q t (a). No exploration, initially does well, but poor long term performance. ε-greedy policy: same as greedy policy, but with prob. ε randomly choose policy. Some exploration, does better than greedy policy. Softmax Action selection: weight actions probabilistically with temperature parameter (Gibbs distribution). Reinforcement methods: keep track of payoffs (reinforcement as opposed to action-value method). Pursuit methods: use both action-value estimates and action preferences.

Summary of example Exploitation versus Exploration Algorithms presented What is the proper balance? Learning schemes Supervised learning: instructed what to do Evaluative learning: try different actions and observe rewards (allows more control of environment) Non associative learning: trial and error learning not associated with situation or state of the problem (only one state)

Reinforcement Learning Model Action Learning System Cost Environment State Exploration versus exploitation Learning can be slow

Finite Markov Decision Processes Parameters State: X(n) = x(n), N states Action: A(n) = a ik (action from state i performing action k) Transition probability: p ij (a) = P(X(n+1) = j X(n) = i, A(n) =a) Cost: r(i,a,j) and discount factor γ, with 0 γ< 1 Policy: π = {u(0), u(1), }, policy mapping states into actions (stationary and nonstationary)

Value Functions Cost or value function (infinite horizon, discounted) J π (i) = E (Σ γ n r(x(n),u(x(n)),x(n+1)) x(0)=i) averaged over Markov chain x(1),x(2), Action-value function Q π (i,a) = E (Σ γ n r(x(n),u(x(n)),x(n+1)) x(0)=i,a(0)=a) averaged over Markov chain x(1),x(2), Find policy π that minimizes J π (i) for all initial states i

Recursive expression for value function J π (i) = E (Σγ n r(x(n),u(x(n)),x(n+1)) x(0)=i) =E(r(i,u(i),j) + γσ γ n r(x(n+1),u(x(n+1)),x(n+2)) x(1)=j) = Σ a Σ j π(i,a) p ij (a) (r(i,a,j) + γ J π (j) ) Bellman equation for J π allows for calculation of value function for policy π. Equation can be solved iteratively or directly.

Optimal Value Function Want to find optimal policy to maximize value function J*(i) = max π J π (i) Can express optimal value function in terms of action-value function as J*(i) = max u(i) Q*(i,u(i)) where Q*(i,u(i)) = max π Q π (i,u(i)) Then can find a recursive expression for J*(i) by expanding RHS of equation similar to method found in previous slide for value function.

MDP Solution and Bellman Equation Use dynamic programming, can formulate cost function in terms of Bellman s Optimality equation J*(i) = max u E x(i) [r(i,u(i),x(i)) + γ J*(x(i))] Current cost: c(i,u(j)) = E x(i) [r(i,u(i),j)] = Σ j=1,n p ij Rewrite Bellman s equation r(i,u(i),j) J*(i) = max u [c(i,u(i)) + γ Σ j=1,n p ij J*(j)] System of N equations with (equation/ state) and minimization

Policy Evaluation and Improvement Policy Evalution:For a given policy we can iteratively compute value function J k+1π (i) = Σ a Σ j π(i,a) p ij (a) (r(i,a,j) + γ J kπ (j) ) Iterative algorithm converges. Policy Improvement: Q function can be expressed iteratively as Q π (i,a) = c(i,a) + γ Σ j=1,n p ij (a) J π (j) u is said to be greedy with respect to J u (i) if u(i) = max a Q π (i,a) for all i

Policy Iteration 1) Policy evaluation: J u (i) Cost to go function needs recomputation J u n(i) = c(i,u n (i)) + γ Σ j=1,n p ij (u n (i))j u n(j) Solve set of N linear equations directly or iteratively. 2) Policy improvement: u n+1 (i) = argmax a Q u n(i,a)

Value Iteration Initialization: start with initial value J 0 (i) Iterate: Q (i,a) = c(i,a) + γσ j=1,n p ij J n (j) J n+1 (i) = max a Q (i,a) Continue until J n+1 (i) J n (i) < ε Compute policy: u* = argmax a Q(i,a)

Dynamic Programming Comments Number of states often grows exponentially as number of state variables. (Bellman s curse of dimensionality) For large state spaces it is infeasible to search entire state space to perform DP steps. Asynchronous DP used where partial searches and updates are made of state space. DP programs run polynomially in number of states and actions. GPI (Generalized Policy Iteration) often used instead of PI where Policy Evaluation and Policy Improvement done together. DP assumes complete knowledge of environment.

Approximate Dynamic Programming Incomplete information (do not know Markov transition probabilities) Curse of dimensionality Opt for suboptimal policy where J*(i) replaced by approximations of J*(i) that can consist of table lookup or parameterized by set of weights Use Monte Carlo simulations to learn policy Q learning

Q Learning Algorithm Define Q function Q*(i,a) = Σ j=1,n p ij (a) (r(i,a,j) + γ max b Q*(j,b)) J*(i) = max a Q*(i,a) Use iterative learning to learn Q function Q n+1 (i,a) = (1- μ(i,a)) Q n (i,a) + μ(i,a)(r(i,a,j) + γj n (j)) where j is random successor state with J n (j) = max b Q n (i,b) Monte Carlo Simulations: update only applies to current state-action pair all other pairs are not updated

Q Learning Comments Convergence Theorem: Q Learning algorithm converges almost surely to optimal Q function given certain conditions on step size (stochastic approximation conditions) and all state pairs are visited infinitely often Representations: Table lookup works well, but networks parameterized by weights often learn very slowly Exploration vs. exploitation: ensure all state-action pairs are explored while also minimizing cost to go function

Temporal Difference Learning Given a learning sequence where a termination occurs and a reward is given how do we learn? Credit assignment to each training input in the sequence can be performed using temporal difference learning Iterative learning algorithms can then be established with inputs and target outputs Class of TD(λ) algorithms where 0 λ 1 Learning much slower than supervised learning

Reinforcement Learning Applications Backgammon Navigation Elevator control Helicopter control Computer network routing Sequential detection Dynamic channel allocation (cellular system)

References R. Sutton, A. Barto, Reinforcement Learning An Introduction, MIT Press, Cambridge, MA, 1998. D. Bertsekas, J. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, Belmont, MA, 1996. Handbook of Learning and Approximate Dynamic Programming, Editors J. Si, A. Barto, W. Powell, D. Wunsch, Wiley-IEEE Press, 2004.