Machine Learning I Reinforcement Learning

Similar documents
Reinforcement Learning

Temporal difference learning

Machine Learning I Continuous Reinforcement Learning

Reinforcement Learning. George Konidaris

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Reinforcement Learning II

arxiv: v1 [cs.ai] 5 Nov 2017

CS599 Lecture 1 Introduction To RL

Reinforcement Learning: An Introduction

Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

6 Reinforcement Learning

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Notes on Reinforcement Learning

Lecture 23: Reinforcement Learning

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Temporal Difference Learning & Policy Iteration

Basics of reinforcement learning

Reinforcement Learning: the basics

Reinforcement Learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Introduction to Reinforcement Learning

Reinforcement learning an introduction

Reinforcement Learning

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Open Theoretical Questions in Reinforcement Learning

Lecture 1: March 7, 2018

Reinforcement learning

Reinforcement Learning

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

CS 570: Machine Learning Seminar. Fall 2016

Chapter 3: The Reinforcement Learning Problem

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Reinforcement Learning

Lecture 7: Value Function Approximation


CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Reinforcement Learning and Control

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Lecture 3: The Reinforcement Learning Problem

Reinforcement Learning Part 2

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

An Adaptive Clustering Method for Model-free Reinforcement Learning

Reinforcement Learning

Decision Theory: Q-Learning

REINFORCEMENT LEARNING

Reinforcement Learning

The Reinforcement Learning Problem

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Reinforcement Learning (1)

Lecture 3: Markov Decision Processes

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Reinforcement Learning and Deep Reinforcement Learning

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

The Nature of Learning - A study of Reinforcement Learning Methodology

Reinforcement Learning. Introduction

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Reinforcement Learning

Reinforcement Learning

An Introduction to Reinforcement Learning

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Chapter 3: The Reinforcement Learning Problem

Q-learning. Tambet Matiisen

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Reinforcement Learning. Yishay Mansour Tel-Aviv University

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

MDP Preliminaries. Nan Jiang. February 10, 2019

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

An online kernel-based clustering approach for value function approximation

Q-Learning in Continuous State Action Spaces

Decision Theory: Markov Decision Processes

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

ilstd: Eligibility Traces and Convergence Analysis

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Reinforcement learning

Generalization and Function Approximation

, and rewards and transition matrices as shown below:

CS 234 Midterm - Winter

Sequential Decision Problems

Markov Decision Processes and Solving Finite Problems. February 8, 2017

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

Internet Monetization

Transcription:

Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009

Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version: google sutton barto book ) Survey Paper Reinforcement learning: a survey Kaelbling, Littman & Moore

What is Reinforcement Learning? What is it not? Supervised Learning For a given input, the learner is also provided with the desired output (target). Learn mapping from input to output. Unsupervised Learning The goal is to build a model of the inputs (e.g. for clustering, outlier detection, compression,... ) without any feedback. Reinforcement Learning is more interactive does not receive the target does receive evaluative feedback (rewards) is more a problem description than a single method

Problem Statement state s t+1 ENVIRONMENT reward r t+1 new step r t s t AGENT action a t Definition (agent, environment, state, action, reward) An agent interacts with an environment at discrete time steps t = 0, 1, 2,... At each time step t, the agent receives state s t S from the environment. It then chooses to execute action a t A(s t ) where A(s t ) is the set of available actions in s t. At the next time step, it receives the immediate reward r t+1 R and finds itself in state s t+1.

Problem Statement (comments) This is a very general definition: environment does not have to be deterministic: transitions from s to s with action a can be probabilistic, described by P a ss. rewards only have the condition of being scalar values. They can be given sparsely (reward only at goal) or after each timestep. They can be probabilistic and the expected reward when transitioning from s to s with action a is described by R a ss. time steps don t have to refer to fixed intervals of real time, they can just be successive stages of decision-making. states can be actual states of a machine (e.g. on/off) or abstract states of some defined problem. An agent could be in a state of I don t know what this object is. actions can be low-level motor controls of a robot or high-level decisions of a planner, such as going to lunch.

General Assumptions For now, we assume the following: Both states and actions are discrete and finite. Our problem fulfills the Markov property (MDP) the current state information summarizes all relevant information from the past (e.g. chess, cannonball) the next state is only determined by the last state and the last action, not the entire history the environment has a stationary P a ss and Ra ss.

Reinforcement Learning Introduction Value-Based RL Model-Based RL Example: Chess Imagine an agent playing one of the players in a chess game. What are the states? actions? rewards? Is the environment probabilistic or deterministic? What is a good move for black? Pawn 1 Bishop 3 Knight 3 Rook 5 Queen 9 Thomas Ru ckstieß ML I 11.12.2009

Example: Chess state one legal combination of pieces on the board ( 10 50 possible states) actions all possible legal moves in a certain state rewards end of game (1 = win, 1 = loss) but: heuristics for taking pieces and certain positions environment chess itself is deterministic, but since one agent plays against an opponent, it depends on the opponent. other player can be considered probabilistic.

Different Types of RL state state action state action π Q P a ss R a ss action value next state reward Direct RL Value-based RL Model-based RL data is cheap computation is limited e.g. embedded systems data is expensive computation doesn't matter e.g. medical trials

Value-Based RL Overview Returns R t Policy π Value Functions V π (s) and Q π (s, a) Optimal Policy π and Optimal Value Functions V and Q Policy Iteration Policy Evaluation π Q π Policy Improvement Q π π, π π Monte Carlo Algorithm Temporal Difference RL SARSA Q-Learning Eligibility Traces

Returns What is the goal of the agent? What should we optimize? Change agent s behavior in a way that it selects the actions that maximize sum of future rewards: R = r t+1 + r t+2 + r t+3 + Problem of infinite sums in continuous settings (blue path) use discounting Definition (return, discounting factor) The return R t is defined as the (discounted) sum of rewards starting at time t+1. T R t = γ k r t+k+1 k=0 where 0 < γ 1 is the discounting factor which ensures that recent rewards are more important than older ones. For continuous tasks, where T =, the discounting factor must be strictly less than 1.

Policy The agent s behavior is formally described by its policy. It returns an action for each state. ENVIRONMENT reward r t update AGENT s t a 1 a 2 Policy π action a t a 3 state s t Definition (policy) An agent s policy is its mapping from observed states to probabilities of choosing an action. π(s t, a t ) : S A(s t ) [0, 1], π(s t, a t ) = p(a t s t ). If the policy is deterministic: π(s, a) = 0 π(s, a ) = 1 then we write π(s) = a. a A(s) \ {a }, and

Value Functions We need to make use of the returns R t to change the agent s behavior. One way to do this: value functions. A value function describes, how good it is for the agent, to be in a certain state, or how good it is to perform a certain action in a state. goodness is defined as the expected future return Value functions depend on the policy π of the agent 2 variants: we can start in a state (state-value function V π ) or during an action (action-value function Q π ). a 1 a 2 a 3 a 4 s 1 s 2 s 3 s 4 V π (s 1 ) Q π (s 1,a 1 )

Value Functions Definition (state-value function) A state-value function V π (s) with respect to policy π is a function of a state s, which describes the expected future return R t if the agent starts in state s and follows policy π thereafter: { T } V π (s) = E π {R t s t = s} = E π γ k r t+k+1 s t = s Definition (action-value function) An action-value function Q π (s, a) with respect to policy π is a function of a state s and action a, which describes the expected future return R t if the agent starts in state s, takes action a and follows policy π thereafter: { T } Q π (s, a) = E π {R t s t = s, a t = a} = E π γ k r t+k+1 s t = s, a t = a k=0 k=0

Value Functions We will concentrate on action-value functions. From an action-value function, we can always get the state-values V π (s) = Q π (s, π(s)). Action values can be represented in a Q-Table. Q π (s, a) a 1 a 2 a 3 s 1 2.0-0.5 0.1 s 2-1.0 3.35 0.0 s 3 0.3 0.3-0.3 s 4 10.4 0.5 0.5 The values can be estimated from experience: Follow π and keep seperate return estimates for each state-action pair Policy Evaluation

Optimal Policies & Optimal Value Functions Definition (Ranking of Policies) A policy π is considered better than π if the expected returns of π are higher than or equal to policy π s for all states and actions: Q π (s, a) Q π (s, a) s S, a A(s) The best policy (or policies) is denoted π (optimal policy). The matching value function Q π (s, a) = Q (s, a) is known as the optimal value function and it is a unique solution to a reinforcement learning problem. Our goal: finding Q

Reinforcement Learning Introduction Value-Based RL Model-Based RL Policy Evaluation If we have a policy π, how do we get the corresponding action value function Q π? In short: π Q π 1 Start in random state with random action 2 Follow π 3 Keep history of all visited (s, a) pairs and received returns R Average over received returns following a (s, a) pair converges to Q(s, a) value Learning from experience (Monte Carlo) Example Assume we play many chess games and encounter (s, a) a total of 5 times. After finishing these games, we P received the rewards 1, 1, 1, 1, 1. Rt = Tk=0 γ k rt+k+1 but because all except the last reward of this game are = 0.6. 0, Rt = rt and we get Q π (s, a) = 1+1+1 1+1 5 With more games containing (s, a) the approximation improves. Thomas Ru ckstieß ML I 11.12.2009

Policy Improvement Now, we want to improve a (deterministic) policy π. Assume π identical to π except for state s : π selects a and π selects a (deterministic policies): π(s ) = a, π (s ) = a Q π (s, π (s)) = Q π (s, π(s)) s S \ {s }, a A(s) if Q π (s, π (s)) Q π (s, π(s)) then Q π (s, a) Q π (s, a) s, a and therefore π is better than π (or at least equal). s' a' a π π This is called the Policy Improvement Theorem.

Policy Improvement All we need to do is find a π that improves Q in one or more states. We can construct a policy that does exactly that: π (s) = arg max Q π (s, a) a This greedy policy uses the current Q-table to execute the currently most promising action in each state. By above definition, Q π (s, a) Q π (s, a) s, a and therefore π π

Policy Iteration We can evaluate a policy (π Q π ) and we can find a better policy by taking the greedy policy of the last value function (Q π π with π π). This leads to the policy iteration cycle: π 0 E Q π 0 I π 1 E Q π 1 I π 2 E I π E Q Improvement π 0 Q π Policy Iteration π Evaluation

Monte Carlo Reinforcement Learning Monte-Carlo algorithm 1 initialize π, Q arbitrarily, returns(s, a) empty list 2 generate an episode starting at random (s, a) and follow π 3 for each (s, a) pair in episode add R to returns(s, a) Q(s, a) returns(s, a), 4 for each s in episode π(s) arg max a Q(s, a) 5 goto 2 sample average

Example: Black Jack state sum of own cards c [4, 21] useable ace u {yes, no} dealer s card d {A, 2, 3, 4, 5, 6, 7, 8, 9, 10} s = (c, u, d) action hit or stick reward +1 or 1 on win or loss, 0 otherwise. episode one game return same as reward (if γ = 1) policy strategy of playing: π(s) {hit, stick} with s = (c, u, d) Q table table of states actions containing the Q values

Some Remarks If the transition probabilities P a ss and expected rewards Ra ss are known, Dynamic Programming can be used. We skipped this part, but you can read it in Sutton & Barto s Book, Chapter 4. Often, exploring starts are not possible (real life experiences). Instead, we can use an exploring policy (called an ɛ-greedy policy) { arg max π(s) = a Q(s, a) if r ɛ random action from A(s) if r < ɛ with small ɛ and a random number r [0, 1], drawn at each time step.

Full Backups vs. Bootstraping Dynamic Programming full model dynamics known, use bootstrapping Monte Carlo RL collect samples from real experience, full backups Temporal Differences collect samples and bootstrap What is bootstrapping? Figure: Muenchhausen pulled himself out of the swamp on his own hair. In our case: We use an approximation of Q to improve Q.

Temporal Difference Learning In Monte-Carlo, the targets for the Q updates are the returns R t We have to wait until the end of the episode to make an update Use approximation to R t by bootstrapping Q π (s, a) = E π {R t s t = s, a t = a} (1) { T } = E π γ k r t+k+1 s t = s, a t = a (2) k=0 = E π {r t+1 + γ } T γ k r t+k+2 s t = s, a t = a (3) k=0 = E π {r t+1 + γq π (s t+1, a t+1 ) s t = s, a t = a} (4) This is called the Bellman Equation for Q π.

SARSA Let s bootstrap and use the approximation for R t Q π (s t, a t ) Q π (s t, a t ) + α (r t+1 + γq π (s t+1, a t+1 ) Q π (s t, a t )) This is our new update rule, and the algorithm is called SARSA. For an update, we need current state s t and action a t, the resulting reward r t+1, next state s t+1 and action a t+1 a r a' s s'

Q-Learning (Watkins, 1989) SARSA was on-policy (we used the real trajectory) Q-Learning is an off-policy method We can pick arbitrary (state, action, reward)-triples to approximate the Q-function Q-Learning Update Rule ( Q π (s t, a t ) Q π (s t, a t ) + α r t+1 + γ max a ) Q π (s t+1, a) Q π (s t, a t )

Q-Learning vs. SARSA Q-Learning Update Rule (top and left) SARSA Update Rule (bottom and right) Q π (s t, a t ) Q π (s t, a t ) + α(r t+1 + γ max a Q π (s t+1, a) Q π (s t, a t )) Q π (s t, a t ) Q π (s t, a t ) + α(r t+1 + γ Q π (s t+1, a t+1 ) Q π (s t, a t )) Q(s,a 1 ) s a s' Q(s,a 2 ) s a s' a' s'' Q(s,a ) off-policy Q(s,a 3 ) on-policy

Q-Learning Algorithm 1 initialize π, Q arbitrarily 2 choose a using { ɛ-greedy policy arg max a = π(s) = a Q(s, a) random action from A(s) 3 execute action a, observe r, s if random(0, 1) ɛ else 4 Q(s, a) Q(s, a) + α (r + γmax a Q(s, a) Q(s, a)) 5 s s 6 goto 2

Eligibility Traces 1 0 How do the Q-values change when the robot walks from field to field? (γ = 1, α = 0.1) Q π (s t, a t ) Q π (s t, a t ) + 0.1 (0 + 1 Q π (s t+1, a t+1 ) Q π (s t, a t )) Q π (s t, a t ) Q π (s t, a t ) + 0.1 Q π (s t+1, a t+1 ) 0.1 Q π (s t, a t ) Q π (s t, a t ) 0.9 Q π (s t, a t ) + 0.1 Q π (s t+1, a t+1 ) Answer: not much, we just stir around a bit in the Q-table. If we initialize the table with 0, then absolutely nothing happens.

Eligibility Traces How do the Q-values change when the robot finds the power outlet (γ = 1, α = 0.1) Q π (s t, a t ) Q π (s t, a t ) + 0.1 (1 + 1 Q π (s t+1, a t+1 ) Q π (s t, a t )) Q π (s t, a t ) Q π (s t, a t ) + 0.1 + 0.1 Q π (s t+1, a t+1 ) 0.1 Q π (s t, a t ) Q π (s t, a t ) 0.9 Q π (s t, a t ) + 0.1 Q π (s t+1, a t+1 ) + 0.1. Answer: if Q-table was initialized with 0, then the state-action pair that lead to the goal now has a value of 0.1.

Eligibility Traces Q-values (here actually summarized as V-Values) after 1, 2, 3, many runs for each run (episode) we move backwards from the goal to the start by one state-action pair. this is slow! Eligibility Traces: remember where you came from. Use the trace (track, route, path,...) to speed up convergence.

Eligibility Traces How does it work? Each state-action pair (s, a) receives a counter e t (s, a), that keeps track of its eligibility at time t (i.e. how recently it was visited). Initially, all the counters are set to 0 (no state-action pair has been visited). Whenever a state-action pair (s, a) is visited, we add +1 to its eligibility. All eligibilities decay over time, with each timestep we multiply them with γλ, where 0 λ 1 as trace decay parameter. e t (s, a) = { γλe t 1 (s, a) if (s, a) (s t, a t ) γλe t 1 (s, a) + 1 if (s, a) = (s t, a t ) Main Difference: After one step, we update all Q-values for all possible state-action transitions but multiply them with their eligibility e t (s, a)

SARSA(λ) Remember SARSA(0) update rule? Q π (s t, a t ) Q π (s t, a t ) + α(r t+1 + γ Q π (s t+1, a t+1 ) Q π (s t, a t )) To save space, we subsitute δ t := r t+1 + γqt π (s t+1, a t+1 ) Qt π (s t, a t ) The new SARSA(λ) update is then Q π (s, a) Q π (s, a) + α δ t e t (s, a) for all s, a. There is a similar extension for Q(λ).

Planning / Model-Based RL / Dyna-Q WORLD state t+1 reward new step AGENT action input / output teaching signal

Planning / Model-Based RL / Dyna-Q WORLD state t+1 reward MODEL new step state t AGENT action input / output teaching signal