Reinforcement Learning. George Konidaris

Similar documents
Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal difference learning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Machine Learning I Reinforcement Learning

Reinforcement Learning

Reinforcement learning an introduction

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Reinforcement Learning II

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

6 Reinforcement Learning

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Reinforcement Learning Part 2

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Grundlagen der Künstlichen Intelligenz

Basics of reinforcement learning

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Markov Decision Processes (and a small amount of reinforcement learning)

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Notes on Reinforcement Learning

Reinforcement Learning: An Introduction

Lecture 7: Value Function Approximation

CS599 Lecture 1 Introduction To RL

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Probabilistic Planning. George Konidaris

Reinforcement Learning II. George Konidaris

Reinforcement Learning

Reinforcement Learning II. George Konidaris

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Lecture 23: Reinforcement Learning

Reinforcement Learning and Control

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 8: Policy Gradient

An Introduction to Reinforcement Learning

CS 570: Machine Learning Seminar. Fall 2016

Reinforcement Learning. Machine Learning, Fall 2010

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Introduction to Reinforcement Learning

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Chapter 6: Temporal Difference Learning

Open Theoretical Questions in Reinforcement Learning

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

ARTIFICIAL INTELLIGENCE. Reinforcement learning

arxiv: v1 [cs.ai] 5 Nov 2017

Reinforcement Learning and Deep Reinforcement Learning

REINFORCEMENT LEARNING

Reinforcement Learning


Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Decision Theory: Q-Learning

Reinforcement Learning

An online kernel-based clustering approach for value function approximation

Reinforcement Learning. Introduction

Reinforcement Learning: the basics

Lecture 1: March 7, 2018

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

Reinforcement Learning

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Temporal Difference Learning & Policy Iteration

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

15-780: ReinforcementLearning

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Q-learning. Tambet Matiisen

Real Time Value Iteration and the State-Action Value Function

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Reinforcement Learning

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Lecture 3: Markov Decision Processes

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

Reinforcement learning

Internet Monetization

Reinforcement Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Autonomous Helicopter Flight via Reinforcement Learning

Replacing eligibility trace for action-value learning with function approximation

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

Trust Region Policy Optimization

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Machine Learning I Continuous Reinforcement Learning

Introduction. Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

Decision Theory: Markov Decision Processes

Approximation Methods in Reinforcement Learning

The Nature of Learning - A study of Reinforcement Learning Methodology

Reinforcement Learning

Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Reinforcement Learning. Value Function Updates

CSC321 Lecture 22: Q-Learning

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Markov Models and Reinforcement Learning. Stephen G. Ware CSCI 4525 / 5525

Reinforcement Learning

Transcription:

Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017

Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom Mitchell, 1997)

vs ML vs Statistics vs Data Mining

Why? Developing effective learning methods has proved difficult. Why bother? Autonomous discovery We don t know something, want to find out. Hard to program Easier to specify task, collect data. Adaptive behavior Our agents should adapt to new data, unforeseen circumstances.

Types of Machine Learning Depends on feedback available: Labeled data: Supervised learning No feedback, just data: Unsupervised learning. Sequential data, weak labels: Reinforcement learning

Supervised Learning Input: X = {x1,, xn} Y = {y1,, yn} inputs labels training data Learn to predict new labels. Given x: y?

Unsupervised Learning Input: X = {x1,, xn} inputs Try to understand the structure of the data. E.g., how many types of cars? How can they vary?

Reinforcement Learning Learning counterpart of planning. π : S A max R = t=0 t r t

MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes action Transition: a t new sensor signal reward r t s t s t+1 Goal: find policy of discounted future rewards): [ π that maximizes expected return (sum max π E R = ] γ t r t t=0

Markov Decision Processes S A γ : set of states : set of actions : discount factor <S,A,γ,R,T > R : reward function R(s, a, s ) is the reward received taking action s and transitioning to state. a from state s T : transition function T (s s, a) is the probability of transitioning to state s taking action a in state s. after

RL vs Planning In planning: Transition function (T) known. Reward function (R) known. Computation offline. In reinforcement learning: One or both of T, R unknown. Action in the world only source of data. Transitions are executed not simulated.

Reinforcement Learning

RL This formulation is general enough to encompass a wide variety of learned control problems.

MDPs As before, our target is a policy: π : S A A policy maps states to actions. The optimal policy maximizes: [ ] max π s, E R(s) = t=0 γ t r t s 0 = s This means that we wish to find a policy that maximizes the return from every state.

Planning via Policy Iteration In planning, we used policy iteration to find an optimal policy. 1. Start with a policy V π 2. Estimate 3. Improve π a. (s) = max E [r + a V (s 0 )], 8s Repeat More precisely, we use a value function: V (s) =E " 1 X i=0 i r i # can t do this anymore then we would update π X (s) = argmax a by computing: s 0 T (s, a, s 0 )[r(s, a, s 0 )+ V [s 0 ]]

Value Functions For learning, we use a state-action value function as follows: " # X 1 Q (s, a) =E i=0 i r i s 0 = s, a 0 = a This is the value of executing ain state s, then following π. Note that V (s) =Q (s, (s)). A x

Policy Iteration This leads to a general policy improvement framework: 1. Start with a policy 2. Learn 3. Improve a. Q π π(s) = max a π Q(s, a), s Repeat Steps 2 and 3 can be interleaved as rapidly as you like. Usually, perform 3a every time step.

Value Function Learning Learning proceeds by gathering samples of Q(s, a). Methods differ by: How you get the samples. How you use them to update Q.

Monte Carlo Simplest thing you can do: sample R(s). r r r r r r r r Do this repeatedly, average values: Q(s, a) = R 1(s)+R 2 (s)+... + R n (s) n

Temporal Difference Learning Where can we get more (immediate) samples? Idea: use the Bellman equation. V (s) =E s 0 [r(s, (s),s 0 )+ V (s 0 )] reward value of this state value of next state

TD Learning Ideally and in expectation: r i + Q(s i+1,a i+1 ) Q(s i,a i )=0 Q is correct if this holds in expectation for all states. When it does not: temporal difference error. s t s t+1 a t r t Q(s t,a t ) r t + γq(s t+1,a t+1 )

Sarsa Sarsa: very simple algorithm 1. Initialize Q[s][a] = 0 2. For n episodes observe state s select a = argmaxa Q[s][a] observe transition compute TD error (s, a, r, s,a ) δ = r + γq(s,a ) Q(s, a) update Q: Q(s, a) =Q(s, a)+αδ if not end of episode, repeat zero by def. if s is absorbing

Sarsa

Sarsa

Exploration vs. Exploitation Always maxa Q(s, a)? Exploit current knowledge. What if your current knowledge is wrong? How are you going to find out? Explore to gain new knowledge. Exploration vs. Exploitation - when to try new things? Consistent theme of RL.

Exploration vs. Exploitation How to balance? Simplest, most popular approach: Instead of always being greedy: maxa Q(s, a) Explore with probability : maxa Q(s, a) with probability (1 ). random action with probability. ( 0.1) - greedy exploration Very simple Ensures asymptotic coverage of state space

TD vs. MC TD and MC two extremes of obtaining samples of Q: r + γv r + γv r + γv... t=1 t=2 t=3 t=4 t=l γ i r i... t=1 t=2 t=3 t=4 t=l i

Generalizing TD We can generalize this to the idea of an n-step rollout: Each tells us something about the value function. We can combine all n-step rollouts. This is known as a complex backup.

TD(λ) Weighted sum: R (1) = r 0 + V (s 1 ) R (2) = r. 0 + r 1 + 2 V (s 2 ).. R (n) = n 1 X i=0 i r i + n V (s n ) 1 n weights Estimator:

Sarsa(λ) This is called the λ-return. At λ=0 we get Sarsa, at λ=1 we get MC. Intermediate values of λ usually best. TD(λ) family of algorithms

Sarsa(λ): Implementation Each state has eligibility trace e(s, a). At time t: e(st, at) = 1 e(s, a) = γλe(s, a), for all other (s, a) pairs. At end of episode: e(s, a) = 0, for all (s, a) pairs. When updating: Compute δ as before Q(s, a) = Q(s, a) + αδe(s, a), for each (s, a) pair.

Sarsa(λ): Implementation 1. Initialize Q[s][a] = 0, for all (s, a) 2. Initialize e[s][a] = 0, for all (s, a) 2. For n episodes observe state st select at = argmaxa Q(st, a) observe transition (st, at, rt, st+1, at+1) compute TD error = r t + Q(s t+1,a t+1 ) Q(s t,a t ) e(st, at) = 1; other e(s, a) = γλe(s, a) update Q: Q(s, a) =Q(s, a)+ e(s, a) if not end of episode, repeat if end of episode, e[s][a] = 0 for all (s,a) don t forget! zero by def. if s is absorbing

Sarsa(λ)

Sarsa(λ)

Next Week: More Realism