ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.
|
|
- Blanche Bernice Perry
- 5 years ago
- Views:
Transcription
1 ROB 537: Learning-Based Control Week 5, Lecture 1 Policy Gradient, Eligibility Traces, Transfer Learning (MaC Taylor Announcements: Project background due Today HW 3 Due on 10/30 Midterm Exam on 11/6 Reading: Transfer Learning survey by Taylor & Stone Policy Gradient Gradient Decent MSE( θ t = s S p(s ( V π (s V t (s 2 Basic parameter update (based on uniform p(s : θ ( 2 Δ θ t = 1 2 α V π (s t V t (s t Update parameters based on the negauve of the gradient 1
2 Gradient Descent Sarsa Gradient descent for state- acuon values: Δ θ t = α ( Q π (s t Q t (s t Q θ t (s t Δ! θ t = α ( r t +γ Q t (s t+1, a t+1 Q t (s t, a t! Q θ t(s t, a t Linear Methods Special case of gradient descent methods: V t is linear with respect to the parameters θ t There is a vector of features φ for each state FuncUon approximauon becomes: n V t (s = θ t (i φ s (i i=1 2
3 Coarse Coding Cover the space with concentric features (for example circles Weights are 0 or 1 Value is based on coverage x Tile Coding ExhausUve paruuons (Ulings Same number of features as number of Ulings x 3
4 Radial Basis Functions Coarse coding generalized to conunuous value features Each feature can be weighted by a real number (instead of 0 or 1 σ i c i-1 c i c i+1 4
5 Eligibility Traces Eligibility Traces are the basic mechanism of assigning credit in Reinforcement Learning In TD(λ, λ refers to an eligibility trace Basic Idea: Determine how many rewards will be used in evaluaung a value funcuon Determine which states are eligible for an update Connect TD learning and Monte Carlo Learning Eligibility Traces Forward view: TheoreUcal Connect TD learning and Monte Carlo learning In TD learning only previous state is eligible In Monte Carlo, all visited states are eligible IntuiUon: How far ahead do you need to look to figure out value of a state Backward view: MechanisUc Temporary record of events VisiUng states, taking acuons, Receiving Rewards IntuiUon: How many state value pairs do I need to store to figure out value of a state 5
6 Forward View Recall Monte Carlo methods for a T step episode: R t = r t+1 + γ r t+2 + γ 2 r t γ T t 1 r T The update is based on all the returns In TD backup: R t (1 = r t+1 + γ V t (s t+1 The update is based on 1- step backup Why restrict ourselves to these two extremes? Forward View: n-step TD Why not look at 2- step backup: R t (2 = r t+1 + γ r t+2 + γ 2 V t (s t+2 How about n- step backup? R t (n = r t+1 + γ r t+2 + γ 2 r t γ n 1 r t+n + γ n V t (s t+n Called corrected n- step truncated return or n- step return 6
7 Forward View (SuCon & Barto Backward View Causal, incremental view Implementable Introduce a new memory variable: eligibility trace e t (s On each step eligibility trace for each state decays by γλ Eligibility trace of one state increased by 1 e t (s = γ λ e (s if s s t 1 t γ λ e t 1 (s +1 if s = s t λ is trace decay parameter γ is the discount rate used previously 7
8 Intuition At any Ume eligibility traces keep track of which states have been recently visited Recently is quanufied by γλ The traces capture how eligible a state is for an update Recall basic reinforcement learning: NewEstimate OldEstimate + Stepsize (Target OldEstimate Now, update all values based on eligibility trace Implementation NewEstimate OldEstimate + Stepsize (Target OldEstimate For TD, the error was: δ t = r(s t + γ V(s t+1 V(s t Then update becomes: V t +1 (s = V t (s + α δ t e(s t for all s S 8
9 Special Cases Let s explore two special cases based on this update rule V t +1 (s = V t (s + α δ t e(s t for all s S For λ=0: All traces are 0 at t except state visited at t: s t Reduces to TD update rule For λ=1: All states receive credit decaying by γ This is the Monte Carlo updates If γ=1 as well, there is no decay in Ume and we have undiscounted Monte Carlo updates Forward and Backward Views Forward view 9
10 Forward and Backward Views Forward view Backward view (SuCon & Barto Sarsa (λ Recall Sarsa learning: Q(s t Q(s t + α ( r(s t + γ Q(s t Q(s t Now, we have: Q(s t Q(s t + α δ t e(s t for all s,a Increase eligibility if s,a picked Decay eligibility otherwise 10
11 Sarsa (λ Recall Sarsa learning: Q(s t Q(s t + α ( r(s t + γ Q(s t Q(s t Now, we have: where: Q(s t Q(s t + α δ t e(s t δ t = r(s t + γ Q(s t Q(s t for all s,a e t (s,a = γ λ e t 1(s,a +1 if s = s t and a = a t for all s,a γ λ e t 1 (s,a otherwise Example: Sarsa (λ Recall Sarsa updates: only last state receives update. All other state acuon pairs will update when visited/taken next Path taken Action values increased by one-step Sarsa Action values increased by Sarsa(! with!=0.9 11
12 Q(λ Recall Q- learning Q(s t Q(s t + α r(s t + γ maxq(s t +1,a Q(s t Now, we have: ( a where Q(s t Q(s t + α δ t e(s t for all s,a δ t = r(s t + γ maxq(s t +1,a Q(s t a γ λ e e t (s,a = I sst I t 1 (s,a +1 if Q t -1 (s t = max a Q t -1 (s t,a aat 0 otherwise Q(λ Q learning with eligibility traces: Like in Sarsa learning, except eligibility trace set to zero aner an exploratory move (not greedy IntuiUon: two step process Step 1: Either each state- acuon pair decayed by λ Or all set to 0 aner an exploratory step Step 2: Trace for current state acuon pair incremented by 1 (this is Watkins Q(λ; There are other versions as well. 12
13 Back to Gradient Descent Sarsa Gradient descent for state- acuon values: Δ θ t = α ( Q π (s t Q t (s t Q θ t (s t Δ θ t = α ( r t +1 + γ Q t (s t Q t (s t Q θ t (s t Gradient descent Sarsa: 1- step backup Extend to eligibility traces Gradient Descent Sarsa(λ Gradient descent Sarsa(λ: where Δ θ t = α δ t e t δ t = r t +1 + γ Q t (s t Q t (s t e t = γ λ e t 1 + θ Q t (s t 13
14 Transfer Learning So far, we learned all policies from scratch Can be unnecessarily slow Humans can use past informauon Soccer with different numbers of players How do we leverage learned knowledge in novel tasks Bias learning : speedup method 14
15 Transfer Learning MaChew E. Taylor and Peter Stone. Transfer Learning for Reinforcement Learning Domains: A Survey. Journal of Machine Learning Research, 10(1: , (Slides courtesy of MaC Taylor Primary Questions Is it possible to transfer learned knowledge? Is it possible to transfer without providing a task mapping? Focus on reinforcement learning tasks - Plenty of work in supervised setngs Source S SOURCE, A SOURCE Target S TARGET, A TARGET 15
16 Value Function Transfer Source Task Environment Action S State S Reward S Q S not defined on S T and A T ρ(q S (S S, A S = Q T (S T, A T Agent Target Task Environment Q S : S S A S R ρ Q T : S T A T R AcUon- Value funcuon transferred ρ is task- dependant: relies on inter- task mappings Action T State T Reward T Agent Inter Task Mappings Source Target Source Target 16
17 Inter Task Mappings χ x: s target s source Given state variable in target task (some x from s=x 1, x 2, x n Return corresponding state variable in source task χ A: a target a source Given acuon in target task Return corresponding acuon in source task S SOURCE A SOURCE Source x1 x k {a 1 a j } χ x χ A Target x1 x n {a 1 a m } S TARGET A TARGET IntuiUve mappings exist in some domains (Oracle Used to construct transfer funcuon Q Value reuse Can be considered a type of reward shaping Directly use Q values from source task to bias learner in target task Not funcuon- approximator specific No iniualizauon step needed between learning the two tasks Drawbacks: an increased lookup Ume and larger memory requirements 17
18 Keepaway [Stone, Sutton, and Kuhlmann 2005] K 1 K 2 Goal: Maintain possession of ball 3 vs. 2: T 1 5 agents 3 (stochasuc acuons 13 (noisy & conunuous state variables K 3 T 2 Keeper with ball may hold ball or pass to either teammate Both takers move towards player with ball 4 vs. 3: 7 agents 4 acuons 19 state variables Transfer Learning in Keepaway 3 vs. 2 Keepaway Or 4 vs. 3 Can learning transfer from 3 vs. 2 to 4 vs. 3? 18
19 Learning Keepaway Sarsa update RBF and neural network approximauon successful Q π (s,a: Predicted number of steps episode will last Reward = +1 for every Umestep ρ s Effect on function approximator For each weight in 4 vs. 3 funcuon approximator: - Use inter- task mapping to find corresponding 3 vs. 2 weight 3 vs. 2 ρ 4 vs. 3 19
20 Keepaway Hand-coded χ A AcUons in 4 vs. 3 have similar acuons in 3 vs. 2 Hold 4v3 Hold 3v2 Pass1 4v3 Pass1 3v2 Pass2 4v3 Pass2 3v2 Pass3 4v3 Pass2 3v2 Value Function Transfer: Time Threshold in 4 vs. 3 No Transfer Simulator Hours Total Time # 3 vs. 2 Episodes } Target Task Time Avg. 4 vs. 3 Ume Avg. 3 vs. 2 Ume 20
21 Example Transfer Domains Series of mazes with different goals [Fernandez and Veloso, 2006] Example Transfer Domains Series of mazes with different goals [Fernandez and Veloso, 2006] Mazes with different structures [Konidaris and Barto, 2007] 21
22 Example Transfer Domains Series of mazes with different goals [Fernandez and Veloso, 2006] Mazes with different structures [Konidaris and Barto, 2007] Keepaway with different numbers of players [Taylor and Stone, 2005] Example Transfer Domains Series of mazes with different goals [Fernandez and Veloso, 2006] Mazes with different structures [Konidaris and Barto, 2007] Keepaway with different numbers of players [Taylor and Stone, 2005] Keepaway to Breakaway [Torrey et al, 2005] 22
23 Transfer Learning So far, we talked about transfer learning in a specific domain Task: MDP Domain: Setng for semanucally similar task What about Cross- Domain Transfer? Source task could be much simpler Source and Target can be less similar What about transferring policies (or just rules? Rule Transfer Overview Learn π Learn D source Translate (D source D target Use D target 1. Learn a policy (π : S A in the source task TD, Policy Search, Model- Based, etc. 2. Learn a decision list, D source, summarizing π 3. Translate (D source D target (applies to target task State variables and acuons can differ in two tasks 4. Use D target to learn a policy in target task Allows for different learning methods and funcuon approximators in source and target tasks 23
24 Rule Transfer Details Learn π Learn D source Translate (D source D target Use D target For example, use Sarsa o Q : S A Return Other learning methods possible Source Task Environment Ac<on State Reward Agent Rule Transfer Details Learn π Learn D source Translate (D source D target Use D target Use learned policy to record S, A pairs Extract a rules list (for example: use JRip (RIPPER in Weka Environment Ac<on State Reward IF s 1 < 4 and s 2 > 5 a 1 ELSEIF s 1 < 3 a 2 ELSEIF s 3 > 7 a 1 Agent State Ac<on 24
25 Rule Transfer Details Learn π Learn D source Translate (D source D target Use D target Inter- task Mappings χ x: s target s source Given state variable in target task (some x from s = x 1, x 2, x n Return corresponding state variable in source task χ A: a target a source Similar, but for acuons χ x χ A rule translate rule Rule Transfer Details Learn π Learn D source Translate (D source D target Use D target K 1 K 2 T 1 K 3 T 2 χ A χ x IF dist(player, Opponent > 4 Stay Stay Hold Ball Run Near Pass to K 2 Run Far Pass to K 3 dist(player, Opponent dist(k 1,T 1 IF dist(k 1,T 1 > 4 Hold Ball 25
26 Rule Transfer Details Learn π Learn D source Translate (D source D target Use D target Many possible ways to use D target o Value Bonus (shaping o Extra AcUon (iniually force agent to select o Extra Variable (iniually force agent to select Assuming TD learner in target task o Should generalize to other learning methods Evaluate agent s 3 acuons in state s = s 1, s 2 Q(s 1, s 2, as 32 1, a = 1 5 = 5 Q(s 1, s 2, as 32, a = 2 3 = Q(s 1, s 2, as 32, a = 3 4 = 4 Q(s 1, s 2, a 4 = 7 (take acuon a 2 D target (s = a 2 Problem Statement Meta- planning problem for agent learning Learn from scratch? Or learn through a series of similar tasks? Start with similar, simple task and move closer to target task MDP MDP MDP MDP 26
Week 6, Lecture 1. Reinforcement Learning (part 3) Announcements: HW 3 due on 11/5 at NOON Midterm Exam 11/8 Project draft due 11/15
ME 537: Learning Based Control Week 6, Lecture 1 Reinforcement Learning (part 3 Announcements: HW 3 due on 11/5 at NOON Midterm Exam 11/8 Project draft due 11/15 Suggested reading : Chapters 7-8 in Reinforcement
More informationChapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1
Chapter 7: Eligibility Traces R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Midterm Mean = 77.33 Median = 82 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
More informationReinforcement Learning
Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationTemporal difference learning
Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).
More informationCS599 Lecture 2 Function Approximation in RL
CS599 Lecture 2 Function Approximation in RL Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview of function approximation (FA)
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart
More informationReinforcement Learning. George Konidaris
Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom
More informationLecture 23: Reinforcement Learning
Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:
More informationReview: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]
Review: TD-Learning function TD-Learning(mdp) returns a policy Class #: Reinforcement Learning, II 8s S, U(s) =0 set start-state s s 0 choose action a, using -greedy policy based on U(s) U(s) U(s)+ [r
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of
More informationReinforcement Learning. Machine Learning, Fall 2010
Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30
More informationMachine Learning I Reinforcement Learning
Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:
More informationTemporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI
Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning
More informationReinforcement Learning II
Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini
More informationChapter 8: Generalization and Function Approximation
Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning II Daniel Hennes 11.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns
More informationReplacing eligibility trace for action-value learning with function approximation
Replacing eligibility trace for action-value learning with function approximation Kary FRÄMLING Helsinki University of Technology PL 5500, FI-02015 TKK - Finland Abstract. The eligibility trace is one
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationGeneralization and Function Approximation
Generalization and Function Approximation 0 Generalization and Function Approximation Suggested reading: Chapter 8 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.
More informationCSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?
CSE 190: Reinforcement Learning: An Introduction Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to
More informationReinforcement Learning
Reinforcement Learning Function approximation Daniel Hennes 19.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns Forward and backward view Function
More informationCS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study
CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.
More informationThis question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.
This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you
More informationReinforcement Learning with Function Approximation. Joseph Christian G. Noel
Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is
More informationThe Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo
CSE 190: Reinforcement Learning: An Introduction Chapter 7: Eligibility races Acknowledgment: A good number of these slides are cribbed from Rich Sutton he Book: Where we are and where we re going Part
More informationarxiv: v1 [cs.ai] 5 Nov 2017
arxiv:1711.01569v1 [cs.ai] 5 Nov 2017 Markus Dumke Department of Statistics Ludwig-Maximilians-Universität München markus.dumke@campus.lmu.de Abstract Temporal-difference (TD) learning is an important
More informationQUICR-learning for Multi-Agent Coordination
QUICR-learning for Multi-Agent Coordination Adrian K. Agogino UCSC, NASA Ames Research Center Mailstop 269-3 Moffett Field, CA 94035 adrian@email.arc.nasa.gov Kagan Tumer NASA Ames Research Center Mailstop
More informationProgress in Learning 3 vs. 2 Keepaway
Progress in Learning 3 vs. 2 Keepaway Gregory Kuhlmann Department of Computer Sciences University of Texas at Austin Joint work with Peter Stone RoboCup and Reinforcement Learning Reinforcement Learning
More informationhttps://www.youtube.com/watch?v=ymvi1l746us Eligibility traces Chapter 12, plus some extra stuff! Like n-step methods, but better! Eligibility traces A mechanism that allow TD, Sarsa and Q-learning to
More informationPART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.
Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE
More informationReinforcement Learning
Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action
More informationReinforcement learning an introduction
Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,
More informationLecture 7: Value Function Approximation
Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationOpen Theoretical Questions in Reinforcement Learning
Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem
More informationPART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.
Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationProf. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be
REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while
More informationLecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010
Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book
More informationNotes on Reinforcement Learning
1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.
More informationCSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?
CSE 190: Reinforcement Learning: An Introduction Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to
More information6 Reinforcement Learning
6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,
More informationARTIFICIAL INTELLIGENCE. Reinforcement learning
INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html
More informationReinforcement Learning
Reinforcement Learning 1 Reinforcement Learning Mainly based on Reinforcement Learning An Introduction by Richard Sutton and Andrew Barto Slides are mainly based on the course material provided by the
More informationReinforcement Learning II. George Konidaris
Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2017 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes
More informationReinforcement Learning: An Introduction. ****Draft****
i Reinforcement Learning: An Introduction Second edition, in progress ****Draft**** Richard S. Sutton and Andrew G. Barto c 2014, 2015 A Bradford Book The MIT Press Cambridge, Massachusetts London, England
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationReinforcement Learning II. George Konidaris
Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2018 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes
More informationReinforcement Learning Part 2
Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment
More informationReinforcement Learning. Spring 2018 Defining MDPs, Planning
Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationLecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation
Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationDecision Theory: Q-Learning
Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning
More informationReinforcement Learning: the basics
Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error
More informationIntroduction. Reinforcement Learning
Introduction Reinforcement Learning Learn Act Sense Scott Sanner NICTA / ANU First.Last@nicta.com.au Lecture Goals 1) To understand formal models for decisionmaking under uncertainty and their properties
More informationilstd: Eligibility Traces and Convergence Analysis
ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca
More informationReinforcement Learning
Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.
More informationOff-Policy Actor-Critic
Off-Policy Actor-Critic Ludovic Trottier Laval University July 25 2012 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 1 / 34 Table of Contents 1 Reinforcement Learning Theory
More informationCMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING
More informationLecture 1: March 7, 2018
Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights
More informationCS230: Lecture 9 Deep Reinforcement Learning
CS230: Lecture 9 Deep Reinforcement Learning Kian Katanforoosh Menti code: 21 90 15 Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Learning IV. Application of Deep
More informationDecision Theory: Markov Decision Processes
Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies
More informationReinforcement Learning
Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques
More informationMarks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:
Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,
More informationCS 570: Machine Learning Seminar. Fall 2016
CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or
More informationToday s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes
Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks
More informationLecture 8: Policy Gradient
Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve
More informationChapter 6: Temporal Difference Learning
Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods Compare efficiency of D learning with
More informationMachine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396
Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction
More informationComputational Reinforcement Learning: An Introduction
Computational Reinforcement Learning: An Introduction Andrew Barto Autonomous Learning Laboratory School of Computer Science University of Massachusetts Amherst barto@cs.umass.edu 1 Artificial Intelligence
More informationCSC321 Lecture 22: Q-Learning
CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize
More informationReinforcement Learning. Summer 2017 Defining MDPs, Planning
Reinforcement Learning Summer 2017 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state
More informationReinforcement Learning. Up until now we have been
Reinforcement Learning Slides by Rich Sutton Mods by Dan Lizotte Refer to Reinforcement Learning: An Introduction by Sutton and Barto Alpaydin Chapter 16 Up until now we have been Supervised Learning Classifying,
More informationECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control
ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Tianyu Wang: tiw161@eng.ucsd.edu Yongxi Lu: yol070@eng.ucsd.edu
More informationReinforcement Learning: An Introduction
Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is
More informationReinforcement Learning In Continuous Time and Space
Reinforcement Learning In Continuous Time and Space presentation of paper by Kenji Doya Leszek Rybicki lrybicki@mat.umk.pl 18.07.2008 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous
More informationShort Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about:
Short Course: Multiagent Systems Lecture 1: Basics Agents Environments Reinforcement Learning Multiagent Systems This course is about: Agents: Sensing, reasoning, acting Multiagent Systems: Distributed
More informationTrust Region Policy Optimization
Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration
More informationChapter 6: Temporal Difference Learning
Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods hen extend to control methods R. S.
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More information15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted
15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient
More informationOff-Policy Temporal-Difference Learning with Function Approximation
This is a digitally remastered version of a paper published in the Proceedings of the 18th International Conference on Machine Learning (2001, pp. 417 424. This version should be crisper but otherwise
More informationReal Time Value Iteration and the State-Action Value Function
MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing
More informationForward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning
Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning Vivek Veeriah Dept. of Computing Science University of Alberta Edmonton, Canada vivekveeriah@ualberta.ca Harm van Seijen
More informationMarkov Decision Processes and Solving Finite Problems. February 8, 2017
Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:
More informationAn online kernel-based clustering approach for value function approximation
An online kernel-based clustering approach for value function approximation N. Tziortziotis and K. Blekas Department of Computer Science, University of Ioannina P.O.Box 1186, Ioannina 45110 - Greece {ntziorzi,kblekas}@cs.uoi.gr
More information15-780: ReinforcementLearning
15-780: ReinforcementLearning J. Zico Kolter March 2, 2016 1 Outline Challenge of RL Model-based methods Model-free methods Exploration and exploitation 2 Outline Challenge of RL Model-based methods Model-free
More informationLecture 3: The Reinforcement Learning Problem
Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which
More informationLecture 3: Markov Decision Processes
Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationAn Introduction to Reinforcement Learning
1 / 58 An Introduction to Reinforcement Learning Lecture 01: Introduction Dr. Johannes A. Stork School of Computer Science and Communication KTH Royal Institute of Technology January 19, 2017 2 / 58 ../fig/reward-00.jpg
More informationDual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks
Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks Kary Främling Helsinki University of Technology, PL 55, FI-25 TKK, Finland Kary.Framling@hut.fi Abstract. Reinforcement
More informationDeep Reinforcement Learning
Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 50 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015
More informationApproximation Methods in Reinforcement Learning
2018 CS420, Machine Learning, Lecture 12 Approximation Methods in Reinforcement Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Reinforcement
More informationQ-learning. Tambet Matiisen
Q-learning Tambet Matiisen (based on chapter 11.3 of online book Artificial Intelligence, foundations of computational agents by David Poole and Alan Mackworth) Stochastic gradient descent Experience
More informationAdministration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.
Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,
More informationReinforcement Learning. Value Function Updates
Reinforcement Learning Value Function Updates Manfred Huber 2014 1 Value Function Updates Different methods for updating the value function Dynamic programming Simple Monte Carlo Temporal differencing
More information1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5
Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................
More information