Reinforcement learning

Size: px
Start display at page:

Download "Reinforcement learning"

Transcription

1 einforcement learning How to learn to make decisions in sequential problems (like: chess, a maze) Why is this difficult? Temporal credit assignment Prediction can help Further reading For modeling: Chapter 9, Dayan & Abbott, Theoretical Neuroscience (but v mathematical); For dopamine: Schultz W Getting formal with dopamine and. Neuron 36: For algorithms: Sutton S & Barto AG einforcement learning: An Introduction Classical conditioning Pair stimulus (bell, light) with significant event (food, shock) Measure anticipatory behavior (salivation, freezing) 1

2 Classical conditioning : Conditioned Stimulus (S) Pair stimulus (bell, light) US: Unconditioned stimulus (reinforcement r) with significant event (food, shock) C: Conditioned response Measure anticipatory behavior (salivation, freezing) escorla-wagner rule (1972) Experimental terms Phase 1: Phase 2: Test: Acquisition: S r S? response Extinction: S r S - S? Partial einf.: S r or - S? weak resp Simple delta-rule model, if stimulus present S=1 (S=0 if not present): w w + esd; d = r - ws, i.e. Error-driven learning Equivalently, if S=1: w (1-e) w + er w comes to estimate the reinforcement r associated with S. Minimizes error <(r-ws) 2 > whenever S is present. r S w reinforcement stimulus 2

3 escorla-wagner rule (1972) Experimental terms Phase 1: Phase 2: Test: Acquisition: S r S? response Extinction: S r S - S? Partial einf.: S r or - S? weak resp Simple delta-rule model, if stimulus present S=1 (S=0 if not present): w w + esd; d = r - ws, i.e. Error-driven learning Equivalently, if S=1: w (1-e) w + er w comes to estimate the reinforcement r associated with S. acquisition extinction <r>=0.5 partial reinforcement r=1 r=0 Multiple stimuli What about when multiple stimuli are present? e.g. S1, S2 r How would animals respond to S1 or S2? How should the model be modified? r w i w i + e S i d i w1 w2 (a) d i = r - w i S i i.e. separate error terms for each S i S1 S2 (b) d i = d = r -V; V = S i w i S i V is expected reinforcement r given all stimuli i.e. single error term for all stimuli S i : d the difference between actual r and V (expected r) 3

4 Experiments with multiple stimuli Experimental terms Phase 1: Phase 2: Test: Overshadowing: S1, S2 r S1? weak resp Blocking: S1 r S1, S2 r S2? Which model is favoured? r w i w i + es i d i Blocking (Kamin, 1969) and overshadowing (Kamin, 1969; Pavlov, 1927) imply: (b) d i = d = r - V; V = S i w i S i w1 S1 w2 S2 i.e. single error term for all stimuli (the escorla-wagner rule) = difference between reinforcement and expected reinforcement given all stimuli Second-order conditioning Phase 1: Phase 2: Test: S1 r S2 S1 S2? What happens? S2 r -W rule does not work no r in Phase 2 One problem: time (important that S2 precede S1 for effect) => the temporal credit assignment problem. 4

5 Temporal-difference learning (Sutton) We need to predict the sum of future s, not just, so we can learn S2 S1, i.e. we want = < S t t > V If = S i w i (t) S i. How can we learn the right w i? w1 S1 w2 S2 Temporal-difference learning (Sutton) We need to predict the sum of future s, not just, so we can learn S2 S1, i.e. we want = < S t t > V If = S i w i (t) S i. How can we learn the right w i? current If = < S t t >, then = r (t) + V(t+1) estimate of subsequent w1 S1 w2 S2 5

6 Temporal-difference learning (Sutton) We need to predict the sum of future s, not just, so we can learn S2 S1, i.e. we want = < S t t > V If = S i w i (t) S i. How can we learn the right w i? current If = < S t t >, then = r (t) + V(t+1) estimate of subsequent So use delta rule to ensure that this happens, i.e. modify connection weights to make closer to r (t) + V(t+1), i.e. use: w i (t+1) = w i (t) + es i d(t); d(t) = [r (t) + V(t+1)] so that d is the difference between and estimate of all future. Compare with -W rule: d(t) = (i.e. d is the difference between and current only). w1 S1 w2 S2 Time: expectation of V(t 1) Need representation of time as an input. Code time as sequential activity of set of input neurons. Then output can represent t 1 w 1 t 2 t 3 w N V(t 2) w1 w N t 1 t 2 t 3 V(t 3) w1 w N t 1 t 2 t 3 6

7 Temporal difference learning rule: w i w i + es i d(t i ); d(t) = +V(t+1) d(t) t t trial 1 unexpected w 1 t 1 t -1 Temporal difference learning rule: w i w i + es i d(t i ); d(t) = +V(t+1) d(t) t t trial 1 unexpected w 1 t 1 t -1 t trial 2 w 1 d(t) d(-1)=0+1-0 d()=1+0-1 t 1 t -1 7

8 Temporal difference learning rule: w i w i + es i d(t i ); d(t) = +V(t+1) d(t) t t trial 1 unexpected w 1 t 1 t -1 t trial 2 d(t) d(-1)=0+1-0 t trial 3 d(t) d()=1+0-1 d(-2)=0+1-0 d(-1)=0+1-1 w 1 t 1 t -1 r d(t) Many trials later, expected future V increases as soon as the occurs.. t d(t +1)=0+1-1 t d(t )=0+1-0 Does dopamine signal d(t)? eward Self-stimulation Addiction Motor control Synaptic plasticity 8

9 many trials many trials Dopamine responses interpreted as d(t) (Schultz, Montague & Dayan, Science, 1997) Burst to unexpected esponse transfers to predictors Pause at time of omitted Dopamine responses interpreted as d(t) (Schultz, Montague & Dayan, Science, 1997) Burst to unexpected V(t+1) d(t) = + V(t+1) esponse transfers to predictors V(t+1) Pause at time of omitted d(t) = + V(t+1) Omission of expected d(t) = + V(t+1) X 9

10 Increasing probability of (given stimulus) More dopamine responses Partial reinforcement task (Fiorillo, Tobler & Schultz) Accords with TD models V(t+1) } p=0.5 d(t) = + V(t+1) Action choice In operant (aka instrumental) conditioning, s are contingent on actions (e.g., lever press) cf. classical conditioning (s are just dependent on stimuli, we just model expectation of ) Consider simple bee foraging problem: Choose between yellow and blue flowers Each pays off probabilistically, with different amounts of nectar r y versus r b Bees rapidly learn to choose richer colour switch r y and r b 10

11 Modeling action choice Actor-critic architecture: use value function V to decide on actions to maximize expected. If V is correct, it allows future to be taken into account in reinforcing actions: solving the temporal credit assignment (we can also evaluate whether actions are good or bad using δ(t) even if = 0) critic v w b y state dopamine? δ(t) = +v(t+1)-v(t) w actor b action y Sequential choice Suppose we know how to learn what to do at B and C (e.g. choose action associated with maximum r) How do we solve the temporal credit assignment problem at A? equires second order conditioning: A-B-r; A-C-r Use Temporal Difference learning to learn values (expected future ) at B and C; use these to learn best action at A ( direct actor ) actor/critic architecture. w critic value v(t) dopamine? δ(t) = +v(t+1)-v(t) output o(t) or action state or stimuli input w actor 11

12 Prob. turning Left State evaluation The critic learns to estimate value states S i (t): = <S t>t > Given initially random choice of actions L/ ( policy ), values V are learned by changing w i with Temporal Difference rule: w i w i + es i d(t); d(t) = +V(t+1) D S i : V critic B V E A w A w B w C A B C F C δ(t) = +v(t+1)-v(t) w D w E D E. G Policy improvement D B E A F C G V The actor learns to act, using to calculate d and d to assess actions: d(t) =+V(t+1) If left at A, d(t) = = 0.75 If right at A, d(t) = = 0.75 Thus should choose left more frequently from A w i w i + es i d(t); critic V δ(t) = +v(t+1)-v(t) w A w B w C A B C w D w E D E w L actor 12

13 Back to dopamine Dual dopamine systems project same signal to motivational (ventral striatum) & motor (dorsal striatum) areas For state evaluation & policy improvement respectively? Macaque brain Caudate Nucleus accumbens Putamen Actor? Caudate/ putamen Nucleus accumbens Critic? Striatal targets of DA ~ prediction-error signal for reinforcement learning (e.g. Seymour et al., 2004) 13

14 Summary of temporal difference or sequential reinforcement solution to the temporal credit assignment problem In Arp the strength of the smell of the tree gives reinforcement to evaluate any action (i.e. output o(t)): does it lead closer to the goal or not? If reinforcement is intermittent (e.g. only when goal is reached), a critic learns an evaluation function : the value v of each state (or input) S is the expected future from that state (given how actions are usually made). The change in value v(t+1)-v(t) + any is used to evaluate any action o(t) so output weights w can be modified as in Arp, using δ(t) = v(t+1)-v(t)+ (if δ(t)>0, action o(t) good): w w + εs(t)δ(t) But how to learn v? A simple learning rule creates connection weights w so that v(t) = w.s(t). This is: w w + εs(t)δ(t), i.e. you can also use δ to learn weights for the critic! w critic value v(t) state or input S(t) (e.g. place cells) w dopamine? δ(t) = +v(t+1)-v(t) output o(t) or action See Intro. to temporal difference learning (spatial example). Use δ(t) = v(t+1)-v(t)+ to evaluate actions Use v(t) = w.s(t) where w w + ε S(t)δ(t) to learn value function. v w critic value v(t) state or input u(t) (e.g. place cells) r=1 x first encounters goal at location x: r=1 so increase weight to critic from place cell at x. Now v(x)=1 dopamine? δ(t) = +v(t+1)-v(t) output o(t) or action North, South, East, West v(t+1)=1 v v v(t+1)=1 v(t)=0 v(t)=0 x x next time go to x from x, v(t+1)-v(t)=1, so increase weight to critic from place cell at x. Now v(x )=1 x x next time go to x from x, v(t+1)-v(t)=1, so increase weight to critic from place cell at x, and so on.. 14

15 Temporal difference learning is a slow trial and error based method. Development of own value function overcomes problem of temporal credit assignment goal value function value function Barrier Agent After initial exploration (e.g. 100 runs to goal) After 2000 runs to goal Initial actions will be random (takes a long time to get to goal), actions improve as a result of learning, as in the Arp action network (but a random element is required to enable ongoing improvement: i.e. finding new shorter routes). Dayan (1991) Neural Information Processing Systems 3. p Morgan Kaufmann Temporal discounting Often, we value more (temporally) distant s less than immediate of the same size. We can do this with a small change to the einforcement Learning rule: We want to predict the sum of future s, discounted by how long you have to wait for them, i.e. we want = < S t t γ (t - t) > where γ < 1 If = < S t t γ (t - t) >, then = + γv(t+1) So use delta rule to ensure that this happens, i.e. modify connection weights to make closer to + γv(t+1), i.e. use: w i (t+1) = w i (t) + es i d(t); where d(t) = [ + γv(t+1)] so that d is the difference between and the estimate of all (temporally discounted) future. 15

Animal learning theory

Animal learning theory Animal learning theory Based on [Sutton and Barto, 1990, Dayan and Abbott, 2001] Bert Kappen [Sutton and Barto, 1990] Classical conditioning: - A conditioned stimulus (CS) and unconditioned stimulus (US)

More information

Rewards and Value Systems: Conditioning and Reinforcement Learning

Rewards and Value Systems: Conditioning and Reinforcement Learning Rewards and Value Systems: Conditioning and Reinforcement Learning Acknowledgements: Many thanks to P. Dayan and L. Abbott for making the figures of their book available online. Also thanks to R. Sutton

More information

Reinforcement Learning. Odelia Schwartz 2016

Reinforcement Learning. Odelia Schwartz 2016 Reinforcement Learning Odelia Schwartz 2016 Forms of learning? Forms of learning Unsupervised learning Supervised learning Reinforcement learning Forms of learning Unsupervised learning Supervised learning

More information

Key concepts: Reward: external «cons umable» s5mulus Value: internal representa5on of u5lity of a state, ac5on, s5mulus W the learned weight associate

Key concepts: Reward: external «cons umable» s5mulus Value: internal representa5on of u5lity of a state, ac5on, s5mulus W the learned weight associate Key concepts: Reward: external «cons umable» s5mulus Value: internal representa5on of u5lity of a state, ac5on, s5mulus W the learned weight associated to the value Teaching signal : difference between

More information

Summary of part I: prediction and RL

Summary of part I: prediction and RL Summary of part I: prediction and RL Prediction is important for action selection The problem: prediction of future reward The algorithm: temporal difference learning Neural implementation: dopamine dependent

More information

Fundamentals of Computational Neuroscience 2e

Fundamentals of Computational Neuroscience 2e Fundamentals of Computational Neuroscience 2e Thomas Trappenberg March 21, 2009 Chapter 9: Modular networks, motor control, and reinforcement learning Mixture of experts Expert 1 Input Expert 2 Integration

More information

Spike timing dependent plasticity - STDP

Spike timing dependent plasticity - STDP Spike timing dependent plasticity - STDP Post before Pre: LTD + ms Pre before Post: LTP - ms Markram et. al. 997 Spike Timing Dependent Plasticity: Temporal Hebbian Learning Synaptic cange % Pre t Pre

More information

The convergence limit of the temporal difference learning

The convergence limit of the temporal difference learning The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector

More information

Model- based fmri analysis. Shih- Wei Wu fmri workshop, NCCU, Jan. 19, 2014

Model- based fmri analysis. Shih- Wei Wu fmri workshop, NCCU, Jan. 19, 2014 Model- based fmri analysis Shih- Wei Wu fmri workshop, NCCU, Jan. 19, 2014 Outline: model- based fmri analysis I: General linear model: basic concepts II: Modeling how the brain makes decisions: Decision-

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Reinforcement Learning In Continuous Time and Space

Reinforcement Learning In Continuous Time and Space Reinforcement Learning In Continuous Time and Space presentation of paper by Kenji Doya Leszek Rybicki lrybicki@mat.umk.pl 18.07.2008 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous

More information

CO6: Introduction to Computational Neuroscience

CO6: Introduction to Computational Neuroscience CO6: Introduction to Computational Neuroscience Lecturer: J Lussange Ecole Normale Supérieure 29 rue d Ulm e-mail: johann.lussange@ens.fr Solutions to the 2nd exercise sheet If you have any questions regarding

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Lecture 3: The Reinforcement Learning Problem

Lecture 3: The Reinforcement Learning Problem Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Computational Reinforcement Learning: An Introduction

Computational Reinforcement Learning: An Introduction Computational Reinforcement Learning: An Introduction Andrew Barto Autonomous Learning Laboratory School of Computer Science University of Massachusetts Amherst barto@cs.umass.edu 1 Artificial Intelligence

More information

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 7: Eligibility Traces R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Midterm Mean = 77.33 Median = 82 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

More information

Chapter 6: Temporal Difference Learning

Chapter 6: Temporal Difference Learning Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods hen extend to control methods R. S.

More information

A latent variable model of configural conditioning

A latent variable model of configural conditioning A latent variable model of configural conditioning Aaron C. Courville obotics Institute, CMU Work with: Nathaniel D. Daw, UCL (Gatsby), David S. Touretzky, CMU and Geoff Gordon, CMU Similarity & Discrimination

More information

The Reinforcement Learning Problem

The Reinforcement Learning Problem The Reinforcement Learning Problem Slides based on the book Reinforcement Learning by Sutton and Barto Formalizing Reinforcement Learning Formally, the agent and environment interact at each of a sequence

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts. An Upper Bound on the oss from Approximate Optimal-Value Functions Satinder P. Singh Richard C. Yee Department of Computer Science University of Massachusetts Amherst, MA 01003 singh@cs.umass.edu, yee@cs.umass.edu

More information

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Reading Response: Due Wednesday R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Another Example Get to the top of the hill as quickly as possible. reward = 1 for each step where

More information

Structure of a Neuron:

Structure of a Neuron: Structure of a Neuron: At the dendrite the incoming signals arrive (incoming currents) At the soma current are finally integrated. At the axon hillock action potential are generated if the potential crosses

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:

More information

CS 570: Machine Learning Seminar. Fall 2016

CS 570: Machine Learning Seminar. Fall 2016 CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or

More information

Every animal is represented by a blue circle. Correlation was measured by Spearman s rank correlation coefficient (ρ).

Every animal is represented by a blue circle. Correlation was measured by Spearman s rank correlation coefficient (ρ). Supplementary Figure 1 Correlations between tone and context freezing by animal in each of the four groups in experiment 1. Every animal is represented by a blue circle. Correlation was measured by Spearman

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil

Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil Charles W. Anderson 1, Douglas C. Hittle 2, Alon D. Katz 2, and R. Matt Kretchmar 1 1 Department of Computer Science Colorado

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning Introduction to Reinforcement Learning Part 5: emporal-difference Learning What everybody should know about emporal-difference (D) learning Used to learn value functions without human input Learns a guess

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@csa.iisc.ernet.in Department of Computer Science and Automation Indian Institute of Science August 2014 What is Reinforcement

More information

Proceedings of the International Conference on Neural Networks, Orlando Florida, June Leemon C. Baird III

Proceedings of the International Conference on Neural Networks, Orlando Florida, June Leemon C. Baird III Proceedings of the International Conference on Neural Networks, Orlando Florida, June 1994. REINFORCEMENT LEARNING IN CONTINUOUS TIME: ADVANTAGE UPDATING Leemon C. Baird III bairdlc@wl.wpafb.af.mil Wright

More information

Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of

More information

Temporal difference learning

Temporal difference learning Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010 Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning 1 Reinforcement Learning Mainly based on Reinforcement Learning An Introduction by Richard Sutton and Andrew Barto Slides are mainly based on the course material provided by the

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Assignment 1 (Sol.) Introduction to Machine Learning Prof. B. Ravindran. 1. Which of the following tasks can be best solved using Clustering.

Assignment 1 (Sol.) Introduction to Machine Learning Prof. B. Ravindran. 1. Which of the following tasks can be best solved using Clustering. Assignment 1 (Sol.) Introduction to Machine Learning Prof. B. Ravindran 1. Which of the following tasks can be best solved using Clustering. (a) Predicting the amount of rainfall based on various cues

More information

Neuronal Dynamics: Computational Neuroscience of Single Neurons

Neuronal Dynamics: Computational Neuroscience of Single Neurons Week 5 part 3a :Three definitions of rate code Neuronal Dynamics: Computational Neuroscience of Single Neurons Week 5 Variability and Noise: The question of the neural code Wulfram Gerstner EPFL, Lausanne,

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks

Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks Kary Främling Helsinki University of Technology, PL 55, FI-25 TKK, Finland Kary.Framling@hut.fi Abstract. Reinforcement

More information

Synthesis of Reinforcement Learning, Neural Networks, and PI Control Applied to a Simulated Heating Coil

Synthesis of Reinforcement Learning, Neural Networks, and PI Control Applied to a Simulated Heating Coil Synthesis of Reinforcement Learning, Neural Networks, and PI Control Applied to a Simulated Heating Coil Charles W. Anderson 1, Douglas C. Hittle 2, Alon D. Katz 2, and R. Matt Kretchmar 1 1 Department

More information

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning

More information

Chapter 9: The Perceptron

Chapter 9: The Perceptron Chapter 9: The Perceptron 9.1 INTRODUCTION At this point in the book, we have completed all of the exercises that we are going to do with the James program. These exercises have shown that distributed

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Reinforcement Learning and Time Perception - a Model of Animal Experiments

Reinforcement Learning and Time Perception - a Model of Animal Experiments Reinforcement Learning and Time Perception - a Model of Animal Experiments J. L. Shapiro Department of Computer Science University of Manchester Manchester, M13 9PL U.K. jls@cs.man.ac.uk John Wearden Department

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you

More information

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning. Monte Carlo is important in practice CSE 190: Reinforcement Learning: An Introduction Chapter 6: emporal Difference Learning When there are just a few possibilitieo value, out of a large state space, Monte

More information

CSE/NB 528 Final Lecture: All Good Things Must. CSE/NB 528: Final Lecture

CSE/NB 528 Final Lecture: All Good Things Must. CSE/NB 528: Final Lecture CSE/NB 528 Final Lecture: All Good Things Must 1 Course Summary Where have we been? Course Highlights Where do we go from here? Challenges and Open Problems Further Reading 2 What is the neural code? What

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G. In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous

More information

arxiv: v1 [cs.ai] 5 Nov 2017

arxiv: v1 [cs.ai] 5 Nov 2017 arxiv:1711.01569v1 [cs.ai] 5 Nov 2017 Markus Dumke Department of Statistics Ludwig-Maximilians-Universität München markus.dumke@campus.lmu.de Abstract Temporal-difference (TD) learning is an important

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart

More information

Neural Modeling and Computational Neuroscience. Claudio Gallicchio

Neural Modeling and Computational Neuroscience. Claudio Gallicchio Neural Modeling and Computational Neuroscience Claudio Gallicchio 1 Neuroscience modeling 2 Introduction to basic aspects of brain computation Introduction to neurophysiology Neural modeling: Elements

More information

Reinforcement Learning: the basics

Reinforcement Learning: the basics Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Stuart Russell, UC Berkeley Stuart Russell, UC Berkeley 1 Outline Sequential decision making Dynamic programming algorithms Reinforcement learning algorithms temporal difference

More information

9/19/2016 CHAPTER 1: THE STUDY OF LIFE SECTION 1: INTRODUCTION TO THE SCIENCE OF LIFE BIOLOGY

9/19/2016 CHAPTER 1: THE STUDY OF LIFE SECTION 1: INTRODUCTION TO THE SCIENCE OF LIFE BIOLOGY CHAPTER 1: THE STUDY OF LIFE SECTION 1: INTRODUCTION TO BIOLOGY Ms. Diana THE SCIENCE OF LIFE Biology is the study of living things. In biology, you study the origins and history of life and once-living

More information

RL 3: Reinforcement Learning

RL 3: Reinforcement Learning RL 3: Reinforcement Learning Q-Learning Michael Herrmann University of Edinburgh, School of Informatics 20/01/2015 Last time: Multi-Armed Bandits (10 Points to remember) MAB applications do exist (e.g.

More information

(Deep) Reinforcement Learning

(Deep) Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

Using Expectation-Maximization for Reinforcement Learning

Using Expectation-Maximization for Reinforcement Learning NOTE Communicated by Andrew Barto and Michael Jordan Using Expectation-Maximization for Reinforcement Learning Peter Dayan Department of Brain and Cognitive Sciences, Center for Biological and Computational

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

An introduction to stochastic control theory, path integrals and reinforcement learning

An introduction to stochastic control theory, path integrals and reinforcement learning An introduction to stochastic control theory, path integrals and reinforcement learning Hilbert J. Kappen Department of Biophysics, Radboud University, Geert Grooteplein 21, 6525 EZ Nijmegen Abstract.

More information

An Introductory Course in Computational Neuroscience

An Introductory Course in Computational Neuroscience An Introductory Course in Computational Neuroscience Contents Series Foreword Acknowledgments Preface 1 Preliminary Material 1.1. Introduction 1.1.1 The Cell, the Circuit, and the Brain 1.1.2 Physics of

More information

Experimental design of fmri studies

Experimental design of fmri studies Experimental design of fmri studies Sandra Iglesias With many thanks for slides & images to: Klaas Enno Stephan, FIL Methods group, Christian Ruff SPM Course 2015 Overview of SPM Image time-series Kernel

More information

Classical Conditioning IV: TD learning in the brain

Classical Conditioning IV: TD learning in the brain Classical Condiioning IV: TD learning in he brain PSY/NEU338: Animal learning and decision making: Psychological, compuaional and neural perspecives recap: Marr s levels of analysis David Marr (1945-1980)

More information

The Markov Decision Process Extraction Network

The Markov Decision Process Extraction Network The Markov Decision Process Extraction Network Siegmund Duell 1,2, Alexander Hans 1,3, and Steffen Udluft 1 1- Siemens AG, Corporate Research and Technologies, Learning Systems, Otto-Hahn-Ring 6, D-81739

More information

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016) Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016) Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology October 11, 2016 Outline

More information

Finding a Basis for the Neural State

Finding a Basis for the Neural State Finding a Basis for the Neural State Chris Cueva ccueva@stanford.edu I. INTRODUCTION How is information represented in the brain? For example, consider arm movement. Neurons in dorsal premotor cortex (PMd)

More information

Markov decision processes

Markov decision processes CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only

More information

The homogeneous Poisson process

The homogeneous Poisson process The homogeneous Poisson process during very short time interval Δt there is a fixed probability of an event (spike) occurring independent of what happened previously if r is the rate of the Poisson process,

More information

Planning by Probabilistic Inference

Planning by Probabilistic Inference Planning by Probabilistic Inference Hagai Attias Microsoft Research 1 Microsoft Way Redmond, WA 98052 Abstract This paper presents and demonstrates a new approach to the problem of planning under uncertainty.

More information

In: Proc. BENELEARN-98, 8th Belgian-Dutch Conference on Machine Learning, pp 9-46, 998 Linear Quadratic Regulation using Reinforcement Learning Stephan ten Hagen? and Ben Krose Department of Mathematics,

More information

Experimental design of fmri studies

Experimental design of fmri studies Experimental design of fmri studies Zurich SPM Course 2016 Sandra Iglesias Translational Neuromodeling Unit (TNU) Institute for Biomedical Engineering (IBT) University and ETH Zürich With many thanks for

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

Experimental design of fmri studies & Resting-State fmri

Experimental design of fmri studies & Resting-State fmri Methods & Models for fmri Analysis 2016 Experimental design of fmri studies & Resting-State fmri Sandra Iglesias With many thanks for slides & images to: Klaas Enno Stephan, FIL Methods group, Christian

More information

Reinforcement Learning in Non-Stationary Continuous Time and Space Scenarios

Reinforcement Learning in Non-Stationary Continuous Time and Space Scenarios Reinforcement Learning in Non-Stationary Continuous Time and Space Scenarios Eduardo W. Basso 1, Paulo M. Engel 1 1 Instituto de Informática Universidade Federal do Rio Grande do Sul (UFRGS) Caixa Postal

More information

MCDB 4777/5777 Molecular Neurobiology Lecture 29 Neural Development- In the beginning

MCDB 4777/5777 Molecular Neurobiology Lecture 29 Neural Development- In the beginning MCDB 4777/5777 Molecular Neurobiology Lecture 29 Neural Development- In the beginning Learning Goals for Lecture 29 4.1 Describe the contributions of early developmental events in the embryo to the formation

More information

Back-propagation as reinforcement in prediction tasks

Back-propagation as reinforcement in prediction tasks Back-propagation as reinforcement in prediction tasks André Grüning Cognitive Neuroscience Sector S.I.S.S.A. via Beirut 4 34014 Trieste Italy gruening@sissa.it Abstract. The back-propagation (BP) training

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5 Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................

More information

Control of Dynamical System

Control of Dynamical System Control of Dynamical System Yuan Gao Applied Mathematics University of Washington yuangao@uw.edu Spring 2015 1 / 21 Simplified Model for a Car To start, let s consider the simple example of controlling

More information