Classical Conditioning IV: TD learning in the brain

Similar documents
CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9)

CSE/NB 528 Lecture 14: From Supervised to Reinforcement Learning (Chapter 9) R. Rao, 528: Lecture 14

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK

Chapter 21. Reinforcement Learning. The Reinforcement Learning Agent

1 Review of Zero-Sum Games

CSE/NEURO 528 Lecture 13: Reinforcement Learning & Course Review (Chapter 9)

Code-specific policy gradient rules for spiking neurons

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Final Spring 2007

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Notes on online convex optimization

Notes on Kalman Filtering

Econ107 Applied Econometrics Topic 7: Multicollinearity (Studenmund, Chapter 8)

Ensamble methods: Boosting

Energy Storage and Renewables in New Jersey: Complementary Technologies for Reducing Our Carbon Footprint

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles

Attention-Gated Reinforcement Learning in Neural Networks A Unified View

Online Convex Optimization Example And Follow-The-Leader

Ensamble methods: Bagging and Boosting

Scale invariant value computation for reinforcement learning in continuous time

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010

Bias-Variance Error Bounds for Temporal Difference Updates

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Introduction D P. r = constant discount rate, g = Gordon Model (1962): constant dividend growth rate.

14 Autoregressive Moving Average Models

Bias in Conditional and Unconditional Fixed Effects Logit Estimation: a Correction * Tom Coupé

Lecture 9: September 25

Financial Econometrics Jeffrey R. Russell Midterm Winter 2009 SOLUTIONS

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Chapter 4. Truncation Errors

Robust and Learning Control for Complex Systems

Sophisticated Monetary Policies. Andrew Atkeson. V.V. Chari. Patrick Kehoe

Linear Time-invariant systems, Convolution, and Cross-correlation

True Online Temporal-Difference Learning. A. Rupam Mahmood Patrick M. Pilarski

The Rosenblatt s LMS algorithm for Perceptron (1958) is built around a linear neuron (a neuron with a linear

Planning in POMDPs. Dominik Schoenberger Abstract

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

This document was generated at 1:04 PM, 09/10/13 Copyright 2013 Richard T. Woodward. 4. End points and transversality conditions AGEC

LAPLACE TRANSFORM AND TRANSFER FUNCTION

Introduction to choice over time

The Brock-Mirman Stochastic Growth Model

Machine Learning 4771

Math Week 14 April 16-20: sections first order systems of linear differential equations; 7.4 mass-spring systems.

EECE 301 Signals & Systems Prof. Mark Fowler

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still.

Lecture 4: November 13

Solutions Problem Set 3 Macro II (14.452)

Some Basic Information about M-S-D Systems

3.1 More on model selection

A Reinforcement Learning Approach for Collaborative Filtering

Affine term structure models

Wednesday, November 7 Handout: Heteroskedasticity

An introduction to the theory of SDDP algorithm

( ) ( ) if t = t. It must satisfy the identity. So, bulkiness of the unit impulse (hyper)function is equal to 1. The defining characteristic is

Robust Learning Control with Application to HVAC Systems

Vehicle Arrival Models : Headway

Problem Set 5. Graduate Macro II, Spring 2017 The University of Notre Dame Professor Sims

Hamilton- J acobi Equation: Explicit Formulas In this lecture we try to apply the method of characteristics to the Hamilton-Jacobi equation: u t

Applications in Industry (Extended) Kalman Filter. Week Date Lecture Title

Air Quality Index Prediction Using Error Back Propagation Algorithm and Improved Particle Swarm Optimization

Biol. 356 Lab 8. Mortality, Recruitment, and Migration Rates

APPM 2360 Homework Solutions, Due June 10

18 Biological models with discrete time

An Analysis of Actor/Critic Algorithms using Eligibility Traces: Reinforcement Learning with Imperfect Value Functions

m = 41 members n = 27 (nonfounders), f = 14 (founders) 8 markers from chromosome 19

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017

Announcements: Warm-up Exercise:

Lecture #11: Wavepacket Dynamics for Harmonic Oscillator

AP Chemistry--Chapter 12: Chemical Kinetics

Week 1 Lecture 2 Problems 2, 5. What if something oscillates with no obvious spring? What is ω? (problem set problem)

Online Appendix to Solution Methods for Models with Rare Disasters

5.1 - Logarithms and Their Properties

Seminar 4: Hotelling 2

Non-parametric directionality analyses of electrophysiological and neuroimaging signals.

Differential Geometry: Numerical Integration and Surface Flow

Homework sheet Exercises done during the lecture of March 12, 2014

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

Slide03 Historical Overview Haykin Chapter 3 (Chap 1, 3, 3rd Ed): Single-Layer Perceptrons Multiple Faces of a Single Neuron Part I: Adaptive Filter

Dynamic Econometric Models: Y t = + 0 X t + 1 X t X t k X t-k + e t. A. Autoregressive Model:

EXERCISES FOR SECTION 1.5

THE MYSTERY OF STOCHASTIC MECHANICS. Edward Nelson Department of Mathematics Princeton University

Numerical Dispersion

Chapter 2. First Order Scalar Equations

Linear Response Theory: The connection between QFT and experiments

Subway stations energy and air quality management

Solutions from Chapter 9.1 and 9.2

Deep Learning: Theory, Techniques & Applications - Recurrent Neural Networks -

Unit 1 Test Review Physics Basics, Movement, and Vectors Chapters 1-3

Particle Swarm Optimization

Errata (1 st Edition)

Synapses with short-term plasticity are optimal estimators of presynaptic membrane potentials: supplementary note

Lecture 3: Exponential Smoothing

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t...

This document was generated at 7:34 PM, 07/27/09 Copyright 2009 Richard T. Woodward

Unit Root Time Series. Univariate random walk

L07. KALMAN FILTERING FOR NON-LINEAR SYSTEMS. NA568 Mobile Robotics: Methods & Algorithms

RC, RL and RLC circuits

Empirical Process Theory

Transcription:

Classical Condiioning IV: TD learning in he brain PSY/NEU338: Animal learning and decision making: Psychological, compuaional and neural perspecives recap: Marr s levels of analysis David Marr (1945-1980) proposed hree levels of analysis: 1. he problem (Compuaional Level) 2. he sraegy (Algorihmic Level) 3. how is acually done by neworks of neurons (Implemenaional Level) 2

les sar over, his ime from he op... V = E V = E V = E i=+1 " 1 X i=+1 T i=+1 r i i 1 r i # r i wan o predic expeced sum of fuure reinforcemen wan o predic expeced sum of discouned fuure reinf. (0<γ<1) wan o predic expeced sum of fuure reinforcemen in a rial/episode 3 les sar over, his ime from he op... V = E [r +1 + r +2 +... + r T ] = E [r +1 ]+E [r +2 +... + r T ] (noe: indexes ime wihin a rial) = E [r +1 ]+V +1 V = E T i=+1 r i wan o predic expeced sum of fuure reinforcemen in a rial/episode 4

les sar over, his ime from he op... V = E [r +1 + r +2 +... + r T ] = E [r +1 ]+E [r +2 +... + r T ] (noe: indexes ime wihin a rial) = E [r +1 ]+V +1 Think fooball Wha would be a sensible learning rule here? How is his differen from Rescorla-Wagner? 5 Temporal Difference (TD) learning Marr s 3 levels: The algorihm: (noe: indexes ime wihin a rial, T indexes rials) V = E [r +1 ]+V +1 V new = V old + (r +1 + V old +1 V old ) emporal difference predicion error δ(+1) compare o: V T +1 = V T + r T V T Suon & Baro 1983, 1990 6

Temporal Difference (TD) learning Marr s 3 levels: The algorihm: V new = V old + (r +1 + V old +1 V old ) predicion error δ(+1) beginning of rial r = 0 V -1 = 0 δ() = V middle of rial r = 0 δ()=v -V -1 end of rial V = 0 δ()=r - V -1 Suon & Baro 1983, 1990 7 dopamine δ() = r δ() = V δ() = r -V -1 δ() = V δ() = 0-V -1 Schulz e al, 1997 8

simulaion wha would happen wih parial reinforcemen? wha would happen in second order condiioning? A noe on ime book-keeping V new = V old + (r +1 + V old +1 V old ) predicion error δ+1 he predicion error δ can be defined a any ime poin, bu can only be based on informaion he animal already has (no informaion ha i will ge in he fuure!) so, we can say δ+1 = r+1+v+1-v as above, because a ime (+1) informaion regarding r+1 and V+1 is already known we can also say δ = r+v-v-1 in prey much he same way we canno say δ = r+1+v+1-v as i jus would no make logical sense! imporanly, in all cases, δ is used o updae he preceding predicion, ha is, δ+1 is used o updae V and δ is used o updae V-1 10

wha does he heory explain? acquisiion exincion blocking overshadowing emporal relaionships overexpecaion 2 nd order cond. R-W X X TD 11 Summary so far... Temporal difference learning is a beer version of Rescorla-Wagner learning derived from firs principles (from definiion of problem) explains everyhing ha R-W does, and more (eg. 2 nd order condiioning) basically a generalizaion of R-W o real ime break! 12

Back o Marr s hree levels The algorihm: emporal difference learning Neural implemenaion: does he brain use TD learning? 13 we already saw his: δ() = r δ() = V δ() = r -V -1 δ() = V δ() = 0-V -1 Schulz e al, 1997 14

predicion error hypohesis of dopamine The idea: Dopamine encodes a emporal difference reward predicion error (Monague, Dayan, Baro mid 90 s) Schulz e al, 1993 Tobler e al, 2005 15 predicion error hypohesis of dopamine Fiorillo e al, 2003 measured firing rae model predicion error Bayer & Glimcher (2005) 16

where does dopamine projec o? main arge: sriaum in basal ganglia (also prefronal corex) 17 he basal ganglia: afferens inpus o sriaum are from all over he corex (and hey are opographic) Voorn e al, 2004 18

a precise microsrucure 19 dopamine and synapic plasiciy predicion errors are for learning corico-sriaal synapses show dopamine-dependen plasiciy hree-facor learning rule: need presynapic+possynapic+dopamine Wickens e al, 1996 20

summary Classical condiioning can be viewed as predicion learning The problem: predicion of fuure reward The algorihm: emporal difference learning Neural implemenaion: dopamine dependen learning in BG A compuaional model of learning allows us o look in he brain for hidden variables posulaed by he model Precise (normaive!) heory for generaion of dopamine firing paerns Explains anicipaory dopaminergic responding, 2 nd order condiioning Compelling accoun for he role of dopamine in classical condiioning: predicion error drives predicion learning 21 if you are confused or inrigued: addiional reading Rescorla & Wagner (1972) - A heory of Pavlovian condiioning: Variaions in he effeciveness of reinforcemen and nonreinforcemen - he original chaper ha is so well cied (and well wrien!) Suon & Baro (1990) - Time derivaive models of Pavlovian reinforcemen - shows sep by sep why TD learning is a suiable rule for modeling classical condiioning Niv & Schoenbaum (2008) - Dialogues on predicion errors - a guide for he perplexed Baro (1995) - adapive criic and he basal ganglia - very clear exposiion o TD learning in he basal ganglia (all will be on BlackBoard) 22