Classical Conditioning IV: TD learning in the brain

Classical Condiioning IV: TD learning in he brain PSY/NEU338: Animal learning and decision making: Psychological, compuaional and neural perspecives recap: Marr s levels of analysis David Marr (1945-1980) proposed hree levels of analysis: 1. he problem (Compuaional Level) 2. he sraegy (Algorihmic Level) 3. how is acually done by neworks of neurons (Implemenaional Level) 2

les sar over, his ime from he op... V = E V = E V = E i=+1 " 1 X i=+1 T i=+1 r i i 1 r i # r i wan o predic expeced sum of fuure reinforcemen wan o predic expeced sum of discouned fuure reinf. (0<γ<1) wan o predic expeced sum of fuure reinforcemen in a rial/episode 3 les sar over, his ime from he op... V = E [r +1 + r +2 +... + r T ] = E [r +1 ]+E [r +2 +... + r T ] (noe: indexes ime wihin a rial) = E [r +1 ]+V +1 V = E T i=+1 r i wan o predic expeced sum of fuure reinforcemen in a rial/episode 4

les sar over, his ime from he op... V = E [r +1 + r +2 +... + r T ] = E [r +1 ]+E [r +2 +... + r T ] (noe: indexes ime wihin a rial) = E [r +1 ]+V +1 Think fooball Wha would be a sensible learning rule here? How is his differen from Rescorla-Wagner? 5 Temporal Difference (TD) learning Marr s 3 levels: The algorihm: (noe: indexes ime wihin a rial, T indexes rials) V = E [r +1 ]+V +1 V new = V old + (r +1 + V old +1 V old ) emporal difference predicion error δ(+1) compare o: V T +1 = V T + r T V T Suon & Baro 1983, 1990 6

Temporal Difference (TD) learning Marr s 3 levels: The algorihm: V new = V old + (r +1 + V old +1 V old ) predicion error δ(+1) beginning of rial r = 0 V -1 = 0 δ() = V middle of rial r = 0 δ()=v -V -1 end of rial V = 0 δ()=r - V -1 Suon & Baro 1983, 1990 7 dopamine δ() = r δ() = V δ() = r -V -1 δ() = V δ() = 0-V -1 Schulz e al, 1997 8

simulaion wha would happen wih parial reinforcemen? wha would happen in second order condiioning? A noe on ime book-keeping V new = V old + (r +1 + V old +1 V old ) predicion error δ+1 he predicion error δ can be defined a any ime poin, bu can only be based on informaion he animal already has (no informaion ha i will ge in he fuure!) so, we can say δ+1 = r+1+v+1-v as above, because a ime (+1) informaion regarding r+1 and V+1 is already known we can also say δ = r+v-v-1 in prey much he same way we canno say δ = r+1+v+1-v as i jus would no make logical sense! imporanly, in all cases, δ is used o updae he preceding predicion, ha is, δ+1 is used o updae V and δ is used o updae V-1 10

wha does he heory explain? acquisiion exincion blocking overshadowing emporal relaionships overexpecaion 2 nd order cond. R-W X X TD 11 Summary so far... Temporal difference learning is a beer version of Rescorla-Wagner learning derived from firs principles (from definiion of problem) explains everyhing ha R-W does, and more (eg. 2 nd order condiioning) basically a generalizaion of R-W o real ime break! 12

Back o Marr s hree levels The algorihm: emporal difference learning Neural implemenaion: does he brain use TD learning? 13 we already saw his: δ() = r δ() = V δ() = r -V -1 δ() = V δ() = 0-V -1 Schulz e al, 1997 14

predicion error hypohesis of dopamine The idea: Dopamine encodes a emporal difference reward predicion error (Monague, Dayan, Baro mid 90 s) Schulz e al, 1993 Tobler e al, 2005 15 predicion error hypohesis of dopamine Fiorillo e al, 2003 measured firing rae model predicion error Bayer & Glimcher (2005) 16

where does dopamine projec o? main arge: sriaum in basal ganglia (also prefronal corex) 17 he basal ganglia: afferens inpus o sriaum are from all over he corex (and hey are opographic) Voorn e al, 2004 18

a precise microsrucure 19 dopamine and synapic plasiciy predicion errors are for learning corico-sriaal synapses show dopamine-dependen plasiciy hree-facor learning rule: need presynapic+possynapic+dopamine Wickens e al, 1996 20

summary Classical condiioning can be viewed as predicion learning The problem: predicion of fuure reward The algorihm: emporal difference learning Neural implemenaion: dopamine dependen learning in BG A compuaional model of learning allows us o look in he brain for hidden variables posulaed by he model Precise (normaive!) heory for generaion of dopamine firing paerns Explains anicipaory dopaminergic responding, 2 nd order condiioning Compelling accoun for he role of dopamine in classical condiioning: predicion error drives predicion learning 21 if you are confused or inrigued: addiional reading Rescorla & Wagner (1972) - A heory of Pavlovian condiioning: Variaions in he effeciveness of reinforcemen and nonreinforcemen - he original chaper ha is so well cied (and well wrien!) Suon & Baro (1990) - Time derivaive models of Pavlovian reinforcemen - shows sep by sep why TD learning is a suiable rule for modeling classical condiioning Niv & Schoenbaum (2008) - Dialogues on predicion errors - a guide for he perplexed Baro (1995) - adapive criic and he basal ganglia - very clear exposiion o TD learning in he basal ganglia (all will be on BlackBoard) 22