CSE/NEURO 528 Lecture 13: Reinforcement Learning & Course Review (Chapter 9)

Size: px

Start display at page:

Download "CSE/NEURO 528 Lecture 13: Reinforcement Learning & Course Review (Chapter 9)"

Shanon Scott
5 years ago
Views:

CSE/NEURO 528 Lecure 13: Reinforceen Learning & Course Review Chaper 9 Aniaion: To Creed, SJU 1 Early Resuls: Pavlov and his Dog F Classical Pavlovian

1 CSE/NEURO 528 Lecure 13: Reinforceen Learning & Course Review Chaper 9 Aniaion: To Creed, SJU 1 Early Resuls: Pavlov and his Dog F Classical Pavlovian condiioning experiens F Training: Bell Food F Afer: Bell Salivae F Condiioned siulus bell predics fuure reward food Iage: Wikiedia Coons; Aniaion: To Creed, SJU 2

2 Predicing Delayed Rewards F How do we predic rewards delivered soe ie afer a siulus is presened? F Given: Many rials, each of lengh T ie seps F Tie wihin a rial: 0 T wih siulus u and reward r a each ie sep Noe: r can be zero for soe F We would like a neuron whose oupu v predics he expeced oal fuure reward saring fro ie v T 0 r rials 3 Learning o Predic Fuure Rewards F Use a se of synapic weighs w and predic based on all pas siuli u: v w u 0 F Learn weighs w ha iniize error: T 0 Linear filer! r v 2 v w 0 w u u 1 u0 Can we iniize his using gradien descen and dela rule? wt Yes, BUT fuure rewards are no ye available! 4

3 5 Teporal Difference TD Learning F Key Idea: Rewrie error funcion o ge rid of fuure ers: F Teporal Difference TD Learning: v v r v r r v r T T ] 1 [ u v v r w Expeced fuure reward Predicion Miniize his using gradien descen! 6 Predicing Fuure Rewards: TD Learning Siulus a = 100 and reward a = 200 Predicion error for each ie sep over any rials Iage Source: Dayan & Abbo exbook

4 Possible Reward Predicion Error Signal in he Priae Brain Dopainergic cells in Venral Tegenal Area VTA Reward Predicion error δ? [ r v 1 v ] Before Training Afer Training [ 0 v v 1] No error v r v 1 [ r v 1 v ] 0 7 Iage Source: Dayan & Abbo exbook More Evidence for Predicion Error Signals Dopainergic cells in VTA afer Training Negaive error r 0, v 1 0 [ r v 1 v ] v Reward expeced bu no delivered 8 Iage Source: Dayan & Abbo exbook

5 Reinforceen Learning: Acing o Maxiize Rewards Agen Sae u Reward r Acion a Environen 9 The Proble Sae u Reward r Agen Environen Acion a Learn a sae-o-acion apping or policy : u a which axiizes he expeced oal fuure reward: T 0 r rials 10

6 Exaple: Ra in a barn Saes = locaions A, B, or C Acions= L go lef or R go righ If he ra chooses L or R a rando rando policy, wha is he expeced reward or value v for each sae? 11 Iage Source: Dayan & Abbo exbook Policy Evaluaion For rando policy: 1 1 v B v C v A v B v C Le value of sae u vu = weigh wu Can learn value of saes using TD learning: w u w u [ r u v u' v u] Locaion, acion new locaion i.e., u,a u 12

7 TD Learning of Values for Rando Policy Once I know he values, I can pick he acion ha leads o he higher valued sae! For all hree, = Iage Source: Dayan & Abbo exbook Selecing Acions based on Values Values ac as surrogae iediae rewards Locally opial choice leads o globally opial policy for Markov environens Relaed o Dynaic Prograing 14

8 F Puing i all ogeher: Acor-Criic Learning Two separae coponens: Acor selecs acion and ainains policy and Criic ainains value of each sae 1. Criic Learning Policy Evaluaion : Value of sae u = vu = wu w u w u [ r u v u' v u] 2. Acor Learning Policy Iproveen : P a; u For all acions a : Qa' u Qa' u [ r u v u' v u] aa' 3. Repea 1 and 2 exp Qa u exp Q u b b Probabilisically selec an acion a a sae u sae as TD rule P a'; u 15 Acor-Criic Learning in our Barn Exaple Probabiliy of going Lef a each locaion 16 Iage Source: Dayan & Abbo exbook

9 Possible Ipleenaion of he Acor-Criic Model in he Basal Ganglia Corex Sae Esiae STN Sriau GPe DA SNc Hidden Layer Value Acor Criic TD error GPi/SNr Acion Thalaus 17 See Suppleenary Maerials for references Reinforceen learning has been applied o any real-world probles! Exaple: Google s AlphaGo beas huan chapion in Go, Auonoous Helicoper Fligh learned fro huan deonsraions Videos and papers a: hp://heli.sanford.edu/ 18

10 Course Suary Where have we been? Course Highlighs Where do we go fro here? Challenges and Open Probles Furher Reading 19 Wha is he neural code? Wha is he naure of he code? Represening he spiking oupu: single cells vs populaions raes vs spike ies vs inervals Wha feaures of he siulus does he neural syse represen? 20

11 Encoding and decoding neural inforaion Encoding: building funcional odels of neurons/neural syses and predicing he spiking oupu given he siulus Decoding: wha can we say abou he siulus given wha we observe fro he neuron or neural populaion? 21 Inforaion axiizaion as a design principle of he nervous syse 22

12 Biophysical Models of Neurons Volage dependen ransier dependen synapic Ca dependen 23 The neural equivalen circui Oh s law: and Kirchhoff s law - Capaciive curren Ionic currens Exernally applied curren 24

13 Siplified odels: inegrae-and-fire V Inegrae-and- Fire Model dv d V E I If V > V hreshold Spike Then rese: V = V rese L e R 25 Modeling Neworks of Neurons dv v F Wu Mv d Oupu Decay Inpu Feedback 26

Unsupervised Learning For linear neuron: Basic Hebb Rule: T v w u u dw w uv d Average effec over any inpus: dw w d uv Qw Q is he inpu correlaion arix: Q uu T T w Hebb rule perfors principal coponen

14 Unsupervised Learning For linear neuron: Basic Hebb Rule: T v w u u dw w uv d Average effec over any inpus: dw w d uv Qw Q is he inpu correlaion arix: Q uu T T w Hebb rule perfors principal coponen analysis PCA w 27 The Connecion o Saisics Unsupervised learning = learning he hidden causes of inpu daa Generaive odel Causes v Daa u p[ v u; G] poserior Recogniion odel p[ u v; G] daa likelihood Use EM algorih for learning G = v, v Causes of clusered daa Causes of naural iages 28

Find W and w ha iniize errors: E W W ij ij, w jk 1 2 E Wij W ij, i d i v i Gradien

15 Generaive Models Droning lecure Lack of sleep Maheaical derivaions 29 Supervised Learning Backpropagaion for Mulilayered Neworks v i g j W g ij k w u jk k u k x j Goal: Find W and w ha iniize errors: E W W ij ij, w jk 1 2 E Wij W ij, i d i v i Gradien descen learning rules: Dela rule 2 Desired oupu w jk w jk E w jk w jk E x j x w j jk Chain rule 30

Reinforceen Learning Learning o predic rewards: w w r v u Learning o predic delayed rewards TD learning: Acor-Criic Learning: Criic learns value of each sae using TD learning Acor learns bes acions

16 Reinforceen Learning Learning o predic rewards: w w r v u Learning o predic delayed rewards TD learning: Acor-Criic Learning: Criic learns value of each sae using TD learning Acor learns bes acions based on value of nex sae using he TD error hp://eployees.csbsju.edu/creed/pb/pdogani.hl w w [ r v 1 v ] u The Fuure: Challenges and Open Probles How do neurons encode inforaion? Topics: Synchrony, Spike-iing based learning, Dynaic synapses Does a neuron s srucure confer copuaional advanages? Topics: Role of channel dynaics, dendries, plasiciy in channels and heir densiy How do neworks ipleen copuaional principles such as efficien coding and Bayesian inference? How do neworks learn opial represenaions of heir environen and engage in purposeful behavior? Topics: Unsupervised/reinforceen/iiaion learning 32

17 Furher Reading for Spring and beyond Spikes: Exploring he Neural Code, F. Rieke e al., MIT Press, 1997 The Biophysics of Copuaion, C. Koch, Oxford Universiy Press, 1999 Large-Scale Neuronal Theories of he Brain, C. Koch and J. L. Davis, MIT Press, 1994 Probabilisic Models of he Brain, R. Rao e al., MIT Press, 2002 Bayesian Brain, K. Doya e al., MIT Press, 2007 Reinforceen Learning: An Inroducion, R. Suon and A. Baro, MIT Press, Nex wo classes: Projec presenaions! Keep your presenaion shor: ~7-8 slides, ins ins/group wih quesions Inroducion, Background, Mehods, Resuls, Conclusion Slides: Bring your slides on a USB sick o use he class lapop Windows achine OR Bring your own lapop esp if you have videos ec. Projecs repors pages oal due March 12 by eail o boh Adrienne, Rich, and Raj before idnigh 34

18 Have a grea weekend! 35

CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9)

CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9) CSE/NB 528 Lecure 14: Reinforcemen Learning Chaper 9 Image from hp://clasdean.la.asu.edu/news/images/ubep2001/neuron3.jpg Lecure figures are from Dayan & Abbo s book hp://people.brandeis.edu/~abbo/book/index.hml