CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9)

Size: px

Start display at page:

Download "CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9)"

Dorcas Paul
5 years ago
Views:

1 CSE/NB 528 Lecure 14: Reinforcemen Learning Chaper 9 Image from hp://clasdean.la.asu.edu/news/images/ubep2001/neuron3.jpg Lecure figures are from Dayan & Abbo s book hp://people.brandeis.edu/~abbo/book/index.hml 1

2 Today s Agenda Reinforcemen Learning Wha is reinforcemen learning? Classical condiioning Learning o salivae predicing reward Predicing Delayed Rewards Temporal Difference Learning Learning o Ac Q-learning Acor-Criic Archiecure 2

3 Some Supervised Learning Demos on he Web Funcion Approximaion: hp://neuron.eng.wayne.edu/bpfuncionapprox/bpfuncionapprox.hml Paern Recogniion hp://eecs.wsu.edu/~cook/ai/lecures/apples/hnn/jrec.hml Image Compression hp://neuron.eng.wayne.edu/bpimagecompression9plus/bp9plus.hml Backpropagaion for Conrol: Ball Balancing hp://neuron.eng.wayne.edu/bpballbalancing/ball5.hml 3

4 Humans don ge exac supervisory signals commands for muscles for learning o alk, walk, ride a bicycle, play he piano, drive, ec. We learn by rial-and-error and by waching ohers Migh ge rewards and punishmens along he way Ener Reinforcemen Learning 4

5 The Reinforcemen Learning Agen Agen Sae u Reward r Acion a Environmen 5

6 The Reinforcemen Learning Framework Unsupervised learning: Learn he hidden causes of inpus Supervised learning: Learn a funcion based on raining examples of inpu, desired oupu pairs Reinforcemen Learning: Learn he bes acion for any given sae so as o maximize oal expeced fuure reward Learn by rial and error Inermediae beween unsupervised and supervised learning Insead of explici eaching signal or desired oupu, you ge rewards or punishmens Inspired by classical condiioning experimens remember Pavlov s hyper-salivaing dog? 6

7 Early Resuls: Pavlov and his Dog Classical Pavlovian condiioning experimens Training: Bell Food Afer: Bell Salivae Condiioned simulus bell predics fuure reward food hp://employees.csbsju.edu/creed/pb/pdoganim.hml 7

8 Predicing Reward Simulus u = 0 or 1 Expeced reward v = wu Delivered reward = r Learn w by minimizing r v 2 w w + ε r v u Predicion error δ = r v For small ε and u = 1, Average value of w = w w same as he dela rule; also called Rescorla-Wagner rule w + ε r w r 8

9 Predicing Reward during Condiioning Reward r presen condiioning r = 1, ε = 0.5 Reward removed exincion Reward presened 50% of he rials 9

10 Predicing Delayed Rewards In more realisic cases, reward is ypically delivered a he end when you know wheher you succeeded or no Time: 0 T wih simulus u and reward r a each ime sep Noe: r can be zero Key Idea: Make he oupu v predic oal expeced fuure reward saring from ime v T τ = 0 r + τ 10

11 Learning o Predic Delayed Rewards Use a se of modifiable weighs w and predic based on all pas simuli u: v = w τ u τ = 0 τ Would like o find wτ ha minimize: T τ = 0 r + τ v 2 Can we minimize his using gradien descen and dela rule? Yes, BUT no ye available are fuure rewards 11

12 12 Temporal Difference TD Learning Key Idea: Rewrie squared error o ge rid of fuure erms: Temporal Difference TD Learning: For each ime sep, do: For all τ 0 τ, do: v v r v r r v r T T = + = τ = τ τ τ ] 1 [ τ ε τ τ u v v r w w 0 τ τ τ = = u w v Expeced fuure reward Predicion δ

13 Predicing Delayed Reward: TD Learning Simulus a = 100 and reward a = 200 Predicion error δ for each ime sep over many rials 13

14 Reward Predicion Error Signal in Monkeys? Dopaminergic cells in Venral Tegmenal Area VTA Reward Predicion error? [ r + v + 1 v ] Before Training Afer Training [ 0 + v + 1 v ] No error v r + v

15 More Evidence for Predicion Error Signals Dopaminergic cells in VTA Negaive error r = 0, v + 1 = 0 [ r + v + 1 v ] = v 15

16 Tha s grea, bu how does all ha mah help me ge food in a maze? 16

17 Using Reward Predicions o Selec Acions Suppose you have compued a Value for each acion Qa = value prediced reward for execuing acion a Higher if acion yields more reward, lower oherwise Can selec acions probabilisically according o heir value: P a = a' exp β Q a exp βq a' High β selecs acions wih highes Q value. Low β selecs more uniformly 17

Simple Example: Bee Foraging Experimen: Bees selec eiher yellow y or blue b flowers based on necar reward Idea: Value of yellow/blue = average reward obained so far Q

18 Simple Example: Bee Foraging Experimen: Bees selec eiher yellow y or blue b flowers based on necar reward Idea: Value of yellow/blue = average reward obained so far Q y Q y + ε r Q y Q b Q b + ε r P y P b = 1 P y b y Q b exp βq y = exp βq y + exp βq b dela rule running average Yum! hp://svi.cps.uexas.edu/bee_on_flower_original.hm 18

19 Simulaing Bees r r y b = 2 = 1 r r y b = 1 = 2 β = 1 exploraion possible Q y b β = 50 β = 50 mosly exploiaion 19

20 Forge bees, how do I ge o he food in he maze? 20

21 Selecing Acions when Reward is Delayed Saes: A, B, or C Possible acions a any sae: Lef L or Righ R If you randomly choose o go L or R random policy, wha is he value v of each sae? 21

22 22 Policy Evaluaion For random policy: Can learn his using TD learning: = + = = + = = + = C v B v A v C v B v ] ' [ u v u v u r w u u w a + + ε u,a u Le vu = wu Locaion, acion new locaion

23 Maze Value Learning for Random Policy Once I know he values, I can pick he acion ha leads o he higher valued sae! 23

24 Selecing Acions based on Values Values ac as surrogae immediae rewards Locally opimal choice leads o globally opimal policy for Markov environmens Relaed o Dynamic Programming in CS see appendix in ex 24

25 Q learning A simple mehod for acion selecion based on acion values or Q values Qx,a where x is a sae and a is an acion 1. Le u be he curren sae. Selec an acion a according o: exp βq u, a P a = exp βq u, a' a' 2. Execue a and record new sae u and reward r. Updae Q: Q u, a Q u, a + ε r + maxa' Q u', a' Q u, a 3. Repea unil an end sae is reached 25

26 Anoher Varian: Acor-Criic Learning Two separae componens: Acor mainains policy and Criic mainains value of each sae 1. Criic Learning Policy Evaluaion : Value of sae u = vu = wu w u w u + ε [ ra u + v u' v u] 2. Acor Learning Policy Improvemen : P a; u Q a' = For all a : exp βqa u exp βq u u Qa' u + ε[ ra u + v u' v u] δ aa' 3. Inerleave 1 and 2 b b same as TD rule Use his o selec an acion a in u P a'; u 26

27 Acor-Criic Learning in he Maze Task Probabiliy of going Lef a a locaion 27

28 Demo of Reinforcemen Learning in a Robo from hp://sysplan.nams.kyushuu.ac.jp/gen/papers/javademoml97/robodemo.hml 28

29 Things o do: Work on mini-projec Nex class: Course Summary 29

CSE/NB 528 Lecture 14: From Supervised to Reinforcement Learning (Chapter 9) R. Rao, 528: Lecture 14

CSE/NB 528 Lecture 14: From Supervised to Reinforcement Learning (Chapter 9) R. Rao, 528: Lecture 14 CSE/NB 58 Lecure 14: From Supervised o Reinforcemen Learning Chaper 9 1 Recall from las ime: Sigmoid Neworks Oupu v T g w u g wiui w Inpu nodes u = u 1 u u 3 T i Sigmoid oupu funcion: 1 g a 1 a e 1 ga