CSE/NB 528 Lecture 14: From Supervised to Reinforcement Learning (Chapter 9) R. Rao, 528: Lecture 14

Size: px

Start display at page:

Download "CSE/NB 528 Lecture 14: From Supervised to Reinforcement Learning (Chapter 9) R. Rao, 528: Lecture 14"

Brendan Doyle
5 years ago
Views:

1 CSE/NB 58 Lecure 14: From Supervised o Reinforcemen Learning Chaper 9 1

2 Recall from las ime: Sigmoid Neworks Oupu v T g w u g wiui w Inpu nodes u = u 1 u u 3 T i Sigmoid oupu funcion: 1 g a 1 a e 1 ga a a Sigmoid is a non-linear squashing funcion: Squashes inpu o be beween 0 and 1. Parameer conrols he slope.

3 3 Wha should we opimize? Given raining examples u m,d m m = 1,, N, define he oupu error funcion: 1 m m v d E w m T m g v u w where How would you change w so ha Ew is minimized?

4 Learning he Synapic Weighs How would you change w so ha Ew is minimized? Gradien Descen: Change w in proporion o de/dw why? de w w dw de dw m m d v g dela = error w T E w u m u m 1 d m v m Also known as he dela rule or LMS leas mean square rule Derivaive of sigmoid 4

5 Bu wai. Wha if we have muliple layers? v i g Wjig wkjuk j k Oupu v = v 1 v v J T ; Desired = d Dela rule can be used o adap hese weighs How do we adap hese? no desired oupu provided here Inpu u = u 1 u u K T 5

6 Ener he backpropagaion algorihm Acually, nohing bu he chain rule from calculus 6

7 Uppermos layer dela rule v i g Wjix j j 1 E W, w d i vi i x j Learning rule for hidden-oupu weighs W: de W ji W ji dw {gradien descen} de dw R. Rao, 58: Lecure ji 14 ji di vi g j W ji x j x j u k {dela rule} 7

8 Backpropagaion: Inner layer chain rule 1 E W, w d i vi i v m i g Wjix j j x m j g k w kj u m k w kj de dw w kj kj m, i de dw d m i kj v Bu m i : g de dw j kj W ji x de dx m j m u k Learning rule for inpu-hidden weighs w: j W ji dx dw j kj g {chain rule} k w kj u m k u m k 8

9 Example: Learning o Drive 9

10 Example Nework Ge seering angle Training Oupu: d = d 1 d d 30 Ge curren camera image Training Inpu u = u 1 u u 960 = image pixels 10 Pomerleau, 199

11 Training he nework using backprop Sar wih random weighs W, w Given inpu u, nework produces oupu v u k Use backprop o learn W and w ha minimize oal error over all oupu unis labeled i: 1 E W, w d i vi i 11

12 Learning o Drive using Backprop One of he learned road feaures w i 1

13 ALVINN Auonomous Land Vehicle in a Neural Nework Trained using human driver + camera images Afer learning: Drove up o 70 mph on highway Up o miles wihou inervenion Drove cross-counry largely auonomously Pomerleau,

14 Bu ha doesn help me find food in a maze 14

15 Humans and animals in general don ge exac supervisory signals commands for muscles for learning o alk, walk, ride a bicycle, play he piano, drive, ec. We learn by rial-and-error wih hins from ohers Migh ge rewards and punishmens along he way Ener Reinforcemen Learning 15

16 The Reinforcemen Learning Agen Agen Sae u Reward r Acion a Environmen 16

17 The Reinforcemen Learning Framework Unsupervised learning: Learn he hidden causes of inpus Supervised learning: Learn a funcion based on raining examples of inpu, desired oupu pairs Reinforcemen Learning: Learn he bes acion for any given sae so as o maximize oal expeced fuure reward Inermediae beween unsupervised and supervised learning Insead of explici eaching signal or desired oupu, you ge rewards or punishmens Inspired by classical condiioning experimens 17

18 Early Resuls: Pavlov and his Dog Classical Pavlovian condiioning experimens Training: Bell Food Afer: Bell Salivae Condiioned simulus bell predics fuure reward food hp://employees.csbsju.edu/creed/pb/pdoganim.hml 18

19 Predicing Delayed Rewards Reward is ypically delivered a he end when you know wheher you succeeded or no Time: 0 T wih simulus u and reward r a each ime sep Noe: r can be zero a some ime poins Key Idea: Make he oupu v predic oal expeced fuure reward saring from ime v T 0 r 19

20 Learning o Predic Delayed Rewards Use a se of modifiable weighs w and predic based on all pas simuli u: v w u 0 Would like o find he weighs or filer w ha minimize: T 0 r v Can we minimize his using gradien descen and dela rule? Yes, BUT no ye available are he fuure rewards 0

21 1 Temporal Difference TD Learning Key Idea: Rewrie squared error o ge rid of fuure erms: Temporal Difference TD Learning: v v r v r r v r T T ] 1 [ u v v r w w Expeced fuure reward Predicion

22 Predicing Delayed Reward: TD Learning Simulus a = 100 and reward a = 00 Predicion error for each ime sep over many rials

23 Reward Predicion Error in he Primae Brain? Dopaminergic cells in Venral Tegmenal Area VTA Reward Predicion error? [ r v 1 v ] Before Training Afer Training [ 0 v 1 v ] No error v r v 1 3

24 More Evidence for Predicion Error Signals Dopaminergic cells in VTA Negaive error r 0, v 1 0 [ r v 1 v ] v 4

25 Tha s grea, bu how does all ha mah help me ge food in a maze? 5

26 Using Reward Predicions o Selec Acions Suppose you have compued a Value for each acion Qa = value prediced reward for execuing acion a Higher if acion yields more reward, lower oherwise Can selec acions probabilisically according o heir value: P a a' exp Q a exp Q a' High selecs acions wih highes Q value. Low selecs more uniformly 6

27 Simple Example: Bee Foraging Experimen: Bees selec eiher yellow y or blue b flowers based on necar reward Idea: Value of yellow/blue = average reward obained so far Q y Q y r Q b Q b r P y P b 1 P y b y Q y Q b exp Q y exp Q y exp Q b dela rule running average Yum! hp://svi.cps.uexas.edu/bee_on_flower_original.hm 7

28 Simulaing Bees r r y b 1 r r y b 1 = 1 exploraion possible Q y b = 50 = 50 mosly exploiaion 8

29 Forge bees, how do I ge o he food in he maze? 9

30 Selecing Acions when Reward is Delayed Saes: A, B, or C Possible acions a any sae: Lef L or Righ R If you randomly choose o go L or R random policy, wha is he value v of each sae? 30

31 31 Policy Evaluaion For random policy: Can learn his using TD learning: C v B v A v C v B v ] ' [ u v u v u r w u u w a u,a u Le vu = wu Locaion, acion new locaion

32 Maze Value Learning for Random Policy Once I know he values, I can pick he acion ha leads o he higher valued sae! 3

33 Selecing Acions based on Values.5 1 Values ac as surrogae immediae rewards Locally opimal choice leads o globally opimal policy for Markov environmens Relaed o Dynamic Programming in CS see appendix in ex 33

34 Acor-Criic Learning Two separae componens: Acor mainains policy and Criic mainains value of each sae 1. Criic Learning Policy Evaluaion : Value of sae u = vu = wu w u w u [ ra u v u' v u]. Acor Learning Policy Improvemen : P a; u Q a' For all a : exp Qa u exp Q u u Qa ' u [ ra u v u' v u] aa ' 3. Inerleave 1 and b b same as TD rule Use his o selec an acion a in u P a'; u 34

35 Acor-Criic Learning in he Maze Task Probabiliy of going Lef a a locaion 35

36 Demo of Reinforcemen Learning in a Robo from hp://sysplan.nams.kyushu-u.ac.jp/gen/papers/javademoml97/robodemo.hml 36

37 Things o do: Finish homework 3 Work on group projec Nex week: Prof. Emo Todorov on moor conrol 37

CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9)

CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9) CSE/NB 528 Lecure 14: Reinforcemen Learning Chaper 9 Image from hp://clasdean.la.asu.edu/news/images/ubep2001/neuron3.jpg Lecure figures are from Dayan & Abbo s book hp://people.brandeis.edu/~abbo/book/index.hml