Chapter 21. Reinforcement Learning. The Reinforcement Learning Agent

Size: px

Start display at page:

Download "Chapter 21. Reinforcement Learning. The Reinforcement Learning Agent"

Gwen Price
5 years ago
Views:

1 CSE 47 Chaper Reinforcemen Learning The Reinforcemen Learning Agen Agen Sae u Reward r Acion a Enironmen CSE AI Faculy

Why reinforcemen learning Programming an agen

reinforcemen learning Can an agen learn o win

posiie reward, Lose = negaie Learn ealuaion

2 Why reinforcemen learning Programming an agen o drie a car or fly a helicoper is ery hard! Can an agen learn o drie or fly hrough posiie/negaie rewards CSE AI Faculy Why reinforcemen learning Can an agen learn o win a board games hrough rewards Win = large posiie reward, Lose = negaie Learn ealuaion funcion for differen board posiions Play games agains iself CSE AI Faculy 4

3 Why reinforcemen learning Humans and animals learn hrough rewards Reinforcemen learning as a model of brain funcion Palo s dog Training: Bell Food Afer: Bell Saliae CSE AI Faculy 5 Toy Example: Agen in a Maze 0 Reward -0 Punishmen 4 Saes = Maze locaions,,,, Acions = Moe forward, lef, righ, back Rewards = 0 a,4, -0 a,4 - a ohers cos of moing CSE AI Faculy 6

4 Acions migh be noisy An acion may no always succeed E.g. 0.9 probabiliy of moing forward, 0. probabiliy diided equally among oher neighboring locaions Characerized by ransiion probabiliies: Pnex sae curren sae, acion CSE AI Faculy 7 Goal: Learn a Policy Policy = for each sae, wha is he bes acion ha maximizes my expeced reward CSE AI Faculy 8 4

5 Goal: Learn a Policy The Opimal Policy CSE AI Faculy 9 A cenral problem in all hese cases is learning o predic fuure reward How do we do i Can we use superised learning 5

Predicing Delayed Rewards Time: 0 T wih inpu u and reward r possibly 0 a each ime sep Key Idea: Make he oupu of superised learner predic oal expeced fuure reward saring from ime T = 0 r < > denoes

6 Predicing Delayed Rewards Time: 0 T wih inpu u and reward r possibly 0 a each ime sep Key Idea: Make he oupu of superised learner predic oal expeced fuure reward saring from ime T = 0 r < > denoes aerage CSE AI Faculy Learning o Predic Delayed Rewards Use a se of modifiable weighs w and predic based on all pas inpus u: = = 0 w u Linear neural nework Would like o find w ha minimize: T = 0 r Can we minimize his using gradien descen and dela rule Yes, BUT no ye aailable are fuure rewards CSE AI Faculy 6

7 7 CSE AI Faculy Temporal Difference TD Learning Key Idea: Rewrie squared error o ge rid of fuure erms: 0 0 r r r r T T = = = CSE AI Faculy 4 Temporal Difference TD Learning TD Learning: For each ime sep, do: For all 0, do: ] [ ε u r w w 0 = = u w Expeced fuure reward Predicion

8 Temporal Difference Learning in he Brain Aciiy of a Dopaminergic cell in Venral Tegmenal Area Reward Predicion error [ r ] Before Training Afer Training [ 0 ] No error r CSE AI Faculy 5 Selecing Acions when Reward is Delayed Can we learn he opimal policy for his maze Saes: A, B, or C Possible acions a any sae: Lef L or Righ R If you randomly choose o go L or R random policy, wha is he alue of each sae CSE AI Faculy 6 8

9 Policy Ealuaion Locaion, acion new locaion u,a u Use oupu u = wu For random policy: B = 0 5 =.5 C = 0 = A = B C =.75 Can learn his using TD learning: w u w u ε [ ra u u' u] CSE AI Faculy 7 Maze Value Learning for Random Policy.75.5 Once I know he alues, I can pick he acion ha leads o he higher alued sae! CSE AI Faculy 8 9

10 Selecing Acions based on Values B =.5 C = Values ac as surrogae immediae rewards Locally opimal choice leads o globally opimal policy Relaed o Dynamic Programming CSE AI Faculy 9 Q learning Simple mehod for acion selecion based on acion alues or Q alues Qu,a where u is a sae and a is an acion. Le u be he curren sae. Selec an acion a according o: P a = exp βq u, a exp βq u, a' a'. Execue a and record new sae u and reward r. Updae Q: Q u, a Q u, a ε r max a' Q u', a' Q u, a. Repea unil an end sae is reached CSE AI Faculy 0 0

11 Reinforcemen Learning Applicaions Example: Flying a helicopor ia reinforcemen learning ideos work of Andrew Ng, Sanford hp://ai.sanford.edu/~ang/ CSE AI Faculy

CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9)

CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9) CSE/NB 528 Lecure 14: Reinforcemen Learning Chaper 9 Image from hp://clasdean.la.asu.edu/news/images/ubep2001/neuron3.jpg Lecure figures are from Dayan & Abbo s book hp://people.brandeis.edu/~abbo/book/index.hml