CSE 47 Chaper Reinforcemen Learning The Reinforcemen Learning Agen Agen Sae u Reward r Acion a Enironmen CSE AI Faculy
Why reinforcemen learning Programming an agen o drie a car or fly a helicoper is ery hard! Can an agen learn o drie or fly hrough posiie/negaie rewards CSE AI Faculy Why reinforcemen learning Can an agen learn o win a board games hrough rewards Win = large posiie reward, Lose = negaie Learn ealuaion funcion for differen board posiions Play games agains iself CSE AI Faculy 4
Why reinforcemen learning Humans and animals learn hrough rewards Reinforcemen learning as a model of brain funcion Palo s dog Training: Bell Food Afer: Bell Saliae CSE AI Faculy 5 Toy Example: Agen in a Maze 0 Reward -0 Punishmen 4 Saes = Maze locaions,,,, Acions = Moe forward, lef, righ, back Rewards = 0 a,4, -0 a,4 - a ohers cos of moing CSE AI Faculy 6
Acions migh be noisy An acion may no always succeed E.g. 0.9 probabiliy of moing forward, 0. probabiliy diided equally among oher neighboring locaions Characerized by ransiion probabiliies: Pnex sae curren sae, acion CSE AI Faculy 7 Goal: Learn a Policy 0-0 4 Policy = for each sae, wha is he bes acion ha maximizes my expeced reward CSE AI Faculy 8 4
Goal: Learn a Policy 0-0 4 The Opimal Policy CSE AI Faculy 9 A cenral problem in all hese cases is learning o predic fuure reward How do we do i Can we use superised learning 5
Predicing Delayed Rewards Time: 0 T wih inpu u and reward r possibly 0 a each ime sep Key Idea: Make he oupu of superised learner predic oal expeced fuure reward saring from ime T = 0 r < > denoes aerage CSE AI Faculy Learning o Predic Delayed Rewards Use a se of modifiable weighs w and predic based on all pas inpus u: = = 0 w u Linear neural nework Would like o find w ha minimize: T = 0 r Can we minimize his using gradien descen and dela rule Yes, BUT no ye aailable are fuure rewards CSE AI Faculy 6
7 CSE AI Faculy Temporal Difference TD Learning Key Idea: Rewrie squared error o ge rid of fuure erms: 0 0 r r r r T T = = = CSE AI Faculy 4 Temporal Difference TD Learning TD Learning: For each ime sep, do: For all 0, do: ] [ ε u r w w 0 = = u w Expeced fuure reward Predicion
Temporal Difference Learning in he Brain Aciiy of a Dopaminergic cell in Venral Tegmenal Area Reward Predicion error [ r ] Before Training Afer Training [ 0 ] No error r CSE AI Faculy 5 Selecing Acions when Reward is Delayed Can we learn he opimal policy for his maze Saes: A, B, or C Possible acions a any sae: Lef L or Righ R If you randomly choose o go L or R random policy, wha is he alue of each sae CSE AI Faculy 6 8
Policy Ealuaion Locaion, acion new locaion u,a u Use oupu u = wu For random policy: B = 0 5 =.5 C = 0 = A = B C =.75 Can learn his using TD learning: w u w u ε [ ra u u' u] CSE AI Faculy 7 Maze Value Learning for Random Policy.75.5 Once I know he alues, I can pick he acion ha leads o he higher alued sae! CSE AI Faculy 8 9
Selecing Acions based on Values B =.5 C = Values ac as surrogae immediae rewards Locally opimal choice leads o globally opimal policy Relaed o Dynamic Programming CSE AI Faculy 9 Q learning Simple mehod for acion selecion based on acion alues or Q alues Qu,a where u is a sae and a is an acion. Le u be he curren sae. Selec an acion a according o: P a = exp βq u, a exp βq u, a' a'. Execue a and record new sae u and reward r. Updae Q: Q u, a Q u, a ε r max a' Q u', a' Q u, a. Repea unil an end sae is reached CSE AI Faculy 0 0
Reinforcemen Learning Applicaions Example: Flying a helicopor ia reinforcemen learning ideos work of Andrew Ng, Sanford hp://ai.sanford.edu/~ang/ CSE AI Faculy