An Introduction to COMPUTATIONAL REINFORCEMENT LEARING. Andrew G. Barto. Department of Computer Science University of Massachusetts Amherst

Size: px

Start display at page:

Download "An Introduction to COMPUTATIONAL REINFORCEMENT LEARING. Andrew G. Barto. Department of Computer Science University of Massachusetts Amherst"

Eleanor Catherine Spencer
5 years ago
Views:

1 An Intrductin t COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Bart Department f Cmputer Science University f Massachusetts Amherst UPF Lecture 1 Autnmus Learning Labratry Department f Cmputer Science

2 Artificial Intelligence Psychlgy Cmputatinal Reinfrcement Learning (RL) Cntrl Thery and Operatins Research Neurscience Artificial Neural Netwrks

3 The Overall Plan Lecture 1: What is Cmputatinal Reinfrcement Learning? Learning frm evaluative feedback Markv decisin prcesses Lecture 2: Dynamic Prgramming Basic Mnte Carl methds Tempral Difference methds A unified perspective Cnnectins t neurscience Lecture 3: Functin apprimatin Mdel-based methds Dimensins f Reinfrcement Learning A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

4 What is Reinfrcement Learning? Learning frm interactin Gal-riented learning Learning abut, frm, and while interacting with an eternal envirnment Learning what t d hw t map situatins t actins s as t maimize a numerical reward signal A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

5 Supervised Learning Training Inf = desired (target) utputs Inputs Supervised Learning System Outputs Errr = (target utput actual utput) A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

6 Reinfrcement Learning Training Inf = evaluatins ( rewards / penalties ) Inputs RL System Outputs ( actins ) Objective: get as much reward as pssible A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

7 Key Features f RL Learner is nt tld which actins t take Trial-and-Errr search Pssibility f delayed reward Sacrifice shrt-term gains fr greater lng-term gains The need t eplre and eplit Cnsiders the whle prblem f a gal-directed agent interacting with an uncertain envirnment A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

8 Cmplete Agent Temprally situated Cntinual learning and planning Object is t affect the envirnment Envirnment is stchastic and uncertain Envirnment state actin reward Agent A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

9 A Less Misleading View eternal sensatins memry reward RL agent state internal sensatins actins A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

10 Elements f RL Plicy: what t d Reward: what is gd Value: what is gd because it predicts reward Mdel: what fllws what Plicy Reward Value Mdel f envirnment A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

11 An Etended Eample: Tic-Tac-Te X O X O X X O O X X X O O X X X O O O X X X O O O X X X } s mve } s mve } s mve } s mve Assume an imperfect ppnent: he/she smetimes makes mistakes } s mve A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

12 An RL Apprach t Tic-Tac-Te 1. Make a table with ne entry per state: State V(s) estimated prbability f winning.5?.5?... 1 win lss 0 draw 2. Nw play lts f games. T pick ur mves, lk ahead ne step: current state varius pssible net states A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press, * Just pick the net state with the highest estimated prb. f winning the largest V(s); a greedy mve. But 10% f the time pick a mve at randm; an eplratry mve.

13 RL Learning Rule fr Tic-Tac-Te Eplratry mve s s! the state befre ur greedy mve the state after ur greedy mve We increment each V(s) tward V( s!) a backup : V(s) " V (s) + #[ V( s!) $ V (s)] a small psitive fractin, e.g.,! =.1 the step - size parameter A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

14 Hw can we imprve this T.T.T. player? Take advantage f symmetries representatin/generalizatin Hw might this backfire? D we need randm mves? Why? D we always need a full 10%? Can we learn frm randm mves? Can we learn ffline?... Pre-training frm self play? Using learned mdels f ppnent? A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

15 Hw is Tic-Tac-Te T Easy? Finite, small number f states One-step lk-ahead is always pssible State cmpletely bservable... A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

16 Sme Ntable RL Applicatins TD-Gammn: Tesaur wrld s best backgammn prgram Elevatr Cntrl: Crites & Bart high perfrmance dwn-peak elevatr cntrller Inventry Management: Van Ry, Bertsekas, Lee&Tsitsiklis 10 15% imprvement ver industry standard methds Dynamic Channel Assignment: Singh & Bertsekas, Nie & Haykin high perfrmance assignment f radi channels t mbile telephne calls A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

TD-Gammn Tesaur, 1992 1995 Value Actin selectin by 2 3 ply search TD errr V t+1!

17 TD-Gammn Tesaur, Value Actin selectin by 2 3 ply search TD errr V t+1! V t Start with a randm netwrk Play very many games against self Learn a value functin frm this simulated eperience This prduces arguably the best player in the wrld A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

18 10 flrs, 4 elevatr cars Elevatr Dispatching Crites and Bart, 1996 STATES: buttn states; psitins, directins, and mtin states f cars; passengers in cars & in halls ACTIONS: stp at, r g by, net flr REWARDS: rughly, 1 per time step fr each persn waiting Cnservatively abut states A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

19 Autnmus Helicpter Flight A. Ng, Stanfrd; H. Kim, M. Jrdn, S. Sastry, Berkeley A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

trials, r abut 3 hurs A. G. Bart, Barcelna Lectures, April 2006. Based n R.

20 Quadrupedal Lcmtin Nate Khl & Peter Stne, Univ f Teas at Austin All training dne with physical rbts: Sny Aib ERS-210A Befre Learning After 1000 trials, r abut 3 hurs A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

21 Learning Cntrl fr Dynamically Stable Walking Rbts Russ Tedrake, Teresa Zhang, H. Sebastin Seung, MIT Start with a Passive Walker A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

Grasp Cntrl R. Platt, A. Fagg, R. Grupen, Univ f Mass Umass Trs: Deter A. G. Bart, Barcelna Lectures, April 2006.

23 Grasp Cntrl R. Platt, A. Fagg, R. Grupen, Univ f Mass Umass Trs: Deter A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

24 Sme RL Histry Trial-and-Errr learning Tempral-difference learning Optimal cntrl, value functins Thrndike (Ψ) 1911 Minsky Secndary reinfrcement (Ψ) Samuel Hamiltn (Physics) 1800s Shannn Bellman/Hward (OR) Klpf Hlland Bart et al. Witten Suttn Werbs Watkins A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

25 Samuel s Checkers Player Arthur Samuel 1959, 1967 Scre bard cnfiguratins by a scring plynmial (after Shannn, 1950) Minima t determine backed-up scre f a psitin Alpha-beta cutffs Rte learning: save each bard cnfig encuntered tgether with backed-up scre needed a sense f directin : like discunting Learning by generalizatin: similar t TD algrithm A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

26 Samuel s Backups A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

27 The Basic Idea... we are attempting t make the scre, calculated fr the current bard psitin, lk like that calculated fr the terminal bard psitins f the chain f mves which mst prbably ccur during actual play. A. L. Samuel Sme Studies in Machine Learning Using the Game f Checkers, 1959 A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

28 A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press, MENACE (Michie 1961) Matchb Educable Nughts and Crsses Engine

29 The Overall Plan Lecture 1: What is Cmputatinal Reinfrcement Learning? Learning frm evaluative feedback Markv decisin prcesses Lecture 2: Dynamic Prgramming Basic Mnte Carl methds Tempral Difference methds A unified perspective Cnnectins t neurscience Lecture 3: Functin apprimatin Mdel-based methds Dimensins f Reinfrcement Learning A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

30 Lecture 1, Part 2: Evaluative Feedback Evaluating actins vs. instructing by giving crrect actins Pure evaluative feedback depends ttally n the actin taken. Pure instructive feedback depends nt at all n the actin taken. Supervised learning is instructive; ptimizatin is evaluative Assciative vs. Nnassciative: Assciative: inputs mapped t utputs; learn the best utput fr each input Nnassciative: learn (find) ne best utput n-armed bandit (at least hw we treat it) is: Nnassciative Evaluative feedback A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

31 The n-armed Bandit Prblem Chse repeatedly frm ne f n actins; each chice is called a play a t After each play, yu get a reward, where E r t a t = Q * (a t ) These are unknwn actin values Distributin f depends nly n Objective is t maimize the reward in the lng term, e.g., ver 1000 plays T slve the n-armed bandit prblem, yu must eplre a variety f actins and the eplit the best f them r t r t a t A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

32 The Eplratin/Eplitatin Dilemma Suppse yu frm estimates Q t (a)! Q * (a) actin value estimates The greedy actin at t is a t * = argma a Q t (a) a t = a t *! eplitatin a t " a t*! eplratin Yu can t eplit all the time; yu can t eplre all the time Yu can never stp eplring; but yu shuld always reduce eplring A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

33 Actin-Value Methds Methds that adapt actin-value estimates and nthing else, e.g.: suppse by the t-th play, actin a had been chsen k a times, prducing rewards r, r, K, r 1 2 k a, then Q t (a) = r 1 + r 2 + Lr k a k a sample average lim Q (a) = t k a! " Q* (a) A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

34 ε-greedy Actin Selectin Greedy actin selectin: ε-greedy: a t = a t * = arg ma a Q t (a) a t = a t * with prbability 1! " { randm actin with prbability "... the simplest way t try t balance eplratin and eplitatin A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

35 10-Armed Testbed n = 10 pssible actins Each each Q * (a) r t 1000 plays is chsen randmly frm a nrmal distributin: is als nrmal:!(q * (a t ),1) repeat the whle thing 2000 times and average the results!(0,1) A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

36 ε-greedy Methds n the 10-Armed Testbed A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

37 Sftma Actin Selectin Sftma actin selectin methds grade actin prbs. by estimated values. The mst cmmn sftma uses a Gibbs, r Bltzmann, distributin: Chse actin a n play t with prbability " e Q t (a)! n e Q t (b)! b=1 where! is the cmputatinal temperature, A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

38 Linear Learning Autmata Let! t (a) = Pr{ a t = a} be the nly adapted parameter L R I (Linear, reward - inactin) On success :! t +1 (a t ) =! t (a t ) +"(1 #! t (a t )) 0 < " < 1 (the ther actin prbs. are adjusted t still sum t 1) On failure : n change L R -P (Linear, reward - penalty) On success :! t +1 (a t ) =! t (a t ) +"(1 #! t (a t )) 0 < " < 1 (the ther actin prbs. are adjusted t still sum t 1) On failure :! t +1 (a t ) =! t (a t ) +"(0 #! t (a t )) 0 < " < 1 Fr tw actins, a stchastic, incremental versin f the supervised algrithm A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

39 Incremental Implementatin Recall the sample average estimatin methd: The average f the first k rewards is (drpping the dependence n ): a Q k = r 1 + r 2 +Lr k k Can we d this incrementally (withut string all the rewards)? We culd keep a running sum and cunt, r, equivalently: Q k +1 = Q k + 1 [ k +1 r k +1! Q k ] This is a cmmn frm fr update rules: NewEstimate = OldEstimate + StepSize[Target OldEstimate] A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

40 Tracking a Nnstatinary Prblem Chsing Q k t be a sample average is apprpriate in a statinary prblem, i.e., when nne f the Q * (a) change ver time, But nt in a nnstatinary prblem. Better in the nnstatinary case is: Q k +1 = Q k +! [ r k +1 " Q k ] fr cnstant!, 0 <! # 1 = (1"!) k Q 0 + $!(1 "!) k "i r i k i =1 epnential, recency-weighted average A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

41 Optimistic Initial Values All methds s far depend n Q 0 (a), i.e., they are biased. Suppse instead we initialize the actin values ptimistically, i.e., n the 10-armed testbed, use Q 0 (a) = 5 fr all a A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

42 Reinfrcement Cmparisn Cmpare rewards t a reference reward, average f bserved rewards, e.g., an Strengthen r weaken the actin taken depending n Let p t (a) dente the preference fr actin a Preferences determine actin prbabilities, e.g., by Gibbs distributin: Then:! t (a) = Pr{ a t = a} = p t +1 (a t ) = p t (a) + r t! r t " r t e p t (a) n e p t (b) b=1 [ ] and r t+1 = r t + "[ r t! r ] t r t! r t A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

43 Assciative Search Imagine switching bandits at each play Bandit 3 actins A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

44 Cnclusins These are all very simple methds but they are cmplicated enugh we will build n them Ideas fr imprvements: estimating uncertainties... interval estimatin apprimating Bayes ptimal slutins Gittens indices The full RL prblem ffers sme ideas fr slutin... A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

45 The Overall Plan Lecture 1: What is Cmputatinal Reinfrcement Learning? Learning frm evaluative feedback Markv decisin prcesses Lecture 2: Dynamic Prgramming Basic Mnte Carl methds Tempral Difference methds A unified perspective Cnnectins t neurscience Lecture 3: Functin apprimatin Mdel-based methds Dimensins f Reinfrcement Learning A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

46 Lecture 1, Part 3: Markv Decisin Prcesses Objectives f this part: describe the RL prblem in terms f MDPs present idealized frm f the RL prblem fr which we have precise theretical results; intrduce key cmpnents f the mathematics: value functins and Bellman equatins; describe trade-ffs between applicability and mathematical tractability. A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

47 The Agent-Envirnment Interface Agent and envirnment interact at discrete time steps Agent bserves state at step t : s t!s prduces actin at step t : a t! A(s t ) gets resulting reward : r t +1!" : t = 0, 1, 2, K and resulting net state : s t s t a t r t +1 s t +1 a t +1 r t +2 s t +2 t +2 a r t +3 s t a t +3 A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

48 The Agent Learns a Plicy Plicy at step t,! t : a mapping frm states t actin prbabilities! t (s, a) = prbability that a t = a when s t = s Reinfrcement learning methds specify hw the agent changes its plicy as a result f eperience. Rughly, the agent s gal is t get as much reward as it can ver the lng run. A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

49 Getting the Degree f Abstractin Right Time steps need nt refer t fied intervals f real time. Actins can be lw level (e.g., vltages t mtrs), r high level (e.g., accept a jb ffer), mental (e.g., shift in fcus f attentin), etc. States can be lw-level sensatins, r they can be abstract, symblic, based n memry, r subjective (e.g., the state f being surprised r lst ). An RL agent is nt like a whle animal r rbt, which cnsist f many RL agents as well as ther cmpnents. The envirnment is nt necessarily unknwn t the agent, nly incmpletely cntrllable. Reward cmputatin is in the agent s envirnment because the agent cannt change it arbitrarily. A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

50 Gals and Rewards Is a scalar reward signal an adequate ntin f a gal? maybe nt, but it is surprisingly fleible. A gal shuld specify what we want t achieve, nt hw we want t achieve it. A gal must be utside the agent s direct cntrl thus utside the agent. The agent must be able t measure success: eplicitly; frequently during its lifespan. A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

51 Returns Suppse the sequence f rewards after step t is : r t +1, r t+ 2, r t + 3, K What d we want t maimize? In general, we want t maimize the epected return, E{ R t }, fr each step t. Episdic tasks: interactin breaks naturally int episdes, e.g., plays f a game, trips thrugh a maze. R t = r t +1 + r t +2 +L + r T, where T is a final time step at which a terminal state is reached, ending an episde. A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

52 Returns fr Cntinuing Tasks Cntinuing tasks: interactin des nt have natural episdes. Discunted return: " # k =0 R t = r t +1 +! r t+ 2 +! 2 r t +3 +L =! k r t + k +1, where!, 0 $! $ 1, is the discunt rate. shrtsighted 0! " # 1 farsighted A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

53 An Eample Avid failure: the ple falling beynd a critical angle r the cart hitting end f track. As an episdic task where episde ends upn failure: reward = +1 fr each step befre failure! return = number f steps befre failure As a cntinuing task with discunted return: reward =!1 upn failure; 0 therwise " return is related t! # k, fr k steps befre failure In either case, return is maimized by aviding failure fr as lng as pssible. A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

54 Anther Eample Get t the tp f the hill as quickly as pssible. reward =!1 fr each step where nt at tp f hill " return =! number f steps befre reaching tp f hill Return is maimized by minimizing number f steps reach the tp f the hill. A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

55 A Unified Ntatin In episdic tasks, we number the time steps f each episde starting frm zer. We usually d nt have t distinguish between episdes, s we write s t instead f s t, j fr the state at step t f episde j. Think f each episde as ending in an absrbing state that always prduces reward f zer: We can cver all cases by writing R t =! k r t +k +1, where! can be 1 nly if a zer reward absrbing state is always reached. " # k =0 A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

56 The Markv Prperty By the state at step t, we mean whatever infrmatin is available t the agent at step t abut its envirnment. The state can include immediate sensatins, highly prcessed sensatins, and structures built up ver time frm sequences f sensatins. Ideally, a state shuld summarize past sensatins s as t retain all essential infrmatin, i.e., it shuld have the Markv Prperty: Pr{ s t +1 = s!, r t +1 = r s t,a t,r t, s t "1,a t "1,K, r 1,s 0,a } 0 = Pr{ s t +1 = s!, r t +1 = r s t,a } t fr all s!, r, and histries s t,a t,r t, s t "1,a t "1,K, r 1, s 0,a 0. A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

57 Markv Decisin Prcesses If a reinfrcement learning task has the Markv Prperty, it is basically a Markv Decisin Prcess (MDP). If state and actin sets are finite, it is a finite MDP. T define a finite MDP, yu need t give: state and actin sets ne-step dynamics defined by transitin prbabilities: P s s! { } fr all s,! a = Pr s t +1 = s! s t = s, a t = a s "S, a "A(s). reward epectatins: a R s! = E{ r t +1 s t = s, a t = a, s t +1 = s!} fr all s, s! "S, a "A(s). A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

58 An Eample Finite MDP Recycling Rbt At each step, rbt has t decide whether it shuld (1) actively search fr a can, (2) wait fr smene t bring it a can, r (3) g t hme base and recharge. Searching is better but runs dwn the battery; if runs ut f pwer while searching, has t be rescued (which is bad). Decisins made n basis f current energy level: high, lw. Reward = number f cans cllected A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

59 Recycling Rbt MDP S = { high, lw} A(high) = { search, wait} A(lw) = { search, wait, recharge } R search = epected n. f cans while searching R wait = epected n. f cans while waiting R search > R wait A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

60 Value Functins The value f a state is the epected return starting frm that state; depends n the agent s plicy: State - value functin fr plicy! : V! (s) = E! R t s t = s { } = E! & $ " k r t +k +1 s t = s The value f taking an actin in a state under plicy π is the epected return starting frm that state, taking that actin, and thereafter fllwing π : A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press, % ' # k =0 Actin- value functin fr plicy! : { } = E! & $ " k r t + k +1 s t = s,a t = a Q! (s, a) = E! R t s t = s, a t = a % ' # k = 0 ( ) * ( ) *

61 Bellman Equatin fr a Plicy π The basic idea: R t = r t +1 +! r t +2 +! 2 r t + 3 +! 3 r t + 4 L = r t +1 +! ( r t +2 +! r t +3 +! 2 r t + 4 L) = r t +1 +! R t +1 S: V " (s) = E " { R t s t = s} { } = E " r t +1 + #V " ( s t +1 ) s t = s Or, withut the epectatin peratr: $ V! (s) =!(s, a) P a s s " a $ s " [ + # V! ( s ")] R a s s " A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

62 Mre n the Bellman Equatin $ V! (s) =!(s, a) P a s s " a $ s " [ + # V! ( s ")] R a s s " This is a set f equatins (in fact, linear), ne fr each state. The value functin fr π is its unique slutin. Backup diagrams: fr V! fr Q! A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

63 What abut Value Functin fr Q!? Q " (s,a) = P a s " # s R s s " s " [ a + # $ "( s #, a # ) Q " ( s #, a #) ] a # A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

64 Gridwrld Actins: nrth, suth, east, west; deterministic. If wuld take agent ff the grid: n mve but reward = 1 Other actins prduce reward = 0, ecept actins that mve agent ut f special states A and B as shwn. State-value functin fr equiprbable randm plicy; γ = 0.9 Nte: A s value is less than immediate reward B s value is mre than immediate reward A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

65 Optimal Value Functins Fr finite MDPs, plicies can be partially rdered:! "! # if and nly if V! (s) " V! # (s) fr all s $S There is always at least ne (and pssibly many) plicies that is better than r equal t all the thers. This is an ptimal plicy. We dente them all π *. Optimal plicies share the same ptimal state-value functin: V! (s) = ma V " (s) fr all s #S " Optimal plicies als share the same ptimal actin-value functin: Q! (s, a) = ma " Q " (s, a) fr all s #S and a #A(s) This is the epected return fr taking actin a in state s and thereafter fllwing an ptimal plicy. A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

66 Bellman Optimality Equatin fr V* The value f a state under an ptimal plicy must equal the epected return fr the best actin frm that state: V! (s) = ma Q #! (s,a) a"a( s) = ma a"a( s) = ma a"a( s) E{ r t +1 + $ V! (s t +1 ) s t = s, a t = a} & P a R a s s % s s % s % The relevant backup diagram: [ + $ V! ( s %)] V! is the unique slutin f this system f nnlinear equatins. A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

67 Bellman Optimality Equatin fr Q* { } Q " (s,a) = E r t +1 + # maq " ( s $, a $ ) s t = s,a t = a a $ % s $ = P s $ [ ] a s R a s s $ + # maq " ( s $, a $ ) a $ The relevant backup diagram: Q * is the unique slutin f this system f nnlinear equatins. A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

68 Why Optimal State-Value Functins are Useful Any plicy that is greedy with respect t V! is an ptimal plicy. V! Therefre, given, ne-step-ahead search prduces the lng-term ptimal actins. E.g., back t the gridwrld: A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

high-accuracy slutins f ptimal cntrl prblems, IJCAI 99. A. G.

69 Car-n-the-Hill Optimal Value Functin Predicted minimum time t gal (negated) Get t the tp f the hill as quickly as pssible (rughly) Muns & Mre Variable reslutin discretizatin fr high-accuracy slutins f ptimal cntrl prblems, IJCAI 99. A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

70 What Abut Optimal Actin-Value Functins? Q * Given, the agent des nt even have t d a ne-step-ahead search:! " (s) = arg ma Q " (s, a) a#a(s) A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

71 Slving the Bellman Optimality Equatin Finding an ptimal plicy by slving the Bellman Optimality Equatin requires the fllwing: accurate knwledge f envirnment dynamics; we have enugh space and time t d the cmputatin; the Markv Prperty. Hw much space and time d we need? plynmial in number f states (via dynamic prgramming methds; net lecture), BUT, number f states is ften huge (e.g., backgammn has abut 10**20 states). We usually have t settle fr apprimatins. Many RL methds can be understd as apprimately slving the Bellman Optimality Equatin. A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

72 Semi-Markv Decisin Prcesses (SMDPs) Generalizatin f an MDP where there is a waiting, r dwell, time τ in each state Transitin prbabilities generalize t P( s ",# s,a) Bellman equatins generalize, e.g., fr a discrete time SMDP: V * (s) = ma a "A(s) & s #,$ [ ] s P( s #,$ s,a) R a + % $ V * ( # s # s ) a R s s " where is nw the amunt f discunted reward epected t accumulate ver the waiting time in s upn ding a and ending up in s A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

73 Summary Agent-envirnment interactin States Actins Rewards Plicy: stchastic rule fr selecting actins Return: the functin f future rewards agent tries t maimize Episdic and cntinuing tasks Markv Prperty Markv Decisin Prcess Transitin prbabilities Epected rewards Value functins State-value functin fr a plicy Actin-value functin fr a plicy Optimal state-value functin Optimal actin-value functin Optimal value functins Optimal plicies Bellman Equatins The need fr apprimatin Semi-Markv Decisin Prcesses A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

Edward L. Thrndike (1874-1949) Learning by Trial-and-Errr puzzle b A. G.

74 Edward L. Thrndike ( ) Learning by Trial-and-Errr puzzle b A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

75 Law f Effect Of several respnses made t the same situatin, thse which are accmpanied r clsely fllwed by satisfactin t the animal will, ther things being equal, be mre firmly cnnected with the situatin, s that, when it recurs, they will be mre likely t recur; thse which are accmpanied r clsely fllwed by discmfrt t the animal will, ther things being equal, have their cnnectins with that situatin weakened, s that when it recurs, they will be less likely t ccur. Edward Thrndike, 1911 A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

76 Search + Memry Search: Trial-and-Errr, Generate-and-Test, Variatin-and- Selectin Memry: remember what wrked best fr each situatin and start frm there net time A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

77 Credit Assignment Prblem Marvin Minsky, 1961 Getting useful training infrmatin t the right places at the right times Spatial Tempral A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

78 The Overall Plan Lecture 1: What is Cmputatinal Reinfrcement Learning? Learning frm evaluative feedback Markv decisin prcesses Lecture 2: Basic Mnte Carl methds Dynamic Prgramming Tempral Difference methds A unified perspective Cnnectins t neurscience Lecture 3: Functin apprimatin Mdel-based methds Dimensins f Reinfrcement Learning A. G. Bart, Barcelna Lectures, April Based n R. S. Suttn and A. G. Bart: Reinfrcement Learning: An Intrductin, MIT Press,

Reinforcement Learning" CMPSCI 383 Nov 29, 2011!

Reinforcement Learning CMPSCI 383 Nov 29, 2011! Reinfrcement Learning" CMPSCI 383 Nv 29, 2011! 1 Tdayʼs lecture" Review f Chapter 17: Making Cmple Decisins! Sequential decisin prblems! The mtivatin and advantages f reinfrcement learning.! Passive learning!