Admin. MDP Search Trees. Optimal Quantities. Reinforcement Learning

Admin Reinfrcement Learning Cntent adapted frm Berkeley CS188 MDP Search Trees Each MDP state prjects an expectimax-like search tree Optimal Quantities The value (utility) f a state s: V*(s) = expected utility starting in s and acting ptimally The value (utility) f a q-state (s,a): Q*(s,a) = expected utility starting ut having taken actin a frm state s and (thereafter) acting ptimally The ptimal plicy: π*(s) = ptimal actin frm state s

Bellman Equatins Bellman Equatins Definitin f ptimal utility via expectimax recurrence gives a simple nestep lkahead relatinship amngst ptimal utility values V * (s) = max a Q* (s, a) Q* (s, a) = T (s, a, s')"# R(s, a, s') + γ V * (s')$% s' V (s) = max a T (s, a, s')"# R(s, a, s') + γ V * (s')$% * Value Iteratin Bellman equatins characterize the ptimal values: Value Iteratin Cnvergence* Value iteratin cmputes them: Value iteratin is just a fixed pint slutin methd. thugh the Vk vectrs are als interpretable as time-limited values s' The are Bellman equatins, and they characterize ptimal behavir in a way we ll use ver and ver Hw d we knw the Vk vectrs are ging t cnverge? Case 1: If the tree has maximum depth M, then VM hlds the actual untruncated values Case 2: If the discunt is less than 1 Sketch: fr any state Vk and Vk+1 can be viewed as depth k+1 expectimax results in nearly identical search trees The difference is that n the bttm later, Vk+1 has actual rewards while Vk has zers That last layer is at best all RMAX It is at wrst RMIN But everything is discunted by γk that far ut S Vk and Vk+1 are at mst γk max R different S as k increases, the values cnverge

Value Iteratin Cnvergence Plicy Lss [Dem] Plicy Iteratin Alternative apprach fr ptimal plicies: Step 1: Plicy evaluatin: calculate utilities fr sme fixed plicy (nt ptimal utilities!) until cnvergence Step 2: Plicy imprvement: update plicy using ne-step lk-ahead with resulting cnverged (but nt ptimal!) utilities as future values This is plicy iteratin It s still ptimal! Can cnverge (much) faster under sme cnditins Plicy Iteratin (EM) Evaluatin: Fr fixed current plicy π, find values with plicy evaluatin: Iterate until values cnverge: Imprvement: Fr fixed values, get a better plicy using plicy extractin One-step lk-ahead:

Plicy Iteratin Generalized Plicy Iteratin [Suttn and Bart] Plicy Iteratin Prf (Sketch) Duble Bandits Guaranteed t cnverge: In every step the plicy imprves (because therwise we return nce we get the same plicy twice). Thus every iteratin generates a new plicy. There are a finite number f plicies. In the wrst-case, we may have t iterate thrugh all (num actins) (num states) plicies befre we terminate. Optimality at Cnvergence: k+1 (s) = k (s) By definitin f cnvergence, This means: 8s, V k (s) = max a Ps T (s, a, 0 s0 )[R(s, a, s 0 )+ V k i (s 0 )] Thus, V k (s) satisfies the Bellman equatin, which means V k (s) is a fixed-pint pint slutin t the Bellman equatin, V (s)

Duble-Bandit MDP Actins: Blue, Red States: Win, Lse Offline Planning Slving MDPs is ffline planning Yu determine all quantities thrugh cmputatin Yu need t knw the details f the MDP Yu d nt actually play the game! Online Planning New rules! Red s win chance is different. Let s Play! $0 $0 $0 $2 $0 $2 $0 $0 $0

What Just Happened? Reinfrcement Learning That wasn t planning, it was learning! Specifically, reinfrcement learning There was an MDP, but yu culdn t slve it with just cmputatin Yu needed t actually act t figure it ut Reinfrcement Learning Example: Learning t Walk Basic idea: Receive feedback in the frm f rewards Agent s utility is defined by the reward functin Must (learn t) act s as t maximize expected rewards All learning is based n bserved samples f utcmes! Initial Training Finished [Khl and Stne, ICRA 2004]

Example: Learning t Walk The Crawler! [Tedrake, Zhang, and Seung 2005] [Yu, Prject 3] Reinfrcement Learning Offline (MDPs) vs. Online (RL) Still assume a Markv decisin prcess (MDP): A set f states s 2 S A set f actins (per state) A mdel T (s, a, s 0 ) A reward functin A R(s, a, s 0 ) Still lking fr a plicy (s) New twist: we dn t knw T r R! I.e., we dn t knw which states are gd r what the actins d Must actually try actins and states ut t learn

Quiz 1: Reinfrcement Learning Mdel-Based Learning The difference between planning in a knwn Markv Decisin Prcess and reinfrcement learning (RL) is that: In RL the agent desn t knw the transitin mdel T r the reward functin R. In RL the agent desn t knw what its current state is (e.g., desn t knw its wn psitin when acting in a gridwrld). A) T/T B) T/F C) F/T D) F/F Mdel-Based Learning Mdel-Based Idea: Learn an apprximate mdel based n experiences Slve fr values as if the learned mdel were crrect Step 1: Learn empirical MDP mdel Cunt utcmes s fr each s,a Nrmalize t give an estimate f T (s, a, s0 ) Discver each R (s, a, s0 ) when we experience (s,a,s ) Step 2: Slve the learned MDP Fr example, use value iteratin, as befre Example: Mdel-Based Learning

Example: Expected Age Gal: Cmpute expected age f CS 151 students Withut P(A), instead cllect samples [a 1,a 2, a N ] Try it! Get in a grup f 3-6 students. Based n yur sample, build a mdel fr the expected graduatin year f students in the class. Why des this wrk? Yu eventually learn the right mdel. Why des this wrk? Samples appear with the right frequencies. What s yur predictin? Quiz 2: Mdel-based Learning Quiz 2: Rapid-Fire Click-in T(A,suth,C)= T(B,east,C)= T(C,suth,E)= T(C,suth,D)= A) 1.0 B) 0.75 C) 0.5 D) 0.25 E) 0.0 What mdel wuld be learned frm the abve bserved episdes?

Mdel-Free Learning Passive Reinfrcement Learning Passive Reinfrcement Learning Direct Evaluatin Simplified task: plicy evaluatin Input: a fixed plicy π(s) Yu dn t knw the transitins T(s,a,s ) Yu dn t knw the rewards R(s,a,s ) Gal: learn the state values In this case: Learner is alng fr the ride N chice abut what actins t take Just execute the plicy and learn frm experience This is NOT ffline planning! Yu actually take actins in the wrld. Gal: Cmpute values fr each state under π Idea: Average tgether bserved sample values Act accrding t π Every time yu visit a state, write dwn what the sum f discunted rewards turned ut t be Average thse samples This is called direct evaluatin

Example: Direct Evaluatin -10 +8 +4 +10-2 Prblems with Direct Evaluatin What s gd abut direct evaluatin? It s easy t understand It desn t require any knwledge f T, R It eventually cmputes the crrect average values, using just sample transitins What s bad abut it? It wastes infrmatin abut state cnnectins Each state must be learned separately S, it takes a lng time t learn Why Nt Use Plicy Evaluatin? Simplified Bellman updates calculate V fr a fixed plicy: Each rund, replace V with a ne-step-lk-ahead layer ver V Quiz 3: Passive Reinfrcement Learning Estimate the utput values f the fllwing: What gives? This apprach fully explited the cnnectins between the states Unfrtunately, we need T and R t d it! Key questin: hw can we d this update t V withut knwing T and R? In ther wrds, hw d we take a weighted average withut knwing the weights?

Quiz 4: Rapid-Fire Click-in V π (A)= V π (B)= V π (C)= V π (D)= V π (E)= A) 10 B) 8 C) 4 D) -2 E) -10 Sample-Based Plicy Evaluatin? We want t imprve ur estimate f V by cmputing these averages: Idea: Take samples f utcmes s (by ding the actin!) and average Sample-Based Plicy Evaluatin? We want t imprve ur estimate f V by cmputing these averages: Idea: Take samples f utcmes s (by ding the actin!) and average Tempral Difference Learning Big idea: learn frm every experience! Update V(s) each time we experience a transitin (s,a,s;,r) Likely utcmes s will cntribute updates mre ften Tempral difference learning f values Plicy still fixed, still ding evaluatin! Mve values tward value f whatever successr ccurs: running average

TD Dem! Expnential Mving Average Expnential mving average The running interplatin update: Makes recent samples mre imprtant: Frgets abut the past (distant past values were wrng anyway) Decreasing learning rate (alpha) can give cnverging averages Prblems with TD Value Learning Quiz 4: TD Learning TD value learning is a mdel-free way t d plicy evaluatin, mimicking Bellman updates with running sample averages Hwever, if we want t turn values int a (new) plicy, we re sunk: Idea: learn Q-values, nt values Makes actin selectin mdel-free t!

Active Reinfrcement Learning Active Reinfrcement Learning Full reinfrcement learning: ptimal plicies (like value iteratin) Yu dn t knw the transitins T(s,a,s ) Yu dn t knw the rewards R(s,a,s ) Yu chse the actins nw Gal: learn the ptimal plicy / values In this case: Learner makes chices! Fundamental tradeff: explratin vs. explitatin This is NOT ffline planning! Yu actually take actins in the wrld and find ut what happens Detur Q-Value Iteratin Value iteratin: find successive (depth-limited) values Q-Learning Q-Learning: sample-based Q-value iteratin Learn Q(s,a) values as yu g Start with V0(s)=0, which we knw is right Given Vk, calculate the depth k+1 values fr all states: But Q-values are mre useful, s cmpute them instead! Start with Q0(s,a)=0, which we knw is right Given Qk, calculate the depth k+1 q-values fr all q-states: Receive a sample (s,a,s,r) Cnsider yur ld estimate: Cnsider yur new sample estimate: Incrprate the new estimate int a running average:

Q-Learning Dem: Gridwrld Q-Learning Dem: Crawler Q-Learning Prperties Amazing result: Q-learning cnverges t ptimal plicy even if yu re acting subptimally! Quiz 5: Q-Learning Which f the fllwing equatins is the Q-value iteratin update? This is called ff-plicy learning Caveats: Yu have t explre enugh Yu have t eventually make the learning rate small enugh but nt decrease it t quickly Basically, in the limit, it desn t matter hw yu select actins(!) T/F If α=1, n averaging will happen --- instead simply the value frm the sample will be used. T/F If α=0, then the sample will nt influence the update.

Next time Amazing result: Q-learning cnverges t ptimal plicy even if yu re acting subptimally! but hw d we select actins? what if ur state/actin space is t large t maintain?