The Winding Path to RL - PDF Free Download

Markov Deciion Procee MDP) Ron Parr ComSci 70 Deartment of Comuter Science Duke Univerity With thank to Kri Hauer for ome lide The Winding Path to RL Deciion Theory Decritive theory of otimal behavior Markov Deciion Procee Mathematical/Algorithmic realization of Deciion Theory Reinforcement Learning Alication of learning technique to challenge of MDP with numerou or unknown arameter

Deciion Theory Review MDP Algorithm for MDP Value Determination Otimal Policy Selection Value Iteration Policy Iteration Covered Today Swet under the rug today Utility of money aumed :) How to determine cot/utilitie How to determine robabilitie

Playing a Game Show Aume erie of quetion Increaing difficulty Increaing ayoff Choice: Accet accumulated earning and quit Continue and rik loing everything Who want to be a millionaire? State Rereentation Dollar amount indicate the ayoff for getting the quetion right Probabilitic Tranition on Attemt to Anwer Start $00 correct $,000 correct $0K correct $50K $0 $0 $0 $0 $6,00 Downward green arrow indicate the choice to exit the game $00 $,00 $,00 N.B.: Thee exit tranition hould actually correond to tate Green indicate rofit at exit from game

Making Otimal Deciion Work backward from future to reent Conider $50,000 quetion Suoe Pcorrect) = /0 Vto)=$,00 Vcontinue) = 0.9*$0 + 0.*$6.K = $6.K Otimal deciion to Working Backward V=$,749 V=$4,66 V=$5,555 V=$.K 9/0 /4 / $00 $K $0K $50K X /0 $0 X $0 X $0 $0 Red X indicate bad choice $00 $,00 $,00 4

Deciion Theory Review Provide theory of otimal deciion Princile of maximizing utility Eay for mall, tree tructured ace with Known utilitie Known robabilitie Deciion Theory MDP Covered in Today Algorithm for MDP Value Determination Otimal Policy Selection Value Iteration Policy Iteration 5

Dealing with Loo Suoe you can ay $000 from any loing tate) to lay again 9/0 /4 / /0 $0 $0 $0 $0 $-000 $00 $,00 $,00 From Policie to Linear Sytem Suoe we alway ay until we win. What i value of following thi olicy? V 0 ) = 0.0 000 +V 0 )) + 0.90V ) V ) = 0.5 000 +V 0 )) + 0.75V ) V ) = 0.50 000 +V 0 )) + 0.50V )!! V ) = 0.90 000 +V 0 )) + 0.0600) Return to Start Continue 6

And the olution i V=$,749 V=$4,66 V=$5,555 V=$.K w/o cheat V=$.47K V=$.58K V=$.95K V=$4.4K 9/0 /4 / /0 $-000 I thi otimal? How do we find the otimal olicy? State ace: S The MDP Framework Action ace: A Tranition function: P Reward function: R,a, ) or R,a) or R) Dicount factor: g Policy: ) a Objective: Maximize exected, dicounted return deciion theoretic otimal behavior) 7

Alication of MDP AI/Comuter Science Robotic control Koenig & Simmon, Thrun et al., Kaelbling et al.) Air Camaign Planning Meuleau et al.) Elevator Control Barto & Crite) Comutation Scheduling Zilbertein et al.) Control and Automation Moore et al.) Soken dialogue management Singh et al.) Cellular channel allocation Singh & Berteka) Alication of MDP Economic/Oeration Reearch Fleet maintenance Howard, Rut) Road maintenance Golabi et al.) Packet Retranmiion Feinberg et al.) Nuclear lant management Rothwell & Rut) Debt collection trategie Abe et al.) Data center management DeeMind) 8

EE/Control Alication of MDP Miile defene Berteka et al.) Inventory management Van Roy et al.) Football lay election Patek & Berteka) Agriculture Herd management Kritenen, Toft) Other Sort trategie Video game The Markov Aumtion Let S t be a random variable for the tate at time t PS t A t- S t-,,a 0 S 0 ) = PS t A t- S t- ) Markov i ecial kind of conditional indeendence Future i indeendent of at given current tate 9

Undertanding Dicounting Mathematical motivation Kee value bounded What if I romie you $0.0 every day you viit me? Economic motivation Dicount come from inflation Promie of $.00 in future i worth $0.99 today Probability of dying Suoe e robability of dying at each deciion interval Tranition w/rob e to tate with value 0 Equivalent to - e dicount factor Dicounting in Practice Often choen unrealitically low Fater convergence of the algorithm we ll ee later Lead to lightly myoic olicie Can reformulate mot alg. for avg. reward Mathematically uglier Somewhat lower run time 0

Deciion Theory MDP Algorithm for MDP Value Determination Otimal Policy Selection Value Iteration Policy Iteration Covered Today V Value Determination Determine the value of each tate under olicy ) = R, )) + g å ' P ', )) V Bellman Equation for a fixed olicy ') R= S 0.4 0.6 S S V ) = + g 0.4V ) + 0.6V ))

Matrix Form ø ö ç ç ç è æ = )), )), )), )), )), )), )), )), )), P P P P P P P P P P g R V P V + = Thi i a generalization of the game how examle from earlier How do we olve thi ytem efficient? Doe it even have a olution? Solving for Value g R V P V + = For moderate number of tate we can olve thi ytem exacty: g R P I V ) - - = Guaranteed invertible becaue ha ectral radiu < gp

Iteratively Solving for Value V = gp V + R For larger number of tate we can olve thi ytem indirectly: V i+ = gp Vi + R Guaranteed convergent becaue ha ectral radiu < gp Etablihing Convergence Eigenvalue analyi Monotonicity Aume all value tart eimitic One value mut alway increae Can never overetimate Eay to rove Contraction analyi

Contraction Analyi Define maximum norm Conider two value function V a and V b each at iteration : WLOG ay V a V -V V a V b =max i V [i ] b = e + e! Vector of all e ) Contraction Analyi Contd. At next iteration for V b : V b = R +γpv b For V a V a = R+γPV a ) R+γPV b + ε)=! R+γPV b +γp! ε = R+γPV b +γ! ε Conclude: Ditribute V a V b γε 4

Imortance of Contraction Any two value function get cloer True value function V* i a fixed oint value doen t change with iteration) Max norm ditance from V* decreae dramatically quickly with iteration V 0 V * =ε V n V * γ n ε Deciion Theory MDP Algorithm for MDP Value Determination Otimal Policy Selection Value Iteration Policy Iteration Covered Today 5

Finding Good Policie Suoe an exert told you the true value of each tate: VS) = 0 VS) = 5 0.5 S 0.7 S 0.5 S 0. S Action Action Imroving Policie How do we get the otimal olicy? If we knew the value under the otimal olicy, then jut take the otimal action in every tate How do we define thee value? Fixed oint equation with choice Bellman equation): V * ) =max a R,a)+γ ' P',a)V * ') Deciion theoretic otimal choice given V* If we know V*, icking the otimal action i eay If we know the otimal action, comuting V* i eay How do we comute both at the ame time? 6

Value Iteration We can t olve the ytem directly with a max in the equation Can we olve it by iteration? V å ) = maxa R, a) + g P ', a) Vi ' ) i+ ' Called value iteration or imly ucceive aroximation Same a value determination, but we can change action Convergence: Can t do eigenvalue analyi not linear) Still monotonic Still a contraction in max norm exercie) Converge quickly Robot Navigation Examle quare,) 4 The robot hown ) live in a world decribed by a 4x grid of quare with quare,) occuied by an obtacle A tate i defined by the quare in which the robot i located:,) in the above figure tate 7

Action Tranition) Model U bring the robot to:,) with robability 0.8,) with robability 0.,) with robability 0. 4 In each tate, the robot oible action are {U, D, R, L} For each action: With robability 0.8 the robot doe the right thing move u, down, right, or left by one quare) With robability 0. it move in a direction erendicular to the intended one If the robot can t move, it tay in the ame quare [Thi model atifie the Markov condition] Action Tranition) Model L bring the robot to:,) with robability 0.8 + 0. = 0.9,) with robability 0. 4 In each tate, the robot oible action are {U, D, R, L} For each action: With robability 0.8 the robot doe the right thing move u, down, right, or left by one quare) With robability 0. it move in a direction erendicular to the intended one If the robot can t move, it tay in the ame quare [Thi model atifie the Markov condition] 8

Terminal State, Reward, and Cot -.04 -.04 -.04 + -.04 -.04 - -.04 -.04 -.04 -.04 4 Two terminal tate: 4,) and 4,) Reward: R4,) = + [The robot find gold] R4,) = - [The robot get traed in quickand] R) = -0.04 in all other tate Thi examle from the textbook) aume no dicounting g=) Dicuion: I thi a good modeling deciion? Stationary) Policy + + - - 4 A tationary olicy i a comlete ma!: tate action For each non-terminal tate it recommend an action, indeendent of when and how the tate i reached Under the Markov and infinite horizon aumtion, the otimal olicy! i necearily a tationary olicy [The bet action in a tate doe not deend on the at] 4 9

Stationary) Policy + + - - 4 A tationary olicy i a comlete ma!: tate action For each non-terminal tate it recommend an action, indeendent of when and how the tate The i reached otimal olicy trie to avoid Under the Markov and infinite dangerou horizon tate aumtion,,) the otimal olicy! i necearily a tationary olicy [The bet action in a tate doe not deend on the at] Finding! i called an obervable Markov Deciion Problem MDP) 4 Otimal Policie for Variou R) + - + - R) = -0.04 R) = - + - + - R) = -0.0 R) > 0 0

If i terminal: Bellman Equation If i non-terminal: + - 4! # = & # + max The equation!#) = &#) are non-linear + -../0) 0 4550,+) The utility of deend on the utility of other tate oibly, including ), and vice vera 7 # #, 8!# ) 9 # = arg max + -../0) 0 4550,+) 7 # #, 8!# ) [Bellman equation] Value Iteration Alied 0 0 0 + 0.8 0.87 0.9 + 0 0-0.76 0.66-0 0 0 0 0.7 0.66 0.6 0.9 4 4. Initialize the utility of each non-terminal tate to V 0 ) = 0. For t = 0,,,... do! "#$ % = ' % + max for each non-terminal tate 4,.//0) 5 6788,,) : % 5 %, ;! " % 5 )

State Utilitie 0.8 0.87 0.9 + 0.76 0.66-0.7 0.66 0.6 0.9 4 The utility of a tate i the maximal exected amount of reward that the robot will collect from and future tate by executing ome action in each encountered tate, until it reache a terminal tate infinite horizon) Under the Markov and infinite horizon aumtion, the utility of i indeendent of when and how i reached [It only deend on the oible equence of tate after, not on the oible equence before ] Convergence of Value Iteration 0.8 0.87 0.9 + 0.76 0.66-0.7 0.66 0.6 0.9 4

Proertie of Value Iteration VI converge to V*. " from V* hrink by g factor each iteration) Converge to otimal olicy Why? Becaue we figure out V*, otimal olicy i argmax) Otimal olicy i tationary i.e. Markovian deend only on current tate) Why? Becaue we are umming utilitie. Thought exeriment: Suoe you think it better to change action the econd time you viit a tate. Why didn t you jut take the bet action the firt time?) Deciion Theory MDP Algorithm for MDP Value Determination Otimal Policy Selection Value Iteration Policy Iteration Covered Today

Greedy Policy Contruction Let name the action that look bet WRT V:!! π v ) = argmax a R,a) + γ P',a)V') '!! π = greedyv) v Exectation over next-tate value Boottraing: Policy Iteration Idea: Greedy election i ueful even with ubotimal V Gue v = 0 V = value of acting on olve linear ytem) v greedyv ) Reeat until olicy doen t change Guaranteed to find otimal olicy Uually take very mall number of iteration Comuting the value function i the exenive art 4

Comaring VI and PI VI Value change at every te Policy may change before exact value of olicy i comuted Many chea iteration PI Alternate olicy/value udate Solve for value of each olicy exactly Fewer, lower iteration need to invert matrix) Convergence Both are contraction in max norm PI i hockingly fat in ractice Comutational Comlexity VI and PI are both contraction maing w/rate g we didn t rove thi for PI in cla) VI cot le er iteration For n tate, a action PI tend to take On) iteration in ractice Recent reult indicate ~On a/-g) wort cae Intereting aide: Bigget inight into PI came ~50 year after the algorithm wa introduced 5

A Unified View of Value Iteration and Policy Iteration Notation Udate for for a fixed olicy definition of T oerator:! " # % " + ' " # Udate with olicy imrovement definition of the T oerator:!#*) = max 0 *, + ' 4 56 * 6 *, #* 6 ) 6

Value Determination For 0 te! " = $ % For i te! & = ' %! &) = ' % &$ % Infinite horizon lim! & = ' %.$ % = % ) ) $ % &. Value Iteration For 0 te! " = $ If R deend on a, ick a with the highet immediate reward) For i te! % = &! %' = & ) $ Infinite horizon lim! % = & ) $ = &! =! % ) 7

Modified Policy Iteration Gue V 0 uually jut R), and i= Reeat until convergence* For j= to n V i = T V i- i = i+ =greedyv i- ) Secial cae: n= VI), n PI) MDP Limitation Reinforcement Learning MDP oerate at the level of tate State = atomic event We uually have exonentially or infinitely) many of thee We aume P and R are known Machine learning to the recue! Infer P and R imlicitly or exlicitly from data) Generalize from mall number of tate/olicie 8