ARTIFICIAL INTELLIGENCE. Markov decision processes

INFOB2KI 2017-2018 Urech Univeriy The Neherland ARTIFICIAL INTELLIGENCE Markov deciion procee Lecurer: Silja Renooij Thee lide are par of he INFOB2KI Coure Noe available from www.c.uu.nl/doc/vakken/b2ki/chema.hml

PageRank (Google) PageRank can be underood a a) A Markov Chain b) A Markov Deciion Proce c) A Parially Obervable Markov Deciion Proce d) None of he above 2

Markov model Markov model = ochaic model ha aume Markov propery. ochaic model: model a proce where he ae depend on previou ae in a non deerminiic way. Markov propery: he probabiliy diribuion of fuure ae, condiioned on boh pa and preen value, depend only upon he preen ae: given he preen, he fuure doe no depend on he pa Generally, hi aumpion enable reaoning and compuaion wih he model ha would oherwie be inracable. 3

Markov model ype Predicion Planning Fully obervable Markov chain MDP (Markov deciion proce) Parially obervable Hidden Markov model POMDP (Parially obervable Markov deciion proce) Typically for opimiaion purpoe Predicion model can be repreened a variable level by a (Dynamic) Bayeian nework: S 1 S 2 S 3 S 1 S 2 S 3 O 1 O 2 O 3 4

MDP: ouline Search in non deerminiic environmen Soluion: opimal policy (plan) of acion ha maximize reward (deciion heoreic planning) Bellman equaion and value ieraion Link wih learning 5

Running example: Grid World A maze like problem The agen live in a grid, where wall block he agen pah Noiy movemen: acion do no alway go a planned If wall in choen direcion, hen ay pu; 80% of he ime, he acion Norh ake he agen Norh 10% of he ime, Norh ake he agen We; 10% Ea (ame deviaion for oher acion) The agen receive reward each ime ep Small living reward each ep (can be negaive) Big reward come a he end (good or bad) Goal: maximize um of reward

Grid World Acion Deerminiic Grid World Sochaic Grid World 7

Goal, reward and opimaliy crieria Tradiional planning goal can be encoded in reward funcion; effec of raniion i uncerain Example: achieving a ae aifying propery P a minimal co i encoded by making any ae aifying P a zeroreward aborbing ae, and aigning all oher ae negaive reward. Reward are addiive and ime eparable, and objecive i o maximize expeced oal reward; fuure reward may be dicouned Planning horizon can be finie, infinie or indefinie (pecial cae of infinie: guaraneed o reach erminal ae) 8

Markov Deciion Procee MDP are non deerminiic earch problem An MDP i defined by: A e of ae S A e of acion a A A raniion funcion T(, a, ) Probabiliy ha a from lead o, i.e., P(, a) Alo called he model or he dynamic A reward funcion R(, a, ) Someime ju R() or R( ) A ar ae Someime a erminal ae 9

Wha i Markov abou MDP? Recall: Markov generally mean ha given he preen ae, he fuure and he pa are independen For Markov deciion procee, Markov mean acion oucome depend only on he curren ae Andrey Markov (1856 1922) Thi i ju like earch, where he ucceor funcion could only depend on he curren ae (no he hiory) 10

MDP Search Tree Each MDP ae projec a earch ree i a ae: a (,a, ) called a raniion T(,a, ) = P(, a) R(,a, ) 11

Policie In deerminiic earch problem, we waned an opimal plan: a equence of acion, from ar o a goal For MDP, we wan an opimal policy *: S A Example: opimal policy when R(, a, ) = 0.03 for all nonerminal ae A policy give an acion for each ae An opimal policy i one ha maximize expeced uiliy (reward) if followed Noe: an explici policy define a reflex agen 12

Opimal Policie - example R() = 0.01 R() = 0.03 R() = 0.4 R() = 2.0 13

Uiliie of Reward Sequence Wha preference hould an agen have over reward equence? More or le? Now or laer? [1, 2, 2] or [2, 3, 4] [0, 0, 1] or [1, 0, 0] I reaonable o maximize he um of reward I alo reaonable o prefer reward now o reward laer A oluion: value of reward decay exponenially 14

Dicouning Worh Now Worh Nex Sep Worh In Two Sep 15

Epiodic ak: ineracion break naurally ino epiode, e.g., play of a game, rip hrough a maze. reurn give oal reward from ime o ime T, ending an epiode. Coninuing ak: ineracion doe no have naural epiode. dicouned reurn where γ, 0 γ 1, i he dicoun rae Reurn in he long run T k k T r r r r R 1 0 1 2 1 0 1 3 2 2 1 k k k r r r r R farighed) 1 0 ed (horigh 16

Dicouning: implemenaion How o dicoun? Each ime we decend a level, we muliply in he dicoun once Why dicoun? Sooner reward probably do have higher uiliy han laer reward Alo help our algorihm converge Example: Value of receiving [1,2,3] wih dicoun of 0.5 = 1*1 + 0.5*2 + 0.25*3 Which i le han ha of [3,2,1] 17

Solving MDP 18

Opimal Quaniie Sae ha value V(): V * () = expeced reurn aring in and hereafer acing opimally acion a ae (Inermediae q ae ha value Q(,a) ) Traniion (,a, ) q ae The opimal policy: * () = opimal acion from ae Any policy ha i greedy wih repec o V* i an opimal policy. ae 19

Conider an arbirary policy precribing acion in ae. Wha i he value of following hi policy when in ae? Fir, le coniderhe deerminiic iuaion: Noie ake expeced value over all poible nex ae: where P() i given by he raniion funcion T(). Characerizing V π () ) ( ) ), (, ( ) ( ] [ ) ( 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 i i i k k k k k k k k k V R V r R r r r r r r r r R R E V ) ( ) ), (, ( )) (, ( ) ( 1 1 1 1 V R P V 20

Example: Policy Evaluaion π: Alway Go Righ π: Alway Go Forward V π i hown for each ae (indicaed in cell) 21

Characerizing opimalv*() )), ( max ( ') ( '),, ( '),, ( max ') ( ') ), (, ( ') ), (, ( max ) ( max ) ( ) ( * * ' ' * * a Q V a R a T V R T V V V a a Expeced reurn from ae i maximized by acing opimally in and hereafer he opimal value for a ae i obained when following he opimal policy : * 22 Thi equaion i called he Bellman equaion.

Uing V*() o obain *() The opimal policy can be exraced from V*(): * ( ) arg maxv ( ) arg max a ' T (, a, ') R(, a, ') V * ( ') uing one ep look ahead, i.e. ue he Bellman equaion once more o compue he given ummaion for all acion raher han reurning he max value, reurn he acion ha give he max value 23

Uing V*() o obain *() Back o Gridworld: Noie = 0.2 (i.e. move ucceful wih p=0.8; deviaion o lef/righ boh wih p=0.1) Dicoun γ = 0.9 Living reward R(,a, ) = 0 Opimal policy? Given V* (hown in cell), one ep look ahead produce he long erm opimal acion (hown a mall arrowhead). 24

Value Ieraion A Dynamic Programming algorihm for compuing V* 25

Value Ieraion (VI) Tree backup : define V k () a he opimal value of ill o be obained if he game end in k more ime ep Sar wih V 0 () = 0 for all (including erminal ) Terminal ae reward (if any) added a k=1 Given V k () value, compue for each V k 1 ( ) ( max max Q a a T (, a, ') (, a)) Repea unil convergence of V value ' k Theorem: VI will converge o unique opimal value Baic idea: approximaion ge refined oward opimal value R(, a, ') V k ( ') a V k+1 () V k ( ) 26

VI ini: k=0 Policy: baed on one ep lookahead, i.e. acion ha give max V k+1 value no ued in compuing V value! no ye inereing (only hown o demonrae change) defaul policy: N Noie = 0.2 Dicoun = 0.9 Living reward (R) = 0 27

k=1 Implemenaion of erminal ae e wih reward r : 1 acion: x (exi) T(e, x)= 1 R(e, x) = r In k=1 erminal ae ge aociaed reward, and no change in V value afer ha Noie = 0.2 Dicoun = 0.9 Living reward (R)= 0 28

k=2 Noie = 0.2 Dicoun = 0.9 Living reward = 0 29

k=3 Noie = 0.2 Dicoun = 0.9 Living reward = 0 30

k=4 Noie = 0.2 Dicoun = 0.9 Living reward = 0 31

k=5 Noie = 0.2 Dicoun = 0.9 Living reward = 0 32

k=6 Noie = 0.2 Dicoun = 0.9 Living reward = 0 33

k=7 Noie = 0.2 Dicoun = 0.9 Living reward = 0 34

k=8 Noie = 0.2 Dicoun = 0.9 Living reward = 0 35

k=9 Noie = 0.2 Dicoun = 0.9 Living reward = 0 36

k=10 Noie = 0.2 Dicoun = 0.9 Living reward = 0 37

k=11 Noie = 0.2 Dicoun = 0.9 Living reward = 0 38

k=12 Noie = 0.2 Dicoun = 0.9 Living reward = 0 39

k=100 Noie = 0.2 Dicoun = 0.9 Living reward = 0 40

Problem wih Value Ieraion Value ieraion repea he Bellman updae: V k 1( ) max T (, a, ') R(, a, ') V ( ') a ' Problem 1: I low O(S 2 A) per ieraion,a, Problem 2: The max a each ae rarely change k a, a Problem 3: The policy ofen converge long before he value 41

Policy Ieraion Alernaive approach for opimal value: Sep 1: Policy evaluaion: calculae reurn for ome fixed policy unil convergence Sep 2: Policy improvemen: updae policy uing one ep look ahead wih reuling converged (bu no opimal!) reurn a fuure value Repea ep unil policy converge Thi i policy ieraion I ill opimal! Can converge (much) faer under ome condiion 42

Recall: Policy Evaluaion π: Alway Go Righ π: Alway Go Forward V π i hown for each ae 43

Comparion VI and PI Boh are dynamic program for olving MDP and compue he ame hing (all opimal value) In value ieraion: Every ieraion updae boh value and (implicily) policy Don rack policy: aking max over acion implicily recompue i In policy ieraion: Do everal pae ha updae reurn wih fixed policy (each pa i fa: we conider only one acion, no all) Afer he policy i evaluaed, a new policy i choen (low like a value ieraion pa) The new policy will be beer (or we re done) 44

Double Bandi 45

Double-Bandi MDP Acion: Blue, Red Sae: Win, Loe 0.25 $0 No dicoun 100 ime ep $1 W 0.75 $2 0.25 $0 L $1 Boh ae have he ame value 1.0 0.75 $2 1.0 Noe he repreenaion a value raher han variable level! 46

Offline Planning Solving MDP i offline planning You deermine all quaniie hrough compuaion You need o know he deail of he MDP You do no acually play he game! Value 0.25 $0 No dicoun 100 ime ep Boh ae have he ame value Play Red Play Blue 150 100 $1 1.0 W 0.75 $2 0.25 $0 0.75 $2 L $1 1.0 47

Le Play! $2$2$0$2$2 $2$2$0$0$0 48

Online Planning Rule changed! Red win chance i differen.?? $0 $1 1.0 W?? $2?? $0?? $2 L $1 1.0 49

Le Play again! $0$0$0$2$0 $2$0$0$0$0 50

Wha Ju Happened? Tha wan planning, i wa learning! Specifically, reinforcemen learning There wa an MDP, bu you couldn olve i wih ju compuaion You needed o acually ac o figure i ou 51

PageRank (Google) PageRank can be underood a a) A Markov Chain b) A Markov Deciion Proce c) A Parially Obervable Markov Deciion Proce d) None of he above 52