Lecue 3 Reinfocemen leaning Milos Hauskech milos@cs.pi.edu 539 Senno Squae Reinfocemen leaning We wan o lean he conol policy: : X A We see examples of x (bu oupus a ae no given) Insead of a we ge a feedback (einfocemen, ewad) fom a ciic quanifying how good he seleced oupu was Inpu x Leane Oupu a Reinfocemen Ciic The einfocemens may no be deeminisic Goal: find : X A wih he bes expeced einfocemens 1
Gambling example. Game: 3 diffeen biased coins ae ossed The coin o be ossed is seleced andomly fom he hee opions and I always see which coin I am going o play nex I make bes on head o ail and I always wage $1 If I win I ge $1, ohewise I lose my be RL model: Inpu: X a coin chosen fo he nex oss, Acion: A choice of head o ail, Reinfocemens: {1, -1} A policy : X A Example: : Coin1 head Coin ail Coin3 head Gambling example RL model: Inpu: X a coin chosen fo he nex oss, Acion: A choice of head o ail, Reinfocemens: {1, -1} A policy : Coin1 head Coin ail Coin3 head Leaning goal: find : X A maximizing fuue expeced pofis : Coin1? Coin? Coin3? 0 E ( ) a discoun faco = pesen value of money
Agen navigaion example. Agen navigaion in he Maze: 4 moves in compass diecions Effecs of moves ae sochasic we may wind up in ohe han inended locaion wih non-zeo pobabiliy Objecive: each he goal sae in he shoes expeced ime moves G Agen navigaion example The RL model: Inpu: X posiion of an agen Oupu: A a move Reinfocemens: R -1 fo each move +100 fo eaching he goal A policy: : X A : Posiion 1 Posiion Posiion 0 G igh igh lef moves Goal: find he policy maximizing fuue expeced ewads E ( ) 0 3
Objecives of RL leaning Objecive: * Find a mapping : X A Tha maximizes some combinaion of fuue einfocemens (ewads) eceived ove ime Valuaion models (quanify how good he mapping is): Finie hoizon model E ( T 0 Infinie hoizon discouned model 0 Aveage ewad T 1 lim E ( ) T T ) E ( ) Discoun faco: 0 1 0 Time hoizon: T 0 Exploaion vs. Exploiaion The (leane) acively ineacs wih he envionmen: A he beginning he leane does no know anyhing abou he envionmen I gadually gains he expeience and leans how o eac o he envionmen Dilemma (exploaion-exploiaion): Afe some numbe of seps, should I selec he bes cuen choice (exploiaion) o y o lean moe abou he envionmen (exploaion)? Exploiaion may involve he selecion of a sub-opimal acion and peven he leaning of he opimal choice Exploaion may spend o much ime on ying bad cuenly subopimal acions 4
Effecs of acions on he envionmen Effec of acions on he envionmen (nex inpu x o be seen) No effec, he disibuion ove possible x is fixed; acion consequences (ewads) ae seen immediaely, Ohewise, disibuion of x can change; he ewads elaed o he acion can be seen wih some delay. Leads o wo foms of einfocemen leaning: Leaning wih immediae ewads Gambling example Leaning wih delayed ewads Agen navigaion example; move choices affec he sae of he envionmen (posiion changes), a big ewad a he goal sae is delayed RL wih immediae ewads Game: 3 diffeen biased coins ae ossed The coin o be ossed is seleced andomly fom he hee opions and I always see which coin I am going o play nex I make bes on head o ail and I always wage $1 If I win I ge $1, ohewise I lose my be RL model: Inpu: X a coin chosen fo he nex oss Acion: A head o ail be Reinfocemens: {1, -1} Leaning goal: find : X A maximizing he fuue expeced pofis ove ime 0 E ( ) a discoun faco = pesen value of money 5
Expeced ewad 0 RL wih immediae ewads E ( ) - a discoun faco = pesen value of money Immediae ewad case: Rewad fo he choice becomes available immediaely Ou choice does no affec envionmen and hus fuue ewads 0 E ( ) E ( ) E ( ) E (, 1,... 0 0 Expeced one sep ewad fo inpu x and he choice a : R ( x, a ) 1 Rewads fo evey sep )... RL wih immediae ewads Immediae ewad case: Rewad fo he choice a becomes available immediaely Expeced ewad fo he inpu x and choice a: R ( x, a ) Fo he gambling poblem i can be defined as: R ( x, a ) ( a, x ) P ( j x, a i ) i j j j- a hidden oucome of he coin oss Recall he definiion of he expeced loss Expeced one sep ewad fo a saegy : X A R ( ) R ( ) R ( x, ( x )) P ( x ) x is he expeced ewad fo i, 1,... 0 6
Expeced ewad RL wih immediae ewads Opimizing he expeced ewad : max E( 0 E ( ) E ( 0 ) E ( 1 ) E ( 0 ) max 0 E( ) max 0 )... R( ) max R( )( 0 ) ( 0 ) max R( ) max R ( ) max R ( x, ( x)) P ( x) x Opimal saegy: * : X A * ( x ) ag max R ( x, a ) a x P ( x)[ max ( x ) R ( x, ( x))] RL wih immediae ewads We know ha * ( x) ag max R( x, a Poblem: In he RL famewok we do no know R ( x, a ) The expeced ewad fo pefoming acion a a inpu x How o ge R ( x, a )? 7
RL wih immediae ewads Poblem: In he RL famewok we do no know R ( x, a ) The expeced ewad fo pefoming acion a a inpu x Soluion: Fo each inpu x y diffeen acions a Esimae R ( x, a ) using he aveage of obseved ewads ~ R ( x, a ) 1 N x, a, ~ Acion choice ( x) ag max R ( x, a Accuacy of he esimae: saisics (Hoeffding s bound) ~ N x, a P R ( x, a ) R ( x, a ) exp ( max min ) Numbe of samples: ( max min ) 1 N x, a ln N x i 1 a x, a i RL wih immediae ewads On-line (sochasic appoximaion) An alenaive way o esimae R ( x, a ) Idea: choose acion a fo inpu x and obseve a ewad Updae an esimae R ~ ( x, a ) (1 ) R ~ ( x, a ) Convegence popey: The appoximaion conveges in he limi fo an appopiae leaning ae schedule. Assume: ( n ( x, a )) - is a leaning ae fo nh ial of (x, pai Then he convege is assued if: i 1 1. ( i ). x, a i 1 (i) x a, - a leaning ae 8
Exploaion vs. Exploiaion In he RL famewok he (leane) acively ineacs wih he envionmen. ~ A any poin in ime i has an esimae of R ( x, fo any inpu acion pai Dilemma: Should he leane use he cuen bes choice of acion (exploiaion) ˆ ( x) ag max R ~ ( x, a A O choose ohe acion a and fuhe impove is esimae (exploaion) Diffeen exploaion/exploiaion saegies exis Exploaion vs. Exploiaion Unifom exploaion Choose he cuen bes choice ~ wih pobabiliy 1 ˆ ( x) ag max R ( x, a A All ohe choices ae seleced wih a unifom pobabiliy A 1 Bolzman exploaion The acion is chosen andomly bu popoionally o is cuen expeced ewad esimae exp R ~ ( x, / T p( a x) ~ exp R ( x, a' ) / T a ' A T is empeaue paamee. Wha does i do? 9