Reinforcement learning

Size: px

Start display at page:

Download "Reinforcement learning"

Milo Conley
5 years ago
Views:

1 CS 75 Mchine Lening Lecue b einfocemen lening Milos Huskech milos@cs.pi.edu 539 Senno Sque einfocemen lening We wn o len conol policy: : X A We see emples of bu oupus e no given Insed of we ge feedbck einfocemen ewd fom ciic qunifying how good he seleced oupu ws Inpu Lene Oupu einfocemen Ciic he einfocemens my no be deeminisic Gol: find : X A wih he bes epeced einfocemens

2 Gmbling emple Gme: 3 bised coins 3 he coin o be ossed is seleced ndomly fom he hee coin opions. he gen lwys sees which coin is going o be plyed ne. he gen mkes be on eihe hed o il wih wge of $. If fe he coin oss he oucome gees wih he be he gen wins $ ohewise i looses $ L model: Inpu: X coin chosen fo he ne oss Acion: A choice of hed o il he gen bes on einfocemens: { -} A policy : X A mple: Coin : Coin Coin3 hed il hed : 3 hed il hed Gmbling emple L model: Inpu: X coin chosen fo he ne oss Acion: A choice of hed o il he gen bes on einfocemens: { -} A policy : Coin hed Lening gol: find he opiml policy *: X A mimizing fuue epeced pofis Coin Coin3 il hed *: discoun fco = pesen vlue of money 3???

3 peced ewds peced ewds fo : X A un ime un ime un 3 ime pecion ove mny possible ewd jecoies fo : X A peced discouned ewds peced discouning ewds fo : X A Discouning wih fuue vlue of money No discouning: un ime un Discouning ime pecion ove mny possible discouned ewd jecoies fo : X A 3

4 L lening: objecive funcions Objecive: * Find mpping : X A h mimizes some combinion of fuue einfocemens ewds eceived ove ime Vluion models qunify how good he mpping is: Finie hoizon models Infinie hoizon discouned model Avege ewd ime hoizon: Discoun fco: lim Discoun fco: Agen nvigion emple Agen nvigion in he mze: 4 moves in compss diecions ffecs of moves e sochsic we my wind up in ohe hn inended locion wih non-zeo pobbiliy Objecive: len how o ech he gol se in he shoes epeced ime moves G 4

5 Agen nvigion emple he L model: Inpu: X posiion of n gen Oupu: A he ne move einfocemens: - fo ech move + fo eching he gol A policy: : X A Gol: find he policy mimizing fuue epeced ewds : Posiion Posiion Posiion G igh igh lef moves ploion vs. ploiion in L he lene cively inecs wih he envionmen: A he beginning he lene does no know nyhing bou he envionmen I gdully gins he epeience nd lens how o ec o he envionmen Dilemm eploion-eploiion: Afe some numbe of seps should I selec he bes cuen choice eploiion o y o len moe bou he envionmen eploion? ploiion my involve he selecion of sub-opiml cion nd peven he lening of he opiml choice ploion my spend o much ime on ying bd cuenly subopiml cions 5

6 ffecs of cions on he envionmen ffec of cions on he envionmen ne inpu o be seen No effec. he disibuion ove possible is fied nd independen of ps cions. he ewds eceived depend only on he se nd cion chosen. he e seen fe he cion. Acions my effec he envionmen nd ne inpus. he disibuion of cn chnge due o ps cions; he ewds eled o he cion cn be seen wih some dely. Leds o wo foms of einfocemen lening: Lening wih immedie ewds 3 coin emple 3 Lening wih delyed ewds Agen nvigion emple; move choices ffec he se of he envionmen posiion chnges big ewd he gol se is delyed L wih immedie ewds Gme: 3 bised coins 3 he coin o be ossed is seleced ndomly fom he hee coin opions. he gen lwys sees which coin is going o be plyed ne. he gen mkes be on eihe hed o il wih wge of $. If fe he coin oss he oucome gees wih he be he gen wins $ ohewise i looses $ L model: Inpu: X coin chosen fo he ne oss Acion: A hed o il he gen bes on einfocemens: { -} $ eihe won o los Lening gol: find he opiml policy *: X A mimizing he fuue epeced pofis ove ime discoun fco 6

7 L wih immedie ewds peced ewd Immedie ewd cse: ewd depends only on nd he cion choice he cion does no ffec he envionmen nd hence fuue inpus ses nd fuue ewds: peced one sep ewd fo inpu coin o ply ne nd he choice : ewds fo evey sep of he gme j L wih immedie ewds Immedie ewd cse: ewd fo inpu nd he cion choice my vy peced ewd fo he inpu nd choice : Fo he coin be poblem i is: i j i P j i j : n oucome of he coin oss : ewd fo n oucome nd he be mde on j i peced one sep ewd fo segy P is he epeced ewd fo : X A... 7

8 8 L wih immedie ewds peced ewd Opimizing he epeced ewd : Opiml segy: m m m m... X A *: ] [m m m P P g m * m L wih immedie ewds We know h Poblem: In he L fmewok we do no know he epeced ewd fo pefoming cion inpu How o esime? g m *

9 L wih immedie ewds Poblem: In he L fmewok we do no know he epeced ewd fo pefoming cion inpu Soluion: Fo ech inpu y diffeen cions sime using he vege of obseved ewds N Acion choice g m Accucy of he esime: sisics Hoeffding s bound N P ep Numbe of smples: N i i N m m min min ln L wih immedie ewds On-line sochsic ppoimion An lenive wy o esime Ide: choose cion fo inpu nd obseve ewd Upde n esime in evey sep i i i i i i i - lening e Convegence popey: he ppoimion conveges in he limi fo n ppopie lening e schedule. Assume: n - is lening e fo nh il of pi hen he convege is ssued if: i. i. i i 9

10 L wih immedie ewds A ny sep in ime i duing he epeimen we hve esimes of epeced ewds fo ech coin cion pi: i coin hed i coin il i coin hed i coin il i coin3 hed i coin3 il Assume he ne coin o ply in sep i+ is coin nd we pick hed s ou be. hen we upde i coin hed using he obseved ewd nd one of he upde segy bove nd keep he ewd esimes fo he emining coin cion pis unchnged e.g. i coin il coin il i ploion vs. ploiion In he L fmewok he lene cively inecs wih he envionmen nd choses he cion o ply fo he cuen inpu Also ny poin in ime i hs n esime of fo ny inpucion pi Dilemm fo choosing he cion o ply fo : Should he lene choose he cuen bes choice of cion eploiion ˆ g m A O choose some ohe cion which my help o impove is esime eploion his dilemm is clled eploion/eploiion dilemm Diffeen eploion/eploiion segies eis

11 ploion vs. ploiion Unifom eploion: ploion pmee Choose he cuen bes choice wih pobbiliy ˆ g m A A All ohe choices e seleced wih unifom pobbiliy Bolzmn eploion he cion is chosen ndomly bu popoionlly o is cuen epeced ewd esime ep / p ep ' / ' A is empeue pmee. Wh does i do?

Reinforcement learning

Reinforcement learning Lecue 3 Reinfocemen leaning Milos Hauskech milos@cs.pi.edu 539 Senno Squae Reinfocemen leaning We wan o lean he conol policy: : X A We see examples of x (bu oupus a ae no given) Insead of a we ge a feedback